Six Lists of Lists for Data Scientists

logoData sciences relies on a strong foundation of mathematics, statistics, and results visualization, most of which are available through R statistical programming ecosystems.  To master the data sciences, one needs to delve into some of the more important pieces of literature (spending 10,000 hours) . But what does one read and when?

While many have tried, it is impractical define the definitive list of R resources, given all the great blogs, texts, and videos available. Most attempt to create such a list are failures from the start. So, in many cases, one needs just to Google the phrase “R Resources” in order to find 80% of the good ones, while exerting less than 20% of your overall research effort.

For my list, here are the texts and PDFs that I keep near or with me most of the time:

General introductions to R
1.  An introduction to R. Venables and Smith (2009) – PDF
2.  A beginner’s guide to R (Use R!). Zuur et al. (2009) – Text
3.  R for Dummies. Meys and de Vries (2012) – Text
4.  The R book. Crawley (2012)
5.  R in a nutshell: A desktop quick reference. Adler (2012)

Statistics books
1.  Statistics for Dummies. Gotelli and Ellison (2012) – Text
2.  Statistical methods. Snedecor and Cochran (2014) – Text
3.  Introduction to Statistics: Fundamental Concepts and Procedures of Data Analysis. Reid (2013) – Text

Statistics books specifically using R
1.  Introductory statistics: a conceptual approach using R. Ware et al. (2012) – Text
2.  Foundations and applications of statistics: an introduction using R. Pruim (2011) – Text
3.  Probability and statistics with R, 2nd Edition. Ugarte et al. (2008) – Text

Visualization using R
1.  ggplot2: elegant graphics for data analysis. Wickham (2009)
2.  R graphics cookbook. Chang (2013)

Programming using R
1.  The art of R programming. Matloff (2011)
2. Mastering Data Analysis with R. Daroczi (2015)

Interesting predictive analytics books
1. The Signal and the Noise: Why So Many Predictions Fail – But Some Don’t. Silver (2012)
2. Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die. Siegel (2013)

Deep Web Intelligence Platform: 6 Plus Capabilities Necessary for Finding Signals in the Noise


Over the last several months I have been involved with developing uniques data science capabilities for the intelligence community, ones specifically based on exploiting insights derived from the open source intelligence (OSINT) found in the deep web. The deep web is World Wide Web (WWW) content that is not part of the Surface Web, which is indexed by standard search engines. It is usually inaccessible through traditional search engines because of the dynamic characteristics of the content and in persistent natural of its URLs. Spanning over 7,500 terabytes of data, it is the richest source of raw material that can be used to build out value.

2014 01 30 09 54 05

One of the more important aspects of intelligence is being able to connect multiple seemingly unrelated events together during a time frame amenable for making actionable decisions. This capability is the optimal blend of man and machine, enabling customers to know more and know sooner. It is only in these low signal that are found in the deep web that one can use behavioral sciences (psychology and sociology) to extract outcome-oriented value.

2014 01 30 09 54 15

Data in the web is mostly composed of noise, which can be unique but is often of low value. Unfortunately, the index engines of the world (Google, Bing, Yahoo) add marginal value to very few data streams that are important to any valuation process. Real value comes from correlating event networks (people performing actions) through deep web signal, which are not the purview of traditional search engines.

2014 01 30 09 54 50

These deep web intelligence capabilities can be achieved in part through the use of machine learning enabled, data science driven, and hadoop-oriented enterprise information hubs. The platform support the 5 plus essential capabilities for actionable intelligence operations:

1. Scalable Infrastructure – Industry standard hardware supported through cloud-based infrastructure providers that is scales linearly with analytical demands.

2. Hadoop – Allows for computation to occur next to data storage and enables storage schema on read – stores data in native raw format.

3. Enterprise Data Science – Scalable exploratory methods, predictive algorithms, and prescriptive and machine learning.

4. Elastic Data Collection – In addition to pulling data from third party sources through APIs, bespoke data collection through scraping web services enables data analyses not capable within traditional enterprise analytics groups.

5. Temporal/Geospatial/Contextual Analyst – The ability to regionalize events, to a specific context, during a specified time (past, present, future).

6. Visualization – Effective visualization that tailors actionable results to individual needs.

The Plus – data, Data, DATA. Without data, lots of disparate data, data science platforms are of no value.

Deep Web Intelligence Architecture 01

Today’s executive, inundated with TOO MUCH DATA, has limited ability to synthesize trends and actionable insights driving competitive advantage. Traditional research tools, internet and social harvesters do not correlate or predict trends. They look at hindsight or, at best, exist at the surface of things. A newer approach based on combining the behavioral analyses achievable through people and the machine learning found in scalable computational system can bridge this capability gap.

The Film Industry’s Golden Rule – Part 2


This is very early, but nevertheless interesting and is based on the initial insights from the “Film Industry Executives Golden Rule – Total Gross is 3x Opening Box Office Receipts” post. As discussed, identifying outliers could be an important part in identifying characteristics for those exceptional films in the industry. The plot below show the number of outlying films (exceptional) where opening revenue was higher the 2.68 stdev (line with circles). In addition, the plot show (line with triangles) the number of outliers that also exceeded 4x Total Gross/Opening Gross ratio (industry average being 3.1).



The second group (triangles) is the candidate study group for any future project – e.g, a good place to look for characteristic differences between exceptional and average films. There appears to be thirty years of data to explore here; helpful for creating, testing, and scoring regression and logistical regression models.

However, the more interesting trends are the exponential increase in outlier opening gross revenue films (line with circles) and the divergence between the two. While I don’t know what to make of it yet, there appears to be something going on.

In order to systematically address these data science questions, any future engagement lifecycle needs to be run through an organic process in order to maximize the likelihood of success (coming up with actionable insights on budget and time). The key will most likely be access to film industry data sets, specifically those used to build web sites like Box Office Mojo. It would be useful to get detailed accounting for each film, inclusive of budgetary items (e.g., market spend). In addition, the project needs to pull in other third party data like regional/national economics (Bureau of Economic Analysis), Weather (Weather Underground), Social (FaceBook, Twitter), demographic/psychographic models, etc. Here is the macro model for deriving insights from ones and zeros:




The analysis process itself is driven by data aggregation, preparation, and design of experiments (DOE). Having access to a few big data tool smiths (data scientists that are Cloudera hackers) pays off at this phase. The data science team should set up a multi-node hadoop environment at the start for all the data that will be pulled in over time (potentially terabytes within 1 year). They should also not waste effort trying to force fit all the disparate data sources into some home grown relational data schema. Accept that fact that uncertainty exists and build a scalable storage model accessible by R/SPSS/etc. from the start.

Once the data is in hand, the fun process begins. While modeling is both a visual and design process, it is all driven through an effect design of experiment. Knowing how to separate data into modeling, test, and scoring is a science, so there is no real need to second guess what to do. Here is one such systematic and teachable process:




At the micro level (day to day), the team needs to build out an ecosystem to support data analytics and science. This includes tools (R, SPSS, Gephi, Mathematica, Matlab, SAS, Hanna, etc.), big data (Cloudera – Hadoop,  Flume, Hive, Mahout (important), Hbase, etc.), visualization (Rapha.ANkl, D3, Polymaps, OpenLayers, Tableau, etc.), computing (local desktops/servers, AWS, etc.), and potentially third party composite processing (Pneuron). Last, but not least, is an Insights Management Framework (dashboard driven application to manage an agile driven, client centric workflow). This will manage the resolution process around all questions developed with the client (buy or build this application).

While the entertainment industry is a really exciting opportunity, this enterprise-level data science (EDS) framework generalizes to all insights analyses across industries. By investing in the methodology (macro/micro) and infrastructure up front (hadoop, etc.), the valuation of data science teams will be driven through a more systematic monetization strategy build on insights analysis and reuse.

Film Industry Executives Golden Rule – Total Gross is 3x Opening Box Office Receipts

Golden rule entrepreneurship

The film entertainment industry believes that the total gross theater earnings from a film can be determined by looking at the opening gross box office receipts. Industry executives use the rule of thumb that for every dollar earned on opening day, three dollars will be earned in total from box office receipts (i.e., Total Gross = 3 x Open Gross). This is why they invest in all that marketing prior to opening day.

I decided to take a look at this rule of thumb, so I created an R script that pulled the required data from Box Office Mojo (see below). I grabbed all 14K+ films from BOM, did a bit of data cleaning and formatting, then plotted the relationship between Opening Box Office Receipts and Total Gross Theater Earning. As it turns out, the executives are right, the 2.5% to 97.5% confidence range for the golden ratio is 3.13 and 3.19, respectively. As a correlative predictive model, it is significant (R^2=.8034).  

2013 09 01 22 27 18

2013 09 01 22 27 02

R-SCRIPT (based on Tony Breyal Quick Scrape script)

2013 09 02 12 52 56

DSI 001 Integrating R and Hadoop with RHadoop

2013 08 25 22 00 19

This is the first in a series of screencasts designed to demonstrate practical aspects of data science. In this episode, I will show you how to integrate R, that awesome awe inspiring statistical processing environment, with Hadoop, the master of distributed data storage  and processing. Once done, we are going to then apply the RHadoop environment to count the number of words in that massive classical book “Moby Dick.”

In this screencast, we are going to setup a Hadoop environment on a Mac OS X operating system; download, install, and configure hadoop; download and install R and R Studio; download and load RHadoop packages; configure R; and finally, create and execute a test mapreduce problem. Here, let me show you exactly how all this works.

The scripts to this screencast will be posted over the next couple of days.

60+ R Resources Every Data Scientist Should Be Aware Of!


There are a lot of great R resources on the internet, ranging from one off articles and texts to comprehensive tutorials. Here are a few of the more popular links:





R: The Video!

Alien 1979 sigourney weaver movie poster

In the last ten years, the open source R statistics language has exploded in popularity and functionality, emerging as the data scientist’s tool of choice. Today, R is used by over 2 million analysts worldwide, many having been introduced to its elegance and power in academia. Users around the world have embraced R to solve their most challenging problems in fields ranging from computational biology to quantitative finance, and to train their students in these same fields. The result has been an explosion of R analysts and applications, leading to enthusiastic adoption by premier analytics-driven companies like Google, Facebook, and and the New York Times.



Visualizing the CRAN: The Right Package for a Given Task


R is an extremely useful software environment for statistical computing and graphics. But as awesome as it is, it can be quite daunting to find just the right package for a specific task. Well, CRAN Task Views is designed to help alleviate this challenge. The table below is matches the task (right) with the package (left). While a bit primitive, it as quick reference it does work.

For those that are must more industrious, check out “Visualizing the CRAN: Graphing Package Dependencies.” The authors uses graph visualize the relationship between packages in CRAN. His analysis shows that MASS is the most “depended on” package on the CRAN.  A total of 294 of the 3794 packages (almost 8%) in the CRAN depend on MASS.  An additional 95 suggest MASS; so just over 10% of all R packages either suggest or depend on MASS.

Bayesian Bayesian Inference
ChemPhys Chemometrics and Computational Physics
ClinicalTrials Clinical Trial Design, Monitoring, and Analysis
Cluster Cluster Analysis & Finite Mixture Models
DifferentialEquations Differential Equations
Distributions Probability Distributions
Econometrics Computational Econometrics
Environmetrics Analysis of Ecological and Environmental Data
ExperimentalDesign Design of Experiments (DoE) & Analysis of Experimental Data
Finance Empirical Finance
Genetics Statistical Genetics
Graphics Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization
HighPerformanceComputing High-Performance and Parallel Computing with R
MachineLearning Machine Learning & Statistical Learning
MedicalImaging Medical Image Analysis
MetaAnalysis Meta-Analysis
Multivariate Multivariate Statistics
NaturalLanguageProcessing Natural Language Processing
OfficialStatistics Official Statistics & Survey Methodology
Optimization Optimization and Mathematical Programming
Pharmacokinetics Analysis of Pharmacokinetic Data
Phylogenetics Phylogenetics, Especially Comparative Methods
Psychometrics Psychometric Models and Methods
ReproducibleResearch Reproducible Research
Robust Robust Statistical Methods
SocialSciences Statistics for the Social Sciences
Spatial Analysis of Spatial Data
SpatioTemporal Handling and Analyzing Spatio-Temporal Data
Survival Survival Analysis
TimeSeries Time Series Analysis
gR gRaphical Models in R