Six Lists of Lists for Data Scientists

logoData sciences relies on a strong foundation of mathematics, statistics, and results visualization, most of which are available through R statistical programming ecosystems.  To master the data sciences, one needs to delve into some of the more important pieces of literature (spending 10,000 hours) . But what does one read and when?

While many have tried, it is impractical define the definitive list of R resources, given all the great blogs, texts, and videos available. Most attempt to create such a list are failures from the start. So, in many cases, one needs just to Google the phrase “R Resources” in order to find 80% of the good ones, while exerting less than 20% of your overall research effort.

For my list, here are the texts and PDFs that I keep near or with me most of the time:

General introductions to R
1.  An introduction to R. Venables and Smith (2009) – PDF
2.  A beginner’s guide to R (Use R!). Zuur et al. (2009) – Text
3.  R for Dummies. Meys and de Vries (2012) – Text
4.  The R book. Crawley (2012)
5.  R in a nutshell: A desktop quick reference. Adler (2012)

Statistics books
1.  Statistics for Dummies. Gotelli and Ellison (2012) – Text
2.  Statistical methods. Snedecor and Cochran (2014) – Text
3.  Introduction to Statistics: Fundamental Concepts and Procedures of Data Analysis. Reid (2013) – Text

Statistics books specifically using R
1.  Introductory statistics: a conceptual approach using R. Ware et al. (2012) – Text
2.  Foundations and applications of statistics: an introduction using R. Pruim (2011) – Text
3.  Probability and statistics with R, 2nd Edition. Ugarte et al. (2008) – Text

Visualization using R
1.  ggplot2: elegant graphics for data analysis. Wickham (2009)
2.  R graphics cookbook. Chang (2013)

Programming using R
1.  The art of R programming. Matloff (2011)
2. Mastering Data Analysis with R. Daroczi (2015)

Interesting predictive analytics books
1. The Signal and the Noise: Why So Many Predictions Fail – But Some Don’t. Silver (2012)
2. Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die. Siegel (2013)

Fermi Problem Solving for Data Scientists

How many new tires can be sold in the Philadelphia area just prior to its first snow storm? How many people will die from the next pandemic that infects North America? What is the global revenue protential for a new medical app on the iPad Pro that helps first time parents with their new born child? There are relatively simple questions that data scientists are often asked to address.

As simple as they might seem, the real world is fraught with networks of complexity, while at the same time, data scientist are often accused of overthinking solutions as they try to make sense of it. Even the simplest of explorations, like determining the number of tires sold, can take on unbounded fidelity without proper problem scoping. In turn, this can result in both the exponential growth of data as well as the uncertainty in our confidence of observing that data.

It is important for the analyst to grossly understand, to estimate, the solution without spending time and money on detailed analyses, supported by countless models. One such type of estimation is call a Fermi Problem, which is a framework designed to teach dimensional analysis and can be thought of as “back-of-the-envelope calculations.” Fermi problems are often used in engineering and sciences scope the larger problem before attempting to build complex models that address more precise answers.

Michael Mitchell does an excellent job at TED Ed talking about Fermi approaches when dealing with complex problems:

Interesting. Yes?

Moving on…while Fermi estimation has no formal calculus, with the help of Sherman Kent’s (CIA Analyst) perspective on information, one can break down the approach the following equation:

Fermi Estimation = things we know for certain (facts) + things we should know, but don’t (assumptions, which have ranges) + things we don’t know we don’t know (error term)

The first term is as close as one can come to a statement of indisputable fact. It describes something knowable and known with a high degree of certainty.

The second term is a judgment or estimate. It describes something which is knowable in terms of the human understanding but not precisely known by the man who is talking about it.

The third term is another judgment or estimate, this one made almost without any evidence direct or indirect. It may be an estimate of something that no man alive can know or will ever know. As such, it truly represents that ultimate error in our knowledge.

The Fermi estimation approach, as you can see, provides an answer before turning to more sophisticated modeling methods and a useful check on their results. As long as the assumptions in the estimate are reasonable, the Fermi estimation gives a quick and simple way to obtain a “frame of reference” for what might be a reasonable expectation of the final answer.