Six Lists of Lists for Data Scientists

logoData sciences relies on a strong foundation of mathematics, statistics, and results visualization, most of which are available through R statistical programming ecosystems.  To master the data sciences, one needs to delve into some of the more important pieces of literature (spending 10,000 hours) . But what does one read and when?

While many have tried, it is impractical define the definitive list of R resources, given all the great blogs, texts, and videos available. Most attempt to create such a list are failures from the start. So, in many cases, one needs just to Google the phrase “R Resources” in order to find 80% of the good ones, while exerting less than 20% of your overall research effort.

For my list, here are the texts and PDFs that I keep near or with me most of the time:

General introductions to R
1.  An introduction to R. Venables and Smith (2009) – PDF
2.  A beginner’s guide to R (Use R!). Zuur et al. (2009) – Text
3.  R for Dummies. Meys and de Vries (2012) – Text
4.  The R book. Crawley (2012)
5.  R in a nutshell: A desktop quick reference. Adler (2012)

Statistics books
1.  Statistics for Dummies. Gotelli and Ellison (2012) – Text
2.  Statistical methods. Snedecor and Cochran (2014) – Text
3.  Introduction to Statistics: Fundamental Concepts and Procedures of Data Analysis. Reid (2013) – Text

Statistics books specifically using R
1.  Introductory statistics: a conceptual approach using R. Ware et al. (2012) – Text
2.  Foundations and applications of statistics: an introduction using R. Pruim (2011) – Text
3.  Probability and statistics with R, 2nd Edition. Ugarte et al. (2008) – Text

Visualization using R
1.  ggplot2: elegant graphics for data analysis. Wickham (2009)
2.  R graphics cookbook. Chang (2013)

Programming using R
1.  The art of R programming. Matloff (2011)
2. Mastering Data Analysis with R. Daroczi (2015)

Interesting predictive analytics books
1. The Signal and the Noise: Why So Many Predictions Fail – But Some Don’t. Silver (2012)
2. Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die. Siegel (2013)

DSI 001 Integrating R and Hadoop with RHadoop

2013 08 25 22 00 19

This is the first in a series of screencasts designed to demonstrate practical aspects of data science. In this episode, I will show you how to integrate R, that awesome awe inspiring statistical processing environment, with Hadoop, the master of distributed data storage  and processing. Once done, we are going to then apply the RHadoop environment to count the number of words in that massive classical book “Moby Dick.”

In this screencast, we are going to setup a Hadoop environment on a Mac OS X operating system; download, install, and configure hadoop; download and install R and R Studio; download and load RHadoop packages; configure R; and finally, create and execute a test mapreduce problem. Here, let me show you exactly how all this works.

The scripts to this screencast will be posted over the next couple of days.

R: The Video!

Alien 1979 sigourney weaver movie poster

In the last ten years, the open source R statistics language has exploded in popularity and functionality, emerging as the data scientist’s tool of choice. Today, R is used by over 2 million analysts worldwide, many having been introduced to its elegance and power in academia. Users around the world have embraced R to solve their most challenging problems in fields ranging from computational biology to quantitative finance, and to train their students in these same fields. The result has been an explosion of R analysts and applications, leading to enthusiastic adoption by premier analytics-driven companies like Google, Facebook, and and the New York Times.