Six Lists of Lists for Data Scientists

logoData sciences relies on a strong foundation of mathematics, statistics, and results visualization, most of which are available through R statistical programming ecosystems.  To master the data sciences, one needs to delve into some of the more important pieces of literature (spending 10,000 hours) . But what does one read and when?

While many have tried, it is impractical define the definitive list of R resources, given all the great blogs, texts, and videos available. Most attempt to create such a list are failures from the start. So, in many cases, one needs just to Google the phrase “R Resources” in order to find 80% of the good ones, while exerting less than 20% of your overall research effort.

For my list, here are the texts and PDFs that I keep near or with me most of the time:

General introductions to R
1.  An introduction to R. Venables and Smith (2009) – PDF
2.  A beginner’s guide to R (Use R!). Zuur et al. (2009) – Text
3.  R for Dummies. Meys and de Vries (2012) – Text
4.  The R book. Crawley (2012)
5.  R in a nutshell: A desktop quick reference. Adler (2012)

Statistics books
1.  Statistics for Dummies. Gotelli and Ellison (2012) – Text
2.  Statistical methods. Snedecor and Cochran (2014) – Text
3.  Introduction to Statistics: Fundamental Concepts and Procedures of Data Analysis. Reid (2013) – Text

Statistics books specifically using R
1.  Introductory statistics: a conceptual approach using R. Ware et al. (2012) – Text
2.  Foundations and applications of statistics: an introduction using R. Pruim (2011) – Text
3.  Probability and statistics with R, 2nd Edition. Ugarte et al. (2008) – Text

Visualization using R
1.  ggplot2: elegant graphics for data analysis. Wickham (2009)
2.  R graphics cookbook. Chang (2013)

Programming using R
1.  The art of R programming. Matloff (2011)
2. Mastering Data Analysis with R. Daroczi (2015)

Interesting predictive analytics books
1. The Signal and the Noise: Why So Many Predictions Fail – But Some Don’t. Silver (2012)
2. Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die. Siegel (2013)

Data Visualization – Lessons Based on Stephen Few


Edward Tufte’s principle point in data visualization is to above all else – show the data. But with a too many graphical options to quantify, how does one go about creating these effective visualization? The short video presentation by Tyler Rinker tries to address this critical question.

The presentation, based on the works of Stephen Few and was presented at the Center for Literacy and Research Instruction’s 50th Anniversary Conference, focuses on designing graphs that are in tune with the brain/eye perceptual subsystem, thus maximizing graph effectiveness. Few believed that in order to effectively show the data, one needs to use pre-attentive visual attributes (length, position, motion, color, hue, intensity, blur, etc. ) to grab and direct the viewer (iconic memory), while constraining the visuals to work within the limits of working memory.

Rinker presentation is an excellent source of definitions (charts, graphics, tables, diagrams, geoms etc.), graph parts (primary data, secondary data, non-data, and chart junk), and examples (bars, boxes, lines, points, etc.). He also provides an entry level view of the brain, memory, and how it impacts the data visualization process. Finally, he wraps up with visualization do’s and don’ts (e.g., don’t use 3D – but do use faceting).

Datalandia – Invasion of the Cattle Snatchers [VIDEO]

2013 07 28 11 24 21

It’s not everyday that big ideas comes to the little screen, but GE has done just that with Datalandia. This short video promo of a fictional land where menacing space aliens collide with brilliant machines and Big Data is brilliant reminder that visualization and story telling are important capabilities in the quest of finding revelations in data.

Visualization: The Artist in the Data Scientist

Data Scientist Insights

The artistic part of data visualization is one of the most under appreciated parts of data science. On a day-to-day basis, we spend most of our time aggregating data sets, exploring their multidimensional depths, making and testing models, and visualizing the results. However, the visualizations produced by many data scientists are often lack luster and uninspiring. But we can do better.

Infographics , data visualized to yield revelations over observations, is one of simplest ways a data scientist can turn on the artisan within. Whitespace, darkspace, type fonts, graphics, colors, and textures are new concepts encountered in this new world. While seemingly complex and confusing, the good news about infographics is there are several great sources of information available to these looking to start this journey.

From a book perspective, one of the best texts on the market is: The Power of Infographics: Using Pictures to Communicate and Connect With Your Audiences. Mark Smiciklas covers everything from the business value to the creative process and into distribution. This is one of those books that tools like the iPad and Kindle were made for.

In addition to books, online courses are also available to those looking to build out new skills using tools like Adobe Illustrator. My two favorite online training sites are:

Tut Plus: How to Create Outstanding Modern Infographics

Lynda: Creating Infographics with Illustrator

Of the two, I found the Lynda course taught by Mordy Golding the most useful. He identified five core characteristics of great designs that you will want to consider: Contrast, Hierarchy, Accuracy, Relevance, and Truth. The last being probably the most important in that Edward Tufte says, “style and esthetics can not rescue failed content. If the words aren’t truthful, the finest topography won’t turn lies into truth.” Without truth there can be no understand.

So, putting all this together and fingers to my keyboard, I produced this renewable energy infographic. It is based on the techniques discussed by Golding, data from various government agencies, and a bit number crunching. Not bad, can use some more artistic flare, but much better than just the tradition scientifically based graphics I am so used to producing.



The Art of Data Visualization: Getting Design Out Of The Way


This is a great PBS Off Book webisode that covers the visualization topic from the point of view appreciated by a data scientist – the data. Edward Tofte points out that “You want to see to learn something, not to confirm something.” At the same time, Jer Thorp says that “Data Visualization is about Revelation – seeing something you have never seen before.”

Revelation is truly seeing to learn, a key characteristic that differentiates data science from business intelligence. BI visualization (e.g., dashboards, pie charts, etc.) is all about seeing to confirm. Very operational, very tactical. In contrast, Data Science strives to learn through visualization, which is very strategic and hopefully transformative.

From scientific visualization to pop infographics, designers are increasingly tasked with incorporating data into the media experience. Data has emerged as such a critical part of modern life that it has entered into the realm of art, where data-driven visual experiences challenge viewers to find personal meaning from a sea of information, a task that is increasingly present in every aspect of our information-infused lives.

This short 8 min webisode features:
Edward Tufte, Yale University
Julie Steele, O’Reilly Media
Josh Smith, Hyperakt
Jer Thorp, Office for Creative Research


Data Visualization Can Be A Beautiful Thing

Newimage31Data Visualization is a part, very significant part, of the big data story. The human brain, through its 100 billion neurons each interconnected 10,000 times, has an absolutely amazing visual processing capability, which is arguably surpassed by none. So leveraging this ability should not come as a surprise when thinking about the human component in the quadrication of big data.

Paul Butler, a FaceBook intern, discovered this earthly visualization using R, an open source statistical application. Based on around 10 million samples of Friends relationships taken from FaceBook’s Hadoop Apache Hive (data warehouse system for Hadoop), he was able to plot the weights for each pair of cities as a function of the Euclidean distance between them and the number of their respective friends. The result is this astonishing image of the earth. By the way, notice anything missing in the geographical location where China normally resides?


Here is another awesome image, this time created by Eric Fischer using a heat map (a type of visualization tool) of places with Flickr photographs and Tweets.  While Twitter and Flickr may not individually have enough users to create a map so detailed as the FaceBook image, but put them together, and there’s a wealth of information that can be discovered in all of its glory.


Data visualization is a powerful exploratory tool that should be exploited as early as possible when working to monetize your big data. Do not assume away discoveries just because you don’t know what you don’t know (third level of knowledge). Let the data take your brain on an unguided journey where discovery is the ultimate destination.

Design Thinking for Effective Data Visualization

Newimage7Noah Iliinsky is one of the best presenters on complex visualization tools and techniques.  One of the classic stories Noah tells is how he used visualization tools to decide what bicycle tires to buy. In this presentation he talks about design thinking for effective data visualization. This presentation is worth the 18 minutes of your time.


Make sure to pay attention to the section on “How to Visualize: Appropriate Encodings,” which occurs at about 8:10.  In this section, Noah covers the properties and best uses of visual encoding. Yes, a bit geeky, but essential if you are a data scientist.