20 R Packages That Should Impact Every Data Scientist

NewImageAnybody that has used R know just how frustrating it is to have an analytical idea in the mind that is hard to express. From a language perspective, R is pretty straight forward. For those that are just starting to learn it, there are a wide range of resources available, ranging from free tutorials to commercial texts. A quick Google search on most structural R questions will quickly lead to a handful of viable solutions (Learn R Blog, R Overview, Example R graphics, R Blogger, etc.). But the power of R is less about its grammar and more in its packages.

Earlier this year, Yhat published a great article on the “10 R Packages I wish I knew about earlier” that should be the basis for exploring R’s powerful capabilities. As Yhat also points out, while R can be a bit more “obscure that other languages,” it provides a thousands of useful packages through its vibrant growing community. Here is a re-listing of the those package:

1. sqldf – Manipulate R data frames using SQL.

2. forecast – Methods and tools for displaying and analysing univariate time series forecasts including exponential smoothing via state space models and automatic ARIMA modelling.

3. plyr – A set of tools for a common set of problems: you need to split up a big data structure into homogeneous pieces, apply a function to each piece and then combine all the results back together.

4. stringr – a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use.

5. Database Drivers (thru install.packages) – R has drivers for nearly every commercially viable database. If you can’t find a specific interface for your database, then you can always use RODBC. Examples RPostgreSQL, RMySQL, RMongo, RODBC, RSQLite, etc.

6. lubridate – Make dealing with dates a little easier.

7. ggplot2 – Is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts.

8. qcc – Is a library for statistical quality control, such as Shewhart quality control charts for continuous, attribute and count data. Cusum and EWMA charts. Operating characteristic curves. Process capability analysis. Pareto chart and cause-and-effect chart. Multivariate control charts.

9. reshape2 – Reshape lets you flexibly restructure and aggregate data using just two functions: melt and cast. This Hadley Wickham package specializes in converting data from wide to long format, and vice versa.

10. randomForest – A machine learning package that perform classification and regression based on a forest of trees using random inputs, through supervised or unsupervised learning.

NewImageIn addition to these package, anybody working in social sciences will also want to look into:

11. Zelig – a one-stop statistical shop for nearly all regression model specifications. It can estimate, and help interpret the results of, an enormous range of statistical models.

12/13. Statnet/igraph – An integrated set of tools for the representation, visualization, analysis, and simulation of network data.

14. Amelia II – Contains a sets of algorithms for multiple imputation of missing data across a wide range of data types, such as survey, time series and cross sectional.

15. nlme – Used to fit and compare Gaussian linear and nonlinear mixed-effects models.

16/17. SNOW Simple Network of Workstations)/Rmpi – Support for simple parallel computing in R.

18/19. xtable/apsrtable – Packages that convert R summary results into LaTeX/HTML table format.

20. plm – contains all of the necessary model specifications and tests for fitting a panel data model; including specifications for instrumental variable models.

So, what are some of your more favorite and/or necessary R packages and why. Post them in the comments section and lets build out this space together.

Single Value Decomposition (SVD): A Golfer’s Tutorial

NewImageSingle Value Decomposition (SVD) is one of my favorite tools for factorizing data, but it can be a rather hard concept to wrap one’s brain around, especially if you don’t have a strong mathematical background. In order to gain a more practical understanding of how SVD are performed and their practical applications, many resort to Googling terms like “Single Value Decomposition tutorial” and “Single Value Decomposition practical example,” only to be disappointed by the results. Alas, here is a tutorial that is both easy to understand, while applying a practical example that more can related to: Golf Score Prediction Using SVD.

This tutorial breaks down the SVD process by looking at the golf scores of three players – Phil, Tiger, and Vijay. By starting with a simple, naive example, the author builds a complete understanding of not only practical mechanics of SVD, but the mathematical background as well. Overall, a simple and elegant example.

Based on the tutorial work, here are a few R scripts I used to recreate the results:

NewImage

NewImage

Then, one can compute the SVD:

NewImage

Resulting in,

NewImage

Graphically, the singular values can be visualized as,

NewImage

This means that first left and right singular values ($u, $v) represent almost 98.9% of the variance in the matrix. In R, we can approximate the result with,

NewImage

Resulting in,

NewImage

SaveSave

Five Graphical Perception Best Practices Every Data Scientist Should Know

NewImageGraphical perception – the visual encoding of data on graphs – is an important consideration in data exploration and presentation visualization. In their seminal work, “Graphical Perception: Theory, Experimentation and Application to the Development of Graphical Methods,” William Cleveland and Robert McGill lay the foundational theory for setting guidelines on graph construction. 

Their graphical perception research enables the data scientist to maximize the likelihood of value transfer for the incurred study cost  (AKA Data Monetization). As we encode the information (relevant data) into graphics, the viewer has to decode the data and interpret the results. This asymmetric and error prone knowledge transformation. Fortunately, Cleveland and McCill have identified several best practices that reduce the likelihood of viewer misperception.

Five Graphical Perception Best Practices:

  • Use common scales when possible  – hard to compare across scales, especially offset
  • Use positions comparison on identical scales when possible
  • Limit the use of length comparisons – proportions are difficult to interpret
  • Limit pie charts – angular  and curvature comparisons are hard to interpret
  • Do not use 3-D charts or shading 

Elementary Perceptual Task

NewImage

Data Monetization: A Road Paved On Top Of Data Sets

Paving Road Construction Sign Royalty Free Clipart Picture 090626 203307 625048The road to efficient data monetization is paved on top of effective data sets. No single source of data is comprehensive enough to be an all being source of transformational insights. It is only through the fusion of orthogonal data sets (independent subject area) that true insights into those thing we don’t know we don’t know (level three knowledge) can be revealed. While we have access to data of interest (ERPs, IT, etc.), where can we find others sources to aid in the third level knowledge spelunking?

NewImageWhile data is everywhere, useful data sets are not. A google search on terms like “open data sets” or “data sets in R” reveal thousands of sources. Over the years as a CTO and Data Scientist, I have collected a few hundred myself. In 2011, however, I came across the work of RevoJoe, Revolution Analytics, that more or less got me organized in this area. So here are a few data sets from my list that I maintain today:

Commercial Sources
Data MarketPlace: http://www.infochimps.com/marketplace

Economics
UMD:: http://inforumweb.umd.edu/econdata/econdata.html
World bank: http://data.worldbank.org/indicator

Finance
CBOE Futures Exchange: http://cfe.cboe.com/Data/
Gapminder: http://www.gapminder.org
Google Finance: http://finance.yahoo.com/ (R)
Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0
St Louis Fed: http://research.stlouisfed.org/fred2/ (R)
NASDAQ: https://data.nasdaq.com/
OANDA: http://www.oanda.com/ (R)
Yahoo Finance: http://finance.yahoo.com/ (R)

Government
Archived national government statistics: http://www.archive-it.org/
Australia: http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument
Canada: http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1
Civic Commons: http://wiki.civiccommons.org/Initiatives
DataMarket: http://datamarket.com/
Datamob: http://datamob.org
Fed Stats: http://www.fedstats.gov/cgi-bin/A2Z.cgi
Guardian world governments: http://www.guardian.co.uk/world-government-data
List of cities/states by Simply Statitistics: http://simplystatistics.org/2012/01/02/list-of-cities-states-with-open-data-help-me-find/
London, U.K. data: http://data.london.gov.uk/catalogue
New Zealand: http://www.stats.govt.nz/tools_and_services/tools/TableBuilder/tables-by…
NYC data: http://nycplatform.socrata.com/
Open Government Data (Hub): http://opengovernmentdata.org
Open Government Data – United States of America: http://www.data.gov
Open Government Data – United Kingdom: http://data.gov.uk
Open Government – France: http://www.data.gouv.fr
OECD: http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html
San Francisco Data sets: http://datasf.org/
U.K. Government Data:http://data.gov.uk/data
United Nations: http://data.un.org/
U.S. Federal Government Agencies: http://www.data.gov/metric
US CDC Public Health datasets: http://www.cdc.gov/nchs/data_access/ftp_data.htm
The World Bank: http://wdronline.worldbank.org/

Machine Learning
Causality Workbench: http://www.causality.inf.ethz.ch/repository.php
Kaggle competition data: http://www.kaggle.com/
KDNuggets competition site: www.kdnuggets.com/datasets/
UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/
Machine Learning Data Set Repository: http://mldata.org/
Microsoft Research: http://research.microsoft.com/apps/dp/dl/downloads.aspx
Million songs: http://blog.echonest.com/post/3639160982/million-song-dataset
Social Networking: http://www.cs.cmu.edu/~jelsas/data/ancestry.com/
The Koblenz Network Collection: http://konect.uni-koblenz.de/

Miscellaneous
Datasets: http://www.reddit.com/r/datasets
Datasets: http://www.reddit.com/r/opendata/
Hilary Mason’s research data (Chief Data Scientist at Bit.ly): http://bitly.com/bundles/hmason/1
Kaggle Contests: http://www.kaggle.com/
R Datasets: http://vincentarelbundock.github.com/Rdatasets/datasets.html

Public Domain Collections
Data360: http://www.data360.org/index.aspx
Datamob.org: http://datamob.org/datasets
Factual: http://www.factual.com/topics/browse
Freebase: http://www.freebase.com/
Google: http://www.google.com/publicdata/directory
infochimps: http://www.infochimps.com/
numbray: http://numbrary.com/
Sample R data sets: http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html (R)
SourceForge Research Data: http://www.nd.edu/~oss/Data/data.html
UFO Reports: http://www.nuforc.org/webreports.html
Wikileaks 911 pager intercepts: http://911.wikileaks.org/files/index.html
Stats4Stem.org: R data sets: http://www.stats4stem.org/data-sets.html (R)
The Washington Post List: http://www.washingtonpost.com/wp-srv/metro/data/datapost.html

Science
Agricultural Experiments: http://www.inside-r.org/packages/cran/agridat/docs/agridat (R)
Climate data: http://www.cru.uea.ac.uk/cru/data/temperature/#datter
and ftp://ftp.cmdl.noaa.gov/
Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/
Geo Spatial Data: http://geodacenter.asu.edu/datalist/
Human Microbiome Project: http://www.hmpdacc.org/reference_genomes/reference_genomes.php
KDD Nugets Datasets: http://www.kdnuggets.com/datasets/index.html
MIT Cancer Genomics Data: http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
NASA: http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html
NIH Microarray data: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/ (R)
Protein structure: http://www.infobiotic.net/PSPbenchmarks/
Public Gene Data: http://www.pubgene.org/
Stanford Microarray Data: http://smd.stanford.edu//

Social Sciences
Analyze Survey Data for Free: http://www.asdfree.com/
General Social Survey: http://www3.norc.org/GSS+Website/
ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/access/index.jsp
UCLA Social Sciences Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm
UPJOHN INST: http://www.upjohn.org/erdc/erdc.html

Time Series
Time Series data Library: http://robjhyndman.com/TSDL/

Universities
Carnegie Mellon University Enron email: http://www.cs.cmu.edu/~enron/
Carnegie Mellon University StatLab: http://lib.stat.cmu.edu/datasets/
Carnegie Mellon University JASA data archive: http://lib.stat.cmu.edu/jasadata/
CMU Statlib: http://lib.stat.cmu.edu/datasets/
Ohio State University Financial data: http://fisher.osu.edu/fin/osudata.htm
Stanford Large Newtork Data: http://snap.stanford.edu/data/
UC Berkeley: http://ucdata.berkeley.edu/
UCI Machine Learning: http://archive.ics.uci.edu/ml/
UCLA: http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data
UC Riverside Time Series: http://www.cs.ucr.edu/~eamonn/time_series_data/
University of Toronto: http://www.cs.toronto.edu/~delve/data/datasets.html