20 R Packages That Should Impact Every Data Scientist

NewImageAnybody that has used R know just how frustrating it is to have an analytical idea in the mind that is hard to express. From a language perspective, R is pretty straight forward. For those that are just starting to learn it, there are a wide range of resources available, ranging from free tutorials to commercial texts. A quick Google search on most structural R questions will quickly lead to a handful of viable solutions (Learn R Blog, R Overview, Example R graphics, R Blogger, etc.). But the power of R is less about its grammar and more in its packages.

Earlier this year, Yhat published a great article on the “10 R Packages I wish I knew about earlier” that should be the basis for exploring R’s powerful capabilities. As Yhat also points out, while R can be a bit more “obscure that other languages,” it provides a thousands of useful packages through its vibrant growing community. Here is a re-listing of the those package:

1. sqldf – Manipulate R data frames using SQL.

2. forecast – Methods and tools for displaying and analysing univariate time series forecasts including exponential smoothing via state space models and automatic ARIMA modelling.

3. plyr – A set of tools for a common set of problems: you need to split up a big data structure into homogeneous pieces, apply a function to each piece and then combine all the results back together.

4. stringr – a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use.

5. Database Drivers (thru install.packages) – R has drivers for nearly every commercially viable database. If you can’t find a specific interface for your database, then you can always use RODBC. Examples RPostgreSQL, RMySQL, RMongo, RODBC, RSQLite, etc.

6. lubridate – Make dealing with dates a little easier.

7. ggplot2 – Is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts.

8. qcc – Is a library for statistical quality control, such as Shewhart quality control charts for continuous, attribute and count data. Cusum and EWMA charts. Operating characteristic curves. Process capability analysis. Pareto chart and cause-and-effect chart. Multivariate control charts.

9. reshape2 – Reshape lets you flexibly restructure and aggregate data using just two functions: melt and cast. This Hadley Wickham package specializes in converting data from wide to long format, and vice versa.

10. randomForest – A machine learning package that perform classification and regression based on a forest of trees using random inputs, through supervised or unsupervised learning.

NewImageIn addition to these package, anybody working in social sciences will also want to look into:

11. Zelig – a one-stop statistical shop for nearly all regression model specifications. It can estimate, and help interpret the results of, an enormous range of statistical models.

12/13. Statnet/igraph – An integrated set of tools for the representation, visualization, analysis, and simulation of network data.

14. Amelia II – Contains a sets of algorithms for multiple imputation of missing data across a wide range of data types, such as survey, time series and cross sectional.

15. nlme – Used to fit and compare Gaussian linear and nonlinear mixed-effects models.

16/17. SNOW Simple Network of Workstations)/Rmpi – Support for simple parallel computing in R.

18/19. xtable/apsrtable – Packages that convert R summary results into LaTeX/HTML table format.

20. plm – contains all of the necessary model specifications and tests for fitting a panel data model; including specifications for instrumental variable models.

So, what are some of your more favorite and/or necessary R packages and why. Post them in the comments section and lets build out this space together.