20 R Packages That Should Impact Every Data Scientist

NewImageAnybody that has used R know just how frustrating it is to have an analytical idea in the mind that is hard to express. From a language perspective, R is pretty straight forward. For those that are just starting to learn it, there are a wide range of resources available, ranging from free tutorials to commercial texts. A quick Google search on most structural R questions will quickly lead to a handful of viable solutions (Learn R Blog, R Overview, Example R graphics, R Blogger, etc.). But the power of R is less about its grammar and more in its packages. 

Earlier this year, Yhat published a great article on the “10 R Packages I wish I knew about earlier” that should be the basis for exploring R’s powerful capabilities. As Yhat also points out, while R can be a bit more “obscure that other languages,” it provides a thousands of useful packages through its vibrant growing community. Here is a re-listing of the those package:

1. sqldf – Manipulate R data frames using SQL.

2. forecast – Methods and tools for displaying and analysing univariate time series forecasts including exponential smoothing via state space models and automatic ARIMA modelling.

3. plyr – A set of tools for a common set of problems: you need to split up a big data structure into homogeneous pieces, apply a function to each piece and then combine all the results back together. 

4. stringr – a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use. 

5. Database Drivers (thru install.packages) – R has drivers for nearly every commercially viable database. If you can’t find a specific interface for your database, then you can always use RODBC. Examples RPostgreSQL, RMySQL, RMongo, RODBC, RSQLite, etc.

6. lubridate – Make dealing with dates a little easier.

7. ggplot2 – Is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts.

8. qcc – Is a library for statistical quality control, such as Shewhart quality control charts for continuous, attribute and count data. Cusum and EWMA charts. Operating characteristic curves. Process capability analysis. Pareto chart and cause-and-effect chart. Multivariate control charts.

9. reshape2 – Reshape lets you flexibly restructure and aggregate data using just two functions: melt and cast. This Hadley Wickham package specializes in converting data from wide to long format, and vice versa. 

10. randomForest – A machine learning package that perform classification and regression based on a forest of trees using random inputs, through supervised or unsupervised learning.

NewImageIn addition to these package, anybody working in social sciences will also want to look into:

11. Zelig – a one-stop statistical shop for nearly all regression model specifications. It can estimate, and help interpret the results of, an enormous range of statistical models. 

12/13. Statnet/igraph – An integrated set of tools for the representation, visualization, analysis, and simulation of network data.

14. Amelia II – Contains a sets of algorithms for multiple imputation of missing data across a wide range of data types, such as survey, time series and cross sectional.

15. nlme – Used to fit and compare Gaussian linear and nonlinear mixed-effects models.

16/17. SNOW Simple Network of Workstations)/Rmpi – Support for simple parallel computing in R.

18/19. xtable/apsrtable – Packages that convert R summary results into LaTeX/HTML table format.

20. plm – contains all of the necessary model specifications and tests for fitting a panel data model; including specifications for instrumental variable models.

So, what are some of your more favorite and/or necessary R packages and why. Post them in the comments section and lets build out this space together.



Categories: Tools

Tags: ,

13 replies

  1. for point 16/17. I would advice to use the new core package “parallel”

  2. Yes, I’ll echo “parallel” over “snow” and “Rmpi”. I’ve had nothing but grief configuring MPI on Linux and sometimes Rmpi doesn’t even compile. I gave up on it. See the O’Reilly book “Parallel R” for more.

    My favorite packages? Well, I can’t live without the RStudio IDE, knitr, Shiny and tm.

  3. The one I can’t-live-without, use-every-single-day: foreign.

  4. I would echo that knitr has really transformed reproducible research within R. In addition, the slidify package should be discussed as an honorable mention; it is young but has significant potential.

  5. I would also include the shiny package, which allows to develop simple and complex web applications using R. It’s really good.

  6. Data.table package should also present in the TOP 20. A very strong competitor to PLYR.

  7. I would put data.table as the number one package in this list. If you like sqldf then you will be blown away with data.table

    Copy paste this code and try it out yourself (assumes you have installed the packages)

    # how fast is R data.table

    libs <- c('sqldf', 'data.table', 'rbenchmark')
    lapply(libs, require, character.only = T)

    n <- 1000000
    set.seed(1)
    ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 1000, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n))
    rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 1000, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n))

    benchmark(replications = 5, order = "user.self",
    noindex.sqldf = (sqldf('select * from ldf as l inner join rdf as r on l.id1 = r.id1 and l.id2 = r.id2')),
    indexed.sqldf = (sqldf(c('create index ldx on ldf(id1, id2)',
    'select * from main.ldf as l inner join rdf as r on l.id1 = r.id1 and l.id2 = r.id2')))
    )

    benchmark(replications = 5, order = "user.self",
    noindex.table = {
    ldt <- data.table(ldf)
    rdt <- data.table(rdf)
    merge(ldt, rdt, by = c('id1', 'id2'))
    },
    indexed.table = {
    ldt <- data.table(ldf, key = 'id1,id2')
    rdt <- data.table(rdf, key = 'id1,id2')
    merge(ldt, rdt, by = c('id1', 'id2'))
    }
    )

  8. Although I am very good with SQL, I prefer data.table. In fact I don’t use sqldf at all.

  9. Although I am very good with SQL. I prefer data.table. In fact, I don’t use sqldf at all.

  10. user package “caret”. mother of all packages containing 197 models

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: