NewImageAnybody that has used R know just how frustrating it is to have an analytical idea in the mind that is hard to express. From a language perspective, R is pretty straight forward. For those that are just starting to learn it, there are a wide range of resources available, ranging from free tutorials to commercial texts. A quick Google search on most structural R questions will quickly lead to a handful of viable solutions (Learn R Blog, R Overview, Example R graphics, R Blogger, etc.). But the power of R is less about its grammar and more in its packages.

Earlier this year, Yhat published a great article on the “10 R Packages I wish I knew about earlier” that should be the basis for exploring R’s powerful capabilities. As Yhat also points out, while R can be a bit more “obscure that other languages,” it provides a thousands of useful packages through its vibrant growing community. Here is a re-listing of the those package:

1. sqldf – Manipulate R data frames using SQL.

2. forecast – Methods and tools for displaying and analysing univariate time series forecasts including exponential smoothing via state space models and automatic ARIMA modelling.

3. plyr – A set of tools for a common set of problems: you need to split up a big data structure into homogeneous pieces, apply a function to each piece and then combine all the results back together.

4. stringr – a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use.

5. Database Drivers (thru install.packages) – R has drivers for nearly every commercially viable database. If you can’t find a specific interface for your database, then you can always use RODBC. Examples RPostgreSQL, RMySQL, RMongo, RODBC, RSQLite, etc.

6. lubridate – Make dealing with dates a little easier.

7. ggplot2 – Is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts.

8. qcc – Is a library for statistical quality control, such as Shewhart quality control charts for continuous, attribute and count data. Cusum and EWMA charts. Operating characteristic curves. Process capability analysis. Pareto chart and cause-and-effect chart. Multivariate control charts.

9. reshape2 – Reshape lets you flexibly restructure and aggregate data using just two functions: melt and cast. This Hadley Wickham package specializes in converting data from wide to long format, and vice versa.

10. randomForest – A machine learning package that perform classification and regression based on a forest of trees using random inputs, through supervised or unsupervised learning.

NewImageIn addition to these package, anybody working in social sciences will also want to look into:

11. Zelig – a one-stop statistical shop for nearly all regression model specifications. It can estimate, and help interpret the results of, an enormous range of statistical models.

12/13. Statnet/igraph – An integrated set of tools for the representation, visualization, analysis, and simulation of network data.

14. Amelia II – Contains a sets of algorithms for multiple imputation of missing data across a wide range of data types, such as survey, time series and cross sectional.

15. nlme – Used to fit and compare Gaussian linear and nonlinear mixed-effects models.

16/17. SNOW Simple Network of Workstations)/Rmpi – Support for simple parallel computing in R.

18/19. xtable/apsrtable – Packages that convert R summary results into LaTeX/HTML table format.

20. plm – contains all of the necessary model specifications and tests for fitting a panel data model; including specifications for instrumental variable models.

So, what are some of your more favorite and/or necessary R packages and why. Post them in the comments section and lets build out this space together.


  1. Yes, I’ll echo “parallel” over “snow” and “Rmpi”. I’ve had nothing but grief configuring MPI on Linux and sometimes Rmpi doesn’t even compile. I gave up on it. See the O’Reilly book “Parallel R” for more.

    My favorite packages? Well, I can’t live without the RStudio IDE, knitr, Shiny and tm.

  2. I would echo that knitr has really transformed reproducible research within R. In addition, the slidify package should be discussed as an honorable mention; it is young but has significant potential.

  3. I would also include the shiny package, which allows to develop simple and complex web applications using R. It’s really good.

  4. I would put data.table as the number one package in this list. If you like sqldf then you will be blown away with data.table

    Copy paste this code and try it out yourself (assumes you have installed the packages)

    # how fast is R data.table

    libs <- c('sqldf', 'data.table', 'rbenchmark')
    lapply(libs, require, character.only = T)

    n <- 1000000
    ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 1000, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n))
    rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 1000, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n))

    benchmark(replications = 5, order = "user.self",
    noindex.sqldf = (sqldf('select * from ldf as l inner join rdf as r on l.id1 = r.id1 and l.id2 = r.id2')),
    indexed.sqldf = (sqldf(c('create index ldx on ldf(id1, id2)',
    'select * from main.ldf as l inner join rdf as r on l.id1 = r.id1 and l.id2 = r.id2')))

    benchmark(replications = 5, order = "user.self",
    noindex.table = {
    ldt <- data.table(ldf)
    rdt <- data.table(rdf)
    merge(ldt, rdt, by = c('id1', 'id2'))
    indexed.table = {
    ldt <- data.table(ldf, key = 'id1,id2')
    rdt <- data.table(rdf, key = 'id1,id2')
    merge(ldt, rdt, by = c('id1', 'id2'))

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.