Data Repositories – Mother’s Milk for Data Scientists


Mothers are life givers, giving the milk of life. While there are so very few analogies so apropos, data is often considered the Mother’s Milk of Corporate Valuation. So, as a data scientist, we should treat dearly all those sources of data, understanding their place in the overall value chain of corporate existence.

A Data Repository is a logical (and sometimes physical) partitioning of data where multiple databases which apply to specific applications or sets of applications reside. For example, several databases (revenues, expenses) which support financial applications (A/R, A/P) could reside in a single financial Data Repository. Data Repositories can be found both internal (e.g., in data warehouses) and external (see below) to an organization. Here are a few repositories from KDnuggets that are worth taking a look at:


Data Monetization: A Road Paved On Top Of Data Sets

Paving Road Construction Sign Royalty Free Clipart Picture 090626 203307 625048The road to efficient data monetization is paved on top of effective data sets. No single source of data is comprehensive enough to be an all being source of transformational insights. It is only through the fusion of orthogonal data sets (independent subject area) that true insights into those thing we don’t know we don’t know (level three knowledge) can be revealed. While we have access to data of interest (ERPs, IT, etc.), where can we find others sources to aid in the third level knowledge spelunking?

NewImageWhile data is everywhere, useful data sets are not. A google search on terms like “open data sets” or “data sets in R” reveal thousands of sources. Over the years as a CTO and Data Scientist, I have collected a few hundred myself. In 2011, however, I came across the work of RevoJoe, Revolution Analytics, that more or less got me organized in this area. So here are a few data sets from my list that I maintain today:

Commercial Sources
Data MarketPlace:

World bank:

CBOE Futures Exchange:
Google Finance: (R)
Google Trends:
St Louis Fed: (R)
Yahoo Finance: (R)

Archived national government statistics:
Civic Commons:
Fed Stats:
Guardian world governments:
List of cities/states by Simply Statitistics:
London, U.K. data:
New Zealand:…
NYC data:
Open Government Data (Hub):
Open Government Data – United States of America:
Open Government Data – United Kingdom:
Open Government – France:
San Francisco Data sets:
U.K. Government Data:
United Nations:
U.S. Federal Government Agencies:
US CDC Public Health datasets:
The World Bank:

Machine Learning
Causality Workbench:
Kaggle competition data:
KDNuggets competition site:
UCI Machine Learning Repository:
Machine Learning Data Set Repository:
Microsoft Research:
Million songs:
Social Networking:
The Koblenz Network Collection:

Hilary Mason’s research data (Chief Data Scientist at
Kaggle Contests:
R Datasets:

Public Domain Collections
Sample R data sets: (R)
SourceForge Research Data:
UFO Reports:
Wikileaks 911 pager intercepts: R data sets: (R)
The Washington Post List:

Agricultural Experiments: (R)
Climate data:
Gene Expression Omnibus:
Geo Spatial Data:
Human Microbiome Project:
KDD Nugets Datasets:
MIT Cancer Genomics Data:
NIH Microarray data: (R)
Protein structure:
Public Gene Data:
Stanford Microarray Data:

Social Sciences
Analyze Survey Data for Free:
General Social Survey:
UCLA Social Sciences Archive:

Time Series
Time Series data Library:

Carnegie Mellon University Enron email:
Carnegie Mellon University StatLab:
Carnegie Mellon University JASA data archive:
CMU Statlib:
Ohio State University Financial data:
Stanford Large Newtork Data:
UC Berkeley:
UCI Machine Learning:
UC Riverside Time Series:
University of Toronto:

FIELD NOTE: Quandl – An Interesting Source For Datasets

NewImageTammer Kamel, a Canadian Data Scientist, has recently post a beta version of, an index of 2 million time series data sets. Tammer says Quandl’s mission is to make numerical data easy to find and easy to use. The site is collaboratively maintained and free with many features including search, browse, download, visualization, merging, and an API.

Here is one of the many datasets that I have been using to research crime related trends.