FIELD NOTE: Answering One of the Most Asked Question of Data Scientists

NewImageGregory Piatetsky and Shashank Lyer have begun to answer one of the most asked question facing data science practitioners: Which tools work with which other tools? If I had a dollar for every time this question was asked of me, well, let’s just say I’d already be retired! So, when I saw Gregory and Shashank took a crack at this question, I was intrigued.

In their recent article Which Big Data, Data Mining, and Data Science Tools go together?, both authors use a version of Apriori algorithm to analyze the results of a 2105 KDnuggets Data Mining Software Poll. This work is an excellent example of how simple techniques, added together, can result in very useful insights.

For example, the graph below visualizes the correlation between the top 20 most popular tools. For the nodes, Red: Free/Open Source tools, Green: Commercial tools, Fuchsia: Hadoop/Big Data tools. The node sizes  vary based on the percentage of votes each tool received. The segmentation shows the weights of each edge, the thicker ones showing a high association and the latter a low association.


For those, like me, who predominately work in the R world, here are a list of tools that are most often associated with R

2015 06 17 11 06 57

While this work is just the start and only covers a limited user population, the authors provide the data set for those that want to further explore the survey or continue collecting additional tools data in order to extend their insights.

Art of Resistance – The Social Network Anatomy of a Kinetic Activist Group

2014 02 18 08 51 17

As a data scientist that works in the intelligence community, we are often asked to help identify where intelligence gathering and analysis resources should be allocated. Governmental and non-governmental intelligence organizations are bounded by both limited operational funds, as well as time. As such, resource allocation planing becomes an extremely important operational activity for data science teams. But how does one actually go about this?

There are no perfect right ways of looking for the proverbial needle in the haystack – needle being the bad guy and haystack being the world. While it is sometimes better to be lucky than good, having a systematically organic approach to resource allocation enables teams to manage the process to some level of statistical quality control (see Deming).  One such way is through the use of Social Network Analyses (SNA).

Social networks encapsulate the human dynamics that are characteristically important to most intelligence activities. Each node in the network represents an entity (person, place, thing, etc.), which is governed by Psychological behavioral characteristics. As these entities interact with each other, the nodes become interconnected forming networks. In turn, these networks are governed by Sociological behavioral principles. Take together, the social networks enable the intelligence community to understand and exploit behavioral dynamics – both psychological and sociological characteristics.

As a side bar, intelligence analysis is not always about why someone or a group does something. It is often more important to understand why they are not doing things. For example, in intelligence we look extensively at why certain groups associate with each other. But it is equally important to also understand why one activist group does not associated with another. From a business prospective there is an equivalence in the sales process. Product managers often over strive to understand who is buying their products and services, but lacks an material understanding on why people don’t buy these same solutions.

In a recent project, we were tasked by a client to determine if Greenpeace was or could become a significant disruptive geopolitical force a critical operational initiative. As part of the initial scoping activity, we needed to understand where to allocate our limited resources (global intel experts, business intel experts, subject matter experts, and data scientists) in order to increase the likelihood of addressing the client’s needs. A high-level SNA not only identify where to focus our effort, but also identify a previously unknown activism actor as well.

The six (6) panel layout below show how we stepped through our discovery. In FaceBook, we leverage Netviss to make an initial collection of group-oriented relationships for the principle target (Greenpeace). The 585 nodes, interconnected by 1788 edges, was imported into Gephi as shown in panel 1. As we say… somewhere in that spaghetti is a potential bad guy, but where?

Gephi Panel 01


After identifying and importing the data, it is important to generate an initial structural view of the entities. Force Atlas 2 is an effective algorithms since studies have identify that organizational structure can be inferred from layout structure (panel 2). While this layout provides some transparency into the network, it still lacks any real clarity around behavioral importance.

To better understand what entities are more central than other, we leveraged the Between Centrality. This is a measure of a node’s centrality in a network, an underlying psychological characteristic. Betweenness centrality is a more useful measure (than just connectivity) in that bigger nodes are more central to behavioral dynamics. As seen in panel 2, serval nodes become central figures in the overall network.

Identifying community relationships is an important next step in helping understand sociological characteristics. Using Modularity as a measure to unfold community organizations (panel 4), we now begin to see a clearer picture of who is doing what with whom. What becomes really interesting at this stage is understanding some of the more nuance relationships.

Take for example the five outlying nodes in the network (blue, maroon, yellow, dark green, and light green). There appear to be central to an equally important red node in the center. Panel 5 clear shows this central relationship. Upon further examination (filtering out nodes with low value Betweenness Centrality metics), we see the emergence of a previously un-recognized activism player: Art of Resistance.

2014 02 18 08 34 16


While Greenpeace was the original target of interest, use of basic social network analysis principles resulted in the discover of an emergent activism group playing a central role in the coordination and communication of events.  Further analysis of this group revealed their propensity to promote kinetic activities (physical violence, bombing, etc.) over more traditional passive non-kinect events found in Greenpeace.

Gephi Panel 02

A resource allocation plan was then developed to monitor and harvest open source information around key players of each community (larger nodes). The plan resulted in a more focused intelligence analysis process where human analysts could explore in-depth the behavioral dynamics of critical entities, rather that tangentially digesting summary information from all.

Social network analysis (SNA) is an effective tool for the intelligence team, as well as the data science. Finding the proverbial needle in the haystack requires a systematically organic process that explains both the why and why not of behavioral dynamics. Use of these kinds of tools enable a broad set of capabilities, ranging from resource allocation to discovery.

Deep Web Intelligence Platform: 6 Plus Capabilities Necessary for Finding Signals in the Noise


Over the last several months I have been involved with developing uniques data science capabilities for the intelligence community, ones specifically based on exploiting insights derived from the open source intelligence (OSINT) found in the deep web. The deep web is World Wide Web (WWW) content that is not part of the Surface Web, which is indexed by standard search engines. It is usually inaccessible through traditional search engines because of the dynamic characteristics of the content and in persistent natural of its URLs. Spanning over 7,500 terabytes of data, it is the richest source of raw material that can be used to build out value.

2014 01 30 09 54 05

One of the more important aspects of intelligence is being able to connect multiple seemingly unrelated events together during a time frame amenable for making actionable decisions. This capability is the optimal blend of man and machine, enabling customers to know more and know sooner. It is only in these low signal that are found in the deep web that one can use behavioral sciences (psychology and sociology) to extract outcome-oriented value.

2014 01 30 09 54 15

Data in the web is mostly composed of noise, which can be unique but is often of low value. Unfortunately, the index engines of the world (Google, Bing, Yahoo) add marginal value to very few data streams that are important to any valuation process. Real value comes from correlating event networks (people performing actions) through deep web signal, which are not the purview of traditional search engines.

2014 01 30 09 54 50

These deep web intelligence capabilities can be achieved in part through the use of machine learning enabled, data science driven, and hadoop-oriented enterprise information hubs. The platform support the 5 plus essential capabilities for actionable intelligence operations:

1. Scalable Infrastructure – Industry standard hardware supported through cloud-based infrastructure providers that is scales linearly with analytical demands.

2. Hadoop – Allows for computation to occur next to data storage and enables storage schema on read – stores data in native raw format.

3. Enterprise Data Science – Scalable exploratory methods, predictive algorithms, and prescriptive and machine learning.

4. Elastic Data Collection – In addition to pulling data from third party sources through APIs, bespoke data collection through scraping web services enables data analyses not capable within traditional enterprise analytics groups.

5. Temporal/Geospatial/Contextual Analyst – The ability to regionalize events, to a specific context, during a specified time (past, present, future).

6. Visualization – Effective visualization that tailors actionable results to individual needs.

The Plus – data, Data, DATA. Without data, lots of disparate data, data science platforms are of no value.

Deep Web Intelligence Architecture 01

Today’s executive, inundated with TOO MUCH DATA, has limited ability to synthesize trends and actionable insights driving competitive advantage. Traditional research tools, internet and social harvesters do not correlate or predict trends. They look at hindsight or, at best, exist at the surface of things. A newer approach based on combining the behavioral analyses achievable through people and the machine learning found in scalable computational system can bridge this capability gap.

Mahout: Machine Learning For Enterprise Data Science

Machine Learning

The success of companies to effectively monetize their information is dependent on how efficiently they can identify revelations in their data sources. While Enterprise Data Science (EDS) is one of the necessary methodologies needed to organically and systematically achieve this goal, it is but one of many such needed frameworks.

Machine Learning, a subdomain of artificial intelligence and a branch of statistical learning, is one such computational methodology composed of techniques and algorithms that enables computing devices to improve their recommendations based on effectiveness of previous experiences (learn). Machine learning is related to data mining (often confused with) and relies on techniques from statistics, probability, numerical analysis, and pattern recognition.

There is a wide variety of machine learning tasks, successful applications, and implementation frameworks.  Mahout, one of the more popular frameworks is a open source project based on Apache Hadoop. Mahout currently can be used for

  • Collaborative filtering (Recommendation systems – user based, item based)
  • Clustering
  • Classification

Varad Meru created and is sharing this introductory Mahout presentation; one that is an excellent source of basis information, as well as implementation details.

Heilmeier Catechism: Nine Questions To Develop A Meaningful Data Science Project


As director of ARPA in the 1970’s, George H. Heilmeier developed a set of questions that he expected every proposal for a new research program to answer. No exceptions. He referred to them as the “Heilmeier Catechism” and are now the basis of how DARPA (Defense Advance Research Projects Activity) and IARPA (Intelligence Advance Research Project Activity) operate.  Today, it’s equally important to answer these questions for any individual data science project, both for yourself and for communicating to others what you hope to accomplish.

While there have been many variants on Heilmeier’s questions, I still prefer to use the original catechism to guide the development of my data science projects:

1. What are you trying to do? Articulate your objectives using absolutely no jargon. 2. How is it done today, and what are the limits of current practice? 3. What’s new in your approach and why do you think it will be successful? 4. Who cares? 5. If you’re successful, what difference will it make? 6. What are the risks and the payoffs? 7. How much will it cost? 8. How long will it take? 9. What are the midterm and final “exams” to check for success?

Each question is critical in the success chain of events, but number 3 and 5 are most aligned to the way business leaders think. Data science is fought with failures, by the definition of science. As such, business leaders are still a bit (truthfully – a lot) suspicious of how data science teams do what they do and how their results would integrate into the larger enterprise in order to solve real business problems. Part of the data science sales cycle, addressed by question 3, needs to address these concerns. For example, in the post “Objective-Based Data Monetization: A Enterprise Approach to Data Science (EDS),” I present a model for scaling out the our results.

In terms of the differences a project makes (question 5), we need to be sure to cover the business as well as technical differences. The business difference are the standard three: impact on revenue, margin (combined ratios for insurance), and market share. If there is not business value (data/big data economics), then your project is a sunk cost that somebody else will need to make up for.

Here is an example taken from a project proposed in the insurance industry. Brokers are third party entities that sell insurance products on behalf of a company. They are not employees and often are under the governance of underwriters (employee that sells similar products). There are instances where brokers “shop” around looking get coverage for a prospect that might have above average risk (e.g., files too many claims, in high risk business, etc.). They do this by manipulating answers to pre-bind questions (prior to issuing a policy) in order to create a product that will not necessarily need underwriter review and/or approval. This project is designed to help stop this practice, which would help the improve business financial fundamentals. Here is Heilmeier’s Catechism for the Pre-Bind Gaming Project:

1. What are you trying to do? Automate the identification of insurance brokers that use corporate policy pricing tools as a means to undersell through third party providers.

2. How is it done today? Corporate underwriters observer broker behaviors and pass judgement based on person criteria.

3.  What is new in your approach? Develop signatures algorithms, based on the analysis of gamer/no gamer pre-bind data, that can be implemented across enterprise product applications.

4. Who cares? Business executives – CEO, President, CMO, and CFO.

5. What difference will it make? In an insurance company that generates $350 M in premiums at a combined ratio (margin) of 97%, addressing this problem could result in  an additional $12M to $32M of incremental revenue while improving the combined ratio to 95.5%.

6. What are the risks and payoffs? Risks – Not having collect or access to relevant causal data reflecting the gamers patterns. Payoffs – Improved revenue and combined ratios.

7. How much will it cost? Proof of concept (POC) will cost between $80K and $120K. Scaling the POC into the enterprise (implementing algorithms into 5 to 10 product applications) will cost between $500K and $700K.

8. How long will it take? Proof of concept (POC) will take between a 8 to 10 weeks. Scaling the POC into the enterprise will take between 3 to 7 months.

9. What are the midterms & final check points for success? The POC will act as the initial milestone that demonstrates gaming algorithms can be identify with existing data.

Regardless of whether you use Heilmeier’s questions or other research topic development methodologies (e.g., The Craft of Research), it is important to systematically address the who, what, when, where, and why of the project. While a firm methodology does not guarantee success, not addressing these nine questions are sure to put you on a risky path, one that will need work to get off of.


Visualizing the CRAN: The Right Package for a Given Task


R is an extremely useful software environment for statistical computing and graphics. But as awesome as it is, it can be quite daunting to find just the right package for a specific task. Well, CRAN Task Views is designed to help alleviate this challenge. The table below is matches the task (right) with the package (left). While a bit primitive, it as quick reference it does work.

For those that are must more industrious, check out “Visualizing the CRAN: Graphing Package Dependencies.” The authors uses graph visualize the relationship between packages in CRAN. His analysis shows that MASS is the most “depended on” package on the CRAN.  A total of 294 of the 3794 packages (almost 8%) in the CRAN depend on MASS.  An additional 95 suggest MASS; so just over 10% of all R packages either suggest or depend on MASS.

Bayesian Bayesian Inference
ChemPhys Chemometrics and Computational Physics
ClinicalTrials Clinical Trial Design, Monitoring, and Analysis
Cluster Cluster Analysis & Finite Mixture Models
DifferentialEquations Differential Equations
Distributions Probability Distributions
Econometrics Computational Econometrics
Environmetrics Analysis of Ecological and Environmental Data
ExperimentalDesign Design of Experiments (DoE) & Analysis of Experimental Data
Finance Empirical Finance
Genetics Statistical Genetics
Graphics Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization
HighPerformanceComputing High-Performance and Parallel Computing with R
MachineLearning Machine Learning & Statistical Learning
MedicalImaging Medical Image Analysis
MetaAnalysis Meta-Analysis
Multivariate Multivariate Statistics
NaturalLanguageProcessing Natural Language Processing
OfficialStatistics Official Statistics & Survey Methodology
Optimization Optimization and Mathematical Programming
Pharmacokinetics Analysis of Pharmacokinetic Data
Phylogenetics Phylogenetics, Especially Comparative Methods
Psychometrics Psychometric Models and Methods
ReproducibleResearch Reproducible Research
Robust Robust Statistical Methods
SocialSciences Statistics for the Social Sciences
Spatial Analysis of Spatial Data
SpatioTemporal Handling and Analyzing Spatio-Temporal Data
Survival Survival Analysis
TimeSeries Time Series Analysis
gR gRaphical Models in R


20 R Packages That Should Impact Every Data Scientist

NewImageAnybody that has used R know just how frustrating it is to have an analytical idea in the mind that is hard to express. From a language perspective, R is pretty straight forward. For those that are just starting to learn it, there are a wide range of resources available, ranging from free tutorials to commercial texts. A quick Google search on most structural R questions will quickly lead to a handful of viable solutions (Learn R Blog, R Overview, Example R graphics, R Blogger, etc.). But the power of R is less about its grammar and more in its packages.

Earlier this year, Yhat published a great article on the “10 R Packages I wish I knew about earlier” that should be the basis for exploring R’s powerful capabilities. As Yhat also points out, while R can be a bit more “obscure that other languages,” it provides a thousands of useful packages through its vibrant growing community. Here is a re-listing of the those package:

1. sqldf – Manipulate R data frames using SQL.

2. forecast – Methods and tools for displaying and analysing univariate time series forecasts including exponential smoothing via state space models and automatic ARIMA modelling.

3. plyr – A set of tools for a common set of problems: you need to split up a big data structure into homogeneous pieces, apply a function to each piece and then combine all the results back together.

4. stringr – a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use.

5. Database Drivers (thru install.packages) – R has drivers for nearly every commercially viable database. If you can’t find a specific interface for your database, then you can always use RODBC. Examples RPostgreSQL, RMySQL, RMongo, RODBC, RSQLite, etc.

6. lubridate – Make dealing with dates a little easier.

7. ggplot2 – Is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts.

8. qcc – Is a library for statistical quality control, such as Shewhart quality control charts for continuous, attribute and count data. Cusum and EWMA charts. Operating characteristic curves. Process capability analysis. Pareto chart and cause-and-effect chart. Multivariate control charts.

9. reshape2 – Reshape lets you flexibly restructure and aggregate data using just two functions: melt and cast. This Hadley Wickham package specializes in converting data from wide to long format, and vice versa.

10. randomForest – A machine learning package that perform classification and regression based on a forest of trees using random inputs, through supervised or unsupervised learning.

NewImageIn addition to these package, anybody working in social sciences will also want to look into:

11. Zelig – a one-stop statistical shop for nearly all regression model specifications. It can estimate, and help interpret the results of, an enormous range of statistical models.

12/13. Statnet/igraph – An integrated set of tools for the representation, visualization, analysis, and simulation of network data.

14. Amelia II – Contains a sets of algorithms for multiple imputation of missing data across a wide range of data types, such as survey, time series and cross sectional.

15. nlme – Used to fit and compare Gaussian linear and nonlinear mixed-effects models.

16/17. SNOW Simple Network of Workstations)/Rmpi – Support for simple parallel computing in R.

18/19. xtable/apsrtable – Packages that convert R summary results into LaTeX/HTML table format.

20. plm – contains all of the necessary model specifications and tests for fitting a panel data model; including specifications for instrumental variable models.

So, what are some of your more favorite and/or necessary R packages and why. Post them in the comments section and lets build out this space together.

Single Value Decomposition (SVD): A Golfer’s Tutorial

NewImageSingle Value Decomposition (SVD) is one of my favorite tools for factorizing data, but it can be a rather hard concept to wrap one’s brain around, especially if you don’t have a strong mathematical background. In order to gain a more practical understanding of how SVD are performed and their practical applications, many resort to Googling terms like “Single Value Decomposition tutorial” and “Single Value Decomposition practical example,” only to be disappointed by the results. Alas, here is a tutorial that is both easy to understand, while applying a practical example that more can related to: Golf Score Prediction Using SVD.

This tutorial breaks down the SVD process by looking at the golf scores of three players – Phil, Tiger, and Vijay. By starting with a simple, naive example, the author builds a complete understanding of not only practical mechanics of SVD, but the mathematical background as well. Overall, a simple and elegant example.

Based on the tutorial work, here are a few R scripts I used to recreate the results:



Then, one can compute the SVD:


Resulting in,


Graphically, the singular values can be visualized as,


This means that first left and right singular values ($u, $v) represent almost 98.9% of the variance in the matrix. In R, we can approximate the result with,


Resulting in,



Six Types Of Analyses Every Data Scientist Should Know

NewImageJeffrey Leek, Assistant Professor of Biostatistics at John Hopkins Bloomberg School of Public Health, has identified six(6) archetypical analyses. As presented, they range from the least to most complex, in terms of knowledge, costs, and time. In summary,

  • Descriptive
  • Exploratory
  • Inferential
  • Predictive
  • Causal
  • Mechanistic

1. Descriptive (least amount of effort):  The discipline of quantitatively describing the main features of a collection of data. In essence, it describes a set of data.

– Typically the first kind of data analysis performed on a data set

– Commonly applied to large volumes of data, such as census data

-The description and interpretation processes are different steps

– Univariate and Bivariate are two types of statistical descriptive analyses.

Type of data set applied to: Census Data Set – a whole population

 Example: Census DataNewImage

2. Exploratory: An approach to analyzing data sets to find previously unknown relationships.

– Exploratory models are good for discovering new connections

– They are also useful for defining future studies/questions

– Exploratory analyses are usually not the definitive answer to the question at hand, but only the start

– Exploratory analyses alone should not be used for generalizing and/or predicting

– Remember: correlation does not imply causation

Type of data set applied to: Census and Convenience Sample Data Set (typically non-uniform) – a random sample with many variables measured

Example: Microarray Data Analysis NewImage

3. Inferential: Aims to test theories about the nature of the world in general (or some part of it) based on samples of “subjects” taken from the world (or some part of it). That is, use a relatively small sample of data to say something about a bigger population.

– Inference is commonly the goal of statistical models

– Inference involves estimating both the quantity you care about and your uncertainty about your estimate

– Inference depends heavily on both the population and the sampling scheme

Type of data set applied to: Observational, Cross Sectional Time Study, and Retrospective Data Set – the right, randomly sampled population

Example: Inferential Analysis NewImage

4. Predictive: The various types of methods that analyze current and historical facts to make predictions about future events. In essence, to use the data on some objects to predict values for another object.

– The models predicts, but it does not mean that the independent variables cause

– Accurate prediction depends heavily on measuring the right variables

– Although there are better and worse prediction models, more data and a simple model works really well

– Prediction is very hard, especially about the future references

Type of data set applied to: Prediction Study Data Set – a training and test data set from the same population

Example: Predictive Analysis


Another Example of Predictive Analysis


5. Causal: To find out what happens to one variable when you change another.

– Implementation usually requires randomized studies

– There are approaches to inferring causation in non-randomized studies

– Causal models are said to be the “gold standard” for data analysis

Type of data set applied to: Randomized Trial Data Set – data from a randomized study

Example: Causal Analysis


6. Mechanistic (most amount of effort): Understand the exact changes in variables that lead to changes in other variables for individual objects.

– Incredibly hard to infer, except in simple situations

– Usually modeled by a deterministic set of equations (physical/engineering science)

– Generally the random component of the data is measurement error

– If the equations are known but the parameters are not, they may be inferred with data analysis

Type of data set applied to: Randomized Trial Data Set – data about all components of the system

Example: Mechanistic Analysis