FIELD NOTE: Answering One of the Most Asked Question of Data Scientists

NewImageGregory Piatetsky and Shashank Lyer have begun to answer one of the most asked question facing data science practitioners: Which tools work with which other tools? If I had a dollar for every time this question was asked of me, well, let’s just say I’d already be retired! So, when I saw Gregory and Shashank took a crack at this question, I was intrigued.

In their recent article Which Big Data, Data Mining, and Data Science Tools go together?, both authors use a version of Apriori algorithm to analyze the results of a 2105 KDnuggets Data Mining Software Poll. This work is an excellent example of how simple techniques, added together, can result in very useful insights.

For example, the graph below visualizes the correlation between the top 20 most popular tools. For the nodes, Red: Free/Open Source tools, Green: Commercial tools, Fuchsia: Hadoop/Big Data tools. The node sizes  vary based on the percentage of votes each tool received. The segmentation shows the weights of each edge, the thicker ones showing a high association and the latter a low association.


For those, like me, who predominately work in the R world, here are a list of tools that are most often associated with R

2015 06 17 11 06 57

While this work is just the start and only covers a limited user population, the authors provide the data set for those that want to further explore the survey or continue collecting additional tools data in order to extend their insights.

Cybernetic Historical Debris Fields – Big Data’s Proof of Life

Newimage22How did Robert Ballard find the Titanic? Most people think it was by looking for it. Well, most people would be wrong. Ballard believed he could rediscovered the Titanic by looking for the debris field created when the ship sank. With the Titanic only being around 900 feet long, he hypothesized that ship parts would be spread out much wider the farther you were from the ship, narrowing like a funnel to closer one got. In essence, this much larger historical debris field would point the way to the much small artifact of interest – the Titanic.

Every physical object leaves some trace of its interaction with the real world over time – everything. Whether it is the Titanic plugging to her heath in depths in the Atlantic ocean or a lonely rock sitting in the middle of a dry desert lake bed. Everything leaves a trace; everything has a Historical Debris Field (HDF). Formally,


Definition: Historical Debris Field (HDF) is any time dependent perturbation of an object and its environment.

One of the key points is that it is an observation over time, not a just a point in time. HDF are about capturing the absolute historical changes in the environment in order to make relative projections about some object in the future.

NewImageAs it turns out, just like physical real world objects leave historical debris fields, so does data through the virtual interactions in cyber space. Data, by definition, is merely a representative abstraction of a concept or real world object, and is a direct artifact of some computational process. At some level, every known relevant piece of electronic information (these words, your digital photos, a You Tube video), boils down to a series of Zeros (0) and Ones (1), streamed to together by a complex series of implicit and tacit interacting algorithms. These algorithms are in essence the natural, often unseen forces that govern the historical debris seen in real world objects. So, the HDF for cyberspace might be defined as,

Definition: Cybernetic Historical Debris Field (CHDF) is any time dependent perturbation of data through and its information environment (information being relevant data).

NewImageWhy is this lengthy definitional expose important? Because big data represents the Atlantic ocean in which a company is looking for opportunities. Any like Robert Ballard’s search for the Titanic, one can not merely set out looking for a piece of insight or knowledge itself in the vastness of all that internal/external structured/unstructured data, one needs to look for the Cybernetic Historical Debris Fields that point to the electronic treasure. But what kind of new “virtual sonar” systems can we employ to help us?

While I will explore this concept more over time, let me suggest that the “new” new in the field of data mining will be in coupling data scientists (DS) with behavioral analysts (BA).  Data changes because at the core some human initiated a change (causal antecedent). It is through a better understanding of human behavior (patterns), that we will have the best chance of monetizing the vastness of big data. Charles Duhigg, author of “The Power Of Habit: Why We Do What We Do in Life and Business,” shows that by understanding human nature (aka our historical debris field) we can accurately predict a behavioral-based future.


For example, Duhigg shows how Target tries to hook future parents at the crucial moment before they turn into loyal buyers of pre/post natal products (e.g., supplements, diapers, strollers, etc.). Target, specifically Andrew Pole, determined while lots of people buy lotion, women on baby registries were buying larger quantities of unscented lotion. Also, women at about twenty weeks into the pregnancy, would start loading up on supplements like calcium, magnesium, and zinc. This CHDF led the Target team to one of the first behavioral patterns (virtual Titanic sonar pattern) that could discriminate (point to) pregnant from non-pregnant women. Not impressed…well this type of thinking led to Target’s $23 billion revenue growth from 2002 to 2010.

The net of all this is that data can be monetized by systematically searching for relevant patterns (cybernetic historical debris fields) in big data based on human patterns of behavior. There are patterns in everything and just because we don’t see them it doesn’t mean they don’t exist. Through data science and behavioral analysis (AKA Big Data), one can reveal the behavioral past in order to monetize the future.

Data Science: Beyond Intuition – The Movie

Newimage9Data science is changing the way we look at business, innovation and intuition. It challenges our subconscious decisions, helps us find patterns and empowers us to ask better questions. Hear from thought leaders at the forefront including Growth Science, IBM, Intel, and the National Center for Supercomputing Applications. This video is an excellent source of information for those that have struggled trying to understanding data science and its value.