FIELD NOTE: Your Math is All Wrong…

NewImageAt the request of a friend, I recently reviewed the article “Your Math Is All Wrong: Flipping The 80/20 Rule For Analytics” by John Thuma. It is a good article, but incomplete and bit misguided. Thuma argues that we are spending too much time in preparing data (prepping) and not enough time analyzing it.  The illusion of the article is that he will “reveal” the magic needed to solve this problem at the end. Spoiler… he does not.

NewImageI agree with the premise; that is, a disproportionate amount of time is spent in data prepping (80%), but the author does not provide any insights into how to reduce it (the flip from 80% to 20%). Study after study has show this to be the case, so it is worthless to argue a statistical point. But towards the end of the article, he states that, “Flipping the rule will mean more data-driven decisions.” Ok, I get it. But please explain how?

Well, the cheap “naive” way would be to just start spending more time with the analytics process itself. That is, once the prep process is complete, just spend 16x more effort with analytics (do the math). This would give you the 20% prep and 80% analytics the author wants to achieve. Cheep trick, but that is statistics. But even that is not the issue. The real issue isn’t moving from 80% to 20%.

The real is challenge is understanding exactly what “value” means in the data science process and understanding a systematic way to achieve it. In the end, if I have to spend 80% of time preparing and 20% analyzing in order to discover “how” to grown a business in a profitable way, who cares what the ratio is. Real value comes for focusing on the questions; from what (descriptive), to why (diagnostic), to when (predictive), and finally how (prescriptive). In doing so, a chain is created with each stage linking value (AKA a value chain). Ok, but how do you do this?

2015 10 09 09 19 48

Addressing that question (my reveal) is beyond the scope of this article. I would suggest one start by looking at a few article in Data Scientist Insights blog. There are several articles that deal exactly on this point. After that, write me (@InsightDataSci) and we can talk.


Coopetition – A Comparison of Data Analytics & Data Sciences

NewImageThere is a lot of discussion around how data sciences and data analytics differ, from the tools that are used to the methodologies that are employed. Two useful perspectives are to look at the differences (what separates them) and then looking at the commonalities (what brings them together). The “tail of the tapes” (below), provides nine common measures used to differentiate these two “data fighters.” The most notable for this discussion is the first – Philosophy. Data analytics tends to focus it’s mental energy on confirming (quantifying and qualifying) things we know we want to know. On the other hand, data sciences is about revelation – the discovery of something new in a previously unknown area.

2015 06 19 08 11 41

Another lens through which we can look at the differences question is that of the Knowledge Model (used above). This model divides our understanding (or not) of the world around us into four groups: you know what we know, you know what you don’t know, you don’t know what you know, and you don’t know what you don’t know. Simple examples of the first two are: you know your age, but don’t know mine. The third is a bit trickier in that this is about recall and recognition. A possible example is recall an event earlier in your life when you smell a particular scent or hear a specific song (ah, those where the days). There are an infinite number of examples in the last category, but one I use a lot is that you probably don’t know much about pebble nuclear reactors and you did not know you didn’t know it until you read those words. On with data sciences. By the way, hardly ever spend time looking into thing we don’t know we know (recall), since a lot of assessments-oriented event highlight them during discovery, which results in more knowing what you know.

The Knowledge Model is very useful when thinking through data analytics and data sciences. Data analytics is fundamentally about providing clarity around those things we know we know. For example, what is my product inventory throughout my global supply chain. Data sciences, on the other hand, explores those things what we don’t know we don’t know, with the goal of producing actionable insights. An example is finding undiscovered ways of limiting product leakage throughout a global supply chain. In the middle, the connective layer, is where data analytics and data science often come together. For example, trying to better understanding why there are different levels of inventory throughout our supply chain or discovering events that will impact them.

2015 06 19 10 13 23

While there are differences and commonalities between data analytics and data science, they are both equally important. Without analytics, we would not be able to operate our factories or even pay our employees. Data Analytics powers the economic engine of society. On the hand, without data science we would be suck doing the same thing over and over, our businesses would be incapable of real strategic growth. Data Sciences is a catalyst that move our society through stagnation. Both very different, but both interconnect. A perfect example of Coopetition.


FIELD NOTE: Answering One of the Most Asked Question of Data Scientists

NewImageGregory Piatetsky and Shashank Lyer have begun to answer one of the most asked question facing data science practitioners: Which tools work with which other tools? If I had a dollar for every time this question was asked of me, well, let’s just say I’d already be retired! So, when I saw Gregory and Shashank took a crack at this question, I was intrigued.

In their recent article Which Big Data, Data Mining, and Data Science Tools go together?, both authors use a version of Apriori algorithm to analyze the results of a 2105 KDnuggets Data Mining Software Poll. This work is an excellent example of how simple techniques, added together, can result in very useful insights.

For example, the graph below visualizes the correlation between the top 20 most popular tools. For the nodes, Red: Free/Open Source tools, Green: Commercial tools, Fuchsia: Hadoop/Big Data tools. The node sizes  vary based on the percentage of votes each tool received. The segmentation shows the weights of each edge, the thicker ones showing a high association and the latter a low association.


For those, like me, who predominately work in the R world, here are a list of tools that are most often associated with R

2015 06 17 11 06 57

While this work is just the start and only covers a limited user population, the authors provide the data set for those that want to further explore the survey or continue collecting additional tools data in order to extend their insights.

Darkness, A Flashlight, and the Data Scientist

What you don t knowData sciences and data analytics not only use different techniques, that are often highly dependent on the distribution characteristics of the data, but also produce very different categorical types of insights. These insights range from a better understanding things you know you know (data analytics) to discoveries in area where you don’t know what you don’t know (data sciences). However, this knowledge metaphor can be a bit confusing, so I often use the “Darkness, A Flashlight, and the Data Scientist” parable.

Flash Light

In your mind, picture a darkened room, where you are standing, but do not know where in the room you are. In your hand is large flashlight. You raise it slowly, pointing it in a direction. You turn it on and white light radiates forward.

The light of the flashlight shines brightly on a distant wall, where you see several items. These are the things you know that you know. As you your eyes begin to scan outward, the wall turns to deep dark dark black where the light does not reach. In this darkness, there are things you don’t know you don’t know. You begin to look back into the cent of the light – that grey transitionary boundary between the light of what we know and the darkness of the we don’t know, are all the things we know we don’t know.


Data analytics is lot about understanding those things we know we know, that is quantifying the light. This is the world of descriptive and diagnostic analytics. On the other hand, data sciences help use understand the darkest parts of our world, where we look to predict temporal and spatial relationships  and prescribe means for achieving desired outcomes. Data analytics and sciences are different in their own ways, each very important in their own right.

However, in the case of the data scientist, the metaphorical role is to pull the flashlight back so that more areas of the wall are illuminated. So, as the flashlight is linearly pulled back, the data scientist enables an exponential increase in our knowledge. In essence, the data scientist works in the dark so that others can benefit from the light. Think about it!

Critical Capabilities for Enterprise Data Science

NewImageIn the article “46 Critical Capabilities of a Data Science Driven Intelligence Platform” an original set of critical enterprise capabilities was identified. In enterprise architecture language, capabilities are “the ability to perform or achieve certain actions or outcomes through a set of controllable and measurable faculties, features, functions, processes, or services.”(1) In essence, they describe the what of the activity, but not necessarily the how.While individually effective, the set was nevertheless incomplete. Below is an update where several new capabilities have been added and other relocated. Given my emphasis on deep learning, composed on cognitive and intelligence process, I have added genetic and evolutionary programming as a set of essential capabilities.

2015 03 04 10 52 16

The Implementation architecture has also be updated to reflect the application of Spark and SparkR.

2015 03 04 10 53 13

46 Critical Capabilities of a Data Science Driven Intelligence Platform

NewImageData science is much more than just a singular computational process. Today, it’s a noun that collectively encompasses the ability to derive actionable insights from disparate data through mathematical and statistical processes, scientifically orchestrated by data scientists and functional behavioral analysts, all being supported by technology capable of linearly scaling to meet the exponential growth of data. One such set of technologies can be found in the Enterprise Intelligence Hub (EIH), a composite of disparate information sources, harvesters, hadoop (HDFS and MapReduce), enterprise R statistical processing, metadata management (business and technical), enterprise integration, and insights visualization – all wrapped in a deep learning framework. However, while this technical stuff is cool, Enterprise Intelligence Capabilities (EIC) are an even more important characteristic that drives the successful realization of the enterprise solution.

2015 02 04 08 50 01

In enterprise architecture language, capabilities are “the ability to perform or achieve certain actions or outcomes through a set of controllable and measurable faculties, features, functions, processes, or services.”(1) In essence, they describe the what of the activity, but not necessarily the how. For a data science-driven approach to deriving insights, these are the collective sets of abilities that find and manage data, transform data into features capable of be exploited through modeling, modeling the structural and dynamic characteristics of phenomena, visualizing the results, and learning from the complete round trip process. The end-to-end process can be sectioned into Data, Information, Knowledge, and Intelligence.

2014 11 08 14 10 45

Each of these atomic capabilities can be used by four different key resources to produce concrete intermediate and final intelligence products. The Platform Engineer (PE) is responsible for harvesting and maintenance of raw data, ensuring well formed metadata. For example, they would write Python scripts used by Flume to ingest Reddit dialogue into the Hadoop ecosystem. The MapReduce Engineer (MR) produces features based on imported data sets. One common function is extracting topics through MapReduced programmed natural language processing on document sets. The Data Science (DS) performs statistical analyses and develops machine learning algorithms.  Time series analysis, for example, is often used by the data scientist as a basis of identifying anomalies in data sets. Taken all together, Enterprise Intelligence Capabilities can transform generic text sources (observations) into actionable intelligence through the intermediate production of metadata tagged signals and contextualized events.

2014 11 08 14 21 11

Regardless of how data science is being used to derive insights, at the desktop or throughout the enterprise, capabilities become the building block for effective solution development. Independent of actual implementation (e.g., there are many different ways to perform anomaly detection), they are the scalable building blocks that transform raw data into the intelligence needed to realize true actionable insights.

UPDATE: U.S. On The Brink: Near-Depression Levels Losses In Wealth Expected


U.S. employers’ labor cost sustained its five year high into the third quarter of 2014. Economist believe this is being driven by a tightening labor market, which often results in company pressure to raise wages and salaries. According to the Bureau of Labor Statistics, wage and salaries, which make up about 70% of compensation costs, rose 0.7% over the last two quarters.

2014 11 01 13 13 32

In the original “U.S. On The Brink: Near-Depression Levels Losses In Wealth Expected” article, the expected median wealth loss was projected to be 18% to 27% over the next 2 to 5 years, respectively. This was driven by a decline in the Wealth to Income index and lower than expected rise in Median Income. Give this sustained change in wages and salaries, the following revised losses in Wealth are based on projected mean Median US Incomes (upward revision):

2014 11 01 12 09 33

The revised analysis now shows a median wealth loss of 15% to 23% over the next 2 to 5 years, respectively. This means that for a family who has a median net wealth of $182K (Federal Reserve, 2013), they are likely to see it fall to $154K by 2016 and $140K by 2019.



U.S. On The Brink: Near-Depression Levels Losses In Wealth Expected

NewImageThe U.S. is on the brink of witnessing some of the largest economic losses in net wealth since the Great Depression. The US Wealth To Income index (reported in Credit Suisse Global Wealth Report 2014) has exceed its mean 3rd quartile for only the forth time in history (see below). While the significance of this most recent event can not be overstated, one can determine the actual economic impact likely to be seen with a bit of time series and probabilistic modeling. 

2014 10 26 15 56 46

In order to quantify the impact on US wealth, we need to forecast the future US Wealth to Income index, along with the expect Median Income for the same period of time. Let’s start by looking at a few of the more interesting characteristics of Wealth to Income index. A stationarity analysis (Augmented Dickey Fuller test) of the index data indicates that we can not reject the null hypothesis that is non-stationary (Dickey-Fuller = -2.3486, Lag order = 0, p-value = 0.4319), which means we can use Autoregressive Integrated Moving Average (ARIMA) time series modeling to forecast future events.

ARIMA are the most general class of models for forecasting a time series which can be made to be “stationary” by differencing (if necessary), perhaps in conjunction with nonlinear transformations such as logging or deflating (if necessary). An ARIMA model is classified as an “ARIMA(p,d,q)” model, where: 

  • p is the number of autoregressive terms, 
  • d is the number of nonseasonal differences, and 
  • q is the number of lagged forecast errors in the prediction equation.

Through experimental evaluation, the most appropriate ARIMA model is ARIMA (1,1,2), which is forecasted for 10 years and added to the original data series in order to produce the graph below. Here we see the fitted mean, forecasted mean, upper and lower 95% confidence interval, as well as the historical Wealth to Income data.  

2014 10 27 09 35 06

At first glance, one expects an equal likelihood of realizing either the forecasted upper or lower values. However, history can provide event-oriented insights that will allow a more probabilistic approach to determining the most likely forecast. Given a certain threshold value of the Wealth to Income index, we can count that number of years it takes for the index to return to pre-threshold level, once exceed. For example, if we set a Wealth to Income index threshold of 5.5, the mean number years spent above this threshold is 4.6 yrs, with a standard deviation (sd) of 2.198 and standard error (se) of 0.98. In addition, the upper and lower 95% confidence levels are 6.52 and 2.68 yrs, respectively. Here is a complete table of years spent above aWealth to Income threshold value:

2014 10 26 17 55 36

With this new threshold data, one can see that the Wealth to Income index stays above the 6.0 level for only 1.08 to 4.42 yrs. Given that this phase is 2 yrs into the cycle, it is more likely that the Wealth to Income index will see a decline in the next 2 years. Thus, we can reject the upper bounds of the forecast model and accept the lower bounds (forecasted lower 95%) for modeling purposes.

A similar analysis, to the one above, was used to forecast the median US Income (see below). In this case, the ARIMA(2,1,0) model was experimentally found to best represent this time series. The median US income is projected to have low to moderate growth over the next ten years and does not have significant volatility, as seen in the Wealth to Income index. Given some of the downward economic and regulatory pressures, the lower bounds (forecasted lower 95%) of forecast will be used in the analysis.

2014 10 27 09 45 09

The last step in the analysis to compute the cumulative percentage change (cumPercentWealthDiff) in wealth as a function of a forecasted Wealth to Income index and US Median Income. The table below show the results of multiplying the respective values and differencing them over the periods in question.

2014 11 01 12 04 11

The analysis shows a median wealth loss of 18% to 27% over the next 2 to 5 years, respectively. This means that for a family who has a median net wealth of $182K (Federal Reserve, 2013), they are likely to see it fall to $150K by 2016 and $133K by 2019. In comparison to 2007-2010 recession, the Federal Reserve said the median net worth of families plunged by 39 percent in just three years, from $126,400 in 2007 to $77,300 in 2010. This analysis appears to be consistent with the reality seen over the last few years.

NewImageThe cause and effect relationship of this correlative model remains unclear. So, while some can probably find faults with this analysis (e.g., assume the Wealth to Income index continues to increase – like during the depression), the final story seem likely to remain the same – an dramatic loss in wealth for the United States over the next few years. The only real question that now remains is identifying and implementing the best investment strategy to undertake given that we are on this brink. I hear there are great specials going on at



Art of Resistance – The Social Network Anatomy of a Kinetic Activist Group

2014 02 18 08 51 17

As a data scientist that works in the intelligence community, we are often asked to help identify where intelligence gathering and analysis resources should be allocated. Governmental and non-governmental intelligence organizations are bounded by both limited operational funds, as well as time. As such, resource allocation planing becomes an extremely important operational activity for data science teams. But how does one actually go about this?

There are no perfect right ways of looking for the proverbial needle in the haystack – needle being the bad guy and haystack being the world. While it is sometimes better to be lucky than good, having a systematically organic approach to resource allocation enables teams to manage the process to some level of statistical quality control (see Deming).  One such way is through the use of Social Network Analyses (SNA).

Social networks encapsulate the human dynamics that are characteristically important to most intelligence activities. Each node in the network represents an entity (person, place, thing, etc.), which is governed by Psychological behavioral characteristics. As these entities interact with each other, the nodes become interconnected forming networks. In turn, these networks are governed by Sociological behavioral principles. Take together, the social networks enable the intelligence community to understand and exploit behavioral dynamics – both psychological and sociological characteristics.

As a side bar, intelligence analysis is not always about why someone or a group does something. It is often more important to understand why they are not doing things. For example, in intelligence we look extensively at why certain groups associate with each other. But it is equally important to also understand why one activist group does not associated with another. From a business prospective there is an equivalence in the sales process. Product managers often over strive to understand who is buying their products and services, but lacks an material understanding on why people don’t buy these same solutions.

In a recent project, we were tasked by a client to determine if Greenpeace was or could become a significant disruptive geopolitical force a critical operational initiative. As part of the initial scoping activity, we needed to understand where to allocate our limited resources (global intel experts, business intel experts, subject matter experts, and data scientists) in order to increase the likelihood of addressing the client’s needs. A high-level SNA not only identify where to focus our effort, but also identify a previously unknown activism actor as well.

The six (6) panel layout below show how we stepped through our discovery. In FaceBook, we leverage Netviss to make an initial collection of group-oriented relationships for the principle target (Greenpeace). The 585 nodes, interconnected by 1788 edges, was imported into Gephi as shown in panel 1. As we say… somewhere in that spaghetti is a potential bad guy, but where?

Gephi Panel 01


After identifying and importing the data, it is important to generate an initial structural view of the entities. Force Atlas 2 is an effective algorithms since studies have identify that organizational structure can be inferred from layout structure (panel 2). While this layout provides some transparency into the network, it still lacks any real clarity around behavioral importance.

To better understand what entities are more central than other, we leveraged the Between Centrality. This is a measure of a node’s centrality in a network, an underlying psychological characteristic. Betweenness centrality is a more useful measure (than just connectivity) in that bigger nodes are more central to behavioral dynamics. As seen in panel 2, serval nodes become central figures in the overall network.

Identifying community relationships is an important next step in helping understand sociological characteristics. Using Modularity as a measure to unfold community organizations (panel 4), we now begin to see a clearer picture of who is doing what with whom. What becomes really interesting at this stage is understanding some of the more nuance relationships.

Take for example the five outlying nodes in the network (blue, maroon, yellow, dark green, and light green). There appear to be central to an equally important red node in the center. Panel 5 clear shows this central relationship. Upon further examination (filtering out nodes with low value Betweenness Centrality metics), we see the emergence of a previously un-recognized activism player: Art of Resistance.

2014 02 18 08 34 16


While Greenpeace was the original target of interest, use of basic social network analysis principles resulted in the discover of an emergent activism group playing a central role in the coordination and communication of events.  Further analysis of this group revealed their propensity to promote kinetic activities (physical violence, bombing, etc.) over more traditional passive non-kinect events found in Greenpeace.

Gephi Panel 02

A resource allocation plan was then developed to monitor and harvest open source information around key players of each community (larger nodes). The plan resulted in a more focused intelligence analysis process where human analysts could explore in-depth the behavioral dynamics of critical entities, rather that tangentially digesting summary information from all.

Social network analysis (SNA) is an effective tool for the intelligence team, as well as the data science. Finding the proverbial needle in the haystack requires a systematically organic process that explains both the why and why not of behavioral dynamics. Use of these kinds of tools enable a broad set of capabilities, ranging from resource allocation to discovery.