## FIELD NOTE: Your Math is All Wrong…

At the request of a friend, I recently reviewed the article “Your Math Is All Wrong: Flipping The 80/20 Rule For Analytics” by John Thuma. It is a good article, but incomplete and bit misguided. Thuma argues that we are spending too much time in preparing data (prepping) and not enough time analyzing it.  The illusion of the article is that he will “reveal” the magic needed to solve this problem at the end. Spoiler… he does not.

I agree with the premise; that is, a disproportionate amount of time is spent in data prepping (80%), but the author does not provide any insights into how to reduce it (the flip from 80% to 20%). Study after study has show this to be the case, so it is worthless to argue a statistical point. But towards the end of the article, he states that, “Flipping the rule will mean more data-driven decisions.” Ok, I get it. But please explain how?

Well, the cheap “naive” way would be to just start spending more time with the analytics process itself. That is, once the prep process is complete, just spend 16x more effort with analytics (do the math). This would give you the 20% prep and 80% analytics the author wants to achieve. Cheep trick, but that is statistics. But even that is not the issue. The real issue isn’t moving from 80% to 20%.

The real is challenge is understanding exactly what “value” means in the data science process and understanding a systematic way to achieve it. In the end, if I have to spend 80% of time preparing and 20% analyzing in order to discover “how” to grown a business in a profitable way, who cares what the ratio is. Real value comes for focusing on the questions; from what (descriptive), to why (diagnostic), to when (predictive), and finally how (prescriptive). In doing so, a chain is created with each stage linking value (AKA a value chain). Ok, but how do you do this?

Addressing that question (my reveal) is beyond the scope of this article. I would suggest one start by looking at a few article in Data Scientist Insights blog. There are several articles that deal exactly on this point. After that, write me (@InsightDataSci) and we can talk.

SaveSave

## FIELD NOTE: Answering One of the Most Asked Question of Data Scientists

Gregory Piatetsky and Shashank Lyer have begun to answer one of the most asked question facing data science practitioners: Which tools work with which other tools? If I had a dollar for every time this question was asked of me, well, let’s just say I’d already be retired! So, when I saw Gregory and Shashank took a crack at this question, I was intrigued.

In their recent article Which Big Data, Data Mining, and Data Science Tools go together?, both authors use a version of Apriori algorithm to analyze the results of a 2105 KDnuggets Data Mining Software Poll. This work is an excellent example of how simple techniques, added together, can result in very useful insights.

For example, the graph below visualizes the correlation between the top 20 most popular tools. For the nodes, Red: Free/Open Source tools, Green: Commercial tools, Fuchsia: Hadoop/Big Data tools. The node sizes  vary based on the percentage of votes each tool received. The segmentation shows the weights of each edge, the thicker ones showing a high association and the latter a low association.

For those, like me, who predominately work in the R world, here are a list of tools that are most often associated with R

While this work is just the start and only covers a limited user population, the authors provide the data set for those that want to further explore the survey or continue collecting additional tools data in order to extend their insights.

## Global Terrorism Database (GTDb)

I have built a prototype database of terrorists and their known associates, inferred associates, cutouts, and ghosts (AKA Global Terrorism Database). Using the Deep Web Intelligence Platform and many of the critical capabilities found in my enterprise data science framework, I have have managed to pull together an initial repository of bad guys and people associated with them that will globally scale. While having a composite database of known bad guys is important, what is really interesting is the list of previously unknown people that are associated with them – some of which I would have never guessed.

I want to know if this is important and why. If you have any thoughts, please let me know (datascientistinsights@gmail.com).

SaveSave

## FIELD NOTE: Predictive Apps

Mike Gualtieri (Forrester) believes that “developers are stuck in a design paradigm that reduces app design to making functionality and content decisions based on a few defined customer personas or segments.”

The answer to developing apps that dazzle the digital consumer and making your company stand out from the competition lies in what Gualtieri calls Predictive Apps. Forrester defines predictive apps as:

Apps that leverage big data and predictive analytics to anticipate and provide the right functionality and content on the right device at the right time for the right person by continuously learning about them.

To build anticipatory, individualized app experiences, app developers will use big data and predictive analytics to continuously and automatically tune the app experience by:

• Learning who the customer (individual user) really is
• Detecting the customer’s intent in the moment
• Morphing functionality and content to match the intent
• Optimizing for the device (or channel)

SaveSave

## FIELD NOTE: A Bayesian View of the Monte Hall Problem

A friend and I were bayesian-doodling on the Monte Hall game show problem – There are three doors to choose from. Behind one door is a prize; behind the others, goats. You pick a door, say Door A, and Monte Hall (the host), who knows what’s behind the doors, opens another door, say Door B, which has a goat. Monte then says to you, “Do you want to switch and pick door Door C?” Is it to your advantage to switch your choice? Here is a rough bayesian view on why switching from your first choice (Door A) to the remaining door (Door C) is to your advantage.

Here is one view of a bayesian analysis that supports why it is more favorable to switch from your current door (A) to the remaining door (C).

Note: Captured with LiveScribe’s Sky Pen – one of the best personal productivity tools I have ever purchases (coupled with Evernote).

## FIELD NOTE: Methodological Problems

Methodological Problems (i.e., problems for which existing methods are inadequate), which leads to new statistical research and improved techniques. This area of work is called Mathematical Statistics and is primarily concerned with developing and evaluating the performance of new statistical methods and algorithms. It is important to note that computing and solving computational problems are integral components of all four of the previously mentioned areas of statistical science.

From Wikipedia:

Statistical science is concerned with the planning of studies, especially with the design of randomized experiments and with the planning of surveys using random sampling. The initial analysis of the data from properly randomized studies often follows the study protocol.
Of course, the data from a randomized study can be analyzed to consider secondary hypotheses or to suggest new ideas. A secondary analysis of the data from a planned study uses tools from data analysis.

Data analysis is divided into:
— descriptive statistics – the part of statistics that describes data, i.e. summarises the data and their typical properties.
— inferential statistics – the part of statistics that draws conclusions from data (using some model for the data): For example, inferential statistics involves selecting a model for the data, checking whether the data fulfill the conditions of a particular model, and with quantifying the involved uncertainty (e.g. using confidence intervals).

While the tools of data analysis work best on data from randomized studies, they are also applied to other kinds of data — for example, from natural experiments and observational studies, in which case the inference is dependent on the model chosen by the statistician, and so subjective.[1]
Mathematical statistics has been inspired by and has extended many procedures in applied statistics.

## FIELD NOTE: Definition – Data Driven

During a spirited debate over the meaning of data driven, a colleague ask me for my definition – you know, as a data scientist. I replied with this one that I learned from DJ Patil (who built the LinkedIn data science team):

A data-driven organization acquires, processes, and leverages data in a timely fashion to create efficiencies, iterate on and develop new products, and navigate the competitive landscape.

This definition is void of defining a organization by the volume, velocity, or even variety of data (big data terms). Instead, it focus on how the data is effectively used to create a more competitive position.

## FIELD NOTE: Quadrification of Big Data

1. Data (the intrinsic 1/0 property of big data) which can be broken down subjective areas like interaction data, transaction data || structured, unstructured || realtime/streaming, batch/static || etc.

2. MapReduce platforms – AKA divide and conquer – virtual integration capabilities that enable aggregation and management of multiple name-spaced data sources (Hadoop, InfoSphere Streams, Pneuron, etc.)

3. Data Exploration, Data Mining, and Intelligence Platforms – technical capabilities that  enable one to derive insights from data (Pentaho, IBM InfoSphere, ListenLogic, MatLab, Mathematics, Statistica, etc.).

4. Knowledge Worker platform (AKA The human component) – The two most important capabilities come from data scientists (navigate through data) and  behavioral scientists (navigate through human behavior, which most important things seem to connect back to).

In essence, Big Data has data, an ability to find it and use it, and an ability to explore and learn from it.

Does this seem right?  Missing anything? Please post or email me.

## Field Note: Big Data Changes the Traditional Paradigm of Social Analysis

A brief big data touch point. In a conversation with a colleague, he asked a question about how big data is changing everyday activities. I noted a conclusion made by Alessandro Mantelero, in his paper “Masters of Big Data: concentration of power over digital information.” He stated, in the context of big data, that:

Examination of the data flows in the evolving datasets shows trends and information without the need of prior working hypothesis, changing the traditional paradigm of social analysis in which the design of the study sample represents the first step which is than followed by the analysis of raw data.

The key point is that big data is a catalyst for the exploration of level three knowledge: those things we don’t know we don’t know. Traditional discovery methods, used in level 1(know what we know) and 2 knowledge (know what we don’t know) knowledge acquisition, are limited because they infer, and often require, the relevant aspects of the questions under study. In contrast to the traditional deductive approach of knowledge acquisition, the big data is self-explanatory and can be based on inductive knowledge acquisition, which fundamentally requires large amounts of information.

Therefore, if one believes that transformative business insights can be found in understanding more of what we don’t know we don’t know (level 3), then the use of big data is one of the proven means to achieve this.

## FIELD NOTE: Big Data and the Power of Three

A common data exploration came up while talking with a British colleague in the advertising industry on Friday, how many independent subject areas should be investigated (1, 10, 100, …, N) in order to have a statistically significant chance of making a discovery with the least amount of effort? An answer can be found in “The Power of Three (3),” an application of the Knowledge Singularity when N is very small, which defines meaningful returns on knowledge discovery costs.

As I discussed in the field note “FIELD NOTE: What Makes Big Data Big – Some Mathematics Behind Its Quantification,” perfect insight can be gained asymptotically as one systematically approaches the Knowledge Singularity (77 independent subject areas out of a N-dimensional universe where N >> 77).  While this convergence on infinite knowledge (insight) is theoretically interesting, it is preceded by a more practical application when N is three (3); that is when one explores the combinatorial space of only three subjects.

Let Insight 1 (I_1) represent insights implicit in the set of data 1 (Ds_1), insight 2 (I_2) represent the insights implicit in the set of data 2 (Ds_2), where union of data sets 1 and 2 are null (Ds_1 U Ds_2 = {}).  Further, let insight N (I_N) represent the insights implicit in the set of data N (I_N), where union of data set N and all previous data sets are null (Ds_1 U Ds_2 U … U Ds_N = {}). The total insight implicit in all data sets, 1 through N, therefore, is proportional to the insights gained by exploring total combinations of all data sets (from to previous field note). That is,

In order to compute a big data ROI, we need quantify the cost of knowledge discovery. Using current knowledge exploration techniques, the cost of discovering insights in any data set is proportional to the size of the data:

Discovery Cost (I_N) = Knowledge Exploration[Size of Data Set N]

Therefore, a big data ROI could be measured by:

Big Data ROI = Total Insights [Ds_1 … Ds_N] / Total Discovery Cost [Ds_1 … Ds_N]

if we assume the explored data sets to be equal in size (which generally is not the case, but does not matter for this analysis), then:

Discovery Cost (I_1) = Discovery Cost (I_2) = Discovery Cost (I_N)

or

Total Discovery Cost [Ds_1 U Ds_2 U… U Ds_N] = N x Discovery Cost [Ds] = Big O(N) or proportional to N, where Ds is any data size

Thus,

and

We can now plot Big Data ROI as a function of N, for small values of N,

That was fun, but  so what? The single biggest ROI in knowledge discovery comes when insights are looked for in and across the very first two combined independent data sets. However, while the total knowledge gained exponentially increases for for each additional independent data set added, the return of investment asymptotically approaches a finite limit as N approaches infinity.  One can therefore reasonably argue, that given a limited discovery investment (budget), a minimum of two subjects is needed, while three ensure some level of sufficiency.

Take the advertising market (McCann, Tag, Goodby, etc.), for example. Significant insight development can be gained by exploring the necessary combination of enterprise data (campaign specific data) and social data (how the market reacts) – two independent subject areas. However, to gains some level of assurance, or sufficiency, the addition of one more data set such as IT data (click throughs, induce hits, etc.), increases the overall ROI without materially increasing the costs.

This combination of at least three independent data sets to ensure insightful sufficiency in what is being called “The Power of Three.” While a bit of a mathematical and statistical journey, this intuitively should make sense. Think about the benefits that come from combining subjects like Psychology, Marketing, and Computer Science. While any one or two is great, all three provide the basis for a compelling ability to cause consumer behavior, not just to report on it (computer science) or correlate around it (computer science and market science).