## FIELD NOTE: Big Data and the Power of Three

A common data exploration came up while talking with a British colleague in the advertising industry on Friday, how many independent subject areas should be investigated (1, 10, 100, …, N) in order to have a statistically significant chance of making a discovery with the least amount of effort? An answer can be found in “The Power of Three (3),” an application of the Knowledge Singularity when N is very small, which defines meaningful returns on knowledge discovery costs.

As I discussed in the field note “FIELD NOTE: What Makes Big Data Big – Some Mathematics Behind Its Quantification,” perfect insight can be gained asymptotically as one systematically approaches the Knowledge Singularity (77 independent subject areas out of a N-dimensional universe where N >> 77).  While this convergence on infinite knowledge (insight) is theoretically interesting, it is preceded by a more practical application when N is three (3); that is when one explores the combinatorial space of only three subjects.

Let Insight 1 (I_1) represent insights implicit in the set of data 1 (Ds_1), insight 2 (I_2) represent the insights implicit in the set of data 2 (Ds_2), where union of data sets 1 and 2 are null (Ds_1 U Ds_2 = {}).  Further, let insight N (I_N) represent the insights implicit in the set of data N (I_N), where union of data set N and all previous data sets are null (Ds_1 U Ds_2 U … U Ds_N = {}). The total insight implicit in all data sets, 1 through N, therefore, is proportional to the insights gained by exploring total combinations of all data sets (from to previous field note). That is,

In order to compute a big data ROI, we need quantify the cost of knowledge discovery. Using current knowledge exploration techniques, the cost of discovering insights in any data set is proportional to the size of the data:

Discovery Cost (I_N) = Knowledge Exploration[Size of Data Set N]

Therefore, a big data ROI could be measured by:

Big Data ROI = Total Insights [Ds_1 … Ds_N] / Total Discovery Cost [Ds_1 … Ds_N]

if we assume the explored data sets to be equal in size (which generally is not the case, but does not matter for this analysis), then:

Discovery Cost (I_1) = Discovery Cost (I_2) = Discovery Cost (I_N)

or

Total Discovery Cost [Ds_1 U Ds_2 U… U Ds_N] = N x Discovery Cost [Ds] = Big O(N) or proportional to N, where Ds is any data size

Thus,

and

We can now plot Big Data ROI as a function of N, for small values of N,

That was fun, but  so what? The single biggest ROI in knowledge discovery comes when insights are looked for in and across the very first two combined independent data sets. However, while the total knowledge gained exponentially increases for for each additional independent data set added, the return of investment asymptotically approaches a finite limit as N approaches infinity.  One can therefore reasonably argue, that given a limited discovery investment (budget), a minimum of two subjects is needed, while three ensure some level of sufficiency.

Take the advertising market (McCann, Tag, Goodby, etc.), for example. Significant insight development can be gained by exploring the necessary combination of enterprise data (campaign specific data) and social data (how the market reacts) – two independent subject areas. However, to gains some level of assurance, or sufficiency, the addition of one more data set such as IT data (click throughs, induce hits, etc.), increases the overall ROI without materially increasing the costs.

This combination of at least three independent data sets to ensure insightful sufficiency in what is being called “The Power of Three.” While a bit of a mathematical and statistical journey, this intuitively should make sense. Think about the benefits that come from combining subjects like Psychology, Marketing, and Computer Science. While any one or two is great, all three provide the basis for a compelling ability to cause consumer behavior, not just to report on it (computer science) or correlate around it (computer science and market science).

## FIELD NOTE: What Makes Big Data Big – Some Mathematics Behind Its Quantification

Heads Up – This is a stream of consciousness! Please be patient with me while I incrementally refining it over time. Critical feedback is welcome!

There are several different ways to define when data becomes big data. The two traditional approaches are based on some variant of:

— Big is the sample size of data after which the asymptotic properties of the exploratory data analysis (EDA) methods kick in for valid results

— Big is the gross size to the data under investigation (e.g., size of a database, data mart, data warehouse, etc.).

While both of these measures tend to provide an adequate means through which one can discuss the sizing issue, they both are correlative and not causal by nature. But before get in to a more precise definition of big, lets look at some characteristics of data.

Regardless of what you are told, all data touched or influenced by natural forces (e.g, hand of man, nature, etc.) has structure (even man made randomly generated data). This structure can be either real (provides meaningful insights in the behaviors of interest) or spurious (trivial and/or uncorrelated insights). The bigger the data, the more likely the structure can be found.

Data, at its core, can be describe in terms of three important characteristics: condition, location, and population. Condition is the state of the data readiness for analysis. If one can use it as is, it is “well conditioned.” If the data needs to be preconditioned/transformed prior to analysis, then it is “ill conditioned.” Location is where the data resides, both physically (databased, logs, etc.) and in time (events). Data populations describe how data is grouped around specific qualities and/or characteristics.

Small data represents a random sample of a know population that is not expected to encounter changes in its composition (condition, location, and population) over the targeted time frame. It tends to address specific and well defined problem through straight forward applications of problem-specific methods. In essence, small data is limited to answering questions about what we know we don’t know (second level of knowledge).

Big data, on the other hand, represents multiple, non random samples of unknown populations, shifting in composition (condition, location, and population) within the target interval. Analyzing big data often requires complex analyses that deal with post-hoc problem assessments, where straight forward solutions can not obtained. This is the realm where one discovers and answers questions in area where we don’t know what we don’t know (third level of knowledge).

With this as a basis, we can now identify more precise quantitative measures of data size, more importantly the subjects/independent variables, needed to lift meaningful observations and learnings from its samples.  Data describing simple problems (aka historical debri) are governed by the interaction of small numbers of independent variable or subjects. For example, the distance a car travels can be understood by analyzing two variables over time – initial starting velocity and acceleration. Informative, but not very interesting. The historical debri for complex problems are governed by the interaction of large numbers of independent variables, who solutions often fall into the realm of non-deterministic polynomials (i.e., an analytical closed formed solution can not be found). Consider, for example, the unbounded number of factors that influence the behavior of love.

A measure of the amount of knowledge contained in data can therefore be defined through understanding the total possible state space of the system, which is proportional to all the possible ways (combination and/or permutations) the variable/factors or subjects can interact.  The relative knowledge contained within two variables/subjects (A and B), for example, can be assessed by looking at A alone, then B alone, and then A and B, for a total of 3 combinatorial spaces. Three variables/subjects (A, B, and C) gives use a knowledge state space of 7. Four subjects results in 15. And so on.

An interesting point is that there is a closed form solution,  based on summing up all the possible combinations where the order of knowledge is NOT important, which is:￼

and where the order of knowledge is important:

A plot of the knowledge space (where order is not important) over the number of variables/subjects looks like:

What this tells us is that as we explore the integration large variable sets (subjects), our ability to truly define/understand complex issues (behaviors) increases exponentially. Note – Where order of knowledge is important, the asymptotical nature (shape) is the same.

More importantly, it gives a direct measure of the number of independent subjects that are needed to complete define a knowledge set. Specifically,

Theorem: The independent interaction from 77 variable/subject areas asymptotically defines all knowledge contained within that space. In other words, as we identify, integrate, and analyze subjects across 75 independent data sources, we exponentially increase our likelihood of completely defining the characteristics (behaviors) of the systems contained therein.

Big data, therefore, is defined as:

Definition: “Big Data” represents the historical debri (observable data) resulting from the interaction of at between 70 and 77 independent variable/subjects, from which non-random samples of unknown populations, shifting in composition with a targeted time frame, can be taken.

Definition: “Knowledge Singularity” is the maximum theoretical number of independent variables/subjects that, if combined and/or permutated, would represent a complete body of knowledge.

It is in the aggregation of the possible 70-77 independent subject areas (patients, doctors, donors, activists, buyer, good guys, bad guys, shipping, receiving, etc.) from internal and external data sources (logs, tweets, FaceBook, LinkedIn, blogs, databases, data marts, data warehouses, etc.) that the initial challenge resides, for this is the realm of Data Lakes. And this is yet another story.

Lot’s of stuff, some interesting I hope and more to come later, But this is enough as a field not for now.

## Refactoring Insurance/Reinsurance Catastrophe Modeling using Big Data

The Catastrophe Modeling ecosystem, used in insurance and reinsurance, is a good example of the types of traditional computational platforms that are undergoing an assault from the exponential changes seen in data. Not only are commercially available simulation and modeling tools incapable of closing the forecasting capabilities gaps in the near future, but most organizations are not addressing the needed changes in the human factor (data scientists and functional behavioral analysts). The net for those insurance/reinsurance companies that rely on these old school techniques is 1) reduced accuracy in understanding physical effects of catastrophic events, 2) reduced precision in quantifying the direct and indirect cost of a catastrophe, and 3) increased blind spots for new and emergent catastrophic events, coming from combinations and permutations of existing events, as well as the creation of new ones.

The quadrafication of big data (infrastructure, tools, exploratory methods, and people) is having a positive impact on these kinds of ecosystems. I believe we can use the big data reference architecture as the basis for refactoring the traditional catastrophic simulation, modeling, and financial analysis activities. Using platforms like Pneuron, we can help them more effectively map computationally complex MDMI (multi-data mult-instruction) workstreams into disaggregated process maps functioning in a MapReduce format, potentially using some of the existing simulation models. They could get the benefit of their a prior knowledge (models, tools), while dealing with the growth in data sets. Just a few thoughts.

One last note – this is an exercise in science and not engineering, or even systems integration. The practices that make for excellent enterprise architectures, requirements development, or even software engineering are of very little use here (those beyond critical thinking). To solve this problem, one must be willing to fail, fail early, an fail often. It is only through these failures that the true realization of Big Data Cat Modeling capabilities will be found.

## A Few Interesting Ways Big Data Is Being Used

Here are five interesting uses of big data that are happening every day:

1. Google now studies the timing and location of search-engine queries to predict flu outbreaks and unemployment trends before official government statistics come out. Interesting.

2. Credit card companies routinely pore over vast quantities of census, financial and personal information to try to detect fraud and identify consumer purchasing trends. While not new, the big data approach is improving accuracy and precision, as well as speeding up prediction times.

3. Medical researchers sift through the health records of thousands of people to try to identify useful correlations between medical treatments and health outcomes. I wonder if the healthcare insurance industry is taking advantage of this?

4. Companies running social-networking websites conduct “data mining” studies on huge stores of personal information in attempts to identify subtle consumer preferences and craft better marketing strategies. This is a subset of the Target case study.

5. A new class of “geo-location” data is emerging that lets companies analyze mobile device data to make ￼intriguing inferences about people’s lives and the economy. It turns out, for example, that the length of time that consumers are willing to travel to shopping malls—data gathered from tracking the location of people’s cell phones—is an excellent proxy for measuring consumer demand in the economy.

These applications do beg the question about privacy, “When does now-casting – search through massive amounts of data to predict individual behavior – violate personal privacy?”

## Big Data Landscape – A New Look

Dave Feinleib, a Forbes Technology contributor, released a new Big Data landscape point of view. While not all encompassing (e.g. missing technologies like Pneuron), it is a great start of making the complicated big data landscape understandable.

## Design Thinking for Effective Data Visualization

Noah Iliinsky is one of the best presenters on complex visualization tools and techniques.  One of the classic stories Noah tells is how he used visualization tools to decide what bicycle tires to buy. In this presentation he talks about design thinking for effective data visualization. This presentation is worth the 18 minutes of your time.

Make sure to pay attention to the section on “How to Visualize: Appropriate Encodings,” which occurs at about 8:10.  In this section, Noah covers the properties and best uses of visual encoding. Yes, a bit geeky, but essential if you are a data scientist.

## Data Science: Beyond Intuition – The Movie

Data science is changing the way we look at business, innovation and intuition. It challenges our subconscious decisions, helps us find patterns and empowers us to ask better questions. Hear from thought leaders at the forefront including Growth Science, IBM, Intel, Inside-BigData.com and the National Center for Supercomputing Applications. This video is an excellent source of information for those that have struggled trying to understanding data science and its value.