* Heads Up* – This is a stream of consciousness! Please be patient with me while I incrementally refining it over time. Critical feedback is welcome!

**T**here are several different ways to define when data becomes big data. The two traditional approaches are based on some variant of:

— Big is the sample size of data after which the asymptotic properties of the exploratory data analysis (EDA) methods kick in for valid results

— Big is the gross size to the data under investigation (e.g., size of a database, data mart, data warehouse, etc.).

**W**hile both of these measures tend to provide an adequate means through which one can discuss the sizing issue, they both are correlative and not causal by nature. But before get in to a more precise definition of big, lets look at some characteristics of data.

**R**egardless of what you are told, all data touched or influenced by natural forces (e.g, hand of man, nature, etc.) has structure (even man made randomly generated data). This structure can be either real (provides meaningful insights in the behaviors of interest) or spurious (trivial and/or uncorrelated insights). The bigger the data, the more likely the structure can be found.

**D**ata, at its core, can be describe in terms of three important characteristics: condition, location, and population. Condition is the state of the data readiness for analysis. If one can use it as is, it is “well conditioned.” If the data needs to be preconditioned/transformed prior to analysis, then it is “ill conditioned.” Location is where the data resides, both physically (databased, logs, etc.) and in time (events). Data populations describe how data is grouped around specific qualities and/or characteristics.

**S**mall data represents a random sample of a know population that is not expected to encounter changes in its composition (condition, location, and population) over the targeted time frame. It tends to address specific and well defined problem through straight forward applications of problem-specific methods. In essence, small data is limited to answering questions about what we know we don’t know (second level of knowledge).

**B**ig data, on the other hand, represents multiple, non random samples of unknown populations, shifting in composition (condition, location, and population) within the target interval. Analyzing big data often requires complex analyses that deal with post-hoc problem assessments, where straight forward solutions can not obtained. This is the realm where one discovers and answers questions in area where we don’t know what we don’t know (third level of knowledge).

**W**ith this as a basis, we can now identify more precise quantitative measures of data size, more importantly the subjects/independent variables, needed to lift meaningful observations and learnings from its samples. Data describing simple problems (aka *historical debri*) are governed by the interaction of small numbers of independent variable or subjects. For example, the distance a car travels can be understood by analyzing two variables over time – initial starting velocity and acceleration. Informative, but not very interesting. The historical debri for complex problems are governed by the interaction of large numbers of independent variables, who solutions often fall into the realm of non-deterministic polynomials (i.e., an analytical closed formed solution can not be found). Consider, for example, the unbounded number of factors that influence the behavior of love.

**A** measure of the amount of knowledge contained in data can therefore be defined through understanding the total possible state space of the system, which is proportional to all the possible ways (combination and/or permutations) the variable/factors or subjects can interact. The relative knowledge contained within two variables/subjects (A and B), for example, can be assessed by looking at A alone, then B alone, and then A and B, for a total of 3 combinatorial spaces. Three variables/subjects (A, B, and C) gives use a knowledge state space of 7. Four subjects results in 15. And so on.

**A**n interesting point is that there is a closed form solution, based on summing up all the possible combinations where the order of knowledge is NOT important, which is:￼

and where the order of knowledge is important:

**A** plot of the knowledge space (where order is not important) over the number of variables/subjects looks like:

**W**hat this tells us is that as we explore the integration large variable sets (subjects), our ability to truly define/understand complex issues (behaviors) increases exponentially. Note – Where order of knowledge is important, the asymptotical nature (shape) is the same.

More importantly, it gives a direct measure of the number of independent subjects that are needed to complete define a knowledge set. Specifically,

Theorem: The independent interaction from 77 variable/subject areas asymptotically defines all knowledge contained within that space. In other words, as we identify, integrate, and analyze subjects across 75 independent data sources, we exponentially increase our likelihood of completely defining the characteristics (behaviors) of the systems contained therein.

**B**ig data, therefore, is defined as:

Definition:“Big Data” represents the historical debri (observable data) resulting from the interaction of at between 70 and 77 independent variable/subjects, from which non-random samples of unknown populations, shifting in composition with a targeted time frame, can be taken.

Definition:“Knowledge Singularity” is the maximum theoretical number of independent variables/subjects that, if combined and/or permutated, would represent a complete body of knowledge.

* I*t is in the aggregation of the possible 70-77 independent subject areas (patients, doctors, donors, activists, buyer, good guys, bad guys, shipping, receiving, etc.) from internal and external data sources (logs, tweets, FaceBook, LinkedIn, blogs, databases, data marts, data warehouses, etc.) that the initial challenge resides, for this is the realm of

*Data Lakes*. And this is yet another story.

**L**ot’s of stuff, some interesting I hope and more to come later, But this is enough as a field not for now.

Categories: Big Data, Data Science

Dr. Smith, I believe that the Mathematica plotting capability may have failed you here. Your formula for summing up all the possible combinations where the order of knowledge is NOT important can be simplified to 2^n -1 (if the subscript i ranges from 0 to n instead of 1 to n, your formula would be simply 2^n). This formula (2^n) has no asymptotic limit. A log plot demonstrates this clearly.

David… Thanks for looking at plot and feedback. I will take a look at your analysis over the next couple of days. Dr. J

Great article. Just one question though; if we consider a data matrix of size m x n, the article considers the dimension n (70-77) independent variables. What about m? How many data rows should we take as constituting a Big Data set? A 3 x 70 data set can’t be considered equal to say a 300,000 x 70 data set. When does m become large. Surely we must consider both m and n?

The analysis is more about the possible insights gained through the interactions of in dependent subject areas. In this case, the mxm matrix (possibly a data frame) would be considered one subject area. Analysis on that data frame (df1) will yield some finite amount of insights, say Idf_1. The combinatorial math demonstrates that infinite knowledge can be gained through the analysis of large numbers (70-80) of independent data sets of limited knowledge.

To me this is less about some theoretical proof and more about how I look at the value of data. Many data scientists constrain their investigations to coherent related constructs. I find that looking for insights in larger numbers of multiple independent data sets give me the best chance at finding that “that interesting” result, which in itself lead to value (data monetization).