FIELD NOTE: Your Math is All Wrong…

NewImageAt the request of a friend, I recently reviewed the article “Your Math Is All Wrong: Flipping The 80/20 Rule For Analytics” by John Thuma. It is a good article, but incomplete and bit misguided. Thuma argues that we are spending too much time in preparing data (prepping) and not enough time analyzing it.  The illusion of the article is that he will “reveal” the magic needed to solve this problem at the end. Spoiler… he does not.

NewImageI agree with the premise; that is, a disproportionate amount of time is spent in data prepping (80%), but the author does not provide any insights into how to reduce it (the flip from 80% to 20%). Study after study has show this to be the case, so it is worthless to argue a statistical point. But towards the end of the article, he states that, “Flipping the rule will mean more data-driven decisions.” Ok, I get it. But please explain how?

Well, the cheap “naive” way would be to just start spending more time with the analytics process itself. That is, once the prep process is complete, just spend 16x more effort with analytics (do the math). This would give you the 20% prep and 80% analytics the author wants to achieve. Cheep trick, but that is statistics. But even that is not the issue. The real issue isn’t moving from 80% to 20%.

The real is challenge is understanding exactly what “value” means in the data science process and understanding a systematic way to achieve it. In the end, if I have to spend 80% of time preparing and 20% analyzing in order to discover “how” to grown a business in a profitable way, who cares what the ratio is. Real value comes for focusing on the questions; from what (descriptive), to why (diagnostic), to when (predictive), and finally how (prescriptive). In doing so, a chain is created with each stage linking value (AKA a value chain). Ok, but how do you do this?

2015 10 09 09 19 48

Addressing that question (my reveal) is beyond the scope of this article. I would suggest one start by looking at a few article in Data Scientist Insights blog. There are several articles that deal exactly on this point. After that, write me (@InsightDataSci) and we can talk.


Big Data: Conventional Definitions and Some Statistics (big numbers for big data)

NewImageDefinition: “Extremely scalable analytics – analyzing petabytes of structured and unstructured data at high velocity.”

Definition: “Big data is data that exceeds the processing capacity of conventional data base systems.”

Big Data has three characteristics:

Variety – Structured and unstructured data

Velocity – Time sensitive data that should be used simultaneously with its enterprise data counterparts, in order to maximize value

Volume – Size of data exceeds the nominal storage capacity of the enterprise.


– In 2011, the global output of data was estimated to be 1.8 zettabytes (10^21 bytes)

– 90% of the world data has been created in the last 2 years.

– We create 2.5 quintillion (10^18) bytes of data per day (from sensors, social media posts, digital pictures, etc.)

– The digital world will increase in capacity 44 folds between 2009 and 2020.

– Only 5% of data is being created in structured forms, 95% is largely unstructured.

– 80% of the effort involved in dealing with unstructured data is reconditioning ill-formed data to well-formed data (cleaning it up).

Performance Statistics (I will start tracking more closely):

– Traditional data storage costs approximately $5/GB, but storing the same data using Hadoop only cost $0.25/GB – yep 25cents/GB. Hum!

– FaceBook stores more than 20Petabytes of data across 23,000 cores, with 50Terabytes of raw data being generated per day.

– eBay uses over 2,600 clustered Hadoop servers.