Mahout: Machine Learning For Enterprise Data Science

Machine Learning

The success of companies to effectively monetize their information is dependent on how efficiently they can identify revelations in their data sources. While Enterprise Data Science (EDS) is one of the necessary methodologies needed to organically and systematically achieve this goal, it is but one of many such needed frameworks.

Machine Learning, a subdomain of artificial intelligence and a branch of statistical learning, is one such computational methodology composed of techniques and algorithms that enables computing devices to improve their recommendations based on effectiveness of previous experiences (learn). Machine learning is related to data mining (often confused with) and relies on techniques from statistics, probability, numerical analysis, and pattern recognition.

There is a wide variety of machine learning tasks, successful applications, and implementation frameworks.  Mahout, one of the more popular frameworks is a open source project based on Apache Hadoop. Mahout currently can be used for

  • Collaborative filtering (Recommendation systems – user based, item based)
  • Clustering
  • Classification

Varad Meru created and is sharing this introductory Mahout presentation; one that is an excellent source of basis information, as well as implementation details.

Neuroscience, Big Data, and Data Science Is Impacting Big Ideas In The Creative World of Advertising

2013 01 23 10 26 42Moxie Group’s Creative Director Tina Chadwick makes the case that real-time data analytics “brings us tangible facts on how consumers actually react to almost anything.” She makes light of the “notion that 10 people in a room, who volunteered to be there because they got paid and fed,” could truly represent consumer behaviors (psychographics) is a thing of the past. Sadly though, for many advertising companies, this is still the mainstay of their advertising-oriented evaluative methodology.

New capabilities based on neuroscience, integrating machine learning with human intuition, and data science/big data is leading to a new creative processes, which many call NeuroMarketing, the direct measurement of consumer thoughts about advertising through neuroscience. The persuasive effects of an advertising campaign (psychographic response) are contingent upon the emotional alignment of the viewer (targeted demographic); that is, the campaigns buying call to action has a higher likelihood of succeeding when the viewer has a positive emotional response to the material. Through neuroscience we can not directly measure emotional alignment without inducing a Hawthorne Effect

This is new field of marketing research, founded in neuroscience, that studies consumers’ sensorimotor, cognitive, and affective response to marketing stimuli. It explores how consumer’s brain responses to ads (broadcast, print, digital) and measures how well and how often media engages the areas for attention/emotion/memory/and personal meaning – measures of emotional response. From data science-driven analyses, we can determine:

  • The effectiveness of the ad to cause a marketing call to action (e.g., buy product, inform, etc) 
  • Components of the ad that are most/least effective (Ad Component Analysis) – identifying what elements make an ad great or not so great.
  • Effectiveness of a transcreation process (language and culture migration) used to create adverting in different culturally centric markets.

One of the best and most entertaining case studies I have seen for NeuroMarketing was done by Neuro-Insight, a leader in the application of neuroscience for marketing and advertising. Top Gear used their technology to evaluate which cars attract women to which type of men. The results are pretty amazing.

While NeuroMarketing is an emergent field for advertising creation and evaluation, the fundamentals of neuroscience and data science make this an essential transformational capability.  This new field has significant transformational opportunities within the advertising industry – it allows for an above average firm to become a great firm through the application incremental quantitative neuroscience.  For any advertising agency looking to leap frog those older, less agile companies that are stilled anchored in the practices of the 70s, neuromarketing might be the worth looking into.

Big Data: Conventional Definitions and Some Statistics (big numbers for big data)

NewImageDefinition: “Extremely scalable analytics – analyzing petabytes of structured and unstructured data at high velocity.”

Definition: “Big data is data that exceeds the processing capacity of conventional data base systems.”

Big Data has three characteristics:

Variety – Structured and unstructured data

Velocity – Time sensitive data that should be used simultaneously with its enterprise data counterparts, in order to maximize value

Volume – Size of data exceeds the nominal storage capacity of the enterprise.


– In 2011, the global output of data was estimated to be 1.8 zettabytes (10^21 bytes)

– 90% of the world data has been created in the last 2 years.

– We create 2.5 quintillion (10^18) bytes of data per day (from sensors, social media posts, digital pictures, etc.)

– The digital world will increase in capacity 44 folds between 2009 and 2020.

– Only 5% of data is being created in structured forms, 95% is largely unstructured.

– 80% of the effort involved in dealing with unstructured data is reconditioning ill-formed data to well-formed data (cleaning it up).

Performance Statistics (I will start tracking more closely):

– Traditional data storage costs approximately $5/GB, but storing the same data using Hadoop only cost $0.25/GB – yep 25cents/GB. Hum!

– FaceBook stores more than 20Petabytes of data across 23,000 cores, with 50Terabytes of raw data being generated per day.

– eBay uses over 2,600 clustered Hadoop servers.

Big Data Driven By Bigger Numbers

Newimage14As most of us know, there is a tremendous growth in global data. There are trillions of transactions occurring daily, ranging from operations to sales to marketing to buying. Every human activity is support by over 100 business processes, all contributing to this exponential data growth.

McKinsey Global Institute report “Big Data: The next frontier for innovation, competition, and productivity” discusses this new (?) or at least emergent world. As part of their business case, they cite some very interesting statistics. Here are just a few:

– $600 to buy a disk that can store all the world’s music

– 5 billion mobile phones in use in 2010

– 30 billion pieces of content shared on FaceBook every month

– 40% projected growth in global data generated per year verses 5% growth in global IT spending

– 235 terabytes of data collected by the US Library of Congress by April 2011

– 15 out of 17 sectors in the United States have more data stored per company than the US Library of Congress

– $300 billion potential annual value to US heath care – more than double the total annual health care spending in Spain

– $600 billion potential annual consumer surplus from using personal location data globally

– 60% potential increase in retails’ operating margins possible with big data

– 140,000 to 190,000 more deep analytical talent positions

– 1.5 million more data savvy managers needed to take full advantage of big data in the United States

Cybernetic Historical Debris Fields – Big Data’s Proof of Life

Newimage22How did Robert Ballard find the Titanic? Most people think it was by looking for it. Well, most people would be wrong. Ballard believed he could rediscovered the Titanic by looking for the debris field created when the ship sank. With the Titanic only being around 900 feet long, he hypothesized that ship parts would be spread out much wider the farther you were from the ship, narrowing like a funnel to closer one got. In essence, this much larger historical debris field would point the way to the much small artifact of interest – the Titanic.

Every physical object leaves some trace of its interaction with the real world over time – everything. Whether it is the Titanic plugging to her heath in depths in the Atlantic ocean or a lonely rock sitting in the middle of a dry desert lake bed. Everything leaves a trace; everything has a Historical Debris Field (HDF). Formally,


Definition: Historical Debris Field (HDF) is any time dependent perturbation of an object and its environment.

One of the key points is that it is an observation over time, not a just a point in time. HDF are about capturing the absolute historical changes in the environment in order to make relative projections about some object in the future.

NewImageAs it turns out, just like physical real world objects leave historical debris fields, so does data through the virtual interactions in cyber space. Data, by definition, is merely a representative abstraction of a concept or real world object, and is a direct artifact of some computational process. At some level, every known relevant piece of electronic information (these words, your digital photos, a You Tube video), boils down to a series of Zeros (0) and Ones (1), streamed to together by a complex series of implicit and tacit interacting algorithms. These algorithms are in essence the natural, often unseen forces that govern the historical debris seen in real world objects. So, the HDF for cyberspace might be defined as,

Definition: Cybernetic Historical Debris Field (CHDF) is any time dependent perturbation of data through and its information environment (information being relevant data).

NewImageWhy is this lengthy definitional expose important? Because big data represents the Atlantic ocean in which a company is looking for opportunities. Any like Robert Ballard’s search for the Titanic, one can not merely set out looking for a piece of insight or knowledge itself in the vastness of all that internal/external structured/unstructured data, one needs to look for the Cybernetic Historical Debris Fields that point to the electronic treasure. But what kind of new “virtual sonar” systems can we employ to help us?

While I will explore this concept more over time, let me suggest that the “new” new in the field of data mining will be in coupling data scientists (DS) with behavioral analysts (BA).  Data changes because at the core some human initiated a change (causal antecedent). It is through a better understanding of human behavior (patterns), that we will have the best chance of monetizing the vastness of big data. Charles Duhigg, author of “The Power Of Habit: Why We Do What We Do in Life and Business,” shows that by understanding human nature (aka our historical debris field) we can accurately predict a behavioral-based future.


For example, Duhigg shows how Target tries to hook future parents at the crucial moment before they turn into loyal buyers of pre/post natal products (e.g., supplements, diapers, strollers, etc.). Target, specifically Andrew Pole, determined while lots of people buy lotion, women on baby registries were buying larger quantities of unscented lotion. Also, women at about twenty weeks into the pregnancy, would start loading up on supplements like calcium, magnesium, and zinc. This CHDF led the Target team to one of the first behavioral patterns (virtual Titanic sonar pattern) that could discriminate (point to) pregnant from non-pregnant women. Not impressed…well this type of thinking led to Target’s $23 billion revenue growth from 2002 to 2010.

The net of all this is that data can be monetized by systematically searching for relevant patterns (cybernetic historical debris fields) in big data based on human patterns of behavior. There are patterns in everything and just because we don’t see them it doesn’t mean they don’t exist. Through data science and behavioral analysis (AKA Big Data), one can reveal the behavioral past in order to monetize the future.

FIELD NOTE: What Makes Big Data Big – Some Mathematics Behind Its Quantification

Newimage27Heads Up – This is a stream of consciousness! Please be patient with me while I incrementally refining it over time. Critical feedback is welcome!

There are several different ways to define when data becomes big data. The two traditional approaches are based on some variant of:

— Big is the sample size of data after which the asymptotic properties of the exploratory data analysis (EDA) methods kick in for valid results

— Big is the gross size to the data under investigation (e.g., size of a database, data mart, data warehouse, etc.).

While both of these measures tend to provide an adequate means through which one can discuss the sizing issue, they both are correlative and not causal by nature. But before get in to a more precise definition of big, lets look at some characteristics of data.

NewImageRegardless of what you are told, all data touched or influenced by natural forces (e.g, hand of man, nature, etc.) has structure (even man made randomly generated data). This structure can be either real (provides meaningful insights in the behaviors of interest) or spurious (trivial and/or uncorrelated insights). The bigger the data, the more likely the structure can be found.

Data, at its core, can be describe in terms of three important characteristics: condition, location, and population. Condition is the state of the data readiness for analysis. If one can use it as is, it is “well conditioned.” If the data needs to be preconditioned/transformed prior to analysis, then it is “ill conditioned.” Location is where the data resides, both physically (databased, logs, etc.) and in time (events). Data populations describe how data is grouped around specific qualities and/or characteristics.

Small data represents a random sample of a know population that is not expected to encounter changes in its composition (condition, location, and population) over the targeted time frame. It tends to address specific and well defined problem through straight forward applications of problem-specific methods. In essence, small data is limited to answering questions about what we know we don’t know (second level of knowledge).


Big data, on the other hand, represents multiple, non random samples of unknown populations, shifting in composition (condition, location, and population) within the target interval. Analyzing big data often requires complex analyses that deal with post-hoc problem assessments, where straight forward solutions can not obtained. This is the realm where one discovers and answers questions in area where we don’t know what we don’t know (third level of knowledge).

With this as a basis, we can now identify more precise quantitative measures of data size, more importantly the subjects/independent variables, needed to lift meaningful observations and learnings from its samples.  Data describing simple problems (aka historical debri) are governed by the interaction of small numbers of independent variable or subjects. For example, the distance a car travels can be understood by analyzing two variables over time – initial starting velocity and acceleration. Informative, but not very interesting. The historical debri for complex problems are governed by the interaction of large numbers of independent variables, who solutions often fall into the realm of non-deterministic polynomials (i.e., an analytical closed formed solution can not be found). Consider, for example, the unbounded number of factors that influence the behavior of love.

A measure of the amount of knowledge contained in data can therefore be defined through understanding the total possible state space of the system, which is proportional to all the possible ways (combination and/or permutations) the variable/factors or subjects can interact.  The relative knowledge contained within two variables/subjects (A and B), for example, can be assessed by looking at A alone, then B alone, and then A and B, for a total of 3 combinatorial spaces. Three variables/subjects (A, B, and C) gives use a knowledge state space of 7. Four subjects results in 15. And so on.

An interesting point is that there is a closed form solution,  based on summing up all the possible combinations where the order of knowledge is NOT important, which is:



and where the order of knowledge is important:


A plot of the knowledge space (where order is not important) over the number of variables/subjects looks like:


What this tells us is that as we explore the integration large variable sets (subjects), our ability to truly define/understand complex issues (behaviors) increases exponentially. Note – Where order of knowledge is important, the asymptotical nature (shape) is the same.

More importantly, it gives a direct measure of the number of independent subjects that are needed to complete define a knowledge set. Specifically,

Theorem: The independent interaction from 77 variable/subject areas asymptotically defines all knowledge contained within that space. In other words, as we identify, integrate, and analyze subjects across 75 independent data sources, we exponentially increase our likelihood of completely defining the characteristics (behaviors) of the systems contained therein.

Big data, therefore, is defined as:

Definition: “Big Data” represents the historical debri (observable data) resulting from the interaction of at between 70 and 77 independent variable/subjects, from which non-random samples of unknown populations, shifting in composition with a targeted time frame, can be taken.

Definition: “Knowledge Singularity” is the maximum theoretical number of independent variables/subjects that, if combined and/or permutated, would represent a complete body of knowledge.

It is in the aggregation of the possible 70-77 independent subject areas (patients, doctors, donors, activists, buyer, good guys, bad guys, shipping, receiving, etc.) from internal and external data sources (logs, tweets, FaceBook, LinkedIn, blogs, databases, data marts, data warehouses, etc.) that the initial challenge resides, for this is the realm of Data Lakes. And this is yet another story.

Lot’s of stuff, some interesting I hope and more to come later, But this is enough as a field not for now.

Big Data – Exploring the Darkest Places on Earth

Newimage5A lot of folks know that I have done some interesting things in my life. From flying jets off aircraft carriers to running nuclear reactors on submarines  to building artificial brains in boxes in laboratories. All fun stuff, none of which I could have ever done without the help and support of very talented friends and team mates. So it doesn’t surprise me too much when someone asks me to talk about some of the scariest places I been. My response often confounds the asker in ways that can be profound. What is my response, you ask?

I have only one reply, “The scariest place I have ever been is THE darkest place on earth. It is the place of neither happiness or misery. Nor is it a place of neither rich or poor, or even right or wrong. This is a place where nothing exists but the absence of nothing. It is that place you find yourself in where you begin to realize just how much you don’t know you don’t know.”

NewImageUsually at about this point I see in their face the “oh man, I shouldn’t have ask” kind of look. You know the one I am talking about. It is that one that people give when they realize they want to be some place else, but need to stay and listen out of respect. That’s the look.

Being a nice guy I usually let them off the “listening hook” by saying they can alway stop by later to talk about this in more detail if they want. But, before they go, I add that, “my death opened a door to this dark place just once and that is a story worth telling.” After that their faces change from “I need to get out” to “I want to know more.” From which we talk about the philosophy of death’s role in helping us all better understand our true limits.

NewImageSo, why all this pontification around death, not knowing what we don’t know, life, the universe, and everything else? Because we are on the technological brink of being capable of systematically exploring this deep dark space in ways that where never possible just a few years ago. We are capable of discovering new endless opportunities by just looking through all that data (FaceBook, Twitter, CRM, etc.) we take for granted everyday. This is the world of Big Data and these profound outcomes are why this new capability is important to us all.

Ok, enough philosophy, for now. Back to the practical, technical, and business aspect of big data 🙂