## FIELD NOTE: Your Math is All Wrong…

At the request of a friend, I recently reviewed the article “Your Math Is All Wrong: Flipping The 80/20 Rule For Analytics” by John Thuma. It is a good article, but incomplete and bit misguided. Thuma argues that we are spending too much time in preparing data (prepping) and not enough time analyzing it.  The illusion of the article is that he will “reveal” the magic needed to solve this problem at the end. Spoiler… he does not.

I agree with the premise; that is, a disproportionate amount of time is spent in data prepping (80%), but the author does not provide any insights into how to reduce it (the flip from 80% to 20%). Study after study has show this to be the case, so it is worthless to argue a statistical point. But towards the end of the article, he states that, “Flipping the rule will mean more data-driven decisions.” Ok, I get it. But please explain how?

Well, the cheap “naive” way would be to just start spending more time with the analytics process itself. That is, once the prep process is complete, just spend 16x more effort with analytics (do the math). This would give you the 20% prep and 80% analytics the author wants to achieve. Cheep trick, but that is statistics. But even that is not the issue. The real issue isn’t moving from 80% to 20%.

The real is challenge is understanding exactly what “value” means in the data science process and understanding a systematic way to achieve it. In the end, if I have to spend 80% of time preparing and 20% analyzing in order to discover “how” to grown a business in a profitable way, who cares what the ratio is. Real value comes for focusing on the questions; from what (descriptive), to why (diagnostic), to when (predictive), and finally how (prescriptive). In doing so, a chain is created with each stage linking value (AKA a value chain). Ok, but how do you do this?

Addressing that question (my reveal) is beyond the scope of this article. I would suggest one start by looking at a few article in Data Scientist Insights blog. There are several articles that deal exactly on this point. After that, write me (@InsightDataSci) and we can talk.

SaveSave

## Mahout: Machine Learning For Enterprise Data Science

The success of companies to effectively monetize their information is dependent on how efficiently they can identify revelations in their data sources. While Enterprise Data Science (EDS) is one of the necessary methodologies needed to organically and systematically achieve this goal, it is but one of many such needed frameworks.

Machine Learning, a subdomain of artificial intelligence and a branch of statistical learning, is one such computational methodology composed of techniques and algorithms that enables computing devices to improve their recommendations based on effectiveness of previous experiences (learn). Machine learning is related to data mining (often confused with) and relies on techniques from statistics, probability, numerical analysis, and pattern recognition.

There is a wide variety of machine learning tasks, successful applications, and implementation frameworks.  Mahout, one of the more popular frameworks is a open source project based on Apache Hadoop. Mahout currently can be used for

• Collaborative filtering (Recommendation systems – user based, item based)
• Clustering
• Classification

Varad Meru created and is sharing this introductory Mahout presentation; one that is an excellent source of basis information, as well as implementation details.

## Data Valuation: Seven Laws of Data Science

I was re-reading the paper “Measuring The Value Of Information: An Asset Valuation Approach,” by Moody and Walsh (European Conference on Information Systems, 1999) when I realized just how powerful their approach was to data valuation. This is seminal research in the field of information theory and should be required reading for data scientists. Moody and Walsh recognized early on that information is the most valuable asset an organization has and that it is important to quantify this value through a formal methodology. While the paper lacks in defining a practical approach, the overall framework can be used as a basis for implementing a repeatable enterprise data valuation methodology.

The reason for this blog post, however, is in my desire to recast Moody and Walsh’s Seven Laws of Information. While they do not explicitly define information and how it is different from data, we can use the DIKW Pramid to recast a few of the laws more towards the field of data science. That is, the world is full of data, information is the relevant data, studying information gives knowledge, and reflecting on knowledge leads to wisdom. So, if we deconstruct the information laws and rethink their data equivalents, one might find these Seven Laws of Data Science as the result:

Law One: Data has value only if it is studies. Intrinsically, data does not generate residual value through its mere presence. Revelations can only be found in the exploration and study of data.

Law Two: The value of data increases with it use. As data is explored, combined with other data, and explored again, additional value is generated.

Law Three: Data can not be depleted through it use. Data is not a physical commodity that is subject the physical laws of entropy and subject to degradation. As such, data is infinitely reusable and through the exploratory processes will produce more data than that originally evaluated.

Law Four: Causal data is more valuable than correlative data. While correlative principle are very useful in some operational circumstances, to forecast the future one needs to truly understand causality within the system. Or, as someone more important than me has stated,  “Felix, qui potuit rerum cognoscere causes.” Translated, “Fortunate who was able to the know the causes of things.”

Law Five: The value from combined independent data is greater than the combined value of each data alone. This is equivalent to the whole is usually greater than the sum of the parts. That is, one plus one is greater than two.

Law Six: The value of data is perishable, while the data itself does not. The insights derived from the study of data have a limited value time horizon.

Law Seven: More data does not necessarily lead to more value. Studies have shown that more data does not necessarily increase the accuracy of our predictions, just our confidence in those predications.

So this is the first cut the Laws of Data Science. What is missing, needs to be rethought through, or even deleted. Let me know.

## Objective-Based Data Monetization: A Enterprise Approach to Data Science (EDS)

Across all industries, companies are looking to Data Science for ways to grow revenue, improve margins, and increase market share. In doing so, many are at a tipping point for where and how to realize these value improvement objectives.

Those that see limited growth opportunities to grow through their traditional application and services portfolios may already be well underway in this data science transformation phase. For those that don’t see the need to find real value in their data and information assets (Data Monetization), it may be a competitively unavoidable risk that jeopardizes a business’s viability and solvency.

Either way, increasing the valuation of a company or business line through the conversion of its data and information assets into actionable outcome-oriented business insights is the single most important capability that will drive business transformation over the next decade.

Data and information have become the single most important assets needed to fuel today’s transformational growth. Most organizations have seen the growth in revenue and margin plateau for organic products and services (those based on people, process, and technologies). The next generation of corporate value will come through the spelunking (exploration, evaluation, and visualization) enterprise, information technology, and social data sources.

“Data is the energy source of business transformation and Data Science is the engine for its delivery.”

This valuation process, however, is not without it challenges. While all data is important, not all data is of value. Data science provides a systematic process to identify and test critical hypotheses associated with increased valuation through data.

Once validated, these hypotheses must be shown to actually create or foster value (Proof of Value – POVs). These POVs extract optical models from sampled data sets. Only these proven objective-oriented models, that have supported growth hypotheses, are extended into the enterprise (e.g., big data, data warehousing, business intelligence, etc.).

The POV phase of value generation translates growth objective-based goals into model systems, from which value can be optimally obtained.

This objective-based approach to data science different, but complements, traditional business intelligence programs. Data science driven actives are crucial for strategic transformations where one does not know what they don’t know. In essence, data science provide the revelations needed identify the value venues necessary for true business transformations.

For those solutions that have clearly demonstrable value, the system models are scale into the enterprise. Unfortunately, this is where most IT-driven process start and often unsuccessfully finish. Enterprise data warehouses are created and big data farms are implemented, all before any sense of data value is identified and extracted (blue). Through these implementations, tradition descriptive statistics and BI reports are generated that tell us mostly things that we know we don’t know, an expensive investment in knowledge confirmation. The objective-based data monetization approach, however, incorporated only those information technology capabilities into the enterprise that are needed to support the scalability of the optimized solutions.

While there are many Objective-Based Data Monetization case studies, a common use can be found in the insurance and reinsurance field. In this case, a leading global insurance and re-insurance company is facing significant competitive pricing and margin (combined ratio) pressure. While having extensive applications covering numerous markets, the business line data was not being effectively used to identify optimal price points across their portfolio of products.

Using Objective-Based Data Monetization, key pricing objectives are identified, along with critical causal-levers that impact the pricing value chain. Portfolio data and information assets are inventoried and assessed for their causality and correlative characteristics. Exploratory visualization maps are created that lead to the design and development of predictive models. These models are aggregated into complex solution spaces that then represents a comprehensive, cohesive pricing ecosystem. Using simulated annealing, optimal pricing structures are identified, which are implemented across their enterprise applications.

Data science is an proven means through which value can be created from existing assets in today’s organization. By focusing on an hypothesis-driven methodology that business objective outcome based, value identification and extraction can be maximized in order to prioritized the investment needed to realize them in the enterprise.

## 3 Factors Of A Successful Data Monetization Strategy

We are at a tipping point for the realization of value from data-oriented services (big data, data science, etc.). Those that see limited growth opportunities in traditional application and services development are already well underway in this data science transformation phase. For those that don’t see the need to monetize their data and information assets, it may be an Extinction Level Event (ELE) that is competitively unavoidable. Either way, understanding the effective components of an actionable data monetization strategy is extremely important.

Data Monetization is the process of actively generating value from a company’s data inventory. Today, only 1% of the world’s data is being analyzed (IDC); while at the same time, 100% of the data is costing companies CapEx and OpEx on a daily basis. Consumers and business line owners are beginning to recognize that the insights locked in data that reflect personal usage, location, profile and activity has a tangible market value. This is especially true when you apply the Power of Three principle to corporate data sets.

A data monetization strategy actively looks to extract latent value through three principle venues:

Level 1. Aggregating and Analyzing – Companies look to drive incremental revenue by aggregating multiple data sources (Power of Three) and conducting deep analyses through data science. The resulting models are then used to drive changes in the decision making process for operational, sales, and marketing. Ownership of value is retained and protected, but the cost of value generation is the highest of the three models.

Level 2. Licensing and Selling – Companies are launching ventures that package, license, and resell corporate data (creating new data sets and insights), or using data sets to launch new information-based products. For example, placing their point-of-sale (POS), internal social,  relationship-oriented, and other data online for business partners to subscribe. Ownership of value is transferred, but the cost of value generation is the least costly of the three models (cost of sales and marketing)

Level 3. Crowdsource Data Insights – Based on the deriving value from the crowd, data is supplied to the crowd for analyses that produce specific actionable outcomes. For example, Kaggle.com is a data prediction competition platform that allows organizations to post their data and have it scrutinized by the data scientists in exchange for a prize. Ownership of value is retained or shared and the cost of value generation is distributed throughout the crowd at a compensation based on tiered rewards (cost of 1st place, 2nd place, and 3rd place rewards << total cost of all data science activities for N participants).

Of all three strategies, crowdsourcing data insights (Level 3) tends to offer the highest long term benefits at the least capital and operational costs. Companies can retain the intellectual property from the insights derived through third party analyses, but not directly incurring the operational costs associated with hiring resources. A true win win.

Data Monetization is increasingly becoming a significant business activity for most companies. While less then 10% of Fortune 1000 companies have a data monetization strategy today, it is projected that 30% of businesses will monetize their data and information assets by 2016 (Gartner). As big data management consultants and data scientists, working with lines of business, begin to address these drivers, we should expect to see one or more of these venues fundamentally change the we monetize our businesses.

SaveSave