Quote of the Week: Stephen Jay Gould


Stephen Jay Gould (September 10, 1941 – May 20, 2002) was an American paleontologist, evolutionary biologist, and historian of science. He was also one of the most influential and widely read writers of popular science of his generation.[1] Gould spent most of his career teaching at Harvard University and working at the American Museum of Natural History in New York. In the later years of his life, Gould also taught biology and evolution at New York University.

Facts and theories are different things, not rungs in a hierarchy of increasing certainty. Facts are the world’s data. Theories are structures of ideas that explain and interpret facts. Facts do not go away while scientists debate rival theories for explaining them. Einstein’s theory of gravitation replaced Newton’s, but apples did not suspend themselves in mid-air pending the outcome.
— Stephen Jay Gould

Insurance Product Recommender(IPR) – Growing Value Through Insights


The Insurance industry can grow additional revenues by incorporating Recommender Systems into existing applications and across enterprise solution as a means of monetizing product related data. Some have estimated that for national insurance companies that have not implemented this category of data monetization, an additional 10% can be added to the top line at current combined ratios (AKA margin). For the $500M in premiums small business insurance company, this means a potential additional premium growth of $50M on existing operations.

Predictive systems, like the Recommender System, have become an extremely important source of new revenue growth over the last few years. Recommender systems are a type of “information filtering system” that seek to predict the ‘rating’ or ‘preference’ that user would give to an item (such as music, books, or movies) or social element (e.g. people or groups) they had not yet considered, using a model built from the characteristics of an item (content-based approaches) or the user’s social environment (collaborative filtering approaches). A familiar example is Amazon’s “Customers Who Bought This Item Also Bought” recommendation sections which identifies other products (books, electronics, etc.) thought should be of interest to the prospect as well.


Because of their direct impact on top line revenue through the organic use of enterprise data, recommender systems are being increasingly used in other industries, such as e-commerce websites; giving businesses a strategic advantage over businesses without them. These systems involve industry specific predictive models, heuristic search, data collection, user interaction and model maintenance that are developed through the emerging field of data science and supported by new big data platforming systems. No two are the same and each become part of a competitive differential that can only be attained through internally analyzing data across the enterprise and externally throughout social networks.

One example that could transform the insurance industry is the Insurance Product Recommender(IPR).This data science-driven risk profiling recommender-based application helps underwriters and brokers identify industry-specific client risks; then, pinpoint cross-selling and up-selling opportunities by offering access to collateral insurance products, marketing materials, and educational materials that support the a complete sales cycle. As more products are sold to an every increasing customer base, the recommendations become more reliable, resulting in an exponential increase revenue realization.

The insurance industry, driven by economic calculus to maximize combined ratios (margin) for a given risk profile that is secured at a given premium (revenue), can use predictive systems, like Recommender Systems, to optimize value by leveraging existing untapped data sources (e.g., pre bind, claims, product purchases, etc.). These systems can become the pathway through which increased client retention (renewals) and client satisfaction can be achieved while growing the risk coverage wallet share within the industry.

Data Science Catechesis: A Systematic Teaching of the Science of Finding Revelations in Data

ANewImage catechesis is the systematic practice of teaching and, in this case, that teaching is about data science. While there are no formal catechisms (questions to invoke reflection and response) in the field of science to draw upon, we can nevertheless begin to compose some of the more important expositions of existing doctrine to as a start.

The first pillar is composed of three of the essential elements that form data economics:

No. 1: Data is the energy source of business transformation.

Question: Why is data the fundamental energy source of transformation and not people, processes, or technology?

Question: What does transformation mean and why is it the basis of value?


No. 2: Data Science is the organic and systematic practice of transforming hypotheses and data into actionable predictions

Question: What does it mean that data science is both organic and systemic? 

Question: Why are hypotheses an important part of the process of data science?

Question: Why do predictions need to be actionable and whom should they act upon?


No. 3: A Data Scientist is a person who is better at mathematics and statistics than any software engineer and better at software engineering than any mathemician or statistician.

Question: What kind of mathematics and statistics is important in the discovery of revelations in data?

Question: What are the necessary elements of software engineer needed to systematically produce actionable predictions?

Question: What programming languages does a data scientist need to know?

This is just a start, so please reply with your reflections and other relevant questions for this first pillar.

Heilmeier Catechism: Nine Questions To Develop A Meaningful Data Science Project


As director of ARPA in the 1970’s, George H. Heilmeier developed a set of questions that he expected every proposal for a new research program to answer. No exceptions. He referred to them as the “Heilmeier Catechism” and are now the basis of how DARPA (Defense Advance Research Projects Activity) and IARPA (Intelligence Advance Research Project Activity) operate.  Today, it’s equally important to answer these questions for any individual data science project, both for yourself and for communicating to others what you hope to accomplish.

While there have been many variants on Heilmeier’s questions, I still prefer to use the original catechism to guide the development of my data science projects:

1. What are you trying to do? Articulate your objectives using absolutely no jargon. 2. How is it done today, and what are the limits of current practice? 3. What’s new in your approach and why do you think it will be successful? 4. Who cares? 5. If you’re successful, what difference will it make? 6. What are the risks and the payoffs? 7. How much will it cost? 8. How long will it take? 9. What are the midterm and final “exams” to check for success?

Each question is critical in the success chain of events, but number 3 and 5 are most aligned to the way business leaders think. Data science is fought with failures, by the definition of science. As such, business leaders are still a bit (truthfully – a lot) suspicious of how data science teams do what they do and how their results would integrate into the larger enterprise in order to solve real business problems. Part of the data science sales cycle, addressed by question 3, needs to address these concerns. For example, in the post “Objective-Based Data Monetization: A Enterprise Approach to Data Science (EDS),” I present a model for scaling out the our results.

In terms of the differences a project makes (question 5), we need to be sure to cover the business as well as technical differences. The business difference are the standard three: impact on revenue, margin (combined ratios for insurance), and market share. If there is not business value (data/big data economics), then your project is a sunk cost that somebody else will need to make up for.

Here is an example taken from a project proposed in the insurance industry. Brokers are third party entities that sell insurance products on behalf of a company. They are not employees and often are under the governance of underwriters (employee that sells similar products). There are instances where brokers “shop” around looking get coverage for a prospect that might have above average risk (e.g., files too many claims, in high risk business, etc.). They do this by manipulating answers to pre-bind questions (prior to issuing a policy) in order to create a product that will not necessarily need underwriter review and/or approval. This project is designed to help stop this practice, which would help the improve business financial fundamentals. Here is Heilmeier’s Catechism for the Pre-Bind Gaming Project:

1. What are you trying to do? Automate the identification of insurance brokers that use corporate policy pricing tools as a means to undersell through third party providers.

2. How is it done today? Corporate underwriters observer broker behaviors and pass judgement based on person criteria.

3.  What is new in your approach? Develop signatures algorithms, based on the analysis of gamer/no gamer pre-bind data, that can be implemented across enterprise product applications.

4. Who cares? Business executives – CEO, President, CMO, and CFO.

5. What difference will it make? In an insurance company that generates $350 M in premiums at a combined ratio (margin) of 97%, addressing this problem could result in  an additional $12M to $32M of incremental revenue while improving the combined ratio to 95.5%.

6. What are the risks and payoffs? Risks – Not having collect or access to relevant causal data reflecting the gamers patterns. Payoffs – Improved revenue and combined ratios.

7. How much will it cost? Proof of concept (POC) will cost between $80K and $120K. Scaling the POC into the enterprise (implementing algorithms into 5 to 10 product applications) will cost between $500K and $700K.

8. How long will it take? Proof of concept (POC) will take between a 8 to 10 weeks. Scaling the POC into the enterprise will take between 3 to 7 months.

9. What are the midterms & final check points for success? The POC will act as the initial milestone that demonstrates gaming algorithms can be identify with existing data.

Regardless of whether you use Heilmeier’s questions or other research topic development methodologies (e.g., The Craft of Research), it is important to systematically address the who, what, when, where, and why of the project. While a firm methodology does not guarantee success, not addressing these nine questions are sure to put you on a risky path, one that will need work to get off of.


Data Valuation: Seven Laws of Data Science

Seven Laws of Data Science 01

I was re-reading the paper “Measuring The Value Of Information: An Asset Valuation Approach,” by Moody and Walsh (European Conference on Information Systems, 1999) when I realized just how powerful their approach was to data valuation. This is seminal research in the field of information theory and should be required reading for data scientists. Moody and Walsh recognized early on that information is the most valuable asset an organization has and that it is important to quantify this value through a formal methodology. While the paper lacks in defining a practical approach, the overall framework can be used as a basis for implementing a repeatable enterprise data valuation methodology.

The reason for this blog post, however, is in my desire to recast Moody and Walsh’s Seven Laws of Information. While they do not explicitly define information and how it is different from data, we can use the DIKW Pramid to recast a few of the laws more towards the field of data science. That is, the world is full of data, information is the relevant data, studying information gives knowledge, and reflecting on knowledge leads to wisdom. So, if we deconstruct the information laws and rethink their data equivalents, one might find these Seven Laws of Data Science as the result:

Law One: Data has value only if it is studies. Intrinsically, data does not generate residual value through its mere presence. Revelations can only be found in the exploration and study of data.

Law Two: The value of data increases with it use. As data is explored, combined with other data, and explored again, additional value is generated.

Law Three: Data can not be depleted through it use. Data is not a physical commodity that is subject the physical laws of entropy and subject to degradation. As such, data is infinitely reusable and through the exploratory processes will produce more data than that originally evaluated.

Law Four: Causal data is more valuable than correlative data. While correlative principle are very useful in some operational circumstances, to forecast the future one needs to truly understand causality within the system. Or, as someone more important than me has stated,  “Felix, qui potuit rerum cognoscere causes.” Translated, “Fortunate who was able to the know the causes of things.”

Law Five: The value from combined independent data is greater than the combined value of each data alone. This is equivalent to the whole is usually greater than the sum of the parts. That is, one plus one is greater than two.

Law Six: The value of data is perishable, while the data itself does not. The insights derived from the study of data have a limited value time horizon. 

Law Seven: More data does not necessarily lead to more value. Studies have shown that more data does not necessarily increase the accuracy of our predictions, just our confidence in those predications.

So this is the first cut the Laws of Data Science. What is missing, needs to be rethought through, or even deleted. Let me know.