Deep Web Intelligence Platform: 6 Plus Capabilities Necessary for Finding Signals in the Noise


Over the last several months I have been involved with developing uniques data science capabilities for the intelligence community, ones specifically based on exploiting insights derived from the open source intelligence (OSINT) found in the deep web. The deep web is World Wide Web (WWW) content that is not part of the Surface Web, which is indexed by standard search engines. It is usually inaccessible through traditional search engines because of the dynamic characteristics of the content and in persistent natural of its URLs. Spanning over 7,500 terabytes of data, it is the richest source of raw material that can be used to build out value.

2014 01 30 09 54 05

One of the more important aspects of intelligence is being able to connect multiple seemingly unrelated events together during a time frame amenable for making actionable decisions. This capability is the optimal blend of man and machine, enabling customers to know more and know sooner. It is only in these low signal that are found in the deep web that one can use behavioral sciences (psychology and sociology) to extract outcome-oriented value.

2014 01 30 09 54 15

Data in the web is mostly composed of noise, which can be unique but is often of low value. Unfortunately, the index engines of the world (Google, Bing, Yahoo) add marginal value to very few data streams that are important to any valuation process. Real value comes from correlating event networks (people performing actions) through deep web signal, which are not the purview of traditional search engines.

2014 01 30 09 54 50

These deep web intelligence capabilities can be achieved in part through the use of machine learning enabled, data science driven, and hadoop-oriented enterprise information hubs. The platform support the 5 plus essential capabilities for actionable intelligence operations:

1. Scalable Infrastructure – Industry standard hardware supported through cloud-based infrastructure providers that is scales linearly with analytical demands.

2. Hadoop – Allows for computation to occur next to data storage and enables storage schema on read – stores data in native raw format.

3. Enterprise Data Science – Scalable exploratory methods, predictive algorithms, and prescriptive and machine learning.

4. Elastic Data Collection – In addition to pulling data from third party sources through APIs, bespoke data collection through scraping web services enables data analyses not capable within traditional enterprise analytics groups.

5. Temporal/Geospatial/Contextual Analyst – The ability to regionalize events, to a specific context, during a specified time (past, present, future).

6. Visualization – Effective visualization that tailors actionable results to individual needs.

The Plus – data, Data, DATA. Without data, lots of disparate data, data science platforms are of no value.

Deep Web Intelligence Architecture 01

Today’s executive, inundated with TOO MUCH DATA, has limited ability to synthesize trends and actionable insights driving competitive advantage. Traditional research tools, internet and social harvesters do not correlate or predict trends. They look at hindsight or, at best, exist at the surface of things. A newer approach based on combining the behavioral analyses achievable through people and the machine learning found in scalable computational system can bridge this capability gap.

Data Analytics vs Data Science: Two Separate, but Interconnected Disciplines


The current working definitions of Data Analytics and Data Science are inadequate for most organizations. But in order to think about improving their characterizations, we need to understand what they hope to accomplish. Data analytics seeks to provide operational observations into issues that we either know we know or know we don’t know. Descriptive analytics, for example, quantitatively describes the main features of a collection of data. Predictive analytics, that focus on correlative analysis, predicts relationships between known random variables or sets of data in order to identify how an event will occur in the future. For example, identifying the where to sell personal power generators and the store locations as a function of future weather conditions (e.g., storms). While the weather may not have caused the buying behavior, it often strongly correlates to future sales.

Data Analytics vs Data Science

The goal of Data Science, on-the-other-hand, is to provide strategic actionable insights into the world were we don’t know what we don’t know. For example, trying to identify a future technology that doesn’t exist today, but will have the most impact on an organization in the future. Predictive analytics in the area of causation, prescriptive analytics (predictive plus decision science), and machine learning are three primary means through which actionable insights can be found. Predictive causal analytics precisely identifies the cause for an event, take for example the title of a film’s impact on box office revenue. Prescriptive analytics couples decision science to predictive capabilities in order to identify actionable outcomes that directly impact a desired goal.

Separating data analytics into operations and data science into strategy allows us to more effectively apply them to the enterprise solution value chain. Enterprise Information Management (EIM) consists of those capabilities necessary for managing today’s large scale data assets. In addition to relational data bases, data warehouses, and data marts, we now see the emergence of big data solutions (hadoop). Data analytics (EDA) leverages data assets to provided day-to-day operational insights. Everything from counting assets to predicting inventory. Data science (EDS) then seeks to exploit the vastness of information and analytics in order to provide actionable decisions that has a meaningful impact on strategy. For example, discovering the optimal price point for products or the means to increase movie theater box office revenues.  Finally, all of these insights are for nothing if they are not operationally fused into the capabilities of the larger enterprise through architecture and solutions.

Data Analytics vs Data Science 2

Data science is about finding revelations in the historical electronic debris of society. Through mathematical, statistical, computational, and visualization, we seek not only to make sense of, but also provide meaningful action through, the zero and ones that constitute the exponentially growing data produced through our electronic DNA. While data science alone is significant capability, its overall valuation is exponentially increased when coupled with its cousin, Data Analytics, and integrated into an end-to-end enterprise value chain.

DSI 001 Integrating R and Hadoop with RHadoop

2013 08 25 22 00 19

This is the first in a series of screencasts designed to demonstrate practical aspects of data science. In this episode, I will show you how to integrate R, that awesome awe inspiring statistical processing environment, with Hadoop, the master of distributed data storage  and processing. Once done, we are going to then apply the RHadoop environment to count the number of words in that massive classical book “Moby Dick.”

In this screencast, we are going to setup a Hadoop environment on a Mac OS X operating system; download, install, and configure hadoop; download and install R and R Studio; download and load RHadoop packages; configure R; and finally, create and execute a test mapreduce problem. Here, let me show you exactly how all this works.

The scripts to this screencast will be posted over the next couple of days.

60+ R Resources Every Data Scientist Should Be Aware Of!


There are a lot of great R resources on the internet, ranging from one off articles and texts to comprehensive tutorials. Here are a few of the more popular links:





Enterprise Data Science (EDS) – Updated Framework Model


Companies continue to struggle with how to implement an organic and systematic approach to data science. As part of an ongoing trend to generate new revenues through enterprise data monetization, products and services owners have turned to internal business analytics teams for help, only to find their individual efforts fall very short of achieving business expectations. Enterprise Data Science (EDS), based on the proven techniques of  Cross Industry Standard Process for Data Mining (CRISP-DM), is designed to overcome most of the traditional limitations found in common business intelligence units.

The earlier post “Objective-Based Data Monetization: A Enterprise Approach to Data Science (EDS)” was in initial cut a describing the framework. It defines data monetization, hypothesis driven assessments, objective-based data science framework, and the differences between business intelligences and data science. While it was a good first cut, several refinements (below) have bee made to better clarify each phase and their explicit interactions.

Data Science Architecture Insurance Prebind Example

In addition to restructuring the EDS framework and its insurance pre-bind data (all the data that goes into quoting insurance policies) example, it was important to document the data science processes that come with an overall enterprise solution (below).

Data Science Process


Objective-Based Data Monetization: A Enterprise Approach to Data Science (EDS)


Across all industries, companies are looking to Data Science for ways to grow revenue, improve margins, and increase market share. In doing so, many are at a tipping point for where and how to realize these value improvement objectives.

Those that see limited growth opportunities to grow through their traditional application and services portfolios may already be well underway in this data science transformation phase. For those that don’t see the need to find real value in their data and information assets (Data Monetization), it may be a competitively unavoidable risk that jeopardizes a business’s viability and solvency.

Either way, increasing the valuation of a company or business line through the conversion of its data and information assets into actionable outcome-oriented business insights is the single most important capability that will drive business transformation over the next decade.

2013 05 27 09 30 41

Data and information have become the single most important assets needed to fuel today’s transformational growth. Most organizations have seen the growth in revenue and margin plateau for organic products and services (those based on people, process, and technologies). The next generation of corporate value will come through the spelunking (exploration, evaluation, and visualization) enterprise, information technology, and social data sources.

“Data is the energy source of business transformation and Data Science is the engine for its delivery.”

This valuation process, however, is not without it challenges. While all data is important, not all data is of value. Data science provides a systematic process to identify and test critical hypotheses associated with increased valuation through data.

2013 05 27 09 36 09

Once validated, these hypotheses must be shown to actually create or foster value (Proof of Value – POVs). These POVs extract optical models from sampled data sets. Only these proven objective-oriented models, that have supported growth hypotheses, are extended into the enterprise (e.g., big data, data warehousing, business intelligence, etc.).

2013 05 27 09 32 46

The POV phase of value generation translates growth objective-based goals into model systems, from which value can be optimally obtained.

2013 05 27 09 40 18

This objective-based approach to data science different, but complements, traditional business intelligence programs. Data science driven actives are crucial for strategic transformations where one does not know what they don’t know. In essence, data science provide the revelations needed identify the value venues necessary for true business transformations.

2013 05 27 10 08 20

For those solutions that have clearly demonstrable value, the system models are scale into the enterprise. Unfortunately, this is where most IT-driven process start and often unsuccessfully finish. Enterprise data warehouses are created and big data farms are implemented, all before any sense of data value is identified and extracted (blue). Through these implementations, tradition descriptive statistics and BI reports are generated that tell us mostly things that we know we don’t know, an expensive investment in knowledge confirmation. The objective-based data monetization approach, however, incorporated only those information technology capabilities into the enterprise that are needed to support the scalability of the optimized solutions.

2013 05 27 09 40 59

While there are many Objective-Based Data Monetization case studies, a common use can be found in the insurance and reinsurance field. In this case, a leading global insurance and re-insurance company is facing significant competitive pricing and margin (combined ratio) pressure. While having extensive applications covering numerous markets, the business line data was not being effectively used to identify optimal price points across their portfolio of products.

Using Objective-Based Data Monetization, key pricing objectives are identified, along with critical causal-levers that impact the pricing value chain. Portfolio data and information assets are inventoried and assessed for their causality and correlative characteristics. Exploratory visualization maps are created that lead to the design and development of predictive models. These models are aggregated into complex solution spaces that then represents a comprehensive, cohesive pricing ecosystem. Using simulated annealing, optimal pricing structures are identified, which are implemented across their enterprise applications.

Data science is an proven means through which value can be created from existing assets in today’s organization. By focusing on an hypothesis-driven methodology that business objective outcome based, value identification and extraction can be maximized in order to prioritized the investment needed to realize them in the enterprise.

3 Factors Of A Successful Data Monetization Strategy

NewImageWe are at a tipping point for the realization of value from data-oriented services (big data, data science, etc.). Those that see limited growth opportunities in traditional application and services development are already well underway in this data science transformation phase. For those that don’t see the need to monetize their data and information assets, it may be an Extinction Level Event (ELE) that is competitively unavoidable. Either way, understanding the effective components of an actionable data monetization strategy is extremely important.

Data Monetization is the process of actively generating value from a company’s data inventory. Today, only 1% of the world’s data is being analyzed (IDC); while at the same time, 100% of the data is costing companies CapEx and OpEx on a daily basis. Consumers and business line owners are beginning to recognize that the insights locked in data that reflect personal usage, location, profile and activity has a tangible market value. This is especially true when you apply the Power of Three principle to corporate data sets.

A data monetization strategy actively looks to extract latent value through three principle venues:

Howitworks smallLevel 1. Aggregating and Analyzing – Companies look to drive incremental revenue by aggregating multiple data sources (Power of Three) and conducting deep analyses through data science. The resulting models are then used to drive changes in the decision making process for operational, sales, and marketing. Ownership of value is retained and protected, but the cost of value generation is the highest of the three models.

Level 2. Licensing and Selling – Companies are launching ventures that package, license, and resell corporate data (creating new data sets and insights), or using data sets to launch new information-based products. For example, placing their point-of-sale (POS), internal social,  relationship-oriented, and other data online for business partners to subscribe. Ownership of value is transferred, but the cost of value generation is the least costly of the three models (cost of sales and marketing)

Level 3. Crowdsource Data Insights – Based on the deriving value from the crowd, data is supplied to the crowd for analyses that produce specific actionable outcomes. For example, is a data prediction competition platform that allows organizations to post their data and have it scrutinized by the data scientists in exchange for a prize. Ownership of value is retained or shared and the cost of value generation is distributed throughout the crowd at a compensation based on tiered rewards (cost of 1st place, 2nd place, and 3rd place rewards << total cost of all data science activities for N participants).

Of all three strategies, crowdsourcing data insights (Level 3) tends to offer the highest long term benefits at the least capital and operational costs. Companies can retain the intellectual property from the insights derived through third party analyses, but not directly incurring the operational costs associated with hiring resources. A true win win.

Data Monetization is increasingly becoming a significant business activity for most companies. While less then 10% of Fortune 1000 companies have a data monetization strategy today, it is projected that 30% of businesses will monetize their data and information assets by 2016 (Gartner). As big data management consultants and data scientists, working with lines of business, begin to address these drivers, we should expect to see one or more of these venues fundamentally change the we monetize our businesses.


Neuroscience, Big Data, and Data Science Is Impacting Big Ideas In The Creative World of Advertising

2013 01 23 10 26 42Moxie Group’s Creative Director Tina Chadwick makes the case that real-time data analytics “brings us tangible facts on how consumers actually react to almost anything.” She makes light of the “notion that 10 people in a room, who volunteered to be there because they got paid and fed,” could truly represent consumer behaviors (psychographics) is a thing of the past. Sadly though, for many advertising companies, this is still the mainstay of their advertising-oriented evaluative methodology.

New capabilities based on neuroscience, integrating machine learning with human intuition, and data science/big data is leading to a new creative processes, which many call NeuroMarketing, the direct measurement of consumer thoughts about advertising through neuroscience. The persuasive effects of an advertising campaign (psychographic response) are contingent upon the emotional alignment of the viewer (targeted demographic); that is, the campaigns buying call to action has a higher likelihood of succeeding when the viewer has a positive emotional response to the material. Through neuroscience we can not directly measure emotional alignment without inducing a Hawthorne Effect

This is new field of marketing research, founded in neuroscience, that studies consumers’ sensorimotor, cognitive, and affective response to marketing stimuli. It explores how consumer’s brain responses to ads (broadcast, print, digital) and measures how well and how often media engages the areas for attention/emotion/memory/and personal meaning – measures of emotional response. From data science-driven analyses, we can determine:

  • The effectiveness of the ad to cause a marketing call to action (e.g., buy product, inform, etc) 
  • Components of the ad that are most/least effective (Ad Component Analysis) – identifying what elements make an ad great or not so great.
  • Effectiveness of a transcreation process (language and culture migration) used to create adverting in different culturally centric markets.

One of the best and most entertaining case studies I have seen for NeuroMarketing was done by Neuro-Insight, a leader in the application of neuroscience for marketing and advertising. Top Gear used their technology to evaluate which cars attract women to which type of men. The results are pretty amazing.

While NeuroMarketing is an emergent field for advertising creation and evaluation, the fundamentals of neuroscience and data science make this an essential transformational capability.  This new field has significant transformational opportunities within the advertising industry – it allows for an above average firm to become a great firm through the application incremental quantitative neuroscience.  For any advertising agency looking to leap frog those older, less agile companies that are stilled anchored in the practices of the 70s, neuromarketing might be the worth looking into.

Cybernetic Historical Debris Fields – Big Data’s Proof of Life

Newimage22How did Robert Ballard find the Titanic? Most people think it was by looking for it. Well, most people would be wrong. Ballard believed he could rediscovered the Titanic by looking for the debris field created when the ship sank. With the Titanic only being around 900 feet long, he hypothesized that ship parts would be spread out much wider the farther you were from the ship, narrowing like a funnel to closer one got. In essence, this much larger historical debris field would point the way to the much small artifact of interest – the Titanic.

Every physical object leaves some trace of its interaction with the real world over time – everything. Whether it is the Titanic plugging to her heath in depths in the Atlantic ocean or a lonely rock sitting in the middle of a dry desert lake bed. Everything leaves a trace; everything has a Historical Debris Field (HDF). Formally,


Definition: Historical Debris Field (HDF) is any time dependent perturbation of an object and its environment.

One of the key points is that it is an observation over time, not a just a point in time. HDF are about capturing the absolute historical changes in the environment in order to make relative projections about some object in the future.

NewImageAs it turns out, just like physical real world objects leave historical debris fields, so does data through the virtual interactions in cyber space. Data, by definition, is merely a representative abstraction of a concept or real world object, and is a direct artifact of some computational process. At some level, every known relevant piece of electronic information (these words, your digital photos, a You Tube video), boils down to a series of Zeros (0) and Ones (1), streamed to together by a complex series of implicit and tacit interacting algorithms. These algorithms are in essence the natural, often unseen forces that govern the historical debris seen in real world objects. So, the HDF for cyberspace might be defined as,

Definition: Cybernetic Historical Debris Field (CHDF) is any time dependent perturbation of data through and its information environment (information being relevant data).

NewImageWhy is this lengthy definitional expose important? Because big data represents the Atlantic ocean in which a company is looking for opportunities. Any like Robert Ballard’s search for the Titanic, one can not merely set out looking for a piece of insight or knowledge itself in the vastness of all that internal/external structured/unstructured data, one needs to look for the Cybernetic Historical Debris Fields that point to the electronic treasure. But what kind of new “virtual sonar” systems can we employ to help us?

While I will explore this concept more over time, let me suggest that the “new” new in the field of data mining will be in coupling data scientists (DS) with behavioral analysts (BA).  Data changes because at the core some human initiated a change (causal antecedent). It is through a better understanding of human behavior (patterns), that we will have the best chance of monetizing the vastness of big data. Charles Duhigg, author of “The Power Of Habit: Why We Do What We Do in Life and Business,” shows that by understanding human nature (aka our historical debris field) we can accurately predict a behavioral-based future.


For example, Duhigg shows how Target tries to hook future parents at the crucial moment before they turn into loyal buyers of pre/post natal products (e.g., supplements, diapers, strollers, etc.). Target, specifically Andrew Pole, determined while lots of people buy lotion, women on baby registries were buying larger quantities of unscented lotion. Also, women at about twenty weeks into the pregnancy, would start loading up on supplements like calcium, magnesium, and zinc. This CHDF led the Target team to one of the first behavioral patterns (virtual Titanic sonar pattern) that could discriminate (point to) pregnant from non-pregnant women. Not impressed…well this type of thinking led to Target’s $23 billion revenue growth from 2002 to 2010.

The net of all this is that data can be monetized by systematically searching for relevant patterns (cybernetic historical debris fields) in big data based on human patterns of behavior. There are patterns in everything and just because we don’t see them it doesn’t mean they don’t exist. Through data science and behavioral analysis (AKA Big Data), one can reveal the behavioral past in order to monetize the future.