Critical Capabilities for Enterprise Data Science

NewImageIn the article “46 Critical Capabilities of a Data Science Driven Intelligence Platform” an original set of critical enterprise capabilities was identified. In enterprise architecture language, capabilities are “the ability to perform or achieve certain actions or outcomes through a set of controllable and measurable faculties, features, functions, processes, or services.”(1) In essence, they describe the what of the activity, but not necessarily the how.While individually effective, the set was nevertheless incomplete. Below is an update where several new capabilities have been added and other relocated. Given my emphasis on deep learning, composed on cognitive and intelligence process, I have added genetic and evolutionary programming as a set of essential capabilities.

2015 03 04 10 52 16

The Implementation architecture has also be updated to reflect the application of Spark and SparkR.

2015 03 04 10 53 13

46 Critical Capabilities of a Data Science Driven Intelligence Platform

NewImageData science is much more than just a singular computational process. Today, it’s a noun that collectively encompasses the ability to derive actionable insights from disparate data through mathematical and statistical processes, scientifically orchestrated by data scientists and functional behavioral analysts, all being supported by technology capable of linearly scaling to meet the exponential growth of data. One such set of technologies can be found in the Enterprise Intelligence Hub (EIH), a composite of disparate information sources, harvesters, hadoop (HDFS and MapReduce), enterprise R statistical processing, metadata management (business and technical), enterprise integration, and insights visualization – all wrapped in a deep learning framework. However, while this technical stuff is cool, Enterprise Intelligence Capabilities (EIC) are an even more important characteristic that drives the successful realization of the enterprise solution.

2015 02 04 08 50 01

In enterprise architecture language, capabilities are “the ability to perform or achieve certain actions or outcomes through a set of controllable and measurable faculties, features, functions, processes, or services.”(1) In essence, they describe the what of the activity, but not necessarily the how. For a data science-driven approach to deriving insights, these are the collective sets of abilities that find and manage data, transform data into features capable of be exploited through modeling, modeling the structural and dynamic characteristics of phenomena, visualizing the results, and learning from the complete round trip process. The end-to-end process can be sectioned into Data, Information, Knowledge, and Intelligence.

2014 11 08 14 10 45

Each of these atomic capabilities can be used by four different key resources to produce concrete intermediate and final intelligence products. The Platform Engineer (PE) is responsible for harvesting and maintenance of raw data, ensuring well formed metadata. For example, they would write Python scripts used by Flume to ingest Reddit dialogue into the Hadoop ecosystem. The MapReduce Engineer (MR) produces features based on imported data sets. One common function is extracting topics through MapReduced programmed natural language processing on document sets. The Data Science (DS) performs statistical analyses and develops machine learning algorithms.  Time series analysis, for example, is often used by the data scientist as a basis of identifying anomalies in data sets. Taken all together, Enterprise Intelligence Capabilities can transform generic text sources (observations) into actionable intelligence through the intermediate production of metadata tagged signals and contextualized events.

2014 11 08 14 21 11

Regardless of how data science is being used to derive insights, at the desktop or throughout the enterprise, capabilities become the building block for effective solution development. Independent of actual implementation (e.g., there are many different ways to perform anomaly detection), they are the scalable building blocks that transform raw data into the intelligence needed to realize true actionable insights.

Deep Web Intelligence Platform: 6 Plus Capabilities Necessary for Finding Signals in the Noise

NewImage

Over the last several months I have been involved with developing uniques data science capabilities for the intelligence community, ones specifically based on exploiting insights derived from the open source intelligence (OSINT) found in the deep web. The deep web is World Wide Web (WWW) content that is not part of the Surface Web, which is indexed by standard search engines. It is usually inaccessible through traditional search engines because of the dynamic characteristics of the content and in persistent natural of its URLs. Spanning over 7,500 terabytes of data, it is the richest source of raw material that can be used to build out value.

2014 01 30 09 54 05

One of the more important aspects of intelligence is being able to connect multiple seemingly unrelated events together during a time frame amenable for making actionable decisions. This capability is the optimal blend of man and machine, enabling customers to know more and know sooner. It is only in these low signal that are found in the deep web that one can use behavioral sciences (psychology and sociology) to extract outcome-oriented value.

2014 01 30 09 54 15

Data in the web is mostly composed of noise, which can be unique but is often of low value. Unfortunately, the index engines of the world (Google, Bing, Yahoo) add marginal value to very few data streams that are important to any valuation process. Real value comes from correlating event networks (people performing actions) through deep web signal, which are not the purview of traditional search engines.

2014 01 30 09 54 50

These deep web intelligence capabilities can be achieved in part through the use of machine learning enabled, data science driven, and hadoop-oriented enterprise information hubs. The platform support the 5 plus essential capabilities for actionable intelligence operations:

1. Scalable Infrastructure – Industry standard hardware supported through cloud-based infrastructure providers that is scales linearly with analytical demands.

2. Hadoop – Allows for computation to occur next to data storage and enables storage schema on read – stores data in native raw format.

3. Enterprise Data Science – Scalable exploratory methods, predictive algorithms, and prescriptive and machine learning.

4. Elastic Data Collection – In addition to pulling data from third party sources through APIs, bespoke data collection through scraping web services enables data analyses not capable within traditional enterprise analytics groups.

5. Temporal/Geospatial/Contextual Analyst – The ability to regionalize events, to a specific context, during a specified time (past, present, future).

6. Visualization – Effective visualization that tailors actionable results to individual needs.

The Plus – data, Data, DATA. Without data, lots of disparate data, data science platforms are of no value.

Deep Web Intelligence Architecture 01

Today’s executive, inundated with TOO MUCH DATA, has limited ability to synthesize trends and actionable insights driving competitive advantage. Traditional research tools, internet and social harvesters do not correlate or predict trends. They look at hindsight or, at best, exist at the surface of things. A newer approach based on combining the behavioral analyses achievable through people and the machine learning found in scalable computational system can bridge this capability gap.

Mahout: Machine Learning For Enterprise Data Science

Machine Learning

The success of companies to effectively monetize their information is dependent on how efficiently they can identify revelations in their data sources. While Enterprise Data Science (EDS) is one of the necessary methodologies needed to organically and systematically achieve this goal, it is but one of many such needed frameworks.

Machine Learning, a subdomain of artificial intelligence and a branch of statistical learning, is one such computational methodology composed of techniques and algorithms that enables computing devices to improve their recommendations based on effectiveness of previous experiences (learn). Machine learning is related to data mining (often confused with) and relies on techniques from statistics, probability, numerical analysis, and pattern recognition.

There is a wide variety of machine learning tasks, successful applications, and implementation frameworks.  Mahout, one of the more popular frameworks is a open source project based on Apache Hadoop. Mahout currently can be used for

  • Collaborative filtering (Recommendation systems – user based, item based)
  • Clustering
  • Classification

Varad Meru created and is sharing this introductory Mahout presentation; one that is an excellent source of basis information, as well as implementation details.

Enterprise Data Science (EDS) – Updated Framework Model

NewImage

Companies continue to struggle with how to implement an organic and systematic approach to data science. As part of an ongoing trend to generate new revenues through enterprise data monetization, products and services owners have turned to internal business analytics teams for help, only to find their individual efforts fall very short of achieving business expectations. Enterprise Data Science (EDS), based on the proven techniques of  Cross Industry Standard Process for Data Mining (CRISP-DM), is designed to overcome most of the traditional limitations found in common business intelligence units.

The earlier post “Objective-Based Data Monetization: A Enterprise Approach to Data Science (EDS)” was in initial cut a describing the framework. It defines data monetization, hypothesis driven assessments, objective-based data science framework, and the differences between business intelligences and data science. While it was a good first cut, several refinements (below) have bee made to better clarify each phase and their explicit interactions.

Data Science Architecture Insurance Prebind Example

In addition to restructuring the EDS framework and its insurance pre-bind data (all the data that goes into quoting insurance policies) example, it was important to document the data science processes that come with an overall enterprise solution (below).

Data Science Process

SaveSave