Deep Web Intelligence Platform: 6 Plus Capabilities Necessary for Finding Signals in the Noise

NewImage

Over the last several months I have been involved with developing uniques data science capabilities for the intelligence community, ones specifically based on exploiting insights derived from the open source intelligence (OSINT) found in the deep web. The deep web is World Wide Web (WWW) content that is not part of the Surface Web, which is indexed by standard search engines. It is usually inaccessible through traditional search engines because of the dynamic characteristics of the content and in persistent natural of its URLs. Spanning over 7,500 terabytes of data, it is the richest source of raw material that can be used to build out value.

2014 01 30 09 54 05

One of the more important aspects of intelligence is being able to connect multiple seemingly unrelated events together during a time frame amenable for making actionable decisions. This capability is the optimal blend of man and machine, enabling customers to know more and know sooner. It is only in these low signal that are found in the deep web that one can use behavioral sciences (psychology and sociology) to extract outcome-oriented value.

2014 01 30 09 54 15

Data in the web is mostly composed of noise, which can be unique but is often of low value. Unfortunately, the index engines of the world (Google, Bing, Yahoo) add marginal value to very few data streams that are important to any valuation process. Real value comes from correlating event networks (people performing actions) through deep web signal, which are not the purview of traditional search engines.

2014 01 30 09 54 50

These deep web intelligence capabilities can be achieved in part through the use of machine learning enabled, data science driven, and hadoop-oriented enterprise information hubs. The platform support the 5 plus essential capabilities for actionable intelligence operations:

1. Scalable Infrastructure – Industry standard hardware supported through cloud-based infrastructure providers that is scales linearly with analytical demands.

2. Hadoop – Allows for computation to occur next to data storage and enables storage schema on read – stores data in native raw format.

3. Enterprise Data Science – Scalable exploratory methods, predictive algorithms, and prescriptive and machine learning.

4. Elastic Data Collection – In addition to pulling data from third party sources through APIs, bespoke data collection through scraping web services enables data analyses not capable within traditional enterprise analytics groups.

5. Temporal/Geospatial/Contextual Analyst – The ability to regionalize events, to a specific context, during a specified time (past, present, future).

6. Visualization – Effective visualization that tailors actionable results to individual needs.

The Plus – data, Data, DATA. Without data, lots of disparate data, data science platforms are of no value.

Deep Web Intelligence Architecture 01

Today’s executive, inundated with TOO MUCH DATA, has limited ability to synthesize trends and actionable insights driving competitive advantage. Traditional research tools, internet and social harvesters do not correlate or predict trends. They look at hindsight or, at best, exist at the surface of things. A newer approach based on combining the behavioral analyses achievable through people and the machine learning found in scalable computational system can bridge this capability gap.

DSI 001 Integrating R and Hadoop with RHadoop

2013 08 25 22 00 19

This is the first in a series of screencasts designed to demonstrate practical aspects of data science. In this episode, I will show you how to integrate R, that awesome awe inspiring statistical processing environment, with Hadoop, the master of distributed data storage  and processing. Once done, we are going to then apply the RHadoop environment to count the number of words in that massive classical book “Moby Dick.”

In this screencast, we are going to setup a Hadoop environment on a Mac OS X operating system; download, install, and configure hadoop; download and install R and R Studio; download and load RHadoop packages; configure R; and finally, create and execute a test mapreduce problem. Here, let me show you exactly how all this works.

The scripts to this screencast will be posted over the next couple of days.

Data Visualization Can Be A Beautiful Thing

Newimage31Data Visualization is a part, very significant part, of the big data story. The human brain, through its 100 billion neurons each interconnected 10,000 times, has an absolutely amazing visual processing capability, which is arguably surpassed by none. So leveraging this ability should not come as a surprise when thinking about the human component in the quadrication of big data.

Paul Butler, a FaceBook intern, discovered this earthly visualization using R, an open source statistical application. Based on around 10 million samples of Friends relationships taken from FaceBook’s Hadoop Apache Hive (data warehouse system for Hadoop), he was able to plot the weights for each pair of cities as a function of the Euclidean distance between them and the number of their respective friends. The result is this astonishing image of the earth. By the way, notice anything missing in the geographical location where China normally resides?

NewImage

Here is another awesome image, this time created by Eric Fischer using a heat map (a type of visualization tool) of places with Flickr photographs and Tweets.  While Twitter and Flickr may not individually have enough users to create a map so detailed as the FaceBook image, but put them together, and there’s a wealth of information that can be discovered in all of its glory.

NewImage

Data visualization is a powerful exploratory tool that should be exploited as early as possible when working to monetize your big data. Do not assume away discoveries just because you don’t know what you don’t know (third level of knowledge). Let the data take your brain on an unguided journey where discovery is the ultimate destination.