## Six Types Of Analyses Every Data Scientist Should Know

Jeffrey Leek, Assistant Professor of Biostatistics at John Hopkins Bloomberg School of Public Health, has identified six(6) archetypical analyses. As presented, they range from the least to most complex, in terms of knowledge, costs, and time. In summary,

• Descriptive
• Exploratory
• Inferential
• Predictive
• Causal
• Mechanistic

1. Descriptive (least amount of effort):  The discipline of quantitatively describing the main features of a collection of data. In essence, it describes a set of data.

– Typically the first kind of data analysis performed on a data set

– Commonly applied to large volumes of data, such as census data

-The description and interpretation processes are different steps

– Univariate and Bivariate are two types of statistical descriptive analyses.

Type of data set applied to: Census Data Set – a whole population

Example: Census Data

2. Exploratory: An approach to analyzing data sets to find previously unknown relationships.

– Exploratory models are good for discovering new connections

– They are also useful for defining future studies/questions

– Exploratory analyses are usually not the definitive answer to the question at hand, but only the start

– Exploratory analyses alone should not be used for generalizing and/or predicting

– Remember: correlation does not imply causation

Type of data set applied to: Census and Convenience Sample Data Set (typically non-uniform) – a random sample with many variables measured

Example: Microarray Data Analysis

3. Inferential: Aims to test theories about the nature of the world in general (or some part of it) based on samples of “subjects” taken from the world (or some part of it). That is, use a relatively small sample of data to say something about a bigger population.

– Inference is commonly the goal of statistical models

– Inference depends heavily on both the population and the sampling scheme

Type of data set applied to: Observational, Cross Sectional Time Study, and Retrospective Data Set – the right, randomly sampled population

Example: Inferential Analysis

4. Predictive: The various types of methods that analyze current and historical facts to make predictions about future events. In essence, to use the data on some objects to predict values for another object.

– The models predicts, but it does not mean that the independent variables cause

– Accurate prediction depends heavily on measuring the right variables

– Although there are better and worse prediction models, more data and a simple model works really well

– Prediction is very hard, especially about the future references

Type of data set applied to: Prediction Study Data Set – a training and test data set from the same population

Example: Predictive Analysis

Another Example of Predictive Analysis

5. Causal: To find out what happens to one variable when you change another.

– Implementation usually requires randomized studies

– There are approaches to inferring causation in non-randomized studies

– Causal models are said to be the “gold standard” for data analysis

Type of data set applied to: Randomized Trial Data Set – data from a randomized study

Example: Causal Analysis

6. Mechanistic (most amount of effort): Understand the exact changes in variables that lead to changes in other variables for individual objects.

– Incredibly hard to infer, except in simple situations

– Usually modeled by a deterministic set of equations (physical/engineering science)

– Generally the random component of the data is measurement error

– If the equations are known but the parameters are not, they may be inferred with data analysis

Type of data set applied to: Randomized Trial Data Set – data about all components of the system

Example: Mechanistic Analysis

## Quotes of the Week: John Tukey

John Tukey (1915-2000) was an American mathematician and has been called the father of modern exploratory data analysis and data visualization. Tukey has written a lot on these subject, so I thought I’d share three of my favorite and also more popular quotes:

The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.

To statisticians, hubris should mean the kind of pride that fosters an inflated idea of one’s powers and thereby keeps one from being more than marginally helpful to others. … The feeling of “Give me (or more likely even, give my assistant) the data, and I will tell you what the real answer is!” is one we must all fight against again and again, and yet again.

Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.

So, if you like these quotes and are looking for a great data science read, then check out Tukey’s text, Exploratory Data Analysis

## 11 Steps To Finding Data Scientists

Data scientist recruiting can be a challenging task, but not an impossible one. Here are eleven tips that can get you going in the right recruiting direction:

1. Focus recruiting at the universities that have top notch computer programming, statistical, and advance sciences. For example, Stanford, MIT, Berkeley, and Harvard are some of the top schools in the world.  Also a few other schools with proven strengths in data analytics, such as: North Carolina State, UC Santa Cruz, University of Maryland, University of Washington, and UT Austin.

2. Look for recruits in the membership rolls of user groups devoted to data science tools. Two excellent places to start are The R User Group (for an open-souce statistical tool favored by data scientists) and Python Interest Groups (for PIGies). Revolutions provide a list of known R User Groups, as well as information around the R community.

3. Search for data scientists on LinkedIn, many of which have formed formal groups.

4. Hang out with data scientists at Strata, Structure:Data, and Hadoop World conferences and similar gatherings or at inform data scientist “meet-ups” in your area. The R User Group Meetup Groups is an excellent source for finding meetings your a particular area.

5. Talk with local venture capitalist (Osage, NewSprings, etc.), who is likely to have gotten a variety of big data proposals over the past year.

6. Host a competition on Kaggle (online data science competitions) and/or TopCoder (online coding competitions), the analytical and coding competition websites. One of my favorite Kaggle competitions was the Heritage Provider Network Health Prize – Identified patients who will be admitted to a hospital within the next year using historical claims data.

7. Candidates need to code. Period. So don’t bother with any candidate that doesn’t understand some formal language (R, Python, Java, etc.). Coding skills don’t have to be at a world-class level, but they should be good enough to get by (hacker).

8. The old saying that “we start dying the day we stop learning” is so true of the data science space. Candidates need to have a demonstrable ability to learn about new technologies and methods, since the field of data science is exponentially changing. Have they gotten certificates from Coursa‘s Data Science or Machine Learning course; contributed to open-source projects; or built an online repository of code or data sets (e.g., Quandl) to share?

9. Make sure a candidate can tell a story in the data sets they are analyzing. It is one thing to do the hard analytical work, but another to provide a coherent narrative about a key insights (AKA they can tell a story). Test their ability to communicate with numbers, visually, and verbally.

10. Candidates need to be able to work in the business world. Take a pass on those candidates that get stuck for answers on how their work might apply to your management challenges.

11. Ask candidates about their favorite analysis or insight. Every data scientist should have something in their insights portfolio, applied or academic. Have them break out the laptop (iPad) to walk through their data sets and analyses. It doesn’t matter what the subject is, just that they can walk through the complete data science value chain.

## FIELD NOTE: Definition – Data Driven

During a spirited debate over the meaning of data driven, a colleague ask me for my definition – you know, as a data scientist. I replied with this one that I learned from DJ Patil (who built the LinkedIn data science team):

A data-driven organization acquires, processes, and leverages data in a timely fashion to create efficiencies, iterate on and develop new products, and navigate the competitive landscape.

This definition is void of defining a organization by the volume, velocity, or even variety of data (big data terms). Instead, it focus on how the data is effectively used to create a more competitive position.

## Data Monetization: 30 percent of businesses will monetize data and information assets in 4 years

There are three data points that are driving the business discussion around big data:

1. Only 1% of the world’s data was being analyzed (IDC); while at the same time, 100% of the data is costing companies CapEx and OpEx every day.

2. Consumers and businesses are beginning to recognize that the insights locked in data that reflects personal usage, location, profile and activity has a tangible market value. This is especially true when you apply the Power of Three principle to data sets.

3. As a result, 30% of businesses will monetize their data and information assets by 2016 (Gartner), up from today’s 10% baseline.

As big data management consultants and data scientists, working with lines of business, begin to address these drivers, we should expect the following solution to fundamentally change the we monetize our business (mostly through applications and people):

:: Companies will look to drive incremental revenue by placing their point-of-sale (POS), internal social,  relationship-oriented, and other data online for business partners to subscribe

:: Companies will launch ventures that package and resell publicly available data (creating new data sets and insights), or using it to launch new information-based products

:: Information Resellers are arising to help organizations develop and execute data and information asset monetization strategies.

:: Information Product Managers to lead these efforts internally to identify, create, and make operational new services out of data.

:: New information architectures, focused on monetized data services (Quadrification of Big Data), will emerge since traditional business intelligence products and implementations are not well-suited to analyzing and sharing data in a subscription-based manner. This will transform platform companies that produce data into data insights companies that have platforms

This type of monetization strategy can can open new revenue doors without a significant change in existing platform and/or services investments. The nice thing about information product management is that it leverages most of the platform/service development to date. New immediate revenue can come through the sale and/or licensing of de-personalized data (loyalty, POS, social, etc.) to third parties.

Secondary revenue streams, which can come later in the implementation phase, comes from combining existing data sets with other third party data (transactional, social, etc.) in order to identify orthogonally conflated services (see the Power of Three). This capability would come at a marginal incremental cost and could be outsourced to cost competitive data science teams.

We are at a tipping point for the realization of value from data-oriented services (big data, data science, etc.). Those that see limited growth opportunities in traditional application and services development are already well underway in this data science transformation phase. For those that don’t see the need to monetize their data and information assets, it may be an Extinction Level Events (ELEs) that is competitively unavoidable.

## FIELD NOTE: Quandl – An Interesting Source For Datasets

Tammer Kamel, a Canadian Data Scientist, has recently post a beta version of Quandl.com, an index of 2 million time series data sets. Tammer says Quandl’s mission is to make numerical data easy to find and easy to use. The site is collaboratively maintained and free with many features including search, browse, download, visualization, merging, and an API.

Here is one of the many datasets that I have been using to research crime related trends.

## Neuroscience, Big Data, and Data Science Is Impacting Big Ideas In The Creative World of Advertising

Moxie Group’s Creative Director Tina Chadwick makes the case that real-time data analytics “brings us tangible facts on how consumers actually react to almost anything.” She makes light of the “notion that 10 people in a room, who volunteered to be there because they got paid and fed,” could truly represent consumer behaviors (psychographics) is a thing of the past. Sadly though, for many advertising companies, this is still the mainstay of their advertising-oriented evaluative methodology.

New capabilities based on neuroscience, integrating machine learning with human intuition, and data science/big data is leading to a new creative processes, which many call NeuroMarketing, the direct measurement of consumer thoughts about advertising through neuroscience. The persuasive effects of an advertising campaign (psychographic response) are contingent upon the emotional alignment of the viewer (targeted demographic); that is, the campaigns buying call to action has a higher likelihood of succeeding when the viewer has a positive emotional response to the material. Through neuroscience we can not directly measure emotional alignment without inducing a Hawthorne Effect

This is new field of marketing research, founded in neuroscience, that studies consumers’ sensorimotor, cognitive, and affective response to marketing stimuli. It explores how consumer’s brain responses to ads (broadcast, print, digital) and measures how well and how often media engages the areas for attention/emotion/memory/and personal meaning – measures of emotional response. From data science-driven analyses, we can determine:

• The effectiveness of the ad to cause a marketing call to action (e.g., buy product, inform, etc)
• Components of the ad that are most/least effective (Ad Component Analysis) – identifying what elements make an ad great or not so great.
• Effectiveness of a transcreation process (language and culture migration) used to create adverting in different culturally centric markets.

One of the best and most entertaining case studies I have seen for NeuroMarketing was done by Neuro-Insight, a leader in the application of neuroscience for marketing and advertising. Top Gear used their technology to evaluate which cars attract women to which type of men. The results are pretty amazing.

While NeuroMarketing is an emergent field for advertising creation and evaluation, the fundamentals of neuroscience and data science make this an essential transformational capability.  This new field has significant transformational opportunities within the advertising industry – it allows for an above average firm to become a great firm through the application incremental quantitative neuroscience.  For any advertising agency looking to leap frog those older, less agile companies that are stilled anchored in the practices of the 70s, neuromarketing might be the worth looking into.

## Big Data: Conventional Definitions and Some Statistics (big numbers for big data)

Definition: “Extremely scalable analytics – analyzing petabytes of structured and unstructured data at high velocity.”

Definition: “Big data is data that exceeds the processing capacity of conventional data base systems.”

Big Data has three characteristics:

Variety – Structured and unstructured data

Velocity – Time sensitive data that should be used simultaneously with its enterprise data counterparts, in order to maximize value

Volume – Size of data exceeds the nominal storage capacity of the enterprise.

Statistics:

– In 2011, the global output of data was estimated to be 1.8 zettabytes (10^21 bytes)

– 90% of the world data has been created in the last 2 years.

– We create 2.5 quintillion (10^18) bytes of data per day (from sensors, social media posts, digital pictures, etc.)

– The digital world will increase in capacity 44 folds between 2009 and 2020.

– Only 5% of data is being created in structured forms, 95% is largely unstructured.

– 80% of the effort involved in dealing with unstructured data is reconditioning ill-formed data to well-formed data (cleaning it up).

Performance Statistics (I will start tracking more closely):

– Traditional data storage costs approximately \$5/GB, but storing the same data using Hadoop only cost \$0.25/GB – yep 25cents/GB. Hum!

– FaceBook stores more than 20Petabytes of data across 23,000 cores, with 50Terabytes of raw data being generated per day.

– eBay uses over 2,600 clustered Hadoop servers.

## FIELD NOTE: Quadrification of Big Data

1. Data (the intrinsic 1/0 property of big data) which can be broken down subjective areas like interaction data, transaction data || structured, unstructured || realtime/streaming, batch/static || etc.

2. MapReduce platforms – AKA divide and conquer – virtual integration capabilities that enable aggregation and management of multiple name-spaced data sources (Hadoop, InfoSphere Streams, Pneuron, etc.)

3. Data Exploration, Data Mining, and Intelligence Platforms – technical capabilities that  enable one to derive insights from data (Pentaho, IBM InfoSphere, ListenLogic, MatLab, Mathematics, Statistica, etc.).

4. Knowledge Worker platform (AKA The human component) – The two most important capabilities come from data scientists (navigate through data) and  behavioral scientists (navigate through human behavior, which most important things seem to connect back to).

In essence, Big Data has data, an ability to find it and use it, and an ability to explore and learn from it.

Does this seem right?  Missing anything? Please post or email me.

## Big Data Driven By Bigger Numbers

As most of us know, there is a tremendous growth in global data. There are trillions of transactions occurring daily, ranging from operations to sales to marketing to buying. Every human activity is support by over 100 business processes, all contributing to this exponential data growth.

McKinsey Global Institute report “Big Data: The next frontier for innovation, competition, and productivity” discusses this new (?) or at least emergent world. As part of their business case, they cite some very interesting statistics. Here are just a few:

– \$600 to buy a disk that can store all the world’s music

– 5 billion mobile phones in use in 2010

– 30 billion pieces of content shared on FaceBook every month

– 40% projected growth in global data generated per year verses 5% growth in global IT spending

– 235 terabytes of data collected by the US Library of Congress by April 2011

– 15 out of 17 sectors in the United States have more data stored per company than the US Library of Congress

– \$300 billion potential annual value to US heath care – more than double the total annual health care spending in Spain

– \$600 billion potential annual consumer surplus from using personal location data globally

– 60% potential increase in retails’ operating margins possible with big data

– 140,000 to 190,000 more deep analytical talent positions

– 1.5 million more data savvy managers needed to take full advantage of big data in the United States