The current working definitions of Data Analytics and Data Science are inadequate for most organizations. But in order to think about improving their characterizations, we need to understand what they hope to accomplish. Data analytics seeks to provide operational observations into issues that we either know we know or know we don’t know. Descriptive analytics, for example, quantitatively describes the main features of a collection of data. Predictive analytics, that focus on correlative analysis, predicts relationships between known random variables or sets of data in order to identify how an event will occur in the future. For example, identifying the where to sell personal power generators and the store locations as a function of future weather conditions (e.g., storms). While the weather may not have caused the buying behavior, it often strongly correlates to future sales.
The goal of Data Science, on-the-other-hand, is to provide strategic actionable insights into the world were we don’t know what we don’t know. For example, trying to identify a future technology that doesn’t exist today, but will have the most impact on an organization in the future. Predictive analytics in the area of causation, prescriptive analytics (predictive plus decision science), and machine learning are three primary means through which actionable insights can be found. Predictive causal analytics precisely identifies the cause for an event, take for example the title of a film’s impact on box office revenue. Prescriptive analytics couples decision science to predictive capabilities in order to identify actionable outcomes that directly impact a desired goal.
Separating data analytics into operations and data science into strategy allows us to more effectively apply them to the enterprise solution value chain. Enterprise Information Management (EIM) consists of those capabilities necessary for managing today’s large scale data assets. In addition to relational data bases, data warehouses, and data marts, we now see the emergence of big data solutions (hadoop). Data analytics (EDA) leverages data assets to provided day-to-day operational insights. Everything from counting assets to predicting inventory. Data science (EDS) then seeks to exploit the vastness of information and analytics in order to provide actionable decisions that has a meaningful impact on strategy. For example, discovering the optimal price point for products or the means to increase movie theater box office revenues. Finally, all of these insights are for nothing if they are not operationally fused into the capabilities of the larger enterprise through architecture and solutions.
Data science is about finding revelations in the historical electronic debris of society. Through mathematical, statistical, computational, and visualization, we seek not only to make sense of, but also provide meaningful action through, the zero and ones that constitute the exponentially growing data produced through our electronic DNA. While data science alone is significant capability, its overall valuation is exponentially increased when coupled with its cousin, Data Analytics, and integrated into an end-to-end enterprise value chain.
This is very early, but nevertheless interesting and is based on the initial insights from the “Film Industry Executives Golden Rule – Total Gross is 3x Opening Box Office Receipts” post. As discussed, identifying outliers could be an important part in identifying characteristics for those exceptional films in the industry. The plot below show the number of outlying films (exceptional) where opening revenue was higher the 2.68 stdev (line with circles). In addition, the plot show (line with triangles) the number of outliers that also exceeded 4x Total Gross/Opening Gross ratio (industry average being 3.1).
The second group (triangles) is the candidate study group for any future project – e.g, a good place to look for characteristic differences between exceptional and average films. There appears to be thirty years of data to explore here; helpful for creating, testing, and scoring regression and logistical regression models.
However, the more interesting trends are the exponential increase in outlier opening gross revenue films (line with circles) and the divergence between the two. While I don’t know what to make of it yet, there appears to be something going on.
In order to systematically address these data science questions, any future engagement lifecycle needs to be run through an organic process in order to maximize the likelihood of success (coming up with actionable insights on budget and time). The key will most likely be access to film industry data sets, specifically those used to build web sites like Box Office Mojo. It would be useful to get detailed accounting for each film, inclusive of budgetary items (e.g., market spend). In addition, the project needs to pull in other third party data like regional/national economics (Bureau of Economic Analysis), Weather (Weather Underground), Social (FaceBook, Twitter), demographic/psychographic models, etc. Here is the macro model for deriving insights from ones and zeros:
The analysis process itself is driven by data aggregation, preparation, and design of experiments (DOE). Having access to a few big data tool smiths (data scientists that are Cloudera hackers) pays off at this phase. The data science team should set up a multi-node hadoop environment at the start for all the data that will be pulled in over time (potentially terabytes within 1 year). They should also not waste effort trying to force fit all the disparate data sources into some home grown relational data schema. Accept that fact that uncertainty exists and build a scalable storage model accessible by R/SPSS/etc. from the start.
Once the data is in hand, the fun process begins. While modeling is both a visual and design process, it is all driven through an effect design of experiment. Knowing how to separate data into modeling, test, and scoring is a science, so there is no real need to second guess what to do. Here is one such systematic and teachable process:
At the micro level (day to day), the team needs to build out an ecosystem to support data analytics and science. This includes tools (R, SPSS, Gephi, Mathematica, Matlab, SAS, Hanna, etc.), big data (Cloudera – Hadoop, Flume, Hive, Mahout (important), Hbase, etc.), visualization (Rapha.ANkl, D3, Polymaps, OpenLayers, Tableau, etc.), computing (local desktops/servers, AWS, etc.), and potentially third party composite processing (Pneuron). Last, but not least, is an Insights Management Framework (dashboard driven application to manage an agile driven, client centric workflow). This will manage the resolution process around all questions developed with the client (buy or build this application).
While the entertainment industry is a really exciting opportunity, this enterprise-level data science (EDS) framework generalizes to all insights analyses across industries. By investing in the methodology (macro/micro) and infrastructure up front (hadoop, etc.), the valuation of data science teams will be driven through a more systematic monetization strategy build on insights analysis and reuse.
Mike Gualtieri (Forrester) believes that “developers are stuck in a design paradigm that reduces app design to making functionality and content decisions based on a few defined customer personas or segments.”
The answer to developing apps that dazzle the digital consumer and making your company stand out from the competition lies in what Gualtieri calls Predictive Apps. Forrester defines predictive apps as:
Apps that leverage big data and predictive analytics to anticipate and provide the right functionality and content on the right device at the right time for the right person by continuously learning about them.
To build anticipatory, individualized app experiences, app developers will use big data and predictive analytics to continuously and automatically tune the app experience by:
- Learning who the customer (individual user) really is
- Detecting the customer’s intent in the moment
- Morphing functionality and content to match the intent
- Optimizing for the device (or channel)
The film entertainment industry believes that the total gross theater earnings from a film can be determined by looking at the opening gross box office receipts. Industry executives use the rule of thumb that for every dollar earned on opening day, three dollars will be earned in total from box office receipts (i.e., Total Gross = 3 x Open Gross). This is why they invest in all that marketing prior to opening day.
I decided to take a look at this rule of thumb, so I created an R script that pulled the required data from Box Office Mojo (see below). I grabbed all 14K+ films from BOM, did a bit of data cleaning and formatting, then plotted the relationship between Opening Box Office Receipts and Total Gross Theater Earning. As it turns out, the executives are right, the 2.5% to 97.5% confidence range for the golden ratio is 3.13 and 3.19, respectively. As a correlative predictive model, it is significant (R^2=.8034).
R-SCRIPT (based on Tony Breyal Quick Scrape script)