This is very early, but nevertheless interesting and is based on the initial insights from the “Film Industry Executives Golden Rule – Total Gross is 3x Opening Box Office Receipts” post. As discussed, identifying outliers could be an important part in identifying characteristics for those exceptional films in the industry. The plot below show the number of outlying films (exceptional) where opening revenue was higher the 2.68 stdev (line with circles). In addition, the plot show (line with triangles) the number of outliers that also exceeded 4x Total Gross/Opening Gross ratio (industry average being 3.1).
The second group (triangles) is the candidate study group for any future project – e.g, a good place to look for characteristic differences between exceptional and average films. There appears to be thirty years of data to explore here; helpful for creating, testing, and scoring regression and logistical regression models.
However, the more interesting trends are the exponential increase in outlier opening gross revenue films (line with circles) and the divergence between the two. While I don’t know what to make of it yet, there appears to be something going on.
In order to systematically address these data science questions, any future engagement lifecycle needs to be run through an organic process in order to maximize the likelihood of success (coming up with actionable insights on budget and time). The key will most likely be access to film industry data sets, specifically those used to build web sites like Box Office Mojo. It would be useful to get detailed accounting for each film, inclusive of budgetary items (e.g., market spend). In addition, the project needs to pull in other third party data like regional/national economics (Bureau of Economic Analysis), Weather (Weather Underground), Social (FaceBook, Twitter), demographic/psychographic models, etc. Here is the macro model for deriving insights from ones and zeros:
The analysis process itself is driven by data aggregation, preparation, and design of experiments (DOE). Having access to a few big data tool smiths (data scientists that are Cloudera hackers) pays off at this phase. The data science team should set up a multi-node hadoop environment at the start for all the data that will be pulled in over time (potentially terabytes within 1 year). They should also not waste effort trying to force fit all the disparate data sources into some home grown relational data schema. Accept that fact that uncertainty exists and build a scalable storage model accessible by R/SPSS/etc. from the start.
Once the data is in hand, the fun process begins. While modeling is both a visual and design process, it is all driven through an effect design of experiment. Knowing how to separate data into modeling, test, and scoring is a science, so there is no real need to second guess what to do. Here is one such systematic and teachable process:
At the micro level (day to day), the team needs to build out an ecosystem to support data analytics and science. This includes tools (R, SPSS, Gephi, Mathematica, Matlab, SAS, Hanna, etc.), big data (Cloudera – Hadoop, Flume, Hive, Mahout (important), Hbase, etc.), visualization (Rapha.ANkl, D3, Polymaps, OpenLayers, Tableau, etc.), computing (local desktops/servers, AWS, etc.), and potentially third party composite processing (Pneuron). Last, but not least, is an Insights Management Framework (dashboard driven application to manage an agile driven, client centric workflow). This will manage the resolution process around all questions developed with the client (buy or build this application).
While the entertainment industry is a really exciting opportunity, this enterprise-level data science (EDS) framework generalizes to all insights analyses across industries. By investing in the methodology (macro/micro) and infrastructure up front (hadoop, etc.), the valuation of data science teams will be driven through a more systematic monetization strategy build on insights analysis and reuse.