Visualizing the CRAN: The Right Package for a Given Task


R is an extremely useful software environment for statistical computing and graphics. But as awesome as it is, it can be quite daunting to find just the right package for a specific task. Well, CRAN Task Views is designed to help alleviate this challenge. The table below is matches the task (right) with the package (left). While a bit primitive, it as quick reference it does work.

For those that are must more industrious, check out “Visualizing the CRAN: Graphing Package Dependencies.” The authors uses graph visualize the relationship between packages in CRAN. His analysis shows that MASS is the most “depended on” package on the CRAN.  A total of 294 of the 3794 packages (almost 8%) in the CRAN depend on MASS.  An additional 95 suggest MASS; so just over 10% of all R packages either suggest or depend on MASS.

Bayesian Bayesian Inference
ChemPhys Chemometrics and Computational Physics
ClinicalTrials Clinical Trial Design, Monitoring, and Analysis
Cluster Cluster Analysis & Finite Mixture Models
DifferentialEquations Differential Equations
Distributions Probability Distributions
Econometrics Computational Econometrics
Environmetrics Analysis of Ecological and Environmental Data
ExperimentalDesign Design of Experiments (DoE) & Analysis of Experimental Data
Finance Empirical Finance
Genetics Statistical Genetics
Graphics Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization
HighPerformanceComputing High-Performance and Parallel Computing with R
MachineLearning Machine Learning & Statistical Learning
MedicalImaging Medical Image Analysis
MetaAnalysis Meta-Analysis
Multivariate Multivariate Statistics
NaturalLanguageProcessing Natural Language Processing
OfficialStatistics Official Statistics & Survey Methodology
Optimization Optimization and Mathematical Programming
Pharmacokinetics Analysis of Pharmacokinetic Data
Phylogenetics Phylogenetics, Especially Comparative Methods
Psychometrics Psychometric Models and Methods
ReproducibleResearch Reproducible Research
Robust Robust Statistical Methods
SocialSciences Statistics for the Social Sciences
Spatial Analysis of Spatial Data
SpatioTemporal Handling and Analyzing Spatio-Temporal Data
Survival Survival Analysis
TimeSeries Time Series Analysis
gR gRaphical Models in R


Objective-Based Data Monetization: A Enterprise Approach to Data Science (EDS)


Across all industries, companies are looking to Data Science for ways to grow revenue, improve margins, and increase market share. In doing so, many are at a tipping point for where and how to realize these value improvement objectives.

Those that see limited growth opportunities to grow through their traditional application and services portfolios may already be well underway in this data science transformation phase. For those that don’t see the need to find real value in their data and information assets (Data Monetization), it may be a competitively unavoidable risk that jeopardizes a business’s viability and solvency.

Either way, increasing the valuation of a company or business line through the conversion of its data and information assets into actionable outcome-oriented business insights is the single most important capability that will drive business transformation over the next decade.

2013 05 27 09 30 41

Data and information have become the single most important assets needed to fuel today’s transformational growth. Most organizations have seen the growth in revenue and margin plateau for organic products and services (those based on people, process, and technologies). The next generation of corporate value will come through the spelunking (exploration, evaluation, and visualization) enterprise, information technology, and social data sources.

“Data is the energy source of business transformation and Data Science is the engine for its delivery.”

This valuation process, however, is not without it challenges. While all data is important, not all data is of value. Data science provides a systematic process to identify and test critical hypotheses associated with increased valuation through data.

2013 05 27 09 36 09

Once validated, these hypotheses must be shown to actually create or foster value (Proof of Value – POVs). These POVs extract optical models from sampled data sets. Only these proven objective-oriented models, that have supported growth hypotheses, are extended into the enterprise (e.g., big data, data warehousing, business intelligence, etc.).

2013 05 27 09 32 46

The POV phase of value generation translates growth objective-based goals into model systems, from which value can be optimally obtained.

2013 05 27 09 40 18

This objective-based approach to data science different, but complements, traditional business intelligence programs. Data science driven actives are crucial for strategic transformations where one does not know what they don’t know. In essence, data science provide the revelations needed identify the value venues necessary for true business transformations.

2013 05 27 10 08 20

For those solutions that have clearly demonstrable value, the system models are scale into the enterprise. Unfortunately, this is where most IT-driven process start and often unsuccessfully finish. Enterprise data warehouses are created and big data farms are implemented, all before any sense of data value is identified and extracted (blue). Through these implementations, tradition descriptive statistics and BI reports are generated that tell us mostly things that we know we don’t know, an expensive investment in knowledge confirmation. The objective-based data monetization approach, however, incorporated only those information technology capabilities into the enterprise that are needed to support the scalability of the optimized solutions.

2013 05 27 09 40 59

While there are many Objective-Based Data Monetization case studies, a common use can be found in the insurance and reinsurance field. In this case, a leading global insurance and re-insurance company is facing significant competitive pricing and margin (combined ratio) pressure. While having extensive applications covering numerous markets, the business line data was not being effectively used to identify optimal price points across their portfolio of products.

Using Objective-Based Data Monetization, key pricing objectives are identified, along with critical causal-levers that impact the pricing value chain. Portfolio data and information assets are inventoried and assessed for their causality and correlative characteristics. Exploratory visualization maps are created that lead to the design and development of predictive models. These models are aggregated into complex solution spaces that then represents a comprehensive, cohesive pricing ecosystem. Using simulated annealing, optimal pricing structures are identified, which are implemented across their enterprise applications.

Data science is an proven means through which value can be created from existing assets in today’s organization. By focusing on an hypothesis-driven methodology that business objective outcome based, value identification and extraction can be maximized in order to prioritized the investment needed to realize them in the enterprise.

Visualization: The Artist in the Data Scientist

Data Scientist Insights

The artistic part of data visualization is one of the most under appreciated parts of data science. On a day-to-day basis, we spend most of our time aggregating data sets, exploring their multidimensional depths, making and testing models, and visualizing the results. However, the visualizations produced by many data scientists are often lack luster and uninspiring. But we can do better.

Infographics , data visualized to yield revelations over observations, is one of simplest ways a data scientist can turn on the artisan within. Whitespace, darkspace, type fonts, graphics, colors, and textures are new concepts encountered in this new world. While seemingly complex and confusing, the good news about infographics is there are several great sources of information available to these looking to start this journey.

From a book perspective, one of the best texts on the market is: The Power of Infographics: Using Pictures to Communicate and Connect With Your Audiences. Mark Smiciklas covers everything from the business value to the creative process and into distribution. This is one of those books that tools like the iPad and Kindle were made for.

In addition to books, online courses are also available to those looking to build out new skills using tools like Adobe Illustrator. My two favorite online training sites are:

Tut Plus: How to Create Outstanding Modern Infographics

Lynda: Creating Infographics with Illustrator

Of the two, I found the Lynda course taught by Mordy Golding the most useful. He identified five core characteristics of great designs that you will want to consider: Contrast, Hierarchy, Accuracy, Relevance, and Truth. The last being probably the most important in that Edward Tufte says, “style and esthetics can not rescue failed content. If the words aren’t truthful, the finest topography won’t turn lies into truth.” Without truth there can be no understand.

So, putting all this together and fingers to my keyboard, I produced this renewable energy infographic. It is based on the techniques discussed by Golding, data from various government agencies, and a bit number crunching. Not bad, can use some more artistic flare, but much better than just the tradition scientifically based graphics I am so used to producing.



New Twitter Feed: @DSI2013


Data Scientist Insights has a new dedicated twitter feed, @DSI2013. With all the data science activity, it became apparent that the time had come to separate Data Scientist Insights (@DSI2013) from Dr. Jerry A. Smith (@drjerryasmith). Please take a bit of time to follow the dialogue in Data Scientist Insight, should that content interest you.

The Art of Data Visualization: Getting Design Out Of The Way


This is a great PBS Off Book webisode that covers the visualization topic from the point of view appreciated by a data scientist – the data. Edward Tofte points out that “You want to see to learn something, not to confirm something.” At the same time, Jer Thorp says that “Data Visualization is about Revelation – seeing something you have never seen before.”

Revelation is truly seeing to learn, a key characteristic that differentiates data science from business intelligence. BI visualization (e.g., dashboards, pie charts, etc.) is all about seeing to confirm. Very operational, very tactical. In contrast, Data Science strives to learn through visualization, which is very strategic and hopefully transformative.

From scientific visualization to pop infographics, designers are increasingly tasked with incorporating data into the media experience. Data has emerged as such a critical part of modern life that it has entered into the realm of art, where data-driven visual experiences challenge viewers to find personal meaning from a sea of information, a task that is increasingly present in every aspect of our information-infused lives.

This short 8 min webisode features:
Edward Tufte, Yale University
Julie Steele, O’Reilly Media
Josh Smith, Hyperakt
Jer Thorp, Office for Creative Research