A friend and I were bayesian-doodling on the Monte Hall game show problem – There are three doors to choose from. Behind one door is a prize; behind the others, goats. You pick a door, say Door A, and Monte Hall (the host), who knows what’s behind the doors, opens another door, say Door B, which has a goat. Monte then says to you, “Do you want to switch and pick door Door C?” Is it to your advantage to switch your choice? Here is a rough bayesian view on why switching from your first choice (Door A) to the remaining door (Door C) is to your advantage.
Here is one view of a bayesian analysis that supports why it is more favorable to switch from your current door (A) to the remaining door (C).
Note: Captured with LiveScribe’s Sky Pen – one of the best personal productivity tools I have ever purchases (coupled with Evernote).
This is the first in a series of screencasts designed to demonstrate practical aspects of data science. In this episode, I will show you how to integrate R, that awesome awe inspiring statistical processing environment, with Hadoop, the master of distributed data storage and processing. Once done, we are going to then apply the RHadoop environment to count the number of words in that massive classical book “Moby Dick.”
In this screencast, we are going to setup a Hadoop environment on a Mac OS X operating system; download, install, and configure hadoop; download and install R and R Studio; download and load RHadoop packages; configure R; and finally, create and execute a test mapreduce problem. Here, let me show you exactly how all this works.
The scripts to this screencast will be posted over the next couple of days.
Edward Tufte’s principle point in data visualization is to above all else – show the data. But with a too many graphical options to quantify, how does one go about creating these effective visualization? The short video presentation by Tyler Rinker tries to address this critical question.
The presentation, based on the works of Stephen Few and was presented at the Center for Literacy and Research Instruction’s 50th Anniversary Conference, focuses on designing graphs that are in tune with the brain/eye perceptual subsystem, thus maximizing graph effectiveness. Few believed that in order to effectively show the data, one needs to use pre-attentive visual attributes (length, position, motion, color, hue, intensity, blur, etc. ) to grab and direct the viewer (iconic memory), while constraining the visuals to work within the limits of working memory.
Rinker presentation is an excellent source of definitions (charts, graphics, tables, diagrams, geoms etc.), graph parts (primary data, secondary data, non-data, and chart junk), and examples (bars, boxes, lines, points, etc.). He also provides an entry level view of the brain, memory, and how it impacts the data visualization process. Finally, he wraps up with visualization do’s and don’ts (e.g., don’t use 3D – but do use faceting).
There are a lot of great R resources on the internet, ranging from one off articles and texts to comprehensive tutorials. Here are a few of the more popular links:
FICO, known for its analytics and decision-making products and of course its eponymous credit-scoring service, has a new infographic that summarizes the eight characteristics of a top-notch data scientist. You’ll find it here.
According to Dr. Andrew Jennings, chief analytics officer at FICO and head of FICO Labs, three of these characteristics are most important, and every organization in the market for a data scientist should know what they are. Summary:
1. Problem-Solving Skills
2. Communications Skills
This brief history of data science is an updated version of Gil Press’s “A Very Short History of Data Science.” The timeline depicts how data scientists became sexy is mostly the story of the coupling of the mature discipline of statistics with a very young one–computer science. The term “Data Science” has emerged only recently to specifically designate a new profession that is expected to make sense of the vast stores of big data. But making sense of data has a long history and has been discussed by scientists, statisticians, librarians, computer scientists and others for years. The following timeline traces the evolution of the term “Data Science” and its use, attempts to define it, and related terms.
(CLICK TO VIEW COMPLETE TIMELINE)
Methodological Problems (i.e., problems for which existing methods are inadequate), which leads to new statistical research and improved techniques. This area of work is called Mathematical Statistics and is primarily concerned with developing and evaluating the performance of new statistical methods and algorithms. It is important to note that computing and solving computational problems are integral components of all four of the previously mentioned areas of statistical science.
Statistical science is concerned with the planning of studies, especially with the design of randomized experiments and with the planning of surveys using random sampling. The initial analysis of the data from properly randomized studies often follows the study protocol.
Of course, the data from a randomized study can be analyzed to consider secondary hypotheses or to suggest new ideas. A secondary analysis of the data from a planned study uses tools from data analysis.
Data analysis is divided into:
— descriptive statistics – the part of statistics that describes data, i.e. summarises the data and their typical properties.
— inferential statistics – the part of statistics that draws conclusions from data (using some model for the data): For example, inferential statistics involves selecting a model for the data, checking whether the data fulfill the conditions of a particular model, and with quantifying the involved uncertainty (e.g. using confidence intervals).
While the tools of data analysis work best on data from randomized studies, they are also applied to other kinds of data — for example, from natural experiments and observational studies, in which case the inference is dependent on the model chosen by the statistician, and so subjective.
Mathematical statistics has been inspired by and has extended many procedures in applied statistics.
Thomas H. Davenport is an academic and author specializing in analytics, business process innovation and knowledge management. He is currently the President’s Distinguished Professor in Information Technology and Management at Babson College, Director of Research at the International Institute for Analytics, and a Senior Advisor to Deloitte Analytics.
We live in a world awash with data. Data is proliferating at an astonishing rate—we have more and more data all the time, and much of it was collected in order to improve decisions about some aspect of business, government, or society. If we can’t turn that data into better decision making through quantitative analysis, we are both wasting data and probably creating suboptimal performance.
— Tom Davenport
The success of companies to effectively monetize their information is dependent on how efficiently they can identify revelations in their data sources. While Enterprise Data Science (EDS) is one of the necessary methodologies needed to organically and systematically achieve this goal, it is but one of many such needed frameworks.
Machine Learning, a subdomain of artificial intelligence and a branch of statistical learning, is one such computational methodology composed of techniques and algorithms that enables computing devices to improve their recommendations based on effectiveness of previous experiences (learn). Machine learning is related to data mining (often confused with) and relies on techniques from statistics, probability, numerical analysis, and pattern recognition.
There is a wide variety of machine learning tasks, successful applications, and implementation frameworks. Mahout, one of the more popular frameworks is a open source project based on Apache Hadoop. Mahout currently can be used for
- Collaborative filtering (Recommendation systems – user based, item based)
Varad Meru created and is sharing this introductory Mahout presentation; one that is an excellent source of basis information, as well as implementation details.