Embracing Predictive Analytics (Part 1 in a Series)
The more you understand how algorithmic models work, the better the opportunities to engage the right set of data scientists and make them productive.
- By David Loshin
- March 4, 2016
The recent years' hype surrounding big data has elevated the role of the corporate data scientist to the forefront, turning it into one of the "best jobs" in the United States. According to recruiting site Glassdoor, "data scientist" is the number one Best Job in America for 2016, with a median base salary of $116,840. Although the role and responsibilities of a data scientist may differ slightly from company to company, there is no doubt that overall, the role blends data management capabilities, analytical skills, and proficiency in the use of predictive methods to build analytical models for profiling, segmentation, analysis, prediction, visualization, storytelling. The role also includes evaluating the accuracy and precision of developed analytical models.
In some ways, the breadth of expectations of the skills necessary for being a successful data scientist somewhat betray its fluidity in terms of how the role of data scientist is adopted within an enterprise. In different cases, the role demands some combination of knowledge about Hadoop, NoSQL, text mining, data mining, R, Python, Java, visualization tools (such as Tableau or Qlik), and a host of statistical techniques. By comparing these requirements, you might see that a data scientist in one organization might not be suitable for the role in other organizations.
To some extent, this is a symptom of the growing pains associated with organizational embracing, adopting, and mainstreaming analytics. Although the technology media promote the technologies as if they were just recently invented, much of what is referred to as "predictive analytics" today has a long history, both algorithmically and in production. This is reflected in the terminology used. The terms analytics, machine learning, and data mining are used interchangeably to refer to several algorithmic methods of discovering patterns in large (or massive) data sets and forecasting patterns based on that analysis.
Some of the methods used are not new at all. Bayesian analysis can be traced back to the 1700s; regression analysis only goes back to the early 1800s. Certainly, most algorithms are relatively mature: neural networks date to the 1950s, decision trees to the 1960s, and association rules and support vector machines hail from the 1990s.
What makes these methods still seem so innovative today?
One perspective is about the transition in ease of use. In the 1990s, if you wanted to use data mining algorithms for predictive analytics, you had to be an expert in statistical analysis, computer science, programming, or (most likely) all three. The emergence of user-friendly tools for data mining in the 1990s, followed by their integration into usage scenarios that did not demand expertise allowed non-experts to take advantage of these models without having to understand how they worked.
The integration of machine-learning algorithms into open source tools accelerated their utility. The open source statistical language R (in particular) opened up the world of analytics to many individuals, while a number of vendors integrated the R capabilities within their own tool frameworks. Other open source libraries (such as Hadoop's Mahout or the MLib machine library in Apache Spark) are also enablers.
The upshot is that the more one understands about how these algorithmic models work, the better opportunities for an organization to engage the right set of data scientists and make them productive. An organization with a few people tinkering with R may be able to develop some interesting ideas, but without a fundamental recognition of how prediction and prescription fit into the enterprise, their incorporation into day-to-day business processes will remain limited. That means that every technology manager and corresponding business counterpart should have some awareness of how predictive models work and how they are used.
To that end, over the next few articles, we will explore data mining and machine-learning techniques and discuss ways that they can be exploited to solve specific classes of business problems.
David Loshin is a recognized thought leader in the areas of data quality and governance, master data management, and business intelligence. David is a prolific author regarding BI best practices via the expert channel at BeyeNETWORK and numerous books on BI and data quality. His valuable MDM insights can be found in his book, Master Data Management, which has been endorsed by data management industry leaders.