The Four Core Project Types: A Conceptual Foundation for Predictive Analytics
The vast majority of predictive analytics projects can be conceptualized as falling into one of four basic types, from which we can creatively combine the benefits of each to enhance our organizational performance.
By Tony Rathburn, Senior Consultant and Director of Training, The Modeling Agency
[Editor's note: Tony Rathburn will be teaching a 4-day intensive session about this topic on May 12 - 15, 2014. For more information, please visit http://analyticsworld.com.]
When approaching an analytics project, many people immediately jump into preliminary analysis of available data and mentally begin to consider the application of a variety of techniques.
Conversely, a review of the available literature in predictive analytics is awash in a sea of project types. In this article, we propose that the vast majority of projects can be conceptualized as falling into one of four basic types. From this basic structure, there are many variations of each type, and many ways we can creatively combine the benefits of each to enhance our organizational performance.
Outcomes: Our First Priority
Our projects begin with the understanding that we are building a formula for "Outcome" that factors in a set of weighted "Conditions." That is, to determine an Outcome, we build a mathematical formula that applies individual Weights to each of several Conditions. We'll explain with examples in a moment.
Insufficient attention to defining the Outcome attribute is the single biggest cause of predictive analytics project failure. We develop exceptionally sophisticated skills and techniques for analyzing Conditions in massively complex environments. Our academic training and our software focus on increasingly complex methods for manipulating our data. What is overlooked is that no amount of complex computations will overcome the faults of an ill-structured project.
Real-world predictive analytics projects possess a characteristic that is unavailable to the theoretician and to the software engineer, both of whom are focused on providing general tools and techniques to a broad audience. Our analytics projects are undertaken for the specific purpose of advancing our organizational objectives, as measured by our performance metrics.
Precision versus Resolution
We are often predisposed to attempt to develop high-precision estimates of our Outcomes. Due to the inconsistency of human behavior, this is often simply not possible. The recommended first step in conceptualizing an analytics project is to understand whether the decision maker requires additional precision or additional resolution.
Precision refers to the number of decimal places our formulas will use in computing our Outcome. Resolution, on the other hand, is how tightly we will define the sub-groups we are identifying. The degree of resolution is determined by how many Conditions are used in our formula.
Many decision makers require very low levels of true precision. In a large number of opportunities, the decision maker simply needs a better way of anticipating whether or not a behavior will likely be displayed, resulting in an Outcome that is binary (yes, a particular outcome will occur or no, it won't).
In pursuing precision, an analyst may be tempted to develop a model that estimates an Outcome in very precise terms -- for example, Sales estimated to the nearest dollar.
Our decision maker, however, often receives more benefit from a lower level of precision but higher levels of resolution. For example, a relative ranking of customer's propensity to purchase a product that breaks the customer base into 10 groups instead of two. Adding Condition attributes increases the resolution by increasing the number of sub-groups we are defining in our customer base.
Classification: Estimating the Future Category of an Outcome
The first and most popular way of conceptualizing our project is to think of the possible categories our Outcome could fit into. When the Outcome is a '"category," our problems are much easier to solve technically. The category represents whether a business relationship displays a behavior that impacts our performance metrics.
From our earlier example, the behavior that impacts our performance metric would be "buying our product" versus the behavior of "not buying."
In these types of problems, we are attempting to develop a scoring system of sorts, where the Conditions help us project whether a behavior is likely to be displayed based on the relative magnitude of the computed score for each customer record. Those customers with a higher score (based on our formula) would be more likely to display the behavior than those with a lower score.
In our example, we could then rank our customer based on their estimated propensity to buy our product and determine who we should mail on the basis of that ranking system.
We can pursue highly complex levels of resolution in the specification of our sub-groups by increasing the number of Condition attributes used to define the groups under consideration. This allows us to treat each of our sub-groups somewhat differently, based on the impact of their behavior on our performance metrics. This approach is often significantly more important to the decision maker than an estimate of the Outcome attribute to a higher level of precision.
From our example, we might find that with a single Condition attribute, Gender, that men display our behavior of Buying at a lower rate than women. If we add a second Condition attribute, Age, and break it into two groups at Age 25, we would now have four groups: Men>25, Men<=25, women="">25, Women<=25. each="" group="" would="" have="" its="" own="" propensity="" to="" buy,="" allowing="" the="" decision="" maker="" to="" allocate="" available="" resources="" accordingly.="">=25.>
The addition of a third Condition attribute, Geography (U.S. versus non-U.S.) would result in eight sub-groups within our customer base, each with its own propensity to buy.
Forecasting: Estimating the Future "Value" of an Outcome
Forecasting consists of projects where we are attempting to estimate a value of a future Outcome based on a known set of current conditions. From our example, we might attempt to estimate the dollar value of purchases by a customer for a future time frame, such as the upcoming quarter or year.
The Outcome associated with our record is a continuous valued variable (rather than a categorical variable). As such, our Conditions also need to be restricted to the set of continuous valued variables in our candidate pool of available fields from our data. (In forecasting, we can only use fields that have a quantitative value as Condition attributes. We can not use categorical, or qualitative Condition attributes, such as Gender or Marital Status.)
It is possible to develop multiple models, one for each of the categories of qualitative variables, but we simply cannot reliably use our qualitative variables as Conditions for determining the value of our Outcome variable.
Forecasting is technically more difficult. The inconsistency of human behavior does not lend itself well to this level of precision. It requires significantly more data, while not allowing the use of much of the data in our data repositories because of their quantitative characteristics. The resulting level of precision is often simply not required by the decision maker.
Time Series: Estimating the Next Step in a Sequence of Values or Categories from a Set of Current Conditions and a Current Position
Time-series projects may estimate either a value or a category of a future Outcome based on a known set of current conditions. In a time-series project we are not working from a static "snapshot" of the value of the Conditions. Rather, we recognize a dependence on a sequence of values over time.
There are many projects where a time series approach is actually the best conceptualization for a project. There are many sophisticated techniques for this type of work. However, in most cases, it is both simpler and more reliable to simply include the currently known value of the Outcome variable as a Condition variable in estimating the next unknown estimate of the Outcome variable. From there, we can either complete a forecasting project or a classification project, as discussed above.
The classic example for time series problems is the movement of stock prices. We could approach a project for modeling prices of a financial instrument by attempting to determine what its price should be based on a set of appropriate Condition attributes, or we could add the Condition attribute of Current Price and change our project conceptualization to estimate the change from the current price.
Clustering: Identification of Mathematically Correct Groupings of Data
Our first three project types were similar in that we had the luxury of providing our algorithms with a historical "right" answer in their search of a solution to our problems. Our fourth type of project is based on what is commonly referred to as "unsupervised learning." That is, we do not know what the value of our Outcome should be at any given time.
These projects are often undertaken due to their value in current big data projects. Specifically, they are of value when we have '"fat data"' and our first concern is with determining which Condition fields in our data have information content related to our Outcome.
Summary
This article introduced the four core project types that are the foundation for virtually all of our analytics projects.
For the majority of our projects, we will recommend that our project team attempts to conceptualize the project as a Classification project. It is the least technically demanding, it avoids many of the issues of inconsistency in human behavior, and it tends to be the easiest to introduce to individuals who are relatively new to quantitative approaches to decision modeling. It also tends to provide significant performance enhancements in the least amount of time.
Mr. Rathburn is a senior consultant and director of training at The Modeling Agency. He holds a strong track record of innovation and creativity in the application of predictive analytics in business environments. Tony has assisted commercial and government clients internationally in the development and implementation of applied analytics solutions since the mid-1980s. He is a regular presenter of the Data Mining and Predictive Analytics tracks at TDWI World Conferences, as well as being engaged by a number of software vendors to present the practical implementation aspects of their tools. Mr. Rathburn can be reached at [email protected].
=25,>