TDWI FlashPoint: Exclusive Excerpt for The Modeling Agency Subscribers
We'd like to extend a special welcome to TMA's newsletter subscribers! Below, you'll find an excerpt from "Reallocation of Data Resources for Predictive Analytics Projects" by Thomas A. "Tony" Rathburn. This article was published in the September 2013 issue of TDWI FlashPoint.
Distributed monthly via e-mail to thousands of BI/DW professionals, TDWI FlashPoint features unique how-to articles, key findings from TDWI Research, and tips on building and managing BI/DW teams. Written by TDWI Premium Members, fellows, and instructors, the focus is on timely BI and DW issues.
If you are interested in reading the full article, we invite you to become a TDWI Premium Member. TDWI Premium Membership comes with a wide range of benefits, including a comprehensive selection of industry research, news, and information; access to all of TDWI's current and archived research and publications in password-protected areas of the TDWI website; and discounts to TDWI World Conferences, Seminars, and CBIP exams.
Thank you for considering Premium Membership with TDWI! Please send us your questions and feedback.
Reallocation of Data Resources for Predictive Analytics Projects
By Thomas A. "Tony" Rathburn
Traditional statistical projects often utilize a train/test design. This approach is appropriate because most projects are completed to test a hypothesis about a specific formulation of a solution using a single model.
Predictive analytics projects generally involve the development of many models in the search for a solution that significantly outperforms the current decision-making process. In a business environment, these projects are likely to involve some aspect of human behavior.
Human behavior does not possess the underlying structure of physical systems, and is further complicated by a high level of inconsistency. The combination of these factors means you must add a significant level of validation to your project development effort to ensure the models will perform in a live decision-making environment.
The Train/Test/Validate Project Design
Mutually exclusive data sets are developed for each phase of the project design:
- Training Data Set: Training data is used for the development of challenger models that will compete to replace the current champion model. This will fulfill the decision-making role for the process under analysis.
- Testing Data Set: All developed models are run against a single set of test data. This allows you to evaluate the relative performance of each of the competing models based on your business performance metrics.
- Validation Data Sets: A single model is selected, based on test performance, and advances into validation. Validation studies are completed to allow the analyst to develop a reliable estimate of performance and variance expectations.
How Much Data Is Enough?
The number of records to be included in each data set varies depending on the type of project being undertaken, and the nature of the decision process in the business environment. In today’s environments, we tend to use far more data than is required, or desirable, in the development of models. ...
Interested in reading the full article? Become a Premium Member today!
Already a TDWI Premium Member? Read the full newsletter.
About the Author
Thomas A. “Tony” Rathburn, senior consultant and training director for The Modeling Agency, has a strong track record of innovation and creativity, with more than two decades of experience in applying predictive analytics in business environments, assisting commercial and government clients internationally in the development and implementation of applied analytics solutions. He is a regular presenter of the data mining and predictive analytics tracks at TDWI World Conferences.