By using tdwi.org website you agree to our use of cookies as described in our cookie policy. Learn More

RESEARCH & RESOURCES

TDWI FlashPoint: Exclusive Excerpt for The Modeling Agency Subscribers

We'd like to extend a special welcome to TMA's newsletter subscribers! Below, you'll find an excerpt from Keith McCormick's recent TDWI FlashPoint article, published in the January 14 issue.

The Importance of Data Preparation to Predictive Analytics

In building a data warehouse, organizations expend considerable effort in data cleansing. Most predictive analytics modeling is done with data that resides in a data warehouse or data mart. Why does it still take weeks and weeks to perform data preparation during predictive analytics projects? In short, it is because a great deal of the work is not related to data cleansing.

Data Prep: High-Priority Activities

Data prep doesn’t simply involve the drudgery of cleansing data. If it were only that, automated data preparation technologies would be more successful. You can only anticipate some of the problems in advance.

How do data scientists spend all those weeks on a lengthy project? Here are just three examples of high-priority activities:

  1. Making sure the modeling data set reflects the problem. In their book Big Data: A Revolution That Will Transform How We Live, Work, and Think, Viktor Mayer-Schönberger and Kenneth Cukier emphasize that these days, with our powerful computers, there is never any sampling; rather, N=all. In other words, they argue, give us everything and run it through the model. This is never the case on the projects that I work on. I have to start with and examine all of the data, certainly, but the data I actually model reflects the data I will deploy the model on. For example, if I’m generating a model to incentivize business-to-consumer transactions, it is natural that I will remove business-to-business transactions.
  2. Aggregating the data. The data is already rolled up, so why do we need to worry about that? It’s because the data was rolled up and aggregated for another purpose—to support routine reporting. The internal client was the finance team or management, not the analysts. They will have completely different data needs that cannot possibly be anticipated until the modeling project is defined. This almost always means going to the most granular level possible and rolling the data back up again. For example, during a cellphone project years ago, I was trying to model customer behavior. For obvious reasons, phone activity that generated a charge on the billing statement was readily available, but activity that was free of charge, under a bundle of services, was not. For something as simple as the number of text messages a month, whether customers paid for them or not, I needed raw data.
  3. Data “construction” is perhaps the most important of all. Modeling algorithms are not magic. It takes a lot of work to make it easier for the analytical techniques. Dates are a great example. Data warehouses store dates. Many rookies in predictive analytics just throw them into the model. Only in the most extreme examples will you get lucky enough to find something. Veteran data miners know to calculate the distances between dates and then discard the raw date information. Even a handful of key dates could generate dozens of data distances to calculate and explore.

Summary

It is a famous observation that 70–90 percent of a typical predictive analytics project is dedicated to data preparation. Although we’re presented a partial list of the activities and entire books have been dedicated to this activity, it offers a glimpse into what all of those data scientists are doing behind the scenes.

Interested in reading the full article? Become a Premium Member today!
Already a TDWI Premium Member? Read the full newsletter.

Distributed monthly via email to thousands of BI/DW professionals, TDWI FlashPoint features unique how-to articles, key findings from TDWI Research, and tips on building and managing BI/DW teams. Written by TDWI Premium Members, fellows, and instructors, the focus is on timely BI and DW issues.

If you are interested in reading the full article, we invite you to become a TDWI Premium Member. TDWI Premium Membership comes with a wide range of benefits, including:

  • A comprehensive selection of industry research, news, and information
  • Access to all of TDWI's current and archived research and publications in password-protected areas of the TDWI website
  • Discounts on TDWI Conferences, seminars, and CBIP exams

Thank you for considering Premium Membership with TDWI! Please send us your questions and feedback.

TDWI Premium Membership benefits
Become a Premium Member

About the Author

Keith McCormick is a highly accomplished professional consultant, mentor, and trainer, having served as keynote and moderator at international conferences focused on analytics practitioners and leadership alike.

Keith has built predictive analytics models since the 1990s utilizing popular industry advanced analytics solutions such as IBM SPSS Statistics, IBM SPSS Modeler, KNIME, and other popular open-source machine learning tools. He’s also authored seven books on the effective use of predictive analytics software and techniques.

He has guided organizations to establish highly effective analytical practices across industries, including the public sector, media, marketing, healthcare, retail, finance, manufacturing, and higher education.

Keith serves as data science principal on the Data Science and AI team at Further, a business services and consulting firm.


TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.