TDWI FlashPoint: Exclusive Excerpt for The Modeling Agency Subscribers
- By Keith McCormick
- January 15, 2016
We'd like to extend a special welcome to TMA's newsletter subscribers! Below, you'll find an excerpt from Keith McCormick's recent TDWI FlashPoint article, published in the January 14 issue.
The Importance of Data Preparation to Predictive Analytics
In building a data warehouse, organizations expend considerable effort in data cleansing. Most predictive analytics modeling is done with data that resides in a data warehouse or data mart. Why does it still take weeks and weeks to perform data preparation during predictive analytics projects? In short, it is because a great deal of the work is not related to data cleansing.
Data Prep: High-Priority Activities
Data prep doesn’t simply involve the drudgery of cleansing data. If it were only that, automated data preparation technologies would be more successful. You can only anticipate some of the problems in advance.
How do data scientists spend all those weeks on a lengthy project? Here are just three examples of high-priority activities:
- Making sure the modeling data set reflects the problem. In their book Big Data: A Revolution That Will Transform How We Live, Work, and Think, Viktor Mayer-Schönberger and Kenneth Cukier emphasize that these days, with our powerful computers, there is never any sampling; rather, N=all. In other words, they argue, give us everything and run it through the model. This is never the case on the projects that I work on. I have to start with and examine all of the data, certainly, but the data I actually model reflects the data I will deploy the model on. For example, if I’m generating a model to incentivize business-to-consumer transactions, it is natural that I will remove business-to-business transactions.
- Aggregating the data. The data is already rolled up, so why do we need to worry about that? It’s because the data was rolled up and aggregated for another purpose—to support routine reporting. The internal client was the finance team or management, not the analysts. They will have completely different data needs that cannot possibly be anticipated until the modeling project is defined. This almost always means going to the most granular level possible and rolling the data back up again. For example, during a cellphone project years ago, I was trying to model customer behavior. For obvious reasons, phone activity that generated a charge on the billing statement was readily available, but activity that was free of charge, under a bundle of services, was not. For something as simple as the number of text messages a month, whether customers paid for them or not, I needed raw data.
- Data “construction” is perhaps the most important of all. Modeling algorithms are not magic. It takes a lot of work to make it easier for the analytical techniques. Dates are a great example. Data warehouses store dates. Many rookies in predictive analytics just throw them into the model. Only in the most extreme examples will you get lucky enough to find something. Veteran data miners know to calculate the distances between dates and then discard the raw date information. Even a handful of key dates could generate dozens of data distances to calculate and explore.
It is a famous observation that 70–90 percent of a typical predictive analytics project is dedicated to data preparation. Although we’re presented a partial list of the activities and entire books have been dedicated to this activity, it offers a glimpse into what all of those data scientists are doing behind the scenes.
Interested in reading the full article? Become a Premium Member today!
Already a TDWI Premium Member? Read the full newsletter.
Distributed monthly via email to thousands of BI/DW professionals, TDWI FlashPoint features unique how-to articles, key findings from TDWI Research, and tips on building and managing BI/DW teams. Written by TDWI Premium Members, fellows, and instructors, the focus is on timely BI and DW issues.
If you are interested in reading the full article, we invite you to become a TDWI Premium Member. TDWI Premium Membership comes with a wide range of benefits, including:
- A comprehensive selection of industry research, news, and information
- Access to all of TDWI's current and archived research and publications in password-protected areas of the TDWI website
- Discounts on TDWI Conferences, seminars, and CBIP exams
Thank you for considering Premium Membership with TDWI! Please send us your questions and feedback.
Keith McCormick has leveraged statistical and machine learning software since the early 1990s and has deep expertise using popular commercial and open source solutions involving structured data, text, and big data analytics. McCormick guides organizations to establish highly effective analytical practices across industries, including public sector, media, marketing, healthcare, retail, finance, manufacturing, and higher education. He is a highly accomplished professional mentor and trainer, having served as keynote speaker and moderator at international conferences focused on practitioners and leadership alike. He possesses a unique blend of tactical and strategic skills along with the business acumen to ensure superior project design, oversight, and outcomes that align with organizational targets.