TDWI FlashPoint: Exclusive Excerpt for The Modeling Agency Subscribers
We'd like to extend a special welcome to TMA's newsletter subscribers! Below, you'll find an excerpt from Keith McCormick's recent TDWI FlashPoint article, published in the January 12 issue.
Enterprise-Level Data Preparation: The Need for Team Collaboration
By Keith McCormick
Senior Consultant, The Modeling Agency, LLC
Everyone knows that data preparation is labor intensive. Increasingly, word has been getting out that some data preparation can be done at the enterprise level to support self-service analytics, which is true. In this article, I focus on data preparation in support of predictive analytics models built by data scientists.
A typical predictive analytics project might draw from a half-dozen to two-dozen data sources. Projects that combine data not typically joined during the course of normal business operations have a much higher likelihood of discovering relationships that can drive real change. For example, visits to the company website might not be routinely combined with inbound calls to a customer service center because they might be managed by different departments. This would seem to be naturally handled by IT, but it must be a collaborative effort to be productive.
Quite simply, there are always more business intelligence (BI) end users in any organization than data scientists. Existing tables and views were almost certainly designed to meet the needs of most users and to support routine reporting. They will never be perfectly suited to meet the needs of any particular predictive analytics project. However, integration is usually the most computationally intensive step. Even if there are a modest number of customers—perhaps only a few million—there might be hundreds of millions of transactions to aggregate in order to create a “customer signature.” In most cases, therefore, the modeler has to be involved in the design—but IT is probably involved in the execution even if the modeler prototypes the solution on a sample of the data.
Data cleaning probably seems like a no-brainer. Why not let IT do the cleaning for everyone and do it in advance? If so, the self-service folks benefit, BI runs more smoothly, and the data scientists are happy, too. Unfortunately, it is not as simple as that. Yes, addressing typos, string/numeric conversion issues, problems with dates, and leading zeros up front benefits everyone. However, we just discussed how integration needs to be a collaboration, and the act of integration inevitably creates some data cleaning issues.
A common example is the creation of nulls when a customer has no transactions or when transactions can’t be matched to a customer. Even if the data going into integration is perfectly clean (it rarely is), cleaning still needs to be redone. On most projects, the post-integration cleaning is prototyped by the data scientist and then operationalized by IT in preparation for the deployment phase of the project.
CRISP-DM defines the format tasks as “primarily syntactic modifications made to the data that do not change its meaning, but might be required by the modeling tool.” Efficiency dictates that this probably be done at the enterprise level before it gets to the modeler. Nevertheless, clearly there is a dilemma. If the problem is defined as meeting the requirements of the modeling tool, how can it be done in advance if the IT team isn’t a user of the modeling tool?
Without carelessly shifting a burden from one team to another, it is prudent to have the IT team sit in on short briefings of the modeling tool. Even if it is only a half-day orientation, quick solutions to nagging problems might be uncovered and resolved. Something as simple as one software solution being case sensitive while another is not might inspire new naming conventions and save a lot of time downstream.
Interested in reading the full article? Become a Premium Member today!
Already a TDWI Premium Member? Read the full newsletter.
Distributed monthly via email to thousands of BI/DW professionals, TDWI FlashPoint features unique how-to articles, key findings from TDWI Research, and tips on building and managing BI/DW teams. Written by TDWI Premium Members, fellows, and instructors, the focus is on timely BI and DW issues.
If you are interested in reading the full article, we invite you to become a TDWI Premium Member. TDWI Premium Membership comes with a wide range of benefits, including:
- A comprehensive selection of industry research, news, and information
- Access to all of TDWI's current and archived research and publications in password-protected areas of the TDWI website
- Discounts on TDWI Conferences, seminars, and CBIP exams
Thank you for considering Premium Membership with TDWI! Please send us your questions and feedback.
About the Author
Keith McCormick is a senior consultant and trainer at The Modeling Agency, LLC. He is a highly seasoned, career-long practitioner in predictive analytics. His forthcoming book, his fourth, Effective Data Preparation for Predictive Analytics, coauthored with Bob Nisbet for Cambridge University Press, will be released in 2017. Keith presents multiple learning courses at TDWI conferences. Keith will present four courses at the TDWI Las Vegas 2017 Conference:
• February 15: Serious Play for Predictive Analytics
• February 16 (half-day): Data Enrichment in Predictive Analytics
• February 16 (half-day): Data Construction for Analytic Modeling
• February 17: Data Preparation for Predictive Analytics