TDWI Articles

AI and BI Projects Are Bogged Down With Data Preparation Tasks

We offer four best practices that will help you do data prep the right (and efficient) way.

IBM is reporting that data quality challenges are a top reason why organizations are reassessing (or ending) artificial-intelligence (AI) and business intelligence (BI) projects.

For Further Reading:

How to Cut Data Preparation Time for Visualization Tools

Four Data Preparation Trends to Watch in 2019

Accessible Data Preparation: 6 Data Quality Tips

Arvind Krishna, IBM’s senior vice president of cloud and cognitive software, stated in a recent interview with the Wall Street Journal, “about 80% of the work with an AI project is collecting and preparing data. Some companies aren’t prepared for the cost and work associated with that going in. And you say: ‘Hey, wait a moment, where’s the AI? I’m not getting the benefit.’ And you kind of bail on it.” [1]

Many businesses are not prepared for the cost and effort associated with data preparation (DP) when starting AI and BI projects. To compound matters, hundreds of data and record types and billions of records are often involved in a project’s DP effort.

However, data analytics projects are increasingly imperative to organizational success in the digital economy, hence the need for DP solutions.

What is AI/BI Data Preparation?

Gartner defines data preparation as “an iterative and agile process for exploring, combining, cleaning, and transforming raw data into curated datasets for data integration, data science, data discovery, and analytics/business intelligence (BI) use cases.” [2]

A 2019 International Data Corporation (IDC) study [3] reports that data workers spend a remarkable time each week on data-related activities: 33 percent on data preparation compared to 32 percent on analytics (and, sadly, just 13 percent on data science). The top challenge cited by more than 30 percent of all data workers in this study was that “too much time is spent on data preparation.”

The variety of data sources, the multiplicity of data types, the enormity of data volumes, and the numerous uses for data analytics and business intelligence -- all result in multiple data sources and complexity for each project. Consequently, today’s data workers often use numerous tools for DP success.

Capabilities Needed in Data Preparation Tools

Evidence in the Gartner Research report, Market Guide for Data Preparation Tools [4], shows that data preparation time and reporting of information discovered during DP can be reduced by more than half when DP tools are implemented.

In the same research report, Gartner lists details of vendors and DP tools. The analyst firm predicts that the market for DP solutions will reach $1 billion this year, with nearly a third (30 percent) of IT organizations employing some type of self-service data preparation tool set.

Another Gartner Research Circle Survey [5] on data and analytics trends revealed that over half (54 percent) of respondents want and need to automate their data preparation and cleansing tasks during the next 12 to 24 months.

To accelerate data understandings and improve trust, data preparation tools should have certain key capabilities [4], including the ability to:

  • Extract and profile data. Typically, a data prep tool uses a visual environment that enables users to extract interactively, search, sample, and prepare data assets.
  • Create and manage data catalogs and metadata. Tools should be able to create and search metadata as well as track data sources, data transformations, and user activity against each data source. It should also keep track of data source attributes, data lineage, relationships, and APIs. All of this enables access to a metadata catalog for data auditing, analytics/BI, data science, and other operational use cases.
  • Support basic data quality and governance features. Tools must be able to integrate with other tools that support data governance/stewardship and data quality criteria.

Getting Started with Data Preparation: Best Practices

For Further Reading:

How to Cut Data Preparation Time for Visualization Tools

Four Data Preparation Trends to Watch in 2019

Accessible Data Preparation: 6 Data Quality Tips

The challenge is getting good at DP. As a recent report by business intelligence pioneer Howard Dresner found, 64 percent of respondents constantly or frequently perform end-user DP, but only 12 percent reported they were very effective [3]. Nearly 40 percent of data professionals spend half of their time prepping data rather than analyzing it.

Following are a few of the practices that help assure optimal DP for your AI and BI projects. Many more can be found from data preparation service and product suppliers [6, 7].

Best Practice #1: Decide which data sources are needed to meet AI and BI requirements

Take these three general steps to data discovery:

  1. Identify the data needed to meet required business tasks.
  2. Identify potential internal and external sources of that data (and include its owners).
  3. Assure that each source will be available according to required frequencies.

Best Practice #2: Identify tools for data analysis and preparation

It will be necessary to load data sources into DP tools so the data can be analyzed and manipulated. It’s important to get the data into an environment where it can be closely examined and readied for the next steps.

Best Practice #3: Profile data for potential and selected source data

This is a vital (but often discounted) step in DP. A project must analyze source data before it can be properly prepared for downstream consumption. Beyond simple visual examination, you need to profile data, detect outliers, and find null values (and other unwanted data) among sources.

The primary purpose of this profiling analysis is to decide which data sources are even worth including in your project. As data warehouse guru Ralph Kimball writes in his book, The Data Warehouse Toolkit [8] , “Early disqualification of a data source is a responsible step that can earn you respect from the rest of the team.”

Best Practice #4: Cleansing and screening source data

Based on your knowledge of the end business analytics goal, experiment with different data cleansing strategies that will get the relevant data into a usable format. Start with a small, statistically-valid sample to iteratively experiment with different data prep strategies, refine your record filters, and discuss the results with business stakeholders.

When discovering what seems to be a good DP approach, take time to rethink the subset of data you really need to meet the business objective. Running your data prep rules on the entire data set will be very time consuming, so think critically with business stakeholders about which entities and attributes you do and don’t need and which records you can safely filter out.

Final Thoughts

Proper and thorough data preparation, conducted from the start of an AI/BI project, leads to faster, more efficient AI and BI down the line. DP steps and processes outlined here apply to whatever technical setup you are using -- and they will get you better results.

Note that DP is not a “do once and forget” task. Data is constantly generated from multiple sources that may change over time, and the context of your business decisions will certainly change over time. Partnering with data preparation solution providers is an important consideration for the long-term capability of your DP infrastructure.

References

[1] The Wall Street Journal, AI Projects Bogged Down in Data Preparation, May 29, 2019, p. B3

[2] Gartner IT Glossary

[3] Alteryx, IDC Data Preparation, Analytics and Science Survey February 2019

[4] Gartner Research, Market Guide for Data Preparation Tools 2019

[5] Gartner Research, Gartner Survey Shows Organizations Are Slow to Advance in Data and Analytics

[6] Import.io , 10 Best Practices in Data Preparation

[7] Paxata, 10 Hard Questions to Make Your Choice of Self-Service Data Prep Easy

[8] Kimball

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.