TDWI Articles

Great Machine Learning Needs Careful Data Engineering

A new TDWI Checklist Report examines best practices for data engineering and management to support machine learning with a focus on collecting, cleansing, transforming, and governing new and big data for analysis.

In a new TDWI Checklist Report, "Five Data Engineering Requirements for Enabling Machine Learning," Fern Halper, vice president and senior director of TDWI Research for advanced analytics, notes how a new generation of data is reinvigorating interest in AI and machine learning -- and providing new challenges to enterprises of all sizes.

For Further Reading:

Machine Learning that Automates Data Management Tasks and Processes

Data Requirements for Machine Learning

Minimizing the Complexities of Machine Learning with Data Virtualization

Machine learning does what its name implies -- it is a system that learns to identify patterns by examining data. There are two approaches: supervised (where the system is given the desired target and learns to predict the same outcome based on attributes) and unsupervised (where there are no predefined outcomes, and once trained, the model is tested against additional data to make sure the model is valid).

Although still in the early mainstream phase of adoption, machine learning is being deployed in a wide range of use cases, including recommendation engines, fraud detection, churn analysis, and cybersecurity. The technology isn't new -- it's been around since the 1990s. As Halper points out, "the advent of big data has, in several important ways, both revitalized machine learning and increased the complexity of using these models to drive insight and action."

The challenge is moving from this model-building "training" phase to full production. "Data engineers must create robust production data pipelines to feed machine learning models the increasing amounts of disparate data they require," Halper explains.

The report discusses best practices for data engineering and management to support machine learning; she focuses on collecting, cleansing, transforming, and governing "new" and big data for analysis. Although organizations may have used rules-based AI systems based on heuristics in the past, they are now moving to automated discovery against vast volumes of disparate data.

Best Practices Lead to Better Results

For machine learning, more is better -- having more data brings more accurate results, and having widely diverse data is better still. Whether rich, new data sources are internal or external to the organization, two popular platforms are proving their worth when it comes to managing data for model building: data lakes and the cloud. Data management platforms also need to handle a new set of sourcing strategies to deal with different ingestion patterns (such as streaming data) and enable data enrichment (such as including metadata or geocoding).

Of course, low-quality data leads to low-quality machine learning results. To that end, Halper suggests seeking out tools that can ensure standardization and accuracy. "The good news is that more vendor solutions are now using advanced technologies such as artificial intelligence to identify (and often correct) data problems."

Data for model building must also be up-to-date, Halper warns. Currency is important when building the initial model and to ensure that the model doesn't become stale -- like automobiles, models occasionally need to be tuned up.

Data engineers and data scientists must be able to engineer the right features for the model, which often requires access to disparate data sources. Halper says that newly derived features need to be stored and persisted to whatever data store the organization is using for analysis, and the calculations necessary to re-create the features must be tracked.

Finally, great machine learning models also require data governance. For governance to work, your enterprise will need to invest in processes as well as tools. Two key tooling areas Halper recommends are management of metadata (data descriptions including data types and structures) and attention to data lineage (which describes where data originated and how it has been changed and transformed).

Halper's report includes dozens of concrete recommendations that will help any enterprise, large or small, start off on the right foot with machine learning. You can read the full report here. Visitors new to TDWI must complete a short, one-time registration for access.

About the Author

James E. Powell is the editorial director of TDWI, including research reports, the Business Intelligence Journal, and Upside newsletter. You can contact him via email here.


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.