Data Requirements for Machine Learning
Machine learning can enable new forms of predictive analytics and embed algorithm-driven intelligence into many software applications. However, none of that is possible without the right data, captured and processed the right way.
- By Philip Russom
- September 14, 2018
Machine learning algorithms consume and process large volumes of data to learn complex patterns about people, business processes, transactions, events, and so on. This intelligence is then incorporated into a predictive model. Comparisons to the model can reveal whether an entity is operating within acceptable parameters or is exhibiting an anomaly.
Today, machine learning is used to solve well-bounded tasks such as classification and clustering. Note that a machine learning algorithm learns from so-called training data during development; it also learns continuously from real-world data during deployment so the algorithm can improve its model with experience.
Machine learning has a voracious appetite for data during both development and production, making unique demands of an organization's infrastructure for data management.
Data Requirements for Successful Machine Learning
#1: Large, diverse data sets
The development of a machine learning algorithm depends on large volumes of data, from which the learning process draws many entities, relationships, and clusters. To broaden and enrich the correlations made by the algorithm, machine learning needs data from diverse sources, in diverse formats, about diverse business processes.
For the most comprehensive learning experience, you should provide diverse training data -- integrated from multiple sources and concerning various business entities, collected across multiple time frames -- to make algorithmic assessments more real-world, accurate, and successful in production. Once in production, a machine learning algorithm continues to read large, diverse data sets to keep its model up-to-date and growing.
Savvy organizations are deploying tools for multiple types of analytics (not just machine learning), because each type tells them something unique and valuable. Each of these analytics approaches needs data that is prepared and presented in a certain way that is optimal for the analytics tool or the user practice involved. Machine learning algorithms are almost always optimized for raw, detailed source data. Thus, the data environment must provision large quantities of raw data for discovery-oriented analytics practices such as data exploration, data mining, statistics, and machine learning.
#2: Large, diverse infrastructure for data management
Infrastructure for training data for machine learning typically involves multiple data platforms, tools, and processing engines, ranging from traditional (relational and columnar databases) to modern (Hadoop, Spark, and cloud storage). Multiple technologies are required to cope with training data's extreme size, multiple data structures, and (in some cases) multiple latencies. Tools for machine learning are obviously important, but data management infrastructure is just as important.
There are many ways to provision training and production data for machine learning. This data can come from multiple platforms in the extended data infrastructure, but the trend is toward consolidating as much data as possible into a data lake designed for machine learning and other forms of advanced analytics. In a related trend, data lakes are moving toward elastic clouds for reasons of automation, optimization, and economics.
Data management infrastructure can be vast. It can include platforms and tools for data warehousing, data lakes, data integration, data preparation, multiple forms of analytics, and big data. New data platforms are emerging as well, dominated by clouds, open source engines, open source libraries and languages, and self-service tools. That is a long list of platforms, technologies, and processing engines. However, it is all required for modern organizations that want to operate and compete on analytics and intelligence.
Finally, when organizations already have big data infrastructure in place, adding machine learning extends the life cycle and business value of the infrastructure.
To Go In-Depth
Portions of this article were adapted from the 2018 TDWI Checklist Report "The Automation and Optimization of Advanced Analytics Based on Machine Learning." Read the complete report for more information about machine learning and its data requirements.
Philip Russom is director of TDWI Research for data management and oversees many of TDWI’s research-oriented publications, services, and events. He is a well-known figure in data warehousing and business intelligence, having published over 600 research reports, magazine articles, opinion columns, speeches, Webinars, and more. Before joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and BI consultant and was a contributing editor with leading IT magazines. Before that, Russom worked in technical and marketing positions for various database vendors. You can reach him at firstname.lastname@example.org, @prussom on Twitter, and on LinkedIn at linkedin.com/in/philiprussom.