Machine Learning that Automates Data Management Tasks and Processes
Machine learning is not just for predictive analytics. It can also be embedded within tools to automate data management development and optimize execution.
- By Philip Russom
- July 30, 2018
Today, most efforts with machine learning (ML)are to support predictive analytics, especially when the analytics parse vast amounts of diverse big data. This is an important practice, and it will continue to grow and mature.
However, a few cutting-edge vendors and open-source projects are embedding ML-driven intelligence into data management (DM) tools. Embedded within DM tools, ML algorithms and models typically address three broad goals:
- Automation for well-understood but time-consuming development tasks, such as mapping sources to targets, cataloging data, or onboarding new sources
- Optimization of system performance, by automatically selecting query optimization strategies, table join approaches, resource management schemes, and distribution methods for data (e.g., hot versus cold storage, memory versus disk, or replication across nodes)
- Capacity management via workload-aware autoscaling, spot instance purchasing, and integrating node types in heterogeneous clusters
Machine learning is high value in these contexts because it increases developer productivity, makes advanced functions doable by lightly technical users, and elevates system performance with minimal administrator involvement. Due to these compelling benefits, TDWI expects to see -- in a few years -- most DM functions automated or optimized via ML and other approaches (e.g., rules engines). Here are a few examples.
Data cataloging. Modern tools can catalog and categorize data automatically via machine learning algorithms and models as well as via old-school business rules and application logic. Cataloging can apply to data sources, datasets, tables, or even individual columns and fields. A single data element can be categorized by its domain, compliance risk, quality level, source, lineage, and so on, as the user organization requires. Cataloging each data element multiple ways enriches user searches and queries of the catalog, and it enables richer cross-category analytics correlations.
Data domains. ML algorithms and other tool logic can recognize and catalog data sources and structures that are of particular domains. This helps users who will browse or search the catalog for domains of high interest, such as the customer, product, and financial domains. Advanced algorithms can even detect domains and domain relationships across datasets. ML algorithms can also recognize and catalog data elements that are potentially sensitive in terms of privacy and compliance.
Data lineage. ML algorithms can parse large volumes of complex data (even data distributed across multiple data platforms) to record data pathways and cluster data elements and datasets of common origin. With these details, users can quickly get deep insights into data provenance and impact analysis.
Metadata management. With big data, IoT, and other new sources that are notoriously devoid of metadata, a modern DM tool with ML embedded can parse data and deduce credible metadata. The tool can suggest a metadata structure to a data developer for approval or log that structure in a metadata repository without human intervention.
Data mappings. Time-consuming source-to-target mappings can now be performed by ML models and algorithms. ML's accuracy and breadth increase as it watches successful users map manually. Automated mappings increase the productivity of data developers, data scientists, and data-savvy business users.
Data-anomaly detection. ML has the potential to spot and react to data defects, such outliers, nonstandard data, and various data quality issues. Some tools go beyond detection to automatically remediate data quality issues, based on ML models or encoded business rules.
Upcoming use cases for the ML automation and optimization of DM. In the near future, catalog-based ML will also contribute to data security, governance, capacity planning, system performance, and guided data exploration.
This article is excerpted from the final section of the 2018 TDWI Checklist Report The Automation and Optimization of Advanced Analytics Based on Machine Learning. Read the entire report online here.
Philip Russom is director of TDWI Research for data management and oversees many of TDWI’s research-oriented publications, services, and events. He is a well-known figure in data warehousing and business intelligence, having published over 600 research reports, magazine articles, opinion columns, speeches, Webinars, and more. Before joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and BI consultant and was a contributing editor with leading IT magazines. Before that, Russom worked in technical and marketing positions for various database vendors. You can reach him at firstname.lastname@example.org, @prussom on Twitter, and on LinkedIn at linkedin.com/in/philiprussom.