4 Reasons to Use Graphs to Optimize Machine Learning Data Engineering
Semantic knowledge graphs accelerate data engineering for machine learning, helping you maximize results.
- By Sean Martin
- November 9, 2018
Throughout the data ecosystem, organizations are beginning to realize the worth of an enterprise information fabric that uses semantic technology for business understanding to provide uniform access to all data assets -- regardless of how dispersed they are. Forrester Research's recent "Big Data Fabric Wave" report identifies data fabrics as a viable means of dealing with the distributed complexities of big data.
One of the innate benefits of using a semantic approach to construct a data fabric is the ability to harmonize all data into an enterprise knowledge graph (predicated on business meaning) that's perfect for machine learning data engineering. Traditionally, data preparation was a data science and machine learning bottleneck; it was so time-consuming it limited the impact of this valuable technology.
However, knowledge graphs accelerate this process in four key ways to maximize machine learning data engineering results:
- Training data. Graph algorithms provide a richer source of training data than other analytics approaches
- Unstructured data. Graphs are difficult to surpass for harmonizing unstructured and structured data
- Feature engineering. Semantic graphs drastically decrease the time and effort of feature engineering, partly due to automated query generation
- Traceability. When operationalizing machine learning models, graphs provide an immutable provenance chain to retrace data's journey from its initial model testing phase, making it easier to recreate that journey in production
These four characteristics effectively unclog the data science bottleneck hampering machine learning throughout the enterprise, enabling organizations to focus on benefiting from this technology instead of preparing for it.
Optimal Model Input Data
Because graph algorithms are so robust at determining the number and nature of relationships between data elements, they deliver new, richer sources of input data for machine learning models than are available using traditional means. Graph analytics (such as clustering) or simply issuing a query asking for the relationship between data objects (such as people, places, or products) exploits this granular understanding of data relationships.
In relational settings, users must determine the relationships between data elements and issue queries for confirmation; with graphs, you simply ask what the relationships are. Graphs provide additional, more comprehensive sources for input data, and this broader data set significantly improves model training. This difference is indispensable for relationship-dependent sources such as a patient's pharmaceuticals, symptoms, related disease research, and charting information from wearable devices. In graph settings, you simply ask for the relationships between these factors and others; the answers themselves could function as predictors for machine learning models.
Harmonizing Unstructured and Structured Data
Graphs are peerless at aligning unstructured, structured, and semistructured data. When you consider the predominance of today's un-mined unstructured data, this advantage is particularly valuable. For example, semantic graphs easily harmonize the unstructured data gleaned from text analytics with that from traditional tabular databases. The linked-data approach of semantic graphs aligns data sets seamlessly, allowing additional data sources to be simultaneously considered when looking for the best variables to help make predictions.
Largely due to the flexible nature of graph technology, it's easy to start with virtually any data set and readily add others when preparing machine learning models.
Best of all, this harmonization is based on the business meaning of data -- another consequence of the linked-data approach. Semantic graphs are predicated on standards-based data models to which all data (structured or otherwise) adheres. Those ontologies provide a common business meaning for data regardless of originating source or format. When analyzing unstructured sources such as text, there's no telling what organizations might uncover. Semantic graphs ensure that whatever the results are, they'll be harmonized with structured data and the business meaning underpinning their value to the enterprise. Existing data sets described using open standard graph descriptions are also much easier to reuse in any combination.
Perhaps the biggest differentiator of the proper application of knowledge graphs for data preparation is the acceleration -- and automation -- of feature engineering. Feature engineering is the process whereby data scientists identify the relevant data attributes that predict the desired outcome of machine learning models; it's essential for model accuracy. Oftentimes, there's a direct correlation between time-consuming data preparation and inefficient feature engineering that slows the production of machine learning models. Thus, data prep and feature engineering are viewed by nearly three-fourths of data scientists as the least enjoyable part of their jobs.
Graphs can expedite feature engineering and feature selection partly because of automatic query generation and transformation capabilities. Accelerating this part of engineering machine learning models allows for increased numbers of features, which positively impacts model accuracy. By assisting data scientists and engineers with the transformations necessary for feature engineering, graphs shorten the process from days and weeks to hours.
Traceability, also known as data lineage or data provenance, is pivotal for ensuring production-level accuracy and consistency commensurate with that of the training period for machine learning models. Models are trained with specific input data that delivers equally specific outputs. As such, most initial models are brittle and require data as similar as possible to that used during their training. The provenance of graph databases illustrates the flow of data used to train models. This lineage provides a road map for recreating data's journey once models are put into production. Traceability shows how to reconstruct the data flow to leverage models without having to rebuild or substantially tweak them.
When building a machine learning model to predict patient outcomes for a specific medication or prescription, for example, a host of information about that specific patient -- potentially contained in scores of tables and documents -- must be encapsulated within that model. Provenance demonstrates just how it was captured and what processes took place, which is invaluable when operationalizing models.
A Final Word
The graph approach expedites machine learning data engineering for more effective models than are otherwise possible. It accelerates this process by rapidly harmonizing unstructured data alongside semistructured and structured data; automated query generation considerably reduces the time required for feature engineering and feature selection.
Moreover, graphs make the preparation process more effective by offering a new, relationship-savvy source of training data and issuing a provenance chain redeemable for ongoing value when operationalizing models. This combination enables organizations to optimize data engineering so they can concentrate on machine learning's value.
About the Author
Sean Martin serves as chief technology officer at Cambridge Semantics. In his career, Martin has pioneered the use of semantic technologies and enterprise knowledge graphs to solve data integration and application development problems. Prior to founding Cambridge Semantics in 2007, he spent 15 years with IBM Corporation where he was a founder for the IBM Advanced Internet Technology Skunkworks group.