Minimizing the Complexities of Machine Learning with Data Virtualization
How the features and benefits of data virtualization can make working with data easier and more efficient.
- By Alberto Pan
- September 21, 2018
Data lakes have become the principal data management architecture for data science. A data lake's primary role is to store raw structured and unstructured data in one central location, making it easy for data scientists and other investigative and exploratory users to analyze data.
The data lake can store vast amounts of data affordably. It can potentially store all data of interest to data scientists in a single physical repository, making discovery easier. The data lake can reduce the time data scientists spend on data selection and data integration by storing data in its original form, avoiding transformations designed for specific tasks. The data lake also provides massive computing power so data can be efficiently transformed and combined to meet the needs of each process.
However, when it comes to applying machine learning (ML) in the enterprise, most data scientists still struggle with the complexities of data discovery and integration. In fact, a recent study revealed that data scientists spend as much as 80 percent of their time on these tasks.
Why Challenges Remain
In the same way that it is not easy to find a specific person in a crowded stadium, having all your data in the same physical place does not necessarily make discovery easy. In addition, only a small subset of the relevant data tends to be stored in the lake because data replication from the origin systems is slow and costly. Further complicating matters is the fact that many companies may have hundreds of data repositories distributed across multiple on-premises data centers and cloud providers.
When it comes to data integration, storing data in its original form does not remove the need to adapt it for the needs of each machine learning process. Rather it simply moves the burden of performing that process to the data scientists. Although the required processing capacity may be available in the lake, data scientists usually do not have the skills needed to integrate data.
Some data preparation tools have emerged in the past few years to make simple integration tasks accessible to data scientists. However, more complex tasks still require advanced skills. IT often needs to come to the rescue by creating new data sets in the data lake for specific ML processes, drastically slowing progress.
Data Virtualization Benefits
To address these challenges, organizations have started to apply new processes such as data virtualization (DV). DV provides a single access point to any data -- no matter where it is located and no matter its native format -- without first replicating it in a central repository.
The DV layer can also provide different logical views of the same physical data without creating additional replicas. This provides a fast and inexpensive way of offering different views of the data to meet the unique needs of each type of user and application. These logical views can be created by applying complex data transformation and combination functions on top of the physical data, using sophisticated optimization techniques to achieve the best performance.
Specifically, data virtualization helps with the two main challenges in the following ways:
Challenge #1: Data Discovery
DV allows data scientists to access more data. Because data sets do not need to be replicated from their origin systems to be available in the DV system, adding new content is faster and cheaper. These tools offer complete flexibility about what data is actually replicated. For instance, for a certain process you can choose to access all the data in real time from the sources, while for another process you can choose to first materialize all required data in a physical repository such as the data lake, and for yet another you can opt for a mixed strategy materializing only a subset of the data (e.g., data that will be used frequently during the process or that may be useful for many processes).
In addition, best-of-breed DV tools offer a searchable, browsable catalog of all the data sets available through the DV layer. This catalog includes extensive metadata about each data set (for example, tags, column descriptions, and usage information such as who uses each data set, when, and how). The content of the data sets can also be searched and queried directly from this catalog.
Challenge #2: Data Integration
DV tools expose all data according to a consistent data representation and query model. This means that no matter if the data was originally stored in a relational database, a Hadoop cluster, a SaaS application, or a NoSQL system, the data scientist can see all data as if it were stored in a single relational database. This "virtual database" can be accessed through standard methods such as SQL, REST, or OData, which supports standard tools/languages including R, Scala, Python, and Spark ML, to name a few.
DV also enables a clear and cost-effective separation of responsibilities between IT data architects and data scientists. IT data architects can use DV to create "reusable logical data sets" that expose information in ways useful for many processes. These logical data sets also do not need to physically replicate the data, so they can be created and maintained with much less effort than with conventional approaches. These reusable data sets can then be adapted by data scientists to meet the needs of each individual ML process. By definition, the reusable logical data sets take care of complex issues such as transformations and performance optimization so data scientists can perform final (and easier) customizations as needed.
Modern DV tools also include advanced governance capabilities so security policies can be enforced centrally, the lineage of the virtual data sets can be preserved, and common transformations and calculations can be reused across multiple ML processes. Data virtualization platforms can also seamlessly expose the results of ML analysis to business users and applications so they can be easily incorporated into business processes and reports.
A Final Word
As machine learning and data lakes continue to proliferate and support modern analytics, data virtualization is the key to drastic productivity improvement for data scientists. It allows them to focus on their core skills instead of data management. DV enables data scientists to access more data and leverage catalog-based data discovery, and it greatly simplifies data integration so the organization can truly benefit from the data at hand.
Alberto Pan is chief technical officer at Denodo, a provider of data virtualization software, and an associate professor at University of A Coruña. He has led product development tasks for all versions of the Denodo Platform and has authored more than 25 scientific papers in areas such as data virtualization, data integration, and Web automation.