Executive Q&A: Data Virtualization and the Use Cases of Today and Tomorrow
Data virtualization has been around for decades but confusion among technologies remains. Denodo’s SVP and CMO Ravi Shankar helps clear things up.
- By Upside Staff
- April 4, 2022
Ravi Shankar, SVP and CMO of Denodo, dispels some common misconceptions about data virtualization and explains the role it plays today, as well as the role it is likely to play in the future.
Upside: There is some confusion about what data virtualization (DV) is and what it is not. For those looking to learn more about it, can you clear up any differences between data virtualization and other technologies that are sometimes confused with it?
Ravi Shankar: Data virtualization has been around for over 20 years yet, some vendors -- primarily those offering a small subset of what data virtualization provides -- claim that they offer data virtualization solutions. The technologies that are often confused with data virtualization are data federation and SQL query acceleration.
It’s true that both data federation and data virtualization enable two or more databases, either on premises or in the cloud, to appear as a single database. The difference is that data virtualization establishes an enterprise-wide semantic layer above the disparate data sources, which abstracts away the complexities of data access, such as the need to know where the data is sourced. This semantic layer can be flexibly manipulated to meet a wide variety of use cases without affecting the source data.
Even though the data virtualization layer contains no source data, it provides real-time access to the source data without having to move it to a consolidated repository through the critical metadata it contains for accessing the different sources. Unlike simple data federation, DV architecture enables enterprise governance capabilities and data catalog creation that not only can list data but also deliver it.
Although data virtualization includes SQL query acceleration, products that are only SQL query accelerators are not true data virtualization technologies. Such query accelerators usually fast-track the use of data in data lakes for analytics. However, these tools are very specific and limited; they can’t work with multiple types of data sources, such as fast data stores that stream Internet of Things (IoT) data. Also, they do not provide strong security and universal governance capabilities like the data catalogs I mentioned. Finally, their data delivery is limited to SQL analytics, and they do not support operational use cases using APIs.
What is the current state of data virtualization, and can you provide some use cases?
Over the years, data virtualization has evolved into a mature data integration, management, and delivery technology that offers broad capabilities, including hybrid/multicloud data integration, query optimization, advanced semantics, unified security, artificial intelligence/machine learning (AI/ML)-powered recommendations, and enterprise data governance. For instance, data virtualization automates many of the common data integration and management functions using AI/ML. By learning the usage patterns of users and statistics of the queries executed, data virtualization streamlines the development of views with practical guidance. It uses active metadata to, for example, automatically infer relationships, boost performance with refined cost estimations, offer suggestions for joins and transformations, and perform smart autocompletion for frequently used SQL fragments.
Data virtualization also accelerates performance with summary tables. This enables business leaders to ask questions that rely on aggregate information such as what were the most profitable products last year in the Americas. Data virtualization uses summary tables to rapidly return the required results without having to query millions of rows of transactional data. Summary tables are pre-aggregated data sets that are much smaller than the originals and can be rapidly transferred over the network to the visualization application. The best part is that the business leader will not even realize that the report/chart he or she is viewing uses data from summary tables.
As the concept of data fabric continues to evolve, what role does data virtualization play?
Data fabric emerged as an alternative to the traditional configuration in which all data sits in a single repository, such as a monolithic data warehouse or data lake. In a data fabric, data is distributed across the enterprise, and anyone in the organization can access the data by tapping into any individual “strand” of the fabric.
However, data still needs to be replicated, which takes time, causing the usual frustration. Data virtualization turns a data fabric into a “logical” data fabric, in that data virtualization makes data available in real time without replication. A logical data fabric knits a virtual view of data across applications by leaving it within its original sources while enabling a unified view of all enterprise data.
Data mesh is another hot topic in 2022, especially for organizations looking to modernize their data infrastructures. How does data virtualization support a data mesh architecture?
Data mesh is another concept that has recently emerged as an alternative to the traditional, consolidated paradigm I’ve described. In a data mesh, data is not stored in a single repository or owned by a single group. Instead, data is organized into different “data domains” that are owned and operated by different departments within the organization.
The data domains in a data mesh are not siloed; however, each data domain is supported by a core provisioning platform that enables data to be shared throughout the organization as “data products.” These data products are specially curated for consumption like the products in a grocery store. Just as it is the natural foundation for a logical data fabric, data virtualization is the perfect fit for a data mesh.
By enabling the creation of highly customizable semantic models above an organization’s disparate data sources, data virtualization facilitates establishing full-featured data domains without changing the underlying data. In this way, data virtualization serves as the core provisioning platform of a data mesh, enabling data domains that serve curated, governed data products to the organization at large.
A new concept -- composable data architectures -- is another interesting topic. What does composable mean and what benefits does it offer?
Composable architecture emphasizes the composers or multiple data-creation centers within an organization. With the proliferation and growing importance of roles such as citizen analysts and citizen integrators, self-service infrastructure creation and self-service analytics are critical for many modern organizations, whereby certain business units or users are empowered to pick and choose their own low-code/no-code tools to build parts (or the entirety) of their required data infrastructure.
Composable architecture also implies a balance between collecting data versus connecting to the data via a logical infrastructure like what is enabled by data virtualization. In this way, a logical data fabric is an inherently composable architecture. Composable data and analytics brings agility to an organization’s data and analytics environment by reducing IT dependency, making business users more self-sufficient and reducing the time required to build the infrastructure.
Big data is already ubiquitous throughout the industry, yet organizations continue to struggle with using unstructured and structured data together. How do “small data analytics” and “wide data analytics” work with data virtualization to address these challenges?
Consumers and businesses alike are using small data analytics to do things such as create hyper-personalized experiences for their customers so the enterprise can understand each individual customer sentiment around a specific product or service within a short time window. To ensure such analytics is successful, companies need to combine certain data sets in real time, which is only possible with data virtualization and cannot be done using legacy data integration methods that require physical consolidation.
Wide data analytics involves the combination of structured, unstructured, and semistructured data from various data sources; this often includes geospatial data, machine generated data, text data, video data, temperature data, and the list goes on. Healthcare companies often combine lab data, X-ray data, R&D data, patient data, and many other types of data for clinical purposes and patient treatment as well as for offering data-as-a-service to their ecosystem partners. In such cases, traditional data integration methods often fall short either because they are not good at handling less-structured or unstructured data or because there is a need for real-time data integration for quicker decision-making. In such scenarios, data virtualization is an absolute necessity.
Where is data virtualization headed? What role will it play 2-3 years from now?
Data virtualization will continue to push the envelope with AI/ML-driven functionality to the point where it will automatically infer changes at the individual data sources and where data is continuously created through business transactions. Soon, there will be no friction with accessing, combining, and using data, and no one will have to ask where the data is physically located or what format it is stored in at the source. Data will continue to grow in volume, velocity, and variety, but with data virtualization and real-time access across disparate systems, the conversation will shift more to using data and less on managing its complexity.