The Broad New Powers of Modern Data Catalogs
We explain the role of two different types of data catalogs and how they can work together in your enterprise.
- By Ravi Shankar
- July 29, 2019
An organization's data is often fragmented across numerous sources: legacy on-premises systems and data warehouses, flat files stored on individual desktops and laptops, and modern, cloud-based repositories. Because of this all-too-familiar data environment, data governance becomes a challenge. Business stakeholders, data analysts, and other users are unable to discover data or run queries across an entire data set, which diminishes the value of data as an asset. In such an environment, it's challenging for users to generate accurate compliance reports or forecast sales with any reliable degree of accuracy.
Like the legendary Wild West, fragmented data environments cannot be effectively governed, if at all. In such environments, business users are often unable to know what data is stored in which system. Even if they do, they might not know who owns the data or how they are allowed to use it.
Data catalogs solve these problems by bringing an organization's diverse data holdings together in one list. However, not all data catalogs are alike, and it is important for data stewards and data consumers to understand all the options these critical tools offer.
Two Types of Data Catalogs
A data catalog may list all of the important data in one place. A data catalog may also be able to:
- Link to the data and provide access to it
- Enable users to query the data or combine data sets from disparate systems in the same query
- Provide a business glossary
- Support rules and policies for governance
However, not all data catalogs accomplish all of these feats. Furthermore, it's important to understand the two basic types of data catalogs.
Catalogs that focus on data quality and managing ownership and stewardship through workflows are called inventory data catalogs.
Catalogs that feature a seamless way to facilitate integration of disparate data assets for maximizing consumption by business users are called, not surprisingly, consumption data catalogs.
Inventory data catalogs emphasize the asset's metadata. Such data catalogs are optimized for data stewards and others that manage data governance. These data catalogs are geared towards organizing information about lineage, quality, ownership, change control, location, and access privileges. Inventory data catalogs are designed specifically not to support data access or query activity.
Consumption data catalogs, on the other hand, are deliberately designed to support real-time data access and querying to support self-service BI or data science initiatives. In contrast to inventory data catalogs, consumption data catalogs do not emphasize information about data quality and governance but rather focus on information about individual data assets, who uses what data, when, and how. Consumption data catalogs enable data marketplace functionality.
Choosing the Right Type
The two types of data catalogs are complementary. Each is designed for different use cases -- inventory data catalogs are typically used in data governance projects and consumption data catalogs are used in self-service initiatives.
Inventory data catalogs provide a business glossary to store the definition of business assets and related data assets. These data assets can be assigned governance rules that belong to global governance policies.
In contrast, consumption data catalogs provide a great way to create a single source of the truth about all enterprise data assets. Consumption data catalogs provide a single location from which to enforce security and governance policies across multiple systems, so they can easily enhance governance activities in logical data warehousing and data science use cases.
However, consumption data catalogs on their own do not create a business glossary or a way to manage a change in an ETL flow or a data quality rule. For that you'll want to employ an inventory data catalog.
The Data Virtualization Connection
Data virtualization is a modern data integration and data management technology that enables data catalogs to provide real-time access to listed data assets and enable queries across all available data sets.
Rather than relying on the movement of data from myriad disparate systems into a single, unified one, data virtualization creates secure views of the source data (leaving the data's location unchanged) and providing these views to authorized users of the data catalog in real time. As a unified data access layer above an organization's data sources, the data virtualization layer contains no actual data; instead, it contains the metadata required for accessing the various sources through the data catalog.
More often than not, inventory data catalogs and data virtualization are employed as part of a broader data governance project. In such projects, data virtualization brings the data together, and because it serves as a universal access layer to the disparate sources, it provides a powerful way to establish rules that define a single source of truth, which is often among the goals of a data governance project.
Data Catalogs, Reimagined
Data virtualization can help inventory and consumption data catalogs play an expanded role for data governance and self-service BI. Better yet, they need not be mutually exclusive -- companies can implement either or even both types of data catalogs side-by-side. Alternatively, they can implement aspects of both in a single data catalog. Because data virtualization simplifies role-based access by establishing a unified data-access layer, companies can easily curate the data catalog experience for their unique needs.
Ravi Shankar is senior vice president and chief marketing officer at Denodo, a provider of data virtualization software. He is responsible for Denodo’s global marketing efforts, including product marketing, demand generation, communications, and partner marketing. Ravi brings to his role more than 25 years of marketing leadership from enterprise software leaders such as Oracle and Informatica. Ravi holds an MBA from the Haas School of Business at the University of California, Berkeley. You can contact the author at [email protected].