How Data Catalogs Expand Discovery and Improve Governance
AI and automation are making it easier for users to find the data they need.
- By David Stodder
- September 28, 2020
Those of us beyond a certain age remember when school research projects began in front of the library card catalog: that now-antique set of wooden cabinets with the long drawers full of well-thumbed cards that adhered to a standard bibliographic system. If you understood that system (or had the help of a good librarian), you could perform a surprising amount of research at a metadata level before having to hunt through the library stacks for the actual books you needed. You could use the system to understand relationships between book topics and perhaps discover an unexpected book that was perfect for your report.
Library catalogs, along with an increasing number of pre-digital-age storage systems, have changed. Today, many library catalogs are linked so you can find an item located anywhere in the system. You also can query the system in your natural language.
In our data-rich digital world, organizations need to make it easy to locate data across physically distributed systems while providing governance and security.
Strong Interest in Data Catalogs and Business Glossaries
TDWI finds strong interest in using data catalogs and business glossaries to address these requirements. A data catalog is a central repository that contains metadata for describing data sets, how they are defined, and where to find them. It can give organizations an inventory of their data assets that is critical for both governance and to make it easier for users to discover data.
Business glossaries, which can provide identification and valid data definitions within a business context, can be integrated with data catalogs that provide information more associated with the data schema, structure, and physical storage. Data virtualization middleware uses data catalogs and business glossaries extensively for faster querying and more comprehensive views of data from multiple sources.
Unfortunately, most organizations are held back because they lack a data catalog or use a hodgepodge of catalogs and glossaries, sometimes created in spreadsheets. Each of these typically collects technical and/or business metadata about just the single BI or database platform, data warehouse, or application. As data becomes voluminous and varied, enterprises need a more comprehensive enterprise data catalog.
At an enterprise level, the data catalog can be the centerpiece of a broader strategy for easier location, governance, and authorized access to data across enterprise systems on premises and in the cloud. Providing developers and users with easy access to the enterprise data catalog can eliminate challenges in creating data pipelines that draw data from multiple sources and help ensure consistent access to trusted data.
An enterprise data catalog enables organizations to respond faster to governance audits and the need to monitor data as it moves through life cycles across the organization. TDWI research finds that the ability to centrally monitor data usage and lineage -- where the data came from and what has happened to it along the way in terms of transformation, enrichment, and cleansing -- is a governance priority for the majority of organizations surveyed. About a third of organizations surveyed for our Q1 2020 TDWI Best Practices Report are looking at consolidating smaller data catalogs and glossaries into a central enterprise data catalog resource, especially to improve response to governance and data privacy regulations.
However, an organization needs more than data catalogs to address these issues. For example, to protect its data, it needs tools that can use the catalog to find data and then apply governance constraints and enforcement policies at runtime to prevent unauthorized access. The enterprise also needs tools to manage copying, replication, and archiving of data so it does not lose control of governance and security.
Advancing with AI and Automation
Artificial intelligence and automation are crucial to augmenting human efforts in building and maintaining data catalogs and glossaries, which traditionally required significant manual work. Our research shows many organizations are planning to use AI and automation to reduce manual effort in crawling metadata at the sources, tagging new data, classifying data for governance and security, and developing taxonomies. Organizations are also interested in natural language search capabilities to make it easier for users to find data.
AI and machine learning techniques are being applied in data catalog solutions to help organizations move beyond just inventorying data to curate data for use in pipelines by exposing which data sets are available, trusted, and governed. This supports the trend toward providing recommendations within BI, analytics applications, and data pipeline development tools so users can select and analyze data sets faster. These more "active" data catalogs depend on AI and automation to provide services that save users from delays and the need for intervention to "wake up" the catalog system.
An active data catalog can also automate steps for responding to audits and maintaining ongoing reports showing compliance with data privacy regulations. They can give users advice when certain data may be sensitive and could raise governance and regulatory concerns. Using AI and automation in data catalogs is essential as organizations try to scale up to support thousands of users and their dashboards, data pipelines, and analytics.
Fortunately, solution providers are not ignoring demand in the data catalog, business glossary, and related governance, security, and data access control arenas. This has been a big year in the development of solutions from established providers such as IBM and Informatica as well as newer and more specialized companies such as Alation, Collibra, Okera, and Zaloni. Organizations should evaluate solutions from a range of providers to determine which fit current and future requirements, keeping in mind that typically no one solution can address all needs.
Next Step for Solutions
The solutions industry should enable organizations to implement industry data definition and metadata standards in their catalogs to avoid adding to the already difficult problem of integrating data catalogs and glossaries across data systems and applications. This would provide users with single, trusted views of all relevant data rather than having to tailor data discovery and analysis to what is available in a single system.
David Stodder is director of TDWI Research for business intelligence. He focuses on providing research-based insight and best practices for organizations implementing BI, analytics, performance management, data discovery, data visualization, and related technologies and methods. He is the author of TDWI Best Practices Reports on mobile BI and customer analytics in the age of social media, as well as TDWI Checklist Reports on data discovery and information management. He has chaired TDWI conferences on BI agility and big data analytics. Stodder has provided thought leadership on BI, information management, and IT management for over two decades. He has served as vice president and research director with Ventana Research, and he was the founding chief editor of Intelligent Enterprise, where he served as editorial director for nine years.