Data Cataloging Comes of Age
Organizations can take advantage of the latest data cataloging technology to bring users closer to a complete view of information.
- By David Stodder
- September 17, 2018
"One person's trash is another person's treasure." If you were to look in my basement, you might find my use of this wise saying to be a little generous. Yet, as I often assert to my family, there is treasure down there amid the teetering boxes full of artifacts from my three decades in the information technology industry. The problem, of course, is finding it. If only there were an app for that.
Most organizations today face a similar problem as they amass volumes of data on premises and in the cloud: How do they locate what's valuable, sensitive, or relevant to the question they need to answer right now? One important solution is a data catalog (or similar resource, such as a metadata repository or business glossary). The increasing volume and variety of data is making data catalogs one of the hottest sectors in IT.
A data catalog can be a kind of Rosetta stone that enables users, developers, and administrators to find and learn about data -- and for information professionals to properly organize, integrate, and curate data for users. An up-to-date, comprehensive data catalog can make it easier for users to collaborate on data because it offers agreed-upon data definitions they can use to organize related data and build analytics models.
Data catalogs' star turn is a little ironic because organizations have been applying technologies and practices toward building them for a long time. You'll have to trust me when I say this, but scattered in those basement boxes are books, magazine articles, and white papers about data catalogs and metadata repositories that date back decades, written by some of the best minds in the industry.
For as long as I have been in this field, organizations have been seeking better tools and systems to centrally define and describe both existing and new data faster and more completely. Vendors have been trying to address this need for just as long, with mixed success. Solutions have proven elusive and usually too limited in scope.
No One Catalog to Rule Them All
Smaller organizations or individual users might have their own improvised repositories of data definitions that are recorded in spreadsheets or other types of files, but the serious starting point for most organizations has been the relational database management system catalog. Although in an earlier time these facilities required considerable manual work to build and manage, most modern RDBMS have "active" system catalogs that can populate and update themselves automatically, with guidance from administrators. Automation is key to building and maintaining modern data catalogs; without it, organizations must rely on manual effort that is slow, error-prone, and incomplete.
When data warehousing systems came along, they generally built up from the lower-level database system catalogs to provide fuller definitions to support BI and reporting systems. Then, many enterprise BI and online analytical processing (OLAP) systems themselves provided data catalogs or metadata repositories that further tailored definitions to what users needed for dashboards and other visualization, reporting, and analysis.
Enterprise applications, product information management systems, and content management systems grew up with their own catalogs, taxonomies, and master data management systems. More recently, organizations that have big data lakes and cloud-based data platforms are finding that they need data catalogs to enable easier data access and discovery as well as to support data management and governance.
Thus, for many large organizations, the problem is not that they don't have a data catalog. It's that they have too many. Instead of a single, comprehensive knowledge base for all data definitions, they have a collection of uncoordinated and conflicting data catalogs, metadata repositories, business glossaries, taxonomies, and master data management systems.
To deal with this problem, some large organizations look for a higher level of abstraction. They are evaluating semantic integration or mediation solutions that can resolve inconsistencies. Others are building their own ontologies, that is, higher-level information naming and meaning models that can provide integrated views across catalogs.
Ontologies and other methods of mediating inconsistencies in metadata definitions can help organizations avoid time-consuming and unproductive data confusion and increase the value they can gain from data. Too often, data is limited in value because the associated catalogs and models satisfy only a single purpose; an enterprise-level semantic integration capability could open up data sources to multiple uses.
Technology Directions for Data Catalogs
Fortunately, technologies are maturing for automatically building catalogs and glossaries and keeping them up to date. Many employ artificial intelligence (AI) techniques, particularly machine learning, to enable solutions to more rapidly learn data definitions and the context and organization present in massive data volumes. AI and machine learning are critical for organizations that want to tap IoT and streaming data and do not have time to wait for the data to land in a data warehouse and go through standard profiling procedures.
Beyond streaming and IoT uses, organizations can use AI capabilities embedded in tools to discover metadata from a range of new and existing data sets, and then learn details about the data, tag data according to higher-level business definitions and rules, and locate and use documentation. AI-infused cataloging tools are becoming important for addressing governance requirements because they can make it easier to track data lineage and learn how data has been consumed, transformed, and shared. Data catalogs can enable organizations in heavily regulated industries (such as healthcare and financial services) to monitor data lineage for audits.
Building knowledge about data, including its lineage, is becoming a priority as organizations increase analytics. Users want to be able to reproduce analytics to verify insights and build on them, but this is difficult if there's too much chaos about who collected the data and what kind of transformations or enrichments were done. Modern data cataloging systems can apply the speed and efficiency of AI and machine learning to improve data lineage.
Beyond AI and machine learning, here are three technology trends that merit attention as organizations set their strategy for data catalogs.
Trend #1: Consolidation of BI, data preparation, and data cataloging
BI and analytics solution providers today understand that slow and problematic data preparation can undermine satisfaction with self-service data visualization and discovery. The opportunity for market-differentiating excellence in this area is driving development and acquisition. A good example is QlikTech International AB's recent acquisition of Podium Data. Podium provides an innovative Podium Data Marketplace platform that uses a shopping paradigm to give users a searchable resource for cataloged and curated data.
Behind the scenes, administrators can work at a metadata level to manage and catalog the flow of raw data into the data lake or data warehouse, monitor demand so they can make popular sources easier to access, and determine whether to pre-integrate or aggregate data. Combining Podium with Qlik's visual analytics tools will create a more comprehensive solution to compete with the likes of Tableau, Alteryx, and Microsoft that are also developing integrated solutions.
Trend #2: Integration of data catalogs and data pipelines
Organizations are employing a data pipeline paradigm to describe the process from ingesting data to extracting, replicating, and potentially landing the data in a system for discovery, preparation, transformation, and analytics. The pipeline paradigm especially fits scenarios where organizations want to perform operational reporting or analytics on real-time data as it flows through the pipelines to gain the most value from it. In some cases, the streamed data may be discarded rather than stored at the end of a pipeline process.
Organizations that require high-volume streams of data to train machine learning models are particularly driving demand for data pipeline management. Third-party providers such as Attunity, Informatica, StreamSets, and Talend are joining data platform players such as Amazon, Google, IBM, Microsoft, Oracle, and Teradata in offering data pipeline management to serve streaming analytics use cases. If solutions such as these are able to tap data catalogs for knowledge about the data as well as supply knowledge to the catalogs about new data sources, organizations can reduce definitional confusion between pipeline processes and improve quality and governance.
Trend #3: Data virtualization layers for integrating multiple catalogs
As I noted, organizations with too many disconnected and conflicting data catalogs can be in almost as bad a situation as those with too few or no data catalogs. Data virtualization tools enable users to find, preview, and query data in multiple sources through a single layer. These tools, supplied by Denodo, SAP, TIBCO, and others, could offer a potential solution to the problem of multiple data catalogs. Organizations could use the virtualization layer to get an integrated view of metadata from all relevant catalogs, which could help mediate differences between them.
Updating Data Catalog Strategies
Data catalogs and similar systems can improve how users view, query, and analyze data sources. They have a long history, which I might be able to recount if I could only find the literature in my basement. Alas, the boxes stored in my basement predate technologies such as sensors, tags, and codes that modern data catalog solutions can use to search for and find relevant information. I can say, however, that the metadata and semantic data integration visionaries of 20 or 30 years ago -- who toiled in frustration due to the limitations of the data management systems of their era -- would be excited by the potential of today's technologies.
Data integration challenges are only growing more complex as data volumes and variety rise. Your organization should evaluate how it can use the latest data cataloging to more quickly and intelligently mediate differences between data definitions and bring your users closer to a complete view of your enterprise's information.