Where Next for Metadata
Metadata finally makes the big time, but how big and for how long? The answers lie in traditional data warehousing principles.
- By Barry Devlin
- December 16, 2019
It’s ironic that data lakes -- possibly the most poorly managed data stores in the history of IT -- have finally, after nearly three decades, focused the industry’s attention on the need for metadata. Ironic but unsurprising. Metadata is prescribed to make data usable and useful to both business and IT, but dumping data flotsam and information jetsam in ungoverned lakes leaves everybody struggling to figure out what’s what and where it might be.
Gartner’s 2019 Metadata Management Solutions Magic Quadrant (also available from many of the listed vendors) reflects the burgeoning growth in interest, with the number of vendors evaluated almost doubling to 17 since the 2016 version. The Leaders quadrant is rather packed: fully a dozen of the contenders are placed there, with Informatica leading the race for the top right corner. Alation, Alex Solutions, ASG, Collibra, Infogix, and Smartlogic form a bunch of six in the middle of the quadrant, while Adaptive, erwin, IBM, Oracle, and SAP tend lower left among the Leaders.
Inclusion and rankings aside, the metadata market is now well populated with vendors tackling different aspects of function and scope. Indeed, Gartner’s Peer Insights lists 32 products from 22 vendors of metadata management solutions. Software vendors clearly see a worthwhile opportunity, but how real is it?
Market Hype Versus Implementation Reality
Gartner’s Magic Quadrant helpfully estimates the number of customers for the included vendors. It suggests that customer uptake, while significant, is still slow. Although a few big hitters (such as Informatica, Oracle, and SAP) boast a few thousand clients each, many others cluster around a hundred and below. I counted six to eight thousand in total, which is a fraction of data warehouse and data lake customers worldwide.
Data warehouse experts like myself have been promoting the need for metadata since the early 1990s. One might imagine a tsunami of pent-up demand being released now that a plethora of powerful products and simple solutions have arrived on the market. Even allowing for the possibility that many data warehouses are ticking over in maintenance mode and thus closed to new investment in metadata function, there has been a significant migration of data warehouses to lakes in recent years. Why is the uptake of metadata products not considerably faster?
Metadata: Two Four-Letter Words
I have long argued that metadata is a wholly inadequate word to describe what it is and what it does. Metadata is far more than “data about data” -- a typically simplistic definition. That is but one limited subset of what is needed, which is to provide business and IT users with sufficient background and context to enable them to create and use data and IT systems with complete confidence. Furthermore, because data is only a subset of information, what’s really needed is “information about information (and data and systems and people and...),” leading to my preferred name: context-setting information or CSI for short.
A key reason metadata seldom got beyond the basic technical variety in data warehousing was that it was usually implemented as a separate, IT-driven subproject in an often-overstretched program. When time or funding pressures arose, it became an easy target because its only business value was for future users, with the initial cohort already well versed in the current data. As a result, most data warehouses have only the metadata related to ETL (extract, transform, and load) processes and precious little for the business.
The situation for data lakes is more complicated. The original approach of simply loading raw data as it arrived -- due to volume, velocity, and variety challenges -- was later formalized in the principle of schema-on-read. As a result, data scientists do minimal data shaping and contextualizing, deliver their desired analytics insights, and then push responsibility for data governance down the road.
The data management outcomes were obvious, at least in retrospect. The lakes began to silt up with uncontrolled and unknown data, multiple copies of the same files, and conflicting information from disparate sources. As data lakes became data swamps and data scientists became sewage engineers, multiple vendors remembered the metadata myths and dusted off old products or invented new solutions. The current boom in metadata products began.
Whither Metadata and CSI?
Although today’s products benefit immensely from modern techniques -- such as artificial intelligence, natural language processing and generation, data usage mining, and support for collaboration -- they are being introduced into toxic data lakes in need of initial deep cleansing and subsequent restructuring according to best data management principles. I suspect that, on their own, even powerful metadata tools will struggle.
The real solution lies -- in my opinion -- in the rebranding of metadata to CSI. Not in the name change itself, but in the rethinking of the concept it implies. What is the difference between context-setting and “real” information? In short, none. Every piece of information provides context for every other piece. CSI thus needs to be directly incorporated into our information stores (database management systems) and CSI creation embedded directly into our information system projects, rather than being standalone metadata products and projects.
One approach, as I’ve described elsewhere, is to create a new class of information context management systems, as exemplified by CortexDB, which uses a combination of a document store and a fifth normal form database to store information together with its context.
Another approach -- perhaps more immediately applicable for many organizations -- is to step back from data lakes and schema-on-read and revisit relational database technology as a core storage and structuring mechanism for a significant proportion of the data now floating free in data lakes. The relational model and its implementation, when used fully and with minimal extensions, offers the basis for managing and storing all information irrespective of its context-setting status.
The data management and governance principles of relational-based data warehousing are ripe for resurrection.
Dr. Barry Devlin defined the first data warehouse architecture in 1985 and is among the world’s foremost authorities on BI, big data, and beyond. His 2013 book, Business unIntelligence, offers a new architecture for modern information use and management.