Time to Kill Metadata?
Metadata is neither meta nor data. The amounts and types of information important to an enterprise today require context, not a misleading term that is only vaguely defined.
- By Barry Devlin
- April 15, 2016
Metadata. I've described it as two four letter words: data, which it certainly is not, and meta, which denotes something higher or beyond and is, at best, doubtful. Let me clarify.
We'll start with data vs. information. Many people use the words interchangeably. This is unfortunate, although so deeply ingrained in the IT industry that I often do it myself. Some address the issue from a philosophical point of view, which leads to much heat but little light. The widespread use of the term big data has confused the issue even further.
We need simple and practical definitions, optimized for business computing and decision making. In Business unIntelligence, I defined information as "the recorded and stored symbols and signs we use to describe the world and our thoughts about it, and to communicate with each other. Information is mostly digital but also includes paper, books and analogue recordings." The digital variety includes everything from tweets to videos and is thus loosely structured, highly variable, and distinctly human.
Data, in contrast, is information that has been heavily optimized for processing by traditional computers. Such optimization consists of separating values, often numeric, from the descriptions of their meaning and usage. Data thus looks like simple "facts" -- measurements, statistics, the output of physical sensors, people's first and last names, etc. -- that are, in reality, meaningful only in the context of contextual information stored elsewhere. This contextual information is one part of what, today, is called metadata.
In BI, metadata first appeared in the late 1980s as "data about data." This is overly simplistic at best. In the context of information vs. data, it is misleading. Data, as a very restricted subset of information, is ill-suited to provide the type of description we expect from metadata. My own early definition of metadata expanded on the above but still missed this limitation: "metadata is data that describes the meaning and structure of business data, as well as how it is created, accessed and used."
David Marco went further in 2000, declaring that metadata covers: "... all physical data (contained in software and other media) and knowledge (contained in employees and various media) from inside and outside an organization, including information about the physical data, technical and business processes, rules and constraints of the data, and structures of the data used by a corporation". The inclusion of knowledge inside employees' heads as a metadata component is perhaps extreme, and poses some difficulties in its extraction and use! However, this definition shows the breadth of what is needed.
In commercial computing, metadata thus emerged as a separate concept when we tried to model the meaning of business information and created files and databases to store it. It also appears when we build and run all the processes that create, manage, or use information. Its physical reality is diverse in the extreme. In simple files, metadata resides in design documents for applications and, if you're lucky, in comment fields in the code. It is also inherent in the code.
In relational databases, metadata resides in system tables as table and column names, and perhaps descriptive information. In other types of databases, such as XML stores, the metadata is mixed (in an easily recognizable way) with the information values. Metadata is also stored in data dictionaries and other repositories.
The problem is compounded by the vast array of information we now store beyond that of business. Digital photographs, for example, are documented by metadata (Exif data) that includes date and time, geolocation, camera settings, and user-entered descriptions. Which of these characteristics are metadata and which are "business information" for the photographer?
Metadata also exists implicitly in the information itself, but recognizing it depends on context and on audience. In a movie, for example, opening shots of the Eiffel Tower easily set the "location metadata" as Paris for most Western people but may fail for the residents of an African township. Increasingly sophisticated deep-learning algorithms can automatically extract metadata from images. These and other examples render the distinction between metadata and information impossible to justify or even articulate.
As defined today, metadata presents us with several problems:
- Business users don't understand if or how it differs from "business information"
- Vendors and IT have devised separate approaches to gather, store, and manage it
- Knowing where to find a specific item of information can thus be a challenge, as can any requirement to combine items from the two categories
The solution may sound too simplistic: reintegrate metadata -- descriptive information -- into the larger information scope. In structure, it is certainly not data. It is information. As one part of business information, it is not higher or beyond any other information it describes. Let's just call it context-setting information (CSI) because that's exactly what it does. CSI is simply one part (comprised of many components) of the information resource of a business. Depending on the situation, the same piece of information may be a business item or it may set the context for another business item. We must treat the management, storage, use, and access of all business information with exactly the same approaches and care.
Metadata is a combination of two four-letter words that confounds business users and confuses IT. Let's kill it.
Dr. Barry Devlin defined the first data warehouse architecture in 1985 and is among the world’s foremost authorities on BI, big data, and beyond. His 2013 book, Business unIntelligence, offers a new architecture for modern information use and management.