Combining SQL and NoSQL to Support Information Context
A combination of relational and non-relational technologies opens the path for the level of adaptive, contextual information management urgently needed by modern business.
- By Barry Devlin
- May 13, 2019
It has been exactly 10 years since NoSQL proponents launched a fake war against relational databases with the charge that SQL could not handle the famous 3Vs of volume, velocity and variety of big data. The retreat from "No SQL" to "Not only SQL" ended the war, but both sides still struggle with how to manage change -- not just in data but, more important, in its meaning and context over time and in different circumstances.
Kimball's slowly changing dimensions and relational bitemporal data are examples of how changes in data over time are recorded. For example, Jane was sales manager for the southwestern region until December 31, 2018, and then took responsibility for the western region. This data-level change is easily recorded and understood in both above methods.
What is not and cannot be recorded in today's databases is the fact that three months later, the southwestern region was subsumed into the western region. The business decided to reorganize its regional sales structures and, in the process, redefined the meaning of "western region." Furthermore, with the thought of enabling more granular reporting, it redefined the meaning of "region" from a collection of states to a collection of ZIP codes.
Tracking Jane's sales performance this year will require some "creative accounting" because the context for calculating it has changed in some rather complex ways. Managing such contextual change is vital to deliver the agility required by modern business.
The Importance of Tracking the Context of Information
I have long promoted the phrase "context-setting information (CSI)" as a necessary improvement on the widely (ab)used word metadata. The word metadata, usually defined as "data about data," hides its full scope and importance as a first step in recording the context of the data stored. We need the complete context of data to truly understand the real meaning of the data we collect or generate.
Context-setting information specifies that what we are dealing with is actually information of business interest that adds contextual richness to its use. Such context includes information definition, sourcing, timeliness, prior manipulation, relationships, valid uses, ownership, and responsibility for quality. In reality, all information is CSI to some extent; every item of information provides context for other pieces of information. This leads to the realization that -- contrary to commonly held beliefs -- metadata/CSI, like most information, may vary with time or circumstances, as seen in the example above.
According to Lars Rönnbäck, co-inventor of Anchor modeling, data (or, better, information) modeling must evolve to identify and describe these continuously changing contextual relationships as discussed in my recent article "Models in Times of Uncertainty." A more challenging question is how to represent such relationships in a useful way in data storage and management systems.
Beyond SQL and NoSQL Databases
SQL and NoSQL databases adopt very different approaches to handling CSI. Relational databases lock in the information context and data model at design time based on the needs and understanding of the builders of the system that first stores the data. This context is freely available to all subsequent users of the data. At the other extreme, schema-less file structures and document stores manage only "naked data" -- data values and their names -- in simple, sequential records. Context must be added by every application using this data to turn it into usable information. The former approach is too rigid to offer the agility demanded by modern business; the latter is too loose to support the necessary data quality and easy reuse in a rapidly changing world.
What is required is to step beyond these contrary data(base) management system approaches to a new and more powerful concept -- what I call an adaptive information context management system (ICMS). In addition to managing basic data, such a system also manages the context around it. This allows important context, such as the changes in definition of sales region in our example, to be stored and formally tracked over time.
By explicitly managing context-setting information, an ICMS also enables existing data to be repurposed for new needs without requiring the creation and management of multiple copies of the same data. For example, an informational view of atomic operational data can be created without building and populating a separate data warehouse or mart. An ICMS allows context to be defined and changed with the same agility that document stores offer for data, but with the added advantage of providing management of CSI.
The concept of an ICMS was inspired by a small, German-built product, CortexDB, which uses a combination of relational and non-relational technologies to combine structure with flexibility, enabling agility in data use as described above. The naked data, independent of context, is preserved in a document store, which is automatically modeled and managed in an integrated, multidimensional index that points to every piece of data in every document. This index is formally in sixth normal form (6NF), which allows all other normalization levels and models to be subsequently built from it and maintained independently for diverse information uses.
The Future is Both Relational and Not
The partial shift away from relational databases since the mid-2000s was driven, among other reasons, by a lack of agility in coping with changing data contexts and schemata. The alternative NoSQL systems' limited support for CSI has proven problematic for data quality and governance efforts. The current focus on data catalogs, business glossaries, and similar tools marks a much-needed renaissance in interest in understanding and managing information meaning and context. A combination of relational and non-relational technologies opens the path for the level of adaptive, contextual information management urgently needed by modern business.
About the Author
Dr. Barry Devlin defined the first data warehouse architecture in 1985 and is among the world’s foremost authorities on BI, big data, and beyond. His 2013 book, Business unIntelligence, offers a new architecture for modern information use and management.