Models in Times of Uncertainty
Data models seem to represent the real world, but a model can be misleading if the relationship between truth and information is not understood.
- By Barry Devlin
- February 1, 2019
If you consider data modeling to be an abstruse and arcane (albeit, necessary) process in database design, get your hard hat on now; we must drill much deeper into the abstract to define what "facts" really mean in today's digital business.
Data modeling emerged in the 1970s as a means of ensuring that databases, particularly relational ones, were internally consistent and mathematically sound. However, Peter Chen's seminal 1976 paper on the entity-relationship model adds: "This model incorporates some of the important semantic information about the real world." Since then, data models have become central artifacts in information systems design for the real world, but few practitioners now recall that phrase some of in Chen's description. It's really important.
Wikipedia's definition of a (scientific) model as "a simplified and idealized understanding of physical systems" reiterates the point. As digital business increasingly blurs the line between internal data processing systems and the online and physical worlds, we must focus on where the simplifications and idealizations are now too limiting or even misleading.
Missing from the Model
Traditional data modeling only addresses the current moment in time, as if nothing ever changes. Time is certainly not of the essence. I've previously written about the challenges of handling temporal data in databases and data lakes. Although progress continues in database technology, many data warehouse implementations still rely on the outdated and imperfect approaches of Kimball and Inmon. Data lakes lag further behind.
At a deeper level, models only deal in the present tense. An entity, such as customer or product, is declared at modeling time. As business goes digital, even these seemingly fundamental definitions may need to change on multiple occasions. Managing such change is currently left to application developers without any theoretical guidance.
However, there exists an even deeper level of theory that modelers of digital businesses must address. What is the nature of the "things" we write into databases and other data stores? Most of us blithely assume them to be "facts," although few people wonder what a fact is or how we would know if it were "true" or not.
Say What You Mean and Mean What You Say
Lars Rönnbäck, co-inventor of anchor modeling, provides food for thought in a recent article, "The Illusion of a Fact," where he explores whether "the statement 'There are no aliens on the dark side of the moon' [is] a fact." It isn't, but he concludes that "Information, is ... factless and instead has two parts, the pieces of information (posits) and the opinions about the pieces (assertions)."
If you wonder why this is important, I suggest you pay more attention to the plague of disinformation that has infected the Internet in recent years. Reactions such as "stupid readers," "villainous writers," or "immoral social media platforms" entirely miss the point.
Within the disinformation are many posits, supported by various levels of justification. Which of these posits becomes popular depends on people's assertions that they are true. Readers decide which posits they accept as truth based on their existing biases and the extent to which they trust whoever makes the assertion. To this extent, truth, like beauty, is in the eye of the beholder.
However, whenever we look at data in our data warehouses and lakes, we tend to assume -- data governance worries aside -- that what we see there are "the true facts" about the business. This may be reasonable for internally sourced information. However, as businesses onboard increasing volumes of information from untrustworthy social media and Internet of Things sources, this assumption must be carefully considered.
Tom Johnston also discusses these conundrums in depth in his 2014 book, "Bitemporal Data." His conclusion (private correspondence) is that "that there is no such thing as a 'God's eye' point of view from which any of us can see what things are really and truly like. So for us, lacking that God's eye point of view, there aren't just very few facts; there are no such facts at all. But even so, the distinction between 'fact' and 'belief' is clearly a useful one. In this useful sense, a fact is what nearly all well-informed people in a particular topic say that they know is the case."
Rönnbäck, in his formal research paper, coins the term transitional modeling for this model of information consisting of posits and assertions. Johnston, basing his thinking on the philosophical concept of "speech acts," further asserts that these considerations lead to the need for a tri-temporal model that includes information about who asserted (or retracted) certain posits and when. These are powerful concepts that merit wider consideration in the IT community as digital business becomes more pervasive.
Traditional modeling approaches, such as third normal form, anchor modeling, and data vault, can be considered as special cases of transitional modeling. However, it is unclear if the relational model, from which they originally sprang, is the best basis on which to proceed further. Nonetheless, relational databases can be used to at least test and prototype some of the ideas described here.
Regarding digital business and the increasing twin challenges of uncertainty in information and blatant disinformation, the approach for now is (sadly) restricted to recommendations to use metadata to record and track data sources and their reliability, and to communicate directly to businesspeople the dangers -- as well as the opportunities -- of using externally sourced data, especially in conjunction with internally sourced information.
Dr. Barry Devlin defined the first data warehouse architecture in 1985 and is among the world’s foremost authorities on BI, big data, and beyond. His 2013 book, Business unIntelligence, offers a new architecture for modern information use and management.