It's Information Governance We Need
Information governance requires us to discern the intention behind information and its provenance rather than the bald facts of its technical sources and processing demanded by data governance.
- By Barry Devlin
- October 19, 2018
Rudy Giuliani's declaration on August 19 that "truth isn't truth" was met with equal doses of shock and scorn. However, data scientists and BI experts should pause and examine how this seemingly paradoxical statement might apply within their own field. The outcome may be new insights in the area of information governance for the increasing volumes of often poor-quality data we capture and process in modern digital business.
What inspired this article was a statement in the 2018 marketing material of a very large and respected IT company that shall remain nameless: "Every day, we create 2.5 quintillion bytes of data -- so much that 90 [percent] of the data in the world today has been created in the last two years alone." I'm sure you've seen this statement many times over the past few years. I know I have. My curiosity was piqued: if it is true that we are living in a world of "exponentially" growing data volumes, the daily 2.5 quintillion bytes figure must date to a particular year.
I went looking for the original research for the quote. I'm still searching. I did discover its probable first use. The Internet Archive WayBackMachine revealed -- to my surprise -- that the statement first appeared on an IBM Big Data marketing page in May 2011. This inconvenient "non-truth" has been quoted repeatedly by reputable sources for seven years and, to my knowledge, has never been questioned or verified.
The impact of this misinformation is rather limited, unless, of course, some disk storage vendor used it to sell you some very large equipment. In fact, if they had properly applied compound growth to the 2011 figure, they could have had an even stronger case. However, it seems that Giuliani was making sense. Truth may not necessarily be truth -- for a variety of reasons: errors, carelessness, or even malicious disinformation (so-called fake news).
Data Rich, Information Poor
Whatever the correct figure, it is undeniable that today's business is dealing with enormous and increasing volumes of data from external sources. Not only is this data orders of magnitude bigger than traditional, internally sourced data, but it also comes in a variety of sometimes inadequately described structures whose reliability may too often be suspect. With such poor-quality data increasingly driving important business decisions, data scientists and BI experts must repeatedly ask, "Yes, but is it the truth?"
The challenge comes in two main guises. The first -- and already long-standing -- concern relates to social media, a subset of human-sourced information. As has become obvious over the past half-decade, social media is becoming an increasingly unreliable indicator of real-world opinions and behaviors as it has become progressively gamed and politicized.
Part of the problem is that the statistical data extracted and analyzed by businesses fails to reflect the complexity and nuance of the current human-sourced information found on social media. Is a post a real opinion from a real person, or is it a paid-for post or the output of a nefarious bot farm?
Our ability to distinguish between real information, misinformation (genuine error), and disinformation lags the interest and drive of those who profit from distorting the system. Simplified but enormous data models developed by data aggregators and other parties drive the collection and use of vast troves of personal details purporting to deliver individually targeted advertising, driven by an Internet business model built upon surveillance -- as suggested by Open University Professor John Naughton as #10 of his 95 Theses about Technology. In a system built on simple, numerical data gathered from a commercialized and gamed environment, truth may well be the first casualty.
Our latest obsession with machine-generated data from the Internet of Things poses the second challenge. There is a widespread belief that data generated by electronic sensors and passed through the Internet represents the truth about the physical world. The reality is considerably messier. Sensors may be faulty or hacked. Communications may be interrupted or intercepted. With data volumes and velocities far exceeding those of traditional, social media-sourced big data, the temptation is to "process and be damned" as quickly and efficiently as possible. Furthermore, data scientists often analyze data with incomplete context, leaving them to intuit the actual information content even before looking for truth.
Truth certainly does not equal information, but the path between them is shorter and smoother than that from data to truth.
Information Governance as a Quest for Truth
Data governance, a long-neglected discipline, has seen a recent revival in interest as the diggers of data lakes finally recognize the swamps they are creating (at a rate of 2.5 quintillion bytes per day). Data catalogs and metadata stores, business glossaries and enterprise data models -- often populated through machine learning technologies -- are the flavor du jour. Technology to the rescue again, and just in time!
The problem is that it is not data that needs to be governed; rather it is information that requires true governance. Data governance is necessary but not sufficient to discern the path to truth. With data, based on our data warehousing history, we assume there is a "single version of the truth." However, when we consider information, it becomes clear that truth comes in many varieties, some genuinely conflicting and some that must indeed be reconciled.
Information governance requires us to discern the intention behind information and its provenance rather than the bald facts of its technical sources and processing demanded by data governance. For example, the automobile industry has demonstrated repeatedly that their goals in measuring fuel efficiency and emissions differ considerably from the aims of the regulators and the expectations of the public.
Information governance steps beyond the dry roles of data owners and quality proofs of data stewards within the business to explore how information emerges and morphs to human influence in the world beyond and within the enterprise walls. Now that sounds like a much more interesting pursuit!
Dr. Barry Devlin defined the first data warehouse architecture in 1985 and is among the world’s foremost authorities on BI, big data, and beyond. His 2013 book, Business unIntelligence, offers a new architecture for modern information use and management.