Historical Data: From Data Warehouse to Immutable Blockchain
For data warehousing, the blockchain has the potential to simplify and even eliminate the process of building history.
- By Barry Devlin
- March 14, 2016
Managers and other decision makers have long trusted the data warehouse as the best available source of a consistent, historical record of business performance. I discussed the topic of consistency in my previous article. In this article, I consider the creation and management of this historical record in IT systems -- how we have traditionally done it through data warehousing and how modern technology may affect this approach.
Operational systems, by definition, record the current state of the business. The amount of historical data they retain is variable as a result both business and technical limitations. Business requirements for operational systems often exclude longer-term trend analysis and cross-system reporting. In the past, hardware performance and storage limitations led designers to minimize data volumes in operational systems, a situation still seen in legacy systems.
Today's operational systems tend to be less parsimonious in the data they retain, but few still hold a complete historical record. As a result, data warehouses have long built history from the current state records of operational systems. In my 1997 book, Data Warehouse -- from Architecture to Implementation, I called such data transient.
One simple method of building history in a data warehouse, espoused early on by Bill Inmon, was to take snapshots of transient operational data, usually at day's end. However, this approach misses intraday changes. I favor a more complete approach, which captures each change, and adds a number of timestamps to each (permanently stored) record in the warehouse. Today, this approach is known as bitemporal data, and has been implemented in a number of relational databases, including Teradata, IBM DB2, and Microsoft SQL Server. Ralph Kimball takes another approach to building history using slowly changing dimensions of various types.
Dr. Tom Johnston's excellent 2014 book, Bitemporal Data: Theory and Practice, explores in depth the philosophical, business, and technical aspects of building a correct historical record that can fully represent all the ways that business may want to explore or roll back the history of their business activities. He shows that neither Inmon's and Kimball's approaches can meet these needs. At a more fundamental level, Johnston also concludes that we may need "tritemporal data" (with three sets of timestamps) to fully track all aspects of the provenance of business events and changes over time in our data warehouses.
The world of big data, particularly the Internet of Things, also suffers from a limited historical record, especially where devices operate in "fire-and-forget" mode. As a result, the need to build a historical record remains. The need for a truly permanent record of the ever-changing state of the world has taken on a new urgency as we digitize and track more aspects of physical, personal, business, and societal activities and behavior in a highly distributed computing environment.
Pat Helland comments in an article cleverly titled Immutability Changes Everything that "there is an inexorable trend toward storing and sending immutable data. We need immutability to coordinate at a distance, and we can afford immutability as storage gets cheaper." Helland notes that immutability is the backbone of big data processing but misses the fact that data warehousing has been working the same ground for years.
On the operational front, where transience has long been the norm (and drove the warehouse to build history, as mentioned above), the blockchain is finally bringing the concept a permanent data record. Defined as a secure transaction ledger database shared by all parties in a distributed network of computers, a blockchain records and stores every transaction that occurs in the network in a public and immutable manner. This approach is attracting widespread interest with its promise to eliminate trusted third parties in a wide range of industries from finance and legal services to real estate and management of intellectual and other valuable property.
For data warehousing, the blockchain -- assuming its eventual success -- has the potential to simplify and, in some cases, eliminate the process of building history. Significant research and development will, however, be required to ensure that the philosophical foundations of correctly recording events and actions in business, as described by Johnston above, are implemented.
Dr. Barry Devlin defined the first data warehouse architecture in 1985 and is among the world’s foremost authorities on BI, big data, and beyond. His 2013 book, Business unIntelligence, offers a new architecture for modern information use and management.