Data Models: Beauty Is in the Eye of the Implementer
The data vault model and data warehouse automation are worth investigating if you are about to embark on a new data warehouse project.
- By Barry Devlin
- June 19, 2017
In a recent TDWI Upside article, I suggested that data models will be beautiful again, pointing to new techniques at the conceptual and logical levels that connect users' requirements with the data representations needed to support them. These new modeling approaches allow business users and IT to collaborate closely on valid data definitions and provide agility to finesse the model or substantially rework it as the business changes.
Readers engaged in the more hands-on aspects of implementation of such models, whether data warehouse, mart, or lake, may well ask: that's all very well, but are there any advances at the logical-physical level to make our lives easier? What about actual implementation?
One important advance that has been growing in popularity over the past decade is the "Data Vault" approach, pioneered by Dan Linstedt and now at version 2.0. This approach, especially when combined with data warehouse automation, offers several advantages, particularly in the areas of enterprise scope and agility, two aspects often considered irreconcilable.
Support for scope and agility, as well as for temporal data, is vital in modern data warehousing. However, understanding how these issues have been addressed takes us back to the old wars between the "Inmonites" and "Kimballites" of the 1990s.
The Historical Warehouse Design Landscape
Per Bill Inmon, a data warehouse should be "subject-oriented," implying tables that are enterprisewide in scope -- highly normalized implementations of the main entities of an enterprise data model. They therefore require up-front, enterprisewide negotiation and definition. This typically delays initial implementation, although this may be mitigated by staged implementation methodologies. These models are also notoriously difficult to change afterwards as business needs evolve.
Furthermore, Inmon's suggested snapshot update approach fails to properly address temporal data, although bitemporal data structures combined with incremental update strategies can address this problem, as I showed in my 1997 book, Data Warehouse -- From Architecture to Implementation.
Ralph Kimball's star schema dimensional model approach, on the other hand, focuses first on quick-win, departmental solutions (data marts) optimized for a common set of slice-and-dice analysis needs. Although later extended with enterprise bus architecture, full cross-enterprise support remains a challenge, as does ongoing modification in the light of business change. In addition, Tom Johnston challenged the validity of the approach's support for temporal data in his 2014 book, Bitemporal Data: Theory and Practice.
Enter the Data Vault
The Data Vault model was released in 2000 as a public domain modeling method to address these (and other) challenges. This model defines a detail-oriented, history-tracking, specially linked set of normalized tables supporting one or more functional business areas. It addresses enterprisewide needs in a flexible, scalable, consistent, and adaptable manner.
The model consists of three specialized types of entities/tables: hubs based on rarely changed business keys, links that describe associations or transactions between business keys, and satellites that hold all temporal and descriptive attributes of business keys and their associations. A data vault is typically not used directly by business users, but serves as the agreed source for a wide range of business-facing, user-specified data marts in any required format on any platform.
This model has been growing in popularity among data warehouse implementers and notable successes have been reported. Nonetheless, like the Inmon and Kimball approaches, it is not without its implementation problems.
The Challenge of Population
Data warehouses, irrespective of their models, have in common the task of populating their tables with data from multiple sources of varying quality, structure, and timeliness. Approaches to data warehouse population have evolved over the decades from hand-written code -- of which there is still far too much -- through ETL and data integration platforms to recent attempts to empower users with self-service data preparation.
Each approach has its strengths and weaknesses. Data warehouse automation (DWA), which carves a middle path between overwhelming technical complexity and unachievable simplicity, shows much promise. There are various flavors of DWA, but most have in common a metadata-driven approach to data transformation that uses the power and functionality of the database where the warehouse and its model reside.
The beauty of this is that all the necessary components for population come together in one place, streamlining IT and user interaction in design and simplifying the creation and subsequent maintenance of the warehouse.
A recent interesting example of this approach sees DWA software pioneer WhereScape collaborating with Data Vault inventor Dan Linstedt to develop code templates and other software with embedded support for the Data Vault 2.0 standards. This is intended to simplify and accelerate the design, creation, population, and maintenance of the hubs, links, and satellites of the model.
Modelers may debate the relative beauty of the different data warehouse models described above. However, the ultimate measure of their value lies in how well they deliver business value, which is, in turn, based on how easily they can be implemented, operated, and maintained. The data vault model, particularly when combined with data warehouse automation, is worth investigating if you are about to embark on a new data warehouse project.
Dr. Barry Devlin defined the first data warehouse architecture in 1985 and is among the world’s foremost authorities on BI, big data, and beyond. His 2013 book, Business unIntelligence, offers a new architecture for modern information use and management.