Data Vault 101
Is the data vault one of the better kept secrets in business intelligence? Agile plus the data vault may be a match made in heaven.
- By Stephen Swoyer
- August 5, 2014
Is the data vault one of the better kept secrets in business intelligence (BI)?
Raphael Klebanov, a certified data vault modeler and a technical support manager with data warehouse automation specialist WhereScape Inc., thinks so. He recently made a presentation at the World Wide Data Vault Consortium (WWDVC) in St. Albans, VT. His topic: using WhereScape's data warehouse automation tools to drive an agile data vault development project.
According to Klebanov, agile and the data vault are nothing less than a match made in heaven.
"For me, the process of building a data warehouse is [one of] continuous morphing. You have this piece of clay and you change it and transform it into something more complex," he comments.
"This does not happen all at once. You don't have to build [a data vault warehouse]100 percent correctly at the beginning. You just build something that makes sense [using] the knowledge that you have today, then you build that data warehouse model and implement it [using a data warehouse automation tool]. You show it to your decision maker or your business people, they look at it and see if it does what they need, and you evolve from there. In the final stage, the piece of clay you started with has become something else. You can build anything out using this [approach]."
Data Vault Debutante
The data vault, which was invented and championed by industry luminary Dan Linstedt, is sometimes described as a hybrid of traditional star schema and third-normal-form (3NF) warehouse architectures. According to Klebenov, the data vault architecture addresses several of the primary shortcomings of 3NF -- inflexibility and brittleness; a protracted development process; and a requirement for a custom-built presentation layer to support BI tool access -- and co-opts one of the biggest benefits of the star-schema (or dimensional) model: namely, some degree of abstraction from dependency on specific BI tools. Unlike a star schema design, a data vault architecture normalizes and regularizes data, which makes it suitable for data mining (as well as for other applications that require raw or normalized data) and for real-time loading. The data vault model has an additional advantage over both dominant architectures, Klebanov argues: it expects conditions to change, and it's designed to permit rapid and flexible responses to business change.
"Change will cause [a] 3NF [warehouse] to crash. A lot of customers prefer using [3NF] because it's pretty easy to follow because each table is holding information about something singular. Even if you have a lot of different tables, the relations between those tables are easy to follow. However, if something changes, if your business changes [in a non-trivial way], then your model is going to crash because the [3NF] model isn't flexible at all," he argues.
For many data management practitioners, Klebanov acknowledges, the pros and cons are still an extremely tendentious subject. This is because one common variant of the 3NF data warehouse architecture (the "Corporate Information Factory") is the brainchild of one of the giants of DW -- industry luminary Bill Inmon -- who was associated with one of the Big Two warring factions in the data warehouse wars of the 1990s. (It isn't strictly accurate to say that Inmon and his counterpart, Ralph Kimball, who helped popularize an enterprise bus-based flavor of the star schema architecture, actually led either warring faction. All the same, plenty of DW professionals took up arms in defense of either Inmon's 3NF or Kimball's star schema models during the 1990's and early-2000's.)
Even though 3NF has several big benefits -- intelligibility, storage efficiency, and enterprisewide scope among them -- it also has a number of prominent drawbacks, starting with fragility. In 3NF modeling, for example, business keys -- e.g., "Customer_ID" or "Order_ID" -- are hard-coded, with the result that if structure of the relationship between business keys in a table changes (e.g., from many:many to one:many), the 3NF model must itself be refactored.
By contrast, Klebanov claims, data vault's concept of "link" entities is more flexible, and thus more resilient, than is the corresponding implementation in 3NF, such that making or introducing non-trivial changes (e.g., redefining the relationships between business keys) won't necessarily break a data vault model. (In data vault modeling, a "link" represents the relationship between two or more business keys.) "If you try to bring new sources into that [3NF] model, then it can also become complicated. That can [also] crash your model. Some of the people who still advocate for [3NF], if the business doesn't change much, it's a valid approach, but most businesses will change," Klebanov argues.
One of the biggest knocks against 3NF is its protracted development process: 3NF projects have been known to go off track during the planning or requirements-gathering process, as designers plunge down into a rabbit hole of constantly expanding requirements. This is a function of 3NF's enterprisewide scope. To the extent that a CIF aims to be -- according to Inmon himself -- a "subject-oriented, integrated, time-variant, non-volatile collection" of all data relevant for business decision-making, it's going to encompass lots of different subject areas, with lots of different tables and dimensions.
A dimensional or star schema approach usually manages to avoid the problem, and star schema likewise lends itself to agile methodologies. On the other hand, star schema lacks 3NF's enterprise scope: it works so well in an agile context precisely because its development model breaks projects down into discrete units of work.
On top of this, star schema imposes limitations (e.g., its unsuitability for data mining and real-time loading) that -- particularly in contemporary deployments -- can be deal-breakers for some adopters. This is why the data vault architecture has emerged as an especially compelling alternative, Klebanov argues: it supports normalization and regularization without the schematic brittleness or performance issues that are characteristic of traditional 3NF architectures; it likewise accommodates the use of (denormalized) subject-specific star schemas from the data vault data warehouse. This makes it ideal for supporting BI tool access.
"For me, 3NF is the first stage of transforming data from sources all of the way through to the data mart. It's what most of the sources look like. It's a transactional type of data, but to select from it is a pain. ... It has to be morphed, it has to be translated into something different," says Klebanov.
"Data vault seems to be a very good model, a very good option for that [i.e., translation], because you create hubs and satellites, and you take all of your attributes away from that. If one source goes away and another one comes [online], that is very easy for the data vault to accept this [kind of] change. However, [by itself] the data vault is not very good [at doing] any type of reporting out of it, even running queries out of it. All of this linkage [i.e., relationships between business keys] might not be very clear, and it can be quite convoluted for a user to make sense out of it."
A Pragmatic Approach
This is why Klebanov and some other data vault enthusiasts champion the use of subject-specific downstream star schema data marts that can be fed by the data vault data warehouse. This is something of an inversion of Ralph Kimball's bus-based model, which has denormalized data marts feeding a star schema data warehouse.
"Data marts are the best for presenting data to the user. They're pretty brittle too, [and] they're very hard to adjust. However, they're very small, and they can be dropped and recreated almost at run time. As far as presenting data to the user, it's the best approach: it's very easy to follow, the business user is only looking at the stuff they're familiar with, they can build cubes on top of it, simple reports on top of it, or Excel spreadsheets on top of it. It's very manageable, too. Of course, as time passes and the business changes, they become obsolete -- but because they're small, they just can be dropped and recreated very easily."
What's important, he explains, is that "the core of the data, with the relationships and all of the history, is held in the data vault structure, or data store structure, or sometimes just 3NF structure."
Data vault modeling has its critics. They note (correctly) that it's no more suited for programmatic or ad hoc access than is 3NF; that it places more of a premium on technique-specific knowledge or skill; and that (certainly for tyros) a data vault approach is comparatively more complicated and difficult to maintain.
There's truth to all of this. In a sense, however, it doesn't matter, according to Klebanov: some data warehouse automation tools -- he specifically cites WhereScape's 3D prototyping and modeling tool -- can automatically generate data vault models, as well as convert or translate from 3NF to data vault models, with limited support for star schema to data vault models. (These tools can likewise automatically generate 3NF and star schema dimensional models, too.)
Data vault modeling isn't something that one does lightly -- Klebanov is himself a certified data vault modeler -- and an organization would do well to build-up data vault-specific expertise before committing to developing a data vault-based data warehouse, but automated development tools can make it a lot easier.
"If you build your dimensional model [in a tool such as 3D] and bring in your 3NF tables, then you can create associations between the keys, [for example] the Customer_ID in one table and the Order_ID in the other. [The tool] will automatically create associations, automatically build an artificial key on top of business key if it's a dimension, [and] automatically bring a foreign key into dimension. You can also use 3D to create multiple models and compare them, for whatever reason."
The tools are getting increasingly intelligent, too, says Klebanov. He again uses the example of WhereScape's 3D product, but acknowledges that competitive products -- from vendors such as BIReady Inc., Kalido, and timeXtender Software, among others -- have their own strengths and weaknesses.
"You can create a 'compare' documentation based on all of the metadata information about the different models, or you can do a gap analysis between them," he points out. "If you're trying to build a fact table of a specific grain and you don't have one of the components of that grain, [3D] is going to give you a gap analysis and tell you that a component or dimension is missing."