Evolving the Data Warehouse
The classic data warehouse architecture is in need of a retrofit. It must be updated to support a real-time, data-in-motion paradigm.
- By Steve Swoyer
- April 10, 2017
A quarter century on, data warehouse architecture can no longer keep pace with the requirements of radically new business intelligence (BI) and advanced analytics use cases.
The reason: data warehouse architecture was conceived at a time when data volumes were comparatively scarce, information consumption was comparatively predictable, and relational data was the dominant game in town. Obviously, a lot has changed since 1988.
However, even if its enabling architecture needs updating, the concept of the data warehouse isn't obsolete. Something like a data warehouse is and will continue to be an essential part of day-to-day business decision making. Its enabling architecture will be different, that's all. "We have to change our data warehouse architectures, but we have to change them in an evolutionary rather than a revolutionary way," says Rick van der Lans, founder of R20/Consultancy, an advisory firm based in the Netherlands and operating worldwide.
What does van der Lans mean by "evolutionary?" He'll tackle this very question at TDWI's upcoming Chicago conference, where he's slated to teach a pair of half-day classes: "Modern Data Warehouse Architectures: From A to Z" and "The Logical Data Warehouse Architecture: Design, Architecture, and Technology."
Each class explores the two-fold issue of evolution: first, how must data warehouse architecture evolve to support (radically) new information consumption patterns? Second, what role do new technologies -- e.g., data virtualization (DV), NoSQL data stores, in-memory compute engines, and cloud -- have to play in this evolution?
The short answer, he says, is that these and other technologies aren't incompatible with modern data warehouse architectures -- provided they're used in complementary ways.
"For the last 25 years, most companies that developed BI and analytical systems … designed them using a classic data warehouse architecture," van der Lans explains, referring not only to the physical RDBMSs on which the data warehouse lives, but to ETL tools, staging areas, operational data stores, and the like. "If you look at the new BI and analytics requirements, on the one hand, and the new technologies we have available to address these requirements, more and more companies are seeing that the traditional architecture is not the right one anymore. It's too inflexible, it doesn't support operational, real-time, use cases."
Evolutionary, Not Revolutionary, Change
Real revolution is a relatively rare thing. Most technological change is evolutionary, in fact, for a couple of reasons. The first is that organizations have a lot of time and money invested in the technologies they use. These technologies cannot be changed -- let alone replaced -- overnight. As a result, organizations tend to prefer evolutionary changes to revolutionary, rip-and-replace updates.
A second reason is more fundamental: even if the specific implementation of a thing becomes obsolete, the needs it was designed to address are often still valid. People no longer use modems to send and receive data over analog telephone lines. However, this doesn't mean they've stopped transmitting data -- or using modems, for that matter. The specific implementation might have changed, but the need itself (a convenient means to rapidly send and receive data) is as critical as ever.
So it is with the data warehouse. Its core premise is that the business needs a time-variant repository of trusted information on which business people can base decisions, perform and compare historical analyses, and so on. However, the specific implementation of the warehouse is a function of a host of other factors and assumptions, some of which are incidental to this core need.
Traditionally, data warehouse systems ran on RDBMSs because that was what was available. ETL tools emerged to address a data integration problem -- namely, extracting data from upstream systems prior to transforming it -- that was unique to the 1990s. ODSs and staging areas were used to address other incidental factors.
Two Takes on Modern Data Warehouse Architectures
The upshot is that the classic data warehouse architecture is in need of a retrofit, van der Lans argues. The first of his classes explores what's involved in retrofitting the warehouse for relevancy in the present and beyond. The needs the data warehouse was designed to address must be reconciled with -- and, to the degree possible, updated to support -- a real-time, data-in-motion paradigm. Furthermore, the use cases for which the warehouse was optimized must also be reconciled with new use cases, such as exploratory analytics, that have radically different requirements.
"More and more environments are processing data live, as soon as it comes in. This is [a model for which] the classic warehouse is unsuitable. The challenge is to understand the uses and limitations of the new technologies -- Hadoop and Spark, Storm and Kafka for streaming [processing] -- as well as their place in a modern data warehouse architecture," he explains.
"There's also [the issue of] how cloud fits in. If your architecture has become too rigid, would it help if you moved some or all of it into the cloud? Some vendors are saying it would -- that their solutions are more flexible. Are they?"
The second of van der Lans' classes focuses in on one style of modern DW architecture -- the logical data warehouse. "I do address the logical data warehouse briefly in [the first] class, because it's an innovative approach to [doing] data warehouse architecture," he says, explaining that the logical data warehouse architecture uses DV technology to create a virtual abstraction layer between data sources and the people (or machines) who consume data.
"Logical data warehouse architecture [is a means of] separating the data consumers -- ranging from the people who are consuming very straightforward reports to the business analysts, data scientists, and investigative users who want to do more sophisticated things with data -- from the sources of data," he says. "The core reason for doing that is to get more flexibility. If we want to plug in new types of data or even new types of use cases -- we just plug them in. We don't have to change anything about the physical layout of the architecture."
Engineering More Resilient, More Responsive, Data Warehouse Systems
One takeaway from both classes is that something like the data warehouse will continue to play an important role going forward. Even logical data warehouse architecture -- which notionally eschews a physical data warehouse -- will probably use a limited version of the warehouse. "One of the questions people ask is, 'Does this mean we have to get rid of the physical data warehouse?' The answer is that you'll probably need a simplified one. For example, you're still going to have source systems that don't keep track of history. You need to have some way to keep track of history, however," van der Lans points out.
Finally, van der Lans will address another critical issue in both of his classes: is it possible to develop a data warehouse architecture that is not only easier (and cheaper) to maintain but more resilient, too? For example, a data warehouse architecture that will permit IT to respond more rapidly, more effectively, to users' needs? "We're just trying to make it a more lightweight architecture so that when users come to us with new questions, we don't have to tell them, 'You're going to have to wait two to three months.'
"The old architecture is much too slow. We no longer have the luxury of two to three months. We need to be able to tell users, 'We can have that for you -- in two to three days.'"