Evolving Data Warehouse Architectures: From EDW to DWE
Many Enterprise Data Warehouses (EDWs) are evolving into multi-platform Data Warehouse Environments (DWEs)
By Philip Russom, TDWI Research Director for Data Management
Analytics, big data, real time, and unstructured data present new data warehouse (DW) workloads.
Workload-centric DW architecture. One way to measure a data warehouse’s architecture is to count the number of workloads it supports. According to the TDWI Survey on High-Performance Data Warehousing of 2012, a little over half of user organizations surveyed (55%) support only the most common workloads, namely those for standard reports, performance management, and online analytic processing (OLAP). The other half (45%) also supports workloads for advanced analytics, detailed source data, various forms of big data, and real-time data feeds.
The trend is toward the latter. In other words, the number and diversity of DW workloads is increasing, due to organizations embracing big data, multi-structured data, real-time or streaming data, and data processing for advanced analytics. The catch is that some data warehouses (whether defined as a vendor product or a user’s design) can handle multiple, concurrent workloads of various types, whereas others cannot.
The diversification of DW workloads leads to distributed architectures for DWs.
Distributed DW architecture. The issue in a multi-workload environment is whether a single-platform data warehouse can be designed and optimized such that all workloads run optimally, even when concurrent. More and more DW teams are concluding that a single-platform DW is no longer desirable. Instead, they maintain a core DW platform for traditional workloads (reports, performance management, and OLAP), but offload other workloads to other platforms. For these organizations, the DW is not going away; it’s just being complemented by additional data platforms tuned to workloads that can and should be offloaded from the core warehouse.
For example, data and processing for SQL-based analytics are regularly offloaded to DW appliances and columnar DBMSs. And a few teams offload workloads for big data and advanced analytics to HDFS, MapReduce, and other NoSQL platforms. The result is a strong trend toward distributed DW architectures, where many areas of the logical DW architecture are physically deployed on standalone platforms instead of the core DW platform.
A distributed DW architecture is both good and bad. It’s good if your fidelity to business requirements and DW performance lead you to deploy another data platform in your DW environment, and the new platform integrates well with others in the distributed architecture. But it’s bad when disconnected systems proliferate uncontrolled, like the errant data marts we all fear. So far, the newest generation of analytic databases and data management platforms are controlled by users far better than the marts of yore. But you still have to be diligent to avoid abuses.
Also, note that the architectural distinctions made here have always been a matter of degree, and will continue to be so. In other words, no architecture is 100% monolithic or 100% distribution. Many are hybrids, and the percentage right for you depends on many matters of business and technology. Many DW architectures have always been distributed, to some degree. It's just that the degree is more pronounced today.
The trend toward a distributed DW architectures isn’t new. Not by a long shot. For decades, warehouses have wended their way through a variety of “edge systems” that are deployed on standalone servers off to the side of the warehouse, but integrated with it. This has been true from the dawn of warehousing (as with data marts and operational data stores (ODSs)), though recently expanded (with DW appliances and columnar DBMSs), and now continuing with new types of data platforms (namely NoSQL and Hadoop). Hence, even the new platforms fit comfortably into the well-established tradition of DW edge systems.
Rearrange the acronym from EDW to DWE, standing for “data warehouse environment,” meaning multi-platform DW.
From the single-platform EDW to the multi-platform DWE. A consequence of the workload-centric approach is a trend away from the single-platform monolith of the enterprise data warehouse (EDW) toward a physically distributed data warehouse environment (DWE). A modern DWE consists of multiple platform types, ranging from the traditional warehouse (and its satellite systems for marts and ODSs) to new platforms like DW appliances, columnar DBMSs, noSQL databases, MapReduce tools, and HDFS. In other words, users’ portfolios of tools for BI/DW and related disciplines are diversifying aggressively.
The multi-platform approach adds more complexity to the DW environment, but BI/DW professionals have always managed complex technology stacks successfully. The upside is that users love the high performance and solid information outcomes that they get from workload-tuned platforms.
Note that a DWE can be a simple bucket of standalone silos, and that’s where many organizations are today. Ideally, the physically distinct systems of the DWE should be integrated with others, so they connect via an overall logical design. Integration within the DWE can take many forms, including shared dimensions, data sync, federation, data flows across DWE platforms, and so on. Unless the platforms of a DWE are integrated at appropriate levels, the DWE is just a bucket of silos, whereas it will be more efficient technically and more effective for business users if it has an architectural design that unifies it.
Stay tuned, because I’ll soon post more blogs about evolving data warehouse architectures. In the meantime, please attend an upcoming TDWI Webinar, in which I’ll address many of the issues mentioned here. Register online for the Webinar Big Data and Your Data Warehouse, to be broadcast September 5, 2013 at 9:00am ET.
Posted by Philip Russom, Ph.D. on July 26, 2013