It's the End of the Data Warehouse as We Know It
According to TDWI survey data, about half of all enterprises expect to replace their data warehouse systems -- in some cases, their analytics tools, too -- over the next three years. What should they replace them with?
- By Steve Swoyer
- January 11, 2017
According to TDWI survey data, about half of all enterprises expect to replace their data warehouse systems -- and in some cases, their analytics tools, too -- over the next three years.
These systems could (and probably should) be replaced by modern data warehouse systems that -- like the database equivalent of a Swiss Army knife -- integrate multiple fit-for-purpose analytics engines. These systems could (but probably should not) be replaced by Hadoop and other NoSQL platforms, which are no less Swiss Army-like.
It is misleading to frame this as a question of rip and replacement, however; the issue isn't a zero-sum one. It's more complicated than that.
Modern Analytics Demand Modern Data Architecture
It's complicated because the core architectural premises of data warehouse architecture -- viz., all critical business information can be integrated and consolidated into a single, centralized system of record and data access itself should be centralized -- are no longer viable. Conventional data warehouse systems are ripe for rip-and-replacement because their role has been outstripped -- in some cases marginalized -- by the emergence of new, nontraditional types of analytics.
"Many organizations need a more modern DW platform to address a number of new and future business and technology requirements. Most of the new requirements relate to big data and advanced analytics, so the data warehouse of the future must support these in multiple ways," writes Philip Russom, senior research director for data management with TDWI Research, in TDWI Checklist Report: Evolving Toward the Modern Data Warehouse.
As Russom sees it, the modern data warehouse must be able to manage and integrate both strictly structured and multistructured data types. It must integrate support for advanced analytics processing -- via in-database functions and algorithms and/or fit-for-purpose data processing engines -- to support new, advanced analytics use cases.
It must support near-real-time or real-time access and analysis at a scale (and cost) not previously practical. It must be as adept at moving data to and from cloud services as it is with on-premises data sources and services.
Finally, it must transparently integrate multiple platforms in a unified data warehouse architecture.
"Users should manage big data on as few data platform types as possible to minimize data movement as well as to avoid data sync and silo problems," Russom writes.
"As you expand into multiple types of analytics with multiple big data structures, you will inevitably spawn many types of data workloads. Because no single platform runs all workloads equally well, most DW and analytics systems are trending toward a multiplatform environment."
Something Like a Data Warehouse Is Still Needed
The day-to-day decision support activities for which the data warehouse was designed aren't going anywhere. Nor is traditional data warehouse architecture likely to be supplanted by Hadoop and other NoSQL platforms. The alternatives -- even specialized compute engines such as Spark, which can run on a standalone basis or in the context of Hadoop or Cassandra -- offer poor or incomplete SQL support. They likewise fail to scale to match the high concurrency -- with hundreds or even thousands of simultaneous users -- of massively parallel processing (MPP) data warehouse systems.
This suggests that something like a data warehouse will be required to support traditional decision-support workloads -- such as business intelligence (BI) reporting or ad hoc (i.e., OLAP-driven) analysis -- at enterprise scale. There's a caveat, however: the concept of the data warehouse environment as the central destination for all data in the enterprise has basically been exploded.
"A consequence of the workload-centric approach is a trend away from the single-platform monolith of the enterprise data warehouse ... toward a physically distributed data warehouse environment [DWE]," writes Russom. One consequence of this, he argues, is a DWE with an awfully big tent.
"A modern DWE consists of multiple data platform types, ranging from the traditional relational and multidimensional warehouse ... to new platforms such as DW appliances, columnar RDBMSs, NoSQL databases, big data processing engines, and data lake platforms. In other words, users' portfolios of tools for BI/DW and related disciplines are diversifying aggressively."
Mark Madsen, a research analyst with information management consultancy Third Nature, agrees. Now more than ever, he says, platform heterogeneity is the rule, not the exception.
"Data warehouse architecture was predicated on the assumption that people would be passively consuming information. It's in the standard definition of the data warehouse as 'a read-only repository,'" Madsen notes.
"That is no longer the case, if it ever was. It's difficult to anticipate the needs -- the workflows and data flows -- of new analytics use cases. They're predictable in a general sense. You need a repository in which to persist data, so we have concepts such as the data lake, which is less a source for than a complement to the data warehouse. The data lake is used for large-scale data collection and exploratory use cases. It's not used for common, core data."
Many Ways to Extend the DWE
The data lake is just one example of platform heterogeneity, Madsen says. There are oodles of others, from streaming ingest engines (and repositories) to text analytics and graph database engines to scalable, parallel processing compute environments such as Spark (or Hadoop with YARN or Mesos). "If our design focus is scenarios that involve exploring data, analyzing data to generate new insights, and developing new analytics applications, then the new need is to get data from where it is to where it is needed, when it is needed," he comments.
The extended data warehouse environment is an attempt to address the new topology of the analytics landscape. It describes a flexible, pragmatic, complementary model that cedes important functions to fit-for-purpose platforms even as it recasts the data warehouse as a platform for both traditional BI use cases (reports, dashboards) and new use cases. These include embedded or operational apps that consume BI information and analytics along with insights from advanced analytics research and development.
As Russom and other experts note, data warehouse architecture is still a best-in-class platform for putting analytics insights into production.
That hasn't and probably won't change.