Modernizing the Data Warehouse
A new report from TDWI Research aims to provide a primer for modernizing the data warehouse.
- By Stephen Swoyer
- June 30, 2015
If you have a data warehouse, chances are it needs to be modernized. Not phased out, not lock-stock upgraded, but modernized. A tweak here, a minor change there, a supplement (or platform upgrade) elsewhere. A new report from TDWI Research aims to provide a primer for doing just this.
TDWI's Philip Russom, research director for data management, outlines eight common-sense steps for modernizing your data warehouse, starting with a rationale for doing so: modernization lets you take advantage of new data types, permits new kinds of analysis -- including extreme SQL, algorithmic, and hierarchical analytics -- and helps you rationalize your platform priorities.
"Organizations are grappling with new data types and sources and how to capture and manage these information assets, plus how to leverage them for business advantage," writes Russom in the aptly titled "Eight Tips for Modernizing a Data Warehouse," the latest entry in TDWI's Checklist Reports series. He cites Web logs; the emergence of industry-specific types of big data (such as call detail records, RFID tags, and sensor data); human-readable data sources, particularly text; and multi-structured data (Russom cites JSON and XML documents) as examples in kind.
"[U]sers are challenged by the newness of the[se] data [types], the massive volume of many new data sets, the wide range of data structures, and the streaming nature of some sources. The problem is further compounded because most vendor platforms and user designs for traditional data warehouses were originally designed for structured data alone or just for relational data," Russom writes. On the other hand, he notes, one kind of "modernization" could involve opening up the data warehouse to new and/or big data types. "Many users choose to reserve their core DW for the relational data that goes into standard reports, dashboards, performance management, and OLAP. For new big data, users are deploying specialized platforms built for new data types, and they are integrating the new platforms with the core DW and related systems. Specialized platforms include those based on column stores and appliances, plus open source Hadoop and NoSQL databases."
It must be noted that Russom's TDWI report is sponsored by several vendors -- viz., Cloudera Inc. and MapR Technologies Inc. -- that market "specialized platforms built for new data types," but Russom is also right. The data warehouse is a platform par excellence for managing strictly structured data types.
Even though most prominent database vendors offer some support for semi-structured data types -- including JSON and XML -- the RDBMS itself isn't an ideal repository for semi-structured or poly-structured data. (One such vendor is Teradata Corp., which also co-sponsored Russom's TDWI report. Teradata markets a dedicated appliance for Hadoop, as well as its Aster Discovery Platform, which supports advanced analytic functions.)
"Given the real-world limitations of modernizing a DW that's tightly wedded to the relational paradigm, complementing the relational DW with other data platforms is a viable strategy for DW modernization. Even so, some organizations prefer to replace the old DW platform with a different platform that's more broadly suited to the extreme diversity of data we're witnessing today, even though rip-and-replace is time-consuming and disruptive for the business."
The second checklist item on Russom and TDWI's data warehouse modernization agenda is pick the right tool for the right job. If you're going to undertake new types of analytics, you need to support them with a "modern" data warehouse and other fit-for-purpose analytic platforms. To this end, Russom writes, there are many different types of analytics -- and one size doesn't fit all.
"A ramification of the diversity of analytics is that the requirements for data to be analyzed vary tremendously. Some analytic methods demand relational data; others need some other structure. This, in turn, complicates the modernization of a data warehouse that must supply data for multiple analytic approaches. ... [G]iven the diversity of analytic data, many users choose to deploy multiple purpose-built platforms, instead of expecting a relational warehouse to supply all data types."
Examples of different analytic practices include data exploration and discovery, large data samples, extreme SQL, dimensional models, hierarchies, file-based data, multimedia data, textual documents, and set-based or algorithmic analysis, according to Russom. No single platform is ideal for all types of analyses: some data warehouse systems, for example -- such as those marketed by Actian (Matrix and Vector) Hewlett-Packard Co. (Vertica), IBM Corp. (Netezza), Kognitio, and Teradata (Teradata and Aster) -- specialize in relational, Extreme SQL, and dimensional analytics. For other types of analysis, non-data warehouse platforms are ideal.
The next step involves a conceptual recalibration of sorts. We're used to thinking of the data warehouse in terms of the enterprise data warehouse, or EDW: fabled/chimaerical repository for all critical business data, for single version of the truth, and so on. For a while now, Russom and TDWI have been trying to recast the EDW as a DWE -- or data warehouse environment. This isn't just a semantic or terminological switcheroo, Russom urges. "Many user organizations are evolving their mature enterprise data warehouses (EDWs) into multi-platform data warehouse environments," he writes.
"We say 'the warehouse' or the EDW as if it's one monolithic entity, although for many organizations it's long been a collection of more-or-less integrated tools, data platforms, and data sets. Rearranging the EDW acronym to DWE acknowledges the extreme degree the multi-platform DW and BI technology stack has achieved in recent years, and it's not just the DWE. Data management in other areas of the enterprise has attained a similar extreme of platform diversity."
The ramifications of this conceptual shift are far-reaching. For one thing, new platforms (such as Hadoop, MongoDB, and other NoSQL platforms) tend to complement the data warehouse without replacing it, Russom argues. "[T]he DW is still the best platform for the aggregated, standardized, and documented data that goes into standard reports, dashboards, performance management, operational analytics, and OLAP. Instead of replacing it, the new platforms complement the warehouse because they are optimized for workloads that manage, process, and analyze data that's new, big, unstructured, exotic, or real time. Also, new data platforms are better suited to the 'early ingestion, later processing' many users need to apply during data exploration and analytics."
Russom's report includes several other "Checklist" action items. It's available -- for free -- from TDWI Research.