Evolving Data Warehouse Architectures: The Roles of Hadoop
HDFS and other Hadoop tools promise to extend and improve some areas within data warehouse architectures
By Philip Russom, TDWI Research Director for Data Management
In a TDWI survey I designed and ran in 2012, 88% of the users surveyed reported that the Hadoop ecosystem of products is a business opportunity (not a technology problem) because it enables new types of applications. When asked which types of applications benefit most from Hadoop, survey respondents chose (in priority order) big data analytics, advanced analytics (i.e., data mining, statistical analysis, and complex SQL), and discovery analytics. After these three analytic application types, respondents then chose two data management use cases for Hadoop, namely information exploration and complementing a data warehouse. Other data management uses seen in the survey include data archiving, transforming big data for analytics, and data staging.
If you pull together all the things I just mentioned, it’s quite a list of use cases in data warehousing for the Hadoop Distributed File System (HDFS) and other Hadoop tools (MapReduce, Hive, HBase, HCatalog, Impala, etc.). And many of these—if implemented in a multi-platform data warehouse environment (DWE)—would have a strong influence on the architecture of that data environment.
Promising Uses of Hadoop that Impact DW Architectures
I see a handful of areas in data warehouse architectures where HDFS and other Hadoop products have the potential to play positive roles:
A lot of data processing occurs in a DW’s staging area, to prepare source data for specific uses (reporting, analytics, OLAP) and for loading into specific databases (DWs, marts, appliances). Much of this processing is done by homegrown or tool-based solutions for extract, transform, and load (ETL). Imagine staging and processing a wide variety of data on HDFS.
For users who prefer to hand-code most of their solutions for extract, transform, and load (ETL), they will most likely feel at home in code-intense environments like Apache MapReduce. And they may be able to refactor existing code to run there. For users who prefer to build their ETL solutions atop a vendor tool, the community of vendors for ETL and other data management tools is rolling out new interfaces and functions for the entire Hadoop product family.
Note that I’m assuming that (whether you use Hadoop or not), you should physically locate your data staging area(s) on standalone systems outside the core data warehouse, if you haven’t already. That way, you preserve the core DW’s capacity for what it does best: squeaky clean, well-modeled data (with an audit trail via metadata and master data) for standard reports, dashboards, performance management, and OLAP. In this scenario, the standalone data staging area(s) offload most of the management of big data, archiving source data, and much of the data processing for ETL, data quality, and so on.
When organizations embrace forms of advanced analytics that require detail source data, they amass large volumes of source data, which taxes areas of the DW architecture where source data is stored. Imagine managing detailed source data as an archive on HDFS.
You probably already do archiving with your data staging area, though you probably don’t call it archiving. If you think of it as an archive, maybe you’ll adopt the best practices of archiving, especially information life cycle management (ILM), which I feel is valuable but woefully vacant from most DWs today. Archiving is yet another thing the staging area in a modern DW architecture must do, thus another reason to offload the staging area from the core DW platform.
Traditionally, enterprises had three options when it came to archiving data: leave it within a relational database, move it to tape or optical disk, or delete it. Hadoop’s scalability and low cost enable organizations to keep far more data in a readily accessible online environment. An online archive can greatly expand applications in business intelligence, advanced analytics, data exploration, auditing, security, and risk management.
Relatively few organizations are getting BI value from semi-structured and unstructured data, despite years of wishing for it. Imagine HDFS as a special place within your DW environment for managing and processing semi-structured and unstructured data. Another way to put it is: imagine not stretching your RDBMS-based DW platform to handle data types that it’s not all that good with.
One of Hadoop’s strongest complements to a DW is its handling of semi-structured and unstructured data. But don’t go thinking that Hadoop is only for unstructured data; HDFS handles the full range of data, including structured forms, too. In fact, Hadoop can manage just about any data you can store in a file and copy into HDFS.
Given its ability to manage diverse multi-structured data, as I just described, Hadoop’s NoSQL approach is a natural framework for manipulating non-traditional data types. Note that these data types are often free of schema or metadata, which makes them challenging for SQL-based relational DBMSs. Hadoop supports a variety of programming languages (Java, R, C), thus providing more capabilities than SQL alone can offer.
In addition, Hadoop enables the growing practice of “late binding.” Instead of transforming data as it’s ingested by Hadoop (the way you often do with ETL for data warehousing), which imposes an a priori model on data, structure is applied at runtime. This, in turn, enables the open-ended data exploration and discovery analytics that many users are looking for today.
Imagine HDFS as a data stage, archive, or twenty-first-century operational data store that manages and processes big data for advanced forms of analytics, especially those based on MapReduce, data mining, statistical analysis, and natural language processing (NLP). There’s much to say about this; in a future blog I’ll drill into how advanced analytics is one of the strongest influences on data warehouse architectures today, whether Hadoop is in use or not.
Stay tuned, because I’ll soon post more blogs about evolving data warehouse architectures. In the meantime, please read the new TDWI Checklist “Where Hadoop Fits in Your Data Warehouse Architecture
Other blogs in the Evolving Data Warehouse Architectures series:
• From EDW to DWE
Posted by Philip Russom, Ph.D. on August 4, 2013