BI Experts: Big Data and Your Data Warehouse's Data Staging Area
No part of your DW architecture is immune to big data.
- By Philip Russom, Ph.D.
- July 10, 2012
My head is spinning from thinking about all the users I've talked to in recent years who've had to adjust -- then adjust again -- their data warehouse (DW) architecture. If you're a DW professional, you've probably adjusted your DW architecture to accommodate new business requirements for real-time operation, just in time to readjust it for one advanced analytic workload after the next. Before that, you made adjustments for performance management, operational BI, text analytics, high performance, data mart consolidation, and so on. Get ready: there's a new adjustment coming to your data warehouse's architecture.
Many of you are diving deeper into big data as you explore the new business information and analytic applications it can enable. Well, guess what? No part of your data warehouse architecture is immune to big data and the advanced analytics usually associated with it. Even arcane corners of architecture are being affected by big data.
Data staging areas are a case in point. As you know, data staging areas have evolved to do more than stage data. In the average DW today, a tremendous amount of hefty data processing is done in and around one or more data staging areas. The bad news is that data staging areas must evolve again to accommodate big data. The good news is that the many new technologies and techniques designed for big data can help data staging, just as they help other components of your DW architecture.
Here are a few ideas about future directions for data staging areas:
As data staging areas evolved from temporary storage platforms to pre- and post-processing platforms, they typically moved out of the data warehouse proper (where they were simply a few tables where data landed) and onto standalone database instances. This gives the staging area more processing power, and it offloads workloads that might degrade DW performance. If you have not made this journey yet, you should consider it. Big data and analytics push DWs into more distributed architectures; data staging probably will, too.
Certain data operations are common in a staging area. For example, you might sort data prior to a DW load, perhaps to get it in ZIP code order before loading a geography dimension. Also, SQL temp tables are commonly held in staging for later merging. Because column-dependent operations like these are common in staging, you might consider a columnar database, especially because these excel with big data volumes.
Originally, data staging areas were temporary holding bins. After all, an ETL or data quality job needs a place to land data before another one picks it up for additional processing. Jobs and users are always fastidious about cleaning out the bins, so some users have daemons that delete data after a certain age. In that spirit, the data staging platform might be appropriate for "analytic sandboxes," which are temporary bins with an expiration date.
Most data staging areas are optimized for detailed source data, because that's what most of the incoming data is like. The advanced forms of analytics that are growing aggressively in user adoption typically require detailed source data -- and large volumes of it -- because of the discovery nature of data mining, statistical analysis, extreme SQL, and natural language processing. Much of the content of big data by most definitions involves giant volumes of detailed source data. Hence, think about evolving your data staging area into a multi-purpose platform for managing and processing big data.
Landed and staged data is transient by nature. Much of it arrives and is deleted as part of overnight batch processing. Big data can be equally transient; exploratory analytic practices need a platform where a business analyst can land multiple terabytes of data for iterative study; most of that data is deleted when it's found to be not useful for the analytic project at hand. Imagine a staging area that elastically scales up to transient volumes of big data, then reallocates server and storage resources later. Hence, imagine your data staging area on some form of cloud.
The detailed source data handled by data staging platforms is mostly structured, but evolving to include more unstructured and semi-structured data. Imagine a staging platform that can handle the full range of multi-structured data. Hence, imagine your staging area on the Hadoop Distributed File System (HDFS). Many staging areas are under pressure to store data for longer periods; HDFS is making a reputation as a cost-effective, live archive for many data types. Imagine much of your staging area's processing made fully parallel and, therefore, high performance by layering MapReduce atop HDFS.
That's a lot to think about, and I admit that much of it falls into the "crystal ball" category. You've probably seen what I have -- namely, that data staging areas are driven to evolve as new big data and analytic requirements emerge, so evolution seems inevitable. At least, there are several positive -- and potentially exciting -- directions you can go, as you adopt your data staging area and other components of DW architecture to the new requirements.
For more information, replay the TDWI Webinar of April 5, 2012, "Big Data and Your Data Warehouse," available online from TDWI's Webinar Archive.