Staging Area: Critical Component or Overkill?
By Jim Benjamin, Director, Product Management, D2K, Incorporated
Designing the physical path of data movement from transaction databases into a data mart or warehouse can range from very simple to very complex. In an environment where there are one or two homogenous sources and a single data mart, introducing an intermediate “staging area” (also called an “operational data store”) may add little value. However, as the complexity increases—by involving multiple database types or different platforms for different sources (e.g., ERP data on Unix, historical data on MVS, a data mart on Windows NT)—the need for a staging area to integrate all data types into one format becomes apparent, even critical.
For organizations with heterogeneous sources, or where multiple data marts are supported from a common set of data sources, the staging area becomes the first level data warehouse. This “atomic” data warehouse is a place to clean and synchronize all sources and organize the elements to support the downstream data warehouse or data mart environment. Here, data can automatically be loaded and undergo data type adjustment, freeing the database administrators from writing and maintaining a number of complicated programs.

Tapestry by D2K, Inc.™ explicitly supports and leverages the staging area concept. Following are some key reasons why an organization might incorporate a relational staging area into its data warehouse architecture.
A staging area allows the process of converting data to be separate from transforming data.
For example, if sales data must be joined with human resources data, each will be extracted and moved independently to the intermediate relational database; the administrator performs the join there, before populating the data mart tables. In this way, differences between the data sources can be resolved in the staging area before applying the transformations in the data mart. The staging area facilitates building data marts with the least amount of complexity and the greatest amount of transparency for the user.
A staging area facilitates moving data from different sources on different schedules.
Different sources may need to be extracted at different times. For example, the sales data may be available (i.e., sales processing is finished and the data is in a stable and synchronized state) for extract only between 1:00 a.m. and 2:00 a.m., while the financial data is available only from 4:00 a.m. until 5:00 a.m. A staging area would allow the administrator to extract the data during its available times and hold it for integration after the fact. It facilitates scheduling extractions from their operational systems during off-peak periods.
A relational staging area provides a place to check data cleanliness and correctness.
The staging area is the only place to determine which values actually came out of the legacy system. The data warehouse administrator has the opportunity to investigate the staging area to determine where the error occurred (e.g., there was an error in the business rules, or the data was incorrect when it came from the legacy system). The staging area can be used for cleaning the dirty data before it goes into the transformation process—the sooner data is cleansed, the less chance there is for error.
Staging areas, as the first data warehouse tier, are one of the components that differentiate D2K’s solution from others in the marketplace. The Tapestry staging area is the hub in the “hub-and-spoke” model. As the hub, the spokes are able to have data integrated and transformed in various ways, all from one atomic level. The hub-and-spoke model delivers flexibility, scalability, integrity, and performance by extracting data once for customization at the user level.
This article originally appeared in the 5/1/1999 issue of TDWI.