Key Factors in Selecting a Data Warehouse Architecture
- By Thilini Ariyachandra, Hugh J. Watson
- May 23, 2005
Our two-phase study investigated which factors influence the selection of data warehouse architectures. In the first phase, rational and social/political factors were identified based on the academic and data warehousing literature and interviews with 10 leading authorities in the field. In the second phase, a Web-based survey collected data from 369 companies about the importance of each factor.
The survey data shows that information independence between organizational units, upper management’s information needs, and the strategic view of the warehouse prior to implementation are the most important factors. The data also shows that the importance of the factors varies with the architecture implemented.
Over the past 15 years, companies have spent billions of dollars on data marts and warehouses. We’ve learned the importance of thoroughly understanding source systems before building, starting with only a few subject areas or business processes but having an enterprisewide goal in mind, and giving end users data-access tools and applications that are appropriate for their needs.
One question, however, still causes considerable controversy: Which architecture should an enterprise use?
There are multiple options. The most popular is the hub-and-spoke architecture (i.e., centralized data warehouse with dependent data marts) advocated by Bill Inmon. He refers to this architecture as the corporate information factory (CIF). Another popular choice is the data-mart bus architecture with linked dimensional data marts, advocated by Ralph Kimball, the other preeminent figure in data warehousing. These and other architectures have strong proponents (Breslin, 2004; Wells, 2003). The selection of an appropriate architecture is one of the keys to data warehousing success (Laney, 2000).
Considering the importance of architecture selection, there is surprisingly little research on the topic. The literature tends to discuss the architectures, provide case study examples, or present survey data about the popularity of the various options (Joshi and Curtis, 1999; Hackney, 2000).
Here we discuss findings from a large study we recently conducted on data warehouse architectures. In particular, we explore the factors that influence the selection of an architecture, whether independent data mart, a data-mart bus architecture with linked dimensional mart, a hub-and-spoke architecture, a centralized warehouse (i.e., no dependent marts), or a federated architecture.
In the study’s first phase, the factors that may affect the architecture selection were identified based on a review of the academic and data warehousing literature and interviews with 10 leading authorities in the field. These sources, along with additional experts (20 experts in all1), were used to develop the survey we employed in the study’s second phase in which we used a Web-based survey to collect data. We asked about the importance of each factor on a company’s architecture decision. Three hundred sixty-nine respondents provided information for this question.
Many individuals and organizations helped promote the study in e-mail and newsletters, at conferences, and on Web sites.
The data warehousing literature provides discussions and examples of a variety of architectures. For our study, we included five: (1) independent data marts, (2) data-mart bus architecture with linked dimensional data marts, (3) hub-and-spoke architecture, (4) centralized data warehouse (no dependent data marts), and (5) federated.
Independent Data Marts
It is common for organizational units to develop their own data marts (Winsberg, 1996; Hoss, 2002). These marts are independent of one another, and while they may meet the needs for which they were created, they do not provide “a single version of the truth.” They have inconsistent data definitions and different dimensions and measures (i.e., non-conformed dimensions) that make it difficult to analyze data across the marts. Figure 1 shows the architecture for independent data marts.
Data-Mart Bus Architecture with Linked Dimensional Data Marts
Creation of this architecture starts with a business requirements analysis for a specific business process, such as orders, deliveries, customer calls, or billing (Kimball et al., 2003). The first mart is built for a single business process using dimensions and measures that will be used with other marts (i.e., conformed dimensions). Additional marts are developed using these conformed dimensions, which results in logically integrated marts and an enterprise view of the data. Atomic-level data is maintained in the marts and is organized in a star schema to provide a dimensional view of the data. This architecture is illustrated in Figure 2.
Hub-and-Spoke Architecture (The Corporate Information Factory)
This architecture is developed after an enterprise-level analysis of data requirements (Inmon, et al., 2001). Attention is also focused on building a scalable and maintainable infrastructure. Using the enterprise view of the data, the architecture is developed in an iterative manner, subject area by subject area. Atomic level data is maintained in third-normal form in the warehouse. Dependent data marts obtain data from the warehouse.
The dependent data marts may be developed for departmental, functional area, or special purposes (e.g., data mining) and may have normalized, denormalized, or summarized dimensional data structures based on user needs. Most users query the dependent data marts. Figure 3 shows this architecture.
Centralized Data Warehouse (No Dependent Data Marts)
This architecture is similar to the hub-and-spoke architecture except there are no dependent data marts. The warehouse contains atomic-level data, some summarized data, and logical dimensional views of the data. Queries and applications access data from both the relational data and the dimensional views. This architecture is often a logical rather than a physical implementation of the corporate information factory. Figure 4 illustrates this architecture.
This architecture leaves existing decision-support structures (e.g., operational systems, data marts, and data warehouses) in place (Hackney, 2002). Based on business requirements, data is accessed from these sources. The data is either logically or physically integrated using shared keys, global metadata, distributed queries, and other methods. This architecture is advocated as a practical solution for firms that already have a complex existing decision support environment and do not want to rebuild. The federated architecture is shown in Figure 5.
Factors Affecting Architecture Selection
From the academic and data-warehousing literature and the experts’ input, 10 factors were identified that potentially affect the architecture selection decision. Some of the factors are related to rational theory, such as the information processing theory of the firm, while others are related to social/political theories, such as power and politics.
Nine rational factors were identified that should lead an organization to select an architecture that is optimal for its needs (Tushman, et al., 1978; Goodhue, et al., 1992). All stakeholders work toward accomplishing this organizational goal (Jasperson, et al., 2003).
Information Interdependence between Organizational Units
There is a high level of information interdependence when the work of one organizational unit is dependent upon information from one or more other organizational units. In this situation, the ability to share and integrate information is important. It is likely that firms with high information interdependence select an enterprisewide architecture.
Upper Management’s Information Needs
In order to carry out their responsibilities, senior management needs information from lower organizational levels. They may need to drill down into areas of interest, aggregate lower-level data, and be confident the company is in compliance with regulations such as the Sarbanes-Oxley Act. To the extent this capability is important, so, too, is having an architecture that supports it.
Urgency of Need for a Data Warehouse
An organization can have an urgent need for the capabilities of a data warehouse (or a data mart), and this urgency may dictate a fast implementation schedule. Some architectures are more quickly implemented than others, which can impact the architecture selection.
Nature of End-User Tasks
Some users perform non-routine tasks. Structured queries and reports are insufficient for their needs. They have to analyze data in novel ways and require an architecture that provides enterprisewide data that can be analyzed “on the fly” in creative ways.
Constraints on Resources
Some data warehouse architectures require more resources than others. As a result, the availability of IT personnel, business unit personnel, and monetary resources can impact the architecture that is selected.
Strategic View of the Data Warehouse Prior to Implementation
Organizations differ in their view or plans for the warehouse (or mart). It may be developed to provide a “point solution” to a particular business unit’s need, may be a decision support infrastructure project to support a range of applications, or may support a company’s strategic business objectives. Depending on the strategic view of the warehouse, some architectures are more appropriate than others.
Compatibility with Existing Systems
There are many benefits to implementing IT solutions compatible with legacy systems. Organizations may realize cost and time benefits by implementing a data warehouse solution compatible with existing systems. Consequently, the data warehouse architecture selected will depend on what other systems are already in place.
Perceived Ability of the In-House IT Staff
Building a data warehouse can be a daunting task, and implementing some data warehouse architectures may be perceived as more challenging than implementing others, depending on the internal IT staff ’s technical skills, successful experiences with similar projects, and level of confidence. Consequently, the IT staff may choose an architecture that is compatible with what they think can be built successfully.
A variety of technical considerations affect the choice of an architecture—the ability to integrate metadata; scalability in the number of users, volume of data, and/or query performance; the ability to maintain historical data; and the ability to adapt to changes (such as changes in source systems). Depending on the importance of these technical issues, some architectures may be deemed more appropriate than others.
Building a data warehouse can be daunting, and implementing some data warehouse architectures may be perceived as more challenging than implementing others.
One social/political factor was selected in our survey: expert influence. The social/political view of organizations is that organizational decision making is a process of negotiation and coalition building in which multiple, ambiguous goals exist (Eisenhardt, et al., 1988). With this perspective, no unified company goal drives the selection of a data warehouse architecture.
When building a data warehouse, there are many places to turn for help—consultants, literature, conferences and seminars, internal experts, and end users. To varying degrees, these sources can affect the architecture that is selected. For example, a consultant may recommend an architecture that he or she has successfully implemented in the past.
Architecture is Different from Methodology
It is important to recognize that a data warehouse architecture identifies component parts, their characteristics, and the relationships among the parts, while a methodology identifies the activities that must be performed and their sequencing. Too often, the architecture and methodology terms are used interchangeably, which creates confusion. An architecture is an end product while a methodology is a process for developing an end product.
Sometimes the corporate information factory is referred to as a top-down approach and the data-mart bus architecture as bottom-up. This makes sense because the CIF approach places considerable emphasis on initially putting the infrastructure and processes in place to create an enterprise data warehouse; and the data-mart bus architecture focuses on delivering a solution that addresses a pressing business need. These are methodologies rather than architectures because they describe development processes.
Over time, the top-down and bottom-up approaches have become increasingly similar. Advocates of the top-down approach now state the importance of developing incrementally and of delivering early “wins.” The bottom-up proponents recognize the importance of having an enterprise plan for integrating the incrementally developed data marts.
Survey participants were asked to indicate the importance of each factor on the selection of a data warehouse architecture. We used a seven-point scale (1 for not important through 7 for very important). The factors rated were:
- Information interdependence between organizational units: The need to share information among organizational units.
- Upper management’s information needs: Upper management’s need for information from lower organizational levels.
- Urgency of need for a data warehouse: The extent to which there was an urgent need to build the data warehouse.
- Nature of end-user tasks: The extent to which users’ jobs required non-routine data analyses.
- Constraints on resources: The availability of resources (IT personnel, business unit personnel, and monetary resources) for building the data warehouse.
- Strategic view of the warehouse prior to implementation: The extent to which implementing a data warehouse was viewed as being important to supporting strategic objectives.
- Compatibility with existing systems: The extent to which the data warehouse architecture was compatible with existing systems.
- Perceived ability of the in-house IT staff: The perceived ability of the in-house IT staff in terms of technical skills, experiences, and confidence in developing a data warehouse.
- Technical issues: The extent to which technical issues affected the data warehouse architecture.
- Expert influence: The influence from sources of data warehouse expertise.
The data reveals that all of the selection factors have some influence. The lowest average score slightly exceeds 4.3 (for the perceived ability of the in-house IT staff ), indicating that even the lowest-rated factor is important. The most important factors (with average scores over 5.0) are information interdependence between organizational units, the strategic view of the warehouse prior to implementation, and upper management’s information needs. All are rational factors, suggesting that optimizing the architecture selection decision is of paramount importance.
Table 1 drills into the data further and provides the average score for every factor and architecture. While it is risky to speculate about the meaning of the differences among the scores for the various architectures based on just this data, some interpretations are possible.
In general, the selection factors for the independent data marts received lower average scores than the other architectures. This finding suggests that the independent data-marts architecture is employed more by happenstance than the others. It is the consequence of a series of independent decisions rather than an overall plan.
The notable exception (where the independent data marts architecture scored relatively high) was constraints on resources. A likely explanation is that a lack of resources prevents some organizations from implementing a better architectural solution. Of course, this architecture has its own costs—missed business opportunities, the need to support multiple decision-support platforms, and so on.
Despite the arguments over the merits of the hub-and-spoke and centralized data warehouse architectures versus the data-mart bus architecture, the scores for the architecture selection factors are surprisingly similar for most factors. Apparently, companies focus on many of the same factors but arrive at different architecture decisions.
The most significant differences (where the data-mart bus architecture scored lower) are for constraints on resources, strategic view of the data warehouse prior to implementation, and compatibility with existing systems. A possible interpretation of these lower scores is that the data-mart bus architecture is sometimes selected because the availability of resources is less of an issue (perhaps being sufficient to meet the needs of the data warehouse initiative), the view of the warehouse is less strategic, and there are fewer concerns about compatibility with existing systems. The differences, however, are not substantial.
The hub-and-spoke and centralized data warehouse architectures are similar except for the dependent data mart architecture; and not surprisingly, the scores for many of the selection factors are about the same. One might argue, however, that the centralized data warehouse architecture is faster and easier to implement because it does not require dependent data marts. The data provides some support for this view. The centralized data warehouse architecture scored higher on urgency of need (indicating a need for a relatively fast implementation), higher on constraints on resources (the solution had to require fewer resources than the hub-and-spoke architecture), and higher on the perceived ability of the in-house IT staff (suggesting confidence in the ability to successfully implement the architecture).
Table 1. The importance of selection factors—all architectures.
|Independent Data Marts
|Data Mart Bus Architecture with Linked Dimensional Data Marts
|Hub and Spoke Architecture
|Centralized Data Warehouse (No Dependent Data Marts)
|Information Interdependence between Organizational Units
|Upper Management's Information Needs
|Urgency of Need for a Data Warehouse
|Nature of End-User Tasks
|Constraints on Resources
|Strategic View of the Data Warehouse Prior to Implementation
|Compatibility with Existing Systems
|Perceived Ability of the In-house IT Staff
The relatively small number of companies with a federated architecture (n=15) makes it difficult to generalize. While recognizing this limitation, it is interesting to note that the score on technical issues was lower than for any of the other architectures. The highest scores were for information interdependence between organizational units and upper management’s information needs. The IT staff may have been told to cobble data together from various systems to meet senior management’s information needs and not be concerned with a technically elegant solution.
This article is the first report from a large study of data warehouse architectures. We are currently conducting multivariate analyses of the survey data in order to test a specific hypothesis and to find interesting relationships. We are also conducting follow-up telephone interviews to gain additional, qualitative insights about the architecture selection process. We will share our research when it is complete.
Breslin, Mary. “Data Warehousing Battle of Giants: Comparing the Basics of the Kimball and Inmon Models,” Business Intelligence Journal, Vol. 9, No. 1 (Winter 2004), 6-20.
Eisenhardt, K.M., and L.J. Bourgeois. “Politics of Strategic Decision-Making in High-Velocity Environments—toward a Midrange Theory,” Academy of Management Journal, Vol. 31, No. 4 (December 1988), 737-770.
Goodhue, D.L., L.J. Kirsch, and M.D. Wybo. “The Impact of Data Integration on the Costs and Benefits of Information-Systems,” MIS Quarterly, (16:3), September 1992, 293-311.
Hackney, D. “Architecture anarchy and how to survive it: God save the queen,” Enterprise Systems Journal, Vol. 15, No. 4 (2000), 24-30.
Hackney, D. “BI Architecture Tiers,” DM Review, (July 2002).
Hoss, D. “The Bottom Line Looms, but Innovation Persists,” www.datawarehouse.com, (August 16, 2002).
Inmon, William., Claudia. Imhoff, and R. Sousa. Corporate Information Factory (Second ed.), New York: Wiley & Sons, 2001.
Jasperson, S., B. Butler, T. Carte, H. Croes, C. Saunders, and W. Zheng, “Review: Power and Information Technology Research: A Metatriangulation Review,” MIS Quarterly, Vol. 26 No. 4 (December 2002), 397-460.
Joshi, K., and Curtis, M. “Issues in building a successful data warehouse,” Information Strategy: The Executive’s Journal, (Winter 1999), 28-35.
Kimball, Ralph, Laura Reeves, Margy Ross, and Warren Thornthwaite. The Data Warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing, and Deploying Data Warehouses, New York: Wiley & Sons, 2003.
Laney, D., “Data warehouse factors to address for success,” HP Professional, Vol. 14, No. 5 (2000), 21-22.
Tushman, M.L., and D.A. Nadler. “Information processing as an integrating concept in organizational design,” Academy of Management Review, Vol. 3, No. 3 (1978), 613-624.
Wells, David. “Choosing the right data warehouse approach,” TDWI FlashPoint, (2003).
Winsberg, P. “Modeling the data warehouse and the data mart,” INFODB, (1996), 1-10.
This article originally appeared in the issue of TDWI.