LESSON - Creating a True Enterprise Data Warehouse with Grid-Enabled Data Warehouse Appliances
- By Stuart Frost
- October 18, 2007
By Stuart Frost, CEO, DATAllegro
Data warehouse installations at large companies generally fall into three architectural categories:
- Centralized enterprise data warehouses (EDW)
- Decentralized collections of data marts (DM)
- Attempts at hub-and-spoke architectures that combine the two
While many organizations have obtained significant value from their data warehouse installations, few have been entirely successful at implementing full-scale enterprise data warehouses. Centralized EDW installations tend to be extremely expensive and inflexible. Consequently, business units become frustrated because the EDW won’t meet their needs at sensible cost and within a reasonable timeframe. Decentralized data marts often result in many versions of the same data that are difficult to keep consistent across the enterprise. While a true hub-andspoke architecture would address many of these issues, technical limitations with current data warehouse infrastructures have made them difficult to implement.
A Grid of Appliances
DATAllegro’s DW appliances are generally used as a data warehousing “black box” with data access from a single point (the control node). However, the nodes within a DATAllegro appliance are actually self-contained Ingres database servers. Therefore, a DATAllegro appliance could be viewed as a highly specialized grid of servers being pulled together to collectively form a DW appliance.
Taking this view, it is a small step to think of a connected set of DATAllegro systems as both a grid of appliances and a grid of nodes. Moving or loading data could be done directly between nodes in different appliances to maximize parallelism and overall transfer speeds.
Solving the Hub-and-Spoke Challenge
Imagine a fairly large appliance acting as the hub of a set of data mart appliances. The hub would hold detailed data (near real-time or batch-loaded) for a number of business units or perhaps the entire enterprise. ETL tools such as Informatica or SQL scripts could create star schemas. The star schemas could then be transferred to the appropriate data mart(s) via the grid at more than a terabyte per minute, depending on the number of nodes in each target appliance.
Users would connect to the independent DM appliances as usual for running queries. Each DM would be tuned according to business needs and sized to handle the required level of performance and concurrency.
Multi-Temperature Data Warehousing
Data warehouse managers are under increasing pressure to store large amounts of historical data at the same time as improving general query response times—without exceeding tight budgets.
Leveraging the grid concept, DATAllegro could provide a multi-temperature system that balances performance and cost across the periods for which data must be stored. For example, assume that the data warehouse must store seven years of data for compliance purposes. The most recent quarter (and most frequently requested data) would be placed on a very high-performance appliance. Data from three to 12 months could be stored on a standard DATAllegro appliance with very good performance, and data older than one year would be stored on one of DATAllegro’s archive appliances offering up to 200 TB of user data storage per rack.
As fresh data is loaded, older data would be automatically aged (moved) across the grid. Incoming queries would be automatically broken down into the relevant date ranges and the responses from the appliances collated into a single result set before being sent back to the user.
The grid concept could also be extended across multiple data centers to provide a highly effective disaster recovery strategy. Individual appliances could be replicated on a second site and automatically synchronized with node-to-node replication.
A high-performance hub-and-spoke architecture that is easily managed and cost effective on an enterprise scale is now a practical reality with DATAllegro’s grid technology. Added benefits include support for multitemperature and disaster recovery integrated into the grid
This article originally appeared in the issue of .