June 29, 2006: Data Warehousing Service-Level Agreements
A service-level agreement (SLA) is a contract that spells out in measurable terms what services a provider will deliver to a customer. Though conventional wisdom suggests that SLAs accompany data warehouses, little has been written about data warehouse SLAs. Let's take a closer look at data warehouse SLAs--their benefits, the areas of quality they should cover, and some of the implications and solution components for successful implementation.
Data Warehousing Service-Level Agreements
John Bair and Faisal Shah, Knightsbridge Solutions
A service-level agreement (SLA) is a contract that spells out in measurable terms what services a provider will deliver to a customer. Though conventional wisdom suggests that SLAs accompany data warehouses, little has been written about data warehouse SLAs. Let’s take a closer look at data warehouse SLAs—their benefits, the areas of quality they should cover, and some of the implications and solution components for successful implementation.
Among others, three key SLA benefits include:
- User satisfaction—When service levels are discussed, set, and measured, users will know what to expect. As a result, they will develop confidence in the data warehouse and the people who build it.
- Data-driven decisions—Data warehouses are built so that users can make data-driven decisions. Measuring data warehousing service levels is a good way to practice what we preach, thus promoting good analytic behavior. Also, service-level measures facilitate problem diagnosis, capacity planning, ROI, and cost justification.
- Operational excellence—Service-level agreements help us measure our progress toward efficiency and operational excellence.
In short, service-level agreements are a good thing!
Before we delve into data warehousing–specific SLA categories, let’s review some universal categories:
1. Uptime (data and system availability)—Users and other systems depend on the data warehouse to be open and ready for business. Time-of-day data availability targets are normally considered to be one of the key components of a data warehouse SLA. Uptime agreements include the times the system will be available for use; communication methods for planned outages; and consequences and penalties for unplanned outages.
2. Performance—Performance agreements with users are often written in terms of average and worst-case response times, and average and peak concurrent users. Performance agreements with other systems involve delivery/consumption latencies for event-driven interactions and batch windows for bulk interactions.
3. Problem resolution—Problem resolution agreements define problem classes. For each class, they define responsible parties, maximum resolution times, and communication processes.
4. Business continuity—Continuity measures and recovery times from catastrophic system failures should be established. Recovery times should not only address data loss and the time required to restore the database, but should also take into account data collection, data staging, and “catch-up” processing times.
Next, let’s look at some data warehouse–specific SLA categories:
5. Data freshness—Data freshness measures the latency between data origination and its availability in the warehouse. Uptime and availability metrics alone are insufficient. Freshness agreements extend the uptime concept with statements of how old the data will be when it is made available. For example, a nightly refresh cycle typically makes the previous day’s data available.
6. Data quality—Data should meet prescribed quality levels so that it will be fit for its intended uses. Data that exceeds the quality levels for one use, such as campaign management, may at the same time be deficient for another use, such as sales compensation. Data quality metrics should accompany all published data.
7. Data retention—Typically, as data ages, its value decreases. Eventually, retention costs exceed the data’s value. When this happens, data is purged or archived. Retention agreements state how long data will be available and how often it will be purged.
8. Ad hoc query response—Because of unpredictable query submission times and query run times, ad hoc query response can be one of the most difficult SLA categories to manage. Generally, ad hoc query response agreements are written in terms of averages.
9. Acceptable usage—It is virtually impossible to guarantee data warehouse query service levels unless agreements can be obtained about usage guidelines and what constitutes acceptable usage. Unfortunately, it is not uncommon for a data warehouse, large or small, to be brought to its knees by a few “killer” queries, which interrupts service to other users. Therefore, a data warehouse SLA cannot be one-sided. If data warehouse users (whether these users are people or systems) are to expect consistent query response times, then we cannot permit a few users to monopolize the data warehouse service. Acceptable usage also includes data privacy and security.
Because a data warehouse is inherently a shared resource, it tends to be the place where different service levels meet. The operational systems that feed the data warehouse operate at a service level different from the warehouse itself, which in turn may be different from the service levels of any downstream analytic application or mart.
This service-level disparity is to be expected. Removing the disparity would force all interconnected applications in the enterprise to operate at the strictest service level found among them. Typically, operational systems require the highest availability service level, and analytic systems require the highest data quality service level. Moving all interconnected applications to the highest of each of the service-level categories is impractical and expensive, and may be impossible.
A warehouse often bridges disparate service levels. What does this mean to the data warehouse service level itself? The answer depends on which service-level category is in question and the direction data moves between applications. Let’s focus primarily on the availability and data quality service-level categories.
Because data tends to move from operational systems to the warehouse, operating a data warehouse at a lower availability is quite practical. When the operational or online systems want direct access to a data warehouse for event-driven data access, then troubles arise. In order to support direct access, the data warehouse would need to stay as available as the operational system. But operating a data warehouse round-the-clock is usually too expensive for most, so more sophisticated architectures are needed to bridge this service-level disparity between the two while supporting the need for operational systems to access data warehouse data.
Data quality in operational systems is notoriously suspect as compared with the quality expected from analytic applications. The data warehouse must often enhance the quality of raw operational data before publishing it downstream. From a data quality perspective, we bridge the service-level disparity between operational systems and the warehouse by implementing rigorous quality verification, remediation, and certification processes in the link between the operational systems and the warehouse.
Thus, service-level disparities must be recognized and explicitly bridged.
The management of all of these different data warehousing services and service levels requires sophisticated monitoring and reporting systems. Fortunately, if there is one system in the enterprise that is capable of reporting on its own performance over time, it’s the data warehouse!
In this article, we have explored the benefits of having data warehouse service-level agreements. We delved into some of the key areas of quality that a data warehouse SLA should address, including not only uptime and business continuity, but also data warehousing–specific considerations such as data and report availability, data quality, and query response. After looking at these facets of SLAs, the conclusion becomes clear. In order to manage the data warehouse as a shared platform, we must stratify uses into differing service levels and manage the use of the data to ensure ideal outcomes.
John Bair and Faisal Shah are chief technology officers with Knightsbridge Solutions, LLC.