Figure 1. Based on 225 respondents who have implemented operational BIaccording to TDWI’s definition.
Nevertheless, data warehousing has become a well-established IT practice in corporate environments, and few organizations will jettison their DW investments without trying to adapt the architecture to support just-in-time data and operational processes. Our survey also shows that about half of organizations (51%) run both operational and analytical reporting from the same environment. Given the high number of survey respondents who equate operational reporting with operational BI, this indicates that many companies have moved operational reporting into their data warehousing environment. (See Figure 1.)
Organizations that want to deliver just-in-time data via a data warehousing environment face daunting challenges. The rest of this section describes the major challenges involved in transforming a data warehousing environment to support operational BI.
Some of these technologies support multiple tasks. For instance, BI and EII tools can both acquire and deliver justin- time data, and if you consider a report definition a storage mechanism, they support all three activities. Likewise, event-driven analytic platforms, working in concert with messaging backbones (EAI networks, enterprise service buses, publish/subscribe middleware), support all three activities. This is not surprising, since they are designed to provide end-to-end support for real-time operational BI requirements.
Our survey shows that organizations use a variety of techniques to support operational BI. The most common are ODSs, followed by more frequent batch loads and changed data capture. (See Figure 2.)
Figure 2. Based on 225 respondents who have implemented operational BI and rated the approach of “high” importance.
Increase scalability and throughput. Another key issue is building systems that scale to support greater numbers of users, more data sources, and higher volumes of data with increased rates of throughput—while ensuring high data quality and security. Organizations need to implement high-performance data warehousing platforms from leading companies such as Sybase, Teradata, and HP, and robust data integration platforms like those from Syncsort and others. To ensure scalability and throughput, companies may need to upgrade their networks and hardware, and parallelize key extraction, transformation, and load processes so they can adequately meet these increased throughput demands and eliminate processing bottlenecks.
Use inserts. To increase ETL throughput, some (but not all) architects recommend inserting new records (i.e., loading data) rather than adding, changing, or deleting existing records. “Avoid updates at all costs,” says O’Brien.“With large data sets, it takes too much time to find the record and update it.” Although this creates lots of duplicate data, administrators can use SQL DISTINCT and GROUP-BY statements to identify and delete duplicates, O’Brien says.
O’Brien also recommends turning off referential integrity and database logging. “Why log inserts if you have the original load file?” he asks. In addition, there is no reason to create indexes and aggregations during the ETL process, since they consume a lot of CPU and slow down the inserts. Like many architects, O’Brien now advocates using the database to perform these tasks once the data is loaded into the data warehouse to exploit the parallelism of the underlying database. This process is known as extract, load, and transform (ELT) versus the more traditional extract, transform, and load (ETL).
Increase availability and recoverability. In a just-in-time data warehousing environment there is little time to recover from errors, because batch windows are small or nonexistent. “In a real-time system, you have no time to fix problems, and messages can start piling up,” says OLAP architect Eric Lofstrom. The best way to avert these problems is to build high-availability systems to drive the acquisition and deployment process. These systems parallelize single points of failure, run in a clustered environment, support failover and backup processes, and have an offsite backup.
Microbatches. To avoid losing data, O’Brien recommends using “microbatches” instead of real-time streaming, since microbatches provide a buffer with which to recover from a network or server outage. His rule of thumb is to set the ETL batch cycle three times larger than the time required to process the data. “If it takes 90 seconds to process a file, then I’ll set the batch cycle to five minutes and have 3.5 minutes of sleep time.”
However, other IT practitioners believe it is perfectly safe to stream events into a data warehouse. Lofstrom, for example, recommends logging each message as it comes off a queue, which may slow down the speed of the streaming data, but prevents the loss of data caused by server outages. If an outage occurs, Lofstrom recommends loading accumulated messages into the warehouse in a single batch job before turning event streaming back on. In other cases, companies use two-phase commit to update streamed messages from a queue into a staging area, so updates are never lost if there is a planned or unplanned outage in the database server.
Deliver adequate performance. A challenging problem for data warehousing designers is ensuring rapid query response times while simultaneously loading or updating a data warehouse and performing other tasks, such as monitoring events, triggering alerts, running backups, and scoring models. A mixed workload can cause RDBMS performance to degrade to the point where it blocks incoming queries or causes load processes to fail. Achieving adequate performance in a mixed workload environment can be a vexing challenge.
Complicating matters is that most users expect near-instantaneous response times in an operational BI environment. For example, customer service representatives can’t wait more than a few seconds for an integrated view of customer activity and cross-sell offers before their callers get impatient and hang up. Most data warehouses aren’t designed to support a mixed workload of tactical queries that return a few rows of data in less than a second along with complex requests that scan millions of records and take minutes or hours to process.
How do you balance these mixed workloads and maintain SLAs without causing the system to crash? There are several options, all of which have trade-offs:
- Throttle incoming requests in response to heavy system loads. You can schedule queries or jobs to only run at night, or queue or delete them using query governors or job managers during periods of heavy load to ensure the integrity of system updates. This approach obviously won’t sit well with users, and it’s the reason companies began building data warehouses in the first place: to provide the business with a dedicated analytical environment that would not interfere with operational processes.
- Separate analytical and tactical queries and jobs. Another approach is to off-load tactical queries and predictive scoring jobs to an ODS or specialized data mart and leave the data warehouse to support long-running analytical queries. Here, the ODS contains current transaction data, usually no more than 30 days’ worth, stored in third normal form models that reflect operational schemas. The downside of this approach is that it separates historical and current data into different systems, making it difficult for users to view just-in-time data in historical context.
- Leverage RDBMS mixed-workload capabilities. Many database management systems (DBMS) have significantly improved their ability to optimize the performance of long-running strategic queries, short-running tactical queries, load and update processes, and other types of workloads. Teradata, for example, has made big improvements in its priority scheduler for optimizing tactical queries, says Dan Graham, marketing manager at Teradata. The priority scheduler is a real-time traffic cop that always allocates CPU to tactical queries first to guarantee sub-second response times. It continuously monitors in-flight tasks and dynamically adjusts CPU allocation to optimize performance. In addition, administrators can configure and prioritize workloads by user groups, types of activity, and other variables. HP’s new Neoview data warehousing platform also boasts workload management capabilities.
- An RDBMS that supports mixed-workload processing lets organizations have their cake and eat it, too: they can load current and historical data into the same database and optimize performance across all types of queries. Without an RDBMS that supports mixed workloads, many organizations decide to avoid using a data warehouse for operational BI. However, not all mixed workload capabilities are created equal, so evaluate your database vendor’s capabilities carefully. Also, running multiple workloads on a single platform may require hardware upgrades to maintain adequate performance, so calculate these additional costs when deciding whether to use a data warehouse to support operational requirements.
Avoid contention problems. When user queries and ETL processes hit the same table, there is the potential for one process to block the other and cause query or load performance to degrade. Along with mixed workload requirements, the issue of contention causes many people to maintain historical and just-in-time data in distinct data stores. There are three ways to minimize the contention problem, but they are not for the fainthearted:
- Allow simultaneous reads and writes. If you are inserting rather than updating data, the database won’t lock the table. The only problem here is that the data can be out of sync if the user reads the table before all the inserts have completed. To avoid confusion, it’s imperative to provide users with guidelines for submitting ad hoc queries against real-time data, and it helps if you dynamically time-stamp all reports. It’s also important to turn off caching in BI servers, since there is no point in loading data every hour if the BI tool requests data from a cache that gets flushed once a day.
- Insert and flip partitions. Another approach is to create a duplicate partition of a fact table and load it with current data. Then, on a periodic basis, swap the table with the live fact table to update the data warehouse. This approach may require you to pause the application server while the flip takes place so no new requests are initiated while the swap occurs, says Justin Langseth, CTO of Claraview, a BI consultancy in Reston, VA. Many organizations use this “insert-and-flip” approach.
- External real-time cache. Another approach Langseth recommends is to load data into an in-memory cache outside of the data warehouse. Requests for real-time data are fulfilled from the cache, while queries that require a combination of real-time and historical data merge the requested data in a set of temporary tables in either the data warehouse or real-time cache, depending on which database holds the largest volume of data requested. This type of just-in-time merging of information requires complex SQL and may be challenging for most BI tools to support.
Summary. There are many challenges involved in implementing operational BI. The challenges discussed here deal primarily with transforming a data warehousing architecture to support just-in-time data delivery. Although there are other ways to deliver just-in-time data to support operational BI, many architects recommend using a data warehousing environment to ensure a consistent set of data for the entire organization.
Wayne Eckerson is the director of TDWI Research and author of Performance Dashboards: Measuring, Monitoring, and Managing Your Business (John Wiley & Sons, 2005). He can be reached at [email protected].
This article was excerpted from the full, 30-page report by the same name. You can download this and other TDWI Research free of charge atwww.tdwi.org/research.
The report was sponsored by Business Objects, HP, Oracle/Hyperion, InetSoft, SAS, Sybase, Syncsort Incorporated, and Teradata Corporation.