LESSON - Why End-to-End Extracting Transforming and Loading Automation Is a Must for Business Intelligence Success
By Derek Evan, Solutions Architect, Cisco
The information retrieved from a BI system is only as trustworthy as the data going into it. Bad information is generally not the result of bad source data; rather, it’s the result of pulling out-of-sync data into the system due to problems in the data processing flow. Process flow issues can also impact information availability, process auditability, and IT’s ability to respond to changing business needs. While bad information can lead to poor decisions, the full scope of problems includes high IT costs, potential governance and compliance issues, and other less-than-desirable consequences.
Scripts, Custom Code, and Islands of Automation
In typical IT environments, there can be hundreds—or thousands—of data sources, including legacy databases, application databases, departmental data marts, and information from business partners. Each can have unique issues around method of access, content, and quality; update and arrival schedules; and requirements for transformation.
ETL tools often include basic process schedulers to initiate, coordinate, and manage the process. These schedulers typically direct operations inside the tool. For process flows outside the tool, scripting and custom code is required. Error handling and documentation are frequently afterthoughts. The overall result is a solution that provides little visibility into the ETL process, scant notification when a portion of the process fails, and no help in recovering from errors.
Recovery and Cascading Errors
Even in a well-managed system, errors and failures inevitably occur. Periodic problems with networks, servers, storage devices, and applications cause processing steps to fail.
Data or application errors may produce corrupted data or truncated files. Unplanned maintenance may take a critical application offline. In a system based on scripting, custom code, and embedded schedulers, these numerous problem points make the BI solution so fragile that IT spends significant time identifying and solving problems.
Another major problem is cascading errors. These occur when custom solutions lack sufficiently robust error detection or dependency management to halt execution of a data flow when a step fails. A series of errors in downstream steps then occurs. Some steps fail completely, or even worse, failures go unnoticed and processing continues using erroneous or missing data.
End-to-End Automation Is the Key
Controlling the ETL process flow requires standardizing on a platform that automates and provides visibility into the data flow end to end. A particularly effective tool is a distributed job scheduler. Distributed job schedulers offer both the reach and control necessary to manage the ETL process, as well as the input, output, and notification functions associated with a complete data flow.
Key to proper ETL processing is managing the sequence of steps in the process and coordinating those steps that may operate autonomously. Enterprise job schedulers orchestrate these sequences as a series of dependencies, which can be defined to include a broad range of events and triggers. Through this mechanism, each step in a process is ensured complete and accurate data, and the next step proceeds as soon as possible, maximizing throughput for the BI system. Automated dependency management dramatically reduces the complexity in creating an end-to-end ETL process and is the only way to consistently deliver data to the right place at the right time.
With the automation of an enterprise job scheduler, cascading errors are prevented right from the start. Dependency management ensures that each step in a process completes successfully before launching the next activity. Additionally, errors are isolated quickly, because the scheduling console immediately highlights any failed processing steps, localizing the problem to the specific system component.
For a free white paper on this topic, click here and choose the title “Automating Data Flows to Your Business Intelligence System.”