Beyond ETL: On Demand ETL and the Big Shift in BI and Analysis
An emerging trend -- "on-demand" ETL -- augurs a big shift in the way analysis and BI are performed and results disseminated.
- By Stephen Swoyer
- April 2, 2013
ETL's batch-based conceptual model lends itself to a very different kind of timeliness, what many call "on-demand" ETL, where ETL jobs are triggered as needed.
This isn't real-time ETL -- not as marketed by real-time specialists such as IBM Corp. and Informatica Corp., for example -- but it does have some real-time characteristics.
In the on-demand model, a user or application might not be consuming time-sensitive information from operational systems in real time, but nonetheless is triggering an ETL process in real time. For some industry veterans, it augurs a big shift in how analysis and, with it, business intelligence (BI) are performed and results disseminated.
Beyond ETL: a Quick Primer
Over the last five years, there's been a push to recast ETL as a more agile or iterative technology -- or to eliminate it altogether.
Data warehousing specialist Kalido, for example, has spent time with the latter project. It uses an ELT-like technology (its Unified Load Controller) that it says can be used to bypass traditional ETL. ULC can "land" data into the Kalido Information Engine, where it can be conformed to DW standards. From Kalido's perspective, ETL -- with its protracted development process and batch-based underpinnings -- is the problem.
Competitor WhereScape Inc. comes at the same problem from an ETL-centric perspective. It touts an iterative approach to ETL that emphasizes rapid development, testing, and optimization. WhereScape concedes that the latency and inertia associated with DW development and management are problematic, but it doesn't understand either to be intrinsic (as problems) to ETL itself.
Rather, WhereScape claims, the problem lies with the process or methodology by which data warehouses are developed and managed.
Even though WhereScape's approach is predicated on a build-in-advance model, its focus on highly-iterative ETL -- on, in effect, rapid ETL development (from which the name of its flagship product, RED, derives) -- that gets close to a kind of on-demand ETL. It's likewise consistent with an analytic discovery model that emphasizes access to information, regardless of its cleanliness, consistency, or standardization. Both WhereScape's and Kalido's schemes promise to reduce ETL batch windows, automate (or in Kalido's case, eliminate) common ETL tasks, and accelerate access.
On-Demand ETL
The on-demand ETL visions touted by ParAccel Inc. (On Demand Integration, or ODI) and Pentaho Inc. go further.
Dave Henry, senior vice president of engineering with Pentaho Inc., describes a big data use case that involves blending data from Hadoop with data from operational sources, which he calls "on demand" ETL.
"If you can think about [a scenario in which you're] doing a query against Hadoop and getting some data out of it -- maybe you're going through Hive, maybe you're reading Hadoop files or supporting something like Impala from Cloudera -- you're going to get a stream of data out of that [query]. As that data comes out, you'd like to do look ups [i.e., comparing or blending it] against your operational data. It's a kind of on-the-fly enrichment."
Pentaho markets an ETL tool in Pentaho Data Integration (PDI), which is based on the Kettle open source software (OSS) ETL project. Nevertheless, Pentaho conceives of ETL as a prerequisite for discovery and analysis: its focus isn't on PDI as a general-purpose ETL product but on PDI as an enabling technology for Pentaho Analysis.
It's a distinction worth emphasizing: ETL is a product of the data warehouse-driven BI model. It's an ingestion technology; its purpose is to transform data such that it conforms to the requirements of the DW. As a result, the ETL model traditionally emphasizes the cleanliness, consistency, and standardization of data. The priorities of this model are fundamentally at odds with the more relaxed constraints of discovery. They're likewise at odds with the ways in which people want and (increasingly) expect to consume information.
Pentaho's new InstaView product -- which is a kind of real-time "blender" for data from Hadoop, other NoSQL repositories, and Web applications -- is an exemplar of this, he maintains. "Increasingly, we have more and more information that's more distributed than ever, and people want to be able to mash it up on the fly. Salesforce[.com] doesn't want to be your data warehouse, [nor are] most people ... going to stuff all of their non-CRM corporate information into Salesforce," Henry says.
"People who are creating line-of-business applications that may be more specialized than Salesforce don't want to get into that [putting everything into a data warehouse]: it's all one-off. You'd have to have a really complex [DW] schema to handle all of that ingestion, so what we're seeing is [the creation of] these kinds of 'data marts on demand,' particularly for highly aggregated stuff."
On-Demand ETL in Practice
At last year's Strata + Hadoop World conference, analytic database specialist ParAccel Inc. discussed a range of similar scenarios. Today, ParAccel offers On Demand Integration (ODI) modules for several database platforms or standards (e.g., Teradata, ParAccel itself, and ODBC) in addition to Hadoop. ODI connectivity can be used to directly import data from these platforms into ParAccel.
At Strata + Hadoop World, officials discussed the idea of embedding user-defined functions (UDFs) in the ODI layer. This is less an ad hoc ETL capability -- e.g., importing data from another platform into ParAccel -- than a kind of user-initiated batch ETL, said vice president of marketing John Santaferraro. A user or application doesn't initiate a completely new ETL job nor consume data from a new (i.e., external) source. Instead, the user (or the application used) consumes data that's already in ParAccel. The embedded UDFs transform this data into a format that can be consumed by the initiating application. This happens on-the-fly.
ParAccel, like Pentaho doesn't position its ODI connectivity as a replacement for robust ETL. Instead, Santaferraro explained, it sees ODI-powered information access and on-the-fly transformations as consistent with the relaxed constraints of the discovery model. Discovery, Santaferraro argued, emphasizes data access above all; it's more tolerant of consistency or quality issues.
"We're not trying to become a transformation engine; we are an analytics company. Everything we do has to do with analytics," he said.
"On-demand integration is about making data available for analysis. When you take the [ParAccel analytics] platform and add ODI services, you've now become sort of the service center for analytic services for people and applications. You can now use the power of the analytic platform to do the heavy-lifting processing of the analytics."
Harbinger of Things to Come?
ParAccel isn't alone. At the TDWI World Conference in Las Vegas, Teradata Corp. trumpeted a new version 5.1 release of its Aster Discovery Platform, which it claims can automate the time-consuming or repetitive aspects of data access and preparation. Teradata's pitch with Aster Discovery 5.10 isn't unlike ParAccel's with ODI; there's likewise a sense in which both efforts -- along with Pentaho's on-demand ETL vision -- might be seen as harbingers of an industry-wide shift to come.
Industry luminary Colin White, president of BI Research, says the DW-driven BI model is being supplanted by a more diverse schema in which multiple platforms or disciplines cohabitate in a kind of information ecosystem. An analog to discovery in this new ecosystem is what White calls "investigative computing." This describes a methodology that emphasizes rapid, test-driven iteration -- of hypotheses, analytic models, or other artifacts. Investigative computing is an umbrella paradigm for practices such as predictive analytics (PA) or analytic discovery. The point, says White, is that the traditional DW-driven model can't keep pace with -- can't feed -- the information requirements of PA, discovery, or other investigative practices.
"At the moment, getting data into the data warehouse has become a bottleneck; typically, with operational intelligence, by the time we have it in there, it's already changed," White observes. In the DW-driven status quo, analysis has largely been a retrospective practice. Analytic discovery and PA change this. Information heterogeneity -- i.e., more data, from more sources -- is a prerequisite for both practices; so, too, is rapid iteration. PA, in particular, works by optimizing models, and models are optimized iteratively: data scientists hypothesize, test, and tweak. The DW-driven status quo is inimical to this kind of "investigative" model.
"Up until now, [the way] we've produced analytics [is a lot like] 'what's-happened-in-the-past' and 'what's-happening-now.' When we get into predictive analytics ... we have to change what we're doing. We have to take advantage of [new] technology to blend in data faster, run models faster, and get results much faster. This isn't possible in the [store and analyze] model we have today," White says.