The Logic of Disintermediation and the End of ETL as We’ve Known It
Big data platforms such as Hadoop, Cassandra, and Spark are usurping the central role of the ETL engine that is at the heart of data integration.
- By Steve Swoyer
- April 18, 2016
As data consolidation and data preparation are shifting to big data platforms, so are the tools and techniques of data integration (DI).
The most significant change is that big data platforms such as Hadoop, Cassandra, and Spark are usurping the role of the ETL engine that's traditionally been at the core of DI. The term “ETL” -- extract, transform, load -- is something of a misnomer in this context, however.
We're used to thinking of ETL as a technology for engineering data, but its most important function is arguably data movement. ETL describes a technique for extracting and moving information from upstream source systems to downstream target systems. Data movement can be mostly frictionless (as with an in-flight ETL engine, which performs data transformation) or quite the opposite, as with the use of an ETL staging area: a place where data is landed, checked for consistency, cleansed, and transformed prior to being moved again into a destination system.
As a technique for moving data, classical ETL is outmoded, however. It's outmoded because it assumes a reality -- that of stateful connectivity between and among discrete physical systems -- that's been superseded by events. It's outmoded because it accords space to an interstitial tier or staging area in which data can be landed and processed -- prior to being moved once again. It's outmoded, more precisely, because it is profligate, not frugal, with data movement. In the physics of the big data universe, such profligacy is untenable: it's wasteful and illogical.
In a data lake architecture, for example, data “movement” involves only the logical (or at least local, i.e., in-cluster) “extraction” and “loading” of data. Think of this as analogous to landing data in a scratch table of an RDBMS before “loading” it into the warehouse proper. In both cases, “movement,” as such, is either logical or local.
Imagine a Hadoop-based data lake. When you “move” data from Hive -- which, in combination with Hadoop's Tez framework, is a capable engine for processing very large data sets -- to Spark (for query processing via the Spark SQL library, for analysis using Spark Streaming, or for other purposes), what you're actually doing is a kind of TEL: transforming data in situ, extracting a derived data set, and loading it into the Spark environment.
Spark SQL can persist data into and process/query against a myriad of formats, from columnar Parquet and ORC files to JSON files, Avro files, Hive tables, and so on. Best of all, because Spark SQL can query against Hive tables, you might not be moving the data at all.
Consider another, even more intriguing scenario: that of the cloud-based data lake/storage sink. A number of organizations are using cheap cloud storage, such as Amazon's simple storage service (S3), as all-purpose data sinks/storage vaults. S3 is used as a persistence layer for strictly structured relational data, polystructured formats (such as JSON objects, multimedia content, and other kinds of objects), and semi-structured sources such as text files and event/application messages. (Message traffic can be encapsulated in JSON and serialized in Avro, among other formats, so structural distinctions on the basis of file containers aren't all that helpful.)
Hadoop combines a scalable, distributed storage layer with a baked-in, general-purpose parallel processing facility. Amazon S3, by contrast, is a storage-only layer. It doesn't have a built-in parallel processing facility. Amazon Web Services (AWS) does -- its Elastic MapReduce (EMR) service. (It's also possible to spin up Hive/Tez, Spark, and other engines in Amazon's elastic compute cloud, or EC2.)
When we talk about “moving” data in AWS, we mean something like ETL. The difference is that “extracting” data from S3 involves a logical movement: a change of virtualized context, not necessarily of discrete physical systems, though a change of physical systems is implied.
In the multi-tenant cloud, however, there's no 1:1 mapping of system-to-hardware -- or, for that matter, of system-to-rack. Instead, there's virtual abstraction, with several operating system instances/nodes cohabitating on a single physical system -- sharing pooled memory, storage, and network resources. Data “moves,” to be sure, but not like it does in the classical ETL paradigm.
The physics of moving data in the realms of big data and the cloud-based data lake/storage sink are remarkably similar. The first priority is to minimize data movement by processing data in situ, i.e., in the context in which it physically lives. In big data, this involves using Hive, Spark SQL, Presto, or other SQL interpreters to produce smaller, derived data sets. (In the context of AWS, this could entail processing data in Redshift or spinning up Hadoop Hive-Tez or Spark instances in EC2.)
In conjunction with S3 and other cloud storage services, the priority is to minimize movement outside of or away from the service context -- in S3's case, that means minimizing how much data is moved outside of the AWS region. Data movement between and among contexts (or AWS regions) is severely constrained by the network transport bottleneck. Moving data from S3 to Redshift, EMR, or EC2 is trivial compared with moving it between and among AWS regions, or across a WAN/VPN connection to a local (on-premises) repository.
In a paradigm in which processing is collocated with storage, the classical ETL model no longer makes sense. Rather, it doesn't make as much sense as it did decades ago, when ETL's enabling technologies and techniques were first developed.
ETL was designed to address two specific practical and technological constraints. The first was the problem of integrating data from multiple source systems, i.e., getting the right data to the right place at the right time. The second was the challenge of engineering data (often in non-trivial ways) in an era of sparse computing capacity. Like any good engineering solution, ETL was a compromise.
As part of that compromise, ETL was allotted -- at least in theory -- its own topologically discrete middle tier -- a place in which to land, stage, and process data, prior to moving it into still another landing area at the warehouse, or undergoing additional ETL processing. Over time, first ETL and then DI evolved into a kind of separate discipline. Then a sort of institutional forgetting happened: the DI middle tier never completely went away, and DI was sometimes treated as an end unto itself.
Thanks to the economics of cloud and big data, that's changing, and that's a great thing.
Does this mean that ETL as a standalone product category will simply go away? No, not on your life. The focus and practice of DI (and with it ETL) will shift to the site of data: to the data lake, data refinery, data sink, data-what-have-you. This has already happened. Prominent ETL vendors have been out in front of Hadoop, Spark, and other big data platforms, either porting their engines to use, or to run in the context of, these engines or trumpeting their ability to move data in and out of Hadoop.
Keep in mind, too, that an ETL or data integration tool isn't just a pipeline processing engine and that (consistent with this) most established ETL tools offer a wide variety of connectors or adapters that support getting data out of (and, sometimes, loading it into) operational data sources. Finally, big data platforms such as Hadoop and Cassandra are comparatively impoverished data management platforms, at least relative to the RDBMS. They lack critical amenities (metadata management, data lineage-tracking) that are taken for granted by traditional data management.
Smart DI vendors have repositioned their products as combined data integration and data management offerings for big data. Call it big data management.
That's just what Informatica Corp. did.
There will continue to be a place for ETL, be it in the form of the standalone ETL tool or (less commonly) the vestigial ETL middle tier. This vestigial tier will no longer be an assumed requirement, however. Increasingly, the emerging model prescribes a single repository for all business information -- namely, a massive storage sink, be it Hadoop, Cassandra, or Spark (running over a distributed file system), or, for that matter, a cloud storage service such as S3 -- and emphasizes the movement of smaller, derived data sets from that repository to its constituent feeder systems.
In this scheme, there's no room or space for an extraneous tier or stage.
Although there are other conceivable schemes, there are few possible permutations in which ETL continues to enjoy the same outsized prominence of the last 20 years. Physics and economics militate against this. So, too, does the fast pace of commoditization. Flashback to the database market of the 1990s, when discrete product categories -- from third-party defrag and reorg tools to performance monitoring tools to replication and backup tools -- abounded.
Over time, the big database vendors (like the big operating system vendors) incorporated most of these features into their products. The same thing has already happened to a significant extent in the ETL space -- e.g., in the late 90s, (R)DBMS vendors started offering ETL capabilities in their products. Over time, these features morphed (in the cases of Microsoft and Oracle, at least) into full-fledged ETL tools -- and this will likely happen all over again in the still-coalescing big data DI space.
Hadoop has the makings of a logical (if not exactly ideal) platform for data storage, data management, and (especially when used with Spark SQL, Presto, or other engines) data preparation. Some vendors explicitly position the Hadoop platform as a one-stop shop for data integration, preparation, and analysis. (Cloudera's Enterprise Data Hub is the most ambitious of these visions.) As the cloud and big data markets evolve and vendors work to flesh out the data management feature sets of AWS, Google's Cloud Platform, Hadoop, Cassandra, Spark and other services, platforms, and frameworks, expect to see the focus of DI shift accordingly.