TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Expert Panel: Leveraging AI-Powered Solutions for Data Management July 28, 2025
  - A Generative AI Framework for Credit and Financial Markets July 29, 2025
  - Redefining Clinical Operations with Agentic AI: Accelerating Innovation Across Data Management and Site Monitoring July 30, 2025
  - Smarter Marketing in Retail: How AI and Modern Data Foundation Drive Growth July 31, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Platforms & Architecture Week July 25, 2025
  - AI Bootcamp Week July 25, 2025
  - Data Governance Week July 25, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Articles

The Logic of Disintermediation and the End of ETL as We’ve Known It

Big data platforms such as Hadoop, Cassandra, and Spark are usurping the central role of the ETL engine that is at the heart of data integration.

By Steve Swoyer
April 18, 2016

As data consolidation and data preparation are shifting to big data platforms, so are the tools and techniques of data integration (DI).

The most significant change is that big data platforms such as Hadoop, Cassandra, and Spark are usurping the role of the ETL engine that's traditionally been at the core of DI. The term “ETL” -- extract, transform, load -- is something of a misnomer in this context, however.

We're used to thinking of ETL as a technology for engineering data, but its most important function is arguably data movement. ETL describes a technique for extracting and moving information from upstream source systems to downstream target systems. Data movement can be mostly frictionless (as with an in-flight ETL engine, which performs data transformation) or quite the opposite, as with the use of an ETL staging area: a place where data is landed, checked for consistency, cleansed, and transformed prior to being moved again into a destination system.

As a technique for moving data, classical ETL is outmoded, however. It's outmoded because it assumes a reality -- that of stateful connectivity between and among discrete physical systems -- that's been superseded by events. It's outmoded because it accords space to an interstitial tier or staging area in which data can be landed and processed -- prior to being moved once again. It's outmoded, more precisely, because it is profligate, not frugal, with data movement. In the physics of the big data universe, such profligacy is untenable: it's wasteful and illogical.

In a data lake architecture, for example, data “movement” involves only the logical (or at least local, i.e., in-cluster) “extraction” and “loading” of data. Think of this as analogous to landing data in a scratch table of an RDBMS before “loading” it into the warehouse proper. In both cases, “movement,” as such, is either logical or local.

Imagine a Hadoop-based data lake. When you “move” data from Hive -- which, in combination with Hadoop's Tez framework, is a capable engine for processing very large data sets -- to Spark (for query processing via the Spark SQL library, for analysis using Spark Streaming, or for other purposes), what you're actually doing is a kind of TEL: transforming data in situ, extracting a derived data set, and loading it into the Spark environment.

Spark SQL can persist data into and process/query against a myriad of formats, from columnar Parquet and ORC files to JSON files, Avro files, Hive tables, and so on. Best of all, because Spark SQL can query against Hive tables, you might not be moving the data at all.

Consider another, even more intriguing scenario: that of the cloud-based data lake/storage sink. A number of organizations are using cheap cloud storage, such as Amazon's simple storage service (S3), as all-purpose data sinks/storage vaults. S3 is used as a persistence layer for strictly structured relational data, polystructured formats (such as JSON objects, multimedia content, and other kinds of objects), and semi-structured sources such as text files and event/application messages. (Message traffic can be encapsulated in JSON and serialized in Avro, among other formats, so structural distinctions on the basis of file containers aren't all that helpful.)

Hadoop combines a scalable, distributed storage layer with a baked-in, general-purpose parallel processing facility. Amazon S3, by contrast, is a storage-only layer. It doesn't have a built-in parallel processing facility. Amazon Web Services (AWS) does -- its Elastic MapReduce (EMR) service. (It's also possible to spin up Hive/Tez, Spark, and other engines in Amazon's elastic compute cloud, or EC2.)

When we talk about “moving” data in AWS, we mean something like ETL. The difference is that “extracting” data from S3 involves a logical movement: a change of virtualized context, not necessarily of discrete physical systems, though a change of physical systems is implied.

In the multi-tenant cloud, however, there's no 1:1 mapping of system-to-hardware -- or, for that matter, of system-to-rack. Instead, there's virtual abstraction, with several operating system instances/nodes cohabitating on a single physical system -- sharing pooled memory, storage, and network resources. Data “moves,” to be sure, but not like it does in the classical ETL paradigm.

The physics of moving data in the realms of big data and the cloud-based data lake/storage sink are remarkably similar. The first priority is to minimize data movement by processing data in situ, i.e., in the context in which it physically lives. In big data, this involves using Hive, Spark SQL, Presto, or other SQL interpreters to produce smaller, derived data sets. (In the context of AWS, this could entail processing data in Redshift or spinning up Hadoop Hive-Tez or Spark instances in EC2.)

In conjunction with S3 and other cloud storage services, the priority is to minimize movement outside of or away from the service context -- in S3's case, that means minimizing how much data is moved outside of the AWS region. Data movement between and among contexts (or AWS regions) is severely constrained by the network transport bottleneck. Moving data from S3 to Redshift, EMR, or EC2 is trivial compared with moving it between and among AWS regions, or across a WAN/VPN connection to a local (on-premises) repository.

In a paradigm in which processing is collocated with storage, the classical ETL model no longer makes sense. Rather, it doesn't make as much sense as it did decades ago, when ETL's enabling technologies and techniques were first developed.

ETL was designed to address two specific practical and technological constraints. The first was the problem of integrating data from multiple source systems, i.e., getting the right data to the right place at the right time. The second was the challenge of engineering data (often in non-trivial ways) in an era of sparse computing capacity. Like any good engineering solution, ETL was a compromise.

As part of that compromise, ETL was allotted -- at least in theory -- its own topologically discrete middle tier -- a place in which to land, stage, and process data, prior to moving it into still another landing area at the warehouse, or undergoing additional ETL processing. Over time, first ETL and then DI evolved into a kind of separate discipline. Then a sort of institutional forgetting happened: the DI middle tier never completely went away, and DI was sometimes treated as an end unto itself.

Thanks to the economics of cloud and big data, that's changing, and that's a great thing.

Does this mean that ETL as a standalone product category will simply go away? No, not on your life. The focus and practice of DI (and with it ETL) will shift to the site of data: to the data lake, data refinery, data sink, data-what-have-you. This has already happened. Prominent ETL vendors have been out in front of Hadoop, Spark, and other big data platforms, either porting their engines to use, or to run in the context of, these engines or trumpeting their ability to move data in and out of Hadoop.

Keep in mind, too, that an ETL or data integration tool isn't just a pipeline processing engine and that (consistent with this) most established ETL tools offer a wide variety of connectors or adapters that support getting data out of (and, sometimes, loading it into) operational data sources. Finally, big data platforms such as Hadoop and Cassandra are comparatively impoverished data management platforms, at least relative to the RDBMS. They lack critical amenities (metadata management, data lineage-tracking) that are taken for granted by traditional data management.

Smart DI vendors have repositioned their products as combined data integration and data management offerings for big data. Call it big data management.

That's just what Informatica Corp. did.

There will continue to be a place for ETL, be it in the form of the standalone ETL tool or (less commonly) the vestigial ETL middle tier. This vestigial tier will no longer be an assumed requirement, however. Increasingly, the emerging model prescribes a single repository for all business information -- namely, a massive storage sink, be it Hadoop, Cassandra, or Spark (running over a distributed file system), or, for that matter, a cloud storage service such as S3 -- and emphasizes the movement of smaller, derived data sets from that repository to its constituent feeder systems.

In this scheme, there's no room or space for an extraneous tier or stage.

Although there are other conceivable schemes, there are few possible permutations in which ETL continues to enjoy the same outsized prominence of the last 20 years. Physics and economics militate against this. So, too, does the fast pace of commoditization. Flashback to the database market of the 1990s, when discrete product categories -- from third-party defrag and reorg tools to performance monitoring tools to replication and backup tools -- abounded.

Over time, the big database vendors (like the big operating system vendors) incorporated most of these features into their products. The same thing has already happened to a significant extent in the ETL space -- e.g., in the late 90s, (R)DBMS vendors started offering ETL capabilities in their products. Over time, these features morphed (in the cases of Microsoft and Oracle, at least) into full-fledged ETL tools -- and this will likely happen all over again in the still-coalescing big data DI space.

Hadoop has the makings of a logical (if not exactly ideal) platform for data storage, data management, and (especially when used with Spark SQL, Presto, or other engines) data preparation. Some vendors explicitly position the Hadoop platform as a one-stop shop for data integration, preparation, and analysis. (Cloudera's Enterprise Data Hub is the most ambitious of these visions.) As the cloud and big data markets evolve and vendors work to flesh out the data management feature sets of AWS, Google's Cloud Platform, Hadoop, Cassandra, Spark and other services, platforms, and frameworks, expect to see the focus of DI shift accordingly.

About the Author

Stephen Swoyer is a technology writer with 20 years of experience. His writing has focused on business intelligence, data warehousing, and analytics for almost 15 years. Swoyer has an abiding interest in tech, but he’s particularly intrigued by the thorny people and process problems technology vendors never, ever want to talk about. You can contact him at [email protected].

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

↑

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

The Logic of Disintermediation and the End of ETL as We’ve Known It

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

The Logic of Disintermediation and the End of ETL as We’ve Known It

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career