TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Expert Panel: What's Next in Data Integration: Powering the AI-Driven Enterprise August 25, 2025
  - Expert Panel: Improving Data Quality, Accuracy, and Consistency August 27, 2025
  - The State of Self-Service Analytics: Results from TDWI’s Latest Research September 8, 2025
  - Expert Panel: Building an AI-Driven Data Strategy September 15, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Data Governance Week July 30, 2025
  - Platforms & Architecture Week July 30, 2025
  - AI Bootcamp Week July 30, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

RESEARCH & RESOURCES

3 Steps to Free Yourself from the ETL Burden

Even in an era of big data, ETL is here to stay. However, we must recognize that Hadoop is forever changing the approach and the economics of data integration.

January 7, 2014

By Jorge Lopez, Director of Product Marketing, Syncsort

Much has been debated about the future of ETL in an ever-increasing Hadoop-focused world. There's never a shortage of voices claiming the end of ETL or even worse, promising to set you free from architecting a solid ETL infrastructure. I'm sorry to break the bad news, but ETL is not going away. Not today, not tomorrow. As long as organizations need to leverage data from multiple sources, we can be very sure ETL will continue to exist.

Although it's certainly true that ETL is here to stay, we must recognize that Hadoop is forever changing the approach and the economics of data integration. In fact, many organizations are already shifting core ETL workloads to Hadoop, and it is only natural. For years, many of these organizations have struggled with the cost and processing limitations of using their enterprise data warehouse (EDW) for data integration. As a colleague and very good friend of mine likes to point out, staging areas, once considered best practices, have become the "dirty secret" of every data warehouse environment -- one that consumes the lion's share of time, money, and effort.

In fact, just a few weeks ago, the CEO of a leading data warehousing company acknowledged that ETL consumes 20 to 40 percent of the EDW workloads ("with some outliers below and above average"). I've found lots of outliers with 50 to 80 percent of total DWH spending driven by ETL.

The good news is that today, for the first time, Hadoop is presenting us with a realistic and cost-effective alternative. With inexpensive storage, high reliability, and massive scalability, Hadoop can become the ideal staging area for all of an enterprise's data.

How can your enterprise ensure you stay ahead of the curve? Here are three specific steps to help you get started on freeing your EDW from the ETL burden.

Step 1: Understand objectives and benefits

You guessed it: the EDW is not going away. You still need it for those fast, interactive user queries, for speed-of-thought analytics, and for business intelligence. Your goal is not to get rid of it but to give it a break by shifting heavy ETL workloads to Hadoop. What's in it for you? Deferred database costs, additional database capacity, less contention between BI and ETL workloads, and faster database user queries among other benefits. More important, cost savings alone will allow you to justify the investment in Hadoop and build up your organization's Hadoop skills. Look at it as your Trojan Elephant.

Step 2: Start by targeting the top 20 percent

Most ETL workloads follow this pattern -- 20 percent of the transformations cause 80 percent of the troubles. These are long-running queries and queries that consume relatively high CPU and I/O on your data warehouse. The idea is to focus on the low hanging fruit, the transformations that will give you the biggest resource and elapsed-time savings. Most RDBMSs provide comprehensive logging and reporting capabilities -- such as DBQL for Teradata -- where you can identify these types of queries. Using your favorite BI tool, you can easily create flashy dashboards and interactive visualizations to analyze these logs and make your life easier. (I actually was part of the team that created a similar tool for a popular EDW with my previous employer.)

In general, you want to target queries that include change data capture (CDC), slowly changing dimensions, ranking functions, lots of volatile tables, multiple merges, large joins, cursors, and unions. Any data transformations including files, sequential and semi-structure data such as Web logs, and clickstream analysis are also good candidates.

Step 3: Make it enterprise-ready

This step might seem obvious but it's often overlooked. As much as we love the yellow elephant, Hadoop is not a complete ETL solution. Without the right approach, you're setting yourself up for lots of disparate tools -- each to accomplish a very specific task -- such as Sqoop for loading database tables, Flume for ingesting logs, HiveQL and Pig for developing data transformations, Java, and maybe even some C#. This can impose some severe adoption barriers including finding the right talent, getting productive quickly, and training existing staff. You also need to think about security, monitoring, and administration.

A Final Word

The good news is that vendors -- including the major Hadoop distributions as well as many players in the big data ecosystem -- are quickly closing these gaps, making Hadoop enterprise-ready. Although some organizations might have the skills and resources to go "solo" open source, the majority of businesses will find that a more balanced approach that includes open source complemented by commercial tools and enterprise-level support will help them lower the barriers to Hadoop adoption and achieve the much needed scalability without compromising on cost and reliability.

Anything you read or hear that suggests you can get rid of ETL works like a charm. It attracts both supporters and detractors, and Hadoop has once again fueled this debate. Nevertheless, I would argue that nearly all organizations with a Hadoop initiative need to ingest data from one or more sources, process the data in Hadoop -- sort, aggregate, join -- and then distribute it by either loading it to an EDW or by presenting insights via reports, interactive visualizations, or dashboards. To me, that looks a lot like ETL -- even if in many cases the developers working with Hadoop don't even know the term ETL, but that might just point to the fact that I've been in this industry long enough.

Jorge A. Lopez is the director of product marketing at Syncsort. You can contact the author at [email protected].

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

↑

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

RESEARCH & RESOURCES

3 Steps to Free Yourself from the ETL Burden

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

TDWI

Engage

Research