3 Steps to Free Yourself from the ETL Burden
Even in an era of big data, ETL is here to stay. However, we must recognize that Hadoop is forever changing the approach and the economics of data integration.
By Jorge Lopez, Director of Product Marketing, Syncsort
Much has been debated about the future of ETL in an ever-increasing Hadoop-focused world. There's never a shortage of voices claiming the end of ETL or even worse, promising to set you free from architecting a solid ETL infrastructure. I'm sorry to break the bad news, but ETL is not going away. Not today, not tomorrow. As long as organizations need to leverage data from multiple sources, we can be very sure ETL will continue to exist.
Although it's certainly true that ETL is here to stay, we must recognize that Hadoop is forever changing the approach and the economics of data integration. In fact, many organizations are already shifting core ETL workloads to Hadoop, and it is only natural. For years, many of these organizations have struggled with the cost and processing limitations of using their enterprise data warehouse (EDW) for data integration. As a colleague and very good friend of mine likes to point out, staging areas, once considered best practices, have become the "dirty secret" of every data warehouse environment -- one that consumes the lion's share of time, money, and effort.
In fact, just a few weeks ago, the CEO of a leading data warehousing company acknowledged that ETL consumes 20 to 40 percent of the EDW workloads ("with some outliers below and above average"). I've found lots of outliers with 50 to 80 percent of total DWH spending driven by ETL.
The good news is that today, for the first time, Hadoop is presenting us with a realistic and cost-effective alternative. With inexpensive storage, high reliability, and massive scalability, Hadoop can become the ideal staging area for all of an enterprise's data.
How can your enterprise ensure you stay ahead of the curve? Here are three specific steps to help you get started on freeing your EDW from the ETL burden.
Step 1: Understand objectives and benefits
You guessed it: the EDW is not going away. You still need it for those fast, interactive user queries, for speed-of-thought analytics, and for business intelligence. Your goal is not to get rid of it but to give it a break by shifting heavy ETL workloads to Hadoop. What's in it for you? Deferred database costs, additional database capacity, less contention between BI and ETL workloads, and faster database user queries among other benefits. More important, cost savings alone will allow you to justify the investment in Hadoop and build up your organization's Hadoop skills. Look at it as your Trojan Elephant.
Step 2: Start by targeting the top 20 percent
Most ETL workloads follow this pattern -- 20 percent of the transformations cause 80 percent of the troubles. These are long-running queries and queries that consume relatively high CPU and I/O on your data warehouse. The idea is to focus on the low hanging fruit, the transformations that will give you the biggest resource and elapsed-time savings. Most RDBMSs provide comprehensive logging and reporting capabilities -- such as DBQL for Teradata -- where you can identify these types of queries. Using your favorite BI tool, you can easily create flashy dashboards and interactive visualizations to analyze these logs and make your life easier. (I actually was part of the team that created a similar tool for a popular EDW with my previous employer.)
In general, you want to target queries that include change data capture (CDC), slowly changing dimensions, ranking functions, lots of volatile tables, multiple merges, large joins, cursors, and unions. Any data transformations including files, sequential and semi-structure data such as Web logs, and clickstream analysis are also good candidates.
Step 3: Make it enterprise-ready
This step might seem obvious but it's often overlooked. As much as we love the yellow elephant, Hadoop is not a complete ETL solution. Without the right approach, you're setting yourself up for lots of disparate tools -- each to accomplish a very specific task -- such as Sqoop for loading database tables, Flume for ingesting logs, HiveQL and Pig for developing data transformations, Java, and maybe even some C#. This can impose some severe adoption barriers including finding the right talent, getting productive quickly, and training existing staff. You also need to think about security, monitoring, and administration.
A Final Word
The good news is that vendors -- including the major Hadoop distributions as well as many players in the big data ecosystem -- are quickly closing these gaps, making Hadoop enterprise-ready. Although some organizations might have the skills and resources to go "solo" open source, the majority of businesses will find that a more balanced approach that includes open source complemented by commercial tools and enterprise-level support will help them lower the barriers to Hadoop adoption and achieve the much needed scalability without compromising on cost and reliability.
Anything you read or hear that suggests you can get rid of ETL works like a charm. It attracts both supporters and detractors, and Hadoop has once again fueled this debate. Nevertheless, I would argue that nearly all organizations with a Hadoop initiative need to ingest data from one or more sources, process the data in Hadoop -- sort, aggregate, join -- and then distribute it by either loading it to an EDW or by presenting insights via reports, interactive visualizations, or dashboards. To me, that looks a lot like ETL -- even if in many cases the developers working with Hadoop don't even know the term ETL, but that might just point to the fact that I've been in this industry long enough.
Jorge A. Lopez is the director of product marketing at Syncsort. You can contact the author at [email protected].