RESEARCH & RESOURCES

Syncsort’s New Data Integration Solutions Provide a Smarter Approach to Hadoop ETL

Two new Hadoop offerings and DMX innovations bring benefits of better ETL through Hadoop and better Hadoop with enhanced ETL.

Syncsort, a big data integration solutions provider, has released two new Hadoop products and breakthrough enhancements to DMX that turn Hadoop into a more robust, feature-rich, and easy-to-use ETL solution.

Big data is prompting organizations to look at Hadoop to process more data in less time and for less money, but Hadoop is not yet a complete ETL solution. Syncsort’s two new offerings for Hadoop -- DMX-h ETL Edition  and DMX-h Sort Edition -- are designed to strengthen Hadoop by providing the full functionality required to deliver enterprise ETL capabilities. They provide greater ease-of-use and maximize node performance compared to non-native, code-generating ETL tools. In addition, performance and connectivity enhancements to DMX expand usage by end-users and partners.   

The new DMX-h solutions take advantage of Syncsort’s recent contribution to Apache Hadoop, which provides a unique level of native integration to deliver best in class data integration capabilities and Sort acceleration for Apache Hadoop distributions.

Highlights of the DMX-h ETL include:

  • Smarter architecture: DMX-h has an ETL engine that runs natively within MapReduce, maximizing node performance

  • Smarter development: Developers can leverage an easy-to-use Windows GUI and deploy seamlessly into Hadoop

  • Smarter productivity: Use-case accelerators -- a library of pre-built templates -- helps developers fast-track Hadoop ETL implementations
  • Smarter connectivity: Extends access to and delivery of all data, including from the mainframe

  • Smarter economics: Smarter architecture, development, connectivity, and productivity combine to help drive results in less time and at a fraction of the cost of other solutions

Benchmark Results

Recent Syncsort benchmarks show significant Hadoop performance and resource efficiency improvements when using DMX-h. The results show predictable and sustainable throughput even as data volumes grow. Using the TeraSort benchmark, DMX-h Sort Edition achieved a sustainable throughput of over 100 megabytes per second per node (MB/S/N) delivering upwards of 2x higher throughput per node than Hadoop's native sort at 45 MB/S/N. Similarly, DMX-h ETL Edition achieved sustainable throughput in excess of 255 MB/S/N for up to 2.5x faster performance than Pig when aggregating 2TB of Web log data.

In both cases, tests were run for data volumes ranging from 500GB to 2TB of data. Although alternatives such as Hadoop's native sort and Pig reach a saturation point -- where throughput starts to decline -- at around 500GB of data, DMX-h delivered sustainable and predictable performance from 500GB to 2TB. The implications are huge for organizations, as they can more efficiently size their Hadoop infrastructure, minimize uncertainty and achieve a more predictable cost structure as big data becomes even bigger.

Users can download a free test drive that contains everything required without the need to set up their own Hadoop cluster. It includes a Linux Virtual Machine with Cloudera CDH 4.2 and DMX-h ETL Edition pre-installed, along with use case accelerators and sample data.

More information is available at www.syncsort.com.

TDWI Membership

Get immediate access to training discounts, video library, BI Teams, Skills, Budget Report, and more

Individual, Student, & Team memberships available.