TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Data Integration for AI: Overcoming Modern Pipeline Challenges July 23, 2025
  - From Silos to Insights: Centralizing Data to Drive AI July 24, 2025
  - Expert Panel: Leveraging AI-Powered Solutions for Data Management July 28, 2025
  - A Generative AI Framework for Credit and Financial Markets July 29, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - TDWI Data Governance Principles and Practices: Managing Data as an Asset June 25, 2025
  - Building Your Company’s Data Governance Roadmap June 25, 2025
  - Data Governance: Driving Engagement and Organizational Change June 26, 2025
  - A Framework for Modern Data Governance June 25, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

RESEARCH & RESOURCES

Greenplum Touts Speedy, Parallel Data Warehouse Bulk Loading

Greenplum says its new Scatter/Gather Streaming capability can meaningfully accelerate loading times

By Stephen Swoyer
March 18, 2009

In a highly competitive data warehousing (DW) segment, differentiation matters.

Differentiation, of course, is in the eye of the beholder, but Greenplum Inc. this week touted what it claims is best-in-class differentiation in the form of a new high-speed bulk-loading option for its DW software.

Bulk loading is a big issue for some customers, officials maintain, and -- for those shops that need it -- Greenplum says its new parallel loading feature can meaningfully accelerate loading times.

Ben Werther, Greenplum's director of product management, touts the example of one customer -- he declined to disclose its name -- that uses Scatter/Gather Streaming to achieve load times of up to 4 TB per hour. Werther touts Scatter/Gather as a "drastic" improvement over the bulk-loading approach that's still used in many DW environments.

"The classical bulk-loading approach is where you have a command line program. It pushes the file … to one server on the database, or in some cases [it] can open a few [different] sessions and send it to a few sessions on that database, but you're usually limited to a single file on the source side. No one else that we're aware of can be pulling from multiple servers simultaneously and coordinate [those transfers] as one load and doing transactions at the same time," he argues.

Greenplum is one of several vendors that market massively parallel processing (MPP) DW technology. Scatter/Gather can be seen as a highly-parallelized take on bulk loading.

"We take the source, whatever it is, and we're essentially breaking it up into chunks that are sent over to the database in streams. Those scattered pieces are going across all of those different streams concurrently. [They're] flowing in parallel fashion to the database, so this [data] may be coming form one source or many sources," Werther explains.

"In the gathering phase, we don't simply land it to disc and call it a day. Because a lot of customers want to support ELT or ETLT … we want to be able to do processing on the data on the way in, so we support arbitrary processing through SQL in parallel as the data flows in to those data servers. … The data is automatically partitioned and … flows across the network and lands where it needs to land."

Parallel loading isn't a new idea. Oracle Corp., for example, has long supported parallel loading in its flagship DBMS. Ditto for high-end data warehousing kingpin Teradata, which -- like Greenplum -- touts the ability to "perform continuous loading of transactional data" via its Parallel Transporter offering.

Parallel loading of the kind Greenplum touts with Scatter/Gather could prove to be a disruptive feature in the broader DW segment, particularly among vendors that market DW products based on open source DBMSes such as MySQL or PostgreSQL.

Jos van Dongen is a data warehousing consultant who works with open source software (OSS) analytic databases -- of which Greenplum (which runs on an MPP-ized version of PostgreSQL) is one example. He says loading can be a time-consuming process in many OSS environments. Van Dongen and other like-minded specialists are currently experimenting with different approaches to accelerate data loading (using sample sets of between 10 and 100 GB) in OSS data warehouses. It's proving to be a tough nut to crack.

"Our problem with this is two-fold: [a lack of] serious hardware resources -- let alone an MPP cluster -- and [a] lack of parallel capabilities in the software," van Dongen explained in an e-mail. He noted that few, if any, OSS data warehouse platforms -- e.g., Infobright's Community Edition, LucidDB, or MonetDB -- support parallel loading. For example, he says, "[T]wo weeks ago I got the results from a TPC-H SF100 benchmark on PostgreSQL using a single quad-core machine with 8GB of RAM and [two] 500GB RAID 0 disks. [It] took them over 10 hours to load the data, and about half the queries didn't respond within 10 hours or just crashed." In most cases, van Dongen indicates, OSS DW platforms can achieve load speeds of up to 20-30 GB an hour -- a far cry from the load performance of an Oracle or a Teradata.

At the same time, OSS platforms are considerably cheaper than Oracle, Teradata, or other proprietary vendors. "What would be interesting … is [to] add another item to the equation: cost. How do load times look like in [dollars] per TB [per] hour?" van Dongen continues.

"[I]n my case, … Infobright loaded 100 GB in a little less than 4.5 hours on a $2,500 machine. All software is free. This results in 22.2 GB-per-hour, or … $112,612 [per TB/hour]. If Greenplum can do 4.5 TB an hour, their system would have to cost less than $506,756 to be relatively faster than mine. Is it?"

A Greenplum representative told BI This Week: "Feedback from customers is that TPC-H numbers aren't meaningful for the kinds of workloads they are running. For this reason, none of the data warehouse appliance leaders -- Greenplum, Teradata, or Netezza -- publish[es] TPC-H benchmarks. Instead of artificial benchmarks, Greenplum works with customers to show them the performance it can bring to their unique workload. Cost-wise, Greenplum offers pricing as low as $20k per TB of usable data. This is significantly less than Oracle's pricing -- particularly when you include their RAC and parallel query options, and the need to purchase either SAN or Exadata servers."

The upshot, according to van Dongen, is that speedy load times come at a cost. He cites published TPC-H benchmarks that demonstrate that Oracle can load 1 TB of data in 1 hour and 35 minutes; the cost of doing so, however, works out to about $4 million per TB an hour. Other MPP OSS-based data warehouse players can do so at less cost, van Dongen concludes -- Exasol can load 1 TB in 1 hour and 3 minutes (at roughly $1 million per TB/hour), and ParAccel can load a single terabyte in 41 minutes (at a cost of about $861,000 per TB/hour) -- but, relative to his own example, they're still extremely expensive.

Of course, some customers are willing to pay a premium -- in some cases a significant premium -- to get speedy loading capabilities. "[T]his discussion is purely academic, though insightful: if you need to load large amounts of data in a very short time, there are no cheap alternatives," he says.

Randy Lea, vice-president of product and services marketing with Teradata, concurs -- to a degree.

Some of Teradata's customers load "as much as 50 TB per day in real production environments," according to Lea. For these customers, extreme loading is an important issue, but isn't the most important issue, he argues. "Depending on customer requirements, sometimes the load is [a fraction of a percentage] of your system resources; sometimes it's 10 percent; sometimes it can be 50-60 percent if you have batch windows. What is the cost of that? You never really buy a system for a load."

On the other hand, Lea points out, extreme loading separates the enterprise data warehousing (EDW) haves from the EDW have-nots. "Make no mistake, you do have to have a parallel loading environment. If you don't have a parallel loading environment, it is tough to play in this space"". It's just that the number of customers who actually have those extreme [loading] requirements is very small."

Greenplum officials say that Scatter/Gather streaming isn't just a one-trick pony. In addition to its bulk-loading benefits, its implementation is both DBA- and developer-friendly, Werther maintains.

"[Another neat thing] is the way this is exposed to the DBAs and the developers. Traditionally, you'd have a command-line program that you'd run and pipe your file into it. We have that and we support it [via] a wrapper around this type of technology with something that we call gpload," he explains. "When you query a table, you're causing this whole Scatter/Gather Streaming technology to fire up and you pull from that source into the database and it's streamed into the database and appears to you like a table in the database. You can join against that … and pull it in in parallel, no matter where that data is -- on the subnet or on the Internet, half way around the world."

The point, Werther concludes, is that "it looks to the database like it's doing a parallel query. It looks like a scan that's an internal scan. Under the covers it's actually pulling data from all of these systems in parallel. It's not just about big bulk loading."

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

↑

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

RESEARCH & RESOURCES

Greenplum Touts Speedy, Parallel Data Warehouse Bulk Loading

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

TDWI

Engage

Research