TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Modernize and Govern: Unifying Your Data Strategy July 10, 2025
  - Expert Panel: Best Practices for Modernizing Your Data Environment July 14, 2025
  - Powering Data Science with AI-Driven Tools and Practices July 15, 2025
  - Data Integration for AI: Overcoming Modern Pipeline Challenges July 23, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Executive Summit AI Accelerate 2025, Brought to You by AI Boadroom & TDWI August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
- Virtual Live Seminars
  - TDWI Data Governance Principles and Practices: Managing Data as an Asset June 25, 2025
  - Building Your Company’s Data Governance Roadmap June 25, 2025
  - Data Governance: Driving Engagement and Organizational Change June 26, 2025
  - A Framework for Modern Data Governance June 25, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

RESEARCH & RESOURCES

Page 3 of 3

Apache Spark: The Next Big Thing

January 6, 2015

If you heard a lot about something called "Spark" last year, you shouldn't be surprised. Spark went supernova in 2014 -- just about every prominent vendor in business intelligence (BI) and data integration (DI) announced plans to support it.

Which begs a question: what is Spark? More important, why should consumers of BI and traditional business analytics care about it? As it happens, this last question is a very good one -- but the answer can be complicated.

The short take is that Spark, which runs in Hadoop, is everything that Hadoop's MapReduce engine is not. A more complicated take is that even though Spark can run in the context of Hadoop, the Spark framework isn't in any sense tethered or bound to Hadoop. Spark can run in other contexts, too. That's one of its most attractive features.

Another thing that's intriguing about Spark is that it can both run in-memory and persist data to disk-based storage, such as the Hadoop Distributed File System (HDFS) or the Cassandra File System (CFS), among other distributed file systems or data stores. From a BI and DI perspective, however, the most interesting thing about Spark is that it was conceived as a cluster computing framework for processing complex workloads with synchronous and asynchronous operations. Hadoop MapReduce was not, and what this means is that Spark, unlike vanilla Hadoop MapReduce, supports interactive processing -- including the kinds of pipelined operations that tend to get performed in analytic and DI processing.

True, Hadoop is no longer a MapReduce-only proposition, thanks to its (still) relatively new Yet Another Resource Negotiator (YARN) resource manager. True, Hadoop, too, has an in-memory capacity. True, the Hadoop platform by the end of 2014 had become a more credible platform for both SQL query and (SQL-driven) data preparation. However, Spark's proponents claim that it comprises a faster, more elegant solution than Hadoop's rapidly maturing SQL ecosystem -- which includes projects such as vanilla Hive, Cloudera's Impala engine, or Hive running in tandem with the new Apache Tez framework -- for most BI and analytic workloads.

To recap: Hive is a SQL interpreter for Hadoop that compiles SQL-like Hive Query Language (HiveQL) statements into MapReduce jobs. Impala is an in-memory SQL-on-Hadoop project that supports interactive use cases. Its development was and is spearheaded by Cloudera. (Proponents claim that Spark's ability to persist data to disk gives it a distinct advantage over Impala, which has no provision for spilling over to disk if it runs out of physical memory.) Tez is a YARN-aware framework for MapReduce that brings features such as pipelining and interactivity to BI and DI on Hadoop. This brings us back to Spark -- which, again, is what exactly?

Call it everything that Hadoop, which will turn 10 years old in 2015 -- might have been.

"Hadoop in general despite all of its claimed uses so far has been great for a low-cost data management solution, but in general it has struggled from a processing perspective. What's the only thing you could do? Batch processing," says Arsalan Tavakoli, director of customer engagement with Spark commercial parent company Databricks Inc.

Tavakoli's referring to Hadoop's recent past, when -- until Hadoop 2.0. shipped in late 2013 -- Hadoop MapReduce was a batch-only proposition, and third-party engines such as Cloudera's Impala or Pivotal's Hawq couldn't be effectively managed using Hadoop's native feature set.

"Spark can support an arbitrary set of third-party data sources, [such as] Cassandra, [SAP] HANA, Mongo[DB], and [Amazon] S3. I can stick my operational data in Cassandra, have my sales data in Salesforce, have other [document] data in MongoDB, and I can do my advanced analytics in Spark to tie all of this together. Spark is the only thing that can seamlessly go from SQL [analytics] to advanced [non-SQL] analytics."

By "seamlessly," Tavakoli means it's possible to "do" both SQL analytics and non-SQL analytics (coded in Java, Python, Scala, or other languages) in the same engine. To the extent it's possible to do the same thing in Hadoop -- and it is -- it requires coding to different engines: Hive or Impala, along with Mahout, as well as -- possibly -- Pig or vanilla MapReduce to handle data preparation. (Cascading, an API that's layered on top of Hadoop, aims to make it easier to program/manage data processing in Hadoop. To that end, Cascading does provide a single API to which to program -- and likewise handles the scheduling and syncing of workloads in Hadoop's constitutive engines.)

Instead of coding to different engines and writing scripts to schedule or sequence different jobs, you write to one engine -- Spark -- and that takes care of everything. This is Spark's first trump card, says Tavakoli.

"Because we say we're a data processing layer, we don't care where your data actually is. Hold it in [Amazon] S3, hold it elsewhere. It doesn't matter. You don't have to worry about writing code [to different engines or APIs] to stitch everything together. You just code for Spark."

Spark has a second potentially huge trump card with respect to Hadoop: viz., its native support for SQL query. Hadoop MapReduce wasn't designed to speak SQL, which is why Hive has been a focus of feverish activity. (Two years ago, Hortonworks kickstarted its "Stinger" initiative to improve Hive's data management feature set. Last year, Hortonworks announced the completion of the first Stinger effort and promptly kickstarted a second initiative, dubbed "Stinger.next.")

Again, Hive was originally conceived as a SQL interpreter for Hadoop's MapReduce engine, which used to have a number of drawbacks for BI and DI workloads. For example, part of the focus of Stinger 1.0 was to bring interactive SQL query to Hadoop.

Spark's SQL story is a little complicated, but -- by most accounts -- more promising. Spark's traditional SQL query facility was "Shark," which was coined as a kind of portmanteau of Hive-on-Spark, or Spark Hive. Basically, Shark kind-of/sort-of decoupled Hive from MapReduce: instead of compiling HiveSQL into MapReduce jobs (generated in Java), Shark compiled HiveQL into Scala jobs for Spark. The problem is that Hive wasn't optimized for Spark but for MapReduce, which made its use with Spark inelegant at best.

Enter Spark SQL, a SQL-like query facility that Tavakoli and others argue is a better (more efficient, elegant, and scalable) framework for the future. Eventually, this might be the case, inasmuch as Spark SQL is an optimized interpreter for the Spark engine. However, Spark SQL, which officially debuted in June of 2014, is also comparatively immature. Because of this immaturity, some claim that Spark SQL is currently a less-functional option than Shark. This is a claim Tavakoli vigorously disputes.

"Unequivocally, I would disagree with that. Shark, when it was created, was way back when. We had Hive, everybody's doing all of this work in Hive, [our thinking was] can we kind of contort it [such that] instead of spitting out MapReduce jobs, [it can] spit out Spark jobs. So Hive wasn't leveraging a ton of what Spark could offer," Tavakoli told BI This Week at O'Reilly's Strata + Hadoop World 2014 conference.

"One of the other reasons we moved away from Shark is that Spark SQL can point to almost any data store -- whether it's in Cassandra, HBase, Parquet [a column storage layer for Hadoop], or whatever. If the structure's there, it can write SQL [to it]."

Immature or not, Tavakoli claims, Spark SQL is at least "competitive" with Hive and Impala in most common decision support benchmarks. "The fact of the matter is that benchmarks always frustrate me because everybody talks about and takes TPC-DS benchmarks, so out of, say, 100 queries, they'll say 'Here's our performance on five of them," he explains. "We want to run Spark SQL across the full breadth of [TPC-DS] queries. The real answer you'll see is that Spark SQL will perform competitive across the board in all of those [queries]."

Tavakoli here returns to Spark's first trump card: its role as a general-purpose parallel computing framework -- a "data processing layer," to use his term -- that can consolidate all workloads.

"Something like Impala or Hawk, those are custom-built MPP [massively parallel processing] engines just designed for a single purpose. We believe that if you have a general-purpose [engine] like Spark that can get pretty close, that's good enough for most customers," he says.

If vendor interest is any indication, Spark is a rock star. Last year, for example, almost every major DI vendor -- Actian (with its Pervasive technology), IBM Corp., Informatica Corp. SAP AG, SAS Institute Inc., and Syncsort Inc. -- announced support for Spark, with announcements coming especially fast and furious in the second half of 2014.

About the Author

Stephen Swoyer is a technology writer with 20 years of experience. His writing has focused on business intelligence, data warehousing, and analytics for almost 15 years. Swoyer has an abiding interest in tech, but he’s particularly intrigued by the thorny people and process problems technology vendors never, ever want to talk about. You can contact him at [email protected].

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

↑

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

RESEARCH & RESOURCES

Apache Spark: The Next Big Thing

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

TDWI

Engage

Research