RESEARCH & RESOURCES

Apache Spark: The Next Big Thing

What is Spark and why should consumers of BI and business analytics care about it?

If you heard a lot about something called "Spark" last year, you shouldn't be surprised. Spark went supernova in 2014 -- just about every prominent vendor in business intelligence (BI) and data integration (DI) announced plans to support it.

Which begs a question: what is Spark? More important, why should consumers of BI and traditional business analytics care about it? As it happens, this last question is a very good one -- but the answer can be complicated.

The short take is that Spark, which runs in Hadoop, is everything that Hadoop's MapReduce engine is not. A more complicated take is that even though Spark can run in the context of Hadoop, the Spark framework isn't in any sense tethered or bound to Hadoop. Spark can run in other contexts, too. That's one of its most attractive features.

Another thing that's intriguing about Spark is that it can both run in-memory and persist data to disk-based storage, such as the Hadoop Distributed File System (HDFS) or the Cassandra File System (CFS), among other distributed file systems or data stores. From a BI and DI perspective, however, the most interesting thing about Spark is that it was conceived as a cluster computing framework for processing complex workloads with synchronous and asynchronous operations. Hadoop MapReduce was not, and what this means is that Spark, unlike vanilla Hadoop MapReduce, supports interactive processing -- including the kinds of pipelined operations that tend to get performed in analytic and DI processing.

True, Hadoop is no longer a MapReduce-only proposition, thanks to its (still) relatively new Yet Another Resource Negotiator (YARN) resource manager. True, Hadoop, too, has an in-memory capacity. True, the Hadoop platform by the end of 2014 had become a more credible platform for both SQL query and (SQL-driven) data preparation. However, Spark's proponents claim that it comprises a faster, more elegant solution than Hadoop's rapidly maturing SQL ecosystem -- which includes projects such as vanilla Hive, Cloudera's Impala engine, or Hive running in tandem with the new Apache Tez framework -- for most BI and analytic workloads.

To recap: Hive is a SQL interpreter for Hadoop that compiles SQL-like Hive Query Language (HiveQL) statements into MapReduce jobs. Impala is an in-memory SQL-on-Hadoop project that supports interactive use cases. Its development was and is spearheaded by Cloudera. (Proponents claim that Spark's ability to persist data to disk gives it a distinct advantage over Impala, which has no provision for spilling over to disk if it runs out of physical memory.) Tez is a YARN-aware framework for MapReduce that brings features such as pipelining and interactivity to BI and DI on Hadoop. This brings us back to Spark -- which, again, is what exactly?

Call it everything that Hadoop, which will turn 10 years old in 2015 -- might have been.

"Hadoop in general despite all of its claimed uses so far has been great for a low-cost data management solution, but in general it has struggled from a processing perspective. What's the only thing you could do? Batch processing," says Arsalan Tavakoli, director of customer engagement with Spark commercial parent company Databricks Inc.

Tavakoli's referring to Hadoop's recent past, when -- until Hadoop 2.0. shipped in late 2013 -- Hadoop MapReduce was a batch-only proposition, and third-party engines such as Cloudera's Impala or Pivotal's Hawq couldn't be effectively managed using Hadoop's native feature set.

"Spark can support an arbitrary set of third-party data sources, [such as] Cassandra, [SAP] HANA, Mongo[DB], and [Amazon] S3. I can stick my operational data in Cassandra, have my sales data in Salesforce, have other [document] data in MongoDB, and I can do my advanced analytics in Spark to tie all of this together. Spark is the only thing that can seamlessly go from SQL [analytics] to advanced [non-SQL] analytics."

By "seamlessly," Tavakoli means it's possible to "do" both SQL analytics and non-SQL analytics (coded in Java, Python, Scala, or other languages) in the same engine. To the extent it's possible to do the same thing in Hadoop -- and it is -- it requires coding to different engines: Hive or Impala, along with Mahout, as well as -- possibly -- Pig or vanilla MapReduce to handle data preparation. (Cascading, an API that's layered on top of Hadoop, aims to make it easier to program/manage data processing in Hadoop. To that end, Cascading does provide a single API to which to program -- and likewise handles the scheduling and syncing of workloads in Hadoop's constitutive engines.)

Instead of coding to different engines and writing scripts to schedule or sequence different jobs, you write to one engine -- Spark -- and that takes care of everything. This is Spark's first trump card, says Tavakoli.

"Because we say we're a data processing layer, we don't care where your data actually is. Hold it in [Amazon] S3, hold it elsewhere. It doesn't matter. You don't have to worry about writing code [to different engines or APIs] to stitch everything together. You just code for Spark."

Spark has a second potentially huge trump card with respect to Hadoop: viz., its native support for SQL query. Hadoop MapReduce wasn't designed to speak SQL, which is why Hive has been a focus of feverish activity. (Two years ago, Hortonworks kickstarted its "Stinger" initiative to improve Hive's data management feature set. Last year, Hortonworks announced the completion of the first Stinger effort and promptly kickstarted a second initiative, dubbed "Stinger.next.")

Again, Hive was originally conceived as a SQL interpreter for Hadoop's MapReduce engine, which used to have a number of drawbacks for BI and DI workloads. For example, part of the focus of Stinger 1.0 was to bring interactive SQL query to Hadoop.

Spark's SQL story is a little complicated, but -- by most accounts -- more promising. Spark's traditional SQL query facility was "Shark," which was coined as a kind of portmanteau of Hive-on-Spark, or Spark Hive. Basically, Shark kind-of/sort-of decoupled Hive from MapReduce: instead of compiling HiveSQL into MapReduce jobs (generated in Java), Shark compiled HiveQL into Scala jobs for Spark. The problem is that Hive wasn't optimized for Spark but for MapReduce, which made its use with Spark inelegant at best.

Enter Spark SQL, a SQL-like query facility that Tavakoli and others argue is a better (more efficient, elegant, and scalable) framework for the future. Eventually, this might be the case, inasmuch as Spark SQL is an optimized interpreter for the Spark engine. However, Spark SQL, which officially debuted in June of 2014, is also comparatively immature. Because of this immaturity, some claim that Spark SQL is currently a less-functional option than Shark. This is a claim Tavakoli vigorously disputes.

"Unequivocally, I would disagree with that. Shark, when it was created, was way back when. We had Hive, everybody's doing all of this work in Hive, [our thinking was] can we kind of contort it [such that] instead of spitting out MapReduce jobs, [it can] spit out Spark jobs. So Hive wasn't leveraging a ton of what Spark could offer," Tavakoli told BI This Week at O'Reilly's Strata + Hadoop World 2014 conference.

"One of the other reasons we moved away from Shark is that Spark SQL can point to almost any data store -- whether it's in Cassandra, HBase, Parquet [a column storage layer for Hadoop], or whatever. If the structure's there, it can write SQL [to it]."

Immature or not, Tavakoli claims, Spark SQL is at least "competitive" with Hive and Impala in most common decision support benchmarks. "The fact of the matter is that benchmarks always frustrate me because everybody talks about and takes TPC-DS benchmarks, so out of, say, 100 queries, they'll say 'Here's our performance on five of them," he explains. "We want to run Spark SQL across the full breadth of [TPC-DS] queries. The real answer you'll see is that Spark SQL will perform competitive across the board in all of those [queries]."

Tavakoli here returns to Spark's first trump card: its role as a general-purpose parallel computing framework -- a "data processing layer," to use his term -- that can consolidate all workloads.

"Something like Impala or Hawk, those are custom-built MPP [massively parallel processing] engines just designed for a single purpose. We believe that if you have a general-purpose [engine] like Spark that can get pretty close, that's good enough for most customers," he says.

If vendor interest is any indication, Spark is a rock star. Last year, for example, almost every major DI vendor -- Actian (with its Pervasive technology), IBM Corp., Informatica Corp. SAP AG, SAS Institute Inc., and Syncsort Inc. -- announced support for Spark, with announcements coming especially fast and furious in the second half of 2014.

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.