By using tdwi.org website you agree to our use of cookies as described in our cookie policy. Learn More

TDWI Articles

Spark SQL: A Query Processing Powerhouse?

Almost two years on, it looks as if Spark SQL, the SQL interpreter for the red-hot Spark cluster computing framework, is shaping up to be a query-processing powerhouse.

Almost two years on, it looks as if Spark SQL, the SQL interpreter for the red-hot Spark cluster computing framework, is shaping up to be a query-processing powerhouse.

The latest evidence comes by way of a new set of decision-support benchmarks released by AtScale Inc., a start-up that markets an analytics query-processing engine for Hadoop.

It isn't so much that Spark SQL completely outclassed the competition -- it didn't, not, at least, in most of the benchmarks -- it's that it's improved substantially between version 1.4, released nine months ago, and version 1.6, released in January of this year.

For a technology that's been a version 1.x product for all of 16 months -- a technology, moreover, that replaced a highly regarded predecessor -- Spark SQL has made astonishing progress.

"Spark [SQL] is really catching up to some extent here. [Earlier] we had benchmarked [version] 1.4, and it hadn't performed as well. This benchmark is [version] 1.6, and it's coming back up. It has improved drastically," says Dave Mariani, founder and CEO of AtScale.

Spark SQL hit the big time in late May of 2014, when it was promoted in place of "Shark" -- a first-generation SQL query facility for Spark. Shark, a kind of portmanteau of "Hive-on-Spark," or "Spark Hive," kind-of/sort-of decoupled Hive from Hadoop 1.x's MapReduce framework. Instead of compiling Hive Query Language (HiveQL) code into MapReduce jobs -- which would be generated as Java code -- Shark compiled HiveQL into Scala jobs for Spark. The problem is that Hive wasn't optimized for Spark but for MapReduce, which made its use with Spark not so ideal.

Spark SQL was supposed to solve this. The problem, at least initially, was that some in the Spark community claimed Spark SQL was inferior to the technology (Shark) it replaced. This was an issue that Arsalan Tavakoli, vice president of customer engagement with Spark commercial company Databricks Inc., vehemently disputed in a late-2014 interview with TDWI's BI This Week.

"Unequivocally, I would disagree with that. Shark, when it was created, was way back when. We had Hive, everybody's doing all of this work in Hive, [our thinking was] can we kind of contort it [so that] instead of spitting out MapReduce jobs, [it can] spit out Spark jobs. Hive wasn't leveraging a ton of what Spark could offer," Tavakoli said at O'Reilly's Strata + Hadoop World 2014 conference.

Fast forward just 16 months -- from Strata + Hadoop World in October of 2014 to now, when AtScale published its benchmark -- and Tavakoli's unequivocal objection seems to have been vindicated. Right now, Mariani describes Impala and Spark as "basically, the two workhorses" for query processing on Hadoop. AtScale's benchmark involved 13 star-schema queries on a 6-billion-row data set, with three different patterns. In most (6) of the queries, Impala -- a rival SQL interpreter for Hadoop, prominently sponsored by Cloudera Inc. -- finished ahead of Spark SQL and the Hive-Tez combination.

Hive used to be tightly coupled to Hadoop's MapReduce engine. With Hadoop 2.0 came a new resource manager, YARN, which effectively broke this dependence. MapReduce is just one of several compute frameworks, or engines, that can be managed by YARN. Tez, a framework/engine said to be optimized for the data processing that's characteristic of decision support workloads, is another. Spark SQL finished first in five queries, however.

Hive/Tez, which Mariani says is best suited for processing very large data sets, finished first in just two of the query benchmarks. "With Hive, you need it for large data sets, but running a[n interactive] query on Hive makes no sense," he argues.

Aside from promoting AtScale's brand, the new benchmark also promotes its pitch. It positions its AtScale engine -- which decomposes SQL or MDX queries and redirects them to Impala, Spark, Hive/Tez, Presto (a SQL interpreter for Hadoop), and Drill (an open source version of "Dremel," the distributed ad hoc query system that underpins Google's BigQuery infrastructure service) -- as a kind of gateway-to-Hadoop for business intelligence (BI) and data discovery tools.

In other words, AtScale doesn't market a front-end tool but rather a kind of query optimizer designed to work with both traditional BI tools and self-service offerings such as Qlik Sense, TIBCO Spotfire, or Tableau. In other words, a benchmark that demonstrates how different Hadoop-based engines are better suited for different workloads aligns neatly with its marketing.

"There's a fallacy out there that you can use a single engine to do all of the work that's required for business intelligence workloads. What we found in the benchmark is that this isn't true at all. Although Impala may be great for certain kinds of concurrency, Hive is better for large data sets, and Spark [SQL] is better in a growing number of instances," Mariani said.

AtScale and Mariani aren't the only ones bullish on Spark. At TDWI's recent conference in Las Vegas, Mark Eaton, an enterprise architect with Autodesk Inc., discussed his company's shift to a virtual data warehouse environment powered by Spark 1.6.

"Spark SQL 1.6 is pretty darn close to language maturity. It isn't fully ANSI-standard [SQL], but it supports enough [of the ANSI standard] to do some really complicated stuff," Eaton told Upside in an interview following his presentation. Autodesk won't miss its existing SQL Server and Oracle based data warehouses, he said, because Spark SQL is at least an order of magnitude faster.

"We were using both Microsoft SQL Server and Oracle [for data warehouse services]. We had a very, very traditional [warehouse architecture], with fact tables and dimension tables," he said. "When you move that to SparkSQL, the fact that you're doing the vast majority of your processing in memory alone means that you were getting at least a 10x performance increase."

About the Author

Stephen Swoyer is a technology writer with 20 years of experience. His writing has focused on business intelligence, data warehousing, and analytics for almost 15 years. Swoyer has an abiding interest in tech, but he’s particularly intrigued by the thorny people and process problems technology vendors never, ever want to talk about. You can contact him at [email protected].


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.