TDWI Upside - Where Data Means Business

The State of Spark Adoption, Carefully Considered

The "Apache Spark Survey 2016," published last month, paints a promising picture of the Spark cluster computing platform's prospects.

The Apache Spark Survey 2016, published last month, paints a promising picture of the Spark cluster computing platform's prospects.

Whether it's the expanded use in public cloud deployments, increased attendance at meet-ups, or the huge surge in the number of code contributors, Spark appears to be growing by leaps and bounds.

The Apache Spark Survey 2016 is sponsored by Spark's commercial parent company, Databricks. It's chock full of interesting data points and Spark-related minutiae but has almost nothing concrete to say about the number of Spark production deployments in the enterprise.

This isn't to say that production deployments aren't happening. They certainly are. After all, use of Spark increased by almost two-thirds (63 percent) in banking and by almost 40 percent in the health vertical -- a category that includes medical, pharmaceutical, and biotech.

Spark's use -- for testing and experimentation, proofs-of-concept, and production apps -- is indisputably increasing. It isn't clear what portion of this increase is from its use in production applications, though. According to the report, Spark is most commonly used (by 68 percent of respondents) to support business intelligence (BI) or customer intelligence applications. Just over half (52 percent) of respondents said they use Spark to support data warehousing workloads, and 45 percent use it for real-time or streaming workloads. Other popular applications for Spark include recommendation engines (40 percent) and log processing (37 percent).

One reason it's important to get a feel for Spark's use in production deployments is that its predecessor platform, Hadoop, is still far more likely to be used in a testing environment than in production. Like Spark, Hadoop promised cheap, general-purpose parallelism; unlike Spark, which is an in-memory compute platform, Hadoop bundled a baked-in scalable file system layer, too. That Spark is a thing is in part because of Hadoop's many shortcomings -- its overly tight coupling to MapReduce in Hadoop versions 0.x to 1.x, its lack of granular resource management facilities, and its batch-oriented processing paradigm, to name a few.

Recently, even the Hadoop Distributed File System (HDFS), Hadoop's scalable file system, emerged as a sore spot. HDFS isn't optimized for the kinds of I/O activity (rapid inserts or updates) that are characteristic of data warehousing and other analytics workloads. Cloudera and other proponents now position the Apache Kudu project as a complement to HDFS.

The upshot is that would-be Hadoop adopters are having trouble managing the transition from testing to production. According to Steve Dine, a managing partner with Datasource Consulting and a TDWI instructor, "Our clients have [Hadoop] initiatives, but ... I don't see a ton of Hadoop in what I would call production. A lot of the Hadoop that [we're told] is 'in production' -- when we start digging down and asking questions about what exactly are they doing, what users are on it, and what workloads and all that -- [our clients] can't really answer or they say there's really not a lot of users on it."

"In our customer base, I would say that the ones who are using it ... in production, they struggle. They struggle because the performance is not good. What I mean is they struggle with ... concurrency. If you have a couple of people on a cluster running jobs and other jobs get submitted, they don't run or [performance] slows down for everybody."

The relevance here is that the hype around Spark today is no less than that surrounding Hadoop at its most celebrated, which makes one wonder -- will history repeat itself?

Spark is Flexible, Adaptable, and Irrepressible

There are good reasons to think it won't. Much of this has to do with Spark's flexibility and adaptability. Spark is a credible complement to existing Hadoop deployments, whether running in the context of Hadoop itself (i.e., managed by the Mesos or YARN resource managers) or standalone on a host operating system. Used separately or spun up in the context of an existing Hadoop cluster, Spark is a superior platform for interactive or real-time applications; Hadoop (and HDFS or Kudu) give it an inexpensive means of persistence.

Spark is also arguably more flexible for certain kinds of workloads -- such as SQL query, machine learning, and real-time streaming analytics -- than Hadoop. The Apache Spark Survey 2016 indicates that Spark's usage for SQL query (via its Spark SQL library) is growing rapidly -- year-over-year usage has increased by 67 percent, with 40 percent of respondents reporting using Spark SQL in 2016. Use of Spark DataFrames -- which often complements Spark SQL usage -- increased by a massive 153 percent from 2015 to 2016. (Almost 40 percent of respondents use Spark DataFrames, too.) In addition, 82 percent of respondents use open source software (OSS) -- and 52 percent use proprietary -- RDBMSs with Spark, either as data sources or as targets.

SQL query is just one part of the story, however. Spark Streaming isn't the last word in streaming analytics -- there are literally dozens of open source streaming projects -- but it does continue to grow in influence. From 2015 to 2016, use of Spark Streaming increased by 57 percent, such that it's used by 22 percent of the respondents in the Apache Spark Survey 2016.

Spark's MLLib -- or Machine Learning Library -- is also widely used. Adoption increased by 38 percent between 2015 and 2016 and nearly one in five (18 percent) survey respondents use Spark MLLib.

There are other reasons to be optimistic. Because Spark is platform-agnostic, it can run on systems other than Linux. (Hadoop, too, has been ported to non-Linux platforms, such as most other flavors of UNIX, the Mac OS, Windows, and IBM's System z mainframe.) The Apache Spark Survey 2016 indicates that adoption of Spark in tandem with Windows increased by 39 percent between 2015 and 2016, while Spark usage on Mac OS increased by more than 50 percent. Combined Linux and UNIX usage actually decreased slightly, though -- from 75 to 74 percent -- year over year.

As to how Spark manages the transition from testing and early adoption to production, only time will tell. If the Apache Spark Survey 2016 is a reliable indication, however, it's off to a promising start.

About the Author

Stephen Swoyer is a technology writer with 20 years of experience. His writing has focused on business intelligence, data warehousing, and analytics for almost 15 years. Swoyer has an abiding interest in tech, but he’s particularly intrigued by the thorny people and process problems technology vendors never, ever want to talk about. You can contact him at evets@alwaysbedisrupting.com.


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, & Team memberships available.