Clash of the Titans: Hadoop versus Spark
Because Spark can run independently of Hadoop, some see the two as competitors. However, these technologies can complement each other depending on enterprise needs.
- By Stephen Swoyer
- August 22, 2016
In an infographic published last November, Hadoop powerhouse Cloudera flagged Apache Spark, not Apache Hadoop, as the most popular open source project. Spark boasts more than 750 contributors from at least 200 companies.
Cloudera's infographic gets at one of the ironies of Spark uptake and adoption: because Spark can run independently of Hadoop, some see it as a competitor to that platform.
If Spark doesn't need Hadoop, Cassandra, or any distributed computing substrate, why is a Hadoop heavy like Cloudera promoting Spark without (much) qualification or reservation? It's an intriguing question -- one that Gartner vice president (and veteran industry watcher) Merv Adrian tackled at last month's Pacific Northwest BI Summit.
Could Spark Put Hadoop Out of Business?
"We are now seeing deployments of Spark that don't have any Hadoop in them. They're not [running on top of Hadoop and] HDFS, they're deploying against a different substrate," Adrian told attendees.
"If [Spark] continues to make the kind of market progress it's making so far, it will be viewed as a competitor to Hadoop in a way that today arguably it's not. I get asked [about competition between Hadoop and Spark] a lot and I say, 'No, it's not competing with Hadoop -- yet.'"
Research analyst Mike Ferguson, a principal with UK-based Intelligent Business Strategies, offered a slightly more caustic assessment. "If the Hadoop vendors didn't ship Spark, Spark would have put them out of business by now," Ferguson argued, citing that platform's phenomenal growth (see companion article).
Adrian concurred: "Yes, they were smart to defend themselves. At the same time, they let a Trojan horse in[side] the walls" -- in this case, the "walls" of their Hadoop platform distributions.
No Universal Agreement
Not all Hadoop vendors are sanguine about Spark. Adrian noted that at least one of Cloudera's competitors is much less eager to promote the Spark framework. "They [Cloudera] were probably the first of any [of the Hadoop] distributor[s] to embrace [Spark] very visibly," he said.
He then noted anecdotally that he had attended an event where analysts from another Hadoop pure-play vendor had been slightly negative toward Spark. For example: saying, "This is looking like it's going to be pretty interesting stuff and we're going to step up and support it much more when it's ready."
This pure-play vendor's resistance probably stems less from a concern that Spark might ultimately marginalize Hadoop than from a desire to differentiate itself from an arch-competitor (Cloudera) that champions Spark. "They have their own solution and they don't want to go with [a technology Cloudera is using]: everybody's constructing their stacks differently," Adrian said.
Hadoop and Spark Together
As Abraham Lincoln might put it, Hadoop and Spark must not -- or need not at any rate -- be enemies, but friends. The two technologies can complement one another.
Unlike Hadoop, Spark wasn't conceived as an all-in-one platform for distributed compute and scalable distributed storage. Spark was designed as a high-performance in-memory compute engine. Hadoop itself isn't a database, but Spark is even less of a database than Hadoop, which (via HBase and Hive) implements hierarchical database and relational database-like services.
Spark is a compute engine designed to work with data structures, such as resilient distributed datasets (RDDs). To this end, it doesn't implement a persistence layer comparable to the Hadoop distributed file system (HDFS).
Per the Apache Spark Programming Guide, "[t]here are two ways to create RDDs: parallelizing an existing collection in your driver program or referencing a data set in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop InputFormat."
At first, Spark primarily worked with RDD data structures. Spark now supports several other data structures in addition to RDD, namely DataFrames and Datasets. The data structures/APIs you work with in the Spark environment can be ingested via local file storage, network file systems, Cassandra, SQL, HDFS, and external DBMSs or RDBMSs.
"When you run Spark [on a] standalone [basis] it just manages the execution nodes for Spark execution on its own and not through another node manager like YARN or Mesos," explains Larry Murdock, a principal engineer with consultancy Silicon Valley Data Science.
Adds Murdock, "There is no persistence in Spark. It connects to persistence services."
From the perspective of a vendor such as Cloudera, then, Hadoop and HDFS provide cost-effective persistence services from which the Spark compute engine can ingest -- and to which it can write -- different data structures.
As noted, Spark needn't only persist to (or read data from) HDFS, Hive, or other Hadoop-based resources; it can connect to all kinds of data sources, including conventional relational database management systems. Using Hadoop is logical, however, because RDBMS storage isn't nearly as cost-effective as HDFS storage.
Different Spark Architectures for Different Uses
At this year's Spark Summit East, Vida Ha, a lead solutions engineer at Spark commercial parent company Databricks, discussed the use of Spark against a file system and talked about when it makes sense to use Spark with an external relational database -- or an RDBMS-like engine, such as Apache Hive running in Hadoop.
For many kinds of use cases, Spark runs very well against a file system, she stressed. For others, it doesn't.
"What's not so great with Spark connected directly to a file system is if you do very, very frequent random access." She cited a representative SQL statement, such as a SELECT * query: "Will this command run in Spark? It definitely will, but it's not very efficient [against a file system]. If you think about what Spark is doing when it's reading data from flat files, it reads all of the files in, searches through linearly until it finds that particular ID, and returns it to you."
Classic Use Case
"If you're doing ad hoc analysis and you're searching for something, this is fine, but if you're doing this in terms of queries-per-second because you're trying to find data to service a report on your website to end users, you don't want to use Spark just attached to a file system for that," Ha said.
"This is just a common classic use case for a database," she continued, noting that a hierarchical database such as HBase (as distinct from a SQL relational database) would be ideal for a scenario like this. HBase ships standard with Hadoop.
"[A] key-value NoSQL store just obviously retrieve[s] the value of a key efficiently out of the box."
The Future Is Coexistence
A Hadoop versus Spark throwdown might make for a good storyline, but it isn't a particularly helpful way of looking at things.
It isn't just that Spark needs a robust (or, in the case of Cassandra, fault tolerant) persistence layer. It's that enterprises have invested hundreds of millions of dollars in Hadoop and other new technologies. Complementarity and coexistence, not rip-and-replacement, is the order of the day.