The Trouble with Spark
Spark boasts a lot of processing power but still lacks maturity. There are difficulties in keeping it running, and its troubles with workload variety make it problematic for production applications.
- By Jake Dolezal
- August 19, 2016
Everyone knows that Apache Spark is a powerful platform. By leveraging its in-memory capabilities, users can perform workloads at rapid rates. With remote procedure calls for efficient task dispatching and scheduling and its leveraged thread pool for task execution, Spark can schedule and execute tasks in milliseconds in busy clusters.
Although this speed is an attractive reason to use Spark, my colleagues and I have found it sometimes difficult to use. I have personally seen the touted performance of Spark in action. Spark is fast if you can get Spark jobs to run.
The trouble with Spark remains the sometimes overwhelming amount of configuration and optimization it requires.
Too Many Parameters
Simply put, with so many parameter options, there are many ways for a Spark job to fail. Spark is also really proficient at spinning up processes that mysteriously grind to a halt. Debugging is difficult because error traceback messages are cryptic at best -- including a dreaded "failed for unknown reason" message.
Users are often left fiddling with any number of parameters, including:
- Number of partitions
- Frame size
- Ask timeout
- Worker timeout
- Memory caches
- JVM heap size of Spark's executors
- Memory fraction for caching partial results
- Akka frame size
- Speculative execution
- Shuffle file consolidation
- Default parallelism
- HDFS block size of the input data
This isn't all. The overwhelming list continues.
The confounding part about configuring Spark is that the optimal (or even minimal) configuration varies by workload and data input size. There is no one perfect configuration to my knowledge. It is nearly impossible to discern how Spark will behave with varied use cases and data sizes.
Testing Full of Unexplained Errors
One case I was involved in concerned Spark jobs failing with a variety of timeout errors and other inexplicable low-level issues. The use case involved a graph match grouping process to identify all separate subgraphs within a population of graphs and uniquely label their vertices. Ultimately, we wanted to use Spark to produce pairs of matched records and group them together to find the duplicates and links to a master record.
The Spark cluster had eight nodes each with 8GB dedicated to Spark for a total of 64GB. The input data set was only 2GB total, and we were already at the limit of what Spark could process. When slight changes were made to the input data, such as row order or the target size of the match groups, sometimes the Spark job would complete and sometimes it would not.
Unpredictable Performance in Production
We just don't know how Spark will behave in varying circumstances, raising serious concerns about its use in production where conditions are unpredictable.
I can easily envision scenarios where a Spark process has been built and tested but breaks down when:
- Input data set volume increases
- Concurrent users and competing jobs strain resources
- Additional processing steps are added to a job
- Some of the cluster nodes are down for maintenance
Spark's strong suit is still highly iterative processing that takes advantage of its in-memory processing. However, if even veteran engineers struggle with Spark, what will the average IT shop face in getting it to work in a production setting with varied workloads, data sizes, and other irregular conditions?
Elevate Spark with the Right Talent and Tools
The trick to getting Spark ready for prime time in your organization is to elevate its maturity.
You have to find the right talent to stand up a Spark cluster for your big data integration and management workloads. As a mentor of mine said of talent, "Rent it, buy it, or grow it." I recommend all three in a hybrid arrangement. Bring in an expert consultancy with the right chemistry for your organization to get started and move towards self-sufficiency by hiring a full-time Spark architect and mentoring some talent already in your IT organization.
Also, make sure you are using the right tool for the job. There are many vendor tools that leverage Spark as a framework (and several really good ones that don't and perform just as well). Going through a tool selection process means being very clear on your use cases and your anticipated data volume growth and scaling trajectory. The right tools can help guide the tuning, maximize Spark performance, and minimize failures.
Third, you will have to take Spark through the paces of a rigorous quality assurance and testing process. Is your QA team ready to take on an unfamiliar (and potentially complex) platform such as Spark?
Introducing Spark will require you to build up a strong team and a quality tool repertoire to ensure success.
Dr. Jake Dolezal is practice leader of Analytics in Action at McKnight Consulting Group Global Services, where he is responsible for helping clients build programs around data and analytics. You can contact the author at firstname.lastname@example.org