Apache Spark and Big Data: What's Ahead
Five trends worth your attention as Spark's use with big data grows.
- By Anzy Meerasahib
- June 22, 2018
Developed at the University of California, Berkeley in 2009, Spark is a powerful cluster-computing engine known for its fast, in-memory, large-scale data processing capability. Spark was acquired by the Apache Software Foundation in 2013 and is currently available as open source technology. In addition to the capability it offers, Apache Spark provides APIs in multiple programming languages, hence its flexibility for business applications across multiple industry verticals.
This article identifies five important trends that indicate the acceptance, adoption, and application of Apache Spark we can expect over the next few years.
Trend #1: The shift from storage to computational power
The era of data warehouse modernization was driven by large organizations focused on distributed storage mechanisms using Hadoop. Recently, businesses have started to focus their attention on deriving value from data analysis on big data (thereby translating data into actionable insights that provide a competitive advantage). As a result, processing power or RAM dedicated to analyzing data has begun to outpace the resources dedicated to storing data.
Spark, with its large-scale, in-memory data processing capability, is at the center of this smart-computation evolution. We should expect to see significant growth in Spark investment, especially in highly competitive industry sectors such as financial services, manufacturing, and pharmaceuticals.
Trend #2: Improved cloud-based infrastructures
Organizations employ Spark to leverage its rapid innovation cycles fueled by contributions from the open source community. It is significantly faster to upgrade to newer versions of software in the cloud than it is for any on-premises implementation.
One way for organizations to get up and running quickly on Spark is to utilize cloud-based implementations. However, this has been a viable option only for smaller companies and start-ups whose data volume was small. For enterprises with sizable data volumes or investments in large data centers, moving their data into the cloud was expensive. Larger organizations opted for a hybrid strategy where a cloud implementation of Spark was used to analyze streaming data while an on-premises Spark cluster was used to analyze historical and aggregated data.
The cloud infrastructure has improved significantly in the last few years with considerable investments from Amazon, Google, and Microsoft. Scalability, elasticity, and ease of use are the pillars of the mainstream cloud infrastructure. Migration to the cloud has never been easier. Based on these cloud infrastructure improvements, even organizations with large data volumes may now adopt an entirely cloud-based Spark implementation. This would result in a more widespread adoption of Spark.
Trend #3: Improved security and governance models
Spark began as the playground for data scientists where they could build sophisticated data and predictive models using significantly larger volumes and disparate types of data. By and large Spark has been utilized in organizations as part of clusters that were primarily used by data scientists for prototyping and rapid iteration, so the need for enterprise requirements such as security and data governance were minimal.
As more enterprises use Spark in production deployments to derive critical business insights from big data, Spark implementations are now required to provide the same safeguards to protect data as traditional data architectures and other Hadoop components have. Governance models are a requirement as well.
As adoption of Spark increases, so will the availability of enterprise-grade security and data governance frameworks for Spark implementations. The advanced security platform and governance models will attract traditionally conservative industry sectors such as finance and insurance, leading, in turn, to even wider adoption of Spark.
Trend #4: The advent of BigDL
Until recently, Spark batch processing was not used for deep learning because it required significant effort to optimize Spark's compute engine for training deep learning models. This is where Intel comes in with their big data deep learning framework, called BigDL. Intel developed BigDL as a distributed deep learning library on Apache Spark and contributed it to the open source community to unite big data processing and deep learning.
The availability of BigDL to write deep learning programs on Spark helps resolve some key use cases and related technology challenges. These include:
- The ability to analyze big data using deep learning on the same big data Spark cluster where the data is stored
- The ability to add deep learning functionality to the big data (Spark) programs or workflows
- Leveraging existing Hadoop/Spark clusters to run deep learning applications
- Making deep learning more accessible to big data users and data scientists who are usually not deep learning experts
Overall, the advent of BigDL creates a world of new possibilities and possibly a new set of users and use cases that span the big-data and deep-learning landscapes.
Trend #5: The growing popularity of Python and Spark
Developers have adopted Python as a coding language for Spark rather than Scala or Java, which have been the norm for the last few years. There are several reasons behind this trend. Python is simpler to learn and use, and because it is an interpreted language, there is no need to compile or deal with JARs or any other dependencies. Code readability, maintainability, and familiarity are far better with Python. Python ships with libraries well-known for data analysis and statistics that are arguably more mature and time tested than those available on Spark Machine Learning Library.
Additionally, it may be argued that the Scala community is far less helpful to the typical programmer than the Python community. This makes Python a more valuable language to learn. Above all, it is just easier to find, hire, and train Python programmers and get them on board with Spark. This is another significant factor that will further drive adoption of Spark.
[Editor's note: This article solely represents the views of the author and does not necessarily represent the views or opinions of EY.]
About the Author
Anzy Meerasahib has 10 years of experience in information management with expertise in data and analytics. She is currently a senior data and analytics consultant at EY GDS LLP and is certified as an OBIEE implementation specialist and an SME in next-gen and traditional data integration and Oracle Business Intelligence Applications.