TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

TDWI Blog

TDWI Blog: Data 360

Sparks are Flying in 2015

By David Stodder, TDWI Director of Research for Business Intelligence

We are past the half-way point of 2015. Major League Baseball is celebrating its all-stars in Cincinnati as teams contemplate trades that they hope will make them stronger for the second-half run. Meanwhile, fall sports are starting to stir; National Football League teams open their training camps around the end of the month. Even pumpkin farmers are aware of time passing; to have fully grown pumpkins for Halloween, they need to have their seeds planted by now. While the air is warm and the sun isstill high in the sky, it’s a good time to contemplate significant trends in our industry this year.

The top trend on my list would be the flourishing of Apache Spark, the open source parallel processing framework (or “engine”) for developing analytic applications and systems working with big data. If Spark “went supernova in 2014,” as Stephen Swoyer put it in a fine article earlier this year, the energy from its explosion is forcefully generating a lot of industry activity in 2015. And not just among the small, newer vendors: IBM, Intel, Microsoft, and other mainstream vendors have issued major Spark announcements and product releases already this year, with more to come. Describing Spark’s potential impact, IBM experts have called Spark “the next Linux.”

As I learned at Strata in February and even more at the Spark Summit in June, Spark is shaking up the big data realm, whichhasbeen dominated by Hadoop, MapReduce, Hive, and Storm technologies. While compatible with them, Sparkoffers performance and scalability advantages over these technologies, including through support for multi-step pipelines that reduce the wait for steps to complete, and support for in-memory data sharing.

One of Spark’s most important attributes is a unified approach tothe management and interaction with a greater diversity of data. The Spark framework can support not only batch processing a la Hadoop but also interactive SQL, real-time processing, machine learning, and stream analytics. At Strata, I met with Matei Zaharia, CTO of Databricks, which was founded by Zaharia and other members of the University of California, Berkeley’s AMPLab team that created Spark and launched it as an Apache project. He did not envision organizations being satisfied with putting all their data into massive Hadoop data lakes; he saw instead increasing diversity in data sources that users seek to access, which requires the unified framework and processing layer that Spark provides.

Spark has changed the parameters of the debate about how SQL-based business intelligence and visual analytics tools and application users might access big data. With Spark SQL, one of the four primary AMPLab-developed libraries that fit into the Spark framework, organizations could bypass some of the steps that have been necessary to move and transform Hadoop files into data warehouses before they can fully analyze the data. Application programming interfaces, such as SparkR for R language programming, are broadening the toolkit available for analytics.

Spark is not as mature as Hadoop or the SQL-on-Hadoop offerings in the market. Spark is also not the only “star” in the open source interactive analytic SQL query galaxy; Presto, which is now strongly backed by Teradata, is another interesting distributed SQL query engine to watch. All of these technologies are enabling organizations to do broader and deeper analytics with data and are becoming important parts of emerging diverse, “hybrid” data architectures (pardon a shameless plug: this topic will be covered at our Solution Summit in Scottsdale later this year).

Spark is a major trend in 2015. What are other trends you are seeing? I would be interested to hear your thoughts.

Hyperlinks embedded in this blog:

Apache Spark: https://spark.apache.org/

Swoyer article: http://tdwi.org/articles/2015/01/06/apache-spark-next-big-thing.aspx

IBM announcement: https://www-03.ibm.com/press/us/en/pressrelease/47107.wss

Intel: https://software.intel.com/sites/campaigns/sparks/IgnitingSparks.php

Microsoft: http://azure.microsoft.com/blog/2015/07/10/interactive-analytics-on-big-data-with-the-release-of-spark-for-azure-hdinsight/

“the next Linux”: https://youtu.be/CrGB_2GJ-fA

Strata: http://strataconf.com/

Spark Summit: https://spark-summit.org/

Databricks: http://www.databricks.com/

AMPLab: https://amplab.cs.berkeley.edu/