Igniting the Analytic Spark

TDWI Blog: Data 360

Igniting the Analytic Spark

An Introduction to Apache Spark and its uses in Business Intelligence (BI), Data Warehousing (DW), and Advanced Analytics

Blog by Philip Russom
Research Director for Data Management, TDWI

At TDWI, we’re hearing a lot of interest in Apache Spark, although it’s still new and most users are unfamiliar with it. So, please allow me to define Spark for you, explain its potential benefits, and describe actual use cases.

Apache Spark is a parallel processing engine. It specializes in big data, and works well with Hadoop environments. However, Apache is not just for Hadoop; it provides parallel processing for other environments, too. Spark is known for high speed and low latency, which it achieves by leveraging in-memory computing and cyclic data flows.

Spark is fast. Very fast. Benchmarks show Spark to be up to one hundred times faster than Hadoop MapReduce with in-memory operations. Spark is ten times faster than MapReduce with disk-bound operations. The point is that Spark has the low latency required of new data-driven practices, like data exploration, discovery, streaming analytics, and SQL-based analytics.

Spark functions apply directly to applications in BI, DW, DI, & analytics. Spark today includes four libraries of functionality, and each is of interest to professionals in BI, DW, and analytics. The libraries support ANSI-standard SQL, streaming data, machine learning, and graph analytics.

A Spark library provides native support for ANSI and ISO standard SQL. In a recent TDWI survey, 69% of users surveyed said that ANSI- and ISO-standard SQL on Hadoop is required for broad enterprise use. That’s because a modern enterprise wants to leverage pre-existing SQL skills and SQL-based tools. Furthermore, users want fast queries on Hadoop, to enable data exploration, analytics, and other interactive, data-driven practices. Spark and its SQL support promise to enable these – in both batch or interactive sessions, for Hadoop and other environments – which in turn will spark big data analytics for users in BI, DW, and analytics.

Spark offers broad compatibility. Spark SQL reuses the Hive front-end and metastore, to provide compatibility with existing Hive data, queries, UDFs. Spark SQL’s server mode extends interoperability via industry-standard ODBC/JDBC. Spark can process data in S3, HDFS, HBase, Hive, Cassandra, and any Hadoop InputFormat.

Spark can be deployed many ways. Spark requires some kind of shared file system (NFS compliant), so its deployment options are diverse. Spark runs on its standalone cluster, Hadoop YARN, Apache Mesos, and Amazon EC2; on premises or cloud. A single job, query, or stream processing can be deployed in either batch or interactive mode via Scala, Python, and R shells.

Spark has one console for the seamless development of diverse functionality. Apache Spark includes libraries for four high-level applications: SQL, streaming data, machine learning, and graph analytics. These are integrated tightly, so users can create applications that mix SQL queries and stream processing alongside complex analytic algorithms.

Spark and its libraries enable several application types for BI, DW, and analytics:

SQL analytics and related set-based applications – e.g., data exploration and discovery, customer-base segmentation, financial analyses, dimensional modeling and analysis, reporting, ETL pushdown that requires SQL
Stream capture and analysis -- monitoring facilities (utilities, factories), tracking social sentiment, predictive machine maintenance, reroute vehicle traffic, manage mobile assets, any time-sensitive process
Graph analytics -- anomaly detection for fraud or risk, behavioral analysis, entity clustering, patient outcome optimization
Mixtures of the above – a trend among users is to mix multiple analytic methods in a single application, because each reveals different insights

Want to learn more about Spark? Click here to replay my recent TDWI Webinar, where go into more detail about Spark and its uses in BI, DW, and analytics.

Posted by Philip Russom, Ph.D. on December 7, 2015