Igniting Analytics: Apache Spark’s Promise and Potential Perils
TDWI Speaker: Philip Russom, TDWI Research Director
Apache Spark is a parallel processing engine for big data that achieves high speed and low latency by leveraging in-memory computing and cyclic data flows. Benchmarks show Spark to be up to 100 times faster than Hadoop MapReduce with in-memory operations and 10 times fast with disk-bound ones. High performance aside, interest in Spark is rising rapidly because it offers a number of other advantages over Hadoop MapReduce, while also aligning with the needs of enterprise users and IT organizations.
This Webinar will discuss a number of those advantages:
- Broad compatibility. Spark SQL reuses the Hive front end and metastore to provide compatibility with existing Hive data, queries, and UDFs. Spark SQL’s server mode extends interoperability via industry-standard ODBC/JDBC.
- Flexible deployment. Spark runs on its standalone cluster, Amazon EC2, Hadoop YARN, and Apache Mesos. A single job, query, or stream processing can be executed in either batch or interactive mode via Scala, Python, and R shells.
- One console for seamless development and diverse functionality. Apache Spark includes libraries for four high-level applications: SQL, streaming data, machine learning, and graph analytics. These are integrated tightly, so users can create apps that mix SQL queries and stream processing alongside complex analytic algorithms.
- Native support for standard SQL. In a recent TDWI survey, 69% of users surveyed said that ANSI-standard SQL on Hadoop is required for broad enterprise use. That’s because a modern enterprise wants to leverage pre-existing SQL skills and SQL-based tools. Furthermore, users want fast queries on Hadoop, to enable data exploration, analytics, and other interactive, data-driven practices. Spark and its SQL support promise to enable these, which in turn will spark big data analytics for end users
This Webinar will also contemplate Spark’s role today and in the future, including the following:
- Real-world use cases for Spark and its four application libraries
- How Spark can either replace or complement MapReduce, Pig, Hive, HBase, etc.
- Moving from Spark pilot to production—potential pitfalls
- How Spark should integrate with the broader analytics ecosystem
- Future directions for Spark SQL and similar open-source tools, such as Drill and Presto
Philip Russom, Ph.D.