Faster Analytics Processing with Open Source
By David Stodder, TDWI Director of Research for Business Intelligence
A tsunami of big data is hitting many organizations and the demand for faster, more frequent, and more varied analytics is riding the crest of that wave. Organizations want to apply predictive analytics, stream analytics, machine learning, and other forms of advanced analytics to their key decisions and operations. They are also experiencing the rise of self-service visual analytics, which is whetting the appetite of nontechnical users throughout organizations who want do more with data than they can using standard business intelligence (BI) reports and spreadsheets.
Fortunately, technology trends are moving in a positive direction for organizations seeking to expand the business impact of analytics and send data exploration in new directions. Many of the most important innovations are occurring in the open source realm. In the decade since Hadoop and MapReduce were first developed, we have seen a flurry of initiatives, the best of which have become ongoing Apache Software foundation projects. Today, with the Hadoop 2.0 ecosystem and YARN, it is more possible for organizations to plug their choice of interactive SQL programs, advanced analytics, open source-based processing and execution engines, and other best-of-breed tools into something resembling a unified architecture.
TDWI has just published my new Checklist Report, “Seven Steps to Faster Analytics Processing with Open Source.” We also did a Webinar on this topic that featured discussion with representatives of the four sponsors of the checklist: Cloudera, DataTorrent, Platfora, and Talend. I invite you to check out these resources.
One of the key areas that I wrote about in the checklist—and that was also discussed in the Webinar—was open source stream processing and stream analytics. With interest growing in Internet of Things (IoT) data streams from sensors and other machines, many organizations need to develop a technology strategy for stream processing and stream analytics. The Apache Spark Streaming module, Apache Storm, and Apache Apex are aimed at processing streams of real-time data for analytics. These technologies can be integrated with Apache Kafka, a popular publish-and-subscribe messaging system that can serve data to streaming systems. In the coming year, I am sure we will see rapid evolution of open source technologies for gaining value from real-time data streams.
Other important topics that we discussed in the Webinar and I covered in the report are interactive SQL-based querying of Hadoop systems, and data integration and preparation. Good interactivity with Hadoop data, which includes the ability to send ad hoc SQL queries and receive responses in a reasonable time, is critical to analytics. However, until recently interactivity with Hadoop data was slow and difficult. New options involving SQL-on-Hadoop, Hive/Spark integration, and packaged MapReduce-based big data discovery are improving performance and making interactivity easier for users and developers. Data integration is also getting a push from Spark. Programs for data integration and preparation can use its in-memory data processing and generally better performance to quicken the pace of what are often the most time-consuming steps in BI and analytics.
I expect an active year ahead in open source-based technologies for BI and analytics and will be observing them closely in my 2016 research and analysis.
Hyperlinks embedded in this blog:
Apache Spark Streaming: https://spark.apache.org/streaming/
Apache Storm: https://storm.apache.org/
Apache Kafka: http://kafka.apache.org/
Apache Apex: http://apex.incubator.apache.org/
Posted by David Stodder on December 21, 2015