Analysis: The Future will be Streamed and Analyzed
The future will be streamed-and-analyzed -- at (or close to) real time.
- By Stephen Swoyer
- September 23, 2014
According to market watcher Forrester Research Inc., organizations in North America and the European Union increased their usage of streaming analytics by almost two-thirds over the last two years. Streaming, Forrester declares, has a cachet that extends far beyond its early adopters in the industrial and financial services markets.
The future will be streamed and analyzed at (or close to) real time. "Many streaming analytics platforms provide tools to design and build command-center[-]style dashboards for human monitoring. [Streaming analytic technologies] can also feed other dashboard [and/or] visualization tools and/or custom monitoring applications," write Forrester researchers Mike Gualtieri and Rowan Curran in The Forrester Wave: Big Data Streaming Analytics Platforms, Q3 2014. This is just one potential use case, according to the authors.
They cite a host of drivers, including the capacity to visualize business or market conditions at or close to real time (e.g., by using streaming analytics on data from social feeds to sample customer sentiment); an ability to respond rapidly to time-critical events or situations (the use of location/motion data to tailor promotions for customers based on where they are); a high degree of intelligent automation, e.g., by configuring streaming analytic apps to send e-mail messages, phone calls, and text messages to human beings, or by automating substantial portions of business processes, as well as the hand-offs or orchestrations between business processes. A good example of this comes via the utility space, the Forrester authors explain: "Energy providers can use streaming analytics to monitor their grids, identify opportunities for predictive maintenance, and automatically deploy crews based on certain thresholds."
Forrester's Wave survey gibes with an earlier survey from TDWI Research, which found that almost half of organizations are already collecting streaming data. (By "streaming" is meant information that's generated by machines, sensors, Web applications, and other resources.)
Forrester and TDWI are also alike in stressing that streaming analytics poses certain technological, logistical, and cultural challenges, not least of which include the problems of capturing, representing, analyzing, and acting on the "world" of events that unfolds in a streaming context. In this respect, streaming throws traditional data management a curve ball. Traditional DM is oriented around (and designed for) a basically static world -- or, at any rate, a world that can be captured and modeled in snapshot images. Its a world that's measured out in units of batch windows; by radically compressing batch windows, it's possible to frequently refresh these snapshots.
However, batch windows can be compressed just so much; as a result, latency is a fact of life. The metaphor of the streaming world, by contrast, is one of continuously occurring atomic events: it's by definition a right-time world -- with "right-time" defined as an acceptable period of latency from the occurrence of an event to its detection, analysis, and remediation, if any.
There's another wrinkle, too, Gualtier and Curran write: the application programming model -- in addition to the DM model -- is different, too. "The streaming application programming model is unfamiliar to most application developers. It's a different paradigm from normal programming [in which] code execution controls data," the pair write, noting that the streaming model inverts this order.
"In streaming applications, the incoming data controls the code. The heart of all streaming applications is a set of streaming operators that are configured and threaded together to process the incoming streams."
Key operators are filtering, aggregation/correlation, location/motion, time series, and temporal pattern-detection functions, Gualtieri and Curran explain. Other types of operators include enrichment (in which streaming data is enriched with reference data to flesh out context), query functionality (so it's possible to query against or to take action based on streaming data), and the growing number of functions provided by custom-built or third-party libraries.
Wither Open Source?
One of the best-known streaming technologies is Apache Storm, which is used in many notable customer accounts (Forrester cites Twitter, among others) and which is likewise reported by Hadoop vendors such as Hortonworks Inc. As its name suggests, Apache Storm is available as open source software (OSS); so, too, is Spark Streaming, a comparatively new entrant for the Apache Spark parallel processing framework.
Spark can run on top of Hadoop and can persist data to the Hadoop Distributed File System, but can also leverage "Tachyon," a distributed, fault-tolerant, in-memory file system. (Spark is still immature but is touted by many as a more promising alternative than vanilla Hadoop for advanced analytics and other demanding workloads. Another, related OSS project is Apache Kafka, which supports high-throughput data feeds at -- or close to -- real time.)
Forrester's Wave report concedes that Storm and other OSS technologies have a great deal of buzz, but argues that they lack the support, functions, and amenities of third-party offerings such as those from IBM Corp., Informatica Corp., SAP AG, Software AG, Tibco Software Inc., and Vitria Inc.
"[A]t this early stage, [Storm] is utilized by well-known companies with significant volumes of streaming data, such as The Weather Channel, Spotify, Twitter, and Rocket Fuel. It is, however, a very technical platform that lacks the higher order tools and streaming operators that are provided by the vendor platforms evaluated in this Forrester Wave evaluation," according to Gualtieri and Curran.
This is true -- from a certain perspective. Viewed from the perspective of traditional data management, for example, streaming itself is "technical." IBM's InfoSphere Streams or the Informatica Platform for Streaming Analytics (to cite the first two entries in Forrester's report) may be less "technical" to deploy, especially if (as is often the case) one contracts with teams of IBM- or Informatica-sponsored integrators or consultants to assist with implementation and deployment. From the perspective of many enterprise architects, sys admins, operations managers, and developers, however, these platforms, too, might seem confusing, complex, unwieldy, or "technical."
That said, it's likely that few large organizations -- with the exception of edge-case innovators such as Google Inc., Facebook, or similar companies -- would opt to base an enterprise streaming strategy on Storm or on similar technologies. As the success of Hadoop demonstrates, however, it's also something of a straw man. Forrester's rationale ignores the prevalence of (and, among Java and OSS developers, the preference for) technologies such as Storm in the Web, cloud, and social media spaces, as well as in more conventional environments. Indeed, using Puppet or other automation tools, it's possible to click-to-deploy Apache Storm in cloud platforms such as OpenStack or CloudFoundry; Amazon offers its own streaming service, Kinesis, for which it has developed a connector, dubbed Kinesis Storm Spout, for Apache Storm. There's also an OSS tool, dubbed "storm-deploy," which enables a one-click installation of Storm to Amazon Web Services.
Forrester's reasoning likewise ignores individual or small-team development efforts inside large organizations, from skunkworks projects or proof-of-concept/quick-fix applications to larger -- but still department- or business-unit-level -- efforts. If you're curious about streaming but don't havethe resources to invest in commercial software (to say nothing of the in-house physical resources on which to run that software), and if you have a team of enterprising, highly talented developers, Apache Storm -- like the Hadoop platform of four or five years ago -- can seem like a promising and, indeed, manageable proposition.
Storm is nowhere to be found in Forrester's Wave tally: it isn't among the market "Leaders," (i.e., vendors that have "the resources and vision to take advantage of the increased adoption of streaming analytics by firms") nor among the "Strong Performers" or "Contenders." IBM, Informatica, SAP, Software AG, and Tibco Software market "mature" alternatives to Storm and related OSS technologies, Forrester says. Not surprisingly, Forrester lists all five companies as streaming analytic "Leaders." Vitria is Forrester's sole "Strong Performer," and upstart SQLStream Inc. is recognized as a "Contender."