Streaming Data and Message Queuing in the Enterprise
Streaming and message queuing have lasting value to organizations and may soon become as prevalent as ETL is today.
- By William McKnight
- March 27, 2018
When an information architect needs to move data around, how should it be done? Recently, there has been a dramatic change in the thinking about this subject. Driven by an increased need for real-time integration of high-velocity data, streaming solutions are increasingly being considered where ETL used to dominate.
Enterprise application integration (EAI) and enterprise service buses (ESB) were the initial responses, but they lacked the scale required for many modern integration challenges. Today's streaming solutions have solved the scaling problems associated with these early attempts at real-time data integration.
A Gap in Features
In the current landscape of streaming and message-queuing technology, a gap has emerged between message queuing capabilities and scale. Either a platform is more streaming data oriented (such as Kafka and Amazon Kinesis) or more message-queuing oriented (such as RabbitMQ, Apache ActiveMQ, Artemis, and Google Cloud Pub/Sub).
If a company goes with a solution built for streaming data, it may have to give up capabilities that are more message-queuing oriented, such as consumer and producer queue definition, conditional message routing, batch fetch and delivery, broker push, and message rejection and resending.
If a company goes with a more message-queue-oriented technology, they will lose capabilities such as ordered storage and delivery, message persistence and durability options, queue data compression, publisher/subscriber methods, and scale.
What has been missing is an enterprise-grade solution ready for deployment that covers both streaming and queuing use cases.
Enter Apache Pulsar
Enter Apache Pulsar (available in an enterprise-ready deployment with Streamlio). Pulsar was developed at Yahoo in 2016 and is utilized in Yahoo Mail, Finance, and Sports; Flickr; Gemini Ads; and Sherpa.
Pulsar follows the publish/subscribe model and has built-in multi-datacenter replication. Pulsar was architected for multitenancy and uses concepts of properties and namespaces. There is a hierarchy of a Pulsar cluster containing multiple properties that contain different namespaces that contain any number of topics. A property could represent all the messaging for a particular team, application, product vertical, etc. Namespaces is the administrative unit where security, replication, configurations, and subscriptions are managed.
Messages are partitioned and managed by brokers using a user-defined routing policy -- such as single, round robin, hash, or custom -- thus granting further transparency and control in the process. Pulsar also has a Kafka API compatibility interface to make porting existing Kafka applications easier.
Pulsar uses Apache BookKeeper to provide low-latency persistent storage. When a Pulsar broker receives messages, it sends the message data to the BookKeeper nodes that push the data into a write-ahead log and memory. In the event of any problem, Pulsar messages are kept safe in permanent storage.
Streaming and message queuing have lasting value to organizations. I predict they will be as prevalent as ETL is today because they can meet the data volume, variety, and timing requirements of the data-driven future. Streaming and message queuing allows us to ingest data and operate at a scale that was not possible until recently. I expect to see it in high use in the near future.
About the Author
McKnight Consulting Group is led by William McKnight. He serves as strategist, lead enterprise information architect, and program manager for sites worldwide utilizing the disciplines of data warehousing, master data management, business intelligence, and big data. Many of his clients have gone public with their success stories. McKnight has published hundreds of articles and white papers and given hundreds of international keynotes and public seminars. His teams’ implementations from both IT and consultant positions have won awards for best practices. William is a former IT VP of a Fortune 50 company and a former engineer of DB2 at IBM, and holds an MBA. He is author of the book Information Management: Strategies for Gaining a Competitive Advantage with Data.