TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Expert Panel: What's Next in Data Integration: Powering the AI-Driven Enterprise August 25, 2025
  - Expert Panel: Improving Data Quality, Accuracy, and Consistency August 27, 2025
  - The State of Self-Service Analytics: Results from TDWI’s Latest Research September 8, 2025
  - Expert Panel: Building an AI-Driven Data Strategy September 15, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Data Governance Week July 30, 2025
  - Platforms & Architecture Week July 30, 2025
  - AI Bootcamp Week July 30, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Articles

Streaming Data Who’s Who: Kafka, Kinesis, Flume, and Storm

Streaming data offers an opportunity for real-time business value. Knowing the big names in streaming data technologies and which one best integrates with your infrastructure will help you make the right architectural decisions.

By Jake Dolezal
April 15, 2016

Real-time data is data with potentially high business value, but also with a perishable expiration date. If the value of the data is not realized in a certain window of time, its value is lost and the decision or action that was needed as a result never occurs. This category of big data comes in continuously and often quickly, so we call it streaming data. Streaming data needs special attention because a sudden price change, a critical threshold met, a sensor reading changing rapidly, or a blip in a log file can all be of immense value, but only if we are alerted in time.

There are four big names in big data technologies designed to handle time-sensitive, streaming data -- Kafka, Kinesis, Flume, and Storm. They are alike in their ability to process massive amounts of streaming data generated from social media, logging systems, click streams, Internet-of-Things devices, and so forth. However, each has a few distinctions, strengths, and weaknesses.

Kafka is one of the better-known streaming data processors. Born at LinkedIn, Kafka has been and is used by some big names in the industry, such as LinkedIn, Netflix, PayPal, Spotify, and Uber. In short, Kafka is a distributed messaging system that maintains feeds of messages called topics. Publishers write data to topics and subscribers read from topics. Kafka topics are partitioned and replicated across multiple nodes in your Hadoop cluster.

Kafka messages are simple, byte-long arrays that can store objects in virtually any format with a key attached to each message, so that all messages within a topic will arrive together within the same partition or be delivered to the same subscriber. Kafka is unique in how it treats each topic like a log file, and the messages within are ordered by a unique offset. To be efficient, subscribers must track their own location within each log, which allows Kafka to dedicate itself to processing data for large volumes of users and data with little overhead.

Kafka has a follow-on competitor -- Amazon Kinesis. Kafka and Kinesis are much the same under the hood. However, although Kafka is very fast and also free, it requires you to make it into an enterprise-class solution for your organization. Amazon filled that gap by offering Kinesis as an out-of-the-box streaming data tool with the speed and scale of Kafka in an enterprise-ready package. Kinesis has shards -- what Kafka calls partitions -- that Amazon users pay for by the shard-hour and payload.

Apache Flume is also a service for collecting large amounts of streaming data, particularly logs. Kafka and Kinesis require consumers to pull data. Flume pushes data to consumers using mechanisms it calls data sinks. Flume can push data to many popular sinks right out of the box, including HDFS, HBase, Cassandra, and some relational databases. Thus, it’s a quick starter, as opposed to Kafka, where you have to build your consumers’ ability to plug into the data stream from scratch. Kafka provides event replication, meaning if a node goes down, the others will pick up the slack and still make the data available. Flume does not. Thus, if your data is so mission-critical that if any loss is unacceptable, then Kafka is the way to go.

Finally, Apache Storm involves streaming data. Storm is the bridge between batch processing and stream processing, which Hadoop is not natively designed to handle. Storm runs continuously, processing a stream of incoming data and dicing it into batches, so Hadoop can more easily ingest it. Data sources are called spouts and each processing node is a bolt. Bolts perform computations and processes on the data, including pushing output to data stores and other services.

If you have a streaming data use case, you have some architectural decisions to make regarding which solution you should choose. If you want a fault-tolerant, do-it-yourself solution and you have the developers to support it, go with Kafka. If you need something that works out of the box, choose Kinesis or Flume, once you decide whether push or pull makes more sense. Finally, if streaming data is, for now, just a small add-on to your already developed Hadoop environment, Storm is a good choice.

Streaming data offers an opportunity for real-time business value. Knowing who's who in streaming data technologies and which one best integrates with your infrastructure will help you make the right architectural decisions.

About the Author

Jake Dolezal

Dr. Jake Dolezal is practice leader of Analytics in Action at McKnight Consulting Group Global Services, where he is responsible for helping clients build programs around data and analytics. You can contact the author at [email protected].

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

↑

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Streaming Data Who’s Who: Kafka, Kinesis, Flume, and Storm

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Streaming Data Who’s Who: Kafka, Kinesis, Flume, and Storm

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career