Apache Kafka

Apache Kafka
Apache Kafka
Original author(s)	LinkedIn
Developer(s)	Apache Software Foundation
Initial release	January 2011; 14 years ago
Stable release	4.0.0 / January 15, 2025; 5 months ago
Repository	github.com/apache/kafka ;
Written in	Java, Scala
Operating system	Cross-platform
Platform	Java VM
Type	Stream processing, Message broker, Event streaming
License	Apache License 2.0
Website	kafka.apache.org

Apache Kafka is a distributed event store and stream-processing platform developed by the Apache Software Foundation. Written in Java and Scala, Kafka provides a unified, high-throughput, low-latency platform for handling real-time data feeds.^[3] The project, originally developed at LinkedIn, has become one of the most widely adopted open-source stream processing systems, with thousands of companies using it for mission-critical applications.^[4]

Overview

Apache Kafka functions as a distributed publish-subscribe messaging system that can handle high volumes of data with low latency. It combines aspects of messaging systems, storage systems, and stream processing platforms into a single solution.^[5] Unlike traditional message brokers, Kafka stores streams of records in categories called topics, with each record consisting of a key, value, and timestamp.

The platform's architecture enables it to serve multiple roles within an organization's data infrastructure:

As a messaging system for decoupling producers and consumers
As a storage system for reliably storing data streams
As a stream processing platform for transforming data in real-time

Kafka achieves high performance through several design decisions, including the use of a distributed commit log, sequential disk I/O patterns, and zero-copy data transfer. The system can scale horizontally across commodity hardware, making it suitable for handling millions of messages per second.^[6]

History

Origins at LinkedIn

Apache Kafka was created at LinkedIn in 2010 by Jay Kreps, Neha Narkhede, and Jun Rao to address the company's growing data pipeline needs.^[7] The team faced challenges with existing messaging systems that couldn't handle LinkedIn's scale requirements for real-time data processing and analytics.

The name "Kafka" was chosen by Jay Kreps because it is "a system optimized for writing", and he was a fan of Franz Kafka's work.^[8] The project was open-sourced in January 2011, allowing the broader community to contribute to its development.^[1]

Apache Software Foundation

Kafka entered the Apache Incubator in November 2011 and graduated as a top-level Apache project on October 23, 2012.^[9] This transition brought structured governance and a growing community of contributors from various organizations.

In 2014, the original creators left LinkedIn to found Confluent, a company focused on providing commercial support and developing ecosystem tools around Kafka. This move helped accelerate Kafka's adoption in enterprise environments while maintaining the open-source project's independence.^[7]

Major releases and evolution

The evolution of Apache Kafka has been marked by several significant milestones:

Early versions (2011-2015): Focus on core messaging capabilities, establishing the fundamental architecture of topics, partitions, and consumer groups.

Version 0.9-0.11 (2015-2017): Introduction of Kafka Connect for data integration and Kafka Streams for stream processing, transforming Kafka from a messaging system into a complete streaming platform.^[10]

Version 1.0-2.8 (2017-2021): Improvements in exactly-once semantics, enhanced security features, and the introduction of KRaft (Kafka Raft) mode as an alternative to ZooKeeper for metadata management.

Version 3.0-3.9 (2021-2024): Continued improvements in performance, security, and the gradual deprecation of ZooKeeper dependencies. Version 3.8 introduced production-ready tiered storage.^[11]

Version 4.0 (January 2025): A landmark release that operates entirely without ZooKeeper, running in KRaft mode by default. This version also introduces "Queues for Kafka" (KIP-932) and the next generation consumer rebalance protocol (KIP-848).^[12]

Architecture

Core concepts

Apache Kafka's architecture is built around several key concepts that enable its distributed, fault-tolerant operation:^[13]

Topics and Partitions: Data in Kafka is organized into topics, which are divided into partitions. Each partition is an ordered, immutable sequence of records that is continually appended to. Partitions allow topics to scale beyond a single server and provide parallelism for both producers and consumers.

Producers and Consumers: Producers publish data to topics, while consumers read from topics. Kafka maintains the order of records within a partition, but not across partitions. This design choice enables high throughput while still providing ordering guarantees where needed.

Consumer Groups: Consumers can be organized into groups, where each partition is consumed by exactly one consumer within the group. This enables both queue-like and publish-subscribe messaging patterns within the same system.

Brokers and Clusters: A Kafka cluster consists of multiple brokers (servers), each managing a subset of partitions. Data is replicated across multiple brokers for fault tolerance, with one broker serving as the leader for each partition.

KRaft architecture

With version 4.0, Kafka has transitioned to KRaft (Kafka Raft) mode by default, eliminating the dependency on ZooKeeper for metadata management.^[14] In KRaft mode:

Metadata is managed by a subset of Kafka brokers running in controller mode
The Raft consensus protocol ensures consistency across controllers
Operational complexity is reduced by eliminating a separate system
Scalability is improved, supporting millions of partitions per cluster

Data flow and guarantees

Kafka provides several important guarantees that make it suitable for mission-critical applications:^[5]

Ordering: Records within a partition maintain their order from producer to consumer.

Durability: Records are persisted to disk and replicated for fault tolerance. Kafka can be configured to ensure data is not lost even if multiple brokers fail.

Delivery semantics: Kafka supports at-least-once, at-most-once, and exactly-once delivery semantics, configurable based on application requirements.

Performance: Through techniques like batching, compression, and zero-copy transfer, Kafka achieves high throughput (millions of messages per second) with low latency (single-digit milliseconds).

Features

Kafka Connect

Kafka Connect is a framework for connecting Kafka with external systems.^[15] Introduced in version 0.9.0.0, it provides:

A standard framework for building and running connectors
Distributed and standalone modes of operation
REST interface for managing connectors
Automatic offset management and fault tolerance
Transformation capabilities for data in flight

The Kafka ecosystem includes hundreds of connectors for popular systems including databases, cloud storage services, search indexes, and other messaging systems. While Apache Kafka provides the Connect framework, production-ready connectors are typically maintained by the community or commercial vendors.

Kafka Streams

Kafka Streams is a client library for building stream processing applications.^[16] Key features include:

A high-level DSL for common operations (filter, map, aggregate, join)
Exactly-once processing semantics
Stateful processing with local state stores backed by RocksDB
Fault tolerance through state replication to Kafka topics
Elastic scaling without downtime

The library enables developers to build sophisticated stream processing applications that run as standard Java applications, without requiring a separate processing cluster.

Queues for Kafka

Introduced in version 4.0, "Queues for Kafka" (KIP-932) adds share groups as an alternative to consumer groups.^[17] This feature enables:

Queue-like semantics with individual message acknowledgment
Multiple consumers processing from the same partitions cooperatively
Solutions to the "over-partitioning" problem
Better support for work-queue patterns

Share groups allow the number of consumers to exceed the partition count, addressing a long-standing limitation of consumer groups and making Kafka more suitable for traditional queuing use cases.

Security

Kafka provides comprehensive security features:^[18]

Authentication via SASL (including Kerberos) and SSL/TLS
Authorization through Access Control Lists (ACLs)
Encryption of data in transit and at rest
Audit logging for compliance requirements
Integration with external security systems

Ecosystem and adoption

Commercial distributions

Several companies provide commercial distributions and managed services for Kafka:

Confluent: Founded by Kafka's creators, offers Confluent Platform with additional enterprise features
AWS: Amazon Managed Streaming for Kafka (MSK)
Azure: Azure Event Hubs with Kafka protocol support
Cloudera: Included in Cloudera Data Platform
Red Hat: AMQ Streams based on the Strimzi project

Use cases

Apache Kafka is used across industries for various applications:^[19]

Financial services: Real-time fraud detection, transaction processing, regulatory reporting
Retail: Inventory management, real-time analytics, recommendation systems
Technology: Log aggregation, metrics collection, event sourcing
Telecommunications: Call detail record processing, network monitoring
Transportation: Real-time tracking, route optimization, IoT data processing

Notable users include Netflix, Uber, LinkedIn, Twitter, Airbnb, and thousands of other organizations processing trillions of messages daily.

Community and development

Apache Kafka has an active open-source community with:^[20]

Regular release cycle (approximately every 4 months)
Hundreds of contributors from various organizations
Kafka Improvement Proposals (KIPs) process for major changes
Annual Kafka Summit conferences
Extensive documentation and training resources

The project follows the Apache Software Foundation's governance model, with a Project Management Committee (PMC) overseeing development and community growth.

References

^ ^a ^b "Open-sourcing Kafka, LinkedIn's distributed message queue". Archived from the original on December 26, 2022. Retrieved October 27, 2016.
^ "Apache Kafka Downloads". kafka.apache.org. Retrieved May 29, 2025.
^ "Introduction to Apache Kafka". kafka.apache.org. Retrieved May 29, 2025.
^ "Apache Kafka End of Life". endoflife.date. Retrieved May 29, 2025.
^ ^a ^b Narkhede, Neha; Shapira, Gwen; Palino, Todd (2017). Kafka: the definitive guide: real-time data and stream processing at scale. Sebastopol, CA: O'Reilly Media. ISBN 978-1-4919-3616-0.
^ "Efficiency". kafka.apache.org. Retrieved May 29, 2025.
^ ^a ^b Li, S. (May 11, 2020). "He Left His High-Paying Job At LinkedIn And Then Built A $4.5 Billion Business In A Niche You've Never Heard Of". Forbes. Archived from the original on January 31, 2023. Retrieved May 29, 2025.
^ Narkhede, Neha; Shapira, Gwen; Palino, Todd (2017). "Chapter 1". Kafka: The Definitive Guide. O'Reilly. ISBN 978-1-4919-3611-5. People often ask how Kafka got its name and if it has anything to do with the application itself. Jay Kreps offered the following insight: "I thought that since Kafka was a system optimized for writing using, a writer's name would make sense. I had taken a lot of lit classes in college and liked Franz Kafka."
^ "Apache Incubator: Kafka Incubation Status". Archived from the original on October 17, 2022. Retrieved October 17, 2022.
^ "Introducing Kafka Streams". Confluent. March 10, 2016. Retrieved May 29, 2025.
^ "Supported Apache Kafka versions". AWS Documentation. Retrieved May 29, 2025.
^ "Apache Kafka 4.0 Release: Default KRaft, Queues, Faster Rebalances". Confluent. January 15, 2025. Retrieved May 29, 2025.
^ "Design". kafka.apache.org. Retrieved May 29, 2025.
^ "KRaft: Apache Kafka Without ZooKeeper". Confluent Developer. Retrieved May 29, 2025.
^ "Apache Kafka Documentation: Kafka Connect". Apache.
^ "Kafka Streams". kafka.apache.org. Retrieved May 29, 2025.
^ "KIP-932: Queues for Kafka". Apache Software Foundation. Retrieved May 29, 2025.
^ "Security". kafka.apache.org. Retrieved May 29, 2025.
^ "Powered by Kafka". kafka.apache.org. Retrieved May 29, 2025.
^ "Community". kafka.apache.org. Retrieved May 29, 2025.

External links

[opensource-announcement-1] "Open-sourcing Kafka, LinkedIn's distributed message queue". Archived from the original on December 26, 2022. Retrieved October 27, 2016.

[kafka-downloads-2] "Apache Kafka Downloads". kafka.apache.org. Retrieved May 29, 2025.

[kafka-intro-3] "Introduction to Apache Kafka". kafka.apache.org. Retrieved May 29, 2025.

[endoflife-4] "Apache Kafka End of Life". endoflife.date. Retrieved May 29, 2025.

[definitive-guide-5] Narkhede, Neha; Shapira, Gwen; Palino, Todd (2017). Kafka: the definitive guide: real-time data and stream processing at scale. Sebastopol, CA: O'Reilly Media. ISBN 978-1-4919-3616-0.

[efficiency-6] "Efficiency". kafka.apache.org. Retrieved May 29, 2025.

[ForbesKreps-7] Li, S. (May 11, 2020). "He Left His High-Paying Job At LinkedIn And Then Built A $4.5 Billion Business In A Niche You've Never Heard Of". Forbes. Archived from the original on January 31, 2023. Retrieved May 29, 2025.

[kafka-naming-8] Narkhede, Neha; Shapira, Gwen; Palino, Todd (2017). "Chapter 1". Kafka: The Definitive Guide. O'Reilly. ISBN 978-1-4919-3611-5. People often ask how Kafka got its name and if it has anything to do with the application itself. Jay Kreps offered the following insight: "I thought that since Kafka was a system optimized for writing using, a writer's name would make sense. I had taken a lot of lit classes in college and liked Franz Kafka."

[9] "Apache Incubator: Kafka Incubation Status". Archived from the original on October 17, 2022. Retrieved October 17, 2022.

[streams-announcement-10] "Introducing Kafka Streams". Confluent. March 10, 2016. Retrieved May 29, 2025.

[aws-versions-11] "Supported Apache Kafka versions". AWS Documentation. Retrieved May 29, 2025.

[kafka4-release-12] "Apache Kafka 4.0 Release: Default KRaft, Queues, Faster Rebalances". Confluent. January 15, 2025. Retrieved May 29, 2025.

[kafka-design-13] "Design". kafka.apache.org. Retrieved May 29, 2025.

[kraft-design-14] "KRaft: Apache Kafka Without ZooKeeper". Confluent Developer. Retrieved May 29, 2025.

[connect-docs-15] "Apache Kafka Documentation: Kafka Connect". Apache.

[streams-docs-16] "Kafka Streams". kafka.apache.org. Retrieved May 29, 2025.

[kip932-17] "KIP-932: Queues for Kafka". Apache Software Foundation. Retrieved May 29, 2025.

[security-docs-18] "Security". kafka.apache.org. Retrieved May 29, 2025.

[use-cases-19] "Powered by Kafka". kafka.apache.org. Retrieved May 29, 2025.

[community-20] "Community". kafka.apache.org. Retrieved May 29, 2025.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

v t e The Apache Software Foundation
Top-level projects	Accumulo ActiveMQ Airavata Airflow Allura Ambari Ant Aries Arrow Apache HTTP Server APR Avro Axis Axis2 Beam Bloodhound Brooklyn Calcite Camel CarbonData Cassandra Cayenne CloudStack Cocoon Cordova CouchDB cTAKES CXF Derby Directory Drill Druid Empire-db Felix Flex Flink Flume FreeMarker Geronimo Groovy Guacamole Gump Hadoop HBase Helix Hive Iceberg Ignite Impala Jackrabbit James Jena JMeter Kafka Kudu Kylin Lucene Mahout Maven MINA mod_perl MyFaces Mynewt NiFi NetBeans Nutch NuttX OFBiz Oozie OpenEJB OpenJPA OpenNLP OрenOffice ORC PDFBox Parquet Phoenix POI Pig Pinot Pivot Qpid Roller RocketMQ Samza Shiro SINGA Sling Solr Spark Storm SpamAssassin Struts 1 Subversion Superset SystemDS Tapestry Thrift Tika TinkerPop Tomcat Trafodion Traffic Server UIMA Velocity Wicket Xalan Xerces XMLBeans Yetus ZooKeeper
Commons	BCEL BSF Daemon Jelly Logging
Incubator	Taverna
Other projects	Batik FOP Ivy Log4j
Attic	Apex AxKit Beehive iBATIS Click Continuum Deltacloud Etch Giraph Hama Harmony Jakarta Marmotta MXNet ODE River Shale Slide Sqoop Stanbol Tuscany Wave XML
Licenses	Apache License
Category