Jump to content

Apache Kafka

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Thomasvincent (talk | contribs) at 08:11, 29 May 2025 (I've completely reformatted and updated the Apache Kafka article to meet Wikipedia standards. emoved "how-to" concern: Rewrote content with encyclopedic tone, focusing on what Kafka is rather than how to use it.). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Apache Kafka
Original author(s)LinkedIn
Developer(s)Apache Software Foundation
Initial releaseJanuary 2011; 14 years ago (2011-01)[1]
Stable release
4.0.0 / January 15, 2025; 5 months ago (2025-01-15)[2]
Repository
Written inJava, Scala
Operating systemCross-platform
PlatformJava VM
TypeStream processing, Message broker, Event streaming
LicenseApache License 2.0
Websitekafka.apache.org Edit this at Wikidata

Apache Kafka is a distributed event store and stream-processing platform developed by the Apache Software Foundation. Written in Java and Scala, Kafka provides a unified, high-throughput, low-latency platform for handling real-time data feeds.[3] The project, originally developed at LinkedIn, has become one of the most widely adopted open-source stream processing systems, with thousands of companies using it for mission-critical applications.[4]

Overview

Apache Kafka functions as a distributed publish-subscribe messaging system that can handle high volumes of data with low latency. It combines aspects of messaging systems, storage systems, and stream processing platforms into a single solution.[5] Unlike traditional message brokers, Kafka stores streams of records in categories called topics, with each record consisting of a key, value, and timestamp.

The platform's architecture enables it to serve multiple roles within an organization's data infrastructure:

  • As a messaging system for decoupling producers and consumers
  • As a storage system for reliably storing data streams
  • As a stream processing platform for transforming data in real-time

Kafka achieves high performance through several design decisions, including the use of a distributed commit log, sequential disk I/O patterns, and zero-copy data transfer. The system can scale horizontally across commodity hardware, making it suitable for handling millions of messages per second.[6]

History

Origins at LinkedIn

Apache Kafka was created at LinkedIn in 2010 by Jay Kreps, Neha Narkhede, and Jun Rao to address the company's growing data pipeline needs.[7] The team faced challenges with existing messaging systems that couldn't handle LinkedIn's scale requirements for real-time data processing and analytics.

The name "Kafka" was chosen by Jay Kreps because it is "a system optimized for writing", and he was a fan of Franz Kafka's work.[8] The project was open-sourced in January 2011, allowing the broader community to contribute to its development.[1]

Apache Software Foundation

Kafka entered the Apache Incubator in November 2011 and graduated as a top-level Apache project on October 23, 2012.[9] This transition brought structured governance and a growing community of contributors from various organizations.

In 2014, the original creators left LinkedIn to found Confluent, a company focused on providing commercial support and developing ecosystem tools around Kafka. This move helped accelerate Kafka's adoption in enterprise environments while maintaining the open-source project's independence.[7]

Major releases and evolution

The evolution of Apache Kafka has been marked by several significant milestones:

Early versions (2011-2015): Focus on core messaging capabilities, establishing the fundamental architecture of topics, partitions, and consumer groups.

Version 0.9-0.11 (2015-2017): Introduction of Kafka Connect for data integration and Kafka Streams for stream processing, transforming Kafka from a messaging system into a complete streaming platform.[10]

Version 1.0-2.8 (2017-2021): Improvements in exactly-once semantics, enhanced security features, and the introduction of KRaft (Kafka Raft) mode as an alternative to ZooKeeper for metadata management.

Version 3.0-3.9 (2021-2024): Continued improvements in performance, security, and the gradual deprecation of ZooKeeper dependencies. Version 3.8 introduced production-ready tiered storage.[11]

Version 4.0 (January 2025): A landmark release that operates entirely without ZooKeeper, running in KRaft mode by default. This version also introduces "Queues for Kafka" (KIP-932) and the next generation consumer rebalance protocol (KIP-848).[12]

Architecture

Core concepts

Apache Kafka's architecture is built around several key concepts that enable its distributed, fault-tolerant operation:[13]

Topics and Partitions: Data in Kafka is organized into topics, which are divided into partitions. Each partition is an ordered, immutable sequence of records that is continually appended to. Partitions allow topics to scale beyond a single server and provide parallelism for both producers and consumers.

Producers and Consumers: Producers publish data to topics, while consumers read from topics. Kafka maintains the order of records within a partition, but not across partitions. This design choice enables high throughput while still providing ordering guarantees where needed.

Consumer Groups: Consumers can be organized into groups, where each partition is consumed by exactly one consumer within the group. This enables both queue-like and publish-subscribe messaging patterns within the same system.

Brokers and Clusters: A Kafka cluster consists of multiple brokers (servers), each managing a subset of partitions. Data is replicated across multiple brokers for fault tolerance, with one broker serving as the leader for each partition.

KRaft architecture

With version 4.0, Kafka has transitioned to KRaft (Kafka Raft) mode by default, eliminating the dependency on ZooKeeper for metadata management.[14] In KRaft mode:

  • Metadata is managed by a subset of Kafka brokers running in controller mode
  • The Raft consensus protocol ensures consistency across controllers
  • Operational complexity is reduced by eliminating a separate system
  • Scalability is improved, supporting millions of partitions per cluster

Data flow and guarantees

Kafka provides several important guarantees that make it suitable for mission-critical applications:[5]

Ordering: Records within a partition maintain their order from producer to consumer.

Durability: Records are persisted to disk and replicated for fault tolerance. Kafka can be configured to ensure data is not lost even if multiple brokers fail.

Delivery semantics: Kafka supports at-least-once, at-most-once, and exactly-once delivery semantics, configurable based on application requirements.

Performance: Through techniques like batching, compression, and zero-copy transfer, Kafka achieves high throughput (millions of messages per second) with low latency (single-digit milliseconds).

Features

Kafka Connect

Kafka Connect is a framework for connecting Kafka with external systems.[15] Introduced in version 0.9.0.0, it provides:

  • A standard framework for building and running connectors
  • Distributed and standalone modes of operation
  • REST interface for managing connectors
  • Automatic offset management and fault tolerance
  • Transformation capabilities for data in flight

The Kafka ecosystem includes hundreds of connectors for popular systems including databases, cloud storage services, search indexes, and other messaging systems. While Apache Kafka provides the Connect framework, production-ready connectors are typically maintained by the community or commercial vendors.

Kafka Streams

Kafka Streams is a client library for building stream processing applications.[16] Key features include:

  • A high-level DSL for common operations (filter, map, aggregate, join)
  • Exactly-once processing semantics
  • Stateful processing with local state stores backed by RocksDB
  • Fault tolerance through state replication to Kafka topics
  • Elastic scaling without downtime

The library enables developers to build sophisticated stream processing applications that run as standard Java applications, without requiring a separate processing cluster.

Queues for Kafka

Introduced in version 4.0, "Queues for Kafka" (KIP-932) adds share groups as an alternative to consumer groups.[17] This feature enables:

  • Queue-like semantics with individual message acknowledgment
  • Multiple consumers processing from the same partitions cooperatively
  • Solutions to the "over-partitioning" problem
  • Better support for work-queue patterns

Share groups allow the number of consumers to exceed the partition count, addressing a long-standing limitation of consumer groups and making Kafka more suitable for traditional queuing use cases.

Security

Kafka provides comprehensive security features:[18]

  • Authentication via SASL (including Kerberos) and SSL/TLS
  • Authorization through Access Control Lists (ACLs)
  • Encryption of data in transit and at rest
  • Audit logging for compliance requirements
  • Integration with external security systems

Ecosystem and adoption

Commercial distributions

Several companies provide commercial distributions and managed services for Kafka:

  • Confluent: Founded by Kafka's creators, offers Confluent Platform with additional enterprise features
  • AWS: Amazon Managed Streaming for Kafka (MSK)
  • Azure: Azure Event Hubs with Kafka protocol support
  • Cloudera: Included in Cloudera Data Platform
  • Red Hat: AMQ Streams based on the Strimzi project

Use cases

Apache Kafka is used across industries for various applications:[19]

  • Financial services: Real-time fraud detection, transaction processing, regulatory reporting
  • Retail: Inventory management, real-time analytics, recommendation systems
  • Technology: Log aggregation, metrics collection, event sourcing
  • Telecommunications: Call detail record processing, network monitoring
  • Transportation: Real-time tracking, route optimization, IoT data processing

Notable users include Netflix, Uber, LinkedIn, Twitter, Airbnb, and thousands of other organizations processing trillions of messages daily.

Community and development

Apache Kafka has an active open-source community with:[20]

  • Regular release cycle (approximately every 4 months)
  • Hundreds of contributors from various organizations
  • Kafka Improvement Proposals (KIPs) process for major changes
  • Annual Kafka Summit conferences
  • Extensive documentation and training resources

The project follows the Apache Software Foundation's governance model, with a Project Management Committee (PMC) overseeing development and community growth.

See also

References

  1. ^ a b "Open-sourcing Kafka, LinkedIn's distributed message queue". Archived from the original on December 26, 2022. Retrieved October 27, 2016.
  2. ^ "Apache Kafka Downloads". kafka.apache.org. Retrieved May 29, 2025.
  3. ^ "Introduction to Apache Kafka". kafka.apache.org. Retrieved May 29, 2025.
  4. ^ "Apache Kafka End of Life". endoflife.date. Retrieved May 29, 2025.
  5. ^ a b Narkhede, Neha; Shapira, Gwen; Palino, Todd (2017). Kafka: the definitive guide: real-time data and stream processing at scale. Sebastopol, CA: O'Reilly Media. ISBN 978-1-4919-3616-0.
  6. ^ "Efficiency". kafka.apache.org. Retrieved May 29, 2025.
  7. ^ a b Li, S. (May 11, 2020). "He Left His High-Paying Job At LinkedIn And Then Built A $4.5 Billion Business In A Niche You've Never Heard Of". Forbes. Archived from the original on January 31, 2023. Retrieved May 29, 2025.
  8. ^ Narkhede, Neha; Shapira, Gwen; Palino, Todd (2017). "Chapter 1". Kafka: The Definitive Guide. O'Reilly. ISBN 978-1-4919-3611-5. People often ask how Kafka got its name and if it has anything to do with the application itself. Jay Kreps offered the following insight: "I thought that since Kafka was a system optimized for writing using, a writer's name would make sense. I had taken a lot of lit classes in college and liked Franz Kafka."
  9. ^ "Apache Incubator: Kafka Incubation Status". Archived from the original on October 17, 2022. Retrieved October 17, 2022.
  10. ^ "Introducing Kafka Streams". Confluent. March 10, 2016. Retrieved May 29, 2025.
  11. ^ "Supported Apache Kafka versions". AWS Documentation. Retrieved May 29, 2025.
  12. ^ "Apache Kafka 4.0 Release: Default KRaft, Queues, Faster Rebalances". Confluent. January 15, 2025. Retrieved May 29, 2025.
  13. ^ "Design". kafka.apache.org. Retrieved May 29, 2025.
  14. ^ "KRaft: Apache Kafka Without ZooKeeper". Confluent Developer. Retrieved May 29, 2025.
  15. ^ "Apache Kafka Documentation: Kafka Connect". Apache.
  16. ^ "Kafka Streams". kafka.apache.org. Retrieved May 29, 2025.
  17. ^ "KIP-932: Queues for Kafka". Apache Software Foundation. Retrieved May 29, 2025.
  18. ^ "Security". kafka.apache.org. Retrieved May 29, 2025.
  19. ^ "Powered by Kafka". kafka.apache.org. Retrieved May 29, 2025.
  20. ^ "Community". kafka.apache.org. Retrieved May 29, 2025.

Further reading

  • Narkhede, Neha; Shapira, Gwen; Palino, Todd (2017). Kafka: The Definitive Guide. O'Reilly Media. ISBN 978-1-4919-3616-0.
  • Kreps, Jay (2014). I Heart Logs: Event Data, Stream Processing, and Data Integration. O'Reilly Media. ISBN 978-1-4919-0932-4.