0% found this document useful (0 votes)
8 views43 pages

Kafka Development and Functionality

Apache Kafka was developed by LinkedIn to address a crisis of scale in handling real-time data, evolving from a specialized messaging queue to a comprehensive event streaming platform after being open-sourced in 2011. Its architecture is centered around a distributed commit log, which allows for high performance, scalability, and decoupling of data producers and consumers. Today, Kafka is widely adopted by major enterprises for its robust capabilities in managing vast streams of data efficiently.

Uploaded by

nhinlechi99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views43 pages

Kafka Development and Functionality

Apache Kafka was developed by LinkedIn to address a crisis of scale in handling real-time data, evolving from a specialized messaging queue to a comprehensive event streaming platform after being open-sourced in 2011. Its architecture is centered around a distributed commit log, which allows for high performance, scalability, and decoupling of data producers and consumers. Today, Kafka is widely adopted by major enterprises for its robust capabilities in managing vast streams of data efficiently.

Uploaded by

nhinlechi99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Apache Kafka: A Deep Architectural and Functional Analysis

of a Distributed Event Streaming Platform

Part I: Genesis and Evolution: The Making of a De Facto Standard

1.1 The LinkedIn Imperative: A Crisis of Scale

In the technological landscape of the late 2000s and early 2010s, the professional
networking platform LinkedIn found itself at the epicenter of a data explosion. The
company was experiencing an exponential surge in the volume of digitized
information, encompassing not only traditional transactional records like user profiles
and job histories but, more significantly, a torrent of user activity data.1 Every click,
search, profile view, and connection request represented a valuable event that
needed to be captured and leveraged in real-time to power core platform features like
news feeds, analytics, and recommendation engines.1

This real-time requirement created a profound technical crisis. The existing data
infrastructure paradigms of the era were fundamentally ill-equipped to handle
LinkedIn's burgeoning scale and velocity. The available options presented a critical
gap 1:
1.​ Databases: Traditional relational database management systems were optimized
for "data at rest"—the persistent storage and structured querying of information.
They excelled at transactional integrity but were too slow and cumbersome for
the high-throughput, low-latency ingestion and processing demanded by
real-time data feeds.1
2.​ Traditional Messaging Systems: Message-oriented middleware, or message
queues, were designed for "data in motion," facilitating asynchronous
communication between applications. However, these systems were typically
architected around a single, centralized broker node. While effective for many
enterprise integration patterns, they were not built for the hyper-scale LinkedIn
required. Attempting to funnel a data volume that was projected to grow by a
factor of 1,000 through a single-node system would inevitably lead to
catastrophic failure.1 Furthermore, these systems lacked the crucial capabilities
of long-term data retention and message replayability, which were becoming
essential for complex analytics and reprocessing scenarios.2

Faced with this "mismatch" between their needs and the available technology, a team
of engineers at LinkedIn—Jay Kreps, Neha Narkhede, and Jun Rao—embarked on
creating a new system from the ground up in 2010.1 The project, christened "Kafka" by
Jay Kreps in homage to the author Franz Kafka, was conceived with a singular design
tenet: to be "a system optimized for writing".3 After approximately a year of
development, the first version was deployed within LinkedIn, where it rapidly became
the central nervous system of the company's data architecture, integrating hundreds
of microservices and data systems in real-time.1

1.2 From Open Source Project to Global Platform

The transformative potential of Kafka was evident far beyond the walls of LinkedIn. In
early 2011, the project was open-sourced and contributed to the Apache Software
Foundation, where it graduated from incubation to a top-level project on October 23,
2012.3 This strategic move catalyzed its adoption across the technology industry.

A pivotal moment in Kafka's trajectory occurred in 2014 when its original


creators—Kreps, Narkhede, and Rao—departed from LinkedIn to found Confluent.1
This new company was established with the express purpose of focusing on Kafka's
continued development and building an enterprise-grade, cloud-native platform
around its open-source core. The commercial backing and dedicated engineering
focus provided by Confluent were instrumental in maturing Kafka and fostering its rich
ecosystem.

This period marked a fundamental shift in Kafka's identity. It evolved rapidly from
being perceived as a powerful but specialized "messaging queue" to a "full-fledged
event streaming platform".2 This was not merely a semantic rebranding but a reflection
of a vastly expanded feature set that went far beyond simple message transport. The
integration of durable, long-term storage, native stream processing capabilities via
the Kafka Streams library, and seamless data integration with external systems
through Kafka Connect transformed it into a comprehensive platform for handling
data in motion.3

Today, Apache Kafka is a cornerstone of modern data infrastructure, used by over


80% of Fortune 100 companies.1 It powers mission-critical systems at global
enterprises like Netflix, Uber, Microsoft, and Goldman Sachs, handling trillions of
events per day and serving as the de facto standard for event streaming.8

1.3 Core Design Philosophy: The Distributed Commit Log

To truly understand Kafka's architecture and capabilities, one must grasp its central,
foundational abstraction: the distributed, partitioned, and replicated commit log.2
This is not merely one feature among many; it is the architectural DNA from which all
of Kafka's other properties—its performance, scalability, durability, and unique
messaging model—are derived.

A commit log is a simple, append-only data structure. In Kafka, data is always written
to the end of the log, making the data records immutable and the write operations
extremely fast, as they leverage the efficiency of sequential disk I/O rather than the
slower random-access patterns required by traditional databases.14 This aligns
directly with its original design goal of being "optimized for writing".3

This architectural choice has profound implications. The most significant is the
complete decoupling of data producers from data consumers.8 In a traditional
message queue, the broker often manages the state of message delivery, tracking
which consumer has received and acknowledged each message. This creates a tight
coupling and a potential bottleneck at the broker. Kafka's commit log model
obliterates this paradigm. The broker's job is simplified to its essence: append records
to the log and replicate them for fault tolerance. It does not track consumer state.

Instead, the responsibility for tracking consumption progress is shifted entirely to the
client side. Each consumer is responsible for managing its own position, or offset,
within the log.3 This "dumb broker/smart client" philosophy 20 means that consumers
can read data at their own pace, rewind to re-process historical data, or have multiple
independent consumer applications read from the same data stream without
interfering with one another. This capability for data replay is what fundamentally
elevates Kafka from a transient messaging system to a durable event streaming
platform, capable of serving both real-time and historical data processing needs.2

The decision to build Kafka around a distributed commit log was a direct and elegant
solution to the scaling crisis that birthed it. By simplifying the broker's responsibilities,
the architecture inherently supports massive horizontal scalability; adding more
brokers to a cluster is a straightforward way to increase capacity because the core
logic of each node remains simple.2 This same design choice necessitates the
consumer model of groups and offset management, which in turn provides the unique
ability to function as both a queue and a pub/sub system simultaneously. Ultimately,
the distributed log is the architectural north star that has guided Kafka's entire
evolution.

Part II: Core Architectural Deconstruction

The power and scalability of Apache Kafka stem from a set of well-defined
architectural components that work in concert. Understanding the role of each
component and their interactions is essential for designing, deploying, and managing
robust Kafka-based systems.

2.1 The Kafka Cluster: Brokers and Controllers

A Kafka deployment consists of a cluster of one or more servers, where each server is
referred to as a broker.2 These brokers form the backbone of the Kafka system. Their
primary responsibilities are to receive streams of records from producer clients,
assign them to the correct partitions, store them durably on disk, and serve them to
consumer clients upon request.13 Each broker in the cluster is identified by a unique,
integer-based ID and is responsible for a subset of the partitions in the cluster,
ensuring a balanced distribution of load.16

Within this cluster, one broker is dynamically elected to take on the additional role of
the controller.24 The controller acts as the administrative brain of the cluster. It is
responsible for managing the state of all resources, including topics, partitions, and
replicas. Its key duties include handling broker failures, performing leader elections for
partitions when a leader broker goes down, and managing the addition or removal of
brokers from the cluster.24 By centralizing these state management tasks in a single
controller, Kafka ensures that cluster-wide state changes are handled efficiently and
without race conditions. If the controller broker fails, a new controller is automatically
elected from the remaining healthy brokers in the cluster.25

Clients (producers and consumers) do not need to know the entire topology of the
cluster. They initiate a connection with one or more designated bootstrap servers.
The bootstrap server responds with metadata about the complete cluster, including
the addresses of all brokers and which broker is the leader for which partition. Armed
with this metadata, the client can then establish direct connections to the appropriate
brokers to send or receive data.27

2.2 Topics, Partitions, and Offsets: The Data Abstraction Layer

Kafka organizes streams of records into categories called topics.2 A topic is a logical
name that producers publish to and consumers subscribe from. It can be
conceptualized as a feed, analogous to a table in a relational database or a folder in a
filesystem.18

To achieve scalability and parallelism, each topic is divided into one or more
partitions.5 A partition is the fundamental unit of storage and parallelism in Kafka.
Each partition is an ordered, immutable sequence of records—effectively a structured
commit log in its own right.16 When a producer sends a record to a topic, it is
ultimately stored in one of these partitions. This partitioning allows a topic's data and
processing load to be split across multiple brokers in the cluster.16 The number of
partitions for a topic is a critical configuration parameter, as it dictates the maximum
level of parallelism for consumers within a consumer group; a group cannot have more
active consumers than the number of partitions for a topic.31

Within each partition, every record is assigned a unique, immutable, and sequential
integer known as an offset.5 The offset serves as a unique identifier for a record within
its partition. For example, the first record in a partition has an offset of 0, the second
has an offset of 1, and so on. This simple, ordered structure is what allows consumers
to reliably track their read position and enables Kafka to provide its ordering
guarantees.3 The combination of topic name, partition number, and offset uniquely
identifies any record in a Kafka cluster.34

2.3 The Producer Client: Writing Data to Kafka

Producers are the client applications responsible for publishing, or writing, streams of
events to Kafka topics.2 A Kafka record produced by these clients is a key-value pair,
accompanied by a timestamp and optional, user-defined headers.5 Both the key and
the value are serialized into byte arrays by the producer before being transmitted over
the network to the broker.27

A critical responsibility of the producer is to determine which partition of a topic to


send a given record to.19 This decision directly impacts load balancing and message
ordering. Kafka provides several strategies for this:
●​ Direct Partition Assignment: A producer can explicitly specify the target
partition number when creating a ProducerRecord. This gives the developer
absolute control over data placement but requires the application to manage its
own partitioning logic.37
●​ Key-based Partitioning: This is the most common and powerful strategy. When
a record is produced with a non-null key, the producer's partitioner computes a
hash of the key (using the murmur2 algorithm by default) and maps it to a
partition using the formula hash(key)(modnumPartitions).29 This deterministic
mapping ensures that all records with the same key are always sent to the same
partition. This is the mechanism by which Kafka guarantees strict ordering for
related events.40
●​ Round-Robin and Sticky Partitioning: When a record is produced with a null
key, the strategy depends on the Kafka version.
○​ Prior to Kafka 2.4, the default partitioner used a simple round-robin
approach, cycling through the available partitions to distribute the load
evenly.29
○​ Since Kafka 2.4, the default is the Sticky Partitioner. This improved strategy
sends all null-key records to a single "sticky" partition until the current batch
is full or the linger.ms timeout is reached. It then selects a new sticky partition
for the next set of records. This approach dramatically improves performance
by increasing batch density and reducing the number of requests sent to the
brokers, thereby lowering latency.45
To optimize network efficiency and throughput, producers do not send each record
individually. Instead, they collect records into batches before sending them to the
brokers.36 This behavior is controlled by two key configurations:

batch.size, which defines the maximum size of a batch in bytes, and linger.ms, which
sets a maximum time the producer will wait to fill a batch before sending it.
Additionally, compression (configured via compression.type) can be applied to these
batches, further reducing network bandwidth and storage requirements. Larger
batches generally lead to better compression ratios.47

2.4 The Consumer Client: Reading Data from Kafka

Consumers are the client applications that subscribe to Kafka topics to read and
process the streams of records published by producers.2

A cornerstone of Kafka's design is its pull-based consumption model.31 Unlike many


traditional messaging systems that push messages to consumers, Kafka consumers
actively pull, or fetch, data from the brokers. This design choice gives consumers
granular control over their consumption rate. It prevents a fast producer from
overwhelming a slower consumer and allows consumers to batch records efficiently
for processing, which is often more performant.31

The central abstraction for consumption is the consumer group. Consumers identify
themselves with a group.id string.5 This simple mechanism elegantly unifies the two
primary messaging models:
●​ Queueing Model: When multiple consumer instances share the same group.id,
they form a pool of workers. Kafka distributes the topic's partitions among these
instances, ensuring that each partition is consumed by exactly one member of
the group. This effectively load-balances the processing workload across the
consumers, mimicking the behavior of a traditional message queue.19
●​ Publish-Subscribe Model: When consumer instances each have a unique
group.id, they act as independent subscribers. In this case, each consumer
receives a full copy of all messages from all partitions of the topic, mirroring the
broadcast behavior of a publish-subscribe system.19

The assignment of partitions to consumers within a group is a dynamic process known


as rebalancing. This process is managed by a designated broker called the Group
Coordinator.31 A rebalance is triggered whenever a consumer joins the group, leaves
the group (either cleanly or due to a crash), or when the number of partitions for a
subscribed topic changes.31 During a rebalance, consumption is paused as partitions
are redistributed among the active members to maintain a balanced load.

As consumers read records from a partition, they must track their progress. This is
done via offset management. The offset of the last successfully processed record for
each partition is periodically "committed" back to a special, highly-available internal
Kafka topic named __consumer_offsets.31 This committed offset acts as a durable
bookmark. If a consumer instance fails and restarts, or if a rebalance assigns its
partition to another consumer, the new consumer will query the

__consumer_offsets topic to find the last committed offset and resume processing
from that point, ensuring no data is lost and (depending on the commit strategy)
minimizing reprocessing.31 Developers have the choice between automatic offset
committing (

enable.auto.commit=true), which is convenient but offers less control, and manual


committing, which provides fine-grained control over when a record is officially
considered "processed," a critical feature for implementing robust processing
semantics.36

This architecture reveals a deeper principle: a Kafka Consumer Group is more than
just a simple reader. It functions as a persistent, fault-tolerant, and scalable "view" or
"cursor" into the distributed commit log. The state of this view—the collection of
committed offsets—is not an ephemeral detail but is itself stored durably as data
within Kafka. A traditional database view is a logical construct that provides a window
into underlying tables. Similarly, a Kafka topic can be seen as the underlying table of
immutable events (the source of truth). A consumer group, then, defines a specific,
independent consumption process on that data. Because the state of this process
(the offsets) is also stored durably in Kafka, multiple independent applications can
consume the same data stream without impacting one another. One group might be a
real-time analytics engine, another an ETL pipeline to a data warehouse, and a third
an audit service. Each progresses at its own pace, maintaining its own persistent view.
Adding a new application—a new view—is as simple as starting a new consumer
group. This is a profoundly scalable paradigm that stems directly from the
architectural decision to decouple consumption state from the broker and manage it
as a first-class, persistent entity within the Kafka ecosystem itself.
Part III: Guarantees of a Distributed System

In any distributed system, providing clear guarantees regarding data durability,


ordering, and delivery is paramount. Apache Kafka's architecture is meticulously
designed to offer a configurable set of strong guarantees, allowing architects to
balance performance, consistency, and availability to meet specific application
requirements.

3.1 Durability and Fault Tolerance: The Replication Protocol

Kafka achieves durability and high availability through its replication protocol, which is
built on a leader-follower model.26
●​ The Leader-Follower Model: When a topic is configured with a replication factor
greater than one, Kafka creates multiple copies, or replicas, of each partition.
These replicas are distributed across different brokers in the cluster to protect
against single-broker failures. For each partition, one replica is designated as the
leader, while the others become followers.5 All produce requests (writes) and
consume requests (reads) for a given partition are exclusively handled by its
leader.16 The followers' sole responsibility is to passively replicate the data from
their leader, fetching new records in sequence to maintain a byte-for-byte
identical copy of the leader's log.53
●​ In-Sync Replicas (ISR): To manage replication consistency, Kafka maintains a
dynamic set for each partition known as the In-Sync Replicas (ISR).25 This set
contains the leader and any followers that are fully "caught-up" with the leader's
log.25 A follower is considered caught-up if it is actively fetching from the leader
and its log does not lag behind the leader's log by more than a configurable time,​
replica.lag.time.max.ms.51 If a follower fails or falls too far behind, the leader
removes it from the ISR. This ISR set is the cornerstone of Kafka's failover
strategy.
●​ The High Watermark: To prevent consumers from reading data that has not been
fully replicated and could be lost in a leader failure, Kafka uses a concept called
the high watermark. The high watermark is the offset of the last record that has
been successfully copied to all replicas in the ISR.55 Consumers are only permitted
to read records up to this high watermark offset. This ensures that any data a
consumer sees is considered "committed" and will not be lost as long as at least
one replica from the ISR remains available.25
●​ Leader Election: In the event of a leader broker failure, the cluster controller
initiates a leader election. It selects a new leader from the remaining healthy
replicas that are members of the ISR.25 Because every member of the ISR is
guaranteed to have all committed records (up to the high watermark), this
process ensures that no committed data is lost during the failover.25
●​ Unclean Leader Election: A critical trade-off between availability and
consistency arises in the catastrophic scenario where all replicas in the ISR for a
partition become unavailable. By default (unclean.leader.election.enable=false),
Kafka prioritizes consistency. It will keep the partition offline and wait for a replica
from the original ISR to come back to life, thus guaranteeing no data loss but
sacrificing availability.25 If this setting is enabled, Kafka will prioritize availability by
electing the first replica to come back online as the new leader, even if it was not
in the ISR. This brings the partition back online quickly but risks losing any data
that had not been replicated to that follower.25

3.2 Configuring for Durability: The Producer-Broker Contract

Kafka's durability is not a monolithic feature but a finely-tunable contract between the
producer client and the broker cluster. This contract is defined by the interplay of
three key configuration parameters. Understanding how to orchestrate them is
essential for architecting a system that meets precise data safety and performance
goals.
1.​ replication.factor: This is a topic-level setting that defines the total number of
copies (replicas) to maintain for each partition of that topic.24 A replication factor
of​
N means the cluster can tolerate the failure of up to N−1 brokers without losing
data for that topic.19 For any production environment, a replication factor of at
least 3 is the standard best practice, typically distributed across three different
physical racks or availability zones.57
2.​ min.insync.replicas: This setting, configurable at the broker or topic level,
establishes a minimum threshold for the size of the ISR. When a producer uses
acks=all, the broker will reject the produce request with a NotEnoughReplicas
exception if the number of in-sync replicas is less than this value.51 This is a
critical safety mechanism. For example, with a replication factor of 3, setting​
min.insync.replicas=2 ensures that any acknowledged write has been durably
persisted on at least two separate brokers, protecting against data loss even if
the leader fails immediately after acknowledging the write.
3.​ Producer acks Setting: This client-side configuration defines the producer's
criteria for considering a write request successful, creating a direct trade-off
between durability and latency.27
○​ acks=0: The producer sends the message and immediately considers it
successful without waiting for any acknowledgment from the broker. This is a
"fire-and-forget" mode that offers the highest throughput and lowest latency
but provides the weakest durability guarantees, as messages can be lost in
transit or if the leader fails before writing the record.61
○​ acks=1 (Default before Kafka 3.0): The producer waits for an acknowledgment
from the partition leader only. This confirms that the leader has successfully
written the record to its local log. It offers a balance of durability and
performance, but data can still be lost if the leader fails before its followers
have replicated the record.61
○​ acks=all (or -1) (Default from Kafka 3.0): The producer waits for an
acknowledgment from the leader after the record has been successfully
replicated to all followers currently in the ISR. This setting provides the
strongest possible durability guarantee, as it ensures that any acknowledged
record exists on multiple brokers.61

The interplay of these three settings defines the system's resilience. Simply setting
replication.factor=3 is insufficient if the producer uses acks=1, as a leader failure can
still cause data loss. Even with acks=all, if min.insync.replicas is not set appropriately
(e.g., it defaults to 1), a write could be acknowledged by a lone leader just before it
fails. Therefore, the gold standard for mission-critical durability in production is the
combination of replication.factor=3, min.insync.replicas=2, and producer acks=all. This
configuration ensures that every acknowledged write is present on at least two
brokers and that the system can tolerate the failure of one broker without any data
loss or loss of write availability.

3.3 Message Ordering Guarantees

Kafka's ordering guarantees are precise and directly tied to its partitioning model.
●​ Within a Partition: Kafka provides a strict ordering guarantee for records within a
single partition. If a producer sends message M1 followed by message M2 to the
same partition, Kafka guarantees they will be written to the log in that order (M1
will have a lower offset than M2), and all consumers of that partition will read
them in that exact same order.19 This is an inviolable property of the append-only
commit log.
●​ Across Partitions: Conversely, Kafka provides no global ordering guarantee for
records across the different partitions of a topic.46 A consumer reading from a
multi-partition topic may process a record from partition 1 that was produced
chronologically later than a record from partition 0 that it has not yet received.
●​ Achieving Order for Related Events: To enforce a specific order for a sequence
of related events (for example, all updates for a single customer account), the
application must ensure these events are sent to the same partition. This is
achieved by producing all related records with the same message key (e.g., using
the customer ID as the key). The producer's key-based partitioning logic
guarantees that all records with the same key will be deterministically routed to
the same partition, thereby preserving their relative order.40

3.4 Message Delivery Semantics: A Tripartite Framework

Message delivery semantics define the guarantees a system provides about whether
a message will be delivered and how many times. Kafka supports all three primary
semantics, which are configurable based on application needs.
●​ At-Most-Once: In this mode, messages may be lost but are guaranteed never to
be delivered more than once. This behavior can occur in Kafka under specific
failure scenarios. For example, if a producer fails to receive an acknowledgment
from the broker and is configured not to retry, the message might be lost. On the
consumer side, if an application is configured to commit offsets automatically
before processing the data, a crash after the commit but before processing would
cause the message to be skipped upon restart.50 This semantic prioritizes
performance over reliability and is suitable only for use cases that can tolerate
data loss, such as collecting non-critical metrics.
●​ At-Least-Once: This is Kafka's default guarantee. Messages are guaranteed
never to be lost but may be redelivered as duplicates. This occurs if a producer
sends a message but experiences a temporary network failure and does not
receive an acknowledgment. The producer's retry mechanism will then resend the
message, potentially creating a duplicate in the broker's log. On the consumer
side, if an application processes a message but crashes before it can commit the
corresponding offset, it will re-read and re-process the same message upon
restarting.50 This is the most common semantic used, with applications often
designed to be idempotent (able to handle duplicate messages gracefully).
●​ Exactly-Once Semantics (EOS): This is the strongest and most complex
guarantee, ensuring that each message is delivered and processed once and only
once. Achieving EOS in a distributed system is a non-trivial challenge that
requires coordination between the client and the broker. Kafka achieves this
through two powerful features introduced in version 0.11:
1.​ Idempotent Producer: By setting enable.idempotence=true in the producer
configuration, the producer becomes idempotent. The broker assigns a
unique Producer ID (PID) to the producer instance, and the producer includes
a sequence number with every record it sends to a specific partition. The
broker keeps track of the last sequence number it has seen for each (PID,
partition) combination. If it receives a record with a sequence number it has
already processed, it discards the duplicate, preventing data duplication from
producer retries.65 This provides exactly-once delivery guarantees​
from the producer to the broker log for a single partition.
2.​ Transactions: The Kafka Transactional API extends EOS to atomic writes
across multiple topics and partitions. This is essential for
"consume-transform-produce" stream processing applications, where an
application reads a message, processes it, and writes one or more resulting
messages back to Kafka. The entire operation must be atomic. The API allows
a producer to beginTransaction(), send records to multiple partitions, send
the consumer's offsets to the transaction, and then either
commitTransaction() or abortTransaction(). A broker-side component called
the Transaction Coordinator manages the state of these transactions.
Consumers can be configured with isolation.level="read_committed" to
ensure they only ever read records that are part of a successfully committed
transaction, effectively filtering out any data from aborted or in-progress
transactions.65

Part IV: Kafka as a Messaging System and Event Streaming


Platform
While often used as a high-performance message queue, Apache Kafka's capabilities
extend far beyond this traditional role. Its unique architecture allows it to serve as a
comprehensive event streaming platform, unifying and transcending the classic
messaging paradigms.

4.1 The Message Queue Paradigm

Historically, asynchronous inter-process communication has been dominated by two


primary models, both facilitated by message-oriented middleware.72
●​ Point-to-Point (Queueing): In this model, a producer sends a message to a
named queue. Multiple consumers can listen to this queue, but the message
broker ensures that each message is delivered to and processed by only one of
the consumers.72 This pattern is ideal for distributing a workload across a pool of
worker processes, ensuring that each task is handled exactly once.
●​ Publish-Subscribe (Pub/Sub): In this model, a producer (or publisher) sends a
message to a logical channel, known as a topic. The message is then broadcast to
all consumers (or subscribers) that have registered an interest in that topic.72 This
pattern is used for disseminating information to multiple interested parties, such
as broadcasting price updates or system alerts.

4.2 Kafka's Hybrid Model: The Best of Both Worlds

Apache Kafka's architecture, centered on the partitioned log, ingeniously synthesizes


these two distinct models into a single, powerful abstraction: the consumer group.10

A Kafka topic serves as the fundamental channel for the publish-subscribe pattern. A
producer publishes records to a topic, and any application can subscribe to that topic
to receive the records.19 The innovation lies in how consumers subscribe. By labeling
themselves with a

group.id, consumers form a consumer group. This group, as a whole, subscribes to


the topic. Kafka then delivers each record from the topic to exactly one consumer
instance within each subscribing consumer group.19

This design has two powerful consequences:


1.​ Pub/Sub Behavior Across Groups: If multiple consumer groups subscribe to the
same topic, each group will receive a full copy of all the records. This is classic
publish-subscribe behavior, allowing different applications (e.g., a real-time
monitoring dashboard, an ETL pipeline, an audit service) to independently
process the same stream of events.19
2.​ Queueing Behavior Within a Group: Within a single consumer group, Kafka
distributes the partitions of the topic among the available consumer instances.
Each partition is assigned to exactly one consumer in the group. This means the
consumers in the group collectively process the topic's data, with the load being
balanced across them. This is the classic point-to-point or queuing behavior,
enabling scalable, parallel processing of the data stream.10

This hybrid model gives Kafka the workload scalability of a traditional queuing system
and the data-sharing flexibility of a publish-subscribe system, all within a unified and
coherent framework.10

4.3 Kafka vs. Traditional Message Queues (RabbitMQ, ActiveMQ)

When compared to established message queues like RabbitMQ and ActiveMQ, Kafka's
distinct design philosophy becomes clear, leading to different strengths, weaknesses,
and ideal use cases.
●​ Architectural Philosophy: The primary distinction is the "smart broker" versus
"dumb broker" paradigm.
○​ RabbitMQ/ActiveMQ: These systems embody the "smart broker, dumb
client" model.20 The broker is a sophisticated entity responsible for complex
message routing (e.g., AMQP exchanges in RabbitMQ), tracking the delivery
state of every message, managing acknowledgments, and implementing
features like message priorities. The client's logic is correspondingly simpler.
○​ Kafka: Kafka follows the "dumb broker, smart client" model.20 The broker's
role is simplified to that of a high-performance, distributed log manager.
Complex logic, such as tracking which messages have been processed (offset
management) and deciding what to consume, is offloaded to the consumer
client.
●​ Message Retention and Replayability: This is a fundamental differentiator.
○​ RabbitMQ/ActiveMQ: These are primarily transient buffers. Messages are
typically stored in memory or on disk until they are successfully consumed
and acknowledged, at which point they are deleted from the queue.15 They are
not designed for long-term storage or data replay.
○​ Kafka: Retention is policy-based, not consumption-based. Records are
durably stored on disk for a configurable period (e.g., 7 days or indefinitely) or
until a size limit is reached, irrespective of whether they have been
consumed.2 This turns the broker into a persistent, replayable system of
record, a feature that is foundational to its use as a streaming platform.
●​ Performance and Scalability:
○​ RabbitMQ/ActiveMQ: They offer excellent performance for low-latency,
transactional messaging and can handle moderate to high throughput.
However, their broker-centric design and complex message state
management can become a bottleneck at extreme scales. Scaling is often
achieved through clustering or federation, which can be more complex to
manage than Kafka's model.64
○​ Kafka: Kafka is architected from the ground up for extreme throughput
(capable of handling millions of messages per second) and seamless
horizontal scalability.2 Its performance is derived from leveraging sequential
disk I/O and a simplified broker model. Adding more brokers to a cluster is a
straightforward way to scale capacity.
●​ Ideal Use Cases:
○​ RabbitMQ/ActiveMQ: They excel in traditional enterprise messaging
scenarios. This includes acting as a task queue for background job
processing, facilitating request-response (RPC-style) communication
between microservices, and implementing complex routing logic where
messages need to be delivered based on content or priority.21 ActiveMQ's
strong support for the Java Message Service (JMS) API also makes it ideal for
integrating with legacy enterprise systems.83
○​ Kafka: It is the superior choice for building real-time data pipelines,
implementing event sourcing patterns, aggregating logs and metrics at a
massive scale, and serving as the backbone for stream processing
applications. Any scenario that requires high throughput, long-term data
retention, and the ability for multiple applications to replay and analyze data
streams is a prime use case for Kafka.9

The following table provides a comparative summary:


Feature Apache Kafka RabbitMQ Apache ActiveMQ

Core Architecture Distributed, Centralized message Centralized message


partitioned commit broker with broker with
log ("dumb broker") 2 exchanges/queues queues/topics
("smart broker") 10 ("smart broker") 74

Messaging Model Hybrid Pub/Sub and Flexible Point-to-Point


Queuing via (Point-to-Point, (Queues) and
Consumer Groups 10 Pub/Sub, Pub/Sub (Topics) via
Request/Reply) via JMS, etc. 89
AMQP 64

Message Retention Policy-based (time or Acknowledgment-ba Acknowledgment-ba


size); durable and sed; messages sed; messages
replayable 10 deleted after deleted after
consumption 15 consumption 78

Throughput Very High (millions of Moderate to High Moderate 83


messages/sec) 22 (tens of thousands of
messages/sec) 21

Latency Profile Low latency, Very low latency for Low latency for
optimized for high transactional moderate workloads
throughput 3 messages 64 83

Scalability Model Excellent horizontal Vertical scaling; Vertical scaling;


scalability by adding horizontal via horizontal via
brokers 2 clustering/federation network of brokers 79
64

Message Routing Basic (topic-based Complex and flexible Flexible (selectors,


partitioning) 64 (content-based, composite
header-based) 64 destinations) 64

Primary Use Cases Event Streaming, Task Queues, RPC, Enterprise


Data Pipelines, Log Microservice Integration,
Aggregation, Communication 21 JMS-based
86
Analytics applications 81

Key Strengths Throughput, Low Latency, Flexible Protocol Versatility


Scalability, Durability, Routing, Mature (JMS, AMQP, MQTT),
Replayability 2 Protocol Support Enterprise Features 79
64
(AMQP)
4.4 Kafka vs. Modern Streaming Platforms (Apache Pulsar)

Apache Pulsar has emerged as a significant modern alternative to Kafka, presenting a


different set of architectural trade-offs.
●​ Architectural Difference: Separation of Compute and Storage: This is the
most profound distinction between the two platforms.
○​ Kafka: Employs a monolithic (or tightly coupled) architecture where each
broker node is responsible for both compute (handling client requests,
replication) and storage (managing log files on its local disks).90
○​ Pulsar: Features a multi-layered, cloud-native architecture that decouples
compute from storage. Stateless brokers handle client connections and
message dispatching, while a separate, scalable storage layer powered by
Apache BookKeeper handles the durable persistence of data.77
●​ Architectural Implications:
○​ Scalability and Elasticity: Pulsar's decoupled architecture provides superior
elasticity. New brokers can be added or removed almost instantly to scale the
compute layer up or down without triggering a massive and time-consuming
data rebalancing process. In Kafka, adding a broker requires reassigning
partitions and physically moving data, which can be a slow and operationally
complex task, especially with large data volumes.91
○​ Multi-Tenancy and Isolation: Pulsar was designed from the ground up with
strong multi-tenancy support, providing resource isolation at the tenant and
namespace levels. This makes it exceptionally well-suited for large, shared,
cloud-native environments where different teams or applications need to use
the cluster without interfering with one another.90 Kafka's support for
multi-tenancy is more limited and often requires external tooling or careful
operational practices.
○​ Unified Messaging Models: Pulsar provides native support for both
streaming (via exclusive subscriptions) and traditional message queuing (via
shared subscriptions) on the same topic. This allows multiple consumers to
process messages from a single topic partition in a round-robin fashion, a
pattern that Kafka requires more complex client-side logic or its new "Share
Groups" feature to achieve.20
●​ Performance and Ecosystem:
○​ Performance: Performance benchmarks comparing Kafka and Pulsar are
often contentious and highly dependent on the workload, configuration, and
the entity performing the benchmark. Some independent and Confluent-led
benchmarks show Kafka achieving higher raw throughput for pure streaming
workloads, leveraging its zero-copy and page cache optimizations.77
Conversely, benchmarks from Pulsar proponents often highlight Pulsar's more
consistent and lower tail latencies, especially in mixed read/write workloads,
attributing this to its I/O isolation between writing new data (to the journal)
and serving reads.95
○​ Ecosystem and Maturity: Kafka possesses a significant advantage in terms
of ecosystem maturity and adoption. It has a vastly larger community, more
extensive documentation, and a much richer ecosystem of third-party tools,
integrations, and connectors (via Kafka Connect). The availability of mature
frameworks like Kafka Streams and ksqlDB provides a comprehensive,
out-of-the-box platform experience that Pulsar is still building out.90 This
established ecosystem represents a major practical advantage for teams
adopting Kafka today.

Part V: The Broader Kafka Ecosystem: Beyond the Broker

Apache Kafka's dominance is not solely due to the performance of its core broker. Its
power is magnified by a rich, integrated ecosystem of tools and libraries that
transform it from a messaging component into a comprehensive, end-to-end data
platform. These components address critical needs in data integration, stream
processing, and data governance.

5.1. Kafka Connect: The Data Integration Hub

Purpose: Kafka Connect is a framework designed to standardize, simplify, and scale


the movement of data between Apache Kafka and other data systems.3 It serves as a
centralized hub for building and managing real-time data pipelines, eliminating the
need for custom, one-off integration code for common data sources and sinks.

Architecture: Kafka Connect operates as one or more separate processes called


workers, which can be deployed in two modes:
●​ Standalone Mode: A single worker process runs all connectors and tasks. This
mode is simple to set up and is suitable for development, testing, or small-scale,
single-machine use cases like log collection.99
●​ Distributed Mode: Multiple workers form a cluster. They coordinate to
automatically balance the execution of connectors and their tasks, providing
scalability and fault tolerance. If a worker fails, its tasks are automatically
redistributed among the remaining workers in the cluster.98

The core of the framework is the Connector plugin. Connectors are pre-built or
custom packages of code that understand how to interface with a specific external
system. There are two types 98:
●​ Source Connectors: Ingest data from external systems (e.g., polling a database
for new rows, tailing a log file) and publish it to Kafka topics.
●​ Sink Connectors: Export data from Kafka topics to external systems (e.g., writing
records to an Elasticsearch index, HDFS, or a cloud object store).

Key Features:
●​ Converters: These plugins handle the serialization and deserialization of data as
it moves between the external system and Kafka. They ensure data is in the
correct format (e.g., JSON, Avro) for both Kafka and the target system.98
●​ Single Message Transformations (SMTs): These allow for lightweight, in-flight
modification of individual records as they pass through the Connect pipeline.
SMTs can be used to filter records, mask sensitive fields, add metadata, or alter
the structure of a message without requiring a separate stream processing
application.98
●​ Dead Letter Queues (DLQs): For sink connectors, a DLQ can be configured as a
destination for records that cannot be processed successfully (e.g., due to a data
format error). This prevents the entire pipeline from halting on a single bad record
and allows for later inspection and remediation.98

By providing a declarative, configuration-driven approach to data integration, Kafka


Connect dramatically lowers the barrier to entry for building robust and scalable data
pipelines.98

5.2. Kafka Streams: Native Stream Processing


Purpose: Kafka Streams is a powerful yet lightweight client library for building
real-time stream processing applications and microservices. It allows developers to
process and analyze data stored in Kafka directly within their own applications, using
a simple and familiar programming model.3

Architecture: A key architectural feature of Kafka Streams is that it is not a separate


cluster. It is simply a Java/Scala library that is embedded within a client application.32
This eliminates the operational overhead of managing a separate processing cluster
(like Apache Flink or Spark). Scalability and fault tolerance are achieved by leveraging
Kafka's own underlying mechanisms. When multiple instances of a Kafka Streams
application are run, they automatically form a consumer group, and Kafka distributes
the input topic partitions among them, enabling parallel processing.32 If an instance
fails, its partitions are automatically reassigned to the remaining instances.

Key Features:
●​ High-Level DSL and Processor API: It offers a functional, high-level
Domain-Specific Language (DSL) with common stream processing operators like
map, filter, groupBy, join, and aggregate. For more complex or fine-grained
control, it also provides the lower-level Processor API.3
●​ Stateful Processing: Kafka Streams has first-class support for stateful
operations, such as windowed aggregations and joins. It manages local state
using embedded, high-performance key-value stores (typically RocksDB), which
allows state to be larger than available memory. For fault tolerance, all updates to
these local state stores are backed up to a compacted changelog topic in Kafka,
allowing state to be fully restored in the event of an application failure.3
●​ Exactly-Once Semantics (EOS): Kafka Streams is the primary vehicle for
achieving end-to-end exactly-once processing in Kafka. By setting
processing.guarantee=exactly_once, the library leverages Kafka's transactional
capabilities to ensure that for every input record, the processing, state updates,
and resulting output records are completed as a single atomic unit.65

5.3. ksqlDB: Streaming SQL

Purpose: ksqlDB is an event streaming database built on top of Kafka. Its goal is to
radically simplify the creation of stream processing applications by providing an
interactive, high-level SQL interface.104
How it Works: ksqlDB is not a new storage engine; it operates directly on data stored
in Kafka topics. Under the hood, a ksqlDB server parses the SQL statements
submitted by a user and translates them into Kafka Streams applications. These
applications are then executed on the Kafka cluster to perform the requested
processing.104

Abstractions: ksqlDB introduces two core abstractions that bridge the gap between
the relational world and the streaming world 105:
●​ STREAM: Represents an unbounded, append-only sequence of events, directly
mapping to a Kafka topic.
●​ TABLE: Represents a stateful, materialized view of a stream. It provides a
snapshot of the latest value for each key in the stream and is continuously
updated as new events arrive.

Use Case: ksqlDB significantly lowers the barrier to entry for stream processing. It
empowers developers, data analysts, and data scientists to perform real-time data
exploration, filtering, transformation, and aggregation on live data streams using
familiar SQL syntax, without needing to write complex code in Java or Scala.107

5.4. Schema Registry: Data Governance and Evolution

Purpose: The Confluent Schema Registry is a centralized, standalone service that


manages and validates schemas for data being produced to and consumed from
Kafka.104 It supports popular data serialization formats like Apache Avro, Protobuf, and
JSON Schema.

Role in Data Governance: In any large-scale data architecture, ensuring data quality
and consistency is a critical challenge. Schema Registry acts as the enforcer of a
"data contract" between producers and consumers.110 By providing a central
repository for schemas, it ensures that all data flowing through Kafka adheres to a
predefined structure, preventing data corruption and downstream processing
failures.110

Schema Evolution: A key feature of Schema Registry is its ability to manage the
evolution of schemas over time. As applications and business requirements change,
data schemas must also change. Schema Registry enforces compatibility rules (e.g.,
backward, forward, full compatibility) when a new version of a schema is registered.
This ensures that new producers do not break old consumers, and new consumers
can still read data produced with old schemas, allowing for independent and
decoupled upgrades of microservices.109

How it Works: The process is highly efficient. When a producer sends a record, its
serializer first checks if the schema is registered. If not, it registers it and receives a
unique schema ID. This small integer ID is then embedded in the record's metadata,
rather than the full, verbose schema. When a consumer receives the record, its
deserializer extracts the schema ID, and if it doesn't have the corresponding schema
cached locally, it requests it from the Schema Registry. This schema is then used to
correctly deserialize the record's payload.109

The power of the Kafka ecosystem lies in the synergistic way these components build
upon one another, creating a powerful feedback loop that reinforces Kafka's position
as the de facto standard for data in motion. An organization might initially adopt Kafka
for its core message brokering. Soon, the need arises to integrate data from a
relational database. Instead of building a custom, brittle pipeline, the team deploys
Kafka Connect with a pre-built JDBC connector, saving weeks of development time
and gaining a scalable, reliable solution.99 Next, they need to filter and enrich this
incoming data in real-time. Rather than introducing and managing a separate,
complex stream processing cluster, they embed the logic directly into a lightweight
Kafka Streams application, leveraging the same operational model and guarantees as
the rest of their Kafka infrastructure.3 As the volume of processed data grows, data
analysts want to run ad-hoc queries on the live streams. With ksqlDB, they can do so
using familiar SQL, gaining immediate insights without waiting for the data to be
loaded into a traditional data warehouse.106 Finally, as more teams and services begin
to depend on these data streams, maintaining data quality becomes paramount. The
organization implements Schema Registry to enforce data contracts and manage
schema evolution, preventing breaking changes and ensuring the long-term health of
their data ecosystem.110

At this stage, the organization is no longer just using Kafka; they are leveraging a
comprehensive, integrated platform for data integration, processing, querying, and
governance. The value derived is not from any single component but from their
seamless interplay. This creates a significant competitive moat. A competing
technology like Apache Pulsar, even with certain architectural advantages, must not
only rival the Kafka broker but also this entire, deeply integrated and battle-tested
ecosystem, presenting a much higher barrier to displacement.90
Part VI: The Future of Kafka: Recent Developments and Strategic
Direction

Apache Kafka is not a static project; it is in a constant state of evolution. Recent and
ongoing developments are fundamentally reshaping its architecture, simplifying its
operation, and expanding its capabilities. These changes are not merely incremental
improvements but strategic moves that address historical limitations and position
Kafka for the next decade of data streaming.

6.1. The KRaft Protocol: Life After ZooKeeper

For most of its history, Apache Kafka had a symbiotic but complex relationship with
Apache ZooKeeper. Kafka relied on ZooKeeper for critical cluster coordination tasks,
including storing cluster metadata, tracking broker membership, and, most
importantly, electing the cluster controller.34 This dependency, while functional, was a
significant source of operational friction. It meant that to run a production Kafka
cluster, operators had to deploy, manage, monitor, and secure two separate, complex
distributed systems, each with its own configuration, failure modes, and security
models. This added considerable operational overhead and created a potential
scalability bottleneck, as ZooKeeper's performance could limit the number of
partitions a Kafka cluster could efficiently manage.113

The Kafka community addressed this long-standing challenge with the introduction of
KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum. This KIP
introduced the Kafka Raft (KRaft) protocol, an event-based implementation of the
Raft consensus algorithm built directly into the Kafka brokers themselves.112 In KRaft
mode, the ZooKeeper dependency is eliminated entirely. Instead, a dedicated quorum
of brokers, acting as controllers, uses the KRaft protocol to manage cluster metadata,
which is stored durably in an internal Kafka topic. This makes Kafka a self-contained,
single-system deployment.112

The architectural benefits of this shift are profound:


●​ Simplified Operations: The most immediate benefit is the reduction in
operational complexity. Administrators now only need to manage, monitor, and
secure a single system, significantly lowering the operational burden.112
●​ Enhanced Scalability: KRaft removes the ZooKeeper bottleneck, allowing a
single Kafka cluster to scale to support millions of partitions, a significant
increase from the previous practical limit of a few hundred thousand.116
●​ Improved Performance and Availability: Controller failover in KRaft mode is
nearly instantaneous. Because the metadata is replicated via an event log to all
controller nodes, a new leader already has all the committed metadata in memory
when it takes over. This is a stark contrast to the ZooKeeper-based controller,
which had to perform a slow and expensive process of loading the entire cluster
state from ZooKeeper upon failover, leading to longer periods of unavailability for
administrative operations.112

The transition to KRaft has been a carefully managed, multi-year effort. KRaft was
declared production-ready for new clusters in Apache Kafka 3.3.112 The recent Apache
Kafka 4.0 release in 2025 marks the final step in this evolution, completely removing
the ZooKeeper mode and making KRaft the default and only supported operational
mode.115 For existing ZooKeeper-based clusters, a detailed migration path is provided,
allowing for a phased transition through a "hybrid mode" before the final cutover.120

6.2. Tiered Storage: Towards Infinite, Cost-Effective Retention

One of Kafka's defining features is its durable, log-based storage. However, as


organizations retain data for longer periods—for compliance, analytics, or model
training—the cost of storing petabytes of data on the high-performance, local disks
required by Kafka brokers can become prohibitive.

To address this economic challenge, the community introduced Tiered Storage,


which became production-ready in Kafka 3.9.123 This feature fundamentally changes
Kafka's storage architecture by separating data into two distinct tiers:
●​ Local Tier: This is the traditional Kafka storage layer. Hot, recent data that
requires low-latency access is kept on the brokers' local, high-performance
disks.123
●​ Remote Tier: Older, "cold" data that is accessed less frequently is automatically
and transparently offloaded to a cheaper, scalable remote object store, such as
Amazon S3, Google Cloud Storage, or an on-premises equivalent.123
This architecture effectively decouples the compute layer (brokers) from the bulk of
the storage layer. The impact of this change is significant:
●​ Economic Viability: It makes storing data in Kafka for very long periods (or even
indefinitely) economically feasible, as the majority of the data resides on low-cost
object storage.
●​ Improved Elasticity and Scalability: When scaling a cluster by adding new
brokers, only the hot data on the local tier needs to be rebalanced. This
dramatically reduces the amount of data that needs to be moved across the
network, making scaling operations much faster and less disruptive.123
●​ Faster Recovery: In the event of a broker failure, recovery is much quicker as the
new broker only needs to replicate the relatively small amount of hot data from its
peers; the vast majority of historical data is already available in the shared remote
tier.123

With tiered storage, Kafka solidifies its position as a true, long-term system of record
for event data, combining the low-latency performance of local storage for real-time
access with the cost-effectiveness and virtually infinite capacity of cloud object
storage for historical data.

6.3. Emerging Capabilities in Kafka 4.0 and Beyond

The Apache Kafka 4.0 release introduced several other game-changing features that
signal the project's future direction.
●​ New Consumer Rebalance Protocol (KIP-848): Now generally available, this
new protocol revolutionizes how consumer groups handle rebalancing. It replaces
the old "stop-the-world" rebalance mechanism with a more cooperative protocol
that allows consumers to continue processing data from their assigned partitions
while a rebalance is in progress for other partitions. This dramatically reduces
downtime and improves the stability and performance of large, dynamic
consumer groups.118
●​ Queues for Kafka (KIP-932): This feature, currently in early access, directly
addresses one of the few remaining areas where traditional message queues held
an advantage. It introduces the concept of a Share Group as an alternative to a
consumer group. In a share group, the strict one-to-one mapping between
partitions and consumers is relaxed, allowing the number of consumers to exceed
the number of partitions. This enables true work-queue semantics, where a pool
of consumers can cooperatively process records from the same partitions, with
individual message acknowledgment and delivery tracking. This makes Kafka a
much more viable platform for traditional queuing use cases without sacrificing its
core durability and scalability.3
●​ Eligible Leader Replicas (ELR) (KIP-966): This preview feature further
strengthens Kafka's consistency guarantees during failover. It introduces a subset
of the ISR, known as the ELR, which contains only those replicas guaranteed to
have the complete data log up to the high-watermark. By restricting leader
elections to only replicas in the ELR, Kafka can further prevent rare edge cases
that could lead to data loss.118

These recent developments are not isolated improvements. They represent a clear
and coherent strategic response to the evolving demands of the data landscape and
the competitive pressures from other platforms, particularly Apache Pulsar. Pulsar's
primary architectural selling points have been its separation of compute and storage
and its native support for both streaming and queuing. Kafka's introduction of KRaft
simplifies its operational model, directly countering the argument of its complexity.
The implementation of Tiered Storage directly mirrors Pulsar's core architectural
benefit, addressing the cost and elasticity arguments that favored Pulsar in
cloud-native deployments. Finally, the introduction of Share Groups (Queues for
Kafka) is a direct answer to Pulsar's flexible queuing capabilities. This roadmap
demonstrates a clear strategy: to systematically re-architect Kafka to incorporate the
best ideas from its competitors, thereby neutralizing its perceived weaknesses while
leveraging its unparalleled ecosystem and market dominance to secure its position as
the leading event streaming platform for the foreseeable future.

Part VII: Managing Kafka in Production: Best Practices

Deploying and operating Apache Kafka at scale requires a disciplined approach to


performance tuning, monitoring, scalability planning, and disaster recovery. While
Kafka is designed for resilience and high performance, optimal results are achieved
through careful configuration and proactive management.
7.1. Performance Tuning and Monitoring

Effective Kafka management begins with robust monitoring. Performance in Kafka is a


multi-dimensional trade-off, primarily between latency, throughput, durability, and
availability.125 Tuning involves adjusting parameters to find the optimal balance for a
specific use case.

Key Metrics to Monitor:


A comprehensive monitoring strategy should track metrics across the entire cluster 126:
●​ Broker Metrics: Key indicators of cluster health include CPU and memory
utilization, disk usage, network I/O, the number of under-replicated partitions (a
critical health indicator), and the rate of leader elections.
●​ Producer Metrics: Monitoring producer request latency, batch size, and
compression ratio helps diagnose data ingestion performance.
●​ Consumer Metrics: The most critical consumer metric is consumer lag, which
measures how far behind a consumer group is from the end of the log.
Persistently high lag indicates that consumers cannot keep up with the rate of
production. Fetch rate and processing latency are also vital.
●​ ZooKeeper/KRaft Metrics: Monitoring the health of the coordination service is
crucial. For KRaft, this includes tracking the status of the controller quorum and
metadata replication.

Common Tuning Levers:


●​ Producer Tuning: To maximize throughput, producers can be tuned by increasing
batch.size (to send more data per request) and linger.ms (to allow more time for
batches to fill). Enabling compression (e.g., lz4 or zstd) can significantly reduce
network load at the cost of some CPU overhead. For durability, acks should be set
to all.48
●​ Broker Tuning: Broker performance can be tuned by adjusting the number of
network and I/O threads (num.network.threads, num.io.threads) to match the
hardware. Proper JVM heap size allocation (typically 6-8 GB) and using the G1
garbage collector are recommended to avoid long GC pauses that can disrupt
broker operations.48
●​ Consumer Tuning: Consumer throughput can be improved by increasing fetch
sizes (fetch.min.bytes, max.partition.fetch.bytes) and wait times
(fetch.max.wait.ms), allowing the consumer to retrieve more data in a single poll
request.48
Monitoring Tools:
A popular and powerful open-source stack for Kafka monitoring combines JMX Exporter (to
expose Kafka's JMX metrics), Prometheus (to scrape and store the metrics), and Grafana (to
visualize the metrics in dashboards).126 Commercial solutions like Confluent Control Center,
Datadog, and New Relic also offer comprehensive, out-of-the-box Kafka monitoring
capabilities.126

7.2. Scalability and Capacity Planning

Kafka is designed for horizontal scalability, but this requires careful planning.
●​ Partitioning Strategy: The number of partitions for a topic is one of the most
critical and difficult-to-change decisions. It determines the maximum consumer
parallelism and is a key factor in throughput. A common rule of thumb is to
provision partitions based on the target throughput (e.g., if a single partition can
handle 10 MB/s and the target is 100 MB/s, at least 10 partitions are needed) and
the expected number of consumer instances. It is generally better to
over-partition slightly than to under-partition, as adding partitions later can
disrupt key-based ordering guarantees.28 However, an excessive number of
partitions (e.g., thousands per broker) can increase memory overhead and leader
election time.28
●​ Broker Sizing and Scaling: Brokers should be sized based on expected network
I/O, disk throughput, and memory requirements. Kafka scales horizontally by
adding more broker nodes to the cluster. After adding brokers, partitions must be
reassigned to the new nodes to balance the load. This rebalancing can be a
resource-intensive operation. Tools like LinkedIn's Cruise Control can be used to
automate the process of generating and executing optimized partition
reassignment plans.124

7.3. Disaster Recovery (DR) Strategies

A robust DR plan is essential for any mission-critical Kafka deployment. It is important


to distinguish between High Availability (HA) and Disaster Recovery (DR).
●​ High Availability (Within a Single Region): HA is about tolerating
component-level failures (e.g., a single broker crash) within a single datacenter or
geographic region. The standard HA architecture for Kafka involves deploying a
single cluster across three distinct Availability Zones (AZs). Topics are configured
with a replication.factor of 3 and min.insync.replicas of 2. This setup ensures that
the cluster can withstand the complete failure of one AZ without any data loss or
service downtime.26
●​ Disaster Recovery (Across Multiple Regions): DR is about surviving a
large-scale failure, such as the loss of an entire datacenter or region. This
requires replicating data to a geographically separate location. The primary
strategies are:
○​ Asynchronous Replication (Active-Passive): This is the most common and
recommended DR pattern. It involves running a primary (active) cluster in one
region and a secondary (passive or standby) cluster in a different region. A
data replication tool, such as Kafka's own MirrorMaker2 or a commercial
product like Confluent Replicator, is used to asynchronously copy data from
the active to the passive cluster.26
■​ Failover: In the event of a disaster at the primary site, client applications
are reconfigured to connect to the secondary cluster, which is then
promoted to active. This is the failover process.132
■​ Failback: Once the primary site is restored, a failback procedure is
initiated to replicate data back from the (now active) secondary cluster
and eventually switch client traffic back to the original primary site.132
■​ Trade-off: Because the replication is asynchronous, this pattern has a
non-zero Recovery Point Objective (RPO), meaning that a small amount of
recently produced data (seconds to minutes) that had not yet been
replicated may be lost during a failover.127
○​ Synchronous Replication (Stretch Clusters): In this advanced and less
common pattern, a single Kafka cluster is "stretched" across two or more
geographically distant datacenters. To ensure consistency, producers must be
configured with acks=all and wait for acknowledgments from brokers in all
regions before a write is considered successful.127
■​ Benefit: This architecture can provide a Recovery Point Objective (RPO) of
zero, meaning no data loss even in a regional failure.
■​ Trade-off: The benefit of RPO=0 comes at a very high cost: producer
write latency is dramatically increased, as it is now bounded by the
round-trip time between the datacenters. This makes it unsuitable for
most low-latency applications.129
○​ Active-Active Replication: This is the most complex pattern, where both
datacenters are actively serving both producer and consumer traffic. It
requires bidirectional replication between the two clusters. This architecture
demands sophisticated solutions to prevent infinite replication loops (e.g.,
using message headers to track data origin) and to handle potential
conflicts.54 It is typically reserved for global applications that require
low-latency access for users in multiple regions.

For most organizations, the active-passive model using asynchronous replication with
a tool like MirrorMaker2 provides the best balance of data protection, cost, and
performance for a disaster recovery strategy.

Part VIII: Conclusion and Recommendations

8.1. Synthesis of Findings

This comprehensive analysis reveals that Apache Kafka is far more than a
high-performance message queue. It is a sophisticated, distributed event streaming
platform, architected around the foundational principle of the replicated commit log.
This core design choice is the wellspring of its defining characteristics: extreme
scalability that supports trillions of events per day; high-throughput performance
derived from sequential I/O patterns; and configurable data durability that allows it to
serve as a fault-tolerant, long-term system of record.

Kafka's unique consumer group abstraction elegantly unifies the paradigms of


queuing and publish-subscribe messaging, providing both scalable workload
distribution and flexible data broadcasting within a single model. While traditional
message queues like RabbitMQ and ActiveMQ excel in transactional, low-latency
messaging and complex routing scenarios, Kafka's domain is the high-volume,
real-time data pipeline. Its ability to retain and replay event streams decouples data
producers from consumers in time, enabling a new class of applications that can
process both live and historical data.

The platform's strength is further amplified by a mature and deeply integrated


ecosystem. Kafka Connect provides a standardized, scalable solution for data
integration; Kafka Streams offers a powerful yet lightweight library for native stream
processing; ksqlDB lowers the barrier to entry with a familiar SQL interface; and
Schema Registry ensures robust data governance and schema evolution. This
ecosystem transforms Kafka from a mere component into a self-sufficient data hub.

Furthermore, Kafka is not a static technology. Its continuous and strategic


evolution—marked by the replacement of ZooKeeper with the KRaft protocol and the
introduction of Tiered Storage and Share Groups—demonstrates a clear trajectory.
These developments simplify operations, enhance scalability to millions of partitions,
make infinite data retention economically viable, and expand its applicability to
traditional queuing workloads. This evolution is a direct response to the demands of
modern cloud-native architectures and the capabilities of competing platforms,
solidifying Kafka's position as a foundational data infrastructure for the foreseeable
future.

8.2. Strategic Recommendations for Adoption and Configuration

Based on this analysis, the following strategic recommendations can be made for
organizations considering or currently using Apache Kafka.

When to Choose Kafka:


●​ Adopt Kafka for use cases centered on high-volume, continuous data streams.
This includes building real-time analytics pipelines, aggregating logs and metrics
from distributed systems, implementing event sourcing patterns, and establishing
a central "nervous system" for a microservices architecture.
●​ Prioritize Kafka when data replayability is a requirement. If applications need to
re-process historical data for analytics, model training, or recovering from
application-level errors, Kafka's durable log is the ideal foundation.
●​ Leverage Kafka when you need to serve the same data stream to multiple,
independent applications or teams. The consumer group model allows new
applications to subscribe to data without impacting existing consumers or broker
performance.

When to Consider Alternatives:


●​ Choose a Traditional MQ (e.g., RabbitMQ) for simple, low-volume task queues
where workers need to process transient jobs.
●​ Use a Traditional MQ for applications that require complex, content-based
message routing, message prioritization, or RPC-style request-response
communication between services. In these scenarios, the "smart broker" model of
RabbitMQ is often a more direct fit.
●​ Evaluate Alternatives if the operational overhead of managing a distributed
system is prohibitive for a small-scale project. However, with the advent of KRaft
and managed cloud services, this barrier is significantly lower than in the past.

Configuration Guidance for Key Business Goals:

Business Goal Recommended Configuration Rationale


Strategy

Maximum Throughput / Producer acks=1; High This configuration minimizes


Lowest Latency batch.size and linger.ms; the time the producer waits
Enable lz4 or zstd for acknowledgments and
compression 48 maximizes the amount of data
sent per request. It accepts a
small risk of data loss on
leader failure in exchange for
the highest possible
performance.

Maximum Durability / Zero replication.factor=3; This combination ensures


Data Loss min.insync.replicas=2; every acknowledged write is
Producer acks=all; durably persisted on at least
enable.idempotence=true 58 two brokers and protects
against duplicates from
producer retries. It is the gold
standard for mission-critical
data.

Strict Ordering of Related Use a consistent Message Kafka's key-based partitioner


Events Key for all related events (e.g., guarantees that all messages
customerId, orderId) 40 with the same key are sent to
the same partition, which in
turn guarantees they are
consumed in order.

Exactly-Once Processing Use Kafka Streams with This leverages Kafka's native
(Stream Processing) processing.guarantee=exac support for idempotency and
tly_once or use the atomic transactions to ensure
Transactional API with an that end-to-end
idempotent producer and "consume-transform-produce
isolation.level=read_committe " operations are processed
d consumers 65 exactly once, even in the face
of failures.

The Path Forward:


For organizations building new data platforms, deploying directly in KRaft mode is the clear
and recommended path. For those with existing ZooKeeper-based clusters, developing a
migration plan to KRaft should be a high-priority infrastructure goal to benefit from the
operational simplicity and scalability improvements. Finally, for any use case involving
long-term data retention, a thorough evaluation of Tiered Storage is essential. This feature
can fundamentally alter the cost structure of a Kafka deployment, making it a viable and
cost-effective system of record for event data at a petabyte scale. By embracing these
modern architectural patterns, organizations can ensure their Kafka deployments are robust,
scalable, and ready for the future.

Nguồn trích dẫn

1.​ How (and why) Kafka was created at LinkedIn | Frontier Enterprise, truy cập vào
tháng 7 5, 2025,
https://www.frontier-enterprise.com/unleashing-kafka-insights-from-confluent-j
un-rao/
2.​ How Apache Kafka Powers Scalable Data Architectures - Peerbits, truy cập vào
tháng 7 5, 2025,
https://www.peerbits.com/blog/everything-you-need-to-about-apache-kafka.ht
ml
3.​ en.wikipedia.org, truy cập vào tháng 7 5, 2025,
https://en.wikipedia.org/wiki/Apache_Kafka
4.​ en.wikipedia.org, truy cập vào tháng 7 5, 2025,
https://en.wikipedia.org/wiki/Apache_Kafka#:~:text=Kafka%20was%20originally%
20developed%20at,Rao%20helped%20co%2Dcreate%20Kafka.
5.​ Kafka — All you want to know. History | by Ramprakash | Analytics Vidhya |
Medium, truy cập vào tháng 7 5, 2025,
https://medium.com/analytics-vidhya/kafka-all-you-want-to-know-b9624e49600
6
6.​ History of Kafka - Data Lake for Enterprises [Book] - O'Reilly Media, truy cập vào
tháng 7 5, 2025,
https://www.oreilly.com/library/view/data-lake-for/9781787281349/1ed43286-4179-
4c35-b044-4c1b379753d3.xhtml
7.​ Apache Kafka: Past, Present and Future - Confluent | DE, truy cập vào tháng 7 5,
2025,
https://www.confluent.io/de-de/online-talks/apache-kafka-past-present-future-o
n-demand/
8.​ What is Apache Kafka? Introduction - Conduktor, truy cập vào tháng 7 5, 2025,
https://learn.conduktor.io/kafka/what-is-apache-kafka-part-1/
9.​ What is Apache Kafka? | Confluent, truy cập vào tháng 7 5, 2025,
https://www.confluent.io/what-is-apache-kafka/
10.​What is Kafka? - Apache Kafka Explained - AWS, truy cập vào tháng 7 5, 2025,
https://aws.amazon.com/what-is/apache-kafka/
11.​ Using Apache Kafka for log aggregation - Redpanda, truy cập vào tháng 7 5,
2025, https://www.redpanda.com/guides/kafka-use-cases-log-aggregation
12.​Use Cases and Architectures for Apache Kafka across Industries ..., truy cập vào
tháng 7 5, 2025,
https://www.kai-waehner.de/blog/2020/10/20/apache-kafka-event-streaming-us
e-cases-architectures-examples-real-world-across-industries/
13.​Apache Kafka: Architecture, deployment and ecosystem [2025 guide] -
Instaclustr, truy cập vào tháng 7 5, 2025,
https://www.instaclustr.com/education/apache-kafka/
14.​The Past, Present and Future of Message Queue 1 - Vanus AI, truy cập vào tháng
7 5, 2025,
https://www.vanus.ai/blog/the-past-present-and-future-of-message-queue-1/
15.​RabbitMQ vs Kafka - Difference Between Message Queue Systems - AWS, truy
cập vào tháng 7 5, 2025,
https://aws.amazon.com/compare/the-difference-between-rabbitmq-and-kafka/
16.​How Kafka distributes the topic partitions among the brokers - Codemia, truy
cập vào tháng 7 5, 2025,
https://codemia.io/knowledge-hub/path/how_kafka_distributes_the_topic_partitio
ns_among_the_brokers
17.​Apache Kafka: Real-Time Event Streaming Platform Explained | by Tahir | Medium,
truy cập vào tháng 7 5, 2025,
https://medium.com/@tahirbalarabe2/apache-kafka-real-time-event-streaming-
platform-explained-12497b2bed44
18.​Apache Kafka documentation, truy cập vào tháng 7 5, 2025,
https://kafka.apache.org/documentation/
19.​Documentation - Apache Kafka, truy cập vào tháng 7 5, 2025,
https://kafka.apache.org/081/documentation.html
20.​I've been trying to rationalize using either RabbitMQ or Kafka for something I'm... |
Hacker News, truy cập vào tháng 7 5, 2025,
https://news.ycombinator.com/item?id=23259305
21.​Kafka Vs RabbitMQ: Key Differences & Features Explained - Simplilearn.com, truy
cập vào tháng 7 5, 2025, https://www.simplilearn.com/kafka-vs-rabbitmq-article
22.​Apache Kafka: What It Is, Use Cases and More | Built In, truy cập vào tháng 7 5,
2025, https://builtin.com/data-science/what-is-kafka
23.​Overview of Kafka architecture: brokers, topics, partitions, and ..., truy cập vào
tháng 7 5, 2025,
https://www.codefro.com/2023/10/03/overview-of-kafka-architecture-brokers-to
pics-partitions-and-replication/
24.​Best Practices for Kafka Production Deployments in Confluent Platform, truy cập
vào tháng 7 5, 2025,
https://docs.confluent.io/platform/current/kafka/post-deployment.html
25.​Kafka Replication | Confluent Documentation, truy cập vào tháng 7 5, 2025,
https://docs.confluent.io/kafka/design/replication.html
26.​Disaster Recovery and High Availability in Apache Kafka: Best Practices for
Resilient Streaming Systems | by Let's code - Medium, truy cập vào tháng 7 5,
2025,
https://medium.com/@letsCodeDevelopers/disaster-recovery-and-high-availabili
ty-in-apache-kafka-best-practices-for-resilient-streaming-5838122c3329
27.​Kafka producer - Redpanda, truy cập vào tháng 7 5, 2025,
https://www.redpanda.com/guides/kafka-architecture-kafka-producer
28.​Kafka Topics Choosing the Replication Factor and Partitions Count - Conduktor,
truy cập vào tháng 7 5, 2025,
https://learn.conduktor.io/kafka/kafka-topics-choosing-the-replication-factor-an
d-partitions-count/
29.​Starting out with Kafka clusters: topics, partitions and brokers | by Martin Hodges
| Medium, truy cập vào tháng 7 5, 2025,
https://medium.com/@martin.hodges/starting-out-with-kafka-clusters-topics-pa
rtitions-and-brokers-c9fbe4ed1642
30.​Kafka Partitions: Essential Concepts for Scalability and Performance - DataCamp,
truy cập vào tháng 7 5, 2025, https://www.datacamp.com/tutorial/kafka-partitions
31.​A Beginner's Guide to Kafka® Consumers - Instaclustr, truy cập vào tháng 7 5,
2025, https://www.instaclustr.com/blog/a-beginners-guide-to-kafka-consumers/
32.​Architecture - Apache Kafka, truy cập vào tháng 7 5, 2025,
https://kafka.apache.org/34/documentation/streams/architecture
33.​How does kafka consumers/producers commit messages/partitions? - Stack
Overflow, truy cập vào tháng 7 5, 2025,
https://stackoverflow.com/questions/61679476/how-does-kafka-consumers-prod
ucers-commit-messages-partitions
34.​Understanding Apache Kafka architecture – a definitive guide - Site24x7, truy cập
vào tháng 7 5, 2025,
https://www.site24x7.com/learn/apache-kafka-architecture.html
35.​Tutorial: Apache Kafka Producer & Consumer APIs - Azure HDInsight | Microsoft
Learn, truy cập vào tháng 7 5, 2025,
https://learn.microsoft.com/en-us/azure/hdinsight/kafka/apache-kafka-producer-
consumer-api
36.​Kafka Producer and Consumer. I talked about Kafka architecture in ..., truy cập
vào tháng 7 5, 2025,
https://medium.com/@cobch7/kafka-producer-and-consumer-f1f6390994fc
37.​How to send message to a particular partition in Kafka? - Stack Overflow, truy
cập vào tháng 7 5, 2025,
https://stackoverflow.com/questions/50324249/how-to-send-message-to-a-part
icular-partition-in-kafka
38.​Apache Kafka Partition Key: A Comprehensive Guide - Confluent, truy cập vào
tháng 7 5, 2025, https://www.confluent.io/learn/kafka-partition-key/
39.​How Producer decides in which Partition it has to put the message? - Stack
Overflow, truy cập vào tháng 7 5, 2025,
https://stackoverflow.com/questions/59389222/how-producer-decides-in-which-
partition-it-has-to-put-the-message
40.​Kafka Keys, Partitions and Message Ordering - Lydtech Consulting, truy cập vào
tháng 7 5, 2025,
https://www.lydtechconsulting.com/blog-kafka-message-keys.html
41.​dattell.com, truy cập vào tháng 7 5, 2025,
https://dattell.com/data-architecture-blog/does-kafka-guarantee-message-order
/#:~:text=Kafka%20Consumer%20Offset.-,Kafka%20Guarantees%20Order,the%
20message%20ordering%20is%20guaranteed.
42.​Does Kafka Guarantee Message Order? - Dattell, truy cập vào tháng 7 5, 2025,
https://dattell.com/data-architecture-blog/does-kafka-guarantee-message-order
/
43.​How to produce messages to selected partition using kafka-console-producer? -
Codemia, truy cập vào tháng 7 5, 2025,
https://codemia.io/knowledge-hub/path/how_to_produce_messages_to_selected
_partition_using_kafka-console-producer
44.​How to use Apache Kafka to guarantee message ordering? - Medium, truy cập
vào tháng 7 5, 2025,
https://medium.com/latentview-data-services/how-to-use-apache-kafka-to-gua
rantee-message-ordering-ac2d00da6c22
45.​Understanding Kafka Producer: How Partition Selection Works - Today I learned,
truy cập vào tháng 7 5, 2025,
https://til.hashnode.dev/understanding-kafka-producer-how-partition-selection-
works
46.​Sending Data to a Specific Partition in Kafka - Baeldung, truy cập vào tháng 7 5,
2025, https://www.baeldung.com/kafka-send-data-partition
47.​Kafka Producer for Confluent Platform, truy cập vào tháng 7 5, 2025,
https://docs.confluent.io/platform/current/clients/producer.html
48.​Optimizing Kafka Performance: Tips for Tuning and Scaling Kafka ..., truy cập vào
tháng 7 5, 2025,
https://medium.com/@nemagan/optimizing-kafka-performance-tips-for-tuning-
and-scaling-kafka-clusters-ebc08153c661
49.​Apache Kafka — Understanding how to produce and consume messages? -
Medium, truy cập vào tháng 7 5, 2025,
https://medium.com/@sirajul.anik/apache-kafka-understanding-how-to-produce
-and-consume-messages-9744c612f40f
50.​Delivery Semantics for Kafka Consumers | Learn Apache Kafka - Conduktor, truy
cập vào tháng 7 5, 2025,
https://learn.conduktor.io/kafka/delivery-semantics-for-kafka-consumers/
51.​Understanding In-Sync Replicas (ISR) in Apache Kafka - GeeksforGeeks, truy cập
vào tháng 7 5, 2025,
https://www.geeksforgeeks.org/understanding-in-sync-replicas-isr-in-apache-ka
fka/
52.​Leader Follower Pattern in Distributed Systems - GeeksforGeeks, truy cập vào
tháng 7 5, 2025,
https://www.geeksforgeeks.org/system-design/leader-follower-pattern-in-distrib
uted-systems/
53.​broker - What is a partition leader in Apache Kafka? - Stack Overflow, truy cập
vào tháng 7 5, 2025,
https://stackoverflow.com/questions/60835817/what-is-a-partition-leader-in-apa
che-kafka/60837212
54.​Multi-Geo Replication in Apache Kafka - Confluent, truy cập vào tháng 7 5, 2025,
https://www.confluent.io/blog/multi-geo-replication-in-apache-kafka/
55.​Kafka Replication & Min In-Sync Replicas - Lydtech Consulting, truy cập vào tháng
7 5, 2025, https://www.lydtechconsulting.com/blog-kafka-replication.html
56.​When does Kafka Leader Election happen? - Codemia, truy cập vào tháng 7 5,
2025,
https://codemia.io/knowledge-hub/path/when_does_kafka_leader_election_happ
en
57.​Kafka Replication: Concept & Best Practices - GitHub, truy cập vào tháng 7 5,
2025,
https://github.com/AutoMQ/automq/wiki/Kafka-Replication:-Concept-&-Best-Pra
ctices
58.​Learning Kafka - Configuring Kafka Producer for Message Durability - Blog, truy
cập vào tháng 7 5, 2025,
https://dsinecos.github.io/blog/Learning-Kafka-Configure-Kafka-Producer-for-M
essage-Durability
59.​What is a partition leader in Apache Kafka? - broker - Stack Overflow, truy cập
vào tháng 7 5, 2025,
https://stackoverflow.com/questions/60835817/what-is-a-partition-leader-in-apa
che-kafka
60.​Understanding Message Durability in Kafka | by Amarendra Singh - Dev Genius,
truy cập vào tháng 7 5, 2025,
https://blog.devgenius.io/understanding-message-durability-in-kafka-8f6e7006a
ea8
61.​Kafka — Data Durability and Availability Guarantees | by Mahesh Saini | The Life
Titbits, truy cập vào tháng 7 5, 2025,
https://medium.com/the-life-titbits/kafka-data-durability-and-availability-guarant
ees-add5e4340638
62.​Kafka Topic Replication | Learn Apache Kafka with Conduktor, truy cập vào tháng
7 5, 2025, https://learn.conduktor.io/kafka/kafka-topic-replication/
63.​Ensuring Message Ordering in Kafka: Strategies and Configurations | Baeldung,
truy cập vào tháng 7 5, 2025,
https://www.baeldung.com/kafka-message-ordering
64.​RabbitMQ vs. Kafka vs. ActiveMQ: A Battle of Messaging Brokers, truy cập vào
tháng 7 5, 2025,
https://www.designgurus.io/blog/rabbitmq-kafka-activemq-system-design
65.​Exactly-once Semantics is Possible: Here's How Apache Kafka Does it, truy cập
vào tháng 7 5, 2025,
https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-
apache-kafka-does-it/
66.​Demystifying Apache Kafka Message Delivery Semantics - Keen IO, truy cập vào
tháng 7 5, 2025,
https://keen.io/blog/demystifying-apache-kafka-message-delivery-semantics-at
-most-once-at-least-once-exactly-once/
67.​Kafka message delivery semantics: at most once, at least once, exactly once | by
Navya PS, truy cập vào tháng 7 5, 2025,
https://medium.com/@psnavya90/kafka-message-delivery-semantics-at-most-o
nce-at-least-once-exactly-once-14bc48046776
68.​Apache Kafka's Exactly-Once Semantics in Spring Cloud Stream Kafka
Applications, truy cập vào tháng 7 5, 2025,
https://spring.io/blog/2023/10/16/apache-kafkas-exactly-once-semantics-in-sprin
g-cloud-stream-kafka/
69.​How Kafka achieves exactly-once semantics | by Oleg Potapov - Medium, truy
cập vào tháng 7 5, 2025,
https://oleg0potapov.medium.com/how-kafka-achieves-exactly-once-semantics
-57fdb7ad2e3f
70.​Exactly-Once Processing in Kafka explained | by sudan - Medium, truy cập vào
tháng 7 5, 2025,
https://ssudan16.medium.com/exactly-once-processing-in-kafka-explained-66ec
c41a8548
71.​Exactly Once Processing in Kafka with Java | Baeldung, truy cập vào tháng 7 5,
2025, https://www.baeldung.com/kafka-exactly-once
72.​en.wikipedia.org, truy cập vào tháng 7 5, 2025,
https://en.wikipedia.org/wiki/Message_queue
73.​What Is a Message Queue? | IBM, truy cập vào tháng 7 5, 2025,
https://www.ibm.com/think/topics/message-queues
74.​Apache Kafka vs. ActiveMQ: Differences & Comparison - AutoMQ, truy cập vào
tháng 7 5, 2025,
https://www.automq.com/blog/apache-kafka-vs-activemq-differences-and-com
parison
75.​Kafka vs RabbitMQ: Key Differences & When to Use Each | DataCamp, truy cập
vào tháng 7 5, 2025, https://www.datacamp.com/blog/kafka-vs-rabbitmq
76.​Kafka vs Message Queue: A Quick Comparison - Linearloop, truy cập vào tháng 7
5, 2025,
https://www.linearloop.io/blog/kafka-vs-message-queue-a-quick-comparison
77.​Benchmarking RabbitMQ vs Kafka vs Pulsar Performance - Confluent, truy cập
vào tháng 7 5, 2025,
https://www.confluent.io/blog/kafka-fastest-messaging-system/
78.​Apache Kafka® vs ActiveMQ: 5 key differences and how to choose - Instaclustr,
truy cập vào tháng 7 5, 2025,
https://www.instaclustr.com/education/apache-kafka/apache-kafka-vs-activemq-
5-key-differences-and-how-to-choose/
79.​Difference between Apache Kafka, RabbitMQ, and ActiveMQ - DEV Community,
truy cập vào tháng 7 5, 2025,
https://dev.to/somadevtoo/difference-between-apache-kafka-rabbitmq-and-acti
vemq-4f1k
80.​When to use RabbitMQ over Kafka? [closed] - Stack Overflow, truy cập vào tháng
7 5, 2025,
https://stackoverflow.com/questions/42151544/when-to-use-rabbitmq-over-kafk
a
81.​When to use Apache kafka instead of ActiveMQ [closed] - Stack Overflow, truy
cập vào tháng 7 5, 2025,
https://stackoverflow.com/questions/44792604/when-to-use-apache-kafka-inste
ad-of-activemq
82.​RabbitMQ vs. Apache Kafka | Confluent, truy cập vào tháng 7 5, 2025,
https://www.confluent.io/learn/rabbitmq-vs-apache-kafka/
83.​ActiveMQ vs Kafka: Differences & Use Cases Explained - DataCamp, truy cập vào
tháng 7 5, 2025, https://www.datacamp.com/blog/activemq-vs-kafka
84.​What is Apache Kafka? - Red Hat, truy cập vào tháng 7 5, 2025,
https://www.redhat.com/en/topics/integration/what-is-apache-kafka
85.​Powered By - Apache Kafka, truy cập vào tháng 7 5, 2025,
https://kafka.apache.org/powered-by
86.​Apache Kafka Use Cases: When to Choose It and When to Look Elsewhere -
CelerData, truy cập vào tháng 7 5, 2025,
https://celerdata.com/glossary/apache-kafka-use-cases
87.​Use Cases - Apache Kafka, truy cập vào tháng 7 5, 2025,
https://kafka.apache.org/uses
88.​10 Real-World Event-Driven Architecture Examples in Logistics. Implementing
Kafka at Scale to Handle Supply Chain Network - nexocode, truy cập vào tháng 7
5, 2025,
https://nexocode.com/blog/posts/event-driven-architecture-examples-in-logistic
s-apache-kafka-to-handle-supply-chain-network/
89.​Apache ActiveMQ vs. Kafka | Baeldung, truy cập vào tháng 7 5, 2025,
https://www.baeldung.com/apache-activemq-vs-kafka
90.​Kafka vs Pulsar: Key Differences - Optiblack, truy cập vào tháng 7 5, 2025,
https://optiblack.com/insights/kafka-vs-pulsar-key-differences
91.​Apache Kafka vs. Apache Pulsar: Differences & Comparison - AutoMQ, truy cập
vào tháng 7 5, 2025,
https://www.automq.com/blog/apache-kafka-vs-apache-pulsar-differences-com
parison
92.​How is Apache Pulsar different from Apache Kafka? - Milvus, truy cập vào tháng 7
5, 2025,
https://milvus.io/ai-quick-reference/how-is-apache-pulsar-different-from-apach
e-kafka
93.​When would you use Kafka, vs some other broker? : r/apachekafka - Reddit, truy
cập vào tháng 7 5, 2025,
https://www.reddit.com/r/apachekafka/comments/hf8you/when_would_you_use_
kafka_vs_some_other_broker/
94.​Kafka vs. Pulsar vs. RabbitMQ: Performance, Architecture, and Features
Compared, truy cập vào tháng 7 5, 2025,
https://www.confluent.io/kafka-vs-pulsar/
95.​Kafka vs Pulsar: Choosing the Right Stream Processing Platform - RisingWave,
truy cập vào tháng 7 5, 2025,
https://risingwave.com/blog/kafka-vs-pulsar-choosing-the-right-stream-processi
ng-platform/
96.​Comparing Apache Pulsar vs. Apache Kafka | 2022 Benchmark Report, truy cập
vào tháng 7 5, 2025,
https://streamnative.io/blog/apache-pulsar-vs-apache-kafka-2022-benchmark
97.​A More Accurate Perspective on Pulsar's Performance Compared to Kafka -
StreamNative, truy cập vào tháng 7 5, 2025,
https://streamnative.io/blog/perspective-on-pulsars-performance-compared-to-
kafka
98.​What is Kafka Connect—Complete Guide - Redpanda, truy cập vào tháng 7 5,
2025, https://www.redpanda.com/guides/kafka-tutorial-what-is-kafka-connect
99.​What is Kafka Connect? Concepts & Best Practices - AutoMQ, truy cập vào tháng
7 5, 2025,
https://www.automq.com/blog/kafka-connect-architecture-concepts-best-practi
ces
100.​ Kafka Connect | Confluent Documentation, truy cập vào tháng 7 5, 2025,
https://docs.confluent.io/platform/current/connect/index.html
101.​ Kafka Connectors—Overview, use cases, and best practices - Redpanda, truy
cập vào tháng 7 5, 2025,
https://www.redpanda.com/guides/kafka-cloud-kafka-connectors
102.​ Architecture - Apache Kafka, truy cập vào tháng 7 5, 2025,
https://kafka.apache.org/20/documentation/streams/architecture
103.​ The Nuts and Bolts of Kafka Streams---An Architectural Deep Dive - YouTube,
truy cập vào tháng 7 5, 2025, https://www.youtube.com/watch?v=2_-WoWlAD5M
104.​ What is Apache Kafka? Ecosystem - Conduktor, truy cập vào tháng 7 5, 2025,
https://learn.conduktor.io/kafka/what-is-apache-kafka-part-3/
105.​ Apache Kafka® and ksqlDB for Confluent Platform, truy cập vào tháng 7 5,
2025,
https://docs.confluent.io/platform/current/ksqldb/concepts/apache-kafka-primer.
html
106.​ Apache Kafka — Part III — ksqlDB - Medium, truy cập vào tháng 7 5, 2025,
https://medium.com/@selcuk.sert/apache-kafka-part-iii-ksqldb-f3f1b8cbaf60
107.​ Introduction to ksqlDB | Baeldung, truy cập vào tháng 7 5, 2025,
https://www.baeldung.com/ksqldb
108.​ Mastering ksqldb Tutorial: Your Ultimate Guide - RisingWave: Streaming
Database Built on Open Standards, truy cập vào tháng 7 5, 2025,
https://risingwave.com/blog/mastering-ksqldb-tutorial-your-ultimate-guide/
109.​ Study Notes 6.11-12: Kafka ksqlDB, Connect & Schema Registry - DEV
Community, truy cập vào tháng 7 5, 2025,
https://dev.to/pizofreude/study-notes-611-12-kafka-ksqldb-connect-schema-regi
stry-2g9n
110.​ Schema Registry for Confluent Platform | Confluent Documentation, truy cập
vào tháng 7 5, 2025,
https://docs.confluent.io/platform/current/schema-registry/index.html
111.​ Kafka Schema Registry in Distributed Systems | by Alex Klimenko - Medium,
truy cập vào tháng 7 5, 2025,
https://medium.com/@alxkm/kafka-schema-registry-in-distributed-systems-8a9
9bad321b1
112.​ Kafka's Shift from ZooKeeper to Kraft | Baeldung, truy cập vào tháng 7 5, 2025,
https://www.baeldung.com/kafka-shift-from-zookeeper-to-kraft
113.​ KRaft: Apache Kafka Without ZooKeeper - SOC Prime, truy cập vào tháng 7 5,
2025, https://socprime.com/blog/kraft-apache-kafka-without-zookeeper/
114.​ Kafka Raft vs. ZooKeeper vs. Redpanda, truy cập vào tháng 7 5, 2025,
https://www.redpanda.com/guides/kafka-alternatives-kafka-raft
115.​ The Evolution of Kafka Architecture: From ZooKeeper to KRaft | by Roman
Glushach, truy cập vào tháng 7 5, 2025,
https://romanglushach.medium.com/the-evolution-of-kafka-architecture-from-z
ookeeper-to-kraft-f42d511ba242
116.​ Guide to ZooKeeper to KRaft migration - OSO, truy cập vào tháng 7 5, 2025,
https://oso.sh/blog/guide-to-zookeeper-to-kraft-migration/
117.​ KRaft - Apache Kafka Without ZooKeeper - Confluent Developer, truy cập vào
tháng 7 5, 2025, https://developer.confluent.io/learn/kraft/
118.​ Apache Kafka 4.0 Release: Default KRaft, Queues, Faster Rebalances, truy cập
vào tháng 7 5, 2025, https://www.confluent.io/blog/latest-apache-kafka-release/
119.​ Apache Kafka 4.0, truy cập vào tháng 7 5, 2025, https://kafka.apache.org/blog
120.​ From ZooKeeper to KRaft: How the Kafka migration works - Strimzi, truy cập
vào tháng 7 5, 2025, https://strimzi.io/blog/2024/03/21/kraft-migration/
121.​ Migrate from ZooKeeper to KRaft on Confluent Platform, truy cập vào tháng 7
5, 2025,
https://docs.confluent.io/platform/current/installation/migrate-zk-kraft.html
122.​ Migrating Zookeeper to Kraft | The Write Ahead Log, truy cập vào tháng 7 5,
2025, https://platformatory.io/blog/Migrating-Zookeeper-to-kraft/
123.​ The various tiers of Apache Kafka Tiered Storage - Strimzi, truy cập vào tháng
7 5, 2025,
https://strimzi.io/blog/2025/04/22/tha-various-tiers-of-apache-kafka-tiered-stora
ge/
124.​ How to auto scale Apache Kafka with Tiered Storage in Production - OSO, truy
cập vào tháng 7 5, 2025,
https://oso.sh/blog/how-to-auto-scale-apache-kafka-with-tiered-storage-in-pro
duction/
125.​ Kafka performance tuning strategies and practical tips - Redpanda, truy cập
vào tháng 7 5, 2025,
https://www.redpanda.com/guides/kafka-performance-kafka-performance-tunin
g
126.​ Kafka monitoring: Tutorials and best practices - Redpanda, truy cập vào tháng
7 5, 2025,
https://www.redpanda.com/guides/kafka-performance-kafka-monitoring
127.​ The Hitchhiker's Guide to Disaster Recovery and Multi-Region Kafka ..., truy
cập vào tháng 7 5, 2025,
https://www.warpstream.com/blog/the-hitchhikers-guide-to-disaster-recovery-a
nd-multi-region-kafka
128.​ DR for Kafka Cluster : r/apachekafka - Reddit, truy cập vào tháng 7 5, 2025,
https://www.reddit.com/r/apachekafka/comments/1i93cmi/dr_for_kafka_cluster/
129.​ Building Bulletproof Disaster Recovery for Apache Kafka: A Field-Tested
Architecture - OSO, truy cập vào tháng 7 5, 2025,
https://oso.sh/blog/building-bulletproof-disaster-recovery-for-apache-kafka-a-fi
eld-tested-architecture/
130.​ Replicate Multi-Datacenter Topics Across Kafka Clusters in Confluent
Platform, truy cập vào tháng 7 5, 2025,
https://docs.confluent.io/platform/current/multi-dc-deployments/replicator/index.
html
131.​ Build multi-Region resilient Apache Kafka applications with identical topic
names using Amazon MSK and Amazon MSK Replicator | AWS Big Data Blog, truy
cập vào tháng 7 5, 2025,
https://aws.amazon.com/blogs/big-data/build-multi-region-resilient-apache-kafk
a-applications-with-identical-topic-names-using-amazon-msk-and-amazon-ms
k-replicator/
132.​ Failover & Failback Runbooks - JetStream Software, truy cập vào tháng 7 5,
2025,
http://www.jetstreamsoft.com/wp-content/uploads/2020/05/Failover-Failback-Ru
nbooks_v1.0.pdf
133.​ Cluster Linking Disaster Recovery and Failover on Confluent Cloud, truy cập
vào tháng 7 5, 2025,
https://docs.confluent.io/cloud/current/multi-cloud/cluster-linking/dr-failover.html
134.​ Performing a failover or failback - Cloudera Documentation, truy cập vào
tháng 7 5, 2025,
https://docs.cloudera.com/csm-operator/1.2/kafka-replication-deploy-configure/t
opics/csm-op-replication-failover-failback.html

You might also like