Apache Kafka Tutorial
Apache Kafka Tutorial
com/kafka/apache-kafka-tutorial/
October 4, 2023
Kafka
This Apache Kafka tutorial is for absolute beginners and offers them some tips while
learning Kafka in the long run. It covers fundamental aspects such as Kafka’s
architecture, the key components within a Kafka cluster, and delves into more
advanced topics like message retention and replication.
1 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
When compared with counterparts, Apache Kafka provides major upgrades from
the traditional messaging system. Below are the differences between a traditional
messaging system (like RabbitMQ, ActiveMQ, etc.) and Kafka.
Traditional Messaging
Feature Kafka Streaming Platform
System
2 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
Traditional Messaging
Feature Kafka Streaming Platform
System
It is a distributed streaming
Not a distributed system, so it
system, so by adding more
Scalability is not possible to scale
partitions, we can scale
horizontally.
horizontally.
3 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
Traditional Messaging
Feature Kafka Streaming Platform
System
Here’s an overview of the main components and their roles in Kafka’s architecture:
4 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
The above diagram represents Kafka’s architecture. Let’s discuss each component
in detail.
3.1. Message
A message is a primary unit of data within Kafka. Messages sent from a producer
consist of the following parts:
5 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
• Message Value: It contains the actual data we want to transmit. Kafka does
not interpret the content of the value. It is received and sent as it is. It can be
XML, JSON, String, or anything. Kafka does not care and stores everything.
Many Kafka developers favor using Apache Avro, a serialization framework
initially developed for Hadoop.
6 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
• Headers (Optional): Kafka allows adding headers that may contain additional
meta-information related to the message.
• Partition and Offset Id: Once a message is sent into a Kafka topic, it also
receives a partition number and offset id that is stored within the message.
The topics and partitions play a crucial role in organizing and distributing messages
across the cluster.
Topics are created on the Kafka broker and can have one or more partitions.
A partition is the smallest storage unit where the messages live inside the topic.
The partitions have a significant effect on scalable message consumption. Each
partition is an ordered and immutable sequence of records; meaning once a
message is stored, it cannot be changed.
7 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
A random number is assigned to each record in a partition called offset. The offset
represents the position of the last consumed message in each partition. Each
partition works independently of each other.
3.3. Producer
A producer publishes the messages using the topic name. The user is not
required to specify the broker and the partition. By default, Kafka uses the message
key to select the topic partition by DefaultPartitioner which uses a 32-bit murmur2
hash.
Remember that the message key is optional, so if there is no key provided, then
Kafka will partition the data in a round-robin fashion.
8 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
A consumer reads messages from the Kafka cluster using a topic name. It
continuously polls the Kafka broker using the topic name for new messages. Once
the polling loop notices a new message, the message is consumed by the consumer
and processing is done on the retrieved message.
Consumer Group
When a topic is created in Kafka, it can have one or more consumer groups
associated with it. The consumer groups maintain the offset information for the
partitions they consume.
9 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
When messages are published to a topic, they are distributed across the partitions
in a configurable manner. Each consumer within a consumer group is assigned one
or more partitions to read from. Each partition is consumed by only one consumer
within a consumer group at a time. This ensures that all messages within a
partition are processed in the order they were received.
To decide which consumer should read data first and from which partition,
consumers within a group use GroupCoordinator and ConsumerCoordinator,
which assign a consumer to a partition, managed by Kafka Broker.
Consumer Offset
Typically, Kafka consumers have 3 options to read the message from the partition:
Consumer offset represents the position or offset within a partition from which a
consumer group has consumed messages. In other words, each consumer group
maintains its offset for each partition it consumes. The offset helps in determining
the next message to read from a specified partition inside the topic.
10 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
3.5. Broker
A Kafka broker is a single server instance that stores and manages the
partitions. Brokers act as a bridge between consumers and producers.
Kafka brokers store data in a directory on the server disk they run on. Each topic
partition receives its own sub-directory with the associated name of the topic.
A client that wants to send or receive messages through the Kafka cluster may
connect to any broker in the cluster. Each broker in the cluster has metadata about
all the other brokers, and therefore any broker in the cluster can act as a
bootstrap server (the initial connection point used by Kafka clients to connect to
the cluster).
The client connects to the provided broker (bootstrap server) and requests
metadata about the Kafka cluster, such as the addresses of all the other brokers,
the available topics, and the partition information. Once the client has obtained the
metadata from the bootstrap server, it can establish connections to other brokers in
the cluster as needed for producing or consuming messages.
11 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
3.7. Zookeeper
• determines which broker is the leader of a given partition and topic and
performs leader elections.
• sends notifications to Kafka in case of changes (e.g., new topic, broker dies,
broker comes up, delete a topic, etc….).
Suppose, if we store the data in only one partition, and if the broker goes down then
there might be a data loss problem. To avoid data loss issues, Kafka uses
replication.
One broker is marked leader and other brokers are called followers for a specific
partition. This designated broker assumes the role of the leader for the topic
partition. On the other hand, any additional broker that keeps track of the leader
12 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
partition is called a follower and it stores replicated data for that partition.
Note that the leader receives and serves all incoming messages from producers
and serves them to consumers. Followers do not serve read or write requests
directly from producers or consumers. Followers just act as backups and can take
over as the leader in case the current leader fails.
4.2. Replication-Factor
In the cluster below consisting of three brokers, the replication factor is 2. Let’s say
a producer produces a message to Partition 0, it goes to the leader partition. Upon
receiving the message, Broker1 proceeds to store it persistently within the file
system. Since we have replicator factor = 2, we need one more copy of the
message. Now the follower replica in another broker receives a copy of the same
message and stores it in the filesystem.
When a partition is replicated across multiple brokers, not all replicas are
necessarily in sync with the leader at all times. The in-sync replicas represent the
number of replicas that are always up-to-date and synchronized with the
13 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
partition’s leader. The leader continuously sends messages to the in-sync replicas,
and they acknowledge the receipt of those messages.
When utilizing Kafka, producers are restricted to writing data exclusively to the
leader broker for a given partition. Furthermore, producers must specify the level
of acknowledgment, known as acks, to determine the minimum number of
replicas that need to receive the message before considering the write
operation successful.
Let us consider a few scenarios of how this value affects the message producers.
acks = 0
However, this approach comes with a risk. If the broker goes offline or an exception
occurs, the producer won’t receive any notification and data loss may occur. This
method is typically suitable for scenarios where it is acceptable to lose messages,
such as in metrics collection. It offers the advantage of achieving the highest
throughput setting since the network overhead is minimized.
14 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
acks = 1
While requesting a response from the leader, the replication process occurs in the
background, but it doesn’t guarantee replication. In the event of not receiving an
acknowledgment, the producer can retry the request. However, if the leader broker
goes offline unexpectedly and the replicas haven’t replicated the data yet, data loss
may occur.
acks = all
To ensure the safety of writing the message, the leader for a partition checks if
15 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
there are enough in-sync replicas, which is determined by the broker setting
min.insync.replicas. The request remains stored in a buffer until the leader
confirms that the follower replicas have replicated the message. At this point, a
successful acknowledgment is sent back to the client.
For instance, let’s consider a topic with three replicas and min.insync.replicas set to
2. In this case, writing to a partition in the topic is possible only when at least two
out of the three replicas are in sync. When all three replicas are in-sync, the
process proceeds as usual. This remains true even if one of the replicas becomes
unavailable. However, if two out of three replicas are unavailable, the brokers will no
longer accept produce requests. Instead, producers attempting to send data will
receive a NotEnoughReplicasException.
The most widely adopted choice for ensuring data durability and
availability, capable of tolerating the loss of a single Kafka broker, is setting
“acks=all” and “min.insync.replicas=2“.
In Apache Kafka, the commit log is an append-only data structure that records all
published messages in the order they were received. Each record in the log each
record represents a single message, in the order they are produced, maintaining the
message ordering within a partition.
Let’s understand the commit log in Kafka using the diagram below.
16 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
When the message is produced, the record or log is saved as a file with the “.log”
extension. Each partition within the Kafka topic has its own dedicated log file.
Therefore, if there are six partitions for a topic, there will be six log files in the file
system. These files are commonly referred to as Partition Commit Logs.
After the messages are written to the log file, the produced records are then
committed. Consequently, only the records that have been committed to the file
system are visible to consumers actively polling for new records.
Subsequently, as new records are published to the Kafka Topic, they are appended
to the respective log file, and the process continues seamlessly, ensuring that
messages are not lost in the event of failures.
Note that although the commit log is an append-only structure, Kafka provides
efficient random access to specific offsets within a partition. Consumers can read
messages from any offset in a partition, allowing them to replay or skip messages
as needed.
The retention policy serves as the primary determinant of how long messages will
be stored, making it a crucial policy to establish. By default, Kafka retains messages
17 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
for a period of 168 hours, equivalent to 7 days. This default retention period can be
adjusted as needed.
If the log retention period is exceeded, Kafka will automatically delete the
corresponding data from the log. This process is controlled by the
log.retention.check.interval.ms property, which specifies the interval at
which retention checks occur (e.g., 300000 milliseconds).
Also, when the log size reaches a specified threshold, a new log segment is
created. The log.segment.bytes property determines the size of each log
segment, with a default value of 1073741824 bytes (1 gigabyte).
Kafka APIs play a crucial role in enabling the implementation of various data
pipelines and real-time data streams. They serve as a bridge between Kafka clients
and Kafka servers, facilitating seamless communication and data transfer.
There are 4 APIs available that developers can use to leverage Kafka capabilities:
18 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
producer.close();
19 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
properties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeser
properties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDes
consumer.subscribe(Collections.singletonList(topic));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("Received message: key=%s, value=%s%n", record.key()
}
}
• Source Connector – is used to pull data from an external data source such
as DB, File System or Elasticsearch and store them in Kafka topics, making
the data available for stream processing.
{
"name": "mysql-source-connector",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"tasks.max": "1",
"connection.url": "jdbc:mysql://localhost:3306/mydb",
"connection.user": "user",
"connection.password": "password",
"mode": "timestamp",
20 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
"timestamp.column.name": "updated_at",
"table.whitelist": "my_table",
"topic.prefix": "mysql-",
"validate.non.null": "false"
}
}
It is built upon the foundations of the Producer and Consumer APIs. It offers
advanced processing capabilities and empowers applications to engage in
continuous, end-to-end stream processing. This involves consuming records from
one or multiple topics, performing analysis, aggregation, or transformation
operations as required, and subsequently publishing the resulting streams back to
the original topics or other designated topics.
transformedStream.to("my_output_topic");
21 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
Kafka provides APIs for administrative operations and metadata management for
Kafka clusters. Using admin APIs, developers can create and delete topics, manage
consumer groups, modify configurations, and retrieve cluster metadata.
Other than the above-mentioned APIs, developers can use KSQL which is an open-
source streaming SQL engine for Apache Kafka. It is an SQL engine that allows us to
process (transformations and aggregations) and analyze the real-time streaming
data present in the Apache Kafka platform. Developers can use standard SQL
constructs like SELECT, JOIN, GROUP BY, and WHERE clauses to query and
manipulate data.
Check this article to learn how to start a Kafka Server in our local systems and
execute different Kafka commands.
22 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
8. Advantages
9. Limitations
• Issues with Message Tweaking – Since Kafka relies on specific system calls
to deliver messages to consumers, any modifications made to the messages
can negatively impact performance. Tweaking messages significantly
reduces Kafka’s efficiency, except when the message remains unchanged.
• No support for wildcard topic selection – Kafka only matches exact topic
23 of 47 10/10/23, 11:48 pm
Apache Kafka Tutorial: A Beginner-Friendly Guide https://howtodoinjava.com/kafka/apache-kafka-tutorial/
names and does not offer support for wildcard topic selection. This limitation
prevents Kafka from addressing certain use cases that require matching
patterns using wildcards due to its algorithmic constraints.
• Performance – While individual message size typically does not pose issues,
as the size of messages increases, brokers and consumers begin
compressing them. This compression process gradually consumes node
memory when the messages are decompressed. Additionally, compression
during data pipeline flow affects throughput and overall performance.
10. Conclusion
This Apache Kafka tutorial provided a comprehensive overview of Kafka and its key
features. We have explored the various components of the Kafka cluster, including
brokers, producers, and consumers and delved into the core concepts such as
topics, partitions, consumer groups, commit logs and retention policy.
Happy Learning!!
Weekly Newsletter
D IS COV ER M ORE
Stay Up-to-Date with Our
Weekly Updates. Right
Related Articles And Resources
into Your Inbox.
JMS TutorialSubscribe
– Java Message Service Tutorial
/commons/io/IOUtils
24 of 47 10/10/23, 11:48 pm