0% found this document useful (0 votes)
8 views33 pages

Kafka Notes Linkedin

Kafka is a distributed event streaming platform that facilitates the reading, writing, storing, and processing of events across multiple machines, utilizing publish-subscribe and message queuing patterns. It consists of various components such as messages, topics, partitions, brokers, producers, and consumers, enabling efficient data flow and decoupling between producers and consumers. Kafka's design is based on a commit log, allowing for high throughput, reliability, and the ability to handle large volumes of data while maintaining message order.

Uploaded by

parthpandeyboss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views33 pages

Kafka Notes Linkedin

Kafka is a distributed event streaming platform that facilitates the reading, writing, storing, and processing of events across multiple machines, utilizing publish-subscribe and message queuing patterns. It consists of various components such as messages, topics, partitions, brokers, producers, and consumers, enabling efficient data flow and decoupling between producers and consumers. Kafka's design is based on a commit log, allowing for high throughput, reliability, and the ability to handle large volumes of data while maintaining message order.

Uploaded by

parthpandeyboss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Kafka

Table of content:
Introduction:
Publish-Subscribe (Pub-Sub):
Message Queuing:
Use Cases:
Design:
Components of Kafka:
Message:
Topic:
Partition:
Message key:
Message offset:
Schemas:
Brokers:
Producers:
Consumers:
Partitioning:
Partition Rebalancer:
Custom Logic of Partitioning:
Kafka Producer:
Sending Messages:
Producer Configurations:
Ordering Issue:
Producer Serialization:
Using Avro:
Kafka Consumer:
Consumer Groups:
Consumer:
Poll Loop:
Thread safety:
Consumer Configuration:
Commits and Offsets:
Offset Commit Configuration:
Handling Rebalances:
Stopping a Consumer:
Consumer Deserialization:
Using Avro:
Running a Single Consumer:
Kafka Internals:
Replication:
Types of Replica:
Kafka Controller:
Broker Membership:

Kafka 1
Controller:
Request Processing:
Produce Request:
Fetch Request:
Partition Allocation:
Directory Allocation:
Data Storage:
Data Retention:
File Format:
Index:
Compaction:
Reliability:
Broker Replication Config:
Exactly Once Delivery:

Introduction:
Kafka is described by the official documentation as:

A distributed event streaming platform that lets you read, write, store, and process events (also called
records or messages in the documentation) across many machines.

An event can be thought of as an independent piece of information that needs to be relayed from a producer to a consumer.
Kafka and other messaging systems are intermediaries that move data records from one application to another and decouple them
from each other. The producer of data doesn’t know who the consumer of data is or even when the data is consumed. This allows
developers to focus on the core logic of their applications and relieves them of directly connecting producers to consumers. This
design that allows for decoupling producers and consumers is called asynchronous messaging and consists of two patterns:

Publish-Subscribe (Pub-Sub)

Message Queuing

Publish-Subscribe (Pub-Sub):
In the publish-subscribe model, a participant in the system produces data and publishes the data to a channel or topic. The
message can be consumed by multiple readers and the messages are always delivered to the consumers in the order that they
were published.

Message Queuing:
In contrast to the pub-sub pattern, message queuing publishes a message to a topic or channel which is processed exactly once by
one consumer. Once the message has been processed and the consumer acknowledges consumption of the message, the
message is deleted from the queue. Implementation dictates which consumer a message will be delivered to, for processing.

Kafka 2
Use Cases:
Messaging: Kafka can be used in scenarios where applications need to send out notifications.

Metrics and Logging: Applications can publish metrics to Kafka topics which can then be consumed by monitoring and alerting
systems. Similarly, logs can be published to Kafka topics which can then be routed to log search systems such as Elasticsearch.

Commit Log: • Kafka is based on the concept of a commit log which opens up the possibility of using it for database changes.
The stream of changes can be used to replicate database updates on a remote system.

Stream Processing: Kafka can be used by streaming frameworks to allow applications to operate on Kafka messages to
perform actions such as counting metrics, partitioning messages for processing by other applications, combining messages, or
applying transformations on them.

Design:
Kafka’s design and data model is inspired by the commit log. A commit log a.k.a write-ahead log is a sequence of records where
each record has its own unique identifier. There are some characteristics of a commit log that differentiate it from other kinds of logs
and storage:

Records can only be appended at the end of a commit log.

Records are immutable (can’t be modified).

A commit log is always read from left to right. The record offsets can be used to specify the start and end of a read.

Kafka 3
The commit log can be thought of as a story of events that are ordered in time and can be used to recreate or replicate changes. For
instance, a replica of a database can read the changes from a commit log to bring itself up to speed with the state of the active
database replica. Commit logs are simple, fast, and can handle large volumes of data better than a traditional relational database.
Another benefit of commit logs is that a complex system can continue to work when certain subcomponents face failure. These
components can consult the commit log when coming back online and replay the events to get to the current state of the system.

Components of Kafka:
Message:
A message is simply an array of bytes from Kafka’s perspective. A message is also the unit of data in the Kafka ecosystem.

Messages are batched rather than being sent individually to reduce overhead. This leads to the classical tradeoff between
latency and throughput. As batch sizes grow larger, throughput increases as more messages are handled per unit of time. At the
same time, it takes longer for an individual message to be delivered thus increasing latency.

Messages batched together are compressed for efficient data transfer.

Topic:
Messages get written and read from topics.

Partition:
A topic has subdivisions known as partitions.

Messages are ordered by time only within a partition and not across the entire topic.

Messages are read from beginning to end in a partition.

Message can only be appended to the end of a partition.

Partitions allow Kafka to scale horizontally and also provide redundancy. Each partition can be hosted on a different server,
which allows new partitions to be added to a topic as the load on the system increases.

Kafka 4
Message key:
Since partitions make up a topic, messages, in reality, get written to and read from partitions. We can control which partition a
message lands in with the use of message keys. A message key is treated like a byte array by Kafka and is optional. All messages
with the same key are written to the same partition.

Message offset:
Messages also have metadata associated with them called the offset. The offset is an integer value that is ever-increasing and
determines the order of the message within a partition. By remembering the message offset, the consumer is able to continue from
where it previously left off.

Schemas:
Kafka doesn’t mandate that messages conform to any given schema. However, it is recommended that messages follow a structure
and form, that allows for easy understanding. Messages can be expressed in JSON, XML, Avro, or other formats. It is important to
think about and mitigate issues that arise from schema changes, Avro provides support for schema evolution and allows for
backward and forward compatibility.

Brokers:
A single Kafka server is called a broker. Usually, several Kafka brokers operate as one Kafka cluster. The cluster is controlled by
one of the brokers, called the controller, which is responsible for administrative actions such as assigning partitions to other brokers
and monitoring for failures. The controller is elected from the live members of the cluster.

A partition can be assigned to more than one broker, in which case the partition is replicated across the assigned brokers. This
creates redundancy in case one of the brokers fails and allows another broker to take its place without disrupting access to the
partition for the users. The replication factor determines the number of replicas for a partition. Within a cluster, a single broker
owns a partition and is said to be the leader. All the other partition-replicating brokers are called followers. Every producer and
consumer interacting with the partition must connect to the leader for that partition.
Messages in Kafka are stored durably for a configurable retention period.

A broker is responsible for receiving messages from producers and committing them to disk. Similarly, brokers also receive requests
from readers and respond with messages fetched from partitions.

Kafka 5
Producers:
Producers create messages and are sometimes known as writers or publishers. Producers can direct messages to specific
partitions using the message key and implement complex rules for partition assignment using a custom partitioner.

Consumers:
Consumers read messages and are sometimes known as subscribers or readers. Consumers operate as a group called
the consumer group. A consumer group consists of several consumers working together to read a topic. It is also possible for a
consumer group to have a single consumer in it. Each partition is read by a single member of the group, though a single consumer
can read multiple partitions. The mapping of a consumer to a partition is called the ownership of the partition by the consumer. If
a consumer fails, the remaining consumers in the group will rebalance the partitions amongst themselves to make up for the failed
member.

Partitioning:
When sending messages to Kafka topics, we can choose whether or not to specify a key. If we don’t specify a key (i.e. it is set to
null), Kafka’s default mechanism takes over to assign a partition. The default partitioner directs the message to one of the available

Kafka 6
partitions in a round-robin fashion. If the key is specified but the partitioner class isn’t, then the default partitioner is employed. It will
generate a hash of the key, messages with the same key will always have the same hash and be delivered to the same partition.
Usually, partitions have replicas that can receive a message for processing when another replica is rendered out of service.
Generally, new partitions shouldn’t be added to a topic if partitioning by keys is important.

Partition Rebalancer:
Membership of consumers in a consumer group is coordinated by a designated Kafka broker referred to as the group coordinator.
It receives heartbeats from consumers to confirm that they are alive and healthy. When the group coordinator doesn’t hear from a
consumer for a configurable period of time, it assumes the particular consumer has crashed and triggers a partition rebalance. A
partition rebalance assigns the partitions that the dead consumer was reading from to the remaining healthy consumers in a
consumer group. This process of changing ownership of a partition from one consumer to another belonging to the same consumer
group is known as a partition rebalance. A rebalance can also happen when partitions are added to a topic.
While a rebalance is in progress, the partitions are unavailable for consumers to read. When a consumer gracefully leaves a
consumer group, it informs the coordinator of its intentions with the close call. The group coordinator doesn’t have to wait to detect
that a consumer has left and can initiate a rebalance immediately. Another side-effect of a rebalance is that consumers may have to
invalidate their caches or any other maintained state as they’ll now be reading from different partitions.
The assignment of partitions to consumers during a rebalance is done by the consumer designated as the group leader. Consumers
join a consumer group by making a JoinGroup request to the coordinator. The first one to do so is also set as the consumer leader.
The consumer leader receives the list of other healthy consumers within the consumer group from the group coordinator and makes
partition assignments. These assignments are sent to the group coordinator which in turn distributes the assignments to consumers.
Each consumer can only see its own assignments and those of the other consumers in the consumer group. The consumer leader
is the only client process which has the list of assignments for all the consumers within the group. The partition assignment logic is
decided by an implementation class of the PartitionAssignor interface.

Below is a pictorial representation of partition reassignment when one of the consumers in a consumer group goes down.

Kafka 7
Custom Logic of Partitioning:
Kafka also gives its users the ability to specify their own custom logic for how partitioning should take place. Users can specify the
property partition.class when creating the producer object, similar to how the key and value serializer properties are specified. The
customer partitioner class must implement the interface org.apache.kafka.clients.producer.Partitioner

An example implementation of the partitioner for the scenario we discussed appears below:

public class CustomPartitioner implements Partitioner {

// Pass any configurations to use in this method


public void configure(Map<String, ?> configs) {
}

// The logic to exercise custom partitioning


public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes,
Cluster cluster) {

// Retrieve the number of partitions for the topic


List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
int numPartitions = partitions.size();

// Throw an exception if key isn't specified


if ((keyBytes == null) || (!(key instanceof String)))
throw new InvalidRecordException("Message is missing the key");

// Messages with key=USA will always land in the last partition


if (((String) key).equals("USA"))
return numPartitions;

// Other records get hashed to the rest of the partitions excluding the last one
return (Math.abs(Utils.murmur2(keyBytes)) % (numPartitions - 1));
}

public void close() {


}
}

Kafka Producer:
Producers write messages to Kafka that can then be consumed by readers.

Kafka 8
Write workflow:

1. An object of class ProduceRecord is instantiated which must contain the intended topic of the message and the message itself,
which is the value. The message key and partition can optionally be included.

2. Next, the key and values to be sent over the network are serialized. At this point, a serialization exception can be thrown if
serialization fails. Two properties key.serializer and value.serializer need to be set to the classes that know how to convert the
key and value objects to bytes.

3. The data is next sent to the partitioner. If the partition has been specified in ProduceRecord then the partitioner takes no action
and simply returns the already specified partition. If the partition is not specified, the partitioner selects a partition based on the
key.

4. Once the partition has been determined, the producer adds the message to a batch of records waiting to be sent to the same
topic and partition. A different thread is responsible for sending the batches of records to the right Kafka brokers. The location of
the Kafka cluster is gleaned from the property bootstrap.servers , which is a list of host:port pairs that the producer will attempt
to connect to. This doesn’t need an exhaustive list of brokers since the producer receives additional information after the initial
connection.

5. Once the broker receives the batch of records and writes the messages to Kafka successfully, it returns an object
of RecordMetadata that contains information about the topic, partition, and offset of the record within the partition.

6. If the broker fails to write the messages, it will return an error to the producer. The producer attempts a few retries before giving
up.

Sending Messages:
There are three ways in which messages can be sent by producers:

Fire and forget: Message is sent to Kafka with no effort made to verify if the message was received successfully by the Kafka
broker. In most instances, the messages will be received by Kafka, since the producer has retries built-in for failures. However,
this method is still prone to losing messages when retries exhaust.

Kafka 9
Synchronous: Message is sent and the future object returned has its get() method invoked to see if the Kafka broker received
the message successfully.

Asynchronous: Messages are sent and a callback is specified, which is invoked when a response is received from the Kafka
broker. In this method, the producer doesn’t wait to receive a response from Kafka before continuing to send other messages.

public static void main(String[] args) {


// Set-up mandatory properties
Properties kafkaProps = new Properties();
kafkaProps.put("bootstrap.servers", "localhost:9092");
kafkaProps.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
kafkaProps.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

// Create an object of KafkaProducer and pass in the properties


KafkaProducer producer = new KafkaProducer<String, String>(kafkaProps);

// Create a record/message that we want to send to Kafka.


ProducerRecord<String, String> record = new ProducerRecord<>("datajek-topic", "my-key", "my-value");

try {
// Send the message to Kafka. The send(...) method returns a Future object for RecordMetadata.
Future<RecordMetadata> future = producer.send(record);

// We perform a get() on the future object, which turns the send call synchronous. The RecordMetadata object is returned if the mess
RecordMetadata recordMetadata = future.get();

// The RecordMetadata object contains the offset the message was written at within the partition and partition for the message.
System.out.println(String.format("Message written to partition %s with offset %s", recordMetadata.partition(), recordMetadata.offse

} catch (Exception e) {
System.out.println("Exception sending message " + e.getMessage());
} finally {
// When you're finished producing records, you can flush the producer to ensure it has all been written to Kafka and then close the
producer.flush();
producer.close();
}
}

In the example above, we sent the messages synchronously, this approach doesn’t scale when we have millions of messages to be
sent. The solution, then, is to send messages asynchronously and specify a callback that is invoked when a response from the
Kafka cluster is received. Let’s rewrite our previous example as an asynchronous producer.

public class AsyncProducerExample {


public static void main(String[] args) throws InterruptedException {
Properties kafkaProps = new Properties();
kafkaProps.put("bootstrap.servers", "localhost:9092");
kafkaProps.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
kafkaProps.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

KafkaProducer producer = new KafkaProducer<String, String>(kafkaProps);


ProducerRecord<String, String> record = new ProducerRecord<>("datajek-topic", "my-key-async", "my-value-async");

try {
// We must pass an object of type that implements the interface org.apache.kafka.clients.producer.Callback.
producer.send(record, new ProducerCallback());
System.out.println("Kafka message sent asynchronously.");
} catch (Exception e) {
System.out.println("Exception sending message asynchronously" + e.getMessage());
} finally {
producer.flush();
producer.close();
}

// wait before exiting to hear from Kafka broker


Thread.sleep(3000);
}

private static class ProducerCallback implements Callback {


@Override
public void onCompletion(RecordMetadata recordMetadata, Exception e) {
if (e != null) {
e.printStackTrace();
} else {
System.out.println(String.format("Message written to partition %s with offset %s", recordMetadata.partition(), recordMetada

Kafka 10
}
}
}
}

The callback can be specified by instantiating an object of any class that implements the
interface org.apache.kafka.clients.producer.Callback

Producer Configurations:
There are several knobs and levers that can affect the behavior and performance of producers. All of them are documented in the
Apache Kafka documentation and come with sensible defaults. We’ll discuss the most important ones that can significantly impact
memory usage, reliable delivery, and performance.

ack Specifies the number of replicas of a partition that must receive a message before a producer can consider the write
successful. There are only three values that ack can take on:

ack=0 : Setting ack equal to zero implies the producer doesn’t wait to hear back from the Kafka cluster and assumes each
message has been sent successfully. Obviously, this can lead to lost messages but the strategy achieves the highest
throughput.

ack=1 : In this setting, the producer receives a confirmation once the leader replica receives the message. If the leader

crashes and a new leader has not yet been elected, an error is returned to the producer which can retry sending the
message. However, the message can still get lost if the leader crashes and a replica is elected as the new leader that has
not received the message (known as unclean election). In this setting, the throughput is determined whether the messages
are sent synchronously or asynchronously. In the latter case, the throughput is capped by the number of in-flight messages.

ack=all : This setting returns the producer a response once all the replicas write the sent message. The increased reliability

is accompanied by increased latency as all the replicas must receive the message.

buffer.memory This is the amount of memory the producer can use to store messages waiting to be sent to the brokers. If

messages are produced at a rate faster than the speed at which they can be delivered to the Kafka broker, the producer will
backup the messages in the buffer. If the buffer fills up, the producer may get blocked at send(...) calls or throw an exception
depending on other config settings.

compression.type Allows the messages being sent to be compressed. The trade-off is CPU utilization, which goes up as the

messages are compressed and decompressed on the sending (producer) and receiving (consumer) ends respectively. The
broker stores the messages from the producer without decompressing them. Additionally, compression also increases the
latency to send a message. The different algorithms supported for compression include gzip, lz4, and snappy.

retries The retries parameter sets the number of times a producer retries sending a message before giving up and declaring

a failure to the client. There are two kinds of failures a producer can encounter:

1. The first are failures that can’t be retried, such as “message too large” errors.

2. The second are the failures which can be retried (e.g. write failure) because of the absence of a partition leader. These
failures are automatically retried by the producer. The application should only handle the case when retries for transient
failures have been exhausted.

retry.backoff.ms Denotes the milliseconds the producer waits before re-attempting a failed send. Generally, the number

of retries and the wait between the retries should be greater than the time it takes for a broker to recover from a crash,
otherwise the producer will declare a failure too soon.

batch.size Controls the amount of memory (in bytes). that are batched together to be sent to the same partition. The producer

doesn’t necessarily wait for the batch to fill to capacity before sending the messages. It may instead send half full batches or
even a batch with a single message.

linger.ms By default, Kafka sends messages in a batch, this could mean sending a batch that has a single message. To

improve throughput Kafka can be configured with linger.ms milliseconds to wait for additional messages to be assigned to a
batch before sending the batch out. A batch is sent out either when it is full or linger.ms milliseconds have elapsed.

client.id This can be any string that identifies a producer.

Kafka 11
max.in.flight.requests.per.connection Represents the number of messages a producer can send without receiving a
corresponding response from the brokers. Setting a higher value increases memory usage but also increases throughput.
Setting this value to 1 guarantees messages are written to the brokers in order despite failures and retries.

max.request.size Controls the maximum size of message or cumulative size of all the messages in a single request that can be
sent by the producer. Note that the broker also has a setting that determines the maximum size of the message the broker will
accept. A broker may still reject a request if the message size is larger than what it can accept.

max.block.ms Determines how long the producer will block when calling the methods send(...) and partitionsFor(...) . These
methods block when the producer’s send buffer is full or when the metadata isn’t available.

receive.buffer.bytes & send.buffer.bytes Control the size of the TCP send and receive buffers used by sockets when writing and
reading data. Setting them to -1 sets the sizes to the OS default.

timeout.ms Represents the milliseconds the broker should wait (on behalf of the producer) to receive a response from in-sync
replicas to acknowledge the message sent by the producer so that the ack configuration is met.

request.timeout.ms The amount of time the producer will wait to receive a reply from the broker for a sent request. On reaching
the timeout, the producer either attempts a retry or raises an error.

metadata.fetch.timeout.ms The amount of time the producer will wait for a reply when requesting metadata. On reaching the

timeout, either a retry is attempted or an error is raised.

Ordering Issue:
An interesting situation arises when the retries is set to non-zero and the max.in.flight.requests.per.connection is greater than one.
This allows the producer to send out more than one batch of messages without having received acknowledgements for any.
Consider the following scenario:

1. Two batches of messages are sent, one after another, destined for the same partition.

2. The first batch of messages fails to be written successfully at the Kafka broker.

3. The second batch gets written successfully.

4. The producer retries sending the first batch again which gets written successfully this time.

The consequence is that the ordering of messages isn’t preserved within the partition, which can be a critical requirement for some
applications. In such situations max.in.flight.requests.per.connection should be set to 1, though it will severely impact the throughput
of the producer.

Producer Serialization:
Kafka comes with serialization classes for simple types such as string, integers, and byte arrays. However, one has to use a
serialization library for complex types. We can use JSON, Apache Avro, Thrift or Protobuf for serializing and deserializing Kafka
messages.

Using Avro:
Apache Avro is a language neutral serialization framework and thus a good choice for Kafka. Avro provides robust support for
schema evolution. A schema can be thought of as a blue-print of the structure of each record in an .avro file. For instance, we can
define a schema representing a car as follows:

{
"namespace": "datajek.io.avro",
"type": "record",
"name": "Car",
"fields": [
{
"name": "make",
"type": "string"
},
{
"name": "model",
"type": [
"string",
"null"

Kafka 12
]
},
]
}

Avro is especially suited for Kafka as producers can switch to new schemas while allowing consumers to read records conforming to
both the old or the new schema. There are certain rules that are followed when schema resolution takes place. The beauty of Avro
is that a reader doesn’t need to know the record schema before reading a .avro file. The schema comes embedded in the file. If
schemas have evolved, the reader will still need access to the new schema even though the application expects the previous one.
Since a Kafka topic can contain messages conforming to different Avro schemas, it necessitates that every message also holds its
schema. This becomes impractical for Kafka, as including schema in every message leads to bloat in message size. Kakfa
addresses this issue by introducing the Schema Registry, where actual schemas are stored. Kafka messages only contain an
identifier to their schema in the registry. This complexity is abstracted away from the user, with the serializer and deserializer
responsible for pushing and pulling the schema from the registry.

In general, we can create messages of type GenericRecord or generate Java classes from our schema using Avro tools. The code in
the widget uses GenericRecord approach which loses type-safety and requires casts when the message is read on the consumer’s
end.

public class ProducerWithAvroSerializerExample {

public static void main(String[] args) {


Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, io.confluent.kafka.serializers.KafkaAvroSerializer.class);

// We also pass-in the URL for the schema registry


props.put("schema.registry.url", "http://localhost:8081");
KafkaProducer producer = new KafkaProducer(props);

// Set-up schema
String key = "key1";
String userSchema = "{\"type\":\"record\"," +
"\"name\":\"Car\"," +
"\"fields\":[{\"name\":\"brand\",\"type\":\"string\"}]}";
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(userSchema);

// Create an avro record


GenericRecord avroRecord = new GenericData.Record(schema);
avroRecord.put("brand", "Mercedes");

// Create a producer record from the avro record


ProducerRecord<String, GenericRecord> record = new ProducerRecord<>("datajek-topic", key, avroRecord);
try {
producer.send(record);
} catch (SerializationException e) {
System.out.println("Exception while sending message " + e.getMessage());
e.printStackTrace();
} finally {

Kafka 13
producer.flush();
producer.close();
}
}
}

Kafka Consumer:
Consumer Groups:
Consumers, or readers, receive messages from Kafka topics. Consumers subscribe to topics, then receive messages that
producers write to a topic. Typically, each consumer belongs to a consumer group. A single producer or multiple producers may
write messages to a topic faster than a single consumer can read them, causing the consumer application to fall farther and farther
behind. Kafka mitigates this scenario by allowing multiple consumers in a consumer group to work together to read messages from
a topic. The various configurations for consumers and partitions in a topic are discussed below:

Partitions in a topic and consumers in a group are equal: In this scenario, each consumer reads from one partition.

Partitions in a topic are greater than the number of consumers in a group: In this scenario, some or all consumers read
from more than one partition.

Single Consumer: In the case of a consumer group with a single consumer, all partitions are consumed by the single
consumer.

Kafka 14
Partitions in a topic are less than the number of consumers in a group: In this scenario, some of the consumers will be
idle.

Increasing the number of consumers in a consumer group is the primary mechanism for Kafka to scale as the number of messages
in a topic increases. This is why it is generally advisable to have ample partitions in a topic so that the number of consumers can be
increased when the load increases.

When multiple applications read from the same topic, each application can have its own consumer group subscribed to the topic.
Consumer groups are independent of each other and read all the messages from the same topic, independent of the state or
progress of other consumer groups.

Consumer:
Kafka consumers are instances of the class KafkaConsumer . The KafkaConsumer class requires three compulsory properties to be
provided: the location of the servers bootstrap.servers , the key deserializer key.deserializer , and the value
deserializer value.deserializer . When creating a KafkaConsumer we can also specify a consumer group id using the property group.id .
We can create Kafka consumers that don’t belong to any consumer group, but this practice is uncommon. When subscribing to
topics, the consumer has the choice to subscribe to a single topic or use a regular expression to match multiple topics.

Poll Loop:
The general pattern for Kafka consumers is to poll for new messages and process them in a perpetual loop, often referred to as
the poll loop. Within the poll loop, the poll() method takes a timeout interval that the consumer blocks for when there’s no data in
the consumer buffer. If the interval value is set to 0 , the method returns immediately. It is essential that the consumer keeps polling
the broker, as the “I am alive” heartbeats are sent as part of the poll method.

In the newer versions of Kafka, the heartbeat can be configured to send messages in-between consumer application data polling
requests.
The first time poll() is invoked by a new consumer, the invocation is responsible for finding the GroupCoordinator , joining the
consumer group, and receiving a partition assignment.

The following code widget demonstrates the code written for a Kafka consumer that polls for events from all topics matching the
regular expression datajek-*

public class ConsumerExample {


public static void main(String[] args) {

Kafka 15
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

// The extra properties we specify compared to when creating a Kakfa producer


props.put("auto.offset.reset","earliest");
props.put("group.id", "DatajekConsumers");

KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);

// Subscribe to all the topics matching the pattern


consumer.subscribe(Pattern.compile("datajek-.*"));

try {
// We have shown an infinite loop for instructional purposes
while (true) {
Duration oneSecond = Duration.ofMillis(1000);

// Poll the topic for new data and block for one second if new
// data isn't available.
ConsumerRecords<String, String> records = consumer.poll(oneSecond);

// Loop through all the records received


for (ConsumerRecord<String, String> record : records) {

String topic = record.topic();


int partition = record.partition();
long recordOffset = record.offset();
String key = record.key();
String value = record.value();

System.out.println("topic: %s" + topic + "\n" +


"partition: %s" + partition + "\n" +
"recordOffset: %s" + recordOffset + "\n" +
"key: %s" + key + "\n" +
"value: %s" + value);
}
}
} finally {
// Remember to close the consumer. If the consumer gracefully exits the consumer group, the coordinator can trigger a rebalance
consumer.close();
}
}
}

Thread safety:
Remember, we can’t have more than one instances of KafkaConsumer belonging to the same consumer group operate within the
same thread. Similarly, we can’t have two threads use the same instance of KafkaConsumer . Always use one consumer per thread.

Consumer Configuration:
fetch.min.bytes Specifies the minimum amount of data a broker should send back to a consumer. The broker will wait for

messages to pile up until they are larger in aggregate size than fetch.min.bytes
before sending them to the consumer. The configuration can be set high when there are a large number of consumers or the
consumers run CPU-intensive processing on the received data.

fetch.max.wait.ms Specifies how long to wait if the minimum number of bytes to fetch specified by fetch.min.bytes aren’t
available.

max.partition.fetch.bytes Specifies the maximum number of bytes returned by a broker per partition.

session.timeout.ms Allows the consumer to be out of contact with the broker for session.timeout.ms milliseconds and still be
considered alive. By default this configuration is set to 3 seconds.

heatbeat.interval.ms Determines how frequently the KafkaConsumer#poll() will send a heartbeat to the broker.
Obviously, heatbeat.interval.ms is set lower than session.timeout.ms

auto.offset.reset When consumers read from a partition, they also commit an offset to remember the position they last read in

the partition. The configuration auto.offset.reset controls the behavior to read earliest or newest records when a valid offset
can’t be determined for a partition by a consumer.

Kafka 16
enable.auto.commit By default, the offsets are committed automatically, but this behavior can be changed by
setting enable.auto.commit to false.

auto.commit.interval.ms Controls when offsets are committed and how frequently to minimize duplicates and avoid missing data.

max.poll.records This configuration specifies the maximum number of records returned by the poll() call to the Kafka
consumer.

client.id Acts as an identifier for a consumer. This can be any string that identifies the consumer.

receive.buffer.bytes & send.buffer.bytes These configurations control the receive and send buffer sizes for the TCP sockets
when reading and writing data. If set to -1, the OS defaults are picked-up.

partition.assignment.strategy Topic partitions are assigned to consumers. The assignment of partitions to consumers can be
controlled by the class PartitionAssignor . If left to Kafka, there are two assignment strategies:

Range: This scheme assigns a consecutive subset of topic partitions to subscribing consumers. Consider two consumers,
C1 and C2, and two topics, T1 and T2. Say, T1 has 3 partitions and T2 has 2 partitions. The algorithm will assign partitions
0 and 1 of T1 to the first consumer C1 and partition 2 of T1 to the second consumer C2. For T2, partition 0 will be assigned
to C1 and partition 1 will be assigned to C2. Note, that the assignment for each topic partition is done independently of other
topics, so that if topic T2 also had three partitions, then consumer C1 will still receive the first two partitions in its share.

Round Robin: The partitions from all the topics are sequentially assigned to consumers one by one. Consider the same two
consumers and two topics. Say, T1 has 3 partitions and T2 has 2 partitions. The algorithm will assign partition 0 of topic T1
to the first consumer C1, then partition 1 of T1 to C2, then partition 2 of T1 to C1, then partition 0 of T2 to C2 and finally
partition 1 of T2 to C1.

We can specify one of the default strategies to use by configuring the partition.assignment.strategy setting. It can be set to
either org.apache.kafka.clients.consumer.RangeAssignor or org.apache.kafka.clients.consumer.RoundRobinAssignor . If you want to specify
your custom strategy, this setting should be set to your class name.

Commits and Offsets:


We have read so far that Kafka consumers poll topic partitions for records using the poll() method. The method returns some
number of records for the consumer to process. However, if a rebalance occurs and the partition gets assigned to a different
consumer within the consumer group, then the newly assigned consumer must know where the previous consumer stopped in order
to resume reading records from that point on in the partition.

The offset identifies the position in a partition up to which a consumer has read. The act of durably storing or updating that position
is called the commit. The onus of tracking a consumer’s position within a partition is on the consumer itself. Each consumer
commits its offset for every partition it is reading by writing a message to a special Kafka topic called __consumer_offsets

Committed offset less than last record read: Consider a scenario where a consumer reads four messages at a time. It reads up
to message 6 but the last commit offset is recorded as 4. If the consumer were to crash at this point and another consumer took up
processing this partition, then the new consumer will start reading messages starting from the record numbered 5. Evidently, some
of the records will end up being processed twice.

Kafka 17
Committed offset greater than last record read: If the committed offset is greater than the offset of the last processed record,
then the records between the two offsets will be missed by the consumer group on a rebalance. In the example scenario depicted
below, the records numbered 7 and 8 will be missed.

How offsets are committed can impact the design of a client application, and Kafka provides various settings to tweak this behavior.

Offset Commit Configuration:


Automatic Commit:

Offsets can be automatically committed by the consumer using the configuration enable.auto.commit=true You can use the
configuration auto.commit.interval.ms to control how frequently the commits happen. By default, the frequency is set to 5 seconds.
Automatic commits happen when the method poll() is invoked. The method checks if auto.commit.interval.ms have elapsed since
the last commit. If so, it commits the offset from the last poll() invocation (not the current one).

In the diagram above, the first commit takes place at 9 seconds. Then a second commit takes place at 16 seconds when the third
call for poll() is made. The second poll() invocation doesn’t result in an offset commit as auto.commit.interval.ms milliseconds
have not elapsed. The problem with this approach arises same as above.
Commit Current Offset:

Rather than relying on the timer, the consumer application can explicitly choose when to commit the offset using
the commitSync() API. In this scenario, the setting enable.auto.commit is set to false. The commitSync() method commits the latest
offset returned by the poll() method. This is unlike the automatic commit option which commits the offset returned by the last
invocation of the poll() call.

while (true) {
Duration oneSecond = Duration.ofMillis(1000);
System.out.println("Consumer polling for data...");
ConsumerRecords<String, String> records = consumer.poll(oneSecond);
for (ConsumerRecord<String, String> record : records) {

Kafka 18
/**
* Process records
*/
}

// blocking call to commit offset


consumer.commitSync();
}

The issue with manually committing offsets is that the consumer application is blocked until it hears back from the broker. The call
to commitSync() is a blocking call. This can lead to degradation in throughput as the consumer is blocked waiting for a response over
the network.

Asynchronous Commit:

Asynchronous commits solves the problem of commit current offset by allowing the consumer to fire off a commit request to the
broker and continue processing.

However, retries with asynchronous commits can be tricky. Let’s examine a situation where the first invocation of the
method commitAsync() fails trying to commit an offset of, say, 10. Since we didn’t block on the invocation, it is possible that we may
have made another commit request for a higher offset, say 20, later on. If we attempt to retry the first failed commitAsync() request, it
will commit the offset as 10 after 20 has already been committed. If a rebalance occurs at this stage, a larger number of records will
need to be reprocessed as the last commit offset would be 10 instead of 20.

The method commitAsync() allows for a callback handler to be invoked when the broker responds. Within this callback handler, we
can attempt a retry in case the call has failed.

while (true) {
Duration oneSecond = Duration.ofMillis(1000);
System.out.println("Consumer polling for data...");
ConsumerRecords<String, String> records = consumer.poll(oneSecond);
for (ConsumerRecord<String, String> record : records) {
/**
* Process records
*/
}

// non-blocking call
consumer.commitAsync(new OffsetCommitCallback() {
@Override
public void onComplete(Map<TopicPartition, OffsetAndMetadata> offsets, Exception e) {
if (e != null) {
System.out.println("Offsets " + offsets + " failed to commit. " + e);
}
}
});
}

One common technique used to fix a lower commit offset overwriting a higher one is to maintain a monotonically increasing
sequence number which increments it whenever a commit is made. This number is passed in the callback handler and, if a failure is
encountered, the current value of the sequence number is checked against the passed-in number to determine if a retry should be
made. If the values are the same, then it is safe to retry since a commit call for a higher offset hasn’t yet been made.

Combining Async and Sync Offsets:

When a consumer deliberately exits after making the last commit, we know a rebalance will follow. In this case, we can combine
synchronous and asynchronous commits together.

try {
while (true) {
Duration oneSecond = Duration.ofMillis(1000);
System.out.println("Consumer polling for data...");
ConsumerRecords<String, String> records = consumer.poll(oneSecond);
for (ConsumerRecord<String, String> record : records) {
/**
* Process records
*/
}

Kafka 19
// Async commit without callback
consumer.commitAsync();
}
} catch (Exception e) {
// log exception
} finally {
try {
// Initiate a synchronous commit since this is the last commit before the consumer exits. The synchronous commit will retry untill su
ccess or fail in case of an unrecoverable error.
consumer.commitSync();
} finally {
consumer.close();
}
}

Committing a Specific Offset:

If we were to return a large number of records, we may want to commit the offset midway or at a time of our choosing. Fortunately,
both the APIs commitSync() and commitAsync() allow us to do just that by passing-in the offset for each partition of a topic that we
want to commit.

Map<TopicPartition, OffsetAndMetadata> currentOffsets = new HashMap<>();


int count = 0;

while (true) {
Duration oneSecond = Duration.ofMillis(1000);
System.out.println("Consumer polling for data...");
ConsumerRecords<String, String> records = consumer.poll(oneSecond);

for (ConsumerRecord<String, String> record : records) {


/**
* Process records here
*/

// commit offset after processing every hundred records


if (count % 100 == 0) {
// Set-up the topic partition for the record we are processing currently
TopicPartition topicPartition = new TopicPartition(record.topic(), record.partition());

// We add 1 to the current record's offset so that we start reading from the record after the current one in case of a rebala
nce.
OffsetAndMetadata metadata = new OffsetAndMetadata(record.offset() + 1, "no metadata");
currentOffsets.put(topicPartition, metadata);

consumer.commitAsync(currentOffsets, null);
}
count++;
}
}

Handling Rebalances:
A consumer may want to do cleanup work before a partition it owns is reassigned during a rebalance. They may also want to
prepare before being assigned a new partition. Both use cases are addressed by the interface ConsumerRebalanceListener The
interface offers two methods that can be implemented by classes to insert custom logic just before a partition is revoked or
assigned.

onPartitionsRevoked This method is invoked after a consumer stops consuming events but before a rebalancing takes effect.

onPartitionsAssigned This method is invoked after partitions have been assigned to a consumer but before it starts consuming
messages.

public class RebalanceListenerExample {

private Map<TopicPartition, OffsetAndMetadata> currentOffsets = new HashMap<>();

class RebalanceHandler implements ConsumerRebalanceListener {

KafkaConsumer<String, String> consumer;

RebalanceHandler(KafkaConsumer<String, String> consumer) {

Kafka 20
this.consumer = consumer;
}

// USE-CASE: To commit the current offset before the partition is revoked for a consumer so that next consumer can start where th
is consumer left off.
@Override
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
// We are committing the offsets for the most recently processed records as we update the currentOffsets hashmap in the poll
loop in the run() method after processing each record.
// This way we don't save the offset from the last poll() call which would have resulted in re-processing some of the record
s.
consumer.commitSync(currentOffsets);
}

// USE-CASE: To seek the consumer to the offset which was stored in the db before the rebalance. Useful if we don't want to re-pr
ocess already processed offsets.
@Override
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
for (TopicPartition partition : partitions) {
long offset = readOffsetFromDB(partition);
consumer.seek(partition, offset);
}
}

long readOffsetFromDB(TopicPartition partition) {


// Method that implements the logic to retrieve the offset for the passed-in partition from the database
}
}

public void run() {

Properties props = new Properties();


props.put("bootstrap.servers", "localhost:9092");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

props.put("group.id", "DataJekConsumers");
props.put("auto.offset.reset", "earliest");

KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);

// Pass-in the instance of RebalanceHandler when subscribing to topics


consumer.subscribe(Pattern.compile("datajek-.*"), new RebalanceHandler(consumer));

try {
while (true) {
Duration oneSecond = Duration.ofMillis(1000);
System.out.println("Consumer polling for data...");
ConsumerRecords<String, String> records = consumer.poll(oneSecond);

for (ConsumerRecord<String, String> record : records) {


TopicPartition topicPartition = new TopicPartition(record.topic(), record.partition());

// We add 1 to the current record's offset so that we start reading from the record after the current one in case of
a rebalance.
OffsetAndMetadata metadata = new OffsetAndMetadata(record.offset() + 1, "no metadata");
currentOffsets.put(topicPartition, metadata);

// Method that is responsible for storing the record as well as the offset as a single DB transaction.
commitRecordAndOffsetToDBAsTransaction(record.value(), metadata);
}

consumer.commitAsync();
}
} catch (Exception e) {
// log exception
} finally {

try {
consumer.commitSync();
} finally {
consumer.close();
}
}
}

void commitRecordAndOffsetToDBAsTransaction(String record,OffsetAndMetadata metadata) {


}
}

Kafka 21
Stopping a Consumer:
The KafkaConsumer object exposes a method wakeup() that can be invoked from a different thread to stop the consumer.
When wakeup() is invoked on the consumer object, the consumer throws a WakeupException if the consumer is already waiting on
the poll() method. If not, the exception is thrown the next time the consumer invokes the poll() method. As a developer, you don’t
need to handle the WakeupException , but you must invoke close() on the consumer object in the finally block. Closing the consumer
commits offsets and informs the broker that the consumer is leaving. The broker can then trigger a rebalance immediately rather
than wait for a session timeout to assign the partitions owned by the exiting consumer to other consumers in the group.

public class StoppingConsumerExample {


public static void main(String[] args) throws InterruptedException {

Properties props = new Properties();


props.put("bootstrap.servers", "localhost:9092");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("group.id", "DataJekConsumers");
props.put("auto.offset.reset", "earliest");

KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);

Thread consumerThread = new Thread(new Runnable() {


@Override
public void run() {
runConsumer(consumer);
}
});
consumerThread.run();

// Let the consumer run for 5 seconds.


Thread.sleep(5 * 1000);

// Wakeup the consumer to exit.


consumer.wakeup();

// Wait for consumer thread to exit.


consumerThread.join();
}

public static void runConsumer(KafkaConsumer<String, String> consumer) {

consumer.subscribe(Pattern.compile("datajek-.*"));

try {
while (true) {
Duration oneSecond = Duration.ofMillis(1000);
System.out.println("Consumer polling for data...");
ConsumerRecords<String, String> records = consumer.poll(oneSecond);
for (ConsumerRecord<String, String> record : records) {
/**
* Process records
*/
}
// Async commit without callback
consumer.commitAsync();
}
} catch (WakeupException e) {
// Nothing to handle
} catch (Exception e) {
// log exception
} finally {
try {
consumer.commitSync();
} finally {
consumer.close();
}
}
}
}

Consumer in the main thread:

If the consumer is running in the main thread itself, we invoke the wakeup() method on the consumer object in the shutdown hook of
the runtime.

Kafka 22
public class StoppingConsumerAsMainThread {
public static void main(String[] args) throws InterruptedException {

Properties props = new Properties();


/**
* Set-up properties
*/

KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);


consumer.subscribe(Pattern.compile("datajek-.*"));

// We register a shutdown hook that is invoked when the key-combination ctrl+c is executed.
Thread mainThread = Thread.currentThread();
Runtime.getRuntime().addShutdownHook(new Thread() {
public void run() {
// The shutdown hook runs in a separate thread, so the only thing we can safely do to a consumer is to wake it up
consumer.wakeup();
try {
// We wait for the main thread in which the consumer runs to exit
mainThread.join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
});

try {
while (true) {
Duration oneSecond = Duration.ofMillis(1000);
System.out.println("Consumer polling for data...");
ConsumerRecords<String, String> records = consumer.poll(oneSecond);
for (ConsumerRecord<String, String> record : records) {
/**
* Process records
*/
}
// Async commit without callback
consumer.commitAsync();
}
} catch (WakeupException e) {
// Nothing to handle
} catch (Exception e) {
// log exception
} finally {
try {
consumer.commitSync();
} finally {
consumer.close();
}
}
}
}

Consumer Deserialization:
When the consumers read records from a partition, they read a stream of bytes and need deserializers to convert the byte stream
back into an object. It is imperative that the deserializer used to convert the byte streams into objects should match the serializer
that initially transformed the objects into the byte stream. Generally, it is a bad idea to spin-up your own custom
serialization/deserialization classes. Instead, use any of the available open source serialization frameworks such as Avro, Thrift,
Protobuf, etc.

Using Avro:
We’ll carry forward with the Car example from the Producer lesson. The code widget below demonstrates a consumer reading Kafka
messages using Avro serialization.

public class ConsumerWithAvroSerializerExample {


public static void main(String[] args) {

// Set up properties
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");

Kafka 23
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("group.id", "DataJekConsumers");
props.put("auto.offset.reset", "earliest");

// We also pass-in the URL for the schema registry


props.put("schema.registry.url", "http://localhost:8081");

// Create a Kafka consumer and subscribe to topic


KafkaConsumer<String, GenericRecord> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Pattern.compile("datajek-.*"));

try {
while (true) {
Duration oneSecond = Duration.ofMillis(1000);

// Since the Producer wrote GenericRecords in the topic, our consumer will read GenericRecords from the topic.
ConsumerRecords<String, GenericRecord> records = consumer.poll(oneSecond);

for (ConsumerRecord<String, GenericRecord> record : records) {

// Read the fields of the GenericRecord and print them on the console.
System.out.println("Offset " + record.offset() + " " +
"Car Make :" + record.value().get("make") + " " +
"Car Model :" + record.value().get("model"));
}

// Commit the offset


consumer.commitSync();
}
} finally {
consumer.close();
}
}
}

Running a Single Consumer:


there may be scenarios where you want to run a single consumer that reads records from all or just a few of the partitions of a topic.

The code widget below demonstrates how to implement a single consumer. Note that in the properties we still specify the
property group.id , as it is a mandatory property even though the consumer group will consist of a single consumer. Also, note we
don’t use the subscribe API on the consumer object. Rather, we use the assign API to read all the partitions of the topic.

public class SingleConsumerExample {


public static void main(String[] args) {

Properties props = new Properties();


props.put("bootstrap.servers", "localhost:9092");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("auto.offset.reset", "earliest");
props.put("group.id", "DataJekConsumers");

// Create a consumer object


KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);

// Retrieve the partitions for the topic


List<PartitionInfo> partitionInfos = consumer.partitionsFor("datajek-topic");

// Select the partitions to consume. In this case, we'll consume all the partitions for the datajek-topic
Set<TopicPartition> partitionsToConsume = new HashSet<>();
for (PartitionInfo partition : partitionInfos) {
partitionsToConsume.add(new TopicPartition(partition.topic(), partition.partition()));
}

// Assign the partitions to the consumer


consumer.assign(partitionsToConsume);
System.out.println("Consumer assigned itself " + partitionsToConsume.size() + " partitions.");

try {
while (true) {
Duration oneSecond = Duration.ofMillis(1000);
System.out.println("Consumer polling for data...");
ConsumerRecords<String, String> records = consumer.poll(oneSecond);
for (ConsumerRecord<String, String> record : records) {

Kafka 24
String topic = record.topic();
int partition = record.partition();
long recordOffset = record.offset();
String key = record.key();
String value = record.value();

System.out.println("\ntopic: " + topic + "\n" +


"partition: " + partition + "\n" +
"recordOffset: " + recordOffset + "\n" +
"key: " + key + "\n" +
"value: \n" + value);

consumer.commitSync();
}
}
} finally {
consumer.close();
}
}
}

In such a setup, the self-assigning consumer will not be notified of the new partitions when partitions are added to the topic. The
onus lies on the consumer to periodically check for new partitions being added to the topic and handle them.
The partitionsFor() can be used to query for partitions associated with a topic.

Kafka Internals:
Replication:
Replication is the mechanism that allows Kafka to advertise availability and durability in the face of inevitable individual node
failures. Kafka has topics, which consist of partitions and these partitions have replicas of themselves hosted on different brokers.
Generally, each broker has several hundred or thousand partitions hosted at any given time.

Types of Replica:
In Kafka, we have two kinds of replicas:

Leader replica: All requests from consumers and producers are routed through the leader replica for a partition. Each partition
has one broker designated as the leader replica.

Follower replica: All replicas that aren’t the leader are called follower replicas. They don’t participate in servicing consumer or
producer requests. Their only task is to replicate messages from the leader replica and stay up to date with the latest messages
the leader has. If the leader replica fails, one of the follower replicas is promoted to be the leader.

The follower replicas send fetch requests for the latest messages to the leader replica, similar to how other clients interact with the
leader. The leader replica knows how far behind a follower replica is by examining the offset in the fetch request sent by the follower
replica. The offset denotes the next message the follower replica wants to consume. Out of sync replica will not be considered for

Kafka 25
the election of the leader should the current leader fail. This makes sense since the out of sync replica doesn’t have the latest
messages and thus shouldn’t be promoted as the leader. The amount of time a follower can be inactive or behind before it is
considered out of sync is controlled by the replica.lag.time.max.ms configuration parameter. Tweaking this configuration has
implications on client behavior and data retention during leader election. Conversely, the follower replicas that are requesting the
latest messages from the leader replica are thought to be in-sync replicas. Only these replicas are eligible to run for the leader
election. A replica is considered an in-sync replica if it is the leader or meets the following criteria as a follower replica:

1. The follower has sent a heartbeat to Zookeeper in the last 6 seconds (configurable).

2. The follower has fetched messages from the leader in the last 10 seconds (configurable).

3. The follower has fetched the latest message from the leader in the last 10 seconds. The follower may be fulfilling #2 but it may
be behind and asking for older messages from the leader. This restriction makes sure that almost no lag exists between the
leader and the follower.

There’s also the concept of a preferred leader. This is the broker that was chosen to be the leader replica when the topic was
initially created. The reason the preferred broker exists is because, at the time of the topic creation, the algorithm attempts to evenly
divide the partition leaders among the available brokers. If all the replica leaders for all the partitions in a cluster are in fact the
preferred leaders, then the load will be evenly divided amongst the brokers. If the configuration auto.leader.rebalance.enable is set to
true, then Kafka will trigger an election to make the preferred leader as the leader replica if it isn’t already so. However, the preferred
leader has to meet the criterion of being an in-sync replica.

Kafka Controller:
Broker Membership:
Brokers maintain their membership in a cluster via a unique ID that is set either in the configuration file or automatically generated.
Each broker creates an ephemeral node in Zookeeper with its ID under the Zookeeper path /brokers/id . Various Kafka components
receive notifications when brokers join or leave the cluster by keeping a watch on the path /brokers/id where brokers create
ephemeral nodes. A new broker can’t register itself with the same ID as an existing broker. A broker can lose connectivity to
Zookeeper for a variety of reasons such as:

broker deliberately stopping

garbage collector pause

network partition

If such a situation occurs, the ephemeral node created by the broker at the time it started is automatically removed from Zookeeper.
Kafka components watching the list of brokers are notified that the broker has left. Interestingly, if a brand new broker is spun up
with the same ID as the broker that left, the new broker will be assigned the same partitions and topic as the broker that left. This is
because even though the broker left the cluster, its ID is still retained within internal data structures.

Controller:
A controller is a broker with additional responsibilities of choosing leaders for partitions whenever nodes leave a cluster. It also
updates existing brokers when a node with replicas rejoins a cluster. Below is the procedure used to elect a controller:

Kafka 26
1. Each broker attempts to create the ephemeral node /controller when starting up.

2. The first broker that successfully creates the node becomes the controller while the rest receive a “node already exists”
exception.

3. Each broker receiving the exception knows that the controller already exists and sets a watch on the node so that it gets notified
in case the controller exits the cluster.

4. If the controller does leave the cluster, the remaining brokers in the cluster are notified and each one tries to create the controller
node. The one successful in creating the node becomes the new controller.

5. To mitigate the situation when two nodes consider themselves the controller, which can happen during a network partition, Kafka
also uses a controller epoch number, which is incremented whenever a new controller gets elected. Other brokers ignore
messages that don’t come with the latest controller epoch number. This prevents a split brain scenario where two nodes act as
the controller at the same time.

When a broker leaves the cluster, the controller is supposed to find a new leader for all the partitions which had the departed broker
as the leader. The new leader is simply the next replica in the list of replicas for a given partition. The news of the selected leader for
affected partitions is propagated to all the followers of those partitions as well as the new leader. The new leader starts serving
requests while the followers get busy replicating messages from the new leader. Similarly, when a new broker joins the cluster, the
controller looks up the ID of the new broker to identify if it has any replicas. If so, the controller informs both the new and existing
brokers of the change and the replicas on the new broker start replicating messages from the existing leader.

Request Processing:
The job of a broker is to process requests received from the controller, partition replicas, and client requests that were sent to the
partition leader. A request consists of the following elements:

Request Type: Also known as the API key.

Request Version: Allows brokers to handle clients of different versions.

Correlation ID: Unique identifier for the request and also appears as part of the response in the error logs.

Client ID: Identifier for the client application sending the request.

Kafka uses a binary protocol over TCP to define how brokers and clients communicate amongst themselves. A broker has
an acceptor thread listening on ports for incoming connections. Once a connection is established, the acceptor thread hands off the
request for processing to the processor thread. The number of processor threads is configurable. The job of a processor thread is
to take requests from client connections, place them in a request queue, pick responses from the response queue, and send them
back to the client. The IO threads do the heavy lifting of actually fetching the data from storage. They read from the request queue
and place responses in the response queue.

Kafka 27
There are two types of requests:

Produce Requests: These requests are initiated by producers and consist of messages that producers want to write to a Kafka
topic.

Fetch Requests: These requests are initiated by consumers or follower replicas and ask for messages in a topic.

Clients send another request called the metadata request which consists of topics a client is interested in. The server responds
with a list of partitions for each topic, the partition replicas, and the partition leaders. Making the metadata request before every
produce or fetch request can be taxing on the system, so clients cache the metadata information. Clients will refresh the metadata
information after every meta data.max.age.ms milliseconds or when they receive a “not partition leader” error signaling that the
metadata information has changed.

Produce Request:
The usual flow of a produce request when received by a partition replica is run through a few checks:

1. Does the producer have write privilege to the topic?

2. Does the acks parameter have a valid value of 0, 1 or all?

3. If acks has been set to all, are there enough in-sync replicas to safely write the message? Brokers can refuse to write a
message if the number of in-sync replicas is below a configurable number.

Once the message has been accepted, it’ll get written to the local disk. In Linux, the message actually ends up in the filesystem
cache and there’s no certainty when the message will get written to disk. Note that if acks has been set to 0 or 1, the partition leader

Kafka 28
will respond immediately after writing the message. However, if acks is set to all, then the leader will store the request in a buffer
called the purgatory and wait to respond to the producer client until it observes that all in-sync replicas have replicated the
message.

Fetch Request:
Fetch requests for messages are similar to produce requests. However, the broker has to be cautious to only send the consumer as
many messages as the client can easily fit into its memory. Conversely, the client can also specify the minimum amount of data to
accumulate for a partition before the broker sends back a response. The broker waits for a decent amount of data to become
available before dispatching it to the client. However, if a configurable threshold elapses and enough data hasn’t accumulated, the
broker will send whatever data it has so far.
When a fetch request is first received, it is scrutinized for validity. If the offset exists, the broker reads messages starting from the
offset up until the number of messages that collectively sum up to the data size the client is willing to receive. Kafka is known to
copy messages from the file (likely Linux’s filesystem cache) to the network channel directly, skipping any buffers in between. This
zero-copy technique improves performance.
Fetch requests for messages that haven’t been replicated to all in-sync replicas result in an empty response from the broker rather
than an error message.

Partition Allocation:
Kafka follows an algorithm when assigning partitions of a topic to brokers. Kafka attempts to evenly divide the partitions among the
brokers as much as possible. As an example, consider a topic with 5 partitions and a replication factor of 3, making for a total of 15
partition replicas to be allocated amongst 5 brokers. The general algorithm will proceed as follows:

There should be 3 partition replicas per broker.

Initially, we start with determining the locations for the partition replica leaders. In our example, the replica leader for partition
0 is placed on broker 1, the replica leader for partition 1 is placed on broker 2, the replica leader or partition 2 onto broker 3,
and so on and so forth.

Once the replica leaders have been assigned to the brokers, we start assigning the replica followers for each partition:

We start with partition 0. We’ll assign the remaining replicas of partition 0 on broker 2 and broker 3. For each partition, we
place the replicas at an increasing offset from the leader. The goal is not to co-locate the replicas on the same broker as the
replica leader or on the same broker together.

Similarly, the replicas for partition 1 get placed on broker 3 and broker 4 since the leader resides on broker 2.

Kafka 29
Kafka has become rack aware starting at version 0.10.0. When assigning partition replicas in a rack-aware environment, the
algorithm attempts to assign partition replicas in different racks so that, if an entire rack experiences failure, availability isn’t
affected. Consider the following rack setup of our example consisting of 5 brokers:

When assigning partition replicas for partition 0 the algorithm first places the leader onto broker 1. The algorithm can place the
second replica of partition 0 onto broker 3 or broker 4. Finally, the third replica will be placed on broker 5
in the third rack. Now all three replicas live in separate racks and the partition as a whole can tolerate rack failures.
Given rack awareness, the algorithm prepares a rack-alternating broker list instead of picking brokers in numerical order:
Broker#1 Broker#3 Broker#5 Broker#2 Broker#4

Directory Allocation:
Once the broker has been chosen for a replica partition, the next step is to choose the directory on the broker where the partition will
place its data. The algorithm applied to choose the directory simply picks the directory with the least number of partitions. This
implies that a newly added directory will receive all the new partitions.

Data Storage:
Data Retention:
Kafka doesn’t hold data in perpetuity. The admin can configure Kafka to delete the messages for a topic in two ways:

Specify a retention time after which messages are deleted.

Specify the data size to be reached before messages are deleted.

Data for a partition isn’t a contiguous file. Rather, the data is broken into chunks of files called segments. Each segment can be at
most 1GB in size or contain a week’s worth of data, whichever is smaller. The segment currently being written to is known as
the active segment and is closed as soon as it reaches 1GB when a new file is opened for writing. Having segments makes
deleting stale data much easier than attempting to delete messages in one long contiguous file.

💡 Note that the active segment can never be deleted. Also, note that the broker keeps an open handle to all the segments in
all the partitions, including the inactive ones.

File Format:

Kafka 30
Each segment is one data file. The file consists of Kafka messages and their offsets. Interestingly, the message format on disk is the
same as on the wire, this allows Kafka to use the zero-copy optimization when sending messages to consumers and avoids any
compression-decompression.
Each message contains information other than the key, value, and offset. This includes:

checksum to detect corruption

timestamp which can be set either to when the message was received by the broker or when sent by the producer, depending
upon configuration

compression type such as Snappy, GZip, LZ4

magic byte to detect the version of the message format

Note that if the producer is employing compression on its end, then the message sent by the producer to the broker is a single
wrapper message which has as its value all the messages in the batch compressed together. The broker stores this wrapper
message as it is and sends it out to the consumer when requested. The consumer decompresses the wrapper message and is able
to see all the messages in the batch along with their offsets and timestamps.

Index:
Kafka allows consumers to read a topic starting from any offset, this feature requires Kafka to be able to quickly and efficiently jump
to the requested offset. Naturally, Kafka requires indexes to do fast lookups, and it thus creates one for each partition. Indexes, like
messages, are broken up into segments so that the index entries for messages that have been deleted can be easily deleted as
well. Indexes can also be easily generated from reading the corresponding log segment.

Compaction:
There can be use cases where we want to only retain the latest message for a given key and delete all the older messages for that
key. Kafka can be configured to serve such use cases by changing the retention policy to compact. However, note that this policy
will only work when messages produced by the application also come with keys. If messages don’t have keys set, compaction will
fail. Compaction is always performed on the inactive segments by compaction/cleaner threads which may require tuning by the
administrator for memory usage.
An interesting question arises about how we can permanently delete a message with a given key since we always store at least one
message for that key. To delete a key permanently, a tombstone message is created which has null set as the value of the message.
The tombstone message is retained for a configurable amount of time so that the consumer can see that the key has been
permanently deleted from the system.

Kafka 31
Reliability:
Kafka makes certain reliability guarantees, including:

If the same producer first writes a message A and then message B to the same partition, consumers are guaranteed to see
message A first and then message B. The offset of message B is guaranteed to be higher than that of message A.

A message sent to the broker is considered committed when the message has been written to the partition by the leader and by
all the in-sync replicas. However, this doesn’t imply the message has also been flushed to disk, it may very well be in memory.

Committed messages will not be lost as long as one replica is alive.

Consumers only read committed messages.

Broker Replication Config:


There are three configuration settings for the broker that affect replication and reliable message storage:

Replication Factor: The number of replicas for a topic can be tweaked with the setting replication.factor . For topics created
automatically, the setting is default.replication.factor . A higher replication factor leads to higher availability, higher reliability,
and fewer disasters. Kafka also has the intelligence to place the replica in different racks to protect against rack failures. The
rack a broker lives in can be provided to Kafka using the broker configuration parameter broker.rack .

Unclean Leader Election: Consider a scenario where we have three replicas and two of them go down, leaving only the leader
as the in-sync replica. The leader keeps receiving messages from the producers and writing them. If any of the two followers
become the leader after the original leader went down, then the new leader will not have the latest messages that the previous
leader received. To complicate matters further, it is possible that some of the consumers may have read some of the newer
messages from the old leader that aren’t available with the new leader. The obvious mitigation is to not let a replica that is out of
sync become the new leader. However, this solution comes at the cost of availability. The configuration
parameter unclean.leader.election.enable allows Kafka to elect out of sync brokers as the leader. When set to false, we choose to
wait for the original leader to come back online, resulting in lower availability.

Minimum in-sync Replicas: min.insync.replicas sets the minimum number of replicas to be in-sync before a message can be
processed. Say, if we have 3 replicas and the parameter min.insync.replicas is set to 2, then a producer attempting to write a

Kafka 32
message to the topic will be met with NotEnoughReplicasException exception in case two out of the three replicas are down. In this
situation, the single in-sync replica becomes read-only and continues to serve consumers.

Exactly Once Delivery:


Kafka doesn’t support exactly once delivery of messages out of the box, but there are ways developers can guarantee exactly
once delivery. One approach is for the consumer to write the results to a system that supports unique keys. The records can come
with their own unique keys and if the records don’t have keys, we can create one using the combination of topic + partition + offset
for the message. Every message is uniquely identified using this combination. Thereafter, even if a record has been duplicated, we’ll
simply overwrite the same value for the key. This pattern is known as idempotent write.
The other approach is to rely on an external system that offers transactions. We store the message as well as its offset in a single
transaction. After a crash or when the consumer starts up the first time, it can query the external store and retrieve the offset of the
last record read and start consuming records from that offset onwards.

Kafka 33

You might also like