0% found this document useful (0 votes)
10 views19 pages

Kafka

The document provides detailed instructions on installing and configuring Kafka, including necessary software, important broker properties, and factors to consider when determining the number of brokers and partitions. It outlines steps for testing a Kafka setup, creating topics, producing and consuming messages, and explains the roles of producers and consumers in Kafka's architecture. Additionally, it discusses message ordering, partitioning strategies, and the implementation of custom partitioners.

Uploaded by

mukeshkyin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views19 pages

Kafka

The document provides detailed instructions on installing and configuring Kafka, including necessary software, important broker properties, and factors to consider when determining the number of brokers and partitions. It outlines steps for testing a Kafka setup, creating topics, producing and consuming messages, and explains the roles of producers and consumers in Kafka's architecture. Additionally, it discusses message ordering, partitioning strategies, and the implementation of custom partitioners.

Uploaded by

mukeshkyin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Kafka Notes

Kafka Installation:
Following software’s will be required.

• Java
• Zookeeper: It is recommended to install a full version of zookeeper.

• Kafka Broker: Install Kafka broker and start it.

Important Cluster/broker properties


• Broker.id -> Unique integer Id within cluster.
• Zookeeper.connect -> Semicolon separated hostname:port/path strings.

o hostname is the hostname or IP address of the Zookeeper server


o port is the client port number for the server
o /path is an optional Zookeeper path to use as a chroot environment for the Kafka
cluster. If it is omitted, the root path is used. It will be created on server start if not exist.
o It is generally considered to be a good practice to use a chroot path for the Kafka
cluster. This allows the Zookeeper ensemble to be shared with other applications,
including other Kafka clusters, without a conflict. It is also best to specify multiple
Zookeeper servers (which are all part of the same ensemble) in this configuration
separated by semicolons. This allows the Kafka broker to connect to another member of
the Zookeeper ensemble in the case of a server failure.
• Log.dirs -> Kafka persists all messages to disk, and these log segments are stored in the
directories specified in the log.dirs configuration. This is a comma separated list of paths on
the local system. If more than one path is specified, the broker will store partitions on them
in a “least used” fashion with one partition’s log segments stored within the same path.
Broker will place a new partition in the path that has the least number of partitions
currently stored in it, not the least amount of disk space used.
• auto.create.topics.enable -> Set it to false in order to disable the auto topic creation.
• log.retention.ms -> time in ms to retain logs in kafka. Default is 1 week. Retention by time is
performed by examining the last modified time (mtime) on each log segment file on disk. Under
normal cluster operations, this is the time that the log segment was closed, and represents the
timestamp of the last message in the file. However, when using administrative tools to move
partitions between brokers, this time is not accurate. This will result in excess retention for
these partitions.
• log.retention.bytes -> If you have a topic with 8 partitions, and log.retention.bytes is set to 1
gigabyte, the amount of data retained for the topic will be 8 gigabytes at most. Note that all
retention is performed for an individual partition, not the topic. If you have specified a value for
both log.retention.bytes and log.retention.ms (or another parameter for retention by time),
messages may be removed when either criteria is met.
• log.segment.bytes -> operates on the log segments not on individual messages. Messages sent
to broker are appended to current log segment for partition. Once a segment is full (default 1
GB) then it is closed and a new one is opened for writing. Once closed segment is considered for
expiry. If the size is small for large volume topics then it means frequent closing and opening of
file segments which impacts performance. On the other hand for low produce rate topics it
means that increase in the log retention time.
o Offset retrieval impact: The size of the log segments also affects the behavior of
fetching offsets by timestamp. When requesting offsets for a partition at a specific
timestamp, Kafka fulfills the request by looking for the log segment in the partition
where the last modified time of the file is (and therefore closed) after the timestamp
and the immediately previous segment was last modified before the timestamp. Kafka
then returns the offset at the beginning of that log segment (which is also the filename).
This means that smaller log segments will provide more accurate answers for offset
requests by timestamp.

• log.segment.ms -> Another way to control when log segments are closed is by using
the log.segment.ms parameter, which specifies the amount of time after which a log segment
should be closed. As with
he log.retention.bytes and log.retention.ms parameters, log.segment.bytes and log.segment.ms
are not mutually exclusive properties. Kafka will close a log segment either when the size limit is
reached, or when the time limit is reached, whichever comes first. By default, there is no setting
for log.segment.ms, which results in only closing log segments by size.
• Default max message size acceptable to broker from produce is 1 MB.
No of Brokers in a Cluster:
We should consider following factors while deciding about broker count.

1. Message retention period and disk capacity of each broker.


2. Replication factor, as it will increase the disk requirement multiply by replication factor.
3. Cluster capacity to handle requests (depends on the network interface/bandwidth). If the traffic
is not consistent over the retention period of the data (e.g. bursts of traffic during peak
times). If the network interface on a single broker is used to 80% capacity at peak, and there
are two consumers of that data, the consumers will not be able to keep up with peak traffic
unless there are two brokers. If replication is being used in the cluster, this is an additional
consumer of the data that must be taken into account.
4. It may also be desirable to scale out to more brokers in a cluster in order to handle
performance concerns caused by lesser disk throughput or system memory available.

Test Kafka Setup


1. Create Topic
/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181
--replication-factor 1 --partitions 1 --topic test

2. Describe Topic Details

kafka-topics.sh --zookeeper localhost:2181 --describe --topic testTopic:test PartitionCount:1


ReplicationFactor:

3. Produce Messages:

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

Test Message 1
Test Message 2

4. Consume Messages:

kafka/bin/kafka-console-consumer.sh –zookeeper localhost:2181 --topic test --from-beginning

Test Message 1
Test Message 2

Kafka Topic:
Partitions are also the way that Kafka provides redundancy and scalability. Each partition
can be hosted on a different server, which means that a single topic can be scaled
horizontally across multiple servers to provide for performance far beyond the ability of a
single server.

HOW TO CHOOSE NUMBER OF PARTITIONS?


While choosing number of partitions for a topic following factors should be considered:
▪ What is the throughput you’d expect to achieve for the topic? i.e. do you expect to write
100KB a second or 1GB a second?
▪ What is the maximum throughput you expect to achieve when consuming from a single
partition? You will always have at most one consumer reading from a partition, so if you
know your slower consumer writes the data to a database and this database never handles
more than 50MB a second from each thread writing to it, then you know you are limited to
60MB throughput when consuming from a partition.
▪ You can go through the same exercise to estimate the maximum throughput per producer
for a single partition, but since producers are typically much faster than consumers, it is
usually safe to skip this.
▪ Number of partitions you will place on each broker and available diskspace and network
bandwidth per broker
▪ The fact that if you send messages to partitions based on keys, adding partitions to an
existing topic is very challenging
▪ The fact that there are limits to how many partitions you want to put on a single broker.
More partitions take more memory and also require more time to complete leader election.
With all this in mind, its clear that you want many partitions but not too many. If you have
some estimate regarding the target throughput of the topic and the expected throughput of
the consumers, you can divide the target throughput by expected consumer throughput and
derive the number of partitions this way. So if I want to be able to write and read 1GB/sec
from a topic, and I know each consumer can only process 50MB/s, then I know I need at
least 20 partitions. This way I can have 20 consumers reading from the topic and achieve
1GB/sec.
If you don’t have this detailed information, our experience suggests that limiting the size of
a partition on the disk to 25GB often gives satisfactory results.

Kafka Producer
By default, the producer does not care what partition a specific message is written to and
will balance messages over all partitions of a topic evenly. In some cases, the producer will
direct messages to specific partitions. This is typically done using the message key and a
partitioner that will generate a hash of the key and map it to a specific partition. This
assures that all messages produced with a given key will get written to the same partition.
The producer could also use a custom partitioner that follows other business rules for
mapping messages to partitions.
o Create ProducerRecord object. Mandatory fields Topic and value to be sent.
o Key and Partition are optional parameters.
o Producer will serialize key and values to byte array
o Send data to partitioner. If we don’t specify a partition then partitioner will provide a
default partition based on the producer record key.
o Now the producer knows which topic and partition the record will go to. It then adds
the record to a batch of records that will also be sent to the same topic and partition. A
separate thread is responsible for sending those batches of records to the appropriate
Kafka brokers.
o If the messages were successfully written to Kafka, it will return a RecordMetadata
object with the topic, partition and the offset of the record within the partition. If the
broker failed to write the messages, it will return an error. When the producer receives
an error, it may retry sending the message few more times before giving up and
returning an error.

Creating a Producer: . Kafka producer has 3 mandatory properties:

o bootstrap.servers - List of host:port pairs of Kafka brokers.


o key.serializer - Kafka brokers expect byte arrays as key and value of messages. However
the Producer interface allows, using parameterized types, to send any Java object as key
and value. key.serializer should be set to a name of a class that implements
org.apache.kafka.common.serialization.Serializer interface and the Producer will use
this class to serialize the key object to byte array. The Kafka client package includes
ByteArraySerializer (which doesn’t do much), StringSerializer and IntegerSerializer, so if
you use common types, there is no need to implement your own serializers. Setting
key.serializer is required even if you intend to send only values.
o value.serializer - the same way you set key.serializer to a name of a class that will
serialize the message key object to a byte array, you set value.serializer to a class that
will serialize the message value object. The serializers can be identical to the
key.serializer, for example when both key and value are Strings or they can be different,
for example Integer key and String value.
private Properties kafkaProps = new Properties();
kafkaProps.put("bootstrap.servers", "broker1:9092,broker2:9092");

kafkaProps.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");

kafkaProps.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

producer = new KafkaProducer<String, String>(kafkaProps);

There are three primary methods of sending messages:

o Fire-and-forget - in which we send a message to the server and don’t really care if it
arrived succesfully or not. Most of the time, it will arrive successfully, since Kafka is
highly available and the producer will retry sending messages automatically. However,
some messages will get lost using this method.
o Synchronous Send - we send a message, the send() method returns a Future object and
we use get() to wait on the future and see if the send() was successful or not.
o Asynchronous Send - we call the send() method with a callback function, which gets
triggered when receive a response from the Kafka broker.

ProducerRecord<String, String> record =


new ProducerRecord<>("CustomerCountry", "Precision Products", "France");
try {
producer.send(record).get();
} catch (Exception e) {
e.printStackTrace();
}

o Here, we are using Future.get() to wait until the reply from Kafka arrives back. The
specific Future implemented by the Producer will throw an exception if Kafka broker
sent back an error and our application can handle the problem. If there were no errors,
we will get a RecordMetadata object which we can use to retrieve the offset the
message was written to.
o If there were any errors - before sending data to Kafka, while sending, if the Kafka
brokers returned a non-retriable exceptions or if we exhausted the available retries,
o client.id ->this can be any string, and will be used by the brokers to identify messages
sent from the client. It is used in logging, metrics and for quotas.
o max.request.size -> this setting controls the size of a produce request sent by the
producer. It caps both the size of the largest message that can be sent and the number
of messages that the producer can send in one request.
o max.in.flight.requests.per.connection -> This controls how many messages the producer
will send to the server without receiving responses. Setting this high can increase
memory usage while improving throughput, although setting it too high can reduce
throughput as batching becomes less efficient. Setting this to 1 will guarantee that
messages will be written to the broker in the order they were sent, even when retries
occure.
Apache Kafka preserves order of messages within a partition. This means that if messages were
sent from the producer in a specific order, the broker will write them to a partition in this order
and all consumers will read them in this order. Setting the retries parameter to non-zero and the
max.in.flights.requests.per.session to more than one, mean that it is possible that the broker
will fail to write the first batch of messages, succeed to write the second (which was already in
flight) and then retry the first batch and succeed, thereby reversing the order. If the order is
critical, usually success is critical too so setting retries to zero is not an option, however you can
set in.flights.requests.per.session = 1 to make sure that no additional messages will be sent to
the broker while the first batch is still retrying. This will severly limit the throughput of the
producer, so only use this when order is important.

Partition:
Keys serve two goals: They are additional information that gets stored with the message, and
they are also used to decide to which one of the topic partitions the message will be written to.
All messages with same key will go to the same partition. This means that if a process is reading
only a subset of the partitions in a topic, all the records for a single key will be read by the same
process. When the key is null and the default partitioner is used, the record will be sent to one
of the available partitions of the topic at random. Round-robin algorithm will be used to balance
the messages between the partitions.

If a key exists and the default partitioner is used, Kafka will hash the key (using its own hash
algorithm, so hash values will not change when Java is upgraded), and use the result to map the
message to a specific partition. This time, it is important that a key will always get mapped to
the same partition, so we use all the partitions in the topic to calculate the mapping and not just
available partitions.

The mapping of keys to partitions is consistent only as long as the number of partitions in a topic
does not change. So as long as the number of partitions is constant you can be sure that
mapping will not change. The moment you add new partitions to the topic, this is no longer
guaranteed - the old records will stay in partition 34 while new records will get written to a
different partition. When partitioning of the keys is important, the easiest solution is to
create topics with sufficient partitions, and never add partitions.
Implementing Custom Partition: You need to implement Partitioner interface. It has three
methods configure, partition and close.

Kafka Consumer
The consumer subscribes to one or more topics and reads the messages in the order they
were produced. The consumer keeps track of which messages it has already consumed by
keeping track of the offset of messages. The offset is another bit of metadata, an integer
value that continually increases, that Kafka adds to each message as it is produced. Each
message within a given partition has a unique offset. By storing the offset of the last
consumed message for each partition, either in Zookeeper or in Kafka itself, a consumer can
stop and restart without losing its place. Consumers work as part of a consumer group. This
is one or more consumers that work together to consume a topic. The group assures that
each partition is only consumed by one member. In Figure 1-6, there are three consumers in
a single group consuming a topic. Two of the consumers are working from one partition
each, while the third consumer is working from two partitions. The mapping of a consumer
to a partition is often called ownership of the partition by the consumer.

In this way, consumers can horizontally scale to consume topics with a large number of
messages. Additionally, if a single consumer fails, the remaining members of the group will
rebalance the partitions being consumed to take over for the missing member.

Lets take topic t1 with 4 partitions. Now suppose we created a new consumer, c1, which is the
only consumer in group g1 and use it to subscribe to topic t1. Consumer c1 will get all messages
from all four of t1 partitions.
If we add more consumers to a single group with a single topic than we have partitions, than
some of the consumers will be idle and get no messages at all.

If we add a new consumer group g2 with a single consumer, this consumer will get all the
messages in topic t1 independently of what g1 is doing. g2 can have more than a single
consumer, in which case they will each get a subset of partitions, just like we showed for g1,
but g2 as a whole will still get all the messages regardless of other consumer groups.
When a consumers gets added it starts consuming message from partition which was getting
consumed by other consumer. Similarly when a consumer gets crashed or leaves group, and the
partitions it used to consume will be consumed by one of the remaining consumers.
Reassignment of partitions to consumers also happen when the topics the consumer group is
consuming are modified, for example if an administrator adds new partitions.
When partition ownership is moved from one consumer to another it is called
Rebalance. During a rebalance, consumers can’t consume messages, so a rebalance is in
effect a short window of unavailability on the entire consumer group. In addition, when
partitions are moved from one consumer to another the consumer loses its current state, if
it was caching any data, it will need to refresh its caches - slowing down our application until
the consumer sets up its state again.
Consumers maintain their membership in a consumer group and their ownership on the
partitions assigned to them is by sending heartbeats to a Kafka broker designated as the Group
Coordinator. Group coordinator broker can be different for different consumer groups.
As long the consumer is sending heartbeats in regular intervals, it is assumed to be alive, well
and processing messages from its partitions. Heartbeats are sent when the consumer polls (i.e.
retrieves records) and when it commits records it has consumed. If the consumer stops sending
heartbeats for long enough, its session will time out and the group coordinator will consider it
dead and trigger a rebalance. If a consumer crashed and stopped processing messages, it will
take the group coordinator few seconds without heartbeats to decide it is dead and trigger the
rebalance. During those seconds, no messages will be processed from the partitions owned by
the dead consumer. When closing a consumer cleanly, the consumer will notify the group
coordinator that it is leaving, and the group coordinator will trigger a rebalance immediately,
reducing the gap in processing. In release 0.10.1, the Kafka community introduced a separate
heartbeat thread which will send heartbeats in-between polls as well.
THE PROCESS OF ASSIGNING PARTITIONS TO BROKERS:
When a consumer wants to join a group, it sends a JoinGroup request to the group coordinator.
The first consumer to join the group becomes the group leader. The leader receives a list of all
consumers in the group from the group coordinator (this will include all consumers that sent a
heartbeat recently and are therefore considered alive) and it is responsible for assigning a
subset of partitions to each consumer. It uses an implementation of PartitionAssignor interface
to decide which partitions should be handled by which consumer. Kafka has two built-in
partition assignment policies. After deciding on the partition assignment, the consumer leader
sends the list of assignments to the GroupCoordinator which sends this information to all the
consumers. Each consumer only sees his own assignment - the leader is the only client process
that has the full list of consumers in the group and their assignments. This process repeats every
time a rebalance happens.

Creating Consumer:
Properties props = new Properties();
props.put("bootstrap.servers", "broker1:9092,broker2:9092");
props.put("group.id", "CountryCounter");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);

group.id : -> It specifies the Consumer Group the KafkaConsumer instance belongs to.

Subscribe to Topic:

To subscribe to one or more topics. The subcribe() method takes a list of topics as a parameter. s

consumer.subscribe(Collections.singletonList("customerCountries"));

Polling:

Once the consumer subscribes to topics, the poll loop handles all details of coordination,
partition rebalances, heartbeats and data fetching, leaving the developer with a clean API that
simply returns available data from the assigned partitions. The poll loop does a lot more than
just get data. The first time you call poll() with a new consumer, it is responsible for finding the
GroupCoordinator, joining the consumer group and receiving a partition assignment. If a
rebalance is triggered, it will be handled inside the poll loop as well. And of course the
heartbeats that keep consumers alive are sent from within the poll loop. You can’t have multiple
consumers that belong to the same group in one thread and you can’t have multiple threads
safely use the same consumer. One consumer per thread is the rule. To run multiple consumers
in the same group in one application, you will need to run each in its own thread. It is useful to
wrap the consumer logic in its own object, and then use Java’s ExecutorService to start multiple
threads each with its own consumer.
Consumer Properties:

partition.assignment.strategy:

Partitions are assigned to consumers in a consumer group. A PartitionAssignor is a class that,


given consumers and topics they subscribed to, decides which partitions will be assigned to
which consumer. By default Kafka has two assignment strategies:
Range - which assigns to each consumer a consecutive subset of partitions from each topic it
subscribes to. So if consumers C1 and C2 are subscribed to two topics, T1 and T2 and each of the
topics has 3 partitions. Then C1 will be assigned partitions 0 and 1 from topics T1 and T2, while
C2 will be assigned partition 2 from those topics. Because each topic has uneven number of
partitions and the assignment is done for each topic independently, the first consumer ended up
with more partitions than the second. This happens whenever Range assignment is used and the
number of consumers does not divide the number of partitions in each topic neatly.
RoundRobin - which takes all the partitions from all subscribed topics and assigns them to
consumers sequentially, one by one. If C1 and C2 described above would use RoundRobin
assignment, C1 would have partitions 0 and 2 from topic T1 and partition 1 from topic T2. C2
would have partition 1 from topic T1 and partitions 0 and 2 from topic T2. In general, if all
consumers are subscribed to the same topics (a very common scenario), RoundRobin
assignment will end up with all consumers having the same number of partitions (or at most 1
partition difference). partition.assignment.strategy allows you to choose a partition assignment
strategy. The default is org.apache.kafka.clients.consumer.RangeAssignor which implements the
Range strategy described above. You can replace it with
org.apache.kafka.clients.consumer.RoundRobinAssignor. A more advanced option will be to
implement your own assignment strategy, in which case partition.assignment.strategy should
point to the name of your class.

Session.timeout.ms: The amount of time a consumer can be out of contact with the brokers
while still considered alive, defaults to 3 seconds. If a consumer goes for more than
session.timeout.ms without sending a heartbeat to the group coordinator, it is considered dead
and the group coordinator will trigger a rebalance of the consumer group to allocate partitions
from the dead consumer to the other consumers in the group. This property is closely related to
heartbeat.interval.ms. heartbeat.interval.ms controls how frequently the KafkaConsumer poll()
method will send a heartbeat to the group coordinator.

client.id : This can be any string, and will be used by the brokers to identify messages sent from
the client. It is used in logging, metrics and for quotas.

Commit and Offset:

How does a consumer commits an offset? It produces a message to Kafka, to a special


__consumer_offsets topic, with the committed offset for each partition. As long as all your
consumers are up, running and churning away, this will have no impact. However, if a consumer
crashes or a new consumer joins the consumer group, this will trigger a rebalance. After a
rebalance, each consumer may be assigned a new set of partitions than the one it processed
before. In order to know where to pick up the work, the consumer will read the latest
committed offset of each partition and continue from there.
If the committed offset is smaller than the offset of the last message the client processed, the
messages between the last processed offset and the committed offset will be processed twice.

If the committed offset is larger than the offset of the last message the client actually processed,
all messages between the last processed offset and the committed offset will be missed by the
consumer group.

When a group is first initialized, the consumers typically begin reading from either the earliest or
latest offset in each partition. The messages in each partition log are then read sequentially. As
the consumer makes progress, it commits the offsets of messages it has successfully processed.
For example, in the figure below, the consumer’s position is at offset 6 and its last committed
offset is at offset 1.
The diagram also shows two other significant positions in the log. The log end offset is the offset
of the last message written to the log. The high watermark is the offset of the last message that
was successfully copied to all of the log’s replicas. From the perspective of the consumer, the
main thing to know is that you can only read up to the high watermark. This prevents the
consumer from reading unreplicated data which could later be lost.

By setting auto.commit.offset = false, offsets will only be committed when the application
explicitly chooses to do so. The simplest and most reliable of the commit APIs is commitSync().
This API will commit the latest offset returned by poll() and return once the offset is committed,
throwing an exception if commit fails for some reason. commitSync() will commit the latest
offset returned by poll(), so make sure you call commitSync() after you are done processing all
the records in the collection, or you risk missing messages as described above. When rebalance
is triggered, all the messages from the beginning of the most recent batch until the time of the
rebalance will be processed twice.

while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
{
System.out.printf("topic = %s, partition = %s, offset = %d, customer = %s, country = %s\n",
record.topic(), record.partition(), record.offset(), record.key(), record.value()); 1
}
try {
consumer.commitSync(); 2
} catch (CommitFailedException e) {
log.error("commit failed", e) 3
}
}

Commit at specified offset: consumer API allows you to call commitSync() and commitAsync()
and pass a map of partitions and offsets that you wish to commit.
Rebalance Listener:
A consumer will want to do some cleanup work before exiting and also before partition
rebalancing. If you know your consumer is about to lose ownership of a partition, you will want
to commit offsets of the last event you’ve processed. If your consumer maintained a buffer with
events that it only processes occasionally, you will want to process the events you accumulated
before losing ownership of the partition. Perhaps you also need to close file handles, database
connections and such.

The consumer API allows you to run your own code when partitions are added or removed from
the consumer. You do this by passing a ConsumerRebalanceListener when calling the subscribe()
method we discussed previously. ConsumerRebalanceListener has two methods you can
implement:
public void onPartitionsRevoked(Collection<TopicPartition> partitions) is called before the
rebalancing starts and after the consumer stopped consuming messages. This is where you want
to commit offsets, so whoever gets this partition next will know where to start.
public void onPartitionsAssigned(Collection<TopicPartition> partitions) is called after partitions
has been re-assigned to the broker, but before the consumer started consuming messages.

Kafka Internals
Cluster Membership:

Kafka uses Apache Zookeeper to maintain the list of brokers that are currently members of the
cluster. Every broker has a unique identifier that is either set in the broker configuration file or
automatically generated. Every time a broker process starts, it registers itself with its id in
Zookeeper by creating an ephemeral node. Different Kafka components subscribe to the
/brokers/ids path in Zookeeper where brokers are registered, so they get notified when brokers
are added or removed. While the node representing the broker is gone when the broker is
stopped, the broker ID still exists in other data structures. For example, the list of replicas of
each topic (see below) contains the broker IDs for the replica. This way, if you completely lose a
broker and start a brand new broker with the ID of the old one, it will immediately join the
cluster in place of the missing broker with the same partitions and topics assigned to it.

Controller Broker: The controller is one of the Kafka brokers that in addition to the usual broker
functionality is also responsible for the task of electing partition leaders. The first broker that
starts in the cluster becomes the controller by creating an ephemeral node in Zookeeper,
/controller. he brokers create a Zookeeper watch on the controller node, so they get notified on
changes to this node. This way we guarantee the cluster will only have one controller at a time.
When the controller broker is stopped or loses connectivity to Zookeeper, the ephemeral node
will disappear. Other brokers in the cluster will be notified through the Zookeeper watch that
the controller is gone and will attempt to create the controller node in Zookeeper themselves.
When the controller notices that a broker left the cluster (by watching the relevant Zookeeper
path), it knows that all the partitions that had a leader on that broker will need a new leader. It goes
over all the partitions that need a new leader, determine who the new leader should be (simply the
next replica in the replica list of that partition) and sends a request to all the brokers that contain
either the new leaders or the existing followers for those partitions. The request contains information
on who is the new leader and who are the followers for the partitions. The new leaders now know
that they need to start serving producer and consumer requests from clients, while the followers now
know that they need to start replicating messages from the new leader.
When the controller notices a broker joined the cluster, it uses the broker ID to check if there are
replicas that exist on this broker. If there are, the controller notifies both new and existing brokers of
the change, and the replicas on the new broker start replicating messages from the existing leaders.
To summarize, the controller uses epoch number to prevent “split brain” scenario where two nodes
believe each is the current controller.

Kafka Replication:
Kafka is “a distributed, partitioned, replicated commit log service”. Replication is so critical because
it is the way Kafka guarantees availability and durability when individual nodes inevitably fail. There
are two types of replica.

Leader replica - Each partition has a single replica designated as the leader. All produce and consume
requests go through the leader, in order to guarantee consistency. It is leaders responsibility to check
which followers are upto date with leader.

Follower replica - All replicas for a partition that are not leaders are called followers. Followers don’t
serve client requests, their only job is to replicate messages from the leader and stay up to date with
the most recent messages the leader has. In the event a leader replica for a partition crashes, one of
the follower replicas will be promoted to become the new leader for the partition.

In order to stay in sync with the leader, the replicas send the leader Fetch requests, the exact same
type of requests that consumers send in order to consume messages. In response to those requests, the
leader sends the messages to the replicas. Those Fetch requests contain the offset of the message that
the replica wants to receive next, and they will always be in order. If a replica hasn’t request any
message in over 10 seconds or it is requesting messages but didn’t catch up to the most recent
message in over 10 seconds, the replica is considered “out of sync”. If a replica fails to keep up with
the leader, it can no longer become the new leader in an event of failure - after all, it does not contain
all the messages.
Replicas that are consistently asking for the latest messages, are called “in sync replicas”. Only in-
sync replicas are eligible to be elected as partition leaders in case the existing leader fails.

Apart from current leader each partition has preferred leader - the replica that was the leader when
the topic was originally created is the preferred leader for the partition. It is preferred because when
partitions are first created the leaders are balanced between brokers. By default, Kafka is configured
with auto.leader.rebalance.enable=true, which will check if the preferred leader replica is not the
current leader but is in-sync and trigger leader election to make the preferred leader the current
leader.
Request Processing:

All requests have a standard header that includes: * Request type (also called API key) * Request
version (so the brokers can handle clients of different versions and respond accordingly) *
Correlation id - a number that uniquely identifies the request and also appears in the response and in
the error logs (It is used for troubleshooting). * Client ID - used to identify the application that sent
the request.

For each port the broker listens on, the broker runs an Acceptor thread that creates a connection and
hands it over to a Processor thread for handling. The number of processor threads (also called
network threads) is configurable. The network threads are responsible for taking requests from client
connections and placing them on request queue and picking up responses from response queue and
sending them back to clients.

Once requests are placed on the request queue, IO threads are responsible to pick up the requests and
process them. The most common types of requests are:

Produce request - sent by Producers and that contain messages the clients write to Kafka
brokersFetch requests - sent by Consumers and follower replicas when they read messages from
Kafka brokers.

Both produce requests and fetch requests have to be sent to the leader replica of a partition. If a
broker receives a produce request for a specific partition and the leader for this partition is on a
different broker, the client that sent the produce request will get an error response with the error “Not
a Leader for Partition”. The same error will occur if a fetch request for a specific partition arrives at a
broker that does not have the leader for that partition. It is the responsibility of Kafka’s clients to
always send produce and fetch requests to the broker that contains the leader for the relevant partition
for the request.

How do the clients know where to send the requests? Kafka clients use another request type called
metadata request. The request includes a list of topics the client is interested in. The server response
specifies which partitions exist in the topics, who are the replicas for each partition and which replica
is the leader. Metadata request can be sent to any broker since all brokers have a metadata cache that
contains this information.

Kafka famously uses a “Zero Copy” method to send the messages to the clients - this means that
Kafka sends messages from the file (or more likely, the Linux filesystem cache) directly to the
network channel without any intermediate buffers. This is different than most databases where data is
stored in local cache before being sent to clients. This technique removes the overhead of copying
bytes and managing buffers in memory and results in much improved performance.

Kafka Storage:

Partitions cannot be split between multiple brokers and not even between multiple disks on the same
broker. So the size of a partition is limited by the space available on a single mount point.

o File Management: Each partition data is split into segments (default 1GB data or one week
data). Once segment size is reached it is closed and a new one is opened. The segment we are
currently writing to is called an active segment. The active segment is never deleted. The format
of the data on the disk is identical to the format of the messages that we send from the
producer and later send to consumers. Each message contains, in addition to its key, value and
offset, things like the message size, checksum code that allows us to detect corruption, magic
byte that indicates the version of the message format, compression codec (Snappy, GZip or LZ4)
and a timestamp (added in 0.10.0 release). The timestamp is given either by the producer when
the message was sent or by the broker when the message arrived - depending on configuration.

o In order to help brokers quickly locate the message for a given offset, Kafka maintains an
index for each partition. The index maps offsets to segment files and location within the file.
Indexes are also broken into segments, so we can delete old index entries when the messages
are purged. Kafka does not attempt to maintain checksums of the index. If the index becomes
corrupted, it will get re-generated from the matching log segment simply by re-reading the
messages and recording the offsets and locations. It is also completely safe for an
administrator to delete index segments if needed - they will be re-generated automatically.
o Compaction: Kafka supports use-cases where you need to maintain latest data forever e.g.
customers current state/address, by allowing to change the retention policy on a topic from
“delete”, which deletes events older than retention time to “compact” which only stores the
most recent value for each key in the topic. Obviously, setting the policy to “compact” only
makes sense on topics to which applications produce events that contain both a key and a value.
If the topic contains null keys, compaction will fail.

You might also like