Kafka
Kafka
Kafka Installation:
Following software’s will be required.
• Java
• Zookeeper: It is recommended to install a full version of zookeeper.
• log.segment.ms -> Another way to control when log segments are closed is by using
the log.segment.ms parameter, which specifies the amount of time after which a log segment
should be closed. As with
he log.retention.bytes and log.retention.ms parameters, log.segment.bytes and log.segment.ms
are not mutually exclusive properties. Kafka will close a log segment either when the size limit is
reached, or when the time limit is reached, whichever comes first. By default, there is no setting
for log.segment.ms, which results in only closing log segments by size.
• Default max message size acceptable to broker from produce is 1 MB.
No of Brokers in a Cluster:
We should consider following factors while deciding about broker count.
3. Produce Messages:
Test Message 1
Test Message 2
4. Consume Messages:
Test Message 1
Test Message 2
Kafka Topic:
Partitions are also the way that Kafka provides redundancy and scalability. Each partition
can be hosted on a different server, which means that a single topic can be scaled
horizontally across multiple servers to provide for performance far beyond the ability of a
single server.
Kafka Producer
By default, the producer does not care what partition a specific message is written to and
will balance messages over all partitions of a topic evenly. In some cases, the producer will
direct messages to specific partitions. This is typically done using the message key and a
partitioner that will generate a hash of the key and map it to a specific partition. This
assures that all messages produced with a given key will get written to the same partition.
The producer could also use a custom partitioner that follows other business rules for
mapping messages to partitions.
o Create ProducerRecord object. Mandatory fields Topic and value to be sent.
o Key and Partition are optional parameters.
o Producer will serialize key and values to byte array
o Send data to partitioner. If we don’t specify a partition then partitioner will provide a
default partition based on the producer record key.
o Now the producer knows which topic and partition the record will go to. It then adds
the record to a batch of records that will also be sent to the same topic and partition. A
separate thread is responsible for sending those batches of records to the appropriate
Kafka brokers.
o If the messages were successfully written to Kafka, it will return a RecordMetadata
object with the topic, partition and the offset of the record within the partition. If the
broker failed to write the messages, it will return an error. When the producer receives
an error, it may retry sending the message few more times before giving up and
returning an error.
kafkaProps.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
kafkaProps.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
o Fire-and-forget - in which we send a message to the server and don’t really care if it
arrived succesfully or not. Most of the time, it will arrive successfully, since Kafka is
highly available and the producer will retry sending messages automatically. However,
some messages will get lost using this method.
o Synchronous Send - we send a message, the send() method returns a Future object and
we use get() to wait on the future and see if the send() was successful or not.
o Asynchronous Send - we call the send() method with a callback function, which gets
triggered when receive a response from the Kafka broker.
o Here, we are using Future.get() to wait until the reply from Kafka arrives back. The
specific Future implemented by the Producer will throw an exception if Kafka broker
sent back an error and our application can handle the problem. If there were no errors,
we will get a RecordMetadata object which we can use to retrieve the offset the
message was written to.
o If there were any errors - before sending data to Kafka, while sending, if the Kafka
brokers returned a non-retriable exceptions or if we exhausted the available retries,
o client.id ->this can be any string, and will be used by the brokers to identify messages
sent from the client. It is used in logging, metrics and for quotas.
o max.request.size -> this setting controls the size of a produce request sent by the
producer. It caps both the size of the largest message that can be sent and the number
of messages that the producer can send in one request.
o max.in.flight.requests.per.connection -> This controls how many messages the producer
will send to the server without receiving responses. Setting this high can increase
memory usage while improving throughput, although setting it too high can reduce
throughput as batching becomes less efficient. Setting this to 1 will guarantee that
messages will be written to the broker in the order they were sent, even when retries
occure.
Apache Kafka preserves order of messages within a partition. This means that if messages were
sent from the producer in a specific order, the broker will write them to a partition in this order
and all consumers will read them in this order. Setting the retries parameter to non-zero and the
max.in.flights.requests.per.session to more than one, mean that it is possible that the broker
will fail to write the first batch of messages, succeed to write the second (which was already in
flight) and then retry the first batch and succeed, thereby reversing the order. If the order is
critical, usually success is critical too so setting retries to zero is not an option, however you can
set in.flights.requests.per.session = 1 to make sure that no additional messages will be sent to
the broker while the first batch is still retrying. This will severly limit the throughput of the
producer, so only use this when order is important.
Partition:
Keys serve two goals: They are additional information that gets stored with the message, and
they are also used to decide to which one of the topic partitions the message will be written to.
All messages with same key will go to the same partition. This means that if a process is reading
only a subset of the partitions in a topic, all the records for a single key will be read by the same
process. When the key is null and the default partitioner is used, the record will be sent to one
of the available partitions of the topic at random. Round-robin algorithm will be used to balance
the messages between the partitions.
If a key exists and the default partitioner is used, Kafka will hash the key (using its own hash
algorithm, so hash values will not change when Java is upgraded), and use the result to map the
message to a specific partition. This time, it is important that a key will always get mapped to
the same partition, so we use all the partitions in the topic to calculate the mapping and not just
available partitions.
The mapping of keys to partitions is consistent only as long as the number of partitions in a topic
does not change. So as long as the number of partitions is constant you can be sure that
mapping will not change. The moment you add new partitions to the topic, this is no longer
guaranteed - the old records will stay in partition 34 while new records will get written to a
different partition. When partitioning of the keys is important, the easiest solution is to
create topics with sufficient partitions, and never add partitions.
Implementing Custom Partition: You need to implement Partitioner interface. It has three
methods configure, partition and close.
Kafka Consumer
The consumer subscribes to one or more topics and reads the messages in the order they
were produced. The consumer keeps track of which messages it has already consumed by
keeping track of the offset of messages. The offset is another bit of metadata, an integer
value that continually increases, that Kafka adds to each message as it is produced. Each
message within a given partition has a unique offset. By storing the offset of the last
consumed message for each partition, either in Zookeeper or in Kafka itself, a consumer can
stop and restart without losing its place. Consumers work as part of a consumer group. This
is one or more consumers that work together to consume a topic. The group assures that
each partition is only consumed by one member. In Figure 1-6, there are three consumers in
a single group consuming a topic. Two of the consumers are working from one partition
each, while the third consumer is working from two partitions. The mapping of a consumer
to a partition is often called ownership of the partition by the consumer.
In this way, consumers can horizontally scale to consume topics with a large number of
messages. Additionally, if a single consumer fails, the remaining members of the group will
rebalance the partitions being consumed to take over for the missing member.
Lets take topic t1 with 4 partitions. Now suppose we created a new consumer, c1, which is the
only consumer in group g1 and use it to subscribe to topic t1. Consumer c1 will get all messages
from all four of t1 partitions.
If we add more consumers to a single group with a single topic than we have partitions, than
some of the consumers will be idle and get no messages at all.
If we add a new consumer group g2 with a single consumer, this consumer will get all the
messages in topic t1 independently of what g1 is doing. g2 can have more than a single
consumer, in which case they will each get a subset of partitions, just like we showed for g1,
but g2 as a whole will still get all the messages regardless of other consumer groups.
When a consumers gets added it starts consuming message from partition which was getting
consumed by other consumer. Similarly when a consumer gets crashed or leaves group, and the
partitions it used to consume will be consumed by one of the remaining consumers.
Reassignment of partitions to consumers also happen when the topics the consumer group is
consuming are modified, for example if an administrator adds new partitions.
When partition ownership is moved from one consumer to another it is called
Rebalance. During a rebalance, consumers can’t consume messages, so a rebalance is in
effect a short window of unavailability on the entire consumer group. In addition, when
partitions are moved from one consumer to another the consumer loses its current state, if
it was caching any data, it will need to refresh its caches - slowing down our application until
the consumer sets up its state again.
Consumers maintain their membership in a consumer group and their ownership on the
partitions assigned to them is by sending heartbeats to a Kafka broker designated as the Group
Coordinator. Group coordinator broker can be different for different consumer groups.
As long the consumer is sending heartbeats in regular intervals, it is assumed to be alive, well
and processing messages from its partitions. Heartbeats are sent when the consumer polls (i.e.
retrieves records) and when it commits records it has consumed. If the consumer stops sending
heartbeats for long enough, its session will time out and the group coordinator will consider it
dead and trigger a rebalance. If a consumer crashed and stopped processing messages, it will
take the group coordinator few seconds without heartbeats to decide it is dead and trigger the
rebalance. During those seconds, no messages will be processed from the partitions owned by
the dead consumer. When closing a consumer cleanly, the consumer will notify the group
coordinator that it is leaving, and the group coordinator will trigger a rebalance immediately,
reducing the gap in processing. In release 0.10.1, the Kafka community introduced a separate
heartbeat thread which will send heartbeats in-between polls as well.
THE PROCESS OF ASSIGNING PARTITIONS TO BROKERS:
When a consumer wants to join a group, it sends a JoinGroup request to the group coordinator.
The first consumer to join the group becomes the group leader. The leader receives a list of all
consumers in the group from the group coordinator (this will include all consumers that sent a
heartbeat recently and are therefore considered alive) and it is responsible for assigning a
subset of partitions to each consumer. It uses an implementation of PartitionAssignor interface
to decide which partitions should be handled by which consumer. Kafka has two built-in
partition assignment policies. After deciding on the partition assignment, the consumer leader
sends the list of assignments to the GroupCoordinator which sends this information to all the
consumers. Each consumer only sees his own assignment - the leader is the only client process
that has the full list of consumers in the group and their assignments. This process repeats every
time a rebalance happens.
Creating Consumer:
Properties props = new Properties();
props.put("bootstrap.servers", "broker1:9092,broker2:9092");
props.put("group.id", "CountryCounter");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
group.id : -> It specifies the Consumer Group the KafkaConsumer instance belongs to.
Subscribe to Topic:
To subscribe to one or more topics. The subcribe() method takes a list of topics as a parameter. s
consumer.subscribe(Collections.singletonList("customerCountries"));
Polling:
Once the consumer subscribes to topics, the poll loop handles all details of coordination,
partition rebalances, heartbeats and data fetching, leaving the developer with a clean API that
simply returns available data from the assigned partitions. The poll loop does a lot more than
just get data. The first time you call poll() with a new consumer, it is responsible for finding the
GroupCoordinator, joining the consumer group and receiving a partition assignment. If a
rebalance is triggered, it will be handled inside the poll loop as well. And of course the
heartbeats that keep consumers alive are sent from within the poll loop. You can’t have multiple
consumers that belong to the same group in one thread and you can’t have multiple threads
safely use the same consumer. One consumer per thread is the rule. To run multiple consumers
in the same group in one application, you will need to run each in its own thread. It is useful to
wrap the consumer logic in its own object, and then use Java’s ExecutorService to start multiple
threads each with its own consumer.
Consumer Properties:
partition.assignment.strategy:
Session.timeout.ms: The amount of time a consumer can be out of contact with the brokers
while still considered alive, defaults to 3 seconds. If a consumer goes for more than
session.timeout.ms without sending a heartbeat to the group coordinator, it is considered dead
and the group coordinator will trigger a rebalance of the consumer group to allocate partitions
from the dead consumer to the other consumers in the group. This property is closely related to
heartbeat.interval.ms. heartbeat.interval.ms controls how frequently the KafkaConsumer poll()
method will send a heartbeat to the group coordinator.
client.id : This can be any string, and will be used by the brokers to identify messages sent from
the client. It is used in logging, metrics and for quotas.
If the committed offset is larger than the offset of the last message the client actually processed,
all messages between the last processed offset and the committed offset will be missed by the
consumer group.
When a group is first initialized, the consumers typically begin reading from either the earliest or
latest offset in each partition. The messages in each partition log are then read sequentially. As
the consumer makes progress, it commits the offsets of messages it has successfully processed.
For example, in the figure below, the consumer’s position is at offset 6 and its last committed
offset is at offset 1.
The diagram also shows two other significant positions in the log. The log end offset is the offset
of the last message written to the log. The high watermark is the offset of the last message that
was successfully copied to all of the log’s replicas. From the perspective of the consumer, the
main thing to know is that you can only read up to the high watermark. This prevents the
consumer from reading unreplicated data which could later be lost.
By setting auto.commit.offset = false, offsets will only be committed when the application
explicitly chooses to do so. The simplest and most reliable of the commit APIs is commitSync().
This API will commit the latest offset returned by poll() and return once the offset is committed,
throwing an exception if commit fails for some reason. commitSync() will commit the latest
offset returned by poll(), so make sure you call commitSync() after you are done processing all
the records in the collection, or you risk missing messages as described above. When rebalance
is triggered, all the messages from the beginning of the most recent batch until the time of the
rebalance will be processed twice.
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
{
System.out.printf("topic = %s, partition = %s, offset = %d, customer = %s, country = %s\n",
record.topic(), record.partition(), record.offset(), record.key(), record.value()); 1
}
try {
consumer.commitSync(); 2
} catch (CommitFailedException e) {
log.error("commit failed", e) 3
}
}
Commit at specified offset: consumer API allows you to call commitSync() and commitAsync()
and pass a map of partitions and offsets that you wish to commit.
Rebalance Listener:
A consumer will want to do some cleanup work before exiting and also before partition
rebalancing. If you know your consumer is about to lose ownership of a partition, you will want
to commit offsets of the last event you’ve processed. If your consumer maintained a buffer with
events that it only processes occasionally, you will want to process the events you accumulated
before losing ownership of the partition. Perhaps you also need to close file handles, database
connections and such.
The consumer API allows you to run your own code when partitions are added or removed from
the consumer. You do this by passing a ConsumerRebalanceListener when calling the subscribe()
method we discussed previously. ConsumerRebalanceListener has two methods you can
implement:
public void onPartitionsRevoked(Collection<TopicPartition> partitions) is called before the
rebalancing starts and after the consumer stopped consuming messages. This is where you want
to commit offsets, so whoever gets this partition next will know where to start.
public void onPartitionsAssigned(Collection<TopicPartition> partitions) is called after partitions
has been re-assigned to the broker, but before the consumer started consuming messages.
Kafka Internals
Cluster Membership:
Kafka uses Apache Zookeeper to maintain the list of brokers that are currently members of the
cluster. Every broker has a unique identifier that is either set in the broker configuration file or
automatically generated. Every time a broker process starts, it registers itself with its id in
Zookeeper by creating an ephemeral node. Different Kafka components subscribe to the
/brokers/ids path in Zookeeper where brokers are registered, so they get notified when brokers
are added or removed. While the node representing the broker is gone when the broker is
stopped, the broker ID still exists in other data structures. For example, the list of replicas of
each topic (see below) contains the broker IDs for the replica. This way, if you completely lose a
broker and start a brand new broker with the ID of the old one, it will immediately join the
cluster in place of the missing broker with the same partitions and topics assigned to it.
Controller Broker: The controller is one of the Kafka brokers that in addition to the usual broker
functionality is also responsible for the task of electing partition leaders. The first broker that
starts in the cluster becomes the controller by creating an ephemeral node in Zookeeper,
/controller. he brokers create a Zookeeper watch on the controller node, so they get notified on
changes to this node. This way we guarantee the cluster will only have one controller at a time.
When the controller broker is stopped or loses connectivity to Zookeeper, the ephemeral node
will disappear. Other brokers in the cluster will be notified through the Zookeeper watch that
the controller is gone and will attempt to create the controller node in Zookeeper themselves.
When the controller notices that a broker left the cluster (by watching the relevant Zookeeper
path), it knows that all the partitions that had a leader on that broker will need a new leader. It goes
over all the partitions that need a new leader, determine who the new leader should be (simply the
next replica in the replica list of that partition) and sends a request to all the brokers that contain
either the new leaders or the existing followers for those partitions. The request contains information
on who is the new leader and who are the followers for the partitions. The new leaders now know
that they need to start serving producer and consumer requests from clients, while the followers now
know that they need to start replicating messages from the new leader.
When the controller notices a broker joined the cluster, it uses the broker ID to check if there are
replicas that exist on this broker. If there are, the controller notifies both new and existing brokers of
the change, and the replicas on the new broker start replicating messages from the existing leaders.
To summarize, the controller uses epoch number to prevent “split brain” scenario where two nodes
believe each is the current controller.
Kafka Replication:
Kafka is “a distributed, partitioned, replicated commit log service”. Replication is so critical because
it is the way Kafka guarantees availability and durability when individual nodes inevitably fail. There
are two types of replica.
Leader replica - Each partition has a single replica designated as the leader. All produce and consume
requests go through the leader, in order to guarantee consistency. It is leaders responsibility to check
which followers are upto date with leader.
Follower replica - All replicas for a partition that are not leaders are called followers. Followers don’t
serve client requests, their only job is to replicate messages from the leader and stay up to date with
the most recent messages the leader has. In the event a leader replica for a partition crashes, one of
the follower replicas will be promoted to become the new leader for the partition.
In order to stay in sync with the leader, the replicas send the leader Fetch requests, the exact same
type of requests that consumers send in order to consume messages. In response to those requests, the
leader sends the messages to the replicas. Those Fetch requests contain the offset of the message that
the replica wants to receive next, and they will always be in order. If a replica hasn’t request any
message in over 10 seconds or it is requesting messages but didn’t catch up to the most recent
message in over 10 seconds, the replica is considered “out of sync”. If a replica fails to keep up with
the leader, it can no longer become the new leader in an event of failure - after all, it does not contain
all the messages.
Replicas that are consistently asking for the latest messages, are called “in sync replicas”. Only in-
sync replicas are eligible to be elected as partition leaders in case the existing leader fails.
Apart from current leader each partition has preferred leader - the replica that was the leader when
the topic was originally created is the preferred leader for the partition. It is preferred because when
partitions are first created the leaders are balanced between brokers. By default, Kafka is configured
with auto.leader.rebalance.enable=true, which will check if the preferred leader replica is not the
current leader but is in-sync and trigger leader election to make the preferred leader the current
leader.
Request Processing:
All requests have a standard header that includes: * Request type (also called API key) * Request
version (so the brokers can handle clients of different versions and respond accordingly) *
Correlation id - a number that uniquely identifies the request and also appears in the response and in
the error logs (It is used for troubleshooting). * Client ID - used to identify the application that sent
the request.
For each port the broker listens on, the broker runs an Acceptor thread that creates a connection and
hands it over to a Processor thread for handling. The number of processor threads (also called
network threads) is configurable. The network threads are responsible for taking requests from client
connections and placing them on request queue and picking up responses from response queue and
sending them back to clients.
Once requests are placed on the request queue, IO threads are responsible to pick up the requests and
process them. The most common types of requests are:
Produce request - sent by Producers and that contain messages the clients write to Kafka
brokersFetch requests - sent by Consumers and follower replicas when they read messages from
Kafka brokers.
Both produce requests and fetch requests have to be sent to the leader replica of a partition. If a
broker receives a produce request for a specific partition and the leader for this partition is on a
different broker, the client that sent the produce request will get an error response with the error “Not
a Leader for Partition”. The same error will occur if a fetch request for a specific partition arrives at a
broker that does not have the leader for that partition. It is the responsibility of Kafka’s clients to
always send produce and fetch requests to the broker that contains the leader for the relevant partition
for the request.
How do the clients know where to send the requests? Kafka clients use another request type called
metadata request. The request includes a list of topics the client is interested in. The server response
specifies which partitions exist in the topics, who are the replicas for each partition and which replica
is the leader. Metadata request can be sent to any broker since all brokers have a metadata cache that
contains this information.
Kafka famously uses a “Zero Copy” method to send the messages to the clients - this means that
Kafka sends messages from the file (or more likely, the Linux filesystem cache) directly to the
network channel without any intermediate buffers. This is different than most databases where data is
stored in local cache before being sent to clients. This technique removes the overhead of copying
bytes and managing buffers in memory and results in much improved performance.
Kafka Storage:
Partitions cannot be split between multiple brokers and not even between multiple disks on the same
broker. So the size of a partition is limited by the space available on a single mount point.
o File Management: Each partition data is split into segments (default 1GB data or one week
data). Once segment size is reached it is closed and a new one is opened. The segment we are
currently writing to is called an active segment. The active segment is never deleted. The format
of the data on the disk is identical to the format of the messages that we send from the
producer and later send to consumers. Each message contains, in addition to its key, value and
offset, things like the message size, checksum code that allows us to detect corruption, magic
byte that indicates the version of the message format, compression codec (Snappy, GZip or LZ4)
and a timestamp (added in 0.10.0 release). The timestamp is given either by the producer when
the message was sent or by the broker when the message arrived - depending on configuration.
o In order to help brokers quickly locate the message for a given offset, Kafka maintains an
index for each partition. The index maps offsets to segment files and location within the file.
Indexes are also broken into segments, so we can delete old index entries when the messages
are purged. Kafka does not attempt to maintain checksums of the index. If the index becomes
corrupted, it will get re-generated from the matching log segment simply by re-reading the
messages and recording the offsets and locations. It is also completely safe for an
administrator to delete index segments if needed - they will be re-generated automatically.
o Compaction: Kafka supports use-cases where you need to maintain latest data forever e.g.
customers current state/address, by allowing to change the retention policy on a topic from
“delete”, which deletes events older than retention time to “compact” which only stores the
most recent value for each key in the topic. Obviously, setting the policy to “compact” only
makes sense on topics to which applications produce events that contain both a key and a value.
If the topic contains null keys, compaction will fail.