0% found this document useful (0 votes)
54 views43 pages

Apache Kafka

Uploaded by

dodgeviper0065
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views43 pages

Apache Kafka

Uploaded by

dodgeviper0065
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Apache Kafka

Introduction
Companies starts wanting to send data from a source system to a target system.

But after a while, it gets complicated.

The problems that the organization is going to face are:

 If you have 4 source systems, and 6 target systems, you need to write 24 integrations.
 Each integration comes with difficulties around:
o Protocol: how the data is transported (TCP, HTTP, REST, FTP, JDBC…)
o Data format: how the data is parsed (Binary, CSV, JSON, Avro, Protobuf…)
o Data scheme & evolution: How the data is shaped and may change.
 Each source system will have an increased load from the connections.

Why Apache Kafka

Now, the Source Systems are only in charge of sending data and, when the Target Systems
needs the data, they’re going to request it from Apache Kafka
Use cases
 Messaging System
 Activity Tracking
 Gather metrics from many different locations
 Application Logs gathering
 Stream processing (with the Kafka Streams API from example)
 De-coupling of system dependencies
 Integration with Spark, Flink, Storm, Hadoop, and many other Big Data technologies
 Micro-services pub/sub

Examples
 Netflix uses Kafka to apply recommendations in real-time while you’re watching TV
shows.
 Uber uses Kafka to gather user, taxi, and trip data in real-time to compute and forecast
demand, and compute surge pricing in real-time.
 LinkedIn uses Kafka to prevent spam, collect user interactions to make better
connection recommendations in real time.

Kafka Theory
Topics, partitions and offsets
Topics
They’re a particular stream of data within our Kafka cluster.

In the example the topics are logs, purchases, twitter_tweets and truck_gps.

It’s like a table in database, but without all the constraints.

You can have as many topics as you want.

A topic is identified by its name and they support any kind of message format (JSON, Avro, text
file, binary, etc.).

The sequence of messages is called a data stream.

You cannot query topics. Instead, use Kafka Producers to send data and Kafka Consumers to
read the data.
Partitions and offsets
Topics are split in partitions (example: 100 partitions). Messages within each partition are
ordered.

The incremental numbers within each partition (e.g. in Partition 1, the numbers 0, 1, 2, 3, 4, 5,
6, 7 and 8) are ids, and they’re called offset.

Kafka topics are immutable: once data is written to a partition, it cannot be changed.

Example: truck_gps
You have a fleet of trucks; each truck reports it GPS position to Kafka.

Each truck will send a message to Kafka every 20 seconds, each message will contain the truck
ID and the truck position (latitude and longitude).

You can have a topic truck_gps that contains the position of all trucks.

We choose to create that topic with 10 partitions (arbitrary number).

Important notes
Once the data is written to a partition, it cannot be changed (immutability).

Data is kept only for a limited time (default is one week – configurable).

Offset only have a meaning for a specific partition (e.g. offset 3 in partition 0 doesn’t represent
the same data as offset 3 in partition 1).

Offset are not re-used even if previous message has been deleted.

Order is guaranteed only within a partition (not across partitions).

Data is assigned randomly to a partition unless a key is provided (more on this later).

You can have as many partitions per topic as you want.


Producers and Message Keys
Producers
Producers write data to topics (which are made of partitions).

Producers know to which partition to write to (and which Kafka broker has it).

In case of Kafka brokers failures, Producers will automatically recover.

This load is balanced to many brokers thanks to the number of partitions.

Kafka scales, because the producers send data across all partitions and each partition receives
messages from one or more producers.

Producers: Message keys


Producers can choose to send a key with the message (string, number, binary, etc…).

If key = null, data is sent round robin (partition 0, then 1, then 2…).

But, if key != null, then all messages for that key will always go to the same partition (hashing).

A key is typically sent if you need message ordering for a specific field (ex: truck_id)
Messages anatomy
A message its composed by:

 Key: It can be null and it’s in binary format.


 Value: It’s the message content and can be null as well.
 Compression type
 Headers (optional)
 Partition + Offset
 Timestamp (system or user set)

Message Serializer
Kafka only accepts bytes as an input from producers and sends bytes out as an output to
consumers.

Message Serialization means transforming objects/data into bytes.

They are used on the value and the key.


Common Serializers:

 String (incl. JSON).


 Int, Float
 Avro
 Protobuf

Message Key Hashing


A Kafka partitioner is a code logic that takes a record and determines to which partition to
send it into.

Key Hashing is the process of determining the mapping of a key to a partition.

In the default Kafka partitioner, the keys are hashed using the murmur2 algorithm, with the
formula below for the curious:

Consumers & Deserialization


Consumers
Consumers read data from a topic (identified by name) – pull model

Consumers automatically know which broker to read from.

In case of broker failures, consumers know how to recover.

Data is read in order from low to high offset within each partitions.

Consumer Deserializer
Deserialize indicates how to transform bytes into objects/data.

They are used on the value and the key of the message.

Common Deserializers:

 String (incl. JSON).


 Int, Float.
 Avro
 Protobuf
The serialization/deserialization type must not change during a topic lifecycle (create a new
topic instead).

Consumer Groups & Consumer Offsets


Consumer Groups
All the consumers in an application read data as a consumer groups.

Each consumer within a group reads from exclusive partitions.

What if too many consumers?


If you have more consumers than partitions, some consumers will be inactive.
Multiple Consumers on one topic
In apache Kafka it’s acceptable to have multiple consumer groups on the same topic

To create distinct consumer groups, use the consumer property group.id.

Consumer Offsets
Kafka stores the offsets at which a consumer group has been reading.

The offsets committed are in Kafka topic named __consumer_offsets.

When a consumer in a group has processed data received from Kafka, it should be periodically
committing the offsets (the Kafka broker will write to __consumer_offsets, not the group
itself).

If a consumer dies, it will be able to read back from where it left off thanks to the committed
consumer offsets.

Delivery semantincs for consumers


By default, Java Consumers will automatically commit offsets (at least once).

There are 3 delivery semantics if you choose to commit manually.

 At least once (usually preferred):


o Offsets are committed after the message is processed.
o If the processing goes wrong, the message will be read again.
o This can result in duplicate processing of messages. Make sure your processing
is idempotent (i.e. processing again the messages won’t impact your systems).
 At most once:
o Offsets are committed as soon as messages are received.
o If the processing goes wrong, some messages will be lost (they won’t be read
again).
 Exactly once:
o For Kafka -> Kafka workflows: use the Transaction API (easy with Kafka
Streams API).
o For Kafka -> External System workflows: use an idempotent consumer.

Brokers and Topics


Brokers
A Kafka cluster is composed of multiple brokers (server).

They’re called brokers because they receive and send data.

Each broker is identified with its ID (integer).

Each broker contains certain topic partitions.

After connecting to any broker (called a bootstrap broker), you will be connected to the entire
cluster (Kafka clients have smart mechanics for that).

A good number to get started is 3 brokers, but some big clusters have 100 brokers.

In these examples we choose to number brokers starting at 100 (arbitrary).

Example
Example of Topic A with 3 partitions and Topic B with 2 partitions.

Note: data is distributed and broker 103 doesn’t have any Topic B data.

As we see, the data in our partitions is going to be distributed across all brokers, and this is
what makes Kafka scale, and what’s actually called horizontal scaling, because the more
partitions and the more brokers we add, the more the data is going to be spread out across
our entire cluster.
Kafka Broker Discovery
Every Kafka broker is also called a “bootstrap server”.

That means that you only need to connect to one broker, and the Kafka clients will know how
to be connected to the entire cluster (smart clients).

Topic replication
Topic replication factor
Topics should have a replication factor > 1 (usually between 2 and 3).

This way if a broker is down, another broker can serve the data.

Example: Topic A with 2 partitions and replication factor of 2.

In this example, if we lose Broker 102, Broker 101 and 103 can still serve the data.

Concept of Leader for a Partition


At any time, only ONE broker can be leader for a given partition.

Producers can only send data to the broker that is leader of a partition.

The other brokers will replicate the data.

Therefore, each partition has one leader and multiple ISR (in-sync replica).
Default producer & consumer behavior with leaders
Kafka producers can only write to the leader broker for a partition.

Kafka consumers by default will read from the leader broker for a partition.

Kafka Consumers Replica Fetching (Kafka v2.4+)


Since Kafka 2.4, it’s possible to configure consumers to read from the closest replica.

This may help to improve latency, and also decrease network costs if using the cloud.

Producer Acknowledgements (acks)


Producers can choose to receive acknowledgment of data writes:

 acks=0: Producer won’t wait for acknowledgment (possible data loss).


 acks=1: Producer will wait for leader acknowledgment (limited data loss).
 acks=all: Leader replicas acknowledgment (no data loss).

Kafka Topic Durability


For a topic replication factor of 3, topic data durability can withstand 2 brokers loss.

As a rule, for a replication factor of N, you can permanently lose up to N-1 brokers and still
recover your data.
Zookeper
Zookeper manages brokers (keeps a list of them). Also, it helps in performing leader election
for partitions.

Zookeper also sends notifications to Kafka in case of changes (e.g. new topic, broker dies,
broker comes up, delete topics, etc…).

Kafka 2.x can’t work without Zookeper.

But, Kafka 3.x can work without Zookeper (KIP-500) – using Kafka Raft instead.

Kafka 4.x will not have Zookeper

Zookeper by design operates with an odd number of servers (1, 3, 5, 7)

Zookeper has a leader (writes) the rest of the servers are followers (reads).

Zookeper does NOT store consumer offsets with Kafka > v0.10

Should you use Zookeper?


With Kafka Brokers: yes, until Kafka 4.0 is out while waiting for Kafka without Zookeper to be
production-ready.

With Kafka Clients

 Over time, the Kafka clients and CLI have been migrated to leverage the brokers as a
connection endpoint instead of Zookeper.
 Since Kafka 0.10, consumers store offset in Kafka and Zookeper and must not connect
to Zookeper as it’s deprecated.
 Since Kafka 2.2., the kafka-topics.sh CLI command references Kafka brokers and not
Zookeper for topic management (creation, deletion, etc…) and the Zookeper CLI
argument is deprecated.
 All the APIs and commands that were previously leveraging Zookeper are migrated to
use Kafka instead, so that when clusters are migrated to be without Zookeper, the
change is invisible to clients.
 Zookeper is also less secure than Kafka, and therefore, Zookeper ports should only be
opened to allow traffic from Kafka brokers, and not Kafka clients.
 Therefore, to be a great modern day Kafka developer, never ever use Zookeper as a
configuration in your Kafka clients, and other programas that connect to Kafka.
Kafka Kraft – Removing Zookeper
Kafka KRaft
Zookeper shows scaling issues when Kafka clusters have > 100.000 partitions

By removing Zookeper, Apache Kafka can:

 Scale to millions of partitions, and becomes easier to maintain and set-up.


 Improve stability, makes it easier to monitor, support and administer.
 Single security model for the whole system.
 Single process to start with Kafka.
 Faster controller shutdown and recovery time.

Kafka 3.X now implements the Raft protocol (KRaft) in order to replace Zookeper (Production
ready since Kafka 3.3.1).

Kafka 4.0 will be released only with KRaft.


Starting Kafka
First, start Zookeeper
~/kafka_2.13-3.7.1/bin/zookeeper-server-start.sh
~/kafka_2.13-3.7.1/config/zookeeper.properties

Second, start Kafka itself


~/kafka_2.13-3.7.1/bin/kafka-server-start.sh
~/kafka_2.13-3.7.1/config/server.properties

Starting Kafka without Zookeper


If we configured Kafka to start without Zookeper (using KRaft), we have to start it like this.

Generate a new ID for your cluster:


~/kafka_2.13-3.7.1/bin/kafka-storage.sh random-uuid

Next, format your storage directory:


~/kafka_2.13-3.7.1/bin/kafka-storage.sh format -t <uuid> -c
~/kafka_2.13-3.7.1/config/kraft/server.properties

And after that, start the Kafka itself.


~/kafka_2.13-3.7.1/bin/kafka-server-start.sh
~/kafka_2.13-3.7.1/config/kraft/server.properties
CLI (Command Line Interface) 101
Kafka Topics CLI
Create a topic
kafka-topics.sh --bootstrap-server localhost:9092 --topic first_topic
–create

o kafka-topics.sh: Command itself


o –bootstrap-server: Represents the URL where Kafka is, in this case
localhost:9092.
o –create: command to create
o –topic: Info of the topic, in this case, the name of the topic will be first_topic

Create topic with partitions


kafka-topics.sh --bootstrap-server localhost:9092 --topic second_topic
--create --partitions 3

o –partitions: number of partitions that the topic is going to have.

Create topic with partitions and replication factor


kafka-topics.sh --bootstrap-server localhost:9092 --topic third_topic
--create --partitions 3 --replication-factor 1

o –replication-factor: Specify replication factor. In this case, 1.

List topics
kafka-topics.sh --bootstrap-server localhost:9092 –list

Get topic description


kafka-topics.sh --bootstrap-server localhost:9092 --topic first_topic
–describe

Delete a topic
kafka-topics.sh --bootstrap-server localhost:9092 --topic first_topic
–delete

Kafka Console Producer CLI


Produce a message
kafka-console-producer.sh --bootstrap-server localhost:9092 --topic
first_topic
> Hello World

Produce a message with properties


kafka-console-producer.sh --bootstrap-server localhost:9092 --topic
first_topic --producer-property acks=all

> some message that is acked

Produce a message with non-existing topic


kafka-console-producer.sh --bootstrap-server localhost:9092 --topic
new_topic

> hello world!

This will create the topic


Produce a message with key
kafka-console-producer.sh --bootstrap-server localhost:9092 --topic
first_topic --property parse.key=true --property key.separator=:

>example key:example value


>name:Stephane

Produce with RoundRobinPartitioner


kafka-console-producer.sh --bootstrap-server localhost:9092 --
producer-property
partitioner.class=org.apache.kafka.clients.producer.RoundRobinPartitio
ner --topic second_topic

This will produce to one partition at a time, and change every partition.

Kafka Console Consumer CLI


Consume
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic
second_topic

Consume from the beginning


kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic
second_topic --from-beginning

This will consume all the messages send to that topic from the beginning.

But it’s only helpful when there has never been a consumer offset that has been committed
ass part of the group. This means that if the consumer group is “new” then it will consume
from the beginning. If not, then it will not consume from the beginning.

Consume displaying key, values and timestamp


kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic
second_topic --formatter kafka.tools.DefaultMessageFormatter --
property print.timestamp=true --property print.key=true --property
print.value=true --property print.partition=true --from-beginning

Kafka Consumer in Groups

And, in the case of having more consumers than partitions, the used consumers are replaced
by the new consumers, and the replaced consumers aren’t used.

Consume specifying consumer group


kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic
third_topic --group my-first-application
If we don’t specify consumer groups, Kafka will create temporary consumer groups such as
console-consumer-XXXXX

Consumer Group Management CLI

List consumer groups


kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list

Describe consumer group


kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe
--group my-second-application

The “Lag” columns it’s the messages that haven’t been consumed yet.

Consumer Groups – Reset Offsets


Check before reset offsets
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group my-
first-application --reset-offsets --to-earliest --topic third_topic --
dry-run

 --reset-offset will reset the offsets.


 --to-earliest means that will get until the earliest data that exists in the topic.
 --dry-run it’s to know how the assignment will be, but don’t run it yet.

Reset offsets
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group my-
first-application --reset-offsets --to-earliest --topic third_topic –
execute

It’s the same command as “check before reset offsets” but just chaning –dry-run by –execute.
Kafka Java Programming 101
Before all
When working in local, before starting Kafka, you have to execute the following commands:

sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1

sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1

Necessary dependencies
 Kafka-clients
 Slf4j-api
 Slf4j-simple

Java Producer
Properties properties = new Properties();
// connect to Localhost
properties.setProperty("bootstrap.servers", "localhost:9092");

// set producer properties


properties.setProperty("key.serializer",
StringSerializer.class.getName());
properties.setProperty("value.serializer",
StringSerializer.class.getName());

// create the Producere


KafkaProducer<String, String> producer = new
KafkaProducer<>(properties);

// create a Producer Record


ProducerRecord<String, String> producerRecord = new
ProducerRecord<>("demo_java", "hello world");

// send data
producer.send(producerRecord);

// tell the producer to send all data and block until done --
synchronous
producer.flush();

// flush and close the producer


producer.close();
Java Producer Callbacks
Normally, when sending a message, its used the Round Robin behavior.

But when the messages sent are sent very quickly, then those messages are batched. This is
the StickyPartitioner behavior to make it more efficient.

Also, to produce messages with callbacks, we can do it like this:


// create a Producer Record
ProducerRecord<String, String> producerRecord = new
ProducerRecord<>("demo_java", "hello world " + i);

// send data
producer.send(producerRecord, new Callback() {
@Override
public void onCompletion(RecordMetadata metadata, Exception
exception) {
// executes every time a record successfully sent or an
exception is thrown
if (exception == null) {
// the record was successfully sent
log.info("Received new metadata \n" +
"Topic: " + metadata.topic() + "\n" +
"Partitions: " + metadata.partition() + "\n" +
"Offset: " + metadata.offset() + "\n" +
"Timestamp: " + metadata.timestamp());
} else {
log.error("Error while producing", exception);
}
}
});

Also, we can override the batch size and the partitioner class with this properties:
properties.setProperty("batch.size", "400");
properties.setProperty("partitioner.class",
RoundRobinPartitioner.class.getName());
Java Producer with Keys
String topic = "demo_java";
String key = "id_" + i;
String value = "hello world " + i;
// create a Producer Record
ProducerRecord<String, String> producerRecord = new
ProducerRecord<>(topic, key, value);

// send data
producer.send(producerRecord, new Callback() {
@Override
public void onCompletion(RecordMetadata metadata, Exception
exception) {
// executes every time a record successfully sent or an
exception is thrown
if (exception == null) {
// the record was successfully sent
log.info("Key: " + key + " | Partitions: " +
metadata.partition() + "\n");
} else {
log.error("Error while producing", exception);
}
}
});
Java Consumer
log.info("Im a Kafka Consumer!");

String groupId = "my-java-application";


String topic = "demo_java";

Properties properties = new Properties();


// connect to Localhost
properties.setProperty("bootstrap.servers", "localhost:9092");

// create consumer configs


properties.setProperty("key.deserializer",
StringDeserializer.class.getName());
properties.setProperty("value.deserializer",
StringDeserializer.class.getName());

properties.setProperty("group.id", groupId);

properties.setProperty("auto.offset.reset", "earliest");

// create a consumer
KafkaConsumer<String, String> consumer = new
KafkaConsumer<>(properties);

// subscribe to a topic
consumer.subscribe(Arrays.asList(topic));

// poll for data


while(true){
log.info("Polling");
ConsumerRecords<String, String> records =
consumer.poll(Duration.ofMillis(1000));

for (ConsumerRecord<String, String> record: records){


log.info("Key: " + record.key() + ", Value: " +
record.value());
log.info("Partition: " + record.partition() + ", Offset: " +
record.offset());
}

auto.offset.reset has three options:

 none: means that if we don’t have any existing consumer group, then we fail. We must
have a consumer group before starting the application
 earliest: Read from the beginning of the topic (it’s the –from-beginning in the console
consumer properties).
 latest: This means that only read the messages that were sent just now.
Java Consumer – Graceful shutdown
// create a consumer
KafkaConsumer<String, String> consumer = new
KafkaConsumer<>(properties);

// get a reference to the main thread


final Thread mainThread = Thread.currentThread();

// adding the shutdown hook


Runtime.getRuntime().addShutdownHook(new Thread(){
public void run(){
log.info("Detected a shutdown, let's exit by calling
consumer.wakeup()...");
consumer.wakeup();

// join the main thread to allow the execution of the code in


the main thread
try {
mainThread.join();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
});

try {

// subscribe to a topic
consumer.subscribe(Arrays.asList(topic));

// poll for data


while (true) {
log.info("Polling");
ConsumerRecords<String, String> records =
consumer.poll(Duration.ofMillis(1000));

for (ConsumerRecord<String, String> record : records) {


log.info("Key: " + record.key() + ", Value: " +
record.value());
log.info("Partition: " + record.partition() + ", Offset: "
+ record.offset());
}

}
}
catch (WakeupException e){
log.info("Consumer is starting to shut down.");
} catch (Exception e){
log.error("Unexpected exception in the consumer", e);
} finally {
consumer.close(); // close the consumer, this will also commit the
offset
log.info("The consumer is now gracefully shut down");
}

The consumer.wakeup() provokes the next consumer.poll() to throw an exception.


Consumer Groups and Partition Rebalance
Moving partitions between consumers is called a rebalance.

Reassignment of partitions happen when a consumer leaves or joins a group.

It also happens if an administrator adds new partitions into a topic.

Eager Rebalance
All consumers stop, give up their membership of partitions. And then, they rejoin the
consumer group and get a new partitions assignment.

This rebalance is quite random and during a short period of time, the entire consumer group
stops processing.

Also, there’s no guarantee that the consumers “get back” the same partitions as the used to.

Cooperative Rebalance (Incremental Rebalance)


Reassigning a small subset of the partitions from one consumer to another.

Other consumers that don’t have reassigned partitions, can still process uninterrupted.

Can go through several iterations to find a “stable” assignment (hence “incremental”).

This avoid “stop-the-world” events where all consumers stop processing data.

In code should be something like this:


properties.setProperty("partition.assignment.strategy",
CooperativeStickyAssignor.class.getName());

How to use?
KafkaConsumer: We have to modify the partition.assignment.strategy property.

 RangeAssignor (used to be default): Assign partitions on a per-topic basis (can lead to


imbalance).
 RoundRobin: Assign partitions across all topics in a round-robin fashion, optimal
balance.
 StickyAssignor: Balanced like RoundRobin, and then minimizes partition movements
when consumer join/leave the group in order to minimize movements.
 CooperativeStickyAssignor: Rebalance strategy is identical to StickyAssignor but
supports cooperative rebalances and therefore consumers can keep on consuming
from the topic.
 The default assignor is [RangeAssignor, CooperativeStickyAssignor], which will use the
RangeAssignor by default, but allows upgrading to the CooperativeStickyAssignor with
just a single rolling bounce that removes the RangeAssignor from the list.

Kafka Connect: already implemented (enabled by default).

Kafka Streams: Turned on by default using StreamsPartitionAssignor.

Static Group Membership


By default, when a consume leaves a group, its partitions are revoked and re-assigned.

If it joins back, it will have a new “member ID” and new partitions assigned.

If you specify group.instance.id it makes the consumer a static member.

Upon leaving, the consumer has up to session.timeout.ms to join back and get back its
partitions (else they will be re-assigned), without triggering a rebalance.

In code should be something like this:


properties.setProperty("group.instance.id", "..."); // Strategy for
static assignments

Consumer Auto Offset Commit Behavior


In the Java Consumer API, offsets are regularly committed.

Enables at-least once reading scenario by default (under conditions).

Offsets are committed when you call .poll() and auto.commit.interval.ms has elapsed.

Example: auto.commit.interval.ms=5000 and enable.auto.commit=true will commit.

Make sure messages are all successfully processed before you call poll() again. If you don’t, you
won’t be in at-least-once reading scenario.

In that rare case, you must disable enable.auto.commit, and most likely most processing to a
separate thread, and then from time-to-time call .commitSync() or .commitAsync() with the
correct offsets manually (advanced).
Advanced Producer Configurations
Producer Acknowledgements (acks)
acks=0
When acks=0 producers consider messages as “written successfully” the moment the message
was sent without waiting for the broker to accept it all.

But if the broker goes offline or an exception happens, we won’t know and will lose data.

It’s useful for data where it’s okay to potentially lose messages, such as metrics collection.

Produces the highest throughput setting because the network overhead is minimized.

acks=1
When acks=1, producers consider messages as “written successfully” when the message was
acknowledged by only the leader.

Default for Kafka v1.0 to v2.8

Leader response is requested but replication is not a guarantee as it happens in the


background.

If the leader broker goes offline unexpectedly but replicas haven’t replicated the data yet, we
have a data loss.

If an ack is not received, the producer may retry the request.

acks=all (acks=-1)
When acks=all, producers consider messages as “written successfully” when the message is
accepted by all in-sync replicas (ISR).

Default for Kafka 3.0+


acks=all & min.insync.replicas
The leader replica for a partition checks to see if there are enough in-sync replicas for safely
writing the message (controlled by the broker setting min.insync.replicas).

 min.insync.replicas = 1: only the broker leader needs to successfully ack


 min.insync.replicas = 2: at least the broker leader and one replica need to ack.

Kafka topic availability


 Availability: (considering RF=3)
o acks=0 & acks=1: if one partition is up and considered an ISR, the topic will be
available for writes.
o acks=all:
 min.insync.replicas=1 (default): the topic must have at least 1 partition
up as an ISR (that includes the leader) and so we can tolerate two
brokers being down.
 min.insync.replicas=2: the topic must have at least 2 ISR up, and
therefore we can tolerate at most one broker being down (in the case
of replication factor of 3), and we have the guarantee that for every
write, the data will be at least written twice.
 min.insync.replicas=3: this wouldn’t make much sense for a
corresponding replication factor of 3 and we couldn’t tolerate any
broker going down.
 In summary, when acks=all, with a replication.factor=N and
min.insync.replicas=M we can tolerate N-M brokers going down for
topic availability purposes.

Acks=all and min.insync.replicas=2 is the most popular option for data durability and
availability and allows you to withstand at most the loss of one Kafka broker.
Producer Retries
Producer Retries
In case of transient failures, developers are expected to handle exceptions otherwise the data
will be lost.

Example of transient failure: NOT_ENOUGH_REPLICAS (due to min.insync.replicas.setting)

There is a “retries” settings:

 Defaults to 0 for Kafka <= 2.0


 Defaults to 2147483647 for Kafka >= 2.1

The retry.backoff.ms setting is by default 100 ms.

Producer Timeouts
If retries, for example retires = 2147483647, retries are bounded by a timeout.

Since Kafka 2.1, you can set: delivery.timeout.ms=120000 == 2 min

Producer Retries: Warning for old version of Kafka


If you’re not using an idempotent producer (not recommended – old Kafka):

 In case of retries, there’s a chance that messages will be sent out of order (if a batch
has failed to be sent).
 If you rely on key-based ordering, that can be an issue.

For this, you can set the setting while controls how many produce request can be made in
parallel: max.in.flight.requests.per.connection

 Default: 5
 Set it to 1 if you need to ensure ordering (may impact throughput).

In Kafka >= 1.0.0, there’s a better solution with idempotent producers!


Idempotent Producer
The producer can introduce duplicate messages in Kafka due to network errors.

In Kafka >= 0.11, you can define a “idempotent producer” which won’t introduce duplicates on
networks error.

Idempotent producers are great to guarantee a stable and safe pipeline!

They’re the default since Kafka 3.0, recommended to use them.

They come with:

 retries = Integer.MAX_VALUE(2^31-1 = 2147483647)


 max.in.flight.requests=1 (Kafka == 0.11) or
 max.in.flight.requests=5 (Kafka >= 1.0 – higher performance & keep ordering – KAFKA
5494).
 acks=all

These settings are applied automatically after your producer has started if not manually set.

Just set producerProps.put(“enable.idempotence”, true);


Kafka Producer defaults
Since Kafka 3.0, the producer is “safe” by default:

 acks=all (-1)
 enable.idempotence=true

With Kafka 2.8 and lower, the producer by default comes with:

 acks=1
 enable.idempotence=false

Its recommended to use a safe producer whenever possible.

Also, it’s important to always use upgraded Kafka Clients.

Safe Kafka Producer – Summary & Demo


Since Kafka 3.0, the producer is “safe” by default, otherwise, upgrade your clients or set the
following settings:

 acks=all: Ensures data is properly replicated before an ack is received.


 min.insync.replicas=2 (broker/topic level): Ensures two brokers in ISR at least have the
data after an ack.
 enable.idempotence=true: Duplicates are not introduced due to network retries.
 retries=MAX_INT (producer level): Retry until delivery.timeout.ms is reached.
 delivery.timeout.ms=120000: Fail after retrying for 2 minutes.
 max.in.flight.requests.per.connection=5: Ensure maximum performance while keeping
message ordering.

Kafka Message Compression


Message Compression at the Producer level
Producer usually send data that’s text-based, for example with JSON data.

In this case, it’s important to apply compression to the producer.

Compression can be enabled at the Producer level and doesn’t require any configuration
change in the Brokers or in the Consumers:

compression.type can be none (default), gzip, lz4, snappy, zstd (Kafka 2.1).

Compression is more effective the bigger the batch of message being sent to Kafka.
Advantages:

 Much smaller producer request size (compression ratio up to 4x!).


 Faster to transfer data over the network >= less latency.
 Better throughput.
 Better disk utilization in Kafka (stored messages on disk are smaller).

Disadvantages (very minor):

 Producers must commit some CPU cycles to compression.


 Consumers must commit some CPU cycles to decompression.

Overall:

 Consider testing snappy or lz4 for optimal speed/compression ratio (test others too).
 Consider tweaking linger.ms and batch.size to have bigger batches, and therefore
more compression and higher throughput.
 Use compression in production.

Message Compression at the Broker/Topic level


There’s also a setting you can set at the broker level (all topics) or topic-level:

 compression.type=producer (default): The broker takes the compressed batch from


the producer client and writes it directly to the topic’s log file without recompressing
the data.
 compression.type=none: All batches are decompressed by the broker.
 compression.type=lz4:
o If it’s matching the producer setting, data is stored on disk as is.
o If it’s a different compression setting, batches are decompressed by the broker
and then re-compressed using the compression algorithm specified.

Warning: if you enable broker-side compression, it will consume extra CPU cycles.
linger.ms and batch.size Producer settings
linger.ms & batch.size
By default, Kafka producers try to send records as soon as possible:

 It will have up to max.in.flight.request.per.connection=5, meaning up to 5 message


batches being in flight (being sent between the producer in the broker) at most.
 After this, if more messages must be sent while others are in flight, Kafka is smart and
will start batching them before the next batch send.

This smart batching helps increase throughput while maintaining very low latency. Added
benefit: batches have higher compression ratio so better efficiency.

Two settings to influence the batching mechanism:

 linger.ms (default 0): how long to wait until we send a batch. Adding a small number
for example 5 ms helps add more messages in the batch at the expense of latency.
 batch.size: if a batch is filled before linger.ms, increase the batch size.

batch.size (default 16KB)


It’s the maximum number of bytes that will be included in a batch.

Increasing a batch size to something like 32KB or 64KB can help increasing the compression,
throughput and efficiency of request.

Any message that’s bigger than the batch size will not be batched.

A batch is allocated per partition, so make sure that you don’t set it to a number that’s too
high, otherwise you’ll run waste memory!

Note: you can monitor the average batch size metric using Kafka Producer Metrics.

High Throughput Producer


 Increase linger.ms and the producer will wait a few milliseconds for the batches to fill
up before sending them.
 If you are sending full batches and have memory to spare, you can increase batch.size
and send larger batches.
 Introduce some producer-level compression for more efficiency in sends.
High Throughput Producer Implementation.
 We’ll add snappy message in our producer. Snappy is very helpful if your messages are
text-based, for example log lines or JSON documents. Also, it has a good balance of
CPU/Compression ratio.
 We’ll also increase the batch size to 32KB and introduce a small delay through
linger.ms
// high throughput producer (at the expense of a bit of latency and
CPU usage)
properties.setProperty(ProducerConfig.COMPRESSION_TYPE_CONFIG,
"snappy");
properties.setProperty(ProducerConfig.LINGER_MS_CONFIG, "20");
properties.setProperty(ProducerConfig.BATCH_SIZE_CONFIG,
Integer.toString(32*1024)); // 32 KB batch size
Producer Default Partitioner & Sticky Partitioner
Key Hashing is the process of determining the mapping of a key to a partition.

In the default Kafka partitioner, the keys are hashed using the murmur2 algorithm.
targetPartition = Math.abs(Utils.murmur2(keyBytes)) % (numPartitions -
1)

This means that the same key will go to the same partition, and adding partitions to a topic will
completely alter the formula.

It’s most likely preferred to not override the behavior of the partitioner, but it’s possible to do
so using partitioner.class parameter for our Kafka producers.

When key=null, the producer has a default partitioner that varies:

 Round Robin for Kafka 2.3 and below.


 Sticky Partitioner for Kafka 2.4 and above.

Sticky Partitioner improves the performance of the producer especially when high throughput
when the key is null.

Producer Default Partitioner Kafka <= 2.3 – Round Robin Partitioner


With Kafka <= 2.3, when there’s no partition and no key specified, the default partitioner sends
data in a round-robin fashion.

This results in more batches (one batch per partition) and smaller batches (imagine with 100
partitions).

Smaller batches lead to more request as well as higher latency.


Producer Default Partitioner Kafka >= 2.4 – Sticky Partitioner
It would be better to have all the records sent to a single partition and not multiple partitions
to improve batching.

The producer sticky partitioner:

 We “stick” to a partition until the batch is full or linger.ms has elapsed.


 After sending the batch, the partition that’s sticky, changes.

Larger batches and reduced latency (because larger request and batch.size, more likely to be
reached).

Over time, records are still spread evenly across all partitions.

max.block.ms & buffer.memory


If the producer produces faster than the broker can take, the records will be buffered in
memory:

buffer.memory=33554432 (32MB): the size of the send buffer.

That buffer will fill up over time and empty back down when the throughput to the broker
increases.

If that buffer is full (all 32MB), then the .send() method will start to block (won’t return right
away).

max.block.ms=60000: the time the .send() will block until throwing an exception.

Exceptions are thrown when:

 The producer has filled up its buffer.


 The broker is not accepting any new data.
 60 seconds has elapsed.

If you hit an exception, that usually means your brokers are down or overloaded as they can’t
respond to requests.
Advanced Consumer Configuration
Consumer Delivery Semantics
At most once
Offsets are committed as soon as the message batch is received. If the processing goes wrong,
the message will be lost (it won’t be read again).

At least once (preferred)


Offsets are committed after the message is processed. If the processing goes wrong, the
message will be read again. This can result in duplicate processing of messages. Make sure
your processing is idempotent (i.e. processing again the messages won’t impact your systems).

Exactly once
Can be achieved for Kafka => Kafka workflows using the Transactional API (easy with Kafka
Streams API). For Kafka => Sink workflows, use an idempotent consumer.

Consumer Offset Commit Strategies


There are two most common patterns for committing offsets in a consumer application.

2 strategies:

 (easy) enable.auto.commit = true & synchronous processing of batches


 (medium) enable.auto.commit = false & manual commit of offsets

Kafka Consumer - Auto Offset Commit Behavior


In the Java Consumer API, offsets are regularly committed.

Enable at-least once reading scenario by default (under conditions).

Offsets are committed when you call .poll() and auto.commit.interval.ms has elapsed.

Example: auto.commit.interval.ms=5000 and enable.auto.commit=true will commit.


Make sure messages are all successfully processed before you call poll() again.

 If you don’t, you will not be in at-least-once reading scenario.


 In that (rare) case, you must disable enable.auto.commit, and most likely most
processing to a separate thread, and then from time-to-time call .commitSync() or
commitAsync() with the correct offsets manually (advanced).

enable.auto.commit = true & synchronous processing of batches.


With auto-commit, offsets will be committed automatically for you at regular interval
(auto.commit.interval.ms=5000 by default) every-time you call .poll().

If you don’t use synchronous processing, you will be in “at-most-once” behavior because
offsets will be committed before your data is processed.

enable.auto.commit=false & synchronous processing of batches


You control when you commit offsets and what’s the condition for committing them.

Example: accumulating records into a buffer and then flushing the buffer to a database +
committing offsets asynchronously then.

enable.auto.commit=false & storing offsets externally


This is advanced:

 You need to assign partitions to your consumers at launch manually .seek() API.
 You need to model and store your offsets in a database table for example.
 You need to handle the cases where rebalances happen (ConsumerRebalanceListener
interface).

Example: If you need exactly once processing and can’t find any way to do idempotent
processing, then you “process data” + “commit offsets” as a part of a single transaction.

It’s not recommended using this strategy unless you exactly know what and why you’re doing
it.
Consumer Offset Reset Behavior
A consumer is expected to read from a log continuously.

But if your application has a bug, your consumer can be down.

 If Kafka has a retention of 7 days, and your consumer is down for more than 7 days,
the offsets are “invalid”.

The behavior for the consumer is to then use:

 auto.offset.reset=latest: will read from the end of the log.


 auto.offset.reset=earliest: will read from the start of the log.
 auto.offset.reset=none: will throw exception if no offset is found.

Additionally, consumer offsets can be lost:

 If a consumer hasn’t read new data in 1 day (Kafka < 2.0)


 If a consumer hasn’t read new data in 7 days (Kafka >= 2.0)

This can be controlled by the broker settings offset.retention.minutes.

Replaying data for Consumers


To replay data for a consumer group:

 Take all the consumers from a specific group down.


 Use kafka-consumer-groups command to set offset to what you want.
 Restart consumers.

Bottom line:

 Set proper data retention period & offset retention period.


 Ensure the auto offset reset behavior is the one you expect/want.
 Use replay capability in case of unexpected behavior.
Consumer Internal Threads
Controlling Consumer Liveliness

Consumers in a Group talk to a Consumer Groups Coordinator.

To detect consumers that are “down”, there’s a “heartbeat” mechanism and a “poll”
mechanism.

To avoid issues, consumers are encouraged to process data fast and poll often.

Consumer Heartbeat Thread


 heartbeat.interval.ms (default 3 seconds):
o Is how often to send heartbeats.
o Usually, it’s set to 1/3 of session.timeout.ms.
 session.timeout.ms (default 45 seconds Kafka 3.0+, before 10 seconds):
o Heartbeats are sent periodically to the broker.
o If no heartbeat is sent during that period, the consumer is considered dead.
o Set even lower to faster consumer rebalances.

This mechanism is used to detect a consumer application being down.

Consumer Poll Thread


 max.poll.interval.ms (default 5 minutes):
o Maximum amount of time between two .poll() calls before declaring the
consumer dead.
o This is relevant for Big Data frameworks like Spark in case the processing takes
time.
 max.poll.records (default 500):
o Controls how many records to receive per poll request.
o Increase if your messages are very small and have a lot of available RAM.
o Good to monitor how many records are polled per request.
o Lower if it takes you too much time to process records.
This mechanism is used to detect a data processing issue with the consumer (consumer is
“stuck”).

Consumer Poll Behavior


 fetch.min.bytes (default 1):
o Controls how much data you want to pull at least on each request.
o Helps improving throughput and decreasing request number.
o At the cost of latency.
 fetch.max.wait.ms (default 500):
o The maximum amount of time the Kafka broker will block before answering
the fetch request if there isn’t sufficient data to immediately satisfy the
requirement given by fetch.min.bytes.
o This means that until the requirement of fetch.min.bytes to be satisfied, you
will have up to 500 ms of latency before the fetch returns data to the
consumer (e.g. introducing a potential delay to be more efficient in requests).
 max.partition.fetch.bytes (default 1MB):
o The maximum amount of data per partition the server will return.
o If you read from 100 partitions, you’ll ned a lot of memory (RAM).
 fetch.max.bytes (default 55MB):
o Maximum data returned for each fetch request.
o If you have available memory, try increasing fetch.max.bytes to allow the
consumer to read more data in each request.

Change these settings only if your consumer maxes out on throughput already.

Consumer Replica Fetching – Rack Awareness


Default Consumer behavior with partition leaders
Kafka Consumers by default will read from the leader broker for a partition.

Possibly higher latency (multiple data centre), + high network charges ($$$).

Example: Data Center === Availability Zone (AZ) in AWS, you pay for Cross AZ network changes.
Kafka Consumer Replica Fetching (Kafka v2.4+)
Since Kafka 2.4, it’s possible to configure consumers to read from the closest replica.

This may help improve latency, and also decrease network costs if using the cloud.

Consumer Rack Awareness (v2.4+) – How to Setup


 Broker settings:
o Must be version Kafka v2.4+.
o rack.id config must be set to the data centre ID (ex: AZ ID in AWS)
o Example for AWS: AZ ID rack.id=usw2-az1
o replica.selector.class must be set to
org.apache.kafka.common.replica.RackAwareReplicaSelector
 Consumer client setting:
o Set client.rack to the data center ID the consumer is launched on.
Kafka Extended APIs for Developers
Kafka & ecosystem has introduced over time some new API that are higher level that solves
specific sets of problems:

 Kafka Connect solves External Source => Kafka and Kafka => External Sink
 Kafka Streams solves transformations Kafka => Kafka
 Schema Registry helps using Schema in Kafka

Kafka Connect Introduction


Why Kafka Connect
Programmers always want to import data from the same sources (i.e. DBs, JDBC, CouchBase,
Blockchain, MongoDB, etc, etc, etc…)

Also, programmers always want to store data in the same sinks (i.e. S3, ElastricSearch, JDBC,
Cassandra, MongoDB, etc, etc, etc…)

Kafka Connect – High Level


 Source Connectors to get data from Common Data Sources.
 Sink Connectors to publish that data in Common Data Stores.
 Make it easy for non-experienced dev to quickly get their data reliably into Kafka.
 Part of your ETL (Extract-Transform-Load) pipeline.
 Scaling made easy from small pipelines to company-wide pipelines.
 Other programmers may already have done a very good job: re-usable code!
 Connectors achieve fault tolerance, idempotence, distribution, ordering.
Kafka Streams
Introduction
You want to do the following from the wikimedia.recentchange topic:

 Count the number of times a change was created by a bot versus a human
 Analyze number of changes per website (ru.wikipedia.org vs en.wikipedia.org)
 Number of edits per 10-seconds as a time series

With the Kafka Producer and Consumer, you can achieve that but it’s very low level and not
developer friendly

What is Kafka Stream?


It’s an easy data processing and transformation library within Kafka:

 You write it as a Standard Java Application.


 There’s no need to create a separate cluster.
 It’s highly scalable, elastic and fault tolerant.
 Provides you Exactly-Once transformation capability because it’s a Kafka => Kafka
workflow.
 One record at a time processing (no batching).
 Works for any application size.

Kafka Schema Registry


The need for a schema registry
Kafka takes bytes as an input and publishes them. Because of that, there’s no data verification.

What if the producer sends bad data? Or a field gets renamed? Or the data format changes
from one day to another? This can lead to breaking the consumers.

We need data to be self-describable, and we need to be able to evolve data without breaking
downstream consumers. We need schemas and a schema registry.

What if the Kafka Brokers were verifying the messages they receive?

It would break what makes Kafka so good:

 Kafka doesn’t parse or even read your data (no CPU usage).
 Kafka takes bytes as an input without even loading them into memory (that’s called
zero copy)
 Kafka distributes bytes.
 As far as Kafka is concerned, it doesn’t even know if your data is an integer, a string,
etc.

The Schema Registry must be a separate component and Producers and Consumers need to be
able to talk to it.

The Schema Registry must be able to reject bad data.

A common data format must be agreed upon:

 It needs to support schemas.


 It needs to support evolution.
 It needs to be lightweight.
So we have Schema Registry and Apache Avro as the data format (Protobuf, JSON Schema also
supported).

Schema Registry – Purpose


A pipeline without Schema Registry will look like this:
Source -> Producer -> Kafka -> Consumer -> Target

But with Schema Registry, it looks like this:

1. Source
2. Producer -> Schema Registry (Send Schema if the Schema is not yet inserted)
3. Schema Registry -> Kafka (Schema Registry is going to validate the Schema itself with
Kafka)
4. Producer -> Kafka (Sends Avro data to Kafka)
5. Kafka -> Schema Registry (Validate the Schema)
6. Kafka -> Consumer (Sends Avro data to Consumer)
7. Schema Registry -> Consumer (Retrieve Schema to produce the object)
8. Consumer -> Target

So the purposes of a Schema Registry are:

 Store and retrieve schemas for Producers/Consumers.


 Enforce Backward/Forward/Full compatibility on topics.
 Decrease the size of the payload of data sent to Kafka.

Schema Registry – gotchas


Utilizing a schema registry has a lot of benefits, but implies you need to:

 Set it up well.
 Make sure it’s highly available.
 Partially change the producer and consumer code.

Apache Avro as a format is awesome, but has a learning curve.

Other formats include Protobuf and JSON Schema.

The Confluent Schema Registry is free and source-available.

Other open-source alternatives may exist.

You might also like