Apache Kafka
Apache Kafka
Introduction
Companies starts wanting to send data from a source system to a target system.
If you have 4 source systems, and 6 target systems, you need to write 24 integrations.
Each integration comes with difficulties around:
o Protocol: how the data is transported (TCP, HTTP, REST, FTP, JDBC…)
o Data format: how the data is parsed (Binary, CSV, JSON, Avro, Protobuf…)
o Data scheme & evolution: How the data is shaped and may change.
Each source system will have an increased load from the connections.
Now, the Source Systems are only in charge of sending data and, when the Target Systems
needs the data, they’re going to request it from Apache Kafka
Use cases
Messaging System
Activity Tracking
Gather metrics from many different locations
Application Logs gathering
Stream processing (with the Kafka Streams API from example)
De-coupling of system dependencies
Integration with Spark, Flink, Storm, Hadoop, and many other Big Data technologies
Micro-services pub/sub
Examples
Netflix uses Kafka to apply recommendations in real-time while you’re watching TV
shows.
Uber uses Kafka to gather user, taxi, and trip data in real-time to compute and forecast
demand, and compute surge pricing in real-time.
LinkedIn uses Kafka to prevent spam, collect user interactions to make better
connection recommendations in real time.
Kafka Theory
Topics, partitions and offsets
Topics
They’re a particular stream of data within our Kafka cluster.
In the example the topics are logs, purchases, twitter_tweets and truck_gps.
A topic is identified by its name and they support any kind of message format (JSON, Avro, text
file, binary, etc.).
You cannot query topics. Instead, use Kafka Producers to send data and Kafka Consumers to
read the data.
Partitions and offsets
Topics are split in partitions (example: 100 partitions). Messages within each partition are
ordered.
The incremental numbers within each partition (e.g. in Partition 1, the numbers 0, 1, 2, 3, 4, 5,
6, 7 and 8) are ids, and they’re called offset.
Kafka topics are immutable: once data is written to a partition, it cannot be changed.
Example: truck_gps
You have a fleet of trucks; each truck reports it GPS position to Kafka.
Each truck will send a message to Kafka every 20 seconds, each message will contain the truck
ID and the truck position (latitude and longitude).
You can have a topic truck_gps that contains the position of all trucks.
Important notes
Once the data is written to a partition, it cannot be changed (immutability).
Data is kept only for a limited time (default is one week – configurable).
Offset only have a meaning for a specific partition (e.g. offset 3 in partition 0 doesn’t represent
the same data as offset 3 in partition 1).
Offset are not re-used even if previous message has been deleted.
Data is assigned randomly to a partition unless a key is provided (more on this later).
Producers know to which partition to write to (and which Kafka broker has it).
Kafka scales, because the producers send data across all partitions and each partition receives
messages from one or more producers.
If key = null, data is sent round robin (partition 0, then 1, then 2…).
But, if key != null, then all messages for that key will always go to the same partition (hashing).
A key is typically sent if you need message ordering for a specific field (ex: truck_id)
Messages anatomy
A message its composed by:
Message Serializer
Kafka only accepts bytes as an input from producers and sends bytes out as an output to
consumers.
In the default Kafka partitioner, the keys are hashed using the murmur2 algorithm, with the
formula below for the curious:
Data is read in order from low to high offset within each partitions.
Consumer Deserializer
Deserialize indicates how to transform bytes into objects/data.
They are used on the value and the key of the message.
Common Deserializers:
Consumer Offsets
Kafka stores the offsets at which a consumer group has been reading.
When a consumer in a group has processed data received from Kafka, it should be periodically
committing the offsets (the Kafka broker will write to __consumer_offsets, not the group
itself).
If a consumer dies, it will be able to read back from where it left off thanks to the committed
consumer offsets.
After connecting to any broker (called a bootstrap broker), you will be connected to the entire
cluster (Kafka clients have smart mechanics for that).
A good number to get started is 3 brokers, but some big clusters have 100 brokers.
Example
Example of Topic A with 3 partitions and Topic B with 2 partitions.
Note: data is distributed and broker 103 doesn’t have any Topic B data.
As we see, the data in our partitions is going to be distributed across all brokers, and this is
what makes Kafka scale, and what’s actually called horizontal scaling, because the more
partitions and the more brokers we add, the more the data is going to be spread out across
our entire cluster.
Kafka Broker Discovery
Every Kafka broker is also called a “bootstrap server”.
That means that you only need to connect to one broker, and the Kafka clients will know how
to be connected to the entire cluster (smart clients).
Topic replication
Topic replication factor
Topics should have a replication factor > 1 (usually between 2 and 3).
This way if a broker is down, another broker can serve the data.
In this example, if we lose Broker 102, Broker 101 and 103 can still serve the data.
Producers can only send data to the broker that is leader of a partition.
Therefore, each partition has one leader and multiple ISR (in-sync replica).
Default producer & consumer behavior with leaders
Kafka producers can only write to the leader broker for a partition.
Kafka consumers by default will read from the leader broker for a partition.
This may help to improve latency, and also decrease network costs if using the cloud.
As a rule, for a replication factor of N, you can permanently lose up to N-1 brokers and still
recover your data.
Zookeper
Zookeper manages brokers (keeps a list of them). Also, it helps in performing leader election
for partitions.
Zookeper also sends notifications to Kafka in case of changes (e.g. new topic, broker dies,
broker comes up, delete topics, etc…).
But, Kafka 3.x can work without Zookeper (KIP-500) – using Kafka Raft instead.
Zookeper has a leader (writes) the rest of the servers are followers (reads).
Zookeper does NOT store consumer offsets with Kafka > v0.10
Over time, the Kafka clients and CLI have been migrated to leverage the brokers as a
connection endpoint instead of Zookeper.
Since Kafka 0.10, consumers store offset in Kafka and Zookeper and must not connect
to Zookeper as it’s deprecated.
Since Kafka 2.2., the kafka-topics.sh CLI command references Kafka brokers and not
Zookeper for topic management (creation, deletion, etc…) and the Zookeper CLI
argument is deprecated.
All the APIs and commands that were previously leveraging Zookeper are migrated to
use Kafka instead, so that when clusters are migrated to be without Zookeper, the
change is invisible to clients.
Zookeper is also less secure than Kafka, and therefore, Zookeper ports should only be
opened to allow traffic from Kafka brokers, and not Kafka clients.
Therefore, to be a great modern day Kafka developer, never ever use Zookeper as a
configuration in your Kafka clients, and other programas that connect to Kafka.
Kafka Kraft – Removing Zookeper
Kafka KRaft
Zookeper shows scaling issues when Kafka clusters have > 100.000 partitions
Kafka 3.X now implements the Raft protocol (KRaft) in order to replace Zookeper (Production
ready since Kafka 3.3.1).
List topics
kafka-topics.sh --bootstrap-server localhost:9092 –list
Delete a topic
kafka-topics.sh --bootstrap-server localhost:9092 --topic first_topic
–delete
This will produce to one partition at a time, and change every partition.
This will consume all the messages send to that topic from the beginning.
But it’s only helpful when there has never been a consumer offset that has been committed
ass part of the group. This means that if the consumer group is “new” then it will consume
from the beginning. If not, then it will not consume from the beginning.
And, in the case of having more consumers than partitions, the used consumers are replaced
by the new consumers, and the replaced consumers aren’t used.
The “Lag” columns it’s the messages that haven’t been consumed yet.
Reset offsets
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group my-
first-application --reset-offsets --to-earliest --topic third_topic –
execute
It’s the same command as “check before reset offsets” but just chaning –dry-run by –execute.
Kafka Java Programming 101
Before all
When working in local, before starting Kafka, you have to execute the following commands:
Necessary dependencies
Kafka-clients
Slf4j-api
Slf4j-simple
Java Producer
Properties properties = new Properties();
// connect to Localhost
properties.setProperty("bootstrap.servers", "localhost:9092");
// send data
producer.send(producerRecord);
// tell the producer to send all data and block until done --
synchronous
producer.flush();
But when the messages sent are sent very quickly, then those messages are batched. This is
the StickyPartitioner behavior to make it more efficient.
// send data
producer.send(producerRecord, new Callback() {
@Override
public void onCompletion(RecordMetadata metadata, Exception
exception) {
// executes every time a record successfully sent or an
exception is thrown
if (exception == null) {
// the record was successfully sent
log.info("Received new metadata \n" +
"Topic: " + metadata.topic() + "\n" +
"Partitions: " + metadata.partition() + "\n" +
"Offset: " + metadata.offset() + "\n" +
"Timestamp: " + metadata.timestamp());
} else {
log.error("Error while producing", exception);
}
}
});
Also, we can override the batch size and the partitioner class with this properties:
properties.setProperty("batch.size", "400");
properties.setProperty("partitioner.class",
RoundRobinPartitioner.class.getName());
Java Producer with Keys
String topic = "demo_java";
String key = "id_" + i;
String value = "hello world " + i;
// create a Producer Record
ProducerRecord<String, String> producerRecord = new
ProducerRecord<>(topic, key, value);
// send data
producer.send(producerRecord, new Callback() {
@Override
public void onCompletion(RecordMetadata metadata, Exception
exception) {
// executes every time a record successfully sent or an
exception is thrown
if (exception == null) {
// the record was successfully sent
log.info("Key: " + key + " | Partitions: " +
metadata.partition() + "\n");
} else {
log.error("Error while producing", exception);
}
}
});
Java Consumer
log.info("Im a Kafka Consumer!");
properties.setProperty("group.id", groupId);
properties.setProperty("auto.offset.reset", "earliest");
// create a consumer
KafkaConsumer<String, String> consumer = new
KafkaConsumer<>(properties);
// subscribe to a topic
consumer.subscribe(Arrays.asList(topic));
none: means that if we don’t have any existing consumer group, then we fail. We must
have a consumer group before starting the application
earliest: Read from the beginning of the topic (it’s the –from-beginning in the console
consumer properties).
latest: This means that only read the messages that were sent just now.
Java Consumer – Graceful shutdown
// create a consumer
KafkaConsumer<String, String> consumer = new
KafkaConsumer<>(properties);
try {
// subscribe to a topic
consumer.subscribe(Arrays.asList(topic));
}
}
catch (WakeupException e){
log.info("Consumer is starting to shut down.");
} catch (Exception e){
log.error("Unexpected exception in the consumer", e);
} finally {
consumer.close(); // close the consumer, this will also commit the
offset
log.info("The consumer is now gracefully shut down");
}
Eager Rebalance
All consumers stop, give up their membership of partitions. And then, they rejoin the
consumer group and get a new partitions assignment.
This rebalance is quite random and during a short period of time, the entire consumer group
stops processing.
Also, there’s no guarantee that the consumers “get back” the same partitions as the used to.
Other consumers that don’t have reassigned partitions, can still process uninterrupted.
This avoid “stop-the-world” events where all consumers stop processing data.
How to use?
KafkaConsumer: We have to modify the partition.assignment.strategy property.
If it joins back, it will have a new “member ID” and new partitions assigned.
Upon leaving, the consumer has up to session.timeout.ms to join back and get back its
partitions (else they will be re-assigned), without triggering a rebalance.
Offsets are committed when you call .poll() and auto.commit.interval.ms has elapsed.
Make sure messages are all successfully processed before you call poll() again. If you don’t, you
won’t be in at-least-once reading scenario.
In that rare case, you must disable enable.auto.commit, and most likely most processing to a
separate thread, and then from time-to-time call .commitSync() or .commitAsync() with the
correct offsets manually (advanced).
Advanced Producer Configurations
Producer Acknowledgements (acks)
acks=0
When acks=0 producers consider messages as “written successfully” the moment the message
was sent without waiting for the broker to accept it all.
But if the broker goes offline or an exception happens, we won’t know and will lose data.
It’s useful for data where it’s okay to potentially lose messages, such as metrics collection.
Produces the highest throughput setting because the network overhead is minimized.
acks=1
When acks=1, producers consider messages as “written successfully” when the message was
acknowledged by only the leader.
If the leader broker goes offline unexpectedly but replicas haven’t replicated the data yet, we
have a data loss.
acks=all (acks=-1)
When acks=all, producers consider messages as “written successfully” when the message is
accepted by all in-sync replicas (ISR).
Acks=all and min.insync.replicas=2 is the most popular option for data durability and
availability and allows you to withstand at most the loss of one Kafka broker.
Producer Retries
Producer Retries
In case of transient failures, developers are expected to handle exceptions otherwise the data
will be lost.
Producer Timeouts
If retries, for example retires = 2147483647, retries are bounded by a timeout.
In case of retries, there’s a chance that messages will be sent out of order (if a batch
has failed to be sent).
If you rely on key-based ordering, that can be an issue.
For this, you can set the setting while controls how many produce request can be made in
parallel: max.in.flight.requests.per.connection
Default: 5
Set it to 1 if you need to ensure ordering (may impact throughput).
In Kafka >= 0.11, you can define a “idempotent producer” which won’t introduce duplicates on
networks error.
These settings are applied automatically after your producer has started if not manually set.
acks=all (-1)
enable.idempotence=true
With Kafka 2.8 and lower, the producer by default comes with:
acks=1
enable.idempotence=false
Compression can be enabled at the Producer level and doesn’t require any configuration
change in the Brokers or in the Consumers:
compression.type can be none (default), gzip, lz4, snappy, zstd (Kafka 2.1).
Compression is more effective the bigger the batch of message being sent to Kafka.
Advantages:
Overall:
Consider testing snappy or lz4 for optimal speed/compression ratio (test others too).
Consider tweaking linger.ms and batch.size to have bigger batches, and therefore
more compression and higher throughput.
Use compression in production.
Warning: if you enable broker-side compression, it will consume extra CPU cycles.
linger.ms and batch.size Producer settings
linger.ms & batch.size
By default, Kafka producers try to send records as soon as possible:
This smart batching helps increase throughput while maintaining very low latency. Added
benefit: batches have higher compression ratio so better efficiency.
linger.ms (default 0): how long to wait until we send a batch. Adding a small number
for example 5 ms helps add more messages in the batch at the expense of latency.
batch.size: if a batch is filled before linger.ms, increase the batch size.
Increasing a batch size to something like 32KB or 64KB can help increasing the compression,
throughput and efficiency of request.
Any message that’s bigger than the batch size will not be batched.
A batch is allocated per partition, so make sure that you don’t set it to a number that’s too
high, otherwise you’ll run waste memory!
Note: you can monitor the average batch size metric using Kafka Producer Metrics.
In the default Kafka partitioner, the keys are hashed using the murmur2 algorithm.
targetPartition = Math.abs(Utils.murmur2(keyBytes)) % (numPartitions -
1)
This means that the same key will go to the same partition, and adding partitions to a topic will
completely alter the formula.
It’s most likely preferred to not override the behavior of the partitioner, but it’s possible to do
so using partitioner.class parameter for our Kafka producers.
Sticky Partitioner improves the performance of the producer especially when high throughput
when the key is null.
This results in more batches (one batch per partition) and smaller batches (imagine with 100
partitions).
Larger batches and reduced latency (because larger request and batch.size, more likely to be
reached).
Over time, records are still spread evenly across all partitions.
That buffer will fill up over time and empty back down when the throughput to the broker
increases.
If that buffer is full (all 32MB), then the .send() method will start to block (won’t return right
away).
max.block.ms=60000: the time the .send() will block until throwing an exception.
If you hit an exception, that usually means your brokers are down or overloaded as they can’t
respond to requests.
Advanced Consumer Configuration
Consumer Delivery Semantics
At most once
Offsets are committed as soon as the message batch is received. If the processing goes wrong,
the message will be lost (it won’t be read again).
Exactly once
Can be achieved for Kafka => Kafka workflows using the Transactional API (easy with Kafka
Streams API). For Kafka => Sink workflows, use an idempotent consumer.
2 strategies:
Offsets are committed when you call .poll() and auto.commit.interval.ms has elapsed.
If you don’t use synchronous processing, you will be in “at-most-once” behavior because
offsets will be committed before your data is processed.
Example: accumulating records into a buffer and then flushing the buffer to a database +
committing offsets asynchronously then.
You need to assign partitions to your consumers at launch manually .seek() API.
You need to model and store your offsets in a database table for example.
You need to handle the cases where rebalances happen (ConsumerRebalanceListener
interface).
Example: If you need exactly once processing and can’t find any way to do idempotent
processing, then you “process data” + “commit offsets” as a part of a single transaction.
It’s not recommended using this strategy unless you exactly know what and why you’re doing
it.
Consumer Offset Reset Behavior
A consumer is expected to read from a log continuously.
If Kafka has a retention of 7 days, and your consumer is down for more than 7 days,
the offsets are “invalid”.
Bottom line:
To detect consumers that are “down”, there’s a “heartbeat” mechanism and a “poll”
mechanism.
To avoid issues, consumers are encouraged to process data fast and poll often.
Change these settings only if your consumer maxes out on throughput already.
Possibly higher latency (multiple data centre), + high network charges ($$$).
Example: Data Center === Availability Zone (AZ) in AWS, you pay for Cross AZ network changes.
Kafka Consumer Replica Fetching (Kafka v2.4+)
Since Kafka 2.4, it’s possible to configure consumers to read from the closest replica.
This may help improve latency, and also decrease network costs if using the cloud.
Kafka Connect solves External Source => Kafka and Kafka => External Sink
Kafka Streams solves transformations Kafka => Kafka
Schema Registry helps using Schema in Kafka
Also, programmers always want to store data in the same sinks (i.e. S3, ElastricSearch, JDBC,
Cassandra, MongoDB, etc, etc, etc…)
Count the number of times a change was created by a bot versus a human
Analyze number of changes per website (ru.wikipedia.org vs en.wikipedia.org)
Number of edits per 10-seconds as a time series
With the Kafka Producer and Consumer, you can achieve that but it’s very low level and not
developer friendly
What if the producer sends bad data? Or a field gets renamed? Or the data format changes
from one day to another? This can lead to breaking the consumers.
We need data to be self-describable, and we need to be able to evolve data without breaking
downstream consumers. We need schemas and a schema registry.
What if the Kafka Brokers were verifying the messages they receive?
Kafka doesn’t parse or even read your data (no CPU usage).
Kafka takes bytes as an input without even loading them into memory (that’s called
zero copy)
Kafka distributes bytes.
As far as Kafka is concerned, it doesn’t even know if your data is an integer, a string,
etc.
The Schema Registry must be a separate component and Producers and Consumers need to be
able to talk to it.
1. Source
2. Producer -> Schema Registry (Send Schema if the Schema is not yet inserted)
3. Schema Registry -> Kafka (Schema Registry is going to validate the Schema itself with
Kafka)
4. Producer -> Kafka (Sends Avro data to Kafka)
5. Kafka -> Schema Registry (Validate the Schema)
6. Kafka -> Consumer (Sends Avro data to Consumer)
7. Schema Registry -> Consumer (Retrieve Schema to produce the object)
8. Consumer -> Target
Set it up well.
Make sure it’s highly available.
Partially change the producer and consumer code.