Kafka For Beginners
Kafka For Beginners
for
Beginners
Dilip Sundarraj
About Me
• Dilip
• Producer Properties
kafkaproducer.send(producerRecord)
ProducerRecord
File
System
KafkaProducer.send()
Synchronous Asynchronous
• The send() call waits until the messages is • The send() does not wait for the message
published and persisted in to the File to the published and persisted in to the file
System and replicas system and replicas
kafkaproducer.send()
kafkaproducer.send()
CallBack
CallBack
File
File
System
System
Logging using Logback
Why Logger?
• We have used System.out.println() until now
• SysOut does not provide more visibility on what’s happening behind the
scenes
• Debugging
• Exception Logging
Logback
• Logback is the successor of log4j
<root level="info">
<appender-ref ref="STDOUT" />
</root>
</configuration>
How Data Flows into Kafka?
How Data Flows into Kafka?
Data Sources
REST API
Command Line to Publish New Records
Commandline Producer
test-topic
Producer API
(Behind the Scenes)
KafkaProducer.send()
Producer
min.insync.replicas = 2
replication-factor=3
acks-all
Kafka Cluster
• No Dataloss
Consuming
Messages
using
Consumer API
Consumer API
• KafkaConsumer
• Consumer Properties:
KafkaConsumer.poll(100)
• auto.offset.reset = earliest
• auto.offset.reset = latest
Kafka Consumer
Configurations
Consumer Configurations
• auto.offset.reset - Property is used instruct the Kafka consumer to read
either from the beginning offset or the latest offset of the topic with the
given group.id when the consumer makes the connection to the kafka
topic for the very first time
• auto.offset.reset = earliest
• auto.offset.reset = latest
Consumer Configurations
• max.poll.interval.ms - The maximum delay between poll calls from the
consumer
Consumer Groups
Consumer Groups
• Consumer Groups is the only way to scale the message consumption
test-topic-replicated
poll loop
P0 P1 P2
Partitions
Consumer Groups
test-topic-replicated
P0 P1 P2
MessageConsumer
group.id=messageconsumer
P0, P1, P2
test-topic-replicated P0,P2
Partitions - P0, P1, P2
MessageConsumer
Group Triggers
Coordinator Rebalance group.id=messageconsumer
P0, P1,
P1 P2
max.poll.interval.ms
max.poll.interval.ms
• max.poll.interval.ms
• The maximum delay between the poll() invocations from the consumer
when using the consumer groups
while (true) {
ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(timeOutDuration);
consumerRecords.forEach((record) -> {
logger.info("Consumer Record Key is {} and the value is {} and the partion {}",
record.key(), record.value(), record.partition());
});
}
• If two subsequent poll invocations take more than 5 mins then the
Group Co-Ordinator triggers a Rebalance
Committing
Consumer Offsets
Consumer Offsets
• What is an Offset ?
• In the event of a consumer crash, the consumer knows what was the
last read message and the consumer picks it up from where it left off
once it up
Options to Committing Offsets
• Options for committing consumer offsets
• Option 1 - Auto Committing Offsets (Default)
• Committing offsets is automatically taken care for you by the consumer
• No code needed
• Option 2 - Manually Committing Offsets (Default)
• Committing offsets explicitly from the code.
• Two approaches to commit offsets
• Commit offsets Synchronously
• Commit Offsets Asynchronously
Option 1 - Auto Committing Offsets
• This is default option
• enable.auto.commit = true
• auto.commit.interval.ms = 5000
Option 1 - Auto Committing Offsets
• Does this option work for all scenarios?
• No
• commitSync() • commitAsync()
• Application is blocked until the response is • Application is not blocked because the
received from Kafka commit invocation from the code -is
asynchronous
• Any failure will be retried
• Any failure will not be retried
Rebalance Listeners
Rebalance Listeners
• This concept is related to Kafka Consumers Rebalance
• Committing Offsets
• Closing DB Connections
• Seek to a specific offset , rather than just reading from the beginning or
latest.
Coding Rebalance Listeners
• ConsumerRebalanceListener (Interface)
void onPartitionsRevoked(Collection<TopicPartition> partitions);
Clean Up Tasks
Initialization Tasks
Is this Mandatory for Kafka Consumer ?
• No
• seekToBeginning()
• seekToEnd()
• Consumer always seek to read the records from latest offset of the
topic
auto.offset.reset-earliest
auto.offset.reset-earliest
• seekToBeginning()
• Use-Case to read records from the beginning of the topic all the time
auto.offset.reset-earliest
• seekToEnd()
• Use-Case to read only the new records every time after the
consumer is brought up
Seek
to a
Specific Offset
seek()
• KafkaConsumer class has a method seek() using which we can seek to a
specific offset in the Topic
while (true) {
ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(timeOutDuration);
consumerRecords.forEach((record) -> {
logger.info("Consumer Record Key is {} and the value is {} and the partion {}",
record.key(), record.value(), record.partition());
// Invoke Some API
// Persist the Record in DB Duplicate Processing of
});
the Record
if(consumerRecords.count()>0){
kafkaConsumer.commitSync(); //the last record offset returned by the poll
logger.info("Offset Committed!");
}
}
Consumer Rebalance
How to avoid this ?
Poll Loop
With this approach we need to use seek() method to seek to a specific offset from the consumer end
Consumer reads the offset from the external system(DB) and then seek to the point where it left off
How to avoid this ?
Poll Loop
Consumer
Topic
Producer Consumer
String
Record Processed
Successfully
• Banking
Topic
Item
Bytes
Serialize/DeSerialize Custom Objects
• Option 1 - Build Custom Serializer/Deserializer
• StringSerializer/Deserializer
• IntegerSerializer/Deserializer
Build Custom
Kafka
Serializer
Build Custom Kafka Serializer
• Item Domain Class
• ItemSerializer
Build Custom
Kafka
DeSerializer
Build Custom Kafka DeSerializer
• Item Domain Class
• ItemDeSerializer