0% found this document useful (0 votes)

125 views

Kafka For Beginners

The max.poll.interval.ms property controls how often the consumer sends a heartbeat to the consumer group coordinator. If the consumer does not send a heartbeat within this interval, the coordinator will assume the consumer is dead and will trigger a rebalance. So this property controls how quickly a consumer rebalance can be triggered if a consumer becomes unresponsive.

Uploaded by

spjb33

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

125 views

Kafka For Beginners

Uploaded by

spjb33

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 77

Kafka

for
Beginners
Dilip Sundarraj
About Me
• Dilip

• Building Software’s since 2008

• Teaching in UDEMY Since 2016

Whats Covered?
• Introduction to Kafka and internals of Kafka

• Learn to build Kafka Producers/Consumers using Java

• Covers advanced Kafka Producer and Consumer concepts

• Hands on Oriented course

Targeted Audience
• Kafka Beginners and Advanced

• Interested in building java applications using producer and consumer API

• Interested in learning advanced Kafka Producer and Consumer operations

Source Code
Thank You !
Prerequisites
Course Prerequisites
• Prior experience using Terminal or Commandline is a must

• Prior knowledge building Java applications

• Knowledge of Lambdas, Streams

• Prior knowledge of Gradle

• Java11 or higher is needed

• Intellij , Eclipse or any other IDE is needed

Sending
Messages
using
Producer API
Producer API
• KafkaProducer

• Class through which we interact with Kafka to produce new Records

• Producer Properties

bootstrap.servers - “localhost:9092, localhost:9093, localhost:9094”

key.serializer - org.apache.kafka.common.serialization.StringSerializer
value.serializer - org.apache.kafka.common.serialization.StringSerializer
KafkaProducer.send()
• KafkaProducer uses the send() method to produce the record to Kafka

• ProducerRecord is the data container:

• Key and Value

kafkaproducer.send(producerRecord)
ProducerRecord

File
System
KafkaProducer.send()
Synchronous Asynchronous
• The send() call waits until the messages is • The send() does not wait for the message
published and persisted in to the File to the published and persisted in to the file
System and replicas system and replicas

kafkaproducer.send()
kafkaproducer.send()
CallBack
CallBack

File
File
System
System
Logging using Logback
Why Logger?
• We have used System.out.println() until now

• SysOut does not provide more visibility on what’s happening behind the
scenes

• Pretty common for applications to have logger

• Debugging

• Exception Logging
Logback
• Logback is the successor of log4j

• Logback is pretty popular today when it comes to logging

• XML/Groovy based configuration

How to configure Logback ?
• Add logback dependency in the build.gradle file

implementation group: 'ch.qos.logback', name: 'logback-classic', version: '1.2.3'

• Add logback.xml file in the classpath

<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">

<encoder>
<pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>

<root level="info">
<appender-ref ref="STDOUT" />
</root>
</configuration>
How Data Flows into Kafka?
How Data Flows into Kafka?
Data Sources

REST API
Command Line to Publish New Records

Commandline Producer

test-topic
Producer API
(Behind the Scenes)
KafkaProducer.send()

Behind the Scenes

RecordAccumulator MetaData

KafkaProducer RecordBatch batch.size

Send() Serializer Partitioner

RecordBatch batch.size buﬀer.memory

key.serializer DefaultPartitioner linger.ms

RecordBatch batch.size
value.serializer test-topic
Configuring
acks
&
min.insync.replicas
min.insync.replicas
Error: NOT_ENOUGH_REPLICAS

Producer
min.insync.replicas = 2
replication-factor=3
acks-all

Kafka Cluster

Broker 1 Broker 2 Broker 3

What does it guarantee?

• Guarantees always a replica of the record is available

• No Dataloss
Consuming
Messages
using
Consumer API
Consumer API
• KafkaConsumer

• Class through which we can read messages from Kafka

• Consumer Properties:

bootstrap.servers - “localhost:9092, localhost:9093, localhost:9094”

key.deserializer - org.apache.kafka.common.serialization.StringDeserializer
value.deserializer - org.apache.kafka.common.serialization.StringDeserializer
group.id - test-consumer
poll() loop- Consumer API

Single Threaded Subscribed to “test-topic”

Poll loop

KafkaConsumer.poll(100)

test-topic-replicated Records Processed

Successfully
auto.offset.reset
auto.offset.reset
• auto.offset.reset - Property is used instruct the Kafka consumer to read
either from the beginning offset or the latest offset of the topic with the
given group.id when the consumer makes the connection to the kafka
topic for the very first time

• beginning oﬀset of the topic

• auto.oﬀset.reset = earliest

• latest oﬀset of the topic (Default)

• auto.offset.reset = latest
Kafka Consumer
Configurations
Consumer Configurations
• auto.offset.reset - Property is used instruct the Kafka consumer to read
either from the beginning offset or the latest offset of the topic with the
given group.id when the consumer makes the connection to the kafka
topic for the very first time

• beginning oﬀset of the topic

• auto.oﬀset.reset = earliest

• latest oﬀset of the topic (Default)

• auto.oﬀset.reset = latest
Consumer Configurations
• max.poll.interval.ms - The maximum delay between poll calls from the
consumer
Consumer Groups
Consumer Groups
• Consumer Groups is the only way to scale the message consumption

test-topic-replicated

poll loop

Lag MessageConsumer group.id=messageConsumer

P0 P1 P2

Partitions
Consumer Groups

test-topic-replicated

poll loop poll loop poll loop

group.id=messageConsumer MessageConsumer group.id=messageConsumer MessageConsumer group.id=messageConsumer MessageConsumer

P0 P1 P2

Partition Partition Partition

Consumer Rebalance
What is Consumer Rebalance?

• Consumer Rebalance is the concept of moving the partition ownership

from one consumer to another

• Consumer Rebalance is important because it promises scalability and

availability
Consumer Rebalance

MessageConsumer

group.id=messageconsumer
P0, P1, P2
test-topic-replicated P0,P2
Partitions - P0, P1, P2

MessageConsumer
Group Triggers
Coordinator Rebalance group.id=messageconsumer

P0, P1,
P1 P2
max.poll.interval.ms
max.poll.interval.ms
• max.poll.interval.ms

• The maximum delay between the poll() invocations from the consumer
when using the consumer groups
while (true) {
ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(timeOutDuration);
consumerRecords.forEach((record) -> {
logger.info("Consumer Record Key is {} and the value is {} and the partion {}",
record.key(), record.value(), record.partition());
});
}

Default value of max.poll.interval.ms = 300000(5 mins)

max.poll.interval.ms

• What does this property have to do with Consumer Rebalance?

• If two subsequent poll invocations take more than 5 mins then the
Group Co-Ordinator triggers a Rebalance
Committing
Consumer Offsets
Consumer Offsets
• What is an Oﬀset ?

• An oﬀset is a sequence number that’s represents a unique number for

each record in a Kafka topic

• What are Consumer Oﬀsets?

• Consumer oﬀsets provides tracking of records that are read by the

consumer for a given group id

• These oﬀsets are present in the __consumer_oﬀsets topic

Committing Consumer Offsets
• Consumers should commit oﬀsets to the __consumer_oﬀsets to keep
track of the records read by them

• Separate process from poll() loop

• Whats the Benefits of Committing Oﬀsets ?

• Avoids duplicate processing of the same record

• In the event of a consumer crash, the consumer knows what was the
last read message and the consumer picks it up from where it left off
once it up
Options to Committing Offsets
• Options for committing consumer offsets
• Option 1 - Auto Committing Offsets (Default)
• Committing offsets is automatically taken care for you by the consumer
• No code needed
• Option 2 - Manually Committing Offsets (Default)
• Committing offsets explicitly from the code.
• Two approaches to commit offsets
• Commit offsets Synchronously
• Commit Offsets Asynchronously
Option 1 - Auto Committing Offsets
• This is default option

• What configuration in Consumer enables this behavior ?

• enable.auto.commit = true

• auto.commit.interval.ms = 5000
Option 1 - Auto Committing Offsets
• Does this option work for all scenarios?

• No

• Consumer Rebalance within the 5 seconds before committing the

oﬀsets might reprocess the same message again.
Manually Committing Offsets
Manually Committing Offsets
Synchronous Commit Asynchronous Commit

• commitSync() • commitAsync()

• Application is blocked until the response is • Application is not blocked because the
received from Kafka commit invocation from the code -is
asynchronous
• Any failure will be retried
• Any failure will not be retried
Rebalance Listeners
Rebalance Listeners
• This concept is related to Kafka Consumers Rebalance

• Consumer Rebalance occurs in the below scenarios:

• Consumer goes down

• New Consumer in to the consumer group

• No poll() invocation within the max.poll.interval.ms config

Why Rebalance Listeners?
• RebalanceListeners is mainly used to perform some clean up work before
partitions are revoked from the consumer instance

• Committing Oﬀsets

• Closing DB Connections

• RebalanceListeners can also be used during partition assignment

• Perform some initialization tasks

• Seek to a specific oﬀset , rather than just reading from the beginning or
latest.
Coding Rebalance Listeners
• ConsumerRebalanceListener (Interface)
void onPartitionsRevoked(Collection<TopicPartition> partitions);

Clean Up Tasks

void onPartitionsAssigned(Collection<TopicPartition> partitions);

Initialization Tasks
Is this Mandatory for Kafka Consumer ?
• No

• Implement this only if its applicable for your consumer application

seekToBeginning()
&
seekToEnd()
seekToBeginning() & seekToEnd()
• Part of the KafkaConsumer class

• seekToBeginning()

• Consumers always seek to read the records from beginning oﬀset of

the topic

• seekToEnd()

• Consumer always seek to read the records from latest oﬀset of the
topic

Consumer Offset Tracking is not applicable

Current Consumer Read Behavior
test-topic

ABC DEF GHI JKL DAD DFF JKL OPP

Partition 0 0 1 2 3 4 5 6 7

auto.offset.reset-earliest

group.id = group1 Consumer 1 __consumer_oﬀsets

Offsets Committed
seekToBeginning()
test-topic

ABC DEF GHI JKL DAD DFF JKL OPP

Partition 0 0 1 2 3 4 5 6 7

auto.offset.reset-earliest

group.id = group1 Consumer 1

When to use seekToBeginning?

• seekToBeginning()

• Use-Case to read records from the beginning of the topic all the time

• Example : Using Kafka as a DataStore for reference data(Compacted

Topic)
seekToEnd()
test-topic

ABC DEF GHI JKL DAD DFF JKL OPP

Partition 0 0 1 2 3 4 5 6 7

auto.offset.reset-earliest

group.id = group1 Consumer 1

When to use seekToEnd?

• seekToEnd()

• Use-Case to read only the new records every time after the
consumer is brought up
Seek
to a
Specific Offset
seek()
• KafkaConsumer class has a method seek() using which we can seek to a
specific oﬀset in the Topic

void seek(TopicPartition partition, long offset);

void seek(TopicPartition partition, OffsetAndMetadata offsetAndMetadata);

Why would you use seek() ?
Poll Loop

while (true) {
ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(timeOutDuration);
consumerRecords.forEach((record) -> {
logger.info("Consumer Record Key is {} and the value is {} and the partion {}",
record.key(), record.value(), record.partition());
// Invoke Some API
// Persist the Record in DB Duplicate Processing of
});
the Record
if(consumerRecords.count()>0){
kafkaConsumer.commitSync(); //the last record offset returned by the poll
logger.info("Offset Committed!");
}
}

Consumer Rebalance
How to avoid this ?
Poll Loop

Approach 1 - Using seek()

while (true) {
ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(timeOutDuration);
consumerRecords.forEach((record) -> {
logger.info("Consumer Record Key is {} and the value is {} and the partion {}",
record.key(), record.value(), record.partition());
// Invoke Some API
// Persist the Record in DB
// Persist the Consumer Offsets in DB }
Transaction
});
}

With this approach we need to use seek() method to seek to a specific oﬀset from the consumer end

Consumer reads the oﬀset from the external system(DB) and then seek to the point where it left oﬀ
How to avoid this ?
Poll Loop

Approach2 - Perform Duplicate Check

// Perform Duplicate Check

// Invoke Some API
// Persist the Record in DB
});
}
Implement
seek(TopicPartition partition, OffsetAndMetadata offsetAndMetadata)
Implement seek()
test-topic-replicated

ABC DEF GHI JKL DAD DFF JKL OPP

Partition 0 0 1 2 3 4 5 6 7

Consumer

Oﬀsets read from File System

File
System
Oﬀsets Committed in to the File System
Custom Kafka
Serializer
&
Deserializer
What do we have until now ?

Topic

Producer Consumer

String
Record Processed
Successfully

StringSerializer Bytes StringDeserializer

Use Kafka in Enterprise
• Retail

• Item, Order , Cart etc.,

• Banking

• Customer, Account , Transaction etc.,

Lets take Retail for example
public class Item implements Serializable{

private static final long serialVersionUID = 1969906832571875737L;

private Integer id;
private String itemName;
private Double price;
}

Topic

Producer Serialization DeSerialization Consumer

Item

Bytes
Serialize/DeSerialize Custom Objects
• Option 1 - Build Custom Serializer/Deserializer

• Option 2 - Use Existing Serializer/Deserializer

• StringSerializer/Deserializer

• IntegerSerializer/Deserializer
Build Custom
Kafka
Serializer
Build Custom Kafka Serializer
• Item Domain Class

• ItemSerializer
Build Custom
Kafka
DeSerializer
Build Custom Kafka DeSerializer
• Item Domain Class

• ItemDeSerializer

Kafka Using Spring Boot
No ratings yet
Kafka Using Spring Boot
136 pages
Basics of Apache Kafka
100% (1)
Basics of Apache Kafka
168 pages
Consuming Messages With Kafka Consumers and Consumer Groups: Ryan Plant
No ratings yet
Consuming Messages With Kafka Consumers and Consumer Groups: Ryan Plant
38 pages
Apache Kafka Key Concepts
100% (1)
Apache Kafka Key Concepts
8 pages
Big Data-Kafka
No ratings yet
Big Data-Kafka
14 pages
Kafka and Spark Streaming
No ratings yet
Kafka and Spark Streaming
45 pages
Kafka Using Spring Boot v2
No ratings yet
Kafka Using Spring Boot v2
150 pages
Kafka Notes2
No ratings yet
Kafka Notes2
19 pages
Kafka Consumer
No ratings yet
Kafka Consumer
3 pages
Kafka Topic Questions
No ratings yet
Kafka Topic Questions
9 pages
Configuring Kafka For High Throughput
No ratings yet
Configuring Kafka For High Throughput
11 pages
Documentation
No ratings yet
Documentation
105 pages
Kafka
No ratings yet
Kafka
23 pages
Handle Large Messages in Apache Kafka
No ratings yet
Handle Large Messages in Apache Kafka
59 pages
Kafka SlidesShare
No ratings yet
Kafka SlidesShare
100 pages
Producing Messages With Kafka Producers: Ryan Plant
No ratings yet
Producing Messages With Kafka Producers: Ryan Plant
31 pages
Kafka Notes
No ratings yet
Kafka Notes
7 pages
Design Patterns For Working With Fast Data: © 2016 Mapr Technologies © 2016 Mapr Technologies
No ratings yet
Design Patterns For Working With Fast Data: © 2016 Mapr Technologies © 2016 Mapr Technologies
64 pages
Apache Kafka Long Polling
No ratings yet
Apache Kafka Long Polling
20 pages
Cours - Kafka
No ratings yet
Cours - Kafka
72 pages
kafka
No ratings yet
kafka
43 pages
Unit 5 Apache Kafka Notes
No ratings yet
Unit 5 Apache Kafka Notes
54 pages
08_Apache_Kafka
No ratings yet
08_Apache_Kafka
45 pages
Apache Kafka
No ratings yet
Apache Kafka
38 pages
Fundamentals and Architecture of Apache Kafka
No ratings yet
Fundamentals and Architecture of Apache Kafka
30 pages
KAFKA PPT
No ratings yet
KAFKA PPT
11 pages
Kafka Python
No ratings yet
Kafka Python
84 pages
Kafka Notes
No ratings yet
Kafka Notes
7 pages
Kafka scenario questions (1)
No ratings yet
Kafka scenario questions (1)
11 pages
Kafka: Big Data Huawei Course
No ratings yet
Kafka: Big Data Huawei Course
14 pages
4. Introduction to Apache Kafka and its setup (3)
No ratings yet
4. Introduction to Apache Kafka and its setup (3)
29 pages
Bdhawgdvq
No ratings yet
Bdhawgdvq
6 pages
Big Data - Group 14
No ratings yet
Big Data - Group 14
26 pages
CloudApps2 Kafka
No ratings yet
CloudApps2 Kafka
15 pages
Kafka TOC
No ratings yet
Kafka TOC
5 pages
Kafka 2
No ratings yet
Kafka 2
11 pages
2 Kafka Eventstorming
No ratings yet
2 Kafka Eventstorming
104 pages
Apache Kafka | Thi Nguyen's Blog
No ratings yet
Apache Kafka | Thi Nguyen's Blog
39 pages
Apache Kafka Essentials
No ratings yet
Apache Kafka Essentials
10 pages
kafka
No ratings yet
kafka
5 pages
AK
No ratings yet
AK
22 pages
Kafka Sparkstreaming
No ratings yet
Kafka Sparkstreaming
75 pages
Kafka - Premiera Ola
No ratings yet
Kafka - Premiera Ola
5 pages
Lab Manual_ETL-KAFKA-TALEND
No ratings yet
Lab Manual_ETL-KAFKA-TALEND
7 pages
Apache Kafka
No ratings yet
Apache Kafka
17 pages
Apache Kafka Essentials
No ratings yet
Apache Kafka Essentials
10 pages
Some Special Terms in Kafka
No ratings yet
Some Special Terms in Kafka
10 pages
kafka (1)
No ratings yet
kafka (1)
2 pages
Kafka
No ratings yet
Kafka
5 pages
5. Kafka WorkFlow
No ratings yet
5. Kafka WorkFlow
1 page
Apache Kafka
No ratings yet
Apache Kafka
43 pages
Microservices architecture kafka
No ratings yet
Microservices architecture kafka
13 pages
Introduction To Apache Kafka
No ratings yet
Introduction To Apache Kafka
18 pages
Kafka Clustering v1.0.0
No ratings yet
Kafka Clustering v1.0.0
20 pages
Using Kafka For Real Time Data Ingestion With .NET KevinFeasel
No ratings yet
Using Kafka For Real Time Data Ingestion With .NET KevinFeasel
33 pages
Kafka Interview Questions
No ratings yet
Kafka Interview Questions
10 pages
Apache Kafka Beginner Guide
No ratings yet
Apache Kafka Beginner Guide
40 pages
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Confluent Certified Developer for Apache Kafka® Exam kit
From Everand
Confluent Certified Developer for Apache Kafka® Exam kit
PRIYANKA
No ratings yet
Advanced Apache Kafka: Engineering High-Performance Streaming Applications
From Everand
Advanced Apache Kafka: Engineering High-Performance Streaming Applications
Peter Jones
No ratings yet
Review Questions on Applications Software
No ratings yet
Review Questions on Applications Software
2 pages
Acer Predator Bios 1.08
No ratings yet
Acer Predator Bios 1.08
103 pages
LinuxCBT DBMS Edition Updates Notes
No ratings yet
LinuxCBT DBMS Edition Updates Notes
4 pages
SOFTWARE ARCHITECTURE DOCUMENT-OnlineCateringService Sample Example
No ratings yet
SOFTWARE ARCHITECTURE DOCUMENT-OnlineCateringService Sample Example
10 pages
Basics of Python Programming and Statistics
No ratings yet
Basics of Python Programming and Statistics
56 pages
The History of SQL Server and Relational Databases
No ratings yet
The History of SQL Server and Relational Databases
5 pages
MongoDB ReferenceCards
No ratings yet
MongoDB ReferenceCards
28 pages
IC Compiler Commands
100% (2)
IC Compiler Commands
8 pages
1997 Sargent & Lundy PC Policy
No ratings yet
1997 Sargent & Lundy PC Policy
7 pages
9137W - TCL Tab 8 Le - Tmo - Um - en
No ratings yet
9137W - TCL Tab 8 Le - Tmo - Um - en
69 pages
Web Site Design Principles - New - Temp
No ratings yet
Web Site Design Principles - New - Temp
192 pages
Configure A Reverse Proxy System Architecture With ArcGIS Server
No ratings yet
Configure A Reverse Proxy System Architecture With ArcGIS Server
13 pages
1 20200711 205059
No ratings yet
1 20200711 205059
13 pages
Racf
No ratings yet
Racf
132 pages
ITSM ITIL Service Desk and Incident Management Product Flyer
100% (3)
ITSM ITIL Service Desk and Incident Management Product Flyer
2 pages
Doom Source
No ratings yet
Doom Source
818 pages
Cisco Iox For Edge Compute Lab V1
No ratings yet
Cisco Iox For Edge Compute Lab V1
16 pages
Dual Boot Windows 10 and Linux Ubuntu On Separate Hard Drives
No ratings yet
Dual Boot Windows 10 and Linux Ubuntu On Separate Hard Drives
9 pages
Oromia Bank S.C.
No ratings yet
Oromia Bank S.C.
18 pages
Squid With Transparent Proxy
No ratings yet
Squid With Transparent Proxy
5 pages
Discovering Computers 2010: Living in A Digital World
No ratings yet
Discovering Computers 2010: Living in A Digital World
35 pages
Analytics WWW - Mzalendo.net 20090617-20091129 Dashboard Report)
No ratings yet
Analytics WWW - Mzalendo.net 20090617-20091129 Dashboard Report)
8 pages
Acc Checker Help
No ratings yet
Acc Checker Help
36 pages
XTL - TQR Template
No ratings yet
XTL - TQR Template
9 pages
AWS Serverless Architectural Patterns and Best Practices
No ratings yet
AWS Serverless Architectural Patterns and Best Practices
38 pages
Doforms Manual
No ratings yet
Doforms Manual
18 pages
Mobile Application Development
No ratings yet
Mobile Application Development
6 pages
WS 4
No ratings yet
WS 4
1 page
How To Install Solid Edge On The Network (Floating Install) - Ally PLM Solutions PDF
No ratings yet
How To Install Solid Edge On The Network (Floating Install) - Ally PLM Solutions PDF
12 pages
Basic ITK Customization Concept Part-02
No ratings yet
Basic ITK Customization Concept Part-02
1 page