Data Engineering 101 Kafka Concepts 1721892046
Data Engineering 101 Kafka Concepts 1721892046
Engineering 101
Kafka
Core Concepts
Data Engineering 101 - Kafka
Kafka Broker
1
them on disk. Example: In a Kafka
cluster, multiple brokers work
together to ensure data is reliably
stored and served.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Topics
Topics are logical channels to
which messages are sent by
producers and from which
messages are read by
consumers. A topic is divided into
multiple partitions to allow
parallel processing.
2
Example: A "user_activity" topic
might be divided into several
partitions to handle high
message volume.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Partitions
Partitions are subdivisions of
topics. Each partition is an
ordered, immutable sequence of
messages that is continually
appended to. Partitions enable
Kafka to scale horizontally and
maintain message order.
3
Example: Partition 0 of the
"user_activity" topic stores
messages for a specific subset of
users.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Producers
4
Example: A web application that
logs user activity sends these logs
to a Kafka topic as messages.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Consumers
Consumers are clients that read
messages from Kafka topics.
Consumers can operate
individually or as part of a
consumer group, which allows for
parallel processing of messages.
5
Example: An analytics service
reads user activity logs from a
Kafka topic to generate reports.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Consumer Groups
Consumer groups allow multiple
consumers to collaborate on
processing messages from a
topic. Each partition in a topic is
assigned to only one consumer
within a group at a time, ensuring
parallel processing and load
balancing.
6
Example: Three consumers in a
group process messages from six
partitions.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Offsets
7
Example: A consumer reads
messages up to offset 105 and
resumes from offset 106 after a
restart.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Cluster
8
brokers can continue operating if
one broker fails.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Replication
Kafka replicates partitions across
multiple brokers to ensure fault
tolerance. Each partition has a
leader and several followers. The
leader handles all reads and
writes, while followers replicate
the data.
9
Example: Partition 0 has one
leader and two followers across
three brokers.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
ZooKeeper
10
Example: ZooKeeper ensures a
new leader is elected if the
current leader broker fails.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Producers and
ACKs
Producers send messages to
brokers and can configure
acknowledgment settings (ACKs)
to ensure reliable message
delivery.
11
Example: A producer configures
ACKs to wait for confirmation
from all replicas before
considering a message sent.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Retention Policy
12
retain messages for 7 days, after
which they are deleted.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Log Compaction
13
retains only the latest update for
each user profile.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Connect
14
sync data between a MySQL
database and a Kafka topic.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Streams
15
Example: An application using
Kafka Streams aggregates
clickstream data to generate
real-time metrics.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
MirrorMaker
16
replicate messages from a
primary datacenter to a backup
datacenter.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka API
17
to send messages from a Java
application to a Kafka topic.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Security
18
encrypt data in transit and SASL
for client authentication.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
AdminClient API
19
create a new topic and configure
its retention policy.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Monitoring and
Metrics
Kafka provides metrics for
monitoring cluster health and
performance. Tools like
Prometheus and Grafana can be
used to visualize these metrics.
20
Example: Monitoring consumer
lag and broker health using
Prometheus and Grafana
dashboards.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Message Delivery
Semantics
Kafka supports three types of
message delivery semantics: at
most once, at least once, and
exactly once.
21
for exactly-once delivery to
ensure no message is lost or
duplicated.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Stateful
Processing
Kafka Streams supports stateful
processing, allowing applications
to maintain state across
messages using state stores.
22
application that maintains a
running count of events over a
window of time.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Windowed
Operations
Kafka Streams provides support
for windowed operations,
enabling time-based
aggregations and
transformations.
23
Example: Calculating the average
number of user clicks per minute
using windowed operations.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
KSQL
24
aggregate, and transform
streams of data in real time.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Ecosystem
25
Example: Integrating Kafka with a
relational database using Kafka
Connect and processing the data
with Kafka Streams.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Publish/Subscribe
Messaging
Pub/Sub systems allow
decoupling of message
producers and consumers. Kafka
acts as a broker facilitating this.
26
user activity logs which can be
consumed by analytics services.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Message and
Batches
Messages are the basic unit of
data in Kafka, stored as byte
arrays. Messages are written in
batches for efficiency.
27
sent from an application.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Schemas
28
Example: Avro schema for user
profile data.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Topics and
Partitions
Topics are categories to which
messages are published. Topics
are divided into partitions for
scalability and redundancy.
29
with partitions.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Producers and
Consumers
Producers create and send
messages to Kafka topics.
Consumers read messages from
topics.
Example: A microservice
30
producing order data and
another consuming for
processing.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Brokers and
Clusters
A broker is a Kafka server that
stores data and serves clients.
Multiple brokers form a Kafka
cluster, providing fault tolerance
and scalability.
31
Example: A Kafka cluster with
three brokers.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Disk-Based
Retention
Kafka retains messages on disk
for a configured period, allowing
consumers to read at their pace.
32 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Multiple Producers
and Consumers
Kafka supports multiple
producers and consumers for the
same topic, enabling flexible data
pipelines.
33
producing data to a single topic,
multiple analytics services
consuming it.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
High Throughput
34 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Stream Processing
35 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Connect
36
and a Kafka topic.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
37
in real-time.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Log Compaction
38
update to user profiles.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Exactly Once
Semantics
Kafka ensures that messages are
processed exactly once, even in
distributed systems.
39 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Idempotent
Producer
Producers can safely retry
sending messages without
duplicating them.
40
guaranteed single delivery.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Transactions
41
written or none are.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
MirrorMaker
42
data to a backup datacenter.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Security
43 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka AdminClient
44 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Monitoring and
Metrics
Kafka provides metrics and
monitoring tools to track cluster
performance.
45 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Serialization and
Deserialization
Kafka requires serialization of
data for transmission, with
support for various formats like
Avro, JSON.
46
Avro format before sending to
Kafka.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Message Ordering
47 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Consumer Group
48
collaboratively.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Offset
Management
Kafka tracks the offset of
messages to manage consumer
progress.
49
restart.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Topic Replication
50
broker failure.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Message
Compression
Kafka supports compressing
messages to save bandwidth and
storage.
51
Kafka.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Zookeeper
52
election.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Broker
Configuration
Brokers can be configured for
performance, retention policies,
and more.
53 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Producer
Configuration
Producers have configurable
parameters for message delivery,
retries, and more.
54 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Consumer
Configuration
Consumers can be configured for
fetch sizes, timeout settings, and
more.
55
performance.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Topic
Management
Topics can be created, deleted,
and managed programmatically
or via CLI.
56 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Quotas and
Throttling
Kafka supports setting quotas to
control resource usage by clients.
57 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Rebalance
Protocol
Kafka handles rebalancing of
consumers within a group to
maintain load balance.
58
group.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
59 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka API
60 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Schema Registry
61
schema.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
62 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Fault Tolerance
63 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Real-Time
Analytics
Kafka supports real-time data
analytics and processing.
64 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
ETL Pipelines
65
loading it into a data warehouse
via Kafka.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Upgrades
66 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Message
Timestamping
Kafka messages can have
timestamps for time-based
processing.
67
Streams.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
State Stores
68
using state stores.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Windowed
Operations
Kafka Streams supports
windowed operations for
aggregations over time windows.
69 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
KSQL
70 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Ecosystem
71
integrate with databases, Kafka
Streams for processing, and KSQL
for querying streams.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Connectors
72
and Kafka.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Cluster
Management
Tools and practices for managing
Kafka clusters efficiently.
73 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Tiered Storage
74
storage costs.
Shwetank Singh
GritSetGrow - GSGLearn.com