0% found this document useful (0 votes)
26 views76 pages

Data Engineering 101 Kafka Concepts 1721892046

Uploaded by

simran sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views76 pages

Data Engineering 101 Kafka Concepts 1721892046

Uploaded by

simran sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Data

Engineering 101
Kafka
Core Concepts
Data Engineering 101 - Kafka

Kafka Broker

A Kafka broker is a server that


runs the Kafka software and is
responsible for storing and
serving data. Brokers receive
messages from producers, assign
offsets to messages, and store

1
them on disk. Example: In a Kafka
cluster, multiple brokers work
together to ensure data is reliably
stored and served.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Topics
Topics are logical channels to
which messages are sent by
producers and from which
messages are read by
consumers. A topic is divided into
multiple partitions to allow
parallel processing.

2
Example: A "user_activity" topic
might be divided into several
partitions to handle high
message volume.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Partitions
Partitions are subdivisions of
topics. Each partition is an
ordered, immutable sequence of
messages that is continually
appended to. Partitions enable
Kafka to scale horizontally and
maintain message order.

3
Example: Partition 0 of the
"user_activity" topic stores
messages for a specific subset of
users.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Producers

Producers are clients that send


messages to Kafka topics. They
can send messages to specific
partitions based on a partitioning
strategy or distribute them evenly
across all partitions.

4
Example: A web application that
logs user activity sends these logs
to a Kafka topic as messages.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Consumers
Consumers are clients that read
messages from Kafka topics.
Consumers can operate
individually or as part of a
consumer group, which allows for
parallel processing of messages.

5
Example: An analytics service
reads user activity logs from a
Kafka topic to generate reports.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Consumer Groups
Consumer groups allow multiple
consumers to collaborate on
processing messages from a
topic. Each partition in a topic is
assigned to only one consumer
within a group at a time, ensuring
parallel processing and load
balancing.

6
Example: Three consumers in a
group process messages from six
partitions.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Offsets

Offsets are unique identifiers


assigned to each message within
a partition. Consumers use offsets
to track which messages have
been read.

7
Example: A consumer reads
messages up to offset 105 and
resumes from offset 106 after a
restart.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Kafka Cluster

A Kafka cluster is composed of


multiple brokers that work
together. Clusters provide fault
tolerance and high availability.

Example: A cluster with three

8
brokers can continue operating if
one broker fails.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Replication
Kafka replicates partitions across
multiple brokers to ensure fault
tolerance. Each partition has a
leader and several followers. The
leader handles all reads and
writes, while followers replicate
the data.

9
Example: Partition 0 has one
leader and two followers across
three brokers.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

ZooKeeper

ZooKeeper is used for distributed


coordination and metadata
management in Kafka. It
manages broker metadata,
leader election, and configuration.

10
Example: ZooKeeper ensures a
new leader is elected if the
current leader broker fails.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Producers and
ACKs
Producers send messages to
brokers and can configure
acknowledgment settings (ACKs)
to ensure reliable message
delivery.

11
Example: A producer configures
ACKs to wait for confirmation
from all replicas before
considering a message sent.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Retention Policy

Kafka topics can have retention


policies that determine how long
messages are stored. Policies can
be time-based or size-based.

Example: A topic is configured to

12
retain messages for 7 days, after
which they are deleted.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Log Compaction

Log compaction ensures that only


the latest message for each key is
retained in a topic, useful for
maintaining the latest state.

Example: A log-compacted topic

13
retains only the latest update for
each user profile.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Kafka Connect

Kafka Connect is a framework for


integrating Kafka with other data
systems. It provides connectors to
move data in and out of Kafka.

Example: Using Kafka Connect to

14
sync data between a MySQL
database and a Kafka topic.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Kafka Streams

Kafka Streams is a library for


building stream processing
applications on top of Kafka. It
allows processing and
transforming data in real time.

15
Example: An application using
Kafka Streams aggregates
clickstream data to generate
real-time metrics.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

MirrorMaker

MirrorMaker is a tool for


replicating data between Kafka
clusters, often used for cross-
datacenter replication.

Example: Using MirrorMaker to

16
replicate messages from a
primary datacenter to a backup
datacenter.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Kafka API

Kafka provides APIs for producing,


consuming, and managing data,
including Producer API, Consumer
API, and Admin API.

Example: Using the Producer API

17
to send messages from a Java
application to a Kafka topic.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Security

Kafka supports various security


features, including SSL encryption,
SASL authentication, and ACLs for
authorization.

Example: Configuring SSL to

18
encrypt data in transit and SASL
for client authentication.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

AdminClient API

The AdminClient API allows


programmatic management of
Kafka topics, brokers, and
configurations.

Example: Using AdminClient to

19
create a new topic and configure
its retention policy.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Monitoring and
Metrics
Kafka provides metrics for
monitoring cluster health and
performance. Tools like
Prometheus and Grafana can be
used to visualize these metrics.

20
Example: Monitoring consumer
lag and broker health using
Prometheus and Grafana
dashboards.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Message Delivery
Semantics
Kafka supports three types of
message delivery semantics: at
most once, at least once, and
exactly once.

Example: Configuring a producer

21
for exactly-once delivery to
ensure no message is lost or
duplicated.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Stateful
Processing
Kafka Streams supports stateful
processing, allowing applications
to maintain state across
messages using state stores.

Example: A stream processing

22
application that maintains a
running count of events over a
window of time.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Windowed
Operations
Kafka Streams provides support
for windowed operations,
enabling time-based
aggregations and
transformations.

23
Example: Calculating the average
number of user clicks per minute
using windowed operations.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

KSQL

KSQL is a SQL-like interface for


stream processing in Kafka,
simplifying the creation of stream
processing applications.

Example: Using KSQL to filter,

24
aggregate, and transform
streams of data in real time.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Kafka Ecosystem

Kafka's ecosystem includes


various tools and frameworks for
comprehensive data processing,
such as Kafka Connect, Kafka
Streams, and KSQL.

25
Example: Integrating Kafka with a
relational database using Kafka
Connect and processing the data
with Kafka Streams.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Publish/Subscribe
Messaging
Pub/Sub systems allow
decoupling of message
producers and consumers. Kafka
acts as a broker facilitating this.

Example: An application publishes

26
user activity logs which can be
consumed by analytics services.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Message and
Batches
Messages are the basic unit of
data in Kafka, stored as byte
arrays. Messages are written in
batches for efficiency.

Example: A batch of log messages

27
sent from an application.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Schemas

Schemas define the structure of


messages, ensuring consistency.
Apache Avro is a common
serialization framework used with
Kafka.

28
Example: Avro schema for user
profile data.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Topics and
Partitions
Topics are categories to which
messages are published. Topics
are divided into partitions for
scalability and redundancy.

Example: A "user_activity" topic

29
with partitions.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Producers and
Consumers
Producers create and send
messages to Kafka topics.
Consumers read messages from
topics.

Example: A microservice

30
producing order data and
another consuming for
processing.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Brokers and
Clusters
A broker is a Kafka server that
stores data and serves clients.
Multiple brokers form a Kafka
cluster, providing fault tolerance
and scalability.

31
Example: A Kafka cluster with
three brokers.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Disk-Based
Retention
Kafka retains messages on disk
for a configured period, allowing
consumers to read at their pace.

Example: Retaining logs for 7 days.

32 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Multiple Producers
and Consumers
Kafka supports multiple
producers and consumers for the
same topic, enabling flexible data
pipelines.

Example: Multiple sensors

33
producing data to a single topic,
multiple analytics services
consuming it.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

High Throughput

Kafka can handle large volumes


of messages efficiently due to its
architecture.

Example: Processing millions of


log entries per second.

34 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Stream Processing

Kafka supports real-time


processing of streams of data
using tools like Kafka Streams.

Example: Real-time analytics on


incoming transaction data.

35 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Kafka Connect

Kafka Connect simplifies the


integration of Kafka with other
data systems.

Example: Using Kafka Connect to


sync data between a database

36
and a Kafka topic.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Kafka Streams API

Kafka Streams API allows building


stream processing applications
with Kafka.

Example: An application that


aggregates user clickstream data

37
in real-time.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Log Compaction

Kafka can retain only the latest


message per key in a log-
compacted topic, useful for
changelog data.

Example: Keeping only the latest

38
update to user profiles.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Exactly Once
Semantics
Kafka ensures that messages are
processed exactly once, even in
distributed systems.

Example: Financial transactions


processed without duplicates.

39 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Idempotent
Producer
Producers can safely retry
sending messages without
duplicating them.

Example: Sending a payment


confirmation message with

40
guaranteed single delivery.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Transactions

Kafka supports atomic writes


across multiple partitions and
topics using transactions.

Example: Ensuring that a series of


related messages are either all

41
written or none are.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

MirrorMaker

Tool for replicating Kafka topics


across clusters, useful for disaster
recovery and multi-datacenter
setups.

Example: Mirroring production

42
data to a backup datacenter.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Security

Kafka supports authentication,


authorization, and encryption to
secure data.

Example: Using SSL for encrypting


data in transit.

43 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Kafka AdminClient

AdminClient API allows


programmatic management of
Kafka.

Example: Creating topics, altering


configurations programmatically.

44 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Monitoring and
Metrics
Kafka provides metrics and
monitoring tools to track cluster
performance.

Example: Monitoring consumer


lag and broker health.

45 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Serialization and
Deserialization
Kafka requires serialization of
data for transmission, with
support for various formats like
Avro, JSON.

Example: Serializing user data to

46
Avro format before sending to
Kafka.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Message Ordering

Kafka maintains the order of


messages within a partition,
important for consistency.

Example: Ensuring order of


transaction logs.

47 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Consumer Group

Consumers can join groups to


balance load and ensure each
message is processed once.

Example: Multiple consumers


processing a high-volume topic

48
collaboratively.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Offset
Management
Kafka tracks the offset of
messages to manage consumer
progress.

Example: Storing offsets in Kafka


to resume processing after a

49
restart.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Topic Replication

Kafka replicates partitions across


multiple brokers for fault
tolerance.

Example: A partition replicated


across three brokers to handle

50
broker failure.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Message
Compression
Kafka supports compressing
messages to save bandwidth and
storage.

Example: Compressing log


messages before sending to

51
Kafka.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Zookeeper

Kafka uses Zookeeper for


distributed coordination and
metadata management.

Example: Zookeeper managing


broker metadata and leader

52
election.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Kafka Broker
Configuration
Brokers can be configured for
performance, retention policies,
and more.

Example: Configuring a broker to


retain messages for 30 days.

53 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Producer
Configuration
Producers have configurable
parameters for message delivery,
retries, and more.

Example: Setting producer retries


to handle transient failures.

54 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Consumer
Configuration
Consumers can be configured for
fetch sizes, timeout settings, and
more.

Example: Configuring consumer


fetch size for optimal

55
performance.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Topic
Management
Topics can be created, deleted,
and managed programmatically
or via CLI.

Example: Creating a new topic for


storing event logs.

56 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Quotas and
Throttling
Kafka supports setting quotas to
control resource usage by clients.

Example: Throttling a high-


volume producer to prevent
overwhelming the cluster.

57 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Rebalance
Protocol
Kafka handles rebalancing of
consumers within a group to
maintain load balance.

Example: Rebalancing partitions


when a new consumer joins the

58
group.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Kafka REST Proxy

Provides a RESTful interface to


interact with Kafka clusters.

Example: Sending messages to


Kafka using HTTP requests.

59 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Kafka API

Kafka provides APIs for producing,


consuming, and managing data.

Example: Using the Kafka


Producer API to send messages
from a Java application.

60 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Schema Registry

Confluent Schema Registry


manages and enforces schemas
for Kafka messages.

Example: Ensuring all messages in


a topic follow a predefined

61
schema.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Kafka Streams DSL

A high-level API for stream


processing in Kafka.

Example: Using Kafka Streams DSL


to filter and transform a stream of
events.

62 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Fault Tolerance

Kafka’s design ensures high


availability and fault tolerance.

Example: Automatic failover to


replicas when a broker fails.

63 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Real-Time
Analytics
Kafka supports real-time data
analytics and processing.

Example: Real-time dashboard


updating with live metrics from
Kafka.

64 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

ETL Pipelines

Kafka can be used to build


efficient ETL pipelines for data
integration.

Example: Extracting data from


databases, transforming it, and

65
loading it into a data warehouse
via Kafka.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Kafka Upgrades

Kafka supports rolling upgrades


to minimize downtime.

Example: Upgrading Kafka brokers


without disrupting message flow.

66 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Message
Timestamping
Kafka messages can have
timestamps for time-based
processing.

Example: Using timestamps for


event time processing in Kafka

67
Streams.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

State Stores

Kafka Streams allows maintaining


stateful processing with state
stores.

Example: Counting occurrences of


events over a window of time

68
using state stores.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Windowed
Operations
Kafka Streams supports
windowed operations for
aggregations over time windows.

Example: Calculating the sum of


transactions every minute.

69 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

KSQL

KSQL is a SQL-like interface for


stream processing with Kafka.

Example: Using KSQL to perform


real-time filtering and
aggregations on Kafka topics.

70 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Kafka Ecosystem

Kafka’s ecosystem includes tools


like Connect, Streams, KSQL, and
more for comprehensive data
processing.

Example: Using Kafka Connect to

71
integrate with databases, Kafka
Streams for processing, and KSQL
for querying streams.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Kafka Connectors

Pre-built connectors for


integrating Kafka with various
data sources and sinks.

Example: Using a JDBC connector


to sync data between a database

72
and Kafka.

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Kafka Cluster
Management
Tools and practices for managing
Kafka clusters efficiently.

Example: Using tools like Kafka


Manager for monitoring and
managing cluster health.

73 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka

Tiered Storage

Kafka’s tiered storage allows


offloading older data to cheaper
storage.

Example: Storing older Kafka topic


data in S3 to reduce on-prem

74
storage costs.

Shwetank Singh
GritSetGrow - GSGLearn.com

You might also like