Apache_kafka notes
Apache_kafka notes
A messaging queue is a simple concept that many of us are familiar with. It involves a producer, a
queue, and a consumer. The producer generates a message, which is placed into a queue, and the
consumer then reads and processes the message from the queue. This is the basic flow: producer →
queue → consumer.
But the real question is, why is it needed, and what advantages does it bring? Let's explore some use
cases and advantages:
The consumer, in this case, could be a notification service, which processes the message and sends
the notification to the user. This approach allows the process to happen asynchronously, reducing
latency in the e-commerce application.
Without a queue, if the e-commerce app directly communicated with the notification service, the
latency would increase, as sending notifications (emails, SMS, etc.) is a resource-heavy task.
Moreover, a message queue offers additional benefits:
1. Retry Capability: If the notification service is down, the message can be retried later. The
message stays in the queue until the service becomes available again.
2. Pace Matching: If multiple producers (e.g., the e-commerce app, inventory management app,
etc.) are sending messages at different
rates (10, 20, 30 messages per second), but
the consumer (notification service) can only
process a certain number per second (e.g.,
15), a message queue ensures that
messages are sent to the consumer at a
manageable pace. The queue acts as a
buffer, matching the pace of the producers
to the capabilities of the consumer.
BHARTENDU
Additional Use Case: Real-Time Location Updates
Another example of a message queue’s value is in a cab service with GPS-enabled vehicles. Each
vehicle sends location updates (car ID and current position) every 10 seconds. With many cars in a city,
this generates a large amount of
data. A consumer application
needs this data to update a
dashboard with car locations.
However, the consumer can't
process this data at such a high
frequency, and a message queue
helps by buffering the messages
and ensuring the consumer can
process them at its own pace.
Summary
• Retry capability: Ensures reliable message processing even if the consumer is temporarily
unavailable.
• Pace matching: Helps manage differing message production and consumption rates,
preventing overloading the consumer.
These features make message queues a powerful tool in scalable, efficient, and reliable system
design.
In message queues, there are two common messaging patterns: Point-to-Point (P2P) and Publish-
Subscribe (Pub-Sub). Here's the difference between the two:
Point-to-Point Messaging
In a Point-to-Point messaging model, there is a single queue where messages are placed by a
producer. Consumers then consume
the message from that queue. The key
point here is that each message is
consumed only once by a single
consumer.
• How it works:
• Use Case: This is suitable when you want to ensure that a message is processed only
once, such as in task processing where each task should only be handled by one worker.
BHARTENDU
Publish-Subscribe (Pub-Sub) Messaging
• How it works:
o The exchange broadcasts the message to all the queues based on specific routing logic.
o Each consumer subscribed to a queue can independently process the same message.
• Use Case: This is useful when you want multiple consumers to receive and process the same
message, such as for updating multiple systems at once (e.g., sending notifications to different
services).
1. Producer: A producer can be any application who can publish message to a topic
2. Consumer: A consumer can be any application that subscribes to a topic and consumes the
messages.
3. Consumer Group: A group of consumers that work together to consume messages from
topics.
4. Topic: A topic is a category or feed name to which records are published. Act like a database
table and categorize different types of messages.
7. Broker: A Kafka server that stores messages and manages topics and partitions.
8. Cluster: A group of brokers that work together to distribute data and handle load.
9. Zookeeper: An external service that helps manage and coordinate the Kafka brokers.
BHARTENDU
Kafka's Workflow:
1. Producer to Broker: The producer sends messages to a broker. A broker is essentially a Kafka
server, and it can manage multiple topics.
2. Topic and Partitions: Topics are divided into multiple partitions to allow parallelism. Each
partition holds messages, and partitions can have different lengths. Messages within a
partition are identified by offsets (e.g., 0, 1, 2, 3...).
3. Consumer and Consumer Groups: Consumers read messages from partitions. Each
consumer belongs to a consumer group, and each consumer in the group reads messages
from a distinct partition. For example, if a topic has two partitions (Partition 0 and Partition 1),
Consumer 1 might read from Partition 0, while Consumer 2 reads from Partition 1. However,
different consumer groups can read from the same partitions simultaneously.
BHARTENDU
How a Kafka Messaging Queue Works
In Kafka, the messaging flow involves several components such as producers, consumers, topics,
partitions, and offsets. Here's a detailed breakdown of how Kafka messages are processed.
1. Key: This could be a string or ID (e.g., car ID). It's used to determine the partition to which the
message should go.
4. Partition (Optional): Kafka can either determine the partition based on the key or use a round-
robin approach if no partition is defined.
• If Key is provided, Kafka computes a hash of the key and places the message in the
corresponding partition.
• If Key is not provided, Kafka checks if a Partition is specified. If it is, the message goes to that
partition.
• If both Key and Partition are missing, Kafka distributes the messages using a round-robin
approach to ensure even load distribution across partitions.
• Topic: A topic is a logical channel where messages are published. Topics can have multiple
partitions for parallel processing.
• Partition: A partition is where the actual data resides. Each partition has an offset that acts as
a pointer to the current position of messages (e.g., offsets 0, 1, 2, 3...).
Offset: The offset is the unique ID for each message within a partition. It tracks the position of
consumers to know which messages have been consumed and which are still pending. Committed
offset indicates the last successfully read message.
BHARTENDU
Consumer Groups:
Kafka Cluster:
• Cluster: A Kafka cluster consists of multiple brokers running on different machines. Brokers
are Kafka servers responsible for
storing and managing messages.
1. Brokers: Kafka is distributed across multiple brokers (e.g., Broker 1, Broker 2, Broker 3).
o A topic (e.g., Topic 1) can have multiple partitions (e.g., Partition 0, Partition 1).
o Each partition has a leader and may have one or more replicas (followers).
▪ Leader: Handles all read and write operations for the partition.
4. Fault Tolerance:
o If a leader goes down, one of its followers automatically becomes the new leader.
o This ensures no data is lost, as followers replicate all messages from the leader.
5. Replication Mechanism:
o This ensures that followers have the same data as the leader at all times.
o If a consumer in a consumer group fails, another consumer in the same group takes
over and continues processing from the last recorded offset.
o If a leader partition (e.g., Partition 0 on Broker 1) goes down, its follower (e.g., on Broker
2) becomes the new leader.
2. Consumer Failure:
o If a consumer fails, another consumer in the same group starts processing from the last
committed offset.
3. Queue Size Limits:
1. Scenario:
o Let's say it reads a message (e.g., Message 7), which turns out to be buggy or
unprocessable.
2. Retry Mechanism:
o Kafka does not automatically increase the committed offset if the message fails to
process. The committed offset remains at the last successfully processed message
(e.g., 6 in this case).
o If the retries exceed the configured limit, the message is moved to a Dead Letter Queue
(DLQ) or failure queue.
o This ensures the faulty message does not block the partition or delay other messages.
o After handling the failure (e.g., by moving the message to a DLQ), the consumer can
increase the committed offset and continue processing subsequent messages.
o Once fixed, they can be reintroduced into the original topic/partition for processing.
Kafka’s Pull-Based Approach
• The consumer periodically asks, "Do you have new data?" and fetches messages when
available.
• This allows consumers to control the rate at which they process messages.
1. Push-Based Approach:
o If a consumer cannot process a message and retries fail, the message is moved to a
dead-letter exchange.