0% found this document useful (0 votes)
5 views

Apache_kafka notes

Apache Kafka is a distributed messaging system designed for real-time data streams, offering features like scalability, durability, and fault tolerance. It utilizes a messaging queue model to decouple producers and consumers, allowing asynchronous processing, retry capabilities, and pace matching. Kafka's architecture includes key components such as producers, consumers, topics, partitions, and brokers, facilitating efficient message handling and fault tolerance in distributed environments.

Uploaded by

aimlproject007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Apache_kafka notes

Apache Kafka is a distributed messaging system designed for real-time data streams, offering features like scalability, durability, and fault tolerance. It utilizes a messaging queue model to decouple producers and consumers, allowing asynchronous processing, retry capabilities, and pace matching. Kafka's architecture includes key components such as producers, consumers, topics, partitions, and brokers, facilitating efficient message handling and fault tolerance in distributed environments.

Uploaded by

aimlproject007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

APACHE KAFKA

• Apache Kafka is a distributed public subscribe messaging system.


• Kafka is fast, scalable, durable, fault tolerant and distributed by design
• Kafka is used for Realtime streams of data and used for collecting data for real time analysis.

Q. What is messaging Queue? Why is it needed and its advantages?

A messaging queue is a simple concept that many of us are familiar with. It involves a producer, a
queue, and a consumer. The producer generates a message, which is placed into a queue, and the
consumer then reads and processes the message from the queue. This is the basic flow: producer →
queue → consumer.

But the real question is, why is it needed, and what advantages does it bring? Let's explore some use
cases and advantages:

Asynchronous Nature and Advantages of Message Queues

A messaging queue is essential


in many scenarios, especially
when dealing with asynchronous
tasks. For example, consider an
e-commerce application where
a user buys a product. After the
purchase, we need to send a
notification to the user, but the user shouldn't have to wait for that notification in real-time. Instead, the
e-commerce app can send a message to a queue saying, "Send notification to this user."

The consumer, in this case, could be a notification service, which processes the message and sends
the notification to the user. This approach allows the process to happen asynchronously, reducing
latency in the e-commerce application.

Without a queue, if the e-commerce app directly communicated with the notification service, the
latency would increase, as sending notifications (emails, SMS, etc.) is a resource-heavy task.
Moreover, a message queue offers additional benefits:

1. Retry Capability: If the notification service is down, the message can be retried later. The
message stays in the queue until the service becomes available again.

2. Pace Matching: If multiple producers (e.g., the e-commerce app, inventory management app,
etc.) are sending messages at different
rates (10, 20, 30 messages per second), but
the consumer (notification service) can only
process a certain number per second (e.g.,
15), a message queue ensures that
messages are sent to the consumer at a
manageable pace. The queue acts as a
buffer, matching the pace of the producers
to the capabilities of the consumer.

BHARTENDU
Additional Use Case: Real-Time Location Updates

Another example of a message queue’s value is in a cab service with GPS-enabled vehicles. Each
vehicle sends location updates (car ID and current position) every 10 seconds. With many cars in a city,
this generates a large amount of
data. A consumer application
needs this data to update a
dashboard with car locations.
However, the consumer can't
process this data at such a high
frequency, and a message queue
helps by buffering the messages
and ensuring the consumer can
process them at its own pace.

Summary

To sum up, message queues provide the following advantages:

• Asynchronous processing: Reduces latency by decoupling producers and consumers.

• Retry capability: Ensures reliable message processing even if the consumer is temporarily
unavailable.

• Pace matching: Helps manage differing message production and consumption rates,
preventing overloading the consumer.

These features make message queues a powerful tool in scalable, efficient, and reliable system
design.

Q. Point-to-Point vs. Publish-Subscribe (Pub-Sub) Messaging

In message queues, there are two common messaging patterns: Point-to-Point (P2P) and Publish-
Subscribe (Pub-Sub). Here's the difference between the two:

Point-to-Point Messaging

In a Point-to-Point messaging model, there is a single queue where messages are placed by a
producer. Consumers then consume
the message from that queue. The key
point here is that each message is
consumed only once by a single
consumer.

• How it works:

o A producer publishes a message to a queue (e.g., Message A).

o If Consumer 1 processes the message, Consumer 2 cannot access it.

o Once the message is consumed by a consumer, it is no longer available for other


consumers.

• Use Case: This is suitable when you want to ensure that a message is processed only
once, such as in task processing where each task should only be handled by one worker.
BHARTENDU
Publish-Subscribe (Pub-Sub) Messaging

In the Publish-Subscribe model, the


producer (publisher) sends messages to a
central exchange, which then broadcasts
these messages to multiple queues. Each
consumer is subscribed to one or more of
these queues and can consume the
message independently.

• How it works:

o A publisher sends a message (e.g., Message A) to an exchange.

o The exchange broadcasts the message to all the queues based on specific routing logic.

o Each consumer subscribed to a queue can independently process the same message.

• Use Case: This is useful when you want multiple consumers to receive and process the same
message, such as for updating multiple systems at once (e.g., sending notifications to different
services).

Kafka Architecture Overview


Kafka is a popular distributed messaging system, often used for real-time data streaming. Let's
understand the core components of Kafka and how they interact:

Key Components of Kafka:

1. Producer: A producer can be any application who can publish message to a topic

2. Consumer: A consumer can be any application that subscribes to a topic and consumes the
messages.

3. Consumer Group: A group of consumers that work together to consume messages from
topics.

4. Topic: A topic is a category or feed name to which records are published. Act like a database
table and categorize different types of messages.

5. Partition: Topics are divided into partitions to allow parallel processing.

6. Offset: A unique identifier for each message within a partition.

7. Broker: A Kafka server that stores messages and manages topics and partitions.

8. Cluster: A group of brokers that work together to distribute data and handle load.

9. Zookeeper: An external service that helps manage and coordinate the Kafka brokers.

BHARTENDU
Kafka's Workflow:

1. Producer to Broker: The producer sends messages to a broker. A broker is essentially a Kafka
server, and it can manage multiple topics.

2. Topic and Partitions: Topics are divided into multiple partitions to allow parallelism. Each
partition holds messages, and partitions can have different lengths. Messages within a
partition are identified by offsets (e.g., 0, 1, 2, 3...).

3. Consumer and Consumer Groups: Consumers read messages from partitions. Each
consumer belongs to a consumer group, and each consumer in the group reads messages
from a distinct partition. For example, if a topic has two partitions (Partition 0 and Partition 1),
Consumer 1 might read from Partition 0, while Consumer 2 reads from Partition 1. However,
different consumer groups can read from the same partitions simultaneously.

4. Cluster: Kafka operates in a distributed manner, so multiple brokers can be running on


different machines (nodes). A Kafka cluster is a group of brokers working together to distribute
data and provide fault tolerance.

5. Zookeeper: Zookeeper helps manage communication between brokers. It keeps track of


metadata, such as which broker holds which partition of a topic. This allows Kafka brokers to
work in coordination.

BHARTENDU
How a Kafka Messaging Queue Works

In Kafka, the messaging flow involves several components such as producers, consumers, topics,
partitions, and offsets. Here's a detailed breakdown of how Kafka messages are processed.

Kafka Message Format:

Each message in Kafka has the following components:

1. Key: This could be a string or ID (e.g., car ID). It's used to determine the partition to which the
message should go.

2. Value: The actual message (the content or data you're sending).

3. Topic: The category or feed where the message is published.

4. Partition (Optional): Kafka can either determine the partition based on the key or use a round-
robin approach if no partition is defined.

• If Key is provided, Kafka computes a hash of the key and places the message in the
corresponding partition.

• If Key is not provided, Kafka checks if a Partition is specified. If it is, the message goes to that
partition.

• If both Key and Partition are missing, Kafka distributes the messages using a round-robin
approach to ensure even load distribution across partitions.

Understanding Kafka Partitions and Offsets:

• Topic: A topic is a logical channel where messages are published. Topics can have multiple
partitions for parallel processing.

• Partition: A partition is where the actual data resides. Each partition has an offset that acts as
a pointer to the current position of messages (e.g., offsets 0, 1, 2, 3...).

Offset: The offset is the unique ID for each message within a partition. It tracks the position of
consumers to know which messages have been consumed and which are still pending. Committed
offset indicates the last successfully read message.

BHARTENDU
Consumer Groups:

• Consumer Group: A consumer group is a collection of consumers working together to read


messages from partitions. Kafka ensures that each partition within a topic is read by only one
consumer within a group.

Handling Failures in Consumer Groups:

• When a consumer goes down, Kafka


assigns the responsibility of
consuming messages from the
affected partition to another
consumer in the same consumer
group. This ensures that the message
consumption continues seamlessly
from the last committed offset.

Kafka Cluster:

• Cluster: A Kafka cluster consists of multiple brokers running on different machines. Brokers
are Kafka servers responsible for
storing and managing messages.

• Broker: A broker is a Kafka server


that stores messages, handles
topic partitions, and serves
messages to consumers.

• The cluster allows Kafka to scale


horizontally for high availability and
fault tolerance.
Kafka Architecture Overview

1. Brokers: Kafka is distributed across multiple brokers (e.g., Broker 1, Broker 2, Broker 3).

o Each broker stores data for specific partitions of topics.

2. Topics and Partitions:

o A topic (e.g., Topic 1) can have multiple partitions (e.g., Partition 0, Partition 1).

o Partitions of the same topic can be distributed across different brokers:

▪ Partition 0 may reside in Broker 1.

▪ Partition 1 may reside in Broker 2.

3. Leaders and Followers:

o Each partition has a leader and may have one or more replicas (followers).

▪ Leader: Handles all read and write operations for the partition.

▪ Followers: Maintain a copy of the leader's data and stay synchronized.

4. Fault Tolerance:

o If a leader goes down, one of its followers automatically becomes the new leader.

o This ensures no data is lost, as followers replicate all messages from the leader.

5. Replication Mechanism:

o Followers constantly sync with the leader by copying new messages.

o This ensures that followers have the same data as the leader at all times.

6. Scaling with Brokers:

o The size of each broker is limited by its machine's capacity.

o To handle larger datasets or higher throughput, additional brokers are added.

7. Consumer Groups and Offsets:

o If a consumer in a consumer group fails, another consumer in the same group takes
over and continues processing from the last recorded offset.

o This guarantees seamless message consumption.

Key Scenarios in Kafka:

1. Partition Leader Failure:

o If a leader partition (e.g., Partition 0 on Broker 1) goes down, its follower (e.g., on Broker
2) becomes the new leader.

2. Consumer Failure:

o If a consumer fails, another consumer in the same group starts processing from the last
committed offset.
3. Queue Size Limits:

o To handle data growth, Kafka scales horizontally by adding more brokers

What Happens When a Kafka Consumer Fails to Process a Message?

1. Scenario:

o A consumer from a consumer group is reading messages from a partition.

o Let's say it reads a message (e.g., Message 7), which turns out to be buggy or
unprocessable.

2. Retry Mechanism:

o Kafka does not automatically increase the committed offset if the message fails to
process. The committed offset remains at the last successfully processed message
(e.g., 6 in this case).

o The consumer retries processing Message 7 based on a retry policy:

▪ Retry 1: If it fails, try again.

▪ Retry 2: Another attempt.

▪ Retry n: Continue until the retry limit is reached.

3. Dead Letter Queue (DLQ):

o If the retries exceed the configured limit, the message is moved to a Dead Letter Queue
(DLQ) or failure queue.

o This ensures the faulty message does not block the partition or delay other messages.

4. Commit Offset Progression:

o After handling the failure (e.g., by moving the message to a DLQ), the consumer can
increase the committed offset and continue processing subsequent messages.

5. Post-Processing of Failed Messages:

o Messages in the DLQ can be reviewed manually or by another system.

o Once fixed, they can be reintroduced into the original topic/partition for processing.
Kafka’s Pull-Based Approach

• Kafka consumers poll for new messages from the broker.

• The consumer periodically asks, "Do you have new data?" and fetches messages when
available.

• This allows consumers to control the rate at which they process messages.

Comparison with RabbitMQ

1. Push-Based Approach:

o RabbitMQ uses a push-based approach.

o Messages are automatically pushed to consumers as they become available.

2. Exchange and Routing Keys:

o RabbitMQ introduces the concept of exchanges.

o Producers send messages to an exchange, which routes them to queues based on


routing keys or bindings.

o This allows fine-grained control over message routing to specific queues.

3. Faulty Message Handling:

o Similar to Kafka, RabbitMQ supports retries and DLQs.

o If a consumer cannot process a message and retries fail, the message is moved to a
dead-letter exchange.

You might also like