0% found this document useful (0 votes)
22 views12 pages

Kafka

Apache Kafka is a high-throughput, fault-tolerant distributed event streaming platform that facilitates real-time data processing and messaging. It features a publish-subscribe model, supports scalable architectures through topics and partitions, and is widely used in various applications such as log aggregation, fraud detection, and IoT data processing. Kafka's design allows for efficient data handling, making it essential for businesses aiming to build reliable, event-driven applications.

Uploaded by

vanshikrplani01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views12 pages

Kafka

Apache Kafka is a high-throughput, fault-tolerant distributed event streaming platform that facilitates real-time data processing and messaging. It features a publish-subscribe model, supports scalable architectures through topics and partitions, and is widely used in various applications such as log aggregation, fraud detection, and IoT data processing. Kafka's design allows for efficient data handling, making it essential for businesses aiming to build reliable, event-driven applications.

Uploaded by

vanshikrplani01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Apache Kafka

Apache Kafka is like a communication system that helps different parts of a


computer system exchange data by publishing and subscribing to topics.

Why
1. High Throughput
2. Fault Tolerance (Replication)
3. Durable
4. Scalable

Architecture:

Kafka Core Concepts :

Producers: Send data to Kafka topics Producers are the "senders" in Kafka. They’re
applications or systems that generate data, like a website logging user clicks or a
sensor reporting temperature. They push this data into Kafka by sending it to specific
topics (more on that below). Example: A shopping app (producer) sends "User bought
item X" to Kafka.
Brokers: Store and manage data Brokers are the Kafka servers—the "warehouses"
that store and manage the data. A Kafka setup usually has multiple brokers working
together (a cluster) to handle the load and ensure reliability. They receive data from
producers, store it, and serve it to consumers. If one broker fails, others can take over,
making Kafka fault-tolerant .

Topics & Partitions: Data is divided for scalability Topics are like categories or
channels where data is stored. Think of them as labeled mailboxes (e.g., "Orders,"
"Clicks," "Logs"). Producers send data to a topic, and consumers read from it. Partitions
split each topic into smaller chunks. This is Kafka’s trick for scalability: more partitions =
more parallel processing. Each partition lives on a broker and holds a subset of the
topic’s data in an ordered log (like a timeline of events). Example: The "Orders" topic
might have 3 partitions, each handling a slice of order data.

Consumers: Read and process data Consumers are the "readers"—applications or


systems that pull data from Kafka topics to do something with it, like analyzing trends or
updating a dashboard. They can work in groups (consumer groups) to split the
workload. Each consumer in a group might read from different partitions, speeding
things up. Example: A fraud detection system (consumer) reads from the "Orders" topic
to spot suspicious activity.

ZooKeeper Manages metadata and leader election ZooKeeper is like Kafka’s


behind-the-scenes coordinator. It’s a separate system that keeps track of metadata
(e.g., which broker has which partition) and ensures the cluster runs smoothly. It also
handles leader election: for each partition, one broker is the "leader" (handling
reads/writes), and ZooKeeper picks a new leader if one fails.

How It All Fits Together (Kafka Architecture)


Imagine this flow:
1)​ Producers send data (e.g., "Order #123 placed") to a topic called "Orders."
2)​ The "Orders" topic is split into, say, 3 partitions (P0, P1, P2), each stored on a
broker in the cluster (Broker 1, Broker 2, Broker 3).
3)​ Brokers store the data in these partitions as an ordered log and replicate it
across the cluster for safety.
4)​ Consumers subscribe to the "Orders" topic. One consumer might read P0,
another P1, and so on, processing the data in parallel.
5)​ ZooKeeper watches over everything, ensuring brokers know their roles and
stepping in if a broker goes down.
Kafka Architecture
Distributed System: Multiple brokers work together
Replication: Ensures fault tolerance (leader-follower model)
Message Storage: Log-based storage for durability
Diagram: Kafka Cluster Architecture with Producers, Topics, Partitions, and Consumers
Kafka vs Traditional Messaging Systems

Here’s a concise comparison of Kafka versus traditional messaging systems (like


RabbitMQ, ActiveMQ, or JMS-based systems) to highlight what sets Kafka apart.
Key Differences
1)​ Purpose and Model
Kafka: Built for event streaming and data pipelines. It’s a distributed log that
stores data durably, letting consumers process it at their own pace.
Traditional Messaging: Designed for message queuing. Focuses on delivering
messages from producers to consumers quickly, often deleting them once
consumed.
2)​ Data Storage
Kafka: Stores data in topics (as logs) for a set time or size (e.g., days or weeks),
even after it’s consumed. Consumers can replay or reprocess old data.
Traditional: Typically removes messages after delivery (e.g., in a queue
3)​ Scalability
Kafka: Scales horizontally with partitions and brokers. Handles massive
throughput (millions of messages per second) by distributing data across a
cluster.
Traditional: Scales vertically (bigger servers) or with limited clustering. Better for
smaller-scale, point-to-point messaging.
4)​ Throughput
Kafka: High-throughput, optimized for large data volumes and real-time
streaming (e.g., logs, events).
Traditional: Lower throughput, optimized for discrete, transactional messages
(e.g., "send order to warehouse").

Kafka Performance & Optimization

Kafka’s performance and optimization stem from its design as a distributed,


log-based system built for high throughput and low latency. It achieves blazing
speed by writing data sequentially to disk (faster than random writes), leveraging
the operating system’s page cache for reads, and using a zero-copy mechanism
to move data directly from disk to network without buffering. Partitioning topics
across multiple brokers allows parallel processing, while replication ensures fault
tolerance without sacrificing speed.
To optimize Kafka, you tune factors like partition count (more partitions = higher
parallelism), batch sizes (larger batches improve throughput), and memory
settings (e.g., JVM heap size), while balancing producer compression and
consumer fetch sizes to reduce network overhead. In short, Kafka’s efficiency
comes from smart architecture and fine-tuning to match your workload, making it
a beast for real-time, high-volume data handling.

Kafka Use Cases

Apache Kafka is widely used across industries for real-time data processing. Here are
its key use cases:
Messaging System – Acts as a high-throughput, fault-tolerant message broker.
Log Aggregation – Collects and processes application logs in real time. Event-Driven
Microservices – Enables communication between microservices using event streams.
Real-Time Data Streaming – Processes and analyzes data in real time for
decision-making.
Fraud Detection – Identifies fraudulent transactions in financial services. Monitoring &
Observability – Streams logs, metrics, and traces for system monitoring.
E-Commerce & Order Tracking – Tracks order status, inventory updates, and user
activity.
IoT & Sensor Data Processing – Handles large-scale IoT device data in real time.
Machine Learning Pipelines – Streams data for training and deploying AI models.
Stock Market & Trading Platforms – Processes high-frequency market data in real
time.
Cybersecurity & Threat Detection – Monitors network traffic for anomalies. Customer
Activity Tracking – Analyzes user behavior for personalized experiences. Social Media
Analytics – Processes social media data for trends and sentiment analysis. Healthcare
Data Processing – Streams patient data for real-time diagnosis and alerts.
Telecommunications & Call Data Analysis – Manages call records, network traffic,
and billing.
Real-Time Chat Applications – Power messaging platforms with low latency.
Video Streaming & Content Delivery – Manages media processing and
recommendations.
Supply Chain & Logistics – Tracks shipments, inventory, and fleet management.

Kafka’s versatility makes it an essential tool for real-time data-driven applications across
multiple domains.

Summary of Apache Kafka: Kafka is a high-throughput, fault-tolerant, and


distributed event streaming platform. It enables real-time data processing, messaging,
and event-driven architecture. Used in log aggregation, microservices communication,
real-time analytics, fraud detection, IoT, and AI pipelines. Scalable, durable, and
integrates well with cloud platforms and big data ecosystems. Supports
publish-subscribe model, stream processing (Kafka Streams), and external connectors
(Kafka Connect).

Conclusion: Kafka is a powerful solution for handling large-scale real-time data


streams. It is widely adopted in banking, e-commerce, social media, IoT, and AI
applications. Future advancements, like tiered storage and cloud-native optimizations,
will further enhance its capabilities. Essential for businesses looking to build scalable,
reliable, and event-driven applications.
Basic Kafka Interview Questions

Let us begin with the basic Kafka interview questions!

1. What is the role of the offset?

In partitions, messages are assigned a unique ID number called the offset. The role is to
identify each message in the partition uniquely.

2. Can Kafka be used without ZooKeeper?

It is not possible to connect directly to the Kafka Server by bypassing ZooKeeper. Any
client request cannot be serviced if ZooKeeper is down.

3. In Kafka, why are replications critical?

Replications are critical as they ensure published messages can be consumed in the
event of any program error or machine error and are not lost.

4. What is a partitioning key?

Ans. The partitioning key indicates the destination partition of the message within the
producer. A hashing based partitioner determines the partition ID when the key is given.

5. What is the critical difference between Flume and Kafka?

Kafka ensures more durability and is scalable even though both are used for real-time
processing.
6. When does QueueFullException occur in the producer?

QueueFullException occurs when the producer attempts to send messages at a pace


not handleable by the broker.

7. What is a partition of a topic in Kafka Cluster?

Partition is a single piece of Kafka topic. More partitions allow excellent parallelism
when reading from the topics. The number of partitions is configured based on per
topic.

8. Explain Geo-replication in Kafka.

The Kafka MirrorMaker provides Geo-replication support for clusters. The messages are
replicated across multiple cloud regions or datacenters. This can be used in
passive/active scenarios for recovery and backup.

9. What do you mean by ISR in Kafka environment?

ISR is the abbreviation of In sync replicas. They are a set of message replicas that are
synced to be leaders.

10. How can you get precisely one messaging during data production?

To get precisely one messaging from data production, you have to follow two things
avoiding duplicates during data production and avoiding duplicates during data
consumption. For this, include a primary key in the message and de-duplicate on the
consumer.
11. How do consumers consumes messages in Kafka?

The transfer of messages is done in Kafka by making use of send file API. The transfer
of bytes occurs using this file through the kernel-space and the calls between back to
the kernel and kernel user.

12. What is Zookeeper in Kafka?

One of the basic Kafka interview questions is about Zookeeper. It is a high performance
and open source complete coordination service used for distributed applications
adapted by Kafka. It lets Kafka manage sources properly.

13. What is a replica in the Kafka environment?

The replica is a list of essential nodes needed for logging for any particular partition. It
can play the role of a follower or leader.

14. What does follower and leader in Kafka mean?

Partitions are created in Kafka based on consumer groups and offset. One server in the
partition serves as the leader, and one or more servers act as a follower. The leader
assigns itself tasks that read and write partition requests. Followers follow the leader
and replicate what is being told.

15. Name various components of Kafka.

The main components are:

1.​ Producer – produces messages and can communicate to a specific topic


2.​ Topic: a bunch of messages that come under the same topic
3.​ Consumer: One who consumes the published data and subscribes to different
topics
4.​ Brokers: act as a channel between consumers and producers.

16. Why is Kafka so popular?

Kafka acts as the central nervous system that makes streaming data available to
applications. It builds real-time data pipelines responsible for data processing and
transferring between different systems that need to use it.

17. What are consumers in Kafka?

Kafka tags itself with a user group, and every communication on the topic is distributed
to one use case. Kafka provides a single-customer abstraction that discovers both
publish-subscribe consumer group and queuing.

18. What is a consumer group?

When more than one consumer consumes a bunch of subscribed topics jointly, it forms
a consumer group.

19. How is a Kafka Server started?

To start a Kafka Server, the Zookeeper has to be powered up by using the following
steps:

> bin/zookeeper-server-start.sh config/zookeeper.properties


> bin/kafka-server-start.sh config/server.properties

20. How does Kafka work?

Kafka combines two messaging models, queues them, publishes, and subscribes to be
made accessible to several consumer instances.

21. What are replications dangerous in Kafka?

This is because duplication assures that issued messages are absorbed in plan fault,
appliance mistake or recurrent software promotions.

22. What is the role of Kafka Producer API play?

It covers two producers: kafka.producer.async.AsyncProducer and


kafka.producer.SyncProducer. The API provides all producer performance through a
single API to its clients.

23. Discuss the architecture of Kafka.

A cluster in Kafka contains multiple brokers as the system is distributed. The topic in
the system is divided into multiple partitions. Each broker stores one or multiple
partitions so that consumers and producers can retrieve and publish messages
simultaneously.

24. What advantages does Kafka have over Flume?


Kafka is not explicitly developed for Hadoop. Using it for writing and reading data is
trickier than it is with Flume. However, Kafka is a highly reliable and scalable system
used to connect multiple systems like Hadoop.

25. Why are the benefits of using Kafka?

Kafka has the following advantages:

1.​ Scalable- Data is streamlined over a cluster of machines and partitioned to


enable large information.
2.​ Fast- Kafka has brokers which can serve thousands of clients
3.​ Durable- message is replicated in the cluster to prevent record loss.
4.​ Distributed- provides robustness and fault tolerance.

You might also like