Kafka
Kafka
Why
1. High Throughput
2. Fault Tolerance (Replication)
3. Durable
4. Scalable
Architecture:
Producers: Send data to Kafka topics Producers are the "senders" in Kafka. They’re
applications or systems that generate data, like a website logging user clicks or a
sensor reporting temperature. They push this data into Kafka by sending it to specific
topics (more on that below). Example: A shopping app (producer) sends "User bought
item X" to Kafka.
Brokers: Store and manage data Brokers are the Kafka servers—the "warehouses"
that store and manage the data. A Kafka setup usually has multiple brokers working
together (a cluster) to handle the load and ensure reliability. They receive data from
producers, store it, and serve it to consumers. If one broker fails, others can take over,
making Kafka fault-tolerant .
Topics & Partitions: Data is divided for scalability Topics are like categories or
channels where data is stored. Think of them as labeled mailboxes (e.g., "Orders,"
"Clicks," "Logs"). Producers send data to a topic, and consumers read from it. Partitions
split each topic into smaller chunks. This is Kafka’s trick for scalability: more partitions =
more parallel processing. Each partition lives on a broker and holds a subset of the
topic’s data in an ordered log (like a timeline of events). Example: The "Orders" topic
might have 3 partitions, each handling a slice of order data.
Apache Kafka is widely used across industries for real-time data processing. Here are
its key use cases:
Messaging System – Acts as a high-throughput, fault-tolerant message broker.
Log Aggregation – Collects and processes application logs in real time. Event-Driven
Microservices – Enables communication between microservices using event streams.
Real-Time Data Streaming – Processes and analyzes data in real time for
decision-making.
Fraud Detection – Identifies fraudulent transactions in financial services. Monitoring &
Observability – Streams logs, metrics, and traces for system monitoring.
E-Commerce & Order Tracking – Tracks order status, inventory updates, and user
activity.
IoT & Sensor Data Processing – Handles large-scale IoT device data in real time.
Machine Learning Pipelines – Streams data for training and deploying AI models.
Stock Market & Trading Platforms – Processes high-frequency market data in real
time.
Cybersecurity & Threat Detection – Monitors network traffic for anomalies. Customer
Activity Tracking – Analyzes user behavior for personalized experiences. Social Media
Analytics – Processes social media data for trends and sentiment analysis. Healthcare
Data Processing – Streams patient data for real-time diagnosis and alerts.
Telecommunications & Call Data Analysis – Manages call records, network traffic,
and billing.
Real-Time Chat Applications – Power messaging platforms with low latency.
Video Streaming & Content Delivery – Manages media processing and
recommendations.
Supply Chain & Logistics – Tracks shipments, inventory, and fleet management.
Kafka’s versatility makes it an essential tool for real-time data-driven applications across
multiple domains.
In partitions, messages are assigned a unique ID number called the offset. The role is to
identify each message in the partition uniquely.
It is not possible to connect directly to the Kafka Server by bypassing ZooKeeper. Any
client request cannot be serviced if ZooKeeper is down.
Replications are critical as they ensure published messages can be consumed in the
event of any program error or machine error and are not lost.
Ans. The partitioning key indicates the destination partition of the message within the
producer. A hashing based partitioner determines the partition ID when the key is given.
Kafka ensures more durability and is scalable even though both are used for real-time
processing.
6. When does QueueFullException occur in the producer?
Partition is a single piece of Kafka topic. More partitions allow excellent parallelism
when reading from the topics. The number of partitions is configured based on per
topic.
The Kafka MirrorMaker provides Geo-replication support for clusters. The messages are
replicated across multiple cloud regions or datacenters. This can be used in
passive/active scenarios for recovery and backup.
ISR is the abbreviation of In sync replicas. They are a set of message replicas that are
synced to be leaders.
10. How can you get precisely one messaging during data production?
To get precisely one messaging from data production, you have to follow two things
avoiding duplicates during data production and avoiding duplicates during data
consumption. For this, include a primary key in the message and de-duplicate on the
consumer.
11. How do consumers consumes messages in Kafka?
The transfer of messages is done in Kafka by making use of send file API. The transfer
of bytes occurs using this file through the kernel-space and the calls between back to
the kernel and kernel user.
One of the basic Kafka interview questions is about Zookeeper. It is a high performance
and open source complete coordination service used for distributed applications
adapted by Kafka. It lets Kafka manage sources properly.
The replica is a list of essential nodes needed for logging for any particular partition. It
can play the role of a follower or leader.
Partitions are created in Kafka based on consumer groups and offset. One server in the
partition serves as the leader, and one or more servers act as a follower. The leader
assigns itself tasks that read and write partition requests. Followers follow the leader
and replicate what is being told.
Kafka acts as the central nervous system that makes streaming data available to
applications. It builds real-time data pipelines responsible for data processing and
transferring between different systems that need to use it.
Kafka tags itself with a user group, and every communication on the topic is distributed
to one use case. Kafka provides a single-customer abstraction that discovers both
publish-subscribe consumer group and queuing.
When more than one consumer consumes a bunch of subscribed topics jointly, it forms
a consumer group.
To start a Kafka Server, the Zookeeper has to be powered up by using the following
steps:
Kafka combines two messaging models, queues them, publishes, and subscribes to be
made accessible to several consumer instances.
This is because duplication assures that issued messages are absorbed in plan fault,
appliance mistake or recurrent software promotions.
A cluster in Kafka contains multiple brokers as the system is distributed. The topic in
the system is divided into multiple partitions. Each broker stores one or multiple
partitions so that consumers and producers can retrieve and publish messages
simultaneously.