Data and AI Kafka Overview 1740507867
Data and AI Kafka Overview 1740507867
Kafka
Overview
Shwetank Singh
GritSetGrow - GSGLearn.com
gsglearn.com
Cloudera Runtime Kafka Introduction
Kafka Introduction
Apache Kafka is a high performance, highly available, and redundant streaming message platform.
Kafka functions much like a publish/subscribe messaging system, but with better throughput, built-in partitioning,
replication, and fault tolerance. Kafka is a good solution for large scale message processing applications. It is often
used in tandem with Apache Hadoop, and Spark Streaming.
You might think of a log as a time-sorted file or data table. Newer entries are appended to the log over time, from left
to right. The log entry number is a convenient replacement for a timestamp.
Kafka integrates this unique abstraction with traditional publish/subscribe messaging concepts (such as producers,
consumers, and brokers), parallelism, and enterprise features for improved performance and fault tolerance.
The original use case for Kafka was to track user behavior on websites. Site activity (page views, searches, or other
actions users might take) is published to central topics, with one topic per activity type.
Kafka can be used to monitor operational data, aggregating statistics from distributed applications to produce
centralized data feeds. It also works well for log aggregation, with low latency and convenient support for multiple
data sources.
Kafka provides the following:
• Persistent messaging with O(1) disk structures, meaning that the execution time of Kafka's algorithms is
independent of the size of the input. Execution time is constant, even with terabytes of stored messages.
• High throughput, supporting hundreds of thousands of messages per second, even with modest hardware.
• Explicit support for partitioning messages over Kafka servers. It distributes consumption over a cluster of
consumer machines while maintaining the order of the message stream.
• Support for parallel data load into Hadoop.
Kafka Architecture
Learn about Kafka's architecture and how it compares to an ideal publish-subscribe system.
The ideal publish-subscribe system is straightforward: Publisher A’s messages must make their way to Subscriber A,
Publisher B’s messages must make their way to Subscriber B, and so on.
Figure 1: Ideal Publish-Subscribe System
4
Cloudera Runtime Kafka Architecture
Kafka's architecture however deviates from this ideal system. Some of the key differences are:
• Messaging is implemented on top of a replicated, distributed commit log.
• The client has more functionality and, therefore, more responsibility.
• Messaging is optimized for batches instead of individual messages.
• Messages are retained even after they are consumed; they can be consumed again.
The results of these design decisions are:
• Extreme horizontal scalability
• Very high throughput
• High availability
• Different semantics and message delivery guarantees
Kafka Terminology
Kafka uses its own terminology when it comes to its basic building blocks and key concepts. The usage of these terms
might vary from other technologies. The following provides a list and definition of the most important concepts of
Kafka:
Broker
A broker is a server that stores messages sent to the topics and serves consumer requests.
Topic
A topic is a queue of messages written by one or more producers and read by one or more
consumers.
Producer
A producer is an external process that sends records to a Kafka topic.
Consumer
A consumer is an external process that receives topic streams from a Kafka cluster.
Client
Client is a term used to refer to either producers and consumers.
Record
A record is a publish-subscribe message. A record consists of a key/value pair and metadata
including a timestamp.
Partition
Kafka divides records into partitions. Partitions can be thought of as a subset of all the records for a
topic.
Continue reading to learn more about each key concept.
Brokers
Learn more about Brokers.
5
Cloudera Runtime Kafka Architecture
Kafka is a distributed system that implements the basic features of an ideal publish-subscribe system. Each host in the
Kafka cluster runs a server called a broker that stores messages sent to the topics and serves consumer requests.
Figure 2: Brokers in a Publish-Subscribe System
Kafka is designed to run on multiple hosts, with one broker per host. If a host goes offline, Kafka does its best to
ensure that the other hosts continue running. This solves part of the “No Downtime” and “Unlimited Scaling” goals of
the ideal publish-subscribe system.
Kafka brokers all talk to Zookeeper for distributed coordination, which also plays a key role in achieving the
"Unlimited Scaling" goal from the ideal system.
Topics are replicated across brokers. Replication is an important part of “No Downtime”, “Unlimited Scaling,” and
“Message Retention” goals.
There is one broker that is responsible for coordinating the cluster. That broker is called the controller.
Topics
Learn more about Kafka topics.
In any publish-subscribe system, messages from one publisher, called producers in Kafka, have to find their way to
the subscribers, called consumers in Kafka. To achieve this, Kafka introduces the concept of topics, which allow for
easy matching between producers and consumers.
A topic is a queue of messages that share similar characteristics. For example, a topic might consist of instant
messages from social media or navigation information for users on a web site. Topics are written by one or more
6
Cloudera Runtime Kafka Architecture
producers and read by one or more consumers. A topic is identified by its name. This name is part of a global
namespace of that Kafka cluster.
As each producer or consumer connects to the publish-subscribe system, it can read from or write to a specific topic.
Figure 3: Topics in a Publish-Subscribe System
Records
Learn more about Kafka records.
In Kafka, a publish-subscribe message is called a record. A record consists of a key/value pair and metadata including
a timestamp. The key is not required, but can be used to identify messages from the same data source. Kafka stores
keys and values as arrays of bytes. It does not otherwise care about the format.
The metadata of each record can include headers. Headers may store application-specific metadata as key-value pairs.
In the context of the header, keys are strings and values are byte arrays.
For specific details of the record format, see Apache Kafka documentation.
Related Information
Record Format
Partitions
Learn more about Kafka partitions.
Instead of all records handled by the system being stored in a single log, Kafka divides records into partitions.
Partitions can be thought of as a subset of all the records for a topic. Partitions help with the ideal of “Unlimited
Scaling”.
Records in the same partition are stored in order of arrival.
When a topic is created, it is configured with two properties:
partition count
The number of partitions that records for this topic will be spread among.
replication factor
The number of copies of a partition that are maintained to ensure consumers always have access to
the queue of records for a given topic.
7
Cloudera Runtime Kafka Architecture
Each topic has one leader partition. If the replication factor is greater than one, there will be additional follower
partitions. (For the replication factor = M, there will be M-1 follower partitions.)
Any Kafka client (a producer or consumer) communicates only with the leader partition for data. All other partitions
exist for redundancy and failover. Follower partitions are responsible for copying new records from their leader
partitions. Ideally, the follower partitions have an exact copy of the contents of the leader. Such partitions are called
in-sync replicas (ISR).
With N brokers and topic replication factor M, then
Partitions are the key to keeping good record throughput. Choosing the correct number of partitions and partition
replications for a topic:
• Spreads leader partitions evenly on brokers throughout the cluster
• Makes partitions within the same topic are roughly the same size
• Balances the load on brokers.
8
Cloudera Runtime Kafka Architecture
Tip: Kafka guarantees that records in the same partition will be in the same order in all replicas of that
partition.
If the order of records is important, the producer can ensure that records are sent to the same partition. The producer
can include metadata in the record to override the default assignment in one of two ways:
• The record can indicate a specific partition.
• The record can includes an assignment key.
The hash of the key and the number of partitions in the topic determines which partition the record is assigned to.
Including the same key in multiple records ensures all the records are appended to the same partition.
9
Cloudera Runtime Kafka Architecture
Note: These references to “log” should not be confused with where the Kafka broker stores their operational
logs.
In actuality, each partition does not keep all the records sequentially in a single file. Instead, it breaks each log into
log segments. Log segments can be defined using a size limit (for example, 1 GB), as a time limit (for example, 1
day), or both. Administration around Kafka records often occurs at the log segment level.
Each of the partitions is broken into segments, with Segment N containing the most recent records and Segment 1
containing the oldest retained records. This is configurable on a per-topic basis.
Figure 6: Partition Log Segments
Related Information
Log-structured file system
10
Cloudera Runtime Kafka Architecture
connection between the brokers and the Zookeeper cluster needs to be reliable. Similarly, if the Zookeeper cluster has
other intensive processes running on it, that can add sufficient latency to the broker/Zookeeper interactions to cause
issues.
• Kafka Controller maintains leadership through Zookeeper (shown in orange)
• Kafka Brokers also store other relevant metadata in Zookeeper (also in orange)
• Kafka Partitions maintain replica information in Zookeeper (shown in blue)
Figure 7: Broker/ZooKeeper Dependencies
11
Cloudera Runtime Kafka Architecture
Consider the following example which shows a simplified version of a Kafka cluster in steady state. There are N
brokers, two topics with nine partitions each. Replicated partitions are not shown for simplicity.
Figure 8: Kafka Cluster in Steady State
In this example, each broker shown has three partitions per topic and the Kafka cluster has well balanced leader
partitions. Recall the following:
• Producer writes and consumer reads occur at the partition level.
• Leader partitions are responsible for ensuring that the follower partitions keep their records in sync.
Since the leader partitions are evenly distributed, most of the time the load to the overall Kafka cluster is relatively
balanced.
Leader Positions
Now lets look at an example where a large chunk of the leaders for Topic A and Topic B are on Broker 1.
Figure 9: Kafka Cluster with Leader Partition Imbalance
12
Cloudera Runtime Kafka Architecture
In a scenario like this a lot more of the overall Kafka workload occurs on Broker 1. Consequently this also causes
a backlog of work, which slows down the cluster throughput, which will worsen the backlog. Even if a cluster
starts with perfectly balanced topics, failures of brokers can cause these imbalances: if the leader of a partition goes
down one of the replicas will become the leader. When the original (preferred) leader comes back, it will get back
leadership only if automatic leader rebalancing is enabled; otherwise the node will become a replica and the cluster
gets imbalanced.
In-Sync Replicas
Let’s take a closer look at Topic A from the previous example that had imbalanced leader partitions. However, this
time let's visualize follower partitions as well:
• Broker 1 has six leader partitions, broker 2 has two leader partitions, and broker 3 has one leader partition.
• Assuming a replication factor of 3.
Figure 10: Kafka Topic with Leader and Follower Partitions
Assuming all replicas are in-sync, then any leader partition can be moved from Broker 1 to another broker without
issue. However, in the case where some of the follower partitions have not caught up, then the ability to change
leaders or have a leader election will be hampered.
13
Cloudera Runtime Kafka FAQ
Kafka FAQ
A collection of frequently asked questions on the topic of Kafka.
Basics
A collection of frequently asked questions on the topic of Kafka aimed for beginners.
What is Kafka?
Kafka is a streaming message platform. Breaking it down a bit further:
“Streaming”: Lots of messages (think tens or hundreds of thousands) being sent frequently by publishers
("producers"). Message polling occurring frequently by lots of subscribers ("consumers").
“Message”: From a technical standpoint, a key value pair. From a non-technical standpoint, a relatively small number
of bytes (think hundreds to a few thousand bytes).
If this isn’t your planned use case, Kafka may not be the solution you are looking for. Contact your favorite Cloudera
representative to discuss and find out. It is better to understand what you can and cannot do upfront than to go ahead
based on some enthusiastic arbitrary vendor message with a solution that will not meet your expectations in the end.
What is Kafka not well fitted for (or what are the tradeoffs)?
It’s very easy to get caught up in all the things that Kafka can be used for without considering the tradeoffs. Kafka
configuration is also not automatic. You need to understand each of your use cases to determine which configuration
properties can be used to tune (and retune!) Kafka for each use case.
Some more specific examples where you need to be deeply knowledgeable and careful when configuring are:
14
Cloudera Runtime Kafka FAQ
What’s a good size of a Kafka record if I care about performance and stability?
There is an older blog post from 2014 from LinkedIn titled: Benchmarking Apache Kafka: 2 Million Writes Per
Second (On Three Cheap Machines). In the “Effect of Message Size” section, you can see two charts which indicate
that Kafka throughput starts being affected at a record size of 100 bytes through 1000 bytes and bottoming out around
10000 bytes. In general, keeping topics specific and keeping message sizes deliberately small helps you get the most
out of Kafka.
Excerpting from Deploying Apache Kafka: A Practical FAQ:
15
Cloudera Runtime Kafka FAQ
• If shared storage is available (HDFS, S3, NAS), place the large payload on shared storage and use Kafka just to send a message with the
payload location.
• Handle large messages by chopping them into smaller parts before writing into Kafka, using a message key to make sure all the parts are
written to the same partition so that they are consumed by the same Consumer, and re-assembling the large message from its parts when
consuming.
Use cases
A collection of frequently asked questions on the topic of Kafka aimed for advanced users.
Like most Open Source projects, Kafka provides a lot of configuration options to maximize performance. In some
cases, it is not obvious how best to map your specific use case to those configuration options. We attempt to address
some of those situations.
16
Cloudera Runtime Kafka FAQ
1. The kernel must be configured for maximum I/O usage that Kafka requires.
a. Large page cache
b.Maximum file descriptions
c.Maximum file memory map limits
2. Kafka JVM configuration settings:
a.Brokers generally don’t need more than 4GB-8GB of heap space.
b.Run with the +G1GC garbage collection using Java 8 or later.
How can I configure Kafka to ensure that events are stored reliably?
The following recommendations for Kafka configuration settings make it extremely difficult for data loss to occur.
• Producer
• block.on.buffer.full=true
• retries=Long.MAX_VALUE
• acks=all
• max.in.flight.requests.per.connections=1
• Remember to close the producer when it is finished or when there is a long pause.
• Broker
• Topic replication.factor >= 3
• Min.insync.replicas = 2
• Disable unclean leader election
• Consumer
• Disable enable.auto.commit
• Commit offsets after messages are processed by your consumer client(s).
If you have more than 3 hosts, you can increase the broker settings appropriately on topics that need more protection
against data loss.
Once I’ve followed all the previous recommendations, my cluster should never lose data, right?
Kafka does not ensure that data loss never occurs. There are the following tradeoffs:
• Throughput vs. reliability. For example, the higher the replication factor, the more resilient your setup will be
against data loss. However, to make those extra copies takes time and can affect throughput.
• Reliability vs. free disk space. Extra copies due to replication use up disk space that would otherwise be used for
storing events.
Beyond the above design tradeoffs, there are also the following issues:
• To ensure events are consumed you need to monitor your Kafka brokers and topics to verify sufficient
consumption rates are sustained to meet your ingestion requirements.
• Ensure that replication is enabled on any topic that requires consumption guarantees. This protects against Kafka
broker failure and host failure.
• Kafka is designed to store events for a defined duration after which the events are deleted. You can increase the
duration that events are retained up to the amount of supporting storage space.
• You will always run out of disk space unless you add more nodes to the cluster.
17
Cloudera Runtime Kafka FAQ
• Your topic must consist of one partition (but a higher replication factor could be useful for redundancy and
failover). However, this will result in very limited message throughput.
• You configure your topic with a small number of partitions and perform the ordering after the consumer has
pulled data. This does not result in guaranteed ordering, but, given a large enough time window, will likely be
equivalent.
Conversely, it is best to take Kafka’s partitioning design into consideration when designing your Kafka setup rather
than rely on global ordering of events.
How do I size my topic? Alternatively: What is the “right” number of partitions for a topic?
Choosing the proper number of partitions for a topic is the key to achieve a high degree of parallelism with respect to
writes and reads and to distribute load. Evenly distributed load over partitions is a key factor to have good throughput
(avoid hot spots). Making a good decision requires estimation based on the desired throughput of producers and
consumers per partition.
For example, if you want to be able to read 1 GB/sec, but your consumer is only able to process 50 MB/sec, then
you need at least 20 partitions and 20 consumers in the consumer group. Similarly, if you want to achieve the same
for producers, and 1 producer can only write at 100 MB/sec, you need 10 partitions. In this case, if you have 20
partitions, you can maintain 1 GB/sec for producing and consuming messages. You should adjust the exact number of
partitions to number of consumers or producers, so that each consumer and producer achieve their target throughput.
So a simple formula could be:
where:
• NP is the number of required producers determined by calculating: TT/TP
• NC is the number of required consumers determined by calculating: TT/TC
• TT is the total expected throughput for our system
• TP is the max throughput of a single producer to a single partition
• TC is the max throughput of a single consumer from a single partition
This calculation gives you a rough indication of the number of partitions. It's a good place to start. Keep in mind the
following considerations for improving the number of partitions after you have your system in place:
• The number of partitions can be specified at topic creation time or later.
• Increasing the number of partitions also affects the number of open file descriptors. So make sure you set file
descriptor limit properly.
• Reassigning partitions can be very expensive, and therefore it's better to over- than under-provision.
• Changing the number of partitions that are based on keys is challenging and involves manual copying.
18
Cloudera Runtime Kafka FAQ
• Reducing the number of partitions is not currently supported. Instead, create a new a topic with a lower number of
partitions and copy over existing data.
• Metadata about partitions are stored in ZooKeeper in the form of znodes. Having a large number of partitions has
effects on ZooKeeper and on client resources:
• Unneeded partitions put extra pressure on ZooKeeper (more network requests), and might introduce delay in
controller and/or partition leader election if a broker goes down.
• Producer and consumer clients need more memory, because they need to keep track of more partitions and also
buffer data for all partitions.
• As guideline for optimal performance, you should not have more than 4000 partitions per broker and not more
than 200,000 partitions in a cluster.
Make sure consumers don’t lag behind producers by monitoring consumer lag. To check consumers' position in a
consumer group (that is, how far behind the end of the log they are), use the following command:
• It is highly recommended that you minimize the volume of replica changes to make sure the cluster remains
healthy. Say, instead of moving ten replicas with a single command, move two at a time.
• It is not possible to use this command to make an out-of-sync replica into the leader partition.
• If too many replicas are moved, then there could be serious performance impact on the cluster. When using
the kafka-reassign-partitions command, look at the partition counts and sizes. From there, you can test various
partition sizes along with the --throttle flag to determine what volume of data can be copied without affecting
broker performance significantly.
Given the earlier restrictions, it is best to use this command only when all brokers and topics are healthy.
•
19
Cloudera Runtime Kafka FAQ
In general, if everything is going well with a particular topic, each consumer’s CURRENT-OFFSET should be up-to-
date or nearly up-to-date with the LOG-END-OFFSET. From this command, you can determine whether a particular
host or a particular partition is having issues keeping up with the data rate.
• Cloudera recommends using the "pull" model for Mirror Maker, meaning that the Mirror Maker instance that is
writing to the destination is running on a host "near" the destination cluster.
• The topics must be unique across the two clusters being copied.
20
Cloudera Runtime Kafka FAQ
• On secure clusters, the source cluster and destination cluster must be in the same Kerberos realm.
• Poll Timeout: This is the timeout between calls to KafkaConsumer.poll(). This timeout is set based on
whatever read latency requirements your particular use case needs.
• Heartbeat Timeout: The newer consumer has a “heartbeat thread” which give a heartbeat to the broker
(actually the Group Coordinator within a broker) to let the broker know that the consumer is still alive. This
happens on a regular basis and if the broker doesn’t receive at least one heartbeat within the timeout period, it
assumes the consumer is dead and disconnects it.
How can I build a Spark streaming application that consumes data from Kafka?
You will need to set up your development environment to use both Spark libraries and Kafka libraries:
• Building Spark Applications
• The kafka-examples directory on Cloudera’s public GitHub has an example pom.xml.
21
Cloudera Runtime Kafka FAQ
From there, you should be able to read data using the KafkaConsumer class and using Spark libraries for real-time
data processing. The blog post Reading data securely from Apache Kafka to Apache Spark has a pointer to a GitHub
repository that contains a word count example.
For further background, read the blog post Architectural Patterns for Near Real-Time Data Processing with Apache
Hadoop.
22