Apress Kafka Troubleshooting in Production
Apress Kafka Troubleshooting in Production
Troubleshooting
in Production
Stabilizing Kafka Clusters in the Cloud
and On-premises
—
Elad Eldor
Kafka Troubleshooting in
Production
Stabilizing Kafka Clusters
in the Cloud and On-premises
Elad Eldor
Kafka Troubleshooting in Production: Stabilizing Kafka Clusters in the Cloud and
On-premises
Elad Eldor
Tel Aviv, Israel
Acknowledgments�������������������������������������������������������������������������������������������������xvii
Introduction������������������������������������������������������������������������������������������������������������xix
v
Table of Contents
vi
Table of Contents
vii
Table of Contents
viii
Table of Contents
ix
Table of Contents
x
Table of Contents
xi
Table of Contents
Index��������������������������������������������������������������������������������������������������������������������� 209
xii
About the Author
Elad Eldor leads the DataOps group in the Grow division
at Unity, formerly ironSource. His role involves preventing
and solving stability issues, improving performance,
and reducing cloud costs associated with the operation
of large-scale Druid, Presto, and Spark clusters on the
AWS platform. Additionally, he assists in troubleshooting
production issues in Kafka clusters.
Elad has twelve years of experience as a backend
software engineer and six years as a Site Reliability Engineer
(SRE) and DataOps for large, high throughput, Linux-based
big data clusters.
Before joining ironSource, Elad worked as a software engineer and SRE at Cognyte,
where he developed big data applications and was in charge of the reliability and
scalability of Spark, Presto and Kafka clusters in on-premises production environments.
His primary professional interests are in improving performance and reducing cloud
costs associated with big data clusters.
xiii
About the Technical Reviewer
Or Arnon was born and raised in Tel Aviv, the startup hub
of Israel, and, according to him, one of the best cities in the
world. As a DevOps engineer, Or thrives on problem-solving
via collaboration, tools, and a build-for-scale attitude. He
aspires to make good things great. As a manager, Or focuses
on business and team growth. He aims to build a team that is
challenged, agile, and takes pride in their work.
xv
Acknowledgments
I’d like to thank the many DevOps, DataOps, and system administrators and developers
who maintain Apache Kafka clusters. Your efforts make this book necessary and I hope it
will assist you in handling and even preventing production issues in your Kafka clusters.
Special thanks to Evgeny Rachlenko. Our coffee breaks which were filled with
discussions about Linux performance tuning sparked my deep interest in this topic, and
the knowledge I gained on this topic has been invaluable in my work with Kafka.
Sofie Zilberman encouraged me to focus on JVM tuning and then also introduced
me to Kafka. I ended up having these issues as my biggest interests along with Linux
performance tuning, and this wouldn’t have happened without her. I am indebted to her
for setting such a high bar for me during the years we worked together.
Uri Bariach worked with me on troubleshooting dozens of production issues in on-
premises Kafka clusters. I’d like to thank him for being such a great colleague and also for
editing the on-premises chapters of this book.
I’m grateful to Or Arnon, who works with me at ironSource (now Unity). We spent
dozens of hours together analyzing production issues in high throughput, cloud-based
Kafka clusters. He is one of the most thorough and professional KafkaOps out there, and
his technical editing of this book has been indispensable.
Writing this book was at times daunting. But taking care of Kafka clusters, whether
on-premises or cloud-based, is even more challenging. Of all the open-source
frameworks I’ve worked with in production, Kafka is by far the toughest one to handle,
and usually the most critical one as well. My hope is that this book will help those
who maintain Kafka clusters to reduce the chances of production issues and serve its
purpose well.
xvii
Introduction
Operating Apache Kafka in production is no easy feat. As a high-performing, open-
source, distributed streaming platform, it includes powerful features. However, these
impressive capabilities also introduce significant complexity, especially when deploying
Kafka at high scale in production.
If you’re a Kafka admin—whether a DevOps, DataOps, SRE, or system
administrator—you know all too well the hefty challenges that come with managing
Kafka clusters in production. It’s a tough task, from unraveling configuration details to
troubleshooting hard-to-pinpoint production issues. But don’t worry, this book is here to
lend a helping hand.
This practical guide provides a clear path through the complex web of Kafka
troubleshooting in production. It delivers tried and tested strategies, useful techniques,
and insights gained from years of hands-on experience with both on-premises and
cloud-based Kafka clusters processing millions of messages per second. The objective?
To help you manage, optimize, monitor, and improve the stability of your Kafka clusters,
no matter what obstacles you face.
The book delves into several critical areas. One such area is the instability that
an imbalance or loss of leaders can bring to your Kafka cluster. It also examines CPU
saturation, helping you understand what triggers it, how to spot it, and its potential
effects on your cluster’s performance.
The book sheds light on other key aspects such as disk utilization and storage usage,
including a deep dive into performance metrics and retention policies. It covers the
sometimes puzzling topic of data skew, teaching you about Kafka’s various skew types
and their potential impact on performance. The book also explains how adjusting
producer configurations can help you strike a balance between distribution and
aggregation ratio.
Additionally, the book discusses the role of RAM in Kafka clusters, including
situations where you might need to increase RAM. It tackles common hardware failures
usually found in on-premises data centers and guides you on how to deal with different
disk configurations like RAID 10 and JBOD, among other Kafka-related issues.
xix
Introduction
Monitoring, an essential part of any KafkaOps skill set, is addressed in detail. You’ll
gain a deep understanding of producer and consumer metrics, learning how to read
them and what they signify about your cluster’s health.
Whether you’re a DevOps, a DataOps, a system administrator, or a developer, this
book was created with you in mind. It aims to demystify Kafka’s behavior in production
environments and arm you with the necessary tools and knowledge to overcome Kafka’s
challenges. My goal is to make your Kafka experience less overwhelming and more
rewarding, helping you tap into the full potential of this powerful platform. Let’s kick off
this exploration.
xx
CHAPTER 1
1
© Elad Eldor 2023
E. Eldor, Kafka Troubleshooting in Production, https://doi.org/10.1007/978-1-4842-9490-1_1
Chapter 1 Storage Usage in Kafka: Challenges, Strategies, and Best Practices
Note For simplicity reasons, I assume that all the topics in this cluster have the
same characteristics:
• Size on disk
• Retention
• Replication factor
• Number of partitions
2
Chapter 1 Storage Usage in Kafka: Challenges, Strategies, and Best Practices
Let’s look at some of the reasons that Kafka disks can become full:
• Adding topics: A new topic will add 600GB of storage on-disk, after
replication.
• Disk failures: When at least one disk fails on at least one broker in the
cluster, and the following conditions apply:
3
Chapter 1 Storage Usage in Kafka: Challenges, Strategies, and Best Practices
Then:
–– The data that resided on the failed disk will need to be replicated
to some other disk.
–– The total available storage in the cluster is reduced while the used
storage remains the same.
–– If the storage usage was high enough already, then even a single
failed disk can cause the storage to become full, and the broker
can fail.
4
Chapter 1 Storage Usage in Kafka: Challenges, Strategies, and Best Practices
In order to learn how these retention policies control the storage usage inside a topic
directory, imagine a topic with the characteristics shown in Table 1-2.
log.retention.bytes 10GB
log.retention.hours 1
Message Size on Disk (After Compression) 1KB
Avg Produce Rate 10K/sec
Segment Size (Determined by log.segment.bytes) 1GB
Produce Rate 1K/sec
Replication Factor 2
In this case, segments will be deleted from the partition either when all the messages
inside these segments are older than one hour or when the size of all the segments inside
the partition reaches 10GB, whatever comes first.
After configuring this topic, let’s consider the chances of losing data.
5
Chapter 1 Storage Usage in Kafka: Challenges, Strategies, and Best Practices
6
Chapter 1 Storage Usage in Kafka: Challenges, Strategies, and Best Practices
To monitor such an issue, keep an eye on sudden increases in the producing rate into the
topic, as this might signal an abnormal data influx. If the topic starts receiving more data,
consider increasing its retention in order to give lagging consumers more time to catch
up. If the lagging application has fewer consumers than the number of partitions in the
topic it reads from, consider adding more consumers. Doing so can help distribute the
load more evenly and prevent data loss.
7
Chapter 1 Storage Usage in Kafka: Challenges, Strategies, and Best Practices
then during the peak hours, the consumers might get into lag.
If, after peak time is over, some segments (that the consumers didn’t manage to
consume) still exist in Kafka, then the consumer will manage to read all of these events.
However, if for some reason, the consumers developed a lag of more than 60 minutes
during peak time, there’s a chance that once peak time is over, the consumers will
develop a delay that’s longer than the 60 minutes, and in such cases the consumers will
lose that data.
8
Chapter 1 Storage Usage in Kafka: Challenges, Strategies, and Best Practices
Scaling Up
Scaling up refers to expanding the storage in the existing brokers. This can be done in
several ways:
9
Chapter 1 Storage Usage in Kafka: Challenges, Strategies, and Best Practices
Scaling Out
Scaling out refers to the addition of new brokers to the cluster. This option helps when
you need to substantially increase storage capacity. Adding more brokers inherently
means adding more disks and storage to the cluster. Depending on the configuration
and needs, this can be a viable option to manage a significant growth in data.
In all cases, care must be taken to properly manage the migration of data. When
moving Kafka’s data directories to new disks, you’ll need to mount the appropriate
directories on to the new disks and transfer all necessary data. This ensures that the
existing data remains accessible to the Kafka brokers, effectively expanding the storage
without risking data loss or unnecessary downtime.
EBS Disks
If the brokers contain only EBS disks, there are two main options for increasing storage
capacity:
• Scale up: This involves increasing the existing EBS size or attaching
more EBS disks to the existing brokers. Since EBS disks can be resized
without losing data, this process can be accomplished with minimal
disruption. However, it may require stopping the associated instances
temporarily to apply the changes, while maintaining data consistency
throughout.
10
Chapter 1 Storage Usage in Kafka: Challenges, Strategies, and Best Practices
• Scale up: Scaling up keeps the same number of brokers in the cluster,
but each broker will have more disks than before. The process
typically involves stopping and starting a broker, which can cause
data loss on that NVME device since a new one is assigned. To
prevent this, a proper migration process must be followed. Data must
be moved from the old NVME device to a persistent storage before
the broker is stopped. Once the broker is started with the new NVME
device, the data can be moved back.
• Scale out: Scaling out adds more brokers with the same number of
disks as before. This can be done without the same risk of data loss
as scaling up, but careful planning and coordination are required to
ensure that the new brokers are correctly integrated into the cluster
without impacting existing data and performance.
11
Chapter 1 Storage Usage in Kafka: Challenges, Strategies, and Best Practices
–– Required retention
–– Replication factor
The amount of storage used by a topic can then be calculated using this formula:
Topic_size_on_disk = [average message size] X [average number
of messages per sec] X [number of required retention hours] X
[replication factor of the topic]/[compression rate]
12
Chapter 1 Storage Usage in Kafka: Challenges, Strategies, and Best Practices
After performing this calculation for all the topics in the cluster, you get a number
that represents the minimal required storage for the cluster. Since it’s hard to predict the
compression rate, you’ll need to test for yourself to see the difference between the ingested
data into Kafka and the data that’s actually stored in Kafka. That’s regardless of whether the
compression is performed by the producer or by the Kafka brokers.
But that number is only a starting point, because we also need to add some extra
storage due to the following reasons:
–– We don’t want the used storage to reach above ±85 percent storage use. This
threshold should be determined by the KafkaOps team, since there’s no right
or wrong value.
–– The storage requirements may change over time - e.g. the replication factor or
the retention might increase
In this case, the minimal required storage for the cluster is 27.5TB. On top of that,
you need to add some storage so you won’t use more than 80 percent, so multiply that
number by 1.25; you get 34TB.
To conclude, the cluster requires 34TB in order to store the required retention.
13
Chapter 1 Storage Usage in Kafka: Challenges, Strategies, and Best Practices
Retention Monitoring
There are certain alerts that should pop up in cases that could lead to the storage getting
full in one of the brokers, or data loss for one of the consumers. The next sections
discuss them.
The X axis is time, and the Y axis is the message rate per minute into each partition.
Each line represents a partition, and the partitions in this case are sorted by percentiles
according to the message rate that is produced into them. So the partition named P1 is
the partition that gets the lowest number of messages, P50 is the median, and P99 is the
partition that gets the most messages.
14
Chapter 1 Storage Usage in Kafka: Challenges, Strategies, and Best Practices
It’s important to detect an increase in the message rate upfront and then adjust the
retention policies accordingly.
15
Chapter 1 Storage Usage in Kafka: Challenges, Strategies, and Best Practices
S
ummary
This chapter delved into the critical role that disk storage plays in Kafka clusters,
illuminating various aspects of Kafka’s storage usage. It began by exploring how the disks
of Kafka brokers can become filled, causing the cluster to halt. Various scenarios were
explored, from increasing the replication factor and retention of topics to disk failures
and unintended writes by other processes.
The chapter further delved into the importance of retention policies in Kafka,
exploring how they can cause data loss for consumers and the critical configuration
parameters needed to avoid such situations. Several scenarios were analyzed, including
consumer lag management, handling unexpected data influx, balancing consumer
throttling, understanding traffic variations, and ensuring batch duration compliance.
Strategies for adding more storage to a Kafka cluster were also explored, focusing on
different approaches depending on the cluster’s location (on-prem or in the cloud). The
chapter concluded with an overview of considerations for extended retention in Kafka
clusters and guidelines for calculating storage capacity based on time-based retention.
Overall, this chapter aimed to provide you with a comprehensive overview of storage
usage in Kafka clusters and help you avoid common storage-related pitfalls.
The next chapter explores a range of producer adjustments, from partitioning
strategies that balance even distribution and clustering of related messages, to the fine-
tuning of parameters like linger.ms and batch.size for enhanced speed and response
time. We’ll also examine how data cardinality influences Kafka’s performance and
uncover scenarios where duplicating data across multiple topics can be an intelligent
strategy.
16
CHAPTER 2
Strategies for
Aggregation, Data
Cardinality, and Batching
This chapter digs into various adjustments you can make to the Kafka producer that can
notably increase your Kafka system’s speed, response time, and efficiency. The chapter
starts by exploring the partitioning strategy, which aims for an equilibrium between
distributing messages evenly and clustering related messages together. It will then
dive into adjusting parameters like linger.ms and batch.size to improve speed and
decrease response time. From there, we’ll learn how the uniqueness and spread of data
values, known as data cardinality, impact Kafka’s performance. And finally, we’ll explore
why, in some cases, duplicating data for different consumers can be a smart move.
17
© Elad Eldor 2023
E. Eldor, Kafka Troubleshooting in Production, https://doi.org/10.1007/978-1-4842-9490-1_2
Chapter 2 Strategies for Aggregation, Data Cardinality, and Batching
18
Chapter 2 Strategies for Aggregation, Data Cardinality, and Batching
19
Chapter 2 Strategies for Aggregation, Data Cardinality, and Batching
rate of 2x, while a larger batch of 1,000 similar messages might reach a compression rate
of 10x. The key factor here is the number of similar messages in the batch, which when
larger, provides a better scope for compression, reducing network and storage use.
Lastly, the way you partition your data can also influence compression rates. For
example, when using a SinglePartitionPartitioner, where the producer writes only to a
single, randomly selected partition, larger batches can enhance compression ratios by
up to two times. This is because all messages in a batch belong to the same partition,
and the likelihood of having similar messages in the batch increases, leading to better
compression.
In conclusion, linger.ms and batch.size are both powerful knobs to tweak in your
Kafka setup, allowing you to optimize throughput, latency, and resource use across
producers, brokers, and consumers. Still, it is crucial to align these parameters with your
specific use case so you avoid potential bottlenecks or inefficiencies that can emerge
from misconfiguration.
In this case, the cardinality level would be the total number of unique activity types.
The cardinality distribution, on the other hand, shows the frequency or proportion of
each activity type in the overall data. For instance, you might find that 70 percent of
the activities are viewed_product, 20 percent are added_to_cart, and 10 percent are
checked_out. This distribution can significantly impact the processing and analysis of
the data, especially when dealing with operations like compression and aggregation.
21
Chapter 2 Strategies for Aggregation, Data Cardinality, and Batching
22
Chapter 2 Strategies for Aggregation, Data Cardinality, and Batching
Summary
This chapter first tackled the task of dividing data across Kafka partitions. The goal is to
distribute data uniformly, yet also organize it strategically to meet user’s requirements.
Subsequently, the chapter explored the process of adjusting the linger.ms and
batch.size settings. By fine-tuning these configurations, you can utilize Kafka broker’s
disk space more efficiently, alter the data compression rate, and strike a balance between
data transmission speed and volume.
Additionally, the chapter delved into the concept of data cardinality, which is
related to the distribution and quantity of unique values in a field. We discussed why an
abundance of unique data can negatively impact compression, consume excessive
memory, and decelerate queries. As a result, we learned some methods to effectively
handle this issue.
23
Chapter 2 Strategies for Aggregation, Data Cardinality, and Batching
Lastly, we learned that duplicating data across multiple topics can sometimes
be beneficial. By establishing separate topics for distinct data subsets, we can see an
improvement in processing speed for certain consumers. This tactic allows you to
enhance the overall efficiency of your data processing system.
The following chapter navigates the intricate landscape of partition skew within
Kafka clusters. From the subtleties of leader and follower skews to the far-reaching
impacts on system balance and efficiency, the chapter explores what happens when
brokers are unevenly loaded. This in-depth examination covers the consequences
on production, replication, consumption, and even storage imbalances. Along with a
detailed exploration of these challenges, I offer guidance on how to monitor and mitigate
these issues, illuminated by real-world case studies.
24
CHAPTER 3
Understanding and
Addressing Partition
Skew in Kafka
When partitions, be they leaders or followers, are distributed unevenly among
brokers within a Kafka cluster, the equilibrium and efficiency of the cluster, as well as
its consumers and producers, can be at risk. This inequality of distribution is called
partition skew.
Partition skew leads to a corresponding imbalance in the number of production and
consumption requests. A broker that hosts more partition leaders will inevitably serve
more of these requests since the producers and consumers interact directly with the
partition leaders.
This domino effect manifests in various ways. Replication on brokers hosting a
higher number of partition leaders might become sluggish. At the same time, consumers
connected to these brokers may experience delays. Producers might find it difficult to
maintain the necessary pace for topic production on these specific brokers. Additionally,
there might be a surge in page cache misses on these brokers, driven by increased cache
eviction as a result of the higher traffic volume.
These symptoms illustrate how swiftly a mere partition skew can escalate into
substantial production challenges within a Kafka cluster. In an ideal setting, all the
topic partitions would be uniformly distributed across every broker in the Kafka cluster.
Reality, however, often falls short of this perfection, with partition imbalance being a
common occurrence.
This chapter is devoted to exploring various scenarios that can lead to such
imbalances, providing insights into how to monitor and address these issues.
25
© Elad Eldor 2023
E. Eldor, Kafka Troubleshooting in Production, https://doi.org/10.1007/978-1-4842-9490-1_3
Chapter 3 Understanding and Addressing Partition Skew in Kafka
Note that for the sake of clarity, throughout this chapter, the term partition leader
describes a partition to which producers supply data and from which consumers retrieve
it. This term can also describe the partition for which the hosting broker serves as its
leader. Utilizing the term leaders for partitions rather than associating brokers with
leadership of partitions offers a more intuitive understanding of the concept.
26
Chapter 3 Understanding and Addressing Partition Skew in Kafka
I/O operations, particularly if the page cache is unable to fulfill read requests. This would
necessitate direct reads from the disk, which can lead to more disk IOPS (input/output
operations per second).
Additionally, consumers attempting to read from leader partitions on the skewed
broker may face delays, thus affecting the overall data processing times. The situation
could also impact producers, who might find that they have more messages in their
buffers when writing to the partition leaders on that broker. These delays in transmitting
messages, if not managed appropriately, can even lead to data loss.
27
Chapter 3 Understanding and Addressing Partition Skew in Kafka
Figure 3-1 shows a possible scenario in which there are more partition leaders on
broker B1 compared to the other two brokers.
Figure 3-1. Number of partitions per broker. While broker B1 hosts more partition
leaders than brokers B2 and B3, these brokers host more partition followers than
broker B1
In this case, all brokers host the same number of leaders and followers, but B1 holds
more leaders, while B2 and B3 host more followers. So a broker with more partition
leaders will cause the other brokers to host more partition followers.
This will lead to more fetch requests from the brokers (hosting the partition leaders
replicas) into the broker that hosts these leaders (which is also the broker that hosts
more leaders). A broker that deals with more fetch requests from other brokers might not
be able to serve fetch requests from consumers at the same rate as other, non-skewed
brokers.
28
Chapter 3 Understanding and Addressing Partition Skew in Kafka
• There can be a partition leader skew while the brokers host the
same number of partition leaders. Imagine a case when two big
topics aren’t balanced among the brokers, but the first topic has
more partition leaders on one broker while the other topic has more
leaders on another broker. In total, the number of partition leaders
per broker would be the same, while if you look per topic you would
see an imbalance.
29
Chapter 3 Understanding and Addressing Partition Skew in Kafka
The following example shows how even a slight skew can cause one broker to reach
CPU saturation and cripple the Kafka cluster. Figure 3-2 shows a cluster with three
brokers, one broker being very loaded (as can be seen from its load average).
Figure 3-2. The load average (LA) of the blue broker is more than double than the
LA of the other two brokers
The cause for the high load average is the high CPU user time (called us%) of this
broker, as shown in Figure 3-3.
30
Chapter 3 Understanding and Addressing Partition Skew in Kafka
Figure 3-3. The CPU us% of the blue broker is more than double the us% of the
other two brokers
In searching for the cause of the high CPU us% in this broker, you can first check
whether this broker receives or sends more traffic. However, you realize that all brokers
had the same incoming and outgoing traffic.
So you would then compare the number of partition leaders per broker. You
determine that there is very little variance in the number of leaders between the brokers,
as shown in Figure 3-4.
31
Chapter 3 Understanding and Addressing Partition Skew in Kafka
When you look deeper—at the scope per topic and not just per broker—you see
a topic with a single partition leader that resides on broker number 2, as shown in
Figure 3-5.
Figure 3-5. The number of partitions per topic and per broker. Topic D has a
single partition that resides on broker 2
32
Chapter 3 Understanding and Addressing Partition Skew in Kafka
At first glance, there’s nothing special here—just a single partition leader that resides
on some broker. However, that single partition leader was the only difference between
the loaded broker and the other two brokers.
After further analysis, you find that the topic (to which this partition leader belongs)
has much more consumers compared to the other topics on this cluster, as shown in
Figure 3-6.
Figure 3-6. The number of consumers per topic and the throughput per topic
The moral of this production incident (in the scope of this particular chapter) isn’t
that many consumers can cause high CPU and load average. Instead, the takeaway from
this story is that when you’re looking for partition leaders skew, it’s not enough to just
compare the number of partition leaders between all the brokers. You also need to check
per topic whether their leaders are aligned among all the brokers. Otherwise, you can
miss a case like the one illustrated here.
33
Chapter 3 Understanding and Addressing Partition Skew in Kafka
• Storage issue: If data is transferred into a broker and the storage use
(in the mount points that store the Kafka log segments) becomes
full (100% of the storage is used), then the broker will probably stop
functioning.
• Disk I/O operations: When data is being transferred from one broker
to another, there are spikes in both read and write operations. For
example, if partition P1 moves from broker B1 to broker B2, then
the segments of P1 are being read from the disks of B1 (resulting in
increased read operations on B1 disks) and written to the disks of
broker B2 (resulting in increased write operations on B2 disks).
34
Chapter 3 Understanding and Addressing Partition Skew in Kafka
at least part of the data from the page cache. The impact of this on
the consumers of the Kafka cluster is that the real-time data that
flows into both brokers read data that doesn’t exist in the page cache
anymore (because it was flushed out by data that’s related to the
reassignment).
• Transfer costs: If the partitions are moved from brokers that reside on
different AZs, there are costs associated with the data transfer.
Summary
This chapter explored the complex problem of partition skew in a Kafka cluster,
shedding light on the subtleties of both leader and follower skews and their effects on the
system’s balance and efficiency.
35
Chapter 3 Understanding and Addressing Partition Skew in Kafka
The chapter started by explaining what partition skew means, specifically focusing
on leader and follower skews and the consequences of sending and receiving messages.
When brokers are unevenly loaded, it can result in issues such as slow replication,
consumption delays, and an increase in page cache misses.
The chapter then detailed the distinctions between leader and follower skews.
Leader skew primarily affects the writing and reading of messages by producers and
consumers, whereas follower skew has a significant influence on replication processes.
Next, the chapter looked into the problems that can occur when a broker manages
too many partition leaders, such as higher disk I/O activities, delays in communication
for both consumers and producers, an imbalance in followers, and a decrease in the
number of synchronized copies.
We stressed the need to analyze skew not just on a broker-by-broker basis but also
for individual topics. A case study was presented to demonstrate how just one partition
leader skew can cause a CPU overload and substantial issues in the cluster.
The chapter also provided guidance on how to monitor and tackle skew challenges,
taking into consideration aspects like the rate of messages, the number of consumers
and producers, the status of in-sync replicas, and more.
Finally, the chapter delved into the particular concern of unequal data distribution
among a broker’s disks, emphasizing the possibility of storage imbalances and outlining
the measures that can be implemented to alleviate these difficulties.
The next chapter delves into the intricate issue of skewed or lost leaders in Kafka,
an area that has significant implications for brokers, consumers, and producers alike.
As you explore the mechanisms of Kafka, understanding how these leaders can become
skewed or lost and the consequences of such irregularities becomes paramount. You’ll
unravel the underlying factors that may lead to these problems, such as networking
issues or challenges with partition leadership. Additionally, you’ll learn practical
solutions and preventive strategies to ensure that the Kafka system operates smoothly
and efficiently.
36
CHAPTER 4
Figure 4-1. The general flow of messages from producers into brokers and then to
consumers
A topic is broken into multiple partitions, so you can think of the relationship
between a topic and its partitions just like folders in a filesystem—a topic is a folder and
inside this folder there are more folders, each a separate partition. Each partition has
a leader broker. This leader broker is in charge of all the read and write operations for
that partition. Also, each partition leader has one or more followers, depending on the
replication factor of the topic that the partition belongs to.
37
© Elad Eldor 2023
E. Eldor, Kafka Troubleshooting in Production, https://doi.org/10.1007/978-1-4842-9490-1_4
Chapter 4 Dealing with Skewed and Lost Leaders
A leader skew has different meanings, depending on which scope you look at. It
indicates a situation where certain brokers are leading a greater number of partitions
compared to their counterparts. However, if you focus on a specific topic, a leader skew
means that, for that particular topic, some brokers are leading more partitions than
others. It’s important to note that in this scenario, the skew only pertains to the partitions
related to that specific topic.
A partition leader can also get lost if the broker that serves as its leader stops serving
as its leader for some reason. This becomes a stability issue for the cluster if many
leaders lose their leadership at once, or when this occurs too frequently.
These two symptoms—leader skew and lost leaders—can adversely impact the
brokers, consumers, and producers.
38
Chapter 4 Dealing with Skewed and Lost Leaders
ZooKeeper
The ZooKeeper (aka ZK) appoints a leader to every partition, and it is in charge of
managing all associated read and write requests. It also nominates a controller broker
with the responsibility of overseeing changes in the state of brokers and partitions.
Furthermore, It’s tasked with maintaining a continuously updated record of metadata for
the Kafka cluster, encapsulating data on topics, partitions, and replicas.
This first networking issue relates to the session timeout of the ZooKeeper. There are
several steps to determine and mitigate this problem.
• First, you need to check whether the ZooKeeper is suffering from high
garbage collection (GC). If it is, you can address this by increasing the
amount of RAM allocated to the ZooKeeper until these GC issues are
resolved.
• If GC is not an issue, you can proceed to the next step, which involves
examining the ZooKeeper’s session and connection timeouts. These
timeouts may be too short, and if so, you can try to increase the
session timeout.
Once you’ve made these changes, restart ZooKeeper and see if the leader partitions
still lose their leadership. In my experience, once you increase the ZooKeeper’s
session timeout, the frequency of leader partitions losing their leadership diminishes
significantly. Consequently, the occasions when consumers hang due to lost leaders also
decreases substantially.
39
Chapter 4 Dealing with Skewed and Lost Leaders
exclusively on brokers in which their NIC was reset under load, you should examine your
NIC configuration. In such scenarios, simply increasing the ZooKeeper session timeout
wouldn’t be beneficial.
If your Kafka consumer application is either hanging, accumulating a backlog,
or crashing when reading from a Kafka topic, or your Kafka producers are unable to
produce into a Kafka topic, you might be experiencing issues with session timeouts or
some other causes.
One of the common causes is that Kafka brokers cannot reach the ZooKeeper due to
session timeouts. Alternatively, the issue might stem from the ZooKeeper dealing with
long or frequent full garbage collection pauses. Occasionally, the NIC in some Kafka
brokers might also reset when under a high load. There are other potential reasons, such
as broker restarts/crashes and network partitioning, but we won’t get into their details in
this section.
To prevent such scenarios, ensure that the ZooKeeper session timeout isn’t too short.
Additionally, take steps to ensure that the NIC in the Kafka brokers does not reset under
high load.
For optimal system health, you should continuously monitor full GC activity in the
ZooKeeper, look out for resets of the NIC, and keep an eye out for errors related to lost
leaders.
40
Chapter 4 Dealing with Skewed and Lost Leaders
In order to fix partition skew, you need to run a partition reassignment procedure,
which not only takes time but also increases the CPU utilization of the brokers (due to
the higher data transfer that occurs during the reassignment), especially with CPU user
time (us%) and disk I/O time (io%). This can sometimes be problematic for Kafka
clusters that already suffer from high CPU, so there are times when we’ll want to avoid
partition reassignment even if there’s a leader skew. Alternatively, we can drop the
retention on the topics that contribute to most of the traffic in the cluster in order to run
the partition reassignment process more smoothly.
That’s why it’s important to emphasize that not all leaders’ skew should be dealt
with. However, there are times when it’s crucial to solve the leaders’ skew, and these are
the cases in which a leaders’ skew in some topic has a real effect on the Kafka clusters.
There are two such issues—high traffic and a high number of consumers/producers
from this topic. The following sections consider these two cases.
41
Chapter 4 Dealing with Skewed and Lost Leaders
Figure 4-2. An example of the distribution of partition leaders and followers (P-0,
P-1, P-2) between three brokers
42
Chapter 4 Dealing with Skewed and Lost Leaders
There’s a pitfall to using this approach to identify leader skew, since it’s not per topic
but instead per broker.
Knowing the leader skew isn’t enough; you need to understand which topics
contribute to this skew. When looking at the number of leader partitions per broker
and per topic, as shown in Figure 4-4, you can see the following anomaly—Topic D has
a single leader partition that resides on Broker 2. This was a real case in which topic D
caused Broker 2 to reach high CPU user time due to many consumers reading from the
leader partition of that topic.
43
Chapter 4 Dealing with Skewed and Lost Leaders
Figure 4-4. The distribution of leader partitions per broker and per topic
Figure 4-5 shows another example of uneven distribution of leaders per brokers.
This skew of leader partitions doesn’t tell you much until you dive deeper into which
topics contribute to the skewed distribution, as shown in Figure 4-6.
44
Chapter 4 Dealing with Skewed and Lost Leaders
Figure 4-6. The number of leader partitions per topic and per broker
In this case, you can see that most of the skew can be attributed to Topic D.
Summary
This chapter explored the impact of skewed or lost leaders in Kafka on brokers,
consumers, and producers. In Kafka, each partition has a leader broker responsible for
read and write tasks, and these leaders can sometimes become skewed or lost. This skew
or loss, depending on whether it’s across all topics or a specific one, can cause problems
for both consumers and producers, like instability, crashes, and backlogs.
The root cause of these issues sometimes lies in networking problems related to
ZooKeeper, which is the entity assigning partition leaders, or the Network Interface Card
(NIC), which connects servers to the network. Potential solutions include increasing
the RAM allocated for the ZooKeeper in order to mitigate high garbage collection (GC) or
extending its session timeouts, and checking for NIC-related errors leading to necessary
configuration adjustments.
Leader skew is a situation where one broker leads more partitions than the average.
Although reassigning leaders for equal distribution is a common solution, it might not
always be the best approach due to the increased CPU usage and time consumption.
45
Chapter 4 Dealing with Skewed and Lost Leaders
Nonetheless, addressing skew is crucial for topics with high traffic or many
consumers/producers. High traffic and a large number of consumers/producers can
lead to increased CPU usage in brokers hosting more leaders, causing issues such as
consumer backlog and stalled producers. Therefore, partition reassignment or manual
leader election is recommended in such cases.
The next chapter dives into the topic of CPU saturation in Kafka clusters. We’ll learn
what CPU saturation is and how it differs from a fully engaged CPU, cover the various
types of CPU usage, and explore how each one can affect Kafka. The chapter also
discusses the influence of log compaction and the number of consumers per topic on
CPU usage, including real-world examples and practical strategies. The goal is to provide
a clear understanding of CPU utilization in Kafka, helping you spot potential issues and
apply effective solutions.
46
CHAPTER 5
CPU Saturation
A Kafka cluster includes several machines, each of which runs a single Kafka process.
When the Kafka process generates more tasks (tasks that require CPU) than the number
of available CPU cores, these tasks will wait in the OS queue for their turn until the CPU
cores become available. This waiting can cause latency for the brokers and its consumers
and producers, leading to a performance degradation of the Kafka cluster.
The demand for CPUs is indicated by an OS metric called load average, which refers
to the number of tasks that are either running or waiting to be run by the CPUs (once a
thread is scheduled to run on the OS, its executable code is considered to be a task).
When the load average is higher than the number of CPUs, it means the CPUs are
saturated. The runnable threads must wait for the CPUs to become available.
47
© Elad Eldor 2023
E. Eldor, Kafka Troubleshooting in Production, https://doi.org/10.1007/978-1-4842-9490-1_5
Chapter 5 CPU Saturation in Kafka: Causes, Consequences, and Solutions
The term normalized load average (NLA) describes the levels of saturation. For
example, in a 16-core machine that has a load average of 16, the NLA is 1 (because
16/16=1). In such a case, all runnable tasks are being served, and no runnable task is
waiting in the queue. However, when the load average is 32, the NLA is 2 (since 32/16=2).
In such a case, there are 16 tasks running on the CPUs; the other 16 tasks are in a
runnable state and are waiting for a CPU to become available.
It’s important to distinguish between a fully engaged CPU and CPU saturation. CPU
saturation is when all the processing units (cores) in a CPU are completely engaged and
operating at maximum capacity.
When CPU utilization reaches 100 percent, this means that every core is actively
processing tasks, and there is no idle time left for any new tasks to be assigned to the
cores. In this situation, the CPU is fully engaged.
However, even when a CPU is fully engaged, that doesn’t necessarily imply CPU
saturation. The CPU can still accommodate more tasks without causing any observable
performance degradation if those tasks are not demanding or if they have a low priority.
In other words, a fully engaged CPU is merely a CPU that is efficiently using all of its
resources, not necessarily one that is overwhelmed or overworked.
When the Normalized Load Average (NLA) exceeds 1, this indicates CPU saturation
as it represents the queueing of processes. The load average is a measure of the number
of runnable tasks, including the tasks that are running and the ones that are waiting
to run (i.e., queued). When the NLA is greater than 1, this implies that there are more
runnable tasks than available cores. This waiting or queueing of tasks signifies CPU
saturation, as the CPU doesn’t have enough resources to immediately accommodate
all tasks. This means the tasks are waiting, which can potentially lead to slower system
response and processing times.
The key difference is that a fully engaged CPU has all its cores actively working and
can still process incoming tasks without delay, while CPU saturation occurs when there
are more tasks to be processed than the CPU can handle at once, leading to queuing and
potentially slowing down task processing.
I’ve noticed that when Kafka consumers and/or producers experience latency due to
CPU saturation, it’s usually not because the CPUs have reached 100 percent utilization.
Usually, it’s due to the NLA reaching a value higher than 1.
Given that, the more interesting question becomes this—what causes tasks to queue
(which is equivalent to asking why the NLA is higher than 1) when CPU utilization is
below 100 percent?
48
Chapter 5 CPU Saturation in Kafka: Causes, Consequences, and Solutions
To answer this question, let’s first overview the different CPU usage types, since
detecting which type of CPU reaches a high utilization can assist in troubleshooting
latency issues.
• System time, denoted as %sy, indicates the part of CPU time spent in
the kernel space, working on tasks vital to the system’s functioning.
• I/O wait time, or %wa, signifies the amount of CPU time spent waiting
for input/output tasks, typically associated with disk operations.
Now that we have an idea what CPU saturation is and know the different categories
of CPU usage in Linux, the next section explains what causes Kafka brokers to use
CPU time.
49
Chapter 5 CPU Saturation in Kafka: Causes, Consequences, and Solutions
50
Chapter 5 CPU Saturation in Kafka: Causes, Consequences, and Solutions
For monitoring purposes, set an alert for situations when the CPU sy% exceeds 20
percent for more than ten minutes. This can serve as a preventative measure against
prolonged system stress, helping to maintain optimal system performance.
51
Chapter 5 CPU Saturation in Kafka: Causes, Consequences, and Solutions
Figure 5-3. A constant high value of 20 percent in the CPU system time
52
Chapter 5 CPU Saturation in Kafka: Causes, Consequences, and Solutions
To prevent these issues, you should ensure that the number of disk.io threads is kept
to a minimum, ideally no larger than the number of disks in the broker. A lower number
of threads can help prevent system overload and unnecessary software interruptions.
For effective monitoring, consider setting an alert for scenarios where CPU software
interrupt time (si%) surpasses 5 percent for more than ten minutes. This approach can
help you promptly identify and address system performance issues, thereby enhancing
overall efficiency.
53
Chapter 5 CPU Saturation in Kafka: Causes, Consequences, and Solutions
Figure 5-4. A topic with log compaction enabled. After the log compaction runs,
only a single value per each key is saved
Usually the log compaction process doesn’t cause a CPU burden on the cluster,
but a compacted topic with large enough retention might cause all brokers in a cluster
to saturate due to a high CPU usage for part of the time. This can cause the cluster to
function poorly during those times. The following section describes such a scenario.
54
Chapter 5 CPU Saturation in Kafka: Causes, Consequences, and Solutions
Figure 5-5. Disk utilization (a metric that was taken from the iostat tool) in all the
brokers reached a value of 100% and stayed there. This means that all the disks in
this cluster are saturated
The CPU’s system time (sy%) was ±15%, as shown in Figure 5-6.
Figure 5-6. CPU sy% in all the brokers reached a value of 15% and stayed there.
Such a value is high for Kafka clusters
The Load Average was higher than the number of cores, which means there are more
Kernel tasks that are either in status waiting or running, as shown in Figure 5-7.
Figure 5-7. Load Average reached a value of 40 and stayed there. Since the
number of cores in the brokers is 32, at any given time tasks wait in the queue for
the CPU to run them. The Normalized Load Average of all the brokers is 1.25 (since
40/32 = 1.25)
55
Chapter 5 CPU Saturation in Kafka: Causes, Consequences, and Solutions
Figure 5-8. The CPU us% was low on all the brokers
After several tries, the developer who created the cluster noticed that the retention
of the compacted topic was 24 hours, which was much higher than required. Once the
retention was reduced to 1 hour, the disk utility usage was drastically reduced, the CPU
sy% was reduced by half, and the load average was reduced to less than the number
of cores.
The conclusion is that when you use a large, compacted topic, pay attention to the
retention. A high retention can be a possible cause for high disk utilization and high CPU
sy% values during the time the compaction runs, which can cause consumer and/or
producer lags due to the cluster struggling with the high load average.
• The Load Average was much higher compared to the other brokers,
as shown in Figure 5-9.
56
Chapter 5 CPU Saturation in Kafka: Causes, Consequences, and Solutions
Figure 5-9. Load Average reached a value of 40 only in the rogue broker, which
was more than double the load average of the other two brokers in the cluster
• The CPU us% was much higher compared to the other brokers, as
shown in Figure 5-10.
Figure 5-10. CPU us% reached 100% in the rogue broker, which was more than
double the us% of the other two brokers in the cluster
At first, I suspected that this broker either received and/or sent more traffic.
However, it turns out the traffic was the same in all brokers, both in and out, as shown in
Figures 5-11 and 5-12.
57
Chapter 5 CPU Saturation in Kafka: Causes, Consequences, and Solutions
Figure 5-11. The number of bytes that were sent from all the brokers was almost
the same
Figure 5-12. The number of bytes that were received by all the brokers was almost
the same
The next step was to suspect a partition skew—maybe this broker contained more
partitions than the other brokers. However, it turned out that all the brokers had almost
the same number of partitions, as shown in Figure 5-13.
58
Chapter 5 CPU Saturation in Kafka: Causes, Consequences, and Solutions
Figure 5-13. The number of partitions in all the brokers was almost the same
At that point I was clueless, until a developer who was working on that cluster found
the issue—there was a single topic, with low incoming traffic, that had many consumers.
What was more surprising is that the topic had only one partition, and that partition
resided on… the rogue broker! Figure 5-14 shows that the single partition resided on the
rogue brokers.
Figure 5-14. The partition that belonged to the topic with a single partition
resided on the rogue broker
59
Chapter 5 CPU Saturation in Kafka: Causes, Consequences, and Solutions
To fix this production issue, we did two things. First, we added two more partitions
to this topic. This helped balance the number of partitions across all brokers. Second, we
got rid of some of the consumers for this topic. We found out that many of them didn’t
need to use that topic anymore. These two steps helped solve the issue.
This finding taught me a valuable lesson regarding partition skew—when checking
for a partition skew among brokers, we need to check this skew not only in the scope
of all the topics, but also in the scope of a single topic. As this production case shows,
even a topic with a single partition that receives low traffic can cause a broker to
become loaded.
This production issue can be summarized through its symptoms, potential causes,
prevention measures, and monitoring options:
The symptom of this problem presents itself as a single broker with high CPU user
time and high load average.
Potential causes for this issue might be that the broker was handling more incoming
and/or outgoing traffic than other brokers. It could also be that this broker held a
higher number of partitions (for all topics in the cluster) compared to other brokers.
Additionally, this broker might host a topic that has a larger number of partitions on that
broker, compared to the other brokers. Finally, it’s possible that the broker hosted a topic
with one or more of the following characteristics: many consumers, a high frequency of
consumer requests, numerous producers, or a high frequency of producer requests.
To prevent such issues:
• Make sure to balance the number of partitions per topic across all
brokers.
• For topics with many producers, consider whether they are all
necessary.
60
Chapter 5 CPU Saturation in Kafka: Causes, Consequences, and Solutions
Monitoring for this issue involves setting up alerts for the following scenarios: topics
that their partitions are not balanced evenly among all brokers, topics with more than a
certain number of consumers or consumer requests, and topics with more than a certain
number of producers or producer requests.
Summary
This chapter provided an in-depth exploration of CPU saturation in Kafka clusters,
its causes, and potential solutions. It started by defining CPU saturation and a fully
engaged CPU. Then it delved into the different types of CPU usage, including user time
(%us), system time (%sy), I/O wait time (%wa), and software interrupts (%si). Each usage
type was explained in the context of Kafka clusters and the potential reasons for their
excessive utilization were discussed along with strategies for prevention and monitoring.
The chapter also emphasized the impact of log compaction on the CPU usage of the
Kafka brokers and explained how this process, by retaining only the most recent value for
each key, might require both RAM and CPU resources, potentially causing high resource
utilization. A real production issue was presented where incorrect configurations for
compacted topics resulted in system strain.
Lastly, the chapter discussed the impact of the number of consumers per topic on
CPU usage, using a case study where a rogue broker caused lags and queuing.
Overall, this chapter serves as a comprehensive guide for understanding CPU
utilization in Kafka clusters, helping you identify potential issues and implement
effective strategies for prevention and monitoring.
The next chapter explores the important role of RAM in Kafka clusters. It explains
why adding more RAM can be a critical improvement, sometimes even more so than
adding CPU or disks, especially in reducing latency and boosting throughput. Cloud-
based and on-premises solutions are compared, along with practical guides on how to
monitor and manage RAM effectively. From understanding page cache to preventing
system crashes, the next chapter gives you a comprehensive look at how RAM affects
Kafka’s performance, and how to optimize it.
61
CHAPTER 6
63
© Elad Eldor 2023
E. Eldor, Kafka Troubleshooting in Production, https://doi.org/10.1007/978-1-4842-9490-1_6
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
64
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
65
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
Performance Boost
When a Kafka broker is allocated with more RAM, it can make more effective use of
the Linux page cache by keeping a greater portion of data in memory. This minimizes
the requirement for frequent disk access, thereby accelerating both read and write
operations, which in turn enhances the overall performance.
66
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
Throughput Enhancement
By augmenting the available RAM, Kafka brokers can handle larger workloads from
both consumers and producers. They can also process a higher number of messages
concurrently. This leads to an improvement in throughput, enabling the broker to
manage higher volumes of data more efficiently and support an increased number of
consumers and producers.
Latency Reduction
With increased RAM capacity, Kafka brokers can more frequently serve read requests
directly from memory. This reduces the time it takes for consumers to access messages,
leading to lower latency and quicker data retrieval. As a result, consumers experience a
more responsive Kafka cluster.
67
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
68
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
As Figure 6-1 illustrates, when producers write messages to a Kafka broker, the
pages containing these messages are placed into the page cache before being flushed
to the disk. When consumers request specific messages from the broker, the system first
searches for the corresponding pages in the page cache. If found, they are sent directly to
the consumer; if not, the page cache retrieves the pages from the disk and forwards them
to the consumer. These pages may remain in the page cache for some time, depending
on the cache policy and access patterns. If other consumers subsequently request the
same pages, they can be served directly from the page cache, improving efficiency by
avoiding repeated disk reads.
However, Kafka’s reliance on the page cache also comes with considerations.
Over-reliance on the page cache without adequate memory can lead to excessive page
swapping, causing a degradation in performance. Therefore, careful capacity planning
and resource allocation are necessary to maintain optimal Kafka performance.
69
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
To mitigate this risk, Kafka employs various strategies such as replication and
acknowledgment mechanisms. Replication ensures that data is replicated across
multiple Kafka brokers, providing fault-tolerance and data redundancy. The
acknowledgment mechanisms, such as “acks” configuration, allow producers to receive
confirmation of successful writes before considering them committed.
In the event of a failure where all Kafka replicas fail simultaneously, even with acks
set to “all,” there is still a chance of losing updates. This is because the page cache may
not have sufficient time to persist the changes to the underlying storage before the
failure occurs. Therefore, it’s crucial to design Kafka deployments with fault-tolerance
considerations and appropriate replication factors to minimize data loss risks.
–– HITS: The number of cache hits (read requests served from the
page cache).
–– DIRTIES: The number of dirty pages (pages that have been modified
and need to be written to disk).
–– READ_HIT%: The cache hit ratio for read requests.
–– WRITE_HIT%: The cache hit ratio for write requests.
70
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
If you notice one or more of the following—high cache miss ratio, lower hit ratio, or
high number of dirty pages—it could indicate that your Kafka brokers are not effectively
utilizing the page cache. This could lead to increased disk I/O and potential latency
issues, as read and write requests have to go to disk instead of being served from
the cache.
Possible causes for a high cache miss ratio include a lack of available RAM, a high
rate of data eviction from the cache (possibly due to other memory-intensive processes
running on the same machine), or consumers trying to read data that is not in the cache
(possibly due to a high consumer lag).
A lower hit ratio in cachestat also indicates that a higher proportion of read requests
are missing the cache and therefore need to be fetched from disk. This can lead to higher
disk utilization because your storage subsystem has to handle more I/O operations.
You can monitor disk utilization using tools like iostat and sar. If you observe
that a decrease in cache hit ratio correlates to an increase in disk utilization, it suggests
that your Kafka brokers cannot effectively utilize the page cache, forcing them to rely
more heavily on disk I/O. This can lead to increased latency and reduced performance,
especially if your disks cannot handle the increased I/O load.
To conclude, monitoring your cache usage with cachestat can help you diagnose
these issues, and adjusting your Kafka and system configurations based on these insights
can help improve your Kafka cluster’s performance.
71
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
Figure 6-2. The output of the top command shows a broker that has a high I/O
wait time because it reads data from the disks instead of from the RAM, probably
due to lack of RAM
In this example, the process of the Kafka broker uses about 5.7GB of memory,
which is roughly 71.2 percent of the total available memory (8GB). The wa% value is
20.3 percent, which is quite high. This indicates that the CPU is spending a significant
amount of time waiting for I/O operations to complete. This is likely because the Kafka
broker has to read data from disk, due to insufficient RAM for the OS (in this case,
±2.3 GB of available RAM) to keep all the necessary data in the page cache.
The next section looks at how to optimize the disks in Kafka in order to allow the
brokers to better handle a lack of RAM.
72
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
brokers spread the log segments across these disks is influenced by whether the disks
are configured using RAID or JBOD. A subsequent chapter delves into this configuration,
detailing its effects on the distribution of files and on the overall disk performance.
73
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
Figure 6-3 shows the consumer lag; all the consumers are lagging behind in all
partitions.
Figure 6-3. The sum of lag (in millions) in all the consumers over all the partitions
over time. As time goes by, the lag continues to increase
When the consumer lag started to increase, you could see the following behaviors.
Figure 6-4 shows that the throughput of the reads from the Kafka disks increased
until they reached the disks’ maximal throughput.
Figure 6-5 shows that the IOPS of reads from the disks also increased until they
reached the maximal IOPS that the disks provide.
74
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
Figure 6-6 shows that the CPU I/O wait time increased until it stabilized at 10
percent, which is pretty high for I/O wait.
Figure 6-6. The wait time increased due to latency in the disks
When you look at the disk utilization and compare it to the page cache hit ratio,
you’ll see a negative correlation—the lower the page cache hit ratio, the higher the disk
utilization. See Figure 6-7.
Figure 6-7. A negative correlation between hit ratio from the page cache and the
disk utilization—as the hit ratio goes down, the disk utilization increases until it
reaches IOPS saturation
To conclude, this example shows how a page cache hit ratio has an immediate effect
on disk utilization. The higher the hit ratio, the lower the disk utilization and vice versa—
the higher the miss ratio, the higher the disk utilization and consequently the higher the
CPU wait time.
75
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
Here’s another example of a Kafka cluster (this time it is on-prem) where its
consumers suffered from recurring lags. We decided to triple the amount of RAM in that
cluster and measured the disk utilization before, during, and after the addition of RAM.
Figure 6-8 shows the effect of adding RAM on the disk utilization of the disks in the
cluster.
Figure 6-8. The disk utilization in a Kafka cluster that suffers from lack of
RAM. There’s a correlation between the amount of RAM in the brokers and a
reduction in the disk utilization
Note that the disk utilization dropped linearly based on the amount of RAM added—
from 43 to 13 percent. This is another indication that the cluster lacked RAM, and that
lack of RAM and high disk utilization go hand in hand.
76
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
77
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
78
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
Latency Spikes
When the garbage collector kicks in, it can cause temporary pauses in Kafka’s processing.
This GC pause can result in spikes in latency, negatively impacting the performance of
the cluster and disrupting the smooth flow of message processing for both producers
and consumers.
Resource Utilization
The garbage collection process is CPU-intensive. High GC activity can consume
significant CPU resources, limiting what’s available for other processes. This, in turn,
can degrade overall cluster performance and affect other services running on the
same nodes.
79
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
System Stability
GC activity is directly tied to system stability. If the garbage collector struggles to reclaim
memory in a timely manner, it can lead to JVM out-of-memory errors. Such situations
can cause Kafka brokers to crash, leading to potential service disruptions, data loss, and
a cascading effect of rebalancing across the remaining brokers.
• top: Run top and look for the Kafka process. Check the RES (resident
memory size) column to get an idea of how much memory the Kafka
process is using. If you know the exact name of the Kafka process, you
can use top -p $(pgrep -d',' -f kafka) to filter the top output
specifically for Kafka.
80
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
81
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
The application would gradually consume all available RAM on a machine in the
cluster. Interestingly, each occurrence affected a different machine, leading to a game
of “musical chairs” that ultimately caused the cluster to cease functioning. It was not
because of high CPU usage or disk utilization; both metrics were surprisingly low.
Instead, a high percentage of system CPU time (CPU sy%) became the telltale sign of
trouble.
We finally concluded that each machine’s “death” (instances where AWS killed the
machine) was directly linked to an available RAM count of zero. A single process was
monopolizing most of the CPU time, but it was primarily wait time (wa%) rather than user
time (us%).
Contrary to initial suspicions, this was not a garbage collection (GC) issue, as GC
typically consumes user CPU time, not wait time. It’s crucial to note that this process had
nearly all the machine’s RAM allocated to it, which could potentially lead to misses in the
page cache.
CPU wait time (wa%) signifies the time the CPU is waiting for device I/O, which could
involve block devices (disk) or network devices. In this instance, the disk seemed to be
the bottleneck.
The machine was attempting to process an enormous volume of disk reads, reaching
200MB/sec. This high level of disk activity might have been due to the lack of available
RAM for the page cache. As the process was relentlessly reading from the disks, the disks
reached their maximum capacity (100% utilization), and the I/O wait time soared to 50%.
This was the root cause of this latency issue.
We contemplated several potential solutions:
• Allocate less RAM to the Go process (less than 14GB). Then, assess
if the disk reads drop below 200MB/sec, and if this reduces the disk
utilization and I/O wait time.
• Use a machine with more RAM, but keep the Go process’s heap size
consistent with the current configuration. This would ensure more
available RAM than the current situation, which is zero. Then, verify
if the disk utilization falls below 100% and if the I/O wait time is
significantly reduced.
• Use a machine with a disk that offers more I/O operations per second
(IOPS) and the same RAM.
82
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
Since we were operating in an AWS environment, a machine with more RAM would
likely also provide more disk IOPS. Therefore, we decided to scale up the machine
but made sure that the Go process didn’t consume more than half of the machine’s
RAM. This experience underlines the critical role of RAM in maintaining system
performance and stability.
Summary
This chapter delved into the significant effects of RAM on Kafka clusters. It elaborated on
why adding RAM can sometimes be more crucial than adding CPU or disks, as it reduces
latency and improves throughput. The chapter further differentiated the impacts of
adding RAM in cloud-based machines versus on-premises ones, emphasizing the
scalability and cost-effectiveness of cloud solutions.
We then outlined the positive outcomes of adding more RAM to Kafka brokers,
predominantly enhancing the brokers’ performance and increasing the page cache. We
shed light on how the page cache significantly impacts Kafka’s read and write operations,
with a larger cache leading to faster data retrieval and insertion. However, we also
emphasized that this performance boost needs to be balanced with fault tolerance
considerations.
We provided a practical guide to monitoring the page cache usage with the
cachestat tool, enabling effective control over Kafka’s performance. We also
emphasized the detrimental impact of a RAM shortage on disks, potentially causing the
disks to reach IOPS saturation. In such a situation, I recommend optimizing Kafka disks
to counteract the lack of RAM.
We then learned how to optimize Kafka in terms of RAM allocation, ensuring that
the Kafka cluster runs smoothly and efficiently. We further explored the relationship
between garbage collection (GC), out-of-memory (OOM) errors, and RAM, highlighting
the importance of proper memory management to prevent system crashes.
We provided a guide to Linux commands that measure the memory usage of the
Kafka machine and the Kafka process, enabling KafkaOps to monitor and manage RAM
usage effectively. Lastly, we took a detour to read about the role of RAM in non-Kafka
clusters, underlining the universal importance of RAM in all types of clusters.
In summary, this chapter serves as a thorough guide to understanding the impact of
RAM on Kafka clusters, offering practical solutions to common problems related to RAM
allocation, management, and optimization.
83
Chapter 6 RAM Allocation in Kafka Clusters: Performance, Stability, and Optimization Strategies
The next chapter takes a closer look at disk I/O overload in Kafka clusters, focusing
on how it affects consumers and producers. Starting with an introduction to key disk
performance metrics, we’ll dive into practical applications, such as diagnosing latency
issues and detecting faulty brokers. Real-life examples highlight the significance of
careful adjustment and monitoring of disk.io threads to ensure optimal performance.
Whether you’re interested in understanding how Kafka reads and writes to disks or
seeking to improve efficiency and reliability, the next chapter offers valuable insights
into the intricate world of disk utilization in Kafka.
84
CHAPTER 7
85
© Elad Eldor 2023
E. Eldor, Kafka Troubleshooting in Production, https://doi.org/10.1007/978-1-4842-9490-1_7
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
• The service time of a disk quantifies the time required for the disk to
complete a read or write operation, and it’s indicative of the disk’s
latency or response time. For instance, if the disk utilization is 80
percent and IOPS is 400, the service time is calculated as 800 ms/400,
or 2 ms.
While these metrics are vital and effective for detecting latency induced by disks,
they only represent a subset of available disk performance indicators. Other potentially
helpful metrics, such as disk queue length, indicating the number of pending read and
write requests, and disk average response time, denoting the average response duration
86
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
of the disks, are also valuable but not detailed in this chapter. These chosen metrics,
however, provide a robust foundation for identifying and diagnosing disk-related latency
issues in Kafka clusters.
Writes
The writes to the filesystem are performed by producers, which send messages to the
brokers, and by the brokers themselves. The brokers write messages that they have
fetched from other brokers, and then receive from these same brokers as part of the
message replication process.
The flow of the writes to the disks of a broker is shown in Figure 7-1.
87
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
Figure 7-1. How messages are sent to the broker by producers or by other brokers
within the cluster
As shown in Figure 7-1, the Kafka threads receive these messages and relay them
to the OS filesystem cache for buffering (it’s important to note that Kafka itself doesn’t
buffer the messages). Then, the OS threads intermittently flush these messages to the
disks in bursts, resulting in periods of no writes at all, followed by instances where disk
utilization spikes dramatically.
The phenomenon of these bursts is crucial to understand, as it’s a common
misconception to equate high disk utilization in Kafka brokers with saturation. This view
is flawed because the way that the OS flushes data to the disks creates periods where the
disks in Kafka either rest (with low utilization) or work intensely (with high utilization)
due to a burst of messages being flushed. Therefore, high utilization doesn’t necessarily
mean that the disks are saturated.
Disk utilization can be influenced by various factors related to disk writes, including
high throughput from the producers, replication factor, number of segments per
partition, size of each segment within a partition, and the values of batch.size and
linger.ms.
88
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
Reads
The reads from the filesystem are performed by the following:
–– Brokers that fetch messages from other brokers as part of the mes-
sage replication
Disk reads are performed whenever data doesn’t exist in the page cache, and since
the reads aren’t buffered by the OS, reads aren’t performed in bursts.
The disk utilization is affected by several factors, which are related to disk reads:
89
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
Figure 7-2 shows a skew in the IOPS per disk in a specific broker.
Figure 7-2. These are the IOPS (read and write operations/sec) per disk in all the
disks of a specific broker. The broker has four disks and Disk-1 has higher IOPS
compared to the others
There can also be a skew in the throughput per disk in a specific broker, as shown in
Figure 7-3.
Figure 7-3. These are the write throughputs (wMB/s, or writes MB/sec) per disk
of all the disks in a specific broker. The broker has four disks and Disk-1 has higher
wMB/s compared to the others
In such a situation, if the disk utilization percentage is significantly higher than the
rest, it can lead to latency issues that may affect the replication of partitions residing on
that disk, consumers consuming from those partitions, and producers writing to the
partitions located on that disk.
90
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
Figure 7-4. These are the IOPS (read and write operations/sec) per disk in all the
disks of all the brokers in the cluster. There are three brokers in the cluster and each
broker has four disks. The IOPS is twice as high in the disks of Broker-1 compared
to the other brokers
91
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
Figure 7-5. These are the write throughput (wMB/s, or writes MB/sec) per disk in
all the disks of all the brokers in the cluster. There are three brokers in the cluster
and each broker has four disks. The wMB/sec is twice as high in the disks of
Broker-1 compared to the other brokers
Figure 7-6. Average disk service time (per disk) is higher in the red disk compared
to the other disks
93
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
The average number of read bytes from that specific disk was also higher, as shown
in Figure 7-7.
Figure 7-7. Disk I/O read bytes (per disk) is higher in the red disk compared to the
other disks
The size of the request queue on the broker that hosted that disk was higher (the
request queue contains produce and fetch requests that are sent by clients to the
brokers), as shown in Figure 7-8.
94
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
Figure 7-8. Request queue size (per broker) is higher in the broker that hosts the
disk (the one with the high I/O read time and I/O read bytes) compared to the
other disks
Figure 7-9. Produce latency 99th percentile (per broker) is higher in the broker
that hosts the disk (the one with the high I/O read time and I/O read bytes)
compared to the other disks
95
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
Discussion
This combination of disk and Kafka metrics helped us detect a faulty broker.
That broker had more reads from a single disk, which caused it to serve and
consume requests slower compared to the other brokers. The size of the request queue
that contains consume and produce requests was higher on that broker compared to the
other brokers.
Figure 7-10. Normalized load average in all brokers started to climb and reached
±1.3, which indicates there’s a saturation in some resource in the cluster
The problem was that we didn’t know which change caused the load average to
grow, so we added new brokers to check whether they suffer from the same symptom,
and spread the partitions evenly across the old and new brokers.
96
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
Surprisingly, the load average of the new brokers was lower than the load average of
the old brokers, as shown in Figure 7-11.
Figure 7-11. Normalized load average in the old brokers was ±1.3, while in the
new brokers it reached ±0.7. So while the old brokers remained saturated, the new
brokers didn’t
When looking for a reason that the load average on the old brokers was high, we
noticed that the server interruptions rate (which is the CPU %si shown in the top
command) was almost twice that in the old brokers, as shown in Figure 7-12. This
correlates with the normalized load average of ±1.3 in the old brokers compared to 0.7 in
the new brokers.
97
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
Figure 7-12. Server interruptions rate (which is shown as %si in the top
command) reached ±4K in the old brokers, while in the new brokers it was 2K
Since the server interruptions rate is related to the number of threads, we started
looking for potential suspects and the number of I/O threads was one of these suspects.
Once we reduced the number of I/O threads from 6 to 2, the load average returned to
normal in the old brokers, as shown in Figure 7-13.
Figure 7-13. Once the number of I/O threads was reduced from 6 to 2 in the old
brokers, the normalized load average in the old brokers was reduced and became
almost the same as the normalized load average in the new brokers
98
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
After that we removed the new brokers from the cluster, because the only reason we
added them was to compare the configuration between the new and old brokers in order
to find which configuration parameter caused the high load average.
Discussion
I/O threads handle requests by picking them up from the request queue for processing.
By adding more threads, throughput can be improved, but this decision is influenced
by various other factors like the number of CPU cores, the number of disks, and disk
bandwidth.
Increasing the number of threads requires careful monitoring of both the NLA
(Normalized Load Average) and the CPU si% (software interrupt percentage). If the NLA
exceeds 1.0 and/or the %si is above ±5%, it could indicate that the disk I/O parallelism
has been increased excessively.
For clusters with multiple disks per broker, boosting the number of I/O threads
generally makes more sense. However, in the context of brokers that have a single disk,
we have found that setting the num.io.threads configuration to 2 provides an optimal
balance. It allows for efficient request processing while maintaining a load average that is
smaller than the number of CPU cores (thus the NLA is below 1.0), preventing potential
overload.
99
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
Figure 7-14. Average sum of writes in MB/sec (wMB/s) in all the disks in a broker,
per all the brokers in the cluster
We also checked the average rMB/s in all the brokers, as shown in Figure 7-15.
Figure 7-15. Average sum of reads in MB/sec (rMB/s) in all the disks in a broker,
per all the brokers in the cluster
100
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
The average write IOPS/sec in all the brokers is shown in Figure 7-16.
Figure 7-16. Average sum of write IOPS (W/s) in all the disks in a broker, per all
the brokers in the cluster
The next step was to verify that the even distribution in the writes to the brokers’
disks is also reflected in the network throughput, and indeed that all brokers received
and sent the same amount of network. This can be seen by the average bytes in all the
brokers, as shown in Figure 7-17.
Figure 7-17. The number of bytes sent to each broker through the network
101
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
Figure 7-18. The number of bytes sent from each broker through the network
After verifying that the disk and network throughput were evenly distributed among
the brokers, we proceeded to look at the disk IOPS metric. However, this time, we looked
at the max values instead of average values, by checking both the read and write IOPS/
sec only during traffic peaks, when consumers lagged. Here, we found an interesting
issue in one of the brokers:
Figure 7-19 shows the write IOPS/sec during traffic peaks.
102
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
Figure 7-19. Write IOPS/sec per each broker during peak traffic time
Figure 7-20. Read IOPS/sec per each broker during peak traffic time
103
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
As you can see, when looking at the disk IOPS only during peak times, there is a
single broker that has significantly lower average read and write throughputs than the
rest. It seemed that while other brokers were able to handle bursts, the problematic
broker struggled to keep up and managed to perform only a third of the write operations
compared to the other brokers.
Then we checked (in the disks of the problematic broker) the time it takes for write
and read requests to wait in the queue before being serviced by the disk (using the
r_await metric for reads and the w_await metric for writes, both provided by the iostat
tool). We noticed that the wait time for read and write operations were ten and thirteen
times slower on the problematic broker compared to the other brokers, respectively.
Figures 7-21 and 7-22 show the wait time for the read and write operations in all the
brokers.
Figure 7-21. Wait time for read operations on the disks of each broker during
peak traffic time
104
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
Figure 7-22. Wait time for write operations on the disks of each broker during
peak traffic time
After seeing the read and write processing times, we understood that the problematic
broker’s disk was much slower than the disks of the other brokers.
Another interesting metric was the queue size for produce requests (which is a
queue that serves incoming produce requests). This queue was much higher in the
problematic broker compared to other brokers, and it was capped, possibly due to the
latency on the slow disk.
Figure 7-23 shows the size of the produce request queue in all the brokers.
Figure 7-23. Size of the produce requests queue in each broker during peak
traffic time
105
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
Discussion
This issue shows the importance of looking at anomalies in a Kafka cluster during peak
times, instead of looking at wider time ranges. In this case, we checked the maximum
values of the read and write processing times (during the peak traffic time) in order to
determine the root cause.
We started with consumers that lagged during peak time, and then the first thing we
did was looking at the disk behavior during the whole time, which led to nothing. Only
when we looked at the maximum values of disk behavior during the peak time (which
has maximum traffic), did we find that the disk of a single broker provided much less
write throughput, and that the queue size of writes to that disk was capped. This reduced
the throughput of that disk dramatically.
That “slow” disk caused two problems during peak time:
–– Due to the high latency on that disk, producers that produced into
the partitions that resided on the problematic disk developed an
increasing queue of messages in their buffers.
Replacing the disk (if the cluster is on-premises) or the broker (if the cluster is on the
cloud) will solve this problem.
106
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
When the broker receives these request messages, they’re added to the
RequestQueue along with any other requests that are waiting to be processed. The broker
then processes the requests in the RequestQueue in the order they are received
Note that requests from Kafka brokers to fetch data aren’t added to the
RequestQueue, but instead are handled by the Kafka replication protocol, which doesn’t
use the RequestQueue because it operates independently of client requests.
107
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
Produce Latency
Produce latency, shown in Figure 7-25, was decreased by a third in all the brokers. The
produce latency metric measures the time it takes for a producer to successfully send a
message to the broker. It’s measured from the time the message is sent by the producer
until it’s acknowledged by the broker. This metric includes any network latency, the time
the messages waits in the queue, and the processing time of the message.
108
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
109
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
Discussion
disk.io threads can have a dramatic impact on the overall latency of producers and
consumers, and on the Kafka clusters’ utilization. In this case, when we switched from a
ratio of 1:1 between disk.io threads and the number of disks per broker to a ratio of 2:1,
the following occurred:
–– Requests waited much less time in the request queue (the size of the
request queue was reduced by two thirds).
–– The time it took messages to be acknowledged was reduced by
two thirds.
–– The CPU utilization of the cluster doubled, and so did the load
average of the cluster.
110
Chapter 7 Disk I/O Overload in Kafka: Diagnosing and Overcoming Challenges
This case shows that tuning the disk.io threads can be beneficial for Kafka clusters
when their clients suffer from high latency. However it’s important to carefully adjust
the number of disk.io threads, since too many threads can cause the brokers to stop
functioning. It’s also important to pay attention to the overall cluster utilization before
trying to increase the diks.io threads, since a cluster that has high CPU utilization will
probably not benefit from such an increase, even if its disk utilization is low.
Summary
This chapter unraveled the intricacies of disk I/O overload and explained how it can
impact consumers and producers in Kafka clusters. It began with an introduction to
various relevant disk performance metrics, and then delved into how these metrics
can be utilized to diagnose issues related to latency in Kafka brokers, consumers, and
producers. The analysis provided a deep understanding of how Kafka reads and writes
to the disks, shedding light on the factors affecting disk utilization and the symptoms
indicating potential disk I/O problems.
The chapter illustrated real-life production issues and demonstrated the practical
application of these metrics, such as detecting a faulty broker, and the effect of disk.io
threads on the performance of the entire Kafka system. One notable example explored
the impact of having too many disk.io threads, revealing the importance of careful
adjustment of these threads to prevent overloads.
The chapter concluded by emphasizing the necessity of considering disk
performance during peak usage times and the balanced tuning of disk.io threads. These
insights should equip you with the knowledge you need to enhance the efficiency and
reliability of your Kafka clusters, particularly in detecting and resolving disk-related
latency.
The next chapter turns to the choice between RAID10 and JBOD for disk
configuration in Kafka production environments. This decision has profound
implications for data protection, write performance, storage usage, and disk failure
tolerance. Through a thorough comparison of RAID10 and JBOD, we’ll examine the
tradeoffs and unique benefits of each configuration, including write throughput and disk
space considerations. Whether you are prioritizing data security or efficiency in storage
and performance, the next chapter will guide you in selecting the option that best suits
your Kafka brokers’ needs.
111
CHAPTER 8
Disk Configuration:
RAID 10 vs. JBOD
The decision to choose RAID 10 or JBOD for disk configuration when deploying Kafka
clusters is crucial. This chapter thoroughly explores the advantages and disadvantages of
both disk configurations in the context of Kafka production environments.
By comparing various aspects of these two configurations—such as disk
failures, storage usage, write operation performance, disk failure tolerance, disk
health monitoring, and balancing data between disks in the broker—you will gain
a comprehensive understanding of the tradeoffs between RAID 10 and JBOD disk
configurations.
Ultimately, you will be equipped with the necessary knowledge to make
informed decisions as to which disk configuration to opt for in your Kafka production
environments.
JBOD (just a bunch of disks) is a disk configuration whereby the server has internal
disks that are controlled individually by the OS, as shown in Figure 8-1. The disks
connect to a disk controller on the server, and the disks can be accessed and be seen
by the OS.
113
© Elad Eldor 2023
E. Eldor, Kafka Troubleshooting in Production, https://doi.org/10.1007/978-1-4842-9490-1_8
Chapter 8 Disk Configuration: RAID 10 vs. JBOD
RAID (Redundant Array of Independent Disks) presents the disks of the server as
a single disk. It can be provided as hardware RAID (via a disk controller card) or as a
software RAID. RAID disks are seen by the OS as a single virtual disk.
Before V1.0, Kafka didn’t tolerate disk failures, in which case using RAID 10 was
almost mandatory since it mitigated the disk failure issue (unless of course two disks in
the same mirroring failed). But even after Kafka 1.0, the RAID 10 option is mentioned by
Confluent.
114
Chapter 8 Disk Configuration: RAID 10 vs. JBOD
The storage volume will still be accessible even if half of the disks fail, assuming that
each of the failed disks belongs to a different pair. If the two disks of the same pair fail,
the volume won’t be accessible. The write performance of both RAID 1 and RAID 10 is
cut in half (compared to RAID 0) due to the mirroring.
The decision to configure the disks in Kafka brokers as either RAID 10 or JBOD is
influenced by two main factors—the level of data protection desired for the information
stored in the brokers, and the quantity of storage and disks you are prepared to allocate
for that purpose.
116
Chapter 8 Disk Configuration: RAID 10 vs. JBOD
Disk Failure
The more disks are being used, the higher the chances that they’ll either partially or
completely stop functioning due to some failure, wear out, bad sectors, and so on. Let’s
consider the level of data protection these two methods provide:
Data Skew
When there is more than a single disk in a Kafka broker, there’s a chance for data skew
between the disks. The skew can be caused by:
• JBOD: If there is more than a single disk per broker, in order to spread
the data evenly across the disks per broker, you need to perform this
yourself.
• RAID 10: Ensures that the data is spread evenly across all disks
per broker.
117
Chapter 8 Disk Configuration: RAID 10 vs. JBOD
Storage Use
Let’s assume the amount of data written to the brokers is 10GB, and the replication factor
for all topics is 2. In that case, the brokers require the following storage:
• JBOD: 20GB
Storage Usage
When using RAID 10, the disk space being used is twice that compared to JBOD,
assuming the replication factor remains the same.
Conclusion: This is an advantage of JBOD over RAID 10.
118
Chapter 8 Disk Configuration: RAID 10 vs. JBOD
• JBOD: The leaders and followers that reside on a failed disk must be
moved to another broker.
• RAID 10: The leaders and followers that reside on a failed disk won’t be
moved because there’s mirroring to a second disk, unless it also fails.
The chances of two disks that are part of a pair failing is much smaller than the chance
of a single disk failing, which means that there will be less partition movement (whether
they’re leaders or followers) due to disk failures in clusters configured with RAID 10.
Conclusion: This is an advantage of RAID 10 over JBOD.
• RAID 10: In the case of hardware RAID, it’s not possible to monitor
the disks via the OS (e.g. using the SMART tool) because all the disks
in the RAID are shown by the OS as one disk, since it’s a single mount
point. Only by using RAID tools can you check the disk status. (Note:
When using software RAID, the disks are visible to the OS.)
119
Chapter 8 Disk Configuration: RAID 10 vs. JBOD
• RAID 10: It’s common that the SRE on site isn’t aware that a disk
failed (due to the lack of disk failure monitoring). This causes failed
disks to not be replaced until two disks of the same mirroring fail
(which can cause a production issue).
• RAID 10: Replacing a disk in a RAID controller forces the RAID array
to be rebuilt. The rebuilding of the RAID array is so I/O intensive that
it effectively disables the server, so this does not provide much real
availability improvement compared to JBOD. In fact, I witnessed a
higher downtime when replacing disks in RAID than in JBOD.
120
Chapter 8 Disk Configuration: RAID 10 vs. JBOD
JBOD
It’s up to the KafkaOps to make sure data and partitions are spread evenly across the
disks per the Kafka broker. If the broker has several data directories, each new partition is
placed in the directory with the least number of stored partitions. If the data isn’t evenly
balanced among the partitions, the data skew can increase.
The weakness of this approach is that it’s agnostic to the number of segments per
partition. A partition P1 can have more segments than partition P2 due to several
reasons. To list a few:
–– Assuming partition P1 belongs to topic T1 and partition P2 belongs to
topic T2:
–– Assuming both partitions (P1 and P2) belong to the same topic:
In order to ensure that data is spread evenly across the disks, you can write a script
that reassigns partitions among the brokers based on the number of segments per disk
instead of the number of partitions per disk.
RAID 10
The RAID controller is (theoretically) in charge of balancing the data evenly between the
disks. However, according to Confluent’s documentation, it doesn’t always balance the
data evenly. I can’t verify Confluent’s stand on whether RAID 10 really balances the data
evenly across the disks, since I never encountered a case in which the storage of a disk in
RAID 10 became full (which is one of the symptoms of data imbalance among disks).
121
Chapter 8 Disk Configuration: RAID 10 vs. JBOD
Summary
This chapter discussed the decision to choose between RAID 10 and JBOD for disk
configuration in Kafka production environments.
122
Chapter 8 Disk Configuration: RAID 10 vs. JBOD
JBOD is a disk configuration where the server has internal disks that are controlled
individually by the OS, whereas RAID presents the disks of the server as a single disk.
RAID 10 ensures that the data is replicated across two disks, while JBOD only provides
one level of protection via the replication factor (assuming replication is configured in
the Kafka brokers).
RAID 10 ensures that the data is spread evenly across all disks per broker, but
when using it, there’s a big performance hit for write throughput compared to
JBOD. Additionally, the disk space used with RAID 10 is double that of JBOD, assuming
the replication factor remains the same.
This chapter compared the pros and cons of RAID 10 and JBOD from several angles,
including performance of write operations, storage usage, and disk failure tolerance.
Ultimately, the decision to opt for one disk configuration over the other depends on the
amount of data protection you want for the data stored in the brokers and the amount of
storage and disks you are willing to dedicate to that.
The next chapter delves into the essential aspects of monitoring producers in your
Kafka cluster. This focused exploration is vital to balancing the relationship between
Kafka brokers and producers, which in turn reduces broker load, cuts latency, and
enhances throughput. From examining key metrics like network I/O rate and record
queue time to investigating the importance of message compression, this next chapter
aims to provide valuable insights into achieving optimal Kafka performance.
123
CHAPTER 9
Producer Metrics
There are several monitoring metrics on the producer side that will assist you in
diagnosing issues related to the effect of producers on the Kafka cluster and vice versa.
125
© Elad Eldor 2023
E. Eldor, Kafka Troubleshooting in Production, https://doi.org/10.1007/978-1-4842-9490-1_9
Chapter 9 A Deep Dive Into Producer Monitoring
126
Chapter 9 A Deep Dive Into Producer Monitoring
Buffer management in the producer also plays a critical role. If the broker’s
acknowledgment is slow and the producer continues to send messages without receiving
acks, the producer’s buffer may fill up, potentially leading to data loss if the buffer
reaches capacity. This situation requires careful handling and an understanding of the
producer’s buffering strategy.
To address a high network I/O rate, consider inspecting the producer’s
configurations and rate of data production. It might be necessary to tune the producer’s
settings, such as the batch size and linger time. However, note that increasing these
settings can create a back pressure on the producers and even cause them to get OOM
errors, so these values need to be tuned carefully.
127
Chapter 9 A Deep Dive Into Producer Monitoring
Figure 9-1. Average network I/O rate per producer pod. The X axis is the timeline,
and the Y axis is the network I/O rate
Figure 9-2. The record queue time is the time the record waits in the buffer queue
128
Chapter 9 A Deep Dive Into Producer Monitoring
129
Chapter 9 A Deep Dive Into Producer Monitoring
Finally, addressing network bottlenecks forms a critical part of the solution. Regular
monitoring and analysis of network performance can lead to actionable insights such as
increasing bandwidth, optimizing routing, or scaling network resources.
Figure 9-3. Average record queue time per producer pod. The X axis is the
timeline, and the Y axis is the average record queue time
131
Chapter 9 A Deep Dive Into Producer Monitoring
Figure 9-4. Outgoing bytes per producer pod. The X axis is the timeline, and the Y
axis is the outgoing bytes
Figure 9-5. The difference between num output bytes and num input bytes
Figure 9-6. The flow of a message from its creation by the producer until it reaches
its partition in the Kafka broker
133
Chapter 9 A Deep Dive Into Producer Monitoring
134
Chapter 9 A Deep Dive Into Producer Monitoring
Figure 9-7. The potential reasons and negative effects of an average batch size
that’s too high
Figure 9-8. The potential reasons and negative effects of an average batch size
that’s too small
135
Chapter 9 A Deep Dive Into Producer Monitoring
Figure 9-9. Average batch size per producer pod. The X axis is the timeline, and
the Y axis is the average batch size
137
Chapter 9 A Deep Dive Into Producer Monitoring
138
Chapter 9 A Deep Dive Into Producer Monitoring
139
Chapter 9 A Deep Dive Into Producer Monitoring
Compression Rate
Compression rate is the ratio of compressed bytes to uncompressed bytes, measuring the
effectiveness of the compression performed by the producer. This section assumes the
compression is performed by the producers and not by the brokers. Figure 9-10 shows
the flow of the compression process.
Figure 9-10. The messages in the batch are compressed and then the batch is sent
to the Kafka broker
140
Chapter 9 A Deep Dive Into Producer Monitoring
141
Chapter 9 A Deep Dive Into Producer Monitoring
Summary
This chapter provided a comprehensive guide to monitoring producers in a Kafka
cluster, since effectively monitoring the producers can reduce the load on the brokers,
decrease latency in producers, and increase overall throughput.
The chapter discussed several key producer metrics, including Network I/O Rate,
Record Queue Time, Output Bytes, and Average Batch Size.
Additionally, the chapter devoted a special section to the Compression Rate metric,
highlighting the importance and effectiveness of message compression in a Kafka
cluster.
The upcoming chapter turns your attention to consumer monitoring in the Kafka
cluster. It dives deeper into the metrics and behaviors essential to consumers, ensuring
smooth message flow and efficient processing. It explores essential consumer metrics,
including Consumer Lag, Fetch Request Rate, and Bytes Consumed Rate, which
together provide insights into the Kafka consumers’ functionality. It also delves into
the relationships between consumer metrics and other Kafka components, as well as
discusses the complexities of data skew and consumer lag. To bring these concepts to
life, the chapter includes a case study highlighting how to pinpoint and address broker
overload using metric correlations.
142
CHAPTER 10
143
© Elad Eldor 2023
E. Eldor, Kafka Troubleshooting in Production, https://doi.org/10.1007/978-1-4842-9490-1_10
Chapter 10 A Deep Dive Into Consumer Monitoring
Consumer Metrics
This section reviews several Kafka consumer metrics that provide essential information
about message consumption rates, fetch details, latency, and various other aspects.
These metrics are instrumental in ensuring the smooth functioning of consumers, and
by extension, the producers and brokers in a Kafka setup. Furthermore, they provide
insights into the system’s performance, efficiency, and reliability, enabling you to
optimize resource usage, improve system responsiveness, and maintain high-quality
data flow in your Kafka pipeline.
Figure 10-1. As long as the number of produced messages is higher than the
number of consumed messages, the consumer lag increases
144
Chapter 10 A Deep Dive Into Consumer Monitoring
145
Chapter 10 A Deep Dive Into Consumer Monitoring
Potential causes for a high fetch request rate can be a high message production rate
or a low fetch.min.bytes configuration on the consumer, which leads to more frequent
requests, because the consumer fetches data as soon as the specified amount of data is
available.
146
Chapter 10 A Deep Dive Into Consumer Monitoring
147
Chapter 10 A Deep Dive Into Consumer Monitoring
• Third, think about how big the messages you’re producing are. If
they’re small, you’re going to have a lot of fetch requests. If your
situation allows it, you might want to think about grouping smaller
messages together to cut down on the number of fetch requests you
need to make.
Figure 10-2 shows the effect of low and high values of fetch.min.bytes and fetch.
max.bytes in the consumer.
Figure 10-2. How low or high values of fetch.min.bytes and fetch.max.bytes affect
the load and memory use of the consumer
If your consumer makes frequent fetch requests but retrieves smaller amounts of
data, or if you’re looking to manage the tradeoff between data delivery latency and fetch
efficiency, adjusting fetch.max.wait.ms can be beneficial.
By increasing the wait time, the consumer allows for a potentially larger amount
of data to accumulate before fetching, which could reduce the fetch request rate and
improve overall fetch efficiency. However, you should be mindful, as setting this too high
can introduce noticeable delays in data delivery, especially when the data production
rate isn’t consistently high.
148
Chapter 10 A Deep Dive Into Consumer Monitoring
149
Chapter 10 A Deep Dive Into Consumer Monitoring
150
Chapter 10 A Deep Dive Into Consumer Monitoring
151
Chapter 10 A Deep Dive Into Consumer Monitoring
152
Chapter 10 A Deep Dive Into Consumer Monitoring
On the broker’s side, a high number of records per request means that it’s serving a
substantial volume of data with each request. If this data is cached in RAM (page cache),
then the I/O impact is minimal. However, if the data needs to be retrieved from disk
because it’s not in the cache, then it can lead to increased I/O operations, potentially
resulting in longer response time, especially if there’s also a high request rate.
Potential causes can be a high rate of message production or a high fetch size
configured in the consumer, which allows for more data to be fetched in each request.
153
Chapter 10 A Deep Dive Into Consumer Monitoring
154
Chapter 10 A Deep Dive Into Consumer Monitoring
156
Chapter 10 A Deep Dive Into Consumer Monitoring
which partition the message should be directed to. By ensuring messages with identical
keys land on the same partition, Kafka not only maintains order but also leverages the
similarities between messages for better compression efficiency.
If you observe that the distribution of these keys is uneven, it is beneficial to
investigate the producer’s key generation and distribution methodology. For situations
where the keys aren’t paramount, opting for a round-robin partitioner can balance the
message distribution.
However, adopting the round-robin method might sacrifice compression efficiency,
because it doesn’t group similar messages. Grouping messages by their partition key
allows for better compression, thanks to their shared content. Therefore, while round-
robin can mitigate the risk of data skew compared to partition key usage, it’s not always
the optimal strategy.
Another limitation of the round-robin approach is its potential impact on the
aggregation ratio. When consumers rely on aggregating specific message values, the
broad distribution inherent to round-robin can hinder their efforts. For consumers
emphasizing such aggregations, especially from a skewed topic, refining the partition
key may be more beneficial than defaulting to the round-robin distribution.
When some consumers lag yet no data skew is present among the partitions, it could
mean that certain consumers are slower or overwhelmed. Slower consumers can be
caused by issues such as garbage collection pauses, slow processing logic, or resource
constraints. Consider tuning the number of threads per consumer, adjusting the size of
fetched data, or scaling out the consumer application.
A complex scenario arises when some consumers lag and there’s data skew among
partitions. This indicates that not only are certain consumers slower, but specific
partitions may also have more data than others. In such cases, the strategies mentioned
previously need to be employed simultaneously.
In cases where all consumers lag, but there’s no data skew among partitions, the
problem could lie with the consumer application itself or the infrastructure. The consumer
applications might be struggling with issues such as garbage collection, resource
contention, or slow processing logic. Alternatively, there could be infrastructure problems
affecting network performance or disk I/O. Here, scaling out the consumer applications,
improving consumer logic, or increasing consumer system resources might be necessary.
The most challenging situation is when all consumers lag and there’s data skew
among partitions. In such case we need to investigate both the producer key distribution
logic to address data skew and consumer application or infrastructure issues to tackle
the consumer lag.
157
Chapter 10 A Deep Dive Into Consumer Monitoring
158
Chapter 10 A Deep Dive Into Consumer Monitoring
Figure 10-3. A topic with almost no data skew between its partitions
Figure 10-4 shows a different story—the partitions from the percentiles of P95 and
above receive significantly more messages than the other partitions, which causes their
consumers to lag, and potentially even lose data if the lag increases for a longer period of
time than the topic’s retention.
Figure 10-4. A topic with a data skew. At least two of its partitions (the P95 and
P99 partitions) receive more messages than its other partitions
159
Chapter 10 A Deep Dive Into Consumer Monitoring
Figure 10-5. Spikes of high I/O wait time in the rogue broker (represented by the
green line)
Similarly, the network processors can be heavily engaged in managing the influx of
incoming write requests. This activity level is reflected in an elevated Network Processor
Busy% metric, especially in the rogue broker, as seen in Figure 10-6, which shows the
idle% of the network processors.
A drop in the idle% of the network threads means a spike in their busy%. Essentially,
every write operation requires network communication, which, when performed at a
high frequency, keeps network processors busy, which is reflected in the metric.
160
Chapter 10 A Deep Dive Into Consumer Monitoring
Figure 10-6. Drops of network processor idle% time (which is equivalent to spikes
in their busy%) in all brokers but especially in the rogue broker (represented by the
orange line)
As the broker grapples with a high volume of write operations, its disks become too
busy, so the queue depth (the queue of pending I/O operations) grows, which causes the
time that operations spend in the queue also to increase. Figure 10-7 shows the spikes
in the number of write operations waiting in the queue in order to perform writes to the
disks of the broker.
Figure 10-7. Spikes in the number of write operations waiting in the queue of the
brokers. The rogue broker is represented by the green line
Furthermore, the Produce Latency 99th Percentile metric captures the longer
durations the broker takes to acknowledge write operations under this increased load.
Higher latency is a common side-effect of the broker striving to manage high-frequency
writes. Figure 10-8 shows the spikes of this metric in the rogue broker.
161
Chapter 10 A Deep Dive Into Consumer Monitoring
Figure 10-8. Spikes of the produce latency 99th percentile in the rogue broker
(represented by the pink line)
In tandem with juggling write operations, the broker also has to serve fetch requests
from consumers. As the load increases due to high-frequency writes, serving these fetch
requests may get delayed, which in turn increases fetch consumer latency in the rogue
broker, as can be shown in Figure 10-9.
Figure 10-9. Spikes in the time it takes the rogue broker (represented by the green
line) to serve fetch requests from the consumers
This Log Flush Rate metric refers to the frequency and amount of time it takes to
flush data from Kafka’s in-memory log buffer to the disk. This metric increases with
frequent writes, as data persistence demands more regular flushes of the log to the disk.
Figure 10-10 shows the spikes in the log flush rate in the rogue broker.
Figure 10-10. Spikes in the log flush rate in the rogue broker (represented by the
red line)
162
Chapter 10 A Deep Dive Into Consumer Monitoring
To summarize, I used various metrics related to the broker, consumer, and producer
to get to the root of the problem with an overloaded broker. By looking at these metrics
together, I could see that the broker was dealing with too many write operations. This
showed up in metrics like increased CPU wait time, busier network processors, growing
queue of operations waiting to be written to the disk, and slower time to acknowledge
write operations.
I also saw that the broker was slower to handle fetch requests from consumers
and had to flush data more often from its temporary memory storage to the disk.
Understanding how these metrics interact helped me figure out why this specific broker
was slower. This should remind you that it’s important to share the load evenly across all
brokers in a Kafka cluster to avoid putting too much strain on one broker.
Summary
This chapter explained Kafka consumer monitoring. It emphasized the importance
of tracking key consumer metrics and drawing insights from their behavior to ensure
smooth data flow in the Kafka cluster. The main objective was to enhance the proficiency
of consumers in receiving and processing messages, which is pivotal in maintaining the
robustness and reliability of the Kafka cluster.
The chapter detailed various consumer metrics, such as Consumer Lag, Fetch
Request Rate, Consumer I/O Wait Ratio, and Bytes Consumed Rate, among others. These
metrics provide essential insights into aspects like message consumption rates, latency,
ensuring the smooth functioning of consumers, and ultimately, the Kafka brokers.
The relationship between data skew in partitions and consumer lag was also
explored—understanding different combinations of consumer lag and data skew across
the topic’s partitions can greatly enhance the health of the data streaming pipeline. An
ideal scenario is when there’s no consumer lag and data is evenly distributed across all
partitions. However, effective management is required when this is not the case.
The chapter also illustrated an example of how correlating between consumer,
producer, and broker metrics can help track down issues, using a Kafka cluster as a
case study. In this cluster, one broker received significantly more write operations than
others. By correlating metrics related to the broker, consumer, and producer, the chapter
highlighted how it was possible to identify and resolve the issue with the overloaded
broker. The case study underscored the importance of evenly sharing the load across all
brokers in a Kafka cluster to avoid overburdening any single broker.
163
Chapter 10 A Deep Dive Into Consumer Monitoring
The next chapter delves into the stability of on-premises Kafka data centers. Chapter
11 explores the various hardware components in a Kafka data center, identifying the
risks and potential failures that can influence the stability of the system. From disks and
RAM DIMMs failures to the challenges posed by Network Interface Cards (NICs), Power
Supplies, Motherboards, and Disk Drawers or Racks, we will examine the multifaceted
elements that ensure the smooth running of a Kafka cluster. The chapter breaks down
the common causes for these hardware failures, the consequences they can have on
a Kafka broker, and the strategies that can be implemented to minimize their impact.
A special emphasis is on HDD disk failures, given their critical role in Kafka clusters,
along with a look at often-overlooked elements like power supplies and motherboards.
Additionally, the next chapter discusses how external factors like enabling firewalls and
antiviruses can affect performance.
164
CHAPTER 11
Stability Issues in
On-Premises Kafka
Data Centers
Kafka clusters can be separated into two deployment categories—cloud-based and
on-premises. This chapter discusses the potential stability issues that may arise from
hardware failures in Kafka clusters that are deployed on-premises. Such clusters are
heavily dependent on their hardware components and might experience stability issues
due to failures in these components. This can impact the cluster stability.
These hardware components specifically include disks, DIMMs, CPUs, network
interface cards (NICs), power supplies, cooling systems, motherboards, disk drawers,
and cabling and connectors. These components can experience failures for a variety of
reasons, including natural wear and tear, manufacturing defects, environmental factors,
and improper maintenance.
These failures can significantly degrade the overall performance and stability of
the cluster, so we’ll investigate the reasons behind these hardware failures, ranging
from aging and environmental conditions to manufacturing defects and maintenance
issues. Here are some of the effects of these failures:
• Disk failures, either complete or subtle ones like latency and I/O
errors, can disrupt Kafka’s operation due to its heavy reliance on disk
I/O operations.
Given that these clusters are Linux-based, this chapter explores how to utilize Linux
tools to monitor the health of these hardware components, with a particular emphasis
on disk monitoring. Such tools can provide insights into disk performance, helping
administrators identify potential issues before they escalate.
We’ll also discuss the impacts of these hardware failures, including data loss,
disruption of replication protocols, and data distribution imbalances, all of which
negatively affect Kafka cluster performance.
Understanding hardware failures and their effects on Kafka cluster stability, as well
as how to use Linux tools for disk monitoring, will equip you with the skills to prevent,
identify, and resolve these issues. The aim of this chapter is to simplify the task of
maintaining a stable on-premises Kafka cluster and make it more effective.
• Disks: Disks are central to Kafka’s operation, as they store all of the
incoming data. Issues can include mechanical failures, firmware
bugs, or problems caused by physical shock or environmental
factors. Failures can be complete, preventing access to all data, or
partial, causing increased latency or errors during data read/write
operations.
166
Chapter 11 Stability Issues in On-Premises Kafka Data Centers
• Disks: Disks store all Kafka messages. A disk failure could lead to
loss of data if not properly replicated. Disk latency or I/O errors can
lead to slow message processing, thereby increasing the end-to-end
latency of messages.
167
Chapter 11 Stability Issues in On-Premises Kafka Data Centers
This section discusses reasons for disks to fail and the effects of such failures. Note
that I’ll refer only to 2.5-inch and 3.5 inch HDD disks, which are connected to Kafka
brokers in either SATA or SAS interfaces which spin at 5400-15000 RPM, since these are
the only disk types that I’ve had experience with in Kafka on-prem clusters.
• Wear and tear: A disk may wear out due to the constant reads
and writes.
• Power surge: Data centers may encounter power surges from time
to time, due to equipment failures, severe weather, or just electricity
being shut down in their region. Some data centers may even lack
sufficient backup generators and UPS (uninterruptible power
supplies), which prevent them from accessing power until the
electricity gets back.
• Bad sectors: Bad sectors on a disk are typically created due to wear
and tear on the disk surface. They’re caused by over-aging of the
disks, over-heating, or a filesystem error.
• Data loss: If the failed disk contained partitions for one or more
topics, data stored on those partitions may be lost and cannot be
recovered.
• Increased latency: The broker may slow down as it tries to recover the
partitions and re-replicate the data to other brokers in the cluster.
169
Chapter 11 Stability Issues in On-Premises Kafka Data Centers
It’s important to have a proper backup and disaster recovery plan in place to
minimize the impact of a disk failure in a Kafka broker and to ensure high availability of
the cluster.
• Wear and tear: Hard disk drives (HDD) are particularly vulnerable
to wear and tear since they contain moving parts. Over time, these
components can deteriorate, hindering the disk’s ability to read or
write data.
• Bad sectors: As disks age, certain areas (sectors) may become faulty
and lose the ability to store data. While having a few bad sectors is
considered normal, a significant number can signify a failing drive.
• Full disk: A disk that has reached its storage capacity will not have
space available for writing additional data, effectively rendering it
read-only.
170
Chapter 11 Stability Issues in On-Premises Kafka Data Centers
One indication of disk failure from the SMART tool might be:
For Kafka systems, you can run this command on all devices to detect whether a disk
is about to fail:
In the case of write failures, system logs (e.g., dmesg or /var/log/syslog) may reveal
I/O errors.
171
Chapter 11 Stability Issues in On-Premises Kafka Data Centers
While this process can help you recover from a disk failure, preventing disk failures
is always preferable. Regular monitoring of disk health (using SMART attributes, for
instance) can help reduce the risk of disk failures. Moreover, ensuring that Kafka’s data
replication is correctly configured can help prevent data loss when disk failures do occur.
• Age and wear: Like any component, RAM can degrade over time.
Repeated write cycles, in particular, can lead to memory wear.
In most cases, when a DIMM fails, it needs to be replaced. Unlike some components,
RAM typically can’t be repaired, at least not without specialized equipment and
expertise.
173
Chapter 11 Stability Issues in On-Premises Kafka Data Centers
Machine event monitoring, such as Dell’s Integrated Dell Remote Access Controller
(iDRAC) or HP’s Integrated Lights-Out (iLO), offers one such approach. These tools,
provided by server manufacturers, continuously monitor the health of hardware
components, including RAM. If they detect abnormalities or failures, they can send
alerts or log entries. System administrators who keep an eye on these notifications can
take preemptive measures before a faulty DIMM leads to serious problems.
Another avenue for monitoring DIMMs comes from the operating system itself.
Kernel messages often detect issues with RAM, with error messages related to DIMMs
found in system logs, like dmesg or /var/log/syslog in Linux systems. These messages
can include details about specific memory addresses or other technical information,
aiding in the diagnosis of the problem.
174
Chapter 11 Stability Issues in On-Premises Kafka Data Centers
• Loss of connectivity: When a NIC fails, the broker loses its ability to
communicate with other brokers in the Kafka cluster. This can make
the broker unavailable, causing client requests to fail and disrupting
data processing.
• Data loss: If a broker is offline due to a NIC failure and the topic
replication factor is low, it could potentially lead to data loss.
175
Chapter 11 Stability Issues in On-Premises Kafka Data Centers
If it’s a hardware issue with the NIC itself, consider replacing the NIC if it’s a physical
card. For built-in NICs, it might be necessary to replace the entire motherboard, or
disable the faulty NIC and install a new network card. For redundancy reasons, it’s
recommended to set two separate network cards for each broker.
• Power surges or dips: A sudden surge in power can damage the power
supply unit. Conversely, voltage dips can cause the power supply to
fail to provide the necessary power to components.
• Poor quality or age: Lower-quality power supply units are more likely
to fail, as are older units. Even high-quality power supplies can fail as
they age.
176
Chapter 11 Stability Issues in On-Premises Kafka Data Centers
Motherboard Failures
The motherboard is a critical component of any computer system, and its failure can
have serious implications.
177
Chapter 11 Stability Issues in On-Premises Kafka Data Centers
178
Chapter 11 Stability Issues in On-Premises Kafka Data Centers
• Power supply issues: If the power supply to the disk drawer or rack
fails, all the disks it houses can fail simultaneously. This could be
due to a faulty power distribution unit (PDU), power cable, or even a
power surge that damages the unit.
• Cooling issues: Disk drawers and racks often have built-in cooling
mechanisms. If these fail, that can cause the disks to overheat
and fail.
• Physical damage: This can be from accidents like dropping the rack,
water damage, or even simple wear and tear over time.
179
Chapter 11 Stability Issues in On-Premises Kafka Data Centers
In summary, while disk drawer and rack failures can have serious implications,
proper preventive measures and monitoring can help mitigate the risks associated with
such failures.
180
Chapter 11 Stability Issues in On-Premises Kafka Data Centers
It’s important to carefully consider the impact of firewalls and antivirus software on
the performance of a Kafka broker before enabling them. Moreover, sometimes Kafka
administrators just don’t notice that this software is installed on the Kafka broker, so it’s a
good practice to develop a verification script that will run once in a while on the brokers
and verify whether firewalls and antivirus programs are installed. If they are installed,
the program should also ensure that they aren’t enabled.
181
Chapter 11 Stability Issues in On-Premises Kafka Data Centers
Disk Performance
Ensuring optimal disk performance is vital for maintaining a healthy ZooKeeper cluster.
Solid State Drives (SSDs) are strongly advised for ZooKeeper, as they offer the low-
latency disk writes required for optimal functioning. Since each request to ZooKeeper
must be committed to disk on every server in the quorum before the result becomes
available for reading, having efficient and fast storage is a non-negotiable requirement.
Monitoring the I/O performance, disk latency, and write speeds can prevent bottlenecks
that might otherwise hamper the overall system performance.
Dedicated Machines
Another significant recommendation for ZooKeeper deployment is to host the servers
on dedicated machines, separate from the Kafka broker cluster. This isolation ensures
that ZooKeeper can function at its best without competing for resources with Kafka
brokers. Such a setup allows for more precise tuning, monitoring, and maintenance of
the ZooKeeper instances, which are crucial for stability.
Monitoring ZooKeeper
To make sure the Kafka cluster remains stable, you must pay careful attention to
several aspects of ZooKeeper. Keeping an eye on the up or down status of the nodes in
the ZooKeeper quorum is vital, as is monitoring the response time for client requests
to detect performance issues early on. It’s also essential to watch the data stored by
ZooKeeper, ensuring it stays within healthy limits. Regular checks on the number of
client connections to the ZooKeeper servers help you understand the load and potential
stress on the system.
182
Chapter 11 Stability Issues in On-Premises Kafka Data Centers
• DIMM’s issues: Check for any RAM issues from kernel messages, as
these can impact overall system stability.
• Number of open files: Check that the number of open files has not
reached a critical threshold, which could limit Kafka’s ability to
function.
• Kafka disk usage: Monitor the disk space dedicated to storing Kafka
topics across machines to avoid running out of space.
• Time synchronization: Ensure all machines are in sync with the NTP
server, as inconsistent time-keeping can lead to various issues.
183
Chapter 11 Stability Issues in On-Premises Kafka Data Centers
• Disk balance across Kafka machines: Check that disk space usage
is more or less balanced, to prevent certain machines from being
overloaded.
• Disk mounting and read/write status: Check that all disks are
mounted correctly and that none are in a read-only state, which
could limit functionality.
Summary
This chapter delved into various factors that can influence the stability of on-premises
Kafka data centers. It began by outlining the potential failures in key hardware
components. These included disks, RAM DIMMs, NICs, power supplies, motherboards,
and disk drawers or racks, which can all experience failures due to natural wear and tear,
manufacturing defects, environmental factors, and improper maintenance.
Next, the chapter delved into the impact of these hardware failures on the stability of
a Kafka cluster. It highlighted that the consequences of these failures could be lessened
by employing strategies such as replication, failover, and backups, as well as proactively
monitoring the health of the Kafka cluster and its hardware components.
A particular focus was given to HDD disk failures. We discussed the reasons for these
failures and their effects on Kafka clusters. The discussion broke down common reasons
for disk failures, their potential impacts on a Kafka broker, the process of monitoring disk
health, and ways to resolve disk failure issues.
184
Chapter 11 Stability Issues in On-Premises Kafka Data Centers
We also looked at RAM DIMMs failures, learning potential causes such as physical
damage or natural wear and tear. We learned about the effects of these failures on system
stability and data integrity.
Then, we shifted focus to Network Interface Cards (NICs), which are vital for
managing server connections. That section explored how failures in NICs, caused by
factors like physical damage, hardware incompatibility, and faulty drivers, can lead to
significant problems in on-premises Kafka clusters.
Power supplies, a critical but often overlooked component, were discussed next.
That section explored the potential causes for power supply failures, their effects on a
Kafka broker, and methods to monitor and resolve such issues.
The next section covered motherboards. As the backbone of a computer system,
motherboard failures can have severe implications. That section covered potential
causes, impacts, and solutions for these failures.
We then moved on to discuss disk drawers and racks, which house multiple disks.
The section illustrated how a failure of any of these components could result in multiple
simultaneous disk failures, and we looked into the causes, implications, and ways to
mitigate such failures.
We also discussed the potential negative effects of enabling firewalls and antivirus
programs on Kafka brokers. We touched upon potential latency issues and the impact
on consuming and producing rates that may occur when enabling antivirus and firewall
software on the disks of a Kafka broker.
We also explored ZooKeeper’s best practices in on-premises Kafka data centers
and emphasized optimal disk performance, low-latency writes using SSDs, and
the significance of dedicated machines for ZooKeeper. Comprehensive monitoring
strategies were outlined, highlighting ZooKeeper’s role in Kafka cluster stability.
We concluded with an in-depth look at the development of dedicated smoke tests
for maintaining Kafka and ZooKeeper stability. Detailed insights were provided on
various tests for hardware and Kafka cluster validations, including RAM issues, disk
problems, network functionality, memory availability, disk usage, and more. The section
underscored the preventive role of smoke tests in identifying and addressing potential
problems early.
The next chapter focuses on an aspect that can significantly impact the efficiency
and cost-effectiveness of a Kafka cluster: optimizing hardware resources.
185
Chapter 11 Stability Issues in On-Premises Kafka Data Centers
While the stability and robustness of the Kafka cluster are of paramount importance,
an equally critical consideration is ensuring that the system is not over-provisioned.
Over-provisioning can lead to unnecessary costs and underutilized resources, resulting
in an inefficient system.
Chapter 12 dissects the various metrics and considerations to accurately assess the
cluster’s usage, such as RAM, CPU, disk storage, and disk IOPS. It explores the nuanced
differences in scaling on-prem versus cloud-based clusters and investigates the pros
and cons of different scaling alternatives from various perspectives, including technical,
managerial, and financial. Through six illustrative examples, you will get a detailed guide
on how to effectively implement scale-in and scale-down strategies to maximize cost
savings without compromising the stability of your Kafka cluster.
186
CHAPTER 12
Cost Reduction of
Kafka Clusters
This chapter zeroes in on the pivotal aspect of reducing hardware costs in Apache Kafka
clusters. By carefully examining the choices between deploying Kafka on the cloud and
on-premises, we’ll delve into strategies that can lead to significant cost reduction.
The chapter begins by exploring the vital aspects that influence the scaling of Kafka
clusters, with particular attention to Kafka on-premises. Here, we’ll outline various
options and their direct impact on hardware costs, drawing comparisons between cloud
and on-premises solutions. The technical, managerial, and financial facets are analyzed,
all within the context of achieving cost savings.
As we progress, the chapter takes a deep dive into the hardware considerations that
play a central role in controlling costs. This includes detailed examinations of RAM, CPU
cores, disk storage, and IOPS, with an emphasis on identifying the most economical
configurations and setups. Practical examples are provided to clarify the optimal
hardware selection for both cloud and on-premises deployments, with clear insights into
the tradeoffs involved.
The concluding section presents a series of real-world examples that vividly illustrate
proven strategies for cost reduction in over-provisioned Kafka clusters. These include
specific findings, options for reducing costs, and recommendations tailored to different
scenarios and cluster specifications. Special attention is given to cloud deployments, but
the principles can be applied more broadly.
Through these examples and the detailed exploration throughout the chapter,
you’ll gain a concrete understanding of how to minimize the hardware costs of your
Kafka cluster. The insights offered are not merely theoretical; they are drawn from
real-world applications and are designed to empower you to make informed,
cost-effective decisions.
187
© Elad Eldor 2023
E. Eldor, Kafka Troubleshooting in Production, https://doi.org/10.1007/978-1-4842-9490-1_12
Chapter 12 Cost Reduction of Kafka Clusters
Lack of RAM
Managing memory in a Kafka cluster is essential. In the cloud, you can easily add more
RAM by choosing bigger instances. But for clusters deployed on-premises, it’s a bit more
complex. You can either add or swap out memory sticks, or just bring in more machines.
Each choice has its own set of pros and cons, so it’s crucial to think it through.
Cloud-Based Cluster
In order to add more RAM to a Kafka cluster that is deployed on the cloud, we just need
to spin off a new cluster with new instances that have more RAM.
On-Prem Cluster
When aiming to add more RAM to a Kafka cluster deployed on-premises, we have
several possible approaches. If there are available memory slots on the motherboard,
additional DIMMs can be added to each broker. Alternatively, if the size of the current
DIMMs deployed on the motherboard is smaller than the maximal size, these can be
replaced with larger ones.
188
Chapter 12 Cost Reduction of Kafka Clusters
If all memory slots are occupied with DIMMs at their maximal size, adding more
machines becomes the viable option. It’s important to note that the term DIMM used
here refers to a memory module or memory stick. This context is specifically referring to
DDR4 DIMMs, which range in size from 8GB to 64GB.
Discussion
There are several courses of action when dealing with a Kafka cluster that’s deployed on-
prem and lacks RAM, and choosing which way to go depends on the use case. If there are
available slots for DIMMs, then the easiest way is to add DIMMs. If that solves the issue,
then we’re done.
However, what if it doesn’t solve our problem? In such case, we have two options,
as discussed next.
Replace the DIMMs with Larger DIMMs That Have More RAM
This strategy can be employed if the current size of the DIMMs isn’t at its maximum.
Replacing all the DIMMs in all brokers with larger-sized DIMMs has specific advantages
and challenges.
From a technical standpoint, this approach is correct, as it merely involves adding
RAM without introducing additional brokers. This means there’s no need to reassign the
topics since the number of brokers remains the same.
However, the financial implications of this solution must be considered. For
instance, purchasing a single server with 24 DIMMs of size 32GB may cost between
two to five times more than purchasing just the DIMMs themselves, depending on the
server type.
The managerial aspect also presents challenges, making it a tough decision to
undertake.
This process involves removing all existing DIMMs from the brokers, replacing them
with the larger ones, and storing the old DIMMs elsewhere. Data center owners often
resist purchasing hardware that will not be utilized (such as the old DIMMs), making this
solution less attractive from an operational perspective.
If the current size of the DIMMs isn’t the maximal one, we can replace all the DIMMs
in all brokers with DIMMs of a bigger size.
189
Chapter 12 Cost Reduction of Kafka Clusters
Scale Out by Adding More Brokers with the Same Amount of RAM
This is an alternative solution to increasing the cluster’s capacity. From a managerial
standpoint, this method has the advantage of not requiring the disposal of old DIMMs,
making it an easier decision to implement. Moreover, adding more machines to the
cluster is often a simpler task compared to replacing all the DIMMs in the current
brokers.
However, this approach does have its drawbacks. If the DIMMs in the current brokers
can be replaced with larger ones, adding more brokers with the same amount of RAM
might end up being more costly. Essentially, the financial implications could outweigh
the convenience, especially if the current DIMMs have not reached their maximum
potential size. Therefore, careful consideration of the technical requirements and the
budget constraints should guide the decision-making process.
The decision whether to replace the DIMMs or add more brokers needs to take into
account all these aspects—technical, financial, maintenance and managerial. In order
to make the right call, you’ll need to put into your calculation the importance of each
aspect, and according to that decide on the correct path.
• For the scale-up approach, we can create a new cluster with the
same number of instances but select instance types that come with
more cores. This essentially enhances the existing configuration
with additional processing power without increasing the number of
instances.
190
Chapter 12 Cost Reduction of Kafka Clusters
On-Prem Cluster
When an on-premises Kafka cluster requires more CPU cores, two primary strategies can
be considered: scaling up or scaling out.
Discussion
When facing the decision whether to scale out or scale up a Kafka cluster to get more
CPU cores, various considerations come into play, and the optimal approach might
differ between on-premises and cloud-based deployments.
In the context of an on-premises cluster, scaling out often seems more financially
attractive, as it can extend existing resources without necessitating the purchase of
entirely new or costlier hardware. Conversely, in a cloud-based environment, financial
considerations may be less significant, as replacing machines doesn’t result in the
retention of old, unused hardware.
From a managerial perspective, scaling out an on-prem cluster may also be more
appealing. It avoids the disposal or replacement of current equipment, aligning more
closely with the existing infrastructure and investment, and making the decision-making
process smoother. This managerial concern is typically not relevant for a cloud-based
cluster, where hardware replacement is a transparent operation.
191
Chapter 12 Cost Reduction of Kafka Clusters
Technically, scaling out is a more straightforward solution for both on-prem and
cloud-based clusters. It allows for the necessary expansion without the need to migrate
topics from the old cluster. Reassignment of partitions to new brokers simplifies the process,
making scaling out a commonly preferred method across both deployment scenarios.
On-Prem Cluster
In an on-premises environment, adding more disk storage to a Kafka cluster requires
careful consideration of the existing hardware configuration, along with identifying
the most suitable method for expansion. The term storage drawer, which refers to the
component within a server chassis that houses disk drives, is pivotal in this context.
192
Chapter 12 Cost Reduction of Kafka Clusters
One strategy to increase storage involves replacing the current disk drives in the
storage drawers with those having more storage capacity. This approach enhances the
existing infrastructure without necessitating additional hardware, capitalizing on the
opportunity to upgrade without substantial changes.
If there are unused slots in the storage drawers, then adding more disk drives can be
a viable option. This approach makes the most of existing capacity in the infrastructure,
promoting a cost-effective increase in storage without transforming the overall hardware
configuration.
Alternatively, the cluster can be expanded by adding more brokers, each equipped
with the same number and type of disks.
Discussion
When confronting the challenge of adding more disk storage to a Kafka cluster, various
factors must be examined, and the most suitable approach may differ between cloud-
based and on-premises deployments.
In the cloud-based environment, particularly with AWS, the availability of different
underlying storage technologies, such as Elastic Block Store (EBS) and NVMe, provides
flexibility in scaling. With EBS, you can extend disk space effortlessly, offering options
to scale out by adding more brokers or scale up by choosing larger storage volumes.
However, considerations regarding network latency and IOPS requirements must be
factored in. On the other hand, NVMe devices offer low-latency and high-throughput
storage, but data durability and replication strategies must be carefully planned, as
NVMe is typically local to the instance.
For an on-premises cluster, the decision-making process requires a thorough
understanding of existing hardware configurations. Options might include replacing
existing disk drives with larger ones, adding more drives if slots are available, or
expanding the cluster with more brokers with the same type of disks. The choices often
hinge on financial efficiency and alignment with the existing infrastructure, along with
performance requirements. The approach must be consistent with the Kafka workload,
ensuring that considerations such as latency, throughput, and data replication are
adequately addressed.
193
Chapter 12 Cost Reduction of Kafka Clusters
194
Chapter 12 Cost Reduction of Kafka Clusters
Additional Considerations
Before we delve into the specific examples of over-provisioned Kafka clusters, this
section goes over several considerations that may influence your decision-making
process. From the deployment environment (cloud vs. on-prem) to the technical details
such as hyper-threading, CPU types, and normalized load averages, these aspects
provide the context for the subsequent analysis.
Hyper-Threading
The number of CPU cores in each example is calculated under the assumption that
hyper-threading is enabled. When hyper-threading is enabled, the Linux kernel can
create two logical processors (threads) for each physical core, allowing two threads to
execute simultaneously on a single core.
For example, consider a server that has two sockets. Each socket has 12 cores, and
hyper-threading is enabled in the Linux kernel. This server has 24 physical cores but
the OS sees 48 cores due to the hyper-threading. In this case, we’ll refer to this server as
having 48 cores and not 24, since that’s the number of cores that the OS sees.
CPU Type
There are different CPU types available in AWS, such as i386, AMD, and ARM. Each of
these types has unique characteristics that may influence the performance and cost of
the Kafka cluster. However, the influence that each CPU type has on the Kafka cluster
isn’t discussed in this chapter.
195
Chapter 12 Cost Reduction of Kafka Clusters
While the load average is a measure of the number of processes that are currently
running or waiting to run in the CPU queue (for time periods of the last 1, 5, and 15
minutes), the normalized load average is a scaled value that represents the load average
relative to the number of CPU cores on the system. The value of the NLA is defined as:
(Load Average/Normalization Factor) x 100.
For example, consider a machine with eight cores. If the load average for the past
five minutes is four, that means that on average there were four processes either in a
runnable or waiting state. In that case, the NLA is (4/8) X 100 = 0.5.
Scale In
For the sake of clarity, in all the examples that follow, when we explore options to scale
in a Kafka cluster, the sections specifically look at using instance types that belong to
the same family as the original brokers. While there is indeed the option to scale in a
cluster by using instance types from a different family, this chapter doesn’t consider that
approach, in order to maintain simplicity in this examination.
Example 1
Table 12-1 shows a Kafka cluster with four brokers.
Total CPU% Normalized Load Average Used Disk Space in Network Processor Idle
Utilization /var/lib/kafka
Findings
CPU utilization: The cluster is over-provisioned in terms of CPU (only 25% CPU
utilization).
Disk storage usage: The cluster is over-provisioned in terms of disk storage (only 15%
disk storage usage).
196
Chapter 12 Cost Reduction of Kafka Clusters
Disk IOPS utilization: There are no reads from the disks (because the %wa is low).
Load on the Cluster: The NLA is far from reaching 1.0, which means that the load on
the cluster is low.
Scale Down
Assuming that you want to remain with the same instance family, you can replace the
current brokers with brokers that have half the cores, disk storage, and RAM. The reason
is that in AWS, the next instance type that’s smaller than the current instances has 24
cores, 15TB storage, and 192GB RAM. Let’s check each hardware aspect and see how
much you can scale it down:
• CPU: Use 24 cores instead of 48. Since even with half the number
of cores, the brokers will have CPU utilization of only 50%, 24 cores
seems to be enough, given that you have a rough estimation that
CPU utilization is linear to the number of CPU cores. This is a valid
estimation since the CPU utilization is mostly contributed to us% and
sy% and not to disk wait time (wa%).
• RAM: In AWS, instances with 24 cores arrive with half of the RAM
(192GB RAM) compared to instances with 48 cores. You can tell
whether the RAM is sufficient only by reducing the amount of RAM
and then checking if there are more reads from the disks. Since
there are currently no reads from the disks, you can use only 192GB
RAM and see if it’s enough by checking the rate of read IOPS from
the disks.
197
Chapter 12 Cost Reduction of Kafka Clusters
Scale In
Since the cluster has four brokers, the option of scaling in the cluster depends on the
replication factor (RF) of the topics in the cluster.
If the RF is 3, it’s not recommended to remove even a single broker. The reason is
that if a single broker fails in a cluster, only two brokers will be left in the cluster and the
requirement of three replicas per topic won’t be satisfied.
However if the RF is 2, you can remove a single broker and leave the cluster with
three brokers. Even if one broker fails, the cluster will still have two brokers, which means
that each topic will still have two replicas and the replication requirement will be met.
Recommendation
Scaling in the cluster would reduce the hardware costs by 50 percent since all the brokers
will be replaced with brokers at half the price (and half the hardware resources as well).
This will require migrating all the topics to the new brokers without having to change the
number of partitions.
On the other hand, scaling down the cluster would reduce the costs by 25 percent,
but that’s recommended only in case the replication factor of the topics is 2 and not
3. This will require a reassignment of the partitions and a change to the number of
partitions of all the topics in which their number of partitions doesn’t divide equally by 3.
Example 2
Table 12-2 shows a Kafka cluster with six brokers.
Total CPU% Normalized Load Used Disk Space in /var/ Network Processor
Utilization Average lib/kafka Idle
198
Chapter 12 Cost Reduction of Kafka Clusters
Findings
CPU utilization: The cluster is over-provisioned in terms of CPU (only 45% CPU
utilization).
Disk storage usage: The cluster is over-provisioned in terms of disk storage (only 4%
disk storage usage).
Disk IOPS utilization: Most of the time, the disks aren’t utilized, but during the day
there are several spikes of reads from the disks. This causes the CPU wa% to reach a value
of 4%, which is quite high.
Load on the cluster: The NLA is 0.6, which isn’t low but also not that high.
Network processor idle percentage: This value is really low, only 60%.
Recommendation
At first sight, this cluster seems over-provisioned in terms of CPU cores, since its
CPU utilization is at ±45%. But the low network processor idle option shows that it
won’t be a smart move to scale down the cluster because it suffers from some issue
that might already cause latency for its clients, and reducing cores might make these
symptoms worse.
So in this case you should focus on investigating the cause of the problematic
symptom instead of reducing the cost of the cluster.
Example 3
Table 12-3 shows a Kafka cluster with 12 brokers.
199
Chapter 12 Cost Reduction of Kafka Clusters
Total CPU% Normalized Load Average Used Disk Space in Network Processor Idle
Utilization /var/lib/kafka
Findings
CPU utilization: The cluster is over-provisioned in terms of CPU (only 40% CPU
utilization).
Disk storage usage: The cluster is over-provisioned in terms of disk storage (only 50%
disk storage usage).
Disk IOPS utilization: The wa% is 2%, which means there are reads/writes to the disks.
Load on the cluster: The NLA is only 0.5, which isn’t high.
Scaling In
Scaling in the cluster involves reducing it from 12 brokers to 8. This reduction is expected
to affect your CPU and storage utilization as follows: CPU usage may rise from 40 to 60
percent, although it’s worth noting that CPU utilization doesn’t always scale linearly. The
storage usage is expected to go from 50 to 75 percent. By implementing this change, you
can anticipate a 30 percent savings on hardware costs without encountering any CPU
200
Chapter 12 Cost Reduction of Kafka Clusters
or storage limitations. To accomplish this transition, you need to reassign the partitions
across all the topics and adjust the number of partitions to ensure an even distribution
among the remaining brokers.
Example 4
Table 12-4 shows a Kafka cluster with ten brokers.
Total CPU% Normalized Load Average Used Disk Space Network Processor Idle
Utilization in /var/lib/kafka
Findings
CPU utilization: The cluster is over-provisioned in terms of CPU (only 35% CPU
utilization).
Disk storage usage: The cluster uses 75 percent of its disk storage, which currently
is enough.
Disk IOPS utilization: The wa% is 1 percent, which means there are almost no reads/
writes to the disks.
Load on the cluster: The NLA is only 0.4, which isn’t high.
Network processor idle: This is lower than expected, which shows there’s either some
bottleneck in the cluster or that there are not enough threads.
201
Chapter 12 Cost Reduction of Kafka Clusters
• CPU: The cluster currently uses 35 percent of the CPU, so with half of
the cores, the CPU utilization is expected to reach 70 percent, which
is okay.
• RAM: There are very few reads from the disks, so all you can tell
is that the current amount of RAM is sufficient for the brokers in
order to avoid almost completely accessing the disks in order to
read data that doesn’t exist in the page cache. However, you can’t
tell whether with half of the current RAM, the brokers won’t need
to access the disks. The only way to know this is to scale down the
cluster and monitor the CPU wa% (using the top command), the disk
utilization (using the iostat command), and the misses from the
page cache (using the cachestat script). These metrics will give you
an indication whether the page cache has enough RAM in order to
prevent the brokers from accessing the disks in order to serve the
fetch requests of the consumers and other brokers (as part of the
replication process).
202
Chapter 12 Cost Reduction of Kafka Clusters
Scale In
To scale in the cluster, you can remove five brokers so that you’ll be left with five brokers,
each with the same number of cores, RAM, disk storage, and IOPS. The previous
arguments for CPU, RAM, disk storage, and load on the cluster apply here. In order to
keep the same amount of disk storage, you’ll need to attach two more disks per broker,
which is also possible.
Recommendation
You can either scale down the cluster by half or scale in the cluster by half, but in both
cases, you’ll need to attach more disks in order to have the same amount of storage.
However, the issue of the low network processor idle% makes the decision whether
to reduce resources from the cluster a tough one. This metric indicates the percentage of
time that the network processor threads in a Kafka broker were idle and not processing
any incoming requests from clients. A value of 85 percent means that for 15 percent of
the time, the network threads were busy, and my experience shows that clients of Kafka
clusters with such busy network threads usually experience some latency.
Although from the perspective of CPU, RAM utilization, and load on the cluster, it
seems that you could reduce half of the cores, RAM, and hardware costs of the cluster,
the low network threads idle% metric indicates there’s a bottleneck that could cause
clients of the clusters even greater latency.
That’s why in the case of this cluster, it’s better to check whether the clients suffer
from latency rather than trying to scale down or scale in the cluster.
Example 5
Table 12-5 shows a Kafka cluster with eight brokers.
203
Chapter 12 Cost Reduction of Kafka Clusters
Total CPU% Normalized Load Used Disk Space in /var/lib/kafka Network Processor Idle
Utilization Average
Findings
CPU utilization: The cluster is over-provisioned in terms of CPU (only 20% CPU
utilization).
Disk storage usage: The cluster uses 40 percent of its disk storage, which currently
is enough.
Disk IOPS utilization: The wa% is 1 percent, which means there are almost no reads/
writes to the disks.
Load on the cluster: The NLA is only 0.25, which isn’t high.
Example 6
Table 12-6 shows a Kafka cluster with six brokers.
204
Chapter 12 Cost Reduction of Kafka Clusters
Findings
CPU utilization: The cluster is over-provisioned in terms of CPU (only 30% CPU
utilization).
Disk storage usage: The cluster uses only 50 percent of its disk storage.
Disk IOPS utilization: The wa% is 1 percent, which means there are almost no reads/
writes to the disks.
Load on the cluster: The NLA is only 0.35, which isn’t high.
205
Chapter 12 Cost Reduction of Kafka Clusters
Scale In
By scaling down the cluster from 12 brokers to 8, you can anticipate the CPU usage to
increase from 40 to 60 percent and storage usage to rise from 50 to 75 percent. This
adjustment is projected to lead to a 33 percent reduction in hardware costs without
hitting any CPU or storage constraints. To facilitate this change, it’s essential to reassign
the partitions across all topics in the cluster and adjust the number of partitions to
ensure a balanced distribution among the remaining brokers.
Recommendation
Scaling in the cluster would reduce the hardware costs by 30 percent, since all the
brokers will be replaced with brokers at 2/3 the price (and 2/3 the hardware resources).
This will require migrating all the topics to the new brokers without having to change the
number of partitions.
Scaling the cluster down would also reduce the costs by 33 percent and will require a
reassignment of the partitions. It will also require a change in the number of partitions of
all the topics in which their number of partitions doesn’t divide equally by 8.
In this case, there’s no better or worse approach, since both reduce the cost of the
cluster equally.
Summary
Scaling and optimizing Kafka clusters is a multifaceted challenge that requires a
comprehensive understanding of both technical intricacies and financial considerations.
This chapter delved into the various strategies for handling different scaling
requirements, focusing on both cloud-based and on-premises Kafka clusters, and
emphasized the importance of avoiding over-provisioning to reduce costs.
We began by exploring the constraints on resources such as RAM, CPU cores, and
disk storage, and provided potential solutions, weighing their respective merits and
challenges. While the cloud-based environment often affords more flexibility with
options to scale up or out, on-premises clusters call for a meticulous evaluation of
existing hardware and careful alignment with financial and managerial objectives. The
importance of optimizing cost savings without sacrificing performance was highlighted,
underscoring the need for accurate evaluation of actual workloads.
206
Chapter 12 Cost Reduction of Kafka Clusters
207
Index
A utilization, 200, 201 (see also CPU
saturation)
Aggregation, 17–18, 21, 157
Clusters
Antiviruses scans files, 180, 181
consumer monitoring, 143
Average Batch Size metrics
cost reduction, 187
grouping message, 133
disk I/O data, 91, 92
higher values, 134
follower skew, 26
low average batch size, 134
JBOD configuration, 122
migration, 135
partition leader, 25
potential reasons/negative effects, 135
producer monitoring, 125
producer pods, 135
RAM (see Random access
Avg/Max fetch request metrics
memory (RAM))
fetch.max.bytes/max.partition.fetch.
storage disks, 1, 8
bytes, 150
cloud, 10
fetch performance, 151
EBS disks, 10
fundamental issue, 150
extended retention periods, 11
high values, 149
NVME disks, 11
low value, 149
on-premises, 9
scaling out, 10
scaling up, 9
B time-based retention, 12
Buffer available bytes metrics, 136, 137 Consumer monitoring
Bytes Consumed Rate metrics, 155, 156 correlating operation, 160
average queue depth, 161
CPU wait times, 160
C fetch requests, 162
Cardinality data, see Data cardinality log flush rate, 162, 163
Central processing unit (CPU), 47 network processor, 161
cost optimization strategies, 195 produce latency, 162
cost reduction, 190–192 metrics, 144
optimization strategies, 197 Bytes Consumed Rate, 155, 156
strategic role, 64 Consumer I/O Wait Ratio, 151, 152
209
© Elad Eldor 2023
E. Eldor, Kafka Troubleshooting in Production, https://doi.org/10.1007/978-1-4842-9490-1
INDEX
210
INDEX
212
INDEX
O
L On-prem cluster, 66
Lag consumer metrics Out-of-memory (OOM) errors, 83, 127
high consumer lag, 145 Output Bytes metrics
low/high-level consumers, 145, 146 high values, 130
low value, 145 low value, 131
migration, 145 migration, 131
produced messages, 144, 145 outgoing bytes, 132
Linux
balancing performance/reliability, 69
cachestat tool, 70 P, Q
flow system, 68 Partition skew
memory usage, 80, 81 consumer lags, 34
monitoring page cache, 70 data distribution, 35
page cache, 67–69 definition, 25
write performance, 68 disk I/O operations, 34
leaders, 26
broker hosts, 26
M consumers, 27
Motherboard failure, 166, 168, 177–179 consumers/throughput per topic, 33
CPU/blue broker, 30, 31
deeper broker, 32
N followers, 26
Network Interface Cards (NICs), 45, 164, followers (replica) skew, 28
167, 168 imbalance brokers, 29
hardware failures, 174–176 in-sync replicas, 29
skewed/lost leaders, 39 load average (LA), 30
Network I/O Rate metrics message rate/incoming bytes rate, 26
average network I/O rate, 127, 128 number per broker, 31, 32
buffer management, 127 producers, 27
high value, 126 network bandwidth, 34
low value, 127 reassignment procedure, 34
213
INDEX
214
INDEX
S
Self-monitoring, analysis and reporting T
technology (SMART), 171, 172 Threads (disk.io)
Skewed/lost leaders, 37 brokers/producer/performance, 106
consumer backlog, 41 context switches, 109
215
INDEX
216