PerformanceAnalysisAndTuningOnModernCPUs SecondEdition
PerformanceAnalysisAndTuningOnModernCPUs SecondEdition
Responsibility. Knowledge and best practices in the field of engineering and software
development are constantly changing. Practitioners and researchers must always rely
on their own experience and knowledge in evaluating and using any information,
methods, compounds, or experiments described herein. In using such information or
methods, they should be mindful of their safety and the safety of others, including
parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the author, contributors, nor editors, assume
any liability for any injury and/or damage to persons or property as a matter of
product liability, negligence or otherwise, or from any use or operations of any methods,
products, instructions, or ideas contained in the material herein.
Trademarks. Designations used by companies to distinguish their products are
often claimed as trademarks or registered trademarks. Intel, Intel Core, Intel Xeon,
Intel Pentium, Intel VTune, and Intel Advisor are trademarks of Intel Corporation
in the U.S. and/or other countries. AMD is a trademark of Advanced Micro Devices
Corporation in the U.S. and/or other countries. Arm is a trademark of Arm Limited
(or its subsidiaries) in the U.S. and/or elsewhere. Readers, however, should contact the
appropriate companies for complete information regarding trademarks and registration.
Affiliation. At the time of writing, the book’s author (Denis Bakhvalov) is an
employee of Intel Corporation. All information presented in the book is not an official
position of the aforementioned company but rather is an individual knowledge and
opinions of the author. The primary author did not receive any financial sponsorship
from Intel Corporation for writing this book.
Advertisement. This book does not advertise any software, hardware, or any other
product.
Copyright
Copyright © 2024 by Denis Bakhvalov.
2
Preface
3
The second edition expands in all these and many other directions. It came out to be
twice as big as the first book.
Specifically, I want to highlight the exercises that I added to the second edition of
the book. I created a supplementary online course “Performance Ninja” with more
than twenty lab assignments. I recommend you use these small puzzles to practice
optimization techniques and check your understanding of the material. I consider
it the best part that differentiates this book from others. I hope it will make your
learning process more entertaining. More details about the online course can be found
in Section 1.7.
I know firsthand that low-level performance optimization is not easy. I tried to
explain everything as clearly as possible, but the topic is very complex. It requires
experimentation and practice to fully understand the material. I encourage you to
take your time, read through the chapters, and experiment with examples provided in
the online course.
During my career, I never shied away from software optimization tasks. I always got a
dopamine hit whenever I managed to make my program run faster. The excitement of
discovering something and feeling proud left me even more curious and craving for
more. My initial performance work was very unstructured. Now it is my profession,
yet I still feel very happy when I make software run faster. I hope you also experience
the joy of discovering performance issues, and the satisfaction of fixing them.
I sincerely hope that this book will help you learn low-level performance analysis. If
you make your application faster as a result, I will consider my mission accomplished.
You will find that I use “we” instead of “I” in some places in the book. This is because
I received a lot of help from other people. The full list of contributors can be found at
the end of the book in the “Acknowledgements” section.
The PDF version of this book and the “Performance Ninja” online course are available
for free. This is my way to give back to the community.
Target Audience
If you’re working with performance-critical applications, this book is right for you. It
is primarily targeted at software developers in High-Performance Computing (HPC),
AI, game development, data center applications (like those at Meta, Google, etc.),
High-Frequency Trading (HFT), and other industries where the value of performance
optimizations is well known and appreciated.
This book will also be useful for any developer who wants to understand the perfor-
mance of their application better and know how it can be improved. You may just be
enthusiastic about performance engineering and want to learn more about it. Or you
may want to be the smartest engineer in the room; that’s also fine. I hope that the
material presented in this book will help you develop new skills that can be applied in
your daily work and potentially move your career forward.
A minimal background in the C and C++ programming languages is necessary to
understand the book’s examples. The ability to read basic x86/ARM assembly is
desirable, but not a strict requirement. I also expect familiarity with basic concepts
4
of computer architecture and operating systems like “CPU”, “memory”, “process”,
“thread”, “virtual” and “physical memory”, “context switch”, etc. If any of these terms
are new to you, I suggest studying these prerequisites first.
I suggest you read the book chapter by chapter, starting from the beginning. If you
consider yourself a beginner in performance analysis, I do not recommend skipping
chapters. After you finish reading, you can use this book as a source of ideas whenever
you face a performance issue and are unsure how to fix it. You can skim through the
second part of the book to see which optimization techniques can be applied to your
code.
I will post errata and other information about the book on my blog at the following
URL: https://easyperf.net/blog/2024/11/11/Book-Updates-Errata.
5
Table Of Contents
Table Of Contents 6
1 Introduction 11
1.1 Why Is Software Slow? . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Why Care about Performance? . . . . . . . . . . . . . . . . . . . . . . 14
1.3 What Is Performance Analysis? . . . . . . . . . . . . . . . . . . . . . . 17
1.4 What Is Performance Tuning? . . . . . . . . . . . . . . . . . . . . . . . 17
1.5 What Is Discussed in this Book? . . . . . . . . . . . . . . . . . . . . . 18
1.6 What Is Not Discussed in this Book? . . . . . . . . . . . . . . . . . . . 20
1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Measuring Performance 23
2.1 Noise in Modern Systems . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Measuring Performance in Production . . . . . . . . . . . . . . . . . . 26
2.3 Continuous Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Manual Performance Testing . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Software and Hardware Timers . . . . . . . . . . . . . . . . . . . . . . 33
2.6 Microbenchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.7 Active Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 CPU Microarchitecture 38
3.1 Instruction Set Architecture . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Exploiting Instruction Level Parallelism (ILP) . . . . . . . . . . . . . . 41
3.3.1 Out-Of-Order (OOO) Execution . . . . . . . . . . . . . . . . . 42
3.3.2 Superscalar Engines . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.3 Speculative Execution . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.4 Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 SIMD Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Exploiting Thread-Level Parallelism . . . . . . . . . . . . . . . . . . . 49
3.5.1 Multicore Systems . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5.2 Simultaneous Multithreading . . . . . . . . . . . . . . . . . . . 50
3.5.3 Hybrid Architectures . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.6.1 Cache Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6.2 Main Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.7 Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.7.1 Translation Lookaside Buffer (TLB) . . . . . . . . . . . . . . . 63
6
3.7.2 Huge Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.8 Modern CPU Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.8.1 CPU Frontend . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.8.2 CPU Backend . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.8.3 Load-Store Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.8.4 TLB Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.9 Performance Monitoring Unit . . . . . . . . . . . . . . . . . . . . . . . 72
3.9.1 Performance Monitoring Counters . . . . . . . . . . . . . . . . 73
7
6.2.6 Analyze Branch Misprediction Rate . . . . . . . . . . . . . . . 148
6.2.7 Precise Timing of Machine Code . . . . . . . . . . . . . . . . . 149
6.2.8 Estimating Branch Outcome Probability . . . . . . . . . . . . . 152
6.2.9 Providing Compiler Feedback Data . . . . . . . . . . . . . . . . 152
6.3 Hardware-Based Sampling Features . . . . . . . . . . . . . . . . . . . . 152
6.3.1 PEBS on Intel Platforms . . . . . . . . . . . . . . . . . . . . . 152
6.3.2 IBS on AMD Platforms . . . . . . . . . . . . . . . . . . . . . . 154
6.3.3 SPE on Arm Platforms . . . . . . . . . . . . . . . . . . . . . . 155
6.3.4 Precise Events . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3.5 Analyzing Memory Accesses . . . . . . . . . . . . . . . . . . . . 157
8
9.3.3 Discovering Loop Optimization Opportunities. . . . . . . . . . 223
9.4 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
9.4.1 Compiler Autovectorization . . . . . . . . . . . . . . . . . . . . 224
9.4.2 Discovering Vectorization Opportunities. . . . . . . . . . . . . . 225
9.5 Compiler Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
9.5.1 Wrapper Libraries for Intrinsics . . . . . . . . . . . . . . . . . . 232
9
13.4.2 True Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
13.4.3 False Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
13.5 Advanced Analysis Tools . . . . . . . . . . . . . . . . . . . . . . . . . . 307
13.5.1 Coz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
13.5.2 eBPF and GAPP . . . . . . . . . . . . . . . . . . . . . . . . . . 308
Epilog 310
Acknowledgments 312
Glossary 316
References 319
10
1 Introduction
Performance is king: this was true a decade ago, and it certainly is now. According to
[domo.com, 2017], in 2017 the world has been creating 2.5 quintillion1 bytes of data
every day. [statista.com, 2024] predicts that number to reach 400 quintillion bytes
per day in 2024. In our increasingly data-centric world, the growth of information
exchange requires both faster software and faster hardware.
Software programmers have had an “easy ride” for decades, thanks to Moore’s law.
Software vendors could rely on new generations of hardware to speed up their software
products, even if they did not spend human resources on making improvements in
their code. This strategy doesn’t work any longer. By looking at Figure 1.1, we
can see that single-threaded2 performance growth is slowing down. From 1990 to
2000, single-threaded performance on SPECint benchmarks increased by a factor
of approximately 25 to 30, driven largely by higher CPU frequencies and improved
microarchitecture.
Figure 1.1: 50 Years of Microprocessor Trend Data. © Image by K. Rupp via karlrupp.net.
Original data up to the year 2010 was collected and plotted by M. Horowitz, F. Labonte, O.
Shacham, K. Olukotun, L. Hammond, and C. Batten. New plot and data collected for
2010-2021 by K. Rupp.
Single-threaded CPU performance growth was more modest from 2000 to 2010 (a
factor between four and five). At that time, clock speeds topped out around 4GHz
due to power consumption, heat dissipation challenges, limitations in voltage scaling
(Dennard Scaling3 ), and other fundamental problems. Despite clock speed stagnation,
1 A quintillion is a thousand raised to the power of six (1018 ).
2 Single-threaded performance is the performance of a single hardware thread inside a CPU core
when measured in isolation.
3 Dennard Scaling - https://en.wikipedia.org/wiki/Dennard_scaling
11
1 Introduction
12
1.1 Why Is Software Slow?
Table 1.1: Speedups from performance engineering a program that multiplies two
4096-by-4096 matrices running on a dual-socket Intel Xeon E5-2666 v3 system with a total
of 60 GB of memory. From [Leiserson et al., 2020].
Absolute Relative
Version Implementation speedup speedup
1 Python 1 —
2 Java 11 10.8
3 C 47 4.4
4 Parallel loops 366 7.8
5 Parallel divide and conquer 6,727 18.4
6 plus vectorization 23,224 3.5
7 plus AVX intrinsics 62,806 2.7
So, let’s talk about what prevents systems from achieving optimal performance by
default. Here are some of the most important factors:
1. CPU limitations: It’s so tempting to ask: “Why doesn’t hardware solve all
our problems?” Modern CPUs execute instructions incredibly quickly, and
are getting better with every generation. But still, they cannot do much if
instructions that are used to perform the job are not optimal or even redundant.
Processors cannot magically transform suboptimal code into something that
performs better. For example, if we implement a bubble sort, a CPU will not
make any attempts to recognize it and use better alternatives (e.g. quicksort).
It will blindly execute whatever it was told to do.
2. Compiler limitations: “But isn’t that what compilers are supposed to do?
Why don’t compilers solve all our problems?” Indeed, compilers are amazingly
smart nowadays, but can still generate suboptimal code. Compilers are great at
eliminating redundant work, but when it comes to making more complex decisions
like vectorization, they may not generate the best possible code. Performance
experts often can come up with clever ways to vectorize loops beyond the
capabilities of compilers. When compilers have to make a decision whether to
perform a code transformation or not, they rely on complex cost models and
heuristics, which may not work for every possible scenario. For example, there
is no binary “yes” or “no” answer to the question of whether a compiler should
always inline a function into the place where it’s called. It usually depends on
many factors which a compiler should take into account. Additionally, compilers
cannot perform optimizations unless they are absolutely certain it is safe to do
so. It may be very difficult for a compiler to prove that an optimization is correct
under all possible circumstances, disallowing some transformations. Finally,
compilers generally do not attempt “heroic” optimizations, like transforming
data structures used by a program.
3. Algorithmic complexity analysis limitations: Some developers are overly
obsessed with algorithmic complexity analysis, which leads them to choose a
popular algorithm with the optimal algorithmic complexity, even though it
may not be the most efficient for a given problem. Considering two sorting
algorithms, insertion sort and quicksort, the latter clearly wins in terms of Big
O notation for the average case: insertion sort is O(N2 ) while quickSort is only
13
1 Introduction
O(N log N). Yet for relatively small sizes of N (up to 50 elements), insertion
sort outperforms quickSort. Complexity analysis cannot account for all the
low-level performance effects of various algorithms, so people just encapsulate
them in an implicit constant C, which sometimes can make a large impact on
performance. Only counting comparisons and swaps that are used for sorting,
ignores cache misses and branch mispredictions, which, today, are actually very
costly. Blindly trusting Big O notation without testing on the target workload
could lead developers down an incorrect path. So, the best-known algorithm for
a certain problem is not necessarily the most performant in practice for every
possible input.
In addition to the limitations described above, there are overheads created by pro-
gramming paradigms. Coding practices that prioritize code clarity, readability, and
maintainability can reduce performance. Highly generalized and reusable code can
introduce unnecessary copies, runtime checks, function calls, memory allocations, etc.
For instance, polymorphism in object-oriented programming is usually implemented
using virtual functions, which introduce a performance overhead.4
All the factors mentioned above assess a “performance tax” on the software. There
are very often substantial opportunities for tuning the performance of our software to
reach its full potential.
computing devices.
6 In 2024, Meta uses mostly on-premise cloud, while Netflix uses AWS public cloud.
7 Worldwide spending on cloud services in 2020 - https://www.srgresearch.com/articles/2020-the-
year-that-cloud-service-revenues-finally-dwarfed-enterprise-spending-on-data-centers
8 Worldwide spending on cloud services in 2024 - https://www.gartner.com/en/newsroom/press-
releases/2024-05-20-gartner-forecasts-worldwide-public-cloud-end-user-spending-to-surpass-675-
billion-in-2024
14
1.2 Why Care about Performance?
book to get there. Write your program in one of the native programming languages,
distribute work among multiple threads, pick a good optimizing compiler and you’ll
get there. Unfortunately, the performance of your program will be about 200 times
slower than the optimal target.
The methodologies in this book focus on squeezing out the last bit of performance
from your application. Such transformations can be attributed along rows 6 and 7
in Table 1.1. The types of improvements that will be discussed are usually not big
and often do not exceed 10%. However, do not underestimate the importance of a
10% speedup. SQLite is commonplace today not because its developers one day made
it 50% faster, but because they meticulously made hundreds of 0.1% improvements
over the years. The cumulative effect of these small improvements is what makes the
difference.
The impact of small improvements is very relevant for large distributed applications
running in the cloud. According to [Hennessy, 2018], in the year 2018, Google spent
roughly the same amount of money on actual computing servers that run the cloud as
it spent on power and cooling infrastructure. Energy efficiency is a very important
problem, which can be improved by optimizing software.
“At such [Google] scale, understanding performance characteristics be-
comes critical—even small improvements in performance or utilization can
translate into immense cost savings.” [Kanev et al., 2015]
In addition to cloud costs, there is another factor at play: how people perceive
slow software. Google reported that a 500-millisecond delay in search caused a 20%
reduction in traffic.9 For Yahoo! 400 milliseconds faster page load caused 5-9% more
traffic.10 In the game of big numbers, small improvements can make a significant
impact. Such examples prove that the slower a service works, the fewer people will
use it.
Outside cloud services, there are many other performance-critical industries where
performance engineering does not need to be justified, such as Artificial Intelligence
(AI), High-Performance Computing (HPC), High-Frequency Trading (HFT), game
development, etc. Moreover, performance is not only required in highly specialized
areas, it is also relevant for general-purpose applications and services. Many tools
that we use every day simply would not exist if they failed to meet their performance
requirements. For example, Visual C++ IntelliSense11 features that are integrated
into Microsoft Visual Studio IDE have very tight performance constraints. For the
IntelliSense autocomplete feature to work, it must parse the entire source codebase
in milliseconds.12 Nobody will use a source code editor if it takes several seconds to
suggest autocomplete options. Such a feature has to be very responsive and provide
valid continuations as the user types new code.
9 Google I/O ’08 Keynote by Marissa Mayer - https://www.youtube.com/watch?v=6x0cAzQ7PVs
10 Slides by Stoyan Stefanov - https://www.slideshare.net/stoyan/dont-make-me-wait-or-building-
highperformance-web-applications
11 Visual C++ IntelliSense - https://docs.microsoft.com/en-us/visualstudio/ide/visual-cpp-
intellisense
12 In fact, it’s not possible to parse the entire codebase in the order of milliseconds. Instead,
IntelliSense only reconstructs the portions of AST that have been changed. Watch more details
on how the Microsoft team achieves this in the video: https://channel9.msdn.com/Blogs/Seth-
Juarez/Anders-Hejlsberg-on-Modern-Compiler-Construction
15
1 Introduction
“Not all fast software is world-class, but all world-class software is fast.
Performance is the killer feature.” —Tobi Lutke, CEO of Shopify.
I hope it goes without saying that people hate using slow software, especially when
their productivity goes down because of it. Table 1.2 shows that most people consider
a delay of 2 seconds or more to be a “long wait,” and would switch to something else
after 10 seconds of waiting (I think much sooner). If you want to keep users’ attention,
your application must react quickly.
Interaction
Class Human Perception Response Time
Fast Minimally noticeable delay 100ms–200ms
Interactive Quick, but too slow to be described as Fast 300ms–500ms
Pause Not quick but still feels responsive 500ms–1 sec
Wait Not quick due to amount of work for scenario 1 sec–3 sec
Long Wait No longer feels responsive 2 sec–5 sec
Captive Reserved for unavoidably long/complex scenarios 5 sec–10 sec
Long-running User will probably switch away during operation 10 sec–30 sec
Sometimes fast tools find applications for which they were not initially designed. For
example, game engines like Unreal and Unity are used in architecture, 3D visualization,
filmmaking, and other areas. Because game engines are so performant, they are a
natural choice for applications that require 2D and 3D rendering, physics simulation,
collision detection, sound, animation, etc.
“Fast tools don’t just allow users to accomplish tasks faster; they allow
users to accomplish entirely new types of tasks, in entirely new ways.” -
Nelson Elhage wrote in his blog.14
Before starting performance-related work, make sure you have a strong reason to do
so. Optimization just for optimization’s sake is useless if it doesn’t add value to your
product.15 Mindful performance engineering starts with clearly defined performance
goals. Understand clearly what you are trying to achieve, and justify the work.
Establish metrics that you will use to measure success.
Now that we’ve talked about the value of performance engineering, let’s uncover what
it consists of. When you’re trying to improve the performance of a program, you need
to find problems (performance analysis) and then improve them (tuning), a task very
similar to a regular debugging activity. This is what we will discuss next.
13 Microsoft Windows Blogs - https://blogs.windows.com/windowsdeveloper/2023/05/26/delivering-
delightful-performance-for-more-than-one-billion-users-worldwide/
14 Reflections on software performance by N. Elhage - https://blog.nelhage.com/post/reflections-
on-performance/
15 Unless you just want to practice performance optimizations, which is fine too.
16
1.3 What Is Performance Analysis?
17
1 Introduction
from high-level optimizations which are more about application-level logic, algorithms,
and data structures. As you will see in the book, the majority of low-level optimizations
can be applied to a wide variety of modern processors. To successfully implement
low-level optimizations, you need to have a good understanding of the underlying
hardware.
“During the post-Moore era, it will become ever more important to make
code run fast and, in particular, to tailor it to the hardware on which it
runs.” [Leiserson et al., 2020]
In the past, software developers had more mechanical sympathy, as they often had to
deal with nuances of the hardware implementation. During the PC era, developers
usually were programming directly on top of the operating system, with possibly a
few libraries in between. As the world moved to the cloud era, the software stack
grew deeper, broader, and more complex. The top layer of the stack (on which most
developers work) has moved further away from the hardware. The negative side of
such evolution is that developers of modern applications have less affinity for the
actual hardware on which their software is running. This book will help you build a
strong connection with modern processors.
There is a famous quote by Donald Knuth: “Premature optimization is the root of
all evil”.[Knuth, 1974] But the opposite is often true as well. Postponed performance
engineering may be too late and cause as much evil as premature optimization. For
developers working with performance-critical projects, it is crucial to know how
underlying hardware works. In such roles, program development without a hardware
focus is a failure from the beginning.17 Performance characteristics of software must be
a primary objective alongside correctness and security from day one. Poor performance
can kill a product just as easily as security vulnerabilities.
Performance engineering is important and rewarding work, but it may be very time-
consuming. In fact, performance optimization is a game with no end. There will always
be something to optimize. Inevitably, a developer will reach the point of diminishing
returns at which further improvement is not justified by expected engineering costs.
Knowing when to stop optimizing is a critical aspect of performance work.
18
1.5 What Is Discussed in this Book?
I hope that by the end of this book, you will be able to answer those questions.
The book is split into two parts. The first part (chapters 2–7) teaches you how to find
performance problems, and the second part (chapters 8–13) teaches you how to fix
them.
At the end of the book, there is a glossary and a list of microarchitectures for major
CPU vendors. Whenever you see an unfamiliar acronim or you need to refresh your
memory on recent Intel, AMD, and ARM chip families, refer to these resources.
Examples provided in this book are primarily based on open-source software: Linux
as the operating system, the LLVM-based Clang compiler for C and C++ languages,
19
1 Introduction
and various open-source applications and benchmarks18 that you can build and run.
The reason is not only the popularity of these projects but also the fact that their
source code is open, which enables us to better understand the underlying mechanism
of how they work. This is especially useful for learning the concepts presented in this
book. This doesn’t mean that we will never showcase proprietary tools. For example,
we extensively use Intel® VTune™ Profiler.
Sometimes it’s possible to obtain attractive speedups by forcing the compiler to
generate desired machine code through various hints. You will find many such examples
throughout the book. While prior compiler experience helps a lot in performance
work, most of the time you don’t have to be a compiler expert to drive performance
improvements in your application. The majority of optimizations can be done at a
source code level without the need to dig down into compiler sources.
benchmark is something that is synthesized and contrived, and does a poor job of representing
real-world scenarios. In this book, we use the terms “benchmark”, “workload”, and “application”
interchangeably and don’t mean to offend anyone.
20
1.7 Exercises
1.7 Exercises
As supplemental material for this book, I developed “Performance Ninja”, a free
online course where you can practice low-level performance analysis and tuning. It
is available at the following URL: https://github.com/dendibakh/perf-ninja. It has
a collection of lab assignments that focus on a specific performance problem. Each
lab assignment can take anywhere from 30 minutes up to 4 hours depending on your
background and the complexity of the lab assignment itself.
Following the name of the GitHub repository, we will use perf-ninja to refer to
the online course. In the “Questions and Exercises” section at the end of each
chapter, you may find assignments from perf-ninja. For example, when you see
perf-ninja::warmup, this corresponds to the lab assignment with the name “Warmup”
in the GitHub repository. We encourage you to solve these puzzles to solidify your
knowledge.
You can solve assignments on your local machine or submit your code changes to
GitHub for automated verification and benchmarking. If you choose the latter, follow
the instructions on the “Get Started” page of the repository. We also use examples
from perf-ninja throughout the book. This enables you to reproduce a specific
performance problem on your own machine and experiment with it.
Chapter Summary
• Single-threaded CPU performance is not increasing as rapidly as it used to a
few decades ago. When it’s no longer the case that each hardware generation
provides a significant performance boost, developers should start optimizing the
code of their software.
• Modern software is massively inefficient. A regular server system in a public
cloud, typically runs poorly optimized code, consuming more power than it
could have consumed, which increases carbon emissions and contributes to other
environmental issues.
• Certain limitations exist that prevent applications from reaching their full perfor-
mance potential. CPUs cannot magically speed up slow algorithms. Compilers
are far from generating optimal code for every program. Big O notation is
not always a good indicator of performance as it doesn’t account for hardware
specifics.
• For many years performance engineering was a nerdy niche. But now it’s
becoming mainstream as software vendors realize the impact that their poorly
optimized software has on their bottom line.
• People absolutely hate using slow software, especially when their productivity
goes down because of it. Not all fast software is world-class, but all world-class
software is fast. Performance is the killer feature.
• Software tuning is becoming more important than it has been for the last 40 years
and it will be one of the key drivers for performance gains in the near future. The
importance of low-level performance tuning should not be underestimated, even
if it’s just a 1% improvement. The cumulative effect of these small improvements
is what makes the difference.
• To squeeze the last bit of performance you need to have a good mental model of
how modern CPUs work.
21
1 Introduction
22
Part 1. Performance Analysis on a Modern
CPU
2 Measuring Performance
23
2 Measuring Performance
• Discuss how to write a good microbenchmark and some common pitfalls you
may encounter while doing it.
Figure 2.1: Variance in performance caused by dynamic frequency scaling: the first run is
200 milliseconds faster than the second.
Frequency Scaling is an example of how a hardware feature can cause variations in our
19 By cold processor, we mean the CPU that stayed in idle mode for a while, allowing it to cool
surements since an additional CPU core will be activated and assigned to it. This might affect the
frequency of the core that is running the actual benchmark.
24
2.1 Noise in Modern Systems
measurements, however, they could also come from software. Consider benchmarking
a git status command, which accesses many files on the disk. The filesystem plays
a big role in performance in this scenario; in particular, the filesystem cache. On the
first run, the required entries in the filesystem cache are missing. The filesystem cache
is not effective and our git status command runs very slowly. However, the second
time, the filesystem cache will be warmed up, making it much faster than the first run.
You’re probably thinking about including a dry run before taking measurements. That
certainly helps, unfortunately, measurement bias can persist through the runs as
well. [Mytkowicz et al., 2009] paper demonstrates that UNIX environment size (i.e.,
the total number of bytes required to store the environment variables) or the link
order (the order of object files that are given to the linker) can affect performance in
unpredictable ways. There are numerous other ways how memory layout may affect
performance measurements.21
Having consistent performance requires running all iterations of the benchmark with
the same conditions. It is impossible to achieve 100% consistent results on every run of
a benchmark, but perhaps you can get close by carefully controlling the environment.
Eliminating nondeterminism in a system is helpful for well-defined, stable performance
tests, e.g., microbenchmarks.
Consider a situation when you implemented a code change and want to know the
relative speedup ratio by benchmarking the “before” and “after” versions of the
program. This is a scenario in which you can control most of the variability in a
system, including HW configuration, OS settings, background processes, etc. Disabling
features with nondeterministic performance impact will help you get a more consistent
and accurate comparison. You can find examples of such features and how to disable
them in Appendix A. Also, there are tools that can set up the environment to ensure
benchmarking results with a low variance; one such tool is temci22 .
However, it is not possible to replicate the exact same environment and eliminate bias
completely: there could be different temperature conditions, power delivery spikes,
unexpected system interrupts, etc. Chasing all potential sources of noise and variation
in a system can be a never-ending story. Sometimes it cannot be achieved, for example,
when you’re benchmarking a large distributed cloud service.
You should not eliminate system nondeterministic behavior when you want to measure
the real-world performance impact of your change. Yes, these features may contribute
to performance instabilities, but they are designed to improve the overall performance
of the system. Users of your application are likely to have them enabled to provide
better performance. So, when you analyze the performance of a production application,
you should try to replicate the target system configuration, which you are optimizing
for. Introducing any artificial tuning to the system will change results from what users
[Curtsinger & Berger, 2013]. This work showed that it’s possible to eliminate measurement bias
that comes from memory layout by repeatedly randomizing the placement of code, stack, and heap
objects at runtime. Sadly, these ideas didn’t go much further, and right now, this project is almost
abandoned.
22 Temci - https://github.com/parttimenerd/temci.
25
2 Measuring Performance
run longer. This is especially important for Continuous Integration and Continuous Delivery (CI/CD)
performance testing when there are time limits for how long it should take to run the whole benchmark
suite.
24 Presented at CMG 2019, https://www.youtube.com/watch?v=4RG2DUK03_0.
26
2.3 Continuous Benchmarking
each new release. Performance defects tend to leak into production software at an
alarming rate [Jin et al., 2012]. A large number of code changes pose a challenge to
thorough analysis of their performance impact.
Performance regressions are defects that make the software run slower compared to
the previous version. Catching performance regressions (or improvements) requires de-
tecting the commit that has changed the performance of the program. From database
systems to search engines to compilers, performance regressions are commonly expe-
rienced by almost all large-scale software systems during their continuous evolution
and deployment life cycle. It may be impossible to entirely avoid performance re-
gressions during software development, but with proper testing and diagnostic tools,
the likelihood of such defects silently leaking into production code can be reduced
significantly.
It is useful to track the performance of your application with charts, like the one shown
in Figure 2.2. Using such a chart you can see historical trends and find moments
where performance improved or degraded. Typically, you will have a separate line for
each performance test you’re tracking. Do not include too many benchmarks on a
single chart as it will become very noisy.
Figure 2.2: Performance graph (higher better) for an application showing a big drop in
performance on August 7th and smaller ones later.
Let’s consider some potential solutions for detecting performance regressions. The
first option that comes to mind is: having humans look at the graphs. For the chart in
Figure 2.2, humans will likely catch performance regression that happened on August
7th, but it’s not obvious that they will detect later smaller regressions. People tend to
lose focus quickly and can miss regressions, especially on a busy chart. In addition to
that, it is a time-consuming and boring job that must be performed daily.
There is another interesting performance drop on August 3rd. A developer will also
likely catch it, however, most of us would be tempted to dismiss it since performance
recovered the next day. But are we sure that it was merely a glitch in measurements?
What if this was a real regression that was compensated by an optimization on August
4th? If we could fix the regression and keep the optimization, we would have a
performance score of around 4500. Do not dismiss such cases. One way to proceed
here would be to repeat the measurements for the dates Aug 02–Aug 04 and inspect
code changes during that period.
The second option is to have a threshold, say, 2%. Every code modification that
27
2 Measuring Performance
has performance within that threshold is considered noise and everything above the
threshold is considered a regression. It is somewhat better than the first option but still
has its own drawbacks. Fluctuations in performance tests are inevitable: sometimes,
even a harmless code change can trigger performance variation in a benchmark.25
Choosing the right value for the threshold is extremely hard and does not guarantee a
low rate of false-positive as well as false-negative alarms. Setting the threshold too
low might lead to analyzing a bunch of small regressions that were not caused by the
change in source code but due to some random noise. Setting the threshold too high
might lead to filtering out real performance regressions.
Small regressions can pile up slowly into a bigger regression, which can be left unnoticed.
Going back to Figure 2.2, notice a downward trend that lasted from Aug 11 to Aug
21. The period started with a score of 3000 and ended with 2600. That is roughly a
15% regression over 10 days or 1.5% per day on average. If we set a 2% threshold all
regressions will be filtered out. However, as we can see, the accumulated regression is
much bigger than the threshold.
Nevertheless, this option works reasonably well for many projects, especially if the
level of noise in a benchmark is very low. Also, you can adjust the threshold for each
test. An example of a Continuous Integration (CI) system where each test requires
setting explicit threshold values for alerting a regression is LUCI,26 which is a part of
the Chromium project.
It’s worth mentioning that tracking performance results over time requires that you
maintain the same configuration of the machine(s) that you use to run benchmarks. A
change in the configuration may invalidate all the previous performance results. You
may decide to recollect all historical measurements with a new configuration, but this
is very expensive.
Another option that recently became popular uses a statistical approach to identify
performance regressions. It leverages an algorithm called “Change Point Detection”
(CPD, see [Matteson & James, 2014]), which utilizes historical data and identifies
points in time where performance has changed. Many performance monitoring systems
embraced the CPD algorithm, including several open-source projects. You can search
the web to find the one that better suits your needs.
The notable advantage of CPD is that it does not require setting thresholds. The
algorithm evaluates a large window of recent results, which allows it to ignore outliers
as noise and produce fewer false positives. The downside for CPD is the lack of
immediate feedback. For example, consider a performance test with the following
historical measurements of running time: 5 sec, 6 sec, 5 sec, 5 sec, 7 sec. If the next
benchmark result comes at 11 seconds, then the threshold would likely be exceeded
and an alert would be generated immediately. However, in the case of using the CPD
algorithm, it wouldn’t do anything at this point. If in the next run, performance
is restored to 5 seconds, then it would likely dismiss it as a false positive and not
generate an alert. Conversely, if the next run or two resulted in 10 sec and 12 sec
respectively, only then would the CPD algorithm trigger an alert.
25 The following article shows that changing the order of the functions or removing dead functions
.md
28
2.3 Continuous Benchmarking
There is no clear answer to which approach is better. If your development flow requires
immediate feedback, e.g., evaluating a pull request before it gets merged, then using
thresholds is a better choice. Also, if you can remove a lot of noise from your system
and achieve stable performance results, then using thresholds is more appropriate. In
a very quiet system, the 11 second measurement mentioned before likely indicates a
real performance regression, thus we need to flag it as early as possible. In contrast, if
you have a lot of noise in your system, e.g., you run distributed macro-benchmarks,
then that 11 second result may just be a false positive. In this case, you may be better
off using Change Point Detection.
If, for some reason, a performance regression has slipped into the codebase, it is very
important to detect it promptly. First, because fewer changes were merged since it
happened. This allows us to have a person responsible for the regression look into the
problem before they move to another task. Also, it is a lot easier for a developer to
approach the regression since all the details are still fresh in their head as opposed to
several weeks after that.
Lastly, the CI system should alert, not just on software performance regressions, but
on unexpected performance improvements, too. For example, someone may check in a
seemingly innocuous commit which improves performance by 10% in the automated
tracking harness. Your initial instinct may be to celebrate this fortuitous performance
boost and proceed with your day. However, while this commit may have passed all
functional tests in your CI pipeline, chances are that this unexpected improvement
uncovered a gap in functional testing which only manifested itself in performance
regression results. For instance, the change caused the application to skip some parts
of work, which was not covered by functional tests. This scenario occurs often enough
that it warrants explicit mention: treat the automated performance regression harness
as part of a holistic software testing framework.
29
2 Measuring Performance
30
2.4 Manual Performance Testing
Figure 2.3: Performance measurements (lower is better) of “Before” and “After” versions of
a program presented as box plots.
• The 75th percentile (p75) divides the lowest 75% of the data from the highest
25%.
• An outlier is a data point that differs significantly from other samples in the
dataset. Outliers can be caused by variability in the data or experimental errors.
• The min and max (whiskers) represent the most extreme data points that are
not considered outliers.
By looking at the box plot in Figure 2.3, we can sense that our code change has a
positive impact on performance since “after” samples are generally faster than “before”.
However, there are some “before” measurements that are faster than “after”. Box
plots allow comparisons of multiple distributions on the same chart. The benefits of
using box plots for visualizing performance distributions are described in a blog post
by Stefan Marr.28
Performance speedups can be calculated by taking a ratio between the two means.
In some cases, you can use other metrics to calculate speedups, including median,
min, and 95th percentile, depending on which one is more representative of your
distribution.
Standard deviation quantifies how much the values in a dataset deviate from the mean
on average. A low standard deviation indicates that the data points are close to the
mean, while a high standard deviation indicates that they are spread out over a wider
range. Unless distributions have low standard deviation, do not calculate speedups. If
28 Stefan Marr’s blog post about box plots - https://stefan-marr.de/2024/06/5-reasons-for-box-
plots-as-default/
31
2 Measuring Performance
the standard deviation in the measurements is on the same order of magnitude as the
mean, the average is not a representative metric. Consider taking steps to reduce noise
in your measurements. If that is not possible, present your results as a combination of
the key metrics such as mean, median, standard deviation, percentiles, min, max, etc.
Performance gains are usually represented in two ways: as a speedup factor or
percentage improvement. If a program originally took 10 seconds to run, and you
optimized it down to 1 second, that’s a 10x speedup. We shaved off 9 seconds of
running time from the original program, that’s a 90% reduction in time. The formula
to calculate percentage improvement is shown below. In the book, we will use both
ways of representing speedups.
New Time
Percentage Speedup = (1 − ) × 100%
Old Time
One of the most important factors in calculating accurate speedup ratios is collecting
a rich collection of samples, i.e., running a benchmark a large number of times. This
may sound obvious, but it is not always achievable. For example, some of the SPEC
CPU 2017 benchmarks29 run for more than 10 minutes on a modern machine. That
means it would take 1 hour to produce just three samples: 30 minutes for each version
of the program. Imagine that you have not just a single benchmark in your suite, but
hundreds. It would become very expensive to collect statistically sufficient data even
if you distribute the work across multiple machines.
If obtaining new measurements is expensive, don’t rush to collect many samples. Often
you can learn a lot from just three runs. If you see a very low standard deviation
within those three samples, you will probably learn nothing new from collecting more
measurements. This is very typical of programs with underlying consistency (e.g.,
static benchmarks). However, if you see an abnormally high standard deviation, I do
not recommend launching new runs and hoping to have “better statistics”. You should
figure out what is causing performance variance and how to reduce it.
In an automated setting, you can implement an adaptive strategy by dynamically
limiting the number of benchmark iterations based on standard deviation, i.e., you
collect samples until you get a standard deviation that lies in a certain range. The
lower the standard deviation in the distribution, the lower the number of samples you
need. Once you have a standard deviation lower than the threshold, you can stop
collecting measurements. This strategy is explained in more detail in [Akinshin, 2019,
Chapter 4].
An important thing to watch out for is the presence of outliers. It is OK to discard some
samples (for example, cold runs) as outliers, but do not deliberately discard unwanted
samples from the measurement set. Outliers can be one of the most important metrics
for some types of benchmarks. For example, when benchmarking software that has
real-time constraints, the 99 percentile could be very interesting.
I recommend using benchmarking tools as they automate performance measurements.
For example, Hyperfine30 is a popular cross platform command-line benchmarking
tool that automatically determines the number of runs, and can visualize the results
as a table with mean, min, max, or as a box plot.
29 SPEC CPU 2017 benchmarks - http://spec.org/cpu2017/Docs/overview.html#benchmarks
30 hyperfine - https://github.com/sharkdp/hyperfine
32
2.5 Software and Hardware Timers
In the next two sections, we will discuss how to measure wall clock time (latency),
which is the most common case. However, sometimes we also may want to measure
other things, like the number of requests per second (throughput), heap allocations,
context switches, etc.
33
2 Measuring Performance
Choosing which timer to use is very simple and depends on how long the thing is that
you want to measure. If you measure something over a very short time period, TSC
will give you better accuracy. Conversely, it’s pointless to use the TSC to measure
a program that runs for hours. Unless you need cycle accuracy, the system timer
should be enough for a large proportion of cases. It’s important to keep in mind that
accessing the system timer usually has a higher latency than accessing TSC. Making a
clock_gettime system call can be much slower than executing RDTSC instruction. The
latter takes about 5 ns (20 CPU cycles), while the former takes about 500 ns. This may
become important for minimizing measurement overhead, especially in the production
environment. A performance comparison of different APIs for accessing timers on
various platforms is available on the wiki page of the CppPerformanceBenchmarks
repository.32
2.6 Microbenchmarks
Microbenchmarks are small self-contained programs that people write to quickly test a
hypothesis. Usually, microbenchmarks are used to choose the best implementation of
a certain relatively small algorithm or functionality. Nearly all modern languages have
benchmarking frameworks. In C++, you can use the Google benchmark33 library, C#
has the BenchmarkDotNet34 library, Julia has the BenchmarkTools35 package, Java
has JMH36 (Java Microbenchmark Harness), Rust has the Criterion37 package, etc.
When writing microbenchmarks, it’s very important to ensure that the scenario you
want to test is actually executed by your microbenchmark at runtime. Optimizing
compilers can eliminate important code that could render the experiment useless,
or even worse, drive you to the wrong conclusion. In the example below, modern
compilers are likely to eliminate the whole loop:
// foo DOES NOT benchmark string creation
void foo() {
for (int i = 0; i < 1000; i++)
std::string s("hi");
}
/wikis/ClockTimeAnalysis
33 Google benchmark library - https://github.com/google/benchmark
34 BenchmarkDotNet - https://github.com/dotnet/BenchmarkDotNet
35 Julia BenchmarkTools - https://github.com/JuliaCI/BenchmarkTools.jl
36 Java Microbenchmark Harness - http://openjdk.java.net/projects/code-tools/jmh/etc
37 Criterion.rs - https://github.com/bheisler/criterion.rs
34
2.7 Active Benchmarking
Blunders like that one are nicely captured in the paper “Always Measure One Level
Deeper” [Ousterhout, 2018], where the author advocates for a more scientific approach,
and measuring performance from different angles. Following advice from the paper, we
should inspect the performance profile of the benchmark and make sure the intended
code stands out as the hotspot. Sometimes abnormal timings can be spotted instantly,
so use common sense while analyzing and comparing benchmark runs.
One of the popular ways to keep the compiler from optimizing away important code is
to use DoNotOptimize-like38 helper functions, which do the necessary inline assembly
magic under the hood:
// foo benchmarks string creation
void foo() {
for (int i = 0; i < 1000; i++) {
std::string s("hi");
DoNotOptimize(s);
}
}
If written well, microbenchmarks can be a good source of performance data. They are
often used for comparing the performance of different implementations of a critical
function. A good benchmark tests performance in realistic conditions. In contrast, if
a benchmark uses synthetic input that is different from what will be given in practice,
then the benchmark will likely mislead you and will drive you to the wrong conclusions.
Besides that, when a benchmark runs on a system free from other demanding processes,
it has all resources available to it, including DRAM and cache space. Such a benchmark
will likely champion the faster version of the function even if it consumes more memory
than the other version. However, the outcome can be the opposite if there are neighbor
processes that consume a significant part of DRAM, which causes memory regions
that belong to the benchmark process to be swapped to the disk.
For the same reason, be careful when concluding results obtained from unit-testing a
function. Modern unit-testing frameworks, e.g., GoogleTest, provide the duration of
each test. However, this information cannot substitute a carefully written benchmark
that tests the function in practical conditions using realistic input (see more in
[Fog, 2023b, chapter 16.2]). It is not always possible to replicate the exact input and
environment as it will be in practice, but it is something developers should take into
account when writing a good benchmark.
35
2 Measuring Performance
Benchmarking done by developer A was done in a passive way. The results were
presented without any technical explanation, and the performance impact was exagger-
ated. In contrast, developer B performed active benchmarking.39 She ensured proper
machine configuration, ran extensive testing, looked one level deeper, and collected
as many metrics as possible to support her conclusions. Her analysis explains the
underlying technical reason for the performance results she observed.
You should have a good intuition to spot suspicious benchmark results. Whenever
you see publications that present benchmark results that look too good to be true
and without any technical explanation, you should be skeptical. There is nothing
wrong with presenting the results of your measurements, but as John Ousterhout
said, “Performance measurements should be considered guilty until proven innocent.”
[Ousterhout, 2018] The best way to verify the results is through active benchmarking.
Active benchmarking requires much more effort than passive benchmarking, but it is
the only way to get reliable results.
36
Summary
Chapter Summary
• Modern systems have nondeterministic performance. Eliminating nondeter-
minism in a system is helpful for well-defined, stable performance tests, e.g.,
microbenchmarks.
• Measuring performance in production is required to assess how users perceive
the responsiveness of your services. However, this requires dealing with noisy
environments and using statistical methods for analyzing results.
• It is beneficial to employ an automated performance tracking system to prevent
performance regressions from leaking into production software. Such CI systems
are supposed to run automated performance tests, visualize results, and alert on
discovered performance anomalies.
• Visualizing performance distributions helps compare performance results. It is a
safe way of presenting performance results to a wide audience.
• To benchmark execution time, engineers can use two different timers: the system-
wide high-resolution timer and the Time Stamp Counter. The former is suitable
for measuring events whose duration is more than a microsecond. The latter
can be used for measuring short events with high accuracy.
• Microbenchmarks are good for quick experiments, but you should always verify
your ideas on a real application in practical conditions. Make sure that you are
benchmarking the right code by checking performance profiles.
• Always measure one level deeper, collect as many metrics as possible to support
your conclusions, and be ready to explain the underlying technical reasons for
the performance results you observe.
37
3 CPU Microarchitecture
3 CPU Microarchitecture
This chapter provides a brief summary of the critical CPU microarchitecture features
that have a direct impact on software performance. The goal of this chapter is not to
cover all the details and trade-offs of CPU architectures, which are already covered
extensively in the literature [Hennessy & Patterson, 2017], [Shen & Lipasti, 2013]. I
provide a recap of features that are present in modern processors to prepare the reader
for what comes next in the book.
Modern CPUs support 32-bit and 64-bit precision for floating-point and integer
arithmetic operations. With the fast-evolving fields of machine learning and AI, the
industry has a renewed interest in alternative numeric formats to drive significant
performance improvements. Research has shown that machine learning models perform
just as well using fewer bits to represent variables, saving on both compute and memory
bandwidth. As a result, the majority of mainstream ISAs have recently added support
for lower precision data types such as 8-bit and 16-bit integer and floating-point types
(int8, fp8, fp16, bf16), in addition to the traditional 32-bit and 64-bit formats for
arithmetic operations.
40 In the book I often write x86 for shortness, but I assume x86-64, which is a 64-bit version of the
x86 instruction set, first announced in 1999. Also, I use ARM to refer to the ISA, and Arm to refer
to the company.
38
3.2 Pipelining
3.2 Pipelining
Pipelining is a foundational technique used to make CPUs fast, wherein multiple
instructions overlap during their execution. Pipelining in CPUs drew inspiration from
automotive assembly lines. The processing of instructions is divided into stages. The
stages operate in parallel, working on different parts of different instructions. DLX is
a relatively simple architecture designed by John L. Hennessy and David A. Patterson
in 1994. As defined in [Hennessy & Patterson, 2017], it has a 5-stage pipeline which
consists of:
1. Instruction fetch (IF)
2. Instruction decode (ID)
3. Execute (EXE)
4. Memory access (MEM)
5. Write back (WB)
Figure 3.1 shows an ideal pipeline view of the 5-stage pipeline CPU. In cycle 1,
instruction x enters the IF stage of the pipeline. In the next cycle, as instruction x
moves to the ID stage, the next instruction in the program enters the IF stage, and
so on. Once the pipeline is full, as in cycle 5 above, all pipeline stages of the CPU
are busy working on different instructions. Without pipelining, the instruction x+1
couldn’t start its execution until after instruction x had finished its work.
Modern high-performance CPUs have multiple pipeline stages, often ranging from 10
to 20 (sometimes more), depending on the architecture and design goals. This involves
a much more complicated design than a simple 5-stage pipeline introduced earlier. For
example, the decode stage may be split into several new stages. We may also add new
stages before the execute stage to buffer decoded instructions and so on.
The throughput of a pipelined CPU is defined as the number of instructions that
complete and exit the pipeline per unit of time. The latency for any given instruction
is the total time through all the stages of the pipeline. Since all the stages of the
pipeline are linked together, each stage must be ready to move to the next instruction
in lockstep. The time required to move an instruction from one stage to the next
defines the basic machine cycle or clock for the CPU. The value chosen for the clock
for a given pipeline is defined by the slowest stage of the pipeline. CPU hardware
designers strive to balance the amount of work that can be done in a stage as this
39
3 CPU Microarchitecture
There is a RAW dependency for register R1. If we take the value directly after
the addition R0 ADD 1 is done (from the EXE pipeline stage), we don’t need to
wait until the WB stage finishes (when the value will be written to the register
file). Bypassing helps to save a few cycles. The longer the pipeline, the more
effective bypassing becomes.
A write-after-read (WAR) hazard requires a dependent write to execute after a
read. It occurs when an instruction writes a register before an earlier instruction
reads the source, resulting in the wrong new value being read. A WAR hazard
is not a true dependency and can be eliminated by a technique called register
renaming. It is a technique that abstracts logical registers from physical registers.
CPUs support register renaming by keeping a large number of physical registers.
Logical (architectural) registers, the ones that are defined by the ISA, are just
aliases over a wider register file. With such decoupling of the architectural state,
solving WAR hazards is simple: we just need to use a different physical register
for the write operation. For example:
; machine code, WAR hazard ; after register renaming
; (architectural registers) ; (physical registers)
R1 = R0 ADD 1 => R101 = R100 ADD 1
R0 = R2 ADD 2 R103 = R102 ADD 2
40
3.3 Exploiting Instruction Level Parallelism (ILP)
In the original assembly code, there is a WAR dependency for register R0. For
the code on the left, we cannot reorder the execution of the instructions, because
it could leave the wrong value in R1. However, we can leverage our large pool of
physical registers to overcome this limitation. To do that we need to rename
all the occurrences of the R0 register starting from the write operation (R0
= R2 ADD 2) and below to use a free register. After renaming, we give these
registers new names that correspond to physical registers, say R103. By renaming
registers, we eliminated a WAR hazard in the initial code, and we can safely
execute the two operations in any order.
You will see similar code in many production programs. In our example, R1
keeps the temporary result of the ADD operation. Once the SUB instruction is
complete, R1 is immediately reused to store the result of the MUL operation. The
original code on the left features all three types of data hazards. There is a
RAW dependency over R1 between ADD and SUB, and it must survive register
renaming. Also, we have WAW and WAR hazards over the same R1 register for
the MUL operation. Again, we need to rename registers to eliminate those two
hazards. Notice that after register renaming we have a new destination register
(R104) for the MUL operation. Now we can safely reorder MUL with the other two
operations.
• Control hazards: are caused due to changes in the program flow. They arise
from pipelining branches and other instructions that change the program flow.
The branch condition that determines the direction of the branch (taken vs. not
taken) is resolved in the execute pipeline stage. As a result, the fetch of the next
instruction cannot be pipelined unless the control hazard is eliminated. Tech-
niques such as dynamic branch prediction and speculative execution described
in the next section are used to mitigate control hazards.
41
3 CPU Microarchitecture
Figure 3.2 illustrates the concept of out-of-order execution. Let’s assume that in-
struction x+1 cannot be executed during cycles 4 and 5 due to some conflict. An
in-order CPU would stall all subsequent instructions from entering the EXE pipeline
stage, so instruction x+2 would begin executing only at cycle 7. In a CPU with OOO
execution, an instruction can begin executing as long as it does not have any conflicts
(e.g., its inputs are available, the execution unit is not occupied, etc.). As you can
see on the diagram, instruction x+2 started executing before instruction x+1. The
instruction x+3 cannot enter the EXE stage at cycle 6, because it is already occupied
by instruction x+1. All instructions still retire in order, i.e., the instructions complete
the WB stage in the program order.
OOO execution usually brings large performance improvements over in-order execution.
However, it also introduces additional complexity and power consumption.
The process of reordering instructions is often called instruction scheduling. The goal
of scheduling is to issue instructions in such a way as to minimize pipeline hazards
and maximize the utilization of CPU resources. Instruction scheduling can be done at
compile time (static scheduling), or at runtime (dynamic scheduling). Let’s unpack
both options.
3.3.1.1 Static scheduling The Intel Itanium was an example of static scheduling.
With static scheduling of a superscalar, multi-execution unit machine, the scheduling
42
3.3 Exploiting Instruction Level Parallelism (ILP)
is moved from the hardware to the compiler using a technique known as VLIW (Very
Long Instruction Word). The rationale is to simplify the hardware by requiring the
compiler to choose the right mix of instructions to keep the machine fully utilized.
Compilers can use techniques such as software pipelining and loop unrolling to look
farther ahead than can be reasonably supported by hardware structures to find the
right ILP.
The Intel Itanium never managed to become a success for a few reasons. One of
them was a lack of compatibility with x86 code. Another was the fact that it was
very difficult for compilers to schedule instructions in such a way as to keep the CPU
busy due to variable load latencies. The 64-bit extension of the x86 ISA (x86-64)
was introduced in the same time window by AMD, was compatible with IA-32 (the
32-bit version of the x86 ISA) and eventually became its real successor. The last Intel
Itanium processors were shipped in 2021.
43
3 CPU Microarchitecture
So instructions can be executed in any order once their operands become available
and are not tied to the program order any longer. Modern processors are becoming
wider (can execute many instructions in one cycle) and deeper (larger ROB, RS, and
other buffers), which demonstrates that there is a lot of potential to uncover more
ILP in production applications.
Figure 3.3 shows a pipeline diagram of a CPU that supports 2-wide issue. Notice
that two instructions can be processed in each stage of the pipeline every cycle. For
example, both instructions x and x+1 started their execution during cycle 3. This could
be two instructions of the same type (e.g., two additions) or two different instructions
(e.g., an addition and a branch). Superscalar processors replicate execution resources
to keep instructions in the pipeline flowing through without structural conflicts. For
instance, to support the decoding of two instructions simultaneously, we need to have
2 independent decoders.
44
3.3 Exploiting Instruction Level Parallelism (ILP)
(a) No speculation
With speculative execution, the CPU guesses an outcome of the branch and initiates
processing instructions from the chosen path. Suppose a processor predicted that
condition a < b will be evaluated as true. It proceeded without waiting for the
branch outcome and speculatively called function foo (see Figure 3.4b). State changes
to the machine cannot be committed until the condition is resolved to ensure that
the architectural state of the machine is never impacted by speculatively executing
instructions.
In the example above, the branch instruction compares two scalar values, which is
fast. But in reality, a branch instruction can depend on a value loaded from memory,
which can take hundreds of cycles.
If the prediction turns out to be correct, it saves a lot of cycles. However, sometimes
the prediction is incorrect, and the function bar should be called instead. In such a
case, the results from the speculative execution must be squashed and thrown away.
This is called the branch misprediction penalty, which we will discuss in Section 4.8.
An instruction that is executed speculatively is marked as such in the ROB. Once
it is not speculative any longer, it can retire in program order. Here is where the
architectural state is committed, and architectural registers are updated. Because the
results of the speculative instructions are not committed, it is easy to roll back when
a misprediction happens.
45
3 CPU Microarchitecture
46
3.4 SIMD Multiprocessors
47
3 CPU Microarchitecture
data loaded from memory and store the intermediate results of computations. In
our example, two regions of 256 bits of contiguous data corresponding to arrays a
and b will be loaded from memory and stored in two separate vector registers. Next,
the element-wise addition will be done and the result will be stored in a new 256-bit
vector register. Finally, the result will be written from the vector register to a 256-bit
memory region corresponding to array c. Note, that the data elements can be either
integers or floating-point numbers.
A vector execution unit is logically divided into lanes. In the context of SIMD, a
lane refers to a distinct data pathway within the SIMD execution unit and processes
one element of the vector. In our example, each lane processes 64-bit elements
(double-precision), so there will be 4 lanes in a 256-bit register.
Most of the popular CPU architectures feature vector instructions, including x86,
PowerPC, ARM, and RISC-V. In 1996 Intel released MMX, a SIMD instruction set,
that was designed for multimedia applications. Following MMX, Intel introduced new
instruction sets with added capabilities and increased vector size: SSE, AVX, AVX2,
and AVX-512. ARM has optionally supported the 128-bit NEON instruction set in
various versions of its architecture. In version 8 (aarch64), this support was made
mandatory, and new instructions were added.
As the new instruction sets became available, work began to make them usable
to software engineers. The software changes required to exploit SIMD instructions
are known as code vectorization. Initially, SIMD instructions were programmed in
assembly. Later, special compiler intrinsics, which are small functions providing a
one-to-one mapping to SIMD instructions, were introduced. Today all the major
compilers support autovectorization for the popular processors, i.e., they can generate
SIMD instructions straight from high-level code written in C/C++, Java, Rust, and
other languages.
To enable code to run on systems that support different vector lengths, Arm introduced
the SVE instruction set. Its defining characteristic is the concept of scalable vectors:
48
3.5 Exploiting Thread-Level Parallelism
their length is unknown at compile time. With SVE, there is no need to port software
to every possible vector length. Users don’t have to recompile the source code of
their applications to leverage wider vectors when they become available in newer CPU
generations. Another example of scalable vectors is the RISC-V V extension (RVV),
which was ratified in late 2021. Some implementations support quite wide (2048-bit)
vectors, and up to eight can be grouped together to yield 16384-bit vectors, which
greatly reduces the number of instructions executed. At each loop iteration, SVE
code typically does ptr += number_of_lanes, where number_of_lanes is not known
at compile time. ARM SVE provides special instructions for such length-dependent
operations, while RVV enables a programmer to query/set the number_of_lanes.
Going back to the example in Listing 3.2, if N equals 5, and we have a 256-bit vector,
we cannot process all the elements in a single iteration. We can process the first four
elements using a single SIMD instruction, but the 5th element needs to be processed
individually. This is known as the loop remainder. Loop remainder is a portion of
a loop that must process fewer elements than the vector width, requiring additional
scalar code to handle the leftover elements. Scalable vector ISA extensions do not
have this problem, as they can process any number of elements in a single instruction.
Another solution to the loop remainder problem is to use masking, which allows
selectively enabling or disabling SIMD lanes based on a condition.
Also, CPUs increasingly accelerate the matrix multiplications often used in machine
learning. Intel’s AMX extension, supported in server processors since 2023, multiplies
8-bit matrices of shape 16x64 and 64x16, accumulating into a 32-bit 16x16 matrix. By
contrast, the unrelated but identically named AMX extension in Apple CPUs, as well
as ARM’s SME extension, compute outer products of a row and column, respectively
stored in special 512-bit registers or scalable vectors.
Initially, SIMD was driven by multimedia applications and scientific computations, but
later found uses in many other domains. Over time, the set of operations supported in
SIMD instruction sets has steadily increased. In addition to straightforward arithmetic
as shown in Figure 3.5, newer use cases of SIMD include:
• String processing: finding characters, validating UTF-8,44 parsing JSON45 and
CSV;46
• Hashing,47 random generation,48 cryptography(AES);
• Columnar databases (bit packing, filtering, joins);
• Sorting built-in types (VQSort,49 QuickSelect);
• Machine Learning and Artificial Intelligence (speeding up PyTorch, TensorFlow).
49
3 CPU Microarchitecture
50
3.5 Exploiting Thread-Level Parallelism
multiple software threads execute concurrently in the same cycle. Those don’t have
to be threads from the same process; they can be completely different programs that
happened to be scheduled on the same physical core.
An example of execution on a non-SMT and a 2-way SMT (SMT2) processor is shown
in Figure 3.6. In both cases, the width of the processor pipeline is four, and each slot
represents an opportunity to issue a new instruction. 100% machine utilization is when
there are no unused slots, which never happens in real workloads. It’s easy to see that
for the non-SMT case, there are many unused slots, so the available resources are not
utilized well. This may happen for a variety of reasons. For instance, in cycle 3, thread
1 cannot make forward progress because all instructions are waiting for their inputs
to become available. Non-SMT processors would simply stall, while SMT-enabled
processors take this opportunity to schedule useful work from another thread. The
goal here is to occupy unused slots by another thread to improve hardware utilization
and multithreaded performance.
With an SMT2 implementation, each physical core is represented by two logical cores,
which are visible to the operating system as two independent processors available to
take work. Consider a situation when we have 16 software threads ready to run, and
only 8 physical cores. In a non-SMT system, only 8 threads will run at the same time,
while with SMT2 we can execute all 16 threads simultaneously. In another hypothetical
situation, if two programs run on an SMT-enabled core and each consistently utilizes
only two out of four available slots, then there is a high chance they will run as fast as
if they were running alone on that physical core.
Although two programs run on the same processor core, they are completely sepa-
rated from each other. In an SMT-enabled processor, even though instructions are
mixed, they have different contexts which helps preserve the correctness of execution.
To support SMT, a CPU must replicate the architectural state (program counter,
registers) to maintain thread context. Other CPU resources can be shared. In a
typical implementation, cache resources are dynamically shared amongst the hardware
51
3 CPU Microarchitecture
threads. Resources to track OOO and speculative execution can either be replicated
or partitioned.
In an SMT2 core, both logical cores are truly running at the same time. In the
CPU Frontend, they fetch instructions in an alternating order (every cycle or a few
cycles). In the backend, each cycle a processor selects instructions for execution from
all threads. Instruction execution is mixed as the processor dynamically schedules
execution units among both threads.
So, SMT is a very flexible setup that makes it possible to recover unused CPU issue
slots. SMT provides equal single-thread performance, in addition to its benefits for
multiple threads. Modern CPUs that choose to support SMT, usually implement
two-way (SMT2), and sometimes four-way (SMT4) SMT.
SMT has its own disadvantages as well. Since some resources are shared among the
logical cores, they could eventually compete to use those resources. The most likely
SMT penalty is caused by competition for L1 and L2 caches. Since they are shared
between two logical cores, they could lack space in caches and force eviction of the
data that will be used by another thread in the future.
SMT brings a considerable burden on software developers as it makes it harder to
predict and measure the performance of an application that runs on an SMT core.
Imagine you’re running performance-critical code on an SMT core, and suddenly the
OS puts another demanding job on a sibling logical core. Your code nearly maxes
out the resources of the machine, and now you need to share them with someone
else. This problem is especially pronounced in a cloud environment when you cannot
predict whether your application will have noisy neighbors or not.
There is also a security concern with certain simultaneous multithreading implemen-
tations. Researchers showed that some earlier implementations had a vulnerability
through which one application could steal critical information (like cryptographic keys)
from another application that runs on the sibling logical core of the same processor
by monitoring its cache use. We will not dig deeper into this topic since hardware
security is outside the scope of this book.
52
3.6 Memory Hierarchy
“Icestorm” cores. Intel introduced its Alder Lake hybrid architecture in 2021 with
eight P- and eight E-cores in the top configuration.
Hybrid architectures combine the best sides of both core types, but they come with
their own set of challenges. First of all, it requires cores to be fully ISA-compatible,
i.e., they should be able to execute the same set of instructions. Otherwise, scheduling
becomes restricted. For example, if a big core features some fancy instructions that
are not available on small cores, then you can only assign big cores to run workloads
that use such instructions. That’s why usually vendors use the “greatest common
denominator” approach when choosing the ISA for a hybrid processor.
Even with ISA-compatible cores, scheduling becomes challenging. Different types
of workloads call for a specific scheduling scheme, e.g., bursty execution vs. steady
execution, low IPC vs. high IPC,50 low importance vs. high importance, etc. It
becomes non-trivial very quickly. Here are a few considerations for optimal scheduling:
• Leverage small cores to conserve power. Do not wake up big cores for background
work.
• Recognize candidates (low importance, low IPC) for offloading to smaller cores.
Similarly, promote high importance, high IPC tasks to big cores.
• When assigning a new task, use an idle big core first. In the case of SMT, use
big cores with both logical threads idle. After that, use idle small cores. After
that, use sibling logical threads of big cores.
From a programmer’s perspective, no code changes are needed to make use of hybrid
systems. This approach became very popular in client-facing devices, especially in
smartphones.
53
3 CPU Microarchitecture
3.6.1.1 Placement of Data within the Cache. The address for a request is
used to access the cache. In direct-mapped caches, a given block address can appear
only in one location in the cache and is defined by a mapping function shown below.
Dirrect-mapped caches are relatively easy to build and have fast access time, however,
they have a high miss rate.
Cache Size
Number of Blocks in the Cache =
Cache Block Size
Direct mapped location = (block address) mod (Number of Blocks in the Cache )
In a fully associative cache, a given block can be placed in any location in the cache.
This approach involves high hardware complexity and slow access time, thus considered
impractical for most use cases.
An intermediate option between direct mapping and fully associative mapping is a
set-associative mapping. In such a cache, the blocks are organized as sets, typically
each set containing 2, 4, 8, or 16 blocks. A given address is first mapped to a set.
Within a set, the address can be placed anywhere, among the blocks in that set. A
cache with m blocks per set is described as an m-way set-associative cache. The
formulas for a set-associative cache are:
Number of Blocks in the Cache
Number of Sets in the Cache =
Number of Blocks per Set (associativity)
Set (m-way) associative location = (block address) mod (Number of Sets in the Cache)
54
3.6 Memory Hierarchy
Here is another example of the cache organization of the Apple M1 processor. The
L1 data cache inside each performance core can store 128 KB, has 256 sets with 8
ways in each set, and operates on 64-byte lines. Performance cores form a cluster and
share the L2 cache, which can keep 12 MB, is 12-way set-associative, and operates on
128-byte lines. [Apple, 2024]
3.6.1.2 Finding Data in the Cache. Every block in the m-way set-associative
cache has an address tag associated with it. In addition, the tag also contains state bits
such as a bit to indicate whether the data is valid. Tags can also contain additional
bits to indicate access information, sharing information, etc.
Figure 3.7 shows how the address generated from the pipeline is used to check the
caches. The lowest order address bits define the offset within a given block; the block
offset bits (5 bits for 32-byte cache lines, 6 bits for 64-byte cache lines). The set is
selected using the index bits based on the formulas described above. Once the set is
selected, the tag bits are used to compare against all the tags in that set. If one of
the tags matches the tag of the incoming request and the valid bit is set, a cache hit
results. The data associated with that block entry (read out of the data array of the
cache in parallel to the tag lookup) is provided to the execution pipeline. A cache
miss occurs in cases where the tag is not a match.
3.6.1.3 Managing Misses. When a cache miss occurs, the cache controller must
select a block in the cache to be replaced to allocate the address that incurred the
miss. For a direct-mapped cache, since the new address can be allocated only in a
single location, the previous entry mapping to that location is deallocated, and the
new entry is installed in its place. In a set-associative cache, since the new cache block
can be placed in any of the blocks of the set, a replacement algorithm is required.
The typical replacement algorithm used is the LRU (least recently used) policy, where
the block that was least recently accessed is evicted to make room for the new data.
Another alternative is to randomly select one of the blocks as the victim block.
3.6.1.4 Managing Writes. Write accesses to caches are less frequent than data
reads. Handling writes in caches is harder, and CPU implementations use various
techniques to handle this complexity. Software developers should pay special attention
to the various write caching flows supported by the hardware to ensure the best
performance of their code.
CPU designs use two basic mechanisms to handle writes that hit in the cache:
• In a write-through cache, hit data is written to both the block in the cache and
to the next lower level of the hierarchy.
55
3 CPU Microarchitecture
• In a write-back cache, hit data is only written to the cache. Subsequently, lower
levels of the hierarchy contain stale data. The state of the modified line is
tracked through a dirty bit in the tag. When a modified cache line is eventually
evicted from the cache, a write-back operation forces the data to be written back
to the next lower level.
Cache misses on write operations can be handled in two ways:
• In a write-allocate cache, the data for the missed location is loaded into the cache
from the lower level of the hierarchy, and the write operation is subsequently
handled like a write hit.
• If the cache uses a no-write-allocate policy, the cache miss transaction is sent
directly to the lower levels of the hierarchy, and the block is not loaded into the
cache.
Out of these options, most designs typically choose to implement a write-back cache
with a write-allocate policy as both of these techniques try to convert subsequent
write transactions into cache hits, without additional traffic to the lower levels of the
hierarchy. Write-through caches typically use the no-write-allocate policy.
Hardware designers take on the challenge of reducing the hit time and miss penalty
through many novel micro-architecture techniques. Fundamentally, cache misses stall
the pipeline and hurt performance. The miss rate for any cache is highly dependent
on the cache architecture (block size, associativity) and the software running on the
machine.
56
3.6 Memory Hierarchy
instruction (see Section 8.5). Compilers can also automatically add prefetch instruc-
tions into the code to request data before it is required. Prefetch techniques need
to balance between demand and prefetch requests to guard against prefetch traffic
slowing down demand traffic.
3.6.2.1 DDR (Double Data Rate) is the predominant DRAM technology sup-
ported by most CPUs. Historically, DRAM bandwidths have improved every generation
while the DRAM latencies have stayed the same or increased. Table 3.1 shows the top
data rate, peak bandwidth, and the corresponding reading latency for the last three
generations of DDR technologies. The data rate is measured in millions of transfers
per second (MT/s). The latencies shown in this table correspond to the latency in the
DRAM device itself. Typically, the latencies as seen from the CPU pipeline (cache
miss on a load to use) are higher (in the 50ns-150ns range) due to additional latencies
and queuing delays incurred in the cache controllers, memory controllers, and on-die
interconnects. You can see an example of measuring observed memory latency and
bandwidth in Section 4.10.
Table 3.1: Performance characteristics for the last three generations of DDR technologies.
Peak
DDR Highest Data Bandwidth In-device Read
Generation Year Rate(MT/s) (GB/s) Latency(ns)
DDR3 2007 2133 17.1 10.3
DDR4 2014 3200 25.6 12.5
DDR5 2020 6400 51.2 14
It is worth mentioning that DRAM chips require their memory cells to be refreshed
periodically. This is because the bit value is stored as the presence of an electric
57
3 CPU Microarchitecture
charge on a tiny capacitor, so it can lose its charge over time. To prevent this, there
is special circuitry that reads each cell and writes it back, effectively restoring the
capacitor’s charge. While a DRAM chip is in its refresh procedure, it is not serving
memory access requests.
A DRAM module is organized as a set of DRAM chips. Memory rank is a term that
describes how many sets of DRAM chips exist on a module. For example, a single-rank
(1R) memory module contains one set of DRAM chips. A dual-rank (2R) memory
module has two sets of DRAM chips, therefore doubling the capacity of a single-rank
module. Likewise, there are quad-rank (4R) and octa-rank (8R) memory modules
available for purchase.
Each rank consists of multiple DRAM chips. Memory width defines how wide the bus
of each DRAM chip is. And since each rank is 64 bits wide (or 72 bits wide for ECC
RAM), it also defines the number of DRAM chips present within the rank. Memory
width can be one of three values: x4, x8, or x16, and defines how wide is the bus
that goes to each chip. As an example, Figure 3.8 shows the organization of a 2Rx16
dual-rank DRAM DDR4 module, with a total of 2GB capacity. There are four chips
in each rank, with a 16-bit wide bus. Combined, the four chips provide 64-bit output.
The two ranks are selected one at a time through a rank-select signal.
Figure 3.8: Organization of a 2Rx16 dual-rank DRAM DDR4 module with a total capacity
of 2GB.
58
3.6 Memory Hierarchy
Going further, we can install multiple DRAM modules in a system to not only increase
memory capacity but also memory bandwidth. Setups with multiple memory channels
are used to scale up the communication speed between the memory controller and the
DRAM.
A system with a single memory channel has a 64-bit wide data bus between the DRAM
and memory controller. The multi-channel architectures increase the width of the
memory bus, allowing DRAM modules to be accessed simultaneously. For example,
the dual-channel architecture expands the width of the memory data bus from 64
bits to 128 bits, doubling the available bandwidth, see Figure 3.9. Notice, that each
memory module, is still a 64-bit device, but we connect them differently. It is very
typical nowadays for server machines to have four or eight memory channels.
Alternatively, you could also encounter setups with duplicated memory controllers.
For example, a processor may have two integrated memory controllers, each of them
capable of supporting several memory channels. The two controllers are independent
and only view their own slice of the total physical memory address space.
We can do a quick calculation to determine the maximum memory bandwidth for a
given memory technology, using the simple formula below:
Max. Memory Bandwidth = Data Rate × Bytes per cycle
For example, for a single-channel DDR4 configuration with a data rate of 2400 MT/s
and 64 bits (8 bytes) per transfer, the maximum bandwidth equals 2400 * 8 = 19.2
GB/s. Dual-channel or dual memory controller setups double the bandwidth to 38.4
GB/s. Remember though, those numbers are theoretical maximums that assume that
a data transfer will occur at each memory clock cycle, which in fact never happens in
practice. So, when measuring actual memory speed, you will always see a value lower
than the maximum theoretical transfer bandwidth.
To enable multi-channel configuration, you need to have a CPU and motherboard
that support such an architecture and install an even number of identical memory
59
3 CPU Microarchitecture
modules in the correct memory slots on the motherboard. The quickest way to check
the setup on Windows is by running a hardware identification utility like CPU-Z or
HwInfo; on Linux, you can use the dmidecode command. Alternatively, you can run
memory bandwidth benchmarks like Intel MLC or Stream.
To make use of multiple memory channels in a system, there is a technique called
interleaving. It spreads adjacent addresses within a page across multiple memory
devices. An example of a 2-way interleaving for sequential memory accesses is shown
in Figure 3.10. As before, we have a dual-channel memory configuration (channels A
and B) with two independent memory controllers. Modern processors interleave per
four cache lines (256 bytes), i.e., the first four adjacent cache lines go to channel A,
and then the next set of four cache lines go to channel B.
Without interleaving, consecutive adjacent accesses would be sent to the same memory
controller, not utilizing the second available controller. In contrast, interleaving
enables hardware parallelism to better utilize available memory bandwidth. For most
workloads, performance is maximized when all the channels are populated as it spreads
a single memory region across as many DRAM modules as possible.
While increased memory bandwidth is generally good, it does not always translate
into better system performance and is highly dependent on the application. On the
other hand, it’s important to watch out for available and utilized memory bandwidth,
because once it becomes the primary bottleneck, the application stops scaling, i.e.,
adding more cores doesn’t make it run faster.
3.6.2.2 GDDR and HBM Besides multi-channel DDR, there are other technolo-
gies that target workloads where higher memory bandwidth is required to achieve
greater performance. Technologies such as GDDR (Graphics DDR) and HBM (High
Bandwidth Memory) are the most notable ones. They find their use in high-end
graphics, high-performance computing such as climate modeling, molecular dynamics,
and physics simulation, but also autonomous driving, and of course, AI/ML. They are
a natural fit there because such applications require moving large amounts of data
very quickly.
60
3.7 Virtual Memory
GDDR was primarily designed for graphics and nowadays it is used on virtually every
high-performance graphics card. While GDDR shares some characteristics with DDR,
it is also quite different. While DRAM DDR is designed for lower latencies, GDDR
is built for much higher bandwidth, because it is located in the same package as the
processor chip itself. Similar to DDR, the GDDR interface transfers two 32-bit words
(64 bits in total) per clock cycle. The latest GDDR6X standard can achieve up to 168
GB/s bandwidth, operating at a relatively low 656 MHz frequency.
HBM is a new type of CPU/GPU memory that vertically stacks memory chips, also
called 3D stacking. Similar to GDDR, HBM drastically shortens the distance data
needs to travel to reach a processor. The main difference from DDR and GDDR is
that the HBM memory bus is very wide: 1024 bits for each HBM stack. This enables
HBM to achieve ultra-high bandwidth. The latest HBM3 standard supports up to
665 GB/s bandwidth per package. It also operates at a low frequency of 500 MHz
and has a memory density of up to 48 GB per package.
A system with HBM onboard will be a good choice if you want to maximize data
transfer throughput. However, at the time of writing, this technology is quite expensive.
As GDDR is predominantly used in graphics cards, HBM may be a good option to
accelerate certain workloads that run on a CPU. In fact, the first x86 general-purpose
server chips with integrated HBM are now available.
61
3 CPU Microarchitecture
The page table can be either single-level or nested. Figure 3.12 shows one example
of a 2-level page table. Notice how the address gets split into more pieces. The first
thing to mention is that the 16 most significant bits are not used. This can seem like
a waste of bits, but even with the remaining 48 bits we can address 256 TB of total
memory (248 ). Some applications use those unused bits to keep metadata, also known
as pointer tagging.
A nested page table is a radix tree that keeps physical page addresses along with some
metadata. To find a translation within a 2-level page table, we first use bits 32..47 as
an index into the Level-1 page table also known as the page table directory. Every
descriptor in the directory points to one of the 216 blocks of Level-2 tables. Once we
find the appropriate L2 block, we use bits 12..31 to find the physical page address.
Concatenating it with the page offset (bits 0..11) gives us the physical address, which
can be used to retrieve the data from the DRAM.
The exact format of the page table is dictated by the CPU for reasons we will discuss
a few paragraphs later. Thus the variations of page table organization are limited by
what a CPU supports. Modern CPUs support both 4-level page tables with 48-bit
pointers (256 TB of total memory) and 5-level page tables with 57-bit pointers (128
PB of total memory).
62
3.7 Virtual Memory
Breaking the page table into multiple levels doesn’t change the total addressable
memory. However, a nested approach does not require storing the entire page table as
a contiguous array and does not allocate blocks that have no descriptors. This saves
memory space but adds overhead when traversing the page table.
Failure to provide a physical address mapping is called a page fault. It occurs if a
requested page is invalid or is not currently in the main memory. The two most
common reasons are: 1) the operating system committed to allocating a page but
hasn’t yet backed it with a physical page, and 2) an accessed page was swapped out
to disk and is not currently stored in RAM.
63
3 CPU Microarchitecture
An example of an address that points to data within a huge page is shown in Figure
3.13. Just like with a default page size, the exact address format when using huge
pages is dictated by the hardware, but luckily we as programmers usually don’t have
to worry about it.
Using huge pages drastically reduces the pressure on the TLB hierarchy since fewer
TLB entries are required. It greatly increases the chance of a TLB hit. We will discuss
how to use huge pages to reduce the frequency of TLB misses in Section 8.4 and
Section 11.8. The downsides of using huge pages are memory fragmentation and,
in some cases, nondeterministic page allocation latency because it is harder for the
operating system to manage large blocks of memory and to ensure effective utilization
of available memory. To satisfy a 2MB huge page allocation request at runtime, an
OS needs to find a contiguous chunk of 2MB. If this cannot be found, the OS needs to
reorganize the pages, resulting in a longer allocation latency.
64
3.8 Modern CPU Design
Figure 3.14: Block diagram of a CPU core in the Intel Golden Cove Microarchitecture.
instruction cache. The BPU predicts the target of all branch instructions and steers
the next instruction fetch based on this prediction.
The heart of the BPU is a branch target buffer (BTB) with 12K entries containing
information about branches and their targets. This information is used by the
prediction algorithms. Every cycle, the BPU generates the next fetch address and
passes it to the CPU Frontend.
The CPU Frontend fetches 32 bytes per cycle of x86 instructions from the L1 I-cache.
This is shared among the two threads (if SMT is enabled), so each thread gets 32
bytes every other cycle. These are complex, variable-length x86 instructions. First,
the pre-decode stage determines and marks the boundaries of the variable instructions
by inspecting the chunk. In x86, the instruction length can range from 1 to 15 bytes.
This stage also identifies branch instructions. The pre-decode stage moves up to 6
65
3 CPU Microarchitecture
66
3.8 Modern CPU Design
Figure 3.15: Block diagram of the CPU Backend of the Intel Golden Cove Microarchitecture.
Second, the ROB allocates execution resources. When an instruction enters the ROB,
a new entry is allocated and resources are assigned to it, mainly an execution unit
and the destination physical register. The ROB can allocate up to 6 µops per cycle.
Third, the ROB tracks speculative execution. When an instruction has finished its
execution, its status gets updated and it stays there until the previous instructions
finish. It’s done that way because instructions must retire in program order. Once
an instruction retires, its ROB entry is deallocated and the results of the instruction
become visible. The retiring stage is wider than the allocation: the ROB can retire 8
instructions per cycle.
There are certain operations that processors handle in a specific manner, often called
idioms, which require no or less costly execution. Processors recognize such cases and
allow them to run faster than regular instructions. Here are some of such cases:
• Zeroing: to assign zero to a register, compilers often use XOR / PXOR / XORPS
/ XORPD instructions, e.g., XOR EAX, EAX, which are preferred by compilers
instead of the equivalent MOV EAX, 0x0 instruction as the XOR encoding uses
fewer encoding bytes. Such zeroing idioms are not executed as any other regular
instruction and are resolved in the CPU frontend, which saves execution resources.
The instruction later retires as usual.
• Move elimination: similar to the previous one, register-to-register mov opera-
tions, e.g., MOV EAX, EBX, are executed with zero cycle delay.
• NOP instruction: NOP is often used for padding or alignment purposes. It
simply gets marked as completed without allocating it to the reservation station.
• Other bypasses: CPU architects also optimized certain arithmetic operations.
For example, multiplying any number by one will always yield the same number.
The same goes for dividing any number by one. Multiplying any number by zero
always yields zero, etc. Some CPUs can recognize such cases at runtime and
execute them with shorter latency than regular multiplication or divide.
The “Scheduler / Reservation Station” (RS) is the structure that tracks the availability
of all resources for a given µop and dispatches the µop to an execution port once it is
ready. An execution port is a pathway that connects the scheduler to its execution
units. Each execution port may be connected to multiple execution units. When an
instruction enters the RS, the scheduler starts tracking its data dependencies. Once
all the source operands become available, the RS attempts to dispatch the µop to a
free execution port. The RS has fewer entries55 than the ROB. It can dispatch up to
6 µops per cycle.
55 People have measured ~200 entries in the RS, however the actual number of entries is not
disclosed.
67
3 CPU Microarchitecture
I repeated a part of the diagram that depicts the Golden Cove execution engine and
Load-Store unit in Figure 3.16. There are 12 execution ports:
• Ports 0, 1, 5, 6, and 10 provide integer (INT) operations, and some of them
handle floating-point and vector (FP/VEC) operations.
• Ports 2, 3, and 11 are used for address generation (AGU) and for load operations.
• Ports 4 and 9 are used for store operations (STD).
• Ports 7 and 8 are used for address generation.
Figure 3.16: Block diagram of the execution engine and the Load-Store unit in the Intel
Golden Cove Microarchitecture.
Instructions that require memory operations are handled by the Load-Store unit (ports
2, 3, 11, 4, 9, 7, and 8) which we will discuss in the next section. If an operation
does not involve loading or storing data, then it will be dispatched to the execution
engine (ports 0, 1, 5, 6, and 10). Some instructions may require two µops that must
be executed on different execution ports, e.g., load and add.
For example, an Integer Shift operation can go only to either port 0 or 6, while a
Floating-Point Divide operation can only be dispatched to port 0. In a situation
when a scheduler has to dispatch two operations that require the same execution port,
one of them will have to be delayed.
The FP/VEC stack does floating-point scalar and all packed (SIMD) operations.
For instance, ports 0, 1, and 5 can handle ALU operations of the following types:
packed integer, packed floating-point, and scalar floating-point. Integer and Vector/FP
register files are located separately. Operations that move values from the INT stack
to FP/VEC and vice-versa (e.g., convert, extract, or insert) incur additional penalties.
68
3.8 Modern CPU Design
memory location. It can also issue up to two stores (two 256-bit or one 512-bit) per
cycle via ports 4, 9, 7, and 8. STD stands for Store Data.
Notice that the AGU is required for both load and store operations to perform
dynamic address calculation. For example, in the instruction vmovss DWORD PTR
[rsi+0x4],xmm0, the AGU will be responsible for calculating rsi+0x4, which will be
used to store data from xmm0.
Once a load or a store leaves the scheduler, the LSU is responsible for accessing the
data. Load operations save the fetched value in a register. Store operations transfer
value from a register to a location in memory. LSU has a Load Buffer (also known
as Load Queue) and a Store Buffer (also known as Store Queue); their sizes are not
disclosed.56 Both Load Buffer and Store Buffer receive operations at dispatch from
the scheduler.
When a memory load request comes, the LSU queries the L1 cache using a virtual
address and looks up the physical address translation in the TLB. Those two operations
are initiated simultaneously. The size of the L1 D-cache is 48KB. If both operations
result in a hit, the load delivers data to the integer or floating-point register and leaves
the Load Buffer. Similarly, a store would write the data to the data cache and exit
the Store Buffer.
In case of an L1 miss, the hardware initiates a query of the (private) L2 cache tags.
While the L2 cache is being queried, a 64-byte wide fill buffer (FB) entry is allocated,
which will keep the cache line once it arrives. The Golden Cove core has 16 fill buffers.
As a way to lower the latency, a speculative query is sent to the L3 cache in parallel
with the L2 cache lookup. Also, if two loads access the same cache line, they will hit
the same FB. Such two loads will be “glued” together and only one memory request
will be initiated.
In case the L2 miss is confirmed, the load continues to wait for the results of the L3
cache, which incurs much higher latency. From that point, the request leaves the
core and enters the uncore, the term you may sometimes see in profiling tools. The
outstanding misses from the core are tracked in the Super Queue (SQ, not shown on
the diagram), which can track up to 48 uncore requests. In a scenario of L3 miss, the
processor begins to set up a memory access. Further details are beyond the scope of
this chapter.
When a store modifies a memory location, the processor needs to load the full cache
line, change it, and then write it back to memory. If the address to write is not in the
cache, it goes through a very similar mechanism as with loads to bring that data in.
The store cannot be complete until the data is written to the cache hierarchy.
Of course, there are a few optimizations done for store operations as well. First, if
we’re dealing with a store or multiple adjacent stores (also known as streaming stores)
that modify an entire cache line, there is no need to read the data first as all of the
bytes will be clobbered anyway. So, the processor will try to combine writes to fill an
entire cache line. If this succeeds no memory read operation is needed.
Second, write combining enables multiple stores to be assembled and written further
56 Load Buffer and Store Buffer sizes are not disclosed, but people have measured 192 and 114
entries respectively.
69
3 CPU Microarchitecture
out in the cache hierarchy as a unit. So, if multiple stores modify the same cache
line, only one memory write will be issued to the memory subsystem. All these
optimizations are done inside the Store Buffer. A store instruction copies the data
that will be written from a register into the Store Buffer. From there it may be written
to the L1 cache or it may be combined with other stores to the same cache line. The
Store Buffer capacity is limited, so it can hold requests for partial writing to a cache
line only for some time. However, while the data sits in the Store Buffer waiting to
be written, other load instructions can read the data straight from the store buffers
(store-to-load forwarding). Also, the LSU supports store-to-load forwarding when
there is an older store containing all of the load’s bytes, and the store’s data has been
produced and is available in the store queue.
Finally, there are cases when we can improve cache utilization by using so-called
non-temporal memory accesses. If we execute a partial store (e.g., we overwrite 8 bytes
in a cache line), we need to read the cache line first. This new cache line will displace
another line in the cache. However, if we know that we won’t need this data again,
then it would be better not to allocate space in the cache for that line. Non-temporal
memory accesses are special CPU instructions that do not keep the fetched line in the
cache and drop it immediately after use.
During a typical program execution, there could be dozens of memory accesses in flight.
In most high-performance processors, the order of load and store operations is not
necessarily required to be the same as the program order, which is known as a weakly
ordered memory model. For optimization purposes, the processor can reorder memory
read and write operations. Consider a situation when a load runs into a cache miss
and has to wait until the data comes from memory. The processor allows subsequent
loads to proceed ahead of the load that is waiting for data. This allows later loads
to finish before the earlier load and doesn’t unnecessarily block the execution. Such
load/store reordering enables memory units to process multiple memory accesses in
parallel, which translates directly into higher performance.
The LSU dynamically reorders operations, supporting both loads bypassing older loads
and loads bypassing older non-conflicting stores. However, there are a few exceptions.
Just like with dependencies through regular arithmetic instructions, there are memory
dependencies through loads and stores. In other words, a load can depend on an
earlier store and vice-versa. First of all, stores cannot be reordered with older loads:
Load R1, MEM_LOC_X
Store MEM_LOC_X, 0
If we allow the store to go before the load, then the R1 register may read the wrong
value from the memory location MEM_LOC_X.
Another interesting situation happens when a load consumes data from an earlier
store:
Store MEM_LOC, 0
Load R1, MEM_LOC
If a load consumes data from a store that hasn’t yet finished, we should not allow
the load to proceed. But what if we don’t yet know the address of the store? In
this case, the processor predicts whether there will be any potential data forwarding
between the load and the store and if reordering is safe. This is known as memory
70
3.8 Modern CPU Design
disambiguation. When a load starts executing, it has to be checked against all older
stores for potential store forwarding. There are four possible scenarios:
• Prediction: Not dependent; Outcome: Not dependent. This is a case of a
successful memory disambiguation, which yields optimal performance.
• Prediction: Dependent; Outcome: Not dependent. In this case, the processor
was overly conservative and did not let the load go ahead of the store. This is a
missed opportunity for performance optimization.
• Prediction: Not dependent; Outcome: Dependent. This is a memory order
violation. Similar to the case of a branch misprediction, the processor has to
flush the pipeline, roll back the execution, and start over. It is very costly.
• Prediction: Dependent; Outcome: Dependent. There is a memory dependency
between the load and the store, and the processor predicted it correctly. No
missed opportunities.
It’s worth mentioning that forwarding from a store to a load occurs in real code
quite often. In particular, any code that uses read-modify-write accesses to its data
structures is likely to trigger these sorts of problems. Due to the large out-of-order
window, the CPU can easily attempt to process multiple read-modify-write sequences
at once, so the read of one sequence can occur before the write of the previous sequence
is complete. One such example is presented in Section 12.2.
The second level of the hierarchy (STLB) caches translations for both instructions
and data. It is a larger storage for serving requests that missed in the L1 TLBs. L2
STLB can accommodate 2048 recent data and instruction page address translations,
which covers a total of 8MB of memory space. There are fewer entries available for
2MB huge pages: L1 ITLB has 32 entries, L1 DTLB has 32 entries, and L2 STLB can
71
3 CPU Microarchitecture
only use 1024 entries that are also shared with regular 4KB pages.
In case a translation was not found in the TLB hierarchy, it has to be retrieved from
the DRAM by “walking” the kernel page tables. Recall that the page table is built
as a radix tree of subtables, with each entry of the subtable holding a pointer to the
next level of the tree.
The key element to speed up the page walk procedure is a set of Paging-Structure
Caches57 that cache the hot entries in the page table structure. For the 4-level page
table, we have the least significant twelve bits (11:0) for page offset (not translated),
and bits 47:12 for the page number. While each entry in a TLB is an individual
complete translation, Paging-Structure Caches cover only the upper 3 levels (bits
47:21). The idea is to reduce the number of loads required to execute in case of a
TLB miss. For example, without such caches, we would have to execute 4 loads,
which would add latency to the instruction completion. But with the help of the
Paging-Structure Caches, if we find a translation for levels 1 and 2 of the address (bits
47:30), we only have to do the remaining 2 loads.
The Golden Cove microarchitecture has four dedicated page walkers, which allows it
to process 4 page walks simultaneously. In the event of a TLB miss, these hardware
units will issue the required loads into the memory subsystem and populate the TLB
hierarchy with new entries. The page-table loads generated by the page walkers can hit
in L1, L2, or L3 caches (details are not disclosed). Finally, page walkers can anticipate
a future TLB miss and speculatively do a page walk to update TLB entries before a
miss actually happens.
The Golden Cove specification doesn’t disclose how resources are shared between two
SMT threads. But in general, caches, TLBs, and execution units are fully shared to
improve the dynamic utilization of those resources. On the other hand, buffers for
staging instructions between major pipe stages are either replicated or partitioned.
These buffers include IDQ, ROB, RAT, RS, Load Buffer, and the Store Buffer. PRF
is also replicated.
72
3.9 Performance Monitoring Unit
of each Intel PMU version, as well as changes from the previous version, can be found
in [Intel, 2023, Volume 3B, Chapter 20].
tool then should save the fact of an overflow. We will discuss it in more detail in Chapter 5.
73
3 CPU Microarchitecture
which can only be executed from kernel space. Luckily, you only have to care about
this if you’re a developer of a performance analysis tool, like Linux perf or Intel
VTune profiler. Those tools handle all the complexity of programming PMCs.
When engineers analyze their applications, it is very common for them to collect the
number of executed instructions and elapsed cycles. That is the reason why some
PMUs have dedicated PMCs for collecting such events. Fixed counters always measure
the same thing inside the CPU core. With programmable counters, it’s up to the user
to choose what they want to measure.
For example, in the Intel Skylake architecture (PMU version 4, see Listing 3.3), each
physical core has three fixed and eight programmable counters. The three fixed
counters are set to count core clocks, reference clocks, and instructions retired (see
Chapter 4 for more details on these metrics). AMD Zen4 and Arm Neoverse V1 cores
support 6 programmable performance monitoring counters per processor core, with no
fixed counters.
It’s not unusual for the PMU to provide more than one hundred events available for
monitoring. Figure 3.18 shows just a small subset of the performance monitoring
events available for monitoring on a modern Intel CPU. It’s not hard to notice that
the number of available PMCs is much smaller than the number of performance events.
It’s not possible to count all the events at the same time, but analysis tools solve this
problem by multiplexing between groups of performance events during the execution
of a program (see Section 5.3.1).
• For Intel CPUs, the complete list of performance events can be found in
[Intel, 2023, Volume 3B, Chapter 20] or at perfmon-events.intel.com.
• AMD doesn’t publish a list of performance monitoring events for every AMD
processor. Curious readers may find some information in the Linux perf source
code59 . Also, you can list performance events available for monitoring using the
AMD uProf command line tool. General information about AMD performance
counters can be found in [AMD, 2023, 13.2 Performance Monitoring Counters].
• For ARM chips, performance events are not so well defined. Vendors implement
59 Linux source code for AMD cores - https://github.com/torvalds/linux/blob/master/arch/x86/e
vents/amd/core.c
74
Questions and Exercises
cores following an ARM architecture, but performance events vary, both in what
they mean and what events are supported. For the Arm Neoverse V1 core,
that Arm designs themselves, the list of performance events can be found in
[Arm, 2022b]. For the Arm Neoverse V2 and V3 microarchitectures, the list of
performance events can be found on Arm’s website.60
Chapter Summary
• Instruction Set Architecture (ISA) is a fundamental contract between software
and hardware. ISA is an abstract model of a computer that defines the set of
available operations and data types, a set of registers, memory addressing, and
other things. You can implement a specific ISA in many different ways. For
example, you can design a “small” core that prioritizes power efficiency or a
“big” core that targets high performance.
• The details of the implementation are encapsulated in the term CPU “microar-
chitecture”. This topic has been researched by thousands of computer scientists
for a long time. Through the years, many smart ideas were invented and imple-
mented in mass-market CPUs. The most notable are pipelining, out-of-order
execution, superscalar engines, speculative execution, and SIMD processors. All
these techniques help exploit Instruction-Level Parallelism (ILP) and improve
singe-threaded performance.
• In parallel with single-threaded performance, hardware designers began pushing
multi-threaded performance. The vast majority of modern client-facing devices
have a processor containing multiple cores. Some processors double the number
of observable CPU cores with the help of Simultaneous Multithreading (SMT).
SMT enables multiple software threads to run simultaneously on the same
physical core using shared resources. A more recent technique in this direction
is called “hybrid” processors which combine different types of cores in a single
package to better support a diversity of workloads.
• The memory hierarchy in modern computers includes several levels of cache
that reflect different tradeoffs in speed of access vs. size. L1 cache tends to be
closest to a core, fast but small. The L3/LLC cache is slower but also bigger.
DDR is the predominant DRAM technology used in most platforms. DRAM
modules vary in the number of ranks and memory width which may have a slight
impact on system performance. Processors may have multiple memory channels
to access more than one DRAM module simultaneously.
60 Arm telemetry - https://developer.arm.com/telemetry
75
3 CPU Microarchitecture
• Virtual memory is the mechanism for sharing physical memory with all the
processes running on the CPU. Programs use virtual addresses in their accesses,
which get translated into physical addresses. The memory space is split into
pages. The default page size on x86 is 4KB, and on ARM is 16KB. Only the
page address gets translated, the offset within the page is used as-is. The OS
keeps the translation in the page table, which is implemented as a radix tree.
There are hardware features that improve the performance of address translation:
mainly the Translation Lookaside Buffer (TLB) and hardware page walkers.
Also, developers can utilize Huge Pages to mitigate the cost of address translation
in some cases (see Section 8.4).
• We looked at the design of Intel’s recent Golden Cove microarchitecture. Logically,
the core is split into a Frontend and a Backend. The Frontend consists of a
Branch Predictor Unit (BPU), L1 I-cache, instruction fetch and decode logic, and
the IDQ, which feeds instructions to the CPU Backend. The Backend consists
of the OOO engine, execution units, the load-store unit, the L1 D-cache, and
the TLB hierarchy.
• Modern processors have performance monitoring features that are encapsulated
into a Performance Monitoring Unit (PMU). This unit is built around a concept
of Performance Monitoring Counters (PMC) that enables observation of specific
events that happen while a program is running, for example, cache misses and
branch mispredictions.
76
4 Terminology and Metrics in Performance
Analysis
Like many engineering disciplines, performance analysis is quite heavy on using peculiar
terms and metrics. For a beginner, it can be a very hard time looking into a profile
generated by an analysis tool like Linux perf or Intel VTune Profiler. Those tools
juggle many complex terms and metrics, however, these metrics are “must-knowns” if
you’re set to do any serious performance engineering work.
Since I have mentioned Linux perf, let me briefly introduce the tool as I have
many examples of using it in this and later chapters. Linux perf is a performance
profiler that you can use to find hotspots in a program, collect various low-level CPU
performance events, analyze call stacks, and many other things. I will use Linux
perf extensively throughout the book as it is one of the most popular performance
analysis tools. Another reason why I prefer showcasing Linux perf is because it is
open-source software, which enables enthusiastic readers to explore the mechanics of
what’s going on inside a modern profiling tool. This is especially useful for learning
concepts presented in this book because GUI-based tools, like Intel® VTune™ Profiler,
tend to hide all the complexity. We will have a more detailed overview of Linux perf
in Chapter 7.
This chapter is a gentle introduction to the basic terminology and metrics used in
performance analysis. We will first define the basic things like retired/executed instruc-
tions, IPC/CPI, µops, core/reference clocks, cache misses, and branch mispredictions.
Then we will see how to measure the memory latency and bandwidth of a system and
introduce some more advanced metrics. In the end, we will benchmark four industry
workloads and look at the collected metrics.
77
4 Terminology and Metrics in Performance Analysis
IN ST _RET IRED.AN Y
IP C = ,
CP U _CLK_U N HALT ED.T HREAD
78
4.3 CPI and IPC
1
CP I =
IP C
Using one or another is a matter of preference. I prefer to use IPC as it is easier to
compare. With IPC, we want as many instructions per cycle as possible, so the higher
the IPC, the better. With CPI, it’s the opposite: we want as few cycles per instruction
as possible, so the lower the CPI the better. The comparison that uses “the higher the
better” metric is simpler since you don’t have to do the mental inversion every time.
In the rest of the book, we will mostly use IPC, but again, there is nothing wrong
with using CPI either.
The relationship between IPC and CPU clock frequency is very interesting. In the
broad sense, performance = work / time, where we can express work as the number
of instructions and time as seconds. The number of seconds a program was running
can be calculated as total cycles / frequency:
instructions × f requency
P erf ormance = = IP C × f requency
cycles
79
4 Terminology and Metrics in Performance Analysis
4.4 Micro-operations
Microprocessors with the x86 architecture translate complex CISC instructions into
simple RISC microoperations, abbreviated as µops. A simple register-to-register
addition instruction such as ADD rax, rbx generates only one µop, while a more
complex instruction like ADD rax, [mem] may generate two: one for loading from
the mem memory location into a temporary (unnamed) register, and one for adding it
to the rax register. The instruction ADD [mem], rax generates three µops: one for
loading from memory, one for adding, and one for storing the result back to memory.
The main advantage of splitting instructions into micro-operations is that µops can
be executed:
• Out of order: consider the PUSH rbx instruction, which decrements the stack
pointer by 8 bytes and then stores the source operand on the top of the stack.
Suppose that PUSH rbx is “cracked” into two dependent micro-operations after
decoding:
SUB rsp, 8
STORE [rsp], rbx
Often, a function prologue saves multiple registers by using multiple PUSH
instructions. In our case, the next PUSH instruction can start executing after the
SUB µop of the previous PUSH instruction finishes and doesn’t have to wait for
the STORE µop, which can now execute asynchronously.
• In parallel: consider HADDPD xmm1, xmm2 instruction, which will sum up (re-
duce) two double-precision floating-point values from xmm1 and xmm2 and store
two results in xmm1 as follows:
xmm1[63:0] = xmm2[127:64] + xmm2[63:0]
xmm1[127:64] = xmm1[127:64] + xmm1[63:0]
One way to microcode this instruction would be to do the following: 1) reduce
xmm2 and store the result in xmm_tmp1[63:0], 2) reduce xmm1 and store the
result in xmm_tmp2[63:0], 3) merge xmm_tmp1 and xmm_tmp2 into xmm1. Three
µops in total. Notice that steps 1) and 2) are independent and thus can be done
in parallel.
80
4.5 Pipeline Slot
Even though we were just talking about how instructions are split into smaller pieces,
sometimes, µops can also be fused together. There are two types of fusion in modern
x86 CPUs:
• Microfusion: fuse µops from the same machine instruction. Microfusion can
only be applied to two types of combinations: memory write operations and
read-modify operations. For example:
add eax, [mem]
There are two µops in this instruction: 1) read the memory location mem, and 2)
add it to eax. With microfusion, two µops are fused into one at the decoding
step.
• Macrofusion: fuse µops from different machine instructions. The decoders
can fuse arithmetic or logic instructions with a subsequent conditional jump
instruction into a single compute-and-branch µop in certain cases. For example:
.loop:
dec rdi
jnz .loop
With macrofusion, two µops from the DEC and JNZ instructions are fused into
one. The Zen4 microarchitecture also added support for DIV/IDIV and NOP
macrofusion [Advanced Micro Devices, 2023, sections 2.9.4 and 2.9.5].
Both micro- and macrofusion save bandwidth in all stages of the pipeline, from
decoding to retirement. The fused operations share a single entry in the reorder buffer
(ROB). The capacity of the ROB is utilized better when a fused µop uses only one
entry. Such a fused ROB entry is later dispatched to two different execution ports
but is retired again as a single unit. Readers can learn more about µop fusion in
[Fog, 2023a].
To collect the number of issued, executed, and retired µops for an application, you
can use Linux perf as follows:
$ perf stat -e uops_issued.any,uops_executed.thread,uops_retired.slots -- ./a.exe
2856278 uops_issued.any
2720241 uops_executed.thread
2557884 uops_retired.slots
The way instructions are split into micro-operations may vary across CPU generations.
Usually, a lower number of µops used for an instruction means that hardware has
better support for it and is likely to have lower latency and higher throughput. For
the latest Intel and AMD CPUs, the vast majority of instructions generate only one
µop. Latency, throughput, port usage, and the number of µops for x86 instructions
on recent microarchitectures can be found at the uops.info62 website.
81
4 Terminology and Metrics in Performance Analysis
and destination registers, execution port, ROB entries, etc.) to 4 new µops every cycle.
Such a processor is usually called a 4-wide machine. During six consecutive cycles on
the diagram, only half of the available slots were utilized (highlighted in yellow). From
a microarchitecture perspective, the efficiency of executing such code is only 50%.
Intel’s Skylake and AMD Zen3 cores have a 4-wide allocation. Intel’s Sunny Cove
microarchitecture was a 5-wide design. As of the end of 2023, the most recent Golden
Cove and Zen4 architectures both have a 6-wide allocation. Apple M1 and M2 designs
are 8-wide, and Apple M3 is 9-µop execution bandwidth see [Apple, 2024, Table 4.10].
The width of a machine puts a cap on the IPC. This means that the maximum
achievable IPC of a processor equals its width.63 For example, when your calculations
show more than 6 IPC on a Golden Cove core, you should be suspicious.
Very few applications can achieve the maximum IPC of a machine. For example, Intel
Golden Cove core can theoretically execute four integer additions/subtractions, plus
one load, plus one store (for a total of six instructions) per clock, but an application
is highly unlikely to have the appropriate mix of independent instructions adjacent to
each other to exploit all that potential parallelism.
Pipeline slot utilization is one of the core metrics in Top-down Microarchitecture
Analysis (see Section 6.1). For example, Frontend Bound and Backend Bound metrics
are expressed as a percentage of unutilized pipeline slots due to various bottlenecks.
only require a single pipeline slot but are counted as two instructions. In some extreme cases, this
may cause IPC to be greater than the machine width.
82
4.7 Cache Miss
which is called Turbo Boost in Intel’s CPUs, and Turbo Core in AMD processors. It
enables the CPU to increase and decrease its frequency dynamically. Decreasing the
frequency reduces power consumption at the expense of performance, and increasing
the frequency improves performance but sacrifices power savings.
The core clock cycles counter is counting clock cycles at the actual frequency that the
CPU core is running at. The reference clock event counts cycles as if the processor
is running at the base frequency. Let’s take a look at an experiment on a Skylake
i7-6000 processor running a single-threaded application, which has a base frequency of
3.4 GHz:
$ perf stat -e cycles,ref-cycles -- ./a.exe
43340884632 cycles # 3.97 GHz
37028245322 ref-cycles # 3.39 GHz
10,899462364 seconds time elapsed
The ref-cycles event counts cycles as if there were no frequency scaling. The external
clock on the platform has a frequency of 100 MHz, and if we scale it by the clock
multiplier, we will get the base frequency of the processor. The clock multiplier for the
Skylake i7-6000 processor equals 34: it means that for every external pulse, the CPU
executes 34 internal cycles when it’s running on the base frequency (i.e., 3.4 GHz).
The cycles event counts real CPU cycles and takes into account frequency scaling.
Using the formula above we can confirm that the average operating frequency was
43340884632 cycles / 10.899 sec = 3.97 GHz. When you compare the perfor-
mance of two versions of a small piece of code, measuring the time in clock cycles is
better than in nanoseconds, because you avoid the problem of the clock frequency
going up and down.
Both instruction and data fetches can miss in the cache. According to Top-down
Microarchitecture Analysis (see Section 6.1), an instruction cache (I-cache) miss is
characterized as a Frontend stall, while a data cache (D-cache) miss is characterized
64 There is also an interactive view that visualizes the latency of different operations in modern
systems - https://colin-scott.github.io/personal_website/research/interactive_latency.html
83
4 Terminology and Metrics in Performance Analysis
as a Backend stall. Instruction cache misses happen very early in the CPU pipeline
during instruction fetch. Data cache misses happen much later during the instruction
execution phase.
Linux perf users can collect the number of L1 cache misses by running:
$ perf stat -e mem_load_retired.fb_hit,mem_load_retired.l1_miss,
mem_load_retired.l1_hit,mem_inst_retired.all_loads -- a.exe
29580 mem_load_retired.fb_hit
19036 mem_load_retired.l1_miss
497204 mem_load_retired.l1_hit
546230 mem_inst_retired.all_loads
Above is the breakdown of all loads for the L1 data cache and fill buffers. A load
might either hit the already allocated fill buffer (fb_hit), hit the L1 cache (l1_hit),
or miss both (l1_miss), thus all_loads = fb_hit + l1_hit + l1_miss.65 We can
see that only 3.5% of all loads miss in the L1 cache, thus the L1 hit rate is 96.5%.
We can further break down L1 data misses and analyze L2 cache behavior by running:
$ perf stat -e mem_load_retired.l1_miss,
mem_load_retired.l2_hit,mem_load_retired.l2_miss -- a.exe
19521 mem_load_retired.l1_miss
12360 mem_load_retired.l2_hit
7188 mem_load_retired.l2_miss
From this example, we can see that 37% of loads that missed in the L1 D-cache also
missed in the L2 cache, thus the L2 hit rate is 63%. A breakdown for the L3 cache
can be made similarly.
545,820, which doesn’t exactly match all_loads. Most likely it’s due to slight inaccuracy in
hardware event collection, but we did not investigate this since the numbers are very close.
84
4.9 Performance Metrics
Linux perf users can check the number of branch mispredictions by running:
$ perf stat -e branches,branch-misses -- a.exe
358209 branches
14026 branch-misses # 3,92% of all branches
# or simply do:
$ perf stat -- a.exe
Table 4.2: A list (not exhaustive) of performance metrics along with descriptions and
formulas for the Intel Golden Cove architecture.
Metric
Name Description Formula
L1MPKI L1 cache true misses per kilo 1000 *
instruction for retired demand MEM_LOAD_RETIRED.L1_MISS_PS /
loads. INST_RETIRED.ANY
85
4 Terminology and Metrics in Performance Analysis
Metric
Name Description Formula
Code STLB (2nd level TLB) code 1000 *
STLB speculative misses per kilo ITLB_MISSES.WALK_COMPLETED /
MPKI instruction (misses of any INST_RETIRED.ANY
page size that complete the
page walk)
86
4.9 Performance Metrics
Metric
Name Description Formula
A few notes on those metrics. First, the ILP and MLP metrics do not represent
theoretical maximums for an application; rather they measure the actual ILP and
MLP of an application on a given machine. On an ideal machine with infinite resources,
these numbers would be higher. Second, all metrics besides “DRAM BW Use” and
“Load Miss Real Latency” are fractions; we can apply fairly straightforward reasoning
to each of them to tell whether a specific metric is high or low. But to make sense of
“DRAM BW Use” and “Load Miss Real Latency” metrics, we need to put them in
context. For the former, we would like to know if a program saturates the memory
bandwidth or not. The latter gives you an idea of the average cost of a cache miss,
which is useless by itself unless you know the latencies of each component in the cache
hierarchy. We will discuss how to find out cache latencies and peak memory bandwidth
in the next section.
Some tools can report performance metrics automatically. If not, you can always
calculate those metrics manually since you know the formulas and corresponding
performance events that must be collected. Table 4.2 provides formulas for the Intel
Golden Cove architecture, but you can build similar metrics on another platform as
long as underlying performance events are available.
87
4 Terminology and Metrics in Performance Analysis
latency-checker-intel-mlc.html
68 lmbench - https://sourceforge.net/projects/lmbench
69 Memory bandwidth benchmark by Zack Smith - https://zsmith.co/bandwidth.php
70 Stream - https://github.com/jeffhammond/STREAM
88
4.10 Memory Latency and Bandwidth
logical CPU 0, which is on a P-core. The option -L enables huge pages to limit TLB
effects in our measurements. The option -b10m tells MLC to use a 10MB buffer, which
will fit in the L3 cache on our system.
Figure 4.2 shows the read latencies of L1, L2, and L3 caches. There are four different
regions on the chart. The first region on the left from 1 KB to 48 KB buffer size
corresponds to the L1 D-cache, which is private to each physical core. We can observe
0.9 ns latency for the E-core and a slightly higher 1.1 ns for the P-core. Also, we can
use this chart to confirm the cache sizes. Notice how E-core latency starts climbing
after a buffer size goes above 32 KB but P-core latency stays constant up to 48 KB.
That confirms that the L1 D-cache size in the E-core is 32 KB, and in the P-core it is
48 KB.
Figure 4.2: L1/L2/L3 cache read latencies (lower better) on Intel Core i7-1260P, measured
with the MLC tool, huge pages enabled.
The second region shows the L2 cache latencies, which for E-core is almost two times
higher than for P-core (5.9 ns vs. 3.2 ns). For P-core, the latency increases after we
cross the 1.25 MB buffer size, which is expected. We expect E-core latency to stay the
same until we hit 2 MB, however, according to our measurements, it happens sooner.
The third region from 2 MB up to 14 MB corresponds to L3 cache latency, which
is roughly 12 ns for both types of cores. The total size of the L3 cache that is
shared between all cores in the system is 18 MB. Interestingly, we start seeing some
unexpected dynamics starting from 15 MB, not 18 MB. Most likely it has to do with
some accesses missing in L3 and requiring a fetch from the main memory.
I don’t show the part of the chart that corresponds to memory latency, which begins
after we cross the 18MB boundary. The latency starts climbing very steeply and levels
off at 24 MB for the E-core and 64 MB for the P-core. With a much larger buffer size,
e.g., 500 MB, E-core access latency is 45ns and P-core is 90ns. This measures the
memory latency since almost no loads hit in the L3 cache.
Using a similar technique we can measure the bandwidth of various components of
89
4 Terminology and Metrics in Performance Analysis
the memory hierarchy. For measuring bandwidth, MLC executes load requests which
results are not used by any subsequent instructions. This allows MLC to generate
the maximum possible bandwidth. MLC spawns one software thread on each of the
configured logical processors. The addresses that each thread accesses are independent
and there is no sharing of data between threads. As with the latency experiments,
the buffer size used by the threads determines whether MLC is measuring L1/L2/L3
cache bandwidth or memory bandwidth.
$ sudo ./mlc --max_bandwidth -k0-15 -Y -L -u -b18m
Measuring Maximum Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 349670.42
There are a couple of new options here. The -k option specifies a list of CPU cores
used for measurements. The -Y option tells MLC to use AVX2 loads, i.e., 32 bytes at
a time. The options With the -u flag, each thread shares the same buffer and does
not allocate its own. This option must be used to measure the L3 bandwidth (notice
we used an 18 MB buffer, which equals to the size of the L3 cache).
Figure 4.3: Block diagram of the memory hierarchy of Intel Core i7-1260P and external
DDR4 memory.
Combined latency and bandwidth numbers for our system under test, as measured
with Intel MLC, are shown in Figure 4.3. Cores can draw much higher bandwidth from
lower-level caches like L1 and L2 than from shared L3 cache or main memory. Shared
90
4.11 Case Study: Analyzing Performance Metrics of Four Benchmarks
caches such as L3 and E-core L2, scale reasonably well to serve requests from multiple
cores at the same time. For example, a single E-core L2 bandwidth is 100GB/s. With
two E-cores from the same cluster, I measured 140 GB/s, three E-cores - 165 GB/s,
and all four E-cores can draw 175 GB/s from the shared L2. The same goes for the
L3 cache, which allows for 60 GB/s for a single P-core and only 25 GB/s for a single
E-core. But when all the cores are used, the L3 cache can sustain a bandwidth of 300
GB/s. Reading data from memory can be done at 33.7 GB/s, while the theoretical
maximum bandwidth is 38.4 GB/s on my platform.
Knowledge of the primary characteristics of a machine is fundamental to assessing
how well a program utilizes available resources. We will return to this topic in
Section 5.5 when discussing the Roofline performance model. If you constantly analyze
performance on a single platform, it is a good idea to memorize the latencies and
bandwidth of various components of the memory hierarchy or have them handy. It
helps to establish the mental model for a system under test which will aid your further
performance analysis as you will see next.
91
4 Terminology and Metrics in Performance Analysis
tools:71
$ ~/workspace/pmu-tools/toplev.py -m --global --no-desc -v -- <app with args>
Table 4.3 provides a side-by-side comparison of performance metrics for our four
benchmarks. There is a lot we can learn about the nature of those workloads just by
looking at the metrics.
Clang15-
Metric Name Core Type Blender Stockfish selfbuild CloverLeaf
Instructions P-core 6.02E+12 6.59E+11 2.40E+13 1.06E+12
Core Cycles P-core 4.31E+12 3.65E+11 3.78E+13 5.25E+12
IPC P-core 1.40 1.80 0.64 0.20
CPI P-core 0.72 0.55 1.57 4.96
Instructions E-core 4.97E+12 0 1.43E+13 1.11E+12
Core Cycles E-core 3.73E+12 0 3.19E+13 4.28E+12
IPC E-core 1.33 0 0.45 0.26
CPI E-core 0.75 0 2.23 3.85
L1MPKI P-core 3.88 21.38 6.01 13.44
L2MPKI P-core 0.15 1.67 1.09 3.58
L3MPKI P-core 0.04 0.14 0.56 3.43
Br. Misp. Ratio P-core 0.02 0.08 0.03 0.01
Code stlb MPKI P-core 0 0.01 0.35 0.01
Ld stlb MPKI P-core 0.08 0.04 0.51 0.03
St stlb MPKI P-core 0 0.01 0.06 0.1
LdMissLat (Clk) P-core 12.92 10.37 76.7 253.89
ILP P-core 3.67 3.65 2.93 2.53
MLP P-core 1.61 2.62 1.57 2.78
Dram Bw (GB/s) All 1.58 1.42 10.67 24.57
IpCall All 176.8 153.5 40.9 2,729
IpBranch All 9.8 10.1 5.1 18.8
IpLoad All 3.2 3.3 3.6 2.7
IpStore All 7.2 7.7 5.9 22.0
IpMispredict All 610.4 214.7 177.7 2,416
IpFLOP All 1.1 1.82E+06 286,348 1.8
IpArith All 4.5 7.96E+06 268,637 2.1
IpArith Scal SP All 22.9 4.07E+09 280,583 2.60E+09
IpArith Scal DP All 438.2 1.22E+07 4.65E+06 2.2
IpArith AVX128 All 6.9 0.0 1.09E+10 1.62E+09
IpArith AVX256 All 30.3 0.0 0.0 39.6
IpSWPF All 90.2 2,565 105,933 172,348
Here are the hypotheses we can make about the performance of the benchmarks:
• Blender. The work is split fairly equally between P-cores and E-cores, with a
decent IPC on both core types. The number of cache misses per kilo instructions
is pretty low (see L*MPKI). Branch misprediction presents a minor bottleneck:
the Br. Misp. Ratio metric is at 2%; we get 1 misprediction for every 610
71 pmu-tools - https://github.com/andikleen/pmu-tools
92
4.11 Case Study: Analyzing Performance Metrics of Four Benchmarks
93
4 Terminology and Metrics in Performance Analysis
branch and one of every ~35 branches gets mispredicted. There are almost no
FP or vector instructions, but this is not surprising. Preliminary conclusion:
Clang has a large codebase, flat profile, many small functions, and “branchy”
code; performance is affected by data cache misses and TLB misses, and branch
mispredictions.
• CloverLeaf . As before, we start with analyzing instructions and core cycles.
The amount of work done by P- and E-cores is roughly the same, but it takes
P-cores more time to do this work, resulting in a lower IPC of one logical thread
on P-core compared to one physical E-core.73 The L*MPKI metrics are high,
especially the number of L3 misses per kilo instructions. The load miss latency
(LdMissLat) is off the charts, suggesting an extremely high price of the average
cache miss. Next, we take a look at the DRAM BW use metric and see that
memory bandwidth consumption is near its limits. That’s the problem: all the
cores in the system share the same memory bus, so they compete for access to
the main memory, which effectively stalls the execution. CPUs are undersupplied
with the data that they demand. Going further, we can see that CloverLeaf does
not suffer from mispredictions or function call overhead. The instruction mix is
dominated by FP double-precision scalar operations with some parts of the code
being vectorized. Preliminary conclusion: multi-threaded CloverLeaf is bound
by memory bandwidth.
As you can see from this study, there is a lot one can learn about the behavior of a
program just by looking at the metrics. It answers the “what?” question, but doesn’t
tell you the “why?”. For that, you will need to collect a performance profile, which
we will introduce in later chapters. In Part 2 of this book, we will discuss how to
mitigate the performance issues we suspect to exist in the four benchmarks that we
have analyzed.
Keep in mind that the summary of performance metrics in Table 4.3 only tells you
about the average behavior of a program. For example, we might be looking at
CloverLeaf’s IPC of 0.2, while in reality, it may never run with such an IPC. Instead,
it may have 2 phases of equal duration, one running with an IPC of 0.1, and the
second with an IPC of 0.3. Performance tools tackle this by reporting statistical
data for each metric along with the average value. Usually, having min, max, 95th
percentile, and variation (stdev/avg) is enough to understand the distribution. Also,
some tools allow plotting the data, so you can see how the value for a certain metric
changed during the program running time. As an example, Figure 4.4 shows the
dynamics of IPC, L*MPKI, DRAM BW, and average frequency for the CloverLeaf
benchmark. The pmu-tools package can automatically build those charts once you
add the --xlsx and --xchart options. The -I 10000 option aggregates collected
samples with 10-second intervals.
$ ~/workspace/pmu-tools/toplev.py -m --global --no-desc -v --xlsx workload.xlsx
–xchart -I 10000 -- ./clover_leaf
Even though the deviation from the average values reported in the summary is not
very big, we can see that the workload is not stable. After looking at the IPC chart
73 A possible explanation for that is because CloverLeaf is very memory-bandwidth bound. All P-
and E-cores are equally stalled waiting on memory. Because P-cores have a higher frequency, they
waste more CPU clocks than E-cores.
94
4.11 Case Study: Analyzing Performance Metrics of Four Benchmarks
Figure 4.4: Performance metrics charts for the CloverLeaf benchmark with 10 second
intervals.
for P-core we can hypothesize that there are no distinct phases in the workload and
the variation is caused by multiplexing between performance events (discussed in
Section 5.3). Yet, this is only a hypothesis that needs to be confirmed or disproved.
Possible ways to proceed would be to collect more data points by running collection
with higher granularity (in our case it was 10 seconds). The chart that plots L*MPKI
suggests that all three metrics hover around their average numbers without much
deviation. The DRAM bandwidth utilization chart indicates that there are periods
with varying pressure on the main memory. The last chart shows the average frequency
of all CPU cores. As you may observe on this chart, throttling starts after the first 10
seconds. I recommend being careful when drawing conclusions just from looking at
the aggregate numbers since they may not be a good representation of the workload
behavior.
95
4 Terminology and Metrics in Performance Analysis
Remember that collecting performance metrics is not a substitute for looking into
the code. Always try to find explanation for the numbers that you see by checking
relevant parts of the code.
In summary, performance metrics help you build the right mental model about what
is and what is not happening in a program. Going further into analysis, these data
will serve you well.
Chapter Summary
• In this chapter, we introduced the basic metrics in performance analysis such as
retired/executed instructions, CPU utilization, IPC/CPI, µops, pipeline slots,
core/reference clocks, cache misses, and branch mispredictions. We showed how
each of these metrics can be collected with Linux perf.
• For more advanced performance analysis, there are many derivative metrics
that you can collect. For instance, cache misses per kilo instructions (MPKI),
instructions per function call, branch, load, etc. (Ip*), ILP, MLP, and others.
96
Summary
The case studies in this chapter show how you can get actionable insights from
analyzing these metrics.
• Be careful about drawing conclusions just by looking at the aggregate numbers.
Don’t fall into the trap of “Excel performance engineering”, i.e., only collecting
performance metrics and never looking at the code. Always seek a second source
of data (e.g., performance profiles, discussed later) to verify your ideas.
• Memory bandwidth and latency are crucial factors in the performance of many
production software packages nowadays, including AI, HPC, databases, and
many general-purpose applications. Memory bandwidth depends on the DRAM
speed (in MT/s) and the number of memory channels. Modern high-end server
platforms have 8–12 memory channels and can reach up to 500 GB/s for the
whole system and up to 50 GB/s in single-threaded mode. Memory latency
nowadays doesn’t change a lot, in fact, it is getting slightly worse with new
DDR4 and DDR5 generations. The majority of modern client-facing systems
fall in the range of 70–110 ns latency per memory access. Server platforms may
have higher memory latencies.
97
5 Performance Analysis Approaches
But also, there are situations when you see a small change in the execution time, say
5%, and you have no clue where it’s coming from. Timing or throughput measurements
alone do not provide any explanation for why performance goes up or down. In this
case, we need more insights about how a program executes. That is the situation
when we need to do performance analysis to understand the underlying nature of the
slowdown or speedup that we observe.
When you just start working on a performance issue, you probably only have mea-
surements, e.g., before and after the code change. Based on those measurements
you conclude that the program became slower by X percent. If you know that the
slowdown occurred right after a certain commit, that may already give you enough
information to fix the problem. But if you don’t have good reference points, then the
set of possible reasons for the slowdown is endless, and you need to gather more data.
One of the most popular approaches for collecting such data is to profile an application
and look at the hotspots. This chapter introduces this and several other approaches
for gathering data that have proven to be useful in performance engineering.
The next question comes: “What performance data are available and how to collect
them?” Both hardware and software layers of the stack have facilities to track
performance events and record them while a program is running. In this context,
by hardware, we mean the CPU, which executes the program, and by software, we
mean the OS, libraries, the application itself, and other tools used for the analysis.
Typically, the software stack provides high-level metrics like time, number of context
switches, and page faults, while the CPU monitors cache misses, branch mispredictions,
and other CPU-related events. Depending on the problem you are trying to solve,
some metrics are more useful than others. So, it doesn’t mean that hardware metrics
will always give us a more precise overview of the program execution. They are just
different. Some metrics, like the number of context switches, for instance, cannot be
provided by a CPU. Performance analysis tools, like Linux perf, can consume data
from both the OS and the CPU.
98
5.1 Code Instrumentation
As you have probably guessed, there are hundreds of data sources that a performance
engineer may use. This chapter is mostly about collecting hardware-level information.
We will introduce some of the most popular performance analysis techniques: code
instrumentation, tracing, characterization, sampling, and the Roofline model. We also
discuss static performance analysis techniques and compiler optimization reports that
do not involve running the actual application.
The plus sign at the beginning of a line means that this line was added and is not
present in the original code. In general, instrumentation code is not meant to be
pushed into the codebase; rather, it’s for collecting the needed data and later can be
deleted.
A more interesting example of code instrumentation is presented in Listing 5.2. In
this made-up code example, the function findObject searches for the coordinates of
an object with some properties p on a map. All objects are guaranteed to eventually
be located. The function getNewCoords returns new coordinates within a bigger area
that is provided as an argument. The function findObj returns the confidence level
of locating the right object with the current coordinates c. If it is an exact match,
we stop the search loop and return the coordinates. If the confidence is above the
threshold, we call zoomIn to find a more precise location of the object. Otherwise,
we get the new coordinates within the searchArea to try our search next time.
The instrumentation code consists of two classes: histogram and incrementor. The
former keeps track of whatever variable values we are interested in and frequencies of
their occurrence and then prints the histogram after the program finishes. The latter
is just a helper class for pushing values into the histogram object. It is simple and
can be adjusted to your specific needs quickly.74
In this hypothetical scenario, we added instrumentation to know how frequently we
zoomIn before we find an object. The variable inc.tripCount counts the number
of iterations the loop runs before it exits, and the variable inc.zoomCount counts
74 I have a slightly more advanced version of this code which I usually copy-paste into whatever
99
5 Performance Analysis Approaches
+ struct incrementor {
+ uint32_t tripCount = 0;
+ uint32_t zoomCount = 0;
+ ~incrementor() {
+ h.hist[tripCount][zoomCount]++;
+ }
+ };
how many times we reduce the search area (call to zoomIn). We always expect
inc.zoomCount to be less or equal to inc.tripCount.
The findObject function is called many times with various inputs. Here is a possible
output we may observe after running the instrumented program:
// [tripCount][zoomCount]: occurences
[7][6]: 2
[7][5]: 6
[7][4]: 20
[7][3]: 156
[7][2]: 967
[7][1]: 3685
[7][0]: 251004
[6][5]: 2
[6][4]: 7
[6][3]: 39
[6][2]: 300
[6][1]: 1235
100
5.1 Code Instrumentation
[6][0]: 91731
[5][4]: 9
[5][3]: 32
[5][2]: 160
[5][1]: 764
[5][0]: 34142
...
The first number in the square bracket is the trip count of the loop, and the second is
the number of zoomIns we made within the same loop. The number after the column
sign is the number of occurrences of that particular combination of the numbers.
For example, two times we observed 7 loop iterations and 6 zoomIns. 251004 times
the loop ran 7 iterations and no zoomIns, and so on. You can then plot the data
for better visualization, or employ some other statistical methods, but the main
point we can make is that zoomIns are not frequent. The total number of calls to
findObject is approximately 400k; we can calculate it by summing up all the buckets
in the histogram. If we sum up all the buckets with a non-zero zoomCount, we get
approximately 10k; this is the number of times the zoomIn function was called. So,
for every zoomIn call, we make 40 calls of the findObject function.
Later chapters of this book contain many examples of how such information can be
used for optimizations. In our case, we conclude that findObj often fails to find the
object. It means that the next iteration of the loop will try to find the object using
new coordinates but still within the same search area. Knowing that, we could attempt
a number of optimizations: 1) run multiple searches in parallel, and synchronize if
any of them succeeded; 2) precompute certain things for the current search region,
thus eliminating repetitive work inside findObj; 3) write a software pipeline that
calls getNewCoords to generate the next set of required coordinates and prefetch the
corresponding map locations from memory. Part 2 of this book looks more deeply into
some of these techniques.
Code instrumentation provides very detailed information when you need specific
knowledge about the execution of a program. It allows us to track any information
about every variable in a program. Using such a method often yields the best
insight when optimizing big pieces of code because you can use a top-down approach
(instrumenting the main function and then drilling down to its callees) to better
understand the behavior of an application. Code instrumentation enables developers
to observe the architecture and flow of an application. This technique is especially
helpful for someone working with an unfamiliar codebase.
The code instrumentation technique is heavily used in performance analysis of real-time
scenarios, such as video games and embedded development. Some profilers combine
instrumentation with other techniques such as tracing or sampling. We will look at
one such hybrid profiler called Tracy in Section 7.7.
While code instrumentation is powerful in many cases, it does not provide any
information about how code executes from the OS or CPU perspective. For example,
it can’t give you information about how often the process was scheduled in and out of
execution (known by the OS) or how many branch mispredictions occurred (known by
the CPU). Instrumented code is a part of an application and has the same privileges
as the application itself. It runs in userspace and doesn’t have access to the kernel.
A more important downside of this technique is that every time something new needs
101
5 Performance Analysis Approaches
All of the above increases the time between experiments and consumes more de-
velopment time, which is why engineers don’t manually instrument their code very
often these days. However, automated code instrumentation is still widely used by
compilers. Compilers are capable of automatically instrumenting an entire program
(except third-party libraries) to collect interesting statistics about the execution. The
most widely known use cases for automated instrumentation are code coverage analysis
and Profile-Guided Optimization (see Section 11.7).
75 Pin - https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool
76 Intel SDE - https://www.intel.com/content/www/us/en/developer/articles/tool/software-
development-emulator.html
77 DynamoRIO - https://github.com/DynamoRIO/dynamorio. It supports Linux and Windows
102
5.2 Tracing
5.2 Tracing
Tracing is conceptually very similar to instrumentation, yet slightly different. Code
instrumentation assumes that the user has full access to the source code of their
application. On the other hand, tracing relies on the existing instrumentation. For
example, the strace tool enables us to trace system calls and can be thought of as
instrumentation of the Linux kernel. Intel Processor Traces (Intel PT, see Appendix
C) enable you to log instructions executed by a processor and can be thought of
as instrumentation of a CPU. Traces can be obtained from components that were
appropriately instrumented in advance and are not subject to change. Tracing is often
used as a black-box approach, where a user cannot modify the code of an application,
yet they want to get insights into what the program is doing.
An example of tracing system calls with the Linux strace tool is provided in Listing 5.3,
which shows the first several lines of output when running the git status command.
By tracing system calls with strace it’s possible to know the timestamp for each
system call (the leftmost column), its exit status (after the = sign), and the duration
of each system call (in angle brackets).
The overhead of tracing depends on what exactly we try to trace. For example, if
we trace a program that rarely makes system calls, the overhead of running it under
strace will be close to zero. On the other hand, if we trace a program that heavily
relies on system calls, the overhead could be very large, e.g. 100x.78 Also, tracing can
generate a massive amount of data since it doesn’t skip any sample. To compensate
for this, tracing tools provide filters that enable you to restrict data collection to a
specific time slice or for a specific section of code.
wow-much-syscall.html
103
5 Performance Analysis Approaches
unresponsive. For example, with Intel PT, you can reconstruct the control flow of the
program and know exactly what instructions were executed.
Tracing is also very useful for debugging. Its underlying nature enables “record and
replay” use cases based on recorded traces. One such tool is the Mozilla rr79 debugger,
which performs record and replay of processes, supports backward single stepping,
and much more. Most tracing tools are capable of decorating events with timestamps,
which enables us to find correlations with external events that were happening during
that time. That is, when we observe a glitch in a program, we can take a look at the
traces of our application and correlate this glitch with what was happening in the
whole system during that time.
The steps outlined in Figure 5.1 roughly represent what a typical analysis tool will do
to count performance events. A similar process is implemented in the perf stat tool,
which can be used to count various hardware events, like the number of instructions,
cycles, cache misses, etc. Below is an example of the output from perf stat:
$ perf stat -- ./my_program.exe
10580290629 cycles # 3,677 GHz
8067576938 instructions # 0,76 insn per cycle
3005772086 branches # 1044,472 M/sec
239298395 branch-misses # 7,96% of all branches
104
5.3 Collecting Performance Monitoring Events
This data may become quite handy. First of all, it enables us to quickly spot some
anomalies, such as a high branch misprediction rate or low IPC. In addition, it might
come in handy when you’ve made a code change and you want to verify that the
change has improved performance. Looking at relevant events might help you justify or
reject the code change. The perf stat utility can be used as a lightweight benchmark
wrapper. It may serve as a first step in performance investigation. Sometimes anomalies
can be spotted right away, which can save you some analysis time.
A full list of available event names can be viewed with perf list:
$ perf list
cycles [Hardware event]
ref-cycles [Hardware event]
instructions [Hardware event]
branches [Hardware event]
branch-misses [Hardware event]
...
cache:
mem_load_retired.l1_hit
mem_load_retired.l1_miss
...
Modern CPUs have hundreds of observable performance events. It’s very hard to
remember all of them and their meanings. Understanding when to use a particular
event is even harder. That is why generally, I don’t recommend manually collecting a
specific event unless you really know what you are doing. Instead, I recommend using
tools like Intel VTune Profiler that automatically collect required events to calculate
various metrics.
Performance events are not available in every environment since accessing PMCs
requires root access, which applications running in a virtualized environment typically
do not have. For programs executing in a public cloud, running a PMU-based profiler
directly in a guest container does not result in useful output if a virtual machine
(VM) manager does not expose the PMU programming interfaces properly to a guest.
Thus profilers based on CPU performance monitoring counters do not work well
in a virtualized and cloud environment [Du et al., 2010], although the situation is
improving. VMware® was one of the first VM managers to enable80 virtual Performance
Monitoring Counters (vPMC). The AWS EC2 cloud has also enabled81 PMCs for
dedicated hosts.
counters-vpmcs/
81 Amazon EC2 PMCs - http://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2.html
105
5 Performance Analysis Approaches
If you need to collect more events than the number of available PMCs, the analysis tool
uses time multiplexing to give each event a chance to access the monitoring hardware.
Figure 5.2a shows an example of multiplexing between 8 performance events with only
4 counters available.
(a)
(b)
Figure 5.2: Multiplexing between 8 performance events with only 4 PMCs available.
With multiplexing, an event is not measured all the time, but rather only during a
portion of time. At the end of the run, a profiling tool needs to scale the raw count
based on the total time enabled:
f inal count = raw count × (time running/time enabled)
Let’s take Figure 5.2b as an example. Say, during profiling, we were able to measure
an event from group 1 during three time intervals. Each measurement interval lasted
100ms (time enabled). The program running time was 500ms (time running). The
total number of events for this counter was measured as 10,000 (raw count). So, the
final count needs to be scaled as follows:
f inal count = 10, 000 × (500ms/(100ms × 3)) = 16, 666
This provides an estimate of what the count would have been had the event been
measured during the entire run. It is very important to understand that this is still an
estimate, not an actual count. Multiplexing and scaling can be used safely on steady
workloads that execute the same code during long time intervals. However, if the
program regularly jumps between different hotspots, i.e., has different phases, there
will be blind spots that can introduce errors during scaling. To avoid scaling, you can
reduce the number of events to no more than the number of physical PMCs available.
However, you’ll have to run the benchmark multiple times to measure all the events.
106
5.3 Collecting Performance Monitoring Events
107
5 Performance Analysis Approaches
108
5.3 Collecting Performance Monitoring Events
Remember, that our instrumentation measures the per-pixel ray tracing stats. Multi-
plying average numbers by the number of pixels (1024x768) should give us roughly the
total stats for the program. A good sanity check in this case is to run perf stat and
compare the overall C-Ray statistics for the performance events that we’ve collected.
The C-ray benchmark primarily stresses the floating-point performance of a CPU
core, which generally should not cause high variance in the measurements, in other
words, we expect all the measurements to be very close to each other. However, we see
that it’s not the case, as p90 values are 1.33x average numbers and max is 5x slower
than the average case. The most likely explanation here is that for some pixels the
algorithm hits a corner case, executes more instructions, and subsequently runs longer.
But it’s always good to confirm a hypothesis by studying the source code or extending
the instrumentation to capture more data for the “slow” pixels.
The additional instrumentation code in our example causes 17% overhead, which is
OK for local experiments, but quite high to run in production. Most large distributed
systems aim for less than 1% overhead, and for some up to 5% can be tolerable, but
it’s unlikely that users would be happy with a 17% slowdown. Managing the overhead
of your instrumentation is critical, especially if you choose to enable it in a production
environment.
Overhead is usefully calculated as the occurrence rate per unit of time or work (RPC,
database query, loop iteration, etc.). If a read system call on our system takes roughly
1.6 microseconds of CPU time, and we call it twice for each pixel (iteration of the
outer loop), the overhead is 3.2 microseconds of CPU time per pixel.
There are many strategies to bring the overhead down. As a general rule, your
instrumentation should always have a fixed cost, e.g., a deterministic syscall, but
not a list traversal or dynamic memory allocation. It will otherwise interfere with
the measurements. The instrumentation code has three logical parts: collecting the
information, storing it, and reporting it. To lower the overhead of the first part
(collection), we can decrease the sampling rate, e.g., sample each 10th RPC and
skip the rest. For a long-running application, performance can be monitored with a
relatively cheap random sampling, i.e., randomly select which RPCs to monitor for
each sample. These methods sacrifice collection accuracy but still provide a good
estimate of the overall performance characteristics while incurring a very low overhead.
For the second and third parts (storing and aggregating), the recommendation is
to collect, process, and retain only as much data as you need to understand the
performance of the system. You can avoid storing every sample in memory by using
“online” algorithms for calculating mean, variance, min, max, and other metrics. This
will drastically reduce the memory footprint of the instrumentation. For instance,
variance and standard deviation can be calculated using Knuth’s online-variance
algorithm. A good implementation84 uses less than 50 bytes of memory.
For long routines, you can collect counters at the beginning and end, and some parts
in the middle. Over consecutive runs, you can binary search for the part of the routine
that performs most poorly and optimize it. Repeat this until all the poorly performing
spots are removed. If tail latency is of primary concern, emitting log messages on a
particularly slow run can provide useful insights.
84 Accurately computing running variance - https://www.johndcook.com/blog/standard_deviation/
109
5 Performance Analysis Approaches
In our example, we collected 4 events simultaneously, though the CPU has 6 pro-
grammable counters. You can open up additional groups with different sets of events
enabled. The kernel will select different groups to run at a time. The time_enabled
and time_running fields indicate the multiplexing. They both indicate duration in
nanoseconds. The time_enabled field indicates how many nanoseconds the event
group has been enabled. The time_running field indicates how much of that enabled
time the events have been collected. If you had two event groups enabled simultane-
ously that couldn’t fit together on the hardware counters, you might see the running
time for both groups converge to time_running = 0.5 * time_enabled.
Capturing multiple events simultaneously makes it possible to calculate various met-
rics that we discussed in Chapter 4. For example, capturing INSTRUCTIONS_RETIRED
and UNHALTED_CLOCK_CYCLES enables us to measure IPC. We can observe the ef-
fects of frequency scaling by comparing CPU cycles (UNHALTED_CORE_CYCLES) with
the fixed-frequency reference clock (UNHALTED_REFERENCE_CYCLES). It is possible
to detect when the thread wasn’t running by requesting CPU cycles consumed
(UNHALTED_CORE_CYCLES, only counts when the thread is running) and comparing it
against wall-clock time. Also, we can normalize the numbers to get the event rate per
second/clock/instruction. For instance, by measuring MEM_LOAD_RETIRED.L3_MISS
and INSTRUCTIONS_RETIRED we can get the L3MPKI metric. As you can see, this gives
a lot of flexibility.
The important property of grouping events is that the counters will be available
atomically under the same read system call. These atomic bundles are very useful.
First, it allows us to correlate events within each group. For example, let’s assume
we measure IPC for a region of code, and found that it is very low. In this case, we
can pair two events (instructions and cycles) with a third one, say L3 cache misses, to
check if this event contributes to the low IPC that we’re dealing with. If it doesn’t,
we can continue factor analysis using other events. Second, event grouping helps to
mitigate bias in case a workload has different phases. Since all the events within a
group are measured at the same time, they always capture the same phase.
In some scenarios, instrumentation may become a part of a functionality or a feature.
For example, a developer can implement an instrumentation logic that detects a
decrease in IPC (e.g., when there is a busy sibling hardware thread running) or
decreasing CPU frequency (e.g., system throttling due to heavy load). When such an
event occurs, the application automatically defers low-priority work to compensate for
the temporarily increased load.
5.4 Sampling
Sampling is the most frequently used approach for doing performance analysis. People
usually associate it with finding hotspots in a program. To put it more broadly,
sampling helps to find places in the code that contribute to the highest number
of certain performance events. If we want to find hotspots, the problem can be
reformulated as: “find a place in the code that consumes the biggest number of CPU
cycles”. People often use the term profiling for what is technically called sampling.
According to Wikipedia,85 profiling is a much broader term and includes a wide variety
85 Profiling(wikipedia) - https://en.wikipedia.org/wiki/Profiling_(computer_programming).
110
5.4 Sampling
111
5 Performance Analysis Approaches
the program experiences the biggest number of L3-cache misses, we would sample on
the corresponding event, i.e., MEM_LOAD_RETIRED.L3_MISS.
After we have initialized the register, we start counting and let the benchmark run.
Since we have configured a PMC to count cycles, it will be incremented every cycle.
Eventually, it will overflow. At the time the register overflows, the hardware will raise
a PMI. The profiling tool is configured to capture PMIs and has an Interrupt Service
Routine (ISR) for handling them. We do multiple steps inside the ISR: first of all, we
disable counting; after that, we record the instruction that was executed by the CPU
at the time the counter overflowed; then, we reset the counter to N and resume the
benchmark.
Now, let us go back to the value N. Using this value, we can control how frequently we
want to get a new interrupt. Say we want a finer granularity and have one sample every
1 million cycles. To achieve this, we can set the counter to (unsigned) -1,000,000
so that it will overflow after every 1 million cycles. This value is also referred to as
the sample after value.
We repeat the process many times to build a sufficient collection of samples. If we
later aggregate those samples, we could build a histogram of the hottest places in our
program, like the one shown in the output from Linux perf record/report below.
This gives us the breakdown of the overhead for functions of a program sorted in
descending order (hotspots). An example of sampling the x26486 benchmark from the
Phoronix test suite87 is shown below:
$ time -p perf record -F 1000 -- ./x264 -o /dev/null --slow --threads 1
../Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m
[ perf record: Captured and wrote 1.625 MB perf.data (35035 samples) ]
real 36.20 sec
$ perf report -n --stdio
# Samples: 35K of event 'cpu_core/cycles/'
# Event count (approx.): 156756064947
112
5.4 Sampling
Linux perf collected 35,035 samples, which means that there were the same number
of process interrupts. We also used -F 1000 which sets the sampling rate at 1000
samples per second. This roughly matches the overall runtime of 36.2 seconds. Notice,
that Linux perf provided the approximate number of total cycles elapsed. If we divide
it by the number of samples, we’ll have 156756064947 cycles / 35035 samples =
4.5 million cycles per sample. That means that Linux perf set the number N to
roughly 4500000 to collect 1000 samples per second. The number N can be adjusted
by Linux perf dynamically according to the actual CPU frequency.
And of course, most valuable for us is the list of hotspots sorted by the number of
samples attributed to each function. After we know what are the hottest functions,
we may want to look one level deeper: what are the hot parts of code inside every
function? To see the profiling data for functions that were inlined as well as assembly
code generated for a particular source code region, we need to build the application
with debug information (-g compiler flag).
Linux perf doesn’t have rich graphic support, so viewing hot parts of source code is
not very convenient, but doable. Linux perf intermixes source code with the generated
assembly, as shown below:
# snippet of annotating source code of 'x264_8_me_search_ref' function
$ perf annotate x264_8_me_search_ref --stdio
Percent | Source code & Disassembly of x264 for cycles:ppp
----------------------------------------------------------
...
: bmx += square1[bcost&15][0]; <== source code
1.43 : 4eb10d: movsx ecx,BYTE PTR [r8+rdx*2] <== corresponding machine
code
: bmy += square1[bcost&15][1];
0.36 : 4eb112: movsx r12d,BYTE PTR [r8+rdx*2+0x1]
: bmx += square1[bcost&15][0];
0.63 : 4eb118: add DWORD PTR [rsp+0x38],ecx
: bmy += square1[bcost&15][1];
...
Most profilers with a Graphical User Interface (GUI), like Intel VTune Profiler, can
show source code and associated assembly side-by-side. Also, there are tools that can
visualize the output of Linux perf raw data with a rich graphical interface similar to
Intel VTune and other tools. You’ll see all that in more detail in Chapter 7.
Sampling gives a good statistical representation of a program’s execution, however,
one of the downsides of this technique is that it has blind spots and is not suitable
for detecting abnormal behaviors. Each sample represents an aggregated view of a
113
5 Performance Analysis Approaches
Figure 5.4: Control Flow Graph: hot function “foo” has multiple callers.
Analyzing the source code of all the callers of foo might be very time-consuming. We
want to focus only on those callers that caused foo to appear as a hotspot. In other
words, we want to figure out the hottest path in the CFG of a program. Profiling tools
achieve this by capturing the call stack of the process along with other information at
the time of collecting performance samples. Then, all collected stacks are grouped,
allowing us to see the hottest path that led to a particular function.
Collecting call stacks in Linux perf is possible with three methods:
1. Frame pointers (perf record --call-graph fp). It requires that the binary
be built with --fno-omit-frame-pointer. Historically, the frame pointer (RBP
register) was used for debugging since it enables us to get the call stack without
popping all the arguments from the stack (also known as stack unwinding).
The frame pointer can tell the return address immediately. It enables very
cheap stack unwinding, which reduces profiling overhead, however, it consumes
one additional register just for this purpose. At the time when the number of
architectural registers was small, using frame pointers was expensive in terms of
runtime performance. Nowadays, the Linux community is moving back to using
114
5.5 The Roofline Performance Model
frame pointers, because it provides better quality call stacks and low profiling
overhead.
2. DWARF debug info (perf record --call-graph dwarf). It requires that the
binary be built with DWARF debug information (-g). It also obtains call stacks
through the stack unwinding procedure, but this method is more expensive than
using frame pointers.
3. Intel Last Branch Record (LBR). This method makes use of a hardware feature,
and is accessed with the following command: perf record --call-graph lbr.
It obtains call stacks by parsing the LBR stack (a set of hardware registers). The
resulting call graph is not as deep as those produced by the first two methods.
See more information about the LBR call-stack mode in Section 6.2.
Below is an example of collecting call stacks in a program using LBR. By looking at
the output, we know that 55% of the time foo was called from func1, 33% of the time
from func2, and 11% from fun3. We can clearly see the distribution of the overhead
between callers of foo and can now focus our attention on the hottest edge in the CFG
of the program, which is func1 → foo, but we should probably also pay attention to
the edge func2 → foo.
$ perf record --call-graph lbr -- ./a.out
$ perf report -n --stdio --no-children
# Samples: 65K of event 'cycles:ppp'
# Event count (approx.): 61363317007
# Overhead Samples Command Shared Object Symbol
# ........ ............ ....... ................ ......................
99.96% 65217 a.out a.out [.] foo
|
--99.96%--foo
|
|--55.52%--func1
| main
| __libc_start_main
| _start
|
|--33.32%--func2
| main
| __libc_start_main
| _start
|
--11.12%--func3
main
__libc_start_main
_start
When using Intel VTune Profiler, you can collect call stacks data by checking the cor-
responding “Collect stacks” box while configuring analysis. When using the command-
line interface, specify the -knob enable-stack-collection=true option.
115
5 Performance Analysis Approaches
Figure 5.5: The Roofline Performance Model. The maximum performance of an application
is limited by the minimum between peak FLOPS (horizontal line) and the platform
bandwidth multiplied by arithmetic intensity (diagonal line).
Hardware has two main limitations: how fast it can make calculations (peak compute
performance, FLOPS) and how fast it can move the data (peak memory bandwidth,
GB/s). The maximum performance of an application is limited by the minimum
between peak FLOPS (horizontal line) and the platform bandwidth multiplied by
arithmetic intensity (diagonal line). The roofline chart in Figure 5.5 plots the perfor-
mance of two applications A and B against hardware limitations. Application A has
lower arithmetic intensity and its performance is bound by the memory bandwidth,
while application B is more compute intensive and doesn’t suffer as much from memory
bottlenecks. Similar to this, A and B could represent two different functions within a
program and have different performance characteristics. The Roofline performance
model accounts for that and can display multiple functions and loops of an application
on the same chart. However, keep in mind that the Roofline performance model is
mainly applicable for HPC applications that have few compute-intensive loops. I
do not recommend using it for general-purpose applications, such as compilers, web
browsers, or databases.
116
5.5 The Roofline Performance Model
88 The Roofline performance model is not only applicable to floating-point calculations but can be
also used for integer operations. However, the majority of HPC applications involve floating-point
calculations, thus the Roofline model is mostly used with FLOPs.
117
5 Performance Analysis Approaches
Figure 5.6: Roofline analysis of a program and potential ways to improve its performance.
The maximum memory bandwidth of Intel NUC Kit NUC8i5BEH, which I used for
experiments, can be calculated as shown below. Remember, that DDR technology
allows transfers of 64 bits or 8 bytes per memory access.
Automated tools like Empirical Roofline Tool89 and Intel Advisor90 are capable of em-
pirically determining theoretical maximums by running a set of prepared benchmarks.
If a calculation can reuse the data in the cache, much higher FLOP rates are possible.
Roofline can account for that by introducing a dedicated roofline for each level of the
memory hierarchy (see Figure 5.7).
After hardware limitations are determined, we can start assessing the performance
of an application against the roofline. Intel Advisor automatically builds a Roofline
chart and provides hints for performance optimization of a given loop. An example of
a Roofline chart generated by Intel Advisor is presented in Figure 5.7. Notice, that
Roofline charts have logarithmic scales.
89 Empirical Roofline Tool - https://bitbucket.org/berkeleylab/cs-roofline-toolkit/src/master/.
90 Intel Advisor - https://software.intel.com/content/www/us/en/develop/tools/advisor.html.
118
5.6 Static Performance Analysis
Figure 5.7: Roofline analysis for “before” and “after” versions of matrix multiplication on
Intel NUC Kit NUC8i5BEH with 8GB RAM using the Clang 10 compiler.
119
5 Performance Analysis Approaches
don’t know the machine code to which it will be compiled. So, static performance
analysis works on assembly code.
Second, static analysis tools simulate the workload instead of executing it. It is
obviously very slow, so it’s not possible to statically analyze the entire program.
Instead, tools take a snippet of assembly code and try to predict how it will behave
on real hardware. The user should pick specific assembly instructions (usually a small
loop) for analysis. So, the scope of static performance analysis is very narrow.
The output of static performance analyzers is fairly low-level and often breaks execution
down to CPU cycles. Usually, developers use it for fine-grained tuning of a critical
code region in which every CPU cycle matters.
ster/performance.tools.md#microarchitecture.
120
5.6 Static Performance Analysis
Let’s look at the code in Listing 5.5. I intentionally try to make the examples as
simple as possible. Though real-world codes are of course usually more complicated
than this. The code scales every element of array a by the floating-point value B and
accumulates products into sum. On the right, I present the machine code for the loop
generated by Clang-16 when compiled with -O3 -ffast-math -march=core-avx2.
This is a reduction loop, i.e., we need to sum up all the products and in the end, return
a single float value. The way this code is written, there is a loop-carry dependency
over sum. You cannot overwrite sum until you accumulate the previous product. A
smart way to parallelize this is to have multiple accumulators and roll them up in the
end. So, instead of a single sum, we could have sum1 to accumulate results from even
iterations and sum2 from odd iterations.
This is what Clang-16 has done: it has 4 vectors (ymm2-ymm5) each holding 8 floating-
point accumulators, plus it used FMA to fuse multiplication and addition into a single
instruction. The constant B is broadcast into the ymm1 register. The -ffast-math
option allows a compiler to reassociate floating-point operations; we will discuss how
this option can aid optimizations in Section 9.4.96
The code looks good, but is it optimal? Let’s find out. We took the assembly snippet
from Listing 5.5 to UICA and ran simulations. At the time of writing, Alder Lake
(Intel’s 12th gen, based on Golden Cove) is not supported by UICA, so we ran it on
the latest available, which is Rocket Lake (Intel’s 11th gen, based on Sunny Cove).
Although the architectures differ, the issue exposed by this experiment is equally
visible in both. The result of the simulation is shown in Figure 5.8. This is a pipeline
diagram similar to what we have shown in Chapter 3. We skipped the first two
iterations, and show only iterations 2 and 3 (leftmost column “It.”). This is when the
execution reaches a steady state, and all further iterations look very similar.
UICA is a very simplified model of the actual CPU pipeline. For example, you may
notice that the instruction fetch and decode stages are missing. Also, UICA doesn’t
account for cache misses and branch mispredictions, so it assumes that all memory
accesses always hit in the L1 cache and branches are always predicted correctly,
which we know is not the case in modern processors. Again, this is irrelevant to our
experiment as we could still use the simulation results to find a way to improve the
code.
Can you see the performance issue? Let’s examine the diagram. First of all, every
96 By the way, it would be possible to sum all the elements in a and then multiply it by B once
outside of the loop. This is an oversight by the programmer, but hopefully, compilers will be able to
handle it in the future.
121
5 Performance Analysis Approaches
Figure 5.8: UICA pipeline diagram. I = issued, r = ready for dispatch, D = dispatched, E =
executed, R = retired.
FMA instruction is broken into two µops (see 1 in Figure 5.8): a load µop that goes
to ports {2,3} and an FMA µop that can go to ports {0,1}. The load µop has a
latency of 5 cycles: it starts at cycle 7 and finishes at cycle 11. The FMA µop has
a latency of 4 cycles: it starts at cycle 15 and finishes at cycle 18. All FMA µops
depend on load µops, as we can see in the diagram: FMA µops always start after the
corresponding load µop finishes. Now find two r cells at cycle 6, they are ready to be
dispatched, but Rocket Lake has only two load ports, and both are already occupied
in the same cycle. So, these two loads are issued in the next cycle.
The loop has four cross-iteration dependencies over ymm2-ymm5. The FMA µop from
instruction 2 that writes into ymm2 cannot start execution before instruction 1
from the previous iteration finishes. Notice that the FMA µop from instruction 2
was dispatched in the same cycle 18 as instruction 1 finished its execution. There is
a data dependency between instruction 1 and instruction 2 . You can observe this
pattern for other FMA instructions as well.
So, “What is the problem?”, you ask. Look at the top right corner of the image.
For each cycle, we added the number of executed FMA µops (this is not printed by
UICA). It goes like 1,2,1,0,1,2,1,..., or an average of one FMA µop per cycle.
Most of the recent Intel processors have two FMA execution units, thus can issue
two FMA µops per cycle. Thus, we utilize only half of the available FMA execution
throughput. The diagram clearly shows the gap as every fourth cycle there are no
FMAs executed. As we figured out before, no FMA µops can be dispatched because
their inputs (ymm2-ymm5) are not ready.
To increase the utilization of FMA execution units from 50% to 100%, we need to
122
5.6 Static Performance Analysis
It’s worth mentioning that you will not always see a 2x speedup in practice. This
can be achieved only in an idealized environment like UICA or nanoBench. In a
real application, even though you maximized the execution throughput of FMA, the
gains may be hindered by eventual cache misses and other pipeline hazards. When
that happens, the effect of cache misses outweighs the effect of suboptimal FMA port
utilization, which could easily result in a much more disappointing 5% speedup. But
don’t worry; you’ve still done the right thing.
As a closing thought, let us remind you that UICA or any other static performance
analyzer is not suitable for analyzing large portions of code. But they are great for
exploring microarchitectural effects. Also, they help you to build up a mental model
of how a CPU works. Another very important use case for UICA is to find critical
dependency chains in a loop as described in a post97 on the Easyperf blog.
Dependency-Chains
123
5 Performance Analysis Approaches
To emit an optimization report in the Clang compiler, you need to use -Rpass* flags:
$ clang -O3 -Rpass-analysis=.* -Rpass=.* -Rpass-missed=.* a.c -c
a.c:5:3: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
for (unsigned i = 1; i < N; i++) {
^
a.c:5:3: remark: unrolled loop by a factor of 8 with run-time trip count
[-Rpass=loop-unroll]
for (unsigned i = 1; i < N; i++) {
^
By checking the optimization report above, we could see that the loop was not
vectorized, but it was unrolled instead. It’s not always easy for a developer to
recognize the existence of a loop-carry dependency in the loop on line 6 in Listing 5.6.
The value that is loaded by c[i-1] depends on the store from the previous iteration
(see operations 2 and 3 in Figure 5.9). The dependency can be revealed by manually
unrolling the first few iterations of the loop:
124
5.7 Compiler Optimization Reports
// iteration 1
a[1] = c[0];
c[1] = b[1]; // writing the value to c[1]
// iteration 2
a[2] = c[1]; // reading the value of c[1]
c[2] = b[2];
...
If we were to vectorize the code in Listing 5.6, it would result in the wrong values
written in the array a. Assuming a CPU SIMD unit can process four floats at a time,
we would get the code that can be expressed with the following pseudocode:
// iteration 1
a[1..4] = c[0..3]; // oops!, a[2..4] get wrong values
c[1..4] = b[1..4];
...
The code in Listing 5.6 cannot be vectorized because the order of operations inside
the loop matters. This example can be fixed by swapping lines 6 and 7 as shown
in Listing 5.7. This does not change the semantics of the code, so it’s a perfectly
legal change. Alternatively, the code can be improved by splitting the loop into two
separate loops. Doing this would double the loop overhead, but this drawback would
be outweighed by the performance improvement gained through vectorization.
In the optimization report, we can now see that the loop was vectorized successfully:
$ clang -O3 -Rpass-analysis=.* -Rpass=.* -Rpass-missed=.* a.c -c
a.cpp:5:3: remark: vectorized loop (vectorization width: 8, interleaved count: 4)
[-Rpass=loop-vectorize]
for (unsigned i = 1; i < N; i++) {
^
125
5 Performance Analysis Approaches
This was just one example of using optimization reports; we will provide more
examples in Section 9.4.2, where we discuss how to discover vectorization opportunities.
Compiler optimization reports can help you find missed optimization opportunities, and
understand why those opportunities were missed. In addition, compiler optimization
reports are useful for testing a hypothesis. Compilers often decide whether a certain
transformation will be beneficial based on their cost model analysis. But compilers
don’t always make the optimal choice. Once you detect a key missing optimization in
the report, you can attempt to rectify it by changing the source code or by providing
hints to the compiler in the form of a #pragma, an attribute, a compiler built-in, etc.
As always, verify your hypothesis by measuring it in a practical environment.
Compiler reports can be quite large, and a separate report is generated for each
source-code file. Sometimes, finding relevant records in the output file can become a
challenge. We should mention that initially these reports were explicitly designed to
be used by compiler writers to improve optimization passes. Over the years, several
tools have been developed to make optimization reports more accessible and actionable
by application developers, most notably opt-viewer99 and optview2.100 Also, the
Compiler Explorer website has the “Optimization Output” tool for LLVM-based
compilers that reports performed transformations when you hover your mouse over
the corresponding line of source code. All of these tools help visualize successful and
failed code transformations by LLVM-based compilers.
In the Link-Time Optimization (LTO)101 mode, some optimizations are made during
the linking stage. To emit compiler reports from both the compilation and linking
stages, you should pass dedicated options to both the compiler and the linker. See the
LLVM “Remarks” guide102 for more information.
A slightly different way of reporting missing optimizations is taken by the Intel®
ISPC103 compiler (discussed in Section 9.4.2.5). It issues warnings for code constructs
that compile to relatively inefficient code. Either way, compiler optimization reports
should be one of the key tools in your toolbox. They are a fast way to check what
optimizations were done for a particular hotspot and see if some important ones
failed. I have found many improvement opportunities thanks to compiler optimization
reports.
https://en.wikipedia.org/wiki/Interprocedural_optimization
102 LLVM compiler remarks - https://llvm.org/docs/Remarks.html
103 ISPC - https://ispc.github.io/ispc.html
126
Summary
• scenario 3: you’re evaluating three different compression algorithms and you want
to know what types of performance bottlenecks (memory latency, computations,
branch mispredictions, etc) each of them has.
• scenario 4: there is a new shiny library that claims to be faster than the one
you currently have integrated into your project; you’ve decided to compare their
performance.
• scenario 5: you were asked to analyze the performance of some unfamiliar code,
which involves a hot loop; you want to know how many iterations the loop is
doing.
2. Run the application that you’re working with daily. Practice doing performance
analysis using the approaches we discussed in this chapter. Collect raw counts
for various CPU performance events, find hotspots, collect roofline data, and
generate and study the compiler optimization report for the hot function(s) in
your program.
Chapter Summary
• Latency and throughput are often the ultimate metrics of the program perfor-
mance. When seeking ways to improve them, we need to get more detailed
information on how the application executes. Both hardware and software
provide data that can be used for performance monitoring.
• Code instrumentation enables us to track many things in a program but causes
relatively large overhead both on the development and runtime side. While
most developers are not in the habit of manually instrumenting their code,
this approach is still relevant for automated processes, e.g., Profile-Guided
Optimizations (PGO).
• Tracing is conceptually similar to instrumentation and is useful for exploring
anomalies in a system. Tracing enables us to catch the entire sequence of events
with timestamps attached to each event.
• Performance monitoring counters are a very important instrument of low-level
performance analysis. They are generally used in two modes: “Counting” or
“Sampling”. The counting mode is primarily used for calculating various perfor-
mance metrics.
• Sampling skips most of a program’s execution and takes just one sample that is
supposed to represent the entire interval. Despite this, sampling usually yields
precise enough distributions. The most well-known use case of sampling is
finding hotspots in a program. Sampling is the most popular analysis approach
since it doesn’t require recompilation of a program and has very little runtime
overhead.
• Generally, counting and sampling incur very low runtime overhead (usually
below 2%). Counting gets more expensive once you start multiplexing between
different events (5–15% overhead), while sampling gets more expensive with
increasing sampling frequency [Nowak & Bitzes, 2014].
• The Roofline Performance Model is a throughput-oriented performance model
that is heavily used in the High Performance Computing (HPC) world. It visual-
127
5 Performance Analysis Approaches
128
6 CPU Features for Performance Analysis
129
6 CPU Features for Performance Analysis
program. It is used for collecting call stacks, identifying hot branches, calculating
misprediction rates of individual branches, and more.
• Hardware-Based Sampling, discussed in Section 6.3.1. This is a feature
that enhances sampling. Its primary benefits include: lowering the overhead of
sampling and providing “Precise Events” capability, that enables pinpointing of
the exact instruction that caused a particular performance event.
• Intel Processor Traces (PT), discussed in Appendix C. It is a facility to record
and reconstruct the program execution with a timestamp on every instruction.
Its main usages are postmortem analysis and root-causing performance glitches.
The Intel PT feature is covered in Appendix C. Intel PT was supposed to be an “end
game” for performance analysis. With its low runtime overhead, it is a very powerful
analysis feature. But it turns out to be not very popular among performance engineers.
Partially because the support in the tools is not mature, and partially because in many
cases it is overkill, and it’s just easier to use a sampling profiler. Also, it produces
a lot of data, which is not practical for long-running workloads. Nevertheless, it is
popular in some industries, such as high-frequency trading (HFT).
The hardware performance monitoring features mentioned above provide insights
into the efficiency of a program from the CPU perspective. In the next chapter, we
will discuss how profiling tools use these features to provide many different types of
analysis.
130
6.1 Top-down Microarchitecture Analysis
Figure 6.1: The concept behind TMA’s top-level breakdown. © Source: [Yasin, 2014]
so it’s better to rely on tools to do the analysis rather than trying to calculate them
yourself.
In the upcoming sections, we will discuss the TMA implementation in AMD, Arm,
and Intel processors.
131
6 CPU Features for Performance Analysis
Figure 6.2: The TMA hierarchy of performance bottlenecks. © Image by Ahmad Yasin.
a single benchmark run. However, this only is acceptable if the workload is steady.
Otherwise, you would better fall back to the original strategy of multiple runs and
drilling down with each run.
The top two levels of TMA metrics are expressed in the percentage of all pipeline
slots (see Section 4.5) that were available during the execution of a program. It allows
TMA to give an accurate representation of CPU microarchitecture utilization, taking
into account the full bandwidth of a processor. Up to this point, everything should
sum up nicely to 100%. However, starting from Level 3, buckets may be expressed in
a different count domain, e.g., clocks and stalls. So they are not necessarily directly
comparable with other TMA buckets.
Once we identify the performance bottleneck, we need to know where exactly in the
code it is happening. The second step in TMA is to locate the source of the problem
down to the exact line of code and corresponding assembly instructions. The analysis
methodology provides performance events that you should use for each category of
the performance problem. Then you can sample on this event to find the line in the
source code that contributes to the performance bottleneck identified by the first stage.
Don’t worry if this process sounds complicated to you; everything will become clear
once you read through the case study.
132
6.1 Top-down Microarchitecture Analysis
133
6 CPU Features for Performance Analysis
option instead of -l3 to limit the collection since we know the application is bound by memory.
134
6.1 Top-down Microarchitecture Analysis
used to locate the issue. To understand the mechanics of TMA, we also present the
manual way to find an event associated with a particular performance bottleneck.
Correspondence between performance bottlenecks and performance events that should
be used for determining the location of bottlenecks in source code can be found
with the help of the TMA metrics107 table. The Locate-with column denotes a
performance event that should be used to locate the exact place in the code where
the issue occurs. In our case, to find memory accesses that contribute to such a
high value of the DRAM_Bound metric (miss in the L3 cache), we should sample on
MEM_LOAD_RETIRED.L3_MISS_PS precise event. Here is the example command:
$ perf record -e cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp --
./benchmark.exe
$ perf report -n --stdio
...
# Samples: 33K of event ‘MEM_LOAD_RETIRED.’L3_MISS
# Event count (approx.): 71363893
# Overhead Samples Shared Object Symbol
# ........ ......... .............. .................
#
99.95% 33811 benchmark.exe [.] foo
0.03% 52 [kernel] [k] get_page_from_freelist
0.01% 3 [kernel] [k] free_pages_prepare
0.00% 1 [kernel] [k] free_pcppages_bulk
Almost all L3 misses are caused by memory accesses in function foo inside executable
benchmark.exe. Now it’s time to look at the source code of the benchmark, which
can be found on GitHub.108
To avoid compiler optimizations, function foo is implemented in assembly language,
which is presented in Listing 6.1. The “driver” portion of the benchmark is implemented
in the main function, as shown in Listing 6.2. We allocate a big enough array a to
make it not fit in the 6MB L3 cache. The benchmark generates a random index into
array a and passes this index to the foo function along with the address of array a.
Later the foo function reads this random memory location.109
tions), the first 2 arguments land in rdi and rsi registers respectively.
135
6 CPU Features for Performance Analysis
By looking at Listing 6.1, we can see that all L3 cache misses in function foo are
tagged to a single instruction. Now that we know which instruction caused so many
L3 misses, let’s fix it.
This explicit memory prefetching hint decreases execution time from 8.5 seconds to
6.5 seconds. Also, the number of CYCLE_ACTIVITY.STALLS_L3_MISS events becomes
almost ten times less: it goes from 19 billion down to 2 billion.
TMA is an iterative process, so once we fix one problem, we need to repeat the process
starting from Step 1. Likely it will move the bottleneck into another bucket, in this
case, Retiring. This was an easy example demonstrating the workflow of TMA
methodology. Analyzing real-world applications is unlikely to be that easy. Chapters
in the second part of the book are organized to make them convenient for use with the
TMA process. In particular, Chapter 8 covers the Memory Bound category, Chapter
9 covers Core Bound, Chapter 10 covers Bad Speculation, and Chapter 11 covers
Frontend Bound. Such a structure is intended to form a checklist that you can use to
drive code changes when you encounter a certain performance bottleneck.
136
6.1 Top-down Microarchitecture Analysis
idea remains the same. The L1 and L2 buckets are also very similar to Intel’s. Since
kernel 6.2, Linux users can utilize the perf tool to collect the pipeline utilization data.
I ran the test on an AMD Ryzen 9 7950X machine with Ubuntu 22.04, Linux kernel
6.5. I compiled Crypto++ version 8.9 using GCC 12.3 C++ compiler. I used the
default -O3 optimization option, but it doesn’t impact performance much since the
code is written with x86 intrinsics (see Section 9.5) and utilizes the SHA x86 ISA
extension.
A description of pipeline utilization metrics shown above can be found in [AMD, 2024,
Chapter 2.8 Pipeline Utilization]. By looking at the metrics, we can see that branch
mispredictions are not happening in SHA256 (bad_speculation is 0%). Only 26.3%
of the available dispatch slots were used (retiring), which means the remaining
73.7% were wasted due to frontend and backend stalls.
Advanced cryptography instructions are not trivial, so internally they are broken into
smaller pieces (µops). Once a processor encounters such an instruction, it retrieves
µops for it from the microcode. Microoperations are fetched from the microcode
sequencer with a lower bandwidth than from regular instruction decoders, making it
a potential source of performance bottlenecks. Crypto++ SHA256 implementation
110 Crypto++ - https://github.com/weidai11/cryptopp
137
6 CPU Features for Performance Analysis
heavily uses instructions such as SHA256MSG2, SHA256RNDS2, and others which consist
of multiple µops according to uops.info111 website.
The retiring_microcode metric indicates that 6.1% of dispatch slots were used by
microcode operations that eventually retired. When comparing with its sibling metric
retiring_fastpath, we can say that roughly every 4th instruction was a microcode
operation. If we now look at the frontend_bound_bandwidth metric, we will see
that 6.1% of dispatch slots were unused due to bandwidth bottleneck in the CPU
frontend. This suggests that 6.1% of dispatch slots were wasted because the microcode
sequencer has not been providing µops while the backend could have consumed them.
In this example, the retiring_microcode and frontend_bound_bandwidth metrics
are tightly connected, however, the fact that they are equal is merely a coincidence.
The majority of cycles are stalled in the CPU backend (backend_bound), but only
1.7% of cycles are stalled waiting for memory accesses (backend_bound_memory). So,
we know that the benchmark is mostly limited by the computing capabilities of the
machine. As you will know from Part 2 of this book, it could be related to either data
flow dependencies or execution throughput of certain cryptographic operations. They
are less frequent than traditional ADD, SUB, CMP, and other instructions and thus can
be often executed only on a single execution unit. A large number of such operations
may saturate the execution throughput of this particular unit. Further analysis should
involve a closer look at the source code and generated assembly, checking execution
port utilization, finding data dependencies, etc.
In summary, the Crypto++ implementation of SHA-256 on AMD Ryzen 9 7950X
utilizes only 26.3% of the available dispatch slots; 6.1% of the dispatch slots were
wasted due to the microcode sequencer bandwidth, and 65.9% were stalled due to lack
of computing resources of the machine. The algorithm certainly hits a few hardware
limitations, so it’s unclear if its performance can be improved or not.
When it comes to Windows, at the time of writing, TMA methodology is only supported
on AMD server platforms (codename Genoa), and not on client systems (codename
Raphael). TMA support was added in AMD uProf version 4.1, but only in the
command line tool AMDuProfPcm tool which is part of AMD uProf installation. You
can consult [AMD, 2024, Chapter 2.8 Pipeline Utilization] for more details on how to
run the analysis. The graphical version of AMD uProf doesn’t have the TMA analysis
yet.
138
6.1 Top-down Microarchitecture Analysis
139
6 CPU Features for Performance Analysis
140
6.1 Top-down Microarchitecture Analysis
multiplication, code transformations like loop blocking may help (see Section 9.3). AI
algorithms are known for being “memory hungry”, however, Neoverse V1 Topdown
doesn’t show if there are stalls that can be attributed to saturated memory bandwidth.
The final category provides the operation mix, which can be useful in some scenarios.
The percentages give us an estimate of how many instructions of a certain type
were executed, including speculatively executed instructions. The numbers don’t
sum up to 100%, because the rest is attributed to the implicit “Others” category
(not printed by the topdown-tool), which is about 15% in our case. We should be
concerned by the low percentage of SIMD operations, especially given that the highly
optimized Tensorflow and numpy libraries are used. In contrast, the percentage of
integer operations and branches seems high. I checked that the majority of executed
branch instructions are loop backward jumps to the next loop iteration. The high
percentage of integer operations could be caused by lack of vectorization, or due to
thread synchronization. [Arm, 2023a] gives an example of discovering a vectorization
opportunity using data from the Speculative Operation Mix category.
In our case study, we ran the benchmark two times, however, in practice, one run
is usually sufficient. Running the topdown-tool without options will collect all the
available metrics using a single run. Also, the -s combined option will group the
metrics by the L1 category, and output data in a format similar to Intel VTune,
toplev, and other tools. The only practical reason for making multiple runs is when a
workload has a bursty behavior with very short phases that have different performance
characteristics. In such a scenario, you would like to avoid event multiplexing (see
Section 5.3.1) and improve collection accuracy by running the workload multiple times.
The AI Benchmark Alpha has various tests that could exhibit different performance
characteristics. The output presented above aggregates all 42 tests and gives an overall
breakdown. This is generally not a good idea if the individual tests indeed have
different performance bottlenecks. You need to have a separate Topdown analysis for
each of the tests. One way the topdown-tool can help is to use the -i option which
will output data per configurable interval of time. You can then compare the intervals
and decide on the next steps.
141
6 CPU Features for Performance Analysis
program. You will find it out once you do the necessary experiments.
While it is possible to achieve Retiring close to 100% on a toy program, real-world
applications are far from getting there. Figure 6.3 shows top-level TMA metrics for
Google’s datacenter workloads along with several SPEC CPU2006117 benchmarks
running on Intel’s Ivy Bridge server processors. We can see that most data center
workloads have a very small fraction in the Retiring bucket. This implies that most
data center workloads spend time stalled on various bottlenecks. BackendBound is
the primary source of performance issues. The FrontendBound category represents a
bigger problem for data center workloads than in SPEC2006 because those applications
typically have large codebases with poor locality. Finally, some workloads suffer from
branch mispredictions more than others, e.g., search2 and 445.gobmk.
Figure 6.3: TMA breakdown of Google’s datacenter workloads along with several SPEC
CPU2006 benchmarks, © Source: [Kanev et al., 2015]
Keep in mind that the numbers are likely to change for other CPU generations as
architects constantly try to improve the CPU design. The numbers are also likely to
change for other instruction set architectures (ISA) and compiler versions.
A few final thoughts before we move on. . . As we mentioned at the beginning of this
chapter, using TMA on code that has major performance flaws is not recommended
because it will likely steer you in the wrong direction, and instead of fixing real
high-level performance problems, you will be tuning bad code, which is just a waste of
time. Similarly, make sure the environment doesn’t get in the way of profiling. For
example, if you drop the filesystem cache and run the benchmark under TMA, it will
likely show that your application is Memory Bound, which may in fact be false when
the filesystem cache is warmed up.
Workload characterization provided by TMA can increase the scope of potential
optimizations beyond source code. For example, if an application is bound by memory
117 SPEC CPU 2006 - http://spec.org/cpu2006/.
142
6.2 Branch Recording Mechanisms
bandwidth and all possible ways to speed it up on the software level have been
exhausted, it may be possible to improve performance by upgrading the memory
subsystem with faster memory chips. This demonstrates how using TMA to diagnose
performance bottlenecks can support your decision to spend money on new hardware.
143
6 CPU Features for Performance Analysis
path since we know that the control flow was sequential from the destination address
in the entry N-1 to the source address in the entry N.
Next, we will take a look at each vendor’s branch recording mechanism and then
explore how they can be used in performance analysis.
exact model number of the current CPU. It makes support in the OS and profiling tools much easier.
Also, LBR entries can be configured to be included in the PEBS records (see Section 6.3.1).
144
6.2 Branch Recording Mechanisms
With Linux perf, you can collect LBR stacks using the following command:
$ perf record -b -e cycles -- ./benchmark.exe
[ perf record: Woken up 68 times to write data ]
[ perf record: Captured and wrote 17.205 MB perf.data (22089 samples) ]
LBR stacks can also be collected using the perf record --call-graph lbr com-
mand, but the amount of information collected is less than using perf record -b.
For example, branch misprediction and cycles data are not collected when running
perf record --call-graph lbr.
Because each collected sample captures the entire LBR stack (32 last branch records),
the size of collected data (perf.data) is significantly bigger than sampling without
LBRs. Still, the runtime overhead for the majority of LBR use cases is below 1%.
[Nowak & Bitzes, 2014]
145
6 CPU Features for Performance Analysis
Users can export raw LBR stacks for custom analysis. Below is the Linux perf
command you can use to dump the contents of collected branch stacks:
$ perf record -b -e cycles -- ./benchmark.exe
$ perf script -F brstack &> dump.txt
The dump.txt file, which can be quite large, contains lines like those shown below:
...
0x4edaf9/0x4edab0/P/-/-/29
0x4edabd/0x4edad0/P/-/-/2
0x4edadd/0x4edb00/M/-/-/4
0x4edb24/0x4edab0/P/-/-/24
0x4edabd/0x4edad0/P/-/-/2
0x4edadd/0x4edb00/M/-/-/1
0x4edb24/0x4edab0/P/-/-/3
0x4edabd/0x4edad0/P/-/-/1
...
In the output above, we present eight entries from the LBR stack, which typically
consists of 32 LBR entries. Each entry has FROM and TO addresses (hexadecimal
values), a predicted flag (this one single branch outcome was M - Mispredicted, P -
Predicted), and the number of cycles since the previous record (number in the last
position of each entry). Components marked with “-” are related to transactional
memory extension (TSX), which we won’t discuss here. Curious readers can look up
the format of a decoded LBR entry in the perf script specification119 .
146
6.2 Branch Recording Mechanisms
not on a mispredicted path. Users can also filter records based on specific branch
types. One notable difference is that BRBE supports configurable depth of the BRBE
buffer: processors can choose the capacity of the BRBE buffer to be 8, 16, 32, or 64
records. More details are available in [Arm, 2022a, Chapter F1 “Branch Record Buffer
Extension”].
At the time of writing, there were no commercially available machines that implement
ARMv9.2-A, so it is not possible to test the BRBE extension in action.
Several use cases become possible thanks to branch recording. Next, we will cover the
most important ones.
As you can see, we identified the hottest function in the program (which is bar). Also,
we identified callers that contribute to the most time spent in function bar: 91% of
the time the tool captured the main → foo → bar call stack, and 9% of the time it
captured main → zoo → bar. In other words, 91% of samples in bar have foo as its
caller function.
It’s important to mention that we cannot necessarily drive conclusions about function
call counts in this case. For example, we cannot say that foo calls bar 10 times more
frequently than zoo. It could be the case that foo calls bar once, but it executes an
expensive path inside bar while zoo calls bar many times but returns quickly from it.
147
6 CPU Features for Performance Analysis
120 The report header generated by perf might still be confusing because it says 21K of event cycles.
branch-probability
122 LLVM test-suite 7zip benchmark - https://github.com/llvm-mirror/test-suite/tree/master/Mult
iSource/Benchmarks/7zip
148
6.2 Branch Recording Mechanisms
# Overhead Samples Mis From Line To Line Source Sym Target Sym
# ........ ....... ... ......... ....... .......... ..........
46.12% 303391 N dec.c:36 dec.c:40 LzmaDec LzmaDec
22.33% 146900 N enc.c:25 enc.c:26 LzmaFind LzmaFind
6.70% 44074 N lz.c:13 lz.c:27 LzmaEnc LzmaEnc
6.33% 41665 Y dec.c:36 dec.c:40 LzmaDec LzmaDec
In this example, the lines that correspond to function LzmaDec are of particular interest
to us. In the output that Linux perf provides, we can spot two entries that correspond
to the LzmaDec function: one with Y and one with N letters. We can conclude that the
branch on source line dec.c:36 is the most executed in the benchmark since more
than 50% of samples are attributed to it. Analyzing those two entries together gives
us a misprediction rate for the branch. We know that the branch on line dec.c:36
was predicted 303391 times (corresponds to N) and was mispredicted 41665 times
(corresponds to Y), which gives us an 88% prediction rate.
Linux perf calculates the misprediction rate by analyzing each LBR entry and
extracting a misprediction bit from it. So for every branch, we have a number of times
it was predicted correctly and a number of mispredictions. Again, due to the nature
of sampling, some branches might have an N entry but no corresponding Y entry. It
means there are no LBR entries for the branch being mispredicted, but that doesn’t
necessarily mean the prediction rate is 100%.
149
6 CPU Features for Performance Analysis
Figure 6.4: Occurrence rate chart for latency of the basic block that starts at address
0x400618.
We can see two humps: a small one around 80 cycles 1 and two bigger ones at 280
and 305 cycles 2 . The block has a random load from a large array that doesn’t fit
in the CPU L3 cache, so the latency of the basic block largely depends on this load.
Based on the chart we can conclude that the first spike 1 corresponds to the L3
cache hit and the second spike 2 corresponds to the L3 cache miss where the load
request goes all the way down to the main memory.
This information can be used for a fine-grained tuning of this basic block. This
example might benefit from memory prefetching, which we will discuss in Section 8.5.
Also, cycle count information can be used for timing loop iterations, where every loop
iteration ends with a taken branch (back edge).
Before the proper support from profiling tools was in place, building graphs similar
to Figure 6.4 required manual parsing of raw LBR dumps. An example of how to
do this can be found on the easyperf blog123 . Luckily, in newer versions of Linux
perf, obtaining this information is much easier. The example below demonstrates this
method directly using Linux perf on the same 7-zip benchmark from the LLVM test
suite we introduced earlier:
$ perf record -e cycles -b -- ./7zip.exe b
$ perf report -n --sort symbol_from,symbol_to -F +cycles,srcline_from,srcline_to
--stdio
# Samples: 658K of event 'cycles'
# Event count (approx.): 658240
123 Easyperf: Building a probability density chart for the latency of an arbitrary basic block -
https://easyperf.net/blog/2019/04/03/Precise-timing-of-machine-code-with-Linux-perf.
150
6.2 Branch Recording Mechanisms
Here is how we can interpret the data: from all the collected samples, 17% of the time
the latency of the basic block was one cycle, 27% of the time it was 2 cycles, and so
on. Notice a distribution mostly concentrates from 1 to 6 cycles, but also there is a
second mode of much higher latency of 24 and 32 cycles, which likely corresponds to
branch misprediction penalty. The second mode in the distribution accounts for 15%
of all samples.
This example shows that it is feasible to plot basic block latencies not only for tiny
microbenchmarks but for real-world applications as well. Currently, LBR is the most
precise cycle-accurate source of timing information on Intel systems.
151
6 CPU Features for Performance Analysis
152
6.3 Hardware-Based Sampling Features
registers (EAX, EBX, ESP, etc.), EventingIP, Data Linear Address, and Latency
value, and a few other fields. The content layout of a PEBS record varies across
different microarchitectures, see [Intel, 2023, Volume 3B, Chapter 20 Performance
Monitoring].
Figure 6.5: PEBS Record Format for 6th Generation, 7th Generation and 8th Generation
Intel Core Processor Families. © Source: [Intel, 2023, Volume 3B, Chapter 20].
Since Skylake, the PEBS record has been enhanced to collect XMM registers and Last
Branch Record (LBR) records. The format has been restructured where fields are
grouped into Basic group, Memory group, GPR group, XMM group, and LBR group.
Performance profiling tools have the option to select data groups of interest and thus
reduce the recording overhead. By default, the PEBS record will only contain the
Basic group.
One of the notable benefits of using PEBS is lower sampling overhead compared to
regular interrupt-based sampling. Recall that when the counter overflows, the CPU
generates an interrupt to collect one sample. Frequently generating interrupts and
having an analysis tool itself capture the program state inside the interrupt service
routine is very costly since it involves OS interaction.
On the other hand, PEBS keeps a buffer to temporarily store multiple PEBS records.
Suppose, we are sampling load instructions using PEBS. When a performance counter
is configured for PEBS, an overflow condition in the counter will not trigger an
interrupt, instead, it will activate the PEBS mechanism. The mechanism will then
trap the next load, capture a new record, and store it in the dedicated PEBS buffer
area. The mechanism also takes care of clearing the counter overflow status and
reloading the counter with the initial value. Only when the dedicated buffer is full
does the processor raise an interrupt and the buffer gets flushed to memory. This
mechanism lowers the sampling overhead by triggering fewer interrupts.
Linux users can check if PEBS is enabled by executing dmesg:
$ dmesg | grep PEBS
[ 0.113779] Performance Events: XSAVE Architectural LBR, PEBS fmt4+-baseline,
AnyThread deprecated, Alderlake Hybrid events, 32-deep LBR, full-width counters,
Intel PMU driver.
153
6 CPU Features for Performance Analysis
For LBR, Linux perf dumps the entire contents of the LBR stack with every collected
sample. So, it is possible to analyze raw LBR dumps collected by Linux perf. However,
for PEBS, Linux perf doesn’t export the raw output as it does for LBR. Instead, it
processes PEBS records and extracts only a subset of data depending on a particular
need. So, it’s not possible to access the collection of raw PEBS records with Linux
perf. However, Linux perf provides some PEBS data processed from raw samples,
which can be accessed by perf report -D. To dump raw PEBS records, you can use
pebs-grabber126 .
154
6.3 Hardware-Based Sampling Features
155
6 CPU Features for Performance Analysis
skid
156
6.3 Hardware-Based Sampling Features
With AMD IBS and Arm SPE, all the collected samples are precise by design since
the hardware captures the exact instruction address. They both work in a very similar
fashion. Whenever an overflow occurs, the mechanism saves the instruction causing
the overflow into a dedicated buffer which is then read by the interrupt handler. As
the address is preserved, the IBS and SPE sample’s instruction attribution is precise.
Users of Linux perf on Intel and AMD platforms must add the pp suffix to one of
the events listed above to enable precise tagging as shown below. However, on Arm
platforms, it has no effect, so users must use the arm_spe_0 event.
$ perf record -e cycles:pp -- ./a.exe
Precise events provide relief for performance engineers as they help to avoid misleading
data that often confuses beginners and even senior developers. The TMA methodology
heavily relies on precise events to locate the exact line of source code where the
inefficient execution takes place.
In PEBS, such a feature is called Data Address Profiling (DLA). To provide additional
information about sampled loads and stores, it uses the Data Linear Address and
Latency Value fields inside the PEBS facility (see Figure 6.5). If the performance
event supports the DLA facility, and DLA is enabled, the processor will dump the
memory address and latency of the sampled memory access. You can also filter
memory accesses that have latency higher than a certain threshold. This is useful
for finding long-latency memory accesses, which can be a performance bottleneck for
many applications.
With the IBS Execute and Arm SPE sampling, you can also do an in-depth analysis
of memory accesses performed by an application. One approach is to dump collected
samples and process them manually. IBS saves the exact linear address, its latency,
where the memory location was fetched from (cache or DRAM), and whether it hit or
missed in the DTLB. SPE can be used to estimate the latency and bandwidth of the
memory subsystem components, estimate memory latencies of individual loads/stores,
and more.
One of the most important use cases for these extensions is detecting True and False
Sharing, which we will discuss in Section 13.4. The Linux perf c2c tool heavily
relies on all three mechanisms (PEBS, IBS, and SPE) to find contested memory
accesses, which could experience True/False sharing: it matches load/store addresses
for different threads and checks if the hit occurs in a cache line modified by other
threads.
157
6 CPU Features for Performance Analysis
Chapter Summary
• Modern processors provide features that enhance performance analysis. Using
those features greatly simplifies finding opportunities for low-level optimization.
• Top-down Microarchitecture Analysis (TMA) methodology is a powerful tech-
nique for identifying ineffective usage of CPU microarchitecture by a program,
and is easy to use even for inexperienced developers. TMA is an iterative process
that consists of multiple steps, including characterizing the workload and locating
the exact place in the source code where the bottleneck occurs. We advise that
TMA should be one of the starting points for every low-level tuning effort.
• Branch Record mechanisms such as Intel’s LBR, AMD’s LBR, and ARM’s BRBE
continuously log the most recent branch outcomes in parallel with executing
the program, causing a minimal slowdown. One of the primary usages of these
facilities is to collect call stacks. Also, they help identify hot branches, and
misprediction rates and enable precise timing of machine code.
• Modern processors often provide Hardware-Based Sampling features for advanced
profiling. Such features lower the sampling overhead by storing multiple samples
in a dedicated buffer without software interrupts. They also introduce “Precise
Events” that enable pinpointing the exact instruction that caused a particular
performance event. In addition, there are several other less important use cases.
Example implementations of such Hardware-Based Sampling features include
Intel’s PEBS, AMD’s IBS, and ARM’s SPE.
• Intel Processor Traces (PT) is a CPU feature that records the program execution
by encoding packets in a highly compressed binary format that can be used
to reconstruct execution flow with a timestamp on every instruction. PT
has extensive coverage and a relatively small overhead. Its main usages are
postmortem analysis and finding the root cause(s) of performance glitches. The
Intel PT feature is covered in Appendix C. Processors based on ARM architecture
also have a tracing capability called Arm CoreSight,131 but it is mostly used for
debugging rather than for performance analysis.
158
7 Overview of Performance Analysis Tools
How to configure it
On Linux, VTune can use two data collectors: Linux perf and VTune’s own driver
called SEP. The first type is used for user-mode sampling, but if you want to perform
advanced analysis, you need to build and install the SEP driver, which is not too hard.
# go to the sepdk folder in vtune's installation
$ cd ~/intel/oneapi/vtune/latest/sepdk/src
/base-toolkit.html
159
7 Overview of Performance Analysis Tools
After you’ve done with the steps above, you should be able to use advanced analysis
types like Microarchitectural Exploration and Memory Access.
Windows does not require any additional configuration after you install VTune. Col-
lecting hardware performance events requires administrator privileges.
VTune can provide very rich information about a running process. It is the right tool
for you if you’re looking to improve the overall performance of an application. VTune
always provides aggregated data over a period of time, so it can be used for finding
optimization opportunities for the “average case”.
Due to the sampling nature of the tool, it will eventually miss events with a very short
duration (e.g., sub-microsecond).
Example
Below is a series of screenshots of VTune’s most interesting features. For this example,
I took POV-Ray, which is a ray tracer used to create 3D graphics. Figure 7.1 shows
the hotpots analysis of the built-in POV-Ray 3.7 benchmark, compiled with clang14
compiler with -O3 -ffast-math -march=native -g options, and run on an Intel
Alder Lake system (Core i7-1260P, 4 P-cores + 8 E-cores) with 4 worker threads.
At the left part of the image, you can see a list of hot functions in the workload along
with the corresponding CPU time percentage and the number of retired instructions.
On the right panel, you can see the most frequent call stack that leads to calling
the function pov::Noise. According to that screenshot, 44.4% of the time function
160
7.1 Intel VTune Profiler
pov::Noise, was called from pov::Evaluate_TPat, which in turn was called from
pov::Compute_Pigment.133
If you double-click on the pov::Noise function, you will see an image that is shown
in Figure 7.2. For the interest of space, only the most important columns are shown.
The left panel shows the source code and CPU time that corresponds to each line
of code. On the right, you can see assembly instructions along with the CPU time
that was attributed to them. Highlighted machine instructions correspond to line
476 in the left panel. The sum of all CPU time percentages (not just the ones that
are visible) in each panel equals to the total CPU time attributed to the pov::Noise
function, which is 26.8%.
When you use VTune to profile applications running on Intel CPUs, it can collect
many different performance events. To illustrate this, I ran a different analysis type,
Microarchitecture Exploration. To access raw event counts, you can switch the view
to Hardware Events as shown in Figure 7.3. To enable switching views, you need to
tick the mark in Options → General → Show all applicable viewpoints. Near the top
of Figure 7.3, you can see that the Platform tab is selected. Two other pages are also
useful. The Summary page gives you the absolute number of raw performance events
as collected from CPU counters. The Event Count page gives you the same data with
a per-function breakdown.
Figure 7.3 is quite busy and requires some explanation. The top panel, indicated
with 1 , is a timeline view that shows the behavior of our four worker threads over
time with respect to L1 cache misses, plus some tiny activity of the main thread
133 Notice that the call stack doesn’t lead all the way to the main function. This happens because,
with the hardware-based collection, VTune uses LBR to sample call stacks, which provides limited
depth. Most likely we’re dealing with recursive functions here, and to investigate that further users
will have to dig into the code.
161
7 Overview of Performance Analysis Tools
(TID: 3102135), which spawns all the worker threads. The higher the black bar, the
more events (L1 cache misses in this case) happened at any given moment. Notice
occasional spikes in L1 misses for all four worker threads. We can use this view to
observe different or repeatable phases of the workload. Then to figure out which
functions were executed at that time, we can select an interval and click “filter in”
to focus just on that portion of the running time. The region indicated with 2 is
an example of such filtering. To see the updated list of functions, you can go to the
Event Count view. Such filtering and zooming features are available on all VTune
timeline views.
The region indicated with 3 shows performance events that were collected and their
distribution over time. This time it is not a per-thread view, but rather it shows
aggregated data across all the threads. In addition to observing execution phases, you
can also visually extract some interesting information. For example, we can see that
the number of executed branches is high (BR_INST_RETIRED.ALL_BRANCHES), but the
misprediction rate is quite low (BR_MISP_RETIRED.ALL_BRANCHES). This can lead you
to the conclusion that branch misprediction is not a bottleneck for POV-Ray. If you
scroll down, you will see that the number of L3 misses is zero, and L2 cache misses
are very rare as well. This tells us that 99% of memory access requests are served by
L1, and the rest of them are served by L2. By combining these two observations, we
can conclude that the application is likely bound by compute, i.e., the CPU is busy
calculating something, not waiting for memory or recovering from a misprediction.
Finally, the bottom panel 4 shows the CPU frequency chart for four hardware
threads. Hovering over different time slices tells us that the frequency of those cores
fluctuates in the 3.2–3.4GHz region.
162
7.2 AMD uProf
Figure 7.3: VTune’s perf events timeline view of povray built-in benchmark.
How to configure it
On Linux, uProf uses Linux perf for data collection. On Windows, uProf uses its own
sampling driver that gets installed when you install uProf, no additional configuration
is required. AMD uProf supports both command-line interface (CLI) and graphical
interface. The CLI interface requires two separate steps—collect and report, similar
to Linux perf.
163
7 Overview of Performance Analysis Tools
Example
To demonstrate the look-and-feel of the AMD uProf tool, we ran the dense LU matrix
factorization component from the Scimark2136 benchmark on an AMD Ryzen 9 7950X,
running Windows 11, with 64 GB RAM.
Figure 7.4 shows Function Hotpots analysis (selected in the menu list on the left side
of the image). At the top of the image, you can see an event timeline showing the
134 MPI - Message Passing Interface, a standard for parallel programming on distributed memory
systems.
135 AMD uProf User Guide - https://www.amd.com/en/developer/uprof.html#documentation
136 Scimark2 - https://math.nist.gov/scimark2/index.html
164
7.3 Apple Xcode Instruments
number of events observed at various times of the application execution. On the right,
you can select which metric to plot; we selected RETIRED_BR_INST_MISP. Notice a
spike in branch mispredictions in the time range from 20s to 40s. You can select this
region to analyze closely what’s going on there. Once you do that, it will update the
bottom panels to show statistics only for that time interval.
Below the timeline graph, you can see a list of hot functions, along with corresponding
sampled performance events and calculated metrics. Event counts can be viewed as
sample count, raw event count, or percentage. There are many interesting numbers to
look at, but we will not dive deep into the analysis. Instead, readers are encouraged
to figure out the performance impact of branch mispredictions and find their source.
Below the functions table, you can see a bottom-up call stack view for the selected
function in the functions table. As we can see, the selected LU_factor function is
called from kernel_measureLU, which in turn is called from main. In the Scimark2
benchmark, this is the only call stack for LU_factor, even though it shows Call
Stacks [5]. This is an artifact of collection that can be ignored. But in other
applications, a hot function can be called from many different places, so you would
want to examine other call stacks as well.
If you double-click on any function, uProf will open the source/assembly view for
that function. We don’t show this view for brevity. On the left panel, there are other
views available, like Metrics, Flame Graph, Call Graph view, and Thread Concurrency.
They are useful for analysis as well, however we decided to skip them. Readers can
experiment and look at those views on their own.
165
7 Overview of Performance Analysis Tools
166
7.3 Apple Xcode Instruments
Figure 7.5 shows the main timeline view of Xcode Instruments. This screenshot was
taken after the compilation had finished. We will get back to it a bit later, but first,
let us show how to start the profiling session.
To begin, open Instruments and choose the CPU Counters analysis type. The first
step you need to do is configure the collection. Click and hold the red target icon (see
1 in Figure 7.5), then select Recording Options. . . from the menu. It will display the
dialog window shown in Figure 7.6. This is where you can add hardware performance
monitoring events for collection. Apple has documented its hardware performance
monitoring events in its manual [Apple, 2024, Section 6.2 Performance Monitoring
Events].
The second step is to set the profiling target. To do that, click and hold the name of
an application (marked 2 in Figure 7.5) and choose the one you’re interested in. Set
the arguments and environment variables if needed. Now, you’re ready to start the
collection; press the red target icon 1 .
Instruments shows a timeline and constantly updates statistics about the running
application. Once the program finishes, Instruments will display the results like those
shown in Figure 7.5. The compilation took 7.3 seconds and we can see how the volume
of events changed over time. For example, the number of executed branch instructions
and mispredictions increased towards the end of the runtime. You can zoom in to that
interval on the timeline to examine the functions involved.
The bottom panel shows numerical statistics. To inspect the hotspots similar to Intel
VTune’s bottom-up view, select Profile in the menu 3 , then click the Call Tree menu
4 and check the Invert Call Tree box. This is exactly what we did in Figure 7.5.
Instruments show raw counts along with the percentages of the total, which is useful if
you want to calculate secondary metrics like IPC, MPKI, etc. On the right side, we have
167
7 Overview of Performance Analysis Tools
How to configure it
Installing Linux perf is very simple and can be done with a single command:
$ sudo apt-get install linux-tools-common linux-tools-generic linux-tools-`uname -r`
Also, consider changing the following defaults unless security is a concern:
# Allow kernel profiling and access to CPU events for unprivileged users
$ echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid
$ echo kernel.perf_event_paranoid=0 | sudo tee -a /etc/sysctl.d/local.conf
# Enable kernel modules symbols resolution for unprivileged users
$ echo 0 | sudo tee /proc/sys/kernel/kptr_restrict
$ echo kernel.kptr_restrict=0 | sudo tee -a /etc/sysctl.d/local.conf
exist.
139 Linux perf wiki - https://perf.wiki.kernel.org/index.php/Main_Page.
168
7.5 Flame Graphs
• KDAB Hotspot,140 a tool that visualizes Linux perf data with an interface very
similar to Intel VTune. If you have worked with Intel VTune, KDAB Hotspot
will seem very familiar to you.
• Netflix Flamescope.141 This tool displays a heat map of sampled events over
application runtime. You can observe different phases and patterns in the
behavior of a workload. Netflix engineers found some very subtle performance
bugs using this tool. Also, you can select a time range on the heat map and
generate a flame graph for that time range.
On the flame graph, each rectangle (horizontal bar) represents a function call, and the
140 KDAB Hotspot - https://github.com/KDAB/hotspot.
141 Netflix Flamescope - https://github.com/Netflix/flamescope.
142 Flame graphs by Brendan Gregg - https://github.com/brendangregg/FlameGraph
169
7 Overview of Performance Analysis Tools
width of the rectangle indicates the relative execution time taken by the function itself
and by its callees. The function calls happen from the bottom to the top, so we can
see that the hottest path in the program is x264 → threadpool_thread_internal →
... → x264_8_macroblock_analyse. The function threadpool_thread_internal
and its callees account for 74% of the time spent in the program. But the self-time,
i.e., time spent in the function itself is rather small. Similarly, we can do the same
analysis for x264_8_macroblock_analyse, which accounts for 66% of the runtime.
This visualization gives you a very good intuition on where the most time is spent.
Flame graphs are interactive. You can click on any bar on the image and it will zoom
into that particular code path. You can keep zooming until you find a place that
doesn’t look according to your expectations or you reach a leaf/tail function—now
you have actionable information you can use in your analysis. Another strategy is to
figure out what is the hottest function in the program (not immediately clear from
this flame graph) and go bottom-up through the flame graph, trying to understand
from where this hottest function gets called.
Some tools prefer to use an icicle graph, which is the upside-down version of a flame
graph (see an example in Section 7.9).
How to configure it
Recording ETW data is possible without any extra download since Windows 10 with
WPR.exe. But to enable system-wide profiling you must be an administrator and have
the SeSystemProfilePrivilege enabled. The Windows Performance Recorder tool
supports a set of built-in recording profiles that are suitable for common performance
issues. You can tailor your recording needs by authoring a custom performance
recorder profile xml file with the .wprp extension.
If you want to not only record but also view the recorded ETW data you need to
install the Windows Performance Toolkit (WPT). You can download it from the
Windows SDK143 or ADK144 download page. The Windows SDK is huge; you don’t
necessarily need all its parts. In our case, we just enabled the checkbox of the Windows
Performance Toolkit. You are allowed to redistribute WPT as a part of your own
application.
143 Windows SDK Downloads - https://developer.microsoft.com/en-us/windows/downloads/sdk-
archive/
144 Windows ADK Downloads - https://learn.microsof t.com/en- us/windows- hardware/get-
started/adk-install#other-adk-downloads
170
7.7 Specialized and Hybrid Profilers
171
7 Overview of Performance Analysis Tools
Developers have created profilers that provide features helpful in specific environments,
usually with a marker API that you can use to manually instrument your code. This
enables you to observe the performance of a particular function or a block of code
(later referred to as a zone). Continuing with the game industry, there are several tools
in this space: some are integrated directly into game engines like Unreal, while others
are provided as external libraries and tools that can be integrated into your project.
Some of the most commonly used profilers are Tracy, RAD Telemetry, Remotery, and
Optick (Windows only). Next, we showcase Tracy,145 as this seems to be one of the
most popular projects; however, these concepts apply to the other profilers as well.
To emulate a typical scenario where Tracy can help to diagnose the root cause of
the problem, we have manually modified the code so that some frames will consume
more time than others. Listing 7.1 shows an outline of the code along with added
Tracy instrumentation. Notice, that we randomly select frames to slow down. Also,
we included Tracy’s header and added the ZoneScoped and FrameMark macros to the
functions that we want to track. The FrameMark macro can be inserted to identify
individual frames in the profiler. The duration of each frame will be visible on the
timeline, which is very useful.
Each frame can contain many zones, designated by the ZoneScoped macro. Similar
to frames, there are many instances of a zone. Every time we enter a zone, Tracy
captures statistics for a new instance of that zone. The ZoneScoped macro creates a
C++ object on the stack that will record the runtime activity of the code within the
scope of the object’s lifetime. Tracy refers to this scope as a zone. At the zone entry,
the current timestamp is captured. Once the function exits, the object’s destructor
will record a new timestamp and store this timing data, along with the function name.
145 Tracy - https://github.com/wolfpld/tracy
146 ToyPathTracer - https://github.com/wolfpld/tracy/tree/master/examples/ToyPathTracer
172
7.7 Specialized and Hybrid Profilers
void DoExtraWork() {
ZoneScoped;
// imitate useful work
}
void TraceRowJob() {
ZoneScoped;
if (frameCount == randomlySelected)
DoExtraWork();
// ...
}
void RenderFrame() {
ZoneScoped;
for (...) {
TraceRowJob();
}
FrameMark;
}
Tracy has two operation modes: it can store all the timing data until the profiler
is connected to the application (the default mode), or it can only start recording
when a profiler is connected. The latter option can be enabled by specifying the
TRACY_ON_DEMAND pre-processor macro when compiling the application. This mode
should be preferred if you want to distribute an application that can be profiled as
needed. With this option, the tracing code can be compiled into the application
and it will cause little to no overhead to the running program unless the profiler is
attached. The profiler is a separate application that connects to a running application
to capture and display the live profiling data, also known as the “flight recorder” mode.
The profiler can be run on a separate machine so that it doesn’t interfere with the
running application. Note, however, that this doesn’t mean that the runtime overhead
caused by the instrumentation code disappears. It is still there, but the overhead of
visualizing the data is avoided in this case.
We used Tracy to debug the program and find the reason why some frames are slower
than others. The data was captured on a Windows 11 machine, equipped with a Ryzen
7 5800X processor. The program was compiled with MSVC 19.36.32532. Tracy’s
graphical interface is quite rich, but unfortunately contains too much detail to fit on a
single screenshot, so we break it down into pieces. At the top, there is a timeline view
as shown in Figure 7.8, cropped to fit onto the page. It shows only a portion of frame
76, which took 44.1 ms to render. On that diagram, we see the Main thread and five
WorkerThreads that were active during that frame. All threads, including the main
thread, are performing work to advance progress in rendering the final image. As we
said earlier, each thread processes a row of pixels inside the TraceRowJob zone. Each
TraceRowJob zone instance contains many smaller zones, that are not visible. Tracy
collapses inner zones and only shows the number of collapsed instances. This is what,
for example, number 4,109 means under the first TraceRowJob in the Main Thread.
Notice the instances of DoExtraWork zones, nested under TraceRowJob zones. This
173
7 Overview of Performance Analysis Tools
observation already can lead to a discovery, but in a real application, it may not be so
obvious. Let’s leave this for now.
Figure 7.8: Tracy main timeline view. It shows the main thread and five worker threads
while rendering a frame.
Right above the main panel, there is a histogram that displays the times for all the
recorded frames (see Figure 7.9). It makes it easier to spot those frames that took
longer than average to complete. In this example, most frames take around 33 ms
(the yellow bars). However, some frames take longer than this and are marked in red.
As seen in the screenshot, a tooltip showing the details of a given frame is displayed
when you point the mouse at a bar in the histogram. In this example, we are showing
the details for the last frame.
Figure 7.9: Tracy frame timings. You can find frames that take more time to render than
other frames.
Figure 7.10 illustrates the CPU data section of the profiler. This area shows which
core a given thread is executing on and it also displays context switches. This section
will also display other programs that are running on the CPU. As seen in the image,
the details for a given thread are displayed when hovering the mouse on a given
section in the CPU data view. Details include the CPU the thread is running on, the
parent program, the individual thread, and timing information. We can see that the
TestCpu.exe thread was active on CPU 1 only for 4.4 ms during the entire run of the
program.
Next comes the panel that provides information on where our program spends its time
(hotspots). Figure 7.11 is a screenshot of Tracy’s statistics window. We can check the
recorded data, including the total time a given function was active, how many times
174
7.7 Specialized and Hybrid Profilers
Figure 7.10: Tracy CPU data view. You can see what each CPU core was doing at any given
moment.
it was invoked, etc. It’s also possible to select a time range in the main view to filter
information corresponding to a time interval.
Figure 7.11: Tracy function statistics. A regular “hotspot” view that provides information
where a program spends time.
The last set of panels that we show, enables us to analyze individual zone instances in
more depth. Once you click on any zone instance, say, on the main timeline view or on
the CPU data view, Tracy will open a Zone Info window (see the left panel in Figure
7.12) with the details for this zone instance. It shows how much of the execution
time is consumed by the zone itself or its children. In this example, execution of the
TraceRowJob function took 19.24 ms, but the time consumed by the function itself
without its callees (self time) takes 1.36 ms, which is only 7%. The rest of the time is
consumed by child zones.
It’s easy to spot a call to DoExtraWork that takes the bulk of the time, 16.99 ms out of
19.24 ms (see the left panel in Figure 7.12). Notice that this particular TraceRowJob
instance runs almost 4.4 times as long as the average case (indicated by “437.93%
of the mean time” on the image). Bingo! We found one of the slow instances where
the TraceRowJob function was slowed down because of some extra work. One way to
proceed would be to click on the DoExtraWork row to inspect this zone instance. This
will update the Zone Info view with the details of the DoExtraWork instance so that we
can dig down to understand what caused the performance issue. This view also shows
175
7 Overview of Performance Analysis Tools
the source file and line of code where the zone starts. So, another strategy would be
to check the source code to understand why the current TraceRowJob instance takes
more time than usual.
Figure 7.12: Tracy zone detail windows. It shows statistics for a slow instance of the
TraceRowJob zone.
Remember, we saw in Figure 7.9, that there are other slow frames. Let’s see if this is
the common problem among all the slow frames. If we click on the Statistics button,
it will display the Find Zone panel (on the right of Figure 7.12). Here we can see
the time histogram that aggregates all zone instances. This is particularly useful to
determine how much variation there is when executing a function. Looking at the
histogram on the right, we see that the median duration for the TraceRowJob function
is 3.59 ms, with most calls taking between 1 and 7 ms. However, there are a few
instances that take longer than 10 ms, with a peak of 23 ms. Note that the time axis
is logarithmic. The Find Zone window also provides other data points, including the
mean, median, and standard deviation for the inspected zone.
Now we can examine other slow instances to find what is common between them,
which will help us to determine the root cause of the issue. From this view, you can
select one of the slow zones. This will update the Zone Info window with the details
of that zone instance and by clicking the Zoom to zone button, the main window will
focus on this slow zone. From here we can check if the selected TraceRowJob instance
has similar characteristics as the one that we just analyzed.
176
7.8 Memory Profiling
are required). Zone statistics (call counts, time, histogram) are exact because Tracy
captures every zone entry/exit, but system-level data and source-code-level data are
sampled.
In the example, we used manual markup of interesting areas in the code. However,
doing this is not a strict requirement to start using Tracy. You can profile an unmodified
application and add instrumentation later when you know where it’s needed. Tracy
provides many other features, too many to cover in this overview. Here are some of
the notable ones:
In comparison with other tools like Intel VTune and AMD uProf, with Tracy, you
cannot get the same level of CPU microarchitectural insights (e.g., various performance
events). This is because Tracy does not leverage the hardware features specific to a
particular platform.
The overhead of profiling with Tracy depends on how many zones you have activated.
The author of Tracy provides some data points that he measured on a program that
does image compression: an overhead of 18% and 34% with two different compression
schemes. A total of 200M zones were profiled, with an average overhead of 2.25 ns per
zone. This test instrumented a very hot function. In other scenarios, the overhead
will be much lower. While it’s possible to keep the overhead small, you need to be
careful about which sections of code you want to instrument, especially if you decide
to use it in production.
• What is a program’s total virtual memory consumption and how does it change
over time?
• Where and when does a program make heap allocations?
• What are the code places with the largest amount of allocated memory?
• How much memory does a program access every second?
• What is the total memory footprint of a program?
177
7 Overview of Performance Analysis Tools
Developers can observe both RSS and VSZ on Linux with a standard top utility,
however, both metrics can change very rapidly. Luckily, some tools can record and
visualize memory usage over time. Figure 7.13 shows the memory usage of the PSPNet
image segmentation algorithm, which is a part of the AI Benchmark Alpha.147 This
chart was created based on the output of a tool called memory_profiler148 , a Python
library built on top of the cross-platform psutil149 package.
Figure 7.13: RSS and VSZ memory utilization of AI_bench PSPNet image segmentation.
In addition to standard RSS and VSZ metrics, people have developed a few more
sophisticated metrics. Since RSS includes both the memory that is unique to the
process and the memory shared with other processes, it’s not clear how much memory
a process has for itself. The USS (Unique Set Size) is the memory that is unique to
a process and which would be freed if the process was terminated right now. The
PSS (Proportional Set Size) represents unique memory plus the amount of shared
memory, evenly divided between the processes that share it. E.g. if a process has 10
147 AI Benchmark Alpha - https://ai-benchmark.com/alpha.html
148 Easyper blog: Measuring memory footprint with SDE - https://easyperf.net/blog/2024/02/12
/Memory-Profiling-Part3
149 psutil - https://github.com/giampaolo/psutil
178
7.8 Memory Profiling
MB all to itself (USS) and 10 MBs shared with another process, its PSS will be 15
MBs. The psutil library supports measuring these metrics (Linux-only), which can
be visualized by memory_profiler.
On Windows, similar concepts are defined by Committed Memory Size and Working
Set Size. They are not direct equivalents to VSZ and RSS but can be used to effectively
estimate the memory usage of Windows applications. The RAMMap150 tool provides
a rich set of information about memory usage for the system and individual processes.
When developers talk about memory consumption, they implicitly mean heap usage.
Heap is, in fact, the biggest memory consumer in most applications as it accommodates
all dynamically allocated objects. But heap is not the only memory consumer. For
completeness, let’s mention others:
• Stack: Memory used by stack frames in an application. Each thread inside an
application gets its own stack memory space. Usually, the stack size is only a
few MB, and the application will crash if it exceeds the limit. For example, the
default size of stack memory on Linux is usually 8MB, although it may vary
depending on the distribution and kernel settings. The default stack size on
macOS is also 8MB, but on Windows, it’s only 1 MB. The total stack memory
consumption is proportional to the number of threads running in the system.
• Code: Memory that is used to store the code (instructions) of an application and
its libraries. In most cases, it doesn’t contribute much to memory consumption,
but there are exceptions. For example, the Clang 17 C++ compiler has a 33
MB code section, while the latest Windows Chrome browser has 187MB of its
219MB chrome.dll dedicated to code. However, not all parts of the code are
frequently exercised while a program is running. We show how to measure code
footprint in Section 11.9.
Since the heap is usually the largest consumer of memory resources, it makes sense for
developers to focus on this part of memory when they analyze the memory utilization
of their applications. In the following section, we will examine heap consumption and
memory allocations in a popular real-world application.
179
7 Overview of Performance Analysis Tools
Notice, that there are many tabs on the top of the image; we will explore some of
them. Figure 7.15 shows the memory usage of the Stockfish built-in benchmark. The
memory usage stays constant at 200 MB throughout the entire run of the program.
Total consumed memory is broken into slices, e.g., regions 1 and 2 on the image.
Each slice corresponds to a particular allocation. Interestingly, it was not a single
big 182 MB allocation that was done through Stockfish::std_aligned_alloc as
we thought earlier. Instead, there are two: slice 1 134.2 MB and slice 2 48.4 MB.
Both allocations stay alive until the very end of the benchmark.
Does it mean that there are no memory allocations after the startup phase? Let’s
find out. Figure 7.16 shows the accumulated number of allocations over time. Similar
to the consumed memory chart (Figure 7.15), allocations are sliced according to the
accumulated number of memory allocations attributed to each function. As we can
see, new allocations keep coming from not just a single place, but many. The most
frequent allocations are done through operator new that corresponds to region 1
on the image.
Notice there are new allocations at a steady pace throughout the life of the program.
180
7.8 Memory Profiling
Figure 7.15: Stockfish memory profile with Heaptrack, memory usage over time stays
constant.
However, as we just saw, memory consumption doesn’t change; how is that possible?
Well, it can be possible if we deallocate previously allocated buffers and allocate new
ones of the same size (also known as temporary allocations).
Figure 7.16: Stockfish memory profile with Heaptrack, the number of allocations is growing.
Since the number of allocations is growing but the total consumed memory doesn’t
change, we are dealing with temporary allocations. Let’s find out where in the code
they are coming from. It is easy to do with the help of a flame graph shown in Figure
7.17. There are 4800 temporary allocations in total with 90.8% of those coming from
operator new. Thanks to the flame graph we know the entire call stack that leads to
4360 temporary allocations. Interestingly, those temporary allocations are initiated
by std::stable_sort which allocates a temporary buffer to do the sorting. One way
to get rid of those temporary allocations would be to use an in-place stable sorting
algorithm. However, by doing so I observed an 8% drop in performance, so I discarded
this change.
Similar to temporary allocations, you can also find the paths that lead to the largest
allocations in a program. In the dropdown menu at the top of Figure 7.17, you would
need to select the “Consumed” flame graph.
181
7 Overview of Performance Analysis Tools
Figure 7.17: Stockfish memory profile with Heaptrack, temporary allocations flamegraph.
/Memory-Profiling-Part3
182
7.8 Memory Profiling
Figure 7.18: Memory intensity and footprint of four workloads. Intensity: total memory
accessed during 1B instructions interval. Footprint: accessed memory that has not been seen
before.
The dashed line (Footprint) tracks the size of the new data accessed every interval
since the start of the program. Here, we count the number of bytes accessed during
each 1B instruction interval that have never been touched before by the program.
Aggregating all the intervals (cross-hatched area under that dashed line) gives us
the total memory footprint for a program. Here are the memory footprint numbers
for our benchmarks: Clang C++ compilation (487MB); Stockfish (188MB); PSPNet
(1888MB); and Blender (149MB). Keep in mind, that these unique memory locations
can be accessed many times, so the overall pressure on the memory subsystem may be
high even if the footprint is relatively small.
As you can see our workloads have very different behavior. Clang compilation has high
memory intensity at the beginning, sometimes spiking to 100MB per 1B instructions,
but after that, it decreases to about 15MB per 1B instructions. Any of the spikes on
the chart may be concerning to a Clang developer: are they expected? Could they
be related to some memory-hungry optimization pass? Can the accessed memory
locations be compacted?
The Blender benchmark is very stable; we can see the start and the end of each
rendered frame. This enables us to focus on just a single frame, without looking at
the entire workload of 1000+ frames. The Stockfish benchmark is a lot more chaotic,
probably because the chess engine crunches different positions which require different
amounts of resources. Finally, the PSPNet segmentation memory intensity is very
interesting as we can spot repetitive patterns. After the initial startup, there are
five or six sine waves from 40B to 95B, then three regions that end with a sharp
spike to 200MB, and then again three mostly flat regions hovering around 25MB
per 1B instructions. This is actionable information that can be used to optimize an
application.
183
7 Overview of Performance Analysis Tools
Such charts help us estimate memory intensity and measure the memory footprint of an
application. By looking at the chart, you can observe phases and correlate them with
the underlying algorithm. Ask yourself: “Does it look according to your expectations,
or the workload is doing something sneaky?” You may encounter unexpected spikes in
memory intensity. On many occasions, memory profiling helped identify a problem or
served as an additional data point to support the conclusions that were made during
regular profiling.
In some scenarios, memory footprint helps us estimate the pressure on the memory
subsystem. For instance, if the memory footprint is small (several megabytes), we
might suspect that the pressure on the memory subsystem is low since the data will
likely reside in the L3 cache; remember that available memory bandwidth in modern
processors ranges from tens to hundreds of GB/s and is getting close to 1 TB/s. On
the other hand, when we’re dealing with an application that accesses 10 GB/s and
the memory footprint is much bigger than the size of the L3 cache, then the workload
might put significant pressure on the memory subsystem.
While memory profiling techniques discussed in this chapter certainly help better
understand the behavior of a workload, this is not enough to fully assess the temporal
and spatial locality of memory accesses. We still have no visibility into how many
times a certain memory location was accessed, what is the time interval between
two consecutive accesses to the same memory location, and whether the memory is
accessed sequentially, with strides, or completely random. The topic of data locality of
applications has been researched for a long time. Unfortunately, as of early 2024, there
are no production-quality tools available that would give us such information. Further
details are beyond the scope of this book, however, curious readers are welcome to
read the related article155 on the Easyperf blog.
ory-Profiling-Part5
184
7.9 Continuous Profiling
issue from the application stack down to the kernel stack from any given date and
time and supports call stack comparisons between any two arbitrary dates/times to
highlight performance differences.
To showcase the look-and-feel of a typical CP tool, let’s look at the Web UI of Parca,156
one of the open-source CP tools, depicted in Figure 7.19. The top panel displays a
time series graph of the number of CPU samples gathered from various processes on
the machine during the period selected from the time window dropdown list, which in
this case is “Last 15 minutes”. However, to make it fit on the page, the image was cut
to show only the last 10 minutes.
By default, Parca collects 19 samples per second. For each sample, it collects stack
traces from all the processes that run on the host system. The more samples are
attributed to a certain process, the more CPU activity it had during a period of time.
In our example, you can see the hottest process (top line) had a bursty behavior with
spikes and dips in CPU activity. If you were the lead developer behind this application
you would probably be curious why this happens. When you roll out a new version of
your application and suddenly see an unexpected spike in the CPU samples attributed
to the process, that is an indication that something is going wrong.
Continuous profiling tools make it easier not only to spot the point in time when
performance change occurred but also to determine the root cause of the issue. Once
you click on any point of interest on the chart, the tool displays an icicle graph
associated with that period in the bottom panel. An icicle graph is the upside-down
156 Parca - https://github.com/parca-dev/parca
185
7 Overview of Performance Analysis Tools
version of a flame graph. Using it, you can compare call stacks before and after to
help you find what is causing performance problems.
Imagine, you merged a code change into production and after it has been running
for a while, you receive reports of intermittent response time spikes. These may or
may not correlate with user traffic or with any particular time of day. This is an area
where CP shines. You can pull up the CP Web UI and do a search for stack traces at
the dates and times of those response time spikes, and then compare them to stack
traces of other dates and times to identify anomalous executions at the application
and/or kernel stack level. This type of “visual diff” is supported directly in the UI,
like a graphical “perf diff” or a differential flamegraph.157
Google introduced the CP concept in the 2010 paper “Google-Wide Profiling”
[Ren et al., 2010], which championed the value of always-on profiling in production
environments. However, it took nearly a decade before it gained traction in the
industry:
1. In March 2019, Google Cloud released its Continuous Profiler.
2. In July 2020, AWS released CodeGuru Profiler.
3. In August 2020, Datadog released its Continuous Profiler.
4. In December 2020, New Relic acquired the Pixie Continuous Profiler.
5. In Jan 2021, Pyroscope released its open-source Continuous Profiler.
6. In October 2021, Elastic acquired Optimyze and its Continuous Profiler (Prod-
filer); Polar Signals released its Parca Continuous Profiler. It was open-source
in April 2024.
7. In December 2021, Splunk released its AlwaysOn Profiler.
8. In March 2022, Intel acquired Granulate and its Continuous Profiler (gProfiler).
It was made open-source in March 2024.
New entrants into this space continue to pop up in both open-source and commercial
varieties. Some of these offerings require more hand-holding than others. For example,
some require source code or configuration file changes to begin profiling. Others
require different agents for different language runtimes (e.g., Ruby, Python, Golang,
C/C++/Rust). The best of them have crafted a secret sauce around eBPF so that
nothing other than simply installing the runtime agent is necessary.
They also differ in the number of language runtimes supported, the work required for
obtaining debug symbols for readable stack traces, and the type of system resources that
can be profiled aside from the CPU (e.g., memory, I/O, or locking). While Continuous
Profilers differ in the aforementioned aspects, they all share the common function of
providing low-overhead, sample-based profiling for various language runtimes, along
with remote stack trace storage for web-based search and query capability.
Where is Continuous Profiling headed? Thomas Dullien, co-founder of Optimyze
which developed the innovative Continuous Profiler Prodfiler, delivered the Keynote at
QCon London 2023 in which he expressed his wish for a cluster-wide tool that could
answer the questions, “Why is this request slow?” or “Why is this request expensive?”
In a multithreaded application, one particular function may show up on a profile as
the highest CPU and memory consumer, yet its duties might be completely outside an
157 Differential flamegraph - https://www.brendangregg.com/blog/2014-11-09/differential-flame-
graphs.html
186
Questions and Exercises
Chapter Summary
• We gave a quick overview of the most popular tools available on three major
platforms: Linux, Windows, and MacOS. Depending on the CPU vendor, the
choice of a profiling tool will vary. For systems with an Intel processor we
recommend using VTune; for systems with an AMD processor use uProf; on
Apple platforms use Xcode Instruments.
• Linux perf is probably the most frequently used profiling tool on Linux. It has
support for processors from all major CPU vendors. It doesn’t have a graphical
interface. However, there are tools that can visualize perf’s profiling data.
• We also discussed Windows Event Tracing (ETW), which is designed to ob-
serve software dynamics in a running system. Linux has a similar tool called
187
7 Overview of Performance Analysis Tools
188
Part 2. Source Code Tuning
A modern CPU is a very complicated device, and it’s nearly impossible to predict how
fast certain pieces of code will run. Software and hardware performance depends on
many factors, and the number of moving parts is too big for a human mind to contain.
Hopefully, observing how your code runs from a CPU perspective is possible thanks to
all the performance monitoring capabilities we discussed in the first part of the book.
We will extensively rely on methods and tools we learned about earlier in the book to
guide our performance engineering process.
At a very high level, software optimizations can be divided into five categories.
Many optimizations that we discuss in this book fall under multiple categories. For
example, we can say that vectorization is a combination of parallelizing and batching;
and loop blocking (tiling) is a manifestation of batching and eliminating redundant
work.
To complete the picture, let us also list other possibly obvious but still quite reasonable
189
Part 2. Source Code Tuning
Algorithmic Optimizations
Standard algorithms and data structures don’t always work well for performance-
critical workloads. For example, traditionally, every new node of a linked list is
dynamically allocated. Besides potentially invoking many costly memory allocations,
this will likely result in a situation where all the elements of the list are scattered
in memory. Traversing such a data structure is not cache-friendly. Even though
algorithmic complexity is still O(N), in practice, the timings will be much worse than
those of a plain array. Some data structures, like binary trees, have a natural linked-
list-like representation, so it might be tempting to implement them in a pointer-chasing
manner. However, more efficient “flat” versions of those data structures exist, e.g.,
boost::flat_map and boost::flat_set.
When selecting an algorithm for a problem at hand, you might quickly pick the most
popular option and move on. . . even though it could not be the best for your particular
case. Let’s assume you need to find an element in a sorted array. The first option that
most developers consider is binary search, right? It is very well-known and is optimal
in terms of algorithmic complexity, O(logN). Will you change your decision if I say
that the array holds 32-bit integer values and the size of an array is usually very small
(less than 20 elements)? In the end, measurements should guide your decision, but
binary search suffers from branch mispredictions since every test of the element value
has a 50% chance of being true. This is why on a small array, a linear scan is usually
faster even though it has worse algorithmic complexity.159
159 In addition, linear scan does not require elements to be sorted.
190
Part 2. Source Code Tuning
Data-Driven Development
The main idea in Data-Driven Development (DDD), is to study how a program accesses
data: how it is laid out in memory and how it is transformed throughout the program;
then modify the program accordingly (change the data layout, change the access
patterns). A classic example of such an approach is the Array-of-Structures (AOS) to
Structure-of-Array (SOA) transformation, which is shown in Listing 7.2.
The answer to the question of which layout is better depends on how the code is
accessing the data. If it iterates over the data structure S and only accesses field
b, then SOA is better because all memory accesses will be sequential. However, if
a program iterates over the data structure and does extensive operations on all the
fields of the object, then AOS may give better memory bandwidth utilization and in
some cases, better performance. In the AOS scenario, members of the struct are likely
to reside in the same cache line, and thus require fewer cache line reads and use less
cache space. But more often, we see SOA gives better performance as it enables other
important transformations, for example, vectorization.
Another widespread example of DDD is small-size optimization. The idea is to
statically preallocate some amount of memory to avoid dynamic memory allocations.
It is especially useful for small and medium-sized containers when the upper limit of
elements can be well-predicted. Modern C++ STL implementations of std::string
keep the first 15–20 characters in an internal buffer allocated in the stack memory and
allocate memory on the heap only for longer strings. Other instances of this approach
can be found in LLVM’s SmallVector and Boost’s static_vector.
Low-Level Optimizations
Performance engineering is an art. And like in any art, the set of possible scenarios
is endless. It’s impossible to cover all the various optimizations one can imagine.
The chapters in Part 2 primarily address optimizations specific to modern CPU
architectures.
Before we jump into particular source code tuning techniques, there are a few caution
notes to make. First, avoid tuning bad code. If a piece of code has a high-level
performance inefficiency, you shouldn’t apply machine-specific optimizations to it.
Always focus on fixing the major problem first. Only once you’re sure that the
algorithms and data structures are optimal for the problem you’re trying to solve
should you try applying low-level improvements.
Second, remember that an optimization you implement might not be beneficial on
every platform. For example, Loop Blocking (tiling), which is discussed in Section 9.3.2,
191
Part 2. Source Code Tuning
192
8 Optimizing Memory Accesses
Modern computers are still being built based on the classical Von Neumann architecture
which decouples CPU, memory, and input/output units. Nowadays, operations with
memory (loads and stores) account for the largest portion of performance bottlenecks
and power consumption. It is no surprise that we start with this category.
The statement that memory hierarchy performance is critical can be illustrated by
Figure 8.1. It shows the growth of the gap in performance between memory and
processors. The vertical axis is on a logarithmic scale and shows the growth of the
CPU-DRAM performance gap. The memory baseline is the latency of memory access
of 64 KB DRAM chips from 1980. Typical DRAM performance improvement is 7%
per year, while CPUs enjoy 20-50% improvement per year. According to this picture,
processor performance has plateaued, but even then, the gap in performance remains.
[Hennessy & Patterson, 2017]
Figure 8.1: The gap in performance between memory and processors. © Source:
[Hennessy & Patterson, 2017].
A variable can be fetched from the smallest L1 cache in just a few clock cycles, but
it can take more than three hundred clock cycles to fetch a variable from DRAM if
it is not in the CPU cache. From a CPU perspective, a last-level cache miss feels
like a very long time, especially if the processor is not doing any useful work during
that time. Execution threads may also be starved when the system is highly loaded
with threads accessing memory at a very high rate and there is no available memory
bandwidth to satisfy all loads and stores promptly.
When an application executes a large number of memory accesses and spends significant
time waiting for them to finish, such an application is characterized as being bounded
by memory. It means that to further improve its performance, we likely need to
improve how we access memory, reduce the number of such accesses, or upgrade the
memory subsystem itself.
In the TMA methodology, the Memory Bound metric estimates a fraction of slots where
193
8 Optimizing Memory Accesses
a CPU pipeline is likely stalled due to demand for load or store instructions. The
first step to solving such a performance problem is to locate the memory accesses that
contribute to the high Memory Bound metric (see Section 6.1.1). Once a troublesome
memory access issue is identified, several optimization strategies might be applied. In
this chapter, we will discuss techniques to improve memory access patterns.
194
8.1 Cache-Friendly Data Structures
The example presented above is classical, but usually, real-world applications are much
more complicated than this. Sometimes you need to go an additional mile to write
cache-friendly code. If the data is not laid out in memory in a way that is optimal for
the algorithm, it may require to rearrange the data first.
Consider a standard implementation of binary search in a large sorted array, where
on each iteration, you access the middle element, compare it with the value you’re
searching for, and go either left or right. This algorithm does not exploit spatial
locality since it tests elements in different locations that are far away from each other
and do not share the same cache line. The most famous way of solving this problem
is storing elements of the array using the Eytzinger layout [Khuong & Morin, 2015].
The idea is to maintain an implicit binary search tree packed into an array using the
BFS-like layout, usually seen with binary heaps. If the code performs a large number
of binary searches in the array, it may be beneficial to convert it to the Eytzinger
layout.
195
8 Optimizing Memory Accesses
S. This greatly reduces the amount of memory transferred back and forth and saves
cache space. However, using bitfields comes with additional costs.160 Since the bits of
a, b, and c are packed into a single byte, the compiler needs to perform additional bit
manipulation operations to extract and insert them. For example, to load b, you need
to shift the byte value right (>>) by 2 and do logical AND (&) with 0x3. Similarly,
shift left (<<) and logical OR (|) operations are needed to store the updated value
back into the packed format. Data packing is beneficial in places where additional
computation is cheaper than the delay caused by inefficient memory transfers.
Also, a programmer can reduce memory usage by rearranging fields in a struct or
class when it avoids padding added by a compiler. Inserting unused bytes of memory
(pads) enables efficient storing and fetching of individual members of a struct. In the
example in Listing 8.3, the size of S can be reduced if its members are declared in the
order of decreasing size. Figure 8.3 illustrates the effect of rearranging the fields in
struct S.
Figure 8.3: Avoid compiler padding by rearranging the fields. Blank cells represent compiler
padding.
196
8.1 Cache-Friendly Data Structures
The problem with the organization of the Soldier struct in the code on the left is
that the fields are not grouped according to the phases of the game. For example,
during the battle phase, the program needs to access two different cache lines to fetch
the required fields. The fields attack and defense are very likely to reside on the
same cache line, but the health field is always pushed to the next cache line. The
same applies to the movement phase (speed and coords fields).
We can make the Soldier struct more cache-friendly by reordering the fields as shown
in Listing 8.4 on the right. With that change, the fields that are accessed together are
grouped together.
Since Linux kernel 6.8, there is a new functionality in the perf tool that allows you
to find data structure reordering opportunities. The perf mem record command
can now be used to profile data structure access patterns. The perf annotate
--data-type command will show you the data structure layout along with profiling
samples attributed to each field of the data structure. Using this information you can
identify fields that are accessed together.161
Data-type profiling is very effective at finding opportunities to improve cache utiliza-
tion. Recent Linux kernel history contains many examples of commits that reorder
structures,162 pad fields,163 , or pack164 them to improve performance.
197
8 Optimizing Memory Accesses
Pointer inlining. Inlining a pointer into a structure can improve cache utilization.
For example, if you have a structure that contains a pointer to another structure, you
can inline the pointer into the first structure. This way, you can avoid additional
memory access to fetch the second structure. An example of pointer inlining is shown
in Listing 8.6. The weight parameter is used in many graph algorithms, and thus,
it is frequently accessed. However, in the original version on the left, retrieving the
edge weight requires additional memory access, which can result in a cache miss. By
moving the weight parameter into the GraphEdge structure, we avoid such issues.
Listing 8.6 Moving the weight parameter into the parent structure.
struct GraphEdge { struct GraphEdge {
unsigned int from; unsigned int from;
unsigned int to; unsigned int to;
GraphEdgeProperties* prop; float weight;
}; => GraphEdgeProperties* prop;
struct GraphEdgeProperties { };
float weight; struct GraphEdgeProperties {
std::string label; std::string label;
// ... // ...
}; };
anagement
198
8.3 Workaround Memory Bandwidth Limitations
by the OS and design your own allocation strategy on top of that region. One simple
strategy could be to divide that region into two parts: one for hot data and one for
cold data. And provide two allocation methods that will tap into their own arenas.
Keeping hot data together creates opportunities for better cache utilization. It is also
likely to improve TLB utilization since hot data will be more compact and will occupy
fewer memory pages.
Another cost of dynamic memory allocation appears when an application is using
multiple threads. When two threads are trying to allocate memory at the same time,
the OS has to synchronize them. In a highly concurrent application, threads may
spend a significant amount of time waiting for a common lock to allocate memory.
The same applies to memory deallocation. Again, custom allocators can help to avoid
this problem, for example, by employing a separate arena for each thread.
There are many drop-in replacements for the standard dynamic memory allocation
routines (malloc and free) that are faster, more scalable, and address fragmentation
problems better. Some of the most popular memory allocation libraries are jemalloc166
and tcmalloc167 . Some projects adopted jemalloc and tcmalloc as their default
memory allocator, and they have seen significant performance improvements.
Finally, some costs of dynamic memory allocation are hidden168 and cannot be easily
measured. In all major operating systems, the pointer returned by malloc is just a
promise – the OS commits that when pages are touched it will provide the required
memory, but the actual physical pages are not allocated until the virtual addresses
are accessed. This is called demand paging, which incurs a cost of a minor page fault
for every newly allocated page. We discuss how to mitigate this cost in Section 12.3.1.
Also, for security reasons, all modern operating systems erase the contents (write zeros)
of a page before giving it to the next process. The OS maintains a pool of zeroed pages
to have them ready for allocation. But when this pool runs out of available zeroed
pages, the OS has to zero a page on demand. This process isn’t super expensive, but
it isn’t free either and may increase the latency of a memory allocation call.
2/10/hidden-costs-of-memory-allocation/.
199
8 Optimizing Memory Accesses
memory-demanding threads working at the same time, so usually you need multiple
threads to saturate the memory bandwidth. Emerging AI workloads are known to
be extremely “memory hungry” and highly parallelized, so memory bandwidth is the
number one bottleneck for them.
The first step in addressing memory bandwidth limitations is to determine the maxi-
mum theoretical and expected memory bandwidth. The theoretical maximum memory
bandwidth can be calculated from the memory technology specifications as we have
shown in Section 5.5. The expected memory bandwidth can be measured using tools
like Intel Memory Latency Checker or lmbench, which we discussed in Section 4.10.
Intel VTune can automatically measure memory bandwidth before the analyzed
application starts.
The second step is to measure the memory bandwidth utilization while your application
is running. If the amount of memory traffic is close to the maximum measured
bandwidth, then the performance of your application is likely to be bound by memory
bandwidth. It is a good idea to plot the memory bandwidth utilization over time to
see if there are different phases where memory intensity spikes or takes a dip. Intel
VTune can provide such a chart if you tick the “Evaluate max DRAM bandwidth”
checkbox in the analysis configuration.
If you have determined that your application is memory bandwidth bound, the first
suggestion is to see if you can decrease the memory intensity of your application. It is
not always possible, but you can consider disabling some memory-hungry features of
your application, recomputing data on the fly instead of caching results, or compressing
your data. In the AI space, most Large Language Models (LLMs) are supplied in fp32
precision, which means that each parameter takes 4 bytes. The biggest performance
gain can be achieved with quantization techniques, which reduce the precision of
the parameters to fp16 or int8. This will reduce the memory traffic by 2x or 4x,
respectively. Sometimes, 4-bit and even 5-bit quantization is used, to reduce memory
traffic and strike the right balance between inference performance and quality.
It is important to mention that for workloads that saturate available memory band-
width, code optimizations don’t play a big role as they do for compute-bound workloads.
For compute-bound applications, code optimizations like vectorization usually translate
into large performance gains. However, for memory-bound workloads, vectorization
may not have a similar effect since a processor can’t make forward progress, simply
because it lacks data to work with. We cannot make the memory bus run faster, this
is why memory bandwidth is often a hard limitation to overcome.
Finally, if all the options have been exhausted, and the memory bandwidth is still a
limitation, the only way to improve the situation is to buy better hardware. You can
invest money in a server with more memory channels, or DRAM modules with faster
transfer speed. This could be an expensive but still, a viable option to speed up your
application.
200
8.4 Reducing DTLB Misses
to calculate the correct physical address for each referenced virtual address. In a
system with a 5-level page table, it will require accessing at least 5 different memory
locations to obtain an address translation. In section Section 11.8 we will discuss how
huge pages can be used for code. Here we will see how they can be used for data.
Any algorithm that does random accesses into a large memory region will likely suffer
from DTLB misses. Examples of such applications are binary search in a big array,
accessing a large hash table, and traversing a graph. The usage of huge pages has the
potential to speed up such applications.
On x86 platforms, the default page size is 4KB. Consider an application that frequently
references memory space of 20 MBs. With 4KB pages, the OS needs to allocate many
small pages. Also, the process will be touching many 4KB-sized pages, each of which
will contend for a limited number of TLB entries. In contrast, using huge 2MB pages,
20MB of memory can be mapped with just ten pages, whereas with 4KB pages, you
would need 5120 pages. This means fewer TLB entries are needed when using huge
pages, which in turn reduces the number of TLB misses. It will not be a proportional
reduction by a factor of 512 since the number of 2MB entries is much less. For example,
in Intel’s Skylake core families, L1 DTLB has 64 entries for 4KB pages and only 32
entries for 2MB pages. Besides 2MB huge pages, x86-based chips from AMD and Intel
also support 1GB gigantic pages for data, but not for instructions. Using 1GB pages
instead of 2MB pages reduces TLB pressure even more.
Utilizing huge pages typically leads to fewer page walks, and the penalty for walking
the kernel page table in the event of a TLB miss is reduced since the table itself is
more compact. Performance gains from utilizing huge pages can sometimes go as
high as 30%, depending on how much TLB pressure an application is experiencing.
Expecting 2x speedups would be asking too much, as it is quite rare that TLB misses
are the primary bottleneck. The paper [Luo et al., 2015] presents the evaluation of
using huge pages on the SPEC2006 benchmark suite. Results can be summarized
as follows. Out of 29 benchmarks in the suite, 15 have a speedup within 1%, which
can be discarded as noise. Six benchmarks have speedups in the range of 1%-4%.
Four benchmarks have speedups in the range from 4% to 8%. Two benchmarks have
speedups of 10%, and the two benchmarks that gain the most enjoyed 22% and 27%
speedups respectively.
Many real-world applications already take advantage of huge pages, for example,
KVM, MySQL, PostgreSQL, Java’s JVM, and others. Usually, those software packages
provide an option to enable that feature. Whenever you’re using a similar application,
check its documentation to see if you can enable huge pages.
Both Windows and Linux allow applications to establish huge-page memory regions.
Instructions on how to enable huge pages for Windows and Linux can be found in
Appendix B. On Linux, there are two ways of using huge pages in an application:
Explicit and Transparent Huge Pages. Windows support is not as rich as Linux and
will be discussed later.
201
8 Optimizing Memory Accesses
system boot time or before an application starts. See Appendix B for instructions on
how to do that. Reserving EHPs at boot time increases the possibility of successful
allocation because the memory has not yet been significantly fragmented. Explicitly
preallocated pages reside in a reserved chunk of physical memory and cannot be
swapped out under memory pressure. Also, this memory space cannot be used for
other purposes, so users should be careful and reserve only the number of pages they
require.
The simplest method of using EHP in a Linux application is to call mmap with
MAP_HUGETLB as shown in Listing 8.7. In this code, the pointer ptr will point to a
2MB region of memory that was explicitly reserved for EHPs. Notice, that allocation
may fail if the EHPs were not reserved in advance. Other less popular ways to use
EHPs in user code are provided in Appendix B. Also, developers can write their own
arena-based allocators that tap into EHPs.
Listing 8.7 Mapping a memory region from an explicitly allocated huge page.
void ptr = mmap(nullptr, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
if (ptr == MAP_FAILED)
throw std::bad_alloc{};
...
munmap(ptr, size);
In the past, there was an option to use the libhugetlbfs169 library, which overrode
malloc calls used in existing dynamically linked executables, to allocate memory in
EHPs. Unfortunately, this project is no longer maintained. It didn’t require users to
modify the code or to relink the binary. They could simply prepend the command line
with LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes <your app command
line> to make use of it. But luckily, other libraries enable the use of huge pages (not
EHPs) with malloc, which we will discuss next.
202
8.4 Reducing DTLB Misses
When THP is enabled system-wide, huge pages are used automatically for normal mem-
ory allocations, without an explicit request from applications. To observe the effect of
huge pages on their application, a user just needs to enable system-wide THPs with
echo "always" | sudo tee /sys/kernel/mm/transparent_hugepage/enabled.
It will automatically launch a daemon process named khugepaged which starts
scanning the application’s memory space to promote regular pages to huge pages.
Sometimes the kernel may fail to combine multiple regular pages into a huge page in
case it cannot find a contiguous 2MB chunk of memory.
System-wide THPs mode is good for quick experiments to check if huge pages can
improve performance. It works automatically, even for applications that are not aware
of THPs, so developers don’t have to change the code to see the benefit of huge pages
for their application. When huge pages are enabled system-wide, applications may
end up allocating more memory resources than needed. This is why the system-wide
mode is disabled by default. Don’t forget to disable system-wide THPs after you’ve
finished your experiments as it may hurt overall system performance.
With the madvise (per-process) option, THP is enabled only inside memory regions
attributed via the madvise system call with the MADV_HUGEPAGE flag. As shown in
Listing 8.8, the pointer ptr will point to a 2MB region of the anonymous (transparent)
memory region, which the kernel allocates dynamically. The mmap call will fail if the
kernel cannot find a contiguous 2MB chunk of memory.
Developers can build custom THP allocators based on the code in Listing 8.8. But
also, it’s possible to use THPs inside malloc calls that their application is making.
Many memory allocation libraries provide that feature by overriding the libc’s
implementation of malloc. Here is an example of using jemalloc, which is one of the
most popular options. If you have access to the source code of the application, you
can relink the binary with an additional -ljemalloc option. This will dynamically
link your application against the jemalloc library, which will handle all the malloc
calls. Then use the following option to enable THPs for heap allocations:
$ MALLOC_CONF="thp:always" <your app command line>
If you don’t have access to the source code, you can still make use of jemalloc by
preloading the dynamic library:
$ LD_PRELOAD=/usr/local/libjemalloc.so.2 MALLOC_CONF="thp:always" <your app command
line>
Windows only offers using huge pages in a way similar to the Linux THP per-process
mode via the VirtualAlloc system call. See details in Appendix B.
203
8 Optimizing Memory Accesses
204
8.5 Explicit Memory Prefetching
OOO engine looks N instructions into the future and issues loads early to enable the
smooth execution of future instructions that will demand this data.
Hardware prefetchers fail when data access patterns are too complicated to predict.
And there is nothing software developers can do about it as we cannot control the
behavior of this unit. On the other hand, an OOO engine does not try to predict
memory locations that will be needed in the future as hardware prefetching does. So,
the only measure of success for it is how much latency it was able to hide by scheduling
the load in advance.
Consider a small snippet of code in Listing 8.9, where arr is an array of one million
integers. The index idx, which is assigned to a random value, is immediately used to
access a location in arr, which almost certainly misses in caches as it is random. A
hardware prefetcher can’t predict it since every time the load goes to a completely new
place in memory. The interval from the time the address of a memory location is known
(returned from the function random_distribution) until the value of that memory
location is demanded (call to doSomeExtensiveComputation) is called prefetching
window. In this example, the OOO engine doesn’t have the opportunity to issue the
load early since the prefetching window is very small. This leads to the latency of
the memory access arr[idx] to stand on a critical path while executing the loop as
shown in Figure 8.4. The program waits for the value to come back (the hatched fill
rectangle in the diagram) without making forward progress.
You’re probably thinking: “But the next iteration of the loop should start executing
speculatively in parallel”. That’s true, and indeed, it is reflected in Figure 8.4. The
doSomeExtensiveComputation function requires a lot of work, and when execution
gets closer to the finish of the first iteration, a CPU speculatively starts executing
instructions from the next iteration. It creates a positive overlap in the execution
between iterations. In fact, we presented an optimistic scenario where a processor
was able to generate the next random number and issue a load in parallel with the
previous iteration of the loop. However, a CPU wasn’t able to fully hide the latency
of the load, because it cannot look that far ahead of the current iteration to issue the
load early enough. Maybe future processors will have more powerful OOO engines,
but for now, there are cases where a programmer’s intervention is needed.
Luckily, it’s not a dead end as there is a way to speed up this code by fully overlapping
the load with the execution of doSomeExtensiveComputation, which will hide the
latency of a cache miss. We can achieve this with techniques called software pipelining
and explicit memory prefetching. Implementation of this idea is shown in Listing 8.10.
We pipeline generation of random numbers and start prefetching memory location for
the next iteration in parallel with doSomeExtensiveComputation.
A graphical illustration of this transformation is shown in Figure 8.5. We utilized
software pipelining to generate random numbers for the next iteration. In other words,
205
8 Optimizing Memory Accesses
Figure 8.4: Execution timeline that shows the load latency standing on a critical path.
Figure 8.5: Hiding the cache miss latency by overlapping it with other execution.
Notice the usage of __builtin_prefetch,171 a special hint that developers can use to
explicitly request a CPU to prefetch a certain memory location. Another option is to
use compiler intrinsics. On x86 platforms there is _mm_prefetch intrinsic, on ARM
platforms there is __pld intrinsic. Compilers will generate the PREFETCH instruction
for x86 and the pld instruction for ARM.
171 GCC builtins - https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html.
206
8.5 Explicit Memory Prefetching
There are situations when software memory prefetching is not possible. For example,
when traversing a linked list, the prefetching window is tiny and it is not possible to
hide the latency of pointer chasing.
In Listing 8.10 we saw an example of prefetching for the next iteration, but also you
may frequently encounter a need to prefetch for 2, 4, 8, and sometimes even more
iterations. The code in Listing 8.11 is one of those cases when it could be beneficial. It
presents a typical code for populating a graph with edges. If the graph is very sparse
and has a lot of vertices, it is very likely that accesses to this->out_neighbors and
this->in_neighbors vectors will frequently miss in caches. This happens because
every edge is likely to connect new vertices that are not currently in caches.
This code is different from the previous example as there are no extensive computations
on every iteration, so the penalty of cache misses likely dominates the latency of each
iteration. But we can leverage the fact that we know all the elements that will be
accessed in the future. The elements of vector edges are accessed sequentially and
thus are likely to be timely brought to the L1 cache by the hardware prefetcher. Our
goal here is to overlap the latency of a cache miss with executing enough iterations to
completely hide it.
As a general rule, for a prefetch hint to be effective, it must be inserted well ahead
of time so that by the time the loaded value is used in other calculations, it will be
already in the cache. However, it also shouldn’t be inserted too early since it may
pollute the cache with data that is not used for a long time. Notice, in Listing 8.11,
lookAhead is a template parameter, which enables a programmer to try different
values and see which gives the best performance. More advanced users can try to
estimate the prefetching window using the method described in Section 6.2.7; an
example of using this method can be found on the Easyperf blog.172
Explicit memory prefetching is most frequently used in loops, but also you can insert
those hints into a parent function; again, it all depends on the available prefetching
window.
172 “Precise timing of machine code with Linux perf” - https://easyperf.net/blog/2019/04/03/Precise-
timing-of-machine-code-with-Linux-perf#application-estimating-prefetch-window.
207
8 Optimizing Memory Accesses
This technique is a powerful weapon, however, it should be used with extreme care as
it is not easy to get it right. First of all, explicit memory prefetching is not portable,
meaning that if it gives performance gains on one platform, it doesn’t guarantee similar
speedups on another platform. It is very implementation-specific and platforms are
not required to honor those hints. In such a case it will likely degrade performance.
My recommendation would be to verify that the impact is positive with all available
tools. Not only check the performance numbers but also make sure that the number
of cache misses (L3 in particular) went down. Once the change is committed into the
code base, monitor performance on all the platforms that you run your application on,
as it could be very sensitive to changes in the surrounding code. Consider dropping
the idea if the benefits do not outweigh the potential maintenance burden.
For some complicated scenarios, make sure that the code prefetches the right memory
locations. It can get tricky when a current iteration of a loop depends on the previous
iteration, e.g., there is a continue statement or changing the next element to be
processed is guarded by an if condition. In this case, my recommendation is to
instrument the code to test the accuracy of your prefetching hints. Because when used
badly, it can degrade the performance of caches by evicting other useful data.
Finally, explicit prefetching increases code size and adds pressure on the CPU Frontend.
A prefetch hint is just a fake load that goes into the memory subsystem but does not
have a destination register. And just like any other instruction, it consumes CPU
resources. Apply it with extreme care, because when used wrong, it can pessimize the
performance of a program.
Chapter Summary
• Most real-world applications experience memory-related performance bottlenecks.
Emerging application domains, such as machine learning and big data, are
particularly demanding in terms of memory bandwidth and latency.
• The performance of the memory subsystem is not growing as fast as the CPU
performance. Yet, memory accesses are a frequent source of performance prob-
lems in many applications. Speeding up such programs requires revising the way
208
Summary
209
9 Optimizing Computations
9 Optimizing Computations
In the previous chapter, we discussed how to clear the path for efficient memory
access. Once that is done, it’s time to look at how well a CPU works with the
data it brings from memory. Modern applications demand a large amount of CPU
computations, especially applications involving complex graphics, artificial intelligence,
cryptocurrency mining, and big data processing. In this chapter, we will focus on
optimizing computations that can reduce the amount of work a CPU needs to do and
improve the overall performance of a program.
When the TMA methodology is applied, inefficient computations are usually reflected
in the Core Bound and, to some extent, in the Retiring categories. The Core Bound
category represents all the stalls inside a CPU out-of-order execution engine that were
not caused by memory issues. There are two main categories:
• Data dependencies between software instructions are limiting the performance.
For example, a long sequence of dependent operations may lead to low Instruction
Level Parallelism (ILP) and wasting many execution slots. The next section
discusses data dependency chains in more detail.
• A shortage in hardware computing resources. This indicates that certain execu-
tion units are overloaded (also known as execution port contention). This can
happen when a workload frequently performs many instructions of the same type.
For example, AI algorithms typically perform a lot of multiplications. Scientific
applications may run many divisions and square root operations. However, there
is a limited number of multipliers and dividers in any given CPU core. Thus
when port contention occurs, instructions queue up waiting for their turn to be
executed. This type of performance bottleneck is very specific to a particular
CPU microarchitecture and usually doesn’t have a cure.
In this chapter, we will take a look at well-known techniques like function inlining,
vectorization, and loop optimizations. Those code transformations aim to reduce the
total amount of executed instructions or replace them with more efficient ones.
210
9.1 Data Dependencies
When long data dependencies do come up, processors are forced to execute code
sequentially, utilizing only a part of their full capabilities. Long dependency chains
hinder parallelism, which defeats the main advantage of modern superscalar CPUs.
For example, pointer chasing doesn’t benefit from OOO execution and thus will run
at the speed of an in-order CPU. As we will see in this section, dependency chains are
a major source of performance bottlenecks.
You cannot eliminate data dependencies; they are a fundamental property of programs.
Any program takes an input to compute something. In fact, people have developed
techniques to discover data dependencies among statements and build data flow graphs.
This is called dependence analysis and is more appropriate for compiler developers,
rather than performance engineers. We are not interested in building data flow graphs
for the whole program. Instead, we want to find a critical dependency chain in a hot
piece of code, such as a loop or function.
You may wonder: “If you cannot get rid of dependency chains, what can you do?”
Well, sometimes this will be a limiting factor for performance, and unfortunately,
you will have to live with it. But there are cases where you can break unnecessary
data dependency chains or overlap their execution. One such example is shown in
Listing 9.1. Similar to a few other cases, we present the source code on the left
along with the corresponding ARM assembly on the right. Also, this code example is
included in the dep_chains_2 lab assignment of the Performance Ninja online course,
so you can try it yourself.173
This small program simulates random particle movement. We have 1000 particles
moving on a 2D surface without constraints, which means they can go as far from their
starting position as they want. Each particle is defined by its x and y coordinates on
a 2D surface and speed. The initial x and y coordinates are in the range [-1000,1000]
and the speed is in the range [0,1], which doesn’t change. The program simulates 1000
movement steps for each particle. For each step, we use a random number generator
(RNG) to produce an angle, which sets the movement direction for a particle. Then
we adjust the coordinates of a particle accordingly.
Given the task at hand, you decide to roll your own RNG, sine, and cosine functions
to sacrifice some accuracy and make it as fast as possible. After all, this is random
movement, so it is a good trade-off to make. You choose a medium-quality XorShift
RNG as it only has 3 shifts and 3 XORs inside. What can be simpler? Also, you
searched the web and found algorithms for sine and cosine approximation using
polynomials, which are accurate enough and very fast.
I compiled the code using the Clang-17 C++ compiler and ran it on a Mac mini
173 Performance Ninja: Dependency Chains 2 - https://github.com/dendibakh/perf-ninja/tree/mai
n/labs/core_bound/dep_chains_2
211
9 Optimizing Computations
(Apple M1, 2020). Let us examine the generated ARM assembly code:
• The first three eor (exclusive OR) instructions combined with lsl (shift left) or
lsr (shift right) correspond to the XorShift32::gen function.
• Next ucvtf (convert unsigned integer to floating-point) and fmul (floating-point
multiply) are used to convert the angle from degrees to radians (line 35 in the
code).
• Sine and Cosine functions both have two fmul instructions and one fmadd
(floating-point fused multiply-add) instruction. Cosine also has an additional
fadd (floating-point add) instruction.
• Finally, we have one more pair of fmadd instructions to calculate x and y
respectively, and an stp instruction to store the pair of coordinates.
You expect this code to “fly”, however, there is one very nasty performance problem
that slows down the program. Without looking ahead in the text, can you find a
212
9.1 Data Dependencies
The code that calculates the coordinates of particle N is not dependent on particle N-1,
so it could be beneficial to pull them left to overlap their execution even more. You
probably want to ask: “But how can those three (or six) instructions drag down the
performance of the whole loop?”. Indeed, there are many other “heavy” instructions
in the loop, like fmul and fmadd. However, they are not on the critical path, so they
can be executed in parallel with other instructions. And because modern CPUs are
very wide, they will execute instructions from multiple iterations at the same time.
This allows the OOO engine to effectively find parallelism (independent instructions)
within different iterations of the loop.
Let’s do some back-of-the-envelope calculations.174 Each eor and lsl instruction
incurs 2 cycles of latency: one cycle for the shift and one for the XOR. We have three
dependent eor + lsl pairs, so it takes 6 cycles to generate the next random number.
This is our absolute minimum for this loop: we cannot run faster than 6 cycles per
iteration. The code that follows takes at least 20 cycles of latency to finish all the fmul
and fmadd instructions. But it doesn’t matter, because they are not on the critical
path. The thing that matters is the throughput of these instructions. A useful rule of
thumb: if an instruction is on a critical path, look at its latency, otherwise look at its
throughput. On every loop iteration, we have 5 fmul and 4 fmadd instructions that
are served on the same set of execution units. The M1 processor can run 4 instructions
per cycle of this type, so it will take at least 9/4 = 2.25 cycles to issue all the fmul
and fmadd instructions. So, we have two performance limits: the first is imposed by
the software (6 cycles per iteration due to the dependency chain), and the second
is imposed by the hardware (2.25 cycles per iteration due to the throughput of the
174 Apple published instruction latency and throughput data in [Apple, 2024, Appendix A].
213
9 Optimizing Computations
execution units). Right now we are bound by the first limit, but we can try to break
the dependency chain to get closer to the second limit.
One of the ways to solve this would be to employ an additional RNG object so that
one of them feeds even iterations and another feeds odd iterations of the loop as shown
in Listing 9.2. Notice, that we also manually unrolled the loop. Now we have two
separate dependency chains, which can be executed in parallel. You can argue that
this changes the functionality of the program, but users would not be able to tell the
difference since the motion of particles is random anyway. An alternative solution
would be to pick a different RNG that has a less expensive internal dependency chain.
Once you do this transformation, the compiler starts autovectorizing the body of the
loop, i.e., it glues two chains together and uses SIMD instructions to process them in
parallel. To isolate the effect of breaking the dependency chain, I disabled compiler
vectorization.
To measure the performance impact of the change, I ran “before” and “after” ver-
sions and observed the running time go down from 19ms per iteration to 10ms per
iteration. This is almost a 2x speedup. The IPC also goes up from 4.0 to 7.1. To
do my due diligence, I also measured other metrics to make sure performance didn’t
accidentally improve for other reasons. In the original code, the MPKI is 0.01, and
BranchMispredRate is 0.2%, which means the program initially did not suffer from
cache misses or branch mispredictions. Here is another data point: when running the
same code on Intel’s Alder Lake system, it shows 74% Retiring and 24% Core Bound,
which confirms the performance is bound by computations.
With a few additional changes, you can generalize this solution to have as many
dependency chains as you want. For the M1 processor, the measurements show that
having 2 dependency chains is enough to get very close to the hardware limit. Having
more than 2 chains brings a negligible performance improvement. However, there is a
trend that CPUs are getting wider, i.e., they become increasingly capable of running
multiple dependency chains in parallel. That means future processors could benefit
from having more than 2 dependency chains. As always you should measure and find
214
9.2 Inlining Functions
the sweet spot for the platforms your code will be running on.
Sometimes it’s not enough just to break dependency chains. Imagine that instead of
a simple RNG, you have a very complicated cryptographic algorithm that is 10,000
instructions long. So, instead of a very short 6-instruction dependency chain, we
now have 10,000 instructions standing on the critical path. You immediately do the
same change we did above anticipating a nice 2x speedup, but see only 5% better
performance. What’s going on?
The problem here is that the CPU simply cannot “see” the second dependency chain to
start executing it. Recall from Chapter 3, that the Reservation Station (RS) capacity
is not enough to see 10,000 instructions ahead as its number of entries is much
smaller. So, the CPU will not be able to overlap the execution of two dependency
chains. To fix this, we need to interleave those two dependency chains. With this
approach, you need to change the code so that the RNG object will generate two
numbers simultaneously, with every statement within the function XorShift32::gen
duplicated and interleaved. Even if a compiler inlines all the code and can clearly see
both chains, it doesn’t automatically interleave them, so you need to watch out for this.
Another limitation you may hit is register pressure. Running multiple dependency
chains in parallel requires keeping more state and thus more registers. If you run out
of architectural registers, the compiler will start spilling them to the stack, which will
slow down the program.
It is worth mentioning that data dependencies can also be created through memory.
For example, if you write to memory location M on loop iteration N and read from this
location on iteration N+1, there will effectively be a dependency chain. The stored
value may be forwarded to a load, but these instructions cannot be reordered and
executed in parallel.
As a closing thought, I would like to emphasize the importance of finding that critical
dependency chain. It is not always easy, but it is crucial to know what stands on the
critical path in your loop, function, or other block of code. Otherwise, you may find
yourself fixing secondary issues that barely make a difference.
215
9 Optimizing Computations
216
9.2 Inlining Functions
Listing 9.3 A profile of function foo which has a hot prologue and epilogue
Overhead | Source code & Disassembly
(%) | of function `foo`
--------------------------------------------
3.77 : 418be0: push r15 # prologue
4.62 : 418be2: mov r15d,0x64
2.14 : 418be8: push r14
1.34 : 418bea: mov r14,rsi
3.43 : 418bed: push r13
3.08 : 418bef: mov r13,rdi
1.24 : 418bf2: push r12
1.14 : 418bf4: mov r12,rcx
3.08 : 418bf7: push rbp
3.43 : 418bf8: mov rbp,rdx
1.94 : 418bfb: push rbx
0.50 : 418bfc: sub rsp,0x8
...
# function body
...
4.17 : 418d43: add rsp,0x8 # epilogue
3.67 : 418d47: pop rbx
0.35 : 418d48: pop rbp
0.94 : 418d49: pop r12
4.72 : 418d4b: pop r13
4.12 : 418d4d: pop r14
0.00 : 418d4f: pop r15
1.59 : 418d51: ret
mean it will be profitable to inline the function. Inlining triggers a lot of different
changes, so it’s hard to predict the outcome. Always measure the performance of the
changed code before forcing a compiler to inline a function.
For the GCC and Clang compilers, you can make a hint for inlining foo with the help
of a C++11 [[gnu::always_inline]] attribute as shown in the code example below.
With earlier C++ standards you can use __attribute__((always_inline)). For
the MSVC compiler, you can use the __forceinline keyword.
[[gnu::always_inline]] int foo() {
// foo body
}
217
9 Optimizing Computations
will recognize an opportunity for tail call optimization. The transformation will reuse
the current stack frame instead of recursively creating new frames. To do so, the
compiler flushes the current frame and replaces the call instruction with a jmp to
the beginning of the function. Just like inlining, tail call optimization provides room
for further optimizations. So, later, the compiler can apply more transformations to
replace the original version with an iterative version shown on the right. For example,
GCC 13.2 generates identical machine code for both versions.
Like with any compiler optimization, there are cases when it cannot perform the
code transformation you want. If you are using the Clang compiler, and you
want guaranteed tail call optimizations, you can mark a return statement with
__attribute__((musttail)). This indicates that the compiler must generate a tail
call for the program to be correct, even when optimizations are disabled. One example,
where it is beneficial is language interpreter loops.177 In case of doubt, it is better to
use an iterative version instead of tail recursion and leave tail recursion to functional
programming languages.
In this section, we will take a look at the most well-known loop optimizations that
address the types of bottlenecks mentioned above. We first discuss low-level opti-
mizations, in Section 9.3.1, that only move code around in a single loop. Next, in
Section 9.3.2, we will take a look at high-level optimizations that restructure loops,
which often affect multiple loops. Note, that what I present here is not a complete
list of all known loop transformations. For more detailed information on each of the
transformations discussed below, readers can refer to [Cooper & Torczon, 2012] and
[Allen & Kennedy, 2001].
177 Josh Haberman’s blog: motivation for guaranteed tail calls - https://blog.reverberate.org/2021/
04/21/musttail-efficient-interpreters.html.
218
9.3 Loop Optimizations
The primary benefit of loop unrolling is to perform more computations per iteration.
At the end of each iteration, the index value must be incremented and tested, then
we go back to the top of the loop if it has more iterations to process. This work is
commonly referred to as “loop overhead” or “loop tax”, and it can be reduced with
loop unrolling. For example, by unrolling the loop in Listing 9.6 by a factor of 2, we
reduce the number of executed compare and branch instructions by 2x.
I do not recommend unrolling loops manually except in cases when you need to break
loop-carry dependencies as shown in Listing 9.1. First, because compilers are very
good at doing this automatically and usually can choose the optimal unrolling factor.
The second reason is that processors have an “embedded unroller” thanks to their out-
of-order speculative execution engine (see Chapter 3). While the processor is waiting
178 The compiler will perform the transformation only if it can prove that a and c don’t alias.
219
9 Optimizing Computations
for long-latency instructions from the first iteration to finish (e.g. loads, divisions,
microcoded instructions, long dependency chains), it will speculatively start executing
instructions from the second iteration and only wait on loop-carried dependencies.
This spans multiple iterations ahead, effectively unrolling the loop in the instruction
Reorder Buffer (ROB). The third reason is that unrolling too much could lead to
negative consequences due to code bloat.
220
9.3 Loop Optimizations
Loop Interchange is only legal if loops are perfectly nested. A perfectly nested loop
is one wherein all the statements are in the innermost loop. Interchanging imperfect
loop nests is harder to do but still possible; check an example in the Codee179 catalog.
Loop Blocking (Tiling): the idea of this transformation is to split the multi-
dimensional execution range into smaller chunks (blocks or tiles) so that each block
will fit in the CPU caches. If an algorithm works with large multi-dimensional arrays
and performs strided accesses to their elements, there is a high chance of poor cache
utilization. Every such access may push the data that will be requested by future
accesses out of the cache (cache eviction). By partitioning an algorithm into smaller
multi-dimensional blocks, we ensure the data used in a loop stays in the cache until it
is reused.
In the example shown in Listing 9.10, an algorithm performs row-major traversal of
elements of array a while doing column-major traversal of array b. The loop nest can
be partitioned into smaller blocks to maximize the reuse of elements in array b.
221
9 Optimizing Computations
Loop Fusion helps to reduce the loop overhead (similar to Loop Unrolling) since both
loops can use the same induction variable. Also, loop fusion can help to improve
the temporal locality of memory accesses. In Listing 9.11, if both x and y members
of a structure happen to reside on the same cache line, it is better to fuse the two
loops since we can avoid loading the same cache line twice. This will reduce the cache
footprint and improve memory bandwidth utilization.
However, loop fusion does not always improve performance. Sometimes it is better to
split a loop into multiple passes, pre-filter the data, sort and reorganize it, etc. By
distributing the large loop into multiple smaller ones, we limit the amount of data
required for each iteration of the loop and increase the temporal locality of memory
accesses. This helps in situations with a high cache contention, which typically happens
in large loops. Loop distribution also reduces register pressure since, again, fewer
operations are being done within each iteration of the loop. Also, breaking a big loop
into multiple smaller ones will likely be beneficial for the performance of the CPU
Frontend because of better instruction cache utilization. Finally, when distributed,
each small loop can be further optimized separately by the compiler.
Loop Unroll and Jam: to perform this transformation, you need to unroll the outer
loop first, then jam (fuse) multiple inner loops together as shown in Listing 9.12. This
transformation increases the ILP (Instruction-Level Parallelism) of the inner loop
since more independent instructions are executed inside the inner loop. In the code
example, the inner loop is a reduction operation that accumulates the deltas between
elements of arrays a and b. When we unroll and jam the loop nest by a factor of two,
we effectively execute two iterations of the original outer loop simultaneously. This is
emphasized by having two independent accumulators. This breaks the dependency
chain over diffs in the initial variant.
Loop Unroll and Jam can be performed as long as there are no cross-iteration de-
pendencies on the outer loops, in other words, two iterations of the inner loop can
be executed in parallel. Also, this transformation makes sense if the inner loop has
memory accesses that are strided on the outer loop index (i in this case), otherwise,
180 Cache-oblivious algorithm - https://en.wikipedia.org/wiki/Cache-oblivious_algorithm
222
9.3 Loop Optimizations
other transformations likely apply better. The Unroll and Jam technique is especially
useful when the trip count of the inner loop is low, e.g., less than 4. By doing the
transformation, we pack more independent operations into the inner loop, which
increases the ILP.
The Unroll and Jam transformation sometimes is very useful for outer loop vectoriza-
tion, which, at the time of writing, compilers cannot do automatically. In a situation
when the trip count of the inner loop is not visible to a compiler, the compiler could
still vectorize the original inner loop, hoping that it will execute enough iterations to
hit the vectorized code (more on vectorization in the next section). But if the trip
count is low, the program will use a slow scalar version of the loop. By performing
Unroll and Jam, we enable the compiler to vectorize the code differently: now “gluing”
the independent instructions in the inner loop together. This technique is also known
as Superword-Level Parallelism (SLP) vectorization.
223
9 Optimizing Computations
Even though there are well-known optimization techniques for a particular set of
computational problems, loop optimizations remain a “black art” that comes with
experience. I recommend that you rely on your compiler and complement it with
manually transforming code when necessary. Above all, keep the code as simple as
possible and do not introduce unreasonably complicated changes if the performance
benefits are negligible.
9.4 Vectorization
On modern processors, the use of SIMD instructions can result in a great speedup
over regular un-vectorized (scalar) code. When doing performance analysis, one of
the top priorities of the software engineer is to ensure that the hot parts of the
code are vectorized. This section guides engineers toward discovering vectorization
opportunities. For a recap of the SIMD capabilities of modern CPUs, readers can take
a look at Section 3.4.
Vectorization often happens automatically without any user intervention; this is called
compiler autovectorization. In such a situation, a compiler automatically recognizes
the opportunity to produce SIMD machine code from the source code.
Autovectorization is very convenient because modern compilers can automatically
generate fast SIMD code for a wide variety of programs. However, in some cases,
autovectorization does not succeed without intervention by a software engineer. Modern
compilers have extensions that allow power users to control the autovectorization
process and make sure that certain parts of the code are vectorized efficiently. We will
provide several examples of using compiler autovectorization hints.
In this section, we will discuss how to harness compiler autovectorization, especially
inner loop vectorization because it is the most common type of autovectorization.
The other two types (outer loop vectorization, and Superword-Level Parallelism
vectorization) are not discussed in this book.
224
9.4 Vectorization
225
9 Optimizing Computations
compiler. However, this skill is highly rewarding and often provides valuable insights.
Experienced developers can quickly tell whether the code was vectorized or not just by
looking at instruction mnemonics and the register names used by those instructions.
For example, in x86 ISA, vector instructions operate on packed data (thus have P
in their name) and use XMM, YMM, or ZMM registers, e.g., VMULPS XMM1, XMM2, XMM3
multiplies four single precision floats in XMM2 and XMM3 and saves the result in XMM1.
But be careful, often people conclude from seeing the XMM register being used, that it is
vector code—not necessarily. For instance, the VMULSS XMM1, XMM2, XMM3 instruction
will only multiply one single-precision floating-point value, not four.
Another indicator for potential vectorization opportunities is a high Retiring metric
(above 80%). In Section 6.1, we said that the Retiring metric is a good indicator of
well-performing code. The rationale behind it is that execution is not stalled and a
CPU is retiring instructions at a high rate. However, sometimes it may hide the real
performance problem, that is, inefficient computations. Perhaps a workload executes
a lot of simple instructions that can be replaced by vector instructions. In such
situations, high Retiring metric doesn’t translate into high performance.
There are a few common cases that developers frequently run into when trying to
accelerate vectorizable code. Below we present four typical scenarios and give general
guidance on how to proceed in each case.
9.4.2.1 Vectorization Is Illegal. In some cases, the code that iterates over ele-
ments of an array is simply not vectorizable. Optimization reports are very effective at
explaining what went wrong and why the compiler can’t vectorize the code. Listing 9.13
shows an example of dependence inside a loop that prevents vectorization.182
While some loops cannot be vectorized due to the hard limitations (such as read-after-
write dependence), others could be vectorized when certain constraints are relaxed. For
example, the code in Listing 9.14 cannot be autovectorized by the compiler, because it
will change the order of floating-point operations and may lead to different rounding
and a slightly different result. Floating-point addition is commutative, which means
that you can swap the left-hand side and the right-hand side without changing the
result: (a + b == b + a). However, it is not associative, because rounding happens
at different times: ((a + b) + c) != (a + (b + c)).
If you tell the compiler that you can tolerate a bit of variation in the final result, it will
autovectorize the code for you. Clang and GCC compilers have a flag, -ffast-math,183
that allows this kind of transformation even though the resulting program may give
slightly different results:
182 It is easy to spot a read-after-write dependency once you unroll a couple of iterations of the loop.
226
9.4 Vectorization
Unfortunately, this flag involves subtle and potentially dangerous behavior changes,
including for Not-a-Number, signed zero, infinity, and subnormals. Because third-party
code may not be ready for these effects, this flag should not be enabled across large
sections of code without careful validation of the results, including for edge cases.
Since Clang 18, you can limit the scope of transformations by using dedicated pragmas,
e.g., #pragma clang fp reassociate(on).184
Let’s look at another typical situation when a compiler may need support from a
developer to perform vectorization. When compilers cannot prove that a loop operates
on arrays with non-overlapping memory regions, they usually choose to be on the safe
side. Given the code in Listing 9.15, compilers should account for the situation when
the memory regions of arrays a, b, and c overlap.
Here is the optimization report (enabled with -fopt-info) provided by GCC 10.2:
$ gcc -O3 -march=core-avx2 -fopt-info
a.cpp:2:26: optimized: loop vectorized using 32-byte vectors
a.cpp:2:26: optimized: loop versioned for vectorization because of possible aliasing
GCC has recognized potential overlap between memory regions and created multiple
184 LLVM extensions to specify floating-point flags - https://clang.llvm.org/docs/LanguageExtensio
ns.html#extensions-to-specify-floating-point-flags
227
9 Optimizing Computations
versions of the loop. The compiler inserted runtime checks185 to detect if the memory
regions overlap. Based on those checks, it dispatches between vectorized and scalar
versions. In this case, vectorization comes with the cost of inserting potentially
expensive runtime checks. If a developer knows that memory regions of arrays a, b,
and c do not overlap, it can insert #pragma GCC ivdep186 right before the loop or use
the __restrict__ keyword as shown in Section 5.7. Such compiler hints will eliminate
the need for the GCC compiler to insert the runtime checks mentioned earlier.
Some dynamic tools, such as Intel Advisor, can detect if issues like cross-iteration
dependence or access to arrays with overlapping memory regions occur in a loop. But
be aware that such tools only provide a suggestion. Carelessly inserting compiler hints
can cause real problems.
Here is the compiler optimization report for the code in Listing 9.16:
$ clang -c -O3 -march=core-avx2 a.cpp -Rpass-missed=loop-vectorize
a.cpp:3:3: remark: the cost-model indicates that vectorization is not beneficial
[-Rpass-missed=loop-vectorize]
for (int i = 0; i < n; i++)
^
Users can force the Clang compiler to vectorize the loop by using the #pragma hint, as
shown in Listing 9.17. However, keep in mind that whether vectorization is profitable
largely depends on the runtime data, for example, the number of iterations of the
loop. Compilers don’t have this information available,187 so they often tend to be
conservative. Though you can use such hints for performance experiments.
Developers should be aware of the hidden cost of using vectorized code. Using AVX
and especially AVX-512 vector instructions could lead to frequency downclocking or
startup overhead, which on certain CPUs can also affect subsequent code for several
microseconds. The vectorized portion of the code should be hot enough to justify
using AVX-512.188 For example, sorting 80 KiB was found to be sufficient to amortize
185 See the example on the Easyperf blog: https://easyperf.net/blog/2017/11/03/Multiversioning_b
y_DD.
186 It is a GCC-specific pragma. For other compilers, check the corresponding manuals.
187 Besides Profile Guided Optimizations (see Section 11.7).
188 For more details read this blog post: https://travisdowns.github.io/blog/2020/01/17/avxfreq1.h
tml.
228
9.4 Vectorization
9.4.2.3 Loop Vectorized but Scalar Version Used. In some scenarios, the
compiler successfully vectorizes the code, but it does not show up as being executed
in the profiler. When inspecting the corresponding assembly of a loop, it is usually
easy to find the vectorized version of the loop body because it uses vector registers,
which are not commonly used in other parts of the program.
If the vector code is not executed, one possible reason for this is that the generated
code assumes loop trip counts that are higher than what the program uses. For
example, a compiler may decide to vectorize and unroll the loop in such a way as
to process 64 elements per iteration. An input array may not have enough elements
even for a single iteration of the loop. In this case, the scalar version (remainder) of
the loop will be used instead. It is easy to detect these cases because the scalar loop
would light up in the profiler, and the vectorized code would remain cold.
The solution to this problem is to force the vectorizer to use a lower vectorization
factor or unroll count, to reduce the number of elements that the loop processes. You
can achieve that with the help of #pragma hints. For the Clang compiler, you can use
#pragma clang loop vectorize_width(N) as shown in the article on the Easyperf
blog.190
9.4.2.4 Loop Vectorized in a Suboptimal Way. When you see a loop being
autovectorized and executed at runtime, there is a high chance that this part of the
program already performs well. However, there are exceptions. There are situations
when the scalar un-vectorized version of a loop performs better than the vectorized
one. This could happen due to expensive vector operations like gather/scatter
loads, masking, inserting/extracting elements, data shuffling, etc., if the compiler
is required to use them to make vectorization happen. Performance engineers could
also try to disable vectorization in different ways. For the Clang compiler, it can be
done via compiler options -fno-vectorize and -fno-slp-vectorize, or with a hint
specific to a particular loop, e.g., #pragma clang loop vectorize(disable).
It is important to note that there is a range of problems where SIMD is important and
where autovectorization just does not work and is not likely to work in the near future.
One example can be found in [Muła & Lemire, 2019]. Another example is outer loop
autovectorization, which is not currently attempted by compilers. Vectorizing floating-
point code is problematic because reordering arithmetic floating-point operations
189Study of AVX-512 downclocking: in VQSort readme
190Using Clang’s optimization pragmas - https://easyperf.net/blog/2017/11/09/Multiversioning_b
y_trip_counts
229
9 Optimizing Computations
230
9.5 Compiler Intrinsics
Since the function calcSum must return a single value (a uniform variable) and our sum
variable is varying, we then need to gather the values of each program instance using
the reduce_add function. ISPC also takes care of generating peeled and remainder
loops as needed to take into account the data that is not correctly aligned or that is
not a multiple of the vector width.
“Close to the metal” programming model: one of the problems with traditional C
and C++ languages is that the compiler doesn’t always vectorize critical parts of code.
ISPC helps to resolve this problem by assuming every operation is SIMD by default.
For example, the ISPC statement sum += array[i] is implicitly considered as a SIMD
operation that makes multiple additions in parallel. ISPC is not an autovectorizing
compiler, and it does not automatically discover vectorization opportunities. Since
the ISPC language is very similar to C and C++, it is more readable than intrinsics
(see Section 9.5) as it allows you to focus on the algorithm rather than the low-level
instructions. Also, it has reportedly matched [Pharr & Mark, 2012] or beaten192
hand-written intrinsics code in terms of performance.
Performance portability: ISPC can automatically detect features of your CPU to
fully utilize all the resources available. Programmers can write ISPC code once and
compile to many vector instruction sets, such as SSE4, AVX2, and ARM NEON.
231
9 Optimizing Computations
When compiling Listing 9.14 for the SSE target, compilers would generate mostly
identical assembly code as for Listing 9.19. I show this example just for illustration
purposes. Obviously, there is no need to use intrinsics if the compiler can generate the
same machine code. You should use intrinsics only when the compiler fails to generate
the desired code.
When you leverage compiler auto-vectorization, it will insert all necessary runtime
checks. For instance, it will ensure that there are enough elements to feed the vector
execution units (see Listing 9.19, line 6). Also, the compiler will generate a scalar
version of the loop to process the remainder (line 19). When you use intrinsics, you
have to take care of safety aspects yourself.
Intrinsics are better to use than inline assembly because the compiler performs type
checking, takes care of register allocation, and makes further optimizations, e.g.,
peephole transformations and instruction scheduling. However, they are still often
verbose and difficult to read.
When you write code using non-portable platform-specific intrinsics, you should also
provide a fallback option for other architectures. A list of all available intrinsics for
the Intel platform can be found in this reference193 . For ARM, you can find such a
list on the Arm’s website.194
232
9.5 Compiler Intrinsics
developers control over the generated code. Many such libraries exist, differing in their
coverage of recent or “exotic” operations, and the number of platforms they support.
The write-once, target-many model of ISPC is appealing. However, you may wish for
tighter integration into C++ programs. For example, interoperability with templates,
or avoiding a separate build step and using the same compiler. Conversely, intrinsics
offer more control, but at a higher development cost.
Wrapper libraries combine the advantages and avoid drawbacks of both using a so-
called embedded domain-specific language, where the vector operations are expressed
as normal C++ functions. You can think of these functions as ‘portable intrinsics’.
Even compiling your code multiple times, once per instruction set, can be done within
a normal C++ library by using the preprocessor to ‘repeat’ your code with different
compiler settings, but within unique namespaces. One example of such a library is
Highway,195 which only requires the C++11 standard.
The Highway version of summing elements of an array is presented in Listing 9.20.
The ScalableTag<float> d is a type descriptor that represents a “scalable” type,
meaning that it can adjust to the available vector width on the target hardware
(e.g., AVX2 or NEON). Zero(d) initializes sum to a vector filled with zeros. This
variable will store the accumulated sum as the function iterates through the array.
The for loop processes Lanes(d) elements at a time, where Lanes(d) represents the
number of floats that can be loaded into a single SIMD vector. The LoadU operation
loads Lanes(d) consecutive elements from array. The Add operation performs an
element-wise addition of the loaded values with the current sum, accumulating the
result in sum.
Notice the explicit handling of remainders after the loop processes multiples of the
vector size Lanes(d). Although this is more verbose, it makes visible what is actually
happening, and allows optimizations such as overlapping the last vector instead of
relying on MaskedLoad, or even skipping the remainder entirely when the count is
known to be a multiple of the vector size. Finally, the ReduceSum operation reduces
all elements in the vector sum to a single scalar value by adding them together.
Like ISPC, Highway also supports detecting the best available instruction sets, grouped
into ‘clusters’ which on x86 correspond to Intel Core (S-SSE3), Nehalem (SSE4.2),
195 Highway library: https://github.com/google/highway
233
9 Optimizing Computations
• Initialization • Memory
• Getting/setting lanes • Cache control
• Getting/setting blocks • Type conversion
• Printing • Combine
• Tuples • Swizzle/permute
• Arithmetic • Swizzling within 128-bit blocks
• Logical • Reductions
• Masks • Crypto
• Comparisons
For the full list of operations, see its documentation.196 Highway is not the only library
of this kind. Other libraries include nsimd, SIMDe, VCL, and xsimd. Note that a C++
standardization effort starting with the Vc library resulted in std::experimental::simd,
however, it provides a very limited set of operations and as of this writing is not
supported on all the major compilers.
rence.md
234
Summary
Chapter Summary
• Inefficient computations represent a significant portion of the bottlenecks in real-
world applications. Modern compilers are very good at removing unnecessary
computation overhead by performing many different code transformations. Still,
there is a high chance that we can do better than what compilers can offer.
• In Chapter 9, we showed how to search performance headrooms in a program by
forcing certain code optimizations. We discussed such popular transformations
like function inlining, loop optimizations, and vectorization.
235
10 Optimizing Branch Prediction
So far we’ve been talking about optimizing memory accesses and computations.
However, we haven’t discussed another important category of performance bottlenecks
yet. It is related to speculative execution, a feature that is present in all modern
high-performance CPU cores. To refresh your memory, turn to Section 3.3.3 where
we discussed how speculative execution can be used to improve performance. In this
chapter, we will explore techniques to reduce the number of branch mispredictions.
In general, modern processors are very good at predicting branch outcomes. They not
only follow static prediction rules but also detect dynamic patterns. Usually, branch
predictors save the history of previous outcomes for the branches and try to guess
what the next result will be. However, when the pattern becomes hard for the CPU
branch predictor to follow, it may hurt performance.
Mispredicting a branch can add a significant penalty when it happens regularly. When
such an event occurs, a CPU is required to clear all the speculative work that was done
ahead of time and later was proven to be wrong. It also needs to flush the pipeline
and start filling it with instructions from the correct path. Typically, modern CPUs
experience a 10- to 25-cycle penalty as a result of a branch misprediction. The exact
number of cycles depends on the microarchitecture design, namely, on the depth of
the pipeline and the mechanism used to recover from a mispredict.
Perhaps the most frequent reason for a branch misprediction is simply because it
has a complicated outcome pattern (e.g., exhibits pseudorandom behavior), which is
unpredictable for a processor. For completeness, let’s cover the other less frequent
reasons behind branch mispredicts. Branch predictors use caches and history registers
and therefore are susceptible to the issues related to caches, namely:
• Cold misses: mispredictions may happen on the first dynamic occurrence of the
branch when no dynamic history is available and static prediction is employed.
• Capacity misses: mispredictions arising from the loss of dynamic history due
to a very high number of branches in the program or exceedingly long dynamic
pattern.
• Conflict misses: branches are mapped into cache buckets (associative sets)
using a combination of their virtual and/or physical addresses. If too many active
branches are mapped to the same set, the loss of history can occur. Another
instance of a conflict miss is aliasing when two independent branches are mapped
to the same cache entry and interfere with each other potentially degrading the
prediction history.
A program will always experience a non-zero number of branch mispredictions. You
can find out how much a program suffers from branch mispredictions by looking at
the TMA Bad Speculation metric. It is normal for a general-purpose application to
have a Bad Speculation metric in the range of 5–10%. My recommendation is to
pay close attention once this metric goes higher than 10%.
In the past, developers had an option of providing a prediction hint to an x86
processor in the form of a prefix to the branch instruction (0x2E: Branch Not
236
10.1 Replace Branches with Lookup
Taken, 0x3E: Branch Taken). This could potentially improve performance on older
microarchitectures, like Pentium 4. However, modern x86 processors used to ignore
those hints until Intel’s RedwoodCove started using it again. Its branch predictor is
still good at finding dynamic patterns, but now it will use the encoded prediction hint
for branches that have never been seen before (i.e. when there is no stored information
about a branch). [Intel, 2023, Section 2.1.1.1 Branch Hint]
There are indirect ways to reduce the branch misprediction rate by reducing the
dynamic number of branch instructions. This approach helps because it alleviates
the pressure on branch predictor structures. When a program executes fewer branch
instructions, it may indirectly improve the prediction of branches that previously
suffered from capacity and conflict misses. Compiler transformations such as loop
unrolling and vectorization help reduce the dynamic branch count, though they don’t
specifically aim to improve the prediction rate of any given conditional statement.
Profile-Guided Optimizations (PGO) and post-link optimizers (e.g., BOLT) are also
effective at reducing branch mispredictions thanks to improving the fallthrough rate
(straightening the code). We will discuss those techniques in the next chapter.197
The only direct way to get rid of branch mispredictions is to get rid of the branch
instruction itself. In subsequent sections, we will take a look at both direct and
indirect ways to improve branch prediction. In particular, we will explore the following
techniques: replacing branches with lookup tables, arithmetic, selection, and SIMD
instructions.
and can’t affect performance, and therefore it doesn’t make much sense to remove them, at least from
a prediction perspective. However, contrary to the wisdom, an experiment conducted by authors
of BOLT optimizer demonstrated that replacing never-taken branches with equal-sized no-ops in a
large code footprint application, such as Clang C++ compiler, leads to approximately 5% speedup
on modern Intel CPUs. So it still pays to try to eliminate all branches.
237
10 Optimizing Branch Prediction
not practical. In this case, we might use interval map data structures that accomplish
that goal using much less memory but logarithmic lookup complexity. Readers can
find existing implementations of interval map container in Boost198 and LLVM199 .
As of the year 2024, compilers are usually unable to find these shortcuts on their own,
so it is up to the programmer to do it manually. If you can find a way to replace
a frequently mispredicted branch with arithmetic, you will likely see a performance
improvement.
icl/interval_map.html
199 LLVM’s IntervalMap - https://llvm.org/doxygen/IntervalMap_8h_source.html
238
10.3 Replace Branches with Selection
For the code on the right, the compiler can replace the branch that comes from the
ternary operator, and generate a CMOV x86 instruction instead. A CMOVcc instruction
checks the state of one or more of the status flags in the EFLAGS register (CF,
OF, PF, SF and ZF) and performs a move operation if the flags are in a specified
state or condition. A similar transformation can be done for floating-point numbers
with FCMOVcc and VMAXSS/VMINSS instructions. In the ARM ISA, there is the CSEL
(conditional selection) instruction, but also CSINC (select and increment), CSNEG (select
and negate), and a few other conditional instructions.
Listing 10.4 shows assembly listings for the original and the branchless version. In
contrast with the original version, the branchless version doesn’t have jump instructions.
However, the branchless version calculates both x and y independently, and then
selects one of the values and discards the other. While this transformation eliminates
the penalty of a branch misprediction, it is doing more work than the original code.
We already know that the branch in the original version on the left is hard to predict.
This is what motivated us to try a branchless version in the first place. In this example,
the performance gain of this change depends on the characteristics of the computeX
and computeY functions. If the functions are small200 and the compiler can inline
them, then selection might bring noticeable performance benefits. If the functions are
big201 , it might be cheaper to take the cost of a branch mispredict than to execute both
computeX and computeY functions. Ultimately, performance measurements always
decide which version is better.
Take a look at Listing 10.4 one more time. On the left, a processor can predict, for
example, that the je 400514 branch will be taken, speculatively call computeY, and
start running code from the function foo. Remember, branch prediction happens
200 Just a handful of instructions that can be completed in a few cycles.
201 More than twenty instructions that take more than twenty cycles.
239
10 Optimizing Branch Prediction
many cycles before we know the actual outcome of the branch. By the time we start
resolving the branch, we could be already halfway through the foo function, despite it
is still speculative. If we are correct, we’ve saved a lot of cycles. If we are wrong, we
have to take the penalty and start over from the correct path. In the latter case, we
don’t gain anything from the fact that we have already completed a portion of foo, it
all must be thrown away. If the mispredictions occur too often, the recovering penalty
outweighs the gains from speculative execution.
With conditional selection, it is different. There are no branches, so the processor
doesn’t have to speculate. It can execute computeX and computeY functions in parallel.
However, it cannot start running the code from foo until it computes the result of the
CMOVNE instruction since foo uses it as an argument (data dependency). When you
use conditional select instructions, you convert a control flow dependency into a data
flow dependency.
To sum it up, for small if-else statements that perform simple operations, conditional
selects can be more efficient than branches, but only if the branch is hard to predict. So
don’t force the compiler to generate conditional selects for every conditional statement.
For conditional statements that are always correctly predicted, having a branch
instruction is likely an optimal choice, because you allow the processor to speculate
(correctly) and run ahead of the actual execution. And don’t forget to measure the
impact of your changes.
Without profiling data, compilers don’t have visibility into the misprediction rates.
As a result, they usually prefer to generate branch instructions by default. Compilers
are conservative at using selection and may resist generating CMOV instructions even
in simple cases. Again, the tradeoffs are complicated, and it is hard to make the
right decision without the runtime data.202 Starting from Clang-17, the compiler now
honors a __builtin_unpredictable hint for the x86 target, which indicates to the
compiler that a branch condition is unpredictable. It can help influence the compiler’s
decision but does not guarantee that the CMOV instruction will be generated. Here’s
an example of how to use __builtin_unpredictable:
int a;
if (__builtin_unpredictable(cond)) {
a = computeX();
} else {
a = computeY();
}
240
10.4 Multiple Tests Single Branch
Listing 10.5 shows a function that finds the longest line in an input string by testing
one character at a time. We go through the input string and search for end-of-line
(eol) characters (\n, 0x0A in ASCII). For every found eol character we check if the
current line is the longest, and reset the length of the current line to zero. This code
will execute one branch instruction for every character.203
Consider the alternative implementation shown in Listing 10.6 that tests eight charac-
ters at a time. You will typically see this idea implemented using compiler intrinsics
(see Section 9.5), however, I decided to show a standard C++ code for clarity. This
exact case is featured in one of Performance Ninja’s lab assignments,204 so you can try
writing SIMD code yourself. Keep in mind, that the code I’m showing is incomplete
as it misses a few corner cases; I provide it just to illustrate the idea.
We start by preparing an 8-byte mask filled with eol symbols. The inner loop loads
eight characters of the input string and performs a byte-wise comparison of these
characters with the eol mask. Vectors in modern processors contain 16/32/64 bytes, so
we can process even more characters simultaneously. The result of the eight comparisons
is an 8-bit mask with either 0 or 1 in the corresponding position (see compareBytes).
For example, when comparing 0x00FF0A000AFFFF00 and 0x0A0A0A0A0A0A0A0A, we
will get 0b00101000 as a result. With x86 and ARM ISAs, the function compareBytes
can be implemented using two vector instructions.205
If the mask is zero, that means there are no eol characters in the current chunk and
we can skip it (see line 11). This is a critical optimization that provides large speedups
for input strings with long lines. If a mask is not zero, that means there are eol
characters and we need to find their positions. To do so, we use the tzcnt function,
which counts the number of trailing zero bits in an 8-bit mask (the position of the
rightmost set bit). For example, for the mask 0b00101000, it will return 3. Most
ISAs support implementing the tzcnt function with a single instruction.206 Line 14
203 Assuming that the compiler will avoid generating branch instructions for std::max.
204 Performance Ninja: compiler intrinsics 2 - https://github.com/dendibakh/perf-ninja/tree/main/
labs/core_bound/compiler_intrinsics_2.
205 For example, with AVX2 (256-bit vectors), you can use VPCMPEQB and VPMOVMSKB instructions.
206 Although in x86, there is no version of the TZCNT instruction that supports 8-bit inputs.
241
10 Optimizing Branch Prediction
calculates the length of the current line using the result of the tzcnt function. We
shift right the mask and repeat until there are no set bits in the mask.
For an input string with a single very long line (best case scenario), the SIMD version
will execute eight times fewer branch instructions. However, in the worst-case scenario
with zero-length lines (i.e., only eol characters in the input string), the original
approach is faster. I benchmarked this technique using AVX2 implementation (with
chunks of 16 characters) on several different inputs, including textbooks, and source
code files. The result was 5–6 times fewer branch instructions and more than 4x better
performance when running on Intel Core i7-1260P (12th Gen, Alder Lake).
242
Summary
introduced mispredictions?
2. Solve the following lab assignments using techniques we discussed in this chapter:
• perf-ninja::branches_to_cmov_1
• perf-ninja::lookup_tables_1
• perf-ninja::virtual_call_mispredict
• perf-ninja::conditional_store_1
3. Run the application that you’re working with daily. Collect the TMA breakdown
and check the BadSpeculation metric. Look at the code that is attributed with
the most number of branch mispredictions. Is there a way to avoid branches
using the techniques we discussed in this chapter?
Coding exercise: write a microbenchmark that will experience a 50% misprediction
rate or get as close as possible. Your goal is to write a code in which half of all branch
instructions are mispredicted. That is not as simple as you may think. Some hints
and ideas:
• Branch misprediction rate is measured as BR_MISP_RETIRED.ALL_BRANCHES /
BR_INST_RETIRED.ALL_BRANCHES.
• If you’re coding in C++, you can use 1) the Google benchmark library similar
to perf-ninja, 2) write a regular console program and collect CPU counters with
Linux perf, or 3) integrate the libpfm library into the microbenchmark (see
Section 5.3.2).
• There is no need to invent some complicated algorithm. A simple approach
would be to generate a pseudo-random number in the range [0;100) and check
if it is less than 50. Random numbers can be pre-generated ahead of time.
• Keep in mind that modern CPUs can remember long (but still limited) sequences
of branch outcomes.
Chapter Summary
• Modern processors are very good at predicting branch outcomes. So, I recommend
paying attention to branch mispredictions only when the TMA points to a high
Bad Speculation metric.
• When branch outcome patterns become hard for the CPU branch predictor to
follow, the performance of the application may suffer. In this case, the branchless
version of an algorithm can be more performant. In this chapter, I showed how
branches could be replaced with lookup tables, arithmetic, and selection.
• Branchless algorithms are not universally beneficial. Always measure to find out
what works better in your specific case.
• There are indirect ways to reduce the branch misprediction rate by reducing
the dynamic number of branch instructions in a program. This approach helps
because it alleviates the pressure on branch predictor structures. Examples of
such techniques include loop unrolling/vectorization, replacing branches with
bitwise operations, and using SIMD instructions.
243
11 Machine Code Layout Optimizations
The CPU Frontend (FE) is responsible for fetching and decoding instructions and
delivering them to the out-of-order Backend (BE). As the newer processors get more
execution “horsepower”, the CPU FE needs to be as powerful to keep the machine
balanced. If the FE cannot keep up with supplying instructions, the BE will be
underutilized, and the overall performance will suffer. That’s why the FE is designed
to always run well ahead of the actual execution to smooth out any hiccups that may
occur and always have instructions ready to be executed. For example, Intel Skylake,
released in 2016, can fetch up to 16 instructions per cycle.
Most of the time, inefficiencies in the CPU FE can be described as a situation when the
Backend is waiting for instructions to execute, but the Frontend is not able to provide
them. As a result, CPU cycles are wasted without doing any actual useful work.
Recall that modern CPUs can process multiple instructions every cycle, nowadays
ranging from 4- to 9-wide. Situations when not all available slots are filled happen
very often. This represents a source of inefficiency for applications in many domains,
such as databases, compilers, web browsers, and many others.
The TMA methodology captures FE performance issues in the Frontend Bound
metric. It represents the percentage of cycles when the CPU FE is not able to
deliver instructions to the BE, while it could have accepted them. Most of the
real-world applications experience a non-zero ‘Frontend Bound’ metric, meaning that
some percentage of running time will be lost on suboptimal instruction fetching and
decoding. Below 10% is the norm. If you see the “Frontend Bound” metric being
more than 20%, it’s worth spending time on it.
There could be many reasons why FE cannot deliver instructions to the execution
units. Most of the time, it is due to suboptimal code layout, which leads to poor
I-cache and ITLB utilization. Applications with a large codebase, e.g., millions of
lines of code, are especially vulnerable to FE performance issues. In this chapter, we
will take a look at some typical optimizations to improve machine code layout.
244
11.2 Basic Block
It is guaranteed that every instruction in the basic block will be executed only once.
This is an important property that is leveraged by many compiler transformations.
For example, it greatly reduces the problem of control flow graph analysis and
transformations since, for some classes of problems, we can treat all instructions in
the basic block as one entity.
245
11 Machine Code Layout Optimizations
Figure 11.2: Two versions of machine code layout for the snippet of code above.
Which layout is better? Well, it depends on whether cond is usually true or false. If
cond is usually true, then we would better choose the default layout because otherwise,
we would be doing two jumps instead of one. Also, if coldFunc is a relatively small
function, we would want to have it inlined. However, in this particular example, we
know that coldFunc is an error-handling function and is likely not executed very often.
By choosing layout 11.2b, we maintain fall through between hot pieces of the code
and convert the taken branch into not taken one.
There are a few reasons why the layout presented in Figure 11.2b performs better.
First of all, the layout in Figure 11.2b makes better use of the instruction and µop-
cache (DSB, see Section 3.8.1). With all hot code contiguous, there is no cache line
fragmentation: all the cache lines in the L1 I-cache are used by hot code. The same
is true for the µop-cache since it caches based on the underlying code layout as well.
Secondly, taken branches are also more expensive for the fetch unit. The Frontend of a
CPU fetches contiguous aligned blocks of bytes, usually 16, 32, or 64 bytes, depending
on the architecture. For every taken branch, bytes in a fetch block after the jump
instruction and before the branch target are unused. This reduces the maximum
246
11.4 Basic Block Alignment
expect.
247
11 Machine Code Layout Optimizations
The code itself is reasonable, but its layout is not perfect (see Figure 11.3a). Instructions
that correspond to the loop are highlighted in yellow. Thick boxes denote cache line
borders. Cache lines are 64 bytes long.
Notice that the loop spans multiple cache lines: it begins on the cache line 0x80-0xBF
and ends in the cache line 0xC0-0xFF. To fetch instructions that are executed in the
loop, a processor needs to read two cache lines. These kinds of situations sometimes
cause performance problems for the CPU Frontend, especially for the small loops like
those presented in Listing 11.3.
To fix this, we can shift the loop instructions forward by 16 bytes using a single NOP
instruction so that the whole loop will reside in one cache line. Figure 11.3b shows
the effect of doing this with the NOP instruction highlighted in blue.
Interestingly, the performance impact is visible even if you run nothing but this hot
loop in a microbenchmark. It is somewhat puzzling since the amount of code is tiny
and it shouldn’t saturate the L1 I-cache size on any modern CPU. The reason for
the better performance of the layout in Figure 11.3b is not trivial to explain and will
involve a fair amount of microarchitectural details, which we don’t discuss in this book.
Interested readers can find more information in the related article on the Easyperf
blog.209
By default, the LLVM compiler recognizes loops and aligns them at 16B boundaries,
as we saw in Figure 11.3a. To reach the desired code placement for our example, as
shown in Figure 11.3b, you can use the -mllvm -align-all-blocks=5 option that
will align every basic block in an object file at a 32 bytes boundary. However, I do
not recommend using this and similar options as they affect the code layout of all the
functions in the translation unit. There are other less intrusive options.
A recent addition to the LLVM compiler is the new [[clang::code_align()]] loop
attribute, which allows developers to specify the alignment of a loop in the source
code. This gives a very fine-grained control over machine code layout. The following
code shows how the new Clang attribute can be used to align a loop at a 64 bytes
boundary:
209 “Code alignment issues” - https://easyperf.net/blog/2018/01/18/Code_alignment_issues
248
11.5 Function Splitting
Figure 11.3: Two different code layouts for the loop in Listing 11.3.
void benchmark_func(int* a) {
[[clang::code_align(64)]]
for (int i = 0; i < 32; ++i)
a[i] += 1;
}
Before this attribute was introduced, developers had to resort to some less practical
solutions like injecting asm(".align 64;") statements of inline assembly in the source
code.
Even though CPU architects work hard to minimize the impact of machine code
layout, there are still cases when code placement (alignment) can make a difference
in performance. Machine code layout is also one of the main sources of noise in
performance measurements. It makes it harder to distinguish a real performance
improvement or regression from the accidental one, that was caused by the change in
the code layout.
249
11 Machine Code Layout Optimizations
cold code inside a hot path. An example of code when such transformation might be
profitable is shown in Listing 11.4. To remove cold basic blocks from the hot path, we
cut and paste them into a new function and create a call to it.
Listing 11.4 Function splitting: cold code outlined to the new functions.
void foo(bool cond1, void foo(bool cond1,
bool cond2) { bool cond2) {
// hot path // hot path
if (cond1) { if (cond1) {
/* cold code (1) */ cold1();
} }
// hot path // hot path
if (cond2) { => if (cond2) {
/* cold code (2) */ cold2();
} }
} }
void cold1() __attribute__((noinline))
{ /* cold code (1) */ }
void cold2() __attribute__((noinline))
{ /* cold code (2) */ }
Notice, that we disable the inlining of cold functions by using the noinline attribute.
Because without it, a compiler may decide to inline it, which will effectively undo
our transformation. Alternatively, we could apply the [[unlikely]] macro (see
Section 11.3) on both cond1 and cond2 branches to convey to the compiler that
inlining cold1 and cold2 functions is not desired.
250
11.6 Function Reordering
code.
251
11 Machine Code Layout Optimizations
nsforms/Utils/CodeLayout.cpp
252
11.7 Profile Guided Optimizations
253
11 Machine Code Layout Optimizations
down, which makes the build step longer and prevents profile collection directly from
production systems, whether on client devices or in the cloud. Unfortunately, you
cannot collect the profiling data once and use it for all future builds. As the source
code of an application evolves, the profile data becomes stale (out of sync) and needs
to be recollected.
Another caveat in the PGO flow is that a compiler should only be trained using
representative scenarios of how your application will be used. Otherwise, you may
end up degrading the program’s performance. The compiler “blindly” uses the profile
data that you provided. It assumes that the program will always behave the same no
matter what the input data is. Users of PGO should be careful about choosing the
input data they will use for collecting profiling data (step 2) because while improving
one use case of the application, others may be pessimized. Luckily, it doesn’t have to
be exactly a single workload since profile data from different workloads can be merged
to represent a set of use cases for the application.
An alternative solution was pioneered by Google in 2016 with sample-based PGO.
[Chen et al., 2016] Instead of instrumenting the code, the profiling data can be obtained
from the output of a standard profiling tool such as Linux perf. Google developed an
open-source tool called AutoFDO216 that converts sampling data generated by Linux
perf into a format that compilers like GCC and LLVM can understand.
This approach has a few advantages over instrumented PGO. First of all, it eliminates
one step from the PGO build workflow, namely step 1 since there is no need to build an
instrumented binary. Secondly, profiling data collection runs on an already optimized
binary, thus it has a much lower runtime overhead. This makes it possible to collect
profiling data in a production environment for a longer time. Since this approach
is based on hardware collection, it also enables new kinds of optimizations that are
not possible with instrumented PGO. One example is branch-to-cmov conversion,
which is a transformation that replaces conditional jumps with conditional moves to
avoid the cost of a branch misprediction (see Section 10.3). To effectively perform
this transformation, a compiler needs to know how frequently the original branch was
mispredicted. This information is available with sample-based PGO on modern CPUs
(Intel Skylake+).
In mid-2018, Meta open-sourced its binary optimization tool called BOLT217 that
works on already compiled binaries. It first disassembles the code, then it uses the
profile information collected by a sampling profiler, such as Linux perf, to do various
layout transformations and then relinks the binary again. [Panchenko et al., 2018] As
of today, BOLT has more than 15 optimization passes, including basic block reordering,
function splitting and reordering, and others. Similar to traditional PGO, primary
candidates for BOLT optimizations are programs that suffer from instruction cache
and ITLB misses. Since January 2022, BOLT has been a part of the LLVM project
and is available as a standalone tool.
A few years after BOLT was introduced, Google open-sourced its binary relinking tool
called Propeller. It serves a similar purpose but instead of disassembling the original
binary, it relies on linker input and thus can be distributed across several machines
216 AutoFDO - https://github.com/google/autofdo
217 BOLT - https://code.fb.com/data-infrastructure/accelerate-large-scale-applications-with-bolt/
254
11.8 Reducing ITLB Misses
for better scaling and less memory consumption. Post-link optimizers such as BOLT
and Propeller can be used in combination with traditional PGO (and Link-Time
Optimizations) and often provide an additional 5-10% performance speedup. Such
techniques open up new kinds of binary rewriting optimizations that are based on
hardware telemetry.
/Utilizing-Huge-Pages-For-Code.
255
11 Machine Code Layout Optimizations
in the ELF binary header that determines if the text segment should be backed with
huge pages by default. The simplest way to do it is using the hugeedit or hugectl
utilities from libhugetlbfs219 package. For example:
# Permanently set a special bit in the ELF binary header.
$ hugeedit --text /path/to/clang++
# Code section will be loaded using huge pages by default.
$ /path/to/clang++ a.cpp
256
11.9 Case Study: Measuring Code Footprint
--profile-mask 100 initiates LBR sampling, and -a enables you to specify a program
to run. This command will collect code footprint along with various other data. I
don’t show the output of the tool, curious readers are welcome to study documentation
and experiment with the tool.
I took a set of four benchmarks: Clang C++ compilation, Blender ray tracing,
Cloverleaf hydrodynamics, and Stockfish chess engine; these workloads should be
already familiar to you from Section 4.11 where we analyzed their performance
characteristics. I ran them on an Intel’s Alder Lake-based processor.223
Before we start looking at the results, let’s spend some time on terminology. Different
parts of a program’s code may be exercised with different frequencies, so some parts
will be hotter than others. The perf-tools package doesn’t make this distinction
and uses the term “non-cold code” to refer to code that was executed at least once.
This is called two-way splitting since it splits the code into cold and non-cold parts.
Other tools (e.g., Meta’s HHVM) use three-way splitting and distinguish between hot,
warm, and cold code with an adjustable threshold between warm and hot. In this
section, we use the term “hot code” to refer to the non-cold code.
Results for each of the four benchmarks are presented in Table 11.1. The binary and
.text sizes were obtained with a standard Linux readelf utility, while other metrics
were collected with perf-tools. The non-cold code footprint [KB] metric is the
number of kilobytes with machine instructions that a program touched at least once.
The metric non-cold code [4KB-pages] tells us the number of non-cold 4KB-pages
with machine instructions that a program touched at least once. Together they help
us to understand how dense or sparse those non-cold memory locations are. It will
become clear once we dig into the numbers. Finally, we also present Frontend Bound
221 perf-tools - https://github.com/aayasin/perf-tools
222 The code footprint data collected by perf-tools is not exact since it is based on sampling
LBR records. Other tools like Intel’s sde -footprint, unfortunately, don’t provide code footprint.
However, it is not hard to write a PIN-based tool yourself that will measure the exact code footprint.
223 It doesn’t matter which machine you use for collecting code footprint as it depends on the
program and input data, and not on the characteristics of a particular machine. As a sanity check, I
ran it on a Skylake-based machine and got very similar results.
257
11 Machine Code Layout Optimizations
percentages, a metric that should be already familiar to you from Section 6.1 about
TMA.
Table 11.1: Code footprint of the benchmarks used in the case study.
Clang17
Metric compilation Blender CloverLeaf Stockfish
Binary size [KB] 113844 223914 672 39583
.text size [KB] 67309 133009 598 238
non-cold code footprint [KB] 5042 313 104 99
non-cold code [4KB-pages] 6614 546 104 61
Frontend Bound [%] 52.3 29.4 5.3 25.8
Let’s first look at the binary and .text sizes. CloverLeaf is a tiny application compared
to Clang17 and Blender; Stockfish embeds the neural network file which accounts for
the largest part of the binary, but its code section is relatively small; Clang17 and
Blender have gigantic code bases. The .text size metric is the upper bound for our
applications, i.e. we assume224 the code footprint should not exceed the .text size.
A few interesting observations can be made by analyzing the code footprint data.
First, even though the Blender .text section is very large, less than 1% of Blender’s
code is non-cold: 313 KB out of 133 MB. So, just because a binary size is large,
doesn’t mean the application suffers from CPU Frontend bottlenecks. It’s the amount
of hot code that matters. For other benchmarks this ratio is higher: Clang17 7.5%,
CloverLeaf 17.4%, Stockfish 41.6%. In absolute numbers, the Clang17 compilation
touches an order of magnitude more bytes with machine instructions than the other
three applications combined.
Second, let’s examine the non-cold code [4KB-pages] row in the table. For Clang17,
non-cold 5042 KB are spread over 6614 4KB pages, which gives us 5042 / (6614 *
4) = 19% page utilization. This metric tells us how dense/sparse the hot parts of the
code are. The closer each hot cache line is located to another hot cache line, the fewer
pages are required to store the hot code. The higher the page utilization the better.
Basic block placement and function reordering that we discussed earlier in this chapter
are perfect examples of a transformation that improves page utilization. For other
benchmarks, the percentages are: Blender 14%, CloverLeaf 25%, and Stockfish 41%.
Now that we quantified the code footprints of the four applications, it’s tempting to
think about the size of L1-instruction and L2 caches and whether the hot code fits or
not. On my Alder Lake-based machine, the L1 I-cache is only 32 KB, which is not
enough to fully cover any of the benchmarks that we’ve analyzed. But remember, at
the beginning of this section we said that a large code footprint doesn’t immediately
point to a problem. Yes, a large codebase puts more pressure on the CPU Frontend,
but an instruction access pattern is also crucial for performance. The same locality
principles as for data accesses apply. That’s why we accompanied it with the Frontend
Bound metric from Topdown analysis.
224 It is not always true: an application itself may be tiny, but call into multiple other dynamically
258
Questions and Exercises
For Clang17, the 5 MB of non-cold code causes a huge 52.3% Frontend Bound
performance bottleneck: more than half of the cycles are wasted waiting for instructions.
From all the presented benchmarks, it benefits the most from PGO-type optimizations.
CloverLeaf doesn’t suffer from inefficient instruction fetch; 75% of its branches are
backward jumps, which suggests that those could be relatively small loops executed
over and over again. Stockfish, while having roughly the same non-cold code footprint
as CloverLeaf, poses a far greater challenge for the CPU Frontend (25.8%). It has a
lot more indirect jumps and function calls. Finally, Blender has even more indirect
jumps and calls than Stockfish.
I stop my analysis at this point as further investigations are outside the scope of this
case study. For readers who are interested in continuing the analysis, I suggest drilling
down into the Frontend Bound category according to the TMA methodology and
looking at metrics such as ICache_Misses, ITLB_Misses, DSB coverage, and others.
Another useful tool to study the code footprint is llvm-bolt-heatmap225 , which is
a part of llvm’s BOLT project. This tool can produce code heatmaps that give a
fine-grained understanding of the code layout in your application. It is primarily used
to evaluate the original layout of hot code and confirm that the optimized layout is
more compact.
Chapter Summary
A summary of CPU Frontend optimizations is presented in Table 11.2.
How
Transform transformed? Why helps? Works best for Done by
Basic maintain fall not taken any code, especially compiler
block through hot code branches are with a lot of branches
placement cheaper;
better cache
utilization
259
11 Machine Code Layout Optimizations
How
Transform transformed? Why helps? Works best for Done by
Basic shift the hot code better cache hot loops compiler
block using NOPs utilization
alignment
260
12 Other Tuning Areas
In this chapter, we will take a look at some of the optimization topics not specifically
related to any of the categories covered in the previous three chapters, but still
important enough to find their place in this book.
The major differences between x86 (considered as CISC) and RISC ISAs, such as
ARM and RISC-V, are summarized below:
• x86 instructions are variable-length, while ARM and RISC-V instructions are
fixed-length. This makes decoding x86 instructions more complex.
• x86 ISA has many addressing modes, while ARM and RISC-V have few addressing
modes. Operands in ARM and RISC-V instructions are either registers or
immediate values, while x86 instruction inputs can also come from memory. This
bloats the number of x86 instructions but also allows for more powerful single
instructions. For instance, ARM requires loading a memory location first, then
performing the operation; x86 can do both in one instruction.
In addition to this, there are a few other differences that you should consider when
optimizing for a specific microarchitecture. As of 2024, the most recent x86-64 ISA has
16 architectural general-purpose registers, while the latest ARMv8 and RV64 require
a CPU to provide 32 general-purpose registers. Extra architectural registers reduce
register spilling and hence reduce the number of loads/stores. Intel has announced a
new extension called APX226 that will increase the number of registers to 32.
There is also a difference in the memory page size between x86 and ARM. The default
page size for x86 platforms is 4 KB, while most ARM systems (for example, Apple
MacBooks) use a 16 KB page size, although both platforms support larger page sizes
226 Intel APX - https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-
performance-extensions-apx.html
261
12 Other Tuning Areas
(see Section 3.7.2, and Section 8.4). All these differences can affect the performance of
your application when they become a bottleneck.
Although ISA differences may have a tangible impact on the performance of a specific
application, numerous studies show that on average, differences between the two most
popular ISAs, namely x86 and ARM, don’t have a measurable performance impact.
Throughout this book, I carefully avoided advertisements of any products (e.g., Intel
vs. AMD vs. Apple) and any religious ISA debates (x86 vs. ARM vs. RISC-V).227
Below are some references that I hope will close the debate:
• Performance or energy consumption differences are not generated by ISA differ-
ences, but rather by microarchitecture implementation. [Blem et al., 2013]
• ISA doesn’t have a large effect on the number and type of executed instructions.
[Weaver & McIntosh-Smith, 2023] [Blem et al., 2013]
• CISC code is not denser than RISC code. [Geelnard, 2022]
• ISA overheads can be effectively mitigated by microarchitecture implementa-
tion. For example, µop cache minimizes decoding overheads; instruction cache
minimizes code density impact. [Blem et al., 2013] [ChipsAndCheese, 2024]
Nevertheless, this doesn’t remove the value of architecture-specific optimizations.
In this section, we will discuss how to optimize for a particular platform. We will
cover ISA extensions, the CPU dispatch technique, and discuss how to reason about
instruction latencies and throughput.
262
12.1 CPU-Specific Optimizations
263
12 Other Tuning Areas
You would typically see CPU dispatching constructs used to optimize only specific
parts of the code, e.g., hot function or loop. Very often, these platform-specific
implementations are written with compiler intrinsics (see Section 9.5) to generate
desired instructions.
Even though CPU dispatching is a runtime check, its overhead is not high. You
can identify hardware capabilities at startup once and save it in some variable, so at
runtime, it becomes just a single branch, which is well-predicted. Perhaps a bigger
concern about CPU dispatching is the maintenance cost. Every new specialized branch
requires fine-tuning and validation.
264
12.1 CPU-Specific Optimizations
an easy task and it requires deep ISA and microarchitecture knowledge. When in
doubt, seek help on specialized forums. Also, keep in mind that some of these things
may change in future CPU generations, so consider using CPU dispatch to isolate the
effect of your code changes.
On each iteration of the loop, we have two operations: calculate the squared value
of a[i] and accumulate the product in the sum variable. If you look closer, you may
notice that multiplications are independent of each other, so they can be executed in
parallel. The generated machine code (on the right) uses fused multiply-add (FMA)
to perform both operations with a single instruction. The problem here is that by
using FMAs, the compiler has included multiplication into the critical dependency
chain of the loop.
The vfmadd231ss instruction computes the squared value of a[i] (in xmm1) and then
accumulates the result in xmm0. There is a data dependency over xmm0: a processor
cannot issue a new vfmadd231ss instruction until the previous one has finished since
xmm0 is both an input and an output of vfmadd231ss. Even though multiplication
parts of FMA do not depend on each other, these instructions need to wait until all
inputs become available. The performance of this loop is bound by FMA latency,
which in Intel’s Alder Lake is 4 cycles.
In this case, fusing multiplication and addition hurts performance. We would be better
off with two separate instructions. The nanobench experiment below proves that:
# ran on Intel Core i7-1260P (Alder Lake)
$ sudo ./kernel-nanoBench.sh -f -basic $ sudo ./kernel-nanoBench.sh -f -basic
-loop 100 -unroll 1000 -loop 100 -unroll 1000
-warm_up_count 10 -asm " -warm_up_count 10 -asm "
vmovss xmm1, dword ptr [R14]; vmovss xmm1, dword ptr [R14];
vfmadd231ss xmm0, xmm1, xmm1;" vmulss xmm1, xmm1, xmm1;
-asm_init "<not shown>" vaddss xmm0, xmm0, xmm1;"
-asm_init "<not shown>"
Instructions retired: 2.00
Core cycles: 4.00 Instructions retired: 3.00
Core cycles: 2.00
265
12 Other Tuning Areas
The version on the left runs in four cycles per iteration, which corresponds to the
FMA latency. However, on the right-hand side, vmulss instructions do not depend on
each other, so they can be run in parallel. Still, there is a loop carry dependency over
xmm0 in the vaddss instruction (FADD). But the latency of FADD is only two cycles,
this is why the version on the right runs in just two cycles per iteration. The latency
and throughput characteristics for other processors may vary.230
From this experiment, we know that if the compiler would not have decided to fuse
multiplication and addition into a single instruction, it would result in two times
better performance for this loop. This only became clear once we examined the loop
dependencies and compared the latencies of FMA and FADD instructions. Since
Clang 18, you can prevent generating FMA instructions within a scope by using
#pragma clang fp contract(off).231
point values.
231 LLVM extensions to specify floating-point flags - https://clang.llvm.org/docs/LanguageExtensio
ns.html#extensions-to-specify-floating-point-flags
232 Otsu’s thresholding method - https://en.wikipedia.org/wiki/Otsu%27s_method
266
12.2 Microarchitecture-Specific Performance Issues
Recall from Section 3.8.3 that the processor doesn’t necessarily know about a potential
store-to-load forwarding, so it has to make a prediction. If it correctly predicts a
memory order violation between two updates of color 0xFF, then these accesses will
be serialized. The performance will not be great, but it is the best we could hope
for with the initial code. On the contrary, if the processor predicts that there is no
memory order violation, it will speculatively let the two updates run in parallel. Later
it will recognize the mistake, flush the pipeline, and re-execute the youngest of the
two updates. This is very hurtful for performance.
Performance will greatly depend on the color patterns of the input image. Images
with long sequences of pixels with the same color will have worse performance than
images where colors don’t repeat often. The performance of the initial version will
be good as long as the distance between two pixels of the same color is long enough.
The phrase “long enough” in this context is determined by the size of the out-of-order
instruction window. Repeating read-modify-writes of the same color may trigger
ordering violations if they occur within a few loop iterations of each other, but not if
they occur more than a hundred loop iterations apart.
A cure for the memory order violation problem is shown in Listing 12.2, on the right.
As you can see, I duplicated the histogram, and now the processing of pixels alternates
between two partial histograms. In the end, we combine the two partial histograms
to get a final result. This new version with two partial histograms is still prone to
potentially problematic patterns, such as 0xFF 0x00 0xFF 0x00 0xFF ... However,
with this change, the original worst-case scenario (e.g., 0xFF 0xFF 0xFF ...) will run
twice as fast as before. It may be beneficial to create four or eight partial histograms
depending on the color pattern of input images. This exact code is featured in the
267
12 Other Tuning Areas
erf-ninja/tree/main/labs/memory_bound/mem_order_violation_1
268
12.2 Microarchitecture-Specific Performance Issues
Figure 12.1: AVX2 loads in a misaligned array. Every second load crosses the cache line
boundary.
now takes an additional argument, which you can use to control the alignment of
dynamically allocated memory. When using standard containers, such as std::vector,
you can define a custom allocator. Listing 12.4 shows a minimal example of a custom
allocator that aligns the memory buffer at the cache line boundary.
ninja/tree/main/labs/memory_bound/mem_alignment_1
269
12 Other Tuning Areas
However, it’s not enough to only align the starting offset of a matrix. Consider an
example of a 9x9 matrix of float values shown in Figure 12.2. If a cache line is 64
bytes, it can store 16 float values. When using AVX2 instructions, the program will
load/store 8 elements (256 bits) at a time. In each row, the first eight elements will be
processed in a SIMD way, while the last element will be processed in a scalar way by
the loop remainder. The second vector load/store (elements 10-17) crosses the cache
line boundary as many other subsequent vector loads/stores. The problem highlighted
in Figure 12.2 affects any matrix with the number of columns that is not a multiple of
8 (for AVX2 vectorization). The SSE and ARM Neon vectorization requires 16-byte
alignment; AVX-512 requires 64-byte alignment.
Figure 12.2: Split loads/stores inside a 9x9 matrix when using AVX2 vectorization. The split
memory access is highlighted in yellow.
So, in addition to aligning the starting offset, each row of the matrix should be aligned
as well. For example in Figure 12.2, it can be achieved by inserting seven dummy
columns into the matrix, effectively making it a 9x16 matrix. This will align the
second row (elements 10-18) at the offset 0x40. Similarly, all the other rows will be
aligned as well. The dummy columns will not be processed by the algorithm, but they
will ensure that the actual data is aligned at the cache line boundary. In my testing,
the performance impact of this change was up to 30%, depending on the matrix size
and platform configuration.
Alignment and padding cause holes with unused bytes, which potentially decreases
memory bandwidth utilization. For small matrices, like our 9x9 matrix, padding
will cause almost half of each row to be unused. However, for large matrices, like
1025x1025 the impact of padding is not that big. Nevertheless, for some algorithms,
e.g., in AI, memory bandwidth can be a bigger concern. Use these techniques with
care and always measure to see if the performance gain from alignment is worth the
cost of unused bytes.
Accesses that cross a 4 KB boundary introduce more complications because virtual to
physical address translations are usually handled in 4 KB pages. Handling such access
would require accessing two TLB entries as well. Unless a TLB supports multiple
lookups per cycle, such loads can cause a significant slowdown.
270
12.2 Microarchitecture-Specific Performance Issues
by its address. Based on the address bits, the cache controller does set selection, i.e.,
it determines the set to which a cache line with the fetched memory location will go.
If two memory locations map to the same set, they will compete for the limited
number of available slots (ways) in a set. When a program repeatedly accesses memory
locations that map to the same set, they will be constantly evicting each other. This
may cause saturation of one set in the cache and underutilization of other sets. This is
known as cache aliasing, though you may find people use the terms cache contention,
cache conflicts, or cache trashing to describe this effect.
A simple example of cache aliasing can be observed in matrix transposition as explained
in detail in [Fog, 2023b, section 9.10 Cache contentions in large data structures]. I
encourage readers to study this manual to learn more about why it happens. I
repeated the experiment on a few modern processors and confirmed that it remains a
relevant issue. Figure 12.3 shows the performance of transposing matrices of 32-bit
floating-point values on Intel’s 12th-gen core i7-1260P processor.
Figure 12.3: Cache aliasing effects observed in matrix transposition ran on Intel’s 12th-gen
processor. Matrix sizes that are powers of two or multiples of 128 cause more than 10x
performance drop.
There are several spikes in the chart, which correspond to the matrix sizes that cause
cache aliasing. Performance drops significantly when the matrix size is a power of
two (e.g., 256, 512) or is a multiple of 128 (e.g., 384, 640, 768, 896).235 This happens
because memory locations that belong to the same column are mapped to the same set
in the L1D and L2 caches. These memory locations compete for the limited number
of ways in the set which causes the same cache line to be reloaded many times before
every element on this line is processed.
On Intel’s processors, this issue can be diagnosed with the help of the L1D.REPLACEMENT
performance event, which counts L1 cache line replacements. For instance, there are
17 times more cache line replacements for the matrix size 256x256 than for the size
235 Also, there are a few spikes at the sizes 341, 683, and 819. Supposedly, these sizes suffer from the
271
12 Other Tuning Areas
255x255. I tested all sizes from 64x64 up to 10,000x10,000 and found that the
pattern repeats very consistently. I also ran the same experiment on an Intel Skylake-
based processor as well as an Apple M1 chip and confirmed that these chips are prone
to cache aliasing effects.
To mitigate cache aliasing, you can use cache blocking as we discussed in Section 9.3.2.
The idea is to process the matrix in smaller blocks that fit into the cache. That
way you will avoid cache line eviction since there will be enough space in the cache.
Another way to solve this is to pad the matrix with extra columns, e.g., instead of a
256x256 matrix, you would allocate a 256x264 matrix; in a similar way we did in the
previous section. But be careful not to run into misaligned memory access issues.
Without subnormal values, the subtraction of two FP values a - b can underflow and
produce zero even though the values are not equal. Subnormal values allow calculations
to gradually lose precision without rounding the result to zero. Although, it can come
with a cost as we shall see later. Subnormal values also may occur in production
software when a value keeps decreasing in a loop with subtraction or division.
From the hardware perspective, handling subnormals is more difficult than handling
normal FP values as it requires special treatment and generally, is considered as an
exceptional situation. The application will not crash, but it will get a performance
penalty. Calculations that produce or consume subnormal numbers are slower than
similar calculations on normal numbers and can run 10 times slower or more. For
236 IEEE Standard 754 - https://ieeexplore.ieee.org/document/8766229
237 Subnormal number - https://en.wikipedia.org/wiki/Subnormal_number
272
12.2 Microarchitecture-Specific Performance Issues
Keep in mind, that both FTZ and DAZ modes are incompatible with the IEEE Standard
754. They are implemented in hardware to improve performance for applications where
underflow is common and generating a denormalized result is unnecessary. I have
observed a 3%-5% performance penalty on some production floating-point applications
that were using subnormal values.
273
12 Other Tuning Areas
274
12.3 Low Latency Tuning Techniques
Listing 12.7 A dump of Linux top command with additional vMn field while compiling
a large C++ project.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND vMn
341763 dendiba+ 20 0 303332 165396 83200 R 99.3 1.0 0:05.09 c++ 13k
341705 dendiba+ 20 0 285768 153872 87808 R 99.0 1.0 0:07.18 c++ 5k
341719 dendiba+ 20 0 313476 176236 83328 R 94.7 1.1 0:06.49 c++ 8k
341709 dendiba+ 20 0 301088 162800 82944 R 93.4 1.0 0:06.46 c++ 2k
341779 dendiba+ 20 0 286468 152376 87424 R 92.4 1.0 0:03.08 c++ 26k
341769 dendiba+ 20 0 293260 155068 83072 R 91.7 1.0 0:03.90 c++ 22k
341749 dendiba+ 20 0 360664 214328 75904 R 88.1 1.3 0:05.14 c++ 18k
341765 dendiba+ 20 0 351036 205268 76288 R 87.1 1.3 0:04.75 c++ 18k
341771 dendiba+ 20 0 341148 194668 75776 R 86.4 1.2 0:03.43 c++ 20k
341776 dendiba+ 20 0 286496 147460 82432 R 76.2 0.9 0:02.64 c++ 25k
In the HFT world, anything more than 0 is a problem. But for low latency applications
in other business domains, a constant occurrence in the range of 100-1000 faults per
second should prompt further investigation. Investigating the root cause of runtime
minor page faults can be as simple as firing up perf record -e page-faults and
then perf report to locate offending source code lines.
To avoid page fault penalties during runtime, you should pre-fault all the memory for
the application at startup time. A toy example might look something like this:
char *mem = malloc(size);
int pageSize = sysconf(_SC_PAGESIZE)
for (int i = 0; i < size; i += pageSize)
mem[i] = 0;
First, this sample code allocates a size amount of memory on the heap as usual.
However, immediately after that, it steps by and writes to the first byte of each page
of newly allocated memory to ensure each one is brought into RAM. This method
helps to avoid runtime delays caused by minor page faults during future accesses.
Take a look at Listing 12.8 with a more comprehensive approach to tuning the glibc
allocator in conjunction with mlock/mlockall syscalls (taken from the “Real-time
Linux Wiki” 238 ).
The code in Listing 12.8 tunes three glibc malloc settings: M_MMAP_MAX,
M_TRIM_THRESHOLD, and M_ARENA_MAX.
• Setting M_MMAP_MAX to 0 disables underlying mmap syscall usage for large alloca-
tions – this is necessary because the mlockall can be undone by library usage of
munmap when it attempts to release mmap-ed segments back to the OS, defeating
the purpose of our efforts.
• Setting M_TRIM_THRESHOLD to -1 prevents glibc from returning memory to the
OS after calls to free. As indicated before, this option has no effect on mmap-ed
segments.
• Finally, setting M_ARENA_MAX to 1 prevents glibc from allocating multiple arenas
via mmap to accommodate multiple cores. Keep in mind, that the latter hinders
the glibc allocator’s multithreaded scalability feature.
238 The Linux Foundation Wiki: Memory for Real-time Applications - https://wiki.linuxfoundation
.org/realtime/documentation/howto/applications/memory
275
12 Other Tuning Areas
Listing 12.8 Tuning the glibc allocator to lock pages in RAM and prevent releasing
them to the OS.
#include <malloc.h>
#include <sys/mman.h>
mallopt(M_MMAP_MAX, 0);
mallopt(M_TRIM_THRESHOLD, -1);
mallopt(M_ARENA_MAX, 1);
mlockall(MCL_CURRENT | MCL_FUTURE);
Combined, these settings force glibc into heap allocations which will not release
memory back to the OS until the application ends. As a result, the heap will remain
the same size after the final call to free(mem) in the code above. Any subsequent
runtime calls to malloc or new simply will reuse space in this pre-allocated/pre-faulted
heap area if it is sufficiently sized at initialization.
More importantly, all that heap memory that was pre-faulted in the for-loop will
persist in RAM due to the previous mlockall call – the option MCL_CURRENT locks all
pages that are currently mapped, while MCL_FUTURE locks all pages that will become
mapped in the future. An added benefit of using mlockall this way is that any thread
spawned by this process will have its stack pre-faulted and locked, as well. For the
finer control of page locking, developers should use mlock system call which gives you
the option to choose which pages should persist in RAM. A downside of this technique
is that it reduces the amount of memory available to other processes running on the
system.
Developers of applications for Windows should look into the following APIs: lock
pages with VirtualLock, avoid immediate release of memory with VirtualFree with
MEM_DECOMMIT, but not the MEM_RELEASE flag.
These are just two example methods for preventing runtime minor faults. Some or all
of these techniques may be already integrated into memory allocation libraries such
as jemalloc, tcmalloc, or mimalloc. Check the documentation of your library to see
what is available.
276
12.3 Low Latency Tuning Techniques
Since other players in the market are likely to catch the same market signal, the
success of the strategy largely relies on how fast we can react, in other words, how fast
we send the order to the exchange. When we want our order to reach the exchange as
fast as possible and to take advantage of the favorable signal detected in the market
data, the last thing we want is to meet roadblocks right at the moment we decide to
take off.
When a certain code path is not exercised for a while, its instructions and associated
data are likely to be evicted from the I-cache and D-cache. Then, just when we need
that critical piece of rarely executed code to run, we take I-cache and D-cache miss
penalties, which may cause us to lose the race. This is where the technique of cache
warming is helpful.
Cache warming involves periodically exercising the latency-sensitive code to keep it in
the cache while ensuring it does not follow all the way through with any unwanted
actions. Exercising the latency-sensitive code also “warms up” the D-cache by bringing
latency-sensitive data into it. This technique is routinely employed for HFT applica-
tions. While I will not provide an example implementation, you can get a taste of it
in a CppCon 2018 lightning talk239 .
277
12 Other Tuning Areas
this file. For example, you might run watch -n5 -d 'grep TLB /proc/interrupts',
where the -n 5 option refreshes the view every 5 seconds while -d highlights the delta
between each refresh output.
Listing 12.9 shows a dump of /proc/interrupts with a large number of TLB shoot-
downs on the CPU2 processor that ran the latency-critical thread. Notice the order
of magnitude difference between other cores. In that scenario, the culprit of such
behavior was a Linux kernel feature called Automatic NUMA Balancing, which can
be easily disarmed with sysctl -w numa_balancing=0.
Listing 12.9 A dump of /proc/interrupts that shows a large number of TLB shoot-
downs on CPU2
CPU0 CPU1 CPU2 CPU3
...
NMI: 0 0 0 0 Non-maskable interrupts
LOC: 552219 1010298 2272333 3179890 Local timer interrupts
SPU: 0 0 0 0 Spurious interrupts
...
IWI: 0 0 0 0 IRQ work interrupts
RTR: 7 0 0 0 APIC ICR read retries
RES: 18708 9550 771 528 Rescheduling interrupts
CAL: 711 934 1312 1261 Function call interrupts
TLB: 4493 6108 73789 5014 TLB shootdowns
But that’s not the only source of TLB shootdowns. Others include Transparent Huge
Pages, memory compaction, page migration, and page cache writeback. Garbage
collectors also can initiate TLB shootdowns. These features either relocate pages
and/or alter permissions on pages in the process of fulfilling their duties, which require
page table updates and, thus, TLB shootdowns.
Preventing TLB shootdowns requires limiting the number of updates made to the
shared process address space. On the source code level, you should avoid runtime exe-
cution of the aforementioned list of syscalls, namely munmap, mprotect, and madvise.
On the OS level, disable kernel features that induce TLB shootdowns as a consequence
of its function, such as Transparent Huge Pages and Automatic NUMA Balancing.
For a more nuanced discussion on TLB shootdowns, along with their detection and
prevention, read a related article240 on the JabPerf blog.
shootdowns/
278
12.4 System Tuning
For this specific case, if heavy AVX512 instruction usage is not desired, include
-mprefer-vector-width=### to your compilation flags to pin the highest width
instruction set to either 128 or 256. Again, if your entire server fleet runs on the
latest chips then this is much less of a concern since the throttling impact of AVX
instruction sets is negligible nowadays.
ts/201501-perf-brief-low-latency-tuning-rhel7-v2.1.pdf
242 Power Management States: P-States, C-States - https://software.intel.com/content/www/us/en
/develop/articles/power-management-states-p-states-c-states-and-package-c-states.html
243 Cache Locking. Survey of cache locking techniques [Mittal, 2016]. An example of pseudo-locking
a portion of the cache, which is then exposed as a character device in the Linux file system and made
available for mmaping: https://events19.linuxfoundation.org/wp-content/uploads/2017/11/Introducin
g-Cache-Pseudo-Locking-to-Reduce-Memory-Access-Latency-Reinette-Chatre-Intel.pdf
279
12 Other Tuning Areas
Our analysis will help us identify applications whose performance drops significantly
when decreasing the size of the LLC. We say that such applications are sensitive to
the size of the LLC. Also, we identified applications that are not sensitive, i.e., LLC
size doesn’t have an impact on performance. This result can be applied to properly
size the processor LLC, especially considering the wide range available on the market.
For example, we can determine that an application benefits from a larger LLC. Then
perhaps an investment in new hardware is justified. Conversely, if the performance of
an application doesn’t improve from having a large LLC, then we can probably buy a
cheaper processor.
For this case study, we use an AMD Milan processor, but other server processors
such as Intel Xeon [Herdrich et al., 2016], and Arm ThunderX [Wang et al., 2017],
also include hardware support for users to control the allocation of both LLC space
and memory read bandwidth to processor threads. Based on our tests, the method
described in this section works equally well on AMD Zen4-based desktop processors,
such as 7950X and 7950X3D.
encore-sets-3-overclocking-world-records/
280
12.5 Case Study: Sensitivity to Last Level Cache Size
Feature Value
Processor AMD EPYC 7313P
Cores x threads 16 × 2
Configuration 4 CCX × 4 cores/CCX
Frequency 3.0/3.7 GHz, base/max
L1 cache (I, D) 8-ways, 32 KiB (per core), 64-byte lines
L2 cache 8-ways, 512 KiB (per core), 64-byte lines
LLC 16-ways, 32 MB, non-inclusive (per CCX), 64-byte lines
Main Memory 512 GiB DDR4, 8 channels, nominal peak BW: 204.8 GB/s
TurboBoost Disabled
Hyperthreading Disabled (1 thread/core)
OS Ubuntu 22.04, kernel 5.15.0-76
Figure 12.4 shows the clustered memory hierarchy of an AMD Milan 7313P processor.
It consists of four Core Complex Dies (CCDs) connected to each other and to off-chip
memory via an I/O chiplet. Each CCD integrates a Core CompleX (CCX) and an
I/O connection. In turn, each CCX has four Zen3 cores that share a 32 MB victim
LLC.245
Figure 12.4: The clustered memory hierarchy of the AMD Milan 7313P processor.
Although there is a total of 128 MB of LLC (32 MB/CCX x 4 CCX), the four cores of
a CCX cannot store cache lines in an LLC other than their own 32 MB LLC. Since
we will be running single-threaded benchmarks, we can focus on a single CCX. The
LLC size in our experiments will vary from 0 to 32 MB with steps of 2 MB. This is
245 “Victim” means that the LLC is filled with the cache lines evicted from the four L2 caches of a
CCX.
281
12 Other Tuning Areas
directly related to having a 16-way LLC: by disabling one of 16 ways, we reduce the
LLC size by 2 MB.
Metrics
The ultimate metric for quantifying the performance of an application is execution
time. To analyze the impact of the memory hierarchy on system performance, we will
246 SPEC CPU® 2017 - https://www.spec.org/cpu2017/.
282
12.5 Case Study: Sensitivity to Last Level Cache Size
also use the following three metrics: 1) CPI, cycles per instruction, 2) DMPKI, demand
misses in the LLC per thousand instructions, and 3) MPKI, total misses (demand +
prefetch) in the LLC per thousand instructions. While CPI has a direct correlation
with the performance of an application, DMPKI and MPKI do not necessarily impact
performance. Table 12.2 shows the formulas used to calculate each metric from specific
hardware counters. Detailed descriptions for each of the counters are available in
AMD’s Processor Programming Reference [Advanced Micro Devices, 2021].
Table 12.2: Formulas for calculating metrics used in the case study.
Metric Formula
CPI Cycles not in Halt (PMCx076) / Retired Instructions (PMCx0C0)
DMPKI Demand Data Cache Fills247 (PMCx043) / (Retired Instr (PMCx0C0) / 1000)
MPKI L3 Misses248 (L3PMCx04) / (Retired Instructions (PMCx0C0) / 1000)
Results
We run a set of SPEC CPU2017 benchmarks alone in the system using only one
instance and a single hardware thread. We repeat those runs while changing the
available LLC size from 0 to 32 MB in 2 MB steps.
Figure 12.5 shows in graphs, from left to right, CPI, DMPKI, and MPKI for each
assigned LLC size. We only show three workloads, namely 503.bwaves (blue),
520.omnetpp (green), and 554.roms (red). They cover the three main trends observed
in all other applications. Thus, we do not show the rest of the benchmarks.
For the CPI chart, a lower value on the Y-axis means better performance. Also, since
the frequency on the system is fixed, the CPI chart is reflective of absolute scores. For
example, 520.omnetpp (green line) with 32 MB LLC is 2.5 times faster than with 0
MB LLC. For the DMPKI and MPKI charts, the lower the value on the Y-axis, the
better.
Two different behaviors can be observed in the CPI and DMPKI graphs. On one
hand, 520.omnetpp takes advantage of its available space in the LLC: both CPI and
DMPKI decrease significantly as the space allocated in the LLC increases. We can
say that the behavior of 520.omnetpp is sensitive to the size available in the LLC.
Increasing the allocated LLC space improves performance because it avoids evicting
cache lines that will be used in the future.
In contrast, 503.bwaves and 554.roms don’t make use of all available LLC space.
For both benchmarks, CPI and DMPKI remain roughly constant as the allocation
limit in the LLC grows. We can say that the performance of these two applications is
247 We used subevents MemIoRemote and MemIoLocal, that count demand data cache fills from DRAM
283
12 Other Tuning Areas
Figure 12.5: CPI, DMPKI, and MPKI for increasing LLC allocation limits (2 MB steps).
insensitive to their available space in the LLC.249 If your application shows similar
behavior, you can buy a cheaper processor with a smaller LLC size without sacrificing
performance.
Let’s now analyze the MPKI graph, that combines both LLC demand misses and
prefetch requests. First of all, we can see that the MPKI values are always much
higher than the DMPKI values. That is, most of the blocks are loaded from memory
into the on-chip hierarchy by the prefetcher. This behavior is due to the fact that the
prefetcher is efficient in preloading the private caches with the data to be used, thus
eliminating most of the demand misses.
For 503.bwaves, we observe that MPKI remains roughly at the same level, similar to
CPI and DMPKI charts. There is likely not much data reuse in the benchmark and/or
the memory traffic is very low. The 520.omnetpp workload behaves as we identified
earlier: MPKI decreases as the available space increases.
However, for 554.roms, the MPKI chart shows a large drop in total misses as the
available space increases while CPI and DMPKI remain unchanged. In this case,
there is data reuse in the benchmark, but it is not consequential to the performance.
The prefetcher can bring required data ahead of time, eliminating demand misses,
regardless of the available space in the LLC. However, as the available space decreases,
the probability that the prefetcher will not find the blocks in the LLC and will have to
load them from memory increases. So, giving more LLC capacity to 554.roms does
not directly benefit its performance, but it does benefit the system since it reduces
memory traffic. So, it is better not to limit available LLC space for 554.roms as it
may negatively affect the performance of other applications running on the system.
[Navarro-Torres et al., 2023]
0-4 MB. Once the LLC size is 4 MB and above, the performance remains constant. That means that
554.roms doesn’t require more than 4 MB of LLC space to perform well.
284
Summary
• perf-ninja::mem_alignment_1
• perf-ninja::io_opt1
2. Study the ISA extensions supported by the processor you’re working with. Check
if the application that you’re working on uses these extensions. If not, can it
benefit from them?
3. Run the application that you’re working with daily. Find the hotspot. Check if
they suffer from any of the microarchitecture-specific issues that we discussed in
this chapter.
4. Describe how you can avoid page faults on a critical path in your application.
Chapter Summary
• Processors from different vendors are not created equal. They differ in terms
of instruction set architecture (ISA) that they support and microarchitecture
implementation. Reaching peak performance often requires leveraging the latest
ISA extensions and tuning the application for a specific CPU microarchitecture.
• CPU dispatching is a technique that enables you to introduce platform-specific
optimizations. Using it, you can provide a fast path for a specific microarchitec-
ture while keeping a generic implementation for other platforms.
• We explored several performance corner cases that are caused by the interaction
of the application with the CPU microarchitecture. These include memory
ordering violations, misaligned memory accesses, cache aliasing, and denormal
floating-point numbers.
• We also discussed a few low-latency tuning techniques that are essential for
applications that require fast response times. We showed how to avoid page
faults, cache misses, TLB shootdowns, and core throttling on a critical path.
• System tuning is the last piece of the puzzle. Some knobs and settings may
affect the performance of your application. It is crucial to ensure that the system
firmware, the OS, or the kernel does not destroy all the efforts put into tuning
the application.
285
13 Optimizing Multithreaded Applications
Modern CPUs are getting more and more cores each year. As of 2024, you can buy
a server processor which will have more than 200 cores! And even a laptop with 16
execution threads is a pretty usual setup nowadays. Since there is so much processing
power in every CPU, effective utilization of all the hardware threads becomes more
challenging. Preparing software to scale well with a growing amount of CPU cores is
very important for the future success of your application.
There is a difference in how server and client products exploit parallelism. Most
server platforms are designed to process requests from a large number of customers.
Those requests are usually independent of each other, so the server can process them
in parallel. If there is enough load on the system, applications themselves could be
single-threaded, and still platform utilization will be high. However, when you use
your server platform for HPC or AI computations; then you need all the computing
power you have. On the other hand, client platforms, such as laptops and desktops,
have all the resources to serve a single user. In this case, an application has to make
use of all the available cores to provide the best user experience. In this chapter, we
will focus on applications that can scale to a large number of cores.
From the software perspective, there are two primary ways to achieve parallelism:
multiprocessing and multithreading. In a multiprocess application, multiple inde-
pendent processes run concurrently. Each process has its own memory space and
communicates with other processes through inter-process communication mechanisms
such as pipes, sockets, or shared memory. In a multithreaded application, a single
process contains multiple threads, which share the same memory space and resources
of the process. Threads within the same process can communicate and share data
more easily because they have direct access to the same memory space. However,
synchronization between threads is usually more complex and is prone to issues like
race conditions and deadlocks. In this chapter we will mostly focus on multithreaded
applications, however, some techniques can be applied to multiprocess applications as
well. We will show examples of both types of applications in this chapter.
When talking about throughput-oriented applications, we can distinguish the following
two types of applications:
• Massively parallel applications. Such applications usually scale well with the
number of cores. They are designed to process a large number of independent
tasks. Massively parallel programs often use the divide-and-conquer technique
to split the work into smaller tasks (also called worker threads) and process them
in parallel. Examples of such applications are scientific computations, video
rendering, data analytics, AI, and many others. The main obstacle for such
applications is the saturation of a shared resource, such as memory bandwidth,
that can effectively stall all the worker threads in the process.
• Applications that require synchronization. Such applications have workers
share resources to complete their tasks. Worker threads depend on each other,
which creates periods when some threads are stalled. Examples of such appli-
cations are databases, web servers, and other server applications. The main
286
13.1 Parallel Efficiency Metrics
Measuring overhead and spin time can be challenging, and I recommend using a
performance analysis tool like Intel VTune Profiler, which can provide these metrics.
Thread Count
Most parallel applications have a configurable number of threads, which allows them to
run efficiently on platforms with a different number of cores. Running an application
using a lower number of threads than is available on the system underutilizes its
resources. On the other hand, running an excessive number of threads can cause
oversubscription; some threads will be waiting for their turn to run.
Besides actual worker threads, multithreaded applications usually have other house-
keeping threads: main thread, input/output threads, etc. If those threads consume
significant time, they will take execution time away from worker threads, as they too
250 Threading libraries such as pthread, OpenMP, and Intel TBB incur additional overhead for creating
287
13 Optimizing Multithreaded Applications
require CPU cores to run. This is why it is important to know the total thread count
and configure the number of worker threads properly.
To avoid a penalty for thread creation and destruction, engineers usually allocate a pool
of threads251 with multiple threads waiting for tasks to be allocated for concurrent
execution by the supervising program. This is especially beneficial for executing
short-lived tasks.
Wait Time
Wait Time occurs when software threads are waiting due to APIs that block or cause
a context switch. Wait Time is per thread; therefore, the total Wait Time can exceed
the application elapsed time.
A thread can be switched off from execution by the OS scheduler due to either
synchronization or preemption. So, Wait Time can be further divided into Sync Wait
Time and Preemption Wait Time. A large amount of Sync Wait Time likely indicates
that the application has highly contended synchronization objects. We will explore how
to find them in the following sections. Significant Preemption Wait Time can signal
a thread oversubscription problem either because of a large number of application
threads or a conflict with OS threads or other applications on the system. In this case,
the developer should consider reducing the total number of threads or increasing task
granularity for every worker thread.
Spin Time
Spin time is Wait Time, during which the CPU is busy. This often occurs when a
synchronization API causes the CPU to poll while the software thread is waiting. In
reality, the implementation of kernel synchronization primitives spins on a lock for
some time instead of immediately yielding to another thread. Too much Spin Time,
however, can reflect the lost opportunity for productive work.
A list of other parallel efficiency metrics can be found on Intel’s VTune page.252
288
13.2 Performance Scaling in Multithreaded Programs
(a) The theoretical speedup limit as a (b) Linear speedup, Amdahl’s law, and
function of the number of processors, Universal Scalability Law.
according to Amdahl’s law.
In reality, further adding computing nodes to the system may yield retrograde speed
up. We will see examples of it in the next section. This effect is explained by Neil
Gunther as the Universal Scalability Law254 (USL), which is an extension of Amdahl’s
law. USL describes communication between computing nodes (threads) as yet another
factor gating performance. As the system is scaled up, overheads start to neutralize
the gains. Beyond a critical point, the capability of the system starts to decrease
(see Figure 13.1b). USL is widely used for modeling the capacity and scalability of
systems.
The slowdowns described by USL are driven by several factors. First, as the number
of computing nodes increases, they start to compete for resources (contention). This
results in additional time being spent on synchronizing those accesses. Another
issue occurs with resources that are shared between many workers. We need to
maintain a consistent state of the shared resources between many workers (coherence).
For example, when multiple workers frequently change a globally visible object,
those changes need to be broadcast to all nodes that use that object. Suddenly,
usual operations start taking more time to finish due to the additional need to
maintain coherence. Optimizing multithreaded applications not only involves all the
techniques described in this book so far but also involves detecting and mitigating the
aforementioned effects of contention and coherence.
289
13 Optimizing Multithreaded Applications
This test is of Blender’s Cycles performance with the BMW27 blend file.
URL: https://download.blender.org/release. Command line: ./blender
-b bmw27_cpu.blend -noaudio --enable-autoexec -o output.test -x
1 -F JPEG -f 1 -t N, where N is the number of threads.
2. Clang 17 build - this test uses Clang 15 to build the Clang 17 compiler from
sources. URL: https://www.llvm.org. Command line: ninja -jN clang, where
N is the number of threads.
3. Zstandard v1.5.5, a fast lossless compression algorithm. URL: https://github.c
om/facebook/zstd. A dataset used for compression: http://wanos.co/assets/si
lesia.tar. Command line: ./zstd -TN -3 -f -- silesia.tar, where N is the
number of compression worker threads.
4. CloverLeaf 2018 - a Lagrangian-Eulerian hydrodynamics benchmark. All hard-
ware threads are used. This test uses the input file clover_bm.in (Problem
5). URL: http://uk-mac.github.io/CloverLeaf. Command line: export
OMP_NUM_THREADS=N; ./clover_leaf, where N is the number of threads.
5. CPython 3.12, a reference implementation of the Python programming language.
URL: https://github.com/python/cpython. I ran a simple multithreaded
binary search script written in Python, which searches 10,000 random numbers
(needles) in a sorted list of 1,000,000 elements (haystack). Command line:
./python3 binary_search.py N, where N is the number of threads. Needles
are divided equally between threads.
The benchmarks were executed on a machine with the configuration shown below:
• 12th Gen Alder Lake Intel® Core™ i7-1260P CPU @ 2.10GHz (4.70GHz Turbo),
4P+8E cores, 18MB L3-cache.
• 16 GB RAM, DDR4 @ 2400 MT/s.
• Clang 15 compiler with the following options: -O3 -march=core-avx2.
• 256GB NVMe PCIe M.2 SSD.
• 64-bit Ubuntu 22.04.1 LTS (Jammy Jellyfish, Linux kernel 6.5).
This is clearly not the top-of-the-line hardware setup, but rather a mainstream
computer, not necessarily designed to handle media, developer, or HPC workloads.
However, it is an excellent platform for our case study as it demonstrates various
effects of thread count scaling. Because of the limited resources, applications start to
hit performance roadblocks even with a small number of threads. Keep in mind, that
on better hardware, the scaling results will be different.
My processor has four P-cores and eight E-cores. P-cores are SMT-enabled, which
means the total number of threads on this platform is sixteen. By default, the Linux
scheduler will first try to use idle physical P-cores. The first four threads will utilize
four threads on four idle P-cores. When they are fully utilized, it will start to schedule
threads on E-cores. So, the next eight threads will be scheduled on eight E-cores.
Finally, the remaining four threads will be scheduled on the 4 sibling SMT threads of
P-cores.
I ran the benchmarks while affinitizing threads using the aforementioned scheme,
except Zstd and CPython. Running without affinity does a better job of representing
real-world scenarios, however, thread affinity makes thread count scaling analysis
cleaner. Since performance numbers were very similar, in this case study I present the
results when thread affinity is used.
290
13.2 Performance Scaling in Multithreaded Programs
Figure 13.2: Thread Count Scalability chart for five selected benchmarks.
As you can see, most of them are very far from the linear scaling, which is quite
disappointing. The benchmark with the best scaling in this case study, Blender,
achieves only 6x speedup while using 16x threads. CPython, for example, enjoys no
thread count scaling at all. The performance of Clang and Zstd degrades when the
number of threads goes beyond 10. To understand why that happens, let’s dive into
the details of each benchmark.
Blender
Blender is the only benchmark in our suite that continues to scale up to all 16 threads
in the system. The reason for this is that the workload is highly parallelizable. The
rendering process is divided into small tiles, and each tile can be rendered independently.
However, even with this high level of parallelism, the scaling is only 6.1x speedup
/ 16 threads = 38%. What are the reasons for this suboptimal scaling?
From Section 4.11, we know that Blender’s performance is bounded by floating-point
computations. It has a relatively high percentage of SIMD instructions as well. P-cores
are much better at handling such instructions than E-cores. This is why we see the
slope of the speedup curve decrease after 4 threads as E-cores start getting used.
291
13 Optimizing Multithreaded Applications
Performance scaling continues at the same pace up until 12 threads, where it starts to
degrade again. This is the effect of using SMT sibling threads. Two active sibling SMT
threads compete for the limited number of FP/SIMD execution units. To measure
SMT scaling, we need to divide the performance of two SMT threads (2T1C - two
threads one core) by the performance of a single P-core (1T1C).255 For Blender, SMT
scaling is around 1.3x.
There is another aspect of scaling degradation that also applies to Blender that we
will talk about while discussing Clang’s thread count scaling.
Clang
While Blender uses multithreading to exploit parallelism, concurrency in C++ com-
pilation is usually achieved with multiprocessing. Clang 17 has more than 2,500
translation units, and to compile each of them, a new process is spawned. Similar to
Blender, we classify Clang compilation as massively parallel, yet they scale differently.
We recommend you revisit Section 4.11 for an overview of Clang compiler performance
bottlenecks. In short, it has a large codebase, flat profile, many small functions, and
“branchy” code. Its performance is affected by D-Cache, I-Cache, and TLB misses, and
branch mispredictions. Clang’s thread count scaling is affected by the same scaling
issue as Blender: P-cores are more effective than E-cores, and P-core SMT scaling is
about 1.1x. However, there is more. Notice that scaling stops at around 10 threads,
and starts to degrade. Let’s understand why that happens.
The problem is related to the frequency throttling. When multiple cores are utilized
simultaneously, the processor generates more heat due to the increased workload on
each core. To prevent overheating and maintain stability, CPUs often throttle down
their clock speeds depending on how many cores are in use. Additionally, boosting all
cores to their maximum turbo frequency simultaneously would require significantly
more power, which might exceed the power delivery capabilities of the CPU. My system
doesn’t have an advanced liquid cooling solution and only has a single processor fan.
That’s why it cannot sustain high frequencies when many cores are utilized.
Figure 13.3 shows the performance scaling of the Clang workload overlaid with the
CPU frequency on our platform while running with different thread counts. Notice
that sustained frequency drops when we start using just two P-cores simultaneously.
By the time you start using all 16 threads, the frequency of P-cores is throttled down
to 3.2GHz, while E-cores operate at 2.6GHz. I used Intel VTune’s platform view to
capture CPU frequencies as shown in Section 7.1.
The tipping point of performance scaling for the Clang workload is around 10 threads.
This is the point where the frequency throttling starts to have a significant impact on
performance, and the benefit of adding additional threads is smaller than the penalty
of running at a lower frequency.
Keep in mind that this frequency chart cannot be automatically applied to all other
workloads. Applications that heavily use SIMD instructions typically operate on lower
frequencies, so Blender, for example, may see slightly more frequency throttling than
Clang. However, such a chart can give you a good intuition about the frequency
throttling issues that occur on your platform.
255 Also, you can measure SMT scaling as 4T2C/2T2C, 6T3C/3T3C, and so on.
292
13.2 Performance Scaling in Multithreaded Programs
Figure 13.3: Frequency throttling while running Clang compilation on Intel® Core™
i7-1260P. E-cores become active only after using four threads on P-cores.
To confirm that frequency throttling is one of the main reasons for performance
degradation, I temporarily disabled Turbo Boost on my platform and repeated the
scaling study for Blender and Clang. When Turbo Boost is disabled, all cores operate
on their base frequencies, which are 2.1 Ghz for P-cores and 1.5 Ghz for E-cores.
The results are shown in Figure 13.4. As you can see, thread count scaling almost
doubles when all 16 threads are used and TurboBoost is disabled, for both Blender
(38% → 69%) and Clang (21% → 41%). It gives us an intuition of what the thread count
scaling would look like if frequency throttling had not happened. In fact, frequency
throttling accounts for a large portion of unrealized performance scaling in modern
systems.
Figure 13.4: Thread Count Scalability chart for Blender and Clang with disabled Turbo
Boost. Frequency throttling is a major roadblock to achieving good thread count scaling.
293
13 Optimizing Multithreaded Applications
Zstandard
Next on our list is the Zstandard compression algorithm, or Zstd for short. When
compressing data, Zstd divides the input into blocks, and each block can be compressed
independently. This means that multiple threads can work on compressing different
blocks simultaneously. Although it seems that Zstd should scale well with the number
of threads, it doesn’t. Performance scaling stops at around 5 threads, sooner than in
the previous two benchmarks. As you will see, the dynamic interaction between Zstd
worker threads is quite complicated.
First of all, the performance of Zstd depends on the compression level. The higher
the compression level, the more compact the result. Lower compression levels provide
faster compression, while higher levels yield better compression ratios. In the case
study, I used compression level 3 (which is also the default level) since it provides a
good trade-off between speed and compression ratio.
Here is the high-level algorithm of Zstd compression:
• The input file is divided into blocks, whose size depends on the compression level.
Each job is responsible for compressing a block of data. When Zstd receives
some data to compress, the main thread copies a small chunk into one of its
internal buffers and posts a new compression job, which is picked up by one of
the worker threads. In a similar manner, the main thread fills all input buffers
for all its workers and sends them to work in order. Worker threads share a
common queue from which they concurrently pick up jobs.
• Jobs are always started in order, but they can be finished in any order. Compres-
sion speed can be variable and depends on the data to compress. Some blocks
are easier to compress than others.
• After a worker finishes compressing a block, it signals to the main thread that
the compressed data is ready to be flushed to the output file. The main thread
is responsible for flushing the compressed data to the output file. Note that
flushing must be done in order, which means that the second job is allowed to be
flushed only after the first one is entirely flushed. The main thread can “partially
flush” an ongoing job, i.e., it doesn’t have to wait for a job to be completely
finished to start flushing it.
To visualize the work of the Zstd algorithm on a timeline, I instrumented the Zstd
source code with VTune’s Instrumentation and Tracing Technology (ITT) markers.256
They enable us to visualize the duration of instrumented code regions and events on
the timeline and control the collection of trace data during execution.
The timeline of compressing the Silesia corpus using 8 threads is shown in Figure
13.5. Using 8 worker threads is enough to observe thread interaction in Zstd while
keeping the image less noisy than when all 16 threads are active. The second half of
the timeline was cut to make the image fit on the page.
On the image, we have the main thread at the bottom (TID 913273), and eight worker
threads at the top. The worker threads are created at the beginning of the compression
process and are reused for multiple compressing jobs.
256 VTune ITT instrumentation - https://www.intel.com/content/www/us/en/docs/vtune-
profiler/user-guide/2023-1/instrumenting-your-application.html
294
13.2 Performance Scaling in Multithreaded Programs
Figure 13.5: Timeline view of compressing Silesia corpus with Zstandard using 8 threads.
On the worker thread timeline (top 8 rows) we have the following markers:
• job0–job25 bars indicate the start and end of a job.
• ww (short for “worker wait”, green bars) bars indicate a period when a worker
thread is waiting for a new job.
• Notches below job periods indicate that a thread has just finished compressing
a portion of the input block and is signaling to the main thread that there is
data available to be flushed (partial flushing).
On the main thread timeline (the bottom row, TID 913273) we have the following
markers:
• p0–p25 boxes indicate periods of preparing a new job. It starts when the main
thread starts filling up the input buffer until it is full (but this new job is not
necessarily posted on the worker queue immediately).
• fw (short for “flush wait”, blue bars) markers indicate a period when the main
thread waits for the produced data to start flushing it. This happens when the
main thread has prepared the next job and has nothing else to do. During this
time, the main thread is blocked.
With a quick glance at the image, we can tell that there are many ww periods when
worker threads are waiting. This negatively affects the performance of Zstandard
compression. Let’s progress through the timeline and try to understand what’s going
on.
1. First, when worker threads are created, there is no work to do, so they are
waiting for the main thread to post a new job.
295
13 Optimizing Multithreaded Applications
2. Then the main thread starts to fill up the input buffers for the worker threads.
It has prepared jobs 0 to 7 (see bars p0–p7), which were picked up by worker
threads immediately. Notice, that the main thread also prepared job8 (p8), but
it hasn’t posted it in the worker queue yet. This is because all workers are still
busy.
3. After the main thread has finished p8, it flushed the data already produced
by job0. Notice, that by this time, job0 has already delivered five portions of
compressed data (first five notches below the job0 bar). Now, the main thread
enters its first fw period and starts to wait for more data from job0.
4. At the timestamp 45ms, one more chunk of compressed data is produced by
job0, and the main thread briefly wakes up to flush it, see 1 . After that, it
goes to sleep (fw) again.
5. Job3 is the first to finish, but there is a couple of milliseconds delay before
TID 913309 picks up the new job, see 2 . This happens because job8 was not
posted in the queue by the main thread. At this time, the main thread is waiting
for a new chunk of compressed data from job0. When it arrives a couple of
milliseconds later, the main thread wakes up, flushes the data, and notices that
there is an idle worker thread (TID 913309, the one that has just finished job3).
So, the main thread posts job8 to the worker queue and starts preparing the
next job (p9).
6. The same thing happens with TID 913313 (see 3 ) and TID 913314 (see 4 ).
But this time the delay is bigger. Interestingly, job10 could have been picked
up by either TID 913314 or TID 913312 since they were both idle at the time
job10 was pushed to the job queue.
7. We should have expected that the main thread would start preparing job11
immediately after job10 was posted in the queue as it did before. But it didn’t.
This happens because there are no available input buffers. We will discuss it in
more detail shortly.
8. Only when job0 finishes, the main thread was able to acquire a new input buffer
and start preparing job11 (see 5 ).
As we just said, the reason for the 20–40ms delays between jobs (e.g., between job4
and job11) is the lack of input buffers, which are required to start preparing a new
job. Zstd maintains a single memory pool, which allocates space for both input and
output buffers. This memory pool is prone to fragmentation issues, as it has to provide
contiguous blocks of memory. When a worker finishes a job, the output buffer is
waiting to be flushed, but it still occupies memory. To start working on another job,
it will require another pair of buffers (one input and one output buffer).
Limiting the capacity of the memory pool is a design decision to reduce memory
consumption. In the worst case, there could be many “run-away” buffers, left by
workers that have completed their jobs very fast, and moved on to process the next
job; meanwhile, the flush queue is still blocked by one slow job and the buffers cannot
be released. In such a scenario, the memory consumption would be high, which is
undesirable. However, the downside of the current implementation is increased wait
time between the jobs.
The Zstd compression algorithm is a good example of a complex interaction between
threads. Visualizing worker threads on a timeline is extremely helpful in understanding
how threads communicate and synchronize, and can be useful for identifying bottlenecks.
296
13.2 Performance Scaling in Multithreaded Programs
It is also a good reminder that even if you have a parallelizable workload, the
performance of your application can be limited by the synchronization between
threads and resource availability.
CloverLeaf
CloverLeaf is a hydrodynamics workload. We will not dig deeply into the details of
the underlying algorithm as it is not relevant to this case study. CloverLeaf uses
OpenMP to parallelize the workload. HPC workloads usually scale well, so we expect
CloverLeaf to scale well too. However, on my platform performance stops growing
after using 3 threads. What’s going on?
To determine the root cause of poor scaling, I collected TMA metrics (see Section 6.1)
in four data points: running CloverLeaf with one, two, three, and four threads. Once
we compare the performance characteristics of these profiles, one thing becomes
clear immediately: CloverLeaf’s performance is bound by memory bandwidth. Table
13.1 shows the relevant metrics from these profiles that highlight increasing memory
bandwidth demand when using multiple threads.
1 2 3 4
Metric thread threads threads threads
TMA::Memory Bound (% of pipeline slots) 34.6 53.7 59.0 65.4
TMA::DRAM Memory Bandwidth (% of cycles) 71.7 83.9 87.0 91.3
Memory Bandwidth Utilization (range, GB/s) 20-22 25-28 27-30 27-30
As you can see from those numbers, the pressure on the memory subsystem keeps
increasing as we add more threads. An increase in the TMA::Memory Bound metric
indicates that threads increasingly spend more time waiting for data and do less useful
work. An increase in the DRAM Memory Bandwidth metric further highlights that
performance is hurt due to approaching bandwidth limits. The Memory Bandwidth
Utilization metric indicates the range of total memory bandwidth utilization while
CloverLeaf was running. I captured these numbers by looking at the memory bandwidth
utilization chart in VTune’s platform view as shown in Figure 13.6.
Let’s put those numbers into perspective. The maximum theoretical memory band-
width of my platform is 38.4 GB/s. However, as I measured in Section 4.10, the
297
13 Optimizing Multithreaded Applications
maximum memory bandwidth that can be achieved in practice is 35 GB/s. With just
a single thread, the memory bandwidth utilization reaches 2/3 of the practical limit.
CloverLeaf fully saturates the memory bandwidth with three threads. Even when
all 16 threads are active, Memory Bandwidth Utilization doesn’t go above 30 GB/s,
which is 86% of the practical limit.
To confirm my hypothesis, I swapped two 8 GB DDR4 2400 MT/s memory modules
with two DDR4 modules of the same capacity, but faster speed: 3200 MT/s. This
brings the theoretical memory bandwidth of the system to 51.2 GB/s and the practical
maximum to 45 GB/s. The resulting performance boost grows with an increasing
number of threads used and is in the range of 10% to 33%. When running CloverLeaf
with 16 threads, faster memory modules provide the expected 33% performance as
a ratio of the memory bandwidth increase (3200 / 2400 = 1.33). But even with a
single thread, there is a 10% performance improvement. This means that there were
moments when CloverLeaf fully saturated the memory bandwidth with a single thread
in the original configuration.
Interestingly, for CloverLeaf, TurboBoost doesn’t provide any performance benefit
when all 16 threads are used, i.e., performance is the same regardless of whether you
enable Turbo or let the cores run on their base frequency. How is that possible? The
answer is: that having 16 active threads is enough to saturate two memory controllers
even if CPU cores run at half the frequency. Since most of the time threads are just
waiting for data, when you disable Turbo, they simply start to wait “slower”.
CPython
The final benchmark in the case study is CPython. I wrote a simple multithreaded
Python script that uses binary search to find numbers (needles) in a sorted list
(haystack). Needles are divided equally between worker threads. The script that I
wrote doesn’t scale at all. Can you guess why?
To solve this puzzle, I built CPython 3.12 from sources with debug information and
ran Intel VTune’s Threading Analysis collection while using two threads. Figure 13.7
visualizes a small portion of the timeline of the Python script execution. As you can
see, the CPU time alternates between two threads. They work for 5 ms, then yield to
another thread. In fact, if you were to scroll left or right, you would see that they
never run simultaneously.
Figure 13.7: VTune’s timeline view when running our Python script with two worker threads
(other threads are filtered out).
Let’s try to understand why two worker threads take turns instead of running together.
Once a thread finishes its turn, the Linux kernel scheduler switches to another thread
as highlighted in Figure 13.7. It also gives the reason for a context switch. If we take
298
13.2 Performance Scaling in Multithreaded Programs
a look at the pthread_cond_wait.c source code257 at line 652, we would land on the
function ___pthread_cond_timedwait64, which waits for a condition variable to be
signaled. Many other inactive wait periods wait for the same reason.
On the Bottom-up page (see the left panel of Figure 13.8), VTune reports that the
___pthread_cond_timedwait64 function is responsible for the majority of Inactive
Sync Wait Time. On the right panel, you can see the corresponding call stack. Using
this call stack we can tell what is the most frequently used code path that led to the
___pthread_cond_timedwait64 function and subsequent context switch.
Figure 13.8: VTune’s timeline view when running our Python script with two threads (other
threads are filtered out).
This call stack leads us to the take_gil function, which is responsible for acquiring the
Global Interpreter Lock (GIL). The GIL is preventing our attempts at running worker
threads in parallel by allowing only one thread to run at any given time, effectively
turning our multithreaded program into a single-threaded one. If you take a look at
the implementation of the take_gil function, you will find out that it uses a version
of wait on a conditional variable with a timeout of 5 ms. Once the timeout is reached,
the waiting thread asks the GIL-holding thread to drop it. Once another thread
complies with the request, the waiting thread acquires the GIL and starts running.
They keep switching roles until the very end of the execution.
Nevertheless, there are ways to bypass GIL, for example, by using GIL-immune libraries
such as NumPy, writing performance-critical parts of the code as a C extension module,
or using alternative runtime environments, such as nogil.258 Also, in Python 3.13
there is experimental support for running with the global interpreter lock disabled.259
Summary
In the case study, we have analyzed several throughput-oriented applications with
varying thread count scaling characteristics. Here is a quick summary of our findings:
257 Glibc source code - https://sourceware.org/git/?p=glibc.git;a=tree
258 Nogil - https://github.com/colesbury/nogil
259 Python 3.13 release notes - https://docs.python.org/3/whatsnew/3.13.html
299
13 Optimizing Multithreaded Applications
300
13.3 Task Scheduling
301
13 Optimizing Multithreaded Applications
Figure 13.9: Typical task scheduling pitfalls: core affinity blocks thread migration,
partitioning jobs with large granularity fails to maximize CPU utilization.
302
13.4 Cache Coherence
P-cores became available, they migrated. It solved the problem we had before, but
now E-cores remain idle until the end of the processing.
My second piece of advice is to avoid static partitioning on systems with asymmetric
cores. Equal-sized chunks will likely be processed faster on P-cores than on E-cores
which will introduce dynamic load imbalance.
In the final example, I switch to using dynamic partitioning. With dynamic partitioning,
chunks are distributed to threads dynamically. Each thread processes a chunk of
elements, then requests another chunk, until no chunks remain to be distributed.
Figure 13.9c shows the result of using dynamic partitioning by dividing the array
into 16 chunks. With this scheme, each task becomes more granular, which enables
OpenMP runtime to balance the work even when P-cores run two times faster than
E-cores. However, notice that there is still some idle time on E-cores.
Performance can be slightly improved if we partition the work into 128 chunks instead of
16. But don’t make the jobs too small, otherwise it will result in increased management
overhead. The result summary of my experiments is shown in Table 13.2. Partitioning
the work into 128 chunks turns out to be the sweet spot for our example. Even
though this example is very simple, lessons from it can be applied to production-grade
multithreaded software.
303
13 Optimizing Multithreaded Applications
• Exclusive: a cache line is present only in the current cache and matches its
value in RAM
• Shared: a cache line is present here and in other cache lines and matches its
value in RAM
• Invalid: a cache line is unused (i.e., does not contain any RAM location)
When fetched from memory, each cache line has one of the states encoded into its tag.
Then the cache line state keeps transiting from one state to another.260 In reality, CPU
vendors usually implement slightly improved variants of MESI. For example, Intel
uses MESIF,261 which adds a Forwarding (F) state, while AMD employs MOESI,262
which adds the Owning (O) state. However, these protocols still maintain the essence
of the base MESI protocol.
Lack of cache coherency can cause sequentially inconsistent programs. This problem
can be mitigated by having snoop caches watch all memory transactions and cooperate
with each other to maintain memory consistency. Unfortunately, it comes with a cost
since modification done by one core invalidates the corresponding cache line in another
core’s cache. This causes memory stalls and wastes system bandwidth. In contrast to
serialization and locking issues, which can only put a ceiling on the performance of
the application, coherency issues can cause retrograde effects as attributed by USL in
Section 13.2. Two widely known types of coherency problems are true sharing and
260 There is an animated demonstration of the MESI protocol - https://www.scss.tcd.ie/Jeremy.Jon
es/vivio/caches/MESI.htm.
261 MESIF - https://en.wikipedia.org/wiki/MESIF_protocol
262 MOESI - https://en.wikipedia.org/wiki/MOESI_protocol
304
13.4 Cache Coherence
First of all, we have a bigger problem here besides true sharing. We actually have a data
race, which sometimes can be quite tricky to detect. Notice, that we don’t have a proper
synchronization mechanism in place, which can lead to unpredictable or incorrect
program behavior, because the operations on the shared data might interfere with
one another. Fortunately, there are tools that can help identify such issues. Thread
sanitizer263 from Clang and helgrind264 are among such tools. To prevent the data
race in Listing 13.1, you should declare the sum variable as std::atomic<unsigned
int> sum.
Using C++ atomics can help to solve data races when true sharing happens. However,
it effectively serializes accesses to the atomic variable, which may hurt performance.
A better way of solving our true sharing issue is by using Thread Local Storage
(TLS). TLS is the method by which each thread in a given multithreaded process can
allocate memory to store thread-specific data. By doing so, threads modify their local
copies instead of contending for a globally available memory location. The example in
Listing 13.1 can be fixed by declaring sum with a TLS class specifier: thread_local
unsigned int sum (since C++11). The main thread should then incorporate results
from all the local copies of each worker thread.
305
13 Optimizing Multithreaded Applications
Figure 13.11: False Sharing: two threads access the same cache line.
as M1, M2, and later, the L2 cache operates on 128B cache lines.
306
13.5 Advanced Analysis Tools
False sharing can not only be observed in native languages, like C and C++, but
also in managed ones, like Java and C#. From a general performance perspective,
the most important thing to consider is the cost of the possible state transitions. Of
all cache states, the only ones that do not involve a costly cross-cache subsystem
communication and data transfer during CPU read/write operations are the Modified
(M) and Exclusive (E) states. Thus, the longer the cache line maintains the M or
E states (i.e., the less sharing of data across caches), the lower the coherence cost
incurred by a multithreaded application. An example demonstrating how this property
has been employed can be found in Nitsan Wakart’s blog post “Diving Deeper into
Cache Coherency”.269
13.5.1 Coz
In Section 13.2, we defined the challenge of identifying parts of code that affect the
overall performance of a multithreaded program. Due to various reasons, optimizing
one part of a multithreaded program might not always give visible results. Traditional
sampling-based profilers only show code places where most of the time is spent.
However, it does not necessarily correspond to where programmers should focus their
optimization efforts.
Coz270 is a profiler that addresses this problem. It uses a novel technique called causal
profiling, whereby experiments are conducted during the runtime of an application
by virtually speeding up segments of code to predict the overall effect of certain
optimizations. It accomplishes these “virtual speedups” by inserting pauses that slow
down all other concurrently running code. Also, Coz quantifies the potential impact
of an optimization. [Curtsinger & Berger, 2018]
iving-deeper-into-cache-coherency.html
270 COZ source code - https://github.com/plasma-umass/coz.
271 C-Ray benchmark - https://github.com/jtsiomb/c-ray.
307
13 Optimizing Multithreaded Applications
that line, the impact on the application begins to level off by Coz’s estimation. For
more details on this example, see the article272 on the Easyperf blog.
sampling-profilers.
273 eBPF docs - https://prototype-kernel.readthedocs.io/en/latest/bpf/
274 BCC compiler - https://github.com/iovisor/bcc
275 GAPP - https://github.com/RN-dev-repo/GAPP/
276 Parsec 3.0 Benchmark Suite - https://parsec.cs.princeton.edu/index.htm
308
Questions and Exercises
Chapter Summary
• Applications that do not take advantage of modern multicore CPUs are lagging
behind their competitors. Preparing software to scale well with a growing amount
of CPU cores is very important for the future success of your application.
• When dealing with a single-threaded application, optimizing one portion of the
program usually yields positive results on performance. However, this is not
necessarily the case for multithreaded applications. This effect is widely known
as Amdahl’s law, which constitutes that the speedup of a parallel program is
limited by its serial part.
• Thread communication can yield retrograde speedup as explained by Universal
Scalability Law. This poses additional challenges for tuning multithreaded
programs.
• As we saw in our thread count case study, frequency throttling, memory band-
width saturation, and other issues can lead to poor performance scaling.
• Task scheduling on hybrid processors is challenging. Watch out for suboptimal
job scheduling and do not restrict the OS scheduler when it is not necessary.
• Optimizing the performance of multithreaded applications also involves detecting
and mitigating the effects of cache coherence, such as true sharing and false
sharing.
• During the past years, new tools emerged that cover gaps in analyzing perfor-
mance of multithreaded applications, that traditional profilers cannot cover. We
introduced Coz and GAPP which have a unique set of features.
309
Epilog
Epilog
Thanks for reading through the whole book. I hope you enjoyed it and found it useful.
I would be even happier if the book would help you solve a real-world problem. In
such a case, I would consider it a success and proof that my efforts were not wasted.
Before you continue with your endeavors, let me briefly highlight the essential points
of the book and give you final recommendations:
310
Epilog
more than halfway through. Based on my experience, the fix is often easier
than finding the root cause of the problem. In Part 2 we covered some essential
optimizations for every type of CPU performance bottleneck: how to optimize
memory accesses and computations, how to get rid of branch mispredictions,
how to improve machine code layout, and several others. Use chapters from Part
2 as a reference to see what options are available when your application has one
of these problems.
• Processors from different vendors are not created equal. They differ in terms of
instruction set architecture (ISA) supported and microarchitectural implementa-
tion. Reaching peak performance on a given platform requires utilizing the latest
ISA extensions, avoiding common microarchitecture-specific issues, and tuning
your code according to the strengths of a particular CPU microarchitecture.
• Multithreaded programs add one more dimension of complexity to performance
tuning. They introduce new types of bottlenecks and require additional tools
and methods to analyze and optimize. Examining how an application scales with
the number of threads is an effective way to identify bottlenecks in multithreaded
programs.
I hope you now have a better understanding of low-level performance optimizations.
Of course, this book doesn’t cover every possible scenario you may encounter in your
daily job. My goal was to give you a starting point and to show you potential options
and strategies for dealing with performance analysis and tuning on modern CPUs. I
wish you experience the joy of discovering performance bottlenecks in your application
and the satisfaction of fixing them.
Happy performance tuning!
I will post errata and other information about the book on my blog at the following
URL: https://easyperf.net/blog/2024/11/11/Book-Updates-Errata.
If you haven’t solved the perf-ninja exercises yet, I encourage you to take the time
to do so. They will help you to solidify your knowledge and prepare you for real-world
performance engineering challenges.
P.S. If you enjoyed reading this book, make sure to pass it on to your friends and
colleagues. I would appreciate your help in spreading the word about the book by
endorsing it on social media platforms.
311
Acknowledgments
Acknowledgments
I write this section with profound gratitude to all the people mentioned below. I feel
very humble to deliver to you all the knowledge these people have.
312
Acknowledgments
Lally Singh has authored Section 5.3.2 about Marker APIs. Lally is
currently at Tesla, his prior work includes Datadog’s performance
team, Google’s Search performance team, low-latency trading sys-
tems, and embedded real-time control systems. Lally has a PhD in
CS from Virginia Tech, focusing on scalability in distributed VR.
Also, I would like to thank the following people. Jumana Mundichipparakkal from
Arm, for helping me write about Arm PMU features in Chapter 6. Yann Collet,
the author of Zstandard, for providing me with the information about the internal
workings of Zstd for Section 13.2.1. Ciaran McHale, for finding tons of grammar
mistakes in my initial draft. Nick Black for proofreading and editing the final version
of the book. Peter Veentjer, Amir Aupov, and Charles-Francois Natali for various
edits and suggestions.
I’m also thankful to the whole performance community for countless blog articles and
papers. I was able to learn a lot from reading blogs by Travis Downs, Daniel Lemire,
Andi Kleen, Agner Fog, Bruce Dawson, Brendan Gregg, and many others. I stand on
the shoulders of giants, and the success of this book should not be attributed only to
myself. This book is my way to thank and give back to the whole community.
313
Acknowledgments
A special “thank you” goes to my family, who were patient enough to tolerate me
missing weekend trips and evening walks. Without their support, I wouldn’t have
finished this book.
Images were created with excalidraw.com. Cover design by Darya Antonova. The
fonts used on the cover of this book are Bebas Neue and Raleway, both provided under
the Open Font License. Bebas Neue was designed by Ryoichi Tsunekawa. Raleway
was designed by Matt McInerney, Pablo Impallari, and Rodrigo Fuenzalida.
I should also mention contributors to the first edition of this book. Below I only list
names and section titles. More detailed acknowledgments are available in the first
edition.
• Mark E. Dawson, Jr. wrote sections Section 8.4 “Optimizing For DTLB”, Sec-
tion 11.8 “Optimizing for ITLB”, Section 12.3.2 “Cache Warming”, Section 12.4
“System Tuning”, and a few other.
• Sridhar Lakshmanamurthy, authored a large part of Chapter 3 about CPU
microarchitecture.
• Nadav Rotem helped write Section 9.4 about vectorization.
• Clément Grégoire authored Section 9.4.2.5 about ISPC compiler.
• Reviewers: Dick Sites, Wojciech Muła, Thomas Dullien, Matt Fleming, Daniel
Lemire, Ahmad Yasin, Michele Adduci, Clément Grégoire, Arun S. Kumar, Surya
Narayanan, Alex Blewitt, Nadav Rotem, Alexander Yermolovich, Suchakrapani
Datt Sharma, Renat Idrisov, Sean Heelan, Jumana Mundichipparakkal, Todd
Lipcon, Rajiv Chauhan, Shay Morag, and others.
314
Support This Book
If you enjoyed this book and would like to support it, there are a few options.
315
Glossary
Glossary
DTLB Data Translation Lookaside Buffer PDB files Program-Debug Data Base
files
EBS Event-Based Sampling
PGO Profile-Guided Optimizations
EHP Explicit Huge Pages
PMC Performance Monitoring Counter
FLOPS FLoating-point Operations Per
Second PMI Performance Monitoring Interrupt
316
List of the Major CPU Microarchitectures
In the tables below we present the most recent ISAs and microarchitectures from Intel,
AMD, and ARM-based vendors. Of course, not all the designs are listed here. We
only include those that we reference in the book or if they represent a big transition
in the evolution of the platform.
Supported ISA
Three-letter client/server
Name acronym Year released chips
Nehalem NHM 2008 SSE4.2
Sandy Bridge SNB 2011 AVX
Haswell HSW 2013 AVX2
Skylake SKL 2015 AVX2 / AVX512
Sunny Cove SNC 2019 AVX512
Golden Cove GLC 2021 AVX2 / AVX512
Redwood Cove RWC 2023 AVX2 / AVX512
Lion Cove LNC 2024 AVX2
317
List of the Major CPU Microarchitectures
Table 13.5: List of recent ARM ISAs along with their own and third-party implementations.
318
References
References
[Advanced Micro Devices, 2021] Advanced Micro Devices (2021). Processor Program-
ming Reference (PPR) for AMD family 19h model 01h (55898). B1 Rev 0.50.
https://www.amd.com/content/dam/amd/en/documents/epyc-technical-
docs/programmer-references/55898_B1_pub_0_50.zip
[Advanced Micro Devices, 2022] Advanced Micro Devices (2022). AMD64 technology
platform quality of service extensions. Pub. 56375, rev 1.01. https://www.amd.co
m/content/dam/amd/en/documents/processor-tech-docs/other/56375_1_03_
PUB.pdf
[Advanced Micro Devices, 2023] Advanced Micro Devices (2023). Software Optimiza-
tion Guide for the AMD Zen4 Microarchitecture. Rev 1.00. https://www.amd.co
m/content/dam/amd/en/documents/epyc-technical-docs/software-optimization-
guides/57647.zip
[Allen & Kennedy, 2001] Allen, R. & Kennedy, K. (2001). Optimizing Compilers for
Modern Architectures: A Dependence-based Approach (1 ed.). Morgan-Kaufmann.
[AMD, 2024] AMD (2024). AMD uProf User Guide, Revision 4.2. Advanced Micro
Devices, Inc. https://www.amd.com/content/dam/amd/en/documents/developer
/version-4-2-documents/uprof/uprof-user-guide-v4.2.pdf
[Apple, 2024] Apple (2024). Apple Silicon CPU Optimization Guide: 3.0. Apple®
Inc. https://developer.apple.com/documentation/apple-silicon/cpu-optimization-
guide
[Arm, 2022a] Arm (2022a). Arm Architecture Reference Manual Supplement Armv9.
Arm Limited. https://documentation-service.arm.com/static/632dbdace68c6809a
6b41710?token=
[Arm, 2022b] Arm (2022b). Arm Neoverse™ V1 PMU Guide, Revision: r1p2. Arm
Limited. https://developer.arm.com/documentation/PJDOC-1063724031-605393/2-
0/?lang=en
[Arm, 2023a] Arm (2023a). Arm Neoverse V1 Core: Performance Analysis Methodol-
ogy. Arm Limited. https://armkeil.blob.core.windows.net/developer/Files/pdf/wh
ite-paper/neoverse-v1-core-performance-analysis.pdf
[Arm, 2023b] Arm (2023b). Arm Statistical Profiling Extension: Performance Analysis
Methodology. Arm Limited. https://developer.arm.com/documentation/109429/lat
est/
319
References
[Blem et al., 2013] Blem, E., Menon, J., & Sankaralingam, K. (2013). Power struggles:
Revisiting the risc vs. cisc debate on contemporary arm and x86 architectures. 2013
IEEE 19th International Symposium on High Performance Computer Architecture
(HPCA), 1–12. https://doi.org/10.1109/HPCA.2013.6522302
[Chen et al., 2016] Chen, D., Li, D. X., & Moseley, T. (2016). Autofdo: Automatic
feedback-directed optimization for warehouse-scale applications. CGO 2016 Proceed-
ings of the 2016 International Symposium on Code Generation and Optimization,
12–23. https://ieeexplore.ieee.org/document/7559528
[ChipsAndCheese, 2024] ChipsAndCheese (2024). Why x86 doesn’t need to die. https:
//chipsandcheese.com/2024/03/27/why-x86-doesnt-need-to-die/
[Cooper & Torczon, 2012] Cooper, K. & Torczon, L. (2012). Engineering a Compiler.
Morgan Kaufmann. Morgan Kaufmann. https://books.google.co.in/books?id=CG
TOlAEACAAJ
[Curtsinger & Berger, 2013] Curtsinger, C. & Berger, E. D. (2013). Stabilizer: Statis-
tically sound performance evaluation. Proceedings of the Eighteenth International
Conference on Architectural Support for Programming Languages and Operating
Systems, ASPLOS ’13, 219–228. https://doi.org/10.1145/2451116.2451141
[Curtsinger & Berger, 2018] Curtsinger, C. & Berger, E. D. (2018). Coz: Finding
code that counts with causal profiling. Commun. ACM, 61(6), 91–99. https:
//doi.org/10.1145/3205911
[domo.com, 2017] domo.com (2017). Data Never Sleeps 5.0. Domo, Inc. https:
//www.domo.com/learn/data-never-sleeps-5?aid=ogsm072517_1&sf100871281=1
[Du et al., 2010] Du, J., Sehrawat, N., & Zwaenepoel, W. (2010). Performance profil-
ing in a virtualized environment. Proceedings of the 2nd USENIX Conference on
Hot Topics in Cloud Computing, HotCloud’10, 2. https://www.usenix.org/legacy/
event/hotcloud10/tech/full_papers/Du.pdf
[Fog, 2023a] Fog, A. (2023a). The microarchitecture of intel, amd and via cpus: An
optimization guide for assembly programmers and compiler makers. Copenhagen
University College of Engineering. https://www.agner.org/optimize/microarchitec
ture.pdf
[Geelnard, 2022] Geelnard, M. (2022). Debunking cisc vs risc code density. https:
//www.bitsnbites.eu/cisc-vs-risc-code-density/
320
References
[Herdrich et al., 2016] Herdrich, A. et al. (2016). Cache QoS: From concept to reality
in the Intel® Xeon® processor E5-2600 v3 product family. HPCA, 657–668. https:
//ieeexplore.ieee.org/document/7446102
[Intel, 2023] Intel (2023). Intel® 64 and IA-32 Architectures Optimization Reference
Manual. Intel® Corporation. https://software.intel.com/content/www/us/en
/develop/download/intel-64-and-ia-32-architectures-optimization-reference-
manual.html
[Jin et al., 2012] Jin, G., Song, L., Shi, X., Scherpelz, J., & Lu, S. (2012). Under-
standing and detecting real-world performance bugs. Proceedings of the 33rd ACM
SIGPLAN Conference on Programming Language Design and Implementation, PLDI
’12, 77–88. https://doi.org/10.1145/2254064.2254075
[Kanev et al., 2015] Kanev, S., Darago, J. P., Hazelwood, K., Ranganathan, P., Mose-
ley, T., Wei, G.-Y., & Brooks, D. (2015). Profiling a warehouse-scale computer.
SIGARCH Comput. Archit. News, 43(3S), 158–169. https://doi.org/10.1145/287288
7.2750392
[Khuong & Morin, 2015] Khuong, P.-V. & Morin, P. (2015). Array layouts for
comparison-based searching. https://arxiv.org/ftp/arxiv/papers/1509/1509.0
5053.pdf
[Leiserson et al., 2020] Leiserson, C. E., Thompson, N. C., Emer, J. S., Kuszmaul,
B. C., Lampson, B. W., Sanchez, D., & Schardl, T. B. (2020). There’s plenty of
room at the top: What will drive computer performance after moore’s law? Science,
368(6495). https://doi.org/10.1126/science.aam9744
[Luo et al., 2015] Luo, T., Wang, X., Hu, J., Luo, Y., & Wang, Z. (2015). Improving
tlb performance by increasing hugepage ratio. 2015 15th IEEE/ACM International
Symposium on Cluster, Cloud and Grid Computing, 1139–1142. https://doi.org/10
.1109/CCGrid.2015.36
[Mittal, 2016] Mittal, S. (2016). A survey of techniques for cache locking. ACM
Transactions on Design Automation of Electronic Systems, 21. https://doi.org/10.1
145/2858792
[Muła & Lemire, 2019] Muła, W. & Lemire, D. (2019). Base64 encoding and decoding
at almost the speed of a memory copy. Software: Practice and Experience, 50(2),
89–97. https://doi.org/10.1002/spe.2777
321
References
[Mytkowicz et al., 2009] Mytkowicz, T., Diwan, A., Hauswirth, M., & Sweeney, P. F.
(2009). Producing wrong data without doing anything obviously wrong! SIGPLAN
Not., 44(3), 265–276. https://doi.org/10.1145/1508284.1508275
[Nair & Field, 2020] Nair, R. & Field, T. (2020). Gapp: A fast profiler for de-
tecting serialization bottlenecks in parallel linux applications. Proceedings of
the ACM/SPEC International Conference on Performance Engineering. https:
//doi.org/10.1145/3358960.3379136
[Navarro-Torres et al., 2023] Navarro-Torres, A., Alastruey-Benedé, J., Ibáñez, P., &
Viñals-Yúfera, V. (2023). Balancer: bandwidth allocation and cache partitioning
for multicore processors. The Journal of Supercomputing, 79(9), 10252–10276.
https://doi.org/10.1007/s11227-023-05070-0
[Nowak & Bitzes, 2014] Nowak, A. & Bitzes, G. (2014). The overhead of profiling
using pmu hardware counters. https://zenodo.org/record/10800/files/TheOverhead
OfProfilingUsingPMUhardwareCounters.pdf
[Ottoni & Maher, 2017] Ottoni, G. & Maher, B. (2017). Optimizing function place-
ment for large-scale data-center applications. Proceedings of the 2017 Interna-
tional Symposium on Code Generation and Optimization, CGO ’17, 233–244.
https://ieeexplore.ieee.org/document/7863743
[Ousterhout, 2018] Ousterhout, J. (2018). Always measure one level deeper. Commun.
ACM, 61(7), 74–83. https://doi.org/10.1145/3213770
[Panchenko et al., 2018] Panchenko, M., Auler, R., Nell, B., & Ottoni, G. (2018).
BOLT: A practical binary optimizer for data centers and beyond. CoRR,
abs/1807.06735. http://arxiv.org/abs/1807.06735
[Pharr & Mark, 2012] Pharr, M. & Mark, W. R. (2012). ispc: A spmd compiler for
high-performance cpu programming. 2012 Innovative Parallel Computing (InPar),
1–13. https://doi.org/10.1109/InPar.2012.6339601
[Ren et al., 2010] Ren, G., Tune, E., Moseley, T., Shi, Y., Rus, S., & Hundt, R. (2010).
Google-wide profiling: A continuous profiling infrastructure for data centers. IEEE
Micro, 65–79. http://www.computer.org/portal/web/csdl/doi/10.1109/MM.2010.68
[Sasongko et al., 2023] Sasongko, M. A., Chabbi, M., Kelly, P. H. J., & Unat, D.
(2023). Precise event sampling on amd versus intel: Quantitative and qualitative
comparison. IEEE Transactions on Parallel and Distributed Systems, 34(5), 1594–
1608. https://doi.org/10.1109/TPDS.2023.3257105
322
References
[Seznec & Michaud, 2006] Seznec, A. & Michaud, P. (2006). A case for (partially)
tagged geometric history length branch prediction. J. Instr. Level Parallelism, 8.
https://inria.hal.science/hal-03408381/document
[Sharma, 2016] Sharma, S. D. (2016). Hardware-assisted instruction profiling and
latency detection. The Journal of Engineering, 2016, 367–376(9). https://digital-
library.theiet.org/content/journals/10.1049/joe.2016.0127
[Shen & Lipasti, 2013] Shen, J. & Lipasti, M. (2013). Modern Processor Design:
Fundamentals of Superscalar Processors. Waveland Press, Inc.
[Sites, 2022] Sites, R. (2022). Understanding Software Dynamics. Addison-Wesley
professional computing series. Addison-Wesley. https://books.google.com/books?i
d=TklozgEACAAJ
[statista.com, 2024] statista.com (2024). Volume of data/information created, captured,
copied, and consumed worldwide from 2010 to 2020, with forecasts from 2021 to 2025.
Statista, Inc. https://www.statista.com/statistics/871513/worldwide-data-created/
[Suresh Srinivas, 2019] Suresh Srinivas, e. a. (2019). Runtime performance opti-
mization blueprint: Intel® architecture optimization with large code pages. https:
//www.intel.com/content/www/us/en/develop/articles/runtime-performance-
optimization-blueprint-intel-architecture-optimization-with-large-code.html
[Wang et al., 2017] Wang, X. et al. (2017). SWAP: effective fine-grain management
of shared last-level caches with minimum hardware support. HPCA, 121–132.
https://ieeexplore.ieee.org/document/7920819
[Weaver & McIntosh-Smith, 2023] Weaver, D. & McIntosh-Smith, S. (2023). An
empirical comparison of the risc-v and aarch64 instruction sets. Proceedings of the
SC ’23 Workshops of The International Conference on High Performance Computing,
Network, Storage, and Analysis, SC-W ’23, 1557–1565. https://doi.org/10.1145/36
24062.3624233
[Williams et al., 2009] Williams, S., Waterman, A., & Patterson, D. (2009). Roofline:
an insightful visual performance model for multicore architectures. Commun. ACM,
52(4), 65–76. https://doi.org/10.1145/1498765.1498785
[Yasin, 2014] Yasin, A. (2014). A top-down method for performance analysis and
counters architecture. 35–44. https://doi.org/10.1109/ISPASS.2014.6844459
323
Appendix A
Below are some examples of features that can contribute to increased non-determinism
in performance measurements and a few techniques to reduce noise. I provided an
introduction to the topic in Section 2.1.
This section is mostly specific to the Linux operating system. Readers are encouraged
to search the web for instructions on how to configure other operating systems.
# TurboBoost disabled
$ echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
1
$ perf stat -e task-clock,cycles -- ./a.exe
13055.200832 task-clock (msec) # 0.993 CPUs utilized
29,946,969,255 cycles # 2.294 GHz
13.142983989 seconds time elapsed
The average frequency is higher when Turbo Boost is on (2.7 GHz vs. 2.3 GHz).
DFS can be permanently disabled in BIOS. To programmatically disable the DFS
feature on Linux systems, you need root access. Here is how one can achieve this:
# Intel
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
# AMD
echo 0 > /sys/devices/system/cpu/cpufreq/boost
Simultaneous Multithreading
Many modern CPU cores support simultaneous multithreading (see Section 3.5.2).
SMT can be permanently disabled in BIOS. To programmatically disable SMT on
Linux systems, you need root access. The sibling pairs of CPU threads can be found
in the following files:
/sys/devices/system/cpu/cpuN/topology/thread_siblings_list
278 Dynamic frequency scaling - https://en.wikipedia.org/wiki/Dynamic_frequency_scaling.
324
Appendix A
Here is how you can disable a sibling thread of core 0 on Intel® Core™ i5-8259U,
which has 4 cores and 8 threads:
# all 8 hardware threads enabled:
$ lscpu
...
CPU(s): 8
On-line CPU(s) list: 0-7
...
$ cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
0,4
$ cat /sys/devices/system/cpu/cpu1/topology/thread_siblings_list
1,5
$ cat /sys/devices/system/cpu/cpu2/topology/thread_siblings_list
2,6
$ cat /sys/devices/system/cpu/cpu3/topology/thread_siblings_list
3,7
Scaling Governor
Linux kernel can control CPU frequency for different purposes. One such purpose
is to save power. In this case, the scaling governor may decide to decrease the CPU
frequency. For performance measurements, it is recommended to set the scaling
governor policy to performance to avoid sub-nominal clocking. Here is how we can
set it for all the cores:
echo performance | sudo tee /sys/devices/system/cpu/cpufreq/policy*/scaling_governor
CPU Affinity
Processor affinity279 enables the binding of a process to a certain CPU core(s). In
Linux, you can do this with taskset280 tool.
# no affinity
$ perf stat -e context-switches,cpu-migrations -r 10 -- a.exe
151 context-switches
10 cpu-migrations
325
Appendix A
Notice the number of cpu-migrations gets down to 0, i.e., the process never leaves
core0.
Alternatively, you can use cset281 tool to reserve CPUs for just the program you are
benchmarking. If using Linux perf, leave at least two cores so that perf runs on one
core, and your program runs in another. The command below will move all threads
out of N1 and N2 (-k on means that even kernel threads are moved out):
$ cset shield -c N1,N2 -k on
The command below will run the command after -- in the isolated CPUs:
$ cset shield --exec -- perf stat -r 10 <cmd>
On Windows, a program can be pinned to a specific core using the following command:
$ start /wait /b /affinity 0xC0 myapp.exe
where the /wait option waits for the application to terminate, /b starts the application
without opening a new command window, and /affinity specifies the CPU affinity
mask. In this case, the mask 0xC0 means that the application will run on cores 6 and
7.
On macOS, it is not possible to pin threads to cores since the operating system does
not provide an API for that.
Process Priority
In Linux, you can increase process priority using the nice tool. By increasing priority,
the process gets more CPU time, and the scheduler favors it more in comparison with
processes with normal priority. Niceness ranges from -20 (highest priority value) to
19 (lowest priority value) with the default of 0.
Notice in the previous example, that the execution of the benchmarked process was
interrupted by the OS more than 100 times. If we increase process priority by running
the benchmark with sudo nice -n -<N>:
$ perf stat -r 10 -- sudo nice -n -5 taskset -c 1 a.exe
0 context-switches
0 cpu-migrations
Notice the number of context switches gets to 0, so the process received all the
computation time uninterrupted.
326
Appendix B
Windows
To utilize huge pages on Windows, you need to enable SeLockMemoryPrivilege secu-
rity policy. This can be done programmatically via the Windows API, or alternatively
via the security policy GUI.
1. Hit start → search “secpol.msc”, and launch it.
2. On the left select “Local Policies” → “User Rights Assignment”, then double-click
on “Lock pages in memory”.
Linux
On Linux OS, there are two ways of using huge pages in an application: Explicit and
Transparent Huge Pages.
327
Appendix B
tests/vm/map_hugetlb.c.
283 Mounted hugetlbfs filesystem - https://github.com/torvalds/linux/blob/master/tools/testing/
selftests/vm/hugepage-mmap.c.
284 SHM_HUGETLB example - https://github.com/torvalds/linux/blob/master/tools/testing/self
tests/vm/hugepage-shm.c.
328
Appendix B
Also, you can observe how your application utilizes EHPs and/or THPs by looking at
the smaps file specific to your process:
$ watch -n1 "cat /proc/<PID_OF_PROCESS>/smaps"
329
Appendix C
The Intel Processor Traces (PT) is a CPU feature that records the program execution
by encoding packets in a highly compressed binary format that can be used to
reconstruct execution flow with a timestamp on every instruction. PT has extensive
coverage and relatively small overhead,285 which is usually below 5%. Its main usages
are postmortem analysis and root-causing performance glitches.
Workflow
Similar to sampling techniques, PT does not require any modifications to the source
code. All you need to collect traces is just to run the program under the tool that
supports PT. Once PT is enabled and the benchmark launches, the analysis tool starts
writing PT packets to DRAM.
Similar to LBR (Last Branch Records), Intel PT works by recording branches. At
runtime, whenever a CPU encounters any branch instruction, PT will record the
outcome of this branch. For a simple conditional jump instruction, a CPU will record
whether it was taken (T) or not taken (NT) using just 1 bit. For an indirect call, PT
will record the destination address. Note that unconditional branches are ignored
since we statically know their targets.
An example of encoding for a small instruction sequence is shown in Figure 13.13.
Instructions like PUSH, MOV, ADD, and CMP are ignored because they don’t change the
control flow. However, the JE instruction may jump to .label, so its result needs to
be recorded. Later there is an indirect call for which the destination address is saved.
330
Appendix C
the PUSH instruction is an entry point of the application binary file. Then PUSH, MOV,
ADD, and CMP are reconstructed as-is without looking into encoded traces. Later, the
software decoder encounters a JE instruction, which is a conditional branch and for
which we need to look up the outcome. According to the traces in Figure 13.14, JE
was taken (T), so we skip the next MOV instruction and go to the CALL instruction.
Again, CALL(edx) is an instruction that changes the control flow, so we look up the
destination address in encoded traces, which is 0x407e1d8.
Timing Packets
With Intel PT, not only execution flow can be traced but also timing information. In
addition to saving jump destinations, PT can also emit timing packets. Figure 13.15
provides a visualization of how time packets can be used to restore timestamps for
instructions. As in the previous example, we first see that JNZ was not taken (NT), so
we update it and all the instructions above with timestamp 0ns. Then we see a timing
update of 2ns and JE being taken, so we update it and all the instructions above JE
(and below JNZ) with timestamp 2ns. After that, there is an indirect call (CALL(edx)),
but no timing packet is attached to it, so we do not update timestamps. Then we see
that 100ns elapsed, and JB was not taken, so we update all the instructions above it
with the timestamp of 102ns.
In the example shown in Figure 13.15, instruction data (control flow) is perfectly
accurate, but timing information is less accurate. Obviously, CALL(edx), TEST, and JB
instructions were not happening at the same time, yet we do not have more accurate
timing information for them. Having timestamps enables us to align the time interval
of our program with another event in the system, and it’s easy to compare to wall
clock time. Trace timing in some implementations can further be improved by a
cycle-accurate mode, in which the hardware keeps a record of cycle counts between
normal packets (see more details in [Intel, 2023, Volume 3C, Chapter 36]).
331
Appendix C
332
Appendix C
every step that was made by the program. It is a very strong foundation for further
functional and performance analysis.
Use Cases
1. Analyze performance glitches: because PT captures the entire instruction
stream, it is possible to analyze what was going on during the small-time period
when the application was not responding. More detailed examples can be found
in an article286 on Easyperf blog.
2. Postmortem debugging: PT traces can be replayed by traditional debuggers
like gdb. In addition to that, PT provides call stack information, which is always
valid even if the stack is corrupted.287 PT traces could be collected on a remote
machine once and then analyzed offline. This is especially useful when the issue
is hard to reproduce or access to the system is limited.
3. Introspect execution of the program:
• We can immediately tell if a code path was never executed.
• Thanks to timestamps, it’s possible to calculate how much time was spent
waiting while spinning on a lock attempt, etc.
• Security mitigation by detecting specific instruction patterns.
part3
287 Postmortem debugging with Intel PT - https://easyperf.net/blog/2019/08/30/Intel-PT-part2
288 When you decode traces with perf script -F with +srcline or +srccode to emit source code,
333
Appendix C
Tools
Besides Linux perf, several other tools support Intel PT. First, Intel VTune Profiler
has Anomaly Detection analysis type that uses Intel PT. Another popular tool worth
mentioning is magic-trace289 , which collects and displays high-resolution traces of a
process.
334
Appendix D
Performance-Tools-Linux-Android
335
Appendix D
• ETWAnalyzer:295 reads ETW data and generates aggregate summary JSON files
that can be queried, filtered, and sorted at the command line or exported to a
CSV file.
• PerfView: mainly used to troubleshoot .NET applications. The ETW events
fired for Garbage Collection and JIT compilation are parsed and easily accessible
as reports or CSV data.
Setup
• Download ETWController to record ETW data and screenshots.
• Download the latest Windows 11 Performance Toolkit296 to be able to view
the data with WPA. Make sure that the newer Win 11 WPR.exe comes
first in your path by moving the install folder of the WPT before the
C:\\Windows\\system32 in the System Environment dialog. This is how it
should look:
C> where wpr
C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\WPR.exe
C:\Windows\System32\WPR.exe
Capture traces
• Start ETWController.
• Select the CSwitch profile to track thread wait times along with the other default
recording settings. Ensure the check boxes Record mouse clicks and Take cyclic
screenshots are ticked (see Figure 13.16) so later you will be able to navigate to
the slow spots with the help of the screenshots.
• Download an application from the internet, and unpack it if needed. It doesn’t
matter what program you use, the goal is to see the delay when starting it.
• Start profiling by pressing the Start Recording button.
• Double-click the executable to start it.
• Once a program has started, stop profiling by pressing the Stop Recording button.
Stopping profiling the first time takes a bit longer because Program-Debug Data Base
files (PDBs) are generated for all managed code, which is a one-time operation. After
profiling has reached the Stopped state you can press the Open in WPA button to load
the ETL file into the Windows Performance Analyzer with an ETWController-supplied
profile. The CSwitch profile generates a large amount of data that is stored in a 4
GB ring buffer, which allows you to record 1-2 minutes before the oldest events are
295 ETWAnalyzer - https://github.com/Siemens-Healthineers/ETWAnalyzer
296 Windows SDK Downloads - https://developer.microsoft.com/en-us/windows/downloads/sdk-
archive/
336
Appendix D
overwritten. Sometimes it is a bit of an art to stop profiling at the right time point. If
you have sporadic issues you can keep recording enabled for hours and stop it when
an event like a log entry in a file shows up, which is checked by a polling script.
Windows supports Event Log and Performance Counter triggers that can start a script
when a performance counter reaches a threshold value or a specific event is written to
an event log. If you need more sophisticated stop triggers, you should take a look at
PerfView; this enables you to define a Performance Counter threshold that must be
reached and stay there for N seconds before profiling is stopped. This way, random
spikes will not trigger false positives.
Analysis in WPA
Figure 13.17 shows the recorded ETW data opened in Windows Performance Analyzer
(WPA). The WPA view is divided into three vertically layered parts: CPU Usage
(Sampled), Generic Events, and CPU Usage (Precise). To understand the difference
between them, let’s dive deeper. The upper graph CPU Usage (Sampled) is useful for
identifying where the CPU time is spent. The data is collected by sampling all the
running threads at a regular time interval. This CPU Usage (Sampled) graph is very
similar to the Hotspots view in other profiling tools.
Next comes the Generic Events view, which displays events such as mouse clicks
and captured screenshots. Remember that we enabled interception of those events in
the ETWController window. Because events are placed on the timeline, it is easy to
correlate UI interactions with how the system reacts to them.
The bottom Graph CPU Usage (Precise) uses a different source of data than the
Sampled view. While sampling data only captures running threads, CSwitch collection
takes into account time intervals during which a process was not running. The data
for the CPU Usage (Precise) view comes from the Windows Thread Scheduler. This
337
Appendix D
Figure 13.17: Windows Performance Analyzer: root causing a slow start of an application.
graph traces how long, and on which CPU, a thread was running (CPU Usage), how
long it was blocked in a kernel call (Waits), in which priority, and how long the thread
had been waiting for a CPU to become free (Ready Time), etc. Consequently, the
CPU Usage (Precise) view doesn’t show the top CPU consumers, but this view is very
helpful for understanding how long and why a certain process was blocked.
Now that we have familiarized ourselves with the WPA interface, let’s observe the charts.
First, we can find the MouseButton events 63 and 64 on the timeline. ETWController
saves all the screenshots taken during collection in a newly created folder. The
profiling data itself is saved in the file named SlowProcessStart.etl and there is
a new folder named SlowProcessStart.etl.Screenshots. This folder contains the
screenshots and a Report.html file that you can view in a web browser. Every recorded
keyboard/mouse interaction is saved in a file with the event number in its name, e.g.,
Screenshot_63.jpg. Figure 13.18 (cropped) displays the mouse double-click (events
63 and 64). The mouse pointer position is marked as a green square, except if a click
event did occur, then it is red. This makes it easy to spot when and where a mouse
click was performed.
The double click marks the beginning of a 1.2-second delay when our application was
waiting for something. At timestamp 35.1, explorer.exe is active as it attempts to
338
Appendix D
launch the new application. But then it wasn’t doing much work and the application
didn’t start. Instead, MsMpEng.exe takes over the execution up until the time 35.7.
So far, it looks like an antivirus scan before the downloaded executable is allowed
to start. But we are not 100% sure that MsMpEng.exe is blocking the start of a new
application.
Since we are dealing with delays, we are interested in wait times. These are available
on the CPU Usage (Precise) panel with Waits selected in the dropdown menu. There
we find the list of processes that our explorer.exe was waiting for, visualized as a
bar chart that aligns with the timeline on the upper panel. It’s not hard to spot the
long bar corresponding to Antivirus - Windows Defender, which accounts for a waiting
time of 1.068s. So, we can conclude that the delay in starting our application is caused
by Defender scanning activity. If you drill into the call stack (not shown), you’ll
see that the CreateProcess system call is delayed in the kernel by WDFilter.sys,
the Windows Defender Filter Driver. It blocks the process from starting until the
potentially malicious file contents are scanned. Antivirus software can intercept
everything, resulting in unpredictable performance issues that are difficult to diagnose
without a comprehensive kernel view, such as with ETW. Mystery solved? Well, not
just yet.
Knowing that Defender was the issue is just the first step. If you look at the top panel
again, you’ll see that the delay is not entirely caused by busy antivirus scanning. The
MsMpEng.exe process was active from the time 35.1 until 35.7, but the application
didn’t start immediately after that. There is an additional delay of 0.5 sec from time
35.7until 36.2, during which the CPU was mostly idle, not doing anything. To find the
root cause of this, you would need to follow the thread wakeup history across processes,
which we will not present here. In the end, you would find a blocking web service call
to MpClient.dll!MpClient::CMpSpyNetContext::UpdateSpynetMetrics which did
wait for some Microsoft Defender web service to respond. If you enable TCP/IP
or socket ETW traces you can also find out with which remote endpoint Microsoft
Defender was communicating. So, the second part of the delay is caused by the
MsMpEng.exe process waiting for the network, which also blocked our application from
running.
This case study shows only one example of what type of issues you can effectively
analyze with WPA, but there are many others. The WPA interface is very rich and
highly customizable. It supports custom profiles to configure the graphs and tables for
visualizing event data in the way you like best. Originally, WPA was developed for
339
Appendix D
device driver developers and there are built-in profiles that do not focus on application
development. ETWController brings its own profile (Overview.wpaprofile) that you
can set as the default profile under Profiles → Save Startup Profile to always use the
performance overview profile.
340