21.optimization II
21.optimization II
Programming
21. Performance Optimization II
Code Optimization
Federico Busato
2023-11-14
Table of Context
1 I/O Operations
printf
Memory Mapped I/O
Speed Up Raw Data Loading
2 Memory Optimizations
Heap Memory
Stack Memory
Cache Utilization
Data Alignment
Memory Prefetch
1/84
Table of Context
3 Arithmetic
Data Types
Operations
Conversion
Floating-Point
Compiler Intrinsic Functions
Value in a Range
Lookup Table
2/84
Table of Context
4 Control Flow
Loop Hoisting
Loop Unrolling
Branch Hints - [[likely]] / [[unlikely]]
Compiler Hints - [[assume]]
Recursion
3/84
Table of Context
5 Functions
Function Call Cost
Argument Passing
Function Optimizations
Function Inlining
Pointers Aliasing
6 Object-Oriented Programming
Object RAII Optimizations
5/84
I/O Streams
6/84
I/O Streams - Example
#include <iostream>
int main() {
std::ifstream fin;
// --------------------------------------------------------
std::ios_base::sync_with_stdio(false); // sync disable
fin.tie(nullptr); // flush disable
// buffer increase
const int BUFFER_SIZE = 1024 * 1024; // 1 MB
char buffer[BUFFER_SIZE];
fin.rdbuf()->pubsetbuf(buffer, BUFFER_SIZE);
// --------------------------------------------------------
fin.open(filename); // Note: open() after optimizations
// IO operations
fin.close();
7/84
}
printf
8/84
www.ciselant.de/projects/gcc_printf/gcc_printf.html
Memory Mapped I/O
Benefits:
• Orders of magnitude faster than system calls
• Input can be “cached” in RAM memory (page/file cache)
• A file requires disk access only when a new page boundary is crossed
• Memory-mapping may bypass the page/swap file completely
• Load and store raw data (no parsing/conversion)
9/84
Memory Mapped I/O - Example 1/2
#if !defined(__linux__)
#error It works only on linux
#endif
#include <fcntl.h> //::open
#include <sys/mman.h> //::mmap
#include <sys/stat.h> //::open
#include <sys/types.h> //::open
#include <unistd.h> //::lseek
// usage: ./exec <file> <byte_size> <mode>
int main(int argc, char* argv[]) {
size_t file_size = std::stoll(argv[2]);
auto is_read = std::string(argv[3]) == "READ";
int fd = is_read ? ::open(argv[1], O_RDONLY) :
::open(argv[1], O_RDWR | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
if (fd == -1)
ERROR("::open") // try to get the last byte
if (::lseek(fd, static_cast<off_t>(file_size - 1), SEEK_SET) == -1)
ERROR("::lseek")
if (!is_read && ::write(fd, "", 1) != 1) // try to write
10/84
ERROR("::write")
Memory Mapped I/O Example 2/2
auto mm_mode = (is_read) ? PROT_READ : PROT_WRITE;
if (mmap_ptr == MAP_FAILED)
ERROR("::mmap");
// Advise sequential access
if (::madvise(mmap_ptr, file_size, MADV_SEQUENTIAL) == -1)
ERROR("::madvise");
// MemoryMapped Operations
// read from/write to "mmap_ptr" as a normal array: mmap_ptr[i]
• Many small heap allocations are more expensive than one large memory allocation
The default page size on Linux is 4 KB. For smaller/multiple sizes, C++ uses a
sub-allocator
• Allocations within the page size is faster than larger allocations (sub-allocator)
16/84
Stack Memory
• Stack memory is faster than heap memory. The stack memory provides high
locality, it is small (cache fit), and its size is known at compile-time
• static stack allocations produce better code. It avoids filling the stack each
time the function is reached
• constexpr arrays with dynamic indexing produces very inefficient code with
GCC. Use static constexpr instead
void f(int x) {
// bad performance with GCC
// constexpr int array[] = {1,2,3,4,5,6,7,8,9};
static constexpr int array[] = {1,2,3,4,5,6,7,8,9};
return array[x];
} 17/84
Cache Utilization
18/84
Spatial Locality Example 1/2
A, B, C matrices of size N × N
Benchmark:
20/84
Temporal-Locality Example
21/84
lemire.me/blog/2019/04/27
Data Alignment
Data alignment allows avoiding unnecessary memory accesses, and it is also essential
to exploit hardware vector instructions (SIMD) like SSE, AVX, etc.
22/84
Internal Structure Alignment
Considering the previous example for the structure A2 , random loads from an array of
structures A2 leads to one or two cache line operations depending on the alignment at
a specific index, e.g.
index 0 → one cache line load
index 1 → two cache line loads
25/84
External Structure Alignment in C++ 2/2
26/84
Memory Prefetch
27/84
Multi-Threading and Caches
The CPU/threads affinity controls how a process is mapped and executed over
multiple cores (including sockets). It affects the process performance due to
core-to-core communication and cache line invalidation overhead
Maximizing threads “clustering” on a single core can potentially lead to higher cache
hits rate and faster communication. On the other hand, if the threads work
independently/almost independently, namely they show high locality on their working
set, mapping them to different cores can improve the performance
• Addition, subtraction, and bitwise operations are computed by the ALU and they
have very similar throughput
29/84
uops.info: Latency, Throughput, and Port Usage Information
Data Types
• 32-bit integral vs. floating-point: in general, integral types are faster, but it
depends on the processor characteristics
• Small integral types are slower than 32-bit integer, but they require less
memory → cache/memory efficiency
30/84
Operations 1/2
* the compiler automatically applies such optimization whenever possible. This is not ensured for
object types 31/84
Operations 2/2
• Keep near constant values/variables → the compiler can merge their values
• Some unsigned operations are faster than signed operations (deal with negative
number), e.g. x / 2
33/84
Power-of-Two Multiplication/Division/Modulo
From To Cost
4-16 clock-cycles
Signed → Floating-point is faster than
Integer Floating-point
Unsigned → Floating-point (except AVX512
instruction set is enabled)
35/84
Optimizing software in C++, Agner Fog
Floating-Point Division
not optimized:
// "value" is floating-point (dynamic)
for (int i = 0; i < N; i++)
A[i] = B[i] / value;
optimized:
div = 1.0 / value; // div is floating-point
for (int i = 0; i < N; i++)
A[i] = B[i] * div;
Compiler behavior:
• GCC 9 and ICC 19 produce a single instruction for std::fma and for a * b + c with
-O3 -march=native
• Clang 9 and MSVC 19.* produce a single instruction for std::fma but not for
a * b + c
Compiler intrinsics are highly optimized functions directly provided by the compiler
instead of external libraries
Advantages:
• Directly mapped to hardware functionalities if available
• Inline expansion
• Do not inhibit high-level optimizations and they are portable contrary to asm code
Drawbacks:
• Portability is limited to a specific compiler
• Some intrinsics do not work on all platforms
• The same instricics can be mapped to a non-optimal instruction sequence
depending on the compiler
38/84
Compiler Intrinsic Functions 2/5
builtin clz(x) (count leading zeros) counts the number of zero bits following the
most significant one bit
builtin ctz(x) (count trailing zeros) counts the number of zero bits preceding
the least significant one bit
builtin ffs(x) (find first set) index of the least significant one bit
39/84
gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
Compiler Intrinsic Functions 3/5
41/84
software.intel.com/sites/landingpage/IntrinsicsGuide/
Compiler Intrinsic Functions 5/5
Using intrinsic instructions is extremely dangerous if the target processor does not
natively support such instructions
Example:
“If you run code that uses the intrinsic on hardware that doesn’t support the lzcnt
instruction, the results are unpredictable” - MSVC
on the contrary, GNU and clang builtin * instructions are always well-defined.
The instruction is translated to a non-optimal operation sequence in the worst case
The instruction set support should be checked at run-time (e.g. with cpuid
function on MSVC), or, when available, by using compiler-time macro (e.g. AVX )
42/84
Automatic Compiler Function Transformation
43/84
Why is the standard "abs" function faster than mine?
Value in a Range
// STEP 1: subtract A
if (x - A >= A - A && x - A <= B - A)
// -->
if (x - A >= 0 && x - A <= B - A) // B - A is precomputed
// STEP 2
// - convert "x - A >= 0" --> (unsigned) (x - A)
// - "B - A" is always positive
if ((unsigned) (x - A) <= (unsigned) (B - A))
44/84
Value in a Range Examples
46/84
Make your lookup table do more
Low-Level Optimizations
47/84
Low-Level Information
48/84
Control Flow
Branches are expensive 1/2
49/84
Branches are expensive 2/2
The downside to long pipelines includes the danger of pipeline stalls that waste CPU
time, and the time it takes to reload the pipeline on conditional branch operations
( if , while , for )
50/84
Control Flow 1/2
• Prefer square brackets syntax [] over pointer arithmetic operations for array
access to facilitate compiler loop optimizations (polyhedral loop transformations)
• Prefer signed integer for loop indexing. The compiler optimizes more aggressively
such loops since integer overflow is not defined
51/84
The Little Things: Everyday efficiencies
Control Flow 2/2
• Some compilers (e.g. clang) use assertion for optimization purposes: most likely
code path, not possible values, etc. 2
• Not all control flow instructions (or branches) are translated into jump
instructions. If the code in the branch is small, the compiler could optimize it in a
conditional instruction, e.g. ccmovl
Small code section can be optimized in different ways 3 (see next slides)
Loop Hoisting, also called loop-invariant code motion, consists of moving statements
or expressions outside the body of a loop without affecting the semantics of the
program
Example:
for (int i = 0; i < N; i++)
sum += A[i];
can be rewritten as:
for (int i = 0; i < N; i += 8) {
sum += A[i];
sum += A[i + 1];
sum += A[i + 2];
sum += A[i + 3];
...
} // we suppose N is a multiple of 8 55/84
Loop Unrolling 2/2
Unroll directive The Intel, IBM, and clang compilers (but not GCC) provide the
preprocessing directive #pragma unroll (to insert above the loop) to force loop unrolling.
The compiler already applies the optimization in most cases
switch (value) {
[[likely]] case 'A': return 2;
[[unlikely]] case 'B': return 4;
}
57/84
Compiler Hints - [[assume]]
Avoid run-time recursion (very expensive). Prefer iterative algorithms instead (see
next slides)
Recursion cost: The program must store all variables (snapshot) at each recursion
iteration on the stack, and remove them when the control return to the caller instance
The tail recursion optimization avoids maintaining caller stack and pass the control to
the next iteration. The optimization is possible only if all computation can be executed
before the recursive call
59/84
Recursion 2/2
60/84
Via Twitter - Jan Wildeboer
Functions
Function Call Cost
pass by-reference May not introduce one level of indirection if related in the same
translation unit/LTO
pass-by-reference is more efficient than pass-by-pointer as
it facilitates variable elimination by the compiler, and the function
code does not require checking for NULL pointer
62/84
Three reasons to pass std::string view by value
Argument Passing - Active Objects 2/3
63/84
Argument Passing - Passive Objects 3/3
• Keep small the number of function parameters. The parameters can be passed by
using the registers instead filling and emptying the stack
• Consider combining several function parameters in a structure
• const modifier applied to pointers and references does not produce better code
in most cases, but it is useful for ensuring read-only accesses
inline
inline specifier for optimization purposes is just a hint for the compiler that
increases the heuristic threshold for inlining, namely copying the function body
where it is called
66/84
inline Function Declaration 2/2
The compiler can inline a function only if it is independent from external references
• A function with internal linkage is not visible outside the current translation unit,
so it can be aggressively inlined
• On the other hand, external linkage doesn’t prevent function inlining if the
function body is visibility in a translation unit. In this situation, the compiler can
duplicate the function code if it determines that there are no external references
68/84
Symbol Visibility
All compilers, except MSVC, export all function symbols → slow, the symbols can be
used in other translation units
Alternatives:
• Use static functions
69/84
gcc.gnu.org/wiki/Visibility
Pointers Aliasing 1/4
• The compiler cannot unroll the loop (sequential execution, no ILP) because
output and input pointers can be aliased, e.g. output = input + 1
• The aliasing problem is even worse for more complex code and inhibits all kinds of
optimization including code re-ordering, vectorization, common sub-expression
elimination, etc.
70/84
Pointers Aliasing 2/4
Potential benefits:
• Instruction-level parallelism
• Less instructions executed
• Merge common sub-expressions
71/84
Pointers Aliasing 3/4
73/84
Argument Passing, Core Guidelines and Aliasing
Object-Oriented
Programming
Variable/Object Scope
• the compiler can more likely fit them into registers instead of stack
• it improves readability
Wrong: Correct:
int i, x; for (int i = 0; i < N; i++) {
for (i = 0; i < N; i++) { int x = value * 5;
x = value * 5; sum += x;
sum += x; }
}
Exception! Built-in type variables and passive structures should be placed in the
innermost loop, while objects with constructors should be placed outside loops
75/84
Object RAII Optimizations
76/84
Object Dynamic Behavior Optimizations
• Use static for all members that do not use instance member (avoid passing
this pointer)
• Prefer ++obj / --obj (return &obj ), instead of obj++ , obj-- (return old
obj )
78/84
Object Implicit Conversion
B() = default;
A a;
B b;
f(b); // no cost
f(a); // very costly!! implicit conversion 79/84
Std Library and
Other Language
Aspects
From C to C++
• Avoid old C library routines such as qsort , bsearch , etc. Prefer instead
std::sort , std::binary search
- std::sort is based on a hybrid sorting algorithm. Quick-sort / head-sort
(introsort), merge-sort / insertion, etc. depending on the std implementation
80/84
Function Optimizations
• Set std::vector size during the object construction (or use the reserve()
method) if the number of elements to insert is known in advance → every implicit
resize is equivalent to a copy of all vector elements
• Consider unordered containers instead of the standard one, e.g. unorder map
vs. map
• Platform/Compiler-dependent implementation
• Debugging is hard
• Binary bloat
• Compile time (see C++ Compile Health Watchdog, and STL Explorer)
• Most data structures are implemented over the heap memory. Consider
re-implement them by using the stack memory if the number of elements to insert
is small (e.g. queue)