0% found this document useful (0 votes)

44 views92 pages

21.optimization II

Uploaded by

Johan Dahlin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views92 pages

21.optimization II

Uploaded by

Johan Dahlin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 92

Modern C++

Programming
21. Performance Optimization II
Code Optimization

Federico Busato
2023-11-14
Table of Context

1 I/O Operations
printf
Memory Mapped I/O
Speed Up Raw Data Loading

2 Memory Optimizations
Heap Memory
Stack Memory
Cache Utilization
Data Alignment
Memory Prefetch

1/84
Table of Context

3 Arithmetic
Data Types
Operations
Conversion
Floating-Point
Compiler Intrinsic Functions
Value in a Range
Lookup Table

2/84
Table of Context

4 Control Flow
Loop Hoisting
Loop Unrolling
Branch Hints - [[likely]] / [[unlikely]]
Compiler Hints - [[assume]]
Recursion

3/84
Table of Context

5 Functions
Function Call Cost
Argument Passing
Function Optimizations
Function Inlining
Pointers Aliasing

6 Object-Oriented Programming
Object RAII Optimizations

7 Std Library and Other Language Aspects

4/84
I/O Operations
I/O Operations

I/O Operations are orders of magnitude slower than

memory accesses

5/84
I/O Streams

In general, input/output operations are one of the most expensive

• Use endl for ostream only when it is strictly necessary (prefer \n )

• Disable synchronization with printf/scanf :

std::ios base::sync with stdio(false)

• Disable IO flushing when mixing istream/ostream calls:

<istream obj>.tie(nullptr);

• Increase IO buffer size:

file.rdbuf()->pubsetbuf(buffer var, buffer size);

6/84
I/O Streams - Example

#include <iostream>

int main() {
std::ifstream fin;
// --------------------------------------------------------
std::ios_base::sync_with_stdio(false); // sync disable
fin.tie(nullptr); // flush disable
// buffer increase
const int BUFFER_SIZE = 1024 * 1024; // 1 MB
char buffer[BUFFER_SIZE];
fin.rdbuf()->pubsetbuf(buffer, BUFFER_SIZE);
// --------------------------------------------------------
fin.open(filename); // Note: open() after optimizations

// IO operations
fin.close();
7/84
}
printf

• printf is faster than ostream (see speed test link)

• A printf call with a simple format string ending with \n is converted to a

puts() call
printf("Hello World\n");
printf("%s\n", string);

• No optimization if the string is not ending with \n or one or more % are

detected in the format string

8/84
www.ciselant.de/projects/gcc_printf/gcc_printf.html
Memory Mapped I/O

A memory-mapped file is a segment of virtual memory that has been assigned a

direct byte-for-byte correlation with some portion of a file

Benefits:
• Orders of magnitude faster than system calls
• Input can be “cached” in RAM memory (page/file cache)
• A file requires disk access only when a new page boundary is crossed
• Memory-mapping may bypass the page/swap file completely
• Load and store raw data (no parsing/conversion)

9/84
Memory Mapped I/O - Example 1/2
#if !defined(__linux__)
#error It works only on linux
#endif
#include <fcntl.h> //::open
#include <sys/mman.h> //::mmap
#include <sys/stat.h> //::open
#include <sys/types.h> //::open
#include <unistd.h> //::lseek
// usage: ./exec <file> <byte_size> <mode>
int main(int argc, char* argv[]) {
size_t file_size = std::stoll(argv[2]);
auto is_read = std::string(argv[3]) == "READ";
int fd = is_read ? ::open(argv[1], O_RDONLY) :
::open(argv[1], O_RDWR | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
if (fd == -1)
ERROR("::open") // try to get the last byte
if (::lseek(fd, static_cast<off_t>(file_size - 1), SEEK_SET) == -1)
ERROR("::lseek")
if (!is_read && ::write(fd, "", 1) != 1) // try to write
10/84
ERROR("::write")
Memory Mapped I/O Example 2/2
auto mm_mode = (is_read) ? PROT_READ : PROT_WRITE;

// Open Memory Mapped file

auto mmap_ptr = static_cast<char*>(
::mmap(nullptr, file_size, mm_mode, MAP_SHARED, fd, 0) );

if (mmap_ptr == MAP_FAILED)
ERROR("::mmap");
// Advise sequential access
if (::madvise(mmap_ptr, file_size, MADV_SEQUENTIAL) == -1)
ERROR("::madvise");

// MemoryMapped Operations
// read from/write to "mmap_ptr" as a normal array: mmap_ptr[i]

// Close Memory Mapped file

if (::munmap(mmap_ptr, file_size) == -1)
ERROR("::munmap");
if (::close(fd) == -1)
11/84
ERROR("::close");
Low-Level Parsing 1/2

Consider using optimized (low-level) numeric conversion routines:

template<int N, unsigned MUL, int INDEX = 0>
struct fastStringToIntStr;

inline unsigned fastStringToUnsigned(const char* str, int length) {

switch(length) {
case 10: return fastStringToIntStr<10, 1000000000>::aux(str);
case 9: return fastStringToIntStr< 9, 100000000>::aux(str);
case 8: return fastStringToIntStr< 8, 10000000>::aux(str);
case 7: return fastStringToIntStr< 7, 1000000>::aux(str);
case 6: return fastStringToIntStr< 6, 100000>::aux(str);
case 5: return fastStringToIntStr< 5, 10000>::aux(str);
case 4: return fastStringToIntStr< 4, 1000>::aux(str);
case 3: return fastStringToIntStr< 3, 100>::aux(str);
case 2: return fastStringToIntStr< 2, 10>::aux(str);
case 1: return fastStringToIntStr< 1, 1>::aux(str);
default: return 0;
}
} 12/84
Low-Level Parsing 2/2

template<int N, unsigned MUL, int INDEX>

struct fastStringToIntStr {
static inline unsigned aux(const char* str) {
return static_cast<unsigned>(str[INDEX] - '0') * MUL +
fastStringToIntStr<N - 1, MUL / 10, INDEX + 1>::aux(str);
}
};

template<unsigned MUL, int INDEX>

struct fastStringToIntStr<1, MUL, INDEX> {
static inline unsigned aux(const char* str) {
return static_cast<unsigned>(str[INDEX] - '0');
}
};

Faster parsing: lemire.me/blog/tag/simd-swar-parsing 13/84

Speed Up Raw Data Loading 1/2

• Hard disk is orders of magnitude slower than RAM

• Parsing is faster than data reading

• Parsing can be avoided by using binary storage and mmap

• Decreasing the number of hard disk accesses improves the performance →

compression
LZ4 is lossless compression algorithm providing extremely fast decompression up to
35% of memcpy and good compression ratio
github.com/lz4/lz4

Another alternative is Facebook zstd

github.com/facebook/zstd 14/84
Speed Up Raw Data Loading 2/2

Performance comparison of different methods for a file of 4.8 GB of integer values

Load Method Exec. Time Speedup

ifstream 102 667 ms 1.0x

memory mapped + parsing (first run) 30 235 ms 3.4x
memory mapped + parsing (second run) 22 509 ms 4.5x
memory mapped + lz4 (first run) 3 914 ms 26.2x
memory mapped + lz4 (second run) 1 261 ms 81.4x

NOTE: the size of the Lz4 compressed file is 1,8 GB

15/84
Memory
Optimizations
Heap Memory

• Dynamic heap allocation is expensive: implementation dependent and interact

with the operating system

• Many small heap allocations are more expensive than one large memory allocation
The default page size on Linux is 4 KB. For smaller/multiple sizes, C++ uses a
sub-allocator

• Allocations within the page size is faster than larger allocations (sub-allocator)

16/84
Stack Memory

• Stack memory is faster than heap memory. The stack memory provides high
locality, it is small (cache fit), and its size is known at compile-time

• static stack allocations produce better code. It avoids filling the stack each
time the function is reached

• constexpr arrays with dynamic indexing produces very inefficient code with
GCC. Use static constexpr instead
void f(int x) {
// bad performance with GCC
// constexpr int array[] = {1,2,3,4,5,6,7,8,9};
static constexpr int array[] = {1,2,3,4,5,6,7,8,9};
return array[x];
} 17/84
Cache Utilization

Maximize cache utilization:

• Maximize spatial and temporal locality (see next examples)

• Prefer small data types

• Prefer std::vector<bool> over array of bool

• Prefer std::bitset<N> over std::vector<bool> if the data size is known in

advance or bounded

18/84
Spatial Locality Example 1/2

A, B, C matrices of size N × N

for (int i = 0; i < N; i++) {

for (int j = 0; j < N; j++) {
int sum = 0;
for (int k = 0; k < N; k++)
C = A * B sum += A[i][k] * B[k][j]; // row × column
C[i][j] = sum;
}
}

for (int i = 0; i < N; i++) {

for (int j = 0; j < N; j++) {
int sum = 0;
for (int k = 0; k < N; k++)
C = A * BT sum += A[i][k] * B[j][k]; // row × row
C[i][j] = sum;
}
19/84
}
Spatial Locality Example 2/2

Benchmark:

N 64 128 256 512 1024

A * B < 1 ms 5 ms 29 ms 141 ms 1,030 ms

A * BT < 1 ms 2 ms 6 ms 48 ms 385 ms
Speedup / 2.5x 4.8x 2.9x 2.7x

20/84
Temporal-Locality Example

Speeding up a random-access function

for (int i = 0; i < N; i++) // V1 for (int K = 0; K < N; K += CACHE) { // V2

out_array[i] = in_array[hash(i)]; for (int i = 0; i < N; i++) {
auto x = hash(i);
if (x >= K && x < K + CACHE)
out_array[i] = in_array[x];
}
}

V1 : 436 ms, V2 : 336 ms → 1.3x speedup (temporal locality improvement)

.. but it needs a careful evaluation of CACHE and it can even decrease the performance for
other sizes
pre-sorted hash(i) : 135 ms → 3.2x speedup (spatial locality improvement)

21/84
lemire.me/blog/2019/04/27
Data Alignment

Data alignment allows avoiding unnecessary memory accesses, and it is also essential
to exploit hardware vector instructions (SIMD) like SSE, AVX, etc.

• Internal alignment: reducing memory footprint, optimizing memory bandwidth,

and minimizing cache-line misses
• External alignment: minimizing cache-line misses

22/84
Internal Structure Alignment

struct A1 { struct A2 { // internal alignment

char x1; // offset 0 char x1; // offset 0
double y1; // offset 8!! (not 1) char x2; // offset 1
char x2; // offset 16 char x3; // offset 2
double y2; // offset 24 char x4; // offset 3
char x3; // offset 32 char x5; // offset 4
double y3; // offset 40 double y1; // offset 8
char x4; // offset 48 double y2; // offset 16
double y4; // offset 56 double y3; // offset 24
char x5; // offset 64 (65 bytes) double y4; // offset 32 (40 bytes)
} }

Considering an array of structures (AoS), there are two problems:

• We are wasting 40% of memory in the first case ( A1 )
• In common x64 processors the cache line is 64 bytes. For the first structure A1 ,
every access involves two cache line operations (2x slower)
23/84
see also #pragma pack(1)
External Structure Alignment and Padding

Considering the previous example for the structure A2 , random loads from an array of
structures A2 leads to one or two cache line operations depending on the alignment at
a specific index, e.g.
index 0 → one cache line load
index 1 → two cache line loads

It is possible to fix the structure alignment in two ways:

• The memory padding refers to introduce extra bytes at the end of the data
structure to enforce the memory alignment
e.g. add a char array of size 24 to the structure A2

• Align keyword or attribute allows specifying the alignment requirement of a

type or an object (next slide)
24/84
External Structure Alignment in C++ 1/2

C++ allows specifying the alignment requirement in different ways:

• C++11 alignas(N) only for variable / struct declaration

• C++17 aligned new (e.g. new int[2, N] )

• Compiler Intrinsic only for variables / struct declaration

• GCC/Clang: attribute ((aligned(N)))
• MSVC: declspec(align(N))

• Compiler Intrinsic for dynamic pointer

• GCC/Clang: builtin assume aligned(x)
• Intel: assume aligned(x)

25/84
External Structure Alignment in C++ 2/2

struct alignas(16) A1 { // C++11

int x, y;
};

struct attribute((aligned(16))) A2 { // compiler-specific attribute

int x, y;
};

auto ptr1 = new int[100, 16]; // 16B alignment, C++17

auto ptr2 = new int[100]; // 4B alignment guarantee
auto ptr3 = __builtin_assume_aligned(ptr2, 16); // compiler-specific attribute
auto ptr4 = new A1[10]; // no aligment guarantee

26/84
Memory Prefetch

builtin prefetch is used to minimize cache-miss latency by moving data into a

cache before it is accessed. It can be used not only for improving spatial locality, but
also temporal locality

for (int i = 0; i < size; i++) {

auto data = array[i];
__builtin_prefetch(array + i + 1, 0, 1); // 2nd argument, '0' means read-only
// 3th argument, '1' means
// temporal locality=1, default=3
// do some computation on 'data', e.g. CRC
}

27/84
Multi-Threading and Caches

The CPU/threads affinity controls how a process is mapped and executed over
multiple cores (including sockets). It affects the process performance due to
core-to-core communication and cache line invalidation overhead

Maximizing threads “clustering” on a single core can potentially lead to higher cache
hits rate and faster communication. On the other hand, if the threads work
independently/almost independently, namely they show high locality on their working
set, mapping them to different cores can improve the performance

C++11 threads, affinity and hyper-threading 28/84

Arithmetic
Hardware Notes

• Instruction throughput greatly depends on processor model and characteristics

• Modern processors provide separated units for floating-point computation (FPU)

• Addition, subtraction, and bitwise operations are computed by the ALU and they
have very similar throughput

• In modern processors, multiplication and addition are computed by the same

hardware component for decreasing circuit area → multiplication and addition can
be fused in a single operation fma (floating-point) and mad (integer)

29/84
uops.info: Latency, Throughput, and Port Usage Information
Data Types

• 32-bit integral vs. floating-point: in general, integral types are faster, but it
depends on the processor characteristics

• 32-bit types are faster than 64-bit types

• 64-bit integral types are slightly slower than 32-bit integral types. Modern processors
widely support native 64-bit instructions for most operations, otherwise they require
multiple operations
• Single precision floating-points are up to three times faster than double precision
floating-points

• Small integral types are slower than 32-bit integer, but they require less
memory → cache/memory efficiency
30/84
Operations 1/2

• In modern architectures, arithmetic increment/decrement ++ / -- has the same

performance of add / sub

• Prefer prefix operator ( ++var ) instead of the postfix operator ( var++ ) *

• Use the arithmetic compound operators ( a += b ) instead of operators

combined with assignment ( a = a + b ) *

* the compiler automatically applies such optimization whenever possible. This is not ensured for
object types 31/84
Operations 2/2

• Keep near constant values/variables → the compiler can merge their values

• Some unsigned operations are faster than signed operations (deal with negative
number), e.g. x / 2

• Prefer logic operations || to bitwise operations | to take advantage of

short-circuiting

Is if(A | B) always faster than if(A || B)? 32/84

Integer Multiplication

Integer multiplication requires double the number of bits of the operands

// 32-bit platforms or knowledge that x, y are less than 232

int f1(int x, int y) {

return x * y; // efficient but can overflow
}

int64_t f2(int64_t x, int64_t y) {

return x * y; // always correct but slow
}

int64_t f3(int x, int y) {

return x * static_cast<int64_t>(y); // correct and efficient!!
}

33/84
Power-of-Two Multiplication/Division/Modulo

• Prefer shift for power-of-two multiplications ( a ≪ b ) and divisions

( a ≫ b ) only for run-time values *

• Prefer bitwise and ( a % b → a & (b - 1) ) for power-of-two modulo

operations only for run-time values *

• Constant multiplication and division can be heavily optimized by the compiler,

even for non-trivial values

* the compiler automatically applies such optimizations if b is known at compile-time. Bitwise

operations make the code harder to read
34/84
Ideal divisors: when a division compiles down to just a multiplication
Conversion

From To Cost

Signed Unsigned no cost, bit representation is the same

Unsigned Larger Unsigned no cost, register extended

Signed Larger Signed 1 clock-cycle, register + sign extended

4-16 clock-cycles
Signed → Floating-point is faster than
Integer Floating-point
Unsigned → Floating-point (except AVX512
instruction set is enabled)

Floating-point Integer fast if SSE2, slow otherwise (50-100 clock-cycles)

35/84
Optimizing software in C++, Agner Fog
Floating-Point Division

Multiplication is much faster than division*

not optimized:
// "value" is floating-point (dynamic)
for (int i = 0; i < N; i++)
A[i] = B[i] / value;

optimized:
div = 1.0 / value; // div is floating-point
for (int i = 0; i < N; i++)
A[i] = B[i] * div;

* Multiplying by the inverse is not the same as the division

36/84
see lemire.me/blog/2019/03/12
Floating-Point FMA

Modern processors allow performing a * b + c in a single operation, called fused

multiply-add ( std::fma in C++11). This implies better performance and accuracy
CPU processors perform computations with a larger register size than the original data
type (e.g. 48-bit for 32-bit floating-point) for performing this operation

Compiler behavior:
• GCC 9 and ICC 19 produce a single instruction for std::fma and for a * b + c with
-O3 -march=native
• Clang 9 and MSVC 19.* produce a single instruction for std::fma but not for
a * b + c

FMA: solve quadratic equation

37/84
FMA: extended precision addition and multiplication by constant
Compiler Intrinsic Functions 1/5

Compiler intrinsics are highly optimized functions directly provided by the compiler
instead of external libraries
Advantages:
• Directly mapped to hardware functionalities if available
• Inline expansion
• Do not inhibit high-level optimizations and they are portable contrary to asm code
Drawbacks:
• Portability is limited to a specific compiler
• Some intrinsics do not work on all platforms
• The same instricics can be mapped to a non-optimal instruction sequence
depending on the compiler
38/84
Compiler Intrinsic Functions 2/5

Most compilers provide intrinsics bit-manipulation functions for SSE4.2 or ABM

(Advanced Bit Manipulation) instruction sets for Intel and AMD processors
GCC examples:

builtin popcount(x) count the number of one bits

builtin clz(x) (count leading zeros) counts the number of zero bits following the
most significant one bit

builtin ctz(x) (count trailing zeros) counts the number of zero bits preceding
the least significant one bit

builtin ffs(x) (find first set) index of the least significant one bit

39/84
gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
Compiler Intrinsic Functions 3/5

• Compute integer log2

inline unsigned log2(unsigned x) {
return 31 - __builtin_clz(x);
}

• Check if a number is a power of 2

inline bool is_power2(unsigned x) {
return __builtin_popcount(x) == 1;
}

• Bit search and clear

inline int bit_search_clear(unsigned x) {
int pos = __builtin_ffs(x); // range [0, 31]
x &= ∼(1u << pos);
return pos;
} 40/84
Compiler Intrinsic Functions 4/5

Example of intrinsic portability issue:

builtin popcount() GCC produces popcountdi2 instruction while Intel
Compiler (ICC) produces 13 instructions
mm popcnt u32 GCC and ICC produce popcnt instruction, but it is available only
for processor with support for SSE4.2 instruction set

More advanced usage

• Compute CRC: mm crc32 u32
• AES cryptography: mm256 aesenclast epi128
• Hash function: mm sha256msg1 epu32

41/84
software.intel.com/sites/landingpage/IntrinsicsGuide/
Compiler Intrinsic Functions 5/5

Using intrinsic instructions is extremely dangerous if the target processor does not
natively support such instructions

Example:
“If you run code that uses the intrinsic on hardware that doesn’t support the lzcnt
instruction, the results are unpredictable” - MSVC

on the contrary, GNU and clang builtin * instructions are always well-defined.
The instruction is translated to a non-optimal operation sequence in the worst case
The instruction set support should be checked at run-time (e.g. with cpuid
function on MSVC), or, when available, by using compiler-time macro (e.g. AVX )

42/84
Automatic Compiler Function Transformation

std::abs can be recognized by the compiler and transformed to a hardware

instruction
In a similar way, C++20 provides a portable and efficient way to express bit operations
<bit>
rotate left : std::rotl
rotate right : std::rotr
count leading zero : std::countl zero
count leading one : std::countl one
count trailing zero : std::countr zero
count trailing one : std::countr one
population count : std::popcount

43/84
Why is the standard "abs" function faster than mine?
Value in a Range

Checking if a non-negative value x is within a range [A, B] can be optimized if

B > A (useful when the condition is repeated multiple times)

if (x >= A && x <= B)

// STEP 1: subtract A
if (x - A >= A - A && x - A <= B - A)
// -->
if (x - A >= 0 && x - A <= B - A) // B - A is precomputed

// STEP 2
// - convert "x - A >= 0" --> (unsigned) (x - A)
// - "B - A" is always positive
if ((unsigned) (x - A) <= (unsigned) (B - A))
44/84
Value in a Range Examples

Check if a value is an uppercase letter:

uint8_t x = ... uint8_t x = ...
if (x >= 'A' && x <= 'Z') → if (x - 'A' <= 'Z')
... ...

A more general case:

int x = ... int x = ...
if (x >= -10 && x <= 30) → if ((unsigned) (x + 10) <= 40)
... ...

The compiler applies this optimization only in some cases

45/84
(tested with GCC/Clang 9 -O3)
Lookup Table

Lookup table (LUT) is a memoization technique which allows replacing runtime

computation with precomputed values
Example: a function that computes the logarithm base 10 of a number in the range [1-100]
template<int SIZE, typename Lambda>
constexpr std::array<float, SIZE> build(Lambda lambda) {
std::array<float, SIZE> array{};
for (int i = 0; i < SIZE; i++)
array[i] = lambda(i);
return array;
}
float log10(int value) {
constexpr auto lamba = [](int i) { return std::log10f((float) i); };
static constexpr auto table = build<100>(lambda);
return table[value];
}

46/84
Make your lookup table do more
Low-Level Optimizations

Collection of low-level implementations/optimization of common operations:

• Bit Twiddling Hacks

graphics.stanford.edu/∼seander/bithacks.html

• The Aggregate Magic Algorithms

aggregate.org/MAGIC

• Hackers Delight Book

www.hackersdelight.org

47/84
Low-Level Information

The same instruction/operation may take different clock-cycles on different

architectures/CPU type

• Agner Fog - Instruction tables (latencies, throughputs)

www.agner.org/optimize/instruction tables.pdf

• Latency, Throughput, and Port Usage Information

uops.info/table.html

48/84
Control Flow
Branches are expensive 1/2

Computation is faster than decision

49/84
Branches are expensive 2/2

Pipelines are an essential element in modern processors. Some processors have up to

20 pipeline stages (14/16 typically)

The downside to long pipelines includes the danger of pipeline stalls that waste CPU
time, and the time it takes to reload the pipeline on conditional branch operations
( if , while , for )

50/84
Control Flow 1/2

• Prefer switch statements instead of multiple if

- If the compiler does not use a jump-table, the cases are evaluated in order of
appearance → the most frequent cases should be placed before
- Some compilers (e.g. clang) are able to translate a sequence of if into a switch

• Prefer square brackets syntax [] over pointer arithmetic operations for array
access to facilitate compiler loop optimizations (polyhedral loop transformations)

• Prefer signed integer for loop indexing. The compiler optimizes more aggressively
such loops since integer overflow is not defined

• Prefer range-based loop for iterating over a container 1

51/84
The Little Things: Everyday efficiencies
Control Flow 2/2

• In general, if statements affect performance when the branch is taken

• Some compilers (e.g. clang) use assertion for optimization purposes: most likely
code path, not possible values, etc. 2

• Not all control flow instructions (or branches) are translated into jump
instructions. If the code in the branch is small, the compiler could optimize it in a
conditional instruction, e.g. ccmovl
Small code section can be optimized in different ways 3 (see next slides)

1 Branch predictor: How many ‘if’s are too many?

2 Andrei Alexandrescu
52/84
3 Is this a branch?
Minimize Branch Overhead

• Branch prediction: technique to guess which way a branch takes. It requires

hardware support and it is generically based on dynamic history of code executing

• Branch predication: a conditional branch is substituted by a sequence of

instructions from both paths of the branch. Only the instructions associated to a
predicate (boolean value), that represents the direction of the branch, are actually
executed
int x = (condition) ? A[i] : B[i];
P = (condition) // P: predicate
@P x = A[i];
@!P x = B[i];

• Speculative execution: execute both sides of the conditional branch to better

utilize the computer resources and commit the results associated to the branch
taken 53/84
Loop Hoisting

Loop Hoisting, also called loop-invariant code motion, consists of moving statements
or expressions outside the body of a loop without affecting the semantics of the
program

Base case: Better:

v = x + y;
for (int i = 0; i < 100; i++) for (int i = 0; i < 100; i++)
a[i] = x + y; a[i] = v;

Loop hoisting is also important in the evaluation of loop conditions

Base case: Better:
// "x" never changes int limit = f(x);
for (int i = 0; i < f(x); i++) for (int i = 0; i < limit; i++)
a[i] = y; a[i] = y;
In the worst case, f(x) is evaluated at every iteration (especially when it belongs to
another translation unit) 54/84
Loop Unrolling 1/2

Loop unrolling (or unwinding) is a loop transformation technique which optimizes

the code by removing (or reducing) loop iterations
The optimization produces better code at the expense of binary size

Example:
for (int i = 0; i < N; i++)
sum += A[i];
can be rewritten as:
for (int i = 0; i < N; i += 8) {
sum += A[i];
sum += A[i + 1];
sum += A[i + 2];
sum += A[i + 3];
...
} // we suppose N is a multiple of 8 55/84
Loop Unrolling 2/2

Loop unrolling can make your code better/faster:

+ Improve instruction-level parallelism (ILP)
+ Allow vector (SIMD) instructions
+ Reduce control instructions and branches
Loop unrolling can make your code worse/slower:
- Increase compile-time/binary size
- Require more instruction decoding
- Use more memory and instruction cache

Unroll directive The Intel, IBM, and clang compilers (but not GCC) provide the
preprocessing directive #pragma unroll (to insert above the loop) to force loop unrolling.
The compiler already applies the optimization in most cases

Why are unrolled loops faster? 56/84

Branch Hints - [[likely]] / [[unlikely]]

C++20 [[likely]] and [[unlikely]] provide a hint to the compiler to optimize

a conditional statement, such as while , for , if

for (i = 0; i < 300; i++) {

[[unlikely]] if (rand() < 10)
return false;
}

switch (value) {
[[likely]] case 'A': return 2;
[[unlikely]] case 'B': return 4;
}

57/84
Compiler Hints - [[assume]]

C++23 allows defining an assumption in the code that is always true

int x = ...;
[[assume(x > 0)]]; // the compiler assume that 'x' is positive

int y = x / 2; // the operation is translated in a single shift as for

// the unsigned case

Compilers provide non-portable instructions for previous C++ standards:

builtin assume() (clang), builtin unreachable() (gcc), assume()
(msvc, icc)
C++23 also provides std::unreachable() ( <utility> ) for marking unreachable
code
58/84
Recursion 1/2

Avoid run-time recursion (very expensive). Prefer iterative algorithms instead (see
next slides)

Recursion cost: The program must store all variables (snapshot) at each recursion
iteration on the stack, and remove them when the control return to the caller instance

The tail recursion optimization avoids maintaining caller stack and pass the control to
the next iteration. The optimization is possible only if all computation can be executed
before the recursive call

59/84
Recursion 2/2

60/84
Via Twitter - Jan Wildeboer
Functions
Function Call Cost

Function call methods:

Direct Function address is known at compile-time
Indirect Function address is known only at run-time
Inline The function code is fused in the caller code

Function call cost:

• The caller pushes the arguments on the stack in reverse order
• Jump to function address
• The caller clears (pop) the stack
• The function pushes the return value on the stack
• Jump to the caller address
61/84
The True Cost of Calls
Argument Passing 1/3

pass by-value Small data types (≤ 8/16 bytes)

The data are copied into registers, instead of stack
It avoids aliasing performance issues

pass by-pointer Introduces one level of indirection

They should be used only for raw pointers (potentially NULL)

pass by-reference May not introduce one level of indirection if related in the same
translation unit/LTO
pass-by-reference is more efficient than pass-by-pointer as
it facilitates variable elimination by the compiler, and the function
code does not require checking for NULL pointer

62/84
Three reasons to pass std::string view by value
Argument Passing - Active Objects 2/3

For active objects with non-trivial copy constructor or destructor:

by-value Could be very expensive, and hard to optimize

by-pointer/reference Prefer pass-by- const -pointer/reference
const function member overloading can also be cheaper

63/84
Argument Passing - Passive Objects 3/3

For passive objects with trivial copy constructor and destructor:

by-value/by-reference Most compilers optimize pass by-value with pass by-reference
and the opposite case for passive data structures if related to
the same translation unit/LTO
by-const-value Always produce the optimal code if applied in the same
translation unit/LTO. It is converted to pass-by-const ref if
needed
In general, it should be avoided for as it does not change the
function signature
by-value Doesn’t always produce the optimal code for large data
structures
by-reference Could introduce a level of indirection 64/84
Function Optimizations

• Keep small the number of function parameters. The parameters can be passed by
using the registers instead filling and emptying the stack
• Consider combining several function parameters in a structure
• const modifier applied to pointers and references does not produce better code
in most cases, but it is useful for ensuring read-only accesses

Some compilers provide additional attributes to optimize function calls

• attribute (pure) attribute (Clang, GCC) specifies that a function has no
side effects on its parameters or program state (external global references)
• attribute (const) attribute (Clang, GCC) specifies that a function doesn’t
depend (read) on external global references
65/84
GoTW#81: Constant Optimization?
inline Function Declaration 1/2

inline
inline specifier for optimization purposes is just a hint for the compiler that
increases the heuristic threshold for inlining, namely copying the function body
where it is called

inline void f() { ... }

• the compiler can ignore the hint

• inlined functions increase the binary size because they are expanded in-place for
every function call

66/84
inline Function Declaration 2/2

Compilers have different heuristics for function inlining

• Number of lines (even comments: How new-lines affect the Linux kernel
performance)
• Number of assembly instructions
• Inlining depth (recursive)

GCC/Clang extensions allow to force inline/non-inline functions:

attribute ((always_inline)) void f() { ... }
attribute ((noinline)) void f() { ... }

• An Inline Function is As Fast As a Macro

67/84
• Inlining Decisions in Visual Studio
Inlining and Linkage

The compiler can inline a function only if it is independent from external references

• A function with internal linkage is not visible outside the current translation unit,
so it can be aggressively inlined

• On the other hand, external linkage doesn’t prevent function inlining if the
function body is visibility in a translation unit. In this situation, the compiler can
duplicate the function code if it determines that there are no external references

68/84
Symbol Visibility

All compilers, except MSVC, export all function symbols → slow, the symbols can be
used in other translation units

Alternatives:
• Use static functions

• Use anonymous namespace (functions and classes)

• Use GNU extension (also clang) attribute ((visibility("hidden")))

69/84
gcc.gnu.org/wiki/Visibility
Pointers Aliasing 1/4

Consider the following example:

// suppose f() is not inline
void f(int* input, int size, int* output) {
for (int i = 0; i < size; i++)
output[i] = input[i];
}

• The compiler cannot unroll the loop (sequential execution, no ILP) because
output and input pointers can be aliased, e.g. output = input + 1

• The aliasing problem is even worse for more complex code and inhibits all kinds of
optimization including code re-ordering, vectorization, common sub-expression
elimination, etc.
70/84
Pointers Aliasing 2/4

Most compilers (included GCC/Clang/MSVC) provide restricted pointers

( restrict ) so that the programmer asserts that the pointers are not aliased
void f(int* __restrict input,
int size,
int* __restrict output) {
for (int i = 0; i < size; i++)
output[i] = input[i];
}

Potential benefits:
• Instruction-level parallelism
• Less instructions executed
• Merge common sub-expressions
71/84
Pointers Aliasing 3/4

Benchmarking matrix multiplication

void matrix_mul_v1(const int* A,
const int* B,
int N,
int* C) {

void matrix_mul_v2(const int* __restrict A,

const int* __restrict B,
int N,
int* __restrict C) {

Optimization -O1 -O2 -O3

v1 1,030 ms 777 ms 777 ms

v2 513 ms 510 ms 761 ms
Speedup 2.0x 1.5x 1.02x
72/84
Pointers Aliasing 4/4

void foo(std::vector<double>& v, const double& coeff) {

for (auto& item : v) item *= std::sinh(coeff);
}
vs.
void foo(std::vector<double>& v, double coeff) {
for (auto& item : v) item *= std::sinh(coeff);
}

73/84
Argument Passing, Core Guidelines and Aliasing
Object-Oriented
Programming
Variable/Object Scope

Declare local variable in the innermost scope

• the compiler can more likely fit them into registers instead of stack
• it improves readability
Wrong: Correct:
int i, x; for (int i = 0; i < N; i++) {
for (i = 0; i < N; i++) { int x = value * 5;
x = value * 5; sum += x;
sum += x; }
}

• C++17 allows local variable initialization in if and while statements, while

C++20 introduces them for in range-based loops
74/84
Variable/Object Scope

Exception! Built-in type variables and passive structures should be placed in the
innermost loop, while objects with constructors should be placed outside loops

for (int i = 0; i < N; i++) { std::string str("prefix_");

std::string str("prefix_"); for (int i = 0; i < N; i++) {
std::cout << str + value[i]; std::cout << str + value[i];
} // str call CTOR/DTOR N times }

75/84
Object RAII Optimizations

• Prefer direct initialization and full object constructor instead of two-step

initialization (also for variables)

• Prefer move semantic instead of copy constructor. Mark copy constructor as

=delete (sometimes it is hard to see, e.g. implicit)

• Ensure defaulted default and copy constructors = default to enable

vectorization

76/84
Object Dynamic Behavior Optimizations

• Virtual calls are slower than standard functions

- Virtual calls prevent any kind of optimizations as function lookup is at
runtime (loop transformation, vectorization, etc.)
- Virtual call overhead is up to 20%-50% for function that can be inlined

• Mark final all virtual functions that are not overridden

• Avoid dynamic operations dynamic cast

- The Hidden Performance Price of Virtual Functions

- Investigating the Performance Overhead of C++ Exceptions
77/84
Object Operation Optimizations

• Use static for all members that do not use instance member (avoid passing
this pointer)

• Avoid multiple + operations between objects to avoid temporary storage

• Prefer ++obj / --obj (return &obj ), instead of obj++ , obj-- (return old
obj )

• Prefer x += obj , instead of x = x + obj → avoid the object copy

78/84
Object Implicit Conversion

struct A { // big object

int array[10000];
};
struct B {
int array[10000];

B() = default;

B(const A& a) { // user-defined constructor

std::copy(a.array, a.array + 10000, array);
}
};
//----------------------------------------------------------------------
void f(const B& b) {}

A a;
B b;
f(b); // no cost
f(a); // very costly!! implicit conversion 79/84
Std Library and
Other Language
Aspects
From C to C++

• Avoid old C library routines such as qsort , bsearch , etc. Prefer instead
std::sort , std::binary search
- std::sort is based on a hybrid sorting algorithm. Quick-sort / head-sort
(introsort), merge-sort / insertion, etc. depending on the std implementation

- Prefer std::find() for small array, std::lower bound ,

std::upper bound , std::binary search for large sorted array

80/84
Function Optimizations

• std::fill applies memset and std::copy applies memcpy if the

input/output are continuous in memory

• Use the same type for initialization in functions like std::accumulate() ,

std::fill
auto array = new int[size];
...
auto sum = std::accumulate(array, array + size, 0u);
// 0u != 0 → conversion at each step

std::fill(array, array + size, 0u);

// it is not translated into memset

The Hunt for the Fastest Zero 81/84

Containers

• Use std container member functions (e.g. obj.find() ) instead of external

ones (e.g. std::find() ). Example: std::set O(log(n)) vs. O(n)

• Be aware of container properties, e.g. vector.push vector(v) , instead of

vector.insert(vector.begin(), value) → entire copy of all vector elements

• Set std::vector size during the object construction (or use the reserve()
method) if the number of elements to insert is known in advance → every implicit
resize is equivalent to a copy of all vector elements

• Consider unordered containers instead of the standard one, e.g. unorder map
vs. map

• Prefer std::array instead of dynamic heap allocation 82/84

Critics to Standard Template Library (STL)

• Platform/Compiler-dependent implementation

• Execution order and results across platforms

• Debugging is hard

• Complex interaction with custom memory allocators

• Error handling based on exceptions is non-transparent

• Binary bloat

• Compile time (see C++ Compile Health Watchdog, and STL Explorer)

STL isn’t for anyone

83/84
Other Language Aspects

• Most data structures are implemented over the heap memory. Consider
re-implement them by using the stack memory if the number of elements to insert
is small (e.g. queue)

• Prefer lambda expression (or function object ) instead of std::function

or function pointers

• Avoid dynamic operations: exceptions (and use noexcept ), smart pointer

(e.g. std::unique ptr )

• Use noexcept decorator → program is aborted if an error occurred instead of

raising an exception. see
Bitcoin: 9% less memory: make SaltedOutpointHasher noexcept
84/84

Getting Started With NODE - RED
No ratings yet
Getting Started With NODE - RED
20 pages
Deep Sea Electronics PLC: 5210 Autostart Module Operating Manual
No ratings yet
Deep Sea Electronics PLC: 5210 Autostart Module Operating Manual
55 pages
Fundamentals of C++: Yingcai Xiao 09/03/08
No ratings yet
Fundamentals of C++: Yingcai Xiao 09/03/08
20 pages
Migdalskiy Sergiy Physics Optimization Strategies
No ratings yet
Migdalskiy Sergiy Physics Optimization Strategies
104 pages
Data-Oriented Design and C++ - Mike Acton - CppCon 2014
No ratings yet
Data-Oriented Design and C++ - Mike Acton - CppCon 2014
201 pages
Memory Thinking For C & C++ Linux Diagnostics
100% (1)
Memory Thinking For C & C++ Linux Diagnostics
258 pages
Less Slow C++ - Hacker News
No ratings yet
Less Slow C++ - Hacker News
3 pages
COSS - Lecture - 6 - With Annotation
No ratings yet
COSS - Lecture - 6 - With Annotation
37 pages
GDC2003 Memory Optimization 18mar03
No ratings yet
GDC2003 Memory Optimization 18mar03
60 pages
Vostokov Dmitry Memory Thinking For C and C++ Windows Diagnostics
No ratings yet
Vostokov Dmitry Memory Thinking For C and C++ Windows Diagnostics
251 pages
FUNDAMENTAL OF Programming
No ratings yet
FUNDAMENTAL OF Programming
9 pages
Meesho Screening Test Study Material
No ratings yet
Meesho Screening Test Study Material
7 pages
2 Cache Complexity
No ratings yet
2 Cache Complexity
100 pages
Memory Mapping
No ratings yet
Memory Mapping
5 pages
SUBMISSION - Task 1 RAUNAK, 2310994547
No ratings yet
SUBMISSION - Task 1 RAUNAK, 2310994547
12 pages
CS2209 - Oops Lab Manual
100% (1)
CS2209 - Oops Lab Manual
62 pages
2 Types Address Memalloc Web
No ratings yet
2 Types Address Memalloc Web
21 pages
OODP Lab Assessment
No ratings yet
OODP Lab Assessment
8 pages
Intro To C++ Object Model - Richard Powell - CppCon 2015
No ratings yet
Intro To C++ Object Model - Richard Powell - CppCon 2015
115 pages
Nilesh Pabuwal
No ratings yet
Nilesh Pabuwal
80 pages
Modern C++ - What You Need To Know PDF
No ratings yet
Modern C++ - What You Need To Know PDF
55 pages
Course Contents
No ratings yet
Course Contents
19 pages
Subtask 1 Raunak, 2310994547
No ratings yet
Subtask 1 Raunak, 2310994547
13 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
21 pages
Lec01 1 Introduction
No ratings yet
Lec01 1 Introduction
36 pages
DSA Lab 01 - Arrays in C++
No ratings yet
DSA Lab 01 - Arrays in C++
7 pages
4 Memory Models
No ratings yet
4 Memory Models
19 pages
L15 Cache Introduction
No ratings yet
L15 Cache Introduction
35 pages
Lectures 1-10 (7 Files Merged)
No ratings yet
Lectures 1-10 (7 Files Merged)
386 pages
Viewing The World Through Array-Shaped Glasses - Łukasz Mendakiewicz - CppCon 2014
No ratings yet
Viewing The World Through Array-Shaped Glasses - Łukasz Mendakiewicz - CppCon 2014
131 pages
10 Cache Memories
No ratings yet
10 Cache Memories
49 pages
1 Data - Structures - Course - 20242
No ratings yet
1 Data - Structures - Course - 20242
183 pages
Memory 2
No ratings yet
Memory 2
31 pages
L15 Cache Introduction
No ratings yet
L15 Cache Introduction
35 pages
01 01 Intro
No ratings yet
01 01 Intro
30 pages
Synopsys Part2
No ratings yet
Synopsys Part2
3 pages
Introduction C - Programming-Lab Manual
No ratings yet
Introduction C - Programming-Lab Manual
12 pages
More Elaborations With Cache & Virtual Memory: CMPE 421 Parallel Computer Architecture
No ratings yet
More Elaborations With Cache & Virtual Memory: CMPE 421 Parallel Computer Architecture
31 pages
The Implementation of Value Types - Lawrence Crowl - CppCon 2014
No ratings yet
The Implementation of Value Types - Lawrence Crowl - CppCon 2014
71 pages
Introduction To C++: An Object - Oriented Language
100% (1)
Introduction To C++: An Object - Oriented Language
38 pages
CPP v1.2
No ratings yet
CPP v1.2
491 pages
Sample Question Bank
No ratings yet
Sample Question Bank
2 pages
Cs Cheat
No ratings yet
Cs Cheat
2 pages
Mastering CPP Pointers
No ratings yet
Mastering CPP Pointers
342 pages
Assignment 7 COP 3014
No ratings yet
Assignment 7 COP 3014
8 pages
CPP v1.2 Modern CPP OOP Slides Margit Antal 2023
No ratings yet
CPP v1.2 Modern CPP OOP Slides Margit Antal 2023
493 pages
01 Intro CPP
No ratings yet
01 Intro CPP
159 pages
08 Caches
No ratings yet
08 Caches
78 pages
C++Reference Card: Looping Pointers
No ratings yet
C++Reference Card: Looping Pointers
2 pages
01 Intro-Cpp
No ratings yet
01 Intro-Cpp
159 pages
C++ Interfaces
No ratings yet
C++ Interfaces
1 page
High Performance Managed Languages: Martin Thompson - @mjpt777
No ratings yet
High Performance Managed Languages: Martin Thompson - @mjpt777
107 pages
Beginner Level: Introduction To C++
No ratings yet
Beginner Level: Introduction To C++
6 pages
Advanced Memory Management in Modern CPP
No ratings yet
Advanced Memory Management in Modern CPP
119 pages
Week 11
No ratings yet
Week 11
45 pages
C++ in Huge AAA Games - Nicolas Fleury - CppCon 2014
No ratings yet
C++ in Huge AAA Games - Nicolas Fleury - CppCon 2014
51 pages
Lecture 8 Memory Hierachy-Virtual Memories
No ratings yet
Lecture 8 Memory Hierachy-Virtual Memories
28 pages
Introduction To C++
No ratings yet
Introduction To C++
14 pages
UTTAM's Ideal C.S. Practical Handbook
No ratings yet
UTTAM's Ideal C.S. Practical Handbook
109 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
From Everand
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
Bruce Dang
No ratings yet
The Mac Terminal Reference and Scripting Primer
From Everand
The Mac Terminal Reference and Scripting Primer
Jay Docherty
4.5/5 (3)
Satisfaction and Performance of Software Developers During Enforced Work From Home in The COVID-19 Pandemic
No ratings yet
Satisfaction and Performance of Software Developers During Enforced Work From Home in The COVID-19 Pandemic
48 pages
Random Numbers
No ratings yet
Random Numbers
4 pages
10.templates II
No ratings yet
10.templates II
73 pages
Icotera I4880 Datasheet
No ratings yet
Icotera I4880 Datasheet
2 pages
15.ecosystem II
No ratings yet
15.ecosystem II
47 pages
18.advanced Topics I
No ratings yet
18.advanced Topics I
64 pages
12.translation Units II
No ratings yet
12.translation Units II
59 pages
19.advanced Topics II
No ratings yet
19.advanced Topics II
70 pages
20.optimization I
No ratings yet
20.optimization I
55 pages
4 Bit Braun Multiplier With Kogge Stone Adder
No ratings yet
4 Bit Braun Multiplier With Kogge Stone Adder
15 pages
Analysis and Detection of Simbox Fraud in Mobility Networks: Proceedings - Ieee Infocom April 2014
No ratings yet
Analysis and Detection of Simbox Fraud in Mobility Networks: Proceedings - Ieee Infocom April 2014
9 pages
CPS Module-5
No ratings yet
CPS Module-5
38 pages
Dell OptiPlex 7000 MFF
No ratings yet
Dell OptiPlex 7000 MFF
23 pages
Finite Element Method Magnetics: Download: 32-Bit Executable 64-Bit Executable Source
No ratings yet
Finite Element Method Magnetics: Download: 32-Bit Executable 64-Bit Executable Source
4 pages
Instamojo Overview
No ratings yet
Instamojo Overview
32 pages
Dichvusocks - Us - Service Socks5, Anonymous Proxy, Proxy Service, Proxy Server, Hide Your IP, Tools Client
No ratings yet
Dichvusocks - Us - Service Socks5, Anonymous Proxy, Proxy Service, Proxy Server, Hide Your IP, Tools Client
1 page
Performance Task 1 Prog 114 No. 2 B
100% (1)
Performance Task 1 Prog 114 No. 2 B
4 pages
INFO5100 Quiz7 With Answers
No ratings yet
INFO5100 Quiz7 With Answers
3 pages
HDS Rev 1
No ratings yet
HDS Rev 1
52 pages
6GK17111EW160AA0 Datasheet en
No ratings yet
6GK17111EW160AA0 Datasheet en
1 page
Lecture 3.3.1 Queue
No ratings yet
Lecture 3.3.1 Queue
17 pages
Az-700 Dumps
No ratings yet
Az-700 Dumps
7 pages
Proxmox Backup Documentation: Release 2.3.1 1
No ratings yet
Proxmox Backup Documentation: Release 2.3.1 1
213 pages
Timetable Generator Using Genetic Algorithms 1
No ratings yet
Timetable Generator Using Genetic Algorithms 1
15 pages
A320M PRO-M2 V2: Features
No ratings yet
A320M PRO-M2 V2: Features
1 page
AI Question Bank
No ratings yet
AI Question Bank
3 pages
FireFly A High-Throughput Hardware Accelerator For Spiking Neural Networks With Efficient DSP and Memory Optimization
No ratings yet
FireFly A High-Throughput Hardware Accelerator For Spiking Neural Networks With Efficient DSP and Memory Optimization
14 pages
Pallavi BRM File
No ratings yet
Pallavi BRM File
29 pages
Grade 8 Computer Studies Notes
100% (1)
Grade 8 Computer Studies Notes
73 pages
Handle Error Mysql
No ratings yet
Handle Error Mysql
15 pages
Rahul Resume Sept 2021
No ratings yet
Rahul Resume Sept 2021
3 pages
Integrated Motion On The ethernet/IP Network: Configuration and Startup
No ratings yet
Integrated Motion On The ethernet/IP Network: Configuration and Startup
354 pages
Chapter 11 Slides Ops Book
No ratings yet
Chapter 11 Slides Ops Book
44 pages
Excel Formulas Cheat Sheet Detailed
No ratings yet
Excel Formulas Cheat Sheet Detailed
42 pages
Introduction To Java Programming
No ratings yet
Introduction To Java Programming
24 pages
Import Java - Util.stack Public Class Linkedinterface (
No ratings yet
Import Java - Util.stack Public Class Linkedinterface (
4 pages
Advert Siemens Internship Information 2024
No ratings yet
Advert Siemens Internship Information 2024
6 pages

21.optimization II

Uploaded by

21.optimization II

Uploaded by

Modern C++

7 Std Library and Other Language Aspects

I/O Operations are orders of magnitude slower than

In general, input/output operations are one of the most expensive

• Use endl for ostream only when it is strictly necessary (prefer \n )

• Disable synchronization with printf/scanf :

• Disable IO flushing when mixing istream/ostream calls:

• Increase IO buffer size:

• printf is faster than ostream (see speed test link)

• A printf call with a simple format string ending with \n is converted to a

• No optimization if the string is not ending with \n or one or more % are

A memory-mapped file is a segment of virtual memory that has been assigned a

// Open Memory Mapped file

// Close Memory Mapped file

Consider using optimized (low-level) numeric conversion routines:

inline unsigned fastStringToUnsigned(const char* str, int length) {

template<int N, unsigned MUL, int INDEX>

template<unsigned MUL, int INDEX>

Faster parsing: lemire.me/blog/tag/simd-swar-parsing 13/84

• Hard disk is orders of magnitude slower than RAM

• Parsing is faster than data reading

• Parsing can be avoided by using binary storage and mmap

• Decreasing the number of hard disk accesses improves the performance →

Another alternative is Facebook zstd

Performance comparison of different methods for a file of 4.8 GB of integer values

Load Method Exec. Time Speedup

ifstream 102 667 ms 1.0x

NOTE: the size of the Lz4 compressed file is 1,8 GB

• Dynamic heap allocation is expensive: implementation dependent and interact

Maximize cache utilization:

• Maximize spatial and temporal locality (see next examples)

• Prefer small data types

• Prefer std::vector<bool> over array of bool

• Prefer std::bitset<N> over std::vector<bool> if the data size is known in

for (int i = 0; i < N; i++) {

for (int i = 0; i < N; i++) {

N 64 128 256 512 1024

A * B < 1 ms 5 ms 29 ms 141 ms 1,030 ms

Speeding up a random-access function

for (int i = 0; i < N; i++) // V1 for (int K = 0; K < N; K += CACHE) { // V2

V1 : 436 ms, V2 : 336 ms → 1.3x speedup (temporal locality improvement)

• Internal alignment: reducing memory footprint, optimizing memory bandwidth,

struct A1 { struct A2 { // internal alignment

Considering an array of structures (AoS), there are two problems:

It is possible to fix the structure alignment in two ways:

• Align keyword or attribute allows specifying the alignment requirement of a

C++ allows specifying the alignment requirement in different ways:

• C++11 alignas(N) only for variable / struct declaration

• C++17 aligned new (e.g. new int[2, N] )

• Compiler Intrinsic only for variables / struct declaration

• Compiler Intrinsic for dynamic pointer

struct alignas(16) A1 { // C++11

struct __attribute__((aligned(16))) A2 { // compiler-specific attribute

auto ptr1 = new int[100, 16]; // 16B alignment, C++17

builtin prefetch is used to minimize cache-miss latency by moving data into a

for (int i = 0; i < size; i++) {

C++11 threads, affinity and hyper-threading 28/84

• Instruction throughput greatly depends on processor model and characteristics

• Modern processors provide separated units for floating-point computation (FPU)

• In modern processors, multiplication and addition are computed by the same

• 32-bit types are faster than 64-bit types

• In modern architectures, arithmetic increment/decrement ++ / -- has the same

• Prefer prefix operator ( ++var ) instead of the postfix operator ( var++ ) *

• Use the arithmetic compound operators ( a += b ) instead of operators

• Prefer logic operations || to bitwise operations | to take advantage of

Is if(A | B) always faster than if(A || B)? 32/84

Integer multiplication requires double the number of bits of the operands

int f1(int x, int y) {

int64_t f2(int64_t x, int64_t y) {

int64_t f3(int x, int y) {

• Prefer shift for power-of-two multiplications ( a ≪ b ) and divisions

• Prefer bitwise and ( a % b → a & (b - 1) ) for power-of-two modulo

• Constant multiplication and division can be heavily optimized by the compiler,

* the compiler automatically applies such optimizations if b is known at compile-time. Bitwise

Signed Unsigned no cost, bit representation is the same

Unsigned Larger Unsigned no cost, register extended

struct attribute((aligned(16))) A2 { // compiler-specific attribute