0% found this document useful (0 votes)

14 views29 pages

Lecture 5

The document discusses optimizations for a vector combining function in C, focusing on performance improvements through various coding techniques and compiler flags. It presents a series of benchmarks measuring cycles per element (CPE) for different implementations, highlighting the impact of manual optimizations such as loop unrolling and the order of operations. The final recommendations emphasize the importance of profiling and understanding domain-specific knowledge for effective optimization.

Uploaded by

tkthdev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views29 pages

Lecture 5

Uploaded by

tkthdev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Program

Optimizations

2024 Fall ECE454: Computer Systems Programming Lecture 5

Jon Eyolfson 1.0.1

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
Today, We’re Creating a Vector in C

Recall: a vector is a managed dynamically allocated array

The functions we’ve already written:

/* Create vector of specified length */

vec_ptr new_vec(int len);

/* Retrieve vector element at index, store at *dest

Return 0 if out of bounds, 1 if successful */
int get_vec_element(vec_ptr v, int index, int *dest);

/* Return pointer to start of vector data */

int *get_vec_start(vec_ptr v);

1
What We’ll be Optimizing: Combining Data

We want to be generic, to combine we’ll accumlate the result

of an operation between all elements of the vector

It’ll have an operation, OP, and identity value IDENT, where

x OP IDENT == IDENT
For for a sum, OP is + and IDENT is 0
For for a product, OP is * and IDENT is 1

2
We’ll Measure Cycles per Element (CPE) for Performance

The system I’ll use today is an AMD Ryzen 1800X (it’s a bit old)

We’ll also calculate the sum

This could be our first attempt:

/* combine1: Maximum use of data abstraction */

void combine(vec_ptr v, data_t *dest) {
*dest = IDENT;
for (int64_t i = 0; i < vec_length(v); i++) {
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;
}
}

If we compile it with no optimizations and debugging we get a CPE of 20.22

3
Without Changing Anything, Lets Add Optimizations

We’ll use -O2 as our default

/* combine1: Maximum use of data abstraction */

void combine(vec_ptr v, data_t *dest) {
*dest = IDENT;
for (int64_t i = 0; i < vec_length(v); i++) {
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;
}
}

Now when we run this we get a CPE of 10.00

The improvements are basically better register allocation and scheduling

What did the compiler miss that we could do?

4
Aside: Using perf

You should now be able to use perf on the UG machines

You can also send commands to perf using named pipe FDs
See example: examples/perf-wrapper.py and examples/src/benchmark.c
It starts perf with --delay -1 meaning disabled,
after calling setup it enables perf

5
The Compiler Did not Lift vec_length Out of the Loop
Since it’s in a different compliation unit, the compiler has to assume it has

side effects and it may not be deterministic

Side effects may include reading or writing global state that may change

For example, if you did a printf in vec_length moving it would

change the behaviour of your program

Also, without knowing the final link step, even if it knew the implementation,

you may not actually use that function at runtime

6
Let’s Manually Do LICM, Since We Know It’s Safe

We’ll change our code to:

/* combine2: Take vec_length() out of loop */

void combine(vec_ptr v, data_t *dest) {
int64_t length = vec_length(v);
*dest = IDENT;
for (int64_t i = 0; i < length; i++) {
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;
}
}

Now we get a healthy speedup, our CPE is now 7.00

What’s the next thing we can do?

7
We Can Manually Inline get_vec_element

We’ll change our code around, and get rid of the bounds check

since we know it’s valid:

/* combine3: Array reference to vector data */

void combine(vec_ptr v, data_t *dest) {
int64_t length = vec_length(v);
data_t *data = get_vec_start(v);

*dest = IDENT;
for (int64_t i = 0; i < length; i++) {
*dest = *dest OP data[i];
}
}

Now, our CPE is down to 1.68

8
We Can Try Removing the Memory Read in the Loop

We can create a local variable called acc to store the result:

/* combine3w: Update dest within loop only with write /

void combine(vec_ptr v, data_t *dest) {
int64_t length = vec_length(v);
data_t *data = get_vec_start(v);
data_t acc = IDENT;

/* Initialize in event length <= 0 */

*dest = acc;

for (int64_t i = 0; i < length; i++) {

acc = acc OP data[i];
*dest = acc;
}
}

Turns out that didn’t do much, our CPE is still 1.68

9
We Actually Don’t Need to Write Everytime in the Loop

Since we only have one thread, we know it’s not possible for

anything else to read dest

(not entirely true, what else could happen?)

/* combine4: Array reference, accumulate in temporary */

void combine(vec_ptr v, data_t *dest) {
int64_t length = vec_length(v);
data_t *data = get_vec_start(v);
data_t acc = IDENT;

for (int64_t i = 0; i < length; i++) {

acc = acc OP data[i];
}
*dest = acc;
}

This helps a bit more, our CPE is now 1.47

10
If We Think Array Indexing is Slow, We Can Try Pointers

We’re just adding a constant value to the memory address,

why do we need to re-calculate?

/* combine4p: Pointer reference, accumulate in temporary */

void combine(vec_ptr v, data_t *dest) {
int64_t length = vec_length(v);
data_t *data = get_vec_start(v);
data_t *dend = data+length;
data_t acc = IDENT;

for (; data < dend; data++)

acc = acc OP *data;
*dest = acc;
}

Turns out our compiler already optimized this for us, the CPU is still 1.47

11
Summary of Results so Far

Benchmark CPE

combine1g 20.22

combine1 10.00

combine2 7.00

combine3 1.68

combine3w 1.68

combine4 1.47

combine4p 1.47

12
Don’t Overuse Pointers in C/C++

It’s very difficult for the compiler to reason about raw pointers

(especially when basically anything is possible)

You should use local variables whenever possible

The compiler can reason about the lifetime of local variables

Only update global state when you have to

13
What About Trying Loop Unrolling?

We can use the compiler flag -funroll-loops without changing the code

/* combine4: Array reference, accumulate in temporary */

void combine(vec_ptr v, data_t *dest) {
int64_t length = vec_length(v);
data_t *data = get_vec_start(v);
data_t acc = IDENT;

for (int64_t i = 0; i < length; i++) {

acc = acc OP data[i];
}
*dest = acc;
}

This this compiler flag enabled we get a CPE of 1.05

14
Looking at the Assembly

Meson generates a compile_commands.json file in the build directory

I wrote a script for this example, show-assembly.py, it shows the generated
assembly using the same compiler arguments as the build

Otherwise, you can get the compiler to output assembly using the -S flag
For x86 assmebly you may want to also add -masm=intel

15
Automatic Loop Unrolling Every 8 Elements

add edx, DWORD PTR [rax]

add rax, 32
add edx, DWORD PTR -28[rax]
add edx, DWORD PTR -24[rax]
add edx, DWORD PTR -20[rax]
add edx, DWORD PTR -16[rax]
add edx, DWORD PTR -12[rax]
add edx, DWORD PTR -8[rax]
add edx, DWORD PTR -4[rax]

16
Assumed CPU Capabilities

Some instructions can actually run in parallel:

1 load

1 store

2 integer (one may be branch)

1 FP addition

1 FP multiplication or division

17
Instructions Take > 1 Cycle, but Can Be Pipelined

Instruction Latency Cycles/Issue

Load / Store 3 1

Integer Add / Branch 1 1

Integer Multiply 4 1

Double/Single FP Multiply 5 2

Double/Single FP Add 3 1

Integer Divide 36 36

Double/Single FP Divide 38 38

18
Since We’ve Unrolled, It’s Easier to Overlap Operations

Load
3 cycles
data[i]

t.1

edx Add 1 cycle

edx

19
What About Manual Loop Unrolling?

Let’s unroll 3 times:

/* combine5uX: Manual loop unrolling X times. */

void combine(vec_ptr v, data_t *dest) {
int64_t length = vec_length(v);
data_t *data = get_vec_start(v);
data_t acc = IDENT;

const int BATCH_SIZE = 3;

int64_t limit = length - BATCH_SIZE + 1;
int64_t i;
for (i = 0; i < limit; i+=BATCH_SIZE) {
acc = acc OP data[i];
acc = acc OP data[i+1];
acc = acc OP data[i+2];
}
/* Fix up any remaining elements */
for (; i < length; ++i) {
acc = acc OP data[i];
}
*dest = acc;
}

This isn’t any better than what the compiler did, our CPE for this is 1.05

20
Surely 4 Times is Better!

We’ll also try this unrolling 5, 8, and 16 times

/* combine5uX: Manual loop unrolling X times. */

void combine(vec_ptr v, data_t *dest) {
int64_t length = vec_length(v);
data_t *data = get_vec_start(v);
data_t acc = IDENT;

const int BATCH_SIZE = 4;

int64_t limit = length - BATCH_SIZE + 1;
int64_t i;
for (i = 0; i < limit; i+=BATCH_SIZE) {
acc = acc OP data[i];
acc = acc OP data[i+1];
acc = acc OP data[i+2];
acc = acc OP data[i+3];
}
/* Fix up any remaining elements */
for (; i < length; ++i) {
acc = acc OP data[i];
}
*dest = acc;
}

This is a bit of an improvement, our CPE is 0.61

21
Why Was 4 Times Better?

mov rcx, rdx

add rdx, 1
sal rcx, 4
movdqu xmm1, XMMWORD PTR [rax+rcx]
paddd xmm0, xmm1
cmp rdx, rsi
jb .L3
movdqa xmm1, xmm0
and rdi, -4
psrldq xmm1, 8
mov rdx, rdi
paddd xmm0, xmm1
add rdx, 4
movdqa xmm1, xmm0
psrldq xmm1, 4
paddd xmm0, xmm1
They used new registers, each register can store four 32-bit integers

This is part of the SSE2 ×86-64 Extension for SIMD instructions

SIMD is short for single instruction multiple data

22
It Turns out Unrolling 8 Times is the Best

movdqu xmm2, XMMWORD PTR [rdx]

movdqu xmm3, XMMWORD PTR 16[rdx]
add rcx, 1
add rdx, 32
paddd xmm1, xmm2
paddd xmm0, xmm3
cmp rcx, rsi
jb .L3
paddd xmm0, xmm1
and rdi, -8
movdqa xmm1, xmm0
mov rdx, rdi
psrldq xmm1, 8
add rdx, 8
paddd xmm0, xmm1
movdqa xmm1, xmm0
psrldq xmm1, 4
paddd xmm0, xmm1

It unrolled it a bit more for us, and used more SSE2 registers

23
Maybe It’s Better to Change the Order of Operations?

/* combine6: Try a different order of operations */

void combine(vec_ptr v, data_t *dest) {
int64_t length = vec_length(v);
data_t *data = get_vec_start(v);
data_t acc = IDENT;

const int BATCH_SIZE = 8;

int64_t limit = length - BATCH_SIZE + 1;
int64_t i;
for (i = 0; i < limit; i+=BATCH_SIZE) {
acc = acc OP (
(data[i] OP data[i+1]) OP (data[i+2] OP data[i+3]) OP
(data[i+4] OP data[i+5]) OP (data[i+6] OP data[i+7])
);
}
/* Fix up any remaining elements */
for (; i < length; ++i) {
acc = acc OP data[i];
}
*dest = acc;
}

Turns out it won’t be better than unrolling 8 or 16 times, our CPE is 0.80

However, it’s better than manual loop unrolling!

24
Changing the Order of Operations is Better

Since the Processor Can Overlap More

mov edx, DWORD PTR 4[rax]

mov ecx, DWORD PTR 12[rax]
add rax, 32
add ecx, DWORD PTR -24[rax]
add edx, DWORD PTR -32[rax]
add edx, ecx
mov ecx, DWORD PTR -12[rax]
add ecx, DWORD PTR -16[rax]
add edx, ecx
mov ecx, DWORD PTR -4[rax]
add ecx, DWORD PTR -8[rax]
add edx, ecx
add esi, edx
cmp rdi, rax
jne .L3
and r9, -8
add r9, 8

The trade-off is we use more registers, increasing pressure

(if we use too many they’ll spill on the stack and slow us down)

25
Summary of Loop Unrolling

Benchmark CPE

combine4u 1.05

combine5u3 1.05

combine5u4 0.61

combine5u5 1.05

combine5u8 0.51

combine5u16 0.50

combine6 0.80

26
Some of Your Biggest Optimizations Come

from Domain-specific Knowledge

For example, what if we were computing the product of all elements?

Is there an optimization we could make under

certain conditions to skip the calculation?

27
Takeaways

Always profile to make sure you’re optimizing the right thing!

Get the most out of your compiler before going manual

Trade-off: manual optimization vs readable/maintainable code

Limit use of pointers, prefer local variables

Reduce pointer-based arrays and pointer arithmetic

Function pointers and virtual functions (unfortunately)

For highly performance-critical code:

Look at assembly for optimization opportunities

Consider the instruction-parallelism capabilities of CPU

Lec03 1 Program Optimizations
No ratings yet
Lec03 1 Program Optimizations
43 pages
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
No ratings yet
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
57 pages
Unit 2 Basic Optimization Techniques For Serial Code
No ratings yet
Unit 2 Basic Optimization Techniques For Serial Code
31 pages
Migdalskiy Sergiy Physics Optimization Strategies
No ratings yet
Migdalskiy Sergiy Physics Optimization Strategies
104 pages
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
No ratings yet
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
37 pages
Compilers: Tools For Scientists and Engineers
No ratings yet
Compilers: Tools For Scientists and Engineers
42 pages
HPC Unit 5 B
No ratings yet
HPC Unit 5 B
31 pages
LEC12-Optimization and New Trends
No ratings yet
LEC12-Optimization and New Trends
23 pages
Clase de Progrea 555
No ratings yet
Clase de Progrea 555
35 pages
MC Ia-2
No ratings yet
MC Ia-2
14 pages
Hidden Overhead of A Function API
No ratings yet
Hidden Overhead of A Function API
158 pages
Assignment 3 - COMP2129
No ratings yet
Assignment 3 - COMP2129
4 pages
Lecture27.1 - Global Optimization
No ratings yet
Lecture27.1 - Global Optimization
36 pages
High Level Synthesis II: ECE 3401 Digital Systems Design
No ratings yet
High Level Synthesis II: ECE 3401 Digital Systems Design
35 pages
Web GPU
0% (1)
Web GPU
40 pages
ch16 High Performance PDF
No ratings yet
ch16 High Performance PDF
22 pages
Algorithm Performance On Modern Architectures
100% (5)
Algorithm Performance On Modern Architectures
7 pages
Comparing C++ Compilers Parallel-Programming Performance
No ratings yet
Comparing C++ Compilers Parallel-Programming Performance
8 pages
3rd Module MC Sem Exam Preparation
No ratings yet
3rd Module MC Sem Exam Preparation
31 pages
Data-Oriented Design and C++ - Mike Acton - CppCon 2014
No ratings yet
Data-Oriented Design and C++ - Mike Acton - CppCon 2014
201 pages
AT - Better C Code For ARM Devices
No ratings yet
AT - Better C Code For ARM Devices
30 pages
Benchmarking Performance of in ( ) &:, Platforms:, &
No ratings yet
Benchmarking Performance of in ( ) &:, Platforms:, &
22 pages
Superscalar Architecture
No ratings yet
Superscalar Architecture
156 pages
Optimization of Computer Programs in C
No ratings yet
Optimization of Computer Programs in C
2 pages
Embedded C Programming
100% (1)
Embedded C Programming
57 pages
L01-slides-Programming For Performance
No ratings yet
L01-slides-Programming For Performance
53 pages
2016 Esc SV Efficient Embedded Programming MG
No ratings yet
2016 Esc SV Efficient Embedded Programming MG
35 pages
High Speed electronics-UoH - 4-Vivado-Presentation
No ratings yet
High Speed electronics-UoH - 4-Vivado-Presentation
66 pages
Lec02 2 Compiler Optimizations
No ratings yet
Lec02 2 Compiler Optimizations
32 pages
Class Ans Q
No ratings yet
Class Ans Q
24 pages
CS3330 - A Quick Guide To SSE - SIMD
No ratings yet
CS3330 - A Quick Guide To SSE - SIMD
9 pages
Anti Virus 2.0 "Compilers in Disguise": Mihai G. Chiriac Bitdefender
No ratings yet
Anti Virus 2.0 "Compilers in Disguise": Mihai G. Chiriac Bitdefender
45 pages
Cache Performance
No ratings yet
Cache Performance
44 pages
Unit 5 Bard
No ratings yet
Unit 5 Bard
8 pages
Machine Control
No ratings yet
Machine Control
71 pages
MC-module 3 C Compilers and Optimization (BCS402)
No ratings yet
MC-module 3 C Compilers and Optimization (BCS402)
22 pages
Final Exam Topics: CSE 564 Computer Architecture Summer 2017
No ratings yet
Final Exam Topics: CSE 564 Computer Architecture Summer 2017
78 pages
Midterm Sample Answer: Instructor: Cristiana Amza Department of Electrical and Computer Engineering University of Toronto
No ratings yet
Midterm Sample Answer: Instructor: Cristiana Amza Department of Electrical and Computer Engineering University of Toronto
18 pages
Ex:No:7 Implementation of Reduction in Strength For Code Optimization Aim
No ratings yet
Ex:No:7 Implementation of Reduction in Strength For Code Optimization Aim
11 pages
241midterm 2024-2025 Fall
No ratings yet
241midterm 2024-2025 Fall
8 pages
Reverse Engineering For Beginners: Dennis Yurichev
No ratings yet
Reverse Engineering For Beginners: Dennis Yurichev
6 pages
Scalable Vectorizationin LLVMIR
No ratings yet
Scalable Vectorizationin LLVMIR
74 pages
C++ in Huge AAA Games - Nicolas Fleury - CppCon 2014
No ratings yet
C++ in Huge AAA Games - Nicolas Fleury - CppCon 2014
51 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Lab Manual DS AND OOPS
No ratings yet
Lab Manual DS AND OOPS
26 pages
Lab 7
No ratings yet
Lab 7
3 pages
High Performance Game Programming in C++
No ratings yet
High Performance Game Programming in C++
41 pages
Assignment 3: Vector and Hashset
No ratings yet
Assignment 3: Vector and Hashset
17 pages
Icl Utk 1031 2017
No ratings yet
Icl Utk 1031 2017
45 pages
Compiler Unit 4
No ratings yet
Compiler Unit 4
59 pages
HW 2 Is Out! Due 9/25!
No ratings yet
HW 2 Is Out! Due 9/25!
21 pages
Unit-5 Toc
No ratings yet
Unit-5 Toc
41 pages
Module 3
No ratings yet
Module 3
21 pages
BCS402 MC Module3 Notes
No ratings yet
BCS402 MC Module3 Notes
30 pages
Lab6 - Linear Algebra in C On A Microcontroller
No ratings yet
Lab6 - Linear Algebra in C On A Microcontroller
8 pages
VCL Manual
No ratings yet
VCL Manual
96 pages
2011 Quiz 4 Sol
No ratings yet
2011 Quiz 4 Sol
17 pages
Projects With Microcontrollers And PICC
From Everand
Projects With Microcontrollers And PICC
Guillermo Perez Guillen
5/5 (1)
C Programming
From Everand
C Programming
Netra
No ratings yet
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Treatment of Police Towards Poor
No ratings yet
Treatment of Police Towards Poor
8 pages
The Cell Cycle Including Mitosis MS
No ratings yet
The Cell Cycle Including Mitosis MS
4 pages
Bai Tap Phat Am Lop 8 Co Dap An
No ratings yet
Bai Tap Phat Am Lop 8 Co Dap An
6 pages
Ei8073 - Bmi (MCQ Model)
No ratings yet
Ei8073 - Bmi (MCQ Model)
8 pages
Metasurface Single Atom Trapping
100% (1)
Metasurface Single Atom Trapping
11 pages
Literacy) Is The Ability To Understand The
No ratings yet
Literacy) Is The Ability To Understand The
4 pages
Figure 8.4: Land Development Pro Forma Budget Format
No ratings yet
Figure 8.4: Land Development Pro Forma Budget Format
4 pages
The Doha Round
No ratings yet
The Doha Round
10 pages
Importanta Resurselor Umane in Sustenabilitatea Strategica
No ratings yet
Importanta Resurselor Umane in Sustenabilitatea Strategica
14 pages
Pre-Test NSTP
No ratings yet
Pre-Test NSTP
3 pages
Expose Sur Les Droits de L'homme Anglais
No ratings yet
Expose Sur Les Droits de L'homme Anglais
5 pages
Field Astromy MCQ
No ratings yet
Field Astromy MCQ
6 pages
Jyotish Praveen - 1993 - Astrology Lessons 2
100% (2)
Jyotish Praveen - 1993 - Astrology Lessons 2
85 pages
A Survey On Advanced Persistent Threats: Techniques, Solutions, Challenges, and Research Opportunities
No ratings yet
A Survey On Advanced Persistent Threats: Techniques, Solutions, Challenges, and Research Opportunities
25 pages
Bài Tập U3.2
No ratings yet
Bài Tập U3.2
2 pages
A Critique of Arranged Marriage
100% (1)
A Critique of Arranged Marriage
5 pages
92-259N Issue 1 AIS SART Service Manual
No ratings yet
92-259N Issue 1 AIS SART Service Manual
24 pages
24.1 The+Secrets+of+Orchestration+-+Texture+Making+-+1+-+Best+registers+for+solo,+unison,+and+octave+doublings+ (Chart+1.)
No ratings yet
24.1 The+Secrets+of+Orchestration+-+Texture+Making+-+1+-+Best+registers+for+solo,+unison,+and+octave+doublings+ (Chart+1.)
9 pages
(CMO Think and Action 2023) (Round 1 - Sprite) - Official Brief
No ratings yet
(CMO Think and Action 2023) (Round 1 - Sprite) - Official Brief
16 pages
Environmental Management
No ratings yet
Environmental Management
16 pages
Exercises and Answers To Chapter 1
No ratings yet
Exercises and Answers To Chapter 1
35 pages
What Is Sub Query? Explain The Properties of Sub Query. - Google Search
No ratings yet
What Is Sub Query? Explain The Properties of Sub Query. - Google Search
5 pages
The Pragmatics of Textual Participation in The Social Media
No ratings yet
The Pragmatics of Textual Participation in The Social Media
9 pages
Vmware Horizon View Graphics Acceleration Deployment
No ratings yet
Vmware Horizon View Graphics Acceleration Deployment
29 pages
KAFD Conference Center Hotel Area Schedule Rev.0
No ratings yet
KAFD Conference Center Hotel Area Schedule Rev.0
6 pages
Section A: Susun Semula Frasa-Frasa Di Bawah Dalam Turutan Yang Betul
No ratings yet
Section A: Susun Semula Frasa-Frasa Di Bawah Dalam Turutan Yang Betul
8 pages
Catalogue of Medieval Plays Pageantry and Others - England 1890
No ratings yet
Catalogue of Medieval Plays Pageantry and Others - England 1890
98 pages
MR Higgins 6th Grade 1st 3rd Hours Student Growth Report Percentages John
No ratings yet
MR Higgins 6th Grade 1st 3rd Hours Student Growth Report Percentages John
3 pages
Unit C Magnetism
No ratings yet
Unit C Magnetism
24 pages
Zoology
No ratings yet
Zoology
4 pages

Lecture 5

Uploaded by

Lecture 5

Uploaded by

Program

2024 Fall ECE454: Computer Systems Programming Lecture 5

Jon Eyolfson 1.0.1

Recall: a vector is a managed dynamically allocated array

The functions we’ve already written:

/* Create vector of specified length */

/* Retrieve vector element at index, store at *dest

/* Return pointer to start of vector data */

We want to be generic, to combine we’ll accumlate the result

of an operation between all elements of the vector

It’ll have an operation, OP, and identity value IDENT, where

We’ll also calculate the sum

This could be our first attempt:

/* combine1: Maximum use of data abstraction */

If we compile it with no optimizations and debugging we get a CPE of 20.22

We’ll use -O2 as our default

/* combine1: Maximum use of data abstraction */

Now when we run this we get a CPE of 10.00

The improvements are basically better register allocation and scheduling

What did the compiler miss that we could do?

You should now be able to use perf on the UG machines

side effects and it may not be deterministic

For example, if you did a printf in vec_length moving it would

you may not actually use that function at runtime

We’ll change our code to:

/* combine2: Take vec_length() out of loop */

Now we get a healthy speedup, our CPE is now 7.00

What’s the next thing we can do?

since we know it’s valid:

/* combine3: Array reference to vector data */

Now, our CPE is down to 1.68

We can create a local variable called acc to store the result:

/* combine3w: Update *dest within loop only with write */

/* Initialize in event length <= 0 */

for (int64_t i = 0; i < length; i++) {

Turns out that didn’t do much, our CPE is still 1.68

anything else to read dest

/* combine4: Array reference, accumulate in temporary */

for (int64_t i = 0; i < length; i++) {

This helps a bit more, our CPE is now 1.47

We’re just adding a constant value to the memory address,

why do we need to re-calculate?

/* combine4p: Pointer reference, accumulate in temporary */

for (; data < dend; data++)

(especially when basically anything is possible)

You should use local variables whenever possible

The compiler can reason about the lifetime of local variables

Only update global state when you have to

/* combine4: Array reference, accumulate in temporary */

for (int64_t i = 0; i < length; i++) {

This this compiler flag enabled we get a CPE of 1.05

Meson generates a compile_commands.json file in the build directory

add edx, DWORD PTR [rax]

Some instructions can actually run in parallel:

2 integer (one may be branch)

Instruction Latency Cycles/Issue

Integer Add / Branch 1 1

edx Add 1 cycle

Let’s unroll 3 times:

/* combine5uX: Manual loop unrolling X times. */

const int BATCH_SIZE = 3;

We’ll also try this unrolling 5, 8, and 16 times

/* combine5uX: Manual loop unrolling X times. */

const int BATCH_SIZE = 4;

This is a bit of an improvement, our CPE is 0.61

mov rcx, rdx

This is part of the SSE2 ×86-64 Extension for SIMD instructions

SIMD is short for single instruction multiple data

movdqu xmm2, XMMWORD PTR [rdx]

/* combine6: Try a different order of operations */

const int BATCH_SIZE = 8;

However, it’s better than manual loop unrolling!

Since the Processor Can Overlap More

mov edx, DWORD PTR 4[rax]

The trade-off is we use more registers, increasing pressure

from Domain-specific Knowledge

/* combine3w: Update dest within loop only with write /