lecture-5
lecture-5
Optimizations
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
Today, We’re Creating a Vector in C
1
What We’ll be Optimizing: Combining Data
2
We’ll Measure Cycles per Element (CPE) for Performance
The system I’ll use today is an AMD Ryzen 1800X (it’s a bit old)
3
Without Changing Anything, Lets Add Optimizations
4
Aside: Using perf
5
The Compiler Did not Lift vec_length Out of the Loop
Since it’s in a different compliation unit, the compiler has to assume it has
Side effects may include reading or writing global state that may change
Also, without knowing the final link step, even if it knew the implementation,
6
Let’s Manually Do LICM, Since We Know It’s Safe
7
We Can Manually Inline get_vec_element
We’ll change our code around, and get rid of the bounds check
*dest = IDENT;
for (int64_t i = 0; i < length; i++) {
*dest = *dest OP data[i];
}
}
8
We Can Try Removing the Memory Read in the Loop
9
We Actually Don’t Need to Write Everytime in the Loop
Since we only have one thread, we know it’s not possible for
10
If We Think Array Indexing is Slow, We Can Try Pointers
Turns out our compiler already optimized this for us, the CPU is still 1.47
11
Summary of Results so Far
Benchmark CPE
combine1g 20.22
combine1 10.00
combine2 7.00
combine3 1.68
combine3w 1.68
combine4 1.47
combine4p 1.47
12
Don’t Overuse Pointers in C/C++
It’s very difficult for the compiler to reason about raw pointers
13
What About Trying Loop Unrolling?
We can use the compiler flag -funroll-loops without changing the code
14
Looking at the Assembly
Otherwise, you can get the compiler to output assembly using the -S flag
For x86 assmebly you may want to also add -masm=intel
15
Automatic Loop Unrolling Every 8 Elements
16
Assumed CPU Capabilities
1 load
1 store
1 FP addition
1 FP multiplication or division
17
Instructions Take > 1 Cycle, but Can Be Pipelined
Load / Store 3 1
Integer Multiply 4 1
Double/Single FP Multiply 5 2
Double/Single FP Add 3 1
Integer Divide 36 36
Double/Single FP Divide 38 38
18
Since We’ve Unrolled, It’s Easier to Overlap Operations
Load
3 cycles
data[i]
t.1
edx
19
What About Manual Loop Unrolling?
This isn’t any better than what the compiler did, our CPE for this is 1.05
20
Surely 4 Times is Better!
21
Why Was 4 Times Better?
22
It Turns out Unrolling 8 Times is the Best
It unrolled it a bit more for us, and used more SSE2 registers
23
Maybe It’s Better to Change the Order of Operations?
Turns out it won’t be better than unrolling 8 or 16 times, our CPE is 0.80
24
Changing the Order of Operations is Better
(if we use too many they’ll spill on the stack and slow us down)
25
Summary of Loop Unrolling
Benchmark CPE
combine4u 1.05
combine5u3 1.05
combine5u4 0.61
combine5u5 1.05
combine5u8 0.51
combine5u16 0.50
combine6 0.80
26
Some of Your Biggest Optimizations Come
27
Takeaways
28