01 Whyparallelism
01 Whyparallelism
Why Parallelism?
Why Efficiency?
Parallel Computing
Stanford CS149, Fall 2023
Hello!
▪ Parallel thinking
1. Decomposing work into pieces that can safely be performed in parallel
2. Assigning work to processors
3. Managing communication/synchronization between the processors so that it does not limit speedup
▪ Just because your program runs faster on a parallel computer, it does not mean it is using the
hardware efficiently
- Is 2x speedup on computer with 10 processors a good result?
▪ Programmer’s perspective: make use of provided machine capabilities
▪ Textbook
- There is no course textbook (the internet is plenty good
these days), also see the course web site for suggested
references
▪ Every two-weeks we will have a take-home written assignment graded on effort only
▪ The idea of late days is to give you the flexibility to handle almost all events that arise
throughout the quarter
- Work from other classes, failing behind, most illnesses, athletic/extra curricular events…
- We expect to give extra late days only under exceptional circumstances
▪ Requests for additional late days for exceptional circumstances should be made days in
advance if possible.
Year
Image credit: Olukutun and Hammond, ACM Queue 2005 Stanford CS149, Fall 2023
Until ~15 years ago: two significant reasons for processor
performance improvement
int x = 1;
printf(“%d\n”, x);
return 0;
}
Fetch/
Decode
Determine what instruction to run next
Step 4:
Store result 96 back to register R0 Stanford CS149, Fall 2023
Execute program
My very simple processor: executes one instruction per clock
Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
Execution Unit mul r1, r1, r0
(ALU) ...
...
...
Execution ...
Context ...
...
st addr[r2], r0
Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
Execution Unit mul r1, r1, r0
(ALU) ...
...
...
Execution ...
Context ...
...
st addr[r2], r0
Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
Execution Unit mul r1, r1, r0
(ALU) ...
...
...
Execution ...
Context ...
...
st addr[r2], r0
Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
Execution Unit mul r1, r1, r0
(ALU) ...
...
...
Execution ...
Context ...
...
st addr[r2], r0
What is an instruction?
It describes an operation for a processor to perform.
Executing an instruction typically modifies the computer’s state.
Assume register R0 = x, R1 = y, R2 = z
x x y y z z
ILP = 3 * * *
ILP = 1 +
ILP = 1 +
In this example: instructions 1, 2, and 3 can be executed in parallel without impacting program correctness
(on a superscalar processor that determines that the lack of dependencies exists)
But instruction 4 must be executed after instructions 1 and 2
And instruction 5 must be executed after instruction 4
* Or the compiler finds independent instructions at compile time and explicitly encodes dependencies in the compiled binary. Stanford CS149, Fall 2023
Superscalar processor
This processor can decode and execute up to two instructions per clock
Fetch/ Fetch/
Decode Decode
1 2
Exec Exec
1 2
Execution
Context
2
Speedup
0
0 4 8 12 16
Instruction issue capability of processor (instructions/clock)
Source: Culler & Singh (data from Johnson 1991) Stanford CS149, Fall 2023
Stanford CS149, Fall 2023
ILP tapped out + end of frequency scaling
= Transistor density
= Clock frequency
= Power
= Instruction-level parallelism (ILP)
Image credit: “The free Lunch is Over” by Herb Sutter, Dr. Dobbs 2005 Stanford CS149, Fall 2023
The “power wall”
Power consumed by a transistor:
∝
Dynamic power capacitive load × voltage2 × frequency
Static power: transistors burn power even when inactive due to leakage
Image credit: “Idontcare”: posted at: http://forums.anandtech.com/showthread.php?t=2281195 Stanford CS149, Fall 2023
Single-core performance scaling
The rate of single-instruction stream performance
scaling has decreased (almost to zero)
Image credit: “The free Lunch is Over” by Herb Sutter, Dr. Dobbs 2005 Stanford CS149, Fall 2023
Example: multi-core CPU
Intel “Comet Lake” 10th Generation Core i9 10-core CPU (2020)
~32-40x faster!
Raspberry Pi 3
Quad-core ARM A53 CPU
▪ We will discuss how modern systems not only use many processing units, but also
utilize specialized processing units to achieve high levels of power efficiency
Memory
...
...
0x1F 0
Fetch/
Decode ld R0 ← mem[R2]
“Please load the four-byte value in memory starting from the
ALU address stored by register R2 and put this value into register R0.”
(Execution Unit)
Execution Memory
Context ...
R0: 96
R1: 64 0xff68107c: 1024
R2: 0xff681080 0xff681080: 42
R3: 0x80486412
0xff681084: 32
0xff681088: 0
...
Data request
Memory
Latency ~ 2 sec
...
...
0x1F 0
Stanford CS149, Fall 2023
What are caches?
▪ A cache is a hardware implementation detail that does not impact the output of a program, only its performance
▪ Cache is on-chip storage that maintains a copy of a subset of the values in memory
▪ If an address is stored “in the cache” the processor can load/store to this address more quickly than if the data resides only in DRAM
▪ Caches operate at the granularity of “cache lines”.
In the figure, the cache: Implementation of memory abstraction
- Has a capacity of 2 lines Address Value
...
...
0x1F 0
Stanford CS149, Fall 2023
Cache example 1 Address
accessed
Cache action
Cache state
(after load is complete)
Array of 16 bytes in memory 0x0 “cold miss”, load 0x0 0x0
Address Value 0x1 hit 0x0
0x0 16 Assume: 0x2 hit 0x0
Line 0x0
0x5 0
0x6 6 0x1 hit 0x0 0x4
0x7 0 Least recently used (LRU)
0x8 32 replacement policy
Line 0x8
0xD 0
0xE 0 in the same line, leading to cache hits
0xF 0
Temporal locality: repeated accesses to the same address
result in hits.
Stanford CS149, Fall 2022
Cache example 2 Address
accessed
Cache action
Cache state
(after load is complete)
Array of 16 bytes in memory 0x0 “cold miss”, load 0x0 0x0
Address Value 0x1 hit 0x0
0x0 16 Assume: 0x2 hit 0x0
Line 0x0
0x5 0
0x6 6 0x7 hit 0x0 0x4
0x7 0 Least recently used (LRU) 0x8 “cold miss”, load 0x8 (evict 0x0) 0x8 0x4
0x8 32 replacement policy
0x9 hit 0x8 0x4
Line 0x8
0x9 48
0xA hit 0x8 0x4
0xA 255
0xB 255 0xB hit 0x8 0x4
0xC 255 0xC “cold miss”, load 0xC (evict 0x4) 0x8 0xC
Line 0xC
Processor
L1 cache
(32 KB)
L3 cache DRAM
L2 cache
(256 KB) (20 MB) (64 GB)
▪ I suspect you will find that modern computers have tremendously more processing power
than you might realize, if you just use it efficiently!