0% found this document useful (0 votes)
3 views82 pages

01 Whyparallelism

The lecture introduces parallel computing, emphasizing its importance for performance and efficiency in problem-solving. Key themes include designing scalable parallel programs, understanding parallel hardware, and the distinction between speed and efficiency. The course will involve programming assignments, written assignments, and participation, with a focus on practical applications of parallelism.

Uploaded by

saudiqbal886
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views82 pages

01 Whyparallelism

The lecture introduces parallel computing, emphasizing its importance for performance and efficiency in problem-solving. Key themes include designing scalable parallel programs, understanding parallel hardware, and the distinction between speed and efficiency. The course will involve programming assignments, written assignments, and participation, with a focus on practical applications of parallelism.

Uploaded by

saudiqbal886
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Lecture 1:

Why Parallelism?
Why Efficiency?
Parallel Computing
Stanford CS149, Fall 2023
Hello!

James Minfei Yasmine Senyang

Prof. Kayvon Prof. Olukotun

Zhenbang Neha Michael

Jensen Shiv Tom


Stanford CS149, Fall 2023
One common definition

A parallel computer is a collection of processing elements


that cooperate to solve problems quickly

We care about performance We’re going to use multiple


and we care about efficiency processors to get it

Stanford CS149, Fall 2023


DEMO 1
(CS149 Fall 2023’s first parallel program)

Stanford CS149, Fall 2023


Speedup

One major motivation of using parallel processing: achieve a speedup

For a given problem:

execution time (using 1 processor)


speedup( using P processors ) =
execution time (using P processors)

Stanford CS149, Fall 2023


Class observations from demo 1
▪ Communication limited the maximum speedup achieved
- In the demo, the communication was telling each other the partial sums

▪ Minimizing the cost of communication improved speedup


- Moved students (“processors”) closer together (or let them shout)

Stanford CS149, Fall 2023


DEMO 2
(scaling up to four “processors”)

Stanford CS149, Fall 2023


Class observations from demo 2
▪ Imbalance in work assignment limited speedup
- Some students (“processors”) ran out work to do (went idle), while others were still working on
their assigned task

▪ Improving the distribution of work improved speedup

Stanford CS149, Fall 2023


DEMO 3
(massively parallel execution)

Stanford CS149, Fall 2023


Class observations from demo 3
▪ The problem I just gave you has a significant amount of communication compared to
computation

▪ Communication costs can dominate a parallel computation, severely limiting speedup

Stanford CS149, Fall 2023


Course theme 1:
Designing and writing parallel programs ... that scale!

▪ Parallel thinking
1. Decomposing work into pieces that can safely be performed in parallel
2. Assigning work to processors
3. Managing communication/synchronization between the processors so that it does not limit speedup

▪ Abstractions/mechanisms for performing the above tasks


- Writing code in popular parallel programming languages

Stanford CS149, Fall 2023


Course theme 2:
Parallel computer hardware implementation: how parallel computers work
▪ Mechanisms used to implement abstractions efficiently
- Performance characteristics of implementations
- Design trade-offs: performance vs. convenience vs. cost

▪ Why do I need to know about hardware?


- Because the characteristics of the machine really matter
(recall speed of communication issues in earlier demos)
- Because you care about efficiency and performance
(you are writing parallel programs after all!)

Stanford CS149, Fall 2023


Course theme 3:
Thinking about efficiency
▪ FAST != EFFICIENT

▪ Just because your program runs faster on a parallel computer, it does not mean it is using the
hardware efficiently
- Is 2x speedup on computer with 10 processors a good result?
▪ Programmer’s perspective: make use of provided machine capabilities

▪ HW designer’s perspective: choosing the right capabilities to put in system (performance/cost,


cost = silicon area?, power?, etc.)

Stanford CS149, Fall 2023


Course logistics

Stanford CS149, Fall 2023


Getting started
▪ The course web site
- https://cs149.stanford.edu

▪ Textbook
- There is no course textbook (the internet is plenty good
these days), also see the course web site for suggested
references

Stanford CS149, Fall 2023


Four programming assignments
Programming assignments can
(optionally) be done with a partner.

Assignment 1: ISPC programming Assignment 2: We realize finding a partner can be


on multi-core CPUs scheduling a task graph
stressful. ! "

Fill out our partner request form by


Thursday 11:59pm and we will find
you a partner! # $

Assignment 3: Writing a renderer Assignment 4: chat149: Optional assignment 5:


in CUDA on NVIDIA GPUs flash-attention transformers (Can be used to boost a prior grade)
for a mini language model
Topics TBD
programming FPGAs,
multi-core graph processing

Stanford CS149, Fall 2023


Written assignments

▪ Every two-weeks we will have a take-home written assignment graded on effort only

▪ Written assignments contain modified versions of previous exam questions, so they:


- Give you practice with key course concepts
- Provide practice for the style of questions you will see on an exam

Stanford CS149, Fall 2023


Commenting and contributing to lectures
The website supports commenting on a per-slide basis

Stanford CS149, Fall 2023


Participation (comments)
▪ You are asked to submit one well-thought-out comment per lecture
- Only two comments per week
- We expect you to submit “within the same calendar week” as the lectures (no
credit for submitting all comments at the end of the quarter when you are
studying for the final)

▪ Why do we ask you to write?


- Because writing is a way many good architects and systems designers force
themselves to think (explaining clearly and thinking clearly are highly correlated!)

▪ But take it seriously, there is a participation component to the final grade


Stanford CS149, Fall 2023
What we are looking for in comments
▪ Try to explain the slide (as if you were trying to teach your classmate while studying for an exam)
- “The instructor said this, but if you think about it this way instead it makes much more sense... ”
▪ Explain what is confusing to you:
- “What I’m totally confused by here was...”
▪ Challenge classmates with a question
- For example, make up a question you think might be on an exam.
▪ Provide a link to an alternate explanation
- “This site has a really good description of how multi-threading works...”
▪ Mention real-world examples
- For example, describe all the parallel hardware components in the PS5
▪ Constructively respond to another student’s comment or question
- “@segfault23, are you sure that is correct? I thought that Prof. Kayvon said…”
▪ It is OKAY (and even encouraged) to address the same topic (or repeat someone else’s summary,
explanation or idea) in your own words
- “@funkysenior23’s point is that the overhead of communication...”
Stanford CS149, Fall 2023
Grades
58% Programming assignments (4)
8% Written assignments (5)
16% Midterm exam
- An evening in-person exam on Nov 14th
16% Final exam
- During the university-assigned slot: Dec 14th, 3:30pm
2% Asynchronous participation (website comments)

Stanford CS149, Fall 2023


Late days
▪ You get eight late days for the quarter
- For use on programming and written assignments

▪ The idea of late days is to give you the flexibility to handle almost all events that arise
throughout the quarter
- Work from other classes, failing behind, most illnesses, athletic/extra curricular events…
- We expect to give extra late days only under exceptional circumstances

▪ Requests for additional late days for exceptional circumstances should be made days in
advance if possible.

Stanford CS149, Fall 2023


Why parallelism?

Stanford CS149, Fall 2023


Some historical context: why avoid parallel processing?
▪ Single-threaded CPU performance doubling ~ every 18 months
▪ Implication: working to parallelize your code was often not worth the time
- Software developer does nothing, code gets faster next year. Woot!
Relative CPU Performance

Year
Image credit: Olukutun and Hammond, ACM Queue 2005 Stanford CS149, Fall 2023
Until ~15 years ago: two significant reasons for processor
performance improvement

1. Exploiting instruction-level parallelism (superscalar execution)

2. Increasing CPU clock frequency

Stanford CS149, Fall 2023


What is a computer program?

Stanford CS149, Fall 2023


Here is a program written in C

int main(int argc, char** argv) {

int x = 1;

for (int i=0; i<10; i++) {


x = x + x;
}

printf(“%d\n”, x);

return 0;
}

Stanford CS149, Fall 2023


What is a program? (from a processor’s perspective)
A program is just a list of processor instructions!
_main:
100000f10: pushq %rbp
100000f11: movq %rsp, %rbp
int main(int argc, char** argv) { 100000f14: subq $32, %rsp
100000f18: movl $0, -4(%rbp)
100000f1f: movl %edi, -8(%rbp)
int x = 1; 100000f22: movq %rsi, -16(%rbp)
100000f26: movl $1, -20(%rbp)
100000f2d: movl $0, -24(%rbp)
for (int i=0; i<10; i++) { 100000f34: cmpl $10, -24(%rbp)
x = x + x; Compile 100000f38:
100000f3e:
jge 23 <_main+0x45>
movl -20(%rbp), %eax
} code 100000f41: addl -20(%rbp), %eax
100000f44: movl %eax, -20(%rbp)
printf(“%d\n”, x); 100000f47: movl -24(%rbp), %eax
100000f4a: addl $1, %eax
100000f4d: movl %eax, -24(%rbp)
return 0; 100000f50: jmp -33 <_main+0x24>
100000f55: leaq 58(%rip), %rdi
} 100000f5c: movl -20(%rbp), %esi
100000f5f: movb $0, %al
100000f61: callq 14
100000f66: xorl %esi, %esi
100000f68: movl %eax, -28(%rbp)
100000f6b: movl %esi, %eax
100000f6d: addq $32, %rsp
100000f71: popq %rbp
100000f72: rets
Stanford CS149, Fall 2023
Kind of like the instructions in a
recipe for your favorite meals
Mmm, carne asada

Stanford CS149, Fall 2023


What does a processor do?

Stanford CS149, Fall 2023


A processor executes instructions
Professor Kayvon’s
Very Simple Processor

Fetch/
Decode
Determine what instruction to run next

ALU Execution unit: performs the operation described by an


(Execution Unit)
instruction, which may modify values in the processor’s
Execution registers or the computer’s memory
Context
Register 0 (R0)
Register 1 (R1)
Register 2 (R2)
Registers: maintain program state: store value of
Register 3 (R3) variables used as inputs and outputs to operations

Stanford CS149, Fall 2023


One example instruction: add two numbers
Professor Kayvon’s Step 1:
Very Simple Processor Processor gets next program instruction from memory
(figure out what the processor should do next)
Fetch/ add R0 ← R0, R1
Decode “Please add the contents of register R0 to the contents of
register R1 and put the result of the addition into register R0”
ALU
(Execution Unit)
Step 2:
Get operation inputs from registers
Execution Contents of R0 input to execution unit: 32
Context
R0: 32
Contents of R1 input to execution unit: 64
R1: 64
R2: 0xff681080
R3: 0x80486412
Step 3:
Perform addition operation:
Execution unit performs arithmetic, the result is: 96

Stanford CS149, Fall 2023


One example instruction: add two numbers
Professor Kayvon’s Step 1:
Very Simple Processor Processor gets next program instruction from memory
(figure out what the processor should do next)
Fetch/ add R0 ← R0, R1
Decode “Please add the contents of register R0 to the contents of
register R1 and put the result of the addition into register R0”
ALU
(Execution Unit)
Step 2:
Get operation inputs from registers
Execution Contents of R0 input to execution unit: 32
Context
R0: 96
Contents of R1 input to execution unit: 64
R1: 64
R2: 0xff681080
R3: 0x80486412
Step 3:
Perform addition operation:
Execution unit performs arithmetic, the result is: 96

Step 4:
Store result 96 back to register R0 Stanford CS149, Fall 2023
Execute program
My very simple processor: executes one instruction per clock

Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
Execution Unit mul r1, r1, r0
(ALU) ...
...
...
Execution ...

Context ...
...
st addr[r2], r0

Stanford CS149, Fall 2023


Execute program
My very simple processor: executes one instruction per clock

Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
Execution Unit mul r1, r1, r0
(ALU) ...
...
...
Execution ...

Context ...
...
st addr[r2], r0

Stanford CS149, Fall 2023


Execute program
My very simple processor: executes one instruction per clock

Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
Execution Unit mul r1, r1, r0
(ALU) ...
...
...
Execution ...

Context ...
...
st addr[r2], r0

Stanford CS149, Fall 2023


Execute program
My very simple processor: executes one instruction per clock

Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
Execution Unit mul r1, r1, r0
(ALU) ...
...
...
Execution ...

Context ...
...
st addr[r2], r0

Stanford CS149, Fall 2023


Review of how computers work…
What is a computer program? (from a processor’s perspective)
It is a list of instructions to execute!

What is an instruction?
It describes an operation for a processor to perform.
Executing an instruction typically modifies the computer’s state.

What do I mean when I talk about a computer’s “state”?


The values of program data, which are stored in a processor’s registers or in memory.

Stanford CS149, Fall 2023


Lets consider a very simple piece of code
a = x*x + y*y + z*z
Consider the following five instruction program:

Assume register R0 = x, R1 = y, R2 = z

1 mul R0, R0, R0 This program has five instructions, so it


2 mul R1, R1, R1
3 mul R2, R2, R2 will take five clocks to execute, correct?
4 add R0, R0, R1
5 add R3, R0, R2 Can we do better?
R3 now stores value of program variable ‘a’

Stanford CS149, Fall 2023


What if up to two instructions can be performed at once?
a = x*x + y*y + z*z
Processor 1 Processor 2
Assume register
R0 = x, R1 = y, R2 = z

1 mul R0, R0, R0 time


1. mul R0, R0, R0 2. mul R1, R1, R1
2 mul R1, R1, R1 1
3 mul R2, R2, R2
4 add R0, R0, R1 3. mul R2, R2, R2
5 add R3, R0, R2 2 4. add R0, R0, R1

R3 now stores value of 3 5. add R3, R0, R2


program variable ‘a’
4

Stanford CS149, Fall 2023


What if up to two instructions can be performed at once?
a = x*x + y*y + z*z
Processor 1 Processor 2
Assume register
R0 = x, R1 = y, R2 = z

1 mul R0, R0, R0 time


2 mul R1, R1, R1 1 1. mul R0, R0, R0 2. mul R1, R1, R1
3 mul R2, R2, R2
4 add R0, R0, R1
5 add R3, R0, R2 2 3. mul R2, R2, R2 4. add R0, R0, R1

R3 now stores value of 3 5. add R3, R0, R2


program variable ‘a’
4

Stanford CS149, Fall 2023


What does it mean for our parallel to scheduling to
that “respects program order”?

Stanford CS149, Fall 2023


What about three instructions at once?
a = x*x + y*y + z*z
Processor 1 Processor 2 Processor 3
Assume register
R0 = x, R1 = y, R2 = z

1 mul R0, R0, R0 time


2 mul R1, R1, R1
3 mul R2, R2, R2
1
4 add R0, R0, R1
5 add R3, R0, R2 2
R3 now stores value of 3
program variable ‘a’
4

Stanford CS149, Fall 2023


What about three instructions at once?
a = x*x + y*y + z*z
Processor 1 Processor 2 Processor 3
Assume register
R0 = x, R1 = y, R2 = z

1 mul R0, R0, R0 time


2 mul R1, R1, R1
3 mul R2, R2, R2
1 1. mul R0, R0, R0 2. mul R1, R1, R1 3. mul R2, R2, R2

4 add R0, R0, R1


5 add R3, R0, R2 2 4. add R0, R0, R1

R3 now stores value of 3 5. add R3, R0, R2


program variable ‘a’
4

Stanford CS149, Fall 2023


Instruction level parallelism (ILP) example
▪ ILP = 3 a = x*x + y*y + z*z

x x y y z z

ILP = 3 * * *

ILP = 1 +

ILP = 1 +

Stanford CS149, Fall 2023


Superscalar processor execution
a = x*x + y*y + z*z
Assume register
R0 = x, R1 = y, R2 = z

1 mul R0, R0, R0 Idea #1:


2 mul R1, R1, R1
3 mul R2, R2, R2 Superscalar execution: processor automatically finds*
4 add R0, R0, R1 independent instructions in an instruction sequence and
5 add R3, R0, R2
executes them in parallel on multiple execution units!

In this example: instructions 1, 2, and 3 can be executed in parallel without impacting program correctness
(on a superscalar processor that determines that the lack of dependencies exists)
But instruction 4 must be executed after instructions 1 and 2
And instruction 5 must be executed after instruction 4

* Or the compiler finds independent instructions at compile time and explicitly encodes dependencies in the compiled binary. Stanford CS149, Fall 2023
Superscalar processor
This processor can decode and execute up to two instructions per clock

Out-of-order control logic

Fetch/ Fetch/
Decode Decode
1 2

Exec Exec
1 2

Execution
Context

Stanford CS149, Fall 2023


Aside:
Old Intel Pentium 4 CPU

Image credit: http://ixbtlabs.com/articles/pentium4/index.html Stanford CS149, Fall 2023


A more complex example
Program (sequence of instructions) Instruction dependency graph
PC Instruction
00 01
value during
00 a = 2
execution
01 b = 4
02 04 05
02 tmp2 = a + b // 6
03 tmp3 = tmp2 + a // 8
04 tmp4 = b + b // 8 03 06
05 tmp5 = b * b // 16
06 tmp6 = tmp2 + tmp4 // 14
07 tmp7 = tmp5 + tmp6 // 30
08 07
08 if (tmp3 > 7)
09 print tmp3
else 09 10
10 print tmp7

Stanford CS149, Fall 2023


Diminishing returns of superscalar execution
Most available ILP is exploited by a processor capable of issuing four instructions per clock
(Little performance benefit from building a processor that can issue more)

2
Speedup

0
0 4 8 12 16
Instruction issue capability of processor (instructions/clock)
Source: Culler & Singh (data from Johnson 1991) Stanford CS149, Fall 2023
Stanford CS149, Fall 2023
ILP tapped out + end of frequency scaling

Processor clock rate stops


increasing

No further benefit from ILP

= Transistor density
= Clock frequency
= Power
= Instruction-level parallelism (ILP)

Image credit: “The free Lunch is Over” by Herb Sutter, Dr. Dobbs 2005 Stanford CS149, Fall 2023
The “power wall”
Power consumed by a transistor:

Dynamic power capacitive load × voltage2 × frequency
Static power: transistors burn power even when inactive due to leakage

High power = high heat


Power is a critical design constraint in modern processors
TDP
Apple M1 laptop: 13W
Intel Core i9 10900K (in desktop CPU): 95W
NVIDIA RTX 4090 GPU 450W
Mobile phone processor 1/2 - 2W

World’s fastest supercomputer megawatts

Standard microwave oven 900W

Source: Intel, NVIDIA, Wikipedia, Top500.org Stanford CS149, Fall 2023


Power draw as a function of clock frequency

Dynamic power capacitive load × voltage2 × frequency
Static power: transistors burn power even when inactive due to leakage
Maximum allowed frequency determined by processor’s core voltage

Image credit: “Idontcare”: posted at: http://forums.anandtech.com/showthread.php?t=2281195 Stanford CS149, Fall 2023
Single-core performance scaling
The rate of single-instruction stream performance
scaling has decreased (almost to zero)

1. Frequency scaling limited by power


2. ILP scaling tapped out

Architects are now building faster processors by adding


more execution units that run in parallel
(Or units that are specialized for a specific task: like graphics,
or audio/video playback)

Software must be written to be parallel to see


performance gains. No more free lunch for software = Transistor density
= Clock frequency
developers! = Power
= ILP

Image credit: “The free Lunch is Over” by Herb Sutter, Dr. Dobbs 2005 Stanford CS149, Fall 2023
Example: multi-core CPU
Intel “Comet Lake” 10th Generation Core i9 10-core CPU (2020)

Core 1 Core 2 Core 3 Core 4 Core 5

Core 6 Core 7 Core 8 Core 9 Core 10

Stanford CS149, Fall 2023


One thing you will learn in this course
▪ How to write code that efficiently uses the resources in a modern multi-core CPU
▪ Example: assignment 1 (coming up!) We’ll talk about these
- Running on a quad-core Intel CPU
terms next time!
- Four CPU cores
- AVX SIMD vector instructions + hyper-threading
- Baseline: single-threaded C program compiled with -O3
- Parallelized program that uses all parallel execution
resources on this CPU…

~32-40x faster!

Stanford CS149, Fall 2023


AMD Ryzen Threadripper 3990X
64 cores, 4.3 GHz

Four 8-core chiplets

Stanford CS149, Fall 2023


NVIDIA AD102 GPU
GeForce RTX 4090 (2022)
76 billion transistors

18,432 fp32 multipliers organized in


144 processing blocks (called SMs)

Stanford CS149, Fall 2023


GPU-accelerated supercomputing

Frontier (at Oak Ridge National Lab)


(world’s #1 in Fall 2022)
9472 x 64 core AMD CPUs (606,208 CPU cores)
37,888 Radeon GPUs
21 Megawatts Stanford CS149, Fall 2023
Mobile parallel processing
Power constraints also heavily influence the design of mobile systems

5 GPU blocks Apple A15 Bionic


(in iPhone 13, 14)
15 billion transistors
6-core CPU
2 “big” CPU cores Multi-core GPU

4 “small” CPU cores

Image Credit: TechInsights Inc. Stanford CS149, Fall 2023


Mobile parallel processing

Raspberry Pi 3
Quad-core ARM A53 CPU

Stanford CS149, Fall 2023


But in modern computing
software must be more than just parallel…

IT MUST ALSO BE EFFICIENT

Stanford CS149, Fall 2023


Parallel + specialized HW
▪ Achieving high efficiency will be a key theme in this class

▪ We will discuss how modern systems not only use many processing units, but also
utilize specialized processing units to achieve high levels of power efficiency

Stanford CS149, Fall 2023


Specialized processing is ubiquitous in mobile systems

Apple A15 Bionic


(in iPhone 13, 14)
15 billion transistors
6-core GPU
2 “big” CPU cores
4 “small” CPU cores

Apple-designed multi-core GPU


Neural Engine (NPU) for DNN acceleration +
Image/video encode/decode processor +
Motion (sensor) processor

Image Credit: TechInsights Inc. Stanford CS149, Fall 2023


Specialization for datacenter-scale applications

Google TPU pods


TPU = Tensor Processing Unit: specialized processor for ML computations
Image Credit: TechInsights Inc. Stanford CS149, Fall 2023
Specialized hardware to accelerate DNN inference/training

Huawei Kirin NPU

Google TPU3 GraphCore IPU

Apple Neural Engine

Intel Deep Learning


Inference Accelerator
SambaNova
Cardinal SN10

Ampere GPU with


Tensor Cores
Cerebras Wafer Scale Engine
Stanford CS348K, Spring 2023
Achieving efficient processing
almost always comes down to
accessing data efficiently.

Stanford CS149, Fall 2023


What is memory?

Memory

Stanford CS149, Fall 2023


A program’s memory address space Address Value
0x0 16
▪ A computer’s memory is organized as an array of bytes 0x1 255
0x2 14
0x3 0
0x4 0
▪ Each byte is identified by its “address” in memory 0x5 0
(its position in this array) 0x6 6
0x7 0
(We’ll assume memory is byte-addressable) 0x8 32
0x9 48
0xA 255
“The byte stored at address 0x8 has the value 32.” 0xB 255
“The byte stored at address 0x10 (16) has the value 128.” 0xC 255
0xD 0
0xE 0
In the illustration on the right, the program’s 0xF 0
memory address space is 32 bytes in size 0x10 128
(so valid addresses range from 0x0 to 0x1F)

...
...
0x1F 0

Stanford CS149, Fall 2023


Load: an instruction for accessing the contents of memory
Professor Kayvon’s
Very Simple Processor

Fetch/
Decode ld R0 ← mem[R2]
“Please load the four-byte value in memory starting from the
ALU address stored by register R2 and put this value into register R0.”
(Execution Unit)

Execution Memory
Context ...
R0: 96
R1: 64 0xff68107c: 1024
R2: 0xff681080 0xff681080: 42
R3: 0x80486412
0xff681084: 32
0xff681088: 0
...

Stanford CS149, Fall 2023


Terminology
▪ Memory access latency
- The amount of time it takes the memory system to provide data to the processor
- Example: 100 clock cycles, 100 nsec

Data request

Memory

Latency ~ 2 sec

Stanford CS149, Fall 2023


Stalls
▪ A processor “stalls” (can’t make progress) when it cannot run the next instruction in an
instruction stream because future instructions depend on a previous instruction that is
not yet complete.

▪ Accessing memory is a major source of stalls


ld r0 mem[r2]
Dependency: cannot execute ‘add’ instruction until data from
ld r1 mem[r3]
mem[r2] and mem[r3] have been loaded from memory
add r0, r0, r1

▪ Memory access times ~ 100’s of cycles


- Memory “access time” is a measure of latency

Stanford CS149, Fall 2023


What are caches?
▪ Recall memory is just an array of values
▪ And a processor has instructions for moving data from memory into registers (load)
and storing data from registers into memory (store)
Memory
Address Value
0x0 16
0x1 255
Processor 0x2
0x3
14
0
0x4 0
0x5 0
Fetch/ 0x6 6
Decode 0
0x7
ALU
0x8 32
(Execute) 0x9 48
0xA 255
Execution 0xB 255
Context 0xC 255
0xD 0
0xE 0
0xF 0
0x10 128

...
...
0x1F 0
Stanford CS149, Fall 2023
What are caches?
▪ A cache is a hardware implementation detail that does not impact the output of a program, only its performance
▪ Cache is on-chip storage that maintains a copy of a subset of the values in memory
▪ If an address is stored “in the cache” the processor can load/store to this address more quickly than if the data resides only in DRAM
▪ Caches operate at the granularity of “cache lines”.
In the figure, the cache: Implementation of memory abstraction
- Has a capacity of 2 lines Address Value

- Each line holds 4 bytes of data 0x0


0x1
16
255
0x2 14
Processor 0x3 0
0x4 0
0x5 0
Fetch/ 0x6 6
Decode 0
0x7
ALU
0x8 32
(Execute) 0x9 48

Execution Data Cache 0xA


0xB
255
255
Context Line Address Values in line 0xC 255
0 0 6 0 0xD 0
DRAM
0x4
0xC 255 0 0 0 0xE 0
0xF 0
0x10 128

...
...
0x1F 0
Stanford CS149, Fall 2023
Cache example 1 Address
accessed
Cache action
Cache state
(after load is complete)
Array of 16 bytes in memory 0x0 “cold miss”, load 0x0 0x0
Address Value 0x1 hit 0x0
0x0 16 Assume: 0x2 hit 0x0
Line 0x0

0x1 255 0x3 hit 0x0


0x2 14 Total cache capacity of 8 bytes
0x2 hit 0x0
0x3 0
Cache with 4-byte cache lines 0x1 hit 0x0
0x4 0
(So 2 lines fit in cache) 0x4 “cold miss”, load 0x4 0x0 0x4
Line 0x4

0x5 0
0x6 6 0x1 hit 0x0 0x4
0x7 0 Least recently used (LRU)
0x8 32 replacement policy
Line 0x8

0x9 48 There are two forms of “data locality” in this sequence:


0xA 255 time
0xB 255 Spatial locality: loading data in a cache line “preloads” the
0xC 255
data needed for subsequent accesses to different addresses
Line 0xC

0xD 0
0xE 0 in the same line, leading to cache hits
0xF 0
Temporal locality: repeated accesses to the same address
result in hits.
Stanford CS149, Fall 2022
Cache example 2 Address
accessed
Cache action
Cache state
(after load is complete)
Array of 16 bytes in memory 0x0 “cold miss”, load 0x0 0x0
Address Value 0x1 hit 0x0
0x0 16 Assume: 0x2 hit 0x0
Line 0x0

0x1 255 0x3 hit 0x0


0x2 14 Total cache capacity of 8 bytes
0x4 “cold miss”, load 0x4 0x0 0x4
0x3 0
Cache with 4-byte cache lines 0x5 hit 0x0 0x4
0x4 0
(So 2 lines fit in cache) 0x6 hit 0x0 0x4
Line 0x4

0x5 0
0x6 6 0x7 hit 0x0 0x4
0x7 0 Least recently used (LRU) 0x8 “cold miss”, load 0x8 (evict 0x0) 0x8 0x4
0x8 32 replacement policy
0x9 hit 0x8 0x4
Line 0x8

0x9 48
0xA hit 0x8 0x4
0xA 255
0xB 255 0xB hit 0x8 0x4
0xC 255 0xC “cold miss”, load 0xC (evict 0x4) 0x8 0xC
Line 0xC

0xD 0 0xD hit 0x8 0xC


0xE 0 0xE hit 0x8 0xC
0xF 0 time
0xF hit 0x8 0xC
0x0 “capacity miss”, load 0x0 (evict 0x8) 0x0 0xC
Stanford CS149, Fall 2022
Caches reduce length of stalls
(reduce memory access latency)
▪ Processors run efficiently when they access data that is resident in caches
▪ Caches reduce memory access latency when processors accesses data that they have
recently accessed! *

* Caches also provide high bandwidth data transfer


Stanford CS149, Fall 2023
The implementation of the linear memory address space abstraction
on a modern computer is complex
The instruction “load the value stored at address X into register R0” might involve a
complex sequence of operations by multiple data caches and access to DRAM

Processor
L1 cache
(32 KB)

L3 cache DRAM
L2 cache
(256 KB) (20 MB) (64 GB)

Common organization: hierarchy of caches:


Level 1 (L1), level 2 (L2), level 3 (L3)

Smaller capacity caches near processor →lower latency


Larger capacity caches farther away →larger latency Stanford CS149, Fall 2023
Data access times
(Kaby Lake CPU)

Latency (number of cycles at 4 GHz)


Data in L1 cache 4
Data in L2 cache 12
Data in L3 cache 38
Data in DRAM (best case) ~248

Stanford CS149, Fall 2023


Summary
▪ Today, single-thread-of-control performance is improving very slowly
- To run programs significantly faster, programs must utilize multiple processing elements or
specialized processing hardware
- Which means you need to know how to reason about and write parallel and efficient code

▪ Writing parallel programs can be challenging


- Requires problem partitioning, communication, synchronization
- Knowledge of machine characteristics is important
- In particular, understanding data movement!

▪ I suspect you will find that modern computers have tremendously more processing power
than you might realize, if you just use it efficiently!

Stanford CS149, Fall 2023


Welcome to CS149!
▪ Get signed up on the website
▪ Find yourself a partner!
(remember, we can help you)
James Minfei Yasmine Senyang

Zhenbang Neha Michael

Prof. Kayvon Prof. Olukotun

Jensen Shiv Tom


Stanford CS149, Fall 2023

You might also like