0% found this document useful (0 votes)

29 views44 pages

Cache Performance

Uploaded by

52068838a

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views44 pages

Cache Performance

Uploaded by

52068838a

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Coding for Cache Optimization

ECE 565
Performance Optimization & Parallelism
Duke University, Fall 2024
Motivation

• Memory Wall
– CPU speed and memory speed have grown at disparate rates
• CPU frequencies are much faster than memory frequencies
• Memory access takes many CPU cycles
– Hundreds, in fact!
– The latency of a load from memory will be in the 60-80ns range

• Cache hierarchy
– Caches are an integral part of current processor designs
• To reduce the impact of long memory latencies
– Cache hierarchies are often multiple levels today
• L1, L2, L3, sometimes L4
• Levels are larger and slower further down the hierarchy

ECE 565 – Fall 2022 2

Motivation (2)

• Cache hierarchy works on principle of locality

– Temporal locality – recently referenced memory addresses are
likely to be referenced again soon
– Spatial locality – memory addresses near recently referenced
memory addresses are likely to be referenced soon
• Locality allows a processor to retrieve a large portion of
memory references from the cache hierarchy
• Thus the programs we write should exhibit good locality!
– Good news – this is typically true of well-written programs

ECE 565 – Fall 2022 3

Locality Example

• Similar to the loop scenarios we’ve discussed –

int A[N];
int sum = 0;
for (i=0; i<N; i++) {
sum = sum + A[i];
}

• Data locality for sum and elements of array A[]

– Temporal locality for sum, spatial locality for A[]
• Code locality for program instructions
– Loops create significant inherent code temporal locality
– Sequential streams of instructions create spatial locality

ECE 565 – Fall 2022 4

Single Slide Cache Reminder
Core
uint64_t A[N];
Move Cache Blocks, e.g. 64B

L1 $ for (int i=0; i < N; i++) {

A[i] = seed;
}
L2 Cache
What happens when CPU
loads A[0] for first time?
Assume 32B cache blocks
L3 Cache
A[0] A[1] A[2] A[3] A[4] A[5]

Fetch A[0-3] (4 x 8B values)

Memory
What happens if CPU next
loads A[5]? Fetch A[4-7]
ECE 565 – Fall 2022 Cache blocks are aligned 5
Important Cache Performance Metrics

• Miss Ratio
– Ratio of cache misses to total cache references
– Typically less than 10% for L1 cache, < 1% for an L2 cache
• Hit Time
– Time to deliver a line in the cache to the processor
– 2-3 CPU cycles for L1, 15-20 cycles for L2, ~40 cycles for L3
– 60-80ns for main memory (hundreds of cycles)
– Related concept is “load-to-use” time
• # of CPU cycles from the execution of a load instruction until
execution of an instruction that depends on the load value

• Miss Penalty
– Time required access a line from the next hierarchy level
• Average access time = hit time + (miss rate * miss penalty)
ECE 565 – Fall 2022 6
Cache Friendly Code

• Strongly-related to the benefits we discussed for certain

loop transformations
• Examples:
– Cold cache, 4-byte words, 4-word cache blocks

for (i=0; i < N; i++) { for (j=0; j < N; j++) {

for (j=0; j < N; j++) { for (i=0; i < N; i++) {
sum += a[i][j]; sum += a[i][j];
} }
} }
Miss rate = ¼ = 25% Miss rate = 100%

ECE 565 – Fall 2022 7

Reverse-engineering a Cache

• Assume you have a machine

• You do not know its
– Cache sizes (or number of levels of cache)
– Cache block sizes
– Cache associativity
– Latencies
– Data bandwidth
• We can find these out through test programs!
– Write targeted code
– Measure performance characteristics
– Analyze the measurements

ECE 565 – Fall 2022 8

In-Class Exercise

• Code a targeted test program

P to determine
– # caches are in our machine
L1 $ – Size of each cache
– Latency of each cache
L2 $ • Assumptions
– LRU replacement policy in use
for each cache
L3 $ – Know cache block size
– Each level of cache hierarchy
has a different access latency
Main Memory

ECE 565 – Fall 2022 9

Test Code Summary
• (See links to code files on the class schedule page)
• Repeatedly access elements of a data set of some size
– E.g. an array
– Each memory access should depend on value from prior access
• E.g. pointer chasing
• This exposes memory latency of each access
• Record exectution time for this set of memory accesses
– Calculate average latency from
• measured time
• known # of accesses
• When data set size grows larger than the size of a cache level –
– We will see a step in the measured average access latency
– Latency will stay constant while the data set size fits within a cache level
• Loop of repeated memory accesses can be unrolled
– To reduce the interference from loop management instructions
• Access data set w/ fixed stride so each access touches a new cache block
• Can randomly cycle through data set elements to defeat prefetch
– Prefetch could disrupt measurement and blur the transition between cache levels

ECE 565 – Fall 2022 10

Example Results

L3 region
20
CPU cycles

L2 region
5

L1 region
0
16384 32768 65536 131072 262144 524288 1048576 2097152

Data Set Size (B)

* Measured on an Intel Core i7 CPU @ 2.4 GHz

* Program compiled with gcc -O3
ECE 565 – Fall 2022 11
Cache Access Patterns
stride = 4 words

data set = 24 words

• Vary the data set accessed by our code

– As data set grows larger than a cache level, performance drops
• Performance can be measured as latency or bandwidth

• Vary the stride of data accessed by our code

– Affects spatial locality provided by cache blocks
– If stride is less than size of a cache line –
• Initial access may cause cache miss, sequential accesses are fast
– If stride is greater than size of a cache line –
• Sequential accesses are slow

ECE 565 – Fall 2022 12

Memory Access Latency vs. Bandwidth

• Thus far, we’ve focused mostly on latency

– Hit time for a cache level or memory
– E.g. we put together example code using pointer chasing
• Stresses the latency of each access
• Only one memory access in flight at a given time

• Cache and memory bandwidth is also important P

– Bandwidth is a rate L1 $
• Bytes per cycle
L2 $
• GB per second
– Bandwidth gets smaller at lower levels of memory hierarchy L3 $

• Just as latency grows larger Main Memory

– Some code is throughput sensitive, not latency sensitive
• Code performance would improve with higher access bandwidth
– Even if access latency increased

ECE 565 – Fall 2022 13

Matrix Multiplication

• Common operation in scientific applications

• Significant interaction with cache & memory subsystem

1 2 3 4 1 5 9 13 30
5 6 7 8 2 6 10 14
9 10 11 12 3 7 11 15
13 14 15 16 4 8 12 16

= 11 + 22 + 33 + 44

• Recall our memory layout discussion

– E.g. C/C++ uses row-major order
– 2D array is allocated as a linear array in memory

ECE 565 – Fall 2022 14

Matrix Multiplication Implementation

double A[N], B[N], C[N];

int i, j, k;
double sum; k j j
for (i=0; i<N; i++) {
for (j=0; j<N; j++) { 1 2 3 4 1 5 9 13 30
sum = 0; 5 6 7 8 k 2 6 1014
i 9 101112 3 7 1115
i
for (k=0; k<N; k++) {
sum += A[i][k] * B[k][j]; 13141516 4 8 1216
}
C[i][j] = sum;
A B C
}
}

• 3 Loops – i, j, k
– 6 ways to arrange the loops and multiply the matrices
• O(N3) total operations
– N reads for each element of A and B
– N values to sum for each output element of C

ECE 565 – Fall 2022 15

Cache Analysis for Matrix Multiplication

• Each matrix element is 64 bits (a double)

• Assumptions:
– N is very large (cache cannot fit more than one row/column)
– Cache block size = 64 bytes (8 matrix elements per block)
• Consider access pattern for i,j,k loop structure
k j j

i k i

A B C

– A=good spatial locality; C=good temporal locality; B=poor locality

ECE 565 – Fall 2022 16

Matrix Multiplication

double A[N], B[N], C[N];

int i, j, k;
double sum; k j j
for (i=0; i<N; i++) {
for (j=0; j<N; j++) {
sum = 0; i k i
for (k=0; k<N; k++) {
sum += A[i][k] * B[k][j];
} A B C
C[i][j] = sum;
} 1
0.125 0
}
Misses per
iteration 1.125 total

• i-j-k
– Memory accesses for each inner loop iteration
• 2 loads: element A[i][k] and element B[k][j]
– A[i][k] access will be cache miss every 8/64 iterations
– B[k][j] access will be cache miss every iteration
• j-i-k cache miss behavior same as i-j-k
ECE 565 – Fall 2022 17
Matrix Multiplication

double A[N], B[N], C[N];

int i, j, k;
double tmp; k j j
for (k=0; k<N; k++) {
for (i=0; i<N; i++) {
tmp = A[i][k]; i k i
for (j=0; j<N; j++) {
C[i][j] += tmp * B[k][j];
} A B C
}
} 0.125
0 0.125
Misses per
iteration 0.25 total

• k-i-j
– Memory accesses for each inner loop iteration
• 2 loads: element C[i][j] and element B[k][j]; 1 store: element C[i][j]
– C[i][j] access will be cache miss every 8/64 iterations
– B[k][j] access will be cache miss every 8/64 iterations
• i-k-j cache miss behavior same as k-i-j
ECE 565 – Fall 2022 18
Matrix Multiplication

double A[N], B[N], C[N];

int i, j, k;
double tmp; k j j
for (j=0; j<N; j++) {
for (k=0; k<N; k++) {
tmp = B[k][j]; i k i
for (i=0; i<N; i++) {
C[i][j] += tmp * A[i][k];
} A B C
}
} 0
1 1
Misses per
iteration 2 total

• j-k-i
– Memory accesses for each inner loop iteration
• 2 loads: element C[i][j] and element A[i][k]; 1 store: element C[i][j]
– C[i][j] access will be cache miss every 8/64 iterations
– B[k][j] access will be cache miss every 8/64 iterations
• k-j-i cache miss behavior same as j-k-i
ECE 565 – Fall 2022 19
Matrix Multiplication Summary

• k is innermost loop
– A = good spatial locality
– C = good temporal locality
– Misses per iteration k j j
• 1 + (element sz/block sz)

• i is innermost loop i k i
– B = good temporal locality
– Misses per iteration A B C
• 2

• j is innermost loop
– B, C = good spatial locality
– A = good temporal locality
– Misses per iteration
• 2 * (element sz/block sz)

ECE 565 – Fall 2022 20

Other Types of Caching

• Main memory is a cache for disk

– Operates at a physical page granularity
• TLB is a cache for page table
– Translation Lookaside Buffer
– Operates at a page table entry granularity

ECE 565 – Fall 2022 21

Virtual Address Space
0xffffffff
Kernel Space • Each process thinks it has access
to the full address space provided
User Stack by the machine
– 4GB on 32-bit computers
– ~16-256TB on 64-bit computers
– Illusion provided by OS
Shared Libraries • Every process has its own virtual
address space
– Contents visible only to that
process
User Heap • Address space for even one
process is larger than physical
memory
Text
• How does it fit?
0x00000000
ECE 565 – Fall 2022 22
Physical Memory as a Cache for Disk
Virtual Address Spaces Physical addresses

Process 0

Process N

ECE 565 – Fall 2022 23

Physical Memory as a Cache for Disk

Virtual Address Physical addresses

Spaces
Process 0 Two Key Performance
Aspects:
1) Which addresses from
various processes should
we keep in physical
memory?
Process N
2) How do we do this
mapping between a virtual
address and its address in
physical memory?

ECE 565 – Fall 2022 24

What To Keep in Physical Memory?

• Want to service memory accesses from DRAM, not disk

– That is, if the access misses in all caches already
– Disk is thousands of times slower than DRAM
• ~60-80ns to several milliseconds

• Locality still rules the day

– Just as we discussed for caches
• Want to capture spatial and temporal locality in memory
– Temporal: retain recently accessed addresses in physical mem
– Spatial: fetch nearby addresses into physical mem on disk access

ECE 565 – Fall 2022 25

What To Keep in Physical Memory?
• OS manages the physical memory and disk accesses
– Physical memory management is software-based
– Operates on physical memory in units of pages
• Pages have some size of 2 P (e.g. 4KB)
• Much larger than cache block size due to huge latency of disk
– OS treats physical memory as a fully associative cache
• A virtual page can be placed in any physical page
– Page replacement policies may be complex
• SW can maintain and track much more state than HW
• Since accessing a page from disk is so slow there is lots of time!
• Physical memory holds the large-scale working set of a program
– Working set size of less than physical mem size
• Good performance for a process after memory is warmed up
– Working set sizes of all active processes greater than physical mem size
• Can cause a severe performance problem
• Called thrashing: pages are continuously swapped between memory and disk

ECE 565 – Fall 2022 26

Metrics

• Page hit
– Memory reference to an address that is stored in physical mem
• Page miss (page fault)
– Reference to an address that is not in physical memory
– Misses are expensive
• Access to disk
• Software is involved in managing the process

ECE 565 – Fall 2022 27

Physical Memory as a Cache for Disk
Virtual Address Spaces Physical addresses

Process 0

Process N

This looks
ECE 565 – Fall 2022 complicated!
28
What is Stored Where in Physical Memory?

• Need to remember current location for every virtual page

– Since physical mem is managed as fully associative cache
• Solution is called a page table
– Maps virtual pages to physical pages
– Per-process software data structure
Physical memory
Physical page number VP 0
or disk address
VP 4
PTE 0 1 VP 3
1 VP 1
0 null
Disk
1
1 VP 0
PTE 5 0 VP 1
VP 3
Page Table
VP 4
VP 5
ECE 565 – Fall 2022 29
Page Table Management

• Page tables are very large

– One entry per page
– For 32-bit address space:
• Assume 4 GB total virtual memory (232)
• Assume 4KB pages
• 232 / 212 = 220 entries
• PTE is 4B in x86 architecture = 222 bytes
• And that’s just for one process!

• Keep portions of the page table in memory; rest on disk

– The frequently and recently accessed portions, that is

ECE 565 – Fall 2022 30

But That’s Not Quite Enough

Processor Chip
L1 D$
L2 $ Phys
CPU MMU
Virtual Physical Mem
L1 I$
address address

• Physical addresses are needed for cache lookups

– Beginning at the L1 cache
• PTE is needed to turn a VA into PA
– PTE is located in memory (at best)
– Memory access required for every load or store?

ECE 565 – Fall 2022 31

Translation Lookaside Buffer (TLB)

• TLB is a very fast cache of PTEs

– Located inside the MMU of a processor
– Go directly from virtual page to physical page address
• Hierarchy of TLBs in current processor designs
– Separate instruction & data L1 TLBs, L2 TLB

Processor Chip TLB

L1 D$
L2 $ Phys
CPU
Virtual MMU Physical Mem
L1 I$
address address

ECE 565 – Fall 2022 32

TLB Reach

• TLB Reach
– Amount of memory accessible from the TLB
– Should cover the working set size of a process
– (# TLB entries) * (Page size)
• For example
– 64 TLB entries in L1 DTLB * 64KB pages = 4MB reach

ECE 565 – Fall 2022 33

Increasing TLB Reach

• What if we need to cover larger working sets?

• We can’t increase the number of TLB entries
– Well, we could wait a few years for a newer processor
• We can increase the page size
– Most modern architectures support a set range of page sizes
– From 4KB to 16MB
• Example - hugepages
– Consult your favorite OS manual to turn on hugepages
– RHEL example – $ grep Hugepagesize /proc/meminfo
Hugepagesize: 2048 kB
$ grep HugePages_Total /proc/meminfo
– E.g. libhugetlbfs HugePages_Total: 0
$ sysctl –w vm.nr_hugepages=128
$ grep HugePages_Total /proc/meminfo
HugePages_Total: 128
ECE 565 – Fall 2022 34
Memory Systems
Server Diagram

Memory (DIMM Slots)

ECE 565 – Fall 2022 36

Typical Server Architecture

PCIe I/O links to

external CPU I/O links to other
components (Multi-core chip + caches) CPUs in same
(network, storage, socket (often
etc.) proprietary I/O)

All running under

single OS image
Main Memory (sees all CPU cores
and memories)
ECE 565 – Fall 2022 37
Memory Bandwidth (DRAM)

• Currently, main memory is often DRAM

– DDR standards defined by JEDEC
CPU
• Double Data Rate (Multi-core chip + caches)
– DDR4 common in current server CPUs
Mem Controller
– New servers offer DDR5
– Various speed grades available
• More on this in a minute
Main Memory
• Connected to CPU w/ channels
– e.g. 4, 8, 16 (uses chip I/O pins)
• Memory controller schedules requests from the CPU
– Reads from cache misses
– Writes from cache evictions of dirty cache blocks

ECE 565 – Fall 2022 38

DRAM Performance
• Latency: we’ve discussed before
– 60-70ns load-to-use latency in current server CPUs is common
– This would be an uncontended load (no other traffic on the chip)
• Bandwidth (peak): defined by several factors
– Number of off-chip DDR channels (e.g. 4, 8, …)
– Bit width of each channel (e.g. 8 bytes)
– Channel frequency (e.g. 3200 MHz)
• Bandwidth (effective)
– Defined by ability of the memory controller to schedule requests
to utilize the DRAM devices and DDR channels efficiently
– 80% of peak B/W is a typical reasonable utilization
– Could be a bit better for some controllers or programs
• E.g. mostly read traffic compared to an even mix of read-write
ECE 565 – Fall 2022 39
Server DRAM B/W Example

ECE 565 – Fall 2022 40

Server DRAM B/W Example (2)

• 6 DDR4 memory channels per chip; 8B channel width

• 2666 = 2.666 Giga-transfers per second (GT/s)
• Peak DDR4 B/W per-chip:
– 6 channels * 8 B/channel * 2.666 GT/s = 127.97 GB/s

• DDR4 supports up to 3200 MT/s

• What would be B/W with 8 channels and 3200 DDR4?
– 8 * 8 * 3.2 = 204.8 GB/s

• Remember this is peak; Program would see ~80% of this

ECE 565 – Fall 2022 41
More on Memory B/W
• Speaking of what a program would see…
• Remember B/W is a rate (# of bytes or accesses per time)
• Programs that generate many outstanding memory
requests can stress bandwidth more than latency
• Would we see peak Mem B/W w/ a program on 1 core?
– No
– Memory system is provisioned to provide bandwidth to all cores
• What defines B/W we could get from 1 core?
– Max number of in-flight memory requests a core can sustain
– Typically will be defined by size of Cache Miss handling structures
• e.g. size of Read Queue or MSHRs (Miss status handling registers)

ECE 565 – Fall 2022 42

Bandwidth from 1 Core
• Let’s look at an example:
– Suppose cache block size is 64 bytes, chip frequency is 2.4 GHz
– Suppose per-core L2 can support 32 pending cache misses
– Suppose average latency is 200 cycles
• Remember there is contention now so each access may take longer
– B/W = ((32 misses * 64 B/miss) / 200 cycles) * 2.4 GHz
• 24.576 GB/s would be max B/W for a single core

ECE 565 – Fall 2022 43

Other Memory Technologies
• Other memory technologies are emerging
• For use along with DRAM or as a replacement
• E.g. GPUs today use High Bandwidth Memory (HBM)
– Also defined by standards, similarly to DRAM
– Provides higher bandwidth (via wide interfaces and die stacking)
– But maximum capacity is much lower
• HBM2 / HBM2E
– Rates quoted in terms of “giga transactions per second”
– HBM2: 128B interface (wide!), up to 2 GT/s = 256 GB/s per “stack”
– e.g. NVIDIA Volta GPU:
• 4 HBM stacks per GPU (16 GB per stack) for 900 GB/s per GPU

• HBM3 is current standard

ECE 565 – Fall 2022 44

Final Exam Topics: CSE 564 Computer Architecture Summer 2017
No ratings yet
Final Exam Topics: CSE 564 Computer Architecture Summer 2017
78 pages
CS33 S25 L14 OpenMP Intro Annotated
No ratings yet
CS33 S25 L14 OpenMP Intro Annotated
73 pages
2 Parts
100% (1)
2 Parts
168 pages
Recitation05 Cachelab
No ratings yet
Recitation05 Cachelab
97 pages
Migdalskiy Sergiy Physics Optimization Strategies
No ratings yet
Migdalskiy Sergiy Physics Optimization Strategies
104 pages
Ch1 Cache Principles
No ratings yet
Ch1 Cache Principles
56 pages
4 Caches With Notes
No ratings yet
4 Caches With Notes
121 pages
2 Cache Complexity
No ratings yet
2 Cache Complexity
100 pages
IT3030E CA Chap6 Memory
No ratings yet
IT3030E CA Chap6 Memory
72 pages
Lect 12 Memory
No ratings yet
Lect 12 Memory
42 pages
08 Caches
No ratings yet
08 Caches
78 pages
Week 11
No ratings yet
Week 11
45 pages
L15 Cache Introduction
No ratings yet
L15 Cache Introduction
35 pages
Parallel & Distributed Computing
No ratings yet
Parallel & Distributed Computing
58 pages
Cache
No ratings yet
Cache
31 pages
CA Lecture 08
No ratings yet
CA Lecture 08
38 pages
DigitalLogic ComputerOrganization L22 CachesP3 Handout
No ratings yet
DigitalLogic ComputerOrganization L22 CachesP3 Handout
52 pages
Web GPU
0% (1)
Web GPU
40 pages
Memory Hierarchies: Forecast - Memory (B5) - Motivation For Memory Hierarchy - Cache - Ecc - Virtual Memory
No ratings yet
Memory Hierarchies: Forecast - Memory (B5) - Motivation For Memory Hierarchy - Cache - Ecc - Virtual Memory
19 pages
L18 Cache Wrap Up
No ratings yet
L18 Cache Wrap Up
30 pages
Cache Writing & Performance
No ratings yet
Cache Writing & Performance
23 pages
Rec 07
No ratings yet
Rec 07
40 pages
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
No ratings yet
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
57 pages
9 - CH05 - Cache Memory Organization
No ratings yet
9 - CH05 - Cache Memory Organization
27 pages
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
COSS - Lecture - 6 - With Annotation
No ratings yet
COSS - Lecture - 6 - With Annotation
37 pages
Main Memory (Fig. 7.13)
No ratings yet
Main Memory (Fig. 7.13)
15 pages
Distributed - Computing2e Chapter 1
No ratings yet
Distributed - Computing2e Chapter 1
39 pages
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
No ratings yet
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
57 pages
25 e 50 Beb 5 Aad 8 F 60
No ratings yet
25 e 50 Beb 5 Aad 8 F 60
49 pages
Vmware Interview Questions: Click Here
100% (1)
Vmware Interview Questions: Click Here
27 pages
Lecture 8
No ratings yet
Lecture 8
22 pages
S1 Cache Friendly Code
No ratings yet
S1 Cache Friendly Code
16 pages
L15 Cache Introduction
No ratings yet
L15 Cache Introduction
35 pages
Lecture Slides 07 076-Caches-Opt
No ratings yet
Lecture Slides 07 076-Caches-Opt
11 pages
Lecture Slides 07 076-Caches-Opt
No ratings yet
Lecture Slides 07 076-Caches-Opt
11 pages
10 Cache Memories
No ratings yet
10 Cache Memories
49 pages
Annotated Presentation2
No ratings yet
Annotated Presentation2
14 pages
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
X570 Creator
No ratings yet
X570 Creator
109 pages
Ece4750 Lab3 Mem
No ratings yet
Ece4750 Lab3 Mem
16 pages
1 Emergency Switch - 1NC 240 VAC 5A 80000055: S.N Photo Description Part No
No ratings yet
1 Emergency Switch - 1NC 240 VAC 5A 80000055: S.N Photo Description Part No
13 pages
Matrix Multiplication-Javan.
No ratings yet
Matrix Multiplication-Javan.
6 pages
Microprocessor System Design: Error Correcting Codes Principle of Locality Cache Architecture
No ratings yet
Microprocessor System Design: Error Correcting Codes Principle of Locality Cache Architecture
28 pages
Computer Architecture-Cache Microarchitecture
No ratings yet
Computer Architecture-Cache Microarchitecture
36 pages
Cache Misses
No ratings yet
Cache Misses
8 pages
Computer Org and Arch: R.Magesh
No ratings yet
Computer Org and Arch: R.Magesh
48 pages
Embedded C Programming
100% (1)
Embedded C Programming
57 pages
Problem Bank 01
100% (1)
Problem Bank 01
8 pages
Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior
No ratings yet
Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior
35 pages
Answer Arch Ass
No ratings yet
Answer Arch Ass
3 pages
Midterm Sample Answer: Instructor: Cristiana Amza Department of Electrical and Computer Engineering University of Toronto
No ratings yet
Midterm Sample Answer: Instructor: Cristiana Amza Department of Electrical and Computer Engineering University of Toronto
18 pages
ASA Chapter4
No ratings yet
ASA Chapter4
8 pages
Solutions To Exercises On Memory Hierarchy
No ratings yet
Solutions To Exercises On Memory Hierarchy
15 pages
Scanner Set User Manual
No ratings yet
Scanner Set User Manual
40 pages
Data Oriented Design
No ratings yet
Data Oriented Design
17 pages
رزومه1402 10 15
No ratings yet
رزومه1402 10 15
25 pages
GottFA80 - HW11 - User Manual - v1.1
100% (1)
GottFA80 - HW11 - User Manual - v1.1
16 pages
Examlet 4 Review
No ratings yet
Examlet 4 Review
2 pages
Cache and Caching: Electrical and Electronic Engineering
No ratings yet
Cache and Caching: Electrical and Electronic Engineering
15 pages
Guide To The ACM Box
No ratings yet
Guide To The ACM Box
19 pages
Average Access Time (AAT)
No ratings yet
Average Access Time (AAT)
6 pages
SL Revolve Plus Theory Rev 1
No ratings yet
SL Revolve Plus Theory Rev 1
31 pages
x86 Assembly Langauge Design Interfacing
No ratings yet
x86 Assembly Langauge Design Interfacing
101 pages
Lab 8
No ratings yet
Lab 8
10 pages
Cache Memory Cache Memory
No ratings yet
Cache Memory Cache Memory
13 pages
Aca Seminar Report
No ratings yet
Aca Seminar Report
11 pages
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Botai Screen March Price List2025
No ratings yet
Botai Screen March Price List2025
27 pages
Cache and Caching: Electrical and Electronic Engineering
No ratings yet
Cache and Caching: Electrical and Electronic Engineering
15 pages
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
No ratings yet
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
10 pages
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
Raid Enclosure Manual
No ratings yet
Raid Enclosure Manual
15 pages
Artikel Bahasa Inggris Tentang Komputer The Definition of Harddisk
No ratings yet
Artikel Bahasa Inggris Tentang Komputer The Definition of Harddisk
3 pages
Teac FD505 Dual Function Combo Diskette
No ratings yet
Teac FD505 Dual Function Combo Diskette
23 pages
How Concepts Such As RISC, Pipelining, Cache Memory, and Virtual Memory Have Evolved Over The Past 25 Years
No ratings yet
How Concepts Such As RISC, Pipelining, Cache Memory, and Virtual Memory Have Evolved Over The Past 25 Years
6 pages
CSC 305 pAST Questions
No ratings yet
CSC 305 pAST Questions
4 pages
User Manual - PD2200
No ratings yet
User Manual - PD2200
18 pages
Evolution of Virtualization Technology
No ratings yet
Evolution of Virtualization Technology
9 pages
600ae 28Ae-Ribbon Cable Replacement
100% (1)
600ae 28Ae-Ribbon Cable Replacement
4 pages
Understanding Downtime: Planned and Unplanned Downtime
No ratings yet
Understanding Downtime: Planned and Unplanned Downtime
6 pages
Build Vs Buy
No ratings yet
Build Vs Buy
8 pages
Sprinkler Hanger Detail D1
No ratings yet
Sprinkler Hanger Detail D1
1 page
Coa Imp Questions
No ratings yet
Coa Imp Questions
3 pages
How To Use Arduino GSM Shield SIM900: Instructables
No ratings yet
How To Use Arduino GSM Shield SIM900: Instructables
9 pages
Snapdragon G3 Gen 3 Product Brief
No ratings yet
Snapdragon G3 Gen 3 Product Brief
1 page
SHTURMANN CAT.5E U - UTP 350MHz 4Pr AWG24 CU PVC INDOOR
No ratings yet
SHTURMANN CAT.5E U - UTP 350MHz 4Pr AWG24 CU PVC INDOOR
1 page
The BEST FREE Ipad Note Taking Apps (That Don't Suck) !! - YouTube
No ratings yet
The BEST FREE Ipad Note Taking Apps (That Don't Suck) !! - YouTube
1 page
CUIT 401 Assignment1 2022
No ratings yet
CUIT 401 Assignment1 2022
3 pages
Huawei For Sale From Power Storm 4SS07091064
No ratings yet
Huawei For Sale From Power Storm 4SS07091064
4 pages

Cache Performance

Uploaded by

Cache Performance

Uploaded by

Coding for Cache Optimization

ECE 565 – Fall 2022 2

• Cache hierarchy works on principle of locality

ECE 565 – Fall 2022 3

• Similar to the loop scenarios we’ve discussed –

• Data locality for sum and elements of array A[]

ECE 565 – Fall 2022 4

L1 $ for (int i=0; i < N; i++) {

Fetch A[0-3] (4 x 8B values)

• Strongly-related to the benefits we discussed for certain

for (i=0; i < N; i++) { for (j=0; j < N; j++) {

ECE 565 – Fall 2022 7

• Assume you have a machine

ECE 565 – Fall 2022 8

• Code a targeted test program

ECE 565 – Fall 2022 9

ECE 565 – Fall 2022 10

Data Set Size (B)

* Measured on an Intel Core i7 CPU @ 2.4 GHz

data set = 24 words

• Vary the data set accessed by our code

• Vary the stride of data accessed by our code

ECE 565 – Fall 2022 12

• Thus far, we’ve focused mostly on latency

• Cache and memory bandwidth is also important P

• Just as latency grows larger Main Memory

ECE 565 – Fall 2022 13

• Common operation in scientific applications

= 1*1 + 2*2 + 3*3 + 4*4

• Recall our memory layout discussion

ECE 565 – Fall 2022 14

double A[N], B[N], C[N];

ECE 565 – Fall 2022 15

• Each matrix element is 64 bits (a double)

– A=good spatial locality; C=good temporal locality; B=poor locality

ECE 565 – Fall 2022 16

double A[N], B[N], C[N];

double A[N], B[N], C[N];

double A[N], B[N], C[N];

ECE 565 – Fall 2022 20

• Main memory is a cache for disk

ECE 565 – Fall 2022 21

ECE 565 – Fall 2022 23

Virtual Address Physical addresses

ECE 565 – Fall 2022 24

• Want to service memory accesses from DRAM, not disk

• Locality still rules the day

ECE 565 – Fall 2022 25

ECE 565 – Fall 2022 26

ECE 565 – Fall 2022 27

• Need to remember current location for every virtual page

• Page tables are very large

• Keep portions of the page table in memory; rest on disk

ECE 565 – Fall 2022 30

• Physical addresses are needed for cache lookups

ECE 565 – Fall 2022 31

• TLB is a very fast cache of PTEs

Processor Chip TLB

ECE 565 – Fall 2022 32

ECE 565 – Fall 2022 33

• What if we need to cover larger working sets?

Memory (DIMM Slots)

ECE 565 – Fall 2022 36

PCIe I/O links to

All running under

• Currently, main memory is often DRAM

ECE 565 – Fall 2022 38

ECE 565 – Fall 2022 40

• 6 DDR4 memory channels per chip; 8B channel width

• DDR4 supports up to 3200 MT/s

• Remember this is peak; Program would see ~80% of this

ECE 565 – Fall 2022 42

ECE 565 – Fall 2022 43

• HBM3 is current standard

You might also like

= 11 + 22 + 33 + 44