0% found this document useful (0 votes)
42 views23 pages

Cache Writing & Performance

The document discusses cache writing and performance, focusing on maintaining memory consistency through write-through and write-back policies, as well as strategies for handling write misses. It also covers the importance of prefetching data into the cache to improve performance, including both hardware and software prefetching techniques. The document concludes with a summary of the key points and introduces the topic of cache-conscious programming for the next session.

Uploaded by

rakshithavasan22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views23 pages

Cache Writing & Performance

The document discusses cache writing and performance, focusing on maintaining memory consistency through write-through and write-back policies, as well as strategies for handling write misses. It also covers the importance of prefetching data into the cache to improve performance, including both hardware and software prefetching techniques. The document concludes with a summary of the key points and introduces the topic of cache-conscious programming for the next session.

Uploaded by

rakshithavasan22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 23

Cache Writing & Performance

 Today we’ll finish up with caches; we’ll cover:


— Writing to caches: keeping memory consistent & write-
allocation.
— We’ll try to quantify the benefits of different cache designs,
and see how caches affect overall performance.
— How we can get the data into the cache before we need it with
prefetching.

01/27/25 Cache performance 1


Four important questions

1. When we copy a block of data from main memory to


the cache, where exactly should we put it?

2. How can we tell if a word is already in the cache, or if


it has to be fetched from main memory first?

3. Eventually, the small cache memory might fill up. To


load a new block from main RAM, we’d have to
replace one of the existing blocks in the cache...
which one?

4. How can write operations be handled by the memory


system?

 Previous lectures answered the first 3. Today, we consider the


4th.

01/27/25 Cache performance 2


Writing to a cache
 Writing to a cache raises several additional issues
 First, let’s assume that the address we want to write to is already
loaded in the cache. We’ll assume a simple direct-mapped cache:
Index V Tag Data Address Data
... ...
110 1 11010 42803 1101 0110 42803
... ...

 If we write a new value to that address, we can store the new


data in the cache, and avoid an expensive main memory access
[but inconsistent]
Mem[1101 0110] = 21763

Index V Tag Data Address Data


... ...
110 1 11010 21763 1101 0110 42803
... ...

01/27/25 Cache performance 3


Write-through caches

 A write-through cache solves the inconsistency problem by


forcing all writes to update both the cache and the main memory

Mem[1101 0110] = 21763

Index V Tag Data Address Data


... ...
110 1 11010 21763 1101 0110 21763
... ...

 This is simple to implement and keeps the cache and memory


consistent

 Why is this not so good?

01/27/25 Cache performance 4


Write-back caches
 In a write-back cache, the memory is not updated until the cache
block needs to be replaced (e.g., when loading data into a full
cache set)
 For example, we might write some data to the cache at first,
leaving it inconsistent with the main memory as shown before
— The cache block is marked “dirty” to indicate this inconsistency
Mem[1101 0110] = 21763

Index V Dirty Tag Data Address Data


... 1000 1110 1225
110 1 1 11010 21763 1101 0110 42803
... ...

 Subsequent reads to the same memory address will be serviced by


the cache, which contains the correct, updated data
01/27/25 Cache performance 5
Finishing the write back
 We don’t need to store the new value back to main memory
unless the cache block gets replaced
 e.g. on a read from Mem[1000 1110], which maps to the same
cache block, the modified cache contents will first be written to
main memory
Index V Dirty Tag Data Address Data
... 1000 1110 1225
110 1 1 11010 21763 1101 0110 21763

... ...

 Only then can the cache block be replaced with data from
Index
address V142
Dirty Tag Data Address Data
... 1000 1110 1225
110 1 0 10001 1225 1101 0110 21763

... ...

01/27/25 Cache performance 6


Write misses

 A second scenario is if we try to write to an address that is not


already contained in the cache; this is called a write miss

 Let’s say we want to store 21763 into Mem[1101 0110] but we


find that address is not currently in the cache

Index V Tag Data Address Data


... ...
110 1 00010 123456 1101 0110 6378
... ...

 When we update Mem[1101 0110], should we also load it into the


cache?

01/27/25 Cache performance 8


Allocate on write

 An allocate on write strategy would instead load the newly written


data into the cache
Mem[214] = 21763

Index V Tag Data Address Data


... ...
110 1 11010 21763 1101 0110 21763
... ...

 If that data is needed again soon, it will be available in the cache


 This is generally the baseline behavior or processors.

 What about the following?


for (int i = 0; i < LARGE; i++)
a[i] = i;
01/27/25 Cache performance 9
Non-temporal stores (write-around/write-no-
allocate)
 For code where the stored values won’t get used in the near future,
like:

for (int i = 0; i < LARGE; i++)


a[i] = i;

 There is no point in putting these values in the cache.


 With a write around policy, the write operation goes directly to main
memory without affecting the cache. Mem[1101 0110] = 21763

Index V Tag Data Address Data


... ...
110 1 00010 123456 1101 0110 21763
... ...

 Some modern processors with write-allocate caches provide special


store instructions called non-temporal stores that do this.
01/27/25 Cache performance 10
Real Designs

01/27/25 Cache performance 11


First Observations
 Split Instruction/Data caches:
— Pro: No structural hazard between IF & MEM stages
• A single-ported unified cache stalls fetch during load or
store
— Con: Static partitioning of cache between instructions & data
• Bad if working sets unequal: e.g., code/DATA or CODE/data

 Cache Hierarchies:
— Trade-off between access time & hit rate
• L1 cache can focus on fast access time (with okay hit rate)
• L2 cache can focus on good hit rate (with okay access
time)
— Such hierarchical design is another “big idea”
— We saw this in section.
CPU L1 cache L2 cache Main
Memory

01/27/25 Cache performance 12


Opteron Vital Statistics

CPU L1 cache L2 cache Main


Memory
 L1 Caches: Instruction & Data
— 64 kB
— 64 byte blocks
— 2-way set associative
— 2 cycle access time

 L2 Cache:
— 1 MB
— 64 byte blocks
— 4-way set associative
— 16 cycle access time (total, not just miss penalty)

 Memory
— 200+ cycle access time

01/27/25 Cache performance 13


Comparing cache organizations
 Like many architectural features, caches are evaluated
experimentally.
— As always, performance depends on the actual instruction mix,
since different programs will have different memory access
patterns.
— Simulating or executing real applications is the most accurate
way to measure performance characteristics.
 The graphs on the next few slides illustrate the simulated miss
rates for several different cache designs.
— Again lower miss rates are generally better, but remember
that the miss rate is just one component of average memory
access time and execution time.
— We will do some cache simulations on the MP’s.

01/27/25 Cache performance 14


Associativity tradeoffs and miss rates
 As we saw last time, higher associativity means more complex
hardware.
 But a highly-associative cache will also exhibit a lower miss rate.
— Each set has more blocks, so there’s less chance of a conflict
between two addresses which both belong in the same set.
 This graph shows the miss rates decreasing as the associativity
increases.

12%

9%
Miss rate

6%

3%

0%
One-way Two-way Four-way Eight-way
Associativity

01/27/25 Cache performance 15


Cache size and miss rates
 The cache size also has a significant impact on performance.
— The larger a cache is, the less chance there will be of a
conflict.
 This graph depicts the miss rate as a function of both the cache
size and its associativity.

15%

12%

9%
Miss rate

1 KB
2 KB
4 KB
6% 8 KB

3%

0%
One-way Two-way Four-way Eight-way
Associativity

01/27/25 Cache performance 16


Block size and miss rates
 Finally, Figure 7.12 on p. 559 shows miss rates relative to the
block size and overall cache size.
— Smaller blocks do not take maximum advantage of spatial
locality.

40%

35%

30%
1 KB
25%
Miss rate

8 KB
20% 16 KB
64 KB
15%

10%

5%

0%
4 16 64 256
Block size (bytes)

01/27/25 Cache performance 17


What happens on a cache miss?

 Can’t do write back (into register file) until data is fetched.


— Easiest thing to do is stall immediately.
• Sub-optimal if data isn’t used right away.

 Optimization: Non-blocking cache.


— Remember miss and which register it should write into &
continue
— Stalls when:
• Data is needed
• Or too many misses outstanding
lw $t0, 64($a0)
 …
Exploit by “hoist”ing loads up from their uses, but…

— Uses up a register

— For potentially many cycles (~100 to memory) …
— Might be guessing what will be accessed. add $v0, $t0, $t1
01/27/25 Cache performance 19
Software Prefetching

 Most modern architectures provide special software prefetch


instructions
— They look like loads w/o destination registers
• e.g., on SPIM, lw $0, 64($a0) # write to the zero
register.
— These are hints to the processor:
• “I think I might use cache block containing this address”
• Hardware will try to move the block into the cache.
• But, hardware can ignore (if busy)

— Useful for fetching data ahead of use:

for (int i = 0 ; i < LARGE ; i ++) {


prefetch A[i+16]; // prefetch 16 iterations ahead.
computation A[i];
}
01/27/25 Cache performance 20
Prefetching, cont.
 Remember this graph? int array[SIZE];
Actual Data from int A = 0;

45 remsun2.ews.uiuc.edu for (int i = 0 ; i < 200000 ; ++ i) {


40
for (int j = 0 ; j < SIZE ; ++ j) {
35
A += array[j];
30 }
}
time

25
Series1
20

15

10

0
0 2000 4000 6000 8000 10000

size

01/27/25 Cache performance 21


Hardware Stream Prefetching
int array[SIZE];
int A = 0;
 Inner loop has very simple access pattern.
for (int i = 0 ; i < 200000 ; ++ i) {
— A, A+4, A+8, A+12, …
for (int j = 0 ; j < SIZE ; ++ j) {
— What is called a stream
A += array[j];
}
 We can easily build hardware to recognize }streams
 If we get a pair of sequential misses (blocks X, X+1), predict a stream.
— Fetch the next two blocks (X+2, X+3)

 Continue fetching the stream if the prefetch blocks accessed.


— If X+2 is read/written, prefetch X+4 …

 As confidence in stream increases, increase # of outstanding


prefetches
— If we get to X+8, have prefetches for X+9, X+10, X+11, X+12,
X+13

 Can learn strides as well (X, X+16, X+32, …) and (X, X-1, X-2, …)
01/27/25 Cache performance 22
PC-based HW Prefetching
 What about the following?

for (int i=0 ; i < LARGE ; i ++) {


C[i] = A[i] + B[i];
}

 3 separate streams
— Might confuse naïve prefetcher.

 Observation: A, B, and C accessed by different instructions.

 Learn a stream for each instruction

 Modern x86 chips do both stream, and PC-based stride


prefetching in HW

01/27/25 Cache performance 23


So what do we need SW prefetching for?

 Non-stride accesses!
 Like linked data structures:
— lists, arrays of pointers, etc.

 Consider:
element_t *A[SIZE];
for (int i=0 ; i < SIZE ; i ++) {
process(A[i]);
}

01/27/25 Cache performance 24


Summary

 Writing to a cache poses a couple of interesting issues.


— Write-through and write-back policies keep the cache
consistent with main memory in different ways for write hits.
— Write-around and allocate-on-write are two strategies to
handle write misses, differing in whether updated data is
loaded into the cache.

 Hardware prefetching handles most streams and strides.


— We’ll talk later about 1 limitation.
 Software prefetching is useful for linked data structures
— Must be added by programmer (or very smart compiler)

 Next time, we’ll look at cache-conscious programming.

01/27/25 Cache performance 25

You might also like