0% found this document useful (0 votes)
83 views

CS 211: Computer Architecture Cache Memory Design

The document discusses the objectives and topics that will be covered in the CS 211: Computer Architecture course. The course will cover memory design including cache design, multiprocessing concepts like multi-core processors and parallel processing, embedded systems, reconfigurable architectures, and memory technologies. It will discuss how memory is organized using a memory hierarchy to facilitate fast access for the processor, including caches, main memory, and different levels of storage like disks. The goal is to address how instructions can be executed faster and how the processor can access operands from memory in one clock cycle.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views

CS 211: Computer Architecture Cache Memory Design

The document discusses the objectives and topics that will be covered in the CS 211: Computer Architecture course. The course will cover memory design including cache design, multiprocessing concepts like multi-core processors and parallel processing, embedded systems, reconfigurable architectures, and memory technologies. It will discuss how memory is organized using a memory hierarchy to facilitate fast access for the processor, including caches, main memory, and different levels of storage like disks. The goal is to address how instructions can be executed faster and how the processor can access operands from memory in one clock cycle.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Course Objectives: Where are we?

CS 211: Computer Architecture

Cache Memory Design

CS 135

CS 211: Part 2! CS 211: Part 3

• Discussions thus far • Multiprocessing concepts


¾ Processor architectures to increase the processing speed
¾ Multi-core processors
¾ Focused entirely on how instructions can be executed faster
¾ Have not addressed the other components that go into putting it ¾ Parallel processing
all together
¾ Cluster computing
¾ Other components: Memory, I/O, Compiler
• Next:
¾ Memory design: how is memory organized to facilitate fast
access • Embedded Systems – another dimension
¾ Focus on cache design ¾ Challenges and what’s different
• Reconfigurable architectures
¾ Compiler optimization: how does compiler generate
‘optimized’ machine code
¾ With a look at techniques for ILP ¾ What are they? And why ?
¾ We have seen some techniques already, and will cover some
more in memory design before getting to formal architecture of
compilers
¾ Quick look at I/O and Virtual Memory

CS 135 CS 135

1
Memory Memory Technology
• Memory Comes in Many Flavors
• In our discussions (on MIPS pipeline, ¾ SRAM (Static Random Access Memory)
superscalar, EPIC) we’ve constantly
¾ Like a register file; once data written to SRAM its contents
stay valid – no need to refresh it
been assuming that we can access our ¾ DRAM (Dynamic Random Access Memory)
operand from memory in 1 clock cycle…
¾ Like leaky capacitors – data stored into DRAM chip charging
memory cells to max values; charge slowly leaks and will
eventually be too low to be valid – therefore refresh circuitry
¾ This is possible, but its complicated rewrites data and charges back to max
¾ We’ll now discuss how this happens ¾ Static RAM is faster but more expensive
¾ Cache uses static RAM

• We’ll talk about… ¾ ROM, EPROM, EEPROM, Flash, etc.


¾ Read only memories – store OS
¾ Memory Technology ¾ Disks, Tapes, etc.
¾ Memory Hierarchy
• Difference in speed, price and “size”
¾ Caches
¾ Fast is small and/or expensive
¾ Memory
¾ Large is slow and/or expensive
¾ Virtual Memory

CS 135 CS 135

Is there a problem with DRAM? Why Not Only DRAM?

Processor-DRAM Memory Gap (latency) • Can lose data when no power


µProc ¾ Volatile storage
1000 60%/yr. • Not large enough for some things
CPU

“Moore’s Law” (2X/1.5yr)


Performance

¾ Backed up by storage (disk)


100 Processor-Memory ¾ Virtual memory, paging, etc.
Performance Gap: • Not fast enough for processor accesses
grows 50% / year Takes hundreds of cycles to return data
10 ¾

DRAM
DRAM ¾ OK in very regular applications
9%/yr. ¾ Can use SW pipelining, vectors
1 (2X/10yrs) ¾ Not OK in most other applications
1989

1992
1980
1981

1987
1988

1990
1991

1993
1994
1995
1996
1997
1998
1982
1983
1984
1985
1986

1999
2000

CS 135
Time CS 135

2
The principle of locality… Levels in a typical memory hierarchy

• …says that most programs don’t access


all code or data uniformly
¾ e.g. in a loop, small subset of instructions might be
executed over and over again…
¾ …& a block of memory addresses might be accessed
sequentially…

• This has led to “memory hierarchies”


• Some important things to note:
¾ Fast memory is expensive
¾ Levels of memory usually smaller/faster than previous
¾ Levels of memory usually “subset” one another
¾ All the stuff in a higher level is in some level below it
CS 135 CS 135

Memory Hierarchies The Full Memory Hierarchy


“always reuse a good idea”
• Key Principles
Capacity
Access Time Upper Level
Staging
Cost Xfer Unit faster
¾ Locality – most programs do not access code or data CPU Registers
uniformly 100s Bytes Registers
<10s ns
Instr. Operands prog./compiler
¾ Smaller hardware is faster 1-8 bytes
Cache
• Goal K Bytes
10-100 ns
Our current
focus
Cache
1-0.1 cents/bit cache cntl
¾ Design a memory hierarchy “with cost almost as low Blocks 8-128 bytes
as the cheapest level of the hierarchy and speed Main Memory
almost as fast as the fastest level” M Bytes
200ns- 500ns
Memory
$.0001-.00001 cents /bit
This implies that we be clever about keeping more likely OS
¾
Disk Pages 4K-16K bytes
used data as “close” to the CPU as possible G Bytes, 10 ms
(10,000,000 ns)
Disk
• Levels provide subsets 10-5 - 10-6cents/bit
user/operator
Files Mbytes
¾ Anything (data) found in a particular level is also found Tape Larger
in the next level below. infinite
sec-min Tape Lower Level
¾ Each level maps from a slower, larger memory to a -8
10
CS 135
smaller but faster memory CS 135

3
Propagation delay bounds where
Cache: Terminology
memory can be placed
• Cache is name given to the first level of
the memory hierarchy encountered once
an address leaves the CPU
Double Data Rate (DDR) SDRAM
¾ Takes advantage of the principle of locality
• The term cache is also now applied
whenever buffering is employed to reuse
XDR planned 3 to 6 GHz
items
• Cache controller
¾ The HW that controls access to cache or generates
request to memory
¾

• No cache in 1980 PCs to 2-level cache by


1995..!

CS 135 CS 135

What is a cache?
• Small, fast storage used to improve average access time to Caches: multilevel
slow memory.
• Exploits spatial and temporal locality
• In computer architecture, almost everything is a cache!
¾ Registers “a cache” on variables – software managed Main
¾ First-level cache a cache on second-level cache CPU cache
¾ Second-level cache a cache on memory Memory
¾ Memory a cache on disk (virtual memory)
¾ TLB a cache on page table
¾ Branch-prediction a cache on prediction information?

Proc/Regs
L1-Cache Main
CPU L2 L3
Bigger L2-Cache Faster
cache cache
L1 Memory
Memory
~256KB ~4MB
Disk, Tape, etc. 16~32KB ~10 pclk latency ~50 pclk latency
CS 135
1~2 CS
pclk
135
latency

4
A brief description of a cache Terminology Summary

• Cache = next level of memory hierarchy


• Hit: data appears in block in upper level (i.e. block X in cache)
up from register file ¾ Hit Rate: fraction of memory access found in upper level
¾ All values in register file should be in cache ¾ Hit Time: time to access upper level which consists of
¾ RAM access time + Time to determine hit/miss
• Miss: data needs to be retrieved from a block in the lower level
• Cache entries usually referred to as (i.e. block Y in memory)
“blocks” ¾

¾
Miss Rate = 1 - (Hit Rate)
Miss Penalty: Extra time to replace a block in the upper level +
¾ Block is minimum amount of information that ¾ Time to deliver the block the processor

can be in cache • Hit Time << Miss Penalty (500 instructions on Alpha 21264)
¾ fixed size collection of data, retrieved from memory Lower Level
and placed into the cache To Processor Upper Level Memory
Memory
• Processor generates request for From Processor
Blk X

data/inst, first look up the cache Blk Y

• If we’re looking for item in a cache and


find it, have a cache hit; if not then a
CS 135 CS 135

Definitions Cache Basics

• Locating a block requires two attributes: • Cache consists of block-sized lines


¾ Size of block ¾ Line size typically power of two
¾ Organization of blocks within the cache
¾ Typically 16 to 128 bytes in size
• Block size (also referred to as line size)
¾ Granularity at which cache operates
• Example
¾ Each block is contiguous series of bytes in memory and ¾ Suppose block size is 128 bytes
begins on a naturally aligned boundary ¾ Lowest seven bits determine offset within block
¾ Eg: cache with 16 byte blocks ¾ Read data at address A=0x7fffa3f4
¾ each contains 16 bytes ¾ Address begins to block with base address 0x7fffa380
¾ First byte aligned to 16 byte boundaries in address space
¾ Low order 4 bits of address of first byte would be 0000

¾ Smallest usable block size is the natural word size of the


processor
¾ Else would require splitting an access across blocks and slows down
translation

CS 135 CS 135

5
Memory Hierarchy Unified or Separate I-Cache and D-Cache

• Placing the fastest memory near the CPU • Two types of accesses:
can result in increases in performance ¾ Instruction fetch
• Consider the number of cycles the CPU is ¾ Data fetch (load/store instructions)

stalled waiting for a memory access – • Unified Cache


memory stall cycles ¾ One large cache for both instructions and date
¾ Pros: simpler to manage, less hardware complexity
¾ Cons: how to divide cache between data and
¾ CPU execution time =
instructions? Confuses the standard harvard architecture
(CPU clk cycles + Memory stall cycles) * clk cycle time. model; optimizations difficult

• Separate Instruction and Data cache


¾ Memory stall cycles =
¾ Instruction fetch goes to I-Cache
number of misses * miss penalty =
¾ Data access from Load/Stores goes to D-cache
IC*(memory accesses/instruction)*miss rate* miss penalty ¾ Pros: easier to optimize each
¾ Cons: more expensive; how to decide on sizes of each
CS 135 CS 135

Cache Design--Questions Where can a block be placed in a cache?

• Q1: Where can a block be placed in • 3 schemes for block placement in a


the upper level? cache:
¾ block placement ¾ Direct mapped cache:
• Q2: How is a block found if it is in the ¾ Block (or data to be stored) can go to only 1 place in
cache
upper level? ¾ Usually: (Block address) MOD (# of blocks in the cache)
¾ block identification
• Q3: Which block should be replaced on ¾ Fully associative cache:

a miss?
¾ Block can be placed anywhere in cache

¾ block replacement ¾ Set associative cache:


• Q4: What happens on a write? ¾ “Set” = a group of blocks in the cache
¾ Write strategy ¾ Block mapped onto a set & then block can be placed
anywhere within that set
¾ Usually: (Block address) MOD (# of sets in the cache)
CS 135 CS 135 ¾ If n blocks in a set, we call it n-way set associative

6
Where can a block be placed in a cache? Associativity

Fully Associative Direct Mapped Set Associative • If you have associativity > 1 you have
12345678 12345678 12345678 to have a replacement policy
¾ FIFO
Cache: Set 0Set 1Set 2Set 3 ¾ LRU
¾ Random

Block 12 can go Block 12 can go Block 12 can go


anywhere only into Block 4 anywhere in set 0 • “Full” or “Full-map” associativity means
(12 mod 8) (12 mod 4) you check every tag in parallel and a
1 2 3 4 5 6 7 8 9….. memory block can go into any cache
block
Memory: 12 ¾ Virtual memory is effectively fully associative
¾ (But don’t worry about virtual memory yet)

CS 135 CS 135

Cache Organizations Associativity

• Direct Mapped vs Fully Associate • If total cache size is kept same,


¾ Direct mapped is not flexible enough; if X(mod K)=Y(mod K) increasing the associativity increases
number of blocks per set
then X and Y cannot both be located in cache
¾ Fully associative allows any mapping, implies all locations
must be searched to find the right one –expensive hardware ¾ Number of simultaneous compares needed to perform
• Set Associative the search in parallel = number of blocks per set
¾ Compromise between direct mapped and fully associative ¾ Increase by factor of 2 in associativity doubles number
of blocks per set and halves number of sets
¾ Allow many-to-few mappings
¾ On lookup, subset of address bits used to generate an index
¾ BUT index now corresponds to a set of entries which can be
searched in parallel – more efficient hardware
implementation than fully associative, but due to flexible
mapping behaves more like fully associative

CS 135 CS 135

7
Block Identification: How is a block found
Large Blocks and Subblocking
in the cache
• Large cache blocks can take a long time • Since we have many-to-one mappings,
to refill need tag
¾ refill cache line critical word first • Caches have an address tag on each
¾ restart cache access before complete refill
block that gives the block address.
• Large cache blocks can waste bus ¾ Eg: if slot zero in cache contains tag K, the value in
bandwidth if block size is larger than slot zero corresponds to block zero from area of
spatial locality memory that has tag K
¾ Address consists of <tag t,block b,offset o>
¾ divide a block into subblocks
¾ Examine tag in slot b of cache:
¾ associate separate valid bits for each subblock. ¾ if matches t then extract value from slot b in cache
¾ Else use memory address to fetch block from memory, place copy in
slot b of cache, replace tag with t, use o to select appropriate byte
v subblock v subblock v subblock tag

CS 135 CS 135

How is a block found in the cache? How is a block found in the cache?

• Cache’s have address tag on each block Block Address Block


Offset
frame that provides block address Tag Index
Tag of every cache block that might have entry is
• Block offset field selects data from block
¾
examined against CPU address (in parallel! – why?)
¾ (i.e. address of desired data within block)
• Index field selects a specific set
• Each entry usually has a valid bit ¾ Fully associative caches have no index field
¾ Tells us if cache data is useful/not garbage • Tag field is compared against it for a hit
¾ If bit is not set, there can’t be a match…
• Could we compare on more of address than the tag?
• How does address provided to CPU ¾ Not necessary; checking index is redundant
relate to entry in cache? ¾

¾
Used to select set to be checked
Ex.: Address stored in set 0 must have 0 in index field
¾ Entry divided between block address & block offset… ¾ Offset not necessary in comparison –entire block is present or not and
all block offsets must match
¾ …and further divided between tag field & index field
CS 135 CS 135

8
Cache Memory Structures Direct Mapped Caches

block index
index key idx key

tag data tag index tag idx b.o.


tag data
decoder

decoder

decoder
decoder
=
Indexed/Direct mapped Associative Memory N-Way =
Tag Multiplexor Tag
Memory (CAM) Set-Associative Memory Match match
no index k-bit index
k-bit index unlimited blocks 2k • N blocks
2k blocks
CS 135 CS 135

Cache Block Size Fully Associative Cache


• Each cache block or (cache line) has only
one tag but can hold multiple “chunks” of
data tag blk.offset

¾ reduce tag storage overhead


¾ In 32-bit addressing, an 1-MB direct-mapped cache has 12 Tag =
bits of tags =
¾ the entire cache block is transferred to and from memory =
all at once
good for spatial locality since if you access address i, you
will probably want i+1 as well (prefetching effect)

• Block size = 2b; Direct Mapped Cache


Size = 2B+b =
Associative
MSB LSB Multiplexor Search
tag block index block offset

CS 135 CS 135
B-bits b-bits

9
N-Way Set Associative Cache N-Way Set Associative Cache

tag index BO a way (bank)


a set
tag idx b.o.

decoder
=

decoder
decoder

=
Associative
search

Tag
= match Tag
=match
Multiplexor
Multiplexor

CS 135
Cache Size = N x 2B+b CS 135
Cache Size = N x 2B+b

Which block should be replaced on a


LRU Example
cache miss?
• 4-way set associative
• If we look something up in cache and entry not ¾ Need 4 values (2 bits) for counter
there, generally want to get data from memory and
put it in cache 0
1
0x00004000
0x00003800
2 0xffff8000
¾ B/c principle of locality says we’ll probably use it again 3 0x00cd0800

Access 0xffff8004
• Direct mapped caches have 1 choice of what block 0
1
0x00004000
0x00003800
to replace 3
2
0xffff8000
0x00cd0800

• Fully associative or set associative offer more Access 0x00003840


choices 0
3
0x00004000
0x00003800
• Usually 2 strategies:
2 0xffff8000
1 0x00cd0800

¾ Random – pick any possible block and replace it Access 0x00d00008


¾ LRU – stands for “Least Recently Used” 3 0x00d00000
Replace entry with 0 counter,
2 0x00003800
¾ Why not throw out the block not used for the longest time 1 0xffff8000
¾ Usually approximated, not much better than random – i.e. 5.18% vs.
0 0x00cd0800 then update counters
CS 135 CS 135
5.69% for 16KB 2-way set associative

10
Approximating LRU What happens on a write?
• FYI most accesses to a cache are
• LRU is too complicated reads:
¾ Access and possibly update all counters in a set
on every access (not just replacement)
¾ Used to fetch instructions (reads)
Most instructions don’t write to memory
• Need something simpler and faster
¾

¾ For DLX only about 7% of memory traffic involve writes


¾ But still close to LRU
¾ Translates to about 25% of cache data traffic
• NMRU – Not Most Recently Used
• Make common case fast! Optimize
¾ The entire set has one MRU pointer
Points to last-accessed line in the set
cache for reads!
¾

¾ Replacement:
Randomly select a non-MRU line ¾ Actually pretty easy to do…
¾ Something like a FIFO will also work ¾ Can read block while comparing/reading tag
¾ Block read begins as soon as address available
¾ If a hit, address just passed right on to CPU

CS 135 • Writes take longer. Any idea why?


CS 135

What happens on a write? What happens on a write?

• Generically, there are 2 kinds of write • Write back versus write through:
policies: ¾ Write back advantageous because:
¾ Write through (or store through) ¾ Writes occur at the speed of cache and don’t incur delay
of lower-level memory
¾ With write through, information written to block in cache
and to block in lower-level memory ¾ Multiple writes to cache block result in only 1 lower-level
memory access
¾ Write back (or copy back)
¾ Write through advantageous because:
¾ With write back, information written only to cache. It will
be written back to lower-level memory when cache block ¾ Lower-levels of memory have most recent copy of data
is replaced

• If CPU has to wait for a write, we have


• The dirty bit: write stall
¾ Each cache entry usually has a bit that specifies if a ¾ 1 way around this is a write buffer
write has occurred in that block or not…
¾ Ideally, CPU shouldn’t have to stall during a write
¾ Helps reduce frequency of writes to lower-level
memory upon block replacement ¾ Instead, data written to buffer which sends it to lower-
CS 135 CS 135
levels of memory hierarchy

11
What happens on a write? Write Policies: Analysis

• What if we want to write and block we • Write-back


want to write to isn’t in cache? ¾ Implicit priority order to find most up to date copy
¾ Require much less bandwidth
¾ Careful about dropping updates due to losing track of
• There are 2 common policies: dirty bits
¾ Write allocate: (or fetch on write) • What about multiple levels of cache on
¾ The block is loaded on a write miss
same chip?
¾ The idea behind this is that subsequent writes will be
captured by the cache (ideal for a write back cache) ¾ Use write through for on-chip levels and write back for
off-chip
¾ No-write allocate: (or write around)
¾ SUN UltraSparc, PowerPC
¾ Block modified in lower-level and not loaded into cache
¾ Usually used for write-through caches • What about multi-processor caches ?
¾ (subsequent writes still have to go to memory)
¾ Write back gives better performance but also leads to
cache coherence problems
¾ Need separate protocols to handle this problem….later
CS 135 CS 135

Write Policies: Analysis Modeling Cache Performance

• Write through • CPU time equation….again!


¾ Simple
¾ Correctness easily maintained and no ambiguity about
which copy of a block is current • CPU execution time =
¾ Drawback is bandwidth required; memory access time
(CPU clk cycles + Memory stall cycles) *
¾ Must also decide on decision to fetch and allocate
space for block to be written clk cycle time.
¾ Write allocate: fetch such a block and put in cache
¾ Write-no-allocate: avoid fetch, and install blocks only on
read misses
• Memory stall cycles =
¾ Good for cases of streaming writes which overwrite data
number of misses * miss penalty =
IC*(memory accesses/instruction)*miss
rate* miss penalty

CS 135 CS 135

12
Cache Performance – Simplified Models Average Memory Access Time

• Hit rate/ratio r = number of requests


that are hits/total num requests AMAT
AMAT == HitTime
HitTime ++ (1
(1 -- h)
h) xx MissPenalty
MissPenalty
• Cost of memory access= rCh + (1-r) Cm
¾ Ch is cost/time from cache, Cm is cost/time when miss – • Hit time: basic time of every access.
fetch from memory
• Hit rate (h): fraction of access that hit
• Extend to multiple levels
• Miss penalty: extra time to fetch a block
from lower level, including time to replace in
¾ Hit ratios for level 1, level 2, etc.

CPU
¾ Access times for level 1, level 2, etc.
¾ r1 Ch1 + r2Ch2 + (1- r1 -r2)Cm

CS 135 CS 135

Memory stall cycles


• Miss-oriented Approach to Memory Access:
• Memory stall cycles: number of cycles ⎛ MemAccess ⎞
CPUtime = IC × ⎜ CPI + × MissRate × MissPenalt y ⎟ × CycleTime
that processor is stalled waiting for ⎝ Execution Inst ⎠
memory access ⎛ MemMisses ⎞
CPUtime = IC × ⎜ CPI + × MissPenalty ⎟ × CycleTime
• Performance in terms of mem stall ⎝ Execution Inst ⎠

cycles ¾ CPIExecution includes ALU and Memory instructions


¾ CPU = (CPU cycles + Mem stall cycles)*Clk cycle time
¾ Mem stall cycles = number of misses * miss penalty
= IC *(Misses/Inst) * Miss Penalty
= IC * (Mem accesses/Inst) * Miss Rate * penalty
¾ Note: Read and Write misses combined into one miss
rate

CS 135 CS 135

13
When do we get a miss ?

• Instruction
• Separating out Memory component entirely ¾ Fetch instruction – not found in cache
¾ AMAT = Average Memory Access Time ¾ How many instructions ?
¾ CPIALUOps does not include memory instructions
• Data access
¾ Load and Store instructions
¾ Data not found in cache
⎛ AluOps MemAccess ⎞ ¾ How many data accesses ?
CPUtime = IC × ⎜ × CPI + × AMAT ⎟ × CycleTime
⎝ Inst ⎠
AluOps
Inst
AMAT = HitTime + MissRate × MissPenalt y
= ( HitTime Inst + MissRate Inst × MissPenalty Inst ) +
( HitTime Data + MissRate Data × MissPenalty Data )

CS 135 CS 135

Impact on Performance Impact on Performance..contd

• Suppose a processor executes at


¾ Clock = 200 MHz (5 ns per cycle), • CPI = ideal CPI + average stalls per instruction =
¾ Ideal (no misses) CPI = 1.1
¾ Inst mix: 50% arith/logic, 30% ld/st, 20% control
• 1.1(cycles/ins) +
[ 0.30 (DataMops/ins)
x 0.10 (miss/DataMop) x 50 (cycle/miss)] +
[ 1 (InstMop/ins)
• Suppose that 10% of memory operations get 50 x 0.01 (miss/InstMop) x 50 (cycle/miss)]
cycle miss penalty = (1.1 + 1.5 + .5) cycle/ins = 3.1
• Suppose that 1% of instructions get same miss
penalty • AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54

• CPI = ideal CPI + average stalls per instruction

CS 135 CS 135

14
Cache Performance: System Performance
Memory access equations
• Using what we defined previously, we can say:
¾ Memory stall clock cycles = • CPU time = IC * CPI * clock
¾ Reads x Read miss rate x Read miss penalty + ¾ CPI depends on memory stall cycles
• CPU time = (CPU execution clock cycles +
Writes x Write miss rate x Write miss penalty

Memory stall clock cycles)* Clock cycle time


• Often, reads and writes are
combined/averaged: • Average memory access time = hit time + miss
¾ Memory stall cycles = rate * miss penalty
¾ Memory access x Miss rate x Miss penalty (approximation)
¾ CPU’s with a low CPI and high clock rates will be
significantly impacted by cache rates (details in book)

• Also possible to factor in instruction count to


get a “complete” formula: • Improve performance
¾ CPU time = IC x (CPIexec + Mem. Stall Cycles/Instruction) x Clk ¾ = Decrease Memory stall cycles
¾ = Decrease Average memory access time/latency (AMAT)

CS 135 CS 135

Next: How to Improve Cache


Performance?

Appendix C: Basic Cache Concepts


Chapter 5: Cache Optimizations
AMAT = HitTime + MissRate × MissPenalt y

Project 2: Study performance of


1. Reduce the miss rate, benchmarks (project 1 benchmarks)
2. Reduce the miss penalty, or using different cache organizations
3. Reduce the time to hit in the cache.

CS 135 CS 135

15
V D Tag 00 01 10 11
Physical Address (10 bits)
00

01

Cache Examples Tag


(6 bits)
Index
(2 bits)
Offset
(2 bits)
10

11
101010 3510 2410 1710 2510

A 4-entry direct mapped cache with 4 data words/block

11 Assume we want to read the


following data words: 22 If we read 101010 10 01 we want to bring
data word 2410 into the cache.
Tag Index Offset Address Holds Data
Where would this data go? Well, the index
101010 | 10 | 00 3510 is 10. Therefore, the data word will go
somewhere into the 3rd block of the cache.
101010 | 10 | 01 2410 (make sure you understand terminology)

101010 | 10 | 10 1710 More specifically, the data word would go


into the 2nd position within the block –
101010 | 10 | 11 2510 because the offset is ’01’

All of these physical addresses


33 The principle of spatial locality says that if we
use
would have the same tag one data word, we’ll probably use some data
All of these physical addresses map words
to the same cache entry that are close to it – that’s why our block size is
CS 135 bigger than one data word. So we fill in the data
word entries surrounding 101010 10 01 as well.

V D Tag 00 01 10 11 V D Tag 00 01 10 11
Physical Address (10 bits) Physical Address (10 bits)
00 00

01 01

Tag Index Offset 10 101010 3510 2410 1710 2510 Tag Index Offset 10 101010 3510 2410 1710 2510
(6 bits) (2 bits) (2 bits) (6 bits) (2 bits) (2 bits)
11 11

A 4-entry direct mapped cache with 4 data words/block This cache can hold 16 data words…

44 Therefore, if we get this pattern of 55 What happens if we get an access to location:


66 What if we change the way our cache is
accesses when we start a new program: 100011 | 10 | 11 (holding data: 1210) laid out – but so that it still has 16 data
words? One way we could do this would
1.) 101010 10 00 Index bits tell us we need to look at cache block 10. be as follows:
2.) 101010 10 01 V D Tag 000 001 010 011 100 101 110 111
3.) 101010 10 10 So, we need to compare the tag of this address –
4.) 101010 10 11 100011 – to the tag that associated with the current 0
entry in the cache block – 101010
1
After we do the read for 101010 10 00
(word #1), we will automatically get the These DO NOT match. Therefore, the data
data for words #2, 3 and 4. associated with address 100011 10 11 IS NOT All of the following are true:
VALID. What we have here could be: 1 cache
• This cache still holds 16 words
What does this mean? Accesses (2), • A compulsory miss block entry
• Our block size is bigger – therefore this should help with compulsory misses
(3), and (4) ARE NOT COMPULSORY • (if this is the 1st time the data was accessed) • Our physical address will now be divided as follows:
MISSES • A conflict miss: • The number of cache blocks has DECREASED
• (if the data for address 100011 10 11 was • This will INCREASE the # of conflict misses
Tag (6 bits) Index (1 bit) Offset (3 bits)
present, but kicked out by 101010 10 00 – for
example)
CS 135 CS 135

16
77 What if we get the same pattern of accesses we had before? 77 But, we could also make our cache look like this…

V D Tag 0 1
V D Tag 000 001 010 011 100 101 110 111
Again, let’s assume we want to read the
000
0 following data words:
001
1 101010 3510 2410 1710 2510 A10 B10 C10 D10 Tag Index Offset Address Holds Data
010
1.) 101010 | 100 | 0 3510
Pattern of accesses: 011
Note that there is now more data
(note different # of bits for offset and 2.) 101010 | 100 | 1 2410
associated with a given cache block. 100 101010 3510 2410
index now)
101 101010 1710 2510 3.) 101010 | 101 | 0 1710
1.) 101010 1 000
2.) 101010 1 001 110 4.) 101010 | 101 | 1 2510
3.) 101010 1 010
111
4.) 101010 1 011
Assuming that all of these accesses were occurring
However, now we have only 1 bit of index. for the 1st time (and would occur sequentially),
Therefore, any address that comes along that has a tag that is There are now just 2 accesses (1) and (3) would result in compulsory
different than ‘101010’ and has 1 in the index position is going to result words associated misses, and accesses would result in hits because
in a conflict miss. with each cache of spatial locality. (The final state of the cache
block. is shown after all 4 memory accesses).

Note that by organizing a cache in this way, conflict misses will be reduced.
There are now more addresses in the cache that the 10-bit physical address can map too.

CS 135 CS 135

V D Tag 0 1
88 What’s a capacity miss?
V D Tag 00 01 10 11
000 • The cache is only so big. We won’t be able to store every block accessed in a program – must
00
them swap out!
001 01 • Can avoid capacity misses by making cache bigger
010 V D Tag 00 01 10 11
10 101010 3510 2410 1710 2510
011 00
11
100 101010 3510 2410 01 Thus, to avoid capacity

101 101010 1710 2510


88 10 101010 3510 2410 1710 2510
misses, we’d need to make
All of these caches hold exactly the same amount of our cache physically bigger –
data – 16 different word entries 11 i.e. there are now 32 word
110
entries for it instead of 16.
111
V D Tag 00 01 10 11 FYI, this will change the way
V D Tag 000 000 the physical address is
001 010 011 100 101 110 111
divided. Given our original
0 001 pattern of accesses, we’d
have:
1 101010 3510 2410 1710 2510 A10 B10 C10 D10 010 10101 3510 2410 1710 2510

011 Pattern of accesses:


As a general rule of thumb, “long and skinny” caches help to reduce conflict misses, “short and fat” 100
caches help to reduce compulsory misses, but a cross between the two is probably what will give you 1.) 10101 | 010 | 00 = 3510
the best (i.e. lowest) overall miss rate. 101 2.) 10101 | 010 | 01 = 2410
But what about capacity misses? 3.) 10101 | 010 | 10 = 1710
110 4.) 10101 | 010 | 11 = 2510
111
(note smaller tag, bigger index)
CS 135 CS 135

17
Next: Cache Optimization

• Techniques for minimizing AMAT


End of Examples ¾ Miss rate
¾ Miss penalty
¾ Hit time
• Role of compiler ?

CS 135

Memory stall cycles

• Memory stall cycles: number of cycles


Improving Cache Performance that processor is stalled waiting for
memory access
• Performance in terms of mem stall
cycles
¾ CPU = (CPU cycles + Mem stall cycles)*Clk cycle time
¾ Mem stall cycles = number of misses * miss penalty
= IC *(Misses/Inst) * Miss Penalty
= IC * (Mem accesses/Inst) * Miss Rate * penalty
¾ Note: Read and Write misses combined into one miss
rate

CS 135

18
• Miss-oriented Approach to Memory Access:
⎛ MemAccess ⎞
CPUtime = IC × ⎜ CPI + × MissRate × MissPenalty ⎟ × CycleTime
⎝ Execution Inst ⎠ • Separating out Memory component entirely
¾ AMAT = Average Memory Access Time
⎛ MemMisses ⎞
CPUtime = IC × ⎜ CPI + × MissPenalty ⎟ × CycleTime ¾ CPIALUOps does not include memory instructions
⎝ Execution Inst ⎠

¾ CPIExecution includes ALU and Memory instructions

⎛ AluOps MemAccess ⎞
CPUtime = IC × ⎜ × CPI + × AMAT ⎟ × CycleTime
⎝ Inst ⎠
AluOps
Inst
AMAT = HitTime + MissRate × MissPenalt y
= ( HitTime Inst + MissRate Inst × MissPenalty Inst ) +
( HitTime Data + MissRate Data × MissPenalty Data )

CS 135 CS 135

How to Improve Cache Performance? Improving Cache Performance

1. Reduce the miss rate,


2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
AMAT = HitTime + MissRate × MissPenalt y

1. Reduce the miss rate,


2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
•We will look at some basic techniques for 1,2,3 today
•Next class we look at some more advanced techniques
CS 135 CS 135

19
Cache Misses Reducing Miss Rate – 3C’s Model

• Latencies at each level determined by • Compulsory (or cold) misses


technology ¾ Due to program’s first reference to a block –not in cache so
must be brought to cache (cold start misses)
• Miss rates are function of ¾ These are Fundamental misses – cannot do anything
• Capacity
¾ Organization of cache
¾ Due to insufficient capacity - if the cache cannot contain all
¾ Access patterns of the program/application the blocks needed capacity misses will occur because of
blocks being discarded and later retrieved)
• Understanding causes of misses enables ¾ Not fundamental, and by-product of cache organization
designer to realize shortcomings of ¾ Increasing capacity can reduce some misses

design and discover creative cost- • Conflict


effective solutions
¾ Due to imperfect allocation; if the block placement strategy is
set associative or direct mapped, conflict misses will occur
because a block may be discarded and later retrieved if too
¾ The 3Cs model is a tool for classifying cache misses many blocks map to its set. Interference or collision misses
based on underlying cause ¾ Not fundamental; by-product of cache organization; fully
associative can eliminate conflict misses

CS 135 CS 135

Miss Rate Reduction Strategies


3Cs Absolute Miss Rate (SPEC92)

• Increase block size – reduce compulsory misses


0.14
• Larger caches
1-way ¾ Larger size can reduce capacity, conflict misses
0.12 Conflict ¾ Larger block size for fixed total size can lead to more
2-way capacity misses
0.1
4-way ¾ Can reduce conflict misses
• Higher associativity
0.08
8-way
0.06 ¾ Can reduce conflict misses
Capacity
0.04 ¾ No effect on cold miss

0.02 • Compiler controlled pre-fetching (faulting/non-


faulting)
0
¾ Code reorganization
1

16

32

64

128

¾ Merging arrays

¾ Loop interchange (row column order)


Cache Size (KB) Compulsory
¾ Loop fusion (2 loops into 1)
CS 135 CS 135
¾ Blocking

20
(1) Larger cache block size
Larger cache block size
(graph comparison)
• Easiest way to reduce miss rate is to
increase cache block size Miss rate vs. block size Why this trend?
¾ This will help eliminate what kind of misses?
• Helps improve miss rate b/c of principle 25

of locality: 20 1K

Miss Rate
¾ Temporal locality says that if something is accessed 15 4K
once, it’ll probably be accessed again soon 16K
10 64K
¾ Spatial locality says that if something is accessed,
something nearby it will probably be accessed 5 256K
¾ Larger block sizes help with spatial locality
0
• Be careful though! 16 32 64 128 256
¾ Larger block sizes can increase miss penalty! Block Size
¾ Generally, larger blocks reduce # of total blocks in (Assuming total cache size stays constant for each curve)
cache
CS 135 CS 135

Larger cache block size


Larger cache block size (example)
(ex. continued…)
• Assume that to access lower-level of memory hierarchy • Recall that Average memory access time =
you: ¾ Hit time + Miss rate X Miss penalty
¾ Incur a 40 clock cycle overhead
¾ Get 16 bytes of data every 2 clock cycles • Assume a cache hit otherwise takes 1 clock
• I.e. get 16 bytes in 42 clock cycles, 32 in 44, etc… cycle –independent of block size
• Using data below, which block size has minimum average • So, for a 16-byte block in a 1-KB cache…
memory access time? ¾ Average memory access time =
¾ 1 + (15.05% X 42) = 7.321 clock cycles
Cache sizes • And for a 256-byte block in a 256-KB cache…
Block Size 1K 4K 16K 64K 256K ¾ Average memory access time =
16 15.05% 8.57% 3.94% 2.04% 1.09% ¾ 1 + (0.49% X 72) = 1.353 clock cycles
Miss rates
32 13.34% 7.24% 2.87% 1.35% 0.70%
64 13.76% 7.00% 2.64% 1.06% 0.51%
128 16.64% 7.78% 2.77% 1.02% 0.49% • Rest of the data is included on next slide…
256 22.01% 9.51% 3.29% 1.15% 0.49%

CS 135 CS 135

21
Larger cache block size
Larger cache block sizes (wrap-up)
(ex. continued)
Cache sizes • We want to minimize cache miss rate &
cache miss penalty at same time!
Block Miss 1K 4K 16K 64K 256K
Size Penalt
16 y
42 7.321 4.599 2.655 1.857 1.485
32 44 6.870 4.186 2.263 1.594 1.308
• Selection of block size depends on
64 48 7.605 4.360 2.267 1.509 1.245
latency and bandwidth of lower-level
128 56 10.318 5.357 2.551 1.571 1.274
memory:
¾ High latency, high bandwidth encourage large block
256 72 16.847 7.847 3.369 1.828 1.353
size
¾ Cache gets many more bytes per miss for a small
increase in miss penalty
Red entries are lowest average time for a particular configuration
¾ Low latency, low bandwidth encourage small block
Note: All of these block sizes are common in processor’s today size
Note: Data for cache sizes in units of “clock cycles”
¾ Twice the miss penalty of a small block may be close to
the penalty of a block twice the size
CS 135 CS 135 ¾ Larger # of small blocks may reduce conflict misses

Higher associativity
Associativity
• Higher associativity can improve cache 0.14
1-way
miss rates… 0.12
2-way
Conflict

• Note that an 8-way set associative 0.1


4-way
cache is… 0.08
8-way
¾ …essentially a fully-associative cache 0.06
Capacity
• Helps lead to 2:1 cache rule of thumb: 0.04
¾ It says: 0.02
¾ A direct mapped cache of size N has about the same miss
rate as a 2-way set-associative cache of size N/2 0
1

16

32

64

• But, diminishing returns set in sooner or


128

later… Cache Size (KB) Compulsory

¾ Greater associativity can cause increased hit time

CS 135 CS 135

22
Miss Rate Reduction Strategies Larger Block Size
(fixed size&assoc)
25%
• Increase block size – reduce compulsory misses
• Larger caches 20% 1K
¾ Larger size can reduce capacity, conflict misses 4K
¾ Larger block size for fixed total size can lead to more 15%
Miss
capacity misses 16K
Rate
¾ Can reduce conflict misses 10%
64K
• Higher associativity
5% 256K
¾ Can reduce conflict misses
¾ No effect on cold miss Reduced
compulsory 0%
• Compiler controlled pre-fetching (faulting/non- misses

16

32

64

128

256
faulting) Increased
Conflict
Block Size (bytes)
¾ Code reorganization Misses
¾ Merging arrays

¾ Loop interchange (row column order)


What else drives up block size?
¾ Loop fusion (2 loops into 1)
CS 135 CS 135
¾ Blocking

How to Improve Cache Performance?


Cache Size
0.14
1-way
0.12
2-way
0.1
4-way
0.08
8-way
AMAT = HitTime + MissRate × MissPenalt y
0.06
Capacity
0.04
1. Reduce the miss rate,
0.02
0
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
1

16

32

64

128

Cache Size (KB) Compulsory

• Old rule of thumb: 2x size => 25% cut in miss rate


• What does it reduce?
CS 135 CS 135

23
Improving Performance: Reducing Cache
Early restart and critical word 1st
Miss Penalty
• Multilevel caches – • With this strategy we’re going to be
¾ second and subsequent level caches can be large enough to capture
many accesses that would have gone to main memory, and are faster impatient
(therefore less penalty)
¾ As soon as some of the block is loaded, see if the data
• Critical word first and early restart – is there and send it to the CPU
¾ don’t wait for full block of cache to be loaded, send the critical word
first, restart the CPU and continue the load ¾ (i.e. we don’t wait for the whole block to be loaded)
• Priority to read misses over write misses
• Merging write buffer –
¾ if the address of a new entry matches that of one in the write buffer,
combine it
• There are 2 general strategies:
• Victim Caches –
¾ cache discarded blocks elsewhere ¾ Early restart:
¾ Remember what was discarded in case it is needed again ¾ As soon as the word gets to the cache, send it to the CPU
¾ Insert small fully associative cache between cache and its refill path
¾ This “victim cache” contains only blocks that were discarded as a result of
¾ Critical word first:
a cache miss (replacement policy) ¾ Specifically ask for the needed word 1st, make sure it gets
¾ Check victim cache in case of miss before going to next lower level of to the CPU, then get the rest of the cache block data
memory

CS 135 CS 135

Victim caches Victim caches

• 1st of all, what is a “victim cache”?


¾ A victim cache temporarily stores blocks that have CPU Address
been discarded from the main cache (usually not that
big) – due to conflict misses Data Data

• 2nd of all, how does it help us? =? in out


=?
Tag
¾ If there’s a cache miss, instead of immediately going
down to the next level of memory hierarchy we check
Data Victim
VictimCache
Cache
the victim cache first
¾ If the entry is there, we swap the victim cache block
with the actual cache block =?
=?
Write
• Research shows: Buffer

¾ Victim caches with 1-5 entries help reduce conflict


misses
Lower level memory
¾ For a 4KB direct mapped cache victim caches:
CS 135
¾ Removed 20% - 95% of conflict misses! CS 135

24
Multi-Level caches Second-level caches

• Processor/memory performance gap makes us • This will of course introduce a new


consider: definition for average memory access
¾ If they should make caches faster to keep pace with CPUs time:
¾ If they should make caches larger to overcome widening gap ¾ Hit timeL1 + Miss RateL1 * Miss PenaltyL1
between CPU and main memory ¾ Where, Miss PenaltyL1 =
¾ Hit TimeL2 + Miss RateL2 * Miss PenaltyL2
¾ So 2nd level miss rate measure from 1st level cache
• One solution is to do both: misses…
¾ Add another level of cache (L2) between the 1st level cache • A few definitions to avoid confusion:
(L1) and main memory
¾ Local miss rate:
¾ Ideally L1 will be fast enough to match the speed of the CPU
while L2 will be large enough to reduce the penalty of going to ¾ # of misses in the cache divided by total # of memory
main memory accesses to the cache – specifically Miss RateL2
¾ Global miss rate:
¾ # of misses in the cache divided by total # of memory
CS 135 CS 135
accesses generated by the CPU – specifically -- Miss
RateL1 * Miss RateL2

Second-level caches Second-level caches


(some “odds and ends” comments)

• Example: • The speed of the L1 cache will affect the


¾ In 1000 memory references there are 40 misses in the L1 clock rate of the CPU while the speed of the
cache and 20 misses in the L2 cache. What are the various L2 cache will affect only the miss penalty of
miss rates? the L1 cache
¾ Miss Rate L1 (local or global): 40/1000 = 4% ¾ Which of course could affect the CPU in various ways…
Miss Rate L2 (local): 20/40 = 50%
• 2 big things to consider when designing the L2
¾

¾ Miss Rate L2 (global): 20/1000 = 2% cache are:


• Note that global miss rate is very similar to ¾ Will the L2 cache lower the average memory access time
single cache miss rate of the L2 cache ¾
portion of the CPI?
If so, how much will is cost?
¾ (if the L2 size >> L1 size) ¾ In terms of HW, etc.

• Local cache rate not good measure of • 2nd level caches are usually BIG!
secondary caches – its a function of L1 miss ¾ Usually L1 is a subset of L2
rate ¾ Should have few capacity misses in L2 cache
¾ Only worry about compulsory and conflict for optimizations…
¾ Which can vary by changing the L1 cache
CS 135 CS 135
¾ Use global cache miss rate to evaluating 2nd level caches!

25
Second-level caches (example) Second-level caches (example)

• Given the following data… • Miss penaltyDirect mapped L2 =


¾ 2-way set associativity increases hit time by 10% of a ¾ 10 + 25% * 50 = 22.5 clock cycles
CPU clock cycle • Adding the cost of associativity increases the
¾ Hit time for L2 direct mapped cache is: 10 clock hit cost by only 0.1 clock cycles
cycles
• Thus, Miss penalty2-way set associative L2 =
¾ Local miss rate for L2 direct mapped cache is: 25%
¾ 10.1 + 20% * 50 = 20.1 clock cycles
¾ Local miss rate for L2 2-way set associative cache is:
20% • However, we can’t have a fraction for a
¾ Miss penalty for the L2 cache is: 50 clock cycles number of clock cycles (i.e. 10.1 ain’t
possible!)
• What is the impact of using a 2-way
• We’ll either need to round up to 11 or
set associative cache on our miss optimize some more to get it down to 10. So…
penalty? ¾ 10 + 20% * 50 = 20.0 clock cycles or
¾ 11 + 20% * 50 = 21.0 clock cycles (both better than 22.5)

CS 135 CS 135

(5) Second level caches Hardware prefetching


(some final random comments)

• We can reduce the miss penalty by reducing • This one should intuitively be pretty obvious:
the miss rate of the 2nd level caches using ¾ Try and fetch blocks before they’re even requested…
techniques previously discussed… ¾ This could work with both instructions and data
¾ I.e. Higher associativity or psuedo-associativity are worth ¾ Usually, prefetched blocks are placed either:
considering b/c they have a small impact on 2nd level hit ¾ Directly in the cache (what’s a down side to this?)
time ¾ Or in some external buffer that’s usually a small, fast cache
And much of the average access time is due to misses in the
• Let’s look at an example: (the Alpha AXP
¾
L2 cache
21064)
• Could also reduce misses by increasing L2
block size ¾ On a cache miss, it fetches 2 blocks:
¾ One is the new cache entry that’s needed
• Need to think about something called the ¾ The other is the next consecutive block – it goes in a buffer
“multilevel inclusion property”: ¾ How well does this buffer perform?
¾ In other words, all data in L1 cache is always in L2… ¾ Single entry buffer catches 15-25% of misses
¾ Gets complex for writes, and what not… ¾ With 4 entry buffer, the hit rate improves about 50%

CS 135 CS 135

26
Hardware prefetching example Hardware prefetching example

• What is the effective miss rate for the • We need a revised memory access time formula:
Alpha using instruction prefetching? ¾ Say: Average memory access timeprefetch =
¾ Hit time + miss rate * prefetch hit rate * 1 + miss rate * (1 – prefetch hit

• How much larger of an instruction cache


rate) * miss penalty
• Plugging in numbers to the above, we get:
would we need if the Alpha to match ¾ 2 + (1.10% * 25% * 1) + (1.10% * (1 – 25%) * 50) = 2.415
the average access time if prefetching • To find the miss rate with equivalent performance, we
was removed? start with the original formula and solve for miss rate:

• Assume:
¾ Average memory access timeno prefetching =
¾ Hit time + miss rate * miss penalty
¾ It takes 1 extra clock cycle if the instruction misses the ¾ Results in: (2.415 – 2) / 50 = 0.83%
cache but is found in the prefetch buffer • Calculation suggests effective miss rate of prefetching
¾ The prefetch hit rate is 25% with 8KB cache is 0.83%
¾ Miss rate for 8-KB instruction cache is 1.10% • Actual miss rates for 16KB = 0.64% and 8KB = 1.10%
¾ Hit time is 2 clock cycles
¾ Miss penalty is 50 clock cycles
CS 135 CS 135

Compiler-controlled prefetching Reducing Misses by Compiler


Optimizations
• It’s also possible for the compiler to tell the
hardware that it should prefetch instructions • McFarling [1989] reduced caches misses by 75%
or data on 8KB direct mapped cache, 4 byte blocks in
¾ It (the compiler) could have values loaded into registers – software
called register prefetching
¾ Or, the compiler could just have data loaded into the cache – • Instructions
called cache prefetching ¾ Reorder procedures in memory so as to reduce conflict misses
• getting things from lower levels of memory can ¾ Profiling to look at conflicts(using tools they developed)
cause faults – if the data is not there… • Data
¾ Ideally, we want prefetching to be “invisible” to the program; ¾ Merging Arrays: improve spatial locality by single array of compound
so often, nonbinding/nonfaulting prefetching used elements vs. 2 arrays
¾ With nonfautling scheme, faulting instructions turned into no- ¾ Loop Interchange: change nesting of loops to access data in order
ops stored in memory
¾ With “faulting” scheme, data would be fetched (as “normal”)
¾ Loop Fusion: Combine 2 independent loops that have same looping
and some variables overlap
¾ Blocking: Improve temporal locality by accessing “blocks” of data

CS 135 repeatedly vs. going down whole columns or rows


CS 135

27
Compiler optimizations – merging arrays Merging Arrays Example

• This works by improving spatial locality


/* Before: 2 sequential arrays */
int val[SIZE];
• For example, some programs may reference int key[SIZE];
multiple arrays of the same size at the same time
¾ Could be bad: /* After: 1 array of stuctures */
¾ Accesses may interfere with one another in the cache
struct merge {
• A solution: Generate a single, compound array… int val;
int key;
/* Before:*/ /* After */ };
int tag[SIZE] struct merge {
int byte1[SIZE] int tag; struct merge merged_array[SIZE];
int byte2[SIZE] int byte1;
int dirty[size] int byte2;
int dirty;
Reducing conflicts between val & key;
improve spatial locality
}
struct merge cache_block_entry[SIZE]
CS 135 CS 135

Compiler optimizations – loop interchange Loop Interchange Example

• Some programs have nested loops that


access memory in non-sequential order
/* Before */
for (k = 0; k < 100; k = k+1)
¾ Simply changing the order of the loops may make them for (j = 0; j < 100; j = j+1)
access the data in sequential order…
for (i = 0; i < 5000; i = i+1)
• What’s an example of this? x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
/* Before:*/
for (i = 0; i < 5000; i = i+1)
for( j = 0; j < 100; j= j + 1) { But who really writes
for( k = 0; k < 5000; k = k + 1) { for (j = 0; j < 100; j = j+1)
x[k][j] = 2 * x[k][j];
loops like this???
x[i][j] = 2 * x[i][j];
/* After:*/
for( k = 0; k < 5000; k= k + 1) {
Sequential accesses instead of striding
through memory every 100 words;
for( j = 0; j < 5000; j = j + 1) {
x[k][j] = 2 * x[k][j];
improved spatial locality
CS 135 CS 135

28
Compiler optimizations – loop fusion Loop Fusion Example

• This one’s pretty obvious once you hear what it


is… /* Before */
• Seeks to take advantage of: for (i = 0; i < N; i = i+1)
¾ Programs that have separate sections of code that access for (j = 0; j < N; j = j+1)
the same arrays in different loops a[i][j] = 1/b[i][j] * c[i][j];
¾ Especially when the loops use common data for (i = 0; i < N; i = i+1)
¾ The idea is to “fuse” the loops into one common loop for (j = 0; j < N; j = j+1)
• What’s the target of this optimization? d[i][j] = a[i][j] + c[i][j];
• Example:
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
/* Before:*/ /* After:*/
for( j = 0; j < N; j= j + 1) { for( j = 0; j < N; j= j + 1) { { a[i][j] = 1/b[i][j] * c[i][j];
for( k = 0; k < N; k = k + 1) { for( k = 0; k < N; k = k + 1) { d[i][j] = a[i][j] + c[i][j];}
a[j][k] = 1/b[j][k] * c[j][k]; a[j][k] = 1/b[j][k] * c[j][k];
for( j = 0; j < N; j= j + 1) { d[j][k] = a[j][k] + c[j][k];

2 misses per access to a & c vs. one miss


for( k = 0; k < N; k = k + 1) { }
d[j][k] = a[j][k] + c[j][k]; }

per access; improve spatial locality


CS 135 CS 135

Compiler optimizations – blocking Blocking Example

• This is probably the most “famous” of /* Before */

compiler optimizations to improve cache for (i = 0; i < N; i = i+1)

performance
for (j = 0; j < N; j = j+1)
{r = 0;
• Tries to reduce misses by improving for (k = 0; k < N; k = k+1){
temporal locality r = r + y[i][k]*z[k][j];};
x[i][j] = r;
• To get a handle on this, you have to };
work through code on your own • Two Inner Loops:
¾ Homework! ¾ Read all NxN elements of z[]
• this is used mainly with arrays! ¾ Read N elements of 1 row of y[] repeatedly

• Simplest case?? ¾ Write N elements of 1 row of x[]

¾ Row-major access
• Capacity Misses a function of N & Cache Size:
¾ 2N3 + N2 => (assuming no conflict; otherwise …)
CS 135 • Idea:
CS 135 compute on BxB submatrix that fits

29
Summary of Compiler Optimizations to Reduce Improving Cache Performance
Cache Misses (by hand)
vpenta (nasa7) 1. Reduce the miss rate,
2. Reduce the miss penalty, or
gmty (nasa7)
tomcatv
btrix (nasa7)
3. Reduce the time to hit in the cache.
mxm (nasa7)
spice
cholesky
(nasa7)
compress

1 1.5 2 2.5 3
Performance Improvement

merged loop loop fusion blocking


arrays interchange
CS 135 CS 135

Reducing the hit time Small and simple caches

• Again, recall our average memory • Why is this good?


access time equation: ¾ Generally, smaller hardware is faster – so a small
¾ Hit time + Miss Rate * Miss Penalty cache should help the hit time…
¾ We’ve talked about reducing the Miss Rate and the ¾ If an L1 cache is small enough, it should fit on the
Miss Penalty – Hit time can also be a big same chip as the actual processing logic…
component… ¾ Processor avoids time going off chip!

• On many machines cache accesses can


¾ Some designs compromise and keep tags on a chip and
data off chip – allows for fast tag check and >> memory
affect the clock cycle time – so making capacity

this small is a good thing! ¾ Direct mapping also falls under the category of
“simple”
¾ Relates to point above as well – you can check tag and
read data at the same time!

CS 135 CS 135

30
Avoid address translation during cache
Separate Instruction and Data Cache
indexing
• This problem centers around virtual addresses. • Multilevel cache is one option for design
Should we send the virtual address to the
cache? • Another view:
¾ In other words we have Virtual caches vs. Physical caches ¾ Separate the instruction and data caches
¾ Why is this a problem anyhow? ¾ Instead of a Unified Cache, have separate I-cache and
¾ Well, recall from OS that a processor usually deals with D-cache
processes ¾ Problem: What size does each have ?
¾ What if process 1 uses a virtual address xyz and process 2
uses the same virtual address?
¾ The data in the cache would be totally different! – called
aliasing

• Every time a process is switched logically,


we’d have to flush the cache or we’d get false
hits.
¾ Cost = time to flush + compulsory misses from empty cache
• I/O must interact with caches so we need
CS 135virtual addressess CS 135

Separate I-cache and D-cache example: example continued…


• 1st, need to determine overall miss rate for split caches:
• We want to compare the following:
¾ (75% x 0.64%) + (25% x 6.47%) = 2.10%
¾ A 16-KB data cache & a 16-KB instruction cache versus a
32-KB unified cache ¾ This compares to the unified cache miss rate of 1.99%

Size Inst. Cache Data Cache Unified Cache


16 KB 0.64% 6.47% 2.87% Miss Rates • We’ll use average memory access time formula from a few
slides ago but break it up into instruction & data references
32 KB 0.15% 4.82% 1.99%
• Assume a hit takes 1 clock cycle to process
• Average memory access time – split cache =
• Miss penalty = 50 clock cycles ¾ 75% x (1 + 0.64% x 50) + 25% x (1 + 6.47% x 50)
• In unified cache, load or store hit takes 1 ¾ (75% x 1.32) + (25% x 4.235) = 2.05 cycles
extra clock cycle b/c having only 1 cache port
= a structural hazard • Average memory access time – unified cache =
• 75% of accesses are instruction references ¾ 75% x (1 + 1.99% x 50) + 25% x (1 + 1 + 1.99% x 50)

• What’s avg. memory access time in each case? ¾ (75% x 1.995) + (25% x 2.995) = 2.24 cycles

CS 135 • Despite
CS 135 higher miss rate, access time faster for split cache!

31
Reducing Time to Hit in Cache:
The Trace Cache Proposal
Trace Cache
• Trace caches A A
¾ ILP technique A
¾ Trace cache finds dynamic sequence of instructions B B
B
including taken branches to load into cache block C C
¾ Branch prediction is folded into the cache C
D D
D E
static 90%
10% static
F
dynamic 10%
90% dynamic G
E F F
Trace-
G cache line
G boundaries

I-cache line
CS 135 CS 135
boundaries

Cache Summary

• Cache performance crucial to overall


performance
• Optimize performance
¾ Miss rates
¾ Miss penalty
¾ Hit time
• Software optimizations can lead to
improved performance

• Next . . . Code Optimization in


Compilers
CS 135

32

You might also like