0% found this document useful (0 votes)
49 views28 pages

Fundamentals of Computer Systems: Caches

Memory performance depends on which is slowest: the processor or the memory system. If the cache takes 1 cycle and the main memory 100, What's the expected access time?

Uploaded by

thanhtunghp13
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views28 pages

Fundamentals of Computer Systems: Caches

Memory performance depends on which is slowest: the processor or the memory system. If the cache takes 1 cycle and the main memory 100, What's the expected access time?

Uploaded by

thanhtunghp13
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Fundamentals of Computer Systems

Caches Stephen A. Edwards and Martha A. Kim


Columbia University

Spring 2012
Illustrations Copyright 2007 Elsevier

Computer Systems
Performance depends on which is slowest: the processor or the memory system

CLK MemWrite Address WriteData


WE

Processor

Memory

ReadData

Memory Speeds Havent Kept Up


100,000 10,000

Performance

1000

CPU
100

10

Memory
1

Our single-cycle memory assumption has been wrong since 1980.


Hennessy and Patterson. Computer Architecture: A Quantitative Approach. 3rd ed., Morgan Kaufmann, 2003.

19 80 19 81 19 82 19 83 19 84 19 85 19 86 19 87 19 88 19 89 19 90 19 91 19 92 19 93 19 94 19 95 19 96 19 97 19 98 19 99 20 00 20 01 20 02 20 03 20 04 20 05

Year

Your Choice of Memories


Fast Cheap Large

On-Chip SRAM

Commodity DRAM

Supercomputer

Memory Hierarchy

Fundmental trick to making a big memory appear fast Technology SRAM DRAM Flash Hard Disk
Read

Cost ($/Gb) 30 000 10 2 0.1

Access Time (ns) 0.5 100 300 10 000 000

Density (Gb/cm2) 0.00025 1 16 8 32 500 2000

speed; writing much, much slower

A Modern Memory Hierarchy

My desktop machine: Level Size L1 Instruction 64 K L1 Data 64 K L2 512 K L3 2 MB Memory 4 GB Disk 500 GB
per

AMD Phenom 9600 Quad-core 2.3 GHz 1.11.25 V 95 W 65 nm

Tech. SRAM SRAM SRAM SRAM DRAM Magnetic

core

Temporal Locality
What path do your eyes take when you read this? Did you look at the drawings more than once?

Euclids Elements

Spatial Locality

If you need something, you may also need something nearby

Memory Performance
Hit: Data is found in the level of memory hierarchy Miss: Data not found; will look in next level Hit Rate =
Number of hits Number of accesses Number of misses Number of accesses

Miss Rate =

Hit Rate + Miss Rate = 1 The expected access time EL for a memory level L with latency tL and miss rate ML : EL = tL + ML EL+1

Memory Performance Example


Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores 750 of these are found in the cache Whats the cache hit and miss rate?

Memory Performance Example


Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores 750 of these are found in the cache Whats the cache hit and miss rate? Hit Rate =
750 1000

= 75%

Miss Rate = 1 0.75 = 25% If the cache takes 1 cycle and the main memory 100, Whats the expected access time?

Memory Performance Example


Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores 750 of these are found in the cache Whats the cache hit and miss rate? Hit Rate =
750 1000

= 75%

Miss Rate = 1 0.75 = 25% If the cache takes 1 cycle and the main memory 100, Whats the expected access time? Expected access time of main memory: E1 = 100 cycles Access time for the cache: t0 = 1 cycle Cache miss rate: M0 = 0.25 E0 = t0 + M0 E1 = 1 + 0.25 100 = 26 cycles

Cache
Highest levels of memory hierarchy Fast: level 1 typically 1 cycle access time With luck, supplies most data Cache design questions: What data does it hold? How is data found? What data is replaced? Recently accessed Simple address hash Often the oldest

What Data is Held in the Cache?


Ideal cache: always correctly guesses what you want before you want it. Real cache: never that smart Caches Exploit Temporal Locality Copy newly accessed data into cache, replacing oldest if necessary Spatial Locality Copy nearby data into the cache at the same time Specically, always read and write a block at a time (e.g., 64 bytes), never a single byte.

A Direct-Mapped Cache
Address Data
11...11111100 11...11111000 11...11110100 11...11110000 11...11101100 11...11101000 11...11100100 11...11100000 mem[0xFFFFFFFC] mem[0xFFFFFFF8] mem[0xFFFFFFF4] mem[0xFFFFFFF0] mem[0xFFFFFFEC] mem[0xFFFFFFE8] mem[0xFFFFFFE4] mem[0xFFFFFFE0]

This simple cache has 8 sets 1 block per set 4 bytes per block To simplify answering is this memory in the cache?, each byte is mapped to exactly one set.
Set 7 (111) Set 6 (110) Set 5 (101) Set 4 (100) Set 3 (011) Set 2 (010) Set 1 (001) Set 0 (000)

00...00100100 00...00100000 00...00011100 00...00011000 00...00010100 00...00010000 00...00001100 00...00001000 00...00000100 00...00000000

mem[0x00000024] mem[0x00000020] mem[0x0000001C] mem[0x00000018] mem[0x00000014] mem[0x00000010] mem[0x0000000C] mem[0x00000008] mem[0x00000004] mem[0x00000000]

230-Word Main Memory

23-Word Cache

Direct-Mapped Cache Hardware


Memory Address
Tag 27 Byte Set Offset

Address bits: 01: byte within block 24: set number 531: block tag

00
3

V Tag

Data Set 7 Set 6 Set 5 Set 4 Set 3 Set 2 Set 1 Set 0

8-entry x (1+27+32)-bit SRAM

27

32

Cache hit if in the set of the address, block is valid (V=1) tag (address bits 531) matches

Hit

Data

Direct-Mapped Cache Behavior


Memory Address
Tag Byte Set Offset 3

00...00 001 00 V Tag


0 0 0 0 1 1 1 0 00...00 00...00 00...00 mem[0x00...0C] mem[0x00...08] mem[0x00...04]

A dumb loop: repeat 5 times load from 0x4; load from 0xC; load from 0x8. li l1: beq lw lw lw addiu j done: $t0, $t0, $t1, $t2, $t3, $t0, l1

Data Set 7 (111) Set 6 (110) Set 5 (101) Set 4 (100) Set 3 (011) Set 2 (010) Set 1 (001) Set 0 (000)

5 $0, done Cache when reading 0x4 last time 0x4($0) When two recently accessed addresses map to the same cache 0xC($0) Assuming the cache starts empty, 0x8($0) whats the miss rate? $t0, -1

Direct-Mapped Cache Behavior


Memory Address
Tag Byte Set Offset 3

00...00 001 00 V Tag


0 0 0 0 1 1 1 0 00...00 00...00 00...00 mem[0x00...0C] mem[0x00...08] mem[0x00...04]

A dumb loop: repeat 5 times load from 0x4; load from 0xC; load from 0x8. li l1: beq lw lw lw addiu j done: $t0, $t0, $t1, $t2, $t3, $t0, l1

Data Set 7 (111) Set 6 (110) Set 5 (101) Set 4 (100) Set 3 (011) Set 2 (010) Set 1 (001) Set 0 (000)

5 $0, done Cache when reading 0x4 last time 0x4($0) When two recently accessed addresses map to the same cache 0xC($0) Assuming the cache starts empty, 0x8($0) whats the miss rate? $t0, -1 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 MMMHHHHHHHHHHHH 3/ 15 = 0.2 = 20%

Direct-Mapped Cache: Conict


Memory Address
Tag Byte Set Offset 3

00...01 001 00 V Tag Data Set 7 (111) Set 6 (110) Set 5 (101) Set 4 (100) Set 3 (011) Set 2 (010) Set 1 (001) Set 0 (000)

A dumber loop: repeat 5 times load from 0x4; load from 0x24 li l1: beq lw lw addiu j done: $t0, $t0, $t1, $t2, $t0, l1 5 $0, done 0x4($0) 0x24($0) $t0, -1

Cache State Assuming the cache starts empty, whats the miss rate?

These are conict misses

Direct-Mapped Cache: Conict


Memory Address
Tag Byte Set Offset 3

00...01 001 00 V Tag


0 0 0 0 0 0 1 0 00...00 mem[0x00...04] mem[0x00...24]

Data Set 7 (111) Set 6 (110) Set 5 (101) Set 4 (100) Set 3 (011) Set 2 (010) Set 1 (001) Set 0 (000)

A dumber loop: repeat 5 times load from 0x4; load from 0x24 li l1: beq lw lw addiu j done: $t0, $t0, $t1, $t2, $t0, l1 5 $0, done 0x4($0) 0x24($0) $t0, -1

Cache State Assuming the cache starts empty, whats the miss rate? 4 24 4 24 4 24 4 24 4 24 M M M M M M M M M M 10/ 10 = 1 = 100% Oops

These are conict misses

No Way! Yes Way! 2-Way Set Associative Cache


Memory Address
Tag 28 Byte Set Offset

00
2

Way 1 Data V Tag

Way 0 Data Set 3 Set 2 Set 1 Set 0

V Tag

28

32

28

32

Hit1

Hit0

Hit1

32

Hit

Data

2-Way Set Associative Behavior


Assuming the cache starts empty, whats the miss rate? 4 24 4 24 4 24 4 24 4 24 M M H H H H H H H H 2/ 10 = 0.2 = 20% Associativity reduces conict misses

li l1: beq lw lw addiu j done:

$t0, $t0, $t1, $t2, $t0, l1

5 $0, done 0x4($0) 0x24($0) $t0, -1

Way 1 V Tag
0 0 1 0 00...00 mem[0x00...24]

Way 0 V Tag
0 0 1 0 00...10 mem[0x00...04]

Data

Data Set 3 Set 2 Set 1 Set 0

An Eight-way Fully Associative Cache

Way 7

Way 6

Way 5

Way 4

Way 3

Way 2

Way 1

Way 0

V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data

Figure 8.11

No conict misses: only compulsory or capacity misses Either very expensive or slow because of all the associativity

Exploiting Spatial Locality: Larger Blocks


0x8000 0009C:
Block Byte Set Offset Offset

Block Byte
Offset Offset

Memory Address

Tag

Set

100...100
800000 9

11
C

00

Memory Address

Tag 27

00
2

V Tag

Data Set 1 Set 0

27

32

32

32

32

11

10
32

01

00

Hit

Data

2 sets 1 block per set (Direct Mapped) 4 words per block

Direct-Mapped Cache Behavior w/ 4-word block


Block Byte Set Offset Offset Memory 00...00 0 11 00 Address V Tag Tag

Data
mem[0x00...0C] mem[0x00...08] mem[0x00...04] mem[0x00...00]

The dumb loop: repeat 5 times load from 0x4; load from 0xC; load from 0x8. li l1: beq lw lw lw addiu j done: $t0, $t0, $t1, $t2, $t3, $t0, l1

0 1

00...00

Set 1 Set 0

Figure 8.14

Cache when reading 0xC

5 $0, done 0x4($0) 0xC($0) 0x8($0) $t0, -1

Assuming the cache starts empty, whats the miss rate?

Direct-Mapped Cache Behavior w/ 4-word block


Block Byte Set Offset Offset Memory 00...00 0 11 00 Address V Tag Tag

Data
mem[0x00...0C] mem[0x00...08] mem[0x00...04] mem[0x00...00]

The dumb loop: repeat 5 times load from 0x4; load from 0xC; load from 0x8. li l1: beq lw lw lw addiu j done: $t0, $t0, $t1, $t2, $t3, $t0, l1

0 1

00...00

Set 1 Set 0

Figure 8.14

Cache when reading 0xC

5 $0, done 0x4($0) 0xC($0) 0x8($0) $t0, -1

Assuming the cache starts empty, whats the miss rate? 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 MHHHHHHHHHHHHHH 1/ 15 = 0.0666 = 6.7%

Larger blocks reduce compulsory misses by exploting spatial locality

Stephens Desktop Machine Revisited

On-chip caches: Cache Size L1I 64 K L1D 64 K L2 512 K L3 2 MB


per

AMD Phenom 9600 Quad-core 2.3 GHz 1.11.25 V 95 W 65 nm

Sets 512 512 512 1024

Ways 2-way 2-way 16-way 32-way

Block 64-byte 64-byte 64-byte 64-byte

core

Intel On-Chip Caches


Chip 80386 80486 Pentium Year 1985 1989 1993 Freq. (MHz) 1625 25100 60300 150200 233450 4501400 14003730 9002130 15003000 Data L1 Instr L2 none off-chip 8K 8K 16K 16K off-chip 256K1M (MCM) 256K512K (Cartridge) 256K512K off-chip 8K unied 8K 8K 16K 16K 816K 32K

Pentium Pro 1995 Pentium II Pentium III Pentium 4 Pentium M 1997 1999 2001 2003

12k op 256K2M trace cache 32K 1M2M 2M6M

Core 2 Duo 2005

32K 32K per core per core

You might also like