Lecture22 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

ECE 2300

Digital Logic & Computer Organization


Spring 2018

More Caches
Measuring Performance

Lecture 22: 1
Announcements

• HW7 due tomorrow 11:59pm

• Prelab 5(c) due Saturday 3pm

• Lab 6 (last one) released

• HW8 (last one) to be released tonight

Lecture 22: 2
Another LRU Replacement Example
• 2-way set associative (*) = LRU block
1 bit in this case
Block Cache Hit/miss Cache contents after access
address index Set 0 Set 1
0 0 miss Mem[0]
4 0 miss Mem[0] (*) Mem[4]
2 0 miss Mem[2] Mem[4] (*)
6 0 miss Mem[2] (*) Mem[6]
8 0 miss Mem[8] Mem[6] (*)
0 0 miss Mem[8] (*) Mem[0]
4 0 miss Mem[4] Mem[0] (*)
2 0 miss Mem[4] (*) Mem[2]
6 0 miss Mem[6] Mem[2] (*)
8 0 miss Mem[6] (*) Mem[8]
2 0 miss Mem[2] Mem[8] (*)
6 0 miss Mem[2] (*) Mem[6]
2 0 hit Mem[2] Mem[6] (*)
0 0 miss Mem[2] (*) Mem[0]

Color code: Cold miss Conflict miss Capacity miss


Lecture 22: 3
What About Writes?
• Where do we put the result of a store?

• Cache hit (block is in cache)


– Write new data value to the cache
– Also write to memory (write through)
– Don’t write to memory (write back)
• Requires an additional dirty bit for each cache block
• Writes back to memory when a dirty cache block is evicted

• Cache miss (block is not in cache)


– Allocate the line (bring it into the cache)
(write allocate)
– Write to memory without allocation
(no write allocate or write around)
Lecture 22: 4
Write Through Example
• Assume write allocate
• Size of each block is 8 bytes
• Cache holds 2 blocks
• Memory holds 8 blocks
• Memory address
V tag data
0
1
2 tag bits 3 byte offset bits
1 index bit

Lecture 22: 5
Write Through
Processor Cache Memory

000 100
M[000000] <= R0
V tag data 000 110
M[000100] <= R1 miss
M[010000] <= R2 0 0 001 120
M[011100] <= R3 1 0 001 130
010 140
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 6
Write Through
Processor Cache Memory

000 100
M[000000] <= R0
V tag data 000 110
M[000100] <= R1 miss
M[010000] <= R2 0 1 00 110 100 001 120
M[011100] <= R3 1 0 001 130
010 140
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 7
Write Through
Processor Cache Memory

000 333
M[000000] <= R0
V tag data 000 110
M[000100] <= R1 miss
M[010000] <= R2 0 1 00 110 333 001 120
M[011100] <= R3 1 0 001 130
010 140
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 8
Write Through
Processor Cache Memory

000 333
M[000000] <= R0
V tag data 000 110
M[000100] <= R1 hit
M[010000] <= R2 0 1 00 110 333 001 120
M[011100] <= R3 1 0 001 130
010 140
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 9
Write Through
Processor Cache Memory

000 333
M[000000] <= R0
V tag data 000 444
M[000100] <= R1 hit
M[010000] <= R2 0 1 00 444 333 001 120
M[011100] <= R3 1 0 001 130
010 140
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 10
Write Through
Processor Cache Memory

000 333
M[000000] <= R0
V tag data 000 444
M[000100] <= R1 miss
M[010000] <= R2 0 1 00 444 333 001 120
M[011100] <= R3 1 0 001 130
010 140
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 11
Write Through
Processor Cache Memory

000 333
M[000000] <= R0
V tag data 000 444
M[000100] <= R1 miss
M[010000] <= R2 0 1 01 150 140 001 120
M[011100] <= R3 1 0 001 130
010 140
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 12
Write Through
Processor Cache Memory

000 333
M[000000] <= R0
V tag data 000 444
M[000100] <= R1 miss
M[010000] <= R2 0 1 01 150 555 001 120
M[011100] <= R3 1 0 001 130
010 555
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 13
Write Through
Processor Cache Memory

000 333
M[000000] <= R0
V tag data 000 444
M[000100] <= R1
M[010000] <= R2 miss 0 1 01 150 555 001 120
M[011100] <= R3 1 0 001 130
010 555
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 14
Write Through
Processor Cache Memory

000 333
M[000000] <= R0
V tag data 000 444
M[000100] <= R1
M[010000] <= R2 miss 0 1 01 150 555 001 120
M[011100] <= R3 1 1 01 170 160 001 130
010 555
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 15
Write Through
Processor Cache Memory

000 333
M[000000] <= R0
V tag data 000 444
M[000100] <= R1
M[010000] <= R2 miss 0 1 01 150 555 001 120
M[011100] <= R3 1 1 01 666 160 001 130
010 555
010 150
011 160
011 666
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 16
Write Back Example
• Assume write allocate
• Size of each block is 8 bytes
• Cache holds 2 blocks
• Memory holds 8 blocks
• Memory address Dirty bit
V D tag data
0
1

2 tag bits 3 byte offset bits


1 index bit

Lecture 22: 17
Write Back
Processor Cache Memory

000 100
M[000000] <= R0
V D tag data 000 110
M[000100] <= R1 miss
M[010000] <= R2 0 0 0 001 120
M[011100] <= R3 1 0 0 001 130
010 140
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 18
Write Back
Processor Cache Memory

000 100
M[000000] <= R0
V D tag data 000 110
M[000100] <= R1 miss
M[010000] <= R2 0 1 0 00 110 100 001 120
M[011100] <= R3 1 0 0 001 130
010 140
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 19
Write Back
Processor Cache Memory

000 100
M[000000] <= R0
V D tag data 000 110
M[000100] <= R1 miss
M[010000] <= R2 0 1 1 00 110 333 001 120
M[011100] <= R3 1 0 0 001 130
010 140
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 20
Write Back
Processor Cache Memory

000 100
M[000000] <= R0
V D tag data 000 110
M[000100] <= R1 hit
M[010000] <= R2 0 1 1 00 110 333 001 120
M[011100] <= R3 1 0 0 001 130
010 140
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 21
Write Back
Processor Cache Memory

000 100
M[000000] <= R0
V D tag data 000 110
M[000100] <= R1 hit
M[010000] <= R2 0 1 1 00 444 333 001 120
M[011100] <= R3 1 0 0 001 130
010 140
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 22
Write Back
Processor Cache Memory

000 100
M[000000] <= R0
V D tag data 000 110
M[000100] <= R1 miss
M[010000] <= R2 0 1 1 00 444 333 001 120
M[011100] <= R3 1 0 0 001 130
010 140
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 23
Write Back
Processor Cache Memory

000 333
M[000000] <= R0
V D tag data 000 444
M[000100] <= R1 miss
M[010000] <= R2 0 1 1 00 444 333 001 120
M[011100] <= R3 1 0 0 001 130
010 140
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 24
Write Back
Processor Cache Memory

000 333
M[000000] <= R0
V D tag data 000 444
M[000100] <= R1 miss
M[010000] <= R2 0 1 0 01 150 140 001 120
M[011100] <= R3 1 0 0 001 130
010 140
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 25
Write Back
Processor Cache Memory

000 333
M[000000] <= R0
V D tag data 000 444
M[000100] <= R1 miss
M[010000] <= R2 0 1 1 01 150 555 001 120
M[011100] <= R3 1 0 0 001 130
010 140
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 26
Write Back
Processor Cache Memory

000 333
M[000000] <= R0
V D tag data 000 444
M[000100] <= R1
M[010000] <= R2 miss 0 1 1 01 150 555 001 120
M[011100] <= R3 1 0 0 001 130
010 140
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 27
Write Back
Processor Cache Memory

000 333
M[000000] <= R0
V D tag data 000 444
M[000100] <= R1
M[010000] <= R2 miss 0 1 1 01 150 555 001 120
M[011100] <= R3 1 1 0 01 170 160 001 130
010 140
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 28
Write Back
Processor Cache Memory

000 333
M[000000] <= R0
V D tag data 000 444
M[000100] <= R1
M[010000] <= R2 miss 0 1 1 01 150 555 001 120
M[011100] <= R3 1 1 1 01 666 160 001 130
010 140
010 150
011 160
011 170
R0 333 100 180
R1 444 100 190
R2 555 101 200
R3 666 101 210
110 220
110 230
111 240
111 250

Lecture 22: 29
Cache Hierarchy
• Time to get a block from memory is so long that
performance suffers even with a low miss rate

• Example: 3% miss rate, 100 cycles to main


memory
– 0.03 × 100 = 3 extra cycles on average to access
instructions or data

• Solution: Add another level of cache

Lecture 22: 30
Pipeline with a Cache Hierarchy
Adder L1
+2 Fm … F0 Data
M
RF U M
Cache
M LD X
(KB)
P L1 U

Decoder
U Instr SA X M
M ALU
X C Cache SB
M
U U
(KB) DR X D_IN
U
X
M
U
X
PCJ X MB
D_in MD
PCL
SE
IF/ID ID/EX EX/MEM MEM/WB

L2 Cache (MB)

Main Memory (GB)

Lecture 22: 31
Cache Hierarchy
• Level 1 (L1) instruction and data caches
– Small, but very fast
• Level 2 (L2) cache handles L1 misses
– Larger and slower than L1, but much faster than main memory
– L1 data are also present in L2
• Main memory handles L2 cache misses

• Example: assume 1 cycle to access L1 (3% miss rate),


10 cycles to L2, 10% L2 miss rate, 100 cycles to main
memory
– How many cycles on average for instruction/data access?
1 + 0.03 × (10 + 0.1 × 100) = 1.6 cycles

Lecture 22: 32
How Do We Measure Performance?
• Execution time: The time between the start and
completion of a program (or task)

• Throughput: Total amount of work done in a


given time

• Improving performance means


– Reducing execution time, or
– Increasing throughput

Lecture 22: 33
CPU Execution Time
• Amount of time the CPU takes to run a program

• Derivation

number of instructions clock cycle time


in the program (1/frequency)
average number of
cycles per instruction

Lecture 22: 34
Instruction Count (I)
• Total number of instructions in the given
program

• Factors
– Instruction set
– Mix of instructions chosen by the compiler

Lecture 22: 35
Cycle Time (CT)
• Clock period (1/frequency)

• Factors
– Instruction set
– Structure of the processor and memory hierarchy

Lecture 22: 36
Cycles Per Instruction (CPI)
• Average number of cycles required to execute
each instruction

• Factors
– Instruction set
– Mix of instructions chosen by the compiler
– Ordering of the instructions by the compiler
– Structure of the processor and memory hierarchy

Lecture 22: 37
Processor Organization
Impact on CPI (Example 1)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

A
ADD R1,R2,R3 IM Reg L DM Reg
U

A
OR R4,R1,R3 IM Reg L DM Reg
U

A
SUB R5,R2,R1 IM Reg L DM Reg
U

A
AND R6,R1,R2 IM Reg L DM Reg
U

A
ADDI R7,R7,3 IM Reg L DM Reg
U

With forwarding: Reduced stall cycles


Lower CPI, potentially reduced execution time

Lecture 22: 38
Processor Organization
Impact on CPI (Example 2)
Control
CU Signals

=?
sign bit

Adder
+2 Fm … F0 Data
M
RF U M RAM
M LD X
P U
Decoder

U Inst SA X M
M ALU
X C RAM SB
M
U U
DR X D_IN
U
X
M
U
X
PCJ X MB
D_in MW MD
PCL
SE
IF/ID ID/EX EX/MEM MEM/WB

Only one delay slot needed with branch resolved in ID


Lower CPI
Lecture 22: 39
Compiler Impact on CPI (Example 3)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

A
BEQ R2,R3,X IM Reg L DM Reg
U

A
NOP
ADDI R7,R7,3 IM Reg L DM Reg
U

A
OR R4,R1,R3 IM Reg L DM Reg
U

A
SUB R5,R2,R1 IM Reg L DM Reg
U

A
X: AND R6,R1,R2 IM Reg L DM Reg
U

ADDI R7,R7,3
Filling the branch delay slot
...
with a useful instruction
Lecture 22: 40
A Rough Breakdown of CPI
• CPIbase is the base CPI in an ideal scenario where
instruction fetches and data memory accesses incur no
extra delay

• CPImemhier is the (additional) CPI spent for accessing the


memory hierarchy when a miss occurs in caches

• CPItotal is the overall CPI


– CPItotal = CPIbase + CPImemhier

Lecture 22: 41
Impact of L1 Caches
• With L1 caches
– L1 instruction cache miss rate = 2%
– L1 data cache miss rate = 4%
– Miss penalty = 100 cycles (access main memory)
– 20% of all instructions are loads, 10% are stores

• CPImemhier = 0.02 × 100 + 0.3 × 0.04 × 100 = 3.2

Lecture 22: 42
Impact of L1+L2 Caches
• With L1 and L2 caches
– L1 instruction cache miss rate = 2%
– L1 data cache miss rate = 4%
– L2 access time = 15 cycles
– L2 miss rate = 25%
– L2 miss penalty = 100 cycles (access main memory)
– 20% of all instructions are loads, 10% are stores

• CPImemhier = 0.02 × (15 + 0.25 × 100) +


0.30 × 0.04 × (15 + 0.25 × 100) = 1.28

Lecture 22: 43
Before Next Class
• H&H 8.4

Next Time

Virtual Memory

Lecture 22: 44

You might also like