Lecture15
Lecture15
2
Carnegie Mellon
Register %rdx
• Forward from the memory stage
Register %rax
• Forward from the execute stage
3
Carnegie Mellon
Register %rdx
• Forward from the memory stage
Register %rax
• Forward from the execute stage
3
Carnegie Mellon
Register %rdx
• Forward from the memory stage
Register %rax
• Forward from the execute stage
3
Carnegie Mellon
Bypass Paths
Decode Stage
• Forwarding logic selects valA and valB
• Normally from register file
• Forwarding: get valA or valB from later pipeline stage
Forwarding Sources
• Execute: valE
• Memory: valE, valM
• Write back: valE, valM
4
Carnegie Mellon
Out-of-order Execution
• Compiler could do this, but has limitations
• Generally done in hardware
Long-latency instruction.
Forces the pipeline to stall.
r0 = r1 + r2 r0 = r1 + r2
r3 = MEM[r0] r3 = MEM[r0]
r4 = r3 + r6 r7 = r5 + r1
r7 = r5 + r1 …
… r4 = r3 + r6
5
Carnegie Mellon
Out-of-order Execution
• Compiler could do this, but has limitations
• Generally done in hardware
Long-latency instruction.
Forces the pipeline to stall.
r0 = r1 + r2 r0 = r1 + r2
r3 = MEM[r0] r3 = MEM[r0]
r4 = r3 + r6 r7 = r5 + r1
r7 = r5 + r1 …
… r4 = r3 + r6
5
Carnegie Mellon
Out-of-order Execution
r0 = r1 + r2
r3 = MEM[r0]
r4 = r3 + r6
r6 = r5 + r1
…
6
Carnegie Mellon
Out-of-order Execution
r0 = r1 + r2 Is this correct? r0 = r1 + r2
r3 = MEM[r0] r3 = MEM[r0]
r4 = r3 + r6 r6 = r5 + r1
r6 = r5 + r1 …
… r4 = r3 + r6
6
Carnegie Mellon
Out-of-order Execution
r0 = r1 + r2 Is this correct? r0 = r1 + r2
r3 = MEM[r0] r3 = MEM[r0]
r4 = r3 + r6 r6 = r5 + r1
r6 = r5 + r1 …
… r4 = r3 + r6
r0 = r1 + r2
r3 = MEM[r0]
r4 = r3 + r6
r4 = r5 + r1
…
6
Carnegie Mellon
Out-of-order Execution
r0 = r1 + r2 Is this correct? r0 = r1 + r2
r3 = MEM[r0] r3 = MEM[r0]
r4 = r3 + r6 r6 = r5 + r1
r6 = r5 + r1 …
… r4 = r3 + r6
r0 = r1 + r2 Is this correct? r0 = r1 + r2
r3 = MEM[r0] r3 = MEM[r0]
r4 = r3 + r6 r4 = r5 + r1
r4 = r5 + r1 …
… r4 = r3 + r6
6
Carnegie Mellon
Out-of-order Execution
r0 = r1 + r2 Is this correct? r0 = r1 + r2
r3 = MEM[r0] r3 = MEM[r0]
r4 = r3 + r6 r6 = r5 + r1
r6 = r5 + r1 …
… r4 = r3 + r6
r0 = r1 + r2 Is this correct? r0 = r1 + r2
r3 = MEM[r0] r3 = MEM[r0]
r4 = r3 + r6 r4 = r5 + r1
r4 = r5 + r1 …
… r4 = r3 + r6
7
Carnegie Mellon
Ideal Memory
• Low access time (latency)
• High capacity
• Low cost
• High bandwidth (to support multiple accesses in parallel)
8
Carnegie Mellon
The Problem
• Ideal memory’s requirements oppose each other
9
Carnegie Mellon
The Problem
• Ideal memory’s requirements oppose each other
• Bigger is slower
9
Carnegie Mellon
The Problem
• Ideal memory’s requirements oppose each other
• Bigger is slower
• Bigger Takes longer to determine the location
9
Carnegie Mellon
The Problem
• Ideal memory’s requirements oppose each other
• Bigger is slower
• Bigger Takes longer to determine the location
9
Carnegie Mellon
The Problem
• Ideal memory’s requirements oppose each other
• Bigger is slower
• Bigger Takes longer to determine the location
9
Carnegie Mellon
The Problem
• Ideal memory’s requirements oppose each other
• Bigger is slower
• Bigger Takes longer to determine the location
9
Carnegie Mellon
The Problem
• Ideal memory’s requirements oppose each other
• Bigger is slower
• Bigger Takes longer to determine the location
9
Carnegie Mellon
CE (chip enable)
Address
n
WE (write enable)
Content
10
Carnegie Mellon
Non-volatile Memories
12
Carnegie Mellon
Non-volatile Memories
• DFF, DRAM and SRAM are volatile memories
• Lose information if powered off.
12
Carnegie Mellon
Non-volatile Memories
• DFF, DRAM and SRAM are volatile memories
• Lose information if powered off.
• Nonvolatile memories retain value even if powered off
• Flash (~ 5 years)
• Hard Disk (~ 5 years)
• Tape (~ 15-30 years)
12
Carnegie Mellon
Summary of Trade-Offs
• Faster is more expensive (dollars and chip area)
• SRAM, < 10$ per Megabyte
• DRAM, < 1$ per Megabyte
• Hard Disk < 1$ per Gigabyte
13
Carnegie Mellon
Summary of Trade-Offs
• Faster is more expensive (dollars and chip area)
• SRAM, < 10$ per Megabyte
• DRAM, < 1$ per Megabyte
• Hard Disk < 1$ per Gigabyte
• Larger capacity is slower
• Flip-flops/Small SRAM, sub-nanosec
• SRAM, KByte~MByte, ~nanosec
• DRAM, Gigabyte, ~50 nanosec
• Hard Disk, Terabyte, ~10 millisec
14
Carnegie Mellon
15
Carnegie Mellon
fast
small
16
Carnegie Mellon
fast
small
16
Carnegie Mellon
16
Carnegie Mellon
Memory Hierarchy
• Fundamental tradeoff
• Fast memory: small
• Large memory: slow
• Balance latency, cost, size,
bandwidth
Hard Disk
CPU
Main
Cache Memory
Registers (SRAM) (DRAM)
(DFF)
17
Carnegie Mellon
L1 cache (SRAM)
~32 KB, ~nsec
L2 cache (SRAM)
512 KB ~ 1MB, many nsec
L3 cache (SRAM)
.....
Hard Disk
100 GB, ~10 msec
18
Carnegie Mellon
L2 CACHE 1
L2 CACHE 0
SHARED L3 CACHE
DRAM INTERFACE
DRAM Modules
CORE 0 CORE 1
L1 CACHE 0 L1 CACHE 1
DRAM MEMORY
CONTROLLER
L1 CACHE 2 L1 CACHE 3
L2 CACHE 2
L2 CACHE 3
CORE 2 CORE 3
19
Carnegie Mellon
My Desktop
20
Carnegie Mellon
My Server
21
Carnegie Mellon
RF Main
Cache
Memory
(DFF) (SRAM) Disk
(DRAM)
23
Carnegie Mellon
Cache
CPU Registers Memory
$
2444
Carnegie Mellon
Register VS Cache
• If the data is in memory, the hardware may keep a copy of this data
in cache to speed up access to it.
25
Carnegie Mellon
Cache
CPU Registers Memory
$
2644
Carnegie Mellon
27
Carnegie Mellon
• Temporal locality:
• Recently referenced items are likely
to be referenced again in the near future
27
Carnegie Mellon
• Temporal locality:
• Recently referenced items are likely
to be referenced again in the near future
• Spatial locality:
• Items with nearby addresses tend
to be referenced close together in time
27
Carnegie Mellon
Locality Example
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;
• Data references
• Spatial Locality: Reference array elements in succession (stride-1 reference
pattern)
• Temporal Locality: Reference variable sum each iteration.
• Instruction references
• Spatial Locality: Reference instructions in sequence.
• Temporal Locality: Cycle through loop repeatedly.
28
Carnegie Mellon
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;
29
Carnegie Mellon
30
Carnegie Mellon
Cache Illustrations
CPU
0 1 2 3
Memory
4 5 6 7
(big but slow)
8 9 10 11
12 13 14 15
31
Carnegie Mellon
Cache Illustrations
CPU
Cache
8 9 14 3
(small but fast)
0 1 2 3
Memory
4 5 6 7
(big but slow)
8 9 10 11
12 13 14 15
32
Carnegie Mellon
Cache Illustrations
CPU
Request Data Data in address 14 is needed
at Address 14
Cache
8 9 14 3
(small but fast)
0 1 2 3
Memory
4 5 6 7
(big but slow)
8 9 10 11
12 13 14 15
32
Carnegie Mellon
Cache Illustrations
CPU
Request Data Data in address 14 is needed
at Address 14
Cache
8 9 14 3 Address 14 is in cache: Hit!
(small but fast)
0 1 2 3
Memory
4 5 6 7
(big but slow)
8 9 10 11
12 13 14 15
32
Carnegie Mellon
Cache Illustrations
CPU
Cache
8 9 14 3
(small but fast)
0 1 2 3
Memory
4 5 6 7
(big but slow)
8 9 10 11
12 13 14 15
33
Carnegie Mellon
Cache Illustrations
CPU
Request data
at Address 12
Cache
8 9 14 3
(small but fast)
0 1 2 3
Memory
4 5 6 7
(big but slow)
8 9 10 11
12 13 14 15
33
Carnegie Mellon
Cache Illustrations
CPU
Request data Data in address 12 is needed
at Address 12
0 1 2 3
Memory
4 5 6 7
(big but slow)
8 9 10 11
12 13 14 15
33
Carnegie Mellon
Cache Illustrations
CPU
Request data Data in address 12 is needed
at Address 12
0 1 2 3
Memory
4 5 6 7
(big but slow)
8 9 10 11
12 13 14 15
33
Carnegie Mellon
Cache Illustrations
CPU
Request data Data in address 12 is needed
at Address 12
0 1 2 3
Memory
4 5 6 7
(big but slow)
8 9 10 11
12 13 14 15
33
Carnegie Mellon
Cache Illustrations
CPU
Request data Data in address 12 is needed
at Address 12
0 1 2 3
Memory
4 5 6 7
(big but slow)
8 9 10 11
12 13 14 15
33
Carnegie Mellon
Cache Illustrations
CPU
Request data Data in address 12 is needed
at Address 12
0 1 2 3
Memory
4 5 6 7
(big but slow) Address 12 is stored in cache
8 9 10 11
12 13 14 15
33
Carnegie Mellon
# Hits
Hit Rate =
# Accesses
34
Carnegie Mellon
35
Carnegie Mellon
A Simple Cache
Cache Memory • 16 memory locations
0000 • 4 cache locations
0001
Content Valid? 0010
0011 a
0100
00 0101
01 0110
10 0111
11 1000
1001
1010
1011
1100
1101
1110
1111
36
Carnegie Mellon
A Simple Cache
Cache Memory • 16 memory locations
0000 • 4 cache locations
0001
• Also called cache-line
Content Valid? 0010
0011 a
0100
00 0101
01 0110
10 0111
11 1000
1001
1010
1011
1100
1101
1110
1111
36
Carnegie Mellon
A Simple Cache
Cache Memory • 16 memory locations
0000 • 4 cache locations
0001
• Also called cache-line
Content Valid? 0010
0011 a • Every location has a valid bit, indicating
0100 whether that location contains valid data;
00 0101
0 initially.
01 0110
10 0111
11 1000
1001
1010
1011
1100
1101
1110
1111
36
Carnegie Mellon
A Simple Cache
Cache Memory • 16 memory locations
0000 • 4 cache locations
0001
• Also called cache-line
Content Valid? 0010
0011 a • Every location has a valid bit, indicating
0100 whether that location contains valid data;
00 0101
0 initially.
01 0110 • For now, assume cache location size
10 0111 == memory location size == 1 B
11 1000
1001
1010
1011
1100
1101
1110
1111
36
Carnegie Mellon
A Simple Cache
Cache Memory • 16 memory locations
0000 • 4 cache locations
0001
• Also called cache-line
Content Valid? 0010
0011 a • Every location has a valid bit, indicating
0100 whether that location contains valid data;
00 0101
0 initially.
01 0110 • For now, assume cache location size
10 0111 == memory location size == 1 B
11 1000
1001
• Assume each memory location can
1010
only reside in one cache-line
1011
1100
1101
1110
1111
36
Carnegie Mellon
A Simple Cache
Cache Memory • 16 memory locations
0000 • 4 cache locations
0001
• Also called cache-line
Content Valid? 0010
0011 a • Every location has a valid bit, indicating
0100 whether that location contains valid data;
00 0101
0 initially.
01 0110 • For now, assume cache location size
10 0111 == memory location size == 1 B
11 1000
1001
• Assume each memory location can
1010
only reside in one cache-line
1011 • Cache is smaller than memory
1100 (obviously)
1101
1110
1111
36
Carnegie Mellon
A Simple Cache
Cache Memory • 16 memory locations
0000 • 4 cache locations
0001
• Also called cache-line
Content Valid? 0010
0011 a • Every location has a valid bit, indicating
0100 whether that location contains valid data;
00 0101
0 initially.
01 0110 • For now, assume cache location size
10 0111 == memory location size == 1 B
11 1000
1001
• Assume each memory location can
1010
only reside in one cache-line
1011 • Cache is smaller than memory
1100 (obviously)
1101
1110
• Thus, not all memory locations can
1111 be cached at the same time
36
Carnegie Mellon
Cache Placement
Cache Memory • Given a memory addr, say 0x0001, we
want to put the data there into the
0000
cache; where does the data go?
0001
0010
0011 a
0100
00 0101
01 0110
10 0111
11 1000
1001
1010
1011
1100
1101
1110
1111
37
Carnegie Mellon
CA 1010
1011
1100
1101
addr[1:0] 1110
1111
Mem addr
38
Carnegie Mellon
Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
• CA = ADDR[1],ADDR[0]
0000
• Always use the lower order address
0001 bits
0010
0011
0100
00 0101
01 0110
10 0111
11 1000
1001
1010
1011
1100
1101
addr[1:0] 1110
1111
Mem addr
39
Carnegie Mellon
Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
• CA = ADDR[1],ADDR[0]
0000
• Always use the lower order address
0001 bits
0010
0011 • Multiple addresses can be
0100 mapped to the same location
00 0101
01 0110
10 0111
11 1000
1001
1010
1011
1100
1101
addr[1:0] 1110
1111
Mem addr
39
Carnegie Mellon
Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
• CA = ADDR[1],ADDR[0]
0000
• Always use the lower order address
0001 bits
0010
0011 • Multiple addresses can be
0100 mapped to the same location
00 0101 • E.g., 0010 and 1010
01 0110
10 0111
11 1000
1001
1010
1011
1100
1101
addr[1:0] 1110
1111
Mem addr
39
Carnegie Mellon
Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
• CA = ADDR[1],ADDR[0]
0000
• Always use the lower order address
0001 bits
0010
0011 • Multiple addresses can be
0100 mapped to the same location
00 0101 • E.g., 0010 and 1010
01 0110
10 0111 • How do we differentiate between
11 1000 different memory locations that
1001 are mapped to the same cache
1010 location?
1011
1100
1101
addr[1:0] 1110
1111
Mem addr
39
Carnegie Mellon
Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
• CA = ADDR[1],ADDR[0]
0000
• Always use the lower order address
0001 bits
0010
0011 • Multiple addresses can be
0100 mapped to the same location
00 0101 • E.g., 0010 and 1010
01 0110
10 0111 • How do we differentiate between
11 1000 different memory locations that
1001 are mapped to the same cache
1010 location?
1011
1100
• Add a tag field for that purpose
1101
addr[1:0] 1110
1111
Mem addr
39
Carnegie Mellon
Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
• CA = ADDR[1],ADDR[0]
0000
• Always use the lower order address
0001 bits
0010
0011 • Multiple addresses can be
0100 mapped to the same location
00 0101 • E.g., 0010 and 1010
01 0110
10 0111 • How do we differentiate between
11 1000 different memory locations that
1001 are mapped to the same cache
1010 location?
1011
1100
• Add a tag field for that purpose
1101 • What should the tag field be?
addr[1:0] 1110
1111
Mem addr
39
Carnegie Mellon
Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
• CA = ADDR[1],ADDR[0]
0000
• Always use the lower order address
0001 bits
0010
0011 • Multiple addresses can be
0100 mapped to the same location
00 0101 • E.g., 0010 and 1010
01 0110
10 0111 • How do we differentiate between
11 1000 different memory locations that
1001 are mapped to the same cache
1010 location?
1011
1100
• Add a tag field for that purpose
1101 • What should the tag field be?
addr[1:0] 1110
• ADDR[3] and ADDR[2] in this
1111
particular example
Mem addr
39
Carnegie Mellon
Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
• CA = ADDR[1],ADDR[0]
0000
• Always use the lower order address
0001 bits
0010
0011 • Multiple addresses can be
0100 mapped to the same location
00 addr [3:2]
0101 • E.g., 0010 and 1010
01 addr [3:2]
0110
10 addr [3:2]
0111 • How do we differentiate between
11 addr [3:2]
1000 different memory locations that
1001 are mapped to the same cache
1010 location?
1011
1100
• Add a tag field for that purpose
1101 • What should the tag field be?
addr[1:0] 1110
• ADDR[3] and ADDR[2] in this
1111
particular example
Mem addr
39
Carnegie Mellon
Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
• CA = ADDR[1],ADDR[0]
0000
• Always use the lower order address
Tag 0001 bits
0010
0011 • Multiple addresses can be
0100 mapped to the same location
00 addr [3:2]
0101 • E.g., 0010 and 1010
01 addr [3:2]
0110
10 addr [3:2]
0111 • How do we differentiate between
11 addr [3:2]
1000 different memory locations that
1001 are mapped to the same cache
1010 location?
1011
1100
• Add a tag field for that purpose
1101 • What should the tag field be?
addr[1:0] 1110
• ADDR[3] and ADDR[2] in this
1111
particular example
Mem addr
39
Carnegie Mellon
Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
• CA = ADDR[1],ADDR[0]
0000
• Always use the lower order address
Tag 0001 bits
0010
0011 • Multiple addresses can be
0100 mapped to the same location
00 addr [3:2]
0101 • E.g., 0010 and 1010
01 addr [3:2]
0110
10 addr [3:2]
0111 • How do we differentiate between
11 addr [3:2]
1000 different memory locations that
1001 are mapped to the same cache
1010 location?
1011
1100
• Add a tag field for that purpose
1101 • What should the tag field be?
addr[1:0] = Hit? 1110
• ADDR[3] and ADDR[2] in this
1111
particular example
CPU
39