UNIT2 Cahe-Opt
UNIT2 Cahe-Opt
UNIT2 Cahe-Opt
5.1 Introduction
The five classic components of a computer:
Processor
Input
Control
Memory
Datapath Output
2
Memory Hierarchy
Levels of the Memory Hierarchy Upper Level
Capacity Faster
Access Time
CPU Registers
500 bytes Registers
0.25 ns
Cache
64 KB
Cache
Capacity
1 ns
Speed
Blocks
Main Memory
512 MB Memory
100ns
Pages
Disk
100 GB I/O Devices
5 ms
Files Larger
??? Lower Level
3
5.2 ABCs of Caches
• Cache:
– In this textbook it mainly means the first level of the memory
hierarchy encountered once the address leaves the CPU
– applied whenever buffering is employed to reuse commonly
occurring items, i.e. file caches, name caches, and so on
• Principle of Locality:
– Program access a relatively small portion of the address space at
any instant of time.
• Two Different Types of Locality:
– Temporal Locality (Locality in Time): If an item is referenced, it will
tend to be referenced again soon (e.g., loops, reuse)
– Spatial Locality (Locality in Space): If an item is referenced, items
whose addresses are close by tend to be referenced soon
(e.g., straightline code, array access)
4
Memory Hierarchy: Terminology
5
Cache Measures
CPU execution time incorporated with cache performance:
CPU execution time = (CPU clock cycles + Memory stall cycles)
* Clock cycle time
Memory stall cycles: number of cycles during which the CPU is stalled
waiting for a memory access
Memory stall clock cycles = Number of misses * miss penalty
= IC*(Misses/Instruction)*Miss penalty
= IC*(Memory accesses/Instruction)*Miss rate*Miss penalty
= IC * Reads per instruction * Read miss rate * Read miss penalty
+IC * Writes per instruction * Write miss rate * Write miss penalty
6
P.395 Example
Example Assume we have a computer where the CPI is 1.0 when all memory accesses
hit the cache. The only data access are loads and stores, and these total 50% of
the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%,
how much faster would the computer be if all instructions are in the cache?
Answer:
(A) If instructions always hit in the cache, CPI=1.0, no memory stalls, then
CPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time
(B) If there are 2% miss, CPI = 1.0, we need to calculate memory stalls.
memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty
= IC*(1+50%)*2%*25 = IC*0.75
then CPU(B) = (IC + IC*0.75)* Clock cycle time
= 1.75*IC*clock cycle time
The performance ratio is easy to get to be the inverse of the CPU execution
time :
CPU(B)/CPU(A) = 1.75
The computer with no cache miss is 1.75 times faster.
7
Four Memory Hierarchy Questions
Q1 (block placement):
Where can a block be placed in the upper level?
Q2 (block identification):
How is a block found if it is in the upper level?
Q3 (block replacement):
Which bock should be replaced on a miss?
Q4 (write strategy):
What happens on a write?
8
Q1(block placement): Where can a block be placed?
Direct mapped: (Block number) mod (Number of blocks in cache)
Set associative: (Block number) mod (Number of sets in cache)
– # of set # of blocks
– n-way: n blocks in a set
– 1-way = direct mapped
Fully associative: # of set = 1
9
Simplest Cache: Direct Mapped (1-way)
Block number Memory
0
4 Block Direct Mapped Cache
1
Block Index in Cache
2
0
3
1
4
2
5
3
6
7
8
9
A
B
C
D The block have only one place it can appear in the
E cache. The mapping is usually
F (Block address) MOD ( Number of blocks in cache)
10
Example: 1 KB Direct Mapped Cache, 32B Blocks
For a 2N byte cache:
– The uppermost (32 - N) bits are always the Cache Tag
– The lowest M bits are the Byte Select (Block Size = 2M)
31 9 4 0
Cache Tag Example: 0x50 Cache Index Byte Select
Ex: 0x01 Ex: 0x00
Stored as part
of the cache “state”
Valid Bit Cache Tag Cache Data
Byte 31 Byte 1 Byte 0 0
: :
0x50 Byte 63 Byte 33 Byte 32 1
2
3
: : :
Byte 1023 Byte 992 31
:
11
Q2 (block identification): How is a block found?
Three portions of an address in a set-associative or direct-mapped cache
Block Offset selects the desired data from the block, the index filed selects
the set, and the tag field compared against the CPU address for a hit
• Use the Cache Index to select the cache set
• Check the Tag on each block in that set
– No need to check index or block offset
– A valid bit is added to the Tag to indicate whether or not this entry
contains a valid address
• Select the desired bytes using Block Offset
12
Example: Two-way set associative cache
• Cache Index selects a “set” from the cache
• The two tags in the set are compared in parallel
• Data is selected based on the tag result
31 9 4 0
Cache Tag Example: 0x50 Cache Index Byte Select
Ex: 0x01 Ex: 0x00
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0
: : : : : :
0x50
Adr Tag
Compare Set1 1 Mux 0 Set0 Compare
OR
Cache Block
Hit 13
Disadvantage of Set Associative Cache
• N-way Set Associative Cache v.s. Direct Mapped Cache:
– N comparators vs. 1
– Extra MUX delay for the data
– Data comes AFTER Hit/Miss
• In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:
– Possible to assume a hit and continue. Recover later if miss.
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0
: : : : : :
Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare
OR
Cache Block
Hit
14
Q3 (block replacement): Which block should be
replaced on a cache miss?
nEasy for Direct Mapped – hardware decisions are simplified
Only one block frame is checked and only that block can be replaced
nSet Associative or Fully Associative
There are many blocks to choose from on a miss to replace
nThree primary strategies for selecting a block to be replaced
l Random: randomly selected
l LRU: Least Recently Used block is removed
l FIFO(First in, First out)
Data cache misses per 1000 instructions for various replacement strategies
Associativity: 2-way 4-way 8-way
Size LRU Random FIFO LRU Random FIFO LRU Random FIFO
16 KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4
64 KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3
256 KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5
There are little difference between LRU and random for the largest size cache, with
LRU outperforming the others for smaller caches. FIFO generally outperforms
random in the smaller cache sizes
15
Q4(write strategy): What happens on a write?
Reads dominate processor cache accesses.
E.g. 7% of overall memory traffic are writes while 21% of data cache
access are writes
Two option we can adopt when writing to the cache:
lWrite through —The information is written to both the block in the
cache and to the block in the lower-level memory.
lWrite back —The information is written only to the block in the cache.
The modified cache block is written to main memory only when it is
replaced.
To reduce the frequency of writing back blocks on replacement, a dirty
bit is used to indicate whether the block was modified in the cache
(dirty) or not (clean). If clean, no write back since identical information
to the cache is found.
Pros and Cons
lWT: simple to implement, the cache is always clean, so read misses
cannot result in writes.
lWB: writes occur at the speed of the cache and multiple writes within
a block require only one write to the lower-level memory.
16
Write Stall and Write Buffer
l When the CPU must wait for writes to complete during WT, the CPU is
said to write stall
l A common optimization to reduce write stall is a write buffer, which
allows the processor to continue as soon as the data are written to the
buffer, thereby overlapping processor execution with memory updating
Cache
Processor DRAM
Write Buffer
• A Write Buffer is needed between the Cache and Memory
– Processor: writes data into the cache and the write buffer
– Memory controller: write contents of the buffer to memory
• Write buffer is just a FIFO:
– Typical number of entries: 4
17
Write-Miss Policy: Write Allocate vs. Not Allocate
18
Write-Miss Policy Example
• 64 KB cache
• 2-way set associative
• 64 byte block
• Tag index block offset
<25> 64 KB/64 64 byte/block
= 1 KB(blocks)/2 2^6 = 64
512 sets = 2^9 i.e., <6>
i.e., <9>
25 + 9 + 6 = 48 physical address
21
Impact of Memory Access on CPU
Performance
22
Impact of Cache Organizations on CPU Performance
Example 1: What is the impact of two different cache organizations (direct
mapped vs. 2-way set associative) on the performance of a CPU?
– Ideal CPI = 2.0 (ignoring memory stalls)
– Clock cycle time is 1.0 ns
– Avg. memory references per instruction is 1.5
– Cache size: 64 KB, block size: 64 bytes
– For set-associative, assume the clock cycle time is stretched 1.25 times to
accommodate the selection multiplexer
– Cache miss penalty is 75 ns
– Hit time is 1 clock cycle
– Miss rate: direct mapped 1.4%; 2-way set-associative 1.0%. Calculate AMAT
and then processor performance.
Answer:
• Avg. memory access time1-way= 1.0+(0.014 x 75) = 2.05 ns
Avg. memory access time2-way= 1.0 x 1.25 + (0.01 x 75) = 2.00 ns
23
• Example 2: What is the impact of two different cache organizations
(direct mapped vs. 2-way set associative) on the performance of a
CPU?
– Ideal CPI = 1.6 (ignoring memory stalls)
– Clock cycle time is 0.35 ns
– Avg. memory references per instruction is 1.4
– Cache size: 128 KB, block size: 64 bytes
– For set-associative, assume the clock cycle time is stretched 1.35
times to accommodate the selection multiplexer
– Cache miss penalty is 65 ns
– Hit time is 1 clock cycle
– Miss rate: direct mapped 2.1%; 2-way set-associative 1.9%.
Calculate AMAT and then processor performance.
24
Summary of Performance Equations
25
Types of Cache Misses
• Compulsory (cold start or process migration, first reference): first
access to a block
– The very first access to a block cannot be in the cache, so the block must
be brought into cache.
– They occur in an infinite cache.
• Capacity:
– Cache cannot contain all blocks accessed by the program, blocks must be
discarded and retrieved
– Occur in fully associative cache
– Solution: increase cache size, smaller upper level memory-thrash
• Conflict (collision):
– Multiple memory locations mapped to the same cache location, when
block placement strategy is set associative or direct.
– no conflicts in fully associativity
– Solution 1: increase cache size
– Solution 2: increase associativity
26
6 Basic Cache Optimizations
0% Size of Cache
16
32
64
128
256
Block Size (bytes)
0.14
1-way
0.12
2-way
0.1
Miss Rate per Type
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
4
8
2
1
16
32
64
128
Cache Size (KB) Compulsory
30
Reducing Cache Miss Penalty
Time to handle a miss is becoming more and more the
controlling factor. This is because of the great improvement in
speed of processors as compared to the speed of memory.
31
4: Multilevel Caches
• Approaches
– Make the cache faster to keep pace with the speed of CPUs
– Make the cache larger to overcome the widening gap
L1: fast hits, L2: fewer misses
• L2 Equations
Average Memory Access Time = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
Average Memory Access Time = Hit TimeL1
+ Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)
Hit TimeL1 << Hit TimeL2 << … << Hit TimeMem
Miss RateL1 > Miss RateL2 > …
Definitions:
– Local miss rate— misses in this cache divided by the total number of memory
accesses to this cache (1st level cache Miss rateL1 , 2nd level cache Miss rateL2)
– Global miss rate—misses in this cache divided by the total number of memory
accesses generated by the CPU (Miss rateL1, Miss RateL1 x Miss RateL2)
•Indicate what fraction of the memory accesses that leave the CPU go all the
way to memory.
32
Design of L2 Cache
•Size
– Since everything in L1 cache is likely to be in L2 cache, L2 cache
should be much bigger than L1
•Whether data in L1 is in L2
– novice approach: design L1 and L2 independently
– multilevel inclusion: L1 data are always present in L2
• Advantage: easy for consistency between I/O and cache (checking L2 only)
• Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-level
block to be replaced => slightly higher 1st-level miss rate
• i.e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2
– multilevel exclusion: L1 data is never found in L2
• A cache miss in L1 results in a swap(not replacement) of blocks between L1
and L2
• Advantage: prevent wasting space in L2
• i.e. AMD Athlon: 64 KB L1 and 256 KB L2
33
Example: Suppose that in 1000 memory references there are 40
misses in the first level cache and 20 misses in the second level
cache. What are the various miss rates? Assume the miss
penalty from the L2 cache to memory is 200 clock cycles, the hit
time of the L2 cache is 10 clock cycles, the hit time of L1 is 1
clock cycle, and there are 1.5 memory references per
instruction. What is the average memory access time and
average stall cycles per instruction?
34
5: Giving Priority to Read Misses over Writes
• Serve reads before writes have been completed
• Write through with write buffers
SW R3, 512(R0) ; M[512] <- R3 (cache index 0)
LW R1, 1024(R0) ; R1 <- M[1024] (cache index 0)
LW R2, 512(R0) ; R2 <- M[512] (cache index 0)
Problem: write through with write buffers offer RAW conflicts with main
memory reads on cache misses
– If simply wait for write buffer to empty, might increase read miss
penalty (old MIPS 1000 by 50% )
– Check write buffer contents before read; if no conflicts, let the
memory access continue
• Write Back
Suppose a read miss will replace a dirty block
– Normal: Write dirty block to memory, and then do the read
– Instead: Copy the dirty block to a write buffer, do the read, and then
do the write
– CPU stall less since restarts as soon as do read
35
6: Avoiding address translation during cache indexing
•Two tasks: indexing the cache and comparing addresses
•virtually vs. physically addressed cache
–virtual cache: use virtual address (VA) for the cache
–physical cache: use physical address (PA) after translating virtual address
•Challenges to virtual cache
1.Protection: page-level protection (RW/RO/Invalid) must be checked
–It’s checked as part of the virtual to physical address translation
–solution: an addition field to copy the protection information from TLB and check
it on every access to the cache
2.context switching: same VA of different processes refer to different PA,
requiring the cache to be flushed
–solution: increase width of cache address tag with process-identifier tag (PID)
3.Synonyms or aliases: two different VA for the same PA
–inconsistency problem: two copies of the same data in a virtual cache
–hardware antialiasing solution: guarantee every cache block a unique PA
–Alpha 21264: check all possible locations. If one is found, it is invalidated
–software page-coloring solution: forcing aliases to share some address bits
–Sun’s Solaris: all aliases must be identical in last 18 bits => no duplicate PA
4.I/O: typically use PA, so need to interact with cache (see Section 5.12)
36
37
Virtually indexed, physically tagged cache
VA VA VA
TB VA $ PA $ TB
Tags Tags
PA VA PA
L2 $
$ TB
PA PA MEM
MEM MEM
Overlap cache access
Conventional with VA translation:
Virtually Addressed Cache
Organization requires $ index to
Translate only on miss
remain invariant
Synonym Problem
across translation
38
• One alternative to get the best of both virtual and physical
caches is to use part of the page offset – the part that is
identical in both VA and PAs-to index the cache.
• At the same time as the cache is being read using the index, the
virtual part of the address is translated, and the tag match uses
PAs.
• It allows the cache read to begin immediately, and yet the tag
comparison is still with PAs.
39
40
Virtual Memory
• Virtual memory (VM) allows programs to have the illusion
of a very large memory that is not limited by physical
memory size
– Make main memory (DRAM) acts like a cache for secondary
storage (magnetic disk)
– Otherwise, application programmers have to move data in/out main
memory
– That’s how virtual memory was first proposed
• Virtual memory also provides the following functions
– Allowing multiple processes share the physical memory in
multiprogramming environment
– Providing protection for processes (compare Intel 8086: without VM
applications can overwrite OS kernel)
– Facilitating program relocation in physical memory space
VM Example
42
Virtual Memory and Cache
• VM address translation a provides a mapping from the
virtual address of the processor to the physical address
in main memory and secondary storage.
43
Virtual Memory and Cache
45
4 Qs for Virtual Memory
• Q1: Where can a block be placed in Main Memory?
– Miss penalty for virtual memory is very high => Full
associativity is desirable (so allow blocks to be placed
anywhere in the memory)
– Have software determine the location while accessing
disk (10M cycles enough to do sophisticated
replacement)
47
Virtual-Physical Translation
• A virtual address consists of a virtual page
number and a page offset.
• The virtual page number gets translated to a
physical page number.
• The page offset is not changed
36 bits 12 bits
Virtual Page Number Page offset Virtual Address
Translation
33 bits 12 bits
48
Address Translation Via Page Table
49
TLB: Improving Page Table Access
• Cannot afford accessing page table for every
access include cache hits (then cache itself
makes no sense)
• Again, use cache to speed up accesses to page
table! (cache for cache?)
• TLB is translation lookaside buffer storing
frequently accessed page table entry
• A TLB entry is like a cache entry
– Tag holds portions of virtual address
– Data portion holds physical page number, protection
field, valid bit, use bit, and dirty bit (like in page table
entry)
– Usually fully associative or highly set associative
– Usually 64 or 128 entries
• Access page table only for TLB misses
50
TLB Characteristics
• The following are characteristics of TLBs
– TLB size : 32 to 4,096 entries
– Block size : 1 or 2 page table entries (4 or 8 bytes
each)
– Hit time: 0.5 to 1 clock cycle
– Miss penalty: 10 to 30 clock cycles (go to page table)
– Miss rate: 0.01% to 0.1%
– Associative : Fully associative or set associative
– Write policy : Write back (replace infrequently)
51
52
• Selecting a page size
Ø The size of the page table is inversely proportional to the page
size; memory can therefore be saved by making the pages
bigger.
Ø A larger page size can allow larger caches with fast cache hit
times.
Ø Transferring larger pages to or from secondary storage, possibly
over a network, is more efficient than transferring smaller pages.
Ø The number of TLB entries is restricted, so a larger page size
means that more memory can be mapped efficiently, thereby
reducing the number of TLB misses.
53
54
10 Advanced Cache Optimizations
• Reducing hit time • Reducing Miss Penalty
1. Small and simple 6. Critical word first
caches 7. Merging write buffers
2. Way prediction
• Reducing Miss Rate
• Increasing cache 8. Compiler optimizations
bandwidth
3. Pipelined caches • Reducing miss penalty or
4. Multibanked caches miss rate via parallelism
5. Nonblocking caches 9. Hardware prefetching
10.Compiler prefetching
55
01: Small and Simple Caches
•A time-consuming portion of a cache hit is using the index portion of the
address to read the tag memory and then compare it to the address.
Guideline: smaller hardware is faster
– Why Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB
second level cache?
• Small data cache and thus fast clock rate
Guideline: simpler hardware is faster
– Direct Mapped, on chip caches can overlap the tag check with
transmission of data, reducing hit time.
– Lower associativity reduce hit time & power because of fewer cache
lines.
•General design:
– small and simple cache for 1st-level cache
– Keeping the tags on chip and the data off chip for 2nd-level caches
The emphasis recently is on fast clock time while hiding L1 misses with
dynamic execution and using L2 caches to avoid going to memory.
56
02. Fast Hit Times Via Way Prediction
• How to combine fast hit time of direct-mapped with lower conflict
misses of 2-way SA cache?
• Way prediction: keep extra bits in cache to predict “way” (block
within set) of next cache access.
– Multiplexer set early to select desired block; only 1 tag
comparison done that cycle (in parallel with reading data)
– Miss check other blocks for matches in next cycle
– Addede to each block of a cache are predictor bits, which
indu=icate which of the blocks to try on next cache access.
– If prediction is correct faster hit time, else, changes the
predictor bits and latency of one extra clock cycle.
• Accuracy > 90% for 2-way (popular), > 80% for 4-way
• Drawback: CPU pipeline harder if hit time is variable-length
57
03: Pipelined Cache Access to increase cache
Bandwidth
Simply to pipeline cache access
– Multiple clock cycle for 1st-level cache hit giving fast clock cycle time
and high bandwidth.
•Advantage: fast cycle time and slow hit
Example: the pipeline for accessing instructions from I-cache
– Pentium: 1 clock cycle
– Pentium Pro ~ Pentium III: 2 clocks
– Pentium 4 & Intel Core i7: 4 clocks
•Drawback: Increasing the number of pipeline stages leads to
– greater penalty on mispredicted branches and
– more clock cycles between the issue of the load and the use of the data
58
04: Non-blocking Caches to increase cache
Bandwidth
• For out-of-order completion processors, the processor need not stall on a
data cache miss.
Ø Eg: processor can continue fetch from I-cache , helpful during a miss
instead of ignoring the processor requests.
• Blocking Caches
o When a "miss" occurs, CPU stalls until the data cache successfully
finds the missing data.
• Non-blocking or lockup-free caches
o Allow the CPU to continue being productive (such as continue fetching
instructions) while the "miss" resolves – “hit under miss” – reduces
miss penalty
• Effective miss penalty can further be reduced if cache can overlap
multiple misses: a “hit under multiple misses” or “miss under miss” (Intel
Core i7)
• Difficult to judge the impact of any single miss and hence to calculate the
AMAT.
05: Multibanked Caches
Increasing Cache Bandwidth Via Multiple
Banks
• Rather than treating cache as single monolithic block, divide into
independent banks to support simultaneous accesses
– E.g., Arm Cortex-A8 - L2 has one to 4 banks
– Intel Core i7 – 4 banks in L1 (2 memory accesses per clock
cycle), 8 banks in L2 cache
• Works best when accesses naturally spread across banks
mapping of addresses to banks affects behavior of memory
system.
• Simple mapping that works well is sequential interleaving
– Spread block addresses sequentially across banks
– E,g, bank i has all blocks with address i modulo n
61
Single Bank
Multibank Cache
Simultaneous
Access
Multibank Cache
Simple mapping: sequential
interleaving
Block
Addr.
0
4
8
12
(block_address) mod 4 = 0
Block
Addr.
0
4
8
12
1
5
9
Bank #1 - Handles addresses where: 13
(block_address) mod 4 = 1
Block
Addr.
0
4
8
12
1
5
9
Bank #2 - Handles addresses where: 13
(block_address) mod 4 = 2 2
6
10
14
Block
Addr.
0
4
8
12
1
5
9
Bank #3 - Handles addresses where: 13
(block_address) mod 4 = 3 2
6
10
14
3
7
11
15
06:Critical Word First / Early Restart
Reduce Miss Penalty:
Early Restart and Critical Word First
• Don’t wait for full block before restarting CPU
• Early restart—As soon as requested word of block arrives, send
to CPU and continue execution
– Spatial locality tend to want next sequential word, so may
still pay to get that one
• Critical Word First—Request missed word from memory first,
send it to CPU right away; let CPU continue while filling rest of
block
– Long blocks more popular today Critical Word 1st widely
used
block
73
Unoptimize
MISS!
Unoptimized
Unoptimized
Unoptimized
Unoptimized
Unoptimized
Unoptimized
Early Start
MISS!
Early Start
Early Start
Early Start
Early Start
Early Start
Critical Word First
MISS!
97
Without Merging Write Buffer
Without Merging Write Buffer
Without Merging Write Buffer
Without Merging Write Buffer
Without Merging Write Buffer
Without Merging Write Buffer
75% wasted
space
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding through memory
every 100 words; improved spatial locality
Loop Fusion Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
2 misses per access to a & c vs. one miss per access; improve
spatial locality
09. Reducing Misses by Hardware
Prefetching of Instructions & Data
Column Decoder
…
Data in
11 Sense Amps & I/O D
Data out
A0…A10 Memory Array Q
Bit Line
(2,048 x 2,048)
Cell
Word Line
Quest for DRAM Performance
CS136 121
Quest for DRAM Performance
CS136 122
A Dynamic Memory Chip
RAS
Row Addr. Strobe
Column
address Column
latch decoder
CAS D7 D0
Column Addr. Strobe
124
Synchronous DRAMs
Row
address Row
decoder Cell array
latch
Row/Column
address
Column Column
address Read/Write
decoder circuits & latches
counter
Clock
RAS Mode register
CAS and Data input Data output
register register
R/W timing control
CS
Clock
R/W
RAS
CAS
Data D0 D1 D2 D3
• Double-Data-Rate SDRAM
• Standard SDRAM performs all actions on the
rising edge of the clock signal.
• DDR SDRAM accesses the cell array in the
same way, but transfers the data on both
edges of the clock.
• The cell array is organized in two banks. Each
can be accessed separately.
• DDR SDRAMs and standard SDRAMs are
most efficiently used in applications where
block transfers are prevalent.
127
DRAM Named by Peak Chip Xfers / Sec
DIMM Named by Peak DIMM MBytes / Sec
CS136 129
Protection : Virtual Memory
CS136 131
Protection : Virtual Machines (VM)
• VMs include all emulation methods that provide a
standard software interface.
• System VMs : provide complete system level environment
at binary ISA level, run different ISAs but always match
the hardware – VMware, IBM VM/370.
• Present an illusion that the users of a VM have an entire
computer to themselves, including a copy of OS
• A single OS owns all the hardware but with VM, multiple
OSes all share the hardware resources
• VMM/ hypervisor – software that supports VM, determines
how to map virtual resources to physical resources.
• Underlying H/W platform is “host” and its resources are
shared among “guest VMs.
CS136 132
Protection : Virtual Machines (VM)
CS136 133
Protection : Virtual Machines (VM)
Requirements of VMM:
• VM presents S/W interface to guest S/W it must
isolate the state of guests from each other.
• Must protect itself from guest S/W.
• Guest S/W should behave on a VM exactly as if it
were running on the native hardware
• Guest S/W should not be able to change allocation of
real time resources directly.
• VMM must control –access to privilage state,
address translation, I/O, exceptions, interrupts.
• VMM must be at the higher privilege level than the
guest VM.
CS136 134