0% found this document useful (0 votes)
16 views31 pages

Memory 2

Uploaded by

cse.20201016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views31 pages

Memory 2

Uploaded by

cse.20201016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Memory Hierarchy-II

1
Memory Hierarchy

o Motivation
m Exploitinglocality to provide a large, fast and
inexpensive memory
2
Cache Basics
o Cache is a high speed buffer between
CPU and main memory
o Memory is divided into blocks
m Q1: Where can a block be placed in the upper
level? (Block placement)
m Q2: How is a block found if it is in the upper
level? (Block identification)
m Q3: Which block should be replaced on a
miss? (Block replacement)
m Q4: What happens on a write? (Write strategy)

3
Q1: Block Placement
o Fully associative, direct mapped, set
associative
m Example: Block 12 placed in 8 block cache:
n Mapping = Block Number Modulo Number Sets
Direct Mapped 2-Way Assoc
Full Mapped
(12 mod 8) = 4 (12 mod 4) = 0
01234567 01234567 01234567

Cache

1111111111222222222233
01234567890123456789012345678901

Memory

4
Q2: Block Identification
o Tag on each block
m No need to check index or block offset
o Increasing associativity Þshrinks index Þ
expands tag

Block Address Block


Tag Index Offset

5
Q3: Block Replacement
o Easy for direct-mapped caches
o Set associative or fully associative:
m Random
n Easy to implement
m LRU (Least Recently Used)
n Relying on past to predict future, hard to implement
m FIFO
n Sort of approximate LRU
m Not Recently Used
n Maintain reference bits and dirty bits; clear reference bits
periodically; Divide all blocks into four categories; choose one
from the lower category
m Optimal replacement?
n Label the blocks in cache by the number of instructions to be
executed before that block is referenced. Then choose a
block with the highest lable
n Unrealizable!
6
Q4: Write Strategy
Write-Through Write-Back
Write data only
Data written to to the cache
cache block
Policy also written to Update lower
lower-level level when a
memory block falls out
of the cache
Implement Easy Hard
Do read misses
produce writes? No Yes
Do repeated
writes make it Yes No
to lower level?

7
Write Buffers
Cache Lower
Processor Level
Memory
Write Buffer

Write-through cache: holds data awaiting


write-through to lower level memory
Q. Why a write buffer ? A. So CPU doesn’t stall.
Q. Why a buffer, why not A. Bursts of writes are
just one register ? common.
A. Yes! Drain buffer before
Q. Are Read After Write
next read, or check write buffer
(RAW) hazards an issue for
before read and perform read
write buffer?
only when no conflict. 8
Cache Performance
o Average memory access time
m Timetotal mem access = Nhit´Thit + Nmiss´Tmiss
=Nmem access ´Thit + Nmiss ´Tmiss penalty

m AMAT = Thit+ miss rate ´Tmiss penalty

o Miss penalty: time to replace a block from lower


level, including time to replace in CPU
m Access time: time to lower level(latency)
m Transfer time: time to transfer block(bandwidth)

9
Performance Example
o Two data caches (assume one clock cycle for hit)
m I: 8KB, 44% miss rate, 1ns hit time
m II: 64KB, 37% miss rate, 2ns hit time
m Miss penalty: 60ns, 30% memory accesses

m AMATI = 1ns + 44%´60ns = 27.4ns


m AMATII = 2ns + 37%´60ns = 24.2ns

m Larger cache Þsmaller miss rate but longer


ThitÞreduced AMAT

10
Miss Penalty in OOO Environment
o In processors with out-of-order execution
m Memory accesses can overlap with other
computation
m Latency of memory accesses is not always
fully exposed

m E.g.8KB cache, 44% miss rate, 1ns hit time,


miss penalty: 60ns, only 70% exposed on
average
m AMAT= 1ns + 44%´(60ns´70%) = 19.5ns

11
Cache Performance Optimizations
o Performance formulas
m AMAT = Thit+ miss rate ´Tmiss penalty
o Reducing miss rate
m Change cache configurations, compiler optimizations
o Reducing hit time
m Simple cache, fast access and address translation
o Reducing miss penalty
m Multilevel caches, read and write policies
o Taking advantage of parallelism
m Cache serving multiple requests simultaneously
m Prefetching

12
Cache Miss Rate
o Three C’s
o Compulsory misses (cold misses)
m The first access to a block: miss regardless of cache
size
o Capacity misses
m Cache too small to hold all data needed
o Conflict misses
m More blocks mapped to a set than the associativity
o Reducing miss rate
m Larger block size (compulsory)
m Larger cache size (capacity, conflict)
m Higher associativity (conflict)
m Compiler optimizations (all three)
13
Miss Rate vs. Block Size

o Larger blocks: compulsory misses reduced, but may


increase conflict misses or even capacity misses if the
cache is small; may also increase miss penalty
14
Reducing Cache Miss Rate
o Larger cache
m Less capacity misses
m Less conflict misses
n Implies higher associativity: less competition to the same set
m Has to balance hit time, energy consumption, and cost
o Higher associativity
m Less conflict misses
m Miss rate (2-way, X) » Miss rate(direct-map, 2X)

m Similarly, need to balance hit time, energy


consumption: diminishing return on reducing conflict
misses

15
Reducing Cache Miss Penalty
o A difficult decision is
m whether to make the cache hit time fast, to keep pace with the high
clock rate of processors,
m or to make the cache large to reduce the gap between the
processor accesses and main memory accesses.

o Solution:
m Use multi-level cache:
n The first-level cache can be small enough to match a fast clock
cycle time.
n Higher level cache can be large enough to capture many
accesses that would go to main memory.
n Multilevel caches are more power-efficient than single
aggregated cache.

16
Compiler Optimizations for Cache
o Increasing locality of programs
m Temporal locality, spatial locality
o Rearrange code
m Targeting instruction cache directly
m Reorder instructions based on the set of data accessed
o Reorganize data
m Padding to eliminate conflicts:
n Change the address of two variables such that they do not map to
the same cache location
n Change the size of an array via padding
m Group data that tend to be accessed together in one block
o Example optimizations
m Merging arrays, loop interchange, loop fusion

17
Merging Arrays
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];
/* After: 1 array of structures */
struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];
o Improve spatial locality
m If val[i] and key[i] tend to be accessed together
o Reducing conflicts between val & key
18
Loop Interchange
o Idea: switching the nesting order of two or
more loops

m Sequentialaccesses instead of striding


through memory; improved spatial locality
o Safety of loop interchange
m Need to preserve true data dependences

19
Loop Fusion
o Takes multiple compatible loop nests and
combines their bodies into one loop nest
m Is legal if no data dependences are reversed
o Improves locality directly by merging accesses to
the same cache line into one loop iteration
/* Before */ /* After */
for (i = 0; i < N; i = i+1) for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1) for (j = 0; j < N; j = j+1){
a[i][j] = 1/b[i][j] * c[i][j]; a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1) d[i][j] = a[i][j] + c[i][j];
for (j = 0; j < N; j = j+1) }
d[i][j] = a[i][j] + c[i][j];
20
Seminar

o Pipelining Cache

o Prefetching Cache

21
Main Memory Background
o Main memory performance
m Latency: cache miss penalty
n Access time: time between request and word arrives
n Cycle time: time between requests
m Bandwidth: multiprocessors, I/O,
n large block miss penalty
o Main memory technology
m Memory is DRAM: Dynamic Random Access Memory
n Dynamic since needs to be refreshed periodically
n Requires data written back after being read
n Concerned with cost per bit and capacity
m Cache is SRAM: Static Random Access Memory
n Concerned with speed and capacity

22
Memory vs. Virtual Memory
o Analogy to cache
m Size: cache << memory << address space
m Both provide big and fast memory - exploit locality

m Both need a policy - 4 memory hierarchy questions

o Difference from cache


m Cache primarily focuses on speed
m VM facilitates transparent memory management
n Providing large address space
n Sharing, protection in multi-programming environment

23
Four Memory Hierarchy Questions
o Where can a block be placed in main memory?
m OS allows block to be placed anywhere: fully
associative
n No conflict misses;
o Which block should be replaced?
m An approximation of LRU: true LRU too costly and
adds little benefit
n A reference bit is set if a page is accessed
n The bit is shifted into a history register periodically
n When replacing, find one with smallest value in history
register
o What happens on a write?
m Write back: write through is prohibitively expensive

24
Four Memory Hierarchy Questions
o How is a block found in main memory?
m Use page table to translate virtual address into
physical address
• 32-bit virtual
address, page
size: 4KB, 4 bytes
per page table
entry, page table
size?
• (232/212)´22= 222
or 4MB

25
Fast Address Translation
o Motivation
m Page table is too large to be stored in cache
n May even expand multiple pages itself
m Multiple page table levels
o Solution: exploit locality and cache recent
translations

Example:
n Four page table levels

26
Fast Address Translation
o TLB: translation look-aside buffer
m A special fully-associative cache for recent translation
m Tag: virtual address
m Data: physical page frame number, protection field,
valid bit, use bit, dirty bit

o Translation
m Send virtual
address to all tags
m Check violation
m Matching tag send
physical address
m Combine offset to
get full physical address
27
Virtual Memory and Cache
o Physical cache: index cache using physical
address
m Always address translation before accessing cache
m Simple implementation, performance issue

o Virtual cache: index cache using virtual address


to avoid translation
m Address translation only @ cache misses
m Issues
n Protection: copy protection info to each block
n Context switch: add PID to address tag
n Synonym/alias -- different virtual addresses map the same
physical address
l Checking multiple places, enforce aliases to be identical in a
fixed number of bits (page coloring)

28
Virtual Memory and Cache
o Physical cache (PIPT)
cache line return

• Slow

Processor Physical
TLB Main Memory
Core Cache
VA PA miss

hit

o Virtual cache (VIVT)


cache line return
• Protection bits
missing
Processor Virtual TLB Main Memory
Core Cache miss • Context-switch
VA
enforces cache flush

hit • Aliasing issue


29
Virtually-Indexed Physically-Tagged
o Virtually-indexed physically-tagged cache
m Use the page offset (identical in virtual & physical
addresses) to index the cache
m Associate physical address of the block as the
verification tag
m Perform cache reading and tag matching with the
physical address at the same time
m Issue: cache size is limited by page size (the length of
offset bits)

30
Advantages of Virtual Memory
o Translation
m Program can be given a consistent view of memory,
even though physical memory is scrambled
m Only the most important part of program (“Working
Set”) must be in physical memory
o Protection
m Different threads/processes protected from each other
m Different pages can be given special behavior
n Read only, invisible to user programs, etc.
m Kernel data protected from user programs
m Very important for protection from malicious programs

o Sharing
m Can map same physical page to multiple users
31

You might also like