0% found this document useful (0 votes)
35 views38 pages

Chapter 3 Cache

The document discusses different types of computer memory technologies and optimizations to improve memory performance. It covers: 1. SRAM and DRAM technologies used for caches and main memory respectively, and their characteristics like speed, size, cost, and refresh requirements. 2. Evolutions of DRAM like SDRAM, DDR, and optimizations like banks to improve bandwidth. 3. The memory hierarchy concept of using faster and smaller memory closer to the CPU, and larger further levels optimized for cost. Locality principles are leveraged. 4. Cache optimizations like reducing miss rates through compiler optimizations, and reducing penalties through techniques like critical word first to improve overall memory access time.

Uploaded by

Setina Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views38 pages

Chapter 3 Cache

The document discusses different types of computer memory technologies and optimizations to improve memory performance. It covers: 1. SRAM and DRAM technologies used for caches and main memory respectively, and their characteristics like speed, size, cost, and refresh requirements. 2. Evolutions of DRAM like SDRAM, DDR, and optimizations like banks to improve bandwidth. 3. The memory hierarchy concept of using faster and smaller memory closer to the CPU, and larger further levels optimized for cost. Locality principles are leveraged. 4. Cache optimizations like reducing miss rates through compiler optimizations, and reducing penalties through techniques like critical word first to improve overall memory access time.

Uploaded by

Setina Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Memory Hierarchy

Chapter Three
Memory Technology and Optimizations

• All computers used DRAM (dynamic random-


access memory) for main memory and SRAM
(static random access memory) for cache.
• Using SRAM addresses the need to minimize
access time to caches.
Memory
• SRAM:
– Value is stored on a pair of inverting gates
– Very fast but takes up more space than DRAM (4 to 6
transistors)

• DRAM:
– Value is stored as a charge on capacitor (must be
refreshed)
– Very small but slower than SRAM (factor of 5 to 10)
Dynamic RAM
• Bits stored as charge in capacitors
• Charges leak
• Need refreshing even when powered
• Simpler construction
• Smaller per bit
• Less expensive
• Need refresh circuits
• Slower
• Main memory
• Essentially analogue
– Level of charge determines value
Dynamic RAM Structure
SDRAM
• There has been multiple improvements to the
DRAM design.
– A clock signal was added making the design
synchronous (SDRAM).
– The data bus transfers data on both rising and
falling edge of the clock (DDR SDRAM).
– Second generation of DDR memory (DDR2) scales
to higher clock frequencies.
– DDR3 and DDR4 are currently being used.
SDRAM
• SDRAMs allows burst transfer mode where multiple
transfers can occur without specifying a new column
address.
• In burst mode 8 or more 16-bit transfers can occur
without sending any new addresses.
• To overcome the problem of getting more bandwidth
from the memory as DRAM density increased,
SDRAM were made wider.
• SDRAMs introduced banks to help with power
management, improve access time, and allow
interleaved and overlapped accesses to different
banks.
Static RAM
• Bits stored as on/off switches
• No charges to leak
• No refreshing needed when powered
• More complex construction
• Larger per bit
• More expensive
• Does not need refresh circuits
• Faster
• Cache
• Digital
– Uses flip-flops
Static RAM Structure
Memory Hierarchy: How Does it Work?
• Temporal Locality (Locality in Time):
• the memory hierarchy will keep those more
recently accessed data items closer to the
processor because chances are, the processor will
access them again soon.
• Spatial Locality (Locality in Space):
=>Not only do we move the item that has just been
accessed to the upper level, but we also move the
data items that are adjacent to it.
Memory Hierarchy of a Modern Computer
System
By taking advantage of the principle of locality:
Present the user with as much memory as is available in the
cheapest technology.
Provide access at the speed offered by the fastest
technology.
Cache
• Small amount of fast memory
• Sits between main memory and CPU
• May be located on CPU chip or module
How to Improve Cache
Performance?
• Cache optimizations
– 1. Reduce the miss rate
– 2. Reduce the miss penalty
– 3. Reduce the time to hit in the cache

AMAT  HitTime  MissRate  MissPenalty


Where Misses Come From?
• Classifying Misses: 3 Cs
– Compulsory — The first access to a block is not in the
cache
so the block must be brought into the cache.
Also called cold start misses or first reference misses.
– Capacity — If the cache cannot contain all the blocks
needed during execution of a program, capacity misses will
occur due to blocks being discarded and later retrieved.
– Conflict — If block-placement strategy is set associative or
direct mapped, conflict misses will occur because a block
can be discarded and later retrieved if too many blocks
map to its set.
Advanced Optimizations of Cache
Performance
Average memory access time
= Hit time + Miss rate x Miss penalty
• We can classify advanced cache optimizations into
five categories:
• 1. Reducing the hit time—Small and simple first-
level caches and way-prediction (decrease power).
• 2. Increasing cache bandwidth—Pipelined caches,
multibanked caches, and nonblocking caches.
Have impacts on power consumption.
Cont…
• 3. Reducing the miss penalty—Critical word
first and merging write buffers
• 4. Reducing the miss rate—Compiler
optimizations (improves power consumption).
• 5. Reducing the miss penalty or miss rate via
parallelism—Hardware prefetching and
compiler prefetching (increase power
consumption)
Hit Time Reduction Technique: Small and
Simple Caches
• Smaller hardware is faster => small cache helps the hit time
• Keep the cache small enough to fit on the same chip as the
processor (avoid the time penalty of going off-chip)
• Direct-mapped caches can overlap the tag check with the
transmission of the data, effectively reducing hit time.
• Keep the cache simple
– Use Direct Mapped cache: it overlaps the tag check
with the transmission of data
•Lower levels of associativity will usually reduce power because fewer cache
lines must be accessed.
Small and Simple First-Level Caches
Way Prediction to Reduce Hit Time
• How to combine fast hit time of Direct Mapped and have
the lower conflict misses of 2-way SA cache?
• Way Prediction: extra bits are kept to predict the way or
block within a set
– Mux is set early to select the desired block
– Only a single tag comparison is performed
– What if miss?
=> check the other blocks in the set
– Used in Alpha 21264
• 1 cc if predictor is correct, 3 cc if not
• Effectiveness: prediction accuracy is 85%
– Used in MIPS 4300 embedded proc. to lower power
Pipelined Access and Multibanked
Caches to Increase Bandwidth
• These optimizations increase cache bandwidth either
– By pipelining the cache access or
– By widening the cache with multiple banks to allow
multiple accesses per clock.
• These optimizations are primarily targeted at L1,
where access bandwidth constrains instruction
throughput.
• Multiple banks are also used in L2 and L3 caches, but
primarily as a power-management technique.
Transferring blocks to/from memory
CPU CPU
CPU

cache cache
cache
bus
bus bus

memory mem mem mem mem


bank0 bank1 bank2 bank3
memory

b. four word wide


c. interleaved
memory
memory

a. one word wide


memory
Nonblocking Caches
to Increase Cache Bandwidth
• For pipelined computers that allow out-of-
order execution, the processor need not stall
on a data cache miss.
– processor could continue fetching instructions
from the instruction cache while waiting for the
data cache to return.
Nonblocking Caches
to Increase Cache Bandwidth
• Non-blocking cache allow data cache to continue to supply cache
hits during a miss
– requires F/E (Full/Empty) bits on registers or out-of-order execution
– requires multi-bank memories
• “hit under miss” reduces the effective miss penalty by working
during miss vs. ignoring CPU requests
• “hit under multiple miss” or “miss under miss” may further lower
the effective miss penalty by overlapping multiple misses
– Significantly increases the complexity of the cache controller as there
can be multiple outstanding memory accesses
– Requires muliple memory banks (otherwise cannot support)
– Pentium Pro allows 4 outstanding memory misses
Value of Hit Under Miss for SPEC
Critical Word First and
Early Restart to Reduce Miss Penalty
• The processor needs just one word of the block at a time.
• This strategy is impatience: do not wait for the full block to
be loaded before sending the requested word and restarting
the processor.
– Early restart- As soon as the requested word of the block arrives,
send it to the CPU and let the CPU continue execution
– Critical Word First- Request the missed word first from memory and
send it to CPU as soon as it arrives; Generally useful only in large
blocks,
• Beneficial when we have long cache lines (blocks)
• If want next sequential word, early restart may not be useful.
Merging Write Buffer to Reduce Miss Penalty

• Write Through caches relay on write-buffers


– on write, data and full address are written into the buffer; write
is finished from the CPU’s perspective
– Problem: Write Back full stalls
• Write merging
– If the buffer contains other modified blocks, the addresses can be
checked to see if the address of the new data matches the
address of a valid write buffer entry. If so, the new data are
combined with that entry.
– multiword writes are faster than a single word writes => reduce
write-buffer stalls
• Is this applicable to I/O addresses?
Compiler Optimizations
to Reduce Miss Rate
• Reduction comes from software without any hardware
changes.
• McFarling reduced caches misses by 75% (8KB, DM, 4 byte
blocks) in software
• Instructions => Reorder procedures in memory so as to
reduce conflict misses
• Data
– Loop Interchange: change nesting of loops to access data
in order stored in memory
– Blocking: Improve temporal locality by accessing “blocks”
of data repeatedly vs. going down whole columns or rows
Loop Interchange
• Motivation: some programs have nested loops
that access data in nonsequential order
• Solution: Simply exchanging the nesting of the
loops can make the code access the data in
the order it is stored
• This reduce misses by improving spatial
locality; reordering maximizes use of data in a
cache block before it is discarded
Loop Interchange
/* Before */
Example
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];

Sequential accesses instead of striding through memory


every 100 words; improved spatial locality.

Reduces misses if the arrays do not fit in the cache.


Blocking
• Motivation: multiple arrays, some accessed by rows and
some by columns
• Storing the arrays row by row (row major order) or
column by column (column major order) does not help:
both rows and columns are used in every iteration of the
loop (Loop Interchange cannot help)
• Solution: instead of operating on entire rows and columns
of an array, blocked algorithms operate on submatrices or
blocks
– maximize accesses to the data loaded into the cache before
the data is replaced
Blocking Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};

• Two Inner Loops:


• Read all NxN elements of z[]
• Read N elements of 1 row of y[] repeatedly
• Write N elements of 1 row of x[]
• Capacity Misses - a function of N & Cache Size:
• 2N3 + N2 => (assuming no conflict; otherwise …)
• Idea: compute on BxB submatrix that fits
Blocking Example (cont’d)
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};

• B called Blocking Factor


• Capacity Misses from 2N3 + N2 to N3/B+2N2
• Conflict Misses Too?
Before and after Blocking
Hardware Prefetching to Reduce Miss
Penalty or Miss Rate
• E.g., Instruction Prefetching
– Alpha 21064 fetches 2 blocks on a miss
– Extra block placed in “stream buffer”
– On miss check stream buffer
• Works with data blocks too:
– Jouppi [1990] 1 data stream buffer got 25% misses from
4KB cache; 4 streams got 43%
– Palacharla & Kessler [1994] for scientific programs for 8
streams got 50% to 70% of misses from
2 64KB, 4-way set associative caches
• Prefetching relies on having extra memory bandwidth
that can be used without penalty
Reading Assignment
• Reducing Misses/Penalty by Software Prefetching
Data
• Using HBM to Extend the Memory Hierarchy

You might also like