0% found this document useful (0 votes)

35 views38 pages

Chapter 3 Cache

The document discusses different types of computer memory technologies and optimizations to improve memory performance. It covers: 1. SRAM and DRAM technologies used for caches and main memory respectively, and their characteristics like speed, size, cost, and refresh requirements. 2. Evolutions of DRAM like SDRAM, DDR, and optimizations like banks to improve bandwidth. 3. The memory hierarchy concept of using faster and smaller memory closer to the CPU, and larger further levels optimized for cost. Locality principles are leveraged. 4. Cache optimizations like reducing miss rates through compiler optimizations, and reducing penalties through techniques like critical word first to improve overall memory access time.

Uploaded by

Setina Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views38 pages

Chapter 3 Cache

Uploaded by

Setina Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

Memory Hierarchy

Chapter Three
Memory Technology and Optimizations

• All computers used DRAM (dynamic random-

access memory) for main memory and SRAM
(static random access memory) for cache.
• Using SRAM addresses the need to minimize
access time to caches.
Memory
• SRAM:
– Value is stored on a pair of inverting gates
– Very fast but takes up more space than DRAM (4 to 6
transistors)

• DRAM:
– Value is stored as a charge on capacitor (must be
refreshed)
– Very small but slower than SRAM (factor of 5 to 10)
Dynamic RAM
• Bits stored as charge in capacitors
• Charges leak
• Need refreshing even when powered
• Simpler construction
• Smaller per bit
• Less expensive
• Need refresh circuits
• Slower
• Main memory
• Essentially analogue
– Level of charge determines value
Dynamic RAM Structure
SDRAM
• There has been multiple improvements to the
DRAM design.
– A clock signal was added making the design
synchronous (SDRAM).
– The data bus transfers data on both rising and
falling edge of the clock (DDR SDRAM).
– Second generation of DDR memory (DDR2) scales
to higher clock frequencies.
– DDR3 and DDR4 are currently being used.
SDRAM
• SDRAMs allows burst transfer mode where multiple
transfers can occur without specifying a new column
address.
• In burst mode 8 or more 16-bit transfers can occur
without sending any new addresses.
• To overcome the problem of getting more bandwidth
from the memory as DRAM density increased,
SDRAM were made wider.
• SDRAMs introduced banks to help with power
management, improve access time, and allow
interleaved and overlapped accesses to different
banks.
Static RAM
• Bits stored as on/off switches
• No charges to leak
• No refreshing needed when powered
• More complex construction
• Larger per bit
• More expensive
• Does not need refresh circuits
• Faster
• Cache
• Digital
– Uses flip-flops
Static RAM Structure
Memory Hierarchy: How Does it Work?
• Temporal Locality (Locality in Time):
• the memory hierarchy will keep those more
recently accessed data items closer to the
processor because chances are, the processor will
access them again soon.
• Spatial Locality (Locality in Space):
=>Not only do we move the item that has just been
accessed to the upper level, but we also move the
data items that are adjacent to it.
Memory Hierarchy of a Modern Computer
System
By taking advantage of the principle of locality:
Present the user with as much memory as is available in the
cheapest technology.
Provide access at the speed offered by the fastest
technology.
Cache
• Small amount of fast memory
• Sits between main memory and CPU
• May be located on CPU chip or module
How to Improve Cache
Performance?
• Cache optimizations
– 1. Reduce the miss rate
– 2. Reduce the miss penalty
– 3. Reduce the time to hit in the cache

AMAT  HitTime  MissRate  MissPenalty

Where Misses Come From?
• Classifying Misses: 3 Cs
– Compulsory — The first access to a block is not in the
cache
so the block must be brought into the cache.
Also called cold start misses or first reference misses.
– Capacity — If the cache cannot contain all the blocks
needed during execution of a program, capacity misses will
occur due to blocks being discarded and later retrieved.
– Conflict — If block-placement strategy is set associative or
direct mapped, conflict misses will occur because a block
can be discarded and later retrieved if too many blocks
map to its set.
Advanced Optimizations of Cache
Performance
Average memory access time
= Hit time + Miss rate x Miss penalty
• We can classify advanced cache optimizations into
five categories:
• 1. Reducing the hit time—Small and simple first-
level caches and way-prediction (decrease power).
• 2. Increasing cache bandwidth—Pipelined caches,
multibanked caches, and nonblocking caches.
Have impacts on power consumption.
Cont…
• 3. Reducing the miss penalty—Critical word
first and merging write buffers
• 4. Reducing the miss rate—Compiler
optimizations (improves power consumption).
• 5. Reducing the miss penalty or miss rate via
parallelism—Hardware prefetching and
compiler prefetching (increase power
consumption)
Hit Time Reduction Technique: Small and
Simple Caches
• Smaller hardware is faster => small cache helps the hit time
• Keep the cache small enough to fit on the same chip as the
processor (avoid the time penalty of going off-chip)
• Direct-mapped caches can overlap the tag check with the
transmission of the data, effectively reducing hit time.
• Keep the cache simple
– Use Direct Mapped cache: it overlaps the tag check
with the transmission of data
•Lower levels of associativity will usually reduce power because fewer cache
lines must be accessed.
Small and Simple First-Level Caches
Way Prediction to Reduce Hit Time
• How to combine fast hit time of Direct Mapped and have
the lower conflict misses of 2-way SA cache?
• Way Prediction: extra bits are kept to predict the way or
block within a set
– Mux is set early to select the desired block
– Only a single tag comparison is performed
– What if miss?
=> check the other blocks in the set
– Used in Alpha 21264
• 1 cc if predictor is correct, 3 cc if not
• Effectiveness: prediction accuracy is 85%
– Used in MIPS 4300 embedded proc. to lower power
Pipelined Access and Multibanked
Caches to Increase Bandwidth
• These optimizations increase cache bandwidth either
– By pipelining the cache access or
– By widening the cache with multiple banks to allow
multiple accesses per clock.
• These optimizations are primarily targeted at L1,
where access bandwidth constrains instruction
throughput.
• Multiple banks are also used in L2 and L3 caches, but
primarily as a power-management technique.
Transferring blocks to/from memory
CPU CPU
CPU

cache cache
cache
bus
bus bus

memory mem mem mem mem

bank0 bank1 bank2 bank3
memory

b. four word wide

c. interleaved
memory
memory

a. one word wide

memory
Nonblocking Caches
to Increase Cache Bandwidth
• For pipelined computers that allow out-of-
order execution, the processor need not stall
on a data cache miss.
– processor could continue fetching instructions
from the instruction cache while waiting for the
data cache to return.
Nonblocking Caches
to Increase Cache Bandwidth
• Non-blocking cache allow data cache to continue to supply cache
hits during a miss
– requires F/E (Full/Empty) bits on registers or out-of-order execution
– requires multi-bank memories
• “hit under miss” reduces the effective miss penalty by working
during miss vs. ignoring CPU requests
• “hit under multiple miss” or “miss under miss” may further lower
the effective miss penalty by overlapping multiple misses
– Significantly increases the complexity of the cache controller as there
can be multiple outstanding memory accesses
– Requires muliple memory banks (otherwise cannot support)
– Pentium Pro allows 4 outstanding memory misses
Value of Hit Under Miss for SPEC
Critical Word First and
Early Restart to Reduce Miss Penalty
• The processor needs just one word of the block at a time.
• This strategy is impatience: do not wait for the full block to
be loaded before sending the requested word and restarting
the processor.
– Early restart- As soon as the requested word of the block arrives,
send it to the CPU and let the CPU continue execution
– Critical Word First- Request the missed word first from memory and
send it to CPU as soon as it arrives; Generally useful only in large
blocks,
• Beneficial when we have long cache lines (blocks)
• If want next sequential word, early restart may not be useful.
Merging Write Buffer to Reduce Miss Penalty

• Write Through caches relay on write-buffers

– on write, data and full address are written into the buffer; write
is finished from the CPU’s perspective
– Problem: Write Back full stalls
• Write merging
– If the buffer contains other modified blocks, the addresses can be
checked to see if the address of the new data matches the
address of a valid write buffer entry. If so, the new data are
combined with that entry.
– multiword writes are faster than a single word writes => reduce
write-buffer stalls
• Is this applicable to I/O addresses?
Compiler Optimizations
to Reduce Miss Rate
• Reduction comes from software without any hardware
changes.
• McFarling reduced caches misses by 75% (8KB, DM, 4 byte
blocks) in software
• Instructions => Reorder procedures in memory so as to
reduce conflict misses
• Data
– Loop Interchange: change nesting of loops to access data
in order stored in memory
– Blocking: Improve temporal locality by accessing “blocks”
of data repeatedly vs. going down whole columns or rows
Loop Interchange
• Motivation: some programs have nested loops
that access data in nonsequential order
• Solution: Simply exchanging the nesting of the
loops can make the code access the data in
the order it is stored
• This reduce misses by improving spatial
locality; reordering maximizes use of data in a
cache block before it is discarded
Loop Interchange
/* Before */
Example
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];

Sequential accesses instead of striding through memory

every 100 words; improved spatial locality.

Reduces misses if the arrays do not fit in the cache.

Blocking
• Motivation: multiple arrays, some accessed by rows and
some by columns
• Storing the arrays row by row (row major order) or
column by column (column major order) does not help:
both rows and columns are used in every iteration of the
loop (Loop Interchange cannot help)
• Solution: instead of operating on entire rows and columns
of an array, blocked algorithms operate on submatrices or
blocks
– maximize accesses to the data loaded into the cache before
the data is replaced
Blocking Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};

• Two Inner Loops:

• Read all NxN elements of z[]
• Read N elements of 1 row of y[] repeatedly
• Write N elements of 1 row of x[]
• Capacity Misses - a function of N & Cache Size:
• 2N3 + N2 => (assuming no conflict; otherwise …)
• Idea: compute on BxB submatrix that fits
Blocking Example (cont’d)
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};

• B called Blocking Factor

• Capacity Misses from 2N3 + N2 to N3/B+2N2
• Conflict Misses Too?
Before and after Blocking
Hardware Prefetching to Reduce Miss
Penalty or Miss Rate
• E.g., Instruction Prefetching
– Alpha 21064 fetches 2 blocks on a miss
– Extra block placed in “stream buffer”
– On miss check stream buffer
• Works with data blocks too:
– Jouppi [1990] 1 data stream buffer got 25% misses from
4KB cache; 4 streams got 43%
– Palacharla & Kessler [1994] for scientific programs for 8
streams got 50% to 70% of misses from
2 64KB, 4-way set associative caches
• Prefetching relies on having extra memory bandwidth
that can be used without penalty
Reading Assignment
• Reducing Misses/Penalty by Software Prefetching
Data
• Using HBM to Extend the Memory Hierarchy

Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Cooling Water Treatment Advanced Training Course Cooling Water Treatment ... (Pdfdrive)
100% (3)
Cooling Water Treatment Advanced Training Course Cooling Water Treatment ... (Pdfdrive)
266 pages
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
Composite Chassis 2 PDF
No ratings yet
Composite Chassis 2 PDF
11 pages
Motorcycle Accidents Thesis
100% (2)
Motorcycle Accidents Thesis
7 pages
XSEED Quarter1 English 3
No ratings yet
XSEED Quarter1 English 3
6 pages
Firmax Rf3 Product Only
No ratings yet
Firmax Rf3 Product Only
61 pages
MPU5 Datasheet 05 2020
No ratings yet
MPU5 Datasheet 05 2020
2 pages
Crime Scene Investigation
No ratings yet
Crime Scene Investigation
4 pages
TEMA
No ratings yet
TEMA
22 pages
HEC-RAS 5.0 Reference Manual-4
No ratings yet
HEC-RAS 5.0 Reference Manual-4
110 pages
Citizen CLP-8301 Technical Manual
No ratings yet
Citizen CLP-8301 Technical Manual
259 pages
Experiment 4 LIPID
100% (8)
Experiment 4 LIPID
16 pages
Program of Activities: Boy Scouts of The Philippines Zamboanga City Council Sta. Maria District
No ratings yet
Program of Activities: Boy Scouts of The Philippines Zamboanga City Council Sta. Maria District
4 pages
Diarrhea Management in The Community
No ratings yet
Diarrhea Management in The Community
23 pages
Theory of Soil Failure
100% (1)
Theory of Soil Failure
10 pages
7th Grade General Science Proficiency Scales
No ratings yet
7th Grade General Science Proficiency Scales
10 pages
No Load & Short Circuit Test On 3 Phase Alternator
No ratings yet
No Load & Short Circuit Test On 3 Phase Alternator
6 pages
Vogue Scandinavia - How To Find Out Your Viking Birth Runes Astrology
No ratings yet
Vogue Scandinavia - How To Find Out Your Viking Birth Runes Astrology
1 page
Recommendations For Gem Stones: Mousumi Chttopadhyay
No ratings yet
Recommendations For Gem Stones: Mousumi Chttopadhyay
1 page
Postal Ballot Team 315
No ratings yet
Postal Ballot Team 315
23 pages
MORVOLC (Version 1.2) : User Manual
No ratings yet
MORVOLC (Version 1.2) : User Manual
11 pages
The Automotive Gray Market John B Hege Download
No ratings yet
The Automotive Gray Market John B Hege Download
50 pages
Urban Studies Case Study-Townships: Location
No ratings yet
Urban Studies Case Study-Townships: Location
10 pages
6.0 Power Series Related Question
No ratings yet
6.0 Power Series Related Question
9 pages
Iman Magnético
No ratings yet
Iman Magnético
6 pages
09 MSDS Wax Dispersant
No ratings yet
09 MSDS Wax Dispersant
8 pages
Lesotho Road Traffic Act of 1981
No ratings yet
Lesotho Road Traffic Act of 1981
49 pages
Initial Lab Report 1
0% (1)
Initial Lab Report 1
4 pages
Thermodynamics (PC-ME 301) Question Bank
No ratings yet
Thermodynamics (PC-ME 301) Question Bank
7 pages
AASHTO T134 Relaciones Humedad-Densidad de Mezclas de Suelo-Cemento
No ratings yet
AASHTO T134 Relaciones Humedad-Densidad de Mezclas de Suelo-Cemento
7 pages
Muscle Mag Chest Workouts
100% (1)
Muscle Mag Chest Workouts
6 pages

Chapter 3 Cache

Uploaded by

Chapter 3 Cache

Uploaded by

Memory Hierarchy

• All computers used DRAM (dynamic random-

AMAT  HitTime  MissRate  MissPenalty

memory mem mem mem mem

b. four word wide

a. one word wide

• Write Through caches relay on write-buffers

Sequential accesses instead of striding through memory

Reduces misses if the arrays do not fit in the cache.

• Two Inner Loops:

• B called Blocking Factor

You might also like