0% found this document useful (0 votes)
85 views73 pages

CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy

1. The document discusses optimizations for cache performance including using direct-mapped caches for faster access times, way prediction to reduce hit times, pipelined and nonblocking caches to increase bandwidth, and multibanked caches. 2. It also covers virtual memory systems for protection between processes using address translation via page tables. Virtual memory allows each process to have its own unique address space. 3. Memory hierarchy design must consider crosscutting issues like protection, coherency of cached data between processors, and how instruction set architectures support virtualization for virtual memory.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views73 pages

CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy

1. The document discusses optimizations for cache performance including using direct-mapped caches for faster access times, way prediction to reduce hit times, pipelined and nonblocking caches to increase bandwidth, and multibanked caches. 2. It also covers virtual memory systems for protection between processes using address translation via page tables. Virtual memory allows each process to have its own unique address space. 3. Memory hierarchy design must consider crosscutting issues like protection, coherency of cached data between processors, and how instruction set architectures support virtualization for virtual memory.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 73

CHAPTER 2 Memory

Hierarchy Design
&
APPENDIX B. Review of
Memory Heriarchy
by:
Fara Mutia, Hansen Hendra, Rachmadio Noval, & Julyando
2.2 Ten Advanced Optimization for Cache Performance
1. Small and Simple First-level Caches to Reduce Hit
Time and Power

The less hardware that is necessary to implement a cache, the shorter


the critical path through the hardware.

Direct-mapped is faster than set associative for both reads and writes.
In particular, tag-checking can overlap data transmission, as there is
only one piece of data for each index.
Impact of cache size and associativity on hit Energy consumption per read vs cache
time size and associativity
2. Way Prediction to Reduce Hit Time

Way Prediction : The multiplexor is set early to select desirable block,


and only a single tag comparison is performed that clock cycle in
parallel with reading the cache data.

With way prediction, extra bits are kept in the cache to predict the way
or block within the set of next cache access.
3. Pipelined Cache Access to Increase Cache Bandwith

This optimzation is simply to pipeline cache access so that the


effectivity latency of a first-level cache hit can be multiple clock cycles,
giving fast clock cycle time and high bandwith but slow hits.
4. Nonblocking Caches to Increase Cache Bandwith

Nonblocking cache: a situation where the data cache is allowed to


continue to suply cache hits during a miss, meaning that it could
continue fetching instructions from the instruction cache while waiting
for the data cache to return the missing data.
5. Multibanked Caches to Increase Bandwith

Rather than treat the cache as a single monolithic block, we can divide
it into independent banks that can support simultaneous accesses.

Four-way interleaved cache banks using


block addressing
6. Critical Word First and Early Restart to Reduce Miss
Penalty

Critical Word First: Request the missed word first from memory and
send it to the preocessor as soon as it arrives; let the processor
continue execution while filling the rest of the words in the block.

Early Restart: Fetch the words in normal order, but as soon as the
requested word of the block arrives send it to the processor and let the
processor continue execution.
7. Merging Write Buffer to Reduce Miss Penalty

Write-through caches rely on write buffers, as all stores must be sent


to the next lower level of hierarchy. If the write buffer is empty, the
data and the full address are written in the buffer, and the write is
finished from the processor's perspective; the processor continues
working while the write buffer prepares to write the addresses can be
checked to see if the address of the new data matches the address of a
valid write buffer entry.
The four writes are merged into a single buffer entry with
write merging; without it, the buffer is full even though
three-fourths of each entry is wasted.
8. Compiler Optimization to Reduce Miss Rate
Loop Interchanging

this technique reduces misses by improving spatial locality;


reordering maximizes use of data in a cache block before they are
discarded. The original code would skip through memory in strides
of 100 words, while the revised version accesses all the words in
one cache block before going to the next block.
9. Hardware Prefetching of Instructions and Data to
Reduce Miss Penalty or Miss Rate

Nonblocking caches effectively reduce the miss penalty by overlapping


execution with memory access. Another approach is to prefetch items
before the processor requests them. Both instructions and data can be
prefetched, either directly into the caches or into an external buffer
that can be more quickly accessed than main memory.
10. Compiler-Controlled Prefetching to Reduce Miss
Penalty or Miss Rate

An alternative to hardware prefetching is for the compiler to insert


prefetch instructions to request data before the processor needs it.
There are two flavors of prefetch:

- Register prefetch will load the value into a register.


- Cache prefetch loads data only into the cache and not the register
2.3 Memory Technology and Optimizations
Memory Technology

Main memory performances : latency and bandwidth


▶ Latency affects the cache miss penalty
▶ Bandwidth affects multiprocessor and I/O

Increasing memory bandwidth is easier then reducing latency:


that’s why multilevel cached is implemented because it has low latency and
larger block size
Memory latency affected by : access time and cycle time

Access time:
▶ the time between when a read is requested and when the desired word
arrives
Cycle time :
▶ the minimum time between unrelated request to memory
SRAM and DRAM

1. DRAM (main memory)


Must be re-written after being read (dynamic properties)
Must be periodically refreshed
Every 8ms
Address line multiplexed:
Row access strobe (RAS) : upper half of address
Column access strobe (CAS) : lower half of address
1. SRAM (cache)
Requires 6 transistors/bit
DRAM vs SRAM

DRAM SRAM

Lose data if we don’t refresh it Retains data while power supplied

1 transistor/bit Several transistor/bit

Slower faster

Cheap Expensive

Size up to 16 GB Size up to 12 MB
DRAM Growth

Nowadays there is
DDR4
16 G bit with 3200
MHz

Corsair LPX 32GB (2x16GB)


3200MHz C16 DDR4 DRAM
Moore’s law : a new chip with 4 times capacity in 3 years.

DRAM’s obeyed this rule for 20 years (1980-2000)


But due to manufacturing challenge
Even in from 2006-2010 only 2 times for 4 years.
Optimization of Memory(1)

1. Improvement of Memory Performance in DRAM Chip


-> Add clock signal to DRAM interface. This made Synchronous DRAM(SDRAM), so
there will be no overhead to synchronize with controller
-> Wider bit of DRAM, from 4-bit transfer mode to 16-bit buses in latest DDR2 and
DDR3
Optimization of Memory(2)

2. Reducing power
Power consumption depends on RW
situation
and standby power

this mainly depends on operating voltage

DDR2 : 1.5-1.8 V
DDR3 : 1.35-1.5 V
DDR4 : 1 – 1.2 V
Optimization of Memory(3)

3. Flash memory : type of EEPROM


Normally read only memory that can be erased. This memory hold data
without any power.
Flash is much slower than SDRAM but faster than disk
2.4 Protection : Virtual Memory and Virtual Machine
Virtual Memory

Protection:
▶ Keeps processes in their own memory space
▶ Prevent program from corrupting each other

Role of virtual memory


▶ Provide TLB to translate addresses
▶ Provide mechaniscms to limit memory accesses
▶ Provide user mode and supervisor mode
Virtual memory to protect program

It’s program has its own page table to map to unique physical address
Program address in Linux

Kernel Program not allowed to access/


Space If try accessing will crash

Stack Be used when call procedures


happened
Libraries Shared data that might be used by
many program
Heap Where program allocate data
Data : static variable
Text : program binary
Example of virtual memory in Linux program
2.5 Crosscutting Issues: The Design of Memory
Hierarchies
Protection and Instruction Set Architecture

Protection is a joint effort of architecture and operating systems, but


architects had to modify some awkward details of existing instruction
set architectures when virtual memory became popular. For example,
to support virtual memory in the IBM 370, architects had to change the
successful IBM 360 instruction set architecture that had been
announced just 6 years before.
Historically, IBM mainframe hardware and VMM took three steps to
improve performance of virtual machines:
1. Reduce the cost of processor virtualization.
2. Reduce interrupt overhead cost due to the virtualization.
3. Reduce interrupt cost by steering interrupts to the proper VM
without invoking VMM.
Coherency of Cached Data

The frequency of the cache coherency problem is different for multiprocessors than I/O.
Multiple data copies are a rare event for I/O—one to be avoided whenever possible—but
a program running on multiple processors will want to have copies of the same data in
several caches. Performance of a multiprocessor program depends on the performance
of the system when sharing data.

The I/O cache coherency question is this: Where does the I/O occur in the computer—
between the I/O device and the cache or between the I/O device and main memory? If
input puts data into the cache and output reads data from the cache, both I/O and the
processor see the same data. The difficulty in this approach is that it interferes with the
processor and can cause the processor to stall for I/O. Input may also interfere with the
cache by displacing some information with new data that are unlikely to be accessed
soon.
2.6 Putting It All Together: Memory
Hierarchies in the ARM
ARM Cortex-A8
Intel Core i7
2.7 Fallacies and Pitfalls
Fallacies and Pitfalls
• Fallacy : predicting cache performance of one program from
another
• Pitfall :
– Simulating enough instructions to get accurate performance measures
of the memory hierarchy
• Trying to predict performance of a large cache using a small trace
• Program’s locality behavior is not constant over the run of the entire program
• Program’s locality behavior may vary depending on the input

– Not delivering high memory bandwidth in a cache-based system


– Implementing a virtual machine monitor on an instruction set
architecture that wasn’t designed to be virtualizable
2.8 Concluding Remarks: Looking Ahead
• Some increases in DRAM bandwidth have been
achieved : to have multiple overlapped accesses per
bank
• Decreases in access time have come much more
slowly, partly because to limit power consumption
voltage levels have been going down
• New approaches to memory are being explored:
– MRAMs (magnetic storage of data)
– Phase change RAMs (PCRAM, PCME, PRAM) : glass that can
be changed between amorphous and crystalline states
Appendix B.1 Review of memory hierarchy
Cache performance review


Memory hierrarchy basic question (1)

1. Where can a block be placed in the upper level? How is Block Found if it is
in the cache?

Based on the mapping


Direct mapping : (address) mod (number of cache blocks)
Full associative : can be in any cache block
Set associative : (address) mod (number of set) -> in any block inside the set
Block mapping in cache:
Memory hierrarchy basic question(2)

2. Which block should be replaced on a Cache Miss?

Direct map: only depends on the address mod cache block


For full associative and set associative :
1. Random
2. First in, first out
3. Least recently used(LRU)
LRU replacement:
The block with the longest time not accessed will be replaced.

Procedure:
1. every access of cache block , we give an increment number
2. When we need to replace a cache block, look for the lowest number (still
depends on full or set associative)
Memory hierrarchy basic question(3)

3. What Happens on a Write?


Depends on write policy:
Write through – The data written to both block in cache and main memory
Write back – The data written only to black in cache. The data written to main
memory when there is replacement.
Write back uses dirty bit to show if it’s needed to do write back on write miss.
Appendix B.2 Cache Performance
Cache Performance
• Measure of memory hierarchy performance :
Average memory access time = Hit time + Miss rate x Miss Penalty
Miss penalty : extra time to handle a miss

Example : Cache
Cache : 16KB instruction + 16KB data I-Cache D-Cache
Size
Hit = 1 cycle, miss = 50 cycles 1 KB 3.0% 24.6%
75% of memory references are fetches 2 KB 2.3% 20.6%
Miss rate info (Shown in table) : 4 KB 1.8% 16.0%
• Find miss rates and AMAT 8 KB 1.1% 10.2%
16 KB 0.6% 6.5%
32 KB 0.4% 4.8%
64 KB 0.1% 3.8%
Cache miss rates:
• Miss rate instruction (16KB) = 0.6%
• Miss rate data (16KB) = 6.5%
• Overall miss rate = 0.75*0.6 + 0.25*6.5 = 2.07

Cache cycle time penalty:


AMAT = 0.75*(1+ 0.6%*50) + 0.25%*(1+6.5%*50) = 2.05

Appendix B.3 Six Basic Cache Optimizations
Six Basic Cache Optimizations
AMAT = Hit time + MR x MP

Improving Cache Performance


1. Reducing MR
2. Reducing MP
3. Reducing time to hit in cache
Reducing Misses Rate
Classifying Misses Rate:
• Compulsory – the first access to a block is not in the cache, so the block
must be brought into the cache
• Capacity – will occur due to blocks being discarded and later retrieved, if
the cache cannot contain all the blocks needed during the execution of a
program
• Conflict – occur in set associative or direct mapped caches because a
block can be discarded and later retrieved if too many blocks map to its
set

How to reduce misses?


• Compulsory misses – cannot be reduced, except by prefetching (achieved
with larger block sizes, because more data is brought into the cache on
each miss; increase the MP)
• Capacity misses – increasing cache size
• Conflict misses – increasing associativity
Reducing Misses Penalty

Reducing Hit Time
Use simpler caches:
• Use direct-mapped caches
• Pipelining writes : check tag for write hit, then actually write data
Summary
Appendix B.4 Virtual Memory
At any instant in time computers are running multiple processes, each
with its own address space. It would be too expensive to dedicate a full
address space worth of memory for each process, especially since
many processes use only a small part of their address space.

Hence, there must be a means of sharing a smaller amount of physical


memory among many processes. One way to do this, virtual memory,
divides physical memory into blocks and allocates them to different
processes.
Virtual memory is made so that it automatically manages the two
levels of memory hierarchy represented by main memory and
secondary storage.
Appendix B5. Protection of Virtual Memory
Protection can be escalated, depending on the apprehension of the computer
designer or the purchaser. One of protection wayt is by adding Rings/hierarchy to
the processor protection structure expand memory access protection from two
levels (user and kernel) to many more.

Like a military classification system of top secret, secret, confidential, and


unclassified, concentric rings of security levels allow the most trusted to access
anything, the second most trusted to access everything except the innermost level,
and so on. The “civilian” programs are the least trusted and, hence, have the most
limited range of accesses. There may also be restrictions on what pieces of
memory can contain code—execute protection—and even on the entrance point
between the levels.
As the designer’s apprehension escalates to trepidation, these simple
rings may not suffice. Restricting the freedom given a program in the
inner sanctum requires a new classification system. Instead of a
military model, the analogy of this system is to keys and locks: A
program can’t unlock access to the data unless it has the key.

For these keys, or capabilities, to be useful, the hardware and


operating system must be able to explicitly pass them from one
program to another without allowing a program itself to forge them.
Such checking requires a great deal of hardware support if time for
checking keys is to be kept low.
B.6 Fallacies and Pitfalls
Even a review of memory hierarchy has fallacies and pitfalls!

Pitfall :
• Address is too small (ex: PDP-11 (by DEC and Carnegie Mellon
University) has address sizes (16 bits) as compared to the address
sizes of the IBM 360 (24 to 31 bits) and the VAX (32 bits)))
Problem because:
- Limiting the program length (must be less than 2^Address size)
- Determining the minimum width of anything that can contain an
address: PC, register, memory word, and effective address
arithmetic
Pitfall :
• Ignoring the impact of the operating system on the performance of the
memory hierarchy
Pitfall :
• Relying on the operating systems to change the page size over time

You might also like