75% found this document useful (4 votes)

2K views105 pages

Chapter 05 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition

The fifth edition of Computer Organization and Design-winner of a 2014 Textbook Excellence Award (Texty) from The Text and Academic Authors Association-moves forward into the post-PC era with new examples, exercises, and material highlighting the emergence of mobile computing and the cloud. This generational change is emphasized and explored with updated content featuring tablet computers, cloud infrastructure, and the ARM (mobile computing devices) and x86 (cloud computing) architectures. Because an understanding of modern hardware is essential to achieving good performance and energy efficiency, this edition adds a new concrete example, "Going Faster," used throughout the text to demonstrate extremely effective optimization techniques. Also new to this edition is discussion of the "Eight Great Ideas" of computer architecture. As with previous editions, a MIPS processor is the core used to present the fundamentals of hardware technologies, assembly language, computer arithmetic, pipelining, memory hierarchies and I/O.

Uploaded by

Priyanka Meena

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

75% found this document useful (4 votes)

2K views105 pages

Chapter 05 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition

Uploaded by

Priyanka Meena

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 105

COMPUTER ORGANIZATION AND

5 D
Edition
th

The Hardware/Software Interface

Chapter 5
Large and Fast:
Exploiting Memory
Hierarchy

Programs access a small proportion of their

address space at any time
Temporal locality

5.1 Introduction

Principle of Locality

Items accessed recently are likely to be

accessed again soon
e.g., instructions in a loop, induction variables

Spatial locality

Items near those accessed recently are likely to

be accessed soon
E.g., sequential instruction access, array data
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 2

Taking Advantage of Locality

Memory hierarchy
Store everything on disk
Copy recently accessed (and nearby)
items from disk to smaller DRAM memory

Main memory

Copy more recently accessed (and

nearby) items from DRAM to smaller
SRAM memory

Cache memory attached to CPU

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 3

Memory Hierarchy Levels

Block (aka line): unit of copying

May be multiple words

If accessed data is present in

upper level

Hit: access satisfied by upper level

Hit ratio: hits/accesses

If accessed data is absent

Miss: block copied from lower level

Time taken: miss penalty

Miss ratio: misses/accesses
= 1 hit ratio

Then accessed data supplied from

upper level

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 4

Static RAM (SRAM)

Dynamic RAM (DRAM)

50ns 70ns, $20 $75 per GB

Magnetic disk

0.5ns 2.5ns, $2000 $5000 per GB

5.2 Memory Technologies

Memory Technology

5ms 20ms, $0.20 $2 per GB

Ideal memory

Access time of SRAM

Capacity and cost/GB of disk
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 5

DRAM Technology

Data stored as a charge in a capacitor

Single transistor used to access the charge

Must periodically be refreshed

Read contents and write back

Performed on a DRAM row

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 6

Advanced DRAM Organization

Bits in a DRAM are organized as a

rectangular array

Double data rate (DDR) DRAM

DRAM accesses an entire row

Burst mode: supply successive words from a
row with reduced latency
Transfer on rising and falling clock edges

Quad data rate (QDR) DRAM

Separate DDR inputs and outputs

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 7

DRAM Generations
Year

Capacity

$/GB

1980

64Kbit

$1500000

1983

256Kbit

$500000

1985

1Mbit

$200000

1989

4Mbit

$50000

1992

16Mbit

$15000

1996

64Mbit

$10000

1998

128Mbit

$4000

2000

256Mbit

$1000

2004

512Mbit

$250

2007

1Gbit

$50

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 8

DRAM Performance Factors

Row buffer

Synchronous DRAM

Allows several words to be read and refreshed in

parallel
Allows for consecutive accesses in bursts without
needing to send each address
Improves bandwidth

DRAM banking

Allows simultaneous access to multiple DRAMs

Improves bandwidth

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 9

Increasing Memory Bandwidth

4-word wide memory

Miss penalty = 1 + 15 + 1 = 17 bus cycles

Bandwidth = 16 bytes / 17 cycles = 0.94 B/cycle

4-bank interleaved memory

Miss penalty = 1 + 15 + 41 = 20 bus cycles

Bandwidth = 16 bytes / 20 cycles = 0.8 B/cycle
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 10

Nonvolatile semiconductor storage

100 1000 faster than disk

Smaller, lower power, more robust
But more $/GB (between disk and DRAM)

6.4 Flash Storage

Flash Storage

Chapter 6 Storage and Other I/O Topics 11

Flash Types

NOR flash: bit cell like a NOR gate

NAND flash: bit cell like a NAND gate

Random read/write access

Used for instruction memory in embedded systems
Denser (bits/area), but block-at-a-time access
Cheaper per GB
Used for USB keys, media storage,

Flash bits wears out after 1000s of accesses

Not suitable for direct RAM or disk replacement

Wear leveling: remap data to less used blocks
Chapter 6 Storage and Other I/O Topics 12

Nonvolatile, rotating magnetic storage

6.3 Disk Storage

Disk Storage

Chapter 6 Storage and Other I/O Topics 13

Disk Sectors and Access

Each sector records

Sector ID
Data (512 bytes, 4096 bytes proposed)
Error correcting code (ECC)

Used to hide defects and recording errors

Synchronization fields and gaps

Access to a sector involves

Queuing delay if other accesses are pending

Seek: move the heads
Rotational latency
Data transfer
Controller overhead
Chapter 6 Storage and Other I/O Topics 14

Disk Access Example

Given

Average read time

512B sector, 15,000rpm, 4ms average seek

time, 100MB/s transfer rate, 0.2ms controller
overhead, idle disk
4ms seek time
+ / (15,000/60) = 2ms rotational latency
+ 512 / 100MB/s = 0.005ms transfer time
+ 0.2ms controller delay
= 6.2ms

If actual average seek time is 1ms

Average read time = 3.2ms

Chapter 6 Storage and Other I/O Topics 15

Disk Performance Issues

Manufacturers quote average seek time

Smart disk controller allocate physical sectors on

disk

Based on all possible seeks

Locality and OS scheduling lead to smaller actual
average seek times

Present logical sector interface to host

SCSI, ATA, SATA

Disk drives include caches

Prefetch sectors in anticipation of access

Avoid seek and rotational delay
Chapter 6 Storage and Other I/O Topics 16

Cache memory

The level of the memory hierarchy closest to

the CPU

Given accesses X1, , Xn1, Xn

5.3 The Basics of Caches

Cache Memory

How do we know if
the data is present?
Where do we look?

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 17

Direct Mapped Cache

Location determined by address

Direct mapped: only one choice

(Block address) modulo (#Blocks in cache)

#Blocks is a
power of 2
Use low-order
address bits

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 18

Tags and Valid Bits

How do we know which particular block is

stored in a cache location?

Store block address as well as the data

Actually, only need the high-order bits
Called the tag

What if there is no data in a location?

Valid bit: 1 = present, 0 = not present

Initially 0

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 19

Cache Example

8-blocks, 1 word/block, direct mapped

Initial state
Index

000

001

010

011

100

101

110

111

Tag

Data

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 20

Cache Example
Word addr

Binary addr

Hit/miss

Cache block

10 110

Miss

110

Index

000

001

010

011

100

101

110

111

Tag

Data

Mem[10110]

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 21

Cache Example
Word addr

Binary addr

Hit/miss

Cache block

11 010

Miss

010

Index

000

001

010

011

100

101

110

111

Tag

Data

Mem[11010]

Mem[10110]

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 22

Cache Example
Word addr

Binary addr

Hit/miss

Cache block

10 110

Hit

110

11 010

Hit

010

Index

000

001

010

011

100

101

110

111

Tag

Data

Mem[11010]

Mem[10110]

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 23

Cache Example
Word addr

Binary addr

Hit/miss

Cache block

10 000

Miss

000

00 011

Miss

011

10 000

Hit

000

Index

Tag

Data

000

Mem[10000]

001

010

Mem[11010]

011

Mem[00011]

100

101

110

Mem[10110]

111

N
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 24

Cache Example
Word addr

Binary addr

Hit/miss

Cache block

10 010

Miss

010

Index

Tag

Data

000

Mem[10000]

001

010

Mem[10010]

011

Mem[00011]

100

101

110

Mem[10110]

111

N
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 25

Address Subdivision

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 26

Example: Larger Block Size

64 blocks, 16 bytes/block

To what block number does address 1200

map?

Block address = 1200/16 = 75

Block number = 75 modulo 64 = 11
31

10 9

4 3

Tag

Index

Offset

22 bits

6 bits

4 bits

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 27

Block Size Considerations

Larger blocks should reduce miss rate

Due to spatial locality

But in a fixed-sized cache

Larger blocks fewer of them

More competition increased miss rate

Larger blocks pollution

Larger miss penalty

Can override benefit of reduced miss rate

Early restart and critical-word-first can help
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 28

Cache Misses

On cache hit, CPU proceeds normally

On cache miss

Stall the CPU pipeline

Fetch block from next level of hierarchy
Instruction cache miss

Restart instruction fetch

Data cache miss

Complete data access

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 29

Write-Through

On data-write hit, could just update the block in

cache

But then cache and memory would be inconsistent

Write through: also update memory

But makes writes take longer

e.g., if base CPI = 1, 10% of instructions are stores,

write to memory takes 100 cycles

Effective CPI = 1 + 0.1100 = 11

Solution: write buffer

Holds data waiting to be written to memory

CPU continues immediately

Only stalls on write if write buffer is already full

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 30

Write-Back

Alternative: On data-write hit, just update

the block in cache

Keep track of whether each block is dirty

When a dirty block is replaced

Write it back to memory

Can use a write buffer to allow replacing block
to be read first

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 31

Write Allocation

What should happen on a write miss?

Alternatives for write-through

Allocate on miss: fetch the block

Write around: dont fetch the block

Since programs often write a whole block before

reading it (e.g., initialization)

For write-back

Usually fetch the block

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 32

Example: Intrinsity FastMATH

Embedded MIPS processor

Split cache: separate I-cache and D-cache

12-stage pipeline
Instruction and data access on each cycle
Each 16KB: 256 blocks 16 words/block
D-cache: write-through or write-back

SPEC2000 miss rates

I-cache: 0.4%
D-cache: 11.4%
Weighted average: 3.2%
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 33

Example: Intrinsity FastMATH

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 34

Main Memory Supporting Caches

Use DRAMs for main memory

Fixed width (e.g., 1 word)

Connected by fixed-width clocked bus

Example cache block read

Bus clock is typically slower than CPU clock

1 bus cycle for address transfer

15 bus cycles per DRAM access
1 bus cycle per data transfer

For 4-word block, 1-word-wide DRAM

Miss penalty = 1 + 415 + 41 = 65 bus cycles

Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 35

Components of CPU time

Program execution cycles

Memory stall cycles

Includes cache hit time

Mainly from cache misses

With simplifying assumptions:

Memory stall cycles

Memory accesses
Miss rate Miss penalty
Program

5.4 Measuring and Improving Cache Performance

Measuring Cache Performance

Instructions
Misses

Miss penalty
Program
Instruction
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 36

Cache Performance Example

Given

Miss cycles per instruction

I-cache miss rate = 2%

D-cache miss rate = 4%
Miss penalty = 100 cycles
Base CPI (ideal cache) = 2
Load & stores are 36% of instructions
I-cache: 0.02 100 = 2
D-cache: 0.36 0.04 100 = 1.44

Actual CPI = 2 + 2 + 1.44 = 5.44

Ideal CPU is 5.44/2 =2.72 times faster

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 37

Average Access Time

Hit time is also important for performance

Average memory access time (AMAT)

AMAT = Hit time + Miss rate Miss penalty

Example

CPU with 1ns clock, hit time = 1 cycle, miss

penalty = 20 cycles, I-cache miss rate = 5%
AMAT = 1 + 0.05 20 = 2ns

2 cycles per instruction

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 38

Performance Summary

When CPU performance increased

Decreasing base CPI

Greater proportion of time spent on memory

stalls

Increasing clock rate

Miss penalty becomes more significant

Memory stalls account for more CPU cycles

Cant neglect cache behavior when

evaluating system performance
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 39

Associative Caches

Fully associative

Allow a given block to go in any cache entry

Requires all entries to be searched at once
Comparator per entry (expensive)

n-way set associative

Each set contains n entries

Block number determines which set

(Block number) modulo (#Sets in cache)

Search all entries in a given set at once

n comparators (less expensive)
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 40

Associative Cache Example

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 41

Spectrum of Associativity

For a cache with 8 entries

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 42

Associativity Example

Compare 4-block caches

Direct mapped, 2-way set associative,

fully associative
Block access sequence: 0, 8, 0, 6, 8

Direct mapped
Block
address

Cache
index

Hit/miss

0
8
0
6
8

0
0
0
2
0

miss
miss
miss
miss
miss

0
Mem[0]
Mem[8]
Mem[0]
Mem[0]
Mem[8]

Cache content after access

1
2

Mem[6]
Mem[6]

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 43

Associativity Example

2-way set associative

Block
address

Cache
index

Hit/miss

0
8
0
6
8

0
0
0
0
0

miss
miss
hit
miss
miss

Cache content after access

Set 0
Set 1
Mem[0]
Mem[0]
Mem[0]
Mem[0]
Mem[8]

Mem[8]
Mem[8]
Mem[6]
Mem[6]

Fully associative
Block
address
0
8
0
6
8

Hit/miss
miss
miss
hit
miss
hit

Cache content after access

Mem[0]
Mem[0]
Mem[0]
Mem[0]
Mem[0]

Mem[8]
Mem[8]
Mem[8]
Mem[8]

Mem[6]
Mem[6]

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 44

How Much Associativity

Increased associativity decreases miss

rate

But with diminishing returns

Simulation of a system with 64KB

D-cache, 16-word blocks, SPEC2000

1-way: 10.3%
2-way: 8.6%
4-way: 8.3%
8-way: 8.1%
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 45

Set Associative Cache Organization

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 46

Replacement Policy

Direct mapped: no choice

Set associative

Prefer non-valid entry, if there is one

Otherwise, choose among entries in the set

Least-recently used (LRU)

Choose the one unused for the longest time

Simple for 2-way, manageable for 4-way, too hard

beyond that

Random

Gives approximately the same performance

as LRU for high associativity
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 47

Multilevel Caches

Primary cache attached to CPU

Level-2 cache services misses from

primary cache

Small, but fast

Larger, slower, but still faster than main

memory

Main memory services L-2 cache misses

Some high-end systems include L-3 cache

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 48

Multilevel Cache Example

Given

CPU base CPI = 1, clock rate = 4GHz

Miss rate/instruction = 2%
Main memory access time = 100ns

With just primary cache

Miss penalty = 100ns/0.25ns = 400 cycles

Effective CPI = 1 + 0.02 400 = 9

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 49

Example (cont.)

Now add L-2 cache

Primary miss with L-2 hit

Penalty = 5ns/0.25ns = 20 cycles

Primary miss with L-2 miss

Access time = 5ns

Global miss rate to main memory = 0.5%

Extra penalty = 500 cycles

CPI = 1 + 0.02 20 + 0.005 400 = 3.4

Performance ratio = 9/3.4 = 2.6
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 50

Multilevel Cache Considerations

Primary cache

L-2 cache

Focus on minimal hit time

Focus on low miss rate to avoid main memory
access
Hit time has less overall impact

Results

L-1 cache usually smaller than a single cache

L-1 block size smaller than L-2 block size
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 51

Interactions with Advanced CPUs

Out-of-order CPUs can execute

instructions during cache miss

Pending store stays in load/store unit

Dependent instructions wait in reservation
stations

Independent instructions continue

Effect of miss depends on program data

flow

Much harder to analyse

Use system simulation
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 52

Interactions with Software

Misses depend on
memory access
patterns
Algorithm behavior
Compiler
optimization for
memory access

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 53

Software Optimization via Blocking

Goal: maximize accesses to data before it

is replaced
Consider inner loops of DGEMM:
for (int j = 0; j < n; ++j)
{
double cij = C[i+j*n];
for( int k = 0; k < n; k++ )
cij += A[i+k*n] * B[k+j*n];
C[i+j*n] = cij;
}
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 54

DGEMM Access Pattern

C, A, and B arrays
older accesses
new accesses

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 55

Cache Blocked DGEMM

1 #define BLOCKSIZE 32
2 void do_block (int n, int si, int sj, int sk, double *A, double
3 *B, double *C)
4 {
5 for (int i = si; i < si+BLOCKSIZE; ++i)
6
for (int j = sj; j < sj+BLOCKSIZE; ++j)
7
{
8
double cij = C[i+j*n];/* cij = C[i][j] */
9
for( int k = sk; k < sk+BLOCKSIZE; k++ )
10
cij += A[i+k*n] * B[k+j*n];/* cij+=A[i][k]*B[k][j] */
11
C[i+j*n] = cij;/* C[i][j] = cij */
12 }
13 }
14 void dgemm (int n, double* A, double* B, double* C)
15 {
16 for ( int sj = 0; sj < n; sj += BLOCKSIZE )
17
for ( int si = 0; si < n; si += BLOCKSIZE )
18
for ( int sk = 0; sk < n; sk += BLOCKSIZE )
19
do_block(n, si, sj, sk, A, B, C);
20 }

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 56

Blocked DGEMM Access Pattern

Unoptimized

Blocked

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 57

Service accomplishment
Service delivered
as specified

Restoration

Failure

Fault: failure of a
component

May or may not lead

to system failure

5.5 Dependable Memory Hierarchy

Dependability

Service interruption
Deviation from
specified service

Chapter 6 Storage and Other I/O Topics 58

Dependability Measures

Reliability: mean time to failure (MTTF)

Service interruption: mean time to repair (MTTR)
Mean time between failures

MTBF = MTTF + MTTR

Availability = MTTF / (MTTF + MTTR)

Improving Availability

Increase MTTF: fault avoidance, fault tolerance, fault

forecasting
Reduce MTTR: improved tools and processes for
diagnosis and repair

Chapter 6 Storage and Other I/O Topics 59

The Hamming SEC Code

Hamming distance

Minimum distance = 2 provides single bit

error detection

Number of bits that are different between two

bit patterns

E.g. parity code

Minimum distance = 3 provides single

error correction, 2 bit error detection

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 60

Encoding SEC

To calculate Hamming code:

Number bits from 1 on the left

All bit positions that are a power 2 are parity
bits
Each parity bit checks certain data bits:

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 61

Decoding SEC

Value of parity bits indicates which bits are

in error

Use numbering from encoding procedure

E.g.

Parity bits = 0000 indicates no error

Parity bits = 1010 indicates bit 10 was flipped

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 62

SEC/DEC Code

Add an additional parity bit for the whole word

(pn)

Make Hamming distance = 4

Decoding:

Let H = SEC parity bits

H even, pn even, no error

H odd, pn odd, correctable single bit error

H even, pn odd, error in pn bit

H odd, pn even, double error occurred

Note: ECC DRAM uses SEC/DEC with 8 bits

protecting each 64 bits
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 63

Host computer emulates guest operating system

and machine resources

Virtualization has some performance impact

Improved isolation of multiple guests

Avoids security and reliability problems
Aids sharing of resources

5.6 Virtual Machines

Virtual Machines

Feasible with modern high-performance comptuers

Examples

IBM VM/370 (1970s technology!)

VMWare
Microsoft Virtual PC
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 64

Virtual Machine Monitor

Maps virtual resources to physical

resources

Guest code runs on native machine in

user mode

Memory, I/O devices, CPUs

Traps to VMM on privileged instructions and

access to protected resources

Guest OS may be different from host OS

VMM handles real I/O devices

Emulates generic virtual I/O devices for guest

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 65

Example: Timer Virtualization

In native machine, on timer interrupt

With Virtual Machine Monitor

OS suspends current process, handles

interrupt, selects and resumes next process
VMM suspends current VM, handles interrupt,
selects and resumes next VM

If a VM requires timer interrupts

VMM emulates a virtual timer

Emulates interrupt for VM when physical timer
interrupt occurs
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 66

Instruction Set Support

User and System modes

Privileged instructions only available in
system mode

All physical resources only accessible

using privileged instructions

Trap to system if executed in user mode

Including page tables, interrupt controls, I/O

registers

Renaissance of virtualization support

Current ISAs (e.g., x86) adapting

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 67

Use main memory as a cache for

secondary (disk) storage

Programs share main memory

Managed jointly by CPU hardware and the

operating system (OS)

5.7 Virtual Memory

Virtual Memory

Each gets a private virtual address space

holding its frequently used code and data
Protected from other programs

CPU and OS translate virtual addresses to

physical addresses

VM block is called a page

VM translation miss is called a page fault
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 68

Address Translation

Fixed-size pages (e.g., 4K)

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 69

Page Fault Penalty

On page fault, the page must be fetched

from disk

Takes millions of clock cycles

Handled by OS code

Try to minimize page fault rate

Fully associative placement

Smart replacement algorithms

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 70

Page Tables

Stores placement information

If page is present in memory

Array of page table entries, indexed by virtual

page number
Page table register in CPU points to page table
in physical memory
PTE stores the physical page number
Plus other status bits (referenced, dirty, )

If page is not present

PTE can refer to location in swap space on disk

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 71

Translation Using a Page Table

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 72

Mapping Pages to Storage

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 73

Replacement and Writes

To reduce page fault rate, prefer leastrecently used (LRU) replacement

Reference bit (aka use bit) in PTE set to 1 on

access to page
Periodically cleared to 0 by OS
A page with reference bit = 0 has not been
used recently

Disk writes take millions of cycles

Block at once, not individual locations

Write through is impractical
Use write-back
Dirty bit in PTE set when page is written
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 74

Fast Translation Using a TLB

Address translation would appear to require

extra memory references

One to access the PTE

Then the actual memory access

But access to page tables has good locality

So use a fast cache of PTEs within the CPU

Called a Translation Look-aside Buffer (TLB)
Typical: 16512 PTEs, 0.51 cycle for hit, 10100
cycles for miss, 0.01%1% miss rate
Misses could be handled by hardware or software

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 75

Fast Translation Using a TLB

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 76

TLB Misses

If page is in memory

Load the PTE from memory and retry

Could be handled in hardware

Or in software

Can get complex for more complicated page table

structures
Raise a special exception, with optimized handler

If page is not in memory (page fault)

OS handles fetching the page and updating

the page table
Then restart the faulting instruction
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 77

TLB Miss Handler

TLB miss indicates

Must recognize TLB miss before

destination register overwritten

Page present, but PTE not in TLB

Page not preset

Raise exception

Handler copies PTE from memory to TLB

Then restarts instruction

If page not present, page fault will occur
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 78

Page Fault Handler

Use faulting virtual address to find PTE

Locate page on disk
Choose page to replace

If dirty, write to disk first

Read page into memory and update page

table
Make process runnable again

Restart from faulting instruction

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 79

TLB and Cache Interaction

If cache tag uses

physical address

Need to translate
before cache lookup

Alternative: use virtual

address tag

Complications due to
aliasing

Different virtual
addresses for shared
physical address

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 80

Memory Protection

Different tasks can share parts of their

virtual address spaces

But need to protect against errant access

Requires OS assistance

Hardware support for OS protection

Privileged supervisor mode (aka kernel mode)

Privileged instructions
Page tables and other state information only
accessible in supervisor mode
System call exception (e.g., syscall in MIPS)
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 81

The BIG Picture

Common principles apply at all levels of

the memory hierarchy

Based on notions of caching

At each level in the hierarchy

Block placement
Finding a block
Replacement on a miss
Write policy

5.8 A Common Framework for Memory Hierarchies

The Memory Hierarchy

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 82

Block Placement

Determined by associativity

Direct mapped (1-way associative)

n-way set associative

n choices within a set

Fully associative

One choice for placement

Any location

Higher associativity reduces miss rate

Increases complexity, cost, and access time

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 83

Finding a Block

Associativity

Location method

Tag comparisons

Direct mapped

Index

n-way set
associative

Set index, then search

entries within the set

Fully associative

Search all entries

#entries

Full lookup table

Hardware caches

Reduce comparisons to reduce cost

Virtual memory

Full table lookup makes full associativity feasible

Benefit in reduced miss rate
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 84

Replacement

Choice of entry to replace on a miss

Least recently used (LRU)

Random

Complex and costly hardware for high associativity

Close to LRU, easier to implement

Virtual memory

LRU approximation with hardware support

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 85

Write Policy

Write-through

Write-back

Update both upper and lower levels

Simplifies replacement, but may require write
buffer
Update upper level only
Update lower level when block is replaced
Need to keep more state

Virtual memory

Only write-back is feasible, given disk write

latency
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 86

Sources of Misses

Compulsory misses (aka cold start misses)

Capacity misses

First access to a block

Due to finite cache size
A replaced block is later accessed again

Conflict misses (aka collision misses)

In a non-fully associative cache

Due to competition for entries in a set
Would not occur in a fully associative cache of
the same total size
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 87

Cache Design Trade-offs

Design change

Effect on miss rate

Negative performance
effect

Increase cache size

Decrease capacity
misses

May increase access

time

Increase associativity

Decrease conflict
misses

May increase access

time

Increase block size

Decrease compulsory Increases miss

misses
penalty. For very large
block size, may
increase miss rate
due to pollution.

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 88

Example cache characteristics

Direct-mapped, write-back, write allocate

Block size: 4 words (16 bytes)
Cache size: 16 KB (1024 blocks)
32-bit byte addresses
Valid bit and dirty bit per block
Blocking cache

CPU waits until access is complete

10 9

4 3

Tag

Index

Offset

18 bits

10 bits

4 bits

5.9 Using a Finite State Machine to Control A Simple Cache

Cache Control

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 89

Interface Signals

CPU

Read/Write

Valid

Address

Write Data

Read Data

Ready

Cache

Address

Write Data

128

Read Data

128

Memory

Ready

Multiple cycles
per access

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 90

Finite State Machines

Use an FSM to
sequence control steps
Set of states, transition
on each clock edge

State values are binary

encoded
Current state stored in a
register
Next state
= fn (current state,
current inputs)

Control output signals

= fo (current state)
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 91

Cache Controller FSM

Could partition
into separate
states to
reduce clock
cycle time

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 92

Suppose two CPU cores share a physical

address space

Write-through caches

Time Event
step

CPU As
cache

CPU Bs
cache

Memory
0

CPU A reads X

CPU B reads X

CPU A writes 1 to X

5.10 Parallelism and Memory Hierarchies: Cache Coherence

Cache Coherence Problem

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 93

Coherence Defined

Informally: Reads return most recently

written value
Formally:

P writes X; P reads X (no intervening writes)

read returns written value
P1 writes X; P2 reads X (sufficiently later)
read returns written value

c.f. CPU B reading X after step 3 in example

P1 writes X, P2 writes X
all processors see writes in the same order

End up with the same final value for X

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 94

Cache Coherence Protocols

Operations performed by caches in

multiprocessors to ensure coherence

Migration of data to local caches

Replication of read-shared data

Reduces contention for access

Snooping protocols

Reduces bandwidth for shared memory

Each cache monitors bus reads/writes

Directory-based protocols

Caches and memory record sharing status of

blocks in a directory
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 95

Invalidating Snooping Protocols

Cache gets exclusive access to a block

when it is to be written

Broadcasts an invalidate message on the bus

Subsequent read in another cache misses

Owning cache supplies updated value

CPU activity

Bus activity

CPU As
cache

CPU Bs
cache

Memory
0

CPU A reads X

Cache miss for X

CPU B reads X

Cache miss for X

CPU A writes 1 to X

Invalidate for X

CPU B read X

Cache miss for X

0
0

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 96

Memory Consistency

When are writes seen by other processors

Assumptions

Seen means a read returns the written value

Cant be instantaneously
A write completes only when all processors have seen
it
A processor does not reorder writes with other
accesses

Consequence

P writes X then writes Y

all processors that see new Y also see new X
Processors can reorder reads, but not writes
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 97

5.13 The ARM Cortex-A8 and Intel Core i7 Memory Hierarchies

Multilevel On-Chip Caches

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 98

2-Level TLB Organization

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 99

Supporting Multiple Issue

Both have multi-banked caches that allow

multiple accesses per cycle assuming no
bank conflicts
Core i7 cache optimizations

Return requested word first

Non-blocking cache

Hit under miss

Miss under miss

Data prefetching
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 100

Combine cache blocking and subword

parallelism

5.14 Going Faster: Cache Blocking and Matrix Multiply

DGEMM

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 101

Byte vs. word addressing

Example: 32-byte direct-mapped cache,

4-byte blocks

Byte 36 maps to block 1

Word 36 maps to block 4

5.15 Fallacies and Pitfalls

Pitfalls

Ignoring memory system effects when

writing or generating code

Example: iterating over rows vs. columns of

arrays
Large strides result in poor locality
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 102

Pitfalls

In multiprocessor with shared L2 or L3

cache

Less associativity than cores results in conflict

misses
More cores need to increase associativity

Using AMAT to evaluate performance of

out-of-order processors

Ignores effect of non-blocked accesses

Instead, evaluate performance by simulation
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 103

Pitfalls

Extending address range using segments

E.g., Intel 80286

But a segment is not always big enough
Makes address arithmetic complicated

Implementing a VMM on an ISA not

designed for virtualization

E.g., non-privileged instructions accessing

hardware resources
Either extend ISA, or require guest OS not to
use problematic instructions
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 104

Fast memories are small, large memories are

slow

Principle of locality

Programs use a small part of their memory space

frequently

Memory hierarchy

We really want fast, large memories

Caching gives this illusion

5.16 Concluding Remarks

Concluding Remarks

L1 cache L2 cache DRAM memory

disk

Memory system design is critical for

multiprocessors
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 105

Chapter 02 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition
75% (4)
Chapter 02 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition
93 pages
Chapter 04 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition
71% (7)
Chapter 04 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition
137 pages
Chapter 06 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition
100% (1)
Chapter 06 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition
57 pages
Modern Digital and Analog Communications Systems - B P Lathi Solutions Manual
91% (89)
Modern Digital and Analog Communications Systems - B P Lathi Solutions Manual
155 pages
Memory Organization Ch41
No ratings yet
Memory Organization Ch41
51 pages
8085 Microprocessor Ramesh S. Gaonkar
64% (22)
8085 Microprocessor Ramesh S. Gaonkar
330 pages
Patterson6e MIPS Ch05 Modified Part2
No ratings yet
Patterson6e MIPS Ch05 Modified Part2
121 pages
Chapter 01 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition
83% (6)
Chapter 01 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition
49 pages
Chapter - 05 9wy
No ratings yet
Chapter - 05 9wy
136 pages
CA Chap5 Memory
No ratings yet
CA Chap5 Memory
91 pages
Chapter 5 Large and Fast Exploiting Memory Hierarchy
No ratings yet
Chapter 5 Large and Fast Exploiting Memory Hierarchy
101 pages
Help 2
No ratings yet
Help 2
102 pages
Chapter 05
No ratings yet
Chapter 05
113 pages
Chapter 3 Large and Fast
No ratings yet
Chapter 3 Large and Fast
86 pages
Lecture 5
No ratings yet
Lecture 5
53 pages
CH10 - Memory Hierarchy
No ratings yet
CH10 - Memory Hierarchy
106 pages
Memory Design
No ratings yet
Memory Design
36 pages
07-08 - CO - B2Ch2 - MIPS Instruction Set
100% (1)
07-08 - CO - B2Ch2 - MIPS Instruction Set
62 pages
Chapter 05
No ratings yet
Chapter 05
105 pages
Lecture 19 Basics of Cache
No ratings yet
Lecture 19 Basics of Cache
23 pages
04 - Large and Fast Exploiting Memory Hierarchy
No ratings yet
04 - Large and Fast Exploiting Memory Hierarchy
92 pages
Large and Fast: Exploiting Memory Hierarchy: The Hardware/Software Interface
No ratings yet
Large and Fast: Exploiting Memory Hierarchy: The Hardware/Software Interface
33 pages
Chapter 5 Large and Fast Exploiting Memory Hierarchy
No ratings yet
Chapter 5 Large and Fast Exploiting Memory Hierarchy
96 pages
Lec 2
No ratings yet
Lec 2
26 pages
פרק ט - גדול ומהיר - ניצול היררכיות זיכרון
No ratings yet
פרק ט - גדול ומהיר - ניצול היררכיות זיכרון
77 pages
Chapter 05
No ratings yet
Chapter 05
105 pages
Chapter 05
No ratings yet
Chapter 05
52 pages
Lecture-17 CH-05 1
No ratings yet
Lecture-17 CH-05 1
21 pages
Large and Fast: Exploiting Memory Hierarchy
No ratings yet
Large and Fast: Exploiting Memory Hierarchy
24 pages
ch5 1
No ratings yet
ch5 1
44 pages
Cosmo I First Plan
100% (3)
Cosmo I First Plan
51 pages
Lecture 9 - The Memory Hierarchy
No ratings yet
Lecture 9 - The Memory Hierarchy
25 pages
Week6 Memory Part2
No ratings yet
Week6 Memory Part2
23 pages
Unit 5 1 Cache Performance V 2
No ratings yet
Unit 5 1 Cache Performance V 2
29 pages
Input/Output Organization in Computer Organisation and Architecture
100% (1)
Input/Output Organization in Computer Organisation and Architecture
99 pages
Chapter5 PDF
No ratings yet
Chapter5 PDF
95 pages
Cache Memory: William Stallings, Computer Organization and Architecture, 9 Edition
No ratings yet
Cache Memory: William Stallings, Computer Organization and Architecture, 9 Edition
47 pages
02b Cache
No ratings yet
02b Cache
48 pages
Large and Fast: Exploiting Memory Hierarchy: Computer Organization and Design
No ratings yet
Large and Fast: Exploiting Memory Hierarchy: Computer Organization and Design
107 pages
Cache Memory
67% (3)
Cache Memory
72 pages
Large and Fast: Exploiting Memory Hierarchy: Omputer Rganization and Esign
No ratings yet
Large and Fast: Exploiting Memory Hierarchy: Omputer Rganization and Esign
87 pages
Cache Memory: William Stallings, Computer Organization and Architecture, 9 Edition
No ratings yet
Cache Memory: William Stallings, Computer Organization and Architecture, 9 Edition
47 pages
Computer Architecture
No ratings yet
Computer Architecture
23 pages
Memory
No ratings yet
Memory
12 pages
Viva Questions For Mobile Communications and Satellite Communication
100% (2)
Viva Questions For Mobile Communications and Satellite Communication
8 pages
Chapter 5: Large and Fast Exploiting Memory Hierarchy Notes
No ratings yet
Chapter 5: Large and Fast Exploiting Memory Hierarchy Notes
16 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
32 pages
Cache1 2
No ratings yet
Cache1 2
30 pages
Chapter 03
100% (1)
Chapter 03
49 pages
Chapter 4 - Cache Memory
0% (1)
Chapter 4 - Cache Memory
50 pages
Circuit Theorems: Mustafa Kemal Uyguroğlu
No ratings yet
Circuit Theorems: Mustafa Kemal Uyguroğlu
72 pages
MIPS Addressing Modes
No ratings yet
MIPS Addressing Modes
5 pages
Morris Mano - Computer Architecture PPT Chapter 7
100% (2)
Morris Mano - Computer Architecture PPT Chapter 7
26 pages
WHKETaz Wa FPFT XBX
No ratings yet
WHKETaz Wa FPFT XBX
281 pages
90% of Python in 90 Minutes - Matt Harisson
100% (2)
90% of Python in 90 Minutes - Matt Harisson
119 pages
English-CG (Grade 7-10)
No ratings yet
English-CG (Grade 7-10)
97 pages
Compiler Techniques For Exposing ILP
No ratings yet
Compiler Techniques For Exposing ILP
26 pages
Arm Instructions
No ratings yet
Arm Instructions
24 pages
NSO SET-C Class-2
No ratings yet
NSO SET-C Class-2
8 pages
Pipelining: by Based On The Text Book "Computer Organization" by Carl Hamacher Et Al., Fifth Edition
No ratings yet
Pipelining: by Based On The Text Book "Computer Organization" by Carl Hamacher Et Al., Fifth Edition
23 pages
WilliamStallings Chp3 PDF
No ratings yet
WilliamStallings Chp3 PDF
60 pages
ARM Based SOC Verification: Test and Verification Solutions
No ratings yet
ARM Based SOC Verification: Test and Verification Solutions
28 pages
Chapter 06
No ratings yet
Chapter 06
76 pages
Cache Coherence: CSE 661 - Parallel and Vector Architectures
No ratings yet
Cache Coherence: CSE 661 - Parallel and Vector Architectures
37 pages
FPGA Architecture
No ratings yet
FPGA Architecture
39 pages
DRAM Lecture2
No ratings yet
DRAM Lecture2
32 pages
Signals and Systems
No ratings yet
Signals and Systems
187 pages
Unshrink: Powerful Principles For Leadership & Innovation in The 21st Century
100% (6)
Unshrink: Powerful Principles For Leadership & Innovation in The 21st Century
161 pages
Brief History of Auditing
100% (2)
Brief History of Auditing
7 pages
Out of The Cradle Endlessly Rocking
No ratings yet
Out of The Cradle Endlessly Rocking
6 pages
முத்திநிச்சயம்
No ratings yet
முத்திநிச்சயம்
172 pages
Nursing Intervention For Chest Pain
100% (3)
Nursing Intervention For Chest Pain
2 pages
Traditional Dresses
No ratings yet
Traditional Dresses
11 pages
Emerson-Thoreau Test Review
No ratings yet
Emerson-Thoreau Test Review
5 pages
Get Hopped Up On Powerful Combat Elixirs With The Emperor's Children On Crusade
No ratings yet
Get Hopped Up On Powerful Combat Elixirs With The Emperor's Children On Crusade
12 pages
Stock Investment
No ratings yet
Stock Investment
61 pages
Cooper-Carringtonk Edid6507-Mini Project
No ratings yet
Cooper-Carringtonk Edid6507-Mini Project
32 pages
Sales Objections
No ratings yet
Sales Objections
10 pages
The Tabula Smaragdina Revisited
No ratings yet
The Tabula Smaragdina Revisited
13 pages
The Wonders of The Animal Kingdom
No ratings yet
The Wonders of The Animal Kingdom
2 pages
Internship Report
No ratings yet
Internship Report
13 pages
Answer:: Wireless Interview Questions
No ratings yet
Answer:: Wireless Interview Questions
4 pages
Buy Ebook Mechanical Behavior of Materials 2nd Edition by Thomas H. Courtney Wei Zhi Cheap Price
100% (3)
Buy Ebook Mechanical Behavior of Materials 2nd Edition by Thomas H. Courtney Wei Zhi Cheap Price
24 pages
Seminar 6 - Directors Duty of Care
No ratings yet
Seminar 6 - Directors Duty of Care
32 pages
Ac Mosfet Amplifier
No ratings yet
Ac Mosfet Amplifier
3 pages
Brainstorming and Outlining
No ratings yet
Brainstorming and Outlining
7 pages
Hakuna Matata - Final
No ratings yet
Hakuna Matata - Final
19 pages
The Stolen Child - Leticia Del Toro
No ratings yet
The Stolen Child - Leticia Del Toro
5 pages
Exam 1 KeyFinance
No ratings yet
Exam 1 KeyFinance
7 pages
BHOLI
No ratings yet
BHOLI
4 pages
Action Plan For Frust and Inst.
No ratings yet
Action Plan For Frust and Inst.
9 pages
Final Test Revision Part 1
No ratings yet
Final Test Revision Part 1
4 pages
Test
No ratings yet
Test
3 pages
Example Type2Settling
No ratings yet
Example Type2Settling
2 pages
Dat LM3940
No ratings yet
Dat LM3940
9 pages
Keanu Reeves John Wick
No ratings yet
Keanu Reeves John Wick
1 page