75% found this document useful (4 votes)
2K views105 pages

Chapter 05 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition

The fifth edition of Computer Organization and Design-winner of a 2014 Textbook Excellence Award (Texty) from The Text and Academic Authors Association-moves forward into the post-PC era with new examples, exercises, and material highlighting the emergence of mobile computing and the cloud. This generational change is emphasized and explored with updated content featuring tablet computers, cloud infrastructure, and the ARM (mobile computing devices) and x86 (cloud computing) architectures. Because an understanding of modern hardware is essential to achieving good performance and energy efficiency, this edition adds a new concrete example, "Going Faster," used throughout the text to demonstrate extremely effective optimization techniques. Also new to this edition is discussion of the "Eight Great Ideas" of computer architecture. As with previous editions, a MIPS processor is the core used to present the fundamentals of hardware technologies, assembly language, computer arithmetic, pipelining, memory hierarchies and I/O.

Uploaded by

Priyanka Meena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
75% found this document useful (4 votes)
2K views105 pages

Chapter 05 Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) 5th Edition

The fifth edition of Computer Organization and Design-winner of a 2014 Textbook Excellence Award (Texty) from The Text and Academic Authors Association-moves forward into the post-PC era with new examples, exercises, and material highlighting the emergence of mobile computing and the cloud. This generational change is emphasized and explored with updated content featuring tablet computers, cloud infrastructure, and the ARM (mobile computing devices) and x86 (cloud computing) architectures. Because an understanding of modern hardware is essential to achieving good performance and energy efficiency, this edition adds a new concrete example, "Going Faster," used throughout the text to demonstrate extremely effective optimization techniques. Also new to this edition is discussion of the "Eight Great Ideas" of computer architecture. As with previous editions, a MIPS processor is the core used to present the fundamentals of hardware technologies, assembly language, computer arithmetic, pipelining, memory hierarchies and I/O.

Uploaded by

Priyanka Meena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 105

COMPUTER ORGANIZATION AND

5 D
Edition
th

The Hardware/Software Interface

Chapter 5
Large and Fast:
Exploiting Memory
Hierarchy

Programs access a small proportion of their


address space at any time
Temporal locality

5.1 Introduction

Principle of Locality

Items accessed recently are likely to be


accessed again soon
e.g., instructions in a loop, induction variables

Spatial locality

Items near those accessed recently are likely to


be accessed soon
E.g., sequential instruction access, array data
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 2

Taking Advantage of Locality

Memory hierarchy
Store everything on disk
Copy recently accessed (and nearby)
items from disk to smaller DRAM memory

Main memory

Copy more recently accessed (and


nearby) items from DRAM to smaller
SRAM memory

Cache memory attached to CPU


Chapter 5 Large and Fast: Exploiting Memory Hierarchy 3

Memory Hierarchy Levels

Block (aka line): unit of copying

May be multiple words

If accessed data is present in


upper level

Hit: access satisfied by upper level

Hit ratio: hits/accesses

If accessed data is absent

Miss: block copied from lower level

Time taken: miss penalty


Miss ratio: misses/accesses
= 1 hit ratio

Then accessed data supplied from


upper level

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 4

Static RAM (SRAM)

Dynamic RAM (DRAM)

50ns 70ns, $20 $75 per GB

Magnetic disk

0.5ns 2.5ns, $2000 $5000 per GB

5.2 Memory Technologies

Memory Technology

5ms 20ms, $0.20 $2 per GB

Ideal memory

Access time of SRAM


Capacity and cost/GB of disk
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 5

DRAM Technology

Data stored as a charge in a capacitor

Single transistor used to access the charge


Must periodically be refreshed

Read contents and write back


Performed on a DRAM row

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 6

Advanced DRAM Organization

Bits in a DRAM are organized as a


rectangular array

Double data rate (DDR) DRAM

DRAM accesses an entire row


Burst mode: supply successive words from a
row with reduced latency
Transfer on rising and falling clock edges

Quad data rate (QDR) DRAM

Separate DDR inputs and outputs


Chapter 5 Large and Fast: Exploiting Memory Hierarchy 7

DRAM Generations
Year

Capacity

$/GB

1980

64Kbit

$1500000

1983

256Kbit

$500000

1985

1Mbit

$200000

1989

4Mbit

$50000

1992

16Mbit

$15000

1996

64Mbit

$10000

1998

128Mbit

$4000

2000

256Mbit

$1000

2004

512Mbit

$250

2007

1Gbit

$50

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 8

DRAM Performance Factors

Row buffer

Synchronous DRAM

Allows several words to be read and refreshed in


parallel
Allows for consecutive accesses in bursts without
needing to send each address
Improves bandwidth

DRAM banking

Allows simultaneous access to multiple DRAMs


Improves bandwidth

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 9

Increasing Memory Bandwidth

4-word wide memory

Miss penalty = 1 + 15 + 1 = 17 bus cycles


Bandwidth = 16 bytes / 17 cycles = 0.94 B/cycle

4-bank interleaved memory

Miss penalty = 1 + 15 + 41 = 20 bus cycles


Bandwidth = 16 bytes / 20 cycles = 0.8 B/cycle
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 10

Nonvolatile semiconductor storage

100 1000 faster than disk


Smaller, lower power, more robust
But more $/GB (between disk and DRAM)

6.4 Flash Storage

Flash Storage

Chapter 6 Storage and Other I/O Topics 11

Flash Types

NOR flash: bit cell like a NOR gate

NAND flash: bit cell like a NAND gate

Random read/write access


Used for instruction memory in embedded systems
Denser (bits/area), but block-at-a-time access
Cheaper per GB
Used for USB keys, media storage,

Flash bits wears out after 1000s of accesses

Not suitable for direct RAM or disk replacement


Wear leveling: remap data to less used blocks
Chapter 6 Storage and Other I/O Topics 12

Nonvolatile, rotating magnetic storage

6.3 Disk Storage

Disk Storage

Chapter 6 Storage and Other I/O Topics 13

Disk Sectors and Access

Each sector records

Sector ID
Data (512 bytes, 4096 bytes proposed)
Error correcting code (ECC)

Used to hide defects and recording errors

Synchronization fields and gaps

Access to a sector involves

Queuing delay if other accesses are pending


Seek: move the heads
Rotational latency
Data transfer
Controller overhead
Chapter 6 Storage and Other I/O Topics 14

Disk Access Example

Given

Average read time

512B sector, 15,000rpm, 4ms average seek


time, 100MB/s transfer rate, 0.2ms controller
overhead, idle disk
4ms seek time
+ / (15,000/60) = 2ms rotational latency
+ 512 / 100MB/s = 0.005ms transfer time
+ 0.2ms controller delay
= 6.2ms

If actual average seek time is 1ms

Average read time = 3.2ms


Chapter 6 Storage and Other I/O Topics 15

Disk Performance Issues

Manufacturers quote average seek time

Smart disk controller allocate physical sectors on


disk

Based on all possible seeks


Locality and OS scheduling lead to smaller actual
average seek times

Present logical sector interface to host


SCSI, ATA, SATA

Disk drives include caches

Prefetch sectors in anticipation of access


Avoid seek and rotational delay
Chapter 6 Storage and Other I/O Topics 16

Cache memory

The level of the memory hierarchy closest to


the CPU

Given accesses X1, , Xn1, Xn

5.3 The Basics of Caches

Cache Memory

How do we know if
the data is present?
Where do we look?

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 17

Direct Mapped Cache

Location determined by address


Direct mapped: only one choice

(Block address) modulo (#Blocks in cache)

#Blocks is a
power of 2
Use low-order
address bits

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 18

Tags and Valid Bits

How do we know which particular block is


stored in a cache location?

Store block address as well as the data


Actually, only need the high-order bits
Called the tag

What if there is no data in a location?

Valid bit: 1 = present, 0 = not present


Initially 0

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 19

Cache Example

8-blocks, 1 word/block, direct mapped


Initial state
Index

000

001

010

011

100

101

110

111

Tag

Data

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 20

Cache Example
Word addr

Binary addr

Hit/miss

Cache block

22

10 110

Miss

110

Index

000

001

010

011

100

101

110

111

Tag

Data

10

Mem[10110]

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 21

Cache Example
Word addr

Binary addr

Hit/miss

Cache block

26

11 010

Miss

010

Index

000

001

010

011

100

101

110

111

Tag

Data

11

Mem[11010]

10

Mem[10110]

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 22

Cache Example
Word addr

Binary addr

Hit/miss

Cache block

22

10 110

Hit

110

26

11 010

Hit

010

Index

000

001

010

011

100

101

110

111

Tag

Data

11

Mem[11010]

10

Mem[10110]

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 23

Cache Example
Word addr

Binary addr

Hit/miss

Cache block

16

10 000

Miss

000

00 011

Miss

011

16

10 000

Hit

000

Index

Tag

Data

000

10

Mem[10000]

001

010

11

Mem[11010]

011

00

Mem[00011]

100

101

110

10

Mem[10110]

111

N
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 24

Cache Example
Word addr

Binary addr

Hit/miss

Cache block

18

10 010

Miss

010

Index

Tag

Data

000

10

Mem[10000]

001

010

10

Mem[10010]

011

00

Mem[00011]

100

101

110

10

Mem[10110]

111

N
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 25

Address Subdivision

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 26

Example: Larger Block Size

64 blocks, 16 bytes/block

To what block number does address 1200


map?

Block address = 1200/16 = 75


Block number = 75 modulo 64 = 11
31

10 9

4 3

Tag

Index

Offset

22 bits

6 bits

4 bits

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 27

Block Size Considerations

Larger blocks should reduce miss rate

Due to spatial locality

But in a fixed-sized cache

Larger blocks fewer of them

More competition increased miss rate

Larger blocks pollution

Larger miss penalty

Can override benefit of reduced miss rate


Early restart and critical-word-first can help
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 28

Cache Misses

On cache hit, CPU proceeds normally


On cache miss

Stall the CPU pipeline


Fetch block from next level of hierarchy
Instruction cache miss

Restart instruction fetch

Data cache miss

Complete data access

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 29

Write-Through

On data-write hit, could just update the block in


cache

But then cache and memory would be inconsistent

Write through: also update memory


But makes writes take longer

e.g., if base CPI = 1, 10% of instructions are stores,


write to memory takes 100 cycles

Effective CPI = 1 + 0.1100 = 11

Solution: write buffer

Holds data waiting to be written to memory


CPU continues immediately

Only stalls on write if write buffer is already full


Chapter 5 Large and Fast: Exploiting Memory Hierarchy 30

Write-Back

Alternative: On data-write hit, just update


the block in cache

Keep track of whether each block is dirty

When a dirty block is replaced

Write it back to memory


Can use a write buffer to allow replacing block
to be read first

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 31

Write Allocation

What should happen on a write miss?


Alternatives for write-through

Allocate on miss: fetch the block


Write around: dont fetch the block

Since programs often write a whole block before


reading it (e.g., initialization)

For write-back

Usually fetch the block

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 32

Example: Intrinsity FastMATH

Embedded MIPS processor

Split cache: separate I-cache and D-cache

12-stage pipeline
Instruction and data access on each cycle
Each 16KB: 256 blocks 16 words/block
D-cache: write-through or write-back

SPEC2000 miss rates

I-cache: 0.4%
D-cache: 11.4%
Weighted average: 3.2%
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 33

Example: Intrinsity FastMATH

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 34

Main Memory Supporting Caches

Use DRAMs for main memory

Fixed width (e.g., 1 word)


Connected by fixed-width clocked bus

Example cache block read

Bus clock is typically slower than CPU clock

1 bus cycle for address transfer


15 bus cycles per DRAM access
1 bus cycle per data transfer

For 4-word block, 1-word-wide DRAM

Miss penalty = 1 + 415 + 41 = 65 bus cycles


Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 35

Components of CPU time

Program execution cycles

Memory stall cycles

Includes cache hit time


Mainly from cache misses

With simplifying assumptions:


Memory stall cycles

Memory accesses
Miss rate Miss penalty
Program

5.4 Measuring and Improving Cache Performance

Measuring Cache Performance

Instructions
Misses

Miss penalty
Program
Instruction
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 36

Cache Performance Example

Given

Miss cycles per instruction

I-cache miss rate = 2%


D-cache miss rate = 4%
Miss penalty = 100 cycles
Base CPI (ideal cache) = 2
Load & stores are 36% of instructions
I-cache: 0.02 100 = 2
D-cache: 0.36 0.04 100 = 1.44

Actual CPI = 2 + 2 + 1.44 = 5.44

Ideal CPU is 5.44/2 =2.72 times faster


Chapter 5 Large and Fast: Exploiting Memory Hierarchy 37

Average Access Time

Hit time is also important for performance


Average memory access time (AMAT)

AMAT = Hit time + Miss rate Miss penalty

Example

CPU with 1ns clock, hit time = 1 cycle, miss


penalty = 20 cycles, I-cache miss rate = 5%
AMAT = 1 + 0.05 20 = 2ns

2 cycles per instruction

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 38

Performance Summary

When CPU performance increased

Decreasing base CPI

Greater proportion of time spent on memory


stalls

Increasing clock rate

Miss penalty becomes more significant

Memory stalls account for more CPU cycles

Cant neglect cache behavior when


evaluating system performance
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 39

Associative Caches

Fully associative

Allow a given block to go in any cache entry


Requires all entries to be searched at once
Comparator per entry (expensive)

n-way set associative

Each set contains n entries


Block number determines which set

(Block number) modulo (#Sets in cache)

Search all entries in a given set at once


n comparators (less expensive)
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 40

Associative Cache Example

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 41

Spectrum of Associativity

For a cache with 8 entries

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 42

Associativity Example

Compare 4-block caches

Direct mapped, 2-way set associative,


fully associative
Block access sequence: 0, 8, 0, 6, 8

Direct mapped
Block
address

Cache
index

Hit/miss

0
8
0
6
8

0
0
0
2
0

miss
miss
miss
miss
miss

0
Mem[0]
Mem[8]
Mem[0]
Mem[0]
Mem[8]

Cache content after access


1
2

Mem[6]
Mem[6]

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 43

Associativity Example

2-way set associative


Block
address

Cache
index

Hit/miss

0
8
0
6
8

0
0
0
0
0

miss
miss
hit
miss
miss

Cache content after access


Set 0
Set 1
Mem[0]
Mem[0]
Mem[0]
Mem[0]
Mem[8]

Mem[8]
Mem[8]
Mem[6]
Mem[6]

Fully associative
Block
address
0
8
0
6
8

Hit/miss
miss
miss
hit
miss
hit

Cache content after access


Mem[0]
Mem[0]
Mem[0]
Mem[0]
Mem[0]

Mem[8]
Mem[8]
Mem[8]
Mem[8]

Mem[6]
Mem[6]

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 44

How Much Associativity

Increased associativity decreases miss


rate

But with diminishing returns

Simulation of a system with 64KB


D-cache, 16-word blocks, SPEC2000

1-way: 10.3%
2-way: 8.6%
4-way: 8.3%
8-way: 8.1%
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 45

Set Associative Cache Organization

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 46

Replacement Policy

Direct mapped: no choice


Set associative

Prefer non-valid entry, if there is one


Otherwise, choose among entries in the set

Least-recently used (LRU)

Choose the one unused for the longest time

Simple for 2-way, manageable for 4-way, too hard


beyond that

Random

Gives approximately the same performance


as LRU for high associativity
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 47

Multilevel Caches

Primary cache attached to CPU

Level-2 cache services misses from


primary cache

Small, but fast

Larger, slower, but still faster than main


memory

Main memory services L-2 cache misses


Some high-end systems include L-3 cache

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 48

Multilevel Cache Example

Given

CPU base CPI = 1, clock rate = 4GHz


Miss rate/instruction = 2%
Main memory access time = 100ns

With just primary cache

Miss penalty = 100ns/0.25ns = 400 cycles


Effective CPI = 1 + 0.02 400 = 9

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 49

Example (cont.)

Now add L-2 cache

Primary miss with L-2 hit

Penalty = 5ns/0.25ns = 20 cycles

Primary miss with L-2 miss

Access time = 5ns


Global miss rate to main memory = 0.5%

Extra penalty = 500 cycles

CPI = 1 + 0.02 20 + 0.005 400 = 3.4


Performance ratio = 9/3.4 = 2.6
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 50

Multilevel Cache Considerations

Primary cache

L-2 cache

Focus on minimal hit time


Focus on low miss rate to avoid main memory
access
Hit time has less overall impact

Results

L-1 cache usually smaller than a single cache


L-1 block size smaller than L-2 block size
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 51

Interactions with Advanced CPUs

Out-of-order CPUs can execute


instructions during cache miss

Pending store stays in load/store unit


Dependent instructions wait in reservation
stations

Independent instructions continue

Effect of miss depends on program data


flow

Much harder to analyse


Use system simulation
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 52

Interactions with Software

Misses depend on
memory access
patterns
Algorithm behavior
Compiler
optimization for
memory access

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 53

Software Optimization via Blocking

Goal: maximize accesses to data before it


is replaced
Consider inner loops of DGEMM:
for (int j = 0; j < n; ++j)
{
double cij = C[i+j*n];
for( int k = 0; k < n; k++ )
cij += A[i+k*n] * B[k+j*n];
C[i+j*n] = cij;
}
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 54

DGEMM Access Pattern

C, A, and B arrays
older accesses
new accesses

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 55

Cache Blocked DGEMM


1 #define BLOCKSIZE 32
2 void do_block (int n, int si, int sj, int sk, double *A, double
3 *B, double *C)
4 {
5 for (int i = si; i < si+BLOCKSIZE; ++i)
6
for (int j = sj; j < sj+BLOCKSIZE; ++j)
7
{
8
double cij = C[i+j*n];/* cij = C[i][j] */
9
for( int k = sk; k < sk+BLOCKSIZE; k++ )
10
cij += A[i+k*n] * B[k+j*n];/* cij+=A[i][k]*B[k][j] */
11
C[i+j*n] = cij;/* C[i][j] = cij */
12 }
13 }
14 void dgemm (int n, double* A, double* B, double* C)
15 {
16 for ( int sj = 0; sj < n; sj += BLOCKSIZE )
17
for ( int si = 0; si < n; si += BLOCKSIZE )
18
for ( int sk = 0; sk < n; sk += BLOCKSIZE )
19
do_block(n, si, sj, sk, A, B, C);
20 }

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 56

Blocked DGEMM Access Pattern

Unoptimized

Blocked

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 57

Service accomplishment
Service delivered
as specified

Restoration

Failure

Fault: failure of a
component

May or may not lead


to system failure

5.5 Dependable Memory Hierarchy

Dependability

Service interruption
Deviation from
specified service

Chapter 6 Storage and Other I/O Topics 58

Dependability Measures

Reliability: mean time to failure (MTTF)


Service interruption: mean time to repair (MTTR)
Mean time between failures

MTBF = MTTF + MTTR

Availability = MTTF / (MTTF + MTTR)


Improving Availability

Increase MTTF: fault avoidance, fault tolerance, fault


forecasting
Reduce MTTR: improved tools and processes for
diagnosis and repair

Chapter 6 Storage and Other I/O Topics 59

The Hamming SEC Code

Hamming distance

Minimum distance = 2 provides single bit


error detection

Number of bits that are different between two


bit patterns

E.g. parity code

Minimum distance = 3 provides single


error correction, 2 bit error detection

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 60

Encoding SEC

To calculate Hamming code:

Number bits from 1 on the left


All bit positions that are a power 2 are parity
bits
Each parity bit checks certain data bits:

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 61

Decoding SEC

Value of parity bits indicates which bits are


in error

Use numbering from encoding procedure


E.g.

Parity bits = 0000 indicates no error


Parity bits = 1010 indicates bit 10 was flipped

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 62

SEC/DEC Code

Add an additional parity bit for the whole word


(pn)

Make Hamming distance = 4


Decoding:

Let H = SEC parity bits

H even, pn even, no error

H odd, pn odd, correctable single bit error

H even, pn odd, error in pn bit

H odd, pn even, double error occurred

Note: ECC DRAM uses SEC/DEC with 8 bits


protecting each 64 bits
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 63

Host computer emulates guest operating system


and machine resources

Virtualization has some performance impact

Improved isolation of multiple guests


Avoids security and reliability problems
Aids sharing of resources

5.6 Virtual Machines

Virtual Machines

Feasible with modern high-performance comptuers

Examples

IBM VM/370 (1970s technology!)


VMWare
Microsoft Virtual PC
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 64

Virtual Machine Monitor

Maps virtual resources to physical


resources

Guest code runs on native machine in


user mode

Memory, I/O devices, CPUs

Traps to VMM on privileged instructions and


access to protected resources

Guest OS may be different from host OS


VMM handles real I/O devices

Emulates generic virtual I/O devices for guest


Chapter 5 Large and Fast: Exploiting Memory Hierarchy 65

Example: Timer Virtualization

In native machine, on timer interrupt

With Virtual Machine Monitor

OS suspends current process, handles


interrupt, selects and resumes next process
VMM suspends current VM, handles interrupt,
selects and resumes next VM

If a VM requires timer interrupts

VMM emulates a virtual timer


Emulates interrupt for VM when physical timer
interrupt occurs
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 66

Instruction Set Support

User and System modes


Privileged instructions only available in
system mode

All physical resources only accessible


using privileged instructions

Trap to system if executed in user mode

Including page tables, interrupt controls, I/O


registers

Renaissance of virtualization support

Current ISAs (e.g., x86) adapting


Chapter 5 Large and Fast: Exploiting Memory Hierarchy 67

Use main memory as a cache for


secondary (disk) storage

Programs share main memory

Managed jointly by CPU hardware and the


operating system (OS)

5.7 Virtual Memory

Virtual Memory

Each gets a private virtual address space


holding its frequently used code and data
Protected from other programs

CPU and OS translate virtual addresses to


physical addresses

VM block is called a page


VM translation miss is called a page fault
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 68

Address Translation

Fixed-size pages (e.g., 4K)

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 69

Page Fault Penalty

On page fault, the page must be fetched


from disk

Takes millions of clock cycles


Handled by OS code

Try to minimize page fault rate

Fully associative placement


Smart replacement algorithms

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 70

Page Tables

Stores placement information

If page is present in memory

Array of page table entries, indexed by virtual


page number
Page table register in CPU points to page table
in physical memory
PTE stores the physical page number
Plus other status bits (referenced, dirty, )

If page is not present

PTE can refer to location in swap space on disk


Chapter 5 Large and Fast: Exploiting Memory Hierarchy 71

Translation Using a Page Table

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 72

Mapping Pages to Storage

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 73

Replacement and Writes

To reduce page fault rate, prefer leastrecently used (LRU) replacement

Reference bit (aka use bit) in PTE set to 1 on


access to page
Periodically cleared to 0 by OS
A page with reference bit = 0 has not been
used recently

Disk writes take millions of cycles

Block at once, not individual locations


Write through is impractical
Use write-back
Dirty bit in PTE set when page is written
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 74

Fast Translation Using a TLB

Address translation would appear to require


extra memory references

One to access the PTE


Then the actual memory access

But access to page tables has good locality

So use a fast cache of PTEs within the CPU


Called a Translation Look-aside Buffer (TLB)
Typical: 16512 PTEs, 0.51 cycle for hit, 10100
cycles for miss, 0.01%1% miss rate
Misses could be handled by hardware or software

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 75

Fast Translation Using a TLB

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 76

TLB Misses

If page is in memory

Load the PTE from memory and retry


Could be handled in hardware

Or in software

Can get complex for more complicated page table


structures
Raise a special exception, with optimized handler

If page is not in memory (page fault)

OS handles fetching the page and updating


the page table
Then restart the faulting instruction
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 77

TLB Miss Handler

TLB miss indicates

Must recognize TLB miss before


destination register overwritten

Page present, but PTE not in TLB


Page not preset

Raise exception

Handler copies PTE from memory to TLB

Then restarts instruction


If page not present, page fault will occur
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 78

Page Fault Handler

Use faulting virtual address to find PTE


Locate page on disk
Choose page to replace

If dirty, write to disk first

Read page into memory and update page


table
Make process runnable again

Restart from faulting instruction

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 79

TLB and Cache Interaction

If cache tag uses


physical address

Need to translate
before cache lookup

Alternative: use virtual


address tag

Complications due to
aliasing

Different virtual
addresses for shared
physical address

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 80

Memory Protection

Different tasks can share parts of their


virtual address spaces

But need to protect against errant access


Requires OS assistance

Hardware support for OS protection

Privileged supervisor mode (aka kernel mode)


Privileged instructions
Page tables and other state information only
accessible in supervisor mode
System call exception (e.g., syscall in MIPS)
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 81

The BIG Picture

Common principles apply at all levels of


the memory hierarchy

Based on notions of caching

At each level in the hierarchy

Block placement
Finding a block
Replacement on a miss
Write policy

5.8 A Common Framework for Memory Hierarchies

The Memory Hierarchy

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 82

Block Placement

Determined by associativity

Direct mapped (1-way associative)

n-way set associative

n choices within a set

Fully associative

One choice for placement

Any location

Higher associativity reduces miss rate

Increases complexity, cost, and access time


Chapter 5 Large and Fast: Exploiting Memory Hierarchy 83

Finding a Block

Associativity

Location method

Tag comparisons

Direct mapped

Index

n-way set
associative

Set index, then search


entries within the set

Fully associative

Search all entries

#entries

Full lookup table

Hardware caches

Reduce comparisons to reduce cost

Virtual memory

Full table lookup makes full associativity feasible


Benefit in reduced miss rate
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 84

Replacement

Choice of entry to replace on a miss

Least recently used (LRU)

Random

Complex and costly hardware for high associativity


Close to LRU, easier to implement

Virtual memory

LRU approximation with hardware support

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 85

Write Policy

Write-through

Write-back

Update both upper and lower levels


Simplifies replacement, but may require write
buffer
Update upper level only
Update lower level when block is replaced
Need to keep more state

Virtual memory

Only write-back is feasible, given disk write


latency
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 86

Sources of Misses

Compulsory misses (aka cold start misses)

Capacity misses

First access to a block


Due to finite cache size
A replaced block is later accessed again

Conflict misses (aka collision misses)

In a non-fully associative cache


Due to competition for entries in a set
Would not occur in a fully associative cache of
the same total size
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 87

Cache Design Trade-offs


Design change

Effect on miss rate

Negative performance
effect

Increase cache size

Decrease capacity
misses

May increase access


time

Increase associativity

Decrease conflict
misses

May increase access


time

Increase block size

Decrease compulsory Increases miss


misses
penalty. For very large
block size, may
increase miss rate
due to pollution.

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 88

Example cache characteristics

Direct-mapped, write-back, write allocate


Block size: 4 words (16 bytes)
Cache size: 16 KB (1024 blocks)
32-bit byte addresses
Valid bit and dirty bit per block
Blocking cache

CPU waits until access is complete

31

10 9

4 3

Tag

Index

Offset

18 bits

10 bits

4 bits

5.9 Using a Finite State Machine to Control A Simple Cache

Cache Control

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 89

Interface Signals

CPU

Read/Write

Read/Write

Valid

Valid

Address

32

Write Data

32

Read Data

32

Ready

Cache

Address

32

Write Data

128

Read Data

128

Memory

Ready

Multiple cycles
per access

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 90

Finite State Machines

Use an FSM to
sequence control steps
Set of states, transition
on each clock edge

State values are binary


encoded
Current state stored in a
register
Next state
= fn (current state,
current inputs)

Control output signals


= fo (current state)
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 91

Cache Controller FSM


Could partition
into separate
states to
reduce clock
cycle time

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 92

Suppose two CPU cores share a physical


address space

Write-through caches

Time Event
step

CPU As
cache

CPU Bs
cache

Memory
0

CPU A reads X

CPU B reads X

CPU A writes 1 to X

5.10 Parallelism and Memory Hierarchies: Cache Coherence

Cache Coherence Problem

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 93

Coherence Defined

Informally: Reads return most recently


written value
Formally:

P writes X; P reads X (no intervening writes)


read returns written value
P1 writes X; P2 reads X (sufficiently later)
read returns written value

c.f. CPU B reading X after step 3 in example

P1 writes X, P2 writes X
all processors see writes in the same order

End up with the same final value for X


Chapter 5 Large and Fast: Exploiting Memory Hierarchy 94

Cache Coherence Protocols

Operations performed by caches in


multiprocessors to ensure coherence

Migration of data to local caches

Replication of read-shared data

Reduces contention for access

Snooping protocols

Reduces bandwidth for shared memory

Each cache monitors bus reads/writes

Directory-based protocols

Caches and memory record sharing status of


blocks in a directory
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 95

Invalidating Snooping Protocols

Cache gets exclusive access to a block


when it is to be written

Broadcasts an invalidate message on the bus


Subsequent read in another cache misses

Owning cache supplies updated value

CPU activity

Bus activity

CPU As
cache

CPU Bs
cache

Memory
0

CPU A reads X

Cache miss for X

CPU B reads X

Cache miss for X

CPU A writes 1 to X

Invalidate for X

CPU B read X

Cache miss for X

0
0

0
0

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 96

Memory Consistency

When are writes seen by other processors

Assumptions

Seen means a read returns the written value


Cant be instantaneously
A write completes only when all processors have seen
it
A processor does not reorder writes with other
accesses

Consequence

P writes X then writes Y


all processors that see new Y also see new X
Processors can reorder reads, but not writes
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 97

5.13 The ARM Cortex-A8 and Intel Core i7 Memory Hierarchies

Multilevel On-Chip Caches

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 98

2-Level TLB Organization

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 99

Supporting Multiple Issue

Both have multi-banked caches that allow


multiple accesses per cycle assuming no
bank conflicts
Core i7 cache optimizations

Return requested word first


Non-blocking cache

Hit under miss


Miss under miss

Data prefetching
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 100

Combine cache blocking and subword


parallelism

5.14 Going Faster: Cache Blocking and Matrix Multiply

DGEMM

Chapter 5 Large and Fast: Exploiting Memory Hierarchy 101

Byte vs. word addressing

Example: 32-byte direct-mapped cache,


4-byte blocks

Byte 36 maps to block 1


Word 36 maps to block 4

5.15 Fallacies and Pitfalls

Pitfalls

Ignoring memory system effects when


writing or generating code

Example: iterating over rows vs. columns of


arrays
Large strides result in poor locality
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 102

Pitfalls

In multiprocessor with shared L2 or L3


cache

Less associativity than cores results in conflict


misses
More cores need to increase associativity

Using AMAT to evaluate performance of


out-of-order processors

Ignores effect of non-blocked accesses


Instead, evaluate performance by simulation
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 103

Pitfalls

Extending address range using segments

E.g., Intel 80286


But a segment is not always big enough
Makes address arithmetic complicated

Implementing a VMM on an ISA not


designed for virtualization

E.g., non-privileged instructions accessing


hardware resources
Either extend ISA, or require guest OS not to
use problematic instructions
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 104

Fast memories are small, large memories are


slow

Principle of locality

Programs use a small part of their memory space


frequently

Memory hierarchy

We really want fast, large memories


Caching gives this illusion

5.16 Concluding Remarks

Concluding Remarks

L1 cache L2 cache DRAM memory


disk

Memory system design is critical for


multiprocessors
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 105

You might also like