100% found this document useful (1 vote)

134 views61 pages

Computer Structure - Memory

Uploaded by

Lotso Chik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

134 views61 pages

Computer Structure - Memory

Uploaded by

Lotso Chik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

dce

2013

COMPUTER ARCHITECTURE CE2013

Faculty of Computer Science and Engineering Department of Computer Engineering Vo Tan Phuong http://www.cse.hcmut.edu.vn/~vtphuong

BK
TP.HCM

dce
2013

Chapter 5
Memory

Computer Architecture Chapter 4.2

2013, CE

dce
2013

Presentation Outline
Random Access Memory and its Structure Memory Hierarchy and the need for Cache Memory The Basics of Caches Cache Performance and Memory Stall Cycles

Improving Cache Performance

Multilevel Caches

Computer Architecture Chapter 5

2013, CE

dce
2013

Random Access Memory

Large arrays of storage cells Volatile memory
Hold the stored data as long as it is powered on

Random Access
Access time is practically the same to any data on a RAM chip

Output Enable (OE) control signal

Specifies read operation
n

RAM
Address Data

Write Enable (WE) control signal

Specifies write operation

m OE WE

2n m RAM chip: n-bit address and m-bit data

Computer Architecture Chapter 5

2013, CE

dce
2013

Memory Technology
Requires 6 transistors per bit Requires low power to retain bit

Static RAM (SRAM) for Cache

Dynamic RAM (DRAM) for Main Memory

One transistor + capacitor per bit Must be re-written after being read

Must also be periodically refreshed

Each row can be refreshed simultaneously

Address lines are multiplexed

Upper half of address: Row Access Strobe (RAS) Lower half of address: Column Access Strobe (CAS)
Computer Architecture Chapter 5 2013, CE 5

dce
2013

Static RAM Storage Cell

Static RAM (SRAM): fast but expensive RAM 6-Transistor cell with no static current

Typically used for caches

Provides fast access time

Word line Vcc

Cell Implementation:
Cross-coupled inverters store bit Two pass transistors
bit bit

Typical SRAM cell

Row decoder selects the word line

Pass transistors enable the cell to be read and written
Computer Architecture Chapter 5 2013, CE 6

dce
2013

Dynamic RAM Storage Cell

Word line

Dynamic RAM (DRAM): slow, cheap, and dense memory Typical choice for main memory Cell Implementation:
1-Transistor cell (pass transistor) Trench capacitor (stores bit)
Pass Transistor

Bit is stored as a charge on capacitor Must be refreshed periodically Refreshing for all memory rows
bit

Capacitor

Because of leakage of charge from tiny capacitor

Typical DRAM cell

Reading each row and writing it back to restore the charge

Computer Architecture Chapter 5

2013, CE

dce
2013

Typical DRAM Packaging

Legend
Ai CAS Dj NC OE RAS WE Address bit i Column address strobe Data bit j No connection Output enable Row address strobe Write enable

24-pin dual in-line package for 16Mbit = 222 4 memory 22-bit address is divided into
11-bit row address 11-bit column address Interleaved on same address lines

Vss D4 D3 CAS OE A9 A8 A7 A6 A5 A4 Vss 24 23 22 21 20 19 18 17 16 15 14 13

10 11

Vcc D1 D2 WE RAS NC A10 A0 A1 A2 A3 Vcc

Computer Architecture Chapter 5 2013, CE 8

dce
2013

Typical Memory Structure

Row Decoder

Row decoder
Row address

Select row to read/write

...

Column decoder
Select column to read/write

2r 2c m bits Cell Matrix

Cell Matrix
2D array of tiny memory cells
Sense/write amplifiers
Data
m

Sense/Write amplifiers
Sense & amplify data on read Drive bit line with data in on write

Row Latch 2c m bits

...
Column Decoder
c

Same data lines are used for data in/out

Computer Architecture Chapter 5

Column address

2013, CE

dce
2013

DRAM Operation
Latch and decode row address to enable addressed row

Row Access (RAS)

Small change in voltage detected by sense amplifiers
Latch whole row of bits Sense amplifiers drive bit lines to recharge storage cells

Column Access (CAS) read and write operation

Latch and decode column address to select m bits m = 4, 8, 16, or 32 bits depending on DRAM package On read, send latched bits out to chip pins On write, charge storage cells to required value Can perform multiple column accesses to same row (burst mode)
Computer Architecture Chapter 5 2013, CE

dce
2013

Burst Mode Operation

Row address is latched and decoded A read operation causes all cells in a selected row to be read Selected row is latched internally inside the SDRAM chip Column address is latched and decoded Selected column data is placed in the data output register Column address is incremented automatically Multiple data items are read depending on the block length

Block Transfer

Fast transfer of blocks between memory and cache Fast transfer of pages between memory and disk

Computer Architecture Chapter 5

2013, CE

dce
2013

Trends in DRAM
Chip size 64 Kbit 256 Kbit 1 Mbit 4 Mbit 16 Mbit 64 Mbit Type DRAM DRAM DRAM DRAM DRAM Row access 170 ns 150 ns 120 ns 100 ns 80 ns Column access 75 ns 50 ns 25 ns 20 ns 15 ns Cycle Time New Request 250 ns 220 ns 190 ns 165 ns 120 ns

Year Produced 1980 1983 1986 1989 1992 1996

SDRAM
SDRAM DDR1 DDR1 DDR2 DDR2 DDR3 DDR3

70 ns
70 ns 65 ns 60 ns 55 ns 50 ns 35 ns 30 ns

12 ns
10 ns 7 ns 5 ns 5 ns 3 ns 1 ns 0.5 ns

110 ns
100 ns 90 ns 80 ns 70 ns 60 ns 37 ns 31 ns
2013, CE 12

1998
2000 2002 2004 2006 2010 2012

128 Mbit
256 Mbit 512 Mbit 1 Gbit 2 Gbit 4 Gbit 8 Gbit

Computer Architecture Chapter 5

dce
2013

SDRAM and DDR SDRAM

Added clock to DRAM interface

SDRAM is Synchronous Dynamic RAM

SDRAM is synchronous with the system clock

Older DRAM technologies were asynchronous As system bus clock improved, SDRAM delivered higher performance than asynchronous DRAM

DDR is Double Data Rate SDRAM

Like SDRAM, DDR is synchronous with the system clock, but the difference is that DDR reads data on both the rising and falling edges of the clock signal
Computer Architecture Chapter 5 2013, CE 13

dce
2013

Transfer Rates & Peak Bandwidth

Standard Name Memory Bus Clock 100 MHz 167 MHz 200 MHz 333 MHz 400 MHz 533 MHz 533 MHz 667 MHz 800 MHz Millions Transfers per second 200 MT/s 333 MT/s 400 MT/s 667 MT/s 800 MT/s 1066 MT/s 1066 MT/s 1333 MT/s 1600 MT/s Module Name PC-1600 PC-2700 PC-3200 PC-5300 PC-6400 PC-8500 PC-8500 PC-10600 PC-12800 Peak Bandwidth 1600 MB/s 2667 MB/s 3200 MB/s 5333 MB/s 6400 MB/s 8533 MB/s 8533 MB/s 10667 MB/s 12800 MB/s

DDR-200 DDR-333 DDR-400 DDR2-667 DDR2-800 DDR2-1066 DDR3-1066 DDR3-1333 DDR3-1600

DDR4-3200

1600 MHz

3200 MT/s

PC-25600

25600 MB/s

1 Transfer = 64 bits = 8 bytes of data

Computer Architecture Chapter 5 2013, CE 14

dce
2013

DRAM Refresh Cycles

Refresh cycle is about tens of milliseconds Refreshing is done for the entire memory

Each row is read and written back to restore the charge

Some of the memory bandwidth is lost to refresh cycles
Voltage for 1 Threshold voltage 0 Stored 1 Written Refreshed Refreshed Refreshed

Refresh Cycle
Time

Voltage for 0

Computer Architecture Chapter 5

2013, CE

dce
2013

Expanding the Data Bus Width

Memory chips typically have a narrow data bus We can expand the data bus width by a factor of p
Use p RAM chips and feed the same address to all chips Use the same Output Enable and Write Enable control signals

OE Address Data

OE Address

...
m

Address Data

Data
..

Data width = m p bits

Computer Architecture Chapter 5 2013, CE 16

dce
2013

Next . . .

Random Access Memory and its Structure Memory Hierarchy and the need for Cache Memory The Basics of Caches Cache Performance and Memory Stall Cycles

Improving Cache Performance

Multilevel Caches

Computer Architecture Chapter 5

2013, CE

dce
2013

Processor-Memory Performance Gap

CPU Performance: 55% per year, slowing down after 2004 Performance Gap DRAM: 7% per year

1980 No cache in microprocessor 1995 Two-level cache on microprocessor

Computer Architecture Chapter 5 2013, CE 18

dce
2013

The Need for Cache Memory

Processor operation takes less than 1 ns

Widening speed gap between CPU and main memory

Main memory requires more than 50 ns to access

Each instruction involves at least one memory access

One memory access to fetch the instruction

A second memory access for load and store instructions

Memory bandwidth limits the instruction execution rate Cache memory can help bridge the CPU-memory gap Cache memory is small in size but fast
Computer Architecture Chapter 5 2013, CE

dce
2013

Typical Memory Hierarchy

Typical size < 1 KB Access time < 0.5 ns

Registers are at the top of the hierarchy

Level 1 Cache (8 64 KB)

Access time: 1 ns
Microprocessor Registers

L2 Cache (512KB 8MB)

Access time: 3 10 ns
Faster

L1 Cache
Bigger
20

Main Memory (4 16 GB)

Access time: 50 100 ns

L2 Cache Memory Bus Main Memory I/O Bus Magnetic or Flash Disk

Disk Storage (> 200 GB)

Access time: 5 10 ms

Computer Architecture Chapter 5

2013, CE

dce
2013

Principle of Locality of Reference

At any time, only a small set of instructions & data is needed

Programs access small portion of their address space Temporal Locality (in time)
If an item is accessed, probably it will be accessed again soon Same loop instructions are fetched each iteration Same procedure may be called and executed many times

Spatial Locality (in space)

Tendency to access contiguous instructions/data in memory Sequential execution of Instructions Traversing arrays element by element

Computer Architecture Chapter 5

2013, CE

dce
2013

What is a Cache Memory ?

Stores the subset of instructions & data currently being accessed

Small and fast (SRAM) memory technology

Used to reduce average access time to memory

Caches exploit temporal locality by
Keeping recently accessed data closer to the processor

Caches exploit spatial locality by

Moving blocks consisting of multiple contiguous words

Goal is to achieve
Fast speed of cache memory access
Balance the cost of the memory system
Computer Architecture Chapter 5 2013, CE 22

dce
2013

Cache Memories in the Datapath

Imm
Imm16

E
32

ALU result 32

0 1

Instruction

RA RB RW

Instruction

Rt 5

BusB

Address

0 1 2 3

A L U
1 0 32

0 32

Address Data_out
1

BusW
32

Rd2

Rd3

Data_in

0 1

clk
Instruction Block

Block Address

D-Cache miss

I-Cache miss

I-Cache miss or D-Cache miss causes pipeline to stall Interface to L2 Cache or Main Memory

Computer Architecture Chapter 5

2013, CE

Data Block

Rd4

WB Data 23

I-Cache

Rs 5

BusA

ALUout

D-Cache

dce
2013

Almost Everything is a Cache !

In computer architecture, almost everything is a cache! Registers: a cache on variables software managed First-level cache: a cache on second-level cache Second-level cache: a cache on memory Memory: a cache on hard disk
Stores recent programs and their data
Hard disk can be viewed as an extension to main memory

Branch target and prediction buffer

Cache on branch target and prediction information

Computer Architecture Chapter 5

2013, CE

dce
2013

Next . . .

Random Access Memory and its Structure Memory Hierarchy and the need for Cache Memory The Basics of Caches Cache Performance and Memory Stall Cycles

Improving Cache Performance

Multilevel Caches

Computer Architecture Chapter 5

2013, CE

dce
2013

Four Basic Questions on Caches

Block placement Direct Mapped, Set Associative, Fully Associative

Q1: Where can a block be placed in a cache?

Q2: How is a block found in a cache?

Block identification Block address, tag, index

Q3: Which block should be replaced on a miss?

Block replacement FIFO, Random, LRU

Q4: What happens on a write?

Write strategy Write Back or Write Through (with Write Buffer)
Computer Architecture Chapter 5 2013, CE 26

dce
2013

Block Placement: Direct Mapped

Block: unit of data transfer between cache and memory Direct Mapped Cache:
A block can be placed in exactly one location in the cache
In this example: Cache index = least significant 3 bits of Memory address
000 001 010 011 100 101 110 111

Cache

Computer Architecture Chapter 5

00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111
2013, CE

Main Memory
27

dce
2013

Direct-Mapped Cache
Block Address Tag Index offset

A memory address is divided into

Block address: identifies block in memory Block offset: to access bytes within a block

A block address is further divided into

Index: used for direct cache access Tag: most-significant bits of block address Index = Block Address mod Cache Blocks

V Tag

Block Data

Tag must be stored also inside cache

For block identification
=
Data Hit

A valid bit is also required to indicate

Whether a cache block is valid or not
Computer Architecture Chapter 5

2013, CE

dce
2013

Direct Mapped Cache contd

Index is used to access cache block
Block Address Tag Index offset

Cache hit: block is stored inside cache

Address tag is compared against stored tag
If equal and cache block is valid then hit Otherwise: cache miss
V Tag Block Data

If number of cache blocks is 2n

n bits are used for the cache index

If number of bytes in a block is 2b

b bits are used for the block offset

If 32 bits are used for an address

32 n b bits are used for the tag

=
Data Hit
2013, CE 29

Cache data size = 2n+b bytes

Computer Architecture Chapter 5

dce
2013

Mapping an Address to a Cache Block

Consider a direct-mapped cache with 256 blocks Block size = 16 bytes

Example

Compute tag, index, and byte offset of address: 0x01FFF8AC

Solution
32-bit address is divided into:

Block Address 20 8 4

Tag

Index offset

4-bit byte offset field, because block size = 24 = 16 bytes

8-bit cache index, because there are 28 = 256 blocks in cache 20-bit tag field

Byte offset = 0xC = 12 (least significant 4 bits of address)

Cache index = 0x8A = 138 (next lower 8 bits of address)

Tag = 0x01FFF (upper 20 bits of address)

Computer Architecture Chapter 5

2013, CE

dce
2013

Example on Cache Placement & Misses

Cache is initially empty, Block size = 16 bytes The following memory addresses (in decimal) are referenced: 1000, 1004, 1008, 2548, 2552, 2556. Map addresses to cache blocks and indicate whether hit or miss
23 Tag 5 4

Consider a small direct-mapped cache with 32 blocks

Solution:
1000 = 0x3E8 1004 = 0x3EC 1008 = 0x3F0 2548 = 0x9F4 2552 = 0x9F8 2556 = 0x9FC

Index offset

cache index = 0x1E cache index = 0x1E cache index = 0x1F cache index = 0x1F cache index = 0x1F cache index = 0x1F

Miss (first access) Hit Miss (first access) Miss (different tag) Hit Hit

Computer Architecture Chapter 5

2013, CE

dce
2013

Fully Associative Cache

A block can be placed anywhere in cache no indexing If m blocks exist then

m comparators are needed to match tag Cache data size = m 2b bytes
V Tag Block Data V Tag Block Data V Tag Block Data
Address
Tag offset

V Tag Block Data

mux

m-way associative
Computer Architecture Chapter 5

Hit

Data

2013, CE

dce
2013

Set-Associative Cache

A set is a group of blocks that can be indexed A block is first mapped onto a set
Set index = Block address mod Number of sets in cache

If there are m blocks in a set (m-way set associative) then

m tags are checked in parallel using m comparators

If 2n sets exist then set index consists of n bits

Cache data size = m 2n+b bytes (with 2b bytes per block)
Without counting tags and valid bits

A direct-mapped cache has one block per set (m = 1)

A fully-associative cache has one set (2n = 1 or n = 0)
Computer Architecture Chapter 5 2013, CE 33

dce
2013

Set-Associative Cache Diagram

Address
Tag Index offset

V Tag Block Data

m-way set-associative

Hit

mux Data

Computer Architecture Chapter 5

2013, CE

dce
2013

Write Policy
Writes update cache and lower-level memory
Cache control bit: only a Valid bit is needed Memory always has latest data, which simplifies data coherency Can always discard cached data when a block is replaced

Write Through:

Write Back:
Writes update cache only Cache control bits: Valid and Modified bits are required

Modified cached data is written back to memory when replaced

Multiple writes to a cache block require only one write to memory Uses less memory bandwidth than write-through and less power However, more complex to implement than write through
Computer Architecture Chapter 5 2013, CE 35

dce
2013

Write Miss Policy

What happens on a write miss? Write Allocate:

Allocate new block in cache
Write miss acts like a read miss, block is fetched and updated

No Write Allocate:
Send data to lower-level memory
Cache is not modified

Typically, write back caches use write allocate

Hoping subsequent writes will be captured in the cache

Write-through caches often use no-write allocate

Reasoning: writes must still go to lower level memory

Computer Architecture Chapter 5

2013, CE

dce
2013

Write Buffer
Permits writes to occur without stall cycles until buffer is full

Decouples the CPU write from the memory bus writing

Write-through: all stores are sent to lower level memory

Write buffer eliminates processor stalls on consecutive writes

Write-back: modified blocks are written when replaced

Write buffer is used for evicted blocks that must be written back

The address and modified data are written in the buffer

The write is finished from the CPU perspective

CPU continues while the write buffer prepares to write memory

If buffer is full, CPU stalls until buffer has an empty entry

Computer Architecture Chapter 5

2013, CE

dce
2013

What Happens on a Cache Miss?

Cache sends a miss signal to stall the processor Decide which cache block to allocate/replace
One choice only when the cache is directly mapped
Multiple choices for set-associative or fully-associative cache

Transfer the block from lower level memory to this cache

Set the valid bit and the tag field from the upper address bits

If block to be replaced is modified then write it back

Modified block is moved into a Write Buffer Otherwise, block to be replaced can be simply discarded

Restart the instruction that caused the cache miss Miss Penalty: clock cycles to process a cache miss
Computer Architecture Chapter 5 2013, CE 38

dce
2013

Replacement Policy

Which block to be replaced on a cache miss? No selection alternatives for direct-mapped caches m blocks per set to choose from for associative caches Random replacement
Candidate blocks are randomly selected One counter for all sets (0 to m 1): incremented on every cycle On a cache miss replace block specified by counter

First In First Out (FIFO) replacement

Replace oldest block in set

One counter per set (0 to m 1): specifies oldest block to replace

Counter is incremented on a cache miss

Computer Architecture Chapter 5

2013, CE

dce
2013

Replacement Policy contd

Replace block that has been unused for the longest time Order blocks within a set from least to most recently used

Least Recently Used (LRU)

Update ordering of blocks on each cache hit

With m blocks per set, there are m! possible permutations

Pure LRU is too costly to implement when m > 2

m = 2, there are 2 permutations only (a single bit is needed) m = 4, there are 4! = 24 possible permutations LRU approximation is used in practice

For large m > 4, Random replacement can be as effective as LRU

Computer Architecture Chapter 5 2013, CE 40

dce
2013

Comparing Random, FIFO, and LRU

10 SPEC2000 benchmarks on Alpha processor Block size of 64 bytes

Data cache misses per 1000 instructions

LRU and FIFO outperforming Random for a small cache

Little difference between LRU and Random for a large cache

LRU is expensive for large associativity (# blocks per set)

Random is the simplest to implement in hardware

2-way Size 16 KB 64 KB LRU Rand FIFO LRU 4-way Rand FIFO LRU 8-way Rand FIFO

114.1 117.3 115.5 103.4 104.3 103.9

111.7 115.1 113.3 102.4 102.3 103.1

109.0 111.8 110.4 99.7 100.5 100.3

256 KB

92.2

92.1

92.5

92.1

92.5

92.1

92.5

Computer Architecture Chapter 5

2013, CE

dce
2013

Next . . .

Random Access Memory and its Structure Memory Hierarchy and the need for Cache Memory The Basics of Caches Cache Performance and Memory Stall Cycles

Improving Cache Performance

Multilevel Caches

Computer Architecture Chapter 5

2013, CE

dce
2013

Hit Rate and Miss Rate

= Hits / (Hits + Misses)

Hit Rate

Miss Rate = Misses / (Hits + Misses) I-Cache Miss Rate = Miss rate in the Instruction Cache D-Cache Miss Rate = Miss rate in the Data Cache Example:
Out of 1000 instructions fetched, 150 missed in the I-Cache 25% are load-store instructions, 50 missed in the D-Cache What are the I-cache and D-cache miss rates?

I-Cache Miss Rate = 150 / 1000 = 15% D-Cache Miss Rate = 50 / (25% 1000) = 50 / 250 = 20%

Computer Architecture Chapter 5

2013, CE

dce
2013

Memory Stall Cycles

When fetching instructions from the Instruction Cache (I-cache) When loading or storing data into the Data Cache (D-cache)

The processor stalls on a Cache miss

Memory stall cycles = Combined Misses Miss Penalty

Miss Penalty: clock cycles to process a cache miss Combined Misses = I-Cache Misses + D-Cache Misses

I-Cache Misses = I-Count I-Cache Miss Rate

D-Cache Misses = LS-Count D-Cache Miss Rate LS-Count (Load & Store) = I-Count LS Frequency Cache misses are often reported per thousand instructions
Computer Architecture Chapter 5 2013, CE 44

dce
2013

Memory Stall Cycles Per Instruction

Combined Misses Per Instruction Miss Penalty

Memory Stall Cycles Per Instruction =

Miss Penalty is assumed equal for I-cache & D-cache

Miss Penalty is assumed equal for Load and Store Combined Misses Per Instruction = I-Cache Miss Rate + LS Frequency D-Cache Miss Rate Therefore, Memory Stall Cycles Per Instruction = I-Cache Miss Rate Miss Penalty + LS Frequency D-Cache Miss Rate Miss Penalty
Computer Architecture Chapter 5 2013, CE 45

dce
2013

Example on Memory Stall Cycles

Instruction count (I-Count) = 106 instructions 30% of instructions are loads and stores

Consider a program with the given characteristics

D-cache miss rate is 5% and I-cache miss rate is 1%

Miss penalty is 100 clock cycles for instruction and data caches Compute combined misses per instruction and memory stall cycles

Combined misses per instruction in I-Cache and D-Cache

1% + 30% 5% = 0.025 combined misses per instruction Equal to 25 misses per 1000 instructions

Memory stall cycles

0.025 100 (miss penalty) = 2.5 stall cycles per instruction
Total memory stall cycles = 106 2.5 = 2,500,000
Computer Architecture Chapter 5 2013, CE 46

dce
2013

CPU Time with Memory Stall Cycles

CPU Time = I-Count CPIMemoryStalls Clock Cycle

CPIMemoryStalls = CPIPerfectCache + Mem Stalls per Instruction CPIPerfectCache = CPI for ideal cache (no cache misses) CPIMemoryStalls = CPI in the presence of memory stalls Memory stall cycles increase the CPI
Computer Architecture Chapter 5 2013, CE 47

dce
2013

Example on CPI with Memory Stalls

Cache miss rate is 2% for instruction and 5% for data 20% of instructions are loads and stores

A processor has CPI of 1.5 without any memory stalls

Cache miss penalty is 100 clock cycles for I-cache and D-cache

What is the impact on the CPI? Answer:

Instruction data

Mem Stalls per Instruction =0.02100 + 0.20.05100 = 3

CPIMemoryStalls = 1.5 + 3 = 4.5 cycles per instruction CPIMemoryStalls / CPIPerfectCache = 4.5 / 1.5 = 3

Processor is 3 times slower due to memory stall cycles

CPINoCache = 1.5 + (1 + 0.2) 100 = 121.5 (a lot worse)
Computer Architecture Chapter 5 2013, CE 48

dce
2013

Average Memory Access Time

AMAT = Hit time + Miss rate Miss penalty

Average Memory Access Time (AMAT) Time to access a cache for both hits and misses Example: Find the AMAT for a cache with
Cache access time (Hit time) of 1 cycle = 2 ns Miss penalty of 20 clock cycles Miss rate of 0.05 per access

Solution:
AMAT = 1 + 0.05 20 = 2 cycles = 4 ns

Without the cache, AMAT will be equal to Miss penalty = 20 cycles

Computer Architecture Chapter 5 2013, CE 49

dce
2013

Next . . .

Random Access Memory and its Structure Memory Hierarchy and the need for Cache Memory The Basics of Caches Cache Performance and Memory Stall Cycles

Improving Cache Performance

Multilevel Caches

Computer Architecture Chapter 5

2013, CE

dce
2013

Improving Cache Performance

Average Memory Access Time (AMAT) AMAT = Hit time + Miss rate * Miss penalty Used as a framework for optimizations Reduce the Hit time
Small and simple caches

Reduce the Miss Rate

Larger cache size, higher associativity, and larger block size

Reduce the Miss Penalty

Multilevel caches

Computer Architecture Chapter 5

2013, CE

dce
2013

Small and Simple Caches

Fast clock rate demands small and simple L1 cache designs

Hit time is critical: affects the processor clock cycle Small cache reduces the indexing time and hit time
Indexing a cache represents a time consuming portion Tag comparison also adds to this hit time

Direct-mapped overlaps tag check with data transfer

Associative cache uses additional mux and increases hit time

Size of L1 caches has not increased much

L1 caches are the same size on Alpha 21264 and 21364
Same also on UltraSparc II and III, AMD K6 and Athlon Reduced from 16 KB in Pentium III to 8 KB in Pentium 4
Computer Architecture Chapter 5 2013, CE 52

dce
2013

Classifying Misses Three Cs

Conditions under which misses occur Compulsory: program starts with no block in cache
Also called cold start misses Misses that would occur even if a cache has infinite size

Capacity: misses happen because cache size is finite

Blocks are replaced and then later retrieved Misses that would occur in a fully associative cache of a finite size

Conflict: misses happen because of limited associativity

Limited number of blocks per set Non-optimal replacement algorithm

Computer Architecture Chapter 5

2013, CE

dce
2013

Classifying Misses contd

Compulsory misses are independent of cache size
Miss Rate

Very small for long-running programs

1-way
2-way

14% 12%
10%

Capacity misses decrease as capacity increases

4-way

8% 8-way 6%

Conflict misses decrease as associativity increases

Capacity

Data were collected using LRU replacement

Compulsory

4%
2% 0

128 KB
2013, CE 54

Computer Architecture Chapter 5

dce
2013

Larger Size and Higher Associativity

Increasing cache size reduces capacity misses It also reduces conflict misses
Larger cache size spreads out references to more blocks

Drawbacks: longer hit time and higher cost Larger caches are especially popular as 2nd level caches Higher associativity also improves miss rates
Eight-way set associative is as effective as a fully associative

Computer Architecture Chapter 5

2013, CE

dce
2013

Larger Block Size

Simplest way to reduce miss rate is to increase block size However, it increases conflict misses if cache is small
25% 20%
Reduced Compulsory Misses Increased Conflict Misses

1K 4K 16K 64K 64-byte blocks are common in L1 caches 128-byte block are common in L2 caches

Miss Rate

15% 10% 5%

256K 32 64 128 256 16 0%

Block Size (bytes)

Computer Architecture Chapter 5

2013, CE

dce
2013

Next . . .

Random Access Memory and its Structure Memory Hierarchy and the need for Cache Memory The Basics of Caches Cache Performance and Memory Stall Cycles

Improving Cache Performance

Multilevel Caches

Computer Architecture Chapter 5

2013, CE

dce
2013

Multilevel Caches
Keep pace with processor speed

Top level cache should be kept small to Adding another cache level
Can reduce the memory gap
Can reduce memory bus loading
I-Cache D-Cache

Unified L2 Cache Main Memory

Local miss rate

Number of misses in a cache / Memory accesses to this cache

Miss RateL1 for L1 cache, and Miss RateL2 for L2 cache

Global miss rate

Number of misses in a cache / Memory accesses generated by CPU Miss RateL1 for L1 cache, and Miss RateL1 Miss RateL2 for L2 cache
Computer Architecture Chapter 5 2013, CE 58

dce
2013

Power 7 On-Chip Caches [IBM 2010]

32KB I-Cache/core 32KB D-Cache/core 3-cycle latency 256KB Unified L2 Cache/core 8-cycle latency 32MB Unified Shared L3 Cache Embedded DRAM 25-cycle latency to local slice

Computer Architecture Chapter 5

2013, CE

dce
2013

Multilevel Cache Policies

Multilevel Inclusion
L1 cache data is always present in L2 cache A miss in L1, but a hit in L2 copies block from L2 to L1 A miss in L1 and L2 brings a block into L1 and L2 A write in L1 causes data to be written in L1 and L2

Typically, write-through policy is used from L1 to L2

Typically, write-back policy is used from L2 to main memory
To reduce traffic on the memory bus

A replacement or invalidation in L2 must be propagated to L1

Computer Architecture Chapter 5

2013, CE

dce
2013

Multilevel Cache Policies contd

L1 data is never found in L2 cache Prevents wasting space
Cache miss in L1, but a hit in L2 results in a swap of blocks Cache miss in both L1 and L2 brings the block into L1 only

Multilevel exclusion

Block replaced in L1 is moved into L2

Example: AMD Athlon

Same or different block size in L1 and L2 caches

Choosing a larger block size in L2 can improve performance
However different block sizes complicates implementation Pentium 4 has 64-byte blocks in L1 and 128-byte blocks in L2
Computer Architecture Chapter 5 2013, CE 61

ECE 554 Computer Architecture Main Memory Spring 2013
No ratings yet
ECE 554 Computer Architecture Main Memory Spring 2013
35 pages
Using BAPI To Migrate Material Master - BAPI's in SAP - SAPNuts
100% (1)
Using BAPI To Migrate Material Master - BAPI's in SAP - SAPNuts
10 pages
Joachim Von Zur Gathen - CryptoSchool-Springer (2015) PDF
100% (2)
Joachim Von Zur Gathen - CryptoSchool-Springer (2015) PDF
889 pages
Cryptography Lecture 1: Principles and History
100% (1)
Cryptography Lecture 1: Principles and History
71 pages
ECE 448 - FPGA and ASIC Design With VHDL: George Mason University
100% (1)
ECE 448 - FPGA and ASIC Design With VHDL: George Mason University
76 pages
Skript QFT2
No ratings yet
Skript QFT2
107 pages
Ch4 - Tabulation Method
100% (1)
Ch4 - Tabulation Method
36 pages
The Quine-McCluskey Method
100% (1)
The Quine-McCluskey Method
15 pages
Sequential Circuit Analysis: State Tables State Diagrams
100% (1)
Sequential Circuit Analysis: State Tables State Diagrams
17 pages
Integer Factorization
No ratings yet
Integer Factorization
57 pages
Sequential Circuit
100% (1)
Sequential Circuit
51 pages
Big-Bang Cosmology: Hitoshi Murayama 129A F2002 Semester
100% (1)
Big-Bang Cosmology: Hitoshi Murayama 129A F2002 Semester
66 pages
Exploring Digital Logic Withlogisim-Evolution
100% (1)
Exploring Digital Logic Withlogisim-Evolution
279 pages
Edwin L. Woollett - Maxima by Example
100% (1)
Edwin L. Woollett - Maxima by Example
514 pages
QM2
100% (1)
QM2
317 pages
3.1 Static Random Access Memory (SRAM)
No ratings yet
3.1 Static Random Access Memory (SRAM)
6 pages
Thesis Hisham PDF
No ratings yet
Thesis Hisham PDF
151 pages
Gretl Guide
No ratings yet
Gretl Guide
336 pages
Theory of Computation
No ratings yet
Theory of Computation
279 pages
What Is JCL?
100% (2)
What Is JCL?
25 pages
d2l en PDF
No ratings yet
d2l en PDF
995 pages
An Operating System Vade Mecum - Raphael Finkel
No ratings yet
An Operating System Vade Mecum - Raphael Finkel
362 pages
LLVM+GHV Thesis
No ratings yet
LLVM+GHV Thesis
73 pages
Intermediate Code Generation
No ratings yet
Intermediate Code Generation
62 pages
Elementary Cellular Automata
No ratings yet
Elementary Cellular Automata
16 pages
Cisc vs. Risc
No ratings yet
Cisc vs. Risc
53 pages
Report Multiprocessor Scheduling Algorithm Implementation Using Genetic Algorithms
No ratings yet
Report Multiprocessor Scheduling Algorithm Implementation Using Genetic Algorithms
98 pages
Numerical Analysis Introduction
100% (1)
Numerical Analysis Introduction
24 pages
Target Code Generation: Utkarsh Jaiswal 11CS30038
No ratings yet
Target Code Generation: Utkarsh Jaiswal 11CS30038
15 pages
Unit-Iv: Mathematics, Biology and Computers For Chemists
No ratings yet
Unit-Iv: Mathematics, Biology and Computers For Chemists
90 pages
An Implementation of Smart Contracts by PDF
No ratings yet
An Implementation of Smart Contracts by PDF
9 pages
Group Theory - QMII 2017: 1.1 Reminder
No ratings yet
Group Theory - QMII 2017: 1.1 Reminder
10 pages
Components, Frameworks, Patterns Ralph E. Johnson
No ratings yet
Components, Frameworks, Patterns Ralph E. Johnson
23 pages
ARM AssyLang PDF
100% (1)
ARM AssyLang PDF
0 pages
Practical Implementation of A Dependently Typed Functional Programming Language
100% (1)
Practical Implementation of A Dependently Typed Functional Programming Language
270 pages
IISc eDRAM Ravi 2014 PDF
No ratings yet
IISc eDRAM Ravi 2014 PDF
75 pages
Cfengine 3 Concepts Guide
No ratings yet
Cfengine 3 Concepts Guide
62 pages
Module 3 Problem Solving and Reasoning
No ratings yet
Module 3 Problem Solving and Reasoning
31 pages
Z-Functional Programming in Haskell
No ratings yet
Z-Functional Programming in Haskell
26 pages
Assembler New Manual PDF
100% (1)
Assembler New Manual PDF
159 pages
Pauli and Heisenberg
100% (1)
Pauli and Heisenberg
22 pages
InTech-Types of Machine Learning Algorithms
No ratings yet
InTech-Types of Machine Learning Algorithms
30 pages
USB: Universal Serial Bus
100% (3)
USB: Universal Serial Bus
38 pages
Barbara Liskov, Programming With Abstract Data Types
100% (1)
Barbara Liskov, Programming With Abstract Data Types
10 pages
Parallel
No ratings yet
Parallel
13 pages
Synposis FPGA Synthesis User Guide
No ratings yet
Synposis FPGA Synthesis User Guide
484 pages
Wolfe, Cave-Guided Search. An Alternative To The Feature Integration Model of Visual Search
No ratings yet
Wolfe, Cave-Guided Search. An Alternative To The Feature Integration Model of Visual Search
15 pages
Combinatorics of Finite Sets
From Everand
Combinatorics of Finite Sets
Ian Anderson
4.5/5 (1)
Chapter05 Memory PDF
No ratings yet
Chapter05 Memory PDF
62 pages
Memory: Computer Architecture and Assembly Language
No ratings yet
Memory: Computer Architecture and Assembly Language
15 pages
Chapter 5-The Memory System
100% (1)
Chapter 5-The Memory System
80 pages
Chapter 5-The Memory System
No ratings yet
Chapter 5-The Memory System
80 pages
Fundamental Concepts
No ratings yet
Fundamental Concepts
64 pages
Onur 740 Fall11 Lecture25 Mainmemory
No ratings yet
Onur 740 Fall11 Lecture25 Mainmemory
50 pages
The Memory System
No ratings yet
The Memory System
50 pages
Mc9211unit 5 PDF
No ratings yet
Mc9211unit 5 PDF
89 pages
Memory System
No ratings yet
Memory System
70 pages
Memory Organization
No ratings yet
Memory Organization
24 pages
L05 Memory
No ratings yet
L05 Memory
45 pages
04 - Computer Memory Systems
No ratings yet
04 - Computer Memory Systems
91 pages
EECS 150 - Components and Design Techniques For Digital Systems Lec 16 - Storage: DRAM, SDRAM
No ratings yet
EECS 150 - Components and Design Techniques For Digital Systems Lec 16 - Storage: DRAM, SDRAM
26 pages
The Memory System: Deepak John, Department Computer Applications, SJCET-Pala
No ratings yet
The Memory System: Deepak John, Department Computer Applications, SJCET-Pala
63 pages
"E-Learning-Woinbo": Noida Institute of Engineering AND Technology (Mca in Stitute) Greater Noida
No ratings yet
"E-Learning-Woinbo": Noida Institute of Engineering AND Technology (Mca in Stitute) Greater Noida
10 pages
Toolbox Other Editing Commands: Twitter
No ratings yet
Toolbox Other Editing Commands: Twitter
1 page
Home Automation System
No ratings yet
Home Automation System
4 pages
EOD Checklist
0% (1)
EOD Checklist
3 pages
User Guide
No ratings yet
User Guide
17 pages
Glassfish Architectures
No ratings yet
Glassfish Architectures
5 pages
Lab Report: Subject Dynamics and Control (Lab)
No ratings yet
Lab Report: Subject Dynamics and Control (Lab)
4 pages
Welcome To The Easypcbuilder $600 (Usd) Value Gaming PC Build Guide!
No ratings yet
Welcome To The Easypcbuilder $600 (Usd) Value Gaming PC Build Guide!
98 pages
Cocomo 81: Project Size Project Type
No ratings yet
Cocomo 81: Project Size Project Type
6 pages
Veritas Cluster
No ratings yet
Veritas Cluster
16 pages
Temenos PGM T24
50% (2)
Temenos PGM T24
5 pages
CMDB Template
100% (1)
CMDB Template
27 pages
Super Market Billing System: Project Progress Report-2
No ratings yet
Super Market Billing System: Project Progress Report-2
6 pages
Lecture01.Beamer 14093
No ratings yet
Lecture01.Beamer 14093
104 pages
GA Standard Ecommerce GTM Recipe and Blueprint
No ratings yet
GA Standard Ecommerce GTM Recipe and Blueprint
9 pages
FreeRTOS Reference Manual V10.0.0
100% (2)
FreeRTOS Reference Manual V10.0.0
400 pages
Chapter 5: Protecting Information Resources Multiple Choice
No ratings yet
Chapter 5: Protecting Information Resources Multiple Choice
4 pages
Open Source Data Mining
No ratings yet
Open Source Data Mining
5 pages
Fpga Implementation of A License Plate Recognition Soc Using Automatically Generated Streaming Accelerators
No ratings yet
Fpga Implementation of A License Plate Recognition Soc Using Automatically Generated Streaming Accelerators
8 pages
User'S Guide: OTP/Flash Writer EM78 Series
No ratings yet
User'S Guide: OTP/Flash Writer EM78 Series
52 pages
Epic Script
No ratings yet
Epic Script
2 pages
MCA Sem I To V (Science)
No ratings yet
MCA Sem I To V (Science)
31 pages
Foglight Install Unix Embedded Mysql
No ratings yet
Foglight Install Unix Embedded Mysql
65 pages
Assignment 4 (Crashing)
No ratings yet
Assignment 4 (Crashing)
1 page
SharePoint 2013 Trace Matrix
No ratings yet
SharePoint 2013 Trace Matrix
3 pages
Application Services Governance
No ratings yet
Application Services Governance
9 pages
2006 SEWorld Jowers Buckley Reilly
No ratings yet
2006 SEWorld Jowers Buckley Reilly
16 pages
Decision Maths 1 Algorithms
No ratings yet
Decision Maths 1 Algorithms
7 pages