0% found this document useful (0 votes)
14 views66 pages

Section 4 - Chapter 5 - Revised

Chapter 5 of CPEG 340 discusses the essential aspects of memory in embedded systems, including types of memory such as ROM, RAM, and their variations. It covers concepts like write ability, storage permanence, and the memory hierarchy, emphasizing the trade-offs between speed, cost, and capacity. The chapter also explains memory access mechanisms and performance metrics like hit and miss rates in cache memory systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views66 pages

Section 4 - Chapter 5 - Revised

Chapter 5 of CPEG 340 discusses the essential aspects of memory in embedded systems, including types of memory such as ROM, RAM, and their variations. It covers concepts like write ability, storage permanence, and the memory hierarchy, emphasizing the trade-offs between speed, cost, and capacity. The chapter also explains memory access mechanisms and performance metrics like hit and miss rates in cache memory systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 66

CPEG 340: Embedded Systems Design

Chapter 5 – Memories
Introduction

• Embedded system’s functionality aspects


– Processing
• processors
• transformation of data
– Storage
• memory
• retention of data
– Communication
• buses
• transfer of data
Memory: basic concepts

• Stores large number of bits m × n memory

– : m words of n bits each


– address input signals

m words

– words
– e.g., memory:
• 32,768 bits n bits per word

• 12 address input signals


• 8 input/output data signals memory external view

• Memory access r/w


2k × n read and write
enable memory
– r/w: selects read or write
A0
– enable: read or write only when asserted …

– multiport: multiple accesses to different locations Ak-1



simultaneously
Qn-1 Q0
Write ability/ storage permanence

• Traditional ROM/RAM distinctions

permanence
Storage
– ROM Mask-programmed ROM Ideal memory
• read only, bits stored without power
– RAM Life of OTP ROM
product
• read and write, lose stored bits without
power Tens of EPROM EEPROM FLASH
years
• Traditional distinctions blurred Battery NVRAM
Nonvolatile
– Advanced ROMs can be written to life (10
years)
• e.g., EEPROM
In-system
– Advanced RAMs can hold bits without programmable SRAM/DRAM

power Near
Write
zero
• e.g., NVRAM ability
• Write ability During External External External External
In-system, fast
fabrication programmer, programmer, programmer programmer
writes,
– Manner and speed a memory can be only one time only 1,000s OR in-system, OR in-system,
unlimited
of cycles 1,000s block-oriented
cycles
written of cycles writes, 1,000s
of cycles
• Storage permanence
– ability of memory to hold stored bits Write ability and storage permanence of memories,
after they are written showing relative degrees along each axis (not to scale).
Write ability

• Ranges of write ability


– High end
• processor writes to memory simply and quickly
• e.g., RAM
– Middle range
• processor writes to memory, but slower
• e.g., FLASH, EEPROM (electrically erasable)
– Lower range
• special equipment, “programmer”, must be used to write to memory
• e.g., EPROM (erasable PROM), OTP (one-time programmable) ROM
– Low end
• bits stored only during fabrication
• e.g., Mask-programmed ROM
• In-system programmable memory
– Can be written to by a processor in the embedded system using the
memory
– Memories in high end and middle range of write ability
Storage permanence
• Range of storage permanence
– High end
• essentially never loses bits
• e.g., mask-programmed ROM
– Middle range
• holds bits days, months, or years after memory’s power source turned off
• e.g., NVRAM (non-volatile RAM)
– Lower range
• holds bits as long as power supplied to memory
• e.g., SRAM
– Low end
• begins to lose bits almost immediately after written
• e.g., DRAM
• Nonvolatile memory
– Holds bits after power is no longer supplied
– High end and middle range of storage permanence
ROM: “Read-Only” Memory

• Nonvolatile memory
• Can be read from but not written to, by a
processor in an embedded system External view

• Traditionally written to, “programmed”, enable 2k × n ROM

before inserting to embedded system A0


• Uses Ak-1

– Store software program for general-purpose Qn-1 Q0


processor
• program instructions can be one or more ROM
words
– Store constant data needed by system
– Implement combinational circuit
Example: 8 x 4 ROM

• Horizontal lines = words


• Vertical lines = data Internal view

• Lines connected only at circles 8 × 4 ROM

• Decoder sets word 2’s line to 1 if enable 3×8


word 0
word 1
decoder
address input is 010 A0
word 2
word line
A1
• Data lines Q3 and Q1 are set to 1 A2

because there is a “programmed” data line

connection with word 2’s line programmable


connection wired-OR

• Word 2 is not connected with data Q3 Q2 Q1 Q0

lines Q2 and Q0
• Output is 1010
Implementing combinational function

• Any combinational circuit of n functions of same k variables


can be done with 2^k x n ROM

Truth table
Inputs (address) Outputs
a b c y z 8×2 ROM
0 0 word 0
0 0 0 0 0
0 0 1 0 1 0 1 word 1
0 1 0 0 1 0 1
0 1 1 1 0 enable 1 0
1 0 0 1 0 1 0
1 0 1 1 1 c 1 1
1 1 0 1 1 b 1 1
1 1 1 1 1 1 1 word 7
a
y z
Mask-programmed ROM

• Connections “programmed” at fabrication


– set of masks
• Lowest write ability
– only once
• Highest storage permanence
– bits never change unless damaged
• Typically used for final design of high-volume systems
OTP ROM: One-time programmable ROM

• Connections “programmed” after manufacture by user


– user provides file of desired contents of ROM
– file input to machine called ROM programmer
– each programmable connection is a fuse
– ROM programmer blows fuses where connections should not exist
• Very low write ability
– typically written only once and requires ROM programmer device
• Very high storage permanence
– bits don’t change unless reconnected to programmer and more fuses
blown
• Commonly used in final products
– cheaper, harder to inadvertently modify
EPROM: Erasable programmable ROM
• Programmable component is a MOS transistor
– Transistor has “floating” gate surrounded by an insulator 0V
– (a) Negative charges form a channel between source and drain floating gate

storing a logic 1 source drain

– (b) Large positive voltage at gate causes negative charges to


move out of channel and get trapped in floating gate storing a (a)
logic 0
– (c) (Erase) Shining UV rays on surface of floating-gate causes
negative charges to return to channel from floating gate restoring +15V
the logic 1
– (d) An EPROM package showing quartz window through which (b)
source drain

UV light can pass


• Better write ability 5-30 min
– can be erased and reprogrammed thousands of times
• Reduced storage permanence source drain
(c)
– program lasts about 10 years but is susceptible to
radiation and electric noise
• Typically used during design development (d)

.
EEPROM: Electrically erasable
programmable ROM
• Programmed and erased electronically
– typically by using higher than normal voltage
– can program and erase individual words
• Better write ability
– can be in-system programmable with built-in circuit to provide higher
than normal voltage
– writes very slow due to erasing and programming
• “busy” pin indicates to processor EEPROM still writing
– can be erased and programmed tens of thousands of times
• Similar storage permanence to EPROM (about 10 years)
• Far more convenient than EPROMs, but more expensive
Flash Memory

• Extension of EEPROM
– Same floating gate principle
– Same write ability and storage permanence
• Fast erase
– Large blocks of memory erased at once, rather than one word at a time
– Blocks typically several thousand bytes large
• Writes to single words may be slower
– Entire block must be read, word updated, then entire block written back
• Used with embedded systems storing large data items in
nonvolatile memory
– e.g., digital cameras, TV set-top boxes, cell phones
RAM: “Random-access” memory
external view
• Typically volatile memory r/w 2k × n read and write
– bits are not held without power supply enable memory

• Read and written to easily by embedded system A0


during execution Ak-1


• Internal structure more complex than ROM


Qn-1 Q0
– a word consists of several memory cells, each
internal view
storing 1 bit I3 I2 I1 I0

– each input and output data line connects to each


4×4 RAM
cell in its column
enable 2×4
– rd/wr connected to every cell decoder

– when row is enabled by decoder, each cell has logic A0


A1
that stores input data bit when rd/wr indicates write Memory
cell
or outputs stored bit when rd/wr indicates read rd/wr To every cell

Q3 Q2 Q1 Q0
Basic types of RAM

• SRAM: Static RAM memory cell internals

– Memory cell uses flip-flop/cross-


SRAM
coupled inverters to store bit
– Requires 6 transistors
– Holds data as long as power supplied Data' Data

• DRAM: Dynamic RAM


W
– Memory cell uses MOS transistor and
capacitor to store bit
– More compact than SRAM DRAM

– “Refresh” required due to capacitor leak Data


W
• word’s cells refreshed when read
– Typical refresh rate 15.625 microsec.
– Slower to access than SRAM
Ram variations

• PSRAM: Pseudo-static RAM


– DRAM with built-in memory refresh controller
– Popular low-cost high-density alternative to SRAM
• NVRAM: Nonvolatile RAM
– Holds data after external power removed
– Battery-backed RAM
• SRAM with own permanently connected battery
• writes as fast as reads
• no limit on number of writes unlike nonvolatile ROM-based memory
– SRAM with EEPROM or flash
• stores complete RAM contents on EEPROM or flash before power turned off
Composing memory
• Memory size needed often differs from size of readily Increase number of words
available memories 2m+1 × n ROM
• When available memory is larger, simply ignore unneeded 2m × n ROM

high-order address bits and higher data lines A0


… …
• Am-1
When available memory is smaller, compose several smaller 1×2 …
memories into one larger memory Am decoder

– Connect side-by-side to increase width of words 2m × n ROM


enable
– Connect top to bottom to increase number of words

• added high-order address line selects smaller memory

containing desired word using a decoder
– Combine techniques to increase number and width of words

Qn-1 Q0
2m × 3n ROM
enable 2m × n ROM 2m × n ROM 2m × n ROM A

Increase width Increase number


A0 and width of
of words … … …
Am words
… … … enable

Q3n-1 Q2n-1 Q0 outputs


Memory Interface
• System performance depends on:
– Processor performance
– Memory system performance

Memory Interface
CLK CLK

MemWrite WE
Address ReadData
Processor Memory
WriteData
The Memory Access Problem
• Up until now, assumed memory could be accessed
in 1 clock cycle
• But that hasn’t been true since the 1980’s
Memory System Challenge
• Make memory system appear as fast as processor
• Use a hierarchy of memories
• Ideal memory:
– Fast
– Cheap (inexpensive)
– Large (capacity)

But we can only choose two!


Memory Hierarchy

Technology cost / GB Access time

SRAM ~ $10,000 ~ 1 ns
Cache
Speed

DRAM ~ $100 ~ 100 ns


Main Memory

Hard Disk ~ $1 ~ 10,000,000 ns


Virtual Memory

Size
Memory hierarchy

• Main memory
– Large, inexpensive,
Processor

slow memory stores


entire program and data Registers

• Cache Cache

– Small, expensive, fast Main memory

memory stores copy of Disk

likely accessed parts of


larger memory Tape

– Can be multiple levels


of cache
Cache
• Usually designed with SRAM
– faster but more expensive than DRAM
• Usually on same chip as processor
– space limited, so much smaller than off-chip main memory
– faster access ( 1 cycle vs. several cycles for main memory)
• Cache operation:
– Request for main memory access (read or write)
– First, check cache for copy
• cache hit
– copy is in cache, quick access
• cache miss
– copy not in cache, read address and possibly its neighbors into cache
• Several cache design choices
– cache mapping, replacement policies, and write techniques
Intel Pentium III Die
What data is held in the cache?
• Ideally, cache anticipates data needed by processor and
holds it in cache
• But impossible to predict future. So, we use past to
predict future – via temporal and spatial locality:
• Temporal locality
– Locality in time; keep recently accessed data in cache.
– If certain data is used recently, most likely to access it again
soon. Next time it’s accessed, it’s available in cache.
• Spatial locality
– Locality in space; copy neighboring data into cache too. Recall a
block is the basic unit of copying in cache and may consist of
multiple words.
– If data used recently, likely to use nearby data soon
Memory Performance
• Hit: is found in that level of memory hierarchy
• Miss: is not found (must go to next level)

Hit Rate = # hits / # memory accesses


= 1 – Miss Rate
Miss Rate = # misses / # memory accesses
= 1 – Hit Rate

• Average memory access time (AMAT): average time it


takes for processor to access data
AMAT = tcache + MRcache[tMM + MRMM(tVM)]
Memory Performance Example 1
• A program has 2,000 load and store instructions
• 1,250 of these data values found in cache
• The rest are supplied by other levels of memory hierarchy
• What are the hit and miss rates for the cache?
Memory Performance Example 1
• A program has 2,000 load and store instructions
• 1,250 of these data values found in cache
• The rest are supplied by other levels of memory hierarchy
• What are the hit and miss rates for the cache?

Hit Rate = 1250/2000 = 0.625


Miss Rate = 750/2000 = 0.375 = 1 – Hit Rate
Memory Performance Example 2
• Suppose processor has 2 levels of hierarchy: cache and
main memory
• tcache = 1 cycle, tMM = 100 cycles
• What is the AMAT of the program from Example 1?
Memory Performance Example 2
• Suppose processor has 2 levels of hierarchy: cache and
main memory
• tcache = 1 cycle, tMM = 100 cycles
• What is the AMAT of the program from Example 1?

AMAT = tcache + MRcache(tMM)


= [1 + 0.375(100)] cycles
= 38.5 cycles
Types of Misses

• Compulsory: first time data is accessed


• Capacity: cache too small to hold all data of interest
• Conflict: data of interest maps to same location in
cache

Miss penalty: time it takes to retrieve a block from


lower level of hierarchy
Cache terminology
• Capacity (C):
– the number of data bytes a cache stores
• Cache block
– the basic unit of cache storage. May contain multiple
words/bytes.
• Cache line
– Same as cache block. Remember, this is not the same as a row of
cache
• Block size (b):
– bytes of data brought into cache at once
• Number of blocks (B = C/b):
– number of blocks in cache: B = C/b
Cache terminology
• Cache set
– A row in the cache; the number of blocks in a set is determined
by the layout of the cache (direct mapped, set-associative, fully
associative)
• Degree of associativity (N):
– number of blocks in a set
• Number of sets (S = B/N):
– each memory address maps to exactly one cache set
• Tag:
– A unique identifier for a group of data. Because, different
memory blocks may map to the same cache block, the tag bits
are used to differentiate between them.
• Valid:
– A bit of information that indicates whether data in the block is
valid(1) or not(0)
How is data found?
• Far fewer number of available cache addresses
• Cache organized into S sets
• Each memory address maps to exactly one set
• Cache is categorized by number of blocks in a set:
– Direct mapped: 1 block per set
– N-way set associative: N blocks per set
– Fully associative: all cache blocks are in a single set

• Let us examine each organization for a cache with:


– Capacity (C = 8 words)
– Block size (b = 1 word)
– So, number of blocks (B = 8)
– Also, let us assume a MIPS-like byte-addressable memory with
32 address bits
Locating data in the cache
Locating data in the cache
Loading data into the cache
Loading data into the cache
Direct Mapped Cache
• Main memory address divided into 2 fields
– Index
• cache address
• number of bits determined by cache size
– Tag
• compared with tag stored in cache at address
indicated by index Tag Set Index Offset

• if tags match, check valid bit V T D

• Valid bit Data

– indicates whether data in slot has been loaded Valid


from memory =

• Offset
– used to find particular word/byte in cache line

40
Direct Mapped Cache
Address
11...11111100 mem[0xFF...FC]
11...11111000 mem[0xFF...F8]
11...11110100 mem[0xFF...F4]
11...11110000 mem[0xFF...F0]
11...11101100 mem[0xFF...EC]
11...11101000 mem[0xFF...E8]
11...11100100 mem[0xFF...E4]
11...11100000 mem[0xFF...E0]

00...00100100 mem[0x00...24]
00...00100000 mem[0x00..20] Set Number
00...00011100 mem[0x00..1C] 7 (111)
00...00011000 mem[0x00...18] 6 (110)
00...00010100 mem[0x00...14] 5 (101)
00...00010000 mem[0x00...10] 4 (100)
00...00001100 mem[0x00...0C] 3 (011)
00...00001000 mem[0x00...08] 2 (010)
00...00000100 mem[0x00...04] 1 (001)
00...00000000 mem[0x00...00] 0 (000)

230 Word Main Memory 23 Word Cache


Direct Mapped Cache Hardware
Byte
Tag Set Offset
Memory
00
Address
27 3
V Tag Data

8-entry x
(1+27+32)-bit
SRAM

27 32

Hit Data
Direct Mapped Cache Performance
Byte
Tag Set Offset
Memory
00...00 001 00
Address 3
V Tag Data
Set 7 (111)
# MIPS assembly code Set 6 (110)
Set 5 (101)
addi $t0, $0, 5 Set 4 (100)
loop: beq $t0, $0, Set 3 (011)
done Set 2 (010)
Set 1 (001)
lw $t1, 0x4($0) Set 0 (000)
lw $t2, 0xC($0)
lw $t3, 0x8($0)
addi $t0, $t0, -1 Miss Rate =
j loop
done:

43
Direct Mapped Cache Performance
Byte
Tag Set Offset
Memory
00...00 001 00
Address 3
V Tag Data
0 Set 7 (111)
# MIPS assembly code Set 6 (110)
0
0 Set 5 (101)
addi $t0, $0, 5 0 Set 4 (100)
loop: beq $t0, $0, 1 00...00 mem[0x00...0C] Set 3 (011)
done 1 00...00 mem[0x00...08] Set 2 (010)
1 00...00 mem[0x00...04] Set 1 (001)
lw $t1, 0x4($0) 0 Set 0 (000)
lw $t2, 0xC($0)
lw $t3, 0x8($0) Miss Rate = 3/15
addi $t0, $t0, -1
= 20%
j loop
done:
Temporal Locality
44
Compulsory Misses
Direct Mapped Cache: Conflict
Byte
Tag Set Offset
Memory
00...01 001 00
Address 3
V Tag Data
# MIPS assembly code Set 7 (111)
Set 6 (110)
Set 5 (101)
addi $t0, $0, 5 Set 4 (100)
loop: beq $t0, $0, done Set 3 (011)
lw $t1, 0x4($0) Set 2 (010)
lw $t2, 0x24($0) Set 1 (001)
Set 0 (000)
addi $t0, $t0, -1
j loop
done:

45
Direct Mapped Cache: Conflict
Byte
Tag Set Offset
Memory
00...01 001 00
Address 3
V Tag Data
# MIPS assembly code 0 Set 7 (111)
0 Set 6 (110)
0 Set 5 (101)
addi $t0, $0, 5 0 Set 4 (100)
loop: beq $t0, $0, done 0 Set 3 (011)
lw $t1, 0x4($0) 0 Set 2 (010)
mem[0x00...04] Set 1 (001)
lw $t2, 0x24($0) 1 00...00 mem[0x00...24]
0 Set 0 (000)
addi $t0, $t0, -1
j loop
done: Miss Rate = 10/10
=
100% Misses
Conflict
46
Fully Associative Cache
• Complete main memory address stored in each cache address
• All addresses stored in cache simultaneously compared with
desired address
• Valid bit and offset same as direct mapping; no set index since
# of sets = 1.
• No conflict misses
• Expensive to build

Tag Offset
Data
V T D V T D V T D

Valid
= =
=
Fully Associative Cache

• The cache for our example would appear as follows. One set;
eight blocks per set.
V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data
Set Associative Cache
• Compromise between direct
mapping and fully associative
mapping
• Index same as in direct mapping
• Each cache address contains content Tag Set Index Offset

and tags of 2 or more memory V T D V T D

address locations Data

• Tags of that set simultaneously Valid

compared as in fully associative = =

mapping
• Cache with set size N called N-way
set-associative
– 2-way, 4-way, 8-way are common
N-Way Set Associative Cache
Byte
Tag Set Offset
Memory
00
Address Way 1 Way 0
28 2
V Tag Data V Tag Data

28 32 28 32

= =

0
Hit1 Hit0 Hit1
32

Hit Data
N-Way Set Associative Performance
# MIPS assembly code

addi $t0, $0, 5


loop: beq $t0, $0, done
lw $t1, 0x4($0)
lw $t2, 0x24($0)
addi $t0, $t0, -1
j loop
done:
Way 1 Way 0
V Tag Data V Tag Data

51
N-way Set Associative Performance
# MIPS assembly code

addi $t0, $0, 5


loop: beq $t0, $0, done Miss Rate = 2/10
lw $t1, 0x4($0) = 20%
lw $t2, 0x24($0)
addi $t0, $t0, -1 Associativity reduces
j loop conflict misses
done:
Way 1 Way 0
V Tag Data V Tag Data
0 0 Set 3
0 0 Set 2
1 00...10 mem[0x00...24] 1 00...00 mem[0x00...04] Set 1
0 0 Set 0

52
Spatial Locality?
• Increase block size:
– Block size, b = 4 words
– C = 8 words
– Direct mapped (1 block per set)
– Number of blocks, B = C/b = 8/4 = 2
Block Byte
Tag Set Offset Offset
Memory
00
Address
27 2
V Tag Data
Set 1
Set 0
27 32 32 32 32
11

10

01

00
32
=

Hit Data 53
Cache with Larger Block Size

Block Byte
Tag Set Offset Offset
Memory
00
Address
27 2
V Tag Data
Set 1
Set 0
27 32 32 32 32
11

10

01

00
32
=

Hit Data

54
Direct Mapped Cache Performance
addi $t0, $0, 5
loop: beq $t0, $0,
done
lw $t1, 0x4($0)
lw $t2, 0xC($0)
lw $t3, 0x8($0)
addi $t0, $t0, -1
j loop
Block Byte
done: Memory Tag Set Offset Offset
00
Address
27 2
V Tag Data

27 32 32 32 32
11

10

01

00
32
=

55
Hit Data
Direct Mapped Cache Performance
addi $t0, $0, 5
loop: beq $t0, $0,
done
lw $t1, 0x4($0) Miss Rate = 1/15
lw $t2, 0xC($0)
lw $t3, 0x8($0)
=
addi $t0, $t0, -1 6.67%
Larger blocks
j loop
Block Byte
reduce compulsory misses
done: Memory Tag Set Offset Offset

Address
00...00 0 11 00
2
through spatial locality
27
V Tag Data
0 Set 1
1 00...00 mem[0x00...0C] mem[0x00...08] mem[0x00...04] mem[0x00...00] Set 0
27 32 32 32 32
11

10

01

00
32
=

Hit Data 56
Cache-replacement policy
• Cache is too small to hold all data of interest at one time
• Technique for choosing which block to replace
– when fully associative cache is full
– when set-associative cache’s set is full
• Direct mapped cache has no choice
• Random
– replace block chosen at random
• LRU: least-recently used
– replace block not accessed for longest time
• FIFO: first-in-first-out
– push block onto queue when accessed
– choose block to replace by popping queue
LRU Replacement
# MIPS assembly
lw $t0, 0x04($0)
lw $t1, 0x24($0)
lw $t2, 0x54($0)

V U Tag Data V Tag Data Set Number


3 (11)
2 (10)
(a)
1 (01)
0 (00)

V U Tag Data V Tag Data Set Number


3 (11)
2 (10)
(b)
1 (01)
0 (00)
LRU Replacement
# MIPS assembly
lw $t0, 0x04($0)
lw $t1, 0x24($0)
lw $t2, 0x54($0) Way 1 Way 0

V U Tag Data V Tag Data


0 0 0 Set 3 (11)
0 0 0 Set 2 (10)
1 0 00...010 mem[0x00...24] 1 00...000 mem[0x00...04] Set 1 (01)
0 0 0 Set 0 (00)
(a)
Way 1 Way 0

V U Tag Data V Tag Data


0 0 0 Set 3 (11)
0 0 0 Set 2 (10)
1 1 00...010 mem[0x00...24] 1 00...101 mem[0x00...54] Set 1 (01)
0 0 0 Set 0 (00)
(b)
Cache write techniques
• When written, data cache must update main memory
• Write-through
– write to main memory whenever cache is written to
– easiest to implement
– processor must wait for slower main memory write
– potential for unnecessary writes
• Write-back
– Processor performs write operation directly into cache and
marks the blocks as “dirty”; the main memory is not
immediately or directly updated
– main memory only written when “dirty” block replaced
– extra dirty bit for each block set when cache block written to
– reduces number of slow main memory writes
Cache Organization Recap
• Capacity: C
• Block size: b
• Number of blocks in cache: B = C/b
• Number of blocks in a set: N
• Number of Sets: S = B/N
Number of Ways Number of Sets
Organization (N) (S = B/N)
Direct Mapped 1 B
N-Way Set Associative 1 < N < B B/N
Fully Associative B 1

61
Cache impact on system performance

• Most important parameters in terms of performance:


– Total size of cache
• total number of data bytes cache can hold
• tag, valid and other house keeping bits not included in total but must be considered when
determining the size of the SRAM
– Degree of associativity
– Data block size
• Larger caches achieve lower miss rates but higher access cost
– e.g.,
• 2 Kbyte cache: miss rate = 15%, hit cost = 2 cycles, miss cost = 20 cycles
– avg. cost of memory access = (0.85 * 2) + (0.15 * 20) = 4.7 cycles
• 4 Kbyte cache: miss rate = 6.5%, hit cost = 3 cycles, miss cost will not change
– avg. cost of memory access = (0.935 * 3) + (0.065 * 20) = 4.105 cycles (improvement)
• 8 Kbyte cache: miss rate = 5.565%, hit cost = 4 cycles, miss cost will not change
– avg. cost of memory access = (0.94435 * 4) + (0.05565 * 20) = 4.8904 cycles (worse)
Cache performance trade-offs
• Larger caches have lower miss rates, longer access times
• Improving cache hit rate without increasing size
– Increase block size
• Bigger blocks reduce compulsory misses
• Bigger blocks increase conflict misses
– Change set-associativity
• Greater associativity reduces conflict misses
0.16

0.14

0.12

0.1 1 way
% cache miss
2 way
0.08
4 way
0.06 8 way

0.04

0.02

0
cache size
1 Kb 2 Kb 4 Kb 8 Kb 16 Kb 32 Kb 64 Kb 128 Kb
Multilevel Caches

• Larger caches have lower miss rates, longer access


times
• Expand the memory hierarchy to multiple levels of
caches
• Level 1: small and fast (e.g. 16 KB, 1 cycle)
• Level 2: larger and slower (e.g. 256 KB, 2-6 cycles)
• Even more levels are possible
Intel Pentium III Die
End of Course!

You might also like