0% found this document useful (0 votes)
12 views59 pages

Week 10

The document discusses memory organization and technology, focusing on DRAM and SRAM, their characteristics, and the trade-offs between speed, density, and cost. It explains the memory hierarchy in modern systems, emphasizing the importance of locality and caching to optimize memory access times. Various cache designs and their operations are also covered, highlighting the impact of associativity on cache performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views59 pages

Week 10

The document discusses memory organization and technology, focusing on DRAM and SRAM, their characteristics, and the trade-offs between speed, density, and cost. It explains the memory hierarchy in modern systems, emphasizing the importance of locality and caching to optimize memory access times. Various cache designs and their operations are also covered, highlighting the impact of associativity on cache performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Memory Organization and

Memory Technology
Readings
• Digital Design and Computer Architecture – David
Harris & Sarah Harris

• Chapter 8
Memory Technology:
DRAM and SRAM
Memory Technology: DRAM

• Dynamic random access memory


• Capacitor charge state indicates stored value
• Whether the capacitor is charged or discharged indicates
storage of 1 or 0
• 1 capacitor
• 1 access transistor
row enable

• Capacitor leaks through the RC path

_bitline
• DRAM cell loses charge over time
• DRAM cell needs to be refreshed
Memory Technology: SRAM
• Static random access memory
• Two cross coupled inverters store a single bit
• Feedback path enables the stored value to persist in the “cell”
• 4 transistors for storage
• 2 transistors for access

row select

_bitline
bitline
DRAM vs. SRAM
• DRAM
• Slower access (capacitor)
• Higher density (1T 1C cell)
• Lower cost
• Requires refresh (power, performance, circuitry)
• Manufacturing requires putting capacitor and logic together

• SRAM
• Faster access (no capacitor)
• Lower density (6T cell)
• Higher cost
• No need for refresh
• Manufacturing compatible with logic process (no capacitor)
Memory Hierarchy and
Caches
The Memory Hierarchy
Memory in a Modern System

DRAM BANKS
DRAM INTERFACE
DRAM MEMORY
CORE 1

CORE 3
CONTROLLER

L2 CACHE 1 L2 CACHE 3
L2 CACHE 0 L2 CACHE 2
CORE 2
CORE 0
SHARED L3 CACHE
Ideal Memory
• Zero access time (latency)
• Infinite capacity
• Zero cost
• Infinite bandwidth (to support multiple accesses in parallel)
The Problem
• Ideal memory’s requirements oppose each other

• Bigger is slower
• Bigger → Takes longer to determine the location

• Faster is more expensive


• Memory technology: SRAM vs. DRAM vs. Disk vs. Tape

• Higher bandwidth is more expensive


• Need more banks, higher frequency, or faster technology
The Problem
• Bigger is slower
• SRAM, 512 Bytes, sub-nanosec
• SRAM, KByte~MByte, ~nanosec
• DRAM, Gigabyte, ~50 nanosec
• Hard Disk, Terabyte, ~10 millisec

• Faster is more expensive (dollars and chip area)


• SRAM, < 10$ per Megabyte
• DRAM, < 1$ per Megabyte
• Hard Disk < 1$ per Gigabyte
• These sample values (circa ~2011) scale with time
Why Memory Hierarchy?
• We want both fast and large

• But we cannot achieve both with a single level of memory

• Idea: Have multiple levels of storage (progressively bigger


and slower as the levels are farther from the processor) and
ensure most of the data the processor needs is kept in the
fast(er) level(s)
The Memory Hierarchy

move what you use here fast


small

With good locality of


reference, memory

cheaper per byte


appears as fast as

faster per byte


and as large as

backup
everything big but slow
here
Memory Hierarchy
• Fundamental tradeoff
• Fast memory: small
• Large memory: slow
• Idea: Memory hierarchy

Hard Disk
Main
CPU Cache Memory
RF (DRAM)

• Latency, cost, size,


bandwidth
Locality
• One’s recent past is a very good predictor of his/her
near future.

• Temporal Locality: If you just did something, it is very


likely that you will do the same thing again soon
• since you are here today, there is a good chance you will be
here again and again regularly

• Spatial Locality: If you did something, it is very likely


you will do something similar/related (in space)
• every time I find you in this room, you are probably sitting
close to the same people
Memory Locality
• A “typical” program has a lot of locality in memory
references
• typical programs are composed of “loops”

• Temporal: A program tends to reference the same


memory location many times and all within a small
window of time

• Spatial: A program tends to reference a cluster of


memory locations at a time
• most notable examples:
• 1. instruction memory references
• 2. array/data structure references
Caching Basics: Exploit Temporal Locality
• Idea: Store recently accessed data in automatically managed
fast memory (called cache)
• Anticipation: the data will be accessed again soon

• Temporal locality principle


• Recently accessed data will be again accessed in the near future
• This is what Maurice Wilkes had in mind:
• Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On
Electronic Computers, 1965.
• “The use is discussed of a fast core memory of, say 32000 words as a slave
to a slower core memory of, say, one million words in such a way that in
practical cases the effective access time is nearer that of the fast memory
than that of the slow memory.”
Caching Basics: Exploit Spatial Locality
• Idea: Store addresses adjacent to the recently accessed
one in automatically managed fast memory
• Logically divide memory into equal size blocks
• Fetch to cache the accessed block in its entirety
• Anticipation: nearby data will be accessed soon

• Spatial locality principle


• Nearby data in memory will be accessed in the near future
• E.g., sequential instruction access, array traversal
• This is what IBM 360/85 implemented
• 16 Kbyte cache with 64 byte blocks
• Liptay, “Structural aspects of the System/360 Model 85 II: the cache,”
IBM Systems Journal, 1968.
The Bookshelf Analogy
• Book in your hand
• Desk
• Bookshelf
• Boxes at home
• Boxes in storage

• Recently-used books tend to stay on desk


• Comp Arch books, books for classes you are currently taking
• Until the desk gets full
• Adjacent books in the shelf needed around the same time
• If I have organized/categorized my books well in the shelf
Caching in a Pipelined Design
• The cache needs to be tightly integrated into the pipeline
• Ideally, access in 1-cycle so that load-dependent operations do
not stall
• High frequency pipeline → Cannot make the cache large
• But, we want a large cache AND a pipelined design
• Idea: Cache hierarchy

Main
Level 2 Memory
CPU Level1 Cache (DRAM)
RF Cache
A Note on Manual vs. Automatic Management

• Manual: Programmer manages data movement across levels


-- too painful for programmers on substantial programs
• still done in some embedded processors (on-chip scratch pad SRAM
in lieu of a cache) and GPUs (called “shared memory”)

• Automatic: Hardware manages data movement across levels,


transparently to the programmer
++ programmer’s life is easier
• the average programmer doesn’t need to know about it
• You don’t need to know how big the cache is and how it works to write a
“correct” program! (What if you want a “fast” program?)
A Modern Memory Hierarchy
Register File
32 words, sub-nsec
manual/compiler
Memory register spilling
L1 cache
Abstraction ~32 KB, ~nsec

L2 cache
512 KB ~ 1MB, many nsec Automatic
HW cache
L3 cache, management
.....

Main memory (DRAM),


GB, ~100 nsec
automatic
Swap Disk
demand
100 GB, ~10 msec paging
Hierarchical Latency Analysis
• For a given memory hierarchy level i it has a technology-intrinsic
access time of ti, The perceived access time Ti is longer than ti
• Except for the outer-most hierarchy, when looking for a given address
there is
• a chance (hit-rate hi) you “hit” and access time is ti
• a chance (miss-rate mi) you “miss” and access time ti +Ti+1
• hi + m i = 1
• Thus
Ti = hi·ti + mi·(ti + Ti+1)
Ti = ti + mi ·Ti+1

hi and mi are defined to be the hit-rate and miss-rate


of just the references that missed at Li-1
Hierarchy Design Considerations
• Recursive latency equation
Ti = ti + mi ·Ti+1
• The goal: achieve desired T1 within allowed cost
• Ti  ti is desirable

• Keep mi low
• increasing capacity Ci lowers mi, but beware of increasing ti
• lower mi by smarter cache management (replacement::anticipate
what you don’t need, prefetching::anticipate what you will need)

• Keep Ti+1 low


• faster lower hierarchies, but beware of increasing cost
• introduce intermediate hierarchies as a compromise
Intel Pentium 4 Example
• P4, 3.6 GHz
if m1=0.1, m2=0.1
• L1 cache T1=7.6, T2=36
• C1 = 16K
• t1 = 4 cyc int / 9 cycle fp if m1=0.01, m2=0.01
T1=4.2, T2=19.8
• L2 cache
• C2 =1024 KB if m1=0.05, m2=0.01
• t2 = 18 cyc int / 18 cyc fp T1=5.00, T2=19.8
• Main memory if m1=0.01, m2=0.50
• t3 = ~ 50ns or 180 cyc T1=5.08, T2=108
• Notice
• best case latency is not 1
• worst case access latencies are into 200+ cycles
Cache Basics and Operation
Readings
• Digital Design and Computer Architecture – David
Harris & Sarah Harris

• Chapter 8
Caching Basics
◼Block (line): Unit of storage in the cache
❑Memory is logically divided into cache blocks that map to locations
in the cache

◼On a reference:
❑HIT: If in cache, use cached data instead of accessing memory
❑MISS: If not in cache, bring block into cache
◼ Maybe have to kick something else out to do it

◼Some important cache design decisions


❑Placement: where and how to place/find a block in cache?
❑Replacement: what data to remove to make room in cache?
❑Granularity of management: large or small blocks? Subblocks?
❑Write policy: what do we do about writes?
❑Instructions/data: do we treat them separately?
Cache Abstraction and Metrics

Address
Tag Store Data Store

(is the address (stores


in the cache? memory
+ bookkeeping) blocks)

Hit/miss? Data

• Cache hit rate = (# hits) / (# hits + # misses) = (# hits) / (# accesses)


• Average memory access time (AMAT)
= ( hit-rate * hit-latency ) + ( miss-rate * miss-latency )
A Basic Hardware Cache Design

• We will start with a basic hardware cache design

• Then, we will examine a multitude of ideas to


make it better
Blocks and Addressing the Cache
◼Memory is logically divided into fixed-size blocks

◼Each block maps to a location in the cache, determined


by the index bits in the address tag index byte in block
❑used to index into the tag and data stores
2b 3 bits 3 bits

8-bit address
◼Cache access:
1) index into the tag and data stores with index bits in address
2) check valid bit in tag store
3) compare tag bits in address with the stored tag in tag store

◼If a block is in the cache (cache hit), the stored tag


should be valid and match the tag of the block
Direct-Mapped Cache: Placement and Access
Block: 00000
Block: 00001
• Assume byte-addressable memory: 256 bytes, 8-byte blocks
Block: 00010 → 32 blocks
Block: 00011
Block: 00100
Block: 00101
• Assume cache: 64 bytes, 8 blocks
Block: 00110 • Direct-mapped: A block can go to only one location
Block: 00111
Block: 01000
Block: 01001
• Addresses with same index contend for the same location
Block: 01010 • Cause conflict misses
Block: 01011
Block: 01100
Block: 01101
Block: 01110 tag
Block: 01111 index byte in block
Block: 10000
Block: 10001 2b 3 bits 3 bits Tag store Data store
Block: 10010
Block: 10011 Address
Block: 10100
Block: 10101
Block: 10110
Block: 10111
Block: 11000
Block: 11001 V tag
Block: 11010
Block: 11011
Block: 11100 byte in block
Block: 11101
=? MUX
Block: 11110
Block: 11111
Main memory Hit? Data
Direct-Mapped Caches
• Direct-mapped cache: Two blocks in memory that map
to the same index in the cache cannot be present in the
cache at the same time
• One index → one entry

• Can lead to 0% hit rate if more than one block accessed


in an interleaved manner map to the same index
• Assume addresses A and B have the same index bits but
different tag bits
• A, B, A, B, A, B, A, B, … → conflict in the cache index
• All accesses are conflict misses
Set Associativity
• Addresses 0 and 8 always conflict in direct mapped cache
• Instead of having one column of 8, have 2 columns of 4 blocks

Tag store Data store


SET

V tag V tag

=? =? MUX

Logic byte in block


MUX
Hit?
Address
tag index byte in block
3b 2 bits 3 bits Key idea: Associative memory within the set
+ Accommodates conflicts better (fewer conflict misses)
-- More complex, slower access, larger tag store
Higher Associativity
• 4-way Tag store

=? =? =? =?

Logic Hit?

Data store

MUX
byte in block
MUX

+ Likelihood of conflict misses even lower


-- More tag comparators and wider data mux; larger tags
Full Associativity
• Fully associative cache
• A block can be placed in any cache location

Tag store

=? =? =? =? =? =? =? =?

Logic

Hit?

Data store

MUX
byte in block
MUX
Associativity (and Tradeoffs)
• Degree of associativity: How many blocks can map to
the same index (or set)?

• Higher associativity
++ Higher hit rate
-- Slower cache access time (hit latency and data access latency)
-- More expensive hardware (more comparators)
hit rate

• Diminishing returns from higher


associativity

associativity
Cache Examples
Cache Terminology
• Capacity (C):
• the number of data bytes a cache stores
• Block size (b):
• bytes of data brought into cache at once
• Number of blocks (B = C/b):
• number of blocks in cache: B = C/b
• Degree of associativity (N):
• number of blocks in a set
• Number of sets (S = B/N):
• each memory address maps to exactly one cache set
How is data found?
• Cache organized into S sets

• Each memory address maps to exactly one set

• Caches categorized by number of blocks in a set:


• Direct mapped: 1 block per set
• N-way set associative: N blocks per set
• Fully associative: all cache blocks are in a single set

• Examine each organization for a cache with:


• Capacity (C = 8 words)
• Block size (b = 1 word)
• So, number of blocks (B = 8)
Direct Mapped Cache
Address
11...11111100 mem[0xFF...FC]
11...11111000 mem[0xFF...F8]
11...11110100 mem[0xFF...F4]
11...11110000 mem[0xFF...F0]
11...11101100 mem[0xFF...EC]
11...11101000 mem[0xFF...E8]
11...11100100 mem[0xFF...E4]
11...11100000 mem[0xFF...E0]

00...00100100 mem[0x00...24]
00...00100000 mem[0x00..20] Set Number
00...00011100 mem[0x00..1C] 7 (111)
00...00011000 mem[0x00...18] 6 (110)
00...00010100 mem[0x00...14] 5 (101)
00...00010000 mem[0x00...10] 4 (100)
00...00001100 mem[0x00...0C] 3 (011)
00...00001000 mem[0x00...08] 2 (010)
00...00000100 mem[0x00...04] 1 (001)
00...00000000 mem[0x00...00] 0 (000)

230 Word Main Memory 23 Word Cache


Direct Mapped Cache Hardware
Byte
Tag Set Offset
Memory
00
Address
27 3
V Tag Data

8-entry x
(1+27+32)-bit
SRAM

27 32

Hit Data
Direct Mapped Cache Performance
Byte
Tag Set Offset
Memory
00...00 001 00
Address 3
V Tag Data
0 Set 7 (111)
0 Set 6 (110)
0 Set 5 (101)
0 Set 4 (100)
1 00...00 mem[0x00...0C] Set 3 (011)
1 00...00 mem[0x00...08] Set 2 (010)
1 00...00 mem[0x00...04] Set 1 (001)
0 Set 0 (000)

# MIPS assembly code


addi $t0, $0, 5 Miss Rate =
loop: beq $t0, $0, done
lw $t1, 0x4($0)
lw $t2, 0xC($0)
lw $t3, 0x8($0)
addi $t0, $t0, -1
j loop
done:
Direct Mapped Cache Performance
Byte
Tag Set Offset
Memory
00...00 001 00
Address 3
V Tag Data
0 Set 7 (111)
0 Set 6 (110)
0 Set 5 (101)
0 Set 4 (100)
1 00...00 mem[0x00...0C] Set 3 (011)
1 00...00 mem[0x00...08] Set 2 (010)
1 00...00 mem[0x00...04] Set 1 (001)
0 Set 0 (000)

# MIPS assembly code Miss Rate = 3/15


=
addi $t0, $0, 5
loop: beq $t0, $0, done
lw $t1, 0x4($0)
lw $t2, 0xC($0) 20%
lw $t3, 0x8($0)
addi $t0, $t0, -1
j loop
Temporal Locality
done: Compulsory Misses
Direct Mapped Cache: Conflict
Byte
Tag Set Offset
Memory
00...01 001 00
Address 3
V Tag Data
0 Set 7 (111)
0 Set 6 (110)
0 Set 5 (101)
0 Set 4 (100)
0 Set 3 (011)
0 Set 2 (010)
mem[0x00...04] Set 1 (001)
1 00...00 mem[0x00...24]
0 Set 0 (000)

# MIPS assembly code


addi $t0, $0, 5 Miss Rate =
loop: beq $t0, $0, done
lw $t1, 0x4($0)
lw $t2, 0x24($0)
addi $t0, $t0, -1
j loop
done:
Direct Mapped Cache: Conflict
Byte
Tag Set Offset
Memory
00...01 001 00
Address 3
V Tag Data
0 Set 7 (111)
0 Set 6 (110)
0 Set 5 (101)
0 Set 4 (100)
0 Set 3 (011)
0 Set 2 (010)
mem[0x00...04] Set 1 (001)
1 00...00 mem[0x00...24]
0 Set 0 (000)

# MIPS assembly code


addi $t0, $0, 5 Miss Rate = 10/10
loop: beq $t0, $0, done
lw $t1, 0x4($0)
= 100%
lw $t2, 0x24($0)
addi $t0, $t0, -1 Conflict Misses
j loop
done:
N-Way Set Associative Cache
Byte
Tag Set Offset
Memory
00
Address Way 1 Way 0
28 2
V Tag Data V Tag Data

28 32 28 32

= =

0
Hit1 Hit0 Hit1
32

Hit Data
N-way Set Associative Performance
# MIPS assembly code
Miss Rate =
addi $t0, $0, 5
loop: beq $t0, $0, done
lw $t1, 0x4($0)
lw $t2, 0x24($0)
addi $t0, $t0, -1
j loop
done:

Way 1 Way 0
V Tag Data V Tag Data
0 0 Set 3
0 0 Set 2
1 00...10 mem[0x00...24] 1 00...00 mem[0x00...04] Set 1
0 0 Set 0
N-way Set Associative Performance
# MIPS assembly code
Miss Rate = 2/10
addi $t0, $0, 5
loop: beq $t0, $0, done = 20%
lw $t1, 0x4($0)
lw
addi
$t2,
$t0,
0x24($0)
$t0, -1
Associativity reduces
j loop conflict misses
done:

Way 1 Way 0
V Tag Data V Tag Data
0 0 Set 3
0 0 Set 2
1 00...10 mem[0x00...24] 1 00...00 mem[0x00...04] Set 1
0 0 Set 0
Fully Associative Cache
• No conflict misses

• Expensive to build

V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data
Spatial Locality?
• Increase block size:
• Block size, b = 4 words
• C = 8 words
• Direct mapped (1 block per set)
• Number of blocks, B = C/b = 8/4 = 2

Block Byte
Tag Set Offset Offset
Memory
00
Address
27 2
V Tag Data
Set 1
Set 0
27 32 32 32 32

11

10

01

00
32
=

Hit Data
Direct Mapped Cache Performance
loop:
addi
beq
$t0,
$t0,
$0, 5
$0, done Miss Rate =
lw $t1, 0x4($0)
lw $t2, 0xC($0)
lw $t3, 0x8($0)
addi $t0, $t0, -1
j loop
done:

Block Byte
Tag Set Offset Offset
Memory
00...00 0 11 00
Address
27 2
V Tag Data
0 Set 1
1 00...00 mem[0x00...0C] mem[0x00...08] mem[0x00...04] mem[0x00...00] Set 0
27 32 32 32 32

11

10

01

00
32
=

Hit Data
Direct Mapped Cache Performance
loop:
addi
beq
$t0,
$t0,
$0, 5
$0, done
Miss Rate = 1/15
= 6.67%
lw $t1, 0x4($0)
lw $t2, 0xC($0)
lw $t3, 0x8($0)
addi $t0, $t0, -1 Larger blocks reduce compulsory
misses through spatial locality
j loop
done:

Block Byte
Tag Set Offset Offset
Memory
00...00 0 11 00
Address
27 2
V Tag Data
0 Set 1
1 00...00 mem[0x00...0C] mem[0x00...08] mem[0x00...04] mem[0x00...00] Set 0
27 32 32 32 32

11

10

01

00
32
=

Hit Data
Types of Misses
• Compulsory: first time data is accessed

• Capacity: cache too small to hold all data of interest

• Conflict: data of interest maps to same location in cache

• Miss penalty: time it takes to retrieve a block from lower


level of hierarchy
Capacity Misses
• Cache is too small to hold all data of interest at one time
• If the cache is full and program tries to access data X that is not in cache,
cache must evict data Y to make room for X
• Capacity miss occurs if program then tries to access Y again
• X will be placed in a particular set based on its address

• In a direct mapped cache, there is only one place to put X

• In an associative cache, there are multiple ways where X could go


in the set.

• How to choose Y to minimize chance of needing it again?


• Least recently used (LRU) replacement: the least recently used block in a
set is evicted when the cache is full.
Cach block size and cach associativity
Cache block size and cache associativity
Cache block size and cache associativity

You might also like