0% found this document useful (0 votes)

16 views27 pages

L07 MemoryII

Uploaded by

evael

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views27 pages

L07 MemoryII

Uploaded by

evael

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

http://inst.eecs.berkeley.

edu/~cs152

CS 152/252A Computer
Architecture and Engineering Sophia Shao
Lecture 7 – Memory II
Intel Kills Optane Memory Business,
Pays $559 Million Inventory Write-Off
Intel used Optane memory to create
both storage and memory products,
and it has long been rumored to be on
the chopping block. At its debut in
2015, Intel and partner Micron touted
the underlying tech, 3D XPoint, as
delivering 1000x the performance and
1000x the endurance of NAND
storage, and 10x the density of DRAM.

https://www.tomshardware.com/news/intel-
kills-optane-memory-business-for-good https://www.intel.com/content/www/us/en/devel
oper/videos/disrupting-the-storage-memory-
hierarchy.html
Last time in Lecture 6
§ Dynamic RAM (DRAM) is main form of main memory
storage in use today
– Holds values on small capacitors, need refreshing (hence dynamic)
– Slow multi-step access: precharge, read row, read column
§ Static RAM (SRAM) is faster but more expensive
– Used to build on-chip memory for caches
§ Cache holds small set of values in fast memory (SRAM)
close to processor
– Need to develop search scheme to find values in cache, and replacement
policy to make space for newly accessed locations
§ Caches exploit two forms of predictability in memory
reference streams
– Temporal locality, same location likely to be accessed again soon
– Spatial locality, neighboring location likely to be accessed soon

2
Recap: Replacement Policy
In an associative cache, which line from a set should be
evicted when the set becomes full?
• Random
• Least-Recently Used (LRU)
• LRU cache state must be updated on every access
• True implementation only feasible for small sets (2-way)
• Pseudo-LRU binary tree often used for 4-8 way
• First-In, First-Out (FIFO) a.k.a. Round-Robin
• Used in highly associative caches
• Not-Most-Recently Used (NMRU)
• FIFO with exception for most-recently used line or lines

This is a second-order effect. Why?

Replacement only happens on misses

3
Pseudo-LRU Binary Tree
§ For 2-way cache, on a hit, single LRU bit is set to point to
other way
§ For 4-way cache, need 3 bits of state. On cache hit, on
path down tree, set all bits to point to other half. On miss,
bits say which way to replace

1 0

1 0 1 0

Way 3 Way 2 Way 1 Way 0

4
CPU-Cache Interaction
(5-stage pipeline)

0x4
Add E
M
A
we
Decode, ALU Y addr
bubble Primary
IR Register B
Data rdata
PC addr inst Fetch Cache R
D wdata hit?
hit? wdata
PCen Primary
Instruction MD1 MD2
Cache
Stall entire
CPU on data
cache miss
To Memory Control

Cache Refill Data from Lower Levels of

Memory Hierarchy

5
Improving Cache Performance

Average memory access time (AMAT) =

Hit time + Miss rate x Miss penalty

To improve performance:
• reduce the hit time
• reduce the miss rate
• reduce the miss penalty

Biggest cache that doesn’t increase hit time past 1 cycle

(approx 8-32KB in modern technology)
[ design issues more complex with deeper pipelines and/or out-of-
order superscalar processors]
6
Causes of Cache Misses: The 3 C’s
Compulsory: first reference to a line (a.k.a. cold
start misses)
– misses that would occur even with infinite cache
Capacity: cache is too small to hold all data needed
by the program
– misses that would occur even under perfect
replacement policy
Conflict: misses that occur because of collisions
due to line-placement strategy
– misses that would not occur with ideal full associativity

7
Effect of Cache Parameters on Performance
§ Larger cache size
+ reduces capacity and conflict misses
- hit time will increase
§ Higher associativity
+ reduces conflict misses
- may increase hit time

§ Larger line size

+ reduces compulsory misses
- increases conflict misses and miss penalty

8
Figure B.9 Total miss rate (top) and
distribution of miss rate (bottom) for
each size cache according to the three
C's for the data in Figure B.8. The top
diagram shows the actual data cache
miss rates, while the bottom diagram
shows the percentage in each category.
(Space allows the graphs to show one
extra cache size than can fit in Figure
B.8.)

© 2018 Elsevier Inc. All rights reserved.

Recap: Line Size and Spatial Locality
A line is unit of transfer between the cache and memory

Tag Word0 Word1 Word2 Word3 4 word line, b=2

Split CPU Line Address Offset

address

32-b bits b bits

2b = line size a.k.a line size (in bytes)

Larger line size has distinct hardware advantages

• less tag overhead
• exploit fast burst transfers from DRAM
• exploit fast burst transfers over wide busses

What are the disadvantages of increasing line size?

Fewer lines => more conflicts. Can waste bandwidth.

10
Figure B.10 Miss rate versus block size for five different-sized caches.
Note that miss rate actually goes up if the block size is too large relative to the
cache size. Each line represents a cache of different size. Figure B.11 shows
the data used to plot these lines. Unfortunately, SPEC2000 traces would take
too long if block size were included, so these data are based on SPEC92 on a
DECstation 5000 (Gee et al. 1993).
© 2019 Elsevier Inc. All rights reserved. 11
Write Policy Choices
§ Cache hit:
– write-through: write both cache & memory
• Generally higher traffic but simpler pipeline & cache design
– write-back: write cache only, memory is written only when the
entry is evicted
• A dirty bit per line further reduces write-back traffic
• Must handle 0, 1, or 2 accesses to memory for each load/store
§ Cache miss:
– no-write-allocate: only write to main memory
– write-allocate (aka fetch-on-write): fetch into cache

§ Common combinations:
– write-through and no-write-allocate
– write-back with write-allocate

12
Write Performance
Tag Index Offset

b
t k
V Tag Data

2k
lines

t
= WE

HIT Data Word or Byte

13
Reducing Write Hit Time
Problem: Writes take two cycles in memory stage, one
cycle for tag check plus one cycle for data write if hit
Solutions:
§ Design data RAM that can perform read and write in one
cycle, restore old value after tag miss
§ Pipelined writes: Hold write data for store in single buffer
ahead of cache, write cache data during next store’s tag check
§ Fully-associative (CAM Tag) caches: Word line only enabled if
hit

14
Pipelining Cache Writes
Address and Store Data From CPU

Tag Index Store Data

Delayed Write Addr. Delayed Write Data

Load/Store
=?
S
Tags L Data

=? 1 0

Load Data to CPU

Hit?
Data from a store hit is written into data portion of cache
during tag access of subsequent store 15
CS152 Administrivia
§ HW 2 out
– Due Tuesday 2/21
§ Lab 1 due Thursday 2/9

§ Thursday: Special guest lecture on prefetching from Apple

CPU designers

§ Attend discussions/OHs.

§ We discuss the historical context of different hardware

designs.
– It helps us understand why they were designed that way.
– But you don’t need to memorize them. This is not a history class ;)

16
CS252 Administrivia
§ Start thinking of class projects and forming teams
– 2-3
§ RISC vs CISC discussion this week.

CS252 17
Write Buffer to Reduce Read Miss Penalty

CPU Unified
Data Cache L2 Cache
Write
RF buffer

Evicted dirty lines for write-back cache

OR
All writes in write-through cache
Processor is not stalled on writes, and read misses can go ahead of
write to main memory
Problem: Write buffer may hold updated value of location needed by a read miss
Simple solution: on a read miss, wait for the write buffer to go empty
Faster solution: Check write buffer addresses against read miss addresses, if no
match, allow read miss to go ahead of writes, else, return value in write buffer

18
Reducing Tag Overhead with Sub-Blocks
§ Problem: Tags are too large, i.e., too much overhead
– Simple solution: Larger lines, but miss penalty could be large.
§ Solution: Sub-block placement (aka sector cache)
– A valid bit added to units smaller than full line, called sub-blocks
– Only read a sub-block on a miss
– If a tag matches, is the word in the cache?

100 1 1 1 1
300 1 1 0 0
204 0 1 0 1

19
Multilevel Caches
Problem: A memory cannot be large and fast
Solution: Increasing sizes of cache at each level

CPU L1$ L2$ DRAM

Local miss rate = misses in cache / accesses to cache

Global miss rate = misses in cache / CPU memory accesses
Misses per instruction (MPI) = misses in cache / number of
instructions

20
Figure B.14 Miss rates versus cache size for multilevel caches. Second-level caches
smaller than the sum of the two 64 KiB first-level caches make little sense, as reflected in
the high miss rates. After 256 KiB the single cache is within 10% of the global miss rates.
The miss rate of a single-level cache versus size is plotted against the local miss rate
and global miss rate of a second-level cache using a 32 KiB first-level cache. The L2
caches (unified) were two-way set associative with replacement. Each had split L1
instruction and data caches that were 64 KiB two-way set associative with LRU
replacement. The block size for both L1 and L2 caches was 64 bytes. Data were
collected as in Figure B.4. © 2019 Elsevier Inc. All rights reserved. 21
Presence of L2 influences L1 design
§ Use smaller L1 if there is also L2
– Trade increased L1 miss rate for reduced L1 hit time
– Backup L2 reduces L1 miss penalty
– Reduces average access energy
§ Use simpler write-through L1 with on-chip L2
– Write-back L2 cache absorbs write traffic, doesn’t go off-chip
– At most one L1 miss request per L1 access (no dirty victim write
back) simplifies pipeline control
– Simplifies coherence issues
– Simplifies error recovery in L1 (can use just parity bits in L1 and
reload from L2 when parity error detected on L1 read)

22
Inclusion Policy
§ Inclusive multilevel cache:
– Inner cache can only hold lines also present in outer
cache
– External coherence snoop access need only check
outer cache
§ Exclusive multilevel caches:
– Inner cache may hold lines not in outer cache
– Swap lines between inner/outer caches on miss
– Used in AMD Athlon with 64KB primary and 256KB
secondary cache

23
Itanium-2 On-Chip Caches
(Intel/HP, 2002)

Level 1: 16KB, 4-way s.a., 64B

line, quad-port (2 load+2
store), single cycle latency

Level 2: 256KB, 4-way s.a, 128B

line, quad-port (4 load or 4
store), five cycle latency

Level 3: 3MB, 12-way s.a., 128B

line, single 32B port, twelve
cycle latency

24
Power 7 On-Chip Caches [IBM 2009]
32KB L1 I$/core
32KB L1 D$/core
3-cycle latency

256KB Unified L2$/core

8-cycle latency

32MB Unified Shared L3$

Embedded DRAM (eDRAM)
25-cycle latency to local
slice

25
IBM z196 Mainframe Caches 2010

§ 96 cores (4 cores/chip, 24 chips/system)

– Out-of-order, 3-way superscalar @ 5.2GHz
§ L1: 64KB I-$/core + 128KB D-$/core
§ L2: 1.5MB private/core (144MB total)
§ L3: 24MB shared/chip (eDRAM) (576MB total)
§ L4: 768MB shared/system (eDRAM)

26
Acknowledgements
§ This course is partly inspired by previous MIT 6.823 and
Berkeley CS252 computer architecture courses created by
my collaborators and colleagues:
– Arvind (MIT)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)

MH 1000 MT Magnetic Hyperthermia Test System
No ratings yet
MH 1000 MT Magnetic Hyperthermia Test System
3 pages
Cache
No ratings yet
Cache
34 pages
UNIT-IV Memory and I/O
No ratings yet
UNIT-IV Memory and I/O
36 pages
Memory Cache
No ratings yet
Memory Cache
18 pages
Unit II
No ratings yet
Unit II
9 pages
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II
No ratings yet
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II
27 pages
2015Sp CS61C L16 Kavs Caches3
No ratings yet
2015Sp CS61C L16 Kavs Caches3
25 pages
10 Caches
No ratings yet
10 Caches
34 pages
L18 Cache Wrap Up
No ratings yet
L18 Cache Wrap Up
30 pages
ch2 Appb
No ratings yet
ch2 Appb
58 pages
l08 Caches 2
No ratings yet
l08 Caches 2
39 pages
Memory 2
No ratings yet
Memory 2
31 pages
UNIT2 Cahe-Opt
No ratings yet
UNIT2 Cahe-Opt
134 pages
Memory Hierarchy Design-Aca
No ratings yet
Memory Hierarchy Design-Aca
15 pages
COMP 740: Computer Architecture and Implementation: Montek Singh
No ratings yet
COMP 740: Computer Architecture and Implementation: Montek Singh
41 pages
Cache Writing & Performance
No ratings yet
Cache Writing & Performance
23 pages
Cache Org
No ratings yet
Cache Org
19 pages
02b Cache
No ratings yet
02b Cache
48 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
32 pages
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
No ratings yet
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
20 pages
Lec8 Memory
No ratings yet
Lec8 Memory
17 pages
CMP3010L09 MemoryII
No ratings yet
CMP3010L09 MemoryII
39 pages
Chapter # 05
No ratings yet
Chapter # 05
42 pages
Chapter 6
No ratings yet
Chapter 6
37 pages
Cache Memory: A Safe Place For Hiding or Storing Things
100% (1)
Cache Memory: A Safe Place For Hiding or Storing Things
34 pages
10 Cacheperf
No ratings yet
10 Cacheperf
24 pages
10 Multi-Level Strategies: Assignments
No ratings yet
10 Multi-Level Strategies: Assignments
20 pages
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
No ratings yet
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
55 pages
Computer Organization and Architecture Chapter 7 Large and Fast Exploiting
No ratings yet
Computer Organization and Architecture Chapter 7 Large and Fast Exploiting
32 pages
Cache1 2
No ratings yet
Cache1 2
30 pages
Cache Presentation
No ratings yet
Cache Presentation
45 pages
Lec8 - Caches
No ratings yet
Lec8 - Caches
55 pages
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
No ratings yet
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
73 pages
Topics: Cache Innovations (Sections 2.4, B.4, B.5), Virtual Memory Intro
No ratings yet
Topics: Cache Innovations (Sections 2.4, B.4, B.5), Virtual Memory Intro
20 pages
AC14L08 Memory Hierarchy
No ratings yet
AC14L08 Memory Hierarchy
20 pages
ACA Unit-5
No ratings yet
ACA Unit-5
54 pages
Cse 410 Computer Systems: Hal Perkins Spring 2010 L T 13 C Hwit DPF Lecture 13 - Cache Writes and Performance
No ratings yet
Cse 410 Computer Systems: Hal Perkins Spring 2010 L T 13 C Hwit DPF Lecture 13 - Cache Writes and Performance
20 pages
Memory Hierarchy Design
No ratings yet
Memory Hierarchy Design
115 pages
15IF11 Multicore B
No ratings yet
15IF11 Multicore B
36 pages
25 e 50 Beb 5 Aad 8 F 60
No ratings yet
25 e 50 Beb 5 Aad 8 F 60
49 pages
Lectures wk11
No ratings yet
Lectures wk11
21 pages
Cache PPT
No ratings yet
Cache PPT
38 pages
03 Memory
No ratings yet
03 Memory
48 pages
Computer Science 246 Computer Architecture: Si 2009 Spring 2009 Harvard University
No ratings yet
Computer Science 246 Computer Architecture: Si 2009 Spring 2009 Harvard University
27 pages
Chapter 2 Neede For Guide Line Help From Smiw
No ratings yet
Chapter 2 Neede For Guide Line Help From Smiw
7 pages
Cache Memory: CS2100 - Computer Organization
No ratings yet
Cache Memory: CS2100 - Computer Organization
45 pages
Lecture 16
No ratings yet
Lecture 16
22 pages
Ec6009 Advanced Computer Architecture Unit V Memory and I/O: Cache Performance
No ratings yet
Ec6009 Advanced Computer Architecture Unit V Memory and I/O: Cache Performance
16 pages
Cache Design
No ratings yet
Cache Design
59 pages
CMSC 611: Advanced Computer Architecture
No ratings yet
CMSC 611: Advanced Computer Architecture
21 pages
05) Cache Memory Introduction
No ratings yet
05) Cache Memory Introduction
20 pages
Lecture 12: Cache Innovations
No ratings yet
Lecture 12: Cache Innovations
17 pages
CS 3853 Computer Architecture - Memory Hierarchy
No ratings yet
CS 3853 Computer Architecture - Memory Hierarchy
37 pages
Cache Memory: A Safe Place For Hiding or Storing Things
No ratings yet
Cache Memory: A Safe Place For Hiding or Storing Things
34 pages
Week 11
No ratings yet
Week 11
45 pages
Unit 4
No ratings yet
Unit 4
72 pages
361 Computer Architecture Lecture 14: Cache Memory
No ratings yet
361 Computer Architecture Lecture 14: Cache Memory
20 pages
Coa PPT
No ratings yet
Coa PPT
158 pages
365 Daily Workout Maths P3 Answers
No ratings yet
365 Daily Workout Maths P3 Answers
204 pages
A New LC MS MS Method For Quantification of Gangliosides in Human Plasma PDF
No ratings yet
A New LC MS MS Method For Quantification of Gangliosides in Human Plasma PDF
32 pages
ZEOL
No ratings yet
ZEOL
407 pages
Co4 Django-Pagination Databases
No ratings yet
Co4 Django-Pagination Databases
29 pages
Micrometer
No ratings yet
Micrometer
3 pages
Chemistry 7th Edition McMurry Solutions Manualinstant Download
100% (7)
Chemistry 7th Edition McMurry Solutions Manualinstant Download
51 pages
Ina102 PDF
No ratings yet
Ina102 PDF
13 pages
51 Ls at 400 Kpa
No ratings yet
51 Ls at 400 Kpa
1 page
Electrical System p9000
No ratings yet
Electrical System p9000
50 pages
Motion in A Straight Line DPP
100% (1)
Motion in A Straight Line DPP
44 pages
Badass Ebikes Calibration Instructions Pro v1
No ratings yet
Badass Ebikes Calibration Instructions Pro v1
3 pages
G9 07 Rate and Ratio
No ratings yet
G9 07 Rate and Ratio
6 pages
NLM Qna Paper
No ratings yet
NLM Qna Paper
7 pages
New Scheme Based On AICTE Flexible Curricula
No ratings yet
New Scheme Based On AICTE Flexible Curricula
13 pages
Load Tables
No ratings yet
Load Tables
3 pages
Assignment 2 (If Else If Ladder)
100% (1)
Assignment 2 (If Else If Ladder)
2 pages
Burner Logic System PDF
No ratings yet
Burner Logic System PDF
5 pages
Cryptography and Network Security-Ppt-1 (Autosaved) .PPTM
No ratings yet
Cryptography and Network Security-Ppt-1 (Autosaved) .PPTM
28 pages
A Tour of The Famous Scientists Laid To Rest in Göttingen City Cemetery - COMSOL Blog
No ratings yet
A Tour of The Famous Scientists Laid To Rest in Göttingen City Cemetery - COMSOL Blog
14 pages
Basics Concrete Construction (2015) PDF
No ratings yet
Basics Concrete Construction (2015) PDF
76 pages
.300 Win. Magnum Ballistics Calcs (QuickTarget Unlimited Lapua Edition)
No ratings yet
.300 Win. Magnum Ballistics Calcs (QuickTarget Unlimited Lapua Edition)
4 pages
320-3340 - 2 X 100watt T-AMP AA-AB32971 MidPowerStereoSeries Manual
100% (1)
320-3340 - 2 X 100watt T-AMP AA-AB32971 MidPowerStereoSeries Manual
2 pages
Lesson 39 - Transcript. Build Applications With Glide - Part 2
No ratings yet
Lesson 39 - Transcript. Build Applications With Glide - Part 2
112 pages
MEPC 60-13 - The Generation of Biocide Leaching Rate Estimates For Anti-Fouling Coatings and Their Use... (IPPIC)
No ratings yet
MEPC 60-13 - The Generation of Biocide Leaching Rate Estimates For Anti-Fouling Coatings and Their Use... (IPPIC)
18 pages
Work, Energy & Power: Formulas
No ratings yet
Work, Energy & Power: Formulas
2 pages
Bdctw401-Update Tile Works
No ratings yet
Bdctw401-Update Tile Works
47 pages
KMP Algorithm
No ratings yet
KMP Algorithm
1 page
Ada and The Galaxies Press Kit
No ratings yet
Ada and The Galaxies Press Kit
2 pages
Digiplex EVO High Security and Access System: Programming Guide
No ratings yet
Digiplex EVO High Security and Access System: Programming Guide
68 pages

L07 MemoryII

Uploaded by

L07 MemoryII

Uploaded by

http://inst.eecs.berkeley.

This is a second-order effect. Why?

Replacement only happens on misses

Way 3 Way 2 Way 1 Way 0

Cache Refill Data from Lower Levels of

Average memory access time (AMAT) =

Biggest cache that doesn’t increase hit time past 1 cycle

§ Larger line size

© 2018 Elsevier Inc. All rights reserved.

Tag Word0 Word1 Word2 Word3 4 word line, b=2

Split CPU Line Address Offset

32-b bits b bits

Larger line size has distinct hardware advantages

What are the disadvantages of increasing line size?

HIT Data Word or Byte

Tag Index Store Data

Delayed Write Addr. Delayed Write Data

Load Data to CPU

§ Thursday: Special guest lecture on prefetching from Apple

§ We discuss the historical context of different hardware

Evicted dirty lines for write-back cache

CPU L1$ L2$ DRAM

Local miss rate = misses in cache / accesses to cache

Level 1: 16KB, 4-way s.a., 64B

Level 2: 256KB, 4-way s.a, 128B

Level 3: 3MB, 12-way s.a., 128B

256KB Unified L2$/core

32MB Unified Shared L3$

§ 96 cores (4 cores/chip, 24 chips/system)

You might also like