0% found this document useful (0 votes)

18 views54 pages

Cache 1 54

This document summarizes the key aspects of memory hierarchy design. It discusses how memory is organized in a hierarchy from fastest but smallest levels like CPU caches to largest but slowest main memory. The goal is to optimize access latency and throughput by exploiting locality. Cache placement policies like direct-mapped, set-associative and fully-associative caching are covered along with concepts like cache hits, misses and block size. Optimization of caches, main memory and the memory hierarchy is important for performance in modern multi-core processors.

Uploaded by

thk jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views54 pages

Cache 1 54

Uploaded by

thk jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 54

Memory Hierarchy Design

1
Contents
1. Memory hierarchy
1. Basic concepts
2. Design techniques
2. Caches
1. Types of caches: Fully associative, Direct mapped, Set associative
2. Ten optimization techniques
3. Main memory
1. Memory technology
2. Memory optimization
3. Power consumption
4. Memory hierarchy case studies: Opteron, Pentium, i7.
5. Virtual memory
6. Problem solving

dcm 2
Introduction
Introduction
 Programmers want very large memory with low latency
 Fast memory technology is more expensive per bit than
slower memory
 Solution: organize memory system into a hierarchy
 Entire addressable memory space available in largest, slowest
memory
 Incrementally smaller and faster memories, each containing a
subset of the memory below it, proceed in steps up toward the
processor
 Temporal and spatial locality insures that nearly all
references can be found in smaller memories
 Gives the allusion of a large, fast memory being presented to the
processor

Copyright © 2012, Elsevier Inc. All rights reserved. 3

Memory hierarchy

Processor

L1 Cache

L2 Cache
Latency

Cache

Main Memory

Hard Drive or Flash

Capacity (KB, MB,

GB,Copyright
TB) © 2012, Elsevier Inc. All rights reserved. 4
PROCESSOR

L1: I-Cache D-Cache I-Cache  instruction cache

D-Cache data cache
U-Cache  unified cache
L2: U-Cache Different functional units fetch
information from I-cache and
D-cache: decoder and scheduler
L3: U-Cache operate with I-
cache, but integer execution
unit and floating-point unit
Main: communicate with D-cache.
Main Memory

Copyright © 2012, Elsevier Inc. All rights reserved. 5

Introduction
Memory hierarchy

 My Power Book
 Intel core i7
 2 cores
 2.8 GHz
 L2 cache:
256 KB/core
 L3 4MB
 Main
memory 16
GB
two DDR3 8 MB
at 1.6 GHz
 Disk 500 GB

Copyright © 2012, Elsevier Inc. All rights reserved. 6

Introduction
Processor/memory cost-performance gap

Copyright © 2012, Elsevier Inc. All rights reserved. 7

Introduction
Memory hierarchy design

 Memory hierarchy design becomes more crucial with recent

multi-core processors
 Aggregate peak bandwidth grows with # cores:
 Intel Core i7 can generate two references per core per clock
 Four cores and 3.2 GHz clock
 12.8 (4 cores x 3.2 GHz) billion 128-bit
instruction references +
 25.6 (2 x 4 cores x 3.2 GHz) billion 64-bit data references/second

 = 409.6 GB/s!

 DRAM bandwidth is only 6% of this (25 GB/s)

 Requires:
 Multi-port, pipelined caches
 Two levels of cache per core
 Shared third-level cache on chip

Copyright © 2012, Elsevier Inc. All rights reserved. 8

Introduction
Performance and power

 High-end microprocessors have >10 MB on-chip cache

 The cache consumes a large amount of area and power
budget

Copyright © 2012, Elsevier Inc. All rights reserved. 9

Introduction
Memory hierarchy basics

 When a word is not found in the cache, a miss occurs.

 In case of a miss fetch word from lower level in hierarchy
 higher latency reference
 lower level may be:
 another cache
 the main memory
 fetch the entire block consisting of several words
 Takes advantage of spatial locality
 place block into cache in any location within its set,
determined by address
 block address MOD number of sets

Copyright © 2012, Elsevier Inc. All rights reserved. 10

Placement problem

Main
Memory Cache
Memory

Copyright © 2012, Elsevier Inc. All rights reserved. 11

Placement policies
 Main memory has a much larger capacity than cache.
 Mapping between main and cache memories.
 Where to put a block in cache

Copyright © 2012, Elsevier Inc. All rights reserved. 12

Fully associative cache
Memory
0
1
2
3
4
5
6
7
Cache
8 0
9 1
10 2
11 3
12 4
Block number

13 5
14 6
15 7
16
17
18
19
20
21 A block can be placed in any
22
23
24
location in cache.
25
26
27
28
29
30
31

13
Direct mapped cache
Memory
0
1
2
3
4
5
6
7
Cache
8
0
9
1
10
2
11 3
12
Block number

4
13
5
14
6
15 7
16
17
18
19 (Block address) MOD (Number of blocks in cache)
20
21 12 MOD 8 = 4
22
23
24
25
26
A block can be placed ONLY
27
28 in a single location in cache.
29
30
31

14
Set associative cache
Memory
0
1
2
3
4
5
6
7
8 0
Cache Set no.0
9 1

Block number
10 2
11 3 1
12 4
Block number

13 5 2
14 6
15 7
3
16
17
(Block address) MOD (Number of sets in cache)
18
19
20 12 MOD 4 = 0
21
22
23 A block can be placed in one
24
25 of n locations in n-way set
26
27 associative cache.
28
29
30
31

15
Introduction
Memory hierarchy basics

 n sets => n-way set associative

 Direct-mapped cache => one block per set
 Fully associative => one set

 Writing to cache: two strategies

 Write-through
 Immediately update lower levels of hierarchy
 Write-back
 Update lower levels of hierarchy only when an updated block
in cache is replaced
 Both strategies use write buffer to make writes
asynchronous

16
Dirty bit
 Two types of caches
 Instruction cache : I-cache
 Data cache: D-cache
 Dirt bit indicates if the cache block has been written to or
modified.
 No need for dirty bit for

 I-caches

 write through D-cache.

 Dirty bit needed for

 write back D-caches.

17
Write back

CPU

D Cache

Main memory

Copyright © 2012, Elsevier Inc. All rights reserved. 18

Write through cache

CPU

Cache

Main memory

Copyright © 2012, Elsevier Inc. All rights reserved. 19

Cache organization
 A cache row has:
 Tag  contains part of the address of data fetched from main memory
 Data bloc  contains data fetched from main memory
 Flags: valid, dirty

 An memory address is split (MSB to LSB) into:

 tag  contains the most significant bits of the address.
 index  gives the cache row the data has been put in.
 block offset  gives the desired data within the stored data block within the
cache row.

20
Cache organization

<21> <6> <5>

CPU
address
Tag Index blk
Data
Valid Tag Data
<1> <21> <256>

= MU
X

Copyright © 2012, Elsevier Inc. All rights reserved. 21

Introduction
Cache misses
 Miss rate  Fraction of cache access that result in a
miss

 Causes of misses
 Compulsory  first reference to a block

 Capacity  blocks discarded and later retrieved

 Conflict  the program makes repeated references to multiple

addresses from different blocks that map to the same location in the
cache

Copyright © 2012, Elsevier Inc. All rights reserved. 22

Introduction
Cache misses

 Speculative and multithreaded processors may execute other

instructions during a miss
 Reduced performance impact of misses

Copyright © 2012, Elsevier Inc. All rights reserved. 23

Introduction
Basic cache optimizations techniques
 Larger block size
 Reduces compulsory misses
 Increases capacity and conflict misses, increases miss penalty
 Larger total cache capacity to reduce miss rate
 Increases hit time, increases power consumption
 Higher associativity
 Reduces conflict misses
 Increases hit time, increases power consumption
 Higher number of cache levels
 Reduces overall memory access time
 Give priority to read misses over writes
 Reduces miss penalty
 Avoid address translation in cache indexing
 Reduces hit time

Copyright © 2012, Elsevier Inc. All rights reserved. 24

Advanced Optimizations
Advanced optimizations

 Metrics:
 Reducing the hit time
 Increase cache bandwidth
 Reducing miss penalty
 Reducing miss rate
 Reducing miss penalty or miss rate via parallelism

Copyright © 2012, Elsevier Inc. All rights reserved. 25

Advanced Optimizations
Ten advanced optimizations

 Small and simple first level caches

 Critical timing path:
 addressing tag memory, then
 comparing tags, then
 selecting correct set
 Direct-mapped caches can overlap tag compare and
transmission of data
 Lower associativity reduces power because fewer
cache lines are accessed

Copyright © 2012, Elsevier Inc. All rights reserved. 26

1) Fast hit times via small and simple L1 caches

 Critical timing path:

 addressing tag memory, then

 comparing tags, then

 selecting correct set

 Direct-mapped caches can overlap tag compare and

transmission of data
 Lower associativity reduces power because fewer cache
lines are accessed

Copyright © 2012, Elsevier Inc. All rights reserved. 27

Advanced Optimizations
L1 size and associativity

Access time vs. size and associativity

Copyright © 2012, Elsevier Inc. All rights reserved. 28

Advanced Optimizations
L1 size and associativity

Energy per read vs. size and associativity

Copyright © 2012, Elsevier Inc. All rights reserved. 29

Advanced Optimizations
2) Fast hit times via way prediction
 How to combine fast hit time of Direct Mapped and have the lower
conflict misses of 2-way SA cache?
 Way prediction: keep extra bits in cache to predict the “way,” or block
within the set, of next cache access.
 Multiplexor is set early to select desired block, only 1 tag comparison
performed that clock cycle in parallel with reading the cache data
 Miss  1st check other blocks for matches in next clock cycle
 Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
 Prediction accuracy

 > 90% for two-way

 > 80% for four-way

 I-cache has better accuracy than D-cache

 First used on MIPS R10000 in mid-90s. Used on ARM Cortex-A8

 Extend to predict block as well.
 “Way selection”  increases mis-prediction penalty

Copyright © 2012, Elsevier Inc. All rights reserved. 30

Advanced Optimizations
3) Increase cache bandwidth by pipelining
 Pipelining improves bandwidth, but higher latency
 More clock cycles between the issue of the load and the
use of data
 Examples:

 Pentium: 1 cycle
 Pentium Pro – Pentium III: 2 cycles
 Pentium 4 – Core i7: 4 cycles
 Increases branch mis-prediction penalty
 Makes it easier to increase associativity

Copyright © 2012, Elsevier Inc. All rights reserved. 31

4. Increase cache bandwidth: non-blocking caches
 Pipelined processors allow out-of-order execution. The
processor should not stall during a data cache miss.
 Non-blocking cache or lockup-free cache allow
data cache to
continue to supply cache hits during a miss
 Requires additional bits on registers or out-of-order execution
 Requires multi-bank memories
 “hit under miss” reduces the effective miss penalty by
working
during miss vs. ignoring CPU requests
 “hit under multiple miss” or “miss under miss” may further
lower the effective miss penalty by overlapping multiple misses
 Significantly increases the complexity of the cache controller as
there can be multiple outstanding memory accesses
 Requires multiple memory banks (otherwise cannot support)
 Pentium Pro allows 4 outstanding memory misses

32
33
Advanced Optimizations
Nonblocking caches
 Like pipelining the
memory system  allow
hits before previous
misses complete
 “Hit under miss”
 “Hit under multiple miss”
 Important for hiding
memory latency
 L2 must support this
 In general, processors can
hide L1 miss penalty but
not L2 miss penalty

35
http://csg.csail.mit.edu/6.S078
Advanced Optimizations
5) Independent banks; interleaving
 Organize cache as independent banks to support
simultaneous access
 ARM Cortex-A8 supports 1-4 banks for L2
 Intel i7 supports 4 banks for L1 and 8 banks for L2

 Interleave banks according to block address

Advanced Optimizations
6) Early restart and critical word first
 Reduce miss penalty.
 Don’t wait for full block before restarting CPU
 Early restart  As soon as the requested word of the
block arrives, send it to the CPU and let the CPU continue
execution
 Spatial locality  tend to want next sequential word, so not clear
size of benefit of just early restart
 Critical Word First Request the missed word first from
memory and send it to the CPU as soon as it arrives; let the
CPU continue execution while filling the rest of the words in
the block
 Long blocks more popular today  Critical Word 1st Widely used
block

7. Merging write buffer to reduce miss penalty
 Write buffer to allow processor to continue while waiting
to write to memory
 If buffer contains modified blocks, the addresses can be
checked to see if address of new data matches the address of
a valid write buffer entry
 If so, new data are combined with that entry
 Increases block size of write for write-through cache of
writes to sequential words, bytes since multiword writes
more efficient to memory
 The Sun T1 (Niagara) processor, among many others, uses
write merging

38
Advanced Optimizations
Merging write buffer
 When storing to a block that is already pending in the
write buffer, update write buffer
 Reduces stalls due to full write buffer
 Do not apply to I/O addresses

No write
buffering

Write buffering

8. Reduce misses by compiler optimizations
 McFarling [1989] reduced caches misses by 75%
on 8KB direct mapped cache, 4 byte blocks in software
 Instructions
 Reorder procedures in memory so as to reduce conflict misses
 Profiling to look at conflicts(using tools they developed)
 Data
 Merging Arrays: improve spatial locality by single array of
compound elements vs. 2 arrays
 Loop Interchange: change nesting of loops to access data in
order stored in memory
 Loop Fusion: Combine 2 independent loops that have same
looping and some variables overlap
 Blocking: Improve temporal locality by accessing “blocks” of
data repeatedly vs. going down whole columns or rows

40
Advanced Optimizations
Compiler optimizations

 Loop Interchange
 Swap nested loops to access memory in
sequential order

 Blocking
 Instead of accessing entire rows or columns,
subdivide matrices into blocks
 Requires more memory accesses but improves
locality of accesses

Merging arrays example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];

/* After: 1 array of structures */

struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];

Reducing conflicts between val & key improve

spatial locality

42
Loop interchange example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];

Sequential accesses instead of striding through

memory every 100 words; improved spatial
locality

43
Loop fusion example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{ a[i][j] = 1/b[i][j] *
c[i][j];
d[i][j] = a[i][j] + c[i]
[j];}

2 misses per access to a & c vs. one miss per access;

improve spatial locality 44
Blocking example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k]
[j];};
x[i][j] = r;
};
 Two Inner Loops:
 Read all NxN elements of z[]

 Read N elements of 1 row of

y[] repeatedly
 Write N elements of 1 row

of x[]
 Capacity Misses a function of N
& Cache Size: 45
Blocking example
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};

 B called Blocking Factor

 Capacity Misses from 2N3 + N2 to 2N3/B +N2
 Conflict Misses Too?

46
Snapshot of arrays x,y,z when N=6 and i =1

 The age of access to the array elements is indicated by shade. White

 not yet touched
Light  older access
Dark  new access
In the “before” algorithm the elements of y and z are read repeatedly to
calculate x. Compare with the next slide which shows the “after” access
patterns. Indexes, I, j, and k are shown along the rows and columns.

47
48
Reducing conflict misses by blocking
0.1

Dire ct M appe d
0.05
Cache

Fully Associativ e
Cache
0
0 50 100
150
Blocking Factor
 Conflict misses in caches not FA vs. Blocking size
 Lam et al [1991] a blocking factor of 24 had a fifth the
misses vs. 48 despite both fit in cache

49
Summary of compiler optimizations to reduce
cache misses (by hand)
vpenta (nasa7)
gmty (nasa7)
tomcatv
btrix (nasa7)
mxm (nasa7)
spice

chole sky
(nasa7)
compre ss
1 1.5 2 2.5 3
Performance Improvement

me rge d loop loop fusion blocking

arrays interchange

50
Advanced Optimizations
9) Hardware prefetching

 Fetch two blocks on miss (include next

sequential block)

Pentium 4 Pre-fetching

Advanced Optimizations
10) Compiler prefetching

 Insert prefetch instructions before data is

needed
 Non-faulting: prefetch doesn’t cause
exceptions

 Register prefetch
 Loads data into register
 Cache prefetch
 Loads data into cache

 Combine with loop unrolling and software

pipelining

Reducing misses by software prefetching
 Data Prefetch
 Load data into register (HP PA-RISC loads)
 Cache Prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)
 Special prefetching instructions cannot cause
faults; a form of speculative execution

 Issuing Prefetch Instructions takes time

 Is cost of prefetch issues < savings in reduced
misses?
 Higher superscalar reduces difficulty of issue
bandwidth

53
Advanced Optimizations
Summary

Lecture4-Ch2-Memory Hierarchy Design
No ratings yet
Lecture4-Ch2-Memory Hierarchy Design
34 pages
Ch01 Part3 Caches
No ratings yet
Ch01 Part3 Caches
32 pages
Lec 4a
No ratings yet
Lec 4a
25 pages
CAQA6e ch2
No ratings yet
CAQA6e ch2
51 pages
Computer Arch 06
No ratings yet
Computer Arch 06
41 pages
Unit 4
No ratings yet
Unit 4
72 pages
Large and Fast: Exploiting Memory Hierarchy: The Hardware/Software Interface
No ratings yet
Large and Fast: Exploiting Memory Hierarchy: The Hardware/Software Interface
33 pages
02b Cache
No ratings yet
02b Cache
48 pages
Module 2 - Memory Organization
No ratings yet
Module 2 - Memory Organization
12 pages
CA Lecture 08
No ratings yet
CA Lecture 08
38 pages
Lecture 13 16 Post
No ratings yet
Lecture 13 16 Post
24 pages
CMP3010L09 MemoryII
No ratings yet
CMP3010L09 MemoryII
39 pages
Week 13 - Lecture 13 - Memory (Cont)
No ratings yet
Week 13 - Lecture 13 - Memory (Cont)
31 pages
L18 Cache Wrap Up
No ratings yet
L18 Cache Wrap Up
30 pages
10 Cacheperf
No ratings yet
10 Cacheperf
24 pages
Memory 2
No ratings yet
Memory 2
31 pages
Cache
No ratings yet
Cache
36 pages
Lecture 13 - Introduction To Cache
No ratings yet
Lecture 13 - Introduction To Cache
47 pages
Memory Hierarchy Design
No ratings yet
Memory Hierarchy Design
76 pages
Memory Design
No ratings yet
Memory Design
36 pages
25 e 50 Beb 5 Aad 8 F 60
No ratings yet
25 e 50 Beb 5 Aad 8 F 60
49 pages
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
No ratings yet
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
22 pages
5 Memory Hierarchy
No ratings yet
5 Memory Hierarchy
99 pages
361 Computer Architecture Lecture 14: Cache Memory
No ratings yet
361 Computer Architecture Lecture 14: Cache Memory
20 pages
COMP 740: Computer Architecture and Implementation: Montek Singh
No ratings yet
COMP 740: Computer Architecture and Implementation: Montek Singh
41 pages
Chapter 5 Large and Fast Exploiting Memory Hierarchy
No ratings yet
Chapter 5 Large and Fast Exploiting Memory Hierarchy
101 pages
CMSC 611: Advanced Computer Architecture
No ratings yet
CMSC 611: Advanced Computer Architecture
21 pages
Unit II
No ratings yet
Unit II
9 pages
Cache
No ratings yet
Cache
34 pages
Question: Who Cares About The Memory Hierarchy?: Caches and Memory Systems I
No ratings yet
Question: Who Cares About The Memory Hierarchy?: Caches and Memory Systems I
13 pages
11 Cache Memory
No ratings yet
11 Cache Memory
40 pages
CS530 Fall2015 Lecture6
No ratings yet
CS530 Fall2015 Lecture6
3 pages
Chapter5 PDF
No ratings yet
Chapter5 PDF
95 pages
Chapter 2 Neede For Guide Line Help From Smiw
No ratings yet
Chapter 2 Neede For Guide Line Help From Smiw
7 pages
System Call and Its Type
No ratings yet
System Call and Its Type
4 pages
05) Cache Memory Introduction
No ratings yet
05) Cache Memory Introduction
20 pages
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
No ratings yet
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
55 pages
פרק ט - גדול ומהיר - ניצול היררכיות זיכרון
No ratings yet
פרק ט - גדול ומהיר - ניצול היררכיות זיכרון
77 pages
10 Cache Memories
No ratings yet
10 Cache Memories
49 pages
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
No ratings yet
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
37 pages
Cache Memory: William Stallings, Computer Organization and Architecture, 9 Edition
No ratings yet
Cache Memory: William Stallings, Computer Organization and Architecture, 9 Edition
47 pages
2015Sp CS61C L16 Kavs Caches3
No ratings yet
2015Sp CS61C L16 Kavs Caches3
25 pages
15IF11 Multicore B
No ratings yet
15IF11 Multicore B
36 pages
UNIT2 Cahe-Opt
No ratings yet
UNIT2 Cahe-Opt
134 pages
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
No ratings yet
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
11 pages
Lec2 PDF
No ratings yet
Lec2 PDF
21 pages
Computer Architecture: Memory Organization
No ratings yet
Computer Architecture: Memory Organization
65 pages
Cache Memory
No ratings yet
Cache Memory
39 pages
Memory Hierarchy - Introduction: Cost Performance of Memory Reference
No ratings yet
Memory Hierarchy - Introduction: Cost Performance of Memory Reference
52 pages
Characteristics Location Capacity Unit of Transfer Access Method Performance Physical Type Physical Characteristics Organisation
No ratings yet
Characteristics Location Capacity Unit of Transfer Access Method Performance Physical Type Physical Characteristics Organisation
53 pages
Cau 6 Cache
No ratings yet
Cau 6 Cache
25 pages
Memory Cache
No ratings yet
Memory Cache
18 pages
Large and Fast: Exploiting Memory Hierarchy
No ratings yet
Large and Fast: Exploiting Memory Hierarchy
48 pages
AWS Lab Practice Guide by WWW - Server Computer 13-12-2018
No ratings yet
AWS Lab Practice Guide by WWW - Server Computer 13-12-2018
68 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
32 pages
Cache1 2
No ratings yet
Cache1 2
30 pages
Memory Hierarchy Design-Aca
No ratings yet
Memory Hierarchy Design-Aca
15 pages
Lecture 16
No ratings yet
Lecture 16
22 pages
UNIT-IV Memory and I/O
No ratings yet
UNIT-IV Memory and I/O
36 pages
8257 - Microprocessors and Microcontrollers Notes
No ratings yet
8257 - Microprocessors and Microcontrollers Notes
25 pages
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
51 pages
Infinistream Certified/Hardware Appliance V6.3: Release Notes
No ratings yet
Infinistream Certified/Hardware Appliance V6.3: Release Notes
24 pages
ME
No ratings yet
ME
37 pages
Computer Architecture and Operating System
No ratings yet
Computer Architecture and Operating System
1 page
A Master Boot Record
No ratings yet
A Master Boot Record
4 pages
Working With Files Using FSO
No ratings yet
Working With Files Using FSO
17 pages
VXLFDGDGDG
No ratings yet
VXLFDGDGDG
23 pages
DE0 User Manual
No ratings yet
DE0 User Manual
56 pages
Configuring Pfsense Hardware Redundancy (CARP) - PFSenseDocs
No ratings yet
Configuring Pfsense Hardware Redundancy (CARP) - PFSenseDocs
4 pages
Administration sd260
No ratings yet
Administration sd260
116 pages
CCRG 4 1 6
No ratings yet
CCRG 4 1 6
521 pages
Install Log
No ratings yet
Install Log
32 pages
HP Proliant Gen9 Servers and Options
No ratings yet
HP Proliant Gen9 Servers and Options
2 pages
Software-Defined Networking and Network Programmability: Mark "Mitch" Mitchiner - Solutions Architect CCIE #3958
No ratings yet
Software-Defined Networking and Network Programmability: Mark "Mitch" Mitchiner - Solutions Architect CCIE #3958
73 pages
Backup Restore Questions
No ratings yet
Backup Restore Questions
26 pages
A Study On Windows Mobile 6.5 Operation System
No ratings yet
A Study On Windows Mobile 6.5 Operation System
13 pages
Lesson Plan Storage Drive Revised
No ratings yet
Lesson Plan Storage Drive Revised
7 pages
Pipeline Master Project Brief
No ratings yet
Pipeline Master Project Brief
2 pages
IIS Configuration For
No ratings yet
IIS Configuration For
17 pages
Good Received by Customer in Good Condition.: This Is A System Generated Bill / Invoice, Seal & Sign Are Not Mandatory
No ratings yet
Good Received by Customer in Good Condition.: This Is A System Generated Bill / Invoice, Seal & Sign Are Not Mandatory
1 page
Module-3 Agenda and Reference
No ratings yet
Module-3 Agenda and Reference
35 pages
CCNA 1 v3.0 Module 2 Networking Fundamentals
No ratings yet
CCNA 1 v3.0 Module 2 Networking Fundamentals
40 pages
Event-Driven Architectures: Tom Van Cutsem
No ratings yet
Event-Driven Architectures: Tom Van Cutsem
38 pages
Processes and Threads (2013 1) 1x2
No ratings yet
Processes and Threads (2013 1) 1x2
16 pages
Reset: - INV081 - Reset: INV 081 Basic Function Symbol
No ratings yet
Reset: - INV081 - Reset: INV 081 Basic Function Symbol
4 pages
Steinberg UR28M/ UR824/ UR44 Firmware Update Guide (For Windows/Mac)
No ratings yet
Steinberg UR28M/ UR824/ UR44 Firmware Update Guide (For Windows/Mac)
23 pages
Predictive Power Consumption Model For Compute Intensive Applications in Clustered ARM A53 Embedded Systems
No ratings yet
Predictive Power Consumption Model For Compute Intensive Applications in Clustered ARM A53 Embedded Systems
4 pages
PC CHIPS P25G (V3.0) User Guide - Manualzz
No ratings yet
PC CHIPS P25G (V3.0) User Guide - Manualzz
1 page
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
From Everand
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
Rob Botwright
No ratings yet

Cache 1 54

Uploaded by

Cache 1 54

Uploaded by

Memory Hierarchy Design

Copyright © 2012, Elsevier Inc. All rights reserved. 3

Hard Drive or Flash

Capacity (KB, MB,

L1: I-Cache D-Cache I-Cache  instruction cache

Copyright © 2012, Elsevier Inc. All rights reserved. 5

Copyright © 2012, Elsevier Inc. All rights reserved. 6

Copyright © 2012, Elsevier Inc. All rights reserved. 7

 Memory hierarchy design becomes more crucial with recent

 DRAM bandwidth is only 6% of this (25 GB/s)

Copyright © 2012, Elsevier Inc. All rights reserved. 8

 High-end microprocessors have >10 MB on-chip cache

Copyright © 2012, Elsevier Inc. All rights reserved. 9

 When a word is not found in the cache, a miss occurs.

Copyright © 2012, Elsevier Inc. All rights reserved. 10

Copyright © 2012, Elsevier Inc. All rights reserved. 11

Copyright © 2012, Elsevier Inc. All rights reserved. 12

 n sets => n-way set associative

 Writing to cache: two strategies

 write through D-cache.

 Dirty bit needed for

Copyright © 2012, Elsevier Inc. All rights reserved. 18

Copyright © 2012, Elsevier Inc. All rights reserved. 19

 An memory address is split (MSB to LSB) into:

<21> <6> <5>

Copyright © 2012, Elsevier Inc. All rights reserved. 21

 Capacity  blocks discarded and later retrieved

 Conflict  the program makes repeated references to multiple

Copyright © 2012, Elsevier Inc. All rights reserved. 22

 Speculative and multithreaded processors may execute other

Copyright © 2012, Elsevier Inc. All rights reserved. 23

Copyright © 2012, Elsevier Inc. All rights reserved. 24

Copyright © 2012, Elsevier Inc. All rights reserved. 25

 Small and simple first level caches

Copyright © 2012, Elsevier Inc. All rights reserved. 26

 Critical timing path:

 comparing tags, then

 selecting correct set

 Direct-mapped caches can overlap tag compare and

Copyright © 2012, Elsevier Inc. All rights reserved. 27

Access time vs. size and associativity

Copyright © 2012, Elsevier Inc. All rights reserved. 28

Energy per read vs. size and associativity

Copyright © 2012, Elsevier Inc. All rights reserved. 29

 > 90% for two-way

 > 80% for four-way

 I-cache has better accuracy than D-cache

 First used on MIPS R10000 in mid-90s. Used on ARM Cortex-A8

Copyright © 2012, Elsevier Inc. All rights reserved. 30

Copyright © 2012, Elsevier Inc. All rights reserved. 31

Copyright © 2012, Elsevier Inc. All rights reserved. 34

 Interleave banks according to block address

Copyright © 2012, Elsevier Inc. All rights reserved. 36

Copyright © 2012, Elsevier Inc. All rights reserved. 37

Copyright © 2012, Elsevier Inc. All rights reserved. 39

Copyright © 2012, Elsevier Inc. All rights reserved. 41

/* After: 1 array of structures */

Reducing conflicts between val & key improve

Sequential accesses instead of striding through

2 misses per access to a & c vs. one miss per access;

 Read N elements of 1 row of

 B called Blocking Factor

 The age of access to the array elements is indicated by shade. White

me rge d loop loop fusion blocking

 Fetch two blocks on miss (include next

Copyright © 2012, Elsevier Inc. All rights reserved. 51

 Insert prefetch instructions before data is

 Combine with loop unrolling and software

Copyright © 2012, Elsevier Inc. All rights reserved. 52

 Issuing Prefetch Instructions takes time

Copyright © 2012, Elsevier Inc. All rights reserved. 54

You might also like