0% found this document useful (0 votes)

21 views22 pages

Compiler Optimizations and Prefetching

This document discusses several compiler and hardware optimizations to reduce cache misses and improve performance, including: 1. Smaller direct-mapped caches can overlap tag comparison and data transmission. Lower associativity also reduces power. 2. Way prediction and pipelining caches can improve hit times but increase penalties for misses. 3. Compiler optimizations like loop interchange and blocking can improve data locality to reduce misses. Prefetching instructions can hide miss latencies. 4. Hardware techniques like nonblocking caches, multibanking, and merging write buffers can allow faster cache accesses. Critical word first and early restart can reduce effective miss penalties.

Uploaded by

Hidayatullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views22 pages

Compiler Optimizations and Prefetching

Uploaded by

Hidayatullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

Compiler

Optimizations
and Prefetching

OLEH :

ADITYA P. P. PRASETYO, S. Kom., MT.

Ten Advanced Optimizations

– Small and simple first level caches

– Critical timing path:
– addressing tag memory, then
– comparing tags, then
– selecting correct set
– Direct-mapped caches can overlap tag compare
and transmission of data
– Lower associativity reduces power because
fewer cache lines are accessed

2
L1 Size and Associativity

Access time vs. size and associativity

3
L1 Size and Associativity

Energy per read vs. size and associativity

4
Way Prediction

– To improve hit time, predict the way to pre-set mux

– Mis-prediction gives longer hit time
– Prediction accuracy
– > 90% for two-way
– > 80% for four-way
– I-cache has better accuracy than D-cache
– First used on MIPS R10000 in mid-90s
– Used on ARM Cortex-A8
– Extend to predict block as well
– “Way selection”
– Increases mis-prediction penalty

5
Pipelining Cache

– Pipeline cache access to improve

bandwidth
– Examples:
– Pentium: 1 cycle
– Pentium Pro – Pentium III: 2 cycles
– Pentium 4 – Core i7: 4 cycles

– Increases branch mis-prediction penalty

– Makes it easier to increase associativity

6
Nonblocking Caches

– Allow hits before

previous misses
complete
– “Hit under miss”
– “Hit under multiple miss”
– L2 must support this
– In general, processors
can hide L1 miss
penalty but not L2 miss
penalty

7
Multibanked Caches
– Organize cache as independent banks to support
simultaneous access
– ARM Cortex-A8 supports 1-4 banks for L2
– Intel i7 supports 4 banks for L1 and 8 banks for L2

– Interleave banks according to block address

8
Critical Word First, Early Restart

– Critical word first

– Request missed word from memory first
– Send it to the processor as soon as it arrives
– Early restart
– Request words in normal order
– Send missed work to the processor as soon as it arrives

– Effectiveness of these strategies depends on block

size and likelihood of another access to the portion
of the block that has not yet been fetched
9
Merging Write Buffer

– When storing to a block that is already pending in the write

buffer, update write buffer
– Reduces stalls due to full write buffer
– Do not apply to I/O addresses

No write
buffering

Write buffering

10
Compiler Optimizations

– Loop Interchange
– Swap nested loops to access memory in sequential order

– Blocking
– Instead of accessing entire rows or columns, subdivide
matrices into blocks
– Requires more memory accesses but improves locality of
accesses

11
Reducing Cache Misses:
5. Compiler Optimizations

12
Reducing Cache Misses:
5. Compiler Optimizations

13
Reducing Cache Misses:
5. Compiler Optimizations
– Blocking: improve temporal and spatial locality
a) multiple arrays are accessed in both ways (i.e., row-major and column-major), namely, orthogonal
accesses that can not be helped by earlier methods
b) concentrate on submatrices, or blocks

c) All N*N elements of Y and Z are accessed N times and each element of X is accessed once. Thus,
there are N3 operations and 2N3 + N2 reads! Capacity misses are a function of N and cache size in
this case.

14
Reducing Cache Misses:
5. Compiler Optimizations (cont’d)

– Blocking: improve temporal and spatial locality

a) To ensure that elements being accessed can fit in the cache, the original code is changed
to compute a submatrix of size B*B, where B is called the blocking factor.
b) To total number of memory words accessed is 2N3//B + N2
c) Blocking exploits a combination of spatial (Y) and temporal (Z) locality.

15
Hardware Prefetching
– Fetch two blocks on miss (include next sequential block): overlapping
memory access with execution by fetching data items before processor
requests them.

Pentium 4 Pre-fetching
16
Compiler Prefetching
– Insert prefetch instructions before data is needed
– Non-faulting: prefetch doesn’t cause exceptions

– Register prefetch
– Loads data into register
– Cache prefetch
– Loads data into cache

– Combine with loop unrolling and software

pipelining
17
Reducing Cache Miss Penalty:
Compiler-Controlled Prefetching

 Compiler inserts prefetch instructions

 An Example
for(i:=0; i<3; i:=i+1)
for(j:=0; j<100; j:=j+1)
a[i][j] := b[j][0] * b[j+1][0]
 16-byte blocks, 8KB cache, 1-way write back, 8-byte elements;
What kind of locality, if any, exists for a and b?
a. 3 100-element rows (100 columns) visited; spatial locality: even-
indexed elements miss and odd-indexed elements hit, leading to
3*100/2 = 150 misses
b. 101 rows and 3 columns visited; no spatial locality, but there is
temporal locality: same element is used in ith and (i + 1)st iterations and
the same element is access in each i iteration (outer loop). 100 misses
for b[j+1][0] when i = 0 and 1 miss for j = 0 for a total of 101 misses
 Assuming large penalty (100 cycles and at least 7 iterations must
be prefetched). Splitting the loop into two, we have

18
Reducing Cache Miss Penalty:
3. Compiler-Controlled Prefetching

 Assuming that each iteration of the pre-split loop consumes 7 cycles and
no conflict and capacity misses, then it consumes a total of 7*300
iteration cycles + 251*100 cache miss cycles = 27,200 cycles;
 With prefetching instructions inserted:
for(j:=0; j<100; j:=j+1){
prefetch(b[j+7][0];
prefetch(a[0][j+7];
a[0][j] := b[j][0] * b[j+1][0];};
for(i:=1; i<3; i:=i+1)
for(j:=0; j<100; j:=j+1){
prefetch(a[i][j+7];
a[i][j] := b[j][0] * b[j+1][0]}
19
Reducing Cache Miss Penalty:
3. Compiler-Controlled Prefetching (cont’d)

 An Example (continued)
 the first loop consumes 9 cycles per iteration (due to the two prefetch
instruction) and iterates 100 times for a total of 900 cycles,
 the second loop consumes 8 cycles per iteration (due to the single
prefetch instruction) and iterates 200 times for a total of 1,600 cycles,
 during the first 7 iterations of the first loop array a incurs 4 cache
misses, array b incurs 7 cache misses, for a total of (4+7)*100=1,100
cache miss cycles,
 during the first 7 iterations of the second loop for i = 1 and i = 2
array a incurs 4 cache misses each, for total of (4+4)*100=800 cache
miss cycles; array b does not incur any cache miss in the second split!
 Total cycles consumed: 900+1600+1100+800= 44000
 Prefetching improves performance: 27200/4400=6.2 folds!

20
Summary

21
THANKS

Unit 3 - LM11 - Memory Prefetching
No ratings yet
Unit 3 - LM11 - Memory Prefetching
6 pages
5.2 Eleven Advanced Optimizations of Cache Performance
No ratings yet
5.2 Eleven Advanced Optimizations of Cache Performance
13 pages
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
No ratings yet
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
37 pages
CS-30005(HPC)-CS_END_NOV_2024
No ratings yet
CS-30005(HPC)-CS_END_NOV_2024
23 pages
Advanced Cache Optimizations - : Adapted From Patterson and Hennessey (Morgan Kauffman Pubs)
No ratings yet
Advanced Cache Optimizations - : Adapted From Patterson and Hennessey (Morgan Kauffman Pubs)
12 pages
Cache Management: Allen and Kennedy, Chapter 9
100% (1)
Cache Management: Allen and Kennedy, Chapter 9
47 pages
COMP 740: Computer Architecture and Implementation: Montek Singh
No ratings yet
COMP 740: Computer Architecture and Implementation: Montek Singh
41 pages
UNIT-IV Memory and I/O
No ratings yet
UNIT-IV Memory and I/O
36 pages
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
No ratings yet
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
57 pages
Lec 34
No ratings yet
Lec 34
26 pages
Aca Seminar Report
No ratings yet
Aca Seminar Report
11 pages
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
No ratings yet
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
17 pages
Caches and Memory
No ratings yet
Caches and Memory
65 pages
Lecture 12: Cache Innovations
No ratings yet
Lecture 12: Cache Innovations
17 pages
Memory Hierarchy - Ways To Reduce Misses: DAP Spr. 98 ©UCB 1
No ratings yet
Memory Hierarchy - Ways To Reduce Misses: DAP Spr. 98 ©UCB 1
23 pages
Memory 2
No ratings yet
Memory 2
31 pages
Lec 33
No ratings yet
Lec 33
26 pages
Cache Optimizations
No ratings yet
Cache Optimizations
23 pages
Final Exam Topics: CSE 564 Computer Architecture Summer 2017
No ratings yet
Final Exam Topics: CSE 564 Computer Architecture Summer 2017
78 pages
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
No ratings yet
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
22 pages
Chapter # 05
No ratings yet
Chapter # 05
42 pages
CS683 Pa1
No ratings yet
CS683 Pa1
14 pages
3 - 2 Memory performance Overview Notes
No ratings yet
3 - 2 Memory performance Overview Notes
13 pages
10_Caches
No ratings yet
10_Caches
34 pages
Improving Cache Performance Reducing Misses
No ratings yet
Improving Cache Performance Reducing Misses
9 pages
Cache_optimizations
No ratings yet
Cache_optimizations
29 pages
Topics: Cache Innovations (Sections 2.4, B.4, B.5), Virtual Memory Intro
No ratings yet
Topics: Cache Innovations (Sections 2.4, B.4, B.5), Virtual Memory Intro
20 pages
Improving Cache Performance:: Average Memory Access Time Amat T + Miss Rate X Miss Penalty
No ratings yet
Improving Cache Performance:: Average Memory Access Time Amat T + Miss Rate X Miss Penalty
16 pages
Advanced Computer Architecture-06CS81-Memory Hierarchy Design
No ratings yet
Advanced Computer Architecture-06CS81-Memory Hierarchy Design
18 pages
L17
No ratings yet
L17
23 pages
Cache
No ratings yet
Cache
31 pages
Computer Science 246 Computer Architecture: Si 2009 Spring 2009 Harvard University
No ratings yet
Computer Science 246 Computer Architecture: Si 2009 Spring 2009 Harvard University
27 pages
Parallel & Distributed Computing
No ratings yet
Parallel & Distributed Computing
58 pages
Cache Misses
No ratings yet
Cache Misses
8 pages
CS 322M Digital Logic & Computer Architecture: Cache Optimization Techniques-II
No ratings yet
CS 322M Digital Logic & Computer Architecture: Cache Optimization Techniques-II
14 pages
ch2 Appb
No ratings yet
ch2 Appb
58 pages
Lecture 5 Cache Optimization
No ratings yet
Lecture 5 Cache Optimization
25 pages
CS530-Fall2015-Lecture6
No ratings yet
CS530-Fall2015-Lecture6
3 pages
Questions On Chapter 1 and 2 Color New V2
No ratings yet
Questions On Chapter 1 and 2 Color New V2
8 pages
Coa Poster Content
No ratings yet
Coa Poster Content
2 pages
Lecture16 PDF
No ratings yet
Lecture16 PDF
4 pages
Chapter 6
No ratings yet
Chapter 6
37 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
12 pages
CompArch_Most_Important_Questions
No ratings yet
CompArch_Most_Important_Questions
12 pages
5.5 Cache Organization
No ratings yet
5.5 Cache Organization
8 pages
COA Digital-Cheatsheet
No ratings yet
COA Digital-Cheatsheet
4 pages
Chapter 2 Neede For Guide Line Help From Smiw
No ratings yet
Chapter 2 Neede For Guide Line Help From Smiw
7 pages
L07-MemoryII
No ratings yet
L07-MemoryII
27 pages
Memory Hierarchy Design-Aca
No ratings yet
Memory Hierarchy Design-Aca
15 pages
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
No ratings yet
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
73 pages
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
No ratings yet
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
20 pages
A Branch Target Instruction Prefetchnig Technique For Improved Performance
No ratings yet
A Branch Target Instruction Prefetchnig Technique For Improved Performance
6 pages
Silicon Components And Processes Self Study Unit Processes And Process Integration 1st Edition Badih Elkareh download
No ratings yet
Silicon Components And Processes Self Study Unit Processes And Process Integration 1st Edition Badih Elkareh download
82 pages
15IF11 Multicore B
No ratings yet
15IF11 Multicore B
36 pages
Ec6009 Advanced Computer Architecture Unit V Memory and I/O: Cache Performance
No ratings yet
Ec6009 Advanced Computer Architecture Unit V Memory and I/O: Cache Performance
16 pages
GDC2003 Memory Optimization 18mar03
No ratings yet
GDC2003 Memory Optimization 18mar03
60 pages
Jntu Online Examinations (Mid 2 - Aca)
No ratings yet
Jntu Online Examinations (Mid 2 - Aca)
22 pages
chapter 8 Memory Interface
No ratings yet
chapter 8 Memory Interface
69 pages
At 45 DB 081 D
No ratings yet
At 45 DB 081 D
53 pages
Cache and Caching: Electrical and Electronic Engineering
No ratings yet
Cache and Caching: Electrical and Electronic Engineering
15 pages
CSO Lecture Notes Unit 1-3
No ratings yet
CSO Lecture Notes Unit 1-3
67 pages
Cache and Caching: Electrical and Electronic Engineering
No ratings yet
Cache and Caching: Electrical and Electronic Engineering
15 pages
Operating Systems: Chapter 3: Memory Management
No ratings yet
Operating Systems: Chapter 3: Memory Management
51 pages
Operating System
No ratings yet
Operating System
33 pages
Instruction Codes
No ratings yet
Instruction Codes
35 pages
Data Storage
No ratings yet
Data Storage
62 pages
Chapter 1: Digital Design Review
No ratings yet
Chapter 1: Digital Design Review
26 pages
DS1245Y/AB 1024k Nonvolatile SRAM: Features Pin Assignment
No ratings yet
DS1245Y/AB 1024k Nonvolatile SRAM: Features Pin Assignment
12 pages
Ilia Ii-Is: Locva-Kurtxevit
No ratings yet
Ilia Ii-Is: Locva-Kurtxevit
50 pages
MPI GTU Study Material E-Notes Unit-5 13052022115156AM
No ratings yet
MPI GTU Study Material E-Notes Unit-5 13052022115156AM
15 pages
Unit 3 (Paging To Thrashing)
No ratings yet
Unit 3 (Paging To Thrashing)
11 pages
Chapter 8
100% (2)
Chapter 8
2 pages
Info Dmi
No ratings yet
Info Dmi
21 pages
4Gb Ddr3 Sdram: Lead-Free&Halogen-Free (Rohs Compliant)
No ratings yet
4Gb Ddr3 Sdram: Lead-Free&Halogen-Free (Rohs Compliant)
35 pages
WWW Tomshardware Com Reviews Cpu Hierarchy 4312 HTML
No ratings yet
WWW Tomshardware Com Reviews Cpu Hierarchy 4312 HTML
14 pages
FANUC P/S Codes FANUC Program Errors (P/S Alarm)
No ratings yet
FANUC P/S Codes FANUC Program Errors (P/S Alarm)
15 pages
COA Assignment: Ques 1) Describe The Principles of Magnetic Disk
No ratings yet
COA Assignment: Ques 1) Describe The Principles of Magnetic Disk
8 pages
Memory Numerical 1
No ratings yet
Memory Numerical 1
10 pages
Fairchild Logic Selection Guide
No ratings yet
Fairchild Logic Selection Guide
12 pages
EE 466/586 VLSI Design: School of EECS Washington State University Pande@eecs - Wsu.edu
No ratings yet
EE 466/586 VLSI Design: School of EECS Washington State University Pande@eecs - Wsu.edu
23 pages
RAM Job Interview Questions and Answers
No ratings yet
RAM Job Interview Questions and Answers
8 pages
8085 Project
No ratings yet
8085 Project
7 pages
DMA Controller - 8237
No ratings yet
DMA Controller - 8237
9 pages
Elx 353 3 Integrated Digital Electronics
No ratings yet
Elx 353 3 Integrated Digital Electronics
2 pages
Microprocessor Quiz One
No ratings yet
Microprocessor Quiz One
2 pages
Cpu Project
No ratings yet
Cpu Project
12 pages
Kakatiya Institute of Technology & Science, Warangal-506015
No ratings yet
Kakatiya Institute of Technology & Science, Warangal-506015
2 pages
Question Paper Code:: Reg. No.
No ratings yet
Question Paper Code:: Reg. No.
2 pages
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet

Compiler Optimizations and Prefetching

Uploaded by

Compiler Optimizations and Prefetching

Uploaded by

Compiler

ADITYA P. P. PRASETYO, S. Kom., MT.

– Small and simple first level caches

Access time vs. size and associativity

Energy per read vs. size and associativity

– To improve hit time, predict the way to pre-set mux

– Pipeline cache access to improve

– Increases branch mis-prediction penalty

– Allow hits before

– Interleave banks according to block address

– Critical word first

– Effectiveness of these strategies depends on block

– When storing to a block that is already pending in the write

– Blocking: improve temporal and spatial locality

– Combine with loop unrolling and software

 Compiler inserts prefetch instructions

You might also like