0% found this document useful (0 votes)

189 views31 pages

ILP Limitations

The document discusses limits on instruction level parallelism (ILP) by analyzing the impact of adding constraints to an ideal processor model. It finds that branch prediction accuracy and limited instruction windows significantly reduce potential parallelism. With realistic models of 256 registers and dynamic memory disambiguation, a hypothetical future CPU could achieve around 20-30% of ideal parallelism for integer programs. Further hardware improvements may help close the remaining gap to ideal ILP.

Uploaded by

Monalisa Barik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

189 views31 pages

ILP Limitations

Uploaded by

Monalisa Barik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Limits on ILP

IIIT Bhubaneswar Mtech(CSE) 2012-2014

Achieving Parallelism
Techniques
Scoreboarding / Tomasulos Algorithm Pipelining Speculation Branch Prediction

But how much more performance could we theoretically get? How much ILP exists? How much more performance could we realistically get?

Methodology
Assume an ideal (and impractical) processor Add limitations one at a time to measure individual impact Consider ILP limits on a hypothetically practical processor

Analysis performed on benchmark programs with varying characteristics

Hardware Model Ideal Machine

Remove all limitations on ILP
Register renaming
Infinite number of registers available, eliminating all WAW and WAR hazards

Branch prediction
Perfect, including targets for jumps

Memory address alias analysis

All addresses known exactly, a load can be moved before a store provided that the addresses are not identical

Perfect caches
All memory accesses take 1 clock cycle

Infinite resources
No limit on number of functional units, buses, etc.

Ideal Machine
All control dependencies removed by perfect branch prediction All structural hazards removed by infinite resources All that remains are true data dependencies (RAW) hazards All functional unit latencies: one clock cycle
Any dynamic instruction can execute in the cycle after its predecessor executes

Initially we assume the processor can issue an unlimited number of instructions at once looking arbitrarily far ahead in the computation

Experimental Method
Programs compiled and optimized with standard MIPS optimizing compiler The programs were instrumented to produce a trace of instruction and data references over the entire execution Each instruction was subsequently rescheduled as early as the true data dependencies would allow
No control dependence

Benchmark Programs SPECINT92

ILP on the Ideal Processor

How close could we get to the ideal?

The perfect processor must do the following:
1. Look arbitrarily far ahead to find a set of instructions to issue, predicting all branches perfectly 2. Rename all registers to avoid WAR and WAW hazards 3. Resolve data dependencies 4. Resolve memory dependencies 5. Have enough functional units for all ready instructions

Limiting the Instruction Window

Limit window size to n (no longer arbitrary)
Window = number of instructions that are candidates for concurrent execution in a cycle

Window size determines

Instruction storage needed within the pipeline Maximum issue rate Number of operand comparisons needed for dependence checking is O(n2)
To try and detect dependences among 2000 instructions would require some 4 million comparisons Issuing 50 instructions requires 2450 comparisons

Effect of Reduced Window Size

Integer programs do not have as much parallelism as floating point programs
Scientific nature of the program

Highly dependent on loop-level parallelism

Instructions that can execute in parallel across loop iterations cannot be found with small window sizes without compiler help

From now on, assume:

Window size of 2000 Maximum of 64 instructions issued per cycle

Effects of Branch Prediction

So far, all branch outcomes are known before the first instruction executes
This is difficult to achieve in hardware or software

Consider 5 alternatives
1. 2. 3. 4. 5. Perfect Tournament predictor
(2,2) prediction scheme with 8K entries

Standard (non-correlating) 2-bit predictor with 512 2-bit enteries Static (profile-based) None (parallelism limited to within current basic block)

No penalty for mispredicted branches except for unseen parallelism

Branch Prediction Accuracy

Effects of Branch Prediction

Branch Prediction
Accurate prediction is critical to finding ILP Loops are easy to predict Independent instructions are separated by many branches in the integer programs and doduc From now on, assume tournament predictor
Also assume 2K jump and return predictors

Effect of Finite Registers

What if we no longer have infinite registers? Might have WAW or WAR hazards Alpha 21264 had 41 gp and integer renaming registers IBM Power5 has 88 fp and integer renaming registers

Effect of Renaming Registers

Effects of Renaming Registers

Renaming Registers
Not a big difference in integer programs
Already limited by branch prediction and window size, not that many speculative paths where we run into renaming problems

Many registers needed to hold live variables for the more predictable floating point programs Significant jump at 64 We will assume 256 integer and FP registers available for renaming

Imperfect Alias Analysis

Memory can have dependencies too, so far assumed they can be eliminated So far, memory alias analysis has been perfect Consider 3 models
Global/stack perfect: idealized static program analysis (heap references are assumed to conflict) Inspection: a simpler, realizable compiler technique limited to inspecting base registers and constant offsets None: all memory references are assumed to conflict

Effects of Imperfect Alias Analysis

Memory Disambiguation
Fpppp and tomcatv use no heap so perfect with global/stack perfect assumption
Perfect analysis here better by a factor of 2, implies there are compiler analysis or dynamic analysis to obtain more parallelism

Has big impact on amount of parallelism Dynamic memory disambiguation constrained by

Each load address must be compared with all in-flight stores The number of references that can be analyzed each clock cycle The amount of load/store buffering determines how far a load/store instruction can be moved

What is realizable?
If our hardware improves, what may be realizable in the near future?
Up to 64 instruction issues per clock (roughly 10 times the issue width in 2005) A tournament predictor with 1K entries and a 16 entry return predictor Perfect disambiguation of memory references done dynamically (ambitious but possible if window size is small) Register renaming with 64 int and 64 fp registers

Performance on Hypothetical CPU

Hypothetical CPU
Ambitious/impractical hardware assumptions
Unrestricted issue (particularly memory ops) Single cycle operations Perfect caches

Other directions
Data value prediction and speculation
Address value prediction and speculation

Speculation on multiple paths Simpler processor with larger cache and higher clock rate vs. more emphasis on ILP with a slower clock and smaller cache

Thank You

Limits of Instruction-Level Parallelism
No ratings yet
Limits of Instruction-Level Parallelism
51 pages
Computer Architecture Unit V
No ratings yet
Computer Architecture Unit V
23 pages
Limitation of ILP
No ratings yet
Limitation of ILP
28 pages
CS136, Advanced Architecture: Limits To ILP Simultaneous Multithreading
No ratings yet
CS136, Advanced Architecture: Limits To ILP Simultaneous Multithreading
49 pages
03a ILP Superscalar VLIW
No ratings yet
03a ILP Superscalar VLIW
21 pages
Unit 1
No ratings yet
Unit 1
34 pages
Limits of Instruction-Level Parallelism
No ratings yet
Limits of Instruction-Level Parallelism
18 pages
02b ILP Superscalar VLIW
No ratings yet
02b ILP Superscalar VLIW
20 pages
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
No ratings yet
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
38 pages
Cs2354 Advanced Computer Architecture 2 Marks
No ratings yet
Cs2354 Advanced Computer Architecture 2 Marks
10 pages
CS 6290 Instruction Level Parallelism
No ratings yet
CS 6290 Instruction Level Parallelism
45 pages
Appendix C
No ratings yet
Appendix C
26 pages
CAQA5e ch3
No ratings yet
CAQA5e ch3
45 pages
CompanionAsset 9780128119051 Chapter03
No ratings yet
CompanionAsset 9780128119051 Chapter03
67 pages
CH10-Processor Structure and Function
No ratings yet
CH10-Processor Structure and Function
14 pages
RHD'L: Instruction-Level Parallel Processing: History, Overview and Perspective
No ratings yet
RHD'L: Instruction-Level Parallel Processing: History, Overview and Perspective
42 pages
RHD'L: Instruction-Level Parallel Processing: History, Overview and Perspective
No ratings yet
RHD'L: Instruction-Level Parallel Processing: History, Overview and Perspective
57 pages
Chapter 2 ILP
No ratings yet
Chapter 2 ILP
89 pages
Lecture 13: Trace Scheduling, Conditional Execution, Speculation, Limits of ILP
No ratings yet
Lecture 13: Trace Scheduling, Conditional Execution, Speculation, Limits of ILP
21 pages
COA UNIT-III Parallel Processors
No ratings yet
COA UNIT-III Parallel Processors
51 pages
10 Week
No ratings yet
10 Week
35 pages
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
No ratings yet
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
170 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
CA Lecture 12
No ratings yet
CA Lecture 12
48 pages
Chapter 5 PPTV 41 STDV 1
No ratings yet
Chapter 5 PPTV 41 STDV 1
47 pages
L1.3b OOOpipelines
No ratings yet
L1.3b OOOpipelines
72 pages
Cosc530 Ch3all6up
No ratings yet
Cosc530 Ch3all6up
8 pages
Unit-IV ILP
No ratings yet
Unit-IV ILP
6 pages
Chap2 Slides
No ratings yet
Chap2 Slides
127 pages
12 - Processor Structure and Function
No ratings yet
12 - Processor Structure and Function
73 pages
Parallel Processing
No ratings yet
Parallel Processing
127 pages
CH14 COA9e Processor Structure and Function
No ratings yet
CH14 COA9e Processor Structure and Function
40 pages
Gpus
No ratings yet
Gpus
37 pages
Solutions Ch4
No ratings yet
Solutions Ch4
7 pages
Chapter 13 - Instruction Level Parallelism
No ratings yet
Chapter 13 - Instruction Level Parallelism
16 pages
Chapter 2 Lecture 4 and 5
No ratings yet
Chapter 2 Lecture 4 and 5
56 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
2 pages
Static Pipelining #2 and Goodbye To Computer Architecture: Prof. Lawrence Rauchwerger
No ratings yet
Static Pipelining #2 and Goodbye To Computer Architecture: Prof. Lawrence Rauchwerger
22 pages
CompArch 17e ILP-1
No ratings yet
CompArch 17e ILP-1
15 pages
Instruction Pipeline
No ratings yet
Instruction Pipeline
27 pages
Midterm1 s15 Sol
No ratings yet
Midterm1 s15 Sol
26 pages
Lecture 2
No ratings yet
Lecture 2
17 pages
Processor Organization
100% (1)
Processor Organization
55 pages
Aca Important Questions 2 Marks 16marks
60% (5)
Aca Important Questions 2 Marks 16marks
18 pages
pdc2: MODULE2
No ratings yet
pdc2: MODULE2
113 pages
Be Computer Engineering Semester 4 2018 December Computer Organization and Architecture Cbcgs
No ratings yet
Be Computer Engineering Semester 4 2018 December Computer Organization and Architecture Cbcgs
18 pages
Hafta 14
No ratings yet
Hafta 14
23 pages
STW120CT Computer Architecture and Networks: (Instruction Pipelining)
No ratings yet
STW120CT Computer Architecture and Networks: (Instruction Pipelining)
24 pages
Module 5 - Processor Structure and Function
No ratings yet
Module 5 - Processor Structure and Function
74 pages
EC483 Fall2024 W7
No ratings yet
EC483 Fall2024 W7
40 pages
VLIW Processors: Spring 2003 CSE P548 1
No ratings yet
VLIW Processors: Spring 2003 CSE P548 1
17 pages
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
74 pages
Simultaneous Multithreading
No ratings yet
Simultaneous Multithreading
50 pages
Advanced Microprocessors and Microcontrollers: A. Narendiran
No ratings yet
Advanced Microprocessors and Microcontrollers: A. Narendiran
14 pages
Lecture - 17 - MIPS - Instruction Level Parallelism
No ratings yet
Lecture - 17 - MIPS - Instruction Level Parallelism
27 pages
Presentation Cea Chapter16 2 Demo
No ratings yet
Presentation Cea Chapter16 2 Demo
30 pages
ITEC582-Chapter 16m
No ratings yet
ITEC582-Chapter 16m
55 pages
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
No ratings yet
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
35 pages
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
PLC Programming & Implementation: An Introduction to PLC Programming Methods and Applications
From Everand
PLC Programming & Implementation: An Introduction to PLC Programming Methods and Applications
Ojula Technology Innovations
No ratings yet
Image Classification in PHP Using Neural Networks
No ratings yet
Image Classification in PHP Using Neural Networks
77 pages
Master Pages:: Nested Master Pages
No ratings yet
Master Pages:: Nested Master Pages
24 pages
Project Synopsis
No ratings yet
Project Synopsis
8 pages
Oracle Item/Product - Release 11i.9 Logical Data Model
No ratings yet
Oracle Item/Product - Release 11i.9 Logical Data Model
1 page
Chapter 6 Iot Systems Logical Design Using Python
No ratings yet
Chapter 6 Iot Systems Logical Design Using Python
33 pages
Install build IOAPI 3.2 昏眼看日新浪博客
No ratings yet
Install build IOAPI 3.2 昏眼看日新浪博客
3 pages
Ankit Kumar Resume
No ratings yet
Ankit Kumar Resume
1 page
FINAL Object Oriented Programming Lab Manual
No ratings yet
FINAL Object Oriented Programming Lab Manual
80 pages
API Testing Interview Questions
No ratings yet
API Testing Interview Questions
22 pages
Vendor: Oracle
No ratings yet
Vendor: Oracle
8 pages
Lab Manual B.Sc. (CA) : Department of Computer Science Ccb-2P2: Laboratory Course - Ii
No ratings yet
Lab Manual B.Sc. (CA) : Department of Computer Science Ccb-2P2: Laboratory Course - Ii
31 pages
AADLv2 - An Introduction
No ratings yet
AADLv2 - An Introduction
134 pages
Android Applications Using Python and SL4A, Part 1:: Set Up Your Development Environment
No ratings yet
Android Applications Using Python and SL4A, Part 1:: Set Up Your Development Environment
12 pages
Chirag Mittal 1
No ratings yet
Chirag Mittal 1
1 page
Cs101 Final Term Current Paper Solution 2023 by Student Info 5
No ratings yet
Cs101 Final Term Current Paper Solution 2023 by Student Info 5
11 pages
TPF
No ratings yet
TPF
3 pages
Computer Programs Class 11
No ratings yet
Computer Programs Class 11
34 pages
So, Ware: Instruc (On Set
No ratings yet
So, Ware: Instruc (On Set
16 pages
CPE 301 Operating Systems November 26, 2022
No ratings yet
CPE 301 Operating Systems November 26, 2022
3 pages
Sohana Nizam 2121715630 - Activities
No ratings yet
Sohana Nizam 2121715630 - Activities
18 pages
DAA Unit-II
No ratings yet
DAA Unit-II
28 pages
Final2 TYBCA Slips SemVI
No ratings yet
Final2 TYBCA Slips SemVI
31 pages
A. Cognizant Roles & Packages: B. Cognizant Eligibility Criteria
No ratings yet
A. Cognizant Roles & Packages: B. Cognizant Eligibility Criteria
26 pages
Oups - Essential.guide - to.ANSI.C.1988.SCAN DARKCROWN
No ratings yet
Oups - Essential.guide - to.ANSI.C.1988.SCAN DARKCROWN
254 pages
FS Mod 3 - Multilevel Indexing and B-Trees
No ratings yet
FS Mod 3 - Multilevel Indexing and B-Trees
37 pages
Exploiting PHP 7 Unserialize Report 160829
No ratings yet
Exploiting PHP 7 Unserialize Report 160829
22 pages
PLF Lesson 3 - Loop
No ratings yet
PLF Lesson 3 - Loop
23 pages
Design of Weather Forecasting System Through Unifi
No ratings yet
Design of Weather Forecasting System Through Unifi
7 pages
Data Types Worksheet 3
No ratings yet
Data Types Worksheet 3
4 pages
Dsa Lab12
No ratings yet
Dsa Lab12
5 pages

ILP Limitations

Uploaded by

ILP Limitations

Uploaded by

Limits on ILP

IIIT Bhubaneswar Mtech(CSE) 2012-2014

Analysis performed on benchmark programs with varying characteristics

Hardware Model Ideal Machine

Memory address alias analysis

Benchmark Programs SPECINT92

ILP on the Ideal Processor

How close could we get to the ideal?

Limiting the Instruction Window

Window size determines

Effect of Reduced Window Size

Effect of Reduced Window Size

Effect of Reduced Window Size

Highly dependent on loop-level parallelism

From now on, assume:

Effects of Branch Prediction

No penalty for mispredicted branches except for unseen parallelism

Branch Prediction Accuracy

Effects of Branch Prediction

Effects of Branch Prediction

Effect of Finite Registers

Effect of Renaming Registers

Effects of Renaming Registers

Imperfect Alias Analysis

Effects of Imperfect Alias Analysis

Effects of Imperfect Alias Analysis

Has big impact on amount of parallelism Dynamic memory disambiguation constrained by

Performance on Hypothetical CPU

Performance on Hypothetical CPU

You might also like