0% found this document useful (0 votes)

42 views8 pages

Cosc530 Ch3all6up

Uploaded by

Khalid Kn3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views8 pages

Cosc530 Ch3all6up

Uploaded by

Khalid Kn3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Chapter 3 Instruction Parallelism Examples

● Loop-level parallelism
– Loop unrolling (compiler)
Instruction-Level Parallelism – Dynamic unrolling (superscalar scheduling)
● Data parallelism
– Vector computers
and
● Cray X1, X1E, X2; NEC SX-9
– SIMT
Its Exploitation ● GPUs
– SIMD
● Short SIMD (SSE, AVX, Intel Phi)

Introduction Types of Dependences

● Instruction level parallelism = ILP =

– (potential) overlap among instructions
● First universal ILP: pipelining (since 1985)
● Two approaches to ILP
– Discover and exploit parallelism in hardware ● Data dependences
● Dominant in server and desktop market segments ● Name dependences
● Not used in PMD segment due to energy constraints
– May be changing with Cortex-A9
● Control dependences
– Software-based discovery at compile time
● Technical markets, scientific computing, HPC
● Itanium is an example of aggressive software discovery
– But mostly abandoned by the majority of server makers

2 5

Instruction-Level Parallelism Basics Data Dependence: Basics

● Not all instructions can be executed in parallel
● Goal: minimize CPI (maximize IPC)
● Data dependent instructions have to be executed “in order”
● In a pipelined processor
– Data dependence is a property of the code
– CPI = ideal CPI + stalls (overheads):
● Instruction j is data dependent on instruction i if
● Structural stalls
● Data hazard stalls – Instruction i produces result used by instruction j
● Control stalls – Instruction j depends on instruction k and k data depends on i
● Fundamental unit for extracting parallelism is ● Pipeline interlocks
– Basic block (block of instructions between branches) – With interlocks, data dependence causes a hazard and stall
– Branches disrupt analysis; add runtime dependence – Without interlocks, data dependence prohibits the compiler
from scheduling instructions with overlap
● But typical basic blocks are small
● Data dependence conveys:
– 3-6 instruction (15%-25% branch frequency)
– Possibility of a hazard (= negative side-effect if not “in order”)
● Optimizing across branches is a must
– The required order of instructions
– Examples: loop-level parallelism, data parallelism (SIMD)
3 – Upper bound on achievable parallelism 6
Data Dependence Example Data Hazards
● Hazards exists as a result of data or name dependence
– Overlap of dependent (and nearby) instructions could change
access order to instructions' operands
double F0, *R1 – Avoiding hazards ensures program order
Loop: Load.D F0, 0(R1) F0=array element F0=R1[0] ● Possible data hazards
– RAW (read after write) ← true data dependence
Add.D F4, F0, F2 add scalar in F2 F4=F0+F2 ● Instruction i: write to x
● Instruction j: read from x
Store.D F4, 0(R1) store result R1[0]=F4
– WAW (write after write) ← output dependence
Add.I R1,R1,#-8 decrement by 8B R1-=1 ● Instruction i: write to x
● Instruction j: write to x
Branch.NE R1, R2, Loop if (R1!=R2) – WAR (write after read) ← antidependence
goto Loop
● Instruction i: read from x
● Instruction j: write to x
7
● RAR is not a hazard 10

Data Dependence Details Control Dependence

● Control dependence determines ordering of an instruction
with respect to a branch instruction
– The order must be preserved
– The execution should be conditional
● Overcoming data dependence
● Example:
– Maintain the dependence by preventing the hazard
– If p1 { S1 }
● By compiler or by hardware scheduler
– If p2 { S2 }
– Eliminate dependence by transforming the code
● A separate topic – S1 is control dependent on p1
– S2 is control dependent on p2
– Dependences that flow through memory
– S2 is not control dependent on p1
● Is R4[100] the same as R6[20]?
● Is R4[20] the same as R6[20]? ● Branches create barriers in code for potential code motion
● It might be possible to violate control dependence but
preserve correct execution with extra hardware
– Speculative execution
8 11

Name Dependence Control Dependence Examples

● Exception handling
● Name dependence occurs when two instructions use the
same register (or memory location) but there is no flow of – Add R2, R3, R4
data between them – Branch.equal0 R2, L1
– Instruction i – Load R1, 0(R2)
– Instruction j – L1: NoOp
● Two types of name dependence – No data dependence between Branch and Load
– Antidependence: Instruction i reads what instruction j writes – Load from wrong R2 could cause exception:
● Store.D F4, 0(R1) ● int *r2 = r3 + r4; y = r2 ? r2[0] : 0;
● Add.I R1, R1, #-8 ● Data flow
– Output dependence: Instructions i and j both write – Add R1, R2, R3
● Renaming is the common technique to deal with name – Branch.equal0 R4, L
dependence – Subtract R1, R5, R6
– Register renaming – L: NoOp ?

– Shadow registers – Or R7, R1, R8

9 12
Control Dependence: Software Speculation Original vs. Unrolled Loop
● Loop
F0 = R1[0]
F4 = F0 + F2
R1[0] = F4
● Ignoring control dependence may be possible after code R1 -= 1
analysis (liveness property) if (R1...
● Loop F6 = R1[-1]
– Add R1, R2, R3 F0 = R1[0] F8 = F6 + F2
– Branch.eq0 R12, Skip F4 = F0 + F2 R1[-1] = F8
?
R1[0] = F4 F10 = R1[-2]
– Subtract R4, R5, R6 R1 -= 1 F12 = F10 + F2
– Add R5, R4, R9 if (R1 != R2) R1[-2] = F12
– Skip: Or R7, R8, R9 goto Loop F14 = R1[-3]
– ; R4 is not used again (is dead) F16 = F14 + F2
R1[-3] = F16
R1 -= 4
if (R1 != R2) goto Loop
13
● New registers: F6, F8, ... 16

Compiler Techniques for Exposing ILP Loop Unrolling + Pipeline Scheduling

● Loop: Load F0, 0(R1) ● Loop: Load F0, 0(R1)
● Stall ● Add.I R1, R1, #-8 ● Loop ● Loop
● Add F4, F0, F2 F0 = R1[0] F0 = R1[0]
● Add F4, F0, F2 F4 = F0 + F2 F6 = R1[-1]
● Stall ● Stall R1[0] = F4 F10 = R1[-2]
F6 = R1[-1] F14 = R1[-3]
● Stall ● Stall
F8 = F6 + F2 F4 = F0 + F2
● Store F4, 0(R1) ● Store F4, 0(R1) R1[-1] = F8 F8 = F6 + F2
F10 = R1[-2] F12 = F10 + F2
● Add.I R1, R1, #-8 ● Branch.NoEq R1, R2, Loop F12 = F10 + F2 F16 = F14 + F2
● Branch.NoEq R1, R2, Loop R1[-2] = F12 R1[0] = F4
F14 = R1[-3] R1[-1] = F8
Instruction producing result Instruction using result Latency in clock cycles F16 = F14 + F2 R1 -= 4
FP ALU op Another FPU ALU op 3 R1[-3] = F16 R1[2] = F12
FP ALU op Store double 2 R1 -= 4 R1[1] = F16
Load double FP ALU op 1 if (R1 != R2) goto Loop if (R1 != R2) goto Loop
Load double Store double 0
14 17

Loop Unrolling Overview Unrolling with Generic Loops

● Loop unrolling simply copies the body of the loop multiple

times, each copy operates on a new loop index
● Given: for (k=0; k < N; ++k)
● Benefits
– Let's unroll 4 times
– Less branch instructions
– But what if N not divisible by 4?
● Less pressure on branch predictor
– Increased basic block size ● Solution:
● Potential for more parallelism – First, unroll N%k times: for (k=0; k < N%4; ++k)
– Less instructions executed – Then loop of group of 4:
● For example: less increments of the loop counter for (k = N%4; k < N; k += 4)
// unroll 4 times for k, k+1, k+2, k+3
● Downsides
● Refer to Chapter 4 and the technique called
– Greater register pressure
– Strip mining
– Increased use of instruction cache
● Could spill the instruction cache and cause cache thrashing
15 18
Branch Prediction Dynamic Scheduling Basics
● Instead of waiting for the branch to finish executing
● Simple techniques can only eliminate some data
– Try to predict its behavior and act upon the prediction dependence stalls
● Requirements – Pipeline scheduling by compiler
– Prediction must be cheaper than executing the branch – Forwarding and bypassing
instruction
● Dynamic scheduling adds another level adding parallelism
● Usually based on few bits of information
while maintaining the data flow
– There has to be a way of dealing with wrong predictions
– Some dependences are not known until runtime
● Beware of exceptions etc.
– The same binaries can run efficiently without recompilation
● Simple predictor
– Compiler might not know the details of the micro-architecture
– Keep a bit (or two) for a (fixed) number of branches
– There could be unpredictable delays: multi-level caches
– Every time a branch is taken increase the count
● Disadvantages
● If N-consecutive executions resulted in “branch taken” then the
next act as if the branch will be taken – Substantial increase in hardware complexity
– If N-consecutive “branch not taken” then start predicting “not – Exception handling (imprecise exceptions)
taken” 19 22

Correlating Branch Predictors Dynamic Scheduling Details

● Dynamic scheduling breaks the “in order” execution

● Observation (based on existing codes)
– Out-of-order execution
– Branches are correlated with each other
● Incoming instructions rearranged and unknown until runtime
● Application: correlating branch predictors
– Out-of-order completion
● Instead of keeping of each branch individually, ● Retired instructions' order depends on code, execution, delays
look also at the recent M branches
● New hazards to deal with
● (M, N) predictor
– WAR
– Uses behavior of last M branches ● Possibility of overwriting a value that has not been read yet
● Total of 2M branch decisions – Load F0, 0(R1) // a load from memory may be stalled for many cycles
– Each predictor has N bits – Load R1, #1 // load of a constant takes only a few cycles
– WAW
● Advantages
● Writing twice to the same location
– Better prediction yield (always test on your own code!)
– RAW hazards are still a problem
– Little hardware require to implement it ● They always occur since they are called “true data dependence”
20 23

Tournament Branch Predictors Dynamic Scheduling and Hazards

● Problem
– Branches might be badly mispredicted when moving between
program scopes
– The branch prediction information from the inner scope is
inadequate for the outer scope
● F0 = F2 / F4
● Observation True data dependence (RAW)

– There is locality in branching

● F6 = F0 + F8
● Inner and outer loops, etc. Output depenendance
● R1[0] = F6 antidepenendence (WAR)
(WAW)
● Solution ● F8 = F10 – F14 antidepenendence (WAR)

– Combine local and global information ● F6 = F10 * F8

● Typical predictors
– Size: 8K-32K bits
– Local predictors unchanged
● Examples: DEC Alpha, AMD Phenom and Opteron 21 24
Register Renaming Example Pipelined Execution

Clock cycles
Instruction 1 2 3 4 5 6 7 8
Instr. I fetch decode exe mem write
Instr. I+1 fetch decode exe mem write
● F0 = F2 / F4 ● F0 = F2 / F4 Instr. I+2 fetch decode exe mem write
Instr. I+3 fetch decode exe mem
● F6 = F0 + F8 ● S = F0 + F8 Instr. I+4 fetch decode exe
Instr. I+5 fetch decode
● R1[0] = F6 ● R1[0] = S Instr. I+6 fetch
● F8 = F10 – F14 ● T = F10 – F14 Instr. I fetch decode exe mem write
● F6 = F10 * F8 ● F6 = F10 * T Instr. I+1 fetch decode exe mem write
Instr. I+2 fetch decode exe mem write
● ●
Instr. I+3 stall fetch decode exe
Instr. I+4 fetch decode
● ● Only RAW hazards remain Instr. I+5 fetch
I1 F1 F2 R X1 X2 X3 D1 D2 T W
I2 F1 F2 R X1 X2 X3 X4 D1 D2 T W
I3 F1 F2 R X1 s D1 s s D2 T W

25 28

Register Renaming Details Tomasulo's Algorithm

● Register renaming provided by reservation stations (RS) ● Tomasulo's approach allows...
● Each entry of RS contains – Out-of-order execution (as in scoreboarding in ARM A8)
– Instruction ● Unlike scoreboarding, Tomasulo can handle anti- and output
dependences by renaming (four-issue Intel i7)
– Buffered operand values (when available)
– Extension to handle speculation
– References to instructions in RS that will provide values
● In Tomasulo, each instruction goes through three steps
● Operation
– Issue
– RS fetches and buffers an operand when available
● FIFO queue maintains correct data flow
● Might bypass a register
● Transfer instruction to RS if available or structural stall
– Pending instructions indicate the RS where they send their
output ● Rename registers to eliminate WAR and WAW (stall if no data)
– Execute
● Results broadcast on result bus (Common Data Bus)
– Only the last output updates the register file ● Monitor bus for new data and distribute it to waiting RS (RAW)
● Execute instructions in functional units when operands available
– Upon instruction issue, registers are renamed with references
to RS – Write result (other RS, registers, store buffers)
● There may be more RS than registers! 26
● Store buffer waits for address, value and memory unit(s) 29

Tomasulo Approach Example Dynamic Execution: All Issued

Instruction status
Issue Execute Write result
Load f6, (r2) x x x
Load f2, (r3) x x
● Introduced by Robert Tomasulo Mult f0,f2,f4 x
Sub f8,f2,f6 x
– Implemented in IBM 360/91 in its floating-point unit Div f9,f0,f6 x
– IBM 360/91 had long memory and floating-point delays Add f6,f8,f2 x
Reservation stations
● Only 4 floating-point registers
Busy Op Vj Vk Qj A Qk
– Binary compatibility was important for IBM customers Load1 no
● Modern processors use a variation of Tomasulo's approach Load2 yes load Reg[r3]
Add1 yes sub Mem[r2] Load2
– Also in use is a simpler algorithm called scoreboarding Add2 yes add Add1 Load2
Add3 no
Mult1 yes mul Reg[f4] Load2
Mult2 yes div Mem[r2] Mult1
f0: Mult1 f1: Load2 f6: Add2 f8: Add1 f9: Mult2
27 30
Example Dynamic Execution: Mult Ready Reorder Buffer
Instruction status ● Another set of (invisible to programmer) registers for
Issue Execute Write result intermediate results
Load f6, (r2) x x x
Load f2, (r3) x x + ● ROB registers hold data after instruction completion but
Mult f0,f2,f4 x + before instruction commit
Sub f8,f2,f6 x + + ● Each ROB entry (register) contains additional fields:
Div f9,f0,f6 x
Add f6,f8,f2 x + + – Instruction type
Reservation stations ● Branch (no destination), store (memory destination), register op
Busy Op Vj Vk Qj Qk A (ALU or register destination)
Load1 no – Destination
Load2 no ● Register number (for loads or ALU ops)or memory address (for
Add1 no stores)
Add2 no – Value
Add3 no
Mult1 yes mul Mem[r3] Reg[f4] – Ready
Mult2 yes div Mem[r2] Mult1 ● Indicates whether the supplying instruction completed its
f0: Mult1 f1: f6: f8: f9: Mult2 execution
31 34

Hardware Speculation Basics Reorder Buffer in Action

● After dealing with data dependences, control dependences ● Issue
become an issue – Has to wait until ROB entry is available (in addition to RS
– Branch prediction is not as effective with multiple instructions entry)
in-flight ● Execute
● If predicted “taken”, conditional instructions are fetched and – Results from Common Data Bus will have to end up on ROB
issued
– Speculation allows to proceed almost as if branch was not ● Write result
there – Results have to be copied to ROB
● Conditional instructions are fetched, issued, and executed ● Commit (also called completion, graduation)
● Hardware speculation comprises Normal commit: store result in destination, mark ROB entry as
–
– Dynamic branch prediction empty
– (speculative) execution: instructions are executed and – Store commit: destination is a memory
possibly undone) – Branch commit:
– Dynamic scheduling ● If correct prediction: no action needed
● More basic blocks available after branches are speculated out of ● If incorrect prediction: ROB result is thrown away and
the instruction stream 32 instructions restarted at the correct branch point 35

Hardware Speculation Components Reorder Buffer Exception Handling

● Additional step in instruction execution

– Issue, Execute, Write result, Commit
● Exceptions are not recognized until they are ready to commit
● Reorder buffer (ROB)
● ROB records exceptions
● Handling of...
– On mispredictions: flush the exception
– Mispredictions
– Upon reaching the head of ROB: raise the exception
– Mis-speculations
– Exceptions

33 36
Speculation at Compile Time VLIW Disadvantages
● Static parallelism
● a += 1 – Must be discovered and exploited early
If (x==0) { ● Preferably by the compiler
b += 1 ● Potential for intermediate representations, bytecodes
c += 1 ● Large code size
} else {
● If (x==0) { a -= 1 – Parallelism relies on large basic blocks
a += 1 } – Clever encoding or on-the-fly decompression may be needed
b += 1 ● a_copy = a+1 ● Lack of hazard detection in lockstep execution
c += 1 b_copy = b+1
} c_copy = c+1 ● Binary compatibility
if (x==0) { – Take code from 2-issue VLIW to (next-gen) 3-issue VLIW
a = a_copy – Add a single ALU unit to the new processor and the old code
b = b_copy will not take advantage of it
c = c_copy
} – New (wider, with more functions) processors could change
instruction encoding
37 ● ISA must provide for future hardware expansions 40

Multiple Issue Execution Tomasulo Recap

● All the techniques presented so far lead to ideal CPI=1 0x0: Load F2, 0(R1)
F0
0x1: Mul F0, F2, F4
● For CPI to go below 1 there need to be multiple instructions 0x2: ...
Empty RS F1
retired most of the time Empty RS F2
F3
Mul v1, v2, v3
– Too many stalls can quickly increase CPI above 1 F4
Load v1, v2
– See Amdahl law
● Most common flavors of multiple issue processors
Memory Buffer
– Statically scheduled processors
● In-order execution
Common data bus
● Examples: MIPS, ARM
To memory hierarchy
– VLIW (very long instruction word) processors
● Each cycle issues multiple (fixed number of) instructions Multiplier/Divider Adder/Subtractor
● Examples: DSPs, Itaniums, some GPUs
– Dynamically scheduled superscalar processors
● Out-of-order execution
● Examples: Intel i3-7, AMD Phenom, IBM POWER7 38 41

VLIW Processor Basics Multiple Issue Taxonomy

Common Issue Hazard Scheduling Distinguishing Exampes
● How many instructions per cycle? name structure detection characteristic

– Two-issue is common place

Superscalar Dynamic Hardware Static In-order execution Mostly in the
– Four-issue is manageable (static) embedded space:
MIPS and ARM
● Scheduling techniques (Cortex A8)
– Local Superscalar Dynamic Hardware Dynamic Some out-of-order None at present
(dynamic) execution but no
● Basic blocks speculation

– Global Superscalar Dynamic Hardware Dynamic Out-of-order execution Intel Core i3-7;
● Across branches (speculative) with with speculation AMD Phenom; IBM
speculation POWER7
– Trace
VLIW/LIW & Static Primarily Static All hazards Most examples are
● VLIW-specific Static software determined and in signal
● Extensive loop unrolling to generate large basic blocks indicated by compiler processing, such
(often implicitly) as TI 6Cx. Also
some GPUs
● Disadvantages EPIC (Exp. Primarily Primarily Mostly All hazards Itanium
Parallel static software static determined and
– Static parallelism, large code size, lack of hazard detection for Instruction indicated explicitly by
lockstep execution, binary compatibility Comp.) the compiler
39 42
VLIW Processors Basic Design Return Address Predictor

● Package multiple operations into one instruction

– Instruction bundles
● Branch prediction deals with conditional branches
● Example VLIW processor
● Most unconditional branches come from function returns
– One integer instruction (or branch)
● But the same function can be called from multiple sites
– Two independent floating-point operations
– This may cause the branch prediction buffer to forget about
– Two independent memory references
return address from previous calls
– Notice: there are restrictions on the instructions
● Solution
● Must be enough parallelism in code to fill the available slots
– Create return address buffer organized as a stack
– Compiler: aggressive loop unrolling
– Programmer: program restructuring

43 46

Modern Microarchitectures
● Combine:
– Dynamic scheduling
– Multiple issue
– Speculation
● Two approaches to dealing with dependences
– Assign reservation stations and update pipeline control table
in half clock cycles
● Only supports 2 instructions/clock
– Design logic to handle any possible dependencies between
the instructions
● Notice: design complexity
– Hybrid approaches
● New bottleneck:
– Issue logic
44

Modern Multiple Issue

● Limit the complexity of a single instruction “bundle”

– Limit bundle size
– Limit classes of instruction in a bundle
● One integer, two floating-point
● With limited size, all dependences in a bundle can be
examined
● Dependences from a small bundle can also be fully encoded
in RS
● Another bottleneck:
– Completion/commit unit
– Need multiple such units to keep up with incoming instructions

Final Project (Fatima)
No ratings yet
Final Project (Fatima)
21 pages
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
No ratings yet
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
170 pages
EC483 Fall2024 W7
No ratings yet
EC483 Fall2024 W7
40 pages
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
No ratings yet
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
38 pages
Chapter 2 ILP
No ratings yet
Chapter 2 ILP
89 pages
Instruction Level Parallelism: Soner Onder
No ratings yet
Instruction Level Parallelism: Soner Onder
25 pages
Topic2c Ss Dynamicscheduling
No ratings yet
Topic2c Ss Dynamicscheduling
94 pages
CompanionAsset 9780128119051 Chapter03
No ratings yet
CompanionAsset 9780128119051 Chapter03
67 pages
Lecture 5
No ratings yet
Lecture 5
80 pages
CAQA5e ch3
No ratings yet
CAQA5e ch3
45 pages
Pipelining Become Universal Technique in 1985
No ratings yet
Pipelining Become Universal Technique in 1985
16 pages
Instruction-Level Parallelism (ILP), Since The
100% (1)
Instruction-Level Parallelism (ILP), Since The
57 pages
13) Ilp1 PDF
No ratings yet
13) Ilp1 PDF
85 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
Lecture 5
No ratings yet
Lecture 5
76 pages
Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
No ratings yet
Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
201 pages
U3.1 Concepts and Challenges
No ratings yet
U3.1 Concepts and Challenges
12 pages
MCP Unit 1
No ratings yet
MCP Unit 1
41 pages
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
0% (1)
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
40 pages
4th Lecture Computer Architecture
No ratings yet
4th Lecture Computer Architecture
15 pages
4-Advanced Pipelining - 241114 - 060906
No ratings yet
4-Advanced Pipelining - 241114 - 060906
80 pages
CS 6290 Instruction Level Parallelism
No ratings yet
CS 6290 Instruction Level Parallelism
45 pages
Unit 6
No ratings yet
Unit 6
22 pages
ILP Overview and Scoreboard
No ratings yet
ILP Overview and Scoreboard
60 pages
CH10-Processor Structure and Function
No ratings yet
CH10-Processor Structure and Function
14 pages
Instruction-Level Parallelism: Stalls Control Stalls WAW Stalls WAR Stalls RAW Stalls Structural CPI CPI
No ratings yet
Instruction-Level Parallelism: Stalls Control Stalls WAW Stalls WAR Stalls RAW Stalls Structural CPI CPI
50 pages
03 Dynamic Sched
No ratings yet
03 Dynamic Sched
84 pages
Computer Architecture ILP - Techniques For Increasing
No ratings yet
Computer Architecture ILP - Techniques For Increasing
11 pages
ACA Unit 3
No ratings yet
ACA Unit 3
50 pages
EE457Unit9a OoO
No ratings yet
EE457Unit9a OoO
77 pages
Unit4 Aca
No ratings yet
Unit4 Aca
6 pages
Onur 447 Spring15 Lecture9 Branch Prediction Afterlecture
No ratings yet
Onur 447 Spring15 Lecture9 Branch Prediction Afterlecture
65 pages
CS 6461: Computer Architecture Instruction Level Parallelism
No ratings yet
CS 6461: Computer Architecture Instruction Level Parallelism
41 pages
06 Ooo Basics
No ratings yet
06 Ooo Basics
74 pages
2 TypesofParallelism
No ratings yet
2 TypesofParallelism
69 pages
Instruction Scheduling
No ratings yet
Instruction Scheduling
17 pages
Module 5 Instruction Level Parallelism and Pipelining
No ratings yet
Module 5 Instruction Level Parallelism and Pipelining
54 pages
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
No ratings yet
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
67 pages
Lecture-7-15 01 2025
No ratings yet
Lecture-7-15 01 2025
19 pages
pdc2: MODULE2
No ratings yet
pdc2: MODULE2
113 pages
Study Guide Chapter 3
No ratings yet
Study Guide Chapter 3
3 pages
CAunitiii
No ratings yet
CAunitiii
36 pages
HW 2 Is Out! Due 9/25!
No ratings yet
HW 2 Is Out! Due 9/25!
21 pages
Unit - 1 Microprocessor Architecture
No ratings yet
Unit - 1 Microprocessor Architecture
52 pages
Lec18-Static BRANCH PREDICTION VLIW
No ratings yet
Lec18-Static BRANCH PREDICTION VLIW
40 pages
William Stallings Computer Organization and Architecture 8 Edition Instruction Level Parallelism and Superscalar Processors
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Instruction Level Parallelism and Superscalar Processors
50 pages
CA Lecture 12
No ratings yet
CA Lecture 12
48 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
108 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
22 pages
Module 5 - Processor Structure and Function
No ratings yet
Module 5 - Processor Structure and Function
74 pages
CH14 COA9e Processor Structure and Function
No ratings yet
CH14 COA9e Processor Structure and Function
40 pages
ACA Unit 3
No ratings yet
ACA Unit 3
17 pages
3a.ILP Dipendenze e Superscalare
No ratings yet
3a.ILP Dipendenze e Superscalare
24 pages
Star Lion College of Engineering & Technology: Cs2354 Aca-2 Marks & 16 Marks
No ratings yet
Star Lion College of Engineering & Technology: Cs2354 Aca-2 Marks & 16 Marks
14 pages
SRM Pipelining 05
No ratings yet
SRM Pipelining 05
42 pages
Chapter 5 PPTV 41 STDV 1
No ratings yet
Chapter 5 PPTV 41 STDV 1
47 pages
Compiler Techniques For Exposing ILP
No ratings yet
Compiler Techniques For Exposing ILP
18 pages
Lec5 - ILP Issues in Pipeline Design
No ratings yet
Lec5 - ILP Issues in Pipeline Design
38 pages
Code Scheduling in Compiler Design
No ratings yet
Code Scheduling in Compiler Design
6 pages
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
74 pages
Copper ANEG JEFF LAPAK
No ratings yet
Copper ANEG JEFF LAPAK
98 pages
Distributed System Requirements: Remco Hobo December 1, 2004
No ratings yet
Distributed System Requirements: Remco Hobo December 1, 2004
4 pages
CH 2 Data Communication Networking Network Model Multiple Choice Questions and Answers PDF Behrou PDF
No ratings yet
CH 2 Data Communication Networking Network Model Multiple Choice Questions and Answers PDF Behrou PDF
17 pages
Nanopdf Min PDF
No ratings yet
Nanopdf Min PDF
1 page
Data Communications & Computer Networks: Digital Data, Digital Signals Digital Data, Analog Signals Home Exercises
No ratings yet
Data Communications & Computer Networks: Digital Data, Digital Signals Digital Data, Analog Signals Home Exercises
23 pages
Final Exam - Fall 2008: COE 308 - Computer Architecture
No ratings yet
Final Exam - Fall 2008: COE 308 - Computer Architecture
8 pages
Homework 2 Assignment For ECE671: You Get A Wrong Answer, You Can Get Partial Credit If You Show Your Work. If You Make A
No ratings yet
Homework 2 Assignment For ECE671: You Get A Wrong Answer, You Can Get Partial Credit If You Show Your Work. If You Make A
9 pages
Fiji Sugar Industry Report - Oxfam
No ratings yet
Fiji Sugar Industry Report - Oxfam
62 pages
Case Study - Network Design
No ratings yet
Case Study - Network Design
6 pages
Housekeeping Notes Edited
No ratings yet
Housekeeping Notes Edited
170 pages
Van Gogh Words
No ratings yet
Van Gogh Words
2 pages
Function and Relations
No ratings yet
Function and Relations
39 pages
Lesson Plan CH 1 His ch.1
No ratings yet
Lesson Plan CH 1 His ch.1
5 pages
New Product Development Strategy
No ratings yet
New Product Development Strategy
6 pages
Académie D'Investissement Triomphal: Henri Lumière Launches "Heritage and Society" Module
No ratings yet
Académie D'Investissement Triomphal: Henri Lumière Launches "Heritage and Society" Module
4 pages
MODULE 2 Planning and Evaluation - Final Edition With Style
No ratings yet
MODULE 2 Planning and Evaluation - Final Edition With Style
51 pages
M&E Notes
100% (1)
M&E Notes
475 pages
Literature Review
No ratings yet
Literature Review
7 pages
HCL New Msds-New 2020
No ratings yet
HCL New Msds-New 2020
8 pages
Lecture 17 BUS and Bus Arbitration
No ratings yet
Lecture 17 BUS and Bus Arbitration
9 pages
S-2 The Adventures of Toto
No ratings yet
S-2 The Adventures of Toto
2 pages
Message 1
No ratings yet
Message 1
59 pages
Aylawangurumi Chihiro Doll
100% (1)
Aylawangurumi Chihiro Doll
10 pages
Cep Awp
No ratings yet
Cep Awp
4 pages
Byrne CH-01 SEM Basics
No ratings yet
Byrne CH-01 SEM Basics
7 pages
ChatGPT Memory State Aditya 2025
No ratings yet
ChatGPT Memory State Aditya 2025
2 pages
130 Julie Skolnick
No ratings yet
130 Julie Skolnick
12 pages
Piping Material Specification-Rev0
100% (4)
Piping Material Specification-Rev0
25 pages
Closing The Loop Enabling Circular Biodegradable Biop - 2024 - Cleaner and Resp
No ratings yet
Closing The Loop Enabling Circular Biodegradable Biop - 2024 - Cleaner and Resp
12 pages
Customization of Mass-Produced Parts by
No ratings yet
Customization of Mass-Produced Parts by
4 pages
Past P (Apers IGCSE 2014 43
No ratings yet
Past P (Apers IGCSE 2014 43
8 pages
Solar - Roof Outlines UK
No ratings yet
Solar - Roof Outlines UK
15 pages
Board Booster Maharashtra 10th 2025: (I) (Ii) (Iii) (Iv)
No ratings yet
Board Booster Maharashtra 10th 2025: (I) (Ii) (Iii) (Iv)
1 page
TDPOWERSY Financial
No ratings yet
TDPOWERSY Financial
11 pages
Iso Astm 52701-13 PDF
No ratings yet
Iso Astm 52701-13 PDF
10 pages
Is 7016 (Part-3)
No ratings yet
Is 7016 (Part-3)
17 pages

Cosc530 Ch3all6up

Uploaded by

Cosc530 Ch3all6up

Uploaded by

Chapter 3 Instruction Parallelism Examples

Introduction Types of Dependences

● Instruction level parallelism = ILP =

Instruction-Level Parallelism Basics Data Dependence: Basics

Data Dependence Details Control Dependence

Name Dependence Control Dependence Examples

– Shadow registers – Or R7, R1, R8

Compiler Techniques for Exposing ILP Loop Unrolling + Pipeline Scheduling

Loop Unrolling Overview Unrolling with Generic Loops

● Loop unrolling simply copies the body of the loop multiple

Correlating Branch Predictors Dynamic Scheduling Details

● Dynamic scheduling breaks the “in order” execution

Tournament Branch Predictors Dynamic Scheduling and Hazards

– There is locality in branching

– Combine local and global information ● F6 = F10 * F8

Register Renaming Details Tomasulo's Algorithm

Tomasulo Approach Example Dynamic Execution: All Issued

Hardware Speculation Basics Reorder Buffer in Action

Hardware Speculation Components Reorder Buffer Exception Handling

● Additional step in instruction execution

Multiple Issue Execution Tomasulo Recap

VLIW Processor Basics Multiple Issue Taxonomy

– Two-issue is common place

● Package multiple operations into one instruction

Modern Multiple Issue

● Limit the complexity of a single instruction “bundle”

You might also like