0% found this document useful (0 votes)
43 views25 pages

CS 352H: Computer Systems Architecture: Topic 10: Instruction Level Parallelism (ILP) October 6 - 8, 2009

This document discusses techniques for increasing instruction-level parallelism (ILP) in computer processors, including deeper pipelining, multiple issue, speculation, static and dynamic multiple issue, loop unrolling, and dynamic pipeline scheduling. The key techniques are executing multiple instructions simultaneously through deeper pipelining with multiple functional units, issuing instructions out of order to avoid stalls while preserving program semantics, and speculatively executing instructions to hide latencies.

Uploaded by

Sudip Kumar Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views25 pages

CS 352H: Computer Systems Architecture: Topic 10: Instruction Level Parallelism (ILP) October 6 - 8, 2009

This document discusses techniques for increasing instruction-level parallelism (ILP) in computer processors, including deeper pipelining, multiple issue, speculation, static and dynamic multiple issue, loop unrolling, and dynamic pipeline scheduling. The key techniques are executing multiple instructions simultaneously through deeper pipelining with multiple functional units, issuing instructions out of order to avoid stalls while preserving program semantics, and speculatively executing instructions to hide latencies.

Uploaded by

Sudip Kumar Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

CS 352H: Computer Systems Architecture

Topic 10: Instruction Level Parallelism (ILP)

October 6 - 8, 2009

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell
Instruction-Level Parallelism (ILP)

Pipelining: executing multiple instructions in


parallel
To increase ILP
Deeper pipeline
Less work per stage ⇒ shorter clock cycle
Multiple issue
Replicate pipeline stages ⇒ multiple pipelines
Start multiple instructions per clock cycle
CPI < 1, so use Instructions Per Cycle (IPC)
E.g., 4GHz 4-way multiple-issue
16 BIPS, peak CPI = 0.25, peak IPC = 4
But dependencies reduce this in practice

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2
Multiple Issue

Static multiple issue


Compiler groups instructions to be issued together
Packages them into “issue slots”
Compiler detects and avoids hazards
Dynamic multiple issue
CPU examines instruction stream and chooses instructions to issue
each cycle
Compiler can help by reordering instructions
CPU resolves hazards using advanced techniques at runtime

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 3
Speculation

“Guess” what to do with an instruction


Start operation as soon as possible
Check whether guess was right
If so, complete the operation
If not, roll-back and do the right thing
Common to static and dynamic multiple issue
Examples
Speculate on branch outcome
Roll back if path taken is different
Speculate on load
Roll back if location is updated

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 4
Compiler/Hardware Speculation

Compiler can reorder instructions


e.g., move load before branch
Can include “fix-up” instructions to recover from incorrect guess
Hardware can look ahead for instructions to execute
Buffer results until it determines they are actually needed
Flush buffers on incorrect speculation

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 5
Speculation and Exceptions

What if exception occurs on a speculatively executed


instruction?
e.g., speculative load before null-pointer check
Static speculation
Can add ISA support for deferring exceptions
Dynamic speculation
Can buffer exceptions until instruction completion (which may not
occur)

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 6
Static Multiple Issue

Compiler groups instructions into “issue packets”


Group of instructions that can be issued on a single cycle
Determined by pipeline resources required
Think of an issue packet as a very long instruction
Specifies multiple concurrent operations
⇒ Very Long Instruction Word (VLIW)

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 7
Scheduling Static Multiple Issue

Compiler must remove some/all hazards


Reorder instructions into issue packets
No dependencies with a packet
Possibly some dependencies between packets
Varies between ISAs; compiler must know!
Pad with nop if necessary

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 8
MIPS with Static Dual Issue

Two-issue packets
One ALU/branch instruction
One load/store instruction
64-bit aligned
ALU/branch, then load/store
Pad an unused instruction with nop

Addres Instruction Pipeline Stages


sn type
ALU/branch IF ID EX ME WB
n+4 Load/store IF ID EX M
ME WB
n+8 ALU/branch IF ID M
EX ME WB
n + 12 Load/store IF ID EX M
ME WB
n + 16 ALU/branch IF ID M
EX ME WB
n + 20 Load/store IF ID EX M
ME WB
M
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 9
MIPS with Static Dual Issue

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 10
Hazards in the Dual-Issue MIPS

More instructions executing in parallel


EX data hazard
Forwarding avoided stalls with single-issue
Now can’t use ALU result in load/store in same packet
add $t0, $s0, $s1
load $s2, 0($t0)
Split into two packets, effectively a stall
Load-use hazard
Still one cycle use latency, but now two instructions
More aggressive scheduling required

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 11
Scheduling Example

Schedule this for dual-issue MIPS


Loop: lw $t0, 0($s1) # $t0=array element
addu $t0, $t0, $s2 # add scalar in $s2
sw $t0, 0($s1) # store result
addi $s1, $s1,–4 # decrement pointer
bne $s1, $zero, Loop # branch $s1!=0

ALU/branch Load/store cycle


Loop nop lw $t0, 0($s1) 1
: addi $s1, $s1,–4 nop 2
addu $t0, $t0, $s2 nop 3
bne $s1, $zero, Loop sw $t0, 4($s1) 4

IPC = 5/4 = 1.25 (c.f. peak IPC = 2)

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 12
Loop Unrolling

Replicate loop body to expose more parallelism


Reduces loop-control overhead
Use different registers per replication
Called “register renaming”
Avoid loop-carried “anti-dependencies”
Store followed by a load of the same register
Aka “name dependence”
Reuse of a register name

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 13
Loop Unrolling Example

ALU/branch Load/store cycle


Loop addi $s1, $s1,–16 lw $t0, 0($s1) 1
: nop lw $t1, 12($s1) 2
addu $t0, $t0, $s2 lw $t2, 8($s1) 3
addu $t1, $t1, $s2 lw $t3, 4($s1) 4
addu $t2, $t2, $s2 sw $t0, 16($s1) 5
addu $t3, $t4, $s2 sw $t1, 12($s1) 6
nop sw $t2, 8($s1) 7
bne $s1, $zero, Loop sw $t3, 4($s1) 8

IPC = 14/8 = 1.75


Closer to 2, but at cost of registers and code size

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 14
Dynamic Multiple Issue

“Superscalar” processors
CPU decides whether to issue 0, 1, 2, … each cycle
Avoiding structural and data hazards
Avoids the need for compiler scheduling
Though it may still help
Code semantics ensured by the CPU

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 15
Dynamic Pipeline Scheduling

Allow the CPU to execute instructions out of order to


avoid stalls
But commit result to registers in order
Example
lw $t0, 20($s2)
addu $t1, $t0, $t2
sub $s4, $s4, $t3
slti $t5, $s4, 20
Can start sub while addu is waiting for lw

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 16
Dynamically Scheduled CPU
Preserves
dependencies

Hold pending
operands

Results also sent


to any waiting
reservation
stations

Reorders buffer for


register writes
Can supply
operands for
issued instructions
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 17
Register Renaming

Reservation stations and reorder buffer effectively provide


register renaming
On instruction issue to reservation station
If operand is available in register file or reorder buffer
Copied to reservation station
No longer required in the register; can be overwritten
If operand is not yet available
It will be provided to the reservation station by a function unit
Register update may not be required

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 18
Speculation

Predict branch and continue issuing


Don’t commit until branch outcome determined
Load speculation
Avoid load and cache miss delay
Predict the effective address
Predict loaded value
Load before completing outstanding stores
Bypass stored values to load unit
Don’t commit load until speculation cleared

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 19
Why Do Dynamic Scheduling?

Why not just let the compiler schedule code?


Not all stalls are predicable
e.g., cache misses
Can’t always schedule around branches
Branch outcome is dynamically determined
Different implementations of an ISA have different
latencies and hazards

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 20
Does Multiple Issue Work?

Yes, but not as much as we’d like


Programs have real dependencies that limit ILP
Some dependencies are hard to eliminate
e.g., pointer aliasing
Some parallelism is hard to expose
Limited window size during instruction issue
Memory delays and limited bandwidth
Hard to keep pipelines full
Speculation can help if done well

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 21
Power Efficiency

Complexity of dynamic scheduling and speculations


requires power
Multiple simpler cores may be better

Microproce Year Clock Pipelin Issue Out-of- Cores Power


ssor Rate e width order/
i486 1989 25MHz Stages
5 1 No
Speculati 1 5W
Pentium 1993 66MHz 5 2 on
No 1 10W
Pentium 1997 200MHz 10 3 Yes 1 29W
Pro
P4 2001 2000MH 22 3 Yes 1 75W
Willamette
P4 Prescott 2004 z
3600MH 31 3 Yes 1 103W
Core 2006 z
2930MH 14 4 Yes 2 75W
UltraSparc 2003 z
1950MH 14 4 No 1 90W
III
UltraSparc 2005 z
1200MH 6 1 No 8 70W
T1 z
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 22
The Opteron X4 Microarchitecture

72 physical
registers

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 23
The Opteron X4 Pipeline Flow

For integer operations

FP is 5 stages longer
Up to 106 RISC-ops in progress
Bottlenecks
Complex instructions with long dependencies
Branch mispredictions
Memory access delays

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 24
Concluding Remarks

Multiple issue and dynamic scheduling (ILP)


Dependencies limit achievable parallelism
Complexity leads to the power wall

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 25

You might also like