0% found this document useful (0 votes)
62 views22 pages

Outline: Code Scheduling

The document outlines topics related to instruction scheduling and code optimization. It discusses modern processor architectures, delay slots, instruction scheduling techniques like list scheduling and dealing with resource constraints. It also covers scheduling across basic blocks, tracing scheduling, scheduling for loops, loop unrolling, and software pipelining. Simple machine models, execution models, and techniques for handling branch instructions like branch delay slots are also explained. Filling the branch delay slot by moving an earlier non-branch instruction is recommended.

Uploaded by

janvi jayaram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views22 pages

Outline: Code Scheduling

The document outlines topics related to instruction scheduling and code optimization. It discusses modern processor architectures, delay slots, instruction scheduling techniques like list scheduling and dealing with resource constraints. It also covers scheduling across basic blocks, tracing scheduling, scheduling for loops, loop unrolling, and software pipelining. Simple machine models, execution models, and techniques for handling branch instructions like branch delay slots are also explained. Filling the branch delay slot by moving an earlier non-branch instruction is recommended.

Uploaded by

janvi jayaram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Outline

• Modern architectures
Spring 2006 • Delay slots
• Introduction to instruction scheduling
• List scheduling
Code Scheduling • Resource constraints
• Interaction with register allocation
• Scheduling across basic blocks
• Trace scheduling
• Scheduling for loops
• Loop unrolling
• Software pipelining

Kostis Sagonas 2 Spring 2006

Simple Machine Model Simple Execution Model


• Instructions are executed in sequence 5 Stage pipe-line
– Fetch, decode, execute, store results
fetch decode execute memory write back
– One instruction at a time
• For branch instructions, start fetching from a Fetch: get the next instruction
different location if needed Decode: figure out what that instruction is
– Check branch condition Execute: perform ALU operation
– Next instruction may come from a new location address calculation in a memory operation
given by the branch instruction Memory: do the memory access in a mem. op.
Write Back: write the results back
Kostis Sagonas 3 Spring 2006 Kostis Sagonas 4 Spring 2006

Execution Models Outline


time
Model 1 • Modern architectures
Inst 1 IF DE EXE MEM WB
• Delay slots
Inst 2 IF DE EXE MEM WB • Introduction to instruction scheduling
• List scheduling
• Resource constraints
Inst 1 IF DE EXE MEM WB
• Interaction with register allocation
Model 2
Inst 2 IF DE EXE MEM WB • Scheduling across basic blocks
• Trace scheduling
Inst 3 IF DE EXE MEM WB
• Scheduling for loops
Inst 4 IF DE EXE MEM WB
• Loop unrolling
Inst 5 IF DE EXE MEM WB • Software pipelining

Kostis Sagonas 5 Spring 2006 Kostis Sagonas 6 Spring 2006

1
Handling Branch Instructions Handling Branch Instructions
Problem: We do not know the location of the next What to do with the middle 2 instructions?
instruction until later 1. Stall the pipeline in case of a branch until we
– after DE in jump instructions know the address of the next instruction
– after EXE in conditional branch instructions – wasted cycles
Branch IF DE EXE MEM WB
Branch IF DE EXE MEM WB
IF DE EXE MEM WB
???
IF DE EXE MEM WB
IF DE EXE MEM WB Next inst
???
IF DE EXE MEM WB
Next Inst
What to do with the middle 2 instructions?
Kostis Sagonas 7 Spring 2006 Kostis Sagonas 8 Spring 2006

Handling Branch Instructions Branch Delay Slot(s)


What to do with the middle 2 instructions? MIPS has a branch delay slot
2. Delay the action of the branch – The instruction after a conditional branch gets
– Make branch affect only after two instructions executed even if the code branches to target
– Following two instructions after the branch get – Fetching from the branch target takes place only
executed regardless of the branch after that
Branch IF DE EXE MEM WB

IF DE EXE MEM WB ble r3, foo


Next seq inst
IF DE EXE MEM WB
Branch delay slot
Next seq inst
IF DE EXE MEM WB
Branch target inst What instruction to put in the branch delay slot?
Kostis Sagonas 9 Spring 2006 Kostis Sagonas 10 Spring 2006

Filling the Branch Delay Slot Filling the Branch Delay Slot
Simple Solution: Put a no-op Move an instruction from above the branch
prev_instr
ble r3, lbl
Wasted instruction, just like a stall prev_instr Branch delay slot

• moved instruction executes iff branch executes


– So, get the instruction from the same basic block as
ble r3, lbl
the branch
noop Branch delay slot
– don’t move a branch instruction!
• instruction needs to be moved over the branch
– branch does not depend on the result of the instr.
Kostis Sagonas 11 Spring 2006 Kostis Sagonas 12 Spring 2006

2
Filling the Branch Delay Slot Filling the Branch Delay Slot
Move an instruction dominated by the branch Move an instruction from the branch target
instruction – Instruction dominated by target
– No other ways to reach target (if so, take care of them)
ble r3, lbl – If conditional branch, the moved instruction should not
dom_instr Branch delay slot have a lasting effect if the branch is not taken

ble r3, lbl


lbl:
instr Branch delay slot
dom_instr

lbl:
instr
Kostis Sagonas 13 Spring 2006 Kostis Sagonas 14 Spring 2006

Load Delay Slots Load Delay Slots


If the value of the load is used…what to do??
Problem: Results of the loads are not available • Always stall one cycle
until end of MEM stage
• Stall one cycle if next instruction uses the value
Load IF DE EXE MEM WB – Need hardware to do this
• Have a delay slot for load
– The new value is only available after two instructions
IF DE EXE MEM WB
Use of load – If next instr. uses the register, it will get the old value
Load IF DE EXE MEM WB
If the value of the load is used…what to do??
IF DE EXE MEM WB
???
IF DE EXE MEM WB
Use of load
Kostis Sagonas 15 Spring 2006 Kostis Sagonas 16 Spring 2006

Example Example
r2 = *(r1 + 4) r2 = *(r1 + 4)
r3 = *(r1 + 8) r3 = *(r1 + 8)
r4 = r2 + r3 noop
r5 = r2 - 1 r4 = r2 + r3
r5 = r2 - 1
goto L1
goto L1
noop

Assume 1 cycle delay on branches


and 1 cycle latency for loads
Kostis Sagonas 17 Spring 2006 Kostis Sagonas 18 Spring 2006

3
Example Example
r2 = *(r1 + 4) r2 = *(r1 + 4)
r3 = *(r1 + 8) r3 = *(r1 + 8)
r5 = r2 - 1 r5 = r2 - 1
r4 = r2 + r3

goto L1 goto L1
noop r4 = r2 + r3

Assume 1 cycle delay on branches Assume 1 cycle delay on branches


and 1 cycle latency for loads and 1 cycle latency for loads
Kostis Sagonas 19 Spring 2006 Kostis Sagonas 20 Spring 2006

Example Outline
r2 = *(r1 + 4) • Modern architectures
r3 = *(r1 + 8) • Delay slots
r5 = r2 - 1 • Introduction to instruction scheduling
goto L1 • List scheduling
• Resource constraints
r4 = r2 + r3
• Interaction with register allocation
• Scheduling across basic blocks
Final code after delay slot filling • Trace scheduling
• Scheduling for loops
• Loop unrolling
• Software pipelining

Kostis Sagonas 21 Spring 2006 Kostis Sagonas 22 Spring 2006

From a Simple Machine Model


to a Real Machine Model Real Machine Model cont.
• Many pipeline stages • Most modern processors have multiple
– MIPS R4000 has 8 stages execution units (superscalar)
• Different instructions take different amount of – If the instruction sequence is correct, multiple
time to execute operations will take place in the same cycles
– mult 10 cycles – Even more important to have the right instruction
sequence
– div 69 cycles
– ddiv 133 cycles
• Hardware to stall the pipeline if an instruction
uses a result that is not ready
Kostis Sagonas 23 Spring 2006 Kostis Sagonas 24 Spring 2006

4
Instruction Scheduling Data Dependencies
Goal: Reorder instructions so that pipeline stalls • If two instructions access the same variable,
are minimized they can be dependent
• Kinds of dependencies
Constraints on Instruction Scheduling: – True: write read
– Anti: read write
– Data dependencies
– Output: write write
– Control dependencies
• What to do if two instructions are dependent?
– Resource constraints
– The order of execution cannot be reversed
– Reduces the possibilities for scheduling

Kostis Sagonas 25 Spring 2006 Kostis Sagonas 26 Spring 2006

Computing Data Dependencies Representing Dependencies


• For basic blocks, compute dependencies by • Using a dependence DAG, one per basic block
walking through the instructions • Nodes are instructions, edges represent
• Identifying register dependencies is simple dependencies
1 2
– is it the same register? 1: r2 = *(r1 + 4)
2: r3 = *(r1 + 8) 2 2
• For memory accesses 3: r4 = r2 + r3
2
– simple: base + offset1 ?= base + offset2 4: r5 = r2 - 1 4 3
– data dependence analysis: a[2i] ?= a[2i+1]
– interprocedural analysis: global ?= parameter Edge is labeled with latency:
– pointer alias analysis: p1 ?= p v(i j) = delay required between initiation times of
i and j minus the execution time required by i
Kostis Sagonas 27 Spring 2006 Kostis Sagonas 28 Spring 2006

Example Another Example


1: r2 = *(r1 + 4) 1: r2 = *(r1 + 4)
2: r3 = *(r2 + 4) 2: *(r1 + 4) = r3
3: r4 = r2 + r3 3: r3 = r2 + r3
3 1
4: r5 = r2 - 1 1 2 4: r5 = r2 - 1 1 2
2 2 2 2
2 1

4 3 4 3

Kostis Sagonas 29 Spring 2006 Kostis Sagonas 30 Spring 2006

5
Control Dependencies and
Example
Resource Constraints Results available in
1: LA r1,array 1 cycle
2: LD r2,4(r1) 1 cycle
• For now, let’s worry only about basic blocks
3: AND r3,r3,0x00FF 1 cycle
• For now, let’s look at simple pipelines 4: MULC r6,r6,100 3 cycles
5: ST r7,4(r6)
6: DIVC r5,r5,100 4 cycles
7: ADD r4,r2,r5 1 cycle
8: MUL r5,r2,r4 3 cycles
9: ST r4,0(r1)

1 2
Kostis Sagonas 31 Spring 2006 Kostis Sagonas 32 Spring 2006

Example Example
Results available in Results available in
1: LA r1,array 1 cycle 1: LA r1,array 1 cycle
2: LD r2,4(r1) 1 cycle 2: LD r2,4(r1) 1 cycle
3: AND r3,r3,0x00FF 1 cycle 3: AND r3,r3,0x00FF 1 cycle
4: MULC r6,r6,100 3 cycles 4: MULC r6,r6,100 3 cycles
5: ST r7,4(r6) 5: ST r7,4(r6)
6: DIVC r5,r5,100 4 cycles 6: DIVC r5,r5,100 4 cycles
7: ADD r4,r2,r5 1 cycle 7: ADD r4,r2,r5 1 cycle
8: MUL r5,r2,r4 3 cycles 8: MUL r5,r2,r4 3 cycles
9: ST r4,0(r1) 9: ST r4,0(r1)

1 2 3 4 st st 5 1 2 3 4 st st 5 6 st st st 7
Kostis Sagonas 33 Spring 2006 Kostis Sagonas 34 Spring 2006

Example Example
Results available in Results available in
1: LA r1,array 1 cycle 1: LA r1,array 1 cycle
2: LD r2,4(r1) 1 cycle 2: LD r2,4(r1) 1 cycle
3: AND r3,r3,0x00FF 1 cycle 3: AND r3,r3,0x00FF 1 cycle
4: MULC r6,r6,100 3 cycles 4: MULC r6,r6,100 3 cycles
5: ST r7,4(r6) 5: ST r7,4(r6)
6: DIVC r5,r5,100 4 cycles 6: DIVC r5,r5,100 4 cycles
7: ADD r4,r2,r5 1 cycle 7: ADD r4,r2,r5 1 cycle
8: MUL r5,r2,r4 3 cycles 8: MUL r5,r2,r4 3 cycles
9: ST r4,0(r1) 9: ST r4,0(r1)
14 cycles!
1 2 3 4 st st 5 6 st st st 7 8 1 2 3 4 st st 5 6 st st st 7 8 9
Kostis Sagonas 35 Spring 2006 Kostis Sagonas 36 Spring 2006

6
Outline List Scheduling Algorithm
• Modern architectures • Idea
• Delay slots
– Do a topological sort of the dependence DAG
• Introduction to instruction scheduling
• List scheduling – Consider when an instruction can be scheduled
• Resource constraints without causing a stall
• Interaction with register allocation – Schedule the instruction if it causes no stall and all
• Scheduling across basic blocks its predecessors are already scheduled
• Trace scheduling • Optimal list scheduling is NP-complete
• Scheduling for loops
– Use heuristics when necessary
• Loop unrolling
• Software pipelining

Kostis Sagonas 37 Spring 2006 Kostis Sagonas 38 Spring 2006

List Scheduling Algorithm Heuristics for selection


• Create a dependence DAG of a basic block Heuristics for selecting from the READY list
• Topological Sort 1. pick the node with the longest path to a leaf in the
READY = nodes with no predecessors dependence graph
Loop until READY is empty 2. pick a node with the most immediate successors
Schedule each node in READY when no stalling 3. pick a node that can go to a less busy pipeline
(in a superscalar implementation)
READY += nodes whose predecessors have all been
scheduled

Kostis Sagonas 39 Spring 2006 Kostis Sagonas 40 Spring 2006

Heuristics for selection Heuristics for selection


Pick the node with the longest path to a leaf in the Pick a node with the most immediate successors
dependence graph
Algorithm (for node x):
Algorithm (for node x) – fx = number of successors of x
– If x has no successors dx = 0
– dx = MAX( dy + cxy) for all successors y of x

Use reverse breadth-first visiting order

Kostis Sagonas 41 Spring 2006 Kostis Sagonas 42 Spring 2006

7
Example Example
Results available in
1 3 4
1: LA r1,array 1 cycle 1: LA r1,array
2: LD r2,4(r1) 1 cycle 2: LD r2,4(r1) 1 3
3: AND r3,r3,0x00FF 1 cycle 3: AND r3,r3,0x00FF 2 6 5
4: MULC r6,r6,100 3 cycles 4: MULC r6,r6,100
5: ST r7,4(r6) 5: ST r7,4(r6) 1 4
6: DIVC r5,r5,100 4 cycles 6: DIVC r5,r5,100 7
7: ADD r4,r2,r5 1 cycle 7: ADD r4,r2,r5 3 1
8: MUL r5,r2,r4 3 cycles 8: MUL r5,r2,r4
9: ST r4,0(r1) 9: ST r4,0(r1) 8 9

Kostis Sagonas 43 Spring 2006 Kostis Sagonas 44 Spring 2006

Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
1 3 1, 3, 4, 6 1 3
READY = { } d=4 d=7 d=0 READY = { 6, 1, 4, 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0

1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1

8 d=0 9 d=0 8 d=0 9 d=0


f=0 f=0 f=0 f=0

Kostis Sagonas 45 Spring 2006 Kostis Sagonas 46 Spring 2006

Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
1 3 1 3
READY = { 6, 1, 4, 3 } d=4 d=7 d=0 READY = { 1,
1 4, 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0

1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1

8 d=0 9 d=0 8 d=0 9 d=0


f=0 f=0 f=0 f=0

6 6 1
Kostis Sagonas 47 Spring 2006 Kostis Sagonas 48 Spring 2006

8
Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
2 1 3 1 3
READY = { 4, 3 } d=4 d=7 d=0 READY = { 2,
2 4, 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0

1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1

8 d=0 9 d=0 8 d=0 9 d=0


f=0 f=0 f=0 f=0

6 1 6 1
Kostis Sagonas 49 Spring 2006 Kostis Sagonas 50 Spring 2006

Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
1 3 7 1 3
READY = { 2, 4, 3 } d=4 d=7 d=0 READY = { 4, 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0

1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1

8 d=0 9 d=0 8 d=0 9 d=0


f=0 f=0 f=0 f=0

6 1 2 6 1 2
Kostis Sagonas 51 Spring 2006 Kostis Sagonas 52 Spring 2006

Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
1 3 1 3
READY = { 7,
7 4, 3 } d=4 d=7 d=0 READY = { 7, 4, 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0

1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1

8 d=0 9 d=0 8 d=0 9 d=0


f=0 f=0 f=0 f=0

6 1 2 6 1 2 4
Kostis Sagonas 53 Spring 2006 Kostis Sagonas 54 Spring 2006

9
Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
5 1 3 1 3
READY = { 7, 3 } d=4 d=7 d=0 READY = { 7,
7 3, 5 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0

1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1

8 d=0 9 d=0 8 d=0 9 d=0


f=0 f=0 f=0 f=0

6 1 2 4 6 1 2 4
Kostis Sagonas 55 Spring 2006 Kostis Sagonas 56 Spring 2006

Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
1 3 8, 9 1 3
READY = { 7, 3, 5 } d=4 d=7 d=0 READY = { 3, 5 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0

1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1

8 d=0 9 d=0 8 d=0 9 d=0


f=0 f=0 f=0 f=0

6 1 2 4 7 6 1 2 4 7
Kostis Sagonas 57 Spring 2006 Kostis Sagonas 58 Spring 2006

Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
1 3 1 3
READY = { 3,
3 5, 8, 9 } d=4 d=7 d=0 READY = { 5,
5 8, 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0

1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1

8 d=0 9 d=0 8 d=0 9 d=0


f=0 f=0 f=0 f=0

6 1 2 4 7 3 6 1 2 4 7 3
Kostis Sagonas 59 Spring 2006 Kostis Sagonas 60 Spring 2006

10
Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
1 3 1 3
READY = { 5, 8, 9 } d=4 d=7 d=0 READY = { 8,
8 9} d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0

1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1

8 d=0 9 d=0 8 d=0 9 d=0


f=0 f=0 f=0 f=0

6 1 2 4 7 3 5 6 1 2 4 7 3 5
Kostis Sagonas 61 Spring 2006 Kostis Sagonas 62 Spring 2006

Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
1 3 1 3
READY = { 8, 9 } d=4 d=7 d=0 READY = { 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0

1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1

8 d=0 9 d=0 8 d=0 9 d=0


f=0 f=0 f=0 f=0

6 1 2 4 7 3 5 8 6 1 2 4 7 3 5 8
Kostis Sagonas 63 Spring 2006 Kostis Sagonas 64 Spring 2006

Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
1 3 1 3
READY = { 9 } d=4 d=7 d=0 READY = { } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0

1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1

8 d=0 9 d=0 8 d=0 9 d=0


f=0 f=0 f=0 f=0

6 1 2 4 7 3 5 8 9 6 1 2 4 7 3 5 8 9
Kostis Sagonas 65 Spring 2006 Kostis Sagonas 66 Spring 2006

11
Example Outline
Results available in
1: LA r1,array 1 cycle • Modern architectures
2: LD r2,4(r1) 1 cycle • Delay slots
3: AND r3,r3,0x00FF 1 cycle
4: MULC r6,r6,100 3 cycles
• Introduction to instruction scheduling
5: ST r7,4(r6) • List scheduling
6: DIVC r5,r5,100 4 cycles • Resource constraints
7: ADD r4,r2,r5 1 cycle • Interaction with register allocation
8: MUL r5,r2,r4 3 cycles
9: ST r4,0(r1)
• Scheduling across basic blocks
• Trace scheduling
1 2 3 4 st st 5 6 st st st 7 8 9 • Scheduling for loops
14 cycles • Loop unrolling
6 1 2 4 7 3 5 8 9 vs. • Software pipelining
9 cycles
Kostis Sagonas 67 Spring 2006 Kostis Sagonas 68 Spring 2006

Resource Constraints of a
Resource Constraints Superscalar Processor
• Modern machines have many resource
constraints Example:
• Superscalar architectures: – 1 integer operation
– can run few parallel operations ALUop dest, src1, src2 # in 1 clock cycle
– but have constraints In parallel with
– 1 memory operation
LD dst, addr # in 2 clock cycles
ST src, addr # in 1 clock cycle

Kostis Sagonas 69 Spring 2006 Kostis Sagonas 70 Spring 2006

List Scheduling Algorithm with List Scheduling Algorithm with


Resource Constraints Resource Constraints
• Represent the superscalar architecture as multiple • Represent the superscalar architecture as multiple
pipelines pipelines
– Each pipeline represents some resource – Each pipeline represents some resource
• Example
– One single cycle ALU unit
– One two-cycle pipelined memory unit
ALUop
MEM 1
MEM 2
Kostis Sagonas 71 Spring 2006 Kostis Sagonas 72 Spring 2006

12
List Scheduling Algorithm with
Resource Constraints Example
d=4
3 d=0 4 d=2
1: LA r1,array
2: LD r2,4(r1)
1 f=1 f=0 f=1
• Create a dependence DAG of a basic block 3: AND r3,r3,0x00FF 1 2
4: LD r6,8(sp) d=3 d=2 d=0
• Topological Sort 5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0
READY = nodes with no predecessors 6: ADD r5,r5,100
2 1
7: ADD r4,r2,r5 d=1
7 f=2
Loop until READY is empty 8: MUL r5,r2,r4
1 1
Let n READY be the node with the highest priority 9: ST r4,0(r1)

Schedule n in the earliest slot READY = { 1,


1 6, 4, 3 } 8 d=0 9 d=0
f=0 f=0
that satisfies precedence + resource constraints ALUop 1
Update READY MEM 1
MEM 2
Kostis Sagonas 73 Spring 2006 Kostis Sagonas 74 Spring 2006

Example Example
d=4 d=4
3 d=0 4 d=2 3 d=0 4 d=2
1: LA r1,array 1: LA r1,array
2: LD r2,4(r1)
1 f=1 f=0 f=1 2: LD r2,4(r1)
1 f=1 f=0 f=1
3: AND r3,r3,0x00FF 1 2 3: AND r3,r3,0x00FF 1 2
4: LD r6,8(sp) d=3 d=2 d=0 4: LD r6,8(sp) d=3 d=2 d=0
5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0 5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0
6: ADD r5,r5,100 1 6: ADD r5,r5,100 1
2 2
7: ADD r4,r2,r5 d=1
7 f=2
7: ADD r4,r2,r5 d=1
7 f=2
8: MUL r5,r2,r4 8: MUL r5,r2,r4
9: ST r4,0(r1) 1 1 9: ST r4,0(r1) 1 1

READY = { 6, 4, 3 } 2 8 d=0 9 d=0 READY = { 22, 6, 4, 3 } 8 d=0 9 d=0


f=0 f=0 f=0 f=0
ALUop 1 ALUop 1
MEM 1 MEM 1 2
MEM 2 MEM 2 2
Kostis Sagonas 75 Spring 2006 Kostis Sagonas 76 Spring 2006

Example Example
d=4 d=4
3 d=0 4 d=2 3 d=0 4 d=2
1: LA r1,array 1: LA r1,array
2: LD r2,4(r1)
1 f=1 f=0 f=1 2: LD r2,4(r1)
1 f=1 f=0 f=1
3: AND r3,r3,0x00FF 1 2 3: AND r3,r3,0x00FF 1 2
4: LD r6,8(sp) d=3 d=2 d=0 4: LD r6,8(sp) d=3 d=2 d=0
5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0 5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0
6: ADD r5,r5,100 1 6: ADD r5,r5,100 1
2 2
7: ADD r4,r2,r5 d=1
7 f=2
7: ADD r4,r2,r5 d=1
7 f=2
8: MUL r5,r2,r4 8: MUL r5,r2,r4
9: ST r4,0(r1) 1 1 9: ST r4,0(r1) 1 1

6 4, 3 }
READY = { 6, 8 d=0 9 d=0 READY = { 4, 3 } 7 8 d=0 9 d=0
f=0 f=0 f=0 f=0
ALUop 1 6 ALUop 1 6
MEM 1 2 MEM 1 2
MEM 2 2 MEM 2 2
Kostis Sagonas 77 Spring 2006 Kostis Sagonas 78 Spring 2006

13
Example Example
d=4 d=4
3 d=0 4 d=2 3 d=0 4 d=2
1: LA r1,array 1: LA r1,array
2: LD r2,4(r1)
1 f=1 f=0 f=1 2: LD r2,4(r1)
1 f=1 f=0 f=1
3: AND r3,r3,0x00FF 1 2 3: AND r3,r3,0x00FF 1 2
4: LD r6,8(sp) d=3 d=2 d=0 4: LD r6,8(sp) d=3 d=2 d=0
5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0 5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0
6: ADD r5,r5,100 1 6: ADD r5,r5,100 1
2 2
7: ADD r4,r2,r5 d=1
7 f=2
7: ADD r4,r2,r5 d=1
7 f=2
8: MUL r5,r2,r4 8: MUL r5,r2,r4
9: ST r4,0(r1) 1 1 9: ST r4,0(r1) 1 1

4 7, 3 }
READY = { 4, 8 d=0 9 d=0 READY = { 7, 3 } 5 8 d=0 9 d=0
f=0 f=0 f=0 f=0
ALUop 1 6 ALUop 1 6
MEM 1 4 2 MEM 1 4 2
MEM 2 4 2 MEM 2 4 2
Kostis Sagonas 79 Spring 2006 Kostis Sagonas 80 Spring 2006

Example Example
d=4 d=4
3 d=0 4 d=2 3 d=0 4 d=2
1: LA r1,array 1: LA r1,array
2: LD r2,4(r1)
1 f=1 f=0 f=1 2: LD r2,4(r1)
1 f=1 f=0 f=1
3: AND r3,r3,0x00FF 1 2 3: AND r3,r3,0x00FF 1 2
4: LD r6,8(sp) d=3 d=2 d=0 4: LD r6,8(sp) d=3 d=2 d=0
5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0 5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0
6: ADD r5,r5,100 1 6: ADD r5,r5,100 1
2 2
7: ADD r4,r2,r5 d=1
7 f=2
7: ADD r4,r2,r5 d=1
7 f=2
8: MUL r5,r2,r4 8: MUL r5,r2,r4
9: ST r4,0(r1) 1 1 9: ST r4,0(r1) 1 1

7 3, 5 }
READY = { 7, 8 d=0 9 d=0 READY = { 3, 5 } 8, 9 8 d=0 9 d=0
f=0 f=0 f=0 f=0
ALUop 1 6 7 ALUop 1 6 7
MEM 1 4 2 MEM 1 4 2
MEM 2 4 2 MEM 2 4 2
Kostis Sagonas 81 Spring 2006 Kostis Sagonas 82 Spring 2006

Example Example
d=4 d=4
3 d=0 4 d=2 3 d=0 4 d=2
1: LA r1,array 1: LA r1,array
2: LD r2,4(r1)
1 f=1 f=0 f=1 2: LD r2,4(r1)
1 f=1 f=0 f=1
3: AND r3,r3,0x00FF 1 2 3: AND r3,r3,0x00FF 1 2
4: LD r6,8(sp) d=3 d=2 d=0 4: LD r6,8(sp) d=3 d=2 d=0
5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0 5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0
6: ADD r5,r5,100 1 6: ADD r5,r5,100 1
2 2
7: ADD r4,r2,r5 d=1
7 f=2
7: ADD r4,r2,r5 d=1
7 f=2
8: MUL r5,r2,r4 8: MUL r5,r2,r4
9: ST r4,0(r1) 1 1 9: ST r4,0(r1) 1 1

3 5, 8, 9 }
READY = { 3, 8 d=0 9 d=0 5 8, 9 }
READY = { 5, 8 d=0 9 d=0
f=0 f=0 f=0 f=0
ALUop 1 6 3 7 ALUop 1 6 3 7
MEM 1 4 2 MEM 1 4 2 5
MEM 2 4 2 MEM 2 4 2
Kostis Sagonas 83 Spring 2006 Kostis Sagonas 84 Spring 2006

14
Example Example
d=4 d=4
3 d=0 4 d=2 3 d=0 4 d=2
1: LA r1,array 1: LA r1,array
2: LD r2,4(r1)
1 f=1 f=0 f=1 2: LD r2,4(r1)
1 f=1 f=0 f=1
3: AND r3,r3,0x00FF 1 2 3: AND r3,r3,0x00FF 1 2
4: LD r6,8(sp) d=3 d=2 d=0 4: LD r6,8(sp) d=3 d=2 d=0
5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0 5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0
6: ADD r5,r5,100 1 6: ADD r5,r5,100 1
2 2
7: ADD r4,r2,r5 d=1
7 f=2
7: ADD r4,r2,r5 d=1
7 f=2
8: MUL r5,r2,r4 8: MUL r5,r2,r4
9: ST r4,0(r1) 1 1 9: ST r4,0(r1) 1 1

READY = { 88, 9 } 8 d=0 9 d=0 READY = { 9 } 8 d=0 9 d=0


f=0 f=0 f=0 f=0
ALUop 1 6 3 7 8 ALUop 1 6 3 7 8
MEM 1 4 2 5 MEM 1 4 2 5 9
MEM 2 4 2 MEM 2 4 2
Kostis Sagonas 85 Spring 2006 Kostis Sagonas 86 Spring 2006

Example Outline
d=4
3 d=0 4 d=2
1: LA r1,array
1 f=1
2: LD r2,4(r1) f=0 f=1 • Modern architectures
3: AND r3,r3,0x00FF 1 2
• Delay slots
4: LD r6,8(sp) d=3 d=2 d=0
5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0 • Introduction to instruction scheduling
6: ADD r5,r5,100 1 • List scheduling
2
7: ADD r4,r2,r5 d=1
7 f=2 • Resource constraints
8: MUL r5,r2,r4
9: ST r4,0(r1) 1 1 • Interaction with register allocation
• Scheduling across basic blocks
READY = { } 8 d=0 9 d=0 • Trace scheduling
f=0 f=0
ALUop 1 6 3 7 8 • Scheduling for loops
• Loop unrolling
MEM 1 4 2 5 9 • Software pipelining
MEM 2 4 2
Kostis Sagonas 87 Spring 2006 Kostis Sagonas 88 Spring 2006

Register Allocation
and Instruction Scheduling Example
1: LD r2,0(r1) 1
• If register allocation is performed before
2: ADD r3,r3,r2 3
instruction scheduling 3: LD r2,4(r5) 1
– the choices for scheduling are restricted 2 1
4: ADD r6,r6,r2

1
1 3

3
ALUop 2 4 4
MEM 1 1 3
MEM 2 1 3
Kostis Sagonas 89 Spring 2006 Kostis Sagonas 90 Spring 2006

15
Example Example
1: LD r2,0(r1) 1 1: LD r2,0(r1) 1
2: ADD r3,r3,r2 3 2: ADD r3,r3,r2 3
3: LD r2,4(r5) 1 3: LD r4,4(r5)
2 1 2
4: ADD r6,r6,r2 4: ADD r6,r6,r4

1
Anti-dependence
1 3 3

3 3
ALUop 2 4
4 4
How about using a different register? MEM 1 1 3
MEM 2 1 3
Kostis Sagonas 91 Spring 2006 Kostis Sagonas 92 Spring 2006

Register Allocation
Outline
and Instruction Scheduling
• If register allocation is performed before • Modern architectures
instruction scheduling • Delay slots
• Introduction to instruction scheduling
– the choices for scheduling are restricted • List scheduling
• Resource constraints
• If instruction scheduling is performed before • Interaction with register allocation
register allocation • Scheduling across basic blocks
• Trace scheduling
– register allocation may spill registers
• Scheduling for loops
– will change the carefully done schedule!!! • Loop unrolling
• Software pipelining

Kostis Sagonas 93 Spring 2006 Kostis Sagonas 94 Spring 2006

Scheduling across basic blocks Moving across basic blocks


• Number of instructions in a basic block is small Downward to adjacent basic block
– Cannot keep a multiple units with long pipelines
busy by just scheduling within a basic block A
• Need to handle control dependencies
– Scheduling constraints across basic blocks B C
– Scheduling policy

A path to B that does not execute A?

Kostis Sagonas 95 Spring 2006 Kostis Sagonas 96 Spring 2006

16
Moving across basic blocks Control Dependencies
Constraints in moving instructions across basic blocks
Upward to adjacent basic block

if ( . . . ) if ( . . . )
B C
a = b op c d = *(a1)

A
Not allowed if e.g. Not allowed if e.g.
if (c != 0 ) if(valid_address(a1))
A path from C that does not reach A? a = b / c d = *(a1)

Kostis Sagonas 97 Spring 2006 Kostis Sagonas 98 Spring 2006

Outline Trace Scheduling


• Modern architectures • Find the most common trace of basic blocks
• Delay slots
– Use profile information
• Introduction to instruction scheduling
• List scheduling • Combine the basic blocks in the trace and
• Resource constraints schedule them as one block
• Interaction with register allocation • Create compensating (clean-up) code if the
• Scheduling across basic blocks
execution goes off-trace
• Trace scheduling
• Scheduling for loops
• Loop unrolling
• Software pipelining

Kostis Sagonas 99 Spring 2006 Kostis Sagonas 100 Spring 2006

Trace Scheduling Trace Scheduling

A A

B C B C

D D

E E

F G F G

H H
Kostis Sagonas 101 Spring 2006 Kostis Sagonas 102 Spring 2006

17
Trace Scheduling Trace Scheduling

A A

B B

D D

E E

G G

H H
Kostis Sagonas 103 Spring 2006 Kostis Sagonas 104 Spring 2006

Large Basic Blocks via


Trace Scheduling
Code Duplication
• Creating large extended basic blocks by
A duplication
B • Schedule the larger blocks

D A A

B C B C
E
D D D
G

E E E
H
Kostis Sagonas 105 Spring 2006 Kostis Sagonas 106 Spring 2006

Outline Scheduling for Loops


• Modern architectures • Loop bodies are typically small
• Delay slots
• Introduction to instruction scheduling • But a lot of time is spend in loops due to their
• List scheduling iterative nature
• Resource constraints • Need better ways to schedule loops
• Interaction with register allocation
• Scheduling across basic blocks
• Trace scheduling
• Scheduling for loops
• Loop unrolling
• Software pipelining

Kostis Sagonas 107 Spring 2006 Kostis Sagonas 108 Spring 2006

18
Loop Example Loop Example
Machine: Source Code
– One load/store unit for i = 1 to N
• load 2 cycles A[i] = A[i] * b
• store 2 cycles
– Two arithmetic units Assembly Code
• add 2 cycles loop:
• branch 2 cycles (no delay slot) ld r6, (r2)
• multiply 3 cycles mul r6, r6, r3
– Both units are pipelined (initiate one op each cycle) st r6, (r2)
add r2, r2, 4
ble r2, r5, loop

Kostis Sagonas 109 Spring 2006 Kostis Sagonas 110 Spring 2006

Loop Example Outline


Assembly Code • Modern architectures
loop: • Delay slots
ld r6, (r2) • Introduction to instruction scheduling
mul r6, r6, r3
• List scheduling
st r6, (r2)
add r2, r2, 4
• Resource constraints
ble r2, r5, loop • Interaction with register allocation
Schedule (9 cycles per iteration) • Scheduling across basic blocks
ld st • Trace scheduling
ld
mul
st
ble
• Scheduling for loops
mul ble • Loop unrolling
mul
add • Software pipelining
add
Kostis Sagonas 111 Spring 2006 Kostis Sagonas 112 Spring 2006

Loop Unrolling Loop Example


loop: loop:
Oldest compiler trick of the trade: ld
mul
r6,
r6,
(r2)
r6, r3
ld
mul
r6,(r2)
r6, r6, r3
Unroll the loop body a few times st r6, (r2) st r6,(r2)
add r2, r2, 4 add r2, r2, 4
Pros: ble r2, r5, loop ld r6,(r2)
mul r6, r6, r3
– Creates a much larger basic block for the body st r6,(r2)
– Eliminates few loop bounds checks add r2, r2, 4
Schedule (8 cycles per iteration) ble r2, r5, loop
Cons:
ld st ld st
– Much larger program ld st ld st
– Setup code (# of iterations < unroll factor) mul mul ble
mul mul ble
– Beginning and end of the schedule can still have mul mul
unused slots add add
add add
Kostis Sagonas 113 Spring 2006 Kostis Sagonas 114 Spring 2006

19
Loop Unrolling Loop Example
• Rename registers loop: loop:
ld r6, (r2) ld r6, (r2)
– Use different registers in different iterations mul r6, r6, r3 mul r6, r6, r3
st r6, (r2) st r6, (r2)
add r2, r2, 4 add r2, r2, 4
ld r6, (r2) ld r7, (r2)
mul r6, r6, r3 mul r7, r7, r3
st r6, (r2) st r7, (r2)
add r2, r2, 4 add r2, r2, 4
ble r2, r5, loop ble r2, r5, loop

Kostis Sagonas 115 Spring 2006 Kostis Sagonas 116 Spring 2006

Loop Unrolling Loop Example


• Rename registers loop: loop:
ld r6, (r2) ld r6, (r1)
– Use different registers in different iterations mul r6, r6, r3 mul r6, r6, r3
st r6, (r2) st r6, (r1)
add r2, r2, 4 add r2, r1, 4
ld r7, (r2) ld r7, (r2)
• Eliminate unnecessary dependencies mul r7, r7, r3 mul r7, r7, r3
– again, use more registers to eliminate true, anti and st r7, (r2) st r7, (r2)
add r2, r2, 4 add r1, r2, 4
output dependencies ble r2, r5, loop ble r1, r5, loop
– eliminate dependent-chains of calculations when
possible

Kostis Sagonas 117 Spring 2006 Kostis Sagonas 118 Spring 2006

Loop Example Loop Example


loop: loop: loop: loop:
ld r6, (r1) ld r6, (r1) ld r6, (r1) ld r6, (r1)
mul r6, r6, r3 mul r6, r6, r3 mul r6, r6, r3 mul r6, r6, r3
st r6, (r1) st r6, (r1) st r6, (r1) st r6, (r1)
add r2, r1, 4 add r2, r1, 4 add r2, r1, 4 add r2, r1, 4
ld r7, (r2) ld r7, (r2) ld r7, (r2) ld r7, (r2)
mul r7, r7, r3 mul r7, r7, r3 mul r7, r7, r3 mul r7, r7, r3
st r7, (r2) st r7, (r2) st r7, (r2) st r7, (r2)
add r1, r2, 4 add r1, r2, 4 add r1, r2, 4 add r1, r1, 8
ble r1, r5, loop ble r1, r5, loop ble r1, r5, loop ble r1, r5, loop

Kostis Sagonas 119 Spring 2006 Kostis Sagonas 120 Spring 2006

20
Loop Example Outline
loop:
ld r6, (r1)
mul r6, r6, r3 • Modern architectures
st r6, (r1) • Delay slots
add r2, r1, 4
ld r7, (r2) • Introduction to instruction scheduling
mul r7, r7, r3 • List scheduling
st r7, (r2)
add r1, r1, 8 • Resource constraints
ble r1, r5, loop • Interaction with register allocation
Schedule (4.5 cycles per iteration) • Scheduling across basic blocks
ld ld st st • Trace scheduling
ld ld st st
mul mul ble • Scheduling for loops
mul mul ble • Loop unrolling
mul mul
add add • Software pipelining
add add
Kostis Sagonas 121 Spring 2006 Kostis Sagonas 122 Spring 2006

Software Pipelining Loop Example


Assembly Code
• Try to overlap multiple iterations so that the loop:
slots will be filled ld r6, (r2)
mul r6, r6, r3
• Find the steady-state window so that: st r6, (r2)
add r2, r2, 4
– all the instructions of the loop body are executed
ble r2, r5, loop
– but from different iterations Schedule
ld ld1 ld2 st ld3 st1 ld4 st2 ld5 st3 ld6
ld ld1 ld2 st ld3 st1 ld4 st2 ld5 st3
mul mul1 mul2 ble mul3 ble1 mul4 ble2 mul5
mul mul1 mul2 ble mul3 ble1 mul4 ble2
mul mul1 mul2 mul3 mul4
add add1 add2 add3
add add1 add2 add3
Kostis Sagonas 123 Spring 2006 Kostis Sagonas 124 Spring 2006

Loop Example Loop Example


Assembly Code ld3 st1 4 iterations are overlapped ld3 st1
loop: st ld3 – values of r3 and r5 don’t change st ld3
ld r6, (r2) mul2 ble mul2 ble
mul r6, r6, r3 mul2 – 4 regs for &A[i] (r2) mul2
st r6, (r2) mul1 mul1
– each addr. incremented by 4*4
add r2, r2, 4 add1 add1
ble r2, r5, loop add add
Schedule (2 cycles per iteration) – 4 regs to keep value A[i] (r6)
loop:
– Same registers can be reused ld r6, (r2)
mul r6, r6, r3
after 4 of these blocks st r6, (r2)
generate code for 4 blocks, add r2, r2, 4
otherwise need to move ble r2, r5, loop

Kostis Sagonas 125 Spring 2006 Kostis Sagonas 126 Spring 2006

21
Software Pipelining
• Optimal use of resources
• Need a lot of registers
– Values in multiple iterations need to be kept
• Issues in dependencies
– Executing a store instruction in an iteration before branch
instruction is executed for a previous iteration (writing when
it should not have)
– Loads and stores are issued out-of-order (need to figure-out
dependencies before doing this)
• Code generation issues
– Generate pre-amble and post-amble code
– Multiple blocks so no register copy is needed

Kostis Sagonas 127 Spring 2006

22

You might also like