Outline: Code Scheduling
Outline: Code Scheduling
• Modern architectures
Spring 2006 • Delay slots
• Introduction to instruction scheduling
• List scheduling
Code Scheduling • Resource constraints
• Interaction with register allocation
• Scheduling across basic blocks
• Trace scheduling
• Scheduling for loops
• Loop unrolling
• Software pipelining
1
Handling Branch Instructions Handling Branch Instructions
Problem: We do not know the location of the next What to do with the middle 2 instructions?
instruction until later 1. Stall the pipeline in case of a branch until we
– after DE in jump instructions know the address of the next instruction
– after EXE in conditional branch instructions – wasted cycles
Branch IF DE EXE MEM WB
Branch IF DE EXE MEM WB
IF DE EXE MEM WB
???
IF DE EXE MEM WB
IF DE EXE MEM WB Next inst
???
IF DE EXE MEM WB
Next Inst
What to do with the middle 2 instructions?
Kostis Sagonas 7 Spring 2006 Kostis Sagonas 8 Spring 2006
Filling the Branch Delay Slot Filling the Branch Delay Slot
Simple Solution: Put a no-op Move an instruction from above the branch
prev_instr
ble r3, lbl
Wasted instruction, just like a stall prev_instr Branch delay slot
2
Filling the Branch Delay Slot Filling the Branch Delay Slot
Move an instruction dominated by the branch Move an instruction from the branch target
instruction – Instruction dominated by target
– No other ways to reach target (if so, take care of them)
ble r3, lbl – If conditional branch, the moved instruction should not
dom_instr Branch delay slot have a lasting effect if the branch is not taken
lbl:
instr
Kostis Sagonas 13 Spring 2006 Kostis Sagonas 14 Spring 2006
Example Example
r2 = *(r1 + 4) r2 = *(r1 + 4)
r3 = *(r1 + 8) r3 = *(r1 + 8)
r4 = r2 + r3 noop
r5 = r2 - 1 r4 = r2 + r3
r5 = r2 - 1
goto L1
goto L1
noop
3
Example Example
r2 = *(r1 + 4) r2 = *(r1 + 4)
r3 = *(r1 + 8) r3 = *(r1 + 8)
r5 = r2 - 1 r5 = r2 - 1
r4 = r2 + r3
goto L1 goto L1
noop r4 = r2 + r3
Example Outline
r2 = *(r1 + 4) • Modern architectures
r3 = *(r1 + 8) • Delay slots
r5 = r2 - 1 • Introduction to instruction scheduling
goto L1 • List scheduling
• Resource constraints
r4 = r2 + r3
• Interaction with register allocation
• Scheduling across basic blocks
Final code after delay slot filling • Trace scheduling
• Scheduling for loops
• Loop unrolling
• Software pipelining
4
Instruction Scheduling Data Dependencies
Goal: Reorder instructions so that pipeline stalls • If two instructions access the same variable,
are minimized they can be dependent
• Kinds of dependencies
Constraints on Instruction Scheduling: – True: write read
– Anti: read write
– Data dependencies
– Output: write write
– Control dependencies
• What to do if two instructions are dependent?
– Resource constraints
– The order of execution cannot be reversed
– Reduces the possibilities for scheduling
4 3 4 3
5
Control Dependencies and
Example
Resource Constraints Results available in
1: LA r1,array 1 cycle
2: LD r2,4(r1) 1 cycle
• For now, let’s worry only about basic blocks
3: AND r3,r3,0x00FF 1 cycle
• For now, let’s look at simple pipelines 4: MULC r6,r6,100 3 cycles
5: ST r7,4(r6)
6: DIVC r5,r5,100 4 cycles
7: ADD r4,r2,r5 1 cycle
8: MUL r5,r2,r4 3 cycles
9: ST r4,0(r1)
1 2
Kostis Sagonas 31 Spring 2006 Kostis Sagonas 32 Spring 2006
Example Example
Results available in Results available in
1: LA r1,array 1 cycle 1: LA r1,array 1 cycle
2: LD r2,4(r1) 1 cycle 2: LD r2,4(r1) 1 cycle
3: AND r3,r3,0x00FF 1 cycle 3: AND r3,r3,0x00FF 1 cycle
4: MULC r6,r6,100 3 cycles 4: MULC r6,r6,100 3 cycles
5: ST r7,4(r6) 5: ST r7,4(r6)
6: DIVC r5,r5,100 4 cycles 6: DIVC r5,r5,100 4 cycles
7: ADD r4,r2,r5 1 cycle 7: ADD r4,r2,r5 1 cycle
8: MUL r5,r2,r4 3 cycles 8: MUL r5,r2,r4 3 cycles
9: ST r4,0(r1) 9: ST r4,0(r1)
1 2 3 4 st st 5 1 2 3 4 st st 5 6 st st st 7
Kostis Sagonas 33 Spring 2006 Kostis Sagonas 34 Spring 2006
Example Example
Results available in Results available in
1: LA r1,array 1 cycle 1: LA r1,array 1 cycle
2: LD r2,4(r1) 1 cycle 2: LD r2,4(r1) 1 cycle
3: AND r3,r3,0x00FF 1 cycle 3: AND r3,r3,0x00FF 1 cycle
4: MULC r6,r6,100 3 cycles 4: MULC r6,r6,100 3 cycles
5: ST r7,4(r6) 5: ST r7,4(r6)
6: DIVC r5,r5,100 4 cycles 6: DIVC r5,r5,100 4 cycles
7: ADD r4,r2,r5 1 cycle 7: ADD r4,r2,r5 1 cycle
8: MUL r5,r2,r4 3 cycles 8: MUL r5,r2,r4 3 cycles
9: ST r4,0(r1) 9: ST r4,0(r1)
14 cycles!
1 2 3 4 st st 5 6 st st st 7 8 1 2 3 4 st st 5 6 st st st 7 8 9
Kostis Sagonas 35 Spring 2006 Kostis Sagonas 36 Spring 2006
6
Outline List Scheduling Algorithm
• Modern architectures • Idea
• Delay slots
– Do a topological sort of the dependence DAG
• Introduction to instruction scheduling
• List scheduling – Consider when an instruction can be scheduled
• Resource constraints without causing a stall
• Interaction with register allocation – Schedule the instruction if it causes no stall and all
• Scheduling across basic blocks its predecessors are already scheduled
• Trace scheduling • Optimal list scheduling is NP-complete
• Scheduling for loops
– Use heuristics when necessary
• Loop unrolling
• Software pipelining
7
Example Example
Results available in
1 3 4
1: LA r1,array 1 cycle 1: LA r1,array
2: LD r2,4(r1) 1 cycle 2: LD r2,4(r1) 1 3
3: AND r3,r3,0x00FF 1 cycle 3: AND r3,r3,0x00FF 2 6 5
4: MULC r6,r6,100 3 cycles 4: MULC r6,r6,100
5: ST r7,4(r6) 5: ST r7,4(r6) 1 4
6: DIVC r5,r5,100 4 cycles 6: DIVC r5,r5,100 7
7: ADD r4,r2,r5 1 cycle 7: ADD r4,r2,r5 3 1
8: MUL r5,r2,r4 3 cycles 8: MUL r5,r2,r4
9: ST r4,0(r1) 9: ST r4,0(r1) 8 9
Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
1 3 1, 3, 4, 6 1 3
READY = { } d=4 d=7 d=0 READY = { 6, 1, 4, 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0
1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1
Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
1 3 1 3
READY = { 6, 1, 4, 3 } d=4 d=7 d=0 READY = { 1,
1 4, 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0
1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1
6 6 1
Kostis Sagonas 47 Spring 2006 Kostis Sagonas 48 Spring 2006
8
Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
2 1 3 1 3
READY = { 4, 3 } d=4 d=7 d=0 READY = { 2,
2 4, 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0
1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1
6 1 6 1
Kostis Sagonas 49 Spring 2006 Kostis Sagonas 50 Spring 2006
Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
1 3 7 1 3
READY = { 2, 4, 3 } d=4 d=7 d=0 READY = { 4, 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0
1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1
6 1 2 6 1 2
Kostis Sagonas 51 Spring 2006 Kostis Sagonas 52 Spring 2006
Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
1 3 1 3
READY = { 7,
7 4, 3 } d=4 d=7 d=0 READY = { 7, 4, 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0
1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1
6 1 2 6 1 2 4
Kostis Sagonas 53 Spring 2006 Kostis Sagonas 54 Spring 2006
9
Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
5 1 3 1 3
READY = { 7, 3 } d=4 d=7 d=0 READY = { 7,
7 3, 5 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0
1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1
6 1 2 4 6 1 2 4
Kostis Sagonas 55 Spring 2006 Kostis Sagonas 56 Spring 2006
Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
1 3 8, 9 1 3
READY = { 7, 3, 5 } d=4 d=7 d=0 READY = { 3, 5 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0
1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1
6 1 2 4 7 6 1 2 4 7
Kostis Sagonas 57 Spring 2006 Kostis Sagonas 58 Spring 2006
Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
1 3 1 3
READY = { 3,
3 5, 8, 9 } d=4 d=7 d=0 READY = { 5,
5 8, 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0
1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1
6 1 2 4 7 3 6 1 2 4 7 3
Kostis Sagonas 59 Spring 2006 Kostis Sagonas 60 Spring 2006
10
Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
1 3 1 3
READY = { 5, 8, 9 } d=4 d=7 d=0 READY = { 8,
8 9} d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0
1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1
6 1 2 4 7 3 5 6 1 2 4 7 3 5
Kostis Sagonas 61 Spring 2006 Kostis Sagonas 62 Spring 2006
Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
1 3 1 3
READY = { 8, 9 } d=4 d=7 d=0 READY = { 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0
1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1
6 1 2 4 7 3 5 8 6 1 2 4 7 3 5 8
Kostis Sagonas 63 Spring 2006 Kostis Sagonas 64 Spring 2006
Example Example
d=5 d=0 d=3 d=5 d=0 d=3
1 f=1 3 4 1 f=1 3 4
f=0 f=1 f=0 f=1
1 3 1 3
READY = { 9 } d=4 d=7 d=0 READY = { } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0 2 f=1 6 f=1 5 f=0
1 4 1 4
d=3 d=3
7 7
f=2 f=2
3 1 3 1
6 1 2 4 7 3 5 8 9 6 1 2 4 7 3 5 8 9
Kostis Sagonas 65 Spring 2006 Kostis Sagonas 66 Spring 2006
11
Example Outline
Results available in
1: LA r1,array 1 cycle • Modern architectures
2: LD r2,4(r1) 1 cycle • Delay slots
3: AND r3,r3,0x00FF 1 cycle
4: MULC r6,r6,100 3 cycles
• Introduction to instruction scheduling
5: ST r7,4(r6) • List scheduling
6: DIVC r5,r5,100 4 cycles • Resource constraints
7: ADD r4,r2,r5 1 cycle • Interaction with register allocation
8: MUL r5,r2,r4 3 cycles
9: ST r4,0(r1)
• Scheduling across basic blocks
• Trace scheduling
1 2 3 4 st st 5 6 st st st 7 8 9 • Scheduling for loops
14 cycles • Loop unrolling
6 1 2 4 7 3 5 8 9 vs. • Software pipelining
9 cycles
Kostis Sagonas 67 Spring 2006 Kostis Sagonas 68 Spring 2006
Resource Constraints of a
Resource Constraints Superscalar Processor
• Modern machines have many resource
constraints Example:
• Superscalar architectures: – 1 integer operation
– can run few parallel operations ALUop dest, src1, src2 # in 1 clock cycle
– but have constraints In parallel with
– 1 memory operation
LD dst, addr # in 2 clock cycles
ST src, addr # in 1 clock cycle
12
List Scheduling Algorithm with
Resource Constraints Example
d=4
3 d=0 4 d=2
1: LA r1,array
2: LD r2,4(r1)
1 f=1 f=0 f=1
• Create a dependence DAG of a basic block 3: AND r3,r3,0x00FF 1 2
4: LD r6,8(sp) d=3 d=2 d=0
• Topological Sort 5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0
READY = nodes with no predecessors 6: ADD r5,r5,100
2 1
7: ADD r4,r2,r5 d=1
7 f=2
Loop until READY is empty 8: MUL r5,r2,r4
1 1
Let n READY be the node with the highest priority 9: ST r4,0(r1)
Example Example
d=4 d=4
3 d=0 4 d=2 3 d=0 4 d=2
1: LA r1,array 1: LA r1,array
2: LD r2,4(r1)
1 f=1 f=0 f=1 2: LD r2,4(r1)
1 f=1 f=0 f=1
3: AND r3,r3,0x00FF 1 2 3: AND r3,r3,0x00FF 1 2
4: LD r6,8(sp) d=3 d=2 d=0 4: LD r6,8(sp) d=3 d=2 d=0
5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0 5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0
6: ADD r5,r5,100 1 6: ADD r5,r5,100 1
2 2
7: ADD r4,r2,r5 d=1
7 f=2
7: ADD r4,r2,r5 d=1
7 f=2
8: MUL r5,r2,r4 8: MUL r5,r2,r4
9: ST r4,0(r1) 1 1 9: ST r4,0(r1) 1 1
Example Example
d=4 d=4
3 d=0 4 d=2 3 d=0 4 d=2
1: LA r1,array 1: LA r1,array
2: LD r2,4(r1)
1 f=1 f=0 f=1 2: LD r2,4(r1)
1 f=1 f=0 f=1
3: AND r3,r3,0x00FF 1 2 3: AND r3,r3,0x00FF 1 2
4: LD r6,8(sp) d=3 d=2 d=0 4: LD r6,8(sp) d=3 d=2 d=0
5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0 5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0
6: ADD r5,r5,100 1 6: ADD r5,r5,100 1
2 2
7: ADD r4,r2,r5 d=1
7 f=2
7: ADD r4,r2,r5 d=1
7 f=2
8: MUL r5,r2,r4 8: MUL r5,r2,r4
9: ST r4,0(r1) 1 1 9: ST r4,0(r1) 1 1
6 4, 3 }
READY = { 6, 8 d=0 9 d=0 READY = { 4, 3 } 7 8 d=0 9 d=0
f=0 f=0 f=0 f=0
ALUop 1 6 ALUop 1 6
MEM 1 2 MEM 1 2
MEM 2 2 MEM 2 2
Kostis Sagonas 77 Spring 2006 Kostis Sagonas 78 Spring 2006
13
Example Example
d=4 d=4
3 d=0 4 d=2 3 d=0 4 d=2
1: LA r1,array 1: LA r1,array
2: LD r2,4(r1)
1 f=1 f=0 f=1 2: LD r2,4(r1)
1 f=1 f=0 f=1
3: AND r3,r3,0x00FF 1 2 3: AND r3,r3,0x00FF 1 2
4: LD r6,8(sp) d=3 d=2 d=0 4: LD r6,8(sp) d=3 d=2 d=0
5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0 5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0
6: ADD r5,r5,100 1 6: ADD r5,r5,100 1
2 2
7: ADD r4,r2,r5 d=1
7 f=2
7: ADD r4,r2,r5 d=1
7 f=2
8: MUL r5,r2,r4 8: MUL r5,r2,r4
9: ST r4,0(r1) 1 1 9: ST r4,0(r1) 1 1
4 7, 3 }
READY = { 4, 8 d=0 9 d=0 READY = { 7, 3 } 5 8 d=0 9 d=0
f=0 f=0 f=0 f=0
ALUop 1 6 ALUop 1 6
MEM 1 4 2 MEM 1 4 2
MEM 2 4 2 MEM 2 4 2
Kostis Sagonas 79 Spring 2006 Kostis Sagonas 80 Spring 2006
Example Example
d=4 d=4
3 d=0 4 d=2 3 d=0 4 d=2
1: LA r1,array 1: LA r1,array
2: LD r2,4(r1)
1 f=1 f=0 f=1 2: LD r2,4(r1)
1 f=1 f=0 f=1
3: AND r3,r3,0x00FF 1 2 3: AND r3,r3,0x00FF 1 2
4: LD r6,8(sp) d=3 d=2 d=0 4: LD r6,8(sp) d=3 d=2 d=0
5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0 5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0
6: ADD r5,r5,100 1 6: ADD r5,r5,100 1
2 2
7: ADD r4,r2,r5 d=1
7 f=2
7: ADD r4,r2,r5 d=1
7 f=2
8: MUL r5,r2,r4 8: MUL r5,r2,r4
9: ST r4,0(r1) 1 1 9: ST r4,0(r1) 1 1
7 3, 5 }
READY = { 7, 8 d=0 9 d=0 READY = { 3, 5 } 8, 9 8 d=0 9 d=0
f=0 f=0 f=0 f=0
ALUop 1 6 7 ALUop 1 6 7
MEM 1 4 2 MEM 1 4 2
MEM 2 4 2 MEM 2 4 2
Kostis Sagonas 81 Spring 2006 Kostis Sagonas 82 Spring 2006
Example Example
d=4 d=4
3 d=0 4 d=2 3 d=0 4 d=2
1: LA r1,array 1: LA r1,array
2: LD r2,4(r1)
1 f=1 f=0 f=1 2: LD r2,4(r1)
1 f=1 f=0 f=1
3: AND r3,r3,0x00FF 1 2 3: AND r3,r3,0x00FF 1 2
4: LD r6,8(sp) d=3 d=2 d=0 4: LD r6,8(sp) d=3 d=2 d=0
5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0 5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0
6: ADD r5,r5,100 1 6: ADD r5,r5,100 1
2 2
7: ADD r4,r2,r5 d=1
7 f=2
7: ADD r4,r2,r5 d=1
7 f=2
8: MUL r5,r2,r4 8: MUL r5,r2,r4
9: ST r4,0(r1) 1 1 9: ST r4,0(r1) 1 1
3 5, 8, 9 }
READY = { 3, 8 d=0 9 d=0 5 8, 9 }
READY = { 5, 8 d=0 9 d=0
f=0 f=0 f=0 f=0
ALUop 1 6 3 7 ALUop 1 6 3 7
MEM 1 4 2 MEM 1 4 2 5
MEM 2 4 2 MEM 2 4 2
Kostis Sagonas 83 Spring 2006 Kostis Sagonas 84 Spring 2006
14
Example Example
d=4 d=4
3 d=0 4 d=2 3 d=0 4 d=2
1: LA r1,array 1: LA r1,array
2: LD r2,4(r1)
1 f=1 f=0 f=1 2: LD r2,4(r1)
1 f=1 f=0 f=1
3: AND r3,r3,0x00FF 1 2 3: AND r3,r3,0x00FF 1 2
4: LD r6,8(sp) d=3 d=2 d=0 4: LD r6,8(sp) d=3 d=2 d=0
5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0 5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0
6: ADD r5,r5,100 1 6: ADD r5,r5,100 1
2 2
7: ADD r4,r2,r5 d=1
7 f=2
7: ADD r4,r2,r5 d=1
7 f=2
8: MUL r5,r2,r4 8: MUL r5,r2,r4
9: ST r4,0(r1) 1 1 9: ST r4,0(r1) 1 1
Example Outline
d=4
3 d=0 4 d=2
1: LA r1,array
1 f=1
2: LD r2,4(r1) f=0 f=1 • Modern architectures
3: AND r3,r3,0x00FF 1 2
• Delay slots
4: LD r6,8(sp) d=3 d=2 d=0
5: ST r7,4(r6) 2 f=1 6 f=1 5 f=0 • Introduction to instruction scheduling
6: ADD r5,r5,100 1 • List scheduling
2
7: ADD r4,r2,r5 d=1
7 f=2 • Resource constraints
8: MUL r5,r2,r4
9: ST r4,0(r1) 1 1 • Interaction with register allocation
• Scheduling across basic blocks
READY = { } 8 d=0 9 d=0 • Trace scheduling
f=0 f=0
ALUop 1 6 3 7 8 • Scheduling for loops
• Loop unrolling
MEM 1 4 2 5 9 • Software pipelining
MEM 2 4 2
Kostis Sagonas 87 Spring 2006 Kostis Sagonas 88 Spring 2006
Register Allocation
and Instruction Scheduling Example
1: LD r2,0(r1) 1
• If register allocation is performed before
2: ADD r3,r3,r2 3
instruction scheduling 3: LD r2,4(r5) 1
– the choices for scheduling are restricted 2 1
4: ADD r6,r6,r2
1
1 3
3
ALUop 2 4 4
MEM 1 1 3
MEM 2 1 3
Kostis Sagonas 89 Spring 2006 Kostis Sagonas 90 Spring 2006
15
Example Example
1: LD r2,0(r1) 1 1: LD r2,0(r1) 1
2: ADD r3,r3,r2 3 2: ADD r3,r3,r2 3
3: LD r2,4(r5) 1 3: LD r4,4(r5)
2 1 2
4: ADD r6,r6,r2 4: ADD r6,r6,r4
1
Anti-dependence
1 3 3
3 3
ALUop 2 4
4 4
How about using a different register? MEM 1 1 3
MEM 2 1 3
Kostis Sagonas 91 Spring 2006 Kostis Sagonas 92 Spring 2006
Register Allocation
Outline
and Instruction Scheduling
• If register allocation is performed before • Modern architectures
instruction scheduling • Delay slots
• Introduction to instruction scheduling
– the choices for scheduling are restricted • List scheduling
• Resource constraints
• If instruction scheduling is performed before • Interaction with register allocation
register allocation • Scheduling across basic blocks
• Trace scheduling
– register allocation may spill registers
• Scheduling for loops
– will change the carefully done schedule!!! • Loop unrolling
• Software pipelining
16
Moving across basic blocks Control Dependencies
Constraints in moving instructions across basic blocks
Upward to adjacent basic block
if ( . . . ) if ( . . . )
B C
a = b op c d = *(a1)
A
Not allowed if e.g. Not allowed if e.g.
if (c != 0 ) if(valid_address(a1))
A path from C that does not reach A? a = b / c d = *(a1)
A A
B C B C
D D
E E
F G F G
H H
Kostis Sagonas 101 Spring 2006 Kostis Sagonas 102 Spring 2006
17
Trace Scheduling Trace Scheduling
A A
B B
D D
E E
G G
H H
Kostis Sagonas 103 Spring 2006 Kostis Sagonas 104 Spring 2006
D A A
B C B C
E
D D D
G
E E E
H
Kostis Sagonas 105 Spring 2006 Kostis Sagonas 106 Spring 2006
Kostis Sagonas 107 Spring 2006 Kostis Sagonas 108 Spring 2006
18
Loop Example Loop Example
Machine: Source Code
– One load/store unit for i = 1 to N
• load 2 cycles A[i] = A[i] * b
• store 2 cycles
– Two arithmetic units Assembly Code
• add 2 cycles loop:
• branch 2 cycles (no delay slot) ld r6, (r2)
• multiply 3 cycles mul r6, r6, r3
– Both units are pipelined (initiate one op each cycle) st r6, (r2)
add r2, r2, 4
ble r2, r5, loop
Kostis Sagonas 109 Spring 2006 Kostis Sagonas 110 Spring 2006
19
Loop Unrolling Loop Example
• Rename registers loop: loop:
ld r6, (r2) ld r6, (r2)
– Use different registers in different iterations mul r6, r6, r3 mul r6, r6, r3
st r6, (r2) st r6, (r2)
add r2, r2, 4 add r2, r2, 4
ld r6, (r2) ld r7, (r2)
mul r6, r6, r3 mul r7, r7, r3
st r6, (r2) st r7, (r2)
add r2, r2, 4 add r2, r2, 4
ble r2, r5, loop ble r2, r5, loop
Kostis Sagonas 115 Spring 2006 Kostis Sagonas 116 Spring 2006
Kostis Sagonas 117 Spring 2006 Kostis Sagonas 118 Spring 2006
Kostis Sagonas 119 Spring 2006 Kostis Sagonas 120 Spring 2006
20
Loop Example Outline
loop:
ld r6, (r1)
mul r6, r6, r3 • Modern architectures
st r6, (r1) • Delay slots
add r2, r1, 4
ld r7, (r2) • Introduction to instruction scheduling
mul r7, r7, r3 • List scheduling
st r7, (r2)
add r1, r1, 8 • Resource constraints
ble r1, r5, loop • Interaction with register allocation
Schedule (4.5 cycles per iteration) • Scheduling across basic blocks
ld ld st st • Trace scheduling
ld ld st st
mul mul ble • Scheduling for loops
mul mul ble • Loop unrolling
mul mul
add add • Software pipelining
add add
Kostis Sagonas 121 Spring 2006 Kostis Sagonas 122 Spring 2006
Kostis Sagonas 125 Spring 2006 Kostis Sagonas 126 Spring 2006
21
Software Pipelining
• Optimal use of resources
• Need a lot of registers
– Values in multiple iterations need to be kept
• Issues in dependencies
– Executing a store instruction in an iteration before branch
instruction is executed for a previous iteration (writing when
it should not have)
– Loads and stores are issued out-of-order (need to figure-out
dependencies before doing this)
• Code generation issues
– Generate pre-amble and post-amble code
– Multiple blocks so no register copy is needed
22