Instruction Scheduling: List Scheduling, Trace Scheduling, Loop Unrolling & Software Pipelining
Instruction Scheduling: List Scheduling, Trace Scheduling, Loop Unrolling & Software Pipelining
Instruction Scheduling
List Scheduling, Trace Scheduling, Loop Unrolling &
Software Pipelining
Outline
Overview of Instruction Scheduling
List Scheduling
Resource Constraints
Interaction with Register Allocation
Scheduling across Basic Blocks
Trace Scheduling
Scheduling for Loops
Loop Unrolling
Software Pipelining
Pedro Diniz 2
[email protected]
CSCI 565 - Compiler Design Spring 2016
IF DE EXE MEM WB
Inst 2
Pedro Diniz 4
[email protected]
CSCI 565 - Compiler Design Spring 2016
IF DE EXE MEM WB
Inst 2
Pedro Diniz 5
[email protected]
CSCI 565 - Compiler Design Spring 2016
IF DE EXE MEM WB
Inst
Pedro Diniz 6
[email protected]
CSCI 565 - Compiler Design Spring 2016
IF DE EXE MEM WB
???
IF DE EXE MEM WB
???
IF DE EXE MEM WB
Inst
Pedro Diniz 7
[email protected]
CSCI 565 - Compiler Design Spring 2016
IF DE EXE MEM WB
Next seq inst
IF DE EXE MEM WB
Next seq inst
IF DE EXE MEM WB
Branch target inst
Pedro Diniz 8
[email protected]
CSCI 565 - Compiler Design Spring 2016
Constraints On Scheduling
Data Dependences
Inherent in the code
Control Dependences
Inherent in the code
Resource Constraints
Pedro Diniz 9
[email protected]
CSCI 565 - Compiler Design Spring 2016
Pedro Diniz 10
[email protected]
CSCI 565 - Compiler Design Spring 2016
Representing Dependences
Using a dependence DAG, one per Basic Block
Nodes are instructions, edges represent dependences
1: r2 = *(r1 + 4)
2: r3 = *(r1 + 8)
3: r4 = r2 + r3
4: r5 = r2 - 1
Pedro Diniz 11
[email protected]
CSCI 565 - Compiler Design Spring 2016
Representing Dependences
Using a dependence DAG, one per Basic Block
Nodes are instructions, edges represent dependences
1 2
1: r2 = *(r1 + 4)
2: r3 = *(r1 + 8)
3: r4 = r2 + r3 2 2
4: r5 = r2 - 1
2
4 3
Edge is labeled with Latency:
v(i j) = delay required between initiation times of i and j minus the
execution time required by i
Pedro Diniz 12
[email protected]
CSCI 565 - Compiler Design Spring 2016
Resource Constraints
Modern Machines Have Many Resource Constraints
Superscalar Architectures:
Can Execute few Operations Concurrently
But have constraints
Example:
1 integer operation
ALUop dest, src1, src2 # in 1 clock cycle
In parallel with
1 memory operation
LD dst, addr # in 2 clock cycles
ST src, addr # in 1 clock cycle
Pedro Diniz 13
[email protected]
CSCI 565 - Compiler Design Spring 2016
Outline
Overview of Instruction Scheduling
List Scheduling
Resource Constraints
Interaction with Register Allocation
Scheduling across Basic Blocks
Trace Scheduling
Scheduling for Loops
Loop Unrolling
Software Pipelining
Pedro Diniz 14
[email protected]
CSCI 565 - Compiler Design Spring 2016
Pedro Diniz 15
[email protected]
CSCI 565 - Compiler Design Spring 2016
Pedro Diniz 16
[email protected]
CSCI 565 - Compiler Design Spring 2016
Pedro Diniz 17
[email protected]
CSCI 565 - Compiler Design Spring 2016
Pedro Diniz 18
[email protected]
CSCI 565 - Compiler Design Spring 2016
Pedro Diniz 19
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
1 3 4
1 3
2 6 5
1 4
7
3 3
8 9
Pedro Diniz 20
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=0
1 3 4
1 3
d=0
2 6 5
1 4
7
3 3
d=0 d=0
8 9
Pedro Diniz 21
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=0
1 3 4
1 3
d=0
2 6 5
1 4
7 d=3
3 3
d=0 d=0
8 9
Pedro Diniz 22
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=0
1 3 4
1 3
d=4 d=7 d=0
2 6 5
1 4
7 d=3
3 3
d=0 d=0
8 9
Pedro Diniz 23
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
Pedro Diniz 24
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
Pedro Diniz 25
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1,3,4,6 1 3 4 f=1
f=1 f=0
1 3
READY = { 6,1,4,3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
Pedro Diniz 26
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 6,1,4,3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
Pedro Diniz 27
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 6,1,4,3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6
Pedro Diniz 28
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 1, 4, 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6
Pedro Diniz 29
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 1, 4, 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6
Pedro Diniz 30
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 1, 4, 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1
Pedro Diniz 31
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
2 1 3 4 f=1
f=1 f=0
1 3
READY = { 4 , 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1
Pedro Diniz 32
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 2, 4 , 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1
Pedro Diniz 33
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 2, 4 , 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1
Pedro Diniz 34
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 2, 4 , 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2
Pedro Diniz 35
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 4,3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2
Pedro Diniz 36
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
7 1 3 4 f=1
f=1 f=0
1 3
READY = { 4,3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2
Pedro Diniz 37
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 7, 4 , 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2
Pedro Diniz 38
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 7, 4 , 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2
Pedro Diniz 39
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 7, 4 , 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2
Pedro Diniz 40
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 7, 4 , 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2
Pedro Diniz 41
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 7, 4 , 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4
Pedro Diniz 42
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
5 1 3 4 f=1
f=1 f=0
1 3
READY = { 7, 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4
Pedro Diniz 43
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 7, 3, 5 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4
Pedro Diniz 44
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 7, 3, 5 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7
Pedro Diniz 45
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 3, 5 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7
Pedro Diniz 46
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
8, 9 1 3 4 f=1
f=1 f=0
1 3
READY = { 3, 5 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7
Pedro Diniz 47
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = {3, 5, 8, 9} d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7
Pedro Diniz 48
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = {3, 5, 8, 9} d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7
Pedro Diniz 49
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = {3, 5, 8, 9} d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7 3
Pedro Diniz 50
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 5, 8, 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7 3
Pedro Diniz 51
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 5, 8, 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7 3
Pedro Diniz 52
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 5, 8, 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7 3
Pedro Diniz 53
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 5, 8, 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7 3 5
Pedro Diniz 54
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 8, 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7 3 5
Pedro Diniz 55
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 8, 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7 3 5
Pedro Diniz 56
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 8, 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7 3 5
Pedro Diniz 57
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 8, 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7 3 5 8
Pedro Diniz 58
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 8, 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7 3 5 8
Pedro Diniz 59
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7 3 5 8
Pedro Diniz 60
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7 3 5 8
Pedro Diniz 61
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7 3 5 8
Pedro Diniz 62
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7 3 5 8 9
Pedro Diniz 63
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7 3 5 8 9
Pedro Diniz 64
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
d=3
d=5 d=0
1 3 4 f=1
f=1 f=0
1 3
READY = { } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
7 d=3
f=2
3 3
d=0 d=0
8 f=0 9 f=0
6 1 2 4 7 3 5 8 9
Pedro Diniz 65
[email protected]
CSCI 565 - Compiler Design Spring 2016
Outline
Overview of Instruction Scheduling
List Scheduling
Resource Constraints
Interaction with Register Allocation
Scheduling across basic blocks
Trace Scheduling
Scheduling for Loops
Loop Unrolling
Software Pipelining
Pedro Diniz 66
[email protected]
CSCI 565 - Compiler Design Spring 2016
Pedro Diniz 67
[email protected]
CSCI 565 - Compiler Design Spring 2016
Pedro Diniz 68
[email protected]
CSCI 565 - Compiler Design Spring 2016
1
6
Pedro Diniz 69
[email protected]
CSCI 565 - Compiler Design Spring 2016
1
6
Pedro Diniz 70
[email protected]
CSCI 565 - Compiler Design Spring 2016
1
6
ALUop
MEM 1
MEM 2
Pedro Diniz 71
[email protected]
CSCI 565 - Compiler Design Spring 2016
1
6
ALUop
MEM 1 1
MEM 2 1
Pedro Diniz 72
[email protected]
CSCI 565 - Compiler Design Spring 2016
1
6
ALUop
MEM 1 1
MEM 2 1
Pedro Diniz 73
[email protected]
CSCI 565 - Compiler Design Spring 2016
1
6
ALUop 2
MEM 1 1
MEM 2 1
Pedro Diniz 74
[email protected]
CSCI 565 - Compiler Design Spring 2016
1
6
ALUop 2
MEM 1 1
MEM 2 1
Pedro Diniz 75
[email protected]
CSCI 565 - Compiler Design Spring 2016
1
6
ALUop 2
MEM 1 1 4
MEM 2 1 4
Pedro Diniz 76
[email protected]
CSCI 565 - Compiler Design Spring 2016
1
6
ALUop 2
MEM 1 1 4
MEM 2 1 4
Pedro Diniz 77
[email protected]
CSCI 565 - Compiler Design Spring 2016
1
6
ALUop 2
MEM 1 1 4 3
MEM 2 1 4
Pedro Diniz 78
[email protected]
CSCI 565 - Compiler Design Spring 2016
1
6
ALUop 2
MEM 1 1 4 3
MEM 2 1 4
Pedro Diniz 79
[email protected]
CSCI 565 - Compiler Design Spring 2016
1
6
ALUop 2 5
MEM 1 1 4 3
MEM 2 1 4
Pedro Diniz 80
[email protected]
CSCI 565 - Compiler Design Spring 2016
1
6
ALUop 2 5
MEM 1 1 4 3
MEM 2 1 4
Pedro Diniz 81
[email protected]
CSCI 565 - Compiler Design Spring 2016
1
6
ALUop 2 5 6
MEM 1 1 4 3
MEM 2 1 4
Pedro Diniz 82
[email protected]
CSCI 565 - Compiler Design Spring 2016
1
6
ALUop 2 5 6
MEM 1 1 4 3
MEM 2 1 4
Pedro Diniz 83
[email protected]
CSCI 565 - Compiler Design Spring 2016
1
6
ALUop 2 5 6
MEM 1 1 4 3 7
MEM 2 1 4
Pedro Diniz 84
[email protected]
CSCI 565 - Compiler Design Spring 2016
1
6
ALUop 2 5 6
MEM 1 1 4 3 7
MEM 2 1 4
Pedro Diniz 85
[email protected]
CSCI 565 - Compiler Design Spring 2016
Outline
Overview of Instruction Scheduling
List Scheduling
Resource Constraints
Interaction with Register Allocation
Scheduling across Basic Blocks
Trace Scheduling
Scheduling for Loops
Loop Unrolling
Software Pipelining
Pedro Diniz 86
[email protected]
CSCI 565 - Compiler Design Spring 2016
Pedro Diniz 87
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
1: LD r2, 0(r1)
2: ADD r3,r3,r2
3: LD r2,4(r5)
4: ADD r6,r6,r2
Pedro Diniz 88
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
1: LD r2, 0(r1) 1
2: ADD r3,r3,r2
3: LD r2,4(r5) 3 3
4: ADD r6,r6,r2
2 3
1
1 3
Pedro Diniz 89
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
1: LD r2, 0(r1) 1
2: ADD r3,r3,r2
3: LD r2,4(r5) 3 3
4: ADD r6,r6,r2
2 3
1
1 3
3
ALUop 2 4 4
MEM 1 1 3
MEM 2 1 3
Pedro Diniz 90
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
1: LD r2, 0(r1) 1
2: ADD r3,r3,r2
3: LD r2,4(r5) 3 3
4: ADD r6,r6,r2
2 3
1
Anti-Dependence between 3 and 2.
There is really no data flowing... 1 3
How to fix this?
How about using a different Register? 3
Pedro Diniz 91
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
1: LD r2, 0(r1) 1
2: ADD r3,r3,r2
3: LD r4,4(r5) 3
4: ADD r6,r6,r4
2
Pedro Diniz 92
[email protected]
CSCI 565 - Compiler Design Spring 2016
Example
1: LD r2, 0(r1) 1
2: ADD r3,r3,r2
3: LD r4,4(r5) 3
4: ADD r6,r6,r4
2
3
ALUop 2 4 4
MEM 1 1 3
MEM 2 1 3
Pedro Diniz 93
[email protected]
CSCI 565 - Compiler Design Spring 2016
Pedro Diniz 94
[email protected]
CSCI 565 - Compiler Design Spring 2016
Pedro Diniz 95
[email protected]
CSCI 565 - Compiler Design Spring 2016
Outline
Overview of Instruction Scheduling
List Scheduling
Resource Constraints
Interaction with Register Allocation
Scheduling across Basic Blocks
Trace Scheduling
Scheduling for Loops
Loop Unrolling
Software Pipelining
Pedro Diniz 96
[email protected]
CSCI 565 - Compiler Design Spring 2016
Pedro Diniz 97
[email protected]
CSCI 565 - Compiler Design Spring 2016
B C
Pedro Diniz 98
[email protected]
CSCI 565 - Compiler Design Spring 2016
B C
Pedro Diniz 99
[email protected]
CSCI 565 - Compiler Design Spring 2016
B C
B C
B C
B C
Outline
Overview of Instruction Scheduling
List Scheduling
Resource Constraints
Interaction with Register Allocation
Scheduling across Basic Blocks
Trace Scheduling
Scheduling for Loops
Loop Unrolling
Software Pipelining
Trace Scheduling
Find the most common Trace of Basic Blocks
Use profiling information
Combine the Basic Blocks in the trace and schedule
them as one Block
Create clean-up code if the execution goes off-trace
Trace Scheduling
B C
F G
H
Pedro Diniz 106
[email protected]
CSCI 565 - Compiler Design Spring 2016
Trace Scheduling
B C
F G
H
Pedro Diniz 107
[email protected]
CSCI 565 - Compiler Design Spring 2016
Trace Scheduling
H
Pedro Diniz 108
[email protected]
CSCI 565 - Compiler Design Spring 2016
Trace Scheduling
H
Pedro Diniz 109
[email protected]
CSCI 565 - Compiler Design Spring 2016
B C
E
Pedro Diniz 110
[email protected]
CSCI 565 - Compiler Design Spring 2016
A A
B C B C
D D D
E E E
Pedro Diniz 111
[email protected]
CSCI 565 - Compiler Design Spring 2016
Scheduling Loops
Loop bodies are small
But, lot of time is spent in loops due to large number
of iterations
Need better ways to schedule loops
Loop Example
Machine Model
One load/store unit
load 2 cycles
store 2 cycles
Two arithmetic units
add 2 cycles
branch 2 cycles (no delay slot)
multiply 3 cycles
Both units are pipelined (initiate one op each cycle)
Source Code
for i = 1 to N
A[i] = A[i] * b
Loop Example
Source Code
for i = 1 to N
A[i] = A[i] * b
Assembly Code
loop:
ld r6, (r2)
mul r6, r6, r3
st r6, (r2)
add r2, r2, 4
ble r2, r5, loop
Loop Example
Assembly Code
loop:
ld r6, (r2)
mul r6, r6, r3
st r6, (r2)
add r2, r2, 4
ble r2, r5, loop
Outline
Overview of Instruction Scheduling
List Scheduling
Resource Constraints
Interaction with Register Allocation
Scheduling across Basic Blocks
Trace Scheduling
Scheduling for Loops
Loop Unrolling
Software Pipelining
Loop Unrolling
Unroll the Loop Body a few times
Pros:
Create a much larger basic block for the body
Eliminate few loop bounds checks
Cons:
Much larger program
Setup code (# of iterations < unroll factor)
beginning and end of the schedule can still have unused slots
Loop Example
loop:
ld r6, (r2)
mul r6, r6, r3
st r6, (r2)
add r2, r2, 4
ble r2, r5, loop
Loop Example
loop:
ld r6, (r2)
mul r6, r6, r3
st r6, (r2)
add r2, r2, 4
ld r6, (r2)
mul r6, r6, r3
st r6, (r2)
add r2, r2, 4
ble r2, r5, loop
Loop Example
loop:
ld r6, (r2)
mul r6, r6, r3
st r6, (r2)
add r2, r2, 4
ld r6, (r2)
mul r6, r6, r3
st r6, (r2)
add r2, r2, 4
ble r2, r5, loop
Loop Unrolling
Rename Registers
Use Different Registers in Different Loop Iterations
Loop Example
loop:
ld r6,(r2)
mul r6, r6, r3
st r6,(r2)
add r2, r2, 4
ld r6,(r2)
mul r6, r6, r3
st r6,(r2)
add r2, r2, 4
ble r2, r5, loop
Loop Example
loop:
ld r6,(r2)
mul r6, r6, r3
st r6,(r2)
add r2, r2, 4
ld r7,(r2)
mul r7, r7, r3
st r7,(r2)
add r2, r2, 4
ble r2, r5, loop
Loop Unrolling
Rename Registers
Use Different Registers in Different Loop Iterations
Loop Example
loop:
ld r6,(r2)
mul r6, r6, r3
st r6,(r2)
add r2, r2, 4
ld r7,(r2)
mul r7, r7, r3
st r7,(r2)
add r2, r2, 4
ble r2, r5, loop
Loop Example
loop:
ld r6,(r1)
mul r6, r6, r3
st r6,(r1)
add r2, r1, 4
ld r7,(r2)
mul r7, r7, r3
st r7,(r2)
add r1, r2, 4
ble r1, r5, loop
Loop Example
loop:
ld r6,(r1)
mul r6, r6, r3
st r6,(r1)
add r2, r1, 4
ld r7,(r2)
mul r7, r7, r3
st r7,(r2)
add r1, r2, 4
ble r1, r5, loop
Loop Example
loop:
ld r6,(r1)
mul r6, r6, r3
st r6,(r1)
add r2, r1, 4
ld r7,(r2)
mul r7, r7, r3
st r7,(r2)
add r1, r1, 8
ble r1, r5, loop
Loop Example
loop:
ld r6, (r1)
mul r6, r6, r3
st r6, (r1)
add r2, r1, 4
ld r7, (r2)
mul r7, r7, r3
st r7, (r2)
add r1, r1, 8
ble r1, r5, loop
Outline
Overview of Instruction Scheduling
List Scheduling
Resource Constraints
Interaction with Register Allocation
Scheduling across Basic Blocks
Trace Scheduling
Scheduling for Loops
Loop Unrolling
Software Pipelining
Software Pipelining
Try to overlap Multiple Iterations so that the Slots
will be filled
Find the Steady-State Window so that:
All the instructions of the loop body are executed
But from different iterations
Loop Example
Assembly Code
loop:
ld r6, (r2)
mul r6, r6, r3
st r6, (r2)
add r2, r2, 4
ble r2, r5, loop
Schedule
ld ld1 ld2 st ld3 st1 ld4 st2 ld5 st3 ld6
ld ld1 ld2 st ld3 st1 ld4 st2 ld5 st3
mul mul1 mul2 ble mul3 ble1 mul4 ble2 mul5
mul mul1 mul2 ble mul3 ble1 mul4 ble2
mul mul1 mul2 mul3 mul4
add add1 add2 add3
add add1 add2 add3
Loop Example
Assembly Code
loop:
ld r6, (r2)
mul r6, r6, r3
st r6, (r2)
add r2, r2, 4
ble r2, r5, loop
Schedule
ld ld1 ld2 st ld3 st1 ld4 st2 ld5 st3 ld6
ld ld1 ld2 st ld3 st1 ld4 st2 ld5 st3
mul mul1 mul2 ble mul3 ble1 mul4 ble2 mul5
mul mul1 mul2 ble mul3 ble1 mul4 ble2
mul mul1 mul2 mul3 mul4
add add1 add2 add3
add add1 add2 add3
Loop Example
Assembly Code
loop:
ld r6, (r2)
mul r6, r6, r3
st r6, (r2)
add r2, r2, 4
ble r2, r5, loop
Loop Example
4 Iterations are Overlapped
value of r3 and r5 dont change ld3 st1
4 regs for &A[i] (r2) st ld3
each address incremented by 4*4 mul2 ble
mul2
4 regs to keep value A[i] (r6) mul1
add1
Same registers can be reused after 4 of
these blocks; generate code for 4 add
blocks, otherwise need to move
loop:
ld r6, (r2)
mul r6, r6, r3
st r6, (r2)
add r2, r2, 4
ble r2, r5, loop
Software Pipelining
Optimal use of Resources
Need a lot of Registers
Values in multiple iterations need to be kept separated
Issues with Dependences:
Executing a store instruction in an iteration before branch
instruction is executed for a previous iteration (writing
when it should not have)
Loads and stores are issued out-of-order (need to figure-
out dependencies before doing this)
Code Generation Issues:
Generate pre-amble and post-amble code
Multiple blocks so no register copy is needed
Pedro Diniz 136
[email protected]
CSCI 565 - Compiler Design Spring 2016
Summary
Overview of Instruction Scheduling
List Scheduling
Resource Constraints
Interaction with Register Allocation
Scheduling across Basic Blocks
Trace Scheduling
Scheduling for Loops
Loop Unrolling
Software Pipelining