DSP compilation
Weidong Shi Supervisor: Dr. Kenneth Mackenzie YAMACRAW PROJECT
Compiling Procedure
1
Scheduling
Constraints -precedence -functional unit capacity -communication capacity* -output dependency* *optional constraint
Functional Unit Assignment
Register Move Scheduling
Constraints -functional unit capacity -communication capacity
Register move is used to handle output dependency and shift value between register files.
Register Allocation
Final Instruction Selection & Code Generation
Pipelining Memory Data Access
-branch scheduling -prologue -epilogue -loop counter -all other necessary instructions 7
Retrieved data (produced results) are immediately used (stored) after being available.
Register Spill
Algorithms have been developed for all steps except step 5 which has to be done by hand.
Scheduling
Simple DSP Specification
#M1 multiplier <opname, pipeline cycle, execution cycle> M1 = functional_unit { ops: <intmul,1,2>, <floatmul,1,3>;} M1 L1 M2 D1 = functional_unit { ops: <ldint, 1,5>, <stint, 1, 1>, <intadd, 1,1>;} L2
S1
D1
RegFileA RegFileB #shared cross path from register file B S3 PcrossB = path { connections: <RegFileB,M1>,<RegFileB,L1>, <RegFileB,S1>,<RegFileB,D1>; D2 capacity: 1; } Memory
Number of functional units and shared data paths constitute of architecture constraints that can be considered in scheduling.
Flow Graph - 2nd Order IIR
6
b0 x[n] D 2 5 D 3 b2 1 b1 4 7
a2
2 D 8 D 9 a1
y[n]
#control variables used for scheduling In Node: 1 Out Node 8 IPB Lower Bound: 3 IPB Upper Bound: 5 T: 14
Determine Node Resource Set
Assign a set of functional units to each node for allocation.
N1 X Integer Multipliers Sn1= {M1,M2}
N2
sSn1 tT
N1
st
sSn2
N2
tT
st
Integer Adders Sn2={L1,S1,L2,S2,D1,D2}
Insert new nodes to represent communication operations. A set of data paths is assigned to each node for allocation.
N1 N3 N4
X
Integer Multipliers Data Paths Sn1= {M1,M2} Sn3={M1xRegA,M2xRegB}
N2
Data Paths Integer Adders Sn4={RegAxL1,PcrossB, Sn2={L1,S1,L2,S2,D1,D2} RegAxS1,PcrossA,...}
Simplified flowgraph where node resource set is reduced and some nodes are deleted if their resource sets are not constraints.
N1 X Integer Multipliers Sn1= {M1,M2} N3 N4
N2
Data Paths Data Paths Integer Adders Sn3={M1xRegA,M2xRegB} Sn4= {ProssA,PcrossB, Sn2={L1,S1,L2,S2,D1,D2} PwithinA,PwithinB}
Represent Constraints
Precedence Constraint
X N1
N2
+
st sSn1 s tT st i
sSn1 tT
tN 2 tN1 C N1 D IPB
st sSn1 tT
N1 X N4 N2
Consistency Constraint
+
Cs execution cycle of functional unit s
Integer Multipliers Sn1= {M1,M2}
Data Paths Sn4= {ProssA,PcrossB, PwithinA,PwithinB}
Integer Adders Sn2={L1,S1,L2,S2,D1,D2}
N1
tT tT
M1t
N4
tT PcrossAt
PcrossAt
N 4PwithinAt
tT PwithinBt
N1
tT L2t tT
M2t
N4
tT tT
PcrossBt
N 4PwithinBt
tT
N4 N4
tT
N4
tT
N2
tT
N 2S2t N 2S1t
tT
N2 N2
tT
D2t
PcrossBt
N4
tT
PwithinAt
N2
tT
L1t
D1t
Represent Constraints
Functional Unit Constraint
Ni
iQt
M1t
1 2(1 IPBi )
Qt set of nodes with M1 assigned, repeat for each IPB and time t
Ni
iQt
M2t
1 2(1 IPBi )
Qt set of nodes with M2 assigned, repeat for each IPB and time t
...
Communication Capacity Constraint
Ni
iQt
PcrossAt
1 2(1 IPBi )
Qt set of communication nodes with PcrossA assigned, repeat for each IPB and time t
Ni
iQt
PcrossBt
1 2(1 IPBi )
Qt set of communication nodes with PcrossB assigned, repeat for each IPB and time t ...
Objective Function
Minimize IPB (iteration period bound).
Pipelining Data Access
Pipelining Data Access
Pipelining data access to minimize register use.
MPY reg1, rega, reg5 ADD regb, reg2, rege MPY reg3, reg8, reg6 ||ADD regc, 4, regf ADD reg4, regd, reg8 Hypothetical machine - 2 cycle load, 1 cycle add, 2 cycle mul - two register files. One load, store unit for each register file. Values to be stored - reg5,reg6, rege, regf Loaded value in - reg1, reg2, reg3, reg4, rega, regb, regc, regd
Register Source reg1 reg2 reg3 reg4
Register Dest reg5
Register Source rega regb
Register Dest
LD/ST ST reg6
LD/ST
rege regf
LD reg1 ST reg5
LD rega ST rege ST regf
reg6
regc regd
LD reg2 conflicts with ST reg5, LD regb conflict with ST rege Insert new time slot and move instructions.
Pipelining Data Access
Register Source Register Dest Register Source Register Dest LD/ST LD/ST
reg2 reg3 reg4 reg1 reg5 reg6
regb regc regd rega
rege regf
LD reg1 ST reg5 LD reg2 ST reg6
LD rega ST rege LD regb ST regf
ST reg5 conflicts with LD reg1, ST regf conflict with LD regb ST reg5 replaces LD reg1. Insert new time slot to take LD reg1 and move instructions.
Register Source
Register Dest
Register Source
Register Dest
LD/ST
LD/ST
reg2 reg3 reg4 reg6
regb regc regd
rege regf
ST reg5 LD reg1 LD rega ST rege LD reg2 LD regb ST regf
reg1
reg5
rega
ST reg6
ST rege conflicts with LD rega, ST reg6 conflict with LD reg2 ST rege replaces LD rega. Insert new time slot to take LD rega and move instructions.
Pipelining Data Access
Register Source Register Dest Register Source Register Dest LD/ST LD/ST reg2 reg3 reg4 reg6 regb regc regd rege regf ST reg5 LD reg1 LD reg1 ST reg6 LD reg2 reg1 reg5 rega LD regb ST regf ST rege LD rega
ST regf conflicts with LD rega. ST regf replaces LD rega. Insert new time slot to take LD rega and move instructions.
Register Source Register Dest Register Source Register Dest LD/ST LD/ST
reg2 reg3 reg4 reg6
regb regc regd
rege regf
ST reg5 ST rege ST regf LD reg1 ST reg6 LD reg2 LD regb LD rega
reg1
reg5
rega
ST reg6 conflicts with LD reg1. ST reg6 replaces LD rega. Insert new time slot to take LD reg1 and move instructions.
Pipelining Data Access
Register Source Register Dest Register Source Register Dest LD/ST LD reg4 reg2 reg3 reg4 reg6 regb regc regd ST reg6 LD reg1 LD rega rege regf ST reg5 ST rege ST regf LD/ST LD regd
LD reg2 reg1 reg5 rega LD reg3
LD regb LD regc
After pipelining, loaded values (produced values) are immediately used (saved). There will be less amount of cycle overhead if only lds or only sts are pipelined.
Allocation of Loop Carried Registers
Allocation of Loop Carried Registers
Logical Register Live Range
ADD reg1, reg2, reg3 ||MPY reg4, x, reg5 MV reg5, reg6 ||ADD reg6, reg9, reg4 LD reg7 ||LD reg8 LD reg2 ||LD reg9 ADD reg7, reg8, reg9 ||ADD reg3, x, reg1
1 reg1 reg2 reg3 reg4 reg5 reg6 reg7 reg8 reg9
Hypothetical Machine with 6 physical register. LD 2 execution cycle, ADD 1 execution cycle, MPY 2 execution cycle.
Allocation of Loop Carried Registers
Physical registers are allocated from left to right for each cycle.
1 2 3 4 5
reg1 reg2 reg3 reg4 reg5 reg6 reg7 reg8 reg9
A B A C
D E B F F
regA regB regC regD regE regF
1 reg1,reg3 reg2 reg4 reg5 reg6 reg9
3 reg7
reg8
Register Spill
There is a cost associated with each physical register allocation. Always allocate the physical register with the minimal cost. Sometime, spill is inevitable if all allocations have positive cost.
reg1 reg2 reg3 reg4 reg5
A B C
D
?
Cost of allocating A, B, C, D to logical register 5
A B C D Cost 0 reg2 spill cost + reg5 spill cost min (reg3 spill cost, reg5 spill cost) infinity
Register Spill
reg1 reg2 reg3 reg4 reg5
A B
C
D ?
Cost of allocating A, B, C, D to logical register 5
A B C D Cost 0 reg2 spill cost min (reg3 spill cost, reg5 spill cost) infinity
Register Spill
Each register spill produces at least one extra LD and one extra ST. These new data access operations have to be pipelined before register allocation routine is executed once again.
Register Allocation
Pipelining Data Access
Register Spill
Flow Graph-6th Order IIR
+ + + + + + +
N1 N0 N9 X N10 N17 N6 D D D D X N3 D X N4 N2 N5
D X N7
+ + + +
N8 N13
N11 X N16 X N20 X N22 X
N12 X N15 X
N14
N23
N24
X
X
N19 N21
Scheduling (6th Order IIR )
T1 MUL 191 221 T2 12-2 211 T3 6-2 150 T4 3-2 4-2 T5 110 201 T6 7-1 160
ADD
8-3
0-3
5-3 17-2
1-3 10-2
240
2-4
9-2 231
13-2
140
In each entry X y, X represents operation number in the flow graph and y represents iteration number.
Functional Unit Assignment (6th Order IIR )
T1
M1 M2 L1 L2 S1 191 221 8-3
T2
12-2 211 0-3 5-3 17-2
T3
6-2 150 1-3 10-2 240
T4
3-2 4-2
T5
110 201 2-4 9-2 231
T6
7-1 160 13-2 140 M1 M2 L1 L2 S1 S2
T1
221 191
T2
211 12-2 17-2
T3
6-2 150
T4
4-2 3-2
T5
201 110 231
T6
160 7-1
5
8-3
-3
10
-2
-2
13
-2
240 0-3 1-3 2-4 140
Assignment 1: This one has many cross-path violations
Assignment 2: only one cross-path violation (one of the best cases)
Communication cost can be minimized when functional units are assigned to operations. Note that assignment 2 does not have better cycle count than assignment 1 because there is only one cross-path constraint violation left for assignment 1 after register moves are inserted to deal with output dependency. The rest part of this presentation is based on assignment 1.
New Flow Graph With Pseudo-registers
reg37
N2 reg50
Reg24
Reg2
Reg25 Reg3
Reg31
Reg23 Reg8
+ N1
Reg1 Reg0
N3
N4
+ N5
Reg22
+ N0
Reg18 Reg26 Reg4
D
Reg32
+ N9
Reg17
N6
X D
reg38
N7
Reg9
reg46
+ N8 +
N13 reg49
+ N10
Reg16
reg43
Reg27 Reg5
Reg33
Reg21 Reg10
X N11
Reg28
reg39
X N12
D
Reg34
reg44
Reg20 Reg11
Reg6
Reg15
N17 reg47
N16 reg40
X D
N15
+ N14
Reg19
reg48
Reg29 Reg7
Reg35
+ N24
reg45
N20
X D
reg42
N19
Reg12
+ N23
Reg30 Reg14
Reg36
reg41
N22
N21
Reg13
Resolve Violation of Output Dependency and Cross-path Constraint by Register Move
Problem: Output Dependency violated at reg2, reg5, reg6, reg14, reg9, reg15, reg19, reg20, and reg23 (shown as red dashed lines in the previous flow graph). Solution: Old values are moved to new pseudo-registers (small blue ovals) before they are overwritten. Problem: At time T3, both operation 10 and operation 15 load values from register file A via the cross-path. Solution: The value used by operation 10 is pre-fetched to register file B by a register move.
Schedule Register Moves (6th Order IIR )
T1 M1 M2 L1 191 221 8-3 MV9 T2 12-2 211 0-3 5-3 MV10 T3 T4 6-2 150 1
-3 -2
T5 3-2 4-2
T6
T7 110 201
T8 7-1 160 13
-2
Src MV1 MV2 MV3 MV4 MV5 MV6 MV7 MV8 reg2 reg5 reg6 reg14 reg9 reg15 reg19 reg20
Dest reg37 reg43 reg44 reg45 reg46 reg47 reg48 reg49
MV2 MV4 MV3
L2 S1
S2 D1 D2
2-4 9-2
10
140
MV7
17
-2
MV6
24
23
MV5
MV8
MV1
MV9
MV10
reg23
reg2
reg50
Reg39
Coalesce redundant register moves (only ten moves left after this is done). Schedule register MVs with new time slots inserted if necessary. Most restrictive register moves are scheduled first.
Assembly 1 - loop kernel (6th Order IIR )
MPY .m1 ||MPY .m2 ||ADD .l1 ||MV .l2x ||MV .s1x MPY .m1 ||MPY .m2 ||ADD .l1x ||ADD .l2x ||ADD .s1 ||MV .d1 MV .l1x ||MV .s1 MPY ||MPY ||ADD ||ADD ||ADD reg2, reg35, reg12 reg39, reg30, reg14 reg46, reg21, reg22 reg23, reg50 reg19, reg48 reg37, reg33, reg10 reg39, reg36, reg13 reg0, reg18, reg1 reg8, reg22, reg23 reg44, reg47, reg16 reg2, reg37 reg2, reg39 reg15, reg47 MV .l1x reg5, ||MV .l2x reg6, MPY .m1 ||MPY .m2 ||ADD .l1 ||ADD .l2x ||ADD .s1x reg43 reg44
reg2, reg27, reg5 reg39, reg29, reg7 reg37, reg50, reg24 reg17, reg4, reg18 reg12, reg13, reg19
MPY .m1 reg2, reg32, reg9 ||MPY .m2 reg39, reg28, reg6 ||ADD .l1 reg10, reg49, reg21 ||ADD .l2 reg11, reg48, reg20 ||MV .s1 reg9, reg46 ||MV .s2x reg20, reg49
.m1 reg2, reg26, reg4 .m2 reg34, reg39, reg11 .l1 reg1, reg3, reg2 .l2x reg43, reg16, reg17 .s1x reg45, reg7, reg15
MPY .m1 reg2, reg25, reg3 ||MPY .m2 reg31, reg2, reg8 ||MV .l2x reg14, reg45
Eight cycle assembly based on the previous reservation table.
Assembly 2 (pipelining data access) - loop kernel (6th Order IIR )
reg51 reg52
MPY .m1 ||MPY .m2 ||ADD .l1 ||MV .l2x ||MV .s1x ||LDH .d1 ||LDH .d2 MPY .m2 ||ADD .l1x ||ADD .l2x ||ADD .s1 ||MV .d1 MV .l1x ||MV .s1 ||MPY .m1 ||LDH .d1 ||LDH .d2 ||SUB .s2 reg2, reg35, reg12 reg39, reg30, reg14 reg46, reg21, reg22 reg23, reg50 reg19, reg48 *reg51, reg25 *reg52, reg31 reg39, reg36, reg13 reg0, reg18, reg1 reg8, reg22, reg23 reg44, reg47, reg16 reg2, reg37 reg2, reg39 reg15, reg47 reg37, reg33, reg10 *+reg51[1], reg27 *+reg52[1], reg29 reg54, 1 LDH .d1 *+reg51[3], reg35 ||LDH .d2 *+reg52[3], reg30 MPY .m1 ||MPY .m2 ||MV .l2x ||LDH .d2 ||LDH .d1 reg2, reg25, reg3 reg31, reg2, reg8 reg14, reg45 *+reg52[4], reg36 *reg53++, reg0
Coefficient Arrays reg53 reg55
MV .l1x reg5, reg43 ||MV .l2x reg6, reg44 ||LDH .d1 *+reg51[4], reg33 MPY .m1 reg2, reg27, reg5 ||MPY .m2 reg39, reg29, reg7 ||ADD .l1 reg37, reg50, reg24 ||ADD .l2x reg17, reg4, reg18 ||ADD .s1x reg12, reg13, reg19 ||LDH .d1 *+reg51[5], reg26 ||LDH .d2 *+reg52[5], reg34 MPY .m1 reg2, reg32, reg9 ||MPY .m2 reg39, reg28, reg6 ||ADD .l1 reg10, reg49, reg21 ||ADD .l2 reg11, reg48, reg20 ||MV .s1 reg9, reg46 ||MV .s2x reg20, reg49 ||STH .d2 reg24, *reg55++
...
...
Input Array
Output Array
MPY .m1 reg2, reg26, reg4 ||MPY .m2 reg34, reg39, reg11 ||ADD .l1 reg1, reg3, reg2 ||ADD .l2x reg43, reg16, reg17 ||ADD .s1x reg45, reg7, reg15 ||LDH .d1 *+reg51[2], reg32 ||LDH .d2 *+reg52[2], reg28 ||B .s2
Pipeline data access (loads and stores) to minimize register use. Retrieved data (produced results) are immediately used (stored) after being available.
Assembly 3 (register allocation) - loop kernel (6th Order IIR )
reg0 reg1 reg2 reg3 reg4 reg5 reg6 reg7 reg8 reg9 reg10 reg11 reg12 reg13 reg14 reg15 reg16 reg17 reg18 reg19 reg20 reg21 reg22 reg23 reg24 reg25 reg26 reg27 A0 A0 A12 A7 A0 A10 B0 B6 B3 A1 A8 B6 A2 B13 B3 A6 A14 B5 B13 A0 B8 A1 A2 B1 A2 A14 A6 A10 reg28 reg29 reg30 reg31 reg32 reg33 reg34 reg35 reg36 reg37 reg38 reg39 reg40 reg41 reg42 reg43 reg44 reg45 reg46 reg47 reg48 reg49 reg50 reg51 reg52 reg53 reg54 reg55 B0 B0 B0 B7 A3 A5 B5 A2 B1 A13 x B12 x x x B7 A14 A9 A3 A5 B2 A11 A3 A4 B4 A15 B14 B15
MPY .m1 ||MPY .m2 ||ADD .l1 ||MV .l2x ||MV .s1x ||LDH .d1 ||LDH .d2 MPY .m2 ||ADD .l1x ||ADD .l2x ||ADD .s1 ||MV .d1 MV .l1x ||MV .s1 ||MPY .m1 ||LDH .d1 ||LDH .d2 ||SUB .s2 A12, B12, A3, B1, A0, *A4, *B4, B12, A0, B3, A14, A12, A2, A2 B0, B3 A1, A2 A3 B2 A14 B7 B1, B13 B13, A0 A2, B1 A5, A14 A13 LDH .d1 *+A4[3], A2 ||LDH .d2 *+B4[3], B0 MPY .m1 ||MPY .m2 ||MV .l2x ||LDH .d2 ||LDH .d1 A12, A14, A7 B7, A12, B3 B3, A9 *+B4[4], B1 *A15++, A0
MV .l1x A10, B7 ||MV .l2x B0, A14 ||LDH .d1 *+A4[4], A5 MPY .m1 A12, A10, A10 ||MPY .m2 B12, B0, B6 ||ADD .l1 A13, A3, A2 ||ADD .l2x B5, A0, B13 ||ADD .s1x A2, B13, A0 ||LDH .d1 *+A4[5], A6 ||LDH .d2 *+B4[5], B5 MPY .m1 A12, A3, A1 ||MPY .m2 B12, B0, B0 ||ADD .l1 A8, A11, A1 ||ADD .l2 B6, B2, B8 ||MV .s1 A1, A3 ||MV .s2x B8, A11 ||STH .d2 A2, *B15++
A12, B12 A6, A5 A13, A5, A8 *+A4[1], A10 *+B4[1], B0 B14, 1
MPY .m1 A12, A6, A0 ||MPY .m2 B5, B12, B6 ||ADD .l1 A0, A7, A12 ||ADD .l2x B7, A14, B5 ||ADD .s1x A9, B6, A6 ||LDH .d1 *+A4[2], A3 ||LDH .d2 *+B4[2], B0 ||B .s2
Nine cycle loop kernel after physical register allocation. Note that prologue and epilogue are not shown.