0% found this document useful (0 votes)
234 views29 pages

Talk Gatech DSP Compilation 2000

The document discusses the process of compiling DSP programs. It involves 6 main steps: 1) scheduling, 2) functional unit assignment, 3) register move scheduling, 4) register allocation, 5) instruction selection and code generation, and 6) register spill handling. It provides examples of scheduling and assigning operations to functional units for a 6th order IIR filter program. It also describes resolving output dependency and cross-path constraints through register moves and introducing pseudo-registers.

Uploaded by

larryshi
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
234 views29 pages

Talk Gatech DSP Compilation 2000

The document discusses the process of compiling DSP programs. It involves 6 main steps: 1) scheduling, 2) functional unit assignment, 3) register move scheduling, 4) register allocation, 5) instruction selection and code generation, and 6) register spill handling. It provides examples of scheduling and assigning operations to functional units for a 6th order IIR filter program. It also describes resolving output dependency and cross-path constraints through register moves and introducing pseudo-registers.

Uploaded by

larryshi
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

DSP compilation

Weidong Shi Supervisor: Dr. Kenneth Mackenzie YAMACRAW PROJECT

Compiling Procedure
1

Scheduling

Constraints -precedence -functional unit capacity -communication capacity* -output dependency* *optional constraint

Functional Unit Assignment

Register Move Scheduling

Constraints -functional unit capacity -communication capacity

Register move is used to handle output dependency and shift value between register files.

Register Allocation

Final Instruction Selection & Code Generation

Pipelining Memory Data Access

-branch scheduling -prologue -epilogue -loop counter -all other necessary instructions 7

Retrieved data (produced results) are immediately used (stored) after being available.

Register Spill

Algorithms have been developed for all steps except step 5 which has to be done by hand.

Scheduling

Simple DSP Specification

#M1 multiplier <opname, pipeline cycle, execution cycle> M1 = functional_unit { ops: <intmul,1,2>, <floatmul,1,3>;} M1 L1 M2 D1 = functional_unit { ops: <ldint, 1,5>, <stint, 1, 1>, <intadd, 1,1>;} L2

S1
D1

RegFileA RegFileB #shared cross path from register file B S3 PcrossB = path { connections: <RegFileB,M1>,<RegFileB,L1>, <RegFileB,S1>,<RegFileB,D1>; D2 capacity: 1; } Memory

Number of functional units and shared data paths constitute of architecture constraints that can be considered in scheduling.

Flow Graph - 2nd Order IIR


6

b0 x[n] D 2 5 D 3 b2 1 b1 4 7

a2

2 D 8 D 9 a1

y[n]

#control variables used for scheduling In Node: 1 Out Node 8 IPB Lower Bound: 3 IPB Upper Bound: 5 T: 14

Determine Node Resource Set


Assign a set of functional units to each node for allocation.
N1 X Integer Multipliers Sn1= {M1,M2}

N2

sSn1 tT

N1

st

sSn2

N2
tT

st

Integer Adders Sn2={L1,S1,L2,S2,D1,D2}

Insert new nodes to represent communication operations. A set of data paths is assigned to each node for allocation.
N1 N3 N4

X
Integer Multipliers Data Paths Sn1= {M1,M2} Sn3={M1xRegA,M2xRegB}

N2

Data Paths Integer Adders Sn4={RegAxL1,PcrossB, Sn2={L1,S1,L2,S2,D1,D2} RegAxS1,PcrossA,...}

Simplified flowgraph where node resource set is reduced and some nodes are deleted if their resource sets are not constraints.
N1 X Integer Multipliers Sn1= {M1,M2} N3 N4

N2

Data Paths Data Paths Integer Adders Sn3={M1xRegA,M2xRegB} Sn4= {ProssA,PcrossB, Sn2={L1,S1,L2,S2,D1,D2} PwithinA,PwithinB}

Represent Constraints
Precedence Constraint
X N1

N2

+
st sSn1 s tT st i

sSn1 tT

tN 2 tN1 C N1 D IPB
st sSn1 tT
N1 X N4 N2

Consistency Constraint
+

Cs execution cycle of functional unit s

Integer Multipliers Sn1= {M1,M2}

Data Paths Sn4= {ProssA,PcrossB, PwithinA,PwithinB}

Integer Adders Sn2={L1,S1,L2,S2,D1,D2}

N1
tT tT

M1t

N4
tT PcrossAt

PcrossAt

N 4PwithinAt
tT PwithinBt

N1
tT L2t tT

M2t

N4
tT tT

PcrossBt

N 4PwithinBt
tT

N4 N4
tT

N4
tT

N2
tT

N 2S2t N 2S1t
tT

N2 N2
tT

D2t

PcrossBt

N4
tT

PwithinAt

N2
tT

L1t

D1t

Represent Constraints
Functional Unit Constraint

Ni
iQt

M1t

1 2(1 IPBi )
Qt set of nodes with M1 assigned, repeat for each IPB and time t

Ni
iQt

M2t

1 2(1 IPBi )
Qt set of nodes with M2 assigned, repeat for each IPB and time t

...

Communication Capacity Constraint

Ni
iQt

PcrossAt

1 2(1 IPBi )

Qt set of communication nodes with PcrossA assigned, repeat for each IPB and time t

Ni
iQt

PcrossBt

1 2(1 IPBi )

Qt set of communication nodes with PcrossB assigned, repeat for each IPB and time t ...

Objective Function
Minimize IPB (iteration period bound).

Pipelining Data Access

Pipelining Data Access


Pipelining data access to minimize register use.
MPY reg1, rega, reg5 ADD regb, reg2, rege MPY reg3, reg8, reg6 ||ADD regc, 4, regf ADD reg4, regd, reg8 Hypothetical machine - 2 cycle load, 1 cycle add, 2 cycle mul - two register files. One load, store unit for each register file. Values to be stored - reg5,reg6, rege, regf Loaded value in - reg1, reg2, reg3, reg4, rega, regb, regc, regd

Register Source reg1 reg2 reg3 reg4

Register Dest reg5

Register Source rega regb

Register Dest

LD/ST ST reg6

LD/ST

rege regf

LD reg1 ST reg5

LD rega ST rege ST regf

reg6

regc regd

LD reg2 conflicts with ST reg5, LD regb conflict with ST rege Insert new time slot and move instructions.

Pipelining Data Access


Register Source Register Dest Register Source Register Dest LD/ST LD/ST

reg2 reg3 reg4 reg1 reg5 reg6

regb regc regd rega

rege regf

LD reg1 ST reg5 LD reg2 ST reg6

LD rega ST rege LD regb ST regf

ST reg5 conflicts with LD reg1, ST regf conflict with LD regb ST reg5 replaces LD reg1. Insert new time slot to take LD reg1 and move instructions.

Register Source

Register Dest

Register Source

Register Dest

LD/ST

LD/ST

reg2 reg3 reg4 reg6

regb regc regd

rege regf

ST reg5 LD reg1 LD rega ST rege LD reg2 LD regb ST regf

reg1

reg5

rega

ST reg6

ST rege conflicts with LD rega, ST reg6 conflict with LD reg2 ST rege replaces LD rega. Insert new time slot to take LD rega and move instructions.

Pipelining Data Access


Register Source Register Dest Register Source Register Dest LD/ST LD/ST reg2 reg3 reg4 reg6 regb regc regd rege regf ST reg5 LD reg1 LD reg1 ST reg6 LD reg2 reg1 reg5 rega LD regb ST regf ST rege LD rega

ST regf conflicts with LD rega. ST regf replaces LD rega. Insert new time slot to take LD rega and move instructions.
Register Source Register Dest Register Source Register Dest LD/ST LD/ST

reg2 reg3 reg4 reg6

regb regc regd

rege regf

ST reg5 ST rege ST regf LD reg1 ST reg6 LD reg2 LD regb LD rega

reg1

reg5

rega

ST reg6 conflicts with LD reg1. ST reg6 replaces LD rega. Insert new time slot to take LD reg1 and move instructions.

Pipelining Data Access


Register Source Register Dest Register Source Register Dest LD/ST LD reg4 reg2 reg3 reg4 reg6 regb regc regd ST reg6 LD reg1 LD rega rege regf ST reg5 ST rege ST regf LD/ST LD regd

LD reg2 reg1 reg5 rega LD reg3

LD regb LD regc

After pipelining, loaded values (produced values) are immediately used (saved). There will be less amount of cycle overhead if only lds or only sts are pipelined.

Allocation of Loop Carried Registers

Allocation of Loop Carried Registers


Logical Register Live Range
ADD reg1, reg2, reg3 ||MPY reg4, x, reg5 MV reg5, reg6 ||ADD reg6, reg9, reg4 LD reg7 ||LD reg8 LD reg2 ||LD reg9 ADD reg7, reg8, reg9 ||ADD reg3, x, reg1

1 reg1 reg2 reg3 reg4 reg5 reg6 reg7 reg8 reg9

Hypothetical Machine with 6 physical register. LD 2 execution cycle, ADD 1 execution cycle, MPY 2 execution cycle.

Allocation of Loop Carried Registers


Physical registers are allocated from left to right for each cycle.
1 2 3 4 5

reg1 reg2 reg3 reg4 reg5 reg6 reg7 reg8 reg9

A B A C

D E B F F

regA regB regC regD regE regF

1 reg1,reg3 reg2 reg4 reg5 reg6 reg9

3 reg7

reg8

Register Spill
There is a cost associated with each physical register allocation. Always allocate the physical register with the minimal cost. Sometime, spill is inevitable if all allocations have positive cost.

reg1 reg2 reg3 reg4 reg5

A B C

D
?

Cost of allocating A, B, C, D to logical register 5


A B C D Cost 0 reg2 spill cost + reg5 spill cost min (reg3 spill cost, reg5 spill cost) infinity

Register Spill
reg1 reg2 reg3 reg4 reg5
A B

C
D ?

Cost of allocating A, B, C, D to logical register 5


A B C D Cost 0 reg2 spill cost min (reg3 spill cost, reg5 spill cost) infinity

Register Spill
Each register spill produces at least one extra LD and one extra ST. These new data access operations have to be pipelined before register allocation routine is executed once again.

Register Allocation

Pipelining Data Access

Register Spill

Flow Graph-6th Order IIR


+ + + + + + +
N1 N0 N9 X N10 N17 N6 D D D D X N3 D X N4 N2 N5

D X N7

+ + + +

N8 N13

N11 X N16 X N20 X N22 X

N12 X N15 X

N14
N23

N24

X
X

N19 N21

Scheduling (6th Order IIR )


T1 MUL 191 221 T2 12-2 211 T3 6-2 150 T4 3-2 4-2 T5 110 201 T6 7-1 160

ADD

8-3

0-3
5-3 17-2

1-3 10-2
240

2-4
9-2 231

13-2
140

In each entry X y, X represents operation number in the flow graph and y represents iteration number.

Functional Unit Assignment (6th Order IIR )


T1
M1 M2 L1 L2 S1 191 221 8-3

T2
12-2 211 0-3 5-3 17-2

T3
6-2 150 1-3 10-2 240

T4
3-2 4-2

T5
110 201 2-4 9-2 231

T6
7-1 160 13-2 140 M1 M2 L1 L2 S1 S2

T1
221 191

T2
211 12-2 17-2

T3
6-2 150

T4
4-2 3-2

T5
201 110 231

T6
160 7-1

5
8-3

-3

10

-2

-2

13

-2

240 0-3 1-3 2-4 140

Assignment 1: This one has many cross-path violations

Assignment 2: only one cross-path violation (one of the best cases)

Communication cost can be minimized when functional units are assigned to operations. Note that assignment 2 does not have better cycle count than assignment 1 because there is only one cross-path constraint violation left for assignment 1 after register moves are inserted to deal with output dependency. The rest part of this presentation is based on assignment 1.

New Flow Graph With Pseudo-registers


reg37

N2 reg50

Reg24

Reg2

Reg25 Reg3

Reg31

Reg23 Reg8

+ N1
Reg1 Reg0

N3

N4

+ N5
Reg22

+ N0
Reg18 Reg26 Reg4

D
Reg32

+ N9
Reg17

N6

X D
reg38

N7

Reg9

reg46

+ N8 +
N13 reg49

+ N10
Reg16

reg43

Reg27 Reg5

Reg33

Reg21 Reg10

X N11
Reg28
reg39

X N12
D
Reg34

reg44

Reg20 Reg11

Reg6

Reg15

N17 reg47

N16 reg40

X D

N15

+ N14
Reg19
reg48

Reg29 Reg7

Reg35

+ N24
reg45

N20

X D
reg42

N19

Reg12

+ N23

Reg30 Reg14

Reg36

reg41

N22

N21

Reg13

Resolve Violation of Output Dependency and Cross-path Constraint by Register Move


Problem: Output Dependency violated at reg2, reg5, reg6, reg14, reg9, reg15, reg19, reg20, and reg23 (shown as red dashed lines in the previous flow graph). Solution: Old values are moved to new pseudo-registers (small blue ovals) before they are overwritten. Problem: At time T3, both operation 10 and operation 15 load values from register file A via the cross-path. Solution: The value used by operation 10 is pre-fetched to register file B by a register move.

Schedule Register Moves (6th Order IIR )


T1 M1 M2 L1 191 221 8-3 MV9 T2 12-2 211 0-3 5-3 MV10 T3 T4 6-2 150 1
-3 -2

T5 3-2 4-2

T6

T7 110 201

T8 7-1 160 13
-2

Src MV1 MV2 MV3 MV4 MV5 MV6 MV7 MV8 reg2 reg5 reg6 reg14 reg9 reg15 reg19 reg20

Dest reg37 reg43 reg44 reg45 reg46 reg47 reg48 reg49

MV2 MV4 MV3

L2 S1
S2 D1 D2

2-4 9-2

10

140

MV7

17

-2

MV6

24

23

MV5
MV8

MV1

MV9
MV10

reg23
reg2

reg50
Reg39

Coalesce redundant register moves (only ten moves left after this is done). Schedule register MVs with new time slots inserted if necessary. Most restrictive register moves are scheduled first.

Assembly 1 - loop kernel (6th Order IIR )


MPY .m1 ||MPY .m2 ||ADD .l1 ||MV .l2x ||MV .s1x MPY .m1 ||MPY .m2 ||ADD .l1x ||ADD .l2x ||ADD .s1 ||MV .d1 MV .l1x ||MV .s1 MPY ||MPY ||ADD ||ADD ||ADD reg2, reg35, reg12 reg39, reg30, reg14 reg46, reg21, reg22 reg23, reg50 reg19, reg48 reg37, reg33, reg10 reg39, reg36, reg13 reg0, reg18, reg1 reg8, reg22, reg23 reg44, reg47, reg16 reg2, reg37 reg2, reg39 reg15, reg47 MV .l1x reg5, ||MV .l2x reg6, MPY .m1 ||MPY .m2 ||ADD .l1 ||ADD .l2x ||ADD .s1x reg43 reg44

reg2, reg27, reg5 reg39, reg29, reg7 reg37, reg50, reg24 reg17, reg4, reg18 reg12, reg13, reg19

MPY .m1 reg2, reg32, reg9 ||MPY .m2 reg39, reg28, reg6 ||ADD .l1 reg10, reg49, reg21 ||ADD .l2 reg11, reg48, reg20 ||MV .s1 reg9, reg46 ||MV .s2x reg20, reg49

.m1 reg2, reg26, reg4 .m2 reg34, reg39, reg11 .l1 reg1, reg3, reg2 .l2x reg43, reg16, reg17 .s1x reg45, reg7, reg15

MPY .m1 reg2, reg25, reg3 ||MPY .m2 reg31, reg2, reg8 ||MV .l2x reg14, reg45

Eight cycle assembly based on the previous reservation table.

Assembly 2 (pipelining data access) - loop kernel (6th Order IIR )


reg51 reg52
MPY .m1 ||MPY .m2 ||ADD .l1 ||MV .l2x ||MV .s1x ||LDH .d1 ||LDH .d2 MPY .m2 ||ADD .l1x ||ADD .l2x ||ADD .s1 ||MV .d1 MV .l1x ||MV .s1 ||MPY .m1 ||LDH .d1 ||LDH .d2 ||SUB .s2 reg2, reg35, reg12 reg39, reg30, reg14 reg46, reg21, reg22 reg23, reg50 reg19, reg48 *reg51, reg25 *reg52, reg31 reg39, reg36, reg13 reg0, reg18, reg1 reg8, reg22, reg23 reg44, reg47, reg16 reg2, reg37 reg2, reg39 reg15, reg47 reg37, reg33, reg10 *+reg51[1], reg27 *+reg52[1], reg29 reg54, 1 LDH .d1 *+reg51[3], reg35 ||LDH .d2 *+reg52[3], reg30 MPY .m1 ||MPY .m2 ||MV .l2x ||LDH .d2 ||LDH .d1 reg2, reg25, reg3 reg31, reg2, reg8 reg14, reg45 *+reg52[4], reg36 *reg53++, reg0

Coefficient Arrays reg53 reg55

MV .l1x reg5, reg43 ||MV .l2x reg6, reg44 ||LDH .d1 *+reg51[4], reg33 MPY .m1 reg2, reg27, reg5 ||MPY .m2 reg39, reg29, reg7 ||ADD .l1 reg37, reg50, reg24 ||ADD .l2x reg17, reg4, reg18 ||ADD .s1x reg12, reg13, reg19 ||LDH .d1 *+reg51[5], reg26 ||LDH .d2 *+reg52[5], reg34 MPY .m1 reg2, reg32, reg9 ||MPY .m2 reg39, reg28, reg6 ||ADD .l1 reg10, reg49, reg21 ||ADD .l2 reg11, reg48, reg20 ||MV .s1 reg9, reg46 ||MV .s2x reg20, reg49 ||STH .d2 reg24, *reg55++

...

...

Input Array

Output Array

MPY .m1 reg2, reg26, reg4 ||MPY .m2 reg34, reg39, reg11 ||ADD .l1 reg1, reg3, reg2 ||ADD .l2x reg43, reg16, reg17 ||ADD .s1x reg45, reg7, reg15 ||LDH .d1 *+reg51[2], reg32 ||LDH .d2 *+reg52[2], reg28 ||B .s2

Pipeline data access (loads and stores) to minimize register use. Retrieved data (produced results) are immediately used (stored) after being available.

Assembly 3 (register allocation) - loop kernel (6th Order IIR )


reg0 reg1 reg2 reg3 reg4 reg5 reg6 reg7 reg8 reg9 reg10 reg11 reg12 reg13 reg14 reg15 reg16 reg17 reg18 reg19 reg20 reg21 reg22 reg23 reg24 reg25 reg26 reg27 A0 A0 A12 A7 A0 A10 B0 B6 B3 A1 A8 B6 A2 B13 B3 A6 A14 B5 B13 A0 B8 A1 A2 B1 A2 A14 A6 A10 reg28 reg29 reg30 reg31 reg32 reg33 reg34 reg35 reg36 reg37 reg38 reg39 reg40 reg41 reg42 reg43 reg44 reg45 reg46 reg47 reg48 reg49 reg50 reg51 reg52 reg53 reg54 reg55 B0 B0 B0 B7 A3 A5 B5 A2 B1 A13 x B12 x x x B7 A14 A9 A3 A5 B2 A11 A3 A4 B4 A15 B14 B15
MPY .m1 ||MPY .m2 ||ADD .l1 ||MV .l2x ||MV .s1x ||LDH .d1 ||LDH .d2 MPY .m2 ||ADD .l1x ||ADD .l2x ||ADD .s1 ||MV .d1 MV .l1x ||MV .s1 ||MPY .m1 ||LDH .d1 ||LDH .d2 ||SUB .s2 A12, B12, A3, B1, A0, *A4, *B4, B12, A0, B3, A14, A12, A2, A2 B0, B3 A1, A2 A3 B2 A14 B7 B1, B13 B13, A0 A2, B1 A5, A14 A13 LDH .d1 *+A4[3], A2 ||LDH .d2 *+B4[3], B0 MPY .m1 ||MPY .m2 ||MV .l2x ||LDH .d2 ||LDH .d1 A12, A14, A7 B7, A12, B3 B3, A9 *+B4[4], B1 *A15++, A0

MV .l1x A10, B7 ||MV .l2x B0, A14 ||LDH .d1 *+A4[4], A5 MPY .m1 A12, A10, A10 ||MPY .m2 B12, B0, B6 ||ADD .l1 A13, A3, A2 ||ADD .l2x B5, A0, B13 ||ADD .s1x A2, B13, A0 ||LDH .d1 *+A4[5], A6 ||LDH .d2 *+B4[5], B5 MPY .m1 A12, A3, A1 ||MPY .m2 B12, B0, B0 ||ADD .l1 A8, A11, A1 ||ADD .l2 B6, B2, B8 ||MV .s1 A1, A3 ||MV .s2x B8, A11 ||STH .d2 A2, *B15++

A12, B12 A6, A5 A13, A5, A8 *+A4[1], A10 *+B4[1], B0 B14, 1

MPY .m1 A12, A6, A0 ||MPY .m2 B5, B12, B6 ||ADD .l1 A0, A7, A12 ||ADD .l2x B7, A14, B5 ||ADD .s1x A9, B6, A6 ||LDH .d1 *+A4[2], A3 ||LDH .d2 *+B4[2], B0 ||B .s2

Nine cycle loop kernel after physical register allocation. Note that prologue and epilogue are not shown.

You might also like