0% found this document useful (0 votes)
48 views6 pages

Optimized Code Generation of Multiplication-Free Linear Transforms

We present code generation of multiplication-free linear transforms targeted to single-register DSP architectures such as TMS320C2x / C5x. Our approach is dierent than that presented in and is based on iterative elimination of two-element common subexpressions in the transformation matrix. The results for Walsh-Hadamard, Haar and Slant transforms show 25% to 40% reduction in the cycle count using our techniques.

Uploaded by

Aasma Khalid
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views6 pages

Optimized Code Generation of Multiplication-Free Linear Transforms

We present code generation of multiplication-free linear transforms targeted to single-register DSP architectures such as TMS320C2x / C5x. Our approach is dierent than that presented in and is based on iterative elimination of two-element common subexpressions in the transformation matrix. The results for Walsh-Hadamard, Haar and Slant transforms show 25% to 40% reduction in the cycle count using our techniques.

Uploaded by

Aasma Khalid
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Optimized Code Generation of

Multiplication-free Linear Transforms


Mahesh Mehendale G. Venkatesh, S.D. Sherlekar
Texas Instruments (India) Ltd. Dept. of Computer Sc. and Engg.
Golf View Homes, Wind Tunnel Road, Indian Institute of Technology, Powai,
Bangalore 560017, INDIA Mumbai, 400076, INDIA

Abstract additions, which is based on the algorithm presented in


We present code generation of multiplication-free lin- [8]. Our approach is di erent than that presented in
ear transforms targeted to single-register DSP architec- [9] and is based on iterative elimination of two-element
tures such as TMS320C2x/C5x. We rst present an common subexpressions in the transformation matrix.
algorithm to generate optimized code from a DAG rep- For single-register architectures, minimum number of
resentation. We then present techniques that transform additions in most cases does not translate into minimum
a DAG so as to minimize the number of nodes and number of execution cycles. The common subexpres-
the accumulator-spills. We then introduce a concept of sions typically cause accumulator spills which result in
spill-free DAGs and present an algorithm for synthesiz- `load' and `store' overhead. In this paper we present four
ing such DAGs. The results for Walsh-Hadamard, Haar DAG transformations to reduce the number of cycles by
and Slant transforms show 25% to 40% reduction in the minimizing accumulator spills.
cycle count using our techniques. Another approach to DAG optimization is to mini-
mize the number of additions under the constraint of
1 Introduction zero accumulator spills. In this paper we present an al-
Many signal processing applications such as image gorithm for spill-free implementation of multiplication-
transforms[1], error correction/detection involve matrix free linear transforms in minimum number of execution
multiplication of the form Y = A?X, where X and Y are cycles.
the input and the output vectors and A is the transfor- The paper is organized as follows. In section 2, we
mation matrix whose elements are 1,-1 and 0. In this pa- present the target architecture model which is a suit-
per we present optimized code generation of these trans- able abstraction of 'C2x/'C5x architectures. We then
forms targeted to programmable Digital Signal Proces- present an algorithm for integrated instruction schedul-
sors. While some of the code optimization techniques ing and register+memory allocation. In section 3, we
presented in this paper are generic, our focus is primar- present the algorithm for minimizing number of addi-
ily on single-register, accumulator-based DSP architec- tions. In section 4, we present DAG optimizing transfor-
tures such as TMS320C2x[2] and TMS320C5x[3]. mations to optimize code in terms of number of cycles.
Code optimization techniques discussed in the In section 5, we present the algorithm for spill-free im-
literature[4-7] address the problems of instruction se- plementation requiring minimum number of cycles. We
lection and scheduling, register allocation and storage present results for Walsh-Hadamard, Haar and Slant
assignment to minimize code size and/or number of cy- transforms in section 6 and conclude in section 7 with
cles. These techniques operate on a DAG representation our future work.
of the code being optimized and can be applied to im-
plement the multiplication-free linear transforms. How-
ever, the amount of optimization achieved using these 2 Code generation from DAG represen-
techniques is limited by the initial DAG representation. tation
Much higher gains are possible by optimizing the DAG
itself.
2.1 Target Architecture Model
One approach to optimizing the DAG is to minimize The code generation techniques presented in this pa-
the number of additions by utilizing the redundancy in per are targeted to the architecture model shown in g-
the computation of two or more outputs. In this pa- ure 1. This model is a suitable abstraction of 'C2x and
per we present an algorithm for minimizing number of 'C5x and shows datapath of interest to multiplication-
free linear transforms. As can be seen from the g-
ure, the target architecture is a single-register, non-
commutative machine in which the available operations
are :
1. < acc >(< acc > :op: < mem >; < shift >
(instructions ADD and SUB)
2. < acc >(< mem >; < shift >
(instruction LAC)
3. < mem >(< Acc >; < shift >
(instruction SAC)
33rd Design Automation Conference 
Permission to make digital/hard copy of all or part of this work forpersonal or class-room use is granted without fee provided that copiesare not made
or distributed for profit or commercial advantage, thecopyright notice, the title of the publication and its date appear,and notice is given that copying is
by permission of ACM, Inc. Tocopy otherwise, to republish, to post on servers or to redistribute tolists, requires prior specific permssion and/or a fee.
DAC 96 - 06/96 Las Vegas, NV, USA 1996 ACM, Inc. 0-89791-833-9/96/0006..$3.50
X1 Y1 X1 Y3 X1 Y1
-1 -1

X2 Y2
X2 X3 X4 X2 X3 X4 -1
<< X1 Y2 X1 Y4
-1 -1 -1 -1 -1
DATA X3 Y3
MEMORY
+/- X2 X3 X4 X2 X3 X4
X4 Y4
-1 -1

ACC << (a) Initial DAG (b) Optimized DAG

Figure 2: DAGs for 4x4 Walsh-Hadamard transform


Figure 1: Target Architecture Model
while (no.of-scheduled-nodes < total-no.of
4. < acc >( < Acc > intermediate+output nodes) f
(instruction NEG) /* build candidate-node-list */
2.2 Code Generation Rules candidate-node-list = fg
One of the inputs to the code generator is a DAG for all (nodei 62 scheduled-node-list) f
representation of the desired computation. The output if ((nodei .left-fanin + nodei.right-fanin) 2
and the intermediate nodes represent either an ADD or (input+scheduled node-list))
a SUBTRACT operation and have fanin of 2. candidate-node-list += nodei
The other input to the code generator is a sequence /* assign weights to the candidate-nodes */
in which the nodes of the DAG need to be evaluated. for (each nodei 2 candidate-node-list) f
Given the sequence and the DAG, following rules are nodei .weight = 1
used to generate the code : if ((nodei 2 output-node-list) .or.
Let `current' node be the latest evaluated node and (nodei .fanout  2)) nodei .weight++
`new' node be the new node for which the code is being if ((nodei .left-fanin = current-node) .or.
generated. ((nodei .right-fanin = current-node) .and.
1. If the `current' node is not one of the fanin nodes (nodei .op = ADD))) nodei .weight += 2
of the `new' node, save the `current' node (SAC instruc- if (nodei .fanout-node.right-fanin 2
tion), load the left fanin node of the `new' node (LAC in- scheduled-node-list) nodei .weight += 2
struction) and ADD/SUBTRACT the right fanin node g
of the `new' node. /* schedule the node with the highest weight */
2. If the `current' node is a left fanin node of the `new' nd (nodem 2 candidate-node-list) such that
node, ADD/SUBTRACT the right fanin node of the nodem .weight is maximum
`new' node. scheduled-node-list += nodem
3. If the `current' node is a right fanin node of the current-node = nodem
`new' node and the `new' node function is SUBTRACT, g
negate the `current' node (NEG) instruction and ADD
the left fanin node of the `new' node.
4. If the `new' node is an output node or an interme-
3 Minimizing Number of Additions
diate node with fanout  2, store the new node (SAC The amount of optimization achievable using the
computation scheduling algorithm is limited by the
instruction) before proceeding with the next node. DAG representation. In this section we present an al-
For a given DAG, the code size and consequently the gorithm that minimizes the number of additions and
number of cycles depend on the sequence in which the e ectively the number of nodes in the DAG to compute
nodes are evaluated. The code optimization problem the multiplication-free linear transform.
thus maps onto the problem of nding an optimum se- Consider the 4x4 Walsh-Hadamard transform[1]
quence of DAG node evaluations. shown below. The DAG representation of this compu-
2.3 Computation Scheduling Algorithm tation is shown in gure 2(a). It has 12 nodes (i.e. 12
We now present an algorithm for scheduling the DAG additions+ subtractions).
computations for minimum number of cycles. The al- 2 3 2 32 3
gorithm uses the following knowledge-base derived from Y1 1 1 1 1 X1
the code generation rules presented in sub-section 2.2.
1. A node can be scheduled for computation only if 4 YY 23 7
6 5=64 11 11 11 11 7 564 X2 7
X3 5
both its fanin nodes are already computed or are input Y4 1 1 1 1 X4
nodes.
2. The computation of output nodes and the interme- From the DAG it can be observed that there is some
diate nodes with fanout  2, always needs to be stored amount of redundancy in the computation. For exam-
irrespective of the next computation node. ple, the sub-computation (X1+X2) is used to compute
3. If the 'current' node is one of the fanin nodes of both Y1 and Y3. Similarly, the sub-computation (X1-
the 'new' node, it avoids accumulator-spill and hence X2) is used to compute both Y2 and Y4. The total
reduces the 'store' and 'load' overhead. number of additions can be reduced by pre-computing
These factors are used to assign weights to the candi- such common sub-computations. The common sub-
date nodes at each iteration of the scheduling algorithm computations are of two types
and the node with the highest weight is selected. Here 1. CS++ in which the elements in 2 columns of the
is the overall algorithm : matrix are both 1 or both -1 for more than 1 row (e.g.
scheduled-node-list = fg; current-node = ; X12+, columns 1,2 for rows 1 and 3).
LAC X1 LAC X1 LAC X1
LAC X1 X1 X1 Y1 X1 Y1 X1 Y1 SUB X2
ADD X2 ADD X2
ADD X2 X1 T1 LAC X1 SAC Y1 SAC Y1
+2
SAC Y2
OR
SAC T1 X2 ADD X2 LAC X1 X2
-1
Y2 X2
-2
Y2 SUB X2,1 X2
-1
Y2 ADD X2,1
X2 SUB X2 SAC Y2 SAC Y1
LAC X3 Y1 ADD X3 SAC Y2
X3
X3 ADD X4
ADD X4
ADD T1 X4
T2
X4 Y1
SAC Y1 Figure 4: Serializing a Butter y
SAC Y1
LAC X1 X1

Figure 3: Tree to Chain Conversion


X3 LAC X1 X3 Y1 LAC X1
ADD X2 Y1 ADD X2
X2 ADD X2
SAC T1 X1
X1 ADD X4 ADD X3
ADD X3 X4 Y2
T1 X2 SAC Y1
X2 SAC Y2 OR X1
SAC Y1 LAC X1
LAC T1 X4 -1 SUB X4

2. C+- in which the elements in 2 columns of the ma-


Y2 ADD X3 X2 ADD X2
ADD X4 X4 ADD X4

trix are +1,-1 or -1,+1 for more that 1 row (e.g. X12-,
SAC Y2 SAC Y1 X4 Y2
X3 Y1 SAC Y2

columns 1,2 for rows 2 and 4) FANOUT REDUCTION MERGING

The amount of reduction due to pre-computing a


common sub-computation depends on the number of Figure 5: Fanout reduction and Merging
rows in which it appears. The DAG minimization algo-
rithm uses a steepest descent approach[8] which during
each iteration pre-computes a common sub-computation 4.2 Serializing a butter y
with the highest occurrence. This is done by construct- Many image transform DAGs have `butter y' struc-
ing a common sub-computation graph with the nodes tures that perform the computations of the type (Y1 =
representing the columns in the transformation matrix. X1 + X2, Y2 = X1 - X2). Such butter y structures can
There are 2 arcs between every two nodes, which repre- be serialized by computing one of the butter y outputs
sent CS++ and CS+- type common sub-computations. in terms of the other output, and using a SHIFT oper-
The arcs are assigned weights indicating the number of ation which when performed along with ADD or SUB-
occurrences of the sub-computations. TRACT does not result in an additional cycle. Figure 4
Once the sub-computation corresponding to the most shows the serialized DAGs which require 5 cycles com-
weighted arc is identi ed, the transformation matrix is pared to 6 cycles required for the butter y computation.
updated to re ect the pre-computation. This is done by As can be seen from the gure, there are two ways
adding a new column to the transformation matrix and of serializing a butter y depending on whether Y1 is
suitably updating the matrix elements. The common computed in terms of Y2 (Y1 = Y2 + 2*X2) or Y2 is
sub-computation graph is also updated by adding a new computed in terms of Y1 (Y2 = Y1 - 2*X2). The choice
node and re-computing arc weights. This procedure is of the transform depends on the context in which the
repeated until no arc has a weight of two or more. butter y appears in the overall DAG.
Figure 2(b) shows the DAG for 4x4 Walsh-
Hadamard transform with minimum number of ad-
4.3 Fanout reduction
ditions+subtractions. It has 8 nodes (8 addi- Since the intermediate nodes with fanout  2 re-
tions+subtractions) compared to 12 nodes of the origi- sult in accumulator-spilling, this transformation reduces
nal DAG shown in gure 2(a). fanout of an intermediate node in a DAG. Unlike the
rst 2 transforms, this transform increases the number
of nodes in the DAG by 1. Figure 5 shows an example
4 DAG Optimizing Transformations of this transformation applied to a 4 input, 2 output
We used the scheduling algorithm discussed in section DAG. It can be noted the fanout of the intermediate
2, to schedule the DAGs shown in gures 2(a) and 2(b). node T1 in the transformed DAG is 1 (i.e. 1 less than
While the DAG shown in gure 2(a) requires 20 cycles, in the original DAG). While the original DAG has 3
the DAG in gure 2(b) requires 22 cycles to compute nodes and requires 8 cycles, the transformed DAG has
the transform, eventhough it has 4 less nodes. Clearly, 4 nodes but requires 7 cycles.
fewer number of nodes does not always translate into 4.4 Merging
fewer number of cycles. The main reason for the DAG Merging is another transform that reduces the fanout
in gure 2(b) requiring more cycles, is that all its in- of intermediate nodes. Unlike the earlier transforms,
termediate nodes have fanout  2. For a single-register this transform does not reduce the number of cycles.
or accumulator-based architectures, such intermediate However it transforms the DAG so that other transfor-
nodes result in accumulator spilling, and consequently mations can be applied to the modi ed DAG. Figure 5
in `store' and `load' overhead. also shows an example of the 'merging' transformation
In this section we present four DAG transformations applied to the 4 input, 2 output DAG.
that minimize the accumulator-spill and hence the num- Figure 6 shows how these transformations can be ap-
ber of execution cycles. plied to the DAG in gure 2(b). The resultant DAG
requires 16 cycles (6 cycles less) to compute the 4x4
4.1 Tree to Chain conversion Walsh-Hadamard transform.
This transform converts a `tree' structure in a DAG The amount of optimization possible using these
to a `chain' structure. This eliminates need to store the transforms depends on the sequence in which the nodes
intermediate computations and hence reduce the num- are selected and the choice of transformations applied.
ber of cycles. Figure 3 shows an example of this trans- One approach is to search the DAG for potential nodes
form. While the DAG with a `tree' structure requires for transformation and for a selected node, apply the
7 cycles to compute the output, and the transformed transformation that results in most saving. This greedy
`chain' structure performs the computation in 5 cycles. approach does not often give the optimum solution. We
X1 Y1 X1 Y1 X1 Y1 4
Y1 Y1

Y1 2 2 3 3
X2 Y2 X2 Y2 X2 X3 X4
-1 -1 -2 4 Y4 Y2 4 4 Y4 Y2 4
4 Y4 Y2 4 2
X4 X3 Y3 2 2
-2 Y3
X3 Y3 X3 Y3 Y3 Y3
-2 -2 X1 Y2 I II III
4
-1 -1 -1 4
X4 Y4 X4 Y4
-1 -2 -1 -2 X2 X3 X4
X3 +2
Y1
X1 Y1 Y3 Y2 Y4
X3 Y4 3
-2
2
4 Y4 Y2
SERIALIZING A BUTTERFLY MERGING TREE TO CHAIN CONVERSION
-2 -2 +2 -2 -2 +2
3

Figure 6: Optimizing DAG using transformations Y3


IV
X2 X3 X4 X3 X2 X3 X4

Figure 7: Spill-free DAG synthesis


are currently developing an algorithm that explores a
wider search space to arrive at a DAG that requires
fewest number of cycles. already-computed-output-list = f g
most-recently-computed-output = ;
5 Synthesis of Spill-free DAGs /* construct initial graph and compute edge costs */
for (i=0,i<no-of-outputs;i++) f
A DAG that can be scheduled without any edge[i,i].cost = no. of non-zero entries in row-i + 1 g
accumulator-spills provides certain advantages. Firstly, repeat f
it simpli es code generation. Secondly, since there are nd the edge E(M,N) with lowest cost.
no accumulator-spills, no intermediate storage is re- if (M == N) f /* self loop */
quired, thus reducing the memory requirements to im- generate DAG to compute output(N)
plement the transform. The DAG in gure 2(a) is an in terms of only inputs
example of such a DAG. It however requires 20 cycles to g else f
compute the transform. It can be noted that the nal generate DAG to compute output(N)
optimized DAG in gure 6 is also a spill-free DAG. This in terms of inputs and output(M)
DAG however requires just 16 cycles. The main reason g
for the reduced cycles is the fact that this DAG uses /* update the graph */
pre-computed outputs along with the inputs. For ex- delete edge E(N,N)
ample, Y3 is computed in terms of Y1, X3 and X4, and for each node (i 2 already-computed-output-list) f
Y4 is computed in terms of Y2, X3 and X4. Instead of delete edge E(i,N) g
generating a DAG with minimum number of additions already-computed-output-list += N
and then applying transformations, this DAG can be di- for each node (i 2 yet-to-be-computed-output-list)
rectly generated from the transformation matrix, if the f E(most-recently-computed-output,i).cost++ g
sequence of output computation is known. most-recently-computed-output = N
We now present an algorithm that arrives at an op- for each node i 2 yet-to-be-computed-output-list f
timum sequence of output computations such that the add edge E(N,i)
resultant spill-free DAG requires fewest number of cy- E(N,i).cost = no. of mis-matches between rows
cles. Our algorithm operates on a graph whose nodes N and i of the transformation matrix g
represent the outputs. The nodes are of three types - g until (yet-to-be-computed-output-list == fg)
corresponding to Figure 7 shows each iteration of the algorithm applied
1. the most recently computed output to the 4x4 Walsh-Hadamard transform matrix, and the
2. other outputs that are already computed resultant DAG. It can be noted that the resultant DAG
3. outputs that are yet to computed is spill-free and requires just 14 cycles to compute the
Each node in the graph has an edge (self loop) that transform.
starts and ends in itself. These self-loops are assigned
costs which are given by the number of cycles required
to compute the output independently (i.e. without us-
ing any of the precomputed outputs). There are also 6 Results
edges between every `already computed' output node to The code generation algorithm presented in sec-
all the `yet to be computed' nodes. Each edge is as- tion 2, the algorithm for minimizing number of addi-
signed a cost given by the number of cycles required to tions+subtractions presented in section 3 and the al-
compute the `yet to be computed' output in terms of gorithm for synthesizing spill-free DAGs presented in
the `already computed output'. Our algorithm uses the section 5 have all been implemented in C under Unix.
steepest descent approach, which at every stage selects We compared the code generated by our algorithm
an output that results in minimum incremental cost. In with that generated using optimizing C compiler for
case of more than one outputs having the same lowest TMS320C5x. The DAGs for 4x4 Walsh-Hadamard
incremental cost, one output is selected randomly. Once transform shown in gures 2(a), 2(b) and 7 were con-
an output is selected, it is marked as the most recently verted to an equivalent C program and compiled with
computed output. All the edges between this node and highest optimization level. The generated code which
already-computed nodes are deleted, and new edges are used indirect addressing, was converted to use direct
added between this node and the other `yet to be com- addressing thus reducing number of cycles. Table I be-
puted' nodes. The newly added edges are then assigned low shows the comparison in terms of number of cycles
appropriate costs. This process is repeated to cover all assuming that the program and data are available in
the outputs. The overall algorithm is given below : on-chip memories.
X1 Y1 X1 Y1 Y1 Y2 Y3

X2 Y2 X2 Y2 X8
-1 -2
-2 -2 -2 -2 +2 -2 +2 -2
X7 X6 X5 X4 X3 X2 X1 X2 X4 X6 X8 X2 X3 X6 X7
X3 -1 Y3 X3 -2 Y3

X4 Y4 X4 Y4 X7 X6 X3 X2 X8 X6 X4 X2 X8 X5 X3 X2 X8 X6 X4 X2
-1 -1 -2 -2
+2 -2 -2 +2 +2 +2 -2 -2 -2 -2 +2 +2 +2 -2 +2 -2
X5 Y5 X5 Y5 Y7
-1 -2

X6 Y6 X6 Y6 Y6 Y5 Y4
-1 -1 -2 -2
Y8
X7 Y7 X7 Y7 -2 +2 +2 -2
-1 -1 -2 -2
X2 X4 X6 X8
Y1 Y5 Y2 Y3
X8 Y8 X8 Y8
-1 -1 -1 -2 -2 -2
X8

Figure 8: DAGs for 8x8 Walsh-Hadamard transform


-2 -2 -2 +2 +2 +2 -2 +2 -2
X7 X6 X5 X4 X3 X2 X1 X2 X4 X5 X7 X2 X3 X6 X7

X8 X6 X4 X2 X7 X6 X3 X2 X6 X5 X4 X3 X8 X6 X4 X2
+2

DAG 'C5x C compiler Code Generator


-2 +2 +2 -2 -2 -2 +2 +2 -2 -2 +2 +2 -2 +2 -2
Y8

no. of cycles no. of cycles


Fig.3 20 20 Y7 Y6 Y4

Fig.4 22 22 Figure 9: DAGs for 8x8 Walsh-Hadamard transform


Fig.9 19 14
Table I: Code generator vs 'C5x C Compiler
X1 Y1 X1 +2 Y1
+2 +2

X2 Y5 X2 Y5

The results show that our code generator generates


-1 -1

as compact code as the 'C5x C compiler for the rst 2 X3


-1
Y3 X3 +2
-1
Y3

DAGs. It does better in case of the DAG in gure 7. The


main reason for this is that the C compiler during its
X4 -1 Y6 X4 -1 Y6

optimization phase modi es the DAG and in the process X5 Y2 X5 +2 Y2

generates code with more number of cycles.


-1 +2 -1

In sections 3,4 and 5 we have presented results for X6 -1 Y7 X6 -1 Y7

4x4 Walsh-Hadamard transform in terms of minimizing


number of additions, optimizing transformations and
X7 -1 Y4 X7 +2 -1 Y4

synthesis of spill-free DAGs. We now present results for X8 -1 Y8 X8 -1 Y8

8x8 Walsh-Hadamard transform, 8x8 Haar transform


and 4x4 Slant transforms. Figure 10: DAGs for 8x8 Haar transform
The 8x8 Walsh-Hadamard transform[1] matrix is
given by:
21 1 1 1 1 1 1 13 common sub-computation (X5 + X6 + X7 + X8). The
resultant DAG is also shown in gure 9. This DAG has
66 1 1 1 1 1 1 1 1 7 32 nodes and it does result in one accumulator spill.
66 11 11 11 11 11 11 11 11 7 7 The code corresponding to this DAG requires 42 cycles
66 1 1 1 1 1 1 1 1 7 7 (2 less than the spill-free DAG).
The 8x8 Haar transform[1] matrix is given by :
64 1 1 1 1 1 1 1 1 7 7
1 1 1 1 1 1 1 15 21 1 1 1 1 1 1 13
1 1 1 1 1 1 1 1 6 1 1 1 1 1 1 1 17
6
6 1 1 1 1 0 0 0 0 77
The direct computation of this transform re- 6
6 0 0 0 0 1 1 1 1 77
quires 56 additions+subtractions and the correspond- 6 1 1 0 0 0 0 0 07
ing code executes in 72 cycles. The number of addi- 6
4 00 00 10 10 01 01 00 00 7 5
tions+subtractions can be minimized to 24 using the
algorithm presented in section 3. The resultant DAG 0 0 0 0 0 0 1 1
is shown in gure 8. The code corresponding to this
DAG requires 64 cycles. We applied optimizing trans- The direct computation of this transform re-
formations to serialize all the butter ies in this DAG. quires 24 additions+subtractions and the correspond-
The resultant DAG is also shown in gure 8. This DAG ing code executes in 40 cycles. The number of addi-
also has 24 nodes but the corresponding code requires tions+subtractions can be minimized to 14 using the
52 cycles. algorithm presented in section 3. The resultant DAG is
We synthesized a spill-free DAG for the 8x8 Walsh- shown in gure 10. The code corresponding to this DAG
Hadamard transform using the algorithm presented in requires 39 cycles. We applied optimizing transforma-
section 5. The resultant DAG is shown in gure 9. The tions to serialize all the butter ies in this DAG. The
DAG has 35 nodes and the corresponding code requires resultant DAG is also shown in gure 10. This DAG
44 cycles. The results so far indicate that for both also has 14 nodes but the corresponding code requires
4x4 and 8x8 Walsh-Hadamard transforms, the spill-free 30 cycles.
DAGs result in most ecient code. We modi ed the We synthesized a spill-free DAG for the 8x8 Haar
DAG for 8x8 Walsh-Hadamard transform to extract a transform using the algorithm presented in section 5.
The resultant DAG has 20 nodes and the corresponding code on a single-register machine. We have presented
code requires 32 cycles. four optimizing transformations that optimize code by
The 4x4 Slant transform[1] can be transformed into minimizing accumulator-spills. These transformations
a 4x8 multiplication-free transform as shown below : utilize the 'shift' operations which can be performed on
2 3 2 3 2 X1=2 3 the data being added/subtracted to/from the accumula-
Y1 1 1 1 1 p tor, without any clock cycle overhead. Finally, we have
64 Y 2 75 = 64 3 1 1 3 75 64 X2=2 5 7 presented a new approach to DAG optimization that is
Y3 1 1 1 1 X3=2 5= based on synthesizing spill-free DAGs. This technique
Y4 1 3 3 1 p has been found to give promising results for most of
X4=2 5 the multiplication-free linear transforms that we have
experimented with.
2 3 We have presented results for 4x4 Walsh-Hadamard,
X1=2p 7 8x8 Walsh-Hadamard, 8x8 Haar transform and 4x4
2 3 666X2=2
X3=2
5 7
7
Slant transforms. The code generated using these opti-
1 1 1 1 0 0 0 0 p mization techniques requires 25% to 40% fewer number
64 1 1 1 1 1 0 0 1 75 666
X4=2 5 777 of cycles. We are currently evaluating the e ectiveness
1 1 1 1 0 0 0 0 66 X1p 7 of these techniques for error correcting/detecting codes
1 1 1 1 0 1 1 0 X2= 5 77
64 and are also looking at optimized code generation of FIR
lters on architectures that do not support a hardware
X3p 5 multiplier.
X4= 5 As part of our future work, we are developing an
The direct computation of the 4x8 transform re- algorithm to automate the DAG optimization by ap-
quires 16 additions+ subtractions and the correspond- plying various transformations. As a rst step we have
ing code executes in 24 cycles. The number of addi- developed an algorithm that searches the DAG for all
tions+subtractions can be minimized to 12 using the butter y patterns and serializes them by appropriately
algorithm presented in section 3. The code correspond- selecting one of the two possible options.
ing to the resultant DAG requires 26 cycles. Our algorithm for synthesizing spill-free DAGs cur-
Interestingly the spill-free DAG can be synthesized rently uses only one already-computed-output to com-
directly from the 4x4 matrix with elements 1,-1,3 and pute the new output. Higher gains are possible if more
-3. The four outputs can be computed as than one already-computed-outputs are used. We are in
Y1 = X1 + X2 + X3 + X4 the process of enhancing our synthesis process to com-
Y2 = Y1 + X11 - X31 - X42 prehend such possibilities.
Y3 = Y2 - X11 - X21 + X42 We are also looking at code optimization of
Y4 = Y3 - X21 + X32 - X41 multiplication-free linear transforms on multiple-
The DAG for the above computation has 12 nodes and register architectures. We believe that the algorithm
requires 17 cycles. The results presented so far are sum- for minimizing number of additions+subtractions can
merized in the table below (Ns - number of nodes, Cs - result in signi cant code optimization for register-rich
number of cycles): architectures.
References
Initial Min. Serial spill- [1] Anil K. Jain, \Fundamentals of Digital Image Process-
Transform adds b' ies free ing", Prentice Hall Inc. 1989
Ns Cs Ns Cs Ns Cs Ns Cs [2] TMS320C2x User's Guide, Texas Instruments, 1993
4x4 Walsh 12 20 8 22 8 19 9 14
8x8 Walsh 56 72 24 64 24 52 35 44 [3] TMS320C5x User's Guide, Texas Instruments, 1993
8x8 Haar 24 40 14 39 14 30 20 32
4x4 Slant 16 24 12 26 12 23 12 17 [4] A. Aho, R. Sethi and J. Ullman, \Compilers Principles,
Techniques and Tools", Addison-Wesley, 1986
7 Conclusion and Future Work [5] Stan Liao et al., \Code Optimization Techniques for
In this paper we have presented techniques for op- Embedded DSP Microprocessors", DAC 1995
timized code generation of multiplication-free linear
transforms. The code generation is targeted to single- [6] Stan Liao et al.,\Instruction Selection Using Binate
register, accumulator-based DSP architectures such as Covering for Code Size Optimization", ICCAD-95
TMS320C2x and TMS320C5x. We have presented
a code generation algorithm that performs integrated [7] Stan Liao et al.,\Storage Assignment to Decrease Code
scheduling and register allocation so as to minimize ac- Size", ACM conference on Programming Language De-
cumulator spills and generate code that executes in min- sign and Implementation, 1995
imum number of cycles. Since the quality of the gen-
erated code is limited by the initial DAG representa- [8] Mahesh Mehendale, G. Venkatesh and S.D. Sherlekar,
tions, we have presented techniques for optimizing DAG \Synthesis of Multiplier-less FIR Filters with Minimum
representations of multiplication-free linear transforms. Number of Additions", ICCAD-95
We have presented an algorithm which is based on it-
erative elimination of common-subcomputations so as [9] M. Potkonjak et al., \Ecient Substitution of Multiple
to minimize the number of additions. We have shown Constant Multiplications by Shifts and Additions using
that the resultant DAG though optimized in terms of Iterative Pairwise Matching", DAC 1994, pp 189-194
number of nodes does not result in the most optimized

You might also like