Energyefficient Backend Compiler Design For Embedded Systems
Energyefficient Backend Compiler Design For Embedded Systems
-
Abstrad Most research to date on energy minimization in cost and maximizes the software component of the performance
DSP processors has focuses on hardware solution. This paper cost [1][2].
examines the software-based factors affecting performance and Although dedicated hardware can provide significant speed
energy consumption for architecture-aware compilation. In this and power consumption advantages for signal processing
paper, we focus on providing support for one architectural applications, extensive programmability is becoming an
feature of DSPs that makes code generation difficult, namely increasingly desirable feature of implementation platforms for
the use of multiple data memory banks. This feature increases VLSI signal processing. Increasingly shorter life cycles for
memory bandwidth by permitting multiple data memory consumer products have fueled the trend toward tighter time-
accesses to occur in parallel when the referenced variables to-market windows, which in turn,caused intense competition
belong to different data memory banks and the registers among DSP product vendors and forced the rapid evolution of
involved conform to a strict set of conditions. We present novel embedded technology. As a consequence of these effects,
instruction scheduling, register and memory allocation designers are often forced to begin architecture design and
algorithms that attempt to maximize the performance,
system implementation before the specification of a product is
minimize the energy, and therefore, maximize the benefit of
fuuy completed. For example, a portable communication
this architectural feature. Experimental results demonstrate
product is often designed before the signal transmission
that our algorithms generate high performance, low energy
codes for the DPS architectural features with multiple data standards under which it will operate are finalized, or before the
memory banks. Our algorithm led to improvements in full range of standards that will be supported by the product is
performance and energy consumption of 48.3% and 66.6% agree upon. In such an environment, late changes in the
respectively in our benchmark examples. design cycle a~ mandatory. The need to quickly make such
late changes requires the use of software.
-
I d e x Terms Architecture-aware Compiler Design, High Although the flexibility offered by software is critical in DSP
Performance and Low Power Design, Instruction Scheduling, applications, the implementation of production quality DSP
Register Allocation, Memory Allocation. software is an extremely complex task. The complexity arises
&om the diversity of critical constraints that must be satisfied.
Typically these constraints involve stringent requirements on
I. INTRODUCTION metrics such as latency, throughput, power consumption, code
URRENTLY, there is a high demand for DSP processors size, and data storage requirements [3].
cwith low powedenergy in many areas such as DSPs are a special kind of processor that is primarily
telecommunications, inhrmation technology and automotive designed to implement signal-processing algorithmsefficiently.
industries. This demand stems from the fact that low power Code generation for DSP is more involved than general-
consumption is important for reliability and low cost purpose processors. This is because DSP processors have
production as well as device portability and miniaturization. non-homogeneous register sets, a number of specialized
In the last decade we have seen the proliferation of functional units, restricted connectivity, limited addressing,
electronic equipment like never before. As these systems are and highly irregular datapaths. It is a well-known fact that the
becoming increasing portable, the minimization of power
quality of compilers for embedded DSP systems are generally
consumption has become an important criterion in system
unacceptable with respect to code density, performance, and
design. In order to design a system with low energy and high
power consumption. This is because the compilation
performance, it is important to analyze all the components of
techniques for general-purpose architectures being used do
the system platform. Since a large portion of the functionality not adapt well to the irregularity of DSP architectures.
of today’s system is in the form of software, it is important to
We address the problem of code generation for DSP
estimate and minimize the software component of the energy
systems on a chip. In such systems, the amount of silicon
devoted to program ROM is limited, so the application
Wen-Tsong Shiue is with the Silicon Metria Corporation, 12710 software must be sufficiently dense. In addition, the software
Research Blvd. Suite 300, Austin, TX 78759 USA (telephone: 512-651- must be written to meet various high-performance and low
1503, e-mail: shiue@ ieee.org). energy constraints. Unfortunately, current compiler
technologies are unable to generate highquality code for II. DSP ARCHITECTURAL FEATURES WITH MULTIPLE DATA
DSPs, whose architectures are highly irregular. Thus, MEMORYBANKS
designers often resort to programming application software in Our approach for increasingthe packing efficiency has been
assembly - a labor-intensive task. tested on DSP architectural features with multiple data memory
In this paper, we present a novel instruction scheduling and banks, which can be characterizedas Dual-Load-Execute@LE)
register and memory allocation for one particular architectural architectures. Examples of DLE processors include Analog
feature, namely multiple data memory banks. This feature, Devices' ADSP2lxx family, NEC's u7701x fhmily, Motorola,
increases memory bandwidth by permitting multiple data 56xxx family, and Fujitsu's Elixir family. These processors
memory accesses to occur in parallel. This happens when the support parallel execution of an ALU operation and two data
rekrenced variables belong to different banks and the register move (data load or data store) operations in the same cycle.
involved conforms to a strict set of conditions. Furthermore,
the instruction set architecture(ISA) of these DSPs require the
programmer to encode in a limited number of long instruction A. DSP Architectures
words, all the data memory accesses that are to be performed in The DSP architectural units of interest are the data arithmetic
parallel, thus assisting in the generation of dense code. logic unit (Data ALU), addressing generation unit (AGU) and
Instruction scheduling techniques that use a listed-based X / Y data memory banks [9]. The unit of Data ALU contains
method has been around since the mid-1980s [4], and it is the hardware specialized for performing fast multiply -accumulate
most popular method of scheduling basic blocks. Trace operations (MAC). The data ALU consists of FOUR 24-bit
scheduling is an optimization technique that selects a input registers named XO, X1, YO, and Y1, and TWO 56-bit
sequence of basic blocks as a trace, and schedules the accumulators named A and B. The resource operands for all
operations from the trace together [SI. Percolation scheduling ALU operations must be input registers or accumulators, and
[6] looks at the whole program and tries to improve the the destination operand must always be an accumulator. TWO
parallelism of the code. The idea that register allocation can be 24-bit buses named XDB and YDB permit two input registers or
viewed as a graphcoloring problem has been around since accumulators to be read or written in conjunction with the
early 1970s, but Chaitin et. al. [7] were the first to actually execution of an ALU operation. As a result, three operations
implement it in a compiler. Briggs [8] came up with some may be executed simultaneously in one instruction cycle. The
modifications to Chaitin-style allocation, the most important Address aneration Unit (AGU) contains TWO sets of 16-bit
idea being the optimization of variable selection for register register files, one consisting of address registers RO, R1, R2,
spilling. and R3 and offset registers NO, N1, N2, and N3, and the other
Most of the previous work on reducing power and energy consisting of address registers R4, R5, R6, and R7 and offset
consumption in DSP processors has focused on hardware registers N4, N5, M, and N7. The unit of X / Y Data Memory
solutions to the problem. However, embedded systems Banks contains two 512 words x 24 bits memory banks which
designers frequently have no control over the hardware allow a total of two data memory accesses to occur in parallel.
aspects of the predesigned processors with which they work The ISA of above DSPs assists in the generation of dense,
and so, s o h a r e b a s e d power and/or energy minimization high-bandwidth code by requiring the programmer to encode
techniques play a useful role in meeting design amstraints. all operations that are to execute in parallel during each
Recently, new research directions in reducing power instruction cycle in either one or two 24-bit instruction words.
consumptions have begun to address the issues of arranging Specifically, up to two move operations and one Data ALU
software at instruction-level to help reduce power operation may be encoded in these words. A move in this case
consumption [lo][ 111. Previousimprovements with software re- refers to a memory access (load or store), register transfer
arrangements include the value locality of registers [101 and (moving of data from an input registers to an accumulator, or
the swapping of operands for booth multiplier [ 111. This new vice versa), or immediate load (loading of a 24-bit constant into
direction brings an interesting issue in the compiler an input register or accumulator). However, due to the nature
participation in software re-arrangements for reducing power of the M56000 micro-architecture, only the following pairs of
consumption for applicationsand systems. move operations may be performed in parallel (i) two memory
The rest of the paper is organized as follows. Section 2 accesses, (ii) a memory access and register transfer, and (iii) a
describes the DSP architectural features with multiple data register transfer and a load immediate.
memory banks. Section 3 describes the novel algorithms of Consider the following parallel move specification that
instruction scheduling to reduce the number of cycles and the simultaneously (i) loads a datum into X1 from the X memory
register pressure. Section 4 describes the examples that bank at the address stored in RO and (ii) loads a datum into Y 1
illustrate our algorithms. Section 5 describes register and from the Y memory bank at the address stored in R4. {MOVE
memory allocation. Section 6 describes the benchmark results. x (RO), x 1 Y (R4), Yl}.
Section 7 concludesthe paper.
104
PAPER IDENTIFICATION NUMBER: 3 15
m. INSTRUCTION SCHEDULING with one or two MOVE operation in the same cycle. Two ALU
Our instruction-scheduling algorithm based on list operation nodes, though, cannot be allowed to execute in the
scheduling directly supports packing, because the above DSP same cycle. The second weighting factor in the tuple is the
supports simultaneous execution of multiple operations. lifetime of the register variables. For instance, node VO is
Packing is efficient in terms of performance, because it always dependent upon register r0 that is alive during cycles 1
leads to a reduction in the cycle time of programs. Another through 3 (see Figure 2); thus, the interval live time is 3-1 = 2.
important feature of packing is that it also tends to reduce the In Figure 2 the initial code is executed in 11 cycles and the
amount of energy consumed during program execution. In number of registers required is 2 using the lifetime analysis.
practice, packing has the potential to reduce energy Next, we construct the ready set based on the as-soon-as-
consumption by more than half. possible (ASAP) scheduling scheme for each node. The total
cycles are now 6 cycles h r the unscheduled nodes in ready
set.
A. Instruction Level Energy Model
The average power P consumed by a processor while a=b+c;
running a certain program is given by P = I*Vdd, where I is the
average current and Vdd is the supply voltage. The energy (4
consumed by a program, E, is given by E = P*T, where T is the
execution time of the program. This, in turn,is given by T = MOVE b, r0 -- W
N*delta, where N is the number of cycles and delta is the cycle MOVE c, rl -- Vl
ADD r0,rl-W
period. Since a common application DSP embedded system is
MOVE rl,a -V3
often in the portable space where power is stored in a battery,
MOVE e,r2-V4
energy consumption is the focus of our attention. Now, Vdd
MOVE Er3 -V5
and delta are known and fixed. Therefore, E is proportional to ADD r2,r3-V6
the product of I and N. Given the number of execution cycles, MOVE r3pI-W
N, for a program, we only need to measure the average current, MOVE a,,rS-VS
I, in order to calculate E. The product of I and N is, therefore, MPY r4,rS,r6 - W
the measure used to compare the energy cost of programs in MOVEr6,d -V10
this analysis. The energy model as taken &om [2], is based on (b)
..
measuring the current consumed by individual instructions Figure 1. (a) C code, (b) uncompacted assembly code, and (c) DDG with
tuple of (depth,lifetime) on the node.
using an oscilloscope and estimating the energy consumed by
a block of software through the calculationof a weighed sum.
List scheduling might give good results in terms of reducing nodes
the number of cycles required for execution, but it cannot
guarantee an optimal scheduling of the code. If the required
number of registers is very high, then register spills may
decrease the performance of the final code. In addition to the 4 V3 rl,a
notion of priority, our data dependence graph (DDG) also
includes the lifetime of each variable so that register pressure
can also be reduced in the scheduling.
9 v8 as5
Iv. EXAMPLES AM) ALGORITHMS 10 W' r49.W
I
A. Example One 11 v10 6,d
Figure 2. Before scheduling.(Need 11 cycles and 2 registers)
Consider the C and corresponding uncompacted symbolic
assembly code shown in Figure l(a) and Figure l@). Figure Cycle I Instructionnodes
l(c) shows the DDG with two weighting factors on each node. 1 1 vo, v1, v4, v5
The first weighting factor in the tuple is the value of depth, 2 V2,V6
which is counted from the bottom node for each branch. The 3 V3,W
node with a higher value has higher priority. For instance, 4 vx
nodes VO and V1 have the highest priority to be selected in the 5 1 v9
ready set shown in Figure 3. 6 1 VI0
Note that the boldface nodes in the ready set are ALU Figure 3. Unscheduled nodes in ready set. (Boldface nodes are &U
operations such as ADD and MPY. One ALU can be executed operations).
105
PAPER IDENTIFICATION NUMBER: 3 15
Cycle I Instructionnodes I Live time Analysis for the random choice, the number of cycles is 7 and the
I vo, v1 I rO rl r2 r3 r4 r5 r6 number of registers is 4. This demonstrates that the algorithm
developed is very important at this stage.
We borrowed the energy model fiom [2] to count the energy
consumption. In Figure 4, total current, I, is 78OmA and total
cycles, N, are 6. So, the energy cost is I*N = 780*6 = 4,680. In
Figure 5 , the scheduling based on random choice has an
energy cost of I*N=850*7=5,950.
Before scheduling, in Figure 1, the total current is 108OmA
and the number of cycles is 11. So, the energy cost is I*N=
11,880. After scheduling, Figure 4, based on our dgorithm, the
1 1
number of cycles is reduced from 11 to 6 (45% reduction in
cycles) and the energy cost is reduced fiom 11,880 to 4,680
MOVE b,rO
A D D ~ O , ~ ~v
t;
After Instruction Scheduling
c,rl I (60% reduction in energy consumption). This demonstrates
that our scheduling algorithm has made a high performance,
ADD r 2 ~ 3 low energy code generation with minimum register pressure for
MOVE r3,r4 the DSP processors.
MPY r4,r5,r6
MOVE
Figure 4. Atter scheduling based on our scheduling algorithm. (Need 6 B. Example Two
cycles and 3 registers). Again, consider the C and corresponding uncompacted
symbolic assembly code shown in Figure 6(a) and Figure 6(b).
Cycle Instructionnodes Live time Analysis
w
Figure 6(c) shows the DDG with two weighting factors on each
vo, v4 rO rl r2 13 r4 r5 r6
node.
- WQ)(~,Q) I I
N d4registers v=a*b+c;
(a)
MAC dJlJ2-V3
MOVE r2,v -V4
1
MOVE d,r3 -V5
MOVE e,r4 -V6
MOVE f,d -V7
MAC r3,r4,r5-V3
1
MOVE :;;c f,r3 -- 120mA ,
(a) (c)
ADD rO,rl -- l0OmA Figure 6. (a) C code, (b) uncompacted assembly code, and (c) DDG ith
ADDr2,r3 rl,a -- 140- tuple of (depth,lifetime) on the node.
MOVE r3,r4
qr5 -- 120mA
MPY r4,r5,r6 -- 160mA Our algorithm focuses on reducing the number of cycles and
MOVE -- 90mA
Figure 5. After scheduling based on RANDOM choice. (Need 7 cycles the register pressure at the same cycle. This helps to do
and 4 registers). register labeling in the next stage. Besides that, due to the code
compaction, the number of cycles is minimized and further
Figure 4 shows the nodes have been scheduled based on results in the reduction of the energy consumption.
our algorithm to exploit the DSP architecture. The number of Before scheduling, in Figure 6, the total current is 104OmA
registers is 3. Note that the number of cycles and the number and the number of cycles is 10, giving an energy cost of I*N=
of registers have the tradeoff relationship. It is cost efficient to 10,400. After scheduling, in Figure 7, based on our algorithm,
reduce the cycles instead of increasing the number of registers. the number of cycles is reduced from 10 to 5 (50% reduction in
However, for DSP processors, the number of registers is cycles) and the energy cost is reduced from 10,400 to 3,400
limited, so we need to develop the best scheduling algorithm to (67.3% reduction in energy consumption). The number of
minimize the number of registers (i.e. reduce the register registers is only 4. If random choice, the number of cycles is
pressure) during the scheduling process. Figure 5 shows that increased from 5 to 6 and the number of registers is increased
106
PAPER IDENTIFICATIONNUMBER 315
1 1 .5
Cycle
MOVE
I
v9
(r5,w)
Instructionnodes
MAC rO,rl,r2
VI,
C. OurAlgorithms
v2
b,rl
r5,w
Final Code
c,r2
1 .
from 4 to 6 (see Figure 8). This implies that it degrades the
performance and increases the energy consumption.
Furthermore, it needs more registers. In Figure 7, total current,
I, is 680mA and total cycles, N, are 5. So, the energy cost is I*N
= 680*5 = 3,400. In Figure 8, the schedulingbased on random
choice has an energy cost of I*N=780*6=4,680.
C Y Instructionnodes
C ~ ~ ~
vo, v5
(ap) (4r3)
Live time Analysis
rO rl r2 r3 r4 r5
Need re ‘sters
-- 120mA
-- 120mA
-- 120mA
-- 160mA
-- 170mA
-- 90mA
Figure 8. After scheduling based on RANDOM choice. (Need 6 cycles
and 6 registers).
Akorithm
I . Construct the Data Dependence Graph (DDG).
2.
3.
4.
Make the Tuple (P,L), where P is the depth value of the node, and
L is the lifetime of the node.
Find the initial r e 4 set R (List Scheduling)
While -isempty@) do
The node with larger value of P has the higher priority.
(reducing the number of cycles and energy cost)
Iftie in value of P:
rfthere arepairs of nodes
eke
Thepair of nodes having the same lifetime
value, L, has higherpriority. (reducing the
register pressure)
107
PAPER IDENTIFICATION NUMBER 315
memory banks in DSP processors. Total Cycles: 10 cycles Total Cycles: 5 cycles
Energy Cost: 10,400 Energy Cost: 10,400
Alrzorithm
Each register must be labeled XO, XI,YO, YI, A or B while each Examp e 3: {a=&; H + e ; }
variable must be labeled X-MEM or FMEM. Unscheduled Assembly After instruction scheduling and register
Codes and memory allocatioii
Iftwo variables and two registers assigned to same cycle, then
MOVE a,rO --9OmA MOVE XqXO Yb,A -- 120mA
o Two variables must be labeled in difference. MOVE b , r l --90mA ADD XO,A Xe,B Y a y 0 -- l50mA
o Ifvariable is labeled in X-MEM, then register must be labeled ADDrO,rl -- lOOmA ADDY0,B A,X:c -- 140mA
XO, XI. A or B. MOVE rl,c -- 90mA MOVE Xf,B -- 90mA
o Ifvariable is labeled in FMEM, then register must be labeled MOVE d,r2 -- 90mA
MOVE e, r3 -- 90mA
YO, YI, A or B. ADD r 2 ~ 3-- 1OOmA
ALU operations must be registers XO, YO or accumulators A/B (or MOVE r3,f -- 90mA
XI, YI, A/B). The destination operand must always be an
accumulator. Total Current: 740mA Total Current: 500mA
Total Cycles: 8 cycles Total Cycles: 4 cycles
Ener Cost: 5,920
\
VL BENCHMARKRESULTS
In our benchma& results, the performance is improved in
average 48.3% and the energy consumption is saved in
average 66.6% (see Figure 9). Figure 9 shows the unscheduled
assembly codes, optimized assembly codes (i.e. Assembly
codes has been scheduled and registers and memory variables
have been allocated), and the total required current, the
number of cycles, and the energy cost for both codes.
108
PAPER IDENTIFICATIONNUMBER:315
109