0% found this document useful (0 votes)
16 views7 pages

Zhou 2008

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views7 pages

Zhou 2008

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

2008 IEEE/IFIP International Conference on Embedded and Ubiquitous Computing

Application Specific Low Power ALU Design

Yu Zhou and Hui Guo


School of Computer Science & Engineering, University of New South Wales
Sydney, Australia
Email: {zhouyu, huig}@cse.unsw.edu.au

Abstract One typical component in the processor is the Arithmetic


and Logic Unit (ALU). An ALU provides common func-
Power consumption is a critical design issue in embed- tions for arithmetic and logic operations. It contains a set of
ded processor design. One of common components in the functional components. Each functional component is re-
processor is the Arithmetic and Logic Unit (ALU). Usu- sponsible for one type of operations. For example, an adder
ally, ALUs are designed with a combinational logic circuit performs additions.
containing a number of functional components for differ-
ent arithmetic and logic operations. An ALU can be con- A
add sel
B
structed with a tree or a chain structure.
Existing approaches to reduce power often achieve and
S
power reduction at the cost of increased design complexity, MUX

thus resulting in delay and area overheads. In this paper, xor

we present a customization approach for the chain-structure


or
based ALU design by repositioning functional components
in the chain. The approach can be easily integrated into (a)
A
add
a processor design environment to effectively reduce ALU B
MUX
power consumption for a given application. A
and
B
MUX
We have applied our approach to a set of benchmarks. A
xor
Our experimental results show that the power savings range B
MUX
S

from 43.5% to 49.6%; on average, 46.9% of ALU power re- A


or
B
duction can be achieved. Most importantly, this achieve- (b)
ment is at cost of neither hardware complexity nor pro-
cessor performance, and the implementation is extremely Figure 1. Typical Designs (a) Tree Structure
straightforward. (b) Chain Structure

Keywords: Low Power ALU, Application-Specific Design


There are two typical ALU design structures: tree struc-
ture and chain structure. An example is shown in Figure 1,
1 Introduction where the ALU contains four functional components for ad-
dition, logic and, logic xor, and logic or. With the tree struc-
The stringent requirement for low power consumption ture (see Figure 1(a)) all functional components are con-
has been a big issue in most embedded processor designs. nected in parallel to the multiplexor (denoted as MUX in the
Power reduction on the processor can be fulfilled at different figure); the multiplexor selects one result as the ALU output
design levels, such as transistor sizing [4][14] and thresh- among results from all functional components. In the chain
old voltage scaling [5] [8] at the semiconductor chip design structure (Figure 1(b)), functional components are concate-
level, clock gating [17] [18] and power gating [2][9] at the nated through a series of small multiplexors; each multi-
logic and register transfer level, Dynamic Voltage Scaling plexor takes a subset of functional components and passes
(DVS) [13][15][19] at the system level. Power reduction the result to the output. Therefore, some operations in the
can also be performed on individual functional components chain structure take more levels of multiplexor transmission
of the processor. to the output than others.

978-0-7695-3492-3/08 $25.00 © 2008 IEEE 214


DOI 10.1109/EUC.2008.81
The tree structure often demands more area than the Transistor sizing [4][14] is one approach to reduce
chain structure but its operation is usually faster, as will be switching capacitance at the low design level. Also applied
shown in our simulation results in Section 3. to the low design level is the approach of altering transis-
With a modern processor design, where the processor is tor threshold voltage [5] [8]. Increase of threshold voltage
pipelined into a number of stages, the speed of the processor reduces leakage power consumption .
is determined by the longest delay that is associated with the At the logic gate level, technology mapping is often tar-
critical path. Usually, the ALU is not on the critical path geted for power reduction. Technology mapping automati-
(as will be demonstrated in Section 3). Therefore, some cally constructs a gate-level design representation based on
processor design tools, such as ASIPMeister [11] [1], use a given logic gate library. To find a mapping that will mini-
the chain structure for the ALU to save area. mize the total power consumption under the delay and cost
For the chain structure, there are a variety of options to constraints is an NP-hard problem. Some heuristic algo-
place functional components. So far, the functional com- rithms [6][10][12] have been proposed to find the optimal
ponent placement in the chain is arbitrarily chosen in the mapping for a design.
processor design. To our best knowledge, no work related Moving to the Register Transfer Level (RTL) and system
to this design issue has been reported. In fact, placing a level, clock gating and power gating are used. Clock gating
functional component differently in the chain structure may [17] [18] controls the clock signal from reaching idle func-
cause different power consumption. For example, swapping tional units so that unnecessary switching activity in the idle
the add component and the or component in Figure 1(b) functional units is avoided, hence saving dynamic power. A
may favor some applications and can save a considerable clear description of the clock gating can be found in [17]
amount of ALU power. and a good example of using logic gating is given by Tiwari
In this paper, we investigate the effect on power con- et al. [18]. Power gating [2][9], on the other hand, discon-
sumption of functional component placement in the chain nects the power supply to unused components to eliminate
structure; we proposed a functional component placement them from both dynamic and leakage power consumption.
approach to reduce power consumption for a given applica- The benefits and costs of power gating are investigated in
tion; we implemented the approach in the hardware descrip- [9]. An application approach at the system level design can
tion modeling and integrated it into the ASIPMeister pro- be found in [2].
cessor design tool [11]. Our experiment on a set of bench- There have been many other approaches that apply mul-
marks shows that on average 46.9% of ALU power can be tiple techniques to reduce power consumption. One popular
saved. approach is Dynamic Voltage Scaling (DVS) [13][15][19].
Compared with existing power reduction approaches, DVS scales the supply voltage and clock frequency during
which often incur high area and/or delay overheads and re- circuit operation. The theoretical and practical limits of the
quires considerable implementation efforts, our approach is DVS approach are discussed in [19].
almost cost free and is extremely simple to implement.
1.2 Paper Organization
1.1 Related Work
The rest of the paper is organized as follows. Section 2
discusses the functional component placement in the chain
Power reduction on digital system design has been stud-
structure design and develops a customization approach for
ied for many decades. Power consumption in a digital
an application specific low power ALU. Section 3 presents
system (with CMOS technology) consists of two types:
the experiment setup, followed by the simulation results and
dynamic power from logic signal switching activities and
related discussions. Section 4 concludes the paper.
static power from transistor leakage currents.
Approaches for dynamic power reduction can be classi-
fied into three categories with each category having a dif- 2 Chain Structure Design and Power Reduc-
ferent focus: to reduce switching capacitance, to reduce tion
switching frequency, or to reduce supply voltage.
Similarly, exiting approaches for leakage power reduc- Without loss of generality, we refer to the chain structure
tion can be classified subject to their power reduction strate- as in Figure 2, where there are n functional components and
gies: reducing supply voltage, reducing circuit size, reduc- they are concatenated by 2-to-1 multiplexors.
ing operating temperature, or increasing transistor threshold All functional components in the structure will operate
voltage. for any a calculation request, but at most one of them exe-
Approaches in each category can be even further cutes the required calculation; other components operate for
grouped, based on the design level at which they are ap- nothing except exercising and consuming power. We use
plied. execution to indicate a component operation that responds

215
en1
fun1 A
MUX B add
1
fun2 (a)
MUX
2 A
fun3 MUX R B and
n-1
en2
funn A
B xor
cond1 en1

Figure 2. General Chain Structure (b) en3


en2
cond2 A S
B or
en3
cond3
to a calculation request and whose result will ripple through
en4
a chain of multiplexors to the output. cond4
en4

The dynamic power consumption of the design is de-


termined by the operational activity of the circuit. At the
Figure 3. Implementation Example of Chain
functional component level, the operational activity can be
Structure: (a) Passing Paths (b) Control
represented by
Logic
n
 n−1

OP = Oi + Omuxj , (1)
i=1 j=1

where Oi is the operational activity of component i, and executed for a given application. Some instructions corre-
Omuxj the operational activity of multiplexor j. spond to ALU operations, such as add (addition), xor (logic
Given an input to the design, the operational activity of xor), beq (branch on equal). They will use the ALU for
each of the functional components is fixed; change of com- calculation. Some instructions do not involve ALU, such
ponent positions in the chain will not affect the value of Oi as jmp (jump to an instruction memory location). There-
in Formula 1. However, the execution of a component will fore, we partition the instruction set into ALU instructions
lead to a different chain of subsequent multiplexor opera- and non-ALU instructions. The ALU instructions are fur-
tions, depending on the component’s position in the chain. ther grouped according to what functional component they
A component positioned far away from the output causes actually use. For example, instructions add and sub belong
more operations than when it is close to the output; the more to the same group since they use component adder.
frequently the component executes, the higher the opera- The execution frequency of a component is the sum of
tional activity it generates. frequencies of instructions associated with the functional
It is very important to note that when the design structure component.
is implemented with a synthesis design tool, such as Syn- Algorithm 1 summarizes the functional component
opsys Design Compiler, the chain of multiplexors may be placement approach. In the algorithm, different weights
realized with an additional disable function, as illustrated in of power consumption are assigned to different functional
Figure 3, where Figure 3(a) shows the passing paths of the components. The more complicated the component, the
chain structure and Figure 3(b) is the control logic. The low higher the weight. Given two functional components of
level implementation of the multiplexor not only includes the same operational frequency but with different design
the function of selecting one of the inputs as the output, but complexities, we place the component with a higher weight
also integrates a gating logic that disables the propagation closer to the output. The algorithm is self explanatory.
of the inputs to the output. Given an ALU operation, some Elaboration is omitted.
passing paths can be blocked, which effectively reduces un-
necessary signal switchings. 2.2 Design Environment Integration

2.1 ALU Customization The customization technique can be integrated into a


processor design environment. A general design platform
We can customize the ALU design by identifying fre- is given in Figure 4.
quent functional components and placing them close to the The design flow starts from a given application written
output. in a high level programming language. A target machine
The execution frequency of a functional component is architecture is selected and the program is compiled for the
obtained from instruction frequencies. The frequency of target machine. Based on the instruction set of the machine
an instruction is how often the instruction is executed, it architecture, a hardware description model for the processor
is measured in percentage of total number of instructions model is developed by either commercial or in-house devel-

216
Algorithm 1 Functional Component Placement in Chain put to the output), respectively. The experiment setup is
Structure given in Figure 5(a), where the loop exhausts all possible
/* Given the execution trace and the set of functional compo- placements in the chain design.
nents, F C */
step 1: obtain instruction execution frequency, IF ;
C program
step 2: obtain the functional component execution frequency
based on IF ;
step 3: Alu operation Simplescalar
/* Find the location for each component in the chain*/ and operands compiler
ALU VHDL generator
/* Start from the closest level to the output */ Model PISA ISA
current level = 1; Generator

/* If there are more than 2 components in F C */ Profiler

while |F C| > 2 do
S <= most f requent components in F C; TESTBENCH ASIPMeister
ALU

F C <= F C − S; For all ModelSim


Cumtomizer

place
/* Repeat if there are multiple such components */ mnt
while S = φ do Synopsys
Customized
get the component of highest weight, fc in S; Design
Processor
PrimPower
/* Assign the component to the current level */ Compiler
TESTBENCH
level(f c) = current level; ModelSim

/* Go to one level further from the output */ Simulation


Simulation
current level + +; result:
area,
result: Synopsys
power
S <= S − f c; delay
Design
PrimPower
end while Compiler

end while
/* Assign the last two components to the farthest level */ Simulation Simulation
(a)
level(F C) = current level; result: result:
area, delay power

(b)
Processor Model
C program Compilation Profiling
Generation
Figure 5. Experimental Setup (a) ALU Design
Exploration (b) Processor Design with ALU
Simulation Design Processor Model ALU Model Customization
Results Simulation Update Customization

Figure 4. Processor Design Flow with ALU For each iteration, a new ALU model with different com-
Customization ponent placement is generated. Its function is verified by
ModelSim [7]. Next, the design is synthesized with Synop-
sys [16] Design Compiler based on the tsl18fs120 library,
and the related power consumption is estimated by Synop-
opment tools. Algorithm 1 is applied to customize the ALU
sys PrimePower. The three tools (ModelSim, Design Com-
in the processor. The processor model is then updated with
piler and PrimePower) together form the testbench for each
the customized ALU. Since the update does not function-
design evaluation.
ally affect the processor and instruction set architecture, no
Later, we integrated the customization technique into
modification is required to the instruction code. Next, the
ASIPMeister [11], a processor design tool. The related ex-
new processor model for the application code is simulated.
perimental setup is given in Figure 5(b). In this experimen-
The design is synthesized using a synthesis tool. Based on
tal environment, the processor design for a given applica-
the synthesis, the design is evaluated.
tion is automatically generated and the functionality of the
processor model is systematically verified.
3 Experimental Setup and Simulation Re- We choose the Portable Instruction Set Architecture
sults (PISA)[3] as the target processor instruction set architec-
ture. ASIPMeister is used to automatically generate the pro-
To verify our placement approach, we first developed a cessor VHDL model. We use Simplescalar [3] to compile
small stand-alone ALU for full design space exploration. the application program and to profile the program execu-
The ALU contains only four functional components each tion. Based on Algorithm 1, the ALU model in the pro-
for addition, logic xor, logic and and pass (passing one in- cessor is customized. Finally, the processor with the cus-

217
tomized ALU is evaluated using the same testbench de- For "add" Operation
scribed in the setup for design space exploration. 2.70E-05
2.60E-05
2.50E-05
2.40E-05
2.30E-05

Design Area 2.20E-05


2.10E-05
2.00E-05

214 

 



 




 



 







  

 



 

  

           

212
                 
     

     

 

   
     

   

   
           

 
 
           
     
 

       
         

210
     
 

     
Area (gates)

 

     
     

   

   
       

 

   

       
   
         
     
 

208
   
   
       

       
         
     

 
     
 

     

     

           
                       

206
     

     
                 
                 
     

204
202
200 For "and" Operation
     
           

           
     

     

           

2.60E-05
     

 

         

   

                 

           
   
   
     
   
   
     

2.50E-05
 

   
   

   

   
       
 

   
   
             
   
 

2.40E-05
           

           
   
   

 
   

     

     

     
                       

2.30E-05
     

                 
           
     

2.20E-05
De sign 2.10E-05
2.00E-05
chain structure tree structure 1.90E-05

# # # # # #
           

 !  !  !  !  !  !

 #  #  #  #  #  #
                 
     

" " " " " "

# #
   
     

# # # #
   
           

 
# #
 !   !  !   !
     

       
  #   #  #  #
     
 

 !  !
" "

 
     
" " " "

Figure 6. Design Area of Different Compo-


# # # #
   
       

# #

   

           
# # # #  !  !
     
 

             
# #

 !  !    !  !
     
" " " "

 
     
" "

# # # # # #
     

                             
# # # # # #

     

     
                  ! ! ! ! ! !
      " " " " " "

nent Placements
For "xor" Operation

2.70E-05
2.60E-05
2.50E-05
2.40E-05
2.30E-05
2.20E-05
Design Delay 2.10E-05
2.00E-05
1.90E-05
4.35

4.3
, , , , , ,

% % % % % % % % % % % %

( * ( * ( * ( * ( * ( *
) ) ) ) ) )

$ , $ , $ , $ , $ , $ ,
' ' % % ' % ' % ' % ' %
$ $ $ $ $ $ $ $ $ $ $ $
Delay (ns)

+ + + + + +

, ,
% % % %
& & & & & &

, , , ,

% % % %
& & & & & & & & & & & &

$ , $ ,
' ( ) * % ( ) * ' ( ) * % ( ) *

4.25
$ $ $ $

% , % , % , % ,
& & $ & & $ & $ & $

$ % $ ' $ % $ '

( ) * ( ) *
+ +

$ ' $ % $ ' $ %
+ + + +

, , , ,
& & & & % & % & % & % &

, ,

& & & &

% % , , % % , ,
& & $ $ & & $ $
% % ' ' ( * ( *
$ $ $ $

) )
% % % % , ,
& & & & & & & & $ $

( * ( * ( * ( *
$ ' $ ' ) ) $ % $ % ) )
+ + + +

4.2
$ % $ % $ ' $ '
+ +

, , , , , ,

& & & & & &

% & % & % & % & % & % & % & % & % & % & % & % & $ , $ , $ , $ , $ , $ ,

& & & & & &

( ( ( ( ( (
$ % $ % $ % $ % $ % $ % $ ' $ ' $ ' $ ' $ ' $ ' ) * ) * ) * ) * ) * ) *
+ + + + + +

4.15

4.1








































 For "pass" Operation
 

   

 

       

   

   
 

   



















   
2.70E-05
       
   

2.60E-05
 

     

     

2.50E-05
     
     
     

2.40E-05
Design 2.30E-05
2.20E-05
chain structure tree structure 2.10E-05
2.00E-05

, , , , , ,

% % % % % % % % % % % %

( * ( * ( * ( * ( * ( *
) ) ) ) ) )

$ , $ , $ , $ , $ , $ ,
' ' % % ' % ' % ' % ' %
$ $ $ $ $ $ $ $ $ $ $ $

+ + + + + +

, ,

Figure 7. Design Delay of Different Compo-


% % % %
& & & & & &

, , , ,

% % % %
& & & & & & & & & & & &

$ , $ ,
' ( ) * % ( ) * ' ( ) * % ( ) *
$ $ $ $

% , % , % , % ,
& & $ & & $ & $ & $

$ % $ ' $ % $ '

( ) * ( ) *
+ +

$ ' $ % $ ' $ %
+ + + +

, , , ,
& & & & % & % & % & % &

, ,

& & & &

% % , , % % , ,
& & $ $ & & $ $
% % ' ' ( * ( *
$ $ $ $

) )
% % % % , ,
& & & & & & & & $ $

( * ( * ( * ( *
$ ' $ ' ) ) $ % $ % ) )
+ + + +

$ % $ % $ ' $ '
+ +

nent Placements
, , , , , ,

& & & & & &

% & % & % & % & % & % & % & % & % & % & % & % & $ , $ , $ , $ , $ , $ ,

& & & & & &

( ( ( ( ( (
$ % $ % $ % $ % $ % $ % $ ' $ ' $ ' $ ' $ ' $ ' ) * ) * ) * ) * ) * ) *
+ + + + + +

Figure 8. Design Exploration (X-axis: Com-


ponent Placement Structure in ALU, Y-axis:
3.1 Simulation Results
ALU Power Consumption)

3.1.1 ALU Design Space Exploration


As aforementioned, a reduced ALU of four functions was
used for exploring designs of all possible placements in or- The area and delay for designs with different placements
der to verify the effectiveness of our component placement are plotted in Figures 6 and 7, respectively. The tree struc-
approach. The chain structure of the ALU is modeled us- tures (described with the case statement in VHDL) were
ing VHDL if-then-else statement. Different order of the also designed for comparison.
ALU calculations in the if-then-else statement corresponds As can be seen from Figure 6, chain designs have lower
to a different functional component placement in the chain area cost than the tree structure.
structure. From Figure 7, we can see that most chain designs have
Given four functional components, there are 4! = 24 dif- longer delay than the tree structure. But for some designs,
ferent placements. We use notation C1-C2-C3-C4 to denote the delay is smaller than that of the tree structure. This
a placement arrangement. For example, add-pass-xor-and can be explained as follows: The critical path of the tree
represents a placement where component for and is placed structure is the longest functional component plus the 4-to-
in the closest position to the output, and add & pass are po- 1 multiplexer; while the critical path in the chain structure
sitioned to the far end of the chain from the output. This is the maximal value of delays of all functional components
notation appears in Figures 6 & 7 & 8 for the chain designs to the output. The delay of the chain varies with the com-
with different component placements. ponent placement. Placing the longest component closest

218
to the output reduces the maximal delay. The adder is the and, or, xor and nor of each application. Their relative fre-
longest component, therefore, when it is positioned next to quencies are displayed in Figure 9.
output in the chain, the overall delay is reduced.
It is worth to note that the power consumption is closely ALU Operation Frequency

related to the input. Different inputs will result in different 100%

Operation Frequency
power consumption; inputs with high switching frequen- 80% Nor(%)
cies are very likely to bring about high power consumption. 60% Xor(%)
Since we are not interested in the effect of inputs on power 40%
Or(%)
And(%)
consumption, we used a same set of random input data of
20% Add(%)
operations for different design structures to eliminate its ef-
0%
fect on our component placement approach.    





 


  



 
 

To check whether the placement affects the power con-






 

  

  

sumption, we carried out a series of tests with the same se-





quence of input data. Each test runs the ALU with a single Benchmark

type of operation. The power consumption for a given type


of operation with different placements is shown in each plot Figure 9. Operation Frequency
of Figure 8, where power is measured in Watts.
Based on the simulation results, we can see that for a As can be seen from Figure 9, for all applications, addi-
given type of operation, the power consumption varies with tion has a highest frequency, dominating other ALU opera-
the different component placement in the chain. It reaches tions. Therefore, for all designs, the adder is placed closest
the minimal level when the related functional component is to the output.
placed closest to the output. There are 3! = 6 such cases for We measured the ALU power consumption for each de-
each type of operation. For instance, when only additions sign. To verify whether the ALU customization affects
are performed (see the top plot in Figure 8), six designs the processor speed, we also evaluated the processor clock
(as highlighted in the plot) all with the adder closest to the speed (determined by the critical path delay) for each de-
output have the lowest power consumption. Similar obser- sign. The results are given in Table 1.
vations (refer to the rest of plots in Figure 8) can be made For each application (row 1, columns 3-11 in the table),
for the other operations . The results therefore verify our we have two designs: normal non-custom design, directly
placement customization approach. from ASIPMeister and the design with the ALU customized
It must be noted that the power savings here are not sig- (see the label custom in the table). The processor clock time
nificant. This is because the low bit-switchings of the inputs is given in rows 2 & 3. Power consumption for ALU in the
used in the simulation. Nevertheless, it does not affect our processor is given in rows 4 & 5 with the reduction rate
investigation on the effect of different placements on power being presented in the last row.
consumption. From the table, we can see that the CPU clock time
remains unchanged throughout all designs, which demon-
strates that the ALU is not on the critical path and differing
3.1.2 Application Specific ALU
functional component placements in ALU does not affect
To see the effectiveness of our approach in real applications, the processor clock speed. However, with our customiza-
we applied the ALU customization technique to the proces- tion approach, the ALU power can be reduced in a range
sor design for a set of benchmarks in the design environ- from 43.5% to 49.6%; on average, 46.9% ALU power can
ment shown in Figure 5(b). With ASIPMeister, the multi- be saved as compared to non-custom designs.
plier and divider are separated from the ALU. Therefore, It is worth to note that the design results are partially af-
none of the ALU designs in the experiment include those fected by the low level implementation (such as logic map-
functional components. ping, layout) based on the synthesis tool and design library.
ASIPMeister models the ALUs with two chains: a chain But, the effect of different design models at the high level
of functional components as has been discussed and a chain described in HDL can still be observed from the final syn-
of different inputs to the adder for different types of ad- thesized results and the proposed design approach can be
ditions, such as a + b and a − b. Therefore, we applied verified.
Algorithm 1 jointly to both chains.
For each design, the functional component execution fre- 4 Conclusions
quencies were obtained from the Simplescalar profiler.
Based on the set of benchmarks we used in our experi- In this paper, we discussed the effect of component
ments, there are five most used ALU operations: addition, placement on the chain structure design. We found that the

219
Table 1. Customized ALU in Application Specific Processors
Application design qsort aes crc des RC4 rsa dijkstra stringsearch sha average
CPU clk time non-custom 7.77 7.77 7.77 7.77 7.77 7.77 7.77 7.77 7.77
(ns) custom 7.77 7.77 7.77 7.77 7.77 7.77 7.77 7.77 7.77
ALU power non-custom. 1.0170 0.7826 1.2880 1.6590 1.0110 1.1160 1.1170 1.4010 1.3980
(mW) custom 0.5557 0.4418 0.6534 0.8367 0.5214 0.6132 0.6073 0.7403 0.7302
pow.red(%) 45.4 43.5 49.3 49.6 48.4 45.1 45.6 47.2 47.8 46.9

order of functional components in the chain affects power Proceedings of the ACM/SIGDA 12th International Sympo-
consumption. To reduce power consumed by the chain, the sium on Field Programmable Gate Arrays, pages 109–117,
frequently operating component should be positioned close 2004.
to the output of the chain. [7] M. Graphics. Modelsim. (http://www.model.com/).
[8] Y.-T. Ho and T.-T. Hwang. Low power design using dual
We developed a VHDL customization approach for the
threshold voltage. In Proceedings of the Asia and South Pa-
low power ALU design. The customization is extremely cific Design Automation Conference, pages 205–208, 2004.
simple. It entails neither additional control logic nor modi- [9] H. Jiang, M. Marek-Sadowska, and S. Nassif. Benefits and
fication to the interface of the ALU model in the processor costs of power-gating technique. In Proceedings of the 2005
design. It only repositions the functional components in the IEEE International Conference on Computer Design: VLSI
ALU chain structure by swapping the order of ALU opera- in Computers and Processors, pages 559– 566, 2005.
tions in the related if-then-else statement in the hardware de- [10] H. Li, S. Katkoori, and W.-K. Mak. Power minimization
scription model. The approach can be readily integrated to algorithms for lut-based fpga technology mapping. ACM
an existing tool that uses a hardware description language. Trans. Design Autom. Electr. Syst., 9:33–51, 2004.
[11] M. I. M, S. Higaki, Y. Takeuchi, A. Kitajima, M. Imai,
We implemented the approach into an Application Specific
J. Sato, and A. Shiomi. Peas-iii: An asip design environ-
Processor design tool, ASIPMeister [11] which is available
ment. In Proceedings of the 2000 IEEE International Con-
to the public. ference on Computer Design, pages 430 – 436, 2000.
Our experiments on a set of benchmarks have shown that [12] R. A. Rutenbar, L. R. Carley, R. Zafalon, and N. Dragone.
on average, 46.9% ALU power can be achieved with our Low-power technology mapping for mixed-swing logic. In
design approach and the power saving is at no cost of pro- Proceedings of the International Symposium on Low Power
cessor performance. Electronics and Design, pages 291–294, 2001.
The component placement approach may be applicable [13] L. Shang, L. Peh, and N. Jha. Dynamic voltage scaling
to other designs with a similar chain structure (such as with links for power optimization of interconnection net-
floating-point ALUs), which will be studied in the future. works. In Proceedings of International Symposium on High-
Performance Computer Architecture, pages 91–102, 2003.
[14] J. M. Shyu, A. Sangiovanni-Vincentelli, J. Fishburn, and
References A. Dunlop. Optimization-based transistor sizing. IEEE
Journal of Solid-State Circuits, 23:400–409, 1988.
[1] Asip-meister. (http://www.eda-meister.org/asip-meister/). [15] T. Simunic, L. Benini, A. Acquaviva, P. Glynn, and G. D.
[2] K. Agarwal, H. Deogun, D. Sylvester, and K. Nowka. Power Micheli. Dynamic voltage scaling and power management
gating with multiple sleep modes. In Proceedings of the 7th for portable systems. In Proceedings of the 38th Conference
ACM/IEEE International Symposium on Quality Electronic on Design Automation, pages 524–529, 2001.
Design, January 2006. [16] Synopsys. Synopsys design compiler.
[3] T. Austin, E. Larson, and D. Ernst. Simplescalar: An (http://www.synopsys.com/).
infrastructure for computer system modeling. Computer, [17] C. Thimmannagari. CPU Design: Answers to Frequently
35(2):59–67, 2002. Asked Questions. Springer, 2005.
[4] M. Borah, R. M. Owens, and M. J. Irwin. Transistor sizing [18] V. Tiwari, S. Malik, and P. Ashar. Guarded evaluation: Push-
for low power cmos circuits. IEEE Trans. on Computer- ing power management to logic synthesis/design. In Pro-
Aided Design of Integrated Circuits and Systems, 15:665– ceedings of the 9th International Symposium on Low Power
671, 1996. Design, pages 221–226, 1995.
[5] B. H. Calhoun, F. A. Honore, and A. Chandrakasan. Design [19] B. Zhai, D. Blaauw, D. Sylvester, and K. Flautner. Theo-
methodology for fine-grained leakage control in mtcmos. In retical and practical limits of dynamic voltage scaling. In
Proceedings of the International Symposium on Low Power Proceedings of the 41st Conference on Design Automation,
Electronics and Design, pages 104–109, 2003. pages 868 – 873, 2004.
[6] D. Chen, J. Cong, F. Li, and L. He. Low power technology
mapping for fpga architectures with dual supply voltages. In

220

You might also like