Zhou 2008
Zhou 2008
215
en1
fun1 A
MUX B add
1
fun2 (a)
MUX
2 A
fun3 MUX R B and
n-1
en2
funn A
B xor
cond1 en1
where Oi is the operational activity of component i, and executed for a given application. Some instructions corre-
Omuxj the operational activity of multiplexor j. spond to ALU operations, such as add (addition), xor (logic
Given an input to the design, the operational activity of xor), beq (branch on equal). They will use the ALU for
each of the functional components is fixed; change of com- calculation. Some instructions do not involve ALU, such
ponent positions in the chain will not affect the value of Oi as jmp (jump to an instruction memory location). There-
in Formula 1. However, the execution of a component will fore, we partition the instruction set into ALU instructions
lead to a different chain of subsequent multiplexor opera- and non-ALU instructions. The ALU instructions are fur-
tions, depending on the component’s position in the chain. ther grouped according to what functional component they
A component positioned far away from the output causes actually use. For example, instructions add and sub belong
more operations than when it is close to the output; the more to the same group since they use component adder.
frequently the component executes, the higher the opera- The execution frequency of a component is the sum of
tional activity it generates. frequencies of instructions associated with the functional
It is very important to note that when the design structure component.
is implemented with a synthesis design tool, such as Syn- Algorithm 1 summarizes the functional component
opsys Design Compiler, the chain of multiplexors may be placement approach. In the algorithm, different weights
realized with an additional disable function, as illustrated in of power consumption are assigned to different functional
Figure 3, where Figure 3(a) shows the passing paths of the components. The more complicated the component, the
chain structure and Figure 3(b) is the control logic. The low higher the weight. Given two functional components of
level implementation of the multiplexor not only includes the same operational frequency but with different design
the function of selecting one of the inputs as the output, but complexities, we place the component with a higher weight
also integrates a gating logic that disables the propagation closer to the output. The algorithm is self explanatory.
of the inputs to the output. Given an ALU operation, some Elaboration is omitted.
passing paths can be blocked, which effectively reduces un-
necessary signal switchings. 2.2 Design Environment Integration
216
Algorithm 1 Functional Component Placement in Chain put to the output), respectively. The experiment setup is
Structure given in Figure 5(a), where the loop exhausts all possible
/* Given the execution trace and the set of functional compo- placements in the chain design.
nents, F C */
step 1: obtain instruction execution frequency, IF ;
C program
step 2: obtain the functional component execution frequency
based on IF ;
step 3: Alu operation Simplescalar
/* Find the location for each component in the chain*/ and operands compiler
ALU VHDL generator
/* Start from the closest level to the output */ Model PISA ISA
current level = 1; Generator
while |F C| > 2 do
S <= most f requent components in F C; TESTBENCH ASIPMeister
ALU
place
/* Repeat if there are multiple such components */ mnt
while S = φ do Synopsys
Customized
get the component of highest weight, fc in S; Design
Processor
PrimPower
/* Assign the component to the current level */ Compiler
TESTBENCH
level(f c) = current level; ModelSim
end while
/* Assign the last two components to the farthest level */ Simulation Simulation
(a)
level(F C) = current level; result: result:
area, delay power
(b)
Processor Model
C program Compilation Profiling
Generation
Figure 5. Experimental Setup (a) ALU Design
Exploration (b) Processor Design with ALU
Simulation Design Processor Model ALU Model Customization
Results Simulation Update Customization
Figure 4. Processor Design Flow with ALU For each iteration, a new ALU model with different com-
Customization ponent placement is generated. Its function is verified by
ModelSim [7]. Next, the design is synthesized with Synop-
sys [16] Design Compiler based on the tsl18fs120 library,
and the related power consumption is estimated by Synop-
opment tools. Algorithm 1 is applied to customize the ALU
sys PrimePower. The three tools (ModelSim, Design Com-
in the processor. The processor model is then updated with
piler and PrimePower) together form the testbench for each
the customized ALU. Since the update does not function-
design evaluation.
ally affect the processor and instruction set architecture, no
Later, we integrated the customization technique into
modification is required to the instruction code. Next, the
ASIPMeister [11], a processor design tool. The related ex-
new processor model for the application code is simulated.
perimental setup is given in Figure 5(b). In this experimen-
The design is synthesized using a synthesis tool. Based on
tal environment, the processor design for a given applica-
the synthesis, the design is evaluated.
tion is automatically generated and the functionality of the
processor model is systematically verified.
3 Experimental Setup and Simulation Re- We choose the Portable Instruction Set Architecture
sults (PISA)[3] as the target processor instruction set architec-
ture. ASIPMeister is used to automatically generate the pro-
To verify our placement approach, we first developed a cessor VHDL model. We use Simplescalar [3] to compile
small stand-alone ALU for full design space exploration. the application program and to profile the program execu-
The ALU contains only four functional components each tion. Based on Algorithm 1, the ALU model in the pro-
for addition, logic xor, logic and and pass (passing one in- cessor is customized. Finally, the processor with the cus-
217
tomized ALU is evaluated using the same testbench de- For "add" Operation
scribed in the setup for design space exploration. 2.70E-05
2.60E-05
2.50E-05
2.40E-05
2.30E-05
214
212
210
Area (gates)
208
206
204
202
200 For "and" Operation
2.60E-05
2.50E-05
2.40E-05
2.30E-05
2.20E-05
De sign 2.10E-05
2.00E-05
chain structure tree structure 1.90E-05
# # # # # #
! ! ! ! ! !
# # # # # #
# #
# # # #
# #
! ! ! !
# # # #
! !
" "
" " " "
# #
# # # # ! !
# #
! ! ! !
" " " "
" "
# # # # # #
# # # # # #
! ! ! ! ! !
" " " " " "
nent Placements
For "xor" Operation
2.70E-05
2.60E-05
2.50E-05
2.40E-05
2.30E-05
2.20E-05
Design Delay 2.10E-05
2.00E-05
1.90E-05
4.35
4.3
, , , , , ,
% % % % % % % % % % % %
( * ( * ( * ( * ( * ( *
) ) ) ) ) )
$ , $ , $ , $ , $ , $ ,
' ' % % ' % ' % ' % ' %
$ $ $ $ $ $ $ $ $ $ $ $
Delay (ns)
+ + + + + +
, ,
% % % %
& & & & & &
, , , ,
% % % %
& & & & & & & & & & & &
$ , $ ,
' ( ) * % ( ) * ' ( ) * % ( ) *
4.25
$ $ $ $
% , % , % , % ,
& & $ & & $ & $ & $
$ % $ ' $ % $ '
( ) * ( ) *
+ +
$ ' $ % $ ' $ %
+ + + +
, , , ,
& & & & % & % & % & % &
, ,
% % , , % % , ,
& & $ $ & & $ $
% % ' ' ( * ( *
$ $ $ $
) )
% % % % , ,
& & & & & & & & $ $
( * ( * ( * ( *
$ ' $ ' ) ) $ % $ % ) )
+ + + +
4.2
$ % $ % $ ' $ '
+ +
, , , , , ,
% & % & % & % & % & % & % & % & % & % & % & % & $ , $ , $ , $ , $ , $ ,
( ( ( ( ( (
$ % $ % $ % $ % $ % $ % $ ' $ ' $ ' $ ' $ ' $ ' ) * ) * ) * ) * ) * ) *
+ + + + + +
4.15
4.1
For "pass" Operation
2.70E-05
2.60E-05
2.50E-05
2.40E-05
Design 2.30E-05
2.20E-05
chain structure tree structure 2.10E-05
2.00E-05
, , , , , ,
% % % % % % % % % % % %
( * ( * ( * ( * ( * ( *
) ) ) ) ) )
$ , $ , $ , $ , $ , $ ,
' ' % % ' % ' % ' % ' %
$ $ $ $ $ $ $ $ $ $ $ $
+ + + + + +
, ,
, , , ,
% % % %
& & & & & & & & & & & &
$ , $ ,
' ( ) * % ( ) * ' ( ) * % ( ) *
$ $ $ $
% , % , % , % ,
& & $ & & $ & $ & $
$ % $ ' $ % $ '
( ) * ( ) *
+ +
$ ' $ % $ ' $ %
+ + + +
, , , ,
& & & & % & % & % & % &
, ,
% % , , % % , ,
& & $ $ & & $ $
% % ' ' ( * ( *
$ $ $ $
) )
% % % % , ,
& & & & & & & & $ $
( * ( * ( * ( *
$ ' $ ' ) ) $ % $ % ) )
+ + + +
$ % $ % $ ' $ '
+ +
nent Placements
, , , , , ,
% & % & % & % & % & % & % & % & % & % & % & % & $ , $ , $ , $ , $ , $ ,
( ( ( ( ( (
$ % $ % $ % $ % $ % $ % $ ' $ ' $ ' $ ' $ ' $ ' ) * ) * ) * ) * ) * ) *
+ + + + + +
218
to the output reduces the maximal delay. The adder is the and, or, xor and nor of each application. Their relative fre-
longest component, therefore, when it is positioned next to quencies are displayed in Figure 9.
output in the chain, the overall delay is reduced.
It is worth to note that the power consumption is closely ALU Operation Frequency
Operation Frequency
power consumption; inputs with high switching frequen- 80% Nor(%)
cies are very likely to bring about high power consumption. 60% Xor(%)
Since we are not interested in the effect of inputs on power 40%
Or(%)
And(%)
consumption, we used a same set of random input data of
20% Add(%)
operations for different design structures to eliminate its ef-
0%
fect on our component placement approach.
quence of input data. Each test runs the ALU with a single Benchmark
219
Table 1. Customized ALU in Application Specific Processors
Application design qsort aes crc des RC4 rsa dijkstra stringsearch sha average
CPU clk time non-custom 7.77 7.77 7.77 7.77 7.77 7.77 7.77 7.77 7.77
(ns) custom 7.77 7.77 7.77 7.77 7.77 7.77 7.77 7.77 7.77
ALU power non-custom. 1.0170 0.7826 1.2880 1.6590 1.0110 1.1160 1.1170 1.4010 1.3980
(mW) custom 0.5557 0.4418 0.6534 0.8367 0.5214 0.6132 0.6073 0.7403 0.7302
pow.red(%) 45.4 43.5 49.3 49.6 48.4 45.1 45.6 47.2 47.8 46.9
order of functional components in the chain affects power Proceedings of the ACM/SIGDA 12th International Sympo-
consumption. To reduce power consumed by the chain, the sium on Field Programmable Gate Arrays, pages 109–117,
frequently operating component should be positioned close 2004.
to the output of the chain. [7] M. Graphics. Modelsim. (http://www.model.com/).
[8] Y.-T. Ho and T.-T. Hwang. Low power design using dual
We developed a VHDL customization approach for the
threshold voltage. In Proceedings of the Asia and South Pa-
low power ALU design. The customization is extremely cific Design Automation Conference, pages 205–208, 2004.
simple. It entails neither additional control logic nor modi- [9] H. Jiang, M. Marek-Sadowska, and S. Nassif. Benefits and
fication to the interface of the ALU model in the processor costs of power-gating technique. In Proceedings of the 2005
design. It only repositions the functional components in the IEEE International Conference on Computer Design: VLSI
ALU chain structure by swapping the order of ALU opera- in Computers and Processors, pages 559– 566, 2005.
tions in the related if-then-else statement in the hardware de- [10] H. Li, S. Katkoori, and W.-K. Mak. Power minimization
scription model. The approach can be readily integrated to algorithms for lut-based fpga technology mapping. ACM
an existing tool that uses a hardware description language. Trans. Design Autom. Electr. Syst., 9:33–51, 2004.
[11] M. I. M, S. Higaki, Y. Takeuchi, A. Kitajima, M. Imai,
We implemented the approach into an Application Specific
J. Sato, and A. Shiomi. Peas-iii: An asip design environ-
Processor design tool, ASIPMeister [11] which is available
ment. In Proceedings of the 2000 IEEE International Con-
to the public. ference on Computer Design, pages 430 – 436, 2000.
Our experiments on a set of benchmarks have shown that [12] R. A. Rutenbar, L. R. Carley, R. Zafalon, and N. Dragone.
on average, 46.9% ALU power can be achieved with our Low-power technology mapping for mixed-swing logic. In
design approach and the power saving is at no cost of pro- Proceedings of the International Symposium on Low Power
cessor performance. Electronics and Design, pages 291–294, 2001.
The component placement approach may be applicable [13] L. Shang, L. Peh, and N. Jha. Dynamic voltage scaling
to other designs with a similar chain structure (such as with links for power optimization of interconnection net-
floating-point ALUs), which will be studied in the future. works. In Proceedings of International Symposium on High-
Performance Computer Architecture, pages 91–102, 2003.
[14] J. M. Shyu, A. Sangiovanni-Vincentelli, J. Fishburn, and
References A. Dunlop. Optimization-based transistor sizing. IEEE
Journal of Solid-State Circuits, 23:400–409, 1988.
[1] Asip-meister. (http://www.eda-meister.org/asip-meister/). [15] T. Simunic, L. Benini, A. Acquaviva, P. Glynn, and G. D.
[2] K. Agarwal, H. Deogun, D. Sylvester, and K. Nowka. Power Micheli. Dynamic voltage scaling and power management
gating with multiple sleep modes. In Proceedings of the 7th for portable systems. In Proceedings of the 38th Conference
ACM/IEEE International Symposium on Quality Electronic on Design Automation, pages 524–529, 2001.
Design, January 2006. [16] Synopsys. Synopsys design compiler.
[3] T. Austin, E. Larson, and D. Ernst. Simplescalar: An (http://www.synopsys.com/).
infrastructure for computer system modeling. Computer, [17] C. Thimmannagari. CPU Design: Answers to Frequently
35(2):59–67, 2002. Asked Questions. Springer, 2005.
[4] M. Borah, R. M. Owens, and M. J. Irwin. Transistor sizing [18] V. Tiwari, S. Malik, and P. Ashar. Guarded evaluation: Push-
for low power cmos circuits. IEEE Trans. on Computer- ing power management to logic synthesis/design. In Pro-
Aided Design of Integrated Circuits and Systems, 15:665– ceedings of the 9th International Symposium on Low Power
671, 1996. Design, pages 221–226, 1995.
[5] B. H. Calhoun, F. A. Honore, and A. Chandrakasan. Design [19] B. Zhai, D. Blaauw, D. Sylvester, and K. Flautner. Theo-
methodology for fine-grained leakage control in mtcmos. In retical and practical limits of dynamic voltage scaling. In
Proceedings of the International Symposium on Low Power Proceedings of the 41st Conference on Design Automation,
Electronics and Design, pages 104–109, 2003. pages 868 – 873, 2004.
[6] D. Chen, J. Cong, F. Li, and L. He. Low power technology
mapping for fpga architectures with dual supply voltages. In
220