0% found this document useful (0 votes)
21 views19 pages

Power Analysis Methodology and Objectives For TI Wireless Platform PDF

Power Analysis

Uploaded by

GoobeD'Great
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views19 pages

Power Analysis Methodology and Objectives For TI Wireless Platform PDF

Power Analysis

Uploaded by

GoobeD'Great
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Power Analysis Methodology and Objectives for TI

Wireless Platforms

Dario Cardini

Texas Instruments, 821 avenue Jack Kilby, 06271 Villeneuve-Loubet

[email protected]

ABSTRACT

Power consumption is today a key performance aspect which must be carefully controlled and
optimized during system on chip implementation for many applications in order to minimize IR drop,
electromagnetic interference (EMI), package dissipation and shortened battery life problems and to
allow cost reduction solutions (e.g. the ability to choose a lower cost package).

A complex power reduction strategy, able to minimize both dynamic and leakage components, is
required for mobile phones, whose evolution is challenged by the need to improve performance
while minimizing power consumption. This constraint is becoming more and more important for the
most advanced Texas Instruments wireless platforms, which will provide their customers with the
capability to play three dimensional games, access internet and watch good quality TV on a mobile
phone.

Purpose of this paper is to show the usage of PrimePower and DesignPower as part of the
methodology used at TI Nice to estimate and reduce dynamic power in a wireless system on a chip.
1.0 Introduction

This paper covers the methodology used at TI Nice to estimate and reduce dynamic power in a
system on chip implementation for mobile processor applications.

The first part will cover the power analysis with PrimePower: the objectives of the hierarchical
power analysis used are mainly:
. detect potential candidates for power reduction
. detect potential problems
. verify the relevance of the power reduction mechanism implemented
. estimate the power consumption for each scenario
The second part of this presentation will cover the power optimization flow at TI: the objectives of
the power reduction mechanism are mainly
. find quick power reduction solutions for critical modules
. explore the possibility to implement an automatic power reduction flow

The paper is divided in the following sections:


1) Introduction about power reduction and analysis techniques
2) Power consumption reduction in TI flow
3) Advantages of Primepower power analysis flow
4) Summary of tcl procedures developed
5) Example of the results
6) Application of Power-Compiler to perform power optimization in a critical subsystem
7) Example of the results

2.0 Introduction about power reduction and analysis techniques

A detailed discussion about the theory of power consumption in integrated circuits is out of the
scope of this work. Nevertheless some important concepts are summarized in this paragraph in
order to make the paper understandable also by readers not specialized in this field.
The components of IC power are shown in eq. (1).

PTOT = PDYN + PLEAK (1)

Where
PTOT = total power
PDYN = dynamic power, dissipated when the circuit is active. This means every time the voltage on a
net changes due to an input stimulus.
PLEAK = leakage power, dissipated when a gate is not switching. Leakage is caused by parasitic
effects, like reduced thresholds that prevent gates from completely turning off, thus generating source
to drain leakage currents.
Dynamic power components are shown in eq. (2).

PDYN = PSW + PINT (2)

SNUG Europe 2004 2 Power analysis methodology and objectives


Where
PSW = switching power, which is the power dissipated by a cell to charge and discharge the load
connected to its output.

PINT = internal power. This is the power dissipated by charging or discharging any capacitance
internal to the cell. It includes short circuit power, which is the power dissipated during transients in
the short phase when both P and N type transistors are on simultaneously and current flows from
Vdd to Gnd.

Reducing chip power consumption means to reduce one or more out of its components.
Even if during normal operation dynamic power is the predominant consumption factor, the
reduction of the leakage power is of key importance for battery powered applications. The
advanced techniques developed at TI for this purpose are not discussed in this paper, because they
are radically different from the ones used to minimize dynamic power and even only writing an
introduction about them would require too much space.

Several techniques have been developed in order to reduce dynamic power. They can be grouped
into the following categories:
. Voltage reduction
. Capacitance reduction
. Frequency/gate activity reduction
These techniques work at different levels of abstraction. Capacitance and voltage reduction are
mainly achieved at the process level, through the development of logic cell libraries designed for low
power applications. Frequency, or more generally gate activity reduction methodologies, can be
applied during different phases of the ASIC design flow, (architectural specification, RTL coding,
implementation). The last part of this paragraph focuses on these techniques.

1) Module level clock gating: this means that the whole clocks of a module are stopped during
inactivity periods, thus switching off the whole clock trees at module level. This capability to
switch off the input clocks of a module can be controlled statically or dynamically. The first kind
of control is aimed at reducing power consumption in the module idle modes. If for example a
peripheral is not used, it makes sense to switch off completely its clock. The dynamic control is
aimed at reducing the power consumption in functional modes. The control monitors the activity
of the module and if a condition of no activity is detected, then, the clocks are switched off.
When a condition of start of the activity is detected, then the clocks are reactivated. Normally a
counter with programmable threshold is implemented in order to switch off the clocks after
waiting a determined time since the detection of the inactivity condition. Module level clock gating
is specified at architectural level. Module level clock gating can be handled at rtl coding level
or/and synthesis level with Power Compiler or other commercial tools. An important factor that
affects its efficiency is the position in the hierarchy where the gating structure is instantiated. If the
centralized clock generator contains the clock gating structure, it is possible to prevent the whole
clock tree from toggling when it is not needed, including the branch that goes from the clock
generator to the peripheral. If the gating cell is instantiated inside the module, it can only stop the
branch of the clock tree internal to the module.

SNUG Europe 2004 3 Power analysis methodology and objectives


2) Register level clock gating: during inactivity periods, it stops the clock selectively for groups of
flip-flops depending on their activity. It is based on the synchronous load enable functionality: the
enable which normally controls a multiplexing function on the data path of a register is changed
into an enable which controls the clock of the register. When the data has to be hold, the clock is
gated, so that the functionality is equivalent and power is saved due to the decreased local clock
frequency. Register level clock gating can be handled at rtl coding level or/and synthesis level
with Power Compiler or other commercial tools. In fig. 1 both types of clock gating are shown.
CK_GAT_INST_1 is an example of module level clock gating. CK_GAT_INST_2 and
CK_GAT_INST_3 are examples of register level gating. The connections needed to guarantee
clocks controllability for test purpose are not shown in the picture.

Fig. 1: module level and register level gating schemes

3) Operand isolation: this technique inserts isolation logic in order to prevent the inputs of data-path
operators from switching when their output is not used. This happens for example when the
output is connected to the input of a register whose clock is gated. Operand isolation is suitable
to computation structures like multipliers and complex adders. After inserting isolation logic, it
must be evaluated if the timing performance is still acceptable. Operand isolation can be applied
at rtl coding level and/or at synthesis level with Power Compiler.

4) Gate level power optimization: this technique is used at implementation level and it is based on
two synthesis compile phases, the first one performing the traditional timing and area optimization,
the second one performing timing, power and area optimization (leakage and Dynamic power).
Switching activity information back-annotated from simulation is suggested to be used in order to
perform the power optimization depending on the real activity of nets.

SNUG Europe 2004 4 Power analysis methodology and objectives


3.0 Power consumption reduction in TI flow

This paragraph shows the importance of dynamic power consumption reduction methodologies in
the flow adopted at TI Nice for the development of a mobile processor chip. The principles of the
hierarchical power estimation flow are explained.

Several steps are needed to minimize power consumption at module level.


1) Power consumption reduction is addressed at architectural level. Key Part of each module
functional specification is the description of the mechanisms used to optimize power. Module
level and register level clock gating are extensively used. Part of the validation plan is the
definition of scenarios aimed at power characterization. A minimum set of modes that are
defined for each module is high-consumption, normal mode and idle.
2) Rtl coding is a crucial phase for power reduction. Careful coding can significantly improve
module power consumption by turning off logic when not needed.
3) Rtl functional simulation of all power scenarios is performed.
4) Synthesis takes special care about area, which benefits also power consumption.
5) Gate-level simulation is performed and vcd file storing activity information for all internal nodes
is written for each scenario defined for the module
6) Power estimation at module level is performed with the following purposes:
- Detect potential issues about gating implementation. The efficiency of the module level
gating is checked by comparing the power consumption when auto-gating is enabled and
when it is disabled. The frequency is measured for all gating elements and it is compared
with the functional (ungated) frequency.
- Identify the areas that are hot from power point of view in order to optimize them if the
consumption is excessive. Hierarchical power analysis is performed going down into the
logical hierarchy as far as it is needed to have an accurate understanding of the main
sources of power consumption. From this analysis a set of suggestions to reduce the
consumption is produced and given to the designers.
- Estimate power consumption of each module. This is important to know if the module
meets its power budget and to have a first estimation of the chip consumption based on
module figures before having the possibility to perform a chip level study. Moreover the
complexity of some architectures makes unpractical the simulation of complete algorithms
at chip level and in those cases chip level figures are corrected taking into account module
power characterization.
- Keep under control power consumption during the development phase of the module. The
area is controlled together with the power.
- Provide feed-back about the power scenario simulation. In the case of a normal activity
scenario the frequency is measured for all sequential elements in order to make sure that
they are clocked at the frequencies defined in the power scenario specification. In case
some sequential elements have low (or zero) toggling rate, their list is provided to the
designers in order to make sure that there are not any bugs either in the module
configuration/functionality or in the testbench.
7) Incremental power reduction. Starting from power analysis results, methods to further reduce
power are identified. The preferred way to implement them is at rtl coding level. Nevertheless a
high degree of flexibility is allowed, and also alternative techniques that make possible to meet

SNUG Europe 2004 5 Power analysis methodology and objectives


the power budget are evaluated. In some cases it has been chosen to use Power-Compiler
register gating and this has proved to be a good solution both from implementation time and
from power improvement point of view. Only the classic automatic clock gating insertion at
register level with Power Compiler has been used for the moment. This feature shows already
good results. The other features will be tested and used in a near feature to improve the results.

Most of the module level power reduction techniques are used also at chip level (architecture, rtl
coding, synthesis). Power scenarios are defined for the main chip applications and they are used to
estimate power consumption based on simulation activity information.
Running full chip gate-level simulations which are able to represent accurately the application
behavior is often unpractical due to the excessively long run times. However it is of a major
importance to estimate the chip level power consumption in the idle modes, in order to detect and
solve problems in the power reduction structures and to know the power figures in those modes
which often set limits to battery life. The analysis at chip level for those scenarios follows the same
strategy used at module level.

4.0 Advantages of PrimePower power analysis flow

Primepower flow is described in fig 2.

SNUG Europe 2004 6 Power analysis methodology and objectives


Fig. 2: primepower flow

The main steps are briefly discussed in the following.


1) Perform gate-level simulation and write vcd file storing activity information for all internal nodes.
If post-layout, then use the sdf.
2) Read design and libraries, link.
3) Read vcd file. There is the possibility to read gzipped vcds. The tool performs a check on the
quality of the vcd annotation.
4) Set operating conditions and constrain properly the module pinout by setting transition times for
inputs and loads for outputs

SNUG Europe 2004 7 Power analysis methodology and objectives


5) If a post-layout netlist is analyzed, read backannotated capacitance information.
6) Calculate power. The tool performs a check on the quality of the libraries.

In the case of post-clock tree insertion module level analysis, an additional flow is needed to
generate module level netlist, sdf and capacitance information from chip level netlist and spef file.
This is depicted in fig. 3.

1) Module level netlist is written using dc_shell


2) Module level sdf based on chip level spef and netlist using pt_shell write_sdf command.
3) Module level capacitance file is written based on chip level spef and netlist using pt_shell
write_physical_annotations –format dctcl –parasitics. The capacitance information is written in
the set_load format.
It is important to point out that sdf and load file used for power characterization are written using
nominal conditions.

Fig. 3: flow for the creation of module level netlist/sdf/capacitance file from chip level
netlist/spef

Primepower main advantages are summarized in the following.


1) Ease of use. The environment is the same as primetime. The use of tcl as programming
language and the set of basic attributes that are common to dc_shell, pt_shell and pp_shell
makes possible to write procedures for one of those tools and to reuse (part or all of) them for
the others.
2) A number of attributes at design and cell (hierarchical/library) level can be accessed. This
makes possible to extend the set of native commands thus making more powerful the tool and
to overcome some of its limitations. Some procedures are released together with the tool. For
example report_register_power writes a custom report for power consumption on all registers
in the design.
3) Capability to perform a full chip analysis.

SNUG Europe 2004 8 Power analysis methodology and objectives


4) Capability to perform peak power analysis. This is an important feature as peak power
determines wire size for power and ground rails and impacts noise margin and reliability
analysis.
5) Capability to write fsdb database (waveform debugger)
6) From release 2003.12 it is possible to run the procedure that detects the peak power during
test, provided that a vcd generated from a suitable scan simulation is read.
7) With a same vcd file, run a mode based analysis (reset, test, active/sleep)

Primepower limitations we have experienced are summarized in the following.


1) Pre-layout analysis relies on the usage of wire-load models. This can limit the accuracy of the
results especially for high-fanout nets like the clock-trees. In order to overcome this inherent
limitation we wrote a set of procedures aimed at creating more realistic estimations for clock-
trees.
2) Currently the tool does not generate reports that put together power, clock frequency and area
information. As the capability to correlate these three parameters is of a key importance to
identify the main areas of improvement for power consumption, we wrote a set of tcl
procedures which are able to create readable, compact and exhaustive reports.
3) The vcd file size can be very relevant especially at chip level or for long simulations of large
modules. Synopsys suggests a way to overcome the limit set on file size by operating systems
based on the creation of a UNIX FIFO. Another way is to connect primepower directly to the
simulator and run the simulator as a slave, thus avoiding to write and read from disk. This
limitation will be removed in 2004. The tool will be able to use in a near feature a vector free
analysis and power macro models.

5.0 Summary of tcl procedures developed


This paragraph summarizes the procedures we developed in order to extend primepower
capabilities and to meet the targets of TI power analysis flow. These procedures can be grouped
into the following categories:

1) Study the clock:


- find_clock_roots: finds the roots of the clocks starting from clock pins of all sequential
elements. It goes through buffers and inverters and stops when it finds a cell that is neither a
buffer nor an inverter. In order to be able to go through gating cells (and through every type
of cells having more than one input) the set_pass_through_pin command can be used. The
output of the procedure is a list storing the hierarchical names of all clock roots. A separated
report is also generated with the information of the fanout (number of sequential elements
driven) of each root. This procedure can be run in dc_shell-t, pt_shell, pp_shell.
- Notes: It is important to point out that this approach is different from the one adopted to
trace clocks in the synopsys procedure report_clock_power. While find_clock_roots is
based just on netlist topology, report_clock_power relies on the definition of clocks by the
command create_clock.
- set_pass_through_pin: it makes possible for the find_clock_roots procedure to go through
every kind of cell. It requires as argument the hierarchical name of the (input) pin which has
to be traced back. This procedure can be run inside dc_shell, pt_shell, pp_shell.

SNUG Europe 2004 9 Power analysis methodology and objectives


- get_clock_frequency: it measures the frequency of a clock. It is applied to the list generated
by find_clock_roots and it requires as additional argument the duration in ns of the
simulation. This procedure can be run inside pp_shell.
- estimate_clock_tree_power: it estimates the power consumption of a clock tree and it is
used in pre-clock tree expansion power estimations. In a TI gate-level pre clock-tree netlist
there is one clock tree cell for each clock root. These are special cells with infinite driving
capability considered as start point by the clock tree insertion tool. The problem is that the
power of a clock tree is represented as the power of one of those special cells and depends
on the capacitance of the net connected to its output. As the fanout of this net can vary from
a few to several tens of thousands sequential elements and its capacitance is estimated based
on the wire load model, the result can be far from realistic. A solution to this problem is first
to estimate clock tree structure and then its power. The procedure relies on a simple formula
based on a geometrical series to estimate the number of buffers of a clock tree as a function
of the number of sequential elements it has to drive. Then the total capacitance of the clock
tree can be estimated considering that the last stage drives flip-flops and all the other stages
drive clock tree buffers and assuming a statistical representative capacitance per flip-flop
clock pin, per clock buffer clock pin and per clock net. Once known all of this, the clock
tree consumption can be obtained with good approximation, provided that the parameters
used in the formulas are set to values collected from a meaningful statistic built collecting post
clock tree insertion data. This procedure can be run inside pp_shell.
- Notes: It is important to point out that the usage of this procedure is different from the one of
synopsys procedure report_clock_power that can be used only on post-clock tree
expansion netlists.

2) Study memory consumption:


- analyze_memory_clock: it finds the clock pin(s) of each memory and determines its
frequency
- get_memory_power: it finds all memories instantiated in the design and calculates the power
consumption of each of them using two different methods. It gets power consumption as it is
measured by pp_shell and it calculates power consumption based on Cpd (power
dissipation capacitance from memory datasheet) and frequency. The two results are printed
together with information about frequency and Cpd for each memory instance. This helps to
understand the accuracy of estimated power consumption for memories.

3) Study module consumption:


- report_power_and_area: based on the results of some of the previously mentioned
procedures and calling a perl script which makes use of CPAN Spreadsheet-WriteExcel
package, it writes an excel spreadsheet where the performance of the module from power
point of view is summarized (see fig. 4). The following information is contained for a
selectable by parameter hierarchical depth:
- area measured in number of gates and as a percentage of the total module area
- number of combinational instances
- number of instances of sequential elements
- total power consumption measured in mW and represented as a percentage of the total
module power consumption
- power consumption components measured in mW: dynamic and leakage power

SNUG Europe 2004 10 Power analysis methodology and objectives


- dynamic power components measured in mW: switching and internal power
- power consumption due to X transitions and to glitches.
- Clock tree information: for each clock there is the measured frequency, the power
consumption estimated by primepower and the power consumption estimated by the
procedure estimate_clock_tree_power.

4) Generate statistical reports:


- report_frequency_statistic_at_clock_pins: it calculates for each sequential element the
frequency measured at its clock pin and writes an histogram representing for each measured
frequency the percentage of sequential elements which are running at that frequency.
- report_statistic_power: it prints for each leaf cell instantiated in the design a report about
the number of instances of this cell and the average power consumption per cell. It calculates
also the power consumption of the whole sequential and combinational logic.
- report_leaves_power: it accepts the parameters MAX and MIN and prints all instances
which have a consumption < MIN and > MAX.

6.0 Examples of the results

In this paragraph some examples of the power estimation results we obtained are shown.
In fig. 4, 5 and 6 it has been split a report (based on post-synthesis netlist of a 600 Kgates module
and vcd coming from unit-delay normal activity simulation) written by the procedure
report_power_and_area.

SNUG Europe 2004 11 Power analysis methodology and objectives


Pwr top
Inst. Name Ref. Name Area (gates) Area (%) Seq Combo Tot Pwr (mW) Tot Pwr (%)
inst_top module_top 568628 100 34428 97080 58.4447 100

Pwr hierarchical modules (level 1)


Inst. Name Ref. Name Area (gates) Area (%) Seq Combo Tot Pwr (mW) Tot Pwr (%)
inst_1 module_1 455 0.080017 22 105 23.2055 39.7050545
inst_2 module_2 146449.5 25.75489 8475 27429 17.8338 30.513973
inst_3 module_3 346157 60.87583 21494 56988 15.3287 26.227699
inst_4 module_4 9451.25 1.662115 599 1858 0.508296 0.86970418
inst_5 module_5 9490.75 1.669061 411 1542 0.246066 0.42102363
inst_6 module_6 8627.75 1.517293 466 1288 0.22993 0.39341463
inst_7 module_7 8680.75 1.526613 466 1473 0.229542 0.39275075
inst_8 module_8 2468.25 0.434071 166 176 0.165636 0.28340637
inst_9 module_8 2559.5 0.450119 163 471 0.14927 0.25540383
inst_10 module_10 2072.75 0.364518 139 393 0.118696 0.20309113
inst_11 module_11 1929.5 0.339326 131 250 0.0812457 0.13901295
inst_12 module_12 3685.25 0.648095 271 536 0.0742047 0.12696566
inst_13 module_13 4331.5 0.761746 266 761 0.0627734 0.10740649
inst_14 module_14 1911.75 0.336204 121 320 0.0529146 0.09053789
inst_15 module_15 2926.75 0.514704 150 585 0.0425643 0.07282833
inst_16 module_16 1351.25 0.237633 47 435 0.0221187 0.03784552
inst_17 module_17 1699.25 0.298833 113 275 0.0217017 0.03713202
inst_18 module_18 3127.75 0.550052 212 444 0.01834 0.03138009
inst_19 module_19 2803.75 0.493073 210 384 0.0173547 0.02969422
inst_20 module_20 1978 0.347855 133 256 0.0131995 0.0225846
inst_21 module_21 1878 0.330269 128 230 0.00981249 0.01678936
inst_22 module_22 2314.75 0.407076 103 434 0.00782887 0.01339535
inst_23 module_23 2170 0.38162 142 422 0.00487564 0.00834231
Fig. 4: example of report written by report_power_and_area procedure. Area versus total
power results are shown. The column “Area (gates)” shows the area measured as number of
equivalent gates. The column “Area (%)” shows the area of the (sub)module as a percentage
of the total top module area. The columns “Seq” and “Combo” show the number of
instances of sequential and combinational cells. Sequential cells are all cells having the
synopsys attribute “is_sequential” equal to true. These can be flip-flops, latches, memories
and gating cells. The column “Tot Pwr (%)” shows the total power of the (sub)module as a
percentage of the total power of the top module. In the case of the “module top” shown in
fig. 4 there are no memories.

Reading this kind of report a deep understanding of the module power consumption sources and
power reduction mechanisms can be reached. Some considerations are pointed out in the following.
1) Three sub-modules module_1, module_2 and module_3 are responsible for ~ 96 % of the
power consumption of module_top.
2) The sub-module module_1, which consumes ~ 40 % of the total module power, has only 0.08
% of the total area. This is due to the fact that module_1 contains the instances of the two main
clock tree cells inst_cktree_17 and inst_cktree_21 and their power consumption is estimated
based on the wire-load of the two giant nets they drive, as previously discussed.
3) The percentage of power consumption due to X transitions is low. This has to be checked
because X transition consumption is estimated statistically based on the value of the variable
pwr_xpower_scale_factor which by default is set to 0.5, which means that a transition to and

SNUG Europe 2004 12 Power analysis methodology and objectives


from X is assumed to consume ½ of the power of a normal transition). The higher the
percentage of X power, the lower the accuracy of the power estimation.

Pwr top
Inst. Name Ref. Name Dyn Pwr Leak Pwr Switc Pwr Int Pwr Xtran Pwr Glitch Pwr
inst_top module_top 58.4357 0.0089712 25.0266 33.4092 0.542325 0.0095086

Pwr hierarchical modules (level 1)


Inst. Name Ref. Name Dyn Pwr Leak Pwr Switc Pwr Int Pwr Xtran Pwr Glitch Pwr
inst_1 module_1 23.2055 7.813E-06 23.1304 0.07508 0 0
inst_2 module_2 17.8313 0.0024923 0.67612 17.1552 0.21778 0.0021937
inst_3 module_3 15.3235 0.0051712 1.1836 14.1399 0.316245 0.0071285
inst_4 module_4 0.508146 0.0001502 0.0027282 0.505418 7.107E-05 1.471E-05
inst_5 module_5 0.245894 0.000172 0 0.245894 0 0
inst_6 module_6 0.229791 0.0001389 0 0.229791 0 0
inst_7 module_7 0.229404 0.0001382 0 0.229404 0 0
inst_8 module_8 0.165577 5.887E-05 0.0082827 0.157294 0 0
inst_9 module_8 0.149228 4.229E-05 0.0045069 0.144721 0.0022168 5.817E-05
inst_10 module_10 0.118673 2.308E-05 0 0.118673 0 0
inst_11 module_11 0.081211 3.468E-05 0.0001038 0.081107 0 0
inst_12 module_12 0.074141 6.342E-05 0.0046356 0.069506 0.0041609 7.218E-06
inst_13 module_13 0.062686 8.714E-05 0.0001321 0.062554 1.531E-05 0
inst_14 module_14 0.052881 3.36E-05 0.0085809 0.0443 0.001779 9.754E-05
inst_15 module_15 0.042507 5.693E-05 0 0.042507 0 0
inst_16 module_16 0.022098 2.06E-05 0.0067376 0.015361 0 8.738E-06
inst_17 module_17 0.021672 2.983E-05 0.0001082 0.021564 0 0
inst_18 module_18 0.018279 6.112E-05 0.0001201 0.018159 0 0
inst_19 module_19 0.017314 4.116E-05 7.05E-05 0.017243 0 0
inst_20 module_20 0.013165 3.5E-05 0.000113 0.013052 0 0
inst_21 module_21 0.009779 3.368E-05 0.0001271 0.009652 0 0
inst_22 module_22 0.007781 4.824E-05 0 0.007781 0 0
inst_23 module_23 0.004849 2.692E-05 0 0.004849 0 0
Fig. 5: example of the report written by report_power_and_area procedure. Power
components are shown. All power values are measured in mW. Xtran power is the power
consumption of a 0 -> X, 1 -> X, X -> 0, X -> 1 transition.

4) The clock-tree consumption is discussed in fig. 6. Inst_cktree_17 makes use of a module auto-
gating mechanism (see par.) that for the level of activity of this scenario is able to reduce the
clock frequency from 100 MHz to ~ 30 MHz. Due to functional reasons for inst_cktree_21 it
has not been possible to implement any module level auto-gating but only the one controlled by
the central clock generation unit which is not based on activity but on a configuration register (it
is intended to reduce power in the idle modes). The higher consumption reflects this functional
limitation.
5) Clock tree power values estimated by estimate_clock_tree_power are in good agreement
with post-clock tree insertion figures. The parameters of the estimate_clock_tree_power
procedure have been tuned in order to have good correlation between estimated data and
collected data estimating power on post-layout designs (timing simulation with sdf annotation,
power analysis using post detailed routing netlist/spef).
6) The consideration written at point 4) explain also the consumption of the two modules
module_3 and module_2: module_3, clocked by inst_cktree_17 has 60 % of the area and only
only ~ 25 % of the power consumption. module_2, clocked by inst_cktree_21 has 25 % of the

SNUG Europe 2004 13 Power analysis methodology and objectives


area and ~ 30 % of the power consumption. This is also a good proof of the efficiency of the
auto-gating mechanism implemented !

Cktree Inst. Name pp Tot Pwr (mW) Estim. Pwr (mW) Fanout Meas. f (MHz)
……. …… …… …… ……
inst_cktree_1 0.00180858 0.002602476 62 3
inst_cktree_2 0.000605202 0.000431747 9 3.428571429
inst_cktree_3 0.00143007 0.001888894 35 3.857142857
inst_cktree_4 0.000586371 0.000300705 4 5
inst_cktree_5 0.000604331 0.000300705 4 5
inst_cktree_6 0.000586371 0.000300705 4 5
inst_cktree_7 0.000604908 0.000300705 4 5
inst_cktree_8 0.000604331 0.000300705 4 5
inst_cktree_9 0.000696599 0.000300705 4 5
inst_cktree_10 0.000639595 0.000300705 4 5
inst_cktree_11 0.000604522 0.000300705 4 5
inst_cktree_12 0.000586371 0.000300705 4 5
inst_cktree_13 0.000639595 0.000300705 4 5
inst_cktree_14 0.00654134 0.00997016 116 6.142857143
inst_cktree_15 0.0144159 0.014049771 132 7.607142857
inst_cktree_16 0.00328363 0.00141761 4 23.57142857
inst_cktree_17 11.4223 7.609887031 18151 29.96428571
inst_cktree_18 0.0993862 0.109155569 78 100.0178571
inst_cktree_19 0.0839546 0.090962974 65 100.0178571
inst_cktree_20 0.112329 0.124549303 89 100.0178571
inst_cktree_21 11.7173 7.822815778 5590 100.0178571
inst_cktree_22 0.0470463 0.047580633 34 100.0178571
inst_cktree_23 0.0839455 0.090962974 65 100.0178571
clkin 0.004886147 2 100.0178571
23.64869442 15.96423403 34428
Fig. 6: example of report written by report_power_and_area procedure. Clock tree analysis,
frequency measurement and power estimation are shown. The column “pp Tot Pwr (mW)”
shows the value of the PrimePower attribute pp_total_power (in mW) for each clock root.
The column “Estim. Pwr (mW)” shows the estimated power of each clock tree after
expansion. The column “Fanout” shows the number of sequential elements which are driven
by the clock root in the pre-layout netlist. The column “Meas. f (MHz)” shows the measured
frequency.

In fig. 7 it is possible to see the instantaneous power behavior.

Fig. 7: example of instantaneous power diagrams

SNUG Europe 2004 14 Power analysis methodology and objectives


This picture shows why the combination of module level and register level gating is efficient. For
both modules it has been implemented a register level gating scheme, which has the effect of
reducing the average level of power consumption. The module A makes also use of a module level
gating, which creates the “holes” in the power consumption (only leakage), thus reducing
dramatically the average consumption.

7.0 Application of Power Compiler to perform power optimization in a


critical subsystem

As explained in par. 3, in TI flow clock gating techniques are applied already at architectural and rtl
coding level. Nevertheless Power-Compiler has proved to be an effective solution especially
attractive when a fast register gating implementation is needed for an externally coded IP. Interesting
considerations about clock gating implementation with Power-Compiler can be found in reference
[4]. In the following the discussion is focused on a specific aspect that is important to control in
order to have the tool implement robust, fast and efficient solutions.
In order to avoid the discovery of problems due to gating implementation only after the very late
stage of clock tree insertion, it is mandatory to prevent this step of the flow from creating timing
violations (clock gating setup check) on the enable of the gating cell. As shown in fig. 8 clock tree
buffers create a delay between the output of the clock gating cell and the clock input of sequential
elements. The more the sequential elements that are driven by one gating cell, the higher the number
of clock tree buffers, the higher the delay. If this delay is not seen by synthesis, the logic that
generates the enable can be poorly optimized, thus leading to timing violations after clock tree
implementation. The solution is to set a suitable latency so that the clock edge at the gating cell pin is
anticipated with respect to the clock edge at flip-flops pin, thus having synthesis see the correct
constraint and optimize properly the enable generation logic. Nevertheless, if the complexity of this
functionality is so high that even a good optimization generates an excessive logic depth, synthesis
might be unable to meet timing: in these cases we prevent the sequential elements from being gated.
The following flow can be used:
- Perform first analyze-elaborate letting Power-Compiler gate every sequential element that
the tool is able to gate.
- Set gating parameters and suitable latency value
- Perform compile and generate timing reports through the enable of all the gating elements
that have been instantiated by Power-Compiler.
- Generate a list of sequential elements to be excluded from gating through the use of Power-
Compiler set_clock_gating_signals -exclude command.
- Re-run the flow from elaborate and check that timing is met for all gating elements.

SNUG Europe 2004 15 Power analysis methodology and objectives


Fig. 8: impact of clock tree insertion on clock gating setup check.

This choice has the disadvantage of reducing the power consumption saving because less elements
are gated. In fig. 9 the results obtained for a module of ~ 350000 gates (~ 22400 flip-flop instances)
are shown.

num. of gated flip-flops ff gated/total (%) margin (ns)


13673 61.04017857 0
13299 59.37053571 0.5
13117 58.55803571 0.7
12923 57.69196429 1
12715 56.76339286 1.5
12467 55.65625 2
Fig. 9: number of gated flip-flops versus timing margin on the enable of the gating cell

Once the maximum number of sequential elements controlled by one gating cell is chosen, the choice
of the good compromise is a function of clock buffers speed and clock tree depth, so it depends on
the technology and on the clock tree implementation flow.
In the case of the module previously mentioned a maximum value of 128, which means 2-3 level of
clock tree buffers, was selected, allowing a safe choice of 1 ns timing margin and 57 % of flip-flops
gated.

8.0 Example of results

The power reduction made possible by the use of Power-Compiler on module_3 is shown in the
following table.

module Power Power Gain %


consumption consumption with

SNUG Europe 2004 16 Power analysis methodology and objectives


without register register clock
clock gating gating (mW)
(mW)
Module_3 21.50 15.33 28.7
Fig. 10: example of power consumption reduction due to use of Power-Compiler register
gating

1.0 Conclusions and Recommendations

This paper shows the methodology used at TI Nice to estimate and reduce dynamic power in a
system on chip implementation for mobile processor applications.

The main focus is on the adoption of Primepower for power consumption estimation and to perform
checks on the correctness and efficiency of the gating structures implemented.

An original and effective methodology has been created in order to significantly improve the
accuracy of pre-clock tree insertion power estimations. This is based on a topological algorithm that
detects all the clock roots and on a set of procedures that measure clock frequency, estimate the
structure of each clock tree depending on its fanout and estimate clock tree power consumption.
The parameters used by these procedures come from a database where post-clock tree insertion
statistic information is stored.

A new kind of reports have been created, automatically written in excel format, where area, power,
frequency information are stored for a programmable level of hierarchical levels. This provides very
deep and readable information that makes possible to identify candidates for power reduction.

Other extensions to Primepower capabilities are the procedures to create dedicated reports (e.g.
memory consumption) and statistical reports.

The effectiveness of Power-Compiler register gating has proved to be interesting in particular in


cases when a fast solution is needed. A key factor that affects the robustness of this solution inside
the complete ASIC flow including clock tree implementation is the timing margin on the enable of the
gating cell. Ensuring a good margin, eventually excluding from being gated the sequential elements for
which this is not possible, guarantees a safe implementation. More extensive usage of Power-
Compiler capabilities will be object of future studies.

2.0 Acknowledgements

I would like to thank all TI colleagues who provided me with huge amount of interesting data for
power estimation. A special thanks to Eric Bouet for his continuous and accurate support and for
suggesting the idea to write this paper.

3.0 References

[1] “A multi-level approach to low-power IC design”

SNUG Europe 2004 17 Power analysis methodology and objectives


IEEE Spectrum (vol.35, num.2, feb. 1998)
J. Frenkil
[2] “Overcoming Power Compiler limitations to optimize clock gating”
SNUG Europe 2003
S. Haas
[3] Power Compiler User Guide
Synopsys
[4] Prime Power User Guide
Synopsys

4.0 Appendix 1

List of the pictures:


Fig. 1: module level and register level gating schemes
Fig. 2: primepower flow
Fig. 3: flow for the creation of module level netlist/sdf/capacitance file from chip level netlist/spef
Fig. 4: example of report written by report_power_and_area procedure. Area versus total power
results are shown.
Fig. 5: example of the report written by report_power_and_area procedure. Power components are
shown.
Fig. 6: example of report written by report_power_and_area procedure. Clock tree analysis,
frequency measurement and power estimation are shown
Fig. 7: example of instantaneous power diagrams
Fig. 8: impact of clock tree insertion on clock gating setup check.
Fig. 9: number of gated flip-flops versus timing margin on the enable of the gating cell
Fig. 10: example of power consumption reduction due to use of Power-Compiler register gating

5.0 Appendix 2

This appendix provides some more details about the clock tree estimation methodology previously
described.
In the following a high-level description of the algorithm to detect clock roots can be found.

set coll_clock_pins [all_registers -clock_pins]


foreach clock_pin $coll_clock_pins {
set driver_pin_hier_name [get_driver $clock_pin]
set driver_cell [get_ref_name $driver_pin]
while {([is_a_buffer $driver_cell]) || ([is_an_inverter $driver_cell])
|| ([is_transparent $driver_cell])} {
set driver_pin_hier_name [get_driver $clock_pin]
set driver_cell [get_ref_name $driver_pin]
}
set a_clock_roots($driver_pin_hier_name) [expr $a_clock_roots($driver_pin_hier_name) + 1]
}

The variable a_clock_roots has as keys the hierarchical names of the clock root pins and as values
their fanout. The procedure is_transparent is based on the set_pass_through_pin procedure that sets

SNUG Europe 2004 18 Power analysis methodology and objectives


on a hierarchical pin name the property “pass_through_pin”. This can be used in order to trace back
clock paths up to their first source, going trough gates like multiplexers or integrated gating cells. The
procedure is_a_buffer is based on a list of buffers that can be customized depending on the purpose
of the clock root detection. When clock roots are detected in order to estimate power consumption,
normally no transparent pins are defined and the clock tree cells are not part of the list of buffers.
This way, the clock path is decomposed into a series of different levels of subtrees, each one with its
fanout information. Once known the clock structure, for each rooth the frequency is calculated
based on the toggle number and on the simulation duration. At this point the
estimate_clock_tree_power procedure has all the information needed to estimate power
consumption for each sub tree of the clock network.

SNUG Europe 2004 19 Power analysis methodology and objectives

You might also like