Power Analysis Methodology and Objectives For TI Wireless Platform PDF
Power Analysis Methodology and Objectives For TI Wireless Platform PDF
Wireless Platforms
Dario Cardini
ABSTRACT
Power consumption is today a key performance aspect which must be carefully controlled and
optimized during system on chip implementation for many applications in order to minimize IR drop,
electromagnetic interference (EMI), package dissipation and shortened battery life problems and to
allow cost reduction solutions (e.g. the ability to choose a lower cost package).
A complex power reduction strategy, able to minimize both dynamic and leakage components, is
required for mobile phones, whose evolution is challenged by the need to improve performance
while minimizing power consumption. This constraint is becoming more and more important for the
most advanced Texas Instruments wireless platforms, which will provide their customers with the
capability to play three dimensional games, access internet and watch good quality TV on a mobile
phone.
Purpose of this paper is to show the usage of PrimePower and DesignPower as part of the
methodology used at TI Nice to estimate and reduce dynamic power in a wireless system on a chip.
1.0 Introduction
This paper covers the methodology used at TI Nice to estimate and reduce dynamic power in a
system on chip implementation for mobile processor applications.
The first part will cover the power analysis with PrimePower: the objectives of the hierarchical
power analysis used are mainly:
. detect potential candidates for power reduction
. detect potential problems
. verify the relevance of the power reduction mechanism implemented
. estimate the power consumption for each scenario
The second part of this presentation will cover the power optimization flow at TI: the objectives of
the power reduction mechanism are mainly
. find quick power reduction solutions for critical modules
. explore the possibility to implement an automatic power reduction flow
A detailed discussion about the theory of power consumption in integrated circuits is out of the
scope of this work. Nevertheless some important concepts are summarized in this paragraph in
order to make the paper understandable also by readers not specialized in this field.
The components of IC power are shown in eq. (1).
Where
PTOT = total power
PDYN = dynamic power, dissipated when the circuit is active. This means every time the voltage on a
net changes due to an input stimulus.
PLEAK = leakage power, dissipated when a gate is not switching. Leakage is caused by parasitic
effects, like reduced thresholds that prevent gates from completely turning off, thus generating source
to drain leakage currents.
Dynamic power components are shown in eq. (2).
PINT = internal power. This is the power dissipated by charging or discharging any capacitance
internal to the cell. It includes short circuit power, which is the power dissipated during transients in
the short phase when both P and N type transistors are on simultaneously and current flows from
Vdd to Gnd.
Reducing chip power consumption means to reduce one or more out of its components.
Even if during normal operation dynamic power is the predominant consumption factor, the
reduction of the leakage power is of key importance for battery powered applications. The
advanced techniques developed at TI for this purpose are not discussed in this paper, because they
are radically different from the ones used to minimize dynamic power and even only writing an
introduction about them would require too much space.
Several techniques have been developed in order to reduce dynamic power. They can be grouped
into the following categories:
. Voltage reduction
. Capacitance reduction
. Frequency/gate activity reduction
These techniques work at different levels of abstraction. Capacitance and voltage reduction are
mainly achieved at the process level, through the development of logic cell libraries designed for low
power applications. Frequency, or more generally gate activity reduction methodologies, can be
applied during different phases of the ASIC design flow, (architectural specification, RTL coding,
implementation). The last part of this paragraph focuses on these techniques.
1) Module level clock gating: this means that the whole clocks of a module are stopped during
inactivity periods, thus switching off the whole clock trees at module level. This capability to
switch off the input clocks of a module can be controlled statically or dynamically. The first kind
of control is aimed at reducing power consumption in the module idle modes. If for example a
peripheral is not used, it makes sense to switch off completely its clock. The dynamic control is
aimed at reducing the power consumption in functional modes. The control monitors the activity
of the module and if a condition of no activity is detected, then, the clocks are switched off.
When a condition of start of the activity is detected, then the clocks are reactivated. Normally a
counter with programmable threshold is implemented in order to switch off the clocks after
waiting a determined time since the detection of the inactivity condition. Module level clock gating
is specified at architectural level. Module level clock gating can be handled at rtl coding level
or/and synthesis level with Power Compiler or other commercial tools. An important factor that
affects its efficiency is the position in the hierarchy where the gating structure is instantiated. If the
centralized clock generator contains the clock gating structure, it is possible to prevent the whole
clock tree from toggling when it is not needed, including the branch that goes from the clock
generator to the peripheral. If the gating cell is instantiated inside the module, it can only stop the
branch of the clock tree internal to the module.
3) Operand isolation: this technique inserts isolation logic in order to prevent the inputs of data-path
operators from switching when their output is not used. This happens for example when the
output is connected to the input of a register whose clock is gated. Operand isolation is suitable
to computation structures like multipliers and complex adders. After inserting isolation logic, it
must be evaluated if the timing performance is still acceptable. Operand isolation can be applied
at rtl coding level and/or at synthesis level with Power Compiler.
4) Gate level power optimization: this technique is used at implementation level and it is based on
two synthesis compile phases, the first one performing the traditional timing and area optimization,
the second one performing timing, power and area optimization (leakage and Dynamic power).
Switching activity information back-annotated from simulation is suggested to be used in order to
perform the power optimization depending on the real activity of nets.
This paragraph shows the importance of dynamic power consumption reduction methodologies in
the flow adopted at TI Nice for the development of a mobile processor chip. The principles of the
hierarchical power estimation flow are explained.
Most of the module level power reduction techniques are used also at chip level (architecture, rtl
coding, synthesis). Power scenarios are defined for the main chip applications and they are used to
estimate power consumption based on simulation activity information.
Running full chip gate-level simulations which are able to represent accurately the application
behavior is often unpractical due to the excessively long run times. However it is of a major
importance to estimate the chip level power consumption in the idle modes, in order to detect and
solve problems in the power reduction structures and to know the power figures in those modes
which often set limits to battery life. The analysis at chip level for those scenarios follows the same
strategy used at module level.
In the case of post-clock tree insertion module level analysis, an additional flow is needed to
generate module level netlist, sdf and capacitance information from chip level netlist and spef file.
This is depicted in fig. 3.
Fig. 3: flow for the creation of module level netlist/sdf/capacitance file from chip level
netlist/spef
In this paragraph some examples of the power estimation results we obtained are shown.
In fig. 4, 5 and 6 it has been split a report (based on post-synthesis netlist of a 600 Kgates module
and vcd coming from unit-delay normal activity simulation) written by the procedure
report_power_and_area.
Reading this kind of report a deep understanding of the module power consumption sources and
power reduction mechanisms can be reached. Some considerations are pointed out in the following.
1) Three sub-modules module_1, module_2 and module_3 are responsible for ~ 96 % of the
power consumption of module_top.
2) The sub-module module_1, which consumes ~ 40 % of the total module power, has only 0.08
% of the total area. This is due to the fact that module_1 contains the instances of the two main
clock tree cells inst_cktree_17 and inst_cktree_21 and their power consumption is estimated
based on the wire-load of the two giant nets they drive, as previously discussed.
3) The percentage of power consumption due to X transitions is low. This has to be checked
because X transition consumption is estimated statistically based on the value of the variable
pwr_xpower_scale_factor which by default is set to 0.5, which means that a transition to and
Pwr top
Inst. Name Ref. Name Dyn Pwr Leak Pwr Switc Pwr Int Pwr Xtran Pwr Glitch Pwr
inst_top module_top 58.4357 0.0089712 25.0266 33.4092 0.542325 0.0095086
4) The clock-tree consumption is discussed in fig. 6. Inst_cktree_17 makes use of a module auto-
gating mechanism (see par.) that for the level of activity of this scenario is able to reduce the
clock frequency from 100 MHz to ~ 30 MHz. Due to functional reasons for inst_cktree_21 it
has not been possible to implement any module level auto-gating but only the one controlled by
the central clock generation unit which is not based on activity but on a configuration register (it
is intended to reduce power in the idle modes). The higher consumption reflects this functional
limitation.
5) Clock tree power values estimated by estimate_clock_tree_power are in good agreement
with post-clock tree insertion figures. The parameters of the estimate_clock_tree_power
procedure have been tuned in order to have good correlation between estimated data and
collected data estimating power on post-layout designs (timing simulation with sdf annotation,
power analysis using post detailed routing netlist/spef).
6) The consideration written at point 4) explain also the consumption of the two modules
module_3 and module_2: module_3, clocked by inst_cktree_17 has 60 % of the area and only
only ~ 25 % of the power consumption. module_2, clocked by inst_cktree_21 has 25 % of the
Cktree Inst. Name pp Tot Pwr (mW) Estim. Pwr (mW) Fanout Meas. f (MHz)
……. …… …… …… ……
inst_cktree_1 0.00180858 0.002602476 62 3
inst_cktree_2 0.000605202 0.000431747 9 3.428571429
inst_cktree_3 0.00143007 0.001888894 35 3.857142857
inst_cktree_4 0.000586371 0.000300705 4 5
inst_cktree_5 0.000604331 0.000300705 4 5
inst_cktree_6 0.000586371 0.000300705 4 5
inst_cktree_7 0.000604908 0.000300705 4 5
inst_cktree_8 0.000604331 0.000300705 4 5
inst_cktree_9 0.000696599 0.000300705 4 5
inst_cktree_10 0.000639595 0.000300705 4 5
inst_cktree_11 0.000604522 0.000300705 4 5
inst_cktree_12 0.000586371 0.000300705 4 5
inst_cktree_13 0.000639595 0.000300705 4 5
inst_cktree_14 0.00654134 0.00997016 116 6.142857143
inst_cktree_15 0.0144159 0.014049771 132 7.607142857
inst_cktree_16 0.00328363 0.00141761 4 23.57142857
inst_cktree_17 11.4223 7.609887031 18151 29.96428571
inst_cktree_18 0.0993862 0.109155569 78 100.0178571
inst_cktree_19 0.0839546 0.090962974 65 100.0178571
inst_cktree_20 0.112329 0.124549303 89 100.0178571
inst_cktree_21 11.7173 7.822815778 5590 100.0178571
inst_cktree_22 0.0470463 0.047580633 34 100.0178571
inst_cktree_23 0.0839455 0.090962974 65 100.0178571
clkin 0.004886147 2 100.0178571
23.64869442 15.96423403 34428
Fig. 6: example of report written by report_power_and_area procedure. Clock tree analysis,
frequency measurement and power estimation are shown. The column “pp Tot Pwr (mW)”
shows the value of the PrimePower attribute pp_total_power (in mW) for each clock root.
The column “Estim. Pwr (mW)” shows the estimated power of each clock tree after
expansion. The column “Fanout” shows the number of sequential elements which are driven
by the clock root in the pre-layout netlist. The column “Meas. f (MHz)” shows the measured
frequency.
As explained in par. 3, in TI flow clock gating techniques are applied already at architectural and rtl
coding level. Nevertheless Power-Compiler has proved to be an effective solution especially
attractive when a fast register gating implementation is needed for an externally coded IP. Interesting
considerations about clock gating implementation with Power-Compiler can be found in reference
[4]. In the following the discussion is focused on a specific aspect that is important to control in
order to have the tool implement robust, fast and efficient solutions.
In order to avoid the discovery of problems due to gating implementation only after the very late
stage of clock tree insertion, it is mandatory to prevent this step of the flow from creating timing
violations (clock gating setup check) on the enable of the gating cell. As shown in fig. 8 clock tree
buffers create a delay between the output of the clock gating cell and the clock input of sequential
elements. The more the sequential elements that are driven by one gating cell, the higher the number
of clock tree buffers, the higher the delay. If this delay is not seen by synthesis, the logic that
generates the enable can be poorly optimized, thus leading to timing violations after clock tree
implementation. The solution is to set a suitable latency so that the clock edge at the gating cell pin is
anticipated with respect to the clock edge at flip-flops pin, thus having synthesis see the correct
constraint and optimize properly the enable generation logic. Nevertheless, if the complexity of this
functionality is so high that even a good optimization generates an excessive logic depth, synthesis
might be unable to meet timing: in these cases we prevent the sequential elements from being gated.
The following flow can be used:
- Perform first analyze-elaborate letting Power-Compiler gate every sequential element that
the tool is able to gate.
- Set gating parameters and suitable latency value
- Perform compile and generate timing reports through the enable of all the gating elements
that have been instantiated by Power-Compiler.
- Generate a list of sequential elements to be excluded from gating through the use of Power-
Compiler set_clock_gating_signals -exclude command.
- Re-run the flow from elaborate and check that timing is met for all gating elements.
This choice has the disadvantage of reducing the power consumption saving because less elements
are gated. In fig. 9 the results obtained for a module of ~ 350000 gates (~ 22400 flip-flop instances)
are shown.
Once the maximum number of sequential elements controlled by one gating cell is chosen, the choice
of the good compromise is a function of clock buffers speed and clock tree depth, so it depends on
the technology and on the clock tree implementation flow.
In the case of the module previously mentioned a maximum value of 128, which means 2-3 level of
clock tree buffers, was selected, allowing a safe choice of 1 ns timing margin and 57 % of flip-flops
gated.
The power reduction made possible by the use of Power-Compiler on module_3 is shown in the
following table.
This paper shows the methodology used at TI Nice to estimate and reduce dynamic power in a
system on chip implementation for mobile processor applications.
The main focus is on the adoption of Primepower for power consumption estimation and to perform
checks on the correctness and efficiency of the gating structures implemented.
An original and effective methodology has been created in order to significantly improve the
accuracy of pre-clock tree insertion power estimations. This is based on a topological algorithm that
detects all the clock roots and on a set of procedures that measure clock frequency, estimate the
structure of each clock tree depending on its fanout and estimate clock tree power consumption.
The parameters used by these procedures come from a database where post-clock tree insertion
statistic information is stored.
A new kind of reports have been created, automatically written in excel format, where area, power,
frequency information are stored for a programmable level of hierarchical levels. This provides very
deep and readable information that makes possible to identify candidates for power reduction.
Other extensions to Primepower capabilities are the procedures to create dedicated reports (e.g.
memory consumption) and statistical reports.
2.0 Acknowledgements
I would like to thank all TI colleagues who provided me with huge amount of interesting data for
power estimation. A special thanks to Eric Bouet for his continuous and accurate support and for
suggesting the idea to write this paper.
3.0 References
4.0 Appendix 1
5.0 Appendix 2
This appendix provides some more details about the clock tree estimation methodology previously
described.
In the following a high-level description of the algorithm to detect clock roots can be found.
The variable a_clock_roots has as keys the hierarchical names of the clock root pins and as values
their fanout. The procedure is_transparent is based on the set_pass_through_pin procedure that sets