0% found this document useful (0 votes)
38 views12 pages

Power Management Techniques For Soft IP PDF

Uploaded by

GoobeD'Great
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views12 pages

Power Management Techniques For Soft IP PDF

Uploaded by

GoobeD'Great
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Power Management Techniques for Soft IP

Peter Greenhalgh
CPU Group
ARM
[email protected]

ABSTRACT
Both dynamic and static leakage power are becoming a significant design issue in
130nm and 90nm technology processes. High power consumption not only reduces
battery life in mobile devices, but also requires more costly IC packaging to deal with
heat dissipation. Designers are required to consider power management techniques
throughout the design flow. However, power management in soft IP presents a
difficult challenge as any power savings must be compatible across a range of target
process technologies and design tool flows.

This paper covers design techniques that can be used in soft IP (using the
ARM1176JZF-S™ and ARM926EJ-S™ processors as examples) to manage dynamic
and static power that are compatible with current tool flows and across multiple
technologies. For dynamic power reduction, RTL clock gating, architectural clock
gating and Dynamic Voltage and Frequency scaling are examined, while static power
consumption is addressed with a Dormant Power Mode.

SNUG Europe 2004 -1 - greenhalgh_final.doc


Table of Contents

1 – Introduction 3

2 – RTL Clock Gating 3

3 – Architectural Clock Gating 4

4 – Comparing the Benefits of RTL Clock Gating and Architectural Clock Gating. 6

5 – Dormant Power Mode 7

6 – Dynamic Voltage Scaling 9

7 – Summary 12

8 – Acknowledgements 12

Table of Figures
Figure 2.1 – Circuit with no clock gating (re-circulating mux)
Figure 2.2 – Circuit with clock gating using an integrated clock gating cell
Figure 3.1 – Architectural clock gating
Figure 3.2 – Architectural clock gate in the ARM1176JZF-S processor
Figure 4.1 – Power consumption of different clock gating approaches on the
ARM926EJ-S processor using Dhrystone loop 4
Figure 5.1 – Dormant Mode voltage domains and clamping logic
Figure 6.1 – ARM1176JZF-S RTL Hierarchy for Dynamic Voltage and Frequency
Scaling
Figure 6.2 – Power and energy benefit from Dynamic Voltage and Frequency Scaling

SNUG Europe 2004 -2 - greenhalgh_final.doc


1.0 Introduction
Chip power consumption is increasing at technology process geometries of 130nm
and below due to greater transistor density. High power consumption reduces battery
life in mobile devices, requires more costly IC packaging to deal with heat dissipation
and can affect long term reliability of the part.

To reduce the upward trend in power consumption, custom circuit techniques that are
library and technology specific have been developed in combination with library and
technology independent approaches.

Soft-IP presents a further challenge because it is not known what technology the IP
will be implemented in. Therefore the IP must be designed such that it supports the
introduction of custom logic as well as incorporating technology independent
techniques.

Furthermore, at process geometries of 90nm and below static leakage power is


becoming increasingly significant. Therefore the IP should provide techniques to
mitigate both static and dynamic power consumption.

The ARM1176JZF-S and ARM926EJ-S microprocessors are designs delivered to


licensees in a synthesisable soft IP form. The ARM926EJ-S processor supports both
RTL and architectural clock gating. At the time of writing the ARM1176JZF-S is the
latest ARM microprocessor and in addition to RTL and architectural clock gating also
has support for a Dormant Power mode and dynamic voltage and frequency scaling.
These power saving techniques have been chosen to complement any custom circuit
or implementation approaches that a licensee may have developed to save power.

2 RTL Clock Gating


RTL clock gating has been supported by tools such as Synopsys Power Compiler for
several years to reduce dynamic power. This design approach uses an enable signal
coded into the RTL to control whether or not a group of registers is updated.

In most cases a clock gate is inferred by Power Compiler to replace the re-circulating
mux that is coded into the RTL. The following diagrams demonstrate a circuit with
no clock gating (figure 2.1) and one using an integrated clock gating cell (figure 2.2):

0
D Q
Data 1

D Q
Clock Enable
Generation Logic

Figure 2.1 – Circuit with no clock gating (re-circulating mux).

SNUG Europe 2004 -3 - greenhalgh_final.doc


Scan over-ride

Latch
D Q Data
D Q
Clock
Enable
D Q
Generation Logic

Integrated Clock
Gating Cell

Figure 2.2 – Circuit with clock gating using an integrated clock gating cell.

RTL clock gating saves power both in the switching of the registers being gated and
also in the clock network between the clock gate and the registers. Depending on the
cell library up to 50% of the capacitance of the clock tree occurs in the clock network
between the clock gate and the registers and in the clock input pins of the registers.
Given that it is typical for the clock tree to consume around 33% of the power of the
standard cell logic, significant power savings can be made by preventing this network
from switching.

However RTL clock gating requires discipline from the designer in identifying
suitable control signals for the clock gate. Where there is naturally a re-circulating
mux this is not an issue. In other cases, a reasonable number of registers (usually four
or more) needs to be identified for the gain in dynamic power from clock gating to
overcome the static leakage penalty of the clock gate and the logic that generates the
enable signal. Nevertheless, RTL clock gating is an effective way of reducing
dynamic power and the ARM1176JZF-S and ARM926EJ-S processors use this
approach on the majority of their registers.

A further benefit of RTL clock gating is that the removal of the re-circulating mux
from the datapath will result in a potential performance improvement on the register-
to-register path. However, there is a penalty in that the enable signal for the clock
gate must be set up to the clock gate some time period before the end of the cycle to
allow the clock to propagate from the clock gate to the register. This can make the
enable generation logic complicated to design from a timing perspective, especially
since the criticality of the enable signal is not accurately known until after Clock Tree
Synthesis has been performed on a placed design.

3 Architectural clock gating


Architectural clock gating is a more coarse grained method of gating the clock from
registers and can be used either at a block level (e.g. a DMA block) or at the top level
of the design.

Architectural clock gating provides an incremental benefit over register clock gating.
Indeed if the target block already uses a large amount of RTL clock gating, the main
benefit will be in gating out a larger section of the clock tree.

For example, in the following diagram, Block 2 is architecturally clock gated and
Block 1 is not. When the architectural clock gate is in operation the RTL clock gating

SNUG Europe 2004 -4 - greenhalgh_final.doc


inside of block 2 should already have disabled the clock. Therefore the incremental
power saving of architectural clock gating over RTL clock gating is on the free
running registers (where either a suitable enable signal could not be identified or there
are too few registers that can be grouped together to make clock gating worthwhile)
and on the clock network leading to the RTL clock gating cells.

IP Block
Block 1 Block 2
D Q
D Q

D Q
RTL Clock
D Q
Gating

D Q
Architectural RTL Clock
Clock Gate Gate
CLK

Figure 3.1 – Architectural clock gating.

Because an architectural clock gate exists higher up the clock tree than a conventional
RTL clock gate the enable signal has a smaller portion of the cycle to setup to the
clock gate. Therefore the enable must come directly out of a register or traverse a
very small number of gates. This can limit the usefulness of architectural clock gates
for circuits that require complex enables.

On the ARM1176JZF-S microprocessor, architectural clock gating is used to remove


the main clock (CLKIN) from the entire microprocessor. The enable signal that
controls this is based on the following conditions:

• There are no instructions being executed in the integer core or the Vector
Floating Point unit.
• There are no operations in progress on the bus interface unit.
• There are no DMA operations in progress into the internal Tightly Coupled
Memories.
• The ARM1176JZF-S processor is not in debug state.

The problem with soft IP is that the IP creator does not know what the insertion delay
of the clock will be as this is a function of the target library and the quality of the
clock tree insertion tool. The risk is that if the clock tree insertion delay approaches
one clock cycle the enable could arrive at the architectural clock gate at the same time
as the next clock edge thereby causing a glitch on the clock. Therefore the prudent
approach is for the IP creator to design clock circuitry that allows any clock tree
insertion delay.

On the ARM1176JZF-S processor a glitch is prevented by using an unbalanced, free-


running version of the main clock called FREECLKIN to synchronise the enable
signal coming into the architectural clock gate.

SNUG Europe 2004 -5 - greenhalgh_final.doc


Synchronised Pipeline Empty Signal
Enable Signals

Synchronisation
Circuitry

FREECLKIN
D Q D Q D Q

Integer Pipeline
Empty Logic
D Q

Unbalanced Clock Tree


Enable
SoC CLK In CLK Out
Clock
CLKIN

Architectural Clock Gate Balanced Clock Tree


(Integrated Clock Gating Cell)
SoC ARM1176JZ(F)-S

Figure 3.2 – Architectural clock gate in the ARM1176JZF-S microprocessor.

Providing the FREECLKIN signal is not balanced with respect to the CLKIN signal,
the clock tree can have any insertion delay and there is no risk of a glitch.

4 Comparing the Benefits of RTL Clock Gating and Architectural Clock


Gating

A normalized comparison of the power benefits of mixing the different clock gating
approaches is shown using the ARM926EJ-S microprocessor running loop 4 of the
industry standard Dhrystone benchmark. The ARM926EJ-S microprocessor was used
to produce these figures rather than the ARM1176JZF-S microprocessor as the use of
architectural clock gating is an implementation choice with the ARM926EJ-S
processor whereas it is a fundamental part of the clocking strategy of the
ARM1176JZF-S processor, and cannot be removed.

The architectural clock gating approach used in the ARM926EJ-S processor differs
from the ARM1176JZF-S processor in that as well as being able to gate off the entire
clock from the microprocessor there are clock gates on eleven other major blocks in
the design. These include the coprocessor interface, the instruction caches, the data
caches, the instruction tightly coupled memories, the data tightly coupled memories,
and the main integer core. Therefore the ARM926EJ-S processor is ideal for
comparing the benefit of block level architectural clock gating with RTL clock gating.

Note that each clock gating experiment required a complete Physical Compiler
synthesis and Astro CTS/route and consequently there will be minor variations
between each run:

SNUG Europe 2004 -6 - greenhalgh_final.doc


No Clock Gating 100%

Architectural Clock Gating Only 94%

RTL Clock Gating Only 74%

RTL and Architectural Clock Gating 69%

0% 20% 40% 60% 80% 100%

Figure 4.1 –Power consumption of different clock gating approaches on the


ARM926EJ-S microprocessor using Dhrystone loop 4

Dhrystone is a reasonably strenuous benchmark for a microprocessor because it


typically fits completely inside of the caches and therefore requires no processor bus
transactions. Indeed, the diagram shows that only a small benefit is gained from the
coarse grained architectural clock gating approach. This is because during the fourth
loop of Dhrystone the main integer core and caches are running continuously and only
a few blocks (such as the coprocessor interface) are idle and can be clock gated out.
However the combination of RTL and architectural clock gating still provides a useful
31% reduction in power compared to using no clock gating at all.

Note that the power savings discussed in this section are only applicable to dynamic
power. Yet static power consumption is becoming increasingly significant at
technology process geometries of 90nm and below. There are many circuit
techniques to deal with static power consumption however there are also technology
independent approaches that an IP creator can use. The use of a Dormant Power
Mode is one such technique.

5 Dormant Power Mode


In order to combat static leakage power it may be advantageous to remove power
from large blocks of logic that are not in use and can return to full functionality a
relatively small time period after power is returned. For example the ARM1176JZF-S
microprocessor implements a power down mode (Dormant Mode) which allows the
power to be removed from the standard cells while the main cache RAMs remain
powered up. This allows the power for the standard cell logic in the processor to be
removed when no useful work is occurring but state can be returned relatively quickly
when required without the latency and power penalty of refilling the processor caches
from external memory.

Depending on the leakage characteristics of the cache RAMs, Dormant Mode can be
highly advantageous for leakage reduction. Indeed, if leakage management is of
critical importance, nominal or even high threshold transistors can be used in the
caches while the logic is implemented using mixed threshold standard cells for
optimum performance.

SNUG Europe 2004 -7 - greenhalgh_final.doc


However, in the case of the ARM1176JZF-S microprocessor, Dormant Mode is not
trivial to implement. The complexities of Dormant Mode can be split into a hardware
component and a software component.

5.1 Hardware Considerations

To implement Dormant Mode successfully, the following must be considered:

• All inputs (including the clock) into the RAMs must be clamped when the
standard cell power is removed to prevent erroneous data being written to the
RAMs. A signal called RAMCLAMP is used for this purpose and the design
is in normal operation when RAMCLAMP is set to logic zero.
• All clamping logic must be placed only in the RAM power domain.
• Clocks must be held at a known state during Dormant Mode and a rising edge
avoided when coming out of Dormant Mode to prevent erroneous data being
clocked into the cache RAMs.

The following diagrams illustrate the Dormant Mode voltage domains and clamping
logic.

SoC ARM1176JZF-S

Standard Cell Logic Clock


VDD Cell Signal
Domain

VDD RAM Data Signal


Clamping Logic Domain
RAMCLAMP

RAM Blocks

Clamped Clamped
Data Signal Clock Signal

Figure 5.1 – Dormant Mode voltage domains and clamping logic

5.2 Software Considerations

Because in Dormant Mode the power is removed from the standard cells all
microprocessor configuration will be lost. So in order to successfully enter and return
from Dormant Mode the following must be copied to and from external memory:

• ARM general purpose and status registers.


• CP15 (system control coprocessor) and CP14 (debug) registers.
• DMA state.
• VFP (floating point coprocessor) state.
• Cache state bits for the cache RAMs.

Depending on the cache write policy (write through/write back) and the AMBA™
AXI processor bus frequency, entry and exit from Dormant Mode takes between 5000
and 10,000 clock cycles. This cycle overhead means that accurate modelling of the
software environment is required to get the best performance and static leakage power

SNUG Europe 2004 -8 - greenhalgh_final.doc


saving from Dormant Mode. However, the voltage domain divide between the RAMs
and the standard cell logic was chosen purely to simplify the implementation process.
It would be possible to place the registers used for storing the cache state bits for the
RAMs in the RAM voltage domain thereby improving the latency to return from
Dormant Mode. Naturally, this would complicate implementation and possibly
reduce performance, but the benefit in reduced latency may make it worthwhile.

6.0 Dynamic Voltage and Frequency Scaling


To achieve further improvements in power reduction without resorting to custom
circuit techniques, Dynamic Voltage and Frequency Scaling can be used.

Dynamic Voltage and Frequency Scaling is effective because of the following two
facts:

• The amount of energy required to complete a task is proportional to the


square of the supply voltage.
• The maximum frequency of any CMOS circuit is proportional to the supply
voltage.

So if the supply voltage is decreased there is a square-law reduction in energy to


complete a given task. However the task takes longer to complete because of the
linear reduction in frequency. Therefore, the principle gain with Dynamic Voltage
and Frequency Scaling is with respect to dynamic power consumption. However any
reduction in supply voltage also results in a proportional reduction in static power.

The ARM1176JZF-S microprocessor supports Dynamic Voltage and Frequency


Scaling as a part of the design. Although splitting the design up into multiple voltage
domains is mostly an implementation problem the structure of the RTL can crucially
affect compatibility with EDA tools.

To ease implementation the RTL was designed using the following rules:

1 – The top-level logical hierarchy should correlate with the voltage domains in the
design.
2 – Clocks should not cross voltage domains inside of the IP.
3 – All data signals that cross between voltage domains inside the processor or on the
processor interface should be synchronized and level shifted.
4 – All outputs from the core voltage domain should be clamped to allow the power to
be removed (Dormant Mode).

SNUG Europe 2004 -9 - greenhalgh_final.doc


By following these rules the ARM1176JZF-S microprocessor looks as follows in a
SoC:

RAMs
Vram ARM1176JZ(F)-S
Level Shift and Clamp Logic

Vcore BIST

Level Shift & Clamps


AMBA AMBA
AXI AXI
Coprocessor Core logic Register Register
Slices Slices
(Vcore) (Vsoc)
Embedded
Trace
Macrocell

Level Shift and Clamp Logic

Energy
Management
Controller
AMBA AXI Buses
Clock & (including clocks
Interrupt Debug
Reset
Vsoc Controller Logic
Signals
and resets)

Figure 6.1 – ARM1176JZF-S RTL Hierarchy for Dynamic Voltage and Frequency
Scaling

The AMBA AXI Register Slices are a method of pipelining the main processor buses.
AXI allows register slices to be placed in the interconnect with no impact at all on the
available bandwidth. By splitting these register slices and placing one half in the core
voltage domain and one half in the SoC voltage domain the synchronous timing
interface to the ARM1176JZF-S microprocessor from the SoC looks the same no
matter what the core voltage is. This approach ensures that the first rule is met and
that the top-level logical hierarchy of the ARM1176JZF-S microprocessor correlates
with the voltage domains being implemented. Furthermore, the ARM1176JZF-S
microprocessor conforms to the second rule since the AMBA AXI clocks required for
the bus interface unit are only used inside of Vsoc module of the AMBA AXI register
slice.

Other interfaces in the design from the core voltage domain to the SoC voltage
domain (such as the debug interface) are asynchronous at the boundary of the
microprocessor and are already handled by synchronization circuitry internally. This
means that the asynchronous interfaces only need to be level shifted and not
resynchronized.

Where possible, it is advantageous for implementation complexity and maximum


power saving to keep peripherals such as coprocessors in the core voltage domain.
However if the peripherals also communicate with blocks of IP in the SoC voltage
domain then some effort may be required from the IP integrator to ensure that timing
and functionality are maintained across the voltage boundary.

SNUG Europe 2004 - 10 - greenhalgh_final.doc


In order for Dynamic Voltage and Frequency Scaling to work in a SoC environment,
there must be some form of energy management controller on the SoC that controls
the voltage domains in the SoC. Also, the operating system must understand when
the voltage and frequency can scale depending on the current and future workload of
the microprocessor.

The benefits of Dynamic Voltage and Frequency Scaling are substantial. For
example, on an ARM926EJ-S test chip fabricated on the TSMC CL013G process, the
following measurements were obtained from the test silicon when running the
Dhrystone benchmark:

300MHz, 1.21V 100% 100%

225MHz, 1.03V 56% 75%

150MHz, 0.81V 23% 46%

75MHz, 0.69V 9% 36%

0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%

Power Energy
Figure 6.2 – Power and energy benefit from Dynamic Voltage and Frequency Scaling

The voltage numbers in the diagrams above represent the minimum voltage that
would still allow the benchmark to successfully complete at the target frequency on
the test silicon.

The diagrams clearly show a large benefit from reducing the supply voltage.
However, the energy diagram is possibly more useful as it factors out the increased
time to execute a task due to the reduction in frequency.

Even though the benchmark took twice as long to run at 150MHz than 300MHz, this
may not be an issue in some applications and the 54% reduction in energy required to
complete the task highly desirable.

Dynamic voltage and frequency scaling is a complex technique which requires


consideration from the RTL design stage all the way through implementation,
integration and in the operating system. However the energy savings of Dynamic
Voltage and Frequency Scaling easily offset these complications.

SNUG Europe 2004 - 11 - greenhalgh_final.doc


7.0 Summary

The techniques that have been described in this paper to save power (static and
dynamic) are complementary and applicable across all types of soft IP. They range
from the highly automated (RTL clock gating) to the more complex (dynamic voltage
and frequency scaling), but each can provide incremental savings in power across all
technology processes therefore extending battery life in mobile devices and also
potentially reducing IC package costs. It is up to the IP creator to decide which
techniques are required depending on the target application and to then put in place
the infrastructure to allow them to be successfully implemented and integrated into a
system chip.

8.0 Acknowledgements
Thanks to everyone in ARM who reviewed this paper prior to release.

SNUG Europe 2004 - 12 - greenhalgh_final.doc

You might also like