Power Management Techniques For Soft IP PDF
Power Management Techniques For Soft IP PDF
Peter Greenhalgh
CPU Group
ARM
[email protected]
ABSTRACT
Both dynamic and static leakage power are becoming a significant design issue in
130nm and 90nm technology processes. High power consumption not only reduces
battery life in mobile devices, but also requires more costly IC packaging to deal with
heat dissipation. Designers are required to consider power management techniques
throughout the design flow. However, power management in soft IP presents a
difficult challenge as any power savings must be compatible across a range of target
process technologies and design tool flows.
This paper covers design techniques that can be used in soft IP (using the
ARM1176JZF-S™ and ARM926EJ-S™ processors as examples) to manage dynamic
and static power that are compatible with current tool flows and across multiple
technologies. For dynamic power reduction, RTL clock gating, architectural clock
gating and Dynamic Voltage and Frequency scaling are examined, while static power
consumption is addressed with a Dormant Power Mode.
1 – Introduction 3
4 – Comparing the Benefits of RTL Clock Gating and Architectural Clock Gating. 6
7 – Summary 12
8 – Acknowledgements 12
Table of Figures
Figure 2.1 – Circuit with no clock gating (re-circulating mux)
Figure 2.2 – Circuit with clock gating using an integrated clock gating cell
Figure 3.1 – Architectural clock gating
Figure 3.2 – Architectural clock gate in the ARM1176JZF-S processor
Figure 4.1 – Power consumption of different clock gating approaches on the
ARM926EJ-S processor using Dhrystone loop 4
Figure 5.1 – Dormant Mode voltage domains and clamping logic
Figure 6.1 – ARM1176JZF-S RTL Hierarchy for Dynamic Voltage and Frequency
Scaling
Figure 6.2 – Power and energy benefit from Dynamic Voltage and Frequency Scaling
To reduce the upward trend in power consumption, custom circuit techniques that are
library and technology specific have been developed in combination with library and
technology independent approaches.
Soft-IP presents a further challenge because it is not known what technology the IP
will be implemented in. Therefore the IP must be designed such that it supports the
introduction of custom logic as well as incorporating technology independent
techniques.
In most cases a clock gate is inferred by Power Compiler to replace the re-circulating
mux that is coded into the RTL. The following diagrams demonstrate a circuit with
no clock gating (figure 2.1) and one using an integrated clock gating cell (figure 2.2):
0
D Q
Data 1
D Q
Clock Enable
Generation Logic
Latch
D Q Data
D Q
Clock
Enable
D Q
Generation Logic
Integrated Clock
Gating Cell
Figure 2.2 – Circuit with clock gating using an integrated clock gating cell.
RTL clock gating saves power both in the switching of the registers being gated and
also in the clock network between the clock gate and the registers. Depending on the
cell library up to 50% of the capacitance of the clock tree occurs in the clock network
between the clock gate and the registers and in the clock input pins of the registers.
Given that it is typical for the clock tree to consume around 33% of the power of the
standard cell logic, significant power savings can be made by preventing this network
from switching.
However RTL clock gating requires discipline from the designer in identifying
suitable control signals for the clock gate. Where there is naturally a re-circulating
mux this is not an issue. In other cases, a reasonable number of registers (usually four
or more) needs to be identified for the gain in dynamic power from clock gating to
overcome the static leakage penalty of the clock gate and the logic that generates the
enable signal. Nevertheless, RTL clock gating is an effective way of reducing
dynamic power and the ARM1176JZF-S and ARM926EJ-S processors use this
approach on the majority of their registers.
A further benefit of RTL clock gating is that the removal of the re-circulating mux
from the datapath will result in a potential performance improvement on the register-
to-register path. However, there is a penalty in that the enable signal for the clock
gate must be set up to the clock gate some time period before the end of the cycle to
allow the clock to propagate from the clock gate to the register. This can make the
enable generation logic complicated to design from a timing perspective, especially
since the criticality of the enable signal is not accurately known until after Clock Tree
Synthesis has been performed on a placed design.
Architectural clock gating provides an incremental benefit over register clock gating.
Indeed if the target block already uses a large amount of RTL clock gating, the main
benefit will be in gating out a larger section of the clock tree.
For example, in the following diagram, Block 2 is architecturally clock gated and
Block 1 is not. When the architectural clock gate is in operation the RTL clock gating
IP Block
Block 1 Block 2
D Q
D Q
D Q
RTL Clock
D Q
Gating
D Q
Architectural RTL Clock
Clock Gate Gate
CLK
Because an architectural clock gate exists higher up the clock tree than a conventional
RTL clock gate the enable signal has a smaller portion of the cycle to setup to the
clock gate. Therefore the enable must come directly out of a register or traverse a
very small number of gates. This can limit the usefulness of architectural clock gates
for circuits that require complex enables.
• There are no instructions being executed in the integer core or the Vector
Floating Point unit.
• There are no operations in progress on the bus interface unit.
• There are no DMA operations in progress into the internal Tightly Coupled
Memories.
• The ARM1176JZF-S processor is not in debug state.
The problem with soft IP is that the IP creator does not know what the insertion delay
of the clock will be as this is a function of the target library and the quality of the
clock tree insertion tool. The risk is that if the clock tree insertion delay approaches
one clock cycle the enable could arrive at the architectural clock gate at the same time
as the next clock edge thereby causing a glitch on the clock. Therefore the prudent
approach is for the IP creator to design clock circuitry that allows any clock tree
insertion delay.
Synchronisation
Circuitry
FREECLKIN
D Q D Q D Q
Integer Pipeline
Empty Logic
D Q
Providing the FREECLKIN signal is not balanced with respect to the CLKIN signal,
the clock tree can have any insertion delay and there is no risk of a glitch.
A normalized comparison of the power benefits of mixing the different clock gating
approaches is shown using the ARM926EJ-S microprocessor running loop 4 of the
industry standard Dhrystone benchmark. The ARM926EJ-S microprocessor was used
to produce these figures rather than the ARM1176JZF-S microprocessor as the use of
architectural clock gating is an implementation choice with the ARM926EJ-S
processor whereas it is a fundamental part of the clocking strategy of the
ARM1176JZF-S processor, and cannot be removed.
The architectural clock gating approach used in the ARM926EJ-S processor differs
from the ARM1176JZF-S processor in that as well as being able to gate off the entire
clock from the microprocessor there are clock gates on eleven other major blocks in
the design. These include the coprocessor interface, the instruction caches, the data
caches, the instruction tightly coupled memories, the data tightly coupled memories,
and the main integer core. Therefore the ARM926EJ-S processor is ideal for
comparing the benefit of block level architectural clock gating with RTL clock gating.
Note that each clock gating experiment required a complete Physical Compiler
synthesis and Astro CTS/route and consequently there will be minor variations
between each run:
Note that the power savings discussed in this section are only applicable to dynamic
power. Yet static power consumption is becoming increasingly significant at
technology process geometries of 90nm and below. There are many circuit
techniques to deal with static power consumption however there are also technology
independent approaches that an IP creator can use. The use of a Dormant Power
Mode is one such technique.
Depending on the leakage characteristics of the cache RAMs, Dormant Mode can be
highly advantageous for leakage reduction. Indeed, if leakage management is of
critical importance, nominal or even high threshold transistors can be used in the
caches while the logic is implemented using mixed threshold standard cells for
optimum performance.
• All inputs (including the clock) into the RAMs must be clamped when the
standard cell power is removed to prevent erroneous data being written to the
RAMs. A signal called RAMCLAMP is used for this purpose and the design
is in normal operation when RAMCLAMP is set to logic zero.
• All clamping logic must be placed only in the RAM power domain.
• Clocks must be held at a known state during Dormant Mode and a rising edge
avoided when coming out of Dormant Mode to prevent erroneous data being
clocked into the cache RAMs.
The following diagrams illustrate the Dormant Mode voltage domains and clamping
logic.
SoC ARM1176JZF-S
RAM Blocks
Clamped Clamped
Data Signal Clock Signal
Because in Dormant Mode the power is removed from the standard cells all
microprocessor configuration will be lost. So in order to successfully enter and return
from Dormant Mode the following must be copied to and from external memory:
Depending on the cache write policy (write through/write back) and the AMBA™
AXI processor bus frequency, entry and exit from Dormant Mode takes between 5000
and 10,000 clock cycles. This cycle overhead means that accurate modelling of the
software environment is required to get the best performance and static leakage power
Dynamic Voltage and Frequency Scaling is effective because of the following two
facts:
To ease implementation the RTL was designed using the following rules:
1 – The top-level logical hierarchy should correlate with the voltage domains in the
design.
2 – Clocks should not cross voltage domains inside of the IP.
3 – All data signals that cross between voltage domains inside the processor or on the
processor interface should be synchronized and level shifted.
4 – All outputs from the core voltage domain should be clamped to allow the power to
be removed (Dormant Mode).
RAMs
Vram ARM1176JZ(F)-S
Level Shift and Clamp Logic
Vcore BIST
Energy
Management
Controller
AMBA AXI Buses
Clock & (including clocks
Interrupt Debug
Reset
Vsoc Controller Logic
Signals
and resets)
Figure 6.1 – ARM1176JZF-S RTL Hierarchy for Dynamic Voltage and Frequency
Scaling
The AMBA AXI Register Slices are a method of pipelining the main processor buses.
AXI allows register slices to be placed in the interconnect with no impact at all on the
available bandwidth. By splitting these register slices and placing one half in the core
voltage domain and one half in the SoC voltage domain the synchronous timing
interface to the ARM1176JZF-S microprocessor from the SoC looks the same no
matter what the core voltage is. This approach ensures that the first rule is met and
that the top-level logical hierarchy of the ARM1176JZF-S microprocessor correlates
with the voltage domains being implemented. Furthermore, the ARM1176JZF-S
microprocessor conforms to the second rule since the AMBA AXI clocks required for
the bus interface unit are only used inside of Vsoc module of the AMBA AXI register
slice.
Other interfaces in the design from the core voltage domain to the SoC voltage
domain (such as the debug interface) are asynchronous at the boundary of the
microprocessor and are already handled by synchronization circuitry internally. This
means that the asynchronous interfaces only need to be level shifted and not
resynchronized.
The benefits of Dynamic Voltage and Frequency Scaling are substantial. For
example, on an ARM926EJ-S test chip fabricated on the TSMC CL013G process, the
following measurements were obtained from the test silicon when running the
Dhrystone benchmark:
0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%
Power Energy
Figure 6.2 – Power and energy benefit from Dynamic Voltage and Frequency Scaling
The voltage numbers in the diagrams above represent the minimum voltage that
would still allow the benchmark to successfully complete at the target frequency on
the test silicon.
The diagrams clearly show a large benefit from reducing the supply voltage.
However, the energy diagram is possibly more useful as it factors out the increased
time to execute a task due to the reduction in frequency.
Even though the benchmark took twice as long to run at 150MHz than 300MHz, this
may not be an issue in some applications and the 54% reduction in energy required to
complete the task highly desirable.
The techniques that have been described in this paper to save power (static and
dynamic) are complementary and applicable across all types of soft IP. They range
from the highly automated (RTL clock gating) to the more complex (dynamic voltage
and frequency scaling), but each can provide incremental savings in power across all
technology processes therefore extending battery life in mobile devices and also
potentially reducing IC package costs. It is up to the IP creator to decide which
techniques are required depending on the target application and to then put in place
the infrastructure to allow them to be successfully implemented and integrated into a
system chip.
8.0 Acknowledgements
Thanks to everyone in ARM who reviewed this paper prior to release.