Low-Power Design For Embedded Processors: Bill Moyer
Low-Power Design For Embedded Processors: Bill Moyer
Invited Paper
Minimization of power consumption in portable and battery- without addressing power consumption issues. Envi-
powered embedded systems has become an important aspect of pro- ronmental concerns relating to energy consumption by
cessor and system design. Opportunities for power optimization computers and other electrical equipment are another reason
and tradeoffs emphasizing low power are available across the en-
tire design hierarchy. A review of low-power techniques applied at for interest in low-power designs and design techniques.
many levels of the design hierarchy is presented, and an example of Low-power design can be an important element in low-
low-power processor architecture is described along with some of ering system cost as well. Smaller packages, batteries, and
the design decisions made in implementation of the architecture. reduced thermal management overhead result in less costly
Keywords—Circuit design, clock distribution, clock gating, products, with higher reliability as an added benefit. Size,
CMOS circuits, CPU microarchitecture, instruction set design, available power budget, and weight of a device are important
low-power architecture, low-power design, low-power synthesis, metrics, and to a large extent, the power source is the pri-
low-power systems, power dissipation, power minimization, power mary determinant of these metrics. Energy efficient designs
optimization, RISC, state assignment, system design.
maximize the useful lifetime of this source, while attempting
to meet throughput and peak performance requirements of
I. INTRODUCTION the overall application. Power efficient design implies that
The increasing prominence of portable electronics and the system minimizes the peak demands on this source, thus
consumer-oriented devices has become a fundamental improving its operating efficiency. The rate of energy use
can have a dramatic effect on the amount of energy avail-
driving factor in the design of new computational elements
in CMOS very large-scale integration (VLSI) systems on able from a battery source as well as its cost [1], [2], thus,
a chip. As the focus shifts away from tethered desktop there is value in not only minimizing average power con-
computing to the mobile appliance, a rethinking of design sumption, but also peak power consumption as well. Portable
product utility is constrained by the physical size and weight
optimizations traditionally targeting ever-increasing per-
formance goals and high clock rates at almost any cost are of the power source. Current battery technologies, such as
required in order to optimize battery life and extend the Nickel–Metal Hydride systems, are available in “AA” sizes
utility of these devices. The trend in the desktop world of with a capacity of 1600 mAh at a nominal voltage of 1.2 V.
continuous growth in complexity and size of the underlying For a portable device containing a pair of these cells, run-time
CPU in terms of instruction issue strategies and the sup- between charges of approximately 4 h is possible when the
porting microarchitecture needs to be re-examined for these system is dissipating 1 W of average power. For a device
devices, as the tradeoffs in energy consumption versus the to remain usable for a month between charges, the average
improved performance obtained may dictate a different set power dissipation must drop below 5 mW. For systems with
of design choices. Power consumption arises as a third axis an active duty cycle of 10%, the power consumed by the
in the optimization space in addition to the traditional speed entire system when active must be less than 50 mW, sev-
(performance) and area (cost) dimensions. eral orders of magnitude below today’s notebook computing
Improvements in circuit density and the corresponding devices.
increase in heat generation must be addressed even for Opportunities for design tradeoffs emphasizing low power
high-end desktop systems. Current trends in technology are available across the entire spectrum of the overall de-
scaling of CMOS circuits cannot be reliably sustained sign process for a portable system, and are effectively ap-
plied at many levels of the design hierarchy. From algorithm
selection to silicon process technology details, opportunities
abound. Generally speaking, the higher the level of abstrac-
Manuscript received December 29, 2000; revised June 10, 2001.
The author is with Motorola Inc., Austin, TX 78729 USA. tion, the greater the opportunity for power savings. Much re-
Publisher Item Identifier S 0018-9219(01)09683-9. search as well as practical development has occurred in the
1576 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 11, NOVEMBER 2001
Authorized licensed use limited to: Rice University. Downloaded on January 3, 2009 at 17:25 from IEEE Xplore. Restrictions apply.
past 30 or so years regarding low-power design. In the last times, thus presenting a tradeoff in device sizing. When a set
decade, popularity of the subject has produced a wealth of of gates is considered, it is generally optimal to target equal
technical information [3]–[7], as well as annual international input and output transition times. For large devices such as
symposia and workshops dedicated to latest research and de- input–output (I/O) buffers or clock drivers, special design
velopments [8]–[10]. considerations are often used to minimize the overlap current
While the bulk of commercial activity addressing [13]. For properly sized and ratioed gates, the contribution to
low-power processor systems has focused on well-known overall dynamic power due to is on the order of
clocked CMOS design styles, important research and 10%–20%, although this factor may increase with increased
commercial work in the area of asynchronous logic design device scaling [14].
techniques continues as an alternative approach to lowering is not usually a factor in pure CMOS designs, since
power dissipation in systems. These techniques may also static current is not drawn by a CMOS gate, but certain cir-
provide a solution to the increasing problem of clock man- cuit structures such as sense amplifiers, voltage references,
agement and distribution as device frequencies approach and constant current sources do exist in CMOS systems and
and even exceed 1 GHz. While not the focus of this paper, contribute to overall power.
the interested reader is referred to the overview presented is due to leakage currents from reversed biased
by Hauck [11] as a starting point for asynchronous design PN junctions associated with the source and drain of MOS
styles. transistors, as well as subthreshold conduction currents. The
leakage component is proportional to device area and tem-
II. POWER DISSIPATION IN CMOS CIRCUITS perature. The subthreshold leakage component is strongly
dependent on device threshold voltages, and becomes an im-
Power dissipated in CMOS circuits consists of several
portant factor as power supply voltage scaling is used to
components as indicated in (1)
lower power. For systems with a high ratio of standby opera-
tion to active operation, may be the dominant factor
(1) in determining overall battery life.
Minimization of these components of power dissipation
The individual components represent the power required to is important in designing low-power systems, and there are
charge or switch a capacitive load ( ), short circuit complex interactions that require tradeoffs to be made in-
power consumed during output transitions of a CMOS gate as volving each. Active power minimization involves reducing
the input switches ( ), static power consumed by the magnitude of each of the components in (3). With its
the device ( ), and leakage power consumed by the de- quadratic contribution in the power equation, reduction of
vice ( ). Components and are supply voltage is an obvious candidate technique for power
present when a device is actively changing state, while the reduction, and can be applied to an entire design. Reducing
components and are present regardless of supply voltage by a factor of two ideally results in a factor of
state changes. The largest active component, , is four reduction in . There are limitations to simple
defined as supply voltage scaling, however, since the performance of a
gate is reduced as is lowered, due to the reduced satu-
(2) ration current available to charge and discharge load capac-
itance. Gate delay dependence on is approximated [15]
where represents the capacitance being switched,
by
is the supply voltage, corresponds to the change
in voltage level of the switched capacitance, represents
a switching activity factor based on the probability of an (5)
output transition, and represents the frequency of oper-
ation. The product is also referred to as the effective The energy-delay product is minimized when is equal
switched capacitance, or . In most circuits, is to . Reducing from (a typ-
equal to , so (2) is commonly written as ical value for 0.18 m technology) to results in
an approximate 50% decrease in performance while using
(3) only 44% of the power. This is a useful point of leverage if
performance goals can still be met. It would seem that re-
The term occurs due to the overlapped conduc- ducing threshold voltage of the devices and, thus, a corre-
tance of both the PMOS and NMOS transistors forming a sponding reduction in offers a path to arbitrarily low-
CMOS logic gate as the input signal transitions. This term power consumption. Unfortunately, there are practical limits
has a complicated derivation, but in simplified form can be to the degree that can be lowered, due to reduced
written as [12] noise margins and since exponentially increased leakage cur-
(4) rent becomes a limiting factor in contribution to
[16]. Controllability of variations in is also an issue
where represents the average current drawn during the in manufacturing, and provides a lower bound on supply
input transition. is minimized for a single gate with voltage scaling [17]. A methodology for selecting supply and
short input rise and fall times, and with long output transition threshold voltage targets is further described in [18].
Authorized licensed use limited to: Rice University. Downloaded on January 3, 2009 at 17:25 from IEEE Xplore. Restrictions apply.
III. DESIGN TECHNIQUES FOR POWER REDUCTION B. Architectural
Power reduction techniques may be applied at all levels of At the architectural and microarchitectural level, instruc-
the system design hierarchy. As noted in [19], these levels in- tion set design and exploitation of parallelism and pipelining
clude Algorithmic, Architectural, Logic and Circuit, and De- are important in minimizing power consumption. Architec-
vice technology. A brief description of each is given followed ture-driven voltage scaling as a method for power reduction
by some specific examples. This section is not intended to be is presented in [19]. The approach is based on lowering
exhaustive. voltage to reduce power consumption, and then to apply
parallelism and/or pipelining to maintain throughput as the
A. Algorithmic speed of a function unit is decreased. This type of approach
is useful if enough parallelism exists at the application level
Algorithmic-level power reduction techniques focus on
to keep the pipeline full, but trades off increased latency and
minimizing the number of operations weighted by the cost additional area overhead in the form of duplicated structures
of those operations. Selection of an algorithm is generally (parallelism) or pipeline register overhead (pipelined). For
based on details of an underlying implementation such as general purpose CPU development, exploiting pipelining
the energy cost of an addition versus a logical operation, the and parallelism is important for improved performance.
cost of a memory access, and whether locality of reference, Increases in latency due to deeper pipelining affect the
both spatially and temporally can be maximized. The metric of instructions per clock due to data dependencies
presence and structure of cache memory, for example, may and control flow dependencies. In the search for maximum
cause a different set of operations to be selected, since the overall performance, complicated value prediction schemes
cost of a memory access relative to an arithmetic operation and speculative fetch and execution of unresolved branch
changes. In general, reducing the number of operations to be target instruction streams are often employed for deeply
performed is a first-order goal, although in some situations, pipelined processors designed for highest performance in
recomputation of an intermediate result may be cheaper than order to reduce dependency-related stalls. The overhead
spilling to and reloading from memory. Techniques used by for these schemes results in extra energy consumption,
optimizing compilers, such as strength reduction, common and additionally, incorrect speculation results in discarding
subexpression elimination, and optimizations to minimize of operations, an additional waste of energy. Low-power
memory traffic are also useful in most circumstances in designs tend to avoid these deeply pipelined approaches
reducing power. Loop unrolling may also be of benefit, as it unless the amount of speculation is limited, the overhead for
results in minimized loop overhead as well as the potential speculation is low, and the accuracy of speculation is high.
for intermediate result reuse. Meeting required performance for an application without
Number representations offer another area for algorithmic overdesigning a solution is a fundamental optimization.
power tradeoffs. For example, the choice of using a fixed Additional circuitry designed to dynamically extract more
point or a floating-point representation for data types parallelism can actually be detrimental, since the power con-
can have a significant difference in power consumption sumption overhead of this logic is not generally controllable,
during arithmetic operations. Selection of sign-magnitude and will be present even when the additional parallelism is
versus two’s complement representation for certain signal absent from the application.
processing applications can result in significant power
reduction if the input samples are uncorrelated and dynamic C. Logic and Circuit Level
range is minimized [20]. Operator precision, or bit length, is
Many techniques for power reduction are available at the
another tradeoff that can be selected to minimize power at logic and circuit levels. Most focus on reducing the effec-
the expense of accuracy. For some floating point algorithms, tive switched capacitance, in (3). Others focus on re-
full precision can be avoided, and mantissa and exponent duced signal swing, thus avoiding the quadratic dependence
width reduced below the standard 23 and 8 bits, respectively, on supply voltage.
for single precision IEEE floating point. In [21], the authors Static and dynamic (clocked) logic families are both uti-
show that for an interesting set of applications involving lized in CMOS designs. Depending on signal probabilities,
speech recognition, pattern classification, and image pro- one or the other may offer reduced effective switched capac-
cessing, mantissa bit width may be reduced by more than itance. For a two-input NAND gate, assuming uniform distri-
50% to 11 bits with no corresponding loss of accuracy. In bution of input values, the probability of the output being 0
addition to improved circuit delays, energy consumption ( ) is 0.25 (both inputs are 1) and being a 1 ( ) is 0.75.
of the floating point multiplier was reduced 20%–70% for For a static gate, the probability of a power consuming tran-
mantissa reductions to 16 and 8 bits, respectively. Truncation sition from ( ) is then 0.1875 ( ). For the
of low-order bits of partial sum terms when performing a dynamic gate with the output precharged to logic 1, power is
16-bit fixed-point multiplication has been shown to result in consumed whenever the output was previously a 0. Relative
power savings of 30% due mainly to reduction in area [22]. to a static gate, the probability of a power consuming tran-
Adaptive bit truncation techniques for performing motion sition is higher (0.25), and power is consumed even when
estimation in a portable video encoder are shown to save the logical value of the output remains 0, which is not the
70% of the power over a full bit width implementation [23]. case for the static version. The dynamic version typically has
1578 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 11, NOVEMBER 2001
Authorized licensed use limited to: Rice University. Downloaded on January 3, 2009 at 17:25 from IEEE Xplore. Restrictions apply.
Fig. 2. Equivalent logic mappings with different power costs.
Fig. 1. Glitching in static logic and restructuring for elimination.
Dynamic logic does not suffer from glitch power since all
lower input capacitance by a factor of 2 to 3 however since inputs must be valid before the gate evaluates.
PMOS devices are not driven by logic inputs, thus for Technology mapping of logic functions to gates may
the dynamic gate may be much lower, even though it has a choose to optimize power at the expense of area. A robust
higher activity factor. For a wider input static gate, such as standard cell library for low power will include gates with a
a four-input NAND, , and is 0.0586. For variety of logic functions as well as multiple drive strengths
the dynamic version, . Increasing the for each function. Complex gates (AND–OR–INVERT, OR
number of inputs leads to a lower probability of an output –AND–INVERT, etc.), NAND and NOR gates with inverted
transition. On the other hand, input capacitive loading in- inputs, and a rich set of storage elements provide synthesis
creases if delay time is held constant, since larger transistors tools with the flexibility to optimize power consumption.
must be used. Intrinsic capacitance of the gate also increases. Transition probabilities of the logic being mapped are used
The power consumed in distributing the precharging signal in conjunction with loading models of the library elements
to the dynamic gate must also considered. A number of dif- to select a mapping of the desired Boolean function onto a
ferent logic families (both static and dynamic) have been pro- set of gates in the library which minimizes power, subject to
posed in the literature including variants of pass transistor meeting a set of delay constraints. Fig. 2 shows an example
logic (CPL), and cascode voltage switched logic (DCVSL) of differences in a four input AND function mapping. In the
offering area, speed, and power tradeoffs. An extensive re- example, mapping (a) consumes more power than mapping
view of the many types of clocked and static logic families (b) due to differences in the total transition probabilities of the
may be found in [24]. three two-input gates. Improvements averaging 10% on a set
Static logic may suffer from hazards (or glitches) that re- of benchmarks were obtained in [26] by using power instead
sult in unnecessary power consumption due to differences of area as a minimization criteria. Their algorithm resulted in
in gate input arrival times. These differences in arrival times an area increase of 12%, showing that minimized area does
may cause multiple output transitions, resulting in a value for not necessarily result in minimum power. A similar result is
that is 1. As an example, the output of a simple two-input reported by [27], where average power dissipation is reduced
circuit in Fig. 1 has unnecessary signal transition from high by 21% with a corresponding 13% increase in area. Hiding
low high due to the difference in arrival times of in- high-probability switching nodes inside of complex gates is
puts X and Y. This hazard may be propagated through ad- used to minimize total switched capacitance.
ditional logic levels and result in multiple gate output tran- Synthesis techniques using a hybrid library composed
sitions before the circuit resolves to a final state, even if of static CMOS gates in conjunction with pass logic cells
the final state is unchanged from the previous state. As the have also been shown to be effective in improving power
number of logic levels increases in a combinational circuit, dissipation [28]. Reordering of equivalent inputs of gates
the probability of unequal path delays from input to output and reordering of transistors in complex gates are also
increases, thus increasing the potential for glitching. Logic techniques available to reduce power. Fig. 3 shows transistor
restructuring and path delay balancing may be used to reduce diagrams of a complex gate realizing the logic function
glitch power, which can be responsible for 20% of overall with an example of input reordering and tran-
dynamic power consumption in combinational circuits [25]. sistor reordering. Input and transistor ordering affect the
Fig. 1 shows a restructured circuit realizing the same logic amount of switched internal capacitance of the gate, and also
function with reduced glitching. Path delay balancing may affect the speed of the gate and its static power dissipation.
be performed by either resizing of individual logic gates to In general, inputs signals with high probability of being
equalize path delay, or by insertion of additional logic ele- off are placed nearest the output node of the gate, subject
ments in faster paths. Since both methods can result in addi- to timing constraints being met, and signals with high
tional switched capacitance, they must be used judiciously. probability of being on are placed nearest the supply node.
Authorized licensed use limited to: Rice University. Downloaded on January 3, 2009 at 17:25 from IEEE Xplore. Restrictions apply.
Fig. 4. Clock gating.
1580 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 11, NOVEMBER 2001
Authorized licensed use limited to: Rice University. Downloaded on January 3, 2009 at 17:25 from IEEE Xplore. Restrictions apply.
simple logic functions that indicate that the original logic
function is either True or False, respectively. By keeping these
functions simple, overhead associated with precomputation
is minimized. In addition, the original logic function may be
simplified since a portion of it is being handled by the pre-
computation logic itself, and the terms for this portion may be
assigned as don’t-cares for the original function. Fig. 5 shows
one variant of a precomputation circuit implementing a logic
function . In Fig. 5, the logic function is implemented by
precomputing a simple subset of the input combinations for
which is True ( block) and for which is False (
block). When either of these blocks is active, the inputs to the
Fig. 5. Precomputation structure.
larger combinational block computing the remaining terms
of are blocked, and the larger block remains quiescent. The
DETFF relative to the single edge version is the duty cycle of precomputation logic then forces the output of function to
the clock is now a factor in determining cycle time. A com- “1” or “0,” respectively.
prehensive comparison of various DETFF implementations As has been seen with other power saving techniques, in-
is provided in [35]. creased area is traded for reduced power. In [37], the authors
Retiming of sequential circuits and pipelined datapath report power savings of 11%–66% using precomputation on
logic is a technique traditionally used to increase operating a number of combinational logic circuits. Methods for gen-
speed of a circuit by balancing the delay of each stage of erating the precomputation functions are also described.
logic in the circuit. Registers are moved either forward Guarded evaluation is a similar technique that relies on
or back along combinational logic paths until the total input blocking for transition reduction [38]. Transparent
delay between registers is equalized. As the registers are latches are added to inputs of existing logic and are appro-
moved, the number of required registers may increase priately disabled when the logic output can be determined
or decrease based on the number of signals crossing the without new input values being driven from the disabled
register boundary. Also, combinational logic optimization latches. This technique is common in the design of datapath
opportunities may occur as new logic groups are exposed, functions in low-power processors as will be described later.
thus further improving the circuit speed. The balanced For synthesized portions of a design using gates from a
circuit may then be operated at a lower frequency or voltage, predetermined library, gate sizing should be performed when
thus reducing power consumption further. possible to ensure that no noncritical circuit path is overly
One observation made in [36] is that propagation of fast. Gate size selection is typically based on output loading,
unnecessary switching activity due to glitches can be halted and fanout ranges of 3–8 are typical. As fanout increases,
by insertion of a register in a combinational logic path. The delay increases but dynamic power is reduced. Care must be
register output will transition once per clock cycle at most, taken not to increase fanout to the degree that signal rise and
even if the input makes multiple transitions. By placing fall times become an issue in increased short circuit power.
registers at high fanout nodes, switched capacitance can Custom portions of a design have an additional degree of
be minimized, assuming that the additional capacitive load freedom in that individual transistors may be sized to min-
created in adding the register is low enough relative the imize power. Algorithms have been developed to size indi-
original load, and the original node had multiple transitions vidual transistors in a design to minimize delay, power, or
per cycle. Retiming for low power is an approach that the power-delay product within an area constraint. Edge rate
attempts to minimize glitch power in a pipeline by moving constraints are also considered [39].
the registers forming the pipeline to positions that optimally
minimize switching activity in the logic network. Since
D. Device Technology
delay of the pipeline stages must be considered, only a
subset of nodes in the circuit are candidates for register At the device level, threshold voltage selection plays an
placement, i.e., those nodes which would not violate delay important role in the tradeoff between performance and
constraints. Additionally, there is a desire to minimize the leakage power. Supply and threshold voltage selection was
number of registers due to area costs as well as the additional discussed earlier [16]–[18]. Alternative process technologies
clock power consumed. to bulk CMOS such as silicon on insulator (SOI) may be
Precomputation is an optimization technique for sequential attractive due to lowered parasitic capacitance and reduced
circuits which minimizes switching activity by selectively body effect. Dual device threshold technologies are also an
precomputing the output values of a logic circuit before approach to lowering power consumption. High-threshold
they are required, and then using the computed values to devices may be used in noncritical delay paths, while re-
minimize switching activity by disabling inputs to the logic serving low-threshold devices for speed-critical paths, thus
circuit. The precomputed values are then substituted for the minimizing standby power consumption. A methodology
original logic circuit output values. Precomputation logic for selection of individual device sizes and thresholds to
uses a small subset of the original input signals to generate optimize speed and standby power goals is described in
Authorized licensed use limited to: Rice University. Downloaded on January 3, 2009 at 17:25 from IEEE Xplore. Restrictions apply.
[40]. Alternate approaches for standby power reduction are the sequencing overhead significantly. Typically a load–store
to raise the threshold of all devices while in standby mode (or register–register) model is chosen in which operations
by providing a transistor well biasing circuit. are performed using a set of general-purpose registers, and
the only operations on memory are loads and stores. Ease
IV. EMBEDDED PROCESSOR EXAMPLE of decoding and the ability to pipeline operations with low
control overhead are advantages. The increased instruction
Low-power embedded processors fall into several cate- fetch bandwidth required represents a drawback, as the typ-
gories. At the extreme low power range, these are typically ical RISC instruction is encoded as a 32-bit word. Average in-
8-bit CPUs with power dissipation measured in microwatts, struction lengths for CISC architectures with variable length
which power devices such as digital watches, calculators, and instructions are on the order of 22–24 bits, and these instruc-
other long-life devices. In the midrange, 16- and 32-bit pro- tions have more semantic content than a RISC instruction.
cessors power handheld devices with dissipation measured in They typically support operations on memory directly, via a
milliwatts. Higher performance 32-bit processors dissipating set of complex addressing modes.
watts of power cover high-end applications, such as notebook An instruction set design based on a fixed-length 16-bit
computers. instruction format was selected for the M CORE architec-
In the midrange of performance, one example of a 32-bit ture, as well as a RISC load-store model with a 16-entry
processor architecture designed specifically for portable and general purpose register file, where the only operations per-
low-power applications is the Motorola M CORE family. formed on memory are loads and stores. The ISA departs
This architecture and its implementations were specifically from a pure RISC approach in several areas to achieve im-
designed from the ground up to address low-power em- proved code density, such as support for instructions that
bedded applications with a range of power and performance save and restore a group of general-purpose registers to and
constraints, but targeted initially at the midrange applica- from memory for increased code density. Relative to a 32-bit
tions requiring tens to hundreds of MIPS of performance, ISA, the limitations of 16-bit instructions cause longer exe-
while dissipating tens to hundreds of milliwatts of power. cution pathlengths due to limitations on the size of imme-
Cost is an important factor that cannot be ignored in the diate fields, effective address offsets, and a 2-operand in-
design of a commercial, high-volume application, and cost struction format in which one of the source registers also
considerations were balanced with power optimizations in serves as the destination. Using compiler-driven instruction
both the architecture definition and implementation aspects. definition during development minimized these limitations.
Some details of the architecture and implementations are Trace analysis was used to minimize instruction bandwidth
described in the following subsections. requirements and instructions were selected to minimize the
overhead for common code sequences.
A. Instruction Set Design, Programmer’s Model The instruction set supports byte, halfword and word
At the architectural level, the specification of an instruc- (32-bit) data types, and a complete set of logical, shift, bit
tion set can have a large effect on system power dissipa- manipulation, and arithmetic operations that operate on a
tion as well as performance. As is to be expected, there are register and either another register or a 5-bit immediate field.
tradeoffs to be made. RISC, CISC, and VLIW architectures Load and store instructions provide a single base 4-bit
are examples of approaches to instruction set design, each scaled displacement addressing mode. A single condition
with their own merits. For low-cost systems, instruction code code bit is defined, and conditional branch instructions
density is an important factor, since the cost of instruction test the value of this bit for either true or false. Branch
memory is directly related to the size of the binary images of instructions support an 11-bit displacement field, sufficient
the programs embedded into the system. CISC designs typ- to satisfy 98% of all displacements. Providing multiple
ically provide good code density due to the complexity of compare instructions allows any Boolean relationship of
individual instructions and due to their use of variable length variables to be generated, and requires less precious opcode
instruction formats. Traditional RISC and VLIW instruction space than providing conditional branch instructions that
sets trade code density for simplified decoding and straight- test for multiple conditions, due to the size of branch dis-
forward instruction fetch units. While code density remains placements. Sizes of immediate fields are limited, so special
high with CISC approaches, the complications in control cir- instructions are provided for generation of commonly
cuitry for fetching, decoding, and sequencing tend to cause occurring constants. Constants from 0–128, all powers of
increased overhead in power, and either cost or performance two, and all powers of two 1 are available directly in the
tend to suffer. ISA. Larger arbitrary values are either synthesized with a
For a low-power focus, the desire is to have as large a per- pair of instructions, or are loaded from memory as 32-bit
centage of power consumption utilized for the fundamental constants with a PC-relative load word instruction (LRW).
computational operations required by the algorithm being ex- A single storage location for these large constants may be
ecuted. Fetch, decode, and sequencing of instructions repre- referenced by multiple LRWs, thus amortizing the storage
sents overhead associated with managing the computational cost. Conditional move, increment, decrement, and clear
task, and an approach that reduces the power in these areas instructions are provided to eliminate some branches. A
is important. Traditional RISC architectures define a fixed- complete description of the MCORE processor architecture
length instruction that is not highly encoded, thus reducing and ISA can be found in [41] and [42].
1582 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 11, NOVEMBER 2001
Authorized licensed use limited to: Rice University. Downloaded on January 3, 2009 at 17:25 from IEEE Xplore. Restrictions apply.
By careful selection of instruction semantics and imme- sired function. As an example, to implement the logical NOT
diate/displacement widths, we find that object code compiled instruction, we can either exclusive–or the source value with
for this ISA is less than 70% the size of code for a typical 1 in a logical unit, or we may perform a subtract from 0 with
32-bit RISC, which results in a significant cost advantage. inverted carry-in in the add unit. Since the energy used by
The penalty in terms of pathlength increase (number of in- the logical unit is lower than the adder, it is the obvious first
structions executed) across a variety of embedded applica- choice. In some circumstances, however, utilizing the adder
tions is on the order of 15%–20% relative to a 32-bit RISC results in lower overall energy usage, since it may allow addi-
instruction encoding. Similar conclusions were reached in tional reduction in control circuitry transitions by collapsing
[43]. From a power perspective, this means memory traffic control terms in the output equations of the control decoder.
(in bytes) is reduced dramatically since instructions are 16 This is particularly true when the instruction or function in
bits in length. In spite of the greater number of instructions question has a low dynamic frequency of execution. Com-
executed, the overall power consumption is reduced, since piler-directed feedback was used to determine the best trade-
on-chip instruction memory power consumption is typically offs between control decoder power and execution unit power
1.5–2 greater than the CPU in our designs, and instruction in a number of instances.
memory traffic has been reduced by 40%. In addition to area minimization, minimizing control unit
Other advantages related to power and performance are power consumption is desired. This was done by instru-
realized. For designs utilizing cache memory, the instruction menting an instruction set simulator to capture the frequency
cache capacity is effectively doubled, since approximately of execution of all instructions, as well as instruction pairs.
twice as many instructions can be stored. Cache miss rates Opcodes were ordered by frequency and by frequency
of typically sized embedded cache designs (4–32 kB) may of execution pairs, and an initial state assignment was
be reduced 30%–50% with this effect. Given that accessing performed on the most frequently occurring instructions,
the next level of the memory hierarchy can result in factors with the objective of assigning adjacent states to frequently
of 20 greater power consumption or more due to traversing occurring instruction pairs. The remainder of the state
chip boundaries, this reduction in miss rate is significant. assignments were made with automated state assignment
In cacheless designs where memory is embedded on-chip, tools. We achieved control section power savings of approx-
the power consumption of memory is reduced due to the imately 15% with this approach to opcode assignment for
reduced capacity requirement. For on-chip memories or our baseline machine, with no increase in area.
caches, a 32-bit data path is typically provided, which Beyond just CPU power reduction, system-level power
results in double the effective fetch bandwidth relative to a savings are supported by the ISA with three low-power op-
32-bit instruction word, allowing instruction memory to be erating mode instructions. The WAIT, DOZE, and STOP in-
accessed every other cycle on average, even with a target structions are provided to enable a system to be placed in in-
of single cycle instruction execution. For low-cost designs creasingly lower power modes as appropriate for operating
where instruction memory is off chip, the ability to fetch a conditions. When the CPU encounters one of these instruc-
pair of instructions at a time across a 32-bit interface reduces tions, it completes all previous instructions in the pipeline,
effective memory latency. Even a narrow 16-bit interface finishes all outstanding prefetch operations, and then enters
path results in greatly reduced performance degradation a state where internal clocks are gated off. A pair of control
relative to a wider instruction word. outputs that encode the present operating mode are driven
After selecting the set of operations to minimize code size to the rest of the system to allow specific low-power oper-
and execution pathlength and defining the instruction for- ating conditions to be defined by the system designer. The
mats, the task of encoding of opcodes remained. We per- CPU will exit these modes and resume normal operation once
formed an initial encoding assignment and then iterated it a pending wakeup request is recognized. As an example of
to reduce the number of terms and literals in a two-level pro- system use, the WAIT mode might be used to disable only
grammable logic array targeted for controlling a processor the CPU, while keeping system PLLs and peripherals ac-
data path which implemented the data operations defined tive. If there is not expected to be a need for processing
by the instruction set, as well as control of an instruction for a longer period of time, the DOZE mode might be de-
prefetch and program counter unit. By viewing this task as a fined to disable PLLs and certain peripherals that are unnec-
state assignment problem for sequential logic minimization, essary in that mode. Wakeup from this state would entail a
each instruction opcode is assigned to a state. A Moore-ma- longer period of time. The STOP mode can be used to enter
chine model was used in which control outputs are a function a deep power-down state in which all clocks are stopped at
of present state only. Inputs to the state machine are the next the system level, and power supply voltage either reduced or
instruction opcode, and all states are completely intercon- totally switched off to major subsystems.
nected via an exhaustive set of edges. Next state equations
are ignored, since they are a function of only the inputs, not
B. CPU Microarchitecture
current state. By casting the opcode assignment problem in
this fashion, state assignment tools were used to automate While many processor implementation techniques in
the process. This process was iterated as the control signal extremely high-end designs are focused on extracting all
requirements were altered to further minimize area. Often, possible instruction-level parallelism, these techniques tend
multiple equivalent control sets can be used to obtain the de- to have a correspondingly high level of power inefficiency.
Authorized licensed use limited to: Rice University. Downloaded on January 3, 2009 at 17:25 from IEEE Xplore. Restrictions apply.
Fig. 6. Instruction buffer supporting the unified bus architecture.
1584 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 11, NOVEMBER 2001
Authorized licensed use limited to: Rice University. Downloaded on January 3, 2009 at 17:25 from IEEE Xplore. Restrictions apply.
Fig. 8. Address adder input and output gating.
majority of input value cases, but not for all cases. By doing Fig. 9. Two-level clock distribution network.
so, critical path timing is not affected by the delay circuit,
but a large majority of calculations are completed prior to two cycles, with single cycle branch not taken timing. No
the delay interval. branch prediction logic is required in this approach. This
Precomputation is used in the address calculation logic for allows increased performance and less wasted processing to
load and store instructions to detect when the displacement occur, but has some significant circuit timing implications.
field of the instruction is zero. In this case, no addition of One implication is that the input values to the branch adder
displacement to the base register is required, so the adder is cannot be completely clock gated based on fully decoding
bypassed and remains idle. Fig. 8 shows the bypass logic. a branch instruction and still meet timing requirements.
From our measurements, almost 50% of load and stores have Instead, we perform an incomplete decoding of the branch
a zero displacement field. opcode by combining it with the infrequently executed en-
Floorplanning of the datapath elements was driven by codings for Add and Subtract with Carry (ADDC, SUBC).
switching activity measurements gathered while executing This reduces the gating function delay to a single gate.
a set of embedded benchmarks. Relative utilization of each Noting that many embedded applications spend a signif-
function unit was coupled with capacitive loading of the units, icant portion of execution inside program loops, we have
and block placement was performed to minimize the overall augmented the branch unit with a program loop folding
bus loading and effective switching frequency of source and capability that captures information about program loops
destination buses. Source and destination bus segmentation for use during successive iterations of the loop. This loop
allowed infrequently utilized function units to be placed folding hardware removes the need to fetch, decode, or exe-
farther from the centralized register file, and decoupled such cute loop branches for most loop iterations, thus increasing
that bus segments attached to these units are only driven performance and lowering power [44].
when the function unit is required for instruction execution.
Branch instruction execution is well recognized as a C. Clock Distribution
critical factor in processor performance due to the need Clock power represents a large portion of overall power
to discard operations following a taken branch and then consumption in a design optimized for low power. Efficient
refill the instruction pipeline. Due to the relatively high distribution and gating mechanisms are essential for reducing
frequency of change of flow events ( 18% in our systems), this component. By adopting a two-level structure for this
branch acceleration techniques are an important element task, improved power efficiency is realized. The first level
of a microarchitecture. Increased complexity must be of logic is used to align clock edges at various points in the
balanced with increased power consumption. Our initial circuitry and to generate two nonoverlapped phases. These
implementations of the M CORE ISA include a dedicated nonoverlapped phases are then distributed to a set of clock
branch adder specifically designed for high-speed branch regenerating cells with clock gating control inputs. Gating
target calculations. Since the branch displacement is limited of clocks can be performed at both levels depending on the
to a signed 11-bit value, a specialized design results in granularity required. Fig. 9 illustrates the two-level approach.
faster target calculations than a generalized 32-bit adder. We Clock tree generation algorithms have been adopted that
implement an aggressive branch taken instruction timing of allow an unbalanced H-tree structure to be used. As opposed
Authorized licensed use limited to: Rice University. Downloaded on January 3, 2009 at 17:25 from IEEE Xplore. Restrictions apply.
[9] IEEE Alessandro Volta Memorial Workshop Low-Power Design,
Como, Italy, Mar. 1999.
[10] Int. Symp. Low Power Electronics and Design, Rapallo/Portacino
Coast, Italy, July 2000.
[11] S. Hauck, “Asynchronous design methodologies: An overview,”
Univ. Washington, Tech. Rep. UW-CSE-93-05-07, May 1993.
[12] A. Bellaur and M. I. Elmasry, Low Power Digital CMOS Design:
Circuits and Systems. Norwell, MA: Kluwer, 1996, ch. 4, pp.
135–137.
[13] H. J. M. Veendrick, “Short-circuit dissipation of static CMOS cir-
cuitry and its impact on the design of buffer circuits,” IEEE J. Solid-
Fig. 10. Clock aligner and low-capacitance balancing nodes. State Circuits, pp. 468–473, Aug. 1984.
[14] S. R. Vemuru and N. Scheinberg, “Short circuit power dissipation
estimation for CMOS logic gates,” IEEE Trans. Circuits Syst., pp.
to traditional clock trees that balance loading at all nodes of 762–765, Nov. 1994.
[15] A. Bellaur and M. I. Elmasry, Low Power Digital CMOS Design:
the H-tree by adding additional routing capacitance to short Circuits and Systems. Norwell, MA: Kluwer, 1996, ch. 3, p. 90.
nodes, a loading algorithm is adopted that resizes interme- [16] D. Liu and C. Svensson, “Trading speed for low power by choice
of supply and threshold voltages,” IEEE J. Solid-State Circuits, pp.
diate devices in the clock aligner and regenerator cells to 10–17, Jan. 1993.
obtained balanced delays. These intermediate device nodes [17] S. W. Sun and P. Tsui, “Limitation of CMOS supply–voltage scaling
have much lower capacitance than the routing, so small in- by MOSFET threshold variation,” in Proc. CICC, 1994.
[18] C. Chen, J. Shott, J. Burr, and J. D. Plummer, “CMOS technology
creases in capacitance can be used to obtain the balancing, scaling for low voltage low power applications,” in Proc. IEEE
thus resulting in lower wasted power relative to the tradi- Symp. Low Power Electronics, San Diego, CA, Oct. 10–12, 1994,
tional tree generation approaches. Delayed clocks are also pp. 56–57.
[19] A. Chandrakasan, S. Sheng, and R. W. Brodersen, “Low-power
produced using a similar sizing approach. Fig. 10 illustrates CMOS digital design,” IEEE J. Solid State Circuits, pp. 473–484,
the clock aligner structure and the balancing nodes. Apr. 1992.
[20] A. Chandrakasan and R. W. Brodersen, “Minimizing power
consumption in digital CMOS circuits,” Proc. IEEE, vol. 83, pp.
V. CONCLUSION 498–523, Apr. 1995.
[21] Y. F. Tong, R. A. Rutenbar, and D. F. Nagle, “Minimizing
Low-power design requires attacking the power dissipa- floating-point power dissipation via bit-width reduction,” in Power
Driven Microarchitecture Workshop in conjunction with ISCA
tion problem at all levels of the design hierarchy. No single ’98, Barcelona, Spain, June 28, 1998, pp. 114–118.
target will be sufficient to extract the efficiency required for [22] M. J. Schulte, J. E. Stine, and J. G. Jansen, “Reduced power dissi-
future handheld products. Voltage scaling has limits that will pation through truncated multiplication,” in Proc. IEEE Alessandro
Volta Memorial Workshop Low-Power Design, Como, Italy, Mar.
require additional advanced techniques to be applied at the 1999, pp. 61–69.
algorithmic and architectural level for additional power sav- [23] Z.-L. He, K.-K. Chan, C.-Y Tsui, and M. L. Liou, “Low power mo-
ings. Dynamic voltage scaling based on system loading and tion estimation design using adaptive pixel truncation,” in Int. Symp.
Low Power Electronics and Design, Monterey, CA, Aug. 1997, pp.
processing requirements is an emerging technique with great 167–172.
promise. Clock power optimizations will remain a challenge [24] K. Bernstein et al., High Speed CMOS Design Styles. Norwell,
as higher frequencies and increased pipelining are applied MA: Kluwer, 1998, ch. 2–3, pp. 51–131.
[25] A. Raghunathan, S. Dy, and N. K. Jha, “Register transfer level power
to extract increased performance. Parallelism must be effi- optimization with emphasis on glitch analysis and reduction,” IEEE
ciently extracted without sacrificing the goal of low power. Trans. Computer-Aided Design, pp. 1114–1131, Aug. 1999.
[26] V. Tiwari, P. Ashar, and S. Malik, “Technology mapping for low
Software generation strategies that are based on power cost power,” in 30th ACM/IEEE Design Automation Conf., 1993, pp.
functions will be increasingly common in these future sys- 74–79.
tems. Even though a broad range of power reducing tech- [27] C.-Y. Tsui, M. Pedram, and A. M. Despain, “Technology decom-
position and mapping targeting low power dissipation,” in 30th
niques have been proposed, the challenge still remains to in- ACM/IEEE Design Automation Conf., 1993, pp. 68–73.
tegrate them into a design flow in which power plays as large [28] M. Gallant and D. Al-Khalili, “Synthesis of low power circuits using
a role as performance. combined pass logic and CMOS topologies,” in 10th Int. Conf. Mi-
croelectronics, Dec. 1998, pp. 59–62.
[29] W.-Z. Shen, J.-Y. Lin, and F.-W. Wang, “Transistor reordering rules
REFERENCES for power reduction in CMOS gates,” in Proc. ASP-DAC ’95, 1995,
pp. 1–6.
[1] R. A. Powers, “Batteries for low power electronics,” Proc. IEEE, [30] R. P. Llopis and M. Sachdev, “Low power, testable dual edge trig-
vol. 83, pp. 687–693, Apr. 1995. gered flip-flops,” in Int. Symp. Low Power Electronics and Design,
[2] J. Costello, “Choosing the right battery to power the portable Aug. 1996, pp. 341–345.
product,” Electron. Prod., Dec. 1992. [31] D. Chen, M. Sarrafzadeh, and G. Yeap, “State encoding of finite
[3] A. P. Chandrakasan and R. W. Broderson, Low Power Digital CMOS state machines for low power design,” in Proc. ISCAS ’95, vol. 3,
Design. Norwell, MA: Kluwer, 1995. pp. 2309–2312.
[4] A. Bellaur and M. I. Elmasry, Low Power Digital CMOS Design: [32] A. G. M. Strollo, E. Napoli, and D. De Caro, “New clock-gating
Circuits and Systems. Norwell, MA: Kluwer, 1996. techniques for low-power flip-flops,” in Int. Symp. Low Power Elec-
[5] J. M. Rabaey and M. Pedram, Eds., Low Power Design Methodolo- tronics and Design, July 2000, pp. 114–119.
gies. Norwell, MA: Kluwer, 1996. [33] H. Kojima, S. Tanaka, and K. Sasaki, “Half-swing clocking scheme
[6] M. S. Elrabaa, I. S. Abu-Khater, and M. I. Elmasry, Advanced Low- for 75% power saving in clocking circuitry,” IEEE J. Solid-State Cir-
Power Digital Circuit Techniques. Norwell, MA: Kluwer, 1997. cuits, pp. 432–435, Apr. 1995.
[7] L. Benini and G. De Micheli, Dynamic Power Management: Design [34] J.-C. Kim, S.-H. Lee, and H.-J. Park, “A high-speed 50% power-
Techniques and CAD Tools. Norwell, MA: Kluwer, 1998. saving half-swing clocking scheme for flip-flop with complementary
[8] Power Driven Microarchitecture Workshop, in conjunction with gate and source drive,” in ICVC ’99 6th Int. Conf. VLSI and CAD,
ISCA ’98, Barcelona, Spain, June 28, 1998. 1999, pp. 574–577.
1586 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 11, NOVEMBER 2001
Authorized licensed use limited to: Rice University. Downloaded on January 3, 2009 at 17:25 from IEEE Xplore. Restrictions apply.
[35] W. M. Chung and M. Sachdev, “A comparative analysis of dual edge [44] L. H. Lee, J. Scott, B. Moyer, and J. Arends, “Low-cost branch
triggered flip-flops,” in 2000 Canadian Conf. Electrical and Com- folding for embedded applications with small tight loops,” MICRO-
puter Engineering, vol. 1, 2000, pp. 564–568. 32, pp. 103–111, Nov. 1999.
[36] J. Monteiro, S. Devadas, and A. Ghosh, “Retiming sequential cir-
cuits for low power,” in Int. Conf. Computer-Aided Design, Nov.
1993, pp. 398–402.
[37] J. Monteiro, S. Devadas, and A. Ghosh, “Sequential logic optimiza-
tion for low power using input-disabling precomputation architec-
tures,” IEEE Trans. Computer-Aided Design, vol. 17, pp. 279–284,
Mar. 1998. Bill Moyer (Member, IEEE) received the
[38] V. Tiwari, S. Malik, and P. Ashar, “Guarded evaluation: Pushing B.S.E.E. degree from Rice University and the
power management to logic synthesis/design,” IEEE Trans. Com- M.S.E.E. degree from the University of Texas.
puter-Aided Design, vol. 17, pp. 1051–1060, Oct. 1998. He is a Motorola Fellow and Distinguished
[39] A. Dharchoudhury et al., “Transistor-level sizing and timing verifi- Innovator active in low-power architecture and
cation of domino circuits in the PowerPC microprocessor,” in Proc. system design. He has been involved with design
IEEE Int. Conf. Computer Design, Oct. 1997, pp. 143–148. and development of 32-bit microprocessors
[40] S. Sirichotiyakul et al., “Standby power minimization through si- in Motorola’s 68000, 88000, PowerPC, and
multaneous threshold voltage selection and circuit sizing,” in Design M CORE processor families, and embedded
Automation Conf. ’99, 1999, pp. 436–441. systems employing them for over 20 years, and
[41] M*CORE Reference Manual: Motorola Inc., 1997. holds more than 65 patents covering a broad
[42] B. Moyer and J. Arends, “RISC gets small,” Byte Mag., Feb. 1998. range of microprocessor, memory, and embedded system topics. He has
[43] J. Bunda et al., “16-bit vs. 32-bit instructions for pipelined micro- also co-authored a number of low-power oriented publications. He is senior
processors,” in Proc. 20th Int. Conf. Computer Architecture, 1993, architect for the M CORE processor family, and is responsible for the
pp. 237–246. definition and design of several family members.
Authorized licensed use limited to: Rice University. Downloaded on January 3, 2009 at 17:25 from IEEE Xplore. Restrictions apply.