Power Aware Design Methodologies
Power Aware Design Methodologies
METHODOLOGIES
This page intentionally left blank
POWER AWARE DESIGN
METHODOLOGIES
edited by
Massoud Pedram
University of Southern California
and
Jan M. Rabaey
University of California, Berkeley
No part of this eBook may be reproduced or transmitted in any form or by any means, electronic,
mechanical, recording, or otherwise, without written consent from the Publisher
CONTRIBUTORS xvii
PREFACE xix
1. INTRODUCTION 1
MASSOUD PEDRAM AND JAN RABAEY
1.1 INTRODUCTION 1
1.2 SOURCES OF POWER CONSUMPTION 2
1.3 LOW-POWER VERSUS POWER-AWARE DESIGN 2
1.4 POWER REDUCTION MECHANISMS IN CMOS CIRCUITS 3
1.5 POWER REDUCTION TECHNIQUES IN MICROELECTRONIC
SYSTEMS 4
1.6 BOOK ORGANIZATION AND OVERVIEW 5
1.7 SUMMARY 7
2. CMOS DEVICE TECHNOLOGY TRENDS FOR POWER-
CONSTRAINED APPLICATIONS 9
DAVID J. FRANK
2.1 INTRODUCTION 9
2.2 CMOS TECHNOLOGY SUMMARY 11
2.2.1 Current CMOS Device Technology 11
2.2.2 ITRS Projections 13
2.3 SCALING PRINCIPLES AND DIFFICULTIES 15
2.3.1 General Scaling 16
2.3.2 Characteristic Scale Length 18
2.3.3 Limits to Scaling 20
vi Contents
Introduction
1 2
Massoud Pedram , Jan Rabaey
1 2
University of Southern California; University of California, Berkeley
Abstract: This chapter provides the motivations for power-aware design, reviews main
sources of power dissipation in CMOS VLSI circuits, hints at a number of
circuit and system-level techniques to improve the power efficiency of the
design, and finally provides an overview of the book content. The chapter
concludes with a list of key challenges for designing low-power circuits or
achieving high power efficiency in designs.
Key words: Low-power design, power-aware design, low-power circuit techniques, energy
efficiency, CMOS devices, Moore’s Law, technology scaling, static power,
dynamic power, voltage scaling, power management, reconfigurable
processors, design methodologies, design tools.
1.1 INTRODUCTION
Many different metrics have been used to capture the notion of “power and
timing efficiency.” The most commonly used ones are (average) power,
power per MIPS (million instructions per second), energy, energy-delay
product, energy-delay squared, peak power, and so on. The choice of which
design metric to use during the circuit or system optimization is strongly
Introduction 3
Aside from technology scaling, reducing only the supply voltage for a given
technology enables significant reduction in power consumption. However,
voltage reduction comes at the expense of slower gate speeds. So, there is a
tradeoff between circuit speed and power consumption. By dynamically
adjusting the supply voltage to the minimum needed to operate at an
operating frequency that meets the computational requirements of the circuit,
one can reduce the power consumption of digital CMOS circuits down to the
minimum required. This technique is referred to as dynamic voltage scaling.
Notice that the rules for analog circuits are quite different than those applied
to digital circuits. Indeed, downscaling the supply voltage does not
automatically decrease analog power consumption.
It has become apparent that the voltage scaling approach is insufficient
by itself. One must also focus on advanced design tools and methodologies
that address the power issues. The list of these issues is lengthy: power grid
4 Power Reduction Techniques in Microelectronic Systems
sensors must operate on a tiny battery for many months and must be able to
communicate wirelessly with each other. They must also be able to increase
their compute power when and if needed (performance on demand) and must
dissipate nearly zero energy dissipation during long idle periods. This
scenario poses a number of unique challenges that require power-awareness
at all levels of the communication hierarchy, from the link layer to media
access to routing protocols, as well as power-efficient hardware design and
application software.
Another emerging trend in embedded systems is that they are being
networked to communicate, often wirelessly, with other devices. In such
networked systems, the energy cost of wireless communications often
dominates that of computations. Furthermore, in many networked systems,
the energy-related metric of interest is the lifetime of the entire system, as
opposed to power consumption at individual nodes. A technique that
consumes less average power but results in a high variance in power
consumption where a small number of nodes see a large energy drain is
undesirable. Conventional approaches to power efficiency in computational
nodes (e.g., dynamic power management and dynamic voltage scaling) need
to be extended to work in the context of a networked system of nodes.
circuits, deal with the impact of reduced supply voltage on the power
consumption of high-speed analog to digital converters (ADC). A
comparison with the power consumption of published high-speed analog to
digital converters will also be presented.
In Chapter 6, C-W. Kim and S-M. Kang of IBM Microelectronics
Division and the University of California - Santa Cruz describe techniques to
reduce power consumptions in both the clock tree and flip-flops. Clock-
gating and logic-embedding techniques are also presented as effective
power-saving techniques, followed by a low-power clock buffer design.
In Chapter 7, H. Yasuura and H. Tomiyama of Kyushu University
introduce several design techniques to reduce the wasteful power
consumption by redundant bits in a datapath. The basic approach is datapath
width adjustment. It is shown that during hardware design, using the result
of bit-width analysis, one can determine the minimal length of registers, the
size of operation units, and the width of memory words on the datapath of a
system in order to eliminate the wasteful power consumption by the
redundant bits.
In Chapter 8, G-Y. Wei, M. Horowitz, and J. Kim of Harvard University
and Stanford University provide a brief overview of high-speed link design
and describe some of the power vs. performance tradeoffs associated with
various design choices. The chapter then investigates various techniques that
a designer may employ in order to reduce power consumption. Three
examples of link designs and link building blocks found in the literature
serve as examples to illustrate energy-efficient implementations of these
techniques.
In Chapter 9, D. Marculescu and R. Marculescu of Carnegie Mellon
University present a design exploration methodology that is meant to
discover the power/performance tradeoffs that are available at both the
system and microarchitectural levels of design abstraction.
In Chapter 10, N. Vijaykrishnan, M. Kandemir, A. Sivasubramaniam,
and M. J. Irwin of Pennsylvania State University describe the design of
energy estimation tools that support both software and architectural
experimentation within a single framework. This chapter presents the details
of two different architectural simulators targeted at superscalar and VLIW
architectures. Finally, techniques that optimize the hardware-software
interaction from an energy perspective are illustrated.
In Chapter 11, M. Srivastava of the University of California - Los
Angeles describes communication-related sources of power consumption
and network-level power-reduction and energy-management techniques in
the context of wirelessly networked systems such as wireless multimedia and
wireless sensor networks. General principles behind power-aware protocols
and resource management techniques at various layers of networked systems
Introduction 7
1.7 SUMMARY
REFRENCES
[1] J. Rabaey and M. Pedram, Low Power Design Methodologies, Kluwer Academic
Publishers, 1996.
Chapter 2
CMOS Device Technology Trends for Power-
Constrained Applications
David J. Frank
IBM T. J. Watson Research Center
Abstract: CMOS device technology has scaled rapidly for nearly three decades and has
come to dominate the electronics world. Because of this scaling, CMOS
circuits have become extremely dense, and power dissipation has become a
major design consideration. Although industry projections call for at least
another 10 years of progress, this progress will be difficult and is likely to be
strongly constrained by power dissipation.
This chapter describes the present state of CMOS technology and the scaling
principles that drive its progress. The physical effects that hinder this scaling
are also examined in order to show how these effects interact with the practical
constraints imposed by power dissipation. It is shown that scaling does not
have a single end. Rather, each application has a different optimal end point
that depends on its power dissipation requirements. A brief overview of some
of the novel device options for extending the limits of scaling is also provided.
2.1 INTRODUCTION
Although the basic concept of the Field Effect Transistor (FET) was
invented in 1930 [1], it was not until 1960 that it was first reduced to
practice in by Kahng and Attala [2]. Since then development has
been very rapid. The Si MOSFET has been incorporated into integrated
circuits since the early 1970s, and progress has followed an exponential
behaviour that has come to be known as Moore’s Law [3]. The device
dimensions have been shrinking at an exponential rate, and the circuit
complexity and industry revenues have been growing exponentially.
10 CMOS Device Technology Trends for Power-Constrained Applications
details the ideas of CMOS scaling that should enable this progress and then
discusses the physical effects that limit this scaling. The fourth section goes
into the optimization of CMOS technology for power-constrained operation
and uses this analysis to provide an estimate of how the limits of scaling
vary with application. The fifth section highlights exploratory CMOS
technology directions that may enable further scaling, and the final section is
a conclusion.
grade doping profile (doping that is low at the surface and increases with
depth) reduces the transverse electric field in the channel (improving
mobility), while at the same time reducing two-dimensional effects by
improving the shielding of the drain potential from the channel region.
Shallow angled ion implantation results in super-halo doping profiles near
the source and drain regions that partially cancel 2D-induced threshold
voltage shifts, resulting in less roll-off.
The drawing in Figure 2.1does not show the wiring levels, but the wires
are clearly essential in creating large integrated circuits, and substantial
technological progress is occurring there too. Today most of the wire is
copper because of its low resistivity and reduced electromigration. The
wire-to-wire capacitance is reduced by the use of fluorinated silicate glass
(FSG) for the insulator, with permittivity k=3.7, and also by taking
advantage of the low resistivity of copper to somewhat reduce the aspect
ratio of the wires [4]. Even lower k materials may be in use soon. It is
common practice to use a hierarchy of wiring sizes, from very fine wires at
minimum lithographic dimension on the bottom to large “fat” wires on the
top, which helps to keep wire delay under control [9].
In addition to bulk CMOS, partially-depleted silicon-on-insulator (PD-
SOI) CMOS is also available [10], As shown in Figure 2.2, this technology
is similar in many ways to bulk technology, but there are also some
important differences. The main difference is that PD-SOI CMOS is built on
a thin layer of Si, 150 nm thick, on top of an insulating layer. For
partially-depleted SOI, the silicon is thick enough that it is not fully depleted
by the gate bias (20-50 nm, depending on doping). The buried oxide (BOX)
CMOS Technology Summary 13
layer is typically 150-250 nm thick and completely insulates the device layer
from the substrate. It is formed in one of two ways: (a) by heavy
implantation of oxygen into a Si substrate followed by high temperature
annealing (SIMOX) or (b) by a wafer bonding and layer transfer technique
[11]. As a result of this insulating layer, the body of the SOI device is
floating and the source- and drain-to-body junction capacitances are
significantly reduced. Both of these effects can increase digital switching
speed, although the detailed advantages depend on circuit configuration [12].
Since the body is floating, the usual bulk MOSFET body-effect
dependencies on source-to-substrate voltage are absent. These are replaced
by floating-body effects such as history-dependent body bias and increased
output conductance (kink-effect) caused by the injection of majority carriers
into the body by impact ionization of channel carriers in the high-field drain
region.
MOSFET will probably be necessary to reach the outer years. Some of these
exploratory device concepts are discussed in Section 5. As indicated in the
table, it is essential that the effective gate insulator thickness decrease very
significantly to achieve these highly scaled devices. As will be shown in the
next section, this cannot be accomplished by simply thinning since the
tunneling leakage current would be too high. Consequently, thicker gate
insulators with a higher dielectric constant than silicon dioxide are critical in
order to reduce the tunneling current while still yielding equivalent
thicknesses down to below 1 nm. Silicon oxynitrides are the first step in this
direction, and other materials with still higher k that can satisfy all the
reliability and interface requirements are under investigation.
Another difficulty with the roadmap projections is the power dissipation,
which is increasing in future technology because of incomplete voltage
scaling and higher subthreshold currents associated with the lower
needed to maintain performance. It will be shown in Section 4 that the
leakage dissipation leads to optimum scaling limits that vary depending on
application. This situation is partially captured in the roadmap projections in
the assumption that low power technology will lag several years behind high
performance technology. The higher currents associated with future
technology also lead to reliability problems. Electromigration-induced
defects in wiring are a serious issue that must be addressed, as are the gate
insulator defects that are triggered by the high tunneling currents through the
insulator [13]. In addition, the extremely high currents may inhibit the use
of burn-in (the stressing of chips at high voltage and temperature to
eliminate early fails).
Although the chip sizes are not expected to increase significantly, wafer
sizes are expected to increase in order to reduce manufacturing cost. 300
mm diameter Si wafers are expected to be in use in production within the
next few years, and still larger wafers (perhaps 450 mm) are being
considered for the future.
under "generalized scaling" in the third column of Table 2.2. Increasing the
electric field necessitates increasing the amount of doping and also increases
the power dissipation, but it does reduce the need to scale The main
disadvantage of this form of scaling is the increased power, but another
problem is that the increasing electric field diminishes the long-term
reliability and durability of the FET. Indeed, this reliability concern forces
the use of lower supply voltages for smaller devices even when power
dissipation is not an issue [15].
the threshold voltage roll-off. They can also improve the drain-induced
barrier-lowering (DIBL) curve by shifting the peak barrier in the channel
closer to the source, thus making the subthreshold current less sensitive to
drain voltage. As presently practiced, it appears that super-halo doping can
lower the design point down to about 1.5 while still maintaining ±30%
gate length tolerance.
Another way to get a smaller is to improve the processing so that
the required tolerance decreases. For example, if tolerance could be
improved to <±10%, it might be possible to reduce to < 1.0 [18].
The effects that limit scaling can be broadly lumped into four categories:
quantum mechanical, atomistic, thermodynamic, and practical. There are
two types of quantum mechanical effects: confinement effects and
tunnelling effects. In conventional bulk and PD-SOI MOSFETs, quantum
confinement causes the average position of the channel carriers to be moved
a little farther from the interface. This weakens the effect of gate
insulator scaling by adding ~0.2 nm to EOT, but it is not a major concern. In
some of the novel device structures that are considered for the future,
however, confinement effects may play a more important role. At present,
quantum tunnelling of carriers through the energy barrier in the device is
generally a more important problem. This tunnelling results in leakage
current that increases power dissipation and decreases logic operating
margins.
Atomistic effects are due to the discreteness of matter. The primary
concern here is that there are a very small number of dopant atoms in a
highly scaled MOSFET, and statistic variations in the exact number of
dopant atoms can give rise to unacceptably large variations in terminal
characteristics. There may also be atomistic effects associated with
roughness scattering along the Si-insulator interface, but these have only
begun to be explored [19].
Thermodynamic effects are perhaps the most important, and take several
forms. First, the subthreshold behaviour of MOSFETs is governed by
Boltzmann statistics. Because of the thermal distribution of carriers, leakage
only falls off exponentially below The temperature of the carriers
determines the subthreshold slope and thus limits the scaling of the threshold
voltage, which cannot be scaled below some multiple of without
incurring excessive leakage current, where is Boltzmann's constant and T
is the temperature. Since is limited, supply voltage scaling is also limited.
In addition, the theoretical minimum supply voltage for self-consistent logic
is also determined by subthreshold slope. The second thermodynamic
Scaling Principles and Difficulties 21
FET is in the “off” state. Presently this leakage primarily occurs in the form
of indirect band-to-band tunnelling through defects and deep traps in the
depletion region, which often dominates over direct tunnelling, and is a
problem in DRAM and ultra-low power circuits in which even very tiny
currents are important. But since this current is strongly dependent on the
electric field [28], it is expected that it will become problematic even for
high performance logic when the body doping reaches the 1019 regime.
Since direct band-to-band tunnelling depends on conduction band states
being lined up with valence band states, it can be avoided in bulk MOSFETs
when where is the drain-to-source voltage and is the
body-to-source voltage. Thus, tunnelling-free operation requires forward
body bias exceeding the supply voltage, At low temperature this might
be an interesting option [6], but it is unlikely that it would be applied to
anything except very high-performance computing.
Finally, it is possible for current to tunnel directly from source to drain
through the channel barrier. This effect has been studied both theoretically
and experimentally and has been observed for channel lengths below 20 nm,
especially at low temperature [29]. Most recent analyses show that such
tunnelling only becomes problematic at room temperature for channel
lengths below ~10 nm [5]. Since such short channel lengths will necessarily
be associated with very high performance, high power density applications,
this extra tunnelling current should be comparatively negligible in cases of
interest.
The primary atomistic effect that may limit scaling is the discreteness of the
dopant atoms. The average concentration of doping is quite well controlled
by the usual ion implantation and annealing processes, but these processes
do not control the exact placement of each dopant. The resulting
randomness at the atomic scale causes spatial fluctuations in the local doping
concentration, resulting in device-to-device variation in MOSFET threshold
voltages. Within a few years it will be readily possible to make FETs whose
threshold voltages are controlled by fewer than 100 dopant atoms. The
uncertainty in the number of dopants, N, in any given device is expected to
vary as the square root of the number of dopants, in keeping with Poisson
statistics, so that the fractional uncertainty, and, hence, the threshold
variation, may become quite large, making the design of robust circuits
very difficult. This is especially true when one considers that the large
number of devices on a chip creates a statistical tail out to about 6 sigma.
Since, by the same reasoning, varies as narrow devices are
most affected by this effect.
Scaling Principles and Difficulties 25
state degradation effects. The worst-case logic inputs must be identified, and
then the supply voltage must be adjusted so that even in the worst cases the
output state ranges are consistent with the inputs. Figure 2.8 shows an
example of using “eye” diagrams to determine the minimum fundamental
supply voltage for a simple CMOS 4-input NAND gate. Part (a) shows the
best- and worst-case bias conditions, which determine the upper and lower
bounds of the logic states. Figure 2.8(b) shows a case where is above
minimum. The logic swing is “large”; and the “eye”-diagram shows a small
amount of noise margin between the lowest-switching gate with only one
input changing and the highest-switching gate with all of its inputs changing.
The output state ranges from to and from to are isolated and
self-consistent, even though the range of input states does create some
spread.
When the logic swing is reduced too far (Figure 2.8(d)), the lowest and
highest curves no longer cross, indicating that there is no self-consistent
solution for and The lack of a self-consistent state means that
operating a long chain of such logic gates can result in the loss of the logic
signal [6]. Figure 2.8(c) shows the minimum logic swing condition: the
lowest and highest curves are exactly tangent at their intersection points (and
the noise margin is exactly zero).
Using this type of minimum logic swing condition, other logic families
and fan-ins have also been evaluated, and the minimum supply voltage is
found to vary roughly as ln(FI) for conventional devices in their
exponential regime, where FI is the fan-in. Since the lowest voltage results
occur for FETs in their subthreshold regime, where they present their
Power-constained Scaling Limits 29
If threshold voltage were scaled according to the constant field scaling rules
in Table 2.2, power density due to the dissipation of the dynamic switching
energy (for irreversible computation) would remain constant but power
density due to subthreshold leakage would rise exponentially. On the other
hand, if one halts the scaling of voltage to prevent increasing subthreshold
dissipation (in effect setting in the generalized scaling rules), then the
power density associated with dynamic switching rises quadratically with
scaling. Consequently, if providing power or removing heat are costly or
inconvenient, then the thermodynamically determined subthreshold
MOSFET behavior forces one into an optimization situation. The optimum
and for a given application need to be set so as to minimize the
power dissipation while providing the desired speed. This sort of
optimization has been well studied, especially in the low power regime
[41][39][40], where the effects of process and supply variations are quite
important. The results of a study by Frank, et al. [40] are shown in Figure
2.10 as an example. Each point in the figure represents an independent
optimization of both the supply voltage and the threshold voltage. These
results, which include realistic tolerances, illustrate the dependence of the
optimum design points on activity factor and logic depth. As can be seen,
the optimum voltage can readily drop below 1 V and can even approach 0.5
V under some circumstances. These particular optimizations are for
static CMOS arithmetic circuits, but the optimal voltages are not expected to
vary much as technology is scaled (assuming the delay target is also scaled).
30 CMOS Device Technology Trends for Power-Constrained Applications
Since the optimum voltages depend strongly on activity factor and logic
depth, a wide range of and are needed to satisfy the requirements
of a range of applications. Note that these supply voltages are much larger
than the theoretical minimum supply voltages for subthreshold logic
discussed in the previous section.
There are many aspects to this optimization analysis that deserve comment,
but only a few can be discussed here: the question of whether the power
targets are achievable, considerations involved in mixing high performance
FETs with lower performance FETs, and some comments about the
uncertainties of the calculations. For discussion of various other issues, see
[6] and [17].
Most of the optimizations in Table 2.3 assume that 60-70% of the power
is dissipated by the switching activity of the circuitry. At the very highest
power density this requires extremely active, heavily loaded
circuits, such as clock drivers, data bus drivers, or off-chip I/O drivers.
Random logic circuits made from these most-scaled FETs would be unlikely
to use so much dynamic power, so it is likely that the fraction of static power
in such circuits would be much higher, perhaps something like
static power and dynamic, for a total of
Moving down the power scale to less aggressive technology, it should be
relatively easy for even low power technology to reach active power
densities of For circuits at the low power end of the design
space, the challenge is to get the active power down to the required levels.
This is primarily a matter of circuit and system design and is largely the
subject of this book. Some of the more obvious approaches include the
following. (1) Since chips consist of a mixture of circuit blocks with varying
activity, one can average the more active circuits over the less active areas
and over large areas of lower dissipation SRAM or DRAM, thus reducing
the overall power density as much as an order of magnitude. (2) The clock
frequency can be reduced to just barely meet the throughput requirements,
which may enable a further reduction in although cannot be too
34 CMOS Device Technology Trends for Power-Constrained Applications
used for the third well has a substantial lateral spread, but it may well be
very useful on a macro-to-macro scale.
2.5.2 Strained Si
The next more complex exploratory device structure is the fully -depleted
SOI (FD-SOI) MOSFET, which is illustrated in Figure 2.11(b). When
compared to Figure 2.2, one can see that FD-SOI is very similar to PD-SOI
except that the Si layer is much thinner. Typically, the Si layer should be
less than about half the depletion depth of a corresponding bulk device, to
guarantee that the layer remains fully depleted over the full range of gate
voltage. Under these circumstances, the floating-body effect of PD-SOI is
almost entirely eliminated except at very high drain voltages [8].
FD-SOI has long been studied because of its potential advantages over
bulk technology [49]. Various investigators have shown, however, that FD-
SOI has fairly poor scaling characteristics because there are no carriers or
conductors on the back side to screen the drain electric field [51] [50].
Recent simulations also indicate that it has worse short channel effects than
double-gate MOSFETs with the same Si thickness, as shown in Figure 2.13.
To achieve the same roll-off characteristics as DG-FETs, the FD-SOI Si
layers must be reduced to less than half the thickness of the DG-FET layers.
Nevertheless, FD-SOI does have some advantages compared to PD-SOI.
Floating body effects are eliminated, making circuit design easier. Parasitic
drain capacitance is reduced because the depth of the drain-to-body junction
is greatly reduced. The subthreshold slope is improved, making it possible
to scale and further. For example, recent experiments on 50 nm gate
Exploratory Technology 39
shorter interconnects and lower wiring capacitance, all of which are useful
for low energy computing [17].
Will some type of double-gate device eventually supplant bulk CMOS?
It is difficult to predict at this point in time. Discrete doping issues may very
well prohibit bulk designs below 20 nm, which greatly increases the DG-
FETs’ advantage. On the other hand, DG-FET design points below 20 nm
probably require halo-like roll-off compensation and metal gates with
suitable workfunctions to set neither of which are known processes.
Furthermore, DG-FET currents are likely to be degraded because most
geometries are expected to suffer from self-heating effects, like other SOI
devices.
There is one more technology option that should be considered for high
performance computing. This is the possibility of running high performance
processors at low temperatures, perhaps 100-150 K. This option does not
require significant device modifications, yet it addresses many of the issues
that limit conventional scaling. First, the threshold voltages should be able
to scale with the operating temperature T, since the subthreshold swing
scales with T, and according to Eq. 4, this would keep constant. As a
result, the supply voltages can also scale. Following this type of scaling,
dynamic power dissipation varies as while the energy required for ideal
refrigeration only varies as where is the temperature at which
heat leaves the system, e.g., ~350 K. Thus, even taking into account the
inefficiency of real heat pumps, it should be possible to break even on the
total room temperature power dissipation.
Furthermore, low temperature improves the mobility of the transistors
and lowers the resistance of the wires, both of which increase performance.
The use of lower voltage supplies would also lower the tunnelling current
through the gate insulator significantly, which would enable further scaling.
Another advantage would be that the reliability of the circuits, which is a
great concern for future technologies, would be greatly enhanced at low
temperature, since most failure mechanisms are at least partially thermally
activated and would therefore be highly suppressed. At low temperature,
DRAM retention time would probably increase so much that it could be
treated as non-volatile, possibly enabling different types of memory design.
Finally, for bulk devices low temperature might enable the use of forward
body bias, which could lower the transverse field, improving the mobility,
and shrink the depletion depth, enabling further scaling.
Summary 45
2.6 SUMMARY
ACKNOWLEDGEMENT
This work has benefited greatly from many useful discussions with co-
workers and colleagues, including Bob Dennard, Wilfried Haensch, Ken
Rim, Ed Nowak, Paul Solomon, Yuan Taur, and H.-S. Philip Wong.
REFERENCES
[1] J. E. Lilienfeld. Method and apparatus for controlling electric currents. U.S. Patent
1745175, 1930.
[2] D. Kahng and M. M. Atalla, “Silicon–silicon dioxide field induced surface devices,”
Presented at IRE Solid-State Device Res. Conf., Pittsburgh, PA, June 1960.
[3] P. K. Bondy, “Moore’s law governs the silicon Revolution,” Proc. IEEE, 86, pp. 78-81,
Jan. 1998.
[4] Semiconductor Industry Association (SIA). International Technology Roadmap for
Semiconductors, 2001 Edition. Austin, Texas: SEMATECH, USA., 2706 Montopolis
Drive, Austin, Texas 78741, USA (http://public.itrs.net), 2001.
[5] Y. Naveh and K. K. Likharev, “Modeling of 10-nm-scale ballistic MOSFETs,” IEEE
Elec. Dev. Lett., 21, pp. 242-244, 2000.
[6] J. Frank, R. H. Dennard, E. Nowak, P. M. Solomon, Y. Taur and H.-S. P. Wong,
“Device scaling Limits of Si MOSFETs and their application dependencies,” in Proc.
IEEE, 89, pp. 259-288, 2001.
[7] Y. Taur, D. Buchanan, W. Chen, D. Frank, K. Ismail, S.-H. Lo, G. Sai-Halasz, R.
Viswanathan, H.-J. C. Wann, S. Wind and H.-S. Wong, “CMOS scaling into the
nanometer regime,” in Proc. IEEE, 85, pp. 486–504, April 1997.
[8] H.-S. P. Wong, D. J. Frank, P. M. Solomon, H.-J. Wann and J. Welser, “Nanoscale
CMOS,” in Proc. IEEE, 87, pp. 537-570, 1999.
[9] Sai-Halasz, “Performance trends in high-end processors,” in Proc. IEEE, 83, pp. 20, Jan.
1995.
[10] J. W. Sleight, P. R. Varekamp, N. Lustig, J. Adkisson, A. Allen, O. Bula, X, Chen, T.
Chou, W. Chu, J. Fitzsimmons, A. Gabor, S. Gates, P. Jamison, M. Khare, L. Lai, J. Lee,
S. Narasimha, J. Ellis-Monaghan, K. Peterson, S. Rauch, S. Shukla, P. Smeys, T.-C. Su,
J. Quinlan, A. Vayshenker, B. Ward, S. Womack, E. Barth, G. Blery, C. Davis, R.
Ferguson, R. Goldblatt, E. Leobandung, J. Welser, I. Yang and P. Agnello, “A high
performance SOI CMOS technology with a 70 nm silicon film and with a
second generation low-k Cu BEOL,” In IEDM Tech. Dig., pp. 245-248, 2001.
[11] Auberton-Hervé, “SOI: materials to systems,” In 1996 IEDM Tech. Dig., pp. 3, 1996.
[12] R. Puri and C. T. Chuang, “SOI digital circuits: design issues,” In Thirteenth Int. Conf.
VLSI Design, 2000., pp. 474 -479, 2000.
[13] Stathis, J.H, “Physical and predictive models of ultrathin oxide reliability in CMOS
devices and circuits,” IEEE Trans. Device and Materials Reliability, 1(1), Pp.(s): 43 -59,
March 2001.
[14] R.H. Dennard, F.H. Gaensslen, H.N. Yu, V.L. Rideout, E. Bassous and A.R. LeBlanc,
“Design of Ion-implanted MOSFETs with very small physical dimensions,” Jour. Solid
St. Circuits, SC-9, pp. 256-268, 1974.
Summary 47
[15] Davari, R. H. Dennard and G. G. Shahidi, “CMOS scaling, the next ten years,” In Proc.
IEEE, 89, pp. 595-606, 1995.
[16] D. J. Frank, Y. Taur and H.-S. P. Wong, “Generalized scale length for two-dimensional
effects in MOSFET's,” IEEE Elec. Dev. Lett., 19, pp. 385-387,1998.
[17] J. Frank, “Power-constrained CMOS scaling limits,” IBM J. Res. Devel., 46(2/3),
March/May 2002.
[18] P. M. Solomon and I. J. Djomehri, “Overscaling, design for the future,” IBM Research
Report, RC22379, Jan. 2002.
[19] Asenov, S. Kaya and J. H. Davies, “Intrinsic threshold voltage fluctuations in decanano
MOSFETs due to local oxide thickness variations,” IEEE Trans. Electron Devices,
49(1), pp. 112 -119, Jan. 2002.
[20] W. Athas, N. Tzartzanis, W. Mao, L. Peterson, R. Lal, K. Chong, Joong-Seok Moon, L.
Svensson and M. Bolotski, “The design and implementation of a low-power clock-
powered microprocessor,” IEEE J. Solid-State Circuits, 35(11), pp. 1561 -1570, Nov.
2000.
[21] D. J. Frank, “Comparison of high speed voltage-scaled conventional and adiabatic
circuits,” In 1996 Int. Symp. Low Power Electronics and Design (ISLPED), Digest of
Tech. Papers, pp. 377, 1996.
[22] S. M. Sze. Physics of Semiconductor Devices, 2nd Edition. John Wiley & Sons, 1981.
[23] S.-H. Lo, D.A. Buchanan, Y. Taur and W. Wang, “Quantum-mechanical modeling of
electron tunneling current from the inversion layer of ultra-thin-oxide nMOSFET's,”
IEEE Electron Dev. Lett., 18, pp. 209, 1997.
[24] J. Robertson, “Band offsets of wide-band-gap oxides and implications for future
electronic devices,” J. Vacuum Science and Technology B, 18(3), pp. 1785-1791,2000.
[25] M. Fischetti, D. Neumayer and E. Cartier, “Effective electron mobility in Si inversion
layers in MOS systems with a high-k insulator: the role of remote phonon scattering," J.
Appl. Phys., 90(9), pp. 4587, 2001.
[26] D. Barlage, R. Arghavani, G. Dewey, M. Doczy, B. Doyle, J. Kavalieros, A. Murthy, B.
Roberds, P. Stokley and R. Chau, “High-frequency response of 100nm integrated CMOS
transistors with high-k gate dielectrics,” In IEDM Tech. Dig., pp. 231-234, 2001.
[27] H.-S. P. Wong, “Beyond the Conventional Transistor,” IBM J. Res. Devel., 46(2/3),
March/May 2002.
[28] Y. Taur, C. H. Wann and D. J. Frank, “25 nm CMOS Design Considerations,” In IEDM
Tech. Dig., pp. 789-792, 1998.
[29] H. Kawaura, T. Sakamoto and T. Baba, “Direct source-drain tunneling current in
subthreshold region of sub-10-gate EJ-MOSFETs,” In 1999 Si Nanoelectronics
Workshop Abstracts, pp. 26-27, 1999.
[30] Asenov and S. Saini, “Random dopant fluctuation resistant decanano MOSFET
architectures,” In 1999 Si Nanoelectronics Workshop Abstracts, pp. 84-85, June 1999.
[31] D. J. Frank, Y. Taur, M. leong and H.-S. P. Wong, “Monte Carlo modeling of threshold
variation due to dopant fluctuations,” In Symp. VLSI Technol., pp. 169-170, 1999.
[32] H.-S. P. Wong, Y. Taur and D. Frank, “discrete random dopant distribution effects in
nanometer-scale MOSFETs,” Microelectronic Reliability, 38, pp. 1447-1456, 1998.
[33] H.-S. P. Wong and Y. Taur, “Three-Dimensional `atomistic' simulation of discrete
microscopic random dopant distributions effects in MOSFETs,” In IEDM
Tech. Dig., pp. 705-708, 1993.
[34] E. Buturla, J. Johnson, S. Furkay and P. Cottrell, “A new 3-D device simulation
formulation. In NASCODE VI: Sixth International Conf. on the Numerical Analysis of
Semiconductor Devices and Integrated Circuits,” Boole Press, Dublin, pp. 291, 1989.
48 CMOS Device Technology Trends for Power-Constrained Applications
Abstract: This chapter describes techniques and issues for power aware design of
memories. The focus is on non-volatile flash memories, non-volatile
ferroelectric memories and embedded DRAMs, which are becoming
increasingly important in the Wireless/Internet era.
Key words: Semiconductor memory, low power memory design, DRAM, SRAM, flash
memory, FeRAM.
3.1 INTRODUCTION
Since the flash concept was first reported by Masuoka et al. [4] at the 1984
IEDM, several different cell variations and many circuit techniques have
been developed and commercialized, and in turn, applications for flash
memories have proliferated greatly. The flash cells, e.g., NOR [5], NAND
[6], DINOR [7], AND [8], SanDisk cell [9], and SST cell [10], are divided
into two groups; one aims at fast random access, and the other aims at high-
bit density. A NOR flash cell is a cell typical of the former group. Figure
3.3(a) illustrates a cross-sectional view of a stacked gate flash memory cell.
The memory cell has no serially-connected transistors and a relatively large
cell current of 50-100uA for its high speed sensing operation. Early
applications of the NOR flash were for program code storage in PC BIOS,
Flash Memories 53
disk drives, automotive engines, and so on, which require a random access
time of less than 100ns. Recently, the NOR flash has also been used in
handheld digital equipment, e.g., cellular phones, messaging pagers, and
flash-embedded logic devices. In addition to a fast access time, the NOR
flash memories are required to have low-power consumption for longer
battery life and low-voltage operation in accordance with the lowering of the
minimum supply voltage of the other logic and analog devices mounted on
the same board or merged on the same chip. Figure 3.1 shows the trend of
the supply voltage of flash memories. Originally, NOR-based flash devices
had two supply voltages: 5V for read and 12V for program and erase [5]. In
1989, the first 5V-only NOR flash memory was introduced [11], and the
2.7V-only NOR flash was reported seven years later [12]. Currently, 1.8V-
only NOR flash memories [13] are used for 1.8V systems.
On the other hand, a NAND flash cell, as shown in Figure 3.6(a), is a
typical cell aimed at high-bit density. The NAND string has two select
transistors, one source line, and one bit-line contact for eight to thirty-two
series-connected flash cells [6]. Thus, the NAND flash memory has the
smallest cell size, and so the bit-cost is the lowest among flash memories.
Although the target of the NAND flash memory was originally replacement
of magnetic hard disks and floppy disks [6], recently other applications have
54 Low Power Memory Design
expanded, and NAND flash memories are used in digital cameras and solid-
state audio players. Using data compression techniques such as MPEG-1
Audio Layer 3 (MP3), flash memory cards supporting 64MB capacity can
store 1-2 hours of CD-quality music. Digital cameras and silicon audio
players utilize flash memory cards such as SmartMedia™ on which only one
or two NAND flash memory chips are mounted, and SD Card™ on which a
flash controller and some flash memories are mounted. In accordance with
the lowering of the supply voltage of flash controller devices to sub-2V, the
NAND flash memory and the NOR flash memory are required to operate at
the same supply voltage as that of the controller devices for simplicity in the
low-power systems.
This section reviews several control schemes and circuits peculiar to
flash memories such as charge pump circuits, level shifter, and sense amp
for low-power NOR and NAND flash memories, as shown in Figure 3.2.
Flash Memories 55
voltages of the erased cells are lowered. The sense amp senses the memory
cell data by means of the cell current. When the cell current flows through
the accessed memory cell, the sense amp outputs the data 1. On the other
hand, when the NOR cell turns off at the word-line voltage of about 5V, the
sense amp outputs the data 0. The program and erase voltages are hardly
scaled with the supply voltage. This is because the reliability of flash
memory cells strongly depends on the tunnel oxide thickness and the inter-
poly dielectric thickness, and thus this limits the thickness [15]. As will be
discussed in the charge pump section, the power efficiency for charge pump
circuits, which generate voltages higher than the supply voltage on chip for
reading, programming, and erasing data stored in the flash memories,
decreases in accordance with the lowering of the supply voltage under the
condition of the constant cell operation voltages. Therefore, it is very
important to reduce the operation voltages and load currents from the
viewpoint of power. Several low-power design techniques for NOR flash
memories are reviewed below.
gate flash memories can have much higher program efficiency than standard
NOR flash memories [18].
Optimization
Figure 3.10(a) shows the Dickson charge pump circuit [28]. The equivalent
circuit is illustrated in Figure 3.10(b) and presents the dynamic characteristic
[29]. In order to design the charge pump circuit, it is necessary to have an
optimization theory that takes into consideration the dynamics of the circuit
to accelerate the rise time of the output voltage even at a low supply voltage
[29]. Another optimization is required to obtain the maximum output current
for a given circuit area [30]. The former optimization is applied to the
programming word-line charge pump and the erase voltage generating
charge pump for both NOR and NAND flash memories and the latter is
Flash Memories 65
applied to the word-line pump for reading and the bit-line pump for
programming in NOR flash memories. Figure 3.10(c) summarizes the
optimized number of stages and the required capacitance per stage for a
required rise time for the former optimization and for a required output
current for the latter optimization.
Figure 3.11 shows the dependence of the rise time, current consumption,
and power on the boosted voltage and supply voltage for the charge pump
circuit [29]. The power is proportional to the number of stages, which, in
turn, is inversely proportional to Vdd-Vt, where Vt is the threshold voltage
of the transfer transistor. Therefore, the threshold voltage degrades the
power efficiency of the charge pump circuit for low-voltage flash memories.
Low voltage charge pump
In order to reduce the effective threshold voltage of the transfer gate, a four-
phase clock pump (Figure 3.12) [31] and floating-well pump [32] have been
developed. In the four-phase clock pump, gate overdrive can be realized so
that the transfer gate operates in linear region. In the floating-well pump, the
well of the transfer transistors is in floating state, so that the transfer
transistors do not suffer from the body effect. A disadvantage of these charge
pump circuits is the low clock frequency compared with that of the
conventional two-phase clock Dickson pump.
66 Low Power Memory Design
Pool method for read voltage generation and low standby-power system
For a 3V-only NOR flash memory, a capacitor-switched booster circuit or
kicker scheme was used to generate a read voltage higher than the supply
voltage by switching the connection state of one or more boosting capacitors
with the load capacitor from parallel to series, synchronized with the address
transition detection (ATD) signal, as illustrated in the first column of Figure
3.13 [33].
For sub-2V or lower supply voltage flash memories, the Dickson charge
pump circuit is better than the capacitor-switched booster circuit from the
viewpoint of active power [30]. The only disadvantage of the Dickson pump
compared with the capacitor-switched booster is the high standby current. In
order to reduce the standby current down to less than the detection
current and the operation current in the detector have to be of the order of
100nA, even for a Dickson pump. Under such low current conditions, the
conventional negative feedback system cannot be used because the operation
speed in the detector is so slow that the word-line voltage cannot be stably
controlled. A low standby-power system was developed as illustrated in the
third column of Figure 3.13. The standby pump operates until the counter
counts a previously determined number so as not to overshoot the word-line
voltage with a very low standby current of [35].
Flash Memories 67
because of the small current flowing through PMOS transistors due to drastic
reduction in gate overdrive, whereas the low-voltage low-level shifter can
operate even at 1V. The low-voltage low-level shifter is composed of three
parts; the latch holding the negative erasing voltage, two coupling capacitors
connected with the latched nodes, and the drivers inverting the latch. The
drivers can have sufficient driving currents to invert the latch via the
coupling capacitors. Other types of level shifters were proposed in [37].
transistor, which prevents the leakage current from flowing in the level
shifter during the inactive state. A diode-connected intrinsic transistor
without channel implantation is used to improve the positive-feedback
efficiency of the local booster when selected for operation. However, the
conventional level shifter cannot operate at a supply voltage below 2.4V, as
shown in the figure. The figure also illustrates a low-voltage NMOS high-
level shifter composed of only low-Vt high-voltage transistors, which
virtually eliminates leakage current from the boosted voltage and which
operates even at a supply voltage of 1.4V [27].
of the bit-line when the bit-line is discharged by the cell current in the case
of 1-data. Because the bit-line keeps the precharged level in the case of 0-
data, the clamp transistor is in an off-state, resulting in the voltage at the
sense node of In this sense amp, the voltage swing of the bit-line can be
reduced so that the fast random access of can be achieved [41].
As shown in the charge pump section, unless the internal high voltages for
read, program, and erase operations are scaled with the supply voltage, the
power in the power system would increase (instead of decrease) due to
degradation in power efficiency. On the other hand, the power dissipated in
the low-voltage logic gates and the internal and external buses can be
reduced by a factor of The dominant factor of the power in the
program and erase operations is the former whereas that in the read
operation is the latter. Therefore, the effectiveness of the supply voltage
reduction on the power reduction depends on the duty factor of read cycles
to rewrite cycles. As shown in Figure 3.21, the power for both NOR and
NAND flash memories can be reduced for the standard use of the NOR and
NAND flash memories with a read duty of greater than 50% [27].
74 Low Power Memory Design
stable states for positions of Ti/Zr, upper site or lower site. The position of
the central atom can be flipped by applying an external electric field, and the
position is unchanged even when the applied electric field is turned off.
Information data is stored as a non-volatile remnant polarization. As shown
in Figure 3.23, polarization has a hysteresis, and there are 2 stable states “0”
and “1”at the applied voltage of 0. As those two states are “stable,” even
when the access transistor is turned on, no free charge is read. In order to
read the stored polarization in the film, application of the voltage to the
capacitor is necessary in contrast to a DRAM read operation.
Figure 3.24 shows the read operation of FeRAM. In the read operation,
(1) the bit-line is floating and precharged to 0V. Then (2) the word line is
turned on and (3) the plate line is driven to to apply voltage to the
ferroelectric capacitor. Just after the plate line drive, the voltage across the
capacitor is – Then charges are shared by the ferroelectric capacitor and
the bit-line. Therefore the read out signal can be obtained as the intersection
between the ferroelectric capacitor’s hysteresis curve and the linear
capacitance with the negative slope of the bit-line capacitance. If the data is
“1,” the read-out voltage is higher than that for “0.” In 2T-2C mode, which
stores 1 bit using a pair of 1T-1C cells connected to complimentary bit-lines
as shown in Figure 3.25, the signal is the difference between the “0” and
“1.” In 1T-1C mode 1 bit is stored in the 1T-1C cell, so the reference voltage
should be supplied in-between the “0” and “1” and accordingly the signal is
half or less than the 2T-2C mode. As shown in Figure 3.24, the stored “1”
data is destroyed in the read cycle; therefore a restore operation is necessary.
76 Low Power Memory Design
Figure 3.26 shows the restore and write operation. When the
read operation is finished, the plate line is activated; therefore the “0” data
can be restored to the cell. However, for the “1” cell, the voltage applied to
the cell capacitor is 0V, because plate line level is “H” and the bit-line (Cell
capacitor node) is also “H”. Therefore in order to restore “1” data, the plate
line should be pulled down. Thus in FeRAM restore cycle, “0” data restore
and “1” data restore are done with separate timing. The write operation is
similar to restore. In the write operation data on the bit-lines are forced by
the write buffers via data bus.
Ferroelectric memory 77
For stable operation at low supply voltage, it is essential to get a larger read-
out signal. As described in the previous session, the read-out signal depends
on the bit-line capacitance and hysteresis curve of the cell capacitor. In
DRAM where the cell capacitor has linear capacitor characteristics, the read-
out signal linearly decreases with the bit-line capacitance. Therefore, the
smaller bit-line capacitance is always better from the viewpoint of the signal
magnitude. However, with the non-linear hysteresis characteristics of the
ferroelectric capacitor, there is an optimal bit-line capacitance and cell
capacitance ratio. Figure 3.27 shows an example of the bit-line capacitance
dependency on the read-out signal. The Read-out signal has a peak value
around a bit-line capacitance of 150fF in this particular case. As shown, it is
important to optimize the bit-line capacitance in the low voltage FeRAM
design.
Theoretically, the entire cell plate can be driven at the same time. This
common plate scheme is very area efficient, but there are two problems.
First, the cell plates of unselected cells are also driven. Then the voltage of
the storage node of the unselected cells is boosted, but, due to the parasitic
capacitances of the storage node, the voltage cannot be as high as the cell
plate voltage. So the unselected cells experience a disturbance voltage across
78 Low Power Memory Design
As shown in the previous section, the speed of a FeRAM with a driven cell
plate is nearly determined by the cell plate drive time. The non-driven cell
plate line scheme, which is similar to the DRAM operation, was proposed in
[50]. The cell plate is set to a stable 1/2 as shown in Figure 3.31. The
Bit line is precharged to 0V. In the read operation, when word line of the
selected cell is turned on, 1/2 Vdd bias is applied to the target cell, and the
signal can be read out to the bit line. A draw back of the non-driven cell
80 Low Power Memory Design
plate line scheme is that it requires refresh cycles. As shown in Figure 3.32,
word lines of the unselected cells are cut off, then the storage node of the
unselected cells is discharged to p-well bias “0V” by the leakage current of
the diffusive layer. A plate line voltage of 1/2 is the applied to the cell
capacitors, which degrades the cell data. So just like DRAM cells, these cells
need to be refreshed to 1/2 at some intervals. This reduces FeRAM’s
inherent advantage of non-volatility.
Ferroelectric memory 81
Combining of DRAM and logic gates in one chip has been common since
the 1980s. However, the first realistic demonstration was Toshiba’s 72K gate
array with 1 Mbit DRAM [53]. Since then intensive works have been
Embedded DRAM 83
access transistor can be turned on earlier than when “H” data is stored in the
cell [59].
The hybrid precharge [60] scheme was proposed from the power-
aware design point of view. Figure 3.37 shows the proposed array and
operation waveforms. Sense amplifiers are divided to two groups; one is
precharged to and the other is precharged to In the precharge cycle,
before fully precharging bit lines, the two groups of sense amplifiers are
connected so as to not waste charges. Then the two groups of sense
amplifiers are disconnected and are fully precharged to and
respectively.
Figure 3.38 compares the sensing time and bit line charging and
discharging power consumption of and hybrid
precharge scheme. As is seen in the figure, at low operation the voltage in
the halfVdd precharge scheme is marginal and that in the Vdd/Vss hybrid
precharge scheme is favourable from the view point of power consumption
when compared with the precharge scheme.
3.5 SUMMARY
techniques. The operation voltage has almost reached to 1.5V. For further
operation voltage reduction, there are a lot of challenges including device
physics limitations such as threshold voltage limitation [61], and gate
dielectrics tunnelling current limitation. There is quite a lot of work to be
done in this area.
REFERENCES
[1] J. M. Rabaey and M. Pedram, “Low power design methodologies,” Kluwer Academic
Publishers, pp201-251, 1996.
[2] K. Itoh etal,“Trends in Low-Power RAM Circuit Technologies,” in Proc IEEE., vol83 pp.
524-543, 1995.
[3] D. D. Buss,“Technology in the Internet Age,” ISSCC Digest of Technical Papers, pp.18-
19, 2002.
[4] F. Masuoka et al., “A new flash EEPROM cell using triple polysilicon technology,”
Technical Digest of IEDM, pp. 464-7,1984.
[5] V. N. Kynett et al., “An in-system reprogrammable 256K CMOS Flash memory,” ISSCC
Digest of Technical papers, pp. 132-3, 1988.
[6] F. Masuoka et al.,“New ultra high density EPROM and flash EEPROM cell with NAND
structure cell,” Technical Digest of IEDM, pp. 552-5, 1987.
[7] H. Onoda et al., “A novel cell structure suitable for a 3V operation, sector erase Flash
memory,” Technical Digest of IEDM, pp. 599-602, 1992.
[8] H. Kume et al., “A contactless memory cell technology for a 3V only 64Mb
EEPROM,” Technical Digest of IEDM, pp. 991-3,1992.
[9] S. Mehrotra et al., “Serial 9Mb Flash EEPROM for solid state disk applications,”
Symposium on VLSI Circuits, pp. 24-5, 1992.
[10] S. Kianian et al., “A novel 3 volts-only, small sector erase, high density Flash
Symposium on VLSI Technology, pp. 71-2,1994.
[11] S. D'Arrigo et al., “A 5 V-only 256K bit CMOS Flash EEPROM,” ISSCC Digest of
Technical papers, pp. 132-3, 1989.
[12] J. C. Chen et al., “A 2.7V only 8Mb x16 NOR Flash Memory,” Symposium on VLSI
Circuits, pp. 172-3, 1996.
[13] S. Atsumi et al., “A Channel-Erasing 1.8V Only 32Mb NOR Flash EEPROM with a Bit-
Line Direct Sensing Scheme,” ISSCC Digest of Technical Papers, pp. 814-5, 2000.
[14] A. Brand et al., “Novel read disturb failure mechanism induced by FLASH cycling,”
Reliability Physics Symposium. 1993. 31st International Annual Proceedings, pp. 127-32,
1993.
[15] K. Naruke et al., “Stress induced leakage current limiting to scale down EEPROM tunnel
oxide thickness,” Technical Digest of IEDM, pp. 424-7, 1988.
[16] J. D. Bude et al., “Secondary Electron flash-a high performance, low power flash
technology for 0.35um and below,” IEDM Technical Digest, pp. 279-82, 1997.
[17] D. Esseni et al., “Trading-off programming speed and current absorption in flash
memories with the ramped-gate programming technique,” IEEE Transactions on Electron
Devices, vol. 47, no. 4, pp. 828 –834, Apr. 2000.
[18] A. T. Wu et al., “A source-side injection erasable programmable read-only-memory (SI-
EPROM) device,” IEEE Electron Device Letter, vol.EDL-7, no.9 pp.540-2, Sep. 1986.
Summary 87
[19] Y. Okuda et al., “A 0.9 V operation 2-transistor flash memory for embedded logic LSIs,”
Digest of Technical Papers, Symposium on VLSI Technology, pp. 21-2, Jun. 1999.
[20] T. Ikehashi et al., “A 60 ns access 32 kByte 3-transistor flash for low power embedded
applications,” Digest of Technical Papers of symposium on VLSI circuits, pp. 162-5, Jun.
2000.
[21] T. Tanaka et al., “A quick intelligent program architecture for 3 V-only NAND-
EEPROMs,” Symposium on VLSI Circuits, Digest of Technical Papers, pp.20-1, 1992.
[22] K. Imamiya et al., “A 35ns-Cycle-Time 3.3V-Only 32Mb NAND Flash EEPROM,”
ISSCC Digest of Technical Papers, pp. 130-1, 1995.
[23] K. D. Suh et al., “A 3.3V 32Mb NAND Flash Memory with Incremental Step Pulse
Programming Scheme,” ISSCC Digest of Technical Papers, pp. 128-9, 1995.
[24] S. Satoh et al., “A novel isolation-scaling technology for NAND EEPROMs with the
minimized program disturbance,” Technical Digest of International Electron Devices
Meeting, pp. 291-4, 1997.
[25] K. Takeuchi et al., “A source-line programming scheme for low-voltage operation NAND
flash memories,” IEEE Journal of Solid-State Circuits, Vol. 35, No. 5, pp. 672-81, May
2000.
[26] T. Tanaka et al., “A quick intelligent pp.-programming architecture and a shielded bitline
sensing method for 3 V-only NAND flash memory,” IEEE J. Solid-State Circuits, vol.29,
no.11, pp. 1366-73, Nov. 1994.
[27] T. Tanzawa, “Low-voltage circuit design for high-performance flash memories,” IEEE J.
Solid-State Circuits, to be published
[28] J.F.Dickson, “On-chip high-voltage generation in mnos integrated circuits using an
improved voltage multiplier technique,” IEEE J. Solid-State Circuits, Vol.SC-11, No.3,
pp374-378, Jun. 1976.
[29] T. Tanzawa and T. Tanaka, “A dynamic analysis of the dickson charge pump circuit,”
IEEE J. Solid-State Circuits, Vol.32, No.8, pp.1231-40, Aug., 1997.
[30] T. Tanzawa and S. Atsumi, “Optimization of word-line booster circuits for low-voltage
flash memories,” IEEE J. Solid-State Circuits, Vol.34, No.8, pp.1091-8, Aug., 1999.
[31] A.Umezawa et al., “A 5V-only operation 0.6um Flash EEPROM with row decoder
scheme in triple-well structure,” IEEE J. Solid-State Circuits, Vol.27, No.11, pp.1540-
1546, Nov.l992.
[32] K.Sawada et al., “An on-chip high-voltage generator circuit for EEPROMs with a power
supply voltage below 2V,” 1995 Symposium on VLSI Circuits Digest of Technical Papers,
pp.75-76, Jun.1995.
[33] Y. Miyawaki et al., “A new erasing and row decoding scheme for low supply voltage
operation 16-Mb/64-Mb Flash Memories,” IEEE J. Solid-State Circuits, Vol.27, No.4,
pp.583-8, Apr., 1992.
[34] T. Tanzawa et al., “Circuit techniques for a 1.8V-Only NAND flash memory,” IEEE
Journal of Solid-State Circuits, Vol. 37, No. 1, pp. 84-9, Jan., 2002.
[35] T. Tanzawa et al., “Word-Line Voltage Generating System for Low-Power Low-Voltage
Flash Memories,” IEEE Journal of Solid-State Circuits, Vol. 36, No. 1, pp. 55-63, Jan.
2001.
[36] T. Tanzawa et al., “high voltage transistor scaling circuit techniques for high-density
negative-gate channel-erasing NOR flash memories,” IEEE Non-Volatile Semiconductor
Memory Workshop, Digest of Technical Papers, Aug., 2001.
[37] N. Otsuka and M. Horowitz, “Circuit techniques for 1.5-V power supply flash memory,”
IEEE J. Solid-State Circuits, Vol.32, No.8, pp.1217-30, Aug., 1997.
88 Low Power Memory Design
[38] T. Tanzawa et al., “Design of a sense circuit for low-voltage flash memories,” IEEE
Journal of Solid-State Circuits, Vol. 35, No. 10, Oct 2000.
[39] B. Parthak et al., “A 1.8V 64Mb 100MHz flexible read while write flash memory,” ISSCC
Digest of Technical Papers, pp.32-3, Feb. 2001.
[40] H. Nakamura et al., “A novel sense amplifier for flexible voltage operation NAND flash
memories,” Digest of Technical Papers of Symposium on VLSI Circuits, pp. 71-2, Jun.
1995.
[41] K. Imamiya et al., “A 256Mb NAND flash with shallow trench isolation
technology,” ISSCC Digest of Technical Papers, pp. 112-3, Feb. 1999.
[42] S. Atsumi et al., “Fast programmable 256K read only memory with on-chip test circuits,”
IEEE J. Solid-State Circuits, Vol.SC-20, No.1, pp.422-7, Feb., 1985.
[43] J. L. Moll and Y. Tarui, “A new solid state memory resistor,” IEEE Trans. Electron
Devices, Vol ED-10, pp. 338-9, 1963
[44] J. T. Evans etal.,”“,” IEEE J. Solid-State Circuits, Vol. SC-23, No5, pp.1171-1175
[45] R. Womach etal., “A 16Kb ferroelectric nonvolatile memory with a bit parallel
architecture,” ISSCC Digest of Technical Papers, pp. 242-3, Feb., 1989
[46] S. S. Eaton etal., “A ferroelectric nonvolatile memory,” ISSCC Digest of Technical
Papers, pp. 130-1, Feb., 1988
[47] T. Sumi etal., “A 256Kb nonvolatile ferroelectric memory at 3V and 100ns,” ISSCC
Digest of Technical Papers, pp. 268-9, Feb., 1994
[48] D. Takashima etal, “A sub-40ns random access chain FRAM architecture with 7ns cell-
plate-line drive,” ISSCC Digest of Technical Papers, pp. 102-3, Feb., 1999
[49] D. Takashima etal, “A 76mm2 8Mb chain ferroelectric memory,” ISSCC Digest of
Technical Papers, pp. 40-1, Feb., 2001
[50] H. Koike etal, “A 60-ns 1-Mb nonvolatile ferroelectric memory with a nondriven cell
olate line write/read scheme,” IEEE Journal of Solid-State Circuits, vol. 31, pp. 1625 -
1634, Nov. 1996.
[51] G. Braun etal., “A robust 8f2 ferroelectric RAM cell with depletion device (DeFeRAM),”
IEEE Journal of Solid-State Circuits, vol. 35, pp. 691 - 700, May. 2000.
[52] H-B. Kang etal, “A hierachy bitline boost scheme for sub-1.5V operation and short
precharge time on high density FeRAM,” ISSCC Digest of Technical Papers, pp. 158-9,
Feb., 2002
[53] K. sawada etal, “A 72-K CMOS channelless gate array with embedded 1-Mbit dynamic
RAM,” CICC Digest, pp. 20.3.1-4, May, 1988
[54] S. Miyano etal., “A 1.6GB/S Data-Transfer-Rate 8-Mb embedded DRAM,” ISSCC Digest
of Technical Papers, pp. 300-1, Feb., 1995
[55] K. Itoh etal,. “Limitations and challenges of multigigabit DRAM chip design,” IEEE
Journal of Solid-State Circuits, vol. 32, pp. 624 - 634, May. 1997.
[56] T. Yabe etal., “A Configurable DRAM Macro design for 2112 derivative organization to
be synthesized using a Memory Generator,” ISSCC Digest of Technical Papers, pp. 72-3,
Feb., 1998
[57] T. Nishikawa etal., “A 60MHz 240mW MPEG-4 video-phone LSI wuth 16Mb embedded
DRAM,” ISSCC Digest of Technical Papers, pp. 130-1, Feb., 2000
[58] A. Kahn etal., “A 150MHz graphic rendering Processor with 256Mb embedded DRAM,”
ISSCC Digest of Technical Papers, pp. 150-151, Feb., 2001
[59] J. Barth etal., “A 300MHz multi-banked eDRAM Macro featuring GND sense, bit-line
twisting and direct reference cell write,” ISSCC Digest of Technical Papers, pp. 156-7,
Feb., 2002
Summary 89
[60] H. Nakano et al., “A dual layer bitline DRAM array with Vcc/Vss hybrid precharge for
multi-gigabit DRAMs,” Symposium on VLSI Circuits, pp. 190-1, 1996.
[61] Y. Oowaki et al., “A sub-0.1um circuit design with substrate-over-biasing,” ISSCC Digest
of Technical Papers, pp. 88-9, Feb., 1998
This page intentionally left blank
Chapter 4
Low-Power Digital Circuit Design
Tadahiro Kuroda
Keio University
Abstract: Circuit techniques for power-aware design are presented, including techniques
for a variable supply voltage, a variable threshold voltage, multiple supply
voltages, multiple threshold voltages, a low-voltage SRAM, a conditional flip-
flop, and an embedded DRAM.
Key words: Supply voltage, threshold voltage, variable, multiple, substrate bias, low-
voltage SRAM, conditional flip-flop, embedded DRAM.
4.1 INTRODUCTION
CMOS power dissipation has been increasing due to the increase in power
density due to device scaling [1]. Constant voltage scaling was employed
until the early 1990s when power density was rapidly increased by where
is a device scaling factor, resulting in increase in the power
dissipation by fourfold every three years. Recently, constant field scaling
has been applied to deal with the power problem. Power density is still
increased by leading to a doubling of the power dissipation every 6.5
years. It is assumed that the power dissipation in CMOS chips will increase
steadily as a natural result of device scaling.
Future computer and communications technology, on the other hand, will
require further reduction in power dissipation [2]. Ubiquitous computing is
the next generation in information technology where computers and
communications will be scaled further, merged together, and materialized in
consumer applications. Computers will be invisible behind broadband
networks as servers, while terminals will come closer to people, even as
wearable or implantable devices. IC chips will be implanted everywhere so
that things can think and talk for sophisticated human-computer interactions.
One of the key technologies need to reach this end is low-power technology.
92 Low-Power Digital Circuit Design
Circuit design techniques for the second and third approaches, as well as
theoretical models for quantitative understanding will be discussed in detail.
Figure 4.1 depicts equi-power (solid lines) and equi-speed (broken lines)
curves on the plane calculated by using equations (4.1) and (4.2) [1].
A rectangle in the figure illustrates ranges of change and fluctuation
that should be taken into account. This rectangle is a design window
because all the circuit specifications should be satisfied within the rectangle
for yield conscious design. In the design window, the circuit speed becomes
the slowest at the upper-left corner S, while at the lower-right corner P, the
power dissipation becomes the highest. The equi-speed and equi-power
curves are normalized at the corners S and P, as designated by normalized
factors and so that the amount of speed and power that must be
improved or degraded, compared to those in the typical condition can be
calculated by sliding and sizing the design window on the plane.
94 Low-Power Digital Circuit Design
Recently, the range of body bias has been extended from reverse to
forward. Forward substrate bias is used during active operation in order to
lower for high-speed operation, and zero substrate bias used during
standby mode in order to raise for low leakage. The substrate biasing
technique has begun to be applied to high-end products such as
microprocessors and communications chips for low-power, high-speed
operation [10][11].
An embedded DC-DC converter can vary the power supply voltage. If
both and are dynamically varied in response to computational load
demands, the energy/operation can be reduced for the low computational
periods, while retaining peak throughput when required. This strategy,
called dynamic voltage scaling (DVS), was first applied to a MIPS-
compatible RISC core in 1998 [7]. Measured performance in MIPS/W was
improved by a factor of more than two compared with that of a conventional
design. In 2000, a DVS processor with an ARM8 core was reported [12].
96 Low-Power Digital Circuit Design
4.2.2 Dual
There are three ways to save power dissipation while maintaining maximum
operating frequency by utilizing surplus timing in non-critical paths: 1)
employing multiple power supplies to lower supply voltage, 2)
employing multiple threshold voltages to reduce leakage current, and
3) employing multiple transistor widths to reduce circuit capacitance.
Clustered voltage scaling employing two power supplies is
discussed first.
should be used to minimize power dissipation of circuits. A theory
to deal with the optimal is described in [18]. According to the theory,
the power reduction ratio R can be calculated as a function of when
p(t) is provided, in which p(t) represents the normalized number of paths
whose delay is t when The power ratio R is calculated for five
Low Voltage Technologies 97
MPEG-4 video codec is designed by using an EDA tool for the clustered
voltage scaling [19] at various and the power dissipation is monitored.
As shown in Figure 4.5, the experimental result shows a good agreement
with the theory when p(t) of lambda-shape is assumed. Power dissipation is
reduced by about 40%.
Two MPEG-4 video codec chips are developed by the two approaches,
controlling and and employing two [20]. The power
dissipation on the chips is simulated and measured. By optimizing and
the power supply voltage can be lowered to 2.5V from 3.3V so that power
dissipation is reduced by 43% in all the circuits. By employing one more
1.75V for non-critical circuits, power dissipation is further reduced by
25%, to a total of 55% compared to the conventional design at 3.3V.
98 Low-Power Digital Circuit Design
4.2.3 Multiple
combinations of power supplies that make up the total delay of the path to
the cycle time, power dissipation is minimized when is applied.
Accordingly, is given by
where is given by
where is total gate width of pMOS and nMOS whose threshold voltage is
and whose source is connected to and The ratio of chip leakage
current in multiple threshold voltages to that in a single threshold voltage is
given by
In a typical design where buffer size and the number of repeaters are
optimally designed, delay and transistor width is mostly in proportion, and
is calculated by
102 Low-Power Digital Circuit Design
The chip leakage current ratio can be computed in the same way as in
Low Voltage Technologies 103
dissipation due to low reducing leakage current by more than one order
of magnitude is very effective.
For other designs where the leakage current is suppressed to a fairly
small amount, the leakage current reduction can be converted to a reduction
of AC power by lowering and, accordingly, are lowered to
the point where chip leakage current is the same as that in As a result,
AC power is reduced by about 20%.
capacitance.
4.2.3.4 Summary
these limitations, the bit-line leakage problem is becoming the most crucial.
The measured cell current and bit-line leakage in the worst case
data pattern of an SRAM with 256 rows, fabricated in a CMOS
technology, are depicted in Figure 4.11. Rapid increase in at low
degrades operation speed and finally causes operation error. If is kept
below should be higher than 0.35V, considering ±0.1V
fluctuation. As illustrated in Figure 4.12, has been about 23% of in
SRAMs and around 15% in logic circuits in the and high-
speed device generations. However, in a technology where
it is predicted that cannot be scaled to under the cover
of excessive which becomes three times as large as in the worst
case.
Low Voltage Technologies 109
Even with the additional operation for leakage detection in the pre-charge
cycle, write-recovery can be completed with little speed penalty as shown in
Figure 4.14(c), since P3 assists P1 with the pre-charge operation while
“/comp” is low.
Capacitance associated with CAP7 should be carefully designed,
because a capacitance that is too small with cause charge sharing and reduce
injection current due to coupling noise from the source of P4, whereas a
capacitance that is too large with increase detection time and hence the pre-
charge cycle time.
Simulated delay time in which the potential difference between the two
bit-lines reaches l00mV is plotted in Figure 4.15. If the budget of the bit-
line delay is 0.5ns, the bit-line leakage should be less than without the
compensation, whereas it can be as large as with the proposed
compensation scheme. This advantage corresponds to 0.1 V reduction. It
is also found from Figure 4.12 that can be scaled to in the
technology as it was previously, when the BLC scheme is employed.
The place where the switching activity is the highest in a chip is a circuit
where clock comes in and out. A flip-flop consumes considerable power by
clock toggling. Typically, one-fourth to one-half of the total power
dissipation of a chip is consumed by flip-flops.
L-level because the Q output is now on the same level as the D input. In this
way, COD-F/F generates the self-aligned pulsed clock internally and
operates reliably. When CKI is a short pulse, a latch circuit can operate like
an edge-triggered flip-flop. Therefore, COD-F/F consists of the clock-gating
circuit and the latch circuit, which reduces area and power penalties. Since
the internal clock is generated and distributed only in a cell, and the pulse
width is self-aligned by the Q transition, no distortion problem occurs.
The power dissipation dependence on the data switching probability, pt,
is shown in Figure 4.19. For example, at pt of 0.3, COD-F/F consumes 50%
less power than the conventional flip-flop. The lower the pt, the lower the
power dissipation. Power penalty due to the clock-gating circuit is almost
cancelled out by the circuit reduction from a flip-flop to a latch. For pt less
than 0.95, the COD-F/F dissipates less power than the conventional flip-flop.
Since pt in logic circuits is around 0.3 on average, and 0.5 at most, the
conventional flip-flop should always be replaced by COD-F/F as long as the
delay penalty can be accepted. In Table 4.1, characteristics of COD-F/F are
presented.
reduced by the cycle time borrowing. The more cycle time borrowing, the
more chances of the cycle time borrowing from the succeeding cycle in the
next pipeline stage. In this way, the cycle time borrowing is propagated,
resulting in larger area and power penalties due to the NS-F/F. Since cell
size becomes larger as the negative setup-time is larger in NS-F/Fs, the logic
synthesis tool automatically maps a necessary and small NS-F/F after area
optimization to minimize penalties as well as to satisfy the timing
requirements. The fact that NS-F/F#4 is not used exclusively provides some
support for an assumption that they are appropriately mapped to minimize
the penalties. Since clock edge is locally shifted by various amounts, the
hold-time violation should be given careful attention. A logic synthesis tool
can fix these problems automatically. The area penalty is in the order of
several percentage points.
As for low-power DCT, power dissipation is reduced by 24% for the random
picture and 51% for the still-picture while maintaining 80MHz maximum
operating frequency. Area is 12% larger as a result of gate sizing
compensating for the delay increase in COD-F/F. Power breakdown
analyzed by simulation is shown in Figure 4.22. The “others” include
combinational logics, SRAMs, and clock trees. The power difference
between Conv-DCT and LP-DCT comes from the difference of the power
dissipation of the flip-flops. HS-DCT is operated at a maximum frequency
larger by 25% than that of the conventional DCT. The area penalty is due to
the increased cell size of NS-F/Fs and gate-sizing to meet tight timing
constraints. The power penalty is mainly due to the delay circuit in NS-F/F.
F/F-Blending explores better tradeoffs between power, delay, and area.
Since neither RTL design nor the timing constraints have to be modified,
designers can use the technique without knowledge of the chip. This
technique also shortens design turn-around-time. There is no need to change
device technology, and, hence, no increase in process cost.
Low Switching-Activity Techniques 117
118 Low-Power Digital Circuit Design
4.5 SUMMARY
REFERENCES
[1] T.Kuroda, and T. Sakurai, “Overview of low-power ULSI circuit techniques,” IEICE
Trans. On Electronics, vol. E78-C, no. 4, pp. 334-344, April 1995.
[2] T. Kuroda, “CMOS design challenges to power wall,” in Proc. of International
Microprocesses and Nanotechnology Conference, pp. 6-7, Nov. 2001.
[3] T. Kuroda, “Low power CMOS design challenges,” IEICE Trans. Electronics, vol. E84-
C, no. 8, pp.1021-1028, Aug. 2001.
[4] Chandrakasan, S. Sheng, and R. Brodersen, “Low-power CMOS digital design,” IEEE
Journal of Solid-State Circuits, vol. 27, no. 4, pp. 473-484, Apr. 1992.
[5] K. Nose, and T. Sakurai, “Optimisation of VDD and VTH for Low-Power and High-
Speed Applications,” in Proc. of ASPDAC, pp. 469-474, Jan. 2000.
Summary 119
using clustered voltage scaling with variable supply-voltage scheme,” IEEE Journal of
Solid-State Circuits, vol. 33, no. 11, pp. 1772-1780, Nov. 1998.
[21] M. Hamada, Y. Ootaguro, and T. Kuroda, “Utilizing surplus timing for power
reduction,” in Proc. of CICC’2001, pp. 89-92, May 2001.
[22] H. Tanaka et al., “A Precise on-chip voltage generator for a gigascale dram with a
negative word-line scheme,” IEEE Journal of Solid-State Circuits, vol.34, pp. 1084-1090,
Aug. 1999.
[23] H. Kawaguchi et al., “Dynamic leakage cut-off scheme for low-voltage SRAM’s,” in
Symp. on VLSI Circuits Dig. Tech. Papers, pp. 140-141, June 1998.
[24] K. Agawa, H. Hara, T. Takayanagi, and T. Kuroda, “A bit-line leakage compensation
scheme for low-voltage SRAM’s,” IEEE Journal of Solid-State Circuits, vol. 36, no. 5,
May 2001.
[25] M. Hamada, T. Terazawa, T. Higashi, S. Kitabayashi, S. Mita, Y. Watanabe, M. Ashino,
H. Hara, and T. Kuroda, “Flip-flop selection technique for power-delay trade-off,” in
ISSCC’99 Dig. Tech. Papers, pp. 270-271, Feb. 1999.
[26] B. Kong, S. Kim, and Y. Jun, “Conditional-capture flip-flop technique for statistical
power reduction,” in ISSCC ’OO Dig. Tech. Papers, pp. 290-291, Feb. 2000.
[27] T. Nishikawa, M. Takahashi, M. Hamada, T. Takayanagi, H. Arakida, N. Machida, H.
Yamamoto, T. Fujiyoshi, Y. Maisumoto, O. Yamagishi, T. Samata, A. Asano, T.
Terazawa, K. Ohmori, J. Shirakura, Y. Watanabe, H. Nakamura, S. Minami, and T.
Kuroda, “A 60MHz 240mW MPEG-4 video-phone LSI with 16Mb embedded DRAM,”
in ISSCC Dig. Tech. Papers, pp. 230-231, Feb. 2000.
Chapter 5
Low Voltage Analog Design
Abstract: The current trend towards low-voltage, low-power design is mainly driven by
two important aspects: the growing demand for long-life autonomous portable
equipment (cellular phones, PDAs, etc.) and the technological limitations of
high-performance VLSI systems (heat dissipation). These two forces are now
combined as portable equipment grows to encompass high-throughput
intensive products such as portable computers and cellular phones. The most
efficient way to reduce the power consumption of digital circuits is to reduce
the supply voltage since the average power consumption of CMOS digital
circuits is proportional to the square of the supply voltage. The resulting
performance loss can be overcome for standard CMOS technologies by
introducing more parallelism and/or modifying the process and optimizing it
for low-voltage operation. The rules for analog circuits are quite different than
those applied to digital circuits. It will be shown that the downscaling of the
supply voltage does not automatically decrease the analog power consumption.
After a general introduction on the limits to low power for analog circuits, an
extensive part of this chapter will deal with the impact of reduced supply
voltage on the power consumption of high-speed analog to digital converters
(ADC). It will be shown that power consumption will not decrease and, even
worse, will increase in future submicron technologies. This trend will be
shown and solutions will be offered at the end of this chapter. A comparison
with the power consumption of published high-speed analog to digital
converters will also be presented.
Key words: Low voltage operation, analog circuits, low power analog design, ADC,
matching, voltage scaling.
122 Low Voltage Analog Design
5.1 INTRODUCTION
The limits discussed so far are fundamental since they do not depend on
the technology nor the choice of the supply voltage. However, a number of
obstacles or technological limitations are in the way of approaching these
limits in practical circuits, and ways to reduce the effect of these various
limitations can be found at all levels of analog design ranging from device to
system level:
reducing the noise bandwidth). So, parasitic capacitances are very bad for
minimum power consumption.
Power is increased if the signal at any node corresponding to a functional
pole (pole within the bandwidth) has voltage amplitude smaller than the
supply voltage. Thus, care must be taken to amplify the signal as early as
possible to its maximum possible voltage value. Using current-mode
voltage swings is therefore not a good approach to reduce power, as long
as the energy is supplied by a voltage source.
The presence of additional noise sources implies an increase in power
consumption. These include 1/f noise of the active devices and noise
coming from the power supply or generated on chip by other blocks of
the circuit.
When capacitive loads are used, the power supply current “I” necessary
to obtain a given bandwidth is inversely proportional to the gm/I ratio of
the active device. The small value of gm/I inherent to MOS transistors
operating in strong inversion may therefore cause an increase in power
consumption.
The need for precision (e.g., in high-speed flash ADCs, as explained
further on) leads to the use of larger dimensions for active and passive
components, with a resulting increase in parasitic capacitances and
power.
All switched capacitors must be clocked at a frequency higher than twice
the signal frequency. The power consumed by the clockdriver itself may
be dominant in some applications.
The reason why deeper submicron technologies use lower supply voltages
will be shown in the second part of this chapter, which deals with the
influence of downscaling technology on the power consumption of high-
speed flash ADCs. In this section a general overview will be given on the
implications of reducing the supply voltage.
According to the equation for minimum analog power consumption,
reducing the analog supply voltage while preserving the same bandwidth and
DR has no fundamental effect on their minimum power consumption.
However this absolute limit was obtained by neglecting the possible
bandwidth B limitation due to the limited transconductance gm of the active
device. The maximum value of B is proportional to gm/C. It can be shown
that by replacing the capacitor value C by gm/C the following equation can
be written:
Introduction 125
the technology used [2]), it directly influences the DNL and INL
characteristics of the A/D converter. Therefore, the first step in the design of
a flash converter consists in deriving an offset voltage standard deviation
that guarantees with a high probability that the design complies with a
certain performance specification (high yield). The yield of the analog part
(e.g., ADC) of a mixed-mode chip must be much higher than the overall
yield, because of the relatively small area contribution of the analog part (see
Figure 5.5).
Consider that the offset voltages of all the comparators are independent
variables that follow a normal distribution. Monte Carlo simulations have
been used to estimate the design yield as a function of the offset voltage
standard deviation (Closed form expressions also exist for calculating the
Speed-power-accuracy Trade-off in High Speed ADC’s 129
yield as a function of offset standard deviation [3]) The results obtained for a
6-bit converter, considering that one wants to comply with a DNL and an
INL specification of 0.5 LSB and 1.0 LSB, respectively, are presented in
Figure 5.6. So in order to design a 6-bit converter with an acceptable yield,
the comparator offset standard deviation should not exceed the value of 0.15
times the least significant bit. The next section will show models to calculate
the offset standard deviation of the comparator. Together with this
information, the trade-off between speed, power, and accuracy will be shown
in section 3.
dependence for a minimal size device. The obtained critical distances are
very large compared to the typical size of an analog circuit. Therefore, the
distance dependence of the parameter mismatch will be neglected in the
following sections.
To end this section and introduce the next section, which deals with the
trade-off between speed, accuracy, and power, mismatch equations for a
differential pair configuration, shown in Figure 5.9, are deduced.
After substituting the equations for the mismatch, (5.1) and (5.2), the
offset voltage can be written in terms of the mismatch parameters and
of the technology used.
134 Low Voltage Analog Design
From (5.7) it can be concluded that the current and threshold matching
depends on and and that the relative importance of threshold
mismatch and current mismatch depends on the gate overdrive voltage. A
corner gate overdrive voltage is defined for which the effect of
the and mismatch on the gate voltage or drain current is of equal size
(see Table 5.1 for values):
The trend in analog circuit design has always been towards higher speed,
higher accuracy, and lower power drain. However, it will be shown that the
speed-accuracy-power trade-off is simply limited by technology parameters
only and, more specifically, the mismatch parameters of the technology and
not the noise level as shown in Figure 5.2. This has already been
demonstrated by Kinget et al. [6].
The only way to overcome this problem is by using offset compensation
or auto-zero techniques (analog or digital, background or foreground).
However, those compensation techniques require calibration phases during
which the normal system operation is interrupted, and the offset voltages of
the building blocks are sampled and dynamically stored in a memory. This
reduces the maximum processing speed and requires a lot of extra chip
overhead to provide calibration and replica circuits. In many high-speed low-
power circuits the interruption of the system can not be tolerated or the
required continuous operation is too long to ensure the offset correction.
Therefore, the accuracy completely depends on the matching performances
of the technology. The bit accuracy that can be achieved is proportional to
the matching of the transistor. To improve the system accuracy, larger
devices are required, but, at the same time, the capacitive loading of the
circuit nodes increase and more power is required to attain a certain speed
Speed-power-accuracy Trade-off in High Speed ADC’s 135
performance. This can very easily be derived for a typical differential input
stage of a high speed ADC. The speed performance of such a topology is
approximated by:
What happens with this trade-off when technology scales down? Several
scaling trends have been proposed, but the “constant field scaling” is
probably the most used in the microelectronics community. Several scaling
Impact of Voltage Scaling on Trade-off in High-speed ADC’s 137
trends will now be discussed and used to explore the impact of these scaling
issues on the previously deduced speed-power-accuracy trade-off.
To reduce the short channel effects in deep-submicron transistors, the
oxide thickness is scaled down together with the minimum transistor length.
As shown in the second section, the threshold mismatch parameter is
proportional with the oxide thickness. As a consequence, the threshold
mismatch parameter decreases as technology scales down. The gate-
oxide capacitance on the other hand increases when technology scales down
(inversely proportional with the oxide thickness). So, increases as
technology scales down, and as a result the trade-off becomes better. This
means that for the same speed and accuracy, less power is needed when
technology is scaled down.
However, the maximal supply voltage also reduces for smaller oxide
thicknesses (see Figure 5.10) so that smaller signal levels have to be used
technology uses a 2.5 V supply, uses a 1.8 V supply).
When the supply voltage becomes smaller, the input swing of the differential
pair decreases leading to smaller values for the least significant bit. As a
result, the maximum allowable offset also decreases. Consequently, the
scaling advantage for the trade-off with smaller technology line-widths is
reduced. Moreover, the increasing substrate doping levels in deeper
submicron technologies make the parasitic drain to bulk and source to bulk
capacitors relatively more and more important compared to the gate-oxide
capacitance. This effect is clearly seen in Figure 5.11 where the and the
is plotted as a function of minimum technology length:
138 Low Voltage Analog Design
Because the two ADCs have the same resolution, the following equation
can be proven2:
So, to achieve the same speed and accuracy, the power in the downscaled
technology is smaller, because of the improved matching of this technology.
Now, some modifications will be performed on these equations to
include the supply-voltage scaling and the relatively increasing importance
of the drain-bulk capacitance compared to the gate-oxide capacitance.
Normally the input range of the ADC is made as high as possible. One
assumption made then is that the least significant bit of the converter scales
down together with the supply voltage, leading to a smaller allowable
mismatch:
2
Index 1 is used forthe technology and index 2 for the technology.
140 Low Voltage Analog Design
The speed equation can be rewritten, now including the drain bulk
capacitance.
3
The typical assumption of [tox=L/50] has been used in this equation.
Impact of Voltage Scaling on Trade-off in High-speed ADC’s 141
b) Case 2: Id. as case 1 but not without drain-bulk capacitance scaling. The
extra load on the driver transistors leads to a slightly increasing straight
line.
c) Case 3: No supply voltage scaling but with drain-bulk capacitance
scaling. The increasing matching properties lead to a decreasing power
consumption of the implemented converter.
To conclude, the expected power-decrease is counteracted by the more
stringent mismatch demand and the relatively increasing drain-bulk
capacitance.
When technology scales further, the becomes dominant,
leading to following equation:
which makes the case even worse (power increases as shown in Figure
5.13). This is because the scaling of the supply voltage is no longer
142 Low Voltage Analog Design
Thus, the power consumption (for equal gate-overdrive voltages) ratio is:
The power consumption trend does not stay the same (as in the case
when only the settling parameter was included) but has a sub-linear slope.
This is due to the introduction of the slewing behavior.
This trend is plotted in Figure 5.15 as a function of the technology and
for three different gate overdrive voltages (slew rate, settling behavior,
and are included). One can clearly see that for
smaller gate-overdrive voltages the power increase turn-point is pushed
towards smaller technologies. This is intuitively understood because then the
supply voltage scaling is advantageous for the power consumption because
the circuit is longer in a slewing behavior.
Lowering the gate-overdrive voltage brings a remarkable conclusion. It
indicates that for future ADCs a behavior close to the linear behaviour is
preferable for the implementation and power consumption of high-speed
ADCs.
Solutions for Low Voltage ADC Design 145
Not only analog circuits have problems with the decreasing power supply
voltage and mismatch, digital circuits also suffer from the mismatch between
identical devices, e.g., offsets in a SRAM cell. Because of the enormous
economical impact of digital circuits, maybe more effort will be spent at
extensive research to achieve much better mismatch parameters in future
technologies. Here, for once, digital demands go hand in hand with analog
demands. Another technological adaptation is the use of dual oxide
processes, which can handle the higher supply voltages necessary to achieve
the required dynamic range in data converters.
146 Low Voltage Analog Design
To compare the developed equations and the published data, a figure (see
Figure 5.17) is presented that shows the figure of merit of several published
6-bit converters vs. their implementation technology.
148 Low Voltage Analog Design
One clearly sees a good agreement between the equations and the
published data. Also the averaging technique solution is a good candidate to
circumvent this speed-power-accuracy trade-off.
5.6 SUMMARY
REFERENCES
[1] E.A. Vittoz, “Future of Analog in the VLSI Environment,” ISCAS 1990, pp. 1372-1375,
May 1990.
[2] M. Pelgrom et al., “Matching properties of MOS Transistors,” IEEE Journal of Solid-
State Circuits, vol. 24, no, 5, pp. 1433-1439, Oct 1989.
[3] M.J.M. Pelgrom, A.C.J. v. Rens, M. Vertregt and M. Dijkstra, “A 25-Ms/s 8-bit CMOS
A/D Converter for Embedded Application,” IEEE JSSC, vol. 29, no. 8, Aug. 1994.
[4] J. Bastos et al., “Mismatch characterization of small size MOS Transistors,” Proc. IEEE
Int. Conf. On Microelectronic Test Structures, vol. 8, pp. 271-276, 1995.
[5] W.M.C. Sansen and K.R. Laker, “Design of analog integrated circuits and systems,”
McGraw-Hill International Editions, 1994.
[6] P. Kinget and M. Steyaert, “Impact of transistor mismatch on the speed-accuracy-power
trade-off of analog CMOS circuits,” Proceedings CICC, May 1996.
[7] Iuri Mehr and Declan Dalton, “A 500 Msample/s 6-Bit Nyquist Rate ADC for Disk
Drive Read Channel Applications” , Journal of Solid State Circuits , Sept. ’99.
[8] E. Lauwers and G. Gielen, “A power estimation model for high-speed CMOS A/D
Converters,” Proc. DATE, March 1999.
[9] Q. Huang et al., “The Impact of Scaling Down to Deep Submicron on CMOS RF
Circuits,” IEEE JSSC, Vol. 33, no. 7, July 1998.
[10] K. Kattmann and J. Barrow, “A Technique for reducing differential non-linearity errors
in flash A/D converters,” 1991 IEEE ISSCC Dig. Of Tech. Papers, pp. 170-171, Feb.
1991.
[11] Abidi et al., “A 6-bit, 1-3 GHz CMOS ADC,” IEEE ISSCC, San Francisco, Feb. 2001.
[12] P. Scholtens et al., “A 6-bit, 1-6 GHz CMOS Flash ADC,” to be presented at ISSCC, San
Francisco, Feb. 2002.
[13] G. Geelen, “A 6b, 1.1 Gsample/s CMOS A/D Converter,” IEEE ISSCC, San Francisco,
Feb. 2001.
[14] K. Bult and A. Buchwald, “An embedded 240mW 10b 50Ms/s CMOS ADC in
IEEE JSSC, Vol. 32, pp. 1887-1895, Dec. 1997.
[15] G. Hoogzaad and R. Roovers, “A 65-mW, 10-bit, 40-Ms/s BICMOS Nyquist ADC in 0.8
IEEE JSSC, Dec. 1999.
[16] Yun-Ti Wang and B. Razavi, “An 8-bit, 150-MHz CMOS A/D Converter,” Proceedings
Custom Integrated Circuits Conference, pp. 117-120, May 1999.
[17] M. Flynn and B. Sheahan, “A 400 Msample/s 6b CMOS Folding and Interpolating
ADC,” ISSCC ’98.
[18] Sanruko Tsukamoto et al., “A CMOS 6b 400 Msample/s ADC with Error Correction,”
ISSCC ’98.
[19] K. Nagaraj et al., “A 700 Msample/s 6b Read Channel A/D converter with 7b Servo
Mode,” ISSCC ’00, Feb. 2000.
[20] K. Sushihara, “ A 6b 800 Msample/s CMOS A/D Converter,” ISSCC ’00, Feb. 2000
[21] Declan Dalton et al., “A 200-MSPS 6-Bit Flash ADC in CMOS,” Journal of
Solid State Circuits, Nov. 1998.
[22] R. Roovers and M. Steyaert, “A 6bit, 160mW, 175 MS/s A/D Converter,” Journal of
Solid-State Circuits, July ’96.
[23] Yuko Tamba, Kazuo Yamakido, “A CMOS 6b 500Msample/s ADC for a Hard Disk
Read Channel,” ISSCC ’99.
This page intentionally left blank
Chapter 6
Low Power Flip-Flop and Clock Network Design
Methodologies in High-Performance System-on-a-
Chip
Abstract: In many VLSI (very large scale integration) chips, the power dissipation of the
clocking system that includes clock distribution network and flip-flops is often
the largest portion of total chip power consumption. In the near future, this
portion is likely to dominate total chip power consumption due to higher clock
frequency and deeper pipeline design trend. Thus it is important to reduce
power consumptions in both the clock tree and flip-flops. Traditionally, two
approaches have been used: 1) to reduce power consumption in the clock tree,
several low-swing clock flip-flops and double-edge flip-flops have been
introduced; 2) to reduce power consumption in flip-flops, conditional capture,
clock-on-demand, data-transition look-ahead techniques have been developed.
In this chapter these flip-flops are described with their pros and cons. Then, a
circuit technique that integrates these two approaches is described along with
simulation results. Finally, clock gating and logic embedding techniques are
explained as powerful power saving techniques, followed by a low-power
clock buffer design.
Key words: Flip-flop, small-swing, low-power, clock tree, statistical power saving, clock
gating, double edge-triggered, logic embedding, clock buffer.
6.1 INTRODUCTION
[2]. The need for cheap packaging will require further reduction in power
consumption. Heat sinks required for high-power chips occupy a large
amount of space, and a cooling fan causes extra power consumption. Also,
low-power system-on-a-chips (SoCs) are needed to meet the market demand
for portable equipment, such as cellular phones, laptop computers, personal
digital assistants (PDAs), and, soon, wearable computers.
In the near future, L di/dt noise concern is another important factor that
demands low-power consumption in high-performance microprocessors [3].
At the clock edge, large amounts of power supply current are required
instantaneously. However, inductance in the power rails limits the ability to
deliver the current fast enough, thus leading to core voltage droop. For
example, when a 1-GHz microprocessor has a 1.6 V core voltage, a 2 pH
package inductance, and a di/dt of 80 A/ns, then the first inductive voltage
droop will be about 160 mV, 10% of the core voltage. If a 10 GHz
microprocessor has a 0.6 V core voltage, a 0.5 pH package inductance, and a
di/dt of 1000 A/ns, then the first inductive voltage droop will be 500 mV,
that is 83.3% of the core voltage. To suppress L di/dt noise, various power
saving techniques are essential in future chip design.
In many VLSI chips, the power dissipation in the clocking system that
includes clock distribution network and flip-flops is often the largest portion
of total chip power consumption, as shown in Figure 6.1 [4][5][6][7]. This is
due to the fact that the activity ratio of the clock signal is unity and the
interconnect length of the clock trees has increased significantly. In Figure
6.1, hashed bars represent the power consumption in the clock distribution
Introduction 153
network (clock tree and clock buffers), and a dark bar represents the power
dissipation in the clock network and storage elements (latches and flip-
flops). The design trend for using more pipeline stages for high throughput
increases the number of flip-flops in a chip. With deeper-pipeline design,
clocking system power consumption can be more than 50% of the total chip
power consumption, and the portion will increase as the clock frequency
goes up. The clock frequency of microprocessors has been doubled every
two to three years as reported in the literature. In a recent high-frequency
microprocessor, the clocking system consumed 70% of the total chip power
[7]. Thus, it is important to reduce power consumptions in both the clock
trees and the flip-flops.
As the clock frequency doubles every two to three years and the number of
gates per cycle decreases with deeper pipeline design, flip-flop insertion
overhead increases significantly. To minimize the flip-flop insertion
overhead, high-performance flip-flop design is crucial in high-speed SoC
design. Both HLFF by H. Partovi and SDFF by F. Klass shown in Figure
6.2(a) and (b) have been known as the fastest flip-flops [8][9].
Both of them are based on short pulse triggered latch design and include an
internal short pulse generator. For example, the front-end of HLFF is a pulse
generator and its back-end is a latch that captures the pulse generated in the
front-end. Figure 6.3 illustrates a short pulse generation in HLFF. At the
rising edge of the CK signal, CKbd is in “Hi” state and goes “Lo” after 3-
inverter delay (tp). Hence, a virtual short pulse, PC in Figure 6.3, is applied
to the front-end of HLFF. During the short time, tp, 3 stacked NMOS
transistors in the front-end will conduct if D is “Hi” and 3stacked NMOS
transistors in the back-end will conduct if D is “Lo.” The small
transparency window of HLFF is closely related to its hold time. Hence, the
minimum delay (3 inverter delay) between flip-flops should be guaranteed to
avoid the hold time violation. HLFF has several advantages: small D-Q
delay, negative setup time, and logic embedding with small penalty.
SDFF has similar characteristics to HLFF. A back-to-back inverter is added
at the internal node for robust operation. The back-end latch has only two
stacked NMOS transistors, which enables SDFF to operate faster than
HLFF. A NAND gate is used for conditional shutoff, which is robust with
respect to variations of sampling window compared to the unconditional
156 Low Power Flip-Flop and Clock Network Design Methodologies in SoC
rest of this section is organized as follows. Section 6.3.1 describes the low
power transmission gate master-slave flip-flop and modified flip-
flop. In Section 6.3.2, four statistical power reduction techniques are
explained. Sections 6.3.3 and 6.3.4 explain power saving methodologies for
clock networks such as small-swing clocking and double-edge triggering. In
Section 6.3.5, low-swing double edge-triggered flip-flops are presented,
which combine good features of both technologies described in Sections
6.3.3 and 6.3.4. Finally, simulation results are compared in Section 6.3.6.
A master-slave latch pair with a two-phase clock can form a flip-flop. The
transmission gate master-slave latch pair (TGFF) used in the PowerPC 603 is
shown in Figure 6.4 [23]. A schematic of a modified flip-flop is
shown in Figure 6.5 [24].
Flip-flops with the statistical power saving techniques in section 6.3.2 use
full-swing clock signals that cause significant power consumption in the
clock tree. One of the most efficient ways to save power in the clock
network is to reduce the voltage swing of the distributed clock signal.
Low-Power Flip-Flops 161
Figures 6.10, 6.11, and 6.12 show a couple of small-swing clocking flip-
flops and their multi-phase or single-phase clock signals.
The single clock flip-flop (SCFF) can operate with a small-swing clock
without a leakage current problem because the clock (as shown in Figure
6.12(a)) drives no PMOS transistors. It can also use a simple clocking
scheme similar to Figure 6.11(b) with a lower clock swing level. But the
peak value of the clock signal in SCFF can be reduced to half [16].
While its single clock phase is advantageous, a drawback of SCFF lies in its
long latency; it samples data at the rising edge of the clock signal and
transfers sampled data at the falling edge of the clock signal. This long
latency becomes a bottleneck for high-performance operation.
Low-Power Flip-Flops 163
Another efficient way to save power in the clock network is to reduce the
frequency of the distributed clock signal by half via double-edge triggering.
Double-edge triggered flip-flops (DETFFs) can reduce power dissipation in
the clock tree, ideally by half. It requires a 50% duty ratio from the clock in
order not to lose any performance degradation in the system. However, it is
not easy to achieve both a 50% duty ratio and the same amount of clock
skew in the rising and the falling edges of the clock. Therefore, these non-
ideal penalties are considered, the clock frequency should be adjusted as
shown in Figure 6.13.
and output node Q starts to ramp up. This deep dip at node Q is due to
different signal path delays and cannot be avoided in this DETFF.
signal, N5 and N6 are turned on to sample data during Hence, the clock
frequency in equation (6.2) can be lowered to half, and, accordingly, the
clock network power consumption can be reduced by 50%.
Figure 6.18 shows the concept of the proposed clocking scheme, and
Figure 6.19 shows equivalent implementation methods. With type A, the
timing skew between CKd and CKdb can be minimized by tuning the
transistor sizes of the inverters. For type B, a pulsed-clock signal can be
generated from an additional pulsed-clock generator. Although the inverter
overhead is removed in LSDFF, degraded pulse amplitude and width may be
a problem for clock signal propagation. Type C is considered the best
method for removing timing skew with some additional power consumption.
The operation of LSDFF is explained next. In Figure 6.17, prior to the rising
edge of clock signal, CK, N3~N6 are off. When the input changes to “Hi,”
node Y is discharged to “Lo” through NMOS transistor N1, and node X
retains the previous data value “Hi.” After the rising edges of CK, N3, and
N4 are on, node X is discharged to “Lo.” This node X drives the gate of P2,
which in turn charges the output node Q to “Hi.” When the input changes to
“Lo,” node X is charged to “Hi” through PMOS transistor P1, and node Y
retains the previous data value “Lo.” After the rising edges of CK, N3 and
N4 are on, node Y is charged to and finally to by P3.
Node Y drives the gate of N2 to discharge the output node Q to “Lo.” The
operation at the falling edge of CK can be explained in a similar manner.
Low-Power Flip-Flops 169
Figure 6.21(a) shows that LSDFF has the least power consumption when
the input pattern does not change, whereas HLFF and SDFF still incur high
power consumption even though the input stays “Hi.” For an average input
switching activity of 0.3, the power consumption of LSDFF is reduced by
28.6%~49.6% over conventional flip-flops as shown in Figure 6.21(a),
mainly due to halved clock frequency and the elimination of unnecessary
internal node transitions. Power-delay product is also reduced by
More on Clocking Power-Saving Methodologies 171
Simple logic elements can be embedded into LSDFF to reduce overall delays
within a pipeline stage. With embedded logic in LSDFF, the overall circuit
performance can be optimized by saving a gate in critical paths. Featuring
embedded logic inside the flip-flop will become more important in terms of
power and performance due to reduced cycle time and increased flip-flop
insertion overhead. Table 6.3 shows that the speedup factor of embedded
logic in LSDFF over discrete logic ranges from 1.33 to 1.49. SDFF can also
include a logic function inside the flip-flop more easily than LSDFF because
input data feeds to only one NMOS gate for flip-flop D as shown in Figure
6.2(b). Hence, logic embedded SDFF can increase the speed of the overall
performance significantly, as seen in [9].
for clock tree can be optimized to reduce skew [39]. Further research is
needed to reduce power consumption.
The small input capacitance of the flip-flop has become more important in
multi-GHz SoCs. Large input capacitance requires a bigger driver from the
previous combinational logic block, and if the driver gain is too big, the size
of the gate that feeds the driver should be increased as well. This ripple
effect may increase the size of a single pipeline stage up to 50% because of
the reduced number of the gates in a single pipeline stage in multi-GHz
SoCs. The 50% increased combinational logic block also consumes 50%
more power.
As the process technology shrinks, more soft errors will occur in flip-
flops and other circuits because of energetic alpha particles emitted from
cosmic rays and chip packages. To reduce this soft-error rate, the
drain/source area in the feedback path should be increased [33], which in
turn will consume more power. Hence, power-aware soft-error minimization
techniques will become more important.
where
by Gago et al., by Hossain et al.
176 Low Power Flip-Flop and Clock Network Design Methodologies in SoC
by Mishra et al.
NA = Not Available because internal clock buffer is not needed.
power consumption of combinational block
6.6 SUMMARY
Flip-flop design plays an important role in reducing cycle time and power
consumption. This chapter focused on power-saving techniques in flip-flops
and the clock distribution network. To summarize, the following features
should be considered to reduce the clocking power of the chip.
REFERENCES
[31] D. Markovic, B. Nikolic, and R. Brodersen, “analysis and design of low-energy flip-
flops,” in IEEE Int. Symp. Low-Power Electronics and Design, Aug. 2001, pp. 52-55.
[32] J.-S. Wang, P.-H. Yang, and D. Sheng, “Design of a 3-V 300-MHz low-power 8-b % 8-b
pipelined multiplier using pulse-triggered TSPC flip-flops,” IEEE J. Solid-State Circuits,
vol. 35, no. 4, pp. 583-592, Apr. 2000.
[33] T. Karnik, B. Bloechel, K. Soumyanath, V. De, and S. Borkar, “Scaling trends of cosmic
rays induced soft errors in static latches beyond ” in Proc. IEEE Symp. VLSI
Circuits Dig. Tech. Papers, Jun. 2001, pp. 61-62.
[34] D. Brooks, P. Bose, S. Schuster, H. Jacobson, P. Kudva, A. Buyuktosunoglu, V. Zyuban,
M. Gupta, and P. Gook, “Power-aware microarchitecture: design and modeling
challenges for next-generation microprocessors,” in IEEE Micro, vol. 20, no. 6, pp. 26-
44, Nov.-Dec. 2000.
[35] V. Adler and E. G. Friedman, “Repeater design to reduce delay and power in resistive
interconnect,” in Proc. IEEE Int. Symp. Circuits and Systems, May 1997, pp. 2148-2151.
[36] Vittal, and M. Marek-Sadowska, “Low-power buffered clock tree design,” IEEE Trans.
Computer-Aided Design, vol. 16, no. 9, pp. 965-975, Sep. 1997.
[37] P. Gronowski, “Designing high performance microprocessor,” in Proc. IEEE Symp. VLSI
Circuits Dig. Tech. Papers, Jun. 1997, pp. 51-54.
[38] M. Gowan, L. Biro, and D. Jackson, “Power considerations in the design of the Alpha
21264 Microprocessor,” in Proc. Design Automation Conf., June 1998, pp. 726-731.
[39] C. Chu and D. F. Wong, “An efficient and optimal algorithm for simultaneous buffer and
wire sizing,” IEEE Trans. Computer-Aided Design, vol. 18, no. 9, pp. 1297-1304, Sep.
1999.
This page intentionally left blank
Chapter 7
Power Optimization by Datapath Width Adjustment
Abstract: Datapath width is an important design parameter for power optimization. The
datapath width significantly affects the area and power consumption of
processors, memories, and circuits. By analyzing required bit-width of
variables, the datapath width can be optimized for power minimization.
Several concepts and techniques of power minimization by datapath-width
adjustment are summarized.
7.1 INTRODUCTION
Since datapath width, the bit width of buses and operational units in a
system, strongly affects the size of circuits and memories in a system, the
power consumption of a system also depends on the width of the datapath.
In hardware design, designers are very sensitive to the width of the
datapath. Analyzing requirements on the datapath width carefully, designers
determine the length of registers and the datapath width to minimize chip
area and power consumption. In processor-based system design, it is
difficult for programmers to change the datapath width for each program. A
system designer determines the datapath width of the system, when he/she
chooses a processor. On the other hand, each application requires a different
accuracy of computation, which is requested by the specifications of input /
output signals and algorithms, and the required datapath width is sometimes
different from the width of processor’s datapath.
182 Power Optimization by Datapath Width Adjustment
Table 7.1 shows the bit width of each variable in an MPEG-2 video
decoder program [1]. The program is written in C with over 6,000 lines, and
384 variables are declared as int type. 50 variables are used as flags, and
only 1 bit is required for each of them during the computation. Only 35% of
the total bits of these 384 variables are actually used in the computation, and
65% are useless.
This section shows the relationship between datapath width and power
consumption. Datapath width directly affects the power consumption of
buses, operation units (such as adders, ALUs, and multipliers), and registers.
It is also related to the size of data and instruction memories of processor-
based systems.
The relations between datapath width and power consumption are
summarized as follows:
1. Shorter registers and operation units reduce switching count and the
leakage current of extra bits on the datapath.
2. Smaller circuit size induces smaller capacitance of each wire.
3. Datapath width is closely related to the size of data and instruction
memories of processor-based systems. The relationship is not
monotonic.
Datapath width is directly related to the area of datapath and memories. The
area of circuits and memories are also closely related to power consumption
because of load capacitance.
Assume a processor-based system, the datapath width of which can be
changed by system designers. As the datapath width is reduced, the area and
power consumption of processor almost linearly decreases because of the
reduction of the size of registers, buses, and operation units. The size of the
memory, which also strongly affects the power consumption of the system,
is changed drastically by the selection of the datapath width.
Generally, narrowing the datapath width reduces the area and power of
the processor, but degrades the performance. The number of execution
cycles increases, since some single-precision operations should be replaced
with double or more precision operations in order to preserve the accuracy of
the computation. Single-precision operations are those whose precision is
smaller than that of the datapath width. For example, an addition of two 32-
bit data is a single-precision operation on processors whose datapath width is
184 Power Optimization by Datapath Width Adjustment
int x, y, z; /* n bits */
z_low = x_low + y_low /* m bits */
z_high = x_high + y_high + carry /* m bits */
To adjust the datapath width for power reduction, information on the bit-
width requirement of each variable is very important. Popular programming
languages, however, have no feature to treat detailed information on the bit
width of variables. System designers and programmers are not concerned
about the size of variables except for selection of data types such as int,
short, and char of the C language.
In the design phase of algorithms and programs, designers want to
concentrate their attention on the design of system functionality. Information
on the bit width of variables has low priority, though it is very useful for
power optimization. It is desirable for the bit width of each variable to be
automatically induced from the descriptions of algorithms and programs.
The bit-width analysis is defined as follows:
For a given program, a set of input data and requirements of computation
accuracy e.g., quality of output), find the bit width of every variable in the
program required to keep sufficient information during the computation for
the input data set satisfying the accuracy requirement.
Several bit-width analysis techniques have been developed [4][5][6][7].
Using the techniques, an ordinary program is automatically analyzed, and
the bit width of each variable required for computation is specified in the
program. Thus, programmers do not have to care about the variable bit
width.
There exist two approaches to analyze the variable bit width [4]. One is
dynamic analysis in which one executes a program and monitor the value of
each variable. The other approach is static analysis, in which variable bit-
width is analyzed by formal rules without the execution of the program.
In the static analysis, rules to compute the ranges of variables after
executing basic operations are prepared. The analysis is performed both
forward and backward [7]. In the forward analysis, for an assignment
188 Power Optimization by Datapath Width Adjustment
statement with arithmetic operations, the range of a variable in the left side is
calculated from ranges of variables and constants according to the rules
applicable for operations in the right side. Starting from ranges of input data,
one can calculate the range of every variable in the program by a technique
of symbolic simulation. The backward analysis is also performed from the
ranges of outputs.
For example, consider the following addition statement.
z=x+y
If the ranges of x and y are [0, 2000] and [30, 500], respectively, the
range of z is [30, 2500]. Thus, 12 bits are required for variable z.
Static analysis is an efficient method to analyze the variable bit width.
However, in many cases when the assigned value of a variable cannot be
predicted unless the program is executed, such as in the case of unbounded
loops, static analysis is insufficient. As a solution to this problem, the
dynamic analysis is used in combination with the static analysis.
chip area (including CPU, data RAM and instruction ROM), and energy
consumption while varying datapath width were estimated.
192 Power Optimization by Datapath Width Adjustment
Two ADPCM decoder ASICs were designed. The designs are hardware-
direct implementations of an ADPCM decoder not including processor
cores. The design started from an ADPCM decoder program written in C,
which is a part of the DSPstone benchmark suite. Next, the required bit
width of variables in the program was statically analyzed. The analysis
results are shown in Table 7.2. There are eight int-type variables in the
program, which are all 32 bits in the original. The results show that no
variable requires the precision of 32 bits or more. The size of the largest
variable is only 18 bits.
Based on the results, two ASICs for the ADPCM decoder were designed.
One has a 32-bit datapath, (ADPCM 32) and the other has an 18-bit one
(ADPCM 18). Since no high-level synthesis tool was available, we manually
designed the ASICs in VHDL. Then, logic synthesis was performed with
Synopsys Design Compiler and 0.5µm, standard cell technology. The
194 Power Optimization by Datapath Width Adjustment
In the third case study, an MPEG-2 video decoder was examined [1]. The
MPEG-2 decoder program was obtained from the MPEG Software
Simulation Group. In this design, Bung-DLX and Valen-C were also used.
The original program consists of over 6,000 lines of C code. We analyzed
the required bit-width of 384 int type variables, and the results are
summarized in Table 7.1. Based on the results, we translated the C program
into Valen-C one.
The datapath width of the processor was changed from 17 bits to 40 bits,
and the performance (in terms of execution cycles), gate count, and energy
consumption was estimated. The results are depicted in Figure 7.9. From the
figure, one can see that the chip area increases in a monotonic fashion with
196 Power Optimization by Datapath Width Adjustment
the datapath width. Execution cycles are minimized at a 28-bit datapath but
are not decreased further for larger bitwidth. Note that smaller datapath
bitwidths have shorter critical-path delays. This means that, in the MPEG-2
example, performance is maximized at a 28-bit datapath. Energy
consumption is also minimized at 28 bits. For datapaths shorter than 28 bits,
more energy is required because of larger execution cycles. On the other
hand, for datapaths larger than 28 bits, wasteful switches on the datapath
increase, and extra energy is consumed.
For SOC design, the bit width of data computed in a system is one of the
most important design parameters related to performance, power, and cost of
the system. The datapath width and size of memories strongly depend on the
bit width of the data. System designers often spend much time analyzing the
bitwidth of data required in the computation of a system. Hardware
designers of portable multimedia devices reduce datapath width [11].
Programmers of embedded systems sometimes work hard for adjustment of
the bit width of a variable to keep the accuracy of computation. By
controlling the datapath width, one can reduce area and power consumption
drastically. Furthermore, one can choose the computation precision actually
required for each application to further optimize application-specific design.
Quality-Driven Design 197
Figure 7.10 shows the flow of the presented QDD for video decoders. In
the first phase of a system design, the implementation of the functionality of
the system and optimization for general constraints, performance, power,
and cost are performed. Initial designs are written in a high-level language,
such as C, in which most variables are assumed to be 32 bits. After the
function design is validated and verified, the second phase for application-
specific optimization is performed. In this phase, the bit width of variables
in the application program is analyzed, the design parameter is tuned, the
output quality and computation precision are adopted, and datapath-width
adjustment is performed under the given quality constraint. Using QDD, one
can design various video applications with different video quality from the
same basic algorithm.
In QDD, both higher and lower bits of data can be reduced. From the
requirements on the output quality, lower bits of data may be omitted in the
datapath-width adjustment (See Figure 7.11). This means that there is
potential for further energy reduction by decreasing computation accuracy.
198 Power Optimization by Datapath Width Adjustment
7.7 SUMMARY
REFERENCES
[i] Y. Cao and H. Yasuura "A system-level energy minimization using datapath
optimization," International Symposium on Low Power Electronics and Design, August
2001.
[2] B. Shackleford, et al, "Memory-CPU size optimization for embedded system designs," in
Proc. of 34th Design Automation Conference (34th DAC), June 1997.
[3] T. Ishihara and H. Yasuura, "Programmable power management architecture for power
reduction," IEICE Trans, on Electronics, vol. E81-C no. 9, pp.1473-1480, September
1998.
[4] H. Yamashita, H. Yasuura, F. N. Eko, and Yun Cao, "Variable size analysis and
validation of computation quality," in Proc. of Workshop on High-Level Design
Validation and Test, HLDVTOO, Nov. 2000.
[5] M. Stephenson, J. Babb, and S. Amarasinghe, "Bitwidth analysis with application to
silicon compilation," Conf. Programming Language Design and Implementation, June
2000.
[6] M.-A. Cantin and Y. Savaria, "An automatic word length determination method," in
Proc. of The IEEE International Symposium on Circuit and Systems, V53-V56, May.
2001.
[7] S. Mahlke, R. Ravindran, M. Schlansker, R. Schreiber, and T. Sherwood, "Bitwidth
cognizant architecture synthesis of custom hardware accelerators," IEEE Trans. CAD,
vol. 20, no. 11, pp. 1355–1371, Nov. 2001.
[8] H. Yasuura, H. Tomiyama, A. Inoue and F. N. Eko, "embedded system design using
soft-core processor and Valen-C," IISJ. Info. Sci. Eng., voL 14, pp.587-603, Sept 1998.
[9] F. N. Eko, etal., "Soil-core processor architecture for embedded system design," IEICE
Trans. Electronics, voL E81-C, no. 9,1416-1423, Sep. 1998.
[10] Inoue, et al. "Language and compiler for optimizing datapath widths of embedded
systems," IEICE Trans. Fundamentals, vol. E81--A, no. 12, pp. 2595-2604, Dec. 1998.
[11] C.N. Taylor, S. Dey, and D. Panigrahi, "Energy/latency/image quality tradeoffs in
enabling mobile multimedia communication," in Proc, of Software Radio: Technologies
andServices. EnricoDelRe, Springer VerlagLtd,, January2001.
[12] Y. Cao and H. Yasuura, "Video quality modeling for quality-driven design," the 10th
Workshop on System and System Integration of Mixed Technologies (SASIMI 2001),
Oct. 2001.
This page intentionally left blank
Chapter 8
Energy-Efficient Design of High-Speed Links
Key words: High-speed I/O, serial links, parallel links, phase-locked loop, delay-locked
loop, clock data recovery, low-power, energy-efficient, power-supply
regulator, voltage scaling, digital, mixed-signal, CMOS.
8.1 INTRODUCTION
Figure 8.3 presents eye-diagrams for ideal and real links, where the x-
axis spans two bit times in order to show both leading and falling transitions
of the data signal. For a random data sequence, there are both falling and
rising transitions at each bit interval. While the data levels and bit intervals
are clearly defined for the ideal case, real systems suffer from process
variability, environmental changes, and various noise sources that interact
with the signal to blur (or close) the eye. Notice that the high and low
voltage levels are no longer well-defined levels but occur over ranges. The
same holds true for the transition times. Qualitatively, larger eye openings
represent more reliable links. Quantitatively, one can apply two metrics to
measure its quality – voltage margin and timing margin. The vertical eye
opening, measured in the middle, determines how much voltage margin the
receiver has in determining whether the received signal is a high- or low-
level. The horizontal opening provides a measure of how well the receiver
can decipher one data bit from the next. Due to the finite slope of edge
transitions, reduction in voltage margin also leads to narrower timing
margins.
Besides environmental variation and noise in the transceiver circuits,
there are non-idealities in the channel that degrade signal quality. Therefore,
an eye-diagram at the receiver presents a more realistic picture of link
performance than one measured at the transmitter. Unfortunately, even
measuring at the receiver does not provide the whole picture. There can be
voltage and timing offsets in the receiver and the designer must subtract
these offsets from the measured margins. Furthermore, since the
measurement occurs over a finite time interval, it cannot fully capture the
effects of unbounded random noise sources (e.g., thermal noise, 1/f noise,
device noise, etc.) that are represented as probabilistic distributions with
infinite tails. So instead of relying only on margins, designers present link
reliability in terms of the bit-error rate (BER), which is the probability that
an error will occur with some frequency. This probability is an exponential
206 Energy-Efficient Design of High-Speed Links
function of the excess signal margins divided by the RMS distribution of the
random noise sources [12]. Increasing margins and reducing noise improves
BER but may come at the expense of higher power consumption. Therefore,
understanding and making the right trade-offs between performance and
power is important. Let us take a look at what some of these trade-offs are
by reviewing the operation of the link components, beginning with the
transmitter.
8.2.2 Transmitter
The transmitter converts binary data into electrical signals that propagate
through an impedance-controlled channel (or transmission line) to a receiver
at the opposite end. This conversion must be done with accurate signal levels
and timing for a reliable high-speed communication link. Link designers
commonly use high-impedance current-mode drivers in single-ended or
differential configurations, and there are various choices for terminating the
signals through the impedance-controlled channel. This subsection
investigates these different transmitter options and looks at how they impact
power/energy consumption. Lastly, controlling the slew rate of the
transmitted signal is desirable for minimizing noise coupling into the
channel. Since lower noise solutions enable lower power, this section
presents several techniques for slew-rate controlled transmitters. The
discussion will start with a single-ended high-impedance driver.
data, and the ability to vary the transmitted level enables lower power
dissipation. In the case of parallel links, several channels may share a single
reference line and overhead of the reference line can be amortized across
them all. For serial links, a reference voltage line may also be used, but
designers will more commonly use a differential signaling scheme where a
pair of wires carries complementary signals. Two implementations are
illustrated in Figure 8.5. One uses a differential pair with a single current
source that sets the output swing. The other implements a pair of single-
ended transmitters, each transmitting complementary data. The drawback of
using a differential pair arises from the reduced gate overdrive on the output
devices. Using larger devices can enable the same current drive at the
expense of larger capacitive loading on both the inputs and outputs that can
limit bandwidth and increase power.
So far, it has been seen that reducing noise can lead to lower power link
designs. Package and connector non-idealities can be another source of
noise. High-frequency energy in the transmitted signal can interact with
parasitic RLC tanks to cause ringing in the line and coupling (cross talk) into
adjacent lines. Therefore, high-speed link designs often limit the edge rate of
transmitted signals to mitigate these effects. Implementing edge-rate control
is fairly straightforward and several examples can be found in the literature.
There are two general approaches used to implement edge-rate control. The
technique illustrated in Figure 8.6(a) limits the slew rate of signals by
controlling the RC time constant of the driver’s input signal [14]. This can
be achieved by adjusting the capacitive loading or by changing the drive
strength of the preceding predriver buffer and thereby varying its effective
output resistance. In so doing, the edge-rate of die signal also slews
accordingly at a controlled rate. Another technique, presented in Figure
8.6(b), breaks the driver input into smaller parallel segments and slews the
output by driving the segments in succession with some delay (often
implemented with an RC delay line) [15]. Care must be taken to guarantee
that the time constants of the signal slew are fixed in proportion to the
210 Energy-Efficient Design of High-Speed Links
symbol rate. Since both the RC of the predriver and the of delay elements
are dependent on process and operating environments, some mechanism for
controlling them is required. Time constants can be controlled manually or
with a simple control loop that relies on a process and environment
monitoring circuit. An inverter-based ring oscillator is a good example of
such a circuit [14]. The oscillation period of the ring is directly related to
process and environmental conditions. Therefore, by counting the
oscillations over a known period, a digital control loop can converge to the
appropriate slew-rate settings for the symbol rate. A system-level approach
to this basic concept that utilizes knowledge of the process and
environmental conditions of a chip can be extended to other parts of the link
interface to enable energy-efficient designs [13][16] and are discussed in
more detail in Sections 3 and 4.
8.2.3 Receiver
At the opposite end of the channel, a receiver circuit deciphers the incoming
analog signals into digital data bits. This block commonly consists of a
differential sampling circuit that samples the data in the middle of the
received symbol and amplifies the low-swing signal to binary levels. Single-
ended signaling connects the signal line to one input of the differential pair
while the other is set to a reference voltage to which the signal is compared.
Differential signaling connects each signal line to each side of the input
buffer. So, the effective voltage swing seen by the receiver is much greater
for differential signaling than single-ended signaling for the same swing
magnitudes. This effect enables differential signaling to require smaller
voltage swings, which can lead to lower power consumption.
differences between the core loop’s clock rate and the data rate of the
received signal. This ability to compensate for frequency differences is
important for high-speed links because the opposite ends of a transceiver
may not share a common clock source.
Although the clock generation for the transmitter and receiver were
introduced separately, since the transmitter and receiver for different
channels reside on the same die, they may share some of the clock
generating components. More specifically, the core loop described for timing
recovery of a receiver may also serve as the clock generator for an adjacent
transmitter [25]. Such sharing of components not only reduces circuit
redundancy, but it obviates issues arising from having multiple loops on the
same substrate4. Moreover, on-chip clock generation and distribution is a
significant source of power consumption in high-speed links and efforts to
reduce this power can enable a much more energy-efficient design.
4
When multiple PLLs are integrated onto the same substrate, they may suffer from
injection locking if not isolated from one another and can be a significant source of clock
jitter [56].
Approaches for Energy Efficiency 215
Now that we have an understanding for how some of the different design
choices affect the energy efficiency of high-speed link designs, this section
further investigates approaches specifically targeted to improve energy
efficiency. Energy consumption has been a growing concern in building
large digital systems (e.g,. microprocessors) and has led to several
advancements to reduce power consumption [30][31][32]. Since high-speed
links are by nature mixed-signals designs (consisting of both digital and
analog circuits), one can leverage many of the observations and techniques
applied to digital systems to build energy-efficient links. One approach can
be as simple as taking advantage of the next generation process technology
to enable lower energy consumption for the same performance. Parallelism
is another technique that digital designers have used to reduce power without
sacrificing performance. This section looks at several forms of parallelism
that are also possible in link design. Lastly, adaptive power-supply
regulation, a technique that has enabled energy-efficient digital systems, is
introduced and its application to the design of high-speed links is presented.
8.3.1 Parallelism
Parallelism has often been used in large digital systems as a way to achieve
higher performance while consuming less power at the expense of larger
area. Breaking up a complex serial task into simpler parallel tasks enables
faster and/or lower power operation in the parallel tasks. For links, the goal
is to reduce power consumption in the overall design without sacrificing bit
rate. An obvious way to parallelize an interface is to utilize multiple links to
achieve the desired aggregate data throughput (i.e., parallel links). Parallel
links can operate at lower bit rates in order to mitigate channel non-idealities
(e.g., skin and dielectric loss, and cross talk) and enable an energy-efficient
interface. However, this pin-level parallelism comes at the expense of pin
and channel resources, which are not always abundant in many
216 Energy-Efficient Design of High-Speed Links
The clock rate of a chip limits link performance when the bit rate is equal to
the clock frequency. Even with aggressive pipelining to reduce the critical
path delay in the datapath, there is a minimum clock cycle time required to
distribute and drive the clock signal across the chip. As seen in Figure 8.9,
as the clock cycle time shrinks, expressed in terms of fanout-of-4 (FO4)
inverter delays5 on the x-axis, it experiences amplitude attenuation as it
propagates through a chain of inverters [34]. The minimum cycle time that
can be propagated is roughly 6 inverter delays. Transmitting at this clock
rate limits the bit rate to less than 1-Gb/s in a technology. However,
higher bit rates are desirable in high-speed links, and, therefore, transmitting
several bits within a clock cycle is required for higher data rates.
5
A fanout-of-4 inverter delay is the delay of an inverter driving a load equivalent to four
times its own input capacitance. A fanout of 4 is used since that is the optimal fanout for
implementing a ramp-up buffer chain to drive a large capacitive load with minimum delay.
Approaches for Energy Efficiency 217
must trade the matching properties of the delay elements and clock
distribution circuits used with the power and performance targets sought.
One can also break up the signal voltage swing into smaller segments to
encode multiple bits of data in one transmitted symbol. Pulse-Amplitude
Modulation (PAM) is a technique that enables higher bit rates without the
need for higher clock rates and has been demonstrated in several high-speed
link designs [28][7]. It relies on parallel transmitters to drive the channel by
encoding multiple bits into different voltage levels within a symbol as shown
by an example of a PAM-4 implementation in Figure 8.10. One of the
advantages of PAM is that the energy of symbols transmitted down the
channel is over a lower frequency spectrum than binary transmission at the
same bit rate. Hence, it experiences less distortion and loss through the
channel. Unfortunately, encoding bits into multiple amplitude levels reduces
voltage margins, and, therefore, this scheme is more susceptible to cross talk
[37].
The approaches for enabling more energy-efficient link designs
investigated so far have relied on the ability to reduce clock rates in order to
reduce power consumption without sacrificing bit rate. They all can leverage
energy’s dependence and trade circuit speed for lower energy
consumption. A dynamic voltage-scaling technique called adaptive power-
supply regulation extends this idea to maximize energy efficiency by
adjusting the supply voltage with respect not only to speed but also to
process and environmental conditions. It is described next.
The pursuit of reducing energy consumption in large digital systems has led
to a technique called adaptive power-supply regulation or dynamic voltage-
Approaches for Energy Efficiency 219
6
A fanout-of-4 inverter is an inverter that drives another inverter with four times its own
input capacitance.
220 Energy-Efficient Design of High-Speed Links
CMOS process is shown in Figure 8.11. Assuming that the critical path
delay of a digital system is a function of some number of inverter delays
[40], the normalized frequency of operation versus supply voltage can be
found by inverting and normalizing the inverter’s delay and is also presented
in Figure 8.11. The frequency of operation achievable by a chip is roughly
linear to supply voltage.
To understand what this relationship means for power, this delay data can
be applied to the dynamic power equation (equation 8.1), and the resulting
normalized power is plotted relative to normalized frequency for two supply
voltage configurations in Figure 8.12. Given a fixed supply voltage, power
consumption is proportional to frequency, resulting in a straight line in this
figure. Reducing frequency lowers power consumption. Moreover, since
gate delay can increase if the required operating frequency is reduced, the
circuit can operate at lower supply voltages when operating at lower
frequencies. Hence, by reducing both frequency and supply voltage, power
consumption reduces dramatically, proportional to frequency cubed.
In addition to the energy savings possible by adaptively regulating the
power supply down to lower levels for lower frequencies, there is a potential
for saving energy due to inefficiencies found in conventional designs that
operate off of a fixed supply voltage. Variability in circuit performance due
to process and temperature variations requires conventional designs to
incorporate overhead voltage margins to guarantee proper operation under
worst-case conditions. This is due to the circuit delay’s strong dependence
on process parameters and temperature. This overhead translates into excess
Approaches for Energy Efficiency 221
dynamically scaling the supply also offers several properties that enable the
designer to replace several precision analog circuit blocks with digital gates.
This is especially appealing for future process technologies that aggressively
scale both voltage and feature size. Section 4.2 describes a serial link design
that adaptively regulates its supply voltage to enable energy-efficient
operation.
8.4 EXAMPLES
Clock generation for both the transmitter and receiver is a critical component
that sets the performance of high-speed links. The study and implementation
of PLLs and DLLs has been extensive over the past few decades with special
attention placed on minimizing jitter. As mentioned earlier, the VCO in a
PLL is especially sensitive to noise, which has led to the development of
self-biased differential delay elements by Maneatis [48], which have good
power-supply noise rejection properties. In recent years, a slightly different
approach to building PLLs and DLLs with good noise rejection properties
has emerged [47]. This approach relies on a linear regulator to drive simple
delay elements comprised of inverters. The delay of these inverters is
controlled directly through their supply voltage instead of modulating
current or capacitive loading. Enabling high power-supply rejection at the
output of the regulator isolates the control node from noise on the power
supply lines. In addition to low jitter characteristics, this approach eliminates
static current delay elements to also enable lower power operation. This
section highlights the particular challenges that supply-regulated delay
elements present to the design of PLLs and DLLs. Implementation details of
a linear regulator and charge pump that are common to both PLL and DLL
designs are described and show how one can build low-jitter loops whose
power consumption and bandwidth track with frequency.
8.4.1.1 DLL
In order to build PLLs and DLLs with robust operation over a wide range of
frequencies, one would like to have their bandwidths track the operating
frequency. Then, the loop parameters can be optimized to the lowest jitter
settings [22]. Taking a look at the stability requirements for each loop
elucidates some of the challenges of using supply-regulated inverters as
delay elements. The transfer function of a DLL can be modeled with a single
dominant pole as:
224 Energy-Efficient Design of High-Speed Links
where represents the dominant pole frequency (also equivalent to the loop
bandwidth). Ideally, should track with where the loop bandwidth is
always 10-20x lower than the operating frequency, so that the fixed delay
around the loop results in a small negative phase shift. can be modeled by
the following equation:
where N is the number of stages in the delay line, and is the capacitive
load seen by each delay stage. Taking the derivative with respect to
yields the following expression for delay-line gain:
226 Energy-Efficient Design of High-Speed Links
where can vary from 1 to 2. Plugging equations (8.7) and (8.9) into
equation (8.6) yields a ratio between and
range. In order to satisfy the constraint that be constant with frequency, the
resistor can be implemented with active components. In a conventional
design, the control voltage is a combination of the aggregate charge stored
on the loop filter capacitor plus the instantaneous voltage across the filter
resistor. This is analogous to an implementation where the voltage on the
capacitor is buffered through a unity-gain amplifier and then augmented by
the instantaneous voltage formed by a second charge pump and the
amplifier’s output impedance [48]. Now, simply changing the second
charge-pump’s current varies the effective loop resistor. The resulting loop
configuration is shown in Figure 8.15. The VCO consists of five inverter
buffers in a ring and an amplifier converts the VCO output to full CMOS
levels to drive the phase-frequency detector (PFD). The output of the PFD
drives two charge pumps. [47] shows that the resulting loop has bandwidth
and damping factor governed by the following nominally fixed ratios:
where is again the capacitance load of each buffer stage. Hence, robust
operation is possible over a wide frequency range by keeping and
nominally fixed, and this scheme enables the optimal scaling of loop
dynamics to minimize jitter. Like the DLL, the current consumption of the
loop components track with operating frequency to enable lower power
consumption when operating at lower frequencies.
energy-efficient links. The next example extends the idea of regulating the
supply voltage beyond the delay elements to drive the entire serial-link
interface.
Figure 8.16 illustrates the block diagram of multiple serial links with an
adaptive power-supply regulator and local clock generators. The adaptive
power-supply regulator adjusts the supply voltage using digital sliding
control [46] so that the reference VCO oscillates at the desired operating
frequency Sliding control is a nonlinear control mechanism widely used
230 Energy-Efficient Design of High-Speed Links
In the absence of a dedicated parallel clock signal, each serial link must
recover timing information from the data stream. Figure 8.18 illustrates the
clock-recovery PLL implemented. A duplicate set of data receivers sampling
the edges instead of the center of the data eye enables phase detection but
provides only binary information on the phase. Hence, PLLs with binary
phase-detectors are bang-bang controlled [51] and, they must have low loop
232 Energy-Efficient Design of High-Speed Links
small area and low power in [19] and [53] and demonstrate a high-speed
link, implemented in a CMOS technology, that operates at 4-Gb/s
while dissipating 127mW. This link design example also multiplexes several
bits within a clock period to achieve high bit rates, but instead of
multiplexing at the transmitter output, multiplexing is performed further
back in the transmit path in order to reduce clock energy. In order to attain
the speed necessary in the circuitry following the mux point, lower voltage
swings in the signal paths are used. The design also implements a DLL with
supply-regulated inverters to generate low-jitter clocks while reducing power
consumption. Clock recovery is achieved with a dual-loop design similar to
the design described in Section 2.4. Lastly, a capactively-trimmed receiver
enables reliable operation at very low signal levels by compensating for
device offsets. Since the DLL design used for clock generation is similar to
the supply-regulated designs previously described in this section, the design
of the transmitter and receiver will be the focus here.
234 Energy-Efficient Design of High-Speed Links
8.4.3.1 Transmitter
8.4.3.2 Receiver
8.5 SUMMARY
REFERENCES
[1] G. Besten, “Embedded low-cost 1.2Gb/s inter-IC serial data link in 0.35mm CMOS,”
IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2000, pp. 250-251.
[2] M. Fukaishi et al, “A 20Gb/s CMOS multi-channel transmitter and receiver chip set for
ultra-high resolution digital display,” IEEE Int’l Solid-State Circuits Conf. Dig. Tech.
Papers, Feb 2000, pp. 260-261.
[3] S. Sidiropoulos et al, “A CMOS 500Mbps/pin synchronous point to point interface,”
IEEE Symposium on VLSI Circuits, June 1994.
[4] T. Tanahashi et al, “A 2Bb/s 21CH low-latency transceiver circuit for inter-processor
communication,” IEEE Int’l Solid-State Circuits Conference Dig. Tech. Papers, Feb.
2001, pp. 60-61.
[5] P. Galloway et al, ”Using creative silicon technology to extend the useful like of
backplane and card substrates at 3.125 Gbps and Beyond,” High-Performance System
Design Conference, 20001.
7
Of course, one cannot ignore the effects of wire parasitics, which do not scale quite as
nicely and are now what limit high-speed digital circuit performance [55].
Summary 237
[6] R. Gu et al, “ 0.5-3.5 Gb/s low-power low-jitter serial data CMOS transceiver,” IEEE
Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb 1999, pp. 352-353.
[7] J. Sonntag et al, “An adaptive PAM-4 5 Gb/s backplane transceiver in 0.25um CMOS,”
IEEE Custom Integrated Circuits Conference, to be published 2002.
[8] Y.M. Greshishchev et al, “A fully integrated SiGe receiver IC for 10Gb/s data rate,”
IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2000, pp. 52-53.
[9] J.P. Mattia et al,“A 1:4 demultiplexer for 40Gb/s fiber-optic applications,” IEEE Int’l
Solid-State Circuits Conf. Dig, Tech. Papers, Feb. 2000, pp. 64-65.
[10] Reese et al “A phase-tolerant 3.8 GB/s data-communication router for muli-processor
super computer backplane,” IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, pp.
296-297, Feb. 1994.
[11] E. Yeung et al, “A 2.4Gb/s/pin simultaneous bidirectional parallel link with per pin skew
compensation ,” IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2000, pp.
256-257.
[12] J. Proakis, M. Salehi, Communications Systems Engineering, Prentice Hall, New Jersey,
1994.
[13] G. Wei et al, “A variable-frequency parallel I/O interface with adaptive power-supply
regulation,” IEEE Journal of Solid-State Circuits, vol. 35, no. 11, Nov. 2000, pp. 1600-
1610.
[14] B. Lau et al, “A 2.6Gb/s multi-purpose chip to chip interface,” IEEE Int’l Solid-State
Circuits Conf. Dig. Tech. Papers, Feb 1998, pp. 162-163.
[15] A. DeHon et al, “Automatic impedance control,” 1993 IEEE Int’l Solid-State Circuits
Conf. Dig. Tech. Papers, pp. 164-5, Feb. 1993.
[16] J. Kim et al ,“Adaptive supply serial links with sub-IV operation and per-pin clock
recovery,” IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb 2002.
[17] K. Donnelly et al, “A 660 MB/s interface megacell portable circuit in 0.3um-0.7mm
CMOS ASIC,” IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, pp. 290-291,
Feb 1996.
[18] S. Sidiropoulos et al, “A 700-Mb/s/pin CMOS signalling interface using current
integrating receivers,” IEEE Journal of Solid-State Circuits, May 1997, pp. 681-690.
[19] M. -J. E. Lee et al, “Low-power area efficient high speed I/O circuit techniques,” IEEE
Journal of Solid-State Circuits, vol. 35, Nov. 2000, pp. 1591-1599.
[20] F.M. Gardner, “Charge-pump phase-lock loops,” IEEE Transactions on
Communications, vol. 28, no. 11, Nov. 1980, pp. 1849-1858.
[21] M. Johnson, “A variable delay line PLL for CPU-coprocessor synchronization,” IEEE
Journal of Solid-State Circuits, vol. 23, no. 5, Oct. 1988, pp. 1218-1223.
[22] M. Mansuri et al, “Jitter optimization based on phase-locked-loop design parameters,”
IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2002.
[23] M. Horowitz et al, “High-speed electrical signalling: Overview and limitations,” IEEE
Micro, vol. 18, no. 1, Jan.-Feb. 1998, pp.12-24.
[24] S. Sidiropoulos and M. Horowitz, “A semi-digital dual delay-locked loop,” IEEE
Journal of Solid-State Circuits, Nov. 1997, pp. 1683-1692.
[25] K. -Y. K. Chang et al, “A 0.4-4Gb/s CMOS quad transceiver cell using on-chip regulated
dual-loop PLLs,” IEEE Symposium on VLSI Circuits, accepted for publication June
2002.
[26] W.J. Dally et al. Digital Systems Engineering, Cambridge University Press, 1998.
[27] W. J. Dally et al, “Transmitter equalization for 4-Gbps signalling” IEEE Micro, Jan.-
Feb. 1997. vol. 17, no. 1, pp. 48-56.
238 Energy-Efficient Design of High-Speed Links
[28] R. Farjad-Rad et al, CMOS 8-GS/s 4-PAM Serial Link Transceiver,” IEEE
Symposium on VLSI Circuits Dig. Tech. Papers, pp.41-44.
[29] A. Fieldler et al, “A 1.0625 Gbps transceiver with 2X oversampling and transmit pre-
emphasis,” IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 1997, pp. 238-
239.
[30] A.P. Chandrakasan et al, Low Power Digital CMOS Design. Norwell, MA: Kluwer
Academic, 1995.
[31] D. Dobberpuhl, “The design of a high performance low power microprocessor,” IEEE
Int’l Symposium on Low Power Electronics and Design Dig. Tech. Papers, Aug. 1996,
pp. 11-16.
[32] M. Horowitz, “Low power processor design using self-clocking,” Workshop on Low-
Power Electronics, 1993.
[33] Zerbe et al, “A 2Gb/s/pin 4-PAM parallel bus interface with transmit crosstalk
cancellation, equalization, and integrating receivers,” IEEE Int’l Solid-State Circuits
Conf. Dig. Tech. Papers, Feb. 2001, pp. 66-67.
[34] C. -K. Yang, “Design of high-speed serial links in CMOS,” Ph.D. dissertation, Stanford
University, Stanford, CA, Decemeber 1998.
[35] D. Weinlader et al, “An eight channel 36Gample/s CMOS timing analyzer,” IEEE Int’l
Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2000, pp. 170-171.
[36] K. Yang, “A scalable 32Gb/s parallel data transceiver with on-chip timing calibration
circuits,” IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2000, pp. 258-
259.
[37] H. Johnson, “Multi-level signaling,” DesignCon, Feb. 2000.
[38] T. Burd et al, “A dynamic voltage scaled microprocessor system,” IEEE Int’l Solid-State
Circuits Conf. Dig. Tech. Papers, Feb. 2000, pp. 294-295.
[39] P. Maken, M. Degrauwe, M. Van Paemel and H. Oguey, “A voltage reduction technique
for digital systems,” IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 1990,
pp238-239.
[40] G. Wei et al, “A full-digital, energy-efficient adaptive power supply regulator,” IEEE
Journal of Solid-State Circuits, vol. 34, no. 4, April 1999, pp. 520-528.
[41] A. P. Chandrakasan et al, “Data driven signal processing: An approach for energy
efficient computing,” IEEE Int’l Symposium on Low Power Electronics and Design Dig.
Tech. Papers, Aug. 1996, pp. 347-352.
[42] V. Gutnik et al, An efficient controller for variable supply voltage low power
processing,” IEEE Symposium on VLSI Circuits Dig. Tech. Papers, June 1996, pp. 158-
159.
[43] L. Nielsen et al, “Low-power operation using self-timed circuits and adaptive scaling of
supply voltage,” IEEE Trans. VLSI Systems., vol. 2, pp 391-397, Dec 1994.
[44] A. J. Stratakos, “High-efficiency low-voltage DC-DC conversion for portable
applications,” Ph.D. dissertation, University of California, Berkeley, CA, Dec. 1998.
[45] K. Suzuki et al, “A 300 MIPS/W RISC core processor with variable supply-voltage
scheme in variable threshold-voltage CMOS,” Proceedings of the IEEE Custom
Integrated Circuits Conference, May 1997, pp. 587-590.
[46] J. Kim et al, “A digital adaptive power-supply regulator using sliding control,” IEEE
Symposium on VLSI Circuits Dig. Tech. Papers, June 2001.
[47] S. Sidiropoulos et al, “Adaptive bandwidth DLL’s and PLL’s using regulated-supply
CMOS buffers,” IEEE Symposium on VLSI Circuits Dig. Tech. Papers, June 2000.
[48] J.G. Maneatis, “Low-Jitter process independent DLL and PLL based on self-biased
techniques,” IEEE Journal of Solid-State Circuits, vol. 28, no. 12, Dec. 1993.
Summary 239
[49] F. Bilaovic et al, “Sliding modes in electrical machines control systems,” IEEE Int’l
Symp. on Industrial Electronics Conference Proceedings, 1992, pp. 73-78.
[50] G. Wei et al “A low power switching power supply for self-clocked systems,” IEEE
Symposium on Low Power Electronics, Oct. 1996, pp. 313-317.
[51] R.C. Walker et al “A two-chip 1.5-GBd serial link interface,” IEEE Journal of Solid-
State Circuits, vol. 27, no. 12, Dec. 1992, pp. 1805-1811.
[52] F.M. Gardner, “Frequency granularity in digital phase-lock loops,” IEEE Transactions
on Communications, vol. 44, no. 6, June 1996, pp. 749-758.
[53] M. -J. E. Lee et al, “An 84-mW 4-Gb/s clock and data recovery circuit for serial link
applications,” IEEE Symposium on VLSI Circuits Dig. Tech. Papers, June 2001.
[54] L. Geppert, “Transmeta’s magic show [microprocessor chips],” IEEE Spectrum, vol. 37,
no. 5, May 2000, pp. 26-33.
[55] R. Ho et al, “Interconnect scaling implications for CAD,” IEEE/ACM Int’l Conf.
Computer Aided Design Dig. Tech. Papers, Nov. 1999, pp. 425-429.
[56] P. Larsson, “Measurement and analysis of PLL jitter caused by digital switching noise,”
IEEE Journal of Solid-State Circuits, July 2001, vol. 37, no. 7, pp. 1113-1119.
[57] J.G. Maneatis, “Precise delay generation using coupled oscillators,” Ph.D. dissertation,
Stanford University, Stanford, CA, June 1994.
This page intentionally left blank
Chapter 9
System and Microarchitectural Level Power
Modeling, Optimization, and Their Implications in
Energy Aware Computing
Abstract: While it is recognized that power consumption has become the limiting factor
in keeping up with increasing performance trends, static or point solutions for
power reduction are beginning to reach their limits. System level
power/performance design exploration for exposing available trade-offs and
achievable limits for various metrics of interest has become an indispensable
step in the quest for shortening the time-to-market for today’s complex
systems. Of particular interest are fast methods for power and performance
analysis that can guide the design process of portable information systems. At
the same time, support is needed at the microarchitectural level for efficient
design exploration for low power or application-driven fine grain power
management. Energy-aware computing is intended to provide a solution to
how various power-reduction techniques can be used and orchestrated such
that the best performance can be achieved within a given power budget, or the
best power efficiency can be obtained under prescribed performance
constraints. The paradigm of energy-aware computing is intended to fill the
gap between gate/circuit-level and system-level power management techniques
by providing more power-management levels and application-driven
adaptability.
9.1 INTRODUCTION
Power consumption has become the limiting factor not only for portable,
embedded applications but also for high-performance or desktop systems.
While there has been notable growth in the use and application of these
systems, their design process has become increasingly difficult due to the
increasing design complexity and shortening time-to-market. The key factor
242 System and Microarchitectural Level Power Modeling, etc.
case analysis, where the correctness of the system depends not only on the
logical results of computation but also on the time at which the results are
produced [4] [5]. Despite the great potential for embedded system design, the
area of average-case analysis has received little attention [6][7][8][9].
However, the area of average-case analysis is becoming more and more
important, and having abstract representations to provide quantitative
measures of power/performance estimates will play a central part. Tools
based on analytic solutions for application-architecture modeling for
performance evaluation are becoming extremely important due to their
potential to significantly shorten the design cycle and allow a better
exploration of the design space [10]. The methodology presented here
complements the existing results for worst-case time analysis and is distinct
from other approaches for performance analysis based on rate analysis [11],
time separation between events [12], and adaptation process [5]. Along the
same lines of using formal models, Thoen and Catthoor [13] address the
problem of generating embedded software efficiently starting from a model
of the behavior of the system being designed. The mathematical properties
of the models are used to drive the synthesis process, with the main objective
of reaching an optimal solution while guaranteeing strict timing constraints.
On the other hand, the existing tools for high-level performance
modeling that can be used in embedded systems design, like Ptolemy [14]
and POLIS [15], focus on application modeling but do not support explicit
mapping of application models onto models of architectures. These tools
share a simulation-based strategy for performance evaluation so they may
require prohibitively long simulation times on real examples. El Greco [16]
and Cadence VCC [17] provide a simulation environment for modeling and
validating the functionality of complex heterogeneous systems. Finally, the
tools recently proposed in [18] are centered on the idea of platform-based
design. The applications are modeled as Kahn process networks that are
further used to perform performance evaluation via simulation.
In what follows, a methodology for system-level power/performance
analysis based on Stochastic Automata Networks (SANs) [19] is presented.
While the methodology described is completely general, the focus of our
attention is on portable embedded multimedia systems. These systems are
characterized by “soft” real-time constraints, and hence, as opposed to safety
critical systems, the average behavior is far more important than their worst-
case behavior. Moreover, due to data dependencies, their computational
requirements show such a large spectrum of statistical variations that
designing them based on the worst-case behavior (typically, orders of
magnitude larger than the actual execution time [5]) would result in
completely inefficient systems.
244 System and Microarchitectural Level Power Modeling, etc.
Relying upon the Y-chart design methodology, in what follows, the focus
is on the application-architecture modeling process for embedded
multimedia systems. The big picture of a formal methodology for
performance estimation is presented in Figure 9.1. Following the principle
of orthogonalization of concerns during the design process, separate SAN
models are built for both applications and architectures. Next, the abstract
model of an application is mapped onto a family of architectures (platform)
and the power-performance figures are evaluated to see how suited is the
platform (and the chosen set of design parameters) for the target application.
This process can be re-iterated with a different set of parameters until
convergence to the desired result.
This global vision has several unique features: First, the methodology is
based on integrating the power-performance metrics into the system-level
design. Indeed, the performance metrics that are being developed become an
integral part of the design process; this helps systems designers to quickly
find the right architecture for the target application. Second, using the same
The SAN Modeling Paradigm 245
involves two major steps: 1) SAN model construction and 2) SAN model
evaluation. The following sections briefly describe these two steps.
with
and where is the transition probability
8
This generator is the analogue of the transition probability matrix in the discrete-time
Markov chains.
The SAN Modeling Paradigm 247
[21]. The descriptor is still written as in equation (9.2), but now its elements
can be functions. In this case, each tensor product that incorporates matrices
with functional entries is replaced with a sum of tensor products of matrices
that incorporate only average numerical entries. Then equation (9.2)
Once the SAN model is obtained, one needs to calculate its steady-state
solution. This is simply expressed by the solution of the equation
the i-th automaton [20]. Note that this is far better than the brute-force
As shown in Figure 9.2, the decoder consists of the baseline unit, the Motion
Compensation (MV) and recovery units, and the associated buffers [22]. The
baseline unit contains the VLD (Variable Length Decoder), IQ/IZZ (Inverse
Quantization and Inverse Zigzag), and IDCT (Inverse Discrete Cosine
Transform) units, as well as the buffers. During the modeling process, each
250 System and Microarchitectural Level Power Modeling, etc.
This modeling step starts with an abstract specification of the platform (e.g.,
Stateflow) and produces a SAN model that reflects the behavior of that
particular specification. A library of generic blocks that can be combined in
a bottom-up fashion to model sophisticated behaviors is constructed.
The generic building blocks model different types of resources in an
architecture, such as processors, communication resources, and memory
resources. Defining a complex architecture thus becomes as easy as
instantiating building blocks from a library and interconnecting them.
Compared to the laborious work of writing fully functional architecture
models (in Verilog/VHDL), this can save the designer a significant amount
of time and therefore enable exploration of alternative architectures.
Architecture modeling shares many ideas with the application modeling
that was just discussed. In Figure 9.4 a few simple generic building blocks
are illustrated. In Figure 9.4(a), the generic model of a CPU is represented
based on a power-saving architecture. Normal-mode is the normal operating
mode when every on-chip resource is functional. StopClock mode offers the
greatest power savings and, consequently, the least functionality. Finally,
Figure 9.4(b) describes a typical memory model.
252 System and Microarchitectural Level Power Modeling, etc.
9.4.4 Mapping
After the application and the architecture models are generated, the next step
is to map the application onto architecture and then evaluate the model using
the analytical procedure in Section 3. To make the discussion more specific,
let us consider the following design problem.
The design problem
Assume that we have to decide how to configure a platform that can work in
four different ways: one that has three identical CPUs operating at clock
frequency (then each process can run on its own processor) and another
three architectures where we can use only one physical CPU but have the
freedom of choosing the speed among the values
Mapping the simple VLD-IDCT/IQ processes in Figure 9.3 onto a
platform with a single CPU is illustrated in Figure 9.5. Because the two
processes9 now have to share the same CPU, some of the local transitions
become synchronizing/functional transitions (e.g., the local transitions with
9
For simplicity, the second consumer process (for the MV unit) was not explicitly represented
in this figure.
Results and Discussion 253
For all the experiments, the following parameters were used: five slot buffers
(that is, n = 5, where one entry in the buffer represents one block of the 64
DCT coefficients that are needed for one IDCT operation),
and
254 System and Microarchitectural Level Power Modeling, etc.
For a platform with three separate CPUs, the analysis is quite simple: the
system will essentially run in the CPU-active state or will either be waiting
for the buffer or writing into the buffer most of the time. The average length
values are 1.57 and 0.53 for the MV and baseline unit buffers, respectively.
This is in sharp contrast with the worst-case scenario assumption where the
lengths will be 4 across all runs.
For a platform with a single CPU, the probability distribution values for
all the components of the system are given in Figure 9.6. The first column in
these diagrams shows the probability of the processes waiting for their
respective packets to arrive. The second column shows the probability of the
process waiting for the CPU to become available (because the CPU is shared
Results and Discussion 255
among all three processes). The third column represents the probability of
the processes actively using the CPU to accomplish their specific tasks. The
fourth column shows the probability of the process being blocked because
the buffer is not available (either it is full/empty or being used by another
process). The fifth column shows the probability of the processes writing
into their corresponding buffers.
Run 1 represents the “reference” case where the CPU operates at frequency
while the second and the third runs represent the cases when the CPU
speed is and respectively. For instance, in run 1, the Producer (VLD)
is waiting with probability 0.01 to get its packets, waiting for CPU access
with probability 0.3, decoding with probability 0.4, waiting for the buffer
with probability 0.27, and finally writing the data into the buffer with
probability 0.02.
Looking at the probability distribution values of MV and baseline unit
buffers10 (Figure 9.7), one can see that a bottleneck may appear because of
the MV buffer. More precisely, the system is overloaded at run1, balanced at
run 2, and under-utilized at run 3. The average buffer lengths in runs 1,2, and
3 are:
MV buffer: 3.14, 1.52, and 1.15
baseline unit buffer: 0.81, 0.63, and 0.54
respectively. Since the average length of the buffers is proportional to the
average waiting time (and therefore directly impacts the system
performance), one can see that, based solely on performance figure, the best
choice would be a single CPU with speed Also, notice how different the
average values (e.g., 1.15 and 0.54, respectively) are compared to the value 4
10
The columns in the Buffer Diagrams show the distribution of the buffer occupancy ranging from 0
(empty) to 4 (full).
256 System and Microarchitectural
. Level Power Modeling, etc.
where and represent the power consumption per state and per
transition, respectively, and is the steady-state probability of state i, and
is the transition rate associated with the transitions between states i and j.
Having already determined the solution of equation (9.3), the value (for a
particular i) can be found by summing up the appropriate components of the
global probability vector The and costs are determined during an
off-line pre-characterization step where other proposed techniques can be
successfully applied [24][25].
To obtain the power values, the MPEG-2 decoder was simulated, using
the Wattch [26] architectural simulator that estimates the CPU power
consumption, based on a suite of parameterized power modes. More
precisely, the simulation of the MPEG-2 was monitored and the power
Microarchitecture-level Power Modeling 257
address the problem of low power and high performance are needed. To this
end, the energy delay product per committed instruction (EDPPI), defined as
has been proposed as a measure that characterizes both the
performance and power efficiency of a given architecture. Such a measure
can identify microarchitectural configurations that keep the power
consumption to a minimum without significantly affecting the performance.
In addition to classical metrics (such as EPC and EPI), this measure can be
used to assess the efficiency of different power-optimization techniques and
to compare different configurations as far as power consumption is
concerned.
Most microarchitectural-level power modeling tools for high-
performance processors consider a typical superscalar, out-of-order
configuration, based on the reservation station model (Figure 9.1). This
structure is used in modern processors like Pentium Pro and PowerPC 604.
The main difference between this structure and the one used in other
processors (e.g., MIPS R10000, DEC Alpha 21264, HP PA-8000) is that the
reorder buffer holds speculative values and the register file holds only
committed, non-speculative data, whereas for the second case, both
speculative and non-speculative values are in the register file. However, the
wake-up, select, and bypass logic are common to both types of architectures,
and, as pointed out in [27], their complexity increases significantly with
increasing issue widths and window sizes.
As expected, there is an intrinsic interdependency between processor
complexity and performance, power consumption and power density. It has
been noted [30] that increasing issue widths must go hand in hand with
increasing instruction window sizes to provide significant performance
gains. In addition, it has been shown that the complexity [31] (and thus,
power requirements) of today’s processors has to be characterized in terms
of issue width (that is, number of instructions fetched, dispatched, and
executed in parallel), instruction window size (that is, the window of
instructions that are dynamically reordered and scheduled for achieving
higher parallelism), and pipeline depth, which is directly related to the
operating clock frequency.
One of the most widely used microarchitectural power simulators for
superscalar, out-of-order processors is Wattch [26], which has been
developed using the infrastructure offered by SimpleScalar [32].
SimpleScalar performs fast, flexible, and accurate simulation of modern
processors that implement a derivative of the MIPS-IV architecture [33] and
support superscalar, out-of-order execution, which is typical for today’s
high-end processors. The power estimation engine of Wattch is based on the
SimpleScalar architecture, but in addition, it supports detailed cycle-accurate
information for all modules, including datapath elements, memory and CAM
Microarchitecture-level Power Modeling 259
While estimation accuracy is important for all modules inside the core
processor, it is recognized that up to 40-45% of the power budget goes into
the global clock power [37]. Thus, accurate estimation of the global clock
power is particularly important for evaluating power values of different core
processor configurations. Specifically, the global clock power can be
estimated as a function of the die size and number of pipeline registers
[30][38]:
where the first term accounts for the register load and the second and third
terms account for the global and local wiring capacitance ( is a term that
depends on the local routing algorithm used, and h is the depth of the H-
tree). is the nominal input capacitance seen at each clocked register, and
is the wire capacitance per unit length, while is the number of pipeline
registers, for p pipeline stages.
To estimate the die size and number of clocked pipeline registers, the
microarchitectural configuration can be used as follows:
11
A basic block is a straight-line piece of code ending at any branch or jump instruction.
Efficient Processor Design Exploration for Low Power 263
one (gcc) the optimal configuration (i.e., lowest energy xdelay product
EDPPI) is achieved for IW = 8 and WS = 32. Although the energy is not
minimized in these cases, the penalty in performance is less than in other
cases with similar energy savings.
Most solutions to the power problem are static in nature, and they do not
allow for adaptation to the application. As described in the previous section,
there is wide variation in processor resource usage among various
applications. In addition, the execution profile of most applications indicates
that there is also wide variation in resource usage from one section of an
application’s code to another. For example, Figure 9.16 shows the execution
profile of the epic benchmark (part of the MediaBench suite) on a typical
workload on an 8-way issue processor. We can see several regions of code
execution characterized by high IPC values lasting for approximately two
million cycles each; towards the end we see regions of code with much
lower IPC values. The quantity and organization of the processor's resources
will also affect the overall execution profile and the energy consumption. As
seen before, low-end configurations consume higher energy per instruction
due to their inherently high CPI; high-end configurations also tend to have
high energies due in part to resource usage and in part to the power
consumption of unused modules. The ideal operating point is somewhere in
between.
Combining the above two ideas, the optimal operating point for each
region of code can be found in terms of processor resources. The goal is to
identify the correct configuration for each code region in terms of various
processor resources to optimize the overall energy consumption. Such an
approach allows fine-grained power management at the processor level
Implications of Application Profile on Energy-aware Computing 269
To detect tightly coupled regions of code, one can resort to the hotspot
detection mechanism described in the previous section. Once a hotspot has
been detected, an optimum configuration for that hotspot needs to be
determined. Configuration is a unique combination of several processor
parameters under control. As has been shown before, the size of the issue
window WS and the effective pipeline width IW are the two factors that most
dramatically affect the performance and energy consumption of the
processor. Hence, changing the configuration of the processor would mean
setting different values for WS and IW. The optimum is defined as that
configuration which leads to the least energy dissipated per committed
instruction. This is equivalent to the power-delay product per committed
instruction, (the inverse of MIPS per Watt) which is a metric used for
characterizing the power-performance trade-off for a given processor.
When a hotspot is detected, a finite state machine (FSM) walks the processor
through all possible configurations for a fixed number of instructions in each
state of the FSM machine. The instruction count register (ICR) is used to
keep a count of the number of instructions retired by the processor, and it is
initialized with the number of instructions to be profiled in each
configuration. During each cycle, the ICR is decremented by the number of
instructions retired in that cycle. When ICR reaches zero, the power register
is sampled to obtain a figure proportional to the total energy dissipated. If
there were n parameters of the processor to vary, exhaustive testing of all
configurations would mean testing all points in the n-dimensional lattice for
a fixed number of instructions. If we use a set of configurations with
and with we have a total of 4 x 3 = 12
configurations, requiring an FSM of only 12 states.
lowest configuration now allows the issue logic to run at 3.6V. Assuming
that energy dissipation is proportional to the savings in energy
dissipated in the issue logic amounts to about 48%.
The power and energy savings for a typical 8-way processor with hotspot
detection and hardware power profiling are shown in Figures 9.17 and 18. In
Figure 17, there are four values indicated for each application. The Dyn
value represents the power obtained by doing dynamic microarchitecture
resource scaling, assuming a 10% energy overhead for unused units,
9.9 SUMMARY
REFERENCES
Key words: Simulation tools, energy estimation, kernel energy consumption, compiler
optimizations, architectural optimizations
10.1 INTRODUCTION
application programs during each cycle and from detailed capacitive models
for the components activated. A key distinction between these different
simulators is in the degree of estimation accuracy and estimation speed. For
example, the SimplePower energy simulator [13] employs transition-
sensitive energy models for the datapath functional unit. SimplePower core
accesses a table containing the switch capacitance for each input transition of
the functional unit exercised. Table 10.1 shows the structure of such a table
for an n-input functional unit.
models estimate the energy consumed per access and do not accommodate
the energy differences found in sequences of accesses. Since energy
consumption is impacted by switching activity, two sequential memory
accesses may exhibit different address decoder energy consumptions.
However, for memories, the energy consumed by the memory core and sense
amplifiers dominates these transition-related differences. Thus, simple
analytical energy models for memories have proven to be quite reliable.
Another approach to evaluating energy estimates at the architectural level
exploits the correlation between performance and energy metrics. These
techniques [17][18] use performance counters present in many current
processor architectures to provide runtime energy estimates.
Most of the current architectural energy-estimation tools focus mainly on
the dynamic power consumption and do not account for leakage energy
accurately. Leakage modeling is especially important in future architectures
since the leakage current per transistor is increasing in conjunction with the
increasing number of transistors on a chip. Leakage energy can be modeled
based on empirical data similar to dynamic energy. As leakage currents in
functional units are dependent on the inputs, it is possible to either employ a
more accurate table lookup mechanism or an average leakage current value
that can enable a faster estimation speed. Memory elements can be modeled
analytically using the size of the memory and the characterization of an
individual cell. However, leakage energy modeling at a higher abstraction
level in an architectural simulator is a challenging task and requires more
effort. New abstraction to capture the influence of various factors such as
stacking, temperature, and circuit style as well as new leakage control
mechanisms are in their infancy.
The MXS CPU and the memory subsystem simulators are modified to
trace accesses to their different components. This enables the simulations to
be analyzed using the Timing Trees [20] mechanism provided by SimOS.
MXS is used to obtain detailed information about the processor. However,
the MXS CPU simulator does not report detailed statistics about the memory
subsystem behavior. Due to this limitation in SimOS, Mipsy is used to obtain
this information.
Since disk systems can be a significant part of the power budget in
workstations and laptops, a disk power model is also incorporated into
SimOS to study the overall system power consumption. SimOS models a
Design of Simulators 283
HP97560 disk. This disk is not state-of-the-art and does not support any low-
power modes. Therefore, a layer is incorporated on top of the existing disk
model to simulate the TOSHIBA MK3003MAN [21] disk, a more
representative modern disk that supports a variety of low-power modes. The
state machine of the operating modes implemented for this disk is shown in
Figure 10.2. The disk transitions from the IDLE state to the ACTIVE state
on a seek operation. The time taken for the seek operation is reported by the
disk simulator of SimOS. This timing information is used to calculate the
energy consumed when transitioning from the IDLE to the ACTIVE state. In
the IDLE state, the disk keeps spinning. A transition from the IDLE state to
the STANDBY state involves spinning the disk down. This operation incurs
a performance penalty. In order to service an I/O request when the disk is in
the STANDBY state, the disk has to be spun back up to the ACTIVE state.
This operation incurs both a performance and energy penalty. The SLEEP
state is the lowest power state for this disk. The disk transitions to this state
via an explicit command from the operating system.
It is assumed that the spin up and spin down operations take the same
amount of time and that the spin down operation does not consume any
power. This model also assumes that the transition from the ACTIVE to the
IDLE state takes zero time and power as in [22]. Currently, the SLLEP state
is not utilized. The timing modules of SimOS are suitably modified to
accurately capture mode transitions. While it is clear that modeling a disk is
important from the energy perspective, the features of a low-power disk can
also influence the operating system routines such as the idle process running
on the processor core. Hence, a disk model helps to characterize the
processor power more accurately. During the I/O operations, energy is
consumed in the disk. Furthermore, as the process requesting the I/O is
blocked, the operating system schedules the idle process to execute.
Therefore, energy is also consumed in both the processor and the memory
subsystem.
SoftWatt uses analytical power models. A post-processing approach is
taken to calculate the power values. The simulation data is read from the log
files, pre-processed, and input to the power models. This approach results in
the loss of per-cycle information as data is sampled and dumped to the
simulation log file at a coarse granularity, However, there is no slowdown in
the simulation time beyond that incurred by SimOS itself. This is particularly
critical due to the time-consuming nature of MXS simulations. The only
exception to this rule is the disk energy model, where energy consumption is
measured during simulation to accurately account for the mode transitions.
This measurement incurs very little simulation overhead. SoftWatt models a
simple conditional clocking model. It assumes that full power is consumed if
any of the ports of a unit is accessed; otherwise no power is consumed.
284 Tools and Techniques for Integrated Hardware-software
The per-access costs of the cache structures are calculated based on the
model presented in [16] [15]. The clock generation and distribution network
is modeled using the technique proposed in [23], which has an error margin
of 10%. The associative structures of the processor are modeled as in
[15][24].
An important and difficult task in the design of architectural energy
simulators is the validation of their estimates. Due to the flexibility provided
by the architectural tools in evaluating different configurations, even
choosing a configuration to validate is challenging. A common approach
used in several works is to validate estimates of configurations similar to
commercial processors for which published data sheets are available [15]. As
an example, in order to validate the entire CPU model, here Soft Watt is
configured to calculate the maximum CPU power of the R10000 processor.
In comparison to the maximum power dissipation of 30W reported in the
R10000 data sheet [25], SoftWatt reports 25.3W. As detailed circuit-level
information is not available at this level, generalizations made in the
analytical power models that do not, for example, capture the system-level
interconnect capacitances result in an estimation error.
Table 10.2 gives the baseline configuration of SoftWatt that was used for the
experiments in this section. The Spec JVM98 benchmarks [30] were chosen
for conducting this characterization study. Java applications are known to
exercise the operating system more than traditional benchmark suites [31].
Thus, they form an interesting suite to characterize for power in a power
simulator like SoftWatt that models the operating system.
Figure 10.4 presents the overall power budget of the system including the
disk. This model is the baseline disk configuration and gives an upper bound
of its power consumption. It can be observed that, when no power-related
optimizations are performed, the disk is the single-largest consumer of
power in the system.
By including the IDLE state in the disk configuration, the dominance of
the disk in the power budget decreases from 34% to 23% as shown in Figure
10.5. This optimization provides significant power-savings and also alters
the overall picture. Now the L1 I-cache and the clock dominate the power
profile.
Hardware-software Optimizations: Case Studies 287
In addition, the results reveal the potential for power optimizations when
executing the kernel idle process. Whenever the operating system does not
have any process to run, it schedules the idle process. Though this has no
performance implications, over 5% of the system energy is consumed during
this period. This energy consumption can be reduced by transitioning the
CPU and the memory subsystem to a low-power mode or by even halting the
processor, instead of executing the idle process.
Hardware-software Optimizations: Case Studies 289
10.4.2.1 Superblock
Frequently executed paths through the code are selected and optimized at the
expense of the less frequently executed paths [27]. Instead of inserting
bookkeeping instructions where two traces join, part of the trace is
duplicated to optimize the original copy. This scheduling scheme provides
an easier way to find parallelism beyond the basic block boundaries. This is
especially true for control intensive benchmarks because the parallelism
within a basic block is very limited.
10.4.2.2 Hyperblock
The idea is to group many basic blocks from different control flow paths into
a single manageable block for compiler optimization and scheduling using
if-conversion [28].
290 Tools and Techniques for Integrated Hardware-software
10.5 SUMMARY
perspective, it is not sufficient to only account for user code energy estimate
since operating system routines can consume a significant portion of the
energy consumed. This could cause significant overestimation of battery life
for executing application. From a hardware perspective, the experiments
indicate the importance of accounting for peripheral devices such as the disk
in estimating overall energy budget. As optimizations on one component can
have negative ramifications on other components, simulation tools should
provide an estimate for the entire system in order to evaluate the real impact
of such optimizations.
Finally, a VESIM energy estimation framework built on top of the
Trimaran tool set for a VLIW architecture was presented. This framework
was used to show the impact of architectural and compiler optimizations on
energy efficiency. As power consumption continues to be the major limiter
to more powerful and faster designs, there is a need for further explorations
of such software- and architectural-level optimizations.
ACKNOWLEDGMENT
The authors wish to acknowledge the contributions of the students from the
Microsystems Design Group at Penn State who have worked on several
projects reported in this chapter. We would like to specially acknowledge the
contributions of Wu Ye, Hyun Suk Kim, Sudhanva Gurumurthi and Soontae
Kim.
REFERENCES
[1] D. Brooks and M. Martonosi, “Dynamic thermal management for high-performance
microprocessors,” In Proceedings of the Seventh International Symposium on High
Performance Computer Architecture, January 2001.
[2] V. Tiwari, D. Singh, S. Rajgopal, G. Mehta, R. Patel, and F. Baez, “Reducing Power in
High-Performance Microprocessors,” In Proceedings of the Design Automation
Conference, June 1998.
[3] M. Irwin, M. Kandemir, N. Vijaykrishnan, and A. Sivasubramaniam, “A Holistic
approach to system level energy optimization,” In Proceedings of the International
Workshop on Power and Timing Modeling, Optimization, and Simulation, September
2000.
[4] D. Marculescu, R. Marculescu, and M. Pedram, “Information theoretic measures of
energy consumption at register transfer level,” In Proceedings of 1995 International
Symposium on Low Power Design, pp. 81, April 1995.
[5] J. M. Rabaey and M. Pedram, “Low power design methodologies,” Kluwer Academic
Publishers, Inc., 1996.
294 Tools and Techniques for Integrated Hardware-software
[6] S. Powell and P. Chau, “Estimating power dissipation of VLSI signal processing chips:
the PFA technique,” In VLSI Signal Processing, IV , pp. 250, 1990.
[7] N. Kumar, S. Katkoori, L. Rader, and R. Vemuri, “Profile-driven behavioral synthesis
for low power VLSI systems,” IEEE Design and Test Magazine, pp. 70, Fall 1995.
[8] D. Liu and C. Svensson, “Power consumption estimation in CMOS VLSI chips,” IEEE
Journal of Solid State Circuits, pp. 663, June 1994.
[9] P. Landman and J. Rabaey, “Activity-sensitive architectural power analysis,” IEEE
Transaction on CAD, TCAD-15(6), pp. 571, June 1996.
[10] H. Mehta, R. M . Owens, and M. J. Irwin, “Energy characterization based on clustering,”
In Proceedings of the 33rd Design Automation Conference, pp. 702, June 1996.
[11] Q. Wu, Q. Qiu, M. Pedram, and C-S. Ding, “Cycle-accurate macro-models for rt-level
power analysis,” IEEE Transactions on VLSI Systems, 6(4), pp. 520, December 1998.
[12] L. Benini, A. Bogoliolo, M. Favalli, and G. De Micheli, “Regression models for
behavioral power estimates,” In Proceedings of International Workshop on Power,
Timing Modeling, Optimization and Simulation, pp. 179, September 1996.
[13] W. Ye, N. Vijaykrishnan, M. Kandemir, and M. Irwin, “The design and use of
simplepower: a cycle-accurate energy estimation tool,” In Proceedings of the Design
Automation Conference, June 2000.
[14] S. Gurumurthi, A. Sivasubramaniam, M. J. Irwin, N. Vijaykrishnan, M. Kandemir, T. Li,
and L. K. John, “Using complete machine simulation for software power estimation: The
SoftWatt Approach,” In Proceedings of the International Symposium on High
Performance Computer Architecture, Feb 2002.
[15] D. Brooks, V. Tiwari, and M. Martonosi. Wattch, “A framework for architectural-level
power analysis and optimizations,” In Proceedings of the 27th International Symposium
on Computer Architecture, June 2000.
[16] M. B. Kamble and K. Ghose, “Analytical energy dissipation models for low power
caches,” In Proceedings of the International Symposium on Low Power Electronic
Design, pp. 143–148, August 1997.
[17] R. Joseph, D. Brooks, and M. Martonosi, ”Runtime power measurements as a foundation
for evaluating power/performance tradeoffs,“ In Proceedings of the Workshop on
Complexity Effectice Design, June 2001.
[18] I. Kadayif, T. Chinoda, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, and A.
Sivasubramaniam, “vEC: virtual energy counters,” In Proceedings of the ACM
SIGPLAN/SIGSOFT Workshop on Program Analysis for Software Tools and
Engineering, June 2001.
[19] K. C. Yeager. The MIPS R10000 Superscalar Microprocessor. IEEE Micro, 16(2):28--
40, April 1996.
[20] S. A. Herrod, “Using complete machine simulation to understand computer system
bavior,” PhD thesis, Stanford University, February 1998.
[21] Toshiba Storage Devices Division, http://www.toshiba.com/.
[22] K. Li, R. Kumpf, P. Horton, and T. E. Anderson, “Quantitative Analysis of Disk Drive
Power Management in Portable Computers,” Technical Report CSD-93-779, University
of California, Berkeley, 1994.
[23] D. Duarte, N. Vijaykrishnan, M. J. Irwin, and M. Kandemir, “Formulation and validation
of an energy dissipation model for the clock generation circuitry and distribution
networks,” In Proceedings of the 2001 VLSI Design Conference, 2001.
[24] S. Palacharla, N. P. Jouppi, and J. E. Smith, “Complexity-effective superscalar
processors,” In Proceedings of the 24th International Symposium on Computer
Architecture, 1997.
Summary 295
Mani Srivastava
University of California, Los Angeles
Abstract: Battery-operated systems usually operate as part of larger networks where they
wirelessly communicate with other systems. Conventional techniques for low-
power design, with their focus on circuits and computation logic in a system,
are at best inadequate for such networked systems. The reason is two fold.
First, the energy cost associated with wireless communications dominates the
energy cost associated with computation. Being dictated primarily by a totally
different set of laws (Shannon and Maxwell), communication energy, unlike
computation energy, does not even benefit much from Moore's Law. Second,
designers are interested in network-wide energy-related metrics, such as
network lifetime, which techniques focused on computation at a single system
cannot address. Therefore, in order to go beyond low-power techniques
developed for stand-alone computing systems, this chapter describes
communication-related sources of power consumption and network-level
power-reduction and energy-management techniques in the context of
wirelessly networked systems such as wireless multimedia and wireless sensor
networks. General principles behind power-aware protocols and resource
management techniques at various layers of networked systems - physical,
link, medium access, routing, transport, and application - are presented. Unlike
their conventional counterparts that only manage bandwidth to achieve
performance, power-aware network protocols also manage the energy
resource. Their goal is not just a reduction in the total power consumption.
Rather, power-aware protocols seek a trade-off between energy and
performance via network-wide power management to provide the right power
at the right place and the right time.
11.1 INTRODUCTION
A simple energy model for a radio with a specified modulation scheme and
data rate can be obtained by setting and as constant
energy/bit due to electronics at the Rx and Tx, while treating as an
energy/bit term at the Tx that is proportional to where r is the radio
range. Clearly, for wireless communications over large r , the
communication energy will be dominated by the RF term as
is set to be large, while for short r the electronic power terms
and ) would dominate as is set to be small. For example,
typical state-of-the-art numbers reported in the literature for Bluetooth-class
radios are 50 nJ/bit for the electronic power terms and for the
RF power term [12]. Therefore, for radios designed for ranges shorter than
(e.g., personal area networks) the electronic power consumption
dominates the energy spent on communication while at larger ranges (e.g.,
wireless LANs, cellular systems) the RF power consumption dominates.
Besides transmit and receive states, radios can be in two other states:
sleep and idle. In the sleep state, the radio is essentially off and consumes
little or no power. In the idle state, the radio is listening for data packet
arrival but is not actively receiving or transmitting data. Traditionally, the
idle-state power consumption of the radio is often assumed to be
insignificant, and the energy spent on communication is counted as the
energy spent on the data packets actually received or transmitted. In reality,
the idle-state power consumption is almost the same as in the receive mode,
and ignoring it can lead to fallacious conclusions about the relative merits of
the various protocols and power management strategies [13]. For example,
the TR1000 radio transceiver from RF Monolithics is a radio commonly
used in wireless sensor networks. This low-power radio has a data rate of 2.4
Kbps and uses On/Off Key (OOK) modulation. For a range of 20 m, its
power consumption is 14.88 mW, 12.50 mW, 12.36 mW, and 0.016 mW in
the transmit, receive, idle, and sleep states, respectively. Clearly, idle
listening is not cheap!
Bluetooth-class radios at 1 Mbps date rate and a 1E-5 bit error rate (BER).
RF analog circuits whose power consumption does not vary much with data
rate in turn dominate the electronic power in these radios. An implication of
this is that using more energy-efficient but lower data rate modulation
schemes, a strategy that is effective for long-range communications, does not
help with short-range communications. Rather, using a high data rate but
energy-inefficient modulation and then shutting the radio down might be
more effective. A hindrance in the case of small data packet sizes is the long
time overhead and the resulting wasted energy in current radios as they
transition from shutdown to the active state.
The discussion in the previous section reveals that the sources of power
consumption in communications are quite diverse and different from sources
such as capacitive charging/discharging, leakage current, etc. that are the
basis for power reduction and management techniques developed for
processors, ASICs, etc. While some of these techniques can certainly be
used to address the electronic power consumption during communication,
much of the power consumption during communication lies beyond their
reach.
This has led to the recognition that one needs power reduction and
management techniques for wireless communications that specifically target
(i) new sources of power consumption, such as RF power, (ii) new
opportunities for power-performance trade-off, such as choices of
modulation and protocols, and (iii) new problems, such as how to wake up a
sleeping radio when the wake-up event is at a remote node.
The remainder of this chapter describes a selection of such techniques
that have been developed. These techniques seek to make communication
more power-efficient by reducing the number of raw bits sent across per
useful information bit that needs to be communicated, or by reducing the
amount of power needed to transmit a raw bit, or by a combination of the
two. The goal of many of the techniques presented is not mere power
reduction but rather power awareness whereby power consumption is
dynamically traded-off against other metrics of system performance such as
throughput, network coverage, accuracy of results, etc. This is done by
intelligent adaptation of control knobs offered by the system components
such as the radio or a protocol.
In the case of digital and analog processing, the various power-reduction
and management techniques have been classified according to whether they
are technology-level techniques (e.g., lowering threshold voltages), circuit-
level techniques (e.g., low supply voltage), architecture-level techniques
(e.g., shutdown or dynamic voltage scaling by an operating system), or
algorithm-level techniques (e.g., power-efficient signal processing
algorithms). Classifying according to technology, circuit, architecture, and
algorithm levels is not appropriate in the case of communications, and a
better way is to classify according to the layer of the communication
protocol stack that a technique impacts. While the seven-layer OSI protocol
stack is the standard for networked systems, the various techniques presented
in this chapter are classified into two broad classes: lower-layer (physical,
link, MAC) and higher-layer (routing, transport, application) techniques.
Historically, the purpose of layering has been to hide information across
Lower Layer Techniques 305
With equations (11.6) and (11.8), the expression in equation (11.2) for
total energy spent in communicating a raw bit becomes an explicit function
of the modulation level:
Together equations (11.1) and (11.9) give the trade-off between the
energy and the delay in sending a bit. A similar trade-off exists for other
modulation schemes, such as for Phase Shift Keying (PSK) and Pulse
Amplitude Modulation (PAM), with appropriate definitions of f(b) and In
general, DMS is applicable to other scalable modulation schemes as well.
Although the discussion so far has assumed that the modulation level can
be varied continuously, in reality the analysis presented is valid only for
integer values of b. In the case of QAM, the expressions are exact only for
even integers but are reasonable approximations when b is odd [11]. One can
308 Power-aware Communication Systems
exploits the effect that by varying the modulation level, energy can be traded
off versus delay, as explained above. On the left side of the curves in Figure
11.4, lowering b reduces the energy, at the cost of an increased delay.
Scaling to the right of the point of minimum energy clearly does not make
sense, as both energy and delay would increase. The operating region of
DMS, therefore, corresponds to the portion of the curves to the left of their
energy minimum points. In this region, modulation scaling is superior to
radio shutdown, even when the overhead associated with waking up a
sleeping radio is ignored, because with shutdown the total energy per bit
will not change. The energy-delay curves are convex so that a uniform
stretching of the transmissions is most energy efficient, similar to what has
been observed for DVS [10] where the energy vs. speed curve is convex as
well.
From Figure 11.4 note also that DMS is more useful for situations where
is large, or in other words, where the transmit power dominates the
electronics power. This is true except for wireless communication systems
with a very short range.
Dynamic code scaling
Another radio-level control knob is DCS, the scaling of the forward error
correcting code that is used. Coding introduces extra traffic that is
characterized by the rate of the code, which is the ratio of the size of the
original data to the size of the coded data. Using a radio with a given symbol
rate, the use of a lower-rate code results in larger time but lower energy that
is needed to get the same number of information bits across at a specified bit
error rate. For example, consider a wireless channel with additive white
Gaussian noise, average signal power constraint P, and noise power N. As
shown in [19], under optimal coding, the RF energy spent to reliably
transmit each information bit is proportional to where s
the number of raw symbol transmissions needed to send each information
bit, and is a function of the BER and the modulation scheme that
gives the ratio between the number of information bits that are reliably
transmitted per symbol to the channel capacity in bits/symbol.
decreases monotonically with s, or, equivalently, the
energy taken to transmit each information bit decreases monotonically as the
time allowed to transmit that bit is increased. Indeed, as [19] mentions, for
practical values of SNR for a wireless link, there is a 20x dynamic range
over which the energy per information bit can be varied by changing the
transmission time. Similar energy-delay behavior is observed for real-life
sub-optimal codes as well. Figure11. 5 shows the energy vs. delay behavior
of a real-life multi-rate convolutional code from [20].
310 Power-aware Communication Systems
Lastly, note that the energy-delay curves due to DCS are also convex
(besides being monotonically decreasing) just as is the case for DVS and
DMS, so that a uniform stretching of the transmissions is most energy
efficient as observed in [10].
The scheduler combines all three scaling factors to get the overall
modulation that is used for the current packet. To see the benefit of this
312 Power-aware Communication Systems
where
There is only one independent parameter left, which can be solved from
the constraint on the desired average data rate expressed in average
number of bits per symbol [22]. Thus, the thresholds only depend on the
statistics of the wireless channel, which can be estimated online. One no
longer has to know the exact behavior of the channel over time to achieve
the energy-optimal scheduling policy.
Figure 11.7 shows the simulated performance of this radio power
management scheme for different values of the average throughput
constraint. The basic parameters are the same as in the real-time energy-
aware packet-scheduling scheme in the preceding subsection:
and possible modulation levels
from to in steps of
bits/symbol. The time correlation of the channel is characterized by a
Doppler rate of 50 Hz, an update rate of 1 kHz for the channel
estimation that was selected, and the maximum possible transmit power is 1
W.
Curve 1 in Figure 11.7 plots the behavior of the “loading in time”
scheduling policy described here. It is superior to scaling with “constant b”
(curve 2), where the modulation is uniformly slowed down based on the
average throughput, but channel variations are not taken into account. The
difference between curve 2 and curve 3, which shows the same uniform
scaling in a non time-varying channel, illustrates the performance
degradation associated with channel variations. Beyond bits/symbol,
one resorts to shutdown, and both these curves flatten out, which is as
expected from the earlier discussion on DMS. However, curve 1 keeps on
Lower Layer Techniques 315
In addition to DMS and DCS there are other radio-level control knobs that
one can exploit for power management. In fact, the interaction between
performance and energy at the radio level is much more complex than for
CPUs with many more run-time variables. The raw radio-level performance
is a function of three variables: the BER, the RF transmit power, and the raw
data rate. The modulation scheme and bit-level coding choices both decide
where the radio operates in this three-dimensional space. In DCS and DMS
316 Power-aware Communication Systems
the BER is kept constant and the other two variables are traded-off. One can
certainly imagine more sophisticated control knobs that trade-off among all
the three variables simultaneously and are under the control of an energy-
aware scheduler, although no such scheme has yet been reported in the
literature.
The situation, however, is even more complex because rarely is an
application interested in low-level data throughput or BER. Rather, the radio
is separated from the applications by layers of protocols that execute
functions such as packetizing the application data as payloads of packets
with headers, performing packet-level error control such as retransmission of
lost and corrupted packets, and packet-level forward error coding. The real
measure of performance is the rate at which application-level data is reliably
getting across. This is often called the goodput, which is a function of not
only the raw data rate and the BER, but also the nature of the intervening
protocols and the packet structure they impose. If one were to trade energy
for goodput, many other control knobs become available which depend on
the protocols used. One such control knob, described below, is the
adaptation of the length of the frames in which the application data is sent.
Another knob is the adaptation of packet-level error control [23].
In order to send data bits over a wireless link, the bits are grouped into link-
layer frames (often called MAC frames) and scheduled for transmission by
the MAC mechanism. Typically, higher-layer packets, such as IP datagrams,
are fragmented to fit into these link-layer frames and reassembled at the
receiver. However, when the underlying channel is variable, operating with a
fixed frame length is inefficient, and it is better to adapt it to the momentary
channel condition instead [24].
Each frame has a cyclic redundancy check (CRC) to determine whether it
contains errors. Although adaptive frame-level forward error correction
could be treated in conjunction with frame length adaptation, we restrict
ourselves to the simpler frame-level error detection case here. Since there is
no correction capability, a single bit error leads to the entire frame to being
dropped. Therefore, smaller frames have a higher chance of making it
through. Each frame, however, contains a fixed header overhead, such that in
relative terms this overhead increases with decreasing frame length. The
length of the frame’s payload and header field are denoted by L and H,
respectively. For a point-to-point communication link, the crucial metric of
performance is the “goodput,” which is the actual data rate G offered to the
higher layers [23]. It takes into account the fact that header overhead and
Lower Layer Techniques 317
For a given transmit RF power, the energy per good application level bit
would be proportional to the inverse of the goodput expression above.
Therefore, it is more energy efficient if the frame length L is selected so that
the goodput G is maximized for a given radio and channel condition. The
data field size that maximizes the goodput, and therefore minimizes the
energy spent per good application-level bit, is given by equation (11.13):
When the BER varies slowly, i.e., over a timescale sufficiently larger
than the frame transmission time, these expressions correspond to the
optimal values at each moment in time. By estimating the BER over time,
the frame length settings can track the channel variations by adapting it
according to equation (11.13).
A straightforward approach to frame length adaptation would be to
directly estimate the BER at regular intervals via bit error measurement, and
set the L accordingly. In order to obtain an accurate estimation, the BER has
to be averaged over a large time window, which severely limits the
responsiveness of the adaptation. Therefore, it is better to use lower-level
channel parameters, measured by the radio, that indicate the quality of the
channel and can be used to estimate BER and thus the appropriate frame
length.
More results on frame length adaptation can be found in [24].
One important class of network-level techniques are those that are based on
the idea that not all nodes in an ad hoc network need to have their radios
active all the time for multi-hop packet forwarding. Many nodes can have
their radios put to sleep, or shutdown, without hurting the overall
communication functioning of the network. Since node shutdown impacts
the topology of the network, these techniques are also called topology
management approaches. Shutdown is an important approach because the
only way to save power consumption in the communication subsystem is to
completely turn off the node’s radio, since the idle mode is as power hungry
as the receive mode and, in case of short range radios, as power hungry (or
more) as the transmit mode as well. However, as soon as a node powers
down its radio, it is essentially disconnected from the rest of the network
topology and, therefore, can no longer perform packet relaying.
320 Power-aware Communication Systems
Recently, several schemes that seek to trade excess node density in ad hoc
networks for energy have appeared in the literature. Two of the first ones
were GAF [13] and SPAN [29]. These techniques operate under the
assumption that a constant network capacity needs to be maintained at all
times and try to do so by shutting redundant nodes down. No use is made of
the knowledge of the overall state of the networked application. So, for
example, whether a network of wireless sensors is monitoring or actively
communicating data, these techniques try to provide the same capacity.
With SPAN a limited set of nodes forms a multi-hop forwarding
backbone that tries to preserve the original capacity of the underlying ad-hoc
network. Other nodes transition to sleep states more frequently as they no
longer carry the burden of forwarding data of other nodes. To balance out
energy consumption, the backbone functionality is rotated between nodes,
and as such, there is a strong interaction with the routing layer.
Geographic Adaptive Fidelity (GAF) exploits the fact that nearby nodes
can perfectly and transparently replace each other in the routing topology.
The sensor network is subdivided into small grids, such that nodes in the
same grid are equivalent from a routing perspective. At each point in time,
only one node in each grid is active, while the others are in the energy-
saving sleep mode. Substantial energy gains are, however, only achieved in
node locations depicted in Figure 11.8, one can calculate that r should
satisfy where R is the transmission range of a node. The average
number of nodes in a grid is where N is the total number of
nodes in field of size L x L. The average number of neighbors of a node will
be so that one gets From now assume that
and
Since all nodes in a grid are equivalent from a routing perspective, this
redundancy can be used to increase the network lifetime. GAF only keeps
one node awake in each grid, while the other nodes turn their radio off. To
balance out the energy consumption, the burden of traffic forwarding is
rotated between nodes. For analysis, one can ignore the unavoidable time
overlap of this process associated with the handoff. If there are m nodes in a
grid, the node will (ideally) only turn its radio on of the time and,
therefore, will last m times longer. When distributing nodes over the sensor
field, some grids will not contain any nodes at all. Let be the fraction of
used grids, i.e., those that have at least one node. As a result, the average
number of nodes in the used grids is
The lifetime of each node in the grid is increased with the same factor
M’. As a result, the average lifetime of a grid, i.e., the time that at least one
node in the grid is still alive, is given by equation (11.16), where is the
Higher Layer Techniques 323
Note that and which are averages over all of the grids, only
depend on M’ and not on the exact distribution of nodes in the used grids. Of
course, the variance of both the node power and the grid lifetime depends on
the distribution. If one had full control over the network deployment, one
could ensure that every used grid has exactly M’ nodes. This would
minimize the power and lifetime variance.
The top curve in Figure 11.9 shows how GAF trades of energy with node
density in a specific scenario. The simulation results are close to the results
from the analysis presented above. The scenario is for a network with 100
nodes, each with radio range of 20 m, and a square area of size L x L in
which the nodes are uniformly deployed. The size L is chosen such that the
average number of one-hop neighbors of a node is 20 and leads to L = 79.27
m. The MAC protocol is a simplified version of the 802.11 Wireless LAN
MAC protocol in its Distributed Coordination Function mode. The radio data
rate is 2.4 kbps. The node closest to the top left corner detects an event and
sends 20 information packets of 1040 bits to the data sink with an inter-
324 Power-aware Communication Systems
packet spacing of 16 seconds. The data sink is the sensor node located
closest to the bottom right corner of the field. The average path length
observed is between 6 and 7 hops. The results are averages over 100
simulation runs.
For the same network scenario as in the previous subsection for GAF,
and with and a single data transfer (so that
is the inverse of the simulation time), the two plots in Figure 11.12 show
the normalized average set-up latency per hop as a function of the
328 Power-aware Communication Systems
Now consider two different cases for sending data from node A to node
B that is distance r away: direct routing and multi-hop. The first case is
direct routing where the transmit power of node A is set so that its range is
Higher Layer Techniques 331
So, when is multi-hop better? If N is given, one can show that multi-hop
routing leads to lower energy if the following condition is satisfied:
If, on the other hand, one is allowed to choose N, then the optimum N for
multi-hop is given by
11.6 SUMMARY
ACKNOWLEDGEMENTS
The author would like to acknowledge the contributions of his current and
past students at UCLA’s Networked and Embedded Systems Laboratory
(http://nesl.ee.ucla.edu) on whose research this chapter is based. In particular
the research by Andreas Savvides, Athanassios Boulis, Curt Schurgers, Paul
Lettieri, Saurabh Ganeriwal, Sung Park, Vijay Raghunathan, and Vlasios
Tsiatsis has played a significant role in formulating the ideas expressed in
this chapter.
REFERENCES
[1] Chandrakasan, A., Sheng, S., Brodersen, R., “Low-power CMOS digital Design,” IEEE
Journal of Solid-State Circuits, Vol.27, pp. 473-484, Dec 1992.
[2] Benini, L., Bogiolo, A., De Micheli, G., “A survey of design techniques for system-level
dynamic power management,” IEEE Transactions on CAD, pp. 813-833, June 1999.
[3] Raghunathan, V., Schurgers, C., Park, S., Srivastava, M., “Energy aware microsensor
networks,” IEEE Signal Processing, vol. 19, no. 2, pp. 40-50, March 2002.
[4] Nielsen, L., Niessen, C., Sparsø, J., van Berkel, K., “Low power operation using self-
timed circuits and adaptive scaling of the supply voltage,” IEEE Trans. on VLSI Systems,
Vol.2, No.4, pp. 391-397, Dec 1994.
[5] Pottie, G.J., Kaiser, W.J., “Wireless integrated network sensors,” Communications of the
ACM, vol.43, (no.5), pp.51-58, May 2000.
Summary 333
[6] Srivastava, M., Chandrakasan, A., Brodersen, R., “Predictive system shutdown and other
architectural techniques for energy efficient programmable computation,” IEEE Trans.
on VLSI Systems, vol. 4, no. 1, pp. 42-55, March 1996.
[7] Gruian, F., “Hard real-time scheduling for low energy using stochastic data and DVS
processor,” ACM ISLPED '01, pp. 46-51, Huntington Beach, CA, August 2001.
[8] Raghunathan, V., Spanos, P., Srivastava, M., “Adaptive power-fidelity in energy aware
wireless systems,” RTSS'01, pp. 106-115, London, UK, December 2001.
[9] Weiser, M., Welch, B., Demers, A., Shenker, B., “Scheduling for reduced CPU energy,”
USENIX Symposium on Operating Systems Design and Implementation, pp. 13-23, Nov
1994.
[10] Yao, F., Demers, A., Shenker, S., “A scheduling model for reduced CPU energy,” 36th
Annual Symposium on Foundations of Computer Science, Milwaukee, WI, pp. 374-385,
Oct 1995.
[11] Proakis, J., “Digital Communications,” McGraw-Hill Series in Electrical and Computer
Engineering, Edition, 1995.
[12] Heinzelman, W., Chandrakasan, A., Balakrishnan, H., “Energy-efficient communication
protocol for wireless microsensor networks,” HICSS 2000, pp. 3005-3014, Maui, HI,
Jan. 2000.
[13] Xu, Y., Heidemann, J., Estrin, D., “Geography-informed energy conservation for ad hoc
routing,” Proceedings of the Seventh Annual International Conference on Mobile
Computing and Networking, pp. 70-84, Rome, Italy, July 2001.
[14] Wang, A., Cho, S-H., Sodini, C.G., Chandrakasan, A.P., “Energy-efficient modulation
and MAC for asymmetric microsensor systems,” ACM ISLPED, pp. 106-111,
Huntington Beach, CA, August 2001.
[15] Savvides, A., Park, S., M. Srivastava, “On modeling networks of wireless micro-
sensors,” ACM SIGMETRICS 2001, pp. 318-319, Cambridge, MA, June 2001.
[16] Srivastava, M.B. “Design and Optimization of Networked Wireless Information
Systems,” IEEE Computer Society Workshop on VLSI, pp. 71-76, April 1998.
[17] Schurgers, C., Aberthorne, O., Srivastava, M., “Modulation scaling for energy aware
communication systems,” ACM ISLPED'01, pp.96-99, Huntington Beach, CA, August
2001.
[18] Schurgers, C., Raghunathan, V., Srivastava, M., “Modulation scaling for real-time
energy aware packet scheduling,” Globecom'01, pp. 3653-3657, San Antonio, TX,
November 2001.
[19] Prabhakar, B., Biyikoglu, E., Gamal, A., “Energy-efficient transmission over a wireless
link via lazy packet scheduling,” Infocom’01, pp. 386-394, April 2001.
[20] Frenger, P., Orten, P., Ottosson, T., Svensson, A., “Multi-rate convolutional codes,”
Tech. Report No. 21, Chalmers University of Technology, Sweden, April 1998.
[21] Jeffay, K., Stanat, D., Martel, C., “On non-preemptive scheduling of periodic and
sporadic tasks,” RTSS’91, San Antonio, TX, pp. 129-139, Dec. 1991.
[22] Schurgers, C., Srivastava, M., “Energy efficient wireless scheduling: adaptive loading in
time,” WCNC’02, Orlando, FL, March 2002.
[23] Lettieri, P., Fragouli, C., Srivastava, M.B., “Low power error control for wireless links,”
ACM MobiCom '97, Budapest, Hungary, pp. 139-150, Sept. 1997.
[24] Lettieri, P., Srivastava, M.B., “Adaptive frame length control for improving wireless link
throughput, range, and energy efficiency,” IEEE INFOCOM'98 Conference on Computer
Communications, vol. 2, pp.5 64-71, March 1998.
334 Power-aware Communication Systems
[25] Sivalingam, K.M., Chen, J.-C., Agrawal, P., Srivastava, M.B., “Design and analysis of
low-power access protocols for wireless and mobile ATM networks,” ACM/Baltzer
Wireless Networks, vol.6, (no. 1), ACM/ Baltzer, February 2000. p.73-87.
[26] Sohrabi, K., Gao, J., Ailawadhi, V., Pottie, G.J., “Protocols for self-organization of a
wireless sensor network,” IEEE Personal Communications, vol.7, (no.5), pp. 16-27, Oct.
2000.
[27] Woo, A., Culler, D., “A transmission control scheme for media access in sensor
networks,” Proceedings of the Seventh Annual International Conference on Mobile
Computing and Networking, pp. 221-235, Rome, Italy, July 2001.
[28] Ye, W., Heidemann, J., Estrin, D., “An energy-efficient MAC protocol for wireless
sensor networks,” IEEE INFOCOM'02 Conference on Computer Communications, June
2002.
[29] Chen, B., Jamieson, K., Balakrishnan, H., Morris, R. “Span: an energy-efficient
coordination algorithm for topology maintenance in ad hoc wireless networks,”
MobiCom 2001, Rome, Italy, pp. 70-84, July 2001.
[30] Schurgers, C., Tsiatsis, V., Ganeriwal, S., and Srivastava, M., “Topology management
for sensor networks: exploiting latency and density,” The Third ACM International
Symposium on Mobile Ad Hoc Networking and Computing (ACM Mobihoc 2002),
Lausanne, Switzerland, June 2002.
[31] Chang, J.-H., Tassiulas, L., “Energy conserving routing in wireless ad-hoc networks,”
IEEE INFOCOM’00 Conference on Computer Communications, Tel Aviv, Israel, pp.
22-31, March 2000.
[32] SIngh, S., Woo, M., Raghavendra, C.S., “Power-aware routing in mobile ad hoc
networks,” Proceedings of the Fourth Annual ACM/IEEE International Conference on
Mobile Computing and Networking, pp. 181-190, Dallas, Texas, October 1998.
[33] Schurgers, C., Srivastava, M., “Energy efficient routing in sensor networks,” Proc.
Milcom, pp. 357-361, Vienna, VI, October 2001.
[34] Guo, C., Zhong, L., Rabaey, J., “Low-poer distributed MAC for ad hoc sensor radio
networks,” IEEE Globecom’01, pp. 2944-2948, San Antonio, TX, Nov 2001.
Chapter 12
Power-Aware Wireless Microsensor Networks
Rex Min, Seong-Hwan Cho, Manish Bhardwaj, Eugene Shih, Alice Wang,
Anantha Chandrakasan
Massachusetts Institute of Technology
Key words: Sensor networks, energy dissipation, power awareness, energy scalability,
communication vs. computation tradeoff, StrongARM SA-1100, leakage
current, processor energy model, radio energy model, dynamic voltage scaling,
adjustable radio modulation, adaptive forward error correction, media access
control, multihop routing, data aggregation, energy-quality scalability, low
power transceiver, FIR filtering, project.
12.1 INTRODUCTION
cessor-based would also include RAM and flash ROM for data and program
storage and a an operating system with light memory and computa-
tional overhead. Code for the relevant data processing algorithms and com-
munication protocols are stored in ROM. In order to deliver data or control
messages to neighboring nodes, data is passed to the node’s radio subsystem.
Finally, power for the node is provided by the battery subsystem with DC-
DC conversion to provide the voltages required by the aforementioned
components.
It is instructive to consider the power consumption characteristics of a
microsensor node in three parts: the sensing circuitry, the digital processing,
and the radio transceiver. The sensing circuitry, which consists of the
environmental sensors and the A/D converter, requires energy for bias
currents, as well as amplification and analog filtering. Its power dissipation
is relatively constant while on, and improvements to its energy-efficiency
depend on increasing integration and skilled analog circuit design. This
section considers the energies of the remaining two sections—digital
computation and radio transmission—and their relationship to the
operational characteristics of a microsensor node.
A node’s digital processing circuits are typically used for digital signal
processing of gathered data and for implementation of the protocol stack.
Energy consumed by digital circuits consists of dynamic and static
dissipation as follows:
The issues of static power and the shutdown cost, two key concerns for the
node’s digital circuits, emerge analogously in the node’s radio. The energy
consumption of the radio consists of static power dissipated by the analog
electronics (analogous to leakage in the digital case, except that these bias
currents serve to stabilize the radio) and the radiated RF energy. The radiated
energy, which scales with transmitted distance as to depending on envi-
ronmental conditions, has historically dominated radio energy. For closely
packed microsensors, however, the radio electronics are of greater concern.
The average power consumption of a microsensor radio can be described
by:
Using a digitally adjustable DC-DC converter, the SA-1100 can adjust its
own core voltage to demonstrate energy-quality tradeoffs with DVS. In
Figure 12.6a, the latency (an inverse of quality) of the computation is
shown to increase as the energy decreases, given a fixed computational
workload.
In
Figure 12.6b, the quality of a FIR filtering algorithm is varied by scaling
the number of filter taps. As the number of taps—and hence the
Power Awareness Through Energy Scalability 345
size, such as 64 bits per input. In practice, however, typical inputs to the
multiplier are far smaller. Calculating, for instance, an 8-bit multiplication
on a 64-bit multiplier can lead to serious energy inefficiencies due to
unnecessary digital switching on the high bits. The problem size of the
multiplication is a source of operational diversity, and large, monolithic
multiplier circuits are not sufficiently energy-scalable.
An architectural solution to input bit-width diversity is the incorporation
of additional, smaller multipliers of varying sizes, as illustrated in Figure
12.7. Incoming multiplications are routed to the smallest multiplier that can
compute the result, reducing the energy overhead of unused bits. An
ensemble of point systems, each of which is energy-efficient for a small
range of inputs, takes the place of a single system whose energy
consumption does not scale as gracefully with varying inputs. The size and
composition of the ensemble is an optimization problem that accounts for the
probabilistic distribution of the inputs and the energy overhead of routing
them [22]. In short, an ensemble of systems improves power-awareness for
digital architectures with a modest cost in chip area. As process technologies
continue to shrink digital circuits, this area trade-off will be increasingly
worthwhile.
The modulation scheme used by the radio is another important trade-off that
can strongly impact the energy consumption of the node. As evidenced by
equation (12.2), one way to increase the energy efficiency of communication
is to reduce the transmission time of the radio. This can be accomplished by
sending multiple bits per symbol, that is, by using M-ary modulation. Using
M-ary modulation, however, will increase the circuit complexity and power
consumption of the radio. In addition, when M-ary modulation is used, the
efficiency of the power amplifier is also reduced. This implies that more
power will be needed to obtain reasonable levels of transmit output power.
The architecture of a generic binary modulation scheme is shown in
Figure 12.8(a), where the modulation circuitry is integrated together with
the frequency synthesizer [23] [17]. To transmit data using this architecture,
the VCO can be either directly or indirectly modulated. The architecture of a
radio that uses M-ary modulation is shown in Figure 12.8(b). Here, the data
encoder parallelizes serially input bits and then passes the result to a digital-
to-analog converter (DAC). The analog values produced serve as output
levels for the in-phase (I) and quadrature (Q) components of the output
signal.
Power Awareness Through Energy Scalability 347
The last two terms of equation (12.6) can be ignored since and
are negligible compared to the power of the frequency synthesizer. A
comparison of the energy consumption of binary modulation and M-ary
modulation is shown in Figure 12.9. In the figure, the ratio of the energy
consumption of M-ary modulation to the energy consumption of binary
modulation is plotted versus the overhead
Power Awareness Through Energy Scalability 349
In any protocol stack, the link layer has a variety of purposes. One of the
tasks of the link layer is to specify the encodings and length limits on
packets such that messages can be sent and received by the underlying
physical layer. The link layer is also responsible for ensuring reliable data
transfer. This section discusses the impact of variable-strength error control
on the energy consumption of a microsensor node. An additional and similar
exploration of the impact of adapting packet size and error control on system
energy efficiency is available in [24].
The level of reliability provided by the link layer will depend on the
needs of the application and on user-specified constraints. In many wireless
sensor networks, such as machine monitoring and vehicle detection
networks, the actual data will need to be transferred with an extremely low
probability of error.
In a microsensor application, it is assumed that objects of interest have
high mobility (e.g., moving vehicles) and nodes are immobile. Thus, the
coherence time of the channel is not much larger than the signaling time of
Given this scenario, the nodes can be assumed to be communicating
over a frequency non-selective, slow Rayleigh fading channel with additive
white Gaussian noise. This is a reasonable channel model to use for
communication at 2.4 GHz where line-of-sight communication is not always
possible.
Consider one node transmitting data to another over such a channel using
the radio described in Section 12.2.3 . The radio presented uses non-coherent
binary frequency-shift keying (FSK) as the modulation scheme. For
purposes of comparison, the best achievable probability of error using raw,
non-coherent binary FSK over a slowly fading Rayleigh channel will be
350 Power-Aware Wireless Microsensor Networks
presented. Let be a function of the received energy per bit to noise power
ratio
In general, where is a random variable for a fading
channel. It is shown in [25] that the probability of error using non-coherent,
orthogonal binary FSK is where is the average
Unfortunately, this does not directly tell us the amount of transmit power
that is required to achieve a certain probability of error. To determine as a
function of requires consideration of the radio implementation. In gen-
eral, can be converted to using
Here d represents the Hamming distance between some path in the trellis
decoder and the all-zero path, the coefficients can be obtained from the
expansion of the first derivative of the transfer function, P(d) is the first-
event error probability, and is the minimum free distance [25]. Figure
12.10 plots the for codes with varying rates and constraint lengths K.
Power Awareness Through Energy Scalability 351
Figure 12.11 plots the measured the energy per useful bit required to
decode 1/2 and 1/3-rate convolutional codes with varying constraint length
on the SA-1100. Two observations can be derived from these graphs. First,
the energy consumption scales exponentially with the constraint length. This
is to be expected since the number of states in the trellis increases
exponentially with constraint length. Second, the energy consumption
appears independent of the coding rate. This is reasonable since the rate only
affects the number of bits sent over the transmission. A lower-rate code does
not necessarily increase the computational energy since the number of states
in the Viterbi decoder is unaffected. In addition, the cost of reading the data
from memory is dominated by the updating of the survivor path registers in
the Viterbi algorithm. The size of the registers is proportional to the
constraint length and is not determined by the rate. Therefore, given two
convolutional codes and both with constraint lengths K, where
the per bit energy to decode and is the same even though
more bits are transmitted when using
Given the data in Figure 12.11, the convolutional code that minimizes
the energy consumed by communication can be determined for a given
Power Awareness Through Energy Scalability 353
probability of error In Figure 12.12, the total energy per information bit
is plotted against
Figure 12.12 shows that the energy per bit using no coding is lower than
that for coding for The reason for this result is that the energy of
computation, i.e., decoding, dominates the energy used by the radio for high
probabilities of error. For example, assuming the model described in
equation (12.9) and the communication energy to transmit and
receive per useful bit for an code is 85 nJ/bit. On the other hand, the
energy to decode an code on the SA-1100 is measured to be
2200 nJ per bit.
At lower probabilities of error, the power amplifier energy begins to
dominate. At these ranges, codes with greater redundancy have better
performance. These results imply that coding the data is not always the best
operational policy for energy-efficient operation. While it may appear that
this result is solely due to the inefficiency of the SA-1100 in performing
error correction coding, this result holds even for more efficient
implementations of Viterbi decoding.
Since using the SA-1100 to perform Viterbi decoding is energy ineffi-
cient, using a dedicated integrated circuit solution to perform decoding is
preferred. To explore the power characteristics of dedicated Viterbi
354 Power-Aware Wireless Microsensor Networks
This section moves upward in the protocol stack to consider the design of
power-aware media access (MAC) layers and routing protocols. For
maximal energy efficiency, the operational policies of the MAC and routing
protocols must be tailored to the energy consumption characteristics of the
hardware and the nature of the sensing application.
nodes in the network. Due to the finite error among each sensor's reference
clock, the base station must send synchronization (SYNC) packets to avoid
collisions among transmitted packets. Hence, the receiver circuitry of each
sensor must be activated periodically to receive the SYNC signals. As
explained in Section 12.2.3 , the receiver consumes more power than the
transmitter. Thus, it is necessary to reduce the average number of times the
receiver is active.
The number of times the receiver needs to be active depends on the
guard time the minimum time difference between two time slots in the
same frequency band, as shown in Figure 12.15. During no sensor is
scheduled to transmit any data. Thus, a larger guard time will reduce the
probability of packet collisions and thus, reduce the frequency of SYNC
signals and
If two slots in the same frequency band are separated by it will
take seconds for these two packets to collide, where is the percent
difference between the two sensors' clocks. Hence the sensors must be
resynchronized at least number of times every second. In other
words, the average number of times the receiver is active per second can be
written as Assuming that the total slot time available is written
as a formula can be derived relating to the latency
requirement of the transmitted packet, as follows:
Data in a sensor network are subject to two primary operations: the for-
warding of data to a remote base station and the aggregation of multiple
streams into a single, higher-quality stream. This section considers an
energy-efficient approach for performing the first of these two essential
functions.
Multihop forwarding utilizes several intervening nodes acting as relays to
prevent any node from having to spend too much transmit energy. A scheme
that transports data between two nodes such that the overall rate of energy
dissipation is minimized is called a minimum energy relay. The proper
placement of nodes for minimum energy relay can be derived by considering
the energy required for a generalized multihop relay.
To aid the presentation of the analysis, the total energy required to
transmit and receive a packet of data is represented as follows:
360 Power-Aware Wireless Microsensor Networks
The term accounts for the fact that node A, the initiator of the relay,
need not spend any energy receiving. The receive energy needed at B is
disregarded because it is fixed regardless of the number of intervening
relays.
Power-aware Communication 361
station for processing. Figure 12.20 depicts the energy required for the first
approach compared to the energy required for the second approach. As the
distance from the sensor to the base station increases, it is more energy-
efficient to perform signal processing locally, at the sensor cluster.
before transmitting the data to the clusterhead. The clusterhead performs the
beamforming and LOB estimation. Since the FFTs are parallelized, the clock
speed and voltage supply of both the FFTs and the beamforming can be low-
ered. For example, if the FFTs at the sensor nodes are run at 0.85 V at 74
MHz while the beamforming algorithm is run at 1.17 Vat 162 MHz then,
the ability to scale the energy consumption of the entire system in response
to changes in the environment, the state of the network, and protocol and
application parameters in order to maximize system lifetime and reduce
global energy consumption. Thus, all layers of the system, including the
algorithms, operating system, and network protocols, can adaptively
minimize energy usage.
The primary component of the data and control processing subsystem is the
StrongARM SA-1110 microprocessor. Selected for its low-power con-
.sumption, performance, and static CMOS design, the SA-1110 runs at a
clock speed of 59 MHz to 206 MHz. The processing subsystem also includes
RAM and flash ROM for data and program storage. A multi-threaded
running on the SA-1110 has been customized to allow software to scale
368 Power-Aware Wireless Microsensor Networks
the energy consumption of the processor. Code for the algorithms and
protocols are stored in ROM.
Data from the StrongARM that is destined for neighboring nodes is
passed to the radio subsystem of the node via a 16-bit memory interface. A
Xilinx FPGA performs additional protocol processing and data recovery.
The primary component of the radio is a Bluetooth-compatible commercial
single-chip 2.4 GHz transceiver [17] with an integrated frequency
synthesizer. The on-board phase-locked loop (PLL), transmitter chain, and
receiver chain can be shut off via software or hardware control for energy
savings. To transmit data, an external voltage-controlled oscillator (VCO) is
directly modulated, providing simplicity at the circuit level and reduced
power consumption at the expense of limits on the amount of data that can
be transmitted continuously. The radio module, with two different power
amplifiers, is capable of transmitting at 1 Mbps at a range of up to 100 m.
Finally, power for the node is provided by the battery subsystem via a
single 3.6 V DC source with an energy capacity of approximately 1500
mAH. Switching regulators generate 3.3 V and adjustable 0.9-2.0 V supplies
from the battery. The 3.3 V supply powers all digital components on the
sensor node with the exception of the processor core. The core is specially
powered by a digitally adjustable switching regulator that can provide 0.9 V
to 2.0 V in thirty discrete increments. The digitally-adjustable voltage allows
the SA-1110 to control its own core voltage, enabling the use of the dynamic
voltage scaling technique discussed in Section 12.3.1 . This feedback loop
governing processor voltage is illustrated in Figure 12.26.
Future Directions 369
This chapter has focused on hardware and algorithmic enablers for energy-
efficient microsensor networks. The final step in the design hierarchy—the
design of an application programming interface (API) and development tools
that will bring the functionality of the network into the hands of users—is an
emerging field of research. An ideal node API would expose the power-
aware operation of the node without sacrificing the abstraction of low-level
functionality. The API would enable an application to shut down or throttle
the performance of each hardware component on the node. Top-level API
calls directed at the network as a single entity would allow quality and
performance to be set and dynamically adjusted, allowing the network to
manage global energy consumption through energy-quality tradeoffs. Power-
370 Power-Aware Wireless Microsensor Networks
12.7 SUMMARY
A microsensor network that can gather and transmit data for years demands
nodes that operate with remarkable energy efficiency. The properties of
VLSI hardware, such as leakage and the start-up time of radio electronics,
must be considered for their impact on system energy, especially during long
idle periods. Nodes must take advantage of operational diversity by
Summary 371
gracefully scaling back energy consumption, so that the node performs just
enough computation—and no more—to meet an application’s specific needs.
All levels of the communication hierarchy, from the link layer to media
access to protocols for routing and clustering, must be tuned for the
hardware and application. Careful attention to the details of energy
consumption at every point in the design process will be the key enabler for
dense, robust microsensor networks that deliver maximal system lifetime in
the most challenging and operationally diverse environments.
REFERENCES
[1] K. Bult et al., “Low power systems for wireless microsensors,” Proc. ISLPED ’96, pp.
17-21, August 1996.
[2] D. Estrin, R. Govindan, J. Heidemann, and S. Kumar, “Next century challenges: scalable
coordination in sensor networks,” Proc. ACM MobiCom'99, pp. 263-270, August 1999.
[3] G. Asada, et al., “Wireless integrated network sensors: low power systems on a chip,”
Proc. ESSCIRC '98, 1998.
[4] J. Kahn, R. Katz, and K. Pister, “Next century challenges: mobile networking for smart
dust,” Proc. ACM MobiCom '99, pp. 271-278, August 1999.
[5] N. Priyantha, A. Chakraborty, and H. Balakrishnan, “The cricket location-support sys-
tem,” Proc. MobiCom '00, pp. 32-43, August 2000.
[6] J. Rabaey et al., “PicoRadio supports ad hoc ultra-low power wireless networking,”
Computer, vol. 33, no. 7, July 2000, pp. 42-48
[7] F. Op’t Eynde et al., “A fully-integrated single-chip SOC for Bluetooth,” Proc. ISSCC
2001, Feb. 2001, pp. 196-197, 446.
[8] V. Tiwari and S. Malik, “Power analysis of embedded software: A first approach to soft-
ware power minimization,” IEEE Trans, on VLSI systems, Vol. 2, December 1994.
[9] R. Powers, “Advances and trends in primary and small secondary batteries,” IEEE Aero-
space and Electronics Systems Magazine, vol. 9, no. 4, April 1994 pp. 32-36.
[10] L. Nord, and J. Haartsen, The Bluetooth Radio Specification and the Bluetooth Baseband
Specification, Bluetooth, 1999-2000.
[11] Wang, W. Heinzelman, and A. Chandrakasan, “Energy-scalable protocols for battery-
operated microsensor networks,” in Proc. IEEE SiPS '99, Oct 1999.
[12] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design, A Systems Perspective,
2nd edition, Reading, Mass.: Addison-Wesley, 1993, p. 236.313-317.
[13] V. De and S. Borkar, “Technology and design challenges for low power and high perfor-
mance,” Proc. ISLPED '99, pp. 163-168, August 1999.
[14] Advanced RISC Machines Ltd., Advance RISC Machines Architectural Reference
Manual, Prentice Hall, New York, 1996.
[15] R. Min, T. Furrer, and A. Chandrakasan, “Dynamic voltage scaling techniques for dis-
tributed microsensor networks,” Proc. WVLSI '00, April 2000.
[16] M. Perrott, T. Tewksbury, and C. Sodini, “27 mW CMOS Fractional-N synthesizer/mod-
ulator IC,” Proc. ISSCC 1997, pp. 366-367, February 1997.
[17] National Semiconductor Corporation, LMX3162 Evaluation Notes and Datasheet, April
1999.
372 Power-Aware Wireless Microsensor Networks
[18] J. Goodman, A. Dancy, and A.P. Chandrakasan, “An energy/security scalable encryption
processor using an embedded variable voltage DC/DC Converter,” IEEE Journal of
Solid-State Circuits, Vol. 33, No. 11, November 1998.
[19] G. Wei and M. Horowitz, “A low power switching supply for self-clocked systems,”
Proc. ISLPED 1996.
[20] K. Govil, E. Chan, and H. Wasserman, “Comparing algorithms for dynamic speed-set-
ting of a low-power CPU,” Proc. MobiCom '95, August 1995.
[21] T. Pering, T. Burd, and R. Brodersen, “The simulation and evaluation of dynamic volt-
age scaling algorithms,” Proc. ISLPED '98, August 1998.
[22] M. Bhardwaj, R. Min and A. Chandrakasan, “Power-aware systems,” Proc. of 34th
Asilomar Conference on Signals, Systems, and Computers, November 2000.
[23] N. Filiol, T. Riley, C. Plett, and M. Copeland, “An agile ISM band frequency synthesizer
with built-in GMSK data modulation,” IEEE Journal of Solid-State Circuits, vol. 33, pp.
998-1008, July 1998.
[24] P. Lettieri and M. B. Srivastava, “Adaptive frame length control for improving wireless
link throughput, range, and energy efficiency,” Proc. INFOCOM '98, pp. 564-571,
March 1998.
[25] J. Proakis, Digital Communications. New York City, New York: McGraw-Hill, 4th ed.,
2000.
[26] M. Bhardwaj, “Power-aware systems,” SM Thesis, Department of EECS, Massachusetts
Institute of Technology, 2001.
[27] W. Heinzelman, A. Chandrakasan, and H. Balakrishnan, “Energy-efficient communica-
tion protocol for wireless microsensor networks,” Proc. HICSS 2000, January 2000.
[28] K. Yao, et al., “Blind beamforming on a randomly distributed sensor array system,”
IEEE Journal on Selected Topics in Communications, Vol. 16, No. 8, October 1998.
[29] S. Haykin, J. Litva, and T. Shepherd, Radar Array Processing. Springer-Verlag, 1993.
[30] A. Sinha and A. Chandrakasan, “Energy aware software,” in Proc. VLSI Design '00, pp.
50-55, Jan. 2000.
[31] Wang, S.-H. Cho, C. Sodini, A. Chandrakasan, “Energy efficient modulation and MAC
for asymmetric RF microsensor systems,” Proc. ISLPED 2001, August 2001.
Chapter 13
Abstract: This chapter describes the concept of dynamic power management (DPM),
which is a methodology used to decrease the power consumption of a system.
In DPM, a system is dynamically reconfigured to lower the power
consumption while meeting some performance requirement. In other words,
depending on the necessary performance and the actual computation load, the
system or some of its blocks are tuned-off or their performance is lowered.
This chapter reviews several approaches to system-level DPM, including fixed
time-out, predictive shut-down or wake-up, and stochastic methods. In
addition, it presents the key ideas behind circuit-level power management
including clock gating, power gating and precomputation logic. The chapter
concludes with a description of several runtime mechanisms for leakage power
control in VLSI circuits.
13.1 INTRODUCTION
it up. The power consumed in the ACTIVE state is typically much higher
than that in the STANDBY state. Therefore, putting the components in the
STANDBY state when their outputs are not being used can save power.
Because the transition from one state to another consumes some energy,
there is a minimum idle time in which power can be saved. Assume
denotes the energy consumed in the transition from the ACTIVE to the
STANDBY state, and denotes the time for this transition. and
are defined similarly. Furthermore, assume and denote the power
consumption values in the ACTIVE and the STANDBY states, respectively.
In a power-managed system, the component is switched from the ACTIVE
state to the STANDBY state if it has been idle for some period of time. The
minimum value of the idle time is calculated as:
Figure 13.1 shows two power states of a hard disk. In the ACTIVE state
(A) the power consumption is l0mW, while in the STANDBY state (S) the
power consumption is 0mW. It takes one second to switch the hard disk
from the ACTIVE state to the STANDBY state. Once in the STANDBY
state, it takes two seconds to switch back to the ACTIVE state. Note that
switching between the two states consumes power: l0mJ for switching from
the ACTIVE state to the STANDBY state and 40mJ when switching back to
the ACTIVE state. is two seconds for this system. If the idle time is less
than switching to the STANDBY state will increase the power
consumption of the system. Otherwise, it will reduce the power
consumption. If the components of a system have a high DPM may not
be effective in reducing the power.
The transition latency from the ACTIVE state to the STANDBY state
may correspond to storing register values inside the system memory to make
it possible to restore them in a later time. This requires some amount of time.
376 Circuit and System Level Power Management
and turn them on and off based on the prediction. High prediction accuracy
makes it possible to significantly reduce the power at the expense of a small
increase in the latency of the system. On the other hand, if the accuracy is
low, both the latency and the power consumption of the system might
increase.
The DPM algorithm can be implemented in software and/or hardware. In
either case, there is a power associated with running the algorithm. It is very
important to take this power consumption into account while selecting an
algorithm. If the DPM algorithm is implemented in software, the load on the
core processor of the system increases. This can increase the response time
of the system. Implementing the DPM algorithm in hardware decreases the
load on the processor, but this comes at the expense of less flexibility.
DPM algorithms can be divided into two different categories: adaptive
and non-adaptive. Adaptive algorithms change the policy by which they
manage the system over time based on the change in the load of the system.
In this sense, they are capable of handling workloads that are unknown a
priori or are non-stationary. Non-adaptive algorithms use a fixed policy, that
is, they implicitly assume that, as a function of the system state, the same
decision is taken at every time instance. Adaptive algorithms tend to perform
better than non-adaptive algorithms, but they are more complex.
A simple DPM policy may employ a greedy method, which turns the system
off as soon as it is not performing any useful task. The system is
subsequently turned on when a new service request is received. The
advantage of this method is its simplicity, but it has the following
disadvantages:
1. The policy does not consider the energy consumed by switching from
ACTIVE to STANDBY state. Therefore, it may put the system in
STANDBY even when it has been idle for a short period of time, only to
have to turn the system back to the ACTIVE state in order to provide
service to an incoming request. This can increase the overall power
consumption of the system.
378 Circuit and System Level Power Management
2. After receiving a new service request, it often takes some time for the
system to wake up and be ready to provide the required service.
Therefore, the response time of the system increases. This increased
latency is not desirable or cannot be tolerated in many cases.
Under a fixed time-out policy, the system is shut down as soon as the
elapsed time after performing the last task exceeds This means that the
system stays in the ACTIVE state for seconds and consumes a high
amount of power without performing any useful task. To decrease the power
consumption, a predictive shut-down technique can be employed as first
proposed in [3]. In this technique, the previous history of the system is used
to decide when to go from the ACTIVE to the STANDBY state. A non-
linear regression equation is used to predict the expected idle time based on
the previous behavior of the system. If the expected idle time is long enough,
the system is turned off. Otherwise, the system remains in the ACITVE state.
The disadvantage of this technique is that there is no method to
automatically find the regression equation.
Another predictive method measures the busy time of the system and
decides whether or not to shut the system down based on the measurement.
If the busy time is less than a threshold, the system is shut down. Otherwise,
it is left in the ACTIVE state. This method performs well for systems that
have burst-like loads. In such systems, short periods of operation are usually
followed by long periods of inactivity. Networks of sensors and wireless
terminals are two examples of such systems.
System-level Power Management Techniques 379
The methods described so far increase the system response time. This may
not be acceptable in many cases. Hwang et al. [4] proposed the predictive
wake-up method to decrease the performance penalty. In this method if the
time spent in the STANDBY state is more than a threshold, the system goes
to the ACTIVE state. As a result there will be no performance penalty for
requests coming after the threshold. On the other hand, high amounts of
power are consumed while the system does not perform any tasks but is still
in the ACTIVE state.
Heuristic policies cannot achieve the best power-delay tradeoff for a system.
They can account for the time varying and uncertain nature of the workload,
but have difficulty accounting for the variability and state dependency of the
service speeds and transition times of the many complex components that a
system may contain. Hence, it is desirable to develop a stochastic model of a
power-managed system and then find the optimal DPM policies under
various workload statistics.
The problem of finding a stochastic power management policy that
minimizes the total power dissipation under a performance constraint (or
alternatively, maximizes the system performance under a power constraint)
is of great interest to system designers. This problem is often referred to as
the policy optimization (PO) problem.
380 Circuit and System Level Power Management
12
Generator matrix for a CTMDP is equivalent to the transition matrix for a DTMDP.
System-level Power Management Techniques 381
and also in the amount of required energy and the latency for transferring to
the active states.
Figure 13.4 shows the Markov process model of an SR with two states,
and When the SR is in state it generates a request every ms
on average. Similarly, when it is in state it generates a request every
ms. So assuming that the request inter-arrival time follows an
exponential distribution with a mean value of the request generation
rates in the two states are and respectively. Furthermore, assume
that the time needed for the SR to switch from one operation state to another
is a random variable with exponential distribution. In particular, when the
SR is in state the expected time for it to switch to state is ms (i.e.,
its transition rate is and when the SR is in state the expected time for
it to switch to state is ms (i.e., its transition rate is
Based on the source of the power that they reduce, component-level power
management techniques can be divided into the following categories:
Component-level Power Management Techniques 387
Because dynamic power has been the dominant source of the power
dissipation in VLSI circuits and systems to date, a significant effort has been
expended on decreasing it. Dynamic power is consumed every time the
output of a gate is changed and its average value can be computed using the
following formula (assuming that all transitions are full rail-to-rail
transitions):
where C is the capacitive load of the gate, V is the supply voltage, f is the
clock frequency, and is the switching activity. To decrease the dynamic
power, any of the parameters in the formula namely capacitance, supply
voltage, frequency, and switching activity and may be reduced. In the next
subsections several techniques are introduced that decrease the dynamic
power consumption by decreasing one or more parameters in the above
formula.
388 Circuit and System Level Power Management
Figure 13.6 illustrates how clock gating can be used to decrease the
switching activity in a circuit. If the enable signal is one, the circuit works as
Component-level Power Management Techniques 389
In Figure 13.10 making Enable equal to zero freezes the clock signal. As
a result, the switching activity on all clock drivers and modules in its fanout
is eliminated.
Up to this point it has been shown how clock gating can be done at the
gate level. It is also possible to perform clock gating in the Hardware
Description Language (HDL) specification of the circuit. In the remainder of
this section several methods for clock gating at the HDL level are described,
namely, register substitution and code separation.
Register substitution replaces registers that have enable signals with
gated-clock registers. Figure 13.11 shows part of a Verilog description and
its gated-clock version. In the gated-clock version, an always statement has
been used to generate a synchronized glitch-free enable signal (i.e., l_ena).
Component-level Power Management Techniques 391
The register is clocked using the AND of the original clock and the
generated enable signal.
392 Circuit and System Level Power Management
In the code separation method proposed by Ravaghan et al. [14], parts of the
Verilog code that are conditionally executed are identified and separated.
Then, clock gating is used for each part.
Figure 13.12 shows a part of a second Verilog description and its gated-
clock version. In the original description, the first and the second statements
inside the always loop are executed at each positive-edge clock, while the
last statement is executed conditionally. Thus, the last statement may be
separated from the rest and can be transformed using a clock gating
technique.
DVFS techniques can be divided into two categories, one for non real-
time operation and the other for real-time operation. The most important step
in implementing DVFS is prediction of the future workload, which allows
one to choose the minimum required voltage/frequency levels while
satisfying key constraints on energy and QoS. As proposed in [15] and [16],
a simple interval-based scheduling algorithm can be used in non real-time
operation. This is because there is no hard timing-constraint. As a result,
some performance degradation due to workload misprediction is allowed.
The defining characteristic of the interval-based scheduling algorithm is that
uniform-length intervals are used to monitor the system utilization in the
previous intervals and thereby set the voltage level for the next interval by
extrapolation. This algorithm is effective for applications with predictable
computational workloads such as audio or other digital signal processing
intensive applications [17]. Although the interval-based scheduling
algorithm is simple and easy to implement, it often predicts the future
workload incorrectly when a task’s workload exhibits a large variability.
One typical example of such a task is MPEG decoding. In MPEG decoding,
because the computational workload varies greatly depending on each frame
type, frequent load mispredictions may result in a decrease in the frame rate,
which in turn means a lower QoS in MPEG.
There are also many ways to apply DVFS in real-time application
scenarios. In general, some information is given by the application itself, and
the OS can use this information to implement an effective DVFS technique.
In [18], an intra-task voltage scheduling technique was proposed in which
the application code is split into many segments and the worst-case
execution time of each segment (which is obtained by static timing analysis)
is used to find a suitable voltage for the next segment. A method using a
software feedback loop was proposed in [19]. In this scheme, a deadline for
each time slot is provided. Furthermore, the actual execution time of each
slot is usually shorter than the given deadline, which means that a slack time
exists. The authors calculated the operating frequency of the processor for
the next time slot depending on the slack time generated in the current slot
and the worst-case execution time of each slot.
394 Circuit and System Level Power Management
level and the CPU clock frequency, In the second case, the total
energy consumed by the CPU is Clearly, there is a
75% energy saving as a result of lowering the supply voltage. This saving is
achieved in spite of “perfect” (i.e., immediate and with no overhead) power
down of the CPU. This energy saving is achieved without sacrificing the
QoS because the given deadline is met. An energy saving of 89% is achieved
when scaling to and to in case of task
incoming frame whereas the FI part remains constant regardless of the frame
type. In the proposed DVFS scheme, the FI part is used as a “buffer zone” to
compensate for the prediction error that may occur during the FD part. This
scheme allows the authors to obtain a significant energy saving without any
notable QoS degradation.
Although the DVFS method is currently a very effective way to reduce
the dynamic power, it is expected to become less effective as the process
technology scales down. The current trend of lowering the supply voltage in
each generation decreases the leeway available for changing the supply
voltage. Another problem is that the delay of the circuit becomes a sub-linear
function of the voltage for small supply voltages. Hence, the actual power
saving becomes sub-quadratic.
13.3.1.3 Pre-computation
The current trend of lowering the supply voltage with each new technology
generation has helped reduce the dynamic power consumption of CMOS
logic gates. Supply voltage scaling increases the gate delays unless the
Component-level Power Management Techniques 399
The most natural way of lowering the leakage power dissipation of a VLSI
circuit in the STANDBY state is to turn off its supply voltage. This can be
done by using one PMOS transistor and one NMOS transistor in series with
the transistors of each logic block to create a virtual ground and a virtual
power supply as depicted in Figure 13.21. Notice that in practice only one
transistor is necessary. Because of their lower on-resistance, NMOS
transistors are usually used.
402 Circuit and System Level Power Management
In the ACTIVE state, the sleep transistor is on. Therefore, the circuit
functions as usual. In the STANDBY state, the transistor is turned off, which
disconnects the gate from the ground. Note that to lower the leakage, the
threshold voltage of the sleep transistor must be large. Otherwise, the sleep
transistor will have a high leakage current, which will make the power gating
less effective. Additional savings may be achieved if the width of the sleep
transistor is smaller than the combined width of the transistors in the pull-
down network. In practice, Dual CMOS or Multi-Threshold CMOS
(MTCMOS) is used for power gating [29]. In these technologies there are
several types of transistors with different values. Transistors with a low
are used to implement the logic, while devices are used as sleep
transistors.
To guarantee the proper functionality of the circuit, the sleep transistor
has to be carefully sized to decrease its voltage drop while it is on. The
voltage drop in the sleep transistor decreases the effective supply voltage of
the logic gate. Also, it increases the threshold of the pull-down transistors
due to the body effect. This increases the high-to-low transition delay of the
circuit. Using a large sleep transistor can solve this problem. On the other
hand, using a large sleep transistor increases the area overhead and the
dynamic power consumed for turning the transistor on and off. Note that
because of this dynamic power consumption, it is not possible to save power
for short idle periods. There is a minimum duration of the idle time below
which power saving is impossible. Increasing the size of the sleep transistors
increases this minimum duration.
Component-level Power Management Techniques 403
Since using one transistor for each logic gate results in a large area and
power overhead, one transistor may be used instead of each group of gates as
depicted in Figure 13.22.
The size of the sleep transistor in Figure 13.22 should be larger than the
one used in Figure Figure 13.21. To find the optimum size of the sleep
transistor, it is necessary to find the vector that causes the worst case delay in
the circuit. This requires simulating the circuit under all possible input
values, a task that is not possible for large circuits.
In [29], the authors describe a method to decrease the size of sleep
transistors based on the mutual exclusion principle. In their method, they
first size the sleep transistors to achieve delay degradation less than a given
percentage for each gate. Notice that this guarantees that the total delay of
the circuit will be degraded by less than the given percentage. In fact the
actual degradation can be as much as 50% smaller. The reason for this is that
sleep transistors degrade only the high-to-low transitions and at each cycle
only half of the gates switch from high to low. Now the idea is that if two
gates switch at different times (i.e., their switching windows are non-
overlapping), then their corresponding sleep transistors can be shared.
404 Circuit and System Level Power Management
Using mutual exclusion at the gate level is not practical for large circuits.
To handle large circuits, the mutual exclusion principle may be used at a
larger level of granularity. In this case, a single sleep transistor is used for
each module or logic block. The size of this sleep transistor is calculated
according to the number of logic gates and complexity of the block. Next the
sleep transistors for different blocks are combined as described before. This
method enables one to “hide” the details of the blocks thus large circuits can
be handled. However, in this case, the sizes of sleep transistors may be sub-
optimal.
Power gating is a very effective method for decreasing the leakage
power. However, it suffers from the following drawbacks:
Component-level Power Management Techniques 405
One of the methods proposed for decreasing the leakage current is using
reverse-body bias to increase the threshold voltage of transistors in the
STANDBY state [30]. The threshold voltage of a transistor can be calculated
from the following standard expression,
The leakage current of a logic gate is a strong function of its input values.
The reason is that the input values affect the number of OFF transistors in
the NMOS and PMOS networks of a logic gate.
Table 13.2 shows the leakage current of a two input NAND gate built in
a CMOS technology with a 0.2V threshold voltage and a 1.5V
supply voltage. Input A is the one closer to the output of the gate.
The minimum leakage current of the gate corresponds to the case when
both its inputs are zero. In this case, both NMOS transistors in the NMOS
network are off, while both PMOS transistors are on. The effective
resistance between the supply and the ground is the resistance of two OFF
NMOS transistors in series. This is the maximum possible resistance. If one
of the inputs is zero and the other is one, the effective resistance will be the
same as the resistance of one OFF NMOS transistor. This is clearly smaller
Component-level Power Management Techniques 407
than the previous case. If both inputs are one, both NMOS transistors will be
on. On the other hand, the PMOS transistors will be off. The effective
resistance in this case is the resistance of two OFF PMOS transistors in
parallel. Clearly, this resistance is smaller than the other cases.
In the NAND gate of Table 13.2 the maximum leakage is about three
times higher than the minimum leakage. Note that there is a small difference
between the leakage current of the A=0, B=1 vector and the A=1, B=0
vector. The reasons are the difference in the size of the NMOS transistors
and the body effect. This data in fact describes the “stack effect” i.e., the
phenomenon whereby the leakage current through a stack of two or more
OFF transistors is significantly smaller than a single device leakage.
Other logic gates exhibit a similar leakage current behavior with respect
to the applied input pattern. As a result, the leakage current of a circuit is a
strong function of its input values. Abdollahi et al. [35] use this fact to
reduce leakage current. They formulate the problem of finding the minimum
leakage vector (MLV) using a series of Boolean Satisfiability problems.
Using this vector to drive the circuit while in the STANDBY state, they
reduce the circuit leakage. It is possible to achieve a moderate reduction in
leakage using this technique, but the reduction is not as high as the one
achieved by the power gating method. On the other hand, the MLV method
does not suffer from many of the shortcomings of the other methods. In
particular,
1. No modification in the process technology is required.
2. No change in the internal logic gates of the circuit is necessary.
3. There is no reduction in voltage swing.
4. Technology scaling does not have a negative effect on its
effectiveness or its overhead. In fact the stack effect becomes
stronger with technology scaling as DIBL worsens.
The first three facts make it very easy to use this method in existing designs.
Further reduction in leakage may be achieved by modifying the internal
logic gates of a circuit. Note that due to logic dependencies of the internal
signals, driving a circuit with its MLV does not guarantee that the leakage
currents of all its logic gates are at minimum values. Therefore, when in the
STANDBY state, if, by some means, values of the internal signals are also
controlled, even higher leakage savings can be achieved. One way to control
the value of an internal signal (line) of a circuit is to replace the line with a
2-to-l multiplexer [36]. The multiplexer is controlled by the SLEEP signal
whereas its data inputs are the incoming signal and either a ZERO or ONE
value decided by the leakage current minimization algorithm. The output is
the outgoing signal. Since one input of the multiplexer is a constant value,
the multiplexer can be replaced by an AND or an OR gate. Figure 13.24
408 Circuit and System Level Power Management
shows a small circuit and its modified version where the internal signal line
can explicitly be controlled during the STANDBY state.
13.4 SUMMARY
ACKNOWLEDGEMENT
The authors would like to thank Afshin Abdollahi, Kihwan Choi, Chang-
woo Kang, Peng Rong, and Qing Wu for their contributions to this chapter.
410 Circuit and System Level Power Management
REFRENCES
[1] Intel, Microsoft, Toshiba, Advanced configuration and power interface specification,
http://www.acpi.info/.
[2] IBM, "2.5-Inch Travelstar Hard Disk Drive," 1998.
[3] M. Srivastava, A. P. Chandrakasan, R. W. Brodersen, “Predictive system shutdown and
other architectural techniques for energy efficient programmable computation,” IEEE
Trans. on VLSI Systems, vol. 4, no. 1, pp. 42-55,1996.
[4] C. Hwang, A. C.-H. Wu, “A predictive system shutdown method for energy saving of
event-driven computation,” Proc. International Conference on Computer-Aided Design
of Integrated Circuits and Systems, Vol. 16, pp. 28-32, November 1997.
[5] L. Benini, A. Bogliolo, G. A. Paleologo and G. De Micheli, “Policy optimization for
dynamic power management,” IEEE Trans. on Computer-Aided Design of Integrated
Circuits and Systems, Vol. 18, pp. 813-833, June 1999.
[6] E. Chung, L. Benini and G. De Micheli, “Dynamic power management for non
stationary service requests,” Proc. Design and Test in Europe Conference, March 1999,
pp. 77-81.
[7] Q. Qiu, Q. Wu and M. Pedram, “Stochastic modeling of a power-managed system-
construction and optimization,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, pp. 1200-1217, October 2001.
[8] Simunic T, Benini L, Glynn P, De Micheli G. “Event-driven power management,” IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 20,
pp. 840-857, July 2001.
[9] Q. Wu, Q. Qiu, and M. Pedram, “Dynamic power management of complex systems
using generalized stochastic Petri nets,” Proc. Design Automation Conference, June
2000, pp. 352-356.
[10] Q. Qiu, Q. Wu and M. Pedram, “Dynamic power management in a mobile multimedia
system with guaranteed quality-of-service,” Proc. Design Automation Conference,
June 2001, pp. 834-839.
[11] M. Pedram and Q. Wu, "Design considerations for battery-power electronics," Proc.
Design Automation Conference, June 1999, pp. 861-866.
[12] Thomas F. Fuller, Marc Doyle and John Newman, "Relaxation phenomena in lithium-
ion-insertion cells," Journal of Electrochemical Society, Vol. 141, April 1994.
[13] P. Rong and M. Pedram, “Battery-aware power management based on CTMDPs,”
Technical Report, Department of Electrical Engineering, University of Southern
California, No. 02-06, May 2002.
[14] N. Raghavan, V. Akella and S. Bakshi, “Automatic insertion of gated clocks at register
transfer level,” Proc. 12th International Conference on VLSI Design, January 1999.
[15] M. Weiser, B. Welch, A. Demers, and S. Shenker, “Scheduling for reduced CPU
energy,” in Proc. First Symposium on Operating Systems Design Implementation, 1994,
pp. 13-23.
[16] K. Govil, E. Chan, and H. Wasserman, “Comparing algorithms for dynamic speed-
setting of a low power CPU,” Proc. First International Conference on Mobile
Computing Networking, 1995, pp. 13-25.
[17] A. Chandrakasan, V. Gutnik, and T. Xanthopoulos, “Data driven signal processing: an
approach to energy efficient computing,” Proc. International Symposium on Low Power
Electronics and Design, August 1996, pp.347-352.
[18] D. Shin, J. Kim, and S. Lee, “Low-energy intra-task voltage scheduling using static
timing analysis,” Proc. Design Automation Conference, June 2001, pp. 438-443.
Summary 411
[19] S. Lee and T. Sakurai, “Run-time power control scheme using software feedback loop
for low-power real-time applications,” Proc. Asia South-Pacific Design Automation
Conference, January 2000, pp. 381-386.
[20] B. Razavi, RF Microelectronics, Prentice Hall, 1997.
[21] O. Y-H Leung , C-W Yue , C-Y Tsui, R. S. Cheng, “Reducing power consumption of
turbo code decoder using adaptive iteration with variable supply voltage,” Proc.
International Symposium on Low Power Electronics and Design, August 1999, pp. 36-
41.
[22] F. Gilbert, A. Worm, N. When, “Low power implementation of a turbo-decoder on
programmable architectures,” Proc. Asia South-Pacific Design Automation Conference,
January 2001, pp. 400-403.
[23] T. Pering, T. Burd, and R. Broderson, “The simulation and evaluation of dynamic
voltage scaling algorithms,” Proc. International Symposium on Low Power Electronics
and Design, August 1998, pp.76-81.
[24] K. Choi, K. Dantu and M. Pedram, “Frame-based dynamic voltage and frequency scaling
for a MPEG decoder,” Technical Report, Department of Electrical Engineering,
University of Southern California, No. 02-07, May 2002.
[25] M. Alidina, J. Monteiro, S. Devadas, A. Ghosh, and M. Papaefthymiou,
“Precomputation-Based Sequential Logic Optimization for Low Power,” Proc.
International Conference on Computer-Aided Design, November 1994, pp. 74-81.
[26] C-F. Yeap, “Leakage current in low standby power and high performance devices: trends
and chlaanges,” Proc. International Symposium on Physical Design, April 2002, pp. 22-
27.
[27] Semiconductor Industry Association, International Technology Roadmap for
Semiconductors, 2001 edition, http://public.itrs.net/.
[28] B. Sheu, D. Scharfetter, P. Ko, and M. Jeng, "BSIM: Berkeley short-channel IGFET
model for MOS transistors," IEEE Journal of Solid State Circuits, Vol. 22, August 1987,
pp. 558-566.
[29] J. T. Kao, A. P. Chandrakasan, "Dual-threshold voltage techniques for low-power digital
circuits,” IEEE Journal of Solid-State Circuits, Vol. 35, July 2000, pp. 1009-1018.
[30] K. Seta, H. Hara, T. Kuroda, et al., “50% active-power saving without speed degradation
using standby power reduction (SPR) circuit,” IEEE International. Solid-State Circuits
Conf., February 1995, pp. 318-319.
[31] S-M. Kang and Y. Lelebici, CMOS Digital Integrated Circuits, Mc Graw Hill, second
edition, 1999.
[32] A. Keshavarzi, S. Narendra, S. Borkar, V. De, and K. Roy, “Technology scaling
behavior of optimum reverse body bias for standby leakage power reduction in CMOS
IC's,” Proc. International Symposium on Low Power Electronics and Design, August
1999, pp. 252-254.
[33] V. De and S. Borkar, “Low power and high performance design challenges in future
technologies,” Proc. the Great Lakes Symposium on VLSI, 2000, pp. 1 -6.
[34] T. Kuroda, T. Fujita, F. Hatori, and T. Sakurai, “Variable threshold-voltage CMOS
technology,” IEICE Transactions. on Fundamentals of Electronics, Communications and
Computer Sciences, vol. E83-C, November 2000, pp. 1705-1715.
[35] A. Abdollahi, F. Fallah, M. Pedram, “Minimizing leakage current in VLSI circuits,”
Technical Report, Department of Electrical Engineering, University of Southern
California, No. 02-08, May 2002.
412 Circuit and System Level Power Management
[36] A. Abdollahi, F. Fallah, M. Pedram, “Runtime mechanisms for leakage current reduction
in CMOS VLSI circuits,” Proc. International Symposium on Low Power Electronics and
Design, August 2002.
Chapter 14
Tools and Methodologies for Power Sensitive Design
Jerry Frenkil
Sequence Design, Inc.
Key words: Low power design, power analysis, power estimation, power optimization,
computer aided design, power sensitive design, power modeling, power tools.
14.1 INTRODUCTION
13
It should be noted that power consumption and power dissipation are not synonymous. For
a more detailed discussion of the differences, please refer to the section on power
measurement types. All discussions of power in this chapter will refer to power
consumption. It should also be noted that static CMOS is the target semiconductor
technology and circuit topology for the calculations and tools described in this chapter.
416 Tools and Methodologies for Power Sensitive Design
Power tools can generally be classified along two axes, the function and
abstraction layers. Function refers to the expected operation of the tool, such
as analysis, modeling, or optimization, while abstraction level refers to the
level of detail present in the input design description.
418 Tools and Methodologies for Power Sensitive Design
Along the function axis, the most fundamental tool is the analysis tool.
This type of tool estimates or calculates the amount of power consumed by
the design of interest. An analysis tool may be used alone or it may be used
as the internal calculation engine for other types of power tools, such as
optimizers, modelers, or other derivative power tools. For example, an
optimizer takes a design and, while holding functionality and performance
constant, makes various transformations to reduce power. In most cases,
optimizers need an internal analysis engine in order to judge whether or not a
given transformation actually reduces power. Modelers utilize an analysis
engine internally in order to compute a circuit’s power characteristics to
produce a power model for a higher abstraction level. The fourth category is
derivatives. These types of tools target the effects of power on other
parameters, such as the current flow through power distribution networks or
the effects of power on circuit timings.
Each of these functions may be performed on different design
abstractions such as transistor, gate (or logic), register transfer (or RTL),
behavior, or system. Here abstraction refers specifically to the level at
which the design is described. For example, a netlist of interconnected
transistors is at the transistor level while a netlist of interconnected logic
primitives, such as NAND gates, flip-flops, and multiplexors is at the gate
level. A design at the RT-level is written in a hardware description language
such as Verilog or VHDL with the register storage explicitly specified, thus
functionality is defined per clock cycle. A behavioral description may also
be written in Verilog, VHDL, or even C, but in this case the abstraction is
“higher” as register storage is implied and functionality may be specified
across clock cycles, or without any reference to a clock cycle at all. The
highest abstraction level is the system level. At this level many of the details
are omitted but the functionality is described by the interrelationship of a
number of high-level blocks or functions.
At whatever abstraction level the tool operates, the data input requirements
are generally the same, although the forms of the data will vary. Equation
(14.3) shows that both technology-related information such as capacitances
and currents and environmental information such as voltages and activities
are required, in addition to the design itself. However, the form of the data,
and which data is input and which data is derived, varies according to how
the specific tools operate.
The Design Automation View 419
A given design will consume differing amounts of power depending upon its
environment. For example, a particular microprocessor running at 100 MHz
in one system will consume less power than the same microprocessor
running in a different system at 150 MHz.
Environmental data can be grouped into three major categories: voltage
supplies, activities, and external loads. Data in each of these categories must
be specified in order to accurately calculate a design’s power consumption.
Supply voltage is represented by the term in equation (14.3) and is
usually specified as a global parameter. Some designs may utilize multiple
supply voltages in which case a different value of must be assigned to
each section as appropriate.
The capacitive loads of a design’s primary outputs, represented by the
term in equation (14.3), must also be specified. As the values for these loads
can be rather large, often in the range of 20 to 100 pf, these capacitances can
contribute a substantial amount to a design’s total power consumption.
Activity data is represented by the f in equation (14.3). For transistor
level tools, the activities of the circuit’s primary inputs are specified as
waveforms with particular shapes, timings, and voltages. The activities of
intermediate nodes are derived by the tools as a function of the circuit’s
operation, with the power calculations being performed concurrently.
Higher-level tools also require the specification of the activities on the
primary inputs, but only the frequency or toggle counts are required.
However, in addition these tools require the activities on all the other nodes
in the design, and this data is usually generated by a separate logic
simulation of the design. For these higher-level power tools, the power is
calculated by post-processing the nodal activity data that was previously
420 Tools and Methodologies for Power Sensitive Design
Top.m1.u3 0.4758 15 14
Here, the first field contains the node name, the second field contains the
effective duty cycle, the third field contains the number of rising transitions,
and the fourth field contains the number of falling transitions.
The primary motivation for using activity data instead of .vcd data is that of
file size; .vcd files can easily require gigabytes of storage. On the other
hand, activity files are less useful in calculating instantaneous currents since
they do not maintain the temporal relationships between signals as is done
with .vcd data. While all common HDL simulators produce .vcd files
directly, few produce activity files directly, in which case the .vcd data must
The Design Automation View 421
be converted into activity data format either through the use of external
utilities or power tools’ internal conversion routines.
cell {
Name: INV
Function: ZN = !I
Pin { Name = I; Direction = in; Capacitance = cap in F }
Pin { Name = ZN; Direction = out; Capacitance = cap in F}
Power_Event { Condition = 01 I; Energy = energy in J }
Power_Event { Condition = 10 I; Energy = energy in J }
Power_state { Condition = 0 I; Power = power in W }
Power_state { Condition = 1 I; Power = power in W }
}
14
in its full composition is partly technology data and partly design data (except for the
case of a primary chip output, as described above, when the total load is offchip – in this
case is considered to be environmental data). Consider the fanout capacitance for a
given would be determined by the sum of the fanout input capacitances. The
fanout number comes from design data while the amount of input capacitance per fanout is
considered to be technology data.
422 Tools and Methodologies for Power Sensitive Design
Power_Event { { Condition = 10 I }
{Input_trans_time = 3.00e-02 4.00e-01 1.50e+00 3.00e+00 }
{Output_cap = 3.50e-04 3.85e-02 1.47e-01 3.11e-01 }
{ Energy = 1.11e-02 1.92e-02 4.59e-02 8.23e-02
7.31e-02 7.70e-02 9.60e-02 1.28e-01
2.45e-01 2.48e-01 2.59e-01 2.82e-01
5.06e-01 5.09e-01 5.16e-01 5.33e-01 }
CELL ND2X1 {
AREA = 9.98e+00;
PIN A { DIRECTION = input ; CAPACITANCE = 4.04e-03; }
PIN B { DIRECTION = input ; CAPACITANCE = 3.84e-03; }
PIN Y { DIRECTION = output ; CAPACITANCE = 0.00e+00; }
FUNCTION { BEHAVIOR { Y = (!(A&&B)); } }
15
This model is complete in the sense that it models all significant dynamic and static power
consuming events. However, there are four usually insignificant non-zero dynamic power
consuming events that are not represented: rising and falling transitions on each of the two
inputs that do not result in a change on the output (the other input is in the low state). Also
not shown are timing or noise data, which would be needed for timing and noise margin
calculations and power vs. performance optimizations.
424 Tools and Methodologies for Power Sensitive Design
}
}
VECTOR ( !A && !B ) { POWER = 3607.79 {UNIT = 1.0e-12;} }
VECTOR ( !A && B ) { POWER = 3643.09 {UNIT = 1.0e-12;} }
VECTOR ( A && !B ) { POWER = 9973.64 {UNIT = 1.0e-12;} }
VECTOR ( A && B ) { POWER = 1219.60 {UNIT = 1.0e-12;} }
}.
RMS measurements are used when a single value is needed that describes
relatively long term behavior while at the same time paying special attention
to the peak values. Such is the case for evaluating electromigration current
limits that are dependent on the average value of the current as well as its
peaks. This is especially true for current flow in signal lines, which is bi-
directional and hence would have a very small average current value [7].
Transistor level tools are generally the most accurate and the most familiar
to IC designers. In fact, accuracy and the well-accepted abstraction are their
primary advantages. Nonetheless, these tools have significant issues in their
applicability to Power Sensitive Design: capacity and run time
characteristics limit their use to moderately sized circuits, or limited amounts
of simulation vectors for larger circuits. However, perhaps the biggest
limitation is that one must have a transistor level design to analyze before
these tools can be used; in other words, a design must be completed in large
part before these tools can be effective.
These tools are utilized primarily in two different use models. The first is
for characterizing circuit elements in order to create timing and power
428 Tools and Methodologies for Power Sensitive Design
models for use at the higher abstraction layers. The second is for verifying,
with the highest levels of accuracy, that the completed transistor level design
meets the targeted power specifications.
Transistor level analysis tools provide the bedrock for IC design and most IC
designers understand and rely upon them. There are two different classes of
transistor level analysis tools, generalized circuit simulators and switch level
simulators.
Generalized circuit simulators are used for many different purposes
including timing analysis for digital and analog circuits, transmission line
analysis and circuit characterization. These types of tools are regarded as the
standard for all other analysis approaches since they are designed to model
transistor behavior, at its fundamental levels, as accurately as possible and
are capable of producing any of the three types of power measurements. The
use of a circuit simulator for power analysis is simply one of its many
applications. The primary example of this type of tool is SPICE, of which
there are many variants [3].
Switch level simulators are constructed differently than circuit
simulators. Whereas the latter utilizes many detailed equations to model
transistor behavior under a wide variety of conditions, switch level
simulators model each transistor as a non-ideal switch with certain electrical
properties. This modeling simplification results in substantial capacity and
run-time improvements over circuit simulators with only a slight decrease in
accuracy. For most digital circuits, this approach to electrical simulation at
the transistor level is very effective for designs too large to be handled by
SPICE [8]. The leading examples of switch-level power simulators are
NanoSim [9] and its predecessor PowerMill, from Synopsys.
until the timing margin is used up, at which point other paths are considered
for optimization. After all paths have been considered for optimization, a
new transistor level netlist is produced containing resized transistors. Power
reductions in the range of 25% per module are possible compared to un-
optimized circuitry. An example of a transistor-level power optimizer is
AMPS from Synopsys [10].
One of the most important types of power tools is the power grid analyzer.
Such a tool analyzes a design’s power delivery network to determine how
much voltage drop occurs at different points on the network [13] and to
evaluate the network’s reliability in terms of sensitivity to electromigration.
A transistor-level power grid analyzer is composed of 3 key elements.
The first is a transistor-level simulator (either a circuit or switch level
simulator, as described above), a matrix solver, and a graphical interface.
The simulator portion is used to compute the circuit’s logical and electrical
response to the applied stimulus and to feed each transistor’s power sinking
or sourcing characteristics to the matrix solver. The matrix solver, in turn,
converts each transistor’s power (or equivalent current) characteristics into
nodal voltages along the power distribution network by calculating the
branch currents in each section of the power grid. These results are then
displayed in the GUI, usually as a color-coded geographic display of the
design.
Gate-level Tools 431
The inputs to a power grid analyzer are the extracted transistor level
netlist along with an extracted RC network for the power and ground rails.
A stimulus is required to excite the circuit, although the form of the
excitation could be either static or dynamic. In the former case, the resulting
analysis would be a DC analysis. For a transient analysis of the power
distribution network, a dynamic stimulus is required.
The outputs of a power grid analyzer are a graphical display illustrating
the gradient of voltages along the power rails, and optionally, a graphical
display of current densities at different points on the rails. This latter display
is used to highlight electromigration violations, as electromigration
sensitivity is a function of current density.
RailMill from Synopsys is an example of such a tool operating at the
transistor level [14].
Gate-level optimization tools are also available. These tools usually take the
form of additional cost-function optimization routines within logic
synthesizers such that power consumption is optimized at the same time as
Gate-level Tools 433
timing and area during the process of converting RTL descriptions to logic
gates. The primary advantage of these optimizers is that of push-button
automation – the tools automatically search for power saving opportunities
and then implement the changes without violating any timing constraints.
The amount of power reduced varies widely, ranging from a few percent to
more than 25% for the synthesized logic depending on many factors, such as
the end application and the degree to which the original design had been
optimized [18].
Several types of power-saving transformations can be performed at the
gate level and these transformations can generally be categorized into two
groups – those that alter the structure of the netlist and those that maintain it.
The former group includes such transformations as clock gating, operator
isolation, logic restructuring, and pin swapping. Of these, clock gating is
usually the most effective and most widely utilized transformation.
Identifying clock gating opportunities is relatively straightforward, as is
the transformation. The netlist is searched for registers that are configured to
recirculate data when not being loaded with new data. These registers have
a two-to-one multiplexor in front of the D input. These structures are
replaced, as shown in Figure 14.6, with a clock gated register16 that is
clocked only when enabled to load new data into the register, The result is
often a substantial reduction in dynamic power in the clock network.
Note that simulation data is not required to identify clock gating
opportunities. However, depending on the actual activities, gating a clock
can actually cause power consumption to increase – this is the case when the
register being clock-gated switches states often. Nonetheless, the attraction
of gating clocks “blindly” is that the overall design flow is simplified, as
meaningful simulations often take many hours.
It is often desirable to reduce power without altering the netlist structure,
and two gate-level transformations are often employed to do this – cell re-
sizing and dual-Vt cell swapping. The former technique is employed to
reduce dynamic power and is similar to the sizing technique employed at the
transistor level – cells off of the critical timing path are downsized to the
extent that timing is not violated, using different drive strength cells from the
library. Dual-Vt cell swapping is a similar transformation in that positive
timing slack is traded off for reduced power by replacing cells that are not on
16
In practice, the clock gating logic is usually more complicated than a simple AND gate.
The additional complexity arises out of two requirements, first to ensure that the clock
gating logic will not glitch and secondly, to make the clock gating logic testable. For
example, the latch used in the gated register in Figure 6 serves to prevent the AND gate
output from glitching. Additional logic beyond that shown would be required to make this
circuitry testable.
434 Tools and Methodologies for Power Sensitive Design
Gate-level Tools 435
the critical path. In this case, the target is leakage power reduction and the
replacement cells have different timing characteristics by virtue of a
different threshold voltage implant as opposed to a different size or drive
strength. To utilize dual-Vt cell swapping, however, a second library is
required that is composed of cells identical to those in the original library
except for the utilization of a different, higher threshold voltage. Both of
these techniques are most effectively performed once actual routing
parasitics are known, while the logic restructuring techniques are best
performed pre-route or pre-placement.
Figure 14.7 illustrates the effects on the path delay distribution of trading
off slack path delays for reduced power, as achieved by either cell re-sizing
or dual-Vt cell swapping [19].
Several gate-level power optimizers are commercially available,
PowerCompiler from Synopsys [20], the Low-power Synthesis Option for
BuildGates from Cadence [21], and Physical Studio from Sequence [22].
The first two work as add-on optimization routines to standard synthesizers
so that power consumption is optimized at the same time as timing and area
when the synthesizers are converting RTL descriptions to logic gates. The
latter functions as a post-place-and-route power optimizer utilizing extracted
wiring parasitics.
The most prevalent power grid analysis tool architecture relies upon the
computation of an average power value for each cell, based on a gate level
simulation as might be performed by a gate-level power analyzer. These
power values are then converted to DC current sources and are attached to
the power grid at the appropriate points per cell. This information, along
with the extracted resistive network, is fed to a matrix solver to compute DC
branch currents and node voltages. This type of analysis is referred to as a
static, or DC, analysis, since the currents used in the voltage analysis are
assumed to be time invariant in order to simplify the tool architecture and
analysis type. Mars-Rail from Avanti [24] and VoltageStorm-SoC [25] are
examples of a gate-level static power grid analysis tool, although the latter
also incorporates some transistor-level analysis capabilities.
Power grid planning tools assist in the creation of the power grid before,
or during, placement and routing. The idea is to design and customize the
power grid to the specifics of the design’s power consumption characteristics
and floor plan. Power grid planners require as inputs an estimate of the
design’s power consumption broken down to some level of detail, a floor
plan for the design, and information about the resistive characteristics of the
manufacturing process routing layers and vias. The tool then produces a
route of the power and ground networks with estimates of the voltage
variations along the network. An example of this type of tool is Iota’s
PowerPlanner [26].
The third type of derivative gate-level power tool is a power-sensitive
delay calculator. Conventional delay calculators compute timing delays
based on the well-known factors of input transition time and output loading,
assuming no voltage variation along the power rails. However, voltage
variations do occur, in both space and time and affect signal timings. More
recent delay calculators, such as that found in ShowTime from Sequence
Design, incorporate the effects of localized power supply variations when
computing delays [27].
The register transfer level, or RTL, is the abstraction level at which much of
IC functional design is performed today. As such, analysis and optimization
at this level is especially important.
Two other considerations account for the significance of tools at this
level. The first is that this is the level at which designers think about
architecting their logic, and it is also at this point that many opportunities for
major power savings exist. Secondly, the design-analyze-modify loop is
438 Tools and Methodologies for Power Sensitive Design
17
An instance in this case is an operator, or several lines of code implementing an inferred
function, such as an adder or multiplier, decoder or multiplexor.
Register Transfer-level Tools 439
Similar to their brethren at the RT-level, behavior-level tools are used during
the design creation process. Modules or entire designs can be described at
this level. Two different motivations exist for describing designs
Behavior-level Tools 441
The analysis and optimization of power at the system level involves the
global consideration of voltages and frequencies. For example, system-level
concerns about battery types and availability often dictate the voltages at
which portable devices must operate. Similarly, thermal and mechanical
issues in laptops often limit microprocessor operating frequencies to less
than their maximum operating capability. Accordingly, the proper analysis
A Power-sensitive Design Methodology 443
of these and related concerns often has the highest impact on the overall
power characteristics and success of the target design.
Unfortunately, little in the way of design automation is available for
addressing these concerns. In fact, the most prevalent software tool used in
this arena is the spreadsheet.
Spreadsheets are fast, flexible, and generally well understood. In fact,
spreadsheets were adopted for chip-level power estimation prior to the
emergence of dedicated power analysis tools [30]. The capabilities that
made spreadsheets applicable to chip-level power analysis are also
applicable to system-level analysis – ease of use, modeling flexibility, and
customizability. Unfortunately, the disadvantages are also applicable – error
prone nature, wide accuracy variance, and manual interface.
Nonetheless, spreadsheets such as Microsoft’s Excel are used to model
entire systems. System components and sub-blocks are modeled with
customized equations using parameters such as supply voltage, operating
frequency, and effective switched capacitances. Technology data may or
may not be explicitly parameterized, but if it is, it is typically derived from
data-book information published by the component and technology vendors.
Spreadsheets are most often utilized at the very earliest stages of system
design when rough cuts at power budgets and targets are needed and before
little detailed work has begun. As the design progresses and descends
through the various abstraction levels, and as more capable and automated
tools become applicable, spreadsheet usage typically wanes in the face of
more reliable and more automated calculations.
The conventional design view of power is that there are two primary design
activities, analysis and minimization, the latter often known as Low Power
Design. An example of this would be the design of the StrongArm
microprocessor, whose design target was low power (less than a watt) with
good performance [31].
But low power processors are not the only designs in which power is a
concern. Consider the 21264 Alpha microprocessor, which consumes 72 W
while being clocked at 600 MHz [32]. Designers of this device had to
consider many power-related, or power-sensitive, issues such as package
thermal characteristics, temperature calculations and thermal gradients,
power bus sizing, di/dt transient currents, and noise margin analysis [33] in
addition to the various power-savings techniques that prevented this machine
from consuming even more power.
Rolled together, the consideration of all these issues, power
minimization, and the analysis and management of those parameters affected
by power, constitutes Power Sensitive Design.
efficient abstraction specific loops. Thus the design that is fed forward to
the lower abstraction levels is much less likely to be fed back for reworking,
and the analysis performed at the lower levels becomes less of a design
effort and more of a verification effort. The key concept is to identify, as
early as possible, the design parameters and trade-offs that are required to
meet the project’s power specs. In this way, it is ensured that the design
being fed forward is fundamentally capable of achieving the power targets.
14.10 SUMMARY
REFERENCES
[1] D. Singh, et al., “ Power conscious CAD tools and methodologies: A Perspective,”
Proceedings of the IEEE, Apr. 1995, pp. 570-594.
[2] V.De, et al., “Techniques for leakage power reduction,” in A. Chandrakasan, et al.,
editors, Design of High Performance Microprocessor Circuits,” IEEE Press, New York,
Chapter 3, 2001.
[3] Star-HSpice Data Sheet, Avanti Corporation, Fremont, CA, 2002.
[4] “Liberty user guide,” Synopsys, Inc. V1999.
[5] “Advanced library format for ASIC technology, cells, and blocks,” Accelera, V2.0,
December 2000.
[6] IEEE 1481 Standard for Delay & Power Calculation Language Reference Manual.
[7] J. Clement, “Electromigration reliability,” in A. Chandrakasan, et. al., editors, Design of
High Performance Microprocessor Circuits, IEEE Press, New York, Chapter 20, 2001.
[8] A. Deng, “Power analysis for CMOS/BiCMOS circuits,” in Proceedings of the 1994
International Workshop on Low Power Design, Apr. 1994.
[9] NanoSim Data Sheet, Synopsys, Inc., Mountain View, CA, 2001.
[10] AMPS Data Sheet, Synopsys, Inc., Mountain View, CA, 1999.
[11] S. M. Kang, “Accurate simulation of power dissipation in VLSI Circuits,” in IEEE
Journal of Solid State Circuits, vol. 21, Oct. 1986, pp. 889-891.
[12] SiliconSmart CR Data Sheet, Silicon Metrics Corporation, Austin, TX, 2000.
[13] S. Panda, et al., “Model and analysis for combined package and on-chip power grid
simulation,” in Proceedings of the 2000 International Symposium on Low Power
Electronics and Design, Jul. 2000.
[14] RailMill Data Sheet, Synopsys, Inc., Mountain View, CA, 2000.
[15] B. George, et al., “Power analysis and characterization for semi-custom design,” in
Proceedings of the 1994 International Workshop on Low Power Design, Apr. 1994.
[16] PowerTheater Reference Manual, Sequence Design, Inc., Santa Clara, CA, 2001.
[17] PrimePower Data Sheet, Synopsys, Inc., Mountain View, CA, 2002.
[18] S. Iman and M. Pedram, “POSE: power optimization and synthesis environment,” in
Proceedings of the 33rd Design Automation Conference, Jun. 1996.
Summary 449
[19] S. Narendra, et al., “Scaling of stack effect and its application for leakage reduction,”
Proceedings of the 2001 International Symposium on Low Power Electronics and
Design, Aug. 2001.
[20] PowerCompiler Data Sheet, Synopsys, Inc., Mountain View, CA, 2001.
[21] Low-Power Synthesis Option for BuildGates and PKS Data Sheet, Cadence, Inc., San
Jose, CA, 2001.
[22] Physical Studio Reference Manual, Sequence Design, Inc., Santa Clara, CA, 2002.
[23] Synchronous SRAM Memory Core Family, TSMC 0.1 Process Datasheet, Virage
Logic, Fremont, CA, 2000.
[24] Mars-Rail Data Sheet, Avanti Corporation, Fremont, CA, 2002.
[25] VoltageStorm-SoC Data Sheet, Simplex Solutions, Inc, Sunnyvale, CA, 2001.
[26] PowerPlanner Data Sheet, Iota Technology, Inc., San Jose, CA. 2001.
[27] Showtime Reference Manual, Sequence Design, Inc., Santa Clara, CA, 2002.
[28] Orinoco Data Sheet, Offis Systems and Consulting, GMBH, 2001.
[29] F. Catthoor, “Unified low-power design flow for data-dominated multi-media and
telecom applications,” Kluwer Academic Publishers, Boston, 2000.
[30] J. Frenkil, “Power dissipation of CMOS ASICs,” in Proceedings of the IEEE
International ASIC Conference,” Sep. 1991.
[31] D. Dobberpuhl, “The design of a high performance low power microprocessor,” in
Proceedings of the 1996 International Symposium on Low Power Electronics and
Design, Aug. 1996.
[32] B. Gieseke, et al., “A 600MHz superscalar RISC Microprocessor with out-of-order
execution,” in ISSCC Digest of Technical Papers, Feb. 1997, pp. 176-177.
[33] P. Gronowski, et al., “High performance microprocessor design,” in IEEE Journal of
Solid State Circuits, vol. 33, May 1998, pp. 676-686.
[34] P. Landman, et al., “An integrated CAD Environment for low-power design,” in IEEE
Design and Test of Computers, vol. 13, Summer 1996, pp. 72-82.
This page intentionally left blank
Chapter 15
Abstract: Energy considerations are at the heart of important paradigm shifts in next-
generation designs, especially in systems-on-a-chip era. With multimedia and
communication functions becoming more and more prominent, coming up
with low-power solutions for these signal-processing applications is a clear
must. Harvard-style architectures, as used in traditional signal processors,
incur a significant overhead in power dissipation. It is therefore worthwhile to
explore novel and different architectures and to quantify their impact on
energy efficiency. Recently, reconfigurable programmable engines have
received a lot of attention. In this chapter, the opportunity for substantial
power reduction by using hybrid reconfigurable processors will be explored.
With the aid of an extensive example, it will be demonstrated that power
reductions of orders of magnitude are attainable.
15.1 INTRODUCTION
While most of the literature of the last decade has focused on power
dissipation, it is really minimization of the energy dissipation in the presence
of performance constraints that we are interested in. For real-time fixed-rate
applications such as DSP, energy and power metrics are freely
interchangeable as the rate is a fixed design constraint. In multi-task
computation, on the other hand, both energy and energy-delay metrics are
Opportunities for energy minimization 455
While traditionally the volatges were fixed over a complete design, it is fair
to state that more and more voltage can be considered as a parameter that
can vary depending upon the location on the die and dynamically over time.
Many researchers have explored this in recent years and the potential
benefits of varying supply voltages are too large to ignore. This is especially
the case in light with the increasing importance of leakage currents.
Matching the desired supply voltage to a task can be accomplished in
different ways. For a hardware module with a fixed functionality and
performance requirement, the preferred voltage can be set statically (e.g. by
choosing from a number of discrete voltages available on the die).
Computational resources that are subject to varying computational
requirements have to be enclosed in a dynamic voltage loop that regulates
the voltage (and the clock) based on the dialed performance level [4]. This
concept is called adaptive voltage scaling.
spent on the real purpose of the design, i.e. computation. The rest is wasted
in overhead functions such as clock distribution, instruction fetching and
decoding, busing, caching, etc. Energy-efficient design should strive to make
this overhead as small as possible, which can be accomplished by sticking to
a number of basic guidelines (the low-energy roadmap):
Match architecture and computation to a maximum extent
Preserve locality and regularity inherent in the algorithm
Exploit signal statistics and data correlations
Energy (and performance) should only be delivered on demand, i.e.
an unused hardware module should consume no energy whatsoever.
This is most easily achievable in ASIC implementations, and it hence
comes as no surprise that dedicated custom implementations yield the best
solutions in terms of the traditional cost functions such as power, delay, and
area (PDA). Indeed, it is hard to beat a solution that is optimized to perform
solely a single well-defined task. However, rapid advances in portable
computing and communication devices require implementations that must
not only be highly energy-efficient, but they must also be flexible enough to
support a variety of multimedia services and communication capabilities.
With the dramatic increase in design complexity and mask cost, reuse of
components of components has become an essential requirement. The
required flexibility dictates the use of programmable processors in
implementing the increasingly sophisticated digital signal processing
algorithms that are widely used in portable multimedia terminals. However,
compared to custom, application-specific solutions, programmable
processors often incur stiff penalties in energy efficiency and performance.
It is our point of contention that adhering strictly to the low-energy
roadmap can lead to programmable architectures that consume dramatically
less power than the traditional programmable engines. Reconfigurable
architectures that program by restructuring the interconnections between
modules are especially attractive in that respect, especially because they
allow for obtaining an adequate match between computational and
architectural granularity.
been observed at numerous sites that this model is too confining and that
other programmable or configurable architectures should be considered as
well. This was inspired by the success of programmable logic (FPGA) to
implement a number of computationally intensive tasks at performance
levels or costs that were substantially better than what could be achieved
with traditional processors [5]. While intrinsically not very efficient, FPGAs
have the advantage that a computational problem can be directly mapped to
the underlying gate structure, hence avoiding the inherent overhead of fixed-
word length, fixed-instruction-set processors. Configurable logic represents
an alternative architecture model, where programming is performed at a
lower level of granularity.
results in that respect have been reported. The most in-depth analysis on the
efficiency and application space of FPGAs for computational tasks was
reported by Andre Dehon [6], who derived an analytical model for area and
performance as a function of architecture parameters (such as data-path
width w, number of instructions stored per processing element c, number of
data words stored per processing element d), and application parameters
(such as word length and path length — the number of sequential
instructions required per task). Figure 15.2 plots one of the resulting
measures of the model, the efficiency — the ratio of the area of running an
application with word length on an architecture with word length
versus running it on architecture with word length w. As can be observed,
the processor excels at larger word lengths and path lengths, while the FPGA
is better suited for tasks with smaller word and path lengths.
Limiting the configurable architecture space to just those two
architectures has proven to be too restrictive and misses major opportunities
to produce dramatic reductions in the PDA space. Potential expansions can
go in a number of directions:
By changing the architecture word length w — sharing the
programming overhead over a number of bits. This increases the
PDA efficiency if the application word length matches the
architecture word length.
By changing the data storage d — this introduces the potential for
local buffering and storing data.
By changing the number of resources r — this makes it possible to
implement multiple operations on the PE by providing concurrent
units (programming in the space domain).
By changing the number of contexts c — this makes it possible to
implement multiple operations on the PE by time-multiplexing
(programming in the time domain).
By reducing the flexibility f, i.e. the type of operations that can be
performed on the processing element, i.e. making it more dedicated
towards a certain task.
Definitions:
The flexibility index of a processing element (PE) is defined as the
ratio of the number of logical operations that can be performed on
the PE versus the total set of possible logical operations. PEs that
can perform all logical operations, such as general-purpose
processors and FPGAs, have a flexibility index equal to 1 (under the
condition that the instruction memory is large enough). Dedicated
units such as adders or multipliers have a flexibility close to 0, but
tend to score considerably better in the PDA space.
PROGRAMMABLE ARCHITECTURES (AN OVERVIEW 459
The remainder of this chapter will be devoted to the latter category. The
possible trade-offs will be discussed based on a comparison between some
emerging approaches. One architectural template, proposed in the Berkeley
Pleiades project, is discussed in more detail.
15.5.1 Concept
15.5.2 Architecture
Table 15.1 shows the energy profile of the VSELP speech coding
algorithm, running on Maia. Six kernels were mapped onto the satellite
processors. The rest of the algorithm is executed on the ARM8 control
processor. The control processor is also responsible for configuring the
satellite processors and the communication network. The energy overhead of
this configuration code running on the control processor is included in the
ARCHITECTURAL INNOVATIONS ENABLE CIRCUIT-LEVEL 469
energy consumption values of the kernels. In other words, the energy values
listed in Table 1for the kernels include contributions from thesatellite
processors as well as the control processor executing configuration code.
The power dissipation of Maia when running VSELP is 1.8 mW. The lowest
power dissipation reported in the literature to date is 17 mW for a
programmable signal processor executing the Texas Instruments
TMS320LC54x instruction set, implemented in a CMOS process,
running at 63 MHz with a 1.0-V supply voltage [12]. The energy efficiency
of this reference processor is whereas the energy efficiency
of Maia is which corresponds to an improvement by a factor of
six.
In the low-energy roadmap, it was outlined that adjusting the supply voltage
to the computational requirements can lead to dramatic energy savings. The
distributed nature of the Pleiades architecture makes it naturally suited to
exploit some of the opportunities offered by dynamic voltage scaling. Most
of the co-processors perform a single task with a well-defined workload. For
these, a single operational voltage, carefully chosen to meet the
470 Reconfigurable Processors
15.7 SUMMARY
REFERENCES
[1] J. Borel, “Technologies for multimedia systems on a chip,” Proc. IEEE ISSCC
Conference 1997, pp. 18-21, San Francisco, February 1997.
[2] J. Rabaey and A. Sangiovanni-Vincentelli, “System-on-a-chip – a platform perspective,“
Keynote presentation, Proceedings Korean Semiconductor Conference, February 2002.
[3] T. Claessen, “First time right silicon, but... to the right specification,” Keynote Design
Automation Conference 2000, Los Angeles.
[4] T. Burd, T. Pering, A. Stratakos, and R. Brodersen, “A dynamic voltage scaled
microprocessor system,” IEEE ISSCC Dig. Tech. Papers, pp. 294-295, Feb. 2000.
[5] J. Villasenor and W. Mangione-Smith, “Configurable Computing,” Scientific American,
pp. 66-73, June 1997.
[6] DeHon, “Reconfigurable Architectures for general purpose computing,,” Technical
Report 1586, MIT Artificial Intelligence Laboratory, September 1996.
[7] Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, Morgan
Kaufman Publishers, san Mateo, 1990.
[8] Silicon after 2010, DARPA ISAT study group, August 1997.
[9] Virtex-II Pro Platform FPGAs,
http://www.xilinx.com/xlnx/xil_prodcat_landing_page.jsp?title=Virtex-II+Pro+FPGAs,
Xilinx, Inc.
[10] H. Zhang, V. Prabhu, V. George, M. Wan, M. Benes, A. Abnous, and J. Rabaey, “A 1 V
heterogeneous reconfigurable processor ic for baseband wireless applications,” Proc.
ISSCC, pp, 68-69, February 2000.
[11] S. Hauck et al, “Triptych - an FPGA architecture with integrated logic and routing,”
Proc. 1992 Brown/MIT Conference, pp 26-32, March 1992.
[12] Yeung and J. Rabaey, “A 2.4 GOPS data-driven reconfigurable multiprocessor IC for
DSP,” Proc. IEEE ISSCC Conference 1995, pp. 108-109, San Francisco, 1995.
[13] H. Zhang, V. George, J. Rabaey, “Low-swing on-chip signaling techniques:
effectiveness and robustness,” IEEE Transactions on VLSI Systems, vol. 8 (no.3), pp.
264-272, June 2000.
Chapter 16
Energy-Efficient System-Level Design
Abstract: The complexity of current and future integrated systems requires a paradigm
shift towards component-based design technologies that enable the integration
of large computational cores, memory hierarchies and communication
channels as well as system and application software onto a single chip.
Moving from a set of case studies, we give an overview of energy-efficient
system- level design, emphasizing a component-based approach.
16.1 INTRODUCTION
The Emotion Engine [4][5] was designed by Sony and Toshiba to support 3-
D graphics for the PlayStation 2 game console. From a functional viewpoint,
the design objective was to enable real-time synthesis of realistic animated
scenes in three dimensions. To achieve the desired degree of realism,
physical modeling of objects and their interactions, as well as 3-D geometry,
transformation are required. Power budget constraints are essentially set by
cost considerations: the shelf price of a game console should be lower than
US$ 500, thus ruling out expensive packaging and cooling. Furthermore,
game consoles should be characterized by the low cost of ownership,
robustness with respect to a wide variety of operating conditions, and
minimal maintenance. All of these requirements conflict with high power
dissipation. These challenges were met by following two fundamental design
guidelines: (i) integration of most of the critical communication, storage, and
computation on a single SoC, and (ii) architectural specialization for a
specific class of applications.
The architecture of the Emotion Engine is depicted in Figure 16.1. The
system integrates three independent processing cores and a few smaller I/O
controllers and specialized coprocessors. The main CPU, the master
controller, is a superscalar RISC processor with a floating-point
coprocessor. The other two cores are floating-point vector processing units.
The first vector unit, VPU0, performs physical modeling computations,
478 Energy-Efficient System-Level Design
and I/O port between VPU1 and the rendering engine. In contrast, VPU0
receives data from the CPU (as a coprocessor). For this reason data transfer
from/to the unit is stored in the CPU's scratch-pad memory and transferred to
VPU0 via DMA on a shared, 128-bit interconnection bus. The bus supports
transfers among the three main processors, the coprocessors, and I/O blocks
(e.g., for interfacing with high-bandwidth RDRAM).
The Emotion Engine was fabricated in technology with
drawn gate length for improved switching speed. The CPU and the VPUs are
clocked at 250MHz. External interfaces are clocked at 125 MHz. Die size is
The chip contains 10.5 million transistors. The chip can
sustain 5 GFLOPs. With power supply the power consumption
is 15 W. Clearly, such a power consumption is not adequate for portable,
battery-operated equipment; however it is much lower than that of a general-
purpose microprocessor with similar FP performance (in the same
technology).
The energy efficiency of the Emotion Engine stems form several factors.
First it contains many fast SRAM memories, providing adequate bandwidth
for localized data transfers but not at the high energy cost implied by cache
memories. On the contrary, instruction and data caches have been kept
small, and it is up to the programmer to develop tight inner loops that
minimize misses. Second, the architecture provides an extremely aggressive
degree of parallelism without pushing the envelope for maximum clock
speed. Privileging parallelism with respect to sheer speed is a well-known
low-power design technique [6]. Third, parallelism is explicit in hardware
and software (the various CPUs have well-defined tasks), and it is not
compromised by centralized hardware structures that impose unacceptable
global communication overhead. The only global communication channel
(the on-chip system bus) is bypassed by dedicated ports for high-bandwidth
point-to-point communication (e.g., between VPU1 and the rendering
hardware). Finally, the SoC contains many specialized coprocessors for
common functions (e.g., MPEG2 video decoding), which unloads the
processors and achieves very high energy efficiency and locality.
Specialization is also fruitfully exploited in the micro-architecture of the
programmable processors, which natively support a large number of
application-specific instructions.
In contrast with the Emotion Engine, the MPEG4 video codec SoC described
by Takahashi et al. [7] has been developed specifically for the highly power-
constrained mobile communications market. Baseband processing for a
multimedia-enabled 3G wireless terminal encompasses several complex
480 Energy-Efficient System-Level Design
speed and voltage supply account for a difference in power consumption of,
roughly, a factor of 2, which becomes a factor of 4 if one discounts area (i.e.,
focuses on power density). The residual 15 times difference is due to the
different transistor usage (the MPEG4 core is dominated by embedded
DRAM, which has low power density), and to architecture, circuit, and
technology optimizations. This straightforward comparison convincingly
demonstrates the impact of power-aware system design techniques and the
impressive flexibility of CMOS technology.
Digital audio is a large market where system cost constraints are extremely
tight. For this reason, several companies are actively pursuing single-chip
solutions based on embedded memory for the on-chip storage of sound
samples [8] [9]. The main challenges are the cost per unit area of
semiconductor memory, and the power dissipation of the chip, which should
be as low as possible to reduce the cost of batteries (e.g., primary Lithium
vs. rechargeable Li-Ion).
The single-chip voice recorder and player developed by Borgatti and
coauthors [10] stores recorded audio samples on embedded FLASH
memory. The chip was originally implemented in technology with
3.0 V supply, and it is a typical example of an SoC designed for a single
application. The main building blocks (Figure 16.3) are: a microcontroller
unit (MCU), a speech coder and decoder, and an embedded FLASH
memory. A distinguishing feature of the system is the use of a multi-level
storage scheme to increase the speech recording capacity of the FLASH.
Speech samples are first digitized then compressed with a simple waveform
coding technique (adaptive-differential pulse-code modulation) and finally
stored in FLASH memory, 4-bits per cell.
A 4-bits per cell density requires 16 different thresholds for the FLASH
cells. Accurate threshold programming and readout requires mixed-signal
circuitry in the memory write and read paths. The embedded FLASH macro
contains 8 Mcells. It is divided into 128 sectors that can be independently
erased. Each sector contains 64-K cells, which can store 32 Kbytes in
multilevel mode. Memory read is performed though an 8-bit, two-step
analog-to-digital converter.
SOC Case Studies 483
represents the effort needed to fetch (or store) a data unit from (to) the
memory. The main objective of energy-efficient memory design is to
minimize the overall energy cost for accessing memory within performance
and memory size constraints. Hierarchical organizations reduce memory
power by exploiting non-uniformity (or locality) in access.
two); (ii) cache line size, from 4 to 32, in powers of two; (iii) associativity (1,
2, 4, and 8); and (iv) off-chip memory size, from 2Mbit SRAM, to 16Mbit
SRAM.
The exhaustive exploration of the cache organization for minimum
energy for an MPEG decoding application results in an energy-optimal
cache organization with cache size 64 bytes, line size 4 bytes, 8-way set
associative. Notice that this is a very small memory, almost fully associative
(only two lines). For this organization, the total memory energy is
and the execution time is 142,000 cycles. In contrast, exploration for
maximum performance yields a cache size of 512 bytes, a line size of 16
bytes, and is 8-way set associative. Notice that this cache is substantially
larger than the energy-optimal one. In this case, the execution time is
reduced to 121,000 cycles, but the energy becomes
One observes that the second cache dominates the first one for size, line
size, and associativity; hence, it has the larger hit rate. This is consistent
with the fact that performance strongly depends on miss rate. On the other
hand, if external memory access power is not too large with respect to cache
access (as in this case), some hit rate can be traded for decreased cache
energy. This justifies the fact that a small cache with a large miss rate is
more power-efficient than a large cache with a smaller miss rate.
The example shows that energy cannot generally be reduced as a
byproduct of performance optimization. On the other hand, architectural
solutions originally devised for performance optimization are often
beneficial in terms of energy. Generally, when locality of access is
improved, both performance and energy tend to improve. This fact is heavily
exploited in software optimization techniques.
and Chakrabarti [16] focus on cache memories. Zyuban and Kogge [21]
study register files; Coumeri and Thomas [22] analyze embedded SRAMs;
Juan et al. [23] study translation look-aside buffers.
Example 16.1 has shown an instance of a typical design space and the
result of the relative exploration. An advantage of explorative techniques is
that they allow for concurrent evaluation of multiple cost functions such as
performance and area. The main limitation of the explorative approach is
that it requires extensive data collection, which provides a posteriori insight.
In order to limit the number of simulations, only a relatively small set of
architectures can be tested and compared.
levels. This option does not just imply the straightforward addition of extra
levels of caching.
A first class of techniques is based on the insertion of “ad-hoc” memories
between existing hierarchy levels. This approach is particularly useful for
instruction memory, where access locality is very high. Pre-decoded
instruction buffers [28] store instructions in critical loops in a pre-decoded
fashion, thereby decreasing both fetch and decode energy. Loop caches [29]
store the most frequently executed instructions (typically contained in small
loops) and can bypass even the first-level cache. Notice that these additional
memories would not be useful for performance if the first-level cache can be
accessed in a single cycle. On the contrary, performance can be slightly
worsened because the access time for the loop cache is on the critical path of
the memory system.
Another approach is based on the replacement of one or more levels of
caches with more energy-efficient memory structures. Such structures are
usually called scratch-pad buffers and are used to store a portion of the off-
chip memory, in an explicit fashion. In contrast with caches, reads and writes
to the scratch-pad memory are controlled explicitly by the programmer.
Clearly, allocation of data to the scratch pad should be driven by profiling
and statistics collection. These techniques are particularly effective in
application-specific systems, which run an application mix whose memory
profiles can be studied a priori, thus providing intuitive candidates for the
addresses to be put into the buffer. The work by Panda et al. [30][31] is
probably the most comprehensive effort in this area [31].
As technology improves and device sizes scale down, the energy spent on
processing and storage components decreases. On the other hand, the energy
for global communication does not scale down. On the contrary, projections
based on current delay optimization techniques for global wires [41] show
that global communication on chip will require increasingly higher energy
consumption.
The chip interconnect has to be considered and designed as an on-chip
network, called a micro-network [42]. As for general network design, a
layered abstraction of the micro-network (shown in Figure 16.6) can help us
analyze the design problems and find energy-efficient communication
solutions. Next, micro-network layers are considered in a bottom-up fashion.
First, the problems due to the physical propagation of signals on chip are
analyzed. Then general issues related to network architectures and control
protocols are considered. Protocols are considered independently from their
implementation, from the physical to the transport layers. The discussion of
higher-level layers is postponed until Section 5. Last, we close this section
by considering techniques for energy-efficient communication on micro-
networks.
492 Energy-Efficient System-Level Design
Due to the limitations at the physical level and to the high bandwidth
requirement, it is likely that SoC design will use network architectures
similar to those used for multi-processors. Whereas shared medium (e.g.,
bus-based) communication dominates today's chip designs, scalability
reasons make it reasonable to believe that more general network topologies
will be used in the future. In this perspective, micro-network design entails
the specification of network architectures and control protocols [46]. The
architecture specifies the topology and physical organization of the
interconnection network, while the protocols specify how to use network
resources during system operation.
The data-link layer abstracts the physical layer as an unreliable digital
link, where the probability of bit errors is non null (and increasing as
technology scales down). Furthermore, reliability can be traded for energy
[45][47]. The main purpose of data-link protocols is to increase the
reliability of the link up to a minimum required level, under the assumption
that the physical layer by itself is not sufficiently reliable.
An additional source of errors is contention in shared-medium networks.
Contention resolution is fundamentally a non-deterministic process because
it requires synchronization of a distributed system, and for this reason it can
be seen as an additional noise source. In general, non-determinism can be
virtually eliminated at the price of some performance penalty. For instance,
centralized bus arbitration in a synchronous bus eliminates contention-
induced errors, at the price of a substantial performance penalty caused by
the slow bus clock and by bus request/release cycles.
Future high-performance shared-medium on-chip micro-networks may
evolve in the same direction as high-speed local area networks, where
contention for a shared communication channel can cause errors, because
two or more transmitters are allowed to send data on a shared medium
concurrently. In this case, provisions must be made for dealing with
contention-induced errors.
An effective way to deal with errors in communication is to packetize
data. If data is sent on an unreliable channel in packets, error containment
and recovery is easier because the effect of the errors is contained by packet
boundaries, and error recovery can be carried out on a packet-by-packet
basis. At the data-link layer, error correction can be achieved by using
494 Energy-Efficient System-Level Design
circuit switching and packet switching are just two extremes of a spectrum,
with many hybrid solutions in between [58].
16.6 SOFTWARE
Systems have several software layers running on top of the hardware. Both
system and application software programs are considered here.
Software does not consume energy per se, but it is the execution and
storage of software that requires energy consumption by the underlying
hardware. Software execution corresponds to performing operations on
hardware, as well as storing and transferring data. Thus software execution
involves power dissipation for computation, storage, and communication.
Moreover, storage of computer programs in semiconductor memories
requires energy (e.g., refresh of DRAMs, static power for SRAMs).
The energy budget for storing programs is typically small (with the
choice of appropriate components) and predictable at design time.
Software 499
Nevertheless, reducing the size of the stored programs is beneficial. This can
be achieved by compilation (see Section 6.2.2) and code compression. In the
latter case, the compiled instruction stream is compressed before storage. At
run time, the instruction stream is decompressed on the fly. Besides reducing
the storage requirements, instruction compression reduces the data traffic
between memory and processor and the corresponding energy cost. (See also
Section 4.5.) Several approaches have been devised to reduce instruction
fetch-and-store overhead, as surveyed in [11], The following subsections
focus mainly on system-level design techniques to reduce the power
consumption associated with the execution of software.
In other words, the system software must support the dynamic power
management (DPM) of its components as well as dynamic information-flow
management.
time. Such an assumption is valid for most systems, both when considered in
isolation and when inter-networked. A second assumption of DPM is that it
is possible to predict, with a certain degree of confidence, the fluctuations of
workload. Workload observation and prediction should not consume
significant energy.
Designing power-managed systems encompasses several tasks, including
the selection of power-manageable components with appropriate
characteristics, determining the power management policy [35], and
implementing the policy at an appropriate level of system software. DPM
was described in a previous Chapter. This chapter considers only the
relations between DPM policy implementation and system software.
A power management policy is an algorithm that observes requests and
states of one or more components and issues commands related to frequency
and voltage settings. In particular, the power manger can turnon/off the clock
and/or the power supply to a component. Whereas policies can be
implemented in hardware (as a part of the control-unit of a component),
software implementations achieve much greater flexibility and ease of
integration. Thus a policy can be seen as a program that is executed at run-
time by the system software.
The simplest implementation of a policy is by a filter driver, i.e., by a
program attached to the software driver of a specific component. The driver
monitors the traffic to/from the component and has access to the component
state. Nevertheless, the driver has a limited view of other components. Thus
such an implementation of power management may suffer from excessive
locality.
Power management policies can be implemented in system kernels and
be tightly coupled to process management. Indeed, process management has
knowledge of currently-executing tasks and tasks coming up for execution.
Process managers also know which components (devices) are needed by
each task. Thus, policy implementation at this level of system software
enjoys both a global view and an outlook of the system operation in the near
future. Predictive component wake-up is possible with the knowledge of
upcoming tasks and required components.
The system software can be designed to improve the effectiveness of
power management. Power management exploits idle times of components.
The system software scheduler can sequence tasks for execution with the
additional goal of clustering component operation, thus achieving fewer but
longer idle periods. Experiments with implementing DPM policies at
different levels of system software [60] have shown increasing energy
savings as the policies have deeper interaction with the system software
functions.
502 Energy-Efficient System-Level Design
The energy cost of executing a program depends on its machine code and on
the corresponding micro-architecture, if one excludes the intervention of the
operating system in the execution (e.g., swapping). Thus, for any given
micro-architecture, the energy cost is tied to the machine code.
There are two important problems of interest: software design and
software compilation. Software design affects energy consumption because
the style of the software source program (for any given function) affects the
energy cost. For example, the probability of swapping depends on
appropriate array dimensioning while considering the hardware storage
resources. As a second example, the use of specific constructs, such as
guarded instructions instead of branching constructs for the ARM
architecture [6], may significantly reduce the energy cost. Several efforts
have addressed the problem of automatically re-writing software programs to
increase their efficiency. Other efforts have addressed the generation of
Software 503
Source-level transformations
Recently several researchers have proposed source-to-source transformations
to improve software code quality, and in particular energy consumption.
Some transformations are directed toward using storage arrays more
efficiently [13][63]. Others exploit the notion of value locality. Value
locality is defined as the likelihood of a previously-seen value recurring
repeatedly within a physical or logical storage location [64]. With value
locality information, reusing previous computations can reduce the
computational cost of a program.
Researchers have shown that value locality can be exploited in various
ways depending on the target system architecture. In [65], common-case
specialization was proposed for hardware synthesis using loop unrolling and
algebraic reduction techniques. In [66][64], value prediction was proposed to
reduce the load/store operations with the modification of a general-purpose
microprocessor. Some authors [67] considered redundant computation, i.e.,
performing the same computation for the same operand value. Redundant
computation can be avoided by reusing results from a result cache.
Unfortunately, some of these techniques are architecture dependent, and thus
cannot be used within a general-purpose software synthesis utility.
Software 505
Example 2 Consider the source code in Figure 16.7 (a), and the first call of
procedure foo in procedure main. If the first parameter a were 0 for all
cases, this procedure could be reduced to procedure sp_foo by partial
evaluation, as shown in Figure 16.7 (b).
In reality, the value of parameter a is not always 0, and the call to
procedure foo cannot be substituted by procedure sp_foo. Instead, it can be
replaced by a branching statement that selects an appropriate procedure
call, depending on the result of the common value detection (CVD). The
CVD procedure is named cvd_foo in Figure 16.7 (b). This is called
transformation step source code alternation. Its effectiveness depends on the
frequency with which a takes the common value 0.
Software libraries
Software engineers working on embedded systems use often software
libraries, like those developed by standards groups (e.g, MPEG) or by
system companies (e.g., Intel's multimedia library for the SA-1110 and TI's
library for the TI'54x DSP.) Embedded operating systems typically provide a
506 Energy-Efficient System-Level Design
choice from a number of math and other libraries [69]. When a set of pre-
optimized libraries is available, the designer has to choose the elements that
perform best for a given section of the code. Such a manual optimization is
error-prone and should be replaced by automated library insertion techniques
that can be seen as part of software synthesis.
For example, consider a section of code that calls the log function. The
library may contain four different software implementations: double, float,
fixed point using simple bit manipulation algorithm [93][89], and fixed point
using polynomial expansion. Each implementation has a different accuracy,
performance, and energy trade-off.
Thus, the automation of the use of software libraries entails two major
tasks. First, characterize the library element implementations in terms of the
criteria of interest. This can be achieved by analyzing the corresponding
instruction flow for a given architecture. Second, recognize the sections of
code that can be replaced effectively by library elements.
In the case of computation-intensive basic blocks of data-flows, code
manipulation techniques based on symbolic algebra have shown to be
effective in both optimizing the computation by reshaping the data flow and
in performing the automatic mapping to library elements. Moreover, these
tasks can be fully automated. These methods are based on the premise that in
several application domains (e.g., multimedia) computation can be reduced
to the evaluation of polynomials with fixed-point precision. The loss in
accuracy is usually compensated by faster evaluation and lower energy
consumption. Next, polynomials can be algebraically manipulated using
symbolic techniques, similar to those used by tools such as Maple.
Polynomial representations of computation can be also decomposed into
sequences of operations to be performed by software library elements or
elementary instructions. Such a decomposition can be driven by energy
and/or performance minimization goals. Recent experiments have shown
large energy gains on applications such as MP3 decoding [70].
Software 507
508 Energy-Efficient System-Level Design
Most software compilers consist of three layers: the front-end, the machine-
independent optimization, and the back-end. The front-end is responsible for
parsing and performing syntax and semantic analysis, as well as for
generating an intermediate form, which is the object of many machine-
independent optimizations [71]. The back-end is specific to the hardware
architecture, and it is often called code generator or codegen. Typically,
energy-efficient compilation is performed by introducing specific
transformations in the back-end, because they are directly related to the
underlying architecture. Nevertheless, some machine-independent
optimizations can be useful in general to reduce energy consumption [72].
An example is selective loop unrolling, which reduces the loop overhead but
is effective if the loop is short enough. Another example is software
pipelining, which decreases the number of stalls by fetching instructions
from different iterations. A third example is removing tail recursion, which
eliminates the stack overhead.
The main tasks of a code generator are instruction selection, register
allocation, and scheduling. Instruction selection is the task of choosing
instructions, each performing a fragment of the computation. Register
allocation is the task of allocating data to registers; when all registers are in
use, data is spilled to the main memory. Spills are usually undesirable
because of the performance and energy overhead of saving temporary
information in the main memory. Instruction scheduling is ordering
instructions in a linear sequence. When considering compilation for general-
purpose microprocessors, instruction selection and register allocation are
often achieved by dynamic programming algorithms [71], which also
generate the order of the instructions. When considering compilers for
application-specific architectures (e.g., DSPs), the compiler back-end is
often more complex, because of irregular structures such as inhomogeneous
register sets and connections. As a result, instruction selection, register
allocation, and scheduling are intertwined problems that are much harder to
solve [73].
Energy-efficient compilation-exploiting instruction selection was
proposed by Tiwari et al. [74] and tied to software analysis and
determination of base costs for operations. Tiwari proposed an instruction
selection algorithm based on the classical dynamic programming tree cover
[71] where instruction weights are the energy costs. Experimental results
showed that this algorithm yields results similar to the traditional algorithm
because energy weights do not differ much in practice.
Instruction scheduling is an enumeration of the instructions consistent
with the partial order induced by data and control flow dependencies.
Software 509
have greater impact on SoCs because they have very heterogeneous memory
architectures, and they often expose memory transfers to the programmer, as
outlined in the case studies (this is rarely done in general-purpose
processors).
The quest for very low energy software cost leads to the crafting and tuning
of very specific application programs. Thus, a reasonable question is–why
not let the application programs finely control the service levels and energy
cost of the underlying hardware components? There are typically two
objections to such an approach. First, application software should be
independent of the hardware platform for portability reasons. Second, system
software typically supports multiple tasks. When a task controls the
hardware, unfair resource utilization and deadlocks may become serious
problems.
For these reasons, it has been suggested [82] that application programs
contain system calls that request the system software to control a hardware
component, e.g., by turning it on or shutting it down, or by requesting a
specific frequency and/or voltage setting. The request can be accepted or
denied by the operating system, which has access to the task schedule
information and to the operating levels of the components. The advantage of
this approach is that OS-based power management is enhanced by receiving
detailed service request information from applications and thus is in a
position to make better decisions.
Another approach is to let the compiler extract the power management
requests directly from the application programs at compile time. This is
performed by an analysis of the code. Compiler-directed power management
has been investigated for variable-voltage, variable-speed systems. A
compiler can analyze the control-data flow graph of a program to find paths
where execution time is much shorter than the worst-case. It can then insert
voltage downscaling directives at the entry points of such paths, thereby
slowing down the processor (and saving energy) only when there is
sufficient slack [83].
16.7 SUMMARY
Digital systems with very low energy consumption require the use of
components that exploit all features of the underlying technologies (as
described in the previous chapters) and the realization of an effective
interconnection of such components. Network technologies will play a major
role in the design of future SoCs, as the communication among components
will be realized as a network on chip. Micro-network architectural choices
and control protocol design will be key in achieving high performance and
low-energy consumption.
A large, maybe dominant, effort in SoC design is spent in writing
software, because the operation of programmable components can be
tailored to specific needs by means of embedded software. System software
must be designed to orchestrate the concurrent operation of on-chip
components and network. Dynamic power management and information-
flow management are implemented at the system software level, thus adding
to the complexity of its design. Eventually, application software design,
synthesis, and compilation will be crucial tasks in realizing low-energy
implementations.
Because of the key challenges presented in this book, SoC design
technologies will remain a central engineering problem, deserving large
human and financial resources for research and development.
REFERENCES
[1] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally, M. Horowitz, “Smart Memories: a
modular reconfigurable architecture,” IEEE International Symposium on Computer
Architecture, pp. 161-171, June 2000.
[2] D. Patterson, et al., “A Case for intelligent RAM,” IEEE Micro, vol. 17, no. 2, pp. 34-44,
March-April 1997.
[3] Shubat, “Moving the market to embedded memory,” IEEE Design & Test of Computers,
vol. 18, no. 3, pp. 16-27, May-June 2001.
[4] M. Suzuoki et al., “A Microprocessor with a 128-bit CPU, Ten Floating-Point MACs,
Four Floating-Point Dividers, and an MPEG-2 Decoder,” IEEE Journal of Solid-State
Circuits, vol. 34, no. 11, pp. 1608--1618, Nov. 1999.
[5] Kunimatsu et al., “Vector Unit Architecture for Emotion Synthesis,” IEEE Micro, vol.
20, no. 2, pp. 40-47, March-April 2000.
[6] L. Benini, G. De Micheli, “System-Level Power Optimization: Techniques and Tools,”
ACM Transactions on Design Automation of Electronic Systems, vol. 5, no. 2, pp. 115-
192, April 2000.
[7] M. Takahashi et al., “A 60-MHz 240-mW MPEG-4 Videophone LSI with 16-Mb
embedded DRAM,” IEEE Journal of Solid-State Circuits, vol. 35, no. 11, pp. 1713-
1721, Nov. 2000.
512 Energy-Efficient System-Level Design
[8] H. V. Tran et al., “A 2.5-V, 256-level nonvolatile analog storage device using EEPROM
technology,” IEEE International Solid-State Circuits Conference, pp. 270-271, Feb.
1996.
[9] G. Jackson et al., “An Analog Record, playback and processing system on a chip for
mobile communications devices,” IEEE Custom Integrated Circuits Conference, pp. 99-
102, San Diego, CA, May 1999.
[10] M. Borgatti et al., ”A 64-Min Single-Chip Voice Recorder/Player Using Embedded 4-
b/cell FLASH Memory,” IEEE Journal of Solid-State Circuits, vol. 36, no. 3, pp. 516-
521, March. 2001.
[11] Macii, L. Benini, M. Poncino, Memory Design Techniques for Low Energy Embedded
Systems, Kluwer, 2002.
[12] Gartner, Inc., Final 2000 Worldwide Semiconductor Market Share, 2000.
[13] F. Catthoor, S. Wuytack, E. De Greef, F. Balasa, L. Nachtergaele, and A. Vandecappelle,
Custom Memory Management Methodology: Exploration of Memory Organization for
Embedded Multimedia System Design, Kluwer, 1998
[14] D. Lidsky, J. Rabaey, “Low-power design of memory intensive functions,” IEEE
Symposium on Low Power Electronics, San Diego, CA, pp. 16-17, September 1994.
[15] P. R. Panda, F. Catthor, N. D. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni,A.
Vandecappelle, P. G. Kjeldsberg, “Data and memory optimization techniques for
embedded systems,” ACM Transactions on Design Automation of Electronic Systems,
vol. 6, no. 2, pp. 149-206, April 2001.
[16] W Shiue, C. Chakrabarti, “Memory exploration for low power, embedded systems,”
DAC-36: ACM/IEEE Design Automation Conference, pp. 140-145, June 1999.
[17] L. Su, A. Despain, “Cache design trade-offs for power and performance optimization: A
case study,” ACM/IEEE International Symposium on Low Power Design, pp. 63-68,
April 1995.
[18] M. Kamble, K. Ghose, “Analytical energy dissipation models for low-power caches,”
ACM/IEEE International Symposium on Low Power Electronics and Design, pp. 143-
148, August 1997.
[19] U. Ko, P. Balsara, A. Nanda, “Energy optimization of multilevel cache architectures for
RISC and CISC processors,” IEEE Transactions on VLSI Systems, vol. 6, no. 2, pp. 299-
308, June 1998.
[20] R. Bahar, G. Albera, S. Manne, “Power and performance tradeoffs using various caching
strategies,” ACM/IEEE International Symposium on Low Lower Electronics and Design,
pp. 64-69, Aug. 1998.
[21] V. Zyuban, P. Kogge, “The energy complexity of register files,” ACM/IEEE
International Symposium on Low Power Electronics and Design, pp. 305-310, Aug.t
1998.
[22] S. Coumeri, D. Thomas, ”Memory modeling for system synthesis,” ACM/IEEE
International Symposium on Low Power Electronics and Design, pp. 179-184, Aug.
1998.
[23] T. Juan, T. Lang, J. Navarro, “Reducing TLB power requirements,” ACM/IEEE
International Symposium on Low Power Electronics and Design, pp. 196-201, August
1997.
[24] Farrahi, G. Tellez, M. Sarrafzadeh, “Memory segmentation to exploit sleep mode
operation,” ACM/IEEE Design Automation Conference, pp. 36-41, June 1995.
[25] Gonzàlez, C. Aliagas, M. Valero, “A Data-cache with multiple caching strategies tuned
to different types of locality,” ACM International Conference on Supercomputing, pp.
338--347, July 1995.
Summary 513
[64] M. Lipasti, C. Wilkerson, and J. Shen, “Value locality and load value prediction,”
ASPLOS, pp.138-147, 1996
[65] G. Lakshminarayana, A. Raghunathan, K. Khouri, K. Jha, and S. Dey, “Common-case
computation: a high-level technique for power and performance optimization,” Design
Automation Conference, pp.56-61, 1999
[66] K. Lepak and M. Lipasti, “On the value locality of store instructions,” ISCA, pp. 182-
191, 2000
[67] S.E. Richardson, “Caching function results: faster arithmetic by avoiding unnecessary
computation,” Tech. report, Sun Microsystems Laboratories, 1992
[68] E.Y.Chung, L. Benini and G. De Micheli,”automatic source code specialization for
energy reduction,” ISLPED, IEEE Symposium on Low Power Electronics and Design,
2000, pp. 80-83.
[69] J.Crenshaw math Toolkit for Real-Time Programming, CMP Books, kansas, 2000.
[70] Peymandoust, and G. De Micheli, “Complex library mapping for embedded
software using symbolic algebra,” DAC, Design Automation Conference, 2002.
[71] Aho, R. Sethi, J. Ullman, Compilers. Principles, Techniques and Tools. Addison-
Wesley, 1988.
[72] H. Mehta, R. Owens, M. Irwin, R. Chen, D. Ghosh, “Techniques for low energy
software,” International Symposium on Low Power Electronics and Design, pp. 72-75,
Aug l997.
[73] G. Goossens, P. Paulin, J. Van Praet, D. Lanneer, W.Guerts, A. Kifli and C.Liem,
“Embedded software in real-time signal processing systems: design technologies,”
Proceedings of the IEEE, vol. 85, no. 3, pp. 436--54, March 1997.
[74] V. Tiwari, S. Malik, A. Wolfe, “Power analysis of embedded software: a first step
towards software power minimization,” IEEE Transactions on VLSI Systems, vol. 2,
no.4, pp.437--445, Dec. 1994.
[75] M. Lorenz, R. Leupers, P. Marwedel, T. Drager, G. Fettweis, “Low-energy DPS code
generation using a genetic algorithm,” IEEE International Conference on Computer
Design, pp. 431-437, Sept 2001.
[76] V. Tiwari, S. Malik, A. Wolfe, M. Lee, “Instruction level power analysis and
optimization of software,” Journal of VLSI Signal Processing, vol. 13, no.1-2, pp.223--
233, 1996.
[77] Su, C. Tsui, A. Despain, “Saving power in the control path of embedded processors,”
IEEE Design and Test of Computers, vol. 11, no. 4, pp. 24-30, Winter 1994.
[78] G. De Micheli, Synthesis and Optimization of Digital Circuits, McGraw-Hill, 1994.
[79] M. Wolfe, High Performance Compilers for Parallel Computing, Addison-Wesley,
1996.
[80] M. Kandemir, M. Vijaykrishnan, M. Irwin, W. Ye, “Influence of compiler optimizations
on system power,” IEEE Transactions on VLSI Systems, vol. 9, no. 6, pp. 801-804, Dec.
2001.
[81] H. Kim, M. Irwin, N. Vijaykrishnan, M. Kandemir, “Effect of compiler optimizations on
memory energy,” IEEE Workshop on Signal Processing Systems, pp. 663-672, 2000.
[82] Y. Lu, L. Benini and G. De Micheli, “Requester-Aware Power Reduction,” ISSS,
International System Synthesis Symposium, 2000, pp. 18-23.
[83] D. Shin, J. Kim, “A profile-based energy-efficient intra-task voltage scheduling
algorithm for hard real-time applications,” IEEE International Symposium on Low-
Power Electronics and Design, pp. 271-274, Aug.2001.
516 Energy-Efficient System-Level Design
[84] S. Chirokoff and C. Consel, “Combining program and data specialization,” ACM
S1GPLAN Workshop on Partial Evaluation and Semantics-Based Program Manipulation
(PEPM '99), pp.45-59, San Antonio, Texas, USA, January 1999
[85] D. Ditzel, ”Transmeta's Crusoe: Cool chips for mobile computing,” Hot Chips
Symposium
[86] R. Ho, K. Mai, M. Horowitz, “The future of wires,” Proceedings of the IEEE, January
2001.
[87] K. Lahiri, A. Raghunathan, G. Lakshminarayana, S. Dey, “Communication architecture
tuners: a methodology for the design of high-performance communication architectures
for systems-on-chip,” IEEE/ACM Design Automation Conference, pp. 513--518, 2000.
[88] H. Mehta, R. M. Owens, M. J. Irwin, “Some issues in gray code addressing,” Great
Lakes Symposium on VLSI, pp. 178--180, March 1996.
[89] Redhat, Linux-ARM math Library Reference Manual
[90] T. Theis, “The future of Interconnection Technology,” IBM Journal of Research and
Development, vol. 44, No. 3, May 2000, pp. 379-390.
[91] Wolfe, “Issues for low-power CAD tools: a system-level design study,” Design
Automation for Embedded System, vol. 1, no. 4, pp. 315-332, 1996.
[92] International Technology Roadmap for Semiconductors htt:///public.itrs.net/
[93] Cygnus Solutions, eCOS reference Manual, 1999
[94] D. Bertsekas, R. Gallager, Data Networks. Prentice Hall, 1991.
[95] J. Montanaro et al, “A 160-MHz, 32-b, 0.5-W CMOS RISC microprocessor,” IEEE
Journal of Solid-State Circuits, vol. 31, no. 11, pp. 1703--1714, Nov. 1996.
Index
adaptive forward error correction, 335 beamforming, 362, 365
adaptive power-supply regulation, 202, behavior-level, 441, 442
215, 218, 228, 232, 237 bit-line capacitance, 63, 75, 77
ADC, 121, 125, 128, 133, 138, 145, 148 bit-width analysis, 181, 187, 193, 198
adjustable radio modulation, 335 body bias, 13, 24, 35, 44, 48, 95, 401,
ADPCM, 190, 192, 193, 194, 483 405, 406, 411
algorithm, 118 body effect, 38, 65, 402, 407
A/D conversion, 126 body factor, 130
beamforming, 362 Boltzmann distribution, 16, 26
block-formation, 289 bus, 33, 52, 76, 186, 238, 444, 461, 479,
data processing, 339 481, 493, 495, 502, 509, 514
dynamic programming, 508 characteristic distance, 361
FIR filtering, 344 charge pump circuits, 54, 56, 65, 67
greedy, 381 charge sharing, 110, 132
instruction selection, 508 chip multiprocessor, 480
leakage current minimization, 407 clock buffer, 151, 153, 157, 171, 174, 177
local routing, 260 clock data recovery, 201
network control, 494 clock gating, 110, 151, 155, 159, 167,
non-adaptive, 377 172, 373, 388, 392, 396, 433, 440,
power-reduction, 304 442, 481
routing, 494 clock network, 153, 156, 160, 164, 167,
scheduing, 393 171, 174, 177, 433
scheduling, 305 clock synthesis, 212
simple bit manipulation, 506 clock tree, 116, 151, 154, 160, 162, 173,
speech coding, 468 177, 389, 439
static, 337 clustering, 280, 294, 361, 371, 485, 501
Viterbi, 352 CMOS
application software, 476, 498, 503, 510 circuit, 10, 305, 345
architectural optimizations, 485 gate, 2, 202, 409
architecture NAND, 27, 28, 418, 423, 432, 475
agile, 460 NOR, 159, 475, 484
hardware, 181, 285, 301, 338, 343, scaling, 201
364, 453, 455 technology, 2, 10, 15
reconfigurable, 470 CMOS technology
software, 285, 377 projection, 13
bandwidth optimization, 490, 491 combinational logic, 163, 174
battery, 1, 3, 8, 31, 53, 94, 151, 273, 293, compiler, 181, 189, 285, 289, 291, 295,
298, 335, 368, 370, 386, 392, 410, 435, 508, 515
427, 442, 446, 474, 499 computation accuracy, 181, 187, 197
battery-operated, 297 conditional flip-flop, 91, 110
518
constraint, 43, 227, 272, 277, 314, 350, embedded DRAM, 51, 82, 91, 117, 481,
352, 363, 380, 393, 395, 409, 454 511
area, 201 embedded systems, 5, 184, 188, 196, 242,
average data rate, 313 274, 298, 302, 320, 332, 476, 488,
average wait time, 384 503, 505
energy, 332, 393 energy band, 23
latency, 364 energy dissipation, 269, 271, 277, 335,
memory size, 486 359, 361, 363, 365, 382, 384, 392, 426
performance, 197, 242, 262, 272, 313, energy estimation, 277, 285, 293
379, 385, 454 energy minimization, 186, 509
power, 3, 5, 21, 26, 31, 34, 43, 309 energy scalability, 335, 338, 343, 354,
quality of service, 197, 198, 337, 393 369
re-use of hardware and software energy-aware computing, 241, 274
components, 242 energy-aware packet scheduling, 306,
stability, 224, 226 310, 312
system cost, 482 energy-efficient, 201, 204, 210, 214, 222,
timing, 3, 114, 242, 243, 311, 428 228, 235, 238, 261, 303, 318, 330,
content-addressable memory, 259, 260 337, 346, 353, 358, 363, 473, 484,
control protocols, 491, 493 488, 494, 500, 510, 515
cross talk, 209, 211, 215, 218 energy-quality scalability, 335
data aggregation, 335, 337, 362, 363 ensemble of point systems, 346
data-link layer, 493 environmental data, 421
datapath width adjustment, 182, 190 FDMA, 355, 356, 359
D-cache, 260, 265, 266, 489 feedback, 35, 66, 143, 159, 174, 177, 221,
delay-locked loop, 201, 212, 237 230, 368, 393, 411, 444
derivative gate-level tools, 436 Fermi potential, 405
design finite state machine, 270
methodologies, 3, 278, 474 fixed time-out, 378
platform-based, 7, 243, 273, 451, 463 flash, 51, 53, 61, 67, 71, 86, 124, 127,
tools, 3, 414, 438 133, 146, 149, 339, 367
design automation, 413, 414, 421, 443 flash memory
discrete doping effect, 33, 45 cell operation, 55
DLL, 212, 223, 224, 227, 229, 232, 238 NAND, 53, 59, 67
double edge-triggered, 151, 154, 157 NOR, 53, 58, 66, 71
double-edge triggered flip-flops, 154, NOR flash memory, 66
164, 174 flat-band voltage, 131
DPM. See dynamic power management flip-flop
DRAM, 24, 33, 44, 51, 74, 75, 77, 82, 88, dynamic power, 174
89, 118, 480, 481 frame length adaptation, 316, 317
cell, 74 frequency scaling, 500
indirect band-to-band tunnelling, 24 gate insulator, 11, 15, 18, 21, 23, 30, 32,
low voltage operation, 83 39, 43, 45
retention time, 44 gate-level tools, 431, 438, 439
dynamic datapath-width adjustment, 181, generator matrix, 380
187 greedy, 378
dynamic power management, 299, 305, Hamming distance, 350, 509
310, 313, 332, 373, 377, 409, 410, 500 hardware, 281
electromigration, 15, 448 profiling, 269
hazard, 389
519