Low Power Vls I Design
Low Power Vls I Design
UNIT I
Low Power VLSI Design
By
Dr. R. Nakkeeran
Associate Professor
Low Power Design –
An Emerging Discipline
16
COMPONENTS PER INTEGRATED FUNCTION
15
14
13
LOG2 OF THE NUMBER OF
12
11
10
9
8
7
6
5
4
3
2
1
0
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
234M transistors in
Electronics, April 19, 1965 die size of 221 mm2
Evolution in Complexity
Transistor Counts
1 Billion
K Transistors
1,000,000
100,000
Pentium® III
10,000 Pentium® II
Pentium® Pro
1,000 Pentium®
i486
100 i386
80286
10 8086
Source: Intel
1
1975 1980 1985 1990 1995 2000 2005 2010
Projected
Courtesy, Intel
Moore’s law in Microprocessors
1000
10
P6
Pentium® proc
1 486
386
0.1 286
8085 8086
0.01 8080
8008
4004
0.001
1970 1980 1990 2000 2010
Year
Transistors on Lead Microprocessors double every 2 years
Courtesy, Intel
Die Size Growth
100
Die size (mm)
P6
486 Pentium ® proc
10 386
286
8080 8086
8085 ~7% growth per year
8008
4004 ~2X growth in 10 years
1
1970 1980 1990 2000 2010
Year
Courtesy, Intel
Frequency
10000
Doubles every
1000
2 years
Frequency (Mhz)
P6
100
Pentium ® proc
486
10 8085 386
8086 286
1 8080
8008
4004
0.1
1970 1980 1990 2000 2010
Year
Lead Microprocessors frequency doubles every 2 years
Courtesy, Intel
Power Dissipation
100
P6
Pentium ® proc
Power (Watts)
10
486
8086 286
386
8085
1 8080
8008
4004
0.1
1971 1974 1978 1985 1992 2000
Year
Courtesy, Intel
Power will be a major problem
100000
18KW
10000 5KW
1.5KW
500W
Power (Watts)
1000
Pentium® proc
100
286 486
10 8086 386
8085
8080
8008
1 4004
0.1
1971 1974 1978 1985 1992 2000 2004 2008
Year
Courtesy, Intel
Power density
Sun Surface
10000
Rocket
Nozzle
Power Density (W/cm2)
1000
Nuclear
Reactor
100
8086
10 4004 Hot Plate P6
8008 8085 386 Pentium® proc
286 486
8080
1
1970 1980 1990 2000 2010
Year
Courtesy, Intel
Not Only Microprocessors
Cell
Phone
Small Power
Signal RF RF
Digital Baseband
(DSP + MCU)
(data from Texas Instruments)
Important (Wireless)
Technology Trends
“Spectral Efficiency”:
More bits/m3
Rapidly increasing
transistor density
Rapidly declining
system cost
Important (Wireless)
Technology Trends
The battery technology alone will not solve the low power problem
Basics
• Power supply provides energy for charging and discharging wires and
transistor gates. The energy supplied is stored & then dissipated as heat.
P dw / dt Power: Rate of work being done w.r.t time
Rate of energy being used
P E t Unit: Watts = Joules/seconds
• If a differential amount of charge dq is given a differential increase in energy
dw, the potential of the charge is increased by:
I dq / dt
• By definition of current: V dw / dq
dw dq
dw / dt P V I A very practical
formulation!
dq dt
t
w Pdt
Total energy
Basics
• Warning! In everyday language, the term
“power” is used incorrectly in place of “energy”
• Power is not energy
• Power is not something you can run out of
• Power can not be lost or used up
• It is not a thing, it is merely a rate
• It can not be put into a battery any more than
velocity can be put in the gas tank of a car
This is how electric tea pots work ...
+
1V
-
1 Ohm
Resistor
20 W rating: Maximum power
the package is able to
transfer to the air. Exceed
rating and resistor burns.
Cooling an iPod nano ...
Like a resistor, iPod relies on passive
transfer of heat from case to the
air
If iPod nano used 5W all the time, its battery would last 15 minutes ...
Powering an iPod nano (2005 edition)
Battery has 1.2 W-hour rating: Can
supply
1.2 W of power for 1 hour
1.2 W / 5 W = 15 minutes
85 mW for music
GND
t1 t1 t1
Esw P(t )dt (Vdd v) i (t )dt (Vdd v) c (dv dt ) dt
t0 t0 t0
t1 t1
cVdd dv c v dv cVdd 1 2cVdd 1 2 cVdd
2 2 2
t0 t0
Vout
I
I
Vin
Diode
10-20% of total chip power Characteristic
~1nWatt/gate
few mWatts/chip
Other Sources of Energy Consumption
• Consumption caused by “DC leakage current” (Ids leakage):
Ids
Vin=0 Vout=Vdd
Ioff
Vgs
Transistor s/d conductance Vth
never turns off all the way
Low voltage processes much worse
• This source of power consumption is becoming increasing
significant as process technology scales down
• For 90nm chips around 10-20% of total power consumption
• Estimates put it at up to 50% for 65nm
Controlling Energy Consumption: What
Control Do You Have as a Designer?
• Largest contributing component to CMOS power consumption is
switching power:
Due to
reduced V and
C (length and
width of Cs
decrease, but
plate distance
gets smaller)
Recent slope
reduced
because V is
scaled less
aggressively
Device Engineers Trade Speed and Power
We can reduce CV2 (Pactive)
by lowering Vdd
MIS structure
Physics of Power Dissipation in CMOS
FET Devices
• For an ideal MIS diode, the energy difference ψms between
the metal work function ψm and the semiconductor work
function ψs is zero
45
P-type
46
CMOS Gate Power equations
• P = CLVDD2f 01 + tsc VDD Ipeak f 0 1 + VDD Ileakage
48
P Substrate
49
• If the applied voltage is increased sufficiently, the
bands bend far enough that level Ei at the surface
crosses over to the other side of level EF
• This is brought about by the tendency of carriers to
occupy states with the lowest total energy
• In the present condition of inversion the level Ei
bends to be closer to level Ec and electrons
outnumber holes at the surface
50
• Ei at the surface now is below EF by an amount of energy
equal to 2 ΨB , where ΨB is the potential difference between
the Fermi level EF and the intrinsic Fermi level Ei in the
bulk.
53
Threshold voltage
• VT = (2d/εi ) * ( q εs NA ψB (1 – e-2βψB) )0.5 + 2ψB
54
Variation of minority concentration in the channel of a MOSFET biased in
weak inversion
55
Threshold voltage
• Drain-induced barrier lowering (DIBL) is the basis for
a number of more complex models of the threshold
voltage shift
• It refers to the decrease in threshold voltage due to the
depletion region charges in the potential barrier between
the source and the channel at the semiconductor surface
• A recent model adopt a quasi two-dimensional approach
to solving the two-dimensional Poisson equation
• dEx/dx at each point (x, y) can be replaced with the
average of its value at (0, y) and at (W, y)
Short channel effect
• The minimum value of the surface potential increases
with decreasing channel length and increasing VDS
56
Subsurface Drain-Induced Barrier
Lowering (Punchthrough)
• The punchthrough voltage VPT defined as the value of
VDS at which I D, st reaches some specific magnitude
with VGS = 0
• The parameter VPT can be roughly approximated as
the value of VDS for which the sum of the widths of
the source and the drain depletion regions becomes
equal to L
57
Power Dissipation in CMOS
• The first ICs ever fabricated used a PMOS process. This
is due to the simplicity of fabrication of a p-channel
enhancement mode MOS field-effect transistor (PMOST)
with threshold voltage VTp < 0
• The charge mobility factor caused the move to the NMOS
process
• Then change to CMOS because of the power dissipation
problem
• This advantage of CMOS over NMOS has proven to be
important enough that the shortcomings of CMOS are
overlooked
• The CMOS process is more complex than the NMOS, the
CMOS requires use of guard-rings to get around the
latch-up problem, and CMOS circuits require more
transistors than the equivalent NMOS circuits 58
CMOS inverter
59
Short-Circuit Dissipation
• The short-circuit dissipation of the gate varies
with the output load and the input signal slope
• The short-circuit dissipation decreases linearly
(roughly) in both absolute terms and a fraction
of the total dissipation as the output load is
increased to a critical value and then it will
increase again rapidly
60
Short Circuit Current
61
• For simplicity a symmetrical inverter (i.e., βN = βp
and VTn = -Vtp;) and a symmetrical input signal
(rise time = fall time) are considered.
• I = β/2(Vin – V T)2 for 0≦ I≦ Imax
• Imean = 1/T ∫0T I(t) dt
= 2* 2/T ∫t1t2 β/2 (Vin (t) – VT)2 dt
• Assuming the rising and falling portions of the
input voltage waveform to be linear ramps,
• Vin(t) = t* VDD/τ
• Imean = 2*2/T∫(Vt/Vdd) ττ/2 β/2(t*VT/τ – VT)2 dt
• Let θ= (VT/τ)t - VT
62
• Imean = - 2β/T∫(Vt/Vdd) ττ/2 θ dθ
• Imean = 1/12*β/VDD(VDD – VT)3 τ/T
• The short-circuit power dissipation of an
unloaded inverter is
• PSC = β/12(VDD – VT)3 τ/T
• If the inverter is lightly loaded, causing output rise and fall
times that are relatively shorter than the input rise and fall
times, the short-circuit dissipation increases to become
comparable to dynamic dissipation
• To minimize dissipation, an inverter should be designed in
such a way so that the input rise and fall times are about equal
to the output rise and fall times
63
Dynamic Dissipation
• Assuming that the input Vin is a square wave
having a period T and that the rise and fall
times of the input are much less than the
repetition period, the dynamic dissipation is
given by
• PD = CL VDD2/T
64
Energy per transition
• When V = VDD, E 0->1 = CLVDD2.
• When energy stored in a capacitor with
capacitance CL and voltage VDD across its
plates is CL VDD2/2, the rest of the energy,
another CL VDD2/2, is converted into heat 65
Principles of Low-Power Design
• Using the lowest possible supply voltage
• Using the smallest geometry, highest frequency
devices but operating them at the lowest possible
frequency
• Using parallelism and pipelining to lower required
frequency of operation
• Power management by disconnecting the power
source when the system is idle
• Designing systems to have lowest requirements on
subsystem performance for the given user level
functionality
66
Fundamental Limits
• The limit from thermodynamic principles results
from the need to have, at any node with an equivalent
resistor R to the ground, the signal power Ps exceed
the available noise power Pavail
• The quantum theoretic limit on low power comes
from the Heisenberg uncertainty principle. In order to
be able to measure the effect of a switching transition
of duration Δt, it must involve an energy greater than
h/ Δt:
• P ≧ h/ (Δt)2 where h is the Planck’s constant
67
• Finally the fundamental limit based on
electromagnetic theory results in the velocity
of propagation of a high-speed pulse on an
interconnect to be always less than the speed
of light in free space, c0:
• L/τ≦ c0 where L is the length of the
interconnect and τ is the interconnect transit
time
68
Material Limits
• The attributes of a semiconductor material that
determine the properties of a device built with
the material are
• Carrier mobility μ
• Carrier saturation velocity σs
• Self-ionizing electric field strength Ec
• Thermal conductivity K
69
• Consider an SOI structure by surrounding the
above generic device in a hemispherical shell
of SiO2 of radius ri, indicating a two-order-of-
magnitude reduction in thermal conductivity
• The response time of the global interconnect
circuit is
τ= (2.3 Rtr + Rint) Cint where Rtr is the
output resistance of the driving transistor and
Rint and Cint are the total resistance and
capacitance, respectively, of the global
interconnect.
70
System Limits
• The architecture of the chip
• The power-delay product of the CMOS
technology used to implement the chip
• The heat removal capacity of the chip package
• The clock frequency
• Its physical size
71
Energy characterization
• Transition-sensitive energy models
– Single energy tables
• Bit independent modules e.g., flipflops
– Multiple energy tables
• Large bit dependent modules e.g., 32-b adders
• Large multi-element modules e.g., register files
– Transition sensitive energy equations
– System level interconnect capacitance values
• Analytical energy modes
– Cache and main memory
72
Transition-sensitive energy model
• Must first design and layout a functional unit and
then simulate it to capture switch capacitances
– Bit independent – bus lines, pipeline registers
• One bit switching does not affect other bit slices’ operations
• Bit dependent – ALU, decoders
• Once constructed, the models can be reused in
simulations of other architectures built with the same
technology
73
Architectural Level Analysis
Considerations
• Very computationally efficient
– Requires predefined analytical and transition-
sensitive energy characterization models
– Requires design only to RTL (with some idea as to
the kind of functional units planned)
– Coarse grain – use of gated clocks implicit
• Reasonably accurate (within 5% - 15% of
SPICE)
74
Dynamic Power Consumption
Energy/transition = CL * Vdd2
Power = Energy/transition * f = CL * Vdd2 * f
pp. 75
Dynamic Power Consumption - Extended
pp. 76
Ultra Low Power System Design
• Power minimization approaches:
• Run at minimum allowable voltage
• Minimize effective switching capacitance
pp. 77
Choice of Logic Style
pp. 78
Choice of Logic Style
CEFF = 3/16 * CL
pp. 80
Operating at the
Lowest Possible Voltage!
pp. 81
Operating at the
Lowest Possible Voltage!
• Desire to operate at lowest possible speeds (using low supply
voltages)
• Use Architecture optimization to compensate for slower
operation
pp. 82
Reducing Vdd
pp. 83
Lowering Vdd Increases Delay
pp. 85
Parallel Data Path
pp. 86
Pipelined Data Path
pp. 87
A Simple Data Path : Summary
pp. 88
Computational Complexity of DCT Algorithms
pp. 89
Power Down Techniques
• Conceptof Dynamic
Frequency Scaling (DFS)
pp. 90
Energy-efficient Software Coding
• Potential for power reduction via software modification is
relatively unexploited.
• Code size and algorithmic efficiency can significantly affect
energy dissipation
• Pipelining at software level- VLIW coding style
• Examples -
pp. 91
Power Hunger – Clock Network (Always
Ticking)
• H-Tree – design deficiencies based on Elmore delay model
• PLL – every designer (digital or analog) should have the
knowledge of PLL
• Multiple frequencies in chips/systems – by PLL
• Low main frequency, But
• Jitter and Noise, Gain and Bandwidth, Pull-in and
Lock Time, Stability …
• Local time zone
• Self-Timed
• Asynchronous => Use Gated Clocks, Sleep Mode
pp. 92
Power Analysis in the Design Flow
pp. 93
Scaling of MOS transistors
1.2nm
scale
doping increased by a
factor of S
Standby power
qV
1 ( mkTt ) bottleneck
Poff e
tox
increasing
higher performance
leakage
l
w: width of interconnect (layer dependant)
s: spacing between interconnects with same layer
h: dielectric thickness (spacing between interconnects in two vertically
adjacent layers)
l: length of interconnect
t: thickness of interconnect
S. Reda EN160 SP’07
Constant thickness scaling versus reduced thickness
scaling
l
S l
S
bottleneck
bottleneck
repeaters required to
buffer Itanium global
interconnects
pp. 107
Unit II
Power estimation
Simulation Power analysis
• Computer simulation has been applied to VLSI design
for several decades
• Simulation programs operate on mathematical models
which mimic the physical laws and properties of the
object under simulation
• Simulation is used for
– Functional verification
– Performance
– Cost
– Reliability
– Power analysis
• Digital logic simulation examples
– VHDL (very high Speed IC Hardware Description
Language)
– Verilog
Computing resources and analysis accuracy
at various abstraction levels
T g, s
Pstat P g, s
gate g state s T
P K1 fin K 2 f out
Power dissipation based on Component
operations
• In this model the power dissipation is expressed in terms
of the frequency of some primitive operations of an
architecture component
• The power dissipation is given by
P K1 f read K2 f write
P
p0 p1 .............. pN
N
Where p refers to the power samples
138
Basic Idea
• View signals as a random processes
Prob{s(t) = 1} = p1
p0 = 1 – p1
C
139
Source of Inaccuracy
p1 = 0.5 P = 0.5CV2fck
1/fck
p1 = 0.5 P = 0.33CV2fck
p1 = 0.5 P = 0.167CV2fck
140
Switching Frequency
Number of transitions per unit time:
N(t)
T = ───
t
N(t)
T = lim ───
t→∞ t
141
Static Signal Probabilities
• Observe signal for interval t0 + t1
– Signal is 1 for duration t1
– Signal is 0 for duration t0
– Signal probabilities:
• p1 = t1/(t0 + t1)
• p0 = t0/(t0 + t1) = 1 – p1
142
Static Transition Probabilities
• Transition probabilities:
• T01 = p0 Prob{signal is 1 | signal was 0} = p0 p1
• T10 = p1 Prob{signal is 0 | signal was 1} = p1 p0
• T = T01 + T10 = 2 p0 p1 = 2 p1 (1 – p1)
• Transition density: T = 2 p1 (1 – p1)
• Transition frequency: f = T/ 2
• Power = CV2T/ 2 (correct formula)
143
Static Transition Frequency
0.25
0.2
f = p1(1 – p1)
0.1
0.0
144
Inaccuracy in Transition Density
p1 = 0.5 T = 1.0
1/fck
p1 = 0.5 T = 4/6
p1 = 0.5 T = 1/6
145
Cause for Error and Correction
• Probability of transition is not independent of
the present state of the signal
• Consider probability p01of a 0→1 transition,
• Then p01 ≠ p0 p1
• We can write p1 = (1 – p1)p01 + p1 p11
p01
p1 = ─────────
1 – p11 + p01
146
Correction (Cont.)
• Since p11 + p10 = 1, i.e., given that the signal
was previously 1, its present value can be
either 1 or 0
• Therefore,
p01
p1 = ──────
p10 + p01
This uniquely gives signal probability as a
function of transition probabilities
147
Transition and Signal Probabilities
1/fck
148
Probabilities: p0, p1, p00, p01, p10, p11
• p01 + p00 =1
• p11 + p10 = 1
• p0 = 1 – p1
• p01
p1 = ──────
p10 + p01
149
Transition Density
• T = 2 p1(1 – p1) = p0 p01 + p1 p10
= 2 p1 p10 = 2 p0 p01
150
Power Calculation
• Power can be estimated if transition density is
known for all signals.
• Calculation of transition density requires
– Signal probabilities
– Transition densities for primary inputs; computed
from vector statistics
151
Signal Probabilities
x1
x1 x2
x2
x1
x1 + x2 – x1x2
x2
x1 1 - x1
152
Signal Probabilities
0.5
x1
x1 x2
X1 X2 X3 Y y = 1 - (1 - x1x2) x3
0 0 0 1 = 1 - x3 + x1x2x3
0 0 1 0 = 0.625
0 1 0 1
0 1 1 0
1 0 0 1
1 0 1 0
1 1 0 1 Ref: K. P. Parker and E. J. McCluskey,
1 1 1 1 “Probabilistic Treatment of General
Combinational Networks,” IEEE Trans. on
Computers, vol. C-24, no. 6, pp. 668-670, June 153
1975.
Correlated Signal Probabilities
0.5
x1
x1 x2
y = 1 - (1 - x1x2) x2
= 1 – x2 + x1x2x2
X1 X2 Y = 1 – x2 + x1x2
0 0 1 = 0.75
0 1 0
1 0 1
1 1 1
154
Correlated Signal Probabilities
x1 0.5 x1 + x2 – x1x2
y = (x1 + x2 – x1x2) x2
X1 X2 Y = x1x2 + x2x2 – x1x2x2
0 0 0 = x1x2 + x2 – x1x2
0 1 1 = x2
1 0 0 = 0.5
1 1 1
155
Observation
• Numerical computation of signal probabilities
is accurate for fanout-free circuits.
156
Remedies
• Use Shannon’s expansion theorem to compute
signal probabilities.
• Use Boolean difference formula to compute
transition densities.
157
Shannon’s Expansion Theorem
• C. E. Shannon, “A Symbolic Analysis of Relay and
Switching Circuits,” Trans. AIEE, vol. 57, pp. 713-
723, 1938.
• Consider:
• Boolean variables, X1, X2, . . . , Xn
• Boolean function, F(X1, X2, . . . , Xn)
• Then F = Xi F(Xi=1) + Xi’ F(Xi=0)
• Where
• Xi’ is complement of X1
• Cofactors, F(Xi=j) = F(X1, X2, . . , Xi=j, . . , Xn), j = 0 or 1
158
Expansion About Two Inputs
• F = XiXj F(Xi=1, Xj=1) + XiXj’ F(Xi=1, Xj=0)
+ Xi’Xj F(Xi=0, Xj=1) + Xi’Xj’ F(Xi=0,
Xj=0)
• In general, a Boolean function can be expanded
about any number of input variables.
• Expansion about k variables will have 2k terms.
159
Correlated Signal Probabilities
X1
X1 X2
Y = X1 X2 + X2’
X2
X1 X2 Y
Shannon expansion about the
0 0 1
reconverging input:
0 1 0
1 0 1
Y = X2 Y(X2=1) + X2’ Y(X2=0)
1 1 1
= X2 (X1) + X2’ (1)
160
Correlated Signals
• When the output function is expanded about all
reconverging input variables,
• All cofactors correspond to fanout-free circuits.
• Signal probabilities for cofactor outputs can be calculated without
error.
• A weighted sum of cofactor probabilities gives the correct
probability of the output.
• For two reconverging inputs:
f = xixj f(Xi=1, Xj=1) + xi(1-xj) f(Xi=1, Xj=0)
+ (1-xi)xj f(Xi=0, Xj=1) + (1-xi)(1-xj) f(Xi=0, Xj=0)
161
Correlated Signal Probabilities
X1
X1 X2
Y = X1 X2 + X2’
X2
X1 X2 Y
Shannon expansion about the
0 0 1
reconverging input:
0 1 0
1 0 1
Y = X2 Y(X2=1) + X2’ Y(X2=0)
1 1 1
= X2 (X1) + X2’ (1)
162
Example
0.5
Supergate
0.25
0.5 0.5 0.5
Point of
0.0 1.0
1 reconv.
0
0.0 0.5 0.375
1.0
0.5
163
Signal entropy
• The entropy of a set of logic signals is a measure of
its randomness
• Entropy correlates to the average switching frequency
of the signals
• Skewed occurrence probability gives a low
probability measure
• If signal switching is active, it maximizes the entropy
of the signals
• These observations prompts the idea of using signal
entropy for power estimation
164
Power estimation of combinational
logic using entropy analysis
165
Entropy
Entropy based approach
– Entropy: Measure of uncertainty in a random variable
– Entropy H of a random variable x is given by
1 1
H ( x) p log (1 p) log
p 1 p
• p: probability of x being 1
Recall that
Pavg Davg .GE.Cavg
Davg: Average node switching activity
GE: Gate equivalents
Cavg: Average gate capacitance
166
Entropy
Hypothesis
Can Davg be estimated only from knowledge of inputs and
output behavior?
• Answer: Yes!
Pavg H .GE.Cavg
Entropy H is given by
2/3
H ( H ( X ) H (Y ))
nm
168
Entropy
• Entropy Based Power Estimation Methodology:
– Run a structural RTL simulation to measure
input/output entropies
– Using input/output entropies, estimate Pavg for the
combinational block
– Use other techniques to estimate latch and clock
power
169
UNIT-III
Low Power Design Circuit level
Power consumption in circuits
• The power reduction techniques at the circuit level are quite limited if
compared with the other techniques at higher abstraction levels
• At the circuit level , percentage power reduction in the teens is
considered good
• However, circuit techniques can have major impact because some
circuits, especially in cell- based design, are repeated thousands of times
on a chip
• Therefore, circuit techniques with a small percentage improvement
should not be overlooked
• Circuits designed manually can often be analyzed in great details
• This allows us to optimize the circuit speed, power and area to suit our
specification
• One important circuit technique is the reduction of operating voltage
– The general rule is to select the lowest voltage that is acceptable
Transistor and Gate Sizing
• At the circuit level, transistors are the basic building blocks
and a digital circuit can be viewed as a network of transistors
with some additional parasitic elements such as capacitors
and resistors
• Transistor sizes are the most important factor affecting the
quality ie area, and power dissipation of a circuit
• Some studies assume that the sizing problem is a convex
function and linear programming can be used to solve the
sizing problem optimally
• Another problem encountered in cell based design is gate
sizing
• The goal is to choose a set of gate sizes that best fits the
design constraints
Sizing an Inverter Chain
• The simplest transistor sizing problem is that of an inverter
chain
• The general design problem is to drive a large capacitive load
without excessive delay, area and power requirements
• A large inverter is required to drive the large capacitive load
at the final stage
Flip flop with self clock gating Power dissipation of self gating flip flop
and regular flip flop
Combinational flip flop
• One way to reduce circuit size is to associate logic gates with
a flip flop to produce a combinational flip flop
• Combinational flip flops are efficient because they are able to
eliminate or share redundant gates
• In terms of area, power and delay, combinational flip flops are
desirable but they increase the design complexity
Double Edge Triggered flip flop
• Most flip flops in use today are called single edge triggered
flip flop (SETFF) because the data are loaded at only one
clock edge, either rising or falling
• A flip flop can also be designed so that it loads data at both
rising and falling clock edges
Double Edge Triggered flip flop
• Compared to SETFF, the double edge triggered flip flop
(DETFF) requires slightly more transistors to implement
• The flip flop retains its data when the clock signal is not
toggling
Low Power Digital Cell Library
• Over the years, the major VLSI design focus has shifted from
masks, to transistors, to gates and to register transfer level
• Undoubtedly, the quality of gate level circuit synthesized
depends on the quality of the cell library
• Cell Sizes and Spacing
– In the top-down cell based design methodology, the tradeoff among
power, area and delay is performed by selecting the appropriate sizes
of the cells
– Therefore, the important attribute that constitute a good low power
cell library is the availability of wide ranges of cell sizes for
commonly used gates
– Further, the library cell count can be reduced without too much
compromise in quality is to have more size selections for gates that
are commonly used than those are less likely used
Low Power Digital Cell Library
• Varieties of Boolean Functions
– The lack of varieties of Boolean functions in a cell library can
result in inferior circuits to be generated
– For example if the Boolean function Y A B were to be
implemented and the inverted input cells are not available, the
logic synthesis system has to use an INVERTER and an AND
gate to implement the function
• The binary codes in the state bubbles indicate the state encoding
• The labels at the state transition edges represent the probabilities
that transition will occur at any given clock cycle
• The sum of all edge probabilities equals to unity
• The expected number of state bit transitions E[M] is given by the
sum of products of edge probabilities and their associated number
of bit flips as dictated by the encoding
State Machine Encoding
• However, a state encoding with the lowest E[M] may not be the
one that results in the lowest overall power dissipation
• The reason is that the particular encoding may require more gates
in the combinational logic, resulting in more signal transitions and
power
• The synthesized area and power dissipation of some randomly
encoded state machines is shown below
Precomputation Logic
• Precomputation logic optimization is a method to trade area for
power in a synchronous digital circuit
• The principle of precomputation logic is to identify logical
conditions at some inputs to a combinational logic that is invariant
to the output
• Since those input values do not affect the output, the input
transitions can be disabled to reduce switching activities
• One variant of precomputation logic is shown below
Precomputation Logic
• Let R1 and R2 are registers with a common clock feeding a
combinational logic circuit with a known Boolean function f(x)
• Due to the nature of the function f(x), there may be some
conditions under which the output of f(x) is independent of the
logic value of R2
• Under such conditions, we can disable the register loading of R2 to
avoid causing unnecessary switching activities, thus conserving
power
• The Boolean function f(x) is correctly computed because it
receives all required values from R1
• To generate the load disable signal to R2, a precomputation
Boolean function g(x) is required to detect the condition at which
f(x) is independent of R2
• g(x) depends on the input signals of R1 only because the load
disable condition is independent t of R2, otherwise f(x) will depend
on the inputs of R2 when the load disable signal is active
Binary comparator function using Precomputation Logic
• Assuming uncorrelated input bits with uniform random
probabilities where every bit has an equal probability of 0 or 1
• There is 50% probability that An Bn 1 and the register R2 is
disabled in 50% of the clock cycles
• Therefore, with only one additional 2 input XOR gate, we have
reduced the signal switching activities of the 2n-2 least significant
bits at R2 to half of its original expected switching frequency
Binary comparator function using Precomputation Logic
• Also, when the load disable signal is asserted, the combinational
logic of the comparator has fewer switching activities because the
outputs of R2 are not switched
• The extra power required to compute An Bn is negligible compared
to the power saving even for moderate size of n
Alternate Precomputation Architectures
• The precomputation scheme based on Shannon's decomposition is
states that a Boolean function f(x1,…,xn) can be decomposed with
respect to the variable xi as follows
f x1 ,......, xn xi f xi xi f x
i
A precomputation
architecture based on
Shannon’s decomposition
Latch based Precomputation Architecture
• The latch based precomputation architecture is shown below
• This architecture is also called guarded evaluation because some
inputs to the logic block C2 are isolated when the signals are not
required, to avoid unnecessary transition
• Transmission gates may be used in place of the latches if the
charge storage and noise immunity conditions permit
– Order(n) is the filter order, ESB is the stopband energy of the filter and alpha is the const
– Variable N is a const in which the filter order is adjusted at each sample
– Q(n) represents the noise energy measured at a particular output sample n
Switching Activity Reduction
• Switching activities are the biggest cause of power dissipation in
most CMOS digital systems
• In order to do computation, switching activities cannot be avoided
• However, some switching activities do not contribute to the actual
contribution and should be eliminated
• The suppression of switching activities always involves some
tradeoff decisions
• In general hardware logic is required to suppress unwanted
switching activities and the additional logic itself consumes power
• Guarded Evaluation
– It is a technique to reduce switching activities by adding latches
or blocking gates at the inputs of a combinational module if the
inputs are not used
Guarded Evaluation
• As shown in figure below the result of multiplication may or may
not be used depending on the condition selection of the multiplexer
Associative Transformation
Operator Reduction
• Also the direct and transformed flow graphs of coordinate rotation
operation is shown below
Loop Unrolling
• The control data flow graphs of DSP systems often contain loops,
as a result of recursive computations
• An important technique for flow graph transformation is to unroll
the loop
• Loop unrolling is a method to apply parallelism to the computation
• Consider a simple recursive computation of an IIR filter
• After loop unrolling, two output values are produced using two
input values in a single computation cycle
• The unrolled computation structure increases the computation by
more than twice
Loop Unrolling
• The data flow graphs of the original and unrolled computation are
illustrated below
Loop Unrolling
• If the unrolled computation structure is implemented directly, the
power efficiency should be worse than the original implementation
because of the increased computation
• However, the unrolled structure allows us to apply pipelining by
adding pipeline registers on the graph edges crossing the vertical
dashed line
• With pipelining , the longest delay path of the unrolled graph is
identical to the original path (i.e. a multiplication followed by a
addition)
• Since the unrolled graph produces two outputs simultaneously, it
can be implemented with half the operating frequency of the
original path
• Thus, its critical delay is identical but the operating frequency is
halved
• This allows us to lower the operating voltage to improve the
overall system power efficiency
UNIT-III
Power Reduction in Clock Networks
Introduction
• In a synchronous digital chip, the clock signal is
generally one with the highest frequency
• The clock signal typically drives a large load because it
has to reach many sequential elements distributed
throughout the chip
• Therefore, clock signals have been a notorious source of
power dissipation because of high frequency and load
• It has been observed that clock distribution can take up to
40% of the total power dissipation of a high performance
microprocessor
• Many techniques have been devoted to the power
efficiency of clock generation and distribution
Clock Gating
• Clock gating is the most popular method for power reduction of
clock signals
• When the clock signal of a fundamental module (ALUs, memories
etc) is not required for some extended period, we use a gating
function (NAND or NOR gate) to turn off the clock feeding the
module
• The gating signal should be enabled or disabled at a much slower
rate compared to the clock frequency
• Clock gating saves power by reducing unnecessary clock activities
inside the gated module
Reduced Swing Clock
• Referring P = C V2 f equation, the most attractive parameter for
power reduction is the voltage swing V due to the quadratic effect
• Also, it is difficult to reduce the load capacitance or frequency of
clock signals due to performance reasons
• Consider a 5 V digital CMOS chip with a N – transistor threshold
voltage of 0.8 V
• N – transistor will turn on if the clock signal is above 0.8 V
• Hence, if we limit the swing of the N – transistor clock signal
from 0 to 2.5 V (half swing)
• Further, the on – off characteristics of all N – transistor remains
digitally identical
• Similar observation can be made for the clock signal feeding a P –
transistor
Reduced Swing Clock
• The figure below illustrates the clock waveform of the half swing
clocking scheme
Reduced Swing Clock
• Generating the half swing clock signal is also relatively simple
using the charge sharing principle with stacked inverters
• The circuit to generate the clock waveform is shown below
Reduced Swing Clock
• The capacitance C1, C2, C3 and C4 are the parasitic capacitances of
the circuit
• CA and CB are add on capacitance much larger than the parasitic
capacitance
• From the principle of charge sharing, when CLK is low the voltage
at VH is given by
• The circuit relies on the parasitic loading of the clock lines CP and
CN to achieve the charge sharing effect
• The capacitance CP and CN should be equalized to obtain the
proper half swing waveform
Oscillator Circuit for Clock Generation
• The simplest oscillator circuit is shown below
• This is common for off chip clock signals because they drive very
large capacitance
Frequency Division and Multiplication
• Off chip clock signal runs at a slower speed and on chip
PLL circuit is used to multiply the frequency to the
desired rate
• The slower signal also eases the off chip signal
distribution in terms of electromagnetic interference and
reliability
• The frequency multiplier N is a trade off between power
dissipation and the PLL circuit complexity
• Larger values of N lead to better power dissipation but
increases the design complexity and performance of the
PLL circuit
Algorithm & Architectural Level
Methodologies
Agenda
• Recap
• Power reduction on
– Gate level
– Architecture level
– Algorithm level
– System level
265
Recap: Problems of Power Dissipation
Continuously increasing performance
demands
Increasing power dissipation of
technical devices
Today: power dissipation is a main
problem
266
Recap: Consumption in CMOS
Voltage (Volt, V) Water pressure (bar)
Current (Ampere, A) Water quantity per second (liter/s)
Energy Amount of Water
1
CL
0
Approach 1
Approach 2
time
Energy is area under curve
Watts
Approach 1
Approach 2
time
269
Recap: Levels of Optimization
Savings Speed Error
ALU
MEM
MEM
System MP3
> 70 % Seconds > 50 %
T +
Architecture 25-40 % Minutes 15-30 %
T
270
nach Massoud Pedram
Recap: Logic Restructuring
Logic restructuring: changing the topology of a logic
network to reduce transitions
271
Recap: Input Ordering
(1-0.5x0.2)*(0.5x0.2)=0.09 (1-0.2x0.1)*(0.2x0.1)=0.0196
0.5 0.2
A B X
X
B C
F 0.1 A F
0.2 C
0.1 0.5
AND: P01 = (1 - PAPB) * PAPB
272
Recap: Glitching
A X
B
C Z
Unit Delay
273
Design Layer: Gate Level
• Basic elements:
– Logic gates
– Sequential elements (flipflops, latches)
• Behavior of elements is described in libraries
274
Dynamic Power and Device Size
• Device Sizing (= changing gate width)
Affects input capacitance Cin
1.5
Affects load capacitance Cload
fcircuit=1
Affects dynamic power consumption Pdyn
normalized energy
• Optimal fanout factor f for Pdyn is smaller than fcircuit=2
1
for performance (especially for large loads)
fcircuit=5
– e.g., for Cload=20, Cin=1
fcircuit = 20 0.5 fcircuit=10
fopt_energy = 3.53
fcircuit=20
fopt_performance = 4.47
• For Low Power: avoid oversizing (f too big) 0
1 2 3 4 5 6 7
275
VDD versus Delay and Power
6 10
5
Relative Delay td
8
Relative Pdyn
4
td Pdyn 6
3
4
2
1 2
0 0
0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4
Supply voltage (VDD)
276
Multiple VDD
• Main ideas:
– Use of different supply voltages within the same design
– High VDD for critical parts (high performance needed)
– Low VDD for non-critical parts (only low performance demands)
• At design phase:
– Determine critical path(s) (see upper next slide)
– High VDD for gates on those paths
– Lower VDD on the other gates (in non-critical paths)
– For low VDD: prefer gates that drive large capacitances (yields the largest
energy benefits)
• Usually two different VDD (but more are possible)
277
Multiple VDD cont’d
• Level converters:
– Necessary, when module at lower supply drives gate at higher supply
(step-up)
– If gate supplied with VDDL drives a gate supplied with VDDH
then PMOS never turns off
– Possible implementation: VDDH
• Cross-coupled PMOS transistors
• NMOS transistor operate on VDDL reducedV
out
supply Vin
– No need of level converters for
step-down change in voltage
– Reducing of overhead:
• Conversions at register boundaries
• Embedding of inside flipflop
278
Data Paths
• Data propagate through different data paths between registers (flipflops -
FF)
• Paths mostly differ in propagation delay times
• Frequency of clock signal (CLK) depends on path with longest delay
critical path
FF FF FF
FF FF FF
Paths
Path
FF FF FF
C
A Y G2
G1
B
A
G1 ready with
B evaluation
Y all inputs of G2
all Inputs of G1 arrived
arrived
C
280
Multiple VDD in Data Paths
• Minimum energy consumption when all logic paths are critical (same
delay)
• Possible Algorithm: clustered voltage-scaling
– Each path starts with VDDH and switches to VDDL (blue gates) when slack
is available
– Level conversion in flipflops at end of paths
281
Design Layer: Architecture Level
282
Clock Gating
• Most popular method for power reduction of clock signals and functional
units
• Gate off clock to idle functional units
• Logic for generation of disable signal necessary R
Functional
Higher complexity of control logic e
unit
g
Higher power consumption
Critical timing critical for avoiding of
clock glitches at OR gate output
Additional gate delay on clock signal
clock
disable
283
Clock Gating cont’d
D D Q
CLK
284
Clock Gating cont’d
PI
Combinational
logic
Flip-flops
PO
Clock
activation Latch
logic
CLK
285
Clock Gating: Example
Without clock gating
30.6mW
8.5mW DEU
VDE
MIF
0 5 10 15 20 25 DSP/
Power [mW]
HIF
896Kb SRAM
90% of FlipFlops clock-gated
286
Recap: VDD versus Delay and Power
6 10
5
Relative Delay td
8
Relative Pdyn
4
td Pdyn 6
3
4
2
1 2
0 0
0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4
Supply voltage (VDD)
287
A Reference Datapath
Register
Register
Combinational
Input Output
logic
Cref
CLK
Supply voltage = Vref
Total capacitance switched per cycle = Cref
Clock frequency = fClk
Power consumption: Pref = CrefVref2fclk
288
Parallel Architecture
Each copy processes Supply voltage:
Comb.
Register
every Nth input, VN ≤ Vref
Logic
operates at fclk/N Copy 1 N = Deg. of
reduced voltage parallelism
N to 1 multiplexer
Comb.
Register
Register
Logic Output
Input Copy 2
fclk/N
fclk
Multiphase Comb.
Register
CLK CLK
• Functionality:
Data
CLK
290
Parallel Architecture: Example
• Reference Data path (for example)
291
Parallel Architecture: Example cont’d
• The clock rate can be reduced by half with the same throughput fpar = fref
/2
• Vpar = Vref / 1.7, Cpar = 2.15 Cref
• Ppar = (2.15 Cref) (Vref / 1.7)2 (fref / 2) = 0.36 Pref
292
Pipelined Architecture: Example
293
Approximate Trend
N-parallel proc. N-stage pipeline proc.
294
Guarded Evaluation
A A
Latch
B
Multiplier Multiplier
C C
condition
condition
295
Precomputation
Precomputed
inputs R1
Combination Outputs
Gated logic f(X)
R2
inputs
Load
Precomputation g(X) disable
logic
296
Precomputation: Design Issues
• Design steps
1. Selection of precomputation architecture
2. Determination of precomputed and gated inputs (Register R1 should
be much smaller than R2)
3. Search good implementation for g(X)
4. Evaluation of potential energy savings based on input statistics (if
savings not sufficient go to step 2 or 3 and try again)
• Also works for multiple output functions where g(X) is the
product of gj(X) over all j
297
Precomputation: Example
• Binary Comparator
An
R1
Bn
n-bit binary value
An-1 comparator A>B
Bn-1 A>B
R2
A1
B1
Load
disable
An = Bn Can achieve up to 75% power
reduction with 3% area overhead
and 1 to 5 additional gate delays in
worst case path
298
Adder Design
• Various algorithms exist to implement an integer adder
– Ripple, select, skip (x2), Look-ahead, conditional-sum.
– Each with its own characteristics of timing and power consumption.
FA FA FA FA FA FA
FA 0 FA FA FA
FA
1
Carry Look-ahead
FA FA FA FA
FA FA FA FA
299
Adder Design
Energy Delay
(pJ ) (nSec )
Ripple Carry 117 54.27
Constant Width Carry Skip 109 28.38
Variable Width Carry Skip 126 21.84
Carry Lookahead 171 17.13
Carry Select 216 19.56
Conditional Sum 304 20.05
300
Bus Power
• Buses are significant source of power dissipation
– 50% of dynamic power for interconnect switching (Magen, SLIP 04)
– MIT Raw processor’s on-chip network consumes 36% of total chip power
(Wang et al. 2003)
• Caused by:
– High switching activities
– Large capacitive loading
301
Bus Power Reduction
302
Reducing Shared Resources
303
Reducing Shared Resources cont’d
• Bus segmentation
– Another way to reduce shared buses
– Control of bus segment by controller blocks (B)
Shared Bus
B
Segmented Bus
304
Design Layer: Algorithm Level
• Base elements:
– Functions
– Procedures
– Processes
– Control structures
• Description of design behavior
305
Coding styles
306
Source-code Transformations
• Minimize power-consuming activity:
– Computation
A*B+A*C A*(B+C)
– Communication
for (c = 1..N) receive (A)
receive (A) for (c = 1..N)
B=c*A B=c*A
– Storage
for (c = 1..N)
B[c] = A[c]*D[c] for (c = 1..N)
for (c = 1..N) F[c] = A[c]*D[c]-1
F[c] = B[c]-1
307
Datapath Energy Consumption
14000
12000
Switched Capacitance (nF)
10000
Others
8000
Functional Unit
Pipeline Registers
6000
Register File
4000
2000
0
bubble.c heap.c quick.c
308
Adaptive Dynamic Voltage Scaling (DVS)
• Slow down processor to fill idle time
• More Delay lower operational voltage
309
Adaptive DVS: Example
T1 T2 T1 T2
Same work,
Speed
lower energy
Task Idle
Task
Time Time
310
Design Layer: System Level
• Basic Elements:
– Complex modules
– Processors
– Calculation and control units
– Sensors
ALU
MEM
MEM
MP3
311
Dynamic Power Management
• Systems are:
– Designed to deliver peak performance, but …
– Not needing peak performance most of the time
• Components are idle sometimes
• Dynamic power management (DPM):
– Puts idle components in low-power non-operational
states when idle
• Power manager:
– Observes and controls the system
– Power consumption of power manager is negligible
312
Processor Sleep Modes
• Software power control - power management
DOZE Most units stopped except on-chip
cache memory (cache coherency)
NAP Cache also turned off, PLL still on,
time out or external interrupt
to resume
SLEEP PLL off, external interrupt to resume
Deeper sleep mode requires
Deeper sleep mode consumes
more latency to resume
less power
313
Processor Sleep Modes: Example
• PowerPC sleep modes
Mode 66Mhz 80Mhz
No power mgmt 2.18W 2.54W
Dynamic power mgmt 1.89W 2.20W
DOZE 307mW 366mW
NAP 113mW 135mW
SLEEP 89mW 105mW
SLEEP without PLL 18mW 19mW
SLEEP without clock 2mW 2mW
314
Transmeta LongRun
• Applies adaptive DVS
• LongRun policies:
– Detection of different workload scenarios
– Based on runtime performance information
• After detection accordingly adaptation of:
– Processor supply voltage
– Processor frequency
– Clock frequency always within limits required by supply voltage to avoid clock
skew problems
• Use of core frequency/voltage hard coded operating points
315
Transmeta LongRun cont’d
100
90
% of max powerl consumption
80
70
60
50
40
30
20
10
Typical operating region Peak performance region
0
300 400 500 600 700 800 900 1000
300 Mhz 433 Mhz 533 Mhz 667 Mhz 800 Mhz 900 Mhz 1000 Mhz
0.80 V 0.87 V 0.95 V 1.05 V 1.15 V 1.25 V 1.30 V
Frequency (MHz)
316
Transmeta LongRun: Example
317
Battery aware design
• Non-linear effects influence life
time of batteries 1000
Capacity (mAh)
1000 mAh
• “Rate Capacity” 800
(Standard
600
– If discharging currents higher Capacity)
400
than allowed 200
125mA
( Rated Current)
real capacity goes under
Discharge current (mA)
nominal capacity
• “Battery Recovery”
Available
– Pulsed discharge increases Charge
nominal capacity (mA) time
– Based on recovery times
Discharge
– (as long there is no rate Current idle
capacity effect) (mA)
time
318
Battery aware design cont’d
Diffusion Model from - Rakhmatov, Vrudula et al.
319
Battery aware design: Example 1
• Performance of a bipolar lead-acid battery subjected to six current
impulses. Pulse length=3 ms, rest period=22 ms.
Source: LaFollette, “Design and performance of high specific power, pulsed discharge, bipolar lead
acid batteries”, 10th Annual Battery Conference on Applications and Advances, Long Beach, pp. 43–
47, January 1995.
320
Battery aware design: Example 2
Current [mA]
Current [mA]
Profile Aver. Current [mA] Battery lifetime [ms] Specif. energy [Wh/Kg]
321
Backup
322
FSM: Clock-Gating
• Moore machine: Outputs depend only on the
state variables.
– If a state has a self-loop in the state transition
graph (STG), then clock can be stopped whenever
a self-loop is to be executed.
Xi/Zk
Si
Sk Xk/Zk
323
Trend: Interconnects
Interconnects
Propagation delays of
global wires will be a
multiple of the clock cycle.
324
Bus Multiplexing
or
325
Resource Sharing and Activity II
326
Bus Multiplexing
S1 D1 S1 D1
S2 D2 S2 D2
327
Correlated Data Streams
0
14 12 10 8 6 4 2 0
MSB LSB
Bit position
328
Disadvantages of Bus Multiplexing
• If data bus is shared, advantages of data
correlation are lost (bus carries samples from
two uncorrelated data streams)
• Bus sharing should not be used for positively
correlated data streams
• Bus sharing may prove advantageous in a
negatively correlated data stream (where
successive samples switch sign bits) - more
random switching
329
Adaptive DVS cont’d
• Implementation
Power-Speed
Control Knob
Workload
Filter
Variable
Power-Speed
System
FIFO Input Buffer
330