0% found this document useful (0 votes)
56 views10 pages

Signal Processing (E.g. For Multimedia and Wireless Communications)

This document discusses two types of computation and their optimization for power consumption. Stream-based signal processing benefits from real-time throughput, while general purpose processing benefits from higher overall speeds. Architecture and circuit-level optimizations can significantly improve energy efficiency by tailoring designs for specific computation types. These include lowering supply voltages, which reduces energy quadratically but increases delays.

Uploaded by

Harinath Reddy
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views10 pages

Signal Processing (E.g. For Multimedia and Wireless Communications)

This document discusses two types of computation and their optimization for power consumption. Stream-based signal processing benefits from real-time throughput, while general purpose processing benefits from higher overall speeds. Architecture and circuit-level optimizations can significantly improve energy efficiency by tailoring designs for specific computation types. These include lowering supply voltages, which reduces energy quadratically but increases delays.

Uploaded by

Harinath Reddy
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Two Kinds of Computation

Lecture 2 - 225 C
• Signal Processing (e.g. for multimedia and
wireless communications)
Architecture and System Level Optimization of • Stream based computation
Power Consumption • No advantage in obtaining throughput in excess
of the realtime constraint
• General purpose processing (for downloaded
code)
• Bursty - mostly idle with bursts of computation
• Faster is better

Architecture and System Level Optimization of Power Consumption 2

Potential of computation specific energy


Switching Energy
optimization
• Conventional general purpose processors
Vdd
• Clock rate is everything ... somehow we’ll get the
power in and out
• 10-100 watts, 100-1000 Mops = .01Mops/mW Vin Vout

• Energy optimized but general purpose CL


• Keep the generality, but reduce the energy as much as
possible - e.g. StrongArm
• .5 Watts, 130 Mops = .3 Mops/mW Energy/transition = CL * V dd2

• Energy optimized and dedicated Power = Energy/transition * f = CL * Vdd 2 * f

• 100 Mops/mW

Architecture and System Level Optimization of Power Consumption 3 Architecture and System Level Optimization of Power Consumption 4

Power-Delay Product Normalized Delay vs. Supply Voltage

7.5 multiplier
NORMALIZED POWER-DELAY PRODUCT

1.5
2.0 µm technology
N OR MALIZED D ELAY

7.0
1.00 P x t d = E t = C L * V d d2 6.5 clock generator
0.70 6.0
0.50 5.5
5.0
0.30 4.5 C L • Vdd
4.0
Td =
0.20 I
E (Vdd=2) (C L) * (2)2 3.5
0.15 quadratic dependence = ring oscillator
E (Vdd=5) (CL ) * (5) 2 3.0
0.1 2.5 microcoded DSP chip
51 stage ring oscillator
0.07 2.0
E (Vdd=2) ≈ 0.16 E (Vdd =5) 1.5
0.05 adder
8-bit adder 1.0 adder (SPICE)
0.03
2.0 4.0 6.0
1 2 5
Vdd (volts) Vdd (volts)
Strong function of voltage (V2 dependence).
Lowering V dd reduces energy but increases delays
Relatively independent of logic function and style.

Architecture and System Level Optimization of Power Consumption 5 Architecture and System Level Optimization of Power Consumption 6
Architecture Trade-offs - Reference Datapath Parallel Datapath
A
A

C OM PAR AT O R
1

COM P ARA TO R
2T

C OM P A R A TOR

LA TCH A
CO M PA RATO R

L AT CH B

L ATC H C
A>B

AD DE R
1

L ATC H A

LAT CH B

LAT CH C
ADDE R
T
A>B 1 C
2T
B 1
2T

MU X
1
T
C
Area = 636 x 833 µ2

C OM PA RAT O R
1

COM P ARA TO R
1
2T
B

L ATCH A

L AT CH B
1 T

L AT CH C
A>B

A DDE R
T

Critical path delay ⇒ Tadder + T comparator (= 25ns) 1


2T
C
1

⇒ fref = 40Mhz
2T

Area = 1476 x 1219 µ2


Total capacitance being switched = C ref
The clock rate can be reduced by half with the same
V dd = Vref = 5V throughput ⇒ f par = f ref / 2
Power for reference datapath = Pref = C ref V ref2 f ref V par = V ref / 1.7, C par = 2.15Cref
from [Chandrakasan92] (IEEE JSSC)
P par = (2.15C ref) (Vref /1.7)2 (fref/2) ≈ 0.36 P ref

Architecture and System Level Optimization of Power Consumption 7 Architecture and System Level Optimization of Power Consumption 8

The More Parallel the Better?? Pipelined Datapath

1.00
Fixed Throughput
NORMALIZED POWER

0.90 A
Minimal Area

C OM PA R A T OR
0.80 1

CO M P ARAT OR
T

L ATC H C2

L ATC H C1
LAT CH A

LAT CH B

LA TCH P
ADDE R
0.70 1
A>B
B
T
0.60
1
0.50 T C
0.40 1
Area = 640 x 1081 µ2
1
T T
0.30
0.20
Critical path delay is less ⇒ max [T adder , T comparator]
0.10 Minimal Power
0.00 Keeping clock rate constant: fpipe = fref
1.00 2.00 3.00 4.00 5.00
Voltage can be dropped ⇒ V pipe = Vref / 1.7
Vdd (volts)
Capacitance slightly higher: C pipe = 1.15C ref
Capacitance overhead starts to dominate at “high” levels
of parallelism and results in an optimum voltage P pipe = (1.15C ref) (V ref/1.7)2 fref ≈ 0.39 P ref

Architecture and System Level Optimization of Power Consumption 9 Architecture and System Level Optimization of Power Consumption 10

Architecture Summary for a Simple Datapath Algorithmic Transformations

XN + YN

XN + YN
Architecture type Voltage Area Power 2D

Simple datapath Loop Unrolling *


* D
(no pipelining or 5V 1 1 A *
A
parallelism) A
XN-1 + Y N-1

Pipelined datapath 2.9V 1.3 0.39 Ceff = Effective normalized


mult-add capacitance = 1 Ceff = 2
Voltage = 5 Voltage = 5
Parallel datapath 2.9V 3.4 0.36 Throughput = 1 Throughput = 2
Power = 25 Power = 25
Pipeline-Parallel 2.0V 3.7 0.2
Loop-unrolling does not reduce power consumption
from [Chandrakasan95] (IEEE TCAD )

Architecture and System Level Optimization of Power Consumption 11 Architecture and System Level Optimization of Power Consumption 12
Loop Unrolling Enables Other Transformations Speed vs. Power Optimization
25

XN + + YN XN + D + YN
21

* 2D * 2D POWER (Fixed Throughput)


A * A * 17

A2 A2
A * Pipelining A * 13

X N-1 + YN-1 XN-1 D + YN-1


9
After SPEEDUP
C eff = 3 VOLTAGE
Algebraic Transformations, CAPACITANCE
& 5
Voltage = 2.9
Constant Propagation
Throughput = 2
C eff = 3 1
1 2 3 4 5 6 7
Power = 12.5 (x2 reduction)
Voltage = 3.7 Unrolling Factor
Throughput = 2 Area can be traded for higher throughput or lower power
Power = 20 (20% reduction)
ARBITRARY SPEEDUP vs. FINITE POWER REDUCTION

Architecture and System Level Optimization of Power Consumption 13 Architecture and System Level Optimization of Power Consumption 14

Multiple Supply Voltage Systems: Filter Example Time-multiplexed Architectures


Parallel busses for I,Q
1 Time-shared bus for I,Q
3V
* * * * I0 I1 I2
2 * * * * + + Q0 Q1 Q2
I0 Q0 I1 Q1 I1
3 T/2
2.4V T
4 + + 30 30
5 + + 20 20
* * * *

Signal Value
Sig nal Va lue

6 10 10
+ I
7 +
0 0
Q
8 * * * * 5V
-10 -10
9 + +
-20 -20
10 + + 0 10 20 30 40 50 0 20 40 60 80 100
Time, Sample Number Time, Sample Number
Power (5V) / Power (5V,3V, 2.4V)= 1.5
from [Raje95]
Can destroy signal correlations and increase
Similar approach to logic design proposed in [Usami95] the switching activity
Architecture and System Level Optimization of Power Consumption 15 Architecture and System Level Optimization of Power Consumption 16

Optimizing Multiplications Number Representation

A = IN * 0 0 1 1
B = IN * 0 1 1 1 Two’s Complement Sign Magnitude
1.0
Tr ansition Probability

A = (IN >>4 + IN >>3) A = (IN >>4 + IN >>3) 1.0


Tra nsition Probability

Rapidly Varying
B = (IN >>4 + IN >>3 + IN >>2) B = (A + IN >>2) 0.8 0.8 Rapidly Varying

16 0.6
0.6
# of shift-ad d operations

0.4 0.4
14

0.2 Slowly Varying 0.2


12 Only Scaling
0.0 0.0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Bit Number Bit Number
10

8 Sign-extension activity significantly reduced using


Scaling &
6 Common sign-magnitude representation
1.10 1.15 1.20 1.25
αq Sub-expression

Architecture and System Level Optimization of Power Consumption 17 Architecture and System Level Optimization of Power Consumption 18
Two’s Complement vs. Sign-Magnitude Reducing Activity by Reordering Inputs
SUM1 SUM2 SUM1 SUM2
IN IN >> 8
SUM
(Two’s Complement)
1.0
Transition Activity

>> 7 >> 8 >> 7


IN
Associativity & Commutativity
IN IN IN

SUMB SUMA + SUMB


0.5

T ransition Pr obability

T ransition Pr obability
(Sign-Magnitude) 0.5 0.5

0.4 0.4
SUM1

SUMA 0.3 0.3


SUM2
0.2 0.2 SUM2
0.0 SUM1
0 2 4 6 8 10 12 0.1 0.1
Bit Number
0.0 0 0.0
2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Two’s complement datapath has a significantly Bit Number Bit Number

higher glitching activity 30% reduction in switching energy


Architecture and System Level Optimization of Power Consumption 19 Architecture and System Level Optimization of Power Consumption 20

Resource Sharing Can Increase Activity Memory Architecture


Counter 1
Co unter 1

BUS1 Serial Access Parallel Access


OR SHARED BUS Row Decoding

Row De coding
MEMORY MEMORY
Co unter 2
C ounter 2

BUS2 Addr Addr


CELL CELL
ARRAY ARRAY
10.0
# o f Bu s T ran sition s Per Cycle

4 4 4 4 4 4 4 4
8.0

Number of Bus Transitions Per Cycle f Mux f/8 Latch


= 2 (1 + 1/2 + 1/4 + ...+1/128) ≈ 4
6.0 4 4 4 4 4
4.0
f Latch f Mux 8-nibbles

No Bus-sharing
2.0
4 bit display interface
0.0
0 50 100 150 200 250 Voltage = 3V Voltage = 1.1V
Skew Between Counter Outputs

Architecture and System Level Optimization of Power Consumption 21 Architecture and System Level Optimization of Power Consumption 22

General Purpose computing - Do we just


optimize power? The complete subsystem should be optimized

Energy Generic System Topology:


Power = x Operations
Operation Second
NO! 0.2-2 W CPU Support 0.1-2 W
ICs
What is important? Crystal
Bus PLD
PROM
Operations per Battery Life:
AA

0.05-0.3 W Glue Logic


AA


Minimize Energy Consumed per Operation •

and Main I/O
0.2-2 W Memory Interface 0.05-1 W
Operations per Second:
Maximize Throughput ≡ Operations/second
Power dissipation is distributed
Architecture and System Level Optimization of Power Consumption 23 Architecture and System Level Optimization of Power Consumption 24
Proposed Design Methodology - (Tom Burd,
Anthony Stratakos and Trevor Pering) Demonstration Vehicle

Instruction Set Architecture


Redesign the
InfoPad processor subsystem
45 mW 120 mW 400 mW
Energy efficient Clock
system organization ARM60 PLD
Oscillator
Dynamically
adjust throughput Processor Bus
to user’s needs 45 mW
Apply energy efficient
circuit and architecture
design techniques I/O SRAM
Interface 128k x 8
40 mW 600 mW
Energy Efficient Processor System
Current System: 10 MIPS @ 1.2W
Architecture and System Level Optimization of Power Consumption 25 Architecture and System Level Optimization of Power Consumption 26

Processor Usage Model Simplest Approach: Compute ASAP

Compute-intensive and Delivered


Desired low-latency computation Throughput Excess throughput
Throughput
80
MIPs
Ceiling: Set by top speed
of the processor

time
time Wake up → Compute ASAP → Go to idle/sleep mode
Not always computing
Background and Always high throughput
high-latency computation Always high energy

Architecture and System Level Optimization of Power Consumption 27 Architecture and System Level Optimization of Power Consumption 28

Clock rate reduction doesn’t help energy


Another Approach: Reduce Clock Frequency consumption
Delivered PowerBook
Frequency Control Panel
Throughput set by user Slow Fast
80
fCLK
• Energy is independent of clock rate
MIPs
Reduced • Number of operations = Nops
• Energy/operation = CV2
time • Total energy = CV2 * Nops

• Energy remains unchanged... • Reducing the clock rate only degrades


while throughput & power scale down with fCLK throughput, but no savings in battery life -
• Reducing power dissipation not always equivalent unless the voltage is changed
to reducing energy consumption

Architecture and System Level Optimization of Power Consumption 29 Architecture and System Level Optimization of Power Consumption 30
Dynamic Voltage Scaling Scale Energy with Throughput, fCLK

Delivered
Throughput Constant supply voltage.
1.0 3.3V
Reduce throughput & fCLK,

Energy (Watts/MIP)
Peak Reduce energy/operation

~10x Energy
time
0.5 Reduction
Dynamically scale energy with clock rate Reduced supply voltage,
circuit speed tracks f CLK .
Extend battery life by up to 10x 0 1.2V
with the same hardware 0 0.5 1.0
Throughput (∝ f CLK )
Key: Process scheduler determines operating point.
Normalized data (simulated, 0.6um process)
Architecture and System Level Optimization of Power Consumption 31 Architecture and System Level Optimization of Power Consumption 32

Minimal Hardware Implementation DVS in Practice

Modify existing DC-DC converter [Stratakos]


feedback loop Fixed Throughput, Energy/operation
• 10 msec per • Clock tracks over Throughput = 10 MIPS Throughput = 80 MIPS
frequency transition process and temp. Energy/op. = 1 nJ/inst. Energy/op. = 9 nJ/inst.
(10 mW) (720 mW)

Ring Osc. f CLK


VD D
Compare ∆f DC-DC Occasionally Demand Peak Throughput
Frequencies Converter Peak Throughput = 80 MIPS
Set by a Frequency Average Energy/op. ≈ 1 nJ/inst
Load Special
Register Inst. Register
(Peak throughput 11% of the time... average energy/op = 2 nJ/inst)
Add Register to ISA
Architecture and System Level Optimization of Power Consumption 33 Architecture and System Level Optimization of Power Consumption 34

Main Memory: IC Design Main Memory: Architecture

Use existing low-power memory block [Burstein] Standard memory architecture design
3.2 mm 2, 0.6 um 4 kByte Block 32
Access time = 22 ns 8 8 8 8
Energy/access = 120 pJ

Proposed memory architecture design


Design 64 kByte IC:
32
Access time ~ 40 ns 32 32 32 32
Energy/access ~ 300 pJ
5-10x better than commercial
Only activate one SRAM → power reduced by 4x
Key: SRAM must be DVS Compatible. Micro-power bus driver makes extra load negligible power
Architecture and System Level Optimization of Power Consumption 35 Architecture and System Level Optimization of Power Consumption 36
Self-timed Approach for Eliminating Glitching Glitch Free, Low Swing RAM Bitslice
Vdd
INACTIVE Data Out
Output remains tri-stated
until senseamp/latch has
Cells Cells
Row Decode
resolved data
OEN

PRE PRE
Vdd Vdd
Sense Sense Column select/ PRE PRE
Block Selected cascode amp SEL0
and Sense-amp
Output Valid SEL1
32

PRE
Enable tri-state drivers after sense-amp outputs are valid Bitlines precharged to
Vdd Vdd
to eliminate glitching on the data-bus. Vdd - Vtn Vdd
B0 B0 B1 B1

Architecture and System Level Optimization of Power Consumption 37 Architecture and System Level Optimization of Power Consumption 38

Critical circuit - High efficiency DC-DC


Achievable energy levels conversion using a switching regulator
10 MIPS, 1 nJ/inst. ⇔ 80 MIPS, 9 nJ/inst. PASS V g1
DEVICE Vx
ILf
(10 mW) (720 mW)
M1 Lf
+ +
V in
DC-DC LP-ARM -
Cin
Vg2 M2 Cx Cf RL V dd
-
Converter CPU
100 pJ 500 pJ SYNCHRONOUS RECTIFIER
Processor Bus
Arbitrary Vdd (<Vin) generated using the Buck converter
<< 100 pJ
Vdd = V in ² Duty Cycle at Node X

0.5 MB Chief sources of inefficiencies:


I/O ⇒ Conduction loss (I2 R)
Interface SRAM
(8 ICs) ⇒ Switching loss ( Cx Vin2 fs and Ls I2 fs)
100 pJ 300 pJ ⇒ Gate-drive loss (Cg V in 2 fs)
from [Stratakos94]
Improves energy efficiency by an order of magnitude (IEEE PESC)

Architecture and System Level Optimization of Power Consumption 39 Architecture and System Level Optimization of Power Consumption 40

Soft-Switching Eliminates C xV2f Loss What happens when I out changes?


PASS V in
Vx Iout ↓ ⇒ Cx discharges slowly Iout ↑ ⇒ Cx discharges quickly
DEVICE
V gp M1 ILf
Iout
Vx
Lf t Vg n Vg n
Vgn M2 Cx Cf RL
ILf
Vx Vx
Iout
RECTIFIER

t
Dead-time when neither PASS DEVICE ON
Rectifier Discharges Cx Body Diode Conduction
FET conducts
| Vgsp |
Current reverses Inverter node transition times depends on Iout
Vgsn
Typical schemes use fixed dead time set by gate delays
Lf charges and discharges C x
t
FETS ARE SWITCHED WITH VDS = 0
Adaptive Dead-time Control Needed for varying Iout
RECTIFIER ON

Architecture and System Level Optimization of Power Consumption 41 Architecture and System Level Optimization of Power Consumption 42
Switcher Design: Power Transistor Sizing Low Voltage Support Circuitry: Level Converter
Normalized FET Losses VddH

4/3 4/3
4 OEN M4 M3
P total = Pgd + Pcl VddL
VIN 24/2
O VOUT
8/2, M1
24/2 4/2
2 M2 VddH

Pgd = af sW P cl = b/W 0 ↔ VddL

VddL
0
W opt O
Gate-Width Tri-stateable output driver
Minimize Ptotal = P gate-drive + Pconduction loss

b Compatibility with 3.3V/5V standard components


W = ------------
opt a ⋅f (VOH)IN = 1.1V to 5V and (V OH )OUT = 1.1 to 5V
s
Architecture and System Level Optimization of Power Consumption 43 Architecture and System Level Optimization of Power Consumption 44

Other uses of adaptive DC-DC converters Adaptive Power Supply Voltages

Power
• Adaptive supplies Control Supply
• Self-timed circuits V DD(t)

FIFO
REG

FIFO

REG
Self-timed
• Adaption to varying algorithmic workloads
Processor

Exploit Data Dependent Computation Times To Vary the Supply

from [Nielsen94]
(IEEE Transactions on VLSI Systems)

Architecture and System Level Optimization of Power Consumption 45 Architecture and System Level Optimization of Power Consumption 46

But Self-timed Circuits are Expensive... Critical path based voltage optimization
V dd V dd VDD_Ref
VDD_Ref
OUTB Equivalent
I OUT
Critical
Path Signal
+
IN
-
INB
Comparator
I
Equivalent
Critical
Path
Guaranteed transition for every operation
Regulated Voltage to DSP
α 0->1 = 1 from [Macken90]
Feedback adjusts the regulated voltage to the point
Use Synchronous DSP instead where the equivalent critical path is about to fail

Architecture and System Level Optimization of Power Consumption 47 Architecture and System Level Optimization of Power Consumption 48
Case Study: A Portable Multimedia I/O Terminal Chipset Summary (1.2-µm, Vt = 0.7-0.9V)

Antenna Radio Modem Minimum


Area Power
Chip Description Supply
(mmxmm) Voltage
at 1.5V
Video
Protocol Module Protocol 9.4 x 9.1 1.1V 1.9mW
Decompression
(2mW)
Module Frame-buffer SRAM 7.8 x 6.5 1.1V 1mW
(2mW) (for 640x480 display)

Pen Speech Video Controller 6.7 x 6.4 1.1V 150 µW


Text/Graphics
Digitizer Codec Luminance 8.5 x 6.7 1.1V 115 µW
Frame-Buffer Decompression
Module
(1mW) Chrominance 8.5 x 9.0 1.1V 100 µW
Decompression
Protocol, ECC, Buffering, Video Decompression, and I/O Color Space Conversion 4.1 x 4.7 1.3V 1.1mW
(InfoPad Terminal Developed at U.C. Berkeley) and Triple DAC
from [Chandrakasan94]

Architecture and System Level Optimization of Power Consumption 49 Architecture and System Level Optimization of Power Consumption 50

Video Decompression Module Digital YIQ -> Digital RGB

Luminance Color
Decompression R D 11 D 12 D 13 Y
Video Space D 21 D 22 D 23
Y Translator G = I
Controller - Ping-pong
frame-buffer B D 31 D 32 D 33 Q
- Demultiplex Digital
- Lookup Table
- NTSC Timing
- Frame-buffer YIQ
control Optimized matrix multiplication (6mults -> 8 adds)
Chrominance I to
- LUT control Decompression Analog ? Hardwired shift-add operations
- Variable sized
packets - Ping-pong RGB ? Coefficient scaling to minimize shift-add operations
- Synchronization frame-buffer ? Exploit multiple coefficients multiplied with the
Q
- Lookup Table same input
100 µWatts compared to commercial 1 Watt - Why??

Architecture and System Level Optimization of Power Consumption 51 Architecture and System Level Optimization of Power Consumption 52

Power reduction approaches which make up the


Color Space Translator and Triple DAC factor of 10,000 improvement

Key Features:
Digital YIQ -> Analog RGB Design Power
Approach
Consideration Reduction
Optimized Multiplications
Frequency 14MHz->2.5MHz 5.6
Number Representation
Supply Voltage 5V->1.5V 11
Optimized Time-sharing IN MATRIX
DACR
Library Optimization Minimum Sized Devices 2-3
Integrated low-voltage DAC’s COMPUTATION Single Phase Clocking
DACG
Matrix Multiplication Hardwired Shift-add 7
ADD TREE
Power @ 1.3V: 0.93mW Coefficient Optimization
DACB
SATURATION Resource Allocation Fully Parallel Implementation 1.5-2
Clock Rate: 2.5MHz Number Representation Sign-Magnitude 1.2
Off Chip Drivers Integrate Processing and DAC 1.4
Size: 4.1mm x 4.7mm Bitwidth 8bits->6bits 1.3
1.2µm technology

Architecture and System Level Optimization of Power Consumption 53 Architecture and System Level Optimization of Power Consumption 54
Summary

Signal statistics can be exploited to minimize


the number of transitions required to perform
a given function
Architectural voltage scaling is a key technique for
low-voltage operation
Variable power supply reduces power and buffering
trades latency for power

Orders of magnitude of power reduction


are possible

Architecture and System Level Optimization of Power Consumption 55

You might also like