Signal Processing (E.g. For Multimedia and Wireless Communications)
Signal Processing (E.g. For Multimedia and Wireless Communications)
Lecture 2 - 225 C
• Signal Processing (e.g. for multimedia and
wireless communications)
Architecture and System Level Optimization of • Stream based computation
Power Consumption • No advantage in obtaining throughput in excess
of the realtime constraint
• General purpose processing (for downloaded
code)
• Bursty - mostly idle with bursts of computation
• Faster is better
• 100 Mops/mW
Architecture and System Level Optimization of Power Consumption 3 Architecture and System Level Optimization of Power Consumption 4
7.5 multiplier
NORMALIZED POWER-DELAY PRODUCT
1.5
2.0 µm technology
N OR MALIZED D ELAY
7.0
1.00 P x t d = E t = C L * V d d2 6.5 clock generator
0.70 6.0
0.50 5.5
5.0
0.30 4.5 C L • Vdd
4.0
Td =
0.20 I
E (Vdd=2) (C L) * (2)2 3.5
0.15 quadratic dependence = ring oscillator
E (Vdd=5) (CL ) * (5) 2 3.0
0.1 2.5 microcoded DSP chip
51 stage ring oscillator
0.07 2.0
E (Vdd=2) ≈ 0.16 E (Vdd =5) 1.5
0.05 adder
8-bit adder 1.0 adder (SPICE)
0.03
2.0 4.0 6.0
1 2 5
Vdd (volts) Vdd (volts)
Strong function of voltage (V2 dependence).
Lowering V dd reduces energy but increases delays
Relatively independent of logic function and style.
Architecture and System Level Optimization of Power Consumption 5 Architecture and System Level Optimization of Power Consumption 6
Architecture Trade-offs - Reference Datapath Parallel Datapath
A
A
C OM PAR AT O R
1
COM P ARA TO R
2T
C OM P A R A TOR
LA TCH A
CO M PA RATO R
L AT CH B
L ATC H C
A>B
AD DE R
1
L ATC H A
LAT CH B
LAT CH C
ADDE R
T
A>B 1 C
2T
B 1
2T
MU X
1
T
C
Area = 636 x 833 µ2
C OM PA RAT O R
1
COM P ARA TO R
1
2T
B
L ATCH A
L AT CH B
1 T
L AT CH C
A>B
A DDE R
T
⇒ fref = 40Mhz
2T
Architecture and System Level Optimization of Power Consumption 7 Architecture and System Level Optimization of Power Consumption 8
1.00
Fixed Throughput
NORMALIZED POWER
0.90 A
Minimal Area
C OM PA R A T OR
0.80 1
CO M P ARAT OR
T
L ATC H C2
L ATC H C1
LAT CH A
LAT CH B
LA TCH P
ADDE R
0.70 1
A>B
B
T
0.60
1
0.50 T C
0.40 1
Area = 640 x 1081 µ2
1
T T
0.30
0.20
Critical path delay is less ⇒ max [T adder , T comparator]
0.10 Minimal Power
0.00 Keeping clock rate constant: fpipe = fref
1.00 2.00 3.00 4.00 5.00
Voltage can be dropped ⇒ V pipe = Vref / 1.7
Vdd (volts)
Capacitance slightly higher: C pipe = 1.15C ref
Capacitance overhead starts to dominate at “high” levels
of parallelism and results in an optimum voltage P pipe = (1.15C ref) (V ref/1.7)2 fref ≈ 0.39 P ref
Architecture and System Level Optimization of Power Consumption 9 Architecture and System Level Optimization of Power Consumption 10
XN + YN
XN + YN
Architecture type Voltage Area Power 2D
Architecture and System Level Optimization of Power Consumption 11 Architecture and System Level Optimization of Power Consumption 12
Loop Unrolling Enables Other Transformations Speed vs. Power Optimization
25
XN + + YN XN + D + YN
21
A2 A2
A * Pipelining A * 13
Architecture and System Level Optimization of Power Consumption 13 Architecture and System Level Optimization of Power Consumption 14
Signal Value
Sig nal Va lue
6 10 10
+ I
7 +
0 0
Q
8 * * * * 5V
-10 -10
9 + +
-20 -20
10 + + 0 10 20 30 40 50 0 20 40 60 80 100
Time, Sample Number Time, Sample Number
Power (5V) / Power (5V,3V, 2.4V)= 1.5
from [Raje95]
Can destroy signal correlations and increase
Similar approach to logic design proposed in [Usami95] the switching activity
Architecture and System Level Optimization of Power Consumption 15 Architecture and System Level Optimization of Power Consumption 16
A = IN * 0 0 1 1
B = IN * 0 1 1 1 Two’s Complement Sign Magnitude
1.0
Tr ansition Probability
Rapidly Varying
B = (IN >>4 + IN >>3 + IN >>2) B = (A + IN >>2) 0.8 0.8 Rapidly Varying
16 0.6
0.6
# of shift-ad d operations
0.4 0.4
14
Architecture and System Level Optimization of Power Consumption 17 Architecture and System Level Optimization of Power Consumption 18
Two’s Complement vs. Sign-Magnitude Reducing Activity by Reordering Inputs
SUM1 SUM2 SUM1 SUM2
IN IN >> 8
SUM
(Two’s Complement)
1.0
Transition Activity
T ransition Pr obability
T ransition Pr obability
(Sign-Magnitude) 0.5 0.5
0.4 0.4
SUM1
Row De coding
MEMORY MEMORY
Co unter 2
C ounter 2
4 4 4 4 4 4 4 4
8.0
No Bus-sharing
2.0
4 bit display interface
0.0
0 50 100 150 200 250 Voltage = 3V Voltage = 1.1V
Skew Between Counter Outputs
Architecture and System Level Optimization of Power Consumption 21 Architecture and System Level Optimization of Power Consumption 22
•
Minimize Energy Consumed per Operation •
•
and Main I/O
0.2-2 W Memory Interface 0.05-1 W
Operations per Second:
Maximize Throughput ≡ Operations/second
Power dissipation is distributed
Architecture and System Level Optimization of Power Consumption 23 Architecture and System Level Optimization of Power Consumption 24
Proposed Design Methodology - (Tom Burd,
Anthony Stratakos and Trevor Pering) Demonstration Vehicle
time
time Wake up → Compute ASAP → Go to idle/sleep mode
Not always computing
Background and Always high throughput
high-latency computation Always high energy
Architecture and System Level Optimization of Power Consumption 27 Architecture and System Level Optimization of Power Consumption 28
Architecture and System Level Optimization of Power Consumption 29 Architecture and System Level Optimization of Power Consumption 30
Dynamic Voltage Scaling Scale Energy with Throughput, fCLK
Delivered
Throughput Constant supply voltage.
1.0 3.3V
Reduce throughput & fCLK,
Energy (Watts/MIP)
Peak Reduce energy/operation
~10x Energy
time
0.5 Reduction
Dynamically scale energy with clock rate Reduced supply voltage,
circuit speed tracks f CLK .
Extend battery life by up to 10x 0 1.2V
with the same hardware 0 0.5 1.0
Throughput (∝ f CLK )
Key: Process scheduler determines operating point.
Normalized data (simulated, 0.6um process)
Architecture and System Level Optimization of Power Consumption 31 Architecture and System Level Optimization of Power Consumption 32
Use existing low-power memory block [Burstein] Standard memory architecture design
3.2 mm 2, 0.6 um 4 kByte Block 32
Access time = 22 ns 8 8 8 8
Energy/access = 120 pJ
PRE PRE
Vdd Vdd
Sense Sense Column select/ PRE PRE
Block Selected cascode amp SEL0
and Sense-amp
Output Valid SEL1
32
PRE
Enable tri-state drivers after sense-amp outputs are valid Bitlines precharged to
Vdd Vdd
to eliminate glitching on the data-bus. Vdd - Vtn Vdd
B0 B0 B1 B1
Architecture and System Level Optimization of Power Consumption 37 Architecture and System Level Optimization of Power Consumption 38
Architecture and System Level Optimization of Power Consumption 39 Architecture and System Level Optimization of Power Consumption 40
t
Dead-time when neither PASS DEVICE ON
Rectifier Discharges Cx Body Diode Conduction
FET conducts
| Vgsp |
Current reverses Inverter node transition times depends on Iout
Vgsn
Typical schemes use fixed dead time set by gate delays
Lf charges and discharges C x
t
FETS ARE SWITCHED WITH VDS = 0
Adaptive Dead-time Control Needed for varying Iout
RECTIFIER ON
Architecture and System Level Optimization of Power Consumption 41 Architecture and System Level Optimization of Power Consumption 42
Switcher Design: Power Transistor Sizing Low Voltage Support Circuitry: Level Converter
Normalized FET Losses VddH
4/3 4/3
4 OEN M4 M3
P total = Pgd + Pcl VddL
VIN 24/2
O VOUT
8/2, M1
24/2 4/2
2 M2 VddH
VddL
0
W opt O
Gate-Width Tri-stateable output driver
Minimize Ptotal = P gate-drive + Pconduction loss
Power
• Adaptive supplies Control Supply
• Self-timed circuits V DD(t)
FIFO
REG
FIFO
REG
Self-timed
• Adaption to varying algorithmic workloads
Processor
from [Nielsen94]
(IEEE Transactions on VLSI Systems)
Architecture and System Level Optimization of Power Consumption 45 Architecture and System Level Optimization of Power Consumption 46
But Self-timed Circuits are Expensive... Critical path based voltage optimization
V dd V dd VDD_Ref
VDD_Ref
OUTB Equivalent
I OUT
Critical
Path Signal
+
IN
-
INB
Comparator
I
Equivalent
Critical
Path
Guaranteed transition for every operation
Regulated Voltage to DSP
α 0->1 = 1 from [Macken90]
Feedback adjusts the regulated voltage to the point
Use Synchronous DSP instead where the equivalent critical path is about to fail
Architecture and System Level Optimization of Power Consumption 47 Architecture and System Level Optimization of Power Consumption 48
Case Study: A Portable Multimedia I/O Terminal Chipset Summary (1.2-µm, Vt = 0.7-0.9V)
Architecture and System Level Optimization of Power Consumption 49 Architecture and System Level Optimization of Power Consumption 50
Luminance Color
Decompression R D 11 D 12 D 13 Y
Video Space D 21 D 22 D 23
Y Translator G = I
Controller - Ping-pong
frame-buffer B D 31 D 32 D 33 Q
- Demultiplex Digital
- Lookup Table
- NTSC Timing
- Frame-buffer YIQ
control Optimized matrix multiplication (6mults -> 8 adds)
Chrominance I to
- LUT control Decompression Analog ? Hardwired shift-add operations
- Variable sized
packets - Ping-pong RGB ? Coefficient scaling to minimize shift-add operations
- Synchronization frame-buffer ? Exploit multiple coefficients multiplied with the
Q
- Lookup Table same input
100 µWatts compared to commercial 1 Watt - Why??
Architecture and System Level Optimization of Power Consumption 51 Architecture and System Level Optimization of Power Consumption 52
Key Features:
Digital YIQ -> Analog RGB Design Power
Approach
Consideration Reduction
Optimized Multiplications
Frequency 14MHz->2.5MHz 5.6
Number Representation
Supply Voltage 5V->1.5V 11
Optimized Time-sharing IN MATRIX
DACR
Library Optimization Minimum Sized Devices 2-3
Integrated low-voltage DAC’s COMPUTATION Single Phase Clocking
DACG
Matrix Multiplication Hardwired Shift-add 7
ADD TREE
Power @ 1.3V: 0.93mW Coefficient Optimization
DACB
SATURATION Resource Allocation Fully Parallel Implementation 1.5-2
Clock Rate: 2.5MHz Number Representation Sign-Magnitude 1.2
Off Chip Drivers Integrate Processing and DAC 1.4
Size: 4.1mm x 4.7mm Bitwidth 8bits->6bits 1.3
1.2µm technology
Architecture and System Level Optimization of Power Consumption 53 Architecture and System Level Optimization of Power Consumption 54
Summary