180mV Low Voltage FFT Processor Paper On IEEE
180mV Low Voltage FFT Processor Paper On IEEE
180mV Low Voltage FFT Processor Paper On IEEE
16.4
degraded by three leaking devices and reduced swing input voltages (due to imperfect inverters). Alternatively, a transmission
gate XOR has fewer parallel devices which improves subthreshold performance at worst-case input vectors. Additionally, having
both NMOS and PMOS in the pull-up and pull-down reduces the
effects of process variations on minimum voltage operation.
Sneak leakage paths between standard cells are minimized by
introducing inverters and buffers and by carefully analyzing
interfaces between standard cells. In multiple stacked devices,
the drive current is significantly reduced in subthreshold operation, so subthreshold transmission-gate MUXes cannot be directly cascaded. Datapath and control circuits for the subthreshold
FFT processor are developed by minimizing stacked devices,
reducing parallel leakage, and avoiding sneak leakage paths.
Memory design using subthreshold operation is challenging.
Conventional SRAM designs will not function at low voltage due
to reduced Ion/Ioff and bitline leakage that depends on the values
stored in memory. For read access in deep subthreshold operation, the bitline is segmented by using a MUX-based hierarchical
approach (Fig. 16.4.4). The selectors to the muxes are the readaddress inputs, and the data from the memories is hierarchically passed through the MUXes to the output. The MUXes are
designed to ensure a high Ion/Ioff at each level of hierarchy by
avoiding parallel leakage and stack effects. The simulation in
Fig. 16.6.4 contrasts operation of the hierarchical read bitline
with a conventional read bitline. The MUXes can be daisychained and arrayed for compact layout. The same hierarchical
design is used to create subthreshold Twiddle ROMs. A latchbased circuit is used for reliable write access at very low voltages
and process corners (Fig. 16.4.4).
The low-voltage FFT containing 627k transistors is fabricated in
a standard 0.18m 6M CMOS process. It is fully functional at
128 to 1024 FFT lengths, 8 and 16b precision, for voltage supplies 180 to 900mV and for clock frequencies of 164Hz to 6MHz.
The minimum supply voltage is 180mV where it dissipates
90nW. Figure 16.4.5 is a oscilloscope plot of outputs from the
FFT chip functioning at 180mV. The optimal operating point is
where energy is minimized and is a function of activity factor
and process technology. The optimal operating point is at 350mV
with a clock frequency of 9.6kHz and is shown in Fig. 16.4.6. This
figure is a plot of the energy and the performance for a 16b, 1024
point FFT as a function of VDD. As previously reported, a low
power FFT processor implemented in a 0.7m process dissipates
3.4J when performing one 1024-point CVFFT at 1.1V [3]. The
energy used by this FFT processor to compute one 16b, 1024
point RVFFT at the optimal operating point is 155nJ. Figure
16.4.7 shows a die photo of the IC that occupies 2.6mm x 2.1mm.
Acknowledgments:
The authors thank J. Cline for her help with the multiplier design. We
also thank B. Calhoun and Prof. K.C. Smith for valuable feedback on the
paper. This effort is sponsored by DARPA Power Aware Computing and
Communications (PAC/C) and the Air Force Research Laboratory, under
agreement number F33615-02-2-4005. A. Wang is supported by an Intel
PhD Fellowship.
References:
[1] A. Wang and A. Chandrakasan, Energy-Aware Architectures for a
Real-Valued FFT Implementation, ISLPED 2003, pp. 360-365, August
2003.
[2] J. Burr and J. Shott, A 200mV Self-Testing Encoder/Decoder Using
Stanford Ultra-Low-Power CMOS, ISSCC Dig. Tech. Papers, pp. 84-85,
Feb. 1994.
[3] B. Baas, A Low-Power, High-Performance, 1024-Point FFT Processor,
IEEE J. Solid-State Circuits, vol 34, no 3, pp. 380-387, March 1999.
%XWWHUIO\'DWDSDWK
FON
%DQN3DULW\(YHQ
<
%DQN3DULW\2GG
%DQN3DULW\(YHQ
; $%
:
:
$GGHUVHQDEOHGIRU
EPXOW RQO\
%>@
%>@
$
%
$ % =
OHDNDJHFXUUHQW
GULYHFXUUHQW
ZHDNGULYHFXUUHQW
::/
0
0
::/
0
0X[
$
::/
:RUVWFDVHOHDNDJH
0 00
5%/
VXEWKUHVKROG;25
9ROWDJH
OHYHODW=
P9
$
0
::/
$
0
0
::/
WLQ\;25
ODWFKEDVHGZULWH
$
$
%
$
0
0
::/
%
$
%
Figure 16.4.2: Sizing trade-off for an inverter at the minimum operating voltage with process variation considerations given Wn=0.44m (simulation).
$
=
9'' P9
$
WRELWSUHFLVLRQ
%DXJK:RROH\PXOWLSOLHU
VXEWKUHVKROG;25
%
:SPLQ)6FRUQHU
:>@
0LQLPXP9ROWDJH2SHUDWLRQ
9'' P9
:S PP:Q PP
,QSXWJDWLQJ
ORJLF
FON
:>@
: HMSNQ1
VWDQGDUGFHOOOLEUDU\
WLQ\;25
:S
EDQGE
PXOW
7ZLGGOH
520V
< $% :
$
06% 06%
&RQWURO/RJLF
$DGGUHVV%DGGUHVV
%DQN3DULW\2GG
:DGGUHVV
GDWDUHDG\
'DWD0HPRU\
HQDEOH
:SPD[6)FRUQHU
$
))7OHQJWK
KLHUDUFKLFDO
PX[EDVHGUHDG
PX[ 5%/
P
P
FRQYHQWLRQDO5%/
P
P
P
P
P
/*\
GDWDUHDG\
/*\
M*\
M*\
M*\
9'' P9
Figure 16.4.5: Oscilloscope plot showing outputs from the RVFFT chip at
180 mV operation.
Figure 16.4.6: Energy and FFT clock frequency for 16b, 1024-point
RVFFT as a function of VDD.
FORFNIUHTXHQF\
GDWDRXW>@
(QHUJ\Q-
RXWSXWFORFN
'DWD0HPRU\
&RQWUROORJLF
%XWWHUIO\
'DWDSDWK
7ZLGGOH
520V
))7OHQJWK
ELWSUHFLVLRQ
FON
'DWD0HPRU\
GDWDUHDG\
06% 06%
$DGGUHVV%DGGUHVV
:DGGUHVV
&RQWURO/RJLF
HQDEOH
%DQN3DULW\2GG
%DQN3DULW\(YHQ
%DQN3DULW\2GG
$
%
;
<
%DQN3DULW\(YHQ
7ZLGGOH
520V
:
%XWWHUIO\'DWDSDWK
$
%
; $%
:
:
$GGHUVHQDEOHGIRU
EPXOW RQO\
< $%
:
%>@
%>@
EDQGE
PXOW
,QSXWJDWLQJ
ORJLF
HMSNQ1
FON
:>@
:>@
WRELWSUHFLVLRQ
%DXJK:RROH\PXOWLSOLHU
Figure 16.4.1: RVFFT architecture that enables scalability in bit-precision and FFT length,
and includes circuits which can scale down to 180 mV operation.
:SPD[6)FRUQHU
:S
0LQLPXP9ROWDJH2SHUDWLRQ
9'' P9
:S PP:Q PP
:SPLQ)6FRUQHU
9'' P9
Figure 16.4.2: Sizing trade-off for an inverter at the minimum operating voltage with process variation
considerations given Wn=0.44m (simulation).
VWDQGDUGFHOOOLEUDU\
WLQ\;25
VXEWKUHVKROG;25
%
%
$
$ % =
OHDNDJHFXUUHQW
GULYHFXUUHQW
ZHDNGULYHFXUUHQW
$
=
%
$
%
$
%
VXEWKUHVKROG;25
WLQ\;25
9ROWDJH
OHYHODW=
P9
P
P
P
P
Figure 16.4.3: The effects of parallel leakage is compounded at ultra-low voltages as shown by the standard-cell tiny
XOR gate for the inputs A=1 and B=0 at VDD=100mV. Parallel leakage is reduced in the subthreshold XOR gate, which
functions better at 100mV.
::/
::/
::/
0
0
0
0X[
$
0
0
::/
$
$
0
:RUVWFDVHOHDNDJH
0 00
5%/
ODWFKEDVHGZULWH
$
$
0
$
::/
$
::/
$
0
KLHUDUFKLFDO
PX[EDVHGUHDG
PX[ 5%/
P
P
FRQYHQWLRQDO5%/
Figure 16.4.4: The MUX-based hierarchical-read access works reliably at 100mV in simulation compared to
a conventional read bitline (RBL).
P
RXWSXWFORFN
GDWDRXW>@
GDWDUHDG\
Figure 16.4.5: Oscilloscope plot showing outputs from the RVFFT chip at 180 mV operation.
/*\
/*\
M*\
M*\
M*\
9'' P9
Figure 16.4.6: Energy and FFT clock frequency for 16b, 1024-point RVFFT as a function of VDD.
FORFNIUHTXHQF\
(QHUJ\Q-
'DWD0HPRU\
&RQWUROORJLF
%XWWHUIO\
'DWDSDWK
7ZLGGOH
520V