Advanced VLSI Technology
Advanced VLSI Technology
VLSI
VLSI
Edited by
Zhongfeng Wang
In-Tech
intechweb.org
Published by In-Teh
In-Teh
Olajnica 19/2, 32000 Vukovar, Croatia
Abstracting and non-profit use of the material is permitted with credit to the source. Statements and
opinions expressed in the chapters are these of the individual contributors and not necessarily those of
the editors or publisher. No responsibility is accepted for the accuracy of information contained in the
published articles. Publisher assumes no responsibility liability for any damage or injury to persons or
property arising out of the use of any materials, instructions, methods or ideas contained inside. After
this work has been published by the In-Teh, authors have the right to republish it, in whole or part, in any
publication of which they are an author or editor, and the make other personal use of the work.
© 2010 In-teh
www.intechweb.org
Additional copies can be obtained from:
[email protected]
VLSI,
Edited by Zhongfeng Wang
p. cm.
ISBN 978-953-307-049-0
V
Preface
The process of integrated circuits (IC) started its era of very-large-scale integration (VLSI)
in 1970’s when thousands of transistors were integrated into a single chip. Since then,
the transistors counts and clock frequencies of state-of-art chips have grown by orders of
magnitude. Nowadays we are able to integrate more than a billion transistors into a single
device. However, the term “VLSI” remains being commonly used, despite of some effort to
coin a new term ultralarge- scale integration (ULSI) for finer distinctions many years ago. In
the past two decades, advances of VLSI technology have led to the explosion of computer
and electronics world. VLSI integrated circuits are used everywhere in our everyday life,
including microprocessors in personal computers, image sensors in digital cameras, network
processors in the Internet switches, communication devices in smartphones, embedded
controllers in automobiles, et al.
VLSI covers many phases of design and fabrication of integrated circuits. In a complete VLSI
design process, it often involves system definition, architecture design, register transfer
language (RTL) coding, pre- and post-synthesis design verification, timing analysis, and chip
layout for fabrication. As the process technology scales down, it becomes a trend to integrate
many complicated systems into a single chip, which is called system-on-chip (SoC) design.
In addition, advanced VLSI systems often require high-speed circuits for the ever increasing
demand of data processing. For instance, Ethernet standard has evolved from 10 Mbps to
10 Gbps, and the specification for 100 Gbps Ethernet is underway. On the other hand, with
the growing popularity of smartphones and mobile computing devices, low-power VLSI
systems have become critically important. Therefore, engineers are facing new challenges to
design highly integrated VLSI systems that can meet both high performance requirement and
stringent low power consumption.
The goal of this book is to elaborate the state-of-art VLSI design techniques at multiple
levels. At device level, researchers have studied the properties of nano-scale devices and
explored possible new material for future very high speed, low-power chips. At circuit level,
interconnect has become a contemporary design issue for nano-scale integrated circuits.
At system level, hardware-software co-design methodologies have been investigated to
coherently improve the overall system performance. At architectural level, researchers have
proposed novel architectures that have been optimized for specific applications as well as
efficient reconfigurable architectures that can be adapted for a class of applications.
As VLSI systems become more and more complex, it is a great challenge but a significant task
for all experts to keep up with latest signal processing algorithms and associated architecture
designs. This book is to meet this challenge by providing a collection of advanced algorithms
VI
in conjunction with their optimized VLSI architectures, such as Turbo codes, Low Density
Parity Check (LDPC) codes, and advanced video coding standards MPEG4/H.264, et al. Each
of the selected algorithms is presented with a thorough description together with research
studies towards efficient VLSI implementations. No book is expected to cover every possible
aspect of VLSI exhaustively. Our goal is to provide the design concepts through those
selected studies, and the techniques that can be adopted into many other current and future
applications.
This book is intended to cover a wide range of VLSI design topics – both general design
techniques and state-of-art applications. It is organized into four major parts:
▪▪Part I focuses on VLSI design for image and video signal processing systems, at both
algorithmic and architectural levels.
▪▪Part II addresses VLSI architectures and designs for cryptography and error correction
coding.
▪▪Part III discusses general SoC design techniques as well as system-level design optimization
for application-specific algorithms.
▪▪Part IV is devoted to circuit-level design techniques for nano-scale devices.
It should be noted that the book is not a tutorial for beginners to learn general VLSI design
methodology. Instead, it should serve as a reference book for engineers to gain the knowledge
of advanced VLSI architecture and system design techniques. Moreover, this book also
includes many in-depth and optimized designs for advanced applications in signal processing
and communications. Therefore, it is also intended to be a reference text for graduate students
or researchers for pursuing in-depth study on specific topics.
The editors are most grateful to all coauthors for contributions of each chapter in their
respective area of expertise. We would also like to acknowledge all the technical editors for
their support and great help.
Contents
Preface V
14. Efficient Built-in Self-Test for Video Coding Cores: A Case Study
on Motion Estimation Computing Array 285
Chun-Lung Hsu, Yu-Sheng Huang and Chen-Kai Chen
17. On the Efficient Design & Synthesis of Differential Clock Distribution Networks 331
Houman Zarrabi, Zeljko Zilic, Yvon Savaria and A. J. Al-Khalili
X1
1. Introduction
Wireless data transmission and high-speed image processing devices have generated a need
for efficient transform methods, which can be implemented in VLSI environment. After the
discovery of the compactly supported discrete wavelet transform (DWT) (Daubechies, 1988;
Smith & Barnwell, 1986) many DWT-based data and image processing tools have
outperformed the conventional discrete cosine transform (DCT) -based approaches. For
example, in JPEG2000 Standard (ITU-T, 2000), the DCT has been replaced by the
biorthogonal discrete wavelet transform. In this book chapter we review the DWT structures
intended for VLSI architecture design. Especially we describe methods for constructing shift
invariant analytic DWTs.
3. Lifting BDWT
The BDWT is most commonly realized by the ladder-type network called lifting scheme
(Sweldens, 1988). The procedure consists of sequential down and uplifting steps and the
reconstruction of the signal is made by running the lifting network in reverse order (Fig. 2).
Efficient lifting BDWT structures have been developed for VLSI design (Olkkonen et al.
2005). The analysis and synthesis filters can be implemented by integer arithmetics using
only register shifts and summations. However, the lifting DWT runs sequentially and this
may be a speed-limiting factor in some applications (Huang et al., 2005). Another drawback
considering the VLSI architecture is related to the reconstruction filters, which run in reverse
order and two different VLSI realizations are required. In the following we show that the
lifting structure can be replaced by more effective VLSI architectures. We describe two
different approaches: the discrete lattice wavelet transform and the sign modulated BDWT.
T0 R0 LW
1 0 L0 R0 TW
1 0 zk 0
T W L R (3)
0 1 1 1 T1R1 L0W1 0 z k
This is satisfied if we state W0 L0 , W1 L1 , R0 T1 and R1 T0 . The perfect
reconstruction condition follows then from the diagonal elements (3) as
T0 ( z )T1 ( z ) L0 ( z ) L1 ( z ) z k (4)
There exists many approaches in the design of the DLWT structures obeying (4), for
example via the Parks-McChellan-type algorithm. Especially the DLWT network is efficient
in designing half-band transmission and lattice filters (see details in Olkkonen & Olkkonen,
2007a). For VLSI design it is essential to note that in the lattice structure all computations are
carried out parallel. Also all the BDWT structures designed via the lifting scheme can be
transferred to the lattice network (Fig. 3). For example, Fig. 4 shows the DLWT equivalent
of the lifting DBWT structure consisting of down and uplifting steps (Fig. 2). The VLSI
implementation is flexible due to parallel filter blocks in analysis and synthesis parts.
Fig. 4. The DLWT equivalence of the lifting BDWT structure described in Fig. 2.
describes the general BDWT structure using the sign modulator. The VLSI design simplifies
to the construction of two parallel biorthogonal filters and the sign modulator. It should be
pointed out that the scaling and wavelet filters can be still efficiently implemented using the
lifting scheme or the lattice structure. The same biorthogonal DWT/IDWT filter module
can be used in decomposition and reconstruction of the signal e.g. in video compression
unit. Especially in bidirectional data transmission the DWT/IDWT transceiver has many
advantages compared with two separate transmitter and receiver units. The same VLSI
module can also be used to construct multiplexer-demultiplexer units. Due to symmetry of
the scaling and wavelet filter coefficents a fast convolution algorithm can be used for
implementation of the filter modules (see details Olkkonen & Olkkonen, 2008).
Fig. 6. The BDWT structure using the scaling and wavelet filters and the sign modulator.
n
n 0
m
h1[n] 0; m 0,1,..., M 1 (8)
Discrete Wavelet Transform Structures for VLSI Architecture Design 5
the HBF structure can be implemented using the lifting scheme (Fig. 7). The functioning of
the compression coder can be explained by writing the input signal via the polyphase
components
X ( z ) X e ( z 2 ) z 1 X 0 ( z 2 ) (12)
where X e ( z ) and X o ( z ) denote the even and odd sequences. We may present the wavelet
coefficents as
W ( z ) [ X ( z ) H1 ( z )] 2 z 2 X o ( z ) A( z ) X e ( z ) (13)
A( z ) works as an approximating filter yielding an estimate of the odd data points based on
the even sequence. The wavelet sequence W ( z ) can be interpreted as the difference between
the odd points and their estimate. In tree structured compression coder the scaling sequence
S ( z ) is fed to the next stage. In many VLSI applications, for example image compression,
the input signal consists of an integer-valued sequences. By rounding or truncating the
output of the A( z ) filter to integers, the compressed wavelet sequence W ( z ) is integer-
valued and can be efficiently coded e.g. using Huffman algorithm. It is essential to note that
this integer-to-integer transform has still the perfect reconstruction property (2).
Fig. 7. The lifting structure for the HBF wavelet filter designed for the VLSI compression
coder.
which warrants smoothness and shift invariance. Selesnick (2002) observed that using two
parallel CQF banks, which are constructed so that the impulse responses of the scaling
filters are half-sample delayed versions of each other: h0 [ n] and h0 [n 0.5] , the
corresponding wavelets are Hilbert transform pairs. In z-transform domain we should be
able to construct the scaling filters H 0 ( z ) and z 0.5 H 0 ( z ) . However, the constructed scaling
filters do not possess coefficient symmetry and in multi-scale analysis the nonlinearity
disturbs spatial timing and prevents accurate statistical correlations between different
scales. In the following we describe the shift invariant BDWT structures especially designed
for VLSI applications.
where the ck coefficients are designed so that the frequency response follows approximately
D ( ) e j / 2 (15)
Recently, half-delay B-spline filters have been introduced, which have an ideal phase
response. The method yields linear phase and shift invariant transform coefficients and can
be adapted to any of the existing BDWT (Olkkonen & Olkkonen, 2007b). The half-sample
delayed scaling and wavelet filters and the corresponding reconstruction filters are
H 0 ( z ) D( z ) H 0 ( z )
H1 ( z ) D 1 ( z ) H1 ( z )
(16)
G0 ( z ) D 1 ( z )G0 ( z )
G1 ( z ) D ( z )G1 ( z )
The half-delayed BDWT filter bank obeys the perfect reconstruction condition (2). The B-
spline half-delay filters have the IIR structure
A( z )
D( z ) (17)
B( z )
which can be implemented by the inverse filtering procedure (see details Olkkonen &
Olkkonen 2007b).
For example, the impulse response h0 [ n] [1 0 9 16 9 0 -1]/ 32 has the fourth order zero at
and h1[n] [1 0 -9 16 -9 0 1]/ 32 has the fourth order zero at 0 . In the tree structured
HBF DWT the wavelet sequences wa [n] . A key feature is that the odd coefficients of the
analytic signal wa [2n 1] can be reconstructed from the even coefficient values wa [2n] . This
avoids the need to use any reconstruction filters. The HBFs (18) are symmetric with respect
to / 2 . Hence, the energy in the frequency range 0 is equally divided by the
scaling and wavelet filters and the energy (absolute value) of the scaling and wavelet
coefficients are statistically comparable. The computation of the analytic signal via the
Hilbert transform requires the FFT-based signal processing. However, efficient FFT chips
are available for VLSI implementation. In many respects the advanced method outperforms
the previous nearly shift invariant DWT structures.
By filtering the real-valued signal X ( z ) by the Hilbert transform filter results in an analytic
signal [1 j( z )] X ( z ) , whose magnitude response is zero at negative side of the frequency
spectrum. For example, an integer-valued half-delay filter D( z ) for this purpose is obtained
by the B-spline transform (Olkkonen & Olkkonen, 2007b). The frequency response of the
Hilbert transform filter designed by the fourth order B-spline (Fig. 9) shows a maximally flat
magnitude spectrum. The phase spectrum corresponds to the ideal Hilbert transformer (19).
Fig. 9. Magnitude and phase spectra of the Hilbert transform filter yielded by the fourth
order B-spline transform.
8. Conclusion
In this book chapter we have described the BDWT constructions especially tailored for VLSI
environment. Most of the VLSI designs in the literature are focused on the biorthogonal 9/7
filters, which have decimal coefficients and usually implemented using the lifting scheme
(Sweldens, 1988). However, the lifting BDWT needs two different filter banks for analysis
and synthesis parts. The speed of the lifting BDWT is also limited due to the sequential
lifting steps. In this work we showed that the lifting BDWT can be replaced by the lattice
structure (Olkkonen & Olkkonen, 2007a). The two-channel DLWT filter bank (Fig. 3) runs
parallel, which significantly increases the channel throughout. A significant advantage
compared with the previous maximally decimated filter banks is that the DLWT structure
allows the construction of the half-band lattice and transmission filters. In tree structured
wavelet transform half-band filtered scaling coefficients introduce no aliasing when they are
fed to the next scale. This is an essential feature when the frequency components in each
scale are considered, for example in electroencephalography analysis.
The VLSI design of the BDWT filter bank simplifies essentially by implementing the sign
modulator unit (Fig. 5), which eliminates the need for constructing separate reconstruction
filters. The biorthogonal DWT/IDWT transceiver module uses only two parallel filter
structures. Especially in bidirectional data transmission the DWT/IDWT module offers
several advantages compared with the separate transmit and receive modules, such as the
Discrete Wavelet Transform Structures for VLSI Architecture Design 9
reduced size, low power consumption, easier synchronization and timing requirements. For
the VLSI designer the DWT/IDWT module appears as a "black box", which readily fits to
the data under processing. This may override the relatively big barrier from the wavelet
theory to the practical VLSI and microprocessor applications. As a design example we
described the construction of the compression coder (Fig. 7), which can be used to compress
integer-valued data sequences, e.g. produced by the analog-to-digital converters.
It is well documented that the real-valued DWTs are not shift invariant, but small fractional
time-shifts may introduce significant differences in the energy of the wavelet coefficients.
Kingsbury (2001) showed that the shift invariance is improved by using two parallel filter
banks, which are designed so that the wavelet sequences constitute real and imaginary parts
of the complex analytic wavelet transform. The dual-tree discrete wavelet transform (DT-
DWT) has been shown to outperform the real-valued DWT in a variety of applications such
as denoising, texture analysis, speech recognition, processing of seismic signals and
neuroelectric signal analysis (Olkkonen et al. 2006). Selesnick (2002) made an observation
that a half-sample time-shift between the scaling filters in parallel CQF banks is enough to
produce the analytic wavelet transform, which is nearly shift invariant. In this work we
described the shift invariant DT-BDWT bank (16) based on the half-sample delay filter. It
should be pointed out that the half-delay filter approach yields wavelet bases which are
Hilbert transform pairs, but the wavelet sequences are only approximately shift invariant. In
multi-scale analysis the complex wavelet sequences should be shift invariant. This
requirement is satisfied in the Hilbert transform-based approach (Fig. 8), where the signal in
every scale is Hilbert transformed yielding strictly analytic and shift invariant transform
coefficients. The procedure needs FFT-based computation (Olkkonen et al. 2007c), which
may be an obstacle in many VLSI realizations. To avoid this we described a Hilbert
transform filter for constructing the shift invariant DT-BDWT bank (23). Instead of the half-
delay filter bank approach (16) the perfect reconstruction condition (2) is attained using the
IIR-type Hilbert transform filters, which yield analytic wavelet sequences.
9. References
Daubechies, I. (1988). Orthonormal bases of compactly supported wavelets. Commmun. Pure
Appl. Math., Vol. 41, 909-996.
Huang, C.T., Tseng, O.O. & Chen, L.G. (2005). Analysis and VLSI architecture for 1-D and 2-
D discrete wavelet transform. IEEE Trans. Signal Process. Vol. 53, No. 4, 1575-1586.
ITU-T (2000) Recommend. T.800-ISO DCD15444-1: JPEG2000 Image Coding System.
International Organization for Standardization, ISO/IEC JTC! SC29/WG1.
Kingsbury, N.G. (2001). Complex wavelets for shift invariant analysis and filtering of
signals. J. Appl. Comput. Harmonic Analysis. Vol. 10, 234-253.
Olkkonen, H., Pesola, P. & Olkkonen, J.T. (2005). Efficient lifting wavelet transform for
microprocessor and VLSI applications. IEEE Signal Process. Lett. Vol. 12, No. 2, 120-
122.
Olkkonen, H., Pesola, P., Olkkonen, J.T. & Zhou, H. (2006). Hilbert transform assisted
complex wavelet transform for neuroelectric signal analysis. J. Neuroscience Meth.
Vol. 151, 106-113.
Olkkonen, J.T. & Olkkonen, H. (2007a). Discrete lattice wavelet transform. IEEE Trans.
Circuits and Systems II. Vol. 54, No. 1, 71-75.
10 VLSI
Olkkonen, H. & Olkkonen, J.T. (2007b). Half-delay B-spline filter for construction of shift-
invariant wavelet transform. IEEE Trans. Circuits and Systems II. Vol. 54, No. 7, 611-
615.
Olkkonen, H., Olkkonen, J.T. & Pesola, P. (2007c). FFT-based computation of shift invariant
analytic wavelet transform. IEEE Signal Process. Lett. Vol. 14, No. 3, 177-180.
Olkkonen, H. & Olkkonen, J.T. (2008). Simplified biorthogonal discrete wavelet transform
for VLSI architecture design. Signal, Image and Video Process. Vol. 2, 101-105.
Selesnick, I.W. (2002). The design of approximate Hilbert transform pairs of wavelet bases.
IEEE Trans. Signal Process. Vol. 50, No. 5, 1144-1152.
Smith, M.J.T. & Barnwell, T.P. (1986). Exaxt reconstruction for tree-structured subband
coders. IEEE Trans. Acoust. Speech Signal Process. Vol. 34, 434-441.
Sweldens, W. (1988). The lifting scheme: A construction of second generation wavelets.
SIAM J. Math. Anal. Vol. 29, 511-546.
High Performance Parallel Pipelined Lifting-based VLSI Architectures
for Two-Dimensional Inverse Discrete Wavelet Transform 11
X2
1. Introduction
Two-dimensional discrete wavelet transform (2-D DWT) has evolved as an effective and
powerful tool in many applications especially in image processing and compression. This is
mainly due to its better computational efficiency achieved by factoring wavelet transforms
into lifting steps. Lifting scheme facilitates high speed and efficient implementation of
wavelet transform and it attractive for both high throughput and low-power applications
(Lan & Zheng, 2005).
DWT considered in this work is part of a compression system based on wavelet such as
JPEG2000. Fig.1 shows a simplified compression system. In this system, the function of the
2-D FDWT is to decompose an NxM image into subbands as shown in Fig. 2 for 3-level
decomposition. This process decorrelates the highly correlated pixels of the original image.
That is the decorrelation process reduces the spatial correlation among the adjacent pixels of
the original image such that they can be amenable to compression.
After transmitting to a remote site, the original image must be reconstructed from the
decorrelated image. The task of the reconstruction and completely recovering the original
image from the decorrelated image is performed by inverse discrete wavelet transform
(IDWT).
The decorrelated image shown in Fig. 2 can be reconstructed by 2-D IDWT as follows. First,
it reconstructs in the column direction subbands LL3 and LH3 column-by-column to recover
L3 decomposition. Similarly, subbands HL3 and HH3 are reconstructed to obtain H3
decomposition. Then L3 and H3 decompositions are combined row-wise to reconstruct
subband LL2. This process is repeated in each level until the whole image is reconstructed
(Ibrahim & Herman, 2008).
The reconstruction process described above implies that the task of the reconstruction can be
achieved by using 2 processors (Ibrahim & Herman, 2008). The first processor (the column-
processor) computes column-wise to combine subbands LL and LH into L and subbands HL
and HH into H, while the second processor (the row-processor) computes row-wise to
12 VLSI
combine L and H into the next level subband. The decorrelated image shown in Fig. 2 is
assumed to be residing in an external memory with the same format.
In this chapter, parallelism is explored to best meet real-time applications of 2-D DWT with
demanding requirements. The single pipelined architecture developed by (Ibrahim &
Herman, 2008) is extended to 2- and 4-parallel pipelined architectures for both 5/3 and 9/7
inverse algorithms to achieve speedup factors of 2 and 4, respectively. The advantage of the
proposed architectures is that the total temporary line buffer (TLB) size does not increase
from that of the single pipelined architecture when degree of parallelism is increased.
LL3 HL3
LH 3 HH 3 HL 2
HL1
LH 2 HH 2
LH1 HH1
3. Scan methods
The hardware complexity and hence the memory required for 2-D DWT architecture in
general depends on the scan method adopted for scanning external memory. Therefore, the
scan method shown in Fig. 5 is proposed for both 5/3 and 9/7 CPs. Fig. 5 (A) is formed for
illustration purposes by merging together subbands LL and LH, where subband LL
coefficients occupy even row and subband LH coefficients occupy odd rows, while Fig. 5 (B)
is formed by merging subband HL and HH together.
According to the scan method shown in Fig. 5, CPs of both 5/3 and 9/7 should scan external
memory column-by-column. However, to allow the RP, which operates on data generated
by the CP, to work in parallel with the CP as earlier as possible, the A’s (LL+LH) first two
columns coefficients are interleaved in execution with the B’s (HL+HH) first two columns
coefficients, in the first run. In all subsequent runs, two columns are interleaved, one from A
with another from B.
Interleaving of 4 columns in the first run takes place as follows. First, coefficients LL0,0,
LH0,0 from the first column of (A) are scanned. Second, coefficients HL0,0 and HH0,0 from
the first column of B are scanned, then LL0,1 and LH0,1 from the second column of A
followed by HL0,1 and HH0,1 from the second column of B are scanned. The scanning
process then returns to the first Fig. 3. 5/3 synthesis algorithm’s DDGs for (a) odd and (b)
even length signals returns to the first column of A to repeat the process and so on.
14 VLSI
Y (n) 1 0 1 2 3 4 5 6 7 8 7 1 0 1 2 3 4 5 6 7 6 5
X ( 2n) 0 2 4 6 8 0 2 4 6 6
redundant
X ( 2n 1) 1 3 5 7 1 3 5 7
computatio ns
X 0 X1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 0 X1 X 2 X 3 X 4 X 5 X 6 X 7
(a ) (b)
Fig. 3. 5/3 synthesis algorithm’s DDGs for (a) odd and (b) even length signals
Y (n) 3 2 1 0 1 2 3 4 5 6 7 8 7 6 5 3 2 1 0 1 2 3 4 5 6 7 6 5 4 3
Y 2n 1 k k k k k k k k k k k k k k k k
Y 2n k1 k1 k1 k1 k1 k1 k1 k1 k1 k1 k1 k1 k1 k1
Y 2n 2 0 2 4 6 8 6 2 0 2 4 6 6 4
Y2n 1 1 1 3 5 7 7 1 1 3 5 7 5
x(2n) 0 2 4 6 8 0 2 4 6 6
x(2n 1) 1 3 5 7 1 3 5 7
X 0 X1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 0 X1 X 2 X 3 X 4 X 5 X 6 X 7
(a ) (b )
Fig. 4. 9/7 synthesis algorithm’s DDGs for (a) odd and (b) even length signals
The advantage of interleaving process not only it speedups the computations by allowing
the two processors to work in parallel earlier during the computations, but also reduces the
internal memory requirement between CP and RP to a few registers.
The scan method illustrated in Fig. 5 for 5/3 and 9/7 CP along with the DDGs suggest that
the RP should scan coefficients generated by CP according to the scan method illustrated in
Fig. 6. This figure is formed for illustration purposes by merging L and H decompositions
even though they are actually separate. In Fig. 6, L’s coefficients occupy even columns,
while H’s coefficients occupy odd columns. In the first run, the RP’s scan method shown in
Fig. 6 requires considering the first four columns for scanning as follows. First, coefficients
L0,0 and H0,0 from row 0 followed by L1,0 and H1,0 from row 1 are scanned. Then the scan
returns to row 0 and scans coefficients L0,1 and H0,1 followed by L1,1 and H1,1. This
process is repeated as shown in Fig. 6 until the first run completes.
In the second run, coefficients of columns 4 and 5 are considered for scanning by RP as
shown in Fig. 6, whereas in the third run, coefficients of columns 6 and 7 are considered for
scanning and so on.
High Performance Parallel Pipelined Lifting-based VLSI Architectures
for Two-Dimensional Inverse Discrete Wavelet Transform 15
According to the 9/7 DDGs, the RP’s scan method shown in Fig. 6 will generate only one
low output coefficient each time it processes 4 coefficients in the first run, whereas 5/3 RP
will generate two output coefficients. However, in all subsequent runs, both 9/7 and 5/3
RPs generate two output coefficients.
run 1 run 2
L1,0
0 1 2 3 4 5 6
0
1
2
3
4
4. Approach
In (Ibrahim & Herman, 2008), to ease the architecture development the strategy adopted
was to divide the details of the development into two steps each having less information to
handle. In the first step, the DDGs were looked at from the outside, which is specified by the
dotted boxes in the DDGs, in terms of the input and output requirements. It can be observed
that the DDGs for 5/3 and 9/7 are identical when they are looked at from outside, taking
into consideration only the input and output requirements; but differ in the internal details.
Based on this observation, the first level, called external architecture, which is identical for
both 5/3 and 9/7, and consists of a column-processor (CP) and a row-processor (RP), was
developed. In the second step, the internal details of the DDGs for 5/3 and 9/7 were
considered separately for the development of processors’ datapath architectures, since
DDGs internally define and specify the internal structure of the processors
In this chapter, the first level, the external architectures for 2-parallel and 4-parallel are
developed for both 5/3 and 9/7 inverse algorithms. Then the processors’ datapath
16 VLSI
architectures developed in (Ibrahim & Herman, 2008) are modified to fit into the two
proposed parallel architectures’ processors.
Where l = 2, 3, 4 ... denote 2, 3, and 4-parallel and t p k is the stage critical path delay of a k-
stage pipelined processor. The clock frequency f 2 is determined from Eq(3) as
f 2 2k t p (4)
The architecture scans the external memory with frequency f 2 and it operates with
frequency f 2 2 . Each time two coefficients are scanned through the two buses labeled bus0
and bus1. The two new coefficients are loaded into CP1 or CP2 latches Rt0 and Rt1 every
time clock f 2 2 makes a negative or a positive transition, respectively.
High Performance Parallel Pipelined Lifting-based VLSI Architectures
for Two-Dimensional Inverse Discrete Wavelet Transform 17
f2 2
f2
f2 2
Fig. 7. (a) Proposed 2-parallel pipelined external architecture for 5/3 and 9/7 and combined
(b) Waveform of the clocks.
On the other hand, both RP1 and RP2 latches Rt0 and Rt1 load simultaneously new data
from CP1 and CP2 output latches each time clock f 2 2 makes a negative transition.
The dataflow for 5/3 2-parallel architecture is shown in Table 1, where CPs and RPs are
assumed to be 4-stage pipelined processors. The dataflow for 9/7 2-parallel architecture is
similar, in all runs, to the 5/3 dataflow except in the first, where RP1 and RP2 of the 9/7
architecture each would generate one output coefficient every other clock cycle, reference to
clock f 2 2 . The reason is that each 4 coefficients of a row processed in the first run by RP1
or RP2 of the 9/7 would require, according to the DDGs, two successive low coefficients
18 VLSI
from the first level of the DDGs labeled Y (2n) in order to carry out node 1 computations in
the second level labeled Y (2n 1) . In Table 1, the output coefficients in Rt0 of both RP1 and
RP2 represent the output coefficients of the 9/7 in the first run.
The strategy adopted for scheduling memory columns for CP1 and CP2 of the 5/3 and 9/7
2-parallel architectures, which are scanned according to the scan method shown in Fig. 5, is
as follow. In the first run, both 5/3 and 9/7 2-parallel architectures are scheduled for
executing 4 columns of memory, two from each (A) and (B) of Fig. 5. The first two columns
of Fig. 5 (A) are executed in an interleaved manner by CP1, while the first two columns of
Fig. 5 (B) are executed by CP2 also in an interleaved fashion as shown in the dataflow Table
1 In all other runs, 2 columns are scheduled for execution at a time. One column from (A) of
Fig. 5 will be scheduled for execution by CP1, while another from (B) of Fig. 5 will be
scheduled for CP2. However, if number of columns in (A) and (B) of Fig. 5 is not equal, then
the last run will consist of only one column of (A). In that case, schedule the last column in
CP1 only, but its output coefficients will be executed by both RP1 and RP2. The reason is
that if the last column is scheduled for execution by both CP1 and CP2, they will yield more
coefficients than that can be handled by both RP1 and RP2.
On the other hand, scheduling RP1 and RP2 of 5/3 and 9/7 2-parallel architectures occur
according to scan method shown in Fig. 6. In this scheduling strategy, all rows of even and
odd numbers in Fig. 6 will be scheduled for execution by RP1 and RP2, respectively. In the
first run, 4 coefficients from each 2 successive rows will be scheduled for RP1 and RP2,
whereas in all subsequent runs, two coefficients of each 2 successive rows will be scheduled
for RP1 and RP2, as shown in Fig. 6. However, if number of columns in Fig. 6 is odd which
occur when number of columns in (A) and (B) of Fig. 5 is not equal, then the last run would
require scheduling one coefficient from each 2 successive rows to RP1 and RP2 as shown in
column 6 of Fig. 6.
In general, all coefficients that belong to columns of even numbers in Fig. 6 will be
generated by CP1 and that belong to columns of odd numbers will be generated by CP2. For
example, in the first run, CP1will first generate two coefficients labeled L0,0 and L1,0 that
belong to locations 0,0 and 1,0 in Fig. 6, while CP2 will generate coefficient H0,0 and H1,0
that belong to locations 0,1 and 1,1. Then coefficients in locations 0,0 and 0,1 are executed by
RP1, while coefficients of locations 1,0 and 1,1 are executed by RP2. Second, CP1 will
generate two coefficients for locations 0,2 and 1,2, while CP2 generates two coefficients for
locations 0,3 and 1,3. Then coefficients in locations 0,2 and 0,3 are executed by RP1, while
coefficients in locations 1,2 and 1,3 are executed by RP2. The same process is repeated in the
next two rows and so on.
In the second run, first, CP1 generates coefficients of locations 0,4 and 1,4, whereas CP2
generates coefficients of locations 0,5 and 1,5 in Fig. 6. Then coefficients in locations 0,4 and
0,5 are executed by RP1, while coefficients in locations 1,4 and 1,5 are executed by RP2. This
process is repeated until the run completes. However, in the even that the last run processes
only one column of (A), CP1 would generate first coefficients of locations 0,m and 1,m
where m refers to the last column. Then coefficients of location 0,m is passed to RP1, while
coefficient of location 1,m is passed to RP2. In the second time, CP1 would generate
coefficients of locations 2,m and 3,m. Then 2,m is passed to RP1 and 3,m to RP2 and so on.
In the following, the dataflow shown in Table 1 for 2-parallel pipelined architecture will be
explained. The first run, which ends at cycle 16 in the table, requires scheduling four
columns as follows. In the first clock cycle, reference to clock f 2 , coefficients LL0,0 and
High Performance Parallel Pipelined Lifting-based VLSI Architectures
for Two-Dimensional Inverse Discrete Wavelet Transform 19
LH0,0 from the first column of LL3 and LH3 in the external memory, respectively, are
scanned and are loaded into CP1 latches Rt0 and Rt1 by the negative transition of
clock f 2 2 . The second clock cycle scans coefficients HL0,0 and HH0,0 from the first column
of HL3 and HH3, respectively, through the buses labeled bus0 and bus1 and loads them into
CP2 latches Rt0 and Rt1 by the positive transition of clock f 2 2 . In the third clock cycle, the
scanning process returns to the second column of subbands LL3 and LH3 in the external
memory and scans coefficients LL0,1 and LH0,1, respectively, and loads them into CP1
latches Rt0 and Rt1 by the negative transition of the clock f 2 2 . The fourth clock cycle scans
coefficients HL0,1 and HH0,1 from the second column of HL3 and HH3, respectively, and
loads them into CP2 latches Rt0 and Rt1. The scanning process then returns to the first
column in subbands LL3 and LH3 to repeat the process until the first run is completed.
In cycle 9, CP1 generates its first two output coefficients L0,0 and L1,0, which belong to L3
decomposition and loads them into its output latches Rtl0 and Rtl1, respectively, by the
negative transition of clock f 2 2 . In cycle 10, CP2 generates its first two output coefficients
H0,0 and H1,0, which belong to H3 decomposition and loads them into its output latches
Rth0 and Rth1, respectively, by the positive transition of clock f 2 2 .
In cycle 11, contents of Rtl0 and Rth0 are transferred to RP1 input latches Rt0 and Rt1,
respectively. The same clock cycle also transfers contents of Rtl1 and Rth1 to RP2 input
latches Rt0 and Rt1, respectively, while the two coefficients L0,1 and L1,1 generated by CP1
during the cycle are loaded into Rtl0 and Rtl1, respectively, by the negative transition of
clock f2/2.
In cycle 21, both RP1 and RP2 each yield its first two output coefficients, which are loaded
into their respective output latches by the negative transition of clock f 2 2 . Contents of
these output latches are then transferred to external memory where they are stored in the
first 2 memory locations of each rows 0 and 1. The dataflow of the first run then proceeds as
shown in Table 1. The second run begins at cycle 19 and yields its first 4 output coefficients
at cycle 37.
20 VLSI
Ck CP CP1 & CP2 CP1 CP2 RP1 input RP2 Output latches of
f2 input output output latches input RP1
latches latches latches Rt0 Rt1 latches RP2
Rt0 Rt1 Rtl0 Rtl1 Rth0 Rt0 Rt0 Rt1 Rt0
Rth1 Rt1 Rt1
1 1 LL0,0
LH0,0
2 2 HL0,0
HH0,0
3 1 LL0,1
LH0,1
4 2 HL0,1
HH0,1
5 1 LL1,0
LH1,0
6 2 HL1,0
HH1,0
7 1 LL1,1
LH1,1
8 2 HL1,1
HH1,1
9 1 LL2,0 L0,0 L1,0
LH2,0
10 2 HL2,0 H0,0
HH2,0 H1,0
11 1 LL2,1 L0,1 L1,1 L0,0 L1,0
LH2,1 H0,0 H1,0
RUN 1
12 2 HL2,1 H0,1
HH2,1 H1,1
13 1 LL3,0 L2,0 L3,0 L0,1 L1,1
LH3,0 H0,1 H1,1
14 2 HL3,0 H2,0
HH3,0 H3,0
15 1 LL3,1 L2,1 L3,1 L2,0 L3,0
LH3,1 H2,0 H3,0
16 2 HL3,1 H2,1
HH3,1 H3,1
17 1 ------ ------- L4,0 L5,0 L2,1 L3,1
- H2,1 H3,1
18 2 ------ ------- H4,0
H5,0
19 1 LL0,2 L4,1 L5,1 L4,0 L5,0
LH0,2 H4,0 H5,0
20 2 HL0,2 H4,1
HH0,2 H5,1
RUN 2
24 2 HL2,2 H6,1
HH2,2 H7,1
25 1 LL3,2 ----- ------ L6,1 L7,1 X2,0 X2,1 X3,0
LH3,2 H6,1 H7,1 X3,1
26 2 HL3,2 ----- -----
HH3,2
27 1 L0,2 L1,2 ----- ---- ----- -- X2,2 ---- X3,2
- ---- -----
28 2 H0,2
H1,2
29 1 L2,2 L3,2 L0,2 L1,2 X4,0 X4,1 X5,0
H0,2 H1,2 X5,1
30 2 H2,2
H3,2
31 1 L4,2 L5,2 L2,2 L3,2 X4,2 ---- X5,2
H2,2 H3,2 -----
32 2 H4,2
H5,2
33 1 L6,2 L7,2 L4,2 L5,2 X6,0 X6,1 X7,0
H4,2 H5,2 X7,1
34 2 H6,2
H7,2
35 1 L6,2 L7,2 X6,2 ---- X7,2
H6,2 H7,2 -----
36 2
37 1 X0,3 X0,4 X1,3
X1,4
38 2
39 1 X2,3 X2,4 X3,3
X3,4
Table 1. Dataflow for 2-parallel 5/3 architecture
4.2 Modified CPs and RPs for 5/3 and 9/7 2-parallel external architecture
Each CP of the 2-parallel external architecture is required to execute two columns in an
interleave fashion in the first run and one column in all other runs. Therefore, the 5/3
processor datapath developed in (Ibrahim & Herman, 2008) should be modified as shown in
Fig. 8 by adding one more stage between stages 2 and 3 for 5/3 2- parallel external
architecture to allow interleaving of two columns as described in the dataflow Table 1.
Through the two multiplexers labeled mux the processor controls between executing 2
columns and one column. Thus, in the first run, the two multiplexers’ control signal labeled
s is set 1 to allow interleaving in execution and 0 in all other runs. The modified 9-stage CP
for 9/7 2-parallel external architecture can be obtained by cascading two copies of Fig. 8.
On the other hand, RP1 and RP2 of the proposed 2-parallel architecture for 5/3 and 9/7 are
required to scan coefficients of H and L decompositions generated by CP1 and CP2
according to the scan method shown in Fig. 6. In this scan method, all rows of even numbers
are executed by RP1 and all rows of odds numbers are executed by RP2. That is, while RP1
is executing row0 coefficients, RP2 will be executing row1 coefficients and so on. In
addition, looking at the DDGs for 5/3 and 9/7 show that applying the scan methods in Fig.
22 VLSI
6 would require inclusion of temporary line buffers (TLBs) in RP1 and RP2 of the proposed
2-parallel external architecture as follows. In the first run, the fourth input coefficient of each
row in the DDGs and the output coefficients labeled X( 2) in the 5/3 DDGs and that labeled
Y"(2), Y"(1), and X(0) in the 9/7 DDGs, generated by considering 4 inputs coefficients in each
row, should be stored in TLBs, since they are required in the next run’s computations.
Similarly, in the second run, the sixth input coefficient of each row and the output
coefficients labeled X(4) in the 5/3 DDGs and that labeled Y"(4), Y"(3), and X(2) in the 9/7
DDGs generated by considering 2 inputs coefficients in each row, should be stored in TLBs.
Accordingly, 5/3 would require addition of 2 TLBs each of size N, whereas 9/7 would
require addition of 4 TLBs each of size N. However, since 2-parallel architecture consists of
two RPs, each 5/3 RP will have 2 TLBs each of size N/2 and each 9/7 RP will have 4 TLBs
each of size N/2 as shown in Fig. 9. Fig. 9 (a) represents the 5/3 modified RP, while both (a)
and (b) represent the 9/7 modified RP for 2- parallel architecture.
se0
se1
se2
To have more insight into the two RPs operations, the dataflow for 5/3 RP1 is given in Table
2 for first and second runs. Note that stage 1 input coefficients in Table 2 are exactly the
same input coefficients of RP1 in Table 1. In the first run, TLBs are only written, but in the
second run and in all subsequent runs, TLBs are read in the first half cycle and written in the
second half cycle. In the cycle 15, Table 2 shows that coefficients H0,1 is stored in the first
location of TLB1, while coefficient H2,1 is stored in the second location in cycle 19 and so on.
Run 2 starts at cycle 29. In cycle 30, the first location of TLB1, which contains coefficients
H0,1 is read during the first half cycle of clock f 2 2 and is loaded into Rd1 by the positive
transition of the clock, whereas coefficient H0,2 is written into the same location in the
second half cycle. Then, the negative transition of clock f 2 2 transfers contents of Rd1 to
Rt2 in stage 2.
High Performance Parallel Pipelined Lifting-based VLSI Architectures
for Two-Dimensional Inverse Discrete Wavelet Transform 23
1
RW RW
se0
1
0
0
0
1
1
se1
se 2
1
RW RW
se0
0
1
se1
se 2
Fig. 9. Modified RP for 2-parallel architecture (a) 5/3 and (a) & (b) 9/7
H0,1 H0,0
17 L2,1 H2,1 L2,0 ---- H2,0 X0,2 H0,0 X0,0 X0,0 ---- ---- ----- -----
-----
19 L4,0 H4,0 L2,1 ---- H2,1 X2,0 ----- X0,2 X0,2 H0,0 ----- -----
H2,1 H2,0 X0,2 X0,0
24 VLSI
21 L4,1 H4,1 L4,0 ---- H4,0 X2,2 H2,0 X2,0 X2,0 ----- X0,0 X0,1
----- X0,2
23 L6,0 H6,0 L4,1 ---- H4,1 X4,0 ----- X2,2 X2,2 H2,0 X0,2 -----
H4,1 H4,0 X2,2 X2,0
25 L6,1 H6,1 L6,0 ---- H6,0 X4,2 H4,0 X4,0 X4,0 ----- X2,0 X2,1
----- X2,2
27 ----- ----- L6,1 ---- H6,1 X6,0 ----- X4,2 X4,2 H4,0 X2,2 -----
H6,1 H6,0 X4,2 X4,0
29 L0,2 H0,2 ---- ----- ----- ----- X6,2 H6,0 X6,0 X6,0 ----- X4,0 X4,1
- X4,2
31 L2,2 H2,2 L0,2 H0,1 H0,2 ---- ----- X6,2 X6,2 H6,0 X4,2 -----
H0,2 ----- X6,2 X6,0
33 L4,2 H4,2 L2,2 H2,1 H2,2 X0,4 H0,1 ----- ---- ---- X6,0 X6,1
H2,2 ----- X6,2 X6,2
35 L6,2 H6,2 L4,2 H4,1 H4,2 X2,4 H2,1 X0,4 X0,4 H0,1 X6,2 -----
H4,2 ----- X0,4 X0,2
37 ---- ----- L6,2 H6,1 H6,2 X4,4 H4,1 ----- X2,4 H2,1 X0,3 X0,4
H6,2 ----- X2,4 X2,2
39 ---- ----- ----- ----- ------ - X6,4 H6,1 ----- X4,4 H4,1 X2,3 X2,4
RUN 2
In Fig. 9 (a), the control signal, s, of the two multiplexers’ labeled mux is set 1 during run 1
to pass R0 of both stages 2 and 3, whereas in all other runs, it is set 0 to pass coefficients
stored in TLB1 and TLB2.
4-parallel 9/7 external architecture is similar in all runs to the 5/3 dataflow except in the
first run, where RPs of the 9/7 architecture, specifically RP3 and RP4 generate a pattern of
output coefficients different from that of the 5/3. RP3 and RP4 of the 9/7 architecture would
generate every clock cycle, reference to clock f4b, two output coefficients as follows. Suppose,
at cycle number n the first two coefficients X(0,0) and X(1,0) generated by RP3 and RP4,
respectively, are loaded into output latch Rt0 of both processors. Then, in cycle n+1, RP3 and
RP4 generate coefficients X(2,0) and X(3,0) followed by coefficients X(4,0) and X(5,0) in cycle
n+1 and so on. Note that these output coefficients are the coefficients generated by RP1 and
RP2 in Table 3.
The strategy used for scheduling memory columns for CPs of the 5/3 and 9/7 4-parallel
architecture, which resemble the one adopted for 2-parallel architecture, is as follow. In the
first run, both 5/3 and 9/7 4-parallel architecture will be scheduled to execute 4 columns of
memory, two from (A) and the other two from (B), both of Fig. 5. Each CP will be assigned
to execute one column of memory coefficients as illustrated in the first run of the dataflow
shown in Table 3, whereas in all subsequent runs, 2 columns at a time will scheduled for
execution by 4 CPs. One column from Fig. 5 (A) will be assigned to both CP1 and CP3, while
the other from Fig. 5 (B) will be assigned to both CP2 and CP4 as shown in the second run of
Table 3. However, if number of columns in (A) and (B) of Fig. 5 is not equal, then the last
run will consist of only one column of (A). In that case, schedule the last column’s
coefficients in both CP1 and CP3 as shown in the third run of Table 3, since an attempt to
execute the last column using 4 CPs would result in more coefficients been generated than
that can be handled by the 4 RPs.
On the other hand, scheduling rows coefficients for RPs, which take place according to scan
method shown in Fig. 6, can be understood by examining the dataflow shown in Table 3. At
cycle 13, CP1 generates its first two output coefficients labeled L0,0 and L1,0, which
correspond to locations 0,0 and 1,0 in Fig. 6, respectively. In cycle 14, CP2 generates its first
two output coefficients H0,0 and H1,0, which correspond to locations 0,1 and 1,1 in Fig. 6,
respectively. In cycle 15, CP3 generate its first two coefficients L0,1 and L1,1, which
correspond to locations 0,2 and 1,2 in Fig. 6, respectively. In cycle 16, CP4 generates its first
two output coefficients H0,1 and H1,0 which correspond to locations 0,3 and 1,3 in Fig. 6.
Note that L0,0, H0,0, L0,1, and H0,1 represents the first 4 coefficients of row 0 in Fig. 6,
whereas L1,0, H1,0, LL1,1 and H1,1, represent the first 4 coefficients of row1.
In cycle 17 and 18, the first two rows coefficients are scheduled for RPs as shown in Table 3,
while CPs generating coefficients of the next two rows, row2 and row3. Table 3 shows that
the first 4 coefficients of row 0 are scheduled for execution by RP1 and RP3, while the first 4
coefficients of row 1 are scheduled for RP2 and RP4. In addition, note that all coefficients
generated by CP4, which belong to column 3 in Fig. 6, are required in the second run’s
computations, according to the DDGs. Therefore, this would require inclusion of a TLB of
size N/4 in each of the 4 RPs to store these coefficients. The second run, however, requires
these coefficients to be stored in the 4 TLBs in a certain way as follows. Coefficients H0,1 and
H1,1 generated by CP4 in cycle 16 should be stored in the first location of TLB of RP1 and
RP2, respectively. These two coefficients would be passed to their respective TLB through
the input latches of RP1 and RP2 labeled Rt2, as shown in cycle 17 of Table 3, whereas,
coefficients H2,1 and H3,1 generated by CP4 at cycle 20 should be stored in the first location
of TLB of RP3 and RP4, respectively. These two coefficients are passed to their respective
TLB for storage through the input latches of RP3 and RP4 labeled Rt1, as shown in cycle 22
26 VLSI
of Table 3. Similarly, coefficients H4,1 and H5,1 generated by CP4 at cycle 24 should be
stored in the second location of TLB of RP1 and RP2, respectively, and so on. Note that these
TLBs are labeled TLB1 in Fig. 12.
f 4a
f 4b
f 4b f 4a
f4
f 4a f 4 4
f 4b f 4 4
Fig. 10. (a) Proposed 2-D IDWT 4-parallel pipelined external architecture for 5/3 and 9/7
and combined (b) Waveforms of the clocks
High Performance Parallel Pipelined Lifting-based VLSI Architectures
for Two-Dimensional Inverse Discrete Wavelet Transform 27
CK CP CPs input CPs 1 CPs 2 & RPs 1 & 3 RPs 2 & 4 RPs 1 & RPs 2 &
f4 Latches &3 4 input input 3 4
Rt0 Out Out latches latches Out Out
Rt1 latches latches RP Rt0 RP Rt0 latches latches
Rtl0 Rth0 Rt1 Rt2 Rt1 Rt2 Rt0 Rt0
Rtl1 Rth1 Rt1 Rt1
1 1 LL0,0
LH0,0
2 2 HL0,0
HH0,0
3 3 LL0,1
LH0,1
4 4 HL0,1
HH0,1
5 1 LL1,0
LH1,0
6 2 HL1,0
HH1,0
7 3 LL1,1
LH1,1
8 4 HL1,1
HH1,1
9 1 LL2,0
LH2,0
10 2 HL2,0
HH2,0
11 3 LL2,1
LH2,1
12 4 HL2,1
HH2,1
13 1 LL3,0 L0,0
LH3,0 L1,0
14 2 HL3,0 H0,0
HH3,0 H1,0
15 3 LL3,1 L0,1
1
LH3,1 L1,1
16 4 HL3,1 H0,1
RUN
HH3,1 H1,1
17 1 LL4,0 --- L2,0 1 L0,0 2 L1,0
--- L3,0 H0,0 H0,1 H1,0 H1,1
18 2 HL4,0 -- H2,0 3 L0,1 4 L1,1
---- H3,0 H0,1 H0,0 H1,1 H1,0
19 3 LL4,1 --- L2,1
--- L3,1
20 4 HL4,1 -- H2,1
---- H3,1
21 1 LL0,2 L4,0 1 L2,0 2 L3,0
LH0,2 L5,0 H2,0 ----- H3,0 -----
2
LH1,2 L5,1
24 4 HL1,2 H4,1
HH1,2 H5,1
25 1 LL2,2 L6,0 1 L4,0 2 L5,0
LH2,2 L7,0 H4,0 H4,1 H5,0 H5,1
26 2 HL2,2 H6,0 3 L4,1 4 L5,1
HH2,2 H7,0 H4,1 H4,0 H5,1 H5,0
27 3 LL3,2 L6,1
LH3,2 L7,1
28 4 HL3,2 H6,1
HH3,2 H7,1
29 1 LL4,2 --- L8,0 ---- 1 L6,0 2 L7,0
--- - H6,0 ----- H7,0 -----
30 2 HL4,2 -- H8,0 --- 3 L6,1 4 L7,1
---- -- H6,1 H6,0 H7,1 H7,0
31 3 ------- --- L8,1 ----
---- -
32 4 ------- --- H8,1 ---
---- --
33 1 LL0,3 L0,2 1 L8,0 2 ----- --- X0,0 --- X1,0 ---
LH0,3 L1,2 H8,0 H8,1 - ---- - --
34 2 ----- -- H0,2 3 L8,1 4 ----- --- X0,1 X1,1
----- H1,2 H8,1 H8,0 - ---- X0,2 X1,2
35 3 LL1,3 L2,2
LH1,3 L3,2
36 4 ------ --- H2,2
---- H3,2
37 1 LL2,3 L4,2 1 L0,2 2 L1,2 X2,0 --- X3,0 ---
LH2,3 L5,2 H0,2 ----- H1,2 ----- - --
38 2 ------- --- H4,2 3 L2,2 4 L3,2 X2,1 X3,1
---- H5,2 H2,2 ----- H3,2 ----- X2,2 X3,2
39 3 LL3,3 L6,2
LH3,3 L7,2
40 4 ------- --- H6,2
---- H7,2
41 1 LL4,3 --- L8,2 ---- 1 L4,2 2 L5,2 X4,0 --- X5,0 ---
--- - H4,2 ----- H5,2 ----- - --
RUN 3
At cycle 33, RP1 and RP2 yield their first output coefficients X0,0 and X1,0, respectively,
which must be stored in the external memory locations 0,0 and 1,0, respectively. Note that
indexes of each output coefficient indicate external memory location where the coefficient
should be stored.
The second run, which requires scheduling two columns for execution by CPs, starts at cycle
21. In cycle 33, it generates its first two output coefficients L0,2 and L1,2, which belong to
locations 0,4 and 1,4 in Fig. 6, respectively. In cycle 34, CP2 generates coefficients H0,2 and
H1,2 which belong to locations 0,5 and 1,5 in Fig. 6. In cycle 35, CP3 generates coefficients
L2,2 and L3,2, which belong to locations 2,4 and 3,4 in Fig. 6, whereas in cycle 36, CP4
generates coefficients H2,2 and H3,2 that belong to locations 2,5 and 3,5 in Fig. 6. From the
above description it is clear that these 8 coefficients are distributed along 4 rows, 0 to 3 in
Fig. 6 with each row having 2 coefficients. Table 3 shows that in cycle 37, the two coefficients
of row 0, L0,2 and H0,2, and the two coefficients of row 1, L1,2 and H1,2 are scheduled for
RP1 and RP2, respectively, while coefficients L4,2 and L5,2 generated by CP1 during the
cycle are loaded into Rtl0 and Rtl1, respectively. In cycle 38, the two coefficients of row 2,
L2,2 and H2,2, and that of row 3 L3,2 and H3,2 are scheduled for RP3 and RP4, respectively,
while coefficients H4,2 and H5,2 generated by CP2 during the cycle are loaded into Rth0
and Rth1, respectively. In cycle 53, RP1 and RP2 generate the first output coefficients of the
second run.
4.4 Column and row processors for 5/3 and 9/7 4-parallel external architecture
The 5/3 and the 9/7 processors datapath architectures proposed in (Ibrahim & Herman,
2008) were developed assuming the processors scan external memory either row by row or
column by column. However, CPs and RPs of the 4-parallel architecture are required to scan
external memory according to scan methods shown in Figs. 5 and 6, respectively. The 4-
parallel architecture, in addition, introduces the requirement for interactions among the
processors in order to accomplish their task. Therefore, the processors datapath
architectures proposed in (Ibrahim & Herman, 2008) should be modified based on the scan
methods and taking into consideration also the requirement for interactions so that they fit
30 VLSI
into the 4-parallel’s processors. Thus, in the following, the modified CPs will be developed
first followed by RPs.
DDGs). These 3 coefficients are required according to the DDGs to compute the high
coefficient, X(1).
f 4a f 4a f 4a
se0
0
f 4a
s 1
s se2
se1 f 4a
f 4a f 4a f 4a
f 4a f 4a f 4a
f 4a
se0 f 4a
1
s 0
s se2
se1
f 4a f 4a f 4a
In cycle 31, the positive transition of clock f4a transfers Rt0 and Rt1 in stage 2 of CP3 and Rt0
in stage 2 of CP1 to stage 3 latches Rt0, Rt1, and Rt2 of CP3, respectively, to compute the
second high coefficient, X(3). While coefficient X(6) calculated in stage 1 of CP3 and
coefficient in Rt1 are loaded into Rt0 and Rt1 of stage 2. As indicated in cycle 31 of Table 3,
no coefficients are loaded into CP3 first stage latches.
In cycle 33, the negative transition of clock f4a loads the first high coefficient, X(1) calculated
in stage 3 of CP1 and Rt0, which contains X(0), into CP1 output latches Rt0 and Rt1,
respectively. Coefficients X(0) and X(1) are labeled L0,2 and L1,2 in Table 3. The same
negative transition of clock f4a transfers contents of Rt0 and Rt1 in stage 2 of CP1 and Rt0 in
stage 2 of CP3 to Rt0, Rt1, and Rt2 in stage 3 of CP1, respectively, to compute the third
coefficient, X(3) and so on.
f 4a f 4a
0
f 4a f 4a
1 1
RW RW
0 se0
N
0 N
4
1 4
0
1
se1
se2
f 4a
f 4a sco
f 4b
f 4b RW 0 f 4b
f 4b
1 0
N
4 se0 f 4b 1
0
1
1
0 N
se1 4
se2
RW
f 4b f 4b sco
Fig. 12. (a) Modified 5/3 RPs 1 and 3 for 4-parallel External Architecture
In the following, the dataflow of the processor datapath architecture shown in Fig. 12 (a)
will described in details starting from cycle 17 in Table 3. The detailed descriptions will
enable us to fully understand the behavior of the processor. In cycle 17, the negative
transition of clock f4a (consider it the first cycle of f4a) loads coefficients L0,0 and H0,0, which
represent the first two coefficients of row 0 in Fig. 6, and H0,1 into RP1 first stage latches
Rt0, Rt1, and Rt2, respectively. During the positive (high) pulse of clock f4a, coefficient in Rt2
is stored in the first location of TLB1.
In cycle 18, Table 3 shows that the negative transition of clock f4b (consider it the first cycle of
f4b) loads coefficients L0,1, H0,1, and H0,0 of row 0 into RP3 first stage latches Rt0, Rt1, and Rt2,
34 VLSI
respectively. In cycle 21, the negative transition of the second cycle of clock f4a transfers
contents of RP1 latches Rt0 and Rt1 of the first stage to stage 2 latches Rt0 and Rt1,
respectively, to compute the first low coefficient, X0,0, while loading two new coefficients
L2,0 and H2,0, which are the first two coefficients of row 2 in Fig. 6, into RP1 first stage
latches Rt0 and Rt1, respectively.
f 4a f 4a
0
f 4a f 4a
1 1
RW RW
0 se0
N
0 N
f 4b 4
1 4
0
1
se1
se2
f 4a
f 4a
f 4b
f 4b 0 f 4b
f 4b
RW 1
se0
N
0
4 N
1 4
se1
RW se2
f 4b
f 4b
Fig. 12. (b) Modified 9/7 RPs 1 and 3 for 4-parallel External Architecture
During the second cycle of clock f4a no data are stored in TLB1 of RP1.
In cycle 22, the negative transition of the second cycle of clock f4b transfers contents of RP3
latches Rt0, Rt1 and Rt2 of the first stage to stage 2 latches Rt0, Rt1 and Rt2 to compute the
second low coefficient of row 0, X0,2, while loading two new coefficients L2,1, H2,1, and
High Performance Parallel Pipelined Lifting-based VLSI Architectures
for Two-Dimensional Inverse Discrete Wavelet Transform 35
H2,0 into RP3 first stage latches Rt0, Rt1 and Rt2, respectively. During the second cycle of
clock f4b, the positive pulse stores content of Rt1 of the first stage in the first location of TLB1
of RP3.
In cycle 25, the negative transition of the third cycle of clock f4a loads coefficient X0,0
computed in stage 2 of RP1, into Rt0 of stage 3 and transfers contents of Rt1 and Rt0 of stage
1 into Rt1 and Rt0 of stage 2 in order to compute the first low coefficient, X2,0 of row 2
labeled X(0) in the 5/3 DDGs, while loading new coefficients L4,0, H4,0, and H4,1 of row 4
in Fig. 6, into RP1 first stage latches Rt0, Rt1 and Rt2, respectively. During the third cycle of
clock f4a, the positive pulse stores Rt2 of the first stage in the second location of TLB1 of RP1.
In cycle 26, the negative transition of the third cycle of clock f4b loads coefficient X0,2
calculated in stage 2 of RP3 into both Rt0 in stage 3 of RP3 and Rd3 in stage 3 of RP1, while
content of Rt2 in stage 2 of RP3 is transferred to Rt1 of stage 3 and that of Rt0 in stage 3 of
RP1 to Rd1 in stage 3 of RP3. The same negative transition of clock f4b also transfers Rt0, Rt1,
and Rt2 of stage 1 into Rt0, Rt1, and Rt2 of stage 2, respectively, to compute the second low
coefficient, X2,2 of row 2, while loading new coefficients L4,1, H4,1, and H4,0 of row 4 into
RP3 first stage latches Rt0, Rt1, and Rt2, respectively. During the third cycle of clock f4b,
coefficient in Rt1 is not stored in TLB1 of RP3. It is important to note that Rd3 in stage 3 of
RP1, which holds coefficient X0,2, will be stored in the first location of TLB2 by the positive
pulse of the third cycle of clock f4a.
In cycle 29, the fourth cycle’s negative transition of clock f4a transfers Rt0 in stage 3 of RP1,
which contains X0,0, to Rt0 of stage 4, while X2,0 calculated in stage 2 is loaded into Rt0 of
stage 3. The same negative transition of clock f4a transfer Rt1 and Rt0 of stage 1 to Rt1 and
Rt0 of stage 2, respectively, to compute the first low coefficient of row 4, X4,0, and loads new
coefficients L6,0 and H6,0 into Rt0 and Rt1 of stage 1, respectively. During the fourth cycle
of clock f4a, coefficient in Rt2 of stage 1 is not stored in TLB1 of RP1.
In cycle 30,the negative transition of the fourth cycle of clock f4b transfers contents of Rt0,
Rt1, and Rd1 in stage 3 of RP3 to Rt0, Rt1, and Rt2 of the next stage, respectively, to compute
the first high coefficient, X0,1 of row 0. While coefficient X2,2 calculated in stage 2 of RP3 is
transferred to both Rt0 in stage3 of RP3 and Rd3 in stage 3 of RP1 and coefficient X2,0 in Rt0,
in stage 3 of RP1, is loaded into Rd1 in stage 3 of RP3, whereas Rt2 of stage 2 is transferred
to Rt1 in stage 3 of RP3. The same negative transition of clock f4b transfers contents of Rt0,
Rt1, and Rt2 of stage 1 to Rt0, Rt1, and Rt2 of stage 2 to compute the second low coefficient,
X4,0 of row 4, while loading new coefficients L6,1, H6,1, and H6,0 of row 6 into RP3 first
stage latches Rt0, Rt1, and Rt2, respectively. During the fourth cycle’s positive pulse of clock
f4b, content of Rt1 in stage 1 of RP3 will be stored in the second location of TLB1 and content
of Rt0, X2,2, in stage 3 of RP3 will be stored in the first location of TLB2, while content of
Rd3, X2,2, in stage 3 of RP1 will not be stored in TLB2 of RP1.
In cycle 33, the negative transition of the fifth cycle of clock f4a transfers content of Rt0, X0,0,
in stage 4 of RP1 to RP1 output latch Rt0, as first output coefficient and Rt0 of stage 3
holding coefficient X2,0 to Rt0 of stage 4, while coefficient X4,0 calculated in stage 2 is
loaded into Rt0 of stage 3. The same negative transition of clock f4a transfers contents of Rt0
and Rt1 of stage 1 to Rt0 and Rt1 of stage 2 to compute the first low coefficient, X6,0 of row
6, while loading new coefficients L8,0, H8,0, and H8,1 of row 8 into RP1 first stage latches
Rt0, Rt1, and Rt2, respectively. During the fifth cycle of clock f4a, the positive pulse of the
clock stores content of Rt2 in stage 1 of RP1 into the third location of TLB1.
36 VLSI
In cycle 34, the negative transition of the fifth cycle of clock f4b transfers content of Rt0, X0,2,
and the high coefficient, X0,1 computed in stage 4 of RP3 to RP3 output latches Rt0 and Rt1,
respectively, while loading contents of Rt0, Rt1, and Rd1 of stage 3 into Rt0, Rt1, and Rt2 of
stage 4 to compute the first high coefficient of row 2, X2,1. Furthermore, the same negative
transition of clock f4b also transfers coefficient X4,2 calculated in stage 2 of RP3 to both Rt0 in
stage 3 of RP3 and Rd3 in stage 3 of RP1. It also transfers coefficient X4,0 in Rt0, in stage 3 of
RP1, and content of Rt2 in stage 2 of RP3 to stage 3 of RP3 latches Rd1 and Rt1, respectively,
and contents of Rt0, Rt1, and Rt2 of stage 1 to Rt0, Rt1, and Rt2 in stage 3 of RP3, while
loading new coefficients L8,1, H8,1, and H8,0 into Rt0, Rt1, and Rt2 of stage 1. Content of
Rd3, X4,2 in stage 3 of RP1 will be stored in the second location of TLB2 by the positive
pulse of the fifth cycle of clock f4a.
In cycle 37, the second run of the RPs begins when the negative transition of the sixth cycle
of clock f4a loads two new coefficients L0,2 and H0,2, which are the fifth and sixth
coefficients of row 0, into first stage latches Rt0 and Rt1 of RP1, respectively. During the first
half (low pulse) of cycle 6, the first location of TLB1 will be read and loaded into Rd1 by the
positive transition of clock f4a, whereas during the second half cycle, content of Rt1 will be
written in the first location of TLB1.
In cycle 38, the negative transition of the sixth cycle of clock f4b loads two new coefficients
L2,2 and H2,2, the fifth and sixth coefficients of row 2, into RP3 first stage latches Rt0 and
Rt1, respectively. The first half cycle of clock f4b reads the first location of TLB1 and loads it
into Rd3 by the positive transition of the clock, whereas the second half cycle writes content
of Rt1 in the same location of TLB1.
In cycle 41, the negative transition of the seventh cycle of clock f4a transfers contents of Rt0,
Rt1, and Rd1 of stage 1 to stage 2 of RP1 latches Rt0, Rt1, and Rt2, respectively, to compute
the third low coefficient, X0,4 of row 0, while loading two new coefficients L4,2 and H4,2 of
row 4 into RP1 first stage latches Rt0 and Rt1 respectively.
Note that during run 2 all RPs execute independently with no interactions among them. In
addition, in the first run, if the first coefficient generated by stage 2 of RP3 is stored in TLB2
of RP1, then the second coefficient should be stored in TLB2 of RP3 and so on. Similarly,
TLB1, TLB3, and TLB4 of both RP1 and RP3 are handled. Furthermore, during the whole
period of run 1, the control signals of the three extension multiplexers labeled muxe0, muxe1,
and muxe2 in RP1 should be set 0, according to Table 4 (Ibrahim & Herman, 2008), whereas
those in RP3 should be set normal as shown in the second line of Table 4, since RP3 will
execute normal computations during the period. However, in the second run and in all
subsequent runs except the last run, the extension multiplexers control signals in all RPs are
set normal. Moreover, the multiplexers labeled muxco in stage 4 is only needed in the
combined 5/3 and 9/7 architecture, otherwise, it can be eliminated and Rt2 output can be
connected directly to Rt0 input of the next stage in case of 9/7, whereas in 5/3, Rt0 is
connected directly to output latch Rt0. In the combined architecture, signal sco of muxco is
set 0 if the architecture is to perform 5/3; otherwise, it is set 1 if the architecture is to
perform 9/7.
It is very important to note that when the RP executes its last set of input coefficients,
according to 9/7 DDGs for odd and even signals shown in Fig. 4 it will not yield all required
output coefficients as expected by the last run. For example, in the DDGs for odd length
signals shown in Fig. 4 (a), when the last input coefficient labeled Y8 is applied to RP it will
yield output coefficients X5 and X6. To get the last remaining two coefficients X7 and X8, the
High Performance Parallel Pipelined Lifting-based VLSI Architectures
for Two-Dimensional Inverse Discrete Wavelet Transform 37
RP must execute another run, which will be the last run in order to compute the remaining
two output coefficients. Similarly, when the last two input coefficients labeled Y6 and Y7 in
the DDG for even length signals shown in Fig. 4 (b) are applied to 9/7 RP it will yield
output coefficients X3 and X4. To obtain the remaining output coefficients X5, X6, and X7,
two more runs should be executed by RP according to the DDG. The first run will yield X5
and X6, whereas the last run will yield X7. The details of the computations that take place
during each of these runs can be determined by examining the specific area of the DDGs.
5. Performance Evaluation
In order to evaluate performance of the two proposed parallel pipelined architectures in
terms of speedup, throughput, and power as compared with single pipelined architecture
proposed in (Ibrahim & Herman, 2008) consider the following. Assume subbands HH, HL,
LH, and LL of each level are equal in size. The dataflow table for single pipelined
architecture (Ibrahim & Herman, 2008) shows that 1 20 clock cycles are needed to
yield the first output coefficient. Then, the total number of output coefficients in the first run
of the Jth level reconstruction can be estimated as
N 2 J 1 (6)
and the total number of cycles in the first run is given by
2 N 2 J 1 (7)
The total time, T1, required to yield n pairs of output coefficients for the Jth level
reconstruction on the single pipelined architecture can be estimated as
T 1 1 2 N 2 J 1 2 n 1 2 N 2 J 1 1
2n t
(8)
1 N 2 J 1 p 2k
On the other hand, the dataflow for 2-parallel pipelined architecture shows that 2 21
clock cycles are required to yield the first 2 pairs of output coefficients. The total number of
paired output coefficients in the first run of the Jth level reconstruction on the 2-parallel
architecture can be estimated as
38 VLSI
3 2 N 2 J 1 (9)
and the total number of 2-paired output coefficients is given by
3 4 N 2 J 1 (10)
While, the total number of cycles in the first run is
2 N 2 J 1 (11)
Note that the total number of paired output coefficients of the first run in each level of
reconstruction starting from the first level can be written as
3 2 N , 3 2 N 2, 3 2 N 4,............., 3 2 N 2 J 1
where the last term is Eq (9).
The total time, T2, required to yield n pairs of output coefficients for the Jth level
reconstruction of an NxM image on the 2-parallel architecture can be estimated as
T 2 2 2 N 2 J 1 2(n 2 3 4 N 2 J 1 ) 2 (12)
T 2 2 1 2 N 2 J 1
n t p 2k (13)
The term 2 n 2 3 4 N 2 J 1
in (12) represents the total number of cycles of run 2 and all
subsequent runs.
The speedup factor, S2, is then given by
T1
1 N 2 J 1 2n t p 2k
S2
T2
2 1 2 N 2 J 1 n t p 2k
For large n, the above equation reduces to
2(1 2 N 2 J 1 n)
S2 2 (14)
(1 2 N 2 J 1 n)
Eq(14) implies that the 2-parallel architecture is 2 times faster than the single pipelined
architecture.
Similarly, the dataflow for the 4-parallel pipelined architecture shows that 4 33 clock
cycles are needed to yield the first two output coefficients. From the dataflow table of the 4-
paralell architecture it can be estimated that both RP1 and RP2, in the first run of the Jth level
reconstruction, yield ( N 2 J 1 ) 2 pairs of output coefficients, whereas both RP3 and RP4
J 1
yield N 2 pairs of output coefficients, a total of 3 2 N 2 J 1 pairs of output coefficients
in the first run. The total number of cycles in run 1 is then given by
4( N 2 J 1 ) 2 (15)
Thus, the total time, T4, required to yield n pairs of output coefficients for the Jth level
reconstruction of an NxM image on the 4-parallel architecture can be estimated as
T 4 4 2 N 2 J 1 2 n 3 2 N 2 J 1 2 4
T 4 4 2N 2 J 1
n 3 2 N 2 t J 1
p 4k (16)
T 4 4 1 2 N 2 J 1 n t 4k p (17)
J 1
The term (n 3 2 N 2 ) in (16) represents the total cycles of run 2 and all subsequent runs.
High Performance Parallel Pipelined Lifting-based VLSI Architectures
for Two-Dimensional Inverse Discrete Wavelet Transform 39
Thus, the 4-parallel architecture is 4 times faster than the single pipelined architecture.
The throughput, H, which can be defined as number of output coefficients generated per
unit time, can be written for each architecture as
H (sin gle) n ( 1 N 2 J 1 2n) t p 2k
The maximum throughput, Hmax, occurs when n is very large (n→∞), thus,
H max (sin gle) H (sin gle) n
(19)
2nkf p 2n kf p
H (2 parallel ) n ( 2 1 2 N 2 J 1 n) t p 2k
H max (2 parallel ) H (2 parallel ) n
(20)
2knf p n 2kf p
H (4 parallel ) n ( 4 1 2 N 2 J 1 n) t p 4k
H max (4 parallel ) H (4 parallel ) n
(21)
4knf p n 4kf p
Thus, the throughputs of the 2-parallel and the 4-parallel pipelined architectures have
increased by factors of 2 and 4, respectively, as compared with the single pipelined
architecture.
On the other hand, the power consumption of l-parallel pipelined architecture as compared
with the single pipelined architecture can be obtained as follows. Let P1 and Pl denote the
power consumption of the single and l-parallel architectures without the external memory,
and Pm1 and Pml denote the power consumption of the external memory for the single and l-
parallel architectures, respectively. The power consumption of VLSI architecture can be
estimated as
P C total V02 f
where Ctotal denotes the total capacitance of the architecture, V0 is the supply voltage, and f
is the clock frequency. Then,
P1 Ctotal V02 f1 2 , Pl l Ctotal V02 f l l and
Pl l Ctotal V02 fl l 2 f l
P1 Ctotal V02 f1 2 f1
(21)
l k
2 2k t l
p
tp
40 VLSI
6. Comparisons
Table 5 provides comparison results of the proposed architectures with most recent
architectures in the literature. The architecture proposed in (Lan & Zheng, 2005) achieves a
critical path of one multiplier delay using very large number of pipelined registers, 52
registers. In addition, it requires a total line buffer of size 6N, which is a very expensive
memory component, while the proposed architectures require only 4N. In (Rahul & Preeti,
2007), a critical path delay of Tm + Ta is achieved through optimal dataflow graph, but
requires a total line buffer of size 10N.
In (wang et at., 2007), by rewriting the lifting-based DWT of the 9/7, the critical path delay
of the pipelined architectures have been reduced to one multiplier delay but it requires a
total line buffer of size 5.5N. In addition, it requires real floating-point multipliers with long
delay that can not be implemented by using arithmetic shift method (Qing & Sheng, 2005).
(Qing & Sheng, 2005) has illustrated that the multipliers used for scale factor k and
coefficients , , , and of the 9/7 filter can be implemented in hardware using
only two adders. Moreover, the fold architecture which uses one module to perform both
the predictor and update steps in fact increases the hardware complexity, e.g., use of several
multiplexers, and the control complexity. In addition, use of one module to perform both
predictor and update steps implies both steps have to be sequenced and that would slow
down the computation process.
In the 2-parallel architecture proposed in (Bao & Yong, 2007), writing results generated by
CPs into MM (main memory) and then switching them out to external memory for next
level decomposition is really a drawback, since external memory in real-time applications,
e.g., in digital camera, is actually consist of charge-coupled devices which can only be
scanned. In addition, it requires a total line buffer of size 5N for 5/3 and 7N for 9/7 while
the proposed architectures require 2N and 4N for 5/3 and 9/7 respectively. The architecture
requires also used of several FIFO buffers in the datapath, which are complex and very
expensive memory components, while the proposed architectures require no such memory
components.
The 2-parallel architecture proposed in (Cheng et al., 2006) requires a total line buffer of size
5.5N and use of two-port RAM to implement FIFOs, whereas, the proposed architectures
require only use of single port RAM.
High Performance Parallel Pipelined Lifting-based VLSI Architectures
for Two-Dimensional Inverse Discrete Wavelet Transform 41
7. Conclusions
In this chapter, two high performance parallel VLSI architectures for 2-D IDWT are
proposed that meet high-speed, low-power, and low memory requirements for real-time
applications. The two parallel architectures achieve speedup factors of 2 and 4 as compared
with single pipelined architecture.
• The advantages of the proposed parallel architectures :
i. Only require a total temporary line buffer (TLB) of size 2N and 4N in 5/3 and 9/7,
respectively.
ii. The scan method adopted not only reduces the internal memory between CPs and
RPs to a few registers, but also reduces the internal memory or TLB size in the CPs
to minimum and allows RPs to work in parallel with CPs earlier during the
computation.
iii. The proposed architectures are simple to control and their control algorithms can
be immediately developed.
8. References
Bao-Feng, L. & Yong, D. (2007). “FIDP A novel architecture for lifting-based 2D DWT in
JPEG2000,” MMM (2), lecture note in computer science, vol. 4352, PP. 373-382,
Springer, 2007.
Cheng-Yi, X.; Jin-Wen, T. & Jian, L. (2006). “Efficient high-speed/low-power line-based
architecture for two-dimensional discrete wavelet transforms using lifting scheme,”
IEEE Trans. on Circuits & sys. For Video Tech. Vol.16, No. 2, February 2006, PP 309-
316.
Dillin, G.; Georis B.; Legant J-D. & Cantineau, O. (2003). “Combined Line-based Architecture
for the 5-3 and 9-7 Wavelet Transform of JPEG2000,” IEEE Trans. on circuits and
systems for video tech., Vol. 13, No. 9, Sep. 2003, PP. 944-950.
Ibrahim saeed koko & Herman Agustiawan, (2008). “Pipelined lifting-based VLSI
architecture for two-dimensional inverse discrete wavelet transform,” proceedings
of the IEEE International Conference on Computer and Electrical Engineering,
ICCEE 2008, Phuket Island, Thailand.
42 VLSI
Lan, X. & Zheng N. (2005). ”Low-Power and High-Speed VLSI Architecturefor Lifting-
Based Forward and Inverse Wavelet Transform,” IEEE tran. on consumer
electronics, Vol. 51, No. 2, May 2005, PP. 379 –385.
Qing-ming Yi & Sheng-Li Xie, (2005). ”Arithmetic shift method suitable for VLSI
implementation to CDF 9/7 discrete wavelet transform based on lifting scheme,”
Proceedings of the Fourth Int. Conf. on Machine Learning and `Cybernetics,
Guangzhou, August 2005, PP. 5241-5244.
Rahul. J. & Preeti R. (2007). ”An efficient pipelined VLSI architecture for Lifting-based 2D-
discrete wavelet transform,” ISCAS, 2007 IEEE, PP. 1377-1380.
Wang, C.; Wu, Z.; Cao, P. & Li, J. (2007). “An efficient VLSI Architecture for lifting-based
discrete wavelet transform,” Mulltimedia and Epo, 2007 IEEE International
conference, PP. 1575-1578.
Contour-Based Binary Motion Estimation Algorithm and VLSI Design
for MPEG-4 Shape Coding 43
X3
1. Introduction
MPEG-4 is a new international standard for multimedia communication [1]. It provides a set
of tools for object-based coding of natural and synthetic videos/audios. MPEG-4 also enables
content-based functionalities by introducing the concept of video object plane (VOP), and
such a content-based representation is a key to enable interactivity with objects for a variety
of multimedia applications. The VOP is composed of texture components (YUV) and an
alpha component [2]-0. The texture component contains the colorific information of video
object, and the alpha component contains the information to identify the pixels. The pixels
which are inside an object are opaque and the pixels which are outside the object are
transparent. MPEG-4 supports a content-based representation by allowing the coding of the
alpha component along with the object texture and motion information. Therefore, MPEG-4
shape coding becomes the key technology for supporting the content-based video coding.
MPEG-4 shape coding mainly comprises the following coding algorithms: binary-shaped
motion estimation/motion compensation (BME/BMC), context-based arithmetic coding
(CAE), size conversion, mode decision, and so on. As full search (FS) algorithm is adopted
for MPEG-4 shape coding, most of the computational complexity is due to binary motion
estimation (BME). From the profiling on shape coding in Fig. 1, it can be seen that BME
contributes to 90% of total computational complexity of MPEG-4 shape encoder. It is well
known that an effective and popular technique to reduce the temporal redundancy of BME,
called block-matching motion estimation, has been widely adopted in various video coding
standard, such as MPEG-2 0, H.263 [5] and MPEG-4 shape coding [1]. In block-matching
motion estimation, the most accurate strategy is the full search algorithm which exhaustively
evaluates all possible candidate motion vectors over a predetermined neighborhood search
window to find the global minimum block distortion position.
44 VLSI
Fast BME algorithms for MPEG-4 shape coding were presented in several previous papers
[6]-[8]. Among these techniques, our previous work, contour-based binary motion
estimation (CBBME), largely reduced the computational complexity of shape coding [9]. It is
applied with the properties of boundary search for block-matching motion estimation, and
the diamond search pattern for further improvement. This algorithm can largely reduce the
number of search points to 0.6% compared with that of full search method, which is
described in MPEG-4 verification model (VM) [2].
In contrast with algorithm-level developments, architecture-level designs for shape coding
are relatively less. Generally, a completed shape coding method should include different
types of algorithms. In CAE part, it needs some bit-level operation. However, in binary
motion estimation part, a high speed search method is needed. With these algorithm
combinations on the shape coding, implementation should not be as straightforward as
expected and it offers some challenges especially on architecture design. Since MPEG-4
shape coding has features of high-computing and high-data-traffic properties, it is suitable
with the consideration of efficient VLSI architecture design. Most literatures have also been
presented to focus on the main computation-expensive part, BME, to improve its
performance [10]. Additionally, CAE is also an important part for architecture design and
discussed in [11]-[12]. They utilized the multi-symbol technique to accelerate the arithmetic
coding performance. As regards the complete MPEG-4 shape coding, some of these designs
utilized array processor to perform the shaping coding algorithm [13]-[15], while others
used pipelined architecture [16]. They can reach the relative high performance at the
expense of these high cost and high complexity architectures. All of them intuitively apply
the full search algorithm for easy realization on architecture design. However, the
algorithm-level achievement on the large reduction of computation complexity is attractive
and not negligible. This demonstrates that, without the supporting on an efficient algorithm,
the straightforward implementation based on full search algorithm is hard to reach a
cost-effective design.
In this paper, we proposed a fast BME algorithm, diamond boundary search (DBS), for
MPEG-4 shape coding to reduce the number of search points. By using the properties of
Contour-Based Binary Motion Estimation Algorithm and VLSI Design
for MPEG-4 Shape Coding 45
block-matching motion estimation in shape coding and diamond search pattern, we can skip
a large number of search points in BME. Simulation results show that the proposed
algorithm can marvelously reduce the number of search points to 0.6% compared with that
of full search method, which is described in MPEG-4 verification model (VM)[2]. Compared
with other fast BME in [6]-[7], the proposed BME algorithm uses less search points
especially in high motion video sequences, such as ‘Bream’ and ‘Forman’. We also present
an efficient architecture design for MPEG-4 shape coding. This architecture is elaborated
based on our fast shape coding algorithm with the binary motion estimation. Since this
block-matching motion estimation can achieve the high performance based on the
information of boundary mask, the dedicated architecture needs some optimization and
consideration to reduce the memory access and processing cycles. Experimental results also
demonstrate the equal performance on full-search based approach. This paper contributes a
comprehensive exploration of the cost-effective architecture design of shaping coding, and
is organized as follows.
In Section 2 the binary motion estimation in shape coding is described. We describe the
highlights of the proposed fast BME algorithm for MPEG-4 shape coding in Section 3. The
design exploration on CBBME is described in Section 4. In Section 5, the architecture design
based on this BME algorithm is proposed. In Section 6, we present the implementation
results and give some comparisons. Finally we summarize the conclusions in Section 7.
If more than one MVS minimize SAD by an identical value, the MVDS that minimizes the
code length of MVDS is selected. If more than one MVS minimize SAD by an identical value
with an identical code length of MVDS, MVDS with smaller vertical element is selected. If
the vertical elements are also the same, MVDS with smaller horizontal element is selected.
After binary motion estimation, motion compensated block is constructed from the 16x16
BAB with a border of width 1 around the 16x16 BAB (bordered MC BAB). Then,
context-based arithmetic encoding (CAE) is adopted for shape coding.
Fig. 3. The correlation between current pixel and adjacent pixel. Where gray part denotes
pixels outside the VOP.
Fig. 4. Example of a boundary mask for shape coding. White area denotes boundary pixel.
A suggested implementation of the proposed Boundary search (BS) algorithm for shape
coding is processed as follows:
Step 1. Perform a pixel loop over the entire reference VOP. If pixel (x,y) is an boundary
pixel, set the mask at (x,y) to ‘1’. Otherwise set the mask at (x,y) to ‘0’.
Step 2. Perform a pixel loop over the entire current BAB. If pixel (i,j) is a boundary
pixel, set (i,j) to be “reference point”, and terminate the pixel loop. This step is
illustrated in Fig. 6(b). Therefore, there is only one reference point in current
BAB.
Step 3. For each search point within ±16 search range, check the pixel (x+i, y+j) which
is fully aligned with the “reference point” from the current BAB. If the mask
value at (x+i, y+j) is ‘1’, which means that the reference point is on the
48 VLSI
boundary of reference VOP, the procedure will compute SAD of the search
point (x, y). Otherwise, SAD of the search point (x, y) will not be computed, and
the processing continues at the next position. Fig. 6(a) shows an example of this
step. The search points in (x1, y1) and (x2, y2) will be skipped by this procedure,
while the SAD will be computed in (x3, y3).
Step 4. When all the search points within ±16 search range is done, the MV that
minimizes the SAD will be taken as MVS. Fig. 7 illustrates the overall scheme of
proposed BS algorithm for MPEG-4 shape coding.
In the worst case, the proposed BS algorithm needs (256+ (16+1)2) determinations to check
whether the pixel is a boundary pixel. For each non-skipped search point, the SAD obtained
by 256 exclusive-OR operations and 255 addition operations was taken as the distortion
measure. However, based on BS algorithm, the number of non-skipped search points was
reduced significantly, and the additional computational load due to BS algorithm was
negligible.
Fig. 5. Example of an effective search area, which has been proposed in reference 0.
Contour-Based Binary Motion Estimation Algorithm and VLSI Design
for MPEG-4 Shape Coding 49
We combine the proposed BS algorithm with diamond-shaped zones, called DBS, and give
different thresholds (Thn) for each search zone for furthermost improvement. The procedure
is explained in below.
Step 1. Construct diamond-shaped zones around MVPS within ±16 search window. Se
t n=1.
Step 2. Calculate SAD for each search point in zone n. Let MinSAD be the smallest SAD
up to now.
Step 3. If MinSAD≦Thn, goto Step 4. Otherwise, set n=n+1 and goto Step 2.
Step 4. The motion vector is chosen according to the block corresponding to MinSAD.
WSAD=W1*SAD+W2*(|mvds_x|+|mvds_y|) (1)
SAD=ΣΣ|pi-1(i+u,j+v)-pi(i,j)| … (2)
where mvds_x is MVDS in the horizontal direction and mvds_y is MVDS in the vertical
direction. W1 and W2 denote the weighting values for SAD and MVDS, respectively. The
WSAD is evaluated in every search points and the BAB with minimum WSAD is selected as
Contour-Based Binary Motion Estimation Algorithm and VLSI Design
for MPEG-4 Shape Coding 51
Fig. 9 shows the algorithm flow of CBBME. First, we construct diamond-shaped zones around
motion vector predictor for shape (MVPS) within shrinking search range. Second, we calculate
WSAD for each search point. Let MinWSAD be the smallest SAD up to now. If MinWSAD is
smaller or equal to a threshold (Th) for each search zone, then the motion vector is chosen
according to the block corresponding to MinWSAD; otherwise, it changes to next position and
calculates the WSAD again.
52 VLSI
Construct diamond
shaped zones
Compute WSAD
no
MinWSAD≦Th
yes
Find MVs
data
control Frame
AG
bitstream Memory
BAB SR
Shift
Buffer Buffer
●
●
Boundary MVDS
PE
Pixel CAS
Array
Detector
Fig. 11. Block diagram of the proposed BME architecture.
27
SSB SSB SSB SSB SSB SSB SSB SSB SSB SSB
In each SSB, 9 checking pixels are included. In addition, in order to detect boundary pixels
and obtain the bordered MC-BAB, a border with the width of one pixel around the 42×42
search area, called bordered search area, is applied. As depicted in Fig. 13, the pixels in the
grey area are the border pixels. This range indicates the total needed pixels in memory.
Boundary Pixel Detector first selects a boundary pixel (i,j) as “reference point” in 16×16
current BAB. Based on the reference point, a 27×27 detected region is generated as shown in
Fig. 14. Each pixel in detected region, called checking pixel, denotes whether the correlative
search position is non-skipped search position or not. If the checking pixel belongs to a
boundary pixel, the correlative search position is denoted as non-skipped search position.
Hereafter, Boundary Pixel Detector detects 9 checking pixels, which are relative to the coding
SSB in detected region. As shown in Fig. 14, the pixels in the dark area are used for detecting
boundary pixel. If all of the checking pixels in light grey area are not boundary pixels, it
means that all search positions in the SSB are skipped search positions. Therefore, the coding
SSB will not be processed in PE Array. Otherwise, the PE Array calculates WSAD of these 9
search positions in coding SSB.
Contour-Based Binary Motion Estimation Algorithm and VLSI Design
for MPEG-4 Shape Coding 55
(a) (b)
Fig. 13. (a) 44×44 bordered search area, (b) 16×16 current BAB.
Fig. 14. The relation between search positions and detected region.
(a)
W 1 W 4 W7 W 2 W 5 W8 W 3 W 6 W9
XOR
ABS ABS
Adder
tree
Add(+)
Accumulator
CAS Unit
>>1 SAD
Add(+) Dff
Wn
(b) (c)
Fig. 15. Architecture of (a) PE Array (b) PE element (c) CAS unit.
Contour-Based Binary Motion Estimation Algorithm and VLSI Design
for MPEG-4 Shape Coding 57
The weighted data is calculated by adding absolute MVDS in both horizontal and vertical
directions and shifting right one bit. From the analysis in our previous paper [9], the ratio for
V2/V1 will not make large difference from 0.5 to 0.8. And in that range our WSAD is indeed
better than the result for SAD. For reducing the computational complexity, V1 and V2 in (1)
are determined as 1 and 0.5, respectively. The architecture of PE element is shown in Fig.
15(b). It produces WSAD with the sum of weighted data and SAD, where Wn means the
WSAD value for each PE element from n=1 to 9.
The architecture of Compare and Selection (CAS) module is shown in Fig. 15(c). It finds the
smallest WSAD and its MVS in coding SSB and feedbacks the smallest WSAD as the input for
the next SSB.
SC SC
Buffer_0 Buffer_1
It consists of three major units: down-sample, up-sample and accepted quality (ACQ)
detector. The bordered BAB is read from BAB buffer and down-sampled to 4×4 and 8×8 BAB
in “SC Buffer_0” and “SC Buffer_1” respectively. Then, the 4×4 BAB is up-sampled to SC
Buffer_1, and 8×8 BAB is up-sampled to ACQ detector by up-sample unit. ACQ detector
calculates the conversion error between the original BAB and the BAB which is
down-sampled and reconstructed by up-sample unit. ACQ also needs to determine the
conversion ratio. In down-sample procedure, several pixels are down-sampled to one pixel,
while interpolated pixels are produced between original pixels in up-sample procedure. To
compute the value of the interpolated pixel, a border with the width of two around the
current BAB is used to obtain the neighboring pixels (A~L), and the unknown pixels are
extended from the outermost pixels inside the BAB. The template and pixel relationship
used for up-sampling operation can be referred as in MPEG-4 standard and [15].
Since the implementation of the down-sample and ACQ detector is relatively simple, we
only address the design of up-sample unit here. The block diagram of up-sample unit is
58 VLSI
shown in Fig. 17. Due to the window-like slicing operations, the up-sample can be easily
mapped into a delay line model. A delay line model is used to obtain the pixels in A~L,
which determine interpolated pixels (P1~P4). Based on the pixels in A~L, four 4-bits
threshold values are obtained from “Table CF”. “UP_PE” generates corresponding values
for comparison. After comparison between threshold values and the values from “UP_PE”,
four interpolated pixels (P1~P4) are stored in “Shift Register” and outputted later.
top-border
Decoder
(A~L)
Up-sampled
12 Up
Original data
data PE 20
Delay Shift
● ● Compare
Line 4 Register
Table
8 CF 16 (P1~P4)
(E~L)
left-border
Decoder
Fig. 17. Block diagram of up-sample unit.
MC Current
BAB BAB
o Template for intra CAE
Shift
Register Template for inter CAE
cx
20 bits
Table
o o o
p0 bit
Code o o o o o Inter BAB
Symbol
L R Intra BAB
o o
fw
Bit Re-
Follow Normalise
L R
Inter MC
bitstream
(a) (b)
Fig. 18. (a) Block diagram of CAE module. (b). Illustration of the Shift Register.
As mentioned before, for pixel-by-pixel processing in CAE, it basically uses the raster scan
order. Since most of the execution time is spent on the context generation in the CAE, the
“Shift Register” is used to obtain context and the related operation is illustrated in Fig. 18(b)
[20]. Data in the shift registers can be effectively reused and thus this redundant data
Contour-Based Binary Motion Estimation Algorithm and VLSI Design
for MPEG-4 Shape Coding 59
accesses can be removed. In Fig. 18(b) all the rectangles are represented as registers. Pixels in
current BAB and MC-BAB are first loaded into the Shift Register, and then shifted left one
bit at every clock cycle. Registers in context box are arranged such that various contexts can
be achieved. For intra-CAE mode, the first three rows of Shift Register are used to store
current BAB. The first two rows of Shift Register are used to store current BAB and the last
three rows are used to store MC-BAB in inter-CAE mode. Therefore, the context (cx) and
coded bit (bit) are obtained from Shift Register per cycle.
Bream
1662 SAD
WSAD
1660
Bitrate
1658
1638
1636
1634
0.4 0.5 0.6 0.7 0.8
w2/w1
News
912 SAD
WSAD
910
Bitrate
908
906
904
0.4 0.5 0.6 0.7 0.8
W2/W1
60 VLSI
Foreman
1216 SAD
WSAD
1214
Bitrate 1212
1200
1198
1196
0.4 0.5 0.6 0.7 0.8
W2/W1
Fig. 19. Performance comparisons of SAD and WSAD using full search algorithm.
It can be seen that the result of using WSAD as the distortion measure takes less bits than
that of using SAD except “News” in W2/W1= 0.4. This is because the “News” sequence is a
low motion video sequence, and WSAD is of no benefit in such sequence. As the result of
Fig. 19, W1 and W2, which is derived from (1), are determined as 10 and 7, respectively.
Based on the weighting values, the average number of bits to represent the shape per VOP is
shown in Table 2. The percentages compared to the results of full search, which uses SAD as
the distortion measure, are also shown in Table 2. An important contribution is that the
WSAD makes some improvement on bit-rate, and it can compensate for the inaccuracy of
motion vector when the fast search algorithm is used.
Fig. 20 shows the comparison of various search algorithms. It can be seen that the proposed
BS and DBS algorithm take less search points than FS algorithm and the algorithms
described in 0-0.
Contour-Based Binary Motion Estimation Algorithm and VLSI Design
for MPEG-4 Shape Coding 61
Bream
50000
FS
Ref. [6]
40000 Ref. [7]
BS
DBS
Search Point
30000
20000
10000
0
0 20 40 60 80 100
Frame
News
20000
FS
Ref. [6]
Ref. [7]
15000 BS
DBS
Search Point
10000
5000
0
0 20 40 60 80 100
Frame
Foreman
FS
60000 Ref. [6]
Ref. [7]
BS
DBS
Search Point
40000
20000
0
0 20 40 60 80 100
Frame
Fig. 20. Performance comparisons of various search algorithms.
Table 3 shows the number of search points (SP), and Table 4 shows the average number of bits
to represent the shape per VOP. The percentages compared to the results corresponding to the
full search in MPEG-4 VM are also shown in these tables. “BS” denotes the proposed
algorithm without using diamond search pattern and “DBS” denotes the proposed algorithm
62 VLSI
using diamond search pattern. Table 5 shows the runtime simulation results of various BME
algorithms. It is noted that the BME algorithm 0 takes much more computational complexity
because of the generation of the mask for effective search area.
SP SP % SP % SP % SP %
Bream 10,504,494 6,495,838 61.84 1,560,221 14.85 367,115 3.49 67,232 0.64
News 747,054 403,131 53.96 6,568 0.88 24,923 3.36 2,048 0.27
Foreman 9,085,527 5,093,409 56.06 1,564,551 17.22 287,272 3.16 57,214 0.63
Table 3. Total search points for various search algorithms.
Compared with the full search method, the proposed fast BME algorithm (BS) needs 3.5%
search points and takes equal bit rate in the same quality. By using the diamond-shaped
zones, the proposed algorithm (DBS) needs only 0.6% search points. Compared with other
fast BME 0-0, our algorithm uses less search points, especially in high motion video
sequences, such as ‘Bream’ and ‘Foreman’.
Bream 57718.67 62776.36 108.76 7782.83 13.48 8678.02 15.04 1738.24 3.01
News 3830.13 4202.36 109.72 35.05 0.92 574.04 14.99 22.16 0.58
Foreman 44877.19 52574.24 117.15 7507.74 16.73 6487.52 14.46 1703.88 3.80
Table 5. Runtime simulation results of various search algorithms.
Table 6 shows the number of non-skipped SSB per PE. Due to the contribution of the
proposed CBBME algorithm, the number of non-skipped SSB is reduced largely. It can be
seen that the average number of non-skipped SSB is much less than the total number of SSB,
which is denoted as 81 per PE. In the worst case, the additional non-skipped SSB is usually
in the positions with large motion vector, which does not tend to be the adoptive MV.
Contour-Based Binary Motion Estimation Algorithm and VLSI Design
for MPEG-4 Shape Coding 63
Therefore, in the binary motion estimator, the number of non-skipped SSB is limited to 32
from experimentation in maximum situation. For average situation, we set the number as
12.In our architecture, the processing cycles for various BAB types are shown in Fig. 21.
(average/maximum)
BME
& BMC
(400/780)
Inter
BAB BAB Size Intra CAE
MVPs
Decision Conv. ● CAE (303/610)
(35)
(19) (264) (303/610)
: VLC(3~2x)
BAB type 2, 3
BAB type 0
BAB type 4 for I -VOP
BAB type 1
BAB type 4, 5, 6
for B-,P-VOP
Fig. 21. Processing cycles for various BAB types.
Totally seven types of mode are described for BAB. The number in parentheses indicates the
latency of that module with average and maximum number of cycles. Notice that the
processing cycles of BME and CAE depend on the content of BAB. To complete one BAB
processing in the worst case scenario, our architecture requires 1708 clock cycles, including:
19 clock cycles for mode decision, 35 clock cycles for identifying MVPS, 264 clock cycles for
size conversion, 780 clock cycles for BME, and 610 clock cycles for inter CAE.
Actually, few literatures explored the architecture design of shape coding. In [10], they only
designed the BME. We can extract our data for BME part as comparison in Table 7. In terms
of the whole shape coding,
Table 8 illustrates some results with different architectures. For our design the size of BAB
64 VLSI
buffer and SR buffer are 16×16 and 44×44 respectively. The average and maximum numbers
of non-skipped SSB are determined as 12 and 32 from experiments.
Table 8 also lists the architecture comparisons between the proposed and some previous
works in [16] and [21]. In [16] it adopts the data-dispatch technique and is named as
data-dispatch based BME (DDBME). [21] is Natarajan’s architecture which is modified from
BME-based Yang’s 1-D systolic array [22]. In their design, they use extra memory, SAP module,
to process the bit shifting and bit packing for the alignment of BAB. It also results in a
computation overhead. In our design, we have used the Boundary Pixel Detector for the
alignment of boundary of BAB. Accordingly, no SAP memory is needed. Furthermore, the
proposed CBBME design needs less data transfer and latency to obtain one motion vector
compared with [16] and [21], because we consider the skipping on redundant searches.
Compared with the implementation for one BAB processing in the worst case, our design also
requires less cycles than [16] with the same base of non-pipelining work. Only 56% cycles of
[16] is needed in our approach.
Fig. 22(a) shows the synthesized gate count of each module and Fig. 22(b) shows the chip
layout using synthesizable Verilog HDL. There are 7 synchronous RAMs in the chip. Two
1600×16 bits RAMs are used for frame buffer. Two 48×22 bits RAMs are used for SR buffer.
One 32×20 bits RAM is used for SC buffer, one 32×18 bits RAM is used for MC buffer and one
32×20 bits RAM is used for BAB buffer, respectively. The chip feature is summarized in Table
9. Total gate count is 40765. The chip size is 2.4×2.4 mm2 with TSMC 0.18μm CMOS
technology and the maximum operation frequency is 53 MHz
BME(10523)
2%
Size Conversion(4453)
3% 8%
35% CAE(6836)
14%
Prob. Table for CAE(4221)
VLC(2459)
(a)
Contour-Based Binary Motion Estimation Algorithm and VLSI Design
for MPEG-4 Shape Coding 65
(b)
Fig. 22. (a) Synthesized gate count of each module. (b) Chip layout of shape coding encoder.
Die size 2
2.4×2.4 mm
Core size 2
1.4×1.4 mm
Clock rate 53 MHz
7. Conclusion
MPEG-4 has provided a well-adopted object-based coding technique. When people migrate
from compressed coding domain to object coding domain, the complexity issue on shape
coding is converged. In this paper we propose a fast binary motion estimation algorithm
using diamond search pattern for shape coding and an efficient architecture for MPEG-4
shape coding. By using the properties of shape information and diamond shaped zones, we
can reduce the number of search points significantly, resulting in a proportional reduction
66 VLSI
of computational complexity. The experimental results show that the proposed method can
reduce the number of search points of BME for shape coding to only 0.6% compared with
that of the full search method described in MPEG-4 verification model. Specifically, the fast
algorithm takes equal bit rate in the same quality compared with full search algorithm. The
proposed algorithm is simple, efficient and suitable for real-time software and hardware
applications. This architecture is based on the boundary search fast algorithm which
accomplishes the large reduction on computation complexity. We also apply the approaches
on center-biased motion vector distribution and search range shrinking for further
improvement. In this paper we report a comprehensive exploration on each module of
shape coding encoder. Our architecture completely elaborates the advantages of the
proposed fast algorithm with a high performance and regular architecture. The result shows
that our design can reduce the memory access and processing cycles largely. The average
number of clock cycles for one binary alpha block processing is only 1708, which is far less
than other designs. The system architecture is implemented by synthesizable Verilog HDL
with TSMC 0.18μm CMOS technology. The chip size is 2.4 × 2.4 mm2 and the maximum
operation frequency is 53 MHz.
8. Acknowledgements
This work was supported by the CIC and the National Science Council of Taiwan, R.O.C.
under Grant NSC97-2220-E-008-001.
9. References
B. Natarajan, V. Bhaskaran, and K. Konstantinides, “Low-complexity block-based motion
estimation via one-bit trasforms,” IEEE Trans. Circuits Syst. Video Technol., vol. 7, pp.
702 -707, Aug. 1997.
D. Yu, S. K. Jang, and J. B. Ra, “A fast motion estimation algorithm for MPEG-4 shape
coding,” IEEE Int. Conf. Image Processing, vol. 1, pp. 876-879, 2000.
E. A. Al_Qaralleh, T. S. Chang, and K. B. Lee, “An efficient binary motion estimation
algorithm and its architecture for MPEG-4 shape encoding,” IEEE Trans. Circuits
Syst. Video Technol., vol. 16, no. 17, pp. 859-868, Jul. 2006.
G. Feygin, P. Glenn and P. Chow, “Architectural advances in the VLSI implementation of
arithmetic coding for binary image compression,” Proc. Data Compression Conf.
(DCC’94), pp. 254 -263, 1994.
G. Sullivan and T. Wiegand, “Rate-Distortion optimization for video compression”, IEEE
Signal Processing Magazine, pp. 74-90, Nov. 1998.
H. C. Chang, Y. C. Chang, Y. C. Wang, W. M. Chao, and L. G. Chen, “VLSI architecture
design for MPEG-4 shape coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 12,
pp. 741-751, Sep. 2002.
ISO/IEC 13818-2, “Information technology-generic coding of moving pictures and associated
audio information-Video,” 1994.
ISO/IEC JTC1/SC29/WG11 N 2502a, “Generic coding of audio-visual objects: Visual
14492-2,” Atlantic City Final Draft IS, Dec. 1998.
ISO/IEC JTC1/SC29/WG11 N 3908, “MPEG-4 video verification model version 18.0,” Jan.
2001.
Contour-Based Binary Motion Estimation Algorithm and VLSI Design
for MPEG-4 Shape Coding 67
X4
1. Introduction
Discrete wavelet transform (DWT) has been adopted in a wide range of applications,
including speech analysis, numerical analysis, signal analysis, image coding, pattern
recognition, computer vision, and biometrics (Mallat, 1989). It can be considered as a multi-
resolution decomposition of a signal into several components with different frequency
bands. Moreover, DWT is a powerful tool for signal processing applications, such as
JPEG2000 still image compression, denoising, region of interest, and watermarking. For real-
time processing it needs small memory access and low computational complexity.
Implementations of two-dimensional (2-D) DWT can be classified as convolution-based
operation (Mallat, 1989) (Marino, 2000) (Vishwanath et al., 1995) (Wu. & Chen, 2001) and
lifting-based operation (Sweldens, 1996). Since the convolution-based implementations of
DWT have high computational complexity and large memory requirements, lifting-based
DWT has been presented to overcome these drawbacks (Sweldens, 1996) (Daubechies &
Sweldens, 1998). The lifting-based scheme can provide low-complexity solutions for
image/video compression applications, such as JPEG2000 (Lian et al., 2001), Motion-
JPEG2000 (Seo & Kim, 2007), MPEG-4 still image coding, and MC-EZBC (Ohm, 2005) (Chen
& Woods, 2004). However, the real-time 2-D DWT for multimedia application is still
difficult to achieve. Hereafter, efficient transformation schemes for real-time application are
highly demanded. Performing 2-D (or multi-dimensional) DWT requires many
computations and a large block of transpose memory for storing intermediate signals with
long latency time. This work presents new algorithms and hardware architectures to
improve the critical issues in 2-D dual-mode (supporting 5/3 lossless and 9/7 lossy coding)
lifting-based discrete wavelet transform (LDWT). The proposed 2-D dual-mode LDWT
architecture has the merits of low transpose memory, low latency, and regular signal flow,
making it suitable for VLSI implementation. The transpose memory requirement of the NN
2-D 5/3 mode LDWT is 2N and that of 2-D 9/7 mode LDWT is 4N.
Low transpose memory requirement is of a priority concern in spatial-frequency domain
implementation. Generally, raster scan signal flow operations are popular in NN 2-D DWT,
and under this approach the memory requirement ranges from 2N to N2 (Diou et al., 2001)
70 VLSI
(Andra et al., 2002) (Chen & Wu, 2002) (Chen, 2002) (Chiang & Hsia, 2005) (Jung & Park,
2005) (Vishwanath et al., 1995) (Huang et al., 2005) (Mei et al., 2006) (Huang et al, 2005) (Wu
& Lin, 2005) (Lan ., 2005) (Wu. & Chen, 2001) in 2-D 5/3 and 9/7 modes LDWT. In order to
reduce the amount of the transpose memory, the memory access must be redirected. In our
approach, the signal flow is revised from row-wise only to mixed row- and column-wise,
and a new approach, called interlaced read scan algorithm (IRSA), is used to reduce the
amount of the transpose memory. By the IRSA approach, a transpose memory size is of 2N
or 4N (5/3 or 9/7 mode) for an NN DWT. The proposed 2-D LDWT architecture is based
on parallel and pipelined schemes to increase the operation speed. For hardware
implementation we replace multipliers with shifters and adders to accomplish high
hardware utilization. This 2-D LDWT has the characteristics of high hardware utilization,
low memory requirement, and regular signal flow. A 256×256 2-D dual-mode LDWT was
designed and simulated by VerilogHDL, and further synthesized by the Synopsys design
compiler with TSMC 0.18μm 1P6M CMOS process technology.
smaller hardware complexity. Jung et al. (Jung & Park, 2005) presented an efficient VLSI
architecture of dual-mode LDWT that is used by lossy or lossless compression of JPEG2000.
Marino (Marino, 2000) proposed a high-speed/low-power pipelined architecture for the
direct 2-D DWT by four-subband transforms performed in parallel. The architecture of
(Huang et al., 2002) implements 2-D DWT with only transpose memory by using recursive
pyramid algorithm (PRA). In (Vishwanath et al., 1995) it has the average of N2 computing
time for all DWT levels. However, they use many multipliers and adders. Varshney et al.
(Varshney et al., 2007) presented energy efficient single-processor and fully pipelined
architectures for 2-D 5/3 lifting-based JPEG2000. The single processor performs both row-
wise and column-wise processing simultaneously to achieve the 2-D transform with 100%
hardware utilization. Tan et al. (Tan & Arslan, 2003) presented a shift-accumulator
arithmetic logic unit architecture for 2-D lifting-based JPEG2000 5/3 DWT. This architecture
has an efficient memory organization, which uses a small amount of embedded memory for
processing and buffering. Those architectures achieve multi-level decomposition using an
interleaving scheme that reduces the size of memory and the number of memory accesses,
but have slow throughput rates and inefficient hardware utilization. Seo et al. (Seo & Kim,
2007) proposed a processor that can handle any tile size, and supports both 5/3 and 9/7
filters for Motion-JPEG2000. Huang et al. (Huang et al, 2005) proposed a generic RAM-based
architecture with high efficiency and feasibility for 2-D DWT. Wu et al. (Wu & Lin, 2005)
presented a high-performance and low-memory architecture to implement a 2-D dual-mode
LDWT. The pipelined signal path of their architecture is regular and practical. Lan et al. (Lan
et al., 2005) proposed a scheme that can process two lines simultaneously by processing two
pixels in a clock period. Wu et al. (Wu. & Chen, 2001) proposed an efficient VLSI architecture
for direct 2-D LDWT, in which the poly-phase decomposition and coefficient folding are
adopted to increase the hardware utilization. Despite these efficient improvements to
existed architectures, further improvements in the algorithm and architecture are still
needed. Some VLSI architectures of 2-D LDWT try to reduce the transpose memory
requirements and communication between the processors (Chen & Wu, 2002) (Andra et al.,
2000) (Andra et al., 2002) (Diou et al., 2001) (Chen, 2002) (Chiang & Hsia, 2005) (Tan &
Aslan, 2002) (Jiang & Ortega, 2001) (Lian et al., 2001) (Jung & Park, 2005) (Chen, 2004)
(Huang et al., 2005) (Daubechies & Sweldens, 1998) (Marino, 2000) (Vishwanath et al., 1995)
(Taubman & Marcellin, 2001) (Marcellin et al., 2000) (Mei et al., 2006) (Varshney et al., 2007)
(Huang et al., 2004) (Tan & Arslan, 2003) (Seo & Kim, 2007) (Huang et al., 2005) (Wu & Lin,
2005) (Lan et al., 2005) (Wu. & Chen, 2001), however these hardware architectures still need
large transpose memory.
for (j=1 to J)
for (i=0 to N/2j-1)
{
k 1
X Hj ( n )= G ( z ) X J 1 H ( 2 n i )
(3)
i0
k 1
X Lj ( n )= H ( z ) X J 1 G ( 2 n i )
(4)
i0
where j denotes the current resolution level, k the number of the filter tap, X Hj (n) the nth
high-pass DWT coefficient at the jth level, X Lj (n) the nth low pass DWT coefficient at the
jth level, and N the length of the original input sequence. Fig. 1 shows a 3-level 1-D DWT
decomposition using Mallat’s algorithm.
Memory-Efficient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 73
The downsampling operation is then applied to the filtered results. A pair of filters are
applied to the signal to decompose the image into the low-low (LL), low-high (LH), high-
low (HL), and high-high (HH) wavelet frequency bands. Fig. 2 illustrates the basic 2-D DWT
operation and the transformed result which is composed of two cascading 1-D DWTs. The
image is first analyzed horizontally to generate two subimages. The information is then sent
into the second 1-D DWT to perform the vertical analysis to generate four subbands, and
each with a quarter of the size of the original image. Considering an image of size N×N, each
band is subsampled by a factor of two, so that each wavelet frequency band contains
N/2×N/2 samples. The four subbands can be integrated to generate an output image with
the same number of samples as the original one.
Most image compression applications can reapply the above 2-D wavelet decomposition
repeatedly to the LL subimage, each time forming four new subband images, to minimize
the energy in the lower frequency bands.
74 VLSI
g(z)=ge(z2)+ z-1go(z2),
h(z)=he(z2)+z-1ho(z2). (5)
Memory-Efficient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 75
h ( z ) g e ( z )
P( z) e
ho ( z ) g o ( z )
. (6)
The Euclidean algorithm recursively finds the greatest common divisors of the even and
odd parts of the original filters. Since h(z) and g(z) form a complementary filter pair, P(z) can
be factorized into (7):
m 1 si( z ) 1 0k 0 . (7)
P(z)
0
1 ti( z )
10
1/ k
i 1
where si(z) and ti(z) are Laurent polynomials corresponding to the prediction and update
steps, respectively, and k is a nonzero constant. Therefore, the filter bank can be factorized
into three lifting steps.
As illustrated in Fig. 3, a lifting-based scheme has the following four stages:
1) Split phase: The original signal is divided into two disjoint subsets. Significantly, the
variable Xe denotes the set of even samples and Xo denotes the set of odd samples. This
phase is also called lazy wavelet transform because it does not decorrelate the data but only
subsamples the signal into even and odd samples.
2) Predict phase: The predicting operator P is applied to the subset Xo to obtain the wavelet
coefficients d[n] as in (8).
d[n]=Xo[n]+P(Xe[n]). (8)
3) Update phase: Xe[n] and d[n] are combined to obtain the scaling coefficients s[n] after an
update operator U as in (9).
s[n]=Xe[n]+U(d[n]). (9)
4) Scaling: In the final step, the normalization factor is applied on s[n] and d[n] to obtain the
wavelet coefficients. For example, (10) and (11) describe the implementation of the 5/3
integer lifting analysis DWT and are used to calculate the odd (high-pass) and even
coefficients (low-pass), respectively.
d *[ n] X (2 n 1) X (2n) X (2 n 2) / 2 . (10)
Although the lifting-based scheme has low complexity, its long and irregular signal paths
cause the major limitation for efficient hardware implementations. Additionally, the
increasing number of pipelined registers increases the internal memory size of the 2-D DWT
architecture. The 2-D LDWT uses a vertical 1-D LDWT subband decomposition and a
horizontal 1-D LDWT subband decomposition to find the 2-D LDWT coefficients. Therefore,
the memory requirement dominates the hardware cost and architectural complexity of 2-D
LDWT. Fig. 4 shows the 5/3 mode 2-D LDWT operation. The default wavelet filters in
JPEG2000 are dual-mode (5/3 and 9/7 modes) LDWT (Taubman & Marcellin, 2001). The
lifting-based steps associated with the dual-mode wavelets are shown in Figs. 5 and 6,
76 VLSI
respectively. Assuming that the original signals are infinite in length, the first lifting stage is
first applied to perform the DWT.
(a)
Original image
Original
s0 d0 s1 d1 s2
Pixels
Low-Pass
Output H0 H1 H2
High
L H Frequency H0 H1 H2 H3 H4
Pixels
H.F Part
HH-Band
HH0 HH1
Output
LH-Band
LH0 LH1
1-D 5/3 lifting-based Output
DWT
LL HL LL-Band
LL0 LL1 LL2
Output
LH HH
(b)
Fig. 4. 5/3 mode 2-D LDWT operation. (a) The block diagram flow of a traditional 2-D DWT.
(b) Detailed processing flow.
Memory-Efficient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 77
Fig. 5 shows the lifting-based step associated with the wavelet algorithm. The original
signals including s0, d0, s1, d1, s2, d2, … are the input pixel sequences. If the original signals
are infinite in length, then the first-stage lifting is applied to update the odd index data s0,
s1, …. In (12), the parameters 1/2 and Hi denote the first stage lifting parameter and
outcome, respectively. Equation (12) shows the operation of the 5/3 integer LDWT (Wu &
Lin, 2005) (Martina & Masera, 2007) (Hsia & Chiang, 2008).
Hi = [(Si+Si+1)×α+di]×K0,
Li = [(Hi+Hi-1)×β+Si]×K1, (12)
Together with the high-frequency lifting parameter, α, and the input signal we can find the
first stage high-frequency wavelet coefficient, Hi. After Hi is found, Hi together with the
low-frequency parameter, β, and the input signals of the second stage low-frequency
wavelet coefficients, Li, can be found. The third and fourth stages lifting can be found in a
similar manner.
Similar to the 1-level 1-D 5/3 mode LDWT, the calculation of a 1-level 1-D 9/7 mode LDWT
is shown in (13).
ai = [(Si+Si+1)×α+di],
bi = [(ai+ai-1)×β+Si],
Hi = [(bi+bi+1)×γ+ai]×K0,
L i= [(Hi+Hi-1)×δ+bi]×K1. (13)
(Varshney et al., 2007) (Huang et al., 2004) (Tan & Arslan, 2003) (Seo & Kim, 2007) (Huang et
al., 2005) (Wu & Lin, 2005) (Lan et al., 2005) (Wu. & Chen, 2001). A 2-D DWT is composed of
two 1-D DWTs and a block of transpose memory. In the conventional approach, the size of
the transpose memory is equal to the size of the processed image signal. Fig. 7(a) shows the
concept of the proposed dual-mode LDWT architecture, which consists of signal
arrangement unit, processing element, memory unit, and control unit, as shown in Fig. 7(b).
The outputs are fed to the 2-D LDWT four-subband coefficients, HH, HL, LH, and LL. The
proposed architecture is described in detail in this section, and we focus on the 2-D dual-
mode LDWT.
Compared to the computation unit, the transpose memory becomes the main overhead in
the 2-D DWT. The block diagram of a conventional 2-D DWT is shown in Fig. 4. Without
loss of generality, the 2-D 5/3 mode LDWT is considered for the description of the 2-D
LDWT. If the image dimension is NN, during the transformation we need a large block of
transpose memory (order of N2) to store the DWT coefficients after the computation of the
first stage 1-D DWT decomposition. The second stage 1-D DWT then uses the stored data to
compute the 2-D DWT coefficients of the four subbands (Chen & Wu, 2002) (Andra et al.,
2000) (Andra et al., 2002) (Diou et al., 2001) (Chen, 2002) (Varshney et al., 2007) (Huang et
al., 2004) (Tan & Arslan, 2003) (Seo & Kim, 2007) (Huang et al., 2005) (Wu & Lin, 2005) (Lan
et al., 2005) (Wu. & Chen, 2001). The computation and the access of the memory may take
time and therefore the latency is long. Since the memory size of N2 is a large quantity, here
we try to use the approach, interlaced read scan algorithm (IRSA), to reduce the required
transpose memory to an order of 2N or 4N (5/3 or 9/7 mode).
(a)
(b)
Fig. 7. The system block diagram of the proposed 2-D DWT. (a) 2-D dual-mode LDWT. (b)
Block diagram of the proposed system architecture.
80 VLSI
Without loss of generality, let us take a 66-pixel image to describe the 2-D 5/3 mode LDWT
operation and IRSA. Fig. 8 shows the operation diagram of the 2-D 5/3 mode LDWT
operations of a 66 image. In Fig. 8, x(i,j), i = 0 to 5 and j = 0 to 5, represents the original
Memory-Efficient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 81
image signal. The left most two columns are the left boundary extension columns, and the
right most column is the right boundary extension column. The details of the boundary
extension were described in the previous section. The left half of Fig. 8 shows the first stage
1-D DWT operations. The right half of Fig. 8 shows the second stage 1-D DWT operations
for finding the four subband coefficients, HH, HL, LH, and LL. In the first stage 1-D DWT,
three pixels are used to find a 1-D high-frequency coefficient. For example, x(0,0), x(0,1), and
x(0,2) are used to find the high-frequency coefficient b(0,0), b(0,0) = [x(0,0) + x(0,2)]/2 +
x(0,1). To calculate the next high-frequency coefficient b(0,1), we need pixels x(0,2), x(0,3),
and x(0,4). Here x(0,2) is used to calculate both b(0,0) and b(0,1) and is called the overlapped
pixel. The low-frequency coefficient is calculated using two consecutive high-frequency
coefficients and the overlapped pixel. For example, b(0,0) and b(0,1) are together with x(0,2)
to find the low-frequency coefficient c(0,1), c(0,1) = [b(0,0) + b(0,1)]/4 + x(0,2). The calculated
high-frequency coefficients, b(i,j), and low-frequency coefficients, c(i,j), are then used in the
second stage 1-D DWT to calculate the four subband coefficients, HH, HL, LH, and LL.
In the second stage 1-D DWT of Fig. 8, the first HH coefficient, HH(0,0), is calculated by
using b(0,2), b(0,1), and b(0,0), HH(0,0) = [b(0,0) + b(0,2)]/2 + b(0,1). The other HH
coefficients can be computed in the same manner using three column consecutive b(i,j)
signals. For two column consecutive HH coefficients it has an overlapped b(i,j) signal. For
example b(0,3) is the overlapped signal for computing HH(0,0) and HH(0,1). To compute HL
coefficients, it needs two column consecutive HH coefficients and an overlapped b(i,j)
signal. For example, HL(0,1) is computed from HH(0,0), HH(0,1), and b(0,3), HL(0,1) =
[HH(0,0) + HH(0,1)]/4 + b(0,3). The LH coefficients are computed from the c(i,j) signal, and
each LH coefficient needs the calculation of three c(i,j) signals. For example, LH(0,1) is
computed from c(0,2), c(0,3), and c(0,4), LH(0,1) = [c(0,2) + c(0,4)]/2 + c(0,3). For two
column consecutive LH coefficients it has an overlapped c(i,j) signal. For example, c(0,3) is
the overlapped signal for computing LH(0,0) and LH(0,1). To compute LL coefficients, it
needs two column consecutive LH coefficients and an overlapped c(i,j) signal. For example,
LL(0,1) is computed from LH(0,0), LH(0,1), and c(0,2), LL(0,1) = [LH(0,0) + LH(0,1)]/4 +
c(0,2). The detail calculation equations for the four subband coefficients are summarized in
the following equations:
1 1 2
HH(h,v) = x(2h+1,2v +1)+(1/4) x(2h+2s,2v+2t)+(-1/2) x(2h+|s|,2v+|-1+s|). (14)
s =0 t =0 s =-1
LL(hv
, ) = (1/4)[LH(hv
, -1)+LH(hv
, )]+c(h,2v)
= (1/4)[LH(hv, -1)+LH(hv
, )]+(1/4)[b(h-1,2v)+b(h,2v)]+x(2h,2v)
= (1/4)[LH(hv
, -1)+LH(hv
, )]+(1/4)[(-1/2)x(2h-2,2v)+x(2h-1,2v)+(-1)x(2h,2v)+x(2h+1,2v)+(-1/2)x(2h+2,2v)]+x(2h,2v).
(17)
Fig. 10. The detail operations of the first stage 1-D DWT.
(a)
Memory-Efficient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 85
(b)
Fig. 11. The detailed operations of the second stage 1-D DWT. (a) The HF (HH and HL) part
operations. (b) The LF (LH and LL) part operations.
Fig. 13. The operation of the signal arrangement unit (for example, IRAS signal in N1).
Memory-Efficient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 87
For the low-frequency coefficients calculation we need two high-frequency coefficients and
an original pixel. Internal register R4 is used to store the original even pixel (N1) and
internal register R9 is used to store the original odd pixel (N2). We can simply shift the
content of R3 to R4 after the MAC operation. The FIFO is used to store the high-frequency
coefficients to calculate the low-frequency coefficients. Register R5 has two functions: 1) to
store the high-frequency coefficients for the low-frequency coefficient calculation, 2) to be
used as a signal buffer for MAC. MAC needs time to compute the signal, and the output of
MAC cannot directly feed the result to the output or the following operation may be
incorrect due to the synchronization problems. R5 acts as an output buffer for MAC to
prevent the error in the following operations. In the 5/3 integer lifting-based operations,
MAC is used to find the results of the high-frequency output, (a1+a3)/2 + a2, and the low
frequency output, (a1+a3)/4+a2. There are two multiplication coefficients, 1/2 and 1/4. To
save hardware, we can use shifters to implement the 1/2 and 1/4 multiplications.
Therefore the MAC needs adders, complementer, and shifters. The MAC block diagram is
shown in Fig. 14, where a1, a2, and a3 are the inputs, ‛‛’’ the 2’s complement converter, and
‛‛>>’’ the right shifter.
Fig. 15. The block diagram of the second stage 1-D LDWT.
Fig. 16. Signal merging process for the signal arrangement unit.
signal processing of the second stage 1-D DWT is shown in Fig. 19. The 24 signals in each
second stage 1-D DWT are then processed, and then HH, HL, LH, and LL are generated and
each has 22 signal data.
The complete architecture of the 2-D LDWT is shown in Fig. 19. The complete 2-D LDWT
consists of four parts, two sets of the first stage 1-D DWT, two sets of the second stage 1-D
DWT, control unit, and MAC unit.
(a)
(b)
Fig. 17. The input signal sequences. (a) IN1 read signal of even row in zig-zag orders. (b) IN2
read signal of odd row in zig-zag orders.
90 VLSI
(a)
(b)
Fig. 18. The signal process of the two stage LDWT. (a) First stage 1-D LDWT. (b) Second
stage 1-D LDWT.
According to (12) and (13), the proposed IRSA architecture can also be applied to the 9/7
mode LDWT. Fig. 20 illustrates the approach. From Figs. 10 and 11 in Section 3, the original
signals (denoted as black circles) for both 5/3 and 9/7 modes LDWT can be processed by the
same IRSA for the first stage 1-D DWT operation. The high-frequency signals (denoted as grey
circles) and the correlated low-frequency signals together with the results of the first stage are
used to compute the second stage 1-D DWT coefficients. Compared to the 9/7 mode LDWT
computation, the 5/3 mode LDWT is much easier for computation, and the registers
arrangement in Figs. 12 and 15 is simple. For 9/7 mode LDWT implementation with the same
system architecture of 5/3 mode LDWT, we have to do the following modifications: 1) The
control signals of the MUX in Figs. 12 and 15 must be modified. We have to rearrange the
registers for the MAC block to process the 9/7 parameters. 2) The wavelet coefficients of the
dual-mode LDWT are different. The coefficients are α= 1/2 and β=1/4 for 5/3 mode LDWT,
but the coefficients are α= −1.586134142, β= −0.052980118, γ= +0.882911075, and δ=
+0.443506852 for 9/7 mode LDWT. For calculation simplicity and good precision, we can use
the integer approach proposed by Huang et al. (Huang et al., 2004) and Martina et al. (Martina
& Masera, 2007) for 9/7 mode LDWT calculation. Similar to the multiplication implementation
by shifters and adders in the 5/3 mode LDWT, we can adopt the shifters approach proposed
Memory-Efficient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 91
in (Huang et al., 2005) further to implement the 9/7 mode LDWT. 3) According to the
characteristics of the 9/7 mode LDWT, the control unit in Fig. 19(b) must be modified
accordingly.
(a)
(b)
Fig. 19. The complete 2-D DWT block diagram. (a) DSP diagram of the 2-D LDWT. (b) System
diagram of the 2-D LDWT.
92 VLSI
Fig. 20. The processing procedures of 2-D dual-mode LDWTs under the same IRSA
architecture.
The multi-level DWT computation can be implemented in a similar manner by the high
performance 1-level 2-D LDWT. For the multi-level computation, this architecture needs
N2/4 off-chip memory. As illustrated in Fig. 21, the off-chip memory is used to temporarily
store the LL subband coefficients for the next iteration computations. The second level
Memory-Efficient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 93
computation requires N/2 counters and N/2 FIFO’s for the control unit. The third level
computation requires N/4 counters and N/4 FIFO’s for the control unit. Generally in the jth
level computation, we need N/2j-1 counters and N/2j-1 FIFO’s.
5/3 LDWT Ours Diou Andra Chen & Chen, Chiang & Mei Huang Wu &
architecture et al., et al., Wu, 2002 Hsia, et al., et al, Lin, 2005
2001 2002 2002 2005 2006 2005
Transpose 2N 3.5N 3.5N 2.5N 3N N2/4+5N 2N 3.5N 3.5N
memory1
(bytes)
Computation (3/4 --- (N2/2)+ N2 (N2/2)+N N2 (N2/ --- 10+(4/3)
time2 )N2+ N+5 +5 2)+N N2[1-
(3/2 (1/4)]+2N
)N [1-(1/2)]
+7
Adders 8 12 8 6 5 4 8 --- ---
Multipliers 0 6 4 4 0 0 0 --- 6
1 Transpose memory size is used to store frequency coefficients in the 1-L 2-D DWT.
2 In a system, computing time represents the time used to compute an image of size N×N.
3 Suppose the image is of size N×N.
9/7 LDWT Ours Andra Jung & Chen, Vishwanath Huang et Huang Wu & Lin, Lan et Wu. &
architecture et al., Park, 20041 et al., 1995 al., 2005 et al, 2005 al., 2005 Chen,
2002 2005 2005 2001
Transpose 4N N2 12N N2/4+L 22N 14N 5.5N 5.5N --- N2+4N+
memory N+L 4
(bytes)
Computatio (3/4)N2 4N2/3+ N2 N2/2~(2 N2 --- --- 22+(4/3)N2[1 --- 2N2/3
n time +(3/2) 2 /3)N -(1/4)]+6N[1-
N +7 (1/2)]
Adders 16 8 12 4L 36 16 16 8 32 16
Multipliers 0 4 9 4L 36 12 10 6 20 16
1 L:
the filter length.
Table 3. Comparisons of the 2-D architectures for 9/7 LDWT.
7. Conclusions
This work presents a new architecture to reduce the transpose memory requirement in 2-D
LDWT. The proposed architecture has a mixed row- and column-wise signal flow, rather
than purely row-wise as in traditional 2-D LDWT. Further we propose a new approach,
interlaced read scan algorithm (IRSA), to reduce the transpose memory for a 2-D dual-mode
LDWT. The proposed 2-D architectures are more efficient than previous architectures in
trading off low transpose memory, output latency, control complexity, and regular memory
access sequence. The proposed architecture reduces the transpose memory significantly to a
memory size of only 2N or 4N (5/3 or 9/7 mode) and reduces the latency to (3/2)N+3 clock
cycles. Due to the regularity and simplicity of the IRSA LDWT architecture, a dual mode
(5/3 and 9/7) 256256 2-D LDWT prototyping chip was designed by TSMC 0.18m 1P6M
standard CMOS technology. The 5/3 and 9/7 filters with different lifting steps are realized
by cascading the four modules (split, predict, update, and scaling phases). The prototyping
chip takes 29,196 gate counts and can operate at 83 MHz. The method is applicable to any
DWT-based signal compression standard, such as JPEG2000, Motion-JPEG2000, MPEG-4
still texture object decoding, and wavelet-based scalable video coding (SVC).
8. References
Andra, K.; Chakrabarti, C. & Acharya, T. (2000). A VLSI architecture for lifting-based
wavelet transform, IEEE Workshop on Signal Processing Systems, (October 2000) pp.
70-79.
Andra, K.; Chakrabarti, C. & Acharya, T. (2002). A VLSI architecture for lifting-based
forward and inverse wavelet transform, IEEE Transactions on Signal Processing, Vol.
50, No.4, (April 2002) pp. 966-977.
Chen, P.-Y. (2002). VLSI implementation of discrete wavelet transform using the 5/3 filter,
IEICE Transactions on Information and Systems, Vol. E85-D, No.12, (December 2002)
pp. 1893-1897.
Chen, P.-Y. (2004). VLSI implementation for one-dimensional multilevel lifting-based
wavelet transform, IEEE Transactions on Computer, Vol. 53, No. 4, (April 2004) pp.
386-398.
Chen, P. & Woods, J. W. (2004). Bidirectional MC-EZBC with lifting implementation, IEEE
Transactions on Circuits and Systems for Video Technology, Vol. 14, No. 10, (October
2004) pp. 1183-1194.
Chen, S.-C. & Wu, C.-C. (2002). An architecture of 2-D 3-level lifting-based discrete wavelet
transform, VLSI Design/ CAD Symposium, (August 2002) pp. 351-354.
Chiang, J.-S. & Hsia, C.-H. (2005). An efficient VLSI architecture for 2-D DWT using lifting
scheme, IEEE International Conference on Systems and Signals, (April 2005) pp. 528-
531.
Christopoulos, C.; Skodras, A. N. & Ebrahimi, T. (2000). The JPEG2000 still image coding
system: An overview, IEEE Trans. on Consumer Electronics, Vol. 46, No. 4,
(November 2000) pp. 1103-1127.
Daubechies, I. & Sweldens, W. (1998). Factoring wavelet transforms into lifting steps, The
Journal of Fourier Analysis and Applications, Vol. 4, No.3, (1998) pp. 247-269.
96 VLSI
Diou, C.; Torres, L. & Robert, M. (2001). An embedded core for the 2-D wavelet transform,
IEEE on Emerging Technologies and Factory Automation Proceedings, Vol. 2, (October
2001) pp. 179-186.
Habibi, A. & Hershel, R. S. (1974). A unified representation of differential pulse code
modulation (DPCM) and transform coding systems, IEEE Transactions on
Communications, Vol. 22, No. 5, (May 1974) pp. 692-696.
Hsia, C.-H. & Chiang, J.-S. (2008). New memory-efficient hardware architecture of 2-D dual-
mode lifting-based discrete wavelet transform for JPEG2000, IEEE International
Conference on Communication Systems, (November 2008) pp. 766-772.
Huang, C.-T.; Tseng, P.-C. & Chen, L.-G. (2002). Efficient VLSI architecture of lifting-based
discrete wavelet transform by systematic design method, IEEE International
Symposium Circuits and Systems, Vol. 5, (May 2002) pp. 26-29.
Huang, C.-T.; Tseng, P.-C. & Chen, L.-G. (2004). Flipping structure: An efficient VLSI
architecture for lifting-based discrete wavelet transform, IEEE Transactions on Signal
Processing, Vol. 52, No. 4, (April 2004) pp. 1080-1089.
Huang, C.-T.; Tseng, P.-C. & Chen, L.-G. (2005). VLSI architecture for lifting-based shape-
adaptive discrete wavelet transform with odd-symmetric filters, Journal of VLSI
Signal Processing Systems, Vol. 40, No. 2, (June 2005) pp.175-188.
Huang, C.-T.; Tseng, P.-C. & Chen, L.-G. (2005). Analysis and VLSI architecture for 1-D and
2-D discrete wavelet transform, IEEE Transactions on Signal Processing, Vol. 53, No.
4, (April 2005) pp. 1575-1586.
Huang, C.-T.; Tseng, P.-C. & Chen, L.-G. (2005). Generic RAM-based architecture for two-
dimensional discrete wavelet transform with line-based method, IEEE Transactions
on Circuits and Systems for Video Technology, Vol. 15, No. 7, (July 2005) pp. 910-919.
ISO/IEC 15444-1 JTC1/SC29 WG1. (2000). JPEG 2000 Part 1 Final Committee Draft Version 1.0,
Information Technology.
ISO/IEC JTC1/SC29/WG1 Wgln 1684 (2000). JPEG 2000 Verification Model 9.0.
ISO/IEC 15444-1 JTC1/SC29 WG1. (2000). Motion JPEG2000, ISO/IEC ISO/IEC 15444-3,
Information Technology.
ISO/IEC JTC1/SC29 WG11. (2001), Coding of Moving Pictures and Audio, Information
Technology.
Jiang, W. & Ortega, A. (2001). Lifting factorization-based discrete wavelet transform based
architecture design, IEEE Transactions on Circuits and Systems for Video Technology,
Vol. 11, No. 5, (May 2001) pp. 651-657.
Jung, G.-C. & Park, S.-M. (2005). VLSI implement of lifting wavelet transform of JPEG2000
with efficient RPA (recursive pyramid algorithm) realization, IEICE Transactions on
Fundamentals, Vol. E88-A, No. 12, (December 2005) pp. 3508-3515.
Kondo, H. & Oishi, Y. (2000). Digital image compression using directional sub-block DCT,
International Conference on Communications Technology, Vol. 1, (August 2000) p p. 985
-992.
Lan, X.; Zheng, N. & Liu, Y. (2005). Low-power and high-speed VLSI architecture for lifting-
based forward and inverse wavelet transform, IEEE Transactions on Consumer
Electronics, Vol. 51, No. 2, (May 2005) pp. 379-385.
Li, W.-M.; Hsia, C.-H. & Chiang, J.-S. (2009). Memory-efficient architecture of 2-D dual-
mode lifting scheme discrete wavelet transform for Moion-JPEG2000, IEEE
International Symposium on Circuits and Systems, (May 2009) pp. 750-753.
Memory-Efficient Hardware Architecture of 2-D Dual-Mode Lifting-Based
Discrete Wavelet Transform for JPEG2000 97
Lian, C.-J.; Chen, K.-F.; Chen, H.-H. & Chen, L.-G. (2001). Lifting based discrete wavelet
transform architecture for JPEG2000, IEEE International Symposium on Circuits and
Systems, Vol. 2, (May 2001) pp. 445-448.
Mallat, S. G. (1989). A theory for multi-resolution signal decomposition: The wavelet
representation, IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. 11,
No. 7, (July 1989) pp. 674-693.
Mallat, S. G. (1989). Multi-frequency channel decompositions of images and wavelet models,
IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP-37, No. 12,
(December 1989) pp. 2091-2110.
Marcellin, M. W.; Gormish, M. J. & Skodras, A. N. (2000). JPEG2000: The new still picture
compression standard, ACM Multimedia Workshops, (September 2000) pp. 45-49.
Marino, F. (2000). Efficient high-speed/low-power pipelined architecture for the direct 2-D
discrete wavelet transform, IEEE Transactions on Circuits and Systems II, Vol. 47, No.
12, (December 2000) pp. 1476-1491.
Martina, M. & Masera, G. (2007). Folded multiplierless lifting-based wavelet pipeline, IET
Electronics Letters, Vol. 43, No. 5, (March 2007) pp. 27-28.
Mei, K.; Zheng, N. & van de Wetering, H. (2006). High-speed and memory-efficient VLSI
design of 2-D DWT for JPEG2000, IET Electronics Letter, Vol. 42, No. 16, (August
2006) pp. 907-908.
Ohm, J.-R. (2005). Advances in scalable video coding, Proceedings of The IEEE, Invited Paper,
Vol. 93, No.1, pp. 42-56, (January 2005) pp. 42-56.
Richardson, I. (2003). H.264 and MPEG-4 Video Compression, John Wiley & Sons Ltd.
Seo, Y.-H. & Kim, D.-W. (2007). VLSI architecture of line-based lifting wavelet transform for
Motion JPEG2000, IEEE Journal of Solid-State Circuits, Vol. 42, No. 2, (February 2007)
pp. 431-440.
Sweldens, W. (1996). The lifting scheme: A custom-design construction of biorthogonal
wavelets, Applied and Computation Harmonic Analysis, Vol. 3, No. 15, (1996) pp.186-
200.
Tan, K.C.B. & Arslan, T. (2001). Low power embedded extension algorithm for the lifting
based discrete wavelet transform in JPEG2000, IET Electronics Letters, Vol. 37, No.
22, (October 2001) pp.1328-1330.
Tan, K.C.B. & Arslan, T. (2003). Shift-accumulator ALU centric JPEG 2000 5/3 lifting based
discrete wavelet transform architecture, IEEE International Symposium on Circuits
and Systems, Vol. 5, (May 2003) pp. V161-V164.
Taubman, D. & Marcellin, M. W. (2001). JPEG2000 image compression fundamentals, standards,
and practice, Kluwer Academic Publisher.
Varshney, H.; Hasan, M. & Jain, S. (2007). Energy efficient novel architecture for the lifting-
based discrete wavelet transform, IET Image Process, Vol. 1, No. 3, (September 2007)
pp.305-310.
Vishwanath, M.; Owens, R. M. & Irwin, M. J. (1995). VLSI architecture for the discrete
wavelet transform, IEEE Transactions on Circuits and Systems II, Vol. 42, No. 5, (May
1995) pp. 305-316.
Weeks, M. & Bayoumi, M. A. (2002). Three-dimensional discrete wavelet transform
architectures, IEEE Transactions on Signal Processing, Vol. 50, Vo.8, (August 2002) pp.
2050-2063.
98 VLSI
Wu, B.-F. & Lin, C.-F. (2005). A high-performance and memory-efficient pipeline
architecture for the 5/3 and 9/7 discrete wavelet transform of JPEG2000 codec,
IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, No. 12,
(December 2005) pp. 1615-1628.
Wu, P.-C. & Chen, L.-G. (2001). An efficient architecture for two-dimensional discrete
wavelet transform, IEEE Transactions on Circuits and Systems for Video Technology,
Vol. 11, No. 4, (April 2001) pp. 536-545.
Full HD JPEG XR Encoder Design for Digital Photography Applications 99
X5
1. Introduction
Multimedia applications, such as radio, audio, camera phone, digital still camera, camcoder,
and mobile broadcasting TV, are more and more popular in our life as the progress of image
sensor, communication, VLSI manufacture, and image/video coding standards. With rapid
progress of image sensor, display devices, and computing engines, image coding standards
are used in the digital photography application everywhere. It has been merged together
with our life such as camera phone, digital still camera, blog and many other applications.
Many advanced multimedia applications require image compression technology with
higher compression ratio and better visual quality. High quality, high compression rates of
digital image and low computational cost are important factors in many areas of consumer
electronics, ranging from digital photography to the consumer display equipments such as
digital still camera and digital frame. These requirements usually involve computationally
intensive algorithms imposing trade-offs between quality, computational resources, and
throughput.
For high quality of digital image applications, the extension of color range has becoming
more important in the consumer product. In the past, the digital cameras and the display
equipments in the consumer market typically had 8 bits information per channel. Today the
condition is quite different. In the consumer market, digital cameras and the desktop display
panels also have at least 12 bits of information per channel. If the information per channel of
digital image is still compressed into 8 bits, 4 or more bits of information per channel are lost
and the quality of the digital image is limited. Due to the improvement of the display
equipments, the JPEG XR is designed for the high dynamic range (HDR) and the high
definition (HD) photo size. JPEG XR which is already under organized by the ISO/IEC Joint
Photographic Experts Group (JPEG) Standard Committee is a new still image coding
standard and derived from the window media photo (Srinivasan et al., 2007; Srinivasan et
al., 2008; Schonberg et al., 2008). The goal of JPEG XR is to support the greatest possible level
100 VLSI
of image dynamic range and color precision, and keep the device implementations of the
encoder and decoder as simple as possible.
For the compression of digital image, the Joint Photographic Experts Group, the first
international image coding standard for continuous-tone natural images, was defined in
1992 (ITU, 1992). JPEG is a well-known image compression format today because of the
population of digital still camera and Internet. But JPEG has its limitation to satisfy the rapid
progress of consumer electronics. Another image coding standard, JPEG2000 (ISO/IEC,
2000), was finalized in 2001. Differed from JPEG standard, a Discrete Cosine Transform
(DCT) based coder, the JPEG2000 uses a Discrete Wavelet Transform (DWT) based coder.
The JPEG2000 not only enhances enhances the compression, but also includes many new
features, such as quality scalability, resolution scalability, region of interest, and
lossy/lossless coding in a unified framework. However, the design of JPEG2000 is much
complicated than the JPEG standard. The core techniques and computation complexity
comparisons of these two image coding standard are shown in (Huang et al., 2005).
For satisfaction of the high quality image compression and lower computation complexity,
the new JPEG XR compression algorithm is discussed and implemented with the VLSI
architecture. JPEG XR has high encoding efficiency and versatile functions. The XR of JPEG
XR means the extended range. It means that JPEG XR supports the extended range of
information per channel. The image quality of JPEG XR is nearly equal to JPEG 2000 with
the same bit-rate. The computation complexity is much lower than JPEG2000 as shown in
Table 1.
The efficient system-level architecture design is more important than the module design
since system-level improvements make more impacts on performance, power, and memory
bandwidth than the module-level improvements. In this chapter, the new JPEG XR
compression standard is introduced and the analysis and architecture design of JEPG XR
encoder are also proposed. Comparison was made to analyze the compression performance
among JPEG2000 and JPEG XR. Fig. 1 is the peak signal-to-noise ratio (PSNR) results under
several different bitrates. The test color image is 512x512 Baboon. The image quality of JPEG
XR is very close to that of JPEG2000. The PSNR difference between JPEG XR and JPEG2000
is under 0.5dB. Fig. 2 shows the subjective views at 80 times compression ratio. The block
artifact of JPEG image in Fig. 2 is easily observed, while the JPEG XR demonstrates
acceptable qualities by implementing the pre-filter function. For the architecture design, a
4:4:4 1920x1080 JPEG XR encoder is proposed. From the simulation and analysis, entropy
coding is the most computationally intensive part in JPEG XR encoder. We first proposed a
timing schedule of pipeline architecture to speed up the entropy encoding module. To
optimize memory bandwidth problem and maximize the silicon area efficiency, we also
proposed a data reuse skill to solve this problem. The data reuse skill can reduce 33%
memory bandwidth form the memory access. The hardware design of JPEG XR encoder has
been implemented by cell-based IC design flow. The encoder design is also verified by
FPGA platform. This JPEG XR chip design can be used for digital photography applications
to achieve low computation, low storage, and high dynamical range features.
Technologies Operations(GOPs)
JPEG2000 4.26
JPEG XR 1.2
Table 1. Machine cycles comparison between JPEG XR and JPEG 2000.
Full HD JPEG XR Encoder Design for Digital Photography Applications 101
PSNR BABOON512
60
55
50
45 JPEG XR(O=1)
JPEG 2000
40 JPEG XR(O=0)
JPEG XR(O=2)
35
30
25
20 bpp
0 1 2 3 4 5 6 7
(a) (b)
Fig. 2. Image Quality Comparison with (a) JPEG XR and (b) JPEG.
102 VLSI
The coding flow of JPEG XR is shown in the Fig. 3. The JPEG XR has lower computation cost
in each module and simple coding flow while maintaining similar PSNR quality at the same
bitrate as compared with the coding flow of JPEG2000. Hence, the JPEG XR is very suitable
to be implemented with the dedicated hardware to deal with HD photo size image for the
HDR display requirement. The JPEG XR encoder architecture design is presented in the
following section.
This paper is organized as follows. In Section 2, we present the fundamentals of JPEG XR.
Section 3 describes the characteristics of our proposed architecture design of JPEG XR
encoder. Section 4 shows the implementation results. Finally, a conclusion is given in
Section 5.
2. JPEG-XR
The JPEG XR image compression standard has many options for different purposes. In the
following section, the fundamentals of JPEG XR are introduced.
MB aligned
Tiles
image
Input
image
Macroblock
Blocks
Frequency mode
2.4 Pre-filter
There are three overlapping choices of the pre-filter function: non-overlapping, one-level
overlapping and two-level overlapping. (Pan et al., 2008) discuss different trade-offs of three
overlapping choices. The non-overlapping condition is used for fastest encoding and
decoding. It is efficient in low compression ratio mode or lossless mode. However, this
mode potentially introduces blocking effect in low bitrate images. The one-level overlapping
function, compared to the non-overlapping, has higher compression ratio but it needs
additional time for the overlapping operation. The two-level overlapping has the highest
computation complexity. The PSNR and objective image quality of two-level overlapping
are better than the other two at low bitrate area. The pre-filter function is recommended for
both high image quality and further compression ratio considerations at high quantization
levels. For high image quality issue, the pre-filter function eliminate the block effect which is
sensitive of visual quality.
for each 4x4 block in first part process. The low frequency coefficient of these four
coefficients is processed to the top-left coefficient. After first part process, the DC coefficients
of 16 4x4 blocks can be collected as a 4x4 DC block. The second part process is for the 4x4
DC block from the first part process. The second part 2D 4x4 PCT is built by using the three
operators: 2x2 T_h, T_odd and T_odd_odd. After second part process, the 16 DC coefficients
are processed as DC and AD coefficients.
2.6 Quantization
The quantization is the process of rescaling the coefficients after applying the transform
process. The quantization uses the quantized value to divide and round the coefficients after
PCT transformed into an integer value. For the lossless coding mode, the quantized value =
1. For the lossy coding mode, the quantized value > 1. The quantization of JPEG XR use
integer operations. The advantage of integer operation keeps the precision after scaling
operations and only uses shift operation to perform the division operations. The
quantization parameter is allowed to differ across high pass band, low pass band, and DC
band. It varies to different values according to the sensitivity of human vision in different
coefficient bands. Fig. 7 shows a example that pixels transformed by 2 stage PCT and
quantized by the quantization process.
(a) (b)
Fig. 9. Example of (a) Prediction of AD coefficients. (b) Prediction model and AC
coefficients.
2.7 Prediction
There are three directions of DC prediction: LEFT, TOP, and LEFT and TOP. As shown in
Fig. 8, the JPEG XR prediction rules of DC coefficient use the DC coefficient of left MB and
top MB to decide the direction of DC prediction. The DC predicted direction can be decided
by the pseudo code described in Fig. 8. After comparing the H_weight and the V_weight,
the predicted direction can be decided. At the boundary MBs of image, the purple MBs only
predict from left MB and the gray MBs only predict from top MB.
The AD/AC block can be predicted from its TOP or LEFT block as Fig. 9. If the predicted
direction is LEFT, the prediction relationship example of AD coefficients is shown as Fig.
9(a). The AD predicted direction follows DC prediction model as described in Fig. 8. The
computation after prediction judgment of AC is a little similar to AD that can reduce the
coefficient value of the block. The Fig. 9(b) shows the prediction relationship example of AC
coefficients when the prediction direction is TOP.
will increase extra bit for the new Run-Level Encode (RLE) coding. The RLE function is
added into the bitstream length for the overhead coefficient. The Levels, the Runs before
nonzero coefficients, and the number of overhead coefficients are counted for RLE. Then the
RLE block is encoded firstly while different size of Huffman table is used to make the bit
allocation in optimization. After processing the RLE algorithm, the RLE results and FlexBits
are packetized.
Because the functions of stage 1 are computing with the 4x4 block style with no feedback
information, the three pipeline architecture for stage1 is used: color conversion (CC), pre-
filter, PCT (include quantization). For the memory allocation, the color conversion requires 6
blocks per column, the pre-filter requires 5 blocks, and the PCT including quantization uses
4 blocks to execute the function. At the end of PCT in Fig. 15, DC block can be processed to
reduce one pipeline bubble when the next new pipeline stage starts. Hence, the well
arranged timing schedule eliminated the access of additional clock cycle to process the DC
block.
The architecture for PCT including Quantization is shown in Fig. 16. Additional registers are
implemented to buffer the related coefficients for the next pipeline processing. The left
multiplex selects the inputs for two parts PCT process. Initially, the input pre-filtered
coefficients are selected to process the PCT algorithm. Then, the yellow block (DC) will be
processed after the 16 blocks have been computed. The quantization stage de-multiplex the
DC, low pass band AD and high pass band AC coefficients to suitable process element. The
processed data are arranged into the quantized coefficients of Y, U, V SRAM blocks for
prediction operation.
110 VLSI
Top AD block
SRAM 3
1440x4 bytes
Pipeline
Buffer
1
Reg
Pipeline
0 Stage 3
Buffer
SRAM 2
DC/AD/AC block 768x4 bytes Output
Input
Reg
SRAM 1
Stage 1 768x4 bytes
CBP Stage 3
Predictor Buffer
Output
3.3 Prediction
The quanitzed data stored in Y, U, V SRAM blocks are processed with the subtract operation
as the prediction algorithm. Three SRAM blocks are used in the this design. One 1440x4 byte
SRAM is used to buffer the 1 DC and 3 AD coefficients for the TOP AD prediction
coefficients in the prediction judgement, so that the regeneration of these data are
unnecessary when they are selected in the prediction mode. In Fig. 17, two 768x4 bytes
SRAMs are used to save the quantized coefficients of current block and the predicted
coefficients for current block.
Gate Level
Algorithm Analysis DfT Considerations Timing Driven P&R
Simulation
No
Spec. Architecture Design Logical synthesis Satisify? Meet Spec. ?
No
Yes Yes
HDL Design
Scan chain insertion Power/Floor Plane LPE
Simulation
Post Layout
Logical synthesis Tetra-MAX Apollo basic flow
Simulation
No No No No
Meet Spec. ? Coverage enough? Highest Utilization? Meet Spec. ?
Fabrication
In order to increase the throughput and decrease the timing of critical patch, we use three
sub-pipeline stages architecture to implement the entropy encoding module as described in
(Chien et al., 2009). Because of the feedback patch, stage 1 includes the modules of feedback
path and the feedback information generator. Stage 2 includes the modules that encode the
RLE information to symbol-based data. The bit-wise packetizer module of stage 3 processes
the symbol-based data bit by bit. By doing this, we can increase the throughput about 3
times by well arranged pipeline timing schedule as shown in Fig. 19.
The design of adaptive scan block counts the numbers of the non-zero coefficients to decide
whether the scan order should be exchanged or not. After the processing of the adaptive
scan block, the coefficients which can be represented under ModelBits are coded by the
FlexBits table. The other coefficients change to generate the Flexbits and Level. After the
processing of the RLE algorithm, the Level and Run choose the suitable Huffman table to
generate the RLE codeword. The RLE results are translated into the codewords by different
Huffman tables. The Huffman encoder in JPEG XR is different from the other standards. It is
composed of many small Huffman tables and can adaptively choose the best option in these
tables for the smaller codesize for Run and Level. Many codewords are produced after the
Huffman encoding. In order to increase the throughput, the codeword concentrating
112 VLSI
architecture is proposed. Linked to the output of the above operation, the whole RLE
codeword and RLE codesize will be produced to the packetizer. The packetizer architecture
is modified from the (Agostini et al., 2002) architecture by combining the RLE codeword and
the FlexBits for generating the JPEG XR compressed file. More detail design is described in
(Pan et al., 2008).
4. Implementations
This design is implemented to verify the proposed VLSI architecture for JPEG XR encoder.
And it is also verified by FPGA platform. The detail information about implementation
result of each module by the FPGA prototype system is shown in (Pan et al., 2008). It is used
to test the conformance bitstreams for the certification.
A three-stage MB pipelining of 4:4:4 lossless JPEG XR encoder was proposed to process the
capacity and hardware utilization. In our design, the extra registers are used to increase the
pipeline stages for achieving the specification, such as the color conversion, PCT/
quantization and the adaptive encode block. And the on-chip SRAM blocks are used to store
the reused data processed with the prediction module to eliminate the memory access. For
the entropy encoding module, the timing schedule and pipelining is well designed. The
proposed architecture of entropy encoding module increases the total throughput about
three times. We use 0.18um TSMC CMOS 1P6M process to implement the JPEG XR encoder.
Our design flow is standard cell based chip design flow. The design flow of our design is
shown as Fig 20. Test consideration is also an important issue in chip design. Therefore, the
scan chain and the built-in self-test (BIST) are considered in our chip. The chip synthesis
layout is shown as Fig. 21. The implementation results are shown as the Table 2. The power
dissipation distribution in shown as Fig. 22.
5. Conclusion
Compared with the JPEG2000, the coding flow of the JPEG XR is simple and has lower
complexity in the similar PSNR quality at the same bit rate. Hence, the JPEG XR is very
suitable for implementation with the dedicated hardware used to manage HD photo size
images for the HDR display requirement. In this paper, we initially analyzed the
comparison of JPEG XR with other image standard, and then a three-stage MB pipelining
was proposed to process the capacity and hardware utilization. We also made a lot of efforts
on module designs. The timing schedule and pipelining of color conversion, pre-filter, PCT
& quantization modules are well designed. In order to prevent accessing the coefficients
from off-chip memory, an on-chip SRAM is designed to buffer the coefficients for the
prediction module with only some area overhead. The pre-filter and PCT function was
designed to reduce 33.3% memory access from off-chip memory.For the entropy coding, we
designed a codeword concentrating architecture for the throughput increasing of RLE
algorithm. And the adaptive encode and packetizer modules efficiently provide the coding
information required for packing the bitstream. Based on this research result, we contribute
114 VLSI
a VLSI architecture for 1920x1080 HD photo size JPEG XR encoder design. Our proposed
design can be used in those devices which need powerful and advanced still image
compression chip, such as the next generation HDR display, the digital still camera, the
digital frame, the digital surveillance, the mobile phone, the camera and other digital
photography applications.
6. References
B. Crow, Windows Media Photo: A new format for end-to-end digitalimaging, Windows
Hardware Engineering Conference, 2006.
C.-H. Pan; C.-Y. Chien; W.-M. Chao; S.-C. Huang & L.-G. Chen, Architecture design of full
HD JPEG XR encoder for digital photography applications, IEEE Trans. Consu. Elec.,
Vol. 54, Issue 3, pp. 963-971, Aug. 2008.
C.-Y. Chien; S.-C. Huang; C.-H. Pan; C.-M. Fang & L.-G. Chen, Pipelined Arithmetic
Encoder Design for Lossless JPEG XR Encoder, IEEE Intl. Sympo. on Consu. Elec.,
Kyoto, Japan, May 2009.
D. D. Giusto & T. Onali. Data Compression for Digital Photography: Performance
comparison between proprietary solutions and standards, IEEE Conf. Consu. Elec.,
pp. 1-2, 2007.
D. Schonberg; S. Sun; G. J. Sullivan; S. Regunathan; Z. Zhou & S. Srinivasan, Techniques for
enhancing JPEG XR / HD Photo rate-distortion performance for particular fidelity
metrics, Applications of Digital Image Processing XXXI, Proceedings of SPIE, vol. 7073,
Aug. 2008.
ISO/IEC JTC1/SC29/WG1. JPEG 2000 Part I Final Committee Draft, Rev. 1.0, Mar. 2000.
ITU. T.81 : Information technology - Digital compression and coding of continuous-tone still
images. 1992.
L.V. Agostini; I.S. Silva & S. Bampi, Pipelined Entropy Coders for JPEG compression,
Integrated Circuits and System Design, 2002.
S. Groder, Modeling and Synthesis of the HD Photo Compression Algorithm, Master Thesis,
2008.
S. Srinivasan; C. Tu; S. L. Regunathan & G. J. Sullivan, HD Photo: a new image coding
technology for digital photography, Applications of Digital Image Processing XXX,
Proceedings of SPIE, vol. 6696, Aug. 2007.
S. Srinivasan; Z. Zhou; G. J. Sullivan; R. Rossi; S. Regunathan; C. Tu & A. Roy, Coding of
high dynamic range images in JPEG XR / HD Photo, Applications of Digital Image
Processing XXXI, Proceedings of SPIE, vol. 7073, Aug. 2008.
Y.-W. Huang; B.-Y. Hsieh; T.-C. Chen & L.-G. Chen, Analysis, Fast Algorithm, and VLSI
Architecture Design for H.264/AVC Intra Frame Coder, IEEE Trans. Circuits Syst.
Video Technol., vol. 15, no. 3, pp. 378-401, Mar. 2005.
The Design of IP Cores in Finite Field for Error Correction 115
X6
1. Introduction
In recent studies, the bandwidth of communication channel, the reliability of information
transferring, and the performance of data storing devices become the major design factors in
digital transmission /storage systems. In consideration of those factors, there are many
algorithms to detect or remove the noisefrom the communication channel and storage media,
such as cyclic redundancy check (CRC) and errorcorrecting code (Peterson & Weldon, 1972;
Wicker, 1995). The former, a hush function proposed by Peterson and Brown (Peterson &
Brown, 1961), is utilized applied in the hard disk and network for error detection; the later is
a type of channel coding algorithms recover the original data from the corrupted data
against various failures. Normally, the scheme adds redundant code(s) to the original data
to provide reliability functions such as error detection or error correction. The background
of this chapter involves the mathematics of algebra, coding theory, and so on.
In terms of the design of reliable components by hardware and / or software
implementations, a large proportion of finite filed operations is used in most related
applications. Moreover, the frequently used finite field operations are usually simplified and
reconstructed into the hardware modules for high-speed and efficient features to replace the
slow software modules or huge look-up tables (a fast software computation). Therefore, we
will introduce those common operations and some techniques for circuit simplification in
this chapter. Those finite field operations are additions, multiplications, inversions, and
constant multiplications, and the techniques include circuit simplification, resource-sharing
methods, etc. Furthermore, the designers may use mathematical techniques such as group
isomorphism and basis transformation to yield the minimum hardware complexities of
those operations. And, it takes a great deal of time and effort to search the optimal designs.
To solve this problem, we propose the computer-aided functions which can be used to
analyze the hardware speed/complexity and then provide the optimal parameters for the IP
design.
This chapter is organized as follows: In Section 2, the mathematical background of finite
field operations is presented. The VLSI implementation of those operations is described in
Section 3. Section 4 provides some techniques for simplification of VLSI design. The use of
116 VLSI
element 3 2 1 0 element 3 2 1 0
0 0 0 0 0 7
0 1 1 1
0 0 0 0 1 8 1 1 1 0
1 0 0 1 0 9
0 1 0 1
2 0 1 0 0 10 1 0 1 0
3 1 0 0 0 11
1 1 0 1
4 1 0 0 1 12
0 0 1 1
5 1 0 1 1 13 0 1 1 0
6 1 1 1 1 14
1 1 0 0
Table 1. The standard basis expression for all elements of E GF( 2 4 )
i 0
represented in a normal basis, and the binary vector b0 , b1 ,bm1 is used to represent the
coefficients of , denoted by b 0 , b 1 , bm1 . Since 2 1 2 by Fermat’s little theorem
m 0
or 2 bmi , bmii , , bm1 , b0 , b1 , , bmi1 . That is, the squaring operations ( 2 i th power
i
hardware, which is with low complexity for practical applications (Fenn et al., 1996).
q( x ) ( x )( x 2 ) x 2 ( 2 )x 17 x 2 34 x 17 . (1)
4 4
Such that 17 is an element of GF( 2 2 ) . In order to represent the elements of the ground
field GF( 2 2 ) , we use the term in q(x ) as the basis element, which is 17 . An element A is
expressed in GF(( 2 4 ) 2 ) as
where aj GF( 2 4 ) . We can express aj in GF( 2 4 ) using 17 as the basis element
aj a j 0 a j 0 a j 0 2 a j 0 3 a j 0 a j 1 17 a j 1 34 a j 1 51 . (3)
A a0 a1 ( a 00 a 01 17 a 02 34 a 03 51 ) ( a 10 a 11 18 a 12 35 a 13 52 ) . (4)
Next, substitute the terms 17 i j for j 0 ,1 and i 0 ,1,2 ,3 by the relation polynomial
f ( x ) x x x x 1 as follows:
8 4 3 2
17 7 4 3 , 34 6 3 2 1 , 51 3 1 ,
(5)
18 5 3 2 1, 35 7 4 3 2 , 52 4 2 .
By substituting the above terms in expression Equation (4), we obtain the representation of
118 VLSI
A 0 a 1 1 a 2 2 a 3 3 a 4 4 a 5 5 a 6 6 a7 7 . (6)
The relationship between the terms ah for h 0 ,1, ,7 and a ji for j 0 ,1 and i 0 ,1,2 ,3
determines a 8 by 8 conversion matrix T (Sunar et al., 2003). The first row of the matrix T
is obtained by gathering the constant terms in the right hand side of Equation (4) after the
substitution, which gives the constant coefficients in the left hand side, i.e., the term a0 . A
simple inspection shows that 0 a00 a11 . Therefore, we obtain the 8 8 matrix T and this
matrix gives the representation of an element in the binary field GF( 2 8 ) given its
representation in the composite field GF(( 2 4 )2 ) as follows:
a0 1 0 0 0 0 1 0 0 a00
a 0 0 1 1 1 0 0 0 a01
1
a 2 0 0 1 0 0 1 1 1 a02
a 3 0 1 1 1 0 1 1 0 a03
.
a 0 (7)
1 0 0 0 0 1 1 a10
4
a 5 0 0 0 0 0 1 0 0 a11
a 0 0 1 0 0 0 0 0 a12
6
a7 0 1 0 0 0 0 1 0 a13
The inverse transformation, i.e., the conversion from GF( 2 8 ) to GF(( 2 4 ) 2 ) , requires
computing the T 1 matrix. We can use Gauss-Jordan Elimination to derive the T 1 matrix as
follows:
a 00 1 0 0 0 0 1 0 0 a0
a 0
01 0 1 0 1 1 1 0 a 1
a 02 0 0 0 0 0 0 1 0 a2
a 03 0 0 0 1 0 1 1 1 a 3
. (8)
a 0 1 0 1 0 1 0 1 a 4
10
a 11 0 0 0 0 0 1 0 0 a5
a 0 0 1 0 1 1 1 1 a 6
12
a 13 0 0 0 0 1 0 0 1 a7
2.1.4 The basis transformation between standard basis and normal basis
The normal basis is with some good features in hardware, but the standard basis is used in
popular designs. Finding the transformation between them is an important topic (Lu, 1997),
we use GF( 2 4 ) as an example to illustrate that. Suppose GF( 2 4 ) is with the relation
p( x ) x 4 x 3 1 which is a primitive polynomial. Let p( ) 0 such that
B1 0 , 1 , 2 , 3 form a standard basis. Let 3 and the set 1 , 2 , 4 , 8 is linear
The Design of IP Cores in Finite Field for Error Correction 119
independent such that B2 1 , 2 , 4 , 8 forms a normal basis. There exists a matrix T such
that B2T T B1T and B1T T 1 B2T . The matrixes T and T 1 are listed as follows.
0 0 1 1 0 0 1 1
0 1 0 1 1 0 1 0
T , T 1
0 1 1 1 0 1 1 0
1 1 1 1 1 1 1 0
. (9)
0
8
0 1 1 3
0
3
0 1 1 8
4 0
1 0 1 2 2 1
0 1 0 4
,
2 0 1
1 1 1
1 0 1 1 0 2
1 0 0
1 1 1 1 1 1 1 0 1
division for finite field can be performed by the multiplicative inversion. For example,
consider the inversion in GF( 2 8 ) , 1 2 2 , and one can obtain this as Fig. 1.
8
we have ai ai and thus A 2 a0 a1 x 2 am1 x 2 ( m1 ) . Besides, those items with power not
2
less m can be expressed by standard basis. Thus, we can perform the square operation by
some finite field additions, i.e., XOR gates. For instance, let E GF( 2 4 ) constructed by
f ( x ) x 4 x 1 , an element A a 0 a 1 x 1 a 2 x 2 a 3 x 3 E , A 2 a0 a1 x 2 a2 x 4 a3 x 6 . Two
terms x 4 and x 6 can be substituted by x 1 and x 3 x according to Table 1. We have
A 2 a 0 x 0 a 1 x 2 a 2 ( x 1) a 3 ( x 3 x ) or A 2 ( a0 a 2 ) ( a 2 a 3 )x a1 x 2 a 3 x 3 . The same
120 VLSI
m1 m 1
and C A B , where A i 0
a i i , B i 0
b i i , and the product
P m 1
i 0
a i i
m 1
i 0
b i i
2 m 1
i 0
pi i . Note that every element in GF( 2 m ) is with the
relation f ( x ) described in Section 2.1.1, such that the terms with order greater than m,
m , m1 , , 2 m1 , can be substituted by the linear combination of standard basis
{1, 1 , , m1 } . Thus, we can observe that there are m 2 and gate and about m O(m) XOR
gates in the substitution for high-order terms.
m1 m 1 m 1
normal basis and C A B , where A a i 2 , B bi 2 , and C c i 2 . Denote
i i i
i 0 i 0 i 0
have:
2 2 2 2 2 2 b0
0 0 0 1 0 m1
2 2
2 2 2 2 b1 a M b T ,
1 0 1 1 1 m1
Using Equation (11), the bit-serial Massey-Omura multiplier can be designed as following:
The Design of IP Cores in Finite Field for Error Correction 121
a (i )
Shift-register A AND-XOR
b (i ) cm1i
Plane
Shift-register B
Fig. 1. The Massey-Omura bit-serial multiplier
In Fig. 1, the two shift-register perform the square operation in normal basis, and the
complexity of and-xor plane is about O(m) and relative to the number of nonzero element
in M m1i . Therefore, Massey-Omura multiplier is suitable to the design of area-limited
circuits.
3.2 Inverse
In general the inverse circuit is usually with the biggest area and time complexity among
other operations. There are two main methods to implement the finite field inverse, that is,
multiplicative inversion and inversion based on composed field. The first method
decomposes inversion by multiplier and squaring, and the optimal way for decomposing is
proposed by Itoh and Tsujii (Itoh & Tsujii, 1988). The later one is based on the composed
field and suited for area-limited circuits, which has been widely used in many applications.
m 1
multiplicative inversion is equal to 2 2
. Based on this fact 1 2 2 2 , Itoh and
m m i
i 1
Tsujii reduced the number of required multiplications to O(log m) , which is based on the
b 1
decomposition of integer. Suppose m 1 n 0
an 2 n , where a n GF( 2 ) and a b1 1
denoted the decimal number [1ab2 a1 a0 ]2 , we have the following facts:
2 m 1 1 ( 2 2 1 ) 2 [ a a1 a0 ]2
2[ a a1 a0 ]2
1
b1
b2 b2
(2 2
1)( 2 2
1) 2 [ ab 2 a1 a0 ]2
2[ a a1 a0 ]2
1 . (12)
b1 b2
b 2
(2 2b2
1)( 2 1)( 2 1) 2 21 10 [ ab 2 a1 a0 ]2
2[ a b2 a1 a0 ]2
1
2[ a a1 a0 ]
1 ab2 ( 2 2 1)2 [ a a1 a0 ]2
2[ a a1 a0 ]2
1
b 2
b 2 b 3 b 3
. (13)
a b 2 ( 2 2b 2
1)( 2 1)( 2 1)2 [ a 21 10 b3 a1 a0 ]2
2[ a b3 a1 a0 ]2
1
2 m1 1 ( 2 2 1)( 2 2 1)( 2 1 1) 2 [ a a1 a0 ]2
2[ a a1 a0 ]2
1
b2 1 0
b 2 b2
(( 2 2b2
1)2 )( 2 2b 2 2b 3
1)( 2 1)( 2 1) 2 [ a 21 20 b3
a1 a0 ]2
a b 2 ( 2 2b3
1)( 2 1)( 2 1)2 21 20 [ ab 3a1 a0 ]2
2 [ ab 3a1 a0 ]2
1
. (14)
(( 2 2 1)2 2 ab2 )( 2 2 1)( 2 2 1)( 2 2 1) 2 [ a a1 a0 ]2
1
b2 b2 b3 1 0
b3
( 2 2 1)2 2 a1 )( 2 2 1)2 2 a0
1 1 0 0
122 VLSI
is design as Fig. 2. Obviously, one can observe the inversion in GF( 2 m ) is executed by
several operations which are all in GF(( 2 m /2 )2 ) , thus the total gated count used can be
reduced.
A 2 p0
A1
a1 p1
b1
a0
b0
A2
8 1 0 , 4 2 0 , 2 2 1 0 , 1 3 2 1 0 . (15)
It takes 7 XOR gates for the straightforward implementation. However, if one calculate the
summation t 2 1 0 firstly, then 2 t and 1 3 t . Therefore, the number of
XOR gates is reduced to 5. Although it is effective in the bit-level, this idea is also effective in
other design stages. Consider another example in previous section, when we form those
components ( a0 a0 a1 p1 p0 a1 ) and b 0 ( a 0 a 1 p 1 )1 , it takes 3 2-input adders in two
2 2
expressions. Suppose we form the component a0 a1 p 1 firstly, thus the number of 2-input
adder is reduced from 3 to 2 ( ( a0 ( a0 a1 p1 ) p0 a1 ) ). Therefore, the resource-sharing idea
2
185 183 7
180
175
Count (XOR) .
170
Delay (XOR)
165 6
160
155
150 5
145 143
140
135 133
130 4
27 29 43 45 57 63 77 95 99 101 105 113119 123 135139 141 159163 169 177189 195 207 215221 231 243245 249
Fig. 3. The statistic of area for multiplier v.s. f (x )
124 VLSI
800 7
784
750 6
Area (XOR)
700 5
f(x) Weight
650 630 4
600 3
588
550 2
27 29 43 45 57 63 77 95 99 101 105113 119 123 135139 141159163169177189195207215221 231243245 249
Fig. 4. The statistic of area for inverse v.s. f (x )
GF ( 2 8 ) irreducible polynomials
1B 1D 2B 2D 39
3F 4D 5F 63 65 69 71 77 7B 87 8B
#=30
8D 9F A3 A9 B1
BD C3 CF D7 DD E7 F3 F5 F9
primitive polynomials
#=16 1D 2B 2D 4D 5F 63 65 69 71 87 8D A9 C3 CF E7 F5
Table 4. The irreducible and primitive polynomials in GF( 2 ) 8
Secondly, the CAD searches for all possible combinations by the proposed algorithm as
shown in Table 6. This algorithm regards as a function used to find transformation matrices
as shown in Table 7. After we gather all results, we can choose the better parameters from
the list of analyzed results for hardware design of new IP.
13 7
12 6
11 5
10 ( 7 )T 4
( 6 )T ( ) 88
0 T
03 3
02 2
01 1
00 81 0 81
Step Compute the T 1 matrix= ( 7 )T ( 6 )T ( 0 )T .
1
88
3:
Put all none-zero elements GF ((2 ) ) in the following equation and check
4 2
input p( x ) x 8 x 4 x 3 x 2 1 p( x ) GF( 2 8 )
f 1 (x) x x 1
4
f 1 ( ) 4 1 0
f 2 (x) x 2 x 3 f 1 ( x ), f 2 ( x ) GF(( 2 4 ) 2 )
output 0 0 1 0 0 0 0 0 1 1 0 1 0 1 0 0
0 1 1 0 0 1 0 0 0 1 0 1 1 0 1 0
0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0
T , T 1
0 0 0 1
0 1 1 1 1 0 0 0 0 1 1 0
0 1 0 1 0 1 0 0 1 0 0 1 1 0 1 0
1 1 0 0 1 1 0 0 1 1 1 1 0 0 1 0
1 1 0 1 1 1 1 1 88 0 0 1 1 0 1 0 1 88
Table 7. The result of transformation matrices between GF( 2 ) and GF (( 2 ) ) . 8 4 2
Because various parameters provide VLSI’s outputs with huge variation and it’s seem
impossible to run all parameters, we should provide engineers a CAD tool to obtain the
analyzed algorithm and results. In Fig. 5, a designer can use a CAD with Windows interface to
find better parameters of S-box. In this CAD, it provides the complexity information of the
multipliers or the inverse in GF( 2 ) . In this figure, designer chooses the fourth result, and the
8
estimative complexity of inverse is shown in the top right of the figure; that the choice of
multiplier is shown under the inverse information. Therefore, the CAD tool helps designer to
choose the better parameters efficiently.
6. Summary
In this chapter, we introduce the common concepts of finite field regarding its applications in
error correcting coding, cryptography, and others, including the mathematical background,
some important designs for multiplier and inversion, and the idea to utilize the computer-
aided functions to find better parameters for IP or system design. The summary of this chapter
is as follows:
1. Introducing the basic finite field operations and their hardware designs: Those common
operations include addition, multiplication, squaring, inversion, basis transformation, and so
on. The VLSI designs of those operations may be understood through the mathematical
background provided in Section 2. From the mathematical background, one should realize the
benefits and the processes of the transformation between two isomorphic finite fields.
2. Using some techniques to simplify the circuits: We have introduced some useful techniques
to reduce the area cost in VLSI design, such as the resource-sharing method, utilization of
different parameters, or use some isomorphic field to substitute the used field. The first
technique is widely used in various design stages. The later two techniques depend on the
parameters used. Different parameters lead to different hardware implementation results.
However, it seems infeasible to analyze all possible parameters manually.
3. Using the composite field inversion: Composite field inversion is used in the finite field
inversion due to its superiority in hardware implementation. The main idea is to consider the
use of intermediate fields and decompose the inversion on the original field into several
operations on smaller fields. This method has been used in the AES S-box design to minimize
the area cost.
4. Calculating the transformation matrices between isomorphic finite fields. It is well known
that finite fields of the same order are isomorphic, and this implies the existence of
transformation matrices. Finding the optimal one is important in the investigation of the VLSI
designs. Two methods are presented; one is to change the relation polynomial, and the other is
to use the composite field. An algorithm to calculate the transformation matrices is provided in
Section 5, and it can be used to find the optimal one.
5. Using the computer-aided design to search for better parameters: A good hardware CAD
tool provides fast search and enough information for designer, because it brings fast and
accurate designs. In Section 5, the computer-aided function based on the proposed algorithms
is one of the examples. When the order of the finite field gets large, the number of isomorphic
field increases rapidly. This makes it almost impossible to do the exhausting search, and the
proposed CAD can be used to support engineers to get the best choices.
128 VLSI
7. Conclusion
In this chapter, we use the concept of composite fields for the CAD designs, which can
support the VLSI designer to calculate the optimal parameters for finite field inversion.
8. Acknowledgments
This work is supported in part by the Nation Science Council, Taiwan, under grant NSC 96-
2623-7-214-001.
9. References
Dinh, A.V.; Palmer, R.J.; Bolton, R.J. & Mason, R. (2001). A low latency architecture for
computing multiplicative inverses and divisions in GF(2m). IEEE Transactions on
Circuits and Systems II: Analog and Digital Signal Processing, Vol. 48, No. 8, pp. 789-
793, ISSN: 1057-7130
Fenn, S.T.J.; Benaissa, M. & Taylor, D. (1996). Fast normal basis inversion in GF(2m).
Electronics Letters, Vol. 32, No. 17, pp. 1566-1567, ISSN: 0013-5194
Hsiao, S.-F.; Chen, M.-C. Chen & Tu, C.-S. (2006). Memory-free low-cost designs of
advanced encryption standard using common subexpression elimination for
subfunctions in transformations. IEEE Transactions on Circuits and Systems I: Regular
Papers, Vol. 53, No. 3, pp. 615–626, ISSN: 1549-8328
Itoh, T. & Tsujii, S. (1988). A fast algorithm for computing multiplicative inverses in GF(2m)
using normal basis. Information and Computing, Vol. 78, No. 3, pp. 171-177, ISSN:
0890-5401
Jing, M.-H.; Chen, Z.-H.; Chen, J.-H. & Chen, Y.-H. (2007). Reconfigurable system for high-
speed and diversified AES using FPGA. Microprocessors and Microsystems, Vol. 31,
No. 2, pp. 94-102, ISSN: 0141-9331
Lidl, R. & Niederreiter, H. (1986). Introduction to finite fields and their applications, Cambridge
University Press, ISBN: 9780521460941
Lu, C.-C. (1997). A search of minimal key functions for normal basis multipliers. IEEE
Transactions on Computers, Vol. 46, No. 5, pp.588–592, ISSN: 0018-9340
Morioka, S. & Satoh, A. (2003). An optimized S-box circuit architecture for low power AES
design. Revised Papers from the 4th International Workshop on Cryptographic Hardware
and Embedded Systems, Lecture Notes in Computer Science, Vol. 2523, pp. 172–186,
ISBN: 3-540-00409-2, August, 2002, Redwood Shores, California, USA
Peterson, W.W. & Brown, D.T. (1961). Cyclic Codes for Error Detection, Proceedings of the
IRE, Vol. 49, No. 1, pp. 228-235, ISSN: 0096-8390
Peterson, W.W. & Weldon, E.J. (1972). Error-Correcting Codes, The MIT Press, 2 edition,
Cambridge, MA, ISBN: 3540004092
Sunar, B.; Savas, E. & Koc, C.K., (2003). Constructing composite field representations for
efficient conversion. IEEE Transactions on Computer, Vol. 52, No. 11, pp. 1391-1398,
ISSN: 0018-9340
Wang, C.C.; Truong, T.K.; Shao, H.M.; Deutsch, L.J.; Omura, J.K.; & Reed, I.S. (1985). VLSI
architecture for computing multiplications and inverses in GF(2m). IEEE
Transactions on Computers, Vol. 34, No. 8, pp. 709-716, ISSN: 0018-9340
The Design of IP Cores in Finite Field for Error Correction 129
Wicker, S.B. (1995). Error Control Systems for Digital Communication and Storage, Prentice Hall,
ISBN: 0-13-308941-X, US
130 VLSI
Scalable and Systolic Gaussian Normal Basis Multipliers
over GF(2m) Using Hankel Matrix-Vector Representation 131
X7
1. Introduction
Efficient design and implementation of finite field multipliers have received high attention
in recent years because of their applications in elliptic curve cryptography (ECC) and error
control coding (Denning, 1983; Rhee, 1994; Menezes, Oorschot & Vanstone, 1997). Although
channel codes and cryptographic algorithms both make use of the finite field GF(2m), the
field orders needed differ dramatically: channel codes are typically restricted to arithmetic
with field elements which are represented by up to eight bits, whereas ECC rely on field
sizes of several hundred bits. The majority of publications concentrate on finite field
architectures for relatively small fields suitable for implementation of channel codes. In
finite field GF(2m), multiplication is one of the most important and time-consuming
computations. Since cryptographic applications (Menezes, Oorschot & Vanstone, 1997) are
the Diffie-Hellman key exchange algorithm based on the discrete exponentiation over
GF(2m), the methods of computing exponentiation over GF(2m) based on Fermat’s theorem
are performed by the repeated multiply-square algorithm. Therefore, to provide the high
performance of the security function, the efficient design of high-speed algorithms and
hardware architectures for computing multiplication is required and considered.
There are three popular basis representations, termed polynomial basis (PB), normal basis
(NB), and dual basis (DB). Each basis representation has its own advantages. The normal
basis multiplication is generally selected for cryptography applications, because the
squaring of the element in GF(2m) is simply the right cyclic shift of its coordinates. NB
multiplication depended the selection of key function is discovered by Massey and Omura
(1986). For the elliptic curve digital signature algorithm (ECDSA) in IEEE Standard P1363
(2000) and National Institute of Standards and Technology (NIST) (2000), Gaussian normal
basis (GNB) is defined to implement the field arithmetic operation. The GNB is a special
class of normal basis, which exists for every positive integer m not divisible by eight. The
GNB for GF(2m) is determined by an integer t, and is called the type-t Gaussian normal
basis. However, the complexity of a type-t GNB multiplier is proportional to t (Reyhani-
Masoleh, 2006), small values of t are generally chosen to ensure that the field multiplication
is implemented efficiently.
132 VLSI
Among various finite field multipliers are classified either as a parallel or serial
architectures. Bit-serial multipliers (Reyhani-Masoleh & Hasan, 2005; Lee & Chang, 2004)
require less area, but are slow that is taken by m clock cycles to carry out the multiplication
of two elements. Conversely, bit-parallel multipliers (Lee, Lu & Lee, 2001; Hasan, Wang &
Bhargava, 1993; Kwon, 2003; Lee & Chiou, 2005) tend to be faster, but have higher hardware
costs. Recently, various multipliers (Lee1, 2003; Lee, Horng & Jou, 2005; Lee, 2005; Lee2, 2003)
focus on bit-parallel architectures with optimal gate count. However, previously mentioned
bit-parallel multipliers show a computational complexity of O(m2) operations in GF(2);
canonical and dual basis architectures are lower bounded by m2 multiplications and k2 - 1
additions, normal basis ones by k2 multiplications and 2k2 - k additions. A multiplication in
GF(2) can be realized by a two-input AND gate and an adder by a two-input XOR gate. For
this reason, it is attractive to provide architectures with low computational complexity for
efficient hardware implementations.
There have digit-serial/scalable multiplier architectures to enhance the trade-off between
throughput performance and hardware complexity. The scalable architecture needs both the
element A and B are separated into n=[m/d] sub-word data, while the digit-serial
architecture only requires one of the element to separate sub-word data. Both architectures
are used by the feature of the scalability to handle growing amounts of work in a graceful
manner, or to be readily enlarged. In (Tenca & Koc, 1999), a unit is considered scalable
defined that the unit can be reused or replicated in order to generate long-precision results
independently of the data path precision for which the unit was originally designed.
Various digit-serial multipliers are recently developed in (Paar, Fleischmann & Soria-
Rodriguez, 1999; Kim & Yoo, 2005; Kim, Hong and Kwon, 2005; Guo & Wang, 1998; Song &
Parhi, 1998; Reyhani-Masoleh & Hasan, 2002). Song and Parhi (1998) proposed MSD-first
and LSD-first digit-serial PB multipliers using Horner’s rule scheme. For partitioning the
structure of two-dimension arrays, efficient digit-serial PB multipliers are found in (Kim &
Yoo, 2005; Kim, Hong & Kwon, 2005; Guo & Wang, 1998). The major feature of these
architectures is combined with both serial and parallel algorithms.
For large word lengths commonly found in cryptography, the bit-serial approach is rather
slow, while bit-parallel realization requires large circuit area and power consumption. In
elliptic curve cryptosystems, it strongly depends on the implementation of finite field
arithmetic. By employing the Hankel matrix-vector representation, the new GNB
multiplication algorithm over GF(2m) is presented. Utilizing the basic characteristics of MSD-
first and LSD-first schemes (Song & Parhi, 1998), it is shown that the proposed GNB
multiplication can be decomposed into n(n+1) Hankel matrix-vector multiplications. The
proposed scalable GNB multipliers are including one dd Hankel multiplier, two registers
and one final reduction polynomial circuit. The results reveal that, if the selected digital size
is d 4 bits, the proposed architecture has less time-space complexity than traditional digit-
serial systolic multipliers, and can thus obtain an optimum architecture for GNB multiplier
over GF(2m). To further saving both time and space complexities, the proposed scalar
multiplication algorithm with Hankel matrix-vector representation can also be realized by a
scalable and systolic architecture for polynomial basis and dual basis of GF(2m).
The rest of this paper is structured as follows. Section 2 briefly reviews a conventional NB
multiplication algorithm and a Hankel matrix-vector multiplication. Section 3 proposes the
two GNB multipliers, based on Hankel matrix-vector representation, to yield a scalable and
systolic architecture. Section 4 introduces the modified GNB multiplier. Section 5 analyzes
Scalable and Systolic Gaussian Normal Basis Multipliers
over GF(2m) Using Hankel Matrix-Vector Representation 133
our proposed GNB multiplier in the term of the time-area complexity. Finally, conclusions
are drawn in Section 6.
2. Preliminaries
2.1 Gaussian normal basis multiplication
The finite field GF(2m) is well known to be viewable as a vector space of dimension m over
m 1
GF(2). A set N { , 2 , , 2 } is called the normal basis of GF(2m), and is called the
normal element of GF(2m). Let any element AGF(2m) can be represented as
m 1
A a
i 0
i
2i
(a 0 , a1 , , a m 1 ) (1)
and B=(b0, b1,…, bm−1) indicate two normal basis elements in GF(2m), and C=(c0, c1,…,
cm−1)GF(2m) represent their product, i.e., C=AB. Coordinate ci of C can then be represented
by
ci A ( i ) M ( B ( i ) ) T (2)
where A(i)
denotes a right cyclic shift of the element A by i positions. To compute the
multiplication matrix M, one can see in (IEEE Standard P1363, 2000; Reyhani-Masoleh &
Hasan, 2003). When the multiplication matrix M is found, the NB multiplication algorithm is
described as follows:
Algorithm 1: (NB multiplication) (IEEE Standard P1363, 2000)
Input: A=(a0, a1,…, am−1) and B=(b0, b1,…, bm−1) GF(2m)
Output: C=(c0, c1,…, cm−1)=AB
1. initial: C=0
2. for i = 0 to m − 1 {
3. ci AMB T
4. A=A(1) and B=B(1)
5. }
6. output C=(c0, c1,…, cm−1)
Applying Algorithm 1, Massey and Omura (1986) first proposed bit-serial NB multiplier.
The complexity of the normal basis N, represented by CN, is the number of nonzero ij
values in M, and determines the gate count and time delay of the NB multiplier. It is shown
in (Mullin, Onyszchuk, Vanstone & Wilson, 1988/1989) that CN for any normal basis of
GF(2m) is greater or equal to 2m−1. In order to an efficient and simple implementation, a
normal basis is chosen that CN is as small as possible. Two types of an optimal normal basis
(ONB), type-1 and type-2, exist in GF(2m) if CN =2m−1. However, such ONBs do not exist for
all m.
Definition 1. Let p=mt+1 represent a prime number and gcd(mt/k,m)=1, where k denotes the
multiplicative order of 2 module p. Let be a primitive p root of unity. The type-t Gaussian
134 VLSI
m 2m (t 1)m
normal basis (GNB) is employed by 2 2 ... 2 to generate a normal basis N
for GF(2m) over GF(2).
Significantly, GNBs exist for GF(2m) whenever m is not divisible by 8. By adopting Definition
1, each element A of GF(2m) can also be given as
m 1 m m ( t 1)
A a0 a1 2 am 1 2 a0 ( 2 2 )
(3)
2 2 m 1 2 m ( t 1) 1 2 m 1 2 2 m 1 mt 1
a1 ( ) am 1 ( 2 )
From Equation (3), the type-t GNB can be represented by the set
2 m 1 2m 2 m 1 2 2 m 1 2 m ( t 1) 2 m ( t 1) 1 2 mt 1
{ , 2 , , , , , , , , , , , } . Since is a primitive p
root of unity, we have
i 1 if i p 1
i (4) .
1 if i p 1
Thus, the GNB can then alternate to represent the redundant basis R={, 2,…, p-1}. The field
element A can be alternated to define the following formula:
A aF (1) aF ( 2) 2 aF ( p 1) p 1 (5)
where
F(2i2mj mod p)=i for 0 ≤ i ≤ m−1 and 0 ≤ j ≤ t−1.
Example 1. For example, Let A=(a0,a1,a2,a3,a4) be the NB element of GF(25), and let =+10 be
used to generate the NB. Applying the redundant representation, the field element A can be
represented by A=a0+a12+a33+a24+a45+a46+a27+a38+a19+a010.
Observing this representation, the coefficients of A are duplicated by t-term coefficients of
the original normal basis element A=(a0, a1,…, am−1) if the field element A presents a type-t
normal basis of GF(2m). Thus, by using the function F(2i2mj mod p)=i, the field element
A ( a F (1) , a F ( 2) , , a F ( p 1) ) with the redundant representation can be translated into the
following representation, A ( a0 ,, a0 , a1 , a1 , a2 , , a2 , am 1 , , am1 ) . Therefore, the
t
redundant basis is easy converted into the normal basis element, and is without extra
hardware implementations.
Let A=(a0, a1,…, am−1) and B=(b0, b1,…, bm−1) indicate two normal basis elements in GF(2m),
and C=(c0, c1,…, cm−1) represent their product, i.e., C=AB. Coordinate ci of C can then be
calculated as in the following formula: (IEEE Standard P1363, 2000)
p2
c0 G ( A, B) a
j 1
F ( j 1)bF ( p j ) (6)
5. }
6. Output C=(c0,c1,…,cm−1)
Definition 3. Let i,j,m be these positive integers with 0 i,j m−1, one can define the
following function (i,j):
j i
2 for i j even
(i, j )
j i 1
for i j odd
2
where <q> denotes q mod m.
Let i denote a fixed integer in the complete set {1,2,…,m−1}, one verifies that the map
k=(i,j). For instance, Table 1 presents the relationship between j and k with k=(i,j), where
m=7, i=2, and 0 j 6. Therefore, by substituting j=(i,j) into Equation (6), the product C can
be denoted as
m 1 m 1
C h
( i , j ) i b ( i , j )
i
(8)
i 0 j 0
j 0 1 2 3 4 5 6
k=(i,j) 6 5 0 4 1 3 2
Table 1. The relationship between j and k for i=2
136 VLSI
Example 2: Let A=(a0, a1, a2, a3, a4) and h =(h0, h1, h2, h3, h4, h5, h6, h7, h8) represent two vectors
and H represents the Hankel matrix defined by h; and let C =(c0, c1, c2, c3, c4) be the product
of hA. By applying Equation (8), the product can be derived as follows.
1 x x2 x3 x4
a0 h0 a4 h5 a4 h6 a3 h6 a3 h7
a4 h4 a0 h1 a3 h5 a4 h7 a2 h6
a1h1 a3 h4 a0 h2 a2 h5 a4 h8
a3 h3 a1h2 a2 h4 a0 h3 a1h5
a2 h2 a2 h3 a1h3 a1h4 a0 h4
c0 c1 c2 c3 c4
Figure 1 depicts the bit-parallel systolic Hankel multiplier given by Example 1. The
multiplier comprises 25 cells, including 14 U-cells and 11 V-cells. Every Ui,j cell (Figure 2(a))
contains one 2-input AND gate, one 2-input XOR gate and three 1-bit latches to realize
ci=ci+a(i,j)h(i,j)+i. Each Vi,j cell (Figure 2(b)) is formed by one 2-input AND gate, one 2-input
XOR gate and four 1-bit latches to realize the operation of ci=ci+a(i,j)h(i,j)+i, or
ci=ci+a(i,j)h(i,j)+i+m. As stated in the above cell operations, the latency needs m clock cycles,
and the computation time per cell is needed by one 2-input AND gate and one 2-input XOR
gate.
h 0 h5 a4 h1 h6 a3
h2 h7
a0 0 0 0 0 0
c0 c1 c2 c3 c4
ci ci
a<(j‐i)/2 >/h<‐(i+j+1)/ 2>+i h <(j‐i)/2>+i/a<‐(i+j+1)/2> a<-(i+j+1)/2> h <‐(i+j+1)/2>+i
h<‐(i+j+1)/2>+i+m
denote two type-t GNB elements in GF(2m), where represents the root of xp+1. Assume that
the chosen digital size is a d-bit, and n=p/d, both elements A and B can also be expressed as
follows.
n 1 n 1
A
i0
Ai di
, B B
i0
i
di
where
d 1 d 1
Ai j0
a F ( di j ) j
, Bi b
j0
F ( di j )
j
.
Based on the partial multiplication for determining AB0, the partial product can be denoted
by
AB0= A0B0 + A1B0d+…+ An−1B0d(n−1) (9)
Each term AiB0 of degree 2d−2 is the core computation of Equation (9). In a general
multiplication, let us define that AiB0 is formed by
AiB0=Si+Did, for 0 i n−1. (10)
where
Si=si,0+si,1+...+si,d−1d−1
Di=di,0+di,1+...+di,d−1d−2
j
si,j= a
k 0
F (id k )bF ( j k ) , for 0 j d−1
d 1
di,j= a
k j 1
F ( id k )bF ( d j k ) , for 0 j d−2
p 1
Output: C c
i0
F (i )
i
=AB0 mod (p+1)
6. C= X mod (p+1)
7. return C=(cF(0), cF(1),…, cF(p−1))
Algorithm 3 for determining AB0 includes two core operations, namely the Hankel
multiplication and the reduction polynomial p+1, as illustrated in Figure 3. The proposed
partial multiplier architecture in Figure 3 can be calculated using the following procedure.
Step 1: From Equation (14), we can see that Hankel matrix Hn−i is defined by all coefficients
in A. Here A=(aF(0), aF(1),…, aF(p−1)) is firstly converted to the Hankel vector hn−i=(aF(d(i−1)+1),
aF(d(i−1)+2),…, aF(d(i+1)−1)) and its result is stored in the register Hn−i.
Step 2: Applying the bit-parallel systolic Hankel multiplier as shown in Figure1, Figure 3
shows AB0 computation. Each Hankel multiplications in Step 4 of Algorithm 3, the result of
AB0 is stored in the register X.
Step 3: After (n+1) Hankel multiplications, the result needs to perform the reduction
polynomial p+1.
Generally, the computation of ABi for 0in-1 can be obtained by the following formula:
ABi=hnBi+ hn1Bid+…+ h0Bidn (16)
The above equation indicates that each ABi computation can be dismembered into (n+1)
Hankel multiplications. As mentioned above, two GNB multipliers are described in the
following subsections.
Bi
d 2d‐ 1 H0
Bit‐parallel H1
systolic Hankel
Hi
A
multiplier
Hn
d
X register
p+d
p
C=X mod( +1)
p
AB0
Fig. 3. The proposed scalable and systolic architecture for computing AB0
Applying Equations (16) and (17), the proposed LSD-first scalable systolic GNB
multiplication is addressed as follows:
140 VLSI
1.4 C=0
2. Multiplication step:
2.1 for i=0 to n−1 do
2.2 C=C +PM(A,Bi) (where PM(A,Bi) as referred to Algorithm 3)
2.3 A=Ad mod (p+1)
2.4 endfor
3. Basis conversion step:
3.1 C (c0 , , c0 , , cm 1 , , cm 1 ) (c F ( 0 ) , c F (1) , , c F ( p 1) )
t
4. Return (c0 , c1 , , cm 1 )
The proposed LSDGNB scalable multiplication algorithm is split into n-loop partial
multiplications. Figure 4 depicts the LSDGNB multiplier based on the proposed partial
multiplier in Fig.3. Both NB elements A and B are initially transformed into the redundant
basis given by Equation (4), and are stored in both registers A and B, respectively. In round
0 (see Figure 4), the systolic array in Figure 3 is adopted to compute C=A(0)B0, and the result
is stored in register C. In round 1, the element A must be cyclically shifted to the right by d
digits. The result produced in the systolic array is added to register C in the round 0. The
first round, which estimates the latency, requires d+n clock cycles. Each subsequent round
computation requires a latency of n+1 clock cycles. Finally, the entire multiplication requires
a latency of d+n(n+1) clock cycles. The critical propagation delay of every cell is the total
delay of one 2-input AND gate, one 2-input XOR gate and one 1-bit latch.
Scalable and Systolic Gaussian Normal Basis Multipliers
over GF(2m) Using Hankel Matrix-Vector Representation 141
C
p
Basis conversion
m
C
Fig. 4. The proposed LSD-first scalable systolic GNB multiplier over GF(2m)
1.4 C=0
2. multiplication step:
2.1 for i=1 to n do
2.2 C=Cd mod (p+1)+PM(A,Bn−i), where PM(A,Bn−i) as referred to Algorithm 3
2.3 endfor
3. basis conversion step:
3.1 C (c0 , , c0 , , cm 1 , , cm 1 ) C (c F ( 0) , c F (1) , , c F ( p 1) )
t
4. return (c0 , c1 , , cm 1 )
Algorithm 5 presents the MSD-first scalable multiplication, and Figure 5 presents the entire
GNB multiplier architecture. As compared to both GNB multiplier architectures, the
LSDGNB multiplier before each round computation, the element A must be performed by
A(id)=A(i−1)dd mod(p+1). The MSDGNB multiplier after each round computation, the result C
142 VLSI
must be performed by C(id)=C(i−1)dd mod(p+1). Notably, both operations A(id) and C(id)
represent a right cyclic shift to id positions. Hence, the two proposed architectures have the
same time and space complexity.
A
Bn‐ 1
d p
Bn‐ 2
ABi computation as seen
Bi
in Fig.3
B0
Basis conversion
m
C
Fig. 5. The proposed MSD-first scalable systolic GNB multiplier over GF(2m)
d 1
Bi b
j 0
F ( id j )
j
and n=(mt+1)/d. From Equation (14), using LSD-first multiplication
p 1 d 1
where, A( j) A j mod( p 1)
i 0
aF (i j) i , for 0 j d−1 and c0, F (i ) b
j 0
F ( j ) aF (i j ) , for 0 i
p−1.
In (Wu, Hasan, Blake & Gao, 2002), it is shown that, from the GNB representation in
(Reyhani-Masoleh & Hasan, 2005) to convert the normal basis, the minimum representation
of A has a Hamming weight equal to or less than mt/2 if m is even, and (mt−t+2)/2 if m is
odd. Assume that coordinate numbers of the partial product C0 in Equation (19) are selected
by q=dk consecutive coordinates to satisfy the corresponding normal basis representation,
where q ≥ mt/2 if m is even, and q ≥ (mt−t+2)/2 if m is odd. Then, the partial product AB0 can
be calculated by
AB0 = hk1B0+hk2B0d +…+h0B0d(k1).
Similarly,
AB1d= hkB1+hk1B1d +…+ h1B1d(k1)
AB22d= hk+1B2+hkB2d +…+ h2B2d(k1)
ABn1 (n1)d= hk+n2Bn1+hk+n3Bn1d +…+ hn1Bn1d(k1)
Thus, the modified multiplication requires only nk Hankel multiplication. As stated above,
the modified LSD-first scalable multiplication is addressed as follows:
Algorithm 6. (modified LSDGNB scalable multiplication)
Input: A (a0 , a1,, am1) and B (b0 , b1,, bm1) are two normal basis elements in GF(2m)
Output: C= (c0 , c1 , , c m 1 ) =AB
1. initial step:
1.1 A (a F ( 0) , a F (1) , , a F ( p 1) ) ( a0 , a1 , , a m 1 )
1.2 B (b F ( 0) , bF (1) , , bF ( p 1) ) (b0 , b1 , , bm 1 )
n 1 d 1
1.3 B B
i 0
i
di
, where n=p/d and Bi b
j 0
F ( id j )
j
k 1 d 1
1.4 C C
i 0
i
di
0 , where k=q/d and Ci c
j 0
F ( id j )
j
1.5 All Hankel vector hi’s for 0 ≤ i ≤ n+k−1 are converted from the redundant basis
representation of A.
2. multiplication step:
2.1 for i=0 to k−1 do
2.2 for j=0 to n−1 do
2.3 Ck1i=Ck1i+Hi+jBj
2.4 endfor
2.5 endfor
3. basis conversion step:
3.1 C (c0 , , c0 , , cm 1 , , cm 1 ) (c F ( 0) , c F (1) , , c F ( q 1) )
4. return (c0 , c1 , , cm 1 )
Applying Algorithm 6, Figure 6 shows a LSDGNB multiplier using the redundant
representation. The circuit includes two registers, one d × d Hankel multiplier, and one
144 VLSI
summation circuit. In the initial step, two registers H and B are converted by Steps 1.5 and
1.3, respectively. As for the register Hi represent (2d1)-bit latches; and the register Bi is d-bit
latches. The operation of a d × d Hankel matrix-vector multiplier has been described in the
previous section. In Figure 6, the MUX is responsible for shifting register H. The SW is in
charge of shifting the outcome of the GNB multiplication. As mentioned above, the total
Hankel multiplications can be reduced from n(n+1) to nk, where k=mt/2d if m is even and
k=(mt−t+2)/2d if m is odd. By the configuration of Figure 6, the modified multiplier is
without the final reduction polynomial circuit. It is revealed that the modified multiplier has
lower time- and space-complexity as compared to Figures 4 and 5.
2d‐1 d
H0 d x d B0
Hankel
H1 Multiplier B1
d
Hn‐1 Bn‐1
MUX d‐bit latches
1
0
SW
Hn ctr
1 0
0
0 C0
Hj n‐1 C 1
0 Ci
Hn+k‐ 2 1 C k‐1
0
Fig. 6. The modified LSD-first scalable systolic GNB multiplier over GF(2m)
Table 2 compares the circuits of the proposed scalable multipliers with those of the other un-
scalable (bit-parallel) multipliers. Table 3 lists the proposed multipliers with the total
latency. According to this table, the proposed multipliers for a type-2 GNB save about 40%
latency as compared to Kwon's (2003) and Lee & Chiou's (2005) multipliers, and those for a
type-1 GNB save about 60% latency as compared to Lee-Lu-Lee's multipliers (2001). Since
the selected digital size d must minimize the total latency, the proposed multipliers then
have low hardware complexity and low latency.
Fig. 7. Comparisons of the time-area complexity for various digit-serial multipliers over
GF(2²³³)
146 VLSI
Fig. 8. Comparisons of transistor count for various digit-serial multipliers over GF(2²³³)
By applying the cut-set systolization techniques (Kung, 1988), various digit-serial systolic
multipliers are recently reported in (Kim, Hong & Kwon, 2005; Guo & Wang, 1998), which are
identical of n processing elements (PE) to enhance the trade-off between throughput
performance and hardware complexity. Each PE requires a maximum propagation delay of
Tmax=(d−1)(TA+TiX+TM)+TA+TiX, where TA, TiX and TM denote the propagation delays through a
2-input AND gate, an i-input XOR gate and a 2-to-1 multiplexer, respectively. The maximum
propagation delay in each PE is large if the selected digital size d is large. However, the
proposed scalable systolic architectures do not have such problems, since the propagation
delay of each PE is independent of the selected digital size d. Applying Horner’s rule, Song
and Parhi (1998) suggested MSD-first and LSD-first digit-serial multipliers. Various digit-serial
multipliers use only one of the input signals A and B to separate n=m/d sub-word data. Our
Scalable and Systolic Gaussian Normal Basis Multipliers
over GF(2m) Using Hankel Matrix-Vector Representation 147
proposed architectures separate both input signals into n sub-word data, in which one of the
input element is translated into Hankel vector representation. The proposed LSD-first and
MSD-first scalable multiplication algorithms require n(n+1) Hankel multiplications, and the
modified multiplication algorithm only demands nk Hankel multiplications, where k=mt/2d
if m is even and k= (mt−t+2)/2d if m is odd. Using a single Hankel multiplier to implement
our proposed scalable multipliers, we have O(d2) space complexity, while other digit-serial
multipliers require O(md) space complexity, as seen in Tables 4 and 5.
For comparing the time-area complexity, the transistor count based on the standard CMOS
VLSI realization is employed for comparison. Therefore, some basic logic gates: 2-input XOR,
2-input AND, 12 SW, MUX and 1-bit latch are assumed to be composed of 6, 6, 6, 6 and 8
transistors, respectively (Kang & Leblebici, 1999). Some real circuits (STMicroelectronics,
http://www.st.com) such as M74HC86 (STMicroelectronics, XOR gate, TX=12ns (TYP.)),
M74HC08 (STMicroelectronics, AND gate, TA=7ns (TYP.)), M74HC279 (STMicroelectronics, SR
Latch, TL=13ns (TYP.)), M74H257 (STMicroelectronics, Mux, TM=11ns (TYP.)) are employed for
comparing time complexity in this paper. In the finite field GF(2233), Figures 7 and 8 show that
our proposed scalable multipliers compare to the corresponding digit-serial multipliers (Kim,
Hong & Kwon, 2005; Guo & Wang, 1998; Reyhani-Masoleh & Hasan, 2002). As the selected
digital size d ≥ 4, the proposed scalable multipliers have lower time-area complexity than two
reported digit-serial PB multipliers (Kim, Hong & Kwon, 2005; Guo& Wang, 1998) (as shown
in Figure 7). When the selected digital size d ≥ 8, the modified scalable multiplier has lower
time-area complexity than the corresponding digit-serial NB multiplier (Reyhani-Masoleh &
Hasan, 2002). For comparing a transistor count, Figure 8 reveals that our scalable multipliers
have low space complexity as compared to the reported digit-serial multipliers (Kim, Hong &
Kwon, 2005; Guo & Wang, 1998; Reyhani-Masoleh & Hasan, 2002).
Multipliers Guo & Wang (1998) Kim, Hong & Kwon Figure 5 Figure 6
(2005)
Basis polynomial polynomial Gaussian normal Gaussian
normal
Architecture digit-serial digit-serial scalable scalable
Total Complexity
2-input XOR
2-input AND 2ed² 2ed² d²+d+p d²+d
1-bit latch e(2d²+d) e(2d²+d) d² d²
1x2 SW 10d+5Pde 10d+1+4.5Pd+Pe 3.5d²+5p+3nd Q
MUX 0 0 p d
2ed 2ed 0 1
Critical Path Pipelined: Pipelined: TA+TX TA+TX
d(TA+2TX+2TM)/(P+ d(TA+TX+2TM)/(P+1)
1) Non-Pipelined:
Non-Pipelined: TA+TX+(d1)(TA+TX+2T
TA+3TX+(d1)(TA+2 M)
TX+2TM)
Latency Pipelined: 3+P)e Pipelined: (3+P)e d+n(n+1) d+nk
Non-Pipelined: 3e Non-Pipelined: 3e
Note: k =h/d, e=m/d; Q=3.5d2+nd+(2d1)(n+k1); TM: denotes 2-by-1 MUX gate delay; P+1
number of pipelining stages inside each basic cell (Kim, Hong & Kwon, 2005; Guo & Wang, 1998)
Table 4. Comparison of various digit-serial systolic multipliers over GF(2m)
148 VLSI
6. Conclusions
This work presents new multiplication algorithms for the GNB of GF(2m) to realize LSD-
first and MSD-first scalable multipliers. The fundamental difference of our designs from
other digit-serial multipliers described in the literature is based on a Hankel matrix-vector
representation to achieve scalable multiplication architectures. In the generic field, the GNB
multiplication can be decomposed into n(n+1) Hankel multiplications. To use the
relationship from the GNB to NB, we can modify the LSD-first scalable multiplication
algorithm to decrease the number of Hankel multiplication from n(n+1) into nk, where
k=mt/2d if m is even and k=(mt-t-2)/2d if m is odd. Our analysis shows that, in finite field
GF(2233), if the selected digital size d ≥ 8, the proposed scalable multipliers then have lower
time-area complexity as compared to existing digit-serial multipliers for polynomial basis
and normal basis of GF(2m). Since our proposed scalable multiplication algorithms have
highly flexible and are suitable for implementing all-type GNB multiplications. Finally, the
proposed architectures have good trade-offs between area and speed for implementing
cryptographic schemes in embedded systems.
7. References
Denning, D.E.R.(1983). Cryptography and Data Security, Reading, MA: Addison-Wesley.
Rhee, M.Y. (1994). Cryptography and Secure Communications, McGraw-Hill, Singapore.
Menezes, A. Oorschot, P. V. & Vanstone, S. (1997). Handbook of Applied Cryptography, CRC
Press, Boca Raton, FL.
Massey, J.L. & Omura, J.K. (1986). Computational method and apparatus for finite field
arithmetic,” U.S. Patent Number 4,587,627.
Reyhani-Masoleh, A. & Hasan, M.A. (2005). Low complexity word-level sequential normal
basis multipliers. IEEE Transactions on Computers, Vol. 54, No.2.
Lee, C.Y. & Chang, C.J. (2004). Low-complexity linear array multiplier for normal basis of
type-II, IEEE International Conference on Multimedia and Expo, Vol. 3, pp. 1515-1518.
Lee, C.Y., Lu, E.H., & Lee, J.Y. (2001). Bit-Parallel Systolic Multipliers for GF(2m) Fields
Defined by All-One and Equally-Spaced Polynomials,” IEEE Transactions on
Computers, Vol. 50, No. 5, pp. 385-393.
Scalable and Systolic Gaussian Normal Basis Multipliers
over GF(2m) Using Hankel Matrix-Vector Representation 149
Hasan, M.A., Wang, M.Z. & Bhargava, V.K. (1993). A modified Massey-Omura parallel
multiplier for a class of finite fields,” IEEE Transactions on Computers, Vol. 42, No.10,
pp. 1278-1280.
Kwon, S. (2003). A low complexity and a low latency bit parallel systolic multiplier over
GF(2m) using an optimal normal basis of type II. Proceedings of 16th IEEE Symp.
Computer Arithmetic, pp. 196-202.
Lee, C.Y. & Chiou, C.W. (2005). Design of low-complexity bit-parallel systolic Hankel
multipliers to implement multiplication in normal and dual bases of GF(2m). IEICE
Transactions on Fundamentals, vol. E88-A, no.11, pp. 3169-3179.
IEEE Standard P1363 (2000). IEEE Standard Specifications for Public-Key Cryptography.
National Inst. of Standards and Technology,(2000). Digital Signature Standard, FIPS
Publication 186-2.
Reyhani-Masoleh, A. (2006). Efficient algorithms and architectures for field multiplication
using Gaussian normal bases. IEEE Transactions on Computers, Vol. 55, No. 1, pp.34-
47.
Lee1, C.Y. (2003). Low-Latency Bit-Parallel Systolic Multiplier for Irreducible xm+xn+1 with
gcd(m,n)=1. IEICE Transactions on Fundamentals, Vol. E86-A, No.11, pp. 2844-2852.
Lee, C.Y., Horng, J.S. & Jou, I.C. (2005). Low-complexity bit-parallel systolic Montgomery
multipliers for special classes of GF(2m). IEEE Transactions on Computers, vol. 54,
no.9, pp. 1061-1070.
Lee, C.Y. (2005). Systolic architectures for computing exponentiation and multiplication over
GF(2m) using polynomial ring basis. Journal of LungHwa University, vol. 19, pp.87-98.
Lee2, C.Y. (2003). Low complexity bit-parallel systolic multiplier over GF(2m) using
irreducible trinomials. IEE Proceeding Computer, and Digital Technical, Vol. 150, pp.
39-42.
Paar, C., Fleischmann, P. & Soria-Rodriguez, P. (1999). Fast arithmetic for public-key
algorithms in Galois fields with composite exponents. IEEE Transactions on
Computers, vol. 48, no.10, pp. 1025-1034.
Kim, N.Y. & Yoo, K.Y. (2005). Digit-serial AB2 systolic architecture in GF(2m). IEE Proceeding
Circuits Devices Systems, Vol. 152, No. 6, pp. 608-614.
Kang, S.M. & Leblebici, Y. (1999). CMOS Digital Integrated Circuits Analysis and Design,
McGrawHill.
Logic selection guide: STMicroelectronics. <http://www.st.com > .
Kim, C.H., Hong, C.P. & Kwon, S. (2005). A digit-serial multiplier for finite field GF(2m).
IEEE Transactions on VLSI, Vol. 13, No. 4, pp. 476-483.
Guo, J.H. & Wang, C.L. (1998). Digit-serial systolic multiplier for finite fields GF(2m). IEE
Proc.-Comput. Digit. Tech., Vol. 145, No. 2, pp. 143-148, March.
Kung, S.Y. (1988). VLSI array processors, Englewood Cliffs, NJ: Prentice-Hall.
H. Wu, M.A. Hasan, I.F. Blake and S. Gao, “Finite field multiplier using redundant
representation. IEEE Transactions on Computers, Vol. 51, No. 11, pp.1306-1316, Nov.
2002.
Mullin, R.C., Onyszchuk, I.M., Vanstone, S.A. & Wilson, R. M. (1988/1989). Optimal Normal
Bases in GF(pn). Discrete Applied Math., vol. 22, pp.149-161.
Reyhani-Masoleh, A. & Hasan, M.A. (2003). Fast normal basis multiplication using general
purpose processors. IEEE Transactions on Computers, Vol. 52, No. 11, pp. 1379-1390.
150 VLSI
Song, L. & Parhi, K.K. (1998). Low-energy digit-serial/parallel finite field multipliers. Journal
of VLSI Signal Processing , Vol.19, pp.149-166.
Tenca, A.F. & Koc, C.K. (1999). A scalable architecture for Montgomery multiplication.
Proceedings of Cryptographic Hardware and Embedded System (CHES 1999), No. 1717 in
Lecture Notes in Computer Science, pp. 94-108.
Reyhani-Masoleh, A. & Hasan, M.A. (2002). Efficient digit-serial normal basis multipliers
over GF(2M). IEEE International Conference on Circuits and Systems.
High-Speed VLSI Architectures for Turbo Decoders 151
X8
Turbo code, being one of the most attractive near-Shannon limit error correction codes, has
attracted tremendous attention in both academia and industry since its invention in early
1990’s. In this chapter, we will discuss high-speed VLSI architectures for Turbo decoders.
First of all, we will explore joint algorithmic and architectural level optimization techniques
to break the high speed bottleneck in recursive computation of state metrics for soft-input
soft-output decoders. Then we will present area-efficient parallel decoding schemes and
associated architectures that aim to linearly increase the overall decoding throughput with
sub-linearly increased hardware overhead.
Keywords: Turbo code, MAP algorithm, parallel decoding, high speed, VLSI.
1. Introduction
Error correction codes are an essential component in digital communication and data
storage systems to ensure robust operation of digital applications, wherein, Turbo code,
invented by Berrou (1993), is among the two most attractive near-optimal (i.e., near-Shannon
limit) error correction codes. As a matter of fact, Turbo codes have been considered in
several new industrial standards, such as 3rd and post-3rd generation cellular wireless
systems (3GPP, 3GPP2, and 3GPP LTE), Wireless LAN (802.11a), WiMAX (broadband
wireless, IEEE 802.16e) and European DAB and DVB (digital audio broadcasting and digital
video broadcasting) systems.
One key feature associated with Turbo code is the iterative decoding process, which enables
Turbo code to achieve outstanding performance with moderate complexity. However, the
iterative process directly leads to low throughput and long decoding latency. To obtain a
high decoding throughput, a large amount of computation units have to be instantiated for
each decoder, and this results in a large chip area and high power consumption. In contrast,
the growing market of wireless and portable computing devices as well as the increasing
desire to reduce packaging costs have directed industry to focus on compact low-power
circuit implementations. This tug-of-war highlights the challenge and calls for innovations
on Very Large Scale Integration (VLSI) design of high-data rate Turbo decoders that are
both area and power efficient.
152 VLSI
For general ASIC design, there are two typical ways to increase the system throughput: 1)
raise the clock speed, and 2) increase the parallelism. In this chapter, we will tackle the high
speed Turbo decoder design in these two aspects. Turbo code decoders can be based on
either maximum-a-posterior probability (MAP) algorithm proposed in Bahl (1974) (or any
variants of approximation) or soft-output Viterbi algorithm (SOVA) proposed in Hagenauer
(1989) (or any modified version). However, either algorithm involves recursive computation
of state metrics, which forms the bottleneck in high speed integrated circuit design since
conventional pipelining techniques cannot be simply applied for raising the effective clock
speed. Look-ahead pipelining in Parhi (1999) may be applicable. But the introduced
hardware overhead can be intolerable. On the other hand, parallel processing can be
effective in increasing the system throughput. Unfortunately, direct application of this
technique will cause hardware and power consumption to increase linearly, which is against
the requirement of modern portable computing devices.
In this chapter, we will focus on MAP-based Turbo decoder design since MAP-based Turbo
decoder significantly outperforms SOVA-based Turbo decoders in terms of Bit-Error-Rate
(BER). In addition, MAP decoders are more challenging than SOVA decoders in high speed
design (Wang 2007). Interested readers are referred to Yeo (2003) and Wang (2003c) for high
data-rate SOVA or Turbo/SOVA decoder design. The rest of the chapter is organized as
follows. In Section 2, we give background information about Turbo codes and discuss
simple serial decoder structure. In Section 3, we will address high speed recursion
architectures for MAP decoders. Both Radix-2 and Radix-4 recursion architectures are
investigated. In Section 4, we present area-efficient parallel processing schemes and
associated parallel decoding architectures. We conclude the chapter in Section 5.
ys
LLR(k )
1 SISO
Input yp
Lex (k )
buffer
Load Interleave Wtbk
y2p memory
Address
Generator
A typical Turbo encoder consists of two recursive systematic convolutional (RSC) encoders
and an interleaver between them (Wang 1999). The source data are encoded by the first RSC
encoder in sequential order while its interleaved sequence is encoded by the second RSC
encoder. The original source bits and parity bits generated by two RSC encoders are sent out
in a time-multiplexed way. Interested reader are referred to Berrou (1993) for details. Turbo
code usually works with large block sizes for the reason that the larger the block size, the
better the performance in general. In order to facilitate iterative decoding, the received data
High-Speed VLSI Architectures for Turbo Decoders 153
of a whole decoding block have to be stored in a memory, whose size is proportional to the
Turbo block size. Hence, Turbo decoders usually require large memory storage. Therefore
serial decoding architectures are widely used in practice.
A typical serial Turbo decoder architecture is shown in Fig. 1. It has only one soft-input soft-
output (SISO) decoder, which works in a time-multiplexed way as proposed in Suzuki
(2000). Each iteration is decomposed into two decoding phases, i.e., the sequential decoding
phase, in which the data are processed in sequential order, and the interleaved decoding phase,
in which the source data are processed in an interleaved order. Both probability MAP
algorithm proposed in Berrou (1993) and SOVA proposed in Hagenauer (1989) can be
employed for the SISO decoding.
The serial Turbo decoder includes two memories: one is used for received soft symbols,
called the input buffer or the receiver buffer, the other is used to store the extrinsic
information, denoted as the interleaver memory. The extrinsic information is feedback as the
a priori information for next decoding. The input buffer is normally indispensable. With
regard to the interleaver memory, either two ping-pong buffers can be used to complete one
Load and one Write operations required to process each information bit within one cycle or
one single-port memory can be employed to fulfil the two required operations within two
clock cycles. A memory-efficient architecture was presented by Wang (2003a) and Parhi,
which can process both Read and Write operations at the same cycle using single-port
memories with the aid of small buffers.
Turbo decoder works as follows. The SISO decoder takes soft inputs (including the received
systematic bit ys and the received parity bit y 1p or y 2p ) from the input buffer and the a
priori information from the interleaver memory. It outputs the log likelihood ratio LLR (k ) ,
and the extrinsic information, Lex (k ) , for the k-th information bit in the decoding sequence.
The extrinsic information is sent back as the new a priori information for next decoding. The
interleaving and de-interleaving processes are completed in an efficient way. Basically the
data are loaded according to the current decoding sequence. For instance, the extrinsic
information is loaded in sequential order at sequential decoding phase while being loaded
in the interleaved order at the interleaved decoding phase. After processing, the new
extrinsic information is written back to the original places. In this way, no de-interleave
pattern is required for Turbo decoding.
The MAP algorithm is commonly implemented in log domain, thus called Log-MAP (Wang
1999). The Log-MAP algorithm involves recursive computation of forward state metrics
(simply called metrics) and backward state metrics (simple called metrics). The log-
likelihood-ratio is computed based on the two types of state metrics and associated branch
metrics (denoted as metrics). Due to different recursion directions in computing and
metrics, a straightforward implementation of Log-MAP algorithm will not only consume
large memory but also introduce large decoding latency. The sliding window approach was
proposed by Viterbi (1998) to mitigate this issue. In this case, pre-backward recursion
operations are introduced for the warm-up process of real backward recursion. For clarity,
we denote the pre-backward recursion computing unit as unit and denote the real
(effective or valid) backward recursion unit as unit. The details of Log-MAP algorithm
will not be given in the chapter. Interested readers are referred to Wang (1999).
The timing diagram for typical sliding-window-based Log-MAP decoding is shown in Fig.
2, where SB1, SB2, etc, stand for consecutive sub-blocks (i.e., sliding windows), the branch
metrics computation was computed together with pre-backward recursion, though not
shown in the figure for simplicity and clarity.
The structure of a typical serial Turbo decoder based on log-MAP algorithm is shown in Fig.
3, where the soft-output unit is used to compute LLR and extrinsic information (denoted as
Lex), the interleaver memory is used to store the extrinsic information for next (phase of)
decoding. It can be seen from the figure that both the branch metric unit (BMU) and soft-
output unit (SOU) can be pipelined for high speed applications. However, due to recursive
computation, three state metrics computation units form the high-speed bottleneck. The
reason is that the conventional pipelining technique is not applicable for raising the effective
processing speed unless one MAP decoder is used to process more than one Turbo code
blocks or sub-blocks as discussed in Lee (2005). Among various high-speed recursion
architectures in the literature such as Lee (2005), Urard (2004), Boutillon (2003), Miyouchi
(2001) and Bickerstaff (2003), the designs presented in Urard (2004) and Bickerstaff (2003)
are most attractive. In Urard (2004), an offset-add-compare-select (OACS) architecture is
proposed to replace the traditional add-compare-select-offset (ACSO) architecture. In
addition, the look-up table (LUT) is simplified with only 1-bit output, and the computation
of absolute value is avoided through introduction of the reverse difference of two competing
path (or state) metrics. An approximate 17% speedup over the traditional Radix-2 ASCO
architecture was reported. With one-step look-ahead operation, a Radix-4 ACSO
architecture can be derived. Practical Radix-4 architectures such as Miyouchi (2001) and
Bickerstaff (2003) always involve approximations in order to achieve higher effective
speedup. For instance, the following approximation is adopted in Bickerstaff (2003):
max* (max*(A, B), max*(C, D))=max*(max(A, B), max(C, D)), (1)
where
| A B|
max*(A, B)= max(A,B)+log(1+ e ). (2)
This Radix-4 architecture can generally improve the processing speed (equivalent to twice of
its clock speed) by over 40% over the traditional Radix-2 architecture, and it has de facto the
highest processing speed among all existing (MAP decoder) designs found in the literature.
High-Speed VLSI Architectures for Turbo Decoders 155
However, the hardware will be nearly doubled compared to the traditional ACSO
architecture presented in Urard (2004).
In this section, we will first present an advanced Radix-2 recursion architecture based on
algorithmic transformation, approximation and architectural level optimization, which can
achieve comparable processing speed as the state-of-the-art Radix-4 design while having
significantly lower hardware complexity. Then we discuss an improved Radix-4 architecture
that is 32% faster than the best existing approach.
It is known from Log-MAP algorithm that all the three recursion units have similar
architectures. So we will focus our discussion on design of units. The traditional design for
computation is illustrated in Fig. 4, where the ABS block is used to compute the absolute
value of the input and the LUT (i.e., look-up table) block is used to implement a nonlinear
x
function log(1+ e ), where x > 0. For simplicity, only one branch (i.e., one state) is drawn.
The overflow approach (Wu 2001) is assumed for normalization of state metrics as used in
conventional Viterbi decoders.
It can be seen that the computation of the recursive loop consists of three multi-bit additions,
the computation of absolute value and a random logic to implement the LUT. As there is
only one delay element in each recursive loop, the traditional retiming technique in Denk
(1998) can not be used to reduce the critical path.
High-Speed VLSI Architectures for Turbo Decoders 157
In this work, we propose an advanced Radix-2 recursion architecture shown in Fig. 5. Here
we first introduce a difference metric for each competing pair of states metrics (e.g., and
in Fig. 4) so that we can perform the front end addition and the subtraction operations
simultaneously in order to reduce the computation delay of the loop. Secondly, we employ a
generalized LUT (see GLUT in Fig. 5) that can efficiently avoid the computation of absolute
value instead of introducing another subtraction operation as in Urard (2004). Thirdly, we
move the final addition to the input side as with the OACS architecture in Boutillon (2003)
and then utilize one stage carry-save structure to convert a 3-number addition to a 2-number
addition. Finally, we make an intelligent approximation in order to further reduce the
critical path.
The following equations are assumed for the considered recursive computation shown in
Fig. 5:
Similarly, the corresponding difference metric is also split into two terms:
01 A [k ] 0 A [k ] 1 A [k ],
01B [k ] 0 B [k ] 1B [k ]. (5)
In this way, the original add-and-compare operation is converted as an addition of three
numbers, i.e.,
( 0 0 ) (1 3 ) ( 0 3 ) 01A 01B
(6)
where 0 3 is computed by BMU, the time index [k] is omitted for simplicity. In
addition, the difference between the two outputs from two GLUTs, i.e., 01B , can be
neglected. From extensive simulations, we found that this small approximation doesn’t
cause any performance loss in Turbo decoding with either AWGN channels or Raleigh
fading channels. This fact can be simply explained in the following. If one competing path
metrics (e.g., p 0 0 0 ) is significantly larger than the other one (e.g., p1 1 3 ),
the GLUT output will not change the decision anyway due to their small magnitudes. On
the other hand, if the two competing path metrics are so close that adding or removing a
value from one GLUT may change the decision (e.g., from p0>p1 to p1>p0), picking any
survivor (p0 or p1) should not make big difference.
At the input side, a small circuitry shown in Fig. 6 is employed to convert an addition of 3
numbers to an addition of 2 numbers, where FA and HA represents full-adder and half-
adder respectively, XOR stands for exclusive OR gate, d0 and d1 correspond to the 2-bit
output of GLUT. The state metrics and branch metrics are represented with 9 and 6 bits,
respectively in this example. The sign extension is only applied to the branch metrics. It
should be noted that an extra addition operation (see dashed adder boxes) is required to
integrate each state metric before storing it into the memory.
XOR HA HA FA FA
The generalized LUT (GLUT) structure is shown in Fig. 7, where the computation of
absolute value is eliminated by including the sign bit into 2 logic blocks, i.e., Ls2 and ELUT,
where the Ls2 function block is used to detect if the absolute value of the input is less than
2.0, and the ELUT block is a small LUT with 3-bit inputs and 2-bit outputs. It can be derived
that Z S b7 b4 b3 S (b7 b4b3 ) . It was reported in Gross (1998) that using two output
values for the LUT only caused a performance loss of 0.03 dB from the floating point
simulation for a 4-state Turbo code. The approximation is described as follows:
where x and f(x) stands for the input and the output of the LUT, respectively. In this
approach, we only need to check if the absolute value of the input is less than 2 or not,
which can be performed by the Ls2 block in Fig. 7. A drawback of this method is that its
performance would be significantly degraded if only two bits are kept for the fractional part
of the state metrics, which is generally the case.
In our design, both the inputs and outputs of the LUT are quantized with 4 levels. The
details are shown in Table 1. The inputs to ELUT are treated as a 3-bit signed binary
number. The outputs of ELUT are ANDed with the output of Ls2 block. This means, if the
absolute value of the input is greater than 2.0, the output from the GLUT is set as 0.
Otherwise the output from ELUT will be the final output.
The ELUT can be implemented with combinational logic for high speed applications. Its
computation latency is smaller than the latency of Ls2 block. Therefore, the overall latency
of the GLUT is almost the same as the above-discussed simplified method, where the total
delay consists of 1 MUX delay and the computation delay of Ls2.
After all the above optimization, the critical path of the recursive architecture is reduced to 2
multi-bit additions, one 2:1 MUX operation and 1-bit addition operation, which saves nearly
2 multi-bit adder delay compared to the traditional ACSO architecture. We will show
detailed comparisons in subsection 3.4.
where
0 [k ]
0 [k ] 0 [k 1]
3 [k ] 0 [k 1]
1 [ k ] 0 [ k 2]
2 [k ]
3[k ] 2 [k 1]
3[k ] 0 [k 1]
3 [k ]
Fig. 8. The Radix-4 architecture proposed by Lucent Bell Labs: Arch-L.
This approximation is reported to have a 0.04 dB performance loss compared to the original
Log-MAP algorithm. The architecture to implement the above computation is shown in Fig.
8 for convenience in later discussion. As it can be seen, the critical path consists of 4 multi-
bit adder delay, one generalized LUT delay (Note: the LUT1 block includes absolute value
computation and a normal LUT operation) and one 2:1 MUX delay.
0 A [k ]
0 B [k ] 0 B [k 2]
0 [k ] 0 [k 1]
3[k ] 0 [k 1]
1 A [k ]
1B [k ]
0 A [ k 2]
2 A [k ]
2 B [k ]
3[ k ] 2 [k 1]
3 [k ] 1[k 1]
3 A [k ]
3B [ k ]
Intuitively, Turbo decoder employing this new approximation should have the same
decoding performance as using equation (7). While directly implementing (11) does not
bring any advantage to the critical path, we intend to take advantages of the techniques that
we developed in subsection 3.2. The new architecture is shown in Fig. 9. Here we split each
state metric into two terms and we adopt the same GLUT structure as we did before. In
addition, a similar approximation is incorporated as with Arch-A. In this case, the outputs
from GLUT are not involved in the final stage comparison operation. It can be observed that
the critical path of the new architecture is close to 3 multi-bit adder delay. To compensate
for all the approximation introduced, the extrinsic information generated by the MAP
decoder based on this new Radix-4 architecture should be scaled by a factor around 0.75.
It can be observed that the proposed Radix-2 architecture has comparable processing speed
as the Radix-4 architecture proposed by Bell Labs with significantly lower complexity, while
the improved Radix-4 architecture is 32% faster with only 9% extra hardware. This amount
of hardware overhead should be negligible compared to an entire Turbo decoder. It can also
be seen that the new Radix-4 architecture achieves twice speedup over the traditional Radix-
2 recursion architecture.
162 VLSI
We have performed extensive simulations for Turbo codes using the original MAP and
using various approximations. Fig. 10 shows the BER (bit-error-rate) performance of a rate-
1/3, 8-state, block size of 512 bits, Turbo code using different MAP architectures. The
simulations were undertaken under the assumption of AWGN channel and BPSK signaling.
A maximum of 8 iterations was performed. More than 40 million random information bits
were simulated for both Eb/No=1.6 dB and Eb/No=1.8dB cases. It can be noted from Fig.
10 that there is no observable performance difference between the true MAP algorithm and
two approximation methods associated with the proposed recursion architectures while the
approximation employed in Urard (2004) caused approximately 0.2 dB performance
degradation in general.
Fig. 10. Performance comparisons between the original MAP and some approximations.
High-Speed VLSI Architectures for Turbo Decoders 163
We argue that the proposed Radix-4 recursion architecture is optimal for high-speed MAP
decoders. Any (significantly) faster recursion architecture (e.g., a possible Radix-8
architecture) will be at the expense of significantly increased hardware. On the other hand,
when the target throughput is moderate, these fast recursion architectures can be used to
reduce power consumption because of their dramatically reduced critical paths.
The major challenge lies in real implementation. As each memory (receiver memory or
extrinsic information memory) needs to support multiple data (for multiple component soft-
input soft-output decoders) at the same cycle, memory access conflicts are inevitable unless
M-port (M = parallelism level) memory is used, which contradicts the original low-
complexity design objective. Two different solutions are proposed in this work. First of all, if
interleave patterns are free to design, we can adopt dividable interlavers, which inherently
ensure no memory access conflict after proper memory partition. For practical applications
wherein Turbo code interleave patterns are fixed, e.g., 3GPP and 3GPP2, a more generic
solution is introduced in this chapter. By introducing some small buffers for data and
addresses as well, we are able to avoid memory access conflict under any practical random
interleavers. Combining all the proposed techniques, it is estimated that 200 Mb/s Turbo
decoder is feasible with current CMOS technology with moderate complexity.
Fig. 12. b) Multi-level parallel Turbo decoding scheme based on sliding-window approach.
A simple area-efficient 2-level parallel Turbo decoding scheme is shown in Fig. 12 a), where
a sequence of data is divided into two segments with equivalent length. There is an overlap
with length of 2D between two segments, where D equals the sliding window size if Log-
MAP (or MAX-Log-MAP) algorithm is employed in the SISO decoder or the survivor length
if SOVA is employed, as proposed in Wang (2003c). The two SISO decoders work on two
different data segments (with overlap) at the same time. Fig. 12 b) shows ( a portion of) a
sliding-window-based area-efficient multi-level parallel Turbo decoding data flow, where
High-Speed VLSI Architectures for Turbo Decoders 165
SISO decoders responsible for two adjacent data segments are processing data in reverse
directions in order to reuse some of the previously computed branch metrics and state
metrics. For other parallel decoding schemes with various trade-offs, interested reader are
referred to Wang (2001) and Zhang (2004).
In this section, we will focus on area-efficient two-level parallel decoding schemes. It can be
observed from Fig. 12.a) that two data accesses per cycle are required for both the receiver
buffer and the interleaver memory (assuming two ping-pong buffers are used for the
interleaver memory). Using a dual-port memory is definitely not an efficient solution since a
normal dual-port memory consumes as much area as two single-port memories with the
same memory size.
Memory partitioning can be done in various ways. An easy and yet effective way is to
partition the memory according to a number of least significant bits (lsb’s) of the memory
address. To partition a memory into two segments, it can be done according to the least
significant bit (lsb). For the resultant two memory segments, one contains data with even
addresses and the other contains data with odd addresses.
The memory partitioning can also be done in a fancy way, e.g., to partition the memory into
2 segments, let the 1st memory segment contain data with addresses b2b1b0 ={0, 2, 5, 7}, and
the 2nd segment contain data with addresses b2b1b0 ={1, 3, 4, or 6}, where b2b1b0 denotes
the 3 lsb’s. Depending on the applications, different partitioning schemes may lead to
different hardware requirements.
For the two memory segments, it is possible, in principle, to support two data accesses
within one cycle. However, it is generally impossible to find an efficient partitioning scheme
to ensure the target addresses (for both Load and Write operations) are always located in
different segments at each cycle because the Turbo decoder processes the data in different
orders during different decoding phases. Consider a simple example, given a sequential
data sequence {0, 1, 2, 3, 4, 5, 6, 7}, the interleaved data sequence is assumed as {2, 5, 7, 3, 1, 0,
6, 4}. If we partition the memory into two segments according to the sequential decoding
phase, i.e., one segment contains data sequence {0, 1, 2, 3} and the other has {4, 5, 6 7}.
During the sequential phase, SISO-1 works on data set {0, 1, 2, 3} and SISO-2 works on {4, 5,
6, 7}. So there is no data conflict (note: the overlap between two segments for parallel
decoding is ignored in this simple example). However, during the interlaved decoding
phase, SISO-1 will process data in set {2, 5, 7, 3} and SISO-2 will process data in set {1, 0, 6,
4}, both in sequential order. It is easy to see that, at the first cycle, both SISO decoders
require data located in the first segment (1st and 2nd data in the input data sequence). Thus a
memory access conflict occurs. More detailed analysis can be found in Wang (2003b). The
memory access issue in parallel decoding can be much worse when multiple rates and
166 VLSI
multiple block sizes of Turbo codes are supported in a specific application, e.g., in 3GPP
CDMA systems.
In principle, the data access conflict problem can be avoided if the Turbo interleaver is
specifically designed. A generalized even-odd interleave is defined below:
All even indexed data are interleaved to odd addresses,
All odd indexed data are interleaved to even addresses.
Back to the previous example, an even-odd interleaver may have an interleaved data
sequence as {3, 6, 7, 4, 1, 0, 5, 2}.
With an even-odd interleaver, we can partition each memory into two segments: one
contains even-addressed data and the other with odd-addressed data.
Fig. 13. Data processing in either decoding phase with an even-odd interleaver.
As shown in Fig. 13, the data are processed in an order of {even address, odd address, even ,
odd, …} in Phase A (i.e., the sequential phase) and an order of {odd address, even address,
odd, even, ..} in Phase B (i.e., the interleaved phase). If we let SISO-1 and SISO-2 start
processing from different memory segments, then there would be no data access conflict
during either decoding phase. In other words, it is guaranteed that both SISO decoders
always access data located in different memory segments.
More complicated interleavers can be designed to support multiple (M>=2) data accesses
per cycle. The interested readers are referred to Wang (2003c), He (2005), Giulietti (2002),
Bougard (2003), and Kwak (2003) for details.
In real applications, the Turbo interleaver is usually not free to design. For example, they are
fixed in WCDMA and CDMA 2000 systems. Thus, a generic parallel decoding architecture is
desired to accommodate all possible applications. Here we propose an efficient memory
arbitration scheme to resolve the problem of data access conflict in parallel Turbo decoding.
The fundamental concept is to partition one single-port memory into S (S>=2) segments and
use B (B >=2) small buffers to assist reading from or writing data back to the memory, where
S does not have to be the same as B. For design simplicity, both S and B are normally chosen
to be a power of 2. S is better chosen to be a multiple of B. In this chapter, we will consider
High-Speed VLSI Architectures for Turbo Decoders 167
only one simple case, i.e., S=2, B=2. For all other cases of B=2, they can be easily extended
from the illustrated case. However, for cases with B>2, the control circuitry will be much
more complicated.
For the receiver buffer, only the Load operation is involved while bother Load and Write
operations are required for the interleaver memory. So we will focus our discussion on the
interleaver memory.
With regard to the interleaver memory, we assume two ping-pong buffers (i.e., RAM1 and
RAM2 shown in Fig.14) are used to ensure that one Load and one Write operations can be
completed within one cycle. Each buffer is partitioned into two segments: Seg-A contains
even addressed data and Seg-B contains odd-addressed data. For simplicity, we use Seg-A1
to represent the even-addressed part of RAM1, Seg-B1 to represent the odd-addressed part
of RAM1. The similar representations are used for RAM2 as well.
An area-efficient 2-parall Turbo decoding architecture is shown in Fig. 14. The interleaver
address generator (IAG) generates two addresses for two SISO decoders respectively at each
cycle. They must belong to one of the following cases: (1) two even addresses, (2) two odd
addresses and (3) one even and one odd address. In Case 1, two addresses are put into Read
Address Buffer 1 (RAB1), In Cases 2, two addresses are put in Read Address buffer 2
(RAB2). In Cases 3, the even address goes to RAB1 and the odd address goes to RAB2.
168 VLSI
A small buffer called Read Index Buffer (RIB) is introduced in this architecture. This buffer
is basically a FIFO (first-in first-out). Two bits are stored at each entry. The distinct four
values represented by the two bits have following meanings:
00: 1st and 2nd even addresses for SISO-1 and SISO-2 respectively,
11: 1st and 2nd odd addresses for SISO-1 and SISO-2 respectively,
01: the even address is used for SISO-1 and the odd address for SISO-2,
10: the even address is used for SISO-1 and the odd address for SISO-2.
After decoding for a number of cycles (e.g., 4 ~ 8 cycles), both RAB1 and RAB2 will have a
small amount of data. Then the data from the top of RAB1 and RAB2 are used respectively
as the addresses to Seg-A1 and Seg-B1 respectively. After one cycle, two previously stored
extrinsic information symbols are loaded to Load Data Buffer 1 (LAB1) and Load Data
Buffer 2 (LDB2). Then the data stored in RIB will be used to direct the outputs from both
load buffers to feed both SISO decoders. For example, if the data at the top of RIB is 00, then
LDB1 output two data to SISO-1 and SISO-2 respectively. The detailed actions of address
buffers and data buffers controlled by RIB for loading data are summarized in Table 3.
B4 B5 B4
A3 A3 A5 A3
B2 B2 A4 B2 A4 B2
A2 B3 A2 B3 A2 B3 A2
A1 B1 A1 B1 A1 B1 A1 B1 A1 B1
B5 B4
A5 A3 A5 A3
A4 B2 A4 B2 A4 B2
B3 A2 B3 A2 B3 A2 B3 A2
A1 B1 A1 B1 A1 B1 A1 B1 A1 B1
B5 B4 B5 B4 B5 B4 B5 B4 B5
A5 A3 A5 A3 A5 A3 A5 A5
A4 B2 A4 B2 A4 A4
B3 A2 B3 A2 B3
A1 B1
A simple example is shown in Fig. 15 to illustrate the details of loading data from memory
to SISO decoders, where, for instance, A3 (B5) denotes the address of data to be loaded or
the data itself corresponding to memory segment A (segment B) for the 3rd (5th) data in the
processing sequence. Fig. 15 (a) shows the details of feeding two dada addresses to two
address buffers at every cycle. Fig. 15 (b) shows the details of feeding one data to each data
buffer at every cycle. The details of outputting the required two data to both SISO decoders
are shown in Fig. 15 (c). As can be seen, both SISO1 and SISO2 can get their required data
that may be located in the same memory segment at the same cycle (e.g., t=7 in this
example). A tiny drawback of this approach is that a small fixed latency is introduced.
Fortunately this latency is negligible compared with the overall decoding cycles for a whole
Turbo code block in most applications.
At each cycle, two extrinsic information symbols are generated from the two SISO decoders.
They may both feed Write Data Buffer 1 (WDF1), or both feed Write Data Buffer 2 (WDF2),
or one feeds WDF1 and the other feeds WDF2. The actual action is controlled by the
delayed output from RIB.
170 VLSI
Similar to loading data, after the same delay (e.g., 4 ~ 8 cycles) from the first output of either
SISO decoder, the data from WDB1 and WDB2 will be written back to Seg-B1 and Seg-B2
respectively. In the next decoding phase, RAM-1 and RAM-2 will exchange roles. So some
extra MUXs must be employed to facilitate this functional switch.
The RIB can be implemented with a shift register as it has a regular data flow in both ends
while a 3-port memory suffices to fulfill the functions of the rest buffers.
It has been found from our cycle-accurate simulations that it is sufficient to choose a buffer
length of 25 for all buffers if the proposed 2-parallel architecture is applied in either
WCDMA or CDMA2000 systems. The maximum Turbo block size for WCDMA system is
approximately 5K bits. The lowest code rate is 1/3. Assume both the received soft inputs
and the extrinsic information symbols are expressed as 6 bits per symbol.
The overall memory requirement is 5K*(2+3)*6=150K bits.
The total overhead of small buffers is approximately 25*3*2*(3*6+2*6) +
25*3*13*2~= 6K bits, where the factor 3 accounts for 3 ports of memory, 13
represents each address contains 13 binary bits.
It can be seen that the overhead is about 4% of total memory. With CDMA2000 systems, the
overhead occupies an even smaller percentage in the total hardware because the maximum
Turbo block size is several times larger.
For WCDMA systems, it is reasonable to assume the total area of a Log-MAP decoder
consumes less than 15% of total hardware of a serial Turbo decoder. Thus, we can achieve
twice the throughput with the proposed 2-parallel decoding architecture while spending
less than 20% hardware overhead. If SOVA is employed in the SISO decoder, the overhead
could be less than 15%. In either case, the percentage of the overhead will be even smaller in
CDMA2000 systems.
Assume 6 iterations are performed at the most, the proposed 2-level parallel architecture, if
implemented with TSMC 0.18 um technology, can achieve a minimum decoding throughput
of 370 M*2/(6*2) > 60 Mbps. If state-of-the-art CMOS technology (e.g., 65nm CMOS) is used,
we can easily achieve 100Mbps data-rate with the proposed 2-parallel decoding architecture.
If an 4-parallel architecture is employed, over 200Mbps data rate can be obtained, though a
direct extension of the proposed 2-paralell architecture will significantly complicate the
control circuitry.
High-Speed VLSI Architectures for Turbo Decoders 171
It is worthwhile to note, when the target throughput is moderately low, the proposed area-
efficient parallel Turbo decoding architecture can be used for low power design. The reason
is that the former architecture can work at a much lower clock frequency and the supply
voltage can thus be reduced significantly.
6. Conclusions
Novel fast recursion architecture for Log-Map decoders have been presented in this chapter.
Experimental results have shown that the proposed fast recursion architectures can increase
the process speed significantly over traditional designs while maintaining the decoding
performance. As a more power-efficient approach to increase the throughput of Turbo
decoder, area-efficient parallel Turbo decoding schemes have been addressed. A hardware-
efficient 2-parallel decoding architecture for generic applications is presented in detail. It has
been shown that twice the throughput of a serial decoding architecture can be obtained with
an overhead of less than 20% of an entire Turbo decoder. The proposed memory
partitioning techniques together with the efficient memory arbitration schemes can be
extended to multi-level parallel Turbo decoding architectures as well.
7. References
3rd Generation Partnership Project (3GPP), Technical specification group radio access
network, multiplexing and channel coding (TS 25.212 version 3.0.0),
http://www.3gpp.org.
3rd Generation Partnership Project 2 (3GPP2), http://www.3gpp2.org.
A. Giulietti et al. (2002), Parallel Turbo code interleavers: Avoiding collisions in
accesses to storage elements, Electron. Lett., vol. 38, no. 5, pp. 232–234, Feb.
2002.
A. J. Viterbi. (1998). An intuitive justification of the MAP decoder for convolutional codes,
IEEE J. Select. Areas Commun., vol.16, pp. 260-264, February 1998.
A. Raghupathy. (1998). Low power and high speed algorithms and VLSI architectures for
error control coding and adaptive video scaling, Ph.D. dissertation, Univ. of
Maryland, College Park, 1998.
B. Bougard et al. (2003). A scalable 8.7 nJ/bit 75.6 Mb/s parallel concatenated
convolutional (Turbo-) codec, in IEEE ISSCC Dig. Tech. Papers, 2003, pp.
152–153.
C. Berrou, A. Clavieux & P. Thitimajshia. (1993). Near Shannon limit error correcting coding
and decoding: Turbo codes, ICC’93, pp 1064-70.
E. Boutillon, W. Gross & P. Gulak. (2003). VLSI architectures for the MAP algorithm, IEEE
Trans. Commun., Volume 51, Issue 2, Feb. 2003, pp. 175 – 185.
E. Yeo, S. Augsburger, W. Davis & B. Nikolic. (2003). A 500-Mb/s soft-output Viterbi
decoder, IEEE Journal of Solid-State Circuits, Volume 38, Issue 7, July 2003,
pp:1234 – 1241.
172 VLSI
H. Suzuki, Z. Wang & K. K. Parhi. (2000). A K=3, 2Mbps Low Power Turbo Decoder for 3rd
Generation W-CDMA Systems, in Proc. IEEE 1999 Custom Integrated Circuits Conf.
(CICC), 2000, pp 39-42.
J. Hagenauer & P. Hoher. (1989). A Viterbi algorithm with soft decision outputs and its
applications, IEEE GLOBECOM, Dallas, TX, USA, Nov. 1989, pp 47.1.1-7.
J. Kwak & K Lee (2002). Design of dividable interleaver for parallel decoding in
turbo codes. Electronics Letters, vol. 38, issue 22, pp. 1362-64, Oct. 2002.
J. Kwak, S. M. Park, S. S. Yoon & K Lee (2003). Implementation of a parallel Turbo
decoder with dividable interleaver, ISCAS’03, vol. 2, pp. II-65- II68, May
2003.
K. K. Parhi. (1999). VLSI Digital signal Processing Systems, John Wiley & Sons, 1999.
L.Bahl, J.Jelinek, J.Raviv & F.Raviv. (1974). Optimal Decoding of Linear Codes for
minimizing symbol error rate, IEEE Trans.s Inf. Theory, vol. IT-20, pp.284-287,
March 1974.
M. Bickerstaff, L. Davis, C. Thomas, D. Garret & C. Nicol. (2003). A 24 Mb/s radix-4
LogMAP Turbo decoder for 3 GPP-HSDPA mobile wireless, in Proc. IEEE ISSCC
Dig. Tech. Papers, 2003, pp. 150–151.
P. Urard et al. (2004). A generic 350 Mb/s Turbo codec based on a 16-state Turbo decoder, in
Proc. IEEE ISSCC Dig. Tech. Papers, 2004, pp. 424–433.
S. Lee, N. Shanbhag & A. Singer. (2005). A 285-MHz pipelined MAP decoder in 0.18 um
CMOS, IEEE J. Solid-State Circuits, vol. 40, no. 8, Aug. 2005, pp. 1718 – 1725.
T. Miyauchi, K. Yamamoto & T. Yokokawa. (2001). High-performance programmable SISO
decoder VLSI implementation for decoding Turbo codes, in Proc. IEEE Global
Telecommunications Conf., vol. 1, 2001, pp. 305–309.
T.C. Denk & K.K. Parhi. (1998). Exhaustive Scheduling and Retiming of Digital Signal
Processing Systems, IEEE Trans. Circuits and Syst. II, vol. 45, no.7, pp. 821-838, July
1998
W. Gross & P. G. Gulak. (1998). Simplified MAP algorithm suitable for implementation of
Turbo decoders, Electronics Letters, vol. 34, no. 16, pp. 1577-78, Aug. 1998.
Y. Wang, C. Pai & X. Song. (2002). The design of hybrid carry-lookahead/carry-select
adders, IEEE Trans. Circuits and Syst. II, vol.49, no.1, Jan. 2002, pp. 16 -24
Y. Wu, B. D. Woerner & T. K. Blankenship, Data width requirement in SISO decoding with
module normalization, IEEE Trans. Commun., vol. 49, no. 11, pp. 1861–1868, Nov.
2001.
Y. Zhang & K. Parhi (2004). Parallel Turbo decoding, Proceedings of the 2004 International
Symposium on Circuits and Systems (ISCAS04), Volume 2, 23-26 May 2004, pp: II -
509-12 Vol.2.
Z Wang, Y. Tan & Y. Wang. (2003b). Low Hardware Complexity Parallel Turbo Decoder
Architecture, ISCAS’2003, vol. II, pp. 53-56, May 2003.
Z. He, S. Roy & Fortier (2005). High-Speed and Low-Power Design of Parallel
Turbo Decoder Circuits and Systems, IEEE International Symposium on
Circuits and Systems (ISCAS’05), 23-26 May 2005, pp. 6018 – 6021.
Z. Wang & K. Parhi. (2003a). Efficient Interleaver Memory Architectures for Serial Turbo
Decoding, ICASSP’2003, vol. II, pp. 629-32, May 2003.
Z. Wang & K. Parhi. (2003c). High Performance, High Throughput Turbo/SOVA Decoder
Design, IEEE Trans. on Commun., vol. 51, no 4, April 2003, pp. 570-79.
High-Speed VLSI Architectures for Turbo Decoders 173
Z. Wang, H. Suzuki & K. K. Parhi. (1999). VLSI implementation issues of Turbo decoder
design for wireless applications, in Proc. IEEE Workshop on Signal Process. Syst.
(SiPS), 1999, pp. 503-512.
Z. Wang, Z. Chi & K. K. Parhi. (2001). Area-Efficient High Speed Decoding Schemes for
Turbo/MAP Decoders, in Proc. IEEE ICASSP'2001, pp 2633-36, vol. 4, Salt Lake
City, Utah, 2001.
Z. Wang. (2000). Low complexity, high performance Turbo decoder design, Ph.D.
dissertation, University of Minnesota, Aug. 2000.
Z. Wang. (2007). High-Speed Recursion Architectures for MAP-Based Turbo Decoders, in
IEEE Trans. on VLSI Syst., vol. 15, issue 4, pp: 470-74, Apr. 2007.
174 VLSI
Ultra-High Speed LDPC Code Design and Implementation 175
X9
1. Introduction
The digital communications are ubiquitous and have provided tremendous benefits to
everyday life. Error Correction Codes (ECC) are widely applied in modern digital
communication systems. Low-Density Parity-Check (LDPC) code, invented by Gallager
(1962) and rediscovered by MacKay (1996), is one of the two most promising near-optimal
error correction codes in practice. Since its rediscovery, significant improvements have been
achieved on the design and analysis of LDPC codes to further enhance the communication
system performance. Due to its outstanding error-correcting performance (Chung 2001),
LDPC code has been widely considered in next generation communication standards such
as IEEE 802.16e, IEEE 802.3an, IEEE 802.11n, and DVB-S2. LDPC code is characterized by
sparse parity check matrix. One key feature associated with LDPC code is the iterative
decoding process, which enables LDPC decoder to achieve outstanding performance with
moderate complexity. However, the iterative process directly leads to large hardware
consumption and low throughput. Thus efficient Very Large Scale Integration (VLSI)
implementation of high data rate LDPC decoder is very challenging and critical in practical
applications.
A satisfying LDPC decoder usually means: good error correction performance, low
hardware complexity and high throughput. There are various existing design methods
which usually encounter the limitations of high routing overhead, large message memory
requirement or long decoding latency. To implement the decoder directly in its inherent
parallel manner may get the highest decoding throughput. But for large codeword lengths
(e.g., larger than 1000 bits), to avoid routing conflict, the complex interconnection may take
up more than half of the chip area. Both serial and partly parallel VLSI architectures are well
studied nowadays. However, none of these approaches are good for very high throughput
(i.e., multi-Gb/s) applications.
In this chapter, we present the construction of a new class of implementation-oriented LDPC
codes, namely shift-LDPC codes to solve these issues integrally. Shift-LDPC codes have
been shown to perform as well as computer generated random codes following some
optimization rules. A specific high-speed decoder architecture targeting for multi-Gb/s
applications namely shift decoder architecture, is developed for this class of codes. In
176 VLSI
contrast with conventional decoder architectures, the shift decoder architecture has three
major merits:
1) Memory efficient. By exploring the special features of the min-sum decoding algorithm,
the proposed architecture stores the message in a compact way which normally leads to
approximately 50% savings on message memory over the conventional design for high rate
codes.
2) Low routing complexity. Through introducing some novel check node information
transfer mechanisms, the complex message passing between variable nodes and check
nodes can be alleviated by the regular communication between check nodes and thus the
complex global interconnection networks can be replaced by local wires. One important fact
is that in the new architecture, check node processing units have few and fixed connections
with variable node processing units.
3) High parallel level of decoding. The architecture can normally exploit more levels of
parallelism in the decoding algorithm than conventional partially parallel decoder
architectures. In addition, the decoding parallel level can be further linearly increased and
the critical path can be significantly reduced by proper pipelining.
The chapter is organized as follows. Section 2 gives an overview of LDPC codes, with the
main attention being paid to the decoding algorithm and decoder architecture design.
Section 3 introduces the code construction of shift-LDPC codes and Section 4 presents the
decoder design with shift decoder architecture and demonstrates the benefits of proposed
techniques. In Section 5, we consider the application of shift architecture to some well
known LDPC codes such as RS-based LDPC codes and QC-LDPC codes. Section 6 concludes
the chapter.
f1 f2 f3 f4
c1 c2 c3 c4 c5 c6 c7 c8
(a)
Ultra-High Speed LDPC Code Design and Implementation 177
1 0 1 0 1 0 1 0
1 0 0 1 0 1 0 1
H
0 1 1 0 0 1 1 0
0 1 0 1 1 0 0 1
(b)
Fig. 1. An LDPC code example. (a) Tanner graph. (b) Parity check matrix.
The typical LDPC decoding algorithm is the Sum-Product (or belief propagation) algorithm.
After variable nodes are initialized with the channel information, the decoding messages are
iteratively computed by all the variable nodes and check nodes and exchanged through the
edges between the neighbouring nodes (Kschischang 2004).
The modified min-sum decoding algorithm studied in Guilloud (2003), Chen (2005) and
Zhao (2005) is similar to the Sum-Product algorithm, with a simplification of check node
process. It has some advantages in implementation over the Sum-Product algorithm, such as
less computation complexity and no requirement of knowledge of SNR (signal-to-noise
ratio) for AWGN channels. In the modified min-sum decoding algorithm, the check node
computes the check-to-variable messages Rcv as follows:
(1)
Rcv sign (Lvc ) min Lvc
nN( c )\v
nN(c )\v
where α is a scaling factor around 0.75, Lvc is the variable-to-check messages, N(c) denotes
the set of variable nodes that participate in c-th check node. The variable node computes the
variable-to-check messages Lvc as the following:
Lvc Rmv I v
mM ( v ) \ c
(2)
where M(v)\c denotes the set of check nodes connected to the variable node v excluding the
variable node c, Iv denotes the intrinsic message of variable node v.
congestion problem. Darabiha (2008) uses bit-serial architectures and message broadcasting
technique to reduce the number of wires in fully parallel decoders. Sharifi Tehrani (2008)
presents a stochastic decoder design where the messages or probabilities are converted to
streams of stochastic bits and complex probability operations can be performed on
stochastic bits using simple bit-serial structures.
The more general methods to alleviate the routing problem is through reducing the
parallelism (by using partially parallel or serial processing) and using storage elements to
store the intermediate messages passed along the edges of the graph (e.g., see Mansour
2002, Yeo 2003, Cocco 2004). Various approaches are investigated in the literature at both
code design and hardware implementation levels. One approach is to design
“implementation-aware” codes (e.g., see Boutillon 2000, Zhang 2002, Mansour 2003a, Liao
2004, Zhang 2004, Sha 2009). In this approach, instead of randomly choosing the locations of
ones in the parity-check matrix at the code design stage, the parity-check matrix of an LDPC
code is decided with constraints allowing a suitable structure for decoder implementation
and providing acceptable decoding performance. In these cases, the problem becomes how
to reduce the complexity incurred by the message storage elements and how to speedup the
decoding process. For example, one important subclass of LDPC codes with regular
structure is quasi-cyclic (QC) LDPC codes which has received the most attentions and has
been selected by many industry standards. Due to their regular structure, QC-LDPC codes
lend themselves conveniently to efficient hardware implementations. Not only the partly
parallel decoder architectures for QC-LDPC codes require simpler logic control (e.g., see
Chen 2004b, Wang 2007) but also the encoders can be efficiently built with shift registers
(e.g., see Li 2006). Construction of QC-LDPC codes with good error performances is
therefore of both theoretical and practical interest. Various construction methods of QC-
LDPC codes have been proposed and the satisfying error correcting performances are
reported (e.g., see Fossorier 2004, Chen 2004a).
On the other hand, LDPC decoders can be implemented with a programmable architecture
or processor, which lend themselves to Software Defined Radio (SDR). SDR offers flexibility
to support codes with different block lengths and rates, however, Seo (2007) shows that the
throughput of SDR-based LDPC decoders is usually low. In addition to digital decoders,
continuous time analog implementations have also been considered for LDPC codes.
Compared to their digital counterparts, analog decoders offer improvements in speed or
power. However, because of the complex and technology-dependent design process, the
analog approach has been only considered for very short error-correcting codes, for example,
the (32, 8) LDPC code in Hemati (2006).
In the following, a new ensemble of LDPC codes, called shift-LDPC codes, will be
introduced to mitigate the above decoder implementation problems.
The 1’s in the shift LDPC code parity check matrix are arranged as the example in Fig. 1. At
first, the 1’s in the leftmost submatrices are arranged randomly under the constraint of
keeping one “1” in each row and one “1” in each column. Thus the submatrix Hi1 is a
permutation matrix of identity matrix. Next, for each submatrix on the right hand side, the
1’s are cyclic-shifted up by 1 space. The “1” at the top is moved down to the bottom. Finally
we can get a (c, t) regular shift-LDPC code.
It should be noted that shift-LDPC code is a kind of code specially designed for hardware
implementation. In fact, the shift decoder architecture is designed first and then the code
construction is found. This follows the “decoder-first code design” methodology proposed
in Boutillon (2000).
based LDPC codes. For example, QC-LDPC codes can be converted to shift architecture
compliant code with proper matrix permutation. It will be explained later in the chapter.
Fig. 3. Performance of (1008, 504) (3, 6) shift-LDPC code and (8192, 7168) (4, 32) shift-LDPC
code compared with the same length WiMAX standard code and computer generated
random codes
The leftmost q columns are processed first, then the second left-most q columns, and so on.
So q columns are processed concurrently in one clock cycle. The column process and the row
process are interleaved. The whole check node process is divided into t steps. In each clock
cycle, q VNUs get M check-to-variable messages and compute the M variable-to-check
messages, so that M CNUs get one message each, so each CNU can deal with one step of the
check node process. With this decoding schedule, we can finish one iteration in t clock
cycles. It is normally much faster than the traditional partly parallel decoder architectures.
By combining the decoding schedule and the min-sum algorithm, the decoding process can
be expressed as below:
7: forqcolumns process :
8: Lvc
mM ( v )
Rmv I v Rcv
between CNUs and VNUs and thus the complexity of shuffle networks can be significantly
reduced. This will be shown next.
Fig. 5. Overall shift decoder block diagram. Massages are iteratively exchanged between
VNUs and CNUs through the shuffle network
For random codes, the shuffle network routing complexity is normally intolerable, while for
shift-LDPC codes we introduced above, the shuffle networks become really simple. Through
the introduction of the CNU communication network, we can ensure that each CNU processes
the variable-to-check messages from and transmit the check-to-variable messages to a fixed
VNU during the entire decoding process. Therefore, the shuffle network connecting CNUs
and VNUs only consists of M*b wires, where we assume each message is quantized as b bits.
(Normally the quantization bits b is chosen as 6.) In contrast, the number of wires required
by original LDPC decoding algorithm is M*t*b. For high rate codes, for example, t can be as
large as 32.
Fig. 5 explains why each CNU always computes with the messages from a fixed VNU
during the entire decoding process. With the simple shuffle network, CNU i is connected
with VNU x. The row process of each check node is separated into t steps with one variable-
to-check message being processed in each step. At the first clock cycle of an iteration, having
received the message from VNU x, CNU i performs the first step of i-th row process. After
that, CNU i passes the i-th row process intermediate result to CNU i+1 through the CNU
communication network. Meanwhile, CNU i receives the (i-1)-th row process intermediate
result from CNU i-1. In the second clock cycle of the iteration, CNU i still receives the
variable-to-check message from VNU x. This message and the row process intermediate
result in CNU i are arranged both corresponding to row i-1. Therefore, CNU i can perform
the second step of (i-1)-th row process. At the same time, CNU i+1 is performing the second
step of i-th row process. For variable node processing, VNU x receives the check-to-variable
Ultra-High Speed LDPC Code Design and Implementation 183
message in sequential from row i to i-1, i-2, …, etc. In the same way, the CNU communication
network also can ensure that VNU x only need to receive its message from CNU i.
To show the decoding procedure more clearly, we illustrate the whole process with the
simple (12, 4) (2, 3) shift LDPC code presented in Fig. 1 as an example. The procedure for
one entire iteration is shown in Fig. 6. It can be seen that the CNU communication network is
separated into two independent networks: CNU communication network (intra iteration) and
CNU communication network (inter iterations) which transfer the row processing results
during iteration and between iteration respectively.
The connections between CNUs and VNUs are fixed and simplified. CNU 1 connects with
VNU 4; CNU 2 connects with VNU 2…These connections are based on the 1’s positions in
the leftmost submatrices.
One time iteration consists of three steps or three cycles (the row weight t equals 3). In each
step/clock cycle, each CNU generates a check-to-variable message, passes the message to its
connected VNU and processes a variable-to-check message which is received from its
connected VNU. At the starting of iteration, in the first clock cycle, CNU 1 process the data
of row 1; CNU 2 process the data of row 2. After this one step row process, the row process
intermediate results are shifted between CNUs through the CNU communication network
(intra iteration). The result according to row 1 is passed to CNU 2; the result according to row
2 is passed to CNU 3, etc.
In the second clock cycle, CNU 2 still transmits to and gets message from VNU 2. This time
the variable-to-check message it gets is corresponding to row 1, thus it can deal with one
step of row 1 process. Likely CNU 3 performs one step of row 2 process; CNU 4 performs
one step of row 3 process; etc. The third step is performed in the same way.
Once the iteration is finished, the row process results will be transferred back to appropriate
CNUs for the check-to-variable messages generation in next iteration. The results are
transferred through the CNU communication network (inter iterations). This network is a one-
to-one communication network as shown in Fig. 6.
As a result, the communication between CNUs and VNUs required by the original LDPC
decoding algorithm is decomposed into three kinds of connections: the simplified
connection between CNUs and VNUs; the CNU communication network (intra iteration) and
the CNU communication network (inter iterations).
184 VLSI
In the following, we will introduce the architecture of CNU applying the min-sum decoding
algorithm. The (8192, 7168) (4, 32) shift LDPC code is chosen as the design example.
Because only one variable-to-check message is processed in one time, the computation in
CNU has very low complexity. The magnitude comparison part compares the magnitude of
input message with the current row process intermediate result to update the magnitudes,
sign and index. Then the updated row process intermediate result is registered and passed
to its next CNU neighbour through the CNU communication network (intra iteration). The
message computation part selects the proper message magnitude according to the index
value, and computes the sign of the message. At the beginning of each iteration, it performs
the message scale calculation (α = 0.75).
Fig. 9. Architecture of the variable node processing unit. StoT module converts from the
sign-magnitude format to two’s complement format. TtoS is for reverse conversion
Through-
CNU VNU total Message Total
frequenc put per
Gate count combination combina- combina- area area
y (MHz) iteration
logic tion logic tion logic (mm2) (mm2)
(Gbps)
One parallel
148 953 395K 160 40.96 6.1 10
level design
One parallel
148 1410 512K 317 76.4 ★ 6.1 11.3
+ pipelining
Double
342 953 838K 151 77.3 6.1 14
parallel level
Double
parallel 342 1410 1072K 317 153 6.1 16.7
+ pipelining
Table 2. Synthesis result for (8192, 7168) (4, 32) shift LDPC decoder under 0.18μm
technology
Table 2 shows the synthesis result of four design examples: one-level parallel design, one-
level parallel with pipelining, two-level parallel design, and two-level parallel with
Ultra-High Speed LDPC Code Design and Implementation 187
pipelining. Due to the large number of CNUs and VNUs (submatrix size q = 256), the
bottom-up synthesis strategy is applied. These results are achieved with 0.18μm technology.
It can be seen that, by applying pipelining, the clock frequency can be almost doubled, and
the overhead is 81 registers per CNU. The one-level parallel design with pipelining can
achieve the best trade-off between speed and hardware complexity.
For the message storage, it is implemented with registers. The transferring messages take
(16*2+32)*1024 = 65536 bits. In comparison, the traditional partially parallel architecture
(e.g., Wang (2007)) needs to store 8192*4*6 = 196224 bits in total. It saves 67% of the message
memory needed. The initial intrinsic information takes 8192*6 = 49152 bits. It is the same for
both cases.
To further examine the efficiency of the shift decoder architecture, backend design of the
sample (8192, 7168) (4, 32) shift LDPC decoder is completed. The technology used for
implementation is a 0.18μm CMOS process with 6 metal layers. Fig. 9 shows the floor-plan
and layout of the decoder chip with a die size of 4.1 mm ×4.1mm. The logic density is 70%.
The placement of CNUs is critical to reducing the routing congestion of CNU
communication networks. The CNU array located in the centre of the chip is specially
arranged: the 1024 CNUs are aligned with 31 CNUs per row, so that the communications
between CNUs can be locally routed.
Fig. 10. The floor-plan and layout of the (8192, 7168) (4, 32) shift LDPC decoder
188 VLSI
Table 3 shows the decoder implementation results compared with some other LDPC
decoder architectures. The throughput achieved here is 5.1 Giga bits per second at a
maximum iteration times of 15. One parameter “Hardware Efficiency” as (Throughput *
Iterations / Area) is defined to evaluate the efficiency of each architecture. The area metrics
are scaled to resemble 90nm CMOS results (65nm by a factor of 2 and 180nm by a factor of
1/4). From this table, we can see that the proposed design can achieve more than 70%
improvement in hardware efficiency compared with an advanced existing design while the
higher clock speed benefited from much more advanced CMOS technology is not
considered for this comparison. Otherwise the improvement would be even more
significant. So shift decoder architecture is very efficient for high-speed LDPC decoder
implementation.
1 1 1 1 0
G
1
q2
2 1 , (3)
we can get the extended RS code with q2 codewords in total. Let v be a nonzero codeword in
Cb with weight q, for example v = (1, αq-2, ···, α2, α, 1). Then, the set Cb(0) = {cv : c ∈ GF(q)} of q
codewords forms the subcode of Cb. Set Cb(0) always contains the all zero codeword and q-1
weight q codewords, no matter which weight q codeword v is chosen.
Partition Cb into q additive cosets, Cb(0), Cb(1),…, Cb(q-1) based on the subcode Cb(0). Each coset
Cb(i) is composed of q codewords, W0(i) , W1(i) , … , Wq-1(i) and each codeword has a length of q,
as below:
W (i ) w w0,1 w0,q 1
0 0,0
W (i ) w w1,1 w1,q1
Cb(i ) 1 1,0
(i )
Wq 1 wq 1,0 wq 1,1 wq 1,q 1
1 i q 2 i q 3 i i 1
i
1 i q2 i 2 i
2 i i
1 i 3 i 2
q 2 i q 3 i q 4 i 1 i q2
0 0 i 0 i 0 i
i
0
. (4)
Then the cosets are arranged together to get the q2 × q matrix Hrs
Cb( 0 )
(1)
C
H rs b
( q 1)
Cb
. (5)
Replace each symbol in Hrs with a location vector:
z ( i ) ( z , z0 , z1, z2 , , zq 2 )
, (6)
where the i-th componet zi = 1 and all the other components equal zero. The exponential
value can be denoted as the position of “1” in the location vector replacement. Finally,
choose a dv × dc subarray from Hrs to get a (dv, dc) regular LDPC code. This can also be stated
as that select dv cosets randomly and then select dc columns in them, finally replace each
symbol with a location vector to get a sparse parity check matrix H.
A coset example is given here to show the matrix properties clearly. It is generated in GF(8):
190 VLSI
4 6 7 3 1 5 0 1
0 4 6 7 3 1 5 2
5 0 4 6 7 3 1 3
1 5 0 4 6 7 3 4
3 1 5 0 4 6 7 5
7 3 1 5 0 4 6 6
6 7 3 1 5 0 4 7
2 2 2 2 2 2 2 0 . (7)
The examination of the matrix and equation (4) reveals the four following properties:
1) Each column is composed of the q elements of Galois Field GF(q).
2) qth row is a special row. It is composed with only one value αi except the last one “0”.
3) qth column is a special column. It is always “1, α, α2, … αq-2, 0”.
4) Except the special row and special column, the remaining q-1 × q-1 matrix is a special type
of circulant in which each row is the right cyclic-shift of the row above it while the first row
is the right cyclic-shift of the last row.
Careful readers will find that property 4) is exactly the shift property which is introduced
earlier in the chapter. Thus the shift decoder architecture can be applied to this kind of RS-
based LDPC codes. Fig. 10 shows the location vector replacement in RS-based LDPC code
construction and the mapping of the parity check matrix to decoder hardware. The chosen
code sample is a (32, 24) (1, 4) RS-based LDPC code generated in GF(8).
4 6 7 3
0 4 6 7 CNU 0 0 0 0 0 1 0 0 0 00000010 00000001 00010000
5 0 4 6 CNU 1 1 0 0 0 0 0 0 0 00001000 00000010 00000001
1 5 0 4 CNU 2 0 0 0 0 0 1 0 0 10000000 00001000 00000010
3 1 5 0 CNU 3 0 1 0 0 0 0 0 0 00000100 10000000 00001000
CNU 4 0 0 0 1 0 0 0 0 01000000 00000100 10000000
7 3 1 5
CNU 5 0 0 0 0 0 0 0 1 00010000 01000000 00000100
6 7 3 1
CNU 6 0 0 0 0 0 0 1 0 00000001 00010000 01000000
2 2 2 2 CNU 7 0 0 1 0 0 0 0 0 00100000 00100000 00100000
clock cycle 1 clock cycle 2 clock cycle 3 clock cycle 4
Fig. 11. The location vector replacement in RS-based LDPC code construction and the
mapping of the parity check matrix to decoder hardware
Ultra-High Speed LDPC Code Design and Implementation 191
Fig. 12. Block diagram of the shift decoder design for the matrix defined in Fig.10
Fig. 11 shows the block diagram of decoder designed for matrix defined in Fig. 10. It can be
decomposed into two parts: the processing units and the message storage registers. There is
one row result registers (RR) corresponding to each CNU. One RR is composed of current
iteration row result registers (RRC) and last iteration row result registers (RRL). Both
contain minimum value, 2nd-minimum value, minimum value index and sign bit.
Since the offset value in qth row is fixed at “2”, qth row’s message always comes from VNU
2. CNU 7 is only in charge of the qth row’s processing and RR 7 stores processing result of
the qth row, which corresponds to the special row property 2).
With the shift decoder architecture, two decoder examples are designed. The target code is
the (2048, 1723) (6, 32) LDPC code generated from (64, 32, 2) RS code. A 6 × 32 sub-array at
the upper left corner of Hrs is selected.
Table 4 shows the synthesis results under 90nm CMOS technology of two design examples:
one basic design and a four times parallel level design. Compared to the basic design, the
four times parallel level design can achieve an approximately 3x decoding speed with 2x
hardware consumption. Table 3 also compares the shift architecture decoders with state-of-
the-art bit serial fully parallel LDPC decoder (Darabiha (2008)) and partially parallel LDPC
decoder (Chen (2003)). Comparing the throughput to area ratio metric, it can be clearly seen
that the shift architecture-based designs are very much hardware efficient.
192 VLSI
Table 4. Synthesis results of (2048, 1723) RS-LDPC code decoder under shift architecture and
comparisons with other designs
Fig. 13. The construction of QC-LDPC parity check matrix Hqc: an array of circulant
submatrices
Ultra-High Speed LDPC Code Design and Implementation 193
Fig. 14. The transformation from the sample QC-LDPC parity check matrix into a shift like
LDPC code through column permutation
Fig. 13 illustrates the way to transform the QC-LDPC code parity check matrix into a shift
kind matrix whose decoder can be implemented with the shift decoder architecture. Firstly
the q columns are distributed to t block columns of Hqcs1 in a round-robin fashion. Then the
second q columns are permutated in the same way and so on until all columns are
distributed into new matrix Hqcs1. Careful readers will find that a new property is inducted in
the new matrix that in some submatrices there are multiple “1”s in one row. This newly
inducted property requires a special CNU design which processes multiple messages
instead of one at each step. It will cause some additional logic delay in CNU. In Cui (2008a),
the authors proposed an efficient matrix permutation optimization method to minimize the
maximum row weight. Normally, it can be no bigger than 2. In addition, in Cui (2008a) the
authors optimized the column layered decoding algorithm and applied it to shift decoder
architecture. By this way, the iteration number to converge can be reduced and the decoding
throughput can be further increased.
There are other matrix transformation methods to apply the extended shift decoder
architecture on QC-LDPC codes. For example, the quasi cyclic matrix can be transformed
into a row shift construction by row permutations as shown in Fig. 14. Firstly the q rows are
distributed to 4 block rows of Hqcs2 in a round-robin fashion (i.e., rows A-H of Hqc are
distributed to row 1, 5, 9, 13, 2, 6, 10 and 14 of Hqcs2). Then the second q rows are permutated
in the same way and so on until all rows are distributed into new matrix Hqcs2. The quasi
cyclic matrix is converted to a form as:
A1 A2 A3 At
A A 2 A3 At (8)
1
H qcrs
q /m
A1 A 2 q /m A3 q /m At q / m
Fig. 15. The transformation from the sample QC-LDPC parity check matrix into a shift like
LDPC code through row permutation
After the matrix conversion, the row shift property can be exploited to apply a row shift
decoder architecture just like the presented column shift architecture. By this means, the row
layered decoding algorithm should be applied. Fortunately the row layer algorithm can
increase the convergence speed significantly (Mansour (2003b)). Interested readers can be
referred to Cui (2008b) for more information.
I I I I
I 2 t 1
(9)
H array I 2 4 2( t 1)
I c 1
c 1)
2(
( c 1)( t 1)
Where I is a q × q identity matrix with a prime number q. α is a permutation matrix with a
single cyclic right shift or left shift. For example, a (2209, 2024) (4, 47) array LDPC code can
be constructed with q = 47. We use this code in the decoder explanation.
For array LDPC codes, they have the similar shift property as shift-LDPC codes. Some
modifications in the CNU communication network can be enough for the decoder
implementation. Fig. 15 shows the modified CNU connections. They are grouped into four
Ultra-High Speed LDPC Code Design and Implementation 195
CNU groups. Each group has 47 members. Group 1 contains from CNU-1 to CNU-47, in
charge of the row operation of row 1~47. Group 2 contains from CNU-48 to CNU-94, in
charge of the row operation of row 48~94... There is no communication between groups. The
difference between array LDPC codes and shift LDPC codes is that the shift values are
changing from 0 to 3 instead of keeping at 1. In addition, for this special kind of LDPC
codes, the CNU communication network (intra iteration) is exactly the same with CNU
communication network (inter iteration), thus these two networks can be incorporated into one
simple network and the routing complexity can be further simplified. As a result, except the
CNUs in Group 1, every 47 CNUs in a group form a chain and each CNU only
communicates with its two neighbors. For example, in CNU group 3, CNU-i only gets data
from CNU-(i-2) and delivers date to CNU-(i+2).
A sample decoder is designed based on an ALTERA FPGA EP2C35. 50MHz clock speed and
120Mbps throughput (20 iterations) are achieved with 23k logic elements and 26k memory
bits. Interested readers are referred to Sha (2006) for more information.
Fig. 16. The CNU communication network designed for (2209, 2024) (4, 47) array LDPC code
6. Concluding Remarks
Since MacKay’s (1997) rediscovery of LDPC codes, the LDPC decoder design has
experienced considerable development and enhancement over the last ten years. Various
optimization techniques on both decoding algorithms and decoder designs have been
developed. This chapter provided a brief overview of the problems in existing LDPC
decoder designs and presented a novel shift-structured decoder design approach to tackle
these problems. The codes suited for this kind of decoder architecture are called shift-LDPC
codes. Several decoder samples are discussed to illustrate the effectiveness of the proposed
architecture. In addition, it was shown in the chapter that some popular classes of LDPC
codes such as RS-based LDPC codes and QC-LDPC codes can be implemented with the shift
decoder architecture through simple matrix permutations or architecture extensions. As a
conclusion, the presented shift decoder architecture will be a competent candidate in future
high-speed communication system designs.
196 VLSI
7. References
A. Darabiha, A. C. Carusone, & F. R. Kschischang. (2008). Power Reduction Techniques for
LDPC Decoders, IEEE Journal of Solid-State Circuits, vol. 43, no. 8, pp. 1835-1845
A. J. Blanksby & C .J .Howland. (2002). A 690-mW 1-Gbps 1024-b, rate-1/2 Low-Density
Parity-Check code decoder, IEEE J. Solid-State Circuits, vol. 37, no. 3, pp. 404-412
D. J. C. MacKay & R. M.Neal. (1996). Near Shannon limit performance of low density parity
check codes, Electron. Lett., vol. 32, pp. 1645-1646
E. Boutillon, J. Castura, & F. R. Kschischang, (2000). Decoder-First Code Design, Proceedings
of the 2nd International Symposium on Turbo Codes and Related Topics, Brest, France,
pp. 459-462
F. Guilloud, E. Boutillon & J.L. Danger. (2003). λ-Min Decoding Algorithm of Regular and
Irregular LDPC Codes, Proc. 3nd International Symposium on Turbo Codes and Related
Topics, pp. 451-454
E. Liao, E. Yeo & B. Nikolic. (2004). Low-density parity-check code constructions for
hardware implementation, Proc. IEEE Int. Conf. on Commun. vol. 5, pp. 2573-2577
E. Yeo, B. Nikolic & V. Anantharam. (2003) Iterative decoder architectures, IEEE Commun.
Mag., vol. 41, pp. 132-140
F. R. Kschischang, B. J. Frey & H. A. Loeliger. (2001). Factor graphs and the sum-product
algorithm, IEEE Trans. Inf. Theory, vol. 47, pp. 498-519
G. Liva, S. Song, L. Lan, Y. Zhang, S. Lin, & W. E. Ryan. (2006). Design of LDPC Codes: A
Survey and New Results. J. Comm. Software and Systems, vol. 2, pp. 191
I. Djurdjevic, J. Xu, K. Abdel-Ghaffar, & S. Lin. (2004). Construction of low-density parity-
check codes based on Reed-Solomon codes with two information symbols, IEEE
Commu. Lett., vol. 8, no. 7, pp. 317-319
J. Chen, A. Dholakia, E. Eleftheriou, M. P. C. Fossorier, & X. Hu. (2005). Reduced-complexity
decoding of LDPC codes, IEEE Trans. Commun., vol. 53, pp. 1288-1299
J. L. Fan. (2000). Array codes as low-density parity-check codes, Proc. 2nd Int. Symp. Turbo
Codes and Related Topics Brest, France, pp. 543
J. Sha, M. Gao, Z. Zhang, L. Li, Z. Wang. (2006). An FPGA Implementation of array LDPC
decoder, IEEE Asia Pacific Conference on Circuits and Systems, pp. 1675-1678
J. Sha, Z. Wang, M. Gao, & Li. (2009). Multi-Gb/s LDPC Code Design and Implementation,
IEEE Trans. on VLSI Systems, vol. 17, no. 2, pp. 262-268
LAN/MAN CSMA/CD Access Method, IEEE 802.3 Standard Online available: http://
standards.ieee.org/getieee802/802.3.html
J. Zhao, F. Zarkeshvari, & A. H. Banihashemi. (2005). On implementation of min-sum
algorithm and its modifications for decoding Low-Density Parity-Check (LDPC)
codes, IEEE Trans. Commun., vol. 53, no. 4, pp. 549-554
L. Chen, J. Xu, I. Djurdjevic & S. Lin. (2004a) Near-Shannon limit quasi-cyclic low-density
parity-check codes, IEEE Trans. Commun, vol. 52, pp. 1038
M. Cocco, J. Dielissen, M. Heijligers, A. Hekstra, & J. Huisken. (2004). A scalable architecture
for LDPC decoding, Proc. Design, Automation and Test in Europe, vol. 3, pp. 88-93
M. M. Mansour & N. R. Shanbhag. (2002). Low power VLSI decoder architecture for LDPC
codes, Proc. IEEE Int. Symp. on Low Power Electron. Design, pp. 284-289
M. M. Mansour & N. R. Shanbhag, (2003a). Architecture-Aware Low-Density Parity-Check
Codes, Proc. IEEE ISCAS, pp. 57-60
Ultra-High Speed LDPC Code Design and Implementation 197
M. M. Mansour & N. R. Shanbhag. (2003b). High throughput LDPC decoders, IEEE Trans.
Very Large Scale Integr. (VLSI) Syst., vol. 11, pp. 976-996
M.M. Mansour & N. R. Shanbhag. (2006). A 640-Mb/s 2048-bit programmable LDPC
decoder chip, IEEE J. Solid-State Circuits, vol. 41, no. 3, pp. 684- 698
M. P. C. Fossorier. (2004). Quasi-cyclic low-density parity-check codes from circulant
permutation matrices, IEEE Trans. Inf. Theory, vol. 50, no. 8, pp. 1788–1793
R. G. Gallager. (1962). Low-density parity-check codes, IRE Transactions on Information
Theory, vol. IT-8, pp. 21-28
R. M. Tanner. (1981). A recursive approach to low complexity codes, IEEE Trans. Inf. Theory,
vol. IT-27, pp. 533-547
S. Hemati, A. Banihashemi, & C. Plett. (2006). A 0.18 µm analog min-sum iterative decoder
for a (32,8) low-density parity-check (LDPC) code, IEEE J. Solid-State Circuits, vol.
41, pp. 2531–2540
S. Seo, T. Mudge, Y. Zhu & C. Chakrabarti. (2007). Design and analysis of LDPC decoders
for software defined radio, Proc. IEEE Workshop on Signal Processing Systems,
Shanghai, China, pp.210–215
S. Sharifi Tehrani, S. Mannor & W. J. Gross. (2008). Fully Parallel Stochastic LDPC Decoders
IEEE Trans. Signal Processing, Vol. 56, no. 11, pp. 5692 - 5703
S. Y. Chung, G. D. Forney, T. J. Richardson & R. Urbanke. (2001). On the design of low-
density parity-check codes within 0.0045 dB of the Shannon limit, IEEE Commun.
Lett., vol. 5, pp. 58-60
T. Brack, M. Alles, T. Lehnigk-Emden, F. Kienle, N. Wehn, & L. Fanucci. (2007). Low
Complexity LDPC Code Decoders for Next Generation Standards, Proc. Design,
Automation and Test in Europe, pp. 1-6
T. Zhang & K. Parhi. (2002). A 54Mbps (3,6)-regular FPGA LDPC decoder, Proc. IEEE Sips’,
pp. 127-132
T. Zhang & K. K. Parhi. (2004). Joint (3,k)-regular LDPC code and decoder/encoder design,
IEEE Trans. Signal Process., vol. 52, no. 4, pp.1065–1079
Y. Chen & D. Hocevar. (2003). A FPGA and ASIC implementation of rate 1/2, 8088-b
irregular low density parity check decoder, Proc. IEEE GLOBECOM, San Francisco,
CA, pp. 113–117
Y. Chen & K. K. Parhi. (2004b). Overlapped message passing for quasi-cyclic low density
parity check codes, IEEE Trans. Circuits Syst. I, vol. 51, pp. 1106-1113
Z.–W. Li, L. Chen, L.-Q. Zeng, S. Lin & W.H. Fong. (2006). Efficient encoding of quasi-cyclic
low-density parity-check codes, IEEE Transactions on Communications, Vol. 54 , no. 1,
pp. 71 – 81
Z. Cui, Z. Wang, X. Zhang & Q. Jia. (2008a). Efficient decoder design for high-throughput
LDPC decoding, APCCAS, pp. 1640-1643
Z. Cui, Z. Wang & Y. Liu. (2008b). High-throughput layered LDPC decoding architecture,
IEEE Trans. VLSI Systems, vol. 17, no. 4, pp. 582-587
Z. Wang & Z. Cui. (2007). Low-complexity high-speed decoder design for quasi-cyclic LDPC
codes, IEEE Trans. on VLSI Systems, vol. 15, no. 1, pp. 104-114
198 VLSI
A Methodology for Parabolic Synthesis 199
10
X
1. Introduction
In relatively recent research of the history of science interpolation theory, in particular of
mathematical astronomy, revealed rudimentary solutions of interpolation problems date
back to early antiquity (Meijering, 2002). Examples of interpolation techniques originally
conceived by ancient Babylonian as well as early-medieval Chinese, Indian, and Arabic
astronomers and mathematicians can be linked to the classical interpolation techniques
developed in Western countries from the 17th until the 19th century. The available historical
material has not yet given a reason to suspect that the earliest known contributors to
classical interpolation theory were influenced in any way by mentioned ancient and
medieval Eastern works. For the classical interpolation theory it is justified to say that there
is no single person who did so much for this field as Newton. Therefore, Newton deserves
the credit for having put classical interpolation theory on a foundation. In the course of the
18th and 19th century Newton’s theories were further studied by many others, including
Stirling, Gauss, Waring, Euler, Lagrange, Bessel, Laplace, and Everett. Whereas the
developments until the end of 19th century had been impressive, the developments in the
past century have been explosive. Another important development from the late 1800s is the
rise of approximation theory. In 1885, Weierstrass justified the use of approximations by
establishing the so-called approximation theorem, which states that every continuous
function on a closed interval can be approximated uniformly to any prescribed accuracy by
a polynomial. In the 20th century two major extensions of classical interpolation theory is
introduced: firstly the concept of the cardinal function, mainly due to E. T. Whittaker, but
also studied before him by Borel and others, and eventually leading to the sampling
theorem for band limited functions as found in the works of J. M. Whittaker, Kotel'nikov,
Shannon, and several others, and secondly the concept of oscillatory interpolation,
researched by many and eventually resulting in Schoenberg's theory of mathematical
splines.
look-up table (Tang, 1991) is simple and fast which is strait forward for low-precision
computations of f(x), i.e., when x only has a few bits. However, when performing high-
precision computations a single look-up table implementation is impractical due to the huge
table size and the long execution time.
Approximations only using polynomials have the advantage of being ROM-less, but they
can impose large computational complexities and delays (Muller, 2006). By introducing
table based methods to the polynomials methods the computational complexity can be
reduced and the delays can also be decreased to some extent (Muller, 2006).
The CORDIC (COordinate Rotation DIgital Computer) algorithm (Volder, 1959) (Andrata,
1998) has been used for these applications since it is faster than a software approach.
CORDIC is an iterative method and therefore slow which makes the method insufficient for
this kind of applications.
The proposed methodology of parabolic synthesis (Hertz & Nilsson, 2008) develops
functions that perform an approximation of original functions in hardware. The architecture
of the processing part of the methodology is using parallelism to reduce the execution time.
For the development of approximations of functions a parabolic synthesis methodology has
been applied. Only low complexity operations that are simple to implement in hardware are
used
2. Methodology
x
Fig. 1. Example of normalized function, in this case sin .
2
A Methodology for Parabolic Synthesis 201
2.1 Normalizing
The purpose with the normalization is to facilitate the hardware implementation by limiting
the numerical range.
The normalization has to satisfy that the values are in the interval 0 ≤ x < 1 on the x-axis and
0 ≤ y < 1 on the y-axis. The coordinates of the starting point shall be (0,0). Furthermore, the
ending point shall have coordinates smaller than (1,1) and the function must be strictly
concave or strictly convex through the interval. An example of such a function, called an
original function forg(x), is shown in Fig. 1.
The procedure when developing sub-functions is to divide the original function forg(x), with
the first sub-function s1(x). This division generates the first function f1(x), as shown in (2).
The first sub-function s1(x), will be chosen to be feasible for hardware, according to the
methodology described in (4). In the same manner the following functions fn(x), are
generated, as shown in (3).
f n (x) (3)
f n1 (x)
sn1 (x)
The purpose with the normalization is to facilitate the hardware implementation by limiting
the numerical range.
First sub-function
The first sub-function s1(x), is developed by dividing the original function forg(x), with x as an
approximation.
As shown in Fig. 2 there are two possible results after dividing the original function with x,
one where f(x)>1 and one where f(x)<1.
A Methodology for Parabolic Synthesis 203
The first sub-function s1(x), is according to (4). To approximate these functions 1+(c1.(1-x)) is
used. The first sub-function s1(x), is given by a multiplication of x and 1+(c1.(1-x)) which
results is a second order parabolic function according to (4).
In (4) the coefficient c1 is determined as the limit from the division of the original function
with x and subtracted with 1, according to (5).
Second sub-function
The first function f1(x), is calculated according to (2) and the result of this operation is a
function which appearance is similar to a parabolic function, as shown in Fig. 3.
Fig. 3. Example of the first function f1(x) compared with sub-function s2(x).
The second sub-function s2(x), is chosen according to the methodology as a second order
parabolic function, see (6).
In (6) the coefficient c2, is chosen to satisfy that the quotient between the function f1(x), and
the second sub-function s2(x), is equal to 1 when x is equal to 0.5, see (7).
1 (7)
c2 4 f1 1
2
Thereby the second function f2(x), will get a shape of a lying S, as shown in Fig. 4.
When developing the third sub-function s3(x), the function is to be split into two parabolic
functions where the first function is restricted by function f2(x) to be in the interval
0 ≤ x < 0.5 and the second function is thus restricted to the interval 0.5 ≤ x < 1.0. By splitting
the function we get strictly convex and concave functions in each interval. The intervals can
be chosen differently but that will lead to a more complex hardware, as shown in section 3.
interval the partial functions is valid for, the subscript index is increased with the index m,
which gives the following appearance of the partial function fn,m(x).
1 (8)
f n,0 (x), 0 x n1
2
1 2
f (x), x n1
f n (x) n,1 2 n1 2
...
2 n1 1
f n,2n1 1 (x), 2 n1
x 1
In equation (8) it is shown how the function fn(x), is divided into partial functions fn,m(x),
when n > 2.
As shown in (8), the number of partial functions is doubled for each order of n > 1 i.e. the
number of partial functions is 2n-1. From these partial functions, the corresponding sub-
functions are developed. Analogous to the function fn(x), also the sub-function sn+1(x), will
have partial sub-functions sn+1,m(x). In equation (9) it is shown how the sub-function sn(x), is
divided into partial functions when n > 2.
1 (9)
sn,0 (x), 0 x n2
2
1 2
s (x), x n2
sn (x) n,1 2 n2 2
...
2 n2 1
sn,2n2 1 (x), 2 n2
x 1
Note that in (9), the partial functions to the sub-functions; x has been changed to xn. The
change to xn is normalization to the corresponding interval, which simplifies the hardware
implementation of the parabolic function. To simplify the normalization of the interval of xn
it is selected as an exponentiation by 2 of x where the integer part is removed. The
normalization of x is therefore done by multiplying x with 2n-2, which in hardware is n-2 left
shifts and the integer part is dropped, which gives xn as a fractional part (frac( )) of x, as
shown in (10).
xn frac 2 n2 x (10)
As in the second sub-function s2(x), the second order parabolic function is used as an
approximation of the interval of the function fn-1(x), as shown in (11).
sn,m (xn ) 1 cn,m xn xn2 (11)
2 (m 1) 1 (12)
cn,m 4 f n1,m 1
2 n1
After the approximation part the result is transformed into its desired form.
3. Hardware Implementation
For the hardware implementation two’s complement representation (Parhami, 2000) is used.
The implementation is divided into three hardware parts, preprocessing, processing, and
postprocessing as shown in Fig. 5, which was introduced by (P.T.P. Tang, 1991), (Muller,
2006).
3.1 Preprocessing
In this part the incoming operand v is normalized to prepare the input to the processing
part, according to section 2.1.
If the approximation is implemented as a block in a system the preprocessing part can be
taken into consideration in the previous blocks, which implies that the preprocessing part
can be excluded.
3.2 Processing
In the processing part the approximation of the original function is directly computed in
either iterative or parallel hardware architecture.
The three equations (4), (6) and (11) has the same structure which gives that the
approximation can be implemented as an iterative architecture as shown in Fig. 6.
A Methodology for Parabolic Synthesis 207
The benefit of the iterative architecture is the small chip area whereas the disadvantage is
longer computation time.
The advantages with parallel hardware architecture are that it gives a short critical path and
fast computation to the prize of a larger chip area. The principle of the parallel hardware
architecture for four sub-functions is shown in Fig. 7.
To increase the throughput even more, pipeline stages can be implemented in the parallel
hardware architecture.
In the sub-functions (4), (6) and (11) x2 and xn2 are reoccurring operations. Since the square
operation xn2, in the parallel hardware architecture is a partial result of x2 a unique squarer
has been developed. In Fig. 8 the algorithm that performs the squaring and delivers partial
product of xn2 is described.
The squaring algorithm for the partial products xn2 can be simplified as shown in Fig. 9.
A Methodology for Parabolic Synthesis 209
In Fig. 8 and Fig. 9, the squaring algorithm that performs the partial products xn2, shown.
The first partial product p, is the squaring of the least significant bit in x. The second partial
product q, is the squaring of the two least significant bits in x. The partial product r, is the
result of the squaring of the three least significant bits in x and s is the result of the squaring
of x. The squaring operation is performed with unsigned numbers. When analyzing the
squarer in Fig. 8 and Fig. 9, it was found that the resemblance to a bit-serial squarer (Ienne &
Viredaz, 1994) (Pekmestz et al., 2001) is large. By introducing registers in the design of the
bit-serial squarer the partial results of xn2 is easily extracted. The squaring algorithm can
thus be simplified to one addition only when computing each partial product.
From (4), (6) and (11) it is found that only the coefficients values differentiate when
implementing different unary functions. This implies that different unary functions can be
realized in the same hardware in the processing part, just by using different sets of
coefficients.
Since the methodology is calculating an approximation of the original function the error to
the desired precision can be both positive and negative. Especially, if the value of the
approximation is less than the desired precision, the word length can have to be increased
compared with the word length needed to accomplish the desired precision. If the order of
the last used sub-function is n > 1, an improvement of the precision can be done by
optimizing one or more coefficients c2 in (7) or cn,m in (12). The optimization of the
coefficients will minimize the error in the last used sub-function and thereby it can reduce
the word length needed to accomplish the desired accuracy. Computer simulations perform
such coefficient optimization numerically.
3.3 Postprocessing
The postprocessing part transforms the value to the output result z. If the approximation is
implemented as a block in a system the postprocessing part can be taken into consideration
in the following blocks, which implies that the postprocessing part can be excluded.
210 VLSI
4.1 Preprocessing
Fig. 10. The function f(v) before normalization and the original function forg(x).
To satisfy that the values of the incoming operand x is in the interval 0 ≤ x < 1 a π/2 is
multiplied with the operand as shown in (13).
(13)
v x
2
To normalize the f(v)=sin(v) function v is substituted with x which gives the original
function forg(x) (14).
(14)
f org (x) sin x
2
In Fig. 10 the f(v) function is shown together with the original function forg(x).
A Methodology for Parabolic Synthesis 211
4.2 Processing
For the processing part, sub-functions are developed according to the proposed
methodology. For the first sub-function s1(x), the coefficient c1 is defined according to (5).
The determined value of the coefficient is shown in (15).
(15)
s1 (x) x 1 x x 2
2
To develop the second sub-function s2(x), the coefficient c2 is defined according to (7). The
determined value of the coefficient is shown in (17).
s2 (x) 1 0.400858 x x 2 (17)
f1 (x) (18)
f 2 (x)
s2 (x)
To develop the third sub-functions s3(x), the second function f2(x), is divided into its two
partial functions as shown in (8). The third order of sub-functions is thereby divided into
two sub-functions, where s3,0(x3) is restricted to the interval 0 ≤ x < 0.5 and s3,1(x3) is
restricted to the interval 0.5 ≤ x < 1.0 according to (9). A normalization of x to x3 is done to
simplify in the implementation in hardware, which is described in (10).
For each sub-function, the corresponding coefficients c3,0 and c3,1 is determined. These
coefficients are determined according to (12) where higher order sub-functions can be
developed. The determined values of the coefficients are shown in (19).
s3,0 (x3 ) 1 0.0122452 x3 x32 , 0 x 0.5
(19)
f 2 (x) (20)
f3 (x)
s3 (x)
To develop the fourth sub-functions s4(x), the third function f3(x), is divided into its four
partial functions as shown in (8). The fourth order of sub-functions is thereby divided into
four sub-functions, where s4,0(x4) is restricted to the interval 0 ≤ x < 0.25, s4,1(x4) is restricted
212 VLSI
to the interval 0.25 ≤ x < 0.5, s4,2(x4) is restricted to the interval 0.5 ≤ x < 0.75 and s4,3(x4) is
restricted to the interval 0.75 ≤ x < 1.0 according to (9). A normalization of x to x4 is done to
simplify the hardware implementation, which is described in (10).
For each sub-function, the corresponding coefficients c4,0, c4,1, c4,2 and c4,3 is determined.
These coefficients are determined according to (12) which accomplish that higher order of
sub-functions can be developed. The determined values of the coefficients are shown in (21).
s4,0 (x4 ) 1 0.00223363 x4 x42 , 0 x 0.25
(21)
No postprocessing is needed since the result out from the processing part has the right size.
4.3 Optimization
If no more sub-functions are to be developed the precision of the approximation can be
further improved by optimization of coefficients c4,0, c4,1, c4,2 and c4,3. As shown in Fig. 12
sub-function s4,3(x) in the interval 0.75 ≤ x < 1.0 has the largest relative error. When
performing an optimization of sub-function s4,3(x) in the interval 0.75 ≤ x < 1.0 it was found
that the word length in the computations could be reduced from 17 bits to 16 bits.
4.4 Architecture
In Fig. 11, architecture of the approximation of the sine function using the proposed
methodology is shown.
The x2 block in Fig. 11 is the special designed multiplier described in Fig. 8 and Fig. 9 that
delivers the partial results q, q3 and q4 used in the following blocks. In the x-q block, x is
subtracted with the partial result q, from the x2 block. The result r from the x-q block is then
used in the two following blocks as shown in Fig. 11. In the x+(c1·r) block is s1(x) performed,
in 1+(c2·r) is s2(x) performed, in 1+(c3·(x3-q3)) is s3(x) performed and in 1+(c4·(x4-q4)) is s4(x)
performed. Note, that in the blocks for sub-function s3(x) and s4(x), the individual index m is
addressing the MUX that selects the coefficients in the block.
4.6 Precision
In Fig. 12 the resulting precision when using one to four sub-functions is shown. Decibel
scale is used to visualize the precision since the combination of binary numbers and dB
works very well together. In dB scale 2 is equal to 20log10(2) = 20·(0.3) = 6 dB and since 6 dB
corresponds to 1 bit, this will make it simpler to understand the result. As shown in Fig. 12,
the relative error decreases with the number of used sub-functions. With 4 sub-functions we
can see that we have accuracy better that 14 bits that will result in at least a latency of 14
adders in the CORDIC algorithm is used.
214 VLSI
Fig. 12. Estimation of the relative error between the original function and different numbers
of sub-functions.
As shown in Fig. 12, the relative error decreases with the number of sub-functions used.
However, increases the delay with the number of sub-function as shown in Table 1.
5. Comparison
The most common methods used when implementing approximation of a unary functions
in hardware are look-up tables, polynomials, table-based methods with polynomials and
CORDIC. Computation by table look-up is attractive since memory is much denser than
random logic in VLSI realizations. However, since the size of the look-up table grows
exponentially with increasing word lengths, both the table size and execution time becomes
totally intolerable. Computation by polynomials is attractive since it is ROM-less. The
disadvantages are that it can impose large computational complexities and delays.
A Methodology for Parabolic Synthesis 215
Algorithm Range
f (v) sin v
Function 0v
2
2
Preprocessing x v 0 x 1
Processing y sin x 0 y 1
2
Postprocessing zy 0 z 1
Table 2. The algorithms for the sine function.
approximation of the cosine function x is substituted with 1-x in the preprocessing part of
the approximation for the sine function.
Algorithm Range
f (v) cos v
Function 0v
2
2
Preprocessing x 1 v 0 x 1
Processing y sin x 0 y 1
2
Postprocessing zy 0 z 1
Table 3. The algorithm for the cosine function.
Algorithm Range
1
Function f (v) arcsin v 0v
2
Preprocessing x 2 v 0 x 1
x
Processing y arcsin 0 y 1
2
Postprocessing z y 0z
4 4
Table 4. The algorithm for the arcsine function.
Algorithm Range
1
Function f (v) arccos v 0v
2
Preprocessing x 2 v 0 x 1
x
Processing y arcsin 0 y 1
2
Postprocessing z 1 y z
4 4 4 2
Table 5. The algorithm for the arccosine function.
Algorithm Range
f (v) tan v
Function 0v
4
4v
Preprocessing x 0 x 1
Processing y tan x 0 y 1
4
Postprocessing zy 0 z 1
Table 6. The algorithm for the tangent function.
Algorithm Range
Function f (v) arctan v 0 v 1
Preprocessing xv 0 x 1
4
Processing y arctan x 0 y 1
Postprocessing z y 0z
4 4
Table 7. The algorithm for the arctangent function.
218 VLSI
Algorithm Range
Function f (v ) log 2 v 1 v 2
Preprocessing x v 1 0 x 1
Processing y log 2 1 x 0 y 1
Postprocessing zy 0 z 1
Table 8. The algorithm for the logarithm function.
Algorithm Range
Function f (v) 2 v 0 v 1
Preprocessing xv 0 x 1
Processing y 2 1x
0 y 1
Postprocessing z 1 y 1 z 2
Table 9. The algorithm for the exponential function.
Algorithm Range
1
Function f (v) 0.5 v 1
1 v
Preprocessing x 2 1 v 0 x 1
6
y 3
Processing x 0 y 1
1 1
2
3 y 1 2
Postprocessing z z
6 2 3
Table 10. The algorithm for the division function.
A Methodology for Parabolic Synthesis 219
Algorithm Range
Function f (v) 1 v 1 v 2
Preprocessing x v 1 0 x 1
2x 2
Processing y 0 y 1
3 2
Postprocessing
z 2 y 3 2 2z 3
7. Conclusions
A novel methodology for implementing approximations of unary functions such as
trigonometric functions, logarithmic functions, as well as square root and division functions
etc. in hardware is introduced. The architecture of the processing part automatically gives a
high degree of parallelism. The methodology to develop the approximation algorithm is
founded on parabolic synthesis. This combined with that the methodology is founded on
operations that are simple to implement in hardware such as addition, shifts, multiplication,
contributes to that the implementation in hardware is simple to perform. By using the
parallelism and parabolic synthesis, one of the most important characteristics with the out
coming hardware is the parallelism that gives a short critical path and fast computation. The
structure of the methodology will also assure an area efficient hardware implementation.
The methodology is also suitable for automatic synthesis.
8. References
B. Parhami (2000), Computer Arithmetic, Oxford University Press Inc., ISBN: 0-19-512583-5,
198 Madison Avenue, New York, New York 10016, USA.
E. Hertz, P. Nilsson (2008), A Methodology for Parabolic Synthesis of Unary Functions for
Hardware Implementation, Proc. of the 2nd International Conference on Signals,
Circuits and Systems, SCS08_9.pdf, pp 1-6, ISBN-13: 978-1-4244-2628-7, Hammamet,
Tunisia, Nov. 2008, Inst. of Elec. and Elec. Eng. Computer Society, 445 Hoes Lane -
P.O.Box 1331, Piscataway, NJ 08855-1331, United States.
E. Hertz, P. Nilsson (2009), Parabolic Synthesis Methodology Implemented on the Sine
Function, Proc. of the 2009 IEEE International Symposium on Circuits and Systems, pp
253-256, ISBN: 978-1-4244-3828-0, Taipei, Taiwan, May. 2009.
E. Meijering (2002), A Chronology of Interpolation From Ancient Astronomy to Modern
Signal and Image Processing, Proceedings of the IEEE, vol. 90, no. 3, March 2002, pp.
319-342, ISSN: 00189219, Institute of Electrical and Electronics Engineers Inc.
J. E. Volder (1959), The CORDIC Trigonometric Computing Technique, IRE Transactions on
Electronic Computers, vol. EC-8, no. 3, 1959, pp. 330–334.
220 VLSI
11
X
1. Introduction
This chapter presents a technique for designing architectures, which execute Fast Fourier
Transform (FFT) algorithms involving a large number of complex points and they target
real-time applications (Ersoy, 1997). Hitherto published techniques and FFT architectures
include mainly ASIC designs (Thompson, 1983; Wold and Despain, 1984; He and Torkelson,
1996; Choi et al., 2003; Uzun et al., 2005; Bouguezel et al., 2004, 2006; Jo & Sunwoo, 2005;
Chang & Nguyen, 2006; Yang et al., 2006; Lin et al., 2005; Takala & Punkka, 2006; Wang &
Li, 2007; Reisis & Vlassopoulos, 2006), which vary with respect to the level of parallelism,
the throughput rate, the latency, the hardware cost and the power consumption. The most
common ASIC architectures are the fully unfolded FFT realizations (Rabiner & Gold)
utilizing large memory arrays between their successive stages and also the latency and
memory efficient cascade FFT topologies (Thomson, 1983; He & Torkelson, 1996, 1998). The
cascade solutions though, as well as the high radix techniques lead to complicated designs
in the case of architectures parameterized with respect to the pipelining of computations,
the FFT size and the data length.
This chapter describes a technique to design efficient, very high speed, deeply-pipelined
FFT architectures maximizing throughput and keeping control and memory organizations
simple compared to the cascade and the fully unfolded FFT architectures. Moreover, the
design is proven more efficient comparing to the previously mentioned architectures in
terms of scalability, maximum operating frequency and consequently, in terms of power
consumption, pipeline depth and data and/or twiddle bit-widths. The technique improves
the latency and the memory requirements -particularly for large input data sets- of systolic
FFT architectures by combining three (3) Radix-4 circuits to result in a 64-point FFT engine.
The efficiency of organizing 64-point FFT engines based on Radix-4 FFT engines is shown by
a 4096-complex point design. This architecture requires only two dual memory banks of
4096 words each and on a Xilinx Virtex II FPGA performs at 200 MHz to sustain a
throughput of 4096 points/20.48 us. The design implemented on a high performance 0.13
um, 1P8M CMOS (standard cell) process from UMC achieved a worst-case (0.9V, 125 C)
post-route frequency of 604.5 MHz, while consuming 4.4 Watts. It is interesting to point out
222 VLSI
that the design exceeded the 1 GHz frequency (rate of 1 GSample/sec) for typical conditions
(1.0V, 25C).
Towards designing architectures FFTs with large input data sets we consider the 4096
complex point FFT architecture as a core. The core constitutes the basis of FFT architectures
computing transforms of 16K, 64K and 256K complex points. These architectures
implemented in the 0.13 CMOS process perform at 352, 256 and 188 Mhz worst-case (0.9V,
125 C) post-route frequencies respectively. The 16K and the 64K architectures have a four
point parallel input/output achieving throughputs of 1.4 and 1 GSample/sec respectively.
Further, this chapter will present a technique, which allows the parallelization of the
memory accesses in hardware implementations of the FFT algorithm. This technique enables
each processor to perform a radix-b butterfly by loading the b-tuple data from b memory
banks in parallel, then by operating on the b data and finally, by storing the resulting b-
tuple in b memory banks in parallel. Hence, the speedup and the throughput increase by b.
Techniques parallelizing the FFT accesses are reported in (Johnson, 1992; Ma, 1999; Reisis &
Vlassopoulos, 2006). We describe the technique in (Reisis & Vlassopoulos, 2006), which is
developed for arbitrary radix and straightforward to implement.
The chapter is organized in three technical sections. Section 2 shows how to organize radix-4
computations to result in a radix-43–equivalent to radix-64–computation and describes the
fully systolic 4096-point architecture. Section 3 will describe the 16K, 64K and 256K point
architectures. Section 4 presents the access parallelization technique in the FFT architectures
and section 5 concludes the chapter.
2
j
where, WN e N are the twiddle factors and denote the N-th primitive root of unity
(Oppenheim, 1975). The architecture presented in this section is based on a four-
dimensional index map and on a R4 decomposition of the DFT series. The derivation of the
Radix − 4³ algorithm lies on three (3) steps in the cascade decomposition. In the frame work
of these 3 steps, the linear mapping transforms into a four-dimentional index map [10,4] as
follows:
N N N
n n1 n2 n3 n4
64 16 4 (2)
k 64k 1 16k 2 4k 3 k 4
N
1
64 3 3 3
N N N
X(64k 1 16k 2 4k 3 k 4 ) x(n
n 1 0 n 2 0 n 3 0 n 4 0
1
64
n2
16 4
nk
n 3 n 4 )WN (3)
N N N
[n 1 n 2 n 3 n 4 ][ 64 k 1 16 k 2 4 k 3 k 4 ]
kn 64 16 4
WN WN
n 2 k 2 n 3k 3 n 4k 4 n 3k 4 n ( 4k k ) n ( 16 k 2 4 k 3 k 4 ] n 1k 1
(4)
kn
WN ( j) W16 W642 3 4 WN1 WN
64
Applying equation 4 into equation 3 and expanding the summation with index n 4 yield
X(64k 1 16k 2 4k 3 k 4 )
N
1
64 3 3
Bk 4 n 1 N n 2 N n 3
N
n 3 0 4
64 16
(5)
n 1 0 n 2 0
( j)n 2 k 2 n 3k 3 W16
n 3k 4 n 2 ( 4k 3 k 4 )
W64 n 1 ( 16k 2 4k 3 k 4 ) n 1k 1
WN WN
64
N N
Where B kN4 n 1 n2 n 3 denotes the first butterfly unit and can be written as
64 16
4
N N N N N N N
B kN4 n 1 n2 n 3 x n 1 n2 n 3 ( j )k 4 x n 1 n2 n3
64 16 64 16 64 16 4 (6)
4
k4 N N N N N 3N
( 1 ) x n 1 n2 n 3 jk 4 x n 1 n2 n3
64 16 2 64 16 4
224 VLSI
Expanding equation 5 with respect to the next summation with index n3 yields
X(64k 1 16k 2 4k 3 k 4 )
N
1
64 3 (7)
N
H
n 1 0 n 2 0
k 3k 4
N n1
64
n 2 j n 2 k 2 W64
n 2 ( 4 k 3 k 4 ) n 1 ( 16 k 2 4 k 3 k 4 ) n 1k 1
WN WN
16 64
k k N
where H N3 4 n 1 n 2 is the secondary butterfly structure and can be expressed as
64
16
N N k4 k4 N N
HkN3 k 4 n 1 n 2 BkN4 n 1 n 2 jk 3 W16 B N n1 n2
64 64 64 16
16 4 4 (8)
N N k4 k4 N 3N
1 k3
W8k 4 BkN4 n 1 n 2 jk 3 W8k 4 W16 B N n1 n2
64 8 64 16
4 4
Finally, expanding the summation of equation 7 with regard to index n2 provides a set
of 64 DFTs of length N/64.
T W
( N / 64 )1
X(64k 1 16k 2 4k 3 k 4 )
n 1 0
k 2 k 3k 4
N / 64
n ( 16 k 2 4 k 3 k 4 )
( n 1 )WN1
n 1k 1
N / 64
(9)
k k k
where TN2/ 64
3 4
( n 1 ) represents the third butterfly and is expressed according to equation 10
k k 3k 4
T N2 n 1 H kN3k 4 n 1 jk 2 W16k 3 W64k 4 H kN3k 4 n 1 N
64
64 16 16
k k k k4 N
1k 2 W8 3 W324 H N3 n1 (10)
32
16
3( 4 k 3 k 4 ) k k4 3N
j k 2 W64 H N3 n1
64
16
Equations 5 to 10 describe a radix-64 based FFT. Further, equations 6, 8 and 10 describe the
internal structure of the radix-64 butterfly based on three radix-4 butterflies and therefore
called Radix - 43 (R43).
The FFT architecture with R43 stages and input N complex points uses log4N−1 complex
multipliers. With each R4 using 3 complex adders to produce 1 result/cycle the architecture
has 3log4N complex adders. The memory size is (N/3)log4N.
Comparing to the cascaded R22 (He & Torkelson, 1998) and R24 (OH & Lim, 2005) the
proposed unfolded R43 has equal number of multipliers (Table 1). The R43 has less complex
adders (3log4N) than the cascaded (4log4N) but uses larger memory size. The R43 FFT design
achieves the least number of multipliers and adders in the literature, equal to that in (Bidet
et al., 1995), and although it requires larger memory, it uses simple control and it can
achieve high frequency performance as it is shown in the following subsection.
2.3 Architecture
Figure 2 depicts the architecture of the 4096-point FFT. The implementation of the 4096-
point FFT using R43 butterflies includes: first the use of a 2-dimensional index map based on
R64, second the decomposition of the 4096-point series into two sums using the R64
butterfly and finally the replacement of each R64 with a R43. Consequently, the architecture
consists of two R43 engines, two 4096-word dual bank memory modules, one 4096 point
read-only memory storing the W4096 twiddles and one complex multiplier. The control is
local to each module and a global sync signal synchronizes the modules. The throughput is
1 complex-point/clock-cycle.
The overall design allows the optimization of the architecture, either globally, or locally: The
architecture can be improved with respect to the operating frequency, or the area, or both
without alterations in the control, the scheduling of the 4K buffers, the registers and the
memories within the R43 engines. Hence, we can modify the pipelined computations and
exploit the specifics of each technology to maximize the operating frequency.
The following subsections describe the details of the R4 processor used as the basis of the
R43, the R43 architecture, the addressing scheme used for each R43 engine and discuss the
performance of the entire FFT architecture.
the opposite in order to perform the correct butterfly operations. Figure 4 depicts the
architecture of the Accumulator module. The accumulator consists of 8 registers and 3
add/sub units. The four ”A” registers are used as input registers storing every 4-tuple of
data, on which the R4 butterfly operation will be applied. The four ”B” resisters are used as
an intermediate stage holding the 4-tuple (while the next 4-tuple of input data is shifted into
the ”A” registers) and loading the data in parallel to the add/sub units. The add-sub units
form an adder-tree architecture, thus avoiding the feedback loops that common
accumulators use.
The data flow of the Radix-4 engine starts with the data entering the Radix-4 engine as a
word serial input stream at a rate of one complex-point/cycle. During each period of 4
cycles four consecutive input data are shifted in the 4 input registers (Reg A). In the fifth
cycle this 4-tuple is latched in the 4 ”storage” registers (Reg B), while the next input 4-tuple
is starting its input to the ”A” registers. During the sixth cycle the 4-tuple enters in parallel
the adder-tree to produce the first of the four R4 results. During the following 3 cycles, the
Fully Systolic FFT Architectures for Giga-sample Applications 227
adder-tree uses as input the 4-tuple stored in the ”B” registers and it produces the remaining
3 of the four R4 results, one result/cycle, by following the Radix-4 computational flow.
Therefore, the total latency of each accumulator is 5+2k cycles, where k is the latency of each
add/sub unit, k = 5 in the implementation.
A control unit synchronizes the operations of the ’Swap’ module and the accumulator. A 2-
bit counter generates the necessary signals, controlling the add/sub units in order to
perform the correct additions and/or subtractions according to the Radix-4 schema.
calculations. The bit width of each part is 16 and 20 bits at the output of the first and second
R4 stages respectively. At the output of the final (third) R4 stage, the real and imaginary
parts are truncated to provide an output of 14 bits each. The twiddle factors are 18 bits real
and 18 bits imaginary. By applying the above mentioned data format, the implementation of
the 4096 point FFT with fixed point calculations is almost as accurate as a floating point
implementation given the dynamic range of 84 dB. The fixed point FFT’s outputs are within
a margin of +/- (1) compared to the output of a floating point FFT whose values are
normalized (divided by N = 4096) and truncated to 14 bit integers.
Fully Systolic FFT Architectures for Giga-sample Applications 229
power consumption. The use of typical process parameters (1V, 25C) results in exceeding
the 1GHz post-route frequency mark (data rate 1 GSample/sec), making the proposed
architecture the fastest standard-cell 4096 complex point FFT implementation reported in
the literature. In addition, a second 4094 complex point engine has been implemented in the
same standard cell library, this time using deeper RAMs (×16 configuration in Table 2).
Power consumption in this case was substantially reduced to 722.8 mW from the 4.4 W of
the ×64 configuration. Figure 6 depicts the VLSI layout of the Radix − 43 engine, while figure
7 depicts the final layouts of the 4K FFT, of both the ×16 and the ×64 implemented.
The 4K FFT has been designed for an experiment involving a frequency analyzer for a
bandwidth of 200 MHz. The band has been divided into four sub-bands and each sub-band
has been accommodated by a 4K FFT architecture. The FPGAs perform at 102.5 MHz on a
18-layer board which has a compact-PCI interface performing at 51.25 MHz. The task is to
perform FFT and use a ”Threshold” filter to identify the frequencies of high power within
each sub-band. The expected output set includes at most 10 frequencies per sub-band, per
FFT. After the prototype completion the 4K FFT has been delivered as an IP core with
specifications achieving 5 times the performance of the FPGA prototype. The 16K, 64K and
256K have been realized as IP cores for research purposes.
comparing its characteristics to other relevant results. Tables 3 and 4 present a comparison
of the proposed FFT architecture with related results. We distinguish related published
results into two categories.
In the first category we compare the hitherto published architectures executing FFT
algorithms up to 128 points to the features of the R−43 playing the role of a complete 64
complex point FFT architecture. This comparison is shown in Table 3. The second category
includes the architectures solving FFTs of size 1024 to 4096 points presented in Table 4. The
comparison includes FFT size, word length, algorithm, FFT architecture, technology process,
voltage, area, power, maximum operating frequency and sustained throughput in both
MSamples/s and Gbits/s. Since the FFT designs vary with respect to FFT size, algorithm
and architecture we have also included the Normalized Area (Bidet et al., 1995), in order to
evaluate the silicon cost. Moreover, we compare the efficiency (performance/cost) of the
proposed design to the related results by using the fraction Sustained
Throughput/Normalized Area.
The proposed FFT architecture achieves the highest sustained throughput compared to all
the other designs. Furthermore, the efficiency expressed as the fraction Sustained
Throughput/Normalized Area of the proposed design is the highest considering both small
input size (Table 3) and large input size FFT architectures (Table 4). Note that, in both
categories the architectures R − 43 and the proposed 4K occupy more area than their
competitors respectively. This is a penalty though in achieving the highest throughput
possible.
Also note that (Lin et al., 2005) performs transformations of only 128 points, and there is no
provision taken so that it will constitute a core for scalable architectures with respect to FFT
size. The 64-point Fourier transform chip, presented in (Maharatna et al, 2004) operates at 20
MHz with latency 3.85 us, comparing to the R43 processor performing a 64 complex point
FFT while operating at a 200 MHz clock frequency with latency 0.32 us.
The architecture described in (Lenart & Owall, 2003) is a 2K complex point FFT processor
which achieves maximum operating frequency of 76MHz and sustains a throughput of 2048
points/26us. The design presented in (Cortes et al., 2006) implements a 2K/4K/8K
multimode FFT and achieves 9 MHz clock frequency, at a computation time of up to 450us.
232 VLSI
(Lenart &
(Cortes et (Swartzlander, Proposed
Characteristics Owall,
al., 2006) 2007) Design
2003)
FFT size 2K 2K/4K/8K 4K 4K
Word Length, bit 10 16 16 14
Algorithm R − 22 R − 22 R−2 R − 43
FFT Architecture SDF SDF Split Systolic Unrolled
Process (um) 0.35 0.35 0.25 0.13
Voltage (V) N/A 3.3 N/A 0.9
Area (mm2) 6 18.7 N/A 13.48
Normalized Area (mm2) 1.58 4.9 N/A 25.84
Power (mW) N/A 114.65 260 4414.1
Fmax (MHz) 76 9.1 100 604.5
Sustained Throughput in 75.8 18.2 200 1052
MSamples/s
Sustained Throughput in 1.5 0.582 6.4 29.5
Gb/s
Throughput(Gb/s) 0.94 0.11 N/A 1.14
Norm.Area
Table 4. Comparison table with FFT designs from 1K-4K
Finally, a single ASIC chip, systolic FFT processor, developed by the Mayo Foundation
computes 4096-point FFTs sustaining a throughput of 200 Ms/s (Swartzlander, 2007).
Considering FPGA implementations, the corresponding XILINX designs (www.xilinx.com)
achieve equal maximum operating frequency of 200MHz, but occupy considerably larger
chip area than the R43 approach. Also note that, ALTERA designs (www.altera.com) utilize
FFT cores with FFT length varying from 64 points up to 4K points. They demonstrate a
maximum operating frequency of 300 MHz. Among the ALTERA’s FFT designs we compare
the 64 point FFT at 300 MHz to the R43 performance, which realized on the same ALTERA
FPGA (ALTERA STRATIX II EP2S30F484C3) achieves operating frequency of 350 MHz.
N
Setting n n 1 n 2 and k k 1 4096 k 2 :
4096
N N
k n n1 n 2 4096k 1 k 2 ⇒ k n n 1k 1 n 1k 2 Nn 2 k 1 k2 ⇒
4096 4096
n 1k 1
W4kn
4096 W4 W4n4096
1k 2 n 2k 2
W4096
According to equation 12, the 4K points FFT can be extended to 16K points. This is
accomplished by first performing four 4K point FFT transforms. Next, the data is multiplied
by the twiddle factors that correspond to a 16K points FFT and finally, a radix-4 stage
completes the 16K point FFT computation.
The architecture is presented in figure 8. There are four 4K FFT blocks operating in parallel.
The architecture has a four complex point input per cycle. The 16K FFT architecture was
implemented in a high performance, 0.13um, 1Poly-8Copper layer standard cell technology
from TSMC. A flat back-end flow was used in which the design was first synthesized to
gates using Synopsys Design Compiler and optimized for a frequency of 300 MHz. The
uniquified netlists was then read into Cadence SoC Encounter where floorplanning, power
planning, clock-tree-synthesis, placement, routing and IPO routing took place. Finally, the
design was brought back into the top-level for final top-level placement, routing and timing
analysis. Table 5 shows the Implementation Routing Results and figure 8 depicts the VLSI
Cells for the 16K FFT. The throughput is 1.4 Gs/sec (39.2 Gbits/sec).
The 16K architecture implemented on the Xilinx Virtex 5 (-2) achieves operating frequency
of 250MHz, occupies 12264 slices and sustains a throughput of 1Gs/s (28 Gbits/sec).
65535 416384 - 1
X[k ] ∑ nk
x[ n ]W65536 ∑x[n]W nk
4 16384 (13)
n 0 n 0
N
Setting n n 1 n 2 and k k 1 16384 k 2 :
16384
N n 1k 1
k n 16384 n 1k 1 n 1k 2 Nn 2 k 1 k 2 W4kn
16384 W4 W4n16384
1k 2 n2k 2
W16384
16384
Therefore, the transform becomes
3 16383
n 1k 1 n 2 k 2 n 1k 2
X[k ] ∑ W4 ∑
n 0
x[ n ]W16384 W416384
(14)
n 1 0 2
According to equation 14, the 16K points FFT can be extended to 64K points. This is
accomplished by first performing a 16K point FFT transform. Next, the data is multiplied by
the twiddle factors that correspond to a 64K points FFT and finally, a R4 stage completes the
64K point FFT computation, as shown in figure 10. The 64K FFT has been VLSI (and FPGA)
implemented by using the 16K parallel/parallel computation with a R4 stage with four
parallel inputs and outputs. The architecture has a post routing frequency of 256 Mhz with a
throughput of 1 Gs/sec (28 Gbits/sec). Figure 11 depicts the VLSI Cell for the 64K FFT
design. The 64K has been implemented on the Xilinx Virtex 5 (-2) achieving an operating
frequency of 125MHz and using 13461 slices. The architecture has a four data parallel input
and output and sustains a throughput of 500Ms/s (14 Gbits/sec).
236 VLSI
Fig. 10. Block diagram of the 64K complex point FFT Architecture
Fully Systolic FFT Architectures for Giga-sample Applications 237
In the case of the very large 256K FFT, tool capacity mandated the use of a hierarchical flow.
Following the same front-end synthesis process (again optimized for 300 MHz), the
optimized netlist was read into Cadence SoC encounter where partitioning was first
performed. This process created six instances of the 256K memory block (for a total of 1.5
MB of on-chip SRAM) which were individually placed and routed. The same process was
performed for the R43 engines and the twiddle ROM block. Note that to complete the entire
FFT computation it requires a 3-frame latency (786792 cycles) and the computation latency
within the three R43 (360 cycles), which is 4.1 msec. The 256K FFT architecture has a post
routing frequency of 188 Mhz. Finally, the design was brought back into the top-level for
238 VLSI
final top-level placement, routing and timing analysis. Figure 13 depicts the VLSI Cell for
the 256K FFT design and Table 6 shows the VLSI Implementation Routing Results of the 64K
and 256K FFT designs.
The memory banks are arranged according to the scheme described in figure 14. Since, we
are using a radix-b processor, the DFT decomposition consists of log b N stages. If we
assume = log b N then we consider a Pb[ b→b ] processor realizing each stage of the
algorithm. The stages are indexed from 0 to log b N - 1 . Between two (2) consecutive stages i
and i+1 there is a memory of size 2 N divided into two (2) sets of size N each. We will
denote the memory between stage i and stage i+1 as memory with index i. The memories
among the FFT stages are indexed as 0 to log b N - 2 . There are memories indexed as -1 and
log b N - 1 serving as input and output memories respectively. The memory of each set i is
divided into b memory banks with indices 0 to b-1.
At each clock cycle, the Pb[ b→b ] processor of stage i loads from its ‘’read’’ memory (memory
i-1), b elements that form a transform b-tuple. These are the elements whose address is of
the form [a p-1 ,..., a i+1 la i-1 ,..., a 0 ] , where the indices [ a 0 ,..., a i-1 , a i+1 ,..., a p-1 ] are the same for
all elements in the b-tuple and l ranges from 0 to b-1. These elements will be loaded in
parallel if they are stored in distinct memory banks. During the next ((i + 1)th) step of the
algorithm, these b elements have the same ai+1 address digit and hence they belong to
240 VLSI
different transformation b-tuples. These elements must be stored in the processor’s write
memory so that they can be read in parallel from the processor at stage i + 1.
Fig.14. Generic arrangement of a Pb[ b→b ] processor and its write memories
The straightforward approach is to distribute the N elements to the b memory banks using
their ith address digit. Then the processor at stage i can load them in parallel. However, if we
try to store the same b elements of each b-tuple to the memory bank at stage (i+1) according
to their (i+1)th digit of their address, we will notice that all the elements in each b-tuple must
be stored in the same memory bank. Since we cannot store more than one element in a
memory bank at each clock cycle this situation constitutes a “conflict”.
To illustrate the situation where a “conflict” occurs, consider a 16 point transform that is
based on P4[ 4→4 ] processors. The address of each element in the input stream is [a 1a 0 ] ,
where a i = 0 ,1,2 ,3. Figure 15 shows the first stage of the example. The first 4-tuple to be
transformed consists of the elements [(0, 0) (0, 1) (0, 2) (0, 3)]. Note that, these elements can
be read in parallel from the set of memory banks of stage i − 1, but must be stored in the
same memory bank of memory i.
The first technique for parallelizing the memory accesses in FFT architectures (Johnson,
1992) describes a hardware architecture designed for in-place computation of the FFT
algorithm. This technique uses r (r being the radix) banks and permutes the output data of
each processor, to be written in the same memory locations within each bank as the input
data have been read. (Ma, 1999) describes a technique for radix-2 based FFT by using queues
at the input of each memory to rearrange the data before they are stored. Johnson’s
technique can be extended to the more general case of a pipelined FFT implementation at
the cost of complex addressing hardware implementation. (Thomas & Yelick, 1999) present
a technique that is used in vector processors. This technique permutes the input data prior
to the radix calculations in order to maximize the efficiency of the algorithm. These
permutations are performed within the vector registers (In-register transpose). This
functionality has been implemented by extending the instruction set of the vector processor
with two instructions that shift the data within a register, so that each permutation is
performed by using a register to register copy and a shift instruction.
Fully Systolic FFT Architectures for Giga-sample Applications 241
All the FFT techniques (Johnson, 1992; Ma, 1999), with b parallel memory accesses, involve
at each stage b banks for reading and b banks for writing the N FFT data with bank size
N/b. The in-place case uses the same b banks (each bank of size N/b) for both read and
write. To compare the hardware complexity of the previously published results, we first
consider designs that are based on queues, such as the one described in (Ma, 1999).
Fig. 15. First stage of a 16-point transform using P4[ 4→4 ] processors
Such designs require O( b 2 )2 w -bits wide registers, where w is the word length, arranged in
b queues with each queue having b − 1 registers. b is the processor radix, or in the general
case, the number of parallel read/write accesses. Each set of b−1 registers is arranged so that
b−1 elements are loaded in parallel while one element is written into the memory bank. The
other b−1 elements are shifted -one element per memory access cycle- into each memory
bank. The technique in (Johnson, 1992) uses fixed hardware to implement first a
permutation on the input data and then this specific hardware can apply a permutation at
each FFT stage. More specifically, (Johnson, 1992)’s approach gives only a subset of the
possible solutions. This subset of solutions considers only in-place implementations (ψ = 1
and memory size N). The circuit in (Johnson, 1992) is optimized for implementing these
solutions with respect to the VLSI area. The address generation is based on a circuit
involving a tree of modulo-b adders with O( b 2 ) exclusive-OR gates.
In this section we describe a technique (Reisis & Vlassopoulos, 2006) that can be used both
for pipelined architectures and in place implementations (memory size equals to N). We
follow a different approach than (Johnson,1992) leading to a different proof and providing a
242 VLSI
general solution, which includes that in (Johnson,1992). Moreover, the presented technique
results in a simple and improved hardware implementation compared to (Johnson,1992).
The presented solution shows improved performance with respect to the latency and the
hardware cost compared to other solutions (Ma, 1999; Suter & Stevens,1998; He &
Torkelson; Thomas & Yelick, 1999) while it provides the same throughput as these. Our
technique is based on memory address permutations and can be realized using look up
tables with each stage table occupying O( b 2 ) bits. The presented approach results in a set of
permutations, which can be applied in the design of pipeline FFT architectures with parallel
memory accesses per stage. A subset of these permutations accommodates the in-place
implementations.
Proof of Lemma 1: Each set S’l′ contains the data with indices [a p-1 ...a i+2 f( ai +1 ) (l ) l a i-1 ...a 0 ] ,
where a i+1 = 0 ,..., b - 1 and l is fixed. To prove (1) assume that there exists a set that contains
more than N/b elements. Then, there is at least one element with index
[ap-1...ai+2f(ai+1 ) (l' ) l'' ai-1...a0 ] where l′ is different than l′′. Since such an element does not
belong to the set S′l′ , we conclude that each set cannot contain more than N/b elements.
Similarly, to prove (2), assume that two elements of the sets Sq and Sr, respectively, whose
indices differ in the ith digit are distributed in the same set, S′s′ . Then, for these elements,
[a p-1 ...a i 2 fai 1 (q ) l a i -1 ...a 0 ] [a p-1 ...a i 2 fai 1 (r ) l a i -1 ...a 0 ] fai +1 (q ) = fai +1 (r ) . Since the
fj are invertible, fa111 (fa i 1 (q)) fa111 (fa i 1 (r)) ⇒q r , which contradicts our initial assumption
that l 1 ≠l 2 . The following Lemma shows that these functions exist.
Lemma 2: There exists a set of functions f j : {0 ,..., b - 1} → {0 ,..., b - 1}, j = 0 ,..., b - 1 and b ≥2
such that the conditions of Lemma 1 are satisfied.
Fully Systolic FFT Architectures for Giga-sample Applications 243
Proof of Lemma 2: Let M = {0, . . . , b − 1}. Further, let C(M) denote the set of cyclic
permutations of length b on the set M. These permutations are of the form
0 1 ... (b - 1)
( )
( p ) mod b ( 1 + p ) mod b ... ( b - 1 + p ) mod b
for p = 0, . . . , b − 1. Now, these permutations are invertible and there are exactly b such
permutations. To prove that
fi ( x ) ≠f j ( x ) (15)
for i ≠j and i, j = 0, . . . , b − 1 it suffices to show that ( x + i ) mod b ≠( x + j) mod b , which is
true since i and j could only be in the same equivalence class, if i = k ・ j, where k is an
integer, but this can be true only for k = 1, since i, j ≤ b − 1. Therefore equation 15 holds.
Theorem 1: Let N = b p . Assume that we have a Pb[ b→b ] processor. Then, a radix-b based
FFT having log b N stages can use b memory banks at each stage, such that all the read and
write operations are performed in parallel.
Proof: Let [a p-1 ...a 0 ] , 0 ≤ ai ≤ b − 1, be the address of each element according to its initial
position within the input data set of the algorithm. The ith stage of the algorithm has
arranged the N elements in b memory banks according to their ith digit of their address:
where ai = const and ai+1 = 0, . . . , b − 1, since this permutation yields a b-tuple whose
elements differ only in the (i + 1)th digit. To prove that these elements are stored in b distinct
memory banks assume that this is not the case, i.e. there exist two distinct
elements, q 1 [a p -1 ...a i 2 a i ...a 0 ][ fa1 (a i )] and q 2 [a' p - 1 ...a' i 2 a' i ...a' 0 ][fa'1 (a' i )] that are
i 1 1 1
244 VLSI
in the same b-tuple and at the same time are stored in the same memory bank. Since q1 and
1 1
q2 are on the same b-tuple, therefore aj = a′j for j ≠i + 1 . Further, fa' i 1 (a i ) fa i 1 (a i ) ,
which is true only when a'i+1 = a'i+1 . Therefore, the elements q1 and q2 coincide.
To conclude the proof we show that the forward and inverse permutations at all stages
result in a correct FFT algorithm. We proceed by induction on the number of stages. During
the first stage of the FFT the data are written in the input memory banks (memory −1)
according to [ a p-1 ...a 1 ][a 0 ] . The outputs of the butterfly are written to the first intermediate
memory (memory 0) using the permutation [a p-1 ...a 2 a 0 ][ fa1 (a 0 )] , a 0 = 0 ,..., b - 1 . We assume
that the functions fj are identical for all stages, although this may not be the case. The only
actual restriction is to use a permutation Pi for writing the data at stage i and its inverse
permutation Pi-1 for retrieving the data at stage i+1. Applying the inverse permutation we
obtain [a p-1 ...a 2 a 0 ][ f -1 a1 (a 0 )] , with a0 = constant and a1 = 0, . . . , b−1. The set { f -1 a1 (a 0 )}
consists of b distinct elements, such that their address differs only in the digit a1 and
therefore, they constitute a valid transform b-tuple. Further, assume that in the ith stage the
elements are stored according to equation 17. Using the same reasoning as above, we can see
that the inverse transform yields a valid b-tuple and the proof is complete. Note that the
write operation of the elements in the first set of banks (memory 0) does not require any
permutation on the input data set. The elements are written to the bank corresponding to
their 0th (zero) address digit, in radix b notation. Similarly, during the final step of the
algorithm, the data can be written in the same addresses and banks as those that have been
read from, since no computation follows
The current section has shown the technique and the properties of the permutations
required to the parallel access of each b-tuple at each FFT stage. An engineer can realize the
technique by choosing among the straightforward solutions, e.g. the cyclic permutations. A
ninteresting research topic is to identify those permutations, which can be realized with
minimal interconnection and address generation circuits and thus lower the VLSI cost.
5. Conclusion
The present chapter has shown a technique to design FFT architectures for real-time
applications, which involve input with large number of complex points. The technique bases
on combining three consecutive R4 stages to realize a R64 computation. The resulting R43 as
well as the systolic architectures, which utilize a R43 as a stage for executing 4K, 16K, 64K
and 256K point FFTs, have been shown to provide higher throughput compared to hitherto
published architectures solving the corresponding transformations. Moreover, the R43 and
4K FFT architectures achieve the highest ratio of throughput to normalized area.
Furthermore, the chapter has proven a technique for parallelizing the memory access for
each butterfly radix-b computation so that the throughput can be improved further by a
factor of b.
Fully Systolic FFT Architectures for Giga-sample Applications 245
6. References
A. Cortes, I. Velez, I. Zalbide, A. Irizar, J.F. Sevillano ”An FFT Core for DVB-T/DVB-H
Receivers ICECS’06, page(s):102-105, December 2006.
A. Oppenheim, R. Schafer Digital Signal Processing. Prentice Hall, 1975.
B. G. Jo and M. H. Sunwoo, ”New Continuous-Flow Mixed-Radix (CFMR) FFT Processor
Using Novel In-Place Strategy,” IEEE Trans. Circuits Syst. I, vol.52, no.5, May 2005.
B. Suter and K. S. Stevens ”A Low Power, High Performance approach for Time-Frequency
/ Time-Scale Computations,” Proceedings SPIE98 Conference on Advanced Signal
Processing Algorithms, Architectures and Implementations VIII. Vol. 3461, pp. 86-
90, July 1998.
C. D. Thompson, “Fourier Transforms in VLSI”, IEEE Transactions on Computers, Vol. 32,
1047 - 1057, 1983
D. Harper, “Block , Multistride Vector, and FFT Accesses in Parallel Memory Systems”, IEEE
Trans. on Parallel and Distributed Systems, Vol. 2, No. 1, pp. 43 - 51, January 1991
D. Reisis, N.Vlassopoulos, ”Address Generation Techniques for Conflict Free Parallel
Memory Accessing in FFT Architectures” ICECS, pp.1188-1191, December 2006.
E. Bidet, D. Castelain, C. Joanblanq and P. Stenn “A fast single-chip implementation of 8192
complex point FFT” IEEE Journ. of SSC, 30(3):300-305, Mar. 1995.
E. H. Wold and A. M. Despain ”Pipeline and Parallel FFT Processors for VLSI
Implementations,” IEEE Transactions on Computers, vol. C-33, 1984.
Earl E. Swartzlander, Jr ”Systolic FFT Processors: A Personal Perspective Journal of VLSI
Signal Processing, June 2007.
I. S. Uzun, A. Amira and A. Bouridane, ”FPGA implementations of Fast Fourier Transforms
for real-time signal and image processing,” IEEE Vision, Image and Signal
Processing, 2005.
J. Lee, J. Lee, M. H. Sunwoo, S. Moh and S. Oh ”A DSP Architecture for High-Speed FFT in
OFDM Systems,” ETRI Journal, 2002.
J. Takala and K. Punkka Scalable FFT Processors and Pipelined Butterfly Units Journal of
VLSI Signal Processing 43, 113-123, 2006.
J. Y. OH and M. S. Lim, ”New Radix-2 to the 4th Power Pipeline FFT Processor,” IEICE
Trans. Electron., VOL. E88-C, NO. 8, August 2005.
J.W. Cooley and J.W. Tukey, “An algorithm for the machine computation of complex
Fourier series”, Mathematics of Computation, 1965
K. Babionitakis, K. Manolopoulos, K. Nakos, N. Vlassopoulos, D. Reisis, V. Chouliaras, “A
High Performance VLSI FFT Architecture”, 13th IEEE International Conference on
Electronics, Circuits and Systems, pp. 810-813, December 2006
K. Maharatna, E. Grass, and U. Jagdhold, ”A 64-Point Fourier Transform Chip for High-
Speed Wireless LAN Applications Using OFDM,” IEEE Journal of Solid State
Circuits, VOL. 39, NO. 3, March 2004.
K. Manolopoulos, K. Nakos, D. Reisis, N.Vlassopoulos, V.A. Chouliaras ”High Performance
16K, 64K, 256K complex points VLSI Systolic FFT Architecture” ICECS, pp. 146-149,
December 2007.
L. G. Johnson, “Conflict Free Memory Addressing for Dedicated FFT Hardware”, IEEE
Transactions on Circuits and Systems - II: Analog and Digital Signal Processing,
Vol. 39, No. 5, May 1992
246 VLSI
L. R. Rabiner and B. Gold ”Theory and Application of Digital Signal Processing,” Prentice-
Hall.
L. Yang, K. Zhang, H. Liu, J. Huang and S. Huang ”An Efficient Locally Pipelined FFT
Processor,” IEEE Trans. Circuits Syst. II, vol. 53, no. 7, July 2006.
N. Hu, O. Ersoy, “Fast Computation of Real Discrete Fourier Transform for Any Number of
Data Points”, IEEE Transactions on Circuits and Systems, Vol. 38, No. 11, pp. 1280 -
1292, November 1991
O.K . Ersoy, Fourier-Related Transforms, Fast Algorithms and Applications. Englewood
Cliffs, NJ:Prentice Hall, 1997.
R. Thomas and K. Yelick, “Efficient FFTs on IRAM”, Proceedings of the 1st Workshop on
Media Processors and DSPs, 1999
S. Bouguezel, M. O. Ahmad, and M. N. S. Swamy, ”A New Radix-2/8 FFT Algorithm for
Length−q × 2m DFTs,” IEEE Trans. Circuits Syst. I, vol. 51, no. 9, September 2004.
S. Bouguezel, M. O. Ahmad, and M. N. S. Swamy, ”New Radix-(2×2×2)/(4×4×4) and Radix-
(2×2×2)/(8×8×8) DIF FFT Algorithms for 3-D DFT,” IEEE Trans. Circuits Syst. I,
vol. 53, no. 2, February 2006.
S. Choi, G. Govindu, J. W. Jang, V. K. Prasanna ”Energy-Efficient and Parameterized
Designs of Fast Fourier Transforms on FPGAs,” The 28th International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), April 2003.
S. He and M. Torkelson ”A New Approach to Pipeline FFT Processor,” Proceedings of the
IPPS, 1996.
S. He and M. Torkelson ”Design and Implementation of a 1024-point Pipeline FFT
Processor,” IEEE 1998 Custom Integrated Circuits.
S.S. Wang and C.S. Li An Area-Efficient Design of Variable-Length Fast Fourier Transform
Processor Journal of VLSI Signal Processing, March 2007.
T. Lenart and V. Owall, ”A 2048 Complex Point FFT Processor Using a Novel Data Scaling
Approach,” IEEE ISCAS 2003.
W. H. Chang, T. Nguyen, ”An OFDM-Specified Lossless FFT Architecture,” IEEE Trans.
Circuits Syst. I, vol. 53, no. 6, June 2006.
www.altera.com
www.xilinx.com
Y. Ma, “An Effective Memory Addressing Scheme for FFT Processors”, IEEE Transactions
on Signal Processing, Vol. 47, No. 3, 907 - 911, May 1999
Y. Ma, L.Wanhammar, “A Hardware Efficient Control ofMemory Addressing for High-
Performance FFT Processors”, IEEE Transactions on Signal Processing, Vol. 48, No.
3, 917 - 921, March 2000.
Y.N. Lin, H.Y. Liu, and C.Y.Lee “A 1-GS/s FFT/IFFT Processor for UWB Applications,”
IEEE Journ. of SSC, vol. 40, Issue 8, Aug. 2005.
Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) IIR Space-time Filters 247
12
X
1. Introduction
Plane-waves are far-field solutions to (1) the vector wave equation, for the case of
electromagnetic waves, (2) the scalar wave equation, for the case of longitudinal pressure
waves in seismic, acoustic, and ultrasonic systems, as well to (3) linear surface waves, such
as those created by dropping a pebble into still the waters of a pond. Far-field beamforming
refers to the highly-selective directional enhancement of propagating spatio-temporal plane-
waves based on their directions-of-arrival (DOA).
field broadband beamforming for smart antenna array applications is currently receiving
much attention, mainly due to the continuously increasing availability of digital
programmable logic and custom silicon fabrication technologies that are gradually enabling
the typically high levels of real-time computational throughputs necessitated by such DSP-
based broadband smart antenna arrays.
3
1 x 2 y 3 z ct , 1,2,3 , and,
k 1
2
k 1 (1)
3
w( x, y, z, ct ) wPW 1 x 2 y 3 z ct ,
2
k 1 1 (2)
k 1
and therefore have the property that they are constant-valued in each of the hyper-planes
(1): that is, for each 1 . Equivalently, for each value of , wPW ( ) is a
corresponding 4D iso-surface in ( x, y , z , ct ) . In Fig. 1, we show the 4D plane wave
wPW 1 x 2 y 3 z ct in the 3D spatial domain ( x, y, z ) 3 in an iso-plane which, by
simply 3D geometry, is perpendicular distance ct from the origin, as shown. In 3D, we may
therefore visualize the 4D space-time plane wave of equation (2) as an infinite set of such
iso-planes w( ) , each of which is propagating in ( x, y, z ) over time t with speed c in a
direction normal to the iso-planes. Depending on the 1D spectral properties of the c-scaled
temporal signal wPW (ct ) , the plane wave might be temporally-narrowband or temporally-
broadband.
Note that, for the case of the ideal plane wave, the region of support (ROS) of equation (2) in
( x, y, z , ct ) 4 extends, in general, to infinity in at least some directions in 4 . Equation
(2) represents either the electric or magnetic field of the plane wave in 4D space-time. In this
chapter, we are only concerned with the values of the 4D plane wave signal as received on a
straight line in ( x, y, z ) . Therefore, we consider only the special case of the 2D space-time
representation for which equation (2) reduces to the form w( x, 0, 0, ct ) wPW 1 x ct for
signals on the x-axis. With the DOA in 3D space defined by the angles
o (measured on x - z plane) and o as shown in Fig. 1, it is easily shown that (2) may be
written in the form
w( x, y, z , ct ) wPW sin o cos o x cos o cos o z sin o y ct (3)
with / 2 o , o / 2, from which it follows that the corresponding 2D space-time
plane wave signal received on the x-axis is given by
250 VLSI
w( x, ct ) wPW sin o cos o x ct (4)
As shown in Fig. 2, the space-time direction of the 2D space-time plane wave is defined by
the normal to these contours and is given by (Gunaratne and Bruton; Khademi)
Fig. 2. Propagating plane-wave in 3D space (a), 2D spatial view on the y 0 plane (b), 2D
spatio-temporal DOA (c), and region of support (ROS) on 2D frequency-domain, aligned
along the spatio-temporal DOA (d).
which passes through the origin and subtends angle to the ct axis. It lies on the ct axis
for broadside DOAs and on ct x for end-fire DOAs. Importantly therefore, the ROS of
all 2D space-time electromagnetic plane wave signals, propagating at speed c, cannot lie
outside the 90-degree wide 2D fan-shaped region x ct in ( x , ct ) .
252 VLSI
M 1
w( x, ct ) w
k 0
PW , k ( x sin k ct ) nv ( x, ct ) (8)
Where sin k sin o , k cos o , k , and where nv ( x, ct ) represents 2D space-time noise. The
Fourier transform of (8) is therefore given by
M 1
W (e jx , e jct ) W
k 0
PW , k (e jx , e jct ) N v (e jx , e jct ) (9)
2D
where N v ( x , ct ) nv ( x, ct ) . Typically, nv ( x, ct ) corresponds to non-plane-wave
electromagnetic propagating interference or other sources of 2D broadband noise, modelled
as additive white Gaussian noise (AWGN). Therefore, the 2D ROS of the noise spectrum
N v ( x , ct ) is typically uniform throughout the 2D frequency-domain ( x , ct ) 2 . The
j x
ROS of W (e , e jct ) therefore consists of the uniform ROS of N v ( x , ct ) and M lines
through the origin, where the orientation of each line is given by the M different angles
k tan 1 (sin k ) .
For notational convenience in the rest of the chapter, we will use 1 x as the spatial frequency
variable, and 2 ct as the time-frequency variable corresponding to spacetime ct.
The focus of this chapter is on the design and real-time hardware implementation of a first-
order 2D IIR beam digital plane-wave filter.
Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) IIR Space-time Filters 253
Methods have been proposed and implemented for significantly reducing the required
number of antennas without significantly reducing performance, for wireless
communications and other applications. These methods also lead to much reduced
arithmetic complexity of the filter and are based on allowing a controlled amount of
multidimensional spatial aliasing and thereby spatial under-sampling, as reported in
(Khademi and Bruton; Madanayake, Hum and Bruton).
R Y ( s1 , s2 )
T ( s1 , s2 ) (10)
R L1 s1 L2 s2 W ( s1 , s2 )
R Y ( j1 , j2 )
T ( j1 , j2 ) (11)
R j L11 L22 W ( j1 , j2 )
From (5), the network under consideration is 2D resonant on the 2D line-shaped region
1 L1 2 L2 0 (12)
passing through the frequency-origin (Note: In 2D, capacitors are not required to induce
resonance). At all finite frequencies where (12) is satisfied (i.e. throughout the 2D passband) ,
network energy resonates between the two inductance elements and T ( j1 , j2 ) is unity. By
choosing L1 cos and L2 sin , 0 90o , we can orient the axis of the 2D passband to
the angle . A typical response is shown in Fig. 3. The shape of the 2D gain T ( j1 , j2 ) of
the filter may be envisaged in 2D frequency space by noting that L11 L22 describes,
for constant , a line that is parallel to the 2D passband and along which T ( j1 , j2 ) is
constant and less than unity. Importantly, T ( j1 , j2 ) decreases monotonically with
increasing values of with the two –3dB lines having gain 0.707 given by R . We
make the following summary observations (see (Bruton and Bartley,1985) for details):
3.2 The Transfer-functions of the First-order Beam Filter in the 2D s- and z- Domains
Although the inverse 2D Laplace transform of equation (10) yields a continuous-domain
partial differential equation for the input-output transfer-function, practical
implementations have so far been in the discrete-domain of 2D finite-difference equations,
implemented in the form of digital circuits. Transformation to the discrete-domain is
zk 1
achieved by applying the normalized 2D bilinear transform (2D BLT) sk , k 1, 2, to
zk 1
equation (10) leads, after considerable algebraic manipulation (Bruton and Bartley,1985), to
the 2D z-transform transfer function
H ( z1 , z2 ) T z1 1, z2 1
1 z 1 z
1
1
1
2
Y ( z1 , z2 )
(13)
z 1 z 1
1 2 1 b b z z b
10
1
11 2
1
1 z 1
01 2
W ( z1 , z2 )
2D 2D
where W ( z1 , z2 ) w( n1x, n2 cTCLK ) and Y ( z1 , z 2 ) y ( n1x, n2 cTCLK ), respectively, and
where bij R (1)i L1 ( 1) j L2 / R L1 L2 . The above application of the 2D BLT, which is
a conformal mapping between the 2D Laplace and 2D z-domain, results in a distortion ofn
the high frequency part of the 2D passband, known an bilinear warping, that leads to a
practical limitation of the upper frequency 0.5 | 2 | of the beam-shaped passband.
This effects of this limitation may be avoided by suitable temporal and/or spatial over-
sampling of the input signal.
For example, here we shall employ a temporal over-sampling factor of 2 for which we show
in Fig. 4 the correpsonding ‘weakly-warped’ magnitude response of the discrete-domain
frequency-response transfer-function over the useful range | 2 | 0.5 .
256 VLSI
Fig. 4. A Beam-shaped Response that is warped by the 2D BLT, shown in the usable range
1 and 0.5 2 0.5 . The beam shape at frequencies 0.5 | 2 | are
not used in our application because the beam-shape is significantly off-axis, due to binear
warping. The interested reader is referred to (Madanayake and Bruton; Bruton,2003) for
details.
Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) IIR Space-time Filters 257
1 z21 Y ( z1 , z2 )
H ( z1 , z2 ) 1
(14)
z W ( z1 , z2 )
1 1
1 z11
1 z21 z21
z1D1
where we require
2 cos 2sin
, 1 (15)
R cos sin R cos sin
R
Note that the passband gain in (14) is scaled by the constant R L1
, relative to the direct-form
case (Bertschmann, Bartley and Bruton; Madanayake, Hum et al.; Liu and Bruton,1989;
Madanayake and Bruton,2007; Madanayake,2008), and is ignored in the following because it
is not of practical significance. Re-writing the direct-form transfer-function using the spatial-
differential operator results in just two filter coefficients in the denominator of (14) instead
of three, implying a 33% reduction in the number of parallel hardware multipliers required
in circuit realizations, relative to direct-form realizations.
Research on novel systolic-array architectures for 2D/3D IIR frequency-planar digital plane-
wave filters for beamforming applications has lead to field-programmable gate-array
(FPGA) based single-chip multiprocessor implementations capable of real-time operation at
a sustained arithmetic throughput of one-frame-per-clock-cycle (OFPCC), a requirement for
real-time plane-wave filtering at RF using linear- or rectangular-arrays of antenna elements
(Hum, Madanayake and Bruton; Madanayake, Bruton and Comis; Madanayake and Bruton;
Madanayake and Bruton; Madanayake, Hum et al.; Madanayake, Hum and Bruton;
Madanayake,2004; Madanayake,2008; Madanayake and Bruton,2008). The required OFPCC
throughput rate, required for multi-GHz implementations, arises due to the fact that the
signals of interest are of ultra-wide RF bandwidth, which leads to Nyquist sample rates that
are at least twice the full RF bandwidth of the signal.
The beamformers therefore directly sample RF signals from the antennas without down-
conversion (or bandpass sampling), and leads to frame sample rates in the GHz. Such
excessively-high frame sample rates (multiple GHz) make software-based realizations
infeasible using traditional DSP technologies. Our research indicates (Madanayake and
Bruton; Madanayake,2008) that massively-parallel synchronously-clocked, speed-optimized,
Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) IIR Space-time Filters 259
fully-pipelined systolic-array processors are currently the best available solution for the
broadband real-time DSP-based radio-frequency (RF) beamforming applications using
sampled antenna arrays (Arnold Van Ardenne; Ellingson,1999; Liberti Jr. and
Rappaport,1999; Weem, Noratos and Popovic,1999; Frederick, Wang and Itoh,2002; Do-
Hong and Russer,2004; Rodenbeck, Sang-Gyu, Wen-Hua et al.,2005; Madanayake,2008;
Devlin,Spring 2003).
The PPCMs that comprise the systolic-array processor are fully-parallel, speed-maximized,
fully-pipelined, multi-input-multi-output (MIMO) processors, each consisting of 2 input
ports and 2 output ports. A PPCM at spatial location n1 has its input port A connected to the
ADC at location n1 and input port B connected to the output port C of the PPCM at location
n1 1. Port D provides the computed output signal y (n1x, n2 cTCLK ) for spatial location
n1 .
1
1 z W ( z , z ) 1 1 z z
1
2 1 2
1
1
1 z21 z21 Y ( z1 , z2 ), (17)
1
leading to
z11
W ( z1 , z2 ) 1
Y ( z1 , z2 )
1 z1
Y ( z1 , z2 ) 1 z 1
2 (18)
1 z21
p
Multiplying both sides by z2 yields the required form
Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) IIR Space-time Filters 261
Fig. 7. Interconnections between PPCMs, shown here in the mixed domain (n1 , z2 ) ,
leads to the massively-parallel systolic-array processor implementation of the beam plane-
wave filter.
z11
W ( z1 , z2 ) z2 z2
p p
1
Y ( z1 , z2 )
1 z1
Y ( z1 , z2 ) z2 p 1 z 1
2 (19)
1 z21
Computing the inverse z1-transform of (19) under spatial ZICs, we obtain the 2D mixed-
domain (n1 , z2 ) form, given by
W (n , z ) z p
z2 pU (n1 , z2 )
Y (n1 , z2 ) z2 p
1 2 2
1 z21
1 z
1
2 (20)
where
U (n1 , z2 ) Y (n1 1, z2 ) z2 p U (n1 1, z2 ) z2 p (21)
and, where z2 p is the z2-transform of the internal pipelining delays at each PPCM. Because
the depth of pipelining is arbitrary, the numerator of (20) can be pipelined at will using
straightforward 1D FIR filter pipelining methods, noting that this 1D FIR section has two
terms W ( n1 , z2 ) and U (n1 , z2 ) which are obtained using digital ports An1 and
Bn1 , respectively. Equations (20) and (21) describe the 2-input-2-output z2-domain transfer-
functions of a PPCM at location 0 n1 N 1. The hybrid-form signal flow-graph is
thereby obtained, and is shown in Fig. 6. The first 3 PPCMs in the systolic-array are shown
in Fig. 7 as an interconnection of processors.
262 VLSI
The hybrid-form signal-flow graph does not allow pipelining of A4, because only one unit-
delay buffer is available in the first-order feedback loop, which is usually absorbed inside
the parallel logic of multiplier . Provided all feed-forward paths are fully pipelined, the
critical path delay of the hybrid-form PPCM cannot be reduced beyond
TCPD TMul TA / S where TMul and TA / S are the propagation delays of a parallel multiplier
and adder/subtractor circuit, respectively. The maximum speed of operation for a hybrid-
form PPCM is therefore less than FCLK 1/ TCPD unless additional speed-optimization
methods, based on look-ahead optimization, are employed. This method is discussed in the
next section.
Fig. 8. Signal flow graph of a hybrid-form PPCM having 12 cycles of pipeline latency
(arbitrarily chosen for the purpose of demonstration). The 12 clock-cycles of additional
pipelining can be used as required to reduce the critical-path delay (CPD) of the systolic-
array.
Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) IIR Space-time Filters 263
The pipelined version, having p 12 for the hybrid form PPCM, is shown in Fig. 8. We
now describe look-ahead speed-optimization of the internal 1D temporal IIR digital filter
1 z21
section having transfer-function that enables much greater levels of real-time
1 z21
throughput at the cost of additonal circuit complexity.
In section 4.4, we described an example for which 10 additonal delays are distributed in the
forward (that is, FIR, also known as feed-forward) signal paths of the PPCM, such that the
critical path delay of the PPCM (and therefore, of the systolic-array) is reduced to the
latency for a multiply-add-operation, denoted TCPD . The speed-bottleneck for this example
lies within the first-order feed-back IIR filter, which has a simple real-pole at z2 where
it may be shown that 1 for passive filter network prototypes. Because this pole is
within (or on) the unit circle z2 1 the 1D IIR filter section is unconditionally stable
(ignoring effects due to finite precision).
Let us further assume that our objective is to halve the critical path delay using look-ahead
optimization of the IIR section. This can be achieved by increasing the number of internal
delays in the first-order feedback loop to 2 (causing the feedback loop to increas in order):
this may be easilty achieved by multiplying both numerator and denominator of
1 z21
by 1 z 1
leading to the 2 nd order section, given by
1 z21 1 z21 leading to
2
1 z21 1 2 z22
a new critical path delay in the feedback loop TCPD , LA TCPD / 2, implying an almost 100%
increase in the maximum speed of operation (Parhi; Parhi; Parhi and Messerschmitt; Parhi
and Messerschmitt; Sundarajan and Parhi; Parhi and Messerschmitt,1989; Parhi,1991;
Parhi,1999). This “look-ahead” speed-maximization process may be repeated: for example,
1 2 z22
allows the multiplier 4 to consist of 3 levels of internal pipelining, while the fourth delay
can be used in the 2-input adder that completes the feedback loop (Madanayake and Bruton;
Madanayake,2008). The additional terms in the numerator that appear due to the
application of look-ahead speed-maximization lead to additional circuit complexity – this is
264 VLSI
the price for the extensive gain in real-time throughput, which is 300% for 4th order feedback
loops. The additional arithnetic circuits that appear in the feed-forward sections can be
easily pipelined by increasing the depth of pipelining p as required. In our example, we
have increase the depth of pipelining up to p 22 which allows 3-level pipelining to all
additional adders/subtractors and multipliers in the PPCM.
The logic design flow starts with the Xilinx System Generator (XSG) design tool, which is a
plug-in for Matlab/Simulink. We chose XSG as our FPGA design tool, although
conventional design methods based on hardware description languages such as VHDL or
Verilog may also be attempted. The modular regular nature of the systolic-array, together
with the complicated pipelines and dataflow structure, makes the use of a graphical FPGA
design method such as XSG, easier, compared to text-based design tools. We however note
that XSG, in the end, leades to synthesizable VHDL (or Verilog), which is subsequently
processed by conventional FPGA logic synthesis tools such as the Xilinx Synthesis Tool
(XST) or Synplify Pro.
The following example assumes input signals obtaining from 4-bit A/D converters.
Preliminary studies show that 3 bit A/D converters are quite sufficient for ultra-wideband
wireless communications applications. Our choice of 4-bits in our A/D converters results
from 1-bit overdesign, mainly as a margin of safety, in order to ensure good performance
Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) IIR Space-time Filters 265
Fig. 9. Xilinx FPGA circuit for a hybrid-form PPCM having 12 cycles of pipeline latency
(corresponding to the signal-flow graph in Fig. 8).
Fig. 10. First 4 PPCMs of a hybrid-form systolic-array FPGA circuit showing inter-PPCM
interconnections. The FPGA circuit is tested on-chip using stepped hardware co-simulation
using a 2D unit impulse input at PPCM #1, with inputs of PPCM #2, #3, …, #21, set to zero,
leading to the 2D measured impulse response h(n1x, n2 cTCLK ) . A bit-true cycle-accurate
FPGA circuit simulation of the 2D impulse response is available in the Matlab variables
simout,simout1,…,simout20, and the measured on-chip FPGA circuit response are available in
Matlab variables h0,h1,…,h20.
266 VLSI
from a real-world application. The multiplier coefficients are assumed to be 12 bits, with the
binary point at position 10. All other registers, including quantized outputs from multipliers
and adder/subtractor blocks, are fixed at 14-bits, with binary point assumed at position 10.
The design of a PPCM is shown in Fig. 9, followed by the systolic-array, in Fig. 10. The finite
precision values at various locations on the PPCM signal flow graph can be widely
optimized against various requirements, but is not attempted here, because we are only
interested in giving our readers a basic design overview of the hybrid-form systolic-array
processor.
The FPGA circuit was tested, using on-chip hardware-in-the-loop co-simulation, using
Matlab/Simulink, XSG, and FUSE, using the XtremeDSP Kit-4 device, which was installed
on the 5V 32-bit PCI slot of the host PC. Figure 11 shows the measured 2D magnitude
frequency response of an example beam filter having spatial DOA o 25o , and bandwidth
parameter R 0.02, computed for 21 spatial samples, and 256 time samples, of the impulse
response. The “uneven” nature of the measured response is attributed to quantization
effects, and magnitude sensitivity, for which a comprehensive study remains as useful
future research.
6. Conclusions
The above new systolic implementation of a 2D IIR frequency-beam filter transfer function
has promising engineering applications for the directional enhancement of a propagating
broadband space-time plane-wave received on an array of sensors. A particularly important
case is the use of an array of broadband antennas for the directional enhancement (that is,
beamforming) of ultra-wideband electromagnetic plane-waves.
maximum throughput, because at OFPCC, the clock rate is equal to the frame rate in these
architectures) and low computational complexity.
A design example for the proposed systolic-array processor architecture has been described
using a Xilinx Virtex-4 Sx35 FPGA device, and the Matlab/Simulink based FPGA design
tool called Xilinx System Generator. The example FPGA implementation of the 2D IIR
frequency- beam filter was tested on-chip using the hardware-in-the-loop verification
method called ‘hardware co-simulation’, and the on-chip 2D unit-impulse response was
measured, which in turn led to measured 2D frequency response results that confirm correct
implementation of the hardware.
Although the FPGA-based example is generally too slow for microwave imaging
applications, it serves as a validation of the proposed OFPCC systolic-array processor and
can be used in its current form for slower applications in audio, ultra-sound, and lower
radio frequencies (of up to approximately 100 MHz). Finally, promising new VLSI
implementation platforms are described here , which may eventually enable the proposed
architecture to operate at the required multi-GHz clock frequency to enable real-time ultra-
wideband digital smart antenna array applications.
7. References
Agathoklis, P. and L. T. Bruton (1983). "Practical-BIBO stability of N-dimensional discrete
systems." Proc. Inst. Elec. Eng. 130, Pt. G(6): 236-242.
Anonymous (2007). Arrix FPOA Overview. Available online at http://www.mathstar.com.
Anonymous (2008). Using High-Performance FPGAs for Advanced Radio Signal Processing.
. Available online at http://www.achronix.com.
Arnold Van Ardenne. The Technology Challenges for the Next Generation Radio
Telescopes. Perspectives on Radio Astronomy - Technologies for Large Antenna
Arrays, Netherlands Foundation for Research in Astronomy.
Bertschmann, R. K., N. R. Bartley and L. T. Bruton A 3-D integrator-differentiator double-
loop (IDD) filter for raster-scan video processing. IEEE Intl. Symp. on Circuits and
Systems, ISCAS'95.
Bolle, M. (1994). A Closed-form Design Method for 3-D Recursive Cone Filters IEEE
International Conference on Acoustics, Speech, and Signal Processing, ICASSP.
Bruton, L. T. (2003). "Three-dimensional cone filter banks." IEEE Trans. on Circuits and
Systems I: Fundamental Theory and Applications 50(2): 208-216.
Bruton, L. T. and N. R. Bartley (1985). "Three-dimensional image processing using the
concept of network resonance." IEEE Trans. on Circuits and Systems 32(7): 664-672.
Dansereau, D. (2003). 4D Light Field Processing and its Application to Computer Vision.
Electrical and Computer Engineering. Calgary, University of Calgary. MSc.
Dansereau, D. and L. T. Bruton (2007). "A 4-D Dual-Fan Filter Bank for Depth Filtering in
Light Fields." Signal Processing, IEEE Transactions on 55(2): 542-549.
Devlin, M. (Spring 2003) "How to Make Smart Antenna Arrays." Xcell Journal Online
Do-Hong, T. and P. Russer (2004). Signal Processing for Wideband Smart Antenna Array
Applications. IEEE Microwave Magazine. 5.
Radio-Frequency (RF) Beamforming Using Systolic FPGA-based
Two Dimensional (2D) IIR Space-time Filters 269
13
X
1. Introduction
Mobile embedded systems with natural human interfaces, such as speech recognition, lip
reading, and gesture recognition, are required for the realization of future ubiquitous
computing. Recognition tasks can be implemented either on processors (CPUs and DSPs) or
dedicated hardware (ASICs). Although processor-based approaches offer flexibility, real-
time recognition tasks using state-of-the-art recognition algorithms exceed the performance
level of current embedded processors, and require modern high-performance processors
that consume far more power than dedicated hardware. Dedicated hardware, which is
optimized for low-power, real-time recognition tasks, is more suitable for implementing
natural human interfaces in low power mobile embedded systems. VLSI architectures
optimized for recognition tasks with low power dissipation have been developed.
Yoshizawa et al. investigated a block-wise parallel processing method for output probability
computations of continuous hidden Markof models (HMMs), and proposed a low power,
high-speed VLSI architecture. Output probability computations are the most time-
consuming part of HMM-based recognition systems. Mathew et al. developed low-power
accelerators for the SPHINX 3 speech recognition system, and also developed perception
accelerators for embedded systems. In this chapter, we present a fast and memory efficient
VLSI architecture for output probability computations of continuous HMMs using a new
block-wise parallel processing method. We show block-wise frame parallel processing
(BFPP) for output probability computations and present an appropriate VLSI architecture
for its implementation. Compared with a conventional block-wise state parallel processing
(BSPP) architecture, when there are a sufficient number of HMM states for accurate
recognition, the BFPP architecture requires fewer registers and processing elements (PEs),
and less processing time. The PEs used in the BFPP architecture are identical to those used
in the BSPP architecture. From a VLSI architectural viewpoint, a comparison shows the
efficiency of the BFPP architecture through efficient use of registers for storing input feature
vectors and intermediate results during computation. The remainder of this chapter is
organized as follows: the structure of HMM based recognition systems is described in
274 VLSI
Likelihood scores
Fig. 1. Basic structure of HMM-based recognition hardware
Section 2, BFPP and BFPP-based VLSI architecture are introduced in Section 3, the
evaluation of the BFPP architecture is described in Section 4, and conclusions are presented
in Section 5.
v=0
v=v+1 Loop D
t=0
t=t+1 Loop C
j=0
Loop B
j=j+1
p=0
p=p+1 Loop A
P
log b j Ot j jp otp jp , 1 j N , 1 t T ,
2
(1)
p 1
where j, jp and jp are the factors of the Gaussian probability density function (Yoshizawa
et al., 2006).
The output probability computation circuit computes log b j(Ot) based on Eq. (1), where all
HMM parameters j, jp and jp are stored in ROM, and the input frames are stored in RAM.
The values of T, N, P and the number of HMMs V differ for each recognition system. For a
recent isolated word recognition system (Yoshizawa et al., 2006, Yoshizawa et al., 2004), T, N,
P and V are 86, 32, 38, and 800, respectively, and for another word recognition system
(Yoshizawa et al., 2002), T, N, P and V are 89, 12, 16 and 100 respectively. For a continuous
speech recognition system (Mathew et al., 2003a), T, N, P and V are approximately 20, 10, 40,
and 50, respectively. Different applications require different output probability computation
circuit architectures. A flowchart of output probability computations for V HMMs is shown
in Fig. 2. Output probabilities are obtained by T N P V times the partial computation of
log b j(Ot). Partial computation of log b j(Ot) performs four arithmetic operations, a
subtraction (a = otp jp), an addition (acc = acc + b, where the initial value of acc is j), and
two multiplications (b = a a jp) for Eq. (1), and computes log b j(Ot).
v=0
v = v + 1 Loop D
t=0
Loop C
t=t+1
p=0
Loop A
p=p+1
otp
PE1 PE2 PEN
logb1 (Ot) logb2 (Ot) . . . . . . . . . . . . . . . . . . . . . logbN (Ot)
N-parallel computation with N PEs
Pp no
yes
no
Tt
yes
no
Vv
yes
et al. (Yoshizawa et al., 2006, Yoshizawa et al., 2004, Yoshizawa et al., 2002). In this method,
the set of input frames is called a block, and HMM parameters are effectively shared between
different input frames in the computation. N-parallel computation is performed by their BPP.
In this chapter, we classify two types of BPP according to data flow of output probability
computations: block-wise frame parallel processing (BFPP) and block-wise state parallel
processing (BSPP). A block can be seen as a set of M (≤ T) input frames, whose elements are
Ot’’s, 1 ≤ t’ ≤ M. M frames in T input frames are processed in block. BFPP performs
arithmetic operations to locally stored input frames, which are O1, O2, ..., OM, and output
probability computations for multiple frames are carried out simultaneously. On the other
hand, a block can also be seen as a M P matrix whose elements are ot’p, 1 ≤ t’ ≤ M, 1 ≤ p ≤ P.
BSPP performs arithmetic operations to an input sequence, which is o11, ..., o1P, o21, ..., o2P, ...,
oM1, ..., oMP, and output probability computations for multiple states are carried out
simultaneously.
The BPP proposed by Yoshizawa et al. (Yoshizawa et al., 2006, Yoshizawa et al., 2004,
Yoshizawa et al., 2002) is classified as a BSPP. In this chapter, we present BFPP for output
probability computations. M/2-parallel computations are performed by our BFPP.
A flowchart of the output probability computations with the conventional BSPP (Yoshizawa
et al., 2006, Yoshizawa et al., 2004, Yoshizawa et al., 2002) is shown in Fig. 3.
PEi represents the i-th processing element, which computes log b i(Ot) by a subtraction, an
addition, and two multiplications for Eq. (1). Loop B (Fig. 2) is expanded as shown in Fig. 3,
and log b 1(Ot), log b 2(Ot), ..., and log b N(Ot) are computed simultaneously with N PEs, where
otp is fed to the N PEs in Loop A. In addition to the N-state parallel computation, the same
HMM parameters jp’s and jp’s, and j’s, 1 ≤ j ≤ N, 1 ≤ p ≤ P, are used repeatedly during
Loop C in Fig. 3.
A flowchart of the output probability computation with BFPP is shown in Fig. 4. The PEs in
A VLSI Architecture for Output Probability Computations of HMM-based Recognition Systems 277
v’max = 0
v’max = v’max + L Loop D2
t’max = 0
Loop C1
t’ = t’max, t’max = t’max + M, v’ = v’max L
Loop D1
v’ = v’ + 1
j = 0, v = v’
Loop B
j=j+1
p=0
Loop A
p=p+1
jp , jp
PE1 PE2 PEM/2
logbj (Ot’+1) logbj (Ot’+2) . . . . . . . . . . . . . . . . . . logbj (Ot’+M/2)
Pp no
yes
no
Nj
yes
no
v’max v’
yes
no
T t’max
yes
no
V v’max
yes
Fig. 4. Flowchart of output probability computation using BFPP
Figs. 4 and 3 are identical, but in a different number. Loop C in Fig. 2 is partially expanded in
Fig. 4, and log b j(Ot’+1), log b j(Ot’+2), ..., and log b j(Ot’+M/2) are computed simultaneously with
M/2 PEs in Loop C1, where jp and jp are fed to the M/2 PEs in Loop A. In addition to the
M/2-frame parallel computations, log b j(Ot’+M/2+1), log b j(Ot’+M/2+2), ..., and log b j(Ot’+M) are
also computed with the same M/2 PEs. In this double M/2-parallel computation, the same
HMM parameters jp and jp are used twice, because the parameters are independent of t. In
addition to the M/2-parallel computations, Loop D (Fig. 2) is divided into Loops D1 and D2
(Fig. 4). The same input frames Ot’+1, Ot’+2, ..., and Ot’+M are used repeatedly during Loop D1,
because the input frames are independent of v.
278 VLSI
ROM RAM
(, , ) (O)
Reg Reg
Reg Reg
RegO P
Ot’+1 … PE1
Ot’+2 …
otp M/2 PE2 M/2 M/2
…
…
…
PEM/21
Ot’+M/21 …
Ot’+M/2 …
PEM/2
M
Ot’+M/2+1 …
… M/2-parallel computation
Ot’+M/2+2
M/2 M/2
…
…
M/2
…
Ot’+M1 …
Ot’+M …
HMM parameter j and intermediate results. Reg stores computed output probabilities for
a Viterbi scorer. Each PEi consists of two adders and two multipliers, which are used for
jp ot ' p jp 2 .
P
computing j
p 1
Figure 6 shows the flowchart of output probability computations using the BFPP
architecture. The computation starts by reading M input frames from RAM and storing
them to RegO in Loop C1, which are Ot’+1, Ot’+2, ..., Ot’+M/2, Ot’+M/2+1, Ot’ +M/2+2, ..., Ot’+M. The
HMM parameters of v-th HMM are read from ROM and stored in Reg, Reg and Reg,
which are 11, 11, and 1. The value of all registers in Reg is set to 1. For the first half of
the stored input frames O t’+1 , O t’+2 , ..., and O t’+M/2 , M/2 intermediate results are
simultaneously computed with the stored 11, 11, and 1 by M/2 PEs, where the HMM
parameters are shared by all PEs. At the same time, an HMM parameter j p+1 of v-th HMM
A VLSI Architecture for Output Probability Computations of HMM-based Recognition Systems 279
is read from ROM and stored in Reg. Then, for the other half of the stored input frames
Ot’+M/2+1, Ot’+M/2+2, ..., and Ot’+M, M/2 intermediate results are simultaneously computed with
the same 11, 11, and 1 by M/2 PEs. At the same time, an HMM parameter j p+1 of v-th
HMM is read from ROM and stored in Reg. In this double M/2-parallel computation, the
v’max = 0
v’max = v’max + L Loop D2
t’max = 0
Loop C1
t’ = t’max, t’max = t’max + M, v’ = v’max L
v = v’+1, j = 1, p = 1
Load Ot to RegO (t=t’+1, t=t’+2, …, t=t’+M, MP cycles)
Load 11, 11 to Reg, Reg, respectively (2 cycles)
Loop D1
v’ = v’ + 1
j = 0, v = v’
Loop B
j=j+1
Pp no
yes
Copy Reg to Reg
no
Nj
yes
no
v’max v’
yes
no
T t’max
yes
no
V v’max
yes
Fig. 6. Flowchart of computations using the BFPP architecture
same HMM parameters 11, 11, and 1 are used twice. In the next double M/2-parallel
computation, the stored HMM parameters j p+1 and j p+1 are used twice. M output
probabilities log b j(Ot’+1), log b j(Ot’+2), ..., and log b j(Ot’+M) of v-th HMM are obtained by Loop
280 VLSI
A. The obtained results are transfered from Reg to Reg for starting the next output
probability computation, log b j+1(Ot’+1), log b j+1(Ot’+2), ..., log b j+1(Ot’+M) of v-th HMM. The
stored results are fed to the Viterbi scorer. The MN output probabilities of v-th HMM are
obtained by Loop B. MNL output probabilities of HMM v’ 1, v’ 2, ..., v’ L are obtained
by Loop D1 with the same M input frames Ot’+1, Ot’+2, ..., and frames Ot’+1, Ot’+2, ..., and Ot’+M.
ROM RAM
(, , ) (O)
N Reg N Reg
… …
… …
P P
…
… ……
… …
Reg
PE1
PE2
N
…
…
otp
PEN1
PEN
N-parallel computation
frames Ot’+1, Ot’+2, ..., and Ot’+M. The MNL(T/M) output probabilities of HMM v’ 1, v’
2, ..., v ’ L are obtained by Loop C1, and finally the MNL(T/M)(V/L) output probabilities
of all HMMs are obtained by Loop D2.
4. Evaluation
We compared the proposed BFPP with BSPP (Fig. 7) VLSI architecture (Yoshizawa et al.,
2006, Yoshizawa et al., 2004, Yoshizawa et al., 2002). The BSPP architecture consists of three
register arrays and N PEs. Reg and Reg store HMM parameters jp and jp, respectively,
and Reg stores HMM parameter j and intermediate results. The PEs in Figs. 7 and 5 are
identical.
A VLSI Architecture for Output Probability Computations of HMM-based Recognition Systems 281
Figure 8 shows the flowchart of the computations of BSPP architecture. The computation
starts by reading all 2NP N HMM parameters of v-th HMM from ROM and storing them
to Reg, Reg, and Reg in Loop D. For input otp, the intermediate results are computed with
stored HMM parameters by N PEs. N output probabilities log b 1(Ot), log b 2(Ot), ..., log b N(Ot)
of the HMM are obtained by Loop A. The obtained results are fed to a Viterbi scorer. NT
v=0
v = v + 1 Loop D
Pp no
yes
no
Tt
yes
no
Vv
yes
output probabilities of v-th HMM are obtained by Loop C with the same HMM parameters.
The NTV output probabilities of all HMMs are obtained by Loop D.
Table 1 shows the register size of the BSPP and BFPP architectures, where x, x, xo, and xf
represent the bit length of jp, jp, otp, and the output of PE, respectively. N, P, and M are the
282 VLSI
number of HMM states, the dimension of input feature vector (frame), and the number of
input frames in a block, respectively.
Table 2 shows the processing time for computing output probabilities of V HMMs with the
BFPP and BSPP architectures, where T and L are the number of input frames and the
number of HMMs whose output probabilities are computed with the same input frames
during Loop D1 of Fig. 6, respectively.
#PEs (BFPP)
6
7
8
9
15
#PEs=32 22 #PEs=32 (BSPP)
#PEs=43
44
Fig. 9. Evaluation of the BSPP and BFPP performance, and the value of M of the BFPP (N =
32, P = 38, T = 86, V = 800)
Table 3 shows the register size, the processing time, and the number of PEs for computing
output probabilities of 800 HMMs, where we assume that N = 32, P = 38, T = 86, x = 8, x = 8,
xo = 8, xf = 24, and V = 800, the same values used in a recent circuit design for isolated word
recognition (Yoshizawa et al., 2006, Yoshizawa et al., 2004). We also assume that M = 44 and
L = 5 for the BFPP architecture. The PEs used in the BSPP and BFPP architectures are
identical. Compared with the BSPP architecture, the BFPP architecture has fewer registers,
requires less processing time, and has fewer PEs. From the VLSI architecture viewpoint, this
is because the register size of the BFPP architecture is independent of N, and its PEs can
repeatedly use the same input frames. The BFPP architecture has fewer wait cycles for
A VLSI Architecture for Output Probability Computations of HMM-based Recognition Systems 283
reading data from ROM before parallel computations, 586,240 (V/L(PM LN)T/M),
than the BSPP architecture, which has 1,971,200 (V(2NP N)).
Fig. 9 shows the processing time and the number of PEs of the BFPP and BSPP architectures,
and the value of M of the BFPP architecture. The processing time and the number of PEs of
the BFPP architecture are less than those of the BSPP architecture when M = 44 (Fig. 9).
From a logic design viewpoint, the register arrays of the BSPP and BFPP architectures are
designed with Flip-Flops or on-chip multi-port memories of different sizes. Data paths are
designed with identical PEs, but in a different number. The control paths of these
architectures are designed, as shown in the flowcharts Figs. 8 and 6. The data path delay is
the same for both the BSPP and BFPP designs, equal to the delay time of one PE. The delay
times of control paths differ between the two, but the control path delay is small compared
with the data path delay.
5. Conclusions
We presented BFPP for output probability computations and presented an appropriate VLSI
architecture for its implementation. BFPP performs arithmetic operations to locally stored
input frames, and output probability computations for multiple frames are carried out
simultaneously. Compared with the conventional BSPP architecture, when the number of
HMM states is large enough for accurate recognition, the BFPP architecture requires fewer
registers and PEs, and less processing time. In terms of the VLSI architecture, a fast and
memory efficient VLSI architecture for output probability computations of HMM-based
recognition systems has been presented. A logic design, a Viterbi scorer for the BFPP
architecture, and a reconfigurable architecture for both the BSPP and BFPP architectures are
our future works.
6. References
B. Mathew, A. Davis & Z. Fang (2003a). Perception Coprocessors for Embedded Systems,
Proc. of Workshop on Embedded Systems for Real-Time Multimedia (ESTIMedia), pp. 109-
116, 2003.
B. Mathew, A. Davis & Z. Fang (2003b). A Low-Power Accelerator for the SPHINX 3 Speech
Recognition System, Proc. of Int'l Conf. on Compilers, Architecture and Synthesis for
Embedded Systems, pp. 210-219, 2003.
S. Yoshizawa, Y. Miyanaga & N. Yoshida (2002). On a High-Speed HMM VLSI Module with
Block Parallel Processing, IEICE Trans. Fundamentals (Japanese Edition), Vol. J85-A,
No. 12, pp. 1440-1450, 2002.
S. Yoshizawa, N. Wada, N. Hayasaka & Y. Miyanaga (2004). Scalable Architecture for Word
HMM-Based Speech Recognition, Proc. of 2004 IEEE Int'l Symposium on Circuits and
Systems (ISCAS'04), pp. 417-420, 2004.
S. Yoshizawa, N. Wada, N. Hayasaka & Y. Miyanaga (2006). Scalable Architecture for Word
HMM-Based Speech Recognition and VLSI Implementation in Complete System,
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, Vol. 53, No. 1, pp. 70-77,
2006.
284 VLSI
X. Huang, F. Alleva, H. W. Hon, M. Y. Hwang, K. f. Lee & R. Rosenfeld (1992). The SPHINX-
II speech recognition system: an overview, Computer Speech and Language, Vol. 7(2),
pp. 137-148, 1992.
Efficient Built-in Self-Test for Video Coding Cores: A Case Study
on Motion Estimation Computing Array 285
14
X
1. Introduction
In more recent years, multimedia technology applications have been becoming more flexible
and powerful with the development of semiconductors, digital signal processing (DSP), and
communication technology. The latest video standard, H.264/AVC/MPEG-4 Part 10
(Advance Video Coding) (Wiegand, 2003), is regarded as the next generation video
compression standard (VCS). For video compression standards, the motion estimation
computing array (MECA) is the most computationally demanding component in a video
encoder/decoder (Kuhn, 1999; Komarek and Pirsch 1989). It is known that about 60-90% of
the total of computation time is consumed in motion estimation. Additionally, the motion
estimation algorithms used also profoundly influences the visual quality of reconstructed
images. More accurate predictions increase the compression ratio and improve peak signal-
to-noise ratio (PSNR) at a given bit-rate. Since the motion estimation algorithm is not
specified in the video coding standards that many algorithms are applied to with different
hardware platform, system frequencies, operating voltage and power dissipation.
On the other hand, due to the rapid advance in semiconductor fabrication technology, a
large number of transistors can be integrated on a single chip. However, integrating large
number of processor on a single chip results in the increase in the logic-per-pin ratio, which
drastically reduces the controllability and observability of the logic on the chip.
Consequently, testing such highly complex and dense circuits become very difficult and
expensive (Lu et al., 2005).
For a commercial chip, the VCS must introduce design-for-testability (DFT), especially in an
MECA. The objective of DFT is to increase the ease with which a device can be tested to
guarantee high system reliability. Many DFT approaches have been developed such as ad
hoc, structured, and built-in self-test (BIST) (Wu et al., 2007; Nagle et al., 1989; McCluskey,
1985; Kung et al., 1995; Touba and McCluskey, 1997). Among these DFT approaches, BIST
has an obvious advantage in that reduces the need for testing of expensive test equipment,
since the circuit/chip and its tester are implemented in the same circuit/chip. In a word, the
BIST can generate test simulations and analyze test responses without outside support,
making tests and diagnoses of digital systems quick and effective.
286 VLSI
Thus, this chapter proposed a minimal performance penalty BIST design with significantly
smaller area overhead. In normal mode, the BIST circuit does not be performed and do not
deliver the test pattern to AD-PEs for testing. Thus, each AD-PE in MECA performs its
normal operation to determine the SAD values. In testing mode, the BIST circuits, comprise
test pattern generator (TPG) and output response analyzer (ORA), are performed to testing
itself. In terms of results, after employing the presents BIST design, the circuit guarantee
100% fault coverage with low test application time at low area overhead. Moreover, the
experimental results prove the effectiveness and value of this work.
N -1 N -1
SAD (i, j ) = C (k , l )
k 0 l 0
- R (i k , j l ) (1)
where C (k , l ) and R (i k , j l ) represent the current frame and search region's macroblock,
respectively (Pirsch et al., 1995). In the other hand, block-matching in MECA is performed
by a sequential exploration of the search region when the computations are performed in
parallel (see Fig. 1). In Fig.1 , the MECA is the parallel architecture for computing the SAD
value and its corresponding AD-PE structure are shown in Fig.1 (a) and (b), respectively.
The purpose of AD-PEs is to store the value of C (k , l ) and R (i k , j l ) and receive the value
of corresponding to the current position of the reference in search region. In other word, the
AD-PEs perform the processing of the subtraction and the absolute value computation.
After that, AD-PE adds the results with the partial result coming from the upper AD-PE (see
Fig. 1 (b)). The partial results are added on columns and a linear array of adders performs
the horizontal summation of the row sums, and compute SAD(i, j ) . For each position (i, j ) of
the reference block, the M-PE checks if the matching cost metric computation, SAD(i, j ) , is
smaller the previous smaller SAD value that updates the smaller SAD value to the register
which stores the previous smaller SAD value.
Efficient Built-in Self-Test for Video Coding Cores: A Case Study
on Motion Estimation Computing Array 287
16
AD-PE
AD-PE AD-PE
AD-PE
… AD-PE
AD-PE AD-PE
AD-PE
AD-PE
AD-PE AD-PE
AD-PE
… AD-PE
AD-PE AD-PE
AD-PE
…
…
…
16
AD-PE
AD-PE AD-PE
AD-PE
… AD-PE
AD-PE AD-PE
AD-PE
AD-PE
AD-PE AD-PE
AD-PE
… AD-PE
AD-PE AD-PE
AD-PE
SADMIN SAD(i,j)
M-PE
M-PE Adder
Adder Adder
Adder … Adder
Adder Adder
Adder
(a)
C R From upper
AD-PE
AD
AD
CR
Adder
Adder
SAD(i , j )
(b)
Fig. 1. The generic architecture of motion estimation and structure of AD-PE
expensive test equipment and testing time. The targeted fault, test pattern generator and
output response analyzer are explicitly described as follows.
System Clock
AD-PE
AD-PE AD-PE
AD-PE AD-PE
AD-PE AD-PE
AD-PE
++ ++ ++ ++
ORA
ORA
SAD Value
In addition, to achieve higher controllability and observability for single AD-PE, each AD-
PE uses the ripple-carry adders (RCAs) to produce the processing of the absolute difference
value computation and addition units, as shown in Fig.3. In Fig. 3, each multiplexers (mux)
are designed for testing requirement and are performed normal and test mode by using the
signal sel. The functions of the absolute difference value computation and addition are
designed by using ripple-carry adder. Besides, the frame data delivered the pixel value to
RCAs to produce the partial SAD value when the normal mode is performed. Then, the test
patterns from test pattern generator are delivered to testing the RCAs while the test mode is
selected.
Efficient Built-in Self-Test for Video Coding Cores: A Case Study
on Motion Estimation Computing Array 289
Frame
Frame TPG
TPG
Frame
Frame data
data Upper
Upper AD-PE
AD-PE
++
|X-Y|
sel
dmux
dmux
To ORA
++
sel
dmux
dmux
AD-PE
To
To next
next AD-PE
AD-PE To
To ORA
ORA
In this chapter, the proposed BIST design includes three modes: 1) Normal mode: In this
mode, each AD-PE performs its normal function, which is to determine the minimum SAD
value. 2) Test mode: The test pattern generator delivers the test patterns to each AD-PE for
testing. Then, the testing output results are compressed by the output response analyzer for
testing analysis. 3) Analyze mode: In analyze mode, output response analyzer used the
compressed signature to determine each AD-PE in MECA architecture that are fault-free or
not.
bits and 12 bits) are shown in Fig. 4 and 5, respectively. For example, the test patterns for 8
bits RCA are shown in Table I, which can achieve 100% single stuck-at fault coverage while
the LFSRs deliver the test patterns to RCA for testing. Furthermore, the test pattern
generator of 8 bits RCA comprises four LFSR to generate the test pattern for testing
requirement. Firstly, LFSR0 generated and delivered the test patterns to three inputs of FA0
in 8 bits RCA that the sequences of test pattern listed as follow: 111→011→001→100→010→
101→110. Then, the next stage carry is generated by itself. Secondly, the inputs of FA1 in 8
bits RCA are inputted the test pattern sequences, 00→10→01→10→11→11→01, by LFRS1.
Similarly, the LFSR2 and LFSR3 deliver the test pattern sequences, 10→11→11→01→00→10
→01 and 11→01→00→10→01→10→11, to the inputs of FA2 and FA3 in 8 bits RCA,
respectively. Besides, these inputs, an and bn, need to match up different carry as shown in
Table 1. in other words, the propagations of test pattern are generated by LFSRs as shown in
Fig. 6.
LFSR0
LFSR0 LFSR1
LFSR1 LFSR2
LFSR2 LFSR3
LFSR3
TPG
a0 b0 a1 b1 a2 b2 a3 b3 a4 b4 a5 b5 a6 b6 a7 b7
Cin Cout
FA0
FA0 FA1
FA1 FA2
FA2 FA3
FA3 FA4
FA4 FA5
FA5 FA6
FA6 FA7
FA7
RCA
LFSR0
LFSR0 LFSR1
LFSR1 LFSR2
LFSR2 LFSR3
LFSR3
TPG
FA0
FA0 FA1
FA1 FA2
FA2 FA3
FA3 FA4
FA4 FA5
FA5 FA6
FA6 FA7
FA7 FA8
FA8 FA9
FA9 FA10
FA10 FA11
FA11
RCA
Sum0 Sum1 Sum2 Sum3 Sum4 Sum5 Sum6 Sum7 Sum8 Sum9 Sum10 Sum11
Fig. 5. The proposed test pattern generator for 12 bits RCA
111
111 100
100
110
110 011
011 101
101 110
110
101
101 001
001 111
111 001
001
010
010 100
100 011
011 010
010
010
010 011
011
101
101 111
111 111
111 101
101
110
110 011
011 110
110 100
100
100
100 001
001 001
001 010
010
signature which is used to check and determine whether the CUT is fault-free or not. In a
word, testing with signature analyzer, MISR, has the merits of simplicity and low cost
hardware because MIRS does not need to store the entire responses of test patterns. The
MISR of ORA has two kinds of input data which are from AD-PE output data through some
XOR gates and the test patterns (see Fig. 7). On the other hand, the results are checked by
MISR and propagated the final result to the last model.
D0 Q0 D1 Q1 D2 Q2 D6 Q6 D7 Q7
CLK
Reset
The operation mode, normal and test mode, are determined by test controller. In normal
mode, the CUT circuit performs its normal operation to determine the SAD values, whereas
the MECA architecture is tested by itself.
In test mode, the test patterns of BIST design are generated by TPG which was made by four
LFSRs. Then, the TPG delivered the test patterns to each AD-PE for testing.
The testing results of each AD-PE analyzed by the ORA until all AD-PE are tested and
analyzed completely.
Efficient Built-in Self-Test for Video Coding Cores: A Case Study
on Motion Estimation Computing Array 293
Start
Test Controller
No
Testing ?
Yes
No Yes
Have fault on AD-PE ?
Take down
Next AD-PE
the fail AD-PE
No
Test completely ?
Yes
End
4. Results discussion
The proposed BIST design was realized using Verilog HDL and synthesized Design
Compiler of Synopsys. The performance comparisons aim at area overhead, fault coverage
and test patterns discussion which are also presented here to verify the good performance of
the proposed BIST design.
The design is carried out top-down at the gate-level in the system of Quartus II by means of
waveform and the design finally passes both the unit test and the integrated test. Figure 9
shows a functionality of the waveforms for AD-PE (as you can see in section II). Figure 10
shows the functionality of test pattern generator. For test mode, initial test patterns are
scanned in, an AD-PE is tested, and test responses are scanned out. All related control
signals are generated from controller. And, the fault coverage of AD-PE reaches 100% with 7
test patterns.
294 VLSI
1 2 3 4 5 6 7 8 9 10
CLK
AD 7 110 36 25 209 0 1 10
1 2 3 4 5 6 7 8 9 10
CLK
Reset
LFSR0_Set 7 0 0
LFSR1_Set 1 0 0
LFSR2_Set 3 0 0
LFSR3_Set 6 0 0
Comparing with previous work (Li et al., 2004), numbers of test patterns, the pin overheads,
test time and fault coverage are listed in Table 2.
By using the proposed BIST design and fault models, the single stuck-at fault coverage of
each AD-PE can achieve 100%. This perfectly proves the validity of the test patterns. And,
the area overhead of the MECA architecture including BIST is 0.2%, which is tolerable in
industry
Efficient Built-in Self-Test for Video Coding Cores: A Case Study
on Motion Estimation Computing Array 295
5. Conclusions
This chapter describes a BIST design for MECA architecture in video coding systems. In test
mode, test patterns are generated by TPG and scanned into each AD-PE of MECA
architecture to testing, and test responses are scanned out. All of control signals are
generated by the controller. And, experimental results show that the area overhead of the
BIST architecture for motion estimation architecture is less than 1%. The fault coverage of
each AD-PE can achieve 100%, and it perfectly proves the validity of the test patterns.
Moreover, BIST structure can easily be designed and applied to the MECA architecture.
That means the simplification in the BIST design of AD-PE in MECA architecture is
reasonable.
6. References
Abramovici, M.; Breuer, M. A. & Friedman, A. D. (1990). Digital Systems Testing and
Testable Design. Boston, MA: Computer Science Press, 1990.
Gallagher, P.; Chickermane, V.; Gregor, S. & Pierre, T. S. (2001). A Building Block BIST
Methodology for SOC Designs: A Case Study, Proceedings of International Test
Conference, PP. 111-120, Oct. 2001.
Komarek, T. & Pirsch, P. (1989). Array Architectures for Block Matching Algorithms, IEEE
Transactions on Circuits and Systems, Vol. 36, No. 2, PP. 1301–1308, Oct. 1989.
Kuhn, P. (1999). Algorithms, Complexity Analysis and VLSI Architectures for MPEG-4
Motion Estimation. New York, NY: Kluwer Academic Publishers, 1999.
Kung, C. P.; Huang, C. J. & Lin, C. S. (1995). Fast Fault Simulation for Bist Applications,”
Proceedings of the Fourth Asian Test Symposium, PP. 93–99, 1995.
Li, D.; HU, M. & Mohamed, O. (2004). Built-In Self-Test Design of Motion Estimation
Computing Array, Proceedings of IEEE Northeast Workshop on Circuits and Systems
(NEWCAS’04), June 2004.
Lu, S. K.; Shih, J. S. & Huang, S. C. (2005). Design-for-Testability and Fault-Tolerant
Techniques for FFT Processors,” IEEE Trans. VLSI Systems, vol. 13, no. 6, pp. 732-
741, June 2005.
McCluskey, E. J. (1985). Built-In Self-Test Technique, IEEE Design and Test of Computers, Vol.
2, No. 2, PP. 29–36, Apr. 1985.
Nagle, H. T.; Roy, S. C.; Hawkins, C. F.; Macnamer, M. G. & Fritzemeier, R. R. (1989). Design
for Testability and Built-in Self Test : A review, IEEE Transactions on Industrial
Electronics, Vol. 36, No. 2, PP. 129–140, May. 1989.
Pirsch, P.; Demassieux, N. & Gehrke, W. (1995). Vlsi architecture for video compression-a
survey, Proceedings of the IEEE, No. 2, P. 220, Feb. 1995.
296 VLSI
15
X
1. Introduction
In today’s globalised world, information exchange between different languages is
indispensable. Accordingly, speech-to-speech translation researches [1]-[6] grew to a
leading-edge technology enabling multilingual human-to-human and human-to- machine
interaction. In previous work, two different architectures are adopted by many researchers
for speech-to-speech translation research [4],[15] – a conventional sequential architecture
and a fully integrated architecture. The sequential architecture is composed of a speech
recognition system followed by a linguistic (or non-linguistic) text-to-text translation system
and a text-to-speech system [1],[2],[5],[16-20]. The integrated architecture combines speech
feature models and translation models in a manner similar to that used for speech
recognition. This integration brings about efficient translation by searching for an optimal
word sequence of target language through the integrated network such as finite-state
transducer [4],[21] and the others [22],[23].
According to these two architectures, there are two system implementations – client-server-
based systems and stand-alone handheld systems. Nevertheless, a critical shortcoming of
client-server based speech translation systems is that they should be built on the server
computer. In other words, it is not available anytime or anywhere. An obvious solution
would be to build portable stand-alone speech-to-speech translation handheld devices. To
mention just a few: Isotani et al. [7] used a Pocket PC PDA with 206 MHz StrongARM/64
MB RAM to build a speech-to-speech translation system for the use in various situations
while traveling. Waibel et al. [8] adopted a Pocket PC PDA with 400 MHz StrongARM/64
MB RAM to construct a speech-to-speech translation system for medical interviews.
Watanabe et al. [9] developed a mobile device running on 400 MHz Pentium II-class
processor/192 MB RAM that helps speech-to-speech translation in various situations during
their travel abroad. However, these works show that real-time speech-to-speech translation
with a resource-limited device is still a problem.
Table 1 shows a comparison among related speech-to-speech translation systems, including
JANUS [1], Verbmobil [2], EUTRANS [4] etc. Clearly, the client-server based architecture
can work in real time, but is not portable; while the stand-alone system is portable, but lacks
the real-time performance. Therefore, a VLSI solution is presented by realizing the entire
298 VLSI
speech-to-speech translation algorithm within a single chip. This SOC chip only requires a
few peripheral components for complete operation, and is characterized by small size, low
cost, real-time operation, and high reliability. The construction of this chip is accomplished
in two main phases: the software simulation phase and the SOC design phase.
In the software simulation phase, the simulation is based on the multiple-translation
spotting (MTS) method, a kind of integration of speech analysis and language translation
[23]. The proposed multiple-translation spotting approach is directly from speech to speech
without language models like other automatic speech recognition (ASR) approaches. With
identifying speech features, translation primarily stays in the speech modality and does not
go through a textual modality. The proposed MTS method not only retrieves the optimal
multiple-translation spotting template, but also extracts the appropriate target patterns.
With the extracted patterns, the target speech can be generated by a concatenation-based
waveform segment synthesis method.
In the SOC design phase, besides a cost efficient programmable core used for system control
and non-computation-intensive tasks, three specific hardware cores were designed to
perform cepstrum extraction, template retrieval, target pattern extraction. Moreover, the
A/D converter (ADC) and D/A converter (DAC) are also designed.
The rest of this paper is organised as follows. Section 2 gives the software simulation of the
proposed speech-to-speech translation system. Section 3 discusses the SOC architecture for
the MTS-based speech-to-speech translation system. Finally, a short conclusion is provided
in Section 4.
ATR-MATRIX [5] 13,000 0.1 sec for TDMT PC-based platform (server-
client)
Isotani et al. [7] 20,000(English) Slower than VR5500 Pocket PC PDA (206 MHz
50,000(Japan) processor StrongARM/64 MB RAM)
Watanabe et al. [9] 10000 (English) 2~3 sec Mobile PC (400 MHz Pentium
50000 (Japan) II-class processor/192 MB
RAM)
Table 1. A comparison among related speech-to-speech translation systems
SOC Design for Speech-to-Speech Translation 299
1
The gray tablets list the translation spotting results.
300 VLSI
waveform similarity overlap and add (WSOLA) algorithm [24]. WSOLA introduces a
tolerance on the desired time-warping function to ensure signal continuity at waveform
segment joins. With a proper windows length and a timing tolerance, WSOLA usually
produces high quality time-scaled speech [25],[26]. Therefore, the system can generate high
phonetic/prosodic quality in the translated speech output. The advantages of this method
are the small computational cost during the generation process and the high intelligibility of
the generated speech. The following subsections further discuss the details of the kernel
spotting algorithm within the speech translation.
tonight
a single room
laundry service
Multiple-
room service
Translation
Spotting
Template
tomorrow
Speech
Feature features
Template
for
available
Identified
Waveform results
Translation
still
breakfast
Speech
Pattern
Is
Speech
feature input
Is a single room still available for tonight
An hypothesized Mandarin Chinese speech pattern set:
還有 嗎 一 間 單人房今晚
Fig. 1. An example of MTS process for the direction of English-to-Mandarin Chinese.
d A (l , k , j ) d (l , k , j ) min d A (l 1, m, j ) , (1)
k 2 m k
After ranking all the templates, the hypothesized spotting template is decided from the Top
N candidates with minimum distortion by
J
(4)
vˆ arg max vj
1 v N j 1
According to the decided v̂ -th template from (4), the target patterns t vˆ J can be obtained
j j 1
by τ vˆ J . With the determining target speech patterns, these waveforms are rearranged
j j 1
with adequate overlapping portions to generate speech with the waveform similarity
overlap and add (WSOLA) algorithm.
For retrieving Top 5 templates, Table 3 shows that the spotting accuracy of Sp2 and Sp3
drops by 10 to 15 percent. A given spotting template is called a match when it obtains the
same intention of the input speech. In addition, from the research presented in [31], the
speaking rate had a significant effect on recognition accuracy and further adaptation
methods of duration models for spotting templates are needed. Based on the proposed
approach, when template or vocabulary size increases, the increasing spotting templates
will lead to more speech feature vectors and hence more similarities will occur in speech
spotting measurement, thus causing false spotting results and lowering spotting accuracy.
By collecting more speech databases, the system can apply speaker-dependent or speaker-
independent HMM to MTS for more robust speech translation. Speech translation
performance also is degraded by noise. Related works to minimize the effects of the noise on
the system performance are presented in [29],[30] and would be applied to the proposed
system in the future.
To judge the generated translations from the matched templates, a subjective sentence error
rate (SSER) in [3] is used to evaluate and classify the target generation results into three
categories by three bilingual evaluators. Referring to the SSER evaluation method, good (G),
understandable (U), bad (B) levels of translation quality are scaled to 1.0, 0.5, and 0.0,
respectively. The understandable translation rate is calculated by
1.0 no. of G level 0.5 no. of U level (5)
rate 100%
no. of tests
The results revealed that our proposed approach roughly achieves 89% and 92%
understandable translation rate from (5) for the Mandarin Chinese-to-English and the
English-to-Mandarin Chinese translations, respectively.
following paragraph, we will discuss the three specific processing hardware cores and the
ADC/DAC for our speech-to-speech translation system.
Cepstrum Speech
Template
Template LPC Extraction Feature
Retrieval Core Extraction
Retrieval Core Core
Core RAM
ADC/
DAC
Pattern
Pattern Programmable
Programmable High Low
Extraction
Extraction Core Core Converter
Core
Translation
Synthesis Translation
Template
Pattern ROM Rule ROM
ROM
Fig. 2. Block diagram of the SOC architecture for the proposed speech-to-speech translation
system.
For each frame of the windowed speech signal, autocorrelation analysis is performed by the
following formula:
304 VLSI
N 1
(8)
R( k ) x(i k ) x(i ),0 k P ,
i k
f n (k) x(n1k)x(n1)
Fig. 3. An example illustrating the autocorrelation calculation procedure.
where R(i) is the autocorrelation coefficient, P is the order of the LPC analysis, chosen as 10
in this paper.
The most efficient method known for solving this particular system of equations is the
Levinson-Durbin recursion, which can be formulated as follows [33]:
E ( 0 ) R(0) (10)
for 1 i P
i 1
(11)
temp R(i ) a (i 1) R(i j )
j 1
j
Fig. 4. There are a small number of divisions needed for the LPC computation. However, it is
not economical to prepare extra hardware for them. Instead, the division operations can be
performed by prune-and-search [34] on the hardware without the use of an individual divider.
Register File
R(0)~R(10), quotient
Constants E, K,Temp, Set up
a1~a10
mux mux
Multiplier
mux mux
Adder
mux mux
Multiplier
mux mux
Adder
where Cn is the nth cepstral coefficient, an is the nth LPC coefficient, and P is the cepstrum
order, which is set equal to the LPC order in this paper.
The memory requirement for the constant (1 m n) can be reduced as follows. There are n-1
constants for any n, so the total number of the various constants is 1+2+3+…+P-1=P(P-1)/2.
By taking into consideration the pattern of the constant array, let km,n=(1-m/n), then we can
show that
kn-1-i,n=1- ki,n , (19)
for i=1, 2 ,…, n 1 / 2 , and if n is even then k n / 2,n =1/2.
Equation (19) implies that ki,n is equal to the complement of k n1i,n , therefore, the size of the
constant array can be reduced to half. This algorithm is implemented by a store-and-
accumulate technique and the architecture is described in Fig. 5. We need two storage
elements to hold the LPC coefficients and the previous order cepstral coefficients. In
addition, an accumulator is necessary for computing and totaling the main expression, (1-
m/n)anCn-m to yield the cepstral coefficients defined in (18b).
The dedicated architecture for the autocorrelation analysis, the LPC analysis, and the
conversion from LPC to LSP are individually designed. However, since the three procedures
are performed sequentially, a resource-sharing technique,is performed so that the cepstrum
extraction core will need only one multiplier and one adder.
column are divided into different blocks according to the patterns they belong to. Thus the DG
in Fig. 6(c) consists of three blocks corresponding to three different patterns.
To further reduce the size of the DG, a vertical projection is performed in Fig. 6(c). This
projection transforms the three-block DG into a one-block DG (see Fig. 6(e)). Because the frame
numbers for the blocks are different, the largest frame number is taken as the frame number
for the new one-block DG. To deal with the discrepancy of having different frame numbers
within different patterns, we added some multiplexers. In addition, two extra registers are
added in front of each node to replace the two eliminated blocks. Figure 6(e) is then modified
as Fig. 6(g) to have a regular wire connection.
Figures 6(b), 6(d), 6(f) and 6(h) provide another example of the use of horizontal and vertical
projections. One can find that the frame number and the register number for a certain node
between the DGs in Figs. 6(g) and 6(h) are different. To combine the two DGs into a single one,
the largest frame number and the largest register number between them are chosen, see Fig.
6(i). The added multiplexers in front of each node provide the selectivity among the one-block
DGs.
rv1
v1
s 3
H o r iz o n t a l
s 2v 1 P r o je c t io n
s 1v 1
x 11 2
( a ) : 2 ‐ D O r i g i n a l T e m p l a t e R e t r i e v a l S p a c e ( c ) : 1 ‐ D T e m p la t e R e t r ie v a l S p a c e
rv 2
s 2v 2
H o r iz o n t a l
P r o je c tio n
s 1v 2
x 18
( b ) : 2 ‐ D O r i g i n a l T e m p l a t e R e t r i e v a l S p a c e ( d ) : 1 ‐ D T e m p la t e R e t r ie v a l S p a c e
R
M
in
e
M
ux
Mn
i
V e r t ic a l
M o d if ic a t io n
P r o je c t io n
(c )
R
e
M
ux
n
M
i
g
M M M M M M M M
u u u u u u u u
x x x x x x x x
DG
(e ) : O n e ‐ B o c k D G ( g ) : M o d i f i e d O n e ‐ B o c k D G
M e r g in g
R
M
in
e
M
ux
Mn
i
V e r t ic a l
M o d if ic a t io n
P r o je c t io n ( i ) : F i n a l M e r g e d D G
(d )
(f ) : O n e ‐ B o c k D G ( h ) : M o d if ie d O n e ‐ B o c k D G
where Ri denotes the i-th cepstral coefficient in the R-th frame of the template, U i represents
the i-th cepstral coefficient in the U-th frame of the input speech, and P is the cepstrum order.
The cepstral coefficients of the input speech and the templates are stored in RAM0 and
ROM1, respectively. The distortion unit accesses the cepstral coefficients from the two
memories, and then accumulates the distortion in register R.
d(l, k, j) d (l, k, j)
dA (l 1, k, j) dA (l 1, k, j)
0 0
dA (l 1, k 1, j)
+ dA (l, k, j) + dA (l, k, j)
Mux
Mux
1
dA (l 1, k 2, j) 2 1
For the internal Frames in a node For the first frame in a node
Fig. 7. Architecture of the processing element.
blocks, numbered from 0 to 23. Blocks in the same row belong to the same pattern, and have
the same number of nodes (frames). For example, blocks 2, 5, 8, …, 23 belong to the third
pattern within this template. The pattern extraction method used in this work can be
regarded as a two-level address decoding process. At the first level, the blocks are the
addressing units. Each block consists of frame nodes, which become addressing units at the
second level. If b denotes the block index and k denotes the local node index in a block, each
node can also be referred to by the pair (b, k). For example, node 64 can be referred to as node
(21, 1).
The decision word is generated for each block, and is denoted by
D {d 0 , d1 , d 2 , ..., d K 1 , e} , (21)
where K is the largest number among all the block node ones, e is the pattern decision
information from the first node of a block, and dk, 0 k K 1 , is the decision information
from the k-th node of a block. We use e to indicate the source pattern for an external
(between-patterns) transition, and dk to indicate the source node for an internal (within-
patterns) transition. Assume that Kj denotes the number of nodes in the j-th pattern.
Pattern extraction starts from the end node, and recursively update the local node index and
the block index to construct a best path in the template retrieval space. The block index b is
updated by
b J , for internal transition, (22)
b
b J new _ e old _ e, for external transition,
where J is the number of patterns in this template, new_e is the current value of the decision
information e, and old_e is the previous value of the decision information e.
As for the local node index k, it is updated by
k , for internal transition with d k 0,
k 1, for internal transition with d 1, (23)
k
k
First 55
1 10 19 28 37 46 64 (0,1) (3,1) (6,1) (9,1) (12,1) (15,1) (18,1) (21,1)
Pattern
0 9 18 27 36 45 54 63 (0,0) (3,0) (6,0) (9,0) (12,0) (15,0 (18,0) (21,0)
Pattern extraction
from the end node (b, k)
Current
block is in the first
End Yes column of the block-
node addressing
method?
No
Yes
k=Ke-1;
External Transition? Yes
b=b-J+new_e-old_e;
No
No
k=k, if dk=0;
k=k-1, if dk=1;
k=k-2, if dk=2;
b=b-J;
Fig. 10. The architecture of fully differential successive approximation ADC using single
reference voltage.
To reduce the area for successive approximation register, this work presents a simplified
non-redundant successive approximation register (SSAR). The basic architecture of the
SSAR is a multiple input n-bit shift register, shown in Fig. 11. This multiple input register
consists of a general D flip-flop and a multiplexer with two inputs. The function of SSAR is
shown in Fig. 11. Suppose the “1” is the input token in Fig. 12. Whenever the SSAR is
triggered, each register of SSAR must go through three modes: (1) when the token has not
passed yet, the values of the registers are “0”; (1) when the token is staying at a certain
register, its register value is changed to “1”, and receives the result of comparator; (3) when
the token has passed through, the value determined by the result of comparator is held until
the whole conversion is done. The function can be implemented as below.
Therefore, when the value of k-th register of the SSAR is still “0”, the Qk+1 selected by the
multiplexer is connected to Dk. If the k-th register receives the token, the Qk is changed to “1”,
312 VLSI
and the CMP would be selected. The result of comparator Zk would be held in the k-th
register until the whole conversion is done. The system can be implemented by the blocked
CLKk. The combination of all above implementations makes it possible to improve the
accuracy and save the chip area.
MSB
LSB
Qp
Mux D2 Q2 Mux D1 Q1 Mux D0 Q0 Dp Qp
CMP CMP CMP
2th 1th 0th
Q2 Q1 Q0
CLK2 CLK1 CLK0 CLK
CLK
Q1b Q0b Qpb
1st cycle 1 0 0 0
2nd cycle Z2 1 0 0
3 rd cycle Z2 Z1 1 0
4th cycle Z2 Z1 Z 0 1
Fig. 12. The function of simplified non-redundant SAR.
4. Summary
Speech-to-speech machine translation is a prospective application of speech and language
technology. This work presents an MTS based speech-to-speech translation system between
Mandarin Chinese and English. The proposed MTS approach achieves about a 90%
understandable translation rate on average. For a portable and real-time speech-to-speech
translation system, this work also proposes the SOC realization. The architecture design is
based on the semi-ASIC technique, which incorporates a cost efficient programmable core
along with specific hardware accelerators: an LPC extraction core, a template retrieval core,
and a pattern extraction core. Besides, the ADC/DAC are also included. A SRV-based fully
differential successive approximation ADC with reduced SSAR is designed. These hardware
cores construct a complete speech-to-speech translation SOC. The proposed SOC chip is the
first one dedicated for speech-to-speech translation.
SOC Design for Speech-to-Speech Translation 313
5. References
[1]A. Lavie et al., “JANUS III: speech-to-speech translation in multiple languages,” in Proc.
IEEE Int. Conf. Acoustics, Speech and Signal Processing, Apr. 1997, pp. 99–102.
[2]W. Wahlster, Verbmobil: Foundations of Speech-to-Speech Translation. Berlin Heidelberg,
New York: Springer-Verlag, 2000.
[3]H. Ney, S. Nießen, F. J. Och, H. Sawaf, C. Tillmann, and S. Vogel, “Algorithms for
statistical translation of spoken language,” IEEE Trans. Speech and Audio Processing,
vol. 8, pp. 24-36, Jan. 2000.
[4]F. Casacuberta et al., ”Speech-to-speech translation based on finite-state transducers,” in
Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, May 2001, pp. 613–616.
[5]F. Sugaya, T. Takezawa, A. Yokoo, and S. Yamamoto, “End-to-end evaluation in ATR-
MATRIX: speech translation system between English and Japanese,” in Proc. 6th
Eur. Conf. Speech Communication and Technology, Sep. 1999, pp. 2431–2434.
[6]P. C. Ching and H. H. Chi, “ISIS: a trilingual conversational system with learning
capabilities and combined interaction and delegation dialogs,” in Proc. National
Conf. Man-Machine Speech Communications, Nov. 2001, pp. 119–124.
[7]R. Isotani, K. Yamabana, S. Ando, K. Hanazawa, S. Ishikawa, T. Emori, H. Hattori, A.
Okumura, and T. Watanabe, “An automatic speech translation system on PDAs for
travel conversation,” in Proc. IEEE Int. Conf. Multimodal Interfaces, Oct. 2002, pp.
211–216.
[8]A. Waibel, A. Badran, A. W. Black, R. Frederking, D. Gates, A. Lavie, L. Levin, K. Lenzo,
L. M. Tomokiyo, J. Reichert, T. Schultz, D. Wallace, M. Woszczyna, and J. Zhang,
“Speechalator: two-way speech-to-speech translation on a consumer PDA”, in Proc.
European Conf. Speech Communication and Technology, Sep. 2003, pp. 369–372.
[9]T. Watanabe, A. Okumura, S. Sakai, K. Yamabana, S. Doi, and K. Hanazawa, “An
automatic interpretation system for travel conversation,” in Proc. Int. Conf. Spoken
Language Processing, Sep. 2000, pp. IV-444–IV-447.
[10]J. F. Wang, B. Z. Houg, and S. C. Lin, “A study for Chinese text to Taiwanese speech
system,” in Proc. Int. Conf. Research on Computational Linguistics, Aug. 1999, pp. 37–
53.
[11]M. Simard, “Translation spotting for translation memories,” in Proc. HLT-NAACL
Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and
Beyond, May 2003, pp. 65–72.
[12]J. Véronis and P. Langlais, “Evaluation of parallel text alignment systems – the ARCADE
project,” in Parallel Text Processing, Dordrecht: Kluwer Academic, 2000, pp. 369–388.
[13]L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Prentice-Hall, Inc., 1993.
[14]J. F. Wang, A. N. Suen, and C. K. Chieh, “A programmable application specific
architecture for real-time speech recognition,” in Proc. of VLSI Design/CAD
Symposium, Aug. 1995, pp. 261–264.
[15]Y. Zhang, “Survey of current speech translation research,” presented at Multilingual
Speech-to-Speech Translation Seminar, Carnegie Mellon University, Pittsburgh,
PA, 2003.
[16]Y. S. Lee and S. Roukos, “IBM Spoken Language Translation System Evaluation,” in
Proc. INTERSPEECH2004 Workshop on Spoken Language Translation: Evaluation
Campaign on Spoken Language Translation, Oct. 2004, pp.39–46.
314 VLSI
[17]L. Gu and Y. Q. Gao, “On Feature Selection in Maximum Entropy Approach to Statistical
Concept-based Speech-to-Speech Translation,” in Proc. INTERSPEECH2004
Workshop on Spoken Language Translation: Evaluation Campaign on Spoken Language
Translation, Oct. 2004, pp.115–121.
[18]S. Nakamura, K. Markov, T. Jitsuhiro, J. S. Zhang, H. Yamamoto and G. Kikui, “Multi-
Lingual Speech Recognition System for Speech-To-Speech Translation,” in Proc.
INTERSPEECH2004 Workshop on Spoken Language Translation: Evaluation Campaign
on Spoken Language Translation, Oct. 2004, pp.146–154.
[19]K. Matsui, Y. Wakita, T. Konuma, K. Mizutani, M. Endo, and M. Murata, “An
experimental multilingual speech translation system,” in Proc. ICMI-PUI, 2001, pp.
1–4.
[20]S. Rossato, H. Blanchon, and L. Besacier, “Speech-to-speech translation system
evaluation: Results for French for the Nespole! project first showcase,” in Proc.
ICSLP, 2002, pp. 1905–1908.
[21]F. Casacuberta, E. Vidal, and J. M. Vilar, “Architectures for speech-to-speech translation
using finite-state models,” in Proc. ACL Workshop on Speech-to-Speech Translation:
Algorithms and Systems, 2002, pp. 39–44.
[22]H. Ney, “Speech translation: Coupling of recognition and translation,” in Proc. ICASSP,
1999, pp. 517–520.
[23]J. F. Wang and S. C. Lin, and H.W. Yang, “Multiple-Translation Spotting for Mandarin-
Taiwanese Speech-to-Speech Translation,” Int. Journal of Computational Linguistics
and Chinese Language Processing, vol.9, no.2, 2004, pp. 13-28.
[24]W. Verhelst and M. Roelands, “An overlap-add technique based on waveform similarity
(WSOLA) for high quality time-scale modification of speech,” in Proc. ICASSP,
1993, pp. 554–557.
[25]M. Demol, K. Struyve, W. Verhelst, H. Paulussen, P. Desmet, and P. Verhoeve, “Efficient
non-uniform time-scaling of speech with WSOLA for CALL applications,”
presented at InSTIL/ICALL Symp. Computer Assisted Learning, Venice, Italy,
2004.
[26]H. G. Ilk and S. Tugac, “Channel and source considerations of a bit-rate reduction
technique for a possible wireless communications system's performance
enhancement,” IEEE Trans. Wireless Communications, vol. 4, no. 1, pp. 93–99, Jan.
2005.
[27]E. T. Cornelius, English 900. Pace Group International Inc., 1999.
[28]B. E. Bagnell and M. Lee, New Globe English Course on Travel. New Globe Publishing Co.,
LTD, 1990.
[29]J. F. Wang and S. H. Chen, “Speech Enhancement Using Perception Wavelet Packet
Decomposition and Teager Energy Operator”, Journal of VLSI Signal Processing,
Vol. 36 I: 2-3, pp. 125–139, Feb. 2004..
[30]J. F. Wang, C. H. Yang, and K. H. Chang, “Design of a Subspace Tracking Based Speech
Enhancement System”, in Proc. IEEE TENCON 2004, vol. 1, pp. 147–150..
[31]J. T. Chien and C. H. Huang, “Bayesian Learning of Speech Duration Models,” IEEE
Trans. Speech and Audio Processing, vol. 11 I:6, pp. 558–567, Nov. 2003.
[32]Jhing-Fa Wang, Jia-Ching Wang, Han-Chiang Chen, Tai-Lung Chen, Chin-Chan Chang,
and Ming-Chi Shih, “Chip Design of Portable Speech Memopad Suitable for
SOC Design for Speech-to-Speech Translation 315
Persons with Visual Disabilities,” IEEE Transactions on Speech and Audio Processing,
vol. 10, no. 8, pp. 644-658, November 2002.
[33]J. Makhoul, “Linear prediction: a tutorial review,” Speech Analysis, IEEE Press, New
York, 1979.
[34]L. Y. Liu, J. F. Wang, J. Y. Lee, M. H. Sheu, and Y. L. Jeang, "An ASIC design for linear
predictive coding of speech signals," in Proc. Euro ASIC' 92, 1992, pp.288-291.
[35]S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE
Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-29, no. 2, pp. 254-
272, 1981.
316 VLSI
A Novel De Bruijn Based Mesh Topology for Networks-on-Chip 317
16
X
1. Introduction
The mesh topology is the most dominant topology for today’s regular tile-based NoCs. It is
well known that mesh topology is very simple. It has low cost and consumes low power.
During the past few years, much effort has been made toward understanding the
relationship between power consumption and performance for mesh based topologies
(Srivasan et al., 2004). Despite the advantages of meshes for on-chip communication, some
packets may suffer from long latencies due to lack of short paths between remotely located
nodes. A number of previous works try to tackle this shortcoming by adding some
application-specific links between distant nodes in the mesh (Ogras & Marculescu, 2005)
and bypassing some intermediate nodes by inserting express channels (Dally, 1991), or
using some other topologies with lower diameter (sabbaghi et al., 2008).
The fact that the de Bruijn network has a logarithmic diameter and a cost equal to the linear
array topology motivated us to evaluate it as an underlying topology for on-chip networks.
De Bruijn topology is a well-known network structure which was initially proposed by de
Bruijn (de Bruijn, 1946) as an efficient topology for parallel processing. Samathan (Samathan
& Pradhan, 1989) showed that de Bruijn networks are suitable for VLSI implementation, and
several other researchers have studied topological properties, routing algorithms, efficient
VLSI layout and other important aspects of the de Bruijn networks (Park & Agrawal, 1995;
Ganesan & Pradhan, 2003).
In this chapter, we propose a two-dimensional de Bruijn based mesh topology (2D DBM for
short) for NoCs. We will compare equivalent mesh and 2D DBM architectures using the two
most important factors, network latency and power consumption. A routing scheme for the
2D DBM network has been developed and the performance and power consumption of the
two networks under similar working conditions have been evaluated using simulation
experiments. Simulation results show that the proposed network can outperform its
equivalent popular mesh topology in terms of network performance and energy dissipation.
318 VLSI
The in-degree and out-degree of a node is equal to k. Therefore, the degree of each node is
equal to 2k. The diameter of a de Bruijin graph is equal to n which is optimal. The de Bruijn
also has a simple routing algorithm. Case k=2 is the most popular de Brujin network which
is also used in this study. Due to the logarithmic (optimal) diameter and a simple routing
algorithm, it can be expected that the traffic on channels in the network will be less than
other networks, resulting in a better performance (Ganesan & Pradhan, 2003).
Examples of de Bruijn networks are illustrated in Fig. 1. Several researchers have studied the
topological properties (Liu & Lee, 1993 ; Mao & Yang , 2000) and efficient VLSI layout
(Samanathan & Pradhan, 1989; Chen et al., 1993) of the de Bruijn networks. Moreover, the
scalability problem of de Bruijn networks is addressed in (Liu & Lee, 1993). In de Brujin
network data is circulated from node to node until it reaches its destination. Each node has
two outgoing (incoming) connections to (from) other nodes via shuffle (rotate left by one bit)
and shuffle-exchange (rotate left by one bit and complement LSB) operations to neighboring
nodes (Louri & Sung, 1995). Owing to the fact that these connections are unidirectional, the
degree of the network is the same as the one-dimensional mesh networks (or linear array
network). The diameter of a de Bruijn network with size N, that is, the distance between
nodes 0 and N-1, is equal to log (N).
(a)
(b)
Fig. 1. The de Bruijn network with (a) 8 nodes and (b) 16 nodes
A Novel De Bruijn Based Mesh Topology for Networks-on-Chip 319
Fig. 2. A 2D DBM with 64 nodes composed from eight 8-node de Bruijn networks (as shown
in Fig.1. a) along each dimension
The 2D DBM networks have some interesting topological properties that motivate us to
consider them as a suitable candidate for on-chip network architectures. The most important
property of 2D DBM networks is that while the number of links in a 2D DBM and an equal-
sized mesh are exactly the same, the network diameter of this network is less than that of the
320 VLSI
mesh. More precisely, the diameter of a 2D DBM and a mesh are 2log N0.5 and 2(N0.5-1),
respectively, where N represents the network size.
Although, establishing the new links removes the link between some adjacent nodes (for
example 1 to 0, 2 to 1 and 3 to 4 connections in Fig. 1) and increases their distance by one
hop. In this network, however, the distance between many nodes is decreased by one or
multiple hops, compared to a mesh, and this can lead to a considerable reduction in the
average inter-node distance in the network.
The 2D DBM links are unidirectional and at most 8 unidirectional links are used per node.
This is equal to the number of links connected to a node in a mesh (which is 4 bidirectional
links). Since the node degree of a topology has an important contribution in (and usually
acts as the dominant factor of) the network cost, the proposed topology can achieve lower
average distance than a 2D mesh while it has almost the same cost. However, we will
discuss the area overhead due to longer links in 2D DBM in next sections.
Park (Park & Agrawal, 1995) has deformed the de Bruijn as two graphs, increasing and
decreasing. In the ith stages, the MSB of the current node is compared with the n-ith bit of the
destination node, and if they are the same, shuffle cycle is used; otherwise, shuffle-exchange
is used. If the path switches from increasing graph to the decreasing graph, the virtual
channel number increments by 1. It is proved that the mentioned algorithm for the de Brujin
network is deadlock-free (Park & Agrawal, 1995). Park (Park & Agrawal, 1995) has shown
that for a de Brujin network with N nodes (N=2n) this kind of routing requires
3. Simulation Results
3.1 Simulator
To simulate the proposed NoC topology, we have used an interconnection network
simulator that is developed based on POPNET simulator (Popnet, 2007) with Orion power
library embedded in it. Orion is a library which models the power consumption of the
interconnection networks (Wang et al., 2002). Providing detailed power characteristics of the
network elements, Orion enables the designers to make rapid power performance tradeoffs
at the architecture level (Wang et al., 2002). As mentioned in (Wang et al., 2002), the total
energy each flit consumes at a specified node and its outgoing link is given by
322 VLSI
As mentioned before, we have also revised the routing algorithm in (Park & Agrawal, 1995) to
have balanced use of virtual channels. Fig. 5 compares the performance of original routing
algorithm and the new routing algorithm in the 88 2D DBM using 2 virtual channels with
messages of 32 flits. As can be seen in the figure, the new algorithm exhibits better
performance in terms of average message latency.
Figures 6-9 compare power consumption and the performance of simple 2D mesh and 2D
DBM NoCs under various traffic patterns, network sizes and message lengths. In Fig. 6 and
Fig. 7, the average message latency is displayed as a function of message generation rate at
each node for the 8×8 and 16×16 networks under deterministic routing. As can be seen in the
figures, the 2D DBM NoC achieves a reduction in message latency with respect to the popular
2D mesh network for the full range of network load under various traffic patterns (especially
in uniform traffic). Note that for matrix-transpose traffic load, it is assumed that 30% of
messages generated at a node are of matrix-transpose type (i.e. node (x,y) sends the message to
A Novel De Bruijn Based Mesh Topology for Networks-on-Chip 323
node (y,x)) and the rest of 70% messages are sent to other nodes uniformly. For hotspot traffic
load a hotspot rate of 16% is assumed (i.e. each node sends 16% of its messages to the hotspot
node (node (4,4) in 8×8 network and node (8,8) for 16×16 network) and the rest of 84% of
messages are sent to other nodes uniformly). Note that increasing the network size causes
earlier saturation in a simple 2D mesh.
8-8 v=2 m =32
w ith v.c. unif ormity
400
Averagedelay(cycles)
300
200
100
0.001 0.002 0.003 0.004 0.005
M e s s age ge ne r ation r ate ()
300 bruijn-u
mesh-u
bruijn-mat
mesh-mat
bruijn-hot
mesh-hot
100
0.001 0.002 0.003 0.004 0.005
Message generation rate ( )
a)
8-8 v=2 m =64
800
Delay(cycles )
600
Average
bruijn-u
mesh-u
400 bruijn-mat
mesh-mat
bruijn-hot
mesh-hot
200
0.0005 0.001 0.0015 0.002 0.0025
Message generation rate ( )
b)
Fig. 6. The average message latency in the 8×8 simple 2D mesh and 2D DBM for different
traffics patterns with message size of (a) 32 flits and (b) 64 flits
324 VLSI
Delay(cycles )
Average
300
bruijn-u
mesh-u
bruijn-mat
mesh-mat
bruijn-hot
mesh-hot
100
0.001 0.002 0.003 0.004 0.005
Message generation rate ( )
a)
16-16 v=3 m =64
800
Delay(cycles )
600
Average
bruijn-u
mesh-u
400 bruijn-mat
mesh-mat
bruijn-hot
mesh-hot
200
0.0005 0.001 0.0015 0.002 0.0025
Message generation rate ()
b)
Fig. 7. The average message latency in the 16×16 simple 2D mesh and 16×16 network of 2D
DBM for different traffics patterns with message size of (a) 32 flits and (b) 64 flits
According to the simulation results reported above, the 2D DBM has a better performance
compared to the equivalent simple 2D mesh NoC. The reason is that the average distance a
message travels in the network in a 2D DBM network is lower than that of a simple 2D
mesh. The node degree of the 2D DBM and simple 2D mesh networks (hence the structure
and area of the routers) are the same. However, unlike the simple 2D mesh topology, the 2D
DBM links do not always connect the adjacent nodes and therefore, some links may be
longer than the links in an equivalent mesh. This can lead to an increase in the network area
and also create problems in link placement. The latter can be alleviated by using efficient
VLSI layouts (Samanathan & Pradhan, 1989; Chen et al., 1993) proposed for de Bruijn
networks, as we used.
Fig. 8 demonstrates power consumption of the simple 2D mesh and 2D DBM under
deterministic routing scheme with uniform traffic. It is again the 2D DBM that shows a
better behavior before reaching to the saturation point. Fig. 9 reports similar results for
hotspot and matrix-transpose traffic patterns in the two networks.
A Novel De Bruijn Based Mesh Topology for Networks-on-Chip 325
110
70
mesh-64f
mesh-32f
50 bruijn-64f
bruijn-32f
30
0.001 0.002 0.003 0.004 0.005 0.006
Message generation rate ( )
a)
550
500
Power (nj / cycles)
450
400
350
300 mesh-64f
mesh-32f
250 bruijn-64f
200
bruijn-32f
150
0.001 0.002 0.003 0.004 0.005 0.006
Message generation rate ( )
b)
Fig. 8. Power consumption of the simple 2D mesh and 2D DBM with uniform traffic pattern
and message size of 32 and 64 flits for (a) 8×8 network and (b) 16×16 network
110
Power (nj / cycles)
90
70 bruijn-32u
bruijn-hot
bruijn-mat
50 mesh-32u
mesh-hot
mesh-mat
30
0.001 0.002 0.003 0.004 0.005 0.006
Message generation rate ( )
a)
326 VLSI
550
500
400
350
bruijn-u
300 bruijn-hot
bruijn-mat
250 mesh-u
200
mesh-hot
mesh-mat
150
0.001 0.002 0.003 0.004 0.005 0.006
Message generation rate ( )
b)
Fig. 9. Power consumption of simple 2D mesh and 2D DBM for different traffic patterns and
message size 32 flits for (a) 8×8 and (b) 16×16 networks
The results indicate that the power of 2D DBM network is less for light to medium traffic
loads. The main source of this reduction is the long wires which bypass some nodes and
hence, save the power which is consumed in intermediate routers in an equivalent mesh
topology.
Although for low traffic loads the 2D DBM network provides a better power consumption
compared to the simple 2D mesh network, it begins to behave differently near heavy traffic
regions.
It is notable that a usual advice on using any networked system is not to take the network
working near saturation region (Duato et al., 2005). Having considered this and also the fact
that most of the networks rarely enter such traffic regions, we can conclude that the 2D
DBM network can outperform its equivalent mesh network when power consumption is
considered.
The area estimation is done based on the hybrid synthesis-analytical area models presented
in (Mullins et al. , 2006; Kim et al., 2006; Kim et al. 2008). In these papers, the area of the
router building blocks is calculated in 90nm standard cell ASIC technology and then
analytically combined to estimate the router total area. Table 1 outlines the parameters. The
analytical area models for NoC and its components are displayed in Table 2. The area of a
router is estimated based on the area of the input buffers, network interface queues, and
crossbar switch, since the router area is dominated by these components.
The area overhead due to the additional inter-router wires is analyzed by calculating the
number of channels in a mesh-based NoC. An n×n mesh has 2×n×(n-1) channels. The 2D
DBM has the same number of channels as mesh but with longer wires. In the analysis, the
lengths of packetization and depacketization queues are considered as large as 64 flits.
In Table 3, the area overhead of 2D DBM NoC is calculated for 8×8 and 16×16 network sizes
in a 32-bit wide system. The results show that, in an 8×8 mesh, the total area of the 2mm
links and the routers are 0.0633 mm2 and 0.1089 mm2, respectively. Based on these area
estimations, the area of the network part of the 2D DBM network shows a 44% increase
compared to a simple 2D mesh with equal size. Considering 2mm×2mm processing
elements, the increase in the entire chip area is less than 3.5%. Obviously, by increasing the
A Novel De Bruijn Based Mesh Topology for Networks-on-Chip 327
buffer sizes, the network node/configuration switch area increases, leading to much
reduction in the area overhead of the proposed architecture.
Parameter Symbol
Flit Size F
Buffer Depth B
No. of Virtual channels V
Buffer area (0.00002 mm2/bit (Kim et al., 2008)) Barea
Wire pitch (0.00024 mm (ITRS, 2007) Wpitch
No. of Ports P
Network Size N (= n×n)
Packetization queue capacity PQ
Depacketization queue capacity DQ
Channel Area (0.00099 mm2/bit/mm (Mullins et al. , 2006) Warea
Channel Length (2mm ) L
No. Of Channels Nchannel
Table 1. Parameters
Symbol Model
Crossbar RCXarea W2pitch×P×P×F2
Buffer (per RBFarea Barea×F×V×B
port)
Router Rarea RCXarea+P×RBFarea
Network NAarea PQ× Barea +DQ ×Barea
Adaptor
Channel CHarea F×Warea×L×Nchannel
NoC Area NoCarea n2× (Rarea+ NAarea)+ CHarea
Table 2. Area analytical model
4. Conclusion
The simple 2D mesh topology has been widely used in a variety of applications especially
for NoC design due to its simplicity and efficiency. However, the de Bruijn network has not
been studied yet as the underlying topology for 2D tiled NoCs. In this chapter, we
introduced the two-dimensional de Bruijn Mesh (2D DBM) network which has the same cost
as the popular mesh, but has a logarithmic diameter. We then conducted a comparative
simulation study to assess the network latency and power consumption of the two
328 VLSI
networks. Results showed that the 2D DBM topology improves on the network latency
especially for heavy traffic loads. The power consumption in the 2D DBM network was also
less than that of the equivalent simple 2D mesh NoC.
Finding a VLSI layout for the 2D and 3D DBM networks based on the design considerations
in deep sub-micron technology, especially in three dimensional design, can be a challenging
future research in this line.
5. References
http://www.princeton.edu/~lshang/popnet.html, August 2007.
Chen, C.; Agrawal, P. & Burke, JR. (1993). dBcube : A New class of Hierarchical
Multiprocessor Interconnection Networks with Area Efficient Layout, IEEE
Transaction on Parallel and Distributed Systems, Vol. 4, No. 12, pp. 1332-1344.
Dally, WJ. & Seitz, C. (1987). Deadlock-free Message Routing in Multiprocessor
Interconnection Networks, IEEE Trans. on Computers, Vol. 36, No. 5, pp. 547-553.
Dally, WJ. (1991). Express Cubes: Improving the Performance of K-ary N-cube
Interconnection Networks, IEEE Trans. on Computers, Vol. 40, No. 9, pp. 1016-1023.
De Bruijn, NG. (1946). A Combinatorial Problem,” Koninklijke Nederlands Akademie van
Wetenschappen Proceedings, 49-2, pp.758–764.
Duato, J. (1995). A Necessary and Sufficient Condition for Deadlock-free Adaptive Routing
in Wormhole Networks, IEEE Transactions on Parallel and Distributed Systems, Vol. 6,
No. 10, pp. 1055–1067.
Duato, J.; Yalamanchili, S. & Ni, L. (2005). Interconnection Networks: An Engineering Approach,
Morgan Kaufmann Publishers.
Ganesan, E. & Pradhan, DK. (2003). Wormhole Routing in de Bruijn Networks and Hyper-
de Bruijn Networks, IEEE International Symposium on Circuits and Systems (ISCAS),
pp. 870-873.
ITRS. (2007). International technology roadmap for semiconductors. Tech. rep., International
Technology Roadmap for Semiconductors.
Kiasari, AE.; Sarbazi-Azad, H. & Rezazad, M. (2005). Performance Comparison of Adaptive
Routing Algorithms in the Star Interconnection Network, Proceedings of the 8th
International Conference on High Performance Computing in Asia-Pacific Region
(HPCAsia), pp. 257-264.
Kim, M.; Kim, D. & Sobelman, E. (2006). NoC link analysis under power and performance
constraints, IEEE International Symposium on Circuits and Systems (ISCAS), Greece.
Kim, MM.; Davis, JD.; Oskin, M & Austin, T. (2008). Polymorphic on-Chip Networks,
International Symposium on Computer Architecture(ISCA), pp. 101 -112.
Liu, GP. & Lee, KY. (1993). Optimal Routing Algorithms for Generalized de Bruijn Digraph,
International Conference on Parallel Processing, pp. 167-174.
Louri, A. & Sung, H. (1995). An Efficient 3D Optical Implementation of Binary de Bruijn
Networks with Applications to Massively Parallel Computing, Second Workshop on
Massively Parallel Processing Using Optical Interconnections, pp.152-159.
Mao, J. & Yang, C. (2000). Shortest Path Routing and Fault-tolerant Routing on de Bruijn
Networks, Networks, vol.35, pp.207-215.
A Novel De Bruijn Based Mesh Topology for Networks-on-Chip 329
Mullins, R.; West, A. & Moore, S. (2006). The Design and Implementation of a Low-Latency
On-Chip Network, Asia and South Pacific Design Automation Conference(ASP-DAC),
pp. 164-169.
Ogras, UY. & Marculescu, R. (2005). Application-Specific Network-on-Chip Architecture
Customization via Long-Range Link Insertion, IEEE/ACM Intl. Conf. on Computer
Aided Design, San Jose, pp. 246-253.
Park, H.; Agrawal, DP. (1995). A Novel Deadlock-free Routing Technique for a class of de
Bruijn based Networks, IPPS, pp. 524-531.
Sabbaghi-Nadooshan, R.; Modarressi, M. & Sarbazi-Azad, H. (2008). A Novel high
Performance low power Based Mesh Topology for NoCs, PMEO-2008, 7th
International Workshop on Performance Modeling, Evaluation, and Optimization, pp. 1-7.
Samanathan, MR.; Pradhan, DK. (1989). The de Bruijn Multiprocessor Network: a Versatile
Parallel Processing and Sorting Network for VLSI, IEEE Trans. On Computers, vol.
38, pp.567-581.
Srivasan, K.; Chata, KS. & Konjevad, G. (2004). Linear Programming Based Techniques for
Synthesis of Networks-on-chip Architectures, IEEE International conference on
Computer Design, pp. 422-429.
Wang, H.; Zhu, X.; Peh, L. & Malik, S. (2002). Orion: A Power-Performance Simulator for
Interconnection Networks, 35th International Symposium on Microarchitecture
(MICRO) , Turkey, pp. 294-305.
330 VLSI
On the Efficient Design & Synthesis of Differential Clock Distribution Networks 331
17
X
Canada
1. Introduction
Almost all high-performance VLSI systems in today technologies are synchronous. These
systems use a clock signal to control the flow of data throughout the chip. This greatly
facilitates the design process of systems because it provides a global framework that allows
many different components to operate simultaneously while sharing data. The only price for
using synchronous type of systems is the additional overhead required to generate and
distribute the clock signal.
Nearly all on-chip Clock Distributions Networks (CDNs) contain a series of buffers and
interconnects that repeatedly power-up the clock signal from the clock source to the clock
sinks. Conventionally, CDNs consisted of only a single stage buffer driving wires to the
clock loads. This is still the case for clock distribution in very small scale systems; yet
contemporary complex systems use multiple buffer stages. A typical clock tree distribution
network in modern complex systems is shown in Figure 1. This design is based on the
reported CDNs in (O’Mahony et al, 2003; Restle et al, 1998; Vasseghi et al, 1996).
system has thousands of loads to be driven by clock signal. In CDNs, the loads are grouped
together creating a (sub-) block. This trend results in a hierarchy in the design of CDNs
including three different levels/categories of clock distribution namely as global, regional and
local as shown in Figure 1. At each level of hierarchy there are buffers associated with that
level to regenerate and to improve the clock signal at that level.
The global clock distribution connects the global clock buffer to the inputs of the sector
buffers. This level of the distribution has usually the longest path in CDN because it relays
the clock signal from the central point on the die to the sector buffers located throughout the
die. The issues in designing the global tree is mostly related to signal integrity which is meant
to maintain a fast edge rate over long wires while not introducing a large amount of timing
uncertainty. Skew and jitter accumulate as the clock signal propagates through the clock
network and both tend to accumulate proportional to the latency of the path. Because most
of the latency occurs in the global clock distribution, this is also a primary source of skew
and jitter (Restle et al, 2001). From a design point of view, achieving low timing uncertainty
is the most critical challenge at this level.
The regional clock level is defined to be the distribution of clock signals from the sector
buffers to the clock pins. This level is the middle ground between global and local clock
distribution; it does not span as much area as the global level and it does not drive as much
load or consume nearly as much power as the local level.
The local level is the part of the CDN that delivers the clock pin to the load of the system to
be synchronized. This network drives the final loads and hence consumes the most power.
As a design challenge, the power at the local level is about one order of magnitude larger
than the power in the global and regional levels combined (Restle et al, 2001).
system malfunctioning. Therefore, the timing uncertainty of the clock signal must be
estimated and taken into account in the first design stages. The two categories of timing
uncertainties in a clock distribution are skew and jitter.
Clock skew refers to the absolute time difference in clock signal’s arrival time between two
points in a CDN. Clock skew is generally caused by mismatches in either device or
interconnect within the clock distribution or by temperature or voltage variations around
the chip. There are two components for clock skew: the skew caused due to the static noise
(such as imbalanced routing) which is deterministic and the one caused by the system device
and environmental variations which is random. An ideal clock distribution would have zero
skew, which is usually unachievable.
Jitter is another source of dynamic timing uncertainties at a single clock load. The key
measure of jitter for a synchronous system is the period or cycle-to-cycle jitter, which is the
difference between the nominal cycle time and the actual cycle time. The first cycle, the
period is the same as the clock signal period and the second cycle, the clock period becomes
longer/shorter. The total clock jitter is the sum of the jitter from the clock source and from
the clock distribution. Power supply noise may cause jitter in both the clock source and the
distribution (Herzel et al, 1999).
Clock network also involves long interconnects which implies having lots of parasitics
associated with the network contributing to the power consumption of the clock signal.
Having the highest switching activity of the circuit in a chip is another fact of consuming a
large amount of power of the system. This power consumption can be as high as 50% of the
total power consumption of the chip according to (Zhang et al, 2000). The components of
power consumption of CDN are: static, dynamic and leakage power. The power
consumption due to the leakage current, in CDNs, is relatively small. In the same way,
keeping the proper rise/fall times, minimizes the static power consumption. Thus the main
portion of the power consumption is due to the dynamic power consumption. This is
estimated as:
2.1 Preliminaries
Vdiff =V1-V0
Differential signaling requires more routing and wires and pins than its single-ended
counterpart system. In return for this increase, differential signaling offers the following
advantages over single-ended signaling:
a. A differential system, serves its own reference. The receiver at the far end of the
system compares the two signal pair to detect the value of the transmitted
information. Transmitters are less critical in terms of noise issues, since the receiver
is comparing two pair of signals together rather than comparing to a fixed
reference. This results in canceling any noises in common to the signals.
b. The voltage difference for the two signal pair between logic’1’ and ‘0’ is:
ΔV=2(V1-V0)
On the Efficient Design & Synthesis of Differential Clock Distribution Networks 335
rdx ldx
c g dx
cc dx lm dx
rdx
c g dx ldx
Fig. 3. A segment of a coupled interconnect
Load rc rΔ
Resistor R R
Current-mirror 1/gm -1/λI
Cross-coupled 1/gm 1/gm
Table 1. Impedance of differential loads
On the Efficient Design & Synthesis of Differential Clock Distribution Networks 337
(a) (b)
Fig. 6. (a) Low-Swing: Low-Swing (b) Low-Swing: Full-Swing DT differential buffers
Vdiff_low=Vdd-RIss
The above equation implies that in order to increase the differential voltage swing, the tail
current need to be increased. This technique largely affects the power consumption in
DCDN. Note that, it is not possible to touch the load (R) as it directly affects the bandwidth
of the clock network. Therefore, in previous works, in order to reach sufficient output
swing, the differential voltage swing is increased to reduce common-mode voltage.
Correspondingly, a circuit technique is proposed to address this design problem.
The proposed technique for differential receiver is given in Figure 7 (Zarrabi, 2006). The
buffer configuration is based on Chappell amplifier as introduced in the previous section.
Attached to the buffer are the level-shifting circuits. The buffer functionality is as follows:
The dashed parts in Figure 7 are the level shifters (also referred to as source followers)
(Razavi, 2001). When the input is applied to the gate terminals of the level shifters, the
outputs are dropped and follow their inputs. In other words, the voltage gain equals one (no
voltage amplification), and the following relations are applicable (Broderson, 2005):
I=(β/2)(VIN-VOUT-VT)2
VOUT-=IRs
VOUT =(Rsβ/2)(VIN-VOUT-VT)2
VIN =VOUT+ VT+[(VIN-2)/(Rsβ)]0.5
340 VLSI
The last result shows that VOUT can be derived by solving the final equation iteratively.
However, by making the first order approximation that RS is large enough (especially in
current sources) to make the third term equal to zero, we can conclude:
VOUT= VIN-VT
This shows that the output of the source follower circuits copies the input of the gates with a
shift of a transistor threshold which is a technology dependent factor. The transistor ratios
for buffers sizing are the same as the ones given in (Chappell et al, 1998). However, the total
size of the buffer is scaled to minimize the skew.
The above configuration for the differential receivers helps lower the common-mode (DC
bias) of the internal input transistors of the receiver. Utilizing this design technique, it is
possible to further reduce the differential voltage swing while maintaining a sufficient
output swing at the final nodes.
In order to perform differential voltage scaling in DCDN, previously a new design for level
converter was given. For the case of intermediate buffers, in order to be able to vary the
differential voltage while maintaining the linearity of the buffer, the differential load should
be reconfigured in a way to establish this design goal. In this part, a new configuration for
differential load is proposed which enables us to have linearity in the buffer. Figure 8 shows
the proposed buffer configuration.
The dashed part demonstrates the proposed composite configuration of the differential load.
Such composition enables the circuit to combine both the characteristics of the diode
connected device and triode transistor together to have a linear operating load in various
On the Efficient Design & Synthesis of Differential Clock Distribution Networks 341
voltage ranges (Dally et al, 1998). The proposed buffer based on composite differential load
is a technology portable design and can be used in any available design process whereas the
use of resistance is limited to current and future advanced technologies. This portable
design method comes at the price of increase in area and parasitic elements. The transistor
ratios (for buffers sizing) are 1 to 3 which refer to the ratio of pull up to pull down
transistors (L=2Lmin to reduce the channel length modulation effect). The total size of the
buffer is scaled to reach the objective frequency of operation.
As was seen in Section 2.1.2, the effective capacitance associated with each segment of a
coupled line, considering both intrinsic and mutual effects is:
follows. Figure 9 shows a schematic of a decoupled clock tree branch in which each line of
the branch is a decoupled distributed RC model connected to its sub-tree child, for which
the distributed line propagation delay is given by tint=0.37RintCeff. Each sub-tree is modeled
by a total capacitance Csubtree and total propagation delay tsubtree as shown in Figure 9.
Considering tapping location x, to satisfy the equality of the two branch delays, the
following equation is realized:
tint1+0.74Rint1Ceffsubtree1+t1= tint2+0.74Rint2Ceffsubtree2+t2 *
In the second part of the equality, since the interconnect resistance combined with sub-tree
capacitance creates a Lumped loop, it has the lumped propagation delay of 0.74RintCsubtree.
Rewriting interconnect parasitics by per unit length parameters, we have:
Rint1=r0xl, Cint1=c0xl
Rint2=r0(1- x)l, Cint1=c0(1- x)l
in which r0 and c0 are the resistance and capacitance per unit length of the wire, l is total
interconnection length between the two sub-trees and tapping location x. Solving Equation
* with respect to x results into:
x=[1.35(tint2-tint1)+r0l(Ceffsubtree2+0.5c0l)]/[r0l(Ceffsubtree1+c0l+Ceffsubtree2)]
In case of (x ≤ 0 or x ≥ 1), elongation would be needed. Elongation is the process of adding
extra wire length to the sub-tree which has less effective capacitance, in order to equalize the
delay of both sub-trees. The length of elongation to maintain zero skew is given by:
L’=[-20r0Ceffsubtree2+2(100r02Ceffsubtree22+270r0c0(tint2-tint1))0.5]/[20r0c0]
This methodology is applied for zero skew routing in DCDNs. The results are given in
Section 5.1 will validate the efficiency of this methodology.
P0 P1 P0 P1 P
Clock source
Clock-root
Clock-sink
of the
partition
P3 P2 P3 P2
Fig. 10. Parallel DCDN distribution: a) partitioning the die area into sub-regions, b) locating
the clock-root of each region, c) finding the source of the clock network
The methodology for parallel synthesis of zero skew DCDNs is as follows. Initially the total
chip area is partitioned into sub-regions (partitioning phase). Later, synthesis of zero skew
differential clock distribution networks is performed on each of the partitioned regions
(local clock distribution phase). In the final stage, the global differential clock network is
routed for each of the previously-extracted clock-roots of the sub-regions (global clock
distribution phase). The obtained source of the clock network can end up anywhere in the
whole chip area (Manhattan surface), regardless of the initial partitioning. The proposed
scenario is illustrated in Figure 10. The proposed method may be implemented using C++
language and the Message Passing Interface (MPI) platform (MPI). A pseudo-code
describing the method is given in Figure 11.
A possible negative side effect of parallel synthesis is the increase in the total wire-length in
the clock network. This could be interpreted as the impact of multi-stage distribution of the
clock network which results in initial local zero-skew clock networks and a final global clock
network routed on top of regional clock networks. In general, this parallel processing
approach results in a clock-tree different from the one routed in a single step, due to die area
partitioning; thus, the characteristics of the new clock tree such as total wire-length and
skew may be slightly different. This proposed methodology is flexible, as it allows having a
hybrid (differential and/or single-ended) distribution of the clock network. The global CDN
could be differential, while the local (lower levels) CDNs could be single-ended to alleviate
routing complexity. It is possible to enhance the global/local distribution algorithm with
refined interconnect models. This methodology is also applicable to all
symmetric/asymmetric clock-trees.
5. Results
In this section, the quantitative results related to the given design and synthesis methods for
DCDNs are given.
Table 3 shows the skew and delay difference for similarly sized, buffered DCDN based on
the conventional (Dally et al, 1998) and the proposed buffers. It shows 25% delay and skew
improvement on average compared to conventional buffers in low-swing differential
clocking scheme. Results show that the delays are reduced significantly while skews are
degraded as compared with un-buffered DCDNs. It is believed that skews in buffered clock
networks can be reduced significantly by enhancing the process of buffer insertion. For
instance, differential buffers delay model should be considered when tapping points are
selected in the zero skew DCDN design algorithm.
0
Single Differential (SS) Differential (DS)
Clocking Schemes
With regards to the skew sensitivity of the proposed DT DCDNs, two types of external
aggressors, resulting into random skew are investigated: power-supply variations and
crosstalk. Comparisons were made between similarly designed single-node CDN, single-
346 VLSI
spaced DCDNs and double-spaced DCDNs (Figure 12). Benchmark r3 is used for
simulations due to its average characteristic in terms of size and simulation time. For the
low-swing scheme, the power supplies were VddH=1.8V & VddL=0.5V, whereas for full-swing
scheme a single supply voltage (VddH=1.8V) was used. In those experiments, supply-
voltages were varied by ±10%.
Simulation results show that for both clocking schemes, the single-spaced DCDN is the most
robust design method in the presence of power-supply variations when compared to other
CDNs. Skew variations increase when low-swing clocking is used. Double-spaced DCDN
has less robustness to supply variations. DCDN is seen to have up to 25% less skew
variations in low-swing clocking scheme and up to 9% less skew variations in full-swing
clocking scheme than single-node CDN, in presence of power-supply variations.
Another source of perturbation that causes delay uncertainty in CDNs is crosstalk. For
experiments, a full-swing aggressor is applied to one of the two big-child of the clock tree.
The same low-swing and full-swing clocking schemes were considered. Simulation results
show that the single-spaced DCDN shows 6% less skew variations when combined with low
swing clocking scheme and 9% when combined with full swing clocking scheme as
compared to single-node CDN subject to crosstalk.
Skew Variations for Different Low Swing CDNs Skew Variations for Different Full Swing CDNs
40
Single Single
20 Differential (SS) Differential (SS)
Differential (DS) Differential (DS)
10
0
% Skew Variations
% Skew Variations
-20 5
-40
0
-60
-80
1.62-0.45 1.79-0.475 1.8-0.5 1.89-0.525 1.98-0.55 1.62 1.71 1.8 1.89 1.98
Voltage Supply VddH -VddL (V) Voltage Supply (V)
3.5
2.5
1.5
10 15 20 25 30 35 40 45 50
Differential Pair Swing Scaling (%Vdd)
Fig. 14. Power consumption for voltage scaled DCDNs vs. a single-node CDN (r3).
Figure 14 shows that as the differential swing increases beyond 25% of supply-voltage
(450mV, Vdd=1.8V), the power consumption increases drastically. This emphasizes the
significant impact of large differential voltage swings on the power consumption of the
clock network. For differential voltage swings below 450mV, the power consumption is not
reduced much if the differential swing scaling is further reduced. A lower bound of 10% of
Vdd was imposed to the differential swing to ensure a sufficient noise margin. Another
consideration is that even with a differential swing as low as 10% of Vdd (180mV), the power
consumption of the differential clock network remains almost 30% higher than that of
single-node clock distribution network. Thus, trying to match the power dissipation of a
single-node network by decreasing the swing of differential networks does not appear to be
a viable option. A final observation from Figure 14 is that the DCDN with grounded-gate
loads (GND) consumes less power over a limited region where the differential swing is
large. However, as we will see in the following, this slight reduction in dissipated power
comes at a large price in clock skew variability.
348 VLSI
350
250
Max Skew Variations (%)
200
150
100
50
0
10 15 20 25 30 35 40 45 50
Differential Pair Swing Scaling (%Vdd)
Fig. 15. Peak to peak skew variations in differential voltage scaled DCDNs.
Taking these considerations into account, we consider a design point for which the
differential swing is 25% of Vdd (450mV) and we reduce the supply voltage to the point
where we reach the same power consumption as that observed for the single-node and
differential clock networks. HSPICE simulations demonstrate that for a supply voltage of
1.4V and differential swing 450mV, we obtain the same power consumption for the
differential clock network. Yet, as can be observed from Figure 16, the variation of clock
skew is still less than that of the comparable single-node CDN. Another interesting point
that was observed during supply voltage scaling in DCDN is a negligible signal latency
difference. This can be justified since as the tail current is lowered to achieve lower
differential swings; the necessary differential voltage needed for the differential buffer to
switch is also decreased. This enables the differential buffer to operate/switch faster than in
the case where greater supply voltage with greater differential swing is used. Also as
observed from Figure 16 and as discussed previously, the DCDN based on only grounded-
gate loads is less resilient.
On the Efficient Design & Synthesis of Differential Clock Distribution Networks 349
Single
Differential (1.8V) Composite
Differential (1.4V) Composite
10
Differential (1.4V) GND Load
% Skew Variations
Less Power
Consumption
5
Less Varations
Fig. 16. HSPICE simulation show less variations in DCDN compared to single-node CDN
for equal nominal power consumption.
In general, the parallel processing approach results in a clock-tree different from the one
routed in a single step, due to die area partitioning; thus, the characteristics of the new clock
tree such as total wire-length and skew may be slightly different. This proposed
methodology is flexible, as it allows having a hybrid (differential and/or single-ended)
distribution of the clock network. The global CDN could be differential, while the local
(lower levels) CDNs could be single-ended to alleviate routing complexity. It is possible to
enhance the global/local distribution algorithm with refined interconnect models. This
methodology is also applicable to all symmetric/asymmetric clock-trees.
350 VLSI
3.5
Speed-up
2.5
1.5
1
1000 1500 2000 2500 3000 3500
Number of clock-sinks
Fig. 17. Speed-up approaches its maximum, as the size of clock network increases, for 2 and
4 processing node synthesis cases.
6. Conclusions
In this chapter, some techniques for efficient design and synthesis of on-chip Differential
Clock Distribution Networks (DCDNs) were given.
Initially design techniques were proposed that improve the performance of differential
buffers which result into the performance improvement of DCDNs. This was achieved by
means of introducing configurations for differential buffers based on Dynamic Threshold
(DT) transistors. It was shown that for low supply-voltages, they outperform the
conventional buffers with 25% delay reduction. Also, in order to overcome the high power
consumption of DCDNs, a circuit configuration was proposed by which it is possible to
reduce the differential voltage swings (down to 10% of Vdd) which reduces the power
consumption significantly (30% more than single-node CDN). Furthermore, by scaling the
supply voltage of the system from 1.8V to 1.4V, we reach a design point where the DCDN
consumes the same power as its single-node CDN counterpart but has less variation (in
terms of skew). This however comes at the expense of delay and reduced voltage swing.
Various synthesis techniques were introduced that improve the DCDNs routing to achieve
low (and possibly zero) skew. For this, a line equivalent delay model was suggested by
which it is possible to route DCDNs with low (zero) skew. On average, 97% skew reduction
was obtained utilizing this model compared to the classic Elmore delay model. A
methodology for parallel distribution (routing) of zero skew DCDNs was also proposed.
The method is applicable to all symmetric/asymmetric clock networks with ability for
hybrid implementation (differential and/or single-ended). The proposed method alleviates
the problem of high computational cost of such CDNs in complex VLSI systems. Utilizing
this method, nearly-linear speed-up is achieved for zero skew DCDNs.
In the hierarchy of CDNs in modern high-performance complex systems, DCDNs can be
effectively fit in the global level of CDNs; yet they can be used as the sole solution to the
clock distribution of the system, when noise is the main design issue.
On the Efficient Design & Synthesis of Differential Clock Distribution Networks 351
7. References
Anderson, F. E.; Wells, J. S. & Berta, E. Z. (2002). The core clock system on the next
generation Itanium microprocessor, in ISSCC Digest of Technical Papers, pp. 146-7.
Assaderaghi, F.; Sinitsky, D.; Parke, S.A.; Bokor, J.; Ko, P.K. & Hu, Chenming. (1997).
Dynamic threshold-voltage MOSFET (DTMOS) for ultra-low voltage VLSI”, IEEE
Transactions on Electron Devices, Volume 44, Issue 3, pp.414 – 422.
Banerjee, Prithviraj. & Xing, Zhaoyun. (1992). A parallel algorithm for zero skew clock tree
routing, International Symposium on Physical Design, pp. 118 – 123.
Banerjee, Prithviraj. (1994). Parallel Algorithms for VLSI Computer-Aided Design, PTR
Prentice Hall, Englewood Cliffs, New Jersey 07632.
Broderson, Bob. (2005). Analog Integrated Circuits, online material, available:
http://bwrc.eecs.berkeley.edu/People/Faculty/rb/.
Chappell, B.A.; Chappell, T.I.; Schuster, S.E.; Segmuller, H.M.; Allan, J.W.; Franch, R.L. &
Restle, P.J. (1988). Fast CMOS ECL receivers with 100-mV worst-case sensitivity,
IEEE JSSC, Volume 23, Issue 1, pp:59 – 67.
Cong, J.; He, L.; Koh, C. K. & Madden, P. (1996). Performance Optimization of VLSI
Interconnect Layout, Integration, the VLSI Journal, vol. 21, pp. 1-94.
Dally, William J. & Poulton, John. (1998). Digital Systems Engineering, Cambridge
University Press.
Hall, S.H.; Hall G.W. & McCall, J.A. (2000). High-Speed Digital system Design, A Handbook
of Interconnect theory and Design Practices. John Wiley & Sons INC.
Herzel, F.; Razavi, B. (1999). A study of oscillator jitter due to supply and substrate noise, IEEE J.
Circuits and Systems, Volume 46, pp. 56 - 62.
Kahng, A.B.; Muddu, S.; Sarto, E. (2000). On switch factor based analysis of coupled RC
interconnects, Design Automation Conference, pp. 79 – 84.
MPI, Message Passing Interface, online: http://www.mpi-forum.org/.
O’Mahony, Frank P. (2003). 10GHz Global Clock Distribution Using Coupled Standing-
wave Oscillators, PhD Dissertation, Stanford University.
Razavi, Behzad. (2001). Design of Analog CMOS Integrated Circuits. Mc Graw Hill.
Restle, P. J. & A. Deutsch (1998). Designing the best clock distribution network, in
Symposium VLSI Circuits Digest of Technical Papers.
Restle, P.J.; McNamara, T.G.; Webber, D.A.; Camporese, P.J.; Eng, K.F.; Jenkins, K.A.; Allen,
D.H.; Rohn, M.J.; Quaranta, M.P.; Boerstler, D.W.; Alpert, C.J.; Carter, C.A.; Bailey,
R.N.; Petrovick, J.G.; Krauter, B.L. & McCredie, B.D (2001). A clock distribution
network for microprocessors, IEEE J. Solid-State Circuits, vol. 36, no.5, pp. 792-799.
Sekar, D.C. (2005). Clock trees: differential or single ended? , International Symposium on
Quality of Electronic Design, pp.548 – 553.
Tsay, R. S. (1991). Exact zero skew, in Proc. IEEE Int. Conf. Computer-Aided Design, pp.
336–339, Nov.
Vasseghi, N.; Yeager, K.; Sarto, E. & Seddighnezhad, M. (1996). 200-MHz superscalar RISC
microprocessor, IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1675-1685.
Wikipedia, online: http://en.wikipedia.org/wiki/Phase-locked_loop
Zarrabi, Houman. (2006). On the design and synthesis of differential clock distribution
network, MASc Dissertation, Concordia University.
352 VLSI
Zarrabi, Houman; Saaied, Haydar; Al-Khalili, A. J. & Savaria, Yvon. (2006). Zero Skew
Differential Clock Distribution Network, International Symposium on Circuit And
Systems (ISCAS), Greece, Island of Kos.
Zarrabi, Houman; Zilic, Zeljko; Al-Khalili, A. J. & Savaria, Yvon. (2007). A methodology for
parallel synthesis of zero skew differential clock distribution networks, Joint
Conference of MWSCAS/NEWCAS, Montreal, Canada.
Zhang, H.; Varghese George & Rabaey, J. M. (2000). Low-swing on-chip signaling
techniques: effectiveness and robustness, IEEE Trans. on VLSI Syst., Volume 8,
Issue 3, pp. 264 – 272.
Robust Design and Test of Analog/Mixed-Signal Circuits in Deeply Scaled CMOS Technologies 353
18
X
1. Introduction
The proliferation of communication and consumer electronic systems leads to the large
demands of high-performance & robust analog/mixed-signal circuits. On the other hand,
although deeply scaled CMOS technologies enable greater degrees of semiconductor
integration and lower manufacturing cost, the advancements of technologies also introduce
several new challenges for VLSI circuit design. The increasing parametric variations and
their impacts on circuit performances are becoming key issues which make already complex
circuits even more sophisticated in design practice (Nassif, 2001). Give these barriers,
designing robust mixed-signal circuits in deeply scaled CMOS technologies become a real
challenge for circuit designers.
In this chapter, we propose to solve the problems by first introducing efficient and accurate
modeling techniques for large analog/mixed-signal circuit designs with consideration of
process variations. Powerful statistical dimension reduction techniques are utilized to make
performance modeling of large circuits possible. Then these novel circuit models are used to
achieve efficient parametric system performance analysis. In such ways the designers can
have the full knowledge of real circuit performances under process variations, which makes
it feasible to do robust system topology design and circuit optimization for large mixed-
signal systems. The efficient modeling framework is also extended for circuit test purposes,
which leads to robust Built-in Self-Test (BIST) circuit design and optimization.
We demonstrate the effectiveness of the proposed ideas with two popular types of mixed-
signal circuit examples, Sigma-Delta A/D converters and Phase Looked Loops (PLLs).
For Sigma-Delta ADCs, we present a novel parameterized lookup table (LUT) technique for
capturing performances of building blocks in the systems, and use these LUTs to perform
topology trade-off analysis and system optimization. Modeling of circuit level
nonlinearities, adaptive LUT generation and robust system design are explained in detail
with comprehensive experimental results (Yu & Li, 2006 & 2007a).
354 VLSI
For PLLs, we discuss building parametric Verilog-A models for charge-pump PLLs and use
these models for high-level performance analysis. In order to handle large number of
parametric variables, dimension reduction technique is applied to reduce simulation
complexities. We apply the obtained system simulation framework to evaluate the
efficiencies of parametric failure detecting of different BIST circuits and perform
optimization based on the experimental results (Yu & Li, 2007b).
H ( z)
The difference between the ideal digital output of the quantizer and the actual analog signal
is called quantization noise. The goal of Sigma-Delta technique is to eliminate this unwanted
quantization noise as much as possible. By oversampling the input signal, the modulator
moves the majority of the quantization noise out of the signal bandwidth. The principle of
noise-shaping can be analyzed using transfer functions, which can be obtained using a
linear model in the frequency domain. The quantization noise E(z) is modelled as additive
noise to the quantizer, the output of the quantizer Y(z) can be written as
H ( z) 1 (1)
Y ( z) X ( z) E ( z)
1 d H ( z) 1 d H ( z)
where d is the feedback gain of the DAC, X(z) is the input signal, H(z) is the transfer
function of the loop filter. By configuring H(z) and d, we can have different noise shaping
functions so that the signal-to-noise ratio in the output can be optimized.
output of the DAC, respectively. This property of Sigma-Delta ADCs makes it possible to
predict the new integrator output using the previous state and the new input. The previous
state of the integrator, the digital feedback and the new analog input are discretized at a set
of discrete voltage levels that are used as the indices to the lookup table models.
y[k+1]=F(y[k],x[k+1],d[k+1])
Fig. 2. Integrator behaviours under clocking
As illustrated in equation (2) and Fig. 2, the output of an integrator is a function of the input
signals and the initial state of the integrator, which are discretized to generate the lookup
table entries. The number of discretization levels depends on the accuracy requirement of
the simulation. The internal circuit node voltage swings can be estimated by the system
architecture. For low-voltage Sigma-Delta ADC designs, the internal voltages can change
from 0 to supply voltage Vdd. To cover the whole range of voltage swing, we discretize the
inputs and outputs of the integrators uniformly at N levels from 0 to Vdd, where N is in the
range of 10. The extraction setup for an integrator with a multi-bit DAC implemented in
thermometer code is shown in Fig.3. A large inductor L together with a voltage source Vs is
used to set the initial value of the integrator output. The input of the integrator is also set by
a voltage source Vi. The digital output of the quantizer controls the amount of charge to be
fed back. An m-bit DAC implemented in thermometer code has 2m - 1 threshold voltages.
The digital codes from 0 to 2m - 1 can be represented by counting the number of voltage
sources that are connected to the integrator inputs from a set of voltages sources Vd1, Vd2, …,
Vd(2m-1),the voltages of which are set to be either digital “1” or digital “0”.
The nonlinearities of quantizers can be captured using lookup tables as well. The quantizer
acts as a comparator, the input threshold voltage varies depending on the direction in which
the input voltage changes. To capture the hysteresis effect accurately, we use the transistor-
level simulation to find the input threshold voltages at which the digital output switches
from 0 to 1 ( Voff ) and from 1 to 0 ( Voff ), respectively. The quantizer is then modeled as
1 (Vin [ k 1] Voff )
(3)
d [ k 1] d [k ] (Voff Vin [k 1] Voff )
0 (Vin [k 1] Voff )
where d[k+1] is the new output of the quantizer, d[k] is the output in the previous clock
cycle. Multi-bit quantizers can be modeled in a similar way since they are built from several
1-bit quantizers.
Sigma-Delta ADCs with continous-time modulators can also be modeled using the
proposed technique with minor modificiation. Continuous-time Sigma-Delta ADCs are
different from discrete-time counterparts since the integrators are not clocked by the
sampling clock, and the input and the output of integrator changes throughout a clock
period. In order to make the lookup-based modeling possible, we discretize each clock cycle
into M time intervals with a step size dT=T/M. If dT is small enough, then in each small
time period the behaviours of continous-time modulators can be approximated using the
presented technique, detailed implementation for continous-time Sigma-Delta ADCs can be
found in (Yu & Li, 2007a).
Since the number of process variables is large, it is not possible to exhaust all the possible
performances under process variations. Here we use parameterized LUT-based models to
capture the impacts of circuit parametric variations that include both environmental and
process variations. In this case, a nonlinear regression model (macromodel) is extracted for
each table entry. In general, a macromodel correlating the input variables and their
responses can be stated as follows:
given n sets observed responses {y1,y2,…,yn} and n sets of m input variables [x1,x2,…,xm], we
can determine a function to relate input x and response y as (Low & Director, 1989)
358 VLSI
The task of constructing each macromodel is achieved by applying the response surface
modeling (RSM) technique where empirical polynomial regression models relating the
inputs and their outputs are extracted by performing nonlinear least square fitting over a
chosen set of input and output data (Box et al., 2005). To systematically control the model
accuracy and cost, design of experiment (DOE) technique is applied to properly choose a
smallest set of data points while satisfying a given modeling regulation. For our circuit
modeling task, the input parameters are the parametric circuit variations and the output is
an entry in the lookup tables. Then, a nonlinear function such as a quadratic function
relating each entry in the tables with the circuit parametric variations can be determined via
regression
^ ^ m ^ m m ^
y 0 i xi ij xi x j (5)
i 1 i 1 j 1
where
xi i-th process variable set
Y approximated response
estimated model fitting coefficient
m number of process variables
The fitting coefficient vector can be calculated using the least-square fitting of
experimental data as
( X T X )1 X T Y (7)
In our implementation, the cube design plan is selected in order to estimate all the first-
order and cross-factor second-order coefficient of the input variables.
The ranges of all parametric variations are usually obtained from the process
characterization. This information is used to setup the model extraction procedure. In the
cube design plan, each factor takes on two values -1 and +1 to represent the minimum and
the maximum values of the parametric variation. Each factor in the star plan takes on three
levels -a, 0, a, where 0 represents the nominal condition and the level range |a| < 1. As
illustrated in Fig. 4, for each point (i,j) in the lookup table, n simulations runs are conducted
using fractional factorial plan to provide the required data to generate the regression model
in equation (5). As long as the lookup tables for specified process variation distributions are
generated, we can perform fast system-level simulation to evaluate the performances under
process variations, and in turn to achieve optimization as to be discussed in the following
section.
^ ^ m ^ m m ^
y 0 i xi ij xi x j
i 1 i 1 j 1
Using our parameterized LUT-based infrastructure, we are able to not only predict the
nominal case design performances but also their sensitivities to parametric variations.
Hence, our technique provides an efficient way for statistical circuit simulation as well as
performance-robustness trade-off analysis. For statistical analysis, a Resolution V 2(8-2)
fractional factorial design plan that includes 64 runs for the cube design plan and 17 runs for
the star design plan is employed by SDM 1. For SDM 2 and SDM 3, a Resolution VI 2(6-1)
360 VLSI
fractional factorial design plan with 45 runs is employed, resulting in 32 runs for the cube
design plan and 13 runs for the star design plan.
In Table 1, the proposed LUT-based simulator is compared with the transistor-level simulator
(Spectre) in terms of model extraction time, simulation time, and predicted nominal SNDR and
THD values. Once the LUT models are extracted, the LUT-based simulator can be efficiently
employed to perform statistical performance analysis, which is infeasible for the transistor-
level simulator. For the 2nd-order Sigma-Delta ADC with 1-bit quantizer, it only takes 20
minutes to conduct 1,000 LUT-based transient simulations each including 64k clock cycles. For
the same analysis, transistor-level simulation with conventional simulators is expected to take
4,500 hours to complete. In terms of accuracy, the SNDRs and THDs predicted by Spectre and
the LUT simulator are also presented in Table 1. The error of SNDR of our LUT-based
simulator is within 1dB, which demonstrates the accuracy of the proposed technique.
With the powerful LUT-based simulator, we can perform system evaluation very efficiently so
the optimization of system topologies and detailed design become possible. First we use the
optimization of 2nd-order Sigma-Delta ADC with multi-bit quantizer as an example by
investigating the impacts of DAC capacitance mismatch. The capacitor mismatch level
decreases as the capacitance increases, so it is of interest to investigate the trade-offs of system
noise performances and area (Pelgrom & et al., 1989). Statistical simulations are performed to
analyze the influence of the mismatch of the two internal DACs by sweeping the values of the
three charging capacitors in each DAC. The variation of capacitances is modeled using a
Gaussian distribution with 3 1% . The distributions of SNDR due to the capacitance
mismatch in the two DACs are shown in Fig. 5, respectively.
40 40
35 35
25 25
Number
Number
20 20
15 15
10 10
5 5
0 0
55 60 65 70 75 80 85 90 85 86 87 88 89 90
SNDR(dB) SNDR(dB)
We can see from the two figures that the mismatch of the DAC connected to the first stage
integrator (left figure) has much more influence to the system performance than that of the
other DAC (right figure). This can be explained by the fact that the first stage DAC is
connected directly to the system input, so the feedback error because of the DAC mismatch
will be magnified by the second stage integrator. The result of this analysis indicates that
more attention should be paid to the first stage DAC in the design process.
50 50
SNDR in
45 SNDR in 45 nominal case
nominal case
40 40
35 35
30 30
Number
25 Number 25
20 20
15 15
10 10
5 5
0 0
82.9 83 83.1 83.2 83.3 83.4 83.5 83.6 83.7 83.8 80 80.5 81 81.5 82 82.5 83 83.5 84 84.5
SNDR(dB) SNDR(dB)
(a) 1% (b) 5%
Fig. 6. SNDR distributions for mismatches of charging and sampling capacitors in SDM 3 (©
[2007] IEEE, from Yu & Li, 2007a)
We can observe from Fig. 6 that performance distribution deviations increase from 1dB to
4.5dB for capacitor variation 5% , and the impact of the mismatch of charging and
sampling capacitors is not as critical as that of the multi-bit DAC even with 5% . It is
also possible to perform more complete system analysis and optimization using the
proposed parametric LUT-based simulation method depending on the target of the design
(Yu & Li, 2007a).
Due to the mixed-signal nature, the design and optimization of PLL system is quite complex
and costly. For example, a long transient simulation (in the order of hours or days) is needed
to obtain the lock-in time behavior of PLL, which is one of the most important performance
metrics for a PLL. So the brute-force optimization by searching in the design space with
transistor-level simulation is infeasible for PLL systems.
When process variations are considered, the situation becomes more sophisticated. The
large number of process variables and the correlations between different building blocks
introduce more uncertainties for PLL performance under process variations. In order to
utilize the hierarchical simulation method while taking into consideration of statistical
performance distributions, we propose an efficient macromodeling method to handle this
difficulty. The key aspect of our macromodeling techniques is the extraction of
parameterized behavioral models that can truthfully map the device-level variabilities to
variabilities at the system level, so that the influence of fabrication stage variations can be
propagated to the PLL system performances.
Parameterization can be done for each building block model as follows. First, multiple
behavioural model extractions are conducted at multiple parameter corners, possibly
following a particular design-of-experiments (DOE) plan (Box & et al., 2005). Then, a
Robust Design and Test of Analog/Mixed-Signal Circuits in Deeply Scaled CMOS Technologies 363
The voltage controlled oscillator is the core component of a PLL. The two mainstream types
of VCOs are LC-tank oscillators and ring oscillators. In a typical VCO model, the dynamic
(response to input change) and static (V-Freq relation) characteristics of the voltage to
frequency transfer are modeled separately first and then combined to form the complete
model. The static VCO characteristic can be written as Fout=f(V’con), where Fout is the
output signal frequency, V’con is the delayed control voltage, and f(.) is a nonlinear
mapping relating the voltage with the frequency. To generate the analytical model, the
mapping function f(.) can be further represented by an n-th order polynomial function.
' ' 2 ' n
Fout a0 a1Vcon a2Vcon anVcon (8)
where a0, a1, …, an are coefficients of the polynomial. To generate the above polynomial,
multiple VCO steady-state simulations are conducted at different control voltage levels and
a nonlinear regression is performed using the collected simulation data.
Suppose the control voltage is Vcon, the dynamic behavior of the VCO is modeled by
adding a delay element that produces a delayed version of the control voltage (V’con). The
delay element can be expressed using a linear transfer function H(s) (e.g. a second-order RC
network consisting of two R's and two C's). H(s) can be determined via transistor-level
simulation as follows: a step control voltage is applied to the VCO and the time it takes for
the VCO to reach the steady-state output frequency, or the step-input delay of the VCO, is
recorded. H(s) is then synthesized that gives the same step-input delay. The dynamic effect
is usually notable in LC VCOs due to the high-Q LC tank while in ring oscillators this effect
may be neglected.
The charge pump is mainly built with switching current sources. As illustrated in Fig. 8, the
control signals of the two switches M1 and M2 come from the outputs of the phase and
frequency detector. The currents through M1 and M2 can be turned on-and-off to provide
desired charge-up or charge-down currents. The existing charge pump macromodels are
very simplistic. Usually, both the charge-up and charge-down currents are modeled as
constant values. A constant mismatch between the two currents may also be considered
(Zou & et al., 2006). However, this simple approach is not sufficient to model the behavior of
charge pump accurately. In real implementation, the current sources are implemented using
transistors so that the actual output currents will vary according to the voltages across these
364 VLSI
Fig. 8. Modeling of charge pump (© [2007] IEEE, from Yu & Li, 2007b)
In our charge pump model, for each output current, the current vs. Vcon characteristics is
divided into two regions. When the output voltage Vcon is close to the supply voltage, then
switch M1 will be biased in triode region. The charge-up current Iup in the triode region can
be written as
W
I up p Cox [(Vgs Vthp )Vds 0.5Vds 2 ] (9)
L
Vds Vdd Von Vcon
where Vdd is the supply voltage, Von is the on-voltage across the switch, Vgs is the gate-
source voltage, p is the mobility, Cox is the oxide capacitance, W is width and L is the
length of M1. We can see from Equation (9) that the charge-up current is dependent on the
output voltage Vcon. We use a polynomial to explicitly model such voltage dependency
2 3
I up b0 bV
1 con b2Vcon b3Vcon
(10)
where bi are the polynomial coefficients. Similarly, the charge-down current has a strong
Vcon dependency when Vcon is low. This voltage dependency is modeled in a similar
fashion. When M1 and M2 operate in saturation region, they act as part of the current
mirrors. In this case, constant output current values are assumed while the possible
mismatches between the two are considered in our Verilog-A models.
The phase detector and the frequency divider are digital circuits so that they are more
amenable to behavioral modeling. The two key parameters of the phase detector and the
frequency divider are the output signal delay and the transition time, which are easy to
extract from transistor-level simulation. The loop filters are usually comprised of passive RC
elements, which can be directly modeled in Verilog-A simulation.
parameter space, rendering the parametric modeling infeasible. Although the widely used
principle component analysis (Reinsel & Velu, 1998) can be adopted to perform parameter
dimension reduction, its effectiveness may be rather limited since the parameter reduction is
achieved by only considering the statistics of controlling parameters while neglecting the
important correspondence between these parameters and the circuit performances of
interest. As such, the extent to which the parameter reduction can be achieved is not
sufficient for our analog macromodeling problems. To address this difficulty, a more
powerful design-specific dimension reduction technique, which is based on reduced rank
regression (RRR), is developed. This new technique considers the crucial structural
information imposed by the design and has been shown to be quite effective for parametric
interconnecting modeling problems (Feng & et al., 2006).
Cov (Y ,
X ) Y X . It can be shown that an optimal reduced rank model (in the sense of mean
square error) is given as (Reinsel & Velu, 1998)
AR U , BR U T Y X X1X (13)
where U contains R normalized eigenvectors corresponding to the R largest eigenvalues of
the matrix: D 1 . It is important to note that a successful construction of the
YX XX XY
above reduced rank model indicates that only a smaller set of R new parameters Z BR X
are critical to Y in a statistical sense, hence facilitating the desired parameter reduction.
It should be noted that the reduced rank regression is only employed as a means for
parameter reduction so as to reduce the complexity of the subsequent parameterized
macromodeling step. Hence, Y in the above equations does not have to be the true
performances of interest and can be just some circuit responses that are highly correlated to
the performances. This flexibility can be exploited to more efficiently collect though
YX
Monte-Carlo sampling if Y are easier to obtain than the true performances in simulation.
366 VLSI
The complete parameterized PLL macromodel extraction flow is shown in Fig. 9. Every
building block is modeled using Verilog-A for efficient system-level simulation. Each model
parameter for building blocks is expressed as a polynomial in the underlying device-level
variations, such as
P f (Vth1 , Leff 1 , Tox1 ,...., Vthn , Leffn , Toxn ) (14)
where Vthi, Leffi, Toxi, etc represent the parameters of i-th transistor, f(.) is the nonlinear
polynomial function to connect process variations to system performances. f(.) is very
difficult to obtain if the number of parameters is large. Hence, RRR-based parameter
reduction is applied, which leads to a set of R new parameters Z that are the most important
variations for the given circuit performances of interest. If R is small, then a new
parameterized model in terms of Z can be easily obtained through conventional nonlinear
regression for coefficients of equation (11)
C f ( Z1 , Z 2 , Z R ) (15)
C f ( Z1 , Z 2 , Z R )
With the hierarchical models and parameter reduction technique, we can also perform built-
in self-test circuit design and optimization since lengthy transient simulations can be
relieved by the proposed method. We will discuss this part in the next section.
Given that process variations and resultant parametric failures will continue to rise in sub-
100-nm technologies, a design-phase PLL BIST development methodology is strongly
desired. Such methodology should facilitate systematic evaluation of parametric variations
of complex PLL system specifications and their relations to specific BIST measurements so
as to enable optimal BIST scheme development. To this end, however, three major
challenges must be addressed: a) Suitable modeling techniques must be developed in order
to enable feasible whole system PLL analysis while considering realistic device-level process
variations and mismatch; b) Device-level parametric variations and mismatch that
contribute to parametric failures form a high-dimensional parameter space and the resulting
368 VLSI
issue of curse of dimensionality brings significant modelling and analysis difficulty; and c)
Effective optimization strategies are desired in order to develop optimal BIST schemes.
With the techniques presented in section 3, the first two difficulties have been addressed,
and now we put more efforts on the optimization of BIST circuits. The most widely used
approach in BIST design is to utilize the existing digital blocks, such as to use the frequency
divider as counter and read out its state in order to detect chip failures (Sunter & Roy, 1999),
(Kim & Soma, 2001), (Hsu & et al., 2005), (Azais & et al., 2003). It is expected that the
frequency divider/counter output will change significantly if there exists a catastrophic
fault. However, parametric failures may produce smaller variations in the readout values.
Hence, they are more difficult to detect and deserve more careful treatments.
We consider three BIST schemes shown in Fig. 10 as potential candidates. Similar in spirit to
the existing BIST schemes, the main idea of the proposed BIST schemes is to control the
charge pump in a way such that the output frequency of the PLL will be altered and the
state of the frequency divider is read out at certain time instance for failure detection.
Device-level variations and mismatch will perturb the operation of the PLL and can push
the system performances out of the specification window. The same parametric variations
may be reflected in the variations in the readout values of the frequency divider. Parametric
failures can be detected if the states of the frequency divider are strongly correlated with the
design performances.
Fref
0
1 Vcon /
Delay1 Frequency Charge Loop Counter
VCO
detector pump filter
Delay2 0
1
Frequency
divider
Delay nT
counter
BIST start/read counter t
signal readout
scheme 1
Fref Vcon /
Frequency Charge Loop Counter
VCO
detector pump filter
0
1
Frequency
divider
BIST Delay nT
counter start/read counter t
scheme 2 signal readout
Fref Vcon /
Frequency Charge Loop Counter
VCO
detector pump filter
0
1
Frequency
divider
Delay nT
BIST counter start/read counter t
scheme 3 signal readout
Fig. 10. Three BIST scheme candidates (© [2007] IEEE, from Yu & Li, 2007b)
Robust Design and Test of Analog/Mixed-Signal Circuits in Deeply Scaled CMOS Technologies 369
a) BIST scheme 1
The first BIST scheme is similar to the one adopted in (Hsu & et al., 2005). In the normal
operation mode, the reference input and the output of the frequency divider are applied to
the frequency detector to form the closed loop configuration. In the test mode, the output of
the frequency divider is disconnected from the input of the frequency detector. The
reference input and or its delayed versions are fed through the muxes to the frequency
divider forming a open loop configuration. The first delay element has a larger delay value
than the second one. To charge up the VCO, the reference input and its delayed version
through delay 2 are applied to the frequency detector. To charge down the VCO, the
delayed versions of the reference input by both delay 1 and delay 2 are selected. The delay
values of the two delay elements determine the phase error introduced at the frequency
detector inputs. Hence, they also dictate the coverage of the VCO tuning range in this BIST
setup. Under typical design values, delay values in the order of ten's of the reference clock
signal period are required, which may cost significant silicon area to implement.
For all these three schemes, the counter read-out signals, which control the start and end
points for a single test run, are generated by passing the reference clock signal Fref through a
series of D flip-flops. As such, the contents of the frequency divider within a defined time
interval are read out.
b) BIST scheme 2
To solve the silicon overhead problem of scheme 1, we propose the second BIST scheme
which employs an inverter to introduce the phase difference. Since this configuration
introduces a constant phase delay of dT at the inputs of frequency divider, the charge pump
experiences the following sequence of operation: charge up stop charge up stop
until the control voltage of the VCO reaches the fully voltage swing.
c) BIST scheme 3
The third BIST scheme is configured as follows: first the PLL is put in a standard closed-loop
configuration and then a standard phase lock test is performed. Once the PLL is locked, the
feedback signal frequency is changed from Fout to 2Fout by using the mux to select the output
of the second last D flip-flop in the divider.
A brief summary of the three BIST schemes: for scheme 1, the area cost is high while the test
time is short, for scheme 2, the area cost is low and the test time is also low, for scheme 3, the
area cost is low, and the test time is medium. The most important aspect for BIST schemes,
however, is the test accuracy. We use the macromodels developed in section 3 to perform
test accuracy evaluation and optimization for these BIST scheme candidates.
take tens of hours to complete, more runtime efficient approaches are needed, especially for
the optimization purpose.
The optimization of a given BIST scheme with n digital outputs is illustrated in the second
half of Fig. 11. Since a BIST scheme may be evaluated many times under different setups
(e.g., the time interval within which the states of the frequency divider are read out) within
the optimization loop, the correlation between the BIST schemes and the PLL performance
must be efficiently conducted. This goal is achieved by identifying the critical sources of
variations Zv as in equation (15).
Build
regression model for Generate a large number of
BIST measurement T with measurements using the
process parameter Z v Ti= f(Zv) nonlinear regression model
Fig. 11. BIST scheme optimization (© [2007] IEEE, from Yu & Li, 2007b)
Noticing that Zv only contains a reduced set of variations, a nonlinear empirical model
relating each measurement Ti of the given BIST scheme and Zv can be rather efficiently
generated. This is achieved by conducting a few Verilog-A based PLL simulations at
different Zv samples and performing nonlinear regression: Ti= f(Zv). Note that this step does
not incur a high simulation cost since regression models are only built over a low-dimension
parameter space represented by Zv. Using these easily obtained regression models, a large
set of samples for each Ti can be efficiently generated.
To capture the potential nonlinear correspondence between the design performances and
the measurements, Support Vector Machine (SVM) is adopted as an accurate classifier
(Vapnik, 1998). Support Vector Machine is a powerful method to build highly nonlinear
multivariate regression/classification models. In SVM regression, we consider a set of
training data {(x1, y1), (x2, y2), …, (xn, yn)}, where xi is the input vector and yi is the
corresponding output. The input X is mapped into a high dimensional feature space using
nonlinear transformation, then a best fitting function is constructed in this feature space as
y f ( x) ( x) b (16)
where is the nonlinear transformation, b is the bias term, and represents the model
parameters to be decided. Based on this nonlinear function f(.), we can classify the chip as
faulty or not with the BIST circuit outputs.
Robust Design and Test of Analog/Mixed-Signal Circuits in Deeply Scaled CMOS Technologies 371
We demonstrate the effectiveness of the proposed method using a PLL design implemented
in 90nm CMOS technology. The frequency versus control voltage curve of the VCO is
extracted using the model of equation (8). First we simulate the VCO for a few clock cycles
and gather the time-domain output response as Y. Then RRR is applied to get a reduced
parameter set Z to represent the important device-level parameters. A parametric model of
the VCO in terms of Z is then built. To model the statistical characteristics of the VCO
accurately, a 6-th order polynomial fitting is used to fit the output frequency vs. control
voltage curve.
The Verilog-A models for other building blocks are extracted in a similar fashion.
Specifically, the charge pump model is generated using a 3-rd order polynomial in the
output voltage for each charge-up/down current. There are a total of 17 Verilog-A model
parameters extracted for the complete PLL design.
We evaluate the effectiveness of three BIST schemes. SVM models are generated as in Fig. 11
to predict the pass/fail status of the chips based on the corresponding BIST outputs. 400
Monte-Carlo simulation samples are generated by conducting PLL system simulation using
Verilog-A macromodels. These data are used to generate the SVM model. To evaluate the
effectiveness of the each scheme more reliably, another 100 Monte-Carlo simulations are
carried out and used as the test data for checking the accuracy of the SVM model. The
pass/fail predictions achieved through the three SVM models are compared against the
simulated chip performances, as shown in Fig. 12. Here, the predictions made through the
simulation are labeled as “direct measurement”, and +1 indicates a chip being classified as
“fail”, while -1 indicates the opposite.
1.5
1
Pass/Fail
0.5
Direct measurement
0
BIST scheme 1
BIST scheme 2
−0.5 BIST scheme 3
−1
0 10 20 30 40 50 60 70 80 90 100
Chip Index
Fig. 12. Pass/fail predictions of three BIST schemes (© [2007] IEEE, from Yu & Li, 2007b)
From Fig. 12 we can see that BIST scheme 1 only has only 1 misclassification. The
performance of BIST scheme 2 is verified to be poor as it can only detect two faulty chips
which may have largest variations. BIST scheme 3 can detect more failures than scheme 2
but is still not as good as scheme 1.
372 VLSI
We further look into the trade-offs between the accuracy and the number of digital outputs
for optimization. This is important since fewer test codes will correspond to a shorter test
time, if a similar accuracy can be achieved. This trade-off analysis is conducted for every
scheme in Fig. 12. It can be observed that for a small number of digital outputs, the accuracy
of scheme 2 is actually higher than that of scheme 3. It can be also seen that the accuracy of
scheme 3 becomes quickly saturated as the number of outputs increases. Under all the cases,
scheme 1 is always the optimal choice.
0.2
0.15
Error
0
3 4 5 6 7 8 9
Number of test codes
Fig. 13. Accuracy vs. test structures for three BIST schemes (© [2007] IEEE, from Yu & Li,
2007b)
There are other optimizations can be done to improve BIST schemes using the efficient
macromodeling technique in Section 3 and Section 4 (Yu & Li, 2007b). It is of great benefit to
do such kind of analysis and optimization by taking consideration of statistical system
performances in the early design stage so that we can avoid the costly design iterations due
to the influence of process and environmental variations.
5. Conclusion
In this chapter we have discussed the influences of process variations to analog/mixed-
signal circuits in deeply scaled CMOS technologies. The performances of two types of
popular mixed-signal systems, i.e. Sigma-Delta ADCs and Phase-locked Loops are evaluated
under process variations. Parameterized lookup table technique and reduced-rank
regression with hierarchical macromodeling method were proposed to fulfil the
optimization of these two systems, respectively. We also extended the obtained fast system
performance evaluation framework to compare the efficiencies of parametric failure
detecting of different BIST circuits and perform test circuit optimization.
6. Acknowledgement
This work was funded in part by the FCRP Focus Center for Circuit & System Solutions
(C2S2), under contract 2003-CT-888.
Robust Design and Test of Analog/Mixed-Signal Circuits in Deeply Scaled CMOS Technologies 373
7. References
Azais, F. & et al. (2003). An all-digital DFT scheme for testing catastrophic faults in PLLs.
IEEE Design & Test of Computers, Vol. 20, No. 1, pp. 60 - 67, Jan. 2003
Babii, S. & et al. (1997). MIDAS User Manual, Stanford University, Stanford, CA
Bishop, R. & et al. (1990). Table-based modeling of delta-sigma modulators. IEEE Trans. On
Circuits and Systems, Vol. 37, No. 3, pp. 447-451, Mar. 1990
Box, G.; Hunter, D. & Hunter, W. (2005). Statistics for Experiments: Design, Innovation, and
Discovery, John Wiley & Son, 978-0471718130, Hoboken, NJ
Brauns, G. & et al. (1990). Table-based modeling of delta-sigma modulators using ZSIM.
IEEE Trans. On Computer-aided Design, Vol. 9, No. 2, pp. 142-150, Feb. 1990
Fang, S. & Suyama, K. (1992). User's Manual for SWITCAP2, Columbia University, New
York, NY
Feng, Z.; Yu, G. & Li, P. (2007). Reducing the Complexity of VLSI Performance Variation
Modeling Via Parameter Dimension Reduction. Proceedings of International
Symposium on Quality Electronic Design, pp. 737-742, 978-0769527957, Mar. 2007,
IEEE press, San Jose, CA
Hsu, C. ; Lai, Y. & Wang, S. (2005). Built-in self-test for phase-locked loops. IEEE Trans. On
Instrument and Measurement, Vol. 54, No. 3, pp. 996-1002, Jun. 2005
Kim, S. & Soma, M. (2001). An all-digital built-in self-test for high-speed phase-locked loops.
IEEE Trans. On Circuits and Systems II, Vol. 48, No. 2, pp. 141-150, Feb. 2001
Low, K. & Director, S. (1989). An efficient methodology for building macromodels of IC
fabrication processes. IEEE Trans. On Computer-aided Design, Vol. 8, No. 12, pp.
1299-1313, Dec. 1989
Nassif, S. (2001). Modeling and Analysis of Manufacturing Variations. Proceedings of Custom
Integrated Circuits Conference, pp. 223-228, 978-0780365917, May 2001, IEEE press,
San Diego, CA
Norsworthy, S.; Schreier, R. & Temes G. (1997). Delta-Sigma Data Converters: Theory, Design,
and Simulation. IEEE Press, 978-0780310452, Piscataway, NJ
Pelgrom, M.; Duinmaijer, A. & Welbers, A. (1989). Matching properties of MOS transistors.
IEEE Journal of Solid-State Circuits, Vol. 24, No. 5, pp. 1433 - 1440, Oct. 1989
Reinsel, G. & Velu, R. (1998). Multivariate Reduced-Rank Regression, Theory and Applications.
Springer, 978-0387986012, New York, NY
Sunter, S. & Roy A. (1999). BIST for phase-locked loops in digital applications. Proceedings of
International Test Conference, pp. 532-540, 978-0780357531, Sep. 1999, IEEE Press,
Atlantic City, NJ
Vapnik, V. (1998). Statistical Learning Theory. Wiley-Interscience, 978-0471030034, New York,
NY
Yu, G. & Li, P. (2006). Lookup Table Based Simulation and Statistical Modeling of Sigma-
Delta ADCs. Proceedings of Design Automation Conference, pp. 1035-1040, 978-
1595933816, San Francisco, CA, July 2006, IEEE Press
Yu, G. & Li, P. (2007a). Efcient Lookup Table Based Modeling for Robust Design of Σ∆
ADCs. IEEE Trans. on Circuits and Systems – I, Vol.54, No.7, Sep. 2007, pp.1513-1528,
Yu, G. & Li, P. (2007b). A Methodology for Systematic Built-in Self-Test of Phase-locked
Loops Targeting at Parametric Failures. Proceedings of International Test Conference,
pp. 1-10, 978-1424411276, Oct. 2007, IEEE Press, Santa Clara, CA
374 VLSI
Yu, G. & Li, P. (2008). Yield-aware hierarchical optimization of large analog integrated
circuits. Proceedings of International Conference on Computer-Aided Design, pp. 79-84,
978-1424428199, Nov. 2008, IEEE Press, San Jose, CA
Zou, J.; Mueller, D.; Graeb, H. & Schlichtmann, U. (2006). A CPPLL hierarchical
optimization methodology considering jitter, power and locking time. Proceedings of
Design Automation Conference, pp. 19-24, 978-1595933816, San Francisco, CA, July
2006, IEEE Press
Nanoelectronic Design Based on a CNT Nano-Architecture 375
0
19
Abstract — Carbon nanotubes (CNTs) and carbon nanotube field effect transistors (CNFETs) have
demonstrated extraordinary properties and are widely expected to be the building blocks of next
generation VLSI circuits. This chapter presents (1) the first purely CNT and CNFET based nano-
architecture, (2) an adaptive configuration methodology for nanoelectronic design based on the
CNT nano-architecture, and (3) robust differential asynchronous circuits as a promising nano-circuit
paradigm.
1. Introduction
Silicon based CMOS technology scaling has driven the semiconductor industry towards cost
minimization and performance improvement in the past five decades, and is rapidly ap-
proaching its end (30). On the other hand, nanotechnology has achieved significant progress
in recent years, fabricating a variety of nanometer scale devices, e.g., molecular diodes (44)
and carbon nanotube field effect transistors (CNFETs) (46). This provides new opportunities
for VLSI circuits to achieve continuing cost minimization and performance improvement in a
post-silicon-based-CMOS-technology era.
However, we must overcome a number of significant challenges for practical nanoelectronic
systems, including achieving some of the most critical nanoelectronic design metrics as follow.
1. Manufacturability. As minimum layout feature size becomes smaller than lithography
light wavelength, traditional lithography based manufacturing process can no longer
achieve satisfiable resolution, and leads to significant process variations. Resolution
enhancement and other design for manufacturability techniques become less applica-
ble as scaling continues. Alternatively, nanoelectronic systems are expected to be based
on bottom-up self-assembly based manufacturing processes, e.g., molecular beam epi-
taxy (MBE). Such bottom-up self-assembly manufacturing processes provide regular
structures, e.g., perfectly aligned carbon nanotubes (23). Consequently, nanoelectronic
systems need to rely on reconfigurability to achieve functionality and reliability (51).
2. Reliability. Technology scaling has led to increasingly significant process and system
runtime variations, including critical dimension variation, dopant fluctuation, electro-
magnetic emission, alpha particle radiation and cosmos ray strikes. Such variations can-
not be avoided by manufacturing process improvement, and is inherent at nanometer
376 VLSI
CNT
Nano Interface
CNT
Nano Interface
Nano Interface
Nano Interface
Fig. 1. The proposed CNT crossbar nano-architecture: layers of orthogonal carbon nanotubes
form a dense array of RDG-CNFETs and programmable interconnects with voltage-controlled
nano-addressing circuits on the boundaries.
scale due to the uncertainty principle of quantum physics. Robust design techniques,
including redundant, adaptive, and resilient design techniques at multiple (architec-
ture, circuit, layout) levels, are needed to achieve a reliable nanoelectronic system (5).
3. Performance. Nanoscale devices have achieved ultra-high performance in the absence
of load, however, nanoelectronic system performance bottleneck lies in global intercon-
nects. Rent’s rule states that the maximum interconnect length scales with the circuit
size in a power law (24), while signal propagation delay across unit length interconnect
increases as technology scales (30). As a result, interconnect design will be critical to
nanoelectronic system performance.
4. Power consumption. As technology scaling leads to increased device density and de-
sign performance, power consumption is also expected to be critical in nanoelectronic
design.
This chapter presents several recent technical advancements towards manufacturable, reli-
able, high performance and low power nanoelectronic systems.
1. The first purely CNT and CNFET based nano-architecture, which is constructed by lay-
ers of orthogonal CNTs with via-forming and gate-forming molecules sandwiched in
between, forming a dense array of reconfigurable double gate carbon nanotube field
effect transistors (RDG-CNFETs) and programmable interconnects. Such a CNT array
is addressed by novel voltage-controlled nano-addressing circuits on the boundaries,
which do not require precise layout design and achieve yield in aggressive scaling
and adaptivity to process variations. Simulation based on CNFET and molecular de-
vice compact models demonstrates superior logic density, reliability, performance and
power consumption for nano-circuits implemented in this CNT crossbar based nano-
architecture compared with the existing, e.g., molecular diode and MOSFET based
nano-architectures.
Nanoelectronic Design Based on a CNT Nano-Architecture 377
2. Background
2.1 Existing Nanoscale Devices
Carbon nanotube is one of the most promising candidates for interconnect technology at
nanometer scale, due to its extraordinary properties in electrical current carrying capabil-
ity, thermal conductivity, and mechanical strength. A carbon nanotube is a one-atom-thick
graphene sheet rolled up in a cylinder of a nanometer-order diameter, which is semicon-
ductive or metallic depending on its chirality. The cylinder form eliminates boundaries and
boundary-induced scattering, yielding electron mean free path on the order of micrometers
compared with few tens of nanometers in copper interconnects (32). This gives extraordinary
current carrying capacity, achieving a current density on the order of 109 A/cm2 (56). How-
ever, large resistance exists at CNT-metal contacts, reducing the performance advantage of
CNTs over copper interconnects (38).
Among various nanotechnology devices, carbon nanotube field effect transistors are the most
promising candidates to replace the current CMOS field effect transistors as the building
blocks of nanoelectronic systems. Three kinds of carbon nanotube based field effect transis-
tors (CNFETs) have been manufactured: (1) A Schottky barrier based carbon nanotube field
effect transistor (SB-CNFET) consists of a metal-nanotube-metal junction, and works on the
principle of direct tunneling through the Schottky barrier formed by direct contact of metal
and semiconducting nanotube. The barrier width is modulated by the gate voltage. This de-
vice has the most mature manufacturing technique up to today, while two problems limit its
378 VLSI
future: (a) The metal-nanotube contact severely limits current. (b) The ambipolar conduction
makes this devices cannot be applied to conventional circuit design methods. (2) A MOSFET-
like CNFET is made by doping a continuous nanotube on both sides of the gate, thus forming
the source/drain regions. This is a unipolar device of high on-current. (3) A band-to-band
tunneling carbon nanotube field effect transistor (T-CNFET) is made by doping the source
and the drain regions into p+ and n+ respectively. This device has low on-current and ultra
low off current, making it potential for ultra low power applications. It also has the potential
to achieve ultra fast signal switching with < 60mV/decade subthreshold slope (46).
Molecular electronic devices are based on two families of molecules: the catenanes which
consist of two or more interlocked rings, and the rotaxanes which consist of one or more
rings encircling a dumbbell-shaped component. These molecules can be switched between
states of different conductivities in a redox (reduction/oxidation) process by applying cur-
rents through them, providing reconfigurability for nanoscale devices (44).
A variety of reconfigurable nanoscale devices have been proposed. Resonant tunneling diodes
based on redox active molecules are configurable on/off (44). Nanowire field effect transistors
with redox active molecules at gates are of high/low conductance (17). Spin-RAM devices are
of high/low conductivity based on the parallel/anti-parallel magnetization configuration of
the device which is configured by the polarity of the source voltage (40). A double gate Schot-
tky barrier CNFET is configurable to be a p-type FET, an n-type FET, or off, by the electrical
potential of the back gate (25). A double gate field effect transistor with the back gate driven
by a three state RTD memory cell is configurable to be a transistor or an interconnect, reducing
reconfiguration cost of a gate array (4).
Fig. 2. Layout of undifferentiated nanoscale wires (data lines) addressed by microscale wires
(address lines). Lithography defines high- and low-k dielectric regions, which gives field effect
transistors and direct conduction, respectively.
Because process variations are inevitably significant at nanometer scale, these existing nano-
addressing structures achieve limited yield, e.g., there is certain probability that two nanoscale
wires have identical or similar gate configuration due to process variation. Furthermore,
nanoscale wires are mostly partially selected, e.g., they may not achieve the ideal conductivity
upon selected, due to process variations such as misalignment, dopant variation, etc.
Front Gate
CNT
Bistable
Molecules
Back Gate
Fig. 3. A n-type MOSFET-like reconfigurable double gate carbon nanotube field effect tran-
sistor (RDG-CNFET).
structures, and (3) an achievable mechanism which precisely addresses an individual CNT in
an array.
In this section, we investigate the first purely CNT and CNFET based nano-architecture, which
is based on a novel RDG-CNFET device, includes a CNT crossbar structure on multiple layers,
and a novel voltage-controlled nano-addressing circuit.
S D
BG
Fig. 4. Compact model of a n-type MOSFET-like reconfigurable double gate carbon nanotube
field effect transistor (RDG-CNFET).
Bistable Molecules
L4
Bistable Molecules
L2
Dielectric
L1
Fig. 5. Carbon nanotube (CNT) layers in the proposed nanoelectronic architecture.
Vdd a b c a b c
output
Fig. 6. An RDG-CNFET based Boolean logic a(b + c) implementation.
Vdd
input
2. Gate-forming and via-forming molecules on top and on bottom of a CNT segment give
the RDG-CNFET, which is reconfigurable to via, short, MOSFET-like CNFET, and open,
3. Via-forming molecules both on top and on bottom of a CNT segment give a device
which is reconfigurable to be stacked via, simple via, or double gate FET.
We have the following observations.
Observation 1. Via-forming (electrically bistable) molecules must be present between any two adja-
cent layers.
Observation 2. Gate-forming (redox active) molecules must be present next to each layer for gate
isolation.
Observation 3. Gate-forming (redox active) and via-forming (electrically bistable) molecules need to
be evenly distributed on each layer for performance.
Vdda1
Address Line 1
Vssa1
Vdda2
Address Line 2
Vssa2
Fig. 8. Schematic of the proposed voltage-controlled nano-addressing circuit.
configured as either a logic gate or an interconnect switch). A pre-determined ratio of logic de-
vices and interconnect switches (e.g., in standard cell designs and FPGA architectures where
cells and routing channels are separated) constrains design optimization and may lead to
inefficient device or interconnect utilization. Allowing an arbitrary ratio of logic gates and
interconnect switches (e.g., as in sea-of-gate designs) provides increased degree of freedom
for design optimization (4).
The CNT crossbar based nano-architecture is also the first to include multiple routing layers.
Multiple routing layers (as in the current technologies) are necessary for VLSI designs, as
Rent’s rule suggests that the I/O number of a circuit module follows a power law with the gate
number in the module (24). A small routing layer number could lead to infeasible physical
design or significant interconnect detouring, resulting in degraded performance and device
utilization.
on the nanoscale wire alone the address line. For example, a i-th nanoscale wire (starting from
Vss ) in an array of n equally spaced nanoscale wires has a transistor gate voltage
i n−i
Vg (i, n) = V + Vss (1)
n dd n
in an address line connecting to two external voltage sources Vdd and Vss . Here we assume
uniform address lines of negligible external resistance (from the first or the last nanoscale wire
to the nearest external voltage source).
A transistor is on if its gate voltage exceeds the threshold voltage Vg > Vth . A nanoscale
wire is conductive if both transistors on it are on. Because the two address lines provide an
increasing series and a decreasing series of gate voltages respectively, only nanoscale wires at
specific positions in the array are conductive. For example, for Vdda1 = Vdda2 and Vssa1 = Vssa2 ,
the nanoscale wire in the middle of the array gets conductive.
In general, to select the i-th data line from the left in an array of n nanoscale wires, the external
voltages need to be such that all the transistors on the right hand side of the i-th data line in
the first address line are off, and all the transistors on the left hand side of the i-th data line in
the second address line are off:
i+1 i+1
Vga1 (i + 1, n) = (1 − )Vdda1 + Vssa1 < Vth
n n
i−1 i−1
Vga2 (i − 1, n) = Vdda2 + (1 − )Vssa2 < Vth (2)
n n
Definition 1. Addressing inaccuracy of a nano-addressing circuit is the offset between the target
data line i and the data line j of maximum current.
AI = |i − j| (3)
Nanoelectronic Design Based on a CNT Nano-Architecture 387
Definition 2. Addressing resolution of a nano-addressing circuit is the minimum ratio between the
on current Ion (i ) of a target data line i and the off current Io f f ( j) of a non-target data line j (under all
conditions, e.g., different inputs and parametric variations).
Ion (i )
AR = Min{ } (4)
Io f f ( j)
In traditional binary decoder based nano-addressing, the achievable addressing resolution de-
pends on the conductance difference between the target data line and other non-target data
lines. There are n non-target data lines with Hamming distance 1 for a n-bit target address,
which have similar if not identical conductances. The presence of parametric variations fur-
ther reduces addressing resolution.
In voltage-controlled nano-addressing, addressing resolution is largely given by the address-
ing voltage difference between two adjacent data lines. Applying high voltages leads to a
number of reliability issues, such as electromigration and gate dioxide breakdown. Carbon
nanotubes are highly resistive to electromigration, while new material is needed to enhance
reliability for gate dioxide breakdown.
Alternatively, for given gate voltage difference, transistor current difference can be improved
by improving the inverse subthreshold slope. However, MOSFETs and MOSFET-like CNFETs
are limited to an inverse subthreshold slope S (which is the minimum gate voltage variation
needed to bring a 10× source-drain current increase) of 2.3 kT q ≈ 60mV/decade at 300K (46).
This requires development of novel devices for larger inverse subthreshold slopes.
Di Dj
Vh
V l
Rh Rl
R h + ∆R R l − ∆R
3. Global data line misalignment (i.e., shifting of all data lines), variations of external volt-
age sources, and variations of external wire/contact resistance (between the resistive
voltage divider and the external voltage sources) lead to potential addressing inaccu-
racy (CNT target offset).
4. Individual data line misalignment (shifting) could decrease the difference between the
gate voltages of two adjacent transistors, leading to degraded addressing resolution
(on/off CNT current ratio between two adjacent CNTs).
5. Process variations of the transistors, including width, length, dopant concentration, and
oxide thickness variations, lead to transistor conductivity uncertainty and degraded
addressing resolution.
The nano-addressing scheme needs to achieve a higher enough addressing resolution which
endures the above-mentioned parametric variation effects (e.g., by applying high external
addressing voltages, and/or novel CNFETs of < 60mV/decade subthreshold slope).
After achieving satisfiable addressing resolution, we need to minimize any addressing inac-
curacy and address the correct CNT data line (Problem 1).
Let us first derive the external voltage offset needed for a data address offset. Suppose for an
address line, the external voltages Vh and Vl address the i-th CNT data line Di . The resistance
between CNT data line Di and the high (low) external address voltage Vh (Vl ) is Rh (Rl ) (Fig.
9).1 We have
Rl Rh
V + V = Von (5)
Rh + Rl h Rh + Rl l
where Von is the voltage needed to address a CNT data line of peak current. Shifting the
external voltages to Vh + ∆V and Vl + ∆V addresses another CNT data line D j . The resistance
between CNT data line D j and the high (low) external address voltage is Rh + ∆R (Rl − ∆R).
We have
Rl − ∆R R + ∆R
(V + ∆V ) + h (V + ∆V ) = Von (6)
Rh + Rl h Rh + Rl l
1 For the first address line, the high external voltage Vh = Vl1 is on the left, the low external voltage
Vl = Vr1 is on the right. For the second address line, the high external voltage Vh = Vr2 is on the right,
the low external voltage Vl = Vl2 is on the left.
Nanoelectronic Design Based on a CNT Nano-Architecture 389
As a result,
∆R
∆V = (V − Vl ) (7)
Rh + Rl h
Observation 4. The external voltage offset ∆V is proportional to the resistance offset ∆R between two
CNT data lines, and is proportional to the physical offset ∆L between the two CNT data lines, if the
resistive voltage dividers are uniform (e.g., the CNT data lines are equally spaced and the address lines
have uniform resistivity).
Observation 5. The addressing accuracy given by Method 1 depends only on the uniformity of the
resistive voltage divider, and the time domain variations of the external voltage differences Vl1 − Vr1
and Vl2 − Vr2 . Any time-invariant (e.g., manufacturing process) variations of the external voltages
(Vl1 , Vr1 , Vl2 , and Vr2 ) or the external address line resistances (from the outer most data lines to the
external voltage sources) do not affect the achievable addressing accuracy.
CNT CNT
Bistable Bistable
Molecules Molecules
n+ p +or i n+ p +or i n+
Dielectric
Redox Active Redox Active Redox Active
Molecules Molecules Molecules
CNT CNT CNT
Fortunately, we observe that precise alignment between a front gate CNT and a back gate
CNT is not necessarily required as long as the CNT arrays are dense, e.g., with the spacing
between CNTs close to the CNT diameters. In such a case, a double gate field effect transistor
is formed even in the presence of CNT misalignment (Fig. 10). A CNFET channel is formed
by doping the source/drain regions with the front gate CNT on the upper layer as mask. The
resultant CNFET channel aligns with the front gate CNT. A misaligned back gate injects a
weaker electrical field in the CNFET channel from a longer distance. A neighboring back gate
may also injects a weak electrical field in the channel. This is either tolerated (which needs to
be verified by simulation or testing) or avoided (by reserving the neighboring back gates for
shielding).
The question is then how to find the closest CNT pair on different layers which form the front
gate and the back gate of a RDG-CNFET (such that we can address them and configure the
RDG-CNFET).
Problem 2 (RDG-CNFET Gate Matching). Given a CNT i on layer l, locate the closest CNT j on
layer l + 2 (or l − 2) where CNTs i and j form the front gate and the back gate of a RDG-CNFET.
Method 2 solves Problem 2 and finds the closest CNT pairs which form the front gate and the
back gate of a RDG-CNFET.
Nanoelectronic Design Based on a CNT Nano-Architecture 391
Once a matching gate is identified, the CNFET can be characterized (by achieving its I-V
curves). A parasitic CNFET can also be identified by finding the second closest CNT (with the
second smallest CNFET conductance in the algorithm), which is either tolerated or avoided in
a nanoelectronic design.
Problem 3. Detect and locate metallic, open and crossover CNTs in an CNT array, which are addressed
on both ends by nano-addressing circuits.
392 VLSI
Such metallic, open, and crossover CNTs can be captured in a n × n resistance matrix RCNT ,
where each entry RCNT (i, j) gives the resistance of CNT between the i-th CNT end and the j-th
CNT end on the opposite sides of an array of n CNTs (if i = j, RCNT (i, j) gives the resistance
of a crossover CNT, otherwise, RCNT (i, i ) gives the i-th CNT’s resistance).
Method 3 solves Problem 3 by giving such a n × n resistance matrix RCNT . With this CNT re-
sistance matrix RCNT , we avoid open CNTs, and consider only semiconductive CNTs, metal-
lic CNTs, and crossover CNT bundles (as multi-thread cables) for the rest of the calibration
(Methods 4 and 5 and 2).
Problem 4. Detect and locate permanently open or short vias in a CNT crossbar nano-architecture.
Method 4 solves Problem 4 by giving two m × n resistance maps R Pmin and R Pmax , where
each entry R Pmin (i, j) or R Pmax (i, j) gives the resistance of a L-shaped path which includes
the i-th CNT segment on the top(bottom) of the CNT crossbar, the j-th CNT segment on the
left(right) of the CNT crossbar, and a programmable via which is configured as conductive
or open, respectively. Given non-open CNTs, these resistance matrices give a defect map for
permanently open or short vias.
Nanoelectronic Design Based on a CNT Nano-Architecture 393
Problem 5. Detect and locate permanently open or short CNFETs in a CNT crossbar nano-
architecture.
Shorts between CNFET gate and source or drain can be detected in a method which is sim-
ilar to Method 4 but without via programming. Method 5 finds permanent opens or shorts
between the source and the drain of a CNFET by giving a m × n resistance matrix RCNFET .
Upon detection and location, these catastrophic defects (metallic, open and crossover CNTs,
permanently open or short vias and CNFETs) can be included in a correct nano-circuit. Nano-
circuit physical design needs to be adaptive to the presence of these catastrophic defects, and
will be different from die to die, based on the catastrophic defect maps (RCNT , R Pmin , R Pmax ,
and RCNFET ) for each die.
394 VLSI
icant parametric variations, must achieve performance with highly resistive CNT intercon-
nects, etc. We discuss nano-circuit design in a CNT crossbar nano-architecture in this section.
Nanoscale computing systems are expected to be subject to prevalent defects and significant
process and environmental variations inevitably as a result of the uncertainty principle of
quantum physics. E.g., the conductance of a CNT or a CNFET is very sensitive to chirality,
diameter, etc. (38). Besides adaptive configuration, nanoscale computing systems need new
computing models or circuit paradigms for reliability enhancement, performance improve-
ment and power consumption reduction.
As technology scales, nanoelectronic computing systems are expected to be based on single
electron devices (the average number of electrons in a transistor channel is approaching one
for the current technologies). In quantum mechanics, the occurrence probability of an electron
is the wavefunction given by the Schrödinger equation. How to extract a deterministic com-
putation result from stochastic events such as electron occurrences is one of the fundamental
problems that we are facing in designing nanometer scale computing systems. Traditional
computation based on large devices can be modeled as redundancy and threshold based logic
(which includes majority logic). In redundancy and threshold based logic, the error rate is
given by a binomial distribution (the probability of observing m events in an environment of
expecting an average of n independent events). As a result, a minimum signal-to-noise ratio
is required with performance and power consumption implications. Finding a more efficient
reliable computing model is of essential interest in nanoelectronic design.
Besides stochastic signal occurrence, signal propagation delay variability is another category
of uncertainty in stochastic nanoscale systems. Nanoelectronic design needs to be adaptive to
or resilient in the presence of signal propagation delay variations. Existing techniques (e.g.,
the Razor technology (2; 18) wherein a shadow flip-flop captures a delayed data signal for tim-
ing verification and correction) achieves only limited performance adaptivity, e.g., the circuity
is adaptive to performance variations only within a given range. Asynchronous circuits have
unlimited adaptivity to performance variations, and are ideal for high performance (enabling
performance scaling in the presence of significant performance variations) and low power (be-
ing event-driven and clockless) nanoelectronic design (e.g., multi-core chips are expected to be
increasingly self-timed, global-asynchronous-locally-synchronous, or totally asynchronous).
However, existing asynchronous design techniques suffer in reliability in the presence of soft
errors (e.g., glitches, coupling noises, radiation or cosmos ray strike induced random noises),
which has limited their applications for decades.
A number of robust design techniques at multiple levels help to enhance reliability and reduce
error rate of a nanoelectronic computing system. At the circuit level, differential signaling and
complementary logic reduces parametric variation effects by exploiting spatial and temporal
correlations (e.g., by correlating m and n for reduced error rate in a binomial distribution) (15;
31; 57). At a higher level, we believe that Error Detection/Correction Code (35) is the key to
the stochastic signal occurrence problem in nanoscale systems (e.g., for lower required signal-
to-noise ratios, which lead to high performance and low power), and needs to be applied more
extensively at a variety of design hierarchy levels. Error correction coding has been applied
widely in today’s memories and wireless communication systems. Proposals also exist for
applying (AN or residue) error detection/correction coding in arithmetic circuits (7).
396 VLSI
ack.t 1
C
ack.t n
inputs f f ack.f1
C
ack.fn
valid(i) C f d.t d.t
rs.f rs.t rs.f
valid
f d.f d.f
FF
valid(i) C valid(d)
rs.t valid req
ck.t ck.f
ack.t
valid(d)
inputs f f
ack.f
Fig. 11. A static logic based RDA (robust differential asynchronous) circuit.
2 In general, multiple-bit data can be encoded in a variety of error detection codes (35). A single parity
bit for n-bit data provides a Hamming distance of two which is immune to any single bit error, while
an error detection code of a Hamming distance larger than k is immune to any k-bit error.
3 Alternative implementations (in dynamic circuits, e.g., dual-rail domino, DCVSL, etc.) achieve dif-
ferent cost and reliability tradeoffs, and are potentially preferrable depending on the manufacturing
technology and the environment, e.g., parametric variabilities, soft error rates, etc.
Nanoelectronic Design Based on a CNT Nano-Architecture 397
signals ack.t and ack. f derive from and always appear later than the differential data signals
f and f¯.
At the sender’s end of the interconnect, for sequential elements, flip-flops are preferred over
latches for reliability. A flip-flop is only vulnerable to noise when capturing the signal, while a
latch is vulnerable to noise whenever it is transparent. The flip-flops send out differential data
signals d.t and d. f (which come from the differential combinational logic outputs f and f¯) as
well as a request req signal (which comes from the acknowledgment signal ack.t). At the re-
ceiver end of the interconnect, a group of XOR and AND(NAND) gates verify the differential
data and request signals, and generate two differential validity signals valid(d) and valid(d).
Any single bit soft error or common multiple bit soft errors injected to the interconnects or at
the validity signals will halt the circuit.
Each sender flip-flop is triggered by two differential acknowledgment signals ack.t and ack. f
as the differential clock signals ck.t = 1 and ck. f = 0, and is reset by two differential reset
signals when rs.t = 1 and rs. f = 0. The differential reset signals come from the downstream
differential acknowledgment signals ack.ti and ack. f i via the Muller C elements. They also
generate the valid and valid signals which trigger the combinational logic computation.
In the presence of multiple fanouts, multiple sets of differential acknowledgment signals will
be sent back to the upstream stage. With the Muller C elements holding the input validity sig-
nals, the early arriving acknowledgment signals hold until the latest acknowledgment signal
arrive from the fanouts. At that time the Muller C elements close the inputs to the combina-
tional logic block and reset the flip-flop at the upstream stage, which brings all differential
data and request signals d.t, d. f and req as well as the acknowledgment signals ack.t and ack. f
back to the ground, completing an asynchronous communication cycle.
Definition 3 (Single Bit Soft Error). A single bit soft error is a glitch or toggling caused by a single
event upset as a result of an alpha particle or neutron strike from radioactive material or cosmos rays.
Definition 4 (Common Multiple Bit Soft Error). A common multiple bit soft error is glitches or
togglings of the same magnitude and polarity caused by common noises such as capacitive or inductive
interconnect coupling, or spatially correlated transient parametric (e.g., supply voltage, temperature)
variations (19; 39; 60), which have near identical effects on components at close physical proximity.
Theorem 1 (Logic Correctness). An robust differential asynchronous circuit achieves logic correct-
ness at the event of a single bit soft error or common multiple bit soft errors.
both f and f¯ to the ground. Only when both the valid and valid signals arrive, the
differential combinational logic computation is enabled.
3. A single bit soft error or a common multiple bit soft error at the differential combina-
tional logic block or the differential data signals f and f¯ will not raise the ack.t signal
nor lower the ack. f signal.4
4. A single bit soft error or a common multiple bit soft error at the differential acknowl-
edgment signals ack.t and ack. f does not trigger the flip-flop.
5. A single bit soft error or a common multiple bit soft error at the differential reset signals
RS and RS does not reset the flip-flop.
6. A single bit soft error or a common multiple bit soft error at the differential data signals
d.t and d. f and the request req signal leads to invalid data and does not generate a
validity signal.
In summary, in order to make an RDA circuit to fail, the glitches must follow certain specific
patterns, e.g., to reverse a “01” to a “10”, which is highly unlikely to take place.
Theorem 2 (Timing Correctness). An robust differential asynchronous circuit achieves timing cor-
rectness for any delay variation given the physical proximity of the circuit components.
until the differential acknowledgment signals ack.t and ack. f reach the upstream stage,
reset the upstream flip-flop, and lower the valid(i ) and valid signals, which take much
longer time than the hold time of the flip-flop. Consequently, no hold time constraint is
required.
4. The differential acknowledgment signals ack.t and ack. f arrive after the combinational
logic computation completes and the differential data signals f and f¯ settle to their final
values.
5. After all downstream stages send back acknowledgment signals, the flip-flop is reset,
bringing the differential data d.t and d. f and the request req signals to the ground. The
downstream stage acknowledgment signals are also brought back to the ground as a
result. This completes a four-phase asynchronous communication cycle.
As a result, the proposed robust differential asynchronous circuit is delay insensitive, i.e.,
achieves correct timing (signal arrival time sequence) in the presence of delay variations,
which is critical for nanoelectronic circuits.
6. Experiments
6.1 Voltage-Controlled Nano-Addressing
In this section, we first verify the effectiveness of the proposed voltage-controlled nano-
addressing circuit (Fig. 8) by running SPICE simulation based on the Stanford CNFET com-
pact model (52).
In the proposed voltage-controlled nano-addressing circuit, each nanotube is gated by two N-
type MOSFET-like CNFETs. These CNFETs are of 6.4nm gate width and 32nm channel length,
as are described in the Stanford CNFET compact model. The two CNFETs in each nanotube
are given a voltage drop of Vdd = 1V. The external address voltages are Vdda1 = Vdda2 = 1V,
Vssa1 = Vssa2 = 0. As a result, the CNFETs have complementary gate voltages Vg1 + Vg2 = 1V.
Fig. 12 gives the nanotube currents in the array with different gate voltage at the first address
line. The nanotubes carry a significant current only with specific gate voltages, e.g., reaching
Iout = 5.064mA at gate voltage Vg1 = 0.495V.
With 0 and 1V external voltages, Fig. 12 gives the currents for all the nanotubes in the array.
With larger external voltages, Fig. 12 is extended to give the nanotube currents: any nanotube
with a Vga1 > 1V or Vga2 < 0V gate voltage at the first address line carries zero current.
Addressing resolution is given by the difference of addressing voltages between two adjacent
nanotubes (since MOSFETs and MOSFET-like CNFETs are limited to a < 60mV/decade in-
verse subthreshold slope). Adjusting the external address voltages minimizes any addressing
inaccuracy due to manufacturing process and system runtime parametric variations.
5 Comparing CNFET and CMOS-FET circuits gives approximately 5× performance improvement (16).
400 VLSI
6
"Iout"
4
Iout (mA)
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Vg1 (V)
Fig. 12. Nanotube current Iout in mA for CNFET gate voltage Vg1 in the first address line.
Vdd a b c
en output
Fig. 13. A molecular diode/MOSFET based Boolean logic a(b + c) implementation.
6 The amorphous silicon based anti-fuse technology works with silicon based nanowires (6). Similar
technologies are expected and assumed here for carbon nanotubes.
Nanoelectronic Design Based on a CNT Nano-Architecture 401
Table 1. Output voltage and static power consumption with different inputs of RDG-CNFET
and molecular diode based Boolean logic a(b + c) implementations.
Comparing the CNFET based and the molecular diode/CMOS based logic implementations,
we have the following observations.
1. Area: The CNFET based logic implementation takes an area of 2 × 6 = 12 CNFETs
and 2 × 3 = 6 vias, while molecular diodes and MOSFET based implementation takes
an area of 2 × 4 = 8 molecular diodes and 2 MOSFETs (and two more MOSFETs if
an inverter is included at each output to restore signal voltage swing). Considering
CNFET based implementation is in a complementary logic, and the MOS transistors
do not scale well, CNFET based implementation is expected to achieve superior logic
density at a nanometer technology node.
2. Signal reliability: The CNFET based logic implementation achieves full voltage swing
at the outputs, while in the diode logic circuit, the output swing depends on the inputs,
and varies between 0.503V to 0.735V in the experiment (Table 1). Additional CMOS
circuitry (e.g., an inverter) can be included at each output to restore full voltage swing,
however, the reduced signal voltage swing in the diode logic circuit still implies com-
promised signal reliability.
3. Static power: The CNFET based logic implementation in CMOS logic achieves orders of
magnitudes of less power consumption compared with molecular diodes and MOSFET
based implementation for most input vectors (Table 1).
4. Performance: The CNFET based logic implementation achieves orders of magnitude
of timing performance improvement compared with molecular diodes and MOSFET
based implementation (Table 2).
In summary, CNFET based logic implementation achieves superior logic density, reliabil-
ity, performance, and power consumption compared with molecular diodes and CMOS-FET
based Boolean logic implementation.
Table 2. Rising/falling signal propagation delays Dr /D f (ns) (from a to output) for various
load capacitance CL ( f F) of RDG-CNFET and molecular diode based Boolean logic a(b + c)
implementations. Input signal transition time varying from 1ps to 100ps leads to no consider-
able delay difference.
1500
valid
f
1000 ack.t
d.f
Voltage (mV)
valid(d)
500
−500
100 150 200 250 300
Time (ps)
Fig. 14. Signal waveforms in a robust differential asynchronous circuit with no single bit soft
error.
the internal glitches that would be observed at the f and f¯ signals if the validity signals arrive
early.
Fig. 15 gives signal waveforms for the same RDA circuit with the same input signals a and
b, while the validity signal valid is delayed by 50ps compared to the complementary validity
signal valid, representing an early false validity signal (valid) or a late arriving validity signal
(valid) due to an injected negative glitch at either the valid or the valid signal. We observe
that the differential data signals f and f¯ are clamped to the ground until both validity signals
settle to their final values valid = 1 and valid = 0. As a result, no logic malfunction is present
while all signals are delayed by 50ps.
Fig. 16 gives signal waveforms for the RDA circuit with a triangle current (of 0.1mA peak
current starting at 160ps ending at 180ps) injected to the f signal. Comparing with Fig. 14, we
observe that such a negative glitch does not lead to any logic error, instead, the arrivals of the
ack.t and ack. f signals are postponed for about 40ps, as well as all the downstream signals d.t,
d. f , and valid(d).
Fig. 17 gives signal waveforms for the RDA circuit with a triangle current (of 0.1mA peak
current starting at 120ps ending at 140ps) injected to the ack.t signal. The glitch at the ack.t
signal does not trigger the flip-flop, and the subsequent signals d.t, d. f , valid(d) and valid(d)
are not affected.
Fig. 18 gives signal waveforms for the RDA circuit with two identical triangle currents (of
0.1mA peak current starting at 120ps ending at 140ps) injected to both differential acknowl-
Nanoelectronic Design Based on a CNT Nano-Architecture 403
1500
valid
f
1000 ack.t
d.f
Voltage (mV)
valid(d)
500
−500
100 150 200 250 300
Time (ps)
Fig. 15. Signal waveforms in a robust differential asynchronous circuit with an early false
valid or a late arriving valid signal.
1500
valid
f
1000 ack.t
d.f
Voltage (mV)
500 valid(d)
−500
−1000
100 150 200 250 300
Time (ps)
Fig. 16. Signal waveforms in a robust differential asynchronous circuit with a negative glitch
injected at the f signal.
edgment signals ack.t and ack. f . We observe that the double glitches do not trigger the flip-flop
either, the subsequent signals d.t, d. f , valid(d) and valid(d) are not affected.
From these experiments, we observe that the RDA circuit achieves correct logic and correct
timing (signal arrival time sequence) at the event of a single bit soft error or common multiple
bit soft errors, by temporarily halting the circuit operation until the valid data re-appear.
7. Conclusion
In this chapter, we have studied the first purely CNT and CNFET based nano-architecture,
which is based on (1) a novel reconfigurable double gate carbon nanotube field effect transistor
(RDG-CNFET) device, (2) a multi-layer CNT crossbar structure with sandwiched via-forming
and gate-forming molecules, and (3) a novel voltage-controlled nano-addressing circuit not
requiring precise layout design, enabling manufacture of nanoelectronic systems in all existing
CMOS circuit design styles.
A complete methodology of adaptive configuration of nanoelectronic systems based on the
CNT crossbar nano-architecture is also presented, including (1) an adaptive nano-addressing
method for the voltage-controlled nano-addressing circuit, (2) an adaptive RDG-CNFET gate
matching method, and (3) a set of catastrophic defect mapping methods, which are specific (to
404 VLSI
2000
valid
f
1500 ack.t
ack.f
Voltage (mV) 1000 d.f
valid(d)
500
−500
100 150 200 250 300
Time (ps)
Fig. 17. Signal waveforms in a robust differential asynchronous circuit with a positive glitch
injected at the ack.t signal.
2000
valid
f
1500 ack.t
ack.f
Voltage (mV)
1000 d.f
valid(d)
500
−500
100 150 200 250 300
Time (ps)
Fig. 18. Signal waveforms in a robust differential asynchronous circuit with positive glitches
injected at both differential acknowledgment signals ack.t and ack. f .
the CNT crossbar nano-architecture), complete (in detecting and locating all possible catas-
trophic defects in the CNT crossbar nano-architecture), deterministic (with no probabilistic
computation), and efficient (test paths are rows or columns of CNT, or L-shaped CNT paths,
CNT open/short defect detection is separated with via/CNFET open/short defect detection,
runtime is linear to the number of defect sites). This is significant improvement compared
with the previous techniques (which either detects only a single defect (10), or is generic, ab-
stract, probabilistic, and highly complex (8; 33; 58)).
We have also examined some of the design challenges and promising techniques for CNT
and CNFET based nano-circuits. We identify significant parametric variation effects on logic
correctness and timing correctness, and propose robust (differential) asynchronous circuits by
applying Error Detection Code to asynchronous circuit design for noise immune and delay
insensitive nano-circuits.
SPICE simulation based on compact CNFET and molecular device models demonstrates su-
perior logic density, reliability, performance and power consumption of the proposed RDG-
CNFET based nanoelectronic architecture compared with previously published nanoelec-
tronic architectures, e.g., of a hybrid nano-CMOS technology including molecular diodes and
MOSFETs. Furthermore, theoretical analysis and SPICE simulation based on 22nm Predictive
Technology Models show that RDA circuits achieve much enhanced reliability in logic correct-
Nanoelectronic Design Based on a CNT Nano-Architecture 405
ness in the presence of a single bit soft error or common multiple bit soft errors, and timing
correctness in the presence of parametric variations given the physical proximity of the circuit
components.
While nanotechnology development has not enabled fabrication of such a system, this chap-
ter has demonstrated the prospected manufacturability, reliability, and performance of a
purely carbon nanotube and carbon nanotube transistor based nanoelectronic system. These
nanoelectronic system design techniques are expected to be further developed along with
nanoscale device fabrication and integration techniques which are critical to achieve and to
improve the proposed nanoelectronic architecture in several aspects, including: (1) search of
electrically bistable molecules of repeated reconfigurability and low contact resistance with
carbon nanotubes, (2) development of etching processes for electrically bistable and redox ac-
tive molecules with carbon nanotubes as masks, and (3) manufacture of nanoscale devices of
superior subthreshold slope for enhanced nano-addressing resolution, and ultra high perfor-
mance low power nanoelectronic systems.
8. References
[1] I. Amlani, N. Pimparkar, K. Nordquist, D. Lim, S. Clavijo, Z. Qian and R. Emrick, “Au-
tomated Removal of Metallic Carbon Nanotubes in a Nanotube Ensemble by Electrical
Breakdown,” Proc. IEEE Conference on Nanotechnology, 2008, pp. 239-242.
[2] T. Austin, V. Bertacco, D. Blaauw and T. Mudge, “Opportunities and Challenges for Bet-
ter Than Worst-Case Design,” Asian South Pacific Design Automation Conference, 2005.
[3] A. Bachtold, P. Hadley, T. Nakanishi and C. Dekker, “Logic Circuits with Carbon Nan-
otube Transistors,” Science, 2001, 294(5545), pp. 1317-1320.
[4] P. Beckett, “A Fine-Grained Reconfigurable Logic Array Based on Double Gate Transis-
tors,” International Conference on Field-Programmable Technology, 2002, pp. 260-267.
[5] L. Benini and G. De Micheli, “Networks on chips: A new paradigm for component-based
MPSoC design,” Proc. MPSoC., 2004.
[6] J. Birkner, A. Chan, H. T. Chua, A. Chao, K. Gordon, B. Kleinman, P. Kolze and R. Wong,
“A Very High-Speed Field Programmable Gate Array Using Metal-To-Metal Anti-Fuse
Programmable Elements,” New Hardware Product Introduction at Custom Integrated Circuits
Conference, 1991.
[7] T. J. Brosnan and N. R. Strader II, “Modular Error Detection for Bit-Serial Multiplication,”
IEEE Trans. on Computers, 37(9), 1988, pp. 1043-1052.
[8] J. G. Brown and R. D. Blanton, “CAEN-BIST: Testing the NanoFabric,” Proc. International
Test Conference, 2004, pp. 462-471.
[9] Z. Chen, J. Appenzeller, Y.-M. Lin, J. Sippel-Oakley, A. G. Rinzler, J. Tang, S. J. Wind, P.
M. Solomon and P. Avouris, “An Integrated Logic Circuit Assembled on a Single Carbon
Nanotube,” Science, 2006, 311(5768), pp. 1735.
[10] B. Culbertson, R. Amerson, R. Carter, P. Kuekes and G. Snider, “Defect Tolerance on the
Teramac Custom Computer,” Proc. Symposium on FPGA’s for Custom Computing Machines
(FCCM), 2000, pp. 185-192.
[11] A.DeHon, “Array-Based Architecture for FET-Based, Nanoscale Electronics,” IEEE Trans.
on Nanotechnology, 2(1), pp. 23-32, 2003.
[12] A. DeHon, P. Lincoln and J. E. Savage, “Stochastic Assembly of Sublithographic
Nanoscale Interface,” IEEE Trans. Nanotechnology, 2(3), pp. 165-174, 2003.
[13] A. DeHon and M. J. Wilson, “Nanowire-Based Sublithographic Programmable Logic Ar-
rays,” Proc. FPGA, 2004, pp. 123-132.
406 VLSI
[14] C. Dwyer, L. Vicci, J. Poulton, D. Erie, R. Superfine, S. Washburn and R. M. Taylor, “The
design of DNA self-assembled computing circuitry,” IEEE Trans. on Very Large Scale Inte-
gration (VLSI) Systems, 12(11), pp. 1214-1220, Nov. 2004.
[15] D. J. Deleganes, M. Barany, G. Geannopoulos, K. Kreitzer, M. Morrise, D. Milliron, A.
P. Singh and S. Wijeratne, “Low-Voltage Swing Logic Circuits for a Pentium 4 Processor
Integer Core,” IEEE J. of Solid-State Circuits, 40(1), pp. 36-43, 2005.
[16] J. Deng, and H.-S. P. Wong, “A Compact SPICE Model for Carbon Nanotube Field Effect
Transistors Including Non-Idealities and Its Application âĂŤ Part II: Full Device Model
and Circuits Performance Benchmarking,” IEEE Trans. Electron Devices, 2007.
[17] X. Duan, Y. Huang and C. M. Lieber, “Nonvolatile Memory and Programmable Logic
from Molecule-Gated Nanowires,” Nano Letters, 2(5), pp. 487-490, 2002.
[18] D. Ernst, N. S. Kim, S. Das, S. Lee, D. Blaauw, T. Austin, T. Mudge and K. Flautner, “Ra-
zor: Circuit-Level Correction of Timing Errors for Low-Power Operation,” IEEE MICRO
special issue on Top Picks From Microarchitecture Conferences of 2004, 2005.
[19] P. Friedberg, Y. Cao, J. Cain, R. Wang, J. Rabaey, and C. Spanos, “Modeling Within-Die
Spatial Correlation Effects for Process-Design Co-Optimization,” Proc. International Sym-
posium on Quality Electronic Design, pp. 516-521, 2005.
[20] S. C. Goldstein and M. Budiu, “NanoFabrics: Spatial Computing Using Molecular Elec-
tronics,” Proc. International Symposium on Computer Architecture, 2001, pp. 178-191.
[21] B. Gojman, E. Rachlin, and J. E. Savage, “Evaluation of Design Strategies for Stochasti-
cally Assembled Nanoarray Memories,” Journal of Emerging Technologies, 1(2), 2005, pp.
73-108.
[22] J. R. Heath and M. A. Ratner, “Molecular Electronics,” Physics Today, 56(5), pp. 43-49,
2003.
[23] S. J. Kang, C. Kocabas, T. Ozel, M. Shim, N. Pimparkar, M. A. Alam, S. V. Rotkin and J. A.
Rogers, “High-Performance Electronics Using Dense, Perfectly Aligned Arrays of Single-
Walled Carbon Nanotubes,” Nature Nanotechnology, Vol. 2, pp. 230-236, April 2007.
[24] B. S. Landman and R. L. Russo, “On a Pin Versus Block Relationship for Partitions of
Logic Graphs,” IEEE Trans. on Computers, C-20, pp. 1469-1479, 1971.
[25] J. Liu, I. O’Connor, D. Navarro and F. Gaffiot, “Design of a Novel CNTFET-Based Recon-
figurable Logic Gate,” Proc. ISVLSI, 2007, pp. 285-290.
[26] B. Liu, “Reconfigurable Double Gate Carbon Nanotube Field Effect Transistor Based Na-
noelectronic Architecture,” Proc. Asia and South Pacific Design Automation Conference, 2009.
[27] B. Liu, “Adaptive Voltage Controlled Nanoelectronic Addressing for Yield, Accuracy and
Resolution,” Proc. International Symposium on Quality Electronic Design, 2009.
[28] B. Liu, “Robust Differential Asynchronous Nanoelectronic Circuits,” Proc. International
Symposium on Quality Electronic Design, 2009.
[29] B. Liu, “Defect Mapping and Adaptive Configuration of Nanoelectronic Circuits Based
on a CNT Crossbar Nano-Architecture,” Workshop on Nano, Molecular, and Quantum Com-
munications (NanoCom), 2009.
[30] International Technology Roadmap for Semiconductors, http://www.itrs.net/.
[31] A. Maheshwari and W. Burleson, “Differential Current-Sensing for On-Chip Intercon-
nects,” IEEE Trans. on VLSI Systems, 12(12), 2004, pp. 1321-1329.
[32] P. L. McEuen, M. S. Fuhrer and P. Hongkun, “Single-walled Carbon Nanotube Electron-
ics,” IEEE Trans. Nanotechnology, 1(1), pp. 78-85, 2002.
[33] M. Mishra and S. C. Goldstein, “Defect Tolerance at the End of the Roadmap,” Proc.
International Test Conference, 2003, pp. 1201-1211.
Nanoelectronic Design Based on a CNT Nano-Architecture 407
[34] S. Mitra, N. Patil and J. Zhang, “Imperfection-Immune Carbon Nanotube VLSI Logic
Circuits,” Foundations of NANO, 2008.
[35] T. K. Moon, Error Correction Coding: Mathematical Methods and Algorithms, Wiley-
Interscience, 2005.
[36] H. Naeimi and A. DeHon, “A Greedy Algorithm for Tolerating Defective Crosspoints in
NanoPLA Design,” Proc. Intl. Conf. on Field-Programmable Technology, 2004, pp. 49-56.
[37] P. Nguyen, H. T. Ng, T. Yamada, M. K. Smith, J. Li, J. Han and M. Meyyappan, “Direct
Integration of Metal Oxide Nanowire in Vertical Field-Effect Transistor,” Nano Letters,
2004, 4(4), pp. 651-657.
[38] A. Nieuwoudt and Y. Massoud, “Assessing the Implications of Process Variations on Fu-
ture Carbon Nanotube Bundle Interconnect Solutions,” Proc. Intl. Symp. on Quality Elec-
tronic Design, 2007, pp. 119-126.
[39] M. Orshansky, L. Milor, P. Chen, K. Keutzer, C. Hu, “Impact of spatial intrachip gate
length variability on the performance of high-speed digital circuits,” IEEE Trans. on
Computer-Aided Design of Integrated Circuits and Systems, 2002, pp. 544-553.
[40] S. S. P. Parkin, “Spintronics Materials and Devices: Past, Present and Future,” IEEE Inter-
national Electron Devices Meeting (IEDM) Technical Digest, pp. 903-906, 2004.
[41] N. Patil, J. Deng, A. Lin, H.-S. Philip Wong and S. Mitra, “Design Methods for Mis-
aligned and Mispositioned Carbon-Nanotube Immune Circuits,” IEEE Tran. on CAD,
2008, 27(10), pp. 1725-1736.
[42] J. P. Patwardhan, C. Dwyer, A. R. Lebeck and D. J. Sorin, “NANA: A Nano-Scale Active
Network Architecture,” ACM Journal on Emerging Technologies in Computing Systems, 2(1),
pp. 1-30, 2006.
[43] J. P. Patwardhan, V. Johri, C. Dwyer and A. R. Lebeck, “A Defect Tolerant Self-Organizing
Nanoscale SIMD Architecture,” International Conference on Architecture Support for Pro-
gramming Languages and Operating Systems, 2006, pp. 241-251.
[44] A. R. Pease, J. O. Jeppesen, J. F. Stoddart, Y. Luo, C. P. Collier and J. R. Heath, “Switching
Devices Based on Interlocked Molecules,” Acc. Chem. Res., 34, pp. 433-444, 2001.
[45] Predictive Technology Model, http://www.eas.asu.edu/∼ptm/.
[46] A. Raychowdhury and K. Roy, “Carbon Nanotube Electronics: Design of High Perfor-
mance and Low Power Digital Circuits,” IEEE Trans. on Circuits and Systems - I: Funda-
mental Theory and Applications, 54(11), pp. 2391-1401, 2007.
[47] G. S. Rose, A. C. Cabe, N. Gergel-Hackett, N. Majumdar, M. R. Stan, J. C. Bean, L. R. Har-
riott, Y. Yao and J. M. Tour, “Design Approaches for Hybrid CMOS/Molecular Memory
Based on Experimental Device Data,” Proc. Great Lakes Symposium on VLSI, 2006, pp. 2-7.
[48] J. E. Savage, E. Rachlin, A. DeHon, C. M. Lieber and Y. Wu, “Radial Addressing of
Nanowires,” ACM Journal of Emerging Technologies in Computing Systems, 2(2), pp. 129-
154. 2006.
[49] M. S. Schmidt, T. Nielsen, D. N. Madsen, A. Kristensen and P. BÃÿggild, “Nano-Scale
Silicon structures by Using Carbon Nanotubes as Reactive Ion Masks,” Nanotechnology,
16, pp. 750-753, 2005.
[50] G. S. Snider and R. S. Williams, “Nano/CMOS Architectures Using a Field-
Programmable Nanowire Interconnect,” Nanotechnology, 18(3), 2007.
[51] M. R. Stan, P. D. Franzon, S. C. Goldstein, J. C. Lach and M. M. Ziegler, “Molecular
Electronics: From Devices and Interconnect to Circuits and Architecture,” Proc. of the
IEEE, 91(11), pp. 1940-1957, 2003.
[52] Stanford CNFET Model, http://nano.stanford.edu/models.php.
408 VLSI
20
1
1. Introduction
During the last two decades, technological progresses in VLSI process have brought an
outstanding development of information technology equipments and thus a great increase
in the use of communication services all over the world. As reported by both the
International Technology Roadmap for Semiconductors (ITRS) and the Overall Roadmap
Technology Characteristics (ORTC), the exponential reduction of the feature size of
H
electronic chips according to Moore’s law (Moore, 1965) still occurs together with the
exponential increase in time of the number of transistor per unit area. Combined to this
shrinking of feature sizes, the on-chip clock frequency increases continually and should
exceed 10 GHz in 2010. These ceaseless trends in VLSI circuits have led to more and more
complex interconnect systems, and thus, the implementation of metal multi-layers for intra-
chip interconnects has become a must. In the mid-1980’s, the devices were, thus, composed
of one or two layers of aluminium; in 2011, according to ITRS prediction, chips will consist
of more than ten layers of copper.
Under these conditions, owing to the higher operation speeds, the interconnect propagation
delay becomes more and more significant and dominates considerably the logic propagation
delay (Deutsch, 1990; Rabay, 1996). Because of the sensitivities of parametric variations,
clock and data flow may not be synchronized (Friedman, 1995). This explains why
interconnections are so important in the determination of VLSI system performances.
Besides, simplified models are worth being considered in order to reduce the complexity of
any study on interconnects. Since the beginning of the 1980s the modelling of propagation
delay in order to estimate the delay of an interconnect line driven by a CMOS gate has been
the subject of numerous papers (Sakurai, 1983 and 1993; Deng & Shiau, 1990). The simplest
and most used model of this delay was proposed by Elmore in 1948; it relies on the use of
only an RC-line model. Nevertheless, due to the elevation of system data rates, this model
tends to be insufficiently accurate. Therefore, more accurate models that sometimes take into
account the inductive effect (Wyatt, 1987; Ismail et al., 2000) have been proposed.
410 VLSI
To solve the problem of clock skew and propagation delays, a technique of signal integrity
enhancement based on repeater insertion was proposed by different authors (Adler and
Friedman, 1998; Ismail & Friedman, 2000). But, when the signals are significantly
attenuated, such a solution may be unable to conserve the data duration and thus,
inefficient. These considerations drove us to recently propose a new technique for
interconnect-effect equalization (Ravelo, Perennec & Le Roy, 2007a, 2008a, 2009 and 2009)
through use of negative group delay (NGD) active circuits. As shown in Fig. 1, it consists
merely in cascading these NGD circuits at the end of the interconnect line. The possibility of
signal recovery with a reduction of signal rise/fall-, settling- and propagation-delays was
theoretically demonstrated and evidenced through simulations in (Ravelo et al., 2007a,
2008a) and confirmed by experiments in (Ravelo et al., 2009).
In fact, evidences of the NGD phenomenon have been provided through theoretical
demonstrations and experiments with passive-electronic devices (Lucyszyn et al., 1993;
Eleftheriades et al., 2003; Siddiqui et al., 2004 and 2005) and active ones (Solli & Chiao, 2002;
Kitano et al. 2003; Nakanishi et al., 2002; Munday & Henderson, 2004). As described in
several physics domains (Wang et al., 2000; Dogariu et al., 2001; Solli & Chiao, 2002), in the
case of a smoothed signal propagating in a device/material that generates NGD, the, the
peak of the output signal and its front edge are both in time advance compared to the input
ones. Then, confirmations that this counterintuitive phenomenon is not physically at odds
with the causality principle have been provided (Wang et al., 2000; Nakanishi et al., 2002). A
literature review shows that the first circuits that exhibited NGD at microwave wavelengths
displayed also significant losses, whereas the baseband-operating ones were intrinsically
limited to low frequencies. To cope with these issues, we, recently, reported on the design,
test and validation through simulations and experiments of a new and totally integrable
topology of NGD active circuit (Ravelo et al., 2007b, 2007c and 2008b); this topology relies
on the use of a FET and showed its ability to compensate for losses at microwave
frequencies over broad bandwidth. Transposition of this NGD topology to baseband
frequencies allowed us to develop new structures that demonstrated their ability to
simultaneously generate an NGD and gain for broad and baseband signals (Ravelo et al,
2008). Then, the idea put forward by Solly and Chiao (Solly, 2002) to compensate
degradations introduced by passive systems such as interconnect lines by using NGD
devices became possible.
As a continuation of these investigations, this chapter deals with further developments of
this technique. Section 2 gives insight into the way this technique works, and briefly
explains the role of the NGD circuit. The theory of interconnect modelling and the definition
of the propagation delay are both recalled in Section 3. Analytical approach and
experimental validations of RC-model equalization are presented in Section 4. The feasibility
of the proposed technique, when the inductive effects are taken into account, is dealt in
Section 5. In Section 6, a completely original and fully-integrable topology is proposed by
A New Technique of Interconnect Effects Equalization
by using Negative Group Delay Active Circuits 411
getting rid off inductance to cope with their implementation issue. Thus, the results of
simulations, which provided a very good validation of the performances expected from
theory, are analysed and discussed. A summary of this chapter is given in the last Section
together with proposals about possible future developments.
Fig. 2. Time-domain responses of the ideal system shown in Fig. 1 for a periodical input
voltage, vi(t).
In frequency domain, the degradation between the input, vi and the output, vl corresponds
to a transfer function denoted, Gl(s) whose gain magnitude and group delay usually verify
the following inequalities:
The output Laplace transform of this interconnect line can be written as:
As shown by Figs. 1 and 2, this study was aimed at finding a relevant configuration or
circuit able to provide a compensated output, vN, (black thick curve) as close as possible to
the input signal, vi, (red dashed curve). It means that the following mathematical
approximation can be made:
In theory, for well-matched circuits, the transfer system to be found, Gx(s), must be
associated to Gl(s) so that equation (4) is verified:
According to the circuit and system theory, through use of equations (3) and (4), one gets
the adequate transfer function:
Consequently, in the frequency domain, the system gain and group delay must be such that:
Gx ( j ) dB Gl ( j ) dB , (6)
x ( ) l ( ) . (7)
So, on condition to take into account the condition expressed in equation (1), the gain and
the group delay must be respectively such that |Gx(jω)|dB > 0 and τx(ω) < 0. Technically,
these conditions require the cascade of a system able to simultaneously exhibit Gain and an
NGD in baseband. As described by the block diagram of Fig. 3, the whole cascaded system
is characterized by its transfer function G(s) where Gl is the interconnect transfer function
and Gx = GNGD; is the whole group delay.
Fig. 3. Block diagram: interconnect passive system, Gl(s) cascaded by the active NGD
circuit, GNGD(s): G(s) = Gl(s).GNGD(s).
In conclusion, the NGD circuit consists in the achievement of a group delay and attenuation
compensatory function. Fig. 4 depicts the compensation principle by considering the general
frequency behaviour of interconnects or transmission lines. So, prior to conducting a
feasibility study of this technique with concrete systems, let us briefly recall the theory on
commonly used models of interconnects, i.e. RC- and RLC-circuits, and on propagation
delay assessments.
For the time-domain analysis of the structure under study, the input voltage, vi, is assigned
as a Heaviside unit step function, Г(t), of amplitude, VM:
vi(t)=VMГ(t). (8)
As defined previously, the 50% propagation delay, Tp , is defined as the root of equation (9):
1
Gl ( s ) , (10)
(1 sRs ) cosh(d ) ( Rs / Zc ) sinh(d )
where
Z c ( R l L l s ) /C l s , (11)
and
( Rl Ll s)C l s , (12)
are, respectively, the characteristic impedance and the propagation constant of the line. This
transfer function is analysed through use of polynomial expansion as done for classical
linear systems. For simplification, let us deal with the normalized transfer function, gl(s)
which is determined by Gl(s)/Gl(0), and that can be expressed by the m-order linear
expression:
1 a1 s a 2 s 2 ... an s n
g l (s ) , (13)
1 b1 s b 2 s 2 ... bm s m
where the coefficients, ai (i = {1,...,n}) and bj (j = {1,...,m}) are real numbers, and m and n are
integers. According to the literature (Elmore, 1948; Wyatt, 1987; Ismail et al., 2000), use of
this expression allows one to estimate the 50% propagation delay expressed in equation (9).
Among the existing approximation, it is worth recalling that the simplest and the most used
in the industrial context is the one proposed by Elmore in 1948. It is based on the first-order
consideration of equation (13). Indeed, this estimation of the propagation delay is merely
defined by:
TpElmore = b1 - a1. (14)
One should note that this propagation delay is exactly equal to the group delay of the
system under consideration at very low frequencies (Vlach et al. 1991). Nevertheless, this
A New Technique of Interconnect Effects Equalization
by using Negative Group Delay Active Circuits 415
formula has proven to become less and less accurate towards the elevation of the operating
frequency. A new formula was proposed by Wyatt (Wyatt 1987): it differs from the previous
approach by only the values of the coefficients, ai and bi, which are defined by the reciprocal
of the dominant pole of the system transfer function:
n
a1 z
i 1
1
i , (15)
m
b1 p
i 1
1
i , (16)
where the real numbers, zi and pi, are, respectively, the zeros and the poles of gl(s). It is
worth noting that this approach provides an exact expression of Tp in the case of the RC-
model (Ll = 0).
Fig. 6. Simplified representation of the structure shown in Fig. 5 by considering the first-
order approximation of the transfer function.
Therefore, the driver gate loaded by the distributed transmission line can be equivalent to a
lumped RC-circuit. To make easier the analytical calculation, let us consider the equivalent
parameters of the system under study, Rt = Rs + Rld and C = Cld. So, the Elmore propagation
delay of this well-known circuit is given by:
TpRC = Rt C . (18)
But, it can be shown by calculation through the unit step response that the exact value of
this quantity is expressed as:
TpRC RtC ln(2 ) . (19)
Then, the rise time, denoted trRC, which is defined as the time needed by the output signal to
pass from 10% to 90% of the final output value, is written as:
trRC = RtC ln(9) . (20)
416 VLSI
1.35
Tp ( e -2.9 1.48 ) /n . (24)
To reduce this propagation delay, and as first envisaged by Solli and Chiao (Solli et al, 2002),
it could be worth cascading the interconnect line with an NGD active circuit. But a
preliminary to this proposal is a detailed study of the resulting device in order to get a
confirmation of the compensation principle efficacy. This will be the focus of the next
Section.
4.1 Theory
As explained above, the equalization principle consists in ending the circuit to be
compensated, here an RC one, with an NGD active cell as depicted in Fig. 7. To simplify the
analytical approaches, the FET is modelled by a controlled voltage current source with a
transconductance, gm, and the drain source resistance, Rds.
A New Technique of Interconnect Effects Equalization
by using Negative Group Delay Active Circuits 417
Fig. 7. RC-NGD circuit: RC-circuit cascaded with a basic cell of the NGD active circuit (FET
in feedback with an RL series network) and the corresponding equivalent circuit.
In a first step, let us recall (Ravelo et al., 2007a, 2008a, 2008b, 2009) the transfer function and
the group delay expressions of the NGD cell alone (in the dash box in Fig. 7) at low
frequencies:
(1 g m R ) Rds
G NGD (0) , (25)
R Rds
L(1 g m Rds ) .
NGD (0) (26)
( R Rds )(1 g m R)
On condition that:
1 gm R 0 , (27)
the NGD cell exhibits a negative group delay (τNGD(ω 0) < 0) at very low frequencies.
Then, the transfer function and the gain at low frequencies of the whole RC-NGD circuit
presented in Fig. 7 are, respectively, expressed as:
Rds (1 g m R ) g m Rds Ls ,
G (s) (28)
R Rds Rt (1 g m Rds ) Rt C ( R Rds ) L s Rt CLs 2
Rds (1 g m R) .
G (0) (29)
R Rds Rt (1 g m Rds )
One should note that, because of the unmatched effect between the RC- and NGD-circuits,
the transfer function, G(s), is not equal to Grc(s).GNGD(s) (Ravelo et al., 2009). The same
remark applies to the group delays:
2
[1 gm ( Rt Rds ) gm Rt Rds ]L TpRC ( R Rds )(1 gm R )
Tp . (32)
( R Rt Rds gm Rt Rds )(1 gm Rds )
So, the delay is reduced (Tp < TpRC,) on condition that:
L (1 g m R )Rt2C /( 1 g m R ) . (33)
But this inequality is automatically verified if the condition expressed in equation (27) is
true, i.e. if the active circuit generates NGD. According to this study, the optimum values of
the NGD circuit can be synthesised as a function of the RC-values.
From equations (31) and inversion of (29) and (32), one gets the synthesis relations
expressed in equations (35) and (36):
Otherwise, several stages of NGD cells are needed to generate a whole gain of about unity;
then, as reported in (Ravelo et al., 2008a), it is advised to use equation (38) for the calculation
of the optimal number of cells, n:
T is the time duration of the considered input square pulse. For a real x, the value of the
greatest integer given by the function, int(x), is equal to or lower than x. It is worth
underlining that the implementation of, at least, two or an even number of transistors is
preferable when the application under study requires avoiding signal inversion.
(a)
(b)
Fig. 8. Schematics of the (a) RC- and (b) RC-NGD-circuits (including the biasing network in
thin lines) using a PHEMT FET (ATF-34143, Vgs = 0 V, Vd = 3 V, Id = 110 mA), for Rt = 33 Ω, C
= 680 pF, R = 56 Ω, Ro = 10 Ω, L = 220 nH, Cb = 100 nF and Z0 is the output reference load.
The FET parameters required for the synthesis equations were extracted from the non-linear
model and found to be gm = 226 mS and Rds = 27 . For the RC circuit, the RL values of the
NGD circuit were synthesised from equations (33) and (34). Then, accurate frequency
responses were obtained through combination of circuit simulations (lumped components
and non linear FET model) with electromagnetic co-simulations of the distributed parts by
Momentum Software (from AgilentTM). The values of the available lumped components were
used in a final slight optimisation procedure. The layout of the hybrid planar circuit shown
in Fig. 9 was printed on an 800-µm-thick FR4 substrate of relative permittivity, εr = 4.3; then
the surface-mount chip passive components were set onto the substrate.
420 VLSI
- Frequency-domain analysis: Fig. 10(a) shows that the magnitude |G(f)| of the whole RC-
NGD circuit is kept at about 0 dB up to 40 MHz, and the corresponding group delay |(f)|
is, as expected, strongly decreased from DC to 15 MHz compared to that of the RC circuit.
Moreover, the frequency responses are only slightly sensitive to lumped-component
tolerance.
(a) (b)
Fig. 10. (a) Magnitude and (b) group delay frequency responses issued from simulations
with a Monte-Carlo analysis of the NGD circuit elements R and L (± 5%).
- Time-domain analysis: It is worth underlining that, at first, the signal was measured at the
output of a Rhode & Schwarz Signal Generator SMJ 100A at the highest available rate, i.e. 25
Msym/s, and then further used in simulations as the input signal. Then, excitation of the
A New Technique of Interconnect Effects Equalization
by using Negative Group Delay Active Circuits 421
simulated RC- and RC-NGD-circuits with this input square wave pulse (magnitude, VM = 1
V) led to the time-domain simulation results presented in Fig. 11, where the dotted curve
indicates the degradation induced by the RC-circuit alone; moreover, the signal recovery is
evidenced by the thick black curve. It clearly appears that the compensation is significant,
but incomplete. Because of the drain-source current inversion, the output signal is reversed
compared to the input voltage. This is why the plot presented in Fig. 11 is that of -VN.
Fig. 11. Results of time-domain simulations (with a Monte Carlo analysis of the NGD circuit
elements, R and L (± 5%)) for an input square wave pulse (VM = 1 V with a 40 ns data
duration).
It is worth noting that the 50% propagation delay is shortened from 16.5 ns (for the RC
circuit) to about 3 ns (for the RC-NGD circuit), i.e. a relative reduction of, at least, 81%. A
Monte-Carlo sensitivity analysis with a ± 5%-variation around the R and L nominal values
showed nearly no change of the RC-NGD circuit output signal (including the front and
leading edges), but the final value is somewhat slightly affected.
(a) (b)
Fig. 12. Measurements of (a) magnitude and (b) group delay produced by the RC-, NGD-
and RC-NGD-circuits.
One should note that, though the RC-NGD circuit group delay, (f), is not fully cancelled
(Fig. 12(b)), it is strongly reduced below 20 MHz thanks to the NGD circuit; moreover, it is
kept below 4 ns up to 80 MHz. In theory, a higher absolute value of NGD could be obtained,
but a compromise between NGD value, NGD bandwidth and gain flatness has to be found
to minimize the overshoot or the ripple in time-domain
- Time-domain experimental results: Throughout the time-domain simulations and
measurements, the circuit load, Z0, is set at a high impedance value (Z0 = 1 M).
Fig. 13. Time-domain responses with an input square pulse (25-Msym/s rate, 2 ns rise- and
fall-times) and zoom on twice the symbol duration.
A New Technique of Interconnect Effects Equalization
by using Negative Group Delay Active Circuits 423
(a)
(b)
Fig. 14. (a) line model: RLC network driven by a logic gate with output resistance, Rs; (b) the
whole circuit composed of the line model compensated by an NGD cell.
5.1 Theory
According to the procedure used in the previous section, the transfer function of the whole
system is:
424 VLSI
(i) Compensation of a lumped RLC-circuit for an input signal with 1 ns-data duration
Fig. 15 depicts the whole compensated RLC-NGD circuit under study. It consists in a
lumped RLC-circuit ended with a two-stage NGD active circuit. The FETs used by the latter
(EC-2612 with gm = 98.14 mS and Rds = 116.8 Ω) are from Mimix BroadbandTM.
Fig. 15. Two-stage NGD active circuit compensating the RLC network (Rt = 100 Ω, Lt = 6 nH,
C = 4.3 pF, R1 = 86 Ω, R2 = 89 Ω, L1 = 5.2 nH, L2 = 2.6 nH and FET/EC-2612).
A New Technique of Interconnect Effects Equalization
by using Negative Group Delay Active Circuits 425
One should note that the time- and frequency-domain results presented here were obtained
by using the FET linear model provided by this manufacturer and recorded at the biasing
point Vds = 3 V and Ids = 25 mA.
- Frequency-domain results: Figs. 16 (a) and (b) present the magnitudes and the group
delays of the RLC- and NGD-circuits separately, and then, both cascaded (noted RLC-NGD),
as depicted in Fig. 15. Fig. 16(a) shows that the transfer-function magnitude of the overall
circuit (black thick curve) is kept within -4 and 0 dB up to 3 GHz.
(a) (b)
Fig. 16. Simulated frequency responses of the RLC, NGD and RLC-NGD circuits: (a)
magnitude and (b) group delay.
Once again, one should note that the value of this whole transfer-function magnitude is
different from the sum (in dB) of the individual magnitudes (GRLCNGD ≠ GRLC.GNGD) because
of mismatch between both parts. Figure 16(b) evidences that, thanks to the NGD effect (thin
blue curve), the total group delay of the whole circuit (black thick curve) is less than 68 ps. A
comparison of the group delays respectively produced by the RLC and the RLC-NGD
circuits shows a significant reduction up to about 0.8 GHz with the latter.
- Time-domain results: Figure 17 presents the results of transient simulations run in the case
of a 2-ns periodic input signal, Vi, whose rise-/fall-times are about 92 ps (thin red curve).
The response at the output of the RLC circuit alone, VRLC, is depicted by the degraded
dashed curve, and the corresponding 50% propagation delay, TpRLC, is equal to 304 ps.
Further to the insertion of the NGD circuit, the black thick curve representative of VN, i.e. at
the output of the RLC-NGD circuit, shows improvement with a relative reduction of more
than 85% of propagation delay since Tp is about 44 ps. Moreover, by comparison to Vi, VN
presents neither attenuation, nor overshoot, and the leading and trailing edges are both
improved.
426 VLSI
Fig. 17. Time-domain responses produced by simulations of the circuit in Fig. 15 for a 2-ns
periodic trapezoidal input, rise-/fall-times, tr = 92 ps and 50%-duty ratio with zoom on a
half-period (bottom).
As the accuracy of the lumped RLC model of interconnect line used in these simulations
might be insufficient for certain VLSI configurations, the simulation described in the next
paragraph dealt with a distributed RLC-model (Ravelo et al, 2007a).
(ii) Compensation of a distributed RLC-line in the case of an input signal of 5-ns data-
duration
Fig. 18 shows the circuit under study; the interconnect line is modelled by an RLC
distributed circuit. The classical RLC-line is driven by a gate of output resistance, Rs,
A New Technique of Interconnect Effects Equalization
by using Negative Group Delay Active Circuits 427
compensated by a two-stage NGD circuit loaded with another gate of input capacitor, CL.
The transmission line parameters were taken from ITRS roadmap so that Rl = 76 Ω/cm,
Ll = 5.3 nH/cm and Cl = 2.6 pF/cm for a 0.8-cm-long line. For these values, the synthesis
relations were used together with an optimization process to get an output, VN, close to the
input, Vi. Finally, the component values for the two NGD cells were: R1 = 73 Ω, L1 = 99 nH,
R2 = 102 Ω and L2 = 17 nH.
- Frequency-domain results: Figs. 19(a) and (b) respectively show the transfer function
magnitude and the group delays produced by simulations of the three circuits (RLC-line,
NGD- and RLC-NGD-circuits). In Fig. 19(a), the important attenuation displayed by the
RLC-line is compensated by the NGD cells so that the total gain is kept at about 0 dB up to
about 500 MHz. With respect to the group delay produced by the RLC-line, the one by the
RLC-NGD circuit (thick black curve) is strongly reduced as expected thanks to the NGD
contribution (thin blue curve). Indeed, at baseband frequencies, this latter provides a
minimal NGD value around -0.5 ns.
(a) (b)
Fig. 19. Frequency responses produced through simulations of the RLC-, NGD-, and RLC-
NGD circuits shown in Fig. 18: magnitude (a) and group delay (b).
- Time-domain results: Time-domain simulations were run on using at the input a 10-
ns-period signal with 0.60 ns as rise-/fall-times.
Fig. 20. Time domain output responses of the RLC-line and of the RLC-NGD compensated
structure for a trapezoidal input voltage at a 200 Mbit/s rate.
428 VLSI
Fig. 20 evidences a marked degradation of the output waveform, Vl, of the RLC-line
compared to the input waveform. Once again, the insertion of the NGD active circuit allows
great enhancement illustrated by a notable signal regeneration, a reduction of the
propagation delay of about 1.60 ns (from 1.86 ns to 0.26 ns) accompanied with an
enhancement of the signal leading- and trailing-edges. These results confirm the efficacy
and the reliability of this technique to equalize the degradation induced either by lumped or
distributed RLC-model of interconnect lines.
Let us focus, at first, on the analytical study of the proposed cell according to the RC circuit
values prior providing evidence, through simulations, of the compensation of the RC effects,
i.e. signal recovery and delay reduction, by using the proposed NGD topology
6.1 Theory
According to (Ravelo et al, 2008), the transfer function of the RC-NGD device described in
Fig. 21 is:
Gmax R2 ( 1 R1C1s )
G( s ) , (48)
( 1 Rt Cs )[ R1 R2 Rds R1C1s( R2 Rds )]
where Gmax is the maximal gain value expressed in equation (37). The minus sign is
explained by the intrinsic FET model, which entails naturally the inversion of the output
voltage direction compared to the input one. Otherwise, with the same approach as in
Section 4.1, from this transfer function it can be established that the gain at very low
frequencies and the Elmore propagation delay are given by:
A New Technique of Interconnect Effects Equalization
by using Negative Group Delay Active Circuits 429
Gmax R2 ,
G( 0 ) (49)
( R1 R2 Rds )
[( R2 Rds )T pRC R1C1 ( Rt R1 )]
Tp . (50)
( R1 R2 Rds )
From expression (49), loss compensation (|G(0)| > 1) is effective on condition that the
following condition between Gmax and the resistance values be met:
R1 Rds .
Gmax 1 (51)
R2
In addition, from expression (50), it can be found that, whatever the values of the RC- and
the NGD-circuit parameters, TP is always lower than TpRC. In other words, the RC-
propagation delay is always reduced in the configuration of Fig. 21. At this stage, it is worth
pointing out that equation (50), despite its usefulness, is an approximation of the 50%
propagation delay as proposed by Elmore. Indeed, according to Ismail and Friedman, a
relative inaccuracy of about 30% is possible. Another limitation appears when the following
condition is satisfied:
R1C1 ( Rt R1 )
Rt C . (52)
R2 Rds
In this case, expression (50) provides a negative value for Tp, which leads to an unrealistic
behaviour that would contradict the causality principle. Indeed, calculation of the exact
expression for Tp from the root of the equation, vN(Tp) = VM/2, gives always a positive value
for the total propagation delay.
Despite these restrictions, equations (49) and (50) are particularly useful for a first analytical
approximation and permit the extraction of the synthesis relations needed to determine the
values of the NGD cell components:
The synthesis relations (53) and (54) are physically realistic under the following conditions:
Gmax 1 , (56)
R2 Rds /(Gmax 1) . (57)
430 VLSI
(a)
(b)
Fig. 22. (a): The whole circuit composed of an RC-model compensated by a two-stage NGD
circuit in active biasing; FET (EC-2612, Vd = 3 V, Ids = 30 mA), R = 68 , C = 10 pF, R1 = 142
, R2 = 32 , C1 = C2 = 10 pF, R3 = 100 , R4 = 51 and (b) the corresponding simulation
results.
An active load bias with the same FETs (EC-2612) was applied to the circuit. No inductance
element is needed in this circuit. Transient simulations run with ideal components gave the
results displayed on Fig. 22(b). Once again, the comparison of the RC- and RC-NGD-outputs
highlights a signal recovery, characterised by a gain of about 1.85 dB at t = T/2 and a
reduction of propagation from TpRC ≈ 47 ps to Tp ≈ 12 ps or ΔT/TpRC ≈ 74.4% in relative value.
Time-domain measurements are scheduled and will indicate if further improvements are
required prior to the integration in a final VLSI device.
experimental results in frequency- and time-domains were in very good agreement with
simulations and validated the compensation technique in the case of an input signal with a
25 Mbit/s data rate. Indeed, the reductions of the rise time and the 50% propagation delay
were 71 and 86%, respectively.
In many VLSI systems and particularly in long wires and/or for high data rates or clocks,
the inductive spurious effects can no longer be neglected. So, a more elaborated system
composed of an RLC interconnect model compensated with NGD cells was also studied
analytically in order to check for the validity and efficacy of the equalization technique and
to determine the synthesis relations to be used in further applications. To validate the
approach, a first series of simulations was run with an RLC lumped model for an input
signal at 1 Gbits/s-rate; then , the model used in the second set of simulations was an RLC
distributed line for an input signal at a rate of 200 Mbit/s. These simulations under realistic
conditions confirmed the compensation approach with reduction of the propagation delay
of the same order as previously. Moreover, as observed with the RC-model, the front and
trailing edges both showed great enhancements indicative of a good recovery of the signal
integrity.
Finally, to be able to compensate for interconnect effects in VLSI systems, the proposed
circuits must be compatible with a VLSI integration process. This requirement drove us to
propose improvements of the proposed topology in order to cope with inductance
integration and manufacturing prerequisites. So, a topology with no inductance, but with
the same behaviour and performances as previously was proposed. A theoretical study
provided evidence of its ability to exhibit a negative group delay in baseband together with
gain. Then, time-domain simulations of a two-stage NGD device excited by a 1 Gbits/s-rate
input signal were run to validate the expected compensation approach and check for the
signal recovery.
The implementation of this equalization technique in the case of a VLSI integration process
is expected to allow compensation for spurious effects such as delay and attenuation
introduced by long inter-chip interconnects in SiP and SoC equipments or by long wires and
buses. A preliminary step would be the design and implementation of such a circuit in
MMIC technology and especially by using distributed elements. At this stage, even if
experimentally the NGD cells were not particularly sensitive to noise contribution, it would
be worth comparing this approach and repeater insertion under rough conditions, i.e. long
wires with a significant attenuation, in order to gain key information on their respective
behaviour under conditions of significant noise. As identified in ITRS roadmap, the power
consumption is now one of the major constraints in chip design and has been identified as
one of the top three overall challenges over the last 5 years. Faced to these constraints,
further investigations are needed to accurately evaluate the consumption of the presented
NGD active circuits.
432 VLSI
8. References
Adler, V. & Friedman, E. G. (1998). Repeater Design to Reduce Delay and Power in Resistive
Interconnect, IEEE Trans. Circuits Syst. II, Analog and Digital Signal Processing, Vol.
54, No. 5, pp. 607-616.
Bakoglu H. B. & Meindl J. D. (1985). Optimal Interconnection Circuits for VLSI, IEEE Trans.
On Electron. Devices, Vol. 32, No. 5, pp. 903-909.
Barke E. (1988), Line-to-ground capacitance calculation for VLSI: a comparison, IEEE Trans.
of Computer Added Design, Vol. 7, No. 2, pp. 295-298.
Deng, A. C. & Shiau, Y. C. (1990). Generic linear RC delay modeling for digital CMOS
circuits, IEEE Tran. on Computer-Aided Design, Vol. 9, No. 4, pp. 367-376.
Deutsch., A. (1990). High-speed signal propagation on lossy transmission lines, IBM J. Res.
Develop., Vol. 34, No. 4, pp. 601-615.
Deutsch, A.; Kopcsay, G. V.; Restle, P. J.; Smith, H. H.; Katopis, G.; Becker, W. D.; Coteus, P.
W.; Surovic, C. W.; Rubin, B. J.; Dune, R. P.; Gallo, T. A.; Jankis, K. A.; Terman, L.
M.; Dennard, R. H.; Asai-Halsz, G.; Krauter, B. L. & Knebel, D. R. (1997). When are
the transmission line effects important for on-chip interconnections, IEEE Trans. on
MTT, Vol. 45, No. 10, pp. 1836-1846.
Deutsch, A.; Kopcsay, G. V.; Surovic, C. W.; Rubin, B. J.; Terman, L.M.; Dunne, Jr. R. P.;
Gallo, T. A. & Dennard, R. H. (1995). Modeling and characterization of long on-chip
interconnections for high speed-performance microprocessors, IBM J. Res. Develop.,
Vol. 39, No. 5, pp. 547-667.
Dogariu, A.; Kuzmich, A. & Cao, H. & Wang, L. J. (2001). Superluminal light pulse
propagation via rephasing in a transparent anomalously dispersive medium, Optics
Express, Vol. 8, No. 6, pp. 344-350.
Eleftheriades, G. V.; Siddiqui, O. & Iyer, A. K. (2003). Transmission line for negative
refractive index media and associated implementations without excess resonators,
IEEE MWC Lett., Vol. 13, No. 2, 51, pp. 51-53.
Elmore, W. C. (1948). The transient response of damped linear networks, J. Appl. Phys., Vol.
19, pp. 55-63.
Friedman, E. (1995). Clock distribution networks in VLSI circuits and systems, New York
IEEE Press.
Grover, F. (1945). Inductance Calculations Working Formulas and Tables, Instrum. Soc. of
America.
Ismail, Y. I. & Friedman, E. G. (2000). Effects of inductance on the propagation, delay and
repeater insertion in VLSI circuits, IEEE Tran. VLSI Sys., Vol. 8, No. 2, pp. 195-206.
Ismail, Y. I.; Friedman, E. G. & Neves, J. L. (2000). Equivalent Elmore delay for RLC trees,
IEEE Tran. Computed-Aided Design, Vol. 19, No. 1, pp. 83-97.
Kitano, M.; Nakanishi, T. & Sugiyama, K. (2003). Negative group delay and superluminal
propagation: an electronic circuit approach, IEEE Journal of Selected Topics in
Quantum Electronics, Vol. 9, No. 1, pp. 43-51.
Krauter, B. & Mehrotra S. (1998). Layout based frequency dependent inductance and
resistance extraction for on-chip interconnect timing analysis, IEEE Design
Automation Conference, Proceedings of the ACM, pp. 303–308.
Lucyszyn, S.; Robertson, I. D. & Aghvami, A. H. (1993). Negative group delay synthesiser,
Electron. Lett., Vol. 29, pp. 798-800.
A New Technique of Interconnect Effects Equalization
by using Negative Group Delay Active Circuits 433
Moore, G. E. (1965). Cramming more components into integrated circuits, in Electronics, Vol.
38, No. 8, pp. 114-117.
Munday, J. N. & Henderson, R. H. (2004). Superluminal time advance of a complex audio
signal, Appl. Phys. Lett., Vol. 85, pp. 503-504.
Nakanishi, T.; Sugiyama, K. & Kitano, M. (2002). Demonstration of negative group delays in
a simple electronic circuit, American Journal of Physics, Issue 11, Vol. 70, pp. 1117-
1121.
Palit, A. K.; Meyer, V.; Duganapalli, K. K.; Anheier, W. & Schloeffel, J. (2004). Test pattern
generation based on predicted signal integrity loss through reduced order
interconnect model, 16th Workshop Test Methods and Reliability of Circuits and
Systems.
Rabay, J. M. (1996). Digital integrated circuits, a design perspective, Englewood Cliffs, NJ:
Prentice-Hall.
Ravelo, B. (Dec. 2008). Negative group delay active devices: theory, experimental
validations and applications, Ph.D. thesis (in French), Lab-STICC, UMR CNRS 3192,
University of Brest, France.
Ravelo, B.; Pérennec, A. & Le Roy, M. (2007a). Equalization of interconnect propagation
delay with negative group delay active circuits, 11th IEEE Workshop on SPI, Genova,
Italy, pp. 15-18.
Ravelo, B.; Perennec, A.; Le Roy, M. & Boucher Y. (2007b). Active microwave circuit with
negative group delay, IEEE MWC Lett., Vol. 17, Issue 12, pp. 861-863.
Ravelo, B.; Perennec, A. & Le Roy, M. (2007c). Synthesis of broadband negative group delay
active circuits, IEEE MTT-S Symp. Dig., Honolulu (Hawaii), pp. 2177-2180.
Ravelo, B.; Pérennec, A. & Le Roy, M. (2008a). Application of negative group delay active
circuits to reduce the 50% propagation delay of RC-line model, 12th IEEE Workshop
on SPI, Avignon, France.
Ravelo, B.; Pérennec, A. & Le Roy, M. (2008b). Negative group delay active topologies
respectively dedicated to microwave frequencies and baseband signals, Journal of
the EuMA, Vol. 4, pp. 124-130.
Ravelo, B.; Pérennec, A. & Le Roy, M. (2009). Experimental validation of the RC-interconnect
effect equalization with negative group delay active circuit in planar hybrid
technology, 13th IEEE Workshop on SPI, Strasbourg, pp. 1-4.
Ruehli, A. & Brennan, P. (1975). Capacitance models for integrated circuit metallization
wires, J. of Solid-State Integrated Circuits, Vol. 10, No. 6, pp. 530-536.
Sakurai, T. (1983). Approximation of wiring delay in MOSFET LSI, IEEE J. of Solid State
Circuits, Vol. 18, No. 4, pp. 418-425.
Sakurai, T. (1993). Closed-form expressions of interconnection delay, coupling and crosstalk
in VLSI’s, IEEE Tran. on Electron. Devices, Vol. 40, No. 1, pp. 118-124.
Siddiqui, O. F.; Erickson, S. J.; Eleftheriades, G. V. & Mojahedi, M. (2004). Time-domain
measurement of negative-index transmission-line metamaterials, IEEE Trans. MTT,
Vol. 52, No. 5, pp. 1449-1453.
Siddiqui, O.; Mojahedi, M.; Erickson, S. & Eleftheriades, G. V. (2003). Periodically loaded
transmission line with effective negative refractive index and negative group
velocity, IEEE Tran. on Antennas and Propagation, Vol. 51, No. 10.
Solli, D. & Chiao, R. Y. (2002). Superluminal effects and negative delays in electronics and
their applications, Physical Review E, Issue 5.
434 VLSI
Standley D. & Wyatt, J. L. Jr., (1986). Improved signal delay bounds for RC tree networks,
VLSI Memo, No. 86-317, Massachusetts Institute of Technology, Cambridge,
Massachusetts.
Vlach, J.; Barby, J. A.; Vannelli, A.; Talkhan, T. & Shi. C. J. (1991). Group delay as an estimate
of delay in logic, IEEE Tran. Computed-Aided Design, Vol. 10, No. 7, pp. 949-953.
Wang, L. J.; Kuzmich, A. & Dogariu, A. (2000). Gain-assisted superluminal light
propagation, Nature 406, pp. 277-279.
Wyatt, J. L. Jr., & Yu, Q. (1984). Signal delay in RC meshes, trees and lines, Proceedings of the
IEEE International Conference on Computer-Aided Design, pp. 15-17.
Wyatt, J. L. Jr. (1985). Signal delay in RC mesh networks, IEEE Tran. Circuits and Systems,
CAS-32(5), pp. 507-510.
Wyatt, J. L. Jr. (1987). Signal propagation delay in RC models for interconnect, Circuit
Analysis, Simulation and Design, Part II: VLSI Circuit Analysis and Simulation, A.
Ruehli, ed., Vol. 3 in the series Advances in CAD for VLSI, North-Holland.
Yu, Q.; Wyatt, J. L. Jr.; Zukowski, C.; Tan, H-N. & O'Brien, P. (1985). Improved bounds on
signal delay in linear RC models for MOS interconnect, Proceedings of the IEEE
International Symposium on Circuits and Systems, pp. 903-906.
Book Embeddings 435
21
Book Embeddings
Saïd Bettayeb
University of Houston Clear Lake
Houston, TX 77058, USA
1. Introduction
Graph embeddings play an important role in interconnection network and VLSI (Very Large
Scale Integration) design. Simulation of one interconnection network by another can be
represented as a graph embedding problem. Determining the number of layers required to
build a VLSI chip, also called book-embedding, is another application of graph embeddings.
In this chapter, we explore the latter problem. After an overview of the results on book
embedding of the hypercube, we present results on book embedding of the k-ary hyperube,
a variant of the hypercube. We also present recently obtained results on the book
embedding of the torus graph.
Graph embeddings have been studied in the literature extensively for the important role it
plays in interconnection network and VLSI (Very Large Scale Integration) design (Bernhart
& Kainen, 1979; Bettayeb et al., 1989; Bettayeb & Sudborough., 1992; Bettayeb &
Sudborough, 1989; Chung et., 1987; Yannakakis, 1989). Simulating an interconnection
network, say A, by another, say B, is represented as a graph embedding problem where the
nodes and edges of A are mapped to nodes and paths of B. A book embedding of a graph G
is the mapping of the nodes of G onto the spine of a book and the edges of G onto pages so
that the edges assigned to the same page do not cross. Determining the minimum number of
pages required for such an embedding is the focus of this chapter. The minimum number of
pages in which a graph G can be embedded is called the pagenumber of G, pg(G).
Determining the pagenumber of an arbitrary graph has been shown to be NP-complete
(Chung et al, 1987; Yannakakis, 1989). It remains NP-complete to determine if an arbitrary
graph can be embedded in two pages. In a 1980 article, Garey et al (Garey et al., 1980)
proved that determining the pagenumber of an arbitrary graph remains NP-complete even
if we assume that the node embedding part is fixed, i.e. the layout of the nodes is given. An
equally challenging task is the problem of determining the pagerwidth or geometric
thickness of an arbitrary graph G which defined to be the minimum number of layers in a
planar drawing of an arbitrary G.
Researchers have been drawn to study this problem and variations of this problem (Chung
et al., 1987; Galil et al., 1989; Yannakakis, 1989) because of its many and diverse applications
such as fault tolerant computing (Chung et al., 1987), graph drawing, and graph separators
(Galil et al., 1989). Other problems remain to be solved such as the relationship of the
436 VLSI
pagenumber of a graph and other invariants. Enomoto and Miyauchi(Enomoto & Miyauchi,
1999) considered the case where edges may use more than one page.
The pagenumber of a graph has strong implications in VLSI design. The pagenumber is the
minimum number of layers required to produce a VLSI chip. Another area of VLSI design
that can be described in terms of book embedding is the configuring of processors in the
presence of faults. Given an array of processors, some of which may be faulty, we lay them
in a line. This could be either physical or logical. Running parallel to the line of processors
are bundles of wires. As we scan the line of processors, we activate switches connecting the
good processors to a bundle of wires and bypassing the bad processors. The bundle of wires
act like a stack in that, when a processor � requests a connection to another processor, � is
connected to a particular bundle and pushes the other processor connections down to a
wire. When our scan reaches the processor to which processor � connects, it is popped off
the bundle since the wire is no longer needed, and the other connections are returned to
their original positions. The desired property in this case, is the minimization of the number
of bundles required to interconnect all of the good processors in the desired layout. This is
used in the Diogenes method of fault tolerant design as described by Chung, Leighton, and
Rosenberg (Chung et al., 1987). If we take each bundle of wires and represent it as a page,
we have a book embedding. Chung, Leighton, and Rosenberg (Chung et al., 1987) have
studied the book embedding problem for a variety of graphs including trees, grids, X-trees,
Benes networks, complete graphs, and the binary hypercube.
A d-dimensional torus is the d-dimensional mesh with wraparound edges. The wraparound
edges connect the first and last vertex in each dimension. The notation ��� � �� � � �� �
denotes the d-dimensional torus where �� , � � � � �, is the size of the � �� dimension.
The binary hypercube of dimension �, denoted by �� ��� has 2� vertices labeled by the
binary representation of integers between 0 and 2� � �. Two vertices are connected by an
edge if and only if their labels differ in exactly one bit position. The k-ary hypercube of
dimension �, denoted by �� ��� has � � vertices labeled by the k-ary representation of
Book Embeddings 437
integers between 0 and � � � �� Two vertices are connected by an edge if and only if their
labels differ in exactly one position by one (modulo k).
short edges. Let Gc be the graph obtained from C by adding its chords. If F and F’ are two
inner faces of Gc where the long edge of F is a short edge of F’ then all vertices of F must
appear between two consecutive vertices of F’. First, Let K be the cycle bounding the inner
face of Gc that contains the long edge of C. Its short edges must be the long edges of the
other inner cycles. Layout the interior of K. Then, expand recursively the inner cycles.
As pointed out in (Yannakakis, 1989), the vertices of a planar graph can be partitioned into
levels. All edges connect either vertices of the same level or vertices of adjacent levels. The
former are called level-edges and the latter binding edges. Level 0 consists of the vertices
forming the cycle K. Laying out the interior of K is accomplished by first laying level 1
vertices and coloring the short edges of K and the binding edges between levels 0 and 1.
Expand recursively the cycles formed by level 1 vertices.
6. Conclusion
In this chapter, we presented results on book embedding and queue embedding of graphs.
The upper bound for the book-embedding of the torus we achieved is 2��� – 3. It is
interesting to know if this could be improved. Heath, Leighton and Rosenberg (Heath et al.
1992) showed that the ternary hypercube has a lower bound �3��� �. It follows that the
lower bound for torus is exponential in the number of dimensions when the sizes of its
dimensions are odd. The authors in (Heath et al., 1992) conjectured that family of graphs
with large queue number and small page (or stack) number do not exist. In (Bettayeb et al.,
2010), we describe a class of graphs, namely the �-dimensional �-ary modified hypercubes
which have pagenumber ����. We conjectured that the queue number for such graphs grow
more rapidly than anu linear function of the dimension �.
7. References
Bernhart, F.; Kanien P.C., (1979). The Book Thickness of a Graph, Journal of Combinatorial
Theory, Series B (27), 1979, pp. 320-331.
Bettayeb, S.; (1995). On the K-ary Hypercube. Journal of Theoretical Computer Science 140,
1995, pp. 333-339.
Bettayeb, S.; Heydari, H.; Morales, L.; Sudborough, I.H., (2010). Stack and Queue Layouts
for Toruses and Extended Hypercubes. To appear
Bettayeb, S.; Hoelzeman, D., (2009). Upper and Lower Bounds on the Pagenumber of the
Book Embedding of the k-ary Hypercube. Journal of Digital Information
Management, 7 (1), 2009, pp. 31-35.
Bettayeb, S.; Miller, Z.; Sudborough, I.H., (1992). Embedding Grids into Hypercubes. Journal
of Computer and System Sciences, 45 (3), 1992, pp. 340-366.
Bettayeb, S.; Sudborough, I.H., (1989). Grid Embedding into Ternary Hypercubes. Proc. Of
the ACM South Central Regional Conference, 1989, pp. 62-64.
Buss, J.F.; Shor, P.W., (1984). On the Pagenumber of Planar Graphs. Proc. Of the 16th Annual
Symposium on Theory of Computing, 1984, pp. 98-100.
Chung, F.R.K; Leightonm F.T; Rosenberg, A.L. (1983). DIOGENES: A Methodology for
Designing Fault Tolerant Processor Arrays. 13th Conf. on Fault Tolerant Computing,
1983, pp. 26-32.
Chung, F.R.K; Leightonm F.T; Rosenberg, A.L. (1987). Embedding Graphs in Books: A
Layout Problem with Application to VLSI Design. SIAM Journal of Algebraic
Discrete Methods, 8 (1), 1987, pp. 33-58.
Dean, A.M.; Hutchinson, J.P. (1991). Relations among Embedding Parameters for Graphs.
Graph Theory, Combinatorics, and Applications, vol. 1, Wiley Interscience Publ., New
York, 1991, pp. 287-296.
Dean, A.M.; Hutchinson, J.P. ; Sheinerman, E.R. (1991). On the Thickness and Arboricity of a
Graph. Journal of Combinatorial Theory Series B, vol. 52, 1991, pp. 147-151.
Dillencourt, M.B.; Epstein, D.; Hirschberg, D.S. (2000). Geometric Thicknes of Complete
Graphs. Journal of Graph Algorithms and Applications. vol. 4, 2000, pp. 5-17.
Enomoto, H.; Miyauchi, M.S., (1999). Embedding Graphs into a Three Page Book with O(M
log N) crossings of edges over the spine. SIAM Journal of Discrete Math. 12, 1999, pp.
337-341.
440 VLSI
Heath, L.S; Istrail, S. (1992). Comparing Queues and Stacks As Mechanisms for Laying out
Graphs. Journal of the Association for computing Machinery, 39, 1992, pp. 479-501.
Heath, L.S; Leighton, F.T.; Rosenberg, A.L. (1992). Comparing Queues and Stacks As
Mechanisms for Laying out Graphs. SIAM Journal of Discrete Mathematics, 5 (3),
1992, pp. 398-412.
Kainen, P.C.; Overbay, S. (2003). Book Embeddings of Graphs and a Theorem of Whitney.
Technical Report GUGU-2/25/03, http:// www.georgetown.edu/faculty/kainen/pbip3.pdf .
Kainen, P.C. 1990. The Book Thickness of a Graph. Congr. Numer., 71, 1990, pp. 127-132.
Yannakakis, M. (1989) Embedding Planar Graphs in four Pages., Journal of Computer and
System Sciences, 38 (1), 1989, pp. 36-67.
VLSI Thermal Analysis and Monitoring 441
22
X
1. Introduction
The rapid evolution in the industry of the integrated circuits during the last decade was so
quick that currently it is possible to integrate complex systems on a single SoC (System on a
Chip). Due to aggressive technology scaling, VLSI integration density as well as power
density increases drastically. As thermal phenomena research activities on micro-scale level
are gaining popularity due to an abundance of SoC and MEMS-based applications, various
measurement techniques are needed to understand the thermal behaviour of VLSI chip. In
particular, measurement techniques for surface temperature distributions of large VLSI
systems are a highly challenging research topic. Hence, surface peaks thermal detection is
necessary in modern VLSI circuits; their internal stress due to packaging combined with
local self heating becomes serious and may result in large performance variation, circuit
malfunction and even chip cracking.
One of the important questions in the field of thermal issues of VLSI systems and micro-
systems is how to perform the thermal monitoring, in order to indicate the overheating
situations, without complicated control circuits. Traditional approach consists of placement
of many sensors everywhere on the chip, and then their output can be read simultaneously
and compared with the reference voltage recognized as the overheating level.
The idea of the proposed method is to predict the local temperature and gradient along the
given distance in a few places only on the monitored surface, and evaluate obtained
information in order to predict the temperature of the heat source. Therefore, in the case of
an SoC devices, there is no place on the layout for the complicated unit performing
computations, but there is also no need for it, as we only want to detect the overheating
situations. These peaks are essentials during thermal die monitoring to avoid a critical
induced thermo-mechanical stress. Moreover, in most cases, the overheating occurs only in
one place.
Due to aggressive technology scaling, VLSI integration density as well as power density
increases drastically. For example, the power density of high performance microprocessors
has already reached 50W/cm2 at 100nm technology and it will reach 100W/cm2 at 50nm
technology (ITRS, 2003). This evolution towards higher integration levels is motivated by
the needs of advanced high performance, lighter and more compact systems with less
442 VLSI
power consumption. Meanwhile, to mitigate the overall power consumption, many low
power techniques such as dynamic power management (Wu et al., 2000), clock gating (Oh &
Pedram, 2001), voltage islands (Puri et al., 2003), dual Vdd/Vth (Srivastava et al., 2004) and
power gating (Kao et al., 1997), (Long & He, 2003) are proposed recently. These techniques,
though helpful to reduce the overall power consumption, may cause significant on-chip
thermal gradients and local hot spots due to different clock/power gating activities and
varying voltage scaling. It has been reported in (Gronowski et al., 1998) that temperature
variations of 30◦C can occur in a high performance microprocessor design. The magnitude of
thermal gradients and associated thermo-mechanical stress is expected to increase further as
VLSI designs move into nanometer processes and multi-GHz frequencies.
Nevertheless, the growth of power density dissipated brought a number of critical thermo-
mechanical problems. Thus the heat produced in the structure of the VLSI devices is
directed towards its edges where it is dissipated by radiation, conduction, or convection.
The principal effect of the absence of a good dynamic thermal management is the gradual
and continuous degradation of the quality of performance as well as some other direct
effects on the life cycle of the electronic systems (Buedo et al., 2002). Thus, an algorithm for
the detection and the localization of the thermal peaks is extremely important in order to
manage the thermal stress on high density semiconductor devices. These peaks are
essentials during thermal die monitoring to avoid a critical induced thermo-mechanical
stress.
This study presents a VLSI thermal peak monitoring approach, using GDS, for development
of SPTDA algorithm. The proposed algorithm using only two sensor cells will be formulated
in a manner to facilitate the development of modular architectures using minimum silicon
space in regards of their implementation in VLSI. The architecture selected will be modelled
in high level languages, simulated in order to evaluate its performances, and then
implemented on a FPGA (Field-Programmable Gate Array). A closed loop of simulation is
used in order to evaluate the performances of the architectures proposed at each stage. The
proposed architecture of the algorithm will be designed in a modular perspective after the
separation of the different elementary functions of the algorithm. Hence the design of
SPTDA with flexible modular-based architecture will be presented. The architecture is
designed in high level languages such as Matlab™ – Simulink®, simulated, tested using
VHDL and synthesized using Xilinx™ ISE (Xilinx, 2009) and Altera™ DSP (Altera, 2009)
tools. The simulation and hardware implementation results obtained will be compared to a
finite element method (FEM) temperature prediction of the entire GDS method
configuration cells.
This chapter will present a packaged VLSI thermal analysis by FEM, thermal monitoring
approach using GDS (Gradient Direction Sensors) method. Also the design of surface peaks
thermal detector algorithm (SPTDA) with flexible modular-based architecture will be
presented.
VLSI Thermal Analysis and Monitoring 443
The proposed algorithm is based on the GDS method for evaluating a single heat source on
the chip surface. This method principle of work is explained in details in (Wójciak &
Napieralski, 1997). For two sensors, A and C placed in the distance a (Fig. 1), the difference
between their output voltages is proportional to the changes of the temperature value along
the distance a (Wójciak & Napieralski, 1997). This is true only when the heat source is
directly on the line AC for any other cases the values of the angle α has to be taken into
account for the proper calculation of T (1).
T TC T A V VA (1)
C
r a. cos a. cos
where r is the distance from the heat source. In figure 1, we have:
b AD, AD TB TA VB VA (2)
b c AE, AE TC TA VC VA (3)
In order to obtain the information about angle we should apply the third sensor. In the
simplest case the GDS contains only three temperature sensors placed in distance a (fig. 1)
Fig. 1. The 3 sensors cell (0, 30) (Wójciak & Napieralski, 1997)
On the basis of eqs. (2) and (3) and the geometrical dependencies from (fig. 1), eq. (4) is
obtained:
2 VB V A 1
tan . (4)
3 VC V A 2
Using the cell we can obtain information on the temperature distribution and partly on the
position of the heat source. In order to obtain the temperature value of a single punctual
heat source we have to calculate the distance between the sensor and the heat source.
444 VLSI
Isoth A Isoth C
Isoth B
C
A Heatflow from
Equivalent
single heat source
B
Cell 1 (3 sensors)
R2
Cell 2 (3 sensors)
H
R1
Isoth D
E
F
D Isoth E
Isoth F
Two sensor cells are required for this purpose (fig2). The cells are placed in a given distance
(H) and each of them gives information about the angle α (α1 and α2) in the direction of the
heat source. Hence, the heat source and cells from a triangle in which the length of one side
and values of the angles adjacent to this side are known. This means that we can calculate
the distances between the heat source and sensors. Now we can calculate the temperature
gradient along the known distance. By adding it to the temperature of the sensor we obtain
the temperature of the heat source. Two sensor cells A,B,C and D,E,F are placed in two
corners of a monitored layout in the distance H. Hence, the temperature of the heat source
can be obtained by equation (5).
TS VS
H
VC VA
tan2 1 1 3 tan 2 VA (5)
a 3 1 tan 1 tan 2 tan 1 tan 2
Figure 2 shows the proposed distribution of the 6 sensors divided into 2 cells located
whether on or outside the chip.
Computation of tanα1
and tan α2.
Computation Computation
of r1 of r2
Estimation of Ts
Delay
Delay
Verification Module
Loop
150
100
Temperature
50
-50
-100
-150
0 20 40 60 80 100 120 140 160 180 200
Matching intervals
Convection coefficient
E :AL level
B : Sold
A : Substrat level
X Heat source
F
R1
D
E
B HS1 HS2
A
R2
C
Fig. 6. Physical position of cell sensors and heat sources on the FEM model
As illustrated in figure 5 (dimensions not respected), the WSI device studied is multilevel
structure with simple Si (silicon) substrate covered with different layers, Al (Aluminum),
solder ball, and molding compound. The packaging assembly was a ceramic BGA. Once the
geometry of the device had been determined and the heat transfer mechanisms quantified, it
was possible to model the system using finite element analysis. Using the computer code
NISA (Numerical Integrated Elements for System Analysis), a 3-D model was created with
more than 100 000 isoparametric thermal shell elements. For the heating computations, this
element models the 3-D state of heat flow. For the thermal part, the element has the
temperature (T) as the only degree of freedom at each node. Figure 6 shows heat sources
and sensors emplacement in the surface of processor. That will enable us to establish in-situ
sensor network to achieve the most homogeneous thermo-mechanical cartography. In this
study we present a case of WSI structure with internal heat generation. This configuration
will enable us to simulate device intense activity and to construct thermal control unit that
will ensure a suitable cooling. Moreover, we have to make sure that the temperature device
structure variation remains suitable for appropriate induced thermo-mechanical stress.
Figure 7 shows position of cell sensors on WSI device for temperature and stress results.
X1
Y X2
X3
X4
X
Fig. 7. Schematic position of sensors cell and heat source on WSI device
Fig. 8. VLSI device transient thermal analysis dissipating multiple localized heat sources
Fig. 9. VLSI device steady state thermal analysis dissipating single localized heat source
In this study, investigations are done for the simplest case, only six temperature sensors
(A,B,C,D,E and F, Figure 2) in the form of two sensor cells and one single power heating
source, in order to validate prediction with the 3-D FEM model. The simulations have been
carried out for a one source placed at the junction surface level. As expected, the peak
temperature profile is located at the centre of the heat source (figure 9). There is subsequent
relaxation on temperature gradients through the structure leading to an essentially uniform
temperature variation ΔT. The sensor cells can be placed in any way out of the monitored
area (different distance H between cell 1 and cell 2), but in some cases adequate placement
can simplify the thermal control unit design. In this part the results of thermal peaks can be
very useful for indicating overheating situations and critical thermo-mechanical stress
occurring in the device structure. Hence, table 1 display a comparison between the
temperatures peaks on surface with different implementation under the same conditions.
VLSI Thermal Analysis and Monitoring 449
Thus, the FEM results obtained (Figure 10 and 11) are in full agreement with the GDS
predictions.
During implementation the fixed-point representation of the SPTDA algorithm was used.
After implementation, the same input is directed to the algorithm simultaneously in
Simulink® and on the FPGA board. The results are routed back to Simulink®, multiplexed,
and projected on the same 2-D graph in order to compare the output. As many simulations
and co-simulations have preceded the implementation, the result was expected to be the
same as the comparison between the floating and the fixed point comparison. An optimal
frequency close to 100 MHZ has been achieved. Furthermore, the VHDL TB (Test Bench)
was constructed and the force-file was used to stimulate inputs and to compare with the
algorithm predictions. Hence, the set of simulations revealed that the estimations generated
by the SPTDA algorithm presented a great concordance with the predictions generated by
the Finite Element Method (FEM) presented in (Lakhsasi et al., 2006; Bougataya et al., 2006).
450 VLSI
incorrect result, unless a very large simulation region is used at the expense of very long
simulation run times. A more natural boundary condition is a zero flow condition across
these tiny surfaces (adiabatic boundary conditions) as shown in figure 12. The remaining
boundary conditions to be defined are on the bottom and top surfaces of the VLSI device,
representing the heat sink interfaces. Because the VLSI devices is relatively thin and silicon
and solder are good thermal conductors, heat flows happens mainly in the vertical direction,
so the boundary conditions in both horizontal directions can be considered adiabatic. The
uniform heat removal at the bottom and top is modelled by heat flux exchange coefficient h
[W/m2*oK]. The power dissipated by the device is modelled by heat flux produced into the
components. The problem formulation is presented graphically in figure 12.
Heat flux
(Surface Power Heat source
density)
Adiabatic
Adiabatic
Adiabatic
Adiabatic
Cooling
Fig. 12. VLSI device: inside thermal boundary conditions (TBC’s) (BC).
Where E / (1 – ν) is the composite elastic constant for the different layers and Δα the
difference in the coefficients of thermal expansion (CTE) between different level of
packaging (Kobeda et al., 1989). In (Lakhsasi & Skorek, 2002) a method has been presented
in details for calculation of the compressive stress in the silicon level given by:
However, the induced compressive thermal stress will be combined to the intrinsic one due
to the fabrication processes, and the stress due to the mechanical clamping mechanism.
452 VLSI
Table 2 gives the thermo mechanical parameters for different materials used in the
computation.
The nodal temperatures stored in the thermal part were used to perform a thermal stress
analysis of the device structure. It is assumed that the structure is stress-free at 25 oC. The
presence of a temperature variation throughout the device structure causes deformation due
to thermal contraction and expansion. A thermal stress analysis is performed to calculate
these deflections and associated stresses due to the thermal loading. The computation is
extended to the whole volume of the device structure.
The present approach may be suitable for large VLSI devices packaging, because it
potentially allows the combination and integration of both thermal and mechanical control,
in a single unit. As an example, the maximum level of stress generated by intense heating of
WSI devices have been evaluated and examined by GDS method. During design and finale
packaging of large VLSI devices, in-situ thermo-mechanical control unite must be
implemented to ensure the safe fulfillments of their operating conditions. The nature of such
interface materials must therefore be considered very carefully during WSI design and
packaging. Further research should be focused on elaborating software tools for
optimization of temperature sensor positions within the allowed area for an IC designer.
Figure 13 and 14 shows thermal stress profile according to X1-axis. The peak compressive
stress of σxx= -8.5 MPa is reported around the region of the heat source.
VLSI Thermal Analysis and Monitoring 453
Fig. 13. VLSI thermal stress profile according to X1-axis, σxx= -8.5 MPa Maximum
Fig. 14. VL VLSI thermal stress profile according to X2-axis, σxx= -3.3 MPa Maximum
454 VLSI
VLSI layers are very thin and any imperfection in structure may lead to cracking and
subsequent shear-initiated delamination at the silicon level. The nature of such interface
materials must therefore be considered very carefully in VLSI devices design for intense
applications. Depending on the technology requirements and definition of failure, the
mechanism of failure may take several forms.
6. Discussion
One of the important questions in the field of thermal issues of VLSI systems and
microsystems is how to perform the thermal monitoring, in order to indicate the overheating
situations, without complicated control circuits. Traditional approach consists of placement
of many sensors everywhere on the chip, and then their output can be read simultaneously
(Szekely, 1994) and compared with the reference voltage recognized as the overheating
level.
The idea of the proposed algorithm is to predict the local temperature and gradient along
the given distance in a few places only on the monitored surface, and evaluate obtained
information in order to predict the temperature of the heat source. Therefore, in the case of
an SoC devices, there is no place on the layout for the complicated unit performing
computations, but there is also no need for it, as we only want to detect the overheating
situations. These peaks are essentials during thermal die monitoring to avoid a critical
induced thermo-mechanical stress. Moreover, in most cases, the overheating occurs only in
one place.
In this chapter, a methodology to evaluate and predict a thermal peak of large VLSI circuit
was presented. The important factors contributing to the device's thermal heating were
characterized. The monitoring approach reported in this paper can be applied to predict the
thermal stress peak of multilevel structures. Surface peaks thermal detection is necessary in
modern VLSI circuits; their internal stress due to packaging combined with local self heating
becomes serious and may result in large performance variation, circuit malfunction and
even chip cracking. Also, in this paper GDS technique was used to develop SPTDA
algorithm.
7. Conclusion
This study presented an approach to thermal and thermo-mechanical stress monitoring of
VLSI chip. The possibility of evaluating thermal peaks and associated thermal stress
distribution all over the monitored area by using GDS and FEM has been shown.
Furthermore, this study presented an approach to the application of inverse problem for
thermo-mechanical analysis. The thermal peaks of investigated source can be obtained by
applying the gradient direction sensors method. The adequate placement of sensors can
accurately evaluate the thermal peaks and simplify the thermo-mechanical control unit
design. That will enable the chip designer to establish the most homogeneous thermo-
mechanical cartography during operation. The cost of the thermal management of WSI
device depends heavily upon the efficiency of the chip design. However, heat sources
spatial distribution has a significant effect on the WSI device operation. Hence, in-situ
thermo-mechanical control unite must be implemented to prevent unexpected pitfalls.
VLSI Thermal Analysis and Monitoring 455
Another aspect presented in this study is related to surface peaks thermal detection in
modern VLSI circuits; because their internal stress due to packaging combined with local
self heating becomes serious and may result in large performance variation, circuit
malfunction and even chip cracking. As an example, in this study GDS technique was used
to develop SPTDA algorithm. Several approaches were implemented to achieve a better
performance for the SPTDA operation. In (Wójciak & Napieralski, 1997), physical circuit
architecture was proposed. However, complicated control components had to be deployed
in the circuit which is not suitable for today’s constraints regarding area, especially if the
control algorithm needs to be placed on-chip, for example, on a sensor network node.
Thereby, deploying the SPTDA in sensor network can be made feasible in the future.
8. References
A. Lakhsasi, A. Skorek," Dynamic Finite Element Approach for Analyzing Stress and
Distortion in Multilevel Devices ", SOLID-STATE ELECTRONICS, PERGAMON,
Elsevier Science Ltd., Volume 46/6 pp. 925-932, May 2002.
A. Lakhsasi, M. Bougataya, D. Massicotte: Practical approach to gradient direction sensor
method in very large scale integration thermomechanical stress analysis, J. Vac. Sci.
Technol. A 24(3), pp. 758-763, May 2006.
A. Srivastava, D. Sylvester, and D. Blaauw, “Concurrent Sizing, Vdd and Vth Assignment
for Low-Power Design,” in Proc. Design, Automation and Test in Eurpoe, vol. 1,
Feb 2004, pp. 718–719.
Altera corp (http://www.altera.com), 2009.
C. Long and L. He, “Distributed sleep transistor network for power reduction,” in Proc.
Design Automation Conf., 2003.
E Kobeda and al ‘’ in situ measurements during thermal oxidation of silicon ‘’ J.vac.sci
Technol B7 (2), Mar/aprl 1989
ITRS, International Technology Roadmap for Semiconductors (ITRS), 2003.
J. Kao, A. Chandrakasan, and D. Antoniadis, “Transistor sizing issues and tool for multi-
threshold CMOS technology,” in Proc. Design Automation Conf., 1997.
J. Oh and M. Pedram, “Gated clock routing for low-power microprocessor design,” IEEE
Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 20, pp.
715–722, Jun 2001.
M. Bougataya, A. Lakhsasi and D. Massicotte: Steady State Thermo-mechanical Stress
Prediction for Large VLSI circuits using GDS Method. IEEE CCECE06 Proceeding
SBN:1-4244-0038-4, pp.917-921
P. Gronowski et al., “High performance microprocessor design,” IEEE J. Solid-State Circuits,
vol. 33, pp. 676–686, May 1998.
Q. Wu, Q. Qiu, and M. Pedram, “Dynamic power management of complex systems using
generalized stochastic Petri nets,” in Proc. Design Automation Conf., Jun 2000.
R. Puri, L. Stok, J. Cohn, D. Kung, D. Pan, D. Sylvester, A. Srivastava, and S. H. Kulkarni,
“Pushing ASIC Performance in a Power Envelope,” in Proc. Design Automation
Conf., 2003.
Sergio Lopez-Buedo, Javier Garrido, and Eduardo I. Boemo: ‘Dynamically Inserting,
Operating, and Eliminating Thermal Sensors of FPGA-Based Systems’, IEEE TRAN
456 VLSI
ON COM AND PACK TECHNOLOGIES, VOL. 25, NO. 4, DECEMBER 2002, pp.
561-566.
V. Szekely, Thermal monitoring of microelectronic structures, Microelectronic J., 25 (1994)
157-170.
W. Wójciak and A. Napieralski ‘’Thermal monitoring of a single heat source in
semiconductor devices – the first approach’’ Microelectronics Journal 28 (1997) p
313-316.
Xilinx corp (http://www.xilinx.com), 2009.