Khatibzadeh Amir Ali
Khatibzadeh Amir Ali
1-GHZ NOVEL
by
A thesis
in the program
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy
submitted. Broken or indistinct print, colored or poor quality illustrations and
photographs, print bleed-through, substandard margins, and improper
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, th ese will be noted. Also, if unauthorized
copyright material had to be removed, a note will indicate the deletion.
UMI
UMI Microform EC 53466
Copyright2009 by ProQ uest LLC
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United S tates Code.
11
Ryerson University requires the signatures of all persons using or
photocopying this thesis. Please sign below, and give address and
date.
Ill
Abstract
Amir Ali Khatibzadeh
Ryerson University
This thesis presents the design o f an 8 x 8-bit novel multiplier, which can provide a better
performance than its counterparts in the sense that it has a fraction of the silicon area,
delay and power consumption o f the common architectures such as the conventional
realization. In the circuit-level, pseudo-NMOS full adder cell is chosen amongst the
several existing full adder cells due to its superior speed and power performance.
The performance of this design has been evaluated by comparing it to those of the
recently reported multipliers. The results o f the comparison, both in theory and
IV
Acknowledgement
During journey through the Master program support and help from friends, family, and
faculty can be invaluable. To begin with, I cannot stress enough that the single most
important person contributing to the success of the student is the principle dissertation
advisor. With this in mind, I would like to thank Prof. Kaamran Raahemifar for his help
and guidance. His extensive knowledge o f circuit and keen insight into VLSI design were
major assets.
(sorry if I missed any o f you). I would like to thank Prof. Sridhar Krishnan, chair and
Ryerson, who has put his effort and intelligence to make a very vigorous atmosphere in
More recently, I would like to thank Prof. Vadim Geurkov for serving on my oral exam
committee. His feedback and insightful comments on the material of this thesis are
greatly appreciated.
I would like to express my appreciation to Prof. Alireza Sadeghian for reading this thesis
numerous hours designing VLSI circuits. It has tmly been an enlightening experience,
I would also like to thank Haleh Vahedi and Majid Malekan for their technical advices
Last, but certainly not least, I would like to thank my dear friend, Dana, and my family
and loved ones who stood by me throughout these years. I will not name you all, but I am
eternally grateful. Words cannot express the warmth and gratitude I feel for each o f you.
Fabrication supports through Canadian Microelectronics Corporation (CMC) is also
gratefully acknowledged.
VI
This thesis is dedicated to the late eminent scholar, Professor Abbas Sahab, my
Chapter 1: Introduction 1
1.1 Motivation I
1.2 Application 2
1.3 Original Contributions 3
1.4 Thesis Organization 5
Chapter 5: Conclusion 91
Vlll
5.2 Comparison Results 93
5.3 Future Work 95
References 98
IX
List of Tables
Table 3.8 Transistor dimension in XOR & transmission gate full adder 34
Table 3.10 Simulation results for the full adders sorted by power consumption 38
Table 3.11 Simulation results for the full adders sorted by propagation delay 38
Table 3.12 Simulation results for the fiill adders sorted by power-delay product 40
Table 3.13 Delay o f the generate, propagate and sum signals of PFA 44
Table 3.15 Transistor dimension & delay of AND, NAND, XOR and OR gates 46
sorted by delay 82
digital multiplier 94
XI
List of Figures
Fig. 2.6 Block diagram of the n x n multiplier using modified Booth algorithm 17
Fig. 2.7 Construction of Wallace’s tree for an 8 x 8-bit multiplier, reduction of the
Fig. 3.7 (a) Input patterns used to evaluate the performance of the adders 35
Fig. 3.7 (b) Input patterns used to evaluate the performance of the adders 35
Fig. 3.7 (c) Input patterns used to evaluate the performance of the adders 36
Fig. 3.7 (d) Input patterns used to evaluate the performance of the adders 36
Xll
Fig. 3.11 Block diagram of the 16-bit carry lookahead adder implemented by
Fig. 3.12 Schematic of (a) AND (b) NAND (c) XOR (d) OR gates 46
Fig. 3.13 Block diagram of the proposed 8 x 8-bit multiplier showing detail
Fig. 4.1 Circuit structure used for simulation o f full adder cell 60
Fig. 4.3 The simulation waveforms showing respond of the pseudo full adder
Fig. 4.4 The simulation waveforms showing respond of the pseudo full adder
Fig. 4.5 The simulation waveforms showing respond of the pseudo full adder
Fig. 4.6 The simulation waveforms showing respond of the pseudo full adder
Fig. 4.7 The simulation waveforms showing respond of the AND/NAND gate 65
A =1, B = I and Q„ = 1 69
xiii
Fig. 4.13 The worst-case delay of 16-bit carry lookahead adder 70
“l l l l l l i r ’ x ’T l l l l l l l ” 73
“l l l l l l l l ” x ‘T l l l l l l l ” 74
C T l l l l l l l ” x “ l l l l l l l l ”) 78
(“10101010” X “01010101”) 79
Fig. 4.21 Adding two extra pins to Vjj and reducing the parasitic inductance 84
Fig. 4.22 The simulation results of the final outputs against temperature variation 85
Fig. 4.23 Layout of the pseudo-NMOS full adder (die size of 22 x 8.5 pm^) 87
Fig. 4.24 Layout of AND gate (die size o f 7.9 x 5.6 pm^) 87
Fig. 4.25 Layout of NAND gate (die size of 5.4 x 5.6 pm^) 88
Fig. 4.26 Layout of XOR gate (die size o f 10.2 x 20.7 pm^) 88
XIV
Chapter 1
Introduction
1.1 Motivation
The core of every microprocessor, digital signal processor (DSP), and data processing
application-specific integrated circuit (ASIC) is its data path. It is often the crucial circuit
component if die area, power consumption, and especially operation speed are of
concerns. At the heart of data path and addressing units are arithmetic units, such as
comparators, adders, and multipliers. Finally, one of the basic operations found in most
multipliers are also used in more complex operations like address calculation and
multiplication.
problem in VLSI design. Designing fast and power-efficient multiplier has been of great
theoretical and practical interest for computer scientists and engineers. Several
algorithms and various VLSI implementations have been proposed [1,2,3, 4, 5, 6, 7] and
practically used.
In order to achieve high performance multiplier, it is necessary to operate very efficiently
in terms o f speed and power trade-off in all design levels. Increasing the operating speed
of the circuits to make more computations with lower power consumption is the main
The recent progress in use of Ultra Deep Sub-Micron Devices (UDSM) helps to
improves productivity in ASIC design. Taking all o f these into account, implementing
effective and feasible alternative for increasing the performance o f the multipliers
substantially. Therefore, this thesis deals with designing a novel multiplier with inherent
1.2 Applications
Wireless communication systems, including third generation cellular radio systems and
wireless Local Area Network Systems (LANS), have become tremendously popular in
recent years. These systems can be implemented using various platforms, such as digital
signal processors, ASICs and Field Programmable Gate Arrays (FPGAs). Most digital
as correlation, convolution, filtering and frequency analysis. These algorithms are used in
applications such as finite impulse filters (FIR), infinite impulse filters (HR), discrete
cosine transforms (DCT), and fast Fourier transforms (FFT). Moreover, there has been a
rapid increase in the popularity of portable and wireless electronic devices, such as laptop
computers, portable video players and cellular phones, which rely on embedded digital
processors. Since the desire is to design digital systems for communication applications
at the best performance without scarifying power, high performance and low power
This thesis presents the design of novel multiplier architecture, with superior
multipliers.
Wallace and pair-wise algorithms have been reviewed in detail. Among these algorithms
pair wise algorithm has been chosen due to its superiority in speed of operation.
1) generating four 8-bit (Xg, X„, Yg, numbers using even and odd positions of
2) multiplying these four 8-bits numbers to generate the four 15-bit numbers
(Pgg, Pgo, Poe, Poo) known as the even and odd elements of the partial products
iP)-
4) adding two final 16-bit numbers (Pg, Po) and thus generating the product o f
In the first step o f design flow, topology selection, six full adder cells based on CMOS
static logic styles are redesigned and examined at transistor-level in standard 0.18p
CMOS technology. The results o f the extensive evaluation, which are further presented in
Chapter 3, prove that 14-transistor pseudo-NMOS full adder cell offers a better speed and
The validity of the design strategy is by proven by testing the complete multiplier and
measuring the speed and power. All the designs are simulated using Cadence Computer
Aided Design (CAD) Tool in 0.18pm CMOS technology at 1.8V supply voltage.
In summary a speed/power efficient novel multiplier for medium bit width applications is
CMOS logic adders, several topologies are examined to support the final circuit design.
• Power reduction through circuit/logic is achieved by using static style rather than
dynamic style. This causes the architectural level to be free from clock and
related clocking issues such as clock skew and high dynamic power.
transistor chaining, grouping, and signal sequencing in the adder layout which is
proven to provide substantial power saving and speed improvement at no area
penalty.
These original contributions have been published in two conference proceedings [9, 10].
Following the introductory Chapter 1, Chapter 2 describes the basic concept of two’s
VLSI implementation along with the pair-wise multiplication algorithm are introduced
cells further reviewed. The circuit design o f the required cells for pair-wise structure is
also discussed.
Chapter 4, is dedicated to the simulation results of individual circuits and cells as well as
the final simulation results o f the proposed multiplier. Layout considerations are also
discussed.
Finally, Chapter 5 presents the features o f the Designed Multiplier. A comparative study
of the previous works on multipliers is presented to better evaluate on this work. Drawing
conclusion, summarizing the contributions o f this thesis, and outlining the directions for
fundamental building block which is being widely used in many Very Large-Scale
security processing and image processing. In addition to their main task, which is
multiplying two binary numbers, multipliers are the nucleus o f many other useful
operations such as division and address calculation. In these systems the multipliers are
the part o f the critical path that determines the overall performance of the system. That is
Parallel to high-speed system design [II], low-power systems [I] are highly in demand
because o f the fast growing technologies in mobile communication and computation. The
battery technology does not advance at the same rates as the microelectronics technology.
There is a limited amount of power available for mobile systems. Thus, designers are
faced with more constraints; high-speed, high throughput, small silicon area and at the
is o f great interest.
Current architectures range from small, low performance array to tree multipliers.
Conventional linear array multipliers achieve high performance in a regular structure, but
require large area o f silicon. Tree structures achieve even higher performance than linear
arrays but the tree interconnection is more complex and less regular, making them even
larger than linear arrays. Ideally, one would wish the speed benefits of a tree structure,
the regularity of an array multiplier, and the small size of a shift and add multipliers.
The first section of this Chapter explains the basics of binary multiplication. A review on
the most known parallel multiplication algorithms is presented in Section 2.3. The pair
wise multiplication algorithm that has been used in the proposed multiplier is also
described. These algorithms are, then, briefly compared against each other at the end o f
this Chapter.
multiplicand on top of the multiplier. The multiplicand is then multiplied by each digit o f
the multiplier beginning with the rightmost. Least Significant Digit (LSD). Intermediate
results (partial products) are placed one atop the other, offset by one digit to align digits
o f the same weight. The final product is determined by summation of all the partial
In the binary number system the digits, called bits, are limited to the set [0, 1]. The result
o f multiplying any binary number by a single binary bit is either 0, or the original
number. This makes forming the intermediate partial products simple and efficient.
Summing these partial products is the time-consuming task for binary multipliers. One
logical approach is to form the partial-products one at a time and sum them as they are
generated. This technique works fine but is slow. For applications where this approach
does not provide good enough performance, another approach is used which is known as
parallel multiplication algorithms. In this latter approach all bit-products are generated in
parallel and a multi-operand adder (i.e., an adder tree) is used for their accumulation.
Multipliers that operate based on these algorithms are called parallel multipliers. Parallel
multipliers are becoming the key components in Reduced Instruction Set Computers
(RISCs), DSP and graphic accelerators due to their inherent higher speed o f operation.
systems, many algorithms have been proposed to perform multiplication, each offering
different advantages and having tradeoffs in terms of speed, circuit complexity, area and
power consumption. Among the multipliers reported parallel multipliers have been of
great theoretical and practical interests for VLSI designers not only for their speed of
The structure of all parallel multipliers can be partitioned into three parts performing
b) Carry-free addition.
c) Cany-propagation addition.
These three parts can be implemented using different schemes such as simple AND gate
or Booth algorithm to generate partial products. The carry-free addition task is often
algorithm, which have been used in VLSI implementation of digital multipliers, are
briefly presented. The readers can consult references [11,12] for more details on parallel
multiplication algorithms.
y =% 2 ' . (2.2)
1=0
/=n-l 7 =n-l
Each of the partial product terms P*= XtYj is called a summand. Fig.2.1 shows an
The summands are generated in parallel with AND gates. Fig. 2.2 shows the Braun’s
array multiplier [4]. Such a n n x n multiplier requires n x (n -1) adders and n^ AND gates.
The delay o f such a multiplier is determined by the delay of the full adder cell and the
final adder in the last row. In the multiplier array a full-adder with balanced carry and
sum delays is desirable because the sum and carry signals are both on the critical path.
For the large arrays, the speed and power of the full adder are both very important.
^8 ^7 ^6 ^5 ^4 ^3 X,
Ys Y Ye Ya Y Y Y
’15
Baugh-Wooley is one of the developed algorithms for parallel multiplication, which has
been used in VLSI architectures [12]. Multipliers based on this algorithm are used for
10
direct multiplication of tw o’s complement numbers. This direct approach does not need
i~ n - l
X = - X ,.a '- '+ 'Z x ,2 ‘ , (2.4)
1=0
is given by
In order to avoid the use o f subtractor cells and use only adders, the negative terms
should be transformed. So
-X„_y 2 ^2"+'-' = X „_ ,(-2 '''-'+ 2 '’-' + J i'2 ''- ^ '- ') . (2.7)
1=0 /=o
From Equation 2.8, it can be seen that the multiplication of two numbers, expressed in
two’s complement representation, can be written in a form which involves only positive
bit products. The product is, then, obtained by adding a constant to the final result. All the
partial product terms to generate the above product are explicitly shown in Fig. 2.3. A
simple reorganization o f Fig. 2.3 results in the array o f partial product shown in Fig. 2.4,
11
It can be seen that half adder, full adder, NAND and AND gates are the required elements
y. Y, Ye Ye Y. Ye Ye 1Î
Xs ^7 Xe Xe X, Xe Xe
This algorithm is suitable for applications where operands with less than 16 bits are to be
processed. Digital filters where small operands are used (e.g. 6, 8 and 12), are examples
o f such applications.
Fig. 2.5 shows the array architecture of an 8 x 8-bit Baugh-Wooley two’s complement.
12
For operands equal to or greater than 16-bits, the Baugh-Wooley scheme becomes area
consuming and slow. Hence, techniques to reduce the size of the array, while maintaining
T, u h h T, '
13
2.3.3 The Modified Booth Algorithm
For operands equal to or greater than 16-bits, the modified Booth algorithm [13] has been
extensively used. It is based on encoding the tw o’s complement operand (i.e., multiplier)
This makes the multiplier faster and uses less hardware (area). For example, the modified
bits, and each group is decoded to generate the correct partial product.
(2.9)
f=0
In Equation 2.10, the terms in brackets assume values from the set (-2, -1, 0, +1, +2}.
The encoding o f Y, using the modified Booth algorithm, generates another number with
the following five signed digits, -2, -1, 0, +1, +2. As illustrated in Table 2.1, each
The bits of the multiplier (f) are partitioned into groups of overlapping 3-bits and each
group permits the generation o f certain partial products. The five possible multiplies of
the multiplicand are generated based on the procedure given in Table 2.2.
The general partial product is related to the multiplicand for each encoded digit by the
relationships presented in Table 2.3. PF, is the partial product and PP, is also the sign bit
o f the partial product with P„=P„./ when no shifting of the partial product is performed.
14
Table 2.1. Partial products selection
0 1 0 +1 +1 X%
0 1 1 +2 +2 X X
1 0 0 -2 -2xX
1 0 1 -1 -1 \ X
1 1 0 -1 -1 x X
1 1 1 0 OxAT
Bits are grouped into 3-bit groups overlapping by one bit. A bit with a value of zero is
added on the right side o f Y as K,. So the multiplication of two 8-bit numbers generates
only 4 partial products. The number of partial products is then reduced by half.
15
In order to make the array rectangular and thus more regular for VLSI implementation,
the problem o f the sign extension must be addressed. This problem is more crucial when
the operand lengths are wide, where each partial product must be sign-extended to the
length of the product. The basic idea is to use two extra bits in the partial product. For the
first partial product, the two additional bits, PPn+/ and PP„+z are equal to the sign bit of
For the second partial product, if the first partial product was positive, then the two
additional bits for this second partial product are given by the Equation 2.11, otherwise
and
So it is more interesting to use a third bit F as a flag to indicate whether there is, from the
previous partial, a negative sign bit to be propagated. F, is the flag generated by the first
partial product to the next one. This flag is expressed by the following Boolean equation
F j„ = F j + PP„j, (2.14)
Fig. 2.6 shows the block diagram of an n x n modified Booth multiplier. Furthermore, the
a) The multiplier array containing partial product’s generation and 1-bit adders.
b) The Booth encoder and the sign extension bits (PP„+2 , PPn+i, E).
c) The Booth encoder generates the five signals (0, +1, +2x, -Ix, and -2x) for
16
Booth
decoder
Y<n-1 Partial p ro d u ct generator n-brt
signs bits adder P <n-1:0>
& A dder’s array
extension
n-bit adder
Cany
P<2n-1 m>
Fig. 2.6 Block diagram of the n x n multiplier using modified Booth algorithm
The Booth multiplier exhibits some glitches. The main reason for glitches is the race
condition between the multiplicand and the multiplier due to the Booth encoder.
As seen in the previous section, applying the Booth algorithm reduces the number of
partial products by half. However, for large multipliers such as 32-bit and over, the
number of the partial products is over 16 bits. In such cases, better performance is
achieved by adopting the Wallace tree using 4-2 compressors [12]. A 4-2 compressor
accepts 4 numbers and a carry in, and sums them to produce 2 numbers and a carry out.
Fig. 2.7 shows an example of such a tree on partial products of an unsigned 8 x 8-bit
multiplier. Eight partial products are produced. Using 4-2 compressors, two levels o f
additions (stages) are needed. The final two summands are added using a fast 16-bits
adder. Some zeros are added to the array. This example shows that the bits which are not
17
used in this T* stage (level) jump to the next stage to be combined with the ones produced
by the compressors.
X, ....
□ i'/........... .... Y,
PtrtiilpioAKl
A Bigmertttd
□
□□ 1“ stage
□
■■
■■■ □
-a
□ □ □ □ A A A A A A A A A A ÀÀ "
□ □ □ □ ÀA A A A A AA AA ▲
A A A A AA A A A A À▲ Stage
A A A A A A A A A A ▲□ y
AAAAAAAAAAAA AAAA
AAAAAAAAAAAAA AA
Two summands
To be added
Fig. 2.7 Construction o f Wallace’s tree for an 8 x 8-bit multiplier, reduction o f the 8 partial
products with 4-2 compressors
Fig 2.8 shows the architecture of the 8 x 8-bit multiplier. As one can see for the first stage
respectively.
To further enhance the performance of Wallace tree multiplier, the modified Booth
algorithm can be used to reduce the number of partial products by half in a carry-save
adder array. This architecmre exhibits some irregularities in the layout since it has a
This algorithm is based on generating n-bit numbers using even and odd positions o f the
two n-bit numbers [14]. Then, parallel addition algorithm is used to add up partial
products.
19
Expanding these terms allows one to see the advantages o f writing the multiplication in
2“ (2.20)
x2° (2.21)
P eo = 0 x 2 ' " + (XsY7 )X 2'^ + 0 X 2 " + {XgYs + T O ) x 2 " + 0 x 2 ' " + (X^F^ +
2 '+ 0 x 2 ° (2.22)
2 ' + Ox 2° (2.23)
Note that the zero positions in the bit pattern alternate with non-zero summations. The
zero position can be used to hold the carry from corresponding summation in the non
zero position. A full adder can be used to calculate each o f the sums and the carry out o f
the full adder generates the bit in the zero positions. Here, we have considered (A^F^ x 2®)
separately. Spares bits are collected together to form one or two distinct numbers. That is,
20
we have x 2® + (X2 Y7 + X 7 Y2) x 2’ + {XyYj) x 2®]. These numbers are treated
postponed at the last stage. This algorithm uses adder to convert three t-bit numbers to
two (k + i)-bit numbers. By using this technique, partial product numbers are, then,
summed together via adder planes repeatedly to generate two distinct numbers. At the last
stage the final two partial products are added by a fast adder to speed up multiplication
\/\7
AND
Generator
FuN
Adder
Plane
(1** Level)
Full
Adder
Plane
{r* Level)
Fun
Adder
Plane
(3" Level)
FuH
Adder
Plane
(4^ Level)
Carry
Loolc
Ahead
Adder
Product
21
2.4 Qualitative Comparisons of Parallel Algorithms
In order to choose the appropriate algorithm for the required applications one has to have
a clear view of advantages and drawbacks of different algorithms that have been
The basic array multipliers, such as the Baugh-Wooley scheme, consume low power and
exhibit relatively good performance. However, they are limited to applications with the
process operands with less than 16 bits. For operands o f 16 bits and over, the modified
Booth algorithm reduces the partial product’s numbers by half and hence the speed of the
Wooley multiplier due to the circuitry overhead in Booth algorithm. However, by using
circuit techniques one can make this multiplier have low-power characteristic. The fastest
multipliers adopt the Wallace tree with modified Booth encoding. Due to its
interconnecting wiring a Wallace tree would generally lead to larger power consumption
and area. Hence, it is not recommended for low-power applications. Finally, the pair-wise
multiplier shows faster operation by preventing the carry propagation in the intermediate
the last stage where 2(n-l)-bit numbers are added. By using a fast addition circuitry such
as carry lookahead adder (CLA) at the last stage o f pair-wise multiplier one can
circuit-level designs makes pair-wise algorithm a viable candidate for high performance
be used as test vehicle for the purpose of quantitative evaluation of pair-wise multiplier.
22
However, the entire Baugh-Wooley architecture should be redesign in order to perform a
fair comparison.
23
Chapter 3
Multiplier Design
In this Chapter, the design of novel 8 x 8 -bit multiplier is described in the circuit level.
The building blocks are identified and the design o f the cells based on these building
blocks is, then, discussed. This Chapter begins with a brief description of some o f the
Propagation delay of digital cells: duration from the moment that the first signal (50%
transition point on input waveform) reaches the inputs o f the cell to the moment that the
last output signal (50% transition point on output waveforms) reaches the output nodes
[21].
Power consumption of digital cells: The value of the power consumption of one cell is
measured individually during testing the circuits. It means that the power consumed by
the other cells in the test circuit is not included in the final measured value. This has been
done by inserting a power meter in the form of Analog Hardware Description Language
(AHDL) block in Cadence CAD tool in the route o f the main supply to measure the
power dissipation. This approach has been used as standard power measurement method
Based on the pair-wise algorithm described in Chapter 2, the top level design o f the
proposed multiplier is built as shown in Fig 2.9. The following decisions were made in
24
order to implement pair-wise algorithm. First, Due to inherent speed characteristic of
The power consumption of each element has been taken into account in topology
selection. These points are discussed further in this Chapter where the circuit-level design
o f the proposed multiplier is reviewed. Also several low-power techniques are applied in
layout extraction in order to achieve the power efficient design. These techniques are
In this Section first the elements required in pair-wise multiplier are introduced and then
topology selection for the key elements is briefly presented. The architecture o f the pair
wise multiplier (Fig. 2.9) shows that full adder, half adder, carry lookahead adder, AND,
NAND, OR and XOR gates are the building blocks of the multiplier.
Full adder (FA) is the most critical circuit for two reasons. First, full adders cause a large
percentage of the core propagation delay. Second, full adders ultimately consume the
large percentage o f power in the whole multiplier architecture. In order to select the best
FA suited for high-performance application, a study was done on the existing FA circuits
[2]. The result of this extensive study has directed to the selection the most speed/power
provided next. First, note that the Boolean expression for a half adder (HA) is:
S = A®B, (3.1)
Co,„=A.B, (3.2)
and for full adder (FA) is:
25
s = A ® B ® Cf„, (3.3)
(3.4)
Table 3.1(a) Truth table of a full adder (b) Truth table o f a half adder
The above Boolean expressions can be realized by different circuitries, each with their
own advantages and disadvantages. In the following a brief review of the result o f the
study o f six most well known CMOS full adder structures is presented. These adders
have been compared in a wide range o f static logic styles, which is viable candidate for
They include:
The HA circuits are then generated from the optimized FAs by eliminating the circuitry
26
Transistor Sizing: Sizing o f the transistors in the full adder cells has been carried out in
1) Set all the transistors (NMOS and PMOS) to the minimum length (/,»,/«) and the
2) Simulate the circuit with all possible input pattern transitions (16 transitions).
3) Consider the transitions with the highest delay and mark the transistors involved
in those transitions.
5) Repeat Steps 2, 3 and 4 until the power-delay product for the cell continues to
increase.
This method guarantees that only the right transistors (in the critical path) are sized in a
proper way. No over-sizing or under-sizing will be incurred, which makes it optimal for
give excellent transistor sizing results, especially for small circuits. Following the same
It should be mentioned that the above transistor sizing method is a time consuming task
for the structure such as double pass-transistor. This structure is already out of interest
due to high numbers of transistors. Therefore, not much effort has been taken to optimize
27
Com plem entary CM OS full adder
Complementary CMOS full adder (CMOS) [15] has 28 transistors and its operation is
based on the regular CMOS structure, pull-up & down networks (Fig.3.1). One o f the
advantages of the complementary CMOS full adder cell is high noise margins and thus,
reliable operation at low voltages and arbitrary transistor sizes (ratio-less logic). The
often mentioned, the disadvantage o f complementary CMOS full adder cell is the
substantial number o f large PMOS transistors resulting in high input loads, more power
consumption and larger silicon area. This adder uses Com signal to generate Sum, which
weak output driving capability due to series transistors of the output stage.
M,
Mr
28
Com plem entary pass-transistor full adder
Complementaiy pass-transistor full adder cell has 32 transistors (Fig. 3.2). Using pass-
transistor logic with CMOS inverters, this circuit features complementary inputs and
outputs. This adder generates many intermediate nodes and their complements in order
to generate the final signals (Sum and Com). Having a signal and its complement together
adder cell is not a suitable option for low power applications. In order to lower the power
consumption o f complementary pass-transistor, two circuit styles are used. These circuits
have output levels restored with cross-coupled inverters [16] and latches [17].
Due to irregular transistor arrangements and high wiring requirement, layout of this full
A_LM< a _ L m ,3
29
Table 3.3 Transistor dimension in complementary pass-transistor full adder
KWfwmy m s m i
M|, M 2, M 3, M4 , M5 , Mô, M7 , Mg, M9 , Mio, M||, Mi2 7.2 1.8
_____ M|3,M|4, Mis, Mie, M|ç, M20, M 2 U M 22_____
Mi7,M|g, M 23, M24 1.8
M25, M 26, M 27, M30, M 32 14.4 1.8
M 2S, M29,M3| 18 1.8
Double pass-transistor full adder cell has 48 transistors and its operation is based on the
double pass-transistor logic in which both NMOS and PMOS logic networks are used
(Fig.3.3.a & b)[18]. The structure of this cell is similar to its complementary pass-
transistor counterparts, but it uses complementary transistors to keep full swing operation
This eliminates the need for restoration circuitry. One disadvantage of this cell is the
‘10
Sum
30
Table 3.4 Transistor dimension in double pass-transistor full adder (Sum)
m (iim ) Wfiim);
M 2, M 4, M&, Mg, Mg, M||, M | 3, M |5, M|8, M 20 0.77 0.18
M l, M 3, Ms, M 7, M | 0, Mi2, M | 4, Mi6, M |7, Mi9 1.08 0.18
JL B
B j
\ B
B
B
' i '
M, B
Mj
,h
H
B
i M,
B
H |Z M „
B
' i
 n
M,
fit
'h
M, Â M,,
Â
H|M»
1#“
1---- «»— T
MjJ M24
|E(|ïm )ï
M 2, M 4, Ms, Mg, M | 0 , M |2 , M|4, M|6, M|7, Mi9, M 2 1 0.77 0.18
M 23, M26, M 2 8
M l8, M 20, M 22, M 24 0.9 0.18
M |, M 3, M s, M 7 , M g , Mil, M |3, Mis, M 2 5 , M 27 1.08 0.18
Transmission gate full adder has 20 transistors (Fig. 3.4). This circuit generates (A+B)
and uses this and its complement as selected signals to generate the output signals (Sum
& Cou,)[19]. It also requires complementary input signals (A, B, Cm) similar to the
31
complementary CMOS ftill adder. However, it exhibits better speed than CMOS full
adder with the same power consumption due to the small transistor stack height [2 0 ].
Sum
Ji
out
M 2 , M 4 , M&, M g, M i 2 , M | 4 , M | 6 , M |8 , M 20 0 .7 0 .1 8
M 5 , M 7 , M ]3 , M i 5, M | 7 , M i 9 0 .9 0 .1 8
M i , M j , M | o, M | i 1 .4 4 0 .1 8
Mg 1 .8 0 .1 8
Pseudo-NMOS full adder operates based on pseudo logic, referred to as ratioed style.
This cell uses 14 transistors to realize the negative addition function (Fig. 3.5). The
adder) and low transistor count. On the negative side is the static power consumption o f
32
the pull-up transistor as well as the reduced output voltage swing, which makes this cell
more susceptible to noise. In order to increase the output swing two CMOS inverters are
added to this circuit, which increases the total transistors o f this cell to 18 transistors.
out
il
H
This adder shown in Fig. 3.6 has been developed based on an XOR gate [21] combined
with transmission gate, which requires a total of 14 transistors [2 2 ]. XOR gate generates
the sum. Using the transmission gate the second half of the circuit produces the carry out.
This cell occupies less area compared with complementary CMOS full adder cell. In
33
terms of power consumption this adder has a better performance. This is due to its low
activity factor and passing a strong signal in fewer number o f pass-logic gates, unlike the
other cells where the signal had to go through more number of logic gates. Having
discussed the high performance o f this novel logic, one should note that the irregularity in
layout of transmission gate and large average size of transistors are the considerable
Sum
Table 3.8 Transistor dimensions in XOR & transmission gate full adder
i M O ST ## m a i
M g , M 7 , M g, M io 0.7 0.18
M j , M4, M | 2, M |4 0.7 0.18
M , „ M ,3 0.9 0.18
M | , M ;, M9 1.44 0.18
M2 1.8 0.18
In the following the techniques for simulations with regards to input patterns o f full adder
34
Input Pattern and Output Loading: In order to compare different adders, input
patterns should be in such ways that fairly test all cases. An input pattern which
maximizes the power consumption for a given cell, could exhibit less power for another.
While another input pattern could have the reversed situation due to different distribution
£ 9B0m
O
g.0
i. a /n e lffl3
£û 930m .
0.0
I.a
900m.
0.0
2,0 n 4.0n 6.0 n 9.0 n
time { s )
Fig. 3.7 (a) Input patterns used to evaluate the performance o f the adders
1g =: /n e t 0301
S 900m
o
0.0
I .a /n e l0 3 0 3
m 900m .
0.0
I .a -
900m .
0.0
0,0 2,0n 4,0n 6.0n 9.0 n
time ( 3 )
Fig. 3.7 (b) Input patterns used to evaluate the performance o f the adders
MULTIPLIER.PAIRWSE^SCH FULLADDEB.IEST. SCH schemotic : Feb 18 23:24:44 2004
Irons^cni R esponse 0
1.8 »: /n e t0301
1
S 900m
o
0.0
1,8 , /n e t0303
m 900m
0.0 1
1.8 - . /n e t43
900m
0.0
0,0 2,0 n 4 ,0 n ^ 0,0n 9,0 n
tim e ( 5 )
Fig. 3.7 (c) Input patterns used to evaluate the performance of the adders
/ne1030l_____ _
1
S 900 m
00
t g =: /^e103g3
m 900 m
1.0 A>e10l^8
^ 900m:
0.0
0.0 2.0n 4^.0n 6.0r\ S.0n
tim e { a )
Fig. 3.7 (d) Input patterns used to evaluate the performance of the adders
A good input pattern for power consumption leading to a fair comparison o f adder cells
should alternate the high frequency at the input and intermediate nodes. A good example
is the concatenation of the four patterns shown in Fig 3.7 (a, b, c, d).
36
Table 3.9 Characteristic of the input signals
iPatterns ' . IT
Inputs T(ns) P.W . (ns) T(ns) F.W.(ns) T(ns) P.W.(ns) T(ns) P:w4ris)
2 1 4 2 8 4 4 2
B . 4 2 8 4 2 1 4 2
e,„ 8 4 2 1 4 2 8 4
As for speed, the input patterns should have all the required input-pattem-to-input-pattem
transitions. The delay o f the cell should be measured for each transition. The input pattern
used for the simulation process is a concatenation o f the four-input patterns shown in Fig.
The test bench used for simulating the adder cell is shown in Fig. 4.1 of Chapter 4, where
the simulation result of the selected adder cell is discussed. The inputs are applied
through buffers (two cascade inverters), which load adder cells with more realistic inputs
in terms of slope and driving strength. Outputs are also applied to another adder to
Results o f the comparison among adders, sorted by power consumption are shown in
Table 3.10. The power performance of the second and third adder cells (Fig. 4.1) in the
cascade configuration seems to be more realistic because in such a case, the high driving
capability of the adder is a must in order to provide the next cell with the clean inputs.
Therefore, the power values of either second or third full adder can be considered as the
basis for our comparison. These results show that XOR and transmission gate full adder
exhibit the lowest power consumption and transmission gate CMOS pseudo-NMOS,
also pointed out that this evaluation corresponds to a 1.8 V power supply, and this point
has slightly rearranged the previously reported adder ranking. The impact o f supply
current drain. This, in turn, results in higher power consumption in circuits such as
Table 3.10 Simulation results for the full adders sorted by power consumption
, x d d m c e ii( i : 8 vy =EowrT(inW)î
XOR and transmission gate 0.0203
Transmission gate CMOS 0.0305
Pseudo-NMOS 0.0341
Complementary CMOS 0.0504
Double pass-transistor 0.0861
Complementary pass-transistor 0.0967
The experimental results of the comparison among adders sorted by speed are presented
in Table 3.11. The delay values are measured from the moment A, B and Qn signals
reach the adder inputs till the last o f the Sum and Cout signals reach the next adder cell
Table 3.11 Simulation results for the full adders sorted by propagation delay
w m m m a m c e o m m m iD ë im # !
Complementary pass-transistor 0.057
XOR and transmission gate 0.066
Transmission gate CMOS 0.074
Pseudo-NMOS 0.080
Double pass-transistor 0.091
Complementary CMOS 0.140
Fig. 3.8 shows the delay of an adder. This measurement is based on the definition of the
propagation delay o f digital cells, explained at the beginning o f this chapter. The inputs
38
signais are as = 1 , 5 = 1, and C,„ = 1, therefore, the adder response will be as Sum = 1
and Co,„ = 1. Then, the delay between the earliest input signal (C,„) and Sum has been
measured. The delay is also measured between C,„ and Co,,,- This measurement has been
performed at 50% transition point o f the signals (which is 0.9 V in our case of V,u= 1.8
V). The delay values o f pseudo-NMOS adder are shown in Fig. 3.8. It can be seen that
delay o f Sum and is very close in this cell, which avoids any data hazard, and race
1.960 “•
1Ü25
950.0m
&0Q0
1.900
1425
I 950.0m
4.75.0m
70.09ps . /nel0301
1.425
E 9500m
4.75.0m
1 900 —: /nel0303
Tf^r- 73.32ps
1.425
9 5 0 0m
4.75 0m
0,000
1.900
1.425
0,000
400pllrm*( g>6O0P
The following criteria have been considered in performing the comparison amongst
different adder:
39
Power-delay product: The power-delay product is defined as a compromise between
speed and power consumption. The values of the power-delay are presented in Table
Area: The transistor count, showing area efficiency and layout productivity must be
Table 3.12 Simulation results for the full adder cells sorted by power-delay product
A daèrlCën irî^ is tô H »
XOR and transmission gate 0.00133 14
Transmission gate CMOS 0.00222 20
Pseudo-NMOS 0.00272 14
Complementary pass-transistor 0.00551 32
Complementary CMOS 0.00702 28
Double pass-transistor 0.00783 48
The measurement shows that pseudo-NMOS full adder has average values in both power
consumption and delay, while providing a sum signal in good logic level. This leads to
Pseudo-NMOS adder also has small area occupancy not only due to the number of
transistors but also because of the size of PMOSs, which are the main issue when it turns
to layout extraction level. These properties make the pseudo-NMOS circuit amenable to
use o f a lower supply voltage to further reduce the power and at the same time
It is timely to mention that the comparison of the performance of the adder cells based on
different logic is a very broad area o f study and it is impossible to appreciate fully in a
small section. Here, identical conditions such as uniform input pattern, capacitive load
and constant Vjj have been used during simulation in order to achieve a fair comparison.
40
However, other factors such as selecting different geometry and physical designs and
The carry lookahead adder is a viable candidate to resolve the propagation delay problem
by calculating the carry signal in advance based on the input signals. It relies on the fact
b) when one o f the two bits is “ 1” and the C,„ (carry-in o f the previous stage) is “ 1”.
in which
Gi=Aj. B, , (3.7)
Pi ={ Ai ®Bi ) . (3.8)
Note that propagate and generate terms only depend on the input bits. If one uses the
above expression to calculate the carry signal, s/he does notneed to wait for the carry to
ripple through all the previous stages to find its proper value. Thus, comes the main
In the following the generate and propagate terms are derived for a 4-bit adder.
C| = Gq + Pq.Cq (3.9)
Cj = Gf = G | + P i .G q + P i .P q . C q ( 3 .1 0 )
41
C 3= G 2+ P 2-G ,+ P 2 .P,.G o+P 2 -^i-^o-Co (3.11)
Note that Co„, bit and C,+/ of the last stage will be available after four delays (two gate
delays to calculate propagate signal and two delays due to AND and OR gates). The sum
S i= A i® Bi © C, = P i® G,. (3.13)
Thus, the sum bit will be available after two additional gate delays (due to the XOR gate)
or total o f six gate delays after the input signals At and P, have been applied. The
advantage is that these delays will be the same and independent o f the number o f bits one
1) The partial full adder, PFA, which generates G„ P,- and P, as defined by
2) The carry lookahead logic, which generates Com bits according to Equations
3.9 to 3.12. The 4-bit adder can then be built by using four PFAs and the carry
The disadvantage o f carry lookahead adder is that the carry logic tends to get quite
complicated for more than 4 bits. Therefore, carry lookahead adders are usually
implemented as 4-bit modules and are used in a hierarchical structure to realize adders
that have multiples o f 4-bits. High fan-in OR gate is an unavoidable problem in designing
a 16-bit carry lookahead adder. This is shown in Equation 3.12 when C4 is calculated.
Using high fan-in in logic gate would not only increase the propagation delay, but also
contributes to additional power consumption. In order to resolve these issues the cascade
of four 4-bit carry lookahead adders have been employed in design of 16-bit carry
lookahead adder. The propagation delay of 16-bit carry lookahead adder in this
42
architecture is approximately equal to that of the 4-bit ripple carry adder. This is because
of Corn signals that have to ripple from one module to the next one. This is repeated four
times until the final Co,„ arrives at the output. Despite the amount of delay, this approach
is more power-efficient.
In the following the overview of the sub-cells of the 4-bit carry lookahead adder are
described. Figure 3.9 shows the block diagram of 4-bit carry lookahead adder.
PFA P.C,
PFA p,c,
PFA
PFA
As seen in Fig. 3.9 partial full adder (PFA) is the first block where inputs are fed. As it is
mentioned earlier, this block generates, propagate, generate and sum signals. Fig. 3.10
shows the gate-level implementation of PFA. Sum signal is also generated in this block
according to Equation 3.13. In order to generate Q», signal another XOR gate is needed.
43
The delays of signals in the highlighted block o f carry lookahead adder (Fig. 3.9) are
Table 3.13 Delay of the generate, propagate and sum signals of PFA
B u tp u tI melaW fns):
Gi 0.0552
Si 0.0385
CiPi 0.08
The block diagram o f the 16-bit carry lookahead adder is shown in Fig. 3.11. Four 4-bit
carry lookahead modules have been used to implement the final stage of the pair-wise
multiplier. The labels on this diagram are based on the outputs of the previous stages.
Si
4-blt Sz
Cany 53
Lookahead 54
Adder
•S5
4-bit •Ss
Cany ■Sr
Lookahead S.
Adder
S,
4-bit ■S10
Carry Si,
Lookahead S„
Adder
S13
44)% 5.4
Cany 5.5
Lookahead ■S«
Adder
Si7
Fig. 3.11 Block diagram o f the 16-bit carry lookahead adder implemented by cascading four 4-bit
carry lookahead modules
44
3.1.1.3 A N D , NAND, O R and XOR Gates
AND, NAND, OR and XOR are the fundamental logic gates, used in most logic circuits
to realize the arithmetic operations. The Boolean expressions for two-input AND,
NAND, OR and XOR gates, followed by their truth tables are shown in Table 3.14.
A.B, (3.5)
A+B, (3.6)
A®B. (3.7)
A.B A.B A +B
by different circuitries. However, the varieties of these structures are not as many as
adder circuits. Therefore, very common configurations have been used to implement the
required logic tasks. Figure 3.12(a, b) shows the schematics of AND, NAND gates that
have been optimized for the required speed in the proposed multiplier [22]. NAND gate
is composed o f two NMOSs and PMOSs. An inverter is added to the circuit to generate
the AND function. Several designs of OR and XOR gates have been reported. Each has
its own advantages such as less delay and drawbacks such as poor response to some
particular inputs [20]. Figure 3.12 (c, d) shows schematics o f the OR and XOR circuits
Dimension o f NMOS and PMOS transistors have been modified for the required rise and
45
iL
A. B A. B
h T m, M,
M,„
r f
A0B
M, M„
if
Ms
M, K
(C)
m s m m
MOST W(|un) 1 L(|im)
1.4 0.18
M„Mt 1.8 1 0.18
A+ B
MOST W((un) I L(tmi)
0.7 1 0.18
In order to increase the productivity in ASIC design, cell design techniques are highly
critical. In cell deign, a basic concept is to design uniform circuits that can perform the
same task. In the following the top-level and the circuit-level of the required cells in each
Fig. 3.13 Block diagram of the proposed 8 x 8-bit multiplier showing detail of the required cells
AND Generator: As seen in the block diagram of the pair-wise 8 x 8-bit multiplier (Fig.
3.13) the first stage of this architecture is an AND generator. In order to execute the first
step of the pair-wise algorithm discussed in Section 2.3.5 AND combinations of all odd
and even positions of two 8-bit multiplicand and multiplier are required. This task is
performed by the AND generator. The block diagram o f the AND generator is shown in
Fig. 3.14. This stage consists o f four AND planes known as:
XeYe: generating AND combination of all even bits of the both multiplicand and
multiplier. The results are: X 2 Y2, X 2 Y4, AjTg, X 2 YS, X 4 Y2, X 4 Y4, X 4 Y6, X 4 YS, XgY2, X ^ 4, XgYg,
XeYo: generating AND combination of even bits of the multiplicand and odd bits of the
multiplier. The results are: X 2 YJ, AjT/.AiT/, Y^T?, YsT;, Y^F), X^Yj,
47
X6Y7.XsY,. X sYj, X sYs, X sYj.
XoYet generating AND combination of odd bits o f the multiplicand and even bits of the
multiplier. The results are: X,Y2, X1Y4, XiYg, X,Ys, X3Y2, X3Y4, X3Y6, X3YS, X5Y2, X5Y4, X5Y6,
XoYot generating AND combination of all odd bits of the both multiplicand and
multiplier. The results are: X ,Y j, X,Y3, X J s , X Y 7 , X3Y,, X3Y3, JGT* X3Y7, X3Y,, X,Y3, XsY,,
XeYe
XeYo
XoYe
XoYo
planes consequently constructs AND generation stage. The AND circuit discussed in
‘ 1+2
First A dder Plane: The second stage o f the multiplier is the first adder plane where
partial products (?», ?eo, Poe, Poo) are generated. Equations 2.20, 2.21, 2.22 and 2.23 show
the different AND combinations of multiplicand and multipliers’ bits required for
generating each of partial products. Fig 3.16 shows the block diagram o f this stage.
49
PrJCD
p.j:2)
P.^
P.oC4)
P.c(5)
P.c(6)
P.^
P«JC8)
.. Reo P«C9)
P«X10)
P.JC11)
P«X12)
P«X13)
P.JC14)
P«C15)
P.JC16)
H H C ^^ ;• }r
H
: Poe
P-(10)
p.(11)
I m r n
ill P-C15)
Ü B lÉ lzJ^
ill^
PJC10)
P<JC11)
I PJ[14)
P«X15)
PJC16)
M(7)-X,Y,
M(8)-X,Y,
Regular Bit
50
This stage consists of four blocks of the partial product generators known as:
P ee: generating partial products resulted by multiplication of the even bits o f both
follows:
^ e e (l) = 0 P e .( 2 ) = 0 P e e (3 ) P e e(4 ) = 0
P e e il 1) = Sum [ X s Y 4 + X sY ^+ X J s] P e e ( 12 ) = Q u , { X g Y 4+ X eY 6 + X Jg ]
Peo: generating partial products resulted by multiplication of even bits o f the multiplicand
and odd bits of the multiplier. Peo is a 15-bit number shown by bit number in parentheses
as follows:
Peo (7) = Cou, [X^Y, + X 4 YS+ X 2 Ys] Pea (8) = Sum [XgY, + XsYs+XJs]
P e o (1 5 ) =0
Po«: generating partial products resulted by multiplication of odd bits o f the multiplicand
and even bits of the multiplier. P „ is a 15-bit number shown by bit number in parentheses
as follows:
P o e (l) = 0 P o e ( 2) ^ X , Y 2 P o e (3 ) = 0 P o e {^ ) = S u m [ X , Y4 + X 3 Y 2 ]
51
P oe (9 ) = Cou, [ X , Y s + X 3 Y 6 + X iY .+ X r Y ^ ] Poe (1 0 ) = S u m [X 3 YS + X s Y 6 + X 7Y 4]
foX13) = C o u , [ % + % ] P œ (1 4 )= X 7 Y a P o e (1 5 ) = 0
Poo; generating partial products resulted by multiplication of the odd bits of both
multiplicand and multiplier. Poo is a 15-bit number shown by bit number in parentheses as
follow;
P o o ( l) = X ,Y , P o o (2 ) = 0 P o « ( 3 ) = S u m [ X ; 7 ,+ J G 7 /]
Poo (6) = Cou, [X,Ys + X 3 Y3 + XsY,] Poo (7) = Sum [XjYy + X 3 YS+ XsY 3 + X 7Y2]
P o o (8 ) = C ou, [ X /Y y + X 3 Y S + X S Y 3 + X 7 Y 2 ] P o o ( 9) = Sum [X 3Y 7 + X s Y s + X 7 Y 3 ]
P oo (1 4 ) = 0 P o o (1 5 ) = 0
All P e e ,P e o , P o e , P o o blocks perform a similar task and have the same numberof inputs and
outputs. This makes it possible to employ the same cell for all four partial products (P e e ,
Peo, Poe, P o o ) generators. This cell has been constructed by two half adders and five
adders.
^ Half i
Adder
Adder!:
Adder
Half ^ 5
î:îâ®ï®S
52
Fig 3.17 shows the gate-level diagram of the partial product generator cell. In circuit-
level pseudo-NMOS adder has been used to realize these cells. Using adder to convert
three k-bit numbers to two (k + 1) numbers avoids the carry propagation delay in body of
In order to use 3-to-2 adding technique it is necessary that not more than three inputs be
used for generating any elements o f partial product (Pÿ). This condition is not met when
the partial product elements are generated by four terms as it happens in Pet(9), FeeOO),
relevant equation). To deal with these extra terms called spares terms they are taken out
of the equations and collected together to form two distinct numbers which are called N
and M. N(i) is a 15-bit number with zero in all even and odd positions except for the
even and odd positions except for the eighth position [M(8)j (0,0,0,0,0,0,0,
X jY2,0,0,0,0,0,0,0 ). These two numbers are shown in the block diagram o f partial
Now outputs of the first adder plane are six 15-bit numbers called ( P e e , Pco> P o e , P o o , N,
M).
Second & Third A dder Planes: In order to generate the final product of multiplication
(P ) o f 8-bit X (multiplicand) and 8-bit Y (multiplier) all the individual partial products
(P e e , P eo , P oe, P oo) generated from summation of even and odd bits o f the multiplicand and
multiplier and two distinct numbers generated by sparse bits (M, N) in the previous stage
This task requires the second and third adder planes. This addition operation has to be
performed bit-by-bit resulting in carry out propagation. In order to postpone the carry
53
propagation delay to the last stage of the multiplier a 3-to-2 adding technique has been
used. To facilitate this technique adding o f four partial products (P«, Peo, Poe, Poo) and two
extra numbers (M, N) is broken up into two steps as shown in Equation 3.7. These six
At this stage three 15-bit numbers (P e e ,P e o ,P o e ) are converted to two 16-bit numbers (P e œ o ,
Peoec) and so are the fourth partial product (Poo) and two distinct numbers (M, N) which
This task can be performed by using a similar structure shown in Fig. 3.17 with a total o f
14 adders. Note that due to the power and area constraints of the entire architecture using
half adder is preferred whenever only two inputs signals need to be added (i.e. no C,„
Peoto: result o f adding all the odd positions o f Pee, Peo, Poe-
- P e e « > ( l) = 0 P e o e o (2 ) = 0 P eoeo 0 ) = P ee (? ) P e o eo (4 ) = 0
P eoeo ( 9 ) = S u m [P e e ( 9 ) + P e o ( 9 ) + P o e ( 9 ) ] P eoeo ( 1 0 ) = C om [ P e e ( 9 ) + P e o ( 9 ) + P o e ( 9 ) ]
P e o e o (l 1 ) = S u m [P e e(l 1 ) + P e o (l 1) + P o e ( l 1 )] P eoeo( 1 2 )= C o m [P e e ( 1 l ) + P e o ( l l ) + P o e ( l 1 ) ]
P e o eo (1 3 ) = Sum [P e e (1 3 )+ P e o (1 3 ) + P o e (1 3 )] P e o e o (1 4 )= C o m [P e e (1 3 )+ P e o (G ) + P o e ( 1 3 ) ]
P eoeo ( 1 5 ) = P e e ( 1 5 )
Peoee: result o f adding all the even positions o f Pee,, Peo and Pœ.
Peoee(l) = 0
Peoee (2) = Sum [Peo (2) + Pœ (2)] Peoee (3) = Com [Peo (2) + Poe (2)]
Peoee (4) = Sum [Peo (4) + Pœ (4)] Peoee (5) = Com [Peo (4) + Poe (4)]
Peœ e (6) = Sum [P e e (6) + Peo (6) + P o e (6)] Peoee (7) = C o m [P e e (6) + P e o (6) + P œ (6)]
Peoee ( 8 ) = S u m [P ee ( 8 ) + P e o ( 8 ) + P o e ( 8 ) ] Peoee ( 9) = C om [P e e ( 8 ) + P e o ( 8 ) + P o e ( 8 ) ]
54
P.o..(10) = Sum [P..(10) + P,<,(10) + P^(10)] P.«.,(ll)=Co„,[P..(10)+P«,(10)+/>o,(10)]
P o o S e { ^ ) = P o o ( 4) ^ « ^ ,( 5) = 0 P « A (6 ) = P „ (6 )
P o o S e iJ ) = 0 P o o 5 ’. ( 8) = Sum[ P o o (8 ) + X 2Y 7 + X 7 Y 2 ]
P J S e { 9 ) = C ou, [ P o o ( 8 ) + X 2 Y 2 + X 2 Y 2 ] P o o ^ .( lO ) = P o o ( lO )
PoJSoil) = Sum [Poo (7) + XjY,] Poo5o(8) = Cou, [Poo (7) + XjY,]
Addition process is completed at this stage and four 16-bit numbers (Peœe, Peoeo. PooSe,
PooSo) result of 3-to-2 addition of Pee,Peo,P«, M and N are the outputs of this level.
At the next stage addition of the four numbers is broken to two steps as shown in
Equation 3.9.
The same technique as the previous stage is used two more times to convert the three
PS le: result ofadding all the even positions of Peœe., Peœo and PooSe.
55
P S h ( l )==0
P S J , ( 3) = C o u , [P e o e e (2 ) + P e o e o (2) + P „ o S e (2 )]
PSI,(4) = Sum [ C . / 4 ) + C m (4 )+ P „ A (4 ) ]
P S 1, ( 5 ) = C o u . [P eo e e(4 ) + P e o eo (4 ) + P o M 4)]
P S h i \ 4 ) = Sum [Peoeeil4) + P „ U 1 4 )+
PSlo: resu lt o f adding all the odd positions o f Peœe., Peœo and PooSe
PSloi2) = Cou. [ P .w e ( l) + P « H „ ( I ) + P „ A .( 1) ]
PSloi4) = C o u . [ P e o e e ( 3 ) + P e o e o ( 3) + P o o ^ e ( 3) ]
PSloi6) = C o u . [ P e o e e ( 5 ) + P e o e o ( 5) + P o o 6 "e( 5 ) ]
P S lo i 10 ) = C o u . [ P e o e e ( 9 ) + P e o e o ( 9) + P o o ^ e ( 9) ]
56
PSJ„{11) = Sum 1) + P«,,„ (1 1)4- P „A (1 1)]
At the next parallel adder plane the two new words from previous adder plane (PSlc,
PSlo) are added to (PooSo) via another 3-to-2 adder stage to complete the Equation 3.9.
This addition process is carried out similar to the one in the previous level. The three
input numbers at this level are converted to two new 15-bit numbers called Pc and P„.
Arithmetic
Pg: result of adding all the even positions of PooSo, PSh, PS1„
PeiS) = C o u . [ f « ,Æ ( 4) 4- PSIX4) + f % ( 4) ]
57
Pg: result o f adding all the odd positions o f PooSo, PSIe, PSlo
Po(14) = Cou,[Poo^o(13) + P % (1 3 ) + P % (1 3 )]
Po(16) = Cou,[Poo6'o(15) + f % ( 1 5 ) + P % (1 5 )]
In the last stage of multiplication process, these two final numbers (Pe,Po) need to be
3.6.
Pe + Po = P (3.10)
At the last step final two numbers (P„ Po) are simply added to generate the final product.
This addition needs to be performed fast. Therefore, carry lookahead structure, known as
In the next Chapter the simulation of the major block as well as the final simulation
58
Chapter 4
simulation results of the final stage of the proposed multiplier for certain given inputs are
All circuits including individual cells and entire design have been simulated in Cadence
environment.
Before presenting the simulation results of the individual circuit and designed cells, we
need to introduce the circuit structure that have been used for simulation purposes.
Arranging the proper test circuits has significant impact in increasing the ASIC
productivity.
architecture that uses full-adder cells as the building block, a cascade of full adders is
usually utilized. In such cases, the high driving capability o f adder is a must for providing
the next cell with input signal with proper logic level. Having this point in mind, the
circuit structure used to simulate the adder is illustrated in Fig. 4.1. A cascade of four full
adder cells is utilized; the inputs are fed from buffers (two cascaded inverters) to give
59
more realistic signals and outputs are loaded with buffers to give proper loading
conditions [28].
Full Full
B A dder A dder Adder A dder
Cell Cell
'InL
Fig. 4.1 Circuit structure used for simulation of full adder cell
The parasitic effects are, therefore, included in the simulation results. The same structure
has been used to compare the adder cells discussed in topology selection.
AND/NAND
OR/XOR
B Sum
Full
AND/NAND
OR/XOR A dder
Cell
=>
AND/NAND
OR/XOR
B
-V
Here are the simulation results for pseudo-NMOS full adder using the test circuit
structure of Fig. 4.1 corresponding to four different input patterns. The input patterns
60
were already introduced in Chapter 3 when describing simulation strategy o f adder cells.
Fig. 4.3 shows Sum and signals of pseudo-NMOS full adder to input pattern shown in
Fig. 3.7 (a). This pattern covers 6 transitions o f the input signals (A, B, C<„). These
transitions are also shown in Table 4.1 corresponding to those in Fig. 4.3.
1 90 —: /n«t24
g 900m
0
- 100m
1.90 =: /n et 58
1 900m
to r ~ T ~ ^
—100m ..... j . ...........j .........._ 1 . ,
1.90 =: /n e t0301
c 900m
Cj
- 100m
950m
0.0
0.0 2.0 n 4.0r» B.0n
tim e f 3 )
Fig. 4.3 The simulation waveforms showing respond of the pseudo full adder to the input
pattern (a)
h p u ts # f O u t |p m
A B C,n Sum ^out
0 1 1 0 1
1 I 1 1 I
0 0 1 1 0
1 0 1 0 1
0 I 0 1 0
1 1 0 0 1
61
Fig. 4.4 shows Sum and C<,„, signals o f pseudo-NMOS full adder to input pattern shown in
Fig 3.7 (b). This pattern covers 6 transitions o f the input signals (A, B, C,„). These
- 100m
1.90
i /netGB
E 900m
- 100m
1.90 /nfit0301
900m
o
- 100m
1.9 /n e t0303
950m.
■j g «=»] /net01+B
950m.
0.0
0.0 2.0 n 4..0n B.0n
trme ( s )
Fig. 4.4 The simulation waveforms showing respond o f the pseudo full adder to the input
pattern (b)
Ü la p Ü tsf® tO u flp u t #
A B C,n Sum Cout
0 1 1 0 1
0 1 0 1 0
1 1 1 1 1
1 1 0 0 1
0 0 1 1 0
0 0 0 0 0
62
Fig. 4.5 shows Sum and Co,„signals of pseudo-NMOS full adder to input pattern shown in
Fig. 3.7 (c). This pattern covers 6 transitions of the input signals (A, B, C/„). These
1.9 /net24
950m , / 1
0.0
1g «=*! /net 68
E 950m .
0.0
1.90 /n e t0301
900nn
- 100m
1g /net 0303
950m
0.0
"J g /net014B
9 5 0 mL
0.0
0.0 2.0n 4.0 n B.0n
trme ( s )
Fig. 4.5 The simulation waveforms showing respond of the pseudo full adder to the input
pattern (c)
:n p u t# ? sçoïït p s î «
A B C,„ Sum Coul
0 1 1 0 1
0 0 1 1 0
0 1 0 1 0
0 0 0 0 0
1 1 1 1 1
1 0 1 0 1
63
Fig. 4.6 shows the Sum and Com signals of pseudo-NMOS full adder to input pattern
discussed in Fig. 3.7 (d). This pattern covers 6 transitions of the input signals (A, B, C,„).
These transitions are shown in Table 4.4 respectively as it is seen in Fig. 4.6.
1g —; /n çt2 4
950m .
0.0
^g ■=•! /net 68
E 950m .
0.0
1.90 =: /nfit0301
900m
- 100m
1.9 /n et0303
950m
/net 0148
950m
Fig. 4.6 The simulation waveforms showing respond of the pseudo full adder to the input
pattern (d)
In p u ts r O ut]
A B Cm S u m Com
0 0 1 1 0
0 1 1 0 1
1 1 1 1 1
1 0 1 0 1
0 0 0 0 0
0 1 0 1 0
64
AND/NAND/OR/XOR Gates Simulation Results
Schematics are shown in Fig. 3.15. The test structure used for the simulation is shown in
Fig. 4.2. The input signals have 50% duty cycles with period of 2ns. Figures 4.7 to 4.9
1.9
<
0.0
0.0
g /nel029
CD
/| g /nel27
0.0
2,0n
ifme ( s )
Fig. 4.7 The simulation waveforms showing respond o f the AND/NAND gate
65
MULTIPLIER.PAIRWISET_SCH OR.TEST.SCH s c h e m a tic : Mor 19 02 :1 0 :3 8 2004-
Tronsient R e sp o n se [
1.9
m 950m
0,0
CO 950m
0,0
950m
0,0
0,0 1,0n 2 .0 n 3 ,0 n
time ( s )
m 1.425
Û1 950,0m
o
X
< 4 7 5 .0 m
0.0 0 0
1.900
1.425
ca 9 5 0 ,0 m
4 7 5 .0 m
0 .0 0 0
1.900
1 425
< 9 5 0 .0 m
475,0m
0 .0 0 0
0.0 1.0n 2.0n 3 .0 n 4 .0 n
time (
66
4.2 Final Simulation Results
In order to evaluate the performance of the proposed multiplier three dimensions have to
be measured. These dimensions are speed, power consumption and area. In this section
Speed: Speed of the multiplier is translated to the minimum interval (frequency) between
two sequential multiplication operations (8-bit x 8-bit) for which the results of
(maximum) delay of the entire design should be measured. By having the worst-case
Equation 4.1.
fm in ~~ (d« 1)
where f„,i„ is minimum operating frequency of multiplication, Tm,, is the worst-case delay
of the multiplier.
As shown in Fig 4.10 the operation of the proposed multiplier can be divided to 6 stages
as:
4) S'** Adder level 5) 4* Adder level 6) Final adder level (Carry lookahead)
Due to parallel operation (AND and addition) in stages 1 to 5 the delay o f one AND gate
can represent the delay o f the first stage (AND generation) and so does the delay of one
pseudo-NMOS adder for each of stages 2 to 5 (Adder levels). Delay o f carry lookahead
In order to evaluate the worst-case delay of the entire design first the worst-case delay of
each stage has been measured and, then, the final worst-case delay of the proposed
67
1»* Adder 24"Adder S'" Adder 4 " Adder Final
Level Level Level Level Adder
■1— \ f Full
Level
Approximate w orst case delay (result or pre-iayout simulation) = AND Generator + 4 x Full Adder + Carry Lookahead
= 75ps + 4 X 120ps + 277ps = 832ps
'^Tolal '^AND Generator 'F 4 X X^dder level "F Xpiugi adder stage > (4 .2)
where Xjotai is the worst-case delay o f the multiplier, xand Generator is the worst-case delay of
the AND generator stage which is equal to the delay of an AND gate, xAdder levei ^s worst-
case delay of one adder stage which is equal to the worst-case delay of one pseudo-
NMOS full adder cell, and Xfi„aiadder stage '^s worst-case delay o f the final adder stage which
is equal to the worst-case delay o f the 16-bit carry lookahead. Delay of AND gate can be
simply measured according to propagation delay definition. Fig 4.11 shows the delay of
AND gate.
The worst-case delay has a better meaning for pseudo-NMOS full adder due to possibility
o f different input combinations. The delay o f pseudo-NMOS adder has been measured
with all input combination. The worst-case delay has been occurred when A = 1 , 5 = 1
68
MULTIPLIER^PMRWISE^SCH AND_TE:ST_SCH schem otk : Mor 19 €13:52:07 2004-
T ro n sien i R esponse
=: A,B
1.90 A
1.10
>
700m
300m
—100m
500p 900p
A: (/0 .4 0 b ip ya:^.Jlbm] deiio. Ç/Z.lüJ^p -l.55y3m)
B: (1 5 0 .5 B 7 p 9 0 3 .7 5 7 m slope: -ÎT5263M
900 m
= 120dj
700m
500m
300m
100m
—100m 1.0n
200P \ 400p 600p
time ( S }
A (7 4 9 0 6 J4 Jsn de no: f 5T5ST55Sn
B: (196.799P 908.525ml ticpt:.
Fig. 4.12 The worst-case delay o f pseudo-NMOS adder occurring when >4 = 1 ,5 = 1 and C„ = 1
69
In order to measure the worst-case delay o f 16-bit carry lookahead adder, the same
method has been taken. Different input transitions have been applied and delay has been
measured between the input and the last output signals at 50% of transition point. The
are added as it was expected due to rippling signal between every 4-bit cany
lookahead adder modules (remember that the 16-bit carry lookahead adder constructed by
Fig. 4.13 shows the input and output signals in composite format. The delay occurring
Table 4.5 shows the values o f worst-case delay o f AND gate, pseudo-NMOS full adder
900m
700m looka
500m
300m
100m
-1 0 0 m
time ( 3 )
875rnt
B: (3a2.â95p 907JI6ml slop«^=^:Ogi09M
70
Table 4.5 The results of the worst-icase delay measurement
T h ë w o rs m s ë m ie m i
AND generation (One AND gate) 75
Adder stage (One pseudo-NMOS adder) 120
16-bit carry lookahead adder 111
'^Total —832ps
It should be pointed out that he worst-case delay that has been measured and shown in
Table 4.5 is the results o f examining each blocks (AND, Adder plane and Final adder
stage) separately. It gives an estimation of the worst-case delay of the entire design but as
one may notice applying the pattern causing the worst case delay is under control only for
the first two multiplier stages which are “AND Generator” and “First Adder Level”. By
applying pattern “U l I l I H ”as X and “ 11111 llT 'as Y, AND generator creating “ 1” at
all of its outputs. Therefore, all the input of the next stage which is the first adder level
are “ 1”. It means the worst-case delay definitely occurs in this stage. So it can be
guaranteed that the first two stages operate with their slowest speed (worst-case delay)
but this is not necessarily the case in the other stages. However, it is not expected that the
entire multiplier’s delay exceeds the total worst-case delay {xrotai ) that have been
the proposed multiplier. Fig 4.14 shows this input signals applied to the proposed
multiplier. The first curve from the top is the current drawn from the node. Fig 4.15
71
M Ü LTiPU EP_RA >PAI5C^.IM FPCl^rC^3CH TETSTT^aCW p c k .rr^ p trc : M o- 1« 13 5 3 M D - i
Tranalcni Response
^ 2(prA IfVdcfl ^
1 -a » n . f ^— 1---- . ^— 7----
V7
•X-
YB
V4
V3
1-B Y1
i.D
V7
1-B Y4
1-B
X
Y1
jLQn
C* )
72
HULm
iCP_RAiPW
ISr_IM
PPQVrDLSCUICST.SCMschemoliD. W
o,19D5r53.D7JDD*
Tr«ncfcn| PsepDnoo
19
> 00 .... ......... L ........ ...................... ......
19 d: pia
> 00 f J, n_.. t. . r
> 001,9
f J. \ . s. \ . ......r
1,9 -1 PI4-
> 0.0 f,.. J ...... 1 . .. 1. M ......._...r
19 PI3
> 0.0 1,,, ;J ...... 1 _____[ ...... ■ 1 ..:...._:..r
19 =: pi2
> 00 [ J, F, I. ■ 1 .......... r
> 001,9 ai P11
f J, 1, , /, 1. c
19 RIO
> 0.0 [ ..... __ 1 _____ I . ...... A .....
19 PS>
> 0.0
1,9 P8
> 00 Î
1,9 P7
> 0.0 f:. -:. . . .t.. . . . ; .If.... . 1 ........ -- -- 1.
1,9 P6
> 0.0
19 PS
> 0.0 Î......
1,9 P4
> 00 F
,9 ai P3
>■ 01.0 [..... -f............................^
19 P2
> 0.0
19 -2 pi
> 0.0 [ J.2,0n I.4,0n . 1^5. ,0n “ 1 Br^n ...... . rlO .
Ï.0 Urn»( 3 \ r
Fig. 4.15 The post layout simulation waveforms showing results o f “l 11111 l l ”x”l 1111111’
73
MULTIPLIER-PAIPWISE-IMPRÔVED.SCH T[ST_SCH :sch cm o tîc : Mor 19 0 6 :4 2 :3 6 2 0 0 4
Tronçîent Response
: P16 P13 - î P12
: P li : P10 P7 =: P6
: P4.
800m
^Tctal = 7 9 3 P2T
600m
400m
200m
2,0n
lîme ( 5 )
■A; {i>26.047p yo 1:606ml ïïëUôl K50i)26m)
B: M.31S84n 903 011m) glopc
In order to measure the delay occurring during “ 11111111” x “ 11111111” the composite
simulation of the input and output waveforms are used. Then, the delay has been
measured in the same manner as previously defined. Fig 4.16 shows the input and output
waveforms. The delay (xroiat) is also shown in Fig 4.16. The measured delay is 793ps and
it is less than 835ps as it was expected. So now it is fair to assume that the delay of the
proposed 8 x 8-bit multiplier in worst case is less than summation of worst-case delay of
its individual blocks. This worst-case delay may never occur, but it is used to set a point
for maximum delay of the multiplier. It is translated to the minimum operating frequency
o f the proposed multiplier, which is determined as 1.19GHz. That is, the frequency of the
74
Power: the estimation o f power consumed by large digital circuit is a complex task.
Measuring the power consumption is critical for low-power design as it permits the
designer to optimize power, to meet requirements, and to know the power distribution
through the chip. Several heuristic algorithms, statistical, and probabilistic methods have
been introduced [24,25,26]. These methods become less accurate when the size of the
circuits increases. It is better to decompose the large circuit into smaller modules and use
this method to estimate the power consumption of each module. These methods are also
very helpful approaches to optimize the performance o f the decomposed modules. One of
these approaches has been employed in the circuit-level design o f this work in order to
meet the power efficiency as one o f the objective. The practical aspect of the method is
explained more in detail in the topology selection and transistor sizing of full adder
circuitry in Chapter 3.
Nevertheless, in case o f complex systems it is wise to use CAD tools for accurate power
consumption measurement.
Generally power estimation refers to the techniques o f estimating the average power
dissipation of circuits. There are several power analysis techniques and tools at the
circuit, gate, architectural, and behavioural level of abstraction. The most straightforward
method o f power estimation is done through circuit simulation; i.e., performing a circuit
simulation of the design and measuring the average current drawn from the supply.
Therefore, the average power can be estimated which is the average o f summation o f the
where is static power consumption and it is the power consumed due to leakage and
static currents, is short-circuit power consumed because of the current flowing from
75
power consumption which constitutes the majority of the power consumed in CMOS
VLSI circuits.
The method used by CAD tools to measure the power consumption is strongly dependent
on the input patterns (pattern-dependent technique). The technique is also called dynamic
power simulation, which should not be confused with dynamic power. Equation (4.4)
shows the dynamic portion o f the total power consumption of the digital circuit. This
equation is very similar to the algorithm that has been driven to compute the power
■ (4.4)
where TT, = Q is load capacitance at node i, V,- is the voltage swing at node i, a,- is
known as switching activity factor at node is the system clock frequency, is the
power supply voltage, V, is transistor threshold voltage, is the gain factor of the
transistor. The summation is over all the nodes of the circuit, which makes the power
estimation a very complex calculation. Changing any of the components in Equation 4.4
would result different power consumption values. Some o f the components o f Equation
4.4 are process-dependent such as F, and Vjj. Other components such as Q/oad, Visning are
Two components in Equation 4.4, which depend only on input pattern, are clock
frequency (/^«) and the switching factor («;). The input frequency of the entire system has
been limited at lower level by the delay of the critical path. It means that by taking into
account the approximate SOOps delay of the critical path that has already been measured,
the characteristics of the input signals are determined. The required period for input
signal is SOOps at minimum value. By considering 50% duty cycle as a standard for the
input signals the lower input pulse width of 400ps is required. Thus, the frequency of the
76
input signals is set at the minimum value of 1.1 GHz. This brings the first condition for
Switching factor («,) is the underlying factor of transistors switching. For N periods o f 0
and Vdj —*0 transitions, the switching activity a/ determines how many 0 —* V jj
transitions occur at the capacitive nodes. In the other words, the «/ represents the
probability that a transition 0 will occur during the period T = 1/f, where f is the
period o f the input signals at the node. Considering all internal nodes’ transition is a
complex task, which is out of the scope of present discussion. However, it is clear at this
point that choosing the pattern that makes the high number of transitions in one period of
multiplier is a contributing factor to power consumption value. Hence, this brings another
multiplier the two following conditions are considered to govern the power performance;
1) Applying the input signals with the operating frequency of approximate! .2GHz.
2) Applying the input pattern causing the maximum switching activity in entire
design.
The power consumed by the entire system has been measured by changing the inputs
transition occurring by these input patterns guarantees o f charging the load capacitances
at all nodes of the circuits to maximum (Fig. 4.17). So one can expect to observe the
maximum power consumed by the multiplier by applying this pattern. This pattern is
shown in Fig. 4.14. The waveform of the current drawn by Vjj node during applying this
pattern is shown in Fig 4.17. The average of this current computed by Cadence is 10.5
mA and the power consumption is measured as 18.09 mW. Many different patterns have
been applied to the proposed multiplier and delay and power consumption have been
recorded (Table 4.6 and 4.7). The maximum power consumption observed belongs to the
77
pattern multiplication o f “ 11111 111” x “ 11111111”, which is expected according to the
10m
-4 0 m
-5 0 m
-6 0 m
0.0
Fig. 4.17 The waveform of current drawn by Vj^node ("111I I 111” X “ 11111111”)
In order to further examine the effect o f switching activity in a complex system such as
the proposed design another random pattern has been chosen. The power consumed by
the entire system has been measured by applying a pattern causing transitions o f X/ =
value of the power consumed by the multiplier under this pattern is almost the mean
value of the power consumed by the system when all inputs are set to “0” which is called
the “power down” or “minimum pow er consumption” value and the maximum power
_ ^Pmm
■+~ P
^ max
average (4.5)
where is the power down value measured as 19.314 nW when no inputs applied and
78
measured as 10.5 mW. The reason for such an assumption is equal random density of “0”
and “ 1” in the pattern “ 101010101” and “01010101” which makes possible to assume
that capacitance at 50% o f all the nodes in entire system will be charged. So the assumed
value for the power consumption by applying this pattern from Equation 4.5 is 9.054
mW. The actual power consumption measured by Cadence during applying this pattern is
10.347 mW. The difference about 12% has been seen between the assumed power
consumption and the actual power consumption, which is measured by Cadence. This
difference is mainly due to power consumed by the interconnections and routing paths.
Fig 4.18 shows the current drawn by Vjj node during applying pattern “ 10101010” X
^ *- 1 0 m
-2 0 m
2.0n
irme ( s )
Fig 4.19 shows the simulation waveform of the input patterns by assumption o f 50%
switching activity compared to the pattern “11111111 ” x “ 11111111”. Fig 4.20 shows the
79
IUL7iPUCR.PAIRWSr_IMPP0Vm.SCH TrSl_SCH K h e m o l'c . Mor 3 0 08:32:57 300
TroHQienl P c o p o n M D
^ 1.8
'%—* 0k0 I ' - — * — - ., --A
1.5 — V7
0r0
- Y8
0.0
1.5 Y5
0,0
1.3 -• Y3
0.0
1.È
Vi?
0,0 L
— VI
1.3
0.0 n . n
1.3
0.0 n n C l
1.3 - YT
0,0 L
0.0
1.B o\ YO
>
•w 0 .0 L
- YA
_ ■'•s
> .
■Vr- 0 .0
Y3
^ 1.8
>-
»w . 0 .0
- Y2
>
•w 0 .0
- Y1
1.S -
0.0. L
l.Bn 2i0n 3 .0 n 4-,0n
lime f e )
Fig. 4.19 waveform o f the input patterns “ 10101010” x “01010101’
80
W U L llP ü rP .P A IP W S r_ IM P P < y /r[> _ « C W i r s i . s c w •schemoVc • Mor M «fi:ff3"32 2 0 0 Ü
T ro n ^lcm l R e sp o r^ sc D
1.6 °- PIG
-I B P IS
00
1 B -■
00
1.6
00 f- pi 3
r~\ r~\ r.
1 ft n. pi 2 _ _
... f - / ^ . . A ~ \ ^ C
1 B
1.B -
00
1.B °* P®
0.0 L
1 B P8
00
1.6 - P"^
0.0 [ d C ] C l . ..
1.6
0.0 1
“ P®
:_______ r c _/~i ./~v .c.
1 .B °: P5
1.6 - P^
0, L
1.6 °: P^
>
0.0
1 B
00 0 0 1 .0 n 2 .0 n 3 .0 n iL 0 n
lime f s )
Fig 4 20 The simulation waveforms of multiplication product of the input patterns shown in Fig
4.19
81
A total number 100 patterns have been examined as inputs to validate the operation of the
proposed design. These patterns included 80 random patterns and the 20 intentional
Tables 4.6 and 4.7 present the delay and power consumption o f the proposed multiplier as
Table 4.7 The numerical results of several random multiplications sorted by delay
111#
Power. %
{--'p,;. .P5P4P3P 2P 1 ..Dec.; vDelay
/ ' Consumption]
■' : ' v(ns) . (m m
“10110100” 180 “00101000” 40 “0001110000100000” 7200 0.662 2 .5 4
“10011101” 157 “00101100” 44 “0001101011111100” 6908 0.662 5.17
“ 10100101” 165 “00001100” 12 “0000011110111100” 1980 0.662 3.91
“10101101” 173 “00110100” 52 “0010001100100100” 8996 0.670 4.42
“00111101” 61 “00101100” 44 “0000101001111100” 2684 0.689 6.93
“ 10111101” 189 “00111100” 60 “0010110001001100” 11340 0.703 5.92
“10111001” 185 “00111100” 60 “0010101101011100” 11100 0.703 5.99
“ 10110101” 181 “00111000” 56 “0010011110011000” 10136 0.704 4.50
“00001100” 12 “00011000” 24 “0000000100100000” 288 0.711 7.23
“00111101” 61 “00111100” 60 “0000111001001100” 3660 0.711 9.83
“00110000” 48 “00010000” 16 “0000001100000000” 768 0.711 9.46
“00011001” 25 “00110100” 52 “0000010100010100” 1300 0.712 7.16
“10100101” 165 “00001100” 12 “0000011110111100” 1980 0.712 3.90
“00111101” 61 “00011100” 28 “0000011010101100” 1708 0.714 4.55
“00100100” 36 “00001000” 8 “0000000100100000” 288 0.716 7.13
“10110001” 177 “00110000” 48 “0010000100110000” 8496 0.729 6.84
82
In regard to design robustness, effects of noise have to be evaluated. Noise is the main
factors determining the stability of the system. In the following the main sources o f the
noise have been described and the performance o f the proposed design has been
Noise: One of the main degrading factors in performance of high-speed VLSI system is
noise, which comes from different sources. Noise can be induced through supply and
ground o f the system during switching transitions. This noise is known as Simultaneous
Switching Noise (SSN). Another type of noise is thermal noise. However, this noise is
Following by definition of the SSN and thermal noise the robustness of the proposed
multiplier has been examined by considering the effects of these noises on the
One o f the main sources of the noise in digital system is Simultaneous Switching Noise
(SSN). The effects o f SSN is getting more attention as a result of the continuous increase
in integration level on a single chip and the operating speed. Thisnoise is caused by the
large instantaneous current, due to the switching o f multiple drivers and switches,
through the parasitic inductance at the ground and power node. SSN can have dramatic
83
where , V„ is ground bounce, L^s is bond wire parasitic inductance and / is the current
Equation 4.6 shows that SSN can be lowered by reducing parasitic inductance. In order to
reduce parasitic inductance a multiple pads and pins for power supply (VjJ) and ground
(F„) are needed. Allocating the extra pins to Vjj and reduces L^ss to Lyss effective due to
nnnnnnnnn .nnnnnnnnn
Core Core
4,,
1Lyvw x-
LyVYYN-
'dd
uuuuuyiuuu UÜUÜÜUÜUÜ
Fig. 4.21 Adding two extra pins to Vjj and reducing the parasitic inductance
The standard package No. 68PGA offered by CMC has 36 pins which allows us to assign
32 pins to the inputs (two 8-bit multiplicand and multiplier) and the outputs (16-bit
multiplication product). Therefore, the 2 extra pins are specified to and Vss (one extra
for each) and this has been done at no extra cost. These multiple pads reduce the parasitic
inductance to half.
Not having glitches also strong driving capability of the output signals validate our
approach to reduce the impact of SSN. This proves the robustness o f the proposed design
electrons in conductors. Equation 4.7 shows the power of this noise as:
84
(4.7)
is the bandwidth. It is often preferred to represent this noise in noise voltage as shown in
Eqaution 4.8.
km
n {Therm al) (4.8)
A /
where Ji is the parasitic resistance. Thermal noise power, per Hz, is equal throughout the
frequency spectrum, depending only on k and T. So to simply examine the effect of this
noise on performance of the entire system the voltage of the final outputs could be
simulated within operating temperature ranged from -40C to 135C. The voltage variation
within this temperature range with capacitive load (Q ) of 5pf is 0.15 %. This shows the
system is quite robust against the temperature variation. Fig 4.22 shows the simulation
110 140
Fig. 4.22 The simulation results of the final outputs against temperature variation
85
4.3 Layout Considerations
The proposed 8 x 8-bit multiplier has been laid out in 0.18pm CMOS technology. In the
following the layout issues such as floor planning, routing, pads and packaging o f the
Considering transistor chaining, grouping, and signal sequencing in our proposed adder
layout, has been shown to bring substantial power saving and speed improvement at no
3) Minimizing the capacitive load on Cout signal by minimizing the size o f those
86
Fig. 4.23 Layout o f the pseudo-NMOS full adder (die size of 22 x 8.5 pm^)
;;<<<«««<♦»>I»«<;•♦«<••«<««•«««««««••««««<*«••
Fig. 4.24 Layout of AND gate (die size of 7.9x 5.6 pm^)
87
Fig. 4.25 Layout of NAND gate (die size of 5.4 x 5.6 nm^)
.#S55§
\4 '
Fig. 4.26 Layout of XOR gate (die size of 10.2 x 20.7 pm^)
88
Carry
Lookahead
Adder
AND
G enerator
F ou r A d d er
L e v e ls
Fig. 4.27 Layout o f core of the 8 x 8-bit proposed multiplier core (die size of 0.275 x 0.38 mm^)
Following are some details on the routing, pad, package and chip size of the proposed
multiplier.
3 and metal # 4. The input and output signals go through metal # 5. Avoiding long
overlap between neighbouring metal layers will reduce the coupling capacitances.
89
Pad: I/O digital pads o f TSMC library “tpz973q” from cell “PDIDGZ” have been used
for connecting the core to the output. Dummy layers are also added to satisfy the density
requirement.
Package: Package 68PGA is used for the chip. This package provides the core chip with
36 pins (9 pins in each side). The total area including the area occupied by pads is 1.395 x
1.37 mm^. The core area of the chip is 0.275 x 0.38 mm^. Fig 4.28 shows the layout the
entire chip.
X, H j
Xj X ,X s Y ,Y 2Y 3Y 4Y 5Y s
90
Chapter 5
Conclusion
Digital multipliers are one o f the crucial blocks of real-time Digital Signal Processing
(DSP) application ranging from digital filtering to image processing. However, speed of
the operation is not the only considerations; low power dissipation and small chip area
are also needed because of the dense packing o f transistors in today’s system on-chip
(SoC) applications.
a digital multiplier with speed of operation over IGHz. The three main considerations for
the design are high multiplication speed, low power consumption, and a small rectangular
chip area.
The performance of the proposed multiplier has met the objectives of this work. The
strategy as follow;
• The critical building blocks of pair-wise algorithm have been identified and
ranked by their impacts on speed, power and area on the entire design. This
91
An extensive study on performance of the main six static full adders has been
performed in order to select the most power-speed efficient full adder topology.
Six full adders haven been re-designed through an iterative approach in sake of
proper transistor sizing (this approach has been used in design of the other
approach guaranteed power reduction in the circuit design level. The full adders
Speed and power are treated as same importance during topology selection by
using power-delay product as a measuring factor. Area and driving capability are
also taken into account. The comparison results in choosing pseudo-NMOS full
adder.
Device Characteristics
Process Five-Metal 0.18pm Digital CMOS
Power Supply (V^d) 1.8 V
Chip Characteristics
Multiplier & Multiplicand 8 bits
Product 16 bits
Multiplication delay 666 ~ 793 ps
Power Consumption (power down) 19.314 nW
Power Consumption @ Input Frequency l.lGHz 18.09 mW
Average Power Consumption 10.347 mW
Core size 0.1045 mm^
Operating Temperature -40C to+135C
The designed multiplier is suitable for high speed and low power applications, which
functioning in supply ranged from 1.8 V to 0.09V has proved the suitability o f the
proposed architecture. The power consumption is 18.09 mW for 1.1 GHz. The design is
implemented in TSMC 0.18pm CMOS technology and analyzed using Cadence’s Spectre
92
The proposed 8 x 8-bit multiplier is laid out in 0.18p CMOS technology and was verified
for design rules and matched with schematics. The total area is 1.395 x 1.37 mnf. The
results of post-layout simulations are in reasonable consistency with those found in the
design process.
In this section the summarized results of investigation in the recent works on digital
multipliers are provided in Table 5.2. This selection has been made based on the novelty
o f the works. Data are extracted form IEEE Journal of the Solid-State circuits and the
As it is seen in this Table, the reported multipliers are implemented in different CMOS
technologies with different bit words. This can have dramatic impacts on criteria of
design dimensions such as area, power consumption, and speed throughput. However, it
will not be fair if the results of the conducted survey on digital multiplier are directly
compared, as the technology, bit width, target frequency, and simulation methodologies
vary widely. The following discussion provides some indications of multiplier results and
When speed is the main concern Booth encoding scheme and Wallace tree reduction
show their abilities for large throughput multiplier. However, combination of these
methodologies with GaAs device results in high-speed multiplier, which is not feasible
for CMOS device to reach. From this point o f view the proposed multiplier shows its
93
superiority for the medium bit width (4 to 8 bits) applications in speed and power trade
off.
Table 5.2 Summary of the performance of the recent publications on digital multiplier
Author, : ^Multiplier Bit- -Technology Power Speed /Core '.'Pow er
Year/Refo. Structure W idth . Tpm) Supply (MHz) Ar?a .Consumption
(mm ) .'1 ' (m\V)
N. Itoh Rectangular- 54 0.18 1.8 600 0.98
(2001)[28] Styled
Wallace Tree
J. Butas Asynchronous 16 0.6 1.5-5 59-251 40.59
(2001)[29] Cross-
Pipelined
S. Kim True Single- 0.5 2.7 220 0.47
(2001)[30] Phase
Adiabatic
J.Lim NRERL 0.6 2.5 0 . 1-1 2.37
(2000)r311 Serial
J.S. Wang TSPC Flip- 0.6 3.3 300 0.3 52.4
(2000)r321 Flops
A. Smith GaAs 16 0.6 0.9 416 1.98 1700
(1997)1331
J. Mori 4-2 54 0.5 3.3 100 12.4 870
(1991)1341 Compressor
K. Yano Pass- 54 0.25 2.5 227 12.7
(1990)1351 Transistor
M. Hatamian Parallel 2.5 2.5 75 250
(1986)136] Pipelined
In terms of power consumption, asynchronous circuitry and adiabatic logic are viable
approaches for applications where speed is not the prime concern. Nevertheless, pass-
transistor logic has properties o f both higher speed and lower power consumption (Table
3.12). Also NMOS reversible energy recovery logic, which is a new reversible adiabatic
has been implemented by this logic, is suitable for the applications where energy
consumption is the top priority [30]. The proposed multiplier still stands ahead in power
consumption compared to other designs. However, this structure could be more power
94
Where area is the prime concern, the recent progress in use o f Deep Sub-micron Devices
can help to overcome this constraint. It is also possible to reduce the silicon area by
Therefore, it has been recommended that in multiplier performance and area tradeoffs,
design of the proposed multiplier the area is considered as one of the criteria in choosing
the building blocks. Pseudo-NMOS shows significant area saving due to having only 14
To further improve the performance of pair-wise multiplier one needs to consider a way
to reduce the critical path delay of the multiplier for longer bit width with better trade off
pipeline latches at appropriate places so that the functionality of the circuit remains
unchanged and no appreciable reduction in the throughput occurs, however it takes a very
absence o f any feedback loop in this architecture (Fig 2.9). The advantages of this
methodology are many-fold. Since the proposed architecture permits pipelining, the
operation speed can be considerably increased. This increased speed can be traded for
This approach makes the pair-wise architecture qualitatively a viable configuration for
95
based on the proper simulation arrangements is required to show the speed and power
trade off.
96
References
[4] P.Y. Lu, et al, “A 30-MFLOP 32b CMOS Floating-Point Processor,” IEEE Solid-
State Circuit Conf. Proceedings, vol. XXXI, pp. 28-29, February 1988.
[5] W. McAllister and D. Zuras, “An NMOS 64b Floating-Point Chip Set,” IEEE Int.
Solid-State Circuits Conf., pp. 34-35, February 1986.
[7] M. Hatamian and G. L. Cash, “High Speed Signal Processing Pipelining, and VLSI,”
in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, pp.l 173-1176, April
1986.
[10] A. Khatibzadeh and K. Raahemifar, “A Study & Comparison o f Full Adder Cells
Based on the Standard Static CMOS Logics,” in Proc. 2004 Canadian Conf. on
Electrical and Computer Engineering (CCECE2004), Niagara, May 2004.
[11] J. Yuan and C. Svensson, “High-Speed CMOS Circuit Technique,” IEEE J. Solid
State Circuits, vol. 24, No. 1, February 1989.
[14] K. Raahemifar and M. Ahmadi, “A Fast 32-bit Digital Multiplier,” in Proc. o f the 8"’
IEEE International Conf. on Electronics, Circuits and Systems (ICECAS), Malta, Sept.
2001,pp.l413-1416.
97
[15] R. Zimmermann and W. Fichtner, “Low-power Logic Styles: CMOS versus Pass-
Transistor hogxc” IE E E J. Solid State Circuits, vol. 32, pp. 1079-90, July 1997.
[16] A. Parameswar et al., “A High Speed, Low Power, Swing Restored Pass-Transistor
Logic Based on Multiply and Accumulate Circuit for Multimedia Applications,” IEEE
CICC, M ay 1994, pp. 278-281.
[18] M. Suzuki et al., “A 1.5-ns 32-b CMOS ALU in Double Pass-Transistor Logic,”
IEEE J. Solid State Circuits, vol. 28, no. 11, pp. 1145-1151, November 1993.
[20] A. Shams and M. Bayoumi, “A Modular Approach for Designing Low Power
Adders”, Proc. ASILOMAR, June 1997.
[21] J. M. Wang et al., “New Efficient Design for XOR & XNOR Functions on the
Transistor Level,” IEEE J. Solid State Circuits, vol. 29, no. 7, pp. 780-786, July 1994.
[22] E. Abu-Shama and M. Bayoumi, “A New Cell for Low-Power Adder,” in Proc. Int.
Midwest Symp. Circuits Syst., 1995.
[24] A. Bellauar and M. Elmasry, “Low-Power Digital VLSI Design Circuit and
System,” Kluwer academic Publishers, 1995.
[27] G. J. Fisher, “An Enhanced Power Meter for SPICE2 Circuit Simulation,” IEEE
Trans. Computer-Aided Design, vol. 7, pp. 641-643, May 1988.
[30] S. Kim, C.H. Ziesler and M. C. Papaefthymiou, “A true Single-Phase 8-bit Adiabatic
Multiplier,” Design Automation Conference, 2001, Preceeding, pp. 758-763.
98
[31] J. Lim, D. Kim, S. Kang and S. Chae, “An 8 x 8-b NRERL Serial Multiplier for
Ultra-low-power Application,” lEE Proceeding, vol. 146, pp. 327-333, Dec 2000.
[33] A. B. Smith, N. Burgess, S. Cui and M. Liebelt, “GaAs Multiplier Design for High-
Speed DSP Application,” Thirty-first ASILOMAR Conference, 1997.
[34] J. Mori and et a l, “ A 10-ns 54 X 54-bit Parallel Structured Full Array Multiplier
with 0.5-pCMOS Technology,” TEEEJ. Solid State Circuits, vol. 26, No. 4, April 1991.
99