0% found this document useful (0 votes)
46 views114 pages

Khatibzadeh Amir Ali

Uploaded by

Dai Lewis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views114 pages

Khatibzadeh Amir Ali

Uploaded by

Dai Lewis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114

A 1.8V 1.

1-GHZ NOVEL

8 X 8-BIT DIGITAL MULTIPLIER

by

Amir All Khatibzadeh


B.Sc. in Electrical and Electronics
Khajeh-e Nasir University o f Technology
Tehran, Iran
1996

A thesis

presented to Ryerson University

in partial fulfilment o f the

requirement for the degree of

Master o f Applied Science

in the program

Electrical & Computer Engineering

Toronto, Ontario, Canada, 2004

© Amir All Khatibzadeh


UMI N um ber: EC 53466

INFORMATION TO USERS

The quality of this reproduction is dependent upon the quality of the copy
submitted. Broken or indistinct print, colored or poor quality illustrations and
photographs, print bleed-through, substandard margins, and improper
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, th ese will be noted. Also, if unauthorized
copyright material had to be removed, a note will indicate the deletion.

UMI
UMI Microform EC 53466
Copyright2009 by ProQ uest LLC
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United S tates Code.

ProQ uest LLC


789 East Eisenhow er Parkway
P.O. Box 1346
Ann Arbor, Ml 48106-1346
I hereby declare that I am the sole author of this thesis.

I authorize Ryerson University to lend this thesis to other institutions

or individuals for the purpose of scholarly research.

Amir Ali Khatibzadeh

I further authorize Ryerson University to reproduce this thesis by

photocopying or by other means, in total or in part, at the request o f

other institutions or individuals for the purpose of scholarly research.

Amir Ali Khatibzadeh


1

11
Ryerson University requires the signatures of all persons using or

photocopying this thesis. Please sign below, and give address and

date.

Ill
Abstract
Amir Ali Khatibzadeh

A 1.8 V 1.1 GHz Novel Digital Multiplier


M aster of Applied Science in
Electrical & Computer Engineering

Ryerson University

Toronto, Ontario, Canada, 2004

This thesis presents the design o f an 8 x 8-bit novel multiplier, which can provide a better

performance than its counterparts in the sense that it has a fraction of the silicon area,

delay and power consumption o f the common architectures such as the conventional

linear array multipliers.

A t the system-level high performance is obtained by implementing a pair-wise

multiplication algorithm. Also, parallel addition algorithm is used to add up partial

products. Combining these two algorithms results in an efficient cell-based circuit

realization. In the circuit-level, pseudo-NMOS full adder cell is chosen amongst the

several existing full adder cells due to its superior speed and power performance.

The performance of this design has been evaluated by comparing it to those of the

recently reported multipliers. The results o f the comparison, both in theory and

simulation, prove the superiority o f the proposed multiplier.

IV
Acknowledgement

During journey through the Master program support and help from friends, family, and

faculty can be invaluable. To begin with, I cannot stress enough that the single most

important person contributing to the success of the student is the principle dissertation

advisor. With this in mind, I would like to thank Prof. Kaamran Raahemifar for his help

and guidance. His extensive knowledge o f circuit and keen insight into VLSI design were

major assets.

I am also grateful to many people who influenced me throughout my career at Ryerson

(sorry if I missed any o f you). I would like to thank Prof. Sridhar Krishnan, chair and

director o f Graduate Program in Department of Electrical and Computer Engineering at

Ryerson, who has put his effort and intelligence to make a very vigorous atmosphere in

Graduate Studies in the department.

More recently, I would like to thank Prof. Vadim Geurkov for serving on my oral exam

committee. His feedback and insightful comments on the material of this thesis are

greatly appreciated.

I would like to express my appreciation to Prof. Alireza Sadeghian for reading this thesis

and for his helpful comments.

My special thanks and acknowledgements go to Shahab Ardalan,, with whom I spent

numerous hours designing VLSI circuits. It has tmly been an enlightening experience,

and I am thankful for having this opportunity.

I would also like to thank Haleh Vahedi and Majid Malekan for their technical advices

regarding the layout.

Last, but certainly not least, I would like to thank my dear friend, Dana, and my family

and loved ones who stood by me throughout these years. I will not name you all, but I am

eternally grateful. Words cannot express the warmth and gratitude I feel for each o f you.
Fabrication supports through Canadian Microelectronics Corporation (CMC) is also

gratefully acknowledged.

VI
This thesis is dedicated to the late eminent scholar, Professor Abbas Sahab, my

grandfather, who has been named the father of Iran’s cartography.


Table of Contents
Abstract iv
Acknowledgements v
List o f Tables x
List o f Figures xii

Chapter 1: Introduction 1

1.1 Motivation I
1.2 Application 2
1.3 Original Contributions 3
1.4 Thesis Organization 5

Chapter 2: Basic Concept of Multiplication 6

2.1 Multiplication Definition 7


2.2 Binary Multiplication 7
2.3 Review of Parallel Multiplication Algorithms 8
2.3.1 Bruan Algorithm 9
2.3.2 Baugh-Wooley Algorithm 10
2.3.3 The Modified Booth Algorithm 14
2.3.4 Wallace Tree Algorithm 17
2.3.5 The Proposed Pair-Wised Algorithm 19
2.4 Qualitative Comparisons of Parallel Algorithms 22

Chapter 3: Multiplier Design 24

3.1 Pair-Wise Multiplier 24


3.1.1 Circuit Level Review 25
3.1.1.1 Full Adder 25
3.1.1.1.1 Simulation Strategies 34
3.1.1.1.2 Power Consumption Performance 37
3.1.1.1.3 Delay Performance 38
3.1.1.1.4 Performance Comparison 39
3.1.1.2 Carry Lookahead Adder 41
3.1.1.3 AND/NAND/OR/XOR Gates 45
3.1.2 Cell Design 47

Chapter 4: Simulation Results & Layout Considerations 59

4.1 Simulation Results of the Individual Circuits 59


4.2 Final Simulation Results 67
4.3 Layout Considerations 86
4.3.1 Layout Strategies 86
4.3.2.Pads, Package and Chip Size 89

Chapter 5: Conclusion 91

5.1 Features of the Designed Multiplier 91

Vlll
5.2 Comparison Results 93
5.3 Future Work 95

References 98

IX
List of Tables

Table 2.1 Partial products selection 15

Table 2.2 Partial product generation process 15

Table 2.3 Partial product generation relation 15

Table 3.1(a) Truth table of a full adder 26

Table 3.1(b) Truth table of a half adder 26

Table 3.2 Transistor dimension in complementary CMOS full adder 28

Table 3.3 Transistor dimension in complementary pass-transistor full adder 30

Table 3.4 Transistor dimension in double pass-transistor full adder (Sum) 31

Table 3.5 Transistor dimension in double pass-transistor full adder (Gout) 31

Table 3.6 Transistor dimension in transmission gate full adder 32

Table 3.7 Transistor dimension in pseudo-NMOS full adder 33

Table 3.8 Transistor dimension in XOR & transmission gate full adder 34

Table 3.9 Characteristic of the input signals 37

Table 3.10 Simulation results for the full adders sorted by power consumption 38

Table 3.11 Simulation results for the full adders sorted by propagation delay 38

Table 3.12 Simulation results for the fiill adders sorted by power-delay product 40

Table 3.13 Delay o f the generate, propagate and sum signals of PFA 44

Table 3.14 Truth table of AND, NAND, OR and XOR 45

Table 3.15 Transistor dimension & delay of AND, NAND, XOR and OR gates 46

Table 4.1 Transitions covered by input pattern (a) 61

Table 4.2 Transitions covered by input pattern (b) 62

Table 4.3 Transitions covered by input pattern (c) 63

Table 4.4 Transitions covered by input pattern (d) 64

Table 4.5 The results of the worst-case delay measurement 71


Table 4.6 The numerical results of several intentional multiplications 82

Table 4.7 The numerical results of several random multiplications

sorted by delay 82

Table 5.1 Performance of the proposed multiplier 92

Table 5.2 Summary of the performance of the recent publications on

digital multiplier 94

XI
List of Figures

Fig. 2.1 Partial products of an 8 x 8-bit unsigned integer multiplication 10

Fig. 2.2 Braun’s array multiplier 10

Fig. 2.3. Illustration o f the partial product terms in Baugh-Wooley algorithm 12

Fig. 2.4. Reorganization of the partial product terms o f Fig. 2.3 12

Fig. 2.5. 8 X 8-bit Baugh-Wooley two’s complement regular array 13

Fig. 2.6 Block diagram of the n x n multiplier using modified Booth algorithm 17

Fig. 2.7 Construction of Wallace’s tree for an 8 x 8-bit multiplier, reduction of the

8 partial products with 4-2 compressors 18

Fig. 2.8 Architecture o f Wallace’s tree for an 8 x 8-bit multiplier 18

Fig. 2.9 Block diagram for the 8 x 8-bit pair-wise multiplier 21

Fig. 3.1 Schematic o f complementary CMOS full adder 28

Fig. 3.2 Schematic o f complementary pass-transistor fulladder 29

Fig. 3.3(a) Schematic of double pass-transistor full adder (Sum) 30

Fig. 3.3(b) Schematic of double pass-transistor full adder (Cout) 31

Fig. 3.4 Schematic o f transmission gate full adder 32

Fig. 3.5 Schematic o f pseudo-NMOS full adder 33

Fig. 3.6 Schematic o f XOR & transmission gate full adder 34

Fig. 3.7 (a) Input patterns used to evaluate the performance of the adders 35

Fig. 3.7 (b) Input patterns used to evaluate the performance of the adders 35

Fig. 3.7 (c) Input patterns used to evaluate the performance of the adders 36

Fig. 3.7 (d) Input patterns used to evaluate the performance of the adders 36

Fig. 3.8 Propagation delay measurement 39

Fig. 3.9 Block diagram of 4-bit carry lookahead adder 43

Fig. 3.10 Gate-level implementation o f partial full adder (PFA) 43

Xll
Fig. 3.11 Block diagram of the 16-bit carry lookahead adder implemented by

cascading four 4-bit carry lookahead modules 44

Fig. 3.12 Schematic of (a) AND (b) NAND (c) XOR (d) OR gates 46

Fig. 3.13 Block diagram of the proposed 8 x 8-bit multiplier showing detail

of the required cells 47

Fig. 3.14 Block diagram of AND generator 48

Fig. 3.15 Gate-level of the AND plane (XjYj Cell) 49

Fig. 3.16 Block diagram of partial products generator 50

Fig. 3.17 Gate-level of one partial product generator (PP Cell) 52

Fig. 4.1 Circuit structure used for simulation o f full adder cell 60

Fig. 4.2 Circuit structure used for simulation of AND/NAND/OR/XOR gates 60

Fig. 4.3 The simulation waveforms showing respond of the pseudo full adder

to the input pattern (a) 61

Fig. 4.4 The simulation waveforms showing respond of the pseudo full adder

to the input pattern (b) 62

Fig. 4.5 The simulation waveforms showing respond of the pseudo full adder

to the input pattern (c) 63

Fig. 4.6 The simulation waveforms showing respond of the pseudo full adder

to the input pattern (d) 64

Fig. 4.7 The simulation waveforms showing respond of the AND/NAND gate 65

Fig. 4.8 The simulation waveforms showing respond of OR gate 66

Fig. 4.9 The simulation waveforms showing respond of XOR gate 66

Fig. 4.10 The critical path of the proposed multiplier 68

Fig. 4.11 Delay o f AND gate 69

Fig. 4.12 The worst-case delay o f pseudo-NMOS adder occurring when

A =1, B = I and Q„ = 1 69

xiii
Fig. 4.13 The worst-case delay of 16-bit carry lookahead adder 70

Fig. 4.14 Input waveforms of X = 11111111 and Y = 11111111 72

Fig. 4.15 The post layout simulation waveforms showing results of

“l l l l l l i r ’ x ’T l l l l l l l ” 73

Fig. 4.16 The measured delay of the proposed multiplier corresponding to

“l l l l l l l l ” x ‘T l l l l l l l ” 74

Fig. 4.17 The waveform of current drawn by F^jnode

C T l l l l l l l ” x “ l l l l l l l l ”) 78

Fig. 4.18 The waveform of current drawn by Vjjnode

(“10101010” X “01010101”) 79

Fig. 4.19 waveform o f the input patterns “ 10101010” x “01010101” 80

Fig 4.20 The simulation waveforms o f multiplication product of the input

patterns shown in Fig 4.19 81

Fig. 4.21 Adding two extra pins to Vjj and reducing the parasitic inductance 84

Fig. 4.22 The simulation results of the final outputs against temperature variation 85

Fig. 4.23 Layout of the pseudo-NMOS full adder (die size of 22 x 8.5 pm^) 87

Fig. 4.24 Layout of AND gate (die size o f 7.9 x 5.6 pm^) 87

Fig. 4.25 Layout of NAND gate (die size of 5.4 x 5.6 pm^) 88

Fig. 4.26 Layout of XOR gate (die size o f 10.2 x 20.7 pm^) 88

Fig. 4.27 Layout of core of the 8 x 8-bit proposed multiplier core

(die size o f 0.275 x 0.38 mm^) 89

Fig. 4.28 The proposed 8 x 8-bit multiplier chip 90

XIV
Chapter 1

Introduction

1.1 Motivation

The core of every microprocessor, digital signal processor (DSP), and data processing

application-specific integrated circuit (ASIC) is its data path. It is often the crucial circuit

component if die area, power consumption, and especially operation speed are of

concerns. At the heart of data path and addressing units are arithmetic units, such as

comparators, adders, and multipliers. Finally, one of the basic operations found in most

arithmetic components is binary multiplication. Besides simply multiplying two numbers,

multipliers are also used in more complex operations like address calculation and

division. Also, simpler operation like magnitude comparison is based on binary

multiplication.

Multiplication is also very critical if implemented in hardware because it involves an

expensive carry-propagation step, when partial product addition is performed. The

efficient implementation of multiplication operation in an integrated circuit is a key

problem in VLSI design. Designing fast and power-efficient multiplier has been of great

theoretical and practical interest for computer scientists and engineers. Several

algorithms and various VLSI implementations have been proposed [1,2,3, 4, 5, 6, 7] and

practically used.
In order to achieve high performance multiplier, it is necessary to operate very efficiently

in terms o f speed and power trade-off in all design levels. Increasing the operating speed

of the circuits to make more computations with lower power consumption is the main

motivation in multiplier design.

The recent progress in use of Ultra Deep Sub-Micron Devices (UDSM) helps to

overcome the area constraint. Employing advanced cell-based architectures constantly

improves productivity in ASIC design. Taking all o f these into account, implementing

low-power circuit techniques based on a fast multiplication algorithm is still a cost-

effective and feasible alternative for increasing the performance o f the multipliers

substantially. Therefore, this thesis deals with designing a novel multiplier with inherent

high-speed characteristic and power efficient performance.

1.2 Applications

Wireless communication systems, including third generation cellular radio systems and

wireless Local Area Network Systems (LANS), have become tremendously popular in

recent years. These systems can be implemented using various platforms, such as digital

signal processors, ASICs and Field Programmable Gate Arrays (FPGAs). Most digital

signal processing systems incorporate a multiplication unit to implement algorithms such

as correlation, convolution, filtering and frequency analysis. These algorithms are used in

applications such as finite impulse filters (FIR), infinite impulse filters (HR), discrete

cosine transforms (DCT), and fast Fourier transforms (FFT). Moreover, there has been a

rapid increase in the popularity of portable and wireless electronic devices, such as laptop

computers, portable video players and cellular phones, which rely on embedded digital

processors. Since the desire is to design digital systems for communication applications
at the best performance without scarifying power, high performance and low power

multipliers are inevitable.

1.3 Original Contributions

This thesis presents the design of novel multiplier architecture, with superior

performance in speed, power consumption and area compared to traditional array

multipliers.

In order to achieve architecture with high performance several foremost parallel

multiplication algorithms have been studied and compared. Braun, Baugh-Wooley,

Wallace and pair-wise algorithms have been reviewed in detail. Among these algorithms

pair wise algorithm has been chosen due to its superiority in speed of operation.

The 8 X 8-bit multiplier based on this algorithm operates by:

1) generating four 8-bit (Xg, X„, Yg, numbers using even and odd positions of

the multiplicand (X) and multiplier (f).

2) multiplying these four 8-bits numbers to generate the four 15-bit numbers

(Pgg, Pgo, Poe, Poo) known as the even and odd elements of the partial products

iP)-

3) adding the result o f the multiplication of elements of partial products. The

addition is performed in four steps by using 3-to-2 adding technique which

results in two 15-bit numbers.

4) adding two final 16-bit numbers (Pg, Po) and thus generating the product o f

multiplication via a fast carry lookahead.

In the first step o f design flow, topology selection, six full adder cells based on CMOS

static logic styles are redesigned and examined at transistor-level in standard 0.18p

CMOS technology. The results o f the extensive evaluation, which are further presented in
Chapter 3, prove that 14-transistor pseudo-NMOS full adder cell offers a better speed and

power trade-off with less numbers o f transistors [8, 9, 10].

The validity of the design strategy is by proven by testing the complete multiplier and

measuring the speed and power. All the designs are simulated using Cadence Computer

Aided Design (CAD) Tool in 0.18pm CMOS technology at 1.8V supply voltage.

In summary a speed/power efficient novel multiplier for medium bit width applications is

designed in this thesis. Leading by a quantitative analysis of the characteristics of static

CMOS logic adders, several topologies are examined to support the final circuit design.

The major contributions of the thesis are summarized as follows:

• An in-depth comparative analysis of the characteristics o f static CMOS adder

cells is conducted, and useful insights are obtained.

• Power reduction through algorithm selection is achieved by:

a) Minimizing the number of operations and, hence, the number of

hardware resources (half adder cells used anywhere possible)

b) Reducing the number of complex operations by transforming

mathematic expressions (cascading four 4-bit carry lookahead adders

instead of implementing a 16-bit carry lookahead structure, which

requires complex logic operation)

• Power reduction through circuit/logic is achieved by using static style rather than

dynamic style. This causes the architectural level to be free from clock and

related clocking issues such as clock skew and high dynamic power.

• Flexibility in delay modeling in system-level in such a way that modifying the

entire multiplier for different speed requirements is straightforward.

• The performance of the proposed multiplier is well enhanced by considering

transistor chaining, grouping, and signal sequencing in the adder layout which is
proven to provide substantial power saving and speed improvement at no area

penalty.

These original contributions have been published in two conference proceedings [9, 10].

1.4 Thesis Organization

This thesis consists of 5 chapters and is organized as:

Following the introductory Chapter 1, Chapter 2 describes the basic concept of two’s

complement multiplication. The most known parallel multiplication algorithms used in

VLSI implementation along with the pair-wise multiplication algorithm are introduced

and a brief qualitative comparison of these algorithms is presented.

In Chapter 3, first the top-level design o f pair-wise multiplier is presented. Topology

selection of the main elements as a result o f an extensive performance analysis on adder

cells further reviewed. The circuit design o f the required cells for pair-wise structure is

also discussed.

Chapter 4, is dedicated to the simulation results of individual circuits and cells as well as

the final simulation results o f the proposed multiplier. Layout considerations are also

discussed.

Finally, Chapter 5 presents the features o f the Designed Multiplier. A comparative study

of the previous works on multipliers is presented to better evaluate on this work. Drawing

conclusion, summarizing the contributions o f this thesis, and outlining the directions for

the future investigations bring this chapter to an end.


Chapter 2

Basic Concepts of Multiplication

Multiplication is one of the main arithmetic operations. Multiplier represent a

fundamental building block which is being widely used in many Very Large-Scale

Integrated (VLSI) systems such as application-specific Digital Signal Processing (DSP)

architectures, microprocessors and systems which implement filtering, encryption,

security processing and image processing. In addition to their main task, which is

multiplying two binary numbers, multipliers are the nucleus o f many other useful

operations such as division and address calculation. In these systems the multipliers are

the part o f the critical path that determines the overall performance of the system. That is

why enhancing the performance o f multiplier is a significant goal.

Parallel to high-speed system design [II], low-power systems [I] are highly in demand

because o f the fast growing technologies in mobile communication and computation. The

battery technology does not advance at the same rates as the microelectronics technology.

There is a limited amount of power available for mobile systems. Thus, designers are

faced with more constraints; high-speed, high throughput, small silicon area and at the

same time, low-power consumption. Therefore, low power, high-performance multiplier

is o f great interest.

Current architectures range from small, low performance array to tree multipliers.

Conventional linear array multipliers achieve high performance in a regular structure, but

require large area o f silicon. Tree structures achieve even higher performance than linear
arrays but the tree interconnection is more complex and less regular, making them even

larger than linear arrays. Ideally, one would wish the speed benefits of a tree structure,

the regularity of an array multiplier, and the small size of a shift and add multipliers.

The first section of this Chapter explains the basics of binary multiplication. A review on

the most known parallel multiplication algorithms is presented in Section 2.3. The pair­

wise multiplication algorithm that has been used in the proposed multiplier is also

described. These algorithms are, then, briefly compared against each other at the end o f

this Chapter.

2.1 Multiplication Definition

Multiplication is defined as “a mathematical operation that at its simplest fonn is an

abbreviated process of adding an integer to itself a specified number of times.” A number

(multiplicand) is added to itself a number of times as specified by another number

(multiplier) to form a result (product). Multiplication starts with placing the

multiplicand on top of the multiplier. The multiplicand is then multiplied by each digit o f

the multiplier beginning with the rightmost. Least Significant Digit (LSD). Intermediate

results (partial products) are placed one atop the other, offset by one digit to align digits

o f the same weight. The final product is determined by summation of all the partial

products. This technique applies equally to any base, including binary.

2.2 Binary Multiplication

In the binary number system the digits, called bits, are limited to the set [0, 1]. The result

o f multiplying any binary number by a single binary bit is either 0, or the original

number. This makes forming the intermediate partial products simple and efficient.
Summing these partial products is the time-consuming task for binary multipliers. One

logical approach is to form the partial-products one at a time and sum them as they are

generated. This technique works fine but is slow. For applications where this approach

does not provide good enough performance, another approach is used which is known as

parallel multiplication algorithms. In this latter approach all bit-products are generated in

parallel and a multi-operand adder (i.e., an adder tree) is used for their accumulation.

Multipliers that operate based on these algorithms are called parallel multipliers. Parallel

multipliers are becoming the key components in Reduced Instruction Set Computers

(RISCs), DSP and graphic accelerators due to their inherent higher speed o f operation.

This brings parallel multiplication to the main focus of our discussion.

2.3 Review o f Parallel Multiplication Algorithms

Since multiplication is one o f the most critical operations in many computational

systems, many algorithms have been proposed to perform multiplication, each offering

different advantages and having tradeoffs in terms of speed, circuit complexity, area and

power consumption. Among the multipliers reported parallel multipliers have been of

great theoretical and practical interests for VLSI designers not only for their speed of

operation but also for their ease o f implementation.

The structure of all parallel multipliers can be partitioned into three parts performing

three major tasks:

a) Partial product generation.

b) Carry-free addition.

c) Cany-propagation addition.
These three parts can be implemented using different schemes such as simple AND gate

or Booth algorithm to generate partial products. The carry-free addition task is often

implemented by using a Wallace tree or redundant binary addition tree.

In the following section four well-known parallel algorithms as well as pair-wise

algorithm, which have been used in VLSI implementation of digital multipliers, are

briefly presented. The readers can consult references [11,12] for more details on parallel

multiplication algorithms.

2.3.1 Braun Algorithm

Consider two unsigned numbers X = XjXo and Y = ¥„./ ...Y/Y q, where

A = X 'x ,2 ', (2.1)


/=0

y =% 2 ' . (2.2)
1=0

The product P = PiPo, which results from multiplying the multiplicand X by

the multiplier 7, can be written in the following form

/=n-l 7 =n-l

% ( Z ,y ,.) 2 '+ \ (2.3)


i=0 j=o

Each of the partial product terms P*= XtYj is called a summand. Fig.2.1 shows an

example o f an 8 x 8-bit multiplication.

The summands are generated in parallel with AND gates. Fig. 2.2 shows the Braun’s

array multiplier [4]. Such a n n x n multiplier requires n x (n -1) adders and n^ AND gates.

The delay o f such a multiplier is determined by the delay of the full adder cell and the

final adder in the last row. In the multiplier array a full-adder with balanced carry and
sum delays is desirable because the sum and carry signals are both on the critical path.

For the large arrays, the speed and power of the full adder are both very important.

^8 ^7 ^6 ^5 ^4 ^3 X,
Ys Y Ye Ya Y Y Y

^sY XJ, X,Y, X J, X J, ^lY


X,Y, x ,Y ^61^2 X,Y, X J, X 3 Y2 X J , X,Y, 0
X,Y, n X,Y, X,Y, X,Y, 0 0
X,Y, ^eY, ^3i^4 X,Y, x.y. 0 0 0
^sYs X,Y, x ,Y x ,Y x,Y 0 0 0 0
n ^sYe ^sYe ^2^ ^^Ye 0 0 0 0 0
X,Y, X J, X,Y, X,Y, 0 0 0 0 0 0
^sY> ^,Y, ^,Ys

■^6 -^5 ^4 ^3 P^2 Pu ^0 4 n 4 Pe Ps Pa Pz P2 Pi

Fig. 2.1 Partial products of an 8 x 8-bit unsigned integer multiplication

Carry Propagation Adder

’15

Fig. 2.2 Braun’s array multiplier

2.3.2 Baugh-Wooley Algorithm

Baugh-Wooley is one of the developed algorithms for parallel multiplication, which has

been used in VLSI architectures [12]. Multipliers based on this algorithm are used for

10
direct multiplication of tw o’s complement numbers. This direct approach does not need

any two’s complementing operations prior to multiplication. Using the Baugh-Wooley

algorithm, the product o f two numbers X an d Texpressed in two’s complement,

i~ n - l
X = - X ,.a '- '+ 'Z x ,2 ‘ , (2.4)
1=0

y = - y „ . , 2 " - '+ '5 V , 2 ', (2.5)

is given by

7=0 y=o 1=0 /=o


(2.6)

In order to avoid the use o f subtractor cells and use only adders, the negative terms

should be transformed. So

-X„_y 2 ^2"+'-' = X „_ ,(-2 '''-'+ 2 '’-' + J i'2 ''- ^ '- ') . (2.7)
1=0 /=o

Using this property in equation (2.5), the product P becomes

a t = - 2\2"n-- '\ +, (/xTF. _ , +, rV.., +I xV'. . , yV. . , )\ . 2 " - ' +

" S ' Ë x , x , 2 '* / + (X ,., + r . _ , ) . 2 - + X . . , ' % 2 ' « - ' + ' % x , 2"'-'


f=o y=o 1=0 /=o
(2.8)

From Equation 2.8, it can be seen that the multiplication of two numbers, expressed in

two’s complement representation, can be written in a form which involves only positive

bit products. The product is, then, obtained by adding a constant to the final result. All the

partial product terms to generate the above product are explicitly shown in Fig. 2.3. A

simple reorganization o f Fig. 2.3 results in the array o f partial product shown in Fig. 2.4,

which is a modified version of the original Baugh-Wooley algorithm.

11
It can be seen that half adder, full adder, NAND and AND gates are the required elements

by Baugh-Wooley algorithm to perform two’s complement multiplication.

y. Y, Ye Ye Y. Ye Ye 1Î
Xs ^7 Xe Xe X, Xe Xe

X.y, xr, X /, X,Y, x j , X,};


X^Y, X^Y, X,Y, X,Y, X^Y, X,Y, X 2Y, 0
X,Y, X,Y, X,Y, X,Y, X,Y, X,Y, X,Y, 0 0
X,Y, X,Y, X,Y, X,Y, X/3 X J 2 0 0 0
X,Y, X,Y, X,Y, X,Y, XsY, XsY, 0 0 0 0
X,Y, X J , XeY2 Xei; 0 0 0 0
X,Y, X,Y, X,Y, X,Y, X,Y, X,Y, X,}^ 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 Wi % 1 1 1 1 1 1 1
1 1 XeYs XeYe ^4>^8 ^ 31^8 ^2Î^8 ^,i^8 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Fig. 2.3 Illustration o f the partial product terms in Baugh-Wooley algorithm

^.1^8 X X X X XXe XX, X,}^ X,Y, X,};


^2>^8 X X XeYe XXe X X , XXe x^r^ XX,
XeYe XjF, XeYe X X XX, XeYe XeYe X X
X Js X Je XXe X X , XXe X ,Y, X41Î
XeYe X X XeYe XeYe X X , XeYe X X XX
^6>^8 XeYe XeYe XeYe XeY, XeYe XeYe X X
x ,r. XeYe XeYe X X XXe XXe X X
1 XX, X»Y, XXe X X XXe XXe XX,

Fig. 2.4 Reorganization of the partial product terms o f Fig. 2.3

This algorithm is suitable for applications where operands with less than 16 bits are to be

processed. Digital filters where small operands are used (e.g. 6, 8 and 12), are examples

o f such applications.

Fig. 2.5 shows the array architecture of an 8 x 8-bit Baugh-Wooley two’s complement.

12
For operands equal to or greater than 16-bits, the Baugh-Wooley scheme becomes area

consuming and slow. Hence, techniques to reduce the size of the array, while maintaining

the regularity, are required.

T, u h h T, '

fii f,, r'l» 'li I'm *1# ft h Tt Tt fi

Fig. 2.5 8 X 8-bit Baugh-Wooley tw o ’s complement regular array

13
2.3.3 The Modified Booth Algorithm

For operands equal to or greater than 16-bits, the modified Booth algorithm [13] has been

extensively used. It is based on encoding the tw o’s complement operand (i.e., multiplier)

in order to reduce the number o f partial products to be added.

This makes the multiplier faster and uses less hardware (area). For example, the modified

Radix-2 algorithm is based on partitioning the multiplier into overlapping groups of 3-

bits, and each group is decoded to generate the correct partial product.

The multiplier, Y, in the two’s complement can be written as:

(2.9)
f=0

This can be rewritten as:

T = X W , with ¥., = 0 . (2.10)

In Equation 2.10, the terms in brackets assume values from the set (-2, -1, 0, +1, +2}.

The encoding o f Y, using the modified Booth algorithm, generates another number with

the following five signed digits, -2, -1, 0, +1, +2. As illustrated in Table 2.1, each

encoded digit in the multiplier performs a certain operation on the multiplicand X

The bits of the multiplier (f) are partitioned into groups of overlapping 3-bits and each

group permits the generation o f certain partial products. The five possible multiplies of

the multiplicand are generated based on the procedure given in Table 2.2.

The general partial product is related to the multiplicand for each encoded digit by the

relationships presented in Table 2.3. PF, is the partial product and PP, is also the sign bit

o f the partial product with P„=P„./ when no shifting of the partial product is performed.

Note that the partial product is requested on (n +1) bits.

14
Table 2.1. Partial products selection

Y2M Y2 , Recoded Digit O peration on X


0 0 0 0 O x%
0 0 1 +1 +1 X%

0 1 0 +1 +1 X%

0 1 1 +2 +2 X X
1 0 0 -2 -2xX
1 0 1 -1 -1 \ X
1 1 0 -1 -1 x X
1 1 1 0 OxAT

Table 2.2. Partial product generation process

Recoded Digit Operation on X


0 Add 0 to the partial product
+1 Add % to the partial product
+2 Shift left X one position and add it to the partial
product
-1 Add two’s complement of% to the partial product
-2 Take two’s complement o f % and shift left one position

Table 2.3. Partial product generation relation

Recoded Digit Operation on X Added to LSB


0 PPi = 0 for i = 0, .. .,n 0
+1 PP, = X , for i = 0, .. .,n 0
+2 P P i= X ,., for i = 0 ,.. .,n 0
-1 PPi = X , for i = 0, .. .,n 1
-2 PPi = X,., for i = 0 ,.. .,n 1

Bits are grouped into 3-bit groups overlapping by one bit. A bit with a value of zero is

added on the right side o f Y as K,. So the multiplication of two 8-bit numbers generates

only 4 partial products. The number of partial products is then reduced by half.

15
In order to make the array rectangular and thus more regular for VLSI implementation,

the problem o f the sign extension must be addressed. This problem is more crucial when

the operand lengths are wide, where each partial product must be sign-extended to the

length of the product. The basic idea is to use two extra bits in the partial product. For the

first partial product, the two additional bits, PPn+/ and PP„+z are equal to the sign bit of

the partial product

PP„+2 = PPn+J —PPn • (2.11)

For the second partial product, if the first partial product was positive, then the two

additional bits for this second partial product are given by the Equation 2.11, otherwise

we have two different cases

PP„^2 = PPn^,= l if PPn=0, (2.12)

and

PP„^2 = PPn^,= l if PPn=0. (2.13)

So it is more interesting to use a third bit F as a flag to indicate whether there is, from the

previous partial, a negative sign bit to be propagated. F, is the flag generated by the first

partial product to the next one. This flag is expressed by the following Boolean equation

F j„ = F j + PP„j, (2.14)

where PP„ j is the sign bit o f the j"' partial product.

Fig. 2.6 shows the block diagram of an n x n modified Booth multiplier. Furthermore, the

figure gives an idea about the floorplan of this subsystem.

The diagram is composed o f the following blocks:

a) The multiplier array containing partial product’s generation and 1-bit adders.

b) The Booth encoder and the sign extension bits (PP„+2 , PPn+i, E).

c) The Booth encoder generates the five signals (0, +1, +2x, -Ix, and -2x) for

each group of 3-bit o f 7.

d) The final stage adder performs 2n bits addition.

16
Booth
decoder
Y<n-1 Partial p ro d u ct generator n-brt
signs bits adder P <n-1:0>
& A dder’s array
extension

n-bit adder
Cany

P<2n-1 m>

Fig. 2.6 Block diagram of the n x n multiplier using modified Booth algorithm

The Booth multiplier exhibits some glitches. The main reason for glitches is the race

condition between the multiplicand and the multiplier due to the Booth encoder.

2.3.4 Wallace Tree Algorithm

As seen in the previous section, applying the Booth algorithm reduces the number of

partial products by half. However, for large multipliers such as 32-bit and over, the

number of the partial products is over 16 bits. In such cases, better performance is

achieved by adopting the Wallace tree using 4-2 compressors [12]. A 4-2 compressor

accepts 4 numbers and a carry in, and sums them to produce 2 numbers and a carry out.

Fig. 2.7 shows an example of such a tree on partial products of an unsigned 8 x 8-bit

multiplier. Eight partial products are produced. Using 4-2 compressors, two levels o f

additions (stages) are needed. The final two summands are added using a fast 16-bits

adder. Some zeros are added to the array. This example shows that the bits which are not

17
used in this T* stage (level) jump to the next stage to be combined with the ones produced

by the compressors.

X, ....
□ i'/........... .... Y,

PtrtiilpioAKl
A Bigmertttd


□□ 1“ stage

■■
■■■ □
-a
□ □ □ □ A A A A A A A A A A ÀÀ "
□ □ □ □ ÀA A A A A AA AA ▲
A A A A AA A A A A À▲ Stage
A A A A A A A A A A ▲□ y

AAAAAAAAAAAA AAAA
AAAAAAAAAAAAA AA
Two summands
To be added

Fig. 2.7 Construction o f Wallace’s tree for an 8 x 8-bit multiplier, reduction o f the 8 partial
products with 4-2 compressors

Fig 2.8 shows the architecture of the 8 x 8-bit multiplier. As one can see for the first stage

o f the tree two blocks, A and B, are required.

1*^stage 4x8 Paxtial Product Generatois


J l 84-2 compressors
/ I --------------
!•* stage 4x8 Partial Product Generators <^^ x ^ x , x , x .
(blockB) 84-2 compressors

2*^ stage 11 4-2 compressors


(block C)
16-bit adder
^ -PfC... Pn

Fig.2.8 Architecture of Wallace’s tree for an 8 x 8-bit multiplier


The block A of the compressor would group the first (last) four partial products,

respectively.

To further enhance the performance of Wallace tree multiplier, the modified Booth

algorithm can be used to reduce the number of partial products by half in a carry-save

adder array. This architecmre exhibits some irregularities in the layout since it has a

complicated interconnection scheme. Hence, the interconnectionwires affectthe speed

and power consumption o f the adder.

2.3.5 The Proposed Pair-Wise Algorithm

This algorithm is based on generating n-bit numbers using even and odd positions o f the

two n-bit numbers [14]. Then, parallel addition algorithm is used to add up partial

products.

If we assume the two multiplicands X and Y are 8-bit numbers as follow:

X = < Ys, ^7, Xj, X „ Xj, X:, X,>, (2.15)

Y = < fg, y 7, y 5, Ys, Y,, Y 3, Y2. Y,>, (2.16)

each can be represented by the sum o f two numbers, namely,

X = Xe + Xo and Y = Yg + Yo, which are defined as follows;

Xe = < X5, 0, Xtf, 0, X,, 0, X 2 , 0 >, (2.17a)

X„ = < (?, X 7 , 0, Xs, 0, X 3 , 0, X, >, (2.17b)

Ye = < Yg, 0, Yg, 0, y,, 0, Y2, 0 >, ( 2 . 18a)

Yo = < 0, Yj, 0, Ys, 0, Yj, 0, Y, >. (2.18b)

Consequently, the product X x Y can be written as

XXY = (Xe + Xo)(Ye + Yo) = XeYe + XgY» + XgYg + XgYo

= Pee+Peo + Poe + Poo • (2.19)

19
Expanding these terms allows one to see the advantages o f writing the multiplication in

the form o f Equation 2.19.

P ee= (T O X 2'“ + 0 X 2" + (XsYs+XJs) X 2'" + 0 X 2"+(XsY,+X6Y6 + X,Ys)

X 2'“+ 0 X 2®+ {XaY2 + X 6 Y4 + X Je+X iY s) x 2®+ 0 x 2^ + ( % + Z ,X ,+

XiYi) X 2®+ 0 X 2^+ {X4 Y2 + X 2 Y4 ) X 2' + 0 X 2^ + {X2 Y2 ) x 2^+ 0 x 2 ' + Ox

2“ (2.20)

Poo= 0 x 2 '" + 0 x 2 '^ + {X7Y7) X 2 '^ + 0 X 2 " + (XSY7+X7YS) X 2 '° + 0 X 2" +

{X}Y7+XjYs + X7Y3) X 2 « + 0 X 2" + {X,Y7+X3Ys + X^Yi + X7Y,) x 2^ + 0 x

l^ + iX Yj+ X jYj+ X sY,) X 2" + 0 X 2^+ (XjYi+XjY,) x 2 ^ + 0 x 2 '+ {X,Yi)

x2° (2.21)

P eo = 0 x 2 ' " + (XsY7 )X 2'^ + 0 X 2 " + {XgYs + T O ) x 2 " + 0 x 2 ' " + (X^F^ +

XgYs + v^F;) X 2" + 0 X 2* + (XgY, + T O + + X2 Y7 ) x 2^ + 0 x 2" +

(XtfF/ + T O +2G lj) X 2^ + 0 X 2"+ {X4 Y1 + X 2 Y3 ) X 2^ + 0 X 2^ + (X^Fy) x

2 '+ 0 x 2 ° (2.22)

Poe= 0 x 2 ' " + (TO ) X 2'^ + 0 X 2 '^ + (T O + T O ) X 2" + 0 X 2'° + (X^F^ +

X 3 Y6 + 2GF,) X 2° + 0 X 2* + (X,Yg+X3 Y6 + A}F, + TO^) x 2’ + 0 x 2° +

(ylGFg+TO + ^ j l 2) X 2^ + 0 X 2" + (2GF, + X 3 Y2) X 2^ + 0 X 2^ + (%,13) x

2 ' + Ox 2° (2.23)

Note that the zero positions in the bit pattern alternate with non-zero summations. The

zero position can be used to hold the carry from corresponding summation in the non­

zero position. A full adder can be used to calculate each o f the sums and the carry out o f

the full adder generates the bit in the zero positions. Here, we have considered (A^F^ x 2®)

separately. Spares bits are collected together to form one or two distinct numbers. That is,

20
we have x 2® + (X2 Y7 + X 7 Y2) x 2’ + {XyYj) x 2®]. These numbers are treated

separately. The propagation of carry is preserved in the body of multiplication and

postponed at the last stage. This algorithm uses adder to convert three t-bit numbers to

two (k + i)-bit numbers. By using this technique, partial product numbers are, then,

summed together via adder planes repeatedly to generate two distinct numbers. At the last

stage the final two partial products are added by a fast adder to speed up multiplication

operation. This approach is discussed in more details in Section 3.1.2.

\/\7
AND
Generator

FuN
Adder
Plane
(1** Level)

Full
Adder
Plane
{r* Level)

Fun
Adder
Plane
(3" Level)

FuH
Adder
Plane
(4^ Level)

Carry
Loolc
Ahead
Adder

Product

Fig. 2.9 Block diagram for the 8 x 8-bit pair-wise multiplier

21
2.4 Qualitative Comparisons of Parallel Algorithms

In order to choose the appropriate algorithm for the required applications one has to have

a clear view of advantages and drawbacks of different algorithms that have been

introduced. In the following a brief comparison of parallel algorithm is presented.

The basic array multipliers, such as the Baugh-Wooley scheme, consume low power and

exhibit relatively good performance. However, they are limited to applications with the

process operands with less than 16 bits. For operands o f 16 bits and over, the modified

Booth algorithm reduces the partial product’s numbers by half and hence the speed of the

multiplier is increased. In this case power consumption is comparable to that of Baugh-

Wooley multiplier due to the circuitry overhead in Booth algorithm. However, by using

circuit techniques one can make this multiplier have low-power characteristic. The fastest

multipliers adopt the Wallace tree with modified Booth encoding. Due to its

interconnecting wiring a Wallace tree would generally lead to larger power consumption

and area. Hence, it is not recommended for low-power applications. Finally, the pair-wise

multiplier shows faster operation by preventing the carry propagation in the intermediate

stages of multiplication. This multiplication algorithm postpones the carry-propagation to

the last stage where 2(n-l)-bit numbers are added. By using a fast addition circuitry such

as carry lookahead adder (CLA) at the last stage o f pair-wise multiplier one can

accelerate the multiplication operation performance. Besides high-speed characteristic

and simplicity of architecture of this algorithm, employing low power techniques [ 1] in

circuit-level designs makes pair-wise algorithm a viable candidate for high performance

multiplier. Baugh-Wooley algorithm is shown to be suitable for medium size (6 or 8 ) bit

words [10]. It can be concluded that Baugh-Wooley multiplier is a suitable candidate to

be used as test vehicle for the purpose of quantitative evaluation of pair-wise multiplier.

22
However, the entire Baugh-Wooley architecture should be redesign in order to perform a

fair comparison.

23
Chapter 3

Multiplier Design

In this Chapter, the design of novel 8 x 8 -bit multiplier is described in the circuit level.

The building blocks are identified and the design o f the cells based on these building

blocks is, then, discussed. This Chapter begins with a brief description of some o f the

terms used hereafter in order to assess the circuits’ performance.

Propagation delay of digital cells: duration from the moment that the first signal (50%

transition point on input waveform) reaches the inputs o f the cell to the moment that the

last output signal (50% transition point on output waveforms) reaches the output nodes

[21].

Power consumption of digital cells: The value of the power consumption of one cell is

measured individually during testing the circuits. It means that the power consumed by

the other cells in the test circuit is not included in the final measured value. This has been

done by inserting a power meter in the form of Analog Hardware Description Language

(AHDL) block in Cadence CAD tool in the route o f the main supply to measure the

power dissipation. This approach has been used as standard power measurement method

throughout this work.

3.1 Pair-Wise Multiplier

Based on the pair-wise algorithm described in Chapter 2, the top level design o f the

proposed multiplier is built as shown in Fig 2.9. The following decisions were made in

24
order to implement pair-wise algorithm. First, Due to inherent speed characteristic of

pair-wise algorithm, a frequency o f multiplication over IGHz is targeted in this design.

The power consumption of each element has been taken into account in topology

selection. These points are discussed further in this Chapter where the circuit-level design

o f the proposed multiplier is reviewed. Also several low-power techniques are applied in

layout extraction in order to achieve the power efficient design. These techniques are

discussed in Section 5 of Chapter 4 where the layout considerations are reviewed.

3.1.1 Circuit-Level Review

In this Section first the elements required in pair-wise multiplier are introduced and then

topology selection for the key elements is briefly presented. The architecture o f the pair­

wise multiplier (Fig. 2.9) shows that full adder, half adder, carry lookahead adder, AND,

NAND, OR and XOR gates are the building blocks of the multiplier.

3.1.1.1 Full A dder

Full adder (FA) is the most critical circuit for two reasons. First, full adders cause a large

percentage of the core propagation delay. Second, full adders ultimately consume the

large percentage o f power in the whole multiplier architecture. In order to select the best

FA suited for high-performance application, a study was done on the existing FA circuits

[2]. The result of this extensive study has directed to the selection the most speed/power

efficient circuitry for the pair-wise multipliers. A summary o f topology selection is

provided next. First, note that the Boolean expression for a half adder (HA) is:

S = A®B, (3.1)

Co,„=A.B, (3.2)
and for full adder (FA) is:

25
s = A ® B ® Cf„, (3.3)

(3.4)

Table 3.1(a) Truth table of a full adder (b) Truth table o f a half adder

A B Cm Sum Cou, A B Sum Co„,


1 1 1 1 1 1 1 1 1
1 1 0 1 0 1 0 1 0
1 0 1 1 0 0 1 1 0
1 0 0 1 0 0 0 0 0
0 1 1 0 1
0 1 0 1 0
0 0 1 0 1
0 0 0 0 0

The above Boolean expressions can be realized by different circuitries, each with their

own advantages and disadvantages. In the following a brief review of the result o f the

study o f six most well known CMOS full adder structures is presented. These adders

have been compared in a wide range o f static logic styles, which is viable candidate for

low-power circuit design.

They include:

1. Complementary CMOS full adder cell

2. Complementary pass-transistor full adder cell

3. Double pass-transistor full adder cell

4. Transmission gate CMOS full adder cell

5. Pseudo-NMOS full adder cell

6. XOR and transmission gate full adder cell

The HA circuits are then generated from the optimized FAs by eliminating the circuitry

which implements the function of the input carry.

26
Transistor Sizing: Sizing o f the transistors in the full adder cells has been carried out in

an iterative process consisted o f the following steps.

1) Set all the transistors (NMOS and PMOS) to the minimum length (/,»,/«) and the

minimum width size

(L„i„ -ISOnm, W„i„ = 660nm in 0.18pm CMOS process).

2) Simulate the circuit with all possible input pattern transitions (16 transitions).

3) Consider the transitions with the highest delay and mark the transistors involved

in those transitions.

4) Size one o f the transistors in this critical path.

5) Repeat Steps 2, 3 and 4 until the power-delay product for the cell continues to

increase.

6) Record the transistor sizes corresponding to the minimum power-delay product.

This method guarantees that only the right transistors (in the critical path) are sized in a

proper way. No over-sizing or under-sizing will be incurred, which makes it optimal for

power-delay product performance. Although this is a lengthy process, it is guaranteed to

give excellent transistor sizing results, especially for small circuits. Following the same

method with larger circuits will take much longer.

It should be mentioned that the above transistor sizing method is a time consuming task

for the structure such as double pass-transistor. This structure is already out of interest

due to high numbers of transistors. Therefore, not much effort has been taken to optimize

the size of the transistors for this adder.

27
Com plem entary CM OS full adder

Complementary CMOS full adder (CMOS) [15] has 28 transistors and its operation is

based on the regular CMOS structure, pull-up & down networks (Fig.3.1). One o f the

advantages of the complementary CMOS full adder cell is high noise margins and thus,

reliable operation at low voltages and arbitrary transistor sizes (ratio-less logic). The

layout o f CMOS gates is straightforward due to the complementary transistor pairs. An

often mentioned, the disadvantage o f complementary CMOS full adder cell is the

substantial number o f large PMOS transistors resulting in high input loads, more power

consumption and larger silicon area. This adder uses Com signal to generate Sum, which

produces an unwanted additional delay. Another drawback o f CMOS is the relatively

weak output driving capability due to series transistors of the output stage.

^13 iJ ^14 1^ Mij I


I, J Mj J M, i-A

M,

Mr

Fig. 3.1 Schematic o f complementary CMOS full adder

Table 3.2 Transistor dimension in complementary CMOS full adder

M l, M], M3, M4, M 5, Mii_M|3 2.14 0.18


M | 4, Mij, Mi6, M 21, M 22, M23. M 28
Me, M 7, Ms, Mg, M]o, M |2_Mn, Mis, 1.44 0.18
Ml 9, M 20, M24, M 25, M26. M27
M24, M 25 1 .8 0.18

28
Com plem entary pass-transistor full adder

Complementaiy pass-transistor full adder cell has 32 transistors (Fig. 3.2). Using pass-

transistor logic with CMOS inverters, this circuit features complementary inputs and

outputs. This adder generates many intermediate nodes and their complements in order

to generate the final signals (Sum and Com). Having a signal and its complement together

produces high rate of switching activities. Therefore, complementary pass-transistor full

adder cell is not a suitable option for low power applications. In order to lower the power

consumption o f complementary pass-transistor, two circuit styles are used. These circuits

have output levels restored with cross-coupled inverters [16] and latches [17].

Due to irregular transistor arrangements and high wiring requirement, layout of this full

adder cell family is also not straightforward and efficient

A_LM< a _ L m ,3

Fig. 3.2 Schematic of complementary pass-transistor full adder

29
Table 3.3 Transistor dimension in complementary pass-transistor full adder

KWfwmy m s m i
M|, M 2, M 3, M4 , M5 , Mô, M7 , Mg, M9 , Mio, M||, Mi2 7.2 1.8
_____ M|3,M|4, Mis, Mie, M|ç, M20, M 2 U M 22_____
Mi7,M|g, M 23, M24 1.8
M25, M 26, M 27, M30, M 32 14.4 1.8
M 2S, M29,M3| 18 1.8

D ouble pass-transistor full adder

Double pass-transistor full adder cell has 48 transistors and its operation is based on the

double pass-transistor logic in which both NMOS and PMOS logic networks are used

(Fig.3.3.a & b)[18]. The structure of this cell is similar to its complementary pass-

transistor counterparts, but it uses complementary transistors to keep full swing operation

and reduces the power consumption.

This eliminates the need for restoration circuitry. One disadvantage of this cell is the

large area used due to thé presence of PMOS transistors.

‘10

Sum

Fig. 3.3(a) Schematic of double pass-transistor full adder (Sum)

30
Table 3.4 Transistor dimension in double pass-transistor full adder (Sum)

m (iim ) Wfiim);
M 2, M 4, M&, Mg, Mg, M||, M | 3, M |5, M|8, M 20 0.77 0.18
M l, M 3, Ms, M 7, M | 0, Mi2, M | 4, Mi6, M |7, Mi9 1.08 0.18

JL B
B j
\ B
B

B
' i '
M, B
Mj

,h
H
B
i M,
B
H |Z M „
B
' i

 n
M,
fit
'h
M, Â M,,
Â
H|M»
1#“
1---- «»— T

MjJ M24

Fig. 3.3(b) Schematic of double pass-transistor full adder (Cou,)

Table 3.5 Transistor dimension in double pass-transistor full adder (Cou,)

|E(|ïm )ï
M 2, M 4, Ms, Mg, M | 0 , M |2 , M|4, M|6, M|7, Mi9, M 2 1 0.77 0.18
M 23, M26, M 2 8
M l8, M 20, M 22, M 24 0.9 0.18
M |, M 3, M s, M 7 , M g , Mil, M |3, Mis, M 2 5 , M 27 1.08 0.18

Transm ission gate C M O S full adder

Transmission gate full adder has 20 transistors (Fig. 3.4). This circuit generates (A+B)

and uses this and its complement as selected signals to generate the output signals (Sum

& Cou,)[19]. It also requires complementary input signals (A, B, Cm) similar to the

31
complementary CMOS ftill adder. However, it exhibits better speed than CMOS full

adder with the same power consumption due to the small transistor stack height [2 0 ].

Sum

Ji

out

Fig. 3.4 Schematic of transmission gate full adder

Table 3.6 Transistor dimension in transmission gate full adder

M 2 , M 4 , M&, M g, M i 2 , M | 4 , M | 6 , M |8 , M 20 0 .7 0 .1 8
M 5 , M 7 , M ]3 , M i 5, M | 7 , M i 9 0 .9 0 .1 8
M i , M j , M | o, M | i 1 .4 4 0 .1 8
Mg 1 .8 0 .1 8

Pseudo-N M O S full adder

Pseudo-NMOS full adder operates based on pseudo logic, referred to as ratioed style.

This cell uses 14 transistors to realize the negative addition function (Fig. 3.5). The

advantage o f pseudo-NMOS adder is its higher speed (compared to complementary full

adder) and low transistor count. On the negative side is the static power consumption o f

32
the pull-up transistor as well as the reduced output voltage swing, which makes this cell

more susceptible to noise. In order to increase the output swing two CMOS inverters are

added to this circuit, which increases the total transistors o f this cell to 18 transistors.

out
il
H

Fig. 3.5 Schematic o f pseudo-NMOS full adder

Table 3.7 Transistor dimension in pseudo-NMOS full adder

M 7, M|2,M|3, Mi4 0 .6 6 0.18


Mi, M 2, M 3, M4, M 5, Me, Mg 0.77 0.18
Mg, M |o,M||, M | 6, M |8 1 0.18
M i5, M|7 2 0.18

XOR and transm ission gate full adder

This adder shown in Fig. 3.6 has been developed based on an XOR gate [21] combined

with transmission gate, which requires a total of 14 transistors [2 2 ]. XOR gate generates

the sum. Using the transmission gate the second half of the circuit produces the carry out.

This cell occupies less area compared with complementary CMOS full adder cell. In

33
terms of power consumption this adder has a better performance. This is due to its low

activity factor and passing a strong signal in fewer number o f pass-logic gates, unlike the

other cells where the signal had to go through more number of logic gates. Having

discussed the high performance o f this novel logic, one should note that the irregularity in

layout of transmission gate and large average size of transistors are the considerable

drawbacks o f this circuit.

Sum

Fig. 3.6 Schematic o f XOR & transmission gate full adder

Table 3.8 Transistor dimensions in XOR & transmission gate full adder

i M O ST ## m a i
M g , M 7 , M g, M io 0.7 0.18
M j , M4, M | 2, M |4 0.7 0.18
M , „ M ,3 0.9 0.18
M | , M ;, M9 1.44 0.18
M2 1.8 0.18

3.1.1.1.1 Sim ulation Strategies

In the following the techniques for simulations with regards to input patterns o f full adder

and output loading are presented.

34
Input Pattern and Output Loading: In order to compare different adders, input

patterns should be in such ways that fairly test all cases. An input pattern which

maximizes the power consumption for a given cell, could exhibit less power for another.

While another input pattern could have the reversed situation due to different distribution

of capacitances in both circuits.

M U LTiPiCR_PA inw sr_SCH rULl_ADDEB_lEST.SCH schemotlc : Feb 18 23:23:09 2001


’Trônèrent Response Q

£ 9B0m
O
g.0
i. a /n e lffl3

£û 930m .

0.0

I.a

900m.

0.0
2,0 n 4.0n 6.0 n 9.0 n
time { s )

Fig. 3.7 (a) Input patterns used to evaluate the performance o f the adders

MULTIPLIER_PMR\V1SE_SCH rULLADDER_lEST.SCH schemotlc : Feb 18 23:20:40 2004.


IronsFcnl Response Q

1g =: /n e t 0301

S 900m
o
0.0

I .a /n e l0 3 0 3

m 900m .

0.0

I .a -

900m .

0.0
0,0 2,0n 4,0n 6.0n 9.0 n
time ( 3 )

Fig. 3.7 (b) Input patterns used to evaluate the performance o f the adders
MULTIPLIER.PAIRWSE^SCH FULLADDEB.IEST. SCH schemotic : Feb 18 23:24:44 2004
Irons^cni R esponse 0

1.8 »: /n e t0301
1

S 900m
o

0.0

1,8 , /n e t0303

m 900m

0.0 1

1.8 - . /n e t43

900m

0.0
0,0 2,0 n 4 ,0 n ^ 0,0n 9,0 n
tim e ( 5 )

Fig. 3.7 (c) Input patterns used to evaluate the performance of the adders

MULTiPLlE:P_PAIRvyiSr_5CH FULl _ADD[ P_75CH achemdlc : Feb 19 01:26:05 2004


T ra n sie n t Response Q

/ne1030l_____ _
1
S 900 m

00

t g =: /^e103g3

m 900 m

1.0 A>e10l^8

^ 900m:

0.0
0.0 2.0n 4^.0n 6.0r\ S.0n
tim e { a )

Fig. 3.7 (d) Input patterns used to evaluate the performance of the adders

A good input pattern for power consumption leading to a fair comparison o f adder cells

should alternate the high frequency at the input and intermediate nodes. A good example

is the concatenation of the four patterns shown in Fig 3.7 (a, b, c, d).

36
Table 3.9 Characteristic of the input signals

iPatterns ' . IT
Inputs T(ns) P.W . (ns) T(ns) F.W.(ns) T(ns) P.W.(ns) T(ns) P:w4ris)
2 1 4 2 8 4 4 2
B . 4 2 8 4 2 1 4 2
e,„ 8 4 2 1 4 2 8 4

P. W. = P ulse width, T = Period, Rise tim e = 50ps, Fall tim e =50ps

As for speed, the input patterns should have all the required input-pattem-to-input-pattem

transitions. The delay o f the cell should be measured for each transition. The input pattern

used for the simulation process is a concatenation o f the four-input patterns shown in Fig.

3.7 (a, b, c, d).

The test bench used for simulating the adder cell is shown in Fig. 4.1 of Chapter 4, where

the simulation result of the selected adder cell is discussed. The inputs are applied

through buffers (two cascade inverters), which load adder cells with more realistic inputs

in terms of slope and driving strength. Outputs are also applied to another adder to

evaluate the driving capability of each cell.

3.1.1.1.2 Power Consum ption Perform ance

Results o f the comparison among adders, sorted by power consumption are shown in

Table 3.10. The power performance of the second and third adder cells (Fig. 4.1) in the

cascade configuration seems to be more realistic because in such a case, the high driving

capability of the adder is a must in order to provide the next cell with the clean inputs.

Therefore, the power values of either second or third full adder can be considered as the

basis for our comparison. These results show that XOR and transmission gate full adder

exhibit the lowest power consumption and transmission gate CMOS pseudo-NMOS,

complementary CMOS, double pass-transistor and complementary pass-transistor are

ranked respectively after it.


PAopmr/o=
RYERSOfJUm/tRSirf usmpy
37
One can see that ranking is not necessarily related to the transistor count. It should be

also pointed out that this evaluation corresponds to a 1.8 V power supply, and this point

has slightly rearranged the previously reported adder ranking. The impact o f supply

reduction is an incomplete voltage swing at some internal nodes leading to a constant

current drain. This, in turn, results in higher power consumption in circuits such as

complementary pass-transistor and double pass-transistor.

Table 3.10 Simulation results for the full adders sorted by power consumption

, x d d m c e ii( i : 8 vy =EowrT(inW)î
XOR and transmission gate 0.0203
Transmission gate CMOS 0.0305
Pseudo-NMOS 0.0341
Complementary CMOS 0.0504
Double pass-transistor 0.0861
Complementary pass-transistor 0.0967

3.1.1.1.3 Delay Perform ance

The experimental results of the comparison among adders sorted by speed are presented

in Table 3.11. The delay values are measured from the moment A, B and Qn signals

reach the adder inputs till the last o f the Sum and Cout signals reach the next adder cell

inputs. The cell with the lowest-delay values is Complementary pass-transistor.

Table 3.11 Simulation results for the full adders sorted by propagation delay

w m m m a m c e o m m m iD ë im # !
Complementary pass-transistor 0.057
XOR and transmission gate 0.066
Transmission gate CMOS 0.074
Pseudo-NMOS 0.080
Double pass-transistor 0.091
Complementary CMOS 0.140

Fig. 3.8 shows the delay of an adder. This measurement is based on the definition of the

propagation delay o f digital cells, explained at the beginning o f this chapter. The inputs

38
signais are as = 1 , 5 = 1, and C,„ = 1, therefore, the adder response will be as Sum = 1

and Co,„ = 1. Then, the delay between the earliest input signal (C,„) and Sum has been

measured. The delay is also measured between C,„ and Co,,,- This measurement has been

performed at 50% transition point o f the signals (which is 0.9 V in our case of V,u= 1.8

V). The delay values o f pseudo-NMOS adder are shown in Fig. 3.8. It can be seen that

delay o f Sum and is very close in this cell, which avoids any data hazard, and race

effects that may occur later in the proposed architecture.

K;11IPII[P_PWWSELSCH rUUAOOEP.irSI-SCH -ycHcmotV I i h t 13 2C04.


Ironsipnl

1.960 “•
1Ü25
950.0m

&0Q0
1.900
1425
I 950.0m
4.75.0m

70.09ps . /nel0301

1.425
E 9500m
4.75.0m

1 900 —: /nel0303
Tf^r- 73.32ps
1.425
9 5 0 0m
4.75 0m
0,000

1.900
1.425

0,000
400pllrm*( g>6O0P

Fig. 3.8 Propagation delay measurement

3.1.1.1.4 Perform ance Comparisons

The following criteria have been considered in performing the comparison amongst

different adder:

39
Power-delay product: The power-delay product is defined as a compromise between

speed and power consumption. The values of the power-delay are presented in Table

3.12. The measurements are performed in identical conditions as it is recorded in Tables

3.10 and 3.11.

Area: The transistor count, showing area efficiency and layout productivity must be

taken into account for choosing the best adder.

Table 3.12 Simulation results for the full adder cells sorted by power-delay product

A daèrlCën irî^ is tô H »
XOR and transmission gate 0.00133 14
Transmission gate CMOS 0.00222 20
Pseudo-NMOS 0.00272 14
Complementary pass-transistor 0.00551 32
Complementary CMOS 0.00702 28
Double pass-transistor 0.00783 48

The measurement shows that pseudo-NMOS full adder has average values in both power

consumption and delay, while providing a sum signal in good logic level. This leads to

average o f value in power-delay products.

Pseudo-NMOS adder also has small area occupancy not only due to the number of

transistors but also because of the size of PMOSs, which are the main issue when it turns

to layout extraction level. These properties make the pseudo-NMOS circuit amenable to

use o f a lower supply voltage to further reduce the power and at the same time

maintaining a specific speed of the multiplication operation.

It is timely to mention that the comparison of the performance of the adder cells based on

different logic is a very broad area o f study and it is impossible to appreciate fully in a

small section. Here, identical conditions such as uniform input pattern, capacitive load

and constant Vjj have been used during simulation in order to achieve a fair comparison.

40
However, other factors such as selecting different geometry and physical designs and

process variations could be considered as well.

3.1.1.2 Carry Lookahead A d d er

The carry lookahead adder is a viable candidate to resolve the propagation delay problem

by calculating the carry signal in advance based on the input signals. It relies on the fact

that a carry signal will be generated in two cases:

a) when both input bits (A,-, B,) are “1”,

b) when one o f the two bits is “ 1” and the C,„ (carry-in o f the previous stage) is “ 1”.

Thus, one can write

C„u, = C,+|= A/.Bi + {Ai ® Bi).Ci . (3.5)

The above expression can be rewritten as

Cj^f=Gi +Pj.Ci, (3.6)

in which

Gi=Aj. B, , (3.7)

Pi ={ Ai ®Bi ) . (3.8)

Gi and Pi are called generate and propagate terms, respectively [23].

Note that propagate and generate terms only depend on the input bits. If one uses the

above expression to calculate the carry signal, s/he does notneed to wait for the carry to

ripple through all the previous stages to find its proper value. Thus, comes the main

advantage of the carry lookahead adder: reducing the propagation delay.

In the following the generate and propagate terms are derived for a 4-bit adder.

C| = Gq + Pq.Cq (3.9)

Cj = Gf = G | + P i .G q + P i .P q . C q ( 3 .1 0 )

41
C 3= G 2+ P 2-G ,+ P 2 .P,.G o+P 2 -^i-^o-Co (3.11)

C 4 =C ?3 +P 3 .G2 4 - i 3 .P2 .G1 4-P3 .P2 .C|.Go + (3.12)

Note that Co„, bit and C,+/ of the last stage will be available after four delays (two gate

delays to calculate propagate signal and two delays due to AND and OR gates). The sum

signal (5/) can be calculated as follows;

S i= A i® Bi © C, = P i® G,. (3.13)

Thus, the sum bit will be available after two additional gate delays (due to the XOR gate)

or total o f six gate delays after the input signals At and P, have been applied. The

advantage is that these delays will be the same and independent o f the number o f bits one

needs to add, as opposed to the case of ripple counter.

The carry lookahead adder can be broken up in two modules;

1) The partial full adder, PFA, which generates G„ P,- and P, as defined by

Equations 3.7, 3.8 and 3.13.

2) The carry lookahead logic, which generates Com bits according to Equations

3.9 to 3.12. The 4-bit adder can then be built by using four PFAs and the carry

lookahead logic block.

The disadvantage o f carry lookahead adder is that the carry logic tends to get quite

complicated for more than 4 bits. Therefore, carry lookahead adders are usually

implemented as 4-bit modules and are used in a hierarchical structure to realize adders

that have multiples o f 4-bits. High fan-in OR gate is an unavoidable problem in designing

a 16-bit carry lookahead adder. This is shown in Equation 3.12 when C4 is calculated.

Using high fan-in in logic gate would not only increase the propagation delay, but also

contributes to additional power consumption. In order to resolve these issues the cascade

of four 4-bit carry lookahead adders have been employed in design of 16-bit carry

lookahead adder. The propagation delay of 16-bit carry lookahead adder in this

42
architecture is approximately equal to that of the 4-bit ripple carry adder. This is because

of Corn signals that have to ripple from one module to the next one. This is repeated four

times until the final Co,„ arrives at the output. Despite the amount of delay, this approach

is more power-efficient.

In the following the overview of the sub-cells of the 4-bit carry lookahead adder are

described. Figure 3.9 shows the block diagram of 4-bit carry lookahead adder.

PFA P.C,

PFA p,c,

PFA

PFA

Fig. 3.9 Block diagram o f 4-bit carry lookahead adder

As seen in Fig. 3.9 partial full adder (PFA) is the first block where inputs are fed. As it is

mentioned earlier, this block generates, propagate, generate and sum signals. Fig. 3.10

shows the gate-level implementation of PFA. Sum signal is also generated in this block

according to Equation 3.13. In order to generate Q», signal another XOR gate is needed.

Fig. 3.10 Gate-level implementation of partial full adder (PFA)

43
The delays of signals in the highlighted block o f carry lookahead adder (Fig. 3.9) are

measured and shown in Table 3.13.

Table 3.13 Delay of the generate, propagate and sum signals of PFA

B u tp u tI melaW fns):
Gi 0.0552
Si 0.0385
CiPi 0.08

The block diagram o f the 16-bit carry lookahead adder is shown in Fig. 3.11. Four 4-bit

carry lookahead modules have been used to implement the final stage of the pair-wise

multiplier. The labels on this diagram are based on the outputs of the previous stages.

Si
4-blt Sz
Cany 53
Lookahead 54
Adder

•S5
4-bit •Ss
Cany ■Sr
Lookahead S.
Adder

S,
4-bit ■S10
Carry Si,
Lookahead S„
Adder

S13
44)% 5.4
Cany 5.5
Lookahead ■S«
Adder
Si7

Fig. 3.11 Block diagram o f the 16-bit carry lookahead adder implemented by cascading four 4-bit
carry lookahead modules

44
3.1.1.3 A N D , NAND, O R and XOR Gates

AND, NAND, OR and XOR are the fundamental logic gates, used in most logic circuits

to realize the arithmetic operations. The Boolean expressions for two-input AND,

NAND, OR and XOR gates, followed by their truth tables are shown in Table 3.14.

A.B, (3.5)

A+B, (3.6)

A®B. (3.7)

Table 3.14 Truth table o f AND, NAND, OR and XOR

A.B A.B A +B

The Boolean expressions representing AND/NAND/OR/XOR operation can be realized

by different circuitries. However, the varieties of these structures are not as many as

adder circuits. Therefore, very common configurations have been used to implement the

required logic tasks. Figure 3.12(a, b) shows the schematics of AND, NAND gates that

have been optimized for the required speed in the proposed multiplier [22]. NAND gate

is composed o f two NMOSs and PMOSs. An inverter is added to the circuit to generate

the AND function. Several designs of OR and XOR gates have been reported. Each has

its own advantages such as less delay and drawbacks such as poor response to some

particular inputs [20]. Figure 3.12 (c, d) shows schematics o f the OR and XOR circuits

Dimension o f NMOS and PMOS transistors have been modified for the required rise and

fall times in the pair-wise multiplier.

45
iL

A. B A. B

h T m, M,

M,„

r f
A0B

M, M„

if
Ms
M, K
(C)

m s m m
MOST W(|un) 1 L(|im)
1.4 0.18
M„Mt 1.8 1 0.18

A+ B
MOST W((un) I L(tmi)
0.7 1 0.18

MOST W(|uiO L(|im)


3.5 0.18
13 0.18
2 0.18
Mji,
3 0.18

Fig. 3.12 Schematic of (a) AND (b) NAND MOST w W L(pm)


(c) XOR (cf) OR gates Mj, h 251 0.18
M„M» 0.7 0.18
Mj 2 0.18
Table 3.15 Transistor dimensions of AND, NAND,
XORandORg^tes Mg 0.75 0.18
3.1.2 Cell Design

In order to increase the productivity in ASIC design, cell design techniques are highly

critical. In cell deign, a basic concept is to design uniform circuits that can perform the

same task. In the following the top-level and the circuit-level of the required cells in each

stage are described.

1** Adder 2™*Adder yd Adder -V-Adder Final


Level K Level Level Level Adder
Level
Full
Full Full
A dder
AND A d d er
Generator P la n e A dder Carry
P la n e Plane Full Product
-y A dder Look
A head
Plane A dder
Full
A dder
P la n e

Sparse bit ------

Fig. 3.13 Block diagram of the proposed 8 x 8-bit multiplier showing detail of the required cells

AND Generator: As seen in the block diagram of the pair-wise 8 x 8-bit multiplier (Fig.

3.13) the first stage of this architecture is an AND generator. In order to execute the first

step of the pair-wise algorithm discussed in Section 2.3.5 AND combinations of all odd

and even positions of two 8-bit multiplicand and multiplier are required. This task is

performed by the AND generator. The block diagram o f the AND generator is shown in

Fig. 3.14. This stage consists o f four AND planes known as:

XeYe: generating AND combination of all even bits of the both multiplicand and

multiplier. The results are: X 2 Y2, X 2 Y4, AjTg, X 2 YS, X 4 Y2, X 4 Y4, X 4 Y6, X 4 YS, XgY2, X ^ 4, XgYg,

X J s . XsY2. XsY 4. XsY6, XsYs..

XeYo: generating AND combination of even bits of the multiplicand and odd bits of the

multiplier. The results are: X 2 YJ, AjT/.AiT/, Y^T?, YsT;, Y^F), X^Yj,

47
X6Y7.XsY,. X sYj, X sYs, X sYj.

XoYet generating AND combination of odd bits o f the multiplicand and even bits of the

multiplier. The results are: X,Y2, X1Y4, XiYg, X,Ys, X3Y2, X3Y4, X3Y6, X3YS, X5Y2, X5Y4, X5Y6,

XsYs, X 7 Y2, X 7Y4, XrY,, X,Y4.

XoYot generating AND combination of all odd bits of the both multiplicand and

multiplier. The results are: X ,Y j, X,Y3, X J s , X Y 7 , X3Y,, X3Y3, JGT* X3Y7, X3Y,, X,Y3, XsY,,

XeYe

XeYo

XoYe

XoYo

Fig. 3.14 Block diagram o f AND generator


I
48
Fig 3.15 shows the gate-level implementation of the AND plane. Combination o f four

planes consequently constructs AND generation stage. The AND circuit discussed in

Section 3.1.1.2 (Fig. 3.12a) is used in the circuit level.

‘ 1+2

Fig. 3.15 Gate level o f the AND plane ( XjYj Cell)

First A dder Plane: The second stage o f the multiplier is the first adder plane where

partial products (?», ?eo, Poe, Poo) are generated. Equations 2.20, 2.21, 2.22 and 2.23 show

the different AND combinations of multiplicand and multipliers’ bits required for

generating each of partial products. Fig 3.16 shows the block diagram o f this stage.

49
PrJCD
p.j:2)
P.^
P.oC4)
P.c(5)
P.c(6)
P.^
P«JC8)
.. Reo P«C9)
P«X10)
P.JC11)
P«X12)
P«X13)
P.JC14)
P«C15)
P.JC16)

H H C ^^ ;• }r

H
: Poe
P-(10)
p.(11)
I m r n

ill P-C15)

Ü B lÉ lzJ^

ill^
PJC10)
P<JC11)
I PJ[14)
P«X15)
PJC16)

M(7)-X,Y,

M(8)-X,Y,

Regular Bit

Fig. 3.16 Block diagram o f partial products generator

50
This stage consists of four blocks of the partial product generators known as:

P ee: generating partial products resulted by multiplication of the even bits o f both

multiplicand and multiplier. is a 15-bit number shown by bit number in parentheses as

follows:

^ e e (l) = 0 P e .( 2 ) = 0 P e e (3 ) P e e(4 ) = 0

Pee (5) = Sum [X4 Y2 + X 2 Y4] Pee (6) = Qw [X4 Y2 + X 2 Y4]

Pee (7) = Sum [ X J 2 + X 4Y 4+ X 2 Y6] Pee (8) = Qu. W 2 + X4Y4+

P e e ( 9) = S um [X sY 2+ X 6Y4+ X J 6+ Pee ( 10) = Q u, 7, + + X 2Y8]

P e e il 1) = Sum [ X s Y 4 + X sY ^+ X J s] P e e ( 12 ) = Q u , { X g Y 4+ X eY 6 + X Jg ]

Pee a v = Sum [ X s Y e + X J ^ I & e a v = C „u, [ X s Y e + X e Y J P ee (15) = X sY g

Peo: generating partial products resulted by multiplication of even bits o f the multiplicand

and odd bits of the multiplier. Peo is a 15-bit number shown by bit number in parentheses

as follows:

P e o (l) = 0 P e o (2 )= ^ y , PecO) = 0 Pe„(4) = Sum

P eo (5) = C ou, [X4 Y, + X 2 Y3] Peo (6) = Sum [XgY, + X 4 Y2 + X 2 Ys]

Peo (7) = Cou, [X^Y, + X 4 YS+ X 2 Ys] Pea (8) = Sum [XgY, + XsYs+XJs]

P e o ( 9) = Com [ X s Y , + X 6 Y 3 + X 4 Y 3 + X 2 Y 2 ] P eo (10) = Sum [ X g Y s + X 6 Y s + X4Y2+ X 2Y 2]

Peo (11) = C om [ X s Y j + X , Y s + X 4 Y 2 ] Peo (12) = Sum [XsY3 + X J t]

P e o ( 1 3 ) = Com [ X s Y s + X c Y t ] P eo(1 4 ) = X^Yy

P e o (1 5 ) =0

Po«: generating partial products resulted by multiplication of odd bits o f the multiplicand

and even bits of the multiplier. P „ is a 15-bit number shown by bit number in parentheses

as follows:

P o e (l) = 0 P o e ( 2) ^ X , Y 2 P o e (3 ) = 0 P o e {^ ) = S u m [ X , Y4 + X 3 Y 2 ]

Poe ( 5 ) = C om [XiY4 + X 3 Y2 ] Poe (6) = Sum [X,Y6 + X 3 Y4 + XsY2 \

Poe (7) = C om [X,Y6 + X 3 Y4 + X 5 Y2] Poe (8) = Sum [%,1^ +

51
P oe (9 ) = Cou, [ X , Y s + X 3 Y 6 + X iY .+ X r Y ^ ] Poe (1 0 ) = S u m [X 3 YS + X s Y 6 + X 7Y 4]

Poe ( 11) = C ou, [X ^Y s + X 3 Y 6 + X 7Y4] P o e ( 12 ) = S u m [ X 3 Y s + X 7 Y 6]

foX13) = C o u , [ % + % ] P œ (1 4 )= X 7 Y a P o e (1 5 ) = 0

Poo; generating partial products resulted by multiplication of the odd bits of both

multiplicand and multiplier. Poo is a 15-bit number shown by bit number in parentheses as

follow;

P o o ( l) = X ,Y , P o o (2 ) = 0 P o « ( 3 ) = S u m [ X ; 7 ,+ J G 7 /]

Poo (4) = Cou, [XjY 3 +X 3 Y,) P o o (5) = Sum [X,Ys + X 3 Y3 + X 3 Y,]

Poo (6) = Cou, [X,Ys + X 3 Y3 + XsY,] Poo (7) = Sum [XjYy + X 3 YS+ XsY 3 + X 7Y2]

P o o (8 ) = C ou, [ X /Y y + X 3 Y S + X S Y 3 + X 7 Y 2 ] P o o ( 9) = Sum [X 3Y 7 + X s Y s + X 7 Y 3 ]

P o o ( 10) = C o u , [ X 3 Y j + X s Y s + X 7 Y 3 ] Poo (11) = Sum [X sY , + X yY s]

Pou ( 12) = C ou, [X sY y + X y Y j] Pou ( 13) = X yY y

P oo (1 4 ) = 0 P o o (1 5 ) = 0

All P e e ,P e o , P o e , P o o blocks perform a similar task and have the same numberof inputs and

outputs. This makes it possible to employ the same cell for all four partial products (P e e ,

Peo, Poe, P o o ) generators. This cell has been constructed by two half adders and five

adders.

^ Half i
Adder

Adder!:

Adder

Half ^ 5

î:îâ®ï®S

Fig. 3.17 Gate-level of one partial product generator (PP Cell)

52
Fig 3.17 shows the gate-level diagram of the partial product generator cell. In circuit-

level pseudo-NMOS adder has been used to realize these cells. Using adder to convert

three k-bit numbers to two (k + 1) numbers avoids the carry propagation delay in body of

the multiplier. The following is description of this technique.

In order to use 3-to-2 adding technique it is necessary that not more than three inputs be

used for generating any elements o f partial product (Pÿ). This condition is not met when

the partial product elements are generated by four terms as it happens in Pet(9), FeeOO),

P e o (9 ), P e o (lO ), P o e (S ), P o e (9 ), P o o (7 ), P<w(8) (the fourth terms are highlighted in the

relevant equation). To deal with these extra terms called spares terms they are taken out

of the equations and collected together to form two distinct numbers which are called N

and M. N(i) is a 15-bit number with zero in all even and odd positions except for the

seventh [#(7)], eighth [//(8)] and ninth [7/(9)] positions

(0,0,0,0,0,0.^7Yi,X2Y7,X2Ys,0,0,0,0,0,0). M(\) is the second 15-bit number with zero in all

even and odd positions except for the eighth position [M(8)j (0,0,0,0,0,0,0,

X jY2,0,0,0,0,0,0,0 ). These two numbers are shown in the block diagram o f partial

products generator (Fig. 3.16).

Now outputs of the first adder plane are six 15-bit numbers called ( P e e , Pco> P o e , P o o , N,

M).

Second & Third A dder Planes: In order to generate the final product of multiplication

(P ) o f 8-bit X (multiplicand) and 8-bit Y (multiplier) all the individual partial products

(P e e , P eo , P oe, P oo) generated from summation of even and odd bits o f the multiplicand and

multiplier and two distinct numbers generated by sparse bits (M, N) in the previous stage

should be added together.

Pee + P e o + P o e + P o o + M +N=P (3.6)

This task requires the second and third adder planes. This addition operation has to be

performed bit-by-bit resulting in carry out propagation. In order to postpone the carry

53
propagation delay to the last stage of the multiplier a 3-to-2 adding technique has been

used. To facilitate this technique adding o f four partial products (P«, Peo, Poe, Poo) and two

extra numbers (M, N) is broken up into two steps as shown in Equation 3.7. These six

numbers are divided to two batches of three numbers.

Pee + Peo+ Poe+ Poo + M + N = (Pee + Peo + Poe) + (Poo+ M + N) (3.7)

At this stage three 15-bit numbers (P e e ,P e o ,P o e ) are converted to two 16-bit numbers (P e œ o ,

Peoec) and so are the fourth partial product (Poo) and two distinct numbers (M, N) which

generate (PooSe, PooSo).

This task can be performed by using a similar structure shown in Fig. 3.17 with a total o f

14 adders. Note that due to the power and area constraints of the entire architecture using

half adder is preferred whenever only two inputs signals need to be added (i.e. no C,„

signal exits). The result o f this 3-to-2 adding is shown as follows;

Peoto: result o f adding all the odd positions o f Pee, Peo, Poe-

- P e e « > ( l) = 0 P e o e o (2 ) = 0 P eoeo 0 ) = P ee (? ) P e o eo (4 ) = 0

Peoeo( 5 ) = S u m [P e e ( 5 ) + Peo ( 5 ) + Poe ( 5 ) ] Peoeo ( 6 ) = C om [Pee ( 5 ) + Peo ( 5 ) + Poe ( 5 ) ]

Peoeo ( 7 ) = Sum [P e e ( 7 ) + P e o ( 7 ) + Poe ( 7 ) ] Peoeo ( 8 ) = C om [P e e ( 7 ) + P e o ( 7 ) + P o e ( 7 ) ]

P eoeo ( 9 ) = S u m [P e e ( 9 ) + P e o ( 9 ) + P o e ( 9 ) ] P eoeo ( 1 0 ) = C om [ P e e ( 9 ) + P e o ( 9 ) + P o e ( 9 ) ]

P e o e o (l 1 ) = S u m [P e e(l 1 ) + P e o (l 1) + P o e ( l 1 )] P eoeo( 1 2 )= C o m [P e e ( 1 l ) + P e o ( l l ) + P o e ( l 1 ) ]

P e o eo (1 3 ) = Sum [P e e (1 3 )+ P e o (1 3 ) + P o e (1 3 )] P e o e o (1 4 )= C o m [P e e (1 3 )+ P e o (G ) + P o e ( 1 3 ) ]

P eoeo ( 1 5 ) = P e e ( 1 5 )

Peoee: result o f adding all the even positions o f Pee,, Peo and Pœ.

Peoee(l) = 0

Peoee (2) = Sum [Peo (2) + Pœ (2)] Peoee (3) = Com [Peo (2) + Poe (2)]

Peoee (4) = Sum [Peo (4) + Pœ (4)] Peoee (5) = Com [Peo (4) + Poe (4)]

Peœ e (6) = Sum [P e e (6) + Peo (6) + P o e (6)] Peoee (7) = C o m [P e e (6) + P e o (6) + P œ (6)]

Peoee ( 8 ) = S u m [P ee ( 8 ) + P e o ( 8 ) + P o e ( 8 ) ] Peoee ( 9) = C om [P e e ( 8 ) + P e o ( 8 ) + P o e ( 8 ) ]

54
P.o..(10) = Sum [P..(10) + P,<,(10) + P^(10)] P.«.,(ll)=Co„,[P..(10)+P«,(10)+/>o,(10)]

Pecee{l2) = Sum [P ,,( 12) + ?,„( 12) + P„e(12)] P,„,,(13)=Cou.[Pee(12)+P«.(12)+P^(12)]

PooSg: result of adding all the even positions of Poo, M and N.

Poo5,(I) = 0 Po„5’,(2) = 0 Po«*î.(3) = 0

P o o S e { ^ ) = P o o ( 4) ^ « ^ ,( 5) = 0 P « A (6 ) = P „ (6 )

P o o S e iJ ) = 0 P o o 5 ’. ( 8) = Sum[ P o o (8 ) + X 2Y 7 + X 7 Y 2 ]

P J S e { 9 ) = C ou, [ P o o ( 8 ) + X 2 Y 2 + X 2 Y 2 ] P o o ^ .( lO ) = P o o ( lO )

Poo5o(ll) = 0 P„o5'o(12) = Poo(12) Poo6"o(13) = 0

Poo5'o(14) = Poo(14) Poo5o(15) = 0

PooSo: result of adding all the odd positions of Poo , M and N.

P o o 5 ’o ( l ) = P o o ( l ) Poo5o(2) = 0 Poo5'„(3) = Poo(3)

Poo‘S'o(4) = 0 Poo^o(5) = P o o ( 5 ) PooSo(fi) = 0

PoJSoil) = Sum [Poo (7) + XjY,] Poo5o(8) = Cou, [Poo (7) + XjY,]

PooSo{9) = Sum [Poo (9) +X 2 YS] Poo5'„(10) = Cou, [Poo(IO) +

Poo5'o(ll) = Poo(ll) Poo5o(12) = 0 PooSo(13) = Poo(13)

Poo5-o(14) = 0 PooSoi 15) = P o o ( 15)

Addition process is completed at this stage and four 16-bit numbers (Peœe, Peoeo. PooSe,

PooSo) result of 3-to-2 addition of Pee,Peo,P«, M and N are the outputs of this level.

Equation 3.7 is rewritten as:

Pee + Peo+ Pœ+ Poo + M + N = Peoee. + Peoeo+ PooSe + PooSo- (3.8)

At the next stage addition of the four numbers is broken to two steps as shown in

Equation 3.9.

Peoee. + Peoeo+ PooSe + PooSo= ( Peoee. + Peoeo+ PooSe) + PooSo. (3.9)

The same technique as the previous stage is used two more times to convert the three

numbers (Peœe., Peœo, PooSe) to two numbers (PSU , PSIo ) as following:

PS le: result ofadding all the even positions of Peœe., Peœo and PooSe.

55
P S h ( l )==0

PSh.(2) = Sum [P,.,U2) + Peo.o(2)+ P „ ^ ,( 2 ) ]

P S J , ( 3) = C o u , [P e o e e (2 ) + P e o e o (2) + P „ o S e (2 )]

PSI,(4) = Sum [ C . / 4 ) + C m (4 )+ P „ A (4 ) ]

P S 1, ( 5 ) = C o u . [P eo e e(4 ) + P e o eo (4 ) + P o M 4)]

PS1^(6) = Sum [Pe„ee{^) + ^tow(6)+ P„oSe{())\

P S 1 ^ (1 ) = Cout [/’«,re(6) + P eoeo{(> )+ P o „ S e { 6 ) ]

P S le iS ) = Sum [P«w^(8) + Pf<,«,(8)+ f»o^g(8)]

PSle(9) = C o u . [ P .w ..( 8 ) + Ceu(8)+ fo«5',(8)]

PSh i 10) = Sum [P e o e e i 10) + P eoeoi 10)+ P ooSei 10)]

P S h il 1) = Cou. [ P e o e e m + C .o ( 1 0 )+ P„,A (10)]

PS1,(12) = Sum [P e ,U i2 ) + p«,™(12)+ P oM l2)]

P S IX 13) = Cou. [P eo ee i 12) + P eoeoi 12)+ P ooSei 12)]

P S h i \ 4 ) = Sum [Peoeeil4) + P „ U 1 4 )+

P 5 /,(1 5 ) = Cou. [ P .U 1 4 ) + P.„«,(14)+ Po,A(14)]

PSlo: resu lt o f adding all the odd positions o f Peœe., Peœo and PooSe

P S lo i 1) = Sum [P eoeei 1) + Peoeo ( 1)+ P ooSei 1)]

PSloi2) = Cou. [ P .w e ( l) + P « H „ ( I ) + P „ A .( 1) ]

PSloi^) = Sum [P«.ee(3) + Peoeo (3)+ P,«,5'e(3)]

PSloi4) = C o u . [ P e o e e ( 3 ) + P e o e o ( 3) + P o o ^ e ( 3) ]

PSloiS) = Sum [P e „ e e (5) + P e o e o ( 5) + P o o ^ e ( 5) ]

PSloi6) = C o u . [ P e o e e ( 5 ) + P e o e o ( 5) + P o o 6 "e( 5 ) ]

PSl o il) = Sum [P e o e e (7) + P eoeo (?)+ P.nXil)]

PSloiS) = C o u . [P e o e e (7) + P eoeo ( 7) + P « , 5 'e ( 7 ) ]

PSloi9) = Sum [P e o e e (9) + P e o e o ( 9) + P o o ^ e ( 9)]

P S lo i 10 ) = C o u . [ P e o e e ( 9 ) + P e o e o ( 9) + P o o ^ e ( 9) ]

56
PSJ„{11) = Sum 1) + P«,,„ (1 1)4- P „A (1 1)]

PSIo{l2) = Cou, 1) + P.w,,(I 1)+ W ( 1 ])]

P57o(13) = Sum [P„w(13) + P„«„(13)+ P,vA(13)]

PS]„il4) = Cou, [PeoeÀ 13) + (13)+ P.SÀ 13)]

PSloi 15) = Sum [Po«r( 15) + Pooeo (15)+ PoM 15)]

PSloi 16) = Cou, [PeoeX 15) + P^oeo (15)4- Po.SX 15)]

At the next parallel adder plane the two new words from previous adder plane (PSlc,

PSlo) are added to (PooSo) via another 3-to-2 adder stage to complete the Equation 3.9.

This addition process is carried out similar to the one in the previous level. The three

input numbers at this level are converted to two new 15-bit numbers called Pc and P„.

Arithmetic

Pg: result of adding all the even positions of PooSo, PSh, PS1„

Po (2) = Sum [P„,A (2) + P e o a , (2) + P eoee (2)]

P e Q ) = C ou. [P o o S o (2) + P e ,.o i 2) + P e o e e i^ )]

Pei4) = Sum [^«.^(d) + PSIX4) + P S h m

PeiS) = C o u . [ f « ,Æ ( 4) 4- PSIX4) + f % ( 4) ]

Pc (6) = Sum [PooS'o(6) -f PSIeiS) 4- P% (6)]

P c(7) = C o u , [P«A(6) + PSIeiS) + PSIoiG)]

Pc (8) = Sum [P„o5'„(8) 4- P5/c(8) 4- P5/„(8)]

Pc (9) = Cou, [Pco^,X8) + P5/c(8) 4- PSlo (8)]

Pc (10) = Sum [PooSoi 10) + PSIei 10) + PSlo (10)]

Pei 11 ) = C o u ,[ P o o ^ o ( 10 ) 4- PSIX10) + PSlo ( 10) ]

Pc ( 12) = Sum [PooSoi 12) + PSIei 12) + P % ( 12)]

Pc( 13) = C o u , [PoA ( 12) + P S Ie i 12) + P S lo i 12)]

Pc (14) = Sum [PooSoi 14) + PSIei 14) + PSh, ( 14)]

PX15) = CoAPooSoi 1 4 ) + PSIX 1 4 ) + PSlo ( 1 4 ) 1

57
Pg: result o f adding all the odd positions o f PooSo, PSIe, PSlo

Po{\) = Sum [Poo5o(l) + PSIei.1) +

Po(2) = Coo, [P«X(1) + PShU ) + P % (1)]

Po (3) = Sum [Poo^oO) + PShO ) + PSlo (3)]

Po (4) = Cou, [f,^o(3) + f % ( 3 ) + PSlo (3)]

Po (5) = Sum [Poo^o(5) + PSh{5) + PSlo (5)]

Po (6) = Cou, [f«o^o(5) + PSIX5) + PSlo (5)]

Po (7) = Sum [Po^oÇ) + P S W ) + PSlo (7)]

Po (8) = Cou, [Poc^oÇ) + P S W ) + PSlo (7)]

Po (9) = Sum [Pooi'o(9) + P S W ) + PSlo (9)]

Po (10) = Cou, [Pm^o(9) + P S W ) + PSlo (9)]

Po (11) = Sum [Poo^o(l 1) + P S W 1) + P % ( 1 1)]

Po(12) = Cou,[PoA(l 1) + P S W 1) + PSIo{\ 1)]

Po(13) = Sum [Poo5’o(13) +PS7,(13) + P5fo(13>]

Po(14) = Cou,[Poo^o(13) + P % (1 3 ) + P % (1 3 )]

Po(15) = Sum [PooSoilS) + P S W 5 ) + P % (1 5 )]

Po(16) = Cou,[Poo6'o(15) + f % ( 1 5 ) + P % (1 5 )]

In the last stage of multiplication process, these two final numbers (Pe,Po) need to be

summed as it is shown in Equation 3.10. This equation is summarized form of Equation

3.6.

Pe + Po = P (3.10)

At the last step final two numbers (P„ Po) are simply added to generate the final product.

This addition needs to be performed fast. Therefore, carry lookahead structure, known as

a fast adder, has been used to speed up the multiplication.

In the next Chapter the simulation of the major block as well as the final simulation

results are presented.

58
Chapter 4

Simulation Results & Layout Considerations


This Chapter presents the simulation results for the major designed cells and circuits. The

simulation results of the final stage of the proposed multiplier for certain given inputs are

further discussed. Layout considerations are explained later in this Chapter.

All circuits including individual cells and entire design have been simulated in Cadence

environment.

4.1 Simulation Results of the Individual Circuits

Before presenting the simulation results of the individual circuit and designed cells, we

need to introduce the circuit structure that have been used for simulation purposes.

Arranging the proper test circuits has significant impact in increasing the ASIC

productivity.

Sim ulation C irc u it S tru c tu re : In regular multipliers such as the proposed

architecture that uses full-adder cells as the building block, a cascade of full adders is

usually utilized. In such cases, the high driving capability o f adder is a must for providing

the next cell with input signal with proper logic level. Having this point in mind, the

circuit structure used to simulate the adder is illustrated in Fig. 4.1. A cascade of four full

adder cells is utilized; the inputs are fed from buffers (two cascaded inverters) to give

59
more realistic signals and outputs are loaded with buffers to give proper loading

conditions [28].

Full Full
B A dder A dder Adder A dder
Cell Cell
'InL

Fig. 4.1 Circuit structure used for simulation of full adder cell

The parasitic effects are, therefore, included in the simulation results. The same structure

has been used to compare the adder cells discussed in topology selection.

AND/NAND
OR/XOR
B Sum

Full
AND/NAND
OR/XOR A dder
Cell
=>
AND/NAND
OR/XOR
B
-V

Fig. 4.2 Circuit structure used for simulation of AND/NAND/OR/XOR gates

Full Adder Sim ulation R esults

Here are the simulation results for pseudo-NMOS full adder using the test circuit

structure of Fig. 4.1 corresponding to four different input patterns. The input patterns

60
were already introduced in Chapter 3 when describing simulation strategy o f adder cells.

These patterns fairly cover most o f the possible input combinations.

Fig. 4.3 shows Sum and signals of pseudo-NMOS full adder to input pattern shown in

Fig. 3.7 (a). This pattern covers 6 transitions o f the input signals (A, B, C<„). These

transitions are also shown in Table 4.1 corresponding to those in Fig. 4.3.

MUniPLIER_RAm»ISr_SCH FUa.ADDrR_Trsl.SCH achemolic : Mor 13 01:27:07 2004


Tfonsient Response

1 90 —: /n«t24

g 900m
0
- 100m

1.90 =: /n et 58

1 900m
to r ~ T ~ ^
—100m ..... j . ...........j .........._ 1 . ,

1.90 =: /n e t0301

c 900m
Cj
- 100m

--- --------------- -----


CÛ 950m

0.0 1............................... . a-------------,--------L-i


1.9 a: /n e t01+B

950m

0.0
0.0 2.0 n 4.0r» B.0n
tim e f 3 )

Fig. 4.3 The simulation waveforms showing respond of the pseudo full adder to the input
pattern (a)

Table 4.1 Transitions covered by input pattern (a)

h p u ts # f O u t |p m
A B C,n Sum ^out
0 1 1 0 1
1 I 1 1 I
0 0 1 1 0
1 0 1 0 1
0 I 0 1 0
1 1 0 0 1

61
Fig. 4.4 shows Sum and C<,„, signals o f pseudo-NMOS full adder to input pattern shown in

Fig 3.7 (b). This pattern covers 6 transitions o f the input signals (A, B, C,„). These

transitions are shown in Table 4.2 corresponding to those in Fig. 4.4.

MUniPLIER_PAIRW SE’_ SCH njLL.ADDCP_lE?ST_SCH aehema«ic : Mor U <3121 06 2004


tronsienl Response 0

190 r-"' /nct2+


900m

- 100m

1.90
i /netGB

E 900m

- 100m

1.90 /nfit0301
900m
o
- 100m

1.9 /n e t0303

950m.

■j g «=»] /net01+B

950m.

0.0
0.0 2.0 n 4..0n B.0n
trme ( s )

Fig. 4.4 The simulation waveforms showing respond o f the pseudo full adder to the input
pattern (b)

Table 4.2 Transitions covered by input pattern (b)

Ü la p Ü tsf® tO u flp u t #
A B C,n Sum Cout
0 1 1 0 1
0 1 0 1 0
1 1 1 1 1
1 1 0 0 1
0 0 1 1 0
0 0 0 0 0

62
Fig. 4.5 shows Sum and Co,„signals of pseudo-NMOS full adder to input pattern shown in

Fig. 3.7 (c). This pattern covers 6 transitions of the input signals (A, B, C/„). These

transitions are shown in Table 4.3 corresponding to those in Fig. 4.5.

MULliPLIE:R_PAIRWISr_SCH FULL.A DD CR_Trsi_SCH sclwndlc : Mor 13 01:30:30 3 0 0 i


Tronsient Response

1.9 /net24
950m , / 1
0.0
1g «=*! /net 68
E 950m .

0.0

1.90 /n e t0301

900nn

- 100m

1g /net 0303
950m

0.0
"J g /net014B

9 5 0 mL

0.0
0.0 2.0n 4.0 n B.0n
trme ( s )

Fig. 4.5 The simulation waveforms showing respond of the pseudo full adder to the input
pattern (c)

Table 4.3 Transitions covered by input pattern (c)

:n p u t# ? sçoïït p s î «
A B C,„ Sum Coul
0 1 1 0 1
0 0 1 1 0
0 1 0 1 0
0 0 0 0 0
1 1 1 1 1
1 0 1 0 1

63
Fig. 4.6 shows the Sum and Com signals of pseudo-NMOS full adder to input pattern

discussed in Fig. 3.7 (d). This pattern covers 6 transitions of the input signals (A, B, C,„).

These transitions are shown in Table 4.4 respectively as it is seen in Fig. 4.6.

MUniPLIEP.PMRWISE_$CH FULL.ADDCP.TEST.SCH schemoUc : Mor 13 01:32:11 2004


Tronsient Response

1g —; /n çt2 4

950m .

0.0

^g ■=•! /net 68
E 950m .

0.0

1.90 =: /nfit0301
900m

- 100m

1.9 /n et0303

950m

/net 0148

950m

0.0 2 .0 n 4.0n B.0n


trme ( s )

Fig. 4.6 The simulation waveforms showing respond of the pseudo full adder to the input
pattern (d)

Table 4.4 Transitions covered by input pattern (d)

In p u ts r O ut]
A B Cm S u m Com
0 0 1 1 0
0 1 1 0 1
1 1 1 1 1
1 0 1 0 1
0 0 0 0 0
0 1 0 1 0

64
AND/NAND/OR/XOR Gates Simulation Results

AND/NAND/OR/XOR gates have been described in more details in Chapter 3.

Schematics are shown in Fig. 3.15. The test structure used for the simulation is shown in

Fig. 4.2. The input signals have 50% duty cycles with period of 2ns. Figures 4.7 to 4.9

show the results of the simulation for these gates.

MULTIPLieP_PAiPWlSC_SCH AND_TE:ST_SCH schem ollc : Mor 19 02:0 2 :4 3 2 0 0 *


Trangfenl Response

1.9

<

0.0

0.0

g /nel029

CD

/| g /nel27

0.0
2,0n
ifme ( s )
Fig. 4.7 The simulation waveforms showing respond o f the AND/NAND gate

65
MULTIPLIER.PAIRWISET_SCH OR.TEST.SCH s c h e m a tic : Mor 19 02 :1 0 :3 8 2004-

Tronsient R e sp o n se [

1.9

m 950m

0,0

CO 950m

0,0

950m

0,0
0,0 1,0n 2 .0 n 3 ,0 n
time ( s )

Fig. 4.8 The simulation waveforms showing respond o f OR gate

MULTIPLIER-PAIRWISE-SCH XOR-TEST_SCH schem atic : Mor 19 02:19:02 2004


Transient Response □
1.900 "’ /net?

m 1.425
Û1 950,0m
o
X
< 4 7 5 .0 m
0.0 0 0

1.900
1.425

ca 9 5 0 ,0 m
4 7 5 .0 m
0 .0 0 0

1.900
1 425

< 9 5 0 .0 m
475,0m
0 .0 0 0
0.0 1.0n 2.0n 3 .0 n 4 .0 n
time (

Fig. 4.9 The simulation waveforms showing respond o f XOR gate

66
4.2 Final Simulation Results

In order to evaluate the performance of the proposed multiplier three dimensions have to

be measured. These dimensions are speed, power consumption and area. In this section

speed and power consumption are estimated.

Speed: Speed of the multiplier is translated to the minimum interval (frequency) between

two sequential multiplication operations (8-bit x 8-bit) for which the results of

multiplication are successful. To determine the frequency of multiplication, worst-case

(maximum) delay of the entire design should be measured. By having the worst-case

delay the minimum operating frequency of multiplier can be calculated according to

Equation 4.1.

fm in ~~ (d« 1)

where f„,i„ is minimum operating frequency of multiplication, Tm,, is the worst-case delay

of the multiplier.

As shown in Fig 4.10 the operation of the proposed multiplier can be divided to 6 stages

as:

1) AND generation 2) 1** Adder level 3) 2"“*Adder level

4) S'** Adder level 5) 4* Adder level 6) Final adder level (Carry lookahead)

Due to parallel operation (AND and addition) in stages 1 to 5 the delay o f one AND gate

can represent the delay o f the first stage (AND generation) and so does the delay of one

pseudo-NMOS adder for each of stages 2 to 5 (Adder levels). Delay o f carry lookahead

adder is separately measured.

In order to evaluate the worst-case delay of the entire design first the worst-case delay of

each stage has been measured and, then, the final worst-case delay of the proposed

multiplier can be calculated by Equation 4.2:

67
1»* Adder 24"Adder S'" Adder 4 " Adder Final
Level Level Level Level Adder
■1— \ f Full
Level

<X!., Xg> Full


Adder
Plane
p«> :
I Adder
Plane
Full
Adder
Full
Carry
Plane P roduct
"“ l Look­
AND —N A d d er ah e ad
Generator — / P« P lan e A dder
Full
< Yff. E Z E Z s H i C ^ Adder
Plane
I '"IK

65 —75ps ;: 7J ~ 12Cps Ib'^nOps :: 75~120ps 75^l20ps 240~277ps :

Approximate w orst case delay (result or pre-iayout simulation) = AND Generator + 4 x Full Adder + Carry Lookahead
= 75ps + 4 X 120ps + 277ps = 832ps

Fig. 4.10 The critical path o f the proposed multiplier

'^Tolal '^AND Generator 'F 4 X X^dder level "F Xpiugi adder stage > (4 .2)

where Xjotai is the worst-case delay o f the multiplier, xand Generator is the worst-case delay of

the AND generator stage which is equal to the delay of an AND gate, xAdder levei ^s worst-

case delay of one adder stage which is equal to the worst-case delay of one pseudo-

NMOS full adder cell, and Xfi„aiadder stage '^s worst-case delay o f the final adder stage which

is equal to the worst-case delay o f the 16-bit carry lookahead. Delay of AND gate can be

simply measured according to propagation delay definition. Fig 4.11 shows the delay of

AND gate.

The worst-case delay has a better meaning for pseudo-NMOS full adder due to possibility

o f different input combinations. The delay o f pseudo-NMOS adder has been measured

with all input combination. The worst-case delay has been occurred when A = 1 , 5 = 1

and Cin = 1 as shown in Fig 4.12.

68
MULTIPLIER^PMRWISE^SCH AND_TE:ST_SCH schem otk : Mor 19 €13:52:07 2004-
T ro n sien i R esponse
=: A,B
1.90 A

1.10

>
700m

300m

—100m
500p 900p
A: (/0 .4 0 b ip ya:^.Jlbm] deiio. Ç/Z.lüJ^p -l.55y3m)
B: (1 5 0 .5 B 7 p 9 0 3 .7 5 7 m slope: -ÎT5263M

Fig. 4.11 Delay of AND gate

WUniPLER-PAIRWIST-SCH FULLADDER.IEST.SCH gçhemolîc : Mor 19 03:31:59 2004


Tronsient Response
=: SUM, Coul o: Cm

900 m
= 120dj
700m

500m

300m

100m

—100m 1.0n
200P \ 400p 600p
time ( S }
A (7 4 9 0 6 J4 Jsn de no: f 5T5ST55Sn
B: (196.799P 908.525ml ticpt:.
Fig. 4.12 The worst-case delay o f pseudo-NMOS adder occurring when >4 = 1 ,5 = 1 and C„ = 1

69
In order to measure the worst-case delay o f 16-bit carry lookahead adder, the same

method has been taken. Different input transitions have been applied and delay has been

measured between the input and the last output signals at 50% of transition point. The

worst-case delay has been seen when “1111111111111111” and “ 1111111111111111”

are added as it was expected due to rippling signal between every 4-bit cany

lookahead adder modules (remember that the 16-bit carry lookahead adder constructed by

four 4-bit carry lookahead adder modules).

Fig. 4.13 shows the input and output signals in composite format. The delay occurring

between input and output signals is clearly seen in this figure.

Table 4.5 shows the values o f worst-case delay o f AND gate, pseudo-NMOS full adder

and 16-bit carry lookahead adder.

MUniPLICR.PAIRWSC.5CH LOOK.AHCAD.ADDER.TCSI.SCH jchemoRç : Mor 19 04:29.45 2004


Iransienl Response
inputs PIG P15
: P12 P10 =: P9 : PS
P4 P3

900m

700m looka
500m

300m

100m

-1 0 0 m

time ( 3 )
875rnt
B: (3a2.â95p 907JI6ml slop«^=^:Ogi09M

Fig. 4.13 The worst-case delay o f 16-bit carry lookahead adder

70
Table 4.5 The results of the worst-icase delay measurement
T h ë w o rs m s ë m ie m i
AND generation (One AND gate) 75
Adder stage (One pseudo-NMOS adder) 120
16-bit carry lookahead adder 111
'^Total —832ps

It should be pointed out that he worst-case delay that has been measured and shown in

Table 4.5 is the results o f examining each blocks (AND, Adder plane and Final adder

stage) separately. It gives an estimation of the worst-case delay of the entire design but as

one may notice applying the pattern causing the worst case delay is under control only for

the first two multiplier stages which are “AND Generator” and “First Adder Level”. By

applying pattern “U l I l I H ”as X and “ 11111 llT 'as Y, AND generator creating “ 1” at

all of its outputs. Therefore, all the input of the next stage which is the first adder level

are “ 1”. It means the worst-case delay definitely occurs in this stage. So it can be

guaranteed that the first two stages operate with their slowest speed (worst-case delay)

but this is not necessarily the case in the other stages. However, it is not expected that the

entire multiplier’s delay exceeds the total worst-case delay {xrotai ) that have been

measured. This has been examined by applying X = “ 11111111” and Y = “ 11111111” to

the proposed multiplier. Fig 4.14 shows this input signals applied to the proposed

multiplier. The first curve from the top is the current drawn from the node. Fig 4.15

shows post-layout simulation result corresponding to the applied input (Fig.4.15).

71
M Ü LTiPU EP_RA >PAI5C^.IM FPCl^rC^3CH TETSTT^aCW p c k .rr^ p trc : M o- 1« 13 5 3 M D - i
Tranalcni Response
^ 2(prA IfVdcfl ^
1 -a » n . f ^— 1---- . ^— 7----

V7

•X-
YB

V4

V3

1-B Y1

i.D

V7

1-B Y4

1-B

X
Y1

jLQn
C* )

Fig. 4.14 Input waveforms o fX = 11111111 and Y = 11111111

72
HULm
iCP_RAiPW
ISr_IM
PPQVrDLSCUICST.SCMschemoliD. W
o,19D5r53.D7JDD*
Tr«ncfcn| PsepDnoo
19
> 00 .... ......... L ........ ...................... ......
19 d: pia
> 00 f J, n_.. t. . r

> 001,9
f J. \ . s. \ . ......r
1,9 -1 PI4-
> 0.0 f,.. J ...... 1 . .. 1. M ......._...r
19 PI3
> 0.0 1,,, ;J ...... 1 _____[ ...... ■ 1 ..:...._:..r
19 =: pi2
> 00 [ J, F, I. ■ 1 .......... r
> 001,9 ai P11
f J, 1, , /, 1. c

19 RIO
> 0.0 [ ..... __ 1 _____ I . ...... A .....
19 PS>
> 0.0
1,9 P8
> 00 Î
1,9 P7
> 0.0 f:. -:. . . .t.. . . . ; .If.... . 1 ........ -- -- 1.
1,9 P6
> 0.0
19 PS
> 0.0 Î......
1,9 P4
> 00 F
,9 ai P3
>■ 01.0 [..... -f............................^
19 P2
> 0.0
19 -2 pi
> 0.0 [ J.2,0n I.4,0n . 1^5. ,0n “ 1 Br^n ...... . rlO .
Ï.0 Urn»( 3 \ r
Fig. 4.15 The post layout simulation waveforms showing results o f “l 11111 l l ”x”l 1111111’

73
MULTIPLIER-PAIPWISE-IMPRÔVED.SCH T[ST_SCH :sch cm o tîc : Mor 19 0 6 :4 2 :3 6 2 0 0 4

Tronçîent Response
: P16 P13 - î P12
: P li : P10 P7 =: P6
: P4.

800m
^Tctal = 7 9 3 P2T
600m

400m

200m

2,0n
lîme ( 5 )
■A; {i>26.047p yo 1:606ml ïïëUôl K50i)26m)
B: M.31S84n 903 011m) glopc

Fig. 4.16 The measured delay of the proposed multiplier corresponding to


“l l l l l l l l ” x “l l l l l l i r ’

In order to measure the delay occurring during “ 11111111” x “ 11111111” the composite

simulation of the input and output waveforms are used. Then, the delay has been

measured in the same manner as previously defined. Fig 4.16 shows the input and output

waveforms. The delay (xroiat) is also shown in Fig 4.16. The measured delay is 793ps and

it is less than 835ps as it was expected. So now it is fair to assume that the delay of the

proposed 8 x 8-bit multiplier in worst case is less than summation of worst-case delay of

its individual blocks. This worst-case delay may never occur, but it is used to set a point

for maximum delay of the multiplier. It is translated to the minimum operating frequency

o f the proposed multiplier, which is determined as 1.19GHz. That is, the frequency of the

proposed design is guaranteed at 1.19GHz.

74
Power: the estimation o f power consumed by large digital circuit is a complex task.

Measuring the power consumption is critical for low-power design as it permits the

designer to optimize power, to meet requirements, and to know the power distribution

through the chip. Several heuristic algorithms, statistical, and probabilistic methods have

been introduced [24,25,26]. These methods become less accurate when the size of the

circuits increases. It is better to decompose the large circuit into smaller modules and use

this method to estimate the power consumption of each module. These methods are also

very helpful approaches to optimize the performance o f the decomposed modules. One of

these approaches has been employed in the circuit-level design o f this work in order to

meet the power efficiency as one o f the objective. The practical aspect of the method is

explained more in detail in the topology selection and transistor sizing of full adder

circuitry in Chapter 3.

Nevertheless, in case o f complex systems it is wise to use CAD tools for accurate power

consumption measurement.

Generally power estimation refers to the techniques o f estimating the average power

dissipation of circuits. There are several power analysis techniques and tools at the

circuit, gate, architectural, and behavioural level of abstraction. The most straightforward

method o f power estimation is done through circuit simulation; i.e., performing a circuit

simulation of the design and measuring the average current drawn from the supply.

Therefore, the average power can be estimated which is the average o f summation o f the

three major components as shown in Equation 4.3:

Ptotal =^Ps+ Pd-^ Psc. (4.3)

where is static power consumption and it is the power consumed due to leakage and

static currents, is short-circuit power consumed because of the current flowing from

power supply to ground during transistors switching and Pj is referred to as dynamic

75
power consumption which constitutes the majority of the power consumed in CMOS

VLSI circuits.

The method used by CAD tools to measure the power consumption is strongly dependent

on the input patterns (pattern-dependent technique). The technique is also called dynamic

power simulation, which should not be confused with dynamic power. Equation (4.4)

shows the dynamic portion o f the total power consumption of the digital circuit. This

equation is very similar to the algorithm that has been driven to compute the power

consumed by digital circuits in CAD tools such as Spice [27]:

■ (4.4)

where TT, = Q is load capacitance at node i, V,- is the voltage swing at node i, a,- is

known as switching activity factor at node is the system clock frequency, is the

power supply voltage, V, is transistor threshold voltage, is the gain factor of the

transistor. The summation is over all the nodes of the circuit, which makes the power

estimation a very complex calculation. Changing any of the components in Equation 4.4

would result different power consumption values. Some o f the components o f Equation

4.4 are process-dependent such as F, and Vjj. Other components such as Q/oad, Visning are

predetermined by to the design requirements.

Two components in Equation 4.4, which depend only on input pattern, are clock

frequency (/^«) and the switching factor («;). The input frequency of the entire system has

been limited at lower level by the delay of the critical path. It means that by taking into

account the approximate SOOps delay of the critical path that has already been measured,

the characteristics of the input signals are determined. The required period for input

signal is SOOps at minimum value. By considering 50% duty cycle as a standard for the

input signals the lower input pulse width of 400ps is required. Thus, the frequency of the

76
input signals is set at the minimum value of 1.1 GHz. This brings the first condition for

power consumption evaluation.

Switching factor («,) is the underlying factor of transistors switching. For N periods o f 0

and Vdj —*0 transitions, the switching activity a/ determines how many 0 —* V jj

transitions occur at the capacitive nodes. In the other words, the «/ represents the

probability that a transition 0 will occur during the period T = 1/f, where f is the

period o f the input signals at the node. Considering all internal nodes’ transition is a

complex task, which is out of the scope of present discussion. However, it is clear at this

point that choosing the pattern that makes the high number of transitions in one period of

multiplier is a contributing factor to power consumption value. Hence, this brings another

condition for the input patterns.

Therefore, due to using Cadence to measure power consumption of the proposed

multiplier the two following conditions are considered to govern the power performance;

1) Applying the input signals with the operating frequency of approximate! .2GHz.

2) Applying the input pattern causing the maximum switching activity in entire

design.

The power consumed by the entire system has been measured by changing the inputs

from X , = 00000000 -*>^2 = 11111111 and Y, = 00000000 -^¥ 2 = 11111111. The

transition occurring by these input patterns guarantees o f charging the load capacitances

at all nodes of the circuits to maximum (Fig. 4.17). So one can expect to observe the

maximum power consumed by the multiplier by applying this pattern. This pattern is

shown in Fig. 4.14. The waveform of the current drawn by Vjj node during applying this

pattern is shown in Fig 4.17. The average of this current computed by Cadence is 10.5

mA and the power consumption is measured as 18.09 mW. Many different patterns have

been applied to the proposed multiplier and delay and power consumption have been

recorded (Table 4.6 and 4.7). The maximum power consumption observed belongs to the

77
pattern multiplication o f “ 11111 111” x “ 11111111”, which is expected according to the

switching activity definition in complex digital system.

MULliPLirR_PAIRWISC_IMPROVCD_SCH TrSTT.SCH jchem otîc : Mor 20 05:35:46 2004


Trorvsî^t Response
20m 1 od Vdd

10m

-4 0 m

-5 0 m

-6 0 m
0.0

Fig. 4.17 The waveform of current drawn by Vj^node ("111I I 111” X “ 11111111”)

In order to further examine the effect o f switching activity in a complex system such as

the proposed design another random pattern has been chosen. The power consumed by

the entire system has been measured by applying a pattern causing transitions o f X/ =

00000000 —X 2 = 10101010 and Ty == 00000000 = 01010101. It is expected that the

value of the power consumed by the multiplier under this pattern is almost the mean

value of the power consumed by the system when all inputs are set to “0” which is called

the “power down” or “minimum pow er consumption” value and the maximum power

consumption of the system which occurs by applying “ 111111 I I ” x “ 11111111”. This is

shown in Equation (4.5).

_ ^Pmm
■+~ P
^ max
average (4.5)

where is the power down value measured as 19.314 nW when no inputs applied and

fmax is the power consumed by applying pattern “ 11111111” x “ 11111111” which is

78
measured as 10.5 mW. The reason for such an assumption is equal random density of “0”

and “ 1” in the pattern “ 101010101” and “01010101” which makes possible to assume

that capacitance at 50% o f all the nodes in entire system will be charged. So the assumed

value for the power consumption by applying this pattern from Equation 4.5 is 9.054

mW. The actual power consumption measured by Cadence during applying this pattern is

10.347 mW. The difference about 12% has been seen between the assumed power

consumption and the actual power consumption, which is measured by Cadence. This

difference is mainly due to power consumed by the interconnections and routing paths.

Fig 4.18 shows the current drawn by Vjj node during applying pattern “ 10101010” X

“01010101”. The average of this current is computed by Cadence as 5.74 mA.

MULTIPLIER_PAIRWISr_IMPROVn>_SCH TEST.SCH jchemoltc : Mor 20 04:33:54 2 0 0 4


Tron$îent Response 0

10m 1of Vdd

^ *- 1 0 m

-2 0 m

2.0n
irme ( s )

Fig. 4.18 The waveform of current drawn by F^^node (“10101010” X “01010101”)

Fig 4.19 shows the simulation waveform of the input patterns by assumption o f 50%

switching activity compared to the pattern “11111111 ” x “ 11111111”. Fig 4.20 shows the

multiplication product o f the input patterns of Fig 4.19.

79
IUL7iPUCR.PAIRWSr_IMPP0Vm.SCH TrSl_SCH K h e m o l'c . Mor 3 0 08:32:57 300
TroHQienl P c o p o n M D

^ 1.8
'%—* 0k0 I ' - — * — - ., --A

1.5 — V7

0r0
- Y8

0.0

1.5 Y5

0.0 r~\ . m j—j Cl


1.5 — VA

0,0
1.3 -• Y3

0.0

1.È
Vi?

0,0 L
— VI
1.3

0.0 n . n
1.3

0.0 n n C l
1.3 - YT

0,0 L
0.0

1.B o\ YO
>
•w 0 .0 L
- YA
_ ■'•s
> .
■Vr- 0 .0

Y3
^ 1.8
>-
»w . 0 .0

- Y2
>
•w 0 .0

- Y1
1.S -
0.0. L
l.Bn 2i0n 3 .0 n 4-,0n
lime f e )
Fig. 4.19 waveform o f the input patterns “ 10101010” x “01010101’

80
W U L llP ü rP .P A IP W S r_ IM P P < y /r[> _ « C W i r s i . s c w •schemoVc • Mor M «fi:ff3"32 2 0 0 Ü
T ro n ^lcm l R e sp o r^ sc D

1.6 °- PIG

-I B P IS

00
1 B -■

00
1.6
00 f- pi 3
r~\ r~\ r.
1 ft n. pi 2 _ _
... f - / ^ . . A ~ \ ^ C
1 B

1.B -
00

1.B °* P®

0.0 L
1 B P8

00

1.6 - P"^
0.0 [ d C ] C l . ..
1.6
0.0 1
“ P®
:_______ r c _/~i ./~v .c.
1 .B °: P5

0.0 r ' _/—L C l _ C 3 c


1 B P*

1.6 - P^
0, L
1.6 °: P^
>
0.0

1 B

00 0 0 1 .0 n 2 .0 n 3 .0 n iL 0 n
lime f s )

Fig 4 20 The simulation waveforms of multiplication product of the input patterns shown in Fig
4.19

81
A total number 100 patterns have been examined as inputs to validate the operation of the

proposed design. These patterns included 80 random patterns and the 20 intentional

patterns propagate the cany from O"* bit to 16“' bit.

Tables 4.6 and 4.7 present the delay and power consumption o f the proposed multiplier as

results of the applying some of the random and intentional patterns.

Table 4.6 The numerical results of several intentional multiplications


' ; PoSyêrl f
Dec. Y7...Ÿ 2Y, Dec. , ;pi7...P 5P4P3P 2P 1 “ ^Dec.- Delay Consumptipn
(ns) (mW) f
“00000000” 0 “00000000” 0 “0000000000000000” 0 0 0.9504
“00000001” 1 “11111111” 255 “0000000011111111” 255 0.78 7.542
“00000011” 3 “11111111” 255 “0000001011111101” 765 0.749 10.71
“00000111” 7 “11111111” 255 “0000011011111001” 1785 0.791 10.962
“00001111” 15 “11111111” 255 “0000111011110001” 3825 0.666 12.456
“00011111” 31 “11111111” 255 “0001111011100001” 7905 0.78 16.362
“00111111” 63 “11111111” 255 “0011111011000001” 16065 0.78 16.936
“01111111” 127 “11111111” 255 “0111111010000001” 32385 0.78 17.744
“ 11111111” 255 “11111111” 255 “ 1111111000000001” 65025 0.78 18.09

Table 4.7 The numerical results of several random multiplications sorted by delay

111#
Power. %
{--'p,;. .P5P4P3P 2P 1 ..Dec.; vDelay
/ ' Consumption]
■' : ' v(ns) . (m m
“10110100” 180 “00101000” 40 “0001110000100000” 7200 0.662 2 .5 4
“10011101” 157 “00101100” 44 “0001101011111100” 6908 0.662 5.17
“ 10100101” 165 “00001100” 12 “0000011110111100” 1980 0.662 3.91
“10101101” 173 “00110100” 52 “0010001100100100” 8996 0.670 4.42
“00111101” 61 “00101100” 44 “0000101001111100” 2684 0.689 6.93
“ 10111101” 189 “00111100” 60 “0010110001001100” 11340 0.703 5.92
“10111001” 185 “00111100” 60 “0010101101011100” 11100 0.703 5.99
“ 10110101” 181 “00111000” 56 “0010011110011000” 10136 0.704 4.50
“00001100” 12 “00011000” 24 “0000000100100000” 288 0.711 7.23
“00111101” 61 “00111100” 60 “0000111001001100” 3660 0.711 9.83
“00110000” 48 “00010000” 16 “0000001100000000” 768 0.711 9.46
“00011001” 25 “00110100” 52 “0000010100010100” 1300 0.712 7.16
“10100101” 165 “00001100” 12 “0000011110111100” 1980 0.712 3.90
“00111101” 61 “00011100” 28 “0000011010101100” 1708 0.714 4.55
“00100100” 36 “00001000” 8 “0000000100100000” 288 0.716 7.13
“10110001” 177 “00110000” 48 “0010000100110000” 8496 0.729 6.84

... X 2 X 1 = Binary representation o f multiplier, Y j ... = Binary representation o f


multiplicand, P 1 7 ...P2 P 1 = Binary representation o f the multiplication product, Dec.= Decimal
representation o f the multiplicand and multiplier

82
In regard to design robustness, effects of noise have to be evaluated. Noise is the main

factors determining the stability of the system. In the following the main sources o f the

noise have been described and the performance o f the proposed design has been

examined by considering the noise effects.

Noise: One of the main degrading factors in performance of high-speed VLSI system is

noise, which comes from different sources. Noise can be induced through supply and

ground o f the system during switching transitions. This noise is known as Simultaneous

Switching Noise (SSN). Another type of noise is thermal noise. However, this noise is

not dominant source, it is inevitable [23].

Following by definition of the SSN and thermal noise the robustness of the proposed

multiplier has been examined by considering the effects of these noises on the

performance of the entire design.

One o f the main sources of the noise in digital system is Simultaneous Switching Noise

(SSN). The effects o f SSN is getting more attention as a result of the continuous increase

in integration level on a single chip and the operating speed. Thisnoise is caused by the

large instantaneous current, due to the switching o f multiple drivers and switches,

through the parasitic inductance at the ground and power node. SSN can have dramatic

impacts on the system by:

a) generating glitches on the ground and power supply interconnections

b) decreasing the effective driving strength of the circuit

c) generating output signal distortion

d) reducing the overall noise margin of the system.

The quantify ground bounce or noise is given by Equation 4.6 as:

83
where , V„ is ground bounce, L^s is bond wire parasitic inductance and / is the current

flow in the bond wire inductor.

Equation 4.6 shows that SSN can be lowered by reducing parasitic inductance. In order to

reduce parasitic inductance a multiple pads and pins for power supply (VjJ) and ground

(F„) are needed. Allocating the extra pins to Vjj and reduces L^ss to Lyss effective due to

the parallel configuration o f parasitic inductors as shown in Fig 4.21.

nnnnnnnnn .nnnnnnnnn
Core Core

4,,
1Lyvw x-
LyVYYN-
'dd

uuuuuyiuuu UÜUÜÜUÜUÜ

Fig. 4.21 Adding two extra pins to Vjj and reducing the parasitic inductance

The standard package No. 68PGA offered by CMC has 36 pins which allows us to assign

32 pins to the inputs (two 8-bit multiplicand and multiplier) and the outputs (16-bit

multiplication product). Therefore, the 2 extra pins are specified to and Vss (one extra

for each) and this has been done at no extra cost. These multiple pads reduce the parasitic

inductance to half.

Not having glitches also strong driving capability of the output signals validate our

approach to reduce the impact of SSN. This proves the robustness o f the proposed design

against the switching noise.

Thermal noise is another source of noise, which is generated by thermal agitation of

electrons in conductors. Equation 4.7 shows the power of this noise as:

84
(4.7)

where P is noise power, k is Boltzmann’a constant, T is the conductor temperature and A f

is the bandwidth. It is often preferred to represent this noise in noise voltage as shown in

Eqaution 4.8.

km
n {Therm al) (4.8)
A /

where Ji is the parasitic resistance. Thermal noise power, per Hz, is equal throughout the

frequency spectrum, depending only on k and T. So to simply examine the effect of this

noise on performance of the entire system the voltage of the final outputs could be

simulated within operating temperature ranged from -40C to 135C. The voltage variation

within this temperature range with capacitive load (Q ) of 5pf is 0.15 %. This shows the

system is quite robust against the temperature variation. Fig 4.22 shows the simulation

results o f the final outputs against temperature variation.

MLnrpilCR.P/URWlSC-WPROVCD.SCH TCST.SCW whenviKc Apr 8 82:39 A1 20CA


DC Response
PIG = P15 P14. P13 - P12 Pit « P10
=.' P7 P6 P5 =' P4 P3 -• P2

110 140

Fig. 4.22 The simulation results of the final outputs against temperature variation

85
4.3 Layout Considerations

The proposed 8 x 8-bit multiplier has been laid out in 0.18pm CMOS technology. In the

following the layout issues such as floor planning, routing, pads and packaging o f the

adder cell are explained.

4.3.1 Layout Strategies

Considering transistor chaining, grouping, and signal sequencing in our proposed adder

layout, has been shown to bring substantial power saving and speed improvement at no

area penalty [9].

The following measures have been taken to reduce parasitic effects:

1) Minimizing the use of diffusion as a routing layer to reduce the overall

parasitic by using metal II layer.

2) Placing the transistors switched by Q„ signal close to the output.

3) Minimizing the capacitive load on Cout signal by minimizing the size o f those

transistors in Sum gate whose gate signals are connected to Gout-

4) Using matching transistor structure to reduce area.

5) Making the transistor connecting to Qn closer to the input of the circuit,

therefore, reducing input capacitance.

86
Fig. 4.23 Layout o f the pseudo-NMOS full adder (die size of 22 x 8.5 pm^)

;;<<<«««<♦»>I»«<;•♦«<••«<««•«««««««••««««<*«••

Fig. 4.24 Layout of AND gate (die size of 7.9x 5.6 pm^)

87
Fig. 4.25 Layout of NAND gate (die size of 5.4 x 5.6 nm^)

.#S55§

\4 '

Fig. 4.26 Layout of XOR gate (die size of 10.2 x 20.7 pm^)

88
Carry
Lookahead
Adder

AND
G enerator

F ou r A d d er
L e v e ls

Fig. 4.27 Layout o f core of the 8 x 8-bit proposed multiplier core (die size of 0.275 x 0.38 mm^)

4.3.2 Pads, Package and Chip Size

Following are some details on the routing, pad, package and chip size of the proposed

multiplier.

Routings: To reduce parasitic capacitances, local interconnections use metal # 2, metal #

3 and metal # 4. The input and output signals go through metal # 5. Avoiding long

overlap between neighbouring metal layers will reduce the coupling capacitances.

89
Pad: I/O digital pads o f TSMC library “tpz973q” from cell “PDIDGZ” have been used

for connecting the core to the output. Dummy layers are also added to satisfy the density

requirement.

Package: Package 68PGA is used for the chip. This package provides the core chip with

36 pins (9 pins in each side). The total area including the area occupied by pads is 1.395 x

1.37 mm^. The core area of the chip is 0.275 x 0.38 mm^. Fig 4.28 shows the layout the

entire chip.

X, H j

Xj X ,X s Y ,Y 2Y 3Y 4Y 5Y s

Fig. 4.28 The proposed 8 x 8-bit multiplier chip

90
Chapter 5

Conclusion

Digital multipliers are one o f the crucial blocks of real-time Digital Signal Processing

(DSP) application ranging from digital filtering to image processing. However, speed of

the operation is not the only considerations; low power dissipation and small chip area

are also needed because of the dense packing o f transistors in today’s system on-chip

(SoC) applications.

This thesis focuses on an application specific integrated circuit (ASIC) implementation of

a digital multiplier with speed of operation over IGHz. The three main considerations for

the design are high multiplication speed, low power consumption, and a small rectangular

chip area.

5.1 Features of the Designed Multiplier

The performance of the proposed multiplier has met the objectives of this work. The

high-performance of the proposed multiplier has been achieved by an efficient design

strategy as follow;

• Several multiplication algorithms have been reviewed. Considering speed as the

priority in system-level, pair-wise algorithm has been chosen.

• The critical building blocks of pair-wise algorithm have been identified and

ranked by their impacts on speed, power and area on the entire design. This

emphasizes the full adder importance.

91
An extensive study on performance of the main six static full adders has been

performed in order to select the most power-speed efficient full adder topology.

Six full adders haven been re-designed through an iterative approach in sake of

proper transistor sizing (this approach has been used in design of the other

required elements to avoid under-sizing and over-sizing transistors). This

approach guaranteed power reduction in the circuit design level. The full adders

have been examined under equal conditions by a realistic circuit structure.

Speed and power are treated as same importance during topology selection by

using power-delay product as a measuring factor. Area and driving capability are

also taken into account. The comparison results in choosing pseudo-NMOS full

adder.

Table 5.1 Performance of the proposed multiplier

Device Characteristics
Process Five-Metal 0.18pm Digital CMOS
Power Supply (V^d) 1.8 V
Chip Characteristics
Multiplier & Multiplicand 8 bits
Product 16 bits
Multiplication delay 666 ~ 793 ps
Power Consumption (power down) 19.314 nW
Power Consumption @ Input Frequency l.lGHz 18.09 mW
Average Power Consumption 10.347 mW
Core size 0.1045 mm^
Operating Temperature -40C to+135C

The designed multiplier is suitable for high speed and low power applications, which

provides multiplication product o f two 8-bit numbers in approximate SOOpsec. Accurate

functioning in supply ranged from 1.8 V to 0.09V has proved the suitability o f the

proposed architecture. The power consumption is 18.09 mW for 1.1 GHz. The design is

implemented in TSMC 0.18pm CMOS technology and analyzed using Cadence’s Spectre

with BSIM3v3 device models.

92
The proposed 8 x 8-bit multiplier is laid out in 0.18p CMOS technology and was verified

for design rules and matched with schematics. The total area is 1.395 x 1.37 mnf. The

results of post-layout simulations are in reasonable consistency with those found in the

design process.

5.2 Comparison Results

In this section the summarized results of investigation in the recent works on digital

multipliers are provided in Table 5.2. This selection has been made based on the novelty

o f the works. Data are extracted form IEEE Journal of the Solid-State circuits and the

results are based on measurement of the actual chips.

As it is seen in this Table, the reported multipliers are implemented in different CMOS

technologies with different bit words. This can have dramatic impacts on criteria of

comparison such as power, speed and silicon area.

As it is seen, different approaches in designing multipliers are taken based on different

design dimensions such as area, power consumption, and speed throughput. However, it

will not be fair if the results of the conducted survey on digital multiplier are directly

compared, as the technology, bit width, target frequency, and simulation methodologies

vary widely. The following discussion provides some indications of multiplier results and

this leads us to evaluate the performance of the proposed multiplier.

When speed is the main concern Booth encoding scheme and Wallace tree reduction

show their abilities for large throughput multiplier. However, combination of these

methodologies with GaAs device results in high-speed multiplier, which is not feasible

for CMOS device to reach. From this point o f view the proposed multiplier shows its

93
superiority for the medium bit width (4 to 8 bits) applications in speed and power trade

off.

Table 5.2 Summary of the performance of the recent publications on digital multiplier
Author, : ^Multiplier Bit- -Technology Power Speed /Core '.'Pow er
Year/Refo. Structure W idth . Tpm) Supply (MHz) Ar?a .Consumption
(mm ) .'1 ' (m\V)
N. Itoh Rectangular- 54 0.18 1.8 600 0.98
(2001)[28] Styled
Wallace Tree
J. Butas Asynchronous 16 0.6 1.5-5 59-251 40.59
(2001)[29] Cross-
Pipelined
S. Kim True Single- 0.5 2.7 220 0.47
(2001)[30] Phase
Adiabatic
J.Lim NRERL 0.6 2.5 0 . 1-1 2.37
(2000)r311 Serial
J.S. Wang TSPC Flip- 0.6 3.3 300 0.3 52.4
(2000)r321 Flops
A. Smith GaAs 16 0.6 0.9 416 1.98 1700
(1997)1331
J. Mori 4-2 54 0.5 3.3 100 12.4 870
(1991)1341 Compressor
K. Yano Pass- 54 0.25 2.5 227 12.7
(1990)1351 Transistor
M. Hatamian Parallel 2.5 2.5 75 250
(1986)136] Pipelined

In terms of power consumption, asynchronous circuitry and adiabatic logic are viable

approaches for applications where speed is not the prime concern. Nevertheless, pass-

transistor logic has properties o f both higher speed and lower power consumption (Table

3.12). Also NMOS reversible energy recovery logic, which is a new reversible adiabatic

logic circuit, is employed in ultra-low-power applications. The serial multiplier, which

has been implemented by this logic, is suitable for the applications where energy

consumption is the top priority [30]. The proposed multiplier still stands ahead in power

consumption compared to other designs. However, this structure could be more power

efficient specifically for large bit words if pipelining technique is employed.

94
Where area is the prime concern, the recent progress in use o f Deep Sub-micron Devices

can help to overcome this constraint. It is also possible to reduce the silicon area by

tighter layout style such as rectangular Wallace tree [28].

Therefore, it has been recommended that in multiplier performance and area tradeoffs,

combinations of several parameters feature size, encoding scheme should be well

considered. Encoding scheme has significant effect in the area of implementation. In

design of the proposed multiplier the area is considered as one of the criteria in choosing

the building blocks. Pseudo-NMOS shows significant area saving due to having only 14

transistors with maximum transistor width size of 2pm.

5.3 Future Work

To further improve the performance of pair-wise multiplier one needs to consider a way

to reduce the critical path delay of the multiplier for longer bit width with better trade off

between speed and power consumption.

A well-known technique to reduce the critical path in digital architectures is to place

pipeline latches at appropriate places so that the functionality of the circuit remains

unchanged and no appreciable reduction in the throughput occurs, however it takes a very

accurate time scheduling for pipeline tasks.

This methodology can be considered as an alternative design of pair-wise due to the

absence o f any feedback loop in this architecture (Fig 2.9). The advantages of this

methodology are many-fold. Since the proposed architecture permits pipelining, the

operation speed can be considerably increased. This increased speed can be traded for

reduction in supply voltage to achieve a considerable reduction in power consumption.

This approach makes the pair-wise architecture qualitatively a viable configuration for

constant data stream in DSP applications, however, extensive quantitative evaluation

95
based on the proper simulation arrangements is required to show the speed and power

trade off.

96
References

[1] C, R. Baugh and B. A. Wooley, “A Two’s Complement Parallel Array Multiplication


A\gon\hm;' IEEE Trans. Computers, vol. C-22, no. 12, pp. 1045-1047, Dec 1973.

[2] J. S. Wang, “A New True-Single-Phase-Clocked Double-Edge-Triggered Flip-Flop


for Low Power VLSI Designs,” in Proc. lEEElSCAS 1997, pp. 1896-1899.

[3] A.D. Booth, “A Signed Binary Multiplication Technique,” Quarterly J. Mechanics


and Applied Mathematics, vol. 4, no. 2, pp. 236-240, 1951.

[4] P.Y. Lu, et al, “A 30-MFLOP 32b CMOS Floating-Point Processor,” IEEE Solid-
State Circuit Conf. Proceedings, vol. XXXI, pp. 28-29, February 1988.

[5] W. McAllister and D. Zuras, “An NMOS 64b Floating-Point Chip Set,” IEEE Int.
Solid-State Circuits Conf., pp. 34-35, February 1986.

[6] B. J. Benschneider, et a l, “A 50MHz Uniformly Pipelined 64b Floating-Point


Arithmetic Processor,” /Æ’jEE//»/. Solid-State Circuits C onf, pp.50-51, February 1989.

[7] M. Hatamian and G. L. Cash, “High Speed Signal Processing Pipelining, and VLSI,”
in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, pp.l 173-1176, April
1986.

[8] A. P. Chandrakasan, S. Sheng and W. Brodersen, “Low-Power CMOS Digital


Design,” lEEEJ. Solid State Circuits, vol. 27, No. 4, April 1992.

[9] A. Khatibzadeh and K. Raahemifar, “A 14-Transistor Low Power High-Speed Full


Adder Cell,” in Proc. 2003 Canadian Conf. on Electrical and Computer Engineering
(CCECE2003), Montreal, May 2003.

[10] A. Khatibzadeh and K. Raahemifar, “A Study & Comparison o f Full Adder Cells
Based on the Standard Static CMOS Logics,” in Proc. 2004 Canadian Conf. on
Electrical and Computer Engineering (CCECE2004), Niagara, May 2004.

[11] J. Yuan and C. Svensson, “High-Speed CMOS Circuit Technique,” IEEE J. Solid
State Circuits, vol. 24, No. 1, February 1989.

[12] K. Hwang, “Computer Arithmetic: Principles, Architecture, and Design,” John


Wiley and Sons, 1979.

[13] J. J. F. Cavanagh, “Computer Science Series: Digital Computer Arithmetic,”


McGraw-Hill Book Co., 1984.

[14] K. Raahemifar and M. Ahmadi, “A Fast 32-bit Digital Multiplier,” in Proc. o f the 8"’
IEEE International Conf. on Electronics, Circuits and Systems (ICECAS), Malta, Sept.
2001,pp.l413-1416.

97
[15] R. Zimmermann and W. Fichtner, “Low-power Logic Styles: CMOS versus Pass-
Transistor hogxc” IE E E J. Solid State Circuits, vol. 32, pp. 1079-90, July 1997.

[16] A. Parameswar et al., “A High Speed, Low Power, Swing Restored Pass-Transistor
Logic Based on Multiply and Accumulate Circuit for Multimedia Applications,” IEEE
CICC, M ay 1994, pp. 278-281.

[17] Y. Sasaki et al., “Pass-Transistor Based on Gate Array Architectures,” in 1995


Symp. VLSI Circuits, Dig. Tech. Papers, June 1995, pp. 123-124.

[18] M. Suzuki et al., “A 1.5-ns 32-b CMOS ALU in Double Pass-Transistor Logic,”
IEEE J. Solid State Circuits, vol. 28, no. 11, pp. 1145-1151, November 1993.

[19] N. Weste and K. Eshraghian, Principles o f CMOS VLSI Design, A System


Perspective, MA: Addison-Wesley, 1993.

[20] A. Shams and M. Bayoumi, “A Modular Approach for Designing Low Power
Adders”, Proc. ASILOMAR, June 1997.

[21] J. M. Wang et al., “New Efficient Design for XOR & XNOR Functions on the
Transistor Level,” IEEE J. Solid State Circuits, vol. 29, no. 7, pp. 780-786, July 1994.

[22] E. Abu-Shama and M. Bayoumi, “A New Cell for Low-Power Adder,” in Proc. Int.
Midwest Symp. Circuits Syst., 1995.

[23] J.M. Rabaey, “Digital Integrated Circuits,” Prentice Hall, 1996.

[24] A. Bellauar and M. Elmasry, “Low-Power Digital VLSI Design Circuit and
System,” Kluwer academic Publishers, 1995.

[25] S. M. Kang, “Accurate Simulation of Power Dissipation in VLSI Circuits,” IEEE J.


Solid State Circuits, vol. 21, no. 5, pp. 889-891, October 1986.

[26] H. J. M. Veendrick, “Short-Circuit Dissipation o f Static CMOS Circuitry anf Its


Impact on the Design o f Buffer Circuits,” /EEE J. Solid State Circuits, vol. 19, no. 4, pp.
468-473, August 1984.

[27] G. J. Fisher, “An Enhanced Power Meter for SPICE2 Circuit Simulation,” IEEE
Trans. Computer-Aided Design, vol. 7, pp. 641-643, May 1988.

[28] N. Itoh, Y. Naemura, H. Makino, Y. Nakase, T. Yoshihara and Y. Horiba, “A 600-


MHz 54 X 54-bit multiplier with Rectangular-Styled Wallace Tree,” IEEE J. Solid State
Circuits, vol. 36, No. 2, Febmary 2001.

[29] J. Butas, C. Choy, J. Povanzenec and C. F. Chan, “Asynchronous Cross-Pipelined


Multiplier,” IEEE J. Solid State Circuits, vol. 36, No. 8, August 2001.

[30] S. Kim, C.H. Ziesler and M. C. Papaefthymiou, “A true Single-Phase 8-bit Adiabatic
Multiplier,” Design Automation Conference, 2001, Preceeding, pp. 758-763.

98
[31] J. Lim, D. Kim, S. Kang and S. Chae, “An 8 x 8-b NRERL Serial Multiplier for
Ultra-low-power Application,” lEE Proceeding, vol. 146, pp. 327-333, Dec 2000.

[32] J. S. Wang, P. H. Yang and D. Sheng, “Design o f a 3-V 300-MHz Low-Power 8 x 8 -


b Pipelined Multiplier Using Pulse-Triggered TSPC Flip-Flops,” IEEE J. Solid State
Circuits, vol. 35, No. 4, April 2000.

[33] A. B. Smith, N. Burgess, S. Cui and M. Liebelt, “GaAs Multiplier Design for High-
Speed DSP Application,” Thirty-first ASILOMAR Conference, 1997.

[34] J. Mori and et a l, “ A 10-ns 54 X 54-bit Parallel Structured Full Array Multiplier
with 0.5-pCMOS Technology,” TEEEJ. Solid State Circuits, vol. 26, No. 4, April 1991.

[35] K.Yano, “ A 3.8-ns CMOS 16 X 16-bit Multiplier Using Complementary Pass-


Transistor Logic,” lEEEJ. Solid State Circuits, vol. 25, pp. 388-395, April 1990.

[36] M. Hatamian and G. Cash “A 70-MHz 8 x 8-bit Parallel Pipelined Multiplier in


2.5pm CMOS,” IEEE J. Solid State Circuits, vol. SC-21, No. 4, August 1986.

99

You might also like