Fast Multiplication Algorithms
Fast Multiplication Algorithms
net/publication/2575879
CITATIONS READS
166 8,821
2 authors, including:
Michael J. Flynn
Stanford University
333 PUBLICATIONS 9,547 CITATIONS
SEE PROFILE
All content following this page was uploaded by Michael J. Flynn on 15 November 2014.
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
By
Gary W. Bewick
February 1994
c Copyright 1994 by Gary W. Bewick
All Rights Reserved
ii
I certify that I have read this dissertation and that in my
opinion it is fully adequate, in scope and in quality, as a
dissertation for the degree of Doctor of Philosophy.
Michael J. Flynn
(Principal Adviser)
Mark A. Horowitz
Constance J. Chang-Hasnain
iii
Abstract
This thesis investigates methods of implementing binary multiplication with the smallest
possible latency. The principle area of concentration is on multipliers with lengths of 53
bits, which makes the results suitable for IEEE-754 double precision multiplication.
Low latency demands high performance circuitry, and small physical size to limit prop-
agation delays. VLSI implementations are the only available means for meeting these two
requirements, but efficient algorithms are also crucial. An extension to Booth’s algorithm
for multiplication (redundant Booth) has been developed, which represents partial products
in a partially redundant form. This redundant representation can reduce or eliminate the
time required to produce "hard" multiples (multiples that require a carry propagate addi-
tion) required by the traditional higher order Booth algorithms. This extension reduces the
area and power requirements of fully parallel implementations, but is also as fast as any
multiplication method yet reported.
In order to evaluate various multiplication algorithms, a software tool has been devel-
oped which automates the layout and optimization of parallel multiplier trees. The tool
takes into consideration wire and asymmetric input delays, as well as gate delays, as the tree
is built. The tool is used to design multipliers based upon various algorithms, using both
Booth encoded, non-Booth encoded and the new extended Booth algorithms. The designs
are then compared on the basis of delay, power, and area.
For maximum speed, the designs are based upon a 0:6 BiCMOS process using emitter
coupled logic (ECL). The algorithms developed in this thesis make possible 53x53 mul-
tipliers with a latency of less than 2.6 nanoseconds @ 10.5 Watts and a layout area of
13mm2. Smaller and lower power designs are also possible, as illustrated by an example
with a latency of 3.6 nanoseconds @ 5.8 W, and an area of 8:9mm2. The conclusions based
iv
upon ECL designs are extended where possible to other technologies (CMOS).
Crucial to the performance of multipliers are high speed carry propagate adders. A
number of high speed adder designs have been developed, and the algorithms and design
of these adders are discussed.
The implementations developed for this study indicate that traditional Booth encoded
multipliers are superior in layout area, power, and delay to non-Booth encoded multipliers.
Redundant Booth encoding further reduces the area and power requirements. Finally, only
half of the total multiplier delay was found to be due to the summation of the partial
products. The remaining delay was due to wires and carry propagate adder delays.
v
Acknowledgements
The work presented in this thesis would not have been possible without the assistance and
cooperation of many people and organizations. I would like to thank the people at Philips
Research Laboratories - Sunnyvale, especially Peter Baltus and Uzi Bar-Gadda for their
assistance and support during my early years here at Stanford. I am also grateful to the
people at Sun Microsystems Inc., specifically George Taylor, Mark Santoro and the entire
P200 gang. I would like to extend thanks to the members of my committee, Constance
Chang-Hasnain, Giovanni De Micheli and Mark Horowitz for their time and patience.
Mark, in particular, provided many helpful suggestions for this thesis.
Finally I would like to thank my advisor, colleague, and friend Michael Flynn for
providing guidance and keeping me on track, but also allowing me the freedom to pursue
areas in my own way and at my own pace. Mike was always there when I needed someone
to bounce ideas off of, or needed support, or requested guidance. My years at Stanford
were hard work, sometimes frustrating, but I always had fun.
The work presented in this thesis was supported by NSF under contract MIP88-22961.
vi
Contents
Abstract iv
Acknowledgements vi
1 Introduction 1
1.1 Technology Options : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1
1.1.1 CMOS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2
1.1.2 ECL : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3
1.2 Technology Choice : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5
1.3 Multiplication Architectures : : : : : : : : : : : : : : : : : : : : : : : : 5
1.3.1 Iterative : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5
1.3.2 Linear Arrays : : : : : : : : : : : : : : : : : : : : : : : : : : : 6
1.3.3 Parallel Addition (Trees) : : : : : : : : : : : : : : : : : : : : : : 6
1.3.4 Wallace Trees : : : : : : : : : : : : : : : : : : : : : : : : : : : 8
1.4 Architectural Choices : : : : : : : : : : : : : : : : : : : : : : : : : : : 9
1.5 Thesis Structure : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11
vii
2.2.1 Booth 3 with Fully Redundant Partial Products : : : : : : : : : : 22
2.2.2 Booth 3 with Partially Redundant Partial Products : : : : : : : : 24
2.2.3 Booth with Bias : : : : : : : : : : : : : : : : : : : : : : : : : : 27
2.2.4 Redundant Booth 3 : : : : : : : : : : : : : : : : : : : : : : : : 32
2.2.5 Redundant Booth 4 : : : : : : : : : : : : : : : : : : : : : : : : 33
2.2.6 Choosing the Adder Length : : : : : : : : : : : : : : : : : : : : 39
2.3 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 40
4 Implementing Multipliers 68
4.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 68
4.2 Delay Model : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 70
4.3 Placement methodology : : : : : : : : : : : : : : : : : : : : : : : : : : 71
4.3.1 Partial Product Generator : : : : : : : : : : : : : : : : : : : : : 71
4.3.2 Placing the CSAs : : : : : : : : : : : : : : : : : : : : : : : : : 80
viii
4.3.3 Tree Folding : : : : : : : : : : : : : : : : : : : : : : : : : : : : 86
4.3.4 Optimizations : : : : : : : : : : : : : : : : : : : : : : : : : : : 89
4.4 Verification and Simulation : : : : : : : : : : : : : : : : : : : : : : : : 94
4.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 95
6 Conclusions 135
ix
B.3 Ways of Computing the Sticky : : : : : : : : : : : : : : : : : : : : : : : 145
B.4 An Improved Method : : : : : : : : : : : : : : : : : : : : : : : : : : : 146
B.5 The -1 Constant : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 149
Bibliography 152
x
List of Tables
xi
List of Figures
xii
2.17 Summing K , Multiple and Z. : : : : : : : : : : : : : : : : : : : : : : : 29
2.18 Producing K + 3M in partially redundant form. : : : : : : : : : : : : : : 31
2.19 Producing other multiples. : : : : : : : : : : : : : : : : : : : : : : : : : 32
2.20 16 x 16 redundant Booth 3. : : : : : : : : : : : : : : : : : : : : : : : : 33
2.21 16 bit partially redundant Booth 3 multiply. : : : : : : : : : : : : : : : : 34
2.22 Partial product selector for redundant Booth 3. : : : : : : : : : : : : : : 35
2.23 Producing K + 6M from K + 3M ? : : : : : : : : : : : : : : : : : : : : 36
2.24 A different bias constant for 6M and 3M. : : : : : : : : : : : : : : : : : 38
2.25 Redundant Booth 3 with 6 bit adders. : : : : : : : : : : : : : : : : : : : 39
xiii
4.6 Dots that connect to bit 2 of the multiplicand. : : : : : : : : : : : : : : : 74
4.7 Multiplexers with the same arithmetic weight. : : : : : : : : : : : : : : : 75
4.8 Physical placement of partial product multiplexers. : : : : : : : : : : : : 76
4.9 Alignment and misalignment of multiplexers. : : : : : : : : : : : : : : : 77
4.10 Multiplexer placement for 8x8 multiplier. : : : : : : : : : : : : : : : : : 78
4.11 Aligned partial products. : : : : : : : : : : : : : : : : : : : : : : : : : : 79
4.12 Geometry for a CSA. : : : : : : : : : : : : : : : : : : : : : : : : : : : 81
4.13 Why half adders are needed. : : : : : : : : : : : : : : : : : : : : : : : : 83
4.14 Transforming two HA’s into a single CSA. : : : : : : : : : : : : : : : : 84
4.15 Interchanging a half adder and a carry save adder. : : : : : : : : : : : : : 85
4.16 Right hand partial product multiplexers. : : : : : : : : : : : : : : : : : : 87
4.17 Multiplexers folded under. : : : : : : : : : : : : : : : : : : : : : : : : : 88
4.18 Embedding CSA within the multiplexers. : : : : : : : : : : : : : : : : : 90
4.19 Elimination of wire crossing. : : : : : : : : : : : : : : : : : : : : : : : 91
4.20 Differential inverter. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 92
xiv
5.16 Power of redundant Booth 3 implementations. : : : : : : : : : : : : : : : 123
5.17 Delay comparison of multiplication algorithms. : : : : : : : : : : : : : : 126
5.18 Area comparison of multiplication algorithms. : : : : : : : : : : : : : : : 127
5.19 Power comparison of multiplication algorithms. : : : : : : : : : : : : : : 128
5.20 Floor plan of multiplier chip : : : : : : : : : : : : : : : : : : : : : : : : 130
5.21 Photo of 53x53 multiplier chip. : : : : : : : : : : : : : : : : : : : : : : 131
xv
Chapter 1
Introduction
As the performance of processors has increased, the demand for high speed arithmetic
blocks has also increased. With clock frequencies approaching 1 GHz, arithmetic blocks
must keep pace with the continued demand for more computational power. The purpose
of this thesis is to present methods of implementing high speed binary multiplication. In
general, both the algorithms used to perform multiplication, and the actual implementation
procedures are addressed. The emphasis of this thesis is on minimizing the latency, with
the goal being the implementation of the fastest multiplication blocks possible.
1
CHAPTER 1. INTRODUCTION 2
1.1.1 CMOS
CMOS (Complementary Metal Oxide Semiconductor) is the primary technology in the
semiconductor industry at the present time. Most high speed microprocessors are imple-
mented using CMOS. Contemporary CMOS technology is characterized by :
Small minimum sized transistors, allowing for dense layouts, although the intercon-
nect limits the density.
Large required transistors - In order to drive wires quickly, large width transistors are
needed, since the time to drive a load is given by :
∆V
∆t = C (1.2)
i
where :
Large voltage swings - Typical voltage swings for contemporary CMOS are from
3.3 to 5 volts (with even smaller swings on the way). All other things being equal,
equation 1.2 says that a smaller voltage swing will be proportionally faster.
BiCMOS
BiCMOS generally refers to CMOS-BiCMOS where bipolar transistors are used to improve
the driving capability of CMOS logic elements (Figure 1.1). In general this will improve
Vdd
In Out
the driving capability of relatively long wires by about a factor of two [2] [22]. A parallel
multiplier does indeed have some long wires, and the long wires contribute significantly
to the total delay, but the delay is not dominated by the long wires. A large number of
short wires also contribute significantly to delay. The net effect is perhaps a 20 to 30%
improvement in performance. The addition of the bipolar transistors increases the process
complexity significantly and it is not clear that the additional complexity is worth this level
of improvement.
1.1.2 ECL
ECL (emitter coupled logic) [20] uses bipolar transistors exclusively to produce various
logic elements (Figure 1.2). The primary advantage of bipolar transistors is that they have
an exponential turn-on characteristic, that is the current through the device is exponentially
related to the base-emitter voltage. This allows extremely small voltage swings (0.5V)
in logic elements. Referring back to Equation 1.2, this results in a proportional speed up
CHAPTER 1. INTRODUCTION 4
Out
In Vb
Vcs
Vee
in the basic logic element. For highest speed the bipolar transistors must be kept from
saturating, which means that they must be used in a current switching mode. Unlike CMOS
or BiCMOS, logic elements dissipate power even if the element is not switching, resulting in
a very high DC power consumption. The total power consumption is relatively independent
of frequency, so even at extremely high frequencies the power consumption will be about the
same as the DC power consumption. In contrast, CMOS or BiCMOS power increases with
frequency. Even at high frequencies, CMOS probably has a better speed-power product
than ECL, but this depends on the exact nature of the circuitry. A partial solution to the
high power consumption problem of ECL is to build relatively complex gates, for example
building a full adder directly rather than building it from NOR gates. Other methods of
reducing power are described in Chapter 4.
Differential ECL
Differential ECL is a simple variation on regular ECL which uses two wires to represent a
single logic signal, with each wire having 1/2 the voltage swing of normal. To first order,
this means that differential ECL is approximately twice as fast as ECL (Equation 1.2), but
CHAPTER 1. INTRODUCTION 5
1.3.1 Iterative
The simplest method of adding a series of partial products is shown in Figure 1.3. It is based
upon an adder-accumulator, along with a partial product generator and a hard wired shifter.
This is relatively slow, because adding N partial products requires N clock cycles. The
easiest clocking scheme is to make use of the system clock, if the multiplier is embedded
in a larger system. The system clock is normally much slower than the maximum speed at
which the simple iterative multiplier can be clocked, so if the delay is to be minimized an
expensive and tricky clock multiplier is needed, or the hardware must be self-clocking.
CHAPTER 1. INTRODUCTION 6
Multiplicand Register
Adder
Right Shift
Multiplicand Register
Right
Partial Product Generator Shift
Adder
Right
Shift
Right
Partial Product Generator Shift
Adder
Figure 1.4: Linear array multiplier. Reduces 3 partial products per clock.
Partial Products
3 adder delays
Adder Adder
Adder
Product
faster for larger values of N. On the down side, the extra complexity in the interconnection
of the adders may contribute to additional size and delay.
Operand 1
Operand 2
a a a a
b b b b
c c c c
CSA CSA CSA CSA
Output 0
Output 1
partial products can be added and reduced to 2 numbers without a carry propagate adder. A
single carry propagate addition is only needed in the final step to reduce the 2 numbers to a
single, final product. The general method can be applied to trees and linear arrays alike to
improve the performance.
CHAPTER 1. INTRODUCTION 9
Binary Trees
The tree structure described by Wallace suffers from irregular interconnections and is
difficult to layout. A more regular tree structure is described by [24], [37], and [30], all
of which are based upon binary trees. A binary tree can be constructed by using a row of
4-2 counters 1 , which accepts 4 numbers and sums them to produce 2 numbers. Although
this improves the layout problem, there are still irregularities, an example of which is
shown in Figure 1.7. This figure shows the reduction of 8 partial products in two levels of
4-2 counters to two numbers, which would then be reduced to a final product by a carry
propagate adder. The shifting of the partial products introduce zeros at various places in
the reduction. These zeros represent either hardware inefficiency, if the zeros are actually
added, or irregularities in the tree if special counters are built to explicitly exclude the zeros
from the summation. The figure shows bits that jump levels (gray dots), and more counters
in the row making up the second level of counters (12), than there are in the rows making up
the first level of counters (9). All of these effects contribute to irregularities in the layout,
although it is still more regular than a Wallace tree.
1
4-2 adders, as used by Santoro[24] and Weinberger[37], are easily constructed from two CSAs, however
in some technologies a more direct method may be faster.
CHAPTER 1. INTRODUCTION 10
0 0
Each box represents a 0
single 4-2 counter
0
First Level of
0 0 4-2 Counters
0 0 0 0
0 0 0 0 Second Level of
4-2 Counters
0
Final output to
Adder
this reason architectures based upon CSAs will be considered exclusively. To overcome the
wiring complexity of the direct usage of CSAs, an automated tool will be used to implement
multiplier trees. This tool is described in detail in later chapters, and is responsible for
placement, wiring, and optimization of multiplier tree structures.
Chapter 2 - Begins the main contribution of this thesis, by reviewing existing partial
product generation algorithms. A new class of algorithms, (Redundant Booth) which
is a variation on more conventional algorithms, is described.
Chapter 3 - Presents the design of various carry propagate adders and multiple
generators. Carry propagate adders play a crucial role in the design of high speed
multipliers. After the partial products are reduced as far as possible in a redundant
form, a carry propagate addition is needed to produce the final product. This addition
consumes a significant fraction of the total multiply time.
Chapter 4 - Describes a software tool that has been developed for this thesis, which
automatically produces the layout and wiring of multiplier trees of various sizes and
algorithms. The tool also performs a number of optimizations to reduce the layout
area and increase the speed.
Chapter 6 - Closes the main body of this thesis by noting that the delay of all pieces of a
multiplier are important. In particular long control wire delays, multiple distribution,
CHAPTER 1. INTRODUCTION 12
and carry propagate adder delays are at least as important in determining the overall
performance as the partial product summing delay.
Chapter 2
13
CHAPTER 2. GENERATING PARTIAL PRODUCTS 14
multiplication from the sign handling. The methods are all easily extended to deal with
signed numbers, an example of which is presented in Appendix A.
2.1 Background
,,,,,,,,,,,,,,,, ,
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,, ,
Partial Product Selection Table
,,,,,,,,,,,,,,,, ,
Multiplier Bit Selection Lsb
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,, ,
0 0
,,,,,,,,,,,,,,,, ,
1 Multiplicand
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,, ,
M
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,, ,
ts
u
uc
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,, ,
od
l
Pr
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,, , t
al
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,, ,
rti
i
Pa
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,, , p
,,,,,,,,,,,,,,,, , l
,,,,,,,,,,,,,,,, , i
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,, ,
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,, ,
e
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,, ,
r
+
,,,,,,,,,,,,,,,, ,Msb
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Msb Lsb
Product
dot in the diagram is a place holder for a single bit which can be a zero or one. The partial
products are represented by a horizontal row of dots, and the selection method used in
producing each partial product is shown by the table in the upper left corner. The partial
products are shifted to account for the differing arithmetic weight of the bits in the multiplier,
aligning dots of the same arithmetic weight vertically. The final product is represented by
the double length row of dots at the bottom. To further illustrate simple multiplication, an
example using real numbers is shown in Figure 2.2.
CHAPTER 2. GENERATING PARTIAL PRODUCTS 15
M Lsb
1001110010110111 1
0
0000000000000000 0
M
1001110010110111 1 M
0
0000000000000000 0
M u
1001110010110111 1
1001110010110111 M
1 l
0 t
0000000000000000 0
M
1001110010110111 1 i
0
0000000000000000 0 p
0
0000000000000000 0 l
0
0000000000000000 0
1001110010110111 M
1 i
1001110010110111 M
1 e
M
1001110010110111 1 r
M
1001110010110111 1
+ 1001110010110111 M
1 Msb
1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 1 = 255433661110 = Product
Roughly speaking, the number of dots (256 for Figure 2.1) in the partial product section
of the dot diagram is proportional to the amount of hardware required (time multiplexing can
reduce the hardware requirement, at the cost of slower operation [25]) to sum the partial
products and form the final product. The latency of an implementation of a particular
algorithm is also related to the height of the partial product section (i.e the maximum
number of dots in any vertical column) of the dot diagram. This relationship can vary from
logarithmic (tree implementation where interconnect delays are insignificant) to linear
(array implementation where interconnect delays are constant) to something in between
(tree implementations where interconnect delays are significant). But independent of the
implementation, adding fewer partial products is always better.
Finally, the logic which selects the partial products can be deduced from the partial
product selection table. For the simple multiplication algorithm, the logic consists of a
single AND gate per bit as shown in Figure 2.3. This figure shows the selection logic for
a single partial product (a single row of dots). Frequently this logic can be merged directly
into whatever hardware is being used to sum the partial products. This merging can reduce
the delay of the logic elements to the point where the extra time due to the selection elements
CHAPTER 2. GENERATING PARTIAL PRODUCTS 16
Multiplicand
Msb Lsb
Multiplier
bit
Msb Lsb
Partial Product
can be ignored. However, in a real implementation there will still be interconnect delay
due to the physical separation of the common inputs of each AND gate, and distribution of
the multiplicand to the selection elements.
,
0
,,,,,,,,,,,,,,,,, ,
Lsb
,,,,,,,,,,,,,,,,, ,
S S S
,,,,,,,,,,,,,,,,, ,
1 S S
,,,,,,,,,,,,,,,,, ,
1 S S
,,,,,,,,,,,,,,,,, ,
1 S S
M
,,,,,,,,,,,,,,,,, ,
1 S S
u
,,,,,,,,,,,,,,,,, ,
1 S S l
t
,,,,,,,,,,,,,,,,, ,
1 S S
i
,,,,,,,,,,,,,,,, ,
S S p
l
,
S
i
,
e
+ r
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,
,
Partial Product Selection Table
Multiplier Bits Selection S = 0 if partial product is positive
,
, Msb
000 +0 (top 4 entries from table) 0
001 + Multiplicand S = 1 if partial product is negative 0
010 + Multiplicand (bottom 4 entries from table)
011 + 2 x Multiplicand
100 -2 x Multiplicand
101 - Multiplicand
110 - Multiplicand
111 -0
2.1.3 Booth 3
Actually, Booth’s algorithm can produce shift amounts between adjacent partial products of
greater than 2 [17], with a corresponding reduction in the height and number of dots in the
dot diagram. A 3 bit Booth (Booth 3) dot diagram is shown in Figure 2.7, and an example
is shown in Figure 2.8. Each partial product could be from the set f0, M, 2M, 3M,
4M g. All multiples with the exception of 3M are easily obtained by simple shifting and
complementing of the multiplicand. The number of dots, constants, and sign bits to be
added is now 126 (for the 16 x 16 example) and the height of the partial product section is
now 6.
Generation of the multiple 3M (referred to as a hard multiple, since it cannot be obtained
via simple shifting and complementing of the multiplicand) generally requires some kind
of carry propagate adder to produce. This carry propagate adder may increase the latency,
mainly due to the long wires that are required for propagating carries from the less significant
to more significant bits. Sometimes the generation of this multiple can be overlapped with
an operation which sets up the multiply (for example the fetching of the multiplier).
Another drawback to this algorithm is the complexity of the partial product selection
CHAPTER 2. GENERATING PARTIAL PRODUCTS 19
Multiplicand
Msb Lsb
Lsb
Select M
Select 2M
Multiplier
Group
Msb
Booth Decoder
12 more
And/Or/Exclusive-
Or blocks
Msb Lsb
Partial Product S S
,,0
,,,,,,,,,,,,,,,,,, ,,
Lsb
,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,
S S S S
,,
,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,
1 1 S S
,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,, ,,
1 1 S S
M
,,,,,,,,,,,,,,,,,,
1 1 S S
u
,,,,,,,,,,,,,,,,, ,,
1 S S l
S t
i
,,
p
+ l
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,
i
e
r
Multiplier Bits
Partial Product Selection Table
Selection Multiplier Bits Selection
,,
0000
0001
+0
+ Multiplicand
1000
1001
-4 x Multiplicand
-3 x Multiplicand S = 0 if partial product is positive
(left-hand side of table)
,, Msb
0010 + Multiplicand 1010 -3 x Multiplicand 0
0011 +2 x Multiplicand 1011 -2 x Multiplicand S = 1 if partial product is negative
(right-hand side of table) 0
0100 +2 x Multiplicand 1100 -2 x Multiplicand
0101 +3 x Multiplicand 1101 - Multiplicand
0110 +3 x Multiplicand 1110 - Multiplicand
0111 +4 x Multiplicand 1111 -0
0
10011000010000000001010101100011
logic, an example of which is shown in Figure 2.9, along with the extra wiring needed for
routing the 3M multiple.
Multiplicand Multiplicand
Bit k Bit k-1
3 x Multiplicand Multiplicand
Bit k Bit k-2
Select M
Lsb
Select 3M
Multiplier
Select 2M Group
Select 4M
Msb
Booth Decoder
1 of 18
multiplexer blocks
,,
0
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,
Lsb
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,
S S S S 0
,
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, ,
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,
1 1 S 0 S
,
,,
M
,,,,,,,,,,,,,,,,, u
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,
1 1 S 0 S l
,,,,,,,,,,,,,,,,, , t
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,
1 1 S 0 S
,, i
,,,,,,,,,,,,,,,,,
p
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,
1 S 0 S
, l
,,
i
,,,,,,,,,,,,,,,,, S
e
,,
r
+
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, , Msb
0
0
0
0111001100011010010001 -3M
1 Lsb
10110001101001000 0
1 1
110010110001101001000 -M
0
11111111111111111 1 M
1 1 u
111010011100101101110 0 l
+3M
01001110010110111 1 t
0 i
0 p
110001100011010010001 -4M
0 l
01100011010010001 0 i
1 1 e
10011111111111111111 -0
1 r
11111111111111111 1
1 1
1
+ 10011100101101110 +2M
0
Msb
0
10011000010000000001010101100011
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,
0
0 2M
M
3M
4 4 4 4 4 4 4 4
4 4 4 4
Carry ∑ Carry ∑ Carry ∑ Carry ∑
,,,,,,,,,,,,,,,,,
C ,,,,,,,,,,,,,,,,,
C C C
Partially redundant form
,,,,,,,,,,,,,,,,,
C ,,,,,,,,,,,,,,,,,
C C C
Negate
1 ,,,,,,,,,,,,,,,,,
C ,,,,,,,,,,,,,,,,,
C
1 1 1C C 1 1 1 1 1 1 1
1
conditions simultaneously.
,,
0
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Lsb
,,,,,,,,,,,,,,,,,, ,
,,,,,,,,,,,,,,,,,,, ,,
S S S S
,,,,,,,,,,,,,,,,,,,
1 1 S S
M
,,,,,,,,,,,,,,,,,, ,,
1 1 S S u
,,,,,,,,,,,,,,,,,,
1 1 S S l
,,,,,,,,,,,,,,,,, ,,
1 S S t
S i
,,
p
+
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
l
i
,, e
r
added to it before being summed to form the final product. The bias constant (K) is the same
for both positive and negative multiples1 of a single partial product, but different partial
products can have different bias constants. The only restriction is that K, for a given partial
product, cannot depend on the particular multiple selected for use in producing the partial
product. With this assumption, the constants for each partial product can be added (at
design time!) and the negative of this sum added to the partial products (the Compensation
constant). The net result is that zero has been added to the partial products, so the final
product is unchanged.
1
the entries from the right side of the table in Figure 2.15 will continue to be considered as negative
multiples
CHAPTER 2. GENERATING PARTIAL PRODUCTS 28
The value of the bias constant K is chosen in such a manner that the creation of negative
partial products is a simple operation, as it is for the conventional Booth algorithms. To find
an appropriate value for this constant, consider a multiple in the partially redundant form of
Figure 2.13 and choose a value for K such that there is a 1 in the positions where a "C" dot
appears and zero elsewhere, as shown in the top part of Figure 2.16. The topmost circled
C ,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,
C C C
Multiple
0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 K
Combine these
bits by summing
C
+ 1
Y X
OR C = Y X = EXOR C
C ,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,
Y
X
Y
X
Y
X K + Multiple
section enclosing 3 vertical items (two dots and the constant 1) can be summed as per the
middle part of the figure, producing the dots "X" and "Y". The three items so summed can
be replaced by the equivalent two dots, shown in the bottom part of the figure, to produce
a redundant form for the sum of K and the multiple. This is very similar to the simple
CHAPTER 2. GENERATING PARTIAL PRODUCTS 29
redundant form described earlier, in that there are large gaps of zeros in the multiple. The
key advantage of this form is that the value for K , Multiple can be obtained very simply
from the value of K + Multiple.
Figure 2.17 shows the sum of K + Multiple with a value Z which is formed by the bit
by bit complement of the non-zero portions of K + Multiple and the constant 1 in the lsb.
When these two values are summed together, the result is 2K (this assumes proper sign
,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,
C X X X K + Multiple
Y Y Y
+
,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,
C
Y
X
Y
X
Y
X
1
Z
(the bit by bit complement of
the non-blank components of
K+Multiple, with a 1 added in
at the lsb)
1 1 1
2K
K + Multiple + Z = 2K
Z = K , Multiple
PP = A + B + K
CHAPTER 2. GENERATING PARTIAL PRODUCTS 30
PP = A+B+K
=
8+Y
X
< Xn,1 Xn,2 : : : Xk : : : Xi : : : X1 X0 +
=
: 0 0::: Yk0 : : : Yi 0 : : : 0 0
The desired behaviour is to be able to "negate" the partial product P, by complementing all
the bits of X and the non-zero components of Y, and then adding 1. It is not really negation,
because the bias constant K, must be the same in both the positive and "negative" forms.
That is :
"negative" of PP = ,(A + B) + K
8
(2.1)
< Xn,1 Xn,2 Xk Xi X1 X0 +1+
=
: 0 0 Yk 0 Yi 0 0 0
,PP = ,(A + B + K)
8
(2.2)
< Xn,1 Xn,2 Xk Xi X1 X0 +1+
=
: 1 1 Yk 1 Yi 1 1 1 +1
So all the long strings of 0’s in Y have become long strings of 1’s, as mentioned previously.
The undesirable strings of 1’s can be pulled out and assembled into a separate constant, and
the "negative" of PP can be substituted :
8
< Xn,1 Xn,2 Xk Xi X1 X0
>
>
> +1+
,PP = >> 0 0 Yk0 Yi0 0 0 +
>
: 1 1 01 01 1 1 +1
8
< "negative" of PP +
=
: 1 1 01 01 1 1 +1
,(A + B + K) = ,(A + B) + K +
CHAPTER 2. GENERATING PARTIAL PRODUCTS 31
1 1 01 01 1 1 +1
,2K = 1 1 01 01 1 1 +1
2K = 0 0 10 10 0 0
which again gives the same value for K. The partially redundant form described above
satisfies the two conditions presented earlier, that is it has the same representation for both
positive and negative multiples, and also it is easy to generate the negative given the positive
form.
Figure 2.18 shows in detail how the biased multiple K + 3M is produced from M and 2M
using 4 bit adders and some simple logic gates. The simple logic gates will not increase
,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,
0
0 2M
M
3M
4 4 4 4 4 4 4 4
Carry 4 bit adder Carry 4 bit adder Carry 4 bit adder Carry 4 bit adder
1
C ,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,
Y
X
Y
X
Y
X K + 3M, where
K = 000010001000100000
the time needed to produce the biased multiple if the carry-out and the least significant bit
CHAPTER 2. GENERATING PARTIAL PRODUCTS 32
from the small adder are available early. This is usually easy to assure. The other required
biased multiples are produced by simple shifting and inverting of the multiplicand as shown
in Figure 2.19. In this figure the bits of the multiplicand (M) are numbered (lsb = 0) so that
,,,,,,,,,,,,,,,,,
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 M
0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 K
0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 K+0
0 0 0
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,
0 0 15 14 13 12 11 10
13 9
9 8 7 6
5
5 4 3 2 1 0 K+M
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,
0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 K+2M
,,,,,,,,,,,,,,,,,
12 8 4
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,
15 14 13 12 11 10
11
9 8
7
7 6 5 4
3
3 2 1 0 0 0 K+4M
Compensation constant
,,,,,,,,,,,,,,,,,, ,,
1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0
,,,,,,,,,,,,,,,,,,
S S S S C X X X Lsb
,,,,,,,,,,,,,,,,,, ,,
Y Y Y S
,,,,,,,,,,,,,,,,,,
1 1 S C X X X
,,,,,,,,,,,,,,,,,, ,,
Y Y Y S
M
,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,
1 1 S C X X X
u
,,,,,,,,,,,,,,,,,, ,,
Y Y Y S l
,,,,,,,,,,,,,,,,,,
1 1 S C X X X t
,,,,,,,,,,,,,,,,,, ,,
Y Y Y S i
,,,,,,,,,,,,,,,,,,
1 S C X X X p
,,,,,,,,,,,,,,,,,, ,,
Y Y Y S l
i
,,,,,,,,,,,,,,,,,, ,,
e
r
+
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,, Msb
0
0
3 algorithm (previously given as 126). The height 2 is 7, which is one more than that for
the Booth 3 algorithm. Each of these measures are less than that for the Booth 2 algorithm
(although the cost of the small adders is not reflected in this count).
A detailed example for the redundant Booth 3 algorithm is shown in Figure 2.21. This
example uses 4 bit adders as per Figure 2.18 to produce the multiple K + 3M. All of the
multiples are shown in detail at the top of the figure.
The partial product selectors can be built out of a single multiplexer block, as shown in
Figure 2.22. This figure shows how a single partial product is built out of the multiplicand
and K + 3M generated by logic in Figure 2.18.
2
The diagram indicates a single column (20) with height 8, but this can be reduced to 7 by manipulation
of the S bits and the compensation constant.
CHAPTER 2. GENERATING PARTIAL PRODUCTS 34
Compensation constant
11111101100100000000010011100000
0
0111100100101111111010 K-3M
1 Lsb
0 0 0 1 0
1 M
110110100000101101000 0
K-M u
1 1 0 1 1
1 l
111011011010000000101 0 t
K+3M
1 1 1 0 1 i
0 p
110011010111100000011 K-4M
0 l
0 0 1 1 0
1 i
10111101110111011111 1 e
K-0
1 1 1 1 1 r
1
1
+ 10011100101101110 2M
0
Msb
0
10011000010000000001010101100011
3D D 2D 4D 3D D 2D 4D 3D D 2D 4D 3D D 2D 4D 3D D 2D 4D 3D D 2D 4D 3D D 2D 4D 3D D 2D 4D 3D D 2D 4D 3D D 2D 4D 3D D 2D 4D 3D D 2D 4D 3D D 2D 4D 3D D 2D 4D 3D D 2D 4D 3D D 2D 4D 3D D 2D 4D 3D D 2D 4D 3D D 2D 4D 3D D 2D 4D 3D D 2D 4D
One row of Selects from Booth decoder.
muxes per Mux Block Mux Block Mux Block Mux Block Mux Block Mux Block Mux Block Mux Block Mux Block Mux Block Mux Block Mux Block Mux Block Mux Block Mux Block Mux Block Mux Block Mux Block Mux Block Mux Block Mux Block All corresponding select
and invert inputs are
partial product Out Out Out Out Out Out Out Out Out Out Out Out Out Out Out Out Out Out Out Out Out
wired together
Partial Product
17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Invert
Mux Block
Out
35
CHAPTER 2. GENERATING PARTIAL PRODUCTS 36
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,
0 C 3M
C C C
1 1 1
K
0 C ,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,
Y
X
Y
X
Y
X
K + 3M
Left Shift
C ,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,
Y
X
Y
X
Y
X 0
2K + 6M ≠ K + 6M
1 0
2 1 X = 1 EXOR C
C X = 0
1 C Y = 2 EXOR ( 1 AND C )
+ Z Y X Z = 2 OR ( 1 AND C )
+ 1 Y = 1 EXOR C
Z Y X Z = 1 OR C
,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,
0 C Y X Y X Y X
K + 3M C Y X Y X Y X 0
K + 6M
Z Z Z Z Z Z
38
CHAPTER 2. GENERATING PARTIAL PRODUCTS 39
to the partial product multiplexers, and the increased complexity of the partial product
multiplexers themselves.
,,,,,,,,,,,,,,,,,,, ,,
1 1 1 1 1 1 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0
,,,,,,,,,,,,,,,,,,,
S S S S X X Lsb
,,,,,,,,,,,,,,,,,,, ,,
Y Y S
,,,,,,,,,,,,,,,,,,,
1 1 S X X
,,,,,,,,,,,,,,,,,,, ,,
Y Y S
M
,,,,,,,,,,,,,,,,,,,
1 1 S X X
u
,,,,,,,,,,,,,,,,,,, ,,
Y Y S l
,,,,,,,,,,,,,,,,,,,
1 1 S X X t
,,,,,,,,,,,,,,,,,,, ,
Y Y S i
,,,,,,,,,,,,,,,,,,, ,,
1 S X X p
,,,,,,,,,,,,,,,,,,
Y Y S l
i
,,,,,,,,,,,,,,,,,, ,, e
r
+ ,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, , Msb
0
0
a redundant Booth 3 algorithm, with a carry interval of 6 bits. Note the accumulation of
dots at certain positions in the dot diagram. In particular, the column forming bit 15 of the
product is now 8 high (vs 7 for a 4 bit carry interval). This accumulation can be avoided by
choosing adder lengths which are relatively prime to the shift amount between neighboring
partial products (in this case, 3). This spreads the Y bits out so that accumulation won’t
occur in any particular column.
CHAPTER 2. GENERATING PARTIAL PRODUCTS 40
2.3 Summary
This chapter has described a new variation on conventional Booth multiplication algo-
rithms. By representing partial products in a partially redundant form, hard multiples can
be computed without a slow, full length carry propagate addition. With such hard multiples
available, a reduction in the amount of hardware needed for summing partial products is
then possible using the Booth 3 multiplication method. Since Booth’s algorithm requires
negative partial products, the key idea in using the partially redundant representation is to
add a carefully chosen constant to each partial product, which allows the partial product to
be easily negated. A detailed evaluation of implementations using this algorithm is pre-
sented in Chapter 5, including comparisons with implementations using more conventional
algorithms.
Chapter 3
Fast carry propagate adders are important to high performance multiplier design in two
ways. First, an efficient and fast adder is needed to make any "hard" multiples that
are needed in partial product generation. Second, after the partial products have been
summed in a redundant form, a carry propagate adder is needed to produce the final non-
redundant product. Chapter 5 will show that the delay of this final carry propagate sum
is a substantial portion of the total delay through the multiplier, so minimizing the adder
delay can make a significant contribution to improving the performance of the multiplier.
This chapter presents the design of several high performance adders, both general purpose
and specialized. These adders will then be used in Chapter 5 to evaluate overall multiplier
designs.
41
CHAPTER 3. ADDERS FOR MULTIPLICATION 42
sum
To avoid ambiguity, the symbol + will be used to signify actual addition of binary numbers.
The defining equations for the binary addition of A, B, and c0 , giving sum S and cn will
be taken as :
sk = ak bk ck (3.1)
ck+1 = ak bk + ak ck + bk ck (3.2)
k = 0; 1; : : : ; n , 1
In developing the algebra of adders, the auxiliary functions p (carry propagate) and g
(carry generate) will be needed, and are defined by a modified version of equation 3.2:
ck+1 = gk + pk ck (3.3)
Combining equations 3.3 and 3.2 gives the definition of g and two possible definitions for p
gk = ak bk (3.4)
pk = ak + bk (3.5)
= ak bk (3.6)
Equation 3.3 gives the carry out from a given bit position in terms of the carry-in to
that position. This equation can also be applied recursively to give ck+1 in terms of a lower
order carry. For example, applying (3.3) three times gives ck+1 in terms of ck,2 :
Equations 3.8 and 3.9 give the carry generate and propagate for the range of bits from k to
j. These equations form the basis for the conventional carry lookahead adder [38].
Theorem 1 Let A and B be positive logic binary numbers, each n bits long, and c0 be a
single carry bit. Let S be the n bit sum of A, B, and c0, and let cn be the carry out from the
summation. That is :
2n cn
sum sum sum
+ S = A + B + c0
Then :
2n cn
sum sum sum
+ S = A + B + c0
Theorem 1 is simply stating that a positive adder is also a negative logic adder. Or in other
words, an adder designed to function with positive logic inputs and outputs will also be an
adder if the inputs and outputs are negative logic.
CHAPTER 3. ADDERS FOR MULTIPLICATION 44
G = g30 (3.10)
= g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0 (3.11)
P = p30
= p3 p2 p1 p0 (3.12)
The outputs of individual gates are connected via a wire-OR to produce G. The output
stage is formed by gates Z and produces the sum at each bit position by a three way
EXCLUSIVE OR of ak and bk with the carry (ck ) reaching a particular bit. The carry
CHAPTER 3. ADDERS FOR MULTIPLICATION
MSB B Operand (64 bits long) LSB
4 4 4 4 4 4
a b 13 Additional Groups a b a b
Group Generate and Group Generate and Group Generate and
of 4 bits
G
Propagate Logic
(4 bits wide)
P
. . . G
Propagate Logic
(4 bits wide)
P G
Propagate Logic
(4 bits wide)
P
G15 P15 G1 P1 G0 P0
Carry-Out
C16 Carry Lookahead Logic Carry-In
C15 C1 C0
Output Stages
(4 bits wide)
sum
Carry-In
. . . Output Stages
(4 bits wide)
sum
Carry-In
Output Stages
(4 bits wide)
sum
Carry-In
4 4 4
45
CHAPTER 3. ADDERS FOR MULTIPLICATION 46
a3 b3 a2 b2 a1 b1 a0 b0
Y Y Y Y
g p g p g p g p
X X X X X W W W W W W W W W
g20 g10 g0
p20 p10 p0
cin
c3 c2 c1 c0
Z
Z Z Z
G P s3 s2 s1 s0
reaching a particular bit can be related to the group carry-in (cin ) by the following :
The signal cin usually arrives later than the other signals, (since it comes from the global
carry lookahead logic which contains long wires), so the logic needs to be optimized to
minimize the delay along the cin path. This is done by using Shannon’s Expansion Theorem
[27] [28] applied to sk as a function of cin :
sk = ak bk ck
CHAPTER 3. ADDERS FOR MULTIPLICATION 47
= ak bk gk0 + pk0 cin
h i h i
= cin ak bk gk0 + pk0 + cin ak bk gk0 (3.13)
Being primary inputs, ak and bk are available very early, so the value ak bk = ak bk = pk
is also available fairly early. The values gk0 and pk0 can be produced using only locally
available signals (that is signals available within the group). Because the wires within a
group should be fairly short, these signals should also be available rather quickly (the gates
labeled W in Figure 3.2 produce these signals). The detailed circuitry for an output stage
gate which realizes equation 3.13, given ak bk (the half sum) with a single tail current is
shown in Figure 3.3. This gate is optimized in such a way that the carry to output delay is
P Vcc
Carry
G
Half-Sum Sum
Sum
Sum
Vbb1
Carry
Vbb2
P
Vbb3
Half-Sum
Vee
Figure 3.3: Output stage circuit. For proper operation, G and P must not both be high.
much smaller than the delay from the other inputs of the gate.
CHAPTER 3. ADDERS FOR MULTIPLICATION 48
G00 = G0
P00 = P0
A gate level implementation of the supergroup G and P using NOR gates and wire-OR is
shown in Figure 3.5. Note that some gates have multiple outputs. These can usually be
obtained by adding multiple emitter followers at the outputs, or by duplicating the gates
in question. The second stage of the carry lookahead logic uses the supergroup G and P
produced in the first stage, along with the carry-in, to make the final group carries, which
are then distributed to the individual group output stages. This process is similar to the
canonic addition described in [36]. The equations relating the super group G and P signals
to the final carries are :
c0 = C
CHAPTER 3. ADDERS FOR MULTIPLICATION
Four Bit Slices (16) Carry In
cin
G P G P G P G P G P G P G P G P G P G P G P G P G P G P G P G P
G15 P15 G14 P14 G13 P13 G12 P12 G11 P11 G10 P10 G9 P9 G8 P8 G7 P7 G6 P6 G5 P5 G4 P4 G3 P3 G2 P2 G1 P1 G0 P0
G3 P3 G2 P2 G1 P1 G0 P0 G3 P3 G2 P2 G1 P1 G0 P0 G3 P3 G2 P2 G1 P1 G0 P0 G3 P3 G2 P2 G1 P1 G0 P0
15 15 14 14 13 13 12 12 11 11 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0
G12 P12 G12 P12 G12 P12 G12 P12 G8 P8 G8 P8 G8 P8 G8 P8 G4 P4 G4 P4 G4 P4 G4 P4 G0 P0+ C G0 P0 G0 P0 G0 P0
G P G P G P G P G P G P G P G P G P G P G P G P G P G P G P G0 P0
G0 G0 G0 G0 G0 G0 G0 G0 G0 G0 G0 G0 C C C C
P0 P0 P0 P0 P0 P0 P0 P0 P0 P0 P0 P0 Carry1 Carry1 Carry1 Carry1
Circuit Circuit Circuit Circuit
G1 G1 G1 G1 G1 G1 G1 G1 Carry2 Carry2 Carry2 Carry2 Cy Cy Cy Cy
P1 P1 P1 P1 P1 P1 P1 P1 Circuit Circuit Circuit Circuit
Cy Cy Cy Cy
G2 G2 G2 G2 Carry3 Carry3 Carry3 Carry3
P2 P2 P2 P2 Circuit Circuit Circuit Circuit
Cy Cy Cy Cy
Carry4 Carry4 Carry4 Carry4
Circuit Circuit Circuit Circuit
Cy Cy Cy Cy
c64 c60 c56 c52 c48 c44 C40 c36 c32 C28 c24 c20 c16 c12 c8 c4 c0
49
CHAPTER 3. ADDERS FOR MULTIPLICATION 50
P3 G3 P2 G2 P1 G1 P0 G0
3 3 2 2 1 1 0 0
P0 G0 P0 G0 P0 G0 P0 G0
c4 = G00 + P00 C
c8 = G10 + P10 C
c12 = G20 + P20 C
c16 = G30 + P30 C
c20 = G44 + P44 G30 + P44 P30 C
c24 = G54 + P54 G30 + P54 P30 C
c28 = G64 + P64 G30 + P64 P30 C
c32 = G74 + P74 G30 + P74 P30 C
c36 = G88 + P88 G74 + P88 P74 G30 + P88 P74 P30 C
c40 = G98 + P98 G74 + P98 P74 G30 + P98 P74 P30 C
c44 = G10 10 7 10 7 3 10 7 3
8 + P8 G4 + P8 P4 G0 + P8 P4 P0 C
c48 = G11 11 7 11 7 3 11 7 3
8 + P8 G4 + P8 P4 G0 + P8 P4 P0 C
c52 = G12 12 11 12 11 7 12 11 7 3 12 11 7 3
12 + P12 G8 + P12 P8 G4 + P12 P8 P4 G0 + P12 P8 P4 P0 C
CHAPTER 3. ADDERS FOR MULTIPLICATION 51
c56 = G13 13 11 13 11 7 13 11 7 3 13 11 7 3
12 + P12 G8 + P12 P8 G4 + P12 P8 P4 G0 + P12 P8 P4 P0 C
c60 = G14 14 11 14 11 7 14 11 7 3 14 11 7 3
12 + P12 G8 + P12 P8 G4 + P12 P8 P4 G0 + P12 P8 P4 P0 C
c64 = G15 15 11 15 11 7 15 11 7 3 15 11 7 3
12 + P12 G8 + P12 P8 G4 + P12 P8 P4 G0 + P12 P8 P4 P0 C
G P C G P G2 P2 G1 P1 G0 P0
Cy Cy
G P G0 P0 G P G1 P1 G0 P0
Cy Cy
4 4 4 4 4 4
a b 13 Additional Groups a b a b
p (dot) Group H and p (dot) p (dot) Group H and p (dot) p (dot) Group H and p (dot) 0
of 4 bits
p
3
3
H
I Logic
(4 bits wide)
I
-1
. . . p
3
3
H
I Logic
(4 bits wide)
I
-1
p
3
3
H
I Logic
(4 bits wide)
I
-1
H15 I 15 H1 I1 H0 I0
Carry-Out
C16 Carry Lookahead Logic Carry-In
h15 h1 h0
. . .
Output Stages p Output Stages p Output Stages p 0
-1 -1 -1
(4 bits wide) (4 bits wide) (4 bits wide)
h in h in h in
sum sum sum
4 4 4
53
CHAPTER 3. ADDERS FOR MULTIPLICATION 54
the group and group lookahead logic. The major difference between the Ling scheme and
the conventional CLA is that the group H signal (which replaces the group G signal from
the CLA) is available one stage earlier than the corresponding G signal. Also the group
propagate signal (P) is replaced with a signal that performs an equivalent function in the
Ling method (I).
G = g30
= g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0 (3.14)
g3 = a3 b3
= (a3 b3) (a3 + b3)
= p+ 3 g3 (3.15)
It is important to note, that the equation above is true only if p3 is formed as the inclusive-or
of a3 and b3 . The exclusive-or form of p3 will not work! At this point it is assumed that p3
is produced from equation 3.5. That is :
p3 = p+ 3 (3.16)
= a3 + b3
G = p+ 3 g3 + p+ 3 g2 + p+ 3 p2 g1 + p+ 3 p2 p1 g0
= p+ 3 (g3 + g2 + p2 g1 + p2 p1 g0 )
= p+ 3 H
which provides the definition for a new type of group signal, the Ling group pseudo carry
generate. This leads to the general definition for the function h, when computed across a
CHAPTER 3. ADDERS FOR MULTIPLICATION 55
series of bits :
p+ j hjk
j
gk = (3.17)
Or equivalently :
gj + gjk,1
j
hk = (3.18)
Again referring back to Figure 3.2, G is produced by two stages of logic. The first stage
computes the bit gk and pk, and the second stage computes G from the bit gk and pk . The
Ling pseudo-generate, H, can be produced in a single stage plus a wire-OR. To see this,
expand H directly in terms of the ak and bk inputs, instead of the intermediate gk and pk :
H = a3 b3 + a2 b2 + a1 a2 b1 + a1 b1 b2
+ a0 a1 a2 b0 + a0 a1 b0 b2 + a0 a2 b0 b1 + a0 b0 b1 b2 (3.19)
If negative logic inputs are assumed, then the function H can be computed in a single
INVERT-AND-OR stage. In principle, G can also be realized in a single INVERT-AND-
OR stage, but it will require gates with up to 5 inputs, and 15 outputs must be connected
together in a large wire-OR. Figure 3.8 shows a sample Ling 4 bit group.
h15
0 = g15 + g14
0
= g15 + g14 14 11 14 11 7 14 11 7 3
12 + p12 g8 + p12 p8 g4 + p12 p8 p4 g0
= g15 + g14 14
12 + p12 g11 + p11 g8
10 14 11
+ p12 p8 g7 + p7 g64
14 11 7 2
+ p12 p8 p4 g3 + p3 g0
Assume that each of p11, p7, and p3 are produced as p+ 11, p+ 7 , and p+ 3. Then :
h15 = g15 + g14 + p 14
p + g + p+ g10 + p14 p11 p+ g + p+ g6
0 12 12 11 11 11 8 12 8 7 7 7 4
CHAPTER 3. ADDERS FOR MULTIPLICATION 56
a3 b3 a2 b2 a1 b1 a0 b0
p p g p g p
p3(dot)
p3 p-1(dot)
p-1
14
+ p12 p8 p74 p+ 3 g3 + p+ 3 g20
11
= g15 + g14
12 + p 14
11 g 11 + g 10
8
14 11
+ p12 p7 g7 + g64 + p14 11 7
12 p8 p3 g3 + g20
= g15 + g14 14
12 + p11 g11 + g8
10 14 10
+ p11 p7 g7 + g64 + p14 10 6
11 p7 p3 g3 + g20
= h15 14 11 14 10 7 14 10 6 3
12 + p11 h8 + p11 p7 h3 + p11 p7 p3 h0
= h15 15 11 15 11 7 15 11 7 3
12 + i12 h8 + i12 i8 h3 + i12 i8 i4 h0
Note that the indexes on the p terms are slightly different than that of the i term. Using this
definition of i, the formation of h across multiple groups from the group H and I signals
is exactly the same as the formation of g across multiple groups from the group G and P
signals. Thus, exactly the same group and supergroup lookahead logic can be used for
the Ling adders, as was used in the CLA. Detail for the Ling lookahead logic is shown in
Figure 3.9. The only real difference is that G and P are replaced by I and H, which for a
four bit group are :
H = h30
= g3 + g2 + p2 g1 + p2 p1 g0
I = i30
= p2 p1 p0 p+ ,1
Note that the formation of I requires the p+ from the most significant bit position of the
adjacent group.
One minor nuisance with this implementation of the Ling adder, is that the complement
of H is a difficult function to implement. As a result, only a positive logic version of
H is available for use by the first level of the lookahead logic. The fastest realization of
the group I signal is only available in a negative logic form. The first layer of lookahead
circuits (Figure 3.10) must be modified to accept a positive logic H and a negative logic
I. This requires a strange type of NOR gate which has a single inverting input, and from
CHAPTER 3. ADDERS FOR MULTIPLICATION
Four Bit Ling Slices (16) Carry In
p3 C
H I H I H I H I H I H I H I H I H I H I H I H I H I H I H I H I
H15 I15 H14 I14 H13 I13 H12 I12 H11 I11 H10 I10 H9 I9 H8 I8 H7 P7 H6 I6 H5 I5 H4 I4 H3 I3 H2 I2 H1 I1 H0 I0
H3 I3 H2 I2 H1 I1 H0 I0 H3 I3 H2 I2 H1 I1 H0 I0 H3 I3 H2 I2 H1 I1 H0 I0 H3 I3 H2 I2 H1 I1 H0 I0
4 Group H,I Lookahead 4 Group H,I Lookahead 4 Group H,I Lookahead 4 Group H,I Lookahead
3 3 2 2 1 1 0 0 3 3 2 2 1 1 0 0 3 3 2 2 1 1 0 0 3 3 2 2 1 1 0 0
H0 I0 H0 I0 H0 I0 H0 I0 H0 I0 H0 I0 H0 I0 H0 I0 H0 I0 H0 I0 H0 I0 H0 I0 H0 I0 H0 I0 H0 I0 H0 I0
15 15 14 14 13 13 12 12 11 11 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0
H12 I12 H12 I12 H12 I12 H12 I12 H8 I8 H8 I8 H8 I8 H8 I8 H4 I4 H4 I4 H4 I4 H4 I4 H0 I 0 + C H0 I0 H0 I0 H0 I0
G P G P G P G P G P G P G P G P G P G P G P G P G P G P G P G0 P0
G0 G0 G0 G0 G0 G0 G0 G0 G0 G0 G0 G0 C C C C
P0 P0 P0 P0 P0 P0 P0 P0 P0 P0 P0 P0 Carry1 Carry1 Carry1 Carry1
Circuit Circuit Circuit Circuit
G1 G1 G1 G1 G1 G1 G1 G1 Carry2 Carry2 Carry2 Carry2 Cy Cy Cy Cy
P1 P1 P1 P1 P1 P1 P1 P1 Circuit Circuit Circuit Circuit
Cy Cy Cy Cy
G2 G2 G2 G2 Carry3 Carry3 Carry3 Carry3
P2 P2 P2 P2 Circuit Circuit Circuit Circuit
Cy Cy Cy Cy
Carry4 Carry4 Carry4 Carry4
Circuit Circuit Circuit Circuit
Cy Cy Cy Cy
h60 h56 h52 h48 h44 h40 h36 h32 h28 h24 h20 h16 h12 h8 h4 h0
C64
58
CHAPTER 3. ADDERS FOR MULTIPLICATION 59
I3 H3 I2 H2 I1 H1 I0 H0
3 3 2 2 1 1 0 0
I0 H0 I0 H0 I0 H0 I0 H0
1 to 3 non-inverting inputs. The circuit for such a strange looking NOR gate is shown in
Figure 3.11.
h60 = h59 59
0 + i0 cin
Computation of the final sum requires the carry (c60), which can be recovered from h60 by
using equations 3.17 and 3.20:
c60 = g59 59
0 + p0 cin
p+ 59 h59 + + 58
= 0 + p 59 p ,1 cin
CHAPTER 3. ADDERS FOR MULTIPLICATION 60
Out
In0 Vb1
In1
In2 Vb2
Vcs
Vee
Out = In2 + In1 + In0
Figure 3.11: NOR gate with 1 inverting input and 2 non-inverting inputs.
h i
= p+ 59 h59
0 + i 59
c
0 in
= p+ 59 h60
This result can be used in place of cin in equation 3.13 to modify the logic in the output
stage to produce the proper sum [3] [34].
3.4.1 Multiply by 3
The general idea is to replace the Ling 4 bit group (Figure 3.8), with a 7 bit group which
is specifically optimized for computing 3 times the input operand. The carry lookahead
network remains the same. Because a group now consists of 7 bits, instead of 4 bits, the
lookahead network is smaller, and could (depending on the length required) be fewer stages.
For this discussion, the assumption is that the B operand has been replaced by a shifted
copy of the A operand :
,
nX1
B = ak 2k+1
=
k 0
Xn
= ak,1 2k
=
k 1
gk = ak ak,1 (3.21)
pk = ak + ak,1 (3.22)
Substituting this into the equation for the group G (equation 3.10)gives :
g30 = a3 a2 + a2 a1 + a3 a1 a0 + a2 a0 a,1
CHAPTER 3. ADDERS FOR MULTIPLICATION 62
This is much simpler than even the Ling expansion (equation 3.19). Sticking with the limit
of gates with no more than 4 inputs, it is possible to compute h60 in a single stage:
h60 = a6 a5 + a5 a4 + a4 a3 + a5 a3 a2 + a4 a2 a1 + a5 a3 a1 a0 + a4 a2 a0 a,1
A sample 7 bit times 3 group is shown in Figure 3.12. This section can be interchanged with
the four bit Ling group (Figure 3.8), with the carry lookahead logic remaining unchanged.
Internal carries required for the final sum generation (as per equation 3.13) are produced
directly from the primary inputs according to the following :
g00 = a0 a,1
g10 = a1 a0 + a0 a,1
g20 = a2 a1 + a1 a0 + a2 a0 a,1
g30 = a3 a2 + a2 a1 + a3 a1 a0 + a2 a0 a,1
g40 = a4 a3 + a3 a2 + a4 a2 a1 + a3 a1 a0 + a4 a2 a0 a,1
g50 = a5 a4 + a4 a3 + a5 a3 a2 + a4 a2 a1 + a5 a3 a1 a0 + a4 a2 a0 a,1
Note the significant sharing possible between adjacent g terms, which is taken advantage
of in the implementation.
p6+(dot)
+ (dot)
p6+ p-1
+
p-1
1 1 1 1 1 1
g50 p-15 p-14 g40 p-13 g30 p-12 g20 p-11 g10 p-10 g0 p-1
hin
2 2 2 2 2 2
63
CHAPTER 3. ADDERS FOR MULTIPLICATION 64
P1
P0
Vcc
Carry
G
Half-Sum Sum
Sum
Sum
Vbb1
Carry
Vbb2
P1
P0
Vbb3
Half-Sum
Vee
p+ p+
6 6
Half Half
Sum
High Order 6 Bits h H Low Order 7 Bits Sum
in
Y X 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a 3a
From Next 12 11 10 9 8 7 6 5 4 3 2 1 From Previous
13 bit section 13 bit section
X
Y
p6+ Half-Sum
g50 g40 g30 g20 g10 g0
65
CHAPTER 3. ADDERS FOR MULTIPLICATION
a12 a11 a10 a9 a8 a7 a6
p6+
Half-Sum
hin
66
CHAPTER 3. ADDERS FOR MULTIPLICATION 67
3.5 Summary
As will be shown in Chapter 5, carry propagate adders play a crucial role in the overall
performance of high speed multipliers. This chapter has described a number of efficient
and high performance adder designs which will be used in the multiplier evaluations in
the following chapters. Although the designs have been specifically tuned for ECL based
adders, the ideas can be applied to other technologies.
Specifically, this chapter has presented an adder design that uses the Ling lookahead
method. This adder has one less stage of logic along the critical path than an adder using the
traditional carry lookahead method. Since the complexity and wire lengths are comparable,
this leads to a faster adder.
Significant hardware reductions (about a 20% reduction in gate count) can result by
designing a specialized adder to compute 3M. Because the basic group size can be made
longer the performance may also improve, since fewer stages are required for the carry
propagation network.
By carefully optimizing the circuits, an efficient and fast (2 stages of logic) short multiple
generator can also be designed. The speed and efficiency of this block is crucial to the
performance of the redundant Booth 3 multiplication algorithm described in Chapter 2.
Chapter 4
Implementing Multipliers
Chapter 2 described various methods of generating partial products, which then must be
added together to form a final product. Unfortunately, the fastest method of summing
the partial products, a Wallace tree or some other related scheme, requires very complex
wiring. The lengths of these wires can affect the performance, and the wires themselves
take up valuable layout area. Manually wiring a multiplier tree is a laborious process,
which makes it difficult to accurately evaluate different multiplier organizations. To make
it possible to efficiently design many different kinds of multipliers, an automated multiplier
generator that designs the layout of partial product generators and summation networks for
multipliers is described in this chapter. Since the partial product generator and summation
network constitute the bulk of the differences between various multiplication algorithms,
many implementations can be evaluated, providing a systematic approach to multiplier
design. The layouts produced by this tool take into consideration wire lengths and delays
as a multiplier is being produced, resulting in an optimized multiplier layout.
4.1 Overview
The structure of the multiplier generator is shown in Figure 4.1. Inputs to the tool consists
of various high level parameters, such as the length and number of partial products, and the
algorithm to be used in developing the summation network. Separately input to the tool
is technology specific information, such as metal pitches, geometric information about the
68
CHAPTER 4. IMPLEMENTING MULTIPLIERS 69
Geometric Technology
Information Information
Cell Library
L Language File
GDT
Final Layout
primitive cells, such as the size of a CSA, I/O terminal locations, etc., and timing tables,
which have been derived from HSPICE [18]. The output of the tool is an L language
(a layout language) file, which contains cell placement information and a net list which
specifies the cell connections. The L file is then used as input to a commercial IC design tool
(GDT from Mentor Graphics). This commercial tool actually places the cells, and performs
any necessary routing using a channel router. Because most things are table driven, the tool
can quickly be modified to adapt to different technologies or layout tools.
Delay1
Delay 2
Output
Inputs Main Delay Delay Output
Delay 3
Delay 4
the output also has a fixed delay (Output Delay) 1 . Each output also has a delay which
1
In actual use the Main Delay and the Output Delay are not really needed and in fact are set to 0.
CHAPTER 4. IMPLEMENTING MULTIPLIERS 71
is proportional to the length of wire being driven. A factor for the fan-out should also be
included, but is not necessary for multipliers, since all of the CSAs have a fan-out of 1. The
individual delays are determined by running SPICE or HSPICE, as is the proportionality
constant for the wire delay.
Multiplicand
Multiplier
Partial Product Generator
Partial Products
Summation Network
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
These bits share the
same select line
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
for the second row of horizontal dots are then placed immediately underneath the first row
of multiplexers, but shifted one position to the left to account for the additional arithmetic
weight of the second partial product with respect to the first. Bits of the multiplicand that
must connect to diagonal sections are routed in the routing channel and over the columns of
cells using feedthroughs provided in the layout of the individual multiplexers. The outputs
of the multiplexers are then routed to the summation network at the bottom. Note that all
bits of the same arithmetic weight are routed in the same routing channel. This makes the
wiring of the CSAs relatively simple.
Multiplexer Alignment
Early versions of this software tool allowed adjacent bits of a single partial product generator
to be unaligned in the Y direction. For some of the multiplication algorithms, a large number
of shared select wires control the multiplexers that create these bits. If these multiplexers
CHAPTER 4. IMPLEMENTING MULTIPLIERS 75
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
are aligned in the Y direction (as shown in the top of Figure 4.9, these shared wires,
run horizontally in a single metal layer and occupy no vertical wiring channels. If these
multiplexers are instead allowed to be misaligned (the bottom of Figure 4.9), the wires
make vertical jogs in the routing channel, and an additional metal layer will be needed for
the vertical sections. This could cause the channel to expand in width. For this reason, the
current implementation forces all bits in a single partial product to line up in the Y direction.
An improved version of the program might allow some limited amount of misalignment to
remove "packing spaces". These are areas that are too small to fit anything into, created by
the forced alignment of the multiplexers. The final placement of the multiplexers for the
sample 8x8 multiplier is shown in Figure 4.10
An alternate approach for organizing the partial product multiplexers, that was not used,
involves aligning the partial products in such a way that selects run horizontally (same as
before), and bits of the multiplicand run vertically (Figure 4.11). Cell feedthroughs are
still required, as a particular bit of the the multiplicand may still have to reach multiplexers
that are in two adjacent columns, if the Booth 2 or higher algorithms are being realized. In
CHAPTER 4. IMPLEMENTING MULTIPLIERS 76
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
Multiplicand bits run
7 6 5 4 3 2 1 0 diagonally, using feed
throughs provided in
selectors to hop
between routing
channels
7 6
7 6 5
Routing Channel
Routing Channel
Routing Channel
6 5 4
Selects run
5 4 3 horizontally,
over the cells
4 3 2
3 2 1
2 1 0
1 0
Vertical column of
multiplexer cells
8 7 6
Routing Channel
Select 0X
Select 1X
Select 2X
Partial Product
Mux
Select 0X
Partial Product Select 1X
Mux
Select 2X
Routing Channel
0
0
Routing Channel
1
1
Routing Channel
2
2
Routing Channel
4
4
Routing Channel
5
5
Routing Channel
6
6
Routing Channel
7
7
Routing Channel
8
7
Routing Channel
9
7
Routing Channel
10
7
Routing Channel
11
7
Routing Channel
12
7
Routing Channel
13
7
Routing Channel
14
7
CHAPTER 4. IMPLEMENTING MULTIPLIERS 79
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
addition, the partial product bits in any particular routing channel are of varying arithmetic
weight, requiring unscrambling before being applied to the summation network. This
methodology is used for linear arrays, as the unscrambling can occur in sequence with
the summation of the next partial product. The unscrambling requires about as much
extra wiring as routing the bit of the multiplicand diagonally through the partial product
selectors, which was why this method was not used by the multiplier generator. Aligning
the partial products should have comparable area and performance. Note that this method
requires approximately N (N is the length of the multiplicand) routing channels, whereas
the previous method required about 2N routing channels. The tree folding optimization
(described below) reduces the number of routing channels actually needed in the previous
method to about N. The decision was made to concentrate on the first method because there
are many more partial product output wires (N2) than there are multiple wires (N), and
it will require less power to make N wires a little longer verses N2 a little longer. Also
having wires of the same arithmetic weight emerge from the same routing channel makes
the placement and wiring of the CSAs in the summation network easier.
CHAPTER 4. IMPLEMENTING MULTIPLIERS 80
A virtual wire is attached to each unwired CSA or multiplexer output. This wire
extends to the bottom of the placement area. This virtual wire is added because even
if an output is never wired to a CSA, it must eventually connect to the carry propagate
adder placed at the bottom. By placing a virtual wire it makes outputs that are already
CHAPTER 4. IMPLEMENTING MULTIPLIERS 81
a
b Inputs
c
CSA
carry sum
Outputs
near the bottom more likely to be connected to a CSA input, and outputs that are near
the top (and require a long wire to reach the bottom) less likely to be connected to a
CSA input. As a result, faster outputs (near the bottom) will go through more levels
of CSAs and slow outputs (due to long wires to reach the bottom) will go through
fewer levels of CSAs, improving overall performance.
The propagation delay from the multiplicand or multiplier select inputs to each of
the unwired outputs is computed, using the delay model described earlier. Individual
bits of multiples of the multiplicand or the multiplier select signals are assumed to
be valid at a time determine by a lookup table. This lookup table is determined by
iterative runs of the multiplier generator, which can then allow for wire delays and
possible differences in delays for individual bits of a single partial product.
The output having the fastest propagation delay is chosen as the primary candidate
to be attached to a CSA. A search is then made for two other unwired outputs of the
same arithmetic weight. If two other unwired outputs of the same arithmetic weight
CHAPTER 4. IMPLEMENTING MULTIPLIERS 82
cannot be found, then this output is skipped and the next fastest output is chosen,
etc., until a group of at least 3 wires of the same arithmetic weight are found. If no
group can be found, then this stage of the placement and wiring is finished, and the
algorithm terminates.
A new CSA is placed in the column determined by the arithmetic weight of the group.
The primary candidate is wired to the input of the new CSA which has the longest
input delay. Of the remaining unwired outputs with the same arithmetic weight as the
primary candidate, the two slowest possible outputs are chosen which do not cause
an increase in the output time of the CSA. These outputs are then wired to the other
two inputs of the CSA.
In effect, this is a greedy algorithm, in that it is constantly choosing to add a CSA delay
along the fastest path available. There are other procedures that will be described below
that help the algorithm avoid local minimums as it places and wires the CSAs
This algorithm can run into problems, illustrated by the following example. Refer to
the top of Figure 4.13. The left section shows a collection of dots which represent unwired
outputs. The arithmetic weight of the outputs increases from right to left, with dots that are
vertically aligned being of the same arithmetic weight. The above algorithm will find the
3 outputs in the little box and wire them to a CSA. This will give an output configuration
as shown in the center section. The algorithm will repeat, giving the right section. This
sequence of CSAs will be wired in series – essentially they will be wired as a ripple carry
adder. This is too slow for a high performance implementation. The solution is to use half
adders (HA) to break the ripple carry. As shown in the bottom of Figure 4.13, the first
step uses a CSA, but also a group of half adders to reduce the unwired outputs to the final
desired form in one step.
When and where to place half adders is based upon a heuristic, which comes from the
following observations. These observations are true in the case where the propagation
delay from any input of a CSA to any output are equal and identical to the the propagation
delay from any input of a HA to any output. Also, all delays must be independent of any
CHAPTER 4. IMPLEMENTING MULTIPLIERS
Without Half Adders
83
CHAPTER 4. IMPLEMENTING MULTIPLIERS 84
Observation 1 If a group of CSAs and HAs are wired to produce the minimum possible
propagation delay when adding a group of partial products, then there will be at most one
HA for any group of wires with the same arithmetic weight.
Proof : Assume that a minimum propagation delay wiring arrangement that has 2 or more
HA’s connected to wires of the same arithmetic weight. Pick any two of the HA’s (left side
of Figure 4.14). The HA’s have a propagation delay from any input to any output of . The
T
A
≤T
HA B
A
T
T+δ T+δ ≤T
Carry Sum B
Inputs Inputs
≤H
C
H≤T H≤T
A CSA
≤H
HA B
top HA in the figure has arrival times of T on the A input, and an arrival time of less than
or equal to T on the B input. Thus, the propagation delay of the top HA is determined by
the A input. Similarly, for the bottom HA the propagation delay is again determined by
the A input arrival time of H, with the assumption that H is less than or equal to T. Such a
configuration can be replaced by a single CSA (right side of Figure 4.14), where the inputs
are rewired as shown. The outputs of the CSA configuration are available at the same
time or before the outputs of the HA configuration, thus the propagation delay of the entire
system cannot be increased. This substitution process can be performed as many times as
needed to reduce the number of HA’s connected to wires of the same arithmetic weight to 1
or 0. To emphasize, Observation 1 is true only when the delay effects of wires are ignored,
and the propagation delay from inputs to outputs on CSAs and HAs is the same for all input
CHAPTER 4. IMPLEMENTING MULTIPLIERS 85
to output combinations. As a result it does not apply to real circuitry, but it is used as a
heuristic to assist in the placement of half adders.
Observation 2 If group of CSAs and a HA are wired to produce the minimum possible
propagation delay when adding a group of partial products, then the inputs of the HA can
be connected directly to the output of the partial product generator.
Proof : Assume that Observation 1 is applied to reduce the number of HA’s attached to
wires of a specific arithmetic weight to 1. If the HA is not connected directly to a partial
product generator output, then there must be some CSA that is connected directly to a partial
product generator output. This configuration is illustrated by the left side of Figure 4.15.
The arrival times on the A inputs of both the CSA and the HA determine the output times
T ≤H
A Switch A,B A
≤T inputs on the ≤H
HA B HA B
HA with B,C
T+δ T+δ ≤H+δ ≤H+δ
Carry Sum
inputs on the Carry Sum
CSA
H≤T H≤T
A A
≤H ≤T
B B
≤H T
C C
CSA CSA
of the two counters2. The CSA A input arrives earlier than the A input on the HA. The two
counters can be rewired (right side of Figure 1) such that the A input on the HA arrives
2
A counter refers to either a CSA or a HA
CHAPTER 4. IMPLEMENTING MULTIPLIERS 86
earlier, without increasing propagation delay of the entire system. This process can be
repeated until the HA is attached to the earliest arriving signals, which would be the output
of the partial product generator.
Even though Observations 1 and 2 are not valid in the presence of wire delays and
asymmetric input propagation delays, they can be used as the basis for a heuristic to place
and wire any needed HAs. Half adders are wired as the very first counter in every column,
and the multiplier is then wired as described above. The critical path time of the multiplier is
then determined. Then starting with the most significant arithmetic weight, the half adder is
temporarily removed and the network is rewired. If the critical path time increases, then it is
concluded that a half adder is needed at this weight, and the removed half adder is replaced.
If the critical path time does not increase, then the half adder is removed permanently. The
process is then repeated for each arithmetic weight, giving a list of weights for which half
adders are required.
6 5 4 3 2 1 0
5 4 3 2 1 0
4 3 2 1 0
Routing Channel
Routing Channel
Routing Channel
Routing Channel
Routing Channel
Routing Channel
Routing Channel
Routing Channel
3 2 1 0
2 1 0
1 0
6 5 4 3
5 4 3 2
Routing Channel
Routing Channel
Routing Channel
Routing Channel
0 1
4 3 2 1
Hinge
Figure 4.17: Multiplexers folded under.
CHAPTER 4. IMPLEMENTING MULTIPLIERS 89
very little space is wasted. The same scheme can be used on the left half of the layout.
In general, this technique can eliminate almost half of the required routing channels. The
program chooses the hinge point by iteration. The right most routing channel is used as the
initial hinge point. The layout is done, and if the area is smaller than any previous layouts,
the hinge point gets moved one column to the left. This continues until the smallest area is
obtained. The method is then repeated for the left side.
The final result from the summation network emerges folded back upon itself, but some
experiments were done with adder layouts and it seems as though the size and performance
of the final carry propagate add is not effected significantly by this folding.
4.3.4 Optimizations
There are a number of optimizations which are done as the layout is being developed, to
improve the area or reduce the delay.
Embedded CSAs
To further reduce the number of vertical wiring tracks needed in the routing channels, a CSA
can be moved closer to the outputs that are connected to it’s inputs. These outputs can come
from either a partial product multiplexer or another CSA. For example, the configuration
shown in the left half of Figure 4.18 takes 3 vertical routing tracks. Moving the CSA to
a location between the outputs requires only 2 routing tracks (right side of Figure 4.18).
To provide space for such movement, the initial placement of the partial product selection
multiplexers has vertical gaps. There are also gaps created by the tree folding as described
previously. As the CSAs are added, checks are made to determine whether a CSA can be
moved into such an area, subject to the constraint that the propagation delay of the path
through this CSA cannot increase. This overly constrains the problem, because not every
CSA is along the system critical path. After the CSAs are all placed, and the critical path is
determined, additional passes are done which attempt to move the CSAs into such locations
to minimize the number of vertical routing channels.
CHAPTER 4. IMPLEMENTING MULTIPLIERS 90
3 Routing 2 Routing
Mux Mux
Tracks Tracks
Mux Mux
a
b
Mux CSA
c
sum
a
b
CSA Mux
c
sum
Wire Crossing
Wire crossing elimination is used to improve performance and wiring channel utilization.
The left side of Figure 4.19 illustrates a possible wire crossing. These wire crosses are
A A
B A B A
CSA CSA
C B C B
CSA CSA
Carry Sum C Carry Sum C
A A
A A
B B
CSA B CSA B
C CSA C CSA
C C
Carry Sum Carry Sum
Carry Sum Carry Sum
created when a CSA is moved upward in a cell column as described earlier. The inputs can
be interchanged (right side of Figure 4.19), and the width of the routing channel reduced if
the following three conditions are met :
A cycle must not be created by the interchange. That is there cannot be feedback,
either direct or indirect, from the output of a counter to one of it’s own inputs.
Each wire crossing eliminated saves 2 routing tracks, allowing possible compression of the
routing channel. The delay may also be reduced since the wires driven by the outputs are
shorter.
Differential Wiring
A major performance gain can be obtained by selectively using differential ECL in place
of standard single ended ECL. This optimization is illustrated by the circuit shown in
Figure 4.20. The reference input in the standard gate is replaced by a second wire which
Out
In Out
In
Vcs
Vee
is always the complement of the input. The addition of the second wire allows the voltage
swing on both wires to be half that of the single ended case, yet maintaining the same (or
better) noise margin. The gate delay of the driving gate is halved, as is the wire delay. On
CHAPTER 4. IMPLEMENTING MULTIPLIERS 93
the down side, the area and power consumption of the driving gate is increased, due to the
second output buffer. A larger routing channel may also be needed to accommodate the
extra signal wire required. This optimization is very useful in reducing the delay along
critical paths. Differential wires are introduced according to the following rules:
A candidate wire must lie along a critical path through the multiplier, and it must not
already be a differential wire.
The addition of the second wire must not increase the routing channel width.
If no wire can be found that satisfies both of the above conditions, then find a wire
that satisfies only the first condition and expand the routing channel.
This process is continued until no wires can be found that satisfy the first condition. The
process may also be discontinued prematurely if this is desired.
Emitter followers are output buffers that are used to provide gates with better wire driving
capability and also to provide any level shifting that is required to avoid saturating the input
transistors of any gates being driven. For differential output gates, two emitter followers
are needed. All single ended gates require some level shifting to facilitate the production
of the reference voltages. Differential gates do not require such a reference voltage, so
this level shifting may not be required. For short wires, the buffering action of the emitter
follower is also not needed, so these emitter followers can be eliminating, reducing area
and power consumption.
Power Ramping
The delay through a short wire (length 2mm) is inversely proportional to the current
available to charge or discharge the wire (see Equation 1.2). This provides a direct trade-off
that can be made between the power consumed by an output buffer and the delay through
the wire driven by the buffer. In a full tree multiplier, there are large numbers of wires
that do not lie along the critical path, thus there is the potential for large power savings
CHAPTER 4. IMPLEMENTING MULTIPLIERS 94
by tuning the current in the emitter follower output buffer. In principle, a follower driving
a completely non-critical wire could be ramped to a negligible current. For noise margin
reasons, however, there is a limit to the minimum current powering a follower, so the
1
practical minimum is about 4
the maximum follower current.
The currents powering non-critical logic gates can also be reduced, increasing the
propagation delay of the gate. The noise margin requirements are different for gates, so
they can be ramped to lower currents than can the emitter followers. The minimum current
is again limited, but this time by the fact that lower currents need larger resistor values in
the current source powering the gate. This larger resistors can consume large amounts of
layout area. Although the resistors can be hidden under routing channels, the practical limit
1
seems to be about 10KΩ. This again provides a ratio of about 4
between the smallest and
largest currents allowed.
All CSAs (carry save adders) have all inputs connected to something (No floating
inputs).
All CSAs in the summation network have all outputs connected to something (No
bits are lost).
All partial product multiplexer outputs are connected to a CSA input (No bits are
lost).
All inputs to a given CSA have the same arithmetic weight (Make sure the correct
things are added).
No input to a given CSA can be driven directly or indirectly by any output from the
same CSA (no feedback).
CHAPTER 4. IMPLEMENTING MULTIPLIERS 95
All wires have exactly one CSA input attached (Each partial product is added no
more than once).
All wires have exactly one output attached, which could come from either a partial
product multiplexer or a CSA (No outputs are tied together).
Addition verification can also be performed by a transistor level simulation of the layout
(see Section 5.5).
4.5 Summary
An automatic software tool which assembles summation networks for multipliers has been
described. This tool produces placement and routing information for multipliers based upon
a variety of algorithms, using a CSA as the basic reduction block. Most of the algorithms
used in the tool for placement and routing have been developed by the process of trying
many different methods and refining and improving those methods that seem to work. A
number of speed, power and area optimizations have been presented.
Chapter 5 will use this software tool to evaluate implementations using various par-
tial product generation algorithm. Implementations produced with the tools will then be
compared to other implementations described in the literature.
Chapter 5
This chapter presents the designs of a number of different multipliers, using the partial
product generation methods described in Chapter 2. The speed, layout area, and power
for multipliers implemented with each of these methods can only be accurately determined
with a complete design, including the final layout. The layout generator described in
Chapter 4 provides a mechanism with which a careful analysis can be performed, as it can
produce a complete layout of the partial product generation and summation network. In
combination with a design for an efficient carry propagate adder and appropriate multiple
generators (both described in Chapter 3), a complete multiplier can be assembled in a mostly
automated manner. Important measures can then be extracted from these complete designs.
The target multiplier for this study is based upon the requirements of IEEE-754 double
precision floating point multiplication [12]. The format for an IEEE double precision
number is shown in Figure 5.1. The IEEE representation stores floating numbers in a
normalized, sign magnitude format. The fraction is 52 bits long, normalized (leading bit
of 1), with the "1" implied and not stored. This effectively gives a 53 bit fraction. To meet
the accuracy requirements of the standard the full 106 bit product must be computed, even
though only the high order 53 bits will be stored. Although the low order 51 bits are used
only in computing the "sticky bit" (if the low order 51 bits of the product are all 0, then the
"sticky bit " is high - see Appendix B), all of the carries from the low order bits must be
propagated into the high order bits. The critical path and most of the layout area (> 95%)
involved in a floating point multiplication is due to the multiplication of the fractions, so
96
CHAPTER 5. EXPLORING THE DESIGN SPACE 97
Sign of Fraction,
Normalized Fraction, f (52 bits) Biased Exponent, e (11 bits)
s (1 bit)
64 Total bits
this is the only area that will be addressed in the sections that follow.
Since the emphasis of this thesis is on speed, the delay goal for the complete fraction
multiply is 5 nsecs or less.
5.1 Technology
All multiplier designs are based upon an experimental BiCMOS process[15]. A brief
summary of the process technology is shown in Figure 5.1. Although this process is
BiCMOS, the test designs use only bipolar ECL logic with 0.5V single ended/0.25V
differential logic swings.
The basic circuits for a CSA and a Booth 2 multiplexer are shown in Figures 5.2 and
5.3. In order to provide some form of delay reference, the propagation delay curves for the
CSA are shown in Figure 5.4. This figure shows the propagation delay vs load capacitance
for a CSA with a 200A tail current. There are three 0.5V swing single ended curves,
corresponding to an output driven through an emitter follower and 0,1 or 2 level shifting
diodes. Each emitter follower is powered by a 200A pulldown current. Four curves are
shown for 0.25V differential swings. The differential @0 output has no emitter followers,
the others have a pair of emitter followers, each powered with 200A, and 0,1, or 2 diodes
per follower. In this technology 1 mm of wire corresponds to about 300fF. Figure 5.5
zooms in on the area where the load capacitance is less than 100fF. The dashed vertical line
corresponds to the approximate capacitance that would be seen if another CSA was being
driven through a wire that is twice the CSA height.
CHAPTER 5. EXPLORING THE DESIGN SPACE 98
Process :
Bipolar Transistors :
– 16 GHz FT @ 200A
– 2KΩ/square polysilicon resistor
CMOS (3.3V) :
VCC
Carry
Sum
VB1
VB2
VB3
VCS
VEE
VCC
Partial Product
Bit n
VB1
Invert
Multiplicand
VB2
Bit n
Multiplicand
Bit n-1
Select 1X
Select 2X
Select 0X
VCS
VEE
1600
Single Ended@3
Single Ended @ 2
1400
Single Ended @1
Differential@0
1200 Differential@3
Differential@2
Differential@1
1000
Propagation Delay (psec)
800
600
400
200
0
0 200 400 600 800 1000
Load Capacitance (fF)
Single Ended @ 2
Single Ended @1
Differential@0
Differential@3
200
Differential@2
Differential@1
Propagation Delay (psec)
150
100
50
0
0 20 40 60 80 100
Load Capacitance (fF)
Multiplicand
Multiplier
Partial Product Generator
Partial Products
Summation Network
Multiplier
Decode Multiplier
Drive Select Wires
Bits (If necessary) Final
Final Add Product
Sum Partial
(Carry
Products
Propagate)
Compute Multiples
Drive Multiple Wires
(If needed)
Multiplicand
Time
Each partial product is produced by a horizontal row of multiplexers which have common
select controls (the layout tool may fold the row back upon itself). Using the dot diagrams
of Chapter 2, a single horizontal row of dots corresponds to a row of multiplexers (or AND
gates). The select controls are shared by all multiplexers used in selecting a single partial
product, and in the layout scheme adopted here, run horizontally over the multiplexers (refer
back to Chapter 4 for more a more detailed description). The select controls are composed
of recoding logic, which use various bits of the multiplier to produce the required decoded
multiplexer signals, such as select Mx1, select Mx2, select Mx3, etc., which are in turn used
to choose a particular multiple of the multiplicand in forming a given partial product. The
decoded multiplexer signals are then fed to buffers which drive the long wires connecting
the multiplexers. The low level circuit design of the output driver for each select takes
advantage of the fact that the selects are mutually exclusive (only one is high at any given
time) to reduce the power consumption. During a multiply operation and after the select
lines have stabilized, exactly one of the select lines will be high. Therefore when the select
lines need to switch, only one wire will be making a high to low transition, so a single
pulldown current source can be shared by all 5 wires, instead of 5 separate pulldown current
sources.
Figure 5.8 shows a simplified driver circuit using 2 select output drivers. To expand
CHAPTER 5. EXPLORING THE DESIGN SPACE 104
this to 5 (or more) select drivers, 3 (or more) additional driver circuits would have to be
added, but they would all share the same pulldown current source shown in the figure. To
VCC
1600
2000
select_Mx1_out
select_Mx1_in VB3
VCC
VCS
T1
T2
R1
2500
1600 D1
VEE
R2
D2
2000
select_Mx2_out
T3
select_Mx2_in VB3
VCS
Single Shared
Current Source
1mA
2500 VEE
VEE
understand how this circuit works, consider the bottom driver in the figure. There are 4
major components. The driver gate, which connects to the input, an output pullup darlington
formed by T1,T2, and D11 , an output pulldown transistor T3 and the shared current source.
When the input transitions from low to high, all the gate current flows through R1,
creating a voltage drop across R1. The output darlington voltage will be low. There is no
current through R2 and no voltage drop between the base and collector of T3. This makes
1
D1 reduces the gain of the output stage to reduce ringing
CHAPTER 5. EXPLORING THE DESIGN SPACE 105
transistor T3 look like a diode that connects the shared current source to the output, pulling
the output down very quickly, with the full force of the shared current source (remember
exactly 1 output is high at any one time).
When the input transitions from high to low, all of the gate current is steered through
R2, creating a voltage drop across R2, turning off transistor T3. At the same time there is
no current through R1, therefore no voltage drop across R1, causing the darlington to pull
up very fast. The current through R2 also provides a trickle current through the darlington
to establish the output high voltage.
To reduce the wire delay, the voltage swing on the wires is reduced to 300mV, from
the 500mV nominal swing for the other circuits, without sacrificing noise margin. Since
exactly 1 wire is high at any given time, it can act like a reference voltage to the other 4
wires that are low (or are in transition to a low). As a result, much of the DC noise (such
as voltage drops on the power supply rails) on the 5 select wires becomes common mode
noise, in much the same way that DC noise becomes common mode noise for a differential
driver. This allows a somewhat reduced voltage swing without sacrificing noise margins.
In the comparisons that follow, the recoding time plus the wire delay time is assumed
to be fast enough that it is never in the critical path. Since the layout tool reports back the
actual lengths of the select wires, the power driving the wire is adjusted to assure that this
delay time.
In parallel with the partial product selection, any required multiples of the multiplicand (M)
must be computed and distributed to the partial product multiplexers. The delay can be
separated into two components :
Multiple Distribution : This is the wire delay due to the distribution of the bits of
the multiplicand and any required multiples. These multiples run diagonally across
the partial product multiplexers, so these wires are longer than the selection wires.
Again the wire lengths are available as output from the layout program, and the power
driving the wires can be adjusted (within reason) to give any desired wire delay.
The multiple generation and distribution is constrained to be less than 600 psec, by
adjusting the power used in driving the long wires. This time is determined by the largest
single transistor available in the technology (2mA), the typical wire length for multiples
in driving to the partial product generator, and the delay of a buffering gate for driving
the multiples. When a hard multiple is distributed, this constraint cannot be met (the
hard multiple takes 700 psec to produce because it requires a long length carry propagate
addition), so the driving current is limited to 2 ma (largest single transistor available) per
wire and the propagation delay is increased.
This block contributes the bulk of the layout area and power. The software layout program
described in Chapter 4 generates complete layout of this section, providing accurate (within
10% of SPICE) delay, power, and area estimates. In addition the lengths of the select and
multiples wires are also computed.
Since all multipliers being considered in this section are 53x53 bits, producing a 106 bit
product, a 106 bit carry propagate adder is needed. This adder can be considered as a fixed
overhead, since it is the same for all algorithms. Such an adder has been designed and layed
out, using the modified Ling scheme presented in Chapter 3. This adder accepts two 106 bit
input operands and produces the high order 66 bits of the product, plus a single bit which
indicates whether the low order 40 bits of the product are exactly zero. The important
measurables for this adder are shown in Table 5.2. These adder parameters were obtained
assuming a nominal -5V supply at 100 C, driving no load. The timing information is based
on SPICE simulations of the critical path using capacitances extracted from the layout.
CHAPTER 5. EXPLORING THE DESIGN SPACE 107
Because the adder design was done in a standard cell manner, the wire capacitance was
increased by 50% in the simulation runs to account for possible Miller capacitance between
simultaneously switching, adjacent wires.
Delay
All delays are for the entire multiply operation, not just the summation network time.
Power
The power values shown in the evaluation tables include all of the power necessary to
operate the multiplier.
Layout Area
The area includes all components of the multiplier. The area can also impact the perfor-
mance, in that larger area generally means longer wires and more wire delay.
Fastest : This variation attempts to maximize the speed of the multiplier, ignoring
area and power, except in the way they impact the performance (for example through
wire lengths). Full use of differential wiring is used where possible to reduce the
critical path time.
Minimum Area : In this variation, all critical paths are fully powered, single ended
swings. Differential wiring is not used, with the exception that differential, level 0
signals are used if no additional area is needed for the extra wire. This configuration
is close to a traditional ECL implementation, giving the minimum area and minimum
power for a full tree ECL design.
Minimum Width : The goal is to improve the speed of the minimum area variation
by allowing differential wiring wherever the impact on the layout area is negligible.
Differential wiring is used where possible to reduce the critical path time, as long
as the width of the routing channels (and hence the entire layout) does not increase.
The use of differential wiring sometimes requires an extra output buffer, which
increases the height of the layout slightly, so the actual area will be a little more
than the minimum area variation. This variation is interesting in that it shows the
performance increment, with only a small increase in layout area, that is possible
with the selective use of differential wiring.
90% Summation Network Speed : Since the cost of the maximum possible speed
may be quite high (in terms of area and power), an interesting configuration is one in
which the speed of the summation network is not pushed to it’s absolute maximum,
but instead is only 90% of the maximum speed. That is, the delay of the summation
Fastest
network in this configuration is 0 :9
.
75% Summation Network Speed : Similar to the 90% speed configuration, except
that the speed of the summation network is pushed only to 75% of the maximum
speed available.
All of the above configurations vary only the speed, power and area of the summation
network. Since there are other components in the complete multiplier (such as adders,
recoders, wire drivers, etc.), the actual effect on the entire system will be reduced.
CHAPTER 5. EXPLORING THE DESIGN SPACE 109
Referring to Table 5.3 it is obvious that simple multiplication is markedly inferior to the
Booth based algorithms in all important measures. Others have reached different conclu-
sions, such as Santoro [24], Jouppi et el[13], and Adlietta et el [1], so some explanation is
in order.
Power - The Santoro and Jouppi implementations are based on CMOS. The power
characteristics are quite different between ECL and CMOS designs, the former being
dominated by static power, the latter almost entirely dynamic power. Consequentially,
power consumption measurements based upon one technology probably can not be
applied directly to the other. It seems possible, however, that a CMOS multiplexer
might consume less power than a CMOS CSA, if only because the former has one
output and the latter has two, so Booth encoding may still save power.
CHAPTER 5. EXPLORING THE DESIGN SPACE 110
4.7
1.00 Booth 2
Simple
A A
0.90 4.2 Booth 3
4.0
A
4.0 Booth 4
AA A
0.80 3.7 3.7 3.7 3.7
0.70
3.5
AA
AA A
A AA
3.3
3.2 3.2
AA
3.4 3.5
AA
AA AA A AA AA
3.1
3.0 3.1 3.0
0.60
AA
AA AA
AA A
A AA
AA
2.8
AA
AA
Relative Delay
2.6
AA AA A AA AA
AA AA A AA AA
0.50
0.40 AA
AA AA
AA A
A AA
AA AA
AA
AA AA A AA AA
0.30
AA AA A AA AA
0.20
AA
AA AA
AA A
A AA
AA AA
AA
AA AA A AA AA
0.10
AA
AA AA
AA A
A AA
AA AA
AA
0.00
Fastest
AA Minimum
AA Minimum
A AA 90% 75%
AA
Width Area Speed Speed
33.0
1.00
29.4
0.90 Booth 2
Simple
A
0.80 Booth 3
Booth 4 24.5
0.70
0.60
Relative Area
18.2
17.1
0.50
15.4
AA AA
14.3
13.7
AA AA AA
0.40 12.6
12.0
AA A AA AA AA
11.0 11.4
10.7 10.6 10.4
AA A AA AA AA
0.30 9.3 9.2 9.3
AA A AA AA AA
8.0 7.8
AA A AA AA AA
0.20
0.10 AA
AA A
A AA
AA AA
AA AA
AA
AA A AA AA AA
0.00
Fastest
AA Minimum
A Minimum
AA AA
90% 75%
AA
Width Area Speed Speed
26.9
1.00
24.5
0.90 Booth 2
Simple
0.80
AA
AA
Booth 3
Booth 4
20.4
0.70
0.60
Relative Power
15.0
14.4
0.50 12.9
AA AA
11.1 10.9
0.40 10.3
9.8
AA AA AA
9.7
9.0
8.6
AA AA AA AA
7.8
0.30
AA AA AA AA AA
7.2 7.3
6.9
6.4
AA AA AA AA
6.1
0.20 5.3
AA AA AA AA A
0.10
AA
AA AA
AA AA
AA AA
AA AA
0.00
AA
Fastest Minimum
AA Minimum
AA 90%
AA 75%
A
Width Area Speed Speed
Delay - Simple multiplication is not significantly slower than Booth based imple-
mentations. Even though there are twice as many partial products to be added, the
delay through the summation network is basically logarithmic in the number of par-
tial products, minimizing the difference. Any difference can be explained by the
replacement of the the top two layers of CSA delay with a single multiplexer delay,
the delay of a CSA and a multiplexer being comparable. Also the extra area of simple
multiplication contributes to longer wires, and thus longer delays.
Area - The Booth 2 multiplexers used in this study are 24:6 in height, compared
to 31:6 for a CSA (the widths are the same). A one for one replacement of
CSA’s with multiplexers, as happens when comparing simple multiplication to Booth
2 multiplication, should result in a modest reduction in total layout area. However,
simple multiplication still requires AND gates for the selection of the partial products.
The actual logic gate can frequently be designed into a CSA, with only a slight
increase in area of the CSA. The wires distributing the multiplicand and the multiplier
throughout the tree still require area, so the partial product selection still requires non-
zero layout area. The remaining area difference can be explained by level shifters
2
that are required for the multiplicand at 3
of the inputs of all of the top level CSAs.
Santoro observes that the size of the Booth multiplexers is limited by the horizontal
wire pitch. Figure 5.12 shows a possible CMOS multiplexer. This particular version
has 4 horizontal wires crossing through each row of multiplexers that create a single
CHAPTER 5. EXPLORING THE DESIGN SPACE 115
t 2X t 1X t 0X
lec lec elec vert
Se Se S In
Vdd Vdd
Multiplicand
Bit n-1
Partial Product
Bit n
Multiplicand
Bit n
partial product (other designs could have from 3 to 5 horizontal wires). Assuming
5N
5 horizontal wires per partial product, an NxN bit Booth multiplier would have 2
total horizontal wires whereas simple multiplication would have N horizontal wires.
If a CSA is exactly the same size as a Booth multiplexer, then simple multiplication
would still be larger due to the N horizontal wires needed to control the AND gates
which generate the partial products.
If the multiplexers are not wire limited, it is extremely unlikely that a multiplexer
will be larger than a CSA, since the circuit is much simpler. Figures 5.2, 5.3, 5.12,
and 5.13 show designs for ECL and CMOS multiplexers and CSAs and clearly the
multiplexers are simpler than the CSAs.
The recoders that drive the select lines which control the multiplexers or AND gates
could explain how it might be possible for simple multiplication to be comparable
(or even smaller) to Booth encoding in CMOS. A relatively small bipolar transistor
drives a large amount of current, so increasing the number of horizontal wires does
not increase the size of the Booth multiplexer select drivers significantly. In contrast,
CHAPTER 5. EXPLORING THE DESIGN SPACE 116
a
Vdd Vdd
b
Vdd
Vdd
c
Sum
Vdd
Carry
Vdd
CMOS will require additional large area transistors to drive the additional long select
5N
wires. The increase in the number of long select wires, from N to 2
, may increase
the area of the select generators enough to overcome the modest savings provided by
Booth encoding, if the 5 wire version of the multiplexer is used.
Returning to Table 5.3, the Booth 4 algorithm has no advantage over the Booth 3
algorithm. The reason for this is that the savings in CSAs that result from the reduction
in partial products is more than made up for by the extra adders required to generate the 2
additional hard multiples. The partial product select multiplexers are also almost twice the
area (80 vs 40:5 in height). Booth 4 may become more competitive if the length of the
multiplication is increased, since the area required for the hard multiple generation grows
linearly with the length, while the area in the summation network grows as the square of
the length. For lengths 64, Booth 4 does not seem to be competitive.
In summary, only Booth 2 and Booth 3 seem to be viable algorithms. Booth 2 is
somewhat faster, but Booth 3 is smaller in area and consumes less power.
CHAPTER 5. EXPLORING THE DESIGN SPACE 117
Delay (nsec) Area (mm2) Adder Power (Watts) Driver Power Total Power
0.2 0.53 0.50 0.76 1.26
Table 5.4: Delay/Area/Power of 55 Bit Multiple Generator, built from 14 bit Subsections
The model for the short multiples adders will be based upon the actual implementation of
a 14 bit 3X adder. Simple modifications will be made to allow the adder length to vary.
The vital statistics for this 14 bit multiple generator, when it is used to construct a 55 bit
multiple generator, are shown in Table 5.4. The delay does not include the time to drive the
long wires at the output, as this delay is accounted for separately.
Using the method described in Chapter 3, it is possible to build short multiple generators
of up to 14 bits using only two stages of logic. The delay of the longer generators is slightly
more than that of the shorter adders, but the delay difference can be minimized by using
more power along a single wire that propagates an intermediate carry from the low order 8
bits to the high order 6 bits. Although the delay is really a continuous function of the length
CHAPTER 5. EXPLORING THE DESIGN SPACE 118
of the adder, the difference between adders of similar lengths is minimal, of the order of
50 psecs between a length 5 adder and a length 14 adder. Although this is a significant
variation in the adder times ( 25%), it is a very small fraction of the total multiply time
(2% or less). The power consumption per bit is also roughly constant, with the difference
between a length 5 adder and a length 14 adder being about 10mW. Since most of the power
and delay involved in the multiple generation is in driving the multiples to all the partial
product multiplexers, a more refined model will not be presented. Because the delay and
power differences between the shorter multiple generators and the longer ones are very
small, they will be ignored.
Tables 5.5 and 5.6 shows the implementation parameters for the redundant Booth 3 algorithm
as the carry interval is varied from 5 bits to 14 bits. The results are also shown graphically
in Figures 5.14, 5.15, and 5.16.
Referring to Tables 5.5, 5.6 and Figures 5.14, 5.15 and 5.16 some general patterns can be
discerned. Generally, the delay is not dependant on the carry interval. This is due to the
logarithmic nature of the delay of the summation network. There are occasional aberrations
(such as the data for a carry interval of 8), but these are due to fact that the layout program
happens to stumble upon a particularly good solution under some circumstances. The
area shows a more definite decrease as the carry interval is increased, again a pretty much
expected result, since fewer CSAs and multiplexers are required. A somewhat surprising
result is that the power, like the delay, is mostly independent of the carry interval. The reason
for this is that most of the additional CSAs required as the carry interval is decreased lie off
of the critical path, so these CSAs can be powered down significantly without increasing
the delay. In addition, the summation network has been made so fast that the total delay
is beginning to be dominated by the final carry propagate adder, and the driving of the
long wires that distribute the multiplicand and select control wires through the summation
network, not the delay of the summation network itself. It seems as though any carry
CHAPTER 5. EXPLORING THE DESIGN SPACE 119
4000
3500
3000
2500
Delay (psec)
2000
Fastest
1500
Minimum Width
Minimum Power
90% Speed
1000
75% Speed
500
4 5 6 7 8 9 10 11 12 13 14
Carry Interval
Figure 5.14: Delay of redundant Booth 3 implementations.
CHAPTER 5. EXPLORING THE DESIGN SPACE 122
18.00
16.00
14.00
12.00
Area (mm2)
10.00
8.00
Fastest
Minimum Power
90% Speed
4.00
75% Speed
2.00
0.00
4 5 6 7 8 9 10 11 12 13 14
Carry Interval
Figure 5.15: Area of redundant Booth 3 implementations.
CHAPTER 5. EXPLORING THE DESIGN SPACE 123
14.00
12.00
10.00
Power (Watts)
8.00
6.00
Fastest
4.00
Minimum Width
Minimum Power
90% Speed
2.00
75% Speed
0.00
4 5 6 7 8 9 10 11 12 13 14
Carry Interval
Figure 5.16: Power of redundant Booth 3 implementations.
CHAPTER 5. EXPLORING THE DESIGN SPACE 124
5.5 Fabrication
In order prove the design flow, a test multiplier was fabricated in the experimental BiCMOS
process described previously. The implementation described here is that of a 53x53 integer
multiplier producing a 106 bit product. Due to pad and area limitations, only the high order
66 bits of the product are computed, with the low order 40 bits encoded into a single "sticky"
bit, using the method described in Appendix B.1. The algorithm used was the redundant
Booth 3 method described in Chapter 2, with 14 bit small adders. CMOS transistors are
used only as capacitors on internal voltage references.
After the entire multiplier was assembled, and the final design rule checks performed,
a net list of the entire multiplier was extracted from the layout and run through a custom
designed simulator built upon the commercial simulator LSIM (from Mentor Graphics).
The simulator works at the transistor and resistor level, and is approximately 3 orders of
magnitude faster than HSPICE at circuit simulation. It is not quite as accurate, and also
provides no timing information. Approximately 3000 carefully selected vectors were run
through the simulated multiplier. The simulation run takes about 10 hours of compute time,
CHAPTER 5. EXPLORING THE DESIGN SPACE 126
AAA
4.0 4.0 Booth 2
1.00
0.90
AAA
AAAAAAAAAAA AA 3.6
3.7
3.6
3.7 3.7
Booth 3-14
Booth 3
AA AA AAAAA AA
3.5
3.5 Booth 3 Improved 3.4
0.80
AA AA AAA AA
AAAAAAAAA AAAAAA
3.0 3.1
AA 3.2 3.2
3.0
3.1
0.70
AA AA
AAAA
AA
2.6 2.6
AA
AA AA
AAA
A A
A AA
AAAA
AAAA
AA
2.8 2.8
0.50 AA
AA AA
AAAA
AA AA
AA AA
AAA
A A
A AA
AAAA
AAAA
AA
AA AAAA AA AAA A AAAAAA
0.40
AA AAAA AA AAA A AAAAAA
0.30
AA
AA AA
AAAA
AA AA
AA AA
AAA
A A
A AA
AAAA
AAAA
AA
AA AAAA AA AAA A AA
AAAAAAAAAAA AAAAAAAAAAA
AA AAAA AA AAA A AAAAAA
0.20
0.10
AA AAAA AA AAA A AA
AAAAAAAAAAA AAAAAAAAAAA
0.00 AAAAAAAAAAA AAAAAAA
Fastest Minimum Minimum 90 % 75 %
Width Area Speed Speed
1.00 15.4
14.3
AA A
0.90 Booth 2
AA A A
13.0 Booth 3-14
12.2
AA A
0.80 Booth 3
12.0
AA A
Booth 3 Improved
11.0
AAAA A AA
0.70 10.4 10.4 10.3
0.60 AA
AA AA
AAAA AA
9.3
8.9
9.2
A
A AA
AAAA
AAAA
9.6
9.3
Relative Area
8.9
0.40
AA
AA AA
AAAA
AA AA
AAAA
AAA
A A
A AA
AAAA
AAAA
AA
AA AAAA AAAAA A AAAAAA
0.30
AA AAAA AAAAA A AAAAAA
AA AAAA AAAA
AAAAAAAAAAA A A AAAAAA
AAAAAAA
AA AAAA AAAAA A AAAAAA
0.20
0.10 AA
AA AA
AAAA
AA AA
AAAA
AAA
A A
A AA
AAAA
AAAA
AA
0.00 AAAAAAAAAAA
Fastest Minimum Minimum
AAAAAAA
90% 75%
Width Area Speed Speed
1.00 14.4
A
12.9
0.90 Booth 2
0.80
A Booth 3-14
Booth 3
AA
AA
10.5 A Booth3 Improved
10.9
AA A
0.70
9.7 9.7
0.60 AA
AA AA
AA
8.9
A
A AAAA
9.0
Relative Power
AA AA A AAAA
8.0 8.0
7.8
AA AA A AAAA
7.2
0.50
AA AAAA AA A AAAAAA
6.4 6.3
6.1
0.30
AA AAAA AA AAA A AAAAAA
AA
AA AA
AAAA
AA AA
AA AA
AAA
A A
A AA
AAAA
AAAA
AA
AAAAAAAAAAA AAAAAAA
0.20
0.10
AA
AA AA
AAAA
AA AA
AA AA
AAA
A A
A AA
AAAA
AAAA
AA
0.00 AAAAAAAAAAA
Fastest Minimum Minimum
AAAAAAA
90% 75%
Width Area Speed Speed
Booth Recoders
Summation Network
Multiple Generator
130
CHAPTER 5. EXPLORING THE DESIGN SPACE 131
Figure 5.21: Photo of 53x53 multiplier chip. Die size is 5mm x 3mm.
CHAPTER 5. EXPLORING THE DESIGN SPACE 132
faster than comparable CMOS designs and also competitive in area. ECL designs consume
high power, so careful circuit design is necessary to minimize the power consumption. With
such care, the power-delay product of ECL designs can be less than a factor of two larger
than CMOS designs.
5.7 Improvements
There are a couple of simple improvements that could be made to the multiplier designs
presented in this chapter. First, because of limitations in the tools, only a 2 layer channel
router was available. A good 3 layer router would have reduced the layout area of all the
multipliers by 10-20%. Second, the power consumption could be reduced by the addition
of a second power supply, with a voltage of -2.5. Many output emitter followers could be
terminated to this supply rather than the -5V supply, reducing the power consumption in
these drivers by 50%. In general about 13 of the current in the designs could be terminated to
this reduced voltage. This would reduce the total power consumption of all of the designs
by about 15%.
CHAPTER 5. EXPLORING THE DESIGN SPACE 133
5.9 Summary
Of the conventional partial product generation algorithms considered in this chapter, the
Booth 3 algorithm is the most efficient in power and area, but is slower due to the need
for an expensive carry propagate addition when computing "hard" multiples. The Booth 2
algorithm is the fastest, but is also quite power and area hungry. Other conventional
algorithms, such as Booth 4 and simple multiplication, do not seem to be competitive with
the first two.
Implementations using the redundant Booth 3 algorithm compare quite favorably with
the conventional algorithms. The fastest version of this algorithm is as fast as the Booth 2
algorithm, but provides modest decreases in both power ( 25%) and area ( 15%). The
redundant Booth algorithm also compares favorably with the conventional(nonredundant)
Booth 3 in both area and power, but is faster. Serious consideration should be given to this
class of algorithms.
Wires and input delay variations are important when designing summation networks, if
the highest possible performance is desired. Ignoring these effects can lead to designs that
are not as fast as they could be.
Chapter 6
Conclusions
The primary objective of this thesis has been to present a new type of partial product gener-
ation algorithm (Redundant Booth), to reduce the implementation to practice, and to show
through simulation and design that this algorithm is competitive with other more commonly
used algorithms when used for high performance implementations. Modest improvements
in area (about 15%) and power (about 25%) over more conventional algorithms have been
shown using this algorithm. Although no performance increment has been demonstrated,
this is not terribly surprising given the logarithmic nature of the summation network which
adds the partial products.
Secondarily, this thesis has shown that algorithms based upon the Booth partial product
method are distinctly superior in power and area when compared to non-Booth encoded
methods. This result must be used carefully if applied to other technologies, since different
trade-offs may apply. Partial product methods higher than Booth 3 do not seem to be
worthwhile, since the savings due to the reduction of the partial products do not seem to
justify the extra hardware required for the generation and distribution of the "hard multiples".
This conclusion may not apply for multiplication lengths larger than 64 bits. The reason for
this is that the "hard multiple" logic increases linearly with the multiplication length, but
the summation network hardware increases with the square of the multiplication length.
The use of carry save adders in a Wallace tree (or similar configuration) is so very fast at
summing partial products, that it seems that there is very little performance to be gained by
trying to optimize the architecture of this component further. The delay effects of the other
135
CHAPTER 6. CONCLUSIONS 136
Final CPA
30%
AAAAAA
Summation
Network
AAAAAA
AAAAAA
49%
AAAAAA
AAAAAA
Wires
14%
AAAAAA
Drive Multiple
Compute
Multiple 7%
1
2
of the total delay. This reduces the performance sensitivity of the entire multiplier to small
changes in the summation network delay. As a result, somewhat slower, but more compact
structures (such as linear arrays or hybrid tree/array structures) may be competitive with
the faster tree approaches. Significant improvements in multiplier performance will come
only from using faster circuits, or by using a completely different approach.
The summation network and partial product generation logic consume most of the power
and area of a multiplier, so there may be more opportunities for improving multipliers by
optimizing summation networks to try to minimize these factors. Reducing the number of
partial products and creating efficient ways of driving the long wires needed in controlling
and providing multiples to the partial product generators are areas where further work may
prove fruitful.
Since wire delays are a substantial fraction of the total delay in both the summation
CHAPTER 6. CONCLUSIONS 137
network and the carry propagate adder, efforts to minimize the area may also improve the
performance. Configuring the CSAs in a linear array arrangement is smaller and has shorter
wires than tree configurations. In the future, if wires become relatively more expensive,
such linear arrays may become competitive with tree approaches. At the present time trees
still seem to be faster.
Finally, good low level circuit design seems to be very important in producing good
multiplier designs. A modest improvement in the design of CSAs is important, because so
many of them are required. From 900 to 2500 carry save adders were used in the designs
presented in this thesis. This thesis has presented a power efficient circuit which can be used
to drive a group of long wires, when it is known that exactly 1 of the wires can be high at
any time. This single circuit reduces the power of the entire multiplier by about 8%, which
seems modest, but it is only a single circuit. Another example where concentrating on the
circuits can pay off is in power ramping non-critical paths, which saves about 30% of the
power at virtually no performance cost. These examples illustrate that good circuit design,
as well as good architectural decisions, are necessary if the best performing multipliers are
to be built.
Appendix A
This appendix shows the sign extension constants that are needed when using Booth’s
multiplication algorithm are computed. The method will be illustrated for the 16x16 bit
Booth 2 multiplication example given in Chapter 2. Once the basic technique is understood
it is easily adapted to the higher Booth algorithms and also to the redundant Booth method
of partial product generation. The example will be that of an unsigned multiplication, but
the final section of this appendix will discuss the modifications that are required for signed
arithmetic.
138
APPENDIX A. SIGN EXTENSION IN BOOTH MULTIPLIERS 139
,
0
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, ,
Lsb
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, ,
,,,,,,,,,,,,,,,,, ,
,,,,,,,,,,,,,,,,, ,
,
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, ,
M
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, ,
u
l
t
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, , i
,,,,,,,,,,,,,,,, ,
p
l
+
,
,
i
e
r
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,
,
Partial Product Selection Table
Multiplier Bits Selection
,
, Msb
000 +0 0
001 + Multiplicand 0
010 + Multiplicand
011 + 2 x Multiplicand
100 -2 x Multiplicand
101 - Multiplicand
110 - Multiplicand
111 -0
,
0
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, ,
Lsb
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, ,
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
,,,,,,,,,,,,,,,,, ,
1 1 1 1 1 1 1 1 1 1 1 1 1 1
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, ,
1 1 1 1 1 1 1 1 1 1 1 1
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, ,
1 1 1 1 1 1 1 1 1 1
M
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, ,
1 1 1 1 1 1 1 1
u
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, ,
1 1 1 1 1 1 l
t
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, ,
1 1 1 1
i
,,,,,,,,,,,,,,,, ,
1 1 p
l
,
1
i
,
e
+ r
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,
,
,
,
0
Msb
,
0
,,,,,,,,,,,,,,,,, ,
Lsb
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, ,
1 1
,,,,,,,,,,,,,,,,, ,
1 0 1
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, ,
1 0 1
,,,,,,,,,,,,,,,,, ,
1 0 1
M
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, ,
1 0 1
u
,,,,,,,,,,,,,,,,, ,
1 0 1 l
t
,,,,,,,,,,,,,,,,, ,
1 0 1
i
,,,,,,,,,,,,,,,, ,
0 1 p
l
,
1
i
,
e
+ r
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,
,
,
,0
Msb
particular partial product turns out to not be negative. The leading string of ones in that
particular partial product can be converted back to a leading of zeroes, by adding a single
1 at the least significant bit of the string. Referring back to the selection table shown in
Figure A.1, a partial product is positive only if the most significant bit of the select bits for
that partial product is 0. Additionally, a 1 is added into the least significant bit of a partial
product only if it is negative. Figure A.4 illustrates this configuration. The S bits represent
the 1’s that are needed to clear the sign extension bits for positive partial products, and the
S bits represent the 1’s that are added at the least significant bit of each partial product if it
is negative.
,
0
,,,,,,,,,,,,,,,,, ,
S Lsb
,,,,,,,,,,,,,,,,, ,
1 1
,,,,,,,,,,,,,,,,, ,
1 S S
,,,,,,,,,,,,,,,,, ,
1 S S
,,,,,,,,,,,,,,,,, ,
1 S S
M
,,,,,,,,,,,,,,,,, ,
1 S S
u
,,,,,,,,,,,,,,,,, ,
1 S S l
t
,,,,,,,,,,,,,,,,, ,
1 S S
i
,,,,,,,,,,,,,,,, ,
S S p
l
,
S
i
,
e
+ r
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,
Partial Product Selection Table
,
,
Multiplier Bits
000
Selection
+0
S = 0 if partial product is positive
(top 4 entries from table)
,
0
Msb
,
0
,,,,,,,,,,,,,,,,, ,
Lsb
,,,,,,,,,,,,,,,,, ,
S S S
,,,,,,,,,,,,,,,,, ,
1 S S
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, ,
1 S S
,,,,,,,,,,,,,,,,, ,
1 S S
M
,,,,,,,,,,,,,,,,, ,
1 S S
u
,,,,,,,,,,,,,,,,, ,
1 S S l
t
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, ,
1 S S
i
,,,,,,,,,,,,,,,, ,
S S p
l
,,,,,,,,,,,,,,,, ,
S
i
,
e
+ r
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,
Partial Product Selection Table
,
,
Multiplier Bits Selection
,
,
S = 0 if partial product is positive Msb
000 +0 (top 4 entries from table) 0
001 + Multiplicand S = 1 if partial product is negative 0
010 + Multiplicand (bottom 4 entries from table)
011 + 2 x Multiplicand
100 -2 x Multiplicand
101 - Multiplicand
110 - Multiplicand
111 -0
The most significant partial product (shown at the bottom in all of the preceding
figures), which is necessary to guarantee a positive result, is not needed for signed
multiplication. All that is required is to sign extend the multiplier to fill out the
bits used in selecting the most significant partial product. For the sample 16x16
multiplier, this means that one partial product can be eliminated.
When Multiplicand (entries 1,2,5 and 6 from the partial product selection table) is
selected, the 17 bit section of the effected partial product is filled with a sign extended
copy of the multiplicand. This sign extension occurs before any complementing that
is necessary to obtain ,Multiplicand.
The leading 1 strings, created by assuming that all partial products were negative, are
cleared in each partial product under a slightly different condition. The leading 1’s
for a particular partial product are cleared when that partial product is positive. For
signed multiplication this occurs when the multiplicand is positive and the multiplier
select bits chooses a positive multiple, and also when the multiplicand is negative and
the multiplier select bits choose a negative multiple. A simple EXCLUSIVE-NOR
between the sign bit of the multiplicand and the high order bit of the partial product
selection bits in the multiplier generates the one to be added to clear the leading 1’s
correctly.
The complete 16x16 signed multiplier dot diagram is shown in Figure A.6
APPENDIX A. SIGN EXTENSION IN BOOTH MULTIPLIERS 143
,
0
,,,,,,,,,,,,,,,,, ,
Lsb
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, ,
E E E
,,,,,,,,,,,,,,,,, ,
1 E S
,,,,,,,,,,,,,,,,, ,
1 E S
,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,, ,
1 E S
M
,,,,,,,,,,,,,,,,, ,
1 E S
u
,,,,,,,,,,,,,,,,, ,
1 E S l
t
,,,,,,,,,,,,,,,,, ,
1 E S
i
,
E S p
l
,
S
i
+
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,
e
r
B.1 Rounding
The material in the preceding chapters of this thesis have dealt with methods and algorithms
for implementing integer multiplication. Chapter 5 briefly explained the format of IEEE
double precision floating point numbers. To convert an integer multiplier into a floating
point multiplier requires 2 modifications to the multiplication hardware :
Exponent adder - This involves a short length (12 bits or less) adder.
Rounding logic - The rounding logic accepts the 106 bit integer product and uses
the low order 53 bits of the product to slightly modify the high order 53 bits of the
product, which then becomes the final 53 bit fraction portion of the product.
The actual rounding process is quite involved, and methods for high speed rounding can be
found in Santoro, Bewick, and Horowitz[23]. The purpose of this appendix is to discuss an
efficient method for computing the "sticky bit" which is required for correct implementation
of the IEEE round to nearest rounding mode, and is also required for computation of the
IEEE "exact" status signal.
144
APPENDIX B. EFFICIENT STICKY BIT COMPUTATION 145
the actual low order product bits, just the input operands, so the determination can occur
in parallel with the actual multiply operation, removing the sticky computation from the
critical path. The disadvantage of this method is that significant extra hardware is required.
This hardware includes 2 long length priority encoders to count the number of trailing zeros
in the input operands, a small length adder, and a small length comparator. Some hardware
is eliminated, though, in that the actual low order bits of the product are no longer needed,
so part of the carry propagate adder hardware can be eliminated.
The Santoro rounding paper describes a very clever method of sticky computation which
involves examining the low order bits of the product while it is still in a redundant form,
i.e. before the carry propagate add which computes the final product. If all of the low
order bits of the redundant form are zero, then the sticky bit must be 1, else it must be 0.
This overlaps the sticky computation with the final carry propagate add which computes
the product, removing the sticky from the critical path. Unfortunately, this method only
works for non-Booth encoded multipliers, a significant disadvantage given the results of
Chapter 5. Like the previous method, this scheme also avoids the actual computation of
the low order bits, which provides hardware savings in the final carry propagate adder.
It will be shown that a group of low order result bits will be all zeros if and only if :
The carry-in of 1 is propagated from the lowest order bit across all low order bits
involved in the sticky bit computation AND
A carry is not generated anywhere by the low order bits involved in the sticky
computation.
Using the terminology of Chapter 3, these two conditions can be stated in a more precise
manner. Assume that the number of low order bits involved in the sticky computation is n.
Then the sticky bit, STn is :
Where sn,1 : : : s0 are the low order product bits. The conjecture is that :
Proof: By induction n, the number of bits involved in the sticky computation. For n=1,
Equation 3.1 gives s0 , and with a carry-in of 1, s0 becomes :
s0 = a0 b0 1
= a0 b0
ST0 = s0
= a0 b0 (B.3)
This is the same as Equation 3.6 which defines p 0 . To allow the use for general p0 , that
is where p0 is computed as either p 0 (EXCLUSIVE-OR) or p+
0 (OR), p0 must be ANDed
ST0 = a0 b0
= p0 g0
APPENDIX B. EFFICIENT STICKY BIT COMPUTATION 148
This establishes the result for n=1. To prove for n bits, the result is assumed to be true for
n-1 bits, and then shown to be true for n bits. From Equation B.1 :
Where cn,1 is the carry-in to bit n-1. Equations 3.7, 3.8, and 3.9 give cn,1, and since c0 = 1,
cn,1 can be written as
=
, ,
pn0 1gn0 1
can be dropped, because pn0,1 and gn0,1 are mutually exclusive. If a carry is propagated
across the entire group of low order bits, no carry can be generated in those bits. Then STn
becomes :
STn = pn0,1
The purpose of this appendix is to provide a proof of Theorem 1 of Chapter 3. Although there
are other ways of proving this particular theorem, the proof illustrates the simple manner
in which many of the relationships used in Chapter 3 can be proven using mathematical
induction. It is also possible to prove the correctness of the algorithms presented in
Chapter 2, by induction on the number of "digits" in the multiplier. A digit is a single
group of bits from the multiplier which are recoded using Booth’s algorithm and then used
to select a particular partial product.
Theorem 1 Let A and B be positive logic binary numbers, each n bits long, and c0 be a
single carry bit. Let S be the n bit sum of A, B, and c0, and let cn be the carry out from the
summation. That is :
2n cn
sum sum sum
+ S = A + B + c0
Then :
2n cn
sum sum sum
+ S = A + B + c0
2 c1 2 (a0 b0 + a0 c0 + b0 c0 ) + (a0 b0 c0 )
sum sum
+ s0 =
h i sum
= 2 a0 b0 (a0 c0 ) b0 c0 + a0 b0 c0
150
APPENDIX C. NEGATIVE LOGIC ADDERS 151
h i sum
= 2 a0 + b0 (a0 + c0 ) b0 + c0 + a0 b0 c0
h i sum
= 2 a0 b0 + a0 c0 + b0 c0 + a0 b0 c0
sum sum
= A + B + c0
To prove by induction on n (the length of the A and B operands), the theorem is assumed
for all operands of length n-1, and with this assumption it is proven to be true for operands
of length n. Proceeding:
h ,1
i sum nX
2n cn 2n (an,1 bn,1 + an,1 cn,1 + bn,1 cn,1 ) + sk 2k
sum
+ S =
= k 0
h i sum ,
nX2
2 an,1 bn,1 + an,1 cn,1 + bn,1 cn,1 n,1
s sk 2k
sum
,
n
= + 2 n 1 +
h i =
k 0
sum n,1
,2
sum nX
+ 2 an,1 bn,1 cn,1 + sk 2k
=
k 0
sum X
n,2
n ,1
sk 2k
sum sum
= 2 an,1 + bn,1 + cn,1 +
=
k 0
sum n,2
X
n ,1
2n,1 cn,1 sk 2k
sum sum
= 2 an,1 + bn,1 + +
k 0=
Now use the induction hypothesis to replace the last two terms by the sums of the first n-1
bits of A, B and a single bit carry-in c0 :
sum X
n,2 n,2
X
n
2 cn sum
+ S = 2,
n 1
an,1
sum
+ bn,1 + ak 2k
sum
+ bk 2k
sum
+ c0
=
k 0 =
k 0
,
nX1 sum nX1 ,
ak 2k bk 2k
sum
= + + c0
=
k 0 =
k 0
sum sum
= A + B + c0
[4] Gary Bewick and Michael J. Flynn. Binary Multiplication Using Partially Redundant
Multiples. Technical Report CSL-TR-92-528, Stanford University, June 1992.
[7] L. Dadda. Some Schemes for Parallel Multipliers. Alta Frequenza, 36(5):349–356,
May 1965.
[8] Bob Elkind, Jay Lessert, James Peterson, and Gregory Taylor. A sub 10ns Bipolar
64 Bit Integer/Floating Point Processor Implemented on Two Circuits. In IEEE 1987
Bipolar Circuits and Technology Meeting, pages 101–104, 1987.
152
BIBLIOGRAPHY 153
[9] Mohamed I. Elmasry. Digital Bipolar Integrated Circuits. John Wiley & Sons, 1983.
[12] IEEE Standard for Binary Floating-Point Arithmetic, 1985. ANSI/IEEE Std 754-1985.
[13] Norman P. Jouppi. MultiTitan Floating Point Unit. In MultiTitan: Four Architecture
Papers. Digital Western Research Laboratory, April 1988.
[14] Earl E. Swartzlander Jr., editor. Computer Arithmetic, volume 1. IEEE Computer
Society Press, 1990.
[16] H. Ling. High-Speed Binary Adder. IBM Journal of Research and Development, 25(2
and 3):156–166, May 1981.
[20] Motorola. MECL System Design Handbook. Motorola Semiconductor Products Inc.,
1988.
[21] Michael S. Paterson and Uri Zwick. Shallow Multiplication Circuits. In 10th Sympo-
sium on Computer Arithmetic, pages 28–34, 1991.
[22] Marc Rocchi, editor. High Speed Digital IC Technologies. Artech House, 1990.
[24] Mark Santoro. Design and Clocking of VLSI Multipliers. PhD thesis, Stanford
University, Oct 1989.
[25] Mark Santoro and Mark Horowitz. SPIM: A Pipelined 64x64b Iterative Array Mul-
tiplier. IEEE International Solid State Circuits Conference, pages 35–36, February
1988.
[26] N. R. Scott. Computer Number Systems & Arithmetic. Prentice-Hall, Inc., Englewood
Cliffs, New Jersey, 1985.
[27] C. E. Shannon. A Symbolic Analysis of Relay and Switching Circuits. Trans. Am.
Inst. Electr. Eng., 57:713–723, 1938.
[28] C. E. Shannon. The Synthesis of Two-Terminal Switching Circuits. Bell Syst. Tech.
J., 28(1), 1949.
[29] J. Sklansky. Conditional Sum Addition Logic. Transactions of the IRE, EC-9(2):226–
230, June 1960.
[30] Naofumi Takagi, Hiroto Yasuura, and Shuzo Yajima. High-speed VLSI Multiplication
Algorithm with a Redundant Binary Addition Tree. IEEE Transactions on Computers,
C–34(9), Sept 1985.
[31] Jeffery Y.F. Tang and J. Leon Yang. Noise Issues in the ECL Circuit Family. Technical
report, Digital Western Research Laboratory, January 1990.
BIBLIOGRAPHY 155
[32] Stamatis Vassiliadis. Six-Stage 64-Bit Adder. IBM Technical Disclosure Bulletin,
30(6):208–212, November 1987.
[33] Stamatis Vassiliadis. Adders With Removed Dependencies. IBM Technical Disclosure
Bulletin, 30(10):426–429, March 1988.
[34] Stamatis Vassiliadis. A Comparison Between Adders with New Defined Carries and
Traditional Schemes for Addition. International Journal of Electronics, 64(4):617–
626, 1988.
[36] S. Waser and M. J. Flynn. Introduction to Arithmetic for Digital Systems Designers.
Holt, Rinehart and Winston, 1982.
[37] A. Weinberger. 4-2 Carry-Save Adder Module. IBM Technical Disclosure Bulletin,
23(8):3811–3814, January 1981.
[39] S. Winograd. On the Time Required to Perform Addition. Journal of the ACM,
12(2):227–285, 1965.
[40] S. Winograd. On the Time Required to Perform Multiplication. Journal of the ACM,
14(4):793–802, 1967.