Anarraymultiplier: Computer Organization and Design, 2Nd Edition, Morgan Kaufmann, 1998 (Sec
Anarraymultiplier: Computer Organization and Design, 2Nd Edition, Morgan Kaufmann, 1998 (Sec
Logical effort is very effective for comparing large structures to select one of
several quite different overall organizations. Rather than completing a detailed
design of each alternative, we can use the method of logical effort to estimate
the performance of a design sketch. In this extended example, we explore several
designs for a multiplier. We make no claim that we find the best multiplier
design for any situation; we offer it only to illustrate the application of logical
effort.
The example illustrates many of the techniques of logical effort explained in
the book. The reader is assumed to be familiar with logical effort applied to static
gates (Chapters 1 and 4), asymmetric gates (Chapter 6), forks (Chapter 9), and
branches (Chapter 10). An alternative design using domino logic (Chapter 8)
is explored briefly. Methods of building multipliers and adders are introduced
briefly in the example; for a more complete treatment, the reader is invited to
study a text on computer design, such as D. A. Patterson and J. L. Hennessy,
Computer Organization and Design, 2nd edition, Morgan Kaufmann, 1998 (Section 4.6).
A multiplier is an interesting design example because it affords a rich set of
design choices. We can build a multiplier that uses a large array of adders or
an alternative that cycles a smaller array of adders several times to complete the
product. We will use logical effort to seek a design with minimum delay for a
given array configuration.
An Array Multiplier
Multiplier Structure
There are many different structures that can be used to implement a multiplier.
Method 1 in Table 1 shows diagrammatically the simple method for multiplication we were all taught in school, modified so that all the numbers use binary
notation. In the illustration, a 5-bit multiplicand is multiplied by a 6-bit multiplier to obtain an 11-bit product. Each row of the array is the product of the
multiplicand and a single bit of the multiplier. Because multiplier bits are either
0 or 1, each row of the array is either 0 or a copy of the multiplicand. The array
is laid out so that each row is shifted to the left to account for the increasing binary significance of the multiplier bits. We sum the rows of the array to obtain
the product. You may wish to verify that the result is correct: in base ten, the
multiplicand is 11, the multiplier 46, and the product 506.
This simple method can be implemented directly in hardware by using five
separate 2-input, 5-bit adders to add the six rows of the array shown in the
table. But this design suffers in two ways. First, it is large because there are a lot
of adders. And second, the delay will be substantial because bits must propagate
through all five adders and because each adder has a carry path to resolve a 5-bit
carry.
1 Multiplier Structure
1
1
0
0
1
1
0
1
0
1
1
1
1
0
1
1
0
0
0
1
0
0
1
1
0
0
0
1
0
1
0
1
0
0
1
1
0
0
0
0
1
0
0
0
0
1
0
0
0
1
0
1
0
0
1
0
0
0
0
0
0
1
1
1
0
1
1
0
0
0
1
1
0
1
0
0
1
00
0
00
0
1
10
00
0
0
1
11
00
0
1
0
00
00
0
1
0
10
00
0
0
0
01
00
1
0
1
10
10
0
0
1
1
10
1
0
11
1
0
0
0
1
1
0
0
0
1
0
1
0
1
1
0
0
0
1
1
00
0
00
0
1
10
00
0
1
1
0
An Array Multiplier
1 Multiplier Structure
T4
P04
T3
P03
T2
P02
T1
P01
T0
P00
a b
cry s
P14
P13
P12
P11
P23
P22
P21
P20
P10
0
P24
T4
T3
T2
T1
T0
T1
T2
T3
Figure 1 The multiplier array of adder cells. A partial product enters at the
top, the Pij values are added, and a new partial product emerges at the bottom, in
carry-save form. In an actual layout, successive rows would probably be shifted
to the right so that the adder cells would form a perfect rectangular array.
But how does carry-save form avoid carry paths in the array? The answer
is illustrated in Figure 1, laid out to resemble the format shown in Table 1,
Method 3. The previous partial product enters at the top, in carry-save form,
from register cells Tj . That is, each bit of the partial product is represented by
two signals, which must be added to determine the binary value. Both signals
from Tj have binary weight 2j in the result. The next partial product is produced
at the bottom of the figure. Note that three bits of the result (T1, T2, and T3)
are produced in carry-resolved rather than carry-save form and become a part
of the final product. The remaining five bits of partial product are routed to the
inputs of the registers Tj , j = 0 . . . 4 to become the partial product used as input
in the next iteration of the multiplier. To simplify the diagram, the drawing of
each register Tj is split: its outputs appear at the top of the figure and its inputs
appear at the bottom.
An Array Multiplier
The array consists of three rows of five adder cells each. Each row is responsible for adding a shifted form of the multiplicand to the partial product and
passing the partial product to the row below. Each adder cell contains a 1-bit
full adder with three inputs of equal binary value, labeled a, b, and c. Each cell
has two outputs representing the sum (s) and carry (cry) that result from adding
together the three inputs. The sum output represents the same binary weight as
each of the three inputs. The carry output represents twice the binary weight of
the sum output. As you can see from the figure, each cell combines a sum and
a carry input from cells earlier in the array with a product bit, labeled Pij . Note
that the longest path through the array is three adder cells, corresponding to the
number of rows in the array.
The product bits Pij are generated by combining multiplicand and multiplier
bits using an and gate, Pij = Qi Rj , where Qi are bits of the multiplier and Rj
are bits of the multiplicand, again using the notation that bit j has weight 2j . The
effect is that a row of cells adds 0 to the partial product if the corresponding bit
of the multiplier is 0 and adds the multiplicand if it is 1. The structure of the
array causes the multiplicand to shift to the left, corresponding to the weight
of the multiplier bit. The multiplier Q and multiplicand R are both stored in
registers operated at the same time as the partial-product register T. Of course,
at the end of the first cycle, the multiplier must be shifted three bits to the right
so that in the second cycle the high-order three bits of the multiplier are used
to control the array. Neither the registers Q and R nor the shifting logic appears
in the figure. Also not shown are registers to save the low-order three bits of the
result of the first cycle (T1, T2, and T3).
The design task is to make the long paths through the arrays of and gates
and adders operate quickly. In order to obtain results that are somewhat more
general than the example, well represent the number of bits in the multiplicand
by the symbol m and the number of rows in the multiplier array by n. So the
longest path through the array has an and gate and n adder cells. Note that
the choice of a carry-save representation avoids a carry path that would have a
length m + n, which is usually much longer than n, the length of the carry path
in the array. By retaining m and n as parameters, we can obtain results that will
let us evaluate alternatives to the 3 5 array design if we wish to reevaluate the
speed-space trade-off.
2 Adder Cell
Figure 2 An adder cell sums three inputs a, b, and c to produce one sum bit s
and one carry bit cry.
Table 2
Parity
Input
a
b
c
Logical effort g
2
4
4
Bundle
a
b
c
Logical effort g
6
12
6
Total
10
Total
24
Adder Cell
A key element in these paths is the adder cell, shown in Figure 2. It consists of
a parity circuit to generate the sum output and a majority circuit to generate
the carry output. Circuits for the majority and parity gates are shown in Figures 4.6(b) and 4.7(b) in the book. In order to reduce the logical effort of the
majority and parity circuits, we have chosen topologically asymmetric forms.
The three inputs of the parity circuit drive different numbers of transistors. One
input is twice as hard to drive as the other two. Similarly, two inputs of the majority circuit are twice as hard to drive as the other one (see Table 2). Thus, some
inputs of these circuits are easier to drive than others.
Circuit details of these two gates can be ignored for now. The parity gate
requires dual-rail (true and complement) inputs. The majority gate inverts the
sense of its inputs, that is, it computes the complement of the majority of its
inputs. However, because of the symmetry of the majority function, the gate
An Array Multiplier
will compute true majority by complementing each of its inputs. We can defer
considering these details until later because they do not enter into logical effort
calculations.
A critical part of the design of the multiplier array is choosing which inputs
of the majority and parity circuits to connect to which outputs of previous
circuits. From a functional point of view, the three inputs to the majority and
parity circuits are interchangeable. From a logical effort point of view, however,
they are not. We are thus faced with a combinatorial problem. Shall we connect
the sum output to the easy-to-drive or harder-to-drive inputs of the majority
and parity circuits of the subsequent adder? Inside each adder cell, should we
connect the hard-to-drive input of the parity circuit to the easy-to-drive input
of the majority circuit to make the input capacitances of the three adder inputs
more equal?
Let us first consider the situation within a single adder cell. As shown in
Figure 2, we will call the load driven by the majority circuit x and the load driven
by the parity circuit y. The relative sizes of these two loads will depend on how
these signals are connected to inputs of subsequent adder cells. The figure also
shows that stages of amplification may be inserted before (or perhaps after) the
logic circuits. We will probably use 21 forks to provide the required true and
complement signals, but lets not worry about that just yet. The wiring structure
of the array depicted in Figure 1 tells us that the sum and carry paths in each
cell should have identical delays so that all paths through the entire array that
traverse n adder cells will have the same overall delay. We know that the fastest
design will be one that minimizes the effort along a path through the entire
array and therefore one that also minimizes the effort along a path from inputs to
outputs of an adder cell. Moreover, logical effort tells us the relationship between
the path effort F through the adder, the input capacitances of the three adder
inputs, and the output loads to be driven, x and y.
The adder cell has two different configurations, depending on which inputs
of the majority gate are tied to which inputs of the parity gate (see Figure 3).
These two configurations are named V and W. The following list shows the
input capacitance on each of the three cell inputs (a, b, and c) for the two
configurations. In each case, the effort F is the effort along the paths from the
specified input to either the carry or sum outputs.
2 Adder Cell
Configuration V
Configuration W
Figure 3 Two wiring configurations for the adder cell. Numbers inside the
gate symbols are the logical effort of the input or input bundle.
Configuration V
Configuration W
Ca = (2x + 6y)/F
Ca = (2x + 12y)/F
Cb = (4x + 12y)/F
Cb = (4x + 6y)/F
Cc = (4x + 6y)/F
Cc = (4x + 6y)/F
These expressions are derived from the basic equation of logical effort,
F = GH. By way of example, let us consider the a input of configuration V. If
we assume a fraction of the input capacitance is devoted to the path through
the majority gate, we have Fcry = 2x/(Ca ), where Fcry is the effort to drive the
carry output. Analogously for the sum output, we have Fs = 6y/((1 )Ca ).
Recognizing that the two efforts should be equal for equal delay, F = Fcry = Fs ,
we find that the branching factor drops out (why?), and we obtain F =
(2x + 6y)/Ca , the expression shown in the preceding list. The constants 2, 4,
6, and 12 that appear in these expressions are the logical efforts of the various
inputs of the parity and majority circuits. Notice that in configuration W the
two inputs called b and c have identical capacitance.
Next we need to consider how the sum and carry outputs of a cell are
connected to the a, b, or c inputs of subsequent adders. There are six possible
wiring configurations (see Table 3), characterized by the way in which each input
is connected to the previous cell. These wiring configurations are combined
with the internal configurations described previously (V and W) to yield twelve
combinations, V1 . . . V6 and W1 . . . W6. Because some of the inputs have
10
An Array Multiplier
Table 3
Case
1
2
3
4
5
6
Connections
Pij a
Pij a
Pij b
Pij b
Pij c
Pij c
Table 4
is V2.
cry b
cry c
cry a
cry c
cry a
cry b
sc
sb
sc
sa
sb
sa
Summary of intercell wiring cases. The fastest design is V4; the slowest
Type
V1
V2
V3
V4
V5
V6
W1
W3
W4
12.0
14.32
9.29
8.61
14.0
10.0
10.0
11.21
13.29
identical logical effort, some combinations have the same effect. Listing only
distinct cases, we have V1 . . . V6, W1, W3, W4 (Table 4).
Which combination gives the best performance? Because the answer to this
combinatorial problem can be found only by working out the least-delay circuit
for each possible combination of inputs and outputs, we shall write expressions
for capacitance for each of the various combinations. For each combination we
can set x and y equal to the capacitances of the inputs that they drive and derive a
value for F as a consequence. Well show the steps for combination W4. Because
the carry signal is connected to input c of the next stage, we have
x = Cc = (4x + 6y)/F
(1)
Likewise, because the sum signal is connected to input a of the next stage,
y = Ca = (2x + 12y)/F
(2)
11
2 Adder Cell
(3)
based on an estimate of the parasitic delay of the majority or parity gate of 6pinv ,
the addition of pinv assuming one stage of inverter to make a two-stage design,
and the rule of thumb that pinv = 1.
We can also compute the load capacitances of the inputs a, b, and c from
the equations prepared for configuration V. Given that F = 8.61, we find that
Ca = y = 1, Cb = 2, and Cc = x = 1.3. Thus, the carry (cry) signals will have
slightly more drive than the sum (s) signals, matched to the c and a input
1. In general, the delays along two paths will differ if the parasitic delays of the gates on the paths
differ. But it so happens that the estimated parasitic delays for the majority and parity gates are both
6pinv , so we are justified in asserting that equal efforts along the paths yield equal delays.
12
An Array Multiplier
capacitances, respectively. We need to enspect the edges of the adder cell array to
ensure that these conventions are acceptable. At the bottom of the array, carry
and sum signals enter the partial-product register; at the top of the array, the
register drives the a and c inputs of adder cells. So by great good fortune, we
simply arrange that the two bits stored for each binary position in register T are
carried at different powers: one has input and output drive of 1.0 units and one
of 1.3 units!
(4)
13
Rj
*
*
m
loads
total
Qi
Pij
n
loads
total
Figure 4 The multiplier and multiplicand are combined to produce values Pij
to be summed in the adder array.
Equations 6.2 and 6.3 in the book give expressions for ga and gb for a nand
gate. Assuming = 2, m = 5, and n = 3, we solve to find a value for the
symmetry factor s = 0.28. As a result, ga = 1.13 and gb = 1.86.
Now we can estimate the delay of the circuit. The electrical effort will be 2
because, first, we can assume multiplicand and multiplier bits have the same
drive as register T, which drives the a inputs, with Ca = 1; and, second, the
nand gate must drive input b, with Cb = 2. Thus, we have F = GBH = ga m(2) =
1.13 5 2 = 11.3. This calls for a two-stage design, so
(5)
D = NF 1/N + P = 2 11.3 + 1 + 2 = 9.7
Its worth remarking that if we had not used an asymmetric nand gate, the delay
would have been 10.3.
The delay of this structure plays a different role in different parts of the adder
array. In the top row, the delay is on the critical path through the entire array:
the top row of adder cells cannot begin working until the P0j bits have been
computed. However, for all other rows, the delay in computing the product
bits will be far less than the delay of other signals reaching the adder cells
14
An Array Multiplier
(compare the delay of computing the product bits with the delay of an adder
cell computed in the previous section). This suggests that we might prefer a
different design for all but the top row, using delayed multiplicand bits to drive
the nand gates. In other words, we drive the top row as shown in Figure 4 but
fork off a separate signal to carry the multiplicand bit to other rows. The fork
will have more stages of amplification that take more time, but it will not offer
much load to the multiplicand bit. If we divide the load equally between the top
row nand gates and the fork, the situation would be similar to setting n = 2
in the preceding analysis. If we do so, we find ga = 1.07. This differs very little
from 1.13, illustrating the limited gains available from asymmetric structures.
This analysis lets us assume that ga 1.1 irrespective of the number of rows, n.
As the number of columns, m, increases, the situation is not as favorable.
The load on a multiplier bit increases, and we must build a suitable string of
amplifiers to drive the bit to all nand gates in the top row as fast as possible.
(Again, other rows are not on the critical path and could get by with less drive.
But the drive for the top row is on the critical path for the whole array.) We
have F = GBH = ga m(2) = 2.2m. We know that good designs will bear an effort
3.59 for each stage, so the best number of stages will be N = ln F/ ln , thus
giving rise to a delay:
D = NF 1/N + Npinv + 1 = 3.6 ln F + 1 = 3.6 ln m + 3.8
(6)
(The term 1 accounts for the extra parasitic delay of a nand gate compared
to that of an inverter.) For example, for m = 32, D = 16.3.
5 Circuit Design
15
Is a total delay of 48.4 acceptable? Perhaps this is just the number we were
seeking! If its smaller than our target, we can omit some of the amplifier stages
we used to obtain least delay. If its larger than we intended, we need to make
some structural changes to the multiplier because these delays are estimates of
the least possible delayit may not be possible to achieve such a low delay when
design details are considered.
What kinds of structural changes could we contemplate? The delay to compute the Pij bits for the top row of adder cells could be removed entirely by
pipelining, that is, computing these values in the previous cycle and saving them
in a register. Of course, we would need to worry about how these values are prepared for the first iteration of the multiplier. If successful, this change would
reduce the delay to 3 12.9 = 38.7.
If still faster cycle time is required, we could consider reducing from 3 to 2 the
number of multiplier bits used in each cycle, thus reducing the number of rows
of adder cells to 2. Of course, this might not speed up the overall multiplication
time because additional iterations would be required in order to use all the
multiplier bits (6 in our example).
We could also consider changing circuit technologies. Rather than using
static gates, we could consider domino gates. Well defer a detailed explanation
of this choice until a later section, but we know from experience that domino
circuits are likely to be faster than their static counterparts.
Its also important to remark that our analysis is not perfect. Weve ignored
the design of the partial-product register. If the register cells use multiplexers
driven by a two-phase clock, they will have two stages, each with a logical effort
of 2, for a combined logical effort of 4. Ideally, two stages should be bearing an
effort more like 2 13. So the registers can amplifythat is, their outputs
should drive more load than their inputs require. Our design does not exploit
this amplification because we would need adder cells with different transistor
sizes: the top row would use larger transistors than the bottom row. Weve chosen
not to attempt this optimization, but it is easy to analyze (see Exercise 3).
Circuit Design
In this section, we assume that the structure explored in previous sections
has been selected and that its time to complete the design. We need to insert
amplifiers so that paths have the right number of stages, we need to worry about
16
An Array Multiplier
the polarity of various signals, and we need to provide true and complement
forms of signals that drive the three-input parity gate in the adder cells.
Weve already noticed a problem: the critical path through the adder cell has
an effort of F = 8.61, which would suggest a two-stage design, but if we use a
21 fork to generate true and complement forms, followed by the parity stage,
that path will have an effective length between two and three stagestoo many.
If we want to have exactly two stages in each adder cell, we can use a dualrail design, in which every input and output to the cell is carried as a two-wire
bundle containing true and complement forms of the signal. Figure 5 shows a
detailed design of such a cell. The true and complement inputs to the parity
gates appear separately; each input is labeled with its logical effort. The output
loads are 1.0 and 1.3, and the input loads 1, 2, and 1.3, as determined by the
analysis in Section 2. We can verify the effort of one of the paths (e.g., a) by
summing the effort GH of each branch: F = 2(1.3/1) + 3(1/1) + 3(1/1) = 8.6,
which is exactly what our previous analysis determined. So this circuit design
Knowing that the stage effort should be 8.6, its easy to determine the
transistor sizes of each of the gates. For example, suppose the inverter connected
to the a input has an input capacitance equal to that of a transistor whose width
is 6.4 microns and whose length is the minimum permitted by the fabrication
process. Then the sum of the transistor widths driven by the inverter should be
6.4 8.6 = 18.8 microns. This should be divided in proportion to the effort
borne by each branch, that is, 2 1.3 to the majority gate and 3 1 to each of the
two parity gates. Thus, the majority gate should have 5.7 microns of input load
and the parity gates 6.6 microns of input load. The other paths can be analyzed
analogously.
Its interesting to consider the performance of a one-stage design; that is, to
remove the inverters from the circuit of Figure 5 and invert the polarity of the
inputs. We now have D = 8.6 + 6 = 14.6, compared to 12.9 for the best twostage design.
One problem with the dual-rail design may prove fatal: it takes a lot of area.
The area of the additional majority and parity gates, which are large, may exceed
that of an alternative single-rail design. The complex wiring topologies of these
gates require lots of room for wires, contacts, and crossovers. If youve ever tried
to lay out a two-input xor gate, you can appreciate the problem!
17
5 Circuit Design
2
4 M
4
cry
load = 1.3
2
4 M
4
cry
load = 1.3
3
3
6
6
3
3
s
load = 1.0
3
3
6
6
3
3
s
load = 1.0
a
1.0
a
1.0
b
2.0
b
2.0
c
1.3
c
1.3
Figure 5 Design for an adder cell in which all inputs and outputs are carried
in dual-rail bundles.
18
An Array Multiplier
Figure 6 Design for an adder cell in which all inputs and outputs are carried
in single-rail form.
have capacitance 3(1.0/2.4) = 1.25, and the input with logical effort of 6 will
have capacitance 2.5.
Now we can analyze the forks with known loads. Consider the fork attached
to the a signal. We dont know how the input load 1.0 will divide on the two
paths, but we know that if the 2-inverter leg of the fork has input capacitance
, the other has input capacitance 1 . The 2-inverter leg is loaded with
capacitance 1.25. The 1-inverter leg is loaded with 1 + 1.25 = 2.25. Setting delays
in the two paths equal, we have
p
2 1.25/ + 2 = 2.25/(1 ) + 1
(7)
We find = 0.47 and the delay through either path is 5.25. To get the delay of the
entire path, we must add the stage effort of the majority gate (calculated above
to be 2.4) and its parasitic delay, 6. The total is thus 5.25 + 2.4 + 6 = 13.65. This
is only slightly worse than the best (12.9), and this circuit may be much easier
to lay out than the dual-rail form. Although the 21 forks look forbidding, the
inverters and their wiring arent hard to lay out.
19
6.1
Wiring capacitance
Do the long wires in the multiplier array contribute significant delay? To address
this question, we need to estimate the size of an adder cell. For the first time, we
have to choose an actual transistor size in microns. Lets assume were using a
0.6 micron process, with a design style in which 1 unit of capacitance (e.g., the
load presented by input a in Figure 6) represents 9 microns of transistor width;
for example, an inverter with pulldown width of 3 microns and pullup width
of 6 microns. By looking at a layout for a two-input xor gate and making some
crude estimates, we can guess that the parity gate could be laid out in a box 20
microns wide and 30 microns high. The majority gate would be 15 microns wide
and the same height (were assuming power and ground wires run horizontally,
so all the gates should have the same height). The forks would require about 10
microns each. Thus, the cell might fit in a box 65 microns wide and 30 microns
high.
A multiplicand bit wire runs vertically to span three rows of cells, or about
90 microns. Using the rule of thumb that a wire is about 1/10 the capacitance
per unit length of a transistor gate, the wire would offer the same capacitance
as 9 microns of gate width, or about 1 unit of capacitance. This is negligible.
A multiplier bit wire runs horizontally to span five columns of cells, or
about 320 microns. This converts to 32 microns of gate width, or about 4
units of capacitance. This is enough load that we should consider it in our
design. If the inverter driving the horizontal wire (signal Qi in Figure 4) has
an input capacitance of 1 unit (which we assumed) and a stage effort of around
4 (characteristic of minimum-delay paths), then the load capacitance of the five
nand gates can be assumed to total about 4 units. Thus, the wiring capacitance
doubles the assumed load on the inverter. We should redo the analysis for this
part of the circuit when we choose final transistor sizes. We will find that the
greater wire load may necessitate a second inverter in the Qi amplifier. Moreover,
since the nand gates represent a smaller fraction of the Qi load, they may be
20
An Array Multiplier
increased in size, reducing their efforts while increasing the amplifier effort by
a smaller amount.
6.2
Domino logic
Lets briefly analyze the use of domino logic in the adder cell. (To understand
this section, you will need to be familiar with Chapter 8 in the book.) Well
use the domino gate structure shown in Figure 8.4(a), consisting of a dynamic
stage using a clocked evaluation transistor, followed by a hi-skew inverter. The
dynamic stages will use pulldown networks analogous to those of Figure 8.5
that compute the majority and parity functions. Consider the parity pulldown
network shown in Figure 4.6(b). If we add a series evaluation transistor, well
need to make each of the transistors have width 4 rather than width 3 in order
to match the pulldown characteristics of the reference inverter. Thus, the logical
effort of the a, a , c, and c inputs of the dynamic circuit is 4/3, and the logical
effort of the b and b inputs is 8/3. An analogous attack on the majority gate of
Figure 4.7(b) finds logical effort of the a input to be 1, while that of the b and c
inputs is 2. We recall that the hi-skew inverter (Figure 7.4) has logical effort 5/6
for rising outputs (Table 7.2).
Figure 7 shows the cell put together. Note that we must carry the inputs and
outputs in dual-rail form because of the nature of domino logic: because the
parity network requires true and complement forms and we must have an even
number of stages along every signal path, we have to carry both forms. Despite
the changes in logical effort from the static gates, we still will consider a V4
internal wiring configuration with relative loads on carry and sum signals in
the ratio 1.3 to 1. We find that the path effort through the cell along the a or c
paths is 3.3, contrasted with 8.61 for the static design. Such a low path effort calls
for a one-stage design, but with domino logic we must have an even number, so
two is the best we can do!
What is the delay of this cell? We estimate the parasitic delay of the majority and parity stages is about 3 (why?) and that of the hi-skew gates is 5/6
(Section 8.2.2). So we have
21
7 Conclusion
1
2 M
2
cry
load = 1.3
1
2 M
2
cry
load = 1.3
4/3
4/3
8/3
8/3
4/3
4/3
s
load = 1.0
4/3
4/3
8/3
8/3
4/3
4/3
s
load = 1.0
a
1.0
a
1.0
b
2.0
b
2.0
c
1.3
c
1.3
Figure 7
Lest you get euphoric over this result, beware that domino circuits, though
fast, require careful design and noise analysis. Making the parity and majority
dynamic gates work properly will probably require secondary precharge transistors on one or more nodes within the pulldown network. A fair amount of
analysis and simulation may be required to demonstrate that the gates work
correctly with sufficiently wide operating margins.
Conclusion
The multiplier design example has illustrated some of the strengths of logical
effort:
When a great many design alternatives exist, logical effort can be a simple way
to find the best. The twelve different wiring topologies of the adder cell (two
internal configurations, six external wiring patterns) were easily compared
with logical effort.
22
An Array Multiplier
Even without detailed design of circuits and transistor sizes, logical effort
gives a delay estimate. We were able to estimate the delay of an n m array
implemented in static gates to be 12.9n + 3.6 ln m + 3.8 (see Equations 3
and 6).
Preliminary delay estimates reveal weaknesses in designs. The time required
to generate the product bits, Pij , is a significant fraction of the total delay.
Exercises
1. The analysis of the single-rail circuit, Figure 6, approximated the sizes for
the majority and parity gates. Work out the best sizes for these gates.
2. Analyze the domino form of the adder cell to determine whether the V4
configuration is the best and what the relative loading of the carry and sum
signals should be.
3. At the end of Section 4, we pointed out that the adder array might be faster if
different rows of adder cells used different designs. Estimate the maximum
speed increase that could be obtained in the given example (n = 3).