0% found this document useful (0 votes)
35 views80 pages

Module 1-1

Uploaded by

Sonika Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views80 pages

Module 1-1

Uploaded by

Sonika Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

L ast E d ited by S P 14112 0 0 4

INTRODUCTION TO ASICs
An ASIC (pronounced a-sick; bold typeface defines a new term) is an
application-specific integrated circuit at least that is what the acronym stands for.
Before we answer the question of what that means we first look at the evolution
of the silicon chip or integrated circuit ( IC ).
Figure 1.1(a) shows an IC package (this is a pin-grid array, or PGA, shown
upside down; the pins will go through holes in a printed-circuit board). People
often call the package a chip, but, as you can see in Figure 1.1(b), the silicon chip
itself (more properly called a die ) is mounted in the cavity under the sealed lid.
A PGA package is usually made from a ceramic material, but plastic packages
are also common.

FIGURE 1.1 An integrated


circuit (IC). (a) A pin-grid
array (PGA) package. (b) The
silicon die or chip is under
the package lid.

The physical size of a silicon die varies from a few millimeters on a side to over
1 inch on a side, but instead we often measure the size of an IC by the number of
logic gates or the number of transistors that the IC contains. As a unit of measure
a gate equivalent corresponds to a two-input NAND gate (a circuit that performs
the logic function, F = A " B ). Often we just use the term gates instead of gate
equivalents when we are measuring chip sizenot to be confused with the gate
terminal of a transistor. For example, a 100 k-gate IC contains the equivalent of
100,000 two-input NAND gates.
The semiconductor industry has evolved from the first ICs of the early 1970s and
matured rapidly since then. Early small-scale integration ( SSI ) ICs contained a
few (1 to 10) logic gatesNAND gates, NOR gates, and so onamounting to a few
tens of transistors. The era of medium-scale integration ( MSI ) increased the
range of integrated logic available to counters and similar, larger scale, logic
functions. The era of large-scale integration ( LSI ) packed even larger logic
functions, such as the first microprocessors, into a single chip. The era of very
large-scale integration ( VLSI ) now offers 64-bit microprocessors, complete with
cache memory and floating-point arithmetic unitswell over a million transistors
on a single piece of silicon. As CMOS process technology improves, transistors
continue to get smaller and ICs hold more and more transistors. Some people
(especially in Japan) use the term ultralarge scale integration ( ULSI ), but most
people stop at the term VLSI; otherwise we have to start inventing new words.
The earliest ICs used bipolar technology and the majority of logic ICs used either
transistortransistor logic ( TTL ) or emitter-coupled logic (ECL). Although
invented before the bipolar transistor, the metal-oxide-silicon ( MOS ) transistor
was initially difficult to manufacture because of problems with the oxide
interface. As these problems were gradually solved, metal-gate n -channel MOS (
nMOS or NMOS ) technology developed in the 1970s. At that time MOS
technology required fewer masking steps, was denser, and consumed less power
than equivalent bipolar ICs. This meant that, for a given performance, an MOS
IC was cheaper than a bipolar IC and led to investment and growth of the MOS
IC market.
By the early 1980s the aluminum gates of the transistors were replaced by
polysilicon gates, but the name MOS remained. The introduction of polysilicon
as a gate material was a major improvement in CMOS technology, making it
easier to make two types of transistors, n -channel MOS and p -channel MOS
transistors, on the same ICa complementary MOS ( CMOS , never cMOS)
technology. The principal advantage of CMOS over NMOS is lower power
consumption. Another advantage of a polysilicon gate was a simplification of the
fabrication process, allowing devices to be scaled down in size.
There are four CMOS transistors in a two-input NAND gate (and a two-input
NOR gate too), so to convert between gates and transistors, you multiply the
number of gates by 4 to obtain the number of transistors. We can also measure an
IC by the smallest feature size (roughly half the length of the smallest transistor)
imprinted on the IC. Transistor dimensions are measured in microns (a micron, 1
m m, is a millionth of a meter). Thus we talk about a 0.5 m m IC or say an IC is
built in (or with) a 0.5 m m process, meaning that the smallest transistors are 0.5
m m in length. We give a special label, l or lambda , to this smallest feature size.
Since lambda is equal to half of the smallest transistor length, l ª 0.25 m m in a
0.5 m m process. Many of the drawings in this book use a scale marked with
lambda for the same reason we place a scale on a map.
A modern submicron CMOS process is now just as complicated as a submicron
bipolar or BiCMOS (a combination of bipolar and CMOS) process. However,
CMOS ICs have established a dominant position, are manufactured in much
greater volume than any other technology, and therefore, because of the economy
of scale, the cost of CMOS ICs is less than a bipolar or BiCMOS IC for the same
function. Bipolar and BiCMOS ICs are still used for special needs. For example,
bipolar technology is generally capable of handling higher voltages than CMOS.
This makes bipolar and BiCMOS ICs useful in power electronics, cars, telephone
circuits, and so on.
Some digital logic ICs and their analog counterparts (analog/digital converters,
for example) are standard parts , or standard ICs. You can select standard ICs
from catalogs and data books and buy them from distributors. Systems
manufacturers and designers can use the same standard part in a variety of
different microelectronic systems (systems that use microelectronics or ICs).
With the advent of VLSI in the 1980s engineers began to realize the advantages
of designing an IC that was customized or tailored to a particular system or
application rather than using standard ICs alone. Microelectronic system design
then becomes a matter of defining the functions that you can implement using
standard ICs and then implementing the remaining logic functions (sometimes
called glue logic ) with one or more custom ICs . As VLSI became possible you
could build a system from a smaller number of components by combining many
standard ICs into a few custom ICs. Building a microelectronic system with
fewer ICs allows you to reduce cost and improve reliability.
Of course, there are many situations in which it is not appropriate to use a custom
IC for each and every part of an microelectronic system. If you need a large
amount of memory, for example, it is still best to use standard memory ICs,
either dynamic random-access memory ( DRAM or dRAM), or static RAM (
SRAM or sRAM), in conjunction with custom ICs.
One of the first conferences to be devoted to this rapidly emerging segment of the
IC industry was the IEEE Custom Integrated Circuits Conference (CICC), and
the proceedings of this annual conference form a useful reference to the
development of custom ICs. As different types of custom ICs began to evolve for
different types of applications, these new ICs gave rise to a new term:
application-specific IC, or ASIC. Now we have the IEEE International ASIC
Conference , which tracks advances in ASICs separately from other types of
custom ICs. Although the exact definition of an ASIC is difficult, we shall look at
some examples to help clarify what people in the IC industry understand by the
term.
Examples of ICs that are not ASICs include standard parts such as: memory chips
sold as a commodity itemROMs, DRAM, and SRAM; microprocessors; TTL or
TTL-equivalent ICs at SSI, MSI, and LSI levels.
Examples of ICs that are ASICs include: a chip for a toy bear that talks; a chip
for a satellite; a chip designed to handle the interface between memory and a
microprocessor for a workstation CPU; and a chip containing a microprocessor as
a cell together with other logic.
As a general rule, if you can find it in a data book, then it is probably not an
ASIC, but there are some exceptions. For example, two ICs that might or might
not be considered ASICs are a controller chip for a PC and a chip for a modem.
Both of these examples are specific to an application (shades of an ASIC) but are
sold to many different system vendors (shades of a standard part). ASICs such as
these are sometimes called application-specific standard products ( ASSPs ).
Trying to decide which members of the huge IC family are application-specific is
trickyafter all, every IC has an application. For example, people do not usually
consider an application-specific microprocessor to be an ASIC. I shall describe
how to design an ASIC that may include large cells such as microprocessors, but
I shall not describe the design of the microprocessors themselves. Defining an
ASIC by looking at the application can be confusing, so we shall look at a
different way to categorize the IC family. The easiest way to recognize people is
by their faces and physical characteristics: tall, short, thin. The easiest
characteristics of ASICs to understand are physical ones too, and we shall look at
these next. It is important to understand these differences because they affect
such factors as the price of an ASIC and the way you design an ASIC.
1.1 Types of ASICs
ICs are made on a thin (a few hundred microns thick), circular silicon wafer ,
with each wafer holding hundreds of die (sometimes people use dies or dice for
the plural of die). The transistors and wiring are made from many layers (usually
between 10 and 15 distinct layers) built on top of one another. Each successive
mask layer has a pattern that is defined using a mask similar to a glass
photographic slide. The first half-dozen or so layers define the transistors. The
last half-dozen or so layers define the metal wires between the transistors (the
interconnect ).
A full-custom IC includes some (possibly all) logic cells that are customized and
all mask layers that are customized. A microprocessor is an example of a
full-custom ICdesigners spend many hours squeezing the most out of every last
square micron of microprocessor chip space by hand. Customizing all of the IC
features in this way allows designers to include analog circuits, optimized
memory cells, or mechanical structures on an IC, for example. Full-custom ICs
are the most expensive to manufacture and to design. The manufacturing lead
time (the time it takes just to make an ICnot including design time) is typically
eight weeks for a full-custom IC. These specialized full-custom ICs are often
intended for a specific application, so we might call some of them full-custom
ASICs.
We shall discuss full-custom ASICs briefly next, but the members of the IC
family that we are more interested in are semicustom ASICs , for which all of the
logic cells are predesigned and some (possibly all) of the mask layers are
customized. Using predesigned cells from a cell library makes our lives as
designers much, much easier. There are two types of semicustom ASICs that we
shall cover: standard-cellbased ASICs and gate-arraybased ASICs. Following
this we shall describe the programmable ASICs , for which all of the logic cells
are predesigned and none of the mask layers are customized. There are two types
of programmable ASICs: the programmable logic device and, the newest member
of the ASIC family, the field-programmable gate array.

1.1.1 Full-Custom ASICs


In a full-custom ASIC an engineer designs some or all of the logic cells, circuits,
or layout specifically for one ASIC. This means the designer abandons the
approach of using pretested and precharacterized cells for all or part of that
design. It makes sense to take this approach only if there are no suitable existing
cell libraries available that can be used for the entire design. This might be
because existing cell libraries are not fast enough, or the logic cells are not small
enough or consume too much power. You may need to use full-custom design if
the ASIC technology is new or so specialized that there are no existing cell
libraries or because the ASIC is so specialized that some circuits must be custom
designed. Fewer and fewer full-custom ICs are being designed because of the
problems with these special parts of the ASIC. There is one growing member of
this family, though, the mixed analog/digital ASIC, which we shall discuss next.
Bipolar technology has historically been used for precision analog functions.
There are some fundamental reasons for this. In all integrated circuits the
matching of component characteristics between chips is very poor, while the
matching of characteristics between components on the same chip is excellent.
Suppose we have transistors T1, T2, and T3 on an analog/digital ASIC. The three
transistors are all the same size and are constructed in an identical fashion.
Transistors T1 and T2 are located adjacent to each other and have the same
orientation. Transistor T3 is the same size as T1 and T2 but is located on the
other side of the chip from T1 and T2 and has a different orientation. ICs are
made in batches called wafer lots. A wafer lot is a group of silicon wafers that are
all processed together. Usually there are between 5 and 30 wafers in a lot. Each
wafer can contain tens or hundreds of chips depending on the size of the IC and
the wafer.
If we were to make measurements of the characteristics of transistors T1, T2, and
T3 we would find the following:
● Transistors T1 will have virtually identical characteristics to T2 on the
same IC. We say that the transistors match well or the tracking between
devices is excellent.
● Transistor T3 will match transistors T1 and T2 on the same IC very well,
but not as closely as T1 matches T2 on the same IC.
● Transistor T1, T2, and T3 will match fairly well with transistors T1, T2,
and T3 on a different IC on the same wafer. The matching will depend on
how far apart the two ICs are on the wafer.
● Transistors on ICs from different wafers in the same wafer lot will not
match very well.
● Transistors on ICs from different wafer lots will match very poorly.

For many analog designs the close matching of transistors is crucial to circuit
operation. For these circuit designs pairs of transistors are used, located adjacent
to each other. Device physics dictates that a pair of bipolar transistors will always
match more precisely than CMOS transistors of a comparable size. Bipolar
technology has historically been more widely used for full-custom analog design
because of its improved precision. Despite its poorer analog properties, the use of
CMOS technology for analog functions is increasing. There are two reasons for
this. The first reason is that CMOS is now by far the most widely available IC
technology. Many more CMOS ASICs and CMOS standard products are now
being manufactured than bipolar ICs. The second reason is that increased levels
of integration require mixing analog and digital functions on the same IC: this
has forced designers to find ways to use CMOS technology to implement analog
functions. Circuit designers, using clever new techniques, have been very
successful in finding new ways to design analog CMOS circuits that can
approach the accuracy of bipolar analog designs.

1.1.2 Standard-CellBased ASICs


A cell-based ASIC (cell-based IC, or CBIC a common term in Japan,
pronounced sea-bick) uses predesigned logic cells (AND gates, OR gates,
multiplexers, and flip-flops, for example) known as standard cells . We could
apply the term CBIC to any IC that uses cells, but it is generally accepted that a
cell-based ASIC or CBIC means a standard-cellbased ASIC.
The standard-cell areas (also called flexible blocks) in a CBIC are built of rows
of standard cellslike a wall built of bricks. The standard-cell areas may be used
in combination with larger predesigned cells, perhaps microcontrollers or even
microprocessors, known as megacells . Megacells are also called megafunctions,
full-custom blocks, system-level macros (SLMs), fixed blocks, cores, or
Functional Standard Blocks (FSBs).
The ASIC designer defines only the placement of the standard cells and the
interconnect in a CBIC. However, the standard cells can be placed anywhere on
the silicon; this means that all the mask layers of a CBIC are customized and are
unique to a particular customer. The advantage of CBICs is that designers save
time, money, and reduce risk by using a predesigned, pretested, and
precharacterized standard-cell library . In addition each standard cell can be
optimized individually. During the design of the cell library each and every
transistor in every standard cell can be chosen to maximize speed or minimize
area, for example. The disadvantages are the time or expense of designing or
buying the standard-cell library and the time needed to fabricate all layers of the
ASIC for each new design.
Figure 1.2 shows a CBIC (looking down on the die shown in Figure 1.1b, for
example). The important features of this type of ASIC are as follows:
● All mask layers are customizedtransistors and interconnect.

● Custom blocks can be embedded.

● Manufacturing lead time is about eight weeks.


FIGURE 1.2 A cell-based ASIC
(CBIC) die with a single
standard-cell area (a flexible
block) together with four fixed
blocks. The flexible block
contains rows of standard cells.
This is what you might see
through a low-powered
microscope looking down on the
die of Figure 1.1(b). The small
squares around the edge of the die
are bonding pads that are
connected to the pins of the ASIC
package.

Each standard cell in the library is constructed using full-custom design methods,
but you can use these predesigned and precharacterized circuits without having to
do any full-custom design yourself. This design style gives you the same
performance and flexibility advantages of a full-custom ASIC but reduces design
time and reduces risk.
Standard cells are designed to fit together like bricks in a wall. Figure 1.3 shows
an example of a simple standard cell (it is simple in the sense it is not maximized
for densitybut ideal for showing you its internal construction). Power and ground
buses (VDD and GND or VSS) run horizontally on metal lines inside the cells.
FIGURE 1.3 Looking down on the layout of a standard cell. This cell would be
approximately 25 microns wide on an ASIC with l (lambda) = 0.25 microns (a
micron is 10 6 m). Standard cells are stacked like bricks in a wall; the abutment
box (AB) defines the edges of the brick. The difference between the bounding
box (BB) and the AB is the area of overlap between the bricks. Power supplies
(labeled VDD and GND) run horizontally inside a standard cell on a metal layer
that lies above the transistor layers. Each different shaded and labeled pattern
represents a different layer. This standard cell has center connectors (the three
squares, labeled A1, B1, and Z) that allow the cell to connect to others. The
layout was drawn using ROSE, a symbolic layout editor developed by Rockwell
and Compass, and then imported into Tanner Researchs L-Edit.

Standard-cell design allows the automation of the process of assembling an


ASIC. Groups of standard cells fit horizontally together to form rows. The rows
stack vertically to form flexible rectangular blocks (which you can reshape
during design). You may then connect a flexible block built from several rows of
standard cells to other standard-cell blocks or other full-custom logic blocks. For
example, you might want to include a custom interface to a standard, predesigned
microcontroller together with some memory. The microcontroller block may be a
fixed-size megacell, you might generate the memory using a memory compiler,
and the custom logic and memory controller will be built from flexible
standard-cell blocks, shaped to fit in the empty spaces on the chip.
Both cell-based and gate-array ASICs use predefined cells, but there is a
differencewe can change the transistor sizes in a standard cell to optimize speed
and performance, but the device sizes in a gate array are fixed. This results in a
trade-off in performance and area in a gate array at the silicon level. The trade-off
between area and performance is made at the library level for a standard-cell
ASIC.
Modern CMOS ASICs use two, three, or more levels (or layers) of metal for
interconnect. This allows wires to cross over different layers in the same way that
we use copper traces on different layers on a printed-circuit board. In a two-level
metal CMOS technology, connections to the standard-cell inputs and outputs are
usually made using the second level of metal ( metal2 , the upper level of metal)
at the tops and bottoms of the cells. In a three-level metal technology,
connections may be internal to the logic cell (as they are in Figure 1.3). This
allows for more sophisticated routing programs to take advantage of the extra
metal layer to route interconnect over the top of the logic cells. We shall cover
the details of routing ASICs in Chapter 17.
A connection that needs to cross over a row of standard cells uses a feedthrough.
The term feedthrough can refer either to the piece of metal that is used to pass a
signal through a cell or to a space in a cell waiting to be used as a feedthrough
very confusing. Figure 1.4 shows two feedthroughs: one in cell A.14 and one in
cell A.23.
In both two-level and three-level metal technology, the power buses (VDD and
GND) inside the standard cells normally use the lowest (closest to the transistors)
layer of metal ( metal1 ). The width of each row of standard cells is adjusted so
that they may be aligned using spacer cells . The power buses, or rails, are then
connected to additional vertical power rails using row-end cells at the aligned
ends of each standard-cell block. If the rows of standard cells are long, then
vertical power rails can also be run in metal2 through the cell rows using special
power cells that just connect to VDD and GND. Usually the designer manually
controls the number and width of the vertical power rails connected to the
standard-cell blocks during physical design. A diagram of the power distribution
scheme for a CBIC is shown in Figure 1.4.

FIGURE 1.4 Routing the CBIC (cell-based IC) shown in Figure 1.2. The use of
regularly shaped standard cells, such as the one in Figure 1.3, from a library
allows ASICs like this to be designed automatically. This ASIC uses two
separate layers of metal interconnect (metal1 and metal2) running at right angles
to each other (like traces on a printed-circuit board). Interconnections between
logic cells uses spaces (called channels) between the rows of cells. ASICs may
have three (or more) layers of metal allowing the cell rows to touch with the
interconnect running over the top of the cells.

All the mask layers of a CBIC are customized. This allows megacells (SRAM, a
SCSI controller, or an MPEG decoder, for example) to be placed on the same IC
with standard cells. Megacells are usually supplied by an ASIC or library
company complete with behavioral models and some way to test them (a test
strategy). ASIC library companies also supply compilers to generate flexible
DRAM, SRAM, and ROM blocks. Since all mask layers on a standard-cell
design are customized, memory design is more efficient and denser than for gate
arrays.
For logic that operates on multiple signals across a data busa datapath ( DP )the
use of standard cells may not be the most efficient ASIC design style. Some
ASIC library companies provide a datapath compiler that automatically generates
datapath logic . A datapath library typically contains cells such as adders,
subtracters, multipliers, and simple arithmetic and logical units ( ALUs ). The
connectors of datapath library cells are pitch-matched to each other so that they
fit together. Connecting datapath cells to form a datapath usually, but not always,
results in faster and denser layout than using standard cells or a gate array.
Standard-cell and gate-array libraries may contain hundreds of different logic
cells, including combinational functions (NAND, NOR, AND, OR gates) with
multiple inputs, as well as latches and flip-flops with different combinations of
reset, preset and clocking options. The ASIC library company provides designers
with a data book in paper or electronic form with all of the functional
descriptions and timing information for each library element.

1.1.3 Gate-ArrayBased ASICs


In a gate array (sometimes abbreviated to GA) or gate-arraybased ASIC the
transistors are predefined on the silicon wafer. The predefined pattern of
transistors on a gate array is the base array , and the smallest element that is
replicated to make the base array (like an M. C. Escher drawing, or tiles on a
floor) is the base cell (sometimes called a primitive cell ). Only the top few layers
of metal, which define the interconnect between transistors, are defined by the
designer using custom masks. To distinguish this type of gate array from other
types of gate array, it is often called a masked gate array ( MGA ). The designer
chooses from a gate-array library of predesigned and precharacterized logic cells.
The logic cells in a gate-array library are often called macros . The reason for this
is that the base-cell layout is the same for each logic cell, and only the
interconnect (inside cells and between cells) is customized, so that there is a
similarity between gate-array macros and a software macro. Inside IBM,
gate-array macros are known as books (so that books are part of a library), but
unfortunately this descriptive term is not very widely used outside IBM.
We can complete the diffusion steps that form the transistors and then stockpile
wafers (sometimes we call a gate array a prediffused array for this reason). Since
only the metal interconnections are unique to an MGA, we can use the stockpiled
wafers for different customers as needed. Using wafers prefabricated up to the
metallization steps reduces the time needed to make an MGA, the turnaround
time , to a few days or at most a couple of weeks. The costs for all the initial
fabrication steps for an MGA are shared for each customer and this reduces the
cost of an MGA compared to a full-custom or standard-cell ASIC design.
There are the following different types of MGA or gate-arraybased ASICs:
● Channeled gate arrays.

● Channelless gate arrays.


● Structured gate arrays.
The hyphenation of these terms when they are used as adjectives explains their
construction. For example, in the term channeled gate-array architecture, the
gate array is channeled , as will be explained. There are two common ways of
arranging (or arraying) the transistors on a MGA: in a channeled gate array we
leave space between the rows of transistors for wiring; the routing on a
channelless gate array uses rows of unused transistors. The channeled gate array
was the first to be developed, but the channelless gate-array architecture is now
more widely used. A structured (or embedded) gate array can be either channeled
or channelless but it includes (or embeds) a custom block.

1.1.4 Channeled Gate Array


Figure 1.5 shows a channeled gate array . The important features of this type of
MGA are:
● Only the interconnect is customized.

● The interconnect uses predefined spaces between rows of base cells.

● Manufacturing lead time is between two days and two weeks.

FIGURE 1.5 A channeled gate-array die.


The spaces between rows of the base cells
are set aside for interconnect.

A channeled gate array is similar to a CBICboth use rows of cells separated by


channels used for interconnect. One difference is that the space for interconnect
between rows of cells are fixed in height in a channeled gate array, whereas the
space between rows of cells may be adjusted in a CBIC.

1.1.5 Channelless Gate Array


Figure 1.6 shows a channelless gate array (also known as a channel-free gate
array , sea-of-gates array , or SOG array). The important features of this type of
MGA are as follows:
● Only some (the top few) mask layers are customizedthe interconnect.

● Manufacturing lead time is between two days and two weeks.


FIGURE 1.6 A channelless gate-array or
sea-of-gates (SOG) array die. The core
area of the die is completely filled with an
array of base cells (the base array).

The key difference between a channelless gate array and channeled gate array is
that there are no predefined areas set aside for routing between cells on a
channelless gate array. Instead we route over the top of the gate-array devices.
We can do this because we customize the contact layer that defines the
connections between metal1, the first layer of metal, and the transistors. When
we use an area of transistors for routing in a channelless array, we do not make
any contacts to the devices lying underneath; we simply leave the transistors
unused.
The logic densitythe amount of logic that can be implemented in a given silicon
areais higher for channelless gate arrays than for channeled gate arrays. This is
usually attributed to the difference in structure between the two types of array. In
fact, the difference occurs because the contact mask is customized in a
channelless gate array, but is not usually customized in a channeled gate array.
This leads to denser cells in the channelless architectures. Customizing the
contact layer in a channelless gate array allows us to increase the density of
gate-array cells because we can route over the top of unused contact sites.

1.1.6 Structured Gate Array


An embedded gate array or structured gate array (also known as masterslice or
masterimage ) combines some of the features of CBICs and MGAs. One of the
disadvantages of the MGA is the fixed gate-array base cell. This makes the
implementation of memory, for example, difficult and inefficient. In an
embedded gate array we set aside some of the IC area and dedicate it to a specific
function. This embedded area either can contain a different base cell that is more
suitable for building memory cells, or it can contain a complete circuit block,
such as a microcontroller.
Figure 1.7 shows an embedded gate array. The important features of this type of
MGA are the following:
● Only the interconnect is customized.

● Custom blocks (the same for each design) can be embedded.

● Manufacturing lead time is between two days and two weeks.


FIGURE 1.7 A structured or
embedded gate-array die showing
an embedded block in the upper
left corner (a static random-access
memory, for example). The rest of
the die is filled with an array of
base cells.

An embedded gate array gives the improved area efficiency and increased
performance of a CBIC but with the lower cost and faster turnaround of an MGA.
One disadvantage of an embedded gate array is that the embedded function is
fixed. For example, if an embedded gate array contains an area set aside for a 32
k-bit memory, but we only need a 16 k-bit memory, then we may have to waste
half of the embedded memory function. However, this may still be more efficient
and cheaper than implementing a 32 k-bit memory using macros on a SOG array.
ASIC vendors may offer several embedded gate array structures containing
different memory types and sizes as well as a variety of embedded functions.
ASIC companies wishing to offer a wide range of embedded functions must
ensure that enough customers use each different embedded gate array to give the
cost advantages over a custom gate array or CBIC (the Sun Microsystems
SPARCstation 1 described in Section 1.3 made use of LSI Logic embedded gate
arraysand the 10K and 100K series of embedded gate arrays were two of LSI
Logics most successful products).

1.1.7 Programmable Logic Devices


Programmable logic devices ( PLDs ) are standard ICs that are available in
standard configurations from a catalog of parts and are sold in very high volume
to many different customers. However, PLDs may be configured or programmed
to create a part customized to a specific application, and so they also belong to
the family of ASICs. PLDs use different technologies to allow programming of
the device. Figure 1.8 shows a PLD and the following important features that all
PLDs have in common:
● No customized mask layers or logic cells

● Fast design turnaround

● A single large block of programmable interconnect

● A matrix of logic macrocells that usually consist of programmable array


logic followed by a flip-flop or latch
FIGURE 1.8 A programmable
logic device (PLD) die. The
macrocells typically consist of
programmable array logic
followed by a flip-flop or latch.
The macrocells are connected
using a large programmable
interconnect block.

The simplest type of programmable IC is a read-only memory ( ROM ). The most


common types of ROM use a metal fuse that can be blown permanently (a
programmable ROM or PROM ). An electrically programmable ROM , or
EPROM , uses programmable MOS transistors whose characteristics are altered
by applying a high voltage. You can erase an EPROM either by using another
high voltage (an electrically erasable PROM , or EEPROM ) or by exposing the
device to ultraviolet light ( UV-erasable PROM , or UVPROM ).
There is another type of ROM that can be placed on any ASICa
mask-programmable ROM (mask-programmed ROM or masked ROM). A
masked ROM is a regular array of transistors permanently programmed using
custom mask patterns. An embedded masked ROM is thus a large, specialized,
logic cell.
The same programmable technologies used to make ROMs can be applied to
more flexible logic structures. By using the programmable devices in a large
array of AND gates and an array of OR gates, we create a family of flexible and
programmable logic devices called logic arrays . The company Monolithic
Memories (bought by AMD) was the first to produce Programmable Array Logic
(PAL ® , a registered trademark of AMD) devices that you can use, for example,
as transition decoders for state machines. A PAL can also include registers
(flip-flops) to store the current state information so that you can use a PAL to
make a complete state machine.
Just as we have a mask-programmable ROM, we could place a logic array as a
cell on a custom ASIC. This type of logic array is called a programmable logic
array (PLA). There is a difference between a PAL and a PLA: a PLA has a
programmable AND logic array, or AND plane , followed by a programmable
OR logic array, or OR plane ; a PAL has a programmable AND plane and, in
contrast to a PLA, a fixed OR plane.
Depending on how the PLD is programmed, we can have an erasable PLD
(EPLD), or mask-programmed PLD (sometimes called a masked PLD but usually
just PLD). The first PALs, PLAs, and PLDs were based on bipolar technology
and used programmable fuses or links. CMOS PLDs usually employ
floating-gate transistors (see Section 4.3, EPROM and EEPROM Technology).
1.1.8 Field-Programmable Gate Arrays
A step above the PLD in complexity is the field-programmable gate array (
FPGA ). There is very little difference between an FPGA and a PLDan FPGA is
usually just larger and more complex than a PLD. In fact, some companies that
manufacture programmable ASICs call their products FPGAs and some call them
complex PLDs . FPGAs are the newest member of the ASIC family and are
rapidly growing in importance, replacing TTL in microelectronic systems. Even
though an FPGA is a type of gate array, we do not consider the term gate-array
based ASICs to include FPGAs. This may change as FPGAs and MGAs start to
look more alike.
Figure 1.9 illustrates the essential characteristics of an FPGA:
● None of the mask layers are customized.

● A method for programming the basic logic cells and the interconnect.

● The core is a regular array of programmable basic logic cells that can
implement combinational as well as sequential logic (flip-flops).
● A matrix of programmable interconnect surrounds the basic logic cells.

● Programmable I/O cells surround the core.

● Design turnaround is a few hours.

We shall examine these features in detail in Chapters 48.

FIGURE 1.9 A field-programmable


gate array (FPGA) die. All FPGAs
contain a regular structure of
programmable basic logic cells
surrounded by programmable
interconnect. The exact type, size,
and number of the programmable
basic logic cells varies
tremendously.
1.2 Design Flow
Figure 1.10 shows the sequence of steps to design an ASIC; we call this a design
flow . The steps are listed below (numbered to correspond to the labels in
Figure 1.10) with a brief description of the function of each step.

FIGURE 1.10 ASIC design flow.


1. Design entry. Enter the design into an ASIC design system, either using a
hardware description language ( HDL ) or schematic entry .
2. Logic synthesis. Use an HDL (VHDL or Verilog) and a logic synthesis
tool to produce a netlist a description of the logic cells and their
connections.
3. System partitioning. Divide a large system into ASIC-sized pieces.
4. Prelayout simulation. Check to see if the design functions correctly.
5. Floorplanning. Arrange the blocks of the netlist on the chip.
6. Placement. Decide the locations of cells in a block.
1.1.8 Field-Programmable Gate Arrays
A step above the PLD in complexity is the field-programmable gate array (
FPGA ). There is very little difference between an FPGA and a PLDan FPGA is
usually just larger and more complex than a PLD. In fact, some companies that
manufacture programmable ASICs call their products FPGAs and some call them
complex PLDs . FPGAs are the newest member of the ASIC family and are
rapidly growing in importance, replacing TTL in microelectronic systems. Even
though an FPGA is a type of gate array, we do not consider the term gate-array
based ASICs to include FPGAs. This may change as FPGAs and MGAs start to
look more alike.
Figure 1.9 illustrates the essential characteristics of an FPGA:
● None of the mask layers are customized.

● A method for programming the basic logic cells and the interconnect.

● The core is a regular array of programmable basic logic cells that can
implement combinational as well as sequential logic (flip-flops).
● A matrix of programmable interconnect surrounds the basic logic cells.

● Programmable I/O cells surround the core.

● Design turnaround is a few hours.

We shall examine these features in detail in Chapters 48.

FIGURE 1.9 A field-programmable


gate array (FPGA) die. All FPGAs
contain a regular structure of
programmable basic logic cells
surrounded by programmable
interconnect. The exact type, size,
and number of the programmable
basic logic cells varies
tremendously.
L ast E d ited by S P 1411 2 0 0 4

CMOS LOGIC
A CMOS transistor (or device) has four terminals: gate , source , drain , and a
fourth terminal that we shall ignore until the next section. A CMOS transistor is a
switch. The switch must be conducting or on to allow current to flow between the
source and drain terminals (using open and closed for switches is confusingfor
the same reason we say a tap is on and not that it is closed ). The transistor source
and drain terminals are equivalent as far as digital signals are concernedwe do
not worry about labeling an electrical switch with two terminals.
● V AB is the potential difference, or voltage, between nodes A and B in a

circuit; V AB is positive if node A is more positive than node B.


● Italics denote variables; constants are set in roman (upright) type.
Uppercase letters denote DC, large-signal, or steady-state voltages.
● For TTL the positive power supply is called VCC (V CC or V CC ). The 'C'
denotes that the supply is connected indirectly to the collectors of the npn
bipolar transistors (a bipolar transistor has a collector, base, and emitter
corresponding roughly to the drain, gate, and source of an MOS
transistor).
● Following the example of TTL we used VDD (V DD or V DD ) to denote
the positive supply in an NMOS chip where the devices are all n -channel
transistors and the drains of these devices are connected indirectly to the
positive supply. The supply nomenclature for NMOS chips has stuck for
CMOS.
● VDD is the name of the power supply node or net; V DD represents the
value (uppercase since V DD is a DC quantity). Since V DD is a variable, it
is italic (words and multiletter abbreviations use romanthus it is V DD , but
V drain ).
● Logic designers often call the CMOS negative supply VSS or VSS even if
it is actually ground or GND. I shall use VSS for the node and V SS for the
value.
● CMOS uses positive logic VDD is logic '1' and VSS is logic '0'.
We turn a transistor on or off using the gate terminal. There are two kinds of
CMOS transistors: n -channel transistors and p -channel transistors. An n
-channel transistor requires a logic '1' (from now on Ill just say a '1') on the gate
to make the switch conducting (to turn the transistor on ). A p -channel transistor
requires a logic '0' (again from now on, Ill just say a '0') on the gate to make the
switch nonconducting (to turn the transistor off ). The p -channel transistor
symbol has a bubble on its gate to remind us that the gate has to be a '0' to turn
the transistor on . All this is shown in Figure 2.1(a) and (b).

FIGURE 2.1 CMOS transistors as switches. (a) An n -channel transistor. (b) A p


-channel transistor. (c) A CMOS inverter and its symbol (an equilateral triangle
and a circle ).

If we connect an n -channel transistor in series with a p -channel transistor, as


shown in Figure 2.1(c), we form an inverter . With four transistors we can form a
two-input NAND gate (Figure 2.2a). We can also make a two-input NOR gate
(Figure 2.2b). Logic designers normally use the terms NAND gate and logic gate
(or just gate), but I shall try to use the terms NAND cell and logic cell rather than
NAND gate or logic gate in this chapter to avoid any possible confusion with the
gate terminal of a transistor.
FIGURE 2.2 CMOS logic. (a) A two-input NAND logic cell. (b) A two-input
NOR logic cell. The n -channel and p -channel transistor switches implement the
'1's and '0's of a Karnaugh map.

2.1 CMOS Transistors


2.2 The CMOS Process
2.3 CMOS Design Rules
2.4 Combinational Logic Cells
2.5 Sequential Logic Cells
2.6 Datapath Logic Cells
2.7 I/O Cells
2.8 Cell Compilers
2.9 Summary
2.10 Problems
2.11 Bibliography
2.12 References
2.1 CMOS Transistors
Figure 2.3 illustrates how electrons and holes abandon their dopant atoms leaving
a depletion region around a transistors source and drain. The region between
source and drain is normally nonconducting. To make an n -channel transistor
conducting, we must apply a positive voltage V GS (the gate voltage with respect
to the source) that is greater than the n -channel transistor threshold voltage , V t n
(a typical value is 0.5 V and, as far as we are presently concerned, is a constant).
This establishes a thin ( ª 50 Å) conducting channel of electrons under the gate.
MOS transistors can carry a very small current (the subthreshold current a few
microamperes or less) with V GS < V t n , but we shall ignore this. A transistor
can be conducting ( V GS > V t n ) without any current flowing. To make current
flow in an n -channel transistor we must also apply a positive voltage, V DS , to
the drain with respect to the source. Figure 2.3 shows these connections and the
connection to the fourth terminal of an MOS transistorthe bulk ( well , tub , or
substrate ) terminal. For an n -channel transistor we must connect the bulk to the
most negative potential, GND or VSS, to reverse bias the bulk-to-drain and
bulk-to-source pn -diodes. The arrow in the four-terminal n -channel transistor
symbol in Figure 2.3 reflects the polarity of these pn -diodes.

FIGURE 2.3 An n -channel MOS transistor. The gate-oxide thickness, T OX , is


approximately 100 angstroms (0.01 m m). A typical transistor length, L = 2 l .
The bulk may be either the substrate or a well. The diodes represent pn
-junctions that must be reverse-biased.

The current flowing in the transistor is


current (amperes) = charge (coulombs) per unit time (second). (2.1)
We can express the current in terms of the total charge in the channel, Q (imagine
taking a picture and counting the number of electrons in the channel at that
instant). If t f (for time of flight sometimes called the transit time ) is the time
that it takes an electron to cross between source and drain, the drain-to-source
current, I DSn , is
I DSn = Q / t f . (2.2)

We need to find Q and t f . The velocity of the electrons v (a vector) is given by


the equation that forms the basis of Ohms law:
v= m n E , (2.3)

where m n is the electron mobility ( m p is the hole mobility ) and E is the electric
field (with units Vm 1 ).
Typical carrier mobility values are m n = 5001000 cm 2 V 1 s 1 and m p = 100
400 cm 2 V 1 s 1 . Equation 2.3 is a vector equation, but we shall ignore the
vertical electric field and concentrate on the horizontal electric field, E x , that
moves the electrons between source and drain. The horizontal component of the
electric field is E x = V DS / L, directed from the drain to the source, where L is
the channel length (see Figure 2.3). The electrons travel a distance L with
horizontal velocity v x = m n E x , so that
L L2
tf= = . (2.4)
v x m n V DS

Next we find the channel charge, Q . The channel and the gate form the plates of
a capacitor, separated by an insulatorthe gate oxide. We know that the charge on
a linear capacitor, C, is Q = C V . Our lower plate, the channel, is not a linear
conductor. Charge only appears on the lower plate when the voltage between the
gate and the channel, V GC , exceeds the n -channel threshold voltage. For our
nonlinear capacitor we need to modify the equation for a linear capacitor to the
following:
Q = C ( V GC V tn ) . (2.5)

The lower plate of our capacitor is resistive and conducting current, so that the
potential in the channel, V GC , varies. In fact, V GC = V GS at the source and V
GC = V GS V DS at the drain. What we really should do is find an expression for
the channel charge as a function of channel voltage and sum (integrate) the
charge all the way across the channel, from x = 0 (at the source) to x = L (at the
drain). Instead we shall assume that the channel voltage, V GC ( x ), is a linear
function of distance from the source and take the average value of the charge,
which is thus
Q = C [ ( V GS V tn ) 0.5 V DS ] . (2.6)

The gate capacitance, C , is given by the formula for a parallel-plate capacitor


with length L , width W , and plate separation equal to the gate-oxide thickness,
T ox . Thus the gate capacitance is
WL e ox
C= = WLC ox , (2.7)
T ox

where e ox is the gate-oxide dielectric permittivity. For silicon dioxide, Si0 2 , e ox


ª 3.45 ¥ 10 11 Fm 1 , so that, for a typical gate-oxide thickness of 100 Å (1 Å = 1
angstrom = 0.1 nm), the gate capacitance per unit area, C ox ª 3 f F m m 2 .

Now we can express the channel charge in terms of the transistor parameters,
Q = WL C ox [ ( V GS V tn ) 0.5 V DS ] . (2.8)

Finally, the drainsource current is


I DSn = Q/ t f
= (W/L) m n C ox [ ( V GS V tn ) 0.5 V DS ] V DS
= (W/L)k ' n [ ( V GS V tn ) 0.5 V DS ] V DS . (2.9)

The constant k ' n is the process transconductance parameter (or intrinsic


transconductance ):
k ' n = m n C ox . (2.10)

We also define b n , the transistor gain factor (or just gain factor ) as
b n = k ' n (W/L) . (2.11)

The factor W/L (transistor width divided by length) is the transistor shape factor .
Equation 2.9 describes the linear region (or triode region) of operation. This
equation is valid until V DS = V GS V t n and then predicts that I DS decreases
with increasing V DS , which does not make physical sense. At V DS = V GS V t
n = V DS (sat) (the saturation voltage ) there is no longer enough voltage between
the gate and the drain end of the channel to support any channel charge. Clearly a
small amount of charge remains or the current would go to zero, but with very
little free charge the channel resistance in a small region close to the drain
increases rapidly and any further increase in V DS is dropped over this region.
Thus for V DS > V GS V t n (the saturation region , or pentode region, of
operation) the drain current IDS remains approximately constant at the saturation
current , I DSn (sat) , where
I DSn (sat) = ( b n /2)( V GS V tn )2 ; V GS > V t n . (2.12)

Figure 2.4 shows the n -channel transistor I DS V DS characteristics for a generic


0.5 m m CMOS process that we shall call G5 . We can fit Eq. 2.12 to the
long-channel transistor characteristics (W = 60 m m, L = 6 m m) in Figure 2.4(a).
If I DSn (sat) = 2.5 mA (with V DS = 3.0 V, V GS = 3.0 V, V t n = 0.65 V, T ox
=100 Å), the intrinsic transconductance is
2(L/W) I DSn (sat)
k'n= (2.13)
( V GS V tn )2

2 (6/60) (2.5 ¥ 10 3 )
=
(3.0 0.65) 2

= 9.05 ¥ 10 5 AV 2

or approximately 90 m AV 2 . This value of k ' n , calculated in the saturation


region, will be different (typically lower by a factor of 2 or more) from the value
of k ' n measured in the linear region. We assumed the mobility, m n , and the
threshold voltage, V t n , are constantsneither of which is true, as we shall see in
Section 2.1.2.
For the p -channel transistor in the G5 process, I DSp (sat) = 850 m A ( V DS =
3.0 V, V GS = 3.0 V, V t p = 0.85 V, W = 60 m m, L = 6 m m). Then
2 (L/W) ( I DSp (sat) )
k'p= (2.14)
( V GS V tp )2

2 (6/60) (850 ¥ 10 6 )
=
(3.0 (0.85) ) 2

= 3.68 ¥ 10 5 AV 2

The next section explains the signs in Eq. 2.14.


(a) (b)

FIGURE 2.4 MOS n -channel


transistor characteristics for a generic
(c)
0.5 m m process (G5). (a) A
short-channel transistor, with W = 6 m
m and L = 0.6 m m (drawn) and a
long-channel transistor (W = 60 m m,
L = 6 m m) (b) The 6/0.6
characteristics represented as a
surface. (c) A long-channel transistor
obeys a square-law characteristic
between I DS and V GS in the
saturation region ( V DS = 3 V). A
short-channel transistor shows a more
linear characteristic due to velocity
saturation. Normally, all of the
transistors used on an ASIC have short
channels.

2.1.1 P-Channel Transistors


The source and drain of CMOS transistors look identical; we have to know which
way the current is flowing to distinguish them. The source of an n -channel
transistor is lower in potential than the drain and vice versa for a p -channel
transistor. In an n -channel transistor the threshold voltage, V t n , is normally
positive, and the terminal voltages V DS and V GS are also usually positive. In a p
-channel transistor V t p is normally negative and we have a choice: We can write
everything in terms of the magnitudes of the voltages and currents or we can use
negative signs in a consistent fashion.
Here are the equations for a p -channel transistor using negative signs:
k ' p (W/L) [ ( V GS V tp ) 0.5 V DS ] V DS ; V DS > V GS
I DSp = (2.15)
V tp
I DSp (sat) = b p /2 ( V GS V tp )2 ; V DS < V GS V tp .

In these two equations V t p is negative, and the terminal voltages V DS and V GS


are also normally negative (and 3 V < 2 V, for example). The current I DSp is
then negative, corresponding to conventional current flowing from source to
drain of a p -channel transistor (and hence the negative sign for I DSp (sat) in
Eq. 2.14).

2.1.2 Velocity Saturation


For a deep submicron transistor, Eq. 2.12 may overestimate the drainsource
current by a factor of 2 or more. There are three reasons for this error. First, the
threshold voltage is not constant. Second, the actual length of the channel (the
electrical or effective length, often written as L eff ) is less than the drawn (mask)
length. The third reason is that Eq. 2.3 is not valid for high electric fields. The
electrons cannot move any faster than about v max n = 10 5 ms 1 when the electric
field is above 10 6 Vm 1 (reached when 1 V is dropped across 1 m m); the
electrons become velocity saturated . In this case t f = L eff / v max n , the drain
source saturation current is independent of the transistor length, and Eq. 2.12
becomes
Wv max n C ox ( V GS V tn ); V DS > V DS (sat) (velocity
I DSn (sat) = (2.16)
saturated).

We can see this behavior for the short-channel transistor characteristics in


Figure 2.4(a) and (c).
Transistor current is often specified per micron of gate width because of the form
of Eq. 2.16. As an example, suppose I DSn (sat) / W = 300 m A m m 1 for the n
-channel transistors in our G5 process (with V DS = 3.0 V, V GS = 3.0 V, V t n =
0.65 V, L eff = 0.5 m m and T ox = 100 Å). Then E x ª (3 0.65) V / 0.5 m m ª 5 V
mm1 ,
I DSn (sat) /W
v max n = (2.17)
C ox ( V GS V tn )

(300 ¥ 10 6 ) (1 ¥ 10 6 )
=
(3.45 ¥ 10 3 ) (3 0.65)
= 37,000 ms 1

and t f ª 0.5 m m/37,000 ms 1 ª 13 ps.

The value for v max n is lower than the 10 5 ms 1 we expected because the carrier
velocity is also lowered by mobility degradation due the vertical electric field
which we have ignored. This vertical field forces the carriers to keep bumping
in to the interface between the silicon and the gate oxide, slowing them down.

2.1.3 SPICE Models


The simulation program SPICE (which stands for Simulation Program with
Integrated Circuit Emphasis ) is often used to characterize logic cells. Table 2.1
shows a typical set of model parameters for our G5 process. The SPICE
parameter KP (given in m AV 2 ) corresponds to k ' n (and k ' p ). SPICE
parameters VT0 and TOX correspond to V t n (and V t p ), and T ox . SPICE
parameter U0 (given in cm 2 V 1 s 1 ) corresponds to the ideal bulk mobility
values, m n (and m p ). Many of the other parameters model velocity saturation
and mobility degradation (and thus the effective value of k ' n and k ' p ).
TABLE 2.1 SPICE parameters for a generic 0.5 m m process, G5 (0.6 m m
drawn gate length). The n-channel transistor characteristics are shown in
Figure 2.4.
.MODEL CMOSN NMOS LEVEL=3 PHI=0.7 TOX=10E-09 XJ=0.2U TPG=1
VTO=0.65 DELTA=0.7
+ LD=5E-08 KP=2E-04 UO=550 THETA=0.27 RSH=2 GAMMA=0.6
NSUB=1.4E+17 NFS=6E+11
+ VMAX=2E+05 ETA=3.7E-02 KAPPA=2.9E-02 CGDO=3.0E-10
CGSO=3.0E-10 CGBO=4.0E-10
+ CJ=5.6E-04 MJ=0.56 CJSW=5E-11 MJSW=0.52 PB=1
.MODEL CMOSP PMOS LEVEL=3 PHI=0.7 TOX=10E-09 XJ=0.2U TPG=-1
VTO=-0.92 DELTA=0.29
+ LD=3.5E-08 KP=4.9E-05 UO=135 THETA=0.18 RSH=2 GAMMA=0.47
NSUB=8.5E+16 NFS=6.5E+11
+ VMAX=2.5E+05 ETA=2.45E-02 KAPPA=7.96 CGDO=2.4E-10
CGSO=2.4E-10 CGBO=3.8E-10
+ CJ=9.3E-04 MJ=0.47 CJSW=2.9E-10 MJSW=0.505 PB=1

2.1.4 Logic Levels


Figure 2.5 shows how to use transistors as logic switches. The bulk connection
for the n -channel transistor in Figure 2.5(ab) is a p -well. The bulk connection
for the p -channel transistor is an n -well. The remaining connections show what
happens when we try and pass a logic signal between the drain and source
terminals.
FIGURE 2.5 CMOS logic levels. (a) A strong '0'. (b) A weak '1'. (c) A weak '0'.
(d) A strong '1'. ( V t n is positive and V t p is negative.) The depth of the
channels is greatly exaggerated.

In Figure 2.5(a) we apply a logic '1' (or VDD I shall use these interchangeably)
to the gate and a logic '0' ( V SS ) to the source (we know it is the source since
electrons must flow from this point, since V SS is the lowest voltage on the chip).
The application of these voltages makes the n -channel transistor conduct current,
and electrons flow from source to drain.
Suppose the drain is initially at logic '1'; then the n -channel transistor will begin
to discharge any capacitance that is connected to its drain (due to another logic
cell, for example). This will continue until the drain terminal reaches a logic '0',
and at that time V GD and V GS are both equal to V DD , a full logic '1'. The
transistor is strongly conducting now (with a large channel charge, Q , but there
is no current flowing since V DS = 0 V). The transistor will strongly object to
attempts to change its drain terminal from a logic '0'. We say that the logic level
at the drain is a strong '0'.
In Figure 2.5(b) we apply a logic '1' to the drain (it must now be the drain since
electrons have to flow toward a logic '1'). The situation is now quite differentthe
transistor is still on but V GS is decreasing as the source voltage approaches its
final value. In fact, the source terminal never gets to a logic '1'the source will
stop increasing in voltage when V GS reaches V t n . At this point the transistor is
very nearly off and the source voltage creeps slowly up to V DD V t n . Because
the transistor is very nearly off, it would be easy for a logic cell connected to the
source to change the potential there, since there is so little channel charge. The
logic level at the source is a weak '1'. Figure 2.5(cd) show the state of affairs for
a p -channel transistor is the exact reverse or complement of the n -channel
transistor situation.
In summary, we have the following logic levels:
● An n -channel transistor provides a strong '0', but a weak '1'.

● A p -channel transistor provides a strong '1', but a weak '0'.

Sometimes we refer to the weak versions of '0' and '1' as degraded logic levels .
In CMOS technology we can use both types of transistor together to produce
strong '0' logic levels as well as strong '1' logic levels.
2.2 The CMOS Process
Figure 2.6 outlines the steps to create an integrated circuit. The starting material
is silicon, Si, refined from quartzite (with less than 1 impurity in 10 10 silicon
atoms). We draw a single-crystal silicon boule (or ingot) from a crucible
containing a melt at approximately 1500 °C (the melting point of silicon at 1 atm.
pressure is 1414 °C). This method is known as Czochralski growth. Acceptor ( p
-type) or donor ( n -type) dopants may be introduced into the melt to alter the
type of silicon grown.
The boule is sawn to form thin circular wafers (6, 8, or 12 inches in diameter, and
typically 600 m m thick), and a flat is ground (the primary flat), perpendicular to
the <110> crystal axisas a this edge down indication. The boule is drawn so
that the wafer surface is either in the (111) or (100) crystal planes. A smaller
secondary flat indicates the wafer crystalline orientation and doping type. A
typical submicron CMOS processes uses p -type (100) wafers with a resistivity of
approximately 10 W cmthis type of wafer has two flats, 90° apart. Wafers are
made by chemical companies and sold to the IC manufacturers. A blank 8-inch
wafer costs about $100.
To begin IC fabrication we place a batch of wafers (a wafer lot ) on a boat and
grow a layer (typically a few thousand angstroms) of silicon dioxide , SiO 2 ,
using a furnace. Silicon is used in the semiconductor industry not so much for the
properties of silicon, but because of the physical, chemical, and electrical
properties of its native oxide, SiO 2 . An IC fabrication process contains a series
of masking steps (that in turn contain other steps) to create the layers that define
the transistors and metal interconnect.
FIGURE 2.6 IC fabrication. Grow crystalline silicon (1); make a wafer (23);
grow a silicon dioxide (oxide) layer in a furnace (4); apply liquid photoresist
(resist) (5); mask exposure (6); a cross-section through a wafer showing the
developed resist (7); etch the oxide layer (8); ion implantation (910); strip the
resist (11); strip the oxide (12). Steps similar to 412 are repeated for each layer
(typically 1220 times for a CMOS process).

Each masking step starts by spinning a thin layer (approximately 1 m m) of liquid


photoresist ( resist ) onto each wafer. The wafers are baked at about 100 °C to
remove the solvent and harden the resist before being exposed to ultraviolet (UV)
light (typically less than 200 nm wavelength) through a mask . The UV light
alters the structure of the resist, allowing it to be removed by developing. The
exposed oxide may then be etched (removed). Dry plasma etching etches in the
vertical direction much faster than it does horizontally (an anisotropic etch). Wet
etch techniques are usually isotropic . The resist functions as a mask during the
etch step and transfers the desired pattern to the oxide layer.
Dopant ions are then introduced into the exposed silicon areas. Figure 2.6
illustrates the use of ion implantation . An ion implanter is a cross between a TV
and a mass spectrometer and fires dopant ions into the silicon wafer. Ions can
only penetrate materials to a depth (the range , normally a few microns) that
depends on the closely controlled implant energy (measured in keVusually
between 10 and 100 keV; an electron volt, 1 eV, is 1.6 ¥ 10 19 J). By using layers
of resist, oxide, and polysilicon we can prevent dopant ions from reaching the
silicon surface and thus block the silicon from receiving an implant . We control
the doping level by counting the number of ions we implant (by integrating the
ion-beam current). The implant dose is measured in atoms/cm 2 (typical doses are
from 10 13 to 10 15 cm 2 ). As an alternative to ion implantation we may instead
strip the resist and introduce dopants by diffusion from a gaseous source in a
furnace.
Once we have completed the transistor diffusion layers we can deposit layers of
other materials. Layers of polycrystalline silicon (polysilicon or poly ), SiO 2 ,
and silicon nitride (Si 3 N 4 ), for example, may be deposited using chemical
vapor deposition ( CVD ). Metal layers can be deposited using sputtering . All
these layers are patterned using masks and similar photolithography steps to
those shown in Figure 2.6.
TABLE 2.2 CMOS process layers.
Derivation from Alternative names for MOSIS mask
Mask/layer name
drawn layers mask/layer label
bulk, substrate, tub, n
n -well = nwell 1 CWN
-tub, moat
bulk, substrate, tub, p
p -well = pwell 1 CWP
-tub, moat
thin oxide, thinox, island,
active = pdiff + ndiff CAA
gate oxide
polysilicon = poly poly, gate CPG
n -diffusion implant
= grow (ndiff) ndiff, n -select, nplus, n+ CSN
2
p -diffusion implant = grow (pdiff) pdiff, p -select, pplus, p+ CSP
2
contact cut, poly contact, CCP and
contact = contact CCA 3
diffusion contact
metal1 = m1 first-level metal CMF
metal2 = m2 second-level metal CMS
metal2/metal3 via,
via2 = via2 CVS
m2/m3 via
metal3 = m3 third-level metal CMT
passivation, overglass,
glass = glass COG
pad

Table 2.2 shows the mask layers (and their relation to the drawn layers) for a
submicron, silicon-gate, three-level metal, self-aligned, CMOS process . A
process in which the effective gate length is less than 1 m m is referred to as a
submicron process . Gate lengths below 0.35 m m are considered in the
deep-submicron regime.
Figure 2.7 shows the layers that we draw to define the masks for the logic cell of
Figure 1.3. Potential confusion arises because we like to keep layout simple but
maintain a what you see is what you get (WYSIWYG) approach. This means
that the drawn layers do not correspond directly to the masks in all cases.
(a) nwell (b) pwell (c) ndiff (d) pdiff

(e) poly (f) contact (g) m1 (h) via

(i) m2 (j) cell (k) phantom

FIGURE 2.7 The standard cell shown in Figure 1.3. (a)(i) The drawn layers that
define the masks. The active mask is the union of the ndiff and pdiff drawn
layers. The n -diffusion implant and p -diffusion implant masks are bloated
versions of the ndiff and pdiff drawn layers. (j) The complete cell layout. (k) The
phantom cell layout. Often an ASIC vendor hides the details of the internal cell
construction. The phantom cell is used for layout by the customer and then
instantiated by the ASIC vendor after layout is complete. This layout uses
grayscale stipple patterns to distinguish between layers.

We can construct wells in a CMOS process in several ways. In an n-well process


, the substrate is p -type (the wafer itself) and we use an n -well mask to build the
n -well. We do not need a p -well mask because there are no p -wells in an n
-well processthe n -channel transistors all sit in the substrate (the wafer)but we
often draw the p -well layer as though it existed. In a p-well process we use a p
-well mask to make the p -wells and the n -wells are the substrate. In a twin-tub
(or twin-well ) process, we create individual wells for both types of transistors,
and neither well is the substrate (which may be either n -type or p -type). There
are even triple-well processes used to achieve even more control over the
transistor performance. Whatever process that we use we must connect all the n
-wells to the most positive potential on the chip, normally VDD, and all the p
-wells to VSS; otherwise we may forward bias the bulk to source/drain pn
-junctions. The bulk connections for CMOS transistors are not usually drawn in
digital circuit schematics, but these substrate contacts ( well contacts or tub ties )
are very important. After we make the well(s), we grow a layer (approximately
1500 Å) of Si 3 N 4 over the wafer. The active mask (CAA) leaves this nitride
layer only in the active areas that will later become transistors or substrate
contacts. Thus
CAA (mask) = ndiff (drawn) ( pdiff (drawn) , (2.18)

the ( symbol represents OR (union) of the two drawn layers, ndiff and pdiff.
Everything outside the active areas is known as the field region, or just field .
Next we implant the substrate to prevent unwanted transistors from forming in
the field regionthis is the field implant or channel-stop implant . The nitride over
the active areas acts as an implant mask and we may use another field-implant
mask at this step also. Following this we grow a thick (approximately 5000 Å)
layer of SiO 2 , the field oxide ( FOX ). The FOX will not grow over the nitride
areas. When we strip the nitride we are left with FOX in the areas we do not want
to dope the silicon. Following this we deposit, dope, mask, and etch the poly gate
material, CPG (mask) = poly (drawn). Next we create the doped regions that
form the sources, drains, and substrate contacts using ion implantation. The poly
gate functions like masking tape in these steps. One implant (using phosphorous
or arsenic ions) forms the n -type source/drain for the n -channel transistors and n
-type substrate contacts (CSN). A second implant (using boron ions) forms the p
-type sourcedrain for the p -channel transistors and p -type substrate contacts
(CSP). These implants are masked as follows
CSN (mask) = grow (ndiff (drawn)), (2.19)
CSP (mask) = grow (pdiff (drawn)), (2.20)

where grow means that we expand or bloat the drawn ndiff and drawn pdiff
layers slightly (usually by a few l ).
During implantation the dopant ions are blocked by the resist pattern defined by
the CSN and CSP masks. The CSN mask thus prevents the n -type regions being
implanted with p -type dopants (and vice versa for the CSP mask). As we shall
see, the CSN and CSP masks are not intended to define the edges of the n -type
and p -type regions. Instead these two masks function more like newspaper that
prevents paint from spraying everywhere. The dopant ions are also blocked from
reaching the silicon surface by the poly gates and this aligns the edge of the
source and drain regions to the edges of the gates (we call this a self-aligned
process ). In addition, the implants are blocked by the FOX and this defines the
outside edges of the source, drain, and substrate contact regions.
The only areas of the silicon surface that are doped n -type are
n -diffusion (silicon) = (CAA (mask) ' CSN (mask)) ' ( ÿ CPG (mask)) ; (2.21)

where the ' symbol represents AND (the intersection of two layers); and the ÿ
symbol represents NOT.
Similarly, the only regions that are doped p -type are
p -diffusion (silicon) = (CAA (mask) ' CSP (mask)) ' ( ÿ CPG (mask)) . (2.22)

If the CSN and CSP masks do not overlap, it is possible to save a mask by using
one implant mask (CSN or CSP) for the other type (CSP or CSN). We can do this
by using a positive resist (the pattern of resist remaining after developing is the
same as the dark areas on the mask) for one implant step and a negative resist
(vice versa) for the other step. However, because of the poor resolution of
negative resist and because of difficulties in generating the implant masks
automatically from the drawn diffusions (especially when opposite diffusion
types are drawn close to each other or touching), it is now common to draw both
implant masks as well as the two diffusion layers.
It is important to remember that, even though poly is above diffusion, the
polysilicon is deposited first and acts like masking tape. It is rather like
airbrushing a stripeyou use masking tape and spray everywhere without
worrying about making straight lines. The edges of the pattern will align to the
edge of the tape. Here the analogy ends because the poly is left in place. Thus,
n -diffusion (silicon) = (ndiff (drawn)) ' ( ÿ poly (drawn)) and (2.23)
p -diffusion (silicon) = (pdiff (drawn)) ' ( ÿ poly (drawn)) . (2.24)

In the ASIC industry the names nplus, n +, and n -diffusion (as well as the p -type
equivalents) are used in various ways. These names may refer to either the drawn
diffusion layer (that we call ndiff), the mask (CSN), or the doped region on the
silicon (the intersection of the active and implant mask that we call n -diffusion)
very confusing.
The source and drain are often formed from two separate implants. The first is a
light implant close to the edge of the gate, the second a heavier implant that
forms the rest of the source or drain region. The separate diffusions reduce the
electric field near the drain end of the channel. Tailoring the device
characteristics in this fashion is known as drain engineering and a process
including these steps is referred to as an LDD process , for lightly doped drain ;
the first light implant is known as an LDD diffusion or LDD implant.
FIGURE 2.8 Drawn layers and
an example set of
black-and-white stipple patterns
for a CMOS process. On top are
the patterns as they appear in
layout. Underneath are the
magnified 8-by-8 pixel patterns.
If we are trying to simplify
layout we may use solid black or
white for contact and vias. If we
have contacts and vias placed on
top of one another we may use
stipple patterns or other means
to help distinguish between
them. Each stipple pattern is
transparent, so that black shows
through from underneath when
layers are superimposed. There
are no standards for these
patterns.

Figure 2.8 shows a stipple-pattern matrix for a CMOS process. When we draw
layout you can see through the layersall the stipple patterns are ORed together.
Figure 2.9 shows the transistor layers as they appear in layout (drawn using the
patterns from Figure 2.8) and as they appear on the silicon. Figure 2.10 shows the
same thing for the interconnect layers.

FIGURE 2.9 The transistor layers. (a) A p -channel transistor as drawn in layout.
(b) The corresponding silicon cross section (the heavy lines in part a show the
cuts). This is how a p -channel transistor would look just after completing the
source and drain implant steps.
FIGURE 2.10 The interconnect
layers. (a) Metal layers as drawn
in layout. (b) The corresponding
structure (as it might appear in a
scanning-electron micrograph).
The insulating layers between
the metal layers are not shown.
Contact is made to the
underlying silicon through a
platinum barrier layer. Each via
consists of a tungsten plug. Each
metal layer consists of a
titaniumtungsten and aluminum
copper sandwich. Most deep
submicron CMOS processes use
metal structures similar to this.
The scale, rounding, and
irregularity of the features are
realistic.

2.2.1 Sheet Resistance


Tables 2.3 and 2.4 show the sheet resistance for each conducting layer (in
decreasing order of resistance) for two different generations of CMOS process.
TABLE 2.3 Sheet resistance (1 m m TABLE 2.4 Sheet resistance (0.35 m
CMOS). m CMOS).
Sheet Sheet
Layer Units Layer Units
resistance resistance
kW/ kW/
n -well 1.15 ± 0.25 n -well 1 ± 0.4
square square
W/ W/
poly 3.5 ± 2.0 poly 10 ± 4.0
square square
W/ W/
n -diffusion 75 ± 20 n -diffusion 3.5 ± 2.0
square square
W/ W/
p -diffusion 140 ± 40 p -diffusion 2.5 ± 1.5
square square
mW/ mW/
m1/2 70± 6 m1/2/3 60 ± 6
square square
mW/ mW/
m3 30± 3 metal4 30 ± 3
square square

The diffusion layers, n -diffusion and p -diffusion, both have a high resistivity
typically from 1100 W /square. We measure resistance in W / square (ohms per
square) because for a fixed thickness of material it does not matter what the size
of a square isthe resistance is the same. Thus the resistance of a rectangular
shape of a sheet of material may be calculated from the number of squares it
contains times the sheet resistance in W / square. We can use diffusion for very
short connections inside a logic cell, but not for interconnect between logic cells.
Poly has the next highest resistance to diffusion. Most submicron CMOS
processes use a silicide material (a metallic compound of silicon) that has much
lower resistivity (at several W /square) than the poly or diffusion layers alone.
Examples are tantalum silicide, TaSi; tungsten silicide, WSi; or titanium silicide,
TiSi. The stoichiometry of these deposited silicides varies. For example, for
tungsten silicide W:Si ª 1:2.6.
There are two types of silicide process. In a silicide process only the gate is
silicided. This reduces the poly sheet resistance, but not that of the sourcedrain.
In a self-aligned silicide ( salicide ) process, both the gate and the sourcedrain
regions are silicided. In some processes silicide can be used to connect adjacent
poly and diffusion (we call this feature LI , white metal, local interconnect,
metal0, or m0). LI is useful to reduce the area of ASIC RAM cells, for example.
Interconnect uses metal layers with resistivities of tens of m W /square, several
orders of magnitude less than the other layers. There are usually several layers of
metal in a CMOS ASIC process, each separated by an insulating layer. The metal
layer above the poly gate layer is the first-level metal ( m1 or metal1), the next is
the second-level metal ( m2 or metal2), and so on. We can make connections
from m1 to diffusion using diffusion contacts or to the poly using polysilicon
contacts .
After we etch the contact holes a thin barrier metal (typically platinum) is
deposited over the silicon and poly. Next we form contact plugs ( via plugs for
connections between metal layers) to reduce contact resistance and the likelihood
of breaks in the contacts. Tungsten is commonly used for these plugs. Following
this we form the metal layers as sandwiches. The middle of the sandwich is a
layer (usually from 3000 Å to 10,000 Å) of aluminum and copper. The top and
bottom layers are normally titaniumtungsten (TiW, pronounced tie-tungsten).
Submicron processes use chemicalmechanical polishing ( CMP ) to smooth the
wafers flat before each metal deposition step to help with step coverage.
An insulating glass, often sputtered quartz (SiO 2 ), though other materials are
also used, is deposited between metal layers to help create a smooth surface for
the deposition of the metal. Design rules may refer to this insulator as an
intermetal oxide ( IMO ) whether they are in fact oxides or not, or interlevel
dielectric ( ILD ). The IMO may be a spin-on polymer; boron-doped
phosphosilicate glass (BPSG); Si 3 N 4 ; or sandwiches of these materials
(oxynitrides, for example).
We make the connections between m1 and m2 using metal vias , cuts , or just
vias . We cannot connect m2 directly to diffusion or poly; instead we must make
these connections through m1 using a via. Most processes allow contacts and vias
to be placed directly above each other without restriction, arrangements known as
stacked vias and stacked contacts . We call a process with m1 and m2 a two-level
metal ( 2LM ) technology. A 3LM process includes a third-level metal layer ( m3
or metal3), and some processes include more metal layers. In this case a
connection between m1 and m2 will use an m1/m2 via, or via1 ; a connection
between m2 and m3 will use an m2/m3 via, or via2 , and so on.
The minimum spacing of interconnects, the metal pitch , may increase with
successive metal layers. The minimum metal pitch is the minimum spacing
between the centers of adjacent interconnects and is equal to the minimum metal
width plus the minimum metal spacing.
Aluminum interconnect tends to break when carrying a high current density.
Collisions between high-energy electrons and atoms move the metal atoms over a
long period of time in a process known as electromigration . Copper is added to
the aluminum to help reduce the problem. The other solution is to reduce the
current density by using wider than minimum-width metal lines.
Tables 2.5 and 2.6 show maximum specified contact resistance and via resistance
for two generations of CMOS processes. Notice that a m1 contact in either
process is equal in resistance to several hundred squares of metal.
TABLE 2.5 Contact resistance (1 m m TABLE 2.6 Contact resistance (0.35 m
CMOS). m CMOS).
Resistance Resistance
Contact/via type Contact/via type
(maximum) (maximum)
m2/m3 via (via2) 5 W m2/m3 via (via2) 6 W
m1/m2 via (via1) 2 W m1/m2 via (via1) 6 W
m1/ p -diffusion m1/ p -diffusion
20 W 20 W
contact contact
m1/ n -diffusion m1/ n -diffusion
20 W 20 W
contact contact
m1/poly contact 20 W m1/poly contact 20 W

1. If only one well layer is drawn, the other mask may be derived from the drawn
layer. For example, p -well (mask) = not (nwell (drawn)). A single-well process
requires only one well mask.
2. The implant masks may be derived or drawn.
3. Largely for historical reasons the contacts to poly and contacts to active have
different layer names. In the past this allowed a different sizing or process bias to
be applied to each contact type when the mask was made.
2.3 CMOS Design Rules
Figure 2.11 defines the design rules for a CMOS process using pictures. Arrows
between objects denote a minimum spacing, and arrows showing the size of an
object denote a minimum width. Rule 3.1, for example, is the minimum width of
poly (2 l ). Each of the rule numbers may have different values for different
manufacturersthere are no standards for design rules. Tables 2.72.9 show the
MOSIS scalable CMOS rules. Table 2.7 shows the layer rules for the process
front end , which is the front end of the line (as in production line) or FEOL .
Table 2.8 shows the rules for the process back end ( BEOL ), the metal
interconnect, and Table 2.9 shows the rules for the pad layer and glass layer.
FIGURE 2.11 The MOSIS scalable CMOS design rules (rev. 7). Dimensions are
in l . Rule numbers are in parentheses (missing rule sets 1113 are extensions to
this basic process).
TABLE 2.7 MOSIS scalable CMOS rules version 7the process front end.
Layer Rule Explanation Value / l
well (CWN, CWP) 1.1 minimum width 10
minimum space (different potential, a hot
1.2 9
well)
1.3 minimum space (same potential) 0 or 6
1.4 minimum space (different well type) 0

active (CAA) 2.1/2.2 minimum width/space 3


2.3 source/drain active to well edge space 5
substrate/well contact active to well edge
2.4 3
space
minimum space between active (different
2.5 0 or 4
implant type)

poly (CPG) 3.1/3.2 minimum width/space 2


3.3 minimum gate extension of active 2
3.4 minimum active extension of poly 3
3.5 minimum field poly to active space 1

minimum select spacing to channel of


select (CSN, CSP) 4.1 3
transistor 1
4.2 minimum select overlap of active 2
4.3 minimum select overlap of contact 1
4.4 minimum select width and spacing 2 2

poly contact (CCP) 5.1.a exact contact size 2¥2


5.2.a minimum poly overlap 1.5
5.3.a minimum contact spacing 2

active contact (CCA) 6.1.a exact contact size 2¥2


6.2.a minimum active overlap 1.5
6.3.a minimum contact spacing 2
6.4.a minimum space to gate of transistor 2
TABLE 2.8 MOSIS scalable CMOS rules version 7the process back end.
Layer Rule Explanation Value / l
metal1 (CMF) 7.1 minimum width 3
7.2.a minimum space 3
7.2.b minimum space (for minimum-width wires only) 2
7.3 minimum overlap of poly contact 1
7.4 minimum overlap of active contact 1
via1 (CVA) 8.1 exact size 2¥2
8.2 minimum via spacing 3
8.3 minimum overlap by metal1 1
8.4 minimum spacing to contact 2
8.5 minimum spacing to poly or active edge 2
metal2 (CMS) 9.1 minimum width 3
9.2.a minimum space 4
9.2.b minimum space (for minimum-width wires only) 3
9.3 minimum overlap of via1 1
via2 (CVS) 14.1 exact size 2¥2
14.2 minimum space 3
14.3 minimum overlap by metal2 1
14.4 minimum spacing to via1 2
metal3 (CMT) 15.1 minimum width 6
15.2 minimum space 4
15.3 minimum overlap of via2 2
TABLE 2.9 MOSIS scalable CMOS rules version 7the pads and overglass
(passivation).
Layer Rule Explanation Value
100 m m ¥ 100 m
glass (COG) 10.1 minimum bonding-pad width
m
10.2 minimum probe-pad width 75 m m ¥ 75 m m
10.3 pad overlap of glass opening 6mm
minimum pad spacing to unrelated metal2
10.4 30 m m
(or metal3)
minimum pad spacing to unrelated
10.5 15 m m
metal1, poly, or active

The rules in Table 2.7 and Table 2.8 are given as multiples of l . If we use
lambda-based rules we can move between successive process generations just by
changing the value of l . For example, we can scale 0.5 m m layouts ( l = 0.25 m
m) by a factor of 0.175 / 0.25 for a 0.35 m m process ( l = 0.175 m m)at least in
theory. You may get an inkling of the practical problems from the fact that the
values for pad dimensions and spacing in Table 2.9 are given in microns and not
in l . This is because bonding to the pads is an operation that does not scale well.
Often companies have two sets of design rules: one in l (with fractional l rules)
and the other in microns. Ideally we would like to express all of the design rules
in integer multiples of l . This was true for revisions 46, but not revision 7 of the
MOSIS rules. In revision 7 rules 5.2a/6.2a are noninteger. The original Mead
Conway NMOS rules include a noninteger 1.5 l rule for the implant layer.

1. To ensure source and drain width.


2. Different select types may touch but not overlap.
2.4 Combinational Logic Cells
The AND-OR-INVERT (AOI) and the OR-AND-INVERT (OAI) logic cells are
particularly efficient in CMOS. Figure 2.12 shows an AOI221 and an OAI321
logic cell (the logic symbols in Figure 2.12 are not standards, but are widely
used). All indices (the indices are the numbers after AOI or OAI) in the logic cell
name greater than 1 correspond to the inputs to the first level or stagethe AND
gate(s) in an AOI cell, for example. An index of '1' corresponds to a direct input
to the second-stage cell. We write indices in descending order; so it is AOI221
and not AOI122 (but both are equivalent cells), and AOI32 not AOI23. If we
have more than one direct input to the second stage we repeat the '1'; thus an
AOI211 cell performs the function Z = (A.B + C + D)'. A three-input NAND cell
is an OAI111, but calling it that would be very confusing. These rules are not
standard, but form a convention that we shall adopt and one that is widely used in
the ASIC industry.
There are many ways to represent the logical operator, AND. I shall use the
middle dot and write A · B (rather than AB, A.B, or A ' B); occasionally I may
use AND(A, B). Similarly I shall write A + B as well as OR(A, B). I shall use an
apostrophe like this, A', to denote the complement of A rather than A since
sometimes it is difficult or inappropriate to use an overbar ( vinculum ) or
diacritical mark (macron). It is possible to misinterpret AB' as A B rather than
AB (but the former alternative would be A · B' in my convention). I shall be
careful in these situations.

FIGURE 2.12 Naming and


numbering complex CMOS
combinational cells. (a) An
AND-OR-INVERT cell, an
AOI221. (b) An
OR-AND-INVERT cell, an
OAI321. Numbering is
always in descending order.

We can express the function of the AOI221 cell in Figure 2.12(a) as


Z = (A · B + C · D + E)' . (2.25)

We can also write this equation unambiguously as Z = OAI221(A, B, C, D, E),


just as we might write X = NAND (I, J, K) to describe the logic function
X = (I · J · K)'.
This notation is useful because, for example, if we write OAI321(P, Q, R, S, T,
U) we immediately know that U (the sixth input) is the (only) direct input
connected to the second stage. Sometimes we need to refer to particular inputs
without listing them all. We can adopt another convention that letters of the input
names change with the index position. Now we can refer to input B2 of an
AOI321 cell, for example, and know which input we are talking about without
writing
Z = AOI321(A1, A2, A3, B1, B2, C) . (2.26)

Table 2.10 shows the AOI family of logic cells with three indices (with branches
in the family for AOI, OAI, AO, and OA cells). There are 5 types and 14 separate
members of each branch of this family. There are thus 4 ¥ 14 = 56 cells of the
type X abc where X = {OAI, AOI, OA, AO} and each of the indexes a , b , and c
can range from 1 to 3. We form the AND-OR (AO) and OR-AND (OA) cells by
adding an inverter to the output of an AOI or OAI cell.
TABLE 2.10 The AOI family of cells with three index numbers or less.
Cell type 1 Cells Number of unique cells
Xa1 X21, X31 2
Xa11 X211, X311 2
Xab X22, X33, X32 3
Xab1 X221, X331, X321 3
Xabc X222, X333, X332, X322 4
Total 14

2.4.1 Pushing Bubbles


The AOI and OAI logic cells can be built using a single stage in CMOS using
seriesparallel networks of transistors called stacks. Figure 2.13 illustrates the
procedure to build the n -channel and p -channel stacks, using the AOI221 cell as
an example.
FIGURE 2.13 Constructing a CMOS logic cellan AOI221. (a) First build the
dual icon by using de Morgans theorem to push inversion bubbles to the
inputs. (b) Next build the n -channel and p -channel stacks from series and
parallel combinations of transistors. (c) Adjust transistor sizes so that the n-
channel and p -channel stacks have equal strengths.

Here are the steps to construct any single-stage combinational CMOS logic cell:
1. Draw a schematic icon with an inversion (bubble) on the last cell (the
bubble-out schematic). Use de Morgans theorems A NAND is an OR
with inverted inputs and a NOR is an AND with inverted inputsto push
the output bubble back to the inputs (this the dual icon or bubble-in
schematic).
2. Form the n -channel stack working from the inputs on the bubble-out
schematic: OR translates to a parallel connection, AND translates to a
series connection. If you have a bubble at an input, you need an inverter.
3. Form the p -channel stack using the bubble-in schematic (ignore the
inversions at the inputsthe bubbles on the gate terminals of the p -channel
transistors take care of these). If you do not have a bubble at the input gate
terminals, you need an inverter (these will be the same input gate terminals
that had bubbles in the bubble-out schematic).
The two stacks are network duals (they can be derived from each other by
swapping series connections for parallel, and parallel for series connections). The
n -channel stack implements the strong '0's of the function and the p -channel
stack provides the strong '1's. The final step is to adjust the drive strength of the
logic cell by sizing the transistors.

2.4.2 Drive Strength


Normally we ratio the sizes of the n -channel and p -channel transistors in an
inverter so that both types of transistors have the same resistance, or drive
strength . That is, we make b n = b p . At low dopant concentrations and low
electric fields m n is about twice m p . To compensate we make the shape factor,
W/L, of the p -channel transistor in an inverter about twice that of the n -channel
transistor (we say the logic has a ratio of 2). Since the transistor lengths are
normally equal to the minimum poly width for both types of transistors, the ratio
of the transistor widths is also equal to 2. With the high dopant concentrations
and high electric fields in submicron transistors the difference in mobilities is less
typically between 1 and 1.5.
Logic cells in a library have a range of drive strengths. We normally call the
minimum-size inverter a 1X inverter. The drive strength of a logic cell is often
used as a suffix; thus a 1X inverter has a cell name such as INVX1 or INVD1. An
inverter with transistors that are twice the size will be an INVX2. Drive strengths
are normally scaled in a geometric ratio, so we have 1X, 2X, 4X, and
(sometimes) 8X or even higher, drive-strength cells. We can size a logic cell
using these basic rules:
● Any string of transistors connected between a power supply and the output
in a cell with 1X drive should have the same resistance as the n -channel
transistor in a 1X inverter.
● A transistor with shape factor W 1 /L 1 has a resistance proportional to L 1

/W 1 (so the larger W 1 is, the smaller the resistance).


● Two transistors in parallel with shape factors W 1 /L 1 and W 2 /L 2 are
equivalent to a single transistor (W 1 /L 1 + W 2 /L 2 )/1. For example, a 2/1
in parallel with a 3/1 is a 5/1.
● Two transistors, with shape factors W 1 /L 2 and W 2 /L 2 , in series are
equivalent to a single 1/(L 1 /W 1 + L 2 /W 2 ) transistor.

For example, a transistor with shape factor 3/1 (we shall call this a 3/1) in series
with another 3/1 is equivalent to a 1/((1/3) + (1/3)) or a 3/2. We can use the
following method to calculate equivalent transistor sizes:
● To add transistors in parallel, make all the lengths 1 and add the widths.

● To add transistors in series, make all the widths 1 and add the lengths.

We have to be careful to keep W and L reasonable. For example, a 3/1 in series


with a 2/1 is equivalent to a 1/((1/3) + (1/2)) or 1/0.83. Since we cannot make a
device 2 l wide and 1.66 l long, a 1/0.83 is more naturally written as 3/2.5. We
like to keep both W and L as integer multiples of 0.5 (equivalent to making W
and L integer multiples of l ), but W and L must be greater than 1.
In Figure 2.13(c) the transistors in the AOI221 cell are sized so that any string
through the p -channel stack has a drive strength equivalent to a 2/1 p -channel
transistor (we choose the worst case, if more than one transistor in parallel is
conducting then the drive strength will be higher). The n -channel stack is sized
so that it has a drive strength of a 1/1 n -channel transistor. The ratio in this
library is thus 2.
If we were to use four drive strengths for each of the AOI family of cells shown
in Table 2.10, we would have a total of 224 combinational library cellsjust for
the AOI family. The synthesis tools can handle this number of cells, but we may
not be able to design this many cells in a reasonable amount of time. Section 3.3,
Logical Effort, will help us choose the most logically efficient cells.

2.4.3 Transmission Gates


Figure 2.14(a) and (b) shows a CMOS transmission gate ( TG , TX gate, pass
gate, coupler). We connect a p -channel transistor (to transmit a strong '1') in
parallel with an n -channel transistor (to transmit a strong '0').

FIGURE 2.14 CMOS transmission gate (TG). (a) An n- channel and p -channel
transistor in parallel form a TG. (b) A common symbol for a TG. (c) The
charge-sharing problem.

We can express the function of a TG as


Z = TG(A, S) , (2.27)

but this is ambiguousif we write TG(X, Y), how do we know if X is connected to


the gates or sources/drains of the TG? We shall always define TG(X, Y) when we
use it. It is tempting to write TG(A, S) = A · S, but what is the value of Z when S
='0' in Figure 2.14(a), since Z is then left floating? A TG is a switch, not an AND
logic cell.
There is a potential problem if we use a TG as a switch connecting a node Z that
has a large capacitance, C BIG , to an input node A that has only a small
capacitance C SMALL (see Figure 2.14c). If the initial voltage at A is V SMALL
and the initial voltage at Z is V BIG , when we close the TG (by setting S = '1') the
final voltage on both nodes A and Z is
C BIG V BIG + C SMALL V SMALL
VF= . (2.28)
C BIG + C SMALL

Imagine we want to drive a '0' onto node Z from node A. Suppose C BIG = 0.2 pF
(about 10 standard loads in a 0.5 m m process) and C SMALL = 0.02 pF, V BIG = 0
V and V SMALL = 5 V; then
(0.2 ¥ 10 12 ) (0) + (0.02 ¥ 10 12 ) (5)
VF= = 0.45 V . (2.29)
(0.2 ¥ 10 12 ) + (0.02 ¥ 10 12 )

This is not what we want at all, the big capacitor has forced node A to a voltage
close to a '0'. This type of problem is known as charge sharing . We should make
sure that either (1) node A is strong enough to overcome the big capacitor, or (2)
insulate node A from node Z by including a buffer (an inverter, for example)
between node A and node Z. We must not use charge to drive another logic cell
only a logic cell can drive a logic cell.
If we omit one of the transistors in a TG (usually the p -channel transistor) we
have a pass transistor . There is a branch of full-custom VLSI design that uses
pass-transistor logic. Much of this is based on relay-based logic, since a single
transistor switch looks like a relay contact. There are many problems associated
with pass-transistor logic related to charge sharing, reduced noise margins, and
the difficulty of predicting delays. Though pass transistors may appear in an
ASIC cell inside a library, they are not used by ASIC designers.

FIGURE 2.15 The CMOS multiplexer (MUX). (a) A noninverting 2:1 MUX
using transmission gates without buffering. (b) A symbol for a MUX (note how
the inputs are labeled). (c) An IEEE standard symbol for a MUX. (d) A
nonstandard, but very common, IEEE symbol for a MUX. (e) An inverting
MUX with output buffer. (f) A noninverting buffered MUX.

We can use two TGs to form a multiplexer (or multiplexorpeople use both
orthographies) as shown in Figure 2.15(a). We often shorten multiplexer to MUX
. The MUX function for two data inputs, A and B, with a select signal S, is
Z = TG(A, S') + TG(B, S) . (2.30)

We can write this as Z = A · S' + B · S, since node Z is always connected to one


or other of the inputs (and we assume both are driven). This is a two-input MUX
(2-to-1 MUX or 2:1 MUX). Unfortunately, we can also write the MUX function
as Z = A · S + B · S', so it is difficult to write the MUX function unambiguously
as Z = MUX(X, Y, Z). For example, is the select input X, Y, or Z? We shall
define the function MUX(X, Y, Z) each time we use it. We must also be careful
to label a MUX if we use the symbol shown in Figure 2.15(b). Symbols for a
MUX are shown in Figure 2.15(bd). In the IEEE notation 'G' specifies an AND
dependency. Thus, in Figure 2.15(c), G = '1' selects the input labeled '1'.
Figure 2.15(d) uses the common control block symbol (the notched rectangle).
Here, G1 = '1' selects the input '1', and G1 = '0' selects the input ' 1 '. Strictly this
form of IEEE symbol should be used only for elements with more than one
section controlled by common signals, but the symbol of Figure 2.15(d) is used
often for a 2:1 MUX.
The MUX shown in Figure 2.15(a) works, but there is a potential charge-sharing
problem if we cascade MUXes (connect them in series). Instead most ASIC
libraries use MUX cells built with a more conservative approach. We could
buffer the output using an inverter (Figure 2.15e), but then the MUX becomes
inverting. To build a safe, noninverting MUX we can buffer the inputs and output
(Figure 2.15f)requiring 12 transistors, or 3 gate equivalents (only the gate
equivalent counts are shown from now on).
Figure 2.16 shows how to use an OAI22 logic cell (and an inverter) to implement
an inverting MUX. The implementation in equation form (2.5 gates) is
ZN = A' · S' + B' · S
= [(A' · S')' · (B' · S)']'
= [ (A + S) · (B + S')]'
= OAI22[A, S, B, NOT(S)] . (2.31)

(both A' and NOT(A) represent an inverter, depending on which representation is


most convenientthey are equivalent). I often use an equation to describe a cell
implementation.

FIGURE 2.16 An inverting 2:1 MUX based on an


OAI22 cell.

The following factors will determine which MUX implementation is best:


1. Do we want to minimize the delay between the select input and the output
or between the data inputs and the output?
2. Do we want an inverting or noninverting MUX?
3. Do we object to having any logic cell inputs tied directly to the
source/drain diffusions of a transmission gate? (Some companies forbid
such transmission-gate inputs since some simulation tools cannot handle
them.)
4. Do we object to any logic cell outputs being tied to the source/drain of a
transmission gate? (Some companies will not allow this because of the
dangers of charge sharing.)
5. What drive strength do we require (and is size or speed more important)?
A minimum-size TG is a little slower than a minimum-size inverter, so there is
not much difference between the implementations shown in Figure 2.15 and
Figure 2.16, but the difference can become important for 4:1 and larger MUXes.

2.4.4 Exclusive-OR Cell


The two-input exclusive-OR ( XOR , EXOR, not-equivalence, ring-OR) function
is
A1 • A2 = XOR(A1, A2) = A1 · A2' + A1' · A2 . (2.32)

We are now using multiletter symbols, but there should be no doubt that A1'
means anything other than NOT(A1). We can implement a two-input XOR using
a MUX and an inverter as follows (2 gates):
XOR(A1, A2) = MUX[NOT(A1), A1, A2] , (2.33)

where
MUX(A, B, S) = A · S + B · S ' . (2.34)

This implementation only buffers one input and does not buffer the MUX output.
We can use inverter buffers (3.5 gates total) or an inverting MUX so that the
XOR cell does not have any external connections to source/drain diffusions as
follows (3 gates total):
XOR(A1, A2) = NOT[MUX(NOT[NOT(A1)], NOT(A1), A2)] . (2.35)

We can also implement a two-input XOR using an AOI21 (and a NOR cell),
since
XOR(A1, A2) = A1 · A2' + A1' · A2
= [ (A1 ·A2) + (A1 + A2)' ]'
= AOI21[A1, A2, NOR(A1, A2)], (2.36)

(2.5 gates). Similarly we can implement an exclusive-NOR (XNOR, equivalence)


logic cell using an inverting MUX (and two inverters, total 3.5 gates) or an
OAI21 logic cell (and a NAND cell, total 2.5 gates) as follows (using the MUX
function of Eq. 2.34):
XNOR(A1, A2) = A1 · A2 + NOT(A1) · NOT(A2
= NOT[NOT[MUX(A1, NOT (A1), A2]]
= OAI21[A1, A2, NAND(A1, A2)] . (2.37)

1. Xabc: X = {AOI, AO, OAI, OA}; a, b, c = {2, 3}; { } means choose one.
2.6 Datapath Logic Cells
Suppose we wish to build an n -bit adder (that adds two n -bit numbers) and to exploit
the regularity of this function in the layout. We can do so using a datapath structure.
The following two functions, SUM and COUT, implement the sum and carry out for a
full adder ( FA ) with two data inputs (A, B) and a carry in, CIN:
SUM = A • B • CIN = SUM(A, B, CIN) = PARITY(A, B, CIN) , (2.38)

COUT = A · B + A · CIN + B · CIN = MAJ(A, B, CIN). (2.39)

The sum uses the parity function ('1' if there are an odd numbers of '1's in the inputs).
The carry out, COUT, uses the 2-of-3 majority function ('1' if the majority of the inputs
are '1'). We can combine these two functions in a single FA logic cell, ADD(A[ i ], B[ i
], CIN, S[ i ], COUT), shown in Figure 2.20(a), where
S[ i ] = SUM (A[ i ], B[ i ], CIN) , (2.40)

COUT = MAJ (A[ i ], B[ i ], CIN) . (2.41)

Now we can build a 4-bit ripple-carry adder ( RCA ) by connecting four of these ADD
cells together as shown in Figure 2.20(b). The i th ADD cell is arranged with the
following: two bus inputs A[ i ], B[ i ]; one bus output S[ i ]; an input, CIN, that is the
carry in from stage ( i 1) below and is also passed up to the cell above as an output;
and an output, COUT, that is the carry out to stage ( i + 1) above. In the 4-bit adder
shown in Figure 2.20(b) we connect the carry input, CIN[0], to VSS and use COUT[3]
and COUT[2] to indicate arithmetic overflow (in Section 2.6.1 we shall see why we
may need both signals). Notice that we build the ADD cell so that COUT[2] is
available at the top of the datapath when we need it.
Figure 2.20(c) shows a layout of the ADD cell. The A inputs, B inputs, and S outputs
all use m1 interconnect running in the horizontal directionwe call these data signals.
Other signals can enter or exit from the top or bottom and run vertically across the
datapath in m2we call these control signals. We can also use m1 for control and m2 for
data, but we normally do not mix these approaches in the same structure. Control
signals are typically clocks and other signals common to elements. For example, in
Figure 2.20(c) the carry signals, CIN and COUT, run vertically in m2 between cells. To
build a 4-bit adder we stack four ADD cells creating the array structure shown in
Figure 2.20(d). In this case the A and B data bus inputs enter from the left and bus S,
the sum, exits at the right, but we can connect A, B, and S to either side if we want.
The layout of buswide logic that operates on data signals in this fashion is called a
datapath . The module ADD is a datapath cell or datapath element . Just as we do for
standard cells we make all the datapath cells in a library the same height so we can abut
other datapath cells on either side of the adder to create a more complex datapath.
When people talk about a datapath they always assume that it is oriented so that
increasing the size in bits makes the datapath grow in height, upwards in the vertical
direction, and adding different datapath elements to increase the function makes the
datapath grow in width, in the horizontal directionbut we can rotate and position a
completed datapath in any direction we want on a chip.

FIGURE 2.20 A datapath adder. (a) A full-adder (FA) cell with inputs (A and B), a
carry in, CIN, sum output, S, and carry out, COUT. (b) A 4-bit adder. (c) The layout,
using two-level metal, with data in m1 and control in m2. In this example the wiring is
completed outside the cell; it is also possible to design the datapath cells to contain the
wiring. Using three levels of metal, it is possible to wire over the top of the datapath
cells. (d) The datapath layout.

What is the difference between using a datapath, standard cells, or gate arrays? Cells
are placed together in rows on a CBIC or an MGA, but there is no generally no
regularity to the arrangement of the cells within the rowswe let software arrange the
cells and complete the interconnect. Datapath layout automatically takes care of most
of the interconnect between the cells with the following advantages:
● Regular layout produces predictable and equal delay for each bit.

● Interconnect between cells can be built into each cell.

There are some disadvantages of using a datapath:


● The overhead (buffering and routing the control signals, for example) can make a
narrow (small number of bits) datapath larger and slower than a standard-cell (or
even gate-array) implementation.
● Datapath cells have to be predesigned (otherwise we are using full-custom
design) for use in a wide range of datapath sizes. Datapath cell design can be
harder than designing gate-array macros or standard cells.
● Software to assemble a datapath is more complex and not as widely used as
software for assembling standard cells or gate arrays.
There are some newer standard-cell and gate-array tools that can take advantage of
regularity in a design and position cells carefully. The problem is in finding the
regularity if it is not specified. Using a datapath is one way to specify regularity to
ASIC design tools.
2.6.1 Datapath Elements
Figure 2.21 shows some typical datapath symbols for an adder (people rarely use the
IEEE standards in ASIC datapath libraries). I use heavy lines (they are 1.5 point wide)
with a stroke to denote a data bus (that flows in the horizontal direction in a datapath),
and regular lines (0.5 point) to denote the control signals (that flow vertically in a
datapath). At the risk of adding confusion where there is none, this stroke to indicate a
data bus has nothing to do with mixed-logic conventions. For a bus, A[31:0] denotes a
32-bit bus with A[31] as the leftmost or most-significant bit or MSB , and A[0] as the
least-significant bit or LSB . Sometimes we shall use A[MSB] or A[LSB] to refer to
these bits. Notice that if we have an n -bit bus and LSB = 0, then MSB = n 1. Also, for
example, A[4] is the fifth bit on the bus (from the LSB). We use a ' S ' or 'ADD' inside
the symbol to denote an adder instead of '+', so we can attach '' or '+/' to the inputs for
a subtracter or adder/subtracter.

FIGURE 2.21 Symbols for a datapath adder. (a) A data bus is shown by a heavy line
(1.5 point) and a bus symbol. If the bus is n -bits wide then MSB = n 1. (b) An
alternative symbol for an adder. (c) Control signals are shown as lightweight (0.5
point) lines.

Some schematic datapath symbols include only data signals and omit the control
signalsbut we must not forget them. In Figure 2.21, for example, we may need to
explicitly tie CIN[0] to VSS and use COUT[MSB] and COUT[MSB 1] to detect
overflow. Why might we need both of these control signals? Table 2.11 shows the
process of simple arithmetic for the different binary number representations, including
unsigned, signed magnitude, ones complement, and twos complement.
TABLE 2.11 Binary arithmetic.
Binary Number Representation
Operation Signed Ones Twos
Unsigned
magnitude complement complement
if positive
no change then MSB = 0 if negative then flip if negative then {flip
bits bits; add 1}
else MSB = 1
3= 0011 0011 0011 0011
3= NA 1011 1100 1101
zero = 0000 0000 or 1000 1111 or 0000 0000
max.
1111 = 15 0111 = 7 0111 = 7 0111 = 7
positive =
max.
0000= 0 1111 = 7 1000 = 7 1000 = 8
negative =
addition =
if SG(A) =
S=A+B SG(B) then S S =
=A+B A+B+
= addend +
augend S=A+B else { if B < A COUT[MSB] S=A+B
then S = A B
else S = B COUT is carry out
SG(A) = A}
sign of A
addition if SG(A) =
OR =
result:
COUT[MSB] SG(B) then OV = OV =
OV = OV =
overflow, COUT[MSB] XOR(COUT[MSB], XOR(COUT[MSB],
COUT is COUT[MSB1]) COUT[MSB 1])
OR = out else OV = 0
carry out (impossible)
of range
if SG(A) =
SG(B) then
SG(S) =
SG(S) =
SG(A)
sign of S
NA else { if B < A NA NA
then SG(S) =
S=A+B SG(A)
else SG(S) =
SG(B)}
subtraction
=
SG(B) =
D=A B Z = B (negate); Z = B (negate);
D=A B NOT(SG(B));
= minuend D=A+Z D=A+Z
D=A+B

subtrahend
subtraction
result : OR =
OV = BOUT[MSB]
as in addition as in addition as in addition
overflow, BOUT is
OR = out borrow out
of range
negation : Z = A;
Z=A NA SG(Z) = Z = NOT(A) Z = NOT(A) + 1
(negate) NOT(SG(A))

2.6.2 Adders
We can view addition in terms of generate , G[ i ], and propagate , P[ i ], signals.
method 1 method 2
G[i] = A[i] · B[i] G[ i ] = A[ i ] · B[ i ] (2.42)
P[ i ] = A[ i ] • B[ i P[ i ] = A[ i ] + B[ i ] (2.43)
C[ i ] = G[ i ] + P[ i ] · C[ i 1] C[ i ] = G[ i ] + P[ i ] · C[ i 1] (2.44)
S[ i ] = P[ i ] • C[ i 1] S[ i ] = A[ i ] • B[ i ] • C[ i 1] (2.45)

where C[ i ] is the carry-out signal from stage i , equal to the carry in of stage ( i + 1).
Thus, C[ i ] = COUT[ i ] = CIN[ i + 1]. We need to be careful because C[0] might
represent either the carry in or the carry out of the LSB stage. For an adder we set the
carry in to the first stage (stage zero), C[1] or CIN[0], to '0'. Some people use delete
(D) or kill (K) in various ways for the complements of G[i] and P[i], but unfortunately
others use C for COUT and D for CINso I avoid using any of these. Do not confuse the
two different methods (both of which are used) in Eqs. 2.422.45 when forming the
sum, since the propagate signal, P[ i ] , is different for each method.
Figure 2.22(a) shows a conventional RCA. The delay of an n -bit RCA is proportional
to n and is limited by the propagation of the carry signal through all of the stages. We
can reduce delay by using pairs of go-faster bubbles to change AND and OR gates to
fast two-input NAND gates as shown in Figure 2.22(a). Alternatively, we can write the
equations for the carry signal in two different ways:
either C[ i ] = A[ i ] · B[ i ] + P[ i ] · C[ i 1] (2.46)
or C[ i ] = (A[ i ] + B[ i ] ) · (P[ i ]' + C[ i 1]), (2.47)

where P[ i ]'= NOT(P[ i ]). Equations 2.46 and 2.47 allow us to build the carry chain
from two-input NAND gates, one per cell, using different logic in even and odd stages
(Figure 2.22b):
even stages odd stages
C1[i]' = P[i ] · C3[i 1] · C4[i 1] C3[i]' = P[i ] · C1[i 1] · C2[i 1] (2.48)
C2[i] = A[i ] + B[i ] C4[i]' = A[i ] · B[i ] (2.49)
C[i] = C1[i ] · C2[i ] C[i] = C3[i ] ' + C4[i ]' (2.50)

(the carry inputs to stage zero are C3[1] = C4[1] = '0'). We can use the RCA of
Figure 2.22(b) in a datapath, with standard cells, or on a gate array.
Instead of propagating the carries through each stage of an RCA, Figure 2.23 shows a
different approach. A carry-save adder ( CSA ) cell CSA(A1[ i ], A2[ i ], A3[ i ], CIN,
S1[ i ], S2[ i ], COUT) has three outputs:
S1[ i ] = CIN , (2.51)
S2[ i ] = A1[ i ] • A2[ i ] • A3[ i ] = PARITY(A1[ i ], A2[ i ], A3[ i ]) , (2.52)
COUT = A1[ i ] · A2[ i ] + [(A1[ i ] + A2[ i ]) · A3[ i ]] = MAJ(A1[ i ], A2[ i ],
(2.53)
A3[ i ]) .

The inputs, A1, A2, and A3; and outputs, S1 and S2, are buses. The input, CIN, is the
carry from stage ( i 1). The carry in, CIN, is connected directly to the output bus S1
indicated by the schematic symbol (Figure 2.23a). We connect CIN[0] to VSS. The
output, COUT, is the carry out to stage ( i + 1).
A 4-bit CSA is shown in Figure 2.23(b). The arithmetic overflow signal for ones
complement or twos complement arithmetic, OV, is XOR(COUT[MSB], COUT[MSB
1]) as shown in Figure 2.23(c). In a CSA the carries are saved at each stage and
shifted left onto the bus S1. There is thus no carry propagation and the delay of a CSA
is constant. At the output of a CSA we still need to add the S1 bus (all the saved
carries) and the S2 bus (all the sums) to get an n -bit result using a final stage that is not
shown in Figure 2.23(c). We might regard the n -bit sum as being encoded in the two
buses, S1 and S2, in the form of the parity and majority functions.
We can use a CSA to add multiple inputsas an example, an adder with four 4-bit inputs
is shown in Figure 2.23(d). The last stage sums two input buses using a carry-propagate
adder ( CPA ). We have used an RCA as the CPA in Figure 2.23(d) and (e), but we can
use any type of adder. Notice in Figure 2.23(e) how the two CSA cells and the RCA
cell abut together horizontally to form a bit slice (or slice) and then the slices are
stacked vertically to form the datapath.
FIGURE 2.22 The carry-save adder (CSA). (a) A CSA cell. (b) A 4-bit CSA.
(c) Symbol for a CSA. (d) A four-input CSA. (e) The datapath for a four-input, 4-bit
adder using CSAs with a ripple-carry adder (RCA) as the final stage. (f) A pipelined
adder. (g) The datapath for the pipelined version showing the pipeline registers as well
as the clock control lines that use m2.

We can register the CSA stages by adding vectors of flip-flops as shown in


Figure 2.23(f). This reduces the adder delay to that of the slowest adder stage, usually
the CPA. By using registers between stages of combinational logic we use pipelining to
increase the speed and pay a price of increased area (for the registers) and introduce
latency . It takes a few clock cycles (the latency, equal to n clock cycles for an n -stage
pipeline) to fill the pipeline, but once it is filled, the answers emerge every clock cycle.
Ferris wheels work much the same way. When the fair opens it takes a while (latency)
to fill the wheel, but once it is full the people can get on and off every few seconds.
(We can also pipeline the RCA of Figure 2.20. We add i registers on the A and B
inputs before ADD[ i ] and add ( n i ) registers after the output S[ i ], with a single
register before each C[ i ].)
The problem with an RCA is that every stage has to wait to make its carry decision, C[
i ], until the previous stage has calculated C[ i 1]. If we examine the propagate signals
we can bypass this critical path. Thus, for example, to bypass the carries for bits 47
(stages 58) of an adder we can compute BYPASS = P[4].P[5].P[6].P[7] and then use a
MUX as follows:
C[7] = (G[7] + P[7] · C[6]) · BYPASS' + C[3] · BYPASS . (2.54)

Adders based on this principle are called carry-bypass adders ( CBA ) [Sato et al.,
1992]. Large, custom adders employ Manchester-carry chains to compute the carries
and the bypass operation using TGs or just pass transistors [Weste and Eshraghian,
1993, pp. 530531]. These types of carry chains may be part of a predesigned ASIC
adder cell, but are not used by ASIC designers.
Instead of checking the propagate signals we can check the inputs. For example we can
compute SKIP = (A[ i 1] • B[ i 1]) + (A[ i ] • B[ i ] ) and then use a 2:1 MUX to
select C[ i ]. Thus,
CSKIP[ i ] = (G[ i ] + P[ i ] · C[ i 1]) · SKIP' + C[ i 2] · SKIP . (2.55)

This is a carry-skip adder [Keutzer, Malik, and Saldanha, 1991; Lehman, 1961].
Carry-bypass and carry-skip adders may include redundant logic (since the carry is
computed in two different wayswe just take the first signal to arrive). We must be
careful that the redundant logic is not optimized away during logic synthesis.
If we evaluate Eq. 2.44 recursively for i = 1, we get the following:
C[1] = G[1] + P[1] · C[0]
= G[1] + P[1] · (G[0] + P[1] · C[1])
= G[1] + P[1] · G[0] . (2.56)

This result means that we can look ahead by two stages and calculate the carry into
the third stage (bit 2), which is C[1], using only the first-stage inputs (to calculate G[0])
and the second-stage inputs. This is a carry-lookahead adder ( CLA ) [MacSorley,
1961]. If we continue expanding Eq. 2.44, we find:
C[2] = G[2] + P[2] · G[1] + P[2] · P[1] · G[0] ,

C[3] = G[3] + P[2] · G[2] + P[2] · P[1] · G[1] + P[3] · P[2] · P[1] · G[0] . (2.57)

As we look ahead further these equations become more complex, take longer to
calculate, and the logic becomes less regular when implemented using cells with a
limited number of inputs. Datapath layout must fit in a bit slice, so the physical and
logical structure of each bit must be similar. In a standard cell or gate array we are not
so concerned about a regular physical structure, but a regular logical structure
simplifies design. The BrentKung adder reduces the delay and increases the regularity
of the carry-lookahead scheme [Brent and Kung, 1982]. Figure 2.24(a) shows a regular
4-bit CLA, using the carry-lookahead generator cell (CLG) shown in Figure 2.24(b).
FIGURE 2.23 The BrentKung carry-lookahead adder (CLA). (a) Carry generation in a
4-bit CLA. (b) A cell to generate the lookahead terms, C[0]C[3]. (c) Cells L1, L2, and
L3 are rearranged into a tree that has less delay. Cell L4 is added to calculate C[2] that
is lost in the translation. (d) and (e) Simplified representations of parts a and c. (f) The
lookahead logic for an 8-bit adder. The inputs, 07, are the propagate and carry terms
formed from the inputs to the adder. (g) An 8-bit BrentKung CLA. The outputs of the
lookahead logic are the carry bits that (together with the inputs) form the sum. One
advantage of this adder is that delays from the inputs to the outputs are more nearly
equal than in other adders. This tends to reduce the number of unwanted and
unnecessary switching events and thus reduces power dissipation.

In a carry-select adder we duplicate two small adders (usually 4-bit or 8-bit adders
often CLAs) for the cases CIN = '0' and CIN = '1' and then use a MUX to select the
case that we needwasteful, but fast [Bedrij, 1962]. A carry-select adder is often used as
the fast adder in a datapath library because its layout is regular.
We can use the carry-select, carry-bypass, and carry-skip architectures to split a 12-bit
adder, for example, into three blocks. The delay of the adder is then partly dependent
on the delays of the MUX between each block. Suppose the delay due to 1-bit in an
adder block (we shall call this a bit delay) is approximately equal to the MUX delay. In
this case may be faster to make the blocks 3, 4, and 5-bits long instead of being equal in
size. Now the delays into the final MUX are equal3 bit-delays plus 2 MUX delays for
the carry signal from bits 06 and 5 bit-delays for the carry from bits 711. Adjusting
the block size reduces the delay of large adders (more than 16 bits).
We can extend the idea behind a carry-select adder as follows. Suppose we have an n
-bit adder that generates two sums: One sum assumes a carry-in condition of '0', the
other sum assumes a carry-in condition of '1'. We can split this n -bit adder into an i -bit
adder for the i LSBs and an ( n i )-bit adder for the n i MSBs. Both of the smaller
adders generate two conditional sums as well as true and complement carry signals.
The two (true and complement) carry signals from the LSB adder are used to select
between the two ( n i + 1)-bit conditional sums from the MSB adder using 2( n i + 1)
two-input MUXes. This is a conditional-sum adder (also often abbreviated to CSA)
[Sklansky, 1960]. We can recursively apply this technique. For example, we can split a
16-bit adder using i = 8 and n = 8; then we can split one or both 8bit adders againand
so on.
Figure 2.25 shows the simplest form of an n -bit conditional-sum adder that uses n
single-bit conditional adders, H (each with four outputs: two conditional sums, true
carry, and complement carry), together with a tree of 2:1 MUXes (Qi_j). The
conditional-sum adder is usually the fastest of all the adders we have discussed (it is the
fastest when logic cell delay increases with the number of inputsthis is true for all
ASICs except FPGAs).
FIGURE 2.24 The conditional-sum adder. (a) A 1-bit conditional adder that calculates
the sum and carry out assuming the carry in is either '1' or '0'. (b) The multiplexer that
selects between sums and carries. (c) A 4-bit conditional-sum adder with carry input,
C[0].

2.6.3 A Simple Example


How do we make and use datapath elements? What does a design look like? We may
use predesigned cells from a library or build the elements ourselves from logic cells
using a schematic or a design language. Table 2.12 shows an 8-bit conditional-sum
adder intended for an FPGA. This Verilog implementation uses the same structure as
Figure 2.25, but the equations are collapsed to use four or five variables. A basic logic
cell in certain Xilinx FPGAs, for example, can implement two equations of the same
four variables or one equation with five variables. The equations shown in Table 2.12
requires three levels of FPGA logic cells (so, for example, if each FPGA logic cell has
a 5 ns delay, the 8-bit conditional-sum adder delay is 15 ns).
TABLE 2.12 An 8-bit conditional-sum adder (the notation is described in Figure 2.25).
module m8bitCSum (C0, a, b, s, C8); // Verilog conditional-sum adder for an FPGA
input [7:0] C0, a, b; output [7:0] s; output C8;
wire
A7,A6,A5,A4,A3,A2,A1,A0,B7,B6,B5,B4,B3,B2,B1,B0,S8,S7,S6,S5,S4,S3,S2,S1,S0;
wire C0, C2, C4_2_0, C4_2_1, S5_4_0, S5_4_1, C6, C6_4_0, C6_4_1, C8;
assign {A7,A6,A5,A4,A3,A2,A1,A0} = a; assign {B7,B6,B5,B4,B3,B2,B1,B0} = b;
assign s = { S7,S6,S5,S4,S3,S2,S1,S0 };
assign S0 = A0^B0^C0 ; // start of level 1: & = AND, ^ = XOR, | = OR, ! = NOT
assign S1 = A1^B1^(A0&B0|(A0|B0)&C0) ;
assign C2 = A1&B1|(A1|B1)&(A0&B0|(A0|B0)&C0) ;
assign C4_2_0 = A3&B3|(A3|B3)&(A2&B2) ; assign C4_2_1 =
A3&B3|(A3|B3)&(A2|B2) ;
assign S5_4_0 = A5^B5^(A4&B4) ; assign S5_4_1 = A5^B5^(A4|B4) ;
assign C6_4_0 = A5&B5|(A5|B5)&(A4&B4) ; assign C6_4_1 =
A5&B5|(A5|B5)&(A4|B4) ;
assign S2 = A2^B2^C2 ; // start of level 2
assign S3 = A3^B3^(A2&B2|(A2|B2)&C2) ;
assign S4 = A4^B4^(C4_2_0|C4_2_1&C2) ;
assign S5 = S5_4_0& !(C4_2_0|C4_2_1&C2)|S5_4_1&(C4_2_0|C4_2_1&C2) ;
assign C6 = C6_4_0|C6_4_1&(C4_2_0|C4_2_1&C2) ;
assign S6 = A6^B6^C6 ; // start of level 3
assign S7 = A7^B7^(A6&B6|(A6|B6)&C6) ;
assign C8 = A7&B7|(A7|B7s)&(A6&B6|(A6|B6)&C6) ;
endmodule

Figure 2.26 shows the normalized delay and area figures for a set of predesigned
datapath adders. The data in Figure 2.26 is from a series of ASIC datapath cell libraries
(Compass Passport) that may be synthesized together with test vectors and simulation
models. We can combine the different adder techniques, but the adders then lose
regularity and become less suited to a datapath implementation.
FIGURE 2.25 Datapath adders. This data is from a series of submicron datapath
libraries. (a) Delay normalized to a two-input NAND logic cell delay (approximately
equal to 250 ps in a 0.5 m m process). For example, a 64-bit ripple-carry adder (RCA)
has a delay of approximately 30 ns in a 0.5 m m process. The spread in delay is due to
variation in delays between different inputs and outputs. An n -bit RCA has a delay
proportional to n . The delay of an n -bit carry-select adder is approximately
proportional to log 2 n . The carry-save adder delay is constant (but requires a
carry-propagate adder to complete an addition). (b) In a datapath library the area of all
adders are proportional to the bit size.

There are other adders that are not used in datapaths, but are occasionally useful in
ASIC design. A serial adder is smaller but slower than the parallel adders we have
described [Denyer and Renshaw, 1985]. The carry-completion adder is a variable delay
adder and rarely used in synchronous designs [Sklansky, 1960].

2.6.4 Multipliers
Figure 2.27 shows a symmetric 6-bit array multiplier (an n -bit multiplier multiplies
two n -bit numbers; we shall use n -bit by m -bit multiplier if the lengths are different).
Adders a0f0 may be eliminated, which then eliminates adders a1a6, leaving an
asymmetric CSA array of 30 (5 ¥ 6) adders (including one half adder). An n -bit array
multiplier has a delay proportional to n plus the delay of the CPA (adders b6f6 in
Figure 2.27). There are two items we can attack to improve the performance of a
multiplier: the number of partial products and the addition of the partial products.
FIGURE 2.26 Multiplication. A 6-bit array multiplier using a final carry-propagate
adder (full-adder cells a6f6, a ripple-carry adder). Apart from the generation of the
summands this multiplier uses the same structure as the carry-save adder of
Figure 2.23(d).

Suppose we wish to multiply 15 (the multiplicand ) by 19 (the multiplier ) mentally. It


is easier to calculate 15 ¥ 20 and subtract 15. In effect we complete the multiplication
as 15 ¥ (20 1) and we could write this as 15 ¥ 2 1 , with the overbar representing a
minus sign. Now suppose we wish to multiply an 8-bit binary number, A, by B =
00010111 (decimal 16 + 4 + 2 + 1 = 23). It is easier to multiply A by the canonical
signed-digit vector ( CSD vector ) D = 0010 1 001 (decimal 32 8 + 1 = 23) since this
requires only three add or subtract operations (and a subtraction is as easy as an
addition). We say B has a weight of 4 and D has a weight of 3. By using D instead of B
we have reduced the number of partial products by 1 (= 4 3).
We can recode (or encode) any binary number, B, as a CSD vector, D, as follows
(canonical means there is only one CSD vector for any number):
D i = B i + C i 2C i+1 , (2.58)
where C i + 1 is the carry from the sum of B i + 1 + B i + C i (we start with C 0 = 0).

As another example, if B = 011 (B 2 = 0, B 1 = 1, B 0 = 1; decimal 3), then, using


Eq. 2.58,
D 0 = B 0 + C 0 2C 1 =1+0 2=1,
D 1 = B 1 + C 1 2C 2 = 1 + 1 2 = 0,
D 2 = B 2 + C 2 2C 3 = 0 + 1 0 = 1, (2.59)

so that D = 10 1 (decimal 4 1 = 3). CSD vectors are useful to represent fixed


coefficients in digital filters, for example.
We can recode using a radix other than 2. Suppose B is an ( n + 1)-digit twos
complement number,
B=B0+B12+B222+...+Bi2i+...+Bn 1 2n 1 B n 2 n . (2.60)

We can rewrite the expression for B using the following sleight-of-hand:


B = B 0 + (B 0 B 1 )2 + . . . + (B i 1 B i )2 i + . . . + B n 1 2n 1 B n
2B B =
2n
= (2B 1 + B 0 )2 0 + (2B 3 + B 2 + B 1 )2 2 + . . .
+ (2B i + B i 1 +Bi 2 )2 i 1 + (2B i+2 + B i + 1 + B i )2 i + 1 + . . .
+ (2B n +Bi 1 +Bi 2 )2 n 1 . (2.61)

This is very useful. Consider B = 101001 (decimal 9 32 = 23, n = 5),


B = 101001
= (2B 1 + B 0 )2 0 + (2B 3 + B 2 + B 1 )2 2 + (2B 5 + B 4 + B 3 )2 4
((2 ¥ 0) + 1)2 0 + ((2 ¥ 1) + 0 + 0)2 2 + ((2 ¥ 1) + 0 + 1)2 4 . (2.62)

Equation 2.61 tells us how to encode B as a radix-4 signed digit, E = 12 1 (decimal 16


8 + 1 = 23). To multiply by B encoded as E we only have to perform a multiplication
by 2 (a shift) and three add/subtract operations.
Using Eq. 2.61 we can encode any number by taking groups of three bits at a time and
calculating
Ej = 2B i + B i 1 + B i 2 ,
E j + 1 = 2B i + 2 + B i + 1 + B i , . . . , (2.63)

where each 3-bit group overlaps by one bit. We pad B with a zero, B n . . . B 1 B 0 0, to
match the first term in Eq. 2.61. If B has an odd number of bits, then we extend the
sign: B n B n . . . B 1 B 0 0. For example, B = 01011 (eleven), encodes to E = 1 11 (16
4 1); and B = 101 is E = 1 1. This is called Booth encoding and reduces the number of
partial products by a factor of two and thus considerably reduces the area as well as
increasing the speed of our multiplier [Booth, 1951].
Next we turn our attention to improving the speed of addition in the CSA array.
Figure 2.28(a) shows a section of the 6-bit array multiplier from Figure 2.27. We can
collapse the chain of adders a0f5 (5 adder delays) to the Wallace tree consisting of
adders 5.15.4 (4 adder delays) shown in Figure 2.28(b).

FIGURE 2.27 Tree-based multiplication. (a) The portion of Figure 2.27 that calculates
the sum bit, P 5 , using a chain of adders (cells a0f5). (b) We can collapse this chain to
a Wallace tree (cells 5.15.5). (c) The stages of multiplication.

Figure 2.28(c) pictorially represents multiplication as a sort of golf course. Each link
corresponds to an adder. The holes or dots are the outputs of one stage (and the inputs
of the next). At each stage we have the following three choices: (1) sum three outputs
using a full adder (denoted by a box enclosing three dots); (2) sum two outputs using a
half adder (a box with two dots); (3) pass the outputs directly to the next stage. The two
outputs of an adder are joined by a diagonal line (full adders use black dots, half adders
white dots). The object of the game is to choose (1), (2), or (3) at each stage to
maximize the performance of the multiplier. In tree-based multipliers there are two
ways to do thisworking forward and working backward.
In a Wallace-tree multiplier we work forward from the multiplier inputs, compressing
the number of signals to be added at each stage [Wallace, 1960]. We can view an FA as
a 3:2 compressor or (3, 2) counter it counts the number of '1's on the inputs. Thus, for
example, an input of '101' (two '1's) results in an output '10' (2). A half adder is a (2, 2)
counter . To form P 5 in Figure 2.29 we must add 6 summands (S 05 , S 14 , S 23 , S 32 ,
S 41 , and S 50 ) and 4 carries from the P 4 column. We add these in stages 17,
compressing from 6:3:2:2:3:1:1. Notice that we wait until stage 5 to add the last carry
from column P 4 , and this means we expand (rather than compress) the number of
signals (from 2 to 3) between stages 3 and 5. The maximum delay through the CSA
array of Figure 2.29 is 6 adder delays. To this we must add the delay of the 4-bit (9
inputs) CPA (stage 7). There are 26 adders (6 half adders) plus the 4 adders in the CPA.

FIGURE 2.28 A 6-bit Wallace-tree multiplier. The carry-save adder (CSA) requires 26
adders (cells 126, six are half adders). The final carry-propagate adder (CPA) consists
of 4 adder cells (2730). The delay of the CSA is 6 adders. The delay of the CPA is 4
adders.

In a Dadda multiplier (Figure 2.30) we work backward from the final product [Dadda,
1965]. Each stage has a maximum of 2, 3, 4, 6, 9, 13, 19, . . . outputs (each successive
stage is 3/2 times largerrounded down to an integer). Thus, for example, in
Figure 2.28(d) we require 3 stages (with 3 adder delaysplus the delay of a 10-bit output
CPA) for a 6-bit Dadda multiplier. There are 19 adders (4 half adders) in the CSA plus
the 10 adders (2 half adders) in the CPA. A Dadda multiplier is usually faster and
smaller than a Wallace-tree multiplier.
FIGURE 2.29 The 6-bit Dadda multiplier. The carry-save adder (CSA) requires 20
adders (cells 120, four are half adders). The carry-propagate adder (CPA, cells 2130)
is a ripple-carry adder (RCA). The CSA is smaller (20 versus 26 adders), faster (3
adder delays versus 6 adder delays), and more regular than the Wallace-tree CSA of
Figure 2.29. The overall speed of this implementation is approximately the same as the
Wallace-tree multiplier of Figure 2.29; however, the speed may be increased by
substituting a faster CPA.

In general, the number of stages and thus delay (in units of an FA delayexcluding the
CPA) for an n -bit tree-based multiplier using (3, 2) counters is
log 1.5 n = log 10 n /log 10 1.5 = log 10 n /0.176 . (2.64)

Figure 2.31(a) shows how the partial-product array is constructed in a conventional


4-bit multiplier. The FerrariStefanelli multiplier (Figure 2.31b) nests multipliersthe
2-bit submultipliers reduce the number of partial products [Ferrari and Stefanelli,
1969].

FIGURE 2.30 FerrariStefanelli multiplier. (a) A conventional 4-bit array multiplier


using AND gates to calculate the summands with (2, 2) and (3, 2) counters to sum the
partial products. (b) A 4-bit FerrariStefanelli multiplier using 2-bit submultipliers to
construct the partial product array. (c) A circuit implementation for an inverting 2-bit
submultiplier.
There are several issues in deciding between parallel multiplier architectures:
1. Since it is easier to fold triangles rather than trapezoids into squares, a
Wallace-tree multiplier is more suited to full-custom layout, but is slightly larger,
than a Dadda multiplierboth are less regular than an array multiplier. For
cell-based ASICs, a Dadda multiplier is smaller than a Wallace-tree multiplier.
2. The overall multiplier speed does depend on the size and architecture of the final
CPA, but this may be optimized independently of the CSA array. This means a
Dadda multiplier is always at least as fast as the Wallace-tree version.
3. The low-order bits of any parallel multiplier settle first and can be added in the
CPA before the remaining bits settle. This allows multiplication and the final
addition to be overlapped in time.
4. Any of the parallel multiplier architectures may be pipelined. We may also use a
variably pipelined approach that tailors the register locations to the size of the
multiplier.
5. Using (4, 2), (5, 3), (7, 3), or (15, 4) counters increases the stage compression
and permits the size of the stages to be tuned. Some ASIC cell libraries contain a
(7, 3) countera 2-bit full-adder . A (15, 4) counter is a 3-bit full adder. There is a
trade-off in using these counters between the speed and size of the logic cells and
the delay as well as area of the interconnect.
6. Power dissipation is reduced by the tree-based structures. The simplified
carry-save logic produces fewer signal transitions and the tree structures produce
fewer glitches than a chain.
7. None of the multiplier structures we have discussed take into account the
possibility of staggered arrival times for different bits of the multiplicand or the
multiplier. Optimization then requires a logic-synthesis tool.

2.6.5 Other Arithmetic Systems


There are other schemes for addition and multiplication that are useful in special
circumstances. Addition of numbers using redundant binary encoding avoids carry
propagation and is thus potentially very fast. Table 2.13 shows the rules for addition
using an intermediate carry and sum that are added without the need for carry. For
example,
binary decimal redundant binary CSD vector
1010111 87 10101001 10 1 0 1 00 1 addend
+ 1100101 101 + 11100111 + 01100101 augend
01001110 = 11 00 1 100 intermediate sum
1 1 00010 1 11000000 intermediate carry
= 10111100 = 188 1 1 1000 1 00 10 1 00 1 100 sum
TABLE 2.13 Redundant binary addition.
Intermediate Intermediate
A[ i ] B[ i ] A[ i 1] B[ i 1]
sum carry
1 1 x x 0 1
1 0 A[i 1]=0/1 and B[i 1]=0/1 1 0
0 1 A[i 1]= 1 or B[i 1]= 1 1 1
1 1 x x 0 0
1 1 x x 0 0
0 0 x x 0 0
0 1 A[i 1]=0/1 and B[i 1]=0/1 1 1
1 0 A[i 1]= 1 or B[i 1]= 1 1 0
1 1 x x 0 1

The redundant binary representation is not unique. We can represent 101 (decimal), for
example, by 1100101 (binary and CSD vector) or 1 1 100111. As another example, 188
(decimal) can be represented by 10111100 (binary), 1 1 1000 1 00, 10 1 00 1 100, or 10
1 000 1 00 (CSD vector). Redundant binary addition of binary, redundant binary, or
CSD vectors does not result in a unique sum, and addition of two CSD vectors does not
result in a CSD vector. Each n -bit redundant binary number requires a rather wasteful
2 n -bit binary number for storage. Thus 10 1 is represented as 010010, for example
(using sign magnitude). The other disadvantage of redundant binary arithmetic is the
need to convert to and from binary representation.
Table 2.14 shows the (5, 3) residue number system . As an example, 11 (decimal) is
represented as [1, 2] residue (5, 3) since 11R 5 = 11 mod 5 = 1 and 11R 3 = 11 mod 3 =
2. The size of this system is thus 3 ¥ 5 = 15. We add, subtract, or multiply residue
numbers using the modulus of each bit positionwithout any carry. Thus:
4 [4, 1] 12 [2, 0] 3 [3, 0]
+ 7 + [2, 1] 4 - [4, 1] ¥ 4 ¥ [4, 1]
= 11 = [1, 2] = 8 = [3, 2] = 12 = [2, 0]
TABLE 2.14 The 5, 3 residue number system.
n residue 5 residue 3 n residue 5 residue 3 n residue 5 residue 3
00 0 50 2 10 0 1
11 1 61 0 11 1 2
22 2 72 1 12 2 0
33 0 83 2 13 3 1
44 1 94 0 14 4 2

The choice of moduli determines the system size and the computing complexity. The
most useful choices are relative primes (such as 3 and 5). With p prime, numbers of the
form 2 p and 2 p 1 are particularly useful (2 p 1 are Mersennes numbers ) [Waser and
Flynn, 1982].

2.6.6 Other Datapath Operators


Figure 2.32 shows symbols for some other datapath elements. The combinational
datapath cells, NAND, NOR, and so on, and sequential datapath cells (flip-flops and
latches) have standard-cell equivalents and function identically. I use a bold outline (1
point) for datapath cells instead of the regular (0.5 point) line I use for scalar symbols.
We call a set of identical cells a vector of datapath elements in the same way that a bold
symbol, A , represents a vector and A represents a scalar.
FIGURE 2.31 Symbols for datapath elements. (a) An array or vector of flip-flops (a
register). (b) A two-input NAND cell with databus inputs. (c) A two-input NAND cell
with a control input. (d) A buswide MUX. (e) An incrementer/decrementer. (f) An
all-zeros detector. (g) An all-ones detector. (h) An adder/subtracter.

A subtracter is similar to an adder, except in a full subtracter we have a borrow-in


signal, BIN; a borrow-out signal, BOUT; and a difference signal, DIFF:
DIFF = A • NOT(B) • NOT( BIN)
SUM(A, NOT(B), NOT(BIN)) (2.65)
NOT(BOUT) = A · NOT(B) + A · NOT(BIN) + NOT(B) · NOT(BIN)
MAJ(NOT(A), B, NOT(BIN)) (2.66)

These equations are the same as those for the FA (Eqs. 2.38 and 2.39) except that the B
input is inverted and the sense of the carry chain is inverted. To build a subtracter that
calculates (A B) we invert the entire B input bus and connect the BIN[0] input to
VDD (not to VSS as we did for CIN[0] in an adder). As an example, to subtract B =
'0011' from A = '1001' we calculate '1001' + '1100' + '1' = '0110'. As with an adder, the
true overflow is XOR(BOUT[MSB], BOUT[MSB 1]).
We can build a ripple-borrow subtracter (a type of borrow-propagate subtracter), a
borrow-save subtracter, and a borrow-select subtracter in the same way we built these
adder architectures. An adder/subtracter has a control signal that gates the A input with
an exclusive-OR cell (forming a programmable inversion) to switch between an adder
or subtracter. Some adder/subtracters gate both inputs to allow us to compute (A B).
We must be careful to connect the input to the LSB of the carry chain (CIN[0] or
BIN[0]) when changing between addition (connect to VSS) and subtraction (connect to
VDD).
A barrel shifter rotates or shifts an input bus by a specified amount. For example if we
have an eight-input barrel shifter with input '1111 0000' and we specify a shift of
'0001 0000' (3, coded by bit position) the right-shifted 8-bit output is '0001 1110'. A
barrel shifter may rotate left or right (or switch between the two under a separate
control). A barrel shifter may also have an output width that is smaller than the input.
To use a simple example, we may have an 8-bit input and a 4-bit output. This situation
is equivalent to having a barrel shifter with two 4-bit inputs and a 4-bit output. Barrel
shifters are used extensively in floating-point arithmetic to align (we call this normalize
and denormalize ) floating-point numbers (with sign, exponent, and mantissa).
A leading-one detector is used with a normalizing (left-shift) barrel shifter to align
mantissas in floating-point numbers. The input is an n -bit bus A, the output is an n -bit
bus, S, with a single '1' in the bit position corresponding to the most significant '1' in
the input. Thus, for example, if the input is A = '0000 0101' the leading-one detector
output is S = '0000 0100', indicating the leading one in A is in bit position 2 (bit 7 is the
MSB, bit zero is the LSB). If we feed the output, S, of the leading-one detector to the
shift select input of a normalizing (left-shift) barrel shifter, the shifter will normalize
the input A. In our example, with an input of A = '0000 0101', and a left-shift of S =
'0000 0100', the barrel shifter will shift A left by five bits and the output of the shifter is
Z = '1010 0000'. Now that Z is aligned (with the MSB equal to '1') we can multiply Z
with another normalized number.
The output of a priority encoder is the binary-encoded position of the leading one in an
input. For example, with an input A = '0000 0101' the leading 1 is in bit position 3
(MSB is bit position 7) so the output of a 4-bit priority encoder would be Z = '0011' (3).
In some cell libraries the encoding is reversed so that the MSB has an output code of
zero, in this case Z = '0101' (5). This second, reversed, encoding scheme is useful in
floating-point arithmetic. If A is a mantissa and we normalize A to '1010 0000' we have
to subtract 5 from the exponent, this exponent correction is equal to the output of the
priority encoder.
An accumulator is an adder/subtracter and a register. Sometimes these are combined
with a multiplier to form a multiplieraccumulator ( MAC ). An incrementer adds 1 to
the input bus, Z = A + 1, so we can use this function, together with a register, to negate
a twos complement number for example. The implementation is Z[ i ] = XOR(A[ i ],
CIN[ i ]), and COUT[ i ] = AND(A[ i ], CIN[ i ]). The carry-in control input, CIN[0],
thus acts as an enable: If it is set to '0' the output is the same as the input.
The implementation of arithmetic cells is often a little more complicated than we have
explained. CMOS logic is naturally inverting, so that it is faster to implement an
incrementer as
Z[ i (even)] = XOR(A[ i ], CIN[ i ]) and COUT[ i (even)] = NAND(A[ i ], CIN[ i ]).
This inverts COUT, so that in the following stage we must invert it again. If we push an
inverting bubble to the input CIN we find that:
Z[ i (odd)] = XNOR(A[ i ], CIN[ i ]) and COUT[ i (even)] = NOR(NOT(A[ i ]), CIN[ i
]).
In many datapath implementations all odd-bit cells operate on inverted carry signals,
and thus the odd-bit and even-bit datapath elements are different. In fact, all the adder
and subtracter datapath elements we have described may use this technique. Normally
this is completely hidden from the designer in the datapath assembly and any output
control signals are inverted, if necessary, by inserting buffers.
A decrementer subtracts 1 from the input bus, the logical implementation is Z[ i ] =
XOR(A[ i ], CIN[ i ]) and COUT[ i ] = AND(NOT(A[ i ]), CIN[ i ]). The
implementation may invert the odd carry signals, with CIN[0] again acting as an
enable.
An incrementer/decrementer has a second control input that gates the input, inverting
the input to the carry chain. This has the effect of selecting either the increment or
decrement function.
Using the all-zeros detectors and all-ones detectors , remember that, for a 4-bit number,
for example, zero in ones complement arithmetic is '1111' or '0000', and that zero in
signed magnitude arithmetic is '1000' or '0000'.
A register file (or scratchpad memory) is a bank of flip-flops arranged across the bus;
sometimes these have the option of multiple ports (multiport register files) for read and
write. Normally these register files are the densest logic and hardest to fit in a datapath.
For large register files it may be more appropriate to use a multiport memory. We can
add control logic to a register file to create a first-in first-out register ( FIFO ), or last-in
first-out register ( LIFO ).
In Section 2.5 we saw that the standard-cell version and gate-array macro version of the
sequential cells (latches and flip-flops) each contain their own clock buffers. The
reason for this is that (without intelligent placement software) we do not know where a
standard cell or a gate-array macro will be placed on a chip. We also have no idea of
the condition of the clock signal coming into a sequential cell. The ability to place the
clock buffers outside the sequential cells in a datapath gives us more flexibility and
saves space. For example, we can place the clock buffers for all the clocked elements at
the top of the datapath (together with the buffers for the control signals) and river route
(in river routing the interconnect lines all flow in the same direction on the same layer)
the connections to the clock lines. This saves space and allows us to guarantee the
clock skew and timing. It may mean, however, that there is a fixed overhead associated
with a datapath. For example, it might make no sense to build a 4-bit datapath if the
clock and control buffers take up twice the space of the datapath logic. Some tools
allow us to design logic using a portable netlist . After we complete the design we can
decide whether to implement the portable netlist in a datapath, standard cells, or even a
gate array, based on area, speed, or power considerations.
2.7 I/O Cells
Figure 2.33 shows a three-state bidirectional output buffer (Tri-State ® is a
registered trademark of National Semiconductor). When the output enable (OE)
signal is high, the circuit functions as a noninverting buffer driving the value of
DATAin onto the I/O pad. When OE is low, the output transistors or drivers , M1
and M2, are disconnected. This allows multiple drivers to be connected on a bus.
It is up to the designer to make sure that a bus never has two driversa problem
known as contention .
In order to prevent the problem opposite to contentiona bus floating to an
intermediate voltage when there are no bus driverswe can use a bus keeper or
bus-hold cell (TI calls this Bus-Friendly logic). A bus keeper normally acts like
two weak (low drive-strength) cross-coupled inverters that act as a latch to retain
the last logic state on the bus, but the latch is weak enough that it may be driven
easily to the opposite state. Even though bus keepers act like latches, and will
simulate like latches, they should not be used as latches, since their drive strength
is weak.
Transistors M1 and M2 in Figure 2.33 have to drive large off-chip loads. If we
wish to change the voltage on a C = 200 pF load by 5 V in 5 ns (a slew rate of 1
Vns 1 ) we will require a current in the output transistors of I DS = C (d V /d t ) =
(200 ¥ 10 12 ) (5/5 ¥ 10 9 ) = 0.2 A or 200 mA.
Such large currents flowing in the output transistors must also flow in the power
supply bus and can cause problems. There is always some inductance in series
with the power supply, between the point at which the supply enters the ASIC
package and reaches the power bus on the chip. The inductance is due to the bond
wire, lead frame, and package pin. If we have a power-supply inductance of 2 nH
and a current changing from zero to 1 A (32 I/O cells on a bus switching at 30
mA each) in 5 ns, we will have a voltage spike on the power supply (called
power-supply bounce ) of L (d I /d t ) = (2 ¥ 10 9 )(1/(5 ¥ 10 9 )) = 0.4 V.
We do several things to alleviate this problem: We can limit the number of
simultaneously switching outputs (SSOs), we can limit the number of I/O drivers
that can be attached to any one VDD and GND pad, and we can design the output
buffer to limit the slew rate of the output (we call these slew-rate limited I/O
pads). Quiet-I/O cells also use two separate power supplies and two sets of I/O
drivers: an AC supply (clean or quiet supply) with small AC drivers for the I/O
circuits that start and stop the output slewing at the beginning and end of a output
transition, and a DC supply (noisy or dirty supply) for the transistors that handle
large currents as they slew the output.
The three-state buffer allows us to employ the same pad for input and output
bidirectional I/O . When we want to use the pad as an input, we set OE low and
take the data from DATAin. Of course, it is not necessary to have all these
features on every pad: We can build output-only or input-only pads.

FIGURE 2.32 A three-state bidirectional


output buffer. When the output enable,
OE, is '1' the output section is enabled
and drives the I/O pad. When OE is '0'
the output buffer is placed in a
high-impedance state.

We can also use many of these output cell features for input cells that have to
drive large on-chip loads (a clock pad cell, for example). Some gate arrays
simply turn an output buffer around to drive a grid of interconnect that supplies a
clock signal internally. With a typical interconnect capacitance of 0.2pFcm 1 , a
grid of 100 cm (consisting of 10 by 10 lines running all the way across a 1 cm
chip) presents a load of 20 pF to the clock buffer.
Some libraries include I/O cells that have passive pull-ups or pull-downs
(resistors) instead of the transistors, M1 and M2 (the resistors are normally still
constructed from transistors with long gate lengths). We can also omit one of the
driver transistors, M1 or M2, to form open-drain outputs that require an external
pull-up or pull-down. We can design the output driver to produce TTL output
levels rather than CMOS logic levels. We may also add input hysteresis (using a
Schmitt trigger) to the input buffer, I1 in Figure 2.33, to accept input data signals
that contain glitches (from bouncing switch contacts, for example) or that are
slow rising. The input buffer can also include a level shifter to accept TTL input
levels and shift the input signal to CMOS levels.
The gate oxide in CMOS transistors is extremely thin (100 Å or less). This leaves
the gate oxide of the I/O cell input transistors susceptible to breakdown from
static electricity ( electrostatic discharge , or ESD ). ESD arises when we or
machines handle the package leads (like the shock I sometimes get when I touch
a doorknob after walking across the carpet at work). Sometimes this problem is
called electrical overstress (EOS) since most ESD-related failures are caused not
by gate-oxide breakdown, but by the thermal stress (melting) that occurs when
the n -channel transistor in an output driver overheats (melts) due to the large
current that can flow in the drain diffusion connected to a pad during an ESD
event.
To protect the I/O cells from ESD, the input pads are normally tied to device
structures that clamp the input voltage to below the gate breakdown voltage
(which can be as low as 10 V with a 100 Å gate oxide). Some I/O cells use
transistors with a special ESD implant that increases breakdown voltage and
provides protection. I/O driver transistors can also use elongated drain structures
(ladder structures) and large drain-to-gate spacing to help limit current, but in a
salicide process that lowers the drain resistance this is difficult. One solution is to
mask the I/O cells during the salicide step. Another solution is to use pnpn and
npnp diffusion structures called silicon-controlled rectifiers (SCRs) to clamp
voltages and divert current to protect the I/O circuits from ESD.
There are several ways to model the capability of an I/O cell to withstand EOS.
The human-body model ( HBM ) represents ESD by a 100 pF capacitor
discharging through a 1.5 k W resistor (this is an International Electrotechnical
Committee, IEC, specification). Typical voltages generated by the human body
are in the range of 24 kV, and we often see an I/O pad cell rated by the voltage it
can withstand using the HBM. The machine model ( MM ) represents an ESD
event generated by automated machine handlers. Typical MM parameters use a
200 pF capacitor (typically charged to 200 V) discharged through a 25 W
resistor, corresponding to a peak initial current of nearly 10 A. The charge-device
model ( CDM , also called device chargedischarge) represents the problem when
an IC package is charged, in a shipping tube for example, and then grounded. If
the maximum charge on a package is 3 nC (a typical measured figure) and the
package capacitance to ground is 1.5 pF, we can simulate this event by charging a
1.5 pF capacitor to 2 kV and discharging it through a 1 W resistor.
If the diffusion structures in the I/O cells are not designed with care, it is possible
to construct an SCR structure unwittingly, and instead of protecting the
transistors the SCR can enter a mode where it is latched on and conducting large
enough currents to destroy the chip. This failure mode is called latch-up .
Latch-up can occur if the pn -diodes on a chip become forward-biased and inject
minority carriers (electrons in p -type material, holes in n -type material) into the
substrate. The sourcesubstrate and drainsubstrate diodes can become
forward-biased due to power-supply bounce or output undershoot (the cell
outputs fall below V SS ) or overshoot (outputs rise to greater than V DD ) for
example. These injected minority carriers can travel fairly large distances and
interact with nearby transistors causing latch-up. I/O cells normally surround the
I/O transistors with guard rings (a continuous ring of n -diffusion in an n -well
connected to VDD, and a ring of p -diffusion in a p -well connected to VSS) to
collect these minority carriers. This is a problem that can also occur in the logic
core and this is one reason that we normally include substrate and well
connections to the power supplies in every cell.
2.8 Cell Compilers
The process of hand crafting circuits and layout for a full-custom IC is a tedious,
time-consuming, and error-prone task. There are two types of automated layout
assembly tools, often known as a silicon compilers . The first type produces a
specific kind of circuit, a RAM compiler or multiplier compiler , for example.
The second type of compiler is more flexible, usually providing a programming
language that assembles or tiles layout from an input command file, but this is
full-custom IC design.
We can build a register file from latches or flip-flops, but, at 4.56.5 gates (1826
transistors) per bit, this is an expensive way to build memory. Dynamic RAM
(DRAM) can use a cell with only one transistor, storing charge on a capacitor
that has to be periodically refreshed as the charge leaks away. ASIC RAM is
invariably static (SRAM), so we do not need to refresh the bits. When we refer to
RAM in an ASIC environment we almost always mean SRAM. Most ASIC
RAMs use a six-transistor cell (four transistors to form two cross-coupled
inverters that form the storage loop, and two more transistors to allow us to read
from and write to the cell). RAM compilers are available that produce single-port
RAM (a single shared bus for read and write) as well as dual-port RAMs , and
multiport RAMs . In a multi-port RAM the compiler may or may not handle the
problem of address contention (attempts to read and write to the same RAM
address simultaneously). RAM can be asynchronous (the read and write cycles
are triggered by control and/or address transitions asynchronous to a clock) or
synchronous (using the system clock).
In addition to producing layout we also need a model compiler so that we can
verify the circuit at the behavioral level, and we need a netlist from a netlist
compiler so that we can simulate the circuit and verify that it works correctly at
the structural level. Silicon compilers are thus complex pieces of software. We
assume that a silicon compiler will produce working silicon even if every
configuration has not been tested. This is still ASIC design, but now we are
relying on the fact that the tool works correctly and therefore the compiled blocks
are correct by construction .

You might also like