0% found this document useful (0 votes)

75 views65 pages

Chapter-1 Introduction To Vlsi: 1.1 Very-Large-Scale Integration

vlsi base project report

Uploaded by

Anonymous oowcaAhZ2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views65 pages

Chapter-1 Introduction To Vlsi: 1.1 Very-Large-Scale Integration

vlsi base project report

Uploaded by

Anonymous oowcaAhZ2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 65

Chapter-1

INTRODUCTION TO VLSI

1.1 Very-large-scale integration

Very-large-scale integration (VLSI) is the process of creating integrated circuits by

combining thousands of transistors into a single chip. VLSI began in the 1970s when
complex semiconductor and communication technologies were being developed. The
microprocessor is a VLSI device.

Fig1.1 A VLSI integrated-circuit die

1.2 History

During the 1920’s, several inventors attempted devices that were intended to control
the current in solid state diodes and convert them into triodes. Success, however, had
to wait until after World War II, during which the attempt to improve silicon and
germanium crystals for use as radar detectors led to improvements both in fabrication
and in the theoretical understanding of the quantum mechanical states of carriers in
semiconductors and after which the scientists who had been diverted to radar
development returned to solid state device development. With the invention of
transistors at Bell labs, in 1947, the field of electronics got a new direction which
shifted from power consuming vacuum tubes to solid state devices.

With the small and effective transistor at their hands, electrical engineers of the 50s
saw the possibilities of constructing far more advanced circuits than before. However,
as the complexity of the circuits grew, problems started arising.

Another problem was the size of the circuits. A complex circuit, like a computer, was
dependent on speed. If the components of the computer were too large or the wires

Page 1
interconnecting them too long, the electric signals couldn't travel fast enough through
the circuit, thus making the computer too slow to be effective.

Jack Kilby at Texas Instruments found a solution to this problem in 1958. Kilby's idea
was to make all the components and the chip out of the same block (monolith) of
semiconductor material. When the rest of the workers returned from vacation, Kilby
presented his new idea to his superiors. He was allowed to build a test version of his
circuit. In September 1958, he had his first integrated circuit ready. Although the first
integrated circuit was pretty crude and had some problems, the idea was
groundbreaking. By making all the parts out of the same block of material and adding
the metal needed to connect them as a layer on top of it, there was no more need for
individual discrete components. No more wires and components had to be assembled
manually. The circuits could be made smaller and the manufacturing process could be
automated. From here the idea of integrating all components on a single silicon wafer
came into existence and which led to development in Small Scale Integration(SSI) in
early 1960s, Medium Scale Integration(MSI) in late 1960s, Large Scale
Integration(LSI) and in early 1980s VLSI 10,000s of transistors on a chip (later
100,000s & now 1,000,000s).

1.3 Developments

The first semiconductor chips held two transistors each. Subsequent advances added
more and more transistors, and, as a consequence, more individual functions or
systems were integrated over time. The first integrated circuits held only a few
devices, perhaps as many as ten diodes, transistors, resistors and capacitors, making it
possible to fabricate one or more logic gates on a single device.Now known
retrospectively as small-scale integration (SSI), improvements in technique led to
devices with hundreds of logic gates, known as medium-scale integration (MSI).
Further improvements led to large-scale integration (LSI), i.e. systems with at least a
thousand logic gates. Current technology has moved far past this mark and today's
microprocessors have many millions of gates and billions of individual transistors.

At one time, there was an effort to name and calibrate various levels of large-scale
integration above VLSI. Terms like ultra-large-scale integration (ULSI) were used.
But the huge number of gates and transistors available on common devices has
rendered such fine distinctions moot. Terms suggesting greater than VLSI levels of
integration are no longer in widespread use.

As of early 2008, billion-transistor processors are commercially available. This is

expected to become more commonplace as semiconductor fabrication moves from the
current generation of 65 nm processes to the next 45 nm generations (while
experiencing new challenges such as increased variation across process corners). A
notable example is Nvidia's280 series GPU. This GPU is unique in the fact that
almost all of its 1.4 billion transistors are used for logic, in contrast to the Itanium,

Page 2
whose large transistor count is largely due to its 24 MB L3 cache. Current designs,
unlike the earliest devices, use extensive design automation and automated logic
synthesis to lay out the transistors, enabling higher levels of complexity in the
resulting logic functionality. Certain high-performance logic blocks like the SRAM
(Static Random Access Memory) cell, however, are still designed by hand to ensure
the highest efficiency (sometimes by bending or breaking established design rules to
obtain the last bit of performance by trading stability). VLSI technology is moving
towards radical level miniaturization with introduction of NEMS technology. Alot of
problems need to be sorted out before the transition is actually made.

1.4 Structured design

Structured VLSI design is a modular methodology originated by Carver Mead and

Lynn Conway for saving microchip area by minimizing the interconnect fabrics area.
This is obtained by repetitive arrangement of rectangular macro blocks which can be
interconnected using wiring by abutment. An example is partitioning the layout of an
adder into a row of equal bit slices cells. In complex designs this structuring may be
achieved by hierarchical nesting.

Structured VLSI design had been popular in the early 1980s, but lost its popularity
later because of the advent of placement and routing tools wasting a lot of area by
routing, which is tolerated because of the progress of Moore's Law. When introducing
the hardware description language KARL in the mid' 1970s, Reiner Hartenstein
coined the term "structured VLSI design" (originally as "structured LSI design"),
echoing EdsgerDijkstra's structured programming approach by procedure nesting to
avoid chaotic spaghetti-structured programs.

1.4.1 Challenges

As microprocessors become more complex due to technology scaling, microprocessor

designers have encountered several challenges which force them to think beyond the
design plane, and look ahead to post-silicon:

 Power usage/Heat dissipation – As threshold voltages have ceased to scale

with advancing process technology, dynamic power dissipation has not scaled
proportionally. Maintaining logic complexity when scaling the design down
only means that the power dissipation per area will go up. This has given rise
to techniques such as dynamic voltage and frequency scaling (DVFS) to
minimize overall power.
 Process variation – As photolithography techniques tend closer to the
fundamental laws of optics, achieving high accuracy in doping concentrations
and etched wires is becoming more difficult and prone to errors due to
variation. Designers now must simulate across multiple fabrication process
corners before a chip is certified ready for production.

Page 3
 Stricter design rules – Due to lithography and etch issues with scaling,
design rules for layout have become increasingly stringent. Designers must
keep ever more of these rules in mind while laying out custom circuits. The
overhead for custom design is now reaching a tipping point, with many design
houses opting to switch to electronic design automation (EDA) tools to
automate their design process.
 Timing/design closure – As clock frequencies tend to scale up, designers are
finding it more difficult to distribute and maintain low clock skew between
these high frequency clocks across the entire chip. This has led to a rising
interest in multicore and multiprocessor architectures, since an overall speedup
can be obtained by lowering the clock frequency and distributing processing.
 First-pass success – As die sizes shrink (due to scaling), and wafer sizes go
up (to lower manufacturing costs), the number of dies per wafer increases, and
the complexity of making suitable photomasks goes up rapidly. A mask set for
a modern technology can cost several million dollars. This non-recurring
expense deters the old iterative philosophy involving several "spin-cycles" to
find errors in silicon, and encourages first-pass silicon success. Several design
philosophies have been developed to aid this new design flow, including
design for manufacturing (DFM), design for test (DFT), and Design for X.

1.5 VLSI Technology

Gone are the days when huge computers made of vacuum tubes sat humming in entire
dedicated rooms and could do about 360 multiplications of 10 digit numbers in a
second. Though they were heralded as the fastest computing machines of that time,
they surely don’t stand a chance when compared to the modern day machines.
Modern day computers are getting smaller, faster, and cheaper and more power
efficient every progressing second. But what drove this change? The whole domain of
computing ushered into a new dawn of electronic miniaturization with the advent of
semiconductor transistor by Bardeen (1947-48) and then the Bipolar Transistor by
Shockley (1949) in the Bell Laboratory.

Since the invention of the first IC (Integrated Circuit) in the form of a Flip Flop by
Jack Kilby in 1958, our ability to pack more and more transistors onto a single chip
has doubled roughly every 18 months, in accordance with the Moore’s Law. Such
exponential development had never been seen in any other field and it still continues
to be a major area of research work.

Page 4
Fig 1.2 A comparison: First Planar IC (1961) and Intel Nehalem Quad Core Die

1.6 History & Evolution of VLSI Technology

The development of microelectronics spans a time which is even lesser than the
average life expectancy of a human, and yet it has seen as many as four generations.
Early 60’s saw the low density fabrication processes classified under Small Scale
Integration (SSI) in which transistor count was limited to about 10. This rapidly gave
way to Medium Scale Integration in the late 60’s when around 100 transistors could
be placed on a single chip.

It was the time when the cost of research began to decline and private firms started
entering the competition in contrast to the earlier years where the main burden was
borne by the military. Transistor-Transistor logic (TTL) offering higher integration
densities outlasted other IC families like ECL and became the basis of the first
integrated circuit revolution. It was the production of this family that gave impetus to
semiconductor giants like Texas Instruments, Fairchild and National Semiconductors.
Early seventies marked the growth of transistor count to about 1000 per chip called
the Large Scale Integration.

By mid eighties, the transistor count on a single chip had already exceeded 1000 and
hence came the age of Very Large Scale Integration or VLSI. Though many
improvements have been made and the transistor count is still rising, further names of
generations like ULSI are generally avoided. It was during this time when TTL lost
the battle to MOS family owing to the same problems that had pushed vacuum tubes
into negligence, power dissipation and the limit it imposed on the number of gates
that could be placed on a single die.

The second age of Integrated circuit revolution started with the introduction of the
first microprocessor, the 4004 by Intel in 1972 and the 8080 in 1974. Today many
companies like Texas Instruments, Infineon, Alliance Semiconductors, Cadence,

Page 5
Synopsys, Celox Networks, Cisco, Micron Tech, National Semiconductors, ST
Microelectronics, Qualcomm, Lucent, Mentor Graphics, Analog Devices, Intel,
Philips, Motorola and many other firms have been established and are dedicated to the
various fields in "VLSI" like Programmable Logic Devices, Hardware Descriptive
Languages, Design tools, Embedded Systems etc.
VLSI Design
VLSI chiefly comprises of Front End Design and Back End design these days. While
front end design includes digital design using HDL, design verification through
simulation and other verification techniques, the design from gates and design for
testability, backend design comprises of CMOS library design and its
characterization. It also covers the physical design and fault simulation.

While Simple logic gates might be considered as SSI devices and multiplexers and
parity encoders as MSI, the world of VLSI is much more diverse. Generally, the
entire design procedure follows a step by step approach in which each design step is
followed by simulation before actually being put onto the hardware or moving on to
the next step. The major design steps are different levels of abstractions of the device
as a whole:

1. Problem Specification: It is more of a high level representation of the system.

The major parameters considered at this level are performance, functionality, physical
dimensions, fabrication technology and design techniques. It has to be a tradeoff
between market requirements, the available technology and the economical viability
of the design. The end specifications include the size, speed, power and functionality
of the VLSI system.

2. Architecture Definition: Basic specifications like Floating point units, which

system to use, like RISC (Reduced Instruction Set Computer) or CISC (Complex
Instruction Set Computer), number of ALU’s cache size etc.

3. Functional Design: Defines the major functional units of the system and hence
facilitates the identification of interconnect requirements between units, the physical
and electrical specifications of each unit. A sort of block diagram is decided upon
with the number of inputs, outputs and timing decided upon without any details of the
internal structure.

4. Logic Design: The actual logic is developed at this level. Boolean expressions,
control flow, word width, register allocation etc. are developed and the outcome is
called a Register Transfer Level (RTL) description. This part is implemented either
with Hardware Descriptive Languages like VHDL and/or Verilog. Gate minimization
techniques are employed to find the simplest, or rather the smallest most effective
implementation of the logic.

Page 6
5. Circuit Design: While the logic design gives the simplified implementation of
the logic,the realization of the circuit in the form of a netlist is done in this step.
Gates, transistors and interconnects are put in place to make a netlist. This again is a
software step and the outcome is checked via simulation.

6. Physical Design: The conversion of the netlist into its geometrical

representation is done in this step and the result is called a layout. This step follows
some predefined fixed rules like the lambda rules which provide the exact details of
the size, ratio and spacing between components. This step is further divided into sub-
steps which are:

6.1 Circuit Partitioning: Because of the huge number of transistors involved, it is not
possible to handle the entire circuit all at once due to limitations on computational
capabilities and memory requirements. Hence the whole circuit is broken down into
blocks which are interconnected.
6.2 Floor Planning and Placement: Choosing the best layout for each block from
partitioning step and the overall chip, considering the interconnect area between the
blocks, the exact positioning on the chip in order to minimize the area arrangement
while meeting the performance constraints through iterative approach are the major
design steps taken care of in this step.
6.3 Routing: The quality of placement becomes evident only after this step is
completed. Routing involves the completion of the interconnections between
modules. This is completed in two steps. First connections are completed between
blocks without taking into consideration the exact geometric details of each wire and
pin. Then, a detailed routing step completes point to point connections between pins
on the blocks.
6.4 Layout Compaction: The smaller the chip size can get, the better it is. The
compression of the layout from all directions to minimize the chip area thereby
reducing wire lengths, signal delays and overall cost takes place in this design step.
6.5 Extraction and Verification: The circuit is extracted from the layout for
comparison with the original netlist, performance verification, and reliability
verification and to check the correctness of the layout is done before the final step of
packaging.

7. Packaging: The chips are put together on a Printed Circuit Board or a Multi
Chip Module to obtain the final finished product.

Initially, design can be done with three different methodologies which provide
different levels of freedom of customization to the programmers. The design methods,
in increasing order of customization support, which also means increased amount of
overhead on the part of the programmer, are FPGAs and PLDs, Standard Cell (Semi
Custom) and Full Custom Design.

Page 7
While FPGAs have inbuilt libraries and a board already built with interconnections
and blocks already in place; Semi Custom design can allow the placement of blocks in
user defined custom fashion with some independence, while most libraries are still
available for program development. Full Custom Design adopts a start from scratch
approach where the programmer is required to write the whole set of libraries and also
has full control over the block development, placement and routing. This also is the
same sequence from entry level designing to professional designing.

Fig1.3: Future of VLSI

Where do we actually see VLSI Technology in action? Everywhere, in personal

computers, cell phones, digital cameras and any electronic gadget. There are certain
key issues that serve as active areas of research and are constantly improving as the
field continues to mature. The figures would easily show how Gordon Moore proved
to be a visionary while the trend predicted by his law still continues to hold with little
deviations and don’t show any signs of stopping in the near future. VLSI has come a
far distance from the time when the chips were truly hand crafted. But as we near the
limit of miniaturization of Silicon wafers, design issues have cropped up.

VLSI is dominated by the CMOS technology and much like other logic families, this
too has its limitations which have been battled and improved upon since years. Taking
the example of a processor, the process technology has rapidly shrunk from 180 nm in
1999 to 60nm in 2008 and now it stands at 45nm and attempts being made to reduce it
further (32nm) while the Die area which had shrunk initially now is increasing owing
to the added benefits of greater packing density and a larger feature size which would
mean more number of transistors on a chip.

Page 8
As the number of transistors increase, the power dissipation is increasing and also the
noise. If heat generated per unit area is to be considered, the chips have already
neared that of the nozzle of a jet engine. At the same time, the Voltage scaling of
threshold voltages beyond a certain point poses serious limitations in providing low
dynamic power dissipation with increased complexity. The number of metal layers
and the interconnects be it global and local also tend to get messy at such nano levels.

Even on the fabrication front, we are soon approaching towards the optical limit of
photolithographic processes beyond which the feature size cannot be reduced due to
decreased accuracy. This opened up Extreme Ultraviolet Lithography techniques.
High speed clocks used now make it hard to reduce clock skew and hence putting
timing constraints. This has opened up a new frontier on parallel processing. And
above all, we seem to be fast approaching the Atom-Thin Gate Oxide layer thickness
where there might be only a single layer of atoms serving as the oxide layer in the
CMOS transistors. New alternatives like Gallium Arsenide technology are becoming
an active area of research owing to this.

Page 9
Chapter-2
INTRODUCTION TO ADDERS
2.1 Motivation
To humans, decimal numbers are easy to comprehend and implement for
performing arithmetic.However, in digital systems, such as a microprocessor, DSP
(Digital Signal Processor)or ASIC (Application-Specific Integrated Circuit), binary
numbers are more pragmatic for a given computation. This occurs because binary
values are optimally efficient at representing many values.

Binary adders are one of the most essential logic elements within a digital
system. In addition, binary adders are also helpful in units other than Arithmetic
Logic Units (ALU),such as multipliers, dividers and memory addressing. Therefore,
binary addition is essential that any improvement in binary addition can result in a
performance boost for any computing system and, hence, help improve the
performance of the entire system.

The major problem for binary addition is the carry chain. As the width of the
input operand increases, the length of the carry chain increases. Figure 2.1
demonstrates an example of an 8- bit binary add operation and how the carry chain is
affected. This example shows that the worst case occurs when the carry travels the
longest possible path, from the least significant bit (LSB) to the most significant bit
(MSB). In order to improve the performance of carry-propagate adders, it is possible
to accelerate the carry chain, but not eliminate it. Consequently, most digital designers
often resort to building faster adders when optimizing a computer architecture,
because they tend to set the critical path for most computations.

Figure 2.1: Binary Adder Example.

Page 10
The binary adder is the critical element in most digital circuit designs
including digital signal processors (DSP) and microprocessor data path units. As such,
extensive research continues to be focused on improving the power delay
performance of the adder. In VLSI implementations, parallel-prefix adders are known
to have the best performance. Reconfigurable logic such as Field Programmable Gate
Arrays (FPGAs) has been gaining in popularity in recent years because it offers
improved performance in terms of speed and power over DSP-based and
microprocessor-based solutions for many practical designs involving mobile DSP and
telecommunications applications and a significant reduction in development time and
cost over Application Specific Integrated Circuit (ASIC) designs.

The power advantage is especially important withthe growing popularity of

mobile and portable electronics, which make extensive use of DSP functions.
However, because of the structure of the configurable logic androuting resources in
FPGAs, parallel-prefix adders will have a different performance than VLSI
implementations. In particular, most modern FPGAs employ a fast-carry chain which
optimizes the carry path for the simple Ripple Carry Adder (RCA).In this paper, the
practical issues involved in designing and implementing tree-based adders on FPGAs
are described. Several tree-based adder structures are implemented and characterized
on a FPGA and compared with the Ripple Carry Adder (RCA) and the Carry Skip
Adder (CSA). Finally, some conclusions and suggestions for improving FPGA
designs to enable better tree-based adder performance are given.

2.2 Carry-Propagate Adders

Binary carry-propagate adders have been extensively published, heavily

attacking problems related to carry chain problem. Binary adders evolve from linear
adders, which have a delay approximately proportional to the width of the adder, e.g.
ripple-carry adder (RCA),to logarithmic-delay adder, such as the carry-lookahead
adder (CLA). There are some additional performance enhancing schemes, including
the carry-increment adder and the Ling adder that can further enhance the carry chain,
however, in Very Large Scale Integration (VLSI) digital systems, the most efficient
way of offering binary addition involves utilizing parallel-prefix trees, this occurs
because they have the regular structures that exhibit logarithmic delay.

Page 11
2.3 Research Contributions
The implementations that have been developed in this dissertation help to
improve the design of Carry select adders and their associated computing
architectures. This has the potential of impacting many application specific and
general purpose computer architectures. Consequently, this work can impact the
designs of many computing systems, as well as impacting many areas of engineers
and science. In this paper, the practical issues involved in designing and implementing
Carry select adders on FPGAs are described. Several carry select adder structures are
implemented and characterized on a FPGA and compared with the CSLA with Ripple
Carry Adder (RCA) and the CSLA with Binary Excess Converter. Finally, some
conclusions and suggestions for improving FPGA designs to enable better carry select
adder performance are given.

Page 12
Chapter-3
BINARY ADDER SCHEMES

Adders are one of the most essential components in digital building blocks,
however, the performance of adders become more critical as the technology advances.
The problem of addition involves algorithms in Boolean algebra and their respective
circuit implementation. Algorithmically, there are linear-delay adders like ripple-carry
adders (RCA),which are the most straightforward but slowest. Adders like carry-skip
adders (CSKA),carry-select adders (CSLA) and carry-increment adders (CINA) are
linear-based adders with optimized carry-chain and improve upon the linear chain
within a ripple-carry adder. Carry-lookahead adders (CLA) have logarithmic delay
and currently have evolved to parallel-prefix structures. Other schemes, like Ling
adders, NAND/NOR adders and carry-save adders can help improve performance as
well.
This chapter gives background information on architectures of adder
algorithms. In the following sections, the adders are characterized with linear gate
model, which is a rough estimation of the complexity of real implementation.
Although this evaluation method can be misleading for VLSI implementers, such type
of estimation can provide sufficient insight to understand the design trade-offs for
adder algorithms.

3.1 Binary Adder Notations and Operations

As mentioned previously, adders in VLSI digital systems use binary notation.

In that case, add is done bit by bit using Boolean equations. Consider a simple binary
add with two n-bit inputs A;B and a one-bit carry-in cin along with n-bit output S.

Figure 3.1: 1-bit Half Adder.

S = A + B + cin:
where A = an-1, an-2……a0;B = bn-1, bn-2……b0.

Page 13
The + in the above equation is the regular add operation. However, in the
binary world, only Boolean algebra works. For add related operations, AND, OR and
Exclusive-OR (XOR) are required. In the following documentation, a dot between
two variables (each with single bit), e.g. a _ b denotes 'a AND b'. Similarly, a + b
denotes 'a OR b' and a _ b denotes 'a XOR b'.

Considering the situation of adding two bits, the sum s and carry c can be expressed
using Boolean operations mentioned above.
si = ai^bi
ci+1 = ai.bi
The Equation of ci+1 can be implemented as shown in Figure 3.1. In the figure, there
is a half adder, which takes only 2 input bits. The solid line highlights the critical
path, which indicates the longest path from the input to the output.

Equation of ci+1 can be extended to perform full add operation, where there is a carry
input.
si = ai^ bi ^ ci
ci+1 = ai .bi + ai. ci + bi . ci

Figure 3.2: 1-bit Full Adder.

A full adder can be built based on Equation above. The block diagram of a 1-
bit full adder is shown in Figure 3.2. The full adder is composed of 2 half adders and
an OR gate for computing carry-out.

Using Boolean algebra, the equivalence can be easily proven.

To help the computation of the carry for each bit, two binary literals are introduced.

Page 14
They are called carry generate and carry propagate, denoted by gi and pi. Another
literalcalled temporary sum ti is employed as well. There is relation between the
inputs and theseliterals.
gi = ai. bi
pi = ai + bi
ti = ai^ bi
where i is an integer and 0 _ i < n.
With the help of the literals above, output carry and sum at each bit can be written as
ci+1 = gi + pi .ci
si = ti^ ci
In some literatures, carry-propagate pi can be replaced with temporary sum ti
in order tosave the number of logic gates. Here these two terms are separated in order
to clarify theconcepts. For example, for Ling adders, only pi is used as carry-
propagate.
The single bit carry generate/propagate can be extended to group version G
and P. The following equations show the inherent relations.
Gi:k = Gi:j + Pi:j. Gj-1:k
Pi:k = Pi:j. Pj-1:k
where i : k denotes the group term from i through k.
Using group carry generate/propagate,carry can be expressed as expressed in the
following equation.
ci+1 = Gi:j + Pi:j.cj

3.2 Ripple-Carry Adders (RCA)

The simplest way of doing binary addition is to connect the carry-out from the
previousbit to the next bit's carry-in. Each bit takes carry-in as one of the inputs and
outputs sumand carry-out bit and hence the name ripple-carry adder. This type of
adders is built bycascading 1-bit full adders. A 4-bit ripple-carry adder is shown in
Figure 3.3. Each trapezoidalsymbol represents a single-bit full adder. At the top of the
figure, the carry is rippledthrough the adder from cin to cout.

Page 15
Figure 3.3: Ripple-Carry Adder.

It can be observed in Figure 3.3 that the critical path, highlighted with a solid
line, isfrom the least significant bit (LSB) of the input (a0 or b0) to the most
significant bit (MSB)of sum (sn-1). Assuming each simple gate, including AND, OR
and XOR gate has a delayof 2/\ and NOT gate has a delay of 1/\. All the gates have an
area of 1 unit. Using thisanalysis and assuming that each add block is built with a 9-
gate full adder, the critical pathis calculated as follows.
ai ,bi si = 10/\
ai , bi  ci+1 = 9/\
cisi = 5/\
ci ci+1 = 4/\
The critical path, or the worst delay is
trca ={9 + (n- 2) x 4 + 5}/\ = {f4n + 6}/\
As each bit takes 9 gates, the area is simply 9n for a n-bit RCA.
3.3 Carry-Select Adders (CSLA)

Simple adders, like ripple-carry adders, are slow since the carry has to to
travel throughevery full adder block. There is a way to improve the speed by
duplicating the hardware dueto the fact that the carry can only be either 0 or 1. The
method is based on the conditionalsum adder and extended to a carry-select adder.
With two RCA, each computingthe case of the one polarity of the carry-in, the sum

Page 16
can be obtained with a 2x1 multiplexerwith the carry-in as the select signal. An
example of 16-bit carry-select adder is shown inFigure 3.4. In the figure, the adder is
grouped into four 4-bit blocks. The 1-bit multiplexorsfor sum selection can be
implemented as Figure 3.5 shows. Assuming the two carry terms are utilized such that
the carry input is given as a constant 1 or 0:

Figure 3.4: Carry-Select Adder.

In Figure 3.4, each two adjacent 4-bit blocks utilizes a carry relationship
ci+4 = c0 i+4 + c1 i+4 . ci
The relationship can be verified with properties of the group carry generate/propagate
and c0i+4 can be written as
c0i+4 = Gi+4:i + Pi+4:i . 0
= Gi+4:i
Similarly, c1i+4 can be written as
c1i+4 = Gi+4:i + Pi+4:i . 1
= Gi+4:i + Pi+4:i
Then
c0i+4 + c1i+4 .ci = Gi+4:i + (Gi+4:i + Pi+4:i) .ci
= Gi+4:i + Gi+4:i .ci + Pi+4:i .ci
= Gi+4:i + Pi+4:i .ci
= ci+4

Page 17
Figure 3.5: 2-1 Multiplexor.

Varying the number of bits in each group can work as well for carry-select
adders. temporary sums can be defined as follows.
s0 i+1 = ti+1 .c0i
s1i+1 = ti+1 .c1i
The final sum is selected by carry-in between the temporary sums already calculated.
si+1 = cj.s0i+1 + cj.s1i+1
Assuming the block size is fixed at r-bit, the n-bit adder is composed of k
groups ofr-bit blocks, i.e. n = r x k. The critical path with the first RCA has a delay of
(4r + 5)/\ from the input to the carry-out, and there are k - 2 blocks that follow, each
with a delay of4/\ for carry to go through. The final delay comes from the multiplexor,
which has a delay of 5/\, as indicated in Figure 3.5. The total delay for this CSEA is
calculated as
tcsea = 4r + 5 + 4(k - 2) + 5/\
= {4r + 4k + 2}/\
The area can be estimated with (2n - r) FAs, (n - r) multiplexors and (k - 1)
AND/ORlogic. As mentioned above, each FA has an area of 9 and a multiplexor takes
5 units ofarea. The total area can be estimated
9(2n - r) + 2(k - 1) + 4(n - r) = 22n - 13r + 2k - 2

The delay of the critical path in CSEA is reduced at the cost of increased area. For
example, in Figure 2.4, k = 4, r = 4 and n = 16. The delay for the CSEA is 34/\
compared to 70/\ for 16-bit RCA. The area for the CSEA is 310 units while the RCA
hasan area of 144 units. The delay of the CSEA is about the half of the RCA. But the

Page 18
CSEAhas an area more than twice that of the RCA. Each adder can also be modified
to have avariable block sizes, which gives better delay and slightly less area.

3.4 Carry-Skip Adders (CSKA)

There is an alternative way of reducing the delay in the carry-chain of a RCA

by checking if a carry will propagate through to the next block. This is called carry-
skip adders.
ci+1 = Pi:j _ Gi:j + Pi:j.cj
Figure 3.6 shows an example of 16-bit carry-skip adder.

Figure 3.6: Carry-Skip Adder.

The carry-out of each block is determined by selecting the carry-in and Gi:j
using Pi:j. When Pi:j = 1, the carry-in cj is allowed to get through the block
immediately. Otherwise, the carry-out is determined by Gi:j. The CSKA has less
delay in the carry-chain with only a little additional extra logic. Further improvement
can be achieved generally by making the central block sizes larger and the two-end
block sizes smaller.
Assuming the n-bit adder is divided evenly to k r-bit blocks, part of the critical
path is from the LSB input through the MSB output of the final RCA. The first delay
is from the LSB input to carry-out, which is 4r + 5. Then, there are k - 2 skip logic
blocks with a delay of 3/\. Each skip logic block includes one 4-input AND gate for
getting Pi+3:i and one AND/OR logic. The final RCA has a delay from input to sum
at MSB, which is 4r+6. The total delay is calculated as follows.
tcska = {4r + 5 + 3(k - 2) + 4r + 6}/\
= {8r + 3k + 5}/\

Page 19
The CSKA has n-bit FA and k - 2 skip logic blocks. Each skip logic block has an area
of 3 units. Therefore, the total area is estimated as9n + 3(k - 2) = 9n + 3k – 6.

3.5 Carry-Look-ahead Adders (CLA)

The carry-chain can also be accelerated with carry generate/propagate logic.

Carry-lookahead adders employ the carry generate/propagate in groups to generate
carry for the next block. In other words, digital logic is used to calculate all the carries
at once. When building a CLA, a reduced version of full adder, which is called a
reducedfull adder (RFA) is utilized. Figure 3.7 shows the block diagram for an RFA.
The carrygenerate/propagate signals gi/pi feed to carry-lookahead generator (CLG)
for carry inputsto RFA.

Figure 3.7: Reduced Full Adder.

The theory of the CLA is based on next Equations. Figure 3.8 shows an
example of 16-bit carry-lookaheadadder. In the figure, each block is fixed at 4-bit.
BCLG stands for Block Carry Lookahead Carry Generator, which generates
generate/propagate signals in group form. For the 4-bit BCLG, the following
equations are created.
Gi+3:i = gi+3 + pi+3 .gi+2 + pi+3 .pi+2 .gi+1 + pi+3 .pi+2 .pi+1 .gi
Pi+3:i = pi+3 .pi+2 .pi+1 .pi

The group generate takes a delay of 4/\, which is an OR after an AND, therefore, the
carry-out can be computed, as follows.
ci+3 = Gi+3:i + Pi+3:i .ci

Page 20
Figure 3.8: Carry-Lookahead Adder.

The carry computation also has a delay of 4/\, which is an OR after an AND.
The 4-bitBCLG has an area of 14 units.
The critical path of the 16-bit CLA can be observed from the input operand
through 1RFA, then 3 BCLG and through the final RFA. That is, the critical path
shown in Figure 3.8 is from a0/b0 to s7. The delay will be the same for a0/b0 to s11
or s15, however, the criticalpath traverses logarithmically, based on the group size.

The delays are listed below.

a0 , b0  p0 , g0 = 2/\
p0 , g0  G3,0 = 4/\
G3,0 c4 = 4/\
c4  c7 = 4/\
c7  s7 = 5/\

Page 21
a0 , b0  s7 = 19/\

The 16-bit CLA is composed of 16 RFAs and 5 BCLGs, which amounts to an area of
16 x 8 + 5 x 14 = 198 units.
Extending the calculation above, the general estimation for delay and area can
be derived.Assume the CLA has n-bits, which is divided into k groups of r-bit blocks.
Itrequires dlogrne logic levels. The critical path starts from the input to p0/g0
generation,BLCG logic and the carry-in to sum at MSB. The generation of (p; g) takes
a delay of 2/\.The group version of (p; g) generated by the BCLG has a delay of 4/\.
From next BCLG,there is a 4/\ delay from the CLG generation and 4/\ from the BCLG
generation to thenext level, which totals to 8/\. Finally, from ck+r to sk+r, there is a
delay of 5/\. Thus, thetotal delay is calculated as follows.
tcla = {2 + 8(dlogrn- 1) + 4 + 5}/\
= {3 + 8dlogrn}/\

Page 22
Chapter-4
Carry Select Adder

4.1 Introduction
Design of area- and power-efficient high-speed data path logic systems are one of the
most substantial areas of research in VLSI system design. In digital adders, the speed
of addition is limited by the time required to propagate a carry through the adder. The
sum for each bit position in an elementary adder is generated sequentially only after
the previous bit position has been summed and a carry propagated into the next
position. The CSLA is used in many computational systems to alleviate the problem
of carry propagation delay by independently generating multiple carries and then
select a carry to generate the sum. How ever the CSLA is not area efficient because it
uses multiple pairs of Ripple Carry Adders (RCA) to generate partial sum and carry
by considering carry input Cin = 0 and Cin = 1, then the final sum and carry are
selected by the multiplexers (mux).
The basic idea of this work is to use Binary to Excess-1 Converter (BEC) instead of
RCA with Cin = 1 in the regular CSLA to achieve lower area and power consumption.
The main advantage of this BEC logic comes from the lesser number of logic gates
than the n-bit Full Adder (FA) structure.The SQRT CSLA has been chosen for
comparison with the proposed design as it has a more balanced delay, and requires
lower power and area. The delay and area evaluation methodology of the regular and
modified SQRT CSLA are presented.

4.2 Delay and area evaluation methodology of the basic adder blocks
The AND, OR, and Inverter (AOI) implementation of an XOR gate is shown in
Fig.4.1. The gates between the dotted lines are performing the operations in parallel
and the numeric representation of each gate indicates the delay contributed by that
gate. The delay and area evaluation methodology considers all gates to be made up of
AND, OR, and Inverter, each having delay equal to 1 unit and area equal to 1 unit.
We then add up the number of gates in the longest path of a logic block that
contributes to the maximum delay. The area evaluation is done by counting the total
number of AOI gates required for each logic block. Based on this approach, the CSLA
adder blocks of 2:1 mux, Half Adder (HA), and FA are evaluated and listed in Table

Page 23
4.I.

Fig 4.1: Delay and Area evaluation of an XOR gate.

TABLE 4. I
DELAY AND AREA COUNT OF THE BASIC BLOCKS OF CSLA

4.3 Binary to Excess Converters

As stated above the main idea of this work is to use BEC instead ofthe RCA with
Cin = 1 in order to reduce the area and power consumption of the regular CSLA. To
replace the n-bit RCA, an n + 1-bit BECis required. A structure and the function
table of a 4-b BEC are shownin Fig.4.2 and Table 4.II, respectively.

Page 24
Fig.4.2. 4-b BEC.

Fig..4.3. 4-b BEC with 8:4 mux.

TABLE 4.II
FUNCTION TABLE OF THE 4-b BEC

Page 25
Fig. 4.3 illustrates how the basic function of the CSLA is obtained byusing the 4-bit
BEC together with the mux. One input of the 8:4 muxgets as it input (B3, B2, B1,
and B0) and another input of the mux is theBEC output. This produces the two
possible partial results in paralleland the mux is used to select either the BEC output
or the direct inputsaccording to the control signal Cin. The importance of the BEC
logicstems from the large silicon area reduction when the CSLA with largenumber
of bits are designed. The Boolean expressions of the 4-bit BECis listed as (note the
functional symbols NOT, & AND;^XOR)

X0 = B0
X1 = B0^B1
X2 = B2^(B0 & B1)
X3 = B3^(B0 & B1 & B2)

4.4 Delay and area evaluation methodology of regular16-B SQRT

CSLA
The structure of the 16-b regular SQRT CSLA is shown in Fig.4.4. Ithas ﬁve groups
of different size RCA. The delay and area evaluation ofeach group are shown in
Fig.4.5, in which the numerals within [] specifythe delay values, e.g., sum2 requires
10 gate delays. The steps leadingto the evaluation are as follows.
1) The group2 [see Fig.4.5(a)] has two sets of 2-b RCA. Based onthe consideration
of delay values of Table I, the arrival time of selection input c1[time(t) = 7] of 6:3
mux is earlier than s3[t =8] and later than s2[t = 6]. Thus, sum3[t = 11] is summation
ofs3 and mux[t = 3] and sum2[t = 10] is summation of c1 and mux.

Page 26
Fig.4. 4. Regular 16-b SQRT CSLA.

Page 27
Fig. 4.5. Delay and area evaluation of regular SQRT CSLA: (a) group2, (b)
group3, (c) group4, and (d) group5. F is a Full Adder.

2) Except for group2, the arrival time of mux selection input is al ways greater than
the arrival time of data outputs from the RCA’s.
Thus, the delay of group3 to group5 is determined, respectively asfollows:
fc6; sum[6 : 4]g = c3[t = 10] + mux
fc10; sum[10 : 7]g = c6[t = 13] + mux
fcout; sum[15 : 11]g = c10[t = 16] + mux:
3) The one set of 2-b RCA in group2 has 2 FA for Cin = 1 and theother set has 1
FA and 1 HA for Cin = 0. Based on the area countof Table 4.I, the total number of
gate counts in group2 is determined
as follows:

Gate count = 57 (FA + HA + Mux)

FA = 39(3 3 13)
HA = 6(1 3 6)
Mux = 12(3 3 4):

4) Similarly, the estimated maximum delay and area of the othergroups in the regular
SQRT CSLA are evaluated and listed in Table 4.III.
TABLE 4.III
DELAY AND AREA COUNT OF REGULAR SQRT CSLA GROUPS

4.5. Delay and area evaluation methodology of modified 16-B SQRT

CSLA

Page 28
The structure of the proposed 16-b SQRT CSLA using BEC for RCAwith Cin = 1 to
optimize the area and power is shown in Fig.4.6. Weagain split the structure into ﬁve
groups. The delay and area estimationof each group are shown in Fig.4.7. The steps
leading to the evaluationare given here.
1) The group2 [see Fig.4.7(a)] has one 2-b RCA which has 1 FA and1 HA for Cin =
0. Instead of another 2-b RCA with Cin = 1ba 3-b BEC is used which adds one to the
output from 2-b RCA.Based on the consideration of delay values of Table I, the
arrival time of selection input c1[time(t) = 7] of 6:3 mux is earlier thanthe s3[t = 9]
and c3[t = 10] and later than the s2[t = 4]. Thus,the sum3 and ﬁnal c3 (output from
mux) are depending on s3and mux and partial c3 (input to mux) and mux,
respectively. Thesum2 depends on c1 and mux.
2) For the remaining group’s the arrival time of mux selection input isalways greater
than the arrival time of data inputs from the BEC’s.Thus, the delay of the remaining
groups depends on the arrivaltime of mux selection input and the mux delay.

Fig.4.6. Modified 16-b SQRT CSLA. The parallel RCA with Cin=0 is replaced
with BEC.

Page 29
Fig.4. 7. Delay and area evaluation of modified SQRT CSLA: (a) group2, (b)
group3, (c) group4, and (d) group5. H is a Half Adder.
3) The area count of group2 is determined as follows:
Gate count = 43 (FA + HA + Mux + BEC)
FA = 13(1 3 13)
HA = 6(1 3 6)
AND = 1
NOT = 1
XOR = 10(2 3 5)
Mux = 12(3 3 4):
4) Similarly, the estimated maximum delay and area of the othergroups of the
modiﬁed SQRT CSLA are evaluated and listed inTable 4.IV.
TABLE IV
DELAY AND AREA COUNT OF MODIFIED SQRT CSLA

Page 30
Comparing Tables 4.III and 4.IV, it is clear that the proposed modiﬁedSQRT CSLA
saves 113 gate areas than the regular SQRT CSLA, withonly 11 increases in gate
delays. To further evaluate the performance,we have resorted to ASIC
implementation and simulation.

4.6. ASIC Implementation Results

The design proposed in this paper has been developed using Verilog-HDL and
synthesized in Cadence RTL compiler using typical libraries of TSMC 0.18 um
technology. The synthesized Verilog netlistand their respective design constraints file
(SDC) are imported to Cadence SoC Encounter and are used to generate automated
layout fromstandard cells and placement and routing [7]. Parasitic extraction is
performed using Encounter’s Native RC extraction tool and the extractedparasitic RC
(SPEF format) is back annotated to Common Timing Engine in Encounter platform
for static timing analysis. For each wordsize of the adder, the same value changed
dump (VCD) file is generatedfor all possible input conditions and imported the same
to Cadence Encounter Power Analysis to perform the power simulations. The
similardesign flow is followed for both the regular and modified SQRT CSLA.Table
4.V exhibits the simulation results of both the CSLA structuresin terms of delay, area
and power.

TABLE V
COMPARISON OF THE REGULAR AND MODIFIED SQRT CSLA

Page 31
Chapter-5
Page 32
LOGIC FORMULATION BASED CSLA

5.1 LOGIC FORMULATION

The BEC-based CSLA involves less logic resources than the conventional
CSLA, but it has marginally higher delay. A CSLA based on common Boolean logic
(CBL) is also proposed. The CBL-based CSLA involves significantly less logic
resource than the conventional CSLA but it has longer CPD, which is almost equal to
that of the RCA. To overcome this problem, a SQRT-CSLA based on CBL was
proposed. However, the CBL-based SQRT CSLA design requires more logic resource
and delay than the BEC-based SQRT-CSLA. We observe that logic optimization
largely depends on availability of redundant operations in the formulation, whereas
adder delay mainly depends on data dependence. In the existing designs, logic is
optimized without giving any consideration to the data dependence. In this brief, we
made an analysis on logic operations involved in conventional and BEC-based
CSLAs to study the data dependence and to identify redundant logic operations.
Based on this analysis, we have proposed a logic formulation for the CSLA. The main
contribution in this brief are logic formulation based on data dependence and
optimized carry generator (CG) and CS design.
The CSLA has two units: 1) the sum and carry generator unit (SCG) and 2)
the sum and carry selection unit. The SCG unit consumes most of the logic resources
of CSLA and significantly contributes to the critical path. Different logic designs have
been suggested for efficient implementation of the SCG unit. We made a study of the
logic designs suggested for the SCG unit of conventional and BEC-based CSLAs
suitable logic expressions. The main objective of this study is to identify redundant
logic operations and data dependence. Accordingly, we remove all redundant logic
operations and sequence logic operations based on their data dependent.

5.2 Logic Expressions of the SCG Unit of the Conventional CSLA

Page 33
Fig.5.1. (a) Conventional CSLA; n is the input operand bit-width. (b) The logic
operations of the RCA is shown in split form, where HSG, HCG, FSG, and FCG
represent half-sum generation, half-carry generation, full-sum generation and full-
carry generation, respectively.

As shown in Fig.5.1.(a), the SCG unit of the conventional CSLA is composed of two
n-bit RCAs, where n is the adder bit-width. The logic operation of the n-bit RCA is
performed in four stages: 1) half-sum generation (HSG); 2) half-carry generation
(HCG); 3) full-sum generation (FSG); and 4) full carry generation (FCG). Suppose
two n-bit operands are added in the conventional CSLA, then RCA-1 and RCA-2
generate n-bit sum (s0and s1) and output-carry (c0outand c1out) corresponding to
input-carry (Cin = 0 and Cin= 1), respectively. Logic expressions of RCA-1 and
RCA-2 of the SCG unit of the n-bit CSLA are given as

s00(i) = A(i) ⊕ B(i) c00(i) = A(i) · B(i) (1a)

s01(i) = s00(i) ⊕ c01(i − 1) (1b)

c01(i) = c00(i) + s00(i) · c01(i − 1) c0out=c01(n − 1) (1c)

s10(i) = A(i) ⊕ B(i) c10(i) = A(i) · B(i) (2a)
s11(i) = s10(i) ⊕ c11(i − 1) (2b)
c11(i) = c10(i) + s10(i) · c11 (i − 1) c1out=c11 (n − 1) (2c)

where c01 (−1) = 0, c11−1) = 1, and 0 ≤ i ≤ n − 1. As shown in (1a)–(1c) and (2a)–

(2c), the logic expression of {s00(i), c00(i)} is identical to that of {s10(i), c10(i)}. These
redundant logic operations can be removed to have an optimized design for RCA-2, in
which the HSG and HCG of RCA-1 is shared to construct RCA-2. Based on this, and
have used an add-one circuit instead of RCA-2 in the CSLA, in which a BEC circuit
is used for the same purpose. Since the BEC-based CSLA offers the best area–delay–
power efﬁciency among the existing CSLAs, we discuss here the logic expressions of
the SCG unit of the BEC-based CSLA as well.

5.3. Logic Expression of the SCG Unit of the BEC Based CSLA

Page 34
Fig.5.2.Structure of the BEC-based CSLA; n is the input operand bit-width.

As shown in Fig.5.2, the RCA calculates n-bit sum s01and c0out corresponding to
Cin= 0. The BEC unit receives s01and c0out from the RCA and generates (n + 1)-bit
excess-1 code. The most signiﬁcant bit (MSB) of BEC represents c1out, in which n
least signiﬁcant bits (LSBs) represent s11. The logic expressions of the RCA are the
same as those given in (1a)–(1c). The logic expressions of the BEC unit of the n-bit
BEC-based CSLA are given as

s11(0) = s01 (0)c11(0) = s01 (0) (3a)

s11(i) = s01(i) ⊕ c11(i − 1) (3b)
c11 (i) = s01(i) · c11(i − 1) (3c)
c1out=c01(n − 1) ⊕ c11(n − 1) (3d)
for 1 ≤ i ≤ n − 1.

We can find from (1a)–(1c) and (3a)–(3d) that, in the case of the BEC-based CSLA,
c11 depends on s01, which otherwise has no dependence on s01 in the case of the
conventional CSLA. The BEC method therefore increases data dependence in the
CSLA. We have considered logic expressions of the conventional CSLA and made a
further study on the data dependence to find an optimized logic expression for the
CSLA.
It is interesting to note from (1a)–(1c) and (2a)–(2c) that logic expressions of s01and
s11 are identical except the terms c01and c11since (s00= s10= s0). In addition, we find
that c01and c11depend on {s0, c0, Cin}, where c0= c00= c10. Since c01 and c11 have no
dependence on s01a and s11, the logic operation of c01and c11 can be scheduled before
s01and s11, and the select unit can select one from the set (s01, s11) for the final-sum of
the CSLA. We find that a significant amount of logic resource is spent for calculating
{s01, s11}, and it is not an efficient approach to reject one sum-word after the

Page 35
calculation. Instead, one can select the required carry word from the anticipated carry
words {c0and c1} to calculate the final-sum. The selected carry word is added with the
half-sum (s0) to generate the final-sum (s). Using this method, one can have three
design advantages: 1) Calculation of s01 is avoided in the SCG unit; 2) the n-bit
select unit is required instead of the (n + 1) bit; and 3) small output-carry delay. All
these features result in an area–delay and energy-efficient design for the CSLA. We
have removed all the redundant logic operations of (1a)–(1c) and (2a)–(2c) and
rearranged logic expressions of (1a)–(1c) and (2a)–(2c) based on their dependence.
The proposed logic formulation for the CSLA is given as

s0(i) = A(i) ⊕ B(i) c0(i) = A(i) · B(i) (4a)

c01(i) = c01(i − 1) · s0(i) + c0(i) for c01(0) = 0 (4b)
c11(i) = c11(i − 1) · s0(i) + c0(i) for c11(0) = 1 (4c)
c(i) = c01(i) if (Cin= 0) (4d)
c(i) = c11(i) if (Cin= 1) (4e)
cout= c(n − 1) (4f)
s(0) = s0(0) ⊕cins(i) = s0(i) ⊕ c(i − 1). (4g)

5.4 Proposed 16-Bit CSLA

Fig. 5.3.(a) Proposed CS adder design, where n is the input operand bit-width, and [∗]
represents delay (in the unit of inverter delay), n = max(t, 3.5n + 2.7). (b) Gate-level
design of the HSG. (c) Gate-level optimized design of (CG0) for input-carry = 0. (d)
Gate-level optimized design of (CG1) for input-carry = 1. (e) Gate-level design of the CS
unit. (f) Gate-level design of the ﬁnal-sum generation (FSG) unit.
The proposed CSLA is based on the logic formulation given in (4a)–(4g), and its

Page 36
structure is shown in Fig.5.3(a). It consists of one HSG unit, one FSG unit, one CG
unit, and one CS unit. The CG unit is composed of two CGs (CG0and CG1)
corresponding to input-carry ‘0’ and ‘1’. The HSG receives two n-bit operands (A and
B) and generate half-sum word s0and half-carry word c0of width n bits each. Both
CG0and CG1receive s0and c0 from the HSG unit and generate two n-bit full-carry
words c01 and c11corresponding to input-carry ‘0’ and ‘1’, respectively. The logic
diagram of the HSG unit is shown in Fig.5.3(b). The logic circuits of CG0and CG1are
optimized to take advantage of the fixed input-carry bits. The optimized designs of
CG0and CG1 are shown in Fig.5.3(c) and (d), respectively.
The CS unit selects one final carry word from the two carry words available at
its input line using the control signal Cin. It selects c01 when Cin= 0; otherwise, it
selects c11. The CS unit can be implemented using an n-bit 2-to-l MUX. However, we
find from the truth table of the CS unit that carry words c01and c11follow a specific bit
pattern. If c01(i) = ‘1’, then c11(i) = 1, irrespective of s0(i) and c0(i), for 0 ≤ i ≤ n − 1.
This feature is used for logic optimization of the CS unit. The optimized design of the
CS unit is shown in Fig.5.3(e), which is composed of n AND–OR gates. The final
carry word c is obtained from the CS unit. The MSB of c is sent to output as Cout,
and (n − 1) LSBs are XORed with (n − 1) MSBs of half-sum (s0) in the FSG [shown
in Fig.5.3(f)] to obtain (n − 1) MSBs of final-sum (s). The LSB of s0is XORed with
Cinto obtain the LSB of s.
The multipath carry propagation feature of the CSLA is fully exploited in the
SQRT-CSLA, which is composed of a chain of CSLAs. CSLAs of increasing size are
used in the SQRT-CSLA to extract the maximum concurrence in the carry
propagation path. Using the SQRT-CSLA design, large-size adders are implemented
with significantly less delay than a single-stage CSLA of same size. However, carry
propagation delay between the CSLA stages of SQRT-CSLA is critical for the overall
adder delay. Due to early generation of output-carry with multipath carry propagation
feature, the proposed CSLA design is more favorable than the existing CSLA designs
for area–delay efficient implementation of SQRT-CSLA. A 16-bit SQRT-CSLA
design using the proposed CSLA is shown in Fig.5.4, where the 2-bit RCA, 2-bit
CSLA, 3-bit CSLA, 4-bit CSLA, and 5-bit CSLA are used. We have considered the
cascaded configuration of (2-bit RCA and 2-, 3-, 4-, 6-, 7-, and 8-bit CSLAs) and (2-
bit RCA and 2-, 3-, 4-, 6-, 7-, 8-, 9-, 11-, and 12-bit CSLAs), respectively, for the 32-
bit CSLA and the 64-bit SQRT-CSLA to optimize adder delay. To demonstrate the
advantage of the proposed CSLA design in SQRT-CSLA.

Page 37
Fig.5.4.Proposed 16-bit SQRT-CSLA

Page 38
Chapter-6
Verilog HDL
In the semiconductor and electronic design industry, Verilog is a hardware description
language (HDL) used to model electronic systems. Verilog HDL, not to be confused
with VHDL (a competing language), is most commonly used in the design,
verification, and implementation of digital logic chips at the register-transfer level of
abstraction. It is also used in the verification of analog and mixed-signal circuits.

6.1 Overview

Hardware description languages such as Verilog differ from software programming

languages because they include ways of describing the propagation of time and signal
dependencies (sensitivity). There are two assignment operators, a blocking
assignment (=), and a non-blocking (<=) assignment. The non-blocking assignment
allows designers to describe a state-machine update without needing to declare and
use temporary storage variables. Since these concepts are part of Verilog's language
semantics, designers could quickly write descriptions of large circuits in a relatively
compact and concise form. At the time of Verilog's introduction (1984), Verilog
represented a tremendous productivity improvement for circuit designers who were
already using graphical schematic capture software and specially written software
programs to document and simulate electronic circuits.

The designers of Verilog wanted a language with syntax similar to the C

programming language, which was already widely used in engineering software
development. Like C, Verilog is case-sensitive and has a basic pre processor (though
less sophisticated than that of ANSI C/C++). Its control flow keywords (if/else, for,
while, case, etc.) are equivalent, and its operator precedence is compatible. Syntactic
differences include variable declaration (Verilog requires bit-widths on net/reg types,
demarcation of procedural blocks (begin/end instead of curly braces {}), and many
other minor differences.

A Verilog design consists of a hierarchy of modules. Modules encapsulate design

hierarchy, and communicate with other modules through a set of declared input,
output, and bidirectional ports. Internally, a module can contain any combination of
the following: net/variable declarations (wire, reg, integer, etc.), concurrent and
sequential statement blocks, and instances of other modules (sub-hierarchies).
Sequential statements are placed inside a begin/end block and executed in sequential
order within the block. However, the blocks themselves are executed concurrently,
making Verilog a dataflow language.

Verilog's concept of 'wire' consists of both signal values (4-state: "1, 0, floating,
undefined") and strengths (strong, weak, etc.). This system allows abstract modeling
of shared signal lines, where multiple sources drive a common net. When a wire has

Page 39
multiple drivers, the wire's (readable) value is resolved by a function of the source
drivers and their strengths.

A subset of statements in the Verilog language are synthesizable. Verilog modules

that conform to a synthesizable coding style, known as RTL (register-transfer level),
can be physically realized by synthesis software. Synthesis software algorithmically
transforms the (abstract) Verilog source into a net list, a logically equivalent
description consisting only of elementary logic primitives (AND, OR, NOT, flip-
flops, etc.) that are available in a specific FPGA or VLSI technology. Further
manipulations to the netlist ultimately lead to a circuit fabrication blueprint (such as a
photo mask set for an ASIC or a bit stream file for an FPGA).

6.2 History
6.2.1 Beginning
Verilog was the first modern hardware description language to be invented. It was
created by Phil Moorby and Prabhu Goel during the winter of 1983/1984. The
wording for this process was "Automated Integrated Design Systems" (later renamed
to Gateway Design Automation in 1985) as a hardware modeling language. Gateway
Design Automation was purchased by Cadence Design Systems in 1990. Cadence
now has full proprietary rights to Gateway's Verilog and the Verilog-XL, the HDL-
simulator that would become the de-facto standard (of Verilog logic simulators) for
the next decade. Originally, Verilog was intended to describe and allow simulation;
only afterwards was support for synthesis added.

6.2.2 Verilog-95

With the increasing success of VHDL at the time, Cadence decided to make the
language available for open standardization. Cadence transferred Verilog into the
public domain under the Open Verilog International (OVI) (now known as Accellera)
organization. Verilog was later submitted to IEEE and became IEEE Standard 1364-
1995, commonly referred to as Verilog-95.

In the same time frame Cadence initiated the creation of Verilog-A to put standards
support behind its analog simulator Spectre. Verilog-A was never intended to be a
standalone language and is a subset of Verilog-AMS which encompassed Verilog-95.

6.2.3 Verilog 2001

Extensions to Verilog-95 were submitted back to IEEE to cover the deficiencies that
users had found in the original Verilog standard. These extensions became IEEE
Standard 1364-2001 known as Verilog-2001.

Verilog-2001 is a significant upgrade from Verilog-95. First, it adds explicit support

for (2's complement) signed nets and variables. Previously, code authors had to

Page 40
perform signed operations using awkward bit-level manipulations (for example, the
carry-out bit of a simple 8-bit addition required an explicit description of the Boolean
algebra to determine its correct value). The same function under Verilog-2001 can be
more succinctly described by one of the built-in operators: +, -, /, *, >>>. A
generate/endgenerate construct (similar to VHDL's generate/endgenerate) allows
Verilog-2001 to control instance and statement instantiation through normal decision
operators (case/if/else). Using generate/endgenerate, Verilog-2001 can instantiate an
array of instances, with control over the connectivity of the individual instances. File
I/O has been improved by several new system tasks. And finally, a few syntax
additions were introduced to improve code readability (e.g. always @*, named
parameter override, C-style function/task/module header declaration).

Verilog-2001 is the dominant flavor of Verilog supported by the majority of

commercial EDA software packages.

6.2.4 Verilog 2005

Not to be confused with System Verilog, Verilog 2005 (IEEE Standard 1364-2005)
consists of minor corrections, spec clarifications, and a few new language features
(such as the uwire keyword).

A separate part of the Verilog standard, Verilog-AMS, attempts to integrate analog

and mixed signal modeling with traditional Verilog.

Example

A hello world program looks like this:

module main;
initial
begin
$display("Hello world!");
$finish;
end
endmodule

A simple example of two flip-flops follows:

moduletoplevel(clock,reset);
input clock;
input reset;

reg flop1;
reg flop2;

Page 41
always @ (posedge reset or posedge clock)
if (reset)
begin
flop1 <= 0;
flop2 <= 1;
end
else
begin
flop1 <= flop2;
flop2 <= flop1;
end
endmodule

The "<=" operator in Verilog is another aspect of its being a hardware description
language as opposed to a normal procedural language. This is known as a "non-
blocking" assignment. Its action doesn't register until the next clock cycle. This means
that the order of the assignments is irrelevant and will produce the same result: flop1
and flop2 will swap values every clock.

The other assignment operator, "=", is referred to as a blocking assignment. When "="
assignment is used, for the purposes of logic, the target variable is updated
immediately. In the above example, had the statements used the "=" blocking operator
instead of "<=", flop1 and flop2 would not have been swapped. Instead, as in
traditional programming, the compiler would understand to simply set flop1 equal to
flop2 (and subsequently ignore the redundant logic to set flop2 equal to flop1.)

An example counter circuit follows:

module Div20x (rst, clk, cet, cep, count, tc);

// TITLE 'Divide-by-20 Counter with enables'
// enable CEP is a clock enable only
// enable CET is a clock enable and
// enables the TC output
// a counter using the Verilog language

parameter size = 5;
parameter length = 20;

inputrst; // These inputs/outputs represent

inputclk; // connections to the module.
inputcet;
inputcep;

output [size-1:0] count;

Page 42
outputtc;

reg [size-1:0] count; // Signals assigned

// within an always
// (or initial)block
// must be of type reg

wiretc; // Other signals are of type wire

// The always statement below is a parallel

// execution statement that
// executes any time the signals
// rst or clk transition from low to high

always @ (posedgeclk or posedgerst)

if (rst) // This causes reset of the cntr
count<= {size{1'b0}};
else
if (cet&&cep) // Enables both true
begin
if (count == length-1)
count<= {size{1'b0}};
else
count<= count + 1'b1;
end

// the value of tc is continuously assigned

// the value of the expression
assigntc = (cet&& (count == length-1));

endmodule

An example of delays:

...
reg a, b, c, d;
wire e;
...
always @(b or e)
begin
a = b & e;
b = a | b;
#5 c = b;
d = #6 c ^ e;

Page 43
end

The always clause above illustrates the other type of method of use, i.e. it executes
whenever any of the entities in the list (the b or e) changes. When one of these
changes, a is immediately assigned a new value, and due to the blocking assignment,
b is assigned a new value afterward (taking into account the new value of a). After a
delay of 5 time units, c is assigned the value of b and the value of c ^ e is tucked away
in an invisible store. Then after 6 more time units, d is assigned the value that was
tucked away.

Signals that are driven from within a process (an initial or always block) must be of
type reg. Signals that are driven from outside a process must be of type wire. The
keyword reg does not necessarily imply a hardware register.

Definition of constants

The definition of constants in Verilog supports the addition of a width parameter. The
basic syntax is:

<Width in bits>'<base letter><number>

Examples:

 12'h123 - Hexadecimal 123 (using 12 bits)

 20'd44 - Decimal 44 (using 20 bits - 0 extension is automatic)
 4'b1010 - Binary 1010 (using 4 bits)
 6'o77 - Octal 77 (using 6 bits)

Synthesizeable constructs

There are several statements in Verilog that have no analog in real hardware, e.g.
$display. Consequently, much of the language can not be used to describe hardware.
The examples presented here are the classic subset of the language that has a direct
mapping to real gates.

// Mux examples - Three ways to do the same thing.

// The first example uses continuous assignment

wire out;
assign out =sel?a : b;

// the second example uses a procedure

// to accomplish the same thing.

reg out;

Page 44
always@(a or b orsel)
begin
case(sel)
1'b0: out = b;
1'b1: out = a;
endcase
end

// Finally - you can use if/else in a

// procedural structure.
reg out;
always@(a or b orsel)
if(sel)
out= a;
else
out= b;

The next interesting structure is a transparent latch; it will pass the input to the output
when the gate signal is set for "pass-through", and captures the input and stores it
upon transition of the gate signal to "hold". The output will remain stable regardless
of the input signal while the gate is set to "hold". In the example below the "pass-
through" level of the gate would be when the value of the if clause is true, i.e. gate =
1. This is read "if gate is true, the din is fed to latch_out continuously." Once the if
clause is false, the last value at latch_out will remain and is independent of the value
of din.

// Transparent latch example

reg out;
always@(gate or din)
if(gate)
out= din;// Pass through state
// Note that the else isn't required here. The variable
// out will follow the value of din while gate is high.
// When gate goes low, out will remain constant.

The flip-flop is the next significant template; in Verilog, the D-flop is the simplest,
and it can be modeled as:

reg q;
always@(posedgeclk)
q <= d;

Page 45
The significant thing to notice in the example is the use of the non-blocking
assignment. A basic rule of thumb is to use <= when there is a posedge or negedge
statement within the always clause.

A variant of the D-flop is one with an asynchronous reset; there is a convention that
the reset state will be the first if clause within the statement.

reg q;
always@(posedgeclkorposedge reset)
if(reset)
q <=0;
else
q <= d;

The next variant is including both an asynchronous reset and asynchronous set
condition; again the convention comes into play, i.e. the reset term is followed by the
set term.

reg q;
always@(posedgeclkorposedge reset orposedge set)
if(reset)
q <=0;
else
if(set)
q <=1;
else
q <= d;

Note: If this model is used to model a Set/Reset flip flop then simulation errors can
result. Consider the following test sequence of events. 1) reset goes high 2) clk goes
high 3) set goes high 4) clk goes high again 5) reset goes low followed by 6) set going
low. Assume no setup and hold violations.

In this example the always @ statement would first execute when the rising edge of
reset occurs which would place q to a value of 0. The next time the always block
executes would be the rising edge of clk which again would keep q at a value of 0.
The always block then executes when set goes high which because reset is high forces
q to remain at 0. This condition may or may not be correct depending on the actual
flip flop. However, this is not the main problem with this model. Notice that when
reset goes low, that set is still high. In a real flip flop this will cause the output to go to
a 1. However, in this model it will not occur because the always block is triggered by
rising edges of set and reset - not levels. A different approach may be necessary for
set/reset flip flops.

Page 46
The final basic variant is one that implements a D-flop with a mux feeding its input.
The mux has a d-input and feedback from the flop itself. This allows a gated load
function.

// Basic structure with an EXPLICIT feedback path

always@(posedgeclk)
if(gate)
q <= d;
else
q <= q;// explicit feedback path

// The more common structure ASSUMES the feedback is present

// This is a safe assumption since this is how the
// hardware compiler will interpret it. This structure
// looks much like a latch. The differences are the
// '''@(posedgeclk)''' and the non-blocking '''<='''
//
always@(posedgeclk)
if(gate)
q <= d;// the "else" mux is "implied"

Note that there are no "initial" blocks mentioned in this description. There is a split
between FPGA and ASIC synthesis tools on this structure. FPGA tools allow initial
blocks where reg values are established instead of using a "reset" signal. ASIC
synthesis tools don't support such a statement. The reason is that an FPGA's initial
state is something that is downloaded into the memory tables of the FPGA. An ASIC
is an actual hardware implementation.

Initial and always

There are two separate ways of declaring a Verilog process. These are the always and
the initial keywords. The always keyword indicates a free-running process. The initial
keyword indicates a process executes exactly once. Both constructs begin execution at
simulator time 0, and both execute until the end of the block. Once an always block
has reached its end, it is rescheduled (again). It is a common misconception to believe
that an initial block will execute before an always block. In fact, it is better to think of
the initial-block as a special-case of the always-block, one which terminates after it
completes for the first time.

//Examples:
initial
begin
a =1;// Assign a value to reg a at time 0
#1;// Wait 1 time unit

Page 47
b = a;// Assign the value of reg a to reg b
end

always@(a or b)// Any time a or b CHANGE, run the process

begin
if(a)
c = b;
else
d =~b;
end// Done with this block, now return to the top (i.e. the @ event-control)

always@(posedge a)// Run whenever reg a has a low to high change

a <= b;

These are the classic uses for these two keywords, but there are two significant
additional uses. The most common of these is an always keyword without the @(...)
sensitivity list. It is possible to use always as shown below:

always
begin// Always begins executing at time 0 and NEVER stops
clk=0;// Set clk to 0
#1;// Wait for 1 time unit
clk=1;// Set clk to 1
#1;// Wait 1 time unit
end// Keeps executing - so continue back at the top of the begin

The always keyword acts similar to the "C" construct while(1) {..} in the sense that it
will execute forever.

The other interesting exception is the use of the initial keyword with the addition of
the forever keyword.

The example below is functionally identical to the always example above.

initialforever// Start at time 0 and repeat the begin/end forever

begin
clk=0;// Set clk to 0
#1;// Wait for 1 time unit
clk=1;// Set clk to 1
#1;// Wait 1 time unit
end

Page 48
Fork/join

The fork/join pair are used by Verilog to create parallel processes. All statements (or
blocks) between a fork/join pair begin execution simultaneously upon execution flow
hitting the fork. Execution continues after the join upon completion of the longest
running statement or block between the fork and join.

initial
fork
$write("A");// Print Char A
$write("B");// Print Char B
begin
#1;// Wait 1 time unit
$write("C");// Print Char C
end
join

The way the above is written, it is possible to have either the sequences "ABC" or
"BAC" print out. The order of simulation between the first $write and the second
$write depends on the simulator implementation, and may purposefully be
randomized by the simulator. This allows the simulation to contain both accidental
race conditions as well as intentional non-deterministic behavior.

Notice that VHDL cannot dynamically spawn multiple processes like Verilog

Race conditions

The order of execution isn't always guaranteed within Verilog. This can best be
illustrated by a classic example. Consider the code snippet below:

initial
a =0;

initial
b = a;

initial
begin
#1;
$display("Value a=%b Value of b=%b",a,b);
end

What will be printed out for the values of a and b? Depending on the order of
execution of the initial blocks, it could be zero and zero, or alternately zero and some

Page 49
other arbitrary uninitialized value. The $display statement will always execute after
both assignment blocks have completed, due to the #1 delay.

Operators

Note: These operators are not shown in order of precedence.

Operator
Operator type Operation performed
symbols
~ Bitwise NOT (1's complement)
& Bitwise AND
Bitwise | Bitwise OR
^ Bitwise XOR
~^ or ^~ Bitwise XNOR
! NOT
Logical && AND
|| OR
& Reduction AND
~& Reduction NAND
| Reduction OR
Reduction
~| Reduction NOR
^ Reduction XOR
~^ or ^~ Reduction XNOR
+ Addition
- Subtraction
- 2's complement
Arithmetic
* Multiplication
/ Division
** Exponentiation (*Verilog-2001)
> Greater than
< Less than
>= Greater than or equal to
<= Less than or equal to
Relational Logical equality (bit-value 1'bX is removed from
==
comparison)
Logical inequality (bit-value 1'bX is removed from
!=
comparison)
=== 4-state logical equality (bit-value 1'bX is taken as

Page 50
literal)
4-state logical inequality (bit-value 1'bX is taken as
!==
literal)
>> Logical right shift
<< Logical left shift
Shift
>>> Arithmetic right shift (*Verilog-2001)
<<< Arithmetic left shift (*Verilog-2001)
Concatenation { , } Concatenation
Replication {n{m}} Replicate value m for n times
Conditional ?: Conditional

Four-valued logic
The IEEE 1364 standard defines a four-valued logic with four states: 0, 1, Z (high
impedance), and X (unknown logic value). For the competing VHDL, a dedicated
standard for multi-valued logic exists as IEEE 1164 with nine levels.

Chapter-7
FPGA Implementation

7.1 Introduction to FPGA

FPGA contains a two dimensional arrays of logic blocks and interconnections

between logic blocks. Both the logic blocks and interconnects are programmable.
Logic blocks are programmed to implement a desired function and the
interconnections are programmed using the switch boxes to connect the logic blocks.
Page 51
To be more clear, if we want to implement a complex design (CPU for
instance), then the design is divided into small sub functions and each sub function is
implemented using one logic block. Now, to get our desired design (CPU), all the sub
functions implemented in logic blocks must be connected and this is done by
programming the internal structure of an FPGA which is depicted in the following
figure 7.1.

Figure 7.1: FPGA interconnections

FPGAs, alternative to the custom ICs, can be used to implement an entire

System On one Chip (SOC). The main advantage of FPGA is ability to reprogram.
User can reprogram an FPGA to implement a design and this is done after the FPGA
is manufactured. This brings the name “Field Programmable.”

Custom ICs are expensive and takes long time to design so they are useful
when produced in bulk amounts. But FPGAs are easy to implement within a short
time with the help of Computer Aided Designing (CAD) tools (because there is no
physical layout process, no mask making, and no IC manufacturing).

Some disadvantages of FPGAs are, they are slow compared to custom ICs as
they can’t handle vary complex designs and also they draw more power.

Page 52
Xilinx logic block consists of one Look Up Table (LUT) and one Flip-Flop.
An LUT is used to implement number of different functionality. The input lines to the
logic block go into the LUT and enable it. The output of the LUT gives the result of
the logic function that it implements and the output of logic block is registered or
unregistered output from the LUT.

SRAM is used to implement a LUT.A k-input logic function is implemented

using 2^k * 1 size SRAM. Number of different possible functions for k input LUT is
2^2^k. Advantage of such an architecture is that it supports implementation of so
many logic functions, however the disadvantage is unusually large number of memory
cells required to implement such a logic block in case number of inputs is large.

Figure 7.2 shows a 4-input LUT based implementation of logic block

LUT based design provides for better logic block utilization. A k-input LUT
based logic block can be implemented in number of different ways with tradeoff
between performance and logic density.An n-LUT can be shown as a direct
implementation of a function truth-table. Each of the latch hold’s the value of the
function corresponding to one input combination. For Example: 2-LUT can be used to
implement 16 types of functions like AND, OR, A +not B.... Etc.

Interconnects
A wire segment can be described as two end points of an interconnection with
no programmable switch between them. A sequence of one or more wire segments in
an FPGA can be termed as a track.

Typically an FPGA has logic blocks, interconnects and switch blocks (Input
/Output blocks). Switch blocks lie in the periphery of logic blocks and interconnect.

Page 53
Wire segments are connected to logic blocks through switch blocks. Depending on the
required design, one logic block is connected to another and so on.

7.2 FPGA DESIGN FLOW

In this part of tutorial we are going to have a short intro on FPGA design flow.
A simplified version of design flow is given in the flowing diagram.

Figure 7.3 FPGA Design Flow

7.2.1 Design Entry

There are different techniques for design entry. Schematic based, Hardware
Description Language and combination of both etc. . Selection of a method depends
on the design and designer. If the designer wants to deal more with Hardware, then
Schematic entry is the better choice. When the design is complex or the designer
thinks the design in an algorithmic way then HDL is the better choice. Language
based entry is faster but lag in performance and density.

HDLs represent a level of abstraction that can isolate the designers from the
details of the hardware implementation. Schematic based entry gives designers much
more visibility into the hardware. It is the better choice for those who are hardware
oriented. Another method but rarely used is state-machines. It is the better choice for
the designers who think the design as a series of states. But the tools for state machine

Page 54
entry are limited. In this documentation we are going to deal with the HDL based
design entry.

7.2.2 Synthesis

Figure 7.4 FPGA Synthesis

The process that translates VHDL/ Verilog code into a device netlist format
i.e. a complete circuit with logical elements (gates flip flop, etc…) for the design. If
the design contains more than one sub designs, ex. to implement a processor, we need
a CPU as one design element and RAM as another and so on, then the synthesis
process generates netlist for each design element Synthesis process will check code
syntax and analyze the hierarchy of the design which ensures that the design is
optimized for the design architecture, the designer has selected. The resulting
netlist(s) is saved to an NGC (Native Generic Circuit) file (for Xilinx® Synthesis
Technology (XST)).

7.2.3 Implementation

This process consists of a sequence of three steps

 Translate
 Map
 Place and Route

Translate:

Page 55
Process combines all the input netlists and constraints to a logic design file.
This information is saved as a NGD (Native Generic Database) file. This can be done
using NGD Build program. Here, defining constraints is nothing but, assigning the
ports in the design to the physical elements (ex. pins, switches, buttons etc) of the
targeted device and specifying time requirements of the design. This information is
stored in a file named UCF (User Constraints File). Tools used to create or modify the
UCF are PACE, Constraint Editor Etc.

Figure 7.5 FPGA Translate

Map:

Process divides the whole circuit with logical elements into sub blocks such
that they can be fit into the FPGA logic blocks. That means map process fits the logic
defined by the NGD file into the targeted FPGA elements (Combinational Logic
Blocks (CLB), Input Output Blocks (IOB)) and generates an NCD (Native Circuit
Description) file which physically represents the design mapped to the components of
FPGA. MAP program is used for this purpose.

Figure 7.6 FPGA map

Place and Route:

Page 56
PAR program is used for this process. The place and route process places the
sub blocks from the map process into logic blocks according to the constraints and
connects the logic blocks. Ex. if a sub block is placed in a logic block which is very
near to IO pin, then it may save the time but it may affect some other constraint. So
tradeoff between all the constraints is taken account by the place and route process.

The PAR tool takes the mapped NCD file as input and produces a completely
routed NCD file as output. The output NCD file consists of the routing information.

Figure 7.7 FPGA Place and route

7.3 Synthesis Result

To investigate the advantages of using our technique in terms of area overhead

against “Fully ECC”and against the partially protection, we implemented
andsynthesized for a Xilinx XC3S500E different versions of a32-bit, 32-entry, dual
read ports, single write port registerfile. Once the functional verification is done, the
RTL model is taken to the synthesis process using the Xilinx ISE tool. In synthesis
process, the RTL model will be converted to the gate level netlist mapped to a specific
technology library. Here in this Spartan 3E family, many different devices were
available in the Xilinx ISE tool. In order to synthesis this design the device named as
“XC3S500E” has been chosen and the package as “FG320” with the device speed
such as “-4”.
RTL Schematic

The RTL (Register Transfer Logic) can be viewed as black box after
synthesize of design is made. It shows the inputs and outputs of the system. By
double-clicking on the diagram we can see gates, flip-flops and MUX.

The corresponding schematics of the adders after synthesis is shown below.

Page 57
Figure 7.13: RTL schematic of Top-level Carry Select Adder(LF)

Figure 7.14: RTL schematic of Internal block Carry Select Adder(LF)

Page 58
Figure 7.15: Technology schematic of Top-level Carry Select Adder(LF)

Figure 7.16: Technology schematic of Internal block Carry Select Adder(LF)

Page 59
Figure 7.17: Internal block Carry Select Adder(LF)
7.4 Synthesis Report

This device utilization includes the following.

 Logic Utilization
 Logic Distribution
 Total Gate count for the Design

The device utilization summery is shown above in which its gives the details
of number of devices used from the available devices and also represented in %.
Hence as the result of the synthesis process, the device utilization in the used device
and package is shown below.

Table 7-1: Synthesis report of BEC’s based CSLA

Page 60
Table 7-2: Synthesis report of proposed Carry-Select Adder(LF)

Chapter-8

Page 61
SIMULATION RESULTS

The corresponding simulation results of the adders are shown below.

Figure 8-1: Test Bench for 16 bit Carry Select Adder(BEC)

Figure 8-2: Simulated output for Carry-Select Adder (BEC)

Page 62
Figure 8-3: Test Bench for 16 bit Carry Select Adder(LF)

Figure 8-4: Simulated output for Carry-Select Adder (LF)

Page 63
Chapter-9
CONCLUSION

A simple approach is proposed in this paper to reduce the area and power of SQRT
CSLA architecture.The logic operations eliminated all the redundant logic operations
of the conventional CSLA and proposed a new logic formulation for the CSLA. In the
proposed scheme, the CS operation is scheduled before the calculation of final-sum,
which is different from the conventional approach. Carry words corresponding to
input-carry ‘0’ and ‘1’ generated by the CSLA based on the proposed scheme follow a
specific bit pattern, which is used for logic optimization of the CS unit. Fixed input
bits of the CG unit are also used for logic optimization. Based on this, an optimized
design for CS and CG units are obtained. Using these optimized logic units, an
efficient design is obtained for the CSLA. The proposed CSLA design involves
significantly less area and delay than the recently proposed BEC-based CSLA. Due to
the small carry output delay, the proposed CSLA design is a good candidate for the
SQRT adder. The synthesis result shows that the existing BEC-based SQRT-CSLA
design involves 48% more ADP and consumes 50% more energy than the proposed
SQRT -CSLA, on average, for different bit-widths.

Page 64
REFERENCES

[1] K. K. Parhi, VLSI Digital Signal Processing. New York, NY, USA: Wiley, 1998.
[2] A. P. Chandrakasan, N. Verma, and D. C. Daly, “Ultralow-power electronics for
biomedical applications,” Annu. Rev. Biomed. Eng., vol. 10, pp. 247–274, Aug. 2008.
[3] O. J. Bedrij, “Carry-select adder,” IRE Trans. Electron. Comput., vol. EC-
11, no. 3, pp. 340–344, Jun. 1962.
[4] Y. Kim and L.-S. Kim, “64-bit carry-select adder with reduced area,”
Electron.Lett., vol. 37, no. 10, pp. 614–615, May 2001.
[5] Y. He, C. H. Chang, and J. Gu, “An area-efficient 64-bit square root carry select
adder for low power application,” in Proc. IEEE Int. Symp. Circuits Syst., 2005, vol.
4, pp. 4082–4085.
[6] B. Ramkumar and H. M. Kittur, “Low-power and area-efficient carry-select
adder,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 2, pp. 371–
375, Feb. 2012.
[7] I.-C. Wey, C.-C. Ho, Y.-S. Lin, and C. C. Peng, “An area-efficient carry select
adder design by sharing the common Boolean logic term,” in Proc. IMECS, 2012, pp.
1–4.
[8] S. Manju and V. Sornagopal, “An efficient SQRT architecture of carry select
adder design by common Boolean logic,” in Proc. VLSI ICEVENT, 2013, pp. 1–5.
[9] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs, 2nd ed.
New York, NY, USA: Oxford Univ. Press, 2010.
[9] Basant Kumar Mohanty, Senior Member, IEEE, and Sujit Kumar Patel: IEEE
Transactions On Circuits And Systems—II: Express Briefs, VOL. 61, NO. 6, JUNE
2014

Page 65

TP-6RC 12RC (En)
No ratings yet
TP-6RC 12RC (En)
8 pages
Reducing The Hardware Complexity of A Parallel Prefix Adder
No ratings yet
Reducing The Hardware Complexity of A Parallel Prefix Adder
57 pages
Very-Large-Scale Integration - Wikipedia, The Free Encyclopedia
No ratings yet
Very-Large-Scale Integration - Wikipedia, The Free Encyclopedia
4 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
3 pages
Very Large Scale Integration
No ratings yet
Very Large Scale Integration
4 pages
Internship Report For VLSI
No ratings yet
Internship Report For VLSI
35 pages
1.1 Introduction of Vlsi
No ratings yet
1.1 Introduction of Vlsi
35 pages
Low Power VLSI: B.Tech. Project Electronics Engineering
No ratings yet
Low Power VLSI: B.Tech. Project Electronics Engineering
18 pages
Very Large Scale Integration (Vlsi)
No ratings yet
Very Large Scale Integration (Vlsi)
49 pages
Adobe Scan 28-Apr-2024
No ratings yet
Adobe Scan 28-Apr-2024
21 pages
Introduction To Vlsi
No ratings yet
Introduction To Vlsi
8 pages
Floating Point Multiplier
No ratings yet
Floating Point Multiplier
35 pages
Vlsi Application: Applications of VLSI Circuits To Medical Imaging
No ratings yet
Vlsi Application: Applications of VLSI Circuits To Medical Imaging
4 pages
Vlsi Tech
No ratings yet
Vlsi Tech
15 pages
Vlsi Internship Report
No ratings yet
Vlsi Internship Report
31 pages
Physics
No ratings yet
Physics
23 pages
Gate Level Design of A Digital Clock With Asynchronous
No ratings yet
Gate Level Design of A Digital Clock With Asynchronous
23 pages
Vim Ref1
No ratings yet
Vim Ref1
3 pages
4NM21EE037 Kishan New
No ratings yet
4NM21EE037 Kishan New
20 pages
Very Large Scale Integration
No ratings yet
Very Large Scale Integration
3 pages
Introduction To VLSI Technology
No ratings yet
Introduction To VLSI Technology
19 pages
Vlsi
No ratings yet
Vlsi
12 pages
1 Intro VLSI
No ratings yet
1 Intro VLSI
88 pages
Presentation With Right Format
No ratings yet
Presentation With Right Format
14 pages
ATM Complete Without Index
100% (8)
ATM Complete Without Index
60 pages
VLSI Processors
No ratings yet
VLSI Processors
11 pages
Very - Large-Scale Integration (VLSI) Is The Process of Creating Integrated Circuits by
No ratings yet
Very - Large-Scale Integration (VLSI) Is The Process of Creating Integrated Circuits by
11 pages
Multifunctional Ic
No ratings yet
Multifunctional Ic
30 pages
Chapter 1 VLSI Introduction
No ratings yet
Chapter 1 VLSI Introduction
83 pages
Introduction To VLSI
No ratings yet
Introduction To VLSI
42 pages
Internship Vlsi
No ratings yet
Internship Vlsi
32 pages
An Area Efficient Universal Cryptography Processor For Smart Cards
0% (1)
An Area Efficient Universal Cryptography Processor For Smart Cards
87 pages
Introduction To VLSI - VHDL
No ratings yet
Introduction To VLSI - VHDL
14 pages
Very Large Scale Integration
No ratings yet
Very Large Scale Integration
17 pages
2.introduction To vlsi-21EC504
No ratings yet
2.introduction To vlsi-21EC504
52 pages
Lpvlsi Unit I
No ratings yet
Lpvlsi Unit I
31 pages
Vlsi Complete Material
100% (1)
Vlsi Complete Material
167 pages
Vlsi
No ratings yet
Vlsi
2 pages
Seminar Report 1
No ratings yet
Seminar Report 1
21 pages
Introduction and Motivation VLSI Circuit PDF
No ratings yet
Introduction and Motivation VLSI Circuit PDF
76 pages
Jack Kilby Texas Instruments
No ratings yet
Jack Kilby Texas Instruments
1 page
VLSI
No ratings yet
VLSI
22 pages
1 1 1 Wrjts Vlsi Fab. Ic
No ratings yet
1 1 1 Wrjts Vlsi Fab. Ic
3 pages
Vlsi Design
No ratings yet
Vlsi Design
115 pages
What Is VLSI - 4
No ratings yet
What Is VLSI - 4
7 pages
Design of Vlsi Systems
100% (1)
Design of Vlsi Systems
325 pages
VLSI Design UNIT-I
No ratings yet
VLSI Design UNIT-I
105 pages
1.1 Vlsi (Very Large Scale Integeration)
No ratings yet
1.1 Vlsi (Very Large Scale Integeration)
36 pages
Unesco - Eolss Sample Chapters: Very-Large-Scale Integration of Electronic Circuits
No ratings yet
Unesco - Eolss Sample Chapters: Very-Large-Scale Integration of Electronic Circuits
12 pages
Vlsi Design: Introduction To Ic Technology
No ratings yet
Vlsi Design: Introduction To Ic Technology
59 pages
Unit I
No ratings yet
Unit I
25 pages
Vlsi Design (17ec63) : Availaible At: VTU HUB (Android App)
No ratings yet
Vlsi Design (17ec63) : Availaible At: VTU HUB (Android App)
35 pages
Lec - 1
No ratings yet
Lec - 1
7 pages
VLSI
No ratings yet
VLSI
13 pages
Very Large Scale Integration
No ratings yet
Very Large Scale Integration
35 pages
Vlsi
No ratings yet
Vlsi
20 pages
Chapter - 1 Vlsi
No ratings yet
Chapter - 1 Vlsi
15 pages
M8final Doc (1) Vin
No ratings yet
M8final Doc (1) Vin
48 pages
Innovations Beyond the Wires: Exploring the Frontiers of Electrical Engineering
From Everand
Innovations Beyond the Wires: Exploring the Frontiers of Electrical Engineering
Farzam Mohammadiazar
No ratings yet
Understanding Microelectronics: A Top-Down Approach
From Everand
Understanding Microelectronics: A Top-Down Approach
Franco Maloberti
No ratings yet
Nanoscale CMOS: Innovative Materials, Modeling and Characterization
From Everand
Nanoscale CMOS: Innovative Materials, Modeling and Characterization
Francis Balestra
No ratings yet
Fx3U 4ad
No ratings yet
Fx3U 4ad
3 pages
Handing O.N
No ratings yet
Handing O.N
2 pages
TSSPDCL 2015 PDF
0% (1)
TSSPDCL 2015 PDF
14 pages
Commands - For - NOKIA DX 200 O&M
No ratings yet
Commands - For - NOKIA DX 200 O&M
12 pages
Drager Infinity Vista - Service Manual
67% (3)
Drager Infinity Vista - Service Manual
124 pages
2022 M6 Series UM
No ratings yet
2022 M6 Series UM
48 pages
Se-List-00 Electrical As-Built Shop Drawing List-Se-List
No ratings yet
Se-List-00 Electrical As-Built Shop Drawing List-Se-List
1 page
Control para Ats
No ratings yet
Control para Ats
12 pages
Embedded Systems
No ratings yet
Embedded Systems
6 pages
APNSS Manual1579852949
No ratings yet
APNSS Manual1579852949
47 pages
Is 10810 43 1984
No ratings yet
Is 10810 43 1984
5 pages
Chemistry Lesson 3
No ratings yet
Chemistry Lesson 3
11 pages
Ledsmagazine Feb2012
No ratings yet
Ledsmagazine Feb2012
80 pages
Syllabus BSC Physics
No ratings yet
Syllabus BSC Physics
10 pages
SM en RADICA7 Radical-7 Service Manual Rev.A
No ratings yet
SM en RADICA7 Radical-7 Service Manual Rev.A
40 pages
Embedded Systems
No ratings yet
Embedded Systems
24 pages
Opamp Tester
No ratings yet
Opamp Tester
2 pages
Manual RTD Thermocouple General Im
No ratings yet
Manual RTD Thermocouple General Im
2 pages
Rigol DG821 Waveform Generator Datasheet
No ratings yet
Rigol DG821 Waveform Generator Datasheet
12 pages
Abbreviations (Standard Procedure For Material Control and Warehousing) PDF
No ratings yet
Abbreviations (Standard Procedure For Material Control and Warehousing) PDF
16 pages
PVL Ignition Installation & Operating Instructions: Warning!!
No ratings yet
PVL Ignition Installation & Operating Instructions: Warning!!
4 pages
US Army Glossary of Terms and Abbreviations 9pp PDF
No ratings yet
US Army Glossary of Terms and Abbreviations 9pp PDF
9 pages
Electronics Club Handout: Resistance Colour Coding
No ratings yet
Electronics Club Handout: Resistance Colour Coding
3 pages
Stable Pulse Generator Uses Matched Transistors in A Current Mirror
No ratings yet
Stable Pulse Generator Uses Matched Transistors in A Current Mirror
6 pages
VDF Project Part 1 Group 20
No ratings yet
VDF Project Part 1 Group 20
68 pages
Beee (G2ua120b) Lab Manual
No ratings yet
Beee (G2ua120b) Lab Manual
51 pages
Alcad Modulator 951 Series
No ratings yet
Alcad Modulator 951 Series
12 pages
Dielectrics Lecture Notes PDF
100% (1)
Dielectrics Lecture Notes PDF
33 pages
Osisense Xs & XT - xsdh603629
No ratings yet
Osisense Xs & XT - xsdh603629
2 pages