Chapter-1 Introduction To Vlsi: 1.1 Very-Large-Scale Integration
Chapter-1 Introduction To Vlsi: 1.1 Very-Large-Scale Integration
INTRODUCTION TO VLSI
1.2 History
During the 1920’s, several inventors attempted devices that were intended to control
the current in solid state diodes and convert them into triodes. Success, however, had
to wait until after World War II, during which the attempt to improve silicon and
germanium crystals for use as radar detectors led to improvements both in fabrication
and in the theoretical understanding of the quantum mechanical states of carriers in
semiconductors and after which the scientists who had been diverted to radar
development returned to solid state device development. With the invention of
transistors at Bell labs, in 1947, the field of electronics got a new direction which
shifted from power consuming vacuum tubes to solid state devices.
With the small and effective transistor at their hands, electrical engineers of the 50s
saw the possibilities of constructing far more advanced circuits than before. However,
as the complexity of the circuits grew, problems started arising.
Another problem was the size of the circuits. A complex circuit, like a computer, was
dependent on speed. If the components of the computer were too large or the wires
Page 1
interconnecting them too long, the electric signals couldn't travel fast enough through
the circuit, thus making the computer too slow to be effective.
Jack Kilby at Texas Instruments found a solution to this problem in 1958. Kilby's idea
was to make all the components and the chip out of the same block (monolith) of
semiconductor material. When the rest of the workers returned from vacation, Kilby
presented his new idea to his superiors. He was allowed to build a test version of his
circuit. In September 1958, he had his first integrated circuit ready. Although the first
integrated circuit was pretty crude and had some problems, the idea was
groundbreaking. By making all the parts out of the same block of material and adding
the metal needed to connect them as a layer on top of it, there was no more need for
individual discrete components. No more wires and components had to be assembled
manually. The circuits could be made smaller and the manufacturing process could be
automated. From here the idea of integrating all components on a single silicon wafer
came into existence and which led to development in Small Scale Integration(SSI) in
early 1960s, Medium Scale Integration(MSI) in late 1960s, Large Scale
Integration(LSI) and in early 1980s VLSI 10,000s of transistors on a chip (later
100,000s & now 1,000,000s).
1.3 Developments
The first semiconductor chips held two transistors each. Subsequent advances added
more and more transistors, and, as a consequence, more individual functions or
systems were integrated over time. The first integrated circuits held only a few
devices, perhaps as many as ten diodes, transistors, resistors and capacitors, making it
possible to fabricate one or more logic gates on a single device.Now known
retrospectively as small-scale integration (SSI), improvements in technique led to
devices with hundreds of logic gates, known as medium-scale integration (MSI).
Further improvements led to large-scale integration (LSI), i.e. systems with at least a
thousand logic gates. Current technology has moved far past this mark and today's
microprocessors have many millions of gates and billions of individual transistors.
At one time, there was an effort to name and calibrate various levels of large-scale
integration above VLSI. Terms like ultra-large-scale integration (ULSI) were used.
But the huge number of gates and transistors available on common devices has
rendered such fine distinctions moot. Terms suggesting greater than VLSI levels of
integration are no longer in widespread use.
Page 2
whose large transistor count is largely due to its 24 MB L3 cache. Current designs,
unlike the earliest devices, use extensive design automation and automated logic
synthesis to lay out the transistors, enabling higher levels of complexity in the
resulting logic functionality. Certain high-performance logic blocks like the SRAM
(Static Random Access Memory) cell, however, are still designed by hand to ensure
the highest efficiency (sometimes by bending or breaking established design rules to
obtain the last bit of performance by trading stability). VLSI technology is moving
towards radical level miniaturization with introduction of NEMS technology. Alot of
problems need to be sorted out before the transition is actually made.
Structured VLSI design had been popular in the early 1980s, but lost its popularity
later because of the advent of placement and routing tools wasting a lot of area by
routing, which is tolerated because of the progress of Moore's Law. When introducing
the hardware description language KARL in the mid' 1970s, Reiner Hartenstein
coined the term "structured VLSI design" (originally as "structured LSI design"),
echoing EdsgerDijkstra's structured programming approach by procedure nesting to
avoid chaotic spaghetti-structured programs.
1.4.1 Challenges
Page 3
Stricter design rules – Due to lithography and etch issues with scaling,
design rules for layout have become increasingly stringent. Designers must
keep ever more of these rules in mind while laying out custom circuits. The
overhead for custom design is now reaching a tipping point, with many design
houses opting to switch to electronic design automation (EDA) tools to
automate their design process.
Timing/design closure – As clock frequencies tend to scale up, designers are
finding it more difficult to distribute and maintain low clock skew between
these high frequency clocks across the entire chip. This has led to a rising
interest in multicore and multiprocessor architectures, since an overall speedup
can be obtained by lowering the clock frequency and distributing processing.
First-pass success – As die sizes shrink (due to scaling), and wafer sizes go
up (to lower manufacturing costs), the number of dies per wafer increases, and
the complexity of making suitable photomasks goes up rapidly. A mask set for
a modern technology can cost several million dollars. This non-recurring
expense deters the old iterative philosophy involving several "spin-cycles" to
find errors in silicon, and encourages first-pass silicon success. Several design
philosophies have been developed to aid this new design flow, including
design for manufacturing (DFM), design for test (DFT), and Design for X.
Gone are the days when huge computers made of vacuum tubes sat humming in entire
dedicated rooms and could do about 360 multiplications of 10 digit numbers in a
second. Though they were heralded as the fastest computing machines of that time,
they surely don’t stand a chance when compared to the modern day machines.
Modern day computers are getting smaller, faster, and cheaper and more power
efficient every progressing second. But what drove this change? The whole domain of
computing ushered into a new dawn of electronic miniaturization with the advent of
semiconductor transistor by Bardeen (1947-48) and then the Bipolar Transistor by
Shockley (1949) in the Bell Laboratory.
Since the invention of the first IC (Integrated Circuit) in the form of a Flip Flop by
Jack Kilby in 1958, our ability to pack more and more transistors onto a single chip
has doubled roughly every 18 months, in accordance with the Moore’s Law. Such
exponential development had never been seen in any other field and it still continues
to be a major area of research work.
Page 4
Fig 1.2 A comparison: First Planar IC (1961) and Intel Nehalem Quad Core Die
The development of microelectronics spans a time which is even lesser than the
average life expectancy of a human, and yet it has seen as many as four generations.
Early 60’s saw the low density fabrication processes classified under Small Scale
Integration (SSI) in which transistor count was limited to about 10. This rapidly gave
way to Medium Scale Integration in the late 60’s when around 100 transistors could
be placed on a single chip.
It was the time when the cost of research began to decline and private firms started
entering the competition in contrast to the earlier years where the main burden was
borne by the military. Transistor-Transistor logic (TTL) offering higher integration
densities outlasted other IC families like ECL and became the basis of the first
integrated circuit revolution. It was the production of this family that gave impetus to
semiconductor giants like Texas Instruments, Fairchild and National Semiconductors.
Early seventies marked the growth of transistor count to about 1000 per chip called
the Large Scale Integration.
By mid eighties, the transistor count on a single chip had already exceeded 1000 and
hence came the age of Very Large Scale Integration or VLSI. Though many
improvements have been made and the transistor count is still rising, further names of
generations like ULSI are generally avoided. It was during this time when TTL lost
the battle to MOS family owing to the same problems that had pushed vacuum tubes
into negligence, power dissipation and the limit it imposed on the number of gates
that could be placed on a single die.
The second age of Integrated circuit revolution started with the introduction of the
first microprocessor, the 4004 by Intel in 1972 and the 8080 in 1974. Today many
companies like Texas Instruments, Infineon, Alliance Semiconductors, Cadence,
Page 5
Synopsys, Celox Networks, Cisco, Micron Tech, National Semiconductors, ST
Microelectronics, Qualcomm, Lucent, Mentor Graphics, Analog Devices, Intel,
Philips, Motorola and many other firms have been established and are dedicated to the
various fields in "VLSI" like Programmable Logic Devices, Hardware Descriptive
Languages, Design tools, Embedded Systems etc.
VLSI Design
VLSI chiefly comprises of Front End Design and Back End design these days. While
front end design includes digital design using HDL, design verification through
simulation and other verification techniques, the design from gates and design for
testability, backend design comprises of CMOS library design and its
characterization. It also covers the physical design and fault simulation.
While Simple logic gates might be considered as SSI devices and multiplexers and
parity encoders as MSI, the world of VLSI is much more diverse. Generally, the
entire design procedure follows a step by step approach in which each design step is
followed by simulation before actually being put onto the hardware or moving on to
the next step. The major design steps are different levels of abstractions of the device
as a whole:
3. Functional Design: Defines the major functional units of the system and hence
facilitates the identification of interconnect requirements between units, the physical
and electrical specifications of each unit. A sort of block diagram is decided upon
with the number of inputs, outputs and timing decided upon without any details of the
internal structure.
4. Logic Design: The actual logic is developed at this level. Boolean expressions,
control flow, word width, register allocation etc. are developed and the outcome is
called a Register Transfer Level (RTL) description. This part is implemented either
with Hardware Descriptive Languages like VHDL and/or Verilog. Gate minimization
techniques are employed to find the simplest, or rather the smallest most effective
implementation of the logic.
Page 6
5. Circuit Design: While the logic design gives the simplified implementation of
the logic,the realization of the circuit in the form of a netlist is done in this step.
Gates, transistors and interconnects are put in place to make a netlist. This again is a
software step and the outcome is checked via simulation.
6.1 Circuit Partitioning: Because of the huge number of transistors involved, it is not
possible to handle the entire circuit all at once due to limitations on computational
capabilities and memory requirements. Hence the whole circuit is broken down into
blocks which are interconnected.
6.2 Floor Planning and Placement: Choosing the best layout for each block from
partitioning step and the overall chip, considering the interconnect area between the
blocks, the exact positioning on the chip in order to minimize the area arrangement
while meeting the performance constraints through iterative approach are the major
design steps taken care of in this step.
6.3 Routing: The quality of placement becomes evident only after this step is
completed. Routing involves the completion of the interconnections between
modules. This is completed in two steps. First connections are completed between
blocks without taking into consideration the exact geometric details of each wire and
pin. Then, a detailed routing step completes point to point connections between pins
on the blocks.
6.4 Layout Compaction: The smaller the chip size can get, the better it is. The
compression of the layout from all directions to minimize the chip area thereby
reducing wire lengths, signal delays and overall cost takes place in this design step.
6.5 Extraction and Verification: The circuit is extracted from the layout for
comparison with the original netlist, performance verification, and reliability
verification and to check the correctness of the layout is done before the final step of
packaging.
7. Packaging: The chips are put together on a Printed Circuit Board or a Multi
Chip Module to obtain the final finished product.
Initially, design can be done with three different methodologies which provide
different levels of freedom of customization to the programmers. The design methods,
in increasing order of customization support, which also means increased amount of
overhead on the part of the programmer, are FPGAs and PLDs, Standard Cell (Semi
Custom) and Full Custom Design.
Page 7
While FPGAs have inbuilt libraries and a board already built with interconnections
and blocks already in place; Semi Custom design can allow the placement of blocks in
user defined custom fashion with some independence, while most libraries are still
available for program development. Full Custom Design adopts a start from scratch
approach where the programmer is required to write the whole set of libraries and also
has full control over the block development, placement and routing. This also is the
same sequence from entry level designing to professional designing.
VLSI is dominated by the CMOS technology and much like other logic families, this
too has its limitations which have been battled and improved upon since years. Taking
the example of a processor, the process technology has rapidly shrunk from 180 nm in
1999 to 60nm in 2008 and now it stands at 45nm and attempts being made to reduce it
further (32nm) while the Die area which had shrunk initially now is increasing owing
to the added benefits of greater packing density and a larger feature size which would
mean more number of transistors on a chip.
Page 8
As the number of transistors increase, the power dissipation is increasing and also the
noise. If heat generated per unit area is to be considered, the chips have already
neared that of the nozzle of a jet engine. At the same time, the Voltage scaling of
threshold voltages beyond a certain point poses serious limitations in providing low
dynamic power dissipation with increased complexity. The number of metal layers
and the interconnects be it global and local also tend to get messy at such nano levels.
Even on the fabrication front, we are soon approaching towards the optical limit of
photolithographic processes beyond which the feature size cannot be reduced due to
decreased accuracy. This opened up Extreme Ultraviolet Lithography techniques.
High speed clocks used now make it hard to reduce clock skew and hence putting
timing constraints. This has opened up a new frontier on parallel processing. And
above all, we seem to be fast approaching the Atom-Thin Gate Oxide layer thickness
where there might be only a single layer of atoms serving as the oxide layer in the
CMOS transistors. New alternatives like Gallium Arsenide technology are becoming
an active area of research owing to this.
Page 9
Chapter-2
INTRODUCTION TO ADDERS
2.1 Motivation
To humans, decimal numbers are easy to comprehend and implement for
performing arithmetic.However, in digital systems, such as a microprocessor, DSP
(Digital Signal Processor)or ASIC (Application-Specific Integrated Circuit), binary
numbers are more pragmatic for a given computation. This occurs because binary
values are optimally efficient at representing many values.
Binary adders are one of the most essential logic elements within a digital
system. In addition, binary adders are also helpful in units other than Arithmetic
Logic Units (ALU),such as multipliers, dividers and memory addressing. Therefore,
binary addition is essential that any improvement in binary addition can result in a
performance boost for any computing system and, hence, help improve the
performance of the entire system.
The major problem for binary addition is the carry chain. As the width of the
input operand increases, the length of the carry chain increases. Figure 2.1
demonstrates an example of an 8- bit binary add operation and how the carry chain is
affected. This example shows that the worst case occurs when the carry travels the
longest possible path, from the least significant bit (LSB) to the most significant bit
(MSB). In order to improve the performance of carry-propagate adders, it is possible
to accelerate the carry chain, but not eliminate it. Consequently, most digital designers
often resort to building faster adders when optimizing a computer architecture,
because they tend to set the critical path for most computations.
Page 10
The binary adder is the critical element in most digital circuit designs
including digital signal processors (DSP) and microprocessor data path units. As such,
extensive research continues to be focused on improving the power delay
performance of the adder. In VLSI implementations, parallel-prefix adders are known
to have the best performance. Reconfigurable logic such as Field Programmable Gate
Arrays (FPGAs) has been gaining in popularity in recent years because it offers
improved performance in terms of speed and power over DSP-based and
microprocessor-based solutions for many practical designs involving mobile DSP and
telecommunications applications and a significant reduction in development time and
cost over Application Specific Integrated Circuit (ASIC) designs.
Page 11
2.3 Research Contributions
The implementations that have been developed in this dissertation help to
improve the design of Carry select adders and their associated computing
architectures. This has the potential of impacting many application specific and
general purpose computer architectures. Consequently, this work can impact the
designs of many computing systems, as well as impacting many areas of engineers
and science. In this paper, the practical issues involved in designing and implementing
Carry select adders on FPGAs are described. Several carry select adder structures are
implemented and characterized on a FPGA and compared with the CSLA with Ripple
Carry Adder (RCA) and the CSLA with Binary Excess Converter. Finally, some
conclusions and suggestions for improving FPGA designs to enable better carry select
adder performance are given.
Page 12
Chapter-3
BINARY ADDER SCHEMES
Adders are one of the most essential components in digital building blocks,
however, the performance of adders become more critical as the technology advances.
The problem of addition involves algorithms in Boolean algebra and their respective
circuit implementation. Algorithmically, there are linear-delay adders like ripple-carry
adders (RCA),which are the most straightforward but slowest. Adders like carry-skip
adders (CSKA),carry-select adders (CSLA) and carry-increment adders (CINA) are
linear-based adders with optimized carry-chain and improve upon the linear chain
within a ripple-carry adder. Carry-lookahead adders (CLA) have logarithmic delay
and currently have evolved to parallel-prefix structures. Other schemes, like Ling
adders, NAND/NOR adders and carry-save adders can help improve performance as
well.
This chapter gives background information on architectures of adder
algorithms. In the following sections, the adders are characterized with linear gate
model, which is a rough estimation of the complexity of real implementation.
Although this evaluation method can be misleading for VLSI implementers, such type
of estimation can provide sufficient insight to understand the design trade-offs for
adder algorithms.
Page 13
The + in the above equation is the regular add operation. However, in the
binary world, only Boolean algebra works. For add related operations, AND, OR and
Exclusive-OR (XOR) are required. In the following documentation, a dot between
two variables (each with single bit), e.g. a _ b denotes 'a AND b'. Similarly, a + b
denotes 'a OR b' and a _ b denotes 'a XOR b'.
Considering the situation of adding two bits, the sum s and carry c can be expressed
using Boolean operations mentioned above.
si = ai^bi
ci+1 = ai.bi
The Equation of ci+1 can be implemented as shown in Figure 3.1. In the figure, there
is a half adder, which takes only 2 input bits. The solid line highlights the critical
path, which indicates the longest path from the input to the output.
Equation of ci+1 can be extended to perform full add operation, where there is a carry
input.
si = ai^ bi ^ ci
ci+1 = ai .bi + ai. ci + bi . ci
A full adder can be built based on Equation above. The block diagram of a 1-
bit full adder is shown in Figure 3.2. The full adder is composed of 2 half adders and
an OR gate for computing carry-out.
Page 14
They are called carry generate and carry propagate, denoted by gi and pi. Another
literalcalled temporary sum ti is employed as well. There is relation between the
inputs and theseliterals.
gi = ai. bi
pi = ai + bi
ti = ai^ bi
where i is an integer and 0 _ i < n.
With the help of the literals above, output carry and sum at each bit can be written as
ci+1 = gi + pi .ci
si = ti^ ci
In some literatures, carry-propagate pi can be replaced with temporary sum ti
in order tosave the number of logic gates. Here these two terms are separated in order
to clarify theconcepts. For example, for Ling adders, only pi is used as carry-
propagate.
The single bit carry generate/propagate can be extended to group version G
and P. The following equations show the inherent relations.
Gi:k = Gi:j + Pi:j. Gj-1:k
Pi:k = Pi:j. Pj-1:k
where i : k denotes the group term from i through k.
Using group carry generate/propagate,carry can be expressed as expressed in the
following equation.
ci+1 = Gi:j + Pi:j.cj
The simplest way of doing binary addition is to connect the carry-out from the
previousbit to the next bit's carry-in. Each bit takes carry-in as one of the inputs and
outputs sumand carry-out bit and hence the name ripple-carry adder. This type of
adders is built bycascading 1-bit full adders. A 4-bit ripple-carry adder is shown in
Figure 3.3. Each trapezoidalsymbol represents a single-bit full adder. At the top of the
figure, the carry is rippledthrough the adder from cin to cout.
Page 15
Figure 3.3: Ripple-Carry Adder.
It can be observed in Figure 3.3 that the critical path, highlighted with a solid
line, isfrom the least significant bit (LSB) of the input (a0 or b0) to the most
significant bit (MSB)of sum (sn-1). Assuming each simple gate, including AND, OR
and XOR gate has a delayof 2/\ and NOT gate has a delay of 1/\. All the gates have an
area of 1 unit. Using thisanalysis and assuming that each add block is built with a 9-
gate full adder, the critical pathis calculated as follows.
ai ,bi si = 10/\
ai , bi ci+1 = 9/\
cisi = 5/\
ci ci+1 = 4/\
The critical path, or the worst delay is
trca ={9 + (n- 2) x 4 + 5}/\ = {f4n + 6}/\
As each bit takes 9 gates, the area is simply 9n for a n-bit RCA.
3.3 Carry-Select Adders (CSLA)
Simple adders, like ripple-carry adders, are slow since the carry has to to
travel throughevery full adder block. There is a way to improve the speed by
duplicating the hardware dueto the fact that the carry can only be either 0 or 1. The
method is based on the conditionalsum adder and extended to a carry-select adder.
With two RCA, each computingthe case of the one polarity of the carry-in, the sum
Page 16
can be obtained with a 2x1 multiplexerwith the carry-in as the select signal. An
example of 16-bit carry-select adder is shown inFigure 3.4. In the figure, the adder is
grouped into four 4-bit blocks. The 1-bit multiplexorsfor sum selection can be
implemented as Figure 3.5 shows. Assuming the two carry terms are utilized such that
the carry input is given as a constant 1 or 0:
In Figure 3.4, each two adjacent 4-bit blocks utilizes a carry relationship
ci+4 = c0 i+4 + c1 i+4 . ci
The relationship can be verified with properties of the group carry generate/propagate
and c0i+4 can be written as
c0i+4 = Gi+4:i + Pi+4:i . 0
= Gi+4:i
Similarly, c1i+4 can be written as
c1i+4 = Gi+4:i + Pi+4:i . 1
= Gi+4:i + Pi+4:i
Then
c0i+4 + c1i+4 .ci = Gi+4:i + (Gi+4:i + Pi+4:i) .ci
= Gi+4:i + Gi+4:i .ci + Pi+4:i .ci
= Gi+4:i + Pi+4:i .ci
= ci+4
Page 17
Figure 3.5: 2-1 Multiplexor.
Varying the number of bits in each group can work as well for carry-select
adders. temporary sums can be defined as follows.
s0 i+1 = ti+1 .c0i
s1i+1 = ti+1 .c1i
The final sum is selected by carry-in between the temporary sums already calculated.
si+1 = cj.s0i+1 + cj.s1i+1
Assuming the block size is fixed at r-bit, the n-bit adder is composed of k
groups ofr-bit blocks, i.e. n = r x k. The critical path with the first RCA has a delay of
(4r + 5)/\ from the input to the carry-out, and there are k - 2 blocks that follow, each
with a delay of4/\ for carry to go through. The final delay comes from the multiplexor,
which has a delay of 5/\, as indicated in Figure 3.5. The total delay for this CSEA is
calculated as
tcsea = 4r + 5 + 4(k - 2) + 5/\
= {4r + 4k + 2}/\
The area can be estimated with (2n - r) FAs, (n - r) multiplexors and (k - 1)
AND/ORlogic. As mentioned above, each FA has an area of 9 and a multiplexor takes
5 units ofarea. The total area can be estimated
9(2n - r) + 2(k - 1) + 4(n - r) = 22n - 13r + 2k - 2
The delay of the critical path in CSEA is reduced at the cost of increased area. For
example, in Figure 2.4, k = 4, r = 4 and n = 16. The delay for the CSEA is 34/\
compared to 70/\ for 16-bit RCA. The area for the CSEA is 310 units while the RCA
hasan area of 144 units. The delay of the CSEA is about the half of the RCA. But the
Page 18
CSEAhas an area more than twice that of the RCA. Each adder can also be modified
to have avariable block sizes, which gives better delay and slightly less area.
The carry-out of each block is determined by selecting the carry-in and Gi:j
using Pi:j. When Pi:j = 1, the carry-in cj is allowed to get through the block
immediately. Otherwise, the carry-out is determined by Gi:j. The CSKA has less
delay in the carry-chain with only a little additional extra logic. Further improvement
can be achieved generally by making the central block sizes larger and the two-end
block sizes smaller.
Assuming the n-bit adder is divided evenly to k r-bit blocks, part of the critical
path is from the LSB input through the MSB output of the final RCA. The first delay
is from the LSB input to carry-out, which is 4r + 5. Then, there are k - 2 skip logic
blocks with a delay of 3/\. Each skip logic block includes one 4-input AND gate for
getting Pi+3:i and one AND/OR logic. The final RCA has a delay from input to sum
at MSB, which is 4r+6. The total delay is calculated as follows.
tcska = {4r + 5 + 3(k - 2) + 4r + 6}/\
= {8r + 3k + 5}/\
Page 19
The CSKA has n-bit FA and k - 2 skip logic blocks. Each skip logic block has an area
of 3 units. Therefore, the total area is estimated as9n + 3(k - 2) = 9n + 3k – 6.
The theory of the CLA is based on next Equations. Figure 3.8 shows an
example of 16-bit carry-lookaheadadder. In the figure, each block is fixed at 4-bit.
BCLG stands for Block Carry Lookahead Carry Generator, which generates
generate/propagate signals in group form. For the 4-bit BCLG, the following
equations are created.
Gi+3:i = gi+3 + pi+3 .gi+2 + pi+3 .pi+2 .gi+1 + pi+3 .pi+2 .pi+1 .gi
Pi+3:i = pi+3 .pi+2 .pi+1 .pi
The group generate takes a delay of 4/\, which is an OR after an AND, therefore, the
carry-out can be computed, as follows.
ci+3 = Gi+3:i + Pi+3:i .ci
Page 20
Figure 3.8: Carry-Lookahead Adder.
The carry computation also has a delay of 4/\, which is an OR after an AND.
The 4-bitBCLG has an area of 14 units.
The critical path of the 16-bit CLA can be observed from the input operand
through 1RFA, then 3 BCLG and through the final RFA. That is, the critical path
shown in Figure 3.8 is from a0/b0 to s7. The delay will be the same for a0/b0 to s11
or s15, however, the criticalpath traverses logarithmically, based on the group size.
Page 21
a0 , b0 s7 = 19/\
The 16-bit CLA is composed of 16 RFAs and 5 BCLGs, which amounts to an area of
16 x 8 + 5 x 14 = 198 units.
Extending the calculation above, the general estimation for delay and area can
be derived.Assume the CLA has n-bits, which is divided into k groups of r-bit blocks.
Itrequires dlogrne logic levels. The critical path starts from the input to p0/g0
generation,BLCG logic and the carry-in to sum at MSB. The generation of (p; g) takes
a delay of 2/\.The group version of (p; g) generated by the BCLG has a delay of 4/\.
From next BCLG,there is a 4/\ delay from the CLG generation and 4/\ from the BCLG
generation to thenext level, which totals to 8/\. Finally, from ck+r to sk+r, there is a
delay of 5/\. Thus, thetotal delay is calculated as follows.
tcla = {2 + 8(dlogrn- 1) + 4 + 5}/\
= {3 + 8dlogrn}/\
Page 22
Chapter-4
Carry Select Adder
4.1 Introduction
Design of area- and power-efficient high-speed data path logic systems are one of the
most substantial areas of research in VLSI system design. In digital adders, the speed
of addition is limited by the time required to propagate a carry through the adder. The
sum for each bit position in an elementary adder is generated sequentially only after
the previous bit position has been summed and a carry propagated into the next
position. The CSLA is used in many computational systems to alleviate the problem
of carry propagation delay by independently generating multiple carries and then
select a carry to generate the sum. How ever the CSLA is not area efficient because it
uses multiple pairs of Ripple Carry Adders (RCA) to generate partial sum and carry
by considering carry input Cin = 0 and Cin = 1, then the final sum and carry are
selected by the multiplexers (mux).
The basic idea of this work is to use Binary to Excess-1 Converter (BEC) instead of
RCA with Cin = 1 in the regular CSLA to achieve lower area and power consumption.
The main advantage of this BEC logic comes from the lesser number of logic gates
than the n-bit Full Adder (FA) structure.The SQRT CSLA has been chosen for
comparison with the proposed design as it has a more balanced delay, and requires
lower power and area. The delay and area evaluation methodology of the regular and
modified SQRT CSLA are presented.
4.2 Delay and area evaluation methodology of the basic adder blocks
The AND, OR, and Inverter (AOI) implementation of an XOR gate is shown in
Fig.4.1. The gates between the dotted lines are performing the operations in parallel
and the numeric representation of each gate indicates the delay contributed by that
gate. The delay and area evaluation methodology considers all gates to be made up of
AND, OR, and Inverter, each having delay equal to 1 unit and area equal to 1 unit.
We then add up the number of gates in the longest path of a logic block that
contributes to the maximum delay. The area evaluation is done by counting the total
number of AOI gates required for each logic block. Based on this approach, the CSLA
adder blocks of 2:1 mux, Half Adder (HA), and FA are evaluated and listed in Table
Page 23
4.I.
As stated above the main idea of this work is to use BEC instead ofthe RCA with
Cin = 1 in order to reduce the area and power consumption of the regular CSLA. To
replace the n-bit RCA, an n + 1-bit BECis required. A structure and the function
table of a 4-b BEC are shownin Fig.4.2 and Table 4.II, respectively.
Page 24
Fig.4.2. 4-b BEC.
Page 25
Fig. 4.3 illustrates how the basic function of the CSLA is obtained byusing the 4-bit
BEC together with the mux. One input of the 8:4 muxgets as it input (B3, B2, B1,
and B0) and another input of the mux is theBEC output. This produces the two
possible partial results in paralleland the mux is used to select either the BEC output
or the direct inputsaccording to the control signal Cin. The importance of the BEC
logicstems from the large silicon area reduction when the CSLA with largenumber
of bits are designed. The Boolean expressions of the 4-bit BECis listed as (note the
functional symbols NOT, & AND;^XOR)
X0 = B0
X1 = B0^B1
X2 = B2^(B0 & B1)
X3 = B3^(B0 & B1 & B2)
Page 26
Fig.4. 4. Regular 16-b SQRT CSLA.
Page 27
Fig. 4.5. Delay and area evaluation of regular SQRT CSLA: (a) group2, (b)
group3, (c) group4, and (d) group5. F is a Full Adder.
2) Except for group2, the arrival time of mux selection input is al ways greater than
the arrival time of data outputs from the RCA’s.
Thus, the delay of group3 to group5 is determined, respectively asfollows:
fc6; sum[6 : 4]g = c3[t = 10] + mux
fc10; sum[10 : 7]g = c6[t = 13] + mux
fcout; sum[15 : 11]g = c10[t = 16] + mux:
3) The one set of 2-b RCA in group2 has 2 FA for Cin = 1 and theother set has 1
FA and 1 HA for Cin = 0. Based on the area countof Table 4.I, the total number of
gate counts in group2 is determined
as follows:
4) Similarly, the estimated maximum delay and area of the othergroups in the regular
SQRT CSLA are evaluated and listed in Table 4.III.
TABLE 4.III
DELAY AND AREA COUNT OF REGULAR SQRT CSLA GROUPS
Page 28
The structure of the proposed 16-b SQRT CSLA using BEC for RCAwith Cin = 1 to
optimize the area and power is shown in Fig.4.6. Weagain split the structure into five
groups. The delay and area estimationof each group are shown in Fig.4.7. The steps
leading to the evaluationare given here.
1) The group2 [see Fig.4.7(a)] has one 2-b RCA which has 1 FA and1 HA for Cin =
0. Instead of another 2-b RCA with Cin = 1ba 3-b BEC is used which adds one to the
output from 2-b RCA.Based on the consideration of delay values of Table I, the
arrival time of selection input c1[time(t) = 7] of 6:3 mux is earlier thanthe s3[t = 9]
and c3[t = 10] and later than the s2[t = 4]. Thus,the sum3 and final c3 (output from
mux) are depending on s3and mux and partial c3 (input to mux) and mux,
respectively. Thesum2 depends on c1 and mux.
2) For the remaining group’s the arrival time of mux selection input isalways greater
than the arrival time of data inputs from the BEC’s.Thus, the delay of the remaining
groups depends on the arrivaltime of mux selection input and the mux delay.
Fig.4.6. Modified 16-b SQRT CSLA. The parallel RCA with Cin=0 is replaced
with BEC.
Page 29
Fig.4. 7. Delay and area evaluation of modified SQRT CSLA: (a) group2, (b)
group3, (c) group4, and (d) group5. H is a Half Adder.
3) The area count of group2 is determined as follows:
Gate count = 43 (FA + HA + Mux + BEC)
FA = 13(1 3 13)
HA = 6(1 3 6)
AND = 1
NOT = 1
XOR = 10(2 3 5)
Mux = 12(3 3 4):
4) Similarly, the estimated maximum delay and area of the othergroups of the
modified SQRT CSLA are evaluated and listed inTable 4.IV.
TABLE IV
DELAY AND AREA COUNT OF MODIFIED SQRT CSLA
Page 30
Comparing Tables 4.III and 4.IV, it is clear that the proposed modifiedSQRT CSLA
saves 113 gate areas than the regular SQRT CSLA, withonly 11 increases in gate
delays. To further evaluate the performance,we have resorted to ASIC
implementation and simulation.
TABLE V
COMPARISON OF THE REGULAR AND MODIFIED SQRT CSLA
Page 31
Chapter-5
Page 32
LOGIC FORMULATION BASED CSLA
The BEC-based CSLA involves less logic resources than the conventional
CSLA, but it has marginally higher delay. A CSLA based on common Boolean logic
(CBL) is also proposed. The CBL-based CSLA involves significantly less logic
resource than the conventional CSLA but it has longer CPD, which is almost equal to
that of the RCA. To overcome this problem, a SQRT-CSLA based on CBL was
proposed. However, the CBL-based SQRT CSLA design requires more logic resource
and delay than the BEC-based SQRT-CSLA. We observe that logic optimization
largely depends on availability of redundant operations in the formulation, whereas
adder delay mainly depends on data dependence. In the existing designs, logic is
optimized without giving any consideration to the data dependence. In this brief, we
made an analysis on logic operations involved in conventional and BEC-based
CSLAs to study the data dependence and to identify redundant logic operations.
Based on this analysis, we have proposed a logic formulation for the CSLA. The main
contribution in this brief are logic formulation based on data dependence and
optimized carry generator (CG) and CS design.
The CSLA has two units: 1) the sum and carry generator unit (SCG) and 2)
the sum and carry selection unit. The SCG unit consumes most of the logic resources
of CSLA and significantly contributes to the critical path. Different logic designs have
been suggested for efficient implementation of the SCG unit. We made a study of the
logic designs suggested for the SCG unit of conventional and BEC-based CSLAs
suitable logic expressions. The main objective of this study is to identify redundant
logic operations and data dependence. Accordingly, we remove all redundant logic
operations and sequence logic operations based on their data dependent.
Page 33
Fig.5.1. (a) Conventional CSLA; n is the input operand bit-width. (b) The logic
operations of the RCA is shown in split form, where HSG, HCG, FSG, and FCG
represent half-sum generation, half-carry generation, full-sum generation and full-
carry generation, respectively.
As shown in Fig.5.1.(a), the SCG unit of the conventional CSLA is composed of two
n-bit RCAs, where n is the adder bit-width. The logic operation of the n-bit RCA is
performed in four stages: 1) half-sum generation (HSG); 2) half-carry generation
(HCG); 3) full-sum generation (FSG); and 4) full carry generation (FCG). Suppose
two n-bit operands are added in the conventional CSLA, then RCA-1 and RCA-2
generate n-bit sum (s0and s1) and output-carry (c0outand c1out) corresponding to
input-carry (Cin = 0 and Cin= 1), respectively. Logic expressions of RCA-1 and
RCA-2 of the SCG unit of the n-bit CSLA are given as
5.3. Logic Expression of the SCG Unit of the BEC Based CSLA
Page 34
Fig.5.2.Structure of the BEC-based CSLA; n is the input operand bit-width.
As shown in Fig.5.2, the RCA calculates n-bit sum s01and c0out corresponding to
Cin= 0. The BEC unit receives s01and c0out from the RCA and generates (n + 1)-bit
excess-1 code. The most significant bit (MSB) of BEC represents c1out, in which n
least significant bits (LSBs) represent s11. The logic expressions of the RCA are the
same as those given in (1a)–(1c). The logic expressions of the BEC unit of the n-bit
BEC-based CSLA are given as
We can find from (1a)–(1c) and (3a)–(3d) that, in the case of the BEC-based CSLA,
c11 depends on s01, which otherwise has no dependence on s01 in the case of the
conventional CSLA. The BEC method therefore increases data dependence in the
CSLA. We have considered logic expressions of the conventional CSLA and made a
further study on the data dependence to find an optimized logic expression for the
CSLA.
It is interesting to note from (1a)–(1c) and (2a)–(2c) that logic expressions of s01and
s11 are identical except the terms c01and c11since (s00= s10= s0). In addition, we find
that c01and c11depend on {s0, c0, Cin}, where c0= c00= c10. Since c01 and c11 have no
dependence on s01a and s11, the logic operation of c01and c11 can be scheduled before
s01and s11, and the select unit can select one from the set (s01, s11) for the final-sum of
the CSLA. We find that a significant amount of logic resource is spent for calculating
{s01, s11}, and it is not an efficient approach to reject one sum-word after the
Page 35
calculation. Instead, one can select the required carry word from the anticipated carry
words {c0and c1} to calculate the final-sum. The selected carry word is added with the
half-sum (s0) to generate the final-sum (s). Using this method, one can have three
design advantages: 1) Calculation of s01 is avoided in the SCG unit; 2) the n-bit
select unit is required instead of the (n + 1) bit; and 3) small output-carry delay. All
these features result in an area–delay and energy-efficient design for the CSLA. We
have removed all the redundant logic operations of (1a)–(1c) and (2a)–(2c) and
rearranged logic expressions of (1a)–(1c) and (2a)–(2c) based on their dependence.
The proposed logic formulation for the CSLA is given as
Fig. 5.3.(a) Proposed CS adder design, where n is the input operand bit-width, and [∗]
represents delay (in the unit of inverter delay), n = max(t, 3.5n + 2.7). (b) Gate-level
design of the HSG. (c) Gate-level optimized design of (CG0) for input-carry = 0. (d)
Gate-level optimized design of (CG1) for input-carry = 1. (e) Gate-level design of the CS
unit. (f) Gate-level design of the final-sum generation (FSG) unit.
The proposed CSLA is based on the logic formulation given in (4a)–(4g), and its
Page 36
structure is shown in Fig.5.3(a). It consists of one HSG unit, one FSG unit, one CG
unit, and one CS unit. The CG unit is composed of two CGs (CG0and CG1)
corresponding to input-carry ‘0’ and ‘1’. The HSG receives two n-bit operands (A and
B) and generate half-sum word s0and half-carry word c0of width n bits each. Both
CG0and CG1receive s0and c0 from the HSG unit and generate two n-bit full-carry
words c01 and c11corresponding to input-carry ‘0’ and ‘1’, respectively. The logic
diagram of the HSG unit is shown in Fig.5.3(b). The logic circuits of CG0and CG1are
optimized to take advantage of the fixed input-carry bits. The optimized designs of
CG0and CG1 are shown in Fig.5.3(c) and (d), respectively.
The CS unit selects one final carry word from the two carry words available at
its input line using the control signal Cin. It selects c01 when Cin= 0; otherwise, it
selects c11. The CS unit can be implemented using an n-bit 2-to-l MUX. However, we
find from the truth table of the CS unit that carry words c01and c11follow a specific bit
pattern. If c01(i) = ‘1’, then c11(i) = 1, irrespective of s0(i) and c0(i), for 0 ≤ i ≤ n − 1.
This feature is used for logic optimization of the CS unit. The optimized design of the
CS unit is shown in Fig.5.3(e), which is composed of n AND–OR gates. The final
carry word c is obtained from the CS unit. The MSB of c is sent to output as Cout,
and (n − 1) LSBs are XORed with (n − 1) MSBs of half-sum (s0) in the FSG [shown
in Fig.5.3(f)] to obtain (n − 1) MSBs of final-sum (s). The LSB of s0is XORed with
Cinto obtain the LSB of s.
The multipath carry propagation feature of the CSLA is fully exploited in the
SQRT-CSLA, which is composed of a chain of CSLAs. CSLAs of increasing size are
used in the SQRT-CSLA to extract the maximum concurrence in the carry
propagation path. Using the SQRT-CSLA design, large-size adders are implemented
with significantly less delay than a single-stage CSLA of same size. However, carry
propagation delay between the CSLA stages of SQRT-CSLA is critical for the overall
adder delay. Due to early generation of output-carry with multipath carry propagation
feature, the proposed CSLA design is more favorable than the existing CSLA designs
for area–delay efficient implementation of SQRT-CSLA. A 16-bit SQRT-CSLA
design using the proposed CSLA is shown in Fig.5.4, where the 2-bit RCA, 2-bit
CSLA, 3-bit CSLA, 4-bit CSLA, and 5-bit CSLA are used. We have considered the
cascaded configuration of (2-bit RCA and 2-, 3-, 4-, 6-, 7-, and 8-bit CSLAs) and (2-
bit RCA and 2-, 3-, 4-, 6-, 7-, 8-, 9-, 11-, and 12-bit CSLAs), respectively, for the 32-
bit CSLA and the 64-bit SQRT-CSLA to optimize adder delay. To demonstrate the
advantage of the proposed CSLA design in SQRT-CSLA.
Page 37
Fig.5.4.Proposed 16-bit SQRT-CSLA
Page 38
Chapter-6
Verilog HDL
In the semiconductor and electronic design industry, Verilog is a hardware description
language (HDL) used to model electronic systems. Verilog HDL, not to be confused
with VHDL (a competing language), is most commonly used in the design,
verification, and implementation of digital logic chips at the register-transfer level of
abstraction. It is also used in the verification of analog and mixed-signal circuits.
6.1 Overview
Verilog's concept of 'wire' consists of both signal values (4-state: "1, 0, floating,
undefined") and strengths (strong, weak, etc.). This system allows abstract modeling
of shared signal lines, where multiple sources drive a common net. When a wire has
Page 39
multiple drivers, the wire's (readable) value is resolved by a function of the source
drivers and their strengths.
6.2 History
6.2.1 Beginning
Verilog was the first modern hardware description language to be invented. It was
created by Phil Moorby and Prabhu Goel during the winter of 1983/1984. The
wording for this process was "Automated Integrated Design Systems" (later renamed
to Gateway Design Automation in 1985) as a hardware modeling language. Gateway
Design Automation was purchased by Cadence Design Systems in 1990. Cadence
now has full proprietary rights to Gateway's Verilog and the Verilog-XL, the HDL-
simulator that would become the de-facto standard (of Verilog logic simulators) for
the next decade. Originally, Verilog was intended to describe and allow simulation;
only afterwards was support for synthesis added.
6.2.2 Verilog-95
With the increasing success of VHDL at the time, Cadence decided to make the
language available for open standardization. Cadence transferred Verilog into the
public domain under the Open Verilog International (OVI) (now known as Accellera)
organization. Verilog was later submitted to IEEE and became IEEE Standard 1364-
1995, commonly referred to as Verilog-95.
In the same time frame Cadence initiated the creation of Verilog-A to put standards
support behind its analog simulator Spectre. Verilog-A was never intended to be a
standalone language and is a subset of Verilog-AMS which encompassed Verilog-95.
Extensions to Verilog-95 were submitted back to IEEE to cover the deficiencies that
users had found in the original Verilog standard. These extensions became IEEE
Standard 1364-2001 known as Verilog-2001.
Page 40
perform signed operations using awkward bit-level manipulations (for example, the
carry-out bit of a simple 8-bit addition required an explicit description of the Boolean
algebra to determine its correct value). The same function under Verilog-2001 can be
more succinctly described by one of the built-in operators: +, -, /, *, >>>. A
generate/endgenerate construct (similar to VHDL's generate/endgenerate) allows
Verilog-2001 to control instance and statement instantiation through normal decision
operators (case/if/else). Using generate/endgenerate, Verilog-2001 can instantiate an
array of instances, with control over the connectivity of the individual instances. File
I/O has been improved by several new system tasks. And finally, a few syntax
additions were introduced to improve code readability (e.g. always @*, named
parameter override, C-style function/task/module header declaration).
Not to be confused with System Verilog, Verilog 2005 (IEEE Standard 1364-2005)
consists of minor corrections, spec clarifications, and a few new language features
(such as the uwire keyword).
Example
module main;
initial
begin
$display("Hello world!");
$finish;
end
endmodule
moduletoplevel(clock,reset);
input clock;
input reset;
reg flop1;
reg flop2;
Page 41
always @ (posedge reset or posedge clock)
if (reset)
begin
flop1 <= 0;
flop2 <= 1;
end
else
begin
flop1 <= flop2;
flop2 <= flop1;
end
endmodule
The "<=" operator in Verilog is another aspect of its being a hardware description
language as opposed to a normal procedural language. This is known as a "non-
blocking" assignment. Its action doesn't register until the next clock cycle. This means
that the order of the assignments is irrelevant and will produce the same result: flop1
and flop2 will swap values every clock.
The other assignment operator, "=", is referred to as a blocking assignment. When "="
assignment is used, for the purposes of logic, the target variable is updated
immediately. In the above example, had the statements used the "=" blocking operator
instead of "<=", flop1 and flop2 would not have been swapped. Instead, as in
traditional programming, the compiler would understand to simply set flop1 equal to
flop2 (and subsequently ignore the redundant logic to set flop2 equal to flop1.)
parameter size = 5;
parameter length = 20;
Page 42
outputtc;
endmodule
An example of delays:
...
reg a, b, c, d;
wire e;
...
always @(b or e)
begin
a = b & e;
b = a | b;
#5 c = b;
d = #6 c ^ e;
Page 43
end
The always clause above illustrates the other type of method of use, i.e. it executes
whenever any of the entities in the list (the b or e) changes. When one of these
changes, a is immediately assigned a new value, and due to the blocking assignment,
b is assigned a new value afterward (taking into account the new value of a). After a
delay of 5 time units, c is assigned the value of b and the value of c ^ e is tucked away
in an invisible store. Then after 6 more time units, d is assigned the value that was
tucked away.
Signals that are driven from within a process (an initial or always block) must be of
type reg. Signals that are driven from outside a process must be of type wire. The
keyword reg does not necessarily imply a hardware register.
Definition of constants
The definition of constants in Verilog supports the addition of a width parameter. The
basic syntax is:
Examples:
Synthesizeable constructs
There are several statements in Verilog that have no analog in real hardware, e.g.
$display. Consequently, much of the language can not be used to describe hardware.
The examples presented here are the classic subset of the language that has a direct
mapping to real gates.
reg out;
Page 44
always@(a or b orsel)
begin
case(sel)
1'b0: out = b;
1'b1: out = a;
endcase
end
The next interesting structure is a transparent latch; it will pass the input to the output
when the gate signal is set for "pass-through", and captures the input and stores it
upon transition of the gate signal to "hold". The output will remain stable regardless
of the input signal while the gate is set to "hold". In the example below the "pass-
through" level of the gate would be when the value of the if clause is true, i.e. gate =
1. This is read "if gate is true, the din is fed to latch_out continuously." Once the if
clause is false, the last value at latch_out will remain and is independent of the value
of din.
reg out;
always@(gate or din)
if(gate)
out= din;// Pass through state
// Note that the else isn't required here. The variable
// out will follow the value of din while gate is high.
// When gate goes low, out will remain constant.
The flip-flop is the next significant template; in Verilog, the D-flop is the simplest,
and it can be modeled as:
reg q;
always@(posedgeclk)
q <= d;
Page 45
The significant thing to notice in the example is the use of the non-blocking
assignment. A basic rule of thumb is to use <= when there is a posedge or negedge
statement within the always clause.
A variant of the D-flop is one with an asynchronous reset; there is a convention that
the reset state will be the first if clause within the statement.
reg q;
always@(posedgeclkorposedge reset)
if(reset)
q <=0;
else
q <= d;
The next variant is including both an asynchronous reset and asynchronous set
condition; again the convention comes into play, i.e. the reset term is followed by the
set term.
reg q;
always@(posedgeclkorposedge reset orposedge set)
if(reset)
q <=0;
else
if(set)
q <=1;
else
q <= d;
Note: If this model is used to model a Set/Reset flip flop then simulation errors can
result. Consider the following test sequence of events. 1) reset goes high 2) clk goes
high 3) set goes high 4) clk goes high again 5) reset goes low followed by 6) set going
low. Assume no setup and hold violations.
In this example the always @ statement would first execute when the rising edge of
reset occurs which would place q to a value of 0. The next time the always block
executes would be the rising edge of clk which again would keep q at a value of 0.
The always block then executes when set goes high which because reset is high forces
q to remain at 0. This condition may or may not be correct depending on the actual
flip flop. However, this is not the main problem with this model. Notice that when
reset goes low, that set is still high. In a real flip flop this will cause the output to go to
a 1. However, in this model it will not occur because the always block is triggered by
rising edges of set and reset - not levels. A different approach may be necessary for
set/reset flip flops.
Page 46
The final basic variant is one that implements a D-flop with a mux feeding its input.
The mux has a d-input and feedback from the flop itself. This allows a gated load
function.
Note that there are no "initial" blocks mentioned in this description. There is a split
between FPGA and ASIC synthesis tools on this structure. FPGA tools allow initial
blocks where reg values are established instead of using a "reset" signal. ASIC
synthesis tools don't support such a statement. The reason is that an FPGA's initial
state is something that is downloaded into the memory tables of the FPGA. An ASIC
is an actual hardware implementation.
There are two separate ways of declaring a Verilog process. These are the always and
the initial keywords. The always keyword indicates a free-running process. The initial
keyword indicates a process executes exactly once. Both constructs begin execution at
simulator time 0, and both execute until the end of the block. Once an always block
has reached its end, it is rescheduled (again). It is a common misconception to believe
that an initial block will execute before an always block. In fact, it is better to think of
the initial-block as a special-case of the always-block, one which terminates after it
completes for the first time.
//Examples:
initial
begin
a =1;// Assign a value to reg a at time 0
#1;// Wait 1 time unit
Page 47
b = a;// Assign the value of reg a to reg b
end
These are the classic uses for these two keywords, but there are two significant
additional uses. The most common of these is an always keyword without the @(...)
sensitivity list. It is possible to use always as shown below:
always
begin// Always begins executing at time 0 and NEVER stops
clk=0;// Set clk to 0
#1;// Wait for 1 time unit
clk=1;// Set clk to 1
#1;// Wait 1 time unit
end// Keeps executing - so continue back at the top of the begin
The always keyword acts similar to the "C" construct while(1) {..} in the sense that it
will execute forever.
The other interesting exception is the use of the initial keyword with the addition of
the forever keyword.
Page 48
Fork/join
The fork/join pair are used by Verilog to create parallel processes. All statements (or
blocks) between a fork/join pair begin execution simultaneously upon execution flow
hitting the fork. Execution continues after the join upon completion of the longest
running statement or block between the fork and join.
initial
fork
$write("A");// Print Char A
$write("B");// Print Char B
begin
#1;// Wait 1 time unit
$write("C");// Print Char C
end
join
The way the above is written, it is possible to have either the sequences "ABC" or
"BAC" print out. The order of simulation between the first $write and the second
$write depends on the simulator implementation, and may purposefully be
randomized by the simulator. This allows the simulation to contain both accidental
race conditions as well as intentional non-deterministic behavior.
Notice that VHDL cannot dynamically spawn multiple processes like Verilog
Race conditions
The order of execution isn't always guaranteed within Verilog. This can best be
illustrated by a classic example. Consider the code snippet below:
initial
a =0;
initial
b = a;
initial
begin
#1;
$display("Value a=%b Value of b=%b",a,b);
end
What will be printed out for the values of a and b? Depending on the order of
execution of the initial blocks, it could be zero and zero, or alternately zero and some
Page 49
other arbitrary uninitialized value. The $display statement will always execute after
both assignment blocks have completed, due to the #1 delay.
Operators
Operator
Operator type Operation performed
symbols
~ Bitwise NOT (1's complement)
& Bitwise AND
Bitwise | Bitwise OR
^ Bitwise XOR
~^ or ^~ Bitwise XNOR
! NOT
Logical && AND
|| OR
& Reduction AND
~& Reduction NAND
| Reduction OR
Reduction
~| Reduction NOR
^ Reduction XOR
~^ or ^~ Reduction XNOR
+ Addition
- Subtraction
- 2's complement
Arithmetic
* Multiplication
/ Division
** Exponentiation (*Verilog-2001)
> Greater than
< Less than
>= Greater than or equal to
<= Less than or equal to
Relational Logical equality (bit-value 1'bX is removed from
==
comparison)
Logical inequality (bit-value 1'bX is removed from
!=
comparison)
=== 4-state logical equality (bit-value 1'bX is taken as
Page 50
literal)
4-state logical inequality (bit-value 1'bX is taken as
!==
literal)
>> Logical right shift
<< Logical left shift
Shift
>>> Arithmetic right shift (*Verilog-2001)
<<< Arithmetic left shift (*Verilog-2001)
Concatenation { , } Concatenation
Replication {n{m}} Replicate value m for n times
Conditional ?: Conditional
Four-valued logic
The IEEE 1364 standard defines a four-valued logic with four states: 0, 1, Z (high
impedance), and X (unknown logic value). For the competing VHDL, a dedicated
standard for multi-valued logic exists as IEEE 1164 with nine levels.
Chapter-7
FPGA Implementation
Custom ICs are expensive and takes long time to design so they are useful
when produced in bulk amounts. But FPGAs are easy to implement within a short
time with the help of Computer Aided Designing (CAD) tools (because there is no
physical layout process, no mask making, and no IC manufacturing).
Some disadvantages of FPGAs are, they are slow compared to custom ICs as
they can’t handle vary complex designs and also they draw more power.
Page 52
Xilinx logic block consists of one Look Up Table (LUT) and one Flip-Flop.
An LUT is used to implement number of different functionality. The input lines to the
logic block go into the LUT and enable it. The output of the LUT gives the result of
the logic function that it implements and the output of logic block is registered or
unregistered output from the LUT.
LUT based design provides for better logic block utilization. A k-input LUT
based logic block can be implemented in number of different ways with tradeoff
between performance and logic density.An n-LUT can be shown as a direct
implementation of a function truth-table. Each of the latch hold’s the value of the
function corresponding to one input combination. For Example: 2-LUT can be used to
implement 16 types of functions like AND, OR, A +not B.... Etc.
Interconnects
A wire segment can be described as two end points of an interconnection with
no programmable switch between them. A sequence of one or more wire segments in
an FPGA can be termed as a track.
Typically an FPGA has logic blocks, interconnects and switch blocks (Input
/Output blocks). Switch blocks lie in the periphery of logic blocks and interconnect.
Page 53
Wire segments are connected to logic blocks through switch blocks. Depending on the
required design, one logic block is connected to another and so on.
In this part of tutorial we are going to have a short intro on FPGA design flow.
A simplified version of design flow is given in the flowing diagram.
There are different techniques for design entry. Schematic based, Hardware
Description Language and combination of both etc. . Selection of a method depends
on the design and designer. If the designer wants to deal more with Hardware, then
Schematic entry is the better choice. When the design is complex or the designer
thinks the design in an algorithmic way then HDL is the better choice. Language
based entry is faster but lag in performance and density.
HDLs represent a level of abstraction that can isolate the designers from the
details of the hardware implementation. Schematic based entry gives designers much
more visibility into the hardware. It is the better choice for those who are hardware
oriented. Another method but rarely used is state-machines. It is the better choice for
the designers who think the design as a series of states. But the tools for state machine
Page 54
entry are limited. In this documentation we are going to deal with the HDL based
design entry.
7.2.2 Synthesis
The process that translates VHDL/ Verilog code into a device netlist format
i.e. a complete circuit with logical elements (gates flip flop, etc…) for the design. If
the design contains more than one sub designs, ex. to implement a processor, we need
a CPU as one design element and RAM as another and so on, then the synthesis
process generates netlist for each design element Synthesis process will check code
syntax and analyze the hierarchy of the design which ensures that the design is
optimized for the design architecture, the designer has selected. The resulting
netlist(s) is saved to an NGC (Native Generic Circuit) file (for Xilinx® Synthesis
Technology (XST)).
7.2.3 Implementation
Translate
Map
Place and Route
Translate:
Page 55
Process combines all the input netlists and constraints to a logic design file.
This information is saved as a NGD (Native Generic Database) file. This can be done
using NGD Build program. Here, defining constraints is nothing but, assigning the
ports in the design to the physical elements (ex. pins, switches, buttons etc) of the
targeted device and specifying time requirements of the design. This information is
stored in a file named UCF (User Constraints File). Tools used to create or modify the
UCF are PACE, Constraint Editor Etc.
Map:
Process divides the whole circuit with logical elements into sub blocks such
that they can be fit into the FPGA logic blocks. That means map process fits the logic
defined by the NGD file into the targeted FPGA elements (Combinational Logic
Blocks (CLB), Input Output Blocks (IOB)) and generates an NCD (Native Circuit
Description) file which physically represents the design mapped to the components of
FPGA. MAP program is used for this purpose.
Page 56
PAR program is used for this process. The place and route process places the
sub blocks from the map process into logic blocks according to the constraints and
connects the logic blocks. Ex. if a sub block is placed in a logic block which is very
near to IO pin, then it may save the time but it may affect some other constraint. So
tradeoff between all the constraints is taken account by the place and route process.
The PAR tool takes the mapped NCD file as input and produces a completely
routed NCD file as output. The output NCD file consists of the routing information.
The RTL (Register Transfer Logic) can be viewed as black box after
synthesize of design is made. It shows the inputs and outputs of the system. By
double-clicking on the diagram we can see gates, flip-flops and MUX.
Page 57
Figure 7.13: RTL schematic of Top-level Carry Select Adder(LF)
Page 58
Figure 7.15: Technology schematic of Top-level Carry Select Adder(LF)
Page 59
Figure 7.17: Internal block Carry Select Adder(LF)
7.4 Synthesis Report
Logic Utilization
Logic Distribution
Total Gate count for the Design
The device utilization summery is shown above in which its gives the details
of number of devices used from the available devices and also represented in %.
Hence as the result of the synthesis process, the device utilization in the used device
and package is shown below.
Page 60
Table 7-2: Synthesis report of proposed Carry-Select Adder(LF)
Chapter-8
Page 61
SIMULATION RESULTS
Page 62
Figure 8-3: Test Bench for 16 bit Carry Select Adder(LF)
Page 63
Chapter-9
CONCLUSION
A simple approach is proposed in this paper to reduce the area and power of SQRT
CSLA architecture.The logic operations eliminated all the redundant logic operations
of the conventional CSLA and proposed a new logic formulation for the CSLA. In the
proposed scheme, the CS operation is scheduled before the calculation of final-sum,
which is different from the conventional approach. Carry words corresponding to
input-carry ‘0’ and ‘1’ generated by the CSLA based on the proposed scheme follow a
specific bit pattern, which is used for logic optimization of the CS unit. Fixed input
bits of the CG unit are also used for logic optimization. Based on this, an optimized
design for CS and CG units are obtained. Using these optimized logic units, an
efficient design is obtained for the CSLA. The proposed CSLA design involves
significantly less area and delay than the recently proposed BEC-based CSLA. Due to
the small carry output delay, the proposed CSLA design is a good candidate for the
SQRT adder. The synthesis result shows that the existing BEC-based SQRT-CSLA
design involves 48% more ADP and consumes 50% more energy than the proposed
SQRT -CSLA, on average, for different bit-widths.
Page 64
REFERENCES
[1] K. K. Parhi, VLSI Digital Signal Processing. New York, NY, USA: Wiley, 1998.
[2] A. P. Chandrakasan, N. Verma, and D. C. Daly, “Ultralow-power electronics for
biomedical applications,” Annu. Rev. Biomed. Eng., vol. 10, pp. 247–274, Aug. 2008.
[3] O. J. Bedrij, “Carry-select adder,” IRE Trans. Electron. Comput., vol. EC-
11, no. 3, pp. 340–344, Jun. 1962.
[4] Y. Kim and L.-S. Kim, “64-bit carry-select adder with reduced area,”
Electron.Lett., vol. 37, no. 10, pp. 614–615, May 2001.
[5] Y. He, C. H. Chang, and J. Gu, “An area-efficient 64-bit square root carry select
adder for low power application,” in Proc. IEEE Int. Symp. Circuits Syst., 2005, vol.
4, pp. 4082–4085.
[6] B. Ramkumar and H. M. Kittur, “Low-power and area-efficient carry-select
adder,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 2, pp. 371–
375, Feb. 2012.
[7] I.-C. Wey, C.-C. Ho, Y.-S. Lin, and C. C. Peng, “An area-efficient carry select
adder design by sharing the common Boolean logic term,” in Proc. IMECS, 2012, pp.
1–4.
[8] S. Manju and V. Sornagopal, “An efficient SQRT architecture of carry select
adder design by common Boolean logic,” in Proc. VLSI ICEVENT, 2013, pp. 1–5.
[9] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs, 2nd ed.
New York, NY, USA: Oxford Univ. Press, 2010.
[9] Basant Kumar Mohanty, Senior Member, IEEE, and Sujit Kumar Patel: IEEE
Transactions On Circuits And Systems—II: Express Briefs, VOL. 61, NO. 6, JUNE
2014
Page 65