Low Power Carry Look Adder Design Using FLUT’s FPGA Arithmetic(Fpga_arithmetic)_docs
Low Power Carry Look Adder Design Using FLUT’s FPGA Arithmetic(Fpga_arithmetic)_docs
ABSTARCT
that are widely used today enables us to build and test the algorithms of other researchers.
FPGA is now widely used in many practical ways with flexible applications. This
presents new methods with the purpose to optimally implement and speed up one bit
Full-Adder (FA). This optimal implementation helps the FA to use less resource on Field-
(LUT)’s structures and properties shown by Xilinx and Altera company, FA/S operation,
and also recent innovation in optimal multiplexer. In such optimal designs and implement
proposed, LUT is divided into smaller LUT’s, which can function as a multiplexer, a
memory or a comparator, to increase the speed of FA, reduce the area occupied on
FPGA, and use the resource appropriately. These new designs also have fewer
components and this leads to fewer layouts, which is a real advantage in the fabrication.
The experiment has proved that the FA speed in the new methods is faster, and less space
is utilized for saving data than standard FA. There are also fewer CMOS used in the FA
in Register-Transistor-Level (RTL)
S.NO TABLE OF CONTENTS PAGE.NO
ABSTRACT
CHAPTER
I Introduction
II Literature Survey
3.1 Non-fLUTs
3.2 fLUTs
VI Conclusion
LIST OF FIGURES
3.5 LUT with balanced adder interaction where both adder inputs are
driven by 5-LUTs.
4.1 Circuit for Sum value: (a) Controlled by Cin & y ; (b) Controlled
by Cin & x
INTRODUCTION
So for fast implementation processing and less delay time we explore the
possibility of using a field programming logic array. Field Programmable Gate Arrays
(FPGAs) combine limited cost and reconfigurability with very high make in to one unit
facility and performances. Such characteristics, along with reduced price and make them
a valid alternative to the more multifold and time to market demanding Application
Specific Integrated Circuits (ASICs) Sum of digit is the main operation of each arithmetic
circuit, thus improving speed performances and reducing the area occupancy of adder
circuits is still an initiative research topic . In digital Very Large Scale Integration (VLSI)
Circuits, full adder forms are the basic building blocks for all arithmetic operations.
Therefore, adder has the great impact in performance of the circuits, which are based on
the arithmetic operations.
The various existing adder structures such as Ripple Carry Adder (RCA), Carry
Look Ahead Adder (CLA), Carry Save Adder (CSA), Carry Select Adder (CSEL), Carry
Bypass adder (CBY) and Area Efficient Carry Select Adder (AECSA) are analyzed based
on the performance. Among all structures, some structures reduce the area occupied by
the circuit with the increased delay and some structures reduce the delay with the
increased consumption of area. The proposed adder structure results in optimized
performance, that is, the delay is reduced with the equal consumption of area which was
observed in normal adder design. The characteristics of the digital circuit are analyzed
mainly based on the time and area consumption. programmable gate arrays provide an
alternative approach to application specific integrated circuits (ASIC) implementation
with features like large-scale integration, design verification post production, lower non-
recurring costs, reconfigurable design approach etc. Field Programmable Gate Arrays
(FPGAs) combine limited cost and reconfigurability with very high integration capability
and performances. Such characteristics, along with reduced low volume costs make them
a valid alternative to the more complex and time to market demand-ing Application
Specific Integrated Circuits (ASICs). Addition is the main operation of each arithmetic
circuit, thus improving speed performances and reducing the area occupancy of adder
circuits is still an enterprising research topic. Unfortunately, as it is well known,
designing efficient adders using an FPGA platform is not trivial. These blocks are highly
optimized in terms of speed or area thereby facilitating efficient realization of complex
functions .One of the major changes in the FPGA architecture has been the introduction
of 6-input LUT as a logic element [11, 16]. With this FPGA primitive, the logic
implementation would lead to higher logic densities resulting in a minimal-depth circuit
and hence higher speed - a trend towards which the current FPGAs are oriented .Perhaps
the biggest issue with 6-input LUTs is their underutilization while implementing a
particular logic function, since many logic functions do not require six inputs
Type 1 consists of an array of discrete LUTs, each depending on an set of input variables
x: f f x i i = ( ), where 1 ≤ ≤i k .
Type 2 is a two level LUT structure. The first level is identical to Type 1 but the outputs
of the first level are combined within the second level. The second level logic can be
arbitrary or restricted to some specific Boolean operations. Given the first level outputs as
g g x i i i = ( ) a second level output is defined by f x x x g x o g x o og x k k k ( , ,..., ) ( )
( ) ... ( ) 1 2 1 1 2 2 = , where o ∈k∧∨, , , , ⊕ ≡ Lp.
Type 3 describes a structure with two LUTs sharing input variables. Additionally to an
arbitrary number of common variables x 2 each LUT might depend on a set of further
disjunctive inputs x 1 and x 3 . Therefore, the output can be written as f x f x x f x x
pair( ) ( = , ) , ( , ) 1 1 2 2 2 3 d i, where x x x x = 1 2 3 ∪∪ and x x 1 3 ∩ = ∅. Some
configuration types and the corresponding parameters for commercial FPGAs are listed
in Table 1. Dotted lines indicate that more than one configuration is possible. E.g., the
XC4000 can implement Type 1 either by one LUT depending on five input signals or by
two discrete LUTs each depending on four input variables.
Table 1.1. Configuration types and corresponding parameters of well known FPGAs
It clear that any strive to reform performances of adder architectures using digital logic
and routing resources an ineffective. We want to improve process of most arithmetic
operations with respect to their corresponding designing through generic configurable
logic, embedded DSP48E slices with high speed features are also made available within
the Virtex-5 devices. DSP slices include dedicated arithmetic circuits, optimized to fasten
addition, multiplication, accumulation, MAC (Multiply And ACcumulation) and Boolean
logic functions. A combinational circuit that performs the addition of three bits are called
full adder. Two of the input variables, denoted by a and b, represents the two significant
bits to be added. The third inputc, represents the Carry from the previous least significant
position. The two output variables are designated as Sum and Carry. The output variables
are determined from the arithmetic Sum of the input bits. When all the input bits are zero,
the output is „0‟. The output Sum is equal to „1‟ when only one input is equal to „1‟ or
all the three inputs are equal to „1‟. Based on the truth table, the output Sum and Carry
are described as Sum = a + b + c Carry = a.b + b.c + c.a (2) In equations (1) and (2),
represents Ex-Or operation and represents AND operation.
Table:1.2 Truth table of Full Adder
The Conventional 1- bit adder circuit is shown in Figure1. It requires 6 gates (4 logic
gates and 2 Ex-Or gates) to implement 1- bit adder circuit
The carry look-ahead adder (CLA) is theoretically one of the fastest methods for addition.
Weinberger and Smith invented the CLA in 1958 [Weinberger58]. The CLA uses
intermediate information to determine in advance if there will be a carry out of a given bit
position. Figure 3-3 shows the truth table for a full adder, including this extra carry
information. For the delete condition, there will be no carry out of the bit position. For
the propagate condition, there will only be a carry out if there is a carry in. For the
generate/propagate condition, there will always be a carry out at that position.
Reduced adders are used, as they are no longer required to compute the carry-out
Figure 3-4 shows the block diagram for a 4-bit section of a CLA. The CLA block at the
top of the diagram is a set of circuitry that creates generate and propagate signals for a
group of full adders, as well as the carry-out from that group. The following equations
compute each position’s generate and propagate signals:
Generate :Gi= Ai • Bi
Propagate : Pi=Ai + Bi
Some books instead define the propagate signal as the exclusive-OR of the A and B
signals, but this does not change the result of addition. We chose the above definition
because the implementation of the OR operation is more efficient than that of exclusive-
OR in most technologies. The equations for the carries in a CLA are given by:
C 1=G0+ P 0 • Cin
C 2=G1+G 0 • P 1+ P• · P 2 •Cin
As we attempt to compute the carries further and further in advance, larger gates are
required. For example, computing C3 requires the use of a 4 input AND gate and a 4
input OR gate. Hence, usually the size of the look-ahead logic is limited to 3 carries.
AND Gates with 5/6 inputs would be needed for the next 2 carry signals, which makes
their implementation in CMOS very slow due to the stacked transistors in the pull-up or
pull-down paths. As the carry calculation is performed by the carry look-ahead block, the
one-bit adder equations for a CLA are the reduced full-adder equations because carry
calculation is no longer needed. The reduced full adder performs the operation given by
equation below:
∑ ¿ A ⊕ B ⊕ C=A BC + A B C + AB C + ABC
CHAPTER 2
LITRETURE SURVEY
EXISTING METHOD
The base FPGA architecture used in this article is designed in a 22-nm CMOS process
and is a heterogeneous architecture with soft logic blocks, simple I/Os, configurable
memories, and fracturable multipliers. The internal connectivity of the blocks is provided
by a 50% depopulated crossbar that connects block inputs and basic logic element (BLE)
outputs to the BLE inputs. We have chosen a depopulated crossbar as this is common in
most commercial devices [2], [5]. The depopulated crossbar is composed of four smaller,
fully populated crossbars as designed by Chiasson and Betz [17]; this depopulation
results in the soft logic block inputs being divided into four groups of ten logically
equivalent pins. The input pins are evenly distributed on the bottom and the right sides of
the logic block, as this simplifies the layout of the FPGA. Table I gives the routing
architecture parameters of the base architecture. In addition to logic blocks, the
architecture includes hard 32k-bit RAM blocks (with configurable width/depth) and DSP
blocks (36 × 36 bit multipliers which can be fractured down to two 18×18 or four 9×9
multipliers). These values are chosen to be in line with the recommendations of prior
research [17], [18].
3.1 Non-fLUTs
Fig. 1 illustrates the baseline non-fracturable soft logic block used in this article, which
contains eight BLEs, 40 generalinputs, eight general outputs, one cin pin, and one cout
pin. The BLE consists of a non-fracturable six-input LUT with an optionally registered
output pin. There are cin and cout pins into and out of the BLE, respectively, to drive a
hard adder. The specific details are described in Section III. There is also a fast path from
the flip-flop output to the LUT input. We also consider architectures that do not contain
hardened arithmetic and hence have neither cin nor cout pins. Our choice of following
industry trends on using larger LUTs has interesting implications in terms of the
efficiency of addition. When implementing arithmetic using only 4-LUTs, every bit of
addition requires one LUT for the sum and another LUT for the carry. With 5-LUTs and
larger, a soft implementation of arithmetic can be more efficient. Fig. 2 shows how three
LUTs can implement two bits of addition. With fracturable 6-LUTs, this benefit grows
even larger as fracturing into the two 5-LUTs mode allows implementation of both the 2-
bit carry and a sum operation in a single fracturable 6-LUT.
Fig. 3.1. Base soft logic block consists of eight BLEs connected by a 50% depopulated
crossbar. Each BLE consist of an LUT and a flip-flop with fast feedforward and feedback
paths reflecting what is commonly found in stateof-the-art FPGAs.
3.2 fLUTs
Fig. 3.2-LUTs and larger allow for more flexibility in technology-mapping addition
FLUTs change the interaction of soft logic with hard addersand carry chains, so it is
important to evaluate their combined effect. As much as possible, our fLUT architectures
reuse the same architecture as the non-fracturable case so that we may compare between
these architectures. Instead of BLEs, our baseline fLUT architecture uses fracturable
BLEs (fBLEs) which, as shown in Fig. 3,
Fig. 3.3. Baseline fBLE which contains one fLUT with optionally registered outputs.
contains one fLUT with optionally registered outputs. Unlike the baseline non-fracturable
architecture (see Fig. 1), which had only one output per BLE, each fBLE has two
independent outputs. Therefore, the internal crossbar inside the soft logic block has an
additional eight inputs (local feedbacks) compared to the non-fracturable case, and the
soft logic block has 16 outputs instead of 8. Fig. 4 shows the baseline fLUT. This fLUT
can operate either as one 6-LUT or two 5-LUTs with four shared inputs. This choice of
shared inputs is between that of a Virtex-style fLUT [13], where all five inputs of the 5-
LUTS are shared, and an Stratix-style fLUT [14], where two inputs of the two 5-LUTs
are shared.
Fig. 3.4. Baseline fLUT which operates as either one 6-LUT or two 5-LUTs with four
shared inputs.
To evaluate the impact of including hard adders in FPGA logic blocks, we explore
different design choices relating to adder implementation and interaction with the rest of
the logic block.
To ensure we fairly compare various hard adder and carry chain architectures, we
carefully electrically designed two hardadder primitives and hand optimized them at the
transistor level. The first adder primitive is a basic 1-bit full adder. In a soft logic block,
eight of these full adders are linearly chained together to form a ripple carry chain. Table
III shows the properties of the 1-bit hard full adder used in this article. Area is measured
as minimum width transistor areas (MWTAs), using the transistor drive to area
conversion equations from Chiasson and Betz [17]. The adder circuitry, LUTs, and
routing are all designed with a similar goal of minimizing the area–delay product of the
FPGA, and the cin to cout path of the adder is particularly optimized for delay as it
occurs n−1 times on an n-bit adder.
Fig:3.5. LUT with balanced adder interaction where both adder inputs are driven by 5-
LUTs.
The second adder primitive is a 4-bit CLA. Each logic block contains two of these 4-bit
adders chained in a ripple carry fashion. Table IV shows the properties of the 4-bit CLA
used in this article. The carry lookahead optimization allows for a faster carry path (20
ps) compared to a ripple of four 1-bit adders (44 ps) when performing a 4-bit addition.
The CLA design trades off flexibility (as some bits are wasted if the desired adder length
is not divisible by 4) and area in exchange for speed.
Fig. 5 shows one way the LUT and adder (within a BLE) may interact. Here, we make
use of the observation that a 6-LUT is constructed with two 5-LUTs and a mux. If that
mux is dropped, then the adder can be driven by two 5-LUTswhere the LUTs share
inputs. If the adder is not used, then another mux can be used to produce the 6-LUT
output. We call this the balanced LUT interaction, and its underlying rationale is that a
symmetric amount of prior logic for each adder input may be the most appropriate
architecture. Example circuits that may benefit from this architecture would be
applications where multiplexers select the inputs to an adder. Similar interaction for the
CLA is shown in Fig. 6. Fig. 7 shows another LUT–adder interaction architecture that we
will explore. Here, the 6-LUT output drives one of the adder inputs and the other adder
input is driven by one of the 6-LUT inputs. As with the previous case, if the adder is not
used, then another mux can be used to select the 6-LUT output. We call this the
unbalanced LUT interaction. We model each additional SRAM-controlled 2-to-1 mux
(one per BLE for the balanced LUT interaction, and two per BLE for the unbalanced
LUT interaction) as having 22 ps of delay and occupying 15 MWTAs (including the
SRAM configuration bit). The underlying rationale for this architecture is that there
might be an advantage to allowing a faster input into one side of the adder, which would
be appropriate when speed was an issue.
PROPOSED METHOD
The truth table for addition 2 bits with carry flag is presented as follow:
From truth table 1, a conclusion reached regarding output values: Sum values and Carry-
out flags. This conclusion derives from the idea of the operating current (electronic), by
switching on/off the Current to obtain the expected logic value:
Two circuits are built for sum and carry-out based on MUX2-1. Those use MUX as a
switch which is controlled to allow expected values to pass through.
The circuit for sum value: A circuit is built through a control of values including input x
or input y that are shown in Table 2 and Fig.
The circuit for carry-out flag circuit: By observing the truth table of addition, 2
bits and new equations are realized, and it is learnt that the carry-out flag relies on the
control of input x and carry-in to find the best solution. Similarly, two circuits for carry-
out of Addition illustrated in table 4.2 and Fig.4.1.
Two LUT6-1’s are used to establish the FA. LUT6-1 is a basis LUT inside the slice.
Because the sum value and carry-out flag above can be executed at the same time, two
basic LUT6-1’s used: one for Sum circuit and one for Carry-out circuit. From the
observation the two truth tables of addition as well as circuits of Carry-out flag for
addition, it is learnt that it is possible combine them to have FA/S with control=0:
following as Fig.4.2.
Figure 4.2: Diagram Cout with a control pin
Moreover, the final circuit for FA is obtained by using one LUT (6-input, 2-output) that
is divided into 2 small LUT’s, one for Sum and one for Carry-out flag .When Carry-out
flag is equal to 1, addition shows ‘overload’. As a result of this, Sum/Sub and Carry-out
flag can run at the same time as shown in Fig.4.4.
In contrast to the well known ripple-carry design which has a total time delay that is
proportional to the length N of the adder, the carry lookahead design is suited to generate
all carries simultaneously by additional logic circuitry. This results in a constant addition
time, independent of the length of the adder. For a good understanding we give a short
overview of the logical structure of this adder type. A more detailed description can be
found in .
Let Ci−1be the carry input to the ith bit position and C−1 the carry input to the least
significant position. Let Si and Ci be the sum and carry outputs of the ith stage. Then, the
sum and carry bit can be described as
These equations show that all carry inputs Ci are available simultaneously and all sum
bits Si can be generated in parallel as illustrated in Figure 2 for a 4-bit carry lookahead
adder. As also shown in Figure 2 we distinguish between structure levels GP, CLA and
SUM according to the generated signals within the levels.
Fig. 4.4. Structure of a 4-bit carry lookahead adder
Theoretically, it is possible to build adders of any word length if the CLA unit can be
freely expanded. However, in practice the complexity of the CLA unit is limited. This
leads to a hierarchical structure based on Block-Carry-Lookahead (BCLA) units. Each of
them generates only a limited number of carries and, additionally, BLOCK-
PROPAGATE (P0*) and BLOCK-GENERATE (G0*) signals which are used to evaluate
carries of the following BCLA level. Figure 3 shows a two-level CLA unit of a 16-bit
CLA adder in a 4-bit BCLA unit configuration. Each of the 4- bit BCLA units is suited to
evaluate three carries. The most significant carry signal Cout can be written as C C out =
−1 ⋅ P + G 0 0 max max , where max refers to the variables of the most significant BCLA
unit. The number of BCLA levels is given by
The implementation of hierarchical CLA adders is done in four steps: Adaptive structure
generation, technology mapping, partitioning and placement. We present a method for
realizing all these steps for any SRAM-based FPGA which can be described by the
generic models defined in Section 2. During each design step we will efficiently use
knowledge about the logical structure. As described, the length of a BCLA unit is not
bound to a special value. Therefore, we can write the equations for BLOCK-
PPOPAGATE, -GENERATE and CARRIES
Since the carry lookahead adder is chosen for high-performance addition the aim of our
logic adaption step is to reduce the BCLA level count to a minimum. According to Eq.(2)
this is accomplished by maximizing the bitwidth of the BCLA units which increases the
number of carry signals evaluated within a unit. As shown in Table 2 the complexity of a
BCLA is limited by the implementation of the BLOCKGENERATE signal which needs a
signal input of 2*BCLAbitwidth -1 and a number of sum terms which equals
BCLAbitwidth. Since this is also valid for the most complex CARRY signal we can use
the same implementation method for both. Therefore, our aim is to determine that
configuration of a FPGA device which is suitable to implement the most complex
BLOCK-GENERATE signal within a CLB
Fig:5.1 LUT Architecture
4. Power Report
Fig:5.10 Power Report
From the results of the experiments, a conclusion comes to that the standard FA and new designs used
the same resources on FPGA. With the proposed n_FA design, the speed is faster than the standard FA
by 20% to 30% depending on different FPGA series. The proposed FA occupies only 50% resources
compared with the standard FA and the speed increases arranging from 28% to 40%.
CHAPTER 6
CONLCUSION
The purpose of this paper is to propose a new FA architecture on an FPGA platform with
two optimization goals. The first is to improve the FA/S ratio in order to increase speed.
The second is to create a new circuit for FA/S to allow for production with fewer gates.
As a result, the proposed design is based on the multiplexer and contains only two types
of components: NOT gate and multiplexer, allowing the design to be easily implemented
on an FPGA chip. Both designs are subjected to detailed testing using different FPGA
series. The experiment revealed that the proposed n FA performed 20% to 30% faster
than the standard FA using the same area resources, and the proposed n FAS performed
28% to 40% faster than the standard FA/S using only half the resources.