0% found this document useful (0 votes)
9 views

Low Power Carry Look Adder Design Using FLUT’s FPGA Arithmetic(Fpga_arithmetic)_docs

The document discusses a low power carry look adder design using FPGA technology, focusing on optimizing full adder implementations to reduce resource usage and increase speed. It highlights the advantages of using Look-Up-Tables (LUTs) to enhance performance and minimize area occupancy in FPGA designs. Experimental results demonstrate that the proposed methods outperform traditional full adders in terms of speed and resource efficiency.

Uploaded by

mailtochipmatrix
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Low Power Carry Look Adder Design Using FLUT’s FPGA Arithmetic(Fpga_arithmetic)_docs

The document discusses a low power carry look adder design using FPGA technology, focusing on optimizing full adder implementations to reduce resource usage and increase speed. It highlights the advantages of using Look-Up-Tables (LUTs) to enhance performance and minimize area occupancy in FPGA designs. Experimental results demonstrate that the proposed methods outperform traditional full adders in terms of speed and resource efficiency.

Uploaded by

mailtochipmatrix
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Low Power Carry Look Adder Design Using FLUT’s FPGA Arithmetic

ABSTARCT

The research on developments of algorithms in the CPU represents the need to

improve FA in a new direction. The knowledge of electronic and programmable devices

that are widely used today enables us to build and test the algorithms of other researchers.

FPGA is now widely used in many practical ways with flexible applications. This

presents new methods with the purpose to optimally implement and speed up one bit

Full-Adder (FA). This optimal implementation helps the FA to use less resource on Field-

Program able-Gate-Array (FPGA) than traditional FA/S. Based on Look-Up-Table

(LUT)’s structures and properties shown by Xilinx and Altera company, FA/S operation,

and also recent innovation in optimal multiplexer. In such optimal designs and implement

proposed, LUT is divided into smaller LUT’s, which can function as a multiplexer, a

memory or a comparator, to increase the speed of FA, reduce the area occupied on

FPGA, and use the resource appropriately. These new designs also have fewer

components and this leads to fewer layouts, which is a real advantage in the fabrication.

The experiment has proved that the FA speed in the new methods is faster, and less space

is utilized for saving data than standard FA. There are also fewer CMOS used in the FA

in Register-Transistor-Level (RTL)
S.NO TABLE OF CONTENTS PAGE.NO

ABSTRACT

CHAPTER

I Introduction

1.1 Generic FPGA Logic Cell Definition

1.2 Carry Look-Ahead Adder

II Literature Survey

III Existing Work

3.1 Non-fLUTs

3.2 fLUTs

3.3 HARD ADDER AND CARRY CHAIN ARCHITECTURES

3.3.1 Adder Primitive

3.3.2 Adder Input Balancing

3.3.3 Carry Chain Flexibility


IV Existing Method

4.1 A New Logical Full Adder

4.2 Proposed Circuits for Full Adder/Subtractor

4.3 Implementing with LUTs

4.4 Proposed CLS Adder using proposed full adder

4.4.1 Logical Structure of the Hierarchical Carry Lookahead


Adder

v Results and discussion

VI Conclusion
LIST OF FIGURES

S.NO LIST OF FIGURES PAGE.NO

1.1 Different configuration types of SRAM-based logic blocks

1.2 Conventional 1-bit adder

1.3 4-bit Carry look-ahead Adder:

3.1 Base soft logic block consists of eight BLEs

3.2 LUTs and larger allow for more flexibility in technology-


mapping addition

3.3 Baseline fBLE which contains one fLUT with optionally


registered outputs.

3.4 Baseline fLUT which operates as either one 6-LUT or two 5-


LUTs with four shared inputs.

3.5 LUT with balanced adder interaction where both adder inputs are
driven by 5-LUTs.

3.6 Four-bit CLA with balanced LUT interaction

4.1 Circuit for Sum value: (a) Controlled by Cin & y ; (b) Controlled
by Cin & x

4.2 Diagram Cout with a control pin

4.3 New Full Adder (n_FA)


4.4 Structure of a 4-bit carry lookahead adder

5.1 LUT Architecture

5.2 RTL SCHEMATICS OF EXISTING WORK

5.3 Device Summary Utilization Report

5.4 Timing Summary report

5.5 Power Report

5.6 Output Wave form for Existing Method

5.7 RTL SCHEMATICS OF PROPOSED WORK

5.8 Device Summary Utilization Report

5.9 Timing Summary report

5.10 Power Report

5.11 Output Wave for Proposed Method


LIST OF TABLES

S.NO LIST OF TABLES PAGE.NO

1.1 Configuration types and corresponding parameters of well known


FPGAs

1.2 Truth Table of full adder

1.3 Generate and propagate information for a CLA

4.1 Truth table Full-Adder/ Subtractor

4.2 Truth table Sum/Sub value

5.12 Parameters Comparison Summary


CHAPTER 1

INTRODUCTION

So for fast implementation processing and less delay time we explore the
possibility of using a field programming logic array. Field Programmable Gate Arrays
(FPGAs) combine limited cost and reconfigurability with very high make in to one unit
facility and performances. Such characteristics, along with reduced price and make them
a valid alternative to the more multifold and time to market demanding Application
Specific Integrated Circuits (ASICs) Sum of digit is the main operation of each arithmetic
circuit, thus improving speed performances and reducing the area occupancy of adder
circuits is still an initiative research topic . In digital Very Large Scale Integration (VLSI)
Circuits, full adder forms are the basic building blocks for all arithmetic operations.
Therefore, adder has the great impact in performance of the circuits, which are based on
the arithmetic operations.

The various existing adder structures such as Ripple Carry Adder (RCA), Carry
Look Ahead Adder (CLA), Carry Save Adder (CSA), Carry Select Adder (CSEL), Carry
Bypass adder (CBY) and Area Efficient Carry Select Adder (AECSA) are analyzed based
on the performance. Among all structures, some structures reduce the area occupied by
the circuit with the increased delay and some structures reduce the delay with the
increased consumption of area. The proposed adder structure results in optimized
performance, that is, the delay is reduced with the equal consumption of area which was
observed in normal adder design. The characteristics of the digital circuit are analyzed
mainly based on the time and area consumption. programmable gate arrays provide an
alternative approach to application specific integrated circuits (ASIC) implementation
with features like large-scale integration, design verification post production, lower non-
recurring costs, reconfigurable design approach etc. Field Programmable Gate Arrays
(FPGAs) combine limited cost and reconfigurability with very high integration capability
and performances. Such characteristics, along with reduced low volume costs make them
a valid alternative to the more complex and time to market demand-ing Application
Specific Integrated Circuits (ASICs). Addition is the main operation of each arithmetic
circuit, thus improving speed performances and reducing the area occupancy of adder
circuits is still an enterprising research topic. Unfortunately, as it is well known,
designing efficient adders using an FPGA platform is not trivial. These blocks are highly
optimized in terms of speed or area thereby facilitating efficient realization of complex
functions .One of the major changes in the FPGA architecture has been the introduction
of 6-input LUT as a logic element [11, 16]. With this FPGA primitive, the logic
implementation would lead to higher logic densities resulting in a minimal-depth circuit
and hence higher speed - a trend towards which the current FPGAs are oriented .Perhaps
the biggest issue with 6-input LUTs is their underutilization while implementing a
particular logic function, since many logic functions do not require six inputs

1.1 Generic FPGA Logic Cell Definition

A FPGA consists of a 2-dimensional array of configurable logic blocks (CLBs).


The basic elements of these CLBs are small SRAM cells (Look-Up-Tables, LUT), which
can be used to implement any logic function of up to a specified number of variables with
identical costs and delay. Often, each CLB can be configured in different manners. In [6]
a specific definition for the Xilinx 3000 Series [5] was presented considering different
configurations. We define a more general model for the logic block structures which is
valid for most SRAM-based FPGA architectures. We distinguish between three different
configuration types shown in Figure 1.1.
Fig.1.1. Different configuration types of SRAM-based logic blocks

Type 1 consists of an array of discrete LUTs, each depending on an set of input variables

x: f f x i i = ( ), where 1 ≤ ≤i k .

Type 2 is a two level LUT structure. The first level is identical to Type 1 but the outputs
of the first level are combined within the second level. The second level logic can be
arbitrary or restricted to some specific Boolean operations. Given the first level outputs as
g g x i i i = ( ) a second level output is defined by f x x x g x o g x o og x k k k ( , ,..., ) ( )
( ) ... ( ) 1 2 1 1 2 2 = , where o ∈k∧∨, , , , ⊕ ≡ Lp.

Type 3 describes a structure with two LUTs sharing input variables. Additionally to an
arbitrary number of common variables x 2 each LUT might depend on a set of further
disjunctive inputs x 1 and x 3 . Therefore, the output can be written as f x f x x f x x
pair( ) ( = , ) , ( , ) 1 1 2 2 2 3 d i, where x x x x = 1 2 3 ∪∪ and x x 1 3 ∩ = ∅. Some
configuration types and the corresponding parameters for commercial FPGAs are listed
in Table 1. Dotted lines indicate that more than one configuration is possible. E.g., the
XC4000 can implement Type 1 either by one LUT depending on five input signals or by
two discrete LUTs each depending on four input variables.
Table 1.1. Configuration types and corresponding parameters of well known FPGAs

It clear that any strive to reform performances of adder architectures using digital logic
and routing resources an ineffective. We want to improve process of most arithmetic
operations with respect to their corresponding designing through generic configurable
logic, embedded DSP48E slices with high speed features are also made available within
the Virtex-5 devices. DSP slices include dedicated arithmetic circuits, optimized to fasten
addition, multiplication, accumulation, MAC (Multiply And ACcumulation) and Boolean
logic functions. A combinational circuit that performs the addition of three bits are called
full adder. Two of the input variables, denoted by a and b, represents the two significant
bits to be added. The third inputc, represents the Carry from the previous least significant
position. The two output variables are designated as Sum and Carry. The output variables
are determined from the arithmetic Sum of the input bits. When all the input bits are zero,
the output is „0‟. The output Sum is equal to „1‟ when only one input is equal to „1‟ or
all the three inputs are equal to „1‟. Based on the truth table, the output Sum and Carry
are described as Sum = a + b + c Carry = a.b + b.c + c.a (2) In equations (1) and (2),
represents Ex-Or operation and represents AND operation.
Table:1.2 Truth table of Full Adder

The Conventional 1- bit adder circuit is shown in Figure1. It requires 6 gates (4 logic
gates and 2 Ex-Or gates) to implement 1- bit adder circuit

Fig:1.2 Conventional 1-bit adder

1.2 Carry Look-Ahead Adder

The carry look-ahead adder (CLA) is theoretically one of the fastest methods for addition.
Weinberger and Smith invented the CLA in 1958 [Weinberger58]. The CLA uses
intermediate information to determine in advance if there will be a carry out of a given bit
position. Figure 3-3 shows the truth table for a full adder, including this extra carry
information. For the delete condition, there will be no carry out of the bit position. For
the propagate condition, there will only be a carry out if there is a carry in. For the
generate/propagate condition, there will always be a carry out at that position.

Table1.3 : Generate and propagate information for a CLA


Figure 1.3 : 4-bit Carry look-ahead Adder:

Reduced adders are used, as they are no longer required to compute the carry-out

Figure 3-4 shows the block diagram for a 4-bit section of a CLA. The CLA block at the
top of the diagram is a set of circuitry that creates generate and propagate signals for a
group of full adders, as well as the carry-out from that group. The following equations
compute each position’s generate and propagate signals:

Generate :Gi= Ai • Bi

Propagate : Pi=Ai + Bi

Some books instead define the propagate signal as the exclusive-OR of the A and B
signals, but this does not change the result of addition. We chose the above definition
because the implementation of the OR operation is more efficient than that of exclusive-
OR in most technologies. The equations for the carries in a CLA are given by:

C 1=G0+ P 0 • Cin

C 2=G1+G 0 • P 1+ P• · P 2 •Cin

C 3=G 2+ G1 • P 2+G 0• P1 • P 2+ P 0 • P 1• P 2 •Cin

As we attempt to compute the carries further and further in advance, larger gates are
required. For example, computing C3 requires the use of a 4 input AND gate and a 4
input OR gate. Hence, usually the size of the look-ahead logic is limited to 3 carries.
AND Gates with 5/6 inputs would be needed for the next 2 carry signals, which makes
their implementation in CMOS very slow due to the stacked transistors in the pull-up or
pull-down paths. As the carry calculation is performed by the carry look-ahead block, the
one-bit adder equations for a CLA are the reduced full-adder equations because carry
calculation is no longer needed. The reduced full adder performs the operation given by
equation below:

∑ ¿ A ⊕ B ⊕ C=A BC + A B C + AB C + ABC
CHAPTER 2

LITRETURE SURVEY

1.Title: FPGA Implementation of High-Speed Area-Efficient Processor for Elliptic


Curve Point Multiplication Over Prime Field,2019, Md. Mainul Islam.

Description: Developing a high-speed elliptic curve cryptographic (ECC) processor that


performs fast point multiplication with low hardware utilization is a crucial demand in
the fields of cryptography and network security. This paper presents field-programmable
gate array (FPGA) implementation of a high-speed, low-area, side-channel attacks
(SCAs) resistant ECC processor over a prime field. The processor supports 256-bit point
multiplication on recently recommended twisted Edwards curve, namely, Edwards25519,
which is used for a high-security digital signature scheme called Edwards curve digital
signature algorithm (EdDSA). The paper proposes novel hardware architectures for point
addition and point doubling operations on the twisted Edwards curve, where the
processor takes only 516 and 1029 clock cycles to perform each point addition and point
doubling, respectively. For a 256-bit key, the proposed ECC processor performs single
point multiplication in 1.48 ms, running at a maximum clock frequency of 177.7 MHz in
a cycle count of 262 650 with a throughput of 173.2 kbps, utilizing only 8873 slices on
the Xilinx Virtex-7 FPGA platform, where the points are represented in projective
coordinates. The implemented design is time-area-efficient as it offers fast scalar
multiplication with low hardware utilization without compromising the security level.

2.Title:An Efficient and Flexible Hardware Implementation of the Dual-Field


Elliptic Curve Cryptographic Processor,2016, Zilong Liu, Dongsheng Liu, and
Xuecheng Zou
Description:Elliptic Curve Cryptography (ECC) has been widely used for the digital
signature to ensure the security in communication. It’s important for the ECC processor
to support a variety of ECC standards to be compatible with different security
applications. Thus, a flexible processor which can support different standards and
algorithms is desired. In this paper, an efficient and flexible dual-field ECC processor
using the hardware-software approach is presented. The proposed processor can support
arbitrary elliptic curve. An elaborate Modular Arithmetic Logic Unit (MALU) is
designed. It can perform basic modular arithmetic operations and achieve high efficiency.
Based on our designed instruction set, the processor can be programmed to perform
various point operations based on different algorithms. To demonstrate the flexibility of
our processor, a point multiplication algorithm with power analysis resistance is adopted.
Our design is implemented in the field-programmable gate array (FPGA) platform and
also in the application-specified integrated circuit (ASIC). After implemented in the
55nm CMOS process, the processor takes between 0.60ms (163bits ECC) and 6.75ms
(571bits ECC) to finish one point multiplication. Compared to other related works, the
merits of our ECC processor are the high hardware efficiency and flexibility.

3.Title: Low-Cost High-Performance VLSI Architecture for Montgomery Modular


Multiplication, 2015, Shiann-Rong Kuang, Kun-Yi Wu, and Ren-Yao Lu.

Description: This paper proposes a simple and efficient Montgomery multiplication


algorithm such that the low-cost and high-performance Montgomery modular multiplier
can be implemented accordingly. The proposed multiplier receives and outputs the data
with binary representation and uses only one-level carry-save adder (CSA) to avoid the
carry propagation at each addition operation. This CSA is also used to perform operand
pre computation and format conversion from the carrysave format to the binary
representation, leading to a low hardware cost and short critical path delay at the expense
of extra clock cycles for completing one modular multiplication. To overcome the
weakness, a configurable CSA (CCSA), which could be one full-adder or two serial half-
adders, is proposed to reduce the extra clock cycles for operand precomputation and
format conversion by half. In addition, a mechanism that can detect and skip the
unnecessary carry-save addition operations in the one-level CCSA architecture while
maintaining the short critical path delay is developed. As a result, the extra clock cycles
for operand precomputation and format conversion can be hidden and high throughput
can be obtained. Experimental results show that the proposed Montgomery modular
multiplier can achieve higher performance and significant area–time product
improvement when compared with previous designs.

4.Title:Elliptic Curve Cryptography with EfficientlyComputable Endomorphisms


and Its HardwareImplementations for the Internet of Things,2016, Zhe Liu, Johann
Großsch¨ adl, Zhi Hu, Kimmo J¨ arvinen, Husen Wang and Ingrid Verbauwhede

Description:Verification of an ECDSA signature requires a double scalar multiplication


on an elliptic curve. In this work, we study the computation of this operation on a twisted
Edwards curve with an efficiently computable endomorphism, which allows reducing the
number of point doublings by approximately 50% compared to a conventional
implementation. In particular, we focus on a curve defined over the 207-bit prime field
Fp with p = 2207 􀀀 5131. We develop several optimizations to the operation and we
describe two hardware architectures for computing the operation. The first architecture is
a small processor implemented in 0.13 _m CMOS ASIC and is useful in resource-
constrained devices for the Internet of Things (IoT) applications. The second architecture
is designed for fast signature verifications by using FPGA acceleration and can be used in
the server-side of these applications. Our designs offer various trade-offs and
optimizations between performance and resource requirements and they are valuable for
IoT applications.
5. Energy-Efficient High-Throughput MontgomeryModular Multipliers for RSA
Cryptosystems,2012 Shiann-Rong Kuang,Jiun-Ping Wang, Kai-Cheng Chang, and
Huan-Wei Hsu

Modular exponentiation in the Rivest, Shamir, andAdleman cryptosystem is usually


achieved by repeated modular multiplications on large integers. To speed up the
encryption/decryption process, many high-speed Montgomery modular multiplication
algorithms and hardware architectures employ carry-save addition to avoid the carry
propagation at each addition operation of the add-shift loop. In this paper, we proposean
energy-efficient algorithm and its corresponding architecture to not only reduce the
energy consumption but also furtherenhance the throughput of Montgomery modular
multipliers. The proposed architecture is capable of bypassing the superfluous carry-save
addition and register write operations, leading to less energy consumption and higher
throughput. In addition, we also modify the barrel register full adder (BRFA) so that the
gated clock design technique can be applied to significantly reduce the energy
consumption of storage elements in BRFA. Experimental results show that the proposed
approaches can achieve up to 60% energy saving and 24.6% throughput improvement for
1024-bit Montgomery multiplier.

6.Modified Dual-CLCG Method and Its VLSIArchitecture for Pseudorandom Bit


Generation,2019 Amit Kumar Panda, and Kailash Chandra Ray,,

Pseudorandom bit generator (PRBG) is an essentialcomponent for securing data during


transmission and storage in various cryptography applications. Among popular existing
PRBG methods such as linear feedback shift register (LFSR), linear congruential
generator (LCG), coupled LCG (CLCG), and dual-coupled LCG (dual-CLCG), the latter
proves to be more secure. This method relies on the inequality comparisons that lead to
generating pseudorandom bit at a non-uniform time interval. Hence, a new architecture of
the existing dual-CLCG method is developed that generates pseudo-random bit at
uniform clock rate. However, this architecture experiences several drawbacks such as
excessive memory usage and high-initial clock latency, and fails to achieve the maximum
length sequence. Therefore, a new PRBG method called as “modified dual-CLCG” and
its very large-scale integration (VLSI) architecture are proposed in this paper to mitigate
the aforesaid problems.The novel contribution of the proposed PRBG method is to
generate pseudorandom bit at uniform clock rate with one initial clock delay and
minimum hardware complexity. Moreover, the proposed PRBG method passes all the 15
benchmark tests of NIST standard and achieves the maximal period of 2n. The proposed
architecture is implemented using Verilog-HDL and prototyped on the commercially
available FPGA device.

7.An Ultra-Fast Parallel Prefix AdderKumar Sambhav Pandey, Dinesh Kumar B,


Neeraj Goel
Parallel Prefix adders are arguably the most commonly used arithmetic units. They have
been extensively investigated at architecture level, register transfer level (RTL), gate
level, circuit level as well as layout level giving rise to a plethora of mathematical
formulations, topologies and implementations. This paper contributes significantly to the
understanding of these parallel prefix adders in a couple of ways. Firstly, it attempts to
describe various such parallel prefix adders in elegant and consistent formulations.
Secondly, a new family of parallel prefix adders is proposed at architecture level. The
estimates of the area-throughput characteristics for an instance of this family are also
presented. While the speeds achieved by this instance match those achieved by the state
of the art adders, their area characteristics exhibit upto 26% improvement.

8.Design and FPGA Prototype of 1024- bit Blum-Blum-Shub PRBG


ArchitectureAmit Kumar Panda and Kailash Chandra Ray

The necessity of hardware security for internet-ofthings applications demands a low


hardware area, high speed and secure pseudorandom bit generator (PRBG). Amongst
various PRBGs, Blum-Blum-Shub (BBS) is the proven cryptographically secure PRBG
because of its large prime factorize problem. The efficient implementation of BBS
method relies on the large integer modular multiplication which makes it computationally
expensive. Montgomery algorithm is a very efficient solution to perform the modular
multiplication which replaces the critical trial division with series of shift and additions.
However, the clock latency and critical path delay are increased with increase of modular
size. Therefore, in this paper, a modified radix-2 iterative Montgomery modular
multiplier is used for efficient hardware implementation of 1024-bit BBS generator. It
replaces two two-operand adders with one three-operand adder. Carry-save adder is the
commonly used technique for three-operand addition which experiences high critical path
delay. Hence, the critical path delay is further reduced by employing a fast parallel prefix
Han-Carlson adder for three-operand addition in the proposed architecture. The proposed
architecture is designed using Verilog HDL and prototyped on the Virtex5 FPGA device.
The physical implementation results report that the proposed 1024- bit BBS architecture
can work at a maximum frequency of 71.2 MHz with overall latency improvement of
93.87%.

9. High-Throughput Modular Multiplication and Exponentiation Algorithms


UsingMultibit-Scan–Multibit-Shift Technique,2014,Abdalhossein Rezai and Parviz
Keshavarzi

Modular exponentiation with a large modulus and exponent is a fundamental operation in


many public-key cryptosystems. This operation is usually accomplished by repeating
modular multiplications. Montgomery modular multiplication has been widely used to
relax the quotient determination. The carry–save adder has been employed to reduce the
critical path. This paper presents and evaluates a new and efficient Montgomery modular
multiplication architecture based on a new digit serial computation. The proposed
architecture relaxes the high-radix partial multiplication to a binary multiplication. It also
performs several multiplications of consecutive zero bits in one clock cycle instead of
several clock cycles. Moreover, the right-to-left and left-to-right modular exponentiation
architectures have been modified to use the proposed modular multiplication architecture
as its structural unit. We provide the implementation results on a Xilinx Virtex 5 FPGA
demonstrating that the total computation time and throughput rate of the proposed
architectures outperform most results so far in the literatures.
CHAPTER 3

EXISTING METHOD

The base FPGA architecture used in this article is designed in a 22-nm CMOS process
and is a heterogeneous architecture with soft logic blocks, simple I/Os, configurable
memories, and fracturable multipliers. The internal connectivity of the blocks is provided
by a 50% depopulated crossbar that connects block inputs and basic logic element (BLE)
outputs to the BLE inputs. We have chosen a depopulated crossbar as this is common in
most commercial devices [2], [5]. The depopulated crossbar is composed of four smaller,
fully populated crossbars as designed by Chiasson and Betz [17]; this depopulation
results in the soft logic block inputs being divided into four groups of ten logically
equivalent pins. The input pins are evenly distributed on the bottom and the right sides of
the logic block, as this simplifies the layout of the FPGA. Table I gives the routing
architecture parameters of the base architecture. In addition to logic blocks, the
architecture includes hard 32k-bit RAM blocks (with configurable width/depth) and DSP
blocks (36 × 36 bit multipliers which can be fractured down to two 18×18 or four 9×9
multipliers). These values are chosen to be in line with the recommendations of prior
research [17], [18].

3.1 Non-fLUTs

Fig. 1 illustrates the baseline non-fracturable soft logic block used in this article, which
contains eight BLEs, 40 generalinputs, eight general outputs, one cin pin, and one cout
pin. The BLE consists of a non-fracturable six-input LUT with an optionally registered
output pin. There are cin and cout pins into and out of the BLE, respectively, to drive a
hard adder. The specific details are described in Section III. There is also a fast path from
the flip-flop output to the LUT input. We also consider architectures that do not contain
hardened arithmetic and hence have neither cin nor cout pins. Our choice of following
industry trends on using larger LUTs has interesting implications in terms of the
efficiency of addition. When implementing arithmetic using only 4-LUTs, every bit of
addition requires one LUT for the sum and another LUT for the carry. With 5-LUTs and
larger, a soft implementation of arithmetic can be more efficient. Fig. 2 shows how three
LUTs can implement two bits of addition. With fracturable 6-LUTs, this benefit grows
even larger as fracturing into the two 5-LUTs mode allows implementation of both the 2-
bit carry and a sum operation in a single fracturable 6-LUT.

Fig. 3.1. Base soft logic block consists of eight BLEs connected by a 50% depopulated
crossbar. Each BLE consist of an LUT and a flip-flop with fast feedforward and feedback
paths reflecting what is commonly found in stateof-the-art FPGAs.

3.2 fLUTs

FPGAs have traditionally used non-fLUTs as described above. However, many


modern commercial FPGA soft logic blocks now employ fLUTs to obtain the
performance advantages of 6-LUTs with the area advantages of 4-LUTs [19]. Some
academic work has questioned whether the additional flexibility of fLUTs is worth their
cost [20]. It is notable, however, that Zgheib and Ienne [20] did not consider the impact
of hardened arithmetic, which we find to be significant.

Fig. 3.2-LUTs and larger allow for more flexibility in technology-mapping addition

FLUTs change the interaction of soft logic with hard addersand carry chains, so it is
important to evaluate their combined effect. As much as possible, our fLUT architectures
reuse the same architecture as the non-fracturable case so that we may compare between
these architectures. Instead of BLEs, our baseline fLUT architecture uses fracturable
BLEs (fBLEs) which, as shown in Fig. 3,

Fig. 3.3. Baseline fBLE which contains one fLUT with optionally registered outputs.
contains one fLUT with optionally registered outputs. Unlike the baseline non-fracturable
architecture (see Fig. 1), which had only one output per BLE, each fBLE has two
independent outputs. Therefore, the internal crossbar inside the soft logic block has an
additional eight inputs (local feedbacks) compared to the non-fracturable case, and the
soft logic block has 16 outputs instead of 8. Fig. 4 shows the baseline fLUT. This fLUT
can operate either as one 6-LUT or two 5-LUTs with four shared inputs. This choice of
shared inputs is between that of a Virtex-style fLUT [13], where all five inputs of the 5-
LUTS are shared, and an Stratix-style fLUT [14], where two inputs of the two 5-LUTs
are shared.

Fig. 3.4. Baseline fLUT which operates as either one 6-LUT or two 5-LUTs with four
shared inputs.

3.3 HARD ADDER AND CARRY CHAIN ARCHITECTURES

To evaluate the impact of including hard adders in FPGA logic blocks, we explore
different design choices relating to adder implementation and interaction with the rest of
the logic block.

3.3.1. Adder Primitive

To ensure we fairly compare various hard adder and carry chain architectures, we
carefully electrically designed two hardadder primitives and hand optimized them at the
transistor level. The first adder primitive is a basic 1-bit full adder. In a soft logic block,
eight of these full adders are linearly chained together to form a ripple carry chain. Table
III shows the properties of the 1-bit hard full adder used in this article. Area is measured
as minimum width transistor areas (MWTAs), using the transistor drive to area
conversion equations from Chiasson and Betz [17]. The adder circuitry, LUTs, and
routing are all designed with a similar goal of minimizing the area–delay product of the
FPGA, and the cin to cout path of the adder is particularly optimized for delay as it
occurs n−1 times on an n-bit adder.

Fig:3.5. LUT with balanced adder interaction where both adder inputs are driven by 5-
LUTs.

The second adder primitive is a 4-bit CLA. Each logic block contains two of these 4-bit
adders chained in a ripple carry fashion. Table IV shows the properties of the 4-bit CLA
used in this article. The carry lookahead optimization allows for a faster carry path (20
ps) compared to a ripple of four 1-bit adders (44 ps) when performing a 4-bit addition.
The CLA design trades off flexibility (as some bits are wasted if the desired adder length
is not divisible by 4) and area in exchange for speed.

3.3.2 Adder Input Balancing

Fig. 5 shows one way the LUT and adder (within a BLE) may interact. Here, we make
use of the observation that a 6-LUT is constructed with two 5-LUTs and a mux. If that
mux is dropped, then the adder can be driven by two 5-LUTswhere the LUTs share
inputs. If the adder is not used, then another mux can be used to produce the 6-LUT
output. We call this the balanced LUT interaction, and its underlying rationale is that a
symmetric amount of prior logic for each adder input may be the most appropriate
architecture. Example circuits that may benefit from this architecture would be
applications where multiplexers select the inputs to an adder. Similar interaction for the
CLA is shown in Fig. 6. Fig. 7 shows another LUT–adder interaction architecture that we
will explore. Here, the 6-LUT output drives one of the adder inputs and the other adder
input is driven by one of the 6-LUT inputs. As with the previous case, if the adder is not
used, then another mux can be used to select the 6-LUT output. We call this the
unbalanced LUT interaction. We model each additional SRAM-controlled 2-to-1 mux
(one per BLE for the balanced LUT interaction, and two per BLE for the unbalanced
LUT interaction) as having 22 ps of delay and occupying 15 MWTAs (including the
SRAM configuration bit). The underlying rationale for this architecture is that there
might be an advantage to allowing a faster input into one side of the adder, which would
be appropriate when speed was an issue.

Fig:3.6. Four-bit CLA with balanced LUT interaction

3.3.3C. Carry Chain Flexibility


Another class of interesting architectures are those with hardened adders but no dedicated
carry link between logic blocks. Here, both the cin and cout pins are treated as though
they are regular input and output pins, respectively, in the inter-block routing
architecture. Within the logic block, the carry signals maintain the same restricted
connections. For architectures that have a dedicated carry link, the carry link has a delay
of 20 ps. For those without a dedicated cin/cout, we add the usual circuitry to allow them
to access the right and bottom side channels of the logic block. There are a few different
ways to implement the starting location of a multi-bit addition. One can place a mux at
every carry link that can select from logic-0, logic-1, or a carry signal of a previous stage,
but this can incur a significant delay penalty because every carry link must now go
through a mux. Alternatively, one can place these muxes only on selected carry links,
thus minimizing the overhead of excessive muxing, but at the cost of having fewer
locations where an addition may begin. This latter approach is typical in commercial
devices. Alternatively, the responsibility for starting an addition can be implemented in a
front-end CAD tool—the tool can pad the addition with a dummy adder before the LSB
(whose addends are fed by constants) which generates a 0 or a 1 cin for addition and
subtraction, respectively. We employ this approach in this article.
CHAPTER 4

PROPOSED METHOD

4.1 A New Logical Full Adder

The truth table for addition 2 bits with carry flag is presented as follow:

Table4. 1: Truth table Full-Adder/ Subtractor

From truth table 1, a conclusion reached regarding output values: Sum values and Carry-
out flags. This conclusion derives from the idea of the operating current (electronic), by
switching on/off the Current to obtain the expected logic value:
Two circuits are built for sum and carry-out based on MUX2-1. Those use MUX as a
switch which is controlled to allow expected values to pass through.

4.2 Proposed Circuits for Full Adder/Subtractor

The circuit for sum value: A circuit is built through a control of values including input x
or input y that are shown in Table 2 and Fig.

Table 4.2: Truth table Sum/Sub value


Figure 4.1: Circuit for Sum value: (a) Controlled by Cin & y ; (b) Controlled by Cin
&x

The circuit for carry-out flag circuit: By observing the truth table of addition, 2
bits and new equations are realized, and it is learnt that the carry-out flag relies on the
control of input x and carry-in to find the best solution. Similarly, two circuits for carry-
out of Addition illustrated in table 4.2 and Fig.4.1.

4.3 Implementing with LUTs

Two LUT6-1’s are used to establish the FA. LUT6-1 is a basis LUT inside the slice.
Because the sum value and carry-out flag above can be executed at the same time, two
basic LUT6-1’s used: one for Sum circuit and one for Carry-out circuit. From the
observation the two truth tables of addition as well as circuits of Carry-out flag for
addition, it is learnt that it is possible combine them to have FA/S with control=0:
following as Fig.4.2.
Figure 4.2: Diagram Cout with a control pin

Moreover, the final circuit for FA is obtained by using one LUT (6-input, 2-output) that
is divided into 2 small LUT’s, one for Sum and one for Carry-out flag .When Carry-out
flag is equal to 1, addition shows ‘overload’. As a result of this, Sum/Sub and Carry-out
flag can run at the same time as shown in Fig.4.4.

Figure 4.3: New Full Adder (n_FA)


4.4 Proposed CLS Adder using proposed full adder

4.4.1 Logical Structure of the Hierarchical Carry Lookahead Adder

In contrast to the well known ripple-carry design which has a total time delay that is
proportional to the length N of the adder, the carry lookahead design is suited to generate
all carries simultaneously by additional logic circuitry. This results in a constant addition
time, independent of the length of the adder. For a good understanding we give a short
overview of the logical structure of this adder type. A more detailed description can be
found in .

Let Ci−1be the carry input to the ith bit position and C−1 the carry input to the least
significant position. Let Si and Ci be the sum and carry outputs of the ith stage. Then, the
sum and carry bit can be described as

where Pi i i = A B ⊕ and Gi i i = A B⋅ are defined as PROPAGATE- and GENERATE


variables, respectively, which can be generated simultaneously from the external inputs
Ai and Bi within the GP-level as illustrated in Figure 2. If we recursively apply the carry
formula given above we obtain the following set of carry equations:

These equations show that all carry inputs Ci are available simultaneously and all sum
bits Si can be generated in parallel as illustrated in Figure 2 for a 4-bit carry lookahead
adder. As also shown in Figure 2 we distinguish between structure levels GP, CLA and
SUM according to the generated signals within the levels.
Fig. 4.4. Structure of a 4-bit carry lookahead adder

Theoretically, it is possible to build adders of any word length if the CLA unit can be
freely expanded. However, in practice the complexity of the CLA unit is limited. This
leads to a hierarchical structure based on Block-Carry-Lookahead (BCLA) units. Each of
them generates only a limited number of carries and, additionally, BLOCK-
PROPAGATE (P0*) and BLOCK-GENERATE (G0*) signals which are used to evaluate
carries of the following BCLA level. Figure 3 shows a two-level CLA unit of a 16-bit
CLA adder in a 4-bit BCLA unit configuration. Each of the 4- bit BCLA units is suited to
evaluate three carries. The most significant carry signal Cout can be written as C C out =
−1 ⋅ P + G 0 0 max max , where max refers to the variables of the most significant BCLA
unit. The number of BCLA levels is given by

The implementation of hierarchical CLA adders is done in four steps: Adaptive structure
generation, technology mapping, partitioning and placement. We present a method for
realizing all these steps for any SRAM-based FPGA which can be described by the
generic models defined in Section 2. During each design step we will efficiently use
knowledge about the logical structure. As described, the length of a BCLA unit is not
bound to a special value. Therefore, we can write the equations for BLOCK-
PPOPAGATE, -GENERATE and CARRIES

Since the carry lookahead adder is chosen for high-performance addition the aim of our
logic adaption step is to reduce the BCLA level count to a minimum. According to Eq.(2)
this is accomplished by maximizing the bitwidth of the BCLA units which increases the
number of carry signals evaluated within a unit. As shown in Table 2 the complexity of a
BCLA is limited by the implementation of the BLOCKGENERATE signal which needs a
signal input of 2*BCLAbitwidth -1 and a number of sum terms which equals
BCLAbitwidth. Since this is also valid for the most complex CARRY signal we can use
the same implementation method for both. Therefore, our aim is to determine that
configuration of a FPGA device which is suitable to implement the most complex
BLOCK-GENERATE signal within a CLB
Fig:5.1 LUT Architecture

1. RTL SCHEMATICS OF EXISTING WORK


Fig:5.2 RTL SCHEMATICS OF EXISTING WORK

2. Device Summary Utilization Report

Fig:5.3 Device Summary Utilization Report

3. Timing Summary report

Fig:5.4 Timing Summary report


4. Power Report

Fig:5.5 Power Report

5. Output Wave form for Existing Method


Fig:5.6 Output Wave form for Existing Method
1. RTL SCHEMATICS OF PROPOSED WORK

Fig:5.7 RTL SCHEMATICS OF PROPOSED WORK


2. Device Summary Utilization Report
Fig:5.8Device Summary Utilization Report

3. Timing Summary report

Fig:5.9 Timing Summary report

4. Power Report
Fig:5.10 Power Report

5. Output Wave for Proposed Method

Fig:5.11 Output Wave for Proposed Method


5. Parameters Comparison Summary

S.No Parameters Existing Method Proposed Method


1. Slice Lut 28 23
2. Combinational Delay in ns 6.425 6.486
3. Power in Watts 0.041 0.023

Table:5.12 Parameters Comparison Summary

From the results of the experiments, a conclusion comes to that the standard FA and new designs used
the same resources on FPGA. With the proposed n_FA design, the speed is faster than the standard FA
by 20% to 30% depending on different FPGA series. The proposed FA occupies only 50% resources
compared with the standard FA and the speed increases arranging from 28% to 40%.
CHAPTER 6

CONLCUSION

The purpose of this paper is to propose a new FA architecture on an FPGA platform with
two optimization goals. The first is to improve the FA/S ratio in order to increase speed.
The second is to create a new circuit for FA/S to allow for production with fewer gates.
As a result, the proposed design is based on the multiplexer and contains only two types
of components: NOT gate and multiplexer, allowing the design to be easily implemented
on an FPGA chip. Both designs are subjected to detailed testing using different FPGA
series. The experiment revealed that the proposed n FA performed 20% to 30% faster
than the standard FA using the same area resources, and the proposed n FAS performed
28% to 40% faster than the standard FA/S using only half the resources.

You might also like