7.performance Analysis of Wallace Tree Multiplier With Kogge Stone Adder Using 15-4 Compressor
7.performance Analysis of Wallace Tree Multiplier With Kogge Stone Adder Using 15-4 Compressor
7.performance Analysis of Wallace Tree Multiplier With Kogge Stone Adder Using 15-4 Compressor
Abstract:
The major role of electronics device is to provide low power dissipation and compact area with
high speed performance. Among the major modules in digital building blocks system, multiplier
is the most complex one and main source of power dissipation. Approximate Computing to
multiplier design plays major role in electronic applications, like multimedia by providing fastest
result even though it possesses low reliability. In this paper, a design approach of 16bit Wallace
Tree approximate multiplier with 15-4 compressor is considered to provide more reliability. The
16×16 Wallace tree multiplier is synthesized and simulated using Xilinx ISE 14.5 software. The
multiplier occupies about 15% of total coverage area. The dissipated power and delay of the
multiplier are 0.042μw, 3.125ns respectively.
INTRODUCTION
MULTIPLIERS are the essential part of the digital system like Arithmetic and Logic Units,
Digital Signal Processors, etc. Usually, they prompt the performance like power, delay and area
utilization of the system. Hence there is an increasing demand for the improvement of
performance of the multiplier[1-4]. The multiplier consists of 3 stages - partial products
generation, partial products reduction and addition at the last stage. The second stage ( partial
product) in multiplier utilize more time and power. Various techniques were suggested to
diminish multipliers critical stages. The most popular technique is using the compressor in the
reduction stage of partial product. The compressor is simply an adder circuit. It takes a number
of equally-weighted bits, adds them, and produces some sum signals. Compressors are
commonly used with the aim of reducing and accumulating a large number of inputs to a smaller
number in a parallel manner. Their main application is within a multiplier, where partial products
have to be summed up in a large amount concurrently. The inner structure of compressors avoids
carry propagation. Either there are not any carry signals or they do arrive at the same time of the
internal values[5-7].
For the purpose of reducing the delay in the second stage, several compressors are needed. Small
sizes of compressors are useful for designing the small size multiplier. In multiplier design, the
different sizes of compressors are required depending upon the bit size. In this paper, a scheme
for delay reduction in 16bit Wallace tree multiplier with 15:4 compressor is considered. To build
15:4 compressor, a 5:3 compressor is considered as a basic module. AND gate is used for the
generation of partial products. For ‘N’ bit multiplier ‘N2’ AND gates are needed. In the partial
product reduction phase, there are three major components namely half adder, full adder and 5-3
compressor[8-11]. The final stage of addition is done by using Kogge-Stone adder. Fig. 1 shows
the structure of 16X16 multiplier. Simulation results show that the approximate multiplier with
compressor using Kogge stone adder achieves high performance while comparing to the
multipliers with compressor using other adder like parallel adder. This paper is elaborated in
following sections. Designs of approximate 16×16 Wallace tree multiplier are detailed in section
II. Brief notes and design of 15-4 compressor, 5-3 compressor and Kogge stone adder is
described
1.1 Objective
Most of the students of Electronics Engineering are exposed to Integrated Circuits (IC's)
at a very basic level, involving SSI (small scale integration) circuits like logic gates or MSI
(medium scale integration) circuits like multiplexers, parity encoders etc. But there is a lot bigger
world out there involving miniaturisation at levels so great, that a micrometer and a microsecond
are literally considered huge! This is the world of VLSI - Very Large Scale Integration. The
article aims at trying to introduce Electronics Engineering students to the possibilities and the
work involved in this field.
VLSI stands for "Very Large Scale Integration". This is the field which involves packing
more and more logic devices into smaller and smaller areas. Thanks to VLSI, circuits that would
have taken boardfuls of space can now be put into a small space few millimetres across! This has
opened up a big opportunity to do things that were not possible before. VLSI circuits are
everywhere ... your computer, your car, your brand new state-of-the-art digital camera, the cell-
phones, and what have you. All this involves a lot of expertise on many fronts within the same
field, which we will look at in later sections. VLSI has been around for a long time, there is
nothing new about it ... but as a side effect of advances in the world of computers, there has been
a dramatic proliferation of tools that can be used to design VLSI circuits. Alongside, obeying
Moore's law, the capability of an IC has increased exponentially over the years, in terms of
computation power, utilisation of available area, yield. The combined effect of these two
advances is that people can now put diverse functionality into the IC's, opening up new frontiers.
Examples are embedded systems, where intelligent devices are put inside everyday objects, and
ubiquitous computing where small computing devices proliferate to such an extent that even the
shoes you wear may actually do something useful like monitoring your heartbeats! These two
fields are kind a related and getting into their description can easily lead to another article.
• Speed Signals can be switched amongst logic zero and good judgment 1 much faster within a
chip than they may be able to amongst chips. Statement inside of a chip can occur 1000's of
times prior than communique amongst chips on a printed circuit board. The excessive pace of
circuits on-chip is due to their small measurement-smaller constituent and wires have smaller
parasitic capacitance to slow down the signal.
• Power consumption. Common sense operations inside a chip also take so much less power.
As soon as again, curb vigour consumption is essentially due to the small measurement of
circuits on the chip-smaller parasitic capacitances and resistances require much less vigor to
pressure them.
1.4 VLSI and SYSTEMS
These advantages of integrated circuits translate into advantages at the system level:
➢ Smaller physical size. Smallness is frequently an benefit in itself-consider portable televisions
or handheld cellular telephones.
➢ Lower power consumption. Substitute a handful of normal parts with a single chip decreases
total power utilization. Reducing power consumption has a ripple effect on the rest of the
system: a smaller, cheaper power supply can be used; since less power consumption means
less heat, a fan may no longer be essential; a simpler cabinet with less shielding for
electromagnetic shielding may be feasible, too.
➢ Reduced cost. Dropping the number of components, the power supply requirements, cabinet
costs, and so on, will inevitably reduce system cost. The ripple effect of integration is such
that the cost of a system built from custom ICs can be less, even though the individual ICs
cost more than the standard parts they replace.
Understanding why integrated circuit technology has such profound influence on the
design of digital systems requires understanding both the technology of IC Manufacturing and
the economics of ICs and digital programs. The growing sophistication of applications
consistently pushes the design and manufacturing of integrated circuits and electronic systems to
new levels of problem. And maybe probably the most potent characteristic of this assortment of
programs is its sort-as techniques turn out to be more difficult, we construct no longer a number
of normal-intent computer systems however an ever wider variety of designated-purpose
techniques. Our potential to take action is a testomony to our developing mastery of each
integrated circuit manufacturing and design, however the growing demands of customers
continue to test the limits of design and manufacturing
1.4.1 ASIC
An Application-Exact Integrated Circuit (ASIC) is an integrated circuit (IC) Customized
for a designated use, as a substitute than supposed for general-intent use. For illustration, a chip
designed exclusively to run a cell phone is an ASIC. Intermediate amongst ASICs and enterprise
general built-in circuits, like the 7400 or the 4000 series, are utility detailed normal merchandise
(ASSPs).
As characteristic sizes have gotten smaller and design tools improved above the years, the
highest trouble (and as a consequence performance) feasible in an ASIC has grown from 5,000
gates to above one hundred million. Cutting-edge ASICs almost always include entire 32-bit
processors, memory blocks in conjunction with ROM, RAM, EEPROM, Flash and different big
building blocks. Such an ASIC is traditionally termed a SoC (procedure-on-a-chip). Designers of
digital ASICs use a hardware description language (HDL), corresponding to Verilog or VHDL,
to describe the functionality of ASICs.
Field-programmable gate arrays (FPGA) are the present day-day technological know-
how for building a breadboard or prototype from common materials; programmable good
judgment blocks and programmable interconnects permit the identical FPGA to be used in many
one-of-a-kind applications. For smaller designs and/or lower production volumes, FPGAs may
be more cost effective than an ASIC design even in production.
➢ An application-exact integrated circuit (ASIC) is an integrated circuit (IC) customized for a
particular use, rather than intended for general-purpose use.
➢ A Structured ASIC falls among an FPGA and a Standard Cell-based ASIC
➢ Structured ASIC’s are used mainly for mid-volume level designs
➢ The design task for structured ASIC’s is to map the circuit into a fixed arrangement of known
cells.
Among different arithmetic blocks, the multiplier is one of the main blocks, which is widely
used in different applications especially signal processing applications. There are two general
architectures for the multipliers, which are sequential and parallel. While sequential architectures
are low power, their latency is very large. On the other hand, parallel architectures (such as
Wallace tree and Dadda) are fast while having high-power consumptions. The parallel
multipliers are used in high-performance applications where their large power consumptions may
create hot-spot locations on the die. Since the power consumption and speed are critical
parameters in the design of digital circuits, the optimizations of these parameters for multipliers
become critically important. Very often, the optimization of one parameter is performed
considering a constraint for the other parameter. Specifically, achieving the desired performance
(speed) considering the limited power budget of portable systems is challenging task. In addition,
having a given level of reliability may be another obstacle in reaching the system target
performance.
To meet the power and speed specifications, a variety of methods at different design abstraction
levels have been suggested. Approximate computing approaches are based on achieving the
target specifications at the cost of reducing the computation accuracy. The approach may be used
for applications where there is not a unique answer and/or a set of answers near the accurate
result can be considered acceptable. These applications include multimedia processing, machine
learning, signal processing, and other error resilient computations. Approximate arithmetic units
are mainly based on the simplification of the arithmetic units circuits. There are many prior
works focusing on approximate multipliers which provide higher speeds and lower power
consumptions at the cost of lower accuracies. Almost, all of the proposed approximate
multipliers are based on having a fixed level of accuracy during the runtime. The runtime
accuracy re configurability, however, is considered as a useful feature for providing different
levels of quality of service during the system operation. Here, by reducing the quality (accuracy),
the delay and/or power consumption of the unit may be reduced. In addition, some digital
systems, such as general purpose processors, may be utilized for both approximate and exact
computation modes. An approach for achieving this feature is to use an approximateunit along
with a corresponding correction unit. The correctionunit, however, increases the delay, power,
and area overheadof the circuit. Also, the error correction procedure may requiremore than one
clock cycle, which could, in turn, slowdown the processing further.
In this paper, we present four dual-quality reconfigurableapproximate 4:2 compressors, which
provide the ability ofswitching between the exact and approximate operating modesduring the
runtime. The compressors may be utilized in thearchitectures of dynamic quality configurable
parallel multipliers.The basic structures of the proposed compressorsconsist of two parts of
approximate and supplementary. Inthe approximate mode, only the approximate part is
activewhereas in the exact operating mode, the supplementary partalong with some components
of the approximate part isinvoked.
Approximate computing
"The need for approximate computing is driven by two factors: a fundamental shift in the nature
of computing workloads, and the need for new sources of efficiency," said AnandRaghunathan, a
Purdue Professor of Electrical and Computer Engineering, who has been working in the field for
about five years. "Computers were first designed to be precise calculators that solved problems
where they were expected to produce an exact numerical value. However, the demand for
computing today is driven by very different applications. Mobile and embedded devices need to
process richer media, and are getting smarter – understanding us, being more context-aware and
having more natural user interfaces. On the other hand, there is an explosion in digital data
searched, interpreted, and mined by data centers."
A growing number of applications are designed to tolerate "noisy" real-world inputs and use
statistical or probabilistic types of computations.
"The nature of these computations is different from the traditional computations where you need
a precise answer," said SrimatChakradhar, department head for Computing Systems Architecture
at NEC Laboratories America, who collaborated with the Purdue team. "Here, you are looking
for the best match since there is no golden answer, or you are trying to provide results that are of
acceptable quality, but you are not trying to be perfect."
However, today's computers are designed to compute precise results even when it is not
necessary. Approximate computing could endow computers with a capability similar to the
human brain's ability to scale the degree of accuracy needed for a given task. New findings were
detailed in research presented during the IEEE/ACM International Symposium on
Microarchitecture, Dec. 7-11 at the University of California, Davis.
The inability to perform to the required level of accuracy is inherently inefficient and saps
energy.
"If I asked you to divide 500 by 21 and I asked you whether the answer is greater than one, you
would say yes right away," Raghunathan said. "You are doing division but not to the full
accuracy. If I asked you whether it is greater than 30, you would probably take a little longer, but
if I ask you if it's greater than 23, you might have to think even harder. The application context
dictates different levels of effort, and humans are capable of this scalable approach, but computer
software and hardware are not like that. They often compute to the same level of accuracy all the
time."
Recently, the researchers have shown how to apply approximate computing to programmable
processors, which are ubiquitous in computers, servers and consumer electronics.
"In order to have a broad impact we need to be able to apply this technology to programmable
processors," Roy said. "And now we have shown how to design a programmable processor to
perform approximate computing."
The researchers achieved this milestone by altering the "instruction set," which is the interface
between software and hardware. "Quality fields" added to the instruction set allow the software
to tell the hardware the level of accuracy needed for a given task. They have created a prototype
programmable processor called Quora based on this approach.
"You are able to program for quality, and that's the real hallmark of this work," lead author
Venkataramani said. "The hardware can use the quality fields and perform energy efficient
computing, and what we have seen is that we can easily double energy efficiency."
In other recent work, led by Chippa, the Purdue team fabricated an approximate "accelerator" for
recognition and data mining.
"We have an actual hardware platform, a silicon chip that we've had fabricated, which is an
approximate processor for recognition and data mining," Raghunathan said. "Approximate
computing is far closer to reality than we thought even a few years ago."
Mahesh, R. and Vinod, A.P. “New Reconfigurable Architectures for Implementing FIR filters
with Low Complexity” IEEE Transaction of CAD of Integrated Circuits and Systems, Vol. 29,
No.2, 2010. In this paper, two reconfigurable architectures for low complexity FIR filters are
proposed namely, Constant Shift Method and Programmable Shift Method. The proposed FIR
filter architecture is capable of operating for different word length filter coefficients without any
overhead in the hardware circuitry. They show that dynamically reconfigurable filters can be
efficiently implemented by using common sub expression elimination algorithms. Design
examples show that the proposed architectures offer good area and power reductions and speed
improvement compared to the best existing reconfigurable FIR filter implementations in the
literature.
Sammy, P. et al. “A Programmable FIR filter for TV ghost Cancellation” IEEE Transaction on
1997. In this paper, a compact 64-tap programmable FIR filter suitable for TV ghost cancellation
has been designed by using Carry Save-Add Shift (CSAS) multiplier to achieve area efficiency
and an internally generated self-timed clock to achieve timing efficiency. The prototype chip is
implemented in a die area of 12.6 mm2 using a 0.8-pm CMOS process and can operate at up to
18 MHz with a 5-V supply or r 14.32 MHz with a 3.6-V supply. It has a 10-bit input word
length, a 14-bit output word length, and an 18-bit internal word length. The chip is suitable for
canceling “short” ghosts such as those present in a cable system, or it can be cascaded to form
longer filters for canceling broadcast TV ghosts.
Ababneh J.I. and Bataineh, M.H. (2007) Linear Phase FIR filter design using particle swarm
optimization and genetic algorithms. Digital Signal Processing. 34
doi:10.1016/j.dsp.2007.05.011. Presents a high performance and low power FIR filter design,
which is based on computation sharing multiplier (CSHM). CSHM specifically targeted
computation re-use in vector-scalar products and was effectively used in the suggested FIR filter
design. Efficient circuit level techniques namely a new carry select adder and Conditional
Capture Flip-Flop (CCFF), were also used to further improve power and performance. The
suggested FIR filter architecture was implemented in 0.25 pm technology. Experimental results
on a 10 tap low pass CSHM FIR filter showed speed and power improvement of 19% and 17%,
respectively
Bruce, H., et al. “Power Optimization of Reconfigurable FIR filter” IEEE Transaction on 2004.
This paper describes power optimization techniques applied to a reconfigurable digital finite
impulse response (FIR) filter used in a Universal Mobile Telephone Service (UMTS) mobile
terminal. Various methods of optimization for implementation were combined to achieve low
cost in terms of power consumption. Each optimization method is described in detail and is
applied to the reconfigurable filter. The optimization methods have achieved a 78.8 % reduction
in complexity for the multipliers in the FIR structure. An automated method for transformation
of coefficient multipliers into bit-shifts is also presented.
Süleyman, S.D. and Andrew G.D. “Efficient Implementation of Digital filters using novel
Reconfigurable Multiplier Blocks” IEEE Transaction on 2004. Generally Reconfigurable
Multiplier Block (ReMB) offer significant complexity reductions in multiple constant
multiplications in time multiplexed digital filters. The ReMB technique was employed in the
implementation of a half-band 32-tap FIR filter on both Xilinx Virtex FPGA and UMC 0.18μm
CMOS technologies. Reference designs had also been built by deploying standard time-
multiplexed architectures and off-the-shelf Xilinx Core Generator system for the FPGA design.
All designs were then compared for their area and delay figures. It was shown that, the ReMB
technique can significantly reduced the area for the multiplier circuitry and the coefficient store,
as well as reducing the delay.
EXISTING SYSTEM
In this section, four designs of a 5-3 approximate compressor are presented. 5-3 compressor has
five primary inputs (X0, X1, X2, X3, and X4) and three outputs (O0, O1, and O2). This
compressor uses the counter property. Output of the compressor depends on number of ones
present at input. This proposed compressor also called as 5-3 counter. In this paper, we have
called this module as a compressor because this module compresses five bits into three bits. We
have chosen 5-3 compressor because it is a basic module for 15-4 compressor. Error rate and
error distance of each design are considered. Design 1 In this design, initially output O2 of 5-3
compressor is approximated. Logical AND between inputs X3 and X2 matches with accurate
output O2 of the conventional 5-3 compressor with an error rate of 18.75%. The following
expressions show design 1 of 5-3 approximate compressor.
Fig1: Initially output O2 of 5-3 compressor is approximated
Design 2
In this design, O2, O1 are approximated and O0 is kept as the same as original expression. Error
distance of all the error cases is either -2 or +2. From the truth table, it can be noted that pass rate
of O2 is 87.5% when O2 alone is replaced with O2 in a 5-3 compressor. Similarly, pass rate of
O1 is 75% when compared with the O1 output of the 5-3 compressor. Expression for O2 and O1
are modified to get the minimum error distance. The overall pass rate of this design is 75%. The
output of the compressor differs only in eight input cases. In this design, the critical path is
between input X0 and output O0. Four XOR gates are involved in the critical path. This design
has least critical path than other proposed designs.
Fig 2: Logic diagram of design 2 approximate 5-3 compressor.
PROPOSED SYSTEM
Multiplier is the substantive part of the electronic device and decides the overall performance of
the system [1]. When designing a multiplier, huge amount of power and delay are generated. To
minimize these disadvantages, adders and compressor are used. Hence reducing delay in
multiplier has been a main aim to enhance the performance of the digital systems like DSP
processors [8]. Hence many attempts are done on multipliers to make it faster. It is an effective
hardware realization of digital system that is nothing but a Wallace tree which multiplies two
numbers and minimizes the number of partial products [4]. In vector processors, several
multiplications are performed to obtain data or loop level parallelism. High processing speed and
low power consumption are the major advantages of this multiplier [2].
Fig. 3 and Fig. 4 describes the structure and schematic view of 16bit multiplier using with the
help of 15-4 compressor. Here in this design each dot denotes partial product. From 13th column
onwards, 15-4 compressors are used in this multiplier architecture. Column number 13 consist of
13 partial products, in order to get 15 partial products 2 zeros are added. Similarly, in 14th
column, one zero is added. Approximate compressors are used in 13th, 14th and15th column of
multipliers. The partial product reduction phase consists of half adder, full adder and 5:3
compressors. When the numbers of bits in the column are 2 and 3 half adders and full adders are
used in each column. In case of a single bit, it is moved further to the subsequent level of that
particular column without any need for further processing. Until only two rows will remain, this
reduction process is repeated. Finally, summation of the last two rows is achieved using 4-bit
Kogge-Stone adder.
15-4 COMPRESSOR
A compressor is simply an adder circuit. It takes a number of equally-weighted bits, adds them,
and produces some sum signals. Compressors are commonly used with the aim of reducing and
accumulating a large number of inputs to a smaller number in a parallel manner. They are the
important parts of the multiplier design as they highly influence the speed of the multiplier. Their
main application is within a multiplier, where a huge number of partial products have to be
summed up concurrently. For high speed applications like DSP, image processing needs several
compressors to perform arithmetic operation. A compressor adder provides reduced delay over
conventional adders using both half adders and full adders. Here the representation as ‘N-r’, in
which ’N’ denotes as the number of bits and ‘r’ denotes as the total number of 1’s present in ‘N’
inputs. The compressor reduces the number of gates and the delay with reference to other adder
circuits. The inner structure of compressors avoids carry propagation. Either there are not any
carry signals or they do arrive at the same time of the internal values. Compressors are widely
used in the reduction stage of a multiplier to accumulate partial products in a concurrent manner.
In this part it is considered the design of 15-4 compressor by using with approximate 5-3
compressors [5]. This compressor compresses 15 inputs (C0- C14) into 4 outputs (B0-B3). The
15-4 compressor consists of three phases. The first phase has five full adders, the second phase
uses two 5-3 compressors and finally the 4-bit kogge stone adder. In this compressor design,
approximate 5-3 compressor are preferred over accurate 5-3 compressors as shown in the Fig. 5
A. 5-3 Compressor
The 15-4 compressor consists of 5-3 compressor as a basic design. The 5-3 compressor utilizes
five primary inputs namely A0, A1, A2, A3, A4 and produces three outputs namely B0, B1, B2.
In this compressor, the presence of number of 1’s at the input decides the output of compressor
and also uses counter property.
The design of compression of given 5 inputs into 3 output is called the design of 5-3 compressor.
Error rate of 5-3 compressor is considered. The design equations of 5-3 approximate compressor
are shown in following equations respectively. The logic diagram of approximate 5-3
compressor is as shown in Fig. 6.
In 1973, Peter M. Kogge and Harold S. Stone introduced the concept of efficient and high-
performance adder called kogge-stone adder. It is basically a parallel prefix adder. This type of
adder has the specialty of fastest addition based on design time. It is known for its special and
fastest addition based on design time [9], [10]. In Fig. 5 and Fig. 6, the functional block diagram
and RTL view of a 4-bit KoggeStone Adder is shown. By using the ith bit of given input, the
propagate signals ‘Pi’ and generate signals ‘Gi’ are calculated. Similarly, these generated signals
produce output carry signals. Therefore by minimizing the computation delay, Prefix Adders are
mainly classified into 3 categories.
A. Pre- processing
B. Generation of Carry.
C. Final processing.
A. Pre-Processing
In this stage, the generate and propagate signals are given by the equations 5&6.
B. Generation of carry
In this stage, carries are calculated with their corresponding bits and this operation is executed in
parallel manner. Carry propagation and generation are used as intermediate signals. The logic
equations for carry propagate and generate are shown below.
C. Final Processing
In final processing, the sum and carry outputs bits are computed for the given input bits and the
logic equation for the final processing stage is given by
Fig. 7. Block diagram of kogge stone adder
The design of approximate16 bit Wallace multiplier using 15-4 compressor has been done in
HDL, using Xilinx ISE 14.5. Simulation results show the design of overall architecture of
Wallace tree multiplier as shown in Fig. 9. The parameters of area are utilized by the multiplier
design and power consumption are obtained through simulation and tabulated in Table I and
Table II. The snapshot of delay obtained through simulation is shown in Fig. 10. The processing
delay at the end of addition level can be reduced by using kogge stone adder
Table I and II describes the area utilization and power parameters of a 16-bit Wallace multiplier. It shows
better result than other adder apart from that it gives less area and low propagation delay.
The approximate 16×16bit Wallace tree multiplier using 15- 4 compressor architecture has been
designed and synthesized using on Spartan 3 XC3S100E board and simulated in Xilinx ISE 14.5.
The performance of proposed Multiplier with kogge stone adder is compared with the same
architecture of multiplier using parallel adder. It can be inferred that 16×16 multiplier
architecture using 15-4 compressor with kogge stone adder is faster compared to multiplier with
parallel adder. In future the performance of the proposed multiplier can be improved and applied
in applications like video and image processing.
REFERENCES
[1] C. S. Wallace, A Suggestion for a Fast Multiplier, IEEE Transactions on Computers, 13,
1964,14-17.
[3] C.-H. Lin and I.-C. Lin, ‘‘High accuracy approximate multiplier with error correction,’’ in
Proc. IEEE 31st Int. Conf. Comput. Design (ICCD), Oct. 2013, pp. 33–38
[4] D. R. Gandhi, and N. N. Shah, Comparative Analysis for Hardware Circuit Architecture of
Wallace Tree Multiplier, IEEE International Conference on Intelligent Systems and Signal
Processing, Gujarat, 2013, 1-6.
[5] R. Menon and D. Radhakrishnan, ‘‘High performance 5:2 compressor architectures,’’ Proc.
IEE-Circuits, Devices Syst., vol. 153, no. 5, pp. 447–452, Oct. 2006
[7] Momeni, J. Han, P. Montuschi, and F. Lombardi, “Design and analysis of approximate
compressors for multiplication,” IEEE Trans. Comput., vol. 64, no. 4, pp. 984–994, Apr. 2015.
[8] Teffi Francis, Tera Joseph and Jobin K Antony., “Modified MAC Unit for low power high
speed DSP application using multiplier with bypassing technique and optimized adders”, IEEE-
31661, 4th ICCCNT, 2013.
[9] Yezerla, Sudheer Kumar, and B. Rajendra Naik. "Design and Estimation of delay, power and
area for Parallel prefix adders." In Engineering and Computational Sciences (RAECS), 2014
Recent Advances in, pp. 1-6. IEEE, 2014.
[10] Y. Choi, “Parallel Prefix Adder Design” Proc. 17th IEEE Symposium on Computer
Arithmetic, pp 90-98, 27th June 2005. [11] Belle W. Y. Wei and Clark D. Thompson, “Area-
Time Optimal Adder Design”, IEEE transactions on Computers, vol.39, pp. 666-675, May1990.