7.performance Analysis of Wallace Tree Multiplier With Kogge Stone Adder Using 15-4 Compressor

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

PERFORMANCE ANALYSIS OF WALLACE TREE MULTIPLIER WITH

KOGGE STONE ADDER USING 15-4 COMPRESSOR

Abstract:

The major role of electronics device is to provide low power dissipation and compact area with
high speed performance. Among the major modules in digital building blocks system, multiplier
is the most complex one and main source of power dissipation. Approximate Computing to
multiplier design plays major role in electronic applications, like multimedia by providing fastest
result even though it possesses low reliability. In this paper, a design approach of 16bit Wallace
Tree approximate multiplier with 15-4 compressor is considered to provide more reliability. The
16×16 Wallace tree multiplier is synthesized and simulated using Xilinx ISE 14.5 software. The
multiplier occupies about 15% of total coverage area. The dissipated power and delay of the
multiplier are 0.042μw, 3.125ns respectively.
INTRODUCTION

MULTIPLIERS are the essential part of the digital system like Arithmetic and Logic Units,
Digital Signal Processors, etc. Usually, they prompt the performance like power, delay and area
utilization of the system. Hence there is an increasing demand for the improvement of
performance of the multiplier[1-4]. The multiplier consists of 3 stages - partial products
generation, partial products reduction and addition at the last stage. The second stage ( partial
product) in multiplier utilize more time and power. Various techniques were suggested to
diminish multipliers critical stages. The most popular technique is using the compressor in the
reduction stage of partial product. The compressor is simply an adder circuit. It takes a number
of equally-weighted bits, adds them, and produces some sum signals. Compressors are
commonly used with the aim of reducing and accumulating a large number of inputs to a smaller
number in a parallel manner. Their main application is within a multiplier, where partial products
have to be summed up in a large amount concurrently. The inner structure of compressors avoids
carry propagation. Either there are not any carry signals or they do arrive at the same time of the
internal values[5-7].

For the purpose of reducing the delay in the second stage, several compressors are needed. Small
sizes of compressors are useful for designing the small size multiplier. In multiplier design, the
different sizes of compressors are required depending upon the bit size. In this paper, a scheme
for delay reduction in 16bit Wallace tree multiplier with 15:4 compressor is considered. To build
15:4 compressor, a 5:3 compressor is considered as a basic module. AND gate is used for the
generation of partial products. For ‘N’ bit multiplier ‘N2’ AND gates are needed. In the partial
product reduction phase, there are three major components namely half adder, full adder and 5-3
compressor[8-11]. The final stage of addition is done by using Kogge-Stone adder. Fig. 1 shows
the structure of 16X16 multiplier. Simulation results show that the approximate multiplier with
compressor using Kogge stone adder achieves high performance while comparing to the
multipliers with compressor using other adder like parallel adder. This paper is elaborated in
following sections. Designs of approximate 16×16 Wallace tree multiplier are detailed in section
II. Brief notes and design of 15-4 compressor, 5-3 compressor and Kogge stone adder is
described

1.1 Objective
Most of the students of Electronics Engineering are exposed to Integrated Circuits (IC's)
at a very basic level, involving SSI (small scale integration) circuits like logic gates or MSI
(medium scale integration) circuits like multiplexers, parity encoders etc. But there is a lot bigger
world out there involving miniaturisation at levels so great, that a micrometer and a microsecond
are literally considered huge! This is the world of VLSI - Very Large Scale Integration. The
article aims at trying to introduce Electronics Engineering students to the possibilities and the
work involved in this field.
VLSI stands for "Very Large Scale Integration". This is the field which involves packing
more and more logic devices into smaller and smaller areas. Thanks to VLSI, circuits that would
have taken boardfuls of space can now be put into a small space few millimetres across! This has
opened up a big opportunity to do things that were not possible before. VLSI circuits are
everywhere ... your computer, your car, your brand new state-of-the-art digital camera, the cell-
phones, and what have you. All this involves a lot of expertise on many fronts within the same
field, which we will look at in later sections. VLSI has been around for a long time, there is
nothing new about it ... but as a side effect of advances in the world of computers, there has been
a dramatic proliferation of tools that can be used to design VLSI circuits. Alongside, obeying
Moore's law, the capability of an IC has increased exponentially over the years, in terms of
computation power, utilisation of available area, yield. The combined effect of these two
advances is that people can now put diverse functionality into the IC's, opening up new frontiers.
Examples are embedded systems, where intelligent devices are put inside everyday objects, and
ubiquitous computing where small computing devices proliferate to such an extent that even the
shoes you wear may actually do something useful like monitoring your heartbeats! These two
fields are kind a related and getting into their description can easily lead to another article.

1.2 DEALING WITH VLSI CIRCUITS


Digital VLSI circuits are predominantly CMOS based. The way normal blocks like
latches and gates are implemented is different from what students have seen so far, but the
behaviour remains the same. All the miniaturisation involves new things to consider. A lot of
thought has to go into actual implementations as well as design. Let us look at some of the
factorsinvolved.
1. Circuit Delays. Large complicated circuits running at very high frequencies have one big
problem to tackle - the problem of delays in propagation of signals through gates and wire even
for areas a few micrometers across! The operation speed is so large that as the delays add up,
they can actually become comparable to the clock speeds.
2. Power. Another effect of high operation frequencies is increased consumption of power. This
has two-fold effect - devices consume batteries faster, and heat dissipation increases. Coupled
with the fact that surface areas have decreased, heat poses a major threat to the stability of the
circuit itself.
3. Layout. Laying out the circuit components is task common to all branches of electronics. What
so special in our case is that there are many possible ways to do this; there can be multiple layers
of different materials on the same silicon, there can be different arrangements of the smaller parts
for the same component and so on.
The power dissipation and speed in a circuit present a trade-off; if we try to optimise on one, the
other is affected. The choice between the two is determined by the way we chose the layout the
circuit components. Layout can also affect the fabrication of VLSI chips, making it either easy or
difficult to implement the components on the silicon.
1.3 INTRODUCTION TO VLSI
Very-large-scale integration (VLSI) is the process of creating integrated circuits by
combining thousands of transistor-based circuits into a single chip. VLSI began in the Nineteen
Seventies when tricky semiconductor and conversation technologies were being developed. The
microprocessor is a VLSI gadget. The term is now not as customary as it once was, as chips have
expanded in situation into the enormous quantities of thousands of transistors. The first
semiconductor chips held one transistor each. Subsequent advances brought increasingly
transistors, and, as a outcome, more character services or techniques have been integrated above
time. The primary integrated circuits held just a few devices, possibly as many as ten diodes,
transistors, resistors and capacitors, making it possible to manufacture a number of logic gates on
a single gadget. Now known retrospectively as "small-scale integration" (SSI), development in
system ended in instruments with hundreds of common sense gates, recognized as large-scale
integration (LSI), i.E. Present technological know-how has inspired far previous this mark and
modern day microprocessors have a few hundreds of thousands of gates and hundreds and
hundreds of hundreds of thousands of special transistors.
As of early 2008, billion-transistor processors are commercially to be had, an instance of
which is Intel's Montecito Itanium chip. That is anticipated to grow to be more common as
semiconductor fabrication strikes from the present new release of 65 nm strategies to the
subsequent forty five nm generations (whilst experiencing new challenges akin to increased
version throughout process corners). A different terrific example is NVIDIA’s 280 series GPU.
This microprocessor is exact in the truth that its 1.4 Billion transistor rely, capable of a teraflop
of executeance, is roughly utterly committed to logic (Itanium's transistor rely is basically as a
result of the 24MB L3 cache). Current designs, versus the earliest gadgets, use large design
automation and automatic good judgment synthesis to put out the transistors, enabling better
levels of obstacle in the resulting logic functionality. Specific excessive-executeance common
sense blocks like the SRAM phone, nevertheless, are nonetheless designed by means of hand to
make sure the absolute best effectivity (normally by way of bending or breaking established
design ideas to receive the last little bit of executeance through buying and selling balance).
What is VLSI?
VLSI stands for "Very Large Scale Integration". This is the field which occupy packing
additional logic devices into slighter areas.
VLSI
➢ Simply we say Integrated circuit is many transistors on one chip.
➢ Design/manufacturing of extremely small, complex circuitry using modified semiconductor
material
➢ Integrated circuit (IC) may contain millions of transistors, both a few mm in size
➢ Applications wide ranging: most electronic logic devices
Advantages of ICs above discrete components
While we will be able to be aware of integrated circuits , the houses of integrated circuits-
what we can and cannot appropriately put in an integrated circuit-mostly verify the architecture
of the complete approach. Built-in circuits strengthen procedure characteristics in a couple of
imperative methods. ICs have three key advantages above digital circuits built from discrete
components:
• Measurement. Built-in circuits are so much smaller-both transistors and wires are reduced in
size to micrometer sizes, evaluated to the millimeter or centimeter scales of discrete add-ons.
Small size results in advantages in pace and power consumption, when you consider that
smaller components have smaller parasitic resistances, capacitances, and inductances.

• Speed Signals can be switched amongst logic zero and good judgment 1 much faster within a
chip than they may be able to amongst chips. Statement inside of a chip can occur 1000's of
times prior than communique amongst chips on a printed circuit board. The excessive pace of
circuits on-chip is due to their small measurement-smaller constituent and wires have smaller
parasitic capacitance to slow down the signal.
• Power consumption. Common sense operations inside a chip also take so much less power.
As soon as again, curb vigour consumption is essentially due to the small measurement of
circuits on the chip-smaller parasitic capacitances and resistances require much less vigor to
pressure them.
1.4 VLSI and SYSTEMS
These advantages of integrated circuits translate into advantages at the system level:
➢ Smaller physical size. Smallness is frequently an benefit in itself-consider portable televisions
or handheld cellular telephones.

➢ Lower power consumption. Substitute a handful of normal parts with a single chip decreases
total power utilization. Reducing power consumption has a ripple effect on the rest of the
system: a smaller, cheaper power supply can be used; since less power consumption means
less heat, a fan may no longer be essential; a simpler cabinet with less shielding for
electromagnetic shielding may be feasible, too.

➢ Reduced cost. Dropping the number of components, the power supply requirements, cabinet
costs, and so on, will inevitably reduce system cost. The ripple effect of integration is such
that the cost of a system built from custom ICs can be less, even though the individual ICs
cost more than the standard parts they replace.
Understanding why integrated circuit technology has such profound influence on the
design of digital systems requires understanding both the technology of IC Manufacturing and
the economics of ICs and digital programs. The growing sophistication of applications
consistently pushes the design and manufacturing of integrated circuits and electronic systems to
new levels of problem. And maybe probably the most potent characteristic of this assortment of
programs is its sort-as techniques turn out to be more difficult, we construct no longer a number
of normal-intent computer systems however an ever wider variety of designated-purpose
techniques. Our potential to take action is a testomony to our developing mastery of each
integrated circuit manufacturing and design, however the growing demands of customers
continue to test the limits of design and manufacturing
1.4.1 ASIC
An Application-Exact Integrated Circuit (ASIC) is an integrated circuit (IC) Customized
for a designated use, as a substitute than supposed for general-intent use. For illustration, a chip
designed exclusively to run a cell phone is an ASIC. Intermediate amongst ASICs and enterprise
general built-in circuits, like the 7400 or the 4000 series, are utility detailed normal merchandise
(ASSPs).
As characteristic sizes have gotten smaller and design tools improved above the years, the
highest trouble (and as a consequence performance) feasible in an ASIC has grown from 5,000
gates to above one hundred million. Cutting-edge ASICs almost always include entire 32-bit
processors, memory blocks in conjunction with ROM, RAM, EEPROM, Flash and different big
building blocks. Such an ASIC is traditionally termed a SoC (procedure-on-a-chip). Designers of
digital ASICs use a hardware description language (HDL), corresponding to Verilog or VHDL,
to describe the functionality of ASICs.
Field-programmable gate arrays (FPGA) are the present day-day technological know-
how for building a breadboard or prototype from common materials; programmable good
judgment blocks and programmable interconnects permit the identical FPGA to be used in many
one-of-a-kind applications. For smaller designs and/or lower production volumes, FPGAs may
be more cost effective than an ASIC design even in production.
➢ An application-exact integrated circuit (ASIC) is an integrated circuit (IC) customized for a
particular use, rather than intended for general-purpose use.
➢ A Structured ASIC falls among an FPGA and a Standard Cell-based ASIC
➢ Structured ASIC’s are used mainly for mid-volume level designs
➢ The design task for structured ASIC’s is to map the circuit into a fixed arrangement of known
cells.
Among different arithmetic blocks, the multiplier is one of the main blocks, which is widely
used in different applications especially signal processing applications. There are two general
architectures for the multipliers, which are sequential and parallel. While sequential architectures
are low power, their latency is very large. On the other hand, parallel architectures (such as
Wallace tree and Dadda) are fast while having high-power consumptions. The parallel
multipliers are used in high-performance applications where their large power consumptions may
create hot-spot locations on the die. Since the power consumption and speed are critical
parameters in the design of digital circuits, the optimizations of these parameters for multipliers
become critically important. Very often, the optimization of one parameter is performed
considering a constraint for the other parameter. Specifically, achieving the desired performance
(speed) considering the limited power budget of portable systems is challenging task. In addition,
having a given level of reliability may be another obstacle in reaching the system target
performance.
To meet the power and speed specifications, a variety of methods at different design abstraction
levels have been suggested. Approximate computing approaches are based on achieving the
target specifications at the cost of reducing the computation accuracy. The approach may be used
for applications where there is not a unique answer and/or a set of answers near the accurate
result can be considered acceptable. These applications include multimedia processing, machine
learning, signal processing, and other error resilient computations. Approximate arithmetic units
are mainly based on the simplification of the arithmetic units circuits. There are many prior
works focusing on approximate multipliers which provide higher speeds and lower power
consumptions at the cost of lower accuracies. Almost, all of the proposed approximate
multipliers are based on having a fixed level of accuracy during the runtime. The runtime
accuracy re configurability, however, is considered as a useful feature for providing different
levels of quality of service during the system operation. Here, by reducing the quality (accuracy),
the delay and/or power consumption of the unit may be reduced. In addition, some digital
systems, such as general purpose processors, may be utilized for both approximate and exact
computation modes. An approach for achieving this feature is to use an approximateunit along
with a corresponding correction unit. The correctionunit, however, increases the delay, power,
and area overheadof the circuit. Also, the error correction procedure may requiremore than one
clock cycle, which could, in turn, slowdown the processing further.
In this paper, we present four dual-quality reconfigurableapproximate 4:2 compressors, which
provide the ability ofswitching between the exact and approximate operating modesduring the
runtime. The compressors may be utilized in thearchitectures of dynamic quality configurable
parallel multipliers.The basic structures of the proposed compressorsconsist of two parts of
approximate and supplementary. Inthe approximate mode, only the approximate part is
activewhereas in the exact operating mode, the supplementary partalong with some components
of the approximate part isinvoked.
Approximate computing
"The need for approximate computing is driven by two factors: a fundamental shift in the nature
of computing workloads, and the need for new sources of efficiency," said AnandRaghunathan, a
Purdue Professor of Electrical and Computer Engineering, who has been working in the field for
about five years. "Computers were first designed to be precise calculators that solved problems
where they were expected to produce an exact numerical value. However, the demand for
computing today is driven by very different applications. Mobile and embedded devices need to
process richer media, and are getting smarter – understanding us, being more context-aware and
having more natural user interfaces. On the other hand, there is an explosion in digital data
searched, interpreted, and mined by data centers."

A growing number of applications are designed to tolerate "noisy" real-world inputs and use
statistical or probabilistic types of computations.

"The nature of these computations is different from the traditional computations where you need
a precise answer," said SrimatChakradhar, department head for Computing Systems Architecture
at NEC Laboratories America, who collaborated with the Purdue team. "Here, you are looking
for the best match since there is no golden answer, or you are trying to provide results that are of
acceptable quality, but you are not trying to be perfect."

However, today's computers are designed to compute precise results even when it is not
necessary. Approximate computing could endow computers with a capability similar to the
human brain's ability to scale the degree of accuracy needed for a given task. New findings were
detailed in research presented during the IEEE/ACM International Symposium on
Microarchitecture, Dec. 7-11 at the University of California, Davis.

The inability to perform to the required level of accuracy is inherently inefficient and saps
energy.

"If I asked you to divide 500 by 21 and I asked you whether the answer is greater than one, you
would say yes right away," Raghunathan said. "You are doing division but not to the full
accuracy. If I asked you whether it is greater than 30, you would probably take a little longer, but
if I ask you if it's greater than 23, you might have to think even harder. The application context
dictates different levels of effort, and humans are capable of this scalable approach, but computer
software and hardware are not like that. They often compute to the same level of accuracy all the
time."

Purdue researchers have developed a range of hardware techniques to demonstrate approximate


computing, showing a potential for improvements in energy efficiency.

The research paper presented during the IEEE/ACM International Symposium on


Microarchitecture was authored by doctoral student SwagathVenkataramani; former Purdue
doctoral student Vinay K. Chippa; Chakradhar; Kaushik Roy, Purdue's Edward G. Tiedemann Jr.
Distinguished Professor of Electrical and Computer Engineering; and Raghunathan.

Recently, the researchers have shown how to apply approximate computing to programmable
processors, which are ubiquitous in computers, servers and consumer electronics.

"In order to have a broad impact we need to be able to apply this technology to programmable
processors," Roy said. "And now we have shown how to design a programmable processor to
perform approximate computing."

The researchers achieved this milestone by altering the "instruction set," which is the interface
between software and hardware. "Quality fields" added to the instruction set allow the software
to tell the hardware the level of accuracy needed for a given task. They have created a prototype
programmable processor called Quora based on this approach.
"You are able to program for quality, and that's the real hallmark of this work," lead author
Venkataramani said. "The hardware can use the quality fields and perform energy efficient
computing, and what we have seen is that we can easily double energy efficiency."

In other recent work, led by Chippa, the Purdue team fabricated an approximate "accelerator" for
recognition and data mining.

"We have an actual hardware platform, a silicon chip that we've had fabricated, which is an
approximate processor for recognition and data mining," Raghunathan said. "Approximate
computing is far closer to reality than we thought even a few years ago."

Approximate computing leverages the intrinsic resilience of applications to inexactness in their


computations, to achieve a desirable trade-off between efficiency (performance or energy) and
acceptable quality of results. To broaden the applicability of approximate computing, we propose
quality programmable processors, in which the notion of quality is explicitly codified in the
HW/SW interface, i.e., the instruction set. The ISA of a quality programmable processor contains
instructions associated with quality fields to specify the accuracy level that must be met during
their execution. We show that this ability to control the accuracy of instruction execution greatly
enhances the scope of approximate computing, allowing it to be applied to larger parts of
programs. The micro-architecture of a quality programmable processor contains hardware
mechanisms that translate the instruction-level quality specifications into energy savings.
Additionally, it may expose the actual error incurred during the execution of each instruction
(which may be less than the specified limit) back to software. As a first embodiment of quality
programmable processors, we present the design of Quora, an energy efficient, quality
programmable vector processor. Quora utilizes a 3-tiered hierarchy of processing elements that
provide distinctly different energy vs. quality trade-off, and uses hardware mechanisms based on
precision scaling with error monitoring and compensation to facilitate quality programmable
execution. We evaluate an implementation of Quora with 289 processing elements in 45nm
technology. The results demonstrate that leveraging quality-programmability leads to 1.05X-
1.7X savings in energy for virtually no loss (< 0.5%) in application output quality, and 1.18X-
2.1X energy savings for modest impact (<2.5%) on output quality. Our work suggests that
quality programmable processors are a significant step towards bringing approximate computing
to the mainstream
LITRATURE SURVEY

Mahesh, R. and Vinod, A.P. “New Reconfigurable Architectures for Implementing FIR filters
with Low Complexity” IEEE Transaction of CAD of Integrated Circuits and Systems, Vol. 29,
No.2, 2010. In this paper, two reconfigurable architectures for low complexity FIR filters are
proposed namely, Constant Shift Method and Programmable Shift Method. The proposed FIR
filter architecture is capable of operating for different word length filter coefficients without any
overhead in the hardware circuitry. They show that dynamically reconfigurable filters can be
efficiently implemented by using common sub expression elimination algorithms. Design
examples show that the proposed architectures offer good area and power reductions and speed
improvement compared to the best existing reconfigurable FIR filter implementations in the
literature.

Sammy, P. et al. “A Programmable FIR filter for TV ghost Cancellation” IEEE Transaction on
1997. In this paper, a compact 64-tap programmable FIR filter suitable for TV ghost cancellation
has been designed by using Carry Save-Add Shift (CSAS) multiplier to achieve area efficiency
and an internally generated self-timed clock to achieve timing efficiency. The prototype chip is
implemented in a die area of 12.6 mm2 using a 0.8-pm CMOS process and can operate at up to
18 MHz with a 5-V supply or r 14.32 MHz with a 3.6-V supply. It has a 10-bit input word
length, a 14-bit output word length, and an 18-bit internal word length. The chip is suitable for
canceling “short” ghosts such as those present in a cable system, or it can be cascaded to form
longer filters for canceling broadcast TV ghosts.

Ababneh J.I. and Bataineh, M.H. (2007) Linear Phase FIR filter design using particle swarm
optimization and genetic algorithms. Digital Signal Processing. 34
doi:10.1016/j.dsp.2007.05.011. Presents a high performance and low power FIR filter design,
which is based on computation sharing multiplier (CSHM). CSHM specifically targeted
computation re-use in vector-scalar products and was effectively used in the suggested FIR filter
design. Efficient circuit level techniques namely a new carry select adder and Conditional
Capture Flip-Flop (CCFF), were also used to further improve power and performance. The
suggested FIR filter architecture was implemented in 0.25 pm technology. Experimental results
on a 10 tap low pass CSHM FIR filter showed speed and power improvement of 19% and 17%,
respectively

Bruce, H., et al. “Power Optimization of Reconfigurable FIR filter” IEEE Transaction on 2004.
This paper describes power optimization techniques applied to a reconfigurable digital finite
impulse response (FIR) filter used in a Universal Mobile Telephone Service (UMTS) mobile
terminal. Various methods of optimization for implementation were combined to achieve low
cost in terms of power consumption. Each optimization method is described in detail and is
applied to the reconfigurable filter. The optimization methods have achieved a 78.8 % reduction
in complexity for the multipliers in the FIR structure. An automated method for transformation
of coefficient multipliers into bit-shifts is also presented.

Süleyman, S.D. and Andrew G.D. “Efficient Implementation of Digital filters using novel
Reconfigurable Multiplier Blocks” IEEE Transaction on 2004. Generally Reconfigurable
Multiplier Block (ReMB) offer significant complexity reductions in multiple constant
multiplications in time multiplexed digital filters. The ReMB technique was employed in the
implementation of a half-band 32-tap FIR filter on both Xilinx Virtex FPGA and UMC 0.18μm
CMOS technologies. Reference designs had also been built by deploying standard time-
multiplexed architectures and off-the-shelf Xilinx Core Generator system for the FPGA design.
All designs were then compared for their area and delay figures. It was shown that, the ReMB
technique can significantly reduced the area for the multiplier circuitry and the coefficient store,
as well as reducing the delay.
EXISTING SYSTEM

DESIGNS OF APPROXIMATE 5-3 COMPRESSORS

In this section, four designs of a 5-3 approximate compressor are presented. 5-3 compressor has
five primary inputs (X0, X1, X2, X3, and X4) and three outputs (O0, O1, and O2). This
compressor uses the counter property. Output of the compressor depends on number of ones
present at input. This proposed compressor also called as 5-3 counter. In this paper, we have
called this module as a compressor because this module compresses five bits into three bits. We
have chosen 5-3 compressor because it is a basic module for 15-4 compressor. Error rate and
error distance of each design are considered. Design 1 In this design, initially output O2 of 5-3
compressor is approximated. Logical AND between inputs X3 and X2 matches with accurate
output O2 of the conventional 5-3 compressor with an error rate of 18.75%. The following
expressions show design 1 of 5-3 approximate compressor.
Fig1: Initially output O2 of 5-3 compressor is approximated

Design 2

In this design, O2, O1 are approximated and O0 is kept as the same as original expression. Error
distance of all the error cases is either -2 or +2. From the truth table, it can be noted that pass rate
of O2 is 87.5% when O2 alone is replaced with O2 in a 5-3 compressor. Similarly, pass rate of
O1 is 75% when compared with the O1 output of the 5-3 compressor. Expression for O2 and O1
are modified to get the minimum error distance. The overall pass rate of this design is 75%. The
output of the compressor differs only in eight input cases. In this design, the critical path is
between input X0 and output O0. Four XOR gates are involved in the critical path. This design
has least critical path than other proposed designs.
Fig 2: Logic diagram of design 2 approximate 5-3 compressor.
PROPOSED SYSTEM

WALLACE TREE MULTIPLIER

Multiplier is the substantive part of the electronic device and decides the overall performance of
the system [1]. When designing a multiplier, huge amount of power and delay are generated. To
minimize these disadvantages, adders and compressor are used. Hence reducing delay in
multiplier has been a main aim to enhance the performance of the digital systems like DSP
processors [8]. Hence many attempts are done on multipliers to make it faster. It is an effective
hardware realization of digital system that is nothing but a Wallace tree which multiplies two
numbers and minimizes the number of partial products [4]. In vector processors, several
multiplications are performed to obtain data or loop level parallelism. High processing speed and
low power consumption are the major advantages of this multiplier [2].

Fig. 3. Structure of 16×16 Multiplier using 15-4 compressor

The three stages of Wallace tree multiplier are mentioned below:

1) Partial products generation

2) Partial products reduction


3) Addition at the final stage

Fig. 4. Schematic View of 16×16 Bit Wallace Tree Multiplier

Fig. 3 and Fig. 4 describes the structure and schematic view of 16bit multiplier using with the
help of 15-4 compressor. Here in this design each dot denotes partial product. From 13th column
onwards, 15-4 compressors are used in this multiplier architecture. Column number 13 consist of
13 partial products, in order to get 15 partial products 2 zeros are added. Similarly, in 14th
column, one zero is added. Approximate compressors are used in 13th, 14th and15th column of
multipliers. The partial product reduction phase consists of half adder, full adder and 5:3
compressors. When the numbers of bits in the column are 2 and 3 half adders and full adders are
used in each column. In case of a single bit, it is moved further to the subsequent level of that
particular column without any need for further processing. Until only two rows will remain, this
reduction process is repeated. Finally, summation of the last two rows is achieved using 4-bit
Kogge-Stone adder.

15-4 COMPRESSOR

A compressor is simply an adder circuit. It takes a number of equally-weighted bits, adds them,
and produces some sum signals. Compressors are commonly used with the aim of reducing and
accumulating a large number of inputs to a smaller number in a parallel manner. They are the
important parts of the multiplier design as they highly influence the speed of the multiplier. Their
main application is within a multiplier, where a huge number of partial products have to be
summed up concurrently. For high speed applications like DSP, image processing needs several
compressors to perform arithmetic operation. A compressor adder provides reduced delay over
conventional adders using both half adders and full adders. Here the representation as ‘N-r’, in
which ’N’ denotes as the number of bits and ‘r’ denotes as the total number of 1’s present in ‘N’
inputs. The compressor reduces the number of gates and the delay with reference to other adder
circuits. The inner structure of compressors avoids carry propagation. Either there are not any
carry signals or they do arrive at the same time of the internal values. Compressors are widely
used in the reduction stage of a multiplier to accumulate partial products in a concurrent manner.
In this part it is considered the design of 15-4 compressor by using with approximate 5-3
compressors [5]. This compressor compresses 15 inputs (C0- C14) into 4 outputs (B0-B3). The
15-4 compressor consists of three phases. The first phase has five full adders, the second phase
uses two 5-3 compressors and finally the 4-bit kogge stone adder. In this compressor design,
approximate 5-3 compressor are preferred over accurate 5-3 compressors as shown in the Fig. 5

Fig. 5. Logic Diagram of Approximate 15-4 Compressor [6]

A. 5-3 Compressor

The 15-4 compressor consists of 5-3 compressor as a basic design. The 5-3 compressor utilizes
five primary inputs namely A0, A1, A2, A3, A4 and produces three outputs namely B0, B1, B2.
In this compressor, the presence of number of 1’s at the input decides the output of compressor
and also uses counter property.

Fig. 6. Logic diagram of 5-3 compressor

The design of compression of given 5 inputs into 3 output is called the design of 5-3 compressor.
Error rate of 5-3 compressor is considered. The design equations of 5-3 approximate compressor
are shown in following equations respectively. The logic diagram of approximate 5-3
compressor is as shown in Fig. 6.

B. Kogge Stone Adder

In 1973, Peter M. Kogge and Harold S. Stone introduced the concept of efficient and high-
performance adder called kogge-stone adder. It is basically a parallel prefix adder. This type of
adder has the specialty of fastest addition based on design time. It is known for its special and
fastest addition based on design time [9], [10]. In Fig. 5 and Fig. 6, the functional block diagram
and RTL view of a 4-bit KoggeStone Adder is shown. By using the ith bit of given input, the
propagate signals ‘Pi’ and generate signals ‘Gi’ are calculated. Similarly, these generated signals
produce output carry signals. Therefore by minimizing the computation delay, Prefix Adders are
mainly classified into 3 categories.

A. Pre- processing

B. Generation of Carry.

C. Final processing.

A. Pre-Processing

In this stage, the generate and propagate signals are given by the equations 5&6.

B. Generation of carry

In this stage, carries are calculated with their corresponding bits and this operation is executed in
parallel manner. Carry propagation and generation are used as intermediate signals. The logic
equations for carry propagate and generate are shown below.

C. Final Processing

In final processing, the sum and carry outputs bits are computed for the given input bits and the
logic equation for the final processing stage is given by
Fig. 7. Block diagram of kogge stone adder

Fig. 8. RTL view of kogge-stone adder


SOFTWARE REQUIREMENTS
5.1 XILINX ISE
XILINX 14.7
Xilinx software is required for both VHDL and VERILOG designers to perform synthesis
operation. Any simulated code is synthesized and configured on FPGA. The process of changing
VHDL code into gate level net list is called synthesis. It is the main part of current design flows.

Fig 5.1. Creation of a new project


5.1.1 ALGORITHM OF XILINX
Click on XILINX ISE icon to start the Xilinx
• To create New Project. The below Figure 5.1 shows the new project creation, where the
name is to be given and should select the corresponding location and shows the same
directory as the location of project.
• The following properties are displayed and set the properties according to our requirement.
Figure 5.2 shows the settings of project, where the device and design flow of project is to
be selected. Set the category of product as all, family as Spartan 3e, and device is xc3s100e and
the language preferred is verilog. After the properties are selected, then it click next.
Fig 5.2. Project settings
Select the Verilog Source by giving the required inputs, outputs and buffers, and a window is
displayed to write the verilog code and is synthesized.

Fig 5.3. Creation of a new source


The Figure5.3 shows creating a new source, where select the project menu and then select
new source. Therefore, the new source is created depends on given conditions and requirements.
• Select source type as Verilog module
Fig 5.4 Selection of type of source
Figure 5.4 shows type of source selection where select source type as verilog module and
write file name and it gives location of filename in the particular drive. Select add to project and
click next.

Fig 5.5. Summary of new source


Figure 5.5 shows the summary of source where it displays source type and name and module
name. It also displays the port definitions. Then click finish. It displays the editor where the
program is written and then save it.
• When the verilog code is completed the check for syntax errors.
• Click on RTL schematic and click on technology and after that go for the synthesis report.
Fig 5.6: Checking for the syntax
Figure 5.6 shows the syntax errors, where select project source file and in the process
window, select synthesis XST and it displays the properties. In this select the syntax. By right
clicking on the check syntax, select the run option so it displaythe errors in the program in
console window.
Correcting HDL Errors
The syntax of added files and the files which are saved to project can be verified. Console
displays error messages and Parser Messages indicate the success or failure when each file is
parsed. If any of the modules contains syntax error correct it before further proceeding. An
“ERROR” in Console indicates the failure and line number of line where the syntax problem has
occurred.For displaying errors required steps are-
1. When we click the file name in the error message of Console Panel or error panel, the source
code will be opened in the Workspace .The line with the error is indicated by a yellow arrow
icon next to the line.
2. Then one needs to correct any errors in the HDL source file. The comments which are placed
above error help to fix the errors.
3. After correcting the errors go to Filemenu and then press Save to save the file.
4. The parsing message should then indicate that file is error free and should display that file is
checked successfully.
• Verilog synthesis tools could create logical circuit structures directly from verilog behavioral
description and target them to a selected technology for realization (i.e., convert verilog to
actual hardware).
• By using verilog, design, simulation and synthesis are performed by a simple combinational
circuit to complete microprocessor on chip.
• Verilog HDL is a standard hardware description language. Verilog HDL is having many
useful features for hardware design.
• Verilog HDL is a general purpose hardware description language, which is to learn and use
easily. The syntax for Verilog is same as C programming language. Designer says that who
has experience with C programming they can easily learn Verilog HDL.
• It allows different levels of modeling mixed in the same model. Hence, switches, gates, RTL,
or behavioral code of modeling hardware are defined by designer. Also, designer learns
easily one language for incentive and hierarchical design.
• Verilog HDL supports the popular logic synthesis. This makes the designers can choose any
language.
• The Programming Language Interface (PLI) is feature in which C code is written to interact
with Verilog data structures.
5.1.2 VERILOG HDL
Verilog HDL is a hardware description language that can model digital system at many
abstract levels ranging from the algorithmic level to the gate level and to the switch level. The
modeling of the digital system difficulties changes from simple gate to complete digital
electronic system. The digital system is described hierarchically and timing can be highly
modeled within the same description.
The Verilog language includes the behavioral, the dataflow and structural model, delays
and mechanism of generating waveform including monitoring response and verification is
modeled by one single language. In addition to this, the internal design is accessed and
simulation run is controlled by simulation because language has programming language
interface.
This language defines not only the syntax but also gives
simulationsemantics construct for each language. Therefore, models written in this language can
be verified by Verilog simulator. The language inherits many of its operator symbols and
obtained from C programming language. This provides large capabilities of modeling, some of
them are difficult to understand initially. Hence, a core language is easy to learn and use. This is
sufficient to model many applications.
5.1.3 VERILOG CAPABILITIES
The following are the major capabilities of the verilog:
➢ Primitive logic gates like AND, OR and NAND, are built in the language.
➢ Creating a user-defined primitive (UDP) is flexible. Such primitive is nothing but either
combinational logic or a sequential logic.
➢ PMOS and NMOS are the gates for switch-level modeling and also in built in the language.
➢ Simple language constructs are required to specify pin-to-pin & path delays and timing
verification of a design.
➢ A design is modeled in three different styles or in a mixed style. These are nothing but:
behavioral model-is modeled by procedures; dataflow model - continuous assignments are
modeled; and structural model – modeled using gate and instantaneous of module.
➢ 2 kinds of data types present in Verilog HDL; net and register data type. The physical
connections between structured elements represent net type and data storage elements
represent register type.
➢ The capacity of verilog in mixed-level modeling in one design; different level of modeling is
done at each module. In this, different level of modeling is formed such as switch level, gate
level, algorithm level and RTL level of modeling.
➢ Verilog HDL also consists of logical built-in functions such as & (bitwise-and) and I
(bitwise-or).
➢ High-level language constructs such as condition- else, case statements, and loops are
available in the language.
➢ The idea of timing and concurrency are unambiguously modeled.
➢ Capability of reading and writing powerful files.
Fig 5.7. Mixed-Level Modeling
➢ It is in deterministic language in some cases, because different simulators produce different
results in a model; for eg., the ordering of events on queue is not defined by the standard.
5.1.4 SYNTHESIS
It is the process of building gate level from register-transfer level circuit model explained
in Verilog HDL. This system is an intermediate step to produce a netlist comprising of register-
transfer level blocks like flip flops, arithmetic &logical units, and multiplexers, which is
interconnected with wires. In this case, the second program is called the RTL module, which is
necessary. The purpose of this is to acquire the predefined components from a library and each
RTL block is used in the user-specified target technology.
Verilog HDL consists of synthesis and RTL module, where the parameters such as power
consumption, delay, area and the usage of memory are found. RTL module gives the project
overview in the form of figure.
Having produced gate level netlist, logic optimizer reads the netlist and reduces the
circuit sis satisfied for specified area and timing constraints. These parameters may also be used
by the module builder for appropriate selection or generation of RTL blocks. In this, we assume
that the target net list is at the gate level. The logic gates are used.
Fig5.8. Synthesis Process
Figure 5.8 shows elements in Verilog HDL and the elements used in hardware. The
elements of verilog are converted in to hardware elements by using a mechanism called mapping
or construction mechanism.
Write the program in the verilog language and check for syntax errors and then modify
the errors. Then verify the design, after that go for the synthesis where the RTL and technology
schematic are to be known.

Fig 5.9.Typical Design Process


5.1.5 XILINX ISE Design Tools:
XilinxISEis thedesign tool provideby Xilinx.Xilinx would virtuallyidenticalforour
purposes.
Theyare4 fundamental steps in every digital logicdesign.
1. Design – Theschematicorcodecan describe the circuit.
2. Synthesis – Theintermediate alteration ofhumanreadable circuit portrayal to FPGA
codeformat. It includessyntaxcheckandcombineof all thedividedesign files into asingle file.
3. Place&Route– Wherethelayoutofthe circuit is finalized. This is the translation
oftheFPGAcodeintologicgates on theFPGA.
4. Program – TheFPGAis efficient to reflect thedevise through theuseof programming (.bit) files.
Testbench simulation is in the second step. ISE has thecapabilityto do adiversityofunlike design
methodologies including: FiniteStateMachine, SchematicCapture,and HDL(VHDLorVerilog).
5.1.6 MODELSIM
ModelSim SE - High presentation Simulation and Debug
ModelSim SE is, Linux,UNIX and Windows-based simulation and debugs environment
combine to produce soaring performance with the mainly powerful and intuitive GUI in trade.
What's New in ModelSim SE?
• Enhanced FSM debug options including run of basic information, conversion table and
counsel messages. Added carry of FSM Multi-state transitions coverage
• Enhanced debugging using hyperlinked navigation linking objects and own declaration, and
involving visited source files.
• The dataflow window can compute and shown all paths from one net to another.
• Enhanced code reporting data management with well grain manages of information in the
source window.
• Some IEEE VHDL 2009 features weresupport and including source code encryption.
Additional support of novel VPI types, together with packed arrays of struck variables
andnets .
ModelSim SE Features:
• Multi language, soaring presentation simulation engine
• Verilog, VHDL, SystemVerilog Design
• Code Coverage
• To design system verilog
• incorporated debug
• JobSpy Regression Monitor
• Mixed HDL simulation option
• SystemC Option
• Solaris and Linux 32-bit and 64-bit
• Windows 32-bit
ModelSim SE Benefits:
• Soaring performance HDL simulation solution forASIC& FPGA design teams
• The best mixed-language environment and performance in the industry
• Intuitive GUI for efficient interactive or post-simulation debug of RTL and gate-level designs
• Amalgamation reporting and, ranking of code exposure for tracking verification evolution
• Sign-off sustain for trendy ASIC libraries
• All ModelSim yield 100% values based.
• Award-winning scientific support
SIMULATION RESULTS

The design of approximate16 bit Wallace multiplier using 15-4 compressor has been done in
HDL, using Xilinx ISE 14.5. Simulation results show the design of overall architecture of
Wallace tree multiplier as shown in Fig. 9. The parameters of area are utilized by the multiplier
design and power consumption are obtained through simulation and tabulated in Table I and
Table II. The snapshot of delay obtained through simulation is shown in Fig. 10. The processing
delay at the end of addition level can be reduced by using kogge stone adder

Fig. 9. Overall Architecture Of 16×16 Bit Wallace Tree Multiplier

Table I and II describes the area utilization and power parameters of a 16-bit Wallace multiplier. It shows
better result than other adder apart from that it gives less area and low propagation delay.

TABLE I DEVICE UTILIZATION OF 16×16 WALLACE TREE MULTIPLIER


TABLE II POWER ANALYSIS OF 16×16 WALLACE TREE MULTIPLIER

Fig. 10. Delay analysis of 16×16bit Wallace tree multiplier

TABLE III DESIGN ANALYSIS OF MULTIPLIER


The performance of 15-4 compressor based approximate 16×16 multiplier is also compared with various
adders at the final stage instead of Kogge stone adder. The comparative results in terms of reduced
delay, area and power dissipation are tabulated in Table III.
CONCLUSION

The approximate 16×16bit Wallace tree multiplier using 15- 4 compressor architecture has been
designed and synthesized using on Spartan 3 XC3S100E board and simulated in Xilinx ISE 14.5.
The performance of proposed Multiplier with kogge stone adder is compared with the same
architecture of multiplier using parallel adder. It can be inferred that 16×16 multiplier
architecture using 15-4 compressor with kogge stone adder is faster compared to multiplier with
parallel adder. In future the performance of the proposed multiplier can be improved and applied
in applications like video and image processing.
REFERENCES

[1] C. S. Wallace, A Suggestion for a Fast Multiplier, IEEE Transactions on Computers, 13,
1964,14-17.

[2] K. Bhardwaj, P. S. Mane, and J. Henkel, ‘‘Power-and area-efficient approximate Wallace


tree multiplier for error-resilient systems,’’in Proc.15thInt. Symp. Quality Electron. Design
(ISQED), Mar. 2014, pp. 263–269.

[3] C.-H. Lin and I.-C. Lin, ‘‘High accuracy approximate multiplier with error correction,’’ in
Proc. IEEE 31st Int. Conf. Comput. Design (ICCD), Oct. 2013, pp. 33–38

[4] D. R. Gandhi, and N. N. Shah, Comparative Analysis for Hardware Circuit Architecture of
Wallace Tree Multiplier, IEEE International Conference on Intelligent Systems and Signal
Processing, Gujarat, 2013, 1-6.

[5] R. Menon and D. Radhakrishnan, ‘‘High performance 5:2 compressor architectures,’’ Proc.
IEE-Circuits, Devices Syst., vol. 153, no. 5, pp. 447–452, Oct. 2006

[6] R. Marimuthu, M. Pradeepkumar, D. Bansal, S. Balamurugan, and P. S. Mallick, ‘‘Design of


high speed and low power 15-4 compressor,’’ in Proc. Int. Conf. Commun. Signal Process.
(ICCSP), Apr. 2013, pp. 533– 536.

[7] Momeni, J. Han, P. Montuschi, and F. Lombardi, “Design and analysis of approximate
compressors for multiplication,” IEEE Trans. Comput., vol. 64, no. 4, pp. 984–994, Apr. 2015.
[8] Teffi Francis, Tera Joseph and Jobin K Antony., “Modified MAC Unit for low power high
speed DSP application using multiplier with bypassing technique and optimized adders”, IEEE-
31661, 4th ICCCNT, 2013.

[9] Yezerla, Sudheer Kumar, and B. Rajendra Naik. "Design and Estimation of delay, power and
area for Parallel prefix adders." In Engineering and Computational Sciences (RAECS), 2014
Recent Advances in, pp. 1-6. IEEE, 2014.

[10] Y. Choi, “Parallel Prefix Adder Design” Proc. 17th IEEE Symposium on Computer
Arithmetic, pp 90-98, 27th June 2005. [11] Belle W. Y. Wei and Clark D. Thompson, “Area-
Time Optimal Adder Design”, IEEE transactions on Computers, vol.39, pp. 666-675, May1990.

You might also like