0% found this document useful (0 votes)
101 views63 pages

Final Documentation

The document discusses low power design techniques for VLSI circuits. It explains that power consumption is a major concern for portable devices as it affects battery life. Various sources of power dissipation in circuits are described, including dynamic power from switching, short circuit power, and increasing leakage power. The document outlines techniques to reduce each type of power dissipation at different design levels from system to circuit. These include voltage scaling, multi-threshold voltage devices, transistor sizing, and power gating of unused blocks.

Uploaded by

Swamy Nallabelli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views63 pages

Final Documentation

The document discusses low power design techniques for VLSI circuits. It explains that power consumption is a major concern for portable devices as it affects battery life. Various sources of power dissipation in circuits are described, including dynamic power from switching, short circuit power, and increasing leakage power. The document outlines techniques to reduce each type of power dissipation at different design levels from system to circuit. These include voltage scaling, multi-threshold voltage devices, transistor sizing, and power gating of unused blocks.

Uploaded by

Swamy Nallabelli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

CHAPTER 1

INTRODUCTION

1.1 INTRODUCTION

Very-large-scale integration (VLSI) is the process of creating an


integrated circuit (IC) by combining thousands of transistors into a single
chip. VLSI began in the 1970s when complex semiconductor and communication
technologies were being developed. VLSI lets IC designers add all of these into one
chip. In electronics, logic synthesis is a process by which an abstract form of
desired circuit behaviour, typically at register transfer level (RTL), is turned into
a design implementation in terms of logic gates, typically by a computer program
called a synthesis tool. All the system consists of a basic block such as addition,
shifting and multiplication. From the all above operations the multiplication is the
main phenomenon which affects the speed of the system. Often many problems
arise due to the speed of a multiplication operation.

Digital signal processing (DSP) the important operations are filtering, inner
product and spectral analysis. Here many of the operations such as filtering and
product performed with the help of multiplication hence it plays a very curtail role
for any DSP system. Multiplication is phenomenon of repeated addition. There are
various types of low power digital multiplier is present with high clock frequency.
They play a wide role in today digital image processing, hence is the heart for today
mobile communication system. Recently because of increase in demand of battery
- powered and of high speed electronic devices, power consumption became a very
serious and important factor in VLSI chips because of the increase in the non-linear
effect. The power consumption also affects the battery life of a device. Because the
output current of the MOS source coupled multiplier in a differential pair is depends
upon the non-linearity of the biased current (Iss) and input signal. Various
techniques applied internally and externally in the multiplier to reduce its power
consumption. The advantage of GDI technique over the static CMOS is the use of
less number of transistors, hence the reduction in the area and interconnects.

1
In Arithmetic circuit, likes multiplier different adders, are one of the basic
components in the design of any communication circuit. Therefore Digital
multipliers are most commonly used in many digital circuit designs. They are very
fast, most reliable and efficient component that is utilized to implement any
operation. The power dissipation in a multiplier is a very important issue as it
reflects the total power dissipated by the circuit and hence affects the performance
for the device. Most of digital signal processing (DSP) systems incorporate a
multiplication unit to implement algorithms such as correlations, convolution, and
filtering and frequency analysis. Multipliers are key components of many high
performance systems such as FIR filters, microprocessors, DSP processors, etc.

During the desktop PC design era, VLSI design efforts have focused
primarily on optimizing speed to realize computationally intensive real-time
functions such as video compression, gaming, graphics etc. As a result, we have
semiconductor ICs integrating various complex signal processing modules and
graphical processing units to meet our computation and entertainment demands.
While these solutions have addressed the real-time problem, they have not
addressed the increasing demand for portable operation, where mobile phone need
to pack all this without consuming much power. The strict limitation on power
dissipation in portable electronics applications such as smart phones and tablet
computers must be met by the VLSI chip designer while still meeting the
computational requirements. While wireless devices are rapidly making their way
to the consumer electronics market, a key design constraint for portable operation
namely the total power consumption of the device must be addressed. Reducing the
total power consumption in such systems is important since it is desirable to
maximize the run time with minimum requirements on size, battery life and weight
allocated to batteries. So the most important factor to consider while designing SoC
for portable devices is 'low power design'.

Is Power Really a Problem?

Scaling of technology node increases power-density more than expected.


CMOS technology beyond 65nm node represents a real challenge for any sort of
voltage and frequency scaling starting from 120nm node, each new process has

2
inherently higher dynamic and leakage current density with minimal improvement
in speed. Between 90nm and 65nm, the dynamic power dissipation is almost same
whereas there is ~5% higher leakage/mm2. Low cost always continues to drive
higher levels of integration, whereas low cost technological breakthroughs to keep
power under control are getting very scarce.

Modern System-on-Chip demand more power. In both logic and memory,


static power is increasing really fast and dynamic power is also rising. Overall
power is dramatically increasing. If the semiconductor integration continue to
follow Moore's Law, the power density inside the chips will reach far higher than
the rocket nozzle.

Do We Need To Bother With Power?

Power dissipation is the main constrain when it comes to portability. The


mobile device consumer demands more features and extended battery life at a lower
cost. About 70% of users demand longer talk and stand-by time as primary mobile
phone feature. Top 3G requirement for operators is power efficiency. Customers
want smaller and sleeker mobile devices. This requires high levels of silicon
integration in advanced processes, but advanced processes have inherently higher
leakage current. So there is a need to bother more on reducing leakage current to
reduce power consumption.

Why Power Matters in SOC?

Power Management matter in System on Chip due to following concerns


a. Packaging and Cooling costs.
b. Digital noise immunity,
c. Battery life (in portable systems)
d. Environmental concerns.

Sources of Power Dissipation:

The power dissipation in circuit can be classified into three categories as described
below.

3
Dynamic power consumption:

Due to logic transitions causing logic gates to charge/discharge load capacitance.


Short-circuit current:

In a CMOS logic P-branch and N-branch are momentarily shorted as logic gate
changes state resulting in short circuit power dissipation.
Leakage current:

This is the power dissipation that occurs when the system is in standby mode or not
powered. There are many sources of leakage current in MOSFET. Diode leakages
around transistors and n-wells, Subthreshold Leakage, Gate Leakage, Tunnel
Currents etc. Increasing 20 times for each new fabrication technology. Insignificant
issues are becoming dominating factors.

Low-Power Design Techniques:


An integrated low power methodology requires optimization at all design
abstraction layers as mentioned below.
1. System: Partitioning, Power down
2. Algorithm: Complexity, Concurrency, Regularity
3. Architecture: Parallelism, Pipelining, Redundancy, Data Encoding
4. Circuit Logic: Logic Styles, Energy Recovery, Transistor Sizing
5. Technology: Threshold Reduction, Multithreshold Devices.

Dynamic power varies as VDD2. So reducing the supply voltage reduces


power dissipation. Also selective frequency reduction technique can be used to
reduce dynamic power. Multi threshold voltage can be used to reduce leakage
power at system level. Transistor resizing can be used to speed-up circuit and reduce
power. Sleep transistors which we will discuss in following tutorials can be used
effectively to reduce standby power. Parallelism and pipelining in system
architecture can reduce power significantly. Clock disabling, power-down of
selected logic blocks, adiabatic computing, software redesign to lower power
dissipation are the other techniques commonly used for low power design.

4
VLSI circuit design for low power:

The growing market of portables such as cellular phones, gaming consoles


and battery-powered electronic systems demands microelectronic circuits design
with ultra-low power dissipation. As the integration, size, and complexity of the
chips continue to increase, the difficulty in providing adequate cooling might either
add significant cost or limit the functionality of the computing systems which make
use of those integrated circuits. As the technology node scales down to 65nm, there
is not much increase in dynamic power dissipation. However the static or leakage
power reaches or exceeds the dynamic power levels beyond 65nm technology node.

Hence the techniques to reduce power dissipation is not limited to dynamic


power. In this article we discuss circuit and logic design approaches to minimize
dynamic, leakage and short circuit power dissipation. Power optimization in a
processor can be achieved at various abstract levels.
System/Algorithm/Architecture have a large potential for power saving even these
techniques tend to saturate as we integrate more functionality on an IC. So
optimization at Circuit and Technology level is also very important for
miniaturization of ICs.

Total Power dissipated in a CMOS circuit is sum of dynamic power, short


circuit power and static or leakage power. Design for low-power implies the ability
to reduce all three components of power consumption in CMOS circuits during the
development of a low power electronic product. In the sections to follow we
summarize the most widely used circuit techniques to reduce each of these
components of power in a standard CMOS design.

The switching power dissipation in CMOS digital integrated circuits is a


strong function of the power supply voltage. Therefore, reduction of VDD emerges
as a very effective means of limiting the power consumption. Given a certain
technology, the circuit designer may utilize on chip DC- DC converters and/or
separate power pins to achieve this goal. The savings in power dissipation comes at
a significant cost in terms of increased circuit delay. When considering drastic
reduction of the power supply voltage below the new standard of 3.3 V, the issue
of time domain performance should also be addressed carefully. Reduction of the

5
power supply voltage with a corresponding scaling of threshold voltages, in order
to compensate for the speed degradation. Influence of Voltage Scaling on Power
and Delay although the reduction of power supply voltage significantly reduces the
dynamic power dissipation, the inevitable design trade-off is the increase of delay.
However, this interpretation assumes that the switching frequency (i.e., the number
of switching events per unit time) remains constant.

If the circuit is always operated at the maximum frequency allowed by its


propagation delay, the number of switching events per unit time (i.e., the operating
frequency) will drop as the propagation delay becomes larger with the reduction of
the power supply voltage. The net result is that the dependence of switching power
dissipation on the power supply voltage becomes stronger than a simple quadratic
relationship, shown in Figure: It is important to note that the voltage scaling is
distinctly different from constant-field scaling, where the power supply voltage as
well as the critical device dimensions (channel length, gate oxide thickness) and
doping densities are scaled by the same factor. Here, we examine the effects of
reducing the power supply voltage for a given technology, hence, key device
parameters and the load capacitances are assumed to be constant. The propagation
delay expressions show that the negative effect of reducing the power supply
voltage upon delay can be compensated for, if the threshold voltage of the
transistors (VT) is scaled down accordingly. However, this approach is limited
because the threshold voltage may not be scaled to the same extent as the supply
voltage. When scaled linearly, reduced threshold voltages allow the circuit to
produce the same speed-performance at a lower VDD.

6
Figure 1.1 Power and delay unit graph

The above Figure shows the variation of the propagation delay of a CMOS
inverter as a function of the power supply voltage, and for different threshold
voltage values. The reduction of threshold voltage from 0.8 V to 0.2 V can improve
the delay at VDD= 2 V by a factor of 2. The positive influence of threshold voltage
reduction upon propagation delay is specially pronounced at low power supply
voltages, for VDD < 2 V. It should be noted, however, that using low- VT
transistors raises significant concerns about noise margins and sub-threshold
conduction. Smaller threshold voltages lead to smaller noise margins for the CMOS
logic gates. The sub-threshold conduction current also sets a severe limitation
against reducing the threshold voltage. For threshold voltages smaller than 0.2 V,
leakage due to sub-threshold conduction in stand-by, i.e., when the gate is not
switching, may become a very significant component of the overall power
consumption. In addition, propagation delay becomes more sensitive to process
related fluctuations of the threshold voltage. The techniques which can be used to
overcome the difficulties (such as leakage and high stand-by power dissipation)
associated with the low VT circuits.

7
1.2 DIGITAL SIGNAL PROCESSING

Digital signal processing is most used in Digital audio, speech


recognition, cable modems, radar, high-definition television-these are but a few of
the modern computer and communications applications relying on digital signal
processing (DSP) and the attendant application-specific integrated circuits (ASICs).
As information-age industries constantly reinvent ASIC chips for lower power
consumption and higher efficiency, there is a growing need for designers who are
current and fluent in VLSI design methodologies for DSP.
Enter VLSI Digital Signal Processing Systems-a unique, comprehensive guide to
performance optimization techniques in VLSI signal processing. VLSI architecture
theory and algorithms, addresses various architectures at the implementation level,
and presents several approaches to analysis, estimation, and reduction of power
consumption.

FIR blocks are the most important blocks in design of Digital Signal
Processing (DSP). They are widely used in industry and digital systems, such as:
automotive, mobile phone, internet, laptop, computer, speech processing, Bluetooth
headsets, and etc. The requirement to design an electronic system consists of two
major components, the first one is Technology driven and the second is Market-
driven. Regarding technology driven, nowadays, most industries are improving
their technology and devices considering greater complexity. It means more
functionality, higher density in order to place millions of transistors on a lesser die
area, increased performance and lower power dissipation. Due to the market
demand, each novel issue must be taken seriously and quick actions have to be taken
since missing the windows market can be very costly.

Multiplication is a fundamental operation in most digital signal processing


algorithms to perform functions like convolution, filtering and so on. Statics shows
that more than 70% of the instructions perform addition and multiplication in most
of the microprocessor and DSP algorithms [1]. That is, these operations consume
most of the execution time. A In a system comprising of multiplier, the system
performance usually determined by the performance of the multiplier as it being the
slowest element of all. Hence, optimizing the speed of the multiplier is a major

8
design issue. Multiplication process can be divided into three steps, namely,
generating the partial products, reducing the partial product and the last addition to
get the final product. The speed of multiplication can be improved by reduction in
the generated number of partial products and/or by increasing the speed at which
these partial products are accumulated.

The objective of a good multiplier is to provide a compact utilization, high


speed and low power consumption unit. The multiplier is selected based on the
required nature of application where it is to be implemented. In this paper first we
discussed on various adders such as Ripple Carry adder (RCA), Carry Look-Ahead
adder (CLA) and Carry Save adder (CSA) as adders also play a crucial role in
designing multipliers. Different methodology of generating the partial products and
accumulating them is also explained in detail. The paper deals with comparison of
different types of multipliers like Array multiplier, Modified Booth multiplier,
Wallace tree multiplier and Modified Booth-Wallace tree multiplier and their
comparison based on maximum combinational path delay so as to find the fast
multiplier of the four.

Fast integer multipliers are a key topic in the VLSI design of high-speed
microprocessors. Multiplication is one of the basic arithmetic operations. In fact
8.72% of all instructions in a typical scientific program are multiplies [1]. In
addition, multiplication is a long latency operation. In typical processes,
multiplication takes between two and eight cycles [2]. Consequently, having high-
speed multipliers is critical for the performance of processors. Processor designers
have recognized this and have devoted considerable silicon area for the design of
multipliers [3]. Recent advances in integrated circuit fabrication technology have
resulted in both smaller feature sizes and increased die areas. Together, these factors
have provided the processor designer the ability to fully implement high-speed
floating-point multipliers in silicon. Most advanced digital systems today
incorporate a parallel multiplication unit to carry out high-speed mathematical
operations. In many situations, the multiplier lies directly in the critical-path,
resulting in an extremely high demand on its speed.

9
In the past, considerable efforts were put into designing multipliers with
higher speed and throughput, which resulted in fast multipliers which can operate
with delay time as low as 4.1 ns [4]. However, with the increasing importance of
the power issue due to the portability and reliability concerns of electronic devices
[5], recent work has started to look into circuit design techniques that will lower the
power dissipation of multipliers [6]. This paper describes the design and fabrication
of a 32×32-bit parallel multiplier, based on a 0.13 µm CMOS process, for low-
power applications. Pass transistor (PT) logic is chosen to implement most of the
logic functions within our multiplier.

Emerging as an attractive replacement for the conventional static CMOS


logic, especially in the design of arithmetic macros PT logic requires fewer devices
to implement basic logic functions in an arithmetic operation, such as the XOR
function. This translates into lower input gate capacitance and power dissipation as
compared to conventional static CMOS [7]. In the PT circuit implementations
reported so far [8], transmission on gate (TG) design techniques which provide full
voltage swings were widely adopted. In this paper, we present several circuits that
fully exploit the inherent non-full swing (NFS) nature of PT logic. These circuits
were used as basic building blocks within our multiplier to achieve low-power
operation.

1.3 LITERATURE SURVEY:

For the current states and future challenges the following papers have been
reviewed in the literature survey.

W. C. Yeh and C. W. Jen [1], have proposed a new Modified Booth


Encoding (MBE) scheme as the Partial Product Generator (PPG). They have
proposed the MBE using encoder and decoder logic. The delay path is optimized
through encoder and decoder logic. They have also proposed Multiple-Level
Conditional-Sum Adder (MLCSMA) as the Carry Propagate Adder (CPA) to
improve the performance of the parallel multiplier. The parallel carry generator
network of the MLCSMA provides carry in parallel to all the stages of the full
adder. The proposed MBE algorithm and the MLCSMA algorithm can optimize the

10
delay and finally reduces the delay by 8% compared with other parallel multipliers.
This proposed multiplier can multiply signed number operands.

Shiann-RongKuang, Jiun-Ping Wang, and Cang-Yuan Guo [2], have


proposed the MBE that generates regular partial product array. The MBE [1] uses
extra partial product bit at the least significant bit (LSB) position of each partial
product row and generates {(n/2) +1} an irregular partial product array. The PPRT
delay and the area increases due to extra bit needed for the negate operation. And
therefore in [2] have proposed the MBE to generate a regular partial product array
with {n/2} partial product rows. Thus the reduction by one partial product row
resulted in the reduction in the delay, the area, and the power of the MBE
multipliers. But additional logic circuits have been implemented for converting
irregular partial product to regular partial product extra logic circuits have been
used. Since this uses regular MBE as the PPG and hence reported the higher
performance compared to the conventional MBE multipliers.

Wang, Shyh–Jye Jou and Chung-Len Lee [3], have proposed a well-
structured MBE multiplier architecture. In this paper an improved Booth Encoder
logic and Booth Selector logic have been proposed to remove an extra partial
product row like paper [2]. This paper also proposed the design of spare-tree
approach for two’s complementation operation. Thus the removal of an extra partial
product row and the design of sparse-tree approach resulted in the reduction of the
area and improved in the speed of the signed multiplier.

G.Goto, A Inoue, R Ohe, S Kashiwakura, S Mitarai, T Tsuru, T Izawa [4],


have proposed a 4.1 ns compact 54×54-bit multiplier utilizing sign-select Booth
Encoders logic. To reduce the total transistor count, they have proposed the design
of sign-select Booth encoding scheme. They have proposed also the design of 4-2
compressor using 48- transistors for the operation as the PPRT. The proposed sign-
select Booth algorithm simplifies the Booth selector circuit such that the transistor
count reduces by 45% as compared to the MBE scheme of [20, 21]. Thus, with the
design of sign-select logic and 4- 2 compressor logic, for the design of 54 × 54-bit
signed multiplier they have claimed the delay of multiplication is 4.1 ns, the chip

11
area size is 1.04 mm × 1.27 mm at 2.5 V power supply, with reduction in total
number of transistor by 24%.

C. H. Chang, J.G, and M. Zhang [5], have proposed the ultra low-voltage
and lowpower 4–2 and 5–2 compressors implemented in CMOS logic for fast
arithmetic circuits. They have proposed the design of 4-2 compressor using
Exclusive OR (XOR) logic gates at three levels with the critical delay path of 3-
units and the 5-2 compressor with critical delay path of 4-units. They also proposed
that new circuit with a pair of PMOS-NMOS transistors to eliminate the weak logic
for the XOR and Exclusive NOR (XNOR) logic modules. They have claimed that
the proposed XOR–XNOR module used for the implementation of 4-2 and 5-2
compressors can operate at supply voltage as low as 0.6 V.

Huang, Ercegovac [6], have proposed the design of high-performance low-


power leftto array multiplier for signed number. They have proposed the signal flow
optimization technique in the full-adder of the PPRT by using left-to-right leapfrog
(LRLF) signal flow and splitting the reduction array into upper/lower parts for high
performance.

Pouya Asadi and KeivanNavi [7], have proposed the design of a Novel
High-Speed 54×54-bit multiplier for signed number. They have presented a self-
timed carry-look ahead adder in which the average computation time was
proportional to the logarithm of the logarithm of n. A novel 4-2 compressor using
PTL has been developed and claimed speed over conventional CMOS circuits due
to critical-path gate stages was minimized. The proposed multiplier delay was 3.4
ns at 1.3 V power supply and implemented the multiplier using 42579 transistors.

Leandro Z. Pieper, Eduardo A. C. da Costa, Sérgio J. M. de Almeida [8],


have proposed 2’s complement radix-2m array multiplier using dedicated modules.
Using the dedicated modules 16, 32 and 64-bit multiplier have been implemented
and claimed more efficient to Modified Booth multiplier. In [11], R. Zlatanovici,
Sean Kao, Borivoje Nikolic have proposed a fast and energy-efficient single-cycle
64-bit CLA adder. In this paper an optimized topology, a sparse radix-4 Ling adder
and use of domino CMOS logic have been claimed the 240 ps delay for the addition

12
64-bit operands and fabricated in 90 nm CMOS technology, consumes 260 mW at
1 V power supply.

In [14], A M. Shams, Tarek K. D, Magdy A. Bayoumi have proposed the


design and implementation of Low-Power 1-bit CMOS full adder cell. They have
been proposed 20 different 1-bit full adder cells. Each full adder cells have been the
different value of the delay, the area and the power consumption. From the library
of the full adder cells the designer of the circuit can pick an appropriate adder that
meet the requirement.

13
CHAPTER II

XILINX ISE 14.7


2.1 DESIGN ENTRY:

Design entry is the first step in the ISE design flow. During design entry,
you create your source files based on your design objectives. You can create your
top-level design file using a Hardware Description Language (HDL), such as
VHDL, Verilog, or ABEL, or using schematic. You can use multiple formats for
the lower-level source files in your design.

2.2 SYNTHESIS

After design entry and optimal simulation, you run synthesis. During this
step, VHDL, Verilog, or mixed language designs become net list files that are
accepted as input to the implementation step.

2.3 IMPLEMENTATION

After synthesis, you run design implementation, which converts the logical
design into a physical file format that can be downloaded to selected target device.
From project navigator, you can run the implementation process in one step, or you
can run each of the implementation separately. Implementation processes vary
depending on whether you are targeting a Field Programmable Gate Array (FPGA)
or a Complex Programmable Logic Device (CPLD).

2.4 VERIFICATION

You can verify the functionality of your design at several points in the
design flow. You can use simulator software to verify the functionality and timing
of your design or a portion of your design. The simulator interprets VHDL or
Verilog code into circuit functionality and displays logical results of described HDL
to determine correct circuit operation. Simulation allows you to create and verify
complex functions in a relatively small amount of time. You can also run in-circuit
verification after programming your device.

14
2.5 DEVICE INSTALLATION:

After generating a programming file, you conFig. your device. During


configuration, you generate configuration files and download the programming
files from a host computer to a Xilinx.

2.6 ISE:

Xilinx ISE is a Hardware Description Language (HDL) simulator that


enables you to perform functional and timing simulations for VHDL, Verilog and
mixed VHDL/Verilog designs.

2.6.1 LANGUAGE SUPPORT:

Table 1: Languages supported by ISE

Language Support

VHDL IEEE-STD-1076-2000

Verilog IEEE-STD-1364-2001

SDF Xilinx [NetGen] generated SDF files

VITAL VITAL-2000

Mixed VHDL/Verilog Yes

VHDL FLI/VHPI No

Verilog PLI No

System Verilog No

15
2.6.2 FEATURE SUPPORT:

Table 2: Features supported by ISE

Feature Support

Incremental Compilation Yes

Source Code Debugging Yes

SDF Annotation Yes

VCD Generation Yes

SAIF Support Yes

Hard IP-MGT, PPC etc Yes

Multi-threading Yes

2.6.3 Simulation Using ISE:

Now that you have a test bench in your project, you can perform
behavioral simulation on the design using ISE. The ISE software has full
integration with ISE. The ISE software enables ISE to create the work directory,
compile the source files, load the design, and perform simulation based on
simulation properties.

To select ISE as your project simulator, do the following:

 In the Hierarchy pane of the Project Navigator Design Panel, right-click


the device line (xc3s100E-5tq114), and select “Design Properties”.
 In the Design Properties dialog box, set the simulator field to
ISE(VHDL/Verilog)”.

16
2.6.4 Locating the Simulation Processes:

The simulation processes in the ISE software enable you to run simulation
on the design using ISE.

To locate the ISE processes, do the following:

 In the View Pane of the Project Navigator Design Panel, select


“Simulation”, and select “Behavioural” from the drop-down list.
 In the Hierarchy Pane, select the test bench files (ex: stopwatch_tb).
 In the Processes Pane, expand “ISE Simulator” to view the process
hierarchy.
The following simulation processes are available:

Check “Syntax” .This process checks for syntax errors in the test bench. Simulate
“Behavioural Model” .This process starts the design simulation.

2.6.5 Specifying Simulation Properties:

You will perform a behavioural simulation on the stopwatch design after


you set process properties for simulation.

The ISE software allows you to set several ISE properties in addition to the
simulation net list properties. To see the behavioural simulation properties and to
modify the properties for this tutorial, do the following:

In the Hierarchy pane of the Project Navigator

 In the Process Pane, expand “ISE simulator”, right-click “Simulate


Behavioural Model”, and select “Process Properties”.
 In the Process Properties dialog box, set the property display level
to “Advanced”. This global setting enables you to see all available
properties.
 Change the Simulation Run Time to “2000 ”. ns”. Clicklock”.

17
2.6.6 Performing Simulation:

After the process Properties have been set, you are ready to run ISE to simulate the
design. To start the behavioural simulation, double-click “Simulate Behavioral
Model”. ISE creates the work directory, compiles the source files, loads the design,
and performs simulation for the time specified.

The majority of the design runs at 100 Hz and would take a significant
amount of time to simulate.

18
CHAPTER III

VLSI& HARDWARE DESCRIPTION LANGUAGES

3.1. VLSI TECHNOLOGY:

Gone are the days when huge computers made of vacuum tubes sat
humming in entire dedicated rooms and could do about 360 multiplications of 10
digit numbers in a second. Though they were heralded as the fastest computing
machines of that time, they surely don’t stand a chance when compared to the
modern day machines. Modern day computers are getting smaller, faster, and
cheaper and more power efficient every progressing second. But what drove this
change? The whole domain of computing ushered into a new dawn of electronic
miniaturization with the advent of semiconductor transistor by Bardeen (1947-48)
and then the Bipolar Transistor by Shockley (1949) in the Bell Laboratory.

Figure 3.1 : A comparison: first planar IC(1961) and Intel Nehalem quad core die

Since the invention of the first IC (Integrated Circuit) in the form of a Flip Flop
by Jack Kirby in 1958, our ability to pack more and more transistors onto a single
chip has doubled roughly every 18 months, in accordance with the Moore’s Law.
Such exponential development had never been seen in any other field and it still
continues to be a major area of research work.

19
3.2 History & Evolution of VLSI:

The development of microelectronics spans a time which is even lesser than


the average life expectancy of a human, and yet it has seen as many as four
generations Early 60’s saw the low density fabrication processes classified under
Small Scale Integration (SSI) in which transistor count was limited to about 10.
This rapidly gave way to Medium Scale Integration in the late 60’s when around
100 transistors could be placed on a single chip.

It was the time when the cost of research began to decline and private firms
started entering the competition in contrast to the earlier years where the main
burden was borne by the military. Transistor-Transistor logic (TTL) offering
higher integration densities outlasted other IC families like ECL and became the
basis of the first integrated circuit revolution. It was the production of this family
that gave impetus to semiconductor giants like Texas Instruments, Fairchild and
National Semiconductors. Early seventies marked the growth of transistor count
to about 1000 per chip called the Large Scale Integration.

By mid-eighties, the transistor count on a single chip had already exceeded


1000 and hence came the age of Very Large Scale Integration or VLSI. Though
many improvements have been made and the transistor count is still rising, further
names of generations like ULSI are generally avoided. It was during this time
when TTL lost the battle to MOS family owing to the same problems that had
pushed vacuum tubes into negligence, power dissipation and the limit it imposed
on the number of gates that could be placed on a single die.

The second age of Integrated Circuits revolution started with the


introduction of the first microprocessor, the 4004 by Intel in 1972 and the 8080 in
1974. Today many companies like Texas Instruments, Infineon, Alliance
Semiconductors, Cadence, Synopsys, Celox Networks, Cisco, Micron Tech,
National Semiconductors, ST Microelectronics, Qualcomm, Lucent, Mentor
Graphics, Analog Devices, Intel, Philips, Motorola and many other firms have
been established and are dedicated to the various fields in "VLSI" like
Programmable Logic Devices, Hardware Descriptive Languages, Embedded.

20
3.3. Introduction of HDL:

Hardware description languages (HDLs), mainly to describe logic equations


to be realized in programmable logic devices (PLDs).in the 1990s, HDL usage by
digital systems designers accelerated as PLDs, CPLDs, and FPGAs became
inexpensive and common place. Designers turned to HDLs as a means to design
individual modules within a system-on-chip.

The important innovations in HDLs occurred in the mid-1980s, and were


the developments of VHDL and VERILOG HDL became popular. There are
several steps in an HDL based design process, often called the design flow. These
steps are applicable to any HDL based design process and are shown in figure .

Figure 3.2: Steps in an HDL Based Design Flow

In any design, specifications are written first. Specifications describe the


functionality, interface and overall architecture of the digital circuit to be
designed. The next step is the actual writing of HDL code for modules, their
interfaces and their internal details. After the code has written we have to compile
the code, this step is known as compilation. Here the HDL compiler analyzes the
code for syntax errors and also checks it for compatibility with other modules
which it relies.

The most satisfying step is simulation or verification. The HDL simulator


allows to define and apply the inputs to the design and to observe its outputs

21
without ever having to build the physical circuit. There are at least two dimensions
to verification. In timing verification, the circuit operation including estimated
delays, the setup, hold and other timing requirements for sequential devices like
flip flops are met. In the functional verification the circuit’s logical operation
independent of timing considerations; gate delays and other timing parameters are
considered to be zero.

After verification step, the synthesis process is done in the back end stage.
There are three basic steps, the first synthesis, converting the HDL description
into a set of primitive or components that can be assembled in the target
technology and it may generate a list of gates and a net list that specifies how they
are interconnected.

In the fitter step, a fitter maps the synthesized components on to available


device resources. It may mean selecting microcells or laying down individual
gates in a pattern and finding ways to connect them within the physical constraints
of the FPGA or ASIC die, is called as place and route process. The final step is
post fitting verification of the fitted circuit. It is only at the stage that the actual
circuit delays due to wire lengths, electrical loading, and other factors can be
calculated with reasonable precision.

3.4 HDL Tool Suites:

HDL tool suite really has several different tools with their own names and
purposes:

 A text editor allows writing, editing and saving an HDL program. It often
contain HDLspecific features, such as recognizing specific file name
extensions and recognizing HDL reserved and comments and displaying
them in different colors.
 The compiler is responsible for parsing the HDL program, finding syntax
errors and figuring out what the program really says.
 A synthesizer or synthesis tools targets the design to a specific hardware
technology,such as FPGA, ASIC etc...

22
 The simulator runs the specified input sequence on the described hardware
and determines the values of the hardware¡¦s internal signals and its
outputs over a specified period of time.
 The output of the simulator can be include waveforms to be viewed using
the waveform editor.
 A schematic viewer may create a schematic diagram corresponding to an
HDL program, based on the intermediate-language output of the compiler.
 A translator targets the compilers intermediate language output to a real
device such as PLD, FPGA OR ASIC.

3.5. VHDL:

VHDL¨ stands for VHSIC hardware description language¨. VHSIC¨ in turns


stands for Very High Speed Integrated Circuit.

3.5.1. VHDL Advantages:

The key advantage of VHDL, when used for systems design, is that it
allows the behavior of the required system to be described (modelled) and verified
(simulated) before synthesis tools translate the design into real hardware (gates
and wires). Another benefit is that VHDL allows the description of a concurrent
system. VHDL is a dataflow language, unlike procedural computing languages
such as BASIC, C, and assembly code, which all run sequentially, one instruction
at a time.

VHDL project is multipurpose. Being created once, a calculation block


can be used in many other projects. However, many formational and functional
block parameters can be tuned (capacity parameters, memory size, element base,
block composition and interconnection structure).VHDL project is portable.
Being created for one element base, a computing device project can be ported on
another element base, for example VLSI with various technologies. Concurrency,
timing and clocking can be modelled. VHDL handles asynchronous as well as
synchronous sequential-circuit structures. The logical operation and timing
behaviour of a design can be simulated. VHDL allows for various design

23
methodologies, both the top-down, bottom-up and is very flexible in its approach
to describing hardware.

3.5.2. VHDL History and Features:

In the mid-1980s, the U.S. Department of Defence (DOD) and the IEEE
sponsored the development of a highly capable hardware description language
called VHDL and this was got extended in 1993 and again 2002. And some of the
features of the VHDL are:

 Packages are used to provide a collection of common declaration,


constants, and/or subprograms to entities and architectures.
 Generics provide a method for communicating information to architecture
from the external environment. They are passed through the entity construct.
 Ports provide the mechanism for a device to communicate with its
environment. A port declaration defines the names, types, directions and
possible default values for the signals in a components interface.
 Configuration is an instruction used to bind the component instances to
design entities. In it, we specify which real entity interface and
corresponding architecture body should be used for any component
instances.
 Bus is a signals group or a particular method of communication.
 Driver is a source for a signal in that it provides values to be applied to the
signal.
 Attribute is VHDL objects additional information.

24
3.5.3. VHDL STRUCTURE:

Figure 3.3: VHDL Structure

The VHDL structure or model is shown in figure. A single component


model is composed of one entity and one of more architecture. The entity
represents the interface specification (I/O) of the component. It defines the
components external view, sometimes referred to as its pins .The architecture(s)
describe(s) the internal implementation of an entity.

3.5.4. Types of Architectures:

There are three general types of architectures. A VHDL Model can be


created at different abstraction levels (behavioural, dataflow, structural),
according to a refinement of starting specification.

3.5.4.1. Dataflow Modelling:

Several additional concurrent statements allow VHDL to describe a circuit


in terms of the flow of data and operations on it within the circuit. This style is
called a dataflow description or dataflow design. Concurrent statements execute
when data is available on their inputs. These statements occur in any order within
the architecture. This method is to use logic equations to develop a data flow
description.

25
3.5.4.2. Structural Modelling:

Structural description can be created from pre described components.


These gates can be pulled from a library of parts. A VHDL architecture that uses
components is often called a structural description or structural design, because it
defines the precise interconnection structure of signals and entities that realize the
entity.

3.5.4.3. Behavioural Modelling:

Behavioural description in which the functional and possibly timing


characteristic are described using VHDL concurrent statements and processes.
Process is a collection of sequential statements that executes in parallel with other
concurrent statement and processes. Using a process, can specify a complex
interaction of signals and events in a way that executes in essentially zero
simulated time during simulation and that give rise to a synthesized combinational
or sequential circuit that performs the modelled operation directly.

A VHDL process statement can be used anywhere that a concurrent


statement can be used. Process statement is introduced by the keyword process.
AVHDL process is always either running or suspended. The list of signals in the
process definition, called the sensitivity list. All statements within a process
execute sequential order until it gets suspended by a wait statement.

3.6. Verilog HDL:

Verilog, standardized as IEEE 1364, is a hardware description


language (HDL) used to model electronic systems. It is most commonly used in
the design and verification of digital circuits at the register transfer level of
abstraction. It is also used in the verification of analogy circuits and mixed signal
circuits.

Hardware description languages such as Verilog differ from


software programming languages because they include ways of describing the
propagation time and signal strengths (sensitivity). There are two types of
assignment operators; a blocking assignment (=), and a non-blocking (<=)

26
assignment. The non-blocking assignment allows designers to describe a state-
machine update without needing to declare and use temporary storage variables.
Since these concepts are part of Verilog’s language semantics, designers could
quickly write descriptions of large circuits in a relatively compact and concise
form. At the time of Verilog's introduction (1984), Verilog represented a
tremendous productivity improvement for circuit designers who were already
using graphical schematic capture software and specially written software
programs to document and simulate electronic circuits.

A Verilog design consists of a hierarchy of modules. Modules


encapsulate design hierarchy, and communicate with other modules through a set
of declared input, output, and bidirectional ports. Internally, a module can contain
any combination of the following: net/variable declarations (wire, reg, integer,
etc.), concurrent and sequential statement blocks, and instances of other modules
(sub-hierarchies). Sequential statements are placed inside a begin/end block and
executed in sequential order within the block. However, the blocks themselves are
executed concurrently, making Verilog a dataflow language.

Verilog's concept of 'wire' consists of both signal values (4-state: "1, 0, floating,
undefined") and signal strengths (strong, weak, etc.). This system allows abstract
modelling of shared signal lines, where multiple sources drive a common net.
When a wire has multiple drivers, the wire's (readable) value is resolved by a
function of the source drivers and their strengths.

3.6.1 Verilog – Modules:

The module is the basic unit of hierarchy in Verilog

 Modules describe:
 Boundaries [module, endmodule].
 Inputs and outputs [ports].
 How it works [behavioral or RTL code].
 Can be a single element or collection of lower level modules
 Module can describe a hierarchical design (a module of modules)
 A module should be contained within one file

27
 Module name should match the file name
 Module fader resides in file named fadder.sv
 Multiple modules can reside within one file (not recommended)
 Correct partitioning a design into modules is critical.

3.6.2. Some Lexical Conventions – Comments:

 Comments are signified the same as C


 One line comments begin with ”//”
 Multi-line comments start: /*, end: */

3.6.3. Some Lexical Conventions – Identifiers:

 Identifiers are names given to objects so that they may be referenced


 They start with alphabetic chars or underscore
 They cannot start with a number or dollar sign
 All identifiers are case sensitive

3.7 Verilog Vs. VHDL:


Verilog and VHDL are Hardware Description languages that are used to
write programs for electronic chips. These languages are used in electronic
devices that do not share a computer’s basic architecture. VHDL is the older of
the two, and is based on Ada and Pascal, thus inheriting characteristics from both
languages. Verilog is relatively recent, and follows the coding methods of the C
programming language.

VHDL is a strongly typed language, and scripts that are not strongly
typed, are unable to compile. A strongly typed language like VHDL does not allow
the intermixing, or operation of variables, with different classes. Verilog uses
weak typing, which is the opposite of a strongly typed language. Another
difference is the case sensitivity. Verilog is case sensitive, and would not
recognize a variable if the case used is not consistent with what it was previously.
On the other hand, VHDL is not case sensitive, and users can freely change the
case, as long as the characters in the name, and the order, stay the same. In general,
Verilog is easier to learn than VHDL. This is due, in part, to the popularity of the

28
C programming language, making most programmers familiar with the
conventions that are used in Verilog. VHDL is a little bit more difficult to learn
and program.

VHDL has the advantage of having a lot more constructs that aid in high-
level modeling and it reflects the actual operation of the device being
programmed. Complex data types and packages are very desirable when
programming big and complex systems that might have a lot of functional parts.
Verilog has no concept of packages, and all programming must be done with the
simple data types that are provided by the programmer.

Lastly, Verilog lacks the library management of software programming


languages. This means that Verilog will not allow programmers to put needed
modules in separate files that are called during compilation. Large projects on
Verilog might end up in a large, and difficult to trace, file.

3.8 Summary:
1. Verilog is based on C, while VHZDL is based on Pascal and Ada.

2. Unlike Verilog, VHDL is strongly typed.

3. Unlike VHDL, Verilog is case sensitive.

4. Verilog is easier to learn compared to VHDL.

5. Verilog has very simple data types, while VHDL allows users to create
more complex data types.

6. Verilog lacks the library management, like that of VHDL

The Xilinx ISE tools allow you to use schematics, hardware description
language (HDLs), and specially designed modules in number of ways. Schematics
are drawn by using symbols for components and lines for wires. Xilinx tools is a
suite of software tools used for the design of digital circuits implemented using
Xilinx Field Programmable Gate Array(FPGA) or Complex programmable logic
Device (CPLD).

The design procedure consists of (a) design entry, (b) synthesis and
implementation of the design,(c) functional simulation and (d) testing and

29
verification. Digital designs can be entered in various ways using the above CAD
tools: using a schematic hardware description language (HDL) – Verilog or
VHDL or a combination of both. In this lab we will only use the design flow that
involves the use of Verilog HDL.

The steps of the design procedure are listed below:

 Create Verilog design input file(s) using template driven editor.


 Compile and implement the Verilog design file(s).
 Create the test-vectors and simulate the design (functional
simulation) without using a PLD (FPGA or CPLD).
 Assign input/output pins to implement the design on a target
device. Download bit stream to an FPGA or CPLD device

A Verilog input file in the Xilinx software environment consists of the following
segments:

Header : module name, list of input and output ports.

Declarations : input and output ports, registers and wires.

Logic Descriptions : equations, state machines and logic functions.

End : end module

The Integrated Software Environment (ISE) is the Xilinx design software suite
that allows you to take your design from design entry through Xilinx device
programming.

The ISE project Navigator manages and Processes your design through the
following steps in the ISE design flow.

30
CHAPTER IV

EXISTING TECHNIQUE
4.1 BOOTH MULTIPLIER

Booth multiplication is a technique described for multiplying binary


numbers of either sign with an uniform process by A. D. Booth (1950). This
multiplication process is not dependent on any foreknowledge of the signs of the
multiplication or multiplier. While designing automatic computing machines, the
requirement is that some technique should be available for multiplying two numbers
whose signs are not a mandatory to be positive. This is entirely tough to be
performed by a human operator since a large number of processes exist. Some of
them find themselves to be suitable for implementation with the type of circuit
currently available on account of the execution. The way to solve this problem is to
find the procedure which can be engineered with the minimal available resources.
Many methods have been adopted which are not very much satisfactory, for
example,

1. The machine may use the absolute value of a number for sign representation. In
such a representation, it is effortless to perform multiplication and least complicated
to execute division, but the more repeatedly used operation of subtraction needs
additional circuitry.

2. Signed numbers may be represented in complementary form mod 2p .

Assume that the machine deals with negative numbers by taking their complements
mod 2, then

+ m  m ………………………………(4.1.1)

-m  2-m…………………………….. (4.1.2)

Henceforth, when two numbers m and t are multiplied the machine generates the
following results

+ m x + t  + m t…………………………. (4.1.3)

31
-m x + t  2t – mt…………………………. (4.1.4)

+ m x – t  2m – mt……………………… (4.1.5)

-m x – t  4 – 2m – 2t +mt………………. (4.1.6)

Equations (4.1.3) to (4.1.6) have negative signs to be dealth with. In order to correct
Equations (4.1.4) to (4.1.6) following steps are followed.

1 If m is negative, a normal procedure of subtraction of 2t from the result is done;

2 If t is negative, subtract 2m from the product done in the usual manner.

The application of both these connections also gives the correct result, if m
and t are negative, subtraction is in effect and because operations are all mod two
the added four is in any case ignored by the machine. A manifestation process for
the division of signed binary numbers was given by Booth et al., (1946). The
machine examines the signs of both m and t and this result has a need for the
efficient engineering of the sequence and the storage of signs of m and r in auxiliary
circuits which is given by A. D. Booth and K. H. V. Britten (1947). The
abovementioned connection operations are highly undesirable. Therefore, it is easy
if there is any process to prefer multiplication in a uniform manner without the
necessity of any special devices to examine the signs of the interacting numbers.

Earlier, a complicated process of multiplication has been suggested by Rey


and Spencer (1950). A simple process is developed by A. D. Booth (1951) for
multiplication of binary numbers, provided the multiplication start with least
significant digit and described as follows:

1. For multiplying two numbers m and t together, the nth digit (mn) of m has
to be examined.
2. If mn= 0, mn+1= 0, the PPs are summed up. The sum is multiplied by 2 -
1 , i.e., each bit of the result is shifted to the right by one place.
3. If mn=0, mn+1= 1, add t into the existing sum of partial products and
multiply by 2-1 , i.e., shifts all the bits simultaneously one place to the
right.

32
4. If mn=1, mn+1= 0, the addition of PPs is done. Then t is subtracted from
the sum. This intermediate result is multiplied by 2 -1 , i.e., a movement
of one place for every bit to the right.
5. If mn=1, mn+1= 1, multiply the sum of partial products by 2-1 , i.e., shift
one place to the right.
6. Do not multiply by 2-1 at m0 in the above processes.

If m is given to n digits, it is considered as mn+1 = 0 at starting of execution.


In providing this process, it is considered that operations are (mod 2).

Thus

m  m0 , m1,m2,…..mn+2-n mn……..+2-NmN(mn= 0,1)………(4.1.7)

 m0,2 0 , 2-1 m1, 2-2 m2,…..+2-n mn……..+2-NmN(mn= 0,1)… (4.1.8)

Expanding the radix of booth encoding decreases further the partial


products which prompts lesser area and power dissipation (G. W. Bewick, 1994; B.
S. Cherkauer and E. G. Friedman, 1997; M. J. Flynn and S. F. Oberman, 2001).
Different types of investigations expect to design a multiplier with the capacity to
disable the intact functional units, keeping in mind the end goal to minimize
pointless energy utilization. This paper (S. Kuang and J. Wang, 2010) proposes a
power-efficient 16x16 Configurable Booth Multiplier (CBM) that backing single
16-bit, single 8-bit, or twin parallel 8-bit duplication operation. A low-vitality
hybrid radix 4/8 multiplier was proposed particularly for versatile applications by
S. Kuang, et al., (2011).

Booth’s Algorithm

A: 1 1 1 1 1 1 1 1 1 1

X: 1 1 1 1 1 1 1 1 1 1

Number of bits: 10

33
Computation

A 1111111111 -1

X x1111111111 -1

Y 0 0 0 0 0 0 0 0 0 -1 recoded multiplier
---------------------------------------------------------------------
Add –A + 0000000001

Shift 00000000001

Shift only 000000000001

Shift only 0000000000001

Shift only 00000000000001

Shift only 000000000000001

Shift only 0000000000000001

Shift only 00000000000000001

Shift only 000000000000000001

Shift only 0000000000000000001

Shift only 000000000000000000011

4.2 MODIFIED BOOTH MULTIPLIER

Multipliers play an important role in today’s digital signal processing and


various other applications. With advances in technology, many researchers have
tried and are trying to design multipliers which offer either of the following design
targets – high speed, low power consumption, regularity of layout and hence less
area or even combination of them in one multiplier thus making them suitable for
various high speed, low power and compact VLSI implementation. The common
multiplication method is “add and shift” algorithm. In parallel multipliers number

34
of partial products to be added is the main parameter that determines the
performance of the multiplier. To reduce the number of partial products to be added,
Modified Booth algorithm is one of the most popular algorithms.

To achieve speed improvements Wallace Tree algorithm can be used to


reduce the number of sequential adding stages. Further by combining both Modified
Booth algorithm and Wallace Tree technique we can see advantage of both
algorithms in one multiplier. However with increasing parallelism, the amount of
shifts between the partial products and intermediate sums to be added will increase
which may result in reduced speed, increase in silicon area due to irregularity of
structure and also increased power consumption due to increase in interconnect
resulting from complex routing. On the other hand “serial-parallel” multipliers
compromise speed to achieve better performance for area and power consumption.
The selection of a parallel or serial multiplier actually depends on the nature of
application,

The main objectives of booth multiplier is to perform a high speed operation


and low power consumption. The booth multiplier is combination of repeated
addition and shifting athematic which mainly take place by the help of encoder and
partial product generating unit. As the speed and the power consumption of a
multiplier depends upon the partial product so in booth multiplier the partial unit is
reduced to half which result in an increase in speed and reduction in power
consumption reduces. The maximum delay can be determined by the help of a sum
of a total delay occur due to the partial product unit. There are mainly three sections
in booth multiplier those are booth encoder, partial product generating unit and
adder circuit.

35
Figure 4.2: Booth Multiplier block diagram

Multiplier is one of the most widely used arithmetic data path operation in
modern digital design. In the state of art Digital Signal Processing and graphics
applications, multiplication is an important and computationally intensive
operation. The multiplication operation is certainly present in many parts of a digital
system or digital computer, most notably in signal processing, graphics and
scientific computation. Booth algorithm is a crucial improvement in the design of
signed binary multiplication.

It is a powerful algorithm for signed-number multiplication, which treats


both positive and negative numbers uniformly. For the standard add-shift operation,
each multiplier bit generates one multiple of the multiplicand to be added to the
partial product. If the multiplier is very large, then a large number of multiplicands
have to be added. In this case the delay of multiplier is determined mainly by the
number of additions to be performed. If there is a way to reduce the number of the
additions, the performance will get better.

36
The block diagram consist of the following sections

4.2.1. Booth Encoder:

The working of booth encoder is to contained more number of zero’s by


converting input bits to equivalents bit. This can done by changing the 2’s
complement value to the single digit representation. The booth encoder is a
combination of XOR and NAND gates. The input taken is in the form of input
multiplicand value. This result obtained is according to this table given below. The
encoder consist of a XOR gate, NAND gate and INVERTER circuit.

4.2.2 Multiplication Process in Encoder:

Hardware multiplication is performed in the same way multiplication


done by hand, first step is to partialised the products are computed then
shifted appropriately and summed.

Normal multiplication Process:

The simplest multiplication operation is to directly calculate the product of


two numbers by hand.

This procedure can be divided into three steps:

1. Partial product generation

2. Partial product reduction

3. Addition.

Let us calculate the product of 2’s complement of two numbers 1101(-3) and 5

(0101), when computing the two binary numbers product we get the result

37
1 1 0 1 Multiplicand

x 0 1 0 1 Multiplier

------------------------

1 1 1 1 1 1 0 1 PP1

0 0 0 0 0 0 0 PP2

1 1 1 1 01 PP3

+0 0 0 0 0 PP4

1 1 1 1 1 0 0 0 1= −15 Product

Discard this bit From the above we say that 1101 is multiplicand
and 0101 is multiplier. The intermediate products are partial products. The final
result is product (-15). When this method is processed in hardware, the operation is
to take one of the multiplier bits at a time from right to left, multiplying the
multiplicand by the single bit of the multiplier and shifting the intermediate product
one position to the left of the earlier intermediate products. All the bits of the partial
products in each column are added to obtain two bits: sum and carry. Finally, the
sum and carry bits in each column have to be summed. The two rows before the
product are called sum and carry bits.

Multiplication is a most commonly used operation in many computing


systems. In fact multiplication is nothing but addition since, multiplicand adds to
itself multiplier number of times gives the multiplication value between multiplier
and multiplicand.

Advantage:

In this method the partial product circuit is simple and easy to


implement.

Therefore, is suitable for the implementation of small multipliers.

4.2.3. Partial Product Generating Unit

In this block the multiplicand bit is multiplied with the output of decoder
unit with the help of NAND gate as NAND gate multiplies the two inputs value to

38
each other to form a partial product output. Here the equivalent value is converted
to the single bit partial product which is later provided to the adder circuit to
generate the output of booth multiplier.

4.2.4. Adder:

Adder performs addition operation. Addition is one in every of the foremost


normally used mathematical process in silicon chip, digital signal processor etc. It
also can be used as a building block for synthesis of all alternative arithmetic
operations. Therefore, as so much because the economical implementation of
associate degree arithmetic unit is bothered, the binary adder structure becomes an
awfully essential hardware unit.

4.3 PIPELINED MULTIPLIER

It is readily observed that the data flow is vertical from one row to the next,
and there are no horizontal connections between the cells on a row except the last
row where the carry signal has to propagate through all cells. Therefore, ignoring
the last row for the moment, by placing a register on the outputs of each individual
cell, one can achieve a pipelined architecture where the stage delay is equal to the
delay of a l-bit full adder plus a register. This, however, is not the case because of
the horizontal data flow in the last row of the array. One solution is to use a carry-
look-ahead adder to replace the last row. However, the carry-lookahead adder does
not offer a structural regularity compatible with the rest of the array. Besides, as the
word length of the multiplier increases, the delay through the carrylook-ahead adder
increases, making it the dominant stage of the pipelined architecture. References
[7] and [8] describe two designs employing only one level of pipelining (two stages)
by placing registers before the last row of the CSA array.

An alternative approach is to perform the addition in the last row of the array
in a bit-by-bit style using the array of half-adders and registers shown in Fig. 2. In
this architecture, starting with the least significant bit, each bit of the result is
determined in one stage of the array. As can be seen, the flow of data is always
horizontal in Fig. 2 and one can convert each column of the array into a horizontal
pipeline stage whose delay is less than that of a l-bit full adder.

39
Figure 4.3.1 General architecture of a parallel array multiplier.

Figure 4.3.2 Conversion of the last row of above figure to a pipelined architecture

40
The multiplier chip described in this paper uses the combination of the two
arrays shown in Figs. 4.3.1 and 4.3.2 to yield an array multiplier that is fully
pipelined down to the bit level. For an IVX ~ multiplier, there are 21V pipeline
stages in the architecture. A simple way of decreasing this number to 3N/2 while
keeping the same throughput is discussed. Obviously, different levels of pipelining
can be achieved by combining different numbers of stages of the fully pipelined
array into one stage.

Notice that instead of the half-adder array shown in Fig. 2, one may use a
pipelined implementation of a ripple carry adder. This will result in an array with
the diagonal cells being l-bit full adders; the rest of the cells in the array will be
plain registers. This implementation, like the half-adder array, adds N stages to the
pipeline for a N X N multiplier. It uses almost 20 percent less transistors than the
half-adder array, however, because of the pipelining and the clock distribution, the
area consumed by this implementation turns out to be almost the same as the half-
adder array. The advantage of the half-adder array is that the delay of each stage is
half of the delay of the full-adder stage, meaning that by combining two stages
together we can save N/2 pipeline stages without loss of throughput. Therefore,
although the half-adder array uses slightly more transistors than the full-adder array,
it offers two advantages: 1) reduction of the number of pipeline stages from 2N to
3N/2; and 2) reduction in area because of a savings of N/2 in the number of pipeline
stages and the fact that the corresponding registers, clock routing, and clock buffers
are no longer present.

As shown in Fig. 4.3.1, each stage of the parallel array should receive some
partial product inputs. In a nonpipelined array, the partial products are generated all
at the same time and are present in the array until the multiplication is done. In a
pipelined array, however, there is a new set of partial products every clock cycle.
These partial products are not all used at the same time. For example, the partial
product word for the tlhird pipeline stage should be ready three clock cycles after it
has been generated. This results in the skewing of the partial product inputs in a
manner shown in Fig. 3. In this figure, the inputs to the last stage (#n ) correspond
to the inputs of the last stage of the array in Fig. 1; the array of Fig. 2 does not use
any of the partial products. The block diagram of Fig. 3 is only for illustration

41
purposes. In our actual design, instead to generating the partial products and then
skewing them, the input data bits are skewed first and then ANrled to produce the
partial products. This results in 50-percent savings in the number of registers
required for partial product skewing. Fig. 4 shows the complete structure of a fully
pipelined 8 X 8 multiplier using the architecture described in this section. The
blocks labeled FR, HR, AR, and R represent registered full adder, registered half-
adder, registered AND gate, and a plain register, respectively.

42
CHAPTER V

PROPOSED MULTIPLIER

5.1. Design Implementation:

By implementing the above design on paper I found that the overflow bit is
not required. The overflow bit shifts into the product register. To implement the 32
bit-register I had two initialized product registers, preg1 and preg2. Preg1 has the
multiplier in the least significant 32-bit positions and the most significant 32-bits
are zeros. Preg2 has the multiplicand in the most significant 32-bit positions and
the least significant 32-bits are zeros. If the least significant bit of the multiplier
product register, preg1, is a ‘1’, then the multiplicand product register, preg2, is
added to the multiplier product register and the result stored in the multiplier
product register is shifted right by one bit. If the least significant bit of the multiplier
product register is a ‘0’, the bits in the multiplier product register are right shifted
by one bit without the addition of the multiplicand product register. This is done 32
times. The result in the multiplier product register after 32 clock cycles is the final
product.

Figure 5.1: Proposed multiplier Architecture

43
Multiplication is a process of adding an integer to itself for a specified
number of times. A number (multiplicand) is added to itself a number of times as
specified by another number (multiplier) to form a result. Multiplication process
have three main steps: 1. Partial product generation. 2. Partial product reduction. 3.
Final addition. For the multiplication of an n-bit multiplicand with an m-bit
multiplier, m partial products are generated and product formed is n + m bits long.

5.1.1 Partial Product Generation (PPG):

The process of partial product generation can be further classified into two:
i. Simple PPG

In this method, the partial products are generated by multiplying each bit of
the multiplier with the multiplicand using logical AND gate.

ii. PPG using Radix-4 Modified Booth Recoding

Partial products are generated with Radix-4 modified Booth recoding. The
speed of multiplier can be improved by reducing the number of generated partial
products. Using Booth recoding only half the number of partial products is
generated when compared with simple PPG which reduces the amount of area
occupied by the hardware and the time required for execution. O. L. MacSorley
proposed the Modified Booth’s Algorithm (MBA) in 1961, as a powerful algorithm
for multiplication of signed number, treating both positive and negative numbers
uniformly[3] [4].

The encoding of the multiplier bits obtained is the multiple of the


multiplicand. The three bits of the multiplier [Yi+1, Yi and Yi-1] are encoded into
[-2X,-X, 0, +X, +2X]. The shifting of multiplicand to the left and then taking the
complement gives (-2X), complement of the multiplicand is (-X), multiplicand
itself is (X), left shift of multiplicand by one bit is (2X) or taking zeros.

The steps of Radix -4 Booth algorithm is as follows:

a. Append ‘0’ on the right of LSB of the multiplier


b. Group the multiplier bits in blocks of three with a bit overlapping from the LSB
c. If the multiplier has odd number of bits, add an extra bit on the left of MSB

44
d. Examine the each block of multiplier and generate the partial product using
the table below:
e. The new partial product generated are added to the previous partial product by
shifting two bits to the left and the multiplier bits are shifted two bits towards
right. Initially the partial product is zero.
f. It is then sign extended.
g. The above operations are repeated n/2 times

5.2 Partial Product Reduction:

Multiplier require high amount of power and delay during the partial
products addition. At this stage, most of the multipliers are designed with different
kind of multi operands adders that are capable to add more than two input operands
and results in two outputs, sum and carry. The Wallace tree method is used in high
speed designs to add the partial products. Wallace Tree helps in reducing the stages
of sequential addition of partial products thereby improving speed. Wallace tree
used here is made up of several compressors that take three or more inputs and
produce two outputs, of the same dimension as the inputs. The speed, area and
power consumption of the multipliers will be in direct proportional to the efficiency
of the compressors. There are various types of compressors, namely 3:2, 4:2, 5:2
and so on. A Wallace tree with 4:2 compressors is considered.

i. 3:2 Compressor A 3-2 compressor takes 3 inputs X1, X2, X3 and


generates 2 outputs, the sum and the carry. The compressor is governed by the basic
equation X1 + X2 + X3 = Sum + 2 * Carry The 3-2 compressor can also be
employed as a full adder cell when the third input is considered as the Carry input
from the previous compressor block or X3 = Cin. The logical expression for sum
and carry are as follows: Sum = (X1xor X2) xor X3 Carry = (X1 and X2) or (X3and
(X1 xor X2))

ii. 4:2 Compressor The so-called 4: 2 compressor, the 4 numbers are


compressed into two numbers, as shown in the following fig.4is a block diagram of
a 4: 2 compressor: The 4:2 compressors have been employed in the high speed
multipliers to lower the latency of accumulation stage.

45
CHAPTER VI
SIMULATION RESULTS
6.1 Booth Multiplier:

Figure 6.1.1: Schematic Diagram of 64 Bit- Booth Multiplier

Figure 6.1.2: RTL Schematic Diagram of 64 Bit- Booth Multiplier

46
Figure 6.1.3: Technology Schematic Diagram of 64 Bit- Booth Multiplier

Figure 6.1.4: Output Waveform of 64 Bit- Booth Multiplier

47
6.1 Modified Booth Multiplier:

Figure 6.2.1: Schematic Diagram of 64 Bit- Modified Booth Multiplier

Figure 6.2.2: RTL Schematic Diagram of 64 Bit- Modified Booth Multiplier

48
Figure 6.2.3: Technology Schematic Diagram of 64 Bit- Modified Booth Multiplier

Figure 6.2.4: Output Waveform of 64 Bit- Modified Booth Multiplier

49
6.2 Pipelined Multiplier:

Figure 6.3.1: Schematic Diagram of 64 Bit- Pipelined Multiplier

Figure 6.3.2: RTL Schematic Diagram of 64 Bit- Pipelined Multiplier

50
Figure 6.2.3: Technology Schematic Diagram of 64 Bit- Pipelined Multiplier

Figure 6.2.4: Output Waveform of 64 Bit- Pipelined Multiplier

51
6.3 PROPOSED MULTIPLIER:

Figure 6.4.1: Schematic Diagram of 64 Bit-Proposed Multiplier

Figure 6.4.2: RTL Schematic Diagram of 64 Bit-Proposed Multiplier

52
Figure 6.4.3: Technology Schematic Diagram of 64 Bit-Proposed Multiplier

Figure 6.3.4: Output Waveform of 64 Bit-Proposed Multiplier

53
6.4 COMPARISON RESULT:

Table 6.1: Area Comparison:

MULTIPLIERS No.of Slices No.of LUT’s No.of Bounded I/O’s

Booth Multiplier 3785 6265 256

Modified Booth 3565 6055 256


Multiplier

Pipelined 645 1250 256


Multiplier

Proposed 350 640 256


Multiplier

Table 6.2: Delay Comparison:

MULTIPLIERS Logic Delay Routing Delay Total Delay

Booth Multiplier 142.256ns 115.568ns 257.825ns

Modified Booth 125.602ns 105.859ns 231.461ns


Multiplier

Pipelined 25.576ns 16.956ns 43.532ns


Multiplier

Proposed 8.538ns 6.224ns 14.76ns


Multiplier

54
CHAPTER VII

CONCLUSION & FUTURE SCOPE

7.1 CONCLUSION:

We done 64 bit Booth Multiplier, Modified Booth Multiplier, Pipelined


Multiplier and Proposed Multiplier. We simulate and synthesis different types of
64-bit Multipliers based on signed numers using Xilinx ISE 14.7 tool. But now the
most Industries are using low power and high speed Multipliers because of their
advantages compare to other multipliers. By using these synthesis results, we noted
the performance parameters like number of area and delay. We compare these three
multipliers in terms of LUTs (represents area) and delay values.

7.2 FUTURE SCOPE:

All the hardware implemented in this dissertation can be extended to the


fixed point and floating point number system. The NMBE and MMBE PPG circuits
can be used to generate the partial products. In fixed point for the precision
multiplication operation, it is necessary to allocate more number of bits. For the
floating point number IEEE single precision 32-bit and double precision 64-bit
formats are available. Synthesis process is carried out for PPG, PPRT and CPA
separately. For simplicity wiring capacitances and load capacitances are not
extracted. Synthesizing the whole multiplier circuit and considering effects of
wiring and load capacitances the real integrated circuit multiplier can be designed.
By further splitting the pipeline stages up to seven, the pipeline multiplier can be
operated with 30 GHz synchronous clock signal. Since the delay of the NMBE or
MMBE is very small and the CLCSA takes maximum delay and VCA even closer
to the CLCSA, therefore dividing the CLCSA and VCA stages the pipeline clock
signal frequency can be increased to about 30GHz.

55
REFERENCES

[1] S.S.Sinthura, “Implementation and analysis of different 32 bit


multipliers on aspects of power speed and area”, 2nd international conference on
trends in Electronics and informatics (ICOEI.2018)

[2] R. P. Brent and H. T. Kung, “A regular layout for parallel adders”, IEEE
trans, computers, Vol.C-31,pp. 260-264,.March 1982.

[3] Kogge P, Stone H, “A parallel algorithm for the efficient solution of a


general class Recurrence relations,” IEEE Trans. Computers, Vol.C-22, pp 786-
793,Aug. 1973.

[4] R. Zimmermann, “Non-heuristic operation and synthesis of parallel-


prefix adders,” in International workshop on logic and architecture synthesis,
December 1996,pp. 123-132.

[5] C.Nagendra, M. J. Irwin, and R. M. Owens, “Area -Time-Power


tradeoffs in parallel adders”, Trans. Circuits Syst. II, vol.43, pp. 689– 702 Oct.1996.

[6] R. Ladner and M. Fischer, “Parallel prefix computation, Journal of


ACM.La.Jolla CA,Vol.27,pp.831-838,October 1980.

[7] Reto Zimmermann. Binary Adder Architectures for Cell-Based VLSI


an their Synthesis. Hartung-Gorre, 1998.

[8] Y. Choi, “Parallel Prefix Adder Design,” Proc. 17th IEEE Symposium
on Computer Arithmetic, pp 90-98, 27th June 2005.

[9] D. Harris, “A taxonomy of parallel prefix networks,” in Signals,


Systems and Computers,2003. Conference Record of Thirty Seventh Asilomar
Conference on, vol. 2, the Nov. 2003,pp.2217.

[10] N. H. E. Weste and D. Harris, CMOS VLSI Design, 4th edition,


Pearson Addison-Wesley, 2011.

56
[11] H. Ling, High-speed binary adder," IBM Journal of Research and
Development, vol. 25,no. 3, pp. 156 March 1981.

[12] K.Vitoroulis and A. J. Al-Khalili “Performance of Parallel Prefix


Adders Implemented with FPGA technology,” IEEE Northeast Workshop on
Circuits and Systems, pp. 498-501, Aug. 2007.

[13] D. H. K. Hoe, C. Martinez, and J. Vundavalli, “Design and


Characterization of Parallel Prefix Adders using FPGAs, ”IEEE 43rd Southeastern
Symposium on System Theory, pp. 170-174, March 2011.

[14] T. Matsunaga, S. Kimura, and Y. Matsunaga.“Power-conscious


syntheses of parallel prefix adders under bitwise timing constraints,” Proc. the
Workshop on Synthesis And System Integration of Mixed Information
technologies(SASIMI), Sapporo, Japan, October 2007,pp. 7–14.

[15] F. E. Fich, “New bounds for parallel prefix circuits,” in Proc. of the
15thAnnu. ACM Sympos. Theory of Comput., 1983, pp.100– 109.

[16] D. Gizopoulos, M. Psarakis, A. Paschalis, and Y.Zorian, “Easily


Testable Cellular Carry Look ahead Adders,” Journal of Electronic Testing: Theory
and Applications 19, 285-298, 2003.

[17] Y. Choi, “Parallel Prefix Adder Design”, Proc. 17th IEEE Symposium
on Computer Arithmetic, pp 90-98, 27th June 2005.

57
APPENDIX

Source Code:

64 Bit Booth Multiplier:

module boothmul(X, Y, Z,en);

input signed [63:0] X, Y;

input en;

output signed [127:0] Z;

reg signed [127:0] Z;

reg [1:0] temp;

integer i;

reg E1;

reg [63:0] Y1;

always @ (X, Y,en)

begin

Z = 128'd0;

E1 = 1'd0;

Y1 = - Y;

Z[63:0]=X;

for (i = 0; i< 64; i = i + 1)

begin

temp = {X[i], E1};

case (temp)

2'd2 : Z [127 : 64] = Z [127 : 64] + Y1;

2'd1 : Z [127 : 64] = Z [127 : 64] + Y;

58
default : begin end

endcase

Z = Z >> 1;

Z[127] = Z[126];

E1 = X[i];

end

end

endmodule

64 Bit Modified Booth Multiplier:

module modified_booth_multiplier(x,y,o);

input [63:0]x;

input [63:0]y;

output [127:0]o;

reg [127:0]o;

integer i;

reg [128:0]a;

reg [63:0]s;

reg [63:0]p;

always @ ( x or y)

begin

a =
129'b0000000000000000000000000000000000000000000000000000000000000

59
000000000000000000000000000000000000000000000000000000000000000000
00;

s = y;

a[64:1] = x;

for ( i = 0 ; ( i <= 63 ) ; i = ( i + 1 ) )

begin

if ( ( a[1] == 1'b1 ) & ( a[0] == 1'b0 ) )

begin

p = a [128:63];

a[128:63] = ( p - s );

end

else

begin

if ( ( a[1] == 1'b0 ) & ( a[0] == 1'b1 ) )

begin

p = a[128:63];

a[128:63] = ( p + s );

end

end

a[127:0] = a[128:1];

end

60
o[127:0] <= a[128:1];

end

endmodule

64 Bit Pipelined Multiplier:

module Pipelined_multiplier(start,clock,clear,binput,qinput,carry,

acc,qreg,preg);

input start,clock,clear;

input [63:0] binput,qinput;

output carry;

output [63:0] acc,qreg;

output [5:0] preg;

//system registers

reg carry;

reg [63:0] acc,qreg,b;

reg [5:0] preg;

reg [1:0] prstate,nxstate;

parameter t0=2'b00,t1=2'b01, t2=2'b10, t3=2'b11;

wire z;

assign z=~|preg;

always @(negedge clock or negedge clear)

if (~clear) prstate=t0;

61
else prstate = nxstate;

always @(start or z or prstate)

case (prstate)

t0: if (start) nxstate=t1; else nxstate=t0;

t1: nxstate=t2;

t2: nxstate=t3;

t3: if (z) nxstate =t0;

else nxstate=t2;

endcase

always @(negedge clock)

case (prstate)

t0: b<=binput;

t1: begin

acc<=
64'b00000000000000000000000000000000000000000000000000000000000000
00;

carry<=1'b0;

preg<=6'b100000;

qreg<=qinput;

end

t2:begin

preg<=preg-6'b000001;

62
if(qreg[0])

{carry,acc}<=acc+b;

end

t3:begin

carry<=1'b0;

acc<={carry,acc[63:1]};

qreg<={acc[0],qreg[63:1]};

end

endcase

endmodule

63

You might also like