Final Documentation
Final Documentation
INTRODUCTION
1.1 INTRODUCTION
Digital signal processing (DSP) the important operations are filtering, inner
product and spectral analysis. Here many of the operations such as filtering and
product performed with the help of multiplication hence it plays a very curtail role
for any DSP system. Multiplication is phenomenon of repeated addition. There are
various types of low power digital multiplier is present with high clock frequency.
They play a wide role in today digital image processing, hence is the heart for today
mobile communication system. Recently because of increase in demand of battery
- powered and of high speed electronic devices, power consumption became a very
serious and important factor in VLSI chips because of the increase in the non-linear
effect. The power consumption also affects the battery life of a device. Because the
output current of the MOS source coupled multiplier in a differential pair is depends
upon the non-linearity of the biased current (Iss) and input signal. Various
techniques applied internally and externally in the multiplier to reduce its power
consumption. The advantage of GDI technique over the static CMOS is the use of
less number of transistors, hence the reduction in the area and interconnects.
1
In Arithmetic circuit, likes multiplier different adders, are one of the basic
components in the design of any communication circuit. Therefore Digital
multipliers are most commonly used in many digital circuit designs. They are very
fast, most reliable and efficient component that is utilized to implement any
operation. The power dissipation in a multiplier is a very important issue as it
reflects the total power dissipated by the circuit and hence affects the performance
for the device. Most of digital signal processing (DSP) systems incorporate a
multiplication unit to implement algorithms such as correlations, convolution, and
filtering and frequency analysis. Multipliers are key components of many high
performance systems such as FIR filters, microprocessors, DSP processors, etc.
During the desktop PC design era, VLSI design efforts have focused
primarily on optimizing speed to realize computationally intensive real-time
functions such as video compression, gaming, graphics etc. As a result, we have
semiconductor ICs integrating various complex signal processing modules and
graphical processing units to meet our computation and entertainment demands.
While these solutions have addressed the real-time problem, they have not
addressed the increasing demand for portable operation, where mobile phone need
to pack all this without consuming much power. The strict limitation on power
dissipation in portable electronics applications such as smart phones and tablet
computers must be met by the VLSI chip designer while still meeting the
computational requirements. While wireless devices are rapidly making their way
to the consumer electronics market, a key design constraint for portable operation
namely the total power consumption of the device must be addressed. Reducing the
total power consumption in such systems is important since it is desirable to
maximize the run time with minimum requirements on size, battery life and weight
allocated to batteries. So the most important factor to consider while designing SoC
for portable devices is 'low power design'.
2
inherently higher dynamic and leakage current density with minimal improvement
in speed. Between 90nm and 65nm, the dynamic power dissipation is almost same
whereas there is ~5% higher leakage/mm2. Low cost always continues to drive
higher levels of integration, whereas low cost technological breakthroughs to keep
power under control are getting very scarce.
The power dissipation in circuit can be classified into three categories as described
below.
3
Dynamic power consumption:
In a CMOS logic P-branch and N-branch are momentarily shorted as logic gate
changes state resulting in short circuit power dissipation.
Leakage current:
This is the power dissipation that occurs when the system is in standby mode or not
powered. There are many sources of leakage current in MOSFET. Diode leakages
around transistors and n-wells, Subthreshold Leakage, Gate Leakage, Tunnel
Currents etc. Increasing 20 times for each new fabrication technology. Insignificant
issues are becoming dominating factors.
4
VLSI circuit design for low power:
5
power supply voltage with a corresponding scaling of threshold voltages, in order
to compensate for the speed degradation. Influence of Voltage Scaling on Power
and Delay although the reduction of power supply voltage significantly reduces the
dynamic power dissipation, the inevitable design trade-off is the increase of delay.
However, this interpretation assumes that the switching frequency (i.e., the number
of switching events per unit time) remains constant.
6
Figure 1.1 Power and delay unit graph
The above Figure shows the variation of the propagation delay of a CMOS
inverter as a function of the power supply voltage, and for different threshold
voltage values. The reduction of threshold voltage from 0.8 V to 0.2 V can improve
the delay at VDD= 2 V by a factor of 2. The positive influence of threshold voltage
reduction upon propagation delay is specially pronounced at low power supply
voltages, for VDD < 2 V. It should be noted, however, that using low- VT
transistors raises significant concerns about noise margins and sub-threshold
conduction. Smaller threshold voltages lead to smaller noise margins for the CMOS
logic gates. The sub-threshold conduction current also sets a severe limitation
against reducing the threshold voltage. For threshold voltages smaller than 0.2 V,
leakage due to sub-threshold conduction in stand-by, i.e., when the gate is not
switching, may become a very significant component of the overall power
consumption. In addition, propagation delay becomes more sensitive to process
related fluctuations of the threshold voltage. The techniques which can be used to
overcome the difficulties (such as leakage and high stand-by power dissipation)
associated with the low VT circuits.
7
1.2 DIGITAL SIGNAL PROCESSING
FIR blocks are the most important blocks in design of Digital Signal
Processing (DSP). They are widely used in industry and digital systems, such as:
automotive, mobile phone, internet, laptop, computer, speech processing, Bluetooth
headsets, and etc. The requirement to design an electronic system consists of two
major components, the first one is Technology driven and the second is Market-
driven. Regarding technology driven, nowadays, most industries are improving
their technology and devices considering greater complexity. It means more
functionality, higher density in order to place millions of transistors on a lesser die
area, increased performance and lower power dissipation. Due to the market
demand, each novel issue must be taken seriously and quick actions have to be taken
since missing the windows market can be very costly.
8
design issue. Multiplication process can be divided into three steps, namely,
generating the partial products, reducing the partial product and the last addition to
get the final product. The speed of multiplication can be improved by reduction in
the generated number of partial products and/or by increasing the speed at which
these partial products are accumulated.
Fast integer multipliers are a key topic in the VLSI design of high-speed
microprocessors. Multiplication is one of the basic arithmetic operations. In fact
8.72% of all instructions in a typical scientific program are multiplies [1]. In
addition, multiplication is a long latency operation. In typical processes,
multiplication takes between two and eight cycles [2]. Consequently, having high-
speed multipliers is critical for the performance of processors. Processor designers
have recognized this and have devoted considerable silicon area for the design of
multipliers [3]. Recent advances in integrated circuit fabrication technology have
resulted in both smaller feature sizes and increased die areas. Together, these factors
have provided the processor designer the ability to fully implement high-speed
floating-point multipliers in silicon. Most advanced digital systems today
incorporate a parallel multiplication unit to carry out high-speed mathematical
operations. In many situations, the multiplier lies directly in the critical-path,
resulting in an extremely high demand on its speed.
9
In the past, considerable efforts were put into designing multipliers with
higher speed and throughput, which resulted in fast multipliers which can operate
with delay time as low as 4.1 ns [4]. However, with the increasing importance of
the power issue due to the portability and reliability concerns of electronic devices
[5], recent work has started to look into circuit design techniques that will lower the
power dissipation of multipliers [6]. This paper describes the design and fabrication
of a 32×32-bit parallel multiplier, based on a 0.13 µm CMOS process, for low-
power applications. Pass transistor (PT) logic is chosen to implement most of the
logic functions within our multiplier.
For the current states and future challenges the following papers have been
reviewed in the literature survey.
10
delay and finally reduces the delay by 8% compared with other parallel multipliers.
This proposed multiplier can multiply signed number operands.
Wang, Shyh–Jye Jou and Chung-Len Lee [3], have proposed a well-
structured MBE multiplier architecture. In this paper an improved Booth Encoder
logic and Booth Selector logic have been proposed to remove an extra partial
product row like paper [2]. This paper also proposed the design of spare-tree
approach for two’s complementation operation. Thus the removal of an extra partial
product row and the design of sparse-tree approach resulted in the reduction of the
area and improved in the speed of the signed multiplier.
11
area size is 1.04 mm × 1.27 mm at 2.5 V power supply, with reduction in total
number of transistor by 24%.
C. H. Chang, J.G, and M. Zhang [5], have proposed the ultra low-voltage
and lowpower 4–2 and 5–2 compressors implemented in CMOS logic for fast
arithmetic circuits. They have proposed the design of 4-2 compressor using
Exclusive OR (XOR) logic gates at three levels with the critical delay path of 3-
units and the 5-2 compressor with critical delay path of 4-units. They also proposed
that new circuit with a pair of PMOS-NMOS transistors to eliminate the weak logic
for the XOR and Exclusive NOR (XNOR) logic modules. They have claimed that
the proposed XOR–XNOR module used for the implementation of 4-2 and 5-2
compressors can operate at supply voltage as low as 0.6 V.
Pouya Asadi and KeivanNavi [7], have proposed the design of a Novel
High-Speed 54×54-bit multiplier for signed number. They have presented a self-
timed carry-look ahead adder in which the average computation time was
proportional to the logarithm of the logarithm of n. A novel 4-2 compressor using
PTL has been developed and claimed speed over conventional CMOS circuits due
to critical-path gate stages was minimized. The proposed multiplier delay was 3.4
ns at 1.3 V power supply and implemented the multiplier using 42579 transistors.
12
64-bit operands and fabricated in 90 nm CMOS technology, consumes 260 mW at
1 V power supply.
13
CHAPTER II
Design entry is the first step in the ISE design flow. During design entry,
you create your source files based on your design objectives. You can create your
top-level design file using a Hardware Description Language (HDL), such as
VHDL, Verilog, or ABEL, or using schematic. You can use multiple formats for
the lower-level source files in your design.
2.2 SYNTHESIS
After design entry and optimal simulation, you run synthesis. During this
step, VHDL, Verilog, or mixed language designs become net list files that are
accepted as input to the implementation step.
2.3 IMPLEMENTATION
After synthesis, you run design implementation, which converts the logical
design into a physical file format that can be downloaded to selected target device.
From project navigator, you can run the implementation process in one step, or you
can run each of the implementation separately. Implementation processes vary
depending on whether you are targeting a Field Programmable Gate Array (FPGA)
or a Complex Programmable Logic Device (CPLD).
2.4 VERIFICATION
You can verify the functionality of your design at several points in the
design flow. You can use simulator software to verify the functionality and timing
of your design or a portion of your design. The simulator interprets VHDL or
Verilog code into circuit functionality and displays logical results of described HDL
to determine correct circuit operation. Simulation allows you to create and verify
complex functions in a relatively small amount of time. You can also run in-circuit
verification after programming your device.
14
2.5 DEVICE INSTALLATION:
2.6 ISE:
Language Support
VHDL IEEE-STD-1076-2000
Verilog IEEE-STD-1364-2001
VITAL VITAL-2000
VHDL FLI/VHPI No
Verilog PLI No
System Verilog No
15
2.6.2 FEATURE SUPPORT:
Feature Support
Multi-threading Yes
Now that you have a test bench in your project, you can perform
behavioral simulation on the design using ISE. The ISE software has full
integration with ISE. The ISE software enables ISE to create the work directory,
compile the source files, load the design, and perform simulation based on
simulation properties.
16
2.6.4 Locating the Simulation Processes:
The simulation processes in the ISE software enable you to run simulation
on the design using ISE.
Check “Syntax” .This process checks for syntax errors in the test bench. Simulate
“Behavioural Model” .This process starts the design simulation.
The ISE software allows you to set several ISE properties in addition to the
simulation net list properties. To see the behavioural simulation properties and to
modify the properties for this tutorial, do the following:
17
2.6.6 Performing Simulation:
After the process Properties have been set, you are ready to run ISE to simulate the
design. To start the behavioural simulation, double-click “Simulate Behavioral
Model”. ISE creates the work directory, compiles the source files, loads the design,
and performs simulation for the time specified.
The majority of the design runs at 100 Hz and would take a significant
amount of time to simulate.
18
CHAPTER III
Gone are the days when huge computers made of vacuum tubes sat
humming in entire dedicated rooms and could do about 360 multiplications of 10
digit numbers in a second. Though they were heralded as the fastest computing
machines of that time, they surely don’t stand a chance when compared to the
modern day machines. Modern day computers are getting smaller, faster, and
cheaper and more power efficient every progressing second. But what drove this
change? The whole domain of computing ushered into a new dawn of electronic
miniaturization with the advent of semiconductor transistor by Bardeen (1947-48)
and then the Bipolar Transistor by Shockley (1949) in the Bell Laboratory.
Figure 3.1 : A comparison: first planar IC(1961) and Intel Nehalem quad core die
Since the invention of the first IC (Integrated Circuit) in the form of a Flip Flop
by Jack Kirby in 1958, our ability to pack more and more transistors onto a single
chip has doubled roughly every 18 months, in accordance with the Moore’s Law.
Such exponential development had never been seen in any other field and it still
continues to be a major area of research work.
19
3.2 History & Evolution of VLSI:
It was the time when the cost of research began to decline and private firms
started entering the competition in contrast to the earlier years where the main
burden was borne by the military. Transistor-Transistor logic (TTL) offering
higher integration densities outlasted other IC families like ECL and became the
basis of the first integrated circuit revolution. It was the production of this family
that gave impetus to semiconductor giants like Texas Instruments, Fairchild and
National Semiconductors. Early seventies marked the growth of transistor count
to about 1000 per chip called the Large Scale Integration.
20
3.3. Introduction of HDL:
21
without ever having to build the physical circuit. There are at least two dimensions
to verification. In timing verification, the circuit operation including estimated
delays, the setup, hold and other timing requirements for sequential devices like
flip flops are met. In the functional verification the circuit’s logical operation
independent of timing considerations; gate delays and other timing parameters are
considered to be zero.
After verification step, the synthesis process is done in the back end stage.
There are three basic steps, the first synthesis, converting the HDL description
into a set of primitive or components that can be assembled in the target
technology and it may generate a list of gates and a net list that specifies how they
are interconnected.
HDL tool suite really has several different tools with their own names and
purposes:
A text editor allows writing, editing and saving an HDL program. It often
contain HDLspecific features, such as recognizing specific file name
extensions and recognizing HDL reserved and comments and displaying
them in different colors.
The compiler is responsible for parsing the HDL program, finding syntax
errors and figuring out what the program really says.
A synthesizer or synthesis tools targets the design to a specific hardware
technology,such as FPGA, ASIC etc...
22
The simulator runs the specified input sequence on the described hardware
and determines the values of the hardware¡¦s internal signals and its
outputs over a specified period of time.
The output of the simulator can be include waveforms to be viewed using
the waveform editor.
A schematic viewer may create a schematic diagram corresponding to an
HDL program, based on the intermediate-language output of the compiler.
A translator targets the compilers intermediate language output to a real
device such as PLD, FPGA OR ASIC.
3.5. VHDL:
The key advantage of VHDL, when used for systems design, is that it
allows the behavior of the required system to be described (modelled) and verified
(simulated) before synthesis tools translate the design into real hardware (gates
and wires). Another benefit is that VHDL allows the description of a concurrent
system. VHDL is a dataflow language, unlike procedural computing languages
such as BASIC, C, and assembly code, which all run sequentially, one instruction
at a time.
23
methodologies, both the top-down, bottom-up and is very flexible in its approach
to describing hardware.
In the mid-1980s, the U.S. Department of Defence (DOD) and the IEEE
sponsored the development of a highly capable hardware description language
called VHDL and this was got extended in 1993 and again 2002. And some of the
features of the VHDL are:
24
3.5.3. VHDL STRUCTURE:
25
3.5.4.2. Structural Modelling:
26
assignment. The non-blocking assignment allows designers to describe a state-
machine update without needing to declare and use temporary storage variables.
Since these concepts are part of Verilog’s language semantics, designers could
quickly write descriptions of large circuits in a relatively compact and concise
form. At the time of Verilog's introduction (1984), Verilog represented a
tremendous productivity improvement for circuit designers who were already
using graphical schematic capture software and specially written software
programs to document and simulate electronic circuits.
Verilog's concept of 'wire' consists of both signal values (4-state: "1, 0, floating,
undefined") and signal strengths (strong, weak, etc.). This system allows abstract
modelling of shared signal lines, where multiple sources drive a common net.
When a wire has multiple drivers, the wire's (readable) value is resolved by a
function of the source drivers and their strengths.
Modules describe:
Boundaries [module, endmodule].
Inputs and outputs [ports].
How it works [behavioral or RTL code].
Can be a single element or collection of lower level modules
Module can describe a hierarchical design (a module of modules)
A module should be contained within one file
27
Module name should match the file name
Module fader resides in file named fadder.sv
Multiple modules can reside within one file (not recommended)
Correct partitioning a design into modules is critical.
VHDL is a strongly typed language, and scripts that are not strongly
typed, are unable to compile. A strongly typed language like VHDL does not allow
the intermixing, or operation of variables, with different classes. Verilog uses
weak typing, which is the opposite of a strongly typed language. Another
difference is the case sensitivity. Verilog is case sensitive, and would not
recognize a variable if the case used is not consistent with what it was previously.
On the other hand, VHDL is not case sensitive, and users can freely change the
case, as long as the characters in the name, and the order, stay the same. In general,
Verilog is easier to learn than VHDL. This is due, in part, to the popularity of the
28
C programming language, making most programmers familiar with the
conventions that are used in Verilog. VHDL is a little bit more difficult to learn
and program.
VHDL has the advantage of having a lot more constructs that aid in high-
level modeling and it reflects the actual operation of the device being
programmed. Complex data types and packages are very desirable when
programming big and complex systems that might have a lot of functional parts.
Verilog has no concept of packages, and all programming must be done with the
simple data types that are provided by the programmer.
3.8 Summary:
1. Verilog is based on C, while VHZDL is based on Pascal and Ada.
5. Verilog has very simple data types, while VHDL allows users to create
more complex data types.
The Xilinx ISE tools allow you to use schematics, hardware description
language (HDLs), and specially designed modules in number of ways. Schematics
are drawn by using symbols for components and lines for wires. Xilinx tools is a
suite of software tools used for the design of digital circuits implemented using
Xilinx Field Programmable Gate Array(FPGA) or Complex programmable logic
Device (CPLD).
The design procedure consists of (a) design entry, (b) synthesis and
implementation of the design,(c) functional simulation and (d) testing and
29
verification. Digital designs can be entered in various ways using the above CAD
tools: using a schematic hardware description language (HDL) – Verilog or
VHDL or a combination of both. In this lab we will only use the design flow that
involves the use of Verilog HDL.
A Verilog input file in the Xilinx software environment consists of the following
segments:
The Integrated Software Environment (ISE) is the Xilinx design software suite
that allows you to take your design from design entry through Xilinx device
programming.
The ISE project Navigator manages and Processes your design through the
following steps in the ISE design flow.
30
CHAPTER IV
EXISTING TECHNIQUE
4.1 BOOTH MULTIPLIER
1. The machine may use the absolute value of a number for sign representation. In
such a representation, it is effortless to perform multiplication and least complicated
to execute division, but the more repeatedly used operation of subtraction needs
additional circuitry.
Assume that the machine deals with negative numbers by taking their complements
mod 2, then
+ m m ………………………………(4.1.1)
-m 2-m…………………………….. (4.1.2)
Henceforth, when two numbers m and t are multiplied the machine generates the
following results
+ m x + t + m t…………………………. (4.1.3)
31
-m x + t 2t – mt…………………………. (4.1.4)
+ m x – t 2m – mt……………………… (4.1.5)
-m x – t 4 – 2m – 2t +mt………………. (4.1.6)
Equations (4.1.3) to (4.1.6) have negative signs to be dealth with. In order to correct
Equations (4.1.4) to (4.1.6) following steps are followed.
The application of both these connections also gives the correct result, if m
and t are negative, subtraction is in effect and because operations are all mod two
the added four is in any case ignored by the machine. A manifestation process for
the division of signed binary numbers was given by Booth et al., (1946). The
machine examines the signs of both m and t and this result has a need for the
efficient engineering of the sequence and the storage of signs of m and r in auxiliary
circuits which is given by A. D. Booth and K. H. V. Britten (1947). The
abovementioned connection operations are highly undesirable. Therefore, it is easy
if there is any process to prefer multiplication in a uniform manner without the
necessity of any special devices to examine the signs of the interacting numbers.
1. For multiplying two numbers m and t together, the nth digit (mn) of m has
to be examined.
2. If mn= 0, mn+1= 0, the PPs are summed up. The sum is multiplied by 2 -
1 , i.e., each bit of the result is shifted to the right by one place.
3. If mn=0, mn+1= 1, add t into the existing sum of partial products and
multiply by 2-1 , i.e., shifts all the bits simultaneously one place to the
right.
32
4. If mn=1, mn+1= 0, the addition of PPs is done. Then t is subtracted from
the sum. This intermediate result is multiplied by 2 -1 , i.e., a movement
of one place for every bit to the right.
5. If mn=1, mn+1= 1, multiply the sum of partial products by 2-1 , i.e., shift
one place to the right.
6. Do not multiply by 2-1 at m0 in the above processes.
Thus
Booth’s Algorithm
A: 1 1 1 1 1 1 1 1 1 1
X: 1 1 1 1 1 1 1 1 1 1
Number of bits: 10
33
Computation
A 1111111111 -1
X x1111111111 -1
Y 0 0 0 0 0 0 0 0 0 -1 recoded multiplier
---------------------------------------------------------------------
Add –A + 0000000001
Shift 00000000001
34
of partial products to be added is the main parameter that determines the
performance of the multiplier. To reduce the number of partial products to be added,
Modified Booth algorithm is one of the most popular algorithms.
35
Figure 4.2: Booth Multiplier block diagram
Multiplier is one of the most widely used arithmetic data path operation in
modern digital design. In the state of art Digital Signal Processing and graphics
applications, multiplication is an important and computationally intensive
operation. The multiplication operation is certainly present in many parts of a digital
system or digital computer, most notably in signal processing, graphics and
scientific computation. Booth algorithm is a crucial improvement in the design of
signed binary multiplication.
36
The block diagram consist of the following sections
3. Addition.
Let us calculate the product of 2’s complement of two numbers 1101(-3) and 5
(0101), when computing the two binary numbers product we get the result
37
1 1 0 1 Multiplicand
x 0 1 0 1 Multiplier
------------------------
1 1 1 1 1 1 0 1 PP1
0 0 0 0 0 0 0 PP2
1 1 1 1 01 PP3
+0 0 0 0 0 PP4
1 1 1 1 1 0 0 0 1= −15 Product
Discard this bit From the above we say that 1101 is multiplicand
and 0101 is multiplier. The intermediate products are partial products. The final
result is product (-15). When this method is processed in hardware, the operation is
to take one of the multiplier bits at a time from right to left, multiplying the
multiplicand by the single bit of the multiplier and shifting the intermediate product
one position to the left of the earlier intermediate products. All the bits of the partial
products in each column are added to obtain two bits: sum and carry. Finally, the
sum and carry bits in each column have to be summed. The two rows before the
product are called sum and carry bits.
Advantage:
In this block the multiplicand bit is multiplied with the output of decoder
unit with the help of NAND gate as NAND gate multiplies the two inputs value to
38
each other to form a partial product output. Here the equivalent value is converted
to the single bit partial product which is later provided to the adder circuit to
generate the output of booth multiplier.
4.2.4. Adder:
It is readily observed that the data flow is vertical from one row to the next,
and there are no horizontal connections between the cells on a row except the last
row where the carry signal has to propagate through all cells. Therefore, ignoring
the last row for the moment, by placing a register on the outputs of each individual
cell, one can achieve a pipelined architecture where the stage delay is equal to the
delay of a l-bit full adder plus a register. This, however, is not the case because of
the horizontal data flow in the last row of the array. One solution is to use a carry-
look-ahead adder to replace the last row. However, the carry-lookahead adder does
not offer a structural regularity compatible with the rest of the array. Besides, as the
word length of the multiplier increases, the delay through the carrylook-ahead adder
increases, making it the dominant stage of the pipelined architecture. References
[7] and [8] describe two designs employing only one level of pipelining (two stages)
by placing registers before the last row of the CSA array.
An alternative approach is to perform the addition in the last row of the array
in a bit-by-bit style using the array of half-adders and registers shown in Fig. 2. In
this architecture, starting with the least significant bit, each bit of the result is
determined in one stage of the array. As can be seen, the flow of data is always
horizontal in Fig. 2 and one can convert each column of the array into a horizontal
pipeline stage whose delay is less than that of a l-bit full adder.
39
Figure 4.3.1 General architecture of a parallel array multiplier.
Figure 4.3.2 Conversion of the last row of above figure to a pipelined architecture
40
The multiplier chip described in this paper uses the combination of the two
arrays shown in Figs. 4.3.1 and 4.3.2 to yield an array multiplier that is fully
pipelined down to the bit level. For an IVX ~ multiplier, there are 21V pipeline
stages in the architecture. A simple way of decreasing this number to 3N/2 while
keeping the same throughput is discussed. Obviously, different levels of pipelining
can be achieved by combining different numbers of stages of the fully pipelined
array into one stage.
Notice that instead of the half-adder array shown in Fig. 2, one may use a
pipelined implementation of a ripple carry adder. This will result in an array with
the diagonal cells being l-bit full adders; the rest of the cells in the array will be
plain registers. This implementation, like the half-adder array, adds N stages to the
pipeline for a N X N multiplier. It uses almost 20 percent less transistors than the
half-adder array, however, because of the pipelining and the clock distribution, the
area consumed by this implementation turns out to be almost the same as the half-
adder array. The advantage of the half-adder array is that the delay of each stage is
half of the delay of the full-adder stage, meaning that by combining two stages
together we can save N/2 pipeline stages without loss of throughput. Therefore,
although the half-adder array uses slightly more transistors than the full-adder array,
it offers two advantages: 1) reduction of the number of pipeline stages from 2N to
3N/2; and 2) reduction in area because of a savings of N/2 in the number of pipeline
stages and the fact that the corresponding registers, clock routing, and clock buffers
are no longer present.
As shown in Fig. 4.3.1, each stage of the parallel array should receive some
partial product inputs. In a nonpipelined array, the partial products are generated all
at the same time and are present in the array until the multiplication is done. In a
pipelined array, however, there is a new set of partial products every clock cycle.
These partial products are not all used at the same time. For example, the partial
product word for the tlhird pipeline stage should be ready three clock cycles after it
has been generated. This results in the skewing of the partial product inputs in a
manner shown in Fig. 3. In this figure, the inputs to the last stage (#n ) correspond
to the inputs of the last stage of the array in Fig. 1; the array of Fig. 2 does not use
any of the partial products. The block diagram of Fig. 3 is only for illustration
41
purposes. In our actual design, instead to generating the partial products and then
skewing them, the input data bits are skewed first and then ANrled to produce the
partial products. This results in 50-percent savings in the number of registers
required for partial product skewing. Fig. 4 shows the complete structure of a fully
pipelined 8 X 8 multiplier using the architecture described in this section. The
blocks labeled FR, HR, AR, and R represent registered full adder, registered half-
adder, registered AND gate, and a plain register, respectively.
42
CHAPTER V
PROPOSED MULTIPLIER
By implementing the above design on paper I found that the overflow bit is
not required. The overflow bit shifts into the product register. To implement the 32
bit-register I had two initialized product registers, preg1 and preg2. Preg1 has the
multiplier in the least significant 32-bit positions and the most significant 32-bits
are zeros. Preg2 has the multiplicand in the most significant 32-bit positions and
the least significant 32-bits are zeros. If the least significant bit of the multiplier
product register, preg1, is a ‘1’, then the multiplicand product register, preg2, is
added to the multiplier product register and the result stored in the multiplier
product register is shifted right by one bit. If the least significant bit of the multiplier
product register is a ‘0’, the bits in the multiplier product register are right shifted
by one bit without the addition of the multiplicand product register. This is done 32
times. The result in the multiplier product register after 32 clock cycles is the final
product.
43
Multiplication is a process of adding an integer to itself for a specified
number of times. A number (multiplicand) is added to itself a number of times as
specified by another number (multiplier) to form a result. Multiplication process
have three main steps: 1. Partial product generation. 2. Partial product reduction. 3.
Final addition. For the multiplication of an n-bit multiplicand with an m-bit
multiplier, m partial products are generated and product formed is n + m bits long.
The process of partial product generation can be further classified into two:
i. Simple PPG
In this method, the partial products are generated by multiplying each bit of
the multiplier with the multiplicand using logical AND gate.
Partial products are generated with Radix-4 modified Booth recoding. The
speed of multiplier can be improved by reducing the number of generated partial
products. Using Booth recoding only half the number of partial products is
generated when compared with simple PPG which reduces the amount of area
occupied by the hardware and the time required for execution. O. L. MacSorley
proposed the Modified Booth’s Algorithm (MBA) in 1961, as a powerful algorithm
for multiplication of signed number, treating both positive and negative numbers
uniformly[3] [4].
44
d. Examine the each block of multiplier and generate the partial product using
the table below:
e. The new partial product generated are added to the previous partial product by
shifting two bits to the left and the multiplier bits are shifted two bits towards
right. Initially the partial product is zero.
f. It is then sign extended.
g. The above operations are repeated n/2 times
Multiplier require high amount of power and delay during the partial
products addition. At this stage, most of the multipliers are designed with different
kind of multi operands adders that are capable to add more than two input operands
and results in two outputs, sum and carry. The Wallace tree method is used in high
speed designs to add the partial products. Wallace Tree helps in reducing the stages
of sequential addition of partial products thereby improving speed. Wallace tree
used here is made up of several compressors that take three or more inputs and
produce two outputs, of the same dimension as the inputs. The speed, area and
power consumption of the multipliers will be in direct proportional to the efficiency
of the compressors. There are various types of compressors, namely 3:2, 4:2, 5:2
and so on. A Wallace tree with 4:2 compressors is considered.
45
CHAPTER VI
SIMULATION RESULTS
6.1 Booth Multiplier:
46
Figure 6.1.3: Technology Schematic Diagram of 64 Bit- Booth Multiplier
47
6.1 Modified Booth Multiplier:
48
Figure 6.2.3: Technology Schematic Diagram of 64 Bit- Modified Booth Multiplier
49
6.2 Pipelined Multiplier:
50
Figure 6.2.3: Technology Schematic Diagram of 64 Bit- Pipelined Multiplier
51
6.3 PROPOSED MULTIPLIER:
52
Figure 6.4.3: Technology Schematic Diagram of 64 Bit-Proposed Multiplier
53
6.4 COMPARISON RESULT:
54
CHAPTER VII
7.1 CONCLUSION:
55
REFERENCES
[2] R. P. Brent and H. T. Kung, “A regular layout for parallel adders”, IEEE
trans, computers, Vol.C-31,pp. 260-264,.March 1982.
[8] Y. Choi, “Parallel Prefix Adder Design,” Proc. 17th IEEE Symposium
on Computer Arithmetic, pp 90-98, 27th June 2005.
56
[11] H. Ling, High-speed binary adder," IBM Journal of Research and
Development, vol. 25,no. 3, pp. 156 March 1981.
[15] F. E. Fich, “New bounds for parallel prefix circuits,” in Proc. of the
15thAnnu. ACM Sympos. Theory of Comput., 1983, pp.100– 109.
[17] Y. Choi, “Parallel Prefix Adder Design”, Proc. 17th IEEE Symposium
on Computer Arithmetic, pp 90-98, 27th June 2005.
57
APPENDIX
Source Code:
input en;
integer i;
reg E1;
begin
Z = 128'd0;
E1 = 1'd0;
Y1 = - Y;
Z[63:0]=X;
begin
case (temp)
58
default : begin end
endcase
Z = Z >> 1;
Z[127] = Z[126];
E1 = X[i];
end
end
endmodule
module modified_booth_multiplier(x,y,o);
input [63:0]x;
input [63:0]y;
output [127:0]o;
reg [127:0]o;
integer i;
reg [128:0]a;
reg [63:0]s;
reg [63:0]p;
always @ ( x or y)
begin
a =
129'b0000000000000000000000000000000000000000000000000000000000000
59
000000000000000000000000000000000000000000000000000000000000000000
00;
s = y;
a[64:1] = x;
for ( i = 0 ; ( i <= 63 ) ; i = ( i + 1 ) )
begin
begin
p = a [128:63];
a[128:63] = ( p - s );
end
else
begin
begin
p = a[128:63];
a[128:63] = ( p + s );
end
end
a[127:0] = a[128:1];
end
60
o[127:0] <= a[128:1];
end
endmodule
module Pipelined_multiplier(start,clock,clear,binput,qinput,carry,
acc,qreg,preg);
input start,clock,clear;
output carry;
//system registers
reg carry;
wire z;
assign z=~|preg;
if (~clear) prstate=t0;
61
else prstate = nxstate;
case (prstate)
t1: nxstate=t2;
t2: nxstate=t3;
else nxstate=t2;
endcase
case (prstate)
t0: b<=binput;
t1: begin
acc<=
64'b00000000000000000000000000000000000000000000000000000000000000
00;
carry<=1'b0;
preg<=6'b100000;
qreg<=qinput;
end
t2:begin
preg<=preg-6'b000001;
62
if(qreg[0])
{carry,acc}<=acc+b;
end
t3:begin
carry<=1'b0;
acc<={carry,acc[63:1]};
qreg<={acc[0],qreg[63:1]};
end
endcase
endmodule
63