Thesis Report
Thesis Report
Thesis Report
CHAPTER-1
INTRODUCTION
1.1 GENERAL
different stages of the computing stack, ranging from logic and architectures at the
hardware layer all the way up to compiler and programming language at the
software layer. There is an extensive amount of research related to approximations
at both hardware and software layers. Voltage over-scaling and functional
approximation are the two major approximate computing knobs employed at the
hardware level.
processor consumes 70% of the energy in supplying data and instructions, and 6% of
the energy while performing arithmetic only. Therefore, using approximate
arithmetic in such a scenario will not provide much energy benefit when considering
the complete processor. Programmable processors are designed for general-purpose
applications with no application-specific specialization. Therefore, there may not be
many applications that will be able to tolerate errors due to approximate computing.
This also makes general-purpose processors not suited for using approximate
building blocks. Therefore, in this paper, we consider application-specific integrated
circuit implementations of error-resilient applications like image and video
compression. We target the most computationally intensive blocks in these
applications and build them using approximate hardware to show substantial
improvements in power consumption with little loss in output quality. Few works
that focus on low-power design through approximate computing at the algorithm and
architecture levels include algorithmic noise tolerance (ANT), significance driven
computation (SDC), and non-uniform voltage over-scaling (VOS).
All these techniques are based on the central concept of VOS, coupled with
additional circuitry for correcting or limiting the resulting errors. In a fast but
“inaccurate” adder is implemented in literature which is based on the idea that on
average, the length of the longest sequence of propagate signals is approximately log
n, where n is the bit-width of the two integers to be added. An error-tolerant adder
that operates by splitting the input operands into accurate and inaccurate parts.
However, neither of these techniques target logic complexity reduction.
A power-efficient multiplier architecture that uses a 2 × 2 inaccurate
multiplier block resulting from Karnaugh map simplification. This paper considers
logic complexity reduction using Karnaugh maps. Shin and Gupta and Phillips et al.
also proposed logic complexity reduction by Karnaugh map simplification. Other
works that focus on logic complexity reduction at the gate level. Other approaches
use complexity reduction at the algorithm level to meet real-time energy constraints.
Proportional reductions in area, power and latency are not achieved as these
approximations target transistor or gate-level truncations, leading to significant
efficiency gains in ASICs but not in FPGAs. The most important factor in
distinguishing ASICs and FPGAs is the way logic functions are realized. The basic
15
building blocks for generating the required logic, in case of ASICs, are the logic
gates, whereas in case of FPGAs, they are the lookup tables (LUTs) made of SRAM
elements. Therefore, the approximations techniques for FPGAs should be amenable
to the LUT structures instead of aiming at reducing the logic gates.
The basic building block of an FPGA are the Configurable logic blocks or
CLBs. They are used to implement any kind of logic function using the
switching/routing matrix. Each CLB consists of two slices FPGA family arranges all
the CLBs in columns by deploying Advanced Silicon Modular Block (ASMBL)
architecture. Each slice in this device consists of four 6- input LUTs, eight flip-flops
and an efficient carry chain logic. The slices act as the main function generators of
any FPGA and in the Virtex-7 family they are categorized as SLICEL or logic slices,
and SLICEM or memory slices. The lookup tables present in these slices are 5x2
LUTs. This LUT6_2 is fabricated using two LUT5s and a multiplexer. These LUT5s
are the basic SRAM elements which are used to realize the required logic function,
by storing the output sequence of the truth-table in 1-bit memory locations, which are
accessed using the address lines acting as inputs. These LUTs are made accessible
using a wide range of lookup table primitives offered by the Xilinx UNISIM library,
ranging from LUT1 to LUT6. These LUT primitives are instantiated with an INIT
attribute, which is the truth table of the function required based on the input logic.
The LUT primitives are used to implement the required logic function which
are then compacted and mapped onto the fabric resources available on the FPGA.
Each of these LUT primitives take in an equivalent number of 1-bit inputs, and
produce a unique 1-bit output. However, at the hardware level, each of these
primitives are physically mapped to one of the two LUT5s present in the four
LUT6_2 fabrics in a given slice of the CLB. As per studies, the use of LUT
primitives allows for Xilinx to efficiently optimize the combining and mapping of
LUT primitives to reduce the area and latency of the synthesized designs. We use
these LUT primitives to achieve significant area and performance gains for the
approximate adder designs.
Approximate computing trades off accuracy to improve the area, power, and
speed of digital hardware. Many computationally intensive applications such as
video encoding, video processing, and artificial intelligence are error resilient by
16
Most of the approximate adders in the literature have been designed for ASIC
implementations. These approximate adders use gate or transistor level
optimizations. Recent studies have shown that the approximate adders designed for
ASIC implementations do not yield the same area, power, and speed improvements
when implemented on FPGAs or fail to utilize FPGA resources efficiently to
improve the output quality. This is mainly due to the difference in the way logic
functions are implemented in ASICs and FPGAs.
The basic element of an ASIC implementation is a logic gate, whereas FPGAs use
lookup tables (LUTs) to implement logic functions. Therefore, ASIC based
optimization techniques cannot be directly mapped to FPGAs. FPGAs are widely
used to implement error-tolerant applications using addition and multiplication
operations. The efficiency of FPGA-based implementations of these applications can
be improved through approximate computing. Only a few FPGA specific
approximate adders have been proposed in the literature. These approximate adders
focus on improving either the efficiency or accuracy. Therefore, the design of low
error efficient approximate adders for FPGAs is an important research topic.
CHAPTER-2
LITERATURE REVIEW
hardware can process 24 and 40 quad full HD (3840x2160) video frames per
second, respectively. The proposed approximate HEVC intra angular
prediction hardware is the smallest and the second fastest HEVC intra
prediction hardware in the literature. It is ten times smaller and 20% slower
than the fastest HEVC intra prediction hardware in the literature. In this paper,
first, data reuse technique is used to reduce amount of computations. Since
some of the HEVC intra angular prediction equations use same Coeff and
reference pixels, there are identical luminance angular prediction equations for
each PU size. Since different PU sizes may use same neighboring pixels, there
are also identical luminance angular prediction equations between different
PU sizes. Data reuse technique calculates the common prediction equations for
all luminance angular prediction modes only once and uses the result for
corresponding modes. Since we use data reuse technique, instead of
calculating intra prediction equations of different prediction modes and PUs
separately, we calculate all necessary intra prediction equations together and
use the results for the corresponding prediction modes and PUs.
H. Jiang, C. Liu, L. Liu, F. Lombardi, and J. Han, Aug. 2017 ‘‘A review,
classification, and comparative evaluation of approximate arithmetic
circuits,’’ ACM J. Emerg. Technol. Comput. Syst., vol. 13, no. 4, pp. 1–34.
hardware has 2 less XOR gates than proposed_0 hardware. However, its
maximum error is 4 which is 3 more than the maximum error of proposed_0
hardware.
CHAPTER-3
EXISTING METHOD
In the existing method, an approximate adder with accurate part at the MSP
side of the adder and approximate adder at the LSP part of the adder.
1. 'A' and' B' are the input states, and 'sum' and 'carry' are the output states.
2. The carry output is 0 in case where both the inputs are not 1.
3. The least significant bit of the sum is defined by the 'sum' bit.
Sum=x'y+xy'
Carry = xy
The parallel prefix adder employs three stages in preprocessing stage the
generation of Propagate and Generate signals is carried out. The calculation of
Generate (Gi) and Propagate (Pi) are calculated when the inputs A, B are
given. As follows
Gi indicates whether the Carry is generated from that bit. Pi indicates whether
Carry is propagated from that bit. In carry generation stage of PPA, prefix
graphs can be used to describe the tree structure. Here the tree structure
consists of grey cells, black cells, and buffers. In carry generation stage when
two pairs of generate and propagate signals (Gm, Pm), (Gn, Pn) are given as
inputs to the carry generation stage. It computes a pair of group generates and
group propagate signals (Gm: n, Pm: n) which are calculated as follows Gm:
n=Gm+ (Pm.Gn) Pm: n=Pm. Pn The black cell computes both generate and
propagate signals as output. It uses two and gates and or gate. The grey cell
computes the generate signal only. It uses only and gate, or gate. In post
processing stage simple adder to generate the sum, Sum and carry out are
calculated in post processing stage as follows Si=Pi XOR Ci-1 Cout=Gn-1
XOR (Pn-1 AND Gn-2) If Cout is not required it can be neglected.
30
Parallel prefix adders also known as carry tree adders. They pre-compute
propagate and generate signals. These signals are combined using fundamental
carry operator (fco). (g1, p1) o (g2, p2) = (g1+g2.p1, p1.p2) Due to associative
law of the fundamental carry operator these operators can be combined in
different ways to form various adder structures. For example 4 bit carry look
ahead generator is given by C4= (g4, p4) o [(g3, p3) o [(g2, p2) o (g1, p1)]]
Now in parallel prefix adders allow parallel operation resulting in more
efficient tree structure for this 4 bit example. C4= [(g4, p4) o (g3, p3)] o [(g2,
p2) o (g1, p1)] It is a key advantage of tree structured adders is that the critical
path due to carry delay is of order log2N for N bit wide adder.
3.5DISADVANTAGES
Area occupied on the hardware is high.
Energy consumption is high and accuracy is very low
Delay increases as the computation time of output increases.
CHAPTER-4
PROPOSED METHOD
tradeoff between power and accuracy and the other on area and accuracy. The
proposed design methodology uses the approximate full adder based n-bit
adder architecture shown in Fig. Generalized approximate adder, n-bit
addition is divided into n-bit approximate adder in the LSP and (n−m)-bit
accurate adder in the MSP.Breaking the carry chain at bit-position m generally
introduces an error of 2m in the final sum. The error rate and error magnitude
can be reduced by predicting the carry-in to the MSP (CMSP) more accurately
and by modifying the logic function of LSP to compensate for the error. The
carry to the accurate part can be predicted using any k-bit input pairs from the
approximate part such that k ≤ m. Most of the existing approximate adders use
k = 1.
FPGA implementation of accurate adder uses only 2 inputs and 1
output of each 6-input LUT. We propose to utilize the remaining 4, available
but unused, inputs of the first LUT of the MSP to predict CMSP. Therefore,
we propose to share the most significant 2 bits of both inputs of the LSP with
the MSP for carry prediction. Sharing more bits of LSP with MSP will
increase the probability of correctly predicting CMSP which will in turn
reduce error rate. However, this will also increase the area and delay of the
approximate adder.
To analyze the tradeoff between the accuracy and performance of an FPGA-
based approximate adder with different values of k. For k > 2, the error rate
reduces slightly at the cost of increased area and delay. On the other hand, for
k< 2, the delay improves marginally at the cost of significant increase in the
error rate. Therefore, we propose using k = 2, as it provides good balance
between accuracy and performance of approximate adders for FPGAs.
It uses 2 MSBs of LSP to predict the CMSP, whereas their respective sum bits
are computed using AAd1. AAd1 is only suitable when the Cout of 2-bit
inputs is predicted accurately. Accurate prediction of Cout requires additional
resources or unused LUT inputs. Therefore, to design area efficient
approximate adders for FPGAs, AAd1 is not used in the least-significant m −
2 bits of the LSP. In this paper, we propose two n-bit approximate adders
using the architecture in Fig above.
The two proposed n-bit approximate adders use different approximate
functions for the first m − 2 bits of the LSP. State-of-the-art FPGAs use 6-
input LUTs. These LUTs can be used to implement two 5-input functions. The
complexity of the implemented logic function does not affect performance of
LUT based implementation. A 2-bit adder has 5 inputs and two outputs.
Therefore, a LUT can be used to implement a 2-bit approximate adder.
For an area efficient FPGA implementation, we propose to split the first m − 2
bits of LSP into d(m − 2)/2e groups of 2-bit inputs such that each group is
mapped to a single LUT. Each group adds two 2-bit inputs with carry-in using
34
This modification reduces the absolute error magnitude to 2 in two cases, and
to 1 in the other six cases. The resulting truth table of AAd2 is given in Table
below. The error cases are shown in red. Since AAd2 produces an erroneous
result in 8 out of 32 cases, the error probability of AAd2 is 0.25.
Fig.9 LEADx
Fig.10 APEx
In the proposed APEx, the S0 to Sm−3 outputs are fixed to 1 and the Cm−2 is
0. This provides significant area and power consumption reduction at the
expense of slight quality loss. It is important to note that this is different from
bit truncation technique which fixes both the sum and carry outputs to 0. The
ME of truncate adder is 2m+1 − 2 which is much higher than ME of APEx
(2m−2 − 1). The proposed APEx approximate adder is shown in Fig. 10.
Same as LEADx, the critical path of APEx is from the input Am−2 to the
output Sn−1.
37
Stage1. Multiply (or better expressing, AND) each bit of multiplicand by each
bit of multiplier, yielding n2 partial products.
Stage2. Reduce the number of partial products using the layers of a FA and a
HA blocks.
Stage3. Adding two n-sets resulted from the previous stage to an n-bit adder.
It should be noted that the second stage is carried out as follows. Three bits of
the same value enter into FA and as a result, two bits with different values are
produced (one bit with the same value and one bit with a higher value).
If two bits with the same value remain, put them into an HA.
If there is just one bit, transfer it to the next layer.
Here for a 8 bit multiplier our proposed adders can be implemented to reduce
the partial products.It can be shown as in fig.12.
39
Advantages:
Area and Energy consumption are optimized.
Computational delay is reduced with improved accuracy of output.
Applications:
Digital signal processors
Digital image processors
Video processor applications
MAC & Arithmetic circuits
40
CHAPTER 5
IMPLEMENTATION
that time included thin film process simulator, decided to acquire Gateway
Automation System. Along with other Gateway products, Cadence now became the
owner of the Verilog language, and continued to market Verilog as both a language
and a simulator.
At the same time, Synopsys was marketing the top-down design methodology, using
Verilog. This was a powerful combination. In 1990, Cadence recognized that if
Verilog remained a closed language, the pressures of standardization would
eventually cause the industry to shift to VHDL. Consequently, Cadence organized
the Open Verilog International (OVI), and in 1991 gave it the documentation for the
Verilog Hardware Description Language. This was the event which "opened" the
language.
5.1.1 INTRODUCTION
HDL is an abbreviation of Hardware Description Language. Any digital system
can be represented in a REGISTER TRANSFER LEVEL (RTL) and HDLs are used
to describe this RTL.
Verilog is one such HDL and it is a general-purpose language –easy to learn
and use. Its syntax is similar to C.
The idea is to specify how the data flows between registers and how the
design processes the data.
To define RTL, hierarchical design concepts play a very significant role.
Hierarchical design methodology facilitates the digital design flow with
several levels of abstraction.
Verilog HDL can utilize these levels of abstraction to produce a simplified
and efficient representation of the RTL description of any digital design.
For example, an HDL might describe the layout of the wires, resistors and
transistors on an Integrated Circuit (IC) chip, i.e., the switch level or, it may
describe the design at a more micro level in terms of logical gates and flip
flops in a digital system, i.e., the gate level. Verilog supports all of these
levels.
Any hardware description language like Verilog can be design in two ways one is
bottom-up design and other one is top-down design.
Bottom-Up Design:
The traditional method of electronic design is bottom-up (designing from transistors
and moving to a higher level of gates and, finally, the system). But with the increase
in design complexity traditional bottom-up designs have to give way to new
structural, hierarchical design methods.
Top-Down Design:
For HDL representation it is convenient and efficient to adapt this design-style. A
real top-down design allows early testing, fabrication technology independence, a
structured system design and offers many other advantages. But it is very difficult to
follow a pure top-down design. Due to this fact most designs are mix of both the
methods, implementing some key elements of both design styles.
5.1.3Features of Verilog HDL
Verilog is case sensitive.
Ability to mix different levels of abstract freely.
One language for all aspects of design, testing, and verification.
In Verilog, Keywords are defined in lower case.
In Verilog, Most of the syntax is adopted from "C" language.
Verilog can be used to model a digital circuit at Algorithm, RTL, Gate and
Switch level.
There is no concept of package in Verilog.
It also supports advanced simulation features like TEXTIO, PLI, and UDPs.
Modules. The key idea is to specify behavior, in terms of input, output and timing of
each unit, without specifying its internal structure.
The outcome of functional design is usually a timing diagram or other relationships
between units.
5.2.4 Logic Design:
In this step the control flow, word widths, register allocation, arithmetic operations,
and logic operations of the design that represent the functional design are derived and
tested.
This description is called Register Transfer Level (RTL) description. RTL is
expressed in a Hardware Description Language (HDL), such as VHDL or Verilog.
This description can be used in simulation and verification
5.2.5 Circuit Design:
The purpose of circuit design is to develop a circuit representation based on the logic
design. The Boolean expressions are converted into a circuit representation by taking
into consideration the speed and power requirements of the original design. Circuit
Simulation is used to verify the correctness and timing of each component
The circuit design is usually expressed in a detailed circuit diagram. This diagram
shows the circuit elements (cells, macros, gates, transistors) and interconnection
between these elements. This representation is also called a netlist. And each stage
verification of logic is done.
Syntax:
module<module name> (<module_port_list>);
…..
<module internals> //contents of the module
….
Endmodule
5.3.1 Instances
A module provides a template from where one can create objects. When a module is
invoked Verilog creates a unique object from the template, each having its own
name, variables, parameters and I/O interfaces. These are known as instances.
5.3.2 Ports:
Ports allow communication between a module and its environment.
All but the top-level modules in a hierarchy have ports.
Ports can be associated by order or by name.
46
You declare ports to be input, output or inout. The port declaration syntax is:
Input [range_val:range_var] list_of_identifiers;
output[range_val:range_var] list_of_identifiers;
inout[range_val:range_var] list_of_identifiers;
5.3.3 Identifiers
Identifiers are user-defined words for variables, function names, module
names, and instance names. Identifiers can be composed of letters, digits, and
the underscore character.
The first character of an identifier cannot be a number. Identifiers can be any
length.
Identifiers are case-sensitive, and all characters are significant.
An identifier that contains special characters, begins with numbers, or has the same
name as a keyword can be specified as an escaped identifier. An escaped identifier
starts with the backslash character(\) followed by a sequence of characters, followed
by white space.
5.3.4 Keywords:
Verilog uses keywords to interpret an input file.
You cannot use these words as user variable names unless you use an escaped
identifier.
Keywords are reserved identifiers, which are used to define language
constructs.
Some of the keywords are always, case, assign, begin, case, end and end case
etc.
5.3.5 Data Types:
Verilog Language has two primary data types:
Nets - represents structural connections between components.
Registers- represent variables used to store data.
Every signal has a data type associated with it. Data types are:
Explicitly declaredwith a declaration in the Verilog code.
47
5.4.4Switch Level:
This is the lowest level of abstraction. A module can be implemented in terms
of switches, storage nodes and interconnection between them. However, as
has been mentioned earlier, one can mix and match all the levels of
abstraction in a design. RTL is frequently used for Verilog description that is
a combination of behavioral and dataflow while being acceptable for
synthesis.
5.5 OPERATORS
Relational Operators
Bit-wise Operators
Logical Operators
Reduction Operators
Shift Operators
Concatenation Operator
Conditional Operator
Note: If any operand is x or z, then the result of that test is treated as false (0)
Bitwise operators perform a bit wise operation on two operands. This take
each bit in one operand and perform the operation with the corresponding bit in the
other operand. If one operand is shorter than the other, it will be extended on the left
side with zeroes to match the length of the longer operand
groups of bits, and treat all values that are nonzero as “1”. Logical operators are
typically used in conditional (if ... else) statements since they work with expressions.
Reduction operators operate on all the bits of an operand vector and return a
single-bit value. These are the unary (one argument) form of the bit-wise operators.
Shift operators shift the first operand by the number of bits specified by the
second operand. Vacated positions are filled with zeros for both left and right shifts
(There is no sign extension).
Select File->New Project to create a new project. This will bring up a new project
window (Figure 15) on the desktop. Fill up the necessary entries as follows:
Project Name: Write the name of your new project which is user defined.
Project Location: The directory where you want to store the new project in the
specified location in one of your drive. In above window they are stored in location c
drive which is not correct , the location of software and code should not be same
location.
All project files such as schematics, netlists, Verilog files, VHDL files, etc., will be
stored in a subdirectory with the project name.
In order to open an existing project in Xilinx Tools, select File->Open
Project to show the list of projects on the machine. Choose the project you want and
click OK.
Clicking on NEXT on the above window brings up the following window in fig.17:
Select Verilog Module and in the “File Name:” area, enter the name of the Verilog
source file you are going to create. Also make sure that the option Add to project is
selected so that the source need not be added to the project again. Then click on Next
to accept the entries.
59
In the Port Name column, enter the names of all input and output pins and
specify the Direction accordingly. A Vector/Bus can be defined by entering
appropriate bit numbers in the MSB/LSB columns. Then click on Next>to get a
window showing all the new source information above window. If any changes are
to be made, just click on <Back to go back and make changes. If everything is
acceptable, click on Finish > Next > Next > Finish to continue.
Once you click on Finish, the source file will be displayed in the sources
window in the Project Navigator
If a source has to be removed, just right click on the source file in the Sources in
Project window in the Project Navigator and select remove in that. Then select
Project -> Delete Implementation Data from the Project Navigator menu bar to
remove any related files.
Editing the Verilog source file
The source file will now be displayed in the Project Navigator window (Figure 8).
The source file window can be used as a text editor to make any necessary changes
60
to the source file. All the input/output pins will be displayed. Save your Verilog
program periodically by selecting the File->Save from the menu. You can also edit
Verilog programs in any text editor and add them to the project directory using “Add
Copy Source”.
Here in the below window we will write the Verilog programming code for
specified design and algorithm in the window.
After writing the programming code we will go for the synthesis report.
61
mark in front of that, otherwise a tick mark will be placed after each of them to
indicate the successful completion
After synthesis right click on synthesis and click view text report in order to generate
the report of our design.
The schematic diagram of the synthesized verilog code can be viewed by double
clicking View RTL Schematic under Synthesize-XST menu in the Process Window.
This would be a handy way to debug the code if the output is not meeting our
specifications in the proto type board.
By double clicking it opens the top level module showing only input(s) and output(s)
as shown in fig.23.
63
Next click on the Isim simulator and double click on Behavioral check syntax
to check the errors. If no errors are found then double click on simulate behavioral
model to get the output waveforms.
After clicking on simulate behavioral model, the following window will appear
The simulation widow will appear pass the input values by making force constant
and if it is clock by making force clock. Mention the simulation period and run for
certain time and results will appear as shown in following window. Verify the results
to the given input values.
CHAPTER 6
RESULTS
The fig 29 shows the RTL schematic of the circuit. It shows the
implantation logic of the circuit that how data flows in and out of the
circuit. RTL View shows the representation of the design in terms of generic
symbols means like AND Gates, OR Gates, adders, multipliers etc.
The fig .30 shows the Technology Schematic of the circuit.
Technology View shows the representation of your design in terms of logical
elements means like LUTs, buffers, I/Os, and other technological components.
Viewing this schematic allows us to see a technology-level representation of
HDL optimized for a specific Xilinx architecture, which might help us
discover design issues early in the design process
CHAPTER 7
CONCLUSION
Hence By using the proposed 2 bit adder of proposed design -2 (Add2 in base
paper) we design approximate multiplier for 8 bit which can generate
approximate results and provide better performance in terms of area and delay.
Our proposed approximate multipliers outperform the prior works and exact
LUT-based multiplier in terms of delay, power, area
The multiplication output results are approximate and the design
provides optimized area and delay.
69
APPENDIX
LIST OF PUBLICATIONS
[1] M. Karthiga and Dr.S. Mohideen Abdul Kadhar (2022) “Design Methodology of
Error Reduced Approximate Adders for FPGAs”, IJCRT, Vol.10, ISSN: 2320-2882,
UGC Approved Journal,IJCRT2208504
70
REFERENCES: