0% found this document useful (0 votes)
38 views13 pages

Resize-Pdf - Base Paper 6 - Copy-Numbered

Uploaded by

sagarphtos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views13 pages

Resize-Pdf - Base Paper 6 - Copy-Numbered

Uploaded by

sagarphtos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

KARATSUBA ALGORITHM:A

PARADIGM SHIFT IN
MULTIPLICATIONEFFICIENCY
1
Mrs.C.Anjani
2
R.Kavya
3
S. Poojitha Reddy
4
A.Lasya priya
1
Professor in department of Electronics and Communication Engineering
2,3,4
UG Students of Sridevi Women’s Engineering College
1,2,3,4
Sridevi Women’s engineering College Telangana, Hyderabad India
1
[email protected]
2
[email protected]
3
[email protected]
4
[email protected]

Abstract: become increasingly evident in the era of the Internet


of Things(IoT), spanning a wide range of applications,
The karatsuba algorithm is a fast multiplication algorithm that uses a
from bio-signals to advanced image processing. Take,
divide and conquer approach to multiply two n-digit numbers. Here,
the system compiler takes lesser time to compute the product than the
for instance, the ubiquitous presence of wearable

time-taken by a normal multiplication. The divide-and-conquer health monitoring devices, crucial given that 47% of
algorithm reduces the multiplication of two n-digit numbers to three cardiac diseases – the leading cause of death globally
multiplications of n/2-digit numbers and, by repeating this reduction, – manifest outside of hospital settings. Similarly,
to at mostsingle-digitmultiplications.Itisthereforeasymptoticallyfaster Unmanned Aerial Vehicles (UAVs), such as drones,
than the traditional algorithm, which performs single-digit products. are proliferating across various domains including
The karatsuba algorithm was the first multiplication algorithm
object/self tracking, search and surveillance,
asymptotically faster than the quadratic "grade school" algorithm.
agricultural operations, and entertainment.
Multiplying large numbers efficiently is an important task , however
the traditional, naive way of multiplying numbers involves Various sectors, including entertainment, agriculture,
multiplying each digit in one number to each digit in the second search and surveillance, object/self-tracking, and
number, requiring n 2 single-digit computations. Asthesize of wildlife monitoring, witness a surge in the utilization
multiplication increases, the time required to solve using the naive of drones and other UAVs. Field-Programmable
way increases dramatically. So ,to overcome this problem, multipliers Circuit Arrays (FPGAs), readily accessible in
and dividers are designed using karatsuba algorithm. This algorithm
commercial markets, offer a viable substitute for
can provide high throughput, high efficiency.Itcan alsoreduce the
power-intensive Application-Specific Integrated
time complexity from O(n 2 ) to O(nlog23)≈O(n 1.58).The
Circuits (ASICs) in implementing these programs.
multipliers are designed using field programming array(FPGA).In
This is primarily due to FPGAs' rapid prototyping
this paper we proposed pipelined soft multipliers using karatsuba
algorithm. Experimental results obtained with vivado, Xilinx which capabilities and their adaptability in post-fabrication
demonstrate the efficiency of proposed pipelined multipliers using datapath adjustments, making them an attractive
karatsuba algorithm. option for such applications.The adaptability of
medical device technology, exemplified by its
I. INTRODUCTION
capacity to adjust to the unique physiological
The growing demand for edge computing has
characteristics and fluctuations in heart activity of

1
the DCT (quantization) stage. However, the reported

individual patients, is paramount. Similarly, performance gains typically focus on these individual
kernels rather than considering the impact on the
parallelizable applications handling substantial
entire end-to-end application implementation.
data volumes frequently opt for methods
Thirdly, although much attention is directed towards
that enhance throughput and/or minimize
optimizing multiplication operations. This
power consumption. underscores the need for comprehensive optimization
strategies that address both multiplication and

While Application-Specific Integrated Circuits division operations in FPGA-based designs.

(ASICs) offer high power efficiency for


implementing such programs, off-the-shelf Field-
Programmable Gate Arrays (FPGAs) have emerged
as commercially viable alternatives. Their rapid
prototyping and post-fabrication data path versatility
make them capable of keeping pace with the rapid
evolution of algorithms, which often outstrip Figure1:Comparing area,delay and energy of 8,16,32bit
hardware updates. Consider, for example, the need Multipliers and dividers.

for health monitoring devices to adapt to different Multiplication stands as a prevalent operation with in bio-
patients'physiological traits and changes in heart signal or visual processing work loads, and FPGAs
activity. integrate built-in DSP units to expedite this process.
Moreover, there's a significant demand for high Nonetheless, there exist three potential reasons why DSP
throughput and energy efficiency to accelerate blocks might fail to meet design criteria. Firstly, they may
parallelizable applications that continuously process lack sufficient processing power for applications
large volumes of data. necessitating extensive multiplication or parallel operation,

Challenges in ASIC, state-of-the-art approximation, owing to their limited ratio compared to Look-up Tables

and DSP have led to the predominant design of (LUTs). Additionally, DSP blocks are permanently

multipliers using FPGA technology. These embedded in FPGAs, escalating routing costs and

challenges encompasses various aspects: potentially diminishing performance in specific industries.


Lastly, digital signal processors prove inadequate in
Firstly, while approximation approaches tailored for
addressing precision issues associated with multiplication
ASIC platforms have shown promising performance
using only 18×18 bits. Notably, prominent FPGA vendors
gains, directly transferring them to FPGAs proves
like Xilinx and Intel have mandated the use of soft
challenging due to the differing architectural
Intellectual Properties (IPs) for operations such as
specifications of the two platforms. Secondly,
arithmetic.
approximation techniques are often applied to
II.RELATEDWORK
individual kernels within multi-kernel applications.
For instance, in JPEG compression, approximation A. .Customized Convolutional

may involve replacing multiplication or division Neural Network for FPGA

operations with imprecise versions specifically in Platforms:

2
Image search engines, object detection in mobile robot mining, where ECG features are compared against
vision, and a myriad of other applications have embraced established rules. Our suggested Bit_Q_Apriori
Convolutional Neural Networks (CNNs) extensively. hardware-oriented data-mining method aims to
Furthermore, edge devices heavily rely on specialized
enhance processing speed. The adoption of the right
hardware accelerators like FPGAs and Application-
implementation promises improved scalability,
Specific Integrated Circuits (ASICs) due to the high
effectiveness, throughput, and cost-efficiency
memory and computational demands of CNN models.
compared to competing hardware solutions.
Among these options, FPGA accelerators hold an edge
owing to their versatility, low power consumption, and C.Utilising template matching and implementing in
rapid development capabilities, surpassing other FPGA for real-time identification of characteristics in
specialist hardware accelerators for artificial neural ECG Waves:
networks. Previous efforts in FPGA acceleration designs The electrocardiogram (ECG) holds immense potential in
predominantly focused on configuring hardware to providing crucial clinical in sight sin to cardiac processes.
support CNN model structures. In contrast, our approach We present an algorithmic approach for real-time
leverages reinforcement learning to autonomously search identification and characterization of wave peaks in one-
for optimal network designs, empowering users to create lead ECG data. Initially, the ECG data undergoes
custom convolutional neural networks tailored to preprocessing to eliminate power line interference and
specialized FPGA hardware. high-frequency noise. Subsequently, a set of rule bases is
B. An Amplifier to real-time ECG Research et Diagnosis established based on slope and polarity using the first

Utilizing:Telemedicine, which utilizes Information 6,000 samples, laying the foundation for detecting R-

and Communication Technology (ICT) to deliver peaks, P-waves, and T-waves in beats. For this study, we
employed the Spartan III FPGA from Xilinx to execute the
medical care remotely, emerges as a potential
code. To validate the methodology, 8-bit encoded ECG
solution to the challenges confronting current
data was transmitted to the FPGA via the computer's
healthcare systems. These challenges include
parallel port using a parallel transfer mechanism. The
catering to an aging population, a rising number of measured sensitivities for P-waves, R-waves, and T-waves
patients, and a shortage of qualified medical were 97.58%, 98.4%, and 97.78%, respectively, owing to
professionals. With recent advancements in
Association-Rule Filtering on an FPGA. To
telemedicine, particularly in wearable ECG
address the need for expedited processing and
monitors, there is a growing demand formore
diagnosis of real-time electrocardiogram (ECG)
sophisticated and precise automated ECG data, we propose a streaming architecture
evaluation and diagnostic systems. Association- leveraging Field-Programmable Gate Arrays
Rule Filtering on an FPGA. To address the need for (FPGAs). Early diagnosis can be facilitated
expedited processing and diagnosis of real-time through association-rule mining, where ECG
electrocardiogram (ECG) data, we propose a features are compared against established rules.

streaming architecture leveraging Field- Our suggested Bit_Q_Apriori hardware-oriented

Programmable Gate Arrays (FPGAs). Early data-mining method aims to enhance processing
speed. The adoption of the right implementation
diagnosis can be facilitated through association-rule

3
promises improved scalability, effectiveness, including machine learning and image/video processing.
throughput, and cost-efficiency compared to To build high-performance multipliers, FPGA providers
competing hardware solutions. offer digital signal processing (DSP) blocks. However,

C. Utilising template matching and implementing there are constraints on the number and placement of these

in FPGA for real-time identification of multipliers on FPGAs, which can lead to additional routing

characteristics in ECG delays and inefficiencies, especially form narrower bit


widths. As a solution, FPGA manufacturers also provide
Waves:
multiplicatively tuned soft IP cores. This article argues that
The electrocardiogram (ECG) holds immense
despite their advantages, FPGA soft multiplication IP
potential in providing crucial clinical in sight sin
cores require better designs to achieve high performance
to cardiac processes. We present an algorithmic
with low resource consumption. Our proposed generic,
approach for real-time identification and
area-optimized, low-latency technology improves upon
characterization of wave peaks in one-lead ECG
existing softcore designs by leveraging the architectural
data. Initially, the ECG data undergoes
attributes of FPGAs, such as fast-carry chains and lookup
preprocessing to eliminate power line
table (LUT) structures. When compared to Xilinx's
interference and high-frequency noise.
multiplier Logi CORE IP, our recommended unsigned
Subsequently, a set of rule bases is established
accurate design can reduce LUT usage byupto53%across
based on slope and polarity using the first 6,000
various multiplier sizes, while our proposed signed correct
samples, laying the foundation for detecting R-
architecture achieves reductions ofupto25%.Additionally,
peaks, P-waves, and T-waves in beats. For this
our unmarked approximated multiplier designs maintain
study, we employed the Spartan III FPGA from
output accuracy while reducing critical path delay (CPD)
Xilinx to execute the code. To validate the
by up to 51% compared to the LogiCORE IP. The
methodology, 8-bit encoded ECG data was
proposed multiplier architecture has enhanced the area and
transmitted to the FPGA via the computer's
performance of accelerators used in image and video
parallel port using a parallel transfer mechanism.
applications. You can find our open-source collection of
The measured sensitivities for P-waves, R-
approximate and exact multipliers at https://cfaed.tu-
waves, and T-waves were 97.58%, 98.4%, and
dresden.de/pd-downloads. Our objectives include
97.78%, respectively, owing to Xilinx's
facilitating result replication, sparking new inquiries
implementation. Among the various wave
within the FPGA community, and encouraging further
characteristics detected—including height,
study and enhancement in this field.
polarity, and duration—an average miss rate of
E. Area-Optimized Low-Latency Approximate Multipliers
9.3% was attained. In a clinical setting, a
for FPGA- Based Hardware Accelerators:
medical expert verified the detected wave
patterns, emphasizing the reliability and The performance advantages of employing ASIC

accuracy of the proposed approach. approximation techniques in FPGA-based configurable


computing systems are limited by architectural constraints
D. Multiplexers for Field Programmable Gate
between ASICs and FPGAs. This paper introduces a
Array Hardware Accelerators with High
comprehensive solution comprising a freely available
Achievement, Both Precise and Generated.:
library, an efficient design methodology, and an innovative
Multiplication plays a critical role in various fields,

4
architecture tailored for approximation multipliers error (ARE). Specifically:
optimized specifically for FPGA-based fabrics. Our - For3-coefficientmultipliermethods,aggregated sub-
approach not only enhances output accuracy but also region inaccuracies are kept below preset thresholds
achieves improvements in area, delay, and energy (e.g.,3%for5-coefficientschemesand2.5%for 10-coefficient
consumption compared to existing approximation schemes).
multipliers based on ASICs. Notably, our proposed
- Error-reduction ratios for each set of sub-intervals are
method outperforms Xilinx Vivado multipliers IP,
determined using mathematical methods described in [45].
delivering significant energy savings (up to 67%),
Table I presents the proposed binary multiplier
reduced latency (53%), and a
and divider coefficients. Partitioning is
30%improvementinareautilization,allwhile maintaining
implemented using small multiplexers designed in
high accuracy (average relative error < 1%).For those
HDL, with the complexity of conditional
interested in contributing to this field or witnessing its
statements affecting LUT consumption. We
ongoing progress, our library of approximation
introduce three methods to decrease this
multipliers is accessible online at
complexity, limiting the number of coefficients
https://cfaed.tudresden.de/pd-downloads. This
per method to 10.
advancement opens up new avenues for research within
Additionally, simplifying conditional statements
the FPGA community.
by comparing only four major stream bits of
F. Proposed light-weight error-reduction scheme:
fractional sections during division further reduces
Streamlining the error-reduction categories could address complexity. Each 6-LUT functions as a 4-MUX in
the overflow issue observed in both INZeD and MBM, as hardware, requiring one FPGA slice containing
well as reduce the excessive parameter count (e.g., 256 in four 6-LUTs for a 16:1 multiplexer.
REALM). RAPID, unlike REALM/SIMDive, allocates
The proposed partitioning mechanism, based on
the squared-off area among power-of-two combinations.
MUXes, maintains scalability compared to
Key factors considered in this partitioning include:
REALM and SIMDive, as the resource cost does
1. Opting for four fractional multiple-significance-bands not exponentially increase with coefficient count.
(MSBs) instead of three for enhanced accuracy. Our approach demonstrates superior resource-
2. Recommending a reduction in the number of error trade-offs compared to state-of-the-art
partitions while maintaining four MSBs to conserve methods. With ten error coefficients and four
resources during parameter selection. MSBs, our method outperforms
3. Minimizing variance pattern and error volume in each SIMDive/REALM in terms of LUT usage and

area by optimizing the estimate of the error-magnitude achieves a Mean Relative Error (ARE) of 0.6%.
integral. Refer to Table III for detailed comparisons.

We derived these methods from Vivado's resource-usage


data and fault analysis, illustrated in Figure 2. Our
primary research objective has been to find optimal
partitioning strategies that minimize the average absolute

5
Figure3:Overall structure of multiplier and divider using Mitchell’s
Figure 2: Proposed error reduction schemes of RAPID for
algorithm
multiplication and division based on MSBs of fractional
parts. In our FPGA-customized approach, the LOD computation
relies on 4-bit LODs, configured directly within LUTs.
Here's how it works:

1. Zero Detection and Leading-One Detection (LOD):


Table1:Binary representation of error reduction Each 4-bit segment of the operands undergoes
coefficients in 16 bit multiplier and divider.
simultaneous analysis. One LUT serves as a logical OR
function to detect the presence of a '1' in the
III.MITCHELL’SAPPROXIMATEALGORITHM
segment(acting as a zero-detection flag). Another 6-LUT
Mitchell’s algorithm for multiplying two numbers
is configured as two 5-LUTs to determine the position of
using logarithms is straightforward. The
the leading one in the 4-bit segment (LOD4-LUT). The
logarithms of the input numbers are added and the
resulting bits from these LUTs determine the position of
antilogarithm of the sum is determined. The
the leading one in the most significant group through
method used to find the logarithm and the
priority logic.
antilogarithm impacts the accuracy. Mitchell
2. Extension to Larger LODs: Similar methods are
presented a simple method to approximate the
applied for 16- and 32-bit LODs. For example, in a 16-
logarithm and antilogarithm calculations.
LOD, if the upper half of the operand is zero, the LOD is
The existing units efficiently utilize 6-input
equal to the lower 8-bit LOD. Otherwise, the position of
Look-up Tables (6-LUTs)and fast carry chains to
the leading one is calculated accordingly.
implement Mitchell’s approximate algorithms.
3. LOD Step Orchestration: In our LeAp architecture,
To address the first challenge, the first
LOD steps are orchestrated through a Finite State Machine
logarithmic multiplier tailored specifically for
(FSM) and executed in at most five clock cycles. To
FPGAs. LeAp's design is motivated by the
maintain efficiency and minimize registers, LOD
translation of multiplication operations into
implementation is realized as combinational logic, with
addition within the logarithmic
critical path analysis guiding balanced partitioning for
domain, achieved through Mitchell's algorithm.
pipelining.
4. Integer Parts Addition: Each4-bitaddition is handled
byoneVirtex-7slice,comprising four 6-LUTsand associated
fast carry chains, forming a Carry Look-Ahead
Adder(CLA).Extendingto8-bitadditionsinvolves

6
connecting the carry-out from a prior slice to the carry-in fractional components. In contrast, in REALM
of the next. [45], MBM [20], and INZeD [16], Mitchell's

5. LUT-Optimized Ternary Addition: We optimize circuit cannot accommodate the error-reduction

ternary addition by configuring FPGA LUTs and carry parameter, or half of it, without relying on an

chain primitives to implement a ternary adder. This additional circuit that operates based on the

aligns with our error reduction approach, allowing the intermediate addition/subtraction of fractional

addition of error reduction coefficients alongside parts.

fractional parts in a single step, minimizing resource


usage. Unlike other methods, where an additional circuit
is required to add error-reduction terms to Mitchell's
circuits, our approach seamlessly integrates this process,
leveraging fixed FPGA primitive delays without
additional overhead.

This streamlined approach ensures efficient computation Figure4:2,3,4Stage pipelined model ofmultipliers and dividers

and resource utilization, critical for FPGA-based systems

In our LeAp approach [17], we focus solely on


IV.KARATSUBAALGORITHM
reducing error factors based on fractional bits,
unlike MBM/INZeD [20, 16], where LUT- The main idea of the Karatsuba Algorithm is to
optimized ternary addition considers the interim reduce multiplication of multiple sub problems to
outcome of Mitchell Mul/Div. Fortunately, the multiplication of three sub problems. Arithmetic
Xilinx UNISIM library provides the necessary operations like additions and subtractions are
LUTs and rapid carry chains for constructing a performed for other computations .For this
ternary adder [59]. By carefully configuring algorithm, two n-digit numbers are taken as the
FPGA logic units and carry chain elements we've input and the product of the two number is
transformed them into a tri adder. This enables obtained as the output.
us to simultaneously incorporate fractional
The Karatsuba algorithm is a recursive algorithm;
components and error-reduction ratios while
since it calls smaller instances of itself during
maintaining the same resource footprint, aligning
execution. According to the algorithm, it calls
perfectly with our error-reduction methodology.
itself only thrice on n/2-digit numbers in order to
At the end of the ternary adder chain, an
achieve the final product of two n-digit numbers.
additional LUT is needed when the sum of Now, if T(n) represents the number of digit
frac1i+frac2i+error coefficient i+ Cin (carry-out multiplications required while performing the
from bit to bit) results in three bits. However, multiplication.
compared to the raw version, only one extra bit
is required at the most significant bit (MSB)
position [19]. The fixed delay of FPGA
primitives eliminates the need for extra design
effort to integrate the error-reduction period with

7
[A=10^{frac{n}{2}}A_1+A_2]

[B=10^{frac{n}{2}}B_1+B_2]

where A1, A2, B1, and B2 each have n/2


digits. Step4:Compute variables U,V,and W
as follows:
Figure5:Block diagram of multiplier using karatsuba algorithm
[ U = A_1B_1 ]
[V=A_2B_2]
Assume A and B are the two inputs of ‘n’ bit seach. The A
and B are divided into two segments say AH, BH and AL, [W=(A_1+A_2)(B_1+B_2)
BL. Here AH, BH are the higher- order bits and AL, BL are ] [ Z = W - (U + V) ]
the lower order bits. Step5:Obtain the product P by substituting the
values into the formula
AB=(2^n/2*AH+AL)(2^n/2*BH+BL)=2^n(AHBH)+
2^n/2(AHBL+ALBH)+(ALBL)By Karatsuba multiplier [P=10^n(U)+10^{frac{n}{2}}(Z)+V]
algorithm, AHBL+ALBH=(AH+AL)(BH+BL)–AHBH–
AL BL Therefore, 4 * n/2 bit multiplications is decreased to [P=10^n(A_1B_1)+10^{frac{n}{2}}(A_1B_2+A_2B_1)
3* n/2 bit multiplications. Time complexity of Karatsuba +A_2B_2]
multiplication algorithm is O(n) = n^1.58.

Step 6: Recursively call the algorithm by passing the sub


II. STEPSINVOLVEDINMULTIPLICATIONTHROUGH problems (A1, B1), (A2, B2), and (A1 + A2, B1 + B2)
KARATSUA ALGORITHM. separately. Store the returned values in variables U, V,
and W, respectively.
Step1:Assumenisapowerof2.

In this paper, the performance of Karatsuba


Step 2: If n equals 1, use multiplication tables to compute P = algorithm is investigated for multiplicand and
AB. multiplier having 4, 8,
16 and 32 bit length. Moreover, the performance of
Step 3: If n is greater than 1,split the n-digit numbers in half
Karatsuba algorithm is analyzed in terms of the
and represent them using the formulas:
number of multiplication and the total process time.

8
The applications used for performance analysis are
implemented using vivado.
The bit length increases along with the number of
multiplication due to the processing of Karatsuba
algorithm. In addition, the more the number of
multiplication raises, the more the amount of
hardware increases. Therefore, the cost required
for performing multiplication operation rises.
When compared to each other, the number of
multiplication of Karatsuba algorithm is less than
classical multiplication method. The performance
Figure 7:Performance analysis of Karatsuba algorithm interms of the
of Karatsuba algorithm in terms of the total total process time for different bit lengths.

process time for different bit lengths is analyzed The graph illustrates that as the bit length
as shown in Fig. VI. increases, the total processing time also rises. This
trend occurs because the number of necessary
multiplications escalates in tandem with the bit
length. Furthermore, the total processing time
inversely correlates with the processing speed; as
the former increases, the latter decreases due to the
slowdown in multiplication. When juxtaposed with
the classical multiplication method, the Karatsuba
algorithm demonstrates superior performance in
terms of total processing time.
V.RESULTS

Figure 6 : Performance analysis of Karatsuba algorithm interms of


the number of multiplication for different bit lengths.

Figure 8:simulation results of multipliers using mitchell’s


algorithm

9
approaches, which are costly and inefficient, this
incorporates innovative error-reduction technologies,
achieving an impressive accuracy range of 99.4–99.4
percent. For instance, when contrasted with pipelined
accurate IPs, this pipelined multiplication and division
operations could potentially reduce LUT usage by
36%, enhance performance/watt by 2.3 times, and
boost throughput by up to 3.3 times.

Through comprehensive end-to-end testing,


Figure 9:RTL view of RAPID demonstrates significant enhancements in various
applications, such as heartbeat detection (35%
improvement), compressed JPEG images (33%
improvement), and Harris corner identification (45%
improvement), across delay, area, and Area-Delay-
Product (ADP), respectively, without compromising
reception quality. The pipelined design presents an
excellent opportunity to expedite the execution of
diverse applications that operate on data streams and
continuously process vast amounts of data. While
Qualityof Reception (QoR) remains unaffected,
Figure 10:power,area,timings of RAPID. latency, area, and Size-Delay-Product (SDP) increase
by 35%,33%,and45% respectively, compared to
correct kernels.

Our primary aim is to evaluate the performance of the


pipeline mode in various environments, including
neural networks, which offer opportunities for SIMD
and pipelining. Addressing data dependencies
sequentially poses a challenge, often only partially
mitigated by processors' out-of-order execution, which
fails to fully exploit pipelining potentials.
Figure 11: Power,Area,Timing of proposed system. Consequently, we are developing improved pipelined
III.CONCLUSION divider and multiplication implementations to resolve
In our study, employing fine-grain pipelining, we data dependenciesandfacilitate internal data transfers
introduce more efficiently. Notably, intra-unit bypassing would
, the pioneering design for an approximation multiplier yield faster execution with reduced overhead.
using karatsuba algorithm.. Compared to current Additionally, we aim to create an ALU similar to

10
assess its effectiveness in the data-path of softer CPUs 17, 3.
like RISC-V. [6] S. Ullah et al. 2021. High
Performance Accurate and Approximate
One promising application is the mantissa Multipliers for FPGA-based Hardware
multiplier/divider, where division delay can be up to 35 Accelerators. IEEE Transactions on
times longer than additionoperations,consumingover95% Computer-Aided Design of Integrated
of the floating-point unit's space and power. The surge in Circuits and Systems (TCAD).
popularity of this technology is largely attributed to its [7] S. Ullah et al. 2018. Area-Optimized
widespread adoption in 3D graphics software. Low-Latency Approximate Multipliers
REFERENCES for FPGA-Based Hardware Accelerators.
[1] World Health Organisation. 2018. In IEEE/ACM Design Automation
Cardiovascular diseases (CVDs). Conference (DAC).
https://www.who.int/ news - room/ fact - [8] I. Kuon and J. Rose. 2007.
sheets/ detail/cardiovascular - diseases- Measuring the gap between fpgas and
(cvds). (2018). asics. IEEE Transactions on Computer
[2] P. Kostic. 2017. Heart Disease and Aided Design of Integrated Circuits and
Early Heart Attack Care. https : / / www . Systems (TCAD), 26, 2.
bnl . gov / hr / occmed / hpp / linkable [9] A. Boutros et al. 2018. Embracing
files / pdf / Diversity: Enhanced DSP Blocks for
EarlyHeartAttackSymptoms.pdf. (2017). Low-Precision Deep Learning on
[3] Y. Yang et al. 2019. FPNet: FPGAs. In IEEE International
Customized Convolutional Neural Conference on Field Programmable
Network for FPGA Platforms. In IEEE Logic and Applications (FPL).
International Conference on [10] S. Lee et al. 2019. Double MAC on
FieldProgrammable Technology a DSP: Boosting the Performance of
(ICFPT). Convolutional Neural Networks on
[4] X. Gu et al. 2016. A Real-Time FPGAs. IEEE Transactions on
FPGA-Based Accelerator for ECG Computer-Aided Design of Integrated
Analysis and Diagnosis Using Circuits and Systems (TCAD), 38, 5.
Association-Rule Mining. ACM [11] Xilinx. 2015. LogiCORE IP
Transactions on Embedded Computing multiplier v12.0.
Systems (TECS), 15, 2. https://www.xilinx.com/ support /
[5] H.K. Chatterjee et al. 2015. documentation / ip documentation / mult
Real–time detection of electrocardiogram gen / v12 0 / pg108 - mult-gen.pdf.
wave features using template matching (2015).
and implementation in FPGA. [12] Xilinx. 2016. LogiCORE IP Divider
International Journal of Biomedical v5.1. https://www.xilinx.com/ support/
Engineering and Technology (IJBET), documentation/ip documentation/ div

11
gen/ v5 1/ pg151 - div -
gen.pdf. (2016)

12
13

You might also like