Resize-Pdf - Base Paper 6 - Copy-Numbered
Resize-Pdf - Base Paper 6 - Copy-Numbered
PARADIGM SHIFT IN
MULTIPLICATIONEFFICIENCY
1
Mrs.C.Anjani
2
R.Kavya
3
S. Poojitha Reddy
4
A.Lasya priya
1
Professor in department of Electronics and Communication Engineering
2,3,4
UG Students of Sridevi Women’s Engineering College
1,2,3,4
Sridevi Women’s engineering College Telangana, Hyderabad India
1
[email protected]
2
[email protected]
3
[email protected]
4
[email protected]
time-taken by a normal multiplication. The divide-and-conquer health monitoring devices, crucial given that 47% of
algorithm reduces the multiplication of two n-digit numbers to three cardiac diseases – the leading cause of death globally
multiplications of n/2-digit numbers and, by repeating this reduction, – manifest outside of hospital settings. Similarly,
to at mostsingle-digitmultiplications.Itisthereforeasymptoticallyfaster Unmanned Aerial Vehicles (UAVs), such as drones,
than the traditional algorithm, which performs single-digit products. are proliferating across various domains including
The karatsuba algorithm was the first multiplication algorithm
object/self tracking, search and surveillance,
asymptotically faster than the quadratic "grade school" algorithm.
agricultural operations, and entertainment.
Multiplying large numbers efficiently is an important task , however
the traditional, naive way of multiplying numbers involves Various sectors, including entertainment, agriculture,
multiplying each digit in one number to each digit in the second search and surveillance, object/self-tracking, and
number, requiring n 2 single-digit computations. Asthesize of wildlife monitoring, witness a surge in the utilization
multiplication increases, the time required to solve using the naive of drones and other UAVs. Field-Programmable
way increases dramatically. So ,to overcome this problem, multipliers Circuit Arrays (FPGAs), readily accessible in
and dividers are designed using karatsuba algorithm. This algorithm
commercial markets, offer a viable substitute for
can provide high throughput, high efficiency.Itcan alsoreduce the
power-intensive Application-Specific Integrated
time complexity from O(n 2 ) to O(nlog23)≈O(n 1.58).The
Circuits (ASICs) in implementing these programs.
multipliers are designed using field programming array(FPGA).In
This is primarily due to FPGAs' rapid prototyping
this paper we proposed pipelined soft multipliers using karatsuba
algorithm. Experimental results obtained with vivado, Xilinx which capabilities and their adaptability in post-fabrication
demonstrate the efficiency of proposed pipelined multipliers using datapath adjustments, making them an attractive
karatsuba algorithm. option for such applications.The adaptability of
medical device technology, exemplified by its
I. INTRODUCTION
capacity to adjust to the unique physiological
The growing demand for edge computing has
characteristics and fluctuations in heart activity of
1
the DCT (quantization) stage. However, the reported
individual patients, is paramount. Similarly, performance gains typically focus on these individual
kernels rather than considering the impact on the
parallelizable applications handling substantial
entire end-to-end application implementation.
data volumes frequently opt for methods
Thirdly, although much attention is directed towards
that enhance throughput and/or minimize
optimizing multiplication operations. This
power consumption. underscores the need for comprehensive optimization
strategies that address both multiplication and
for health monitoring devices to adapt to different Multiplication stands as a prevalent operation with in bio-
patients'physiological traits and changes in heart signal or visual processing work loads, and FPGAs
activity. integrate built-in DSP units to expedite this process.
Moreover, there's a significant demand for high Nonetheless, there exist three potential reasons why DSP
throughput and energy efficiency to accelerate blocks might fail to meet design criteria. Firstly, they may
parallelizable applications that continuously process lack sufficient processing power for applications
large volumes of data. necessitating extensive multiplication or parallel operation,
Challenges in ASIC, state-of-the-art approximation, owing to their limited ratio compared to Look-up Tables
and DSP have led to the predominant design of (LUTs). Additionally, DSP blocks are permanently
multipliers using FPGA technology. These embedded in FPGAs, escalating routing costs and
2
Image search engines, object detection in mobile robot mining, where ECG features are compared against
vision, and a myriad of other applications have embraced established rules. Our suggested Bit_Q_Apriori
Convolutional Neural Networks (CNNs) extensively. hardware-oriented data-mining method aims to
Furthermore, edge devices heavily rely on specialized
enhance processing speed. The adoption of the right
hardware accelerators like FPGAs and Application-
implementation promises improved scalability,
Specific Integrated Circuits (ASICs) due to the high
effectiveness, throughput, and cost-efficiency
memory and computational demands of CNN models.
compared to competing hardware solutions.
Among these options, FPGA accelerators hold an edge
owing to their versatility, low power consumption, and C.Utilising template matching and implementing in
rapid development capabilities, surpassing other FPGA for real-time identification of characteristics in
specialist hardware accelerators for artificial neural ECG Waves:
networks. Previous efforts in FPGA acceleration designs The electrocardiogram (ECG) holds immense potential in
predominantly focused on configuring hardware to providing crucial clinical in sight sin to cardiac processes.
support CNN model structures. In contrast, our approach We present an algorithmic approach for real-time
leverages reinforcement learning to autonomously search identification and characterization of wave peaks in one-
for optimal network designs, empowering users to create lead ECG data. Initially, the ECG data undergoes
custom convolutional neural networks tailored to preprocessing to eliminate power line interference and
specialized FPGA hardware. high-frequency noise. Subsequently, a set of rule bases is
B. An Amplifier to real-time ECG Research et Diagnosis established based on slope and polarity using the first
Utilizing:Telemedicine, which utilizes Information 6,000 samples, laying the foundation for detecting R-
and Communication Technology (ICT) to deliver peaks, P-waves, and T-waves in beats. For this study, we
employed the Spartan III FPGA from Xilinx to execute the
medical care remotely, emerges as a potential
code. To validate the methodology, 8-bit encoded ECG
solution to the challenges confronting current
data was transmitted to the FPGA via the computer's
healthcare systems. These challenges include
parallel port using a parallel transfer mechanism. The
catering to an aging population, a rising number of measured sensitivities for P-waves, R-waves, and T-waves
patients, and a shortage of qualified medical were 97.58%, 98.4%, and 97.78%, respectively, owing to
professionals. With recent advancements in
Association-Rule Filtering on an FPGA. To
telemedicine, particularly in wearable ECG
address the need for expedited processing and
monitors, there is a growing demand formore
diagnosis of real-time electrocardiogram (ECG)
sophisticated and precise automated ECG data, we propose a streaming architecture
evaluation and diagnostic systems. Association- leveraging Field-Programmable Gate Arrays
Rule Filtering on an FPGA. To address the need for (FPGAs). Early diagnosis can be facilitated
expedited processing and diagnosis of real-time through association-rule mining, where ECG
electrocardiogram (ECG) data, we propose a features are compared against established rules.
Programmable Gate Arrays (FPGAs). Early data-mining method aims to enhance processing
speed. The adoption of the right implementation
diagnosis can be facilitated through association-rule
3
promises improved scalability, effectiveness, including machine learning and image/video processing.
throughput, and cost-efficiency compared to To build high-performance multipliers, FPGA providers
competing hardware solutions. offer digital signal processing (DSP) blocks. However,
C. Utilising template matching and implementing there are constraints on the number and placement of these
in FPGA for real-time identification of multipliers on FPGAs, which can lead to additional routing
4
architecture tailored for approximation multipliers error (ARE). Specifically:
optimized specifically for FPGA-based fabrics. Our - For3-coefficientmultipliermethods,aggregated sub-
approach not only enhances output accuracy but also region inaccuracies are kept below preset thresholds
achieves improvements in area, delay, and energy (e.g.,3%for5-coefficientschemesand2.5%for 10-coefficient
consumption compared to existing approximation schemes).
multipliers based on ASICs. Notably, our proposed
- Error-reduction ratios for each set of sub-intervals are
method outperforms Xilinx Vivado multipliers IP,
determined using mathematical methods described in [45].
delivering significant energy savings (up to 67%),
Table I presents the proposed binary multiplier
reduced latency (53%), and a
and divider coefficients. Partitioning is
30%improvementinareautilization,allwhile maintaining
implemented using small multiplexers designed in
high accuracy (average relative error < 1%).For those
HDL, with the complexity of conditional
interested in contributing to this field or witnessing its
statements affecting LUT consumption. We
ongoing progress, our library of approximation
introduce three methods to decrease this
multipliers is accessible online at
complexity, limiting the number of coefficients
https://cfaed.tudresden.de/pd-downloads. This
per method to 10.
advancement opens up new avenues for research within
Additionally, simplifying conditional statements
the FPGA community.
by comparing only four major stream bits of
F. Proposed light-weight error-reduction scheme:
fractional sections during division further reduces
Streamlining the error-reduction categories could address complexity. Each 6-LUT functions as a 4-MUX in
the overflow issue observed in both INZeD and MBM, as hardware, requiring one FPGA slice containing
well as reduce the excessive parameter count (e.g., 256 in four 6-LUTs for a 16:1 multiplexer.
REALM). RAPID, unlike REALM/SIMDive, allocates
The proposed partitioning mechanism, based on
the squared-off area among power-of-two combinations.
MUXes, maintains scalability compared to
Key factors considered in this partitioning include:
REALM and SIMDive, as the resource cost does
1. Opting for four fractional multiple-significance-bands not exponentially increase with coefficient count.
(MSBs) instead of three for enhanced accuracy. Our approach demonstrates superior resource-
2. Recommending a reduction in the number of error trade-offs compared to state-of-the-art
partitions while maintaining four MSBs to conserve methods. With ten error coefficients and four
resources during parameter selection. MSBs, our method outperforms
3. Minimizing variance pattern and error volume in each SIMDive/REALM in terms of LUT usage and
area by optimizing the estimate of the error-magnitude achieves a Mean Relative Error (ARE) of 0.6%.
integral. Refer to Table III for detailed comparisons.
5
Figure3:Overall structure of multiplier and divider using Mitchell’s
Figure 2: Proposed error reduction schemes of RAPID for
algorithm
multiplication and division based on MSBs of fractional
parts. In our FPGA-customized approach, the LOD computation
relies on 4-bit LODs, configured directly within LUTs.
Here's how it works:
6
connecting the carry-out from a prior slice to the carry-in fractional components. In contrast, in REALM
of the next. [45], MBM [20], and INZeD [16], Mitchell's
ternary addition by configuring FPGA LUTs and carry parameter, or half of it, without relying on an
chain primitives to implement a ternary adder. This additional circuit that operates based on the
aligns with our error reduction approach, allowing the intermediate addition/subtraction of fractional
This streamlined approach ensures efficient computation Figure4:2,3,4Stage pipelined model ofmultipliers and dividers
7
[A=10^{frac{n}{2}}A_1+A_2]
[B=10^{frac{n}{2}}B_1+B_2]
8
The applications used for performance analysis are
implemented using vivado.
The bit length increases along with the number of
multiplication due to the processing of Karatsuba
algorithm. In addition, the more the number of
multiplication raises, the more the amount of
hardware increases. Therefore, the cost required
for performing multiplication operation rises.
When compared to each other, the number of
multiplication of Karatsuba algorithm is less than
classical multiplication method. The performance
Figure 7:Performance analysis of Karatsuba algorithm interms of the
of Karatsuba algorithm in terms of the total total process time for different bit lengths.
process time for different bit lengths is analyzed The graph illustrates that as the bit length
as shown in Fig. VI. increases, the total processing time also rises. This
trend occurs because the number of necessary
multiplications escalates in tandem with the bit
length. Furthermore, the total processing time
inversely correlates with the processing speed; as
the former increases, the latter decreases due to the
slowdown in multiplication. When juxtaposed with
the classical multiplication method, the Karatsuba
algorithm demonstrates superior performance in
terms of total processing time.
V.RESULTS
9
approaches, which are costly and inefficient, this
incorporates innovative error-reduction technologies,
achieving an impressive accuracy range of 99.4–99.4
percent. For instance, when contrasted with pipelined
accurate IPs, this pipelined multiplication and division
operations could potentially reduce LUT usage by
36%, enhance performance/watt by 2.3 times, and
boost throughput by up to 3.3 times.
10
assess its effectiveness in the data-path of softer CPUs 17, 3.
like RISC-V. [6] S. Ullah et al. 2021. High
Performance Accurate and Approximate
One promising application is the mantissa Multipliers for FPGA-based Hardware
multiplier/divider, where division delay can be up to 35 Accelerators. IEEE Transactions on
times longer than additionoperations,consumingover95% Computer-Aided Design of Integrated
of the floating-point unit's space and power. The surge in Circuits and Systems (TCAD).
popularity of this technology is largely attributed to its [7] S. Ullah et al. 2018. Area-Optimized
widespread adoption in 3D graphics software. Low-Latency Approximate Multipliers
REFERENCES for FPGA-Based Hardware Accelerators.
[1] World Health Organisation. 2018. In IEEE/ACM Design Automation
Cardiovascular diseases (CVDs). Conference (DAC).
https://www.who.int/ news - room/ fact - [8] I. Kuon and J. Rose. 2007.
sheets/ detail/cardiovascular - diseases- Measuring the gap between fpgas and
(cvds). (2018). asics. IEEE Transactions on Computer
[2] P. Kostic. 2017. Heart Disease and Aided Design of Integrated Circuits and
Early Heart Attack Care. https : / / www . Systems (TCAD), 26, 2.
bnl . gov / hr / occmed / hpp / linkable [9] A. Boutros et al. 2018. Embracing
files / pdf / Diversity: Enhanced DSP Blocks for
EarlyHeartAttackSymptoms.pdf. (2017). Low-Precision Deep Learning on
[3] Y. Yang et al. 2019. FPNet: FPGAs. In IEEE International
Customized Convolutional Neural Conference on Field Programmable
Network for FPGA Platforms. In IEEE Logic and Applications (FPL).
International Conference on [10] S. Lee et al. 2019. Double MAC on
FieldProgrammable Technology a DSP: Boosting the Performance of
(ICFPT). Convolutional Neural Networks on
[4] X. Gu et al. 2016. A Real-Time FPGAs. IEEE Transactions on
FPGA-Based Accelerator for ECG Computer-Aided Design of Integrated
Analysis and Diagnosis Using Circuits and Systems (TCAD), 38, 5.
Association-Rule Mining. ACM [11] Xilinx. 2015. LogiCORE IP
Transactions on Embedded Computing multiplier v12.0.
Systems (TECS), 15, 2. https://www.xilinx.com/ support /
[5] H.K. Chatterjee et al. 2015. documentation / ip documentation / mult
Real–time detection of electrocardiogram gen / v12 0 / pg108 - mult-gen.pdf.
wave features using template matching (2015).
and implementation in FPGA. [12] Xilinx. 2016. LogiCORE IP Divider
International Journal of Biomedical v5.1. https://www.xilinx.com/ support/
Engineering and Technology (IJBET), documentation/ip documentation/ div
11
gen/ v5 1/ pg151 - div -
gen.pdf. (2016)
12
13