An Efficient and High Speed Overlap Free Karatsuba Based Finite Field Multiplier For Fpga Implementation
An Efficient and High Speed Overlap Free Karatsuba Based Finite Field Multiplier For Fpga Implementation
Abstract: There is no kind of electronic communication that does not today include some kind
of cryptography technology. Elliptic curve cryptography (ECC), a branch of public-key
cryptography, is now the most used technique for using cryptographic protocols. In ECC
systems, the operation that requires the greatest space and time is polynomial multiplication.
In order to maximize the utilization of field-programmable gate arrays (FPGAs), this
research introduces novel hardware architecture for ECC finite-field multipliers. In order to
determine the performance criterion, the suggested hardware was implemented on many
FPGA devices with different operand sizes. When compared to state-of-the-art works, the
proposed method shows design efficiency with a reduced combinational delay and area-delay
product.
An essential part of the ECC hardware XOR gate from the critical path of KA.
implementation is the finite-field With this enhancement, the possible
multiplier, which determines the points of latency is much lower than with the
the elliptic curve. Throughput and system original KA technique. However, because
area are governed by the size and latency to it’s no iterative nature, CA achieves
of the multiplier. This prompted the better latency performance than OKA.
development and intensive study of In this paper, we provide a new hardware
cryptosystem finite-field multipliers [17]- approach to efficient multiplication that
[26]. maintains its performance even while
The Karatsuba algorithm (KA) is a famous avoiding very large size constraints. The
way of multiplying numbers [27]. By method produces an output comparable to
substituting addition operations for the OKA by using a base unit developed at
multipliers, this strategy seeks to reduce a lower level in a similar fashion to the
the number of multipliers. earlier technique. The predicted method of
Unfortunately, due to algorithms' high multiplication was determined to be
temporal complexity and KA's iterative
nature, processing time increases and
efficiency decreases. As a result, while
deciding between the KA and CA
algorithms, latency and area are also Near the Karatsuba utilizing resources at
considered. The hardware the same pace as the CA.
implementation's constraints dictate Three algorithms were chosen for FPGA
whether the Karatsuba approach can be implementation: OKA, KA, and the classic
fine-tuned to increase speed or decrease method of multiplication. At the end of
area. Maximizing multipliers' performance this round of the research, we get an
is possible with the use of hardware overview, in terms of area and latency, of
implementation approaches such as the FPGA implementations for various
pipelines. Titles of pages 29–32. operand sizes. This proves that the CA is
To lessen the combinational latency of the the fastest of the three algorithms.
KA, academics have suggested many However, for operands of a bigger size,
methods, one of which is the overlap-free more lookup tables are required. In
Karatsuba algorithm (OKA) [33]-[35]. The contrast to previous algorithms, the ones
objective of this approach is to remove an that were proposed demonstrated much
higher speed while using significantly less limitations [37]. This article presents
space. The significance of this study is best results that have been derived from
shown by two key points. We started by theoretical and practical sources.
comparing the results to theoretical The A. Gold Standard
analysis and other methods, and then we A brief summary of the CA is given here
evaluated FPGA implementations of for binary polynomial multiplication. We
overlap-free Karatsuba binary polynomial continue on to n-bit multipliers after we
multiplications for various operand sizes. lay the groundwork with 2- and 4-bit
We also proposed an overlap-free lookup multipliers. In GF (2n), consider two
table-based method to obtain a quick and polynomials of degree one, A(x) = a1x +
efficient polynomial multiplier. a0 and B(x) = b1x + b0. We do 1-bit
II BACK GROUND addition and multiplication using logical
Polynomial multiplication and modular XORs and ANDs, respectively, since we
reduction are two common operations in are in GF (2n). For the first-order and 4-bit
GF (2n) that have a major impact on the multiplier examples, respectively, the data
efficiency and cost of the system [30]. flow graphs (DFGs) are shown in images
Space consumption and total 1(a) and (b).
multiplication delay are theoretically A simple 2-bit multiplier might be
utilized to calculate the multiplier's implemented using only one XOR and
efficiency when utilizing ideal two-input four AND gates. These are the values that
AND XOR gates [30], [33], [36]. For make up a conventional n-bit multiplier:
example, the issue of limited gate fan-out
in these devices is often ignored in this
research, which means that the hardware Where (CAAND ) and (CAAND ) are the
limits are not considered. We can simply total number of AND XOR gates,
get the multiplier's space and delay by respectively. Assuming ideal hardware
taking the ideal system configuration and condition and signal strength (no
adding the usual gate delays and area buffers required), the delay of the CA
needs in a linear fashion. These multiplier for the given example in Fig.
considerations may be irrelevant in 1(a) is
theoretical studies, but they become crucial
in real-world applications, such as when
buffers are necessary due to hardware
premise that Tx Ta Tg, the delays of the accomplish CA, in contrast to Karatsuba
AND and XOR gates, are equivalent. and overlap-free, grows substantially with
While the CA has the smallest delay in this increasing operand size. The CA requires
figure, the delays of conventional and more gates—more than 163% more—to
overlap-free Karatsuba converge to almost handle an operand size of 409 bits, in
the same value as the operand size expands. comparison to the Karatsuba or overlap-
Contrarily, the latency for Krartsuba free.
method increases more rapidly than that of
Figure 4(b) shows the total combination
the other two algorithms. The durations of
delay, here shown as gate delays, for all
recursive multipliers with 16, 19, and 233
three methods. The assumption that the
bits are same because
AND and XOR delays, Tx Ta Tg, are equal
A new and efficient implementation of the was our starting point. Conventional and
finite-field multiplier is given here. overlap-free Karatsuba delays approach
Findings from studies of conventional, one another as the operand size increases,
Karatsuba, and overlap-free systems' with the CA having the least delay in this
theoretical area and delay limitations figure. Interestingly, compared to the other
inform the proposed implementation two algorithms, the latency for the
strategy. Using a trend as a template, a Krartsuba approach grows at a faster rate.
finite-field multiplier of varying sizes is There is no difference in the runtimes of
produced. There is also an evaluation of recursive multipliers with 16, 19, and 233
the hardware resource requirements and bits due to
combinational delay of two
implementation strategies, namely FPGA
and theoretical gate-based analysis. The number of levels is same for each of
them. To further understand the problem,
Figure 4(a) displays the hardware
see Figure 5, which depicts the
implementation of the algorithms for
construction of these multipliers.
binary polynomial multiplication for
different sizes of operands. The graphic
illustrates that fewer gates are required to
implement CAs in comparison to the KA,
even when considering the small operand
sizes. The number of gates required to
assessed for their hardware resource needs finite-field multiplier is given here.
over 163% more gates for an operand size binary polynomial multiplication for
compared to KAs. Compared to the the trend for space complexity is rather
Karatsuba approach, 409 bit requires comparable.
almost twice as many LUTs. It is
anticipated that the disparity would widen
with increasing operand sizes.
compared to Karatsuba and overlap-free, less than 409 bits, the ADP for the
the number of gates needed to achieve CA traditional approach is less compared to
increases dramatically as the operand size the other ways. In terms of numerical
increases. To illustrate, compared to the efficiency, a 283-bit multiplier
Karatsuba or overlap-free, the CA needs implemented using the traditional way
over 163% more gates for an operand size outperforms Karatsuba by 14% and the
of 409 bits. overlap-free method by 9%. For a
multiplier of 93 bits, these are the numbers
The overall combination delay, expressed
64% and 66% in that order.
as gate delays, for all three techniques is
shown in Figure 4(b). We started with the For lower operand sizes, the trend suggests
premise that Tx Ta Tg, the delays of the that the CA is the most efficient approach.
AND XOR gates, are equivalent. While Also, keep in mind that overlap-free
the CA has the smallest delay in this figure, outperforms the KA when the operand size
the delays of conventional and overlap- is greater than 93 bits. When dealing with
free Karatsuba converge to almost the bigger operand sizes, the overlap-free
same value as the operand size expands. technique is likely to continue to be the
Contrarily, the latency for Krartsuba most efficient. A hybrid technique might
method increases more rapidly than that of be proposed to achieve finite-field
the other two algorithms. The durations of multiplication, as the efficiency is still
recursive multipliers with 16, 19, and 233 leaning toward the overlap-free method for
bits are same because large operand sizes.
The traditional technique provides the Figure 8 shows the DFG for the OBS
quickest results for theoretical gate-based technique, which is a suggested overlap-
analysis and FPGA implementation in free multiplication algorithm. Based on the
terms of latency, followed by the overlap- overlap-free, the maximum level is
free and Karatsuba methods. Furthermore, reached. At the first level, however, the
the overlap-free latency is near the traditional method is employed.
Karatsuba on FPGA, although it is closer
to the CA in theoretical calculations.
various operand sizes. Taking into account produced. There is also an evaluation of
tiny operand sizes, the figure shows that the hardware resource requirements and
fewer gates are needed to implement CAs combinational delay of two
compared to the KA. On the other hand, implementation strategies, namely FPGA
compared to Karatsuba and overlap-free, and theoretical gate-based analysis.
the number of gates needed to achieve CA Figure 4(a) displays the hardware
increases dramatically as the operand size implementation of the algorithms for
increases. To illustrate, compared to the binary polynomial multiplication for
Karatsuba or overlap-free, the CA needs different sizes of operands. The graphic
over 163% more gates for an operand size illustrates that fewer gates are required to
of 409 bits. implement CAs in comparison to the KA,
The overall combination delay, expressed even when considering the small operand
as gate delays, for all three techniques is sizes. The number of gates required to
shown in Figure 4(b). We started with the accomplish CA, in contrast to Karatsuba
premise that Tx Ta Tg, the delays of the and overlap-free, grows substantially with
AND XOR gates, are equivalent. While increasing operand size. The CA requires
the CA has the smallest delay in this figure, more gates—more than 163% more—to
the delays of conventional and overlap- handle an operand size of 409 bits, in
free Karatsuba converge to almost the comparison to the Karatsuba or overlap-
same value as the operand size expands. free.
Contrarily, the latency for Krartsuba Figure 4(b) shows the total combination
method increases more rapidly than that of delay, here shown as gate delays, for all
the other two algorithms. The durations of three methods. The assumption that the
recursive multipliers with 16, 19, and 233 AND XOR delays, Tx Ta Tg, are equal
bits are same because was our starting point. Conventional and
A new and efficient implementation of the overlap-free Karatsuba delays approach
finite-field multiplier is given here. one another as the operand size increases,
Findings from studies of conventional, with the CA having the least delay in this
Karatsuba, and overlap-free systems' figure. Interestingly, compared to the other
theoretical area and delay limitations two algorithms, the latency for the
inform the proposed implementation Krartsuba approach grows at a faster rate.
strategy. Using a trend as a template, a Recursive multipliers with 16, 19, and 233
finite-field multiplier of varying sizes is bits all have the same durations since this
section covers the results of employing the principal advantage of the proposed
proposed approach and comparing them to approach.
other relevant research in this field. Figure 10(c) shows that compared to other
A hardware multiplier that utilizes methods, the ADP of the proposed method
approaches that avoid overlap are the lowest. As compared to the
Theoretical limits for area, latency, and alternative approaches, Tables II and III
ADP were often computed to ascertain an summarize the ADP and speed
algorithm's performance. It becomes improvements. As a whole, the proposed
evident upon closer study that this may not method achieves 25% better ADP
be true when applied to an FPGA. Since performance than the conventional
the consumption of FPGA hardware is algorithm, 31% better than the Karatsuba
reliant on LUTs, such theoretical analysis method, and 25% better than the overlap-
needs to be revised. The proposed method free algorithm.
is based on LUT implementation, the core
component of the FPGA; therefore the
estimates are also more accurate in terms
of actual performance and cost.
Binary polynomial multiplication followed
by a modular reduction is a common way
to build a finite-field multiplier. You can
see the relative performance and amount of
resources used by several methods in
Figure 10(a) and (b). An FPGA (Artix-7 Here we provide a novel and efficient
XC7A200TTFV1156-2) was used to finite-field multiplier implementation. The
implement these algorithms that were suggested implementation approach is
developed using irreducible trinomials. derived on research on the theoretical
There is a wide variety of multipliers, from bounds of area and delay for conventional,
93 to 409 bits. Karatsuba, and overlap-free systems. A
The proposed approach is much closer to finite-field multiplier of different size is
KAs, utilizes a fraction of the resources, generated using an observed trend as a
and is almost as fast as the standard template. Additionally, two
procedure (Fig. 10). Nevertheless, the implementation techniques, theoretical
following delves into the effectiveness, the gate-based analysis and FPGA, are
assessed for their hardware resource needs comparisons. By comparing with state-of-
and combinational latency. the-art works, it was found that the design
The hardware implementation of the is more efficient, with greater speed and
binary polynomial multiplication lower ADP.
algorithms is shown in Fig. 4(a) for REFERENCES
[1] R. Abu-Salma, M. A. Sasse, J.
various operand sizes. Taking into account
Bonneau, A. Danilova, A. Naiakshina,
tiny operand sizes, the figure shows that
and M. Smith, “Obstacles to the
fewer gates are needed to implement CAs
adoption of secure communication
compared to the KA. On the other hand,
tools,” in Proc. IEEE Symp. Secur.
compared to Karatsuba and overlap-free,
Privacy (SP), May 2017, pp. 137–
the number of gates needed to achieve CA
153.
increases dramatically as the operand size
[2] B. Vembu, A. Navale, and S.
increases. To illustrate, compared to the
Sadhasivan, “Creating secure
Karatsuba or overlap-free, the CA needs
communication channels between
over 163% more gates for an operand size
processing elements,” U.S. Patent 9
of 409 bits.
589 159, Mar. 7, 2017.
[3] J. Yoo and J. H. Yi, “Code-based
V CONCLUSION
authentication scheme for light-
A new finite-field multiplier is suggested
weight integrity checking of smart
in this paper. We compared the suggested
vehicles,” IEEE Access, vol. 6, pp.
method's performance metrics with those
46731–46741, 2018.
of other algorithms after implementing it
[4] K. Shahbazi and S. B. Ko, “Area-
on FPGA for varying operand sizes. On
efficient nano-AES implementation
average, the suggested strategy
for Internet-of-Things devices,”
outperformed Karatsuba and the OKA by
IEEE Trans. Very Large Scale Integer.
30% and 20%, respectively, according to
(VLSI) Syst., vol. 29, no. 1, pp. 136–
the results of the implementation. Quicker
146, Jan. 2021.
than Karatsuba, 4% smaller than overlap-
[5] P. Aparna and P. V. V. Kishore,
free Karatsuba, and 43% smaller than the
“Biometric-based efficient medical
CA, all while using 1% less land. The
image watermarking in E-healthcare
design outperforms traditional, Karatsuba,
application,” IET Image Process., vol.
and OKA by 25%, 30%, and 25%,
13,no. 3, pp. 421–428, Feb. 2019.
respectively, according to ADP