0% found this document useful (0 votes)
23 views4 pages

Point Multiplication Accelerator For Arbitrary Montgomery Curves

Uploaded by

thirukg77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views4 pages

Point Multiplication Accelerator For Arbitrary Montgomery Curves

Uploaded by

thirukg77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

This article has been accepted for publication in IEEE Embedded Systems Letters.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LES.2024.3399071

Point Multiplication Accelerator for Arbitrary


Montgomery Curves
Khalid Javeed, Member, IEEE and David Gregg

Abstract—This letter presents a novel and efficient hardware field inversion (FI) is the most computationally intensive
architecture to accelerate the computation of point multiplication operation. Whereas, using projective coordinates (X, Y, Z),
(PM) primitive over arbitrary Montgomery curves. It is based on the performance burden is shifted towards a field multipli-
a new novel double field multiplier (DFM) that computes two field
multiplications simultaneously. The DFM uses the interleaved cation (FM) operation. Several fast PM implementations have
multiplication technique, and it shortens the critical path of the been proposed using Weierstrass or Montgomery ECs using
circuit by computing two results at once. It is generic to work for either standard or general structure [5]–[11]. Mostly in these
any prime structure and curve parameters over the Montgomery implementations, projective coordinates (X, Y, Z) were used
curves. At the system level, a fast scheduling methodology is after developing efficient FM architectures. To further speed up
also presented to execute the field-level operations with the
Montgomery ladder (ML) approach. Our ML and DFM designs the computation, hard and softcore IPs of modern FPGAs were
perform the same operations regardless of the input values, which utilized which made them platform-dependent. In addition,
provides resistance to timing and simple power analysis side- some of these are even vulnerable to the most common timing
channel attacks. It is synthesized and implemented over different and simple power analysis attacks (SPA) [12]. Robustness
FPGA platforms. The implementation results confirm that it against these attacks is a very important feature that must
outperforms the state-of-the-art in terms of area-time product
and throughput/slice. To the best of the authors’ knowledge, be deployed in cryptographic devices to be used for ensuring
it is the first fully LUT-based architecture for the arbitrary security services. The main contributions in this letter are:
Montgomery curves. - An efficient hardware architecture to accelerate the com-
Index Terms—Montgomery curves, FPGA, double modular putation of PM over any arbitrary MCs for a generic
multiplier, point multiplication prime field is proposed.
- The proposed PM module is developed on the founda-
tions of a new novel double finite field multiplier (DFM).
I. I NTRODUCTION
- Subsequently, dual cores of DFM are utilized in the

E LLIPTIC curve cryptography (ECC) [1], [2] is a sub-class


of public key cryptography (PKC) that is outperforming
its competitors to ensure many security services. This is
development of the PM module. The proposed design
is robust, and programmable for curve parameters and
for prime value p up to 256-bit.
primarily due to the compact keyspace as compared to other The rest of the paper is organized as follows: Section II
schemes [3]. This makes it a strong and favorable choice to introduces the preliminaries. Section III presents DFM and
develop confidentiality, integrity, and authentication services PM modules. Implementation results are given in Section IV.
in different applications. The performance of these security Finally, Section V concludes this letter.
services heavily relies on the computational speed of point
multiplication (PM), which is the chief primitive in the context II. PRELIMINARIES
of ECC. Therefore, efficient implementation of PM primitive The Montgomery form of EC is a preferred choice over the
is the requisite for the efficient realization of ECC protocols conventional Weierstrass form due to the requirement of fewer
and associated services. The National Institute of Standards field arithmetic operations in computing underlying group
and Technology (NIST) recommended elliptic curves (ECs) operations. A representation for MC is y 2 = x3 + αx + β.
over prime fields with special structures. However, it has been For Curve25519, β is 486662 and the modulus p is 2255 − 19.
reported in [4] that these are not secure, and a new curve The Montgomery ladder (ML) [13] is the most efficient way
Curve25519 was introduced. Curve25519 is a Montgomery to compute PM using the differential addition formula. This
curve (MC) with a special prime structure, where PM can is to achieve an addition of two points M1 and M2 using the
be computed with fewer group operations due to a smaller individual points and their difference M1 − M2 [13]. It also
number of field operations. However, these types of ECs are resists timing and SPA attacks [5]. The computation of PM is
chained to a specific prime field so deficient in flexibility. an iterative process. It starts with two points M1 and M2 with
Implementation of a PM module over any form of EC boils a difference of R, and this difference is maintained throughout
down to the computation of basic arithmetic operations in the computation. In each ML iteration, a combined differential
the given finite field. In the case of affine coordinates (x, y), addition and doubling step is performed. This combined step
does not require y-coordinate to be computed for the interme-
K. Javeed is with the Department of Computer Engineering, University of diate points which can speed up the computation and efficiency
Sharjah, UAE. (email: [email protected]).
D. Gregg is with the School of Computer Science, Trinity College Dublin, of PM. In total one iteration of the ladder requires ten FM and
Ireland. (email: [email protected]). eight field addition/subtraction (FAS) operations.

Authorized licensed use limited to: K. Ramakrishnan Health and Educational Trust. Downloaded on August 02,2024 at 03:46:41 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Embedded Systems Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LES.2024.3399071

III. P ROPOSED DFM AND PM MODULES Algorithm 1: Proposed DFM algorithm


FM operation plays a vital role in establishing the compu- Input: X [0], X [1], Y[0], Y[1], P
Output: Z[0] = X [0] × Y[0] mod P, Z[1] = X [1] × Y[1] mod P
tational efficiency of the PM module. Therefore, many FM 1 Z[0] = 0, Z[1] = 0 S1 [0] = X [0], S1 [1] = X [1], S2 [0] =
designs are available, where these are either based on Mont- ( mod P, S2 [1] = 2X [1] mod P
2X [0]
k + 3, if k mod 2 = 0
gomery multiplication (MM) [14] or interleaved multiplication 2 M =
k + 2, if k mod 2 = 1
(IM) [15]. Modifications in the IM algorithm using higher 3 M ←M +1 // append 0 to right of LSB of b //
radix approaches reduced the number of iterations at the 4 for (i = 0; i ≤ M − 2; i ← i + 2) do
// First Cycle
cost of higher critical path delay [11], [16]. The proposed 5 switch (Y[0](i+2:i) ) do
DFM module is based on the IM method where two FM 6 when 000 | 111 =⇒ V[0] ← 0
7 when 001 | 010 | 101 | 110 =⇒ V[0] ← S1 [0]
operations can be executed concurrently with less critical path 8 else =⇒ V[0] ← S2 [0]
delay without a significant increase in resource consumption. 9 end
10 S1 [0] = 2 × S2 [0] mod P
Subsequently, the DFM module is utilized in the design of 11 S2 [1] = 2 × S1 [1] mod P
the PM architecture using the ML approach having inherent 12 Z[0] = Z[0] ± V[0]
13 Z[1] = Z[1] mod P
resistance to timing and simple power analysis (SPA) attacks. // Second Cycle
14 switch (Y[1](i+2:i) ) do
15 when 000 | 111 =⇒ V[1] ← 0
A. Proposed DFM Module 16 when 001 | 010 | 101 | 110 =⇒ V[1] ← S1 [1]
17 else =⇒ V[1] ← S2 [1]
The digit-serial version of the IM technique is given in 18 end
19 S1 [1] = 2 × S2 [1] mod P
[11], [16]. It requires two quadrupling mod P arithmetic 20 S2 [0] = 2 × S1 [0] mod P
primitives where each such primitive can be accomplished with 21 Z[1] = Z[1] ± V[1]
22 Z[0] = Z[0] mod P
two k-bit additions, where k = log2 p. Thus, the critical steps 23 end
of the algorithm require four additions with some multiplexers 24 return Z[0], Z[1]
used for selection purposes. Field addition or subtraction prim-
itive (±) can be realized with two k-bit additions and a few
multiplexers. Therefore, in total, it can be implemented with
six k-bit adders to execute a k-bit FM operation. Whereas, to Z[0] update and Z[1] reduction modulo P. Note that in
the critical path delay is comprised of any of the quadrupling the Z[0] update, only normal addition/subtraction is executed
units. A cryptographic engine needs to compute multiple while its reduction is delayed and is performed in the second
FM instructions by exploiting parallelism to accelerate the cycle. Similarly, Z[1] reduction is completed in the first cycle
PM computational process. Many designs deployed multiple whereas normal addition/subtraction is accomplished in the
copies of an FM unit to increase performance which is a major second cycle. Note that these two cycles have no dependency
source of overall higher resource consumption. and can run simultaneously subject to the available resources.
Our proposed novel modular multiplication algorithm is Let β be an upper bound on the total number of iterations
demonstrated in Alg. 1, where it provides parallel execution which can be computed as β = ⌈(k + 2)⌉/2, where k is a bit
of two FM instructions with a reduced critical path delay and length of P. Whereas, the total number of cycles N S = 2β,
without a significant increase in resource consumption. We because each iteration of the loop is completed in two cycles.
analyzed that data dependence length of IM in [11], [16] 1) Hardware Architecture and Execution Flow: A proposed
is two additions which runs through intermediate products hardware architecture to execute the DFM algorithm is shown
S1 and S2 , and another two addition dependence running in Fig. 1. There are two rounds in each iteration of the DFM
through the updates of accumulator Z. In every iteration of the algorithm, where the same operations are executed on different
algorithm, S2 is double of S1 , due to pre-computation of 2× operands and internal registers. The proposed DFM executes
multiplicand mod p we can eliminate one quadrupling unit each of these in a single clock cycle, thus each iteration is
and split the other as two doubling units (DUs). Our proposed completed in two clock cycles. This can execute two field
method facilitates the simultaneous execution of two pairs of multiplications simultaneously, where the critical path delay
inputs instead of one. An index term is added to each input is split into two by inserting registers between two rounds.
and intermediate variable to demonstrate their association. The Overall, DFM design consists of four adders (Adder1−4 ),
algorithm accepts two inputs X [0], X [1] and a modulus P. It six registers (S1 [0], S1 [1], S2 [0], S2 [1], Z[0], Z[1]), and mul-
generates two respective outputs Z[0], Z[1], where Z[0] = tiplexing logic. These components are divided into four sub-
(X [0] × Y[0]) mod P and Z[1] = (X [1] × Y[1]) mod P. A modules: two double mod p (DMP1 and DMP2 ), one integer
fixed modulus value is used in most of the public key cryp- addition (IA), and one single-bit reduction (SR). The DFM
tosystems, however, the proposed algorithm can support two architecture executes two consecutive bits Y [0]i , Y [0]i+1 and
different values of P. We unroll the for loop and explicitly split Y [1]i , Y [1]i+1 of multipliers Y [0] and Y [1] respectively. The
it into two cycles, where the same operations are performed DMP1 and DMP2 compute steps 10 and 11 in the first cycle.
on different sets of operands and intermediate variables in an Whereas, IA executes the addition of Z[0] while the reduction
alternative fashion. By splitting the loop into two cycles, we of Z[1] is done by SR. In the second cycle, the input operands
reduced the dependency length of the loop to only a single to these sub-modules are interchanged as can be seen in
addition as opposed to two additions in [11], [16]. In the first steps 19 to 22 of the algorithm. Now the intermediate result
cycle, doubling of S1 [0] and S2 [1] are executed in parallel (Z[0]) that is added in the previous cycles is reduced and a

Authorized licensed use limited to: K. Ramakrishnan Health and Educational Trust. Downloaded on August 02,2024 at 03:46:41 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
rset
This article has been accepted for publication in IEEE Embedded Systems Letters. This is the author's version which has not been fully edited and p
content
Y[0](n-1) . . . . may
Y[0]2 change prior
Y[0]1 Y[0] 0 to final
Y[1] (n-1) publication. Citation
. . . Y[1]2 Y[1] 1 Y[1]information:
0 DOI 10.1109/LES.2024.3399071
X
Instructions Y Registers Unit
DMP2 DMP1
Unit Z 3
Mux Mux

<< 1 p << 1 p
Ist Cycle
A/S p
Y[0](n-1) . . . . Y[0]2 Y[0]1 Y[0]0 (n-1) 2. . . Y[1]2
Y[1]Adder Y[1]1 Y[1]0
Adder 1 Z
Control YUnit

Mux
DFM1 DFM2 FAS

+
+
DMP2 DMP1 X
Mux Mux
Mux Mux
z[1] S2[1] S1[1] Registers z[0] S1[1] S1[0]
<< 1 p << 1 p X[1]
Ist Cycle X[1] X[0]
Y[1]0
X[0] X[1] X[0]
A/S
Fig. 3:p Field addition/subtraction
p
2nd Cycle Mux Mux
Adder 2 Adder 1
Mux Y[1]1 Z
Y

Mux
+
Adder 4 Adder 3

+
SR IA X FAS
Mux Mux W1 = X2 + Z2
FAS
z[1] S2[1] S1[1] Registers z[0] S1[1] S1[0] Level Ɩ1 W2 = X2 - Z2
DFM1
X[1] FAS
X[1] X[0] X[0] X[1] X[0] W5 = W1 x W1 W6 = W2 x W2 W3 = X3 + Z3
p Y[1]0
2nd Cycle Mux Mux Mux FAS
Y[1]1 DFM2 W4 = X3 – Z3

Adder 4 Adder 3 W7 = W3 x W 2 W 8 = W1 x W4
SR IA
p
p b a FAS
OP clk res + + Level Ɩ2 W9 = W 5 – W6
Multiplexing logic
b DFM1
Fig. 1: Proposed DFM hardware+ architecture
+
W12 = W5 x W6 W13 = β x W9 FAS
W10 = W8 – W7
reg reg
registers
Control 0 01
Unit FAS
K 0
M (x,
1
y)
2 3 DFM2 W11 = W8 + W7

Shared logic FAS W15 = W10 x W10 W14 = W11 x W11


DL-FI DL-FM
(SL)
A2P
reg
Clk DFM1
W Level Ɩ3 FAS
rset
p Z3 = X1 x W15 W16 = W6 + W13
DFM2
p X
p b a Instructions Registers Unit Z2 = W9 x W16
OP clk res + Y +
Unit
b Z
Multiplexing logic
+ +
Fig. 4: Mapping of a single iteration of Montgomery ladder
01
reg reg
Control 0
registers
Unit
Control Unit DFM1 DFM2 FAS
0 2 3
1
architecture. It takes two clock cycles to complete either FA or
DL-FI
Shared logic
DL-FM
FAS FS operation with a throughput of a single clock cycle. For an
(SL)
reg FA operation, the first adder performs the addition of operands
W Fig. 2: Proposed PM architecture while reduction is done in the second adder. Whereas, in the
case of FS, the first adder performs subtraction while the
FAS
W 1 = X2 + Z2 correction is achieved by the second adder.
new result (Z[1]) is added simultaneously.
Level Ɩ
Similarly,FASafter the
W =X -Z 1
1) Data Flow and Instructions mapping: Ten FM and eight
2 2 2
second cycle, S1DFM and S2 are having results of 4X[0] 1
FAS
mod p FAS instructions are required to complete a single iteration
and 4X[1]W =mod
W – W p, respectively.
W =W xW
5 1 Each iteration
1 W = X +of
Z the loop is
6 2 2 3 3 3
of ML. The execution flow of these instructions on the PM
completed in two clock cycles, DFM
however, Wwe are processing
=X –Z
FAS
2 4 3 3 architecture is given in Fig. 4. These eighteen instructions are
two bits of a multiplier
W = W xso
W it W consumed
=W xW k clock cycles to 7 3 2 8 1 4 sequenced into three levels l1−3 . In l1 , four additions W1 to
compute two k-bit FM operations. This turns out to be k/2 W4 and four multiplications W5 to W8 are mapped on the
FAS
cycles latency forDFM
a single FMLevel operation.
Ɩ W =W –W 2 9 5 6
deployed modules and are completed in (k + 5) clock cycles.
1

W13 = β x W9 FAS Whereas, level l2 has three additions W9 to W11 and four
W12 = W5 x W6 W10 = W8 – W7
B. PM Module FAS multiplications W12 to W15 and is done in (k+4) clock cycles.
DFM2 W11 = W8 + W7
Our novel PM architecture based on the DFM module by Finally, the last level l3 only has a total of three instructions
W =W xW W =W xW 15 10 10 14 11 11

adopting the ML technique is shown in Fig. 2. To compute with two multiplications and one addition, these are finished
a PM, a standard DFM
projective Level
coordinates
Ɩ system is
FAS applied
1
3 in (k + 1) clock cycles. The total latency of a single iteration
Z =X xW W =W +W
3 1
where a single iteration of the
15
DFMML needs 10 FM and 8 FAS 2
16 6 13
of ML is (3k + 10) clock cycles with two DFM cores.
Z =W xW
instructions. Our proposed DFM module can simultaneously 2 9 16

execute two FM primitives so to fully exploit the available IV. I MPLEMENTATION AND RESULTS
parallelism, we deployed two cores of the DFM module. In The proposed PM module for 256-bit operand sizes is
addition to the dual-core DFM module, it consists of one FAS implemented on Xilinx Virtex-7, Zynq, and Virtex-6 FPGAs
module, a register file, and a main control unit. The proposed using the Xilinx Vivado tool. A software model in Python is
architecture can execute four FM and one FAS instructions developed to capture test vectors used in the functional veri-
simultaneously. Whereas, the register file is used to hold fication and validation stages. The implementation results and
intermediate values while the control unit takes care of all comparisons with other related proposals are shown in Table
the operations by activating/de-activating the required modules I. It is done based on area occupancy (slices), computation
in the given architecture. An internal architecture of FAS is time, area-time product as #slices × computation time (ST),
given in Fig. 3, where it is developed as a two-stage pipeline throughput (TP) (PM operations per second), and TP per slice

Authorized licensed use limited to: K. Ramakrishnan Health and Educational Trust. Downloaded on August 02,2024 at 03:46:41 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Embedded Systems Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LES.2024.3399071

TABLE I: Performance comparison of PM module with state of the art on FPGA platforms
Ref. Platform bits #Slices LUTs Freq. (MHz) Time (ms) ST TP TPS ∆TPS Remarks
Virtex-7 256 2.96K 9.32K 238 0.69 2.04 1449.47 489.5
Our Zynq-7020 256 3.04K 9.691K 229 0.72 2.18 1388.8 456.8 - Dual DFM, MCs, general p
Virtex-6 256 3.15K 9.73K 221 0.75 2.33 1351 428.8
[17] Virtex-6 256 12.6K 45.54K 25 0.32 4.1 3143 248.4 73% NIST curve, lacks flexibility
[16] Virtex-7 256 5.1K 14.9K 192 0.65 3.3 1538.46 301.66 62% Unified arithmetic, Co-Z
[10] Virtex-7 256 6.4K - 158 1.7 10.9 588.24 91.91 433% Parallel units with Co-Z
[6] Virtex-7 256 7.1K 24.7K 187 1.01 7.2 990.1 139.45 251% parallel modules
[9] Zynq-7020 256 29.7K∗∗ 116.3K∗ 232 0.2 5.94 5000 168.3 171% DSPs blocks using Karatsuba
256 6.2K 18.1K 195 0.7 4.3 1428.57 230.42 112%
[11] Virtex-7 Parallel units with Co-Z
384 7.6 24.8K 157 1.94 14.7 515.46 67.82
[5] Zynq-7020 256 7.6K∗∗ 30.3K∗ 170.4 0.35 2.66 2857.42 375.92 22% MCs, general p
[18] Virtex-7 256 6.5K - 104 1.9 12.4 526.31 80.97 505% Unified point operation
[8] Virtex-7 256 6.8K 22.14K 166 0.8 5.8 1250.16 187.57 161% Unified point operation
[19] Virtex-6 256 6.6K - 76.3 2.83 18.7 353.36 53.54 701% Parallel units with IM
TP in PM operations per second, #LU T s∗ : (#DSP s × 619 + LU T s), ∗∗
estimated 1 slice = 4 LUTs, common-Z coordinates (Co-Z), TPS increase (∆TPS)

(TPS) (TP/#slices). Lower ST and higher TPS figures are the R EFERENCES
most desired criteria to establish the higher efficiency of a [1] V. S. Miller, “Use of elliptic curves in cryptography,” in Conference on
design. The percentage increase in TPS is also presented in the theory and application of cryptographic techniques. Springer, 1985,
pp. 417–426.
the table over the state-of-the-art. [2] N. Koblitz, “Elliptic curve cryptosystems,” Mathematics of computation,
The PM module on average computes one PM operation in vol. 48, no. 177, pp. 203–209, 1987.
0.69 ms, 0.72 ms, and 0.75 ms on Virtex-7, Zynq-7020, and [3] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital
signatures and public-key cryptosystems,” Communications of the ACM,
Virtex-6 FPGAs with slice occupancy of 2.96K, 3.04K, and vol. 21, no. 2, pp. 120–126, 1978.
3.15K, delivers TP of 1449.4, 1388.8, and 1351, achieves ST [4] D. J. Bernstein, T. Lange et al., “Safecurves: choosing safe curves for
of 2.04, 2.18, and 2.33, and TPS of 489.5, 456.8 and 428.8.9, elliptic-curve cryptography,” Avialable online at http://safecurves. cr. yp.
to, 2013.
respectively. Note that all of the designs except [5] compared [5] D. B. Roy and D. Mukhopadhyay, “High-speed implementation of ECC
in Table I are either based on Weierstrass or NIST curves. scalar multiplication in GF (p) for generic Montgomery curves,” IEEE
NIST curve designs tend to be faster due to a specific prime transactions on very large scale integration (VLSI) systems, vol. 27,
no. 7, pp. 1587–1600, 2019.
structure but these lack flexibility and can have back-doors [4]. [6] K. Javeed, A. El-Moursy, and D. Gregg, “E 2 csm: efficient FPGA
The only generic MC design for 256-bit was proposed in [5]. It implementation of elliptic curve scalar multiplication over generic prime
is based on redundant sign digit arithmetic used in the imple- field GF (p),” The Journal of Supercomputing, pp. 1–25, 2023.
[7] Y. A. Shah, K. Javeed, M. I. Shehzad, and S. Azmat, “LUT-based
mentation of Montgomery multiplier [14]. However, it utilizes high-speed point multiplier for Goldilocks-curve448,” IET Computers
FPGA-dedicated blocks such as DSP slices and BRAMs which & Digital Techniques, vol. 14, no. 4, pp. 149–157, 2020.
somehow tied it to be platform-dependent. To the best of the [8] K. Javeed and A. El-Moursy, “Area-time efficient point multiplication
architecture on twisted Edwards curve over general prime field GF (p),”
authors’ knowledge, the proposed design is the first design International Journal of Circuit Theory and Applications.
for arbitrary MCs with complete LUT implementation. This [9] A. M. Awaludin, H. T. Larasati, and H. Kim, “High-speed and unified
can make it portable to any FPGA family/device in addition ECC processor for generic Weierstrass curves over GF(p) on FPGA,”
Sensors, vol. 21, no. 4, p. 1451, 2021.
to a generic prime advantage. It dominates all the mentioned [10] Y. Hao, S. Zhong, M. Ma, R. Jiang, S. Huang, J. Zhang, and W. Wang,
designs in terms of ST and TPS metrics. It has the lowest ST “Lightweight architecture for elliptic curve scalar multiplication over
and highest TPS values in comparison to the state-of-the-art. prime field,” Electronics, vol. 11, no. 14, p. 2234, 2022.
[11] K. Javeed, A. El-Mursy, and D. Gregg, “Ec-crypto: Highly efficient area-
In terms of ST, it is 1.75, 1.61, 5.34, 3.52, 2.72, 2.10, 1.22, delay optimized elliptic curve cryptography processor,” IEEE Access,
6.07, 2.84, and 8.02 times better, whereas, in terms of TPS, 2023.
it is 1.73, 1.62, 5.32, 3.5, 2.71, 2.12, 1.21, 6.05, 2.61, 8.01 [12] P. C. Kocher, “Timing attacks on implementations of Diffie-Hellman,
RSA, DSS, and other systems,” in Annual International Cryptology
times better as compared to [17], [16], [10], [6], [9], [11], Conference. Springer, 1996, pp. 104–113.
[5], [18], [8] and [19], respectively. Due to our constant time [13] C. Costello and B. Smith, “Montgomery curves and their arithmetic,”
ML, DFM, and FAS circuits, it resists timing attacks. For all Journal of Cryptographic Engineering, vol. 8, no. 3, pp. 227–240, 2018.
[14] P. L. Montgomery, “Modular multiplication without trial division,”
choices, we compute both values and select the result, which Mathematics of computation, vol. 44, no. 170, pp. 519–521, 1985.
provides further resistance to SPA attacks. [15] G. R. Blakely, “A computer algorithm for calculating the product AB
modulo M,” IEEE Transactions on Computers, vol. 100, no. 5, pp. 497–
500, 1983.
V. CONCLUSION [16] K. Javeed, “FPGA implementation of area-time aware ECC scalar
multiplication core*,” in 2023 30th IEEE International Conference on
This letter presented a novel hardware architecture to Electronics, Circuits and Systems (ICECS), 2023, pp. 1–4.
accelerate the PM primitive. It is designed using a new [17] X. Hu, X. Li, X. Zheng, Y. Liu, and X. Xiong, “A high-speed processor
novel double modular multiplier circuit that can perform two for elliptic curve cryptography over NIST prime field,” IET Circuits,
Devices & Systems, vol. 16, no. 4, pp. 350–359, 2022.
modular multiplication operations simultaneously. On different [18] M. M. Islam, M. S. Hossain, M. K. Hasan, M. Shahjalal, and Y. M.
FPGA platforms, it delivers significantly better ST and TPS Jang, “Design and implementation of high-performance ECC processor
in comparison to other contemporary designs. Therefore, it is with unified point addition on twisted Edwards curve,” Sensors, vol. 20,
no. 18, p. 5148, 2020.
the prominent choice as a building block for key exchange and [19] T. Kudithi, “An efficient hardware implementation of the elliptic curve
digital signature protocols in both performance and resource- cryptographic processor over prime field,” International Journal of
critical applications. Circuit Theory and Applications, 2020.

Authorized licensed use limited to: K. Ramakrishnan Health and Educational Trust. Downloaded on August 02,2024 at 03:46:41 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

You might also like