Point Multiplication Accelerator For Arbitrary Montgomery Curves

Uploaded by

thirukg77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views4 pages

Point Multiplication Accelerator For Arbitrary Montgomery Curves

Uploaded by

thirukg77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

This article has been accepted for publication in IEEE Embedded Systems Letters.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LES.2024.3399071

Point Multiplication Accelerator for Arbitrary

Montgomery Curves
Khalid Javeed, Member, IEEE and David Gregg

Abstract—This letter presents a novel and efficient hardware field inversion (FI) is the most computationally intensive
architecture to accelerate the computation of point multiplication operation. Whereas, using projective coordinates (X, Y, Z),
(PM) primitive over arbitrary Montgomery curves. It is based on the performance burden is shifted towards a field multipli-
a new novel double field multiplier (DFM) that computes two field
multiplications simultaneously. The DFM uses the interleaved cation (FM) operation. Several fast PM implementations have
multiplication technique, and it shortens the critical path of the been proposed using Weierstrass or Montgomery ECs using
circuit by computing two results at once. It is generic to work for either standard or general structure [5]–[11]. Mostly in these
any prime structure and curve parameters over the Montgomery implementations, projective coordinates (X, Y, Z) were used
curves. At the system level, a fast scheduling methodology is after developing efficient FM architectures. To further speed up
also presented to execute the field-level operations with the
Montgomery ladder (ML) approach. Our ML and DFM designs the computation, hard and softcore IPs of modern FPGAs were
perform the same operations regardless of the input values, which utilized which made them platform-dependent. In addition,
provides resistance to timing and simple power analysis side- some of these are even vulnerable to the most common timing
channel attacks. It is synthesized and implemented over different and simple power analysis attacks (SPA) [12]. Robustness
FPGA platforms. The implementation results confirm that it against these attacks is a very important feature that must
outperforms the state-of-the-art in terms of area-time product
and throughput/slice. To the best of the authors’ knowledge, be deployed in cryptographic devices to be used for ensuring
it is the first fully LUT-based architecture for the arbitrary security services. The main contributions in this letter are:
Montgomery curves. - An efficient hardware architecture to accelerate the com-
Index Terms—Montgomery curves, FPGA, double modular putation of PM over any arbitrary MCs for a generic
multiplier, point multiplication prime field is proposed.
- The proposed PM module is developed on the founda-
tions of a new novel double finite field multiplier (DFM).
I. I NTRODUCTION
- Subsequently, dual cores of DFM are utilized in the

E LLIPTIC curve cryptography (ECC) [1], [2] is a sub-class

of public key cryptography (PKC) that is outperforming
its competitors to ensure many security services. This is
development of the PM module. The proposed design
is robust, and programmable for curve parameters and
for prime value p up to 256-bit.
primarily due to the compact keyspace as compared to other The rest of the paper is organized as follows: Section II
schemes [3]. This makes it a strong and favorable choice to introduces the preliminaries. Section III presents DFM and
develop confidentiality, integrity, and authentication services PM modules. Implementation results are given in Section IV.
in different applications. The performance of these security Finally, Section V concludes this letter.
services heavily relies on the computational speed of point
multiplication (PM), which is the chief primitive in the context II. PRELIMINARIES
of ECC. Therefore, efficient implementation of PM primitive The Montgomery form of EC is a preferred choice over the
is the requisite for the efficient realization of ECC protocols conventional Weierstrass form due to the requirement of fewer
and associated services. The National Institute of Standards field arithmetic operations in computing underlying group
and Technology (NIST) recommended elliptic curves (ECs) operations. A representation for MC is y 2 = x3 + αx + β.
over prime fields with special structures. However, it has been For Curve25519, β is 486662 and the modulus p is 2255 − 19.
reported in [4] that these are not secure, and a new curve The Montgomery ladder (ML) [13] is the most efficient way
Curve25519 was introduced. Curve25519 is a Montgomery to compute PM using the differential addition formula. This
curve (MC) with a special prime structure, where PM can is to achieve an addition of two points M1 and M2 using the
be computed with fewer group operations due to a smaller individual points and their difference M1 − M2 [13]. It also
number of field operations. However, these types of ECs are resists timing and SPA attacks [5]. The computation of PM is
chained to a specific prime field so deficient in flexibility. an iterative process. It starts with two points M1 and M2 with
Implementation of a PM module over any form of EC boils a difference of R, and this difference is maintained throughout
down to the computation of basic arithmetic operations in the computation. In each ML iteration, a combined differential
the given finite field. In the case of affine coordinates (x, y), addition and doubling step is performed. This combined step
does not require y-coordinate to be computed for the interme-
K. Javeed is with the Department of Computer Engineering, University of diate points which can speed up the computation and efficiency
Sharjah, UAE. (email: [email protected]).
D. Gregg is with the School of Computer Science, Trinity College Dublin, of PM. In total one iteration of the ladder requires ten FM and
Ireland. (email: [email protected]). eight field addition/subtraction (FAS) operations.

Authorized licensed use limited to: K. Ramakrishnan Health and Educational Trust. Downloaded on August 02,2024 at 03:46:41 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Embedded Systems Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LES.2024.3399071

III. P ROPOSED DFM AND PM MODULES Algorithm 1: Proposed DFM algorithm

FM operation plays a vital role in establishing the compu- Input: X [0], X [1], Y[0], Y[1], P
Output: Z[0] = X [0] × Y[0] mod P, Z[1] = X [1] × Y[1] mod P
tational efficiency of the PM module. Therefore, many FM 1 Z[0] = 0, Z[1] = 0 S1 [0] = X [0], S1 [1] = X [1], S2 [0] =
designs are available, where these are either based on Mont- ( mod P, S2 [1] = 2X [1] mod P
2X [0]
k + 3, if k mod 2 = 0
gomery multiplication (MM) [14] or interleaved multiplication 2 M =
k + 2, if k mod 2 = 1
(IM) [15]. Modifications in the IM algorithm using higher 3 M ←M +1 // append 0 to right of LSB of b //
radix approaches reduced the number of iterations at the 4 for (i = 0; i ≤ M − 2; i ← i + 2) do
// First Cycle
cost of higher critical path delay [11], [16]. The proposed 5 switch (Y[0](i+2:i) ) do
DFM module is based on the IM method where two FM 6 when 000 | 111 =⇒ V[0] ← 0
7 when 001 | 010 | 101 | 110 =⇒ V[0] ← S1 [0]
operations can be executed concurrently with less critical path 8 else =⇒ V[0] ← S2 [0]
delay without a significant increase in resource consumption. 9 end
10 S1 [0] = 2 × S2 [0] mod P
Subsequently, the DFM module is utilized in the design of 11 S2 [1] = 2 × S1 [1] mod P
the PM architecture using the ML approach having inherent 12 Z[0] = Z[0] ± V[0]
13 Z[1] = Z[1] mod P
resistance to timing and simple power analysis (SPA) attacks. // Second Cycle
14 switch (Y[1](i+2:i) ) do
15 when 000 | 111 =⇒ V[1] ← 0
A. Proposed DFM Module 16 when 001 | 010 | 101 | 110 =⇒ V[1] ← S1 [1]
17 else =⇒ V[1] ← S2 [1]
The digit-serial version of the IM technique is given in 18 end
19 S1 [1] = 2 × S2 [1] mod P
[11], [16]. It requires two quadrupling mod P arithmetic 20 S2 [0] = 2 × S1 [0] mod P
primitives where each such primitive can be accomplished with 21 Z[1] = Z[1] ± V[1]
22 Z[0] = Z[0] mod P
two k-bit additions, where k = log2 p. Thus, the critical steps 23 end
of the algorithm require four additions with some multiplexers 24 return Z[0], Z[1]
used for selection purposes. Field addition or subtraction prim-
itive (±) can be realized with two k-bit additions and a few
multiplexers. Therefore, in total, it can be implemented with
six k-bit adders to execute a k-bit FM operation. Whereas, to Z[0] update and Z[1] reduction modulo P. Note that in
the critical path delay is comprised of any of the quadrupling the Z[0] update, only normal addition/subtraction is executed
units. A cryptographic engine needs to compute multiple while its reduction is delayed and is performed in the second
FM instructions by exploiting parallelism to accelerate the cycle. Similarly, Z[1] reduction is completed in the first cycle
PM computational process. Many designs deployed multiple whereas normal addition/subtraction is accomplished in the
copies of an FM unit to increase performance which is a major second cycle. Note that these two cycles have no dependency
source of overall higher resource consumption. and can run simultaneously subject to the available resources.
Our proposed novel modular multiplication algorithm is Let β be an upper bound on the total number of iterations
demonstrated in Alg. 1, where it provides parallel execution which can be computed as β = ⌈(k + 2)⌉/2, where k is a bit
of two FM instructions with a reduced critical path delay and length of P. Whereas, the total number of cycles N S = 2β,
without a significant increase in resource consumption. We because each iteration of the loop is completed in two cycles.
analyzed that data dependence length of IM in [11], [16] 1) Hardware Architecture and Execution Flow: A proposed
is two additions which runs through intermediate products hardware architecture to execute the DFM algorithm is shown
S1 and S2 , and another two addition dependence running in Fig. 1. There are two rounds in each iteration of the DFM
through the updates of accumulator Z. In every iteration of the algorithm, where the same operations are executed on different
algorithm, S2 is double of S1 , due to pre-computation of 2× operands and internal registers. The proposed DFM executes
multiplicand mod p we can eliminate one quadrupling unit each of these in a single clock cycle, thus each iteration is
and split the other as two doubling units (DUs). Our proposed completed in two clock cycles. This can execute two field
method facilitates the simultaneous execution of two pairs of multiplications simultaneously, where the critical path delay
inputs instead of one. An index term is added to each input is split into two by inserting registers between two rounds.
and intermediate variable to demonstrate their association. The Overall, DFM design consists of four adders (Adder1−4 ),
algorithm accepts two inputs X [0], X [1] and a modulus P. It six registers (S1 [0], S1 [1], S2 [0], S2 [1], Z[0], Z[1]), and mul-
generates two respective outputs Z[0], Z[1], where Z[0] = tiplexing logic. These components are divided into four sub-
(X [0] × Y[0]) mod P and Z[1] = (X [1] × Y[1]) mod P. A modules: two double mod p (DMP1 and DMP2 ), one integer
fixed modulus value is used in most of the public key cryp- addition (IA), and one single-bit reduction (SR). The DFM
tosystems, however, the proposed algorithm can support two architecture executes two consecutive bits Y [0]i , Y [0]i+1 and
different values of P. We unroll the for loop and explicitly split Y [1]i , Y [1]i+1 of multipliers Y [0] and Y [1] respectively. The
it into two cycles, where the same operations are performed DMP1 and DMP2 compute steps 10 and 11 in the first cycle.
on different sets of operands and intermediate variables in an Whereas, IA executes the addition of Z[0] while the reduction
alternative fashion. By splitting the loop into two cycles, we of Z[1] is done by SR. In the second cycle, the input operands
reduced the dependency length of the loop to only a single to these sub-modules are interchanged as can be seen in
addition as opposed to two additions in [11], [16]. In the first steps 19 to 22 of the algorithm. Now the intermediate result
cycle, doubling of S1 [0] and S2 [1] are executed in parallel (Z[0]) that is added in the previous cycles is reduced and a

Authorized licensed use limited to: K. Ramakrishnan Health and Educational Trust. Downloaded on August 02,2024 at 03:46:41 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
rset
This article has been accepted for publication in IEEE Embedded Systems Letters. This is the author's version which has not been fully edited and p
content
Y[0](n-1) . . . . may
Y[0]2 change prior
Y[0]1 Y[0] 0 to final
Y[1] (n-1) publication. Citation
. . . Y[1]2 Y[1] 1 Y[1]information:
0 DOI 10.1109/LES.2024.3399071
X
Instructions Y Registers Unit
DMP2 DMP1
Unit Z 3
Mux Mux

<< 1 p << 1 p
Ist Cycle
A/S p
Y[0](n-1) . . . . Y[0]2 Y[0]1 Y[0]0 (n-1) 2. . . Y[1]2
Y[1]Adder Y[1]1 Y[1]0
Adder 1 Z
Control YUnit

Mux
DFM1 DFM2 FAS

+
+
DMP2 DMP1 X
Mux Mux
Mux Mux
z[1] S2[1] S1[1] Registers z[0] S1[1] S1[0]
<< 1 p << 1 p X[1]
Ist Cycle X[1] X[0]
Y[1]0
X[0] X[1] X[0]
A/S
Fig. 3:p Field addition/subtraction
p
2nd Cycle Mux Mux
Adder 2 Adder 1
Mux Y[1]1 Z
Y

Mux
+
Adder 4 Adder 3

+
SR IA X FAS
Mux Mux W1 = X2 + Z2
FAS
z[1] S2[1] S1[1] Registers z[0] S1[1] S1[0] Level Ɩ1 W2 = X2 - Z2
DFM1
X[1] FAS
X[1] X[0] X[0] X[1] X[0] W5 = W1 x W1 W6 = W2 x W2 W3 = X3 + Z3
p Y[1]0
2nd Cycle Mux Mux Mux FAS
Y[1]1 DFM2 W4 = X3 – Z3

Adder 4 Adder 3 W7 = W3 x W 2 W 8 = W1 x W4
SR IA
p
p b a FAS
OP clk res + + Level Ɩ2 W9 = W 5 – W6
Multiplexing logic
b DFM1
Fig. 1: Proposed DFM hardware+ architecture
+
W12 = W5 x W6 W13 = β x W9 FAS
W10 = W8 – W7
reg reg
registers
Control 0 01
Unit FAS
K 0
M (x,
1
y)
2 3 DFM2 W11 = W8 + W7

Shared logic FAS W15 = W10 x W10 W14 = W11 x W11

DL-FI DL-FM
(SL)
A2P
reg
Clk DFM1
W Level Ɩ3 FAS
rset
p Z3 = X1 x W15 W16 = W6 + W13
DFM2
p X
p b a Instructions Registers Unit Z2 = W9 x W16
OP clk res + Y +
Unit
b Z
Multiplexing logic
+ +
Fig. 4: Mapping of a single iteration of Montgomery ladder
01
reg reg
Control 0
registers
Unit
Control Unit DFM1 DFM2 FAS
0 2 3
1
architecture. It takes two clock cycles to complete either FA or
DL-FI
Shared logic
DL-FM
FAS FS operation with a throughput of a single clock cycle. For an
(SL)
reg FA operation, the first adder performs the addition of operands
W Fig. 2: Proposed PM architecture while reduction is done in the second adder. Whereas, in the
case of FS, the first adder performs subtraction while the
FAS
W 1 = X2 + Z2 correction is achieved by the second adder.
new result (Z[1]) is added simultaneously.
Level Ɩ
Similarly,FASafter the
W =X -Z 1
1) Data Flow and Instructions mapping: Ten FM and eight
2 2 2
second cycle, S1DFM and S2 are having results of 4X[0] 1
FAS
mod p FAS instructions are required to complete a single iteration
and 4X[1]W =mod
W – W p, respectively.
W =W xW
5 1 Each iteration
1 W = X +of
Z the loop is
6 2 2 3 3 3
of ML. The execution flow of these instructions on the PM
completed in two clock cycles, DFM
however, Wwe are processing
=X –Z
FAS
2 4 3 3 architecture is given in Fig. 4. These eighteen instructions are
two bits of a multiplier
W = W xso
W it W consumed
=W xW k clock cycles to 7 3 2 8 1 4 sequenced into three levels l1−3 . In l1 , four additions W1 to
compute two k-bit FM operations. This turns out to be k/2 W4 and four multiplications W5 to W8 are mapped on the
FAS
cycles latency forDFM
a single FMLevel operation.
Ɩ W =W –W 2 9 5 6
deployed modules and are completed in (k + 5) clock cycles.
1

W13 = β x W9 FAS Whereas, level l2 has three additions W9 to W11 and four
W12 = W5 x W6 W10 = W8 – W7
B. PM Module FAS multiplications W12 to W15 and is done in (k+4) clock cycles.
DFM2 W11 = W8 + W7
Our novel PM architecture based on the DFM module by Finally, the last level l3 only has a total of three instructions
W =W xW W =W xW 15 10 10 14 11 11

adopting the ML technique is shown in Fig. 2. To compute with two multiplications and one addition, these are finished
a PM, a standard DFM
projective Level
coordinates
Ɩ system is
FAS applied
1
3 in (k + 1) clock cycles. The total latency of a single iteration
Z =X xW W =W +W
3 1
where a single iteration of the
15
DFMML needs 10 FM and 8 FAS 2
16 6 13
of ML is (3k + 10) clock cycles with two DFM cores.
Z =W xW
instructions. Our proposed DFM module can simultaneously 2 9 16

execute two FM primitives so to fully exploit the available IV. I MPLEMENTATION AND RESULTS
parallelism, we deployed two cores of the DFM module. In The proposed PM module for 256-bit operand sizes is
addition to the dual-core DFM module, it consists of one FAS implemented on Xilinx Virtex-7, Zynq, and Virtex-6 FPGAs
module, a register file, and a main control unit. The proposed using the Xilinx Vivado tool. A software model in Python is
architecture can execute four FM and one FAS instructions developed to capture test vectors used in the functional veri-
simultaneously. Whereas, the register file is used to hold fication and validation stages. The implementation results and
intermediate values while the control unit takes care of all comparisons with other related proposals are shown in Table
the operations by activating/de-activating the required modules I. It is done based on area occupancy (slices), computation
in the given architecture. An internal architecture of FAS is time, area-time product as #slices × computation time (ST),
given in Fig. 3, where it is developed as a two-stage pipeline throughput (TP) (PM operations per second), and TP per slice

TABLE I: Performance comparison of PM module with state of the art on FPGA platforms
Ref. Platform bits #Slices LUTs Freq. (MHz) Time (ms) ST TP TPS ∆TPS Remarks
Virtex-7 256 2.96K 9.32K 238 0.69 2.04 1449.47 489.5
Our Zynq-7020 256 3.04K 9.691K 229 0.72 2.18 1388.8 456.8 - Dual DFM, MCs, general p
Virtex-6 256 3.15K 9.73K 221 0.75 2.33 1351 428.8
[17] Virtex-6 256 12.6K 45.54K 25 0.32 4.1 3143 248.4 73% NIST curve, lacks flexibility
[16] Virtex-7 256 5.1K 14.9K 192 0.65 3.3 1538.46 301.66 62% Unified arithmetic, Co-Z
[10] Virtex-7 256 6.4K - 158 1.7 10.9 588.24 91.91 433% Parallel units with Co-Z
[6] Virtex-7 256 7.1K 24.7K 187 1.01 7.2 990.1 139.45 251% parallel modules
[9] Zynq-7020 256 29.7K∗∗ 116.3K∗ 232 0.2 5.94 5000 168.3 171% DSPs blocks using Karatsuba
256 6.2K 18.1K 195 0.7 4.3 1428.57 230.42 112%
[11] Virtex-7 Parallel units with Co-Z
384 7.6 24.8K 157 1.94 14.7 515.46 67.82
[5] Zynq-7020 256 7.6K∗∗ 30.3K∗ 170.4 0.35 2.66 2857.42 375.92 22% MCs, general p
[18] Virtex-7 256 6.5K - 104 1.9 12.4 526.31 80.97 505% Unified point operation
[8] Virtex-7 256 6.8K 22.14K 166 0.8 5.8 1250.16 187.57 161% Unified point operation
[19] Virtex-6 256 6.6K - 76.3 2.83 18.7 353.36 53.54 701% Parallel units with IM
TP in PM operations per second, #LU T s∗ : (#DSP s × 619 + LU T s), ∗∗
estimated 1 slice = 4 LUTs, common-Z coordinates (Co-Z), TPS increase (∆TPS)

(TPS) (TP/#slices). Lower ST and higher TPS figures are the R EFERENCES
most desired criteria to establish the higher efficiency of a [1] V. S. Miller, “Use of elliptic curves in cryptography,” in Conference on
design. The percentage increase in TPS is also presented in the theory and application of cryptographic techniques. Springer, 1985,
pp. 417–426.
the table over the state-of-the-art. [2] N. Koblitz, “Elliptic curve cryptosystems,” Mathematics of computation,
The PM module on average computes one PM operation in vol. 48, no. 177, pp. 203–209, 1987.
0.69 ms, 0.72 ms, and 0.75 ms on Virtex-7, Zynq-7020, and [3] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital
signatures and public-key cryptosystems,” Communications of the ACM,
Virtex-6 FPGAs with slice occupancy of 2.96K, 3.04K, and vol. 21, no. 2, pp. 120–126, 1978.
3.15K, delivers TP of 1449.4, 1388.8, and 1351, achieves ST [4] D. J. Bernstein, T. Lange et al., “Safecurves: choosing safe curves for
of 2.04, 2.18, and 2.33, and TPS of 489.5, 456.8 and 428.8.9, elliptic-curve cryptography,” Avialable online at http://safecurves. cr. yp.
to, 2013.
respectively. Note that all of the designs except [5] compared [5] D. B. Roy and D. Mukhopadhyay, “High-speed implementation of ECC
in Table I are either based on Weierstrass or NIST curves. scalar multiplication in GF (p) for generic Montgomery curves,” IEEE
NIST curve designs tend to be faster due to a specific prime transactions on very large scale integration (VLSI) systems, vol. 27,
no. 7, pp. 1587–1600, 2019.
structure but these lack flexibility and can have back-doors [4]. [6] K. Javeed, A. El-Moursy, and D. Gregg, “E 2 csm: efficient FPGA
The only generic MC design for 256-bit was proposed in [5]. It implementation of elliptic curve scalar multiplication over generic prime
is based on redundant sign digit arithmetic used in the imple- field GF (p),” The Journal of Supercomputing, pp. 1–25, 2023.
[7] Y. A. Shah, K. Javeed, M. I. Shehzad, and S. Azmat, “LUT-based
mentation of Montgomery multiplier [14]. However, it utilizes high-speed point multiplier for Goldilocks-curve448,” IET Computers
FPGA-dedicated blocks such as DSP slices and BRAMs which & Digital Techniques, vol. 14, no. 4, pp. 149–157, 2020.
somehow tied it to be platform-dependent. To the best of the [8] K. Javeed and A. El-Moursy, “Area-time efficient point multiplication
architecture on twisted Edwards curve over general prime field GF (p),”
authors’ knowledge, the proposed design is the first design International Journal of Circuit Theory and Applications.
for arbitrary MCs with complete LUT implementation. This [9] A. M. Awaludin, H. T. Larasati, and H. Kim, “High-speed and unified
can make it portable to any FPGA family/device in addition ECC processor for generic Weierstrass curves over GF(p) on FPGA,”
Sensors, vol. 21, no. 4, p. 1451, 2021.
to a generic prime advantage. It dominates all the mentioned [10] Y. Hao, S. Zhong, M. Ma, R. Jiang, S. Huang, J. Zhang, and W. Wang,
designs in terms of ST and TPS metrics. It has the lowest ST “Lightweight architecture for elliptic curve scalar multiplication over
and highest TPS values in comparison to the state-of-the-art. prime field,” Electronics, vol. 11, no. 14, p. 2234, 2022.
[11] K. Javeed, A. El-Mursy, and D. Gregg, “Ec-crypto: Highly efficient area-
In terms of ST, it is 1.75, 1.61, 5.34, 3.52, 2.72, 2.10, 1.22, delay optimized elliptic curve cryptography processor,” IEEE Access,
6.07, 2.84, and 8.02 times better, whereas, in terms of TPS, 2023.
it is 1.73, 1.62, 5.32, 3.5, 2.71, 2.12, 1.21, 6.05, 2.61, 8.01 [12] P. C. Kocher, “Timing attacks on implementations of Diffie-Hellman,
RSA, DSS, and other systems,” in Annual International Cryptology
times better as compared to [17], [16], [10], [6], [9], [11], Conference. Springer, 1996, pp. 104–113.
[5], [18], [8] and [19], respectively. Due to our constant time [13] C. Costello and B. Smith, “Montgomery curves and their arithmetic,”
ML, DFM, and FAS circuits, it resists timing attacks. For all Journal of Cryptographic Engineering, vol. 8, no. 3, pp. 227–240, 2018.
[14] P. L. Montgomery, “Modular multiplication without trial division,”
choices, we compute both values and select the result, which Mathematics of computation, vol. 44, no. 170, pp. 519–521, 1985.
provides further resistance to SPA attacks. [15] G. R. Blakely, “A computer algorithm for calculating the product AB
modulo M,” IEEE Transactions on Computers, vol. 100, no. 5, pp. 497–
500, 1983.
V. CONCLUSION [16] K. Javeed, “FPGA implementation of area-time aware ECC scalar
multiplication core*,” in 2023 30th IEEE International Conference on
This letter presented a novel hardware architecture to Electronics, Circuits and Systems (ICECS), 2023, pp. 1–4.
accelerate the PM primitive. It is designed using a new [17] X. Hu, X. Li, X. Zheng, Y. Liu, and X. Xiong, “A high-speed processor
novel double modular multiplier circuit that can perform two for elliptic curve cryptography over NIST prime field,” IET Circuits,
Devices & Systems, vol. 16, no. 4, pp. 350–359, 2022.
modular multiplication operations simultaneously. On different [18] M. M. Islam, M. S. Hossain, M. K. Hasan, M. Shahjalal, and Y. M.
FPGA platforms, it delivers significantly better ST and TPS Jang, “Design and implementation of high-performance ECC processor
in comparison to other contemporary designs. Therefore, it is with unified point addition on twisted Edwards curve,” Sensors, vol. 20,
no. 18, p. 5148, 2020.
the prominent choice as a building block for key exchange and [19] T. Kudithi, “An efficient hardware implementation of the elliptic curve
digital signature protocols in both performance and resource- cryptographic processor over prime field,” International Journal of
critical applications. Circuit Theory and Applications, 2020.

Design of A High-Performance Iterative Barrett Modular Multiplier For Crypto Systems
No ratings yet
Design of A High-Performance Iterative Barrett Modular Multiplier For Crypto Systems
14 pages
Ansi x962 1998 The Elliptic Curve Digital Signature Algorithm Ecdsa - Compress
No ratings yet
Ansi x962 1998 The Elliptic Curve Digital Signature Algorithm Ecdsa - Compress
194 pages
2011 - High-Speed Elliptic Curve and Pairing-Based
No ratings yet
2011 - High-Speed Elliptic Curve and Pairing-Based
233 pages
Efficient Computation of Millers Algorithm in Pairing-Based Cryp
No ratings yet
Efficient Computation of Millers Algorithm in Pairing-Based Cryp
86 pages
Efficient Hardware Architectures For Modular Multi
No ratings yet
Efficient Hardware Architectures For Modular Multi
59 pages
2022 Ecc
No ratings yet
2022 Ecc
20 pages
Mimc: Efficient Encryption and Cryptographic Hashing With Minimal Multiplicative Complexity
No ratings yet
Mimc: Efficient Encryption and Cryptographic Hashing With Minimal Multiplicative Complexity
34 pages
FPGA Acceleration of Multi-Scalar Multiplication: CycloneMSM
No ratings yet
FPGA Acceleration of Multi-Scalar Multiplication: CycloneMSM
17 pages
Ecc: A Gpu - E C C: G Based High Throughput Framework For Lliptic Urve Ryptography
No ratings yet
Ecc: A Gpu - E C C: G Based High Throughput Framework For Lliptic Urve Ryptography
20 pages
High-Speed Implementation of ECC Scalar Multiplication in GFP For Generic Montgomery Curves
No ratings yet
High-Speed Implementation of ECC Scalar Multiplication in GFP For Generic Montgomery Curves
14 pages
Fast and Flexible Elliptic Curve Point Arithmetic Over Prime Fields
No ratings yet
Fast and Flexible Elliptic Curve Point Arithmetic Over Prime Fields
14 pages
Applsci 14 04085
No ratings yet
Applsci 14 04085
15 pages
Hardware Acceleration of ECC
No ratings yet
Hardware Acceleration of ECC
102 pages
Secure Elliptic Curves and Their Performance: Logic Journal of IGPL March 2019
No ratings yet
Secure Elliptic Curves and Their Performance: Logic Journal of IGPL March 2019
13 pages
Razali 2021 J. Phys. Conf. Ser. 1933 012055
No ratings yet
Razali 2021 J. Phys. Conf. Ser. 1933 012055
8 pages
Elsevier FPGA Based Modular Inversion With Kaliski S Algorithm For High Performance ECC-3
No ratings yet
Elsevier FPGA Based Modular Inversion With Kaliski S Algorithm For High Performance ECC-3
14 pages
Susantio 2016 J. Phys. Conf. Ser. 710 012022
No ratings yet
Susantio 2016 J. Phys. Conf. Ser. 710 012022
10 pages
3.3 - Cryptography - ECC
No ratings yet
3.3 - Cryptography - ECC
43 pages
Embedment of Montgomery Algorithm On Elliptic Curve Cryptography Over RSA Public Key Cryptography
No ratings yet
Embedment of Montgomery Algorithm On Elliptic Curve Cryptography Over RSA Public Key Cryptography
7 pages
FDTC 08
No ratings yet
FDTC 08
6 pages
Efficient Arithmetic in Finite Field Extensions With Application in Elliptic Curve Cryptography
No ratings yet
Efficient Arithmetic in Finite Field Extensions With Application in Elliptic Curve Cryptography
27 pages
Software Implementation of Elliptic Curve Cryptography Over Binary Fields
No ratings yet
Software Implementation of Elliptic Curve Cryptography Over Binary Fields
24 pages
High-Throughput Multi-Key Elliptic Curve Cryptosystem Based On Residue Number System
No ratings yet
High-Throughput Multi-Key Elliptic Curve Cryptosystem Based On Residue Number System
8 pages
Montgomery Modular Multiplier Architecture
No ratings yet
Montgomery Modular Multiplier Architecture
28 pages
Towards High Performance FPGA Implementation of ECDSA ETASR v2
No ratings yet
Towards High Performance FPGA Implementation of ECDSA ETASR v2
7 pages
Final Project Report
No ratings yet
Final Project Report
44 pages
A Dual-Core High-Performance Processor For Elliptic Curve Cryptography in GFP Over Generic Weierstrass Curves
No ratings yet
A Dual-Core High-Performance Processor For Elliptic Curve Cryptography in GFP Over Generic Weierstrass Curves
5 pages
EC overGF in Cryptography
No ratings yet
EC overGF in Cryptography
26 pages
2019 - An Efficient Fault Detection Method For Elliptic Curve Scalar Multiplication Montgomery Algorithm
No ratings yet
2019 - An Efficient Fault Detection Method For Elliptic Curve Scalar Multiplication Montgomery Algorithm
5 pages
Jurnal
No ratings yet
Jurnal
5 pages
Floating Point Multiplier
No ratings yet
Floating Point Multiplier
10 pages
Secure Elliptic Curves
No ratings yet
Secure Elliptic Curves
13 pages
VHDL Implementation of ECC Processor Over GF (2 163)
No ratings yet
VHDL Implementation of ECC Processor Over GF (2 163)
7 pages
Fast and Secure Elliptic Curve Scalar Multiplication Over Prime Fields Using Special Addition Chains
No ratings yet
Fast and Secure Elliptic Curve Scalar Multiplication Over Prime Fields Using Special Addition Chains
14 pages
Efficient Low-Latency Multiplication Architecture For NIST Trinomials With RISC-V Integration
No ratings yet
Efficient Low-Latency Multiplication Architecture For NIST Trinomials With RISC-V Integration
5 pages
Mathematics of Computation
No ratings yet
Mathematics of Computation
5 pages
Imran 2017
No ratings yet
Imran 2017
6 pages
IKV 2 Main
No ratings yet
IKV 2 Main
97 pages
Ecc 2018
No ratings yet
Ecc 2018
7 pages
High Performance ECDSA Over F Based On Java With Hardware Acceleration
No ratings yet
High Performance ECDSA Over F Based On Java With Hardware Acceleration
14 pages
Dual Field Arithmetic Architectures For Cryptography and DSP Applications
No ratings yet
Dual Field Arithmetic Architectures For Cryptography and DSP Applications
7 pages
2 EC Cryptography: 2.1 Elliptic Curve Arithmetic
No ratings yet
2 EC Cryptography: 2.1 Elliptic Curve Arithmetic
8 pages
Elliptic Curve Cryptography On Embedded Multicore Systems
No ratings yet
Elliptic Curve Cryptography On Embedded Multicore Systems
6 pages
Design A Scalable RSA and ECC Crypto-Processor
No ratings yet
Design A Scalable RSA and ECC Crypto-Processor
4 pages
Finite Field Polynomial Multiplier With Linear Feedback Shift Register
No ratings yet
Finite Field Polynomial Multiplier With Linear Feedback Shift Register
12 pages
Ijspr 5901 30318
No ratings yet
Ijspr 5901 30318
5 pages
Final2 PDF
No ratings yet
Final2 PDF
61 pages
Low-Power Design For A Digit-Serial Polynomial Basis Finite Field Multiplier Using Factoring Technique
No ratings yet
Low-Power Design For A Digit-Serial Polynomial Basis Finite Field Multiplier Using Factoring Technique
17 pages
Ecc1 PDF
No ratings yet
Ecc1 PDF
43 pages
Design of Low-Area and High Speed Pipelined
No ratings yet
Design of Low-Area and High Speed Pipelined
6 pages
Dual-Field Multiplier Architecture For Cryptographic Applications
No ratings yet
Dual-Field Multiplier Architecture For Cryptographic Applications
5 pages
Kazuo Sakiyama, Elke de Mulder, Bart Preneel and Ingrid Verbauwhede
No ratings yet
Kazuo Sakiyama, Elke de Mulder, Bart Preneel and Ingrid Verbauwhede
4 pages
Paper 3
No ratings yet
Paper 3
10 pages
DesignandimplementationofMultiplierunitMAC ROBA
No ratings yet
DesignandimplementationofMultiplierunitMAC ROBA
10 pages
Elliptic Public Key Cryptosystem Using DHK and Partial Reduction Modulo Techniques
No ratings yet
Elliptic Public Key Cryptosystem Using DHK and Partial Reduction Modulo Techniques
8 pages
Mengfei Li+Eggy+Party+Server
No ratings yet
Mengfei Li+Eggy+Party+Server
50 pages
International Journal of Engineering Research and Development
No ratings yet
International Journal of Engineering Research and Development
7 pages
NIST Recommended Curves
No ratings yet
NIST Recommended Curves
43 pages
Excel VBA Programming Golden Rules
100% (6)
Excel VBA Programming Golden Rules
31 pages
Ijcaes Cse 2012 031
No ratings yet
Ijcaes Cse 2012 031
4 pages
DNA Replication Project
No ratings yet
DNA Replication Project
4 pages
Groups/Buttons Description: Clipboard Group Paste Cut Copy Format Painter
100% (1)
Groups/Buttons Description: Clipboard Group Paste Cut Copy Format Painter
3 pages
Some Recommendations For Publishing Coin
No ratings yet
Some Recommendations For Publishing Coin
6 pages
Chemicals Zetag DATA Beads Magnafloc 156 - 0410
No ratings yet
Chemicals Zetag DATA Beads Magnafloc 156 - 0410
2 pages
Reproductive System
No ratings yet
Reproductive System
8 pages
Validation Methodology On Airbag Deployment Process of Driver Side Airbag
No ratings yet
Validation Methodology On Airbag Deployment Process of Driver Side Airbag
9 pages
Epson L3216 Brochure
No ratings yet
Epson L3216 Brochure
2 pages
Worksheet of The Making of A Scientist
No ratings yet
Worksheet of The Making of A Scientist
1 page
Course Syllabus and Schedule Rubric
No ratings yet
Course Syllabus and Schedule Rubric
2 pages
H2 Chapter 14 Sequences and Series Learning Package 2025
No ratings yet
H2 Chapter 14 Sequences and Series Learning Package 2025
31 pages
19xr Impeller
No ratings yet
19xr Impeller
1 page
2026
No ratings yet
2026
14 pages
2250reozm 0720
No ratings yet
2250reozm 0720
3 pages
Sindhu Rudianto - PDF - Wiratman Wangsadinata .PDF - Ellen M. Rathje - Makalah
No ratings yet
Sindhu Rudianto - PDF - Wiratman Wangsadinata .PDF - Ellen M. Rathje - Makalah
10 pages
E Statement
No ratings yet
E Statement
4 pages
22PAM0062 - INTERMEDIATE ACADEMIC ENGLISH - Part8
No ratings yet
22PAM0062 - INTERMEDIATE ACADEMIC ENGLISH - Part8
20 pages
The Analysis of A Framed Building With Shear Walls Subjected To Horizontal and Vertical Load Is Essentially A Three
No ratings yet
The Analysis of A Framed Building With Shear Walls Subjected To Horizontal and Vertical Load Is Essentially A Three
5 pages
The Emergence and Adoption of E-Logistics System in Supply Chain
No ratings yet
The Emergence and Adoption of E-Logistics System in Supply Chain
7 pages
DDN Budget of Work Mathematics
No ratings yet
DDN Budget of Work Mathematics
12 pages
Emplys Job Satisfaction
No ratings yet
Emplys Job Satisfaction
64 pages
STATIKA Rangka Baja 8x6 M
No ratings yet
STATIKA Rangka Baja 8x6 M
7 pages
SZ715 User Manual
No ratings yet
SZ715 User Manual
4 pages
8114 Um Hu
No ratings yet
8114 Um Hu
37 pages
NTC-s Varios PDF
No ratings yet
NTC-s Varios PDF
5 pages
8semester Result
No ratings yet
8semester Result
1 page
Judgement 12
No ratings yet
Judgement 12
4 pages
Unmasking Japans Work Culture
No ratings yet
Unmasking Japans Work Culture
2 pages
GE2 - Exercise 2.1 Juvine Ramos
No ratings yet
GE2 - Exercise 2.1 Juvine Ramos
4 pages
Study Guide for the Cisco 300-440 ENCC Designing and Implementing Cloud Connectivity Exam.
From Everand
Study Guide for the Cisco 300-440 ENCC Designing and Implementing Cloud Connectivity Exam.
Anand Vemula
No ratings yet
Cortex-M Architecture and Programming Reference: Definitive Reference for Developers and Engineers
From Everand
Cortex-M Architecture and Programming Reference: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Point Multiplication Accelerator For Arbitrary Montgomery Curves

Uploaded by

Point Multiplication Accelerator For Arbitrary Montgomery Curves

Uploaded by

This article has been accepted for publication in IEEE Embedded Systems Letters.

Point Multiplication Accelerator for Arbitrary

E LLIPTIC curve cryptography (ECC) [1], [2] is a sub-class

III. P ROPOSED DFM AND PM MODULES Algorithm 1: Proposed DFM algorithm

Shared logic FAS W15 = W10 x W10 W14 = W11 x W11

You might also like