Point Multiplication Accelerator For Arbitrary Montgomery Curves
Point Multiplication Accelerator For Arbitrary Montgomery Curves
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LES.2024.3399071
Abstract—This letter presents a novel and efficient hardware field inversion (FI) is the most computationally intensive
architecture to accelerate the computation of point multiplication operation. Whereas, using projective coordinates (X, Y, Z),
(PM) primitive over arbitrary Montgomery curves. It is based on the performance burden is shifted towards a field multipli-
a new novel double field multiplier (DFM) that computes two field
multiplications simultaneously. The DFM uses the interleaved cation (FM) operation. Several fast PM implementations have
multiplication technique, and it shortens the critical path of the been proposed using Weierstrass or Montgomery ECs using
circuit by computing two results at once. It is generic to work for either standard or general structure [5]–[11]. Mostly in these
any prime structure and curve parameters over the Montgomery implementations, projective coordinates (X, Y, Z) were used
curves. At the system level, a fast scheduling methodology is after developing efficient FM architectures. To further speed up
also presented to execute the field-level operations with the
Montgomery ladder (ML) approach. Our ML and DFM designs the computation, hard and softcore IPs of modern FPGAs were
perform the same operations regardless of the input values, which utilized which made them platform-dependent. In addition,
provides resistance to timing and simple power analysis side- some of these are even vulnerable to the most common timing
channel attacks. It is synthesized and implemented over different and simple power analysis attacks (SPA) [12]. Robustness
FPGA platforms. The implementation results confirm that it against these attacks is a very important feature that must
outperforms the state-of-the-art in terms of area-time product
and throughput/slice. To the best of the authors’ knowledge, be deployed in cryptographic devices to be used for ensuring
it is the first fully LUT-based architecture for the arbitrary security services. The main contributions in this letter are:
Montgomery curves. - An efficient hardware architecture to accelerate the com-
Index Terms—Montgomery curves, FPGA, double modular putation of PM over any arbitrary MCs for a generic
multiplier, point multiplication prime field is proposed.
- The proposed PM module is developed on the founda-
tions of a new novel double finite field multiplier (DFM).
I. I NTRODUCTION
- Subsequently, dual cores of DFM are utilized in the
Authorized licensed use limited to: K. Ramakrishnan Health and Educational Trust. Downloaded on August 02,2024 at 03:46:41 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Embedded Systems Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LES.2024.3399071
Authorized licensed use limited to: K. Ramakrishnan Health and Educational Trust. Downloaded on August 02,2024 at 03:46:41 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
rset
This article has been accepted for publication in IEEE Embedded Systems Letters. This is the author's version which has not been fully edited and p
content
Y[0](n-1) . . . . may
Y[0]2 change prior
Y[0]1 Y[0] 0 to final
Y[1] (n-1) publication. Citation
. . . Y[1]2 Y[1] 1 Y[1]information:
0 DOI 10.1109/LES.2024.3399071
X
Instructions Y Registers Unit
DMP2 DMP1
Unit Z 3
Mux Mux
<< 1 p << 1 p
Ist Cycle
A/S p
Y[0](n-1) . . . . Y[0]2 Y[0]1 Y[0]0 (n-1) 2. . . Y[1]2
Y[1]Adder Y[1]1 Y[1]0
Adder 1 Z
Control YUnit
Mux
DFM1 DFM2 FAS
+
+
DMP2 DMP1 X
Mux Mux
Mux Mux
z[1] S2[1] S1[1] Registers z[0] S1[1] S1[0]
<< 1 p << 1 p X[1]
Ist Cycle X[1] X[0]
Y[1]0
X[0] X[1] X[0]
A/S
Fig. 3:p Field addition/subtraction
p
2nd Cycle Mux Mux
Adder 2 Adder 1
Mux Y[1]1 Z
Y
Mux
+
Adder 4 Adder 3
+
SR IA X FAS
Mux Mux W1 = X2 + Z2
FAS
z[1] S2[1] S1[1] Registers z[0] S1[1] S1[0] Level Ɩ1 W2 = X2 - Z2
DFM1
X[1] FAS
X[1] X[0] X[0] X[1] X[0] W5 = W1 x W1 W6 = W2 x W2 W3 = X3 + Z3
p Y[1]0
2nd Cycle Mux Mux Mux FAS
Y[1]1 DFM2 W4 = X3 – Z3
Adder 4 Adder 3 W7 = W3 x W 2 W 8 = W1 x W4
SR IA
p
p b a FAS
OP clk res + + Level Ɩ2 W9 = W 5 – W6
Multiplexing logic
b DFM1
Fig. 1: Proposed DFM hardware+ architecture
+
W12 = W5 x W6 W13 = β x W9 FAS
W10 = W8 – W7
reg reg
registers
Control 0 01
Unit FAS
K 0
M (x,
1
y)
2 3 DFM2 W11 = W8 + W7
W13 = β x W9 FAS Whereas, level l2 has three additions W9 to W11 and four
W12 = W5 x W6 W10 = W8 – W7
B. PM Module FAS multiplications W12 to W15 and is done in (k+4) clock cycles.
DFM2 W11 = W8 + W7
Our novel PM architecture based on the DFM module by Finally, the last level l3 only has a total of three instructions
W =W xW W =W xW 15 10 10 14 11 11
adopting the ML technique is shown in Fig. 2. To compute with two multiplications and one addition, these are finished
a PM, a standard DFM
projective Level
coordinates
Ɩ system is
FAS applied
1
3 in (k + 1) clock cycles. The total latency of a single iteration
Z =X xW W =W +W
3 1
where a single iteration of the
15
DFMML needs 10 FM and 8 FAS 2
16 6 13
of ML is (3k + 10) clock cycles with two DFM cores.
Z =W xW
instructions. Our proposed DFM module can simultaneously 2 9 16
execute two FM primitives so to fully exploit the available IV. I MPLEMENTATION AND RESULTS
parallelism, we deployed two cores of the DFM module. In The proposed PM module for 256-bit operand sizes is
addition to the dual-core DFM module, it consists of one FAS implemented on Xilinx Virtex-7, Zynq, and Virtex-6 FPGAs
module, a register file, and a main control unit. The proposed using the Xilinx Vivado tool. A software model in Python is
architecture can execute four FM and one FAS instructions developed to capture test vectors used in the functional veri-
simultaneously. Whereas, the register file is used to hold fication and validation stages. The implementation results and
intermediate values while the control unit takes care of all comparisons with other related proposals are shown in Table
the operations by activating/de-activating the required modules I. It is done based on area occupancy (slices), computation
in the given architecture. An internal architecture of FAS is time, area-time product as #slices × computation time (ST),
given in Fig. 3, where it is developed as a two-stage pipeline throughput (TP) (PM operations per second), and TP per slice
Authorized licensed use limited to: K. Ramakrishnan Health and Educational Trust. Downloaded on August 02,2024 at 03:46:41 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Embedded Systems Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LES.2024.3399071
TABLE I: Performance comparison of PM module with state of the art on FPGA platforms
Ref. Platform bits #Slices LUTs Freq. (MHz) Time (ms) ST TP TPS ∆TPS Remarks
Virtex-7 256 2.96K 9.32K 238 0.69 2.04 1449.47 489.5
Our Zynq-7020 256 3.04K 9.691K 229 0.72 2.18 1388.8 456.8 - Dual DFM, MCs, general p
Virtex-6 256 3.15K 9.73K 221 0.75 2.33 1351 428.8
[17] Virtex-6 256 12.6K 45.54K 25 0.32 4.1 3143 248.4 73% NIST curve, lacks flexibility
[16] Virtex-7 256 5.1K 14.9K 192 0.65 3.3 1538.46 301.66 62% Unified arithmetic, Co-Z
[10] Virtex-7 256 6.4K - 158 1.7 10.9 588.24 91.91 433% Parallel units with Co-Z
[6] Virtex-7 256 7.1K 24.7K 187 1.01 7.2 990.1 139.45 251% parallel modules
[9] Zynq-7020 256 29.7K∗∗ 116.3K∗ 232 0.2 5.94 5000 168.3 171% DSPs blocks using Karatsuba
256 6.2K 18.1K 195 0.7 4.3 1428.57 230.42 112%
[11] Virtex-7 Parallel units with Co-Z
384 7.6 24.8K 157 1.94 14.7 515.46 67.82
[5] Zynq-7020 256 7.6K∗∗ 30.3K∗ 170.4 0.35 2.66 2857.42 375.92 22% MCs, general p
[18] Virtex-7 256 6.5K - 104 1.9 12.4 526.31 80.97 505% Unified point operation
[8] Virtex-7 256 6.8K 22.14K 166 0.8 5.8 1250.16 187.57 161% Unified point operation
[19] Virtex-6 256 6.6K - 76.3 2.83 18.7 353.36 53.54 701% Parallel units with IM
TP in PM operations per second, #LU T s∗ : (#DSP s × 619 + LU T s), ∗∗
estimated 1 slice = 4 LUTs, common-Z coordinates (Co-Z), TPS increase (∆TPS)
(TPS) (TP/#slices). Lower ST and higher TPS figures are the R EFERENCES
most desired criteria to establish the higher efficiency of a [1] V. S. Miller, “Use of elliptic curves in cryptography,” in Conference on
design. The percentage increase in TPS is also presented in the theory and application of cryptographic techniques. Springer, 1985,
pp. 417–426.
the table over the state-of-the-art. [2] N. Koblitz, “Elliptic curve cryptosystems,” Mathematics of computation,
The PM module on average computes one PM operation in vol. 48, no. 177, pp. 203–209, 1987.
0.69 ms, 0.72 ms, and 0.75 ms on Virtex-7, Zynq-7020, and [3] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital
signatures and public-key cryptosystems,” Communications of the ACM,
Virtex-6 FPGAs with slice occupancy of 2.96K, 3.04K, and vol. 21, no. 2, pp. 120–126, 1978.
3.15K, delivers TP of 1449.4, 1388.8, and 1351, achieves ST [4] D. J. Bernstein, T. Lange et al., “Safecurves: choosing safe curves for
of 2.04, 2.18, and 2.33, and TPS of 489.5, 456.8 and 428.8.9, elliptic-curve cryptography,” Avialable online at http://safecurves. cr. yp.
to, 2013.
respectively. Note that all of the designs except [5] compared [5] D. B. Roy and D. Mukhopadhyay, “High-speed implementation of ECC
in Table I are either based on Weierstrass or NIST curves. scalar multiplication in GF (p) for generic Montgomery curves,” IEEE
NIST curve designs tend to be faster due to a specific prime transactions on very large scale integration (VLSI) systems, vol. 27,
no. 7, pp. 1587–1600, 2019.
structure but these lack flexibility and can have back-doors [4]. [6] K. Javeed, A. El-Moursy, and D. Gregg, “E 2 csm: efficient FPGA
The only generic MC design for 256-bit was proposed in [5]. It implementation of elliptic curve scalar multiplication over generic prime
is based on redundant sign digit arithmetic used in the imple- field GF (p),” The Journal of Supercomputing, pp. 1–25, 2023.
[7] Y. A. Shah, K. Javeed, M. I. Shehzad, and S. Azmat, “LUT-based
mentation of Montgomery multiplier [14]. However, it utilizes high-speed point multiplier for Goldilocks-curve448,” IET Computers
FPGA-dedicated blocks such as DSP slices and BRAMs which & Digital Techniques, vol. 14, no. 4, pp. 149–157, 2020.
somehow tied it to be platform-dependent. To the best of the [8] K. Javeed and A. El-Moursy, “Area-time efficient point multiplication
architecture on twisted Edwards curve over general prime field GF (p),”
authors’ knowledge, the proposed design is the first design International Journal of Circuit Theory and Applications.
for arbitrary MCs with complete LUT implementation. This [9] A. M. Awaludin, H. T. Larasati, and H. Kim, “High-speed and unified
can make it portable to any FPGA family/device in addition ECC processor for generic Weierstrass curves over GF(p) on FPGA,”
Sensors, vol. 21, no. 4, p. 1451, 2021.
to a generic prime advantage. It dominates all the mentioned [10] Y. Hao, S. Zhong, M. Ma, R. Jiang, S. Huang, J. Zhang, and W. Wang,
designs in terms of ST and TPS metrics. It has the lowest ST “Lightweight architecture for elliptic curve scalar multiplication over
and highest TPS values in comparison to the state-of-the-art. prime field,” Electronics, vol. 11, no. 14, p. 2234, 2022.
[11] K. Javeed, A. El-Mursy, and D. Gregg, “Ec-crypto: Highly efficient area-
In terms of ST, it is 1.75, 1.61, 5.34, 3.52, 2.72, 2.10, 1.22, delay optimized elliptic curve cryptography processor,” IEEE Access,
6.07, 2.84, and 8.02 times better, whereas, in terms of TPS, 2023.
it is 1.73, 1.62, 5.32, 3.5, 2.71, 2.12, 1.21, 6.05, 2.61, 8.01 [12] P. C. Kocher, “Timing attacks on implementations of Diffie-Hellman,
RSA, DSS, and other systems,” in Annual International Cryptology
times better as compared to [17], [16], [10], [6], [9], [11], Conference. Springer, 1996, pp. 104–113.
[5], [18], [8] and [19], respectively. Due to our constant time [13] C. Costello and B. Smith, “Montgomery curves and their arithmetic,”
ML, DFM, and FAS circuits, it resists timing attacks. For all Journal of Cryptographic Engineering, vol. 8, no. 3, pp. 227–240, 2018.
[14] P. L. Montgomery, “Modular multiplication without trial division,”
choices, we compute both values and select the result, which Mathematics of computation, vol. 44, no. 170, pp. 519–521, 1985.
provides further resistance to SPA attacks. [15] G. R. Blakely, “A computer algorithm for calculating the product AB
modulo M,” IEEE Transactions on Computers, vol. 100, no. 5, pp. 497–
500, 1983.
V. CONCLUSION [16] K. Javeed, “FPGA implementation of area-time aware ECC scalar
multiplication core*,” in 2023 30th IEEE International Conference on
This letter presented a novel hardware architecture to Electronics, Circuits and Systems (ICECS), 2023, pp. 1–4.
accelerate the PM primitive. It is designed using a new [17] X. Hu, X. Li, X. Zheng, Y. Liu, and X. Xiong, “A high-speed processor
novel double modular multiplier circuit that can perform two for elliptic curve cryptography over NIST prime field,” IET Circuits,
Devices & Systems, vol. 16, no. 4, pp. 350–359, 2022.
modular multiplication operations simultaneously. On different [18] M. M. Islam, M. S. Hossain, M. K. Hasan, M. Shahjalal, and Y. M.
FPGA platforms, it delivers significantly better ST and TPS Jang, “Design and implementation of high-performance ECC processor
in comparison to other contemporary designs. Therefore, it is with unified point addition on twisted Edwards curve,” Sensors, vol. 20,
no. 18, p. 5148, 2020.
the prominent choice as a building block for key exchange and [19] T. Kudithi, “An efficient hardware implementation of the elliptic curve
digital signature protocols in both performance and resource- cryptographic processor over prime field,” International Journal of
critical applications. Circuit Theory and Applications, 2020.
Authorized licensed use limited to: K. Ramakrishnan Health and Educational Trust. Downloaded on August 02,2024 at 03:46:41 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.