A High-Performance ECC Processor Over Curve448 Based On A Novel Variant of The Karatsuba Formula For Asymmetric Digit Multiplier
A High-Performance ECC Processor Over Curve448 Based On A Novel Variant of The Karatsuba Formula For Asymmetric Digit Multiplier
Abstract—In this paper, we present a high-performance architecture for elliptic curve cryptography (ECC) over Curve448, which to the
best of our knowledge, is the fastest implementation of ECC point multiplication over Curve448 to date. Firstly, we introduce a novel
variant of the Karatsuba formula for asymmetric digit multiplier, suitable for typical DSP primitive with asymmetric input. It reduces the
number of required DSPs compared to previous work and preserves the performance via full parallelization and pipelining. We then
construct a 244-bit pipelined multiplier and interleaved fast reduction algorithm, yielding a total of 12 stages of pipelined modular
multiplication with four stages of input delay. Additionally, we present an efficient Montgomery ladder scheduling with no additional
register is required. The implementation on the Xilinx 7-series FPGA: Virtex-7, Kintex-7, Artix-7, and Zynq 7020 yields execution times
of 0.12, 0.13, 0.24, and 0.24 ms, respectively. It increases the throughput by 242% compared to the best previous work on Zynq 7020
and by 858% compared to the best previous work on Virtex-7. Furthermore, the proposed architecture optimizes nearly 63% efficiency
improvement in terms of Area×Time tradeoff. Lastly, we extend our architecture with well-known side-channel protections such as
scalar blinding, base-point randomization, and continuous randomization.
Index Terms—elliptic-curves cryptography (ECC); Curve448; high-speed multiplier; asymmetric Karatsuba; field-programmable gate
array (FPGA)
F
1 I NTRODUCTION
sign employed schoolbook multiplication with interleaved to avoid overflow during the accumulation of the partial
reduction for the underlying modular multiplier. The im- products on a typical word-based processor (i.e., software
plementation results on a Xilinx Zynq 7020 Field Pro- implementation), which is technically implemented in an
grammable Gate Arrays (FPGA) archived a throughput of iterative way.
1087 ECC point multiplication (ECPM) per second and Apart from optimizing the cost of extra addition, the
consumed 1580 logic slices and 33 Digital Signal Processor method employed by [16] does not leverage the full ca-
(DSP) blocks. Their design also offered basic side-channel pability of DSP blocks (i.e., Xilinx DSP48E1), as they use
protections, such as scalar blinding and base-point random- a symmetric 16x16-bit digit multiplier. Thus, the use of
ization. Furthermore, they extended their previous design the Karatsuba formula with the asymmetric feature of DSP
with additional protection against horizontal attacks in [11] blocks remains unexplored. To the best of our knowledge,
by adding a re-randomization countermeasure. At the same Roy et al. [19] represents the most recent work that uses
time, they evaluated their countermeasure with scalar- and the full capability of asymmetric DSP blocks, which reduces
base-point-dependent leakage side-channel evaluations. the required DSP blocks in the schoolbook method using
In [12], Shah et al. proposed the hardware design of the nonstandard tiling method. This method is also used by
Curve448 utilizing LookUp Table (LUT) only, which aims [20] to construct a 257-bit signed multiplier for the hardware
to be platform independent. They adopted the redundant- implementation of PQC SIKE.
signed-digit (RSD) representation for arithmetic operations Our contributions: The contributions of this paper are
and the segmentation approach at the architectural level summarized as follows:
to reduce the required number of clock cycles for ECPM
operations. Their implementation results targeting Virtex-7 1) We present a novel variant of the Karatsuba for-
achieved a throughput of 869 ECPM per second utilizing mula for asymmetric digit multiplier, which reduces
50,143 LUTs. DSP block utilization while offering a high-speed
The proposal by Niasar et al. [13] represents a very multiplier through parallelization and pipelining.
recent work hardware implementation of Curve448. They To the best of our knowledge, this is the first work
investigated three different implementation strategies (i.e., considering the full capability of DSP blocks with
lightweight, area-time efficient, and high-performance ar- the Karatsuba algorithm. Furthermore, it can be
chitectures) targeting the Xilinx Zynq 7020 FPGA. Their generalized for broader use in a cryptographic al-
high-performance architecture increased throughput by 12% gorithm that employs multiplication.
by executing 1,219 ECPM per second and increased effi- 2) We then present a high-performance ECC processor
ciency by 40% in terms of required clock cycles×utilized architecture over Curve448 that to the best of our
area compared to the initial work in [10]. They achieved knowledge, outperforms the existing architecture
their speed-up by utilizing 81 DSPs for parallelization in the in terms of execution time as well as Area×Time
lowest level of Karatsuba computation. To the best of our efficiency.
knowledge, their high-performance variant is the state-of- 3) For the underlying architecture, we propose a 12-
the-art of Curve448 hardware implementation in terms of stage pipelined modular multiplier with four stages
ECPM throughput. of input delay, which is built from a five-stage 244-
The Karatsuba-Ofman formula [14], also known as bit fully pipelined multiplier with an interleaved
Karatsuba formula, has been a widely used method of fast reduction over the modulus p = 2448 − 2224 − 1.
multiplying two n-bit arbitrary-precision numbers, which 4) The presented five-stage 244-bit fully pipelined mul-
reduces the asymptotic complexity to O(n1.585 ) bit opera- tiplier is constructed from a novel variant of Karat-
tions compared to O(n2 ) bit operations for the schoolbook suba in point 1. At the same time, the interleaved
method. However, the nature of its algorithm that uses fast reduction is obtained by exploiting the Solinas
recursion to construct higher precision numbers leads to prime with the golden ratio φ = 2224 .
the extra overhead of additions. In particular, implementing 5) We provide an efficient Montgomery ladder
parallel Karatsuba in hardware is problematic in that it scheduling algorithm without the requirement of an
increases the critical delay path due to the addition tree, de- additional temporary register.
spite using parallel DSP blocks for digit multipliers at lower 6) Lastly, the proposed architecture is extended with
levels. Therefore, despite reducing the number of required side-channel attack countermeasures such as scalar
DSP blocks, the overall operating frequency remains low, as blinding, base-point randomization, and continuous
shown in the implementation results in [13] and [15]. randomization, which are expected to resist vertical
Awaludin et al. [16] demonstrated a new way of using and horizontal attacks.
the Karatsuba formula for high-speed hardware parallel
multiplier without the cost of increasing the critical delay The rest of this paper is organized as follows: Section 2
path. The technique employs the combination of the school- gives a brief introduction to Curve448 with the underlying
book method and the Karatsuba algorithm with a compres- group arithmetic and field arithmetic. Section 3 describes
sor circuit (i.e., carry-save-adder tree(CSAT)), despite requir- the proposed novel variant of the Karatsuba formula for
ing slightly more DSP blocks than the original Karatsuba asymmetric digit multiplier. Section 4 presents the proposed
method. Apparently, the presented equation is similar to the hardware architecture of the ECC processor over Curve448.
method discovered earlier by Khachatrian et al. [17], which Then, in section 6, we present our hardware implementation
was then formalized by [18], called the arbitrary degree vari- results and compare them to those of the existing methods.
ant of Karatsuba (ADK). The method was initially intended Lastly, Section 7 concludes the paper.
3
TABLE 1
Comparison of required digit multipliers for different operand widths
with existing methods
Schoolbook
Operand Nonstandard Our
Schoolbook + Karatsuba
Width Tiling [19] method
[16]
192 96 90 78 60
224 140 120 105 88
256 176 160 136 113
384 368 360 300 216
521 682 660 561 433
red points represent a single-digit multiplication, while the 4.1.2 Fast Reduction over p = 2448 − 2224 − 1
two black points connected by a line represent a single-digit
We propose the fast reduction technique interleaving with
multiplications that are reduced from two-digit multipli-
the intermediate output from the 244-bit pipelined multi-
cations. Note that some lines may pass through multiple
plier, which is given in Algorithm 1. The multiplication of A
points, yet the relation should satisfy the Equation 7. There
and B , which each have a 448-bit width, can be decomposed
are some exceptions; when the point is already applied as
into four 244-bit multiplications. Note that we do not take
a red point (i.e., due to the Karatsuba multiplication on the
the Karatsuba approach recommended by [5], since it does
counterpart), it cannot be further reduced, even if it satisfies
not give an advantage in our reduction step; rather, we take
Equation 7. Thus, it remains as red point (including the
the additional cost of one clock cycle on the pipelined mul-
counterpart) as in our case, as shown in points (8,2), (9,2),
tiplier. We perform partial reduction for three intermediate
(8,5), (9,5), (8,2), and (9,2) in Fig. 5. Finally, we obtain less
results in advance (i.e., z4 = (z1 + z2 + z3 ) · 2224 mod p),
complexity, as it requires only 88 DSPs instead of 140 DSPs
while the second term (i.e., z0 + z3 mod p) is accumulated
in the schoolbook method or 105 DSPs in the nonstandard
with the first reduction step result and we perform the
tiling method [19].
second reduction accordingly (i.e., C = z0 +z3 +z4 mod p).
Fig. 6 shows the architecture of the 224-bit pipelined
This technique relies on the following property:
multiplier. The architecture contains five fully pipelined
stages, which means it can process an input on each cycle.
Our calculation steps of C = A · B are described as follows: (a + b) mod p = (a + (b mod p)) mod p (8)
• Stages 1 and 2: The parallel 16-bit ripple-carry adder
(RCA) is used to compute bj − bl . The output of the Considering the advantage of the Goldilocks modulus p =
16-bit RCA is wired to 25x17-bit signed Multiply- 2448 − 2224 − 1 and the fact that 2448 ≡ 2224 + 1 mod p,
Accumulate (MAC) modules, which also have a pre- the reduction of z4 = T · 2224 mod p, where T = z1 + z2 +
adder input to compute ai − ak before going to z3 , can be performed efficiently, as mentioned in Step 13
the multiplication stage. At the same time, parallel of Algorithm 1. Referring to the structure of the RCSA, the
24x16-bit signed multiplier (MUL) modules are used actual addition is performed only in the second adder (i.e.,
to compute ai bl and ak bj . Both the 25x17-bit signed 226-bit CCA). Accordingly, the reduction of C = G mod p,
MAC with the pre-adder and 24x16-bit signed MUL where G = z0 + z3 + z4 yields 3 bits of overflow, can be
are utilized from DSP primitive with a three-stage performed efficiently with the RCSA as mentioned in Steps
and two-stage pipeline, respectively, to achieve max- 18–20 of Algorithm 1. Note that the final reduction might
imum performance, as recommended in [26], which produce a carry at the first adder of the RCSA, as this carry
is shown in Fig. 7. needs to be propagated to the second adder at the final step.
6
Fig. 8. Modular Multiplication Calculation Steps. M , A1, and A2 are a The Montgomery ladder in Equation 2 requires condi-
244-bit multiplier, first RCSA, and second RCSA, respectively. tional swap such that:
(X2 , Z2 , X3 , Z3 ) = (X2+b , Z2+b , X3−b , Z3−b ) (9)
with b respect to the most two significant bit values of scalar
on each iteration, assuming the scalar register is shifted
left. Constant-time conditional swap can be implemented
easily on hardware since the update of X2 , Z2 , X3 , and Z3
naturally are performed in parallel.
Fig. 10. Pipelined Montgomery Ladder Scheduling of Curve448. The total latency is 52 clock cycles, without the requirement of an additional
temporary register. The constant a24 is equal to 39081.
z -coordinate with λ and uses a modular multiplication to ried out using Xilinx Vivado 2020.1, targeting four modern
update the x-coordinate accordingly. Hence, this counter- devices (Xilinx Virtex-7 [XC7VX690T], Kintex-7 [XC7K325T],
measure can be integrated easily by using an additional Artix-7 [XC7A100T], and Zynq 7020 [XC7Z020] FPGA) for a
multiplication call during the initialization phase of the more comprehensive evaluation with other related works.
ECPM operation. The correctness of implementation was verified by using
the testbench with reference to the test vector provided in
5.1.2 Scalar Blinding RFC 7748 [21].
Scalar blinding can be achieved by adding multiple group The result of our ECC processor implementation, as
order #E to scalar k such that kr = k + r × #E where r well as those of several related papers over Curve448, are
is a random value. The correctness of this approach can be presented in Table 2. We achieve the lowest latency among
proven as follows: the proposals targeting Xilinx Zynq 7020 FPGA with 0.24
kr P = (k + r × #E)P = kP + rΘ = kP (11) and 0.39 ms for the unprotected and protected designs,
respectively.
Note that the multiplication of point P and group order
Additionally, we provide the implementation results on
#E results a point at infinity. The computation removes the
various devices for future reference, such as Artix-7, Kintex-
correlation between the Montgomery ladder swap function
7, and Virtex-7, achieving latency of 0.24, 0.13, and 0.12
and the corresponding bit in scalar k . For ECC with spe-
ms for the unprotected design, and 0.40, 0.22, and 0.20 ms
cial prime field (i.e., Solinas prime), it is recommended to
for the protected design, respectively. For the unprotected
provide sufficient larger blinding factors r as investigated
design, our fastest implementation (Virtex-7) requires 7,521
in [27], which is at least half of the field size. Thus, the
slices, while Kintex-7, Artix-7, and Zynq 7020 utilize 7,210,
blinding factor r with 224-bit length builds kr with 672-bit
6,826, and 6,946 slices, respectively. On all four platforms,
length. The latency of ECPM is increased accordingly.
we utilize 88 DSPs and no BRAM. As can be inferred from
the table, our architecture yields the highest efficiency in
5.2 Secure against Horizontal Side-channel Attack
terms of Area×Time and DSP×Time tradeoff compared to
The horizontal side-channel attack is another type of attack other existing architectures.
in which the attacker observes leakage within a single run To the best of our knowledge, the method by Ni-
of ECPM operation. Continuous point randomization for asar et al. [13] represents the state-of-the-art high-
each Montgomery ladder within a single ECPM operation performance hardware implementation of Curve448. They
can be applied sequentially to prevent horizontal attacks. provide three different designs (i.e., lightweight, area-time
It requires two more modular multiplications applied on in- efficient, and high-performance); in particular, we compare
termediates output (i.e., λXP A and λZP A ) to re-randomized our proposed design with their high-performance variant.
Montgomery ladder computation. Our proposed design increases the throughput by 242%
Hence, enabling horizontal attack protection with a
for the unprotected design and by 259% for the protected
continuous point randomization will increase Montgomery
design. Their approach is based on the refined Karatsuba
ladder time and total latency. In particular, the Montgomery
formula by Bernstein in [29], employing five levels of Karat-
scheduling in Fig. 10 is enlarged to 64 cycles. We assume that
suba computation and parallel multiplication using 81 DSP
the random number is provided externally with sufficient
cores. However, the multilevel Karatsuba approach yields a
throughput, such as the Random Number Generator (RNG)
longer addition tree that increases the critical path delay,
design proposed by [28].
limiting their operating frequency to 95 MHz, which is
lower than our design.
6 H ARDWARE I MPLEMENTATION R ESULT AND Table 3 provides the detailed performance analysis with
C OMPARISON a comparison to their design. The latency of our archi-
The proposed design has been described by SystemVerilog tecture outperforms the state-of-the-art in all underlying
HDL. Synthesizing, mapping, placing, and routing were car- field arithmetic and ECC group operations. The signifi-
9
TABLE 2
Performance comparison of the proposed High-Performance ECC Processor over Curve448 with existing literatures
Max. Total
Latency Throughput Area×Time** DSP×Time
Design Platform SCA* Slices DSP BRAM Freq Time
[CCs] [OP/s] [×10−3 ] [×10−3 ]
[MHz] [ms]
(-) 1,580 33 14 328,286 357 0.92 1,087 4,490 30.36
[10] Zynq 7020
(+) 1,648 35 14 473,926 335 1.41 709 7,259 49.35
(+) 1,985 33 14 499,344 341 1.46 685 7,716 48.18
[11] Zynq 7020
(++) 2,056 33 14 547,728 341 1.61 621 8,623 53.13
50,143
[12] Virtex-7 (-) - - 372,742 325 1.15 870 14,416*** -
(LUT)
(-) 4,354 81 - 77,702 95 0.82 1,220 10,212 66.42
[13] Zynq 7020
(++) 4,424 81 - 133,254 95 1.40 714 17,534 113.40
(-) 6,946 88 - 30,469 128 0.24 4,167 3,779 21.12
Zynq 7020
(++) 6,984 88 - 49,735 126 0.39 2,564 6,156 34.32
(-) 6,826 88 - 30,469 127 0.24 4,167 3,750 21.12
Artix-7
This (++) 6,934 88 - 49,735 125 0.40 2,500 6,294 35.20
work (-) 7,210 88 - 30,469 237 0.13 7,692 2,081 11.44
Kintex-7
(++) 7,269 88 - 49,735 230 0.22 4,545 3,535 19.36
(-) 7,521 88 - 30,469 250 0.12 8,333 1,959 10.56
Virtex-7
(++) 7,666 88 - 49,735 245 0.20 5,000 3,293 17.60
* (-): no protection, (+): scalar blinding and point randomization countermeasures,
(++): scalar blinding and point re-randomization countermeasures
** Area = Slices + DSPs×100
*** Area = LUTs/4 (Assume 1 Slice contains 4 LUTs as mentioned in specification [24])
TABLE 3 all the stages in the 224-bit multiplier are nearly busy during
Performance Analysis of Proposed ECC Processor in comparison with the Montgomery ladder operation with the utilization of
State-of-the-art on Zynq 7020 FPGA 48
52 × 100 ' 92%, making the use of the pipeline architecture
in the highest efficiency. On the other hand, the modular
Clock Cycles inversion via FLT consumes almost 18% of the total latency
Operations Niasar et al. [13] Our Work and is considered an inefficient method in our design. This
448 224
@95 MHz @128 MHz is because the exponentiation z 2 −2 −3 mod p requires
1 x Modular Addition 7 2 462 consecutive modular multiplications rather than paral-
1 x Modular Subtraction 7 2 lelization through the pipelining architecture.
1 x Modular Multiplication 15 12 In terms of area, their design has lower slices utilization
10 x Modular Multiplication 150 48 (i.e., 4,354 slices for the unprotected design and 4,424 slices
1 x Modular Inverse 6,917 5,544 for the protected design). However, in terms of Area×Time
Montgomery Ladder Step 158 52 tradeoff, our design is 63% more efficient for both the
unprotected and protected designs. It turns out that the
Single ECC Point Multiplication 77,702 30,469
cost of higher utilization is well absorbed by the latency
Total Latency [ms] 0.82 0.24 improvement. Note that we use the same assumption as
they do where the area is equivalent to slices + DSPs, while
each DSP is assumed to be equivalent to 100 slices.
cant latency improvement is mainly due to a pipelined It is worth mentioning that the first hardware implemen-
modular multiplier, which is constructed from a 244-bit tation of Curve448 was carried out by Sasdrich and Güneysu
fully pipelined multiplier and proposed fast reduction over in [10], who later proposed the protected architecture by
p = 2448 − 2224 − 1. Thanks to the novel variant of the considering side-channel attack countermeasures [11]. They
Karatsuba formula, we can enable the parallelization at demonstrated an evaluation to detect scalar- and base-point-
the digit multiplication level without causing large delay dependable leakage on hardware with side-channel protec-
propagation caused by additions in the recursion tree while tions (i.e., scalar blinding and point randomization) and
offering relatively low DSP block utilization. Although a proved that their methods are secure against side-channel
single modular multiplication operation does not give sig- attacks. Thanks to their results, we also include side-channel
nificant latency improvement (i.e., 12 cycles compared to protections (i.e., scalar blinding, base-point randomization,
15 cycles), employing multiple operations (i.e., 10 modular and continuous point randomization) in our protected de-
multiplications as in Equation 2) results in a significant sign, yet we present a 313% speed-up compared to their
latency improvement due to pipelining compared to their results on the same target device (i.e., Zynq 7020).
design (i.e., 52 cycles compared to 158 cycles). Furthermore, Shah et al. [12] proposed a LUT-based implementation
10
targeting Virtex-7, employing the RSD technique for the [7] L. Chen, D. Moody, A. Regenscheid, and K. Randall, “Recom-
arithmetic operations. Their proposed designs aimed to mendations for discrete logarithm-based cryptography: Elliptic
curve domain parameters,” National Institute of Standards and
be platform independent by using LUTs only, consuming Technology, Tech. Rep., 2019.
50,143 LUTs with a throughput of 870 ECPM operations per [8] P. W. Shor, “Algorithms for quantum computation: discrete log-
second, yet our design is 858% faster than their design. arithms and factoring,” in Proceedings 35th annual symposium on
foundations of computer science. Ieee, 1994, pp. 124–134.
[9] N. Bindel, U. Herath, M. McKague, and D. Stebila, “Transitioning
7 C ONCLUSIONS to a quantum-resistant public key infrastructure,” in International
Workshop on Post-Quantum Cryptography. Springer, 2017, pp. 384–
In this paper, we proposed a high-performance ECC pro- 405.
cessor over Curve448 that outperformed all the previous [10] P. Sasdrich and T. Géneysu, “Cryptography for next generation
results in terms of execution time. The implementation on tls: Implementing the rfc 7748 elliptic curve448 cryptosystem
in hardware,” in 2017 54th ACM/EDAC/IEEE Design Automation
the Xilinx 7-series FPGA Virtex-7, Kintex-7, Artix-7, and Conference (DAC). IEEE, 2017, pp. 1–6.
Zynq 7020 yielded execution times of 0.12, 0.13, 0.24, and [11] P. Sasdrich and T. Güneysu, “Exploring rfc 7748 for hardware
0.24 ms, respectively. The speed was obtained by utilizing a implementation: Curve25519 and curve448 with side-channel pro-
tection,” Journal of Hardware and Systems Security, vol. 2, no. 4, pp.
novel variant of the Karatsuba for asymmetric digit multi- 297–313, 2018.
plier, constructing a high-throughput 244-bit fully pipelined [12] Y. A. Shah, K. Javeed, M. I. Shehzad, and S. Azmat, “Lut-based
multiplier. The method combined schoolbook long and high-speed point multiplier for goldilocks-curve448,” IET Comput-
Karatsuba multiplication, allowing its digit multiplication to ers & Digital Techniques, vol. 14, no. 4, pp. 149–157, 2020.
[13] M. B. Niasar, R. Azarderakhsh, and M. M. Kermani, “Effi-
be performed in parallel while leveraging the full capability cient hardware implementations for elliptic curve cryptography
of asymmetric DSP blocks. It is worth mentioning that the over curve448,” in International Conference on Cryptology in India.
algorithm even works on arbitrary degrees, which means Springer, 2020, pp. 228–247.
[14] A. A. Karatsuba and Y. P. Ofman, “Multiplication of many-digital
it can be generalized for wider use in a cryptographic numbers by automatic computers,” in Doklady Akademii Nauk, vol.
algorithm that requires multiplication. In sequence, the in- 145, no. 2. Russian Academy of Sciences, 1962, pp. 293–294.
terleaved fast reduction over 2448 − 2224 − 1 was presented, [15] R. Salarifard and S. Bayat-Sarmadi, “An efficient low-latency
yields a high throughput 12-stage modular multiplier with point-multiplication over curve25519,” IEEE Transactions on Cir-
cuits and Systems I: Regular Papers, vol. 66, no. 10, pp. 3854–3862,
four stages of input delay. Furthermore, we also proposed 2019.
certain components to maximize the speed gain and the [16] A. M. Awaludin, H. T. Larasati, and H. Kim, “High-speed and
overall performance, such as employing a low-latency mod- unified ecc processor for generic weierstrass curves over gf (p) on
fpga,” Sensors, vol. 21, no. 4, p. 1451, 2021.
ular adder/subtractor as well as efficient scheduling of the [17] G. H. Khachatrian, M. K. Kuregian, K. R. Ispiryan, and J. L.
Montgomery ladder. Finally, the proposed architecture was Massey, “Fast multiplication of integers for public-key applica-
extended with both vertical and horizontal side-channel tions,” in International Workshop on Selected Areas in Cryptography.
protection through well-known countermeasures such as Springer, 2001, pp. 245–254.
[18] M. Scott, “Missing a trick: Karatsuba variations,” Cryptography and
scalar blinding, base-point randomization, and continuous Communications, vol. 10, no. 1, pp. 5–15, 2018.
randomization. [19] D. B. Roy, D. Mukhopadhyay, M. Izumi, and J. Takahashi, “Tile
before multiplication: An efficient strategy to optimize dsp multi-
plier for accelerating prime field ecc for nist curves,” in Proceedings
ACKNOWLEDGMENTS of the 51st Annual Design Automation Conference, 2014, pp. 1–6.
[20] P. M. C. Massolino, P. Longa, J. Renes, and L. Batina, “A Com-
This work was supported by Institute of Information pact and Scalable Hardware/Software Co-design of SIKE,” IACR
& Communications Technology Planning & Evaluation Transactions on Cryptographic Hardware and Embedded Systems, 2020.
(IITP) grant funded by the Korea government(MSIT) [21] A. Langley, M. Hamburg, and S. Turner, “Elliptic curves for
security,” Internet Requests for Comments, RFC Editor, RFC 7748,
(2019-0-01343, Regional strategic industry convergence se- January 2016.
curity core talent training business) and supported by [22] P. L. Montgomery, “Speeding the pollard and elliptic curve meth-
the MSIT(Ministry of Science and ICT), Korea, under ods of factorization,” Mathematics of computation, vol. 48, no. 177,
the ITRC(Information Technology Research Center) sup- pp. 243–264, 1987.
[23] B. Devlin, Blockchain Acceleration Using FPGAs - Elliptic curves, zk-
port program(IITP-2021-2020-0-01797) supervised by the SNARKs, and VDFs, ZCASH Foundation, 2019.
IITP(Institute for Information & Communications Technol- [24] Xilinx, 7 Series FPGAs Data Sheet: Overview, 2020 (ac-
ogy Planning & Evaluation). cessed January 26, 2022), https://www.xilinx.com/support/
documentation/data sheets/ds180 7Series Overview.pdf.
[25] T. B. Preußer, M. Zabel, and R. G. Spallek, “Accelerating computa-
R EFERENCES tions on fpga carry chains by operand compaction,” in 2011 IEEE
20th Symposium on Computer Arithmetic. IEEE, 2011, pp. 95–102.
[1] 3GPP, “Security architecture and procedures for 5g system,” Tech- [26] Xilinx, 7 Series DSP48E1 Slice User Guide, 2018 (accessed Decem-
nical Specification (TS) 3GPP TS 33.501 V17.4.1 (2022–01), 2022. ber 28, 2020), https://www.xilinx.com/support/documentation/
[2] C. Fan, S. Ghaemi, H. Khazaei, and P. Musilek, “Performance user guides/ug479 7Series DSP48E1.pdf.
evaluation of blockchain systems: A systematic survey,” IEEE [27] W. Schindler and A. Wiemers, “Efficient side-channel attacks on
Access, vol. 8, pp. 126 927–126 950, 2020. scalar blinding on elliptic curves with special structure,” in NIST
[3] A. Langley, M. Hamburg, and S. Turner, “Rfc 7748: Elliptic curves Workshop on ECC standards, 2015.
for security,” Internet Research Task Force (IRTF), 2016. [28] A. M. Awaludin, D. Pratama, and H. Kim, “Anytrng: Generic,
[4] D. J. Bernstein, “Curve25519: new diffie-hellman speed records,” high-throughput, low-area true random number generator based
in International Workshop on Public Key Cryptography. Springer, on synchronous edge sampling,” in International Conference on
2006, pp. 207–228. Information Security Applications. Springer, 2021, pp. 157–168.
[5] M. Hamburg, “Ed448-goldilocks, a new elliptic curve.” IACR [29] D. J. Bernstein, “Batch binary edwards,” in Annual International
Cryptol. ePrint Arch., vol. 2015, p. 625, 2015. Cryptology Conference. Springer, 2009, pp. 317–336.
[6] E. Rescorla, “The transport layer security (tls) protocol version
1.3,” Internet Requests for Comments, RFC Editor, RFC 8446,
August 2018.