0% found this document useful (0 votes)
29 views10 pages

A High-Performance ECC Processor Over Curve448 Based On A Novel Variant of The Karatsuba Formula For Asymmetric Digit Multiplier

This document summarizes a research paper that presents a high-performance elliptic curve cryptography (ECC) processor architecture for Curve448. It introduces a novel variant of the Karatsuba formula for asymmetric digit multiplication that reduces the number of required digital signal processors compared to previous work. The implementation on FPGAs yields execution times of 0.12-0.24 ms and increases throughput by 242-858% compared to previous work. It also optimizes area-time efficiency by 63%. The architecture includes side-channel protections like scalar blinding and base-point randomization.

Uploaded by

pskumarvlsipd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views10 pages

A High-Performance ECC Processor Over Curve448 Based On A Novel Variant of The Karatsuba Formula For Asymmetric Digit Multiplier

This document summarizes a research paper that presents a high-performance elliptic curve cryptography (ECC) processor architecture for Curve448. It introduces a novel variant of the Karatsuba formula for asymmetric digit multiplication that reduces the number of required digital signal processors compared to previous work. The implementation on FPGAs yields execution times of 0.12-0.24 ms and increases throughput by 242-858% compared to previous work. It also optimizes area-time efficiency by 63%. The architecture includes side-channel protections like scalar blinding and base-point randomization.

Uploaded by

pskumarvlsipd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1

A High-performance ECC Processor over


Curve448 based on a Novel Variant of the
Karatsuba Formula for Asymmetric Digit
Multiplier
Asep Muhamad Awaludin, Jonguk Park, Rini Wisnu Wardhani, and Howon Kim.

Abstract—In this paper, we present a high-performance architecture for elliptic curve cryptography (ECC) over Curve448, which to the
best of our knowledge, is the fastest implementation of ECC point multiplication over Curve448 to date. Firstly, we introduce a novel
variant of the Karatsuba formula for asymmetric digit multiplier, suitable for typical DSP primitive with asymmetric input. It reduces the
number of required DSPs compared to previous work and preserves the performance via full parallelization and pipelining. We then
construct a 244-bit pipelined multiplier and interleaved fast reduction algorithm, yielding a total of 12 stages of pipelined modular
multiplication with four stages of input delay. Additionally, we present an efficient Montgomery ladder scheduling with no additional
register is required. The implementation on the Xilinx 7-series FPGA: Virtex-7, Kintex-7, Artix-7, and Zynq 7020 yields execution times
of 0.12, 0.13, 0.24, and 0.24 ms, respectively. It increases the throughput by 242% compared to the best previous work on Zynq 7020
and by 858% compared to the best previous work on Virtex-7. Furthermore, the proposed architecture optimizes nearly 63% efficiency
improvement in terms of Area×Time tradeoff. Lastly, we extend our architecture with well-known side-channel protections such as
scalar blinding, base-point randomization, and continuous randomization.

Index Terms—elliptic-curves cryptography (ECC); Curve448; high-speed multiplier; asymmetric Karatsuba; field-programmable gate
array (FPGA)

F
1 I NTRODUCTION

T HE performance of Public-Key Cryptography has be-


come one of the main factors of interest in the re-
cently emerging technologies such as the 5G System [1]
step for the emergence of Post-Quantum Cryptography
(PQC). We must acknowledge, with the invention of the
Shor’s factorization algorithm [8], that the current state of
and Blockchain [2]. At the same time, there is increasing the cryptographic system will soon likely be compromised
demand to increase the security level against attacks that by quantum computing; likewise, the classic public key
could compromise the overall performance. In particular, exchange algorithm will soon be replaced. Nevertheless, as
applications enabled on Internet of Things (IoT) devices suf- long as PQC has not been fully implemented, hybrid mode
fer from performance degradation due to limited resources ECC will still be used in order to sustain the compatibility
on the processing unit. Elliptic Curve Cryptography (ECC) of industry and government regulations. Since in the hybrid
has been chosen as the building block in the security proto- schemes, classic and PQC will work concurrently during the
col of those technologies among asymmetric cryptographic transition to PQC [9]. Classical cryptographic system will
algorithms due to its smaller key size. The Internet Research still be needed, even though PQC development has come
Task Force (IRTF) [3] recommended Curve25519 [4] and a long way. Consequently, designing an ECC processor
Curve448 [5] for a high level of practical security with architecture with a high level of security, high speed, low
128-bit and 224-bit security levels, respectively, along with latency, and high efficiency in every single processing step
inclusion in Transport Layer Security (TLS) standard 1.3 [6]. is crucial.
Afterward, National Institute of Standards and Technology Generally, optimization can be done through algorithmic
(NIST) [7] also included these curves in their standard. improvement to reduce the number of calculation steps for
Curve448 is a conservatively designed elliptic curve with such expensive primitive operations such as finite field and
very competitive performance on a wide variety of plat- group operations. Despite that, keeping a short critical delay
forms, leading to ECC construction issues and to advances path on hardware implementation is even more challenging
in strong cryptanalysis and classical attacks [5]. Obviously, due to being limited by technology, as it determines the
this research is being performed not only due to the wide maximum working frequency. Thus, critical delay requires
use of high-performance ECC Processors over Curve448 more attention than software implementation.
but also to address the need for achieving a high-security Prior work: To the best of our knowledge, there are only
and efficient ECC processor architecture as an imperative a few published results on hardware implementations tar-
geting ECC with a security level above 128 bits, particularly
• The authors are with the School of Computer Sciense and Engi- Curve448.
neering, Pusan National University, Busan 609735, Korea; (E-mail: The first hardware implementation of Curve448 was
{asep.muhamad11, daiula10, rini.wisnu, howonkim}@pusan.ac.kr)
investigated by Sasdrich and Güneysu in [10]. Their de-
2

sign employed schoolbook multiplication with interleaved to avoid overflow during the accumulation of the partial
reduction for the underlying modular multiplier. The im- products on a typical word-based processor (i.e., software
plementation results on a Xilinx Zynq 7020 Field Pro- implementation), which is technically implemented in an
grammable Gate Arrays (FPGA) archived a throughput of iterative way.
1087 ECC point multiplication (ECPM) per second and Apart from optimizing the cost of extra addition, the
consumed 1580 logic slices and 33 Digital Signal Processor method employed by [16] does not leverage the full ca-
(DSP) blocks. Their design also offered basic side-channel pability of DSP blocks (i.e., Xilinx DSP48E1), as they use
protections, such as scalar blinding and base-point random- a symmetric 16x16-bit digit multiplier. Thus, the use of
ization. Furthermore, they extended their previous design the Karatsuba formula with the asymmetric feature of DSP
with additional protection against horizontal attacks in [11] blocks remains unexplored. To the best of our knowledge,
by adding a re-randomization countermeasure. At the same Roy et al. [19] represents the most recent work that uses
time, they evaluated their countermeasure with scalar- and the full capability of asymmetric DSP blocks, which reduces
base-point-dependent leakage side-channel evaluations. the required DSP blocks in the schoolbook method using
In [12], Shah et al. proposed the hardware design of the nonstandard tiling method. This method is also used by
Curve448 utilizing LookUp Table (LUT) only, which aims [20] to construct a 257-bit signed multiplier for the hardware
to be platform independent. They adopted the redundant- implementation of PQC SIKE.
signed-digit (RSD) representation for arithmetic operations Our contributions: The contributions of this paper are
and the segmentation approach at the architectural level summarized as follows:
to reduce the required number of clock cycles for ECPM
operations. Their implementation results targeting Virtex-7 1) We present a novel variant of the Karatsuba for-
achieved a throughput of 869 ECPM per second utilizing mula for asymmetric digit multiplier, which reduces
50,143 LUTs. DSP block utilization while offering a high-speed
The proposal by Niasar et al. [13] represents a very multiplier through parallelization and pipelining.
recent work hardware implementation of Curve448. They To the best of our knowledge, this is the first work
investigated three different implementation strategies (i.e., considering the full capability of DSP blocks with
lightweight, area-time efficient, and high-performance ar- the Karatsuba algorithm. Furthermore, it can be
chitectures) targeting the Xilinx Zynq 7020 FPGA. Their generalized for broader use in a cryptographic al-
high-performance architecture increased throughput by 12% gorithm that employs multiplication.
by executing 1,219 ECPM per second and increased effi- 2) We then present a high-performance ECC processor
ciency by 40% in terms of required clock cycles×utilized architecture over Curve448 that to the best of our
area compared to the initial work in [10]. They achieved knowledge, outperforms the existing architecture
their speed-up by utilizing 81 DSPs for parallelization in the in terms of execution time as well as Area×Time
lowest level of Karatsuba computation. To the best of our efficiency.
knowledge, their high-performance variant is the state-of- 3) For the underlying architecture, we propose a 12-
the-art of Curve448 hardware implementation in terms of stage pipelined modular multiplier with four stages
ECPM throughput. of input delay, which is built from a five-stage 244-
The Karatsuba-Ofman formula [14], also known as bit fully pipelined multiplier with an interleaved
Karatsuba formula, has been a widely used method of fast reduction over the modulus p = 2448 − 2224 − 1.
multiplying two n-bit arbitrary-precision numbers, which 4) The presented five-stage 244-bit fully pipelined mul-
reduces the asymptotic complexity to O(n1.585 ) bit opera- tiplier is constructed from a novel variant of Karat-
tions compared to O(n2 ) bit operations for the schoolbook suba in point 1. At the same time, the interleaved
method. However, the nature of its algorithm that uses fast reduction is obtained by exploiting the Solinas
recursion to construct higher precision numbers leads to prime with the golden ratio φ = 2224 .
the extra overhead of additions. In particular, implementing 5) We provide an efficient Montgomery ladder
parallel Karatsuba in hardware is problematic in that it scheduling algorithm without the requirement of an
increases the critical delay path due to the addition tree, de- additional temporary register.
spite using parallel DSP blocks for digit multipliers at lower 6) Lastly, the proposed architecture is extended with
levels. Therefore, despite reducing the number of required side-channel attack countermeasures such as scalar
DSP blocks, the overall operating frequency remains low, as blinding, base-point randomization, and continuous
shown in the implementation results in [13] and [15]. randomization, which are expected to resist vertical
Awaludin et al. [16] demonstrated a new way of using and horizontal attacks.
the Karatsuba formula for high-speed hardware parallel
multiplier without the cost of increasing the critical delay The rest of this paper is organized as follows: Section 2
path. The technique employs the combination of the school- gives a brief introduction to Curve448 with the underlying
book method and the Karatsuba algorithm with a compres- group arithmetic and field arithmetic. Section 3 describes
sor circuit (i.e., carry-save-adder tree(CSAT)), despite requir- the proposed novel variant of the Karatsuba formula for
ing slightly more DSP blocks than the original Karatsuba asymmetric digit multiplier. Section 4 presents the proposed
method. Apparently, the presented equation is similar to the hardware architecture of the ECC processor over Curve448.
method discovered earlier by Khachatrian et al. [17], which Then, in section 6, we present our hardware implementation
was then formalized by [18], called the arbitrary degree vari- results and compare them to those of the existing methods.
ant of Karatsuba (ADK). The method was initially intended Lastly, Section 7 concludes the paper.
3

2 P RELIMINARIES time modular inversion can be implemented using Fermat’s


Ed448-Goldilocks is an elliptic curve over prime field GF (p) Little Theorem (FLT) such that Z −1 ≡ Z p−2 mod p. Finally,
with a 244-bit security level introduced by Hamburg in [5], the affine representation of the point Q is calculated as
which is defined in untwisted Edwards form: xQ = XZ −1 , with Z −1 is computed using FLT.

Ed : y 2 + x2 = 1 + dx2 y 2 mod p (1)


3 N OVEL VARIANT OF K ARATSUBA F ORMULA FOR
with d = −39081 and p = 2448 − 2224 − 1. The curve is
birationally equivalent to the Montgomery curve defined in A SYMMETRIC D IGIT M ULTIPLIER
RFC 7748 [21] called Curve448, the term we will use for Consider two n-bit arbitrary-precision numbers A and B
the rest of the paper. Curve448 satisfies the requirement of represented in asymmetric radixes α and β , where α 6= β .
SafeCurves and is included in TLS standard 1.3 [6]. u
X v
X
A= ai αi , B= bj β j (4)
2.1 ECC Group Law i=0 j=0
Let k be a scalar, P = (xP , yP ) and Q = (xQ , yQ ) be two u and v are the degree of A and B , respectively. The product
point represented in affine coordinates where P, Q ∈ E and C = A · B is calculated as follows:
xP , yP ∈ GF (p). An ECC point multiplication (ECPM), u−1 v−1 u−1
Q = k · P , is a k -times additions of point P (i.e., X X X v−1
X
C= ai αi bj β j = ai αi bj β j (5)
P + P + ... + P ), which can be performed with group
i=0 j=0 i=0 j=0
operation of point doubling (PD) and point addition (PA).
Typically, the projective coordinate representation is used to The schoolbook algorithm multiplies u digit and v digit
avoid modular inversion during intermediate computation, numbers by multiplying each digit of one input by each
where an affine point P = (xP , yP ) can be converted to digit of the other, which takes O(uv) digit multiplications
projective point P = (X, Y, Z) such that xP = X/Z and in total. Clearly, it requires un DSP blocks when performing
yP = Y /Z . full parallelization on digit multiplication. We investigate a
The Montgomery ladder was introduced to perform novel variant of the Karatsuba formula for asymmetric digit
ECPM over the Montgomery curve, which processes point multiplier, which later reduces the complexity as well as the
point doubling and addition computation in a single step number of required DSPs compared to other similar works
[22]. A single step of Montgomery ladder is computed with (i.e., [16], [19], [23]).
the following formula (taken from [4]): We rewrite the Equation 5 as follows:
2 iα
XP D = (X2 − Z2 ) (X2 + Z2 )2 u−1
XX β u−1
X v−1
X
ZP D = (X2 + Z2 )2 − (X2 − Z2 )2 .
 C= ai αi bj β j + ak αk bl β l (6)
i=0 j=0 k=0 l=k α
(X2 + Z2 )2 + a24 (X2 + Z2 )2 − (X2 − Z2 )2
 β

XP A = ((X2 − Z2 )(X3 + Z3 ) + (X2 + Z2 )(X3 − Z3 ))


2 where when αi β j = αk β l , we obtain the following identity:
2
ZP A = ((X2 − Z2 )(X3 + Z3 ) − (X2 + Z2 )(X3 − Z3 )) xP ai αi bj β j +ak αk bl β l = [(ai − ak ) (bj − bl ) + ai bl + ak bj ] αi β j
(2) (7)
where Q2 = 2P2 and Q3 = P2 + P3 with Q2 = Equation 7 shows that two multiplications can be re-
(XP D , ZP D ), Q3 = (XP A , ZP A ), P2 = (X2 , Z2 ), and duced into one multiplication for a condition where αi β j =
P3 = (X3 , Z3 ). A constant value a24 = 39081 is used αk β l . This is similar to the Karatsuba [14] for α = β where it
specifically for Curve448. Note that this formula needs only reduces four-digit multiplications to three-digit multiplica-
x-coordinate of base point P to perform ECPM. The formula tions, as a generalization of our problem illustrated in Fig. 1.
requires ten modular multiplications and eight modular
additions/subtractions.

2.2 Field Arithmetic


The name ”Goldilocks” refers to the prime modulus of
Curve448 that is defined as the Solinas trinomial prime with
the golden ratio φ = 2224 , which offers fast arithmetic in
typical (i.e., 32-bit or 64-bit) machines. Moreover, with its
golden ratio φ, it allows Karatsuba multiplication of two Fig. 1. Karatsuba on Asymmetric Digit Multiplier
operands A = (a1 φ+a0 ) and B = (b1 φ+b0 ), A, B ∈ GF (p),
to be calculated efficiently as follows: The red points are two-digit multiplications that are
C = (a1 φ + a0 ) · (b1 φ + b0 ) calculated prior to other digit multiplications. At the same
≡ (a1 b1 + a0 b0 ) + (a1 b0 + a0 b1 + a0 b0 )φ (mod p) (3) time, the black points, which are connected by a line, are the
two-digit multiplications that later can be reduced to one-
= (a1 b1 + a0 b0 ) + ((a0 + a1 )(b0 + b1 ) − a0 b0 )φ digit multiplication. In general, the Karatsuba formula can
A modular inversion is required to convert back projec- be applied when two-digit multiplications are connected via
tive coordinate Q = (X, Z) to affine coordinates represen- a diagonal line without being restricted by the used radix.
tations Q = (xQ ) at the end of ECPM operation. A constant This method works ideally on a radix with a power of two.
4

TABLE 1
Comparison of required digit multipliers for different operand widths
with existing methods

Schoolbook
Operand Nonstandard Our
Schoolbook + Karatsuba
Width Tiling [19] method
[16]
192 96 90 78 60
224 140 120 105 88
256 176 160 136 113
384 368 360 300 216
521 682 660 561 433

Fig. 3. Proposed 12-stage Pipelined (with four stages of input delay)


Modular Multiplier

proposed in [11], clearly, it consumes more area due to


the utilization of logic cells. Furthermore, all the compu-
Fig. 2. Top-level Architecture tation steps are one-way controlled by precise scheduling of
Montgomery ladder without need handshake process such
as valid/ready protocol.
In particular, if we let α = 2w1 and β = 2w2 , a greater reduc-
tion in complexity can be obtained when GCD(w1 , w2 ) 6= 1.
Table 1 shows the complexity comparison of our method 4.1 Modular Multiplier
with existing methods. By setting up α = 224 and β = 216 , We construct a 12-stage pipelined modular multiplier with
our method reduces the DSP utilization compared to the four stages of input delay based on a 224-bit pipelined
existing methods in the literature. multiplier. We choose a 244-bit width multiplier because the
prime number of Curve448 has a golden ratio φ = 224,
which later optimizes the reduction step as we propose the
4 P ROPOSED H ARDWARE A RCHITECTURE reduction algorithm for Curve448.
Fig. 2 depicts the the proposed top-level architecture of Fig. 3 shows the overall structure of our modular multi-
Curve448. This is the typical architecture consisting of plication design. It consists of a 224-bit pipelined multiplier
the control unit, modular multiplier module, and modular followed by two Ripple-Carry-Save Adders (RCSAs). An
adder/subtractor module. In contrast to the architecture RCSA is actually a pair of adders, which in our case are
proposed in [13], which uses RAM to store the ladder carry-compact adders (CCAs) [25] that are used to limit
variables, we use register files utilized from flip-flop (FF). the critical delay path of the ripple-carry-adder at some
This is because in a typical FPGA (i.e., Xilinx FPGA [24]), point, as shown in Fig. 4. The carry for the first half is not
the availability of FF is higher than that of LUT cells (e.g., propagated; instead, it is saved and included as an input
in Xilinx, a single slice consists of four LUTs and eight for another half in the next stage. This method is suitable
FFs). Therefore, with a design that has higher LUT cell for accumulator circuits. Additionally, the pipelined one-
utilization than FF, increasing FF utilization will not dras- hot encoding is used with simple shift register to control
tically increase the slice utilization. Moreover, utilizing FF the input-output signal between the stages. Note that the
instead of BRAM preserves the overall performance without output valid and busy signal are not necessary in our design
introducing overhead on memory read/write access. Apart since we use a precise ladder scheduling, considering the
from performance and utilization considerations, the use of restriction in modular multiplier module (i.e., requires four
Block Random Access Memory (BRAM) introduces a new cycles input delay).
opportunity for attackers to extract secret scalar information
by recovering information on the BRAM addressing pattern 4.1.1 244-bit Pipelined Multiplier
using Differential Power Analysis (DPA). Although it can The construction of the 244-bit pipelined multiplier based
be protected via address scrambling, such as the design on Equation 7 is given in the Fig. 5. As shown in, all the
5

• Stage 3: The output of 24x16-bit signed MUL avail-


able in this stage is then used by the 40-bit CCA
to calculate ai bl + ak bj . The output 40-bit CCA is
routed to the input accumulator of the MAC mod-
ules. Note that the output of 24x16-bit signed MUL
Fig. 4. Ripple-Carry-Save Adder (RCSA). Technically, it limits the carry is also stored in registers, as it will be used in the
propagation to a predefined delay, which is the delay of a 244-bit Carry- compression stage (Stage 4).
Compact-Adder (CCA) in our case. • Stage 4: Before being processed by the CSAT, all inter-
mediate values are grouped and aligned into 40-bit
segments to reduce the number of inputs in the CSAT
as well as the depth of the tree. However, while the
output of 24x16-bit signed MUL is already in 40-bit
width, the calculation of (ai −ak )(bj −bl )+ai bl +ak bj
obviously produces up to a 41-bit output width.
We employ an alignment method similar to that
used by [16] to handle the overflow bit (i.e., 41st
bit). All intermediates values are compressed using
homogeneous 3:2 compression to achieve balanced
performance.
• Stage 5: In this stage, a final propagated addition
of sum and carry from the output of the CSAT
is performed using the CCA proposed in [25]. We
obtained the optimal parameter CCA with H = 3 and
L = 30 experimentally based on trial and error after
synthesis and implemention in FPGA. Furthermore,
Fig. 5. Construction of 244-bit Multiplication with Asymmetric Digit Mul- the input and output of CCA are enclosed by regis-
tiplier (α = 224 and β = 216 ). Note that since 224 does not divisible by
24, there will be unused bits at the most significant bit of the intermediate ters to minimize the critical delay path.
results.

red points represent a single-digit multiplication, while the 4.1.2 Fast Reduction over p = 2448 − 2224 − 1
two black points connected by a line represent a single-digit
We propose the fast reduction technique interleaving with
multiplications that are reduced from two-digit multipli-
the intermediate output from the 244-bit pipelined multi-
cations. Note that some lines may pass through multiple
plier, which is given in Algorithm 1. The multiplication of A
points, yet the relation should satisfy the Equation 7. There
and B , which each have a 448-bit width, can be decomposed
are some exceptions; when the point is already applied as
into four 244-bit multiplications. Note that we do not take
a red point (i.e., due to the Karatsuba multiplication on the
the Karatsuba approach recommended by [5], since it does
counterpart), it cannot be further reduced, even if it satisfies
not give an advantage in our reduction step; rather, we take
Equation 7. Thus, it remains as red point (including the
the additional cost of one clock cycle on the pipelined mul-
counterpart) as in our case, as shown in points (8,2), (9,2),
tiplier. We perform partial reduction for three intermediate
(8,5), (9,5), (8,2), and (9,2) in Fig. 5. Finally, we obtain less
results in advance (i.e., z4 = (z1 + z2 + z3 ) · 2224 mod p),
complexity, as it requires only 88 DSPs instead of 140 DSPs
while the second term (i.e., z0 + z3 mod p) is accumulated
in the schoolbook method or 105 DSPs in the nonstandard
with the first reduction step result and we perform the
tiling method [19].
second reduction accordingly (i.e., C = z0 +z3 +z4 mod p).
Fig. 6 shows the architecture of the 224-bit pipelined
This technique relies on the following property:
multiplier. The architecture contains five fully pipelined
stages, which means it can process an input on each cycle.
Our calculation steps of C = A · B are described as follows: (a + b) mod p = (a + (b mod p)) mod p (8)
• Stages 1 and 2: The parallel 16-bit ripple-carry adder
(RCA) is used to compute bj − bl . The output of the Considering the advantage of the Goldilocks modulus p =
16-bit RCA is wired to 25x17-bit signed Multiply- 2448 − 2224 − 1 and the fact that 2448 ≡ 2224 + 1 mod p,
Accumulate (MAC) modules, which also have a pre- the reduction of z4 = T · 2224 mod p, where T = z1 + z2 +
adder input to compute ai − ak before going to z3 , can be performed efficiently, as mentioned in Step 13
the multiplication stage. At the same time, parallel of Algorithm 1. Referring to the structure of the RCSA, the
24x16-bit signed multiplier (MUL) modules are used actual addition is performed only in the second adder (i.e.,
to compute ai bl and ak bj . Both the 25x17-bit signed 226-bit CCA). Accordingly, the reduction of C = G mod p,
MAC with the pre-adder and 24x16-bit signed MUL where G = z0 + z3 + z4 yields 3 bits of overflow, can be
are utilized from DSP primitive with a three-stage performed efficiently with the RCSA as mentioned in Steps
and two-stage pipeline, respectively, to achieve max- 18–20 of Algorithm 1. Note that the final reduction might
imum performance, as recommended in [26], which produce a carry at the first adder of the RCSA, as this carry
is shown in Fig. 7. needs to be propagated to the second adder at the final step.
6

Fig. 6. Proposed Five-stage 244-bit Fully Pipelined Multiplier.

Algorithm 2 Fermat-based inversion for Curve448 (p =


2448 − 2224 − 1)).
Require: Integer z satisfying 0 < z < p
Ensure: Modular inverse z −1 ≡ z p−2 mod p
1 2
1: u ← z 2 · z
1
z (23 −1)
2: u ← u2 · z z (26 −1)
23
3: u ← u · u
6
z (212 −1)
(2 −1)
4: u ← u2 · u z
1
(213 −1)
5: u ← u2 · z z
213 26
(a) 6: u ← u · u z (227 −1)
21 (2 −1)
7: u ← u · z z 54
27
8: u ← u2 · u z (255 −1)
21
9: u ← u · z z (2110 −1)
255
10: u ← u · u z (2111 −1)
21
11: u ← u · z
111
z (2222 −1)
12: v ← u2 · u
1
z (2223 −1)
13: u ← v 2 · z z (2 −1)
(b) 2223 (2446 −2222 −1)
14: u ← u ·v z 448 224
2
Fig. 7. Digital Signal Processing (DSP) utilization for (a) a three-stage 15: u ← u2 · z z (2 −2 −3)
25x17-bit signed Multiply-Accumulator with pre-adder and (b) a two- 16: return u
stage 24x16-bit signed multiplier.

The precise scheduling of modular multiplication is pre-


Algorithm 1 Proposed Interleaved Fast Reduction for p = sented in Fig. 8. The first four stages are used to calculate
2448 − 2224 − 1) modulus z1 , z2 , z3 , and z0 . In these stages, the input A and B are
Require: Integer A, B satisfying 0 ≤ A, B < p held in the input register, placing the modular multiplier
Ensure: C = A · B mod p core in a busy state and causing input delay for four cycles.
1: a0 ← A[223:0] Stages 7 and 8 perform the first accumulation z1 + z2 + z3 ,
2: a1 ← A[447:224] followed by addition t0 + t1 + t2 in Stage 9 using A1. At the
3: b0 ← B[223:0]  same time, Stages 8 to 10 perform the second accumulation
4: b1 ← B[447:224] 
 z3 + z0 + z4 , followed by two parallel additions g0 + g2 and
5: z1 ← a0 .b1 g1 + g2 in Stage 10 using A2. Lastly, the output product is


244-bit pipelined

6: z2 ← a1 .b0 available in Stage 12. Therefore, the modular multiplication
multiplications
7: z3 ← a1 .b1 takes 12 cycles, pipelined with four stages of input delay.




8: z0 ← a0 .b0

9: T ← z1 + z2 + z3  {450-bit} 4.2 Modular Adder/Subtractor
10: t0 ← T[223:0] 
 A unified modular adder/subtractor is utilized from a sin-
11: t1 ← T[449:224]

gle CCA, which calculates C = A±B ±p, as shown in Fig. 9.



12: t2 ← T[449:448]



 The calculation takes two steps: it first calculates r1 = A±B
13: z4 ← (t0 + t1 + t2 ) k t1[223:0] {450-bit}

and then calculates s = r1 ± p with op to control the sign


14: G ← z3 + z0 + z4 {451-bit}

of ±B and ±p using masking. The output C is selected


15: g0 ← G[223:0] interleaved fast between r1 and s depending on the value of sel, which is the
16: g1 ← G[447:224] reduction



 XOR value of cout of the first step and op. Basically, it detects
17: g2 ← G[450:448]



 whether the first step calculation produces a carry/borrow.
18: U ← g0 + g2 {225-bit}



 While the sign of ±b is converted with two’s complement,
19: V ← g1 + g2 {224-bit}


the sign of ±p is rather than more efficient due to the special


20: C ← (V + U[244] ) k U[243:0] {448-bit}

21: return C
7
448 224
z −1 ≡ z 2 −2 −3 (mod p). The modular inversion calcula-
tion via exponentiation can be performed with a total of 462
modular multiplications, as given in Algorithm 2, which can
be utilized from the modular multiplier module. Therefore,
no additional module for inversion is employed.

4.4 Efficient Montgomery Ladder Scheduling


The scheduling of the Montgomery ladder algorithm,
Equation 2, is given in Fig. 10. The gray line in the
multiplier indicates the busy signal of the 224-bit pipelined
multiplier modules, where a single modular multiplication
takes four 224-bit multiplications, as mentioned previously.
It shows that the 224-bit pipelined modular multiplier
module is nearly busy, which yields high usage efficiency
for the pipelined architecture. It takes 52 cycles to perform
a single Montgomery ladder step. Furthermore, with
the pipelined architecture of the modular multiplier, no
temporary register in addition to input registers (i.e.,
xP , X2 , Z2 , X3 , and Z3 ) is required.

Fig. 8. Modular Multiplication Calculation Steps. M , A1, and A2 are a The Montgomery ladder in Equation 2 requires condi-
244-bit multiplier, first RCSA, and second RCSA, respectively. tional swap such that:
(X2 , Z2 , X3 , Z3 ) = (X2+b , Z2+b , X3−b , Z3−b ) (9)
with b respect to the most two significant bit values of scalar
on each iteration, assuming the scalar register is shifted
left. Constant-time conditional swap can be implemented
easily on hardware since the update of X2 , Z2 , X3 , and Z3
naturally are performed in parallel.

5 S IDE -C HANNEL ATTACK C OUNTERMEASURES


In this section, we extend our proposed architecture with
side-channel attack protection for both vertical (classical)
and horizontal attacks by incorporating several well-known
methods from the literature.

5.1 Secure against Vertical Side-channel Attack


Fig. 9. Proposed Modular Adder/Subtractor Module Our proposed architecture, presented in section 4, is natu-
rally resistant to timing attacks and simple power analysis
due to inherently resistant algorithms (i.e., Montgomery
form of its prime number. With the value of 2448 − 2224 − 1, Ladder and FLT). To provide protection against Differential
we can construct its value with the following signal instead Power Analysis (DPA), additional methods such as base-
of masking (written in Verilog syntax): point randomization and scalar blinding have to be im-
plemented [27]. Enabling these countermeasures provides
protection against vertical side-channel attacks, in which the
attacker tries to observe multiple runs of ECPM operation.
Therefore, it takes two cycles to complete a single modu-
lar addition/subtraction. The critical path of this module is 5.1.1 Base-Point Randomization
defined by the CCA circuit with optimal parameters H = 3 Point randomization can be achieved by multiplying a ran-
and L = 30 obtained experimentally on FPGA. dom value λ ∈ Z2448 \{0} to the projective point P = (X, Z)
such that P = (λX, λZ). The output of ECPM is not
changed in this respect, which can be proven as follows:
4.3 Modular Inverse
A modular inversion is required to transform back from X λX
xp = = (10)
projective coordinates to affine coordinates at the end of Z λZ
the ECPM operation. A fully constant time modular inver- Base-point randomization provides different point rep-
sion can be performed based on Fermat’s Little Theorem resentations corresponding to the entropy given by the
(FLT). Let p = 2448 − 2224 − 1 be the prime of Curve448; random value λ to prevent any information extraction using
then, the modular inverse of z −1 can be calculated as statistical analysis. In particular, this process initializes the
8

Fig. 10. Pipelined Montgomery Ladder Scheduling of Curve448. The total latency is 52 clock cycles, without the requirement of an additional
temporary register. The constant a24 is equal to 39081.

z -coordinate with λ and uses a modular multiplication to ried out using Xilinx Vivado 2020.1, targeting four modern
update the x-coordinate accordingly. Hence, this counter- devices (Xilinx Virtex-7 [XC7VX690T], Kintex-7 [XC7K325T],
measure can be integrated easily by using an additional Artix-7 [XC7A100T], and Zynq 7020 [XC7Z020] FPGA) for a
multiplication call during the initialization phase of the more comprehensive evaluation with other related works.
ECPM operation. The correctness of implementation was verified by using
the testbench with reference to the test vector provided in
5.1.2 Scalar Blinding RFC 7748 [21].
Scalar blinding can be achieved by adding multiple group The result of our ECC processor implementation, as
order #E to scalar k such that kr = k + r × #E where r well as those of several related papers over Curve448, are
is a random value. The correctness of this approach can be presented in Table 2. We achieve the lowest latency among
proven as follows: the proposals targeting Xilinx Zynq 7020 FPGA with 0.24
kr P = (k + r × #E)P = kP + rΘ = kP (11) and 0.39 ms for the unprotected and protected designs,
respectively.
Note that the multiplication of point P and group order
Additionally, we provide the implementation results on
#E results a point at infinity. The computation removes the
various devices for future reference, such as Artix-7, Kintex-
correlation between the Montgomery ladder swap function
7, and Virtex-7, achieving latency of 0.24, 0.13, and 0.12
and the corresponding bit in scalar k . For ECC with spe-
ms for the unprotected design, and 0.40, 0.22, and 0.20 ms
cial prime field (i.e., Solinas prime), it is recommended to
for the protected design, respectively. For the unprotected
provide sufficient larger blinding factors r as investigated
design, our fastest implementation (Virtex-7) requires 7,521
in [27], which is at least half of the field size. Thus, the
slices, while Kintex-7, Artix-7, and Zynq 7020 utilize 7,210,
blinding factor r with 224-bit length builds kr with 672-bit
6,826, and 6,946 slices, respectively. On all four platforms,
length. The latency of ECPM is increased accordingly.
we utilize 88 DSPs and no BRAM. As can be inferred from
the table, our architecture yields the highest efficiency in
5.2 Secure against Horizontal Side-channel Attack
terms of Area×Time and DSP×Time tradeoff compared to
The horizontal side-channel attack is another type of attack other existing architectures.
in which the attacker observes leakage within a single run To the best of our knowledge, the method by Ni-
of ECPM operation. Continuous point randomization for asar et al. [13] represents the state-of-the-art high-
each Montgomery ladder within a single ECPM operation performance hardware implementation of Curve448. They
can be applied sequentially to prevent horizontal attacks. provide three different designs (i.e., lightweight, area-time
It requires two more modular multiplications applied on in- efficient, and high-performance); in particular, we compare
termediates output (i.e., λXP A and λZP A ) to re-randomized our proposed design with their high-performance variant.
Montgomery ladder computation. Our proposed design increases the throughput by 242%
Hence, enabling horizontal attack protection with a
for the unprotected design and by 259% for the protected
continuous point randomization will increase Montgomery
design. Their approach is based on the refined Karatsuba
ladder time and total latency. In particular, the Montgomery
formula by Bernstein in [29], employing five levels of Karat-
scheduling in Fig. 10 is enlarged to 64 cycles. We assume that
suba computation and parallel multiplication using 81 DSP
the random number is provided externally with sufficient
cores. However, the multilevel Karatsuba approach yields a
throughput, such as the Random Number Generator (RNG)
longer addition tree that increases the critical path delay,
design proposed by [28].
limiting their operating frequency to 95 MHz, which is
lower than our design.
6 H ARDWARE I MPLEMENTATION R ESULT AND Table 3 provides the detailed performance analysis with
C OMPARISON a comparison to their design. The latency of our archi-
The proposed design has been described by SystemVerilog tecture outperforms the state-of-the-art in all underlying
HDL. Synthesizing, mapping, placing, and routing were car- field arithmetic and ECC group operations. The signifi-
9

TABLE 2
Performance comparison of the proposed High-Performance ECC Processor over Curve448 with existing literatures

Max. Total
Latency Throughput Area×Time** DSP×Time
Design Platform SCA* Slices DSP BRAM Freq Time
[CCs] [OP/s] [×10−3 ] [×10−3 ]
[MHz] [ms]
(-) 1,580 33 14 328,286 357 0.92 1,087 4,490 30.36
[10] Zynq 7020
(+) 1,648 35 14 473,926 335 1.41 709 7,259 49.35
(+) 1,985 33 14 499,344 341 1.46 685 7,716 48.18
[11] Zynq 7020
(++) 2,056 33 14 547,728 341 1.61 621 8,623 53.13
50,143
[12] Virtex-7 (-) - - 372,742 325 1.15 870 14,416*** -
(LUT)
(-) 4,354 81 - 77,702 95 0.82 1,220 10,212 66.42
[13] Zynq 7020
(++) 4,424 81 - 133,254 95 1.40 714 17,534 113.40
(-) 6,946 88 - 30,469 128 0.24 4,167 3,779 21.12
Zynq 7020
(++) 6,984 88 - 49,735 126 0.39 2,564 6,156 34.32
(-) 6,826 88 - 30,469 127 0.24 4,167 3,750 21.12
Artix-7
This (++) 6,934 88 - 49,735 125 0.40 2,500 6,294 35.20
work (-) 7,210 88 - 30,469 237 0.13 7,692 2,081 11.44
Kintex-7
(++) 7,269 88 - 49,735 230 0.22 4,545 3,535 19.36
(-) 7,521 88 - 30,469 250 0.12 8,333 1,959 10.56
Virtex-7
(++) 7,666 88 - 49,735 245 0.20 5,000 3,293 17.60
* (-): no protection, (+): scalar blinding and point randomization countermeasures,
(++): scalar blinding and point re-randomization countermeasures
** Area = Slices + DSPs×100
*** Area = LUTs/4 (Assume 1 Slice contains 4 LUTs as mentioned in specification [24])

TABLE 3 all the stages in the 224-bit multiplier are nearly busy during
Performance Analysis of Proposed ECC Processor in comparison with the Montgomery ladder operation with the utilization of
State-of-the-art on Zynq 7020 FPGA 48
52 × 100 ' 92%, making the use of the pipeline architecture
in the highest efficiency. On the other hand, the modular
Clock Cycles inversion via FLT consumes almost 18% of the total latency
Operations Niasar et al. [13] Our Work and is considered an inefficient method in our design. This
448 224
@95 MHz @128 MHz is because the exponentiation z 2 −2 −3 mod p requires
1 x Modular Addition 7 2 462 consecutive modular multiplications rather than paral-
1 x Modular Subtraction 7 2 lelization through the pipelining architecture.
1 x Modular Multiplication 15 12 In terms of area, their design has lower slices utilization
10 x Modular Multiplication 150 48 (i.e., 4,354 slices for the unprotected design and 4,424 slices
1 x Modular Inverse 6,917 5,544 for the protected design). However, in terms of Area×Time
Montgomery Ladder Step 158 52 tradeoff, our design is 63% more efficient for both the
unprotected and protected designs. It turns out that the
Single ECC Point Multiplication 77,702 30,469
cost of higher utilization is well absorbed by the latency
Total Latency [ms] 0.82 0.24 improvement. Note that we use the same assumption as
they do where the area is equivalent to slices + DSPs, while
each DSP is assumed to be equivalent to 100 slices.
cant latency improvement is mainly due to a pipelined It is worth mentioning that the first hardware implemen-
modular multiplier, which is constructed from a 244-bit tation of Curve448 was carried out by Sasdrich and Güneysu
fully pipelined multiplier and proposed fast reduction over in [10], who later proposed the protected architecture by
p = 2448 − 2224 − 1. Thanks to the novel variant of the considering side-channel attack countermeasures [11]. They
Karatsuba formula, we can enable the parallelization at demonstrated an evaluation to detect scalar- and base-point-
the digit multiplication level without causing large delay dependable leakage on hardware with side-channel protec-
propagation caused by additions in the recursion tree while tions (i.e., scalar blinding and point randomization) and
offering relatively low DSP block utilization. Although a proved that their methods are secure against side-channel
single modular multiplication operation does not give sig- attacks. Thanks to their results, we also include side-channel
nificant latency improvement (i.e., 12 cycles compared to protections (i.e., scalar blinding, base-point randomization,
15 cycles), employing multiple operations (i.e., 10 modular and continuous point randomization) in our protected de-
multiplications as in Equation 2) results in a significant sign, yet we present a 313% speed-up compared to their
latency improvement due to pipelining compared to their results on the same target device (i.e., Zynq 7020).
design (i.e., 52 cycles compared to 158 cycles). Furthermore, Shah et al. [12] proposed a LUT-based implementation
10

targeting Virtex-7, employing the RSD technique for the [7] L. Chen, D. Moody, A. Regenscheid, and K. Randall, “Recom-
arithmetic operations. Their proposed designs aimed to mendations for discrete logarithm-based cryptography: Elliptic
curve domain parameters,” National Institute of Standards and
be platform independent by using LUTs only, consuming Technology, Tech. Rep., 2019.
50,143 LUTs with a throughput of 870 ECPM operations per [8] P. W. Shor, “Algorithms for quantum computation: discrete log-
second, yet our design is 858% faster than their design. arithms and factoring,” in Proceedings 35th annual symposium on
foundations of computer science. Ieee, 1994, pp. 124–134.
[9] N. Bindel, U. Herath, M. McKague, and D. Stebila, “Transitioning
7 C ONCLUSIONS to a quantum-resistant public key infrastructure,” in International
Workshop on Post-Quantum Cryptography. Springer, 2017, pp. 384–
In this paper, we proposed a high-performance ECC pro- 405.
cessor over Curve448 that outperformed all the previous [10] P. Sasdrich and T. Géneysu, “Cryptography for next generation
results in terms of execution time. The implementation on tls: Implementing the rfc 7748 elliptic curve448 cryptosystem
in hardware,” in 2017 54th ACM/EDAC/IEEE Design Automation
the Xilinx 7-series FPGA Virtex-7, Kintex-7, Artix-7, and Conference (DAC). IEEE, 2017, pp. 1–6.
Zynq 7020 yielded execution times of 0.12, 0.13, 0.24, and [11] P. Sasdrich and T. Güneysu, “Exploring rfc 7748 for hardware
0.24 ms, respectively. The speed was obtained by utilizing a implementation: Curve25519 and curve448 with side-channel pro-
tection,” Journal of Hardware and Systems Security, vol. 2, no. 4, pp.
novel variant of the Karatsuba for asymmetric digit multi- 297–313, 2018.
plier, constructing a high-throughput 244-bit fully pipelined [12] Y. A. Shah, K. Javeed, M. I. Shehzad, and S. Azmat, “Lut-based
multiplier. The method combined schoolbook long and high-speed point multiplier for goldilocks-curve448,” IET Comput-
Karatsuba multiplication, allowing its digit multiplication to ers & Digital Techniques, vol. 14, no. 4, pp. 149–157, 2020.
[13] M. B. Niasar, R. Azarderakhsh, and M. M. Kermani, “Effi-
be performed in parallel while leveraging the full capability cient hardware implementations for elliptic curve cryptography
of asymmetric DSP blocks. It is worth mentioning that the over curve448,” in International Conference on Cryptology in India.
algorithm even works on arbitrary degrees, which means Springer, 2020, pp. 228–247.
[14] A. A. Karatsuba and Y. P. Ofman, “Multiplication of many-digital
it can be generalized for wider use in a cryptographic numbers by automatic computers,” in Doklady Akademii Nauk, vol.
algorithm that requires multiplication. In sequence, the in- 145, no. 2. Russian Academy of Sciences, 1962, pp. 293–294.
terleaved fast reduction over 2448 − 2224 − 1 was presented, [15] R. Salarifard and S. Bayat-Sarmadi, “An efficient low-latency
yields a high throughput 12-stage modular multiplier with point-multiplication over curve25519,” IEEE Transactions on Cir-
cuits and Systems I: Regular Papers, vol. 66, no. 10, pp. 3854–3862,
four stages of input delay. Furthermore, we also proposed 2019.
certain components to maximize the speed gain and the [16] A. M. Awaludin, H. T. Larasati, and H. Kim, “High-speed and
overall performance, such as employing a low-latency mod- unified ecc processor for generic weierstrass curves over gf (p) on
fpga,” Sensors, vol. 21, no. 4, p. 1451, 2021.
ular adder/subtractor as well as efficient scheduling of the [17] G. H. Khachatrian, M. K. Kuregian, K. R. Ispiryan, and J. L.
Montgomery ladder. Finally, the proposed architecture was Massey, “Fast multiplication of integers for public-key applica-
extended with both vertical and horizontal side-channel tions,” in International Workshop on Selected Areas in Cryptography.
protection through well-known countermeasures such as Springer, 2001, pp. 245–254.
[18] M. Scott, “Missing a trick: Karatsuba variations,” Cryptography and
scalar blinding, base-point randomization, and continuous Communications, vol. 10, no. 1, pp. 5–15, 2018.
randomization. [19] D. B. Roy, D. Mukhopadhyay, M. Izumi, and J. Takahashi, “Tile
before multiplication: An efficient strategy to optimize dsp multi-
plier for accelerating prime field ecc for nist curves,” in Proceedings
ACKNOWLEDGMENTS of the 51st Annual Design Automation Conference, 2014, pp. 1–6.
[20] P. M. C. Massolino, P. Longa, J. Renes, and L. Batina, “A Com-
This work was supported by Institute of Information pact and Scalable Hardware/Software Co-design of SIKE,” IACR
& Communications Technology Planning & Evaluation Transactions on Cryptographic Hardware and Embedded Systems, 2020.
(IITP) grant funded by the Korea government(MSIT) [21] A. Langley, M. Hamburg, and S. Turner, “Elliptic curves for
security,” Internet Requests for Comments, RFC Editor, RFC 7748,
(2019-0-01343, Regional strategic industry convergence se- January 2016.
curity core talent training business) and supported by [22] P. L. Montgomery, “Speeding the pollard and elliptic curve meth-
the MSIT(Ministry of Science and ICT), Korea, under ods of factorization,” Mathematics of computation, vol. 48, no. 177,
the ITRC(Information Technology Research Center) sup- pp. 243–264, 1987.
[23] B. Devlin, Blockchain Acceleration Using FPGAs - Elliptic curves, zk-
port program(IITP-2021-2020-0-01797) supervised by the SNARKs, and VDFs, ZCASH Foundation, 2019.
IITP(Institute for Information & Communications Technol- [24] Xilinx, 7 Series FPGAs Data Sheet: Overview, 2020 (ac-
ogy Planning & Evaluation). cessed January 26, 2022), https://www.xilinx.com/support/
documentation/data sheets/ds180 7Series Overview.pdf.
[25] T. B. Preußer, M. Zabel, and R. G. Spallek, “Accelerating computa-
R EFERENCES tions on fpga carry chains by operand compaction,” in 2011 IEEE
20th Symposium on Computer Arithmetic. IEEE, 2011, pp. 95–102.
[1] 3GPP, “Security architecture and procedures for 5g system,” Tech- [26] Xilinx, 7 Series DSP48E1 Slice User Guide, 2018 (accessed Decem-
nical Specification (TS) 3GPP TS 33.501 V17.4.1 (2022–01), 2022. ber 28, 2020), https://www.xilinx.com/support/documentation/
[2] C. Fan, S. Ghaemi, H. Khazaei, and P. Musilek, “Performance user guides/ug479 7Series DSP48E1.pdf.
evaluation of blockchain systems: A systematic survey,” IEEE [27] W. Schindler and A. Wiemers, “Efficient side-channel attacks on
Access, vol. 8, pp. 126 927–126 950, 2020. scalar blinding on elliptic curves with special structure,” in NIST
[3] A. Langley, M. Hamburg, and S. Turner, “Rfc 7748: Elliptic curves Workshop on ECC standards, 2015.
for security,” Internet Research Task Force (IRTF), 2016. [28] A. M. Awaludin, D. Pratama, and H. Kim, “Anytrng: Generic,
[4] D. J. Bernstein, “Curve25519: new diffie-hellman speed records,” high-throughput, low-area true random number generator based
in International Workshop on Public Key Cryptography. Springer, on synchronous edge sampling,” in International Conference on
2006, pp. 207–228. Information Security Applications. Springer, 2021, pp. 157–168.
[5] M. Hamburg, “Ed448-goldilocks, a new elliptic curve.” IACR [29] D. J. Bernstein, “Batch binary edwards,” in Annual International
Cryptol. ePrint Arch., vol. 2015, p. 625, 2015. Cryptology Conference. Springer, 2009, pp. 317–336.
[6] E. Rescorla, “The transport layer security (tls) protocol version
1.3,” Internet Requests for Comments, RFC Editor, RFC 8446,
August 2018.

You might also like