0% found this document useful (0 votes)

71 views8 pages

High-Speed NTT-based Polynomial Multiplication Accelerator For Post-Quantum Cryptography

This paper presents a hardware accelerator architecture for accelerating polynomial multiplication using the number theoretic transform (NTT) which is important for post-quantum cryptography. The proposed NTT architecture achieves high speed while using limited resources. It is implemented on an FPGA and used to accelerate the key exchange of the Kyber post-quantum cryptosystem.

Uploaded by

Leonardo Camargo Rossato

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views8 pages

High-Speed NTT-based Polynomial Multiplication Accelerator For Post-Quantum Cryptography

Uploaded by

Leonardo Camargo Rossato

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

2021 IEEE 28th Symposium on Computer Arithmetic (ARITH)

High-Speed NTT-based Polynomial Multiplication

Accelerator for Post-Quantum Cryptography
Mojtaba Bisheh-Niasar Reza Azarderakhsh Mehran Mozaffari-Kermani

CEECS Department CEECS Department CSE Department

Florida Atlantic University Florida Atlantic University University of South Florida
Boca Raton, FL Boca Raton, FL Tampa, FL
[email protected] [email protected] [email protected]
2021 IEEE 28th Symposium on Computer Arithmetic (ARITH) | 978-1-6654-2293-2/21/$31.00 ©2021 IEEE | DOI: 10.1109/ARITH51176.2021.00028

Abstract—This paper demonstrates an architecture for accel- vector-vector multiplication, can be optimized with the fast
erating the polynomial multiplication using number theoretic number-theoretic transform (NTT), which can reduce compu-
transform (NTT). Kyber is one of the finalists in the third round tational complexity from O(n2 ) to roughly O(nlogn). Since
of the NIST post-quantum cryptography standardization process.
Simultaneously, the performance of NTT execution is its main the implementation of NTT-based multiplication is still a per-
challenge, requiring large memory and complex memory access formance bottleneck in lattice-based cryptography, improving
pattern. In this paper, an efficient NTT architecture is presented NTT efficiency has recently received significant attention.
to improve the respective computation time. We propose several Reducing the computational complexity of polynomial mul-
optimization strategies for efficiency improvement targeting dif-
tiplication is essential for faster key encapsulation and opti-
ferent performance requirements for various applications. Our
NTT architecture, including four butterfly cores, occupies only mization of the resource utilization of the entire cryptosys-
798 LUTs and 715 FFs on a small Artix-7 FPGA, showing more tem. This acceleration of polynomial multiplication would
than 44% improvement compared to the best previous work. be challenging for various applications due to their resource
We also implement a coprocessor architecture for Kyber KEM constraints, strict performance, and flexibility requirements.
benefiting from our high-speed NTT core to accomplish three
However, for a widely-deployed cryptosystem, the overall
phases of the key exchange in 9, 12, and 19 μs, respectively,
operating at 200 MHz. complexity consisting of the utilized resource and the required
Index Terms—FPGA, hardware architecture, Kyber, lattice- latency will have to be minimal to be standardized by NIST
based cryptography, NTT, post-quantum cryptography. [4]. To address these challenges, hardware implementation of
the cryptosystem will be critical since it accelerates the core
I. I NTRODUCTION arithmetic operation occupying limited resources.
The security of classical public-key cryptosystems relies Overall, there are two possible strategies to deploy hardware
on the underlying NP-hard problems like integer factoriza- accelerators: (i) hardware/software co-design approaches and
tion, discrete logarithm, and elliptic curve discrete logarithm. (ii) pure hardware architectures. Although hardware/software
However, these problems can be solved when a large-scale co-design approaches are more flexible and easier to develop
quantum computer is build using quantum algorithms such compared to pure hardware architectures, they may not lead
as Shor’s algorithm [1]. Hence, the National Institute of to the best performance. Most hardware accelerators focus on
Standards and Technology (NIST) started a post-quantum the FPGA platform to take advantage of its reconfigurability.
cryptography standardization process in 2016, noting that FPGA can provide an appropriate balance between flexibility
in round-3 of this competition, the four key encapsulation and performance, which is especially important for a rapidly
mechanisms (KEM) finalists, i.e., Classic-McEliece, Kyber, evolving field like PQC.
NTRU, and Saber, were announced in July 2020. Among
A. Related Work
all promising candidates, lattice-based cryptography is a very
attractive alternative, mainly because of offering a good trade- There are prominent works to accelerate polynomial mul-
off between security and efficiency. tiplication in the literature. The work of [5] proposed
Kyber KEM [2] is part of the Cryptographic Suite for the negative wrapped convolution (NWC) to eliminate the
Algebraic Lattices (CRYSTALS) and shares a common frame- overhead of zero padding in the polynomial multiplication
work with the Dilithium signature scheme [3]. Kyber bases over Zq [X]/ X n + 1. The authors in [6] introduced low-
its security on the hardness assumptions over module learning complexity NTT by merging the pre-processing of NTT into
with errors (Module-LWE) and is believed to be quantum- butterfly operations. Furthermore, low-complexity INTT is
resistant. The main characteristic of Kyber is polynomial
mul-
proposed in [7] to avoid post-processing overhead. Longa
tiplication over a polynomial ring as Z3329 [X]/ X 256 + 1 , et al. in [8] proposed the KRED and KRED-2X reduction
providing a significant increase in efficiency. Hence, the most algorithms to speed up the NTT computation. This work also
computationally intensive operation, i.e., matrix-vector and reduces post-processing computation of INTT at the cost of

978-1-6654-2293-2/21/$31.00 ©2021 IEEE 94

DOI 10.1109/ARITH51176.2021.00028

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on January 05,2023 at 23:39:24 UTC from IEEE Xplore. Restrictions apply.
more memory utilization. Furthermore, employing Cooley-
Tukey (CT) and Gentleman-Sande (GS) butterﬂy conﬁgura- SW

Time(s)
tions reduces bit-reverse operation, which was implemented HW/SW

in [9]. The authors in [10] presented a processor beneﬁting HW

from polynomial vector structure in the Kyber algorithm to

10-5 10-4 10-3 10-2 10-1
reduce memory access overhead.

Area (#LUT)
A flexible and scalable NTT architecture was presented
HW/SW
in [11], [12]. Furthermore, the work of [13] implemented a HW
scalable NTT architecture on RISC-V. In [14], a low-power
0 5k 10k 15k 20k
NTT was proposed to reduce the required latency. Keccak NTT Control
Although a compact design of NTT employing only one
butterfly core requires few hardware resources, it is too slow Figure 1. Performance (in log10 ) and resource utilization comparison
to provide high throughput requirements of high-performance in three different Kyber implementation approaches: software (SW), hard-
ware/software (HW/SW), and hardware (HW). Kyber architecture is break-
applications. The work of [15] employed four butterfly cores down into three main cores, including Keccak (hashing and sampling), NTT
for NewHope implementation. However, increasing the num- (polynomial multiplication), and Control (controller and all other required
ber of butterfly cores in unmerged implementations increases functions).
memory access overhead. Hence, merging NTT layers was reported in the literature for the hardware/software approach
studied in [16] using 2 × 2 butterfly structure. This design was since different optimization perspectives have been targeted.
customized in [17] for NewHope using KRED and KRED-2X Therefore, implementation gaps are identified in accelerating
reductions in their proposed architecture. The authors in [18], and compacting the NTT in pure hardware architecture to
[19] used the same architecture for Kyber KEM, employing reduce the required time and resources.
the high-level synthesis (HLS) approach. Implementing KRED
B. Contributions
and KRED-2X modular reductions increases the performance
in software platforms, while it doubles the occupied resources Polynomial multiplication computations take a significant
in hardware. Furthermore, the required memory for the pre- portion of Kyber KEM latency on hardware implementation.
computed values is increased to store two sets of constants. Therefore, to improve the efficiency of Kyber, one should in-
Additionally, the authors in [20] implemented 3-layer merged crease efficiency on the NTT core, providing higher throughput
NTT for NewHope by RISC-V ISA features, while they using fewer hardware resources. This paper proposes algorith-
claimed using this method for Kyber cannot improve effi- mic optimizations and hardware optimizations to design an
ciency. The prior hardware NTT designs have so far been efficient pure hardware architecture of high-speed polynomial
fixed in throughput. Furthermore, since the same butterfly multiplication core (PMC) on FPGA to accelerate Kyber
configurations are used for both NTT and INTT, a bit-reverse KEM. Algorithmic optimizations include modular reduction
function is required. and efficient NTT computation. The hardware optimizations
Implementing Ring-LWE has been increased since it offers are achieved by designing a reconfigurable butterfly core (BF),
high-performance and compact architecture compared to both judicious rearrangement of the sequence of the operations to
PQC schemes [21], [22] and even pre-quantum cryptosystems leverage pipelining and parallelism at multiple layers within
[23], [24]. Although many efforts towards the HLS [25] each unit’s implementation.
and the hardware/software co-design implementation of PQC The contributions and novelties of this paper are as follows:
accelerators have been made [9], [20], [26], [27], there are 1) We propose a hardware-friendly modular reduction al-
merely a few developed pure hardware architectures for Kyber gorithm, which requires few resources without the ad-
KEM. The first hardware implementation of Kyber is reported ditional cost of memory utilization. Reductions are only
in [28], employing an RTL-based methodology providing carried out after multiplications to avoid occupying other
good performance and smaller area consumption compared resources.
to the HLS-based approach. Furthermore, the authors in [29] 2) We propose an improved reconfigurable hardware archi-
proposed an architecture of Kyber, which heavily relies on tecture for NTT and INTT with highly efficient mod-
BlockRAM primitives between components. Recently, the ular reduction. This reconfigurability supporting both
work of [30] implemented a compact FPGA-based architecture decimation-in-frequency (DIF) and decimation-in-time
occupying only 3 BRAMs. (DIT) NTT algorithm avoids utilizing additional re-
Fig. 1 shows a performance and resource utilization compar- sources for the same computations while reduces the
ison between software, hardware/software, and pure hardware pre-processing cost of NTT and post-processing cost of
implementations of Kyber. Software benchmarking [31], [32], INTT. The proposed architecture significantly reduces
[33] reports 60-80% of the overall required cycle for hashing the overall area and memory consumption with no
and sampling while hardware/software accelerators can reduce impact on performance.
it. However, Keccak latency can be hidden by pure hardware 3) We implement a parameterized design of the NTT
design when it works in a parallel fashion with the NTT module using VHDL and prototype it on an Artix-7
core. A wide range of NTT computation (25-90%) has been FPGA. Our NTT core shows an efficiency improvement

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on January 05,2023 at 23:39:24 UTC from IEEE Xplore. Restrictions apply.
NTTCT â(0) INTTGS
s(0) × t(0) CT Butterfly
â(4) (u+ωv) mod q
ωn0 ωn-0 t(1) u +
s(1) × ω
â(2)
ωn0 ωn-0 t(2) v × -
s(2) × (u-ωv) mod q
0 1 â(6) -1 -2
ωn ωn ωn ωn
s(3) × t(3)
â(1) ωn-0
ωn0
s(4) × t(4)
ωn0 ωn2
â(5)
ωn-2 ωn-0 GS Butterfly
s(5) × t(5) (u+v) mod q
u +
0 ωn 2 â(3) -0 ωn-0 ω
ωn ωn
s(6) × t(6)
v - ×
ωn0
â(7) ωn-0 (u-v)ω mod q
ωn2 ωn3 ωn-3 ωn-2
s(7) × t(7)
Stage 0 Stage 1 Stage 2 Stage 0 Stage 1 Stage 2

Figure 2. An 8-point NTT-based polynomial multiplication: (Left) Dataflow graph including CT butterfly-based NTT, point-wise multiplication, and GS
butterfly-based INTT. Polynomial â is in NTT domain and s and t are in normal domain. (Right) CT and GS butterfly configurations.

by 44% with at least 25% and 80% fewer Slice and be exploited to compute polynomial multiplication efficiently
BRAM resource utilization. over a polynomial ring Zq [X]/ X n + 1. The NTT is a
4) We propose a high-performance coprocessor architecture generalization of a fast Fourier transform (FFT) defined in
for lattice-based public-key cryptography with Kyber a finite n−1field. Let f be a polynomial of degree n, where
i
KEM as a case study. Our result utilizes the proposed f = i=0 fi X and fi ∈ Zq , and ωn be n-th primitive
high-speed NTT core and outperforms all reported im- root of unity such that ωnn = 1 mod q. The n−1 forward NTT is
plementations by reducing the total time. defined by fˆ = N T T (f ), such that fî = j=0 fj ωnij mod q.
The rest of the paper is organized as follows. In Section II, Furthermore, the inverse NTT is shown by f = IN T T (fˆ),
n−1
we discuss the preliminaries. In Section III, our proposed such that fi = n −1 ˆ −ij mod q. An NTT-based
j=0 fj ωn
algorithms and architectures are discussed. The details of polynomial multiplication between f and g can be performed
FPGA implementations are provided in Section IV. We discuss such that f.g = INTT(NTT(f ) ◦ NTT(g)).
our results and compare to the counterparts in Section V. To avoid applying the NTT of length 2n with n zero padding
Finally, we conclude the paper in Section VI. of inputs, NWC [5] is proposed at the cost of pre-processing
II. P RELIMINARIES √
of NTT and post-processing of INTT. Let ψ = ωn be a
In this section, Kyber protocols and relevant mathematical primitive 2n-th root of unity. Pre-processing of NTT includes
background are briefly described. multiplication between the coefficients of the input polynomi-
A. The Kyber Protocol
als and ψ i , while the post-processing of INTT is multiplication
Kyber is an IND-CCA secure KEM [34], including three between the coefficients of the output polynomial and ψ −i .
algorithms, i.e., key generation, encryption, and decryption. NTT computation can be implemented by CT or GS butter-
In key generation, a matrix A and a secret key s are sampled fly. The bit-reverse function is the bit-wise reversal of the bi-
from a uniform and binomial distribution, respectively. Then nary representation of the coefficient index. By performing CT
a public key is computed by multiplication between A and butterfly for NTT and GS for INTT can avoid the bit-reverse
s in the NTT domain and adding noise to the product. In permutation [8]. Fig. 2 illustrates an 8-point NTT-based mul-
encryption, a message m should be added to the product of tiplication employing both CT and GS butterfly operations.
the public key and a sampled random r in the normal domain In order to perform point-wise multiplication in Kyber, we
to generate a vector v. Additionally, another polynomial mul- have to compute 128 degree-2 polynomial multiplications
tiplication is performed between r and uniform distribution to such that (âj,2i + âj,2i+1 X) · (ŝ2i + ŝ2i+1 X) = (âj,2i ŝ2i +
compute matrix u. The encryption output, called ciphertext ct, 2br (i)+1
âj,2i+1 ŝ2i+1 ωn 7 ) + (âj,2i ŝ2i+1 + âj,2i+1 ŝ2i )X, where
is composed of compression of u and v, while the message
br7 is the bit reversal function.
can then be decrypted by recovering an approximation of v
C. Modular Reduction
by computing the product of secret key and u.
All polynomials in the Kyber scheme have 256 coefficients Different modular reductions can be implemented in butter-
over k-dimensional vectors, where k = 2, 3, 4 indicates the fly core, including Barrett reduction and Montgomery reduc-
three different post-quantum security levels. Kyber uses these tion. A variant of Montgomery reduction was introduced by
functions to construct a Chosen Plaintext Attack (CPA) secure [8], benefiting from a special form of prime q = k · 2m + 1.
public-key encryption scheme. A CCA-secure KEM can be This method includes two functions, i.e., KRED and KRED-
constructed using an adapted Fujisaki-Okamoto transformation 2X, which take any integer C and return an integer D such
[35]. For details, we refer interested readers to [2]. that D ≡ k · C mod q and D ≡ k 2 · C mod q, respectively.
B. Polynomial Multiplication However, we can eliminate the extra factor of k s with s ∈
Polynomial multiplication is the bottleneck of lattice-based {1, 2} by replacing k −s · ωnij instead of ωnij in NTT algorithm.
cryptography, which can either be done using NTT or school- Although these functions do not compute the exact value of
book polynomial multiplication algorithm. The former can C mod q, they can close the output range to the exact value.

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on January 05,2023 at 23:39:24 UTC from IEEE Xplore. Restrictions apply.
data out
Algorithm 1 Proposed K2 -RED Reduction Algorithm 4(logq+1)
v01 u01 v00 u00
u
Input: A binary number C = (c23 , . . . , c0 )2 , k = 13, m = 8, NTT
ω00 ω00
u
q = 3329 = k · 2m + 1 RAM u+v
2

Output: C = k 2 C mod q 0
+ 1
u+vω 0
Step 1:
n/4
addr_a -
n/4 addr_b 1
v11 u11 v10 u10
1: Cl = (c7 , . . . , c0 )2 write ω11 ω10 1 0
0
2: Ch = (c23 , . . . , c8 )2 1
u-vω
3: C ← k · Cl − Ch 1
0 (v-u)ω
0
1
Step 2: data in BF v 2

4: Cl = (c7 , . . . , c0 )2 data out v21 u21 v20 u20 v

5: Ch = (c15 , . . . , c8 )2 ω00

Buffer (7)

Buffer (6)

Buffer (5)

Buffer (4)

6: C ← k · Cl − Ch ROM ω10 ω
ω11 0
7: return C

1
× mode

K2-RED
BF data out 4(logq+1)
In Kyber with q = 3329, we have k = 13 and m = 8. These Ch Cl
functions do not need any multiplications in hardware and can
c23 ... c8 c7 ... c0
be achieved by shifter and adder.
Ch Cl«3 Cl«2 Cl
III. P ROPOSED A RCHITECTURE FOR H IGH - SPEED - +
P OLYNOMIAL M ULTIPLIER C′h
+
C′ l

A. Modular Reduction c′15 ... c′ 8 c′ 7 ... c′0

−1 C′ h C′l«3
Implementing KRED and KRED-2X requires to store k · -
C′l«2
+
C′l

ωnij and k −2 ·ωnij in ROM. Furthermore, the KRED-2X returns +

k 2 · C0 − k · C1 + C2 where C0 , C1 , and C2 are the m-bit c′′ 11 ... c′′ 0
chunks of input C. Thus, for k = 13 it needs 5 shifting and 7
additions to output a 16-bit data. However, it allows output to Figure 3. Proposed polynomial multiplication architecture employing 2×2
grow up to 32 bits. Hence, we propose K2 -RED reduction, reconfigurable butterfly cores and K2 -RED reduction
a modified version of the KRED algorithm, presented in
Algorithm 1. It includes two steps of performing KRED, butterfly cores are reconfigured when mode = 1 for GS in
so its output is k 2 · C mod q. This reduction needs 4 shift INTT operation, while its output is manipulated compared
and 6 addition operations and keeps output width to 12 bits. to standard GS to reduce required memory. The proposed
Furthermore, we do not need to implement another reduction architecture supports both even or odd numbers of layers
unit in the butterfly core by implementing this reduction employing pipeline stages. Hence, to support an odd number
after multiplication, and the required memory is halved. Fig. of layers, mode is set to 2 for the first butterfly row in the
3 shows the reduction architecture of a 24-bit input using last layer of computation to only pass the data. The proposed
Algorithm 1 to compute a 12-bit output. NTT algorithm is shown in Algorithm 2 for even layers.
B. Reconfigurable Butterfly Core In each cycle, four coefficients are read from NTT RAM
to fed cores, and their outputs are buffered in four serial-in,
To avoid the bit-reverse cost in polynomial multiplication, parallel-out shift registers with different lengths. The results
two different butterfly configurations, i.e., CT and GS, are are written back to the NTT RAM sequentially. The address
required for NTT and INTT, respectively. Hence, a recon- and data flow of NTT RAM for read and write operation in
figurable butterfly core is proposed to support both CT and every clock cycle are given in Fig. 4 for n = 128. After
GS operations and reduce required hardware resources. We 4 cycles, the first buffer is full, and 4 coefficients can be
implement a 2 × 2 butterfly core to merge two layers of stored in the RAM. The same scenario is performed after one
NTT/INTT and perform two butterfly operations in each layer. cycle for the second and then for the third and fourth buffer,
The proposed architecture for PMC is depicted in Fig. 3 and its first 4-coefficients will be stored. Each round of NTT
employing four butterfly cores. Each butterfly core includes includes n4 reading and storing while there are fully pipelined
a multiplication, a modular reduction, an addition, and a to increase throughput. The pipeline latency between read and
subtraction, while there are also some registers to balance write sequences consists of 2 cycles for reading from RAM, 8
the pipeline latency in different configurations. The signal cycles for two butterfly operations, and 4 cycles for buffering
mode chooses between NTT and INTT operations. It also the results in registers. Furthermore, to avoid any memory
supports point-wise multiplication, polynomial addition, and conflict, we consider 6 idle cycles between each round.
polynomial subtraction employing an additional control logic
The required twiddle factors for NTT are stored in a ROM.
which is not shown in Fig. 3 for brevity. When mode is
Based on the symmetry property of twiddle factors in NTT and
set to 0, the butterfly works in CT configuration in the
INTT, i.e., ωni and ωn−i respectively, we have ωn−i = −ωnn−i .
NTT computation and computes u + vω and u − vω. The

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on January 05,2023 at 23:39:24 UTC from IEEE Xplore. Restrictions apply.
Table I
Algorithm 2 Proposed NTT Algorithm Based on Cooley- I MPLEMENTATION RESULTS FOR DIFFERENT MODULAR REDUCTION
Tukey Butterfly ALGORITHMS
Input: a polynomial a(x) ∈ Zq [X]/ Xn + 1, n-th primitive Reduction CPD Area Output
root of unity ωn ∈ Zq , n = 2l Algorithm [ns] #LUTs #FFs #Slices #DSPs Width
Output: a(x) = NTTωn (a) ∈ Zq [X]/ Xn + 1 Barrett Reduction 1.34 59 31 26 2 12
Montgomery [19] 2.10 391 382 91 1 121
1: for (s = 0, s < log(n), s = s + 2) do KRED [19] 1.99 80 47 31 0 161
2: m = 2s K2 -RED 0.91 54 30 18 0 12
3: t = t 2 1 Our estimation by re-implementing this work.
4: for (i = 0, to i < m, i + +) do The Keccak used in SHA3 standard is Keccak-f [1600],
5: for (j = 4i · t, j < 4i · t + t, j + +) do which performs four functions, including SHA3-256, SHA3-
6: u00 ← aj , v00 ← aj+t , u01 ← aj+2t , v01 ← aj+3t 512, SHAKE-128, and SHAKE-256 during KEM. To design a
7: ω00 ← ψk−2 [m + i] high-performance architecture, we modify the high-speed core
8: (u10 , u11 ) ← BF _CT (u00 , v00 , ω00 ) implementation of the Keccak provided by [36]. It requires
9: (v10 , v11 ) ← BF _CT (u01 , v01 , ω00 ) 24 clock cycles to execute 24 rounds of the Keccak sponge
10: ω10 ← ψk−2 [2 × (m + i)], ω11 ← ψk−2 [2 × (m + i) + 1]
function computation. We also develop a dedicated SIPO and
11: (u20 , u21 ) ← BF _CT (u10 , v10 , ω10 ) PISO for interfacing with this core in its input and output,
12: (v20 , v21 ) ← BF _CT (u11 , v11 , ω11 ) respectively. The SIPO takes data in 64-bit width and delivers
13: aj ← u20 , aj+t ← v20, aj+2t ← u21 , aj+3t ← v21 1344-bit data to the Keccak core, while the PISO takes 1344-
14: end for bit data and divides it into 21 chunks of 64-bit width.
15: end for Since CT configuration is used in NTT, we assume that
16: end for the input polynomials are in normal order, while the public
17: return a(x) and secret keys are in bit-reverse order. Hence, the point-wise
multiplication works in bit-reverse order in the NTT domain,
and the results are transformed back to the normal domain
Hence, to reduce the required memory, we can use NTT with normal order employing GS configuration.
twiddle factors for INTT by (i) reversing the order of reading In order to reduce the total cycle, operations are performed
ROM, and (ii) computing v − u instead of u − v in GS in a parallel fashion. Hence, the latency of samplers can be
configuration. Our proposed architecture can perform NTT and entirely absorbed by the Keccak core. To accelerate the KEM
INTT operations in around n8 logn and n8 (logn + 1) cycles for computation, we duplicate PMC to maximize the polynomial
even and odd number of layers, respectively. multiplication speed, while NTT/INTT is independently per-
C. Area/Performance Trade-offs formed for odd and even coefficients.
V. I MPLEMENTATION R ESULTS A ND C OMPARISONS
The main goal of the proposed architecture is to achieve
Our proposed architecture is synthesized with Xilinx Vivado
high-speed computation employing small area requirements.
2019.2 and implemented on a Xilinx Artix XC7A100T-3
However, we can target different area/performance trade-
FPGA device which is recommended by NIST.
offs by increasing the number of PMC, taking advantage of A. Implementation Results of NTT Core
polynomial vector structure in the Kyber algorithm. Since
Table I reports implementation results for different alter-
NTT/INTT can be computed for odd and even coefficients
native reduction algorithms for q = 3, 329. As one can see,
of each polynomial in Kyber separately, two PMC can be
our proposed K2 -RED algorithm is more compact compared
implemented for each polynomial vector. Hence, for Kyber-
to other algorithms and maintains the output of 12 bits to
512 having 2 polynomial vectors, increasing the number of
reduce required memory. It also requires half of precomputed
implemented PMC from 1 to 2 or 4 can drastically reduce to
twiddle factors compared with KRED since the latter needs
a half or a quarter of NTT/INTT latency.
storing k −1 · ωnij and k −2 · ωnij in ROM for reduction.
Nevertheless, implementing more PMC needs more band-
Table II reports area and time specifications for our PMC
width for feeding the butterfly cores and storing their results.
core in NTT and INTT mode. Other state-of-the-art NTT
On the other hand, due to the data width limitation for BRAM,
designs with the merged-layer NTT structure are also listed.
one BRAM cannot support two PMCs. Thus, the number of
Additionally, we report the results for both Kyber with q =
utilized BRAM should be matched with PMC to provide the
3, 329, n = 256, and NewHope with q = 12, 289, n = 1024 to
required bandwidth by implementing more BRAMs in parallel.
show the superiority of the proposed architecture in different
IV. A RCHITECTURE OF CRYSTAL-K YBER schemes. For comparison, A × T are reported, where A and
T are the utilized LUT and time in μs, respectively. It should
The proposed highly optimized architecture for Kyber co-
be noted that we assume the same operating frequency in
prosessor can compute all the operations described in the
computing A × T as our architecture for the works which
Kyber protocol. It includes a PMC, Keccak, binomial sam-
do not report frequency. An operating frequency in a limited
pler, rejection sampler, and compress/decompress units. The
range is mostly considered to reduce the required power. Thus,
architecture of Kyber is designed to perform in constant time.

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on January 05,2023 at 23:39:24 UTC from IEEE Xplore. Restrictions apply.
Memory configuration at Round 1: Performing Stage 1 & 2 Memory configuration at
the beginning of Round 1 the beginning of Round 2

0 a96 a64 a32 a0 Cycle #1 #2 #3 #4 #5 #6 #16 #17 #18 #30 #31 #32 #39 #40 #41 #42 #43 0 a24 a16 a8 a0

1 a95 a65 a33 a1 Read 0 8 16 24 1 9 ... 27 4 12 20 ... 15 23 31 ... 0 2 4 6 1 ... 1 a25 a17 a9 a1

Write 0 8 16 24 ... 19 27 4 ... 29 6 14 22 30 ...

...
...

...

...
8 a104 a72 a40 a8 8 a57 a49 a41 a33
a96 a64 a32 a0

...
...
...

...
Addr.0 in Round 1 a24 a16 a8 a0
31 a127 a95 a63 a31 31 a127 a119 a111 a103
Addr.0 in Round 2

Figure 4. Memory Address and Data flow when NTT operation is performed.
Table II
I MPLEMENTATION RESULTS FOR DIFFERENT NTT IMPLEMENTATION ON FPGA
Butterfly NTT/INTT Freq Time Area Speedup
Parameters Work Platform A×C A×T
Arrangement Cycles [MHz] [μs] #LUTs #FFs #Slices #DSPs #BRAM Ratio
Zhang et al. [7] Artix-7 2 2,8251 /2,8251 244 11.58 847 375 - 2 6 1.70 2.4 (45.8%) 9.8 (44.9%)
Xing et al. [15] Zynq-7000 4 2,6882 /2,6882 153 5.52 4823 2901 - 8 0 2.58 13.0 (90.0%) 84.7 (93.6%)
n = 1024 Mert et al. [12] Virtex-7 32 200/- 125 1.60 17,188 - - 96 48 0.24 3.4 (61.8%) 27.5 (80.4%)
q = 12, 289 Kuo et al. [17] Zynq-7000 2×2 2,616/- 150 17.44 2832 1381 - 8 10 2.57 7.4 (82.4%) 49.4 (89.1%)
Nguyen et al. [18] Zynq-7000 2×2 2,032/- 188 10.81 898 1117 357 4 10 1.59 1.8 (27.8%) 9.7 (44.3%)
This Work Artix-7 2×2 1,591/1,591 234 6.80 798 715 268 4 2 1.00 1.3 5.4
Fritzmann et al. [26] Zynq-7000 2 1,935/1,930 - - 2908 170 - 9 0 5.97 5.6 (94.6%) 25.3 (95.3%)
Karabulut et al. [13] Virtex-7 1 43,756/- - - 417 462 NA 0 0 135.05 18.2 (98.4%) 82.2 (98.5%)
n = 256 Alkim et al. [20] Artix-7 1 6,868/6,367 59 116.41 - - - - - 79.73 - -
q = 3, 329 Huang et al. [29] Artix-7 2 1,834/- 155 11.83 - - - - - 8.10 - -
Xing et al. [30] Artix-7 2 512/576 161 3.18 1,737 1,167 - 2 3 2.18 0.9 (66.7%) 5.5 (78.2%)
This Work Artix-7 2×2 324/324 222 1.46 801 717 312 4 2 1.00 0.3 1.2
1 This number is obtained by adding the reported cycles for the butterfly operations (i.e., 2569 cycles) with n/4 = 256 cycles for the scramble function.
2 This number is obtained by adding the reported butterfly cycles (i.e., 1280 cycles) with 1280 and 128 cycles for the scramble function and pre/post-processing.

A × C can be computed for a fair comparison, where C is the architecture, while it works at 45 MHz on the ASIC platform.
required clock cycles. If this design runs at the same frequency as ours, its A × T
The results show our proposed architecture is the fastest and and total time are 21× and 5.97× greater than our proposed
smallest architecture for n = 1024. Although the work of [17] design. The works in [13] and [20] also presented an NTT
and [18] implemented 2 × 2 butterfly structure, they use the architecture over RISC-V, which requires considerably greater
KRED algorithm over a fixed butterfly configuration. Nonethe- cycle count, while our optimized design achieves 135.05× and
less, our proposed reduction algorithm reduces required re- 79.73× speedup, respectively. The FPGA-based design was
sources, especially in terms of occupied BRAM, and increases proposed in [29] employing Montgomery reduction. The re-
the maximum operating frequency. Furthermore, employing quired hardware resources for the NTT core were not reported;
reconfigurable PMC eliminates the bit-reverse function and the however, our design reduces the required cycles achieving a
pre-processing and post-processing cost. For instance, [17] and speedup factor of 8.1. In [30], two butterfly cores for even
[18] need 1,330 and 1,324 cycles for only butterfly operations, and odd coefficients are used employing 2 DSPs at the cost
respectively, while ours requires 1,320 cycles. In [17], the of utilizing 2.17× and 1.63× more LUT and FF. Our result
reduction unit is implemented by DSP block, which results shows 2.18× faster computing and 78.2% A×T improvement
in increasing the number of utilized DSP two times that of compared to [30].
ours. Our architecture approximately improves 90% A×T and B. Implementation Results of CRYSTAL-Kyber
reduces 2.57× the total time for NTT computation comapred Table III lists the detailed resource consumption, perfor-
to [17]. Although [18] implements the reduction unit without mance results, and comparison in terms of A × T for all NIST
DSP block, this design needs larger area and more cycles. security levels. The total time is the required time for a key
Hence, our proposed design achieves 44% A×T improvement encapsulation and a decapsulation (Encaps + Decaps), as the
and 1.59× speedup compared to [18]. key generation can be done offline. We utilize 2, 3, and 4
The work of [7] occupies two butterfly cores and a highly PMCs in our proposed architecture for security levels 1, 3,
optimized reduction hardware tailored only for the special and 5, respectively. As one can see, our design requires 10,502
value. However, this approach requires more BRAM and LUT LUTs, 9,859 FFs, 8 DSPs, and 13 BRAMs for NIST security
to implement a low-complexity NTT utilizing 2 DSPs. As a level 1, performing the Kyber protocol in almost 31 μs.
result, our architecture achieves a speedup factor of 1.70× and There are several hardware/software implementations target-
improves A × T by almost 45%. ing Kyber KEM in the literature. However, a direct comparison
Our results for Kyber parameters show a significant im- is not possible between the listed hardware implementations
provement requiring only 1.46 μs. Since Kyber parameters due to the varying techniques of different FPGA generations,
have been changed during round 2 of the NIST competition, targeting different optimization goals, and using different
we only list previous works implementing Kyber v-3 param- design methodologies. The work in [9] implemented a config-
eters for a fair comparison. The work in [26] optimized an urable coprocessor based on a RISC-V architecture that can be
NTT core based on hardware/software approach over RISC-V used for multiple lattice-based schemes including Kyber. Its ar-

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on January 05,2023 at 23:39:24 UTC from IEEE Xplore. Restrictions apply.
Table III
FPGA I MPLEMENTATION RESULTS AND COMPARISON WITH STATE - OF - THE - ART
Area
Scheme Work Platform KeyGen Encaps Decaps Freq Total Time Throughput
A×T
#LUTs #FFs #Slices #DSPs #BRAMs
[CCs] [CCs] [CCs] [MHz] [μs] [KEM/s]
Basu et al. [25]1 Virtex-7 1,977,896 194,126 NA 0 0 - 31,669 43,018 67 1,115 897 2,214.2 (99.9%)
Banerjee et al. [9] Artix-7 14,975 2,539 4,173 11 14 74,519 131,698 142,309 25 10,960 91 164.4 (99.8%)
Fritzmann et al. [26] Zynq-7000 23,947 10,847 NA 21 32 150,106 193,076 204,843 - - - 47.6 (99.3%)
Alkim et al. [20] Artix-7 1,842 1,634 NA 5 34 710,000 971,000 870,000 59 31,203 32 57.5 (99.4%)
Kyber-512
Huang et al. [29]1 Artix-7 88,901 NA 141,825 354 202 - 49,015 68,815 155 760 1,315 67.8 (99.5%)
Xing et al. [30] Artix-7 7,412 4,644 2,126 2 3 3,768 5,079 6,668 161 73 13,705 0.54 (34.0%)
Dang et al. [28] Artix-7 11,864 10,348 3,989 8 15 - 3,025 4,395 210 35 28,301 0.42 (21.4%)
This work Artix-7 10,502 9,859 3,547 8 13 1,882 2,446 3,754 200 31 32,258 0.33
Banerjee et al. [9] Artix-7 14,975 2,539 4,173 11 14 111,525 177,540 190,579 25 14,725 67 220.5 (99.8%)
Fritzmann et al. [26] Zynq-7000 23,947 10,847 NA 21 32 273,370 325,888 340,418 - - - 79.8 (99.4%)
Kyber-768 Huang et al. [29]1 Artix-7 110,260 NA 167,293 292 202 - 77,481 102,113 155 1,159 863 127.7 (99.6%)
Xing et al. [30] Artix-7 7,412 4,644 2,126 2 3 6,316 7,925 10,049 161 112 8,957 0.83 (43.4%)
Dang et al.[28] Artix-7 11,884 10,380 3,984 8 15 - 4,065 5,555 210 46 21,829 0.54 (13.0%)
This work Artix-7 11,783 10,424 3,952 12 14 2,667 3,251 4,805 200 40 24,826 0.47
Banerjee et al. [9] Artix-7 14,975 2,539 4,173 11 14 148,547 223,469 240,977 25 18,578 53 278.2 (99.7%)
Fritzmann et al. [26] Zynq-7000 23,947 10,847 NA 21 32 349,673 405,477 424,682 - - - 99.4 (99.2%)
Alkim et al. [20] Artix-7 1,842 1,634 NA 5 34 2,203,000 2,619,000 2,429,000 59 85,559 11 157.6 (99.5%)
Kyber-1024 Huang et al. [29]1 Virtex-7 132,918 NA 172,489 548 202 - 107,054 135,553 192 1,264 791 167.9 (99.6%)
Xing et al. [30] Artix-7 7,412 4,644 2,126 2 3 9,380 11,321 13,908 161 157 6,381 1.16 (35.3%)
Dang et al. [28] Artix-7 12,183 12,441 4,511 8 15 - 5,785 7,395 210 63 15,933 0.76 (1.3%)
This work Artix-7 13,347 11,639 4,585 16 16 3,459 4,122 6,257 185 56 17,824 0.75
1 Different architectures for Encaps and Decaps are used.

chitecture performs almost 91 KEM per second for Kyber-512, Table IV

which is 353× slower than our design. Our proposed design C OMPARISON WITH OTHER PQC SCHEMES IN NIST SECURITY LEVEL 1.
also achieves 99.8% improvements in terms of A × T . In [26], Protocol Platform
Area Freq Time
#LUTs #FFs #Slices #DSPs #BRAMs [MHz] [us]
another RISC-V-based architecture was proposed to accelerate SIKEp434 [21] Virtex-7 12,818 18,271 5,527 195 32 249.6 8,800
Frodo-640 [37] Artix-7 6,881 5,081 1,947 16 12.5 149 2,621
NTT-based schemes. This design requires 64× more cycles LightSaber [38] UltraScale+ 23,686 9,805 NA 0 2 150 60
Kyber-512 [Ours] Artix-7 10,502 6,859 3,547 8 15 200 31
for encapsulation and decapsulation while consuming 2.3×,
1.1×, 2.6×, and 2.1× more LUTs, FFs, DSPs, and BRAMs, instruction-set coprocessor performing Saber in 60 μs.
respectively. Additionally, [20] proposed a RISC-V design The experimental result shows that taking advantage of the
to accelerate Kyber KEM employing customized instructions. proposed PMC to implement lattice-based KEM schemes as
Although the design of [20] is lightweight, its required latency full-hardware architecture results in high-speed and efficient
is significantly greater than ours. Thus, our hardware imple- design. For Kyber KEM, our coprocessor architecture out-
mentation of Kyber is around 1,000 times faster and 180 times performs all the reported implementations in the literature.
more efficient than their hardware/software implementation. The efficiency of our proposed implementation already has
An HLS evaluation was proposed in [25] for Kyber-512 performance levels comparable to or even significantly better
employing different implementations for encapsulation and than pre-quantum algorithms [23], [24], [39].
decapsulation. However, this approach comes at a considerably VI. C ONCLUSION
far larger area consumption. Our design achieves almost 7,000
times better A × T compared to HLS-based implementation. This paper proposed a high-performance and efficient archi-
tecture for NTT-based polynomial multiplication and lattice-
Our design achieves 24.5× faster KEM and improves 99.5% based public-key cryptography coprocessor with Kyber KEM
A × T while occupying 8.4×, 44.2×, and 13.5× fewer LUTs, as a case study. We optimize the implementation of the
DSPs, and BRAMs compared to a pure hardware architecture NTT core by merging the layers and an efficient reduction
in [29], respectively. The high-speed implementation of Kyber unit by creating a configurable butterfly core. Besides, we
was reported in [28] for two different platforms, i.e., Artix- propose a coprocessor architecture that can perform all KEM
7 and Virtex-7. In security level 1, our proposed architecture operations for Kyber. Overall, our NTT core shows more
reduces 11.4% of total time and improves 21.4% A × T on than 44% improvement in terms of A × T . The proposed
the same platform. Besides, our design reduces required cycles Kyber coprocessor architecture also performs key generation,
by 16% and 21% in security levels 3 and 5 by implementing encapsulation, and decapsulation in 9, 12, and 19 μs for a
parallel PMCs to accelerate NTT computation. Moreover, our security level comparable to AES-128, respectively, on an
design has 2.35× and 34% better time and A×T , respectively, Artix-7 FPGA.
compared to compact design in [30] in security level 1, while ACKNOWLEDGMENT
ours utilizes 1.4×, 2.1×, 4×, 4.3× more LUTs, FFs, DSPs,
and BRAMs, respectively. The authors would like to thank the reviewers for their
comments. This work is supported in parts by a grant from
Table IV lists other PQC scheme results implemented on NSF-1801341.
the FPGA platform for NIST security level 1. Elkhatib et R EFERENCES
al. in [21] implemented a supersingular isogeny-based KEM
performed in 8.8 ms. Howe et al. [37] presented a flexible [1] P. W. Shor, “Algorithms for quantum computation: Discrete logarithms
and factoring,” in 35th Annual Symposium on Foundations of Computer
FrodoKEM architecture that performs 825 and 710 encap- Science, Santa Fe, New Mexico, USA, 20-22 November 1994, pp. 124–
sulations and decapsulation. The work of [38] proposed an 134, 1994.

100

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on January 05,2023 at 23:39:24 UTC from IEEE Xplore. Restrictions apply.
[2] R. Avanzi, J. Bos, L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky, [20] E. Alkim, H. Evkan, N. Lahr, R. Niederhagen, and R. Petri, “ISA
J. M. Schanck, P. Schwabe, G. Seiler, and D. Stehle, “CRYSTALS- extensions for finite field arithmetic accelerating Kyber and NewHope
Kyber: Algorithm specification and supporting documentation (version on RISC-V,” IACR Trans. Cryptogr. Hardw. Embed. Syst., vol. 2020,
3.0). submission to the NIST post-quantum cryptography standardization no. 3, pp. 219–242, 2020.
project,” 2020. [21] R. Elkhatib, R. Azarderakhsh, and M. Mozaffari Kermani, “Highly
[3] NISTIR 8309, “Status report on the second round of the NIST post- optimized montgomery multiplier for SIKE primes on FPGA,” in 27th
quantum cryptography standardization process,” National Institute of IEEE Symposium on Computer Arithmetic, ARITH 2020, Portland, OR,
Standards and Technology, 2020. USA, June 7-10, 2020, pp. 64–71, 2020.
[4] NIST, “Submission requirements and evaluation criteria for the post- [22] M. Anastasova, R. Azarderakhsh, and M. Mozaffari Kermani, “Fast
quantum cryptography standardization process,” National Institute of strategies for the implementation of SIKE round 3 on ARM Cortex-
Standards and Technology, 2016. M4,” IACR Cryptol. ePrint Arch., vol. 2021, p. 115, 2021.
[5] T. Pöppelmann and T. Güneysu, “Towards efficient arithmetic for [23] M. Bisheh Niasar, R. E. Khatib, R. Azarderakhsh, and M. Mozaffari
lattice-based cryptography on reconfigurable hardware,” in Progress in Kermani, “Fast, small, and area-time efficient architectures for key-
Cryptology - LATINCRYPT 2012 - 2nd International Conference on exchange on Curve25519,” in 27th IEEE Symposium on Computer
Cryptology and Information Security in Latin America, Santiago, Chile, Arithmetic, ARITH 2020, Portland, OR, USA, June 7-10, 2020, pp. 72–
October 7-10, 2012. Proceedings, pp. 139–158, 2012. 79, 2020.
[6] S. S. Roy, F. Vercauteren, N. Mentens, D. D. Chen, and I. Verbauwhede, [24] M. Bisheh Niasar, R. Azarderakhsh, and M. Mozaffari Kermani, “Ef-
“Compact Ring-LWE cryptoprocessor,” in Cryptographic Hardware and ficient hardware implementations for elliptic curve cryptography over
Embedded Systems - CHES 2014 - 16th International Workshop, Busan, Curve448,” in Progress in Cryptology - INDOCRYPT 2020 - 21st
South Korea, September 23-26, 2014. Proceedings, pp. 371–391, 2014. International Conference on Cryptology in India, Bangalore, India,
[7] N. Zhang, B. Yang, C. Chen, S. Yin, S. Wei, and L. Liu, “Highly December 13-16, 2020, Proceedings, pp. 228–247, 2020.
efficient architecture of NewHope-NIST on FPGA using low-complexity [25] K. Basu, D. Soni, M. Nabeel, and R. Karri, “NIST post-quantum
NTT/INTT,” IACR Trans. Cryptogr. Hardw. Embed. Syst., vol. 2020, cryptography- A hardware evaluation study,” IACR Cryptol. ePrint Arch.,
no. 2, pp. 49–72, 2020. vol. 2019, p. 47, 2019.
[26] T. Fritzmann, G. Sigl, and J. Sepúlveda, “RISQ-V: Tightly coupled
[8] P. Longa and M. Naehrig, “Speeding up the number theoretic transform
RISC-V accelerators for post-quantum cryptography,” IACR Trans.
for faster ideal lattice-based cryptography,” in Cryptology and Network
Cryptogr. Hardw. Embed. Syst., vol. 2020, no. 4, pp. 239–280, 2020.
Security - 15th International Conference, CANS 2016, Milan, Italy,
[27] G. Xin, J. Han, T. Yin, Y. Zhou, J. Yang, X. Cheng, and X. Zeng,
November 14-16, 2016, Proceedings, pp. 124–139, 2016.
“VPQC: A domain-specific vector processor for post-quantum cryptog-
[9] U. Banerjee, T. S. Ukyab, and A. P. Chandrakasan, “Sapphire: A
raphy based on RISC-V architecture,” IEEE Trans. Circuits Syst. I Regul.
configurable crypto-processor for post-quantum lattice-based protocols
Pap., vol. 67-I, no. 8, pp. 2672–2684, 2020.
(extended version),” IACR Cryptol. ePrint Arch., vol. 2019, p. 1140,
[28] V. B. Dang, F. Farahmand, M. Andrzejczak, K. Mohajerani, D. T.
2019.
Nguyen, and K. Gaj, “Implementation and benchmarking of round
[10] Z. Chen, Y. Ma, T. Chen, J. Lin, and J. Jing, “Towards efficient Kyber 2 candidates in the NIST post-quantum cryptography standardization
on FPGAs: A processor for vector of polynomials,” in 25th Asia and process using hardware and software/hardware co-design approaches,”
South Pacific Design Automation Conference, ASP-DAC 2020, Beijing, IACR Cryptol. ePrint Arch., vol. 2020, p. 795, 2020.
China, January 13-16, 2020, pp. 247–252, 2020. [29] Y. Huang, M. Huang, Z. Lei, and J. Wu, “A pure hardware implemen-
[11] A. C. Mert, E. Karabulut, E. Öztürk, E. Savas, M. Becchi, and A. Aysu, tation of CRYSTALS-Kyber PQC algorithm through resource reuse,”
“A flexible and scalable NTT hardware: Applications from homomorphi- IEICE Electronics Express, vol. advpub, 2020.
cally encrypted deep learning to post-quantum cryptography,” in 2020 [30] Y. Xing and S. Li, “A compact hardware implementation of CCA-secure
Design, Automation & Test in Europe Conference & Exhibition, DATE key exchange mechanism CRYSTALS-KYBER on FPGA,” IACR Trans.
2020, Grenoble, France, March 9-13, 2020, pp. 346–351, 2020. Cryptogr. Hardw. Embed. Syst., vol. 2021, no. 2, pp. 328–356, 2021.
[12] A. C. Mert, E. Karabulut, E. Öztürk, E. Savas, and A. Aysu, “An [31] L. Botros, M. J. Kannwischer, and P. Schwabe, “Memory-efficient
extensive study of flexible design methods for the number theoretic high-speed implementation of Kyber on Cortex-M4,” in Progress in
transform,” IEEE Transactions on Computers, pp. 1–1, 2020. Cryptology - AFRICACRYPT 2019 - 11th International Conference on
[13] E. Karabulut and A. Aysu, “RANTT: A RISC-V architecture extension Cryptology in Africa, Rabat, Morocco, July 9-11, 2019, Proceedings,
for the number theoretic transform,” in 2020 30th International Confer- pp. 209–228, 2019.
ence on Field-Programmable Logic and Applications (FPL), pp. 26–32, [32] E. Alkim, Y. A. Bilgin, M. Cenk, and F. Gérard, “Cortex-m4 optimiza-
2020. tions for {R, M} LWE schemes,” IACR Trans. Cryptogr. Hardw. Embed.
[14] T. Fritzmann and J. Sepúlveda, “Efficient and flexible low-power NTT Syst., vol. 2020, no. 3, pp. 336–357, 2020.
for lattice-based cryptography,” in IEEE International Symposium on [33] M. J. Kannwischer, J. Rijneveld, P. Schwabe, and K. Stoffelen, “PQM4:
Hardware Oriented Security and Trust, HOST 2019, McLean, VA, USA, post-quantum crypto library for the ARM Cortex-M4,” 2018.
May 5-10, 2019, pp. 141–150, 2019. [34] J. W. Bos, L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky, J. M.
[15] Y. Xing and S. Li, “An efficient implementation of the NewHope key Schanck, P. Schwabe, G. Seiler, and D. Stehlé, “CRYSTALS-Kyber:
exchange on FPGAs,” IEEE Trans. Circuits Syst. I Regul. Pap., vol. 67-I, A CCA-secure module-lattice-based KEM,” in 2018 IEEE European
no. 3, pp. 866–878, 2020. Symposium on Security and Privacy, EuroS&P 2018, London, United
[16] C. Du, G. Bai, and X. Wu, “High-speed polynomial multiplier archi- Kingdom, April 24-26, 2018, pp. 353–367, 2018.
tecture for Ring-LWE based public key cryptosystems,” in Proceedings [35] E. Fujisaki and T. Okamoto, “Secure integration of asymmetric and
of the 26th edition on Great Lakes Symposium on VLSI, GLVLSI 2016, symmetric encryption schemes,” in Advances in Cryptology - CRYPTO
Boston, MA, USA, May 18-20, 2016, pp. 9–14, 2016. ’99, 19th Annual International Cryptology Conference, Santa Barbara,
[17] P.-C. Kuo, W.-D. Li, Y.-W. Chen, Y.-C. Hsu, B.-Y. Peng, C.-M. Cheng, California, USA, August 15-19, 1999, Proceedings, pp. 537–554, 1999.
and B.-Y. Yang, “High performance post-quantum key exchange on [36] G. Bertoni, J. Daemen, S. Hoffert, M. Peeters, and G. V. Assche,
FPGAs,” IACR Cryptology ePrint Archive, p. 690, 2017. “Keccak in VHDL,” 2020.
[18] D. T. Nguyen, V. B. Dang, and K. Gaj, “A high-level synthesis approach [37] J. Howe, M. Martinoli, E. Oswald, and F. Regazzoni, “Exploring
to the software/hardware codesign of NTT-based post-quantum cryptog- parallelism to improve the performance of frodokem in hardware.”
raphy algorithms,” in International Conference on Field-Programmable Cryptology ePrint Archive, Report 2021/155, 2021.
Technology, FPT 2019, Tianjin, China, December 9-13, 2019, pp. 371– [38] S. S. Roy and A. Basso, “High-speed instruction-set coprocessor for
374, 2019. lattice-based key encapsulation mechanism: Saber in hardware,” IACR
[19] D. T. Nguyen, V. B. Dang, and K. Gaj, “High-level synthesis in Trans. Cryptogr. Hardw. Embed. Syst., vol. 2020, no. 4, pp. 443–466,
implementing and benchmarking number theoretic transform in lattice- 2020.
based post-quantum cryptography using software/hardware codesign,” in [39] M. Bisheh-Niasar, R. Azarderakhsh, and M. Mozaffari-Kermani, “Area-
Applied Reconfigurable Computing. Architectures, Tools, and Applica- time efficient hardware architecture for signature based on Ed448,” IEEE
tions - 16th International Symposium, ARC 2020, Toledo, Spain, April Transactions on Circuits and Systems II: Express Briefs, pp. 1–1, 2021.
1-3, 2020, Proceedings [postponed], pp. 247–257, 2020.

101

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on January 05,2023 at 23:39:24 UTC from IEEE Xplore. Restrictions apply.