Conceptual Review On Number Theoretic Transform and Comprehensive Review On Its Implementations
Conceptual Review On Number Theoretic Transform and Comprehensive Review On Its Implementations
ABSTRACT The Number Theoretic Transform (NTT) is a powerful mathematical tool that has become
increasingly important in developing Post Quantum Cryptography (PQC) and Homomorphic Encryption
(HE). Its ability to efficiently calculate polynomial multiplication using the convolution theorem with a quasi-
linear complexity O(n log n) when implemented with Fast Fourier Transform-style algorithms has made it
a key component in modern cryptography. FFT-style NTT algorithm or fast-NTT is particularly useful in
lattice-based cryptography, which relies on the hardness of certain mathematical problems to ensure security.
Its importance in these fields continues to grow as quantum computing technology advances and traditional
encryption methods become vulnerable. In this report, we discuss the mathematical concepts of polynomial
multiplications using NTT and provide a comprehensive review of the latest implementation and state-of-
the-art of NTT in both PQC and HE schemes.
INDEX TERMS Number theoretic transform, post quantum cryptography, homomorphic encryption.
schoolbook algorithm with a quadratic complexity of using various mathematical techniques, such as the Residue
O(n2 ). However, other alternatives exist, such as the Karat- Number Systems (RNS) and the Chinese Remainder Theo-
suba algorithm [18], [19], the Toom-Cook algorithm [20], rem (CRT) [34], [35].
[21], and the Discrete Fourier Transform (DFT)-based NTT is also important in Homomorphic Encryption (HE)
algorithm [22], [23], [24]. The Karatsuba algorithm applies schemes such as Brakerski-Fan-Vercauteren (BFV) [36],
the divide and conquers principle to reduce the complexity BGV (Brakerski-Gentry-Vaikuntanathan) [37], [38], and
by dividing the original polynomial into two parts, resulting CKKS (Cheon-Kim-Kim-Song) [39] based on the Ring
in O(nlog2 3 ) or O(n1.58 ) [25]. Toom-Cook algorithm gener- Learning With Errors (RLWE) problem. In the BFV and
alizes Karatsuba algorithm by dividing into k parts, giving BGV schemes, NTT performs the modulus-switching oper-
O(nlogk (2k−1) ) complexity [26]. ation to reduce the noise in the encrypted data. In the
Discrete Fourier Transform (DFT) and its variant in the CKKS homomorphic encryption scheme, NTT performs
polynomial ring, Number Theoretic Transform (NTT) can be the ‘‘relinearization’’ operation, which reduces the size of
utilized to multiply two polynomials via convolution theo- the ciphertexts after multiplication operations [36], [37],
rem [27], [28]. However, the classical algorithm to compute [38], [39]. Microsoft SEAL is one of the most prominent
DFT or NTT is also O(n2 ). The fundamental difference libraries implementing the aforementioned schemes [40],
between DFT and NTT is the ring they use to transform [41]. The noticeable difference between the PQC and HE
the polynomial. DFT uses a complex ring with a twiddle schemes is the modulus size. While PQC schemes usually
factor of e−2πj/n while NTT uses an integer polynomial ring use a small number as their modulus, HE schemes use a large
with a twiddle factor of its n-th root of unity. The only use number, which makes the implementation techniques vastly
of integers makes NTT popular among researchers because different between the two schemes.
there is no need to implement complicated schemes such Most of the NTT implementation reports briefly introduce
as fixed-point or floating-point arithmetic architecture. This NTT and recent literature reviews. However, those reports
advantage also eliminates the precision problem that may focus on their implementation techniques of NTT in the
arise from implementing such architectures [29]. various platforms and do not provide a comprehensive under-
Many optimized versions of DFT have been proposed in standing of the NTT concepts. This motivates us to briefly
the past few decades due to their prominent use in signal introduce NTT concepts and summarize the state of the arts of
and image processing. The most widely used fast algorithm NTT implementations in the PQC and HE schemes. We sum-
is Fast Fourier Transform (FFT) which Gauss first proposed marize the contribution of our works as follows:
in 1805 [30]. It gained widespread attention in the 1960s 1) We briefly introduce the basic concepts of linear,
when Cooley-Tukey [23] and Gentleman-Sande [24] pub- cyclic, and negacyclic convolutions via traditional
lished their works, giving their infamous name for the CT schoolbook algorithms, traditional NTT, and FFT-like
and GS butterflies architecture for FFT. The FFT has a versions of NTT. While other literature briefly intro-
quasilinear complexity of O(n log n), which gives a massive duces the concepts, they are scattered everywhere.
advantage over other methods, especially when calculating They require significant effort to learn, especially for
higher-degree polynomial multiplications. NTT is also a those who begin researching the area and come from
DFT version, so one can apply FFT algorithms to calculate the implementation side.
NTT [31]. 2) We provide consistent toy examples through differ-
However, using NTT also has limitations: it requires very ent concepts and algorithms to further enhance the
specific parameters. Implementing FFT algorithms requires conceptual understanding of the NTT. However, the
the array lengths n to be a power of two – in other words, focus of our report is the implementation of NTT.
the polynomials need to have a 2k − 1 degree [32]. It also For the mathematical understanding of NTT, [33] pro-
only works on a specific prime modulus. Positive-wrapped vides a comprehensive conceptual explanation of the
convolution (PWC)-based NTT requires the prime modulus q topic.
to have a primitive n-th root of unity in the Zq ring. Moreover, 3) We summarize and provide a comprehensive review
negative-wrapped convolution (NWC)-based NTT needs an of the recent research on the NTT implementations
additional 2n-th root of unity [33]. for PQC schemes in various platforms such as FPGA,
The parameter requirements of NTT make it not always ASIC, CPU, and GPU.
available to use in lattice-based cryptosystems. Out of three 4) Similarly, we also summarize and provide a com-
standardized PQC schemes, while Dilithium and Falcon can prehensive review of NTT implementations for HE
apply PWC-based and NWC-based NTT, Kyber can only schemes, which are usually a combination of RNS and
use PWC-based NTT due to its chosen parameters. In the CRT.
other finalists’ schemes: NTRU and Saber, NTT can not be We hope that our report provides researchers in relevant
used due to the power-of-two modulus and the chosen ring, fields with a general understanding of NTT from the imple-
respectively [33]. However, many researchers are working mentation side of view and also shows the state-of-the-art of
on making workarounds to implement NTT on such systems NTT implementations in various architectures.
n−1
ωn ≡ 1 mod q (7)
X
NWC(x) = ck x k (5) and
k=0
ωk ̸ ≡ 1 mod q (8)
Pk Pn−1
where ck = i=0 gi hk−i − i=k+1 gi hk+n−i mod q. If Y (x) for k < n.
is the result of their linear convolution in the ring Z[x], it also One thing to note is that the primitive n−th root of unity in a
can be defined as ring Zq might not be unique. We show the following example
for q = 7681, used in Kyber in Rounds 1 and 2 of the NIST-
NWC(x) = Y (x) mod (x n + 1) (6) PQC Competition [13], [15], however, in our toy example we
show for n = 4 instead of n = 256.
Example 2.3: Let G(x) = 1 + 2x + 3x 2 + 4x 3 and H (x) =
Example 3.1: In a ring Z7681 and n = 4, the 4-th root
5 + 6x + 7x 2 + 8x 3 or in vector notation g = [1, 2, 3, 4] and
of unity which satisfy the condition ω4 ≡ 1 mod 7681 are
h = [5, 6, 7, 8]. The result of the negacyclic convolution is
{3383, 4298, 7680}. Out of three roots, 7680 is not a primitive
NWC(x) = −56 − 36x + 2x 2 + 60x 3 or [−56, −36, 2, 60].
n-th root of unity, as there exist k = 2 < n that satisfy
Figure 3 shows how schoolbook long division calculates a
ω2 ≡ 1 mod 7681. Therefore ω = 3383 or ω = 4298 are
negacyclic convolution, the remainder of the division.
the primitive 4-th root of unity in Z7681 .
The value of ω will be important in calculating NTT and
positive-wrapped convolution. Calculating the ω of a ring
with a large number modulus q is tricky and tedious. One
alternative library that provides a function to calculate ω is
Sympy via the function nthroot_mod [47].
Notice that the power of ω is the multiplication between the ω, ω−1 = 4298 and the scaling factor n−1 = 5761.
row and column numbers. As ω is the n-root of unity, ωk = One can calculate the INTT(NTT(ĝ)) by the following matrix
ω(k mod n) for k > n. Thus: multiplication:
ω ω ω ω
0 0 0 0
1
ω ω ω ω
−0×0 −0×1 −0×2 −0×3
ω0 ω1 ω2 ω3 2 10
ĝ = ω−1×0 ω−1×1 ω−1×2 ω−1×3 913
ω0 ω2 ω4 ω6 3 g = n−1
ω−2×0 ω−2×1 ω−2×2 ω−2×3 7679
ω0 ω3 ω6 ω9 4
ω−3×0 ω−3×1 ω−3×2 ω−3×3 6764
ω ω ω ω
0 0 0 0
1
ω ω ω0 ω0
0 0
ω0 ω1 ω2 ω3 2 10
−1 ω ω
0 −1 ω−2 ω−3 913
ĝ =
ω0 ω2 ω0 ω2 3
g = n 0 −2 −4 −6
ω ω ω ω
7679
ω0 ω3 ω2 ω1 4
ω0 ω−3 ω−6 ω−9 6764
From Example 3.1 we obtained one of the n-th roots of unity ω ω ω0 ω0
0 0
10
in Z7681 is ω = 3383. Substituting into the equation: ω0 ω−1 ω−2 ω−3 913
g = n−1 ω0 ω−2 ω−0 ω−2 7679
33830 33830 33830 33830
1
33830 33831 33832 33833 2 ω0 ω−3 ω−2 ω−1 6764
ĝ =
33830 33832 33830 33832 3
4298 4298 4298 42980
0 0 0
10
33830 33833 33832 33831 4 42980 42981 42982 42983 913
g = 5761
42980 42982 42980 42982 7679
1 1 1 1 1
1 3383 7680 4298 2 42980 42983 42982 42981 6764
ĝ =
1 7680 1 7680 3
1 1 1 1 10 1
1 4298 7680 3383 4 1 4298 7680 3383 913 2
g = 5761 =
10 1 7680 1 7680 7679 3
913 1 3383 7680 4298 6764 4
ĝ =
7679
Note that the differences between NTTψ and INTTψ are the their negative-wrapped convolution by:
scaling factor n−1 , the replacement of ψ by ψ −1 , and the
1467
2489
transpose of the exponents of ψ matrix. 2807 7489
Example 3.10: Let NTTψ (g) = ĝ = [1467, 2807, 3471, INTT( 3471 ◦ 6478)
7621] and ψ = 1925 in the ring Z7681 . Note that ψ −1 = 7621 6607
1213 and n−1 = 5761. The vector g can be calculated by the
2888
following matrix multiplication: 6407
= INTT( 2851)
ψ ψ ψ ψ −0
−0 −0 −0
1467 2992
ψ −1 ψ −3 ψ −5 ψ −7 2807
g = n−1 ψ −2 ψ −6 ψ −10 ψ −14 3471
1 1 1 1 2888 7625
1213 5756 6468 1925 6407 7645
ψ −3 ψ −9 ψ −15 ψ −21 7621 = 5761 4298 3383 4298 3383 2851 = 2
ψ ψ ψ ψ
−0 −0 −0 −0
1467 5756 1213 1925 6468 2992 60
ψ −1 ψ −3 ψ −5 ψ −7 2807
g = n−1 Therefore, [7625, 7645, 2, 60] – or when written with
ψ −2 ψ −6 ψ −2 ψ −6 3471
negative numbers [−56, −36, 2, 60] is their negacyclic con-
ψ −3 ψ −1 ψ −7 ψ −5 7621 volution, the same result as calculated by schoolbook multi-
1213 1213 1213 12130
0 0 0 plication and long division in Example 2.3
1467
12131 12133 12135 12137 2807
g = 5761
12132 12136 12132 12136 3471
E. THE CHOICE OF MODULUS
12133 12131 12137 12135 7621 To make NTT transformation available, the modulus q has to
satisfy the following requirements:
1 1 1 1 1467 1
1213 5756 6468 1925 2807 2 1) The n-th root of unity ω exists in ring Zq . The existence
g = 5761 4298 3383 4298 3383 3471 = 3
of ω enables one to utilize NTT to perform positive-
5756 1213 1925 6468 7621 4 wrapped convolutions.
2) Furthermore, the 2n-th root of unity ψ exists in ring Zq
Therefore g = [1, 2, 3, 4].
to make negative-wrapped convolutions work.
Example 3.11: Let NTTψ (h) = ĥ = [2489, 7489, 6478,
6607] and ψ = 1925 in the ring Z7681 . The vector h can be The modulus q has to satisfy the following theorem to
calculated by the following matrix multiplication: guarantee that ω exists [27], [29], [49]:
Theorem 3.1: If q is prime, then n must divide q − 1. If q
is composite such that:
1 1 1 1 2489 5
1213 5756 6468 1925 7489 6 q = q1 m1 · q2 m2 · q3 m3 . . . qk mk
h = 5761
4298 3383 4298 3383 6478 = 7
then n must divide the greatest common divisor (GCD) of
5756 1213 1925 6468 6607 8 (q1 − 1, q2 − 1, q3 − 1, . . . , qk − 1).
However, while Theorem 3.1 guarantees the existence of
Therefore, the h = [5, 6, 7, 8]. ω does not guarantee the existence of ψ. To guarantee the
existence of ψ in Zq :
Theorem 3.2: If q is prime, then 2n must divide q − 1. If q
3) USING NTTψ TO CALCULATE NEGATIVE-WRAPPED
is composite such that:
CONVOLUTIONS
Like its positive-wrapped version, the negative-wrapped NTT q = q1 m1 · q2 m2 · q3 m3 . . . qk mk
can evaluate the negative-wrapped convolutions, commonly then 2n must divide the greatest common divisor (GCD) of
referred to as negacyclic convolutions. (q1 − 1, q2 − 1, q3 − 1, . . . , qk − 1).
Proposition 3.2: Let a and b are the multiplicands’ vectors Many researchers proposed various moduli that might sat-
of polynomial coefficients. The negative-wrapped convolu- isfy the requirements, such as Mersenne [27] and Fermat [50]
tion of a and b, c can be calculated by: prime numbers. Here we define NTT-friendly modulus based
on its abilities to perform the type of convolutions:
Definition 3.7: A PWC-NTT friendly modulus q is defined
c = INTTψ (NTTψ (a) ◦ NTTψ (b))
−1
(18)
if and only if an n-th root of unity, ω, exists in Zq .
Definition 3.8: An NWC-NTT friendly modulus q is
where ◦ is an element-wise vector multiplication in Zq . defined if and only if n-th root of unity, ω, and 2n-th root of
Example 3.12: Let g = [1, 2, 3, 4] and h = [5, 6, 7, 8]. unity, ψ,exists in Zq .
From Example 3.8 and 3.9, we know that the NTTψ of them In the schemes proposed for the NIST-PQC competition,
in in Z7681 are ĝ = [1467, 2807, 3471, 7621] and ĥ = the values of n and q are standardized. Table 1 summarizes
[2489, 7489, 6478, 6607] when ψ = 1925. We can calculate the schemes and their NTT-friendliness.
TABLE 1. The values of n and q of standardized NIST-PQC scheme. Notice that Aj and Bj can be obtained as n/2 points NTT.
If n is power-of-two, the process can be repeated for all the
coefficients. Figure 4 shows the visualization of Equation
(23), usually called CT butterfly as a reference to its proposer,
Cooley and Tukey [23].
P 2n −1 P 2n −1
Let Ai = j=0 âj ψ −4ij and Bi = j=0 âj+ 2n ψ −4ij ,
Based on the periodicity and symmetry of ψ −1 , for the even Based on ψ −1 symmetry:
term:
ψ ψ −0 ψ −0 ψ −0
−0
n n 1467
2 −1 2 −1 ψ −1 ψ −3 −ψ −1 −ψ −3
g = n−1 2807
X X n
a2i = ψ −2i ψ −4ij âj + ψ −4i(j+ 2 ) â(j+ n2 ) mod q ψ −2 −ψ −2 ψ −2 −ψ −2 3471
j=0 j=0 ψ −3 ψ −1 −ψ −3 −ψ −1 7621
n
2 −1
Xh i
a2i = ψ −2i âj + â(j+ 2n ) ψ −4ij mod q (25) Breaking down for each element:
j=0
g0 = [1467ψ −0 + 2807ψ −0 + 3471ψ −0 + 7621ψ −0 ]n−1
Doing the same derivation for the odd term:
n g1 = [1467ψ −1 + 2807ψ −3 − 3471ψ −1 − 7621ψ −3 ]n−1
2 −1
g2 = [1467ψ −2 − 2807ψ −2 + 3471ψ −2 − 7621ψ −2 ]n−1
Xh i
a2i+1 = ψ −2i âj − â(j+ n2 ) ψ −4ij mod q (26)
j=0 g3 = [1467ψ −3 + 2807ψ −1 − 3471ψ −3 − 7621ψ −1 ]n−1
TABLE 4. Normal and bit-reversed order for n = 16. D. MODULAR ARITHMETIC UNITS
One of the challenges of NTT-based multiplication is that
addition and multiplication have to be done in Zq . All the
CT and GS butterflies operators require modular arithmetic,
a non-standard feature for most implementation platforms.
1) MODULAR ADDER
To calculate modular addition, (A + B) mod q, we can simply
use a piece-wise function:
(
A+B A+B<q
(A + B) mod q = (29)
A+B−q A+B≥q
Equation (29) is relatively easy to implement using a set of
adders, subtractors, and multiplexers.
While modular adder is simple and easy to implement,
modular multipliers are trickier. The standard algorithm for
modular multiplication uses trial division, which is ineffi-
cient, not scalable, and difficult to implement in hardware
architecture. The most popular workaround for implementing
to have NO-input & BO-output. Figure 8 shows all pos- Barrett or Montgomery modular multiplication algorithm.
sible configurations for NTT CT and INTT GS Butterfly
for n = 8. 2) MODULAR REDUCTION: BARRETT METHOD
Using normal order as NTT input is called decimation in The main idea behind Barrett reduction is to approximate
time, while bit-reversed order input is called decimation in the division by the modulus using pre-computed values,
frequency [52]. Another thing to notice is that the power of which allows for faster modular multiplication [53], [54].
ψ follows the bit-reversed order index. The set of all the Algorithm 1 shows how to multiply two integers modulo q
exponentiation of ψ is called twiddle factors. using Barrett reduction. As the value of q is usually fixed,
Transform back to standard representation: TABLE 5. The values of n, q, ω, and ψ of standardized NIST-PQC scheme.
Note that only Dilithium specifies the actual value of ψ, others do not.
c = cm × rinv mod q
= 8697 × 7200 mod 7681
= 2888
Transforming to and from Montgomery representation is
an expensive operation, which is usually done iteratively by
subtracting q multiple times. One needs to minimize the num-
ber of transformations to use Montgomery modular reduction
efficiently. This report highlights the NTT/INTT specifications and
Many researchers perform various workarounds and opti- the Dilithium, Kyber, and Falcon scheme implementa-
mizations for NTT/INTT implementation using previously tions. Optimizations and various implementations outside
discussed concepts in various Post-Quantum Cryptogra- the NTT/INTT in the scheme are out of the scope of
phy applications, which we will discuss in the following our work.
chapter.
B. NTT IN FINALIZED PQC SCHEMES
V. NTT IN POST QUANTUM CRYPTOGRAPHY SCHEME NTT is a part of Dilithium specification, with the parameters
All the NIST-PQC competition winners: Dilithium [10], Fal- set as polynomials of degree n = 256 and the modulus
con [12], and Kyber [15], [16] include NTT/INTT in their q = 223 − 213 + 1 = 8380417 is used in the extended
specifications for modular polynomial multiplication. In this cyclotomic ring Zq [x]/(x 256 + 1). Notice that the chosen q
section, we surveyed the implementation of NTT/INTT for is NWC-NTT friendly prime where ψ exists. Dilithium also
each scheme in various platforms based on their novelty specifies the chosen 2n-th root of unity, ψ = 1753. These
claims, algorithms, and implementation strategies. We also parameters were chosen based on a trade-off between security
present common optimizations implemented by various and efficiency [10].
researchers. NTT is also a part of Falcon and Kyber specifications.
Falcon [12] specifies that n = 512 or n = 1024 depends
A. DILITHIUM, KYBER, AND FALCON OVERVIEW on the desired security level and the modulus q is chosen
Dilithium [10] is one of the standardized algorithms in the to be 12289, which is an NWC-NTT friendly modulus for
NIST Post-Quantum Cryptography (PQC) competition. It is both n. Kyber [16] also specifies n = 256 and q = 3329 in
a signature scheme based on the problem of finding short its finalized version. Table 5 shows the NTT parameters
lattice vectors, which is believed to be hard even for quantum summary for Dilithium, Falcon, and Kyber.
computers. The Dilithium algorithm is designed to provide As we can see, the NTT specification in Kyber is unique
strong security while remaining efficient enough for practical because the chosen modulus in the final version, q = 3329,
use in digital signature applications. is not an NTT-NWC friendly modulus, which requires a
Kyber [15], [16] is a key encapsulation mechanism (KEM) special trick called truncated NTT to calculate its negative-
part of the NIST Post-Quantum Cryptography (PQC) project. wrapped convolution. Truncated NTT requires the calcula-
Kyber is one of the proposed algorithms in the NIST-PQC tions of NTT divided into two parts, as for Kyber n = 256,
competition. It is a lattice-based cryptosystem that relies it requires two NTT calculations with n = 128 by dividing
on the hardness of the Learning With Errors (LWE) prob- odd and even parts [58]. Notice that when n = 128 and
lem and its variants, which are believed to resist quantum q = 3329, it is an NWC-NTT friendly modulus with one
attacks. of the ψ = 892. In the following toy example for truncated
Falcon [12] is one of the candidate algorithms for digital NTT, we can calculate NTT/INTT with n = 8 by breaking it
signature schemes In NIST-PQC (National Institute of Stan- down into two NTT/INTT calculations with n = 4.
dards and Technology Post-Quantum Cryptography). Falcon Example 5.1: Let A = [0, 1, 2, 3, 4, 5, 6, 7] and B =
is a family of lattice-based signature schemes designed to be [8, 9, 10, 11, 12, 13, 14, 15] in the ring ZQ with Q = 7681.
secure against attacks by quantum computers. Falcon uses a We need to find the negacyclic convolution of A and B.
variation of the Ring Learning With Errors (RLWE) problem, Calculating the results using previously explained meth-
which is believed to be resistant to attacks by both classical ods in normal order: Using ψ = 7154, we can get:
and quantum computers. The security of Falcon relies on the NTT ψ (A) = [0, 7154, 2426, 2497, 1830, 4245, 3812, 4081]
hardness of the underlying mathematical problem of finding
the shortest vector in a lattice. Falcon provides efficient signa- NTT ψ (B) = [8, 2938, 4449, 4035, 5490, 3356, 3774, 1064]
ture generation and verification, making it a practical option Element-wise multiplication between the two yields:
for real-world applications. It is also designed to resist side-
channel attacks, which exploit weaknesses in the physical NTT ψ (A) ◦ NTT ψ (B)
implementation of a cryptographic system. = [3213, 7391, 1790, 5474, 5572, 2527, 2633, 7341]
FIGURE 10. Calculating 256-element NTT Transformation using 4-element CT butterfly iteratively.
Taking INTT from the results yields in the negacyclic convo- TABLE 6. Values of ψ 2i +1 , which is important to determine the modulus
of the schoolbooks multiplication.
lution between A and B
INTT ψ (NTT ψ (A) ◦ NTT ψ (B))
−1
TABLE 7. (Continued.) Summary of Dilithium, Kyber, and Falcon’s NTT hardware implementations.
TABLE 7. (Continued.) Summary of Dilithium, Kyber, and Falcon’s NTT hardware implementations.
TABLE 7. (Continued.) Summary of Dilithium, Kyber, and Falcon’s NTT hardware implementations.
3) HE IMPLEMENTATIONS IN FPGAs
Theorem 6.3: For encryption and decryption, the com- On the other hand, FPGAs also offer efficient area and speed
plexities for processing h plaintexts in parallel with C cores implementation for homomorphic encryption (HE) presented
GPU are: in [117], [127], [132], [133], [134], [135], [136], [137], [138],
6nh and [139]. The implementation of HE in FPGA also consists
O( (1 + log n)),
C of 3 approaches. First, pre/post-processing is usually required
and in the NTT/INTT by using the RNS-CRT method and mod-
ular multipliers such as Barrett, Montgomery, or LUT-based
2nh
O( (1 + log n)), reduction [127], [132], [133], [134], [137], [139]. The sec-
C ond is the parallel implementation of processing elements
respectively, where n is the number of inputs or coefficients (PEs) or butterfly units (BUs), in serial or pipeline par-
in NTT/INTT. allelization [127], [132], [133], [134], [135], [138], [139].
Proof: As Lemmas 6.1 and 6.2, encryption requires at And the last, the optimization is done by reconfigurable
most 3 polynomial multiplications and 3 polynomial addi- designs by implementing custom PEs or instructions using
tions, while decryption requires a polynomial multiplication RISC-V [117], [132], [138], [139].
and a polynomial addition. With C-core GPU and h parallel Figures 17 and 18 show the NTT/INTT architecture
processes, we have the theorem. □ implemented in FPGAs. In [135], the butterfly unit array
By targeting high utilization of a GPU, if 6nh is less is constructed with 8 × 4 arrays as shown in Figure 17.
than C, we will have a very fast O(log n) execution time. The architecture is capable of transforming 16 coefficients
Example 6.1: Let A = [123456, 7891011, 121314, Converting each element back to normal representation can
151617] and B = [181920, 212223, 232425, 262728] in the be done using the Chinese Reminder Theorem. Take the ele-
ring ZQ with Q = 456149404001. We need to find the ment (129, 4265, 4017) as an example, we need to find x from
negacyclic convolution of A and B. the following system of equations:
Calculating the results using previously explained meth-
x ≡ 129 mod 6841,
ods in normal order: Using ψ = 12967992388, we can get:
x ≡ 4265 mod 7681,
NTT ψ (A) = [164909637252, 371837718802, x ≡ 4017 mod 8681
52022178059, 323529767713]
This is a classical textbook problem for CRT. Solving it,
NTT ψ (B) = [94256621661, 54777633553, we can obtain x = 169643576476. Transforming back all
418495999451, 344769281017] C to normal representation yields:
[41] H. Chen, K. Laine, and R. Player, ‘‘Simple encrypted arithmetic library- [66] N. Zhang, B. Yang, C. Chen, S. Yin, S. Wei, and L. Liu, ‘‘Highly
SEAL v2. 1,’’ in Proc. Int. Conf. Financial Cryptogr. Data Secur. Sliema, efficient architecture of NewHope-NIST on FPGA using low-complexity
Malta: Springer, Apr. 2017, pp. 3–18. NTT/INTT,’’ IACR Trans. Cryptograph. Hardw. Embedded Syst.,
[42] H. J. Nussbaumer, ‘‘Elements of number theory and polynomial algebra,’’ vol. 2020, pp. 49–72, Mar. 2020.
in Fast Fourier Transform and Convolution Algorithms, 1982, pp. 4–31. [67] G. Mao, D. Chen, G. Li, W. Dai, A. I. Sanka, Ç. K. Koç,
[43] Convolution and Polynomial Multiplication in MATLAB. Accessed: and R. C. C. Cheung, ‘‘High-performance and configurable SW/HW
May 2, 2023. [Online]. Available: https://www.mathworks.com/help/ Co-design of post-quantum signature CRYSTALS-Dilithium,’’ ACM
MATLAB/ref/conv.html Trans. Reconfigurable Technol. Syst., vol. 16, no. 3, pp. 1–28, Sep. 2023.
[44] Numpy Convolution. Accessed: May 2, 2023. [Online]. Available: [68] N. Gupta, A. Jati, A. Chattopadhyay, and G. Jha, ‘‘Lightweight hardware
https://numpy.org/doc/stable/reference/generated/numpy.convolve.html accelerator for post-quantum digital signature CRYSTALS-Dilithium,’’
[45] Modulo-n Circular Convolution in MATLAB. Accessed: May 2, 2023. IACR Cryptol. ePrint Arch., vol. 2022, p. 496, Jan. 2022.
[Online]. Available: https://www.mathworks.com/help/signal/ref/cconv. [69] J. Zheng, F. He, S. Shen, C. Xue, and Y. Zhao, ‘‘Parallel small polynomial
html multiplication for Dilithium: A faster design and implementation,’’ in
[46] I. Syafalni, G. Jonatan, N. Sutisna, R. Mulyawan, and T. Adiono, ‘‘Effi- Proc. 38th Annu. Comput. Secur. Appl. Conf., Dec. 2022, pp. 304–317.
cient homomorphic encryption accelerator with integrated PRNG using [70] C. Zhao, N. Zhang, H. Wang, B. Yang, W. Zhu, Z. Li, M. Zhu, S. Yin,
low-cost FPGA,’’ IEEE Access, vol. 10, pp. 7753–7771, 2022. S. Wei, and L. Liu, ‘‘A compact and high-performance hardware archi-
[47] Sympy 1.11 Documentation. Accessed: May 2, 2023. [Online]. Available: tecture for CRYSTALS-Dilithium,’’ IACR Trans. Cryptograph. Hardw.
https://docs.sympy.org/latest/modules/ntheory.html#sympy.ntheory. Embedded Syst., vol. 2022, pp. 270–295, Nov. 2021.
residue_ntheory.nthroot_mod [71] S. He and M. Torkelson, ‘‘A new approach to pipeline FFT processor,’’ in
[48] A. Schönhage and V. Strassen, ‘‘Schnelle multiplikation grosser Zahlen,’’ Proc. Int. Conf. Parallel Process., Apr. 1996, pp. 766–770.
Computing, vol. 7, nos. 3–4, pp. 281–292, 1971. [72] R. I. Hartley, ‘‘Subexpression sharing in filters using canonic signed digit
[49] V. S. Dimitrov, T. V. Cooklev, and B. D. Donevsky, ‘‘Generalized multipliers,’’ IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process.,
Fermat–Mersenne number theoretic transform,’’ IEEE Trans. Circuits vol. 43, no. 10, pp. 677–688, Oct. 1996.
Syst. II, Analog Digit. Signal Process., vol. 41, no. 2, pp. 133–139, [73] G. Land, P. Sasdrich, and T. Güneysu, ‘‘A hard CRYSTAL—
Feb. 1994. Implementing Dilithium on reconfigurable hardware,’’ in Proc. Int. Conf.
[50] C. M. Rader, ‘‘Discrete convolutions via Mersenne transrorms,’’ IEEE Smart Card Res. Adv. Appl. Lübeck, Germany: Springer, Nov. 2022,
Trans. Comput., vol. C-100, no. 12, pp. 1269–1273, Dec. 1972. pp. 210–230.
[51] P. Heckbert, ‘‘Fourier transforms and the fast Fourier transform (FFT) [74] Z. Zhou, D. He, Z. Liu, M. Luo, and K.-K.-R. Choo, ‘‘A soft-
algorithm,’’ Comput. Graph., vol. 2, p. 463, Feb. 1995. ware/hardware co-design of CRYSTALS-Dilithium signature scheme,’’
ACM Trans. Reconfigurable Technol. Syst., vol. 14, no. 2, pp. 1–21,
[52] A. Saidi, ‘‘Decimation-in-time-frequency FFT algorithm,’’ in Proc. IEEE
Jun. 2021.
Int. Conf. Acoust., Speech Signal Process., Apr. 1994, p. 453.
[75] D. T. Nguyen, V. B. Dang, and K. Gaj, ‘‘A high-level synthesis approach
[53] P. Barrett, ‘‘Implementing the Rivest Shamir and Adleman public key
to the software/hardware codesign of NTT-based post-quantum cryptog-
encryption algorithm on a standard digital signal processor,’’ in Proc.
raphy algorithms,’’ in Proc. Int. Conf. Field-Program. Technol. (ICFPT),
Conf. Theory Appl. Cryptograph. Techn. Cham, Switzerland: Springer,
Dec. 2019, pp. 371–374.
2000, pp. 311–323.
[76] P. Longa and M. Naehrig, ‘‘Speeding up the number theoretic transform
[54] T. Wu, S. Li, and L. Liu, ‘‘Modular multiplier by folding Barrett modular
for faster ideal lattice-based cryptography,’’ in Proc. Int. Conf. Cryptol.
reduction,’’ in Proc. IEEE 11th Int. Conf. Solid-State Integr. Circuit
Netw. Secur. Milan, Italy: Springer, Nov. 2016, pp. 124–139.
Technol., Oct. 2012, pp. 1–3.
[77] L. Beckwith, D. T. Nguyen, and K. Gaj, ‘‘High-performance hardware
[55] L. Hars, ‘‘Long modular multiplication for cryptographic applications,’’ implementation of CRYSTALS-Dilithium,’’ in Proc. Int. Conf. Field-
in Proc. CHES. Cham, Switzerland: Springer, 2004, pp. 45–61. Program. Technol. (ICFPT), Dec. 2021, pp. 1–10.
[56] P. L. Montgomery, ‘‘Modular multiplication without trial division,’’ Math. [78] K. D. Ortega L. and L. J. Dominguez Perez, ‘‘Implementing CRYSTAL-
Comput., vol. 44, no. 170, pp. 519–521, 1985. Dilithium on FRDM-K64,’’ in Proc. IEEE 12th Annu. Ubiquitous
[57] C. K. Koc, T. Acar, and B. S. Kaliski, ‘‘Analyzing and comparing Comput., Electron. Mobile Commun. Conf. (UEMCON), Dec. 2021,
Montgomery multiplication algorithms,’’ IEEE Micro, vol. 16, no. 3, pp. 178–183.
pp. 26–33, Jun. 1996. [79] S. Ricci, L. Malina, P. Jedlicka, D. Smékal, J. Hajny, P. Cibik, P. Dzurenda,
[58] T. T. Nguyen, S. Kim, Y. Eom, and H. Lee, ‘‘Area-time efficient hardware and P. Dobias, ‘‘Implementing CRYSTALS-Dilithium signature scheme
architecture for CRYSTALS–Kyber,’’ Appl. Sci., vol. 12, no. 11, p. 5305, on FPGAs,’’ in Proc. 16th Int. Conf. Availability, Rel. Secur., Aug. 2021,
May 2022. pp. 1–11.
[59] A. Abdulrahman, V. Hwang, M. J. Kannwischer, and A. Sprenkels, [80] H. Becker, V. Hwang, M. J. Kannwischer, B.-Y. Yang, and S.-Y. Yang,
‘‘Faster Kyber and Dilithium on the Cortex-M4,’’ in Proc. Int. Conf. Appl. ‘‘Neon NTT: Faster Dilithium, Kyber, and saber on Cortex-A72
Cryptogr. Netw. Secur. Rome, Italy: Springer, Jun. 2022, pp. 853–871. and Apple M1,’’ IACR Trans. Cryptograph. Hardw. Embedded Syst.,
[60] F. Yaman, A. C. Mert, E. Öztürk, and E. Savas, ‘‘A hardware accel- vol. 2022, pp. 221–244, Nov. 2021.
erator for polynomial multiplication operation of CRYSTALS-KYBER [81] A. Basso, F. Aydin, D. Dinu, J. Friel, A. Varna, M. Sastry, and S. Ghosh,
PQC scheme,’’ in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), ‘‘Where star wars meets star trek: Saber and Dilithium on the same
Feb. 2021, pp. 1020–1025. polynomial multiplier,’’ Cryptol. ePrint Arch., to be published.
[61] A. Aikata, A. C. Mert, M. Imran, S. Pagliarini, and S. S. Roy, ‘‘KaLi: [82] D. O. C. Greconici, M. J. Kannwischer, and D. Sprenkels, ‘‘Compact
A crystal for post-quantum security using Kyber and Dilithium,’’ IEEE Dilithium implementations on Cortex-M3 and Cortex-M4,’’ IACR Trans.
Trans. Circuits Syst. I, Reg. Papers, vol. 70, no. 2, pp. 747–758, Feb. 2023. Cryptograph. Hardw. Embedded Syst., vol. 2021, pp. 1–24, Dec. 2020.
[62] Y. Kim, J. Song, T.-Y. Youn, and S. C. Seo, ‘‘CRYSTALS–Dilithium on [83] V. B. Dang, K. Mohajerani, and K. Gaj, ‘‘High-speed hardware archi-
ARMv8,’’ Secur. Commun. Netw., vol. 2022, pp. 1–12, Feb. 2022. tectures and FPGA benchmarking of CRYSTALS-Kyber, NTRU, and
[63] X. Chen, B. Yang, Y. Lu, S. Yin, S. Wei, and L. Liu, ‘‘Efficient saber,’’ IEEE Trans. Comput., vol. 72, no. 2, pp. 306–320, Feb. 2023.
access scheme for multi-bank based NTT architecture through con- [84] M. Knezevic, F. Vercauteren, and I. Verbauwhede, ‘‘Faster interleaved
flict graph,’’ in Proc. 59th ACM/IEEE Design Autom. Conf., Jul. 2022, modular multiplication based on Barrett and Montgomery reduction
pp. 91–96. methods,’’ IEEE Trans. Comput., vol. 59, no. 12, pp. 1715–1721,
[64] T. Wang, C. Zhang, P. Cao, and D. Gu, ‘‘Efficient implementation Dec. 2010.
of Dilithium signature scheme on FPGA SoC platform,’’ IEEE Trans. [85] Y. Zhao, R. Xie, G. Xin, and J. Han, ‘‘A high-performance domain-
Very Large Scale Integr. (VLSI) Syst., vol. 30, no. 9, pp. 1158–1171, specific processor with matrix extension of RISC-V for module-LWE
Sep. 2022. applications,’’ IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 69, no. 7,
[65] X. Chen, B. Yang, S. Yin, S. Wei, and L. Liu, ‘‘CFNTT: Scalable radix-2/4 pp. 2871–2884, Jul. 2022.
NTT multiplication architecture with an efficient conflict-free memory [86] D. W. Kim, D. I. Maulana, and W. Jung, ‘‘Kyber accelerator on FPGA
mapping scheme,’’ IACR Trans. Cryptograph. Hardw. Embedded Syst., using energy-efficient LUT-based Barrett reduction,’’ in Proc. 19th Int.
vol. 2022, pp. 94–126, Nov. 2021. SoC Design Conf. (ISOCC), Oct. 2022, pp. 83–84.
[87] L. Wan, F. Zheng, G. Fan, R. Wei, L. Gao, Y. Wang, J. Lin, and J. Dong, [107] K. Yao, D. Kundi, C. Wang, M. O’Neill, and W. Liu, ‘‘Towards
‘‘A novel high-performance implementation of CRYSTALS-Kyber with CRYSTALS-Kyber: A M-LWE cryptoprocessor with area-time trade-
ai accelerator,’’ in Proc. Eur. Symp. Res. Comput. Secur. Copenhagen, off,’’ in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2021, pp. 1–5.
Denmark: Springer, Sep. 2022, pp. 514–534. [108] C. Zhang, D. Liu, X. Liu, X. Zou, G. Niu, B. Liu, and Q. Jiang, ‘‘Towards
[88] H. Nguyen and L. Tran, ‘‘Design of polynomial NTT and INTT acceler- efficient hardware implementation of NTT for Kyber on FPGAs,’’ in
ator for post-quantum cryptography CRYSTALS-Kyber,’’ Arabian J. Sci. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2021, pp. 1–5.
Eng., vol. 48, pp. 1527–1536, Feb. 2022. [109] Y. Huang, M. Huang, Z. Lei, and J. Wu, ‘‘A pure hardware implementation
[89] P. Sanal, E. Karagoz, H. Seo, R. Azarderakhsh, and of CRYSTALS-KYBER PQC algorithm through resource reuse,’’ IEICE
M. Mozaffari-Kermani, ‘‘Kyber on ARM64: Compact implementations Electron. Exp., vol. 17, no. 17, 2020, Art. no. 20200234.
of Kyber on 64-bit arm cortex-a processors,’’ in Proc. Int. Conf. Secur. [110] Z. Chen, Y. Ma, T. Chen, J. Lin, and J. Jing, ‘‘Towards efficient Kyber on
Privacy Commun. Syst. Cham, Switzerland: Springer, Sep. 2021, FPGAs: A processor for vector of polynomials,’’ in Proc. 25th Asia South
pp. 424–440. Pacific Design Autom. Conf. (ASP-DAC), Jan. 2020, pp. 247–252.
[90] J. N. Ortiz, F. C. Rodrigues, D. G. Filho, C. Teixeira, J. López, and [111] L. Botros, M. J. Kannwischer, and P. Schwabe, ‘‘Memory-efficient high-
R. Dahab, ‘‘Evaluation of CRYSTALS-Kyber and saber on the ARMv8 speed implementation of Kyber on Cortex-m4,’’ in Proc. Int. Conf.
architecture,’’ in Proc. Anais do XXII Simpósio Brasileiro em Segurança Cryptol. Africa. Rabat, Morocco: Springer, Jul. 2019, pp. 209–228.
da Informação e de Sistemas Computacionais, 2022, pp. 372–377. [112] Y. Kim, J. Song, and S. C. Seo, ‘‘Accelerating falcon on ARMv8,’’ IEEE
[91] J. Huang, J. Zhang, H. Zhao, Z. Liu, R. C. C. Cheung, Ç. K. Koç, Access, vol. 10, pp. 44446–44460, 2022.
and D. Chen, ‘‘Improved plantard arithmetic for lattice-based cryptog- [113] W.-K. Lee, R. K. Zhao, R. Steinfeld, A. Sakzad, and S. O. Hwang,
raphy,’’ IACR Trans. Cryptograph. Hardw. Embedded Syst., vol. 2022, ‘‘High throughput lattice-based signatures on GPUs: Comparing Falcon
pp. 614–636, Aug. 2022. and Mitaka,’’ Cryptol. ePrint Arch., to be published.
[92] T. Plantard, ‘‘Efficient word size modular arithmetic,’’ IEEE Trans. [114] (2023). Australian Research Data Commons Nectar Research Cloud
Emerg. Topics Comput., vol. 9, no. 3, pp. 1506–1518, Jul. 2021. System. [Online]. Available: https://ardc.edu.au/services/
[93] Z. Ye, R. C. C. Cheung, and K. Huang, ‘‘PipeNTT: A pipelined number [115] P. Karl, J. Schupp, T. Fritzmann, and G. Sigl, ‘‘Post-quantum signatures
theoretic transform architecture,’’ IEEE Trans. Circuits Syst. II, Exp. on RISC-V with hardware acceleration,’’ ACM Trans. Embedded Comput.
Briefs, vol. 69, no. 10, pp. 4068–4072, Oct. 2022. Syst., to be published.
[94] M. Li, J. Tian, X. Hu, and Z. Wang, ‘‘Reconfigurable and high-efficiency [116] M. Gautschi, P. D. Schiavone, A. Traber, I. Loi, A. Pullini, D. Rossi,
polynomial multiplication accelerator for CRYSTALS-Kyber,’’ IEEE E. Flamand, F. K. Gürkaynak, and L. Benini, ‘‘Near-threshold RISC-V
Trans. Comput.-Aided Design Integr. Circuits Syst., early access, core with DSP extensions for scalable IoT endpoint devices,’’ IEEE Trans.
Dec. 19, 2022, doi: 10.1109/TCAD.2022.3230359. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 10, pp. 2700–2713,
[95] D. Kundi, Y. Zhang, C. Wang, A. Khalid, M. O’Neill, and W. Liu, Oct. 2017.
‘‘Ultra high-speed polynomial multiplications for lattice-based cryptog-
[117] E. Karabulut and A. Aysu, ‘‘RANTT: A RISC-V architecture extension
raphy on FPGAs,’’ IEEE Trans. Emerg. Topics Comput., vol. 10, no. 4,
for the number theoretic transform,’’ in Proc. 30th Int. Conf. Field-
pp. 1993–2005, Oct. 2022.
Program. Log. Appl. (FPL), Aug. 2020, pp. 26–32.
[96] M. Bisheh-Niasar, R. Azarderakhsh, and M. Mozaffari-Kermani,
[118] D. Harvey and J. van der Hoeven, ‘‘Polynomial multiplication over finite
‘‘A monolithic hardware implementation of Kyber: Comparing apples to
fields in time O(n log n,’’ J. ACM, vol. 69, no. 2, pp. 1–40, Apr. 2022.
apples in PQC candidates,’’ in Proc. Int. Conf. Cryptol. Inf. Secur. Latin
[119] L. Ducas and D. Micciancio, ‘‘FHEW: Bootstrapping homomorphic
Amer. Bogotá, Colombia: Springer, 2021, pp. 108–126.
encryption in less than a second,’’ in Proc. Annu. Int. Conf. Theory Appl.
[97] Y. Xing and S. Li, ‘‘A compact hardware implementation of CCA-secure
Cryptograph. Techn. Sofia, Bulgaria: Springer, Apr. 2015, pp. 617–640.
key exchange mechanism CRYSTALS-KYBER on FPGA,’’ IACR Trans.
Cryptograph. Hardw. Embedded Syst., vol. 2021, pp. 328–356, Feb. 2021. [120] I. Chillotti, N. Gama, M. Georgieva, and M. Izabachène, ‘‘TFHE: Fast
fully homomorphic encryption over the torus,’’ J. Cryptol., vol. 33, no. 1,
[98] P. Nannipieri, S. Di Matteo, L. Zulberti, F. Albicocchi, S. Saponara,
pp. 34–91, Jan. 2020.
and L. Fanucci, ‘‘A RISC-V post quantum cryptography instruction set
extension for number theoretic transform to speed-up CRYSTALS algo- [121] C. Gentry, A. Sahai, and B. Waters, ‘‘Homomorphic encryption
rithms,’’ IEEE Access, vol. 9, pp. 150798–150808, 2021. from learning with errors: Conceptually-simpler, asymptotically-faster,
attribute-based,’’ in Proc. Annual Cryptol. Conf. Santa Barbara, CA,
[99] W. Guo, S. Li, and L. Kong, ‘‘An efficient implementation of Kyber,’’
USA: Springer, Aug. 2013, pp. 75–92.
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 69, no. 3, pp. 1562–1566,
Mar. 2022. [122] S. Halevi and V. Shoup, ‘‘Design and implementation of helib: A homo-
[100] L. Zhao, J. Zhang, J. Huang, Z. Liu, and G. Hancke, ‘‘Efficient imple- morphic encryption library,’’ Cryptol. ePrint Arch., to be published.
mentation of Kyber on mobile devices,’’ in Proc. IEEE 27th Int. Conf. [123] A. Al Badawi et al., ‘‘OpenFHE: Open-source fully homomorphic
Parallel Distrib. Syst. (ICPADS), Dec. 2021, pp. 506–513. encryption library,’’ in Proc. 10th Workshop Encrypted Comput. Appl.
[101] D. T. Nguyen and K. Gaj, ‘‘Fast NEON-based multiplication for lattice- Homomorphic Cryptogr., Nov. 2022, pp. 53–63.
based NIST post-quantum cryptography finalists,’’ in Proc. Int. Conf. [124] C. V. Mouchet, J.-P. Bossuat, J. R. Troncoso-Pastoriza, and J.-P. Hubaux,
Post-Quantum Cryptogr. Daejeon, South Korea: Springer, Jul. 2021, ‘‘Lattigo: A multiparty homomorphic encryption library in go,’’ in Proc.
pp. 234–254. 8th Workshop Encrypted Comput. Appl. Homomorphic Cryptogr., 2020,
[102] Z. Chen, Y. Ma, T. Chen, J. Lin, and J. Jing, ‘‘High-performance area- pp. 64–70.
efficient polynomial ring processor for CRYSTALS-Kyber on FPGAs,’’ [125] Z. Wang, P. Li, R. Hou, Z. Li, J. Cao, X. Wang, and D. Meng, ‘‘HE-
Integration, vol. 78, pp. 25–35, May 2021. booster: An efficient polynomial arithmetic acceleration on GPUs for
[103] Z. Liu, H. Seo, S. Sinha Roy, J. Großschädl, H. Kim, and I. Verbauwhede, fully homomorphic encryption,’’ IEEE Trans. Parallel Distrib. Syst.,
‘‘Efficient Ring-LWE encryption on 8-bit AVR processors,’’ in Proc. vol. 34, no. 4, pp. 1067–1081, Apr. 2023.
Int. Workshop Cryptograph. Hardw. Embedded Syst. Saint-Malo, France: [126] Z. Zheng, ‘‘Encrypted cloud using GPUs,’’ Ph.D. dissertation, KU Leu-
Springer, Sep. 2015, pp. 663–682. ven, Leuven, Belgium, 2020. [Online]. Available: https://www~
[104] M. Bisheh-Niasar, R. Azarderakhsh, and M. Mozaffari-Kermani, ‘‘High- [127] E. Öztürk, Y. Doröz, E. Savas, and B. Sunar, ‘‘A custom accelerator for
speed NTT-based polynomial multiplication accelerator for post-quantum homomorphic encryption applications,’’ IEEE Trans. Comput., vol. 66,
cryptography,’’ in Proc. IEEE 28th Symp. Comput. Arithmetic (ARITH), no. 1, pp. 3–16, Jan. 2017.
Jun. 2021, pp. 94–101. [128] W. Wang, Z. Chen, and X. Huang, ‘‘Accelerating leveled fully homo-
[105] M. Bisheh-Niasar, R. Azarderakhsh, and M. Mozaffari-Kermani, morphic encryption using GPU,’’ in Proc. IEEE Int. Symp. Circuits Syst.
‘‘Instruction-set accelerated implementation of CRYSTALS-Kyber,’’ (ISCAS), Jun. 2014, pp. 2800–2803.
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 68, no. 11, pp. 4648–4659, [129] W. Dai, Y. Doröz, and B. Sunar, ‘‘Accelerating NTRU based homomor-
Nov. 2021. phic encryption using GPUs,’’ in Proc. IEEE High Perform. Extreme
[106] L. Ma, X. Wu, and G. Bai, ‘‘Parallel polynomial multiplication optimized Comput. Conf. (HPEC), Sep. 2014, pp. 1–6.
scheme for CRYSTALS-KYBER post-quantum cryptosystem based on [130] A. S. Özcan, C. Ayduman, E. R. Türkoglu, and E. Savas, ‘‘Homomorphic
FPGA,’’ in Proc. Int. Conf. Commun., Inf. Syst. Comput. Eng. (CISCE), encryption on GPU,’’ IEEE Access, early access, Apr. 7, 2023, doi:
May 2021, pp. 361–365. 10.1109/ACCESS.2023.3265583.
[131] Ö. Özerk, C. Elgezen, A. C. Mert, E. Öztürk, and E. Savaş, ‘‘Efficient INFALL SYAFALNI received the B.Eng. degree
number theoretic transform implementation on GPU for homomorphic in electrical engineering from Institut Teknologi
encryption,’’ J. Supercomput., vol. 78, no. 2, pp. 2840–2872, Feb. 2022. Bandung (ITB), Bandung, Indonesia, in 2008, the
[132] Y. Su, B. Yang, C. Yang, Z. Yang, and Y. Liu, ‘‘A highly unified reconfig- M.Sc. degree in electronic engineering from the
urable multicore architecture to speed up NTT/INTT for homomorphic University of Science Malaysia (USM), Penang,
polynomial multiplication,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Malaysia, in 2011, and the Dr.Eng. degree in
Syst., vol. 30, no. 8, pp. 993–1006, Aug. 2022. engineering from the Kyushu Institute of Tech-
[133] Y. Doröz, E. Öztürk, and B. Sunar, ‘‘Accelerating fully homomor- nology (KIT), Iizuka, Fukuoka, Japan, in 2014.
phic encryption in hardware,’’ IEEE Trans. Comput., vol. 64, no. 6,
From 2014 to 2015, he held a research position
pp. 1509–1521, Jun. 2015.
with KIT. From 2015 to 2018, he was an ASIC
[134] A. C. Mert, E. Karabulut, E. Öztürk, E. Savas, and A. Aysu, ‘‘An extensive
study of flexible design methods for the number theoretic transform,’’ Engineer with the ASIC Development Group, Logic Research Company
IEEE Trans. Comput., vol. 71, no. 11, pp. 2829–2843, Nov. 2022. Ltd., Fukuoka, Japan. In 2019, he joined ITB, where he is currently an Assis-
[135] P. Duong-Ngoc, S. Kwon, D. Yoo, and H. Lee, ‘‘Area-efficient number tant Professor with the School of Electrical Engineering and Informatics and
theoretic transform architecture for homomorphic encryption,’’ IEEE a Researcher with the University Center of Excellence on Microelectronics.
Trans. Circuits Syst. I, Reg. Papers, vol. 70, no. 3, pp. 1270–1283, His research interests include logic synthesis, logic design, VLSI design, and
Mar. 2023. efficient circuits and algorithms.
[136] X. Feng and S. Li, ‘‘Design of an area-effcient million-bit integer multi-
plier using double modulus NTT,’’ IEEE Trans. Very Large Scale Integr.
RELLA MARETA (Graduate Student Member,
(VLSI) Syst., vol. 25, no. 9, pp. 2658–2662, Sep. 2017.
[137] C. Rafferty, M. O’Neill, and N. Hanley, ‘‘Evaluation of large integer IEEE) received the B.S. and M.S. degrees in elec-
multiplication methods on hardware,’’ IEEE Trans. Comput., vol. 66, trical engineering from Institut Teknologi Ban-
no. 8, pp. 1369–1382, Aug. 2017. dung (ITB), Bandung, Indonesia, in 2011 and
[138] R. Paludo and L. Sousa, ‘‘NTT architecture for a Linux-ready RISC-V 2014, respectively. She is currently pursuing the
fully-homomorphic encryption accelerator,’’ IEEE Trans. Circuits Syst. Ph.D. degree with the Digital Integrated Systems
I, Reg. Papers, vol. 69, no. 7, pp. 2669–2682, Jul. 2022. Laboratory, Inha University.
[139] Y. Su, B. Yang, C. Yang, and S. Zhao, ‘‘ReMCA: A reconfigurable multi-
core architecture for full RNS variant of BFV homomorphic evaluation,’’
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 69, no. 7, pp. 2857–2870,
Jul. 2022.
[140] A. Al Badawi, Y. Polyakov, K. M. M. Aung, B. Veeravalli, and K. Rohloff, ISA ANSHORI (Member, IEEE) received the
‘‘Implementation and performance evaluation of RNS variants of the BFV B.S. degree in engineering physics from Insti-
homomorphic encryption scheme,’’ IEEE Trans. Emerg. Topics Comput., tut Teknologi Bandung, Indonesia, in 2009, and
vol. 9, no. 2, pp. 941–956, Apr. 2021. the M.Eng. degree in materials science and the
[141] X. Cao, C. Moore, M. O’Neill, E. O’Sullivan, and N. Hanley, ‘‘Optimised Ph.D. degree in nanoscience and nanotechnol-
multiplication architectures for accelerating fully homomorphic encryp- ogy from the University of Tsukuba, Japan, in
tion,’’ IEEE Trans. Comput., vol. 65, no. 9, pp. 2794–2806, Sep. 2016. 2015 and 2018, respectively. He has been an Assis-
[142] S. S. Roy, F. Turan, K. Jarvinen, F. Vercauteren, and I. Verbauwhede, tant Professor with the Department of Biomedical
‘‘FPGA-based high-performance parallel architecture for homomorphic Engineering, School of Electrical Engineering and
computing on encrypted data,’’ in Proc. IEEE Int. Symp. High Perform. Informatics, Institut Teknologi Bandung, since
Comput. Archit. (HPCA), Feb. 2019, pp. 387–398.
2018. His research interests include bio/chemical sensors, microfluidics, the
[143] P. Ravi, R. Poussier, S. Bhasin, and A. Chattopadhyay, ‘‘On configurable
IoT devices, and lab-on-chip.
SCA countermeasures against single trace attacks for the NTT: A perfor-
mance evaluation study over Kyber and Dilithium on the arm Cortex-m4,’’
in Proc. Int. Conf. Secur., Privacy, Appl. Cryptogr. Eng. Kolkata, India: WERVYAN SHALANNANDA received the B.S.
Springer, Dec. 2020, pp. 123–146. degree in telecommunications engineering and the
[144] J. Howe, T. Prest, and D. Apon, ‘‘SoK: How (not) to design and implement M.S. degree in electrical engineering (telematics
post-quantum cryptography,’’ in Proc. Cryptographers’ Track RSA Conf. and telco networks) from the Bandung Institute
Cham, Switzerland: Springer, May 2021, pp. 444–477. of Technology, in 2013 and 2015, respectively.
He joined the Bandung Institute of Technology,
in 2016, as an Academic Assistant and then as a
Lecturer, in 2018. His research interests include
networked systems and security and artificial intel-
ligence in telecommunications.
ARDIANTO SATRIAWAN received the B.S. and ALEAMS BARRA received the B.S. and M.S.
M.S. degrees in electrical engineering from Institut degrees in mathematics from Institut Teknologi
Teknologi Bandung (ITB), Bandung, Indonesia, Bandung (ITB), Bandung, Indonesia, in 1998 and
in 2013 and 2015 respectively. He is a member 2002, respectively, and the Ph.D. degree in mathe-
of the Computer Engineering Research Group, matics from the University of Kentucky, Kentucky,
School of Electrical Engineering and Informatics, USA, in 2012. He is currently a member of the
ITB. His research interests include virtual reality, Algebra Research Group, Faculty of Mathematics
machine learning, computer networks, and infor- and Natural Sciences, ITB.
mation security.