0% found this document useful (0 votes)
106 views29 pages

Conceptual Review On Number Theoretic Transform and Comprehensive Review On Its Implementations

This document provides a conceptual review of the Number Theoretic Transform (NTT) and its implementations, emphasizing its significance in Post Quantum Cryptography (PQC) and Homomorphic Encryption (HE). The NTT is highlighted for its efficiency in polynomial multiplication, which is crucial for modern cryptographic systems, especially as quantum computing advances. The paper also summarizes various NTT implementation techniques across different platforms and discusses the challenges and requirements for using NTT in lattice-based cryptography.

Uploaded by

savasya04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views29 pages

Conceptual Review On Number Theoretic Transform and Comprehensive Review On Its Implementations

This document provides a conceptual review of the Number Theoretic Transform (NTT) and its implementations, emphasizing its significance in Post Quantum Cryptography (PQC) and Homomorphic Encryption (HE). The NTT is highlighted for its efficiency in polynomial multiplication, which is crucial for modern cryptographic systems, especially as quantum computing advances. The paper also summarizes various NTT implementation techniques across different platforms and discusses the challenges and requirements for using NTT in lattice-based cryptography.

Uploaded by

savasya04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Received 19 May 2023, accepted 8 July 2023, date of publication 11 July 2023, date of current version 14 July 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3294446

Conceptual Review on Number Theoretic


Transform and Comprehensive
Review on Its Implementations
ARDIANTO SATRIAWAN 1 , INFALL SYAFALNI1 ,
RELLA MARETA2 , (Graduate Student Member, IEEE), ISA ANSHORI 1, (Member, IEEE),
WERVYAN SHALANNANDA1 , AND ALEAMS BARRA3
1 School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Bandung 40132, Indonesia
2 Department of Information and Communication Engineering, Inha University, Incheon 22212, South Korea
3 Faculty of Mathematics and Natural Sciences, Institut Teknologi Bandung, Bandung 40132, Indonesia
Corresponding author: Ardianto Satriawan ([email protected])
This work was supported by the School of Electrical Engineering and Informatics, Institut Teknologi Bandung. The work of Infall Syafalni
was supported by the 2023 Bank Central Asia (BCA) Innovation Awards arranged by the Lembaga Pengembangan Inovasi dan
Kewirausahaan (LPIK)-Institut Teknologi Bandung (ITB).

ABSTRACT The Number Theoretic Transform (NTT) is a powerful mathematical tool that has become
increasingly important in developing Post Quantum Cryptography (PQC) and Homomorphic Encryption
(HE). Its ability to efficiently calculate polynomial multiplication using the convolution theorem with a quasi-
linear complexity O(n log n) when implemented with Fast Fourier Transform-style algorithms has made it
a key component in modern cryptography. FFT-style NTT algorithm or fast-NTT is particularly useful in
lattice-based cryptography, which relies on the hardness of certain mathematical problems to ensure security.
Its importance in these fields continues to grow as quantum computing technology advances and traditional
encryption methods become vulnerable. In this report, we discuss the mathematical concepts of polynomial
multiplications using NTT and provide a comprehensive review of the latest implementation and state-of-
the-art of NTT in both PQC and HE schemes.

INDEX TERMS Number theoretic transform, post quantum cryptography, homomorphic encryption.

I. INTRODUCTION scheme due to its balance of computing complexity, commu-


Most of the classical cryptosystems are based on the assump- nication bandwidth, and security [4].
tion that the prime factorization of a large integer is a An effort has been initiated by the US National Institute
computationally complex problem to solve [1]. However, the of Standards and Technology (NIST) to standardize crypto-
assumption will no longer hold in the near future due to the graphic algorithms that are resistant to attacks by quantum
recent development of quantum computer research and devel- computers, which was called Post-Quantum Cryptography
opments. Quantum computers can factorize large integers (PQC) Competition starting in 2016 and finalized in 2022 [5].
exponentially faster than classical computers [2]. This vulner- The lattice-based cryptography is the most proposed system,
ability has led to the need for quantum-resistant cryptosystem making 26 of 64 in the first round [6], 12 of 16 in the
developments. There are five different post-quantum cryp- second round [7], 7 out of 15 in the third round [8], and
tosystems, which are lattice-based, hash-based, code-based, 3 out of 4 final standardized schemes [9]. Those three lattice-
multivariable-based, and isogeny-based schemes [3]. Lattice- based standardized cryptosystems are Dilithium [10], [11],
based cryptography is a promising and most researched Falcon [12], and Kyber [13], [14], [15], [16].
The bottleneck of the lattice-based cryptography imple-
mentation is its fundamental building block: modular
The associate editor coordinating the review of this manuscript and polynomial multiplication, which is a very time-consuming
approving it for publication was Mohamad Afendee Mohamed . operation [17]. Traditionally, it is computed by the
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
70288 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 11, 2023
A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

schoolbook algorithm with a quadratic complexity of using various mathematical techniques, such as the Residue
O(n2 ). However, other alternatives exist, such as the Karat- Number Systems (RNS) and the Chinese Remainder Theo-
suba algorithm [18], [19], the Toom-Cook algorithm [20], rem (CRT) [34], [35].
[21], and the Discrete Fourier Transform (DFT)-based NTT is also important in Homomorphic Encryption (HE)
algorithm [22], [23], [24]. The Karatsuba algorithm applies schemes such as Brakerski-Fan-Vercauteren (BFV) [36],
the divide and conquers principle to reduce the complexity BGV (Brakerski-Gentry-Vaikuntanathan) [37], [38], and
by dividing the original polynomial into two parts, resulting CKKS (Cheon-Kim-Kim-Song) [39] based on the Ring
in O(nlog2 3 ) or O(n1.58 ) [25]. Toom-Cook algorithm gener- Learning With Errors (RLWE) problem. In the BFV and
alizes Karatsuba algorithm by dividing into k parts, giving BGV schemes, NTT performs the modulus-switching oper-
O(nlogk (2k−1) ) complexity [26]. ation to reduce the noise in the encrypted data. In the
Discrete Fourier Transform (DFT) and its variant in the CKKS homomorphic encryption scheme, NTT performs
polynomial ring, Number Theoretic Transform (NTT) can be the ‘‘relinearization’’ operation, which reduces the size of
utilized to multiply two polynomials via convolution theo- the ciphertexts after multiplication operations [36], [37],
rem [27], [28]. However, the classical algorithm to compute [38], [39]. Microsoft SEAL is one of the most prominent
DFT or NTT is also O(n2 ). The fundamental difference libraries implementing the aforementioned schemes [40],
between DFT and NTT is the ring they use to transform [41]. The noticeable difference between the PQC and HE
the polynomial. DFT uses a complex ring with a twiddle schemes is the modulus size. While PQC schemes usually
factor of e−2πj/n while NTT uses an integer polynomial ring use a small number as their modulus, HE schemes use a large
with a twiddle factor of its n-th root of unity. The only use number, which makes the implementation techniques vastly
of integers makes NTT popular among researchers because different between the two schemes.
there is no need to implement complicated schemes such Most of the NTT implementation reports briefly introduce
as fixed-point or floating-point arithmetic architecture. This NTT and recent literature reviews. However, those reports
advantage also eliminates the precision problem that may focus on their implementation techniques of NTT in the
arise from implementing such architectures [29]. various platforms and do not provide a comprehensive under-
Many optimized versions of DFT have been proposed in standing of the NTT concepts. This motivates us to briefly
the past few decades due to their prominent use in signal introduce NTT concepts and summarize the state of the arts of
and image processing. The most widely used fast algorithm NTT implementations in the PQC and HE schemes. We sum-
is Fast Fourier Transform (FFT) which Gauss first proposed marize the contribution of our works as follows:
in 1805 [30]. It gained widespread attention in the 1960s 1) We briefly introduce the basic concepts of linear,
when Cooley-Tukey [23] and Gentleman-Sande [24] pub- cyclic, and negacyclic convolutions via traditional
lished their works, giving their infamous name for the CT schoolbook algorithms, traditional NTT, and FFT-like
and GS butterflies architecture for FFT. The FFT has a versions of NTT. While other literature briefly intro-
quasilinear complexity of O(n log n), which gives a massive duces the concepts, they are scattered everywhere.
advantage over other methods, especially when calculating They require significant effort to learn, especially for
higher-degree polynomial multiplications. NTT is also a those who begin researching the area and come from
DFT version, so one can apply FFT algorithms to calculate the implementation side.
NTT [31]. 2) We provide consistent toy examples through differ-
However, using NTT also has limitations: it requires very ent concepts and algorithms to further enhance the
specific parameters. Implementing FFT algorithms requires conceptual understanding of the NTT. However, the
the array lengths n to be a power of two – in other words, focus of our report is the implementation of NTT.
the polynomials need to have a 2k − 1 degree [32]. It also For the mathematical understanding of NTT, [33] pro-
only works on a specific prime modulus. Positive-wrapped vides a comprehensive conceptual explanation of the
convolution (PWC)-based NTT requires the prime modulus q topic.
to have a primitive n-th root of unity in the Zq ring. Moreover, 3) We summarize and provide a comprehensive review
negative-wrapped convolution (NWC)-based NTT needs an of the recent research on the NTT implementations
additional 2n-th root of unity [33]. for PQC schemes in various platforms such as FPGA,
The parameter requirements of NTT make it not always ASIC, CPU, and GPU.
available to use in lattice-based cryptosystems. Out of three 4) Similarly, we also summarize and provide a com-
standardized PQC schemes, while Dilithium and Falcon can prehensive review of NTT implementations for HE
apply PWC-based and NWC-based NTT, Kyber can only schemes, which are usually a combination of RNS and
use PWC-based NTT due to its chosen parameters. In the CRT.
other finalists’ schemes: NTRU and Saber, NTT can not be We hope that our report provides researchers in relevant
used due to the power-of-two modulus and the chosen ring, fields with a general understanding of NTT from the imple-
respectively [33]. However, many researchers are working mentation side of view and also shows the state-of-the-art of
on making workarounds to implement NTT on such systems NTT implementations in various architectures.

VOLUME 11, 2023 70289


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

The rest of the paper is organized as follows.


Section II discusses the fundamental mathematical defi-
nitions and basic concepts of convolutions in polynomial
rings. Section III explains convolutions based on the Number
Theoretic Transform. Section IV explains FFT-like algo-
rithms in calculating NTT. Section V reviews the current
research works of NTT implementations in Post Quantum
Cryptography (PQC) scheme. Section VI reviews the current
research works of NTT implementations in the Homomorphic
Encryption (HE) scheme. Finally, Section VII concludes the
FIGURE 1. Schoolbook method for polynomial multiplication or linear
paper and discusses possible future works. convolution.

II. PRELIMINARIES: SCHOOLBOOK CONVOLUTIONS


This section briefly explains the definition of linear, cyclic, q ∈ Z. A cyclic convolution or positive wrapped convolution,
and negacyclic convolutions between polynomials with inte- PWC(x) is defined as:
ger coefficients to show their basic concepts and differences.
n−1
We also provide simple and consistent toy examples through- X
PWC(x) = ck x k (3)
out the section to clarify how different concepts work. In this
k=0
section, we assume the modulus, q, is large enough so that the
arithmetic calculations do not cause integer overflows. where ck = ki=0 gi hk−i + n−1
P P
i=k+1 gi hk+n−i mod q. If Y (x)
is the result of their linear convolution in the ring Zq [x],
A. POLYNOMIAL MULTIPLICATION AND it also can be defined as
LINEAR CONVOLUTION
Definition 2.1: Suppose that G(x) and H (x) are polynomi- PWC(x) = Y (x) mod (x n − 1) (4)
als of degree n − 1 in the ring Zq [x] where q ∈ Z and x is Traditional and schoolbooks method to calculate a cyclic
the polynomial variable, a polynomial multiplication of G(x) convolution is through a polynomial multiplication, as shown
and H (x) is defined as: in Example 2.1, followed by a long division. The method has
O(n2 ) complexity.
Y (x) = G(x) · H (x) Example 2.2: Let G(x) = 1 + 2x + 3x 2 + 4x 3 and H (x) =
2(n−1)
X 5 + 6x + 7x 2 + 8x 3 or in vector notation g = [1, 2, 3, 4]
= yk x k (1) and h = [5, 6, 7, 8]. The result of the cyclic convolution is
k=0 PWC(x) = 66 + 68x + 66x 2 + 60x 3 or [66, 68, 66, 60].
where yk = ki=0 gi hk−i mod q, g and h are the polynomial
P Figure 2 shows how schoolbook long division is used
coefficients of G(x) and H (x) respectively. to calculate a cyclic convolution with the dividend as the
Polynomial multiplication is equivalent to a discrete linear linear convolution result of G(x) and H (x). The remainder of
convolution between the coefficients’ vectors g and h [42]. the long division algorithm is the cyclic convolution result.
Notice that we present the result sorted in increasing power
k
X in Example 2.2.
y[k] = (g ∗ h)[k] = g[i]h[k − i] (2)
i=0

Example 2.1: Let G(x) = 1 + 2x + 3x 2 + 4x 3 and H (x) =


5 + 6x + 7x 2 + 8x 3 or in vector notation g = [1, 2, 3, 4]
and h = [5, 6, 7, 8]. The result of the linear convolution is
Y (x) = 5 + 16x + 34x 2 + 60x 3 + 61x 4 + 52x 5 + 32x 6 or
y = [5, 16, 34, 60, 61, 52, 32].
Figure 1 shows the schoolbook method of how a typical
polynomial multiplication or linear convolution is done. This
traditional multiplication algorithm has a O(n2 ) complexity.
The algorithm can be implemented in many mathematical
programming libraries, such as MATLAB’s conv [43] and FIGURE 2. Schoolbook method for positively wrapped modular
Numpy’s convolve [44] with integer array inputs com- polynomial multiplication or cyclic convolution.
bined with modular arithmetic operations.
The MATLAB function cconv [45] can calculate a cyclic
B. CYCLIC CONVOLUTION convolution using integer array inputs and modular arithmetic
Definition 2.2: Suppose that G(x) and H (x) are polynomi- operations. Notice that the result of cyclic convolution, unlike
als of degree n − 1 in the quotient ring Zq [x]/(x n − 1) where linear convolution, has a length of n instead of 2n − 1.

70290 VOLUME 11, 2023


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

C. NEGACYCLIC CONVOLUTION A. PRIMITIVE n-TH ROOT OF UNITY


Definition 2.3: Suppose that G(x) and H (x) are polyno- Definition 3.1: Let Zq be an integer ring modulo q, and
mials of degree n − 1 in the quotient ring Z[x]/(x n + 1) n − 1 is the polynomial degree of G(x) and H (x). Such
where q ∈ Z. A negacyclic convolution or negative wrapped rings have a multiplicative identity (unity) of 1. Define ω as
convolution, NWC(x) is defined as: primitive n-th root of unity in Zq if and only if:

n−1
ωn ≡ 1 mod q (7)
X
NWC(x) = ck x k (5) and
k=0
ωk ̸ ≡ 1 mod q (8)
Pk Pn−1
where ck = i=0 gi hk−i − i=k+1 gi hk+n−i mod q. If Y (x) for k < n.
is the result of their linear convolution in the ring Z[x], it also One thing to note is that the primitive n−th root of unity in a
can be defined as ring Zq might not be unique. We show the following example
for q = 7681, used in Kyber in Rounds 1 and 2 of the NIST-
NWC(x) = Y (x) mod (x n + 1) (6) PQC Competition [13], [15], however, in our toy example we
show for n = 4 instead of n = 256.
Example 2.3: Let G(x) = 1 + 2x + 3x 2 + 4x 3 and H (x) =
Example 3.1: In a ring Z7681 and n = 4, the 4-th root
5 + 6x + 7x 2 + 8x 3 or in vector notation g = [1, 2, 3, 4] and
of unity which satisfy the condition ω4 ≡ 1 mod 7681 are
h = [5, 6, 7, 8]. The result of the negacyclic convolution is
{3383, 4298, 7680}. Out of three roots, 7680 is not a primitive
NWC(x) = −56 − 36x + 2x 2 + 60x 3 or [−56, −36, 2, 60].
n-th root of unity, as there exist k = 2 < n that satisfy
Figure 3 shows how schoolbook long division calculates a
ω2 ≡ 1 mod 7681. Therefore ω = 3383 or ω = 4298 are
negacyclic convolution, the remainder of the division.
the primitive 4-th root of unity in Z7681 .
The value of ω will be important in calculating NTT and
positive-wrapped convolution. Calculating the ω of a ring
with a large number modulus q is tricky and tedious. One
alternative library that provides a function to calculate ω is
Sympy via the function nthroot_mod [47].

B. NTT-BASED POSITIVE-WRAPPED CONVOLUTION


This section explains the definition of Number Theoretic
Transform (NTT) and its inverse (INTT) based on n-th root
of unity, ω. The NTT of a polynomial does not have any
physical meaning, unlike Discrete Fourier Transform (DFT)
FIGURE 3. Schoolbook method for negatively wrapped modular which represents a signal in the frequency domain. However,
polynomial multiplication or negacyclic convolution.
NTT preserves one of the important properties of DFT: the
convolution theorem, which is valuable in calculating poly-
Note that the only difference between cyclic and nega- nomial multiplication.
cyclic convolution is the divisor. The cyclic convolution uses
x n − 1 while the negacyclic convolution uses x n + 1. 1) NUMBER THEORETIC TRANSFORM BASED ON ω
Those schoolbook algorithms have O(n2 ) complexity. Definition 3.2: The Number Theoretic Transform (NTT)
Many efforts have been tried to reduce their complexities of a vector of polynomial coefficients a is defined as â =
by dividing the multiplier and multiplicand into several NTT(a), where:
parts [18], [19], [20], [21] or by parallelizing the algorithm n−1
X
on the implementation side [46]. However, those efforts are âj = ωij ai mod q (9)
not scalable as the polynomial degree grows higher. i=0
and j = 0, 1, 2, . . . , n − 1
III. NTT-BASED CONVOLUTIONS Example 3.2: Let G(x) = 1 + 2x + 3x 2 + 4x 3 or in vector
In this section, we present the basic of NTT-based convolu- notation g = [1, 2, 3, 4]. We can infer that n = 4. Suppose
tions. Many researchers do not differentiate the term NTT we work in the ring Z7681 and ω is its primitive n-th root of
and FFT-based algorithms to calculate NTT, which creates unity. The NTT of g, ĝ, can be calculated by the following
confusion when understanding the topic. This report refers to matrix multiplication:
the transformation itself as NTT and the FFT-like algorithms
ω ω ω ω
 0×0 0×1 0×2 0×3   
as fast-NTT, which are explained in Section IV. The classi- 1
ω1×0 ω1×1 ω1×2 ω1×3  2
cal NTT has quadratic complexity of O(n2 ) when computed ĝ = 
ω2×0 ω2×1 ω2×2 ω2×3  3
 
directly, while fast-NTT algorithms have a more efficient
quasi-linear complexity O(n log n). ω3×0 ω3×1 ω3×2 ω3×3 4

VOLUME 11, 2023 70291


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

Notice that the power of ω is the multiplication between the ω, ω−1 = 4298 and the scaling factor n−1 = 5761.
row and column numbers. As ω is the n-root of unity, ωk = One can calculate the INTT(NTT(ĝ)) by the following matrix
ω(k mod n) for k > n. Thus: multiplication:
ω ω ω ω
 0 0 0 0  
1
ω ω ω ω
 −0×0 −0×1 −0×2 −0×3   
ω0 ω1 ω2 ω3  2 10
ĝ =  ω−1×0 ω−1×1 ω−1×2 ω−1×3   913 
ω0 ω2 ω4 ω6  3 g = n−1 
 
ω−2×0 ω−2×1 ω−2×2 ω−2×3  7679
 
ω0 ω3 ω6 ω9 4
ω−3×0 ω−3×1 ω−3×2 ω−3×3 6764
ω ω ω ω
 0 0 0 0  
1
ω ω ω0 ω0
 0 0   
ω0 ω1 ω2 ω3  2 10
−1 ω ω
0 −1 ω−2 ω−3   913 
ĝ = 
ω0 ω2 ω0 ω2  3
  
g = n  0 −2 −4 −6  
ω ω ω ω
  
7679
ω0 ω3 ω2 ω1 4
ω0 ω−3 ω−6 ω−9 6764
From Example 3.1 we obtained one of the n-th roots of unity ω ω ω0 ω0
 0 0  
10

in Z7681 is ω = 3383. Substituting into the equation: ω0 ω−1 ω−2 ω−3   913 
g = n−1 ω0 ω−2 ω−0 ω−2  7679
 
33830 33830 33830 33830
  
1
33830 33831 33832 33833  2 ω0 ω−3 ω−2 ω−1 6764
ĝ =   
33830 33832 33830 33832  3
4298 4298 4298 42980
0 0 0
  
10
33830 33833 33832 33831 4 42980 42981 42982 42983   913 
   g = 5761  
42980 42982 42980 42982  7679

1 1 1 1 1
1 3383 7680 4298 2 42980 42983 42982 42981 6764
ĝ = 
1 7680 1 7680 3
      
1 1 1 1 10 1
1 4298 7680 3383 4 1 4298 7680 3383  913  2
  g = 5761      =  
10 1 7680 1 7680 7679 3
 913  1 3383 7680 4298 6764 4
ĝ = 
7679

6764 Therefore, the g = [1, 2, 3, 4], which is the initial polynomial


coefficients given in Example 3.2
Therefore, the NTT(g) = [10, 913, 7679, 6764] in Z7681 .
Example 3.5: Given NTT(g) = ĥ = [26, 913, 7679, 6764]
Example 3.3: Let H (x) = 5 + 6x + 7x 2 + 8x 3 or in vector
in Z7681 and ω = 3383. We can similarly calculate the INTT
notation g = [5, 6, 7, 8] in the ring Z7681 and ω = 3383. The
to the previous example:
NTT of h is:
         
1 1 1 1 5 26 1 1 1 1 26 5
1 3383 7680 4298 6  913  1 4298 7680 3383  913  6
ĥ = 
1 7680 1 7680 7 = 7679
    h = 5761 1 7680 1 7680 7679 = 7
   

1 4298 7680 3383 8 6764 1 3383 7680 4298 6764 8


Therefore, the NTT(h) = [26, 913, 7679, 6764] in Z7681 . Therefore, the h = [5, 6, 7, 8], which is the initial polynomial
Note that the NTT of a particular polynomial is not always coefficients given in Example 3.3
unique. It depends on the choice of ω. The NTT result of
Example 3.2 and 3.3 will differ if one uses ω = 4298 instead
3) USING NTT TO CALCULATE POSITIVE-WRAPPED
of ω = 3383.
CONVOLUTIONS
2) INVERSE NUMBER THEORETIC TRANSFORM BASED ON ω
Because NTT is a variant of DFT in the polynomial ring. One
can apply DFT’s convolution theorem to calculate positive-
Definition 3.3: The Inverse of Number Theoretic Trans-
wrapped convolution [27], [28]:
form (INTT) of an NTT vector â is defined as a = INTT(â),
Proposition 3.1: Let a and b are the multiplicands’ vectors
where:
of polynomial coefficients. The positive-wrapped convolution
n−1
X of a and b, c can be calculated by:
ai = n−1 ω−ij âj mod q (10)
j=0
c = INTT(NTT(a) ◦ NTT(b)) (11)
and j = 0, 1, 2, . . . , n − 1
Note that the INTT has a very similar formula to NTT. The where ◦ is an element-wise vector multiplication in Zq .
only differences are ω replaced by its inverse in Zq and a n−1 Example 3.6: Let g = [1, 2, 3, 4] and h = [5, 6, 7, 8].
scaling factor. It always holds that a = INTT(NTT(a)). From Example 3.2 and 3.3, we know that the NTT of them
Example 3.4: Given NTT(g) = ĝ = [10, 913, 7679, 6764] in in Z7681 are ĝ = [10, 913, 7679, 6764] and ĥ =
in Z7681 and ω = 3383. We can calculate the inverse of [10, 913, 7679, 6764] when ω = 3383. We can calculate

70292 VOLUME 11, 2023


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

their positive-wrapped convolution by: defined as â = NTTψ (a), where:


    n−1
10 26 X
 913   913  âj = ψ i ωij ai mod q (14)
INTT(7679 ◦ 7679) i=0
  

6764 6764 and j = 0, 1, 2, . . . , n − 1. As ψ 2 ≡ ω mod q, we can



260
 substitute ω = ψ 2 to equation (14):
4021 n−1
= INTT(  4 )
X
ψ 2ij+i ai

âj = mod q (15)
3660 i=0
Example 3.8: Let g = [1, 2, 3, 4], n = 4 and ψ = 1925 in
    
1 1 1 1 260 66
1 4298 7680 3383 4021 68 the ring Z7681 . The NTTψ (g) = ĝ, can be calculated by the
= 5761 1 7680 1 7680  4  = 66
   
following matrix multiplication:
1 3383 7680 4298 3660 60
ψ ψ ψ ψ
 2(0×0)+0 2(0×1)+1 2(0×2)+2 2(0×3)+3   
1
ψ 2(1×0)+0 ψ 2(1×1)+1 ψ 2(1×2)+2 ψ 2(1×3)+3  2
Therefore, their positive-wrapped convolution is [66, 68, ĝ = 
ψ 2(2×0)+0 ψ 2(2×1)+1 ψ 2(2×2)+2 ψ 2(2×3)+3  3
 
66, 60], the same result as calculated by schoolbook multi-
plication and long division in Example 2.2. ψ 2(3×0)+0 ψ 2(3×1)+1 ψ 2(3×2)+2 ψ 2(3×3)+3 4
ψ ψ ψ ψ3 ψ0 ψ1 ψ2 ψ3
 0 1 2       
While positive-wrapped convolution, commonly known as 1 1
cyclic convolution, is useful, its implementation is primarily ψ 0 ψ 3 ψ 6 ψ 9  2 ψ 0 ψ 3 ψ 6 ψ 1  2
ĝ = 
ψ 0 ψ 5 ψ 10 ψ 15  3 = ψ 0 ψ 5 ψ 2 ψ 7  3
    
outside the cryptography domain. One such example is the
implementation of Schönhage-Strassen algorithm [48] for ψ 0 ψ 7 ψ 14 ψ 21 4 ψ0 ψ7 ψ6 ψ5 4
large integer multiplication. However, in the context of PQC 
19250 19251 19252 19253
 
1

and HE, the chosen ring is mostly Zq [n]/(x n + 1) instead of 19250 19253 19256 19251  2
Zq [n]/(x n − 1). One must calculate the polynomial multipli- ĝ = 
   
19250 19255 19252 19257  3
cations via the negative-wrapped convolution in such rings.
19250 19257 19256 19255 4
    
1 1925 3383 6468 1 1467
C. PRIMITIVE 2n-TH ROOT OF UNITY
1 6468 4298 1925 2 2807
To calculate negative-wrapped convolution, one needs the ĝ = 
1 5756 3383 1213 3 = 3471
   
primitive 2n-th root of unity, ψ.
1 1213 4298 5756 4 7621
Definition 3.4: Let Zq be an integer ring modulo q, and
n − 1 is the polynomial degree of G(x) and H (x) and ω is its Therefore, the NTTψ (g) = [1467, 2807, 3471, 7621] when
primitive n-th root of unity. Define ψ as the primitive 2n-th ψ = 1925 in Z7681 .
root of unity if and only if: Example 3.9: Let h = [5, 6, 7, 8], n = 4 and ψ = 1925 in
the ring Z7681 . The NTTψ (h) = ĥ, can be calculated similarly
ψ2 ≡ ω mod q (12) by the following matrix multiplication:
    
1 1925 3383 6468 5 2489
and 1 6468 4298 1925 6 7489
ĥ = 
1 5756 3383 1213 7 = 6478
   
ψ n ≡ −1 mod q (13)
1 1213 4298 5756 8 6607
Example 3.7: In a ring Z7681 and n = 4, when ω = 3383,
the value of ψ can be 1925 or 5756 as 19252 = 57562 ≡ Therefore, the NTTψ (h) = [2489, 7489, 6478, 6607].
3383 mod 7681 and 19254 = 57564 = 7680 ≡ −1 mod
7681. Therefore, one can choose the value of ψ = 1925 or 2) INVERSE NUMBER THEORETIC TRANSFORM BASED ON ψ
ψ = 5756. Definition 3.6: The Negative-Wrapped Inverse of Number
Theoretic Transform (INTT) of an NTT vector â is defined as
a = INTTψ (â), where:
−1
D. NTT-BASED NEGATIVE-WRAPPED CONVOLUTION
This section explains the definition of Number Theoretic n−1
Transform (NTT) and its inverse (INTT) based on 2n-th root
X
ai = n−1 ψ −j ω−ij âj mod q (16)
of unity, ψ, and how to utilize them to calculate negative- j=0
wrapped or negacyclic convolution.
and i = 0, 1, 2, . . . , n − 1. Substituting ω = ψ 2 yields:
1) NUMBER THEORETIC TRANSFORM BASED ON ψ n−1
X
Definition 3.5: The Negative-Wrapped Number Theoretic ai = n−1 ψ −(2ij+j) âj mod q (17)
Transform (NTTψ ) of a vector of polynomial coefficients a is j=0

VOLUME 11, 2023 70293


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

Note that the differences between NTTψ and INTTψ are the their negative-wrapped convolution by:
scaling factor n−1 , the replacement of ψ by ψ −1 , and the 
1467
 
2489

transpose of the exponents of ψ matrix. 2807 7489
Example 3.10: Let NTTψ (g) = ĝ = [1467, 2807, 3471, INTT( 3471 ◦ 6478)
  
7621] and ψ = 1925 in the ring Z7681 . Note that ψ −1 = 7621 6607
1213 and n−1 = 5761. The vector g can be calculated by the  
2888
following matrix multiplication: 6407
= INTT( 2851)

ψ ψ ψ ψ −0
 −0 −0 −0  
1467 2992
ψ −1 ψ −3 ψ −5 ψ −7  2807     
g = n−1 ψ −2 ψ −6 ψ −10 ψ −14  3471
  1 1 1 1 2888 7625
1213 5756 6468 1925 6407 7645
ψ −3 ψ −9 ψ −15 ψ −21 7621 = 5761 4298 3383 4298 3383 2851 =  2 
   

ψ ψ ψ ψ
 −0 −0 −0 −0   
1467 5756 1213 1925 6468 2992 60
ψ −1 ψ −3 ψ −5 ψ −7  2807
g = n−1  Therefore, [7625, 7645, 2, 60] – or when written with
ψ −2 ψ −6 ψ −2 ψ −6  3471
 
negative numbers [−56, −36, 2, 60] is their negacyclic con-
ψ −3 ψ −1 ψ −7 ψ −5 7621 volution, the same result as calculated by schoolbook multi-
1213 1213 1213 12130
0 0 0 plication and long division in Example 2.3
  
1467
12131 12133 12135 12137  2807
g = 5761  
12132 12136 12132 12136  3471

E. THE CHOICE OF MODULUS
12133 12131 12137 12135 7621 To make NTT transformation available, the modulus q has to
     satisfy the following requirements:
1 1 1 1 1467 1
1213 5756 6468 1925 2807 2 1) The n-th root of unity ω exists in ring Zq . The existence
g = 5761 4298 3383 4298 3383 3471 = 3
    of ω enables one to utilize NTT to perform positive-
5756 1213 1925 6468 7621 4 wrapped convolutions.
2) Furthermore, the 2n-th root of unity ψ exists in ring Zq
Therefore g = [1, 2, 3, 4].
to make negative-wrapped convolutions work.
Example 3.11: Let NTTψ (h) = ĥ = [2489, 7489, 6478,
6607] and ψ = 1925 in the ring Z7681 . The vector h can be The modulus q has to satisfy the following theorem to
calculated by the following matrix multiplication: guarantee that ω exists [27], [29], [49]:
Theorem 3.1: If q is prime, then n must divide q − 1. If q
     is composite such that:
1 1 1 1 2489 5
1213 5756 6468 1925 7489 6 q = q1 m1 · q2 m2 · q3 m3 . . . qk mk
h = 5761 
4298 3383 4298 3383 6478 = 7
   
then n must divide the greatest common divisor (GCD) of
5756 1213 1925 6468 6607 8 (q1 − 1, q2 − 1, q3 − 1, . . . , qk − 1).
However, while Theorem 3.1 guarantees the existence of
Therefore, the h = [5, 6, 7, 8]. ω does not guarantee the existence of ψ. To guarantee the
existence of ψ in Zq :
Theorem 3.2: If q is prime, then 2n must divide q − 1. If q
3) USING NTTψ TO CALCULATE NEGATIVE-WRAPPED
is composite such that:
CONVOLUTIONS
Like its positive-wrapped version, the negative-wrapped NTT q = q1 m1 · q2 m2 · q3 m3 . . . qk mk
can evaluate the negative-wrapped convolutions, commonly then 2n must divide the greatest common divisor (GCD) of
referred to as negacyclic convolutions. (q1 − 1, q2 − 1, q3 − 1, . . . , qk − 1).
Proposition 3.2: Let a and b are the multiplicands’ vectors Many researchers proposed various moduli that might sat-
of polynomial coefficients. The negative-wrapped convolu- isfy the requirements, such as Mersenne [27] and Fermat [50]
tion of a and b, c can be calculated by: prime numbers. Here we define NTT-friendly modulus based
on its abilities to perform the type of convolutions:
Definition 3.7: A PWC-NTT friendly modulus q is defined
c = INTTψ (NTTψ (a) ◦ NTTψ (b))
−1
(18)
if and only if an n-th root of unity, ω, exists in Zq .
Definition 3.8: An NWC-NTT friendly modulus q is
where ◦ is an element-wise vector multiplication in Zq . defined if and only if n-th root of unity, ω, and 2n-th root of
Example 3.12: Let g = [1, 2, 3, 4] and h = [5, 6, 7, 8]. unity, ψ,exists in Zq .
From Example 3.8 and 3.9, we know that the NTTψ of them In the schemes proposed for the NIST-PQC competition,
in in Z7681 are ĝ = [1467, 2807, 3471, 7621] and ĥ = the values of n and q are standardized. Table 1 summarizes
[2489, 7489, 6478, 6607] when ψ = 1925. We can calculate the schemes and their NTT-friendliness.

70294 VOLUME 11, 2023


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

TABLE 1. The values of n and q of standardized NIST-PQC scheme. Notice that Aj and Bj can be obtained as n/2 points NTT.
If n is power-of-two, the process can be repeated for all the
coefficients. Figure 4 shows the visualization of Equation
(23), usually called CT butterfly as a reference to its proposer,
Cooley and Tukey [23].

In the context of Post-Quantum Cryptography and Homo-


morphic Encryption, most of the time, the term ‘‘Number
Theoretic Transform’’ and ‘‘Convolutions’’ refer to their
negacyclic or negative-wrapped version. Therefore, for the FIGURE 4. Cooley-Tukey (CT) butterfly unit for calculating NTT.
rest of the report, we refer to all the terms ‘‘NTT’’, ‘‘INTT,’’
and ‘‘convolutions’’ for their negative-wrapped version. One can configure several butterfly units to calculate the
entire n length of NTT.
IV. FAST NTT Example 4.1: From Example 3.8, one can calculate the
To reduce the complexity and fasten the process of the matrix NTT by the matrix multiplication:
multiplication needed for the NTT transformation, one can
ψ ψ ψ ψ3
 0 1 2  
use ‘‘divide and conquer’’ techniques by utilizing the period- 1
ψ 0 ψ 3 ψ 6 ψ 9  2
icity and symmetry property of ψ: ĝ = 
ψ 0 ψ 5 ψ 10 ψ 15  3
 
periodicity:ψ k+2n = ψ k (19) ψ 0 ψ 7 ψ 14 ψ 21 4
symmetry:ψ k+n = −ψ k (20)
Based on the ψ periodicity:
where k is a non-negative integer. The calculation of n point
ψ ψ ψ2 ψ3
 0 1  
1
NTT and INTT can be divided into two n/2 points. How- ψ 0 ψ 3 ψ6 ψ 1 2
ever, the dividing techniques for NTT and INTT are slightly ĝ = 
ψ 0 ψ 5 ψ2 ψ 7  3
 
different.
ψ0 ψ7 ψ6 ψ1 4
A. COOLEY-TUKEY (CT) ALGORITHM FOR FAST-NTT Based on the ψ symmetry:
From equation (15), one can separate the summation into two
ψ ψ1 ψ2 ψ3
 0  
1
parts based on the summation index parity: ψ 0 ψ 3 −ψ 2 ψ1  2
 
ĝ = 
ψ 0 −ψ 1
n−1
X ψ2 −ψ 3  3
âj = ψ 2ij+i ai mod q ψ 0 −ψ 3 −ψ 2 ψ1 4
i=0
n/2−1 n/2−1 Breaking down for each element:
X X
= ψ 4ij+2i
a2i + ψ 4ij+2j+2i+1
a2i+1 mod q ĝ0 = 1ψ 0 + 2ψ 1 + 3ψ 2 + 4ψ 3
i=0 i=0
n/2−1 n/2−1 ĝ1 = 1ψ 0 + 2ψ 3 − 3ψ 2 + 4ψ 1
X X
= ψ 4ij+2i a2i + ψ 2j+1 ψ 4ij+2i a2i+1 mod q ĝ2 = 1ψ 0 − 2ψ 1 + 3ψ 2 − 4ψ 3
i=0 i=0 ĝ3 = 1ψ 0 − 2ψ 3 − 3ψ 2 + 4ψ 3
(21)
Based on the ψ’s symmetry properties: Factoring:
n/2−1
X ĝ0 = ψ 0 (1 + 3ψ 2 ) + ψ 1 (2 + 4ψ 2 )
âj+n/2 = ψ 4ij+2i
a2i ĝ1 = ψ 0 (1 − 3ψ 2 ) + ψ 3 (2 + 4ψ 2 )
i=0
n/2−1 ĝ2 = ψ 0 (1 + 3ψ 2 ) − ψ 1 (2 − 4ψ 2 )
X
− ψ 2j+1 ψ 4ij+2i a2i+1 mod q (22) ĝ3 = ψ 0 (1 − 3ψ 2 ) − ψ 3 (2 − 4ψ 2 ) (24)
i=0
Pn/2−1 Pn/2−1 The idea is to calculate similar terms in the brackets once and
Let Aj = i=0 ψ 4ij+2i a2i and Bj = i=0 ψ 4ij+2i a2i+1 , then distribute the results instead of calculating them multiple
equations (21) and (22) become: times. Figure 5 shows the visualization of Equation 24.
âj = Aj + ψ 2j+1 Bj mod q The number of stages required is log2 (n). For our case
here, as n = 4, two stages are required. For this example, the
âj+n/2 = Aj − ψ 2j+1
Bj mod q (23) result of stage 1 is [2469, 5853, 5214, 1832], and stage 2 is

VOLUME 11, 2023 70295


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

P 2n −1 P 2n −1
Let Ai = j=0 âj ψ −4ij and Bi = j=0 âj+ 2n ψ −4ij ,

Equation (25) and (26) become:

a2i = (Ai + Bi )ψ −2i mod q


−2i
a2i+1 = (Ai − Bi )ψ mod q (27)

Notice that Ai and Bi can be obtained as n/2 points INTT.


If n is power-of-two, the process can be repeated for all the
coefficients. Figure 4 shows the visualization of Equation
FIGURE 5. Cooley-Tukey butterflies for n = 4 and [1, 2, 3, 4] as its input. (27), usually called GS butterfly as a reference to its proposer,
Gentleman and Sande [24].

[1467, 3471, 2807, 7621]. By reordering the result of stage 2,


we can get the correct NTT result: [1467, 2807, 3471, 7621]
The order of the results of CT-Butterfly is called bit-
reversed order (BO), while the correct order of the NTT is
called normal order (NO). We will discuss the ordering in
more detail in Subsection IV-C.
Example 4.2: Redoing Example 3.9, using the same but-
FIGURE 6. Gentleman-Sande (GS) butterfly unit for calculating INTT.
terfly configuration as Figure 5 with [5, 6, 7, 8] as the input,
the result of stage 1 is [643, 4027, 7048, 3666], and stage 2 is
[2489, 6478, 7489, 6607]. Reorder it to normal order for the Because the separation is done differently, GS butterflies’
NTT result: [2489, 7489, 6478, 6607]. input is usually in bit-reversed order (BO) and the output is
However, to calculate INTT, one will need another but in normal order (NO).
similar ‘‘divide and conquer’’ approach. Example 4.3: Repeating example 3.10, let NTTψ (g) =
ĝ = [1467, 2807, 3471, 7621], the INTT can be calculated
B. GENTLEMAN-SANDE (GS) ALGORITHM FOR FAST-INTT by using matrix multiplication:
For the INTT, instead of dividing the summation by its index
ψ ψ −0 ψ −0 ψ −0
 −0  
parity, it is separated by the lower and upper half of the 1467
summation. From equation (15) and ignoring n−1 term: ψ −1 ψ −3 ψ −5 ψ −7  2807
g = n−1 
  
ψ −2 ψ −6 ψ −10 ψ  3471
−14
n−1
ψ −3 ψ −9 ψ −15 ψ −21 7621
X
ai = ψ −(2i+1)j âj mod q
j=0
n  Based on ψ −1 periodicity:
2 −1 n−1
X X n
= ψ −(2i+1)j âj + ψ −(2i+1)(j+ 2 ) â(j+ n )  mod q
ψ ψ −0 ψ −0 ψ −0
 −0  
2 1467
j=0 j= n2
−1 ψ ψ −3 ψ −5 ψ −7 
 −1
 2807
 
n n  g = n  −2
2 −1
X 2 −1
X n ψ ψ −6 ψ −2 ψ −6  3471
= ψ −i  ψ −2ij âj + ψ −2i(j+ 2 ) â(j+ n ) 2
mod q ψ −3 ψ −1 ψ −7 ψ −5 7621
j=0 j=0

Based on the periodicity and symmetry of ψ −1 , for the even Based on ψ −1 symmetry:
term:
ψ ψ −0 ψ −0 ψ −0
 −0  
n n  1467
2 −1 2 −1 ψ −1 ψ −3 −ψ −1 −ψ −3 
g = n−1   2807
X X n
  
a2i = ψ −2i  ψ −4ij âj + ψ −4i(j+ 2 ) â(j+ n2 )  mod q ψ −2 −ψ −2 ψ −2 −ψ −2  3471
j=0 j=0 ψ −3 ψ −1 −ψ −3 −ψ −1 7621
n
2 −1
Xh i
a2i = ψ −2i âj + â(j+ 2n ) ψ −4ij mod q (25) Breaking down for each element:
j=0
g0 = [1467ψ −0 + 2807ψ −0 + 3471ψ −0 + 7621ψ −0 ]n−1
Doing the same derivation for the odd term:
n g1 = [1467ψ −1 + 2807ψ −3 − 3471ψ −1 − 7621ψ −3 ]n−1
2 −1
g2 = [1467ψ −2 − 2807ψ −2 + 3471ψ −2 − 7621ψ −2 ]n−1
Xh i
a2i+1 = ψ −2i âj − â(j+ n2 ) ψ −4ij mod q (26)
j=0 g3 = [1467ψ −3 + 2807ψ −1 − 3471ψ −3 − 7621ψ −1 ]n−1

70296 VOLUME 11, 2023


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

Factoring: TABLE 2. Normal and bit-reversed order for n = 4.

g0 = [(1467 + 3471)ψ −0 + (2807 + 7621)ψ −0 ]ψ −0 n−1


g1 = [(1467 − 3471)ψ −1 + (2807 − 7621)ψ −3 ]ψ −0 n−1
g2 = [(1467 + 3471)ψ −0 − (2807 + 7621)ψ −0 ]ψ −2 n−1
g3 = [(1467 − 3471)ψ −1 − (2807 − 7621)ψ −3 ]ψ −2 n−1
(28) TABLE 3. Normal and bit-reversed order for n = 8.

Similar to NTT, the idea is to calculate the similar terms


in the brackets once, then distribute the results instead of
calculating them multiple times. By first reordering the input,
we can visualize Equation 28 as shown in Figure 7.

C. NORMAL ORDER AND BIT-REVERSED ORDER


As encountered in Subsection IV-A and IV-B, typically,
the input of CT Butterfly is in Normal Order (NO), and
the output is in Bit-reversed Order (BO). Conversely, the
input of GS Butterfly is in BO, and the output is in NO.
FIGURE 7. Gentleman-Sande butterflies for n = 4 and
[1467, 2807, 3471, 7621] reordered as bit-reversed order as its input. This section clarifies the formal definition of Normal and
Bit-reversed Order and provides examples for n = 4
and n = 8.
The result of stage 1 is [4938, 4025, 2747, 3664], and Definition 4.1: Let n be a power of two, and b is a non-
stage 2 is [4, 8, 12, 16]. After scaling with a 4−1 = 5761 fac- negative integer with b < n. The bit-reversal of b is
tor, we can get the INTT result of [1, 2, 3, 4]. defined as:
Example 4.4: Redoing Example 3.11, using the same
butterfly configuration as Figure 7, reordering the input brvn (blog n−1 2log n−1 + · · · + b1 2 + b0 )
from normal order [2489, 7489, 6478, 6607] to bit-reversed = b0 2log n−1 + · · · + blog n−2 2 + blog n−1
order [2489, 6478, 7489, 6607]. The result of stage 1
is [1286, 373, 6415, 7332], the result of stage 2 is where bi is the i-th bit of the binary expansion of b [33].
[20, 14, 28, 32], and the INTT result after scaling is Example 4.6: Consider n = 4, the index of the array in the
[5, 6, 7, 8]. normal order is [0, 1, 2, 3]. Table 2 shows the index binary
For polynomial multiplication, one can use CT butter- representation in log2 n = 2 bit, their bit-reversal in binary,
flies to transform both inputs to the NTT domain, then and their decimal representation.
use element-wise multiplication for the NTT outputs. The From the table, we know that the index of the normal
result is then transformed back using GS butterflies to order is [0, 1, 2, 3] and the index of the bit-reversed order is
perform INTT. As the butterflies reduce the mathematical [0, 2, 1, 3]
operation in a quasilinear scale, the complexity of the poly- Example 4.7: Similarly, when considering n = 8, we can
nomial multiplication is reduced from O(n2 ) to O(n log n). construct a similar table with log2 n = 3 as the length of
The larger the polynomial degree, the larger the speed and binary representation.
cost gain [51]. We will get the NO index is [0, 1, 2, 3, 4, 5, 6, 7], and the
Example 4.5: From example 4.1, we get that the NTT BO index is [0, 4, 2, 6, 1, 5, 3, 7].
transformation of [1, 2, 3, 4] in bit-reversed order is Example 4.8: Redoing the previous examples for n =
[1467, 3471, 2807, 7621]. From example 4.2, we get the NTT 16 and 4 as the length of binary representation.
transformation of [5, 6, 7, 8] is [2489, 6478, 7489, 6607] Therefore the NO index is [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
in bit-reversed order. Using element-wise multiplication 12, 13, 14, 15] and the BO index is [0, 8, 4, 12, 2, 10, 6, 14, 1,
for those two results, we get [2888, 2851, 6407, 2992] in 9, 5, 13, 3, 11, 7, 15].
bit-reversed order. Transforming back the results using GS- Typical NTT-CT Butterfly configuration has NO-input and
butterfly, we will get [7625, 7645, 2, 60] or [−56, −36, 2, 60] BO-output, while INTT-GS configuration usually has BO-
when written using negative numbers. Which is the same input and NO-output. However, one can reconfigure the CT
result as Example 3.12. butterfly to have BO-input & NO-output and GS butterfly

VOLUME 11, 2023 70297


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

FIGURE 8. All possible CT and GS butterfly configurations for n = 8.

TABLE 4. Normal and bit-reversed order for n = 16. D. MODULAR ARITHMETIC UNITS
One of the challenges of NTT-based multiplication is that
addition and multiplication have to be done in Zq . All the
CT and GS butterflies operators require modular arithmetic,
a non-standard feature for most implementation platforms.

1) MODULAR ADDER
To calculate modular addition, (A + B) mod q, we can simply
use a piece-wise function:
(
A+B A+B<q
(A + B) mod q = (29)
A+B−q A+B≥q
Equation (29) is relatively easy to implement using a set of
adders, subtractors, and multiplexers.
While modular adder is simple and easy to implement,
modular multipliers are trickier. The standard algorithm for
modular multiplication uses trial division, which is ineffi-
cient, not scalable, and difficult to implement in hardware
architecture. The most popular workaround for implementing
to have NO-input & BO-output. Figure 8 shows all pos- Barrett or Montgomery modular multiplication algorithm.
sible configurations for NTT CT and INTT GS Butterfly
for n = 8. 2) MODULAR REDUCTION: BARRETT METHOD
Using normal order as NTT input is called decimation in The main idea behind Barrett reduction is to approximate
time, while bit-reversed order input is called decimation in the division by the modulus using pre-computed values,
frequency [52]. Another thing to notice is that the power of which allows for faster modular multiplication [53], [54].
ψ follows the bit-reversed order index. The set of all the Algorithm 1 shows how to multiply two integers modulo q
exponentiation of ψ is called twiddle factors. using Barrett reduction. As the value of q is usually fixed,

70298 VOLUME 11, 2023


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

Algorithm 1 Modular Multiplication by Barrett Reduction Algorithm 2 Modular Multiplication by Montgomery


Input: a, b, q ∈ Z Reduction
Output: a × b mod q Input: a, b, q ∈ Z
// Pre-computation Output: a × b mod q
1: k = ⌈log2 q⌉ // number of bits in q // Pre-computation
2: r = 2k 1: k = ⌈log2 q⌉ // number of bits in q
r2 2: r = 2k
3: µ = ⌊ q ⌋
// Multiplication 3: rinv = r −1 mod q
4: z = a × b 4: q′ = −q−1 mod r
// Barrett Reduction // Convert to Montgomery representation
z 5: am = a × r mod q
5: m1 = ⌊ r ⌋
6: m2 = m1 × µ 6: bm = b × r mod q
m // Multiplication in Montgomery representation
7: m3 = ⌊ r2 ⌋
8: t = z − m3 × q 7: t = am × bm
9: if t ≥ q then 8: u = t × q′ mod r
10: return t − q 9: cm = (t + u × q)/r
11: else // Convert back to the standard representation
12: return t 10: c = cm × rinv mod q
13: end if 11: return c

The main drawback of Montgomery reduction is the


we can pre-compute the value of k, r, and µ while designing requirement to transform the numbers into Montgomery rep-
the unit. The floor function of division by r, a power-of-two resentation in Zq in contrast to Barrett reduction, in which all
integer, can be replaced by the right shift function that is easy, the calculation is done in Z. However, this drawback can also
cheap, and efficient to implement in hardware. be advantageous when calculating the same multiplication
This method is suitable for modular multiplication between multiple times. Hence, this method is suitable for modular
unrelated numbers [55]. In the CT and GS Butterfly, this exponentiation [55]. In the case of NTT, it is useful to calcu-
type of multiplication is used to multiply the polynomial late the exponentiation of ψ used in CT and GS butterflies as
coefficients with various twiddle factors. It is also used in the twiddle factors.
element-wise multiplication between two NTT vectors. Example 4.10: To calculate 1467 × 2489 mod 7681 (used
Example 4.9: To calculate 1467 × 2489 mod 7681 (used in Example 3.12) using Montgomery reduction, one will get
in Example 3.12) using Barrett reduction, one will get a = 1467, b = 2489, and q = 7681.
a = 1467, b = 2489, and q = 7681. Pre-computation:

k = ⌈log2 q⌉ = ⌈log2 7681⌉ = 13 k = ⌈log2 q⌉ = ⌈log2 7681⌉ = 13


r =2 =2k 13
= 8192 r = 2k = 213 = 8192
r2 81922 rinv = r −1 mod q = 8192−1 mod 7681 = 7200
µ=⌊ ⌋=⌊ ⌋ = 8736
q 7681 q′ = −q−1 mod r = −(7681−1 ) mod 8192 = 7679
z = a × b = 1467 × 2489 = 3651363
z 3651363 Convert a and b to Montgomery representation:
m1 = ⌊ ⌋ = ⌊ ⌋ = 445
r 8192
am = a × r mod q = 1467 × 8192 mod 7681 = 4580
m2 = m1 × µ = 445 × 8736 = 3887520
m2 3887520 bm = b × r mod q = 2489 × 8192 mod 7681 = 4514
m3 = ⌊ ⌋ = ⌊ ⌋ = 474
r 8192
Multiplication in Montgomery representation:
t = z − m3 × q = 3651363 − 474 × 7681 = 10569
t = am × bm = 4580 × 4514 = 20674120
As t ≥ 7681, the result is t − q = 10569 − 7681 = 2888.
u = t × q′ mod r = 20674120 × 7679 mod 8192 = 6584
3) MODULAR REDUCTION: MONTGOMERY METHOD
Result in Montgomery representation:
Another alternative is to perform modular multiplication
by Montgomery reduction [56], [57], which is shown in cm = (t + u × q)/r
Algorithm 2. The main idea is that it avoids direct divisions by
= (20674120 + 6584 × 7681)/8192
the modulus by transforming the number to the Montgomery
representation. = 8697

VOLUME 11, 2023 70299


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

Transform back to standard representation: TABLE 5. The values of n, q, ω, and ψ of standardized NIST-PQC scheme.
Note that only Dilithium specifies the actual value of ψ, others do not.

c = cm × rinv mod q
= 8697 × 7200 mod 7681
= 2888
Transforming to and from Montgomery representation is
an expensive operation, which is usually done iteratively by
subtracting q multiple times. One needs to minimize the num-
ber of transformations to use Montgomery modular reduction
efficiently. This report highlights the NTT/INTT specifications and
Many researchers perform various workarounds and opti- the Dilithium, Kyber, and Falcon scheme implementa-
mizations for NTT/INTT implementation using previously tions. Optimizations and various implementations outside
discussed concepts in various Post-Quantum Cryptogra- the NTT/INTT in the scheme are out of the scope of
phy applications, which we will discuss in the following our work.
chapter.
B. NTT IN FINALIZED PQC SCHEMES
V. NTT IN POST QUANTUM CRYPTOGRAPHY SCHEME NTT is a part of Dilithium specification, with the parameters
All the NIST-PQC competition winners: Dilithium [10], Fal- set as polynomials of degree n = 256 and the modulus
con [12], and Kyber [15], [16] include NTT/INTT in their q = 223 − 213 + 1 = 8380417 is used in the extended
specifications for modular polynomial multiplication. In this cyclotomic ring Zq [x]/(x 256 + 1). Notice that the chosen q
section, we surveyed the implementation of NTT/INTT for is NWC-NTT friendly prime where ψ exists. Dilithium also
each scheme in various platforms based on their novelty specifies the chosen 2n-th root of unity, ψ = 1753. These
claims, algorithms, and implementation strategies. We also parameters were chosen based on a trade-off between security
present common optimizations implemented by various and efficiency [10].
researchers. NTT is also a part of Falcon and Kyber specifications.
Falcon [12] specifies that n = 512 or n = 1024 depends
A. DILITHIUM, KYBER, AND FALCON OVERVIEW on the desired security level and the modulus q is chosen
Dilithium [10] is one of the standardized algorithms in the to be 12289, which is an NWC-NTT friendly modulus for
NIST Post-Quantum Cryptography (PQC) competition. It is both n. Kyber [16] also specifies n = 256 and q = 3329 in
a signature scheme based on the problem of finding short its finalized version. Table 5 shows the NTT parameters
lattice vectors, which is believed to be hard even for quantum summary for Dilithium, Falcon, and Kyber.
computers. The Dilithium algorithm is designed to provide As we can see, the NTT specification in Kyber is unique
strong security while remaining efficient enough for practical because the chosen modulus in the final version, q = 3329,
use in digital signature applications. is not an NTT-NWC friendly modulus, which requires a
Kyber [15], [16] is a key encapsulation mechanism (KEM) special trick called truncated NTT to calculate its negative-
part of the NIST Post-Quantum Cryptography (PQC) project. wrapped convolution. Truncated NTT requires the calcula-
Kyber is one of the proposed algorithms in the NIST-PQC tions of NTT divided into two parts, as for Kyber n = 256,
competition. It is a lattice-based cryptosystem that relies it requires two NTT calculations with n = 128 by dividing
on the hardness of the Learning With Errors (LWE) prob- odd and even parts [58]. Notice that when n = 128 and
lem and its variants, which are believed to resist quantum q = 3329, it is an NWC-NTT friendly modulus with one
attacks. of the ψ = 892. In the following toy example for truncated
Falcon [12] is one of the candidate algorithms for digital NTT, we can calculate NTT/INTT with n = 8 by breaking it
signature schemes In NIST-PQC (National Institute of Stan- down into two NTT/INTT calculations with n = 4.
dards and Technology Post-Quantum Cryptography). Falcon Example 5.1: Let A = [0, 1, 2, 3, 4, 5, 6, 7] and B =
is a family of lattice-based signature schemes designed to be [8, 9, 10, 11, 12, 13, 14, 15] in the ring ZQ with Q = 7681.
secure against attacks by quantum computers. Falcon uses a We need to find the negacyclic convolution of A and B.
variation of the Ring Learning With Errors (RLWE) problem, Calculating the results using previously explained meth-
which is believed to be resistant to attacks by both classical ods in normal order: Using ψ = 7154, we can get:
and quantum computers. The security of Falcon relies on the NTT ψ (A) = [0, 7154, 2426, 2497, 1830, 4245, 3812, 4081]
hardness of the underlying mathematical problem of finding
the shortest vector in a lattice. Falcon provides efficient signa- NTT ψ (B) = [8, 2938, 4449, 4035, 5490, 3356, 3774, 1064]
ture generation and verification, making it a practical option Element-wise multiplication between the two yields:
for real-world applications. It is also designed to resist side-
channel attacks, which exploit weaknesses in the physical NTT ψ (A) ◦ NTT ψ (B)
implementation of a cryptographic system. = [3213, 7391, 1790, 5474, 5572, 2527, 2633, 7341]

70300 VOLUME 11, 2023


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

FIGURE 9. Schoolbook multiplication instead of element-wise for Example 5.1.

FIGURE 10. Calculating 256-element NTT Transformation using 4-element CT butterfly iteratively.

Taking INTT from the results yields in the negacyclic convo- TABLE 6. Values of ψ 2i +1 , which is important to determine the modulus
of the schoolbooks multiplication.
lution between A and B
INTT ψ (NTT ψ (A) ◦ NTT ψ (B))
−1

= [7373, 7369, 7391, 7441, 7521, 7633, 98, 280]


Therefore the negacyclic convolution between them is
[7373, 7369, 7391, 7441, 7521, 7633, 98, 280].
Calculating the results by truncated NTT: Group A and B In the usual NTT/INTT scheme, we multiply element-wise
into two: in this step. However, in this scheme, we do a schoolbook
A = [(0, 1), (2, 3), (4, 5), (6, 7)] multiplication between each grouped element treated as poly-
nomial coefficients [59]. The schoolbook multiplication is
B = [(8, 9), (10, 11), (12, 13), (14, 15)]
performed modulo X 2 −ψ 2i+1 , where i is the index of grouped
Then, separate it into odd and even parts: elements. Figure 9 illustrates how our example case is per-
Aeven = [0, 2, 4, 6] formed, while Table 10 shows the ψ 2i+1 values.
From the calculations, we obtain the following:
Aodd = [1, 3, 5, 7]
Beven = [8, 10, 12, 14] Ĉ = NTT ψ (A) ◦ NTT ψ (B)
Bodd = [9, 11, 13, 15] = [(2567, 3470), (6052, 1343), (4959, 1421), (552, 199)]
Transform them into NTT with n = 4, q = 7681, and Separating odd and even values of Ĉ:
ψ = 1925:
Ĉ even = [2567, 6052, 4959, 552]
NTT ψ (Aeven ) = [2423, 3273, 1598, 387]
Ĉ odd = [3470, 1343, 1421, 199]
NTT ψ (Aodd ) = [6519, 603, 4270, 3974]
NTT ψ (Beven ) = [4467, 4956, 7612, 6040] Transform them back using INTT:
ψ
INTT ψ (Ĉ even ) = [7373, 7391, 7521, 98]
−1
NTT (Bodd ) = [882, 2286, 2603, 1946]

INTT ψ (Ĉ odd ) = [7369, 7441, 7633, 280]


−1
Note that we present the result here in normal order. We can
group them again:
Grouping them:
NTT ψ (A) = [(2423, 6519), (3273, 603), (1598, 4270),
(387, 3974)] C = [(7373, 7369), (7391, 7441), (7521, 7633), (98, 280)]
ψ
NTT (B) = [(4467, 882), (4956, 2286), (7612, 2603), Which is the same result as regular NTT/INTT negative
(6040, 1946)] wrapped convolution performed previously.

VOLUME 11, 2023 70301


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

C. IMPLEMENTATION OPTIMIZATIONS 3) HARDWARE SPECIFIC FEATURES UTILIZATION


As the value of n and q is relatively small in the finalized Another common optimization among researchers is using
PQC schemes, one can straightforwardly implement CT and more efficient and ready-to-use platforms in their targeted
GS butterflies for NTT/INTT with 256 points. Nevertheless, devices. Some platforms provide useful features, such as
various optimizations on the implementation side exist. This ready-to-use multipliers in the DSP module of FPGA, vector
section discusses what other researchers do to achieve their arithmetic, large register sets, specialized instruction sets, and
desired goals. parallelization. However, this optimization is very specific to
the targeted device, and not all researchers can access the
1) ITERATIVE CT/GS BUTTERFLIES specialized hardware.
One may look to optimize the NTT/INTT for a smaller area
in the hardware implementation. This goal can be achieved D. IMPLEMENTATIONS COMPARISON
using an iterative approach, which involves performing the To close this section, Table 7 shows the summary of the
butterflies repeatedly with a trade-off for longer clock cycles. implementation comparisons between various researchers
This means one can break it into smaller units instead of based on their novelty claim on NTT/INTT implementa-
doing a single 256-element CT/GS butterfly. For instance, tions algorithm, target device or hardware, their presented
since 256 can be expressed as 162 = 44 = 28 , one can NTT/INTT implementations, and how they implement mod-
do a 16-element CT/GS butterfly twice (called radix-16), a ular reduction algorithms.
4-element four times (called radix-4), or a 2-element eight
times (called radix-2). This allows for a reduction in the area
VI. NTT IN HOMOMORPHIC ENCRYPTION
required to implement the butterflies while still maintaining
ARCHITECTURES
the same transformation result.
Another use case of NTT is its application in Homomorphic
However, the iterative method has another trade-off: the
Encryption (HE). This section explores the NTT implemen-
need to generate a different set of twiddle factors for each
tations of HE schemes in various platforms, such as CPUs,
clock cycle. This means the twiddle factors must be recalcu-
GPUs, and FPGAs. NTT is one of the key components
lated for each iteration, adding to the processing time. This
in Homomorphic Encryption (HE) algorithms that rely on
trade-off is necessary to ensure that the correct results are
polynomial arithmetic (addition and multiplication). NTT is
obtained at each stage of the iterative transformation.
an integer version of FFT, which is well-known to perform
Figure 10 illustrates how a 4-element CT butterfly can
O(n log n) time complexity. Suppose we have two n-degrees
be used to calculate 256-element NTT transformation. For
polynomial numbers, A and B. Theoretically, it can be mul-
each clock cycle, the twiddle factors: ψ a , ψ b , and ψ c are
tiplied in time O(n log n) [118]. Moreover, they showed that
changed according to the input elements. The order of the
polynomials degree less than n ∈ Rq with q elements cost
four 4-elements inputted must also be managed carefully to
O(n log q log(n log q)). One big question is how to implement
ensure the correct results. This scheme can also be applied
the O(n log n) complexity in the platforms such as GPUs and
similarly to GS butterflies.
FPGAs. This section will answer it.
This optimization is common among researchers and used
To have a comprehensive understanding of the architec-
in almost all papers we surveyed due to its effectiveness in
tures, first, we explain how the CPU, GPU, and FPGA work.
saving area and design flexibility.
Figures 11 and 12 show the CPU, GPU, and FPGA archi-
tecture, respectively. CPU is built on several modules such
2) BITWISE OPERATIONS FOR MODULAR REDUCTION as registers, arithmetic logic unit (ALU), control unit, and
Another common optimization among researchers in this instruction memory. The left side of Figure 11 shows a multi-
scheme is utilizing the fact that the modulus q is already core CPU that supports the execution of multiple instructions
set and fixed for each scheme. This fact can be exploited to simultaneously. Typical multi-core CPUs are also supported
design a custom modular reduction module specific to the by caches that temporarily hold the data and instructions
targeted modulus by only using hardware-friendly operations to speed up the data movement rather than accessing the
such as add and shift. DRAM. However, the CPUs are specialized for serial pro-
For example, in Dilithium, the value of the modulus is cessing with higher clock frequencies that may not be suitable
fixed at q = 8380417. This number can also be written as for the data format of polynomial operations in homomorphic
223 − 213 + 1, then one can derive that 223 ≡ 213 − 1 mod q. encryption.
Similar derivation can also be done for Falcon and Kyber GPU, or graphics processing unit, is a processor for han-
moduli. Instead of using Barrett or Montgomery reduction, dling complex mathematical and graphical computations.
a series of bit-wise operations based on this fact are done to A GPU contains much more cores than the CPU e.g., hun-
perform modular multiplication. dreds, thousands, even hundred thousands or million cores
The idea of designing a custom bit-wise operations modu- that can work simultaneously in parallel. The right side of
lar reduction module is attributed to [60] by many researchers Figure 11 shows the typical GPU architecture that consists
we surveyed. of streaming multiprocessors (SM) containing the cores.

70302 VOLUME 11, 2023


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

TABLE 7. Summary of Dilithium, Kyber, and Falcon’s NTT hardware implementations.

VOLUME 11, 2023 70303


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

TABLE 7. (Continued.) Summary of Dilithium, Kyber, and Falcon’s NTT hardware implementations.

70304 VOLUME 11, 2023


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

TABLE 7. (Continued.) Summary of Dilithium, Kyber, and Falcon’s NTT hardware implementations.

VOLUME 11, 2023 70305


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

TABLE 7. (Continued.) Summary of Dilithium, Kyber, and Falcon’s NTT hardware implementations.

TABLE 8. CPU, GPU, and FPGA comparison.

FIGURE 11. CPU vs. GPU architecture.

The feature of the GPU is very suitable for processing the


FIGURE 12. FPGA architecture.
homomorphic encryption that requires polynomial operations
with high degrees.
A field programmable gate array (FPGA) is a re-
configurable device that can perform specific functions or
tasks. The biggest advantage of an FPGA is the ability to be encryption, because of its ability to perform computations in
reprogrammed with higher flexibility than CPUs and GPUs. parallel and greater flexibility in embedding special functions
As shown in Figure 12, the FPGA consists of configurable on the hardware. Moreover, it also allows us to add some
logic blocks and interconnects that can be changed by loading interfaces such as memory, Ethernet, video, and audio inter-
a bit stream onto the FPGA. An FPGA is suitable for apply- faces. Finally, the comparisons of CPU, GPU, and FPGA are
ing high-performance cryptography, such as homomorphic shown in Table 8.

70306 VOLUME 11, 2023


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

FIGURE 13. Computation pattern [46].

A. OVERVIEW OF HOMOMORPHIC ENCRYPTION


First, we explain the basic properties used in homomorphic
encryption. Z[x] is a polynomial usually represented in inte-
gers. Thus, Zq [x] is a set of integers [0, q) where q is the
FIGURE 14. BFV accelerator architecture [46].
coefficient modulus. Second, we have a ring of polynomial
modulus is Rq = Zq [x]/φm (x), where φm (x) is a cyclotomic
polynomial in the form of (x n + 1) with n as the polynomial
degree. Moreover, a random number generator to generate sparse
random polynomials and errors e1 and e2 is proposed as a
1) ENCRYPTION IN HE Gaussian PRNG module.
Definition 6.1: The following represents the encryption of Some polynomial operations are also used in homomor-
FHE [46]: phic encryption, such as public key generation, relineariza-
tion, and bootstrapping. Some of them are computationally
(c0 , c1 ) = ([1 · m + p0 · u + e1 ]q , [p1 · u + e2 ]q ) (30)
expensive, especially bootstrapping, which refreshes the
where c0 and c1 are the ciphertexts with errors e1 , e2 ← χ , noise in the ciphertext after homomorphic operations are per-
p0 and p1 are the public keys, m ∈ Rt is the message, u ← Rn2 formed. Bootstrapping includes evaluation and re-encryption
is the sparse random polynomial, and 1 = q/t is the ratio of the ciphertext. They require polynomial multiplication and
between the ciphertext coefficient modulus and the plaintext addition. Finally, the NTT module is implemented in the
coefficient modulus. polynomial arithmetic cores to speed up the computation
efficiently.
2) DECRYPTION IN HE
Definition 6.2: The following represents the decryption of B. RLWE HOMOMORPHIC ENCRYPTION SCHEMES
FHE [46]: In this subsection, we review some ring learning with

[c0 + c1 · s]q
 errors (RLWE) based homomorphic encryption schemes i.e.,
m= (31) FHEW [119], CKKS [39], BFV [36], BGV [37], [38],
1 t
and TFHE [120]. One of the pioneers of HE is the Fully
where c0 and c1 are the ciphertexts, s is the secret key, q is the Homomorphic Encryption over the Weil Descent (FHEW),
ciphertext coefficient modulus, t is the plaintext coefficient RLWE-based homomorphic encryption founded by Gen-
modulus, and the ratio between the ciphertext coefficient try and Halevi, which combines RLWE and Weil descent
modulus and the plaintext coefficient modulus is 1 = q/t. techniques to achieve homomorphic operations on cipher-
texts. FHEW is designed to be efficient in computation and
3) COMPUTATIONAL PATTERN AND ARCHITECTURE OF HE memory usage, making it suitable for resource-constrained
In Equations (30) and (31), polynomial multiplications are environments.
performed in the [·] operator while polynomial additions The CKKS (Cheon-Kim-Kim-Song) scheme [39] is a
are performed in the [+] operator. Figure 13 shows a com- Leveled Homomorphic Encryption scheme designed to sup-
putational pattern proposed in [46]. We can see that the port real-valued data in complex numbers. It is well-suited
polynomial multiplication operations are used for public keys for data with a large dynamic range and is sensitive to
(pk[0], pk[1]) in the encryption and ciphertext (ct[1]) in the numerical errors, such as the data from machine learning
decryption. models. CKKS uses a polynomial ring to represent encrypted
Figure 14 shows an architecture of BFV homomorphic data and employs the relinearization technique to reduce
encryption. Note that the polynomial multiplication is per- computational complexity. It is known for its fast perfor-
formed in the ring polymult core while the adder module mance on real-world applications, such as machine learning
performs the polynomial addition after the polymult core. inference.

VOLUME 11, 2023 70307


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

The BFV (Brakerski-Gentry-Vaikuntanathan) scheme [36] 2) HE IMPLEMENTATIONS IN GPUs


is a type of FHE with integer representations. Next, the BGV Some works in GPUs for homomorphic encryption are
(Brakerski-Gentry-Vaikuntanathan) scheme [37], [38] is a presented in [125], [126], [127], [128], [129], and [130].
type of FHE designed to work with integers and binary data. The optimizations implemented in GPUs can be classified
It is also well-suited for modular arithmetic and logical oper- into 3 major optimizations. The first is the parallel imple-
ations applications, such as privacy-preserving data analysis. mentation of butterfly modules in the NTT since GPU is
BGV uses a polynomial ring to represent encrypted data and capable of massively parallel computing [125], [126], [130],
employs bootstrapping to reduce computational complexity. [131]. The second is optimizing the kernel and memory
Bootstrapping enables the FHE scheme to evaluate complex management [125], [126], [130], [131]. And the last is the
functions on encrypted data by recursively re-encrypting optimization of modular arithmetic operations, such as the
ciphertexts. BGV is known for its ability to handle more Barret multiplication, the Montgomery multiplication, and
complex computations than BFV. the Residual Number System (RNS) based on the Chi-
The TFHE (Fully Homomorphic Encryption from Torus) nese Remainder Theorem (CRT) [125], [126], [128], [129],
scheme [120] is a somewhat homomorphic encryption (SHE) [130], [131].
scheme that is based on the GSW scheme [121] and the Figure 15 depicts the implementation of HE-Booster.
ring-learning with errors (RLWE) problem. It is designed It consists of 5 phases to do the homomorphic encryption
to be efficient in computation and memory usage. TFHE operations [125]. The first phase is CRT which decomposes
is a single-key scheme, meaning the same key is used for ciphertext into multiple independent sub-space. The coeffi-
encryption, decryption, and homomorphic operations. It is cient reduction operation provides a higher utilization of GPU
also designed to be fully compatible with standard Boolean parallel architecture. Second, NTT is performed by using the
circuits, making it suitable for various applications. CT butterfly. It uses inter-thread local synchronization for
optimization. Next, element-wise modular operations are per-
C. HOMOMORPHIC ENCRYPTION IMPLEMENTATIONS formed in the third phase, dyadic computation. Fourth, using
IN VARIOUS PLATFORMS the GS butterfly, the INTT phase is the same mechanism
1) HE IMPLEMENTATIONS IN CLASSICAL CPU as the second NTT phase. Finally, the inverse CRT phase
Some software for homomorphic encryption is HELib [122], reconstructs multiple residue polynomials in an independent
SEAL [41], OpenFHE [123], and LATTIGO [124]. First, sub-space into a single polynomial in ciphertext space.
HELib [122] implements a lattice polynomial ring using Figure 16 shows the GPU thread strategy for executing the
a C++ library that consists of BGV and CKKS schemes. butterflies in the NTT [131]. In this case, the NTT operation
It supports high-performance computing environments with uses 4 iterations with 16 inputs. In each iteration, there are
parallelization, optimized arithmetic operations, and mem- 8 butterfly operations. Thus, we have the following:
ory management. Second, Microsoft SEAL [41] implements Lemma 6.1: Suppose that we have n-input NTT/INTT. The
BFV, BGV, and CKKS. It is a well-known software that serial implementation of NTT/INTT complexity is
efficiently improved by implementing the residue number
n
system (RNS) FV. SEAL also supports various parameters. O( log n).
Next, PALISADE (Privacy-Enhancing Technologies: Algo- 2
Proof: It is clear that to achieve the last butterfly opera-
rithms, Implementations, and Development Environments), tion between input with index i and i + 1, it requires log n
now called OpenFHE [123], is a modular C++ library iterations where there are n/2 butterfly operations in each
for FHE that supports multiple schemes, including CKKS, iteration. □
BGV, and the Fan-Vercauteren (FV) scheme. It is designed Moreover, if we analyze the complexity of polynomial
to be extensible and includes various features like multi- multiplication using NTT/INTT, we have the following:
threading, GPU acceleration, and network communication. Lemma 6.2: A polynomial multiplication of A and B using
Finally, LATTIGO [124] supports CKKS and BGV imple- NTT/INTT that has n coefficients or inputs using serial com-
mented in the Golang library. It is fast and designed for ease putation takes at least:
of use. It includes features like automatic parameter selection,
key management, and ciphertext packing and provides an O(n + 2n log n).
interface for easy integration with machine learning frame- Proof: A polynomial multiplication requires an NTT,
works like TensorFlow. a component-wise product, and an INTT for each A and
Next, we explore various NTT implementations in GPUs B polynomial. An NTT/INTT takes O( 2n log n) while an
and FPGAs as hardware accelerators. As explained before, element-wise operation takes O(n). Thus, we have the
the advantages of GPU and FPGA over CPU are in the flexi- lemma. □
bility, configurability, and performance due to massive cores Now, we define a quota of parallelization in the GPU as:
and specific functions. We will see how parallel architecture Definition 6.3: Suppose that C is the number of GPU
is applied in fast large integer arithmetics by the residual cores, and h is the number of parallel polynomial operations
number system and the transformer, in this case, butterfly executed in the GPU. The quota in executing a polynomial
operations in NTT. operation in a GPU is C/h cores.

70308 VOLUME 11, 2023


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

FIGURE 15. HE-Booster GPU Implementation [125].

FIGURE 17. Parallel NTT-INTT architecture implementation [135].

Thus, n, h, and C determine the system’s scalability. The


calculation is also applied for FPGA implementations with
FIGURE 16. GPU thread assignment for n = 16 [131].
parallel processing elements.

3) HE IMPLEMENTATIONS IN FPGAs
Theorem 6.3: For encryption and decryption, the com- On the other hand, FPGAs also offer efficient area and speed
plexities for processing h plaintexts in parallel with C cores implementation for homomorphic encryption (HE) presented
GPU are: in [117], [127], [132], [133], [134], [135], [136], [137], [138],
6nh and [139]. The implementation of HE in FPGA also consists
O( (1 + log n)),
C of 3 approaches. First, pre/post-processing is usually required
and in the NTT/INTT by using the RNS-CRT method and mod-
ular multipliers such as Barrett, Montgomery, or LUT-based
2nh
O( (1 + log n)), reduction [127], [132], [133], [134], [137], [139]. The sec-
C ond is the parallel implementation of processing elements
respectively, where n is the number of inputs or coefficients (PEs) or butterfly units (BUs), in serial or pipeline par-
in NTT/INTT. allelization [127], [132], [133], [134], [135], [138], [139].
Proof: As Lemmas 6.1 and 6.2, encryption requires at And the last, the optimization is done by reconfigurable
most 3 polynomial multiplications and 3 polynomial addi- designs by implementing custom PEs or instructions using
tions, while decryption requires a polynomial multiplication RISC-V [117], [132], [138], [139].
and a polynomial addition. With C-core GPU and h parallel Figures 17 and 18 show the NTT/INTT architecture
processes, we have the theorem. □ implemented in FPGAs. In [135], the butterfly unit array
By targeting high utilization of a GPU, if 6nh is less is constructed with 8 × 4 arrays as shown in Figure 17.
than C, we will have a very fast O(log n) execution time. The architecture is capable of transforming 16 coefficients

VOLUME 11, 2023 70309


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

TABLE 9. Each modulus calculation of NWC via NTT/INTT.

FIGURE 19. ReMCA architecture implementation for NTT, INTT and


modular operations [139].

FIGURE 18. NTT-INTT architecture implementation with a Custom RISC-V


Instructions [138].

in the pipeline. The work is targeted for the RNS-CKKS


scheme.
In [138], NTT/INTT architecture is proposed with custom
instructions of Linux-ready RISC-V core. The number of
processing elements (PE) is varied. The BRAMs have three
parameters i.e., depth, width, and number of PEs. Increasing
the number of PEs reduces the depth of the memories. How-
ever, more memories require to be initiated. The architecture
in Figure 18 uses one-dimensional PE, while the architec-
ture in Figure 17 uses two-dimensional PEs in the form of
butterfly units (BUs). Regarding the speed of execution, the
architecture in [135] is better, while in terms of programma-
bility and area, the architecture in [138] is better. Another
work [117] also proposes a RISC-V architecture extension
for NTT without additional hardware modification.
Figure 19 shows a unified dynamic reconfigurable that can
be flexible into several modes i.e., a butterfly unit for NTT
(CT-form), a butterfly unit for INTT (GS-form), and a mod-
ular multiplication. The work implements a BFV algorithm
with NTT/INTT pre/post-processing with RNS.
From Subsection III-E we know that we can choose a FIGURE 20. ICRT architecture implementation [136].
composite modulus instead of a prime. A large composite
modulus can Q be factored into a product of small primes qi
such that l−1 i=0 qi , where l is the number of small primes. The biggest advantage of breaking down into RNS rep-
An example implementation of RNS based on CRT is pro- resentations is each NTT/INTT modulus can be calculated
posed in [136] as depicted in Figure 20. It reconstructs independently without any dependency on one another. This
modulo p1 and p2 into a larger modulo p1 p2 . Note that advantage makes it easy to parallelize NTT/INTT with a
for reconstructing a modulo from two smaller modulos, composite modulus. We provide a toy example of using NTT,
it requires 3 pipeline stages. RNS, and CRT combinations in Example 9.

70310 VOLUME 11, 2023


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

TABLE 10. Summary of NTT hardware implementations for homomorphic encryption.

VOLUME 11, 2023 70311


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

Example 6.1: Let A = [123456, 7891011, 121314, Converting each element back to normal representation can
151617] and B = [181920, 212223, 232425, 262728] in the be done using the Chinese Reminder Theorem. Take the ele-
ring ZQ with Q = 456149404001. We need to find the ment (129, 4265, 4017) as an example, we need to find x from
negacyclic convolution of A and B. the following system of equations:
Calculating the results using previously explained meth-
x ≡ 129 mod 6841,
ods in normal order: Using ψ = 12967992388, we can get:
x ≡ 4265 mod 7681,
NTT ψ (A) = [164909637252, 371837718802, x ≡ 4017 mod 8681
52022178059, 323529767713]
This is a classical textbook problem for CRT. Solving it,
NTT ψ (B) = [94256621661, 54777633553, we can obtain x = 169643576476. Transforming back all
418495999451, 344769281017] C to normal representation yields:

Element-wise multiplication between the two yields: C = [169643576476, 26172545988, 317135487954,


95233749301]
NTT ψ (A) ◦ NTT ψ (B) = [164909637252, 371837718802,
52022178059, 323529767713] Which is the same result as the negacyclic convolution when
calculated directly.
Taking INTT from the results yields in the negacyclic convo-
lution between A and B D. IMPLEMENTATION COMPARISON
ψ −1 Finally, to close this section, Table 10 summarizes the imple-
INTT (NTT ψ (A) ◦ NTT ψ (B)) mentation comparisons between various researchers in the
= [169643576476, 26172545988, 317135487954, HE schemes based on their novelty claim on NTT/INTT
95233749301] implementations algorithm, target device or hardware, their
presented NTT/INTT implementations, and how they imple-
Therefore the negacyclic convolution between them is ment the modular reduction.
[169643576476, 26172545988, 317135487954,
95233749301]. VII. CONCLUSION AND FUTURE WORKS
Calculating the results using RNS and CRT: Notice that A. CONCLUSION
the modulus Q is a composite number that can be factored We reviewed the concepts of Number Theoretic Transform
into three primes Q = q1 × q2 × q3 , where q1 = 6841, q2 = (NTT) and its inverse (INTT). We also provided a compre-
7681, and q3 = 8681. We can use the modulus factors as hensive survey about their implementation in the standardized
moduli set for a residue number system: {6841, 7681, 8681}. Post-Quantum Cryptographic (PQC) scheme by the NIST
We can represent each element in A and B in the RNS repre- and in Homomorphic Encryption (HE). In summary, we con-
sentation. Take the element 123456 as an example: clude that:
1) We comprehensively introduced the concepts of
123456 ≡ 318 mod 6841, NTT/INTT and the other concepts surrounding it.
123456 ≡ 560 mod 7681, Many other pieces of literature briefly introduce the
123456 ≡ 1922 mod 8681 concepts, but they are scattered everywhere, requiring
significant effort to learn. Our report should be helpful,
Therefore 123456 can be represented as (318, 560, 1922) in especially for those who begin researching the area and
the RNS representation based on our chosen moduli. Hence, come from the engineering or implementation side.
transforming A and B in the RNS representation in our chosen 2) We provided consistent, small, and understandable toy
moduli: examples through different concepts and algorithms to
further enhance the conceptual understanding of the
A = [(318, 560, 1922), (3338, 2624, 8663),
NTT/INTT, which hopefully helps in understanding the
(5017, 6099, 8461), (1115, 5678, 4040)] concepts.
B = [(4054, 5257, 8300), (152, 4836, 3879), 3) We summarized and provided a comprehensive review
(6672, 1995, 6719), (2770, 1574, 2289)] of the recent research on the NTT/INTT implementa-
tions for Post Quantum Cryptography (PQC) schemes
NTT-INTT can calculate the negacyclic convolution for each in various platforms such as FPGAs, GPUs, and various
modulus in the set. Table 9 shows the calculation details. embedded systems.
From the last column, we got the RNS representation of the 4) Similarly, we also summarize and provide a com-
negacyclic convolution between A and B: prehensive review of another use case of NTT/INTT
implementations for Homomorphic Encryption (HE)
C = [(129, 4265, 4017), (1912, 7029, 8106), schemes, including its optimizations, such as the com-
(6335, 1887, 6657), (3594, 2848, 2055)] binations of NTT-RNS-CRT for its parallelizations.

70312 VOLUME 11, 2023


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

B. FUTURE WORKS [16] R. Avanzi, J. Bos, L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky,


J. M. Schanck, P. Schwabe, G. Seiler, and D. Stehlé, ‘‘CRYSTALS–
While NTT/INTT is useful for many applications in Post- Kyber algorithm specifications and supporting documentation (version
Quantum Cryptography and Homomorphic Encryption, most 3.0),’’ NIST Post-Quantum Cryptogr. Standardization Process, to be
researchers currently do not consider the secure implemen- published.
tation of NTT/INTT. Side-channel attacks have increasingly [17] V. Migliore, M. M. Real, V. Lapotre, A. Tisserand, C. Fontaine, and
G. Gogniat, ‘‘Exploration of polynomial multiplication algorithms for
become a concern because they can use leaked information homomorphic encryption schemes,’’ in Proc. Int. Conf. ReConFigurable
to recover some secrets of cryptographic schemes. Comput. FPGAs (ReConFig), Dec. 2015, pp. 1–6.
One research suggests that all operations should be under [18] A. Karatsuba and Y. Ofman, ‘‘Multiplication of many-digital numbers by
automatic computers,’’ Doklady Akademii Nauk SSSR, vol. 145, no. 2,
the strategy of constant implementation to avoid timing pp. 293–294, 1962.
attacks [57]. There are also known types of attacks in [19] A. Weimerskirch and C. Paar, ‘‘Generalizations of the Karatsuba
NTT/INTT implementation, including single trace attacks, algorithm for efficient implementations,’’ Cryptol. ePrint Arch., to be
simple power analysis, and fault attacks [4], [143], [144]. published.
[20] A. L. Toom, ‘‘The complexity of a scheme of functional elements real-
Therefore, we suggest in the future, researchers also need izing the multiplication of integers,’’ Soviet Math. Doklady, vol. 3, no. 4,
to consider the secure implementation of NTT/INTT. pp. 714–716, 1963.
[21] S. A. Cook and S. O. Aanderaa, ‘‘On the minimum computation time of
functions,’’ Trans. Amer. Math. Soc., vol. 142, pp. 291–314, Aug. 1969.
REFERENCES
[22] E. Chu and A. George, Inside the FFT Black Box: Serial and Parallel Fast
[1] P. W. Shor, ‘‘Algorithms for quantum computation: Discrete loga- Fourier Transform Algorithms. Boca Raton, FL, USA: CRC Press, 1999.
rithms and factoring,’’ in Proc. 35th Annu. Symp. Found. Comput. Sci., [23] J. W. Cooley and J. W. Tukey, ‘‘An algorithm for the machine calculation
Nov. 1994, pp. 124–134. of complex Fourier series,’’ Math. Comput., vol. 19, no. 90, pp. 297–301,
[2] P. W. Shor, ‘‘Polynomial-time algorithms for prime factorization and 1965.
discrete logarithms on a quantum computer,’’ SIAM Rev., vol. 41, no. 2, [24] W. M. Gentleman and G. Sande, ‘‘Fast Fourier transforms: For fun and
pp. 303–332, Jan. 1999. profit,’’ in Proc. November 7–10, Fall Joint Comput. Conf. XX AFIPS
[3] J. A. Buchmann, D. Butin, F. Göpfert, and A. Petzoldt, ‘‘Post-quantum (Fall), 1966, pp. 563–578.
cryptography: State of the art,’’ in The New Codebreakers: Essays Ded- [25] T. Jebelean, ‘‘Practical integer division with Karatsuba complexity,’’ in
icated to David Kahn Occasion His 85th Birthday. Berlin, Germany: Proc. Int. Symp. Symbolic Algebr. Comput. (ISSAC), 1997, pp. 339–341.
Springer, 2016, pp. 88–108. [26] M. Bodrato, ‘‘Towards optimal Toom–Cook multiplication for univariate
[4] H. Nejatollahi, N. Dutt, S. Ray, F. Regazzoni, I. Banerjee, and and multivariate polynomials in characteristic 2 and 0,’’ in Proc. Int.
R. Cammarota, ‘‘Post-quantum lattice-based cryptography implemen- Workshop Arithmetic Finite Fields, (WAIFI), Madrid, Spain: Springer,
tations: A survey,’’ ACM Comput. Surv., vol. 51, no. 6, pp. 1–41, Jun. 2007, pp. 116–133.
Nov. 2019. [27] R. C. Agarwal and C. S. Burrus, ‘‘Number theoretic transforms to imple-
[5] L. Chen, D. Moody, and Y. Liu, ‘‘NIST post-quantum cryptography ment fast digital convolution,’’ Proc. IEEE, vol. 63, no. 4, pp. 550–560,
standardization,’’ Transition, vol. 800, no. 131A, p. 164, 2017. Apr. 1975.
[6] G. Alagic, ‘‘Status report on the first round of the NIST post-quantum [28] H. Nussbaumer, ‘‘Fast polynomial transform algorithms for digital con-
cryptography standardization process,’’ 2019. volution,’’ IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-28,
[7] G. Alagic, Status Report on the Second Round of the NIST Post-Quantum no. 2, pp. 205–215, Apr. 1980.
Cryptography Standardization Process, vol. 2. Gaithersburg, MD, USA: [29] J. M. Pollard, ‘‘The fast Fourier transform in a finite field,’’ Math.
U.S. Department of Commerce, NIST, 2020. Comput., vol. 25, no. 114, pp. 365–374, 1971.
[8] G. Alagic, Status Report on the Third Round of the NIST Post-Quantum [30] C. Gauss, Theoria Interpolationis Methodo Nova Tractata Werke
Cryptography Standardization Process. Gaithersburg, MD, USA: U.S. Band 3, 265–327. Göttingen, Germany: Kónigliche Gesellschaft der
Department of Commerce, NIST, 2022. Wissenschaften, 1886.
[9] NIST. Selected Algorithms 2022. Accessed: Mar. 8, 2023. [Online]. Avail- [31] D. Harvey and J. van der Hoeven, ‘‘Integer multiplication in time
able: https://csrc.nist.gov/Projects/post-quantum-cryptography/selected- O(nlog n),’’ Ann. Math., vol. 193, no. 2, pp. 563–617, Mar. 2021.
algorithms-2022
[32] E. W. Weisstein. (2015). Fast Fourier Transform. [Online]. Available:
[10] L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky, P. Schwabe, G. Seiler, https://mathworld.wolfram.com/
and D. Stehlé, ‘‘CRYSTALS-Dilithium: A lattice-based digital signature
[33] Z. Liang and Y. Zhao, ‘‘Number theoretic transform and its applications
scheme,’’ IACR Trans. Cryptograph. Hardw. Embedded Syst., vol. 2018,
in lattice-based cryptosystems: A survey,’’ 2022, arXiv:2211.13546.
pp. 238–268, Feb. 2018.
[11] V. Lyubashevsky, L. Ducas, E. Kiltz, T. Lepoint, P. Schwabe, G. Seiler, [34] P. B. Modak, ‘‘Implementation of an RNS based sequential NTT con-
D. Stehlé, and S. Bai, ‘‘CRYSTALS-Dilithium,’’ in Algorithm Specifica- volver,’’ Tech. Rep., 1982.
tions Supporting Documentation, 2020. [35] H. Krishna, K.-Y. Lin, and B. Krishna, ‘‘Rings, fields, the Chinese remain-
[12] P.-A. Fouque, ‘‘FALCON: Fast-Fourier lattice-based compact signatures der theorem and an extension—Part II: Applications to digital signal
over NTRU,’’ in Submission to NIST’s Post-Quantum Cryptography Stan- processing,’’ IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process.,
dardization Process, vol. 36, no. 5, 2018. vol. 41, no. 10, pp. 656–668, 1994.
[13] R. Avanzi, J. Bos, L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky, [36] J. Fan and F. Vercauteren, ‘‘Somewhat practical fully homomorphic
J. M. Schanck, P. Schwabe, G. Seiler, and D. Stehlé, ‘‘CRYSTALS– encryption,’’ Cryptol. ePrint Arch., to be published.
Kyber algorithm specifications and supporting documentation (version [37] Z. Brakerski and V. Vaikuntanathan, ‘‘Fully homomorphic encryption
1.0),’’ NIST Post-Quantum Cryptogr. Standardization Process, vol. 2, from ring-LWE and security for key dependent messages,’’ in Proc.
no. 4, pp. 1–43, 2017. Annual Cryptol. Conf., Santa Barbara, CA, USA: Springer, Aug. 2011,
[14] J. Bos, L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky, J. M. Schanck, pp. 505–524.
P. Schwabe, G. Seiler, and D. Stehle, ‘‘CRYSTALS—Kyber: A CCA- [38] Z. Brakerski, C. Gentry, and V. Vaikuntanathan, ‘‘(Leveled) fully
secure module-lattice-based KEM,’’ in Proc. IEEE Eur. Symp. Secur. homomorphic encryption without bootstrapping,’’ ACM Trans. Comput.
Privacy (EuroS&P), Apr. 2018, pp. 353–367. Theory, vol. 6, no. 3, pp. 1–36, Jul. 2014.
[15] R. Avanzi, J. Bos, L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky, [39] J. H. Cheon, A. Kim, M. Kim, and Y. Song, ‘‘Homomorphic encryption
J. M. Schanck, P. Schwabe, G. Seiler, and D. Stehlé, ‘‘CRYSTALS– for arithmetic of approximate numbers,’’ in Proc. Int. Conf. Theory Appl.
Kyber algorithm specifications and supporting documentation (version Cryptol. Inf. Secur. Hong Kong: Springer, Dec. 2017, pp. 409–437.
2.0),’’ NIST Post-Quantum Cryptogr. Standardization Process, to be [40] K. Laine and R. Player, ‘‘Simple encrypted arithmetic library-seal
published. (V2. 0),’’ Tech. Rep., 2016.

VOLUME 11, 2023 70313


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

[41] H. Chen, K. Laine, and R. Player, ‘‘Simple encrypted arithmetic library- [66] N. Zhang, B. Yang, C. Chen, S. Yin, S. Wei, and L. Liu, ‘‘Highly
SEAL v2. 1,’’ in Proc. Int. Conf. Financial Cryptogr. Data Secur. Sliema, efficient architecture of NewHope-NIST on FPGA using low-complexity
Malta: Springer, Apr. 2017, pp. 3–18. NTT/INTT,’’ IACR Trans. Cryptograph. Hardw. Embedded Syst.,
[42] H. J. Nussbaumer, ‘‘Elements of number theory and polynomial algebra,’’ vol. 2020, pp. 49–72, Mar. 2020.
in Fast Fourier Transform and Convolution Algorithms, 1982, pp. 4–31. [67] G. Mao, D. Chen, G. Li, W. Dai, A. I. Sanka, Ç. K. Koç,
[43] Convolution and Polynomial Multiplication in MATLAB. Accessed: and R. C. C. Cheung, ‘‘High-performance and configurable SW/HW
May 2, 2023. [Online]. Available: https://www.mathworks.com/help/ Co-design of post-quantum signature CRYSTALS-Dilithium,’’ ACM
MATLAB/ref/conv.html Trans. Reconfigurable Technol. Syst., vol. 16, no. 3, pp. 1–28, Sep. 2023.
[44] Numpy Convolution. Accessed: May 2, 2023. [Online]. Available: [68] N. Gupta, A. Jati, A. Chattopadhyay, and G. Jha, ‘‘Lightweight hardware
https://numpy.org/doc/stable/reference/generated/numpy.convolve.html accelerator for post-quantum digital signature CRYSTALS-Dilithium,’’
[45] Modulo-n Circular Convolution in MATLAB. Accessed: May 2, 2023. IACR Cryptol. ePrint Arch., vol. 2022, p. 496, Jan. 2022.
[Online]. Available: https://www.mathworks.com/help/signal/ref/cconv. [69] J. Zheng, F. He, S. Shen, C. Xue, and Y. Zhao, ‘‘Parallel small polynomial
html multiplication for Dilithium: A faster design and implementation,’’ in
[46] I. Syafalni, G. Jonatan, N. Sutisna, R. Mulyawan, and T. Adiono, ‘‘Effi- Proc. 38th Annu. Comput. Secur. Appl. Conf., Dec. 2022, pp. 304–317.
cient homomorphic encryption accelerator with integrated PRNG using [70] C. Zhao, N. Zhang, H. Wang, B. Yang, W. Zhu, Z. Li, M. Zhu, S. Yin,
low-cost FPGA,’’ IEEE Access, vol. 10, pp. 7753–7771, 2022. S. Wei, and L. Liu, ‘‘A compact and high-performance hardware archi-
[47] Sympy 1.11 Documentation. Accessed: May 2, 2023. [Online]. Available: tecture for CRYSTALS-Dilithium,’’ IACR Trans. Cryptograph. Hardw.
https://docs.sympy.org/latest/modules/ntheory.html#sympy.ntheory. Embedded Syst., vol. 2022, pp. 270–295, Nov. 2021.
residue_ntheory.nthroot_mod [71] S. He and M. Torkelson, ‘‘A new approach to pipeline FFT processor,’’ in
[48] A. Schönhage and V. Strassen, ‘‘Schnelle multiplikation grosser Zahlen,’’ Proc. Int. Conf. Parallel Process., Apr. 1996, pp. 766–770.
Computing, vol. 7, nos. 3–4, pp. 281–292, 1971. [72] R. I. Hartley, ‘‘Subexpression sharing in filters using canonic signed digit
[49] V. S. Dimitrov, T. V. Cooklev, and B. D. Donevsky, ‘‘Generalized multipliers,’’ IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process.,
Fermat–Mersenne number theoretic transform,’’ IEEE Trans. Circuits vol. 43, no. 10, pp. 677–688, Oct. 1996.
Syst. II, Analog Digit. Signal Process., vol. 41, no. 2, pp. 133–139, [73] G. Land, P. Sasdrich, and T. Güneysu, ‘‘A hard CRYSTAL—
Feb. 1994. Implementing Dilithium on reconfigurable hardware,’’ in Proc. Int. Conf.
[50] C. M. Rader, ‘‘Discrete convolutions via Mersenne transrorms,’’ IEEE Smart Card Res. Adv. Appl. Lübeck, Germany: Springer, Nov. 2022,
Trans. Comput., vol. C-100, no. 12, pp. 1269–1273, Dec. 1972. pp. 210–230.
[51] P. Heckbert, ‘‘Fourier transforms and the fast Fourier transform (FFT) [74] Z. Zhou, D. He, Z. Liu, M. Luo, and K.-K.-R. Choo, ‘‘A soft-
algorithm,’’ Comput. Graph., vol. 2, p. 463, Feb. 1995. ware/hardware co-design of CRYSTALS-Dilithium signature scheme,’’
ACM Trans. Reconfigurable Technol. Syst., vol. 14, no. 2, pp. 1–21,
[52] A. Saidi, ‘‘Decimation-in-time-frequency FFT algorithm,’’ in Proc. IEEE
Jun. 2021.
Int. Conf. Acoust., Speech Signal Process., Apr. 1994, p. 453.
[75] D. T. Nguyen, V. B. Dang, and K. Gaj, ‘‘A high-level synthesis approach
[53] P. Barrett, ‘‘Implementing the Rivest Shamir and Adleman public key
to the software/hardware codesign of NTT-based post-quantum cryptog-
encryption algorithm on a standard digital signal processor,’’ in Proc.
raphy algorithms,’’ in Proc. Int. Conf. Field-Program. Technol. (ICFPT),
Conf. Theory Appl. Cryptograph. Techn. Cham, Switzerland: Springer,
Dec. 2019, pp. 371–374.
2000, pp. 311–323.
[76] P. Longa and M. Naehrig, ‘‘Speeding up the number theoretic transform
[54] T. Wu, S. Li, and L. Liu, ‘‘Modular multiplier by folding Barrett modular
for faster ideal lattice-based cryptography,’’ in Proc. Int. Conf. Cryptol.
reduction,’’ in Proc. IEEE 11th Int. Conf. Solid-State Integr. Circuit
Netw. Secur. Milan, Italy: Springer, Nov. 2016, pp. 124–139.
Technol., Oct. 2012, pp. 1–3.
[77] L. Beckwith, D. T. Nguyen, and K. Gaj, ‘‘High-performance hardware
[55] L. Hars, ‘‘Long modular multiplication for cryptographic applications,’’ implementation of CRYSTALS-Dilithium,’’ in Proc. Int. Conf. Field-
in Proc. CHES. Cham, Switzerland: Springer, 2004, pp. 45–61. Program. Technol. (ICFPT), Dec. 2021, pp. 1–10.
[56] P. L. Montgomery, ‘‘Modular multiplication without trial division,’’ Math. [78] K. D. Ortega L. and L. J. Dominguez Perez, ‘‘Implementing CRYSTAL-
Comput., vol. 44, no. 170, pp. 519–521, 1985. Dilithium on FRDM-K64,’’ in Proc. IEEE 12th Annu. Ubiquitous
[57] C. K. Koc, T. Acar, and B. S. Kaliski, ‘‘Analyzing and comparing Comput., Electron. Mobile Commun. Conf. (UEMCON), Dec. 2021,
Montgomery multiplication algorithms,’’ IEEE Micro, vol. 16, no. 3, pp. 178–183.
pp. 26–33, Jun. 1996. [79] S. Ricci, L. Malina, P. Jedlicka, D. Smékal, J. Hajny, P. Cibik, P. Dzurenda,
[58] T. T. Nguyen, S. Kim, Y. Eom, and H. Lee, ‘‘Area-time efficient hardware and P. Dobias, ‘‘Implementing CRYSTALS-Dilithium signature scheme
architecture for CRYSTALS–Kyber,’’ Appl. Sci., vol. 12, no. 11, p. 5305, on FPGAs,’’ in Proc. 16th Int. Conf. Availability, Rel. Secur., Aug. 2021,
May 2022. pp. 1–11.
[59] A. Abdulrahman, V. Hwang, M. J. Kannwischer, and A. Sprenkels, [80] H. Becker, V. Hwang, M. J. Kannwischer, B.-Y. Yang, and S.-Y. Yang,
‘‘Faster Kyber and Dilithium on the Cortex-M4,’’ in Proc. Int. Conf. Appl. ‘‘Neon NTT: Faster Dilithium, Kyber, and saber on Cortex-A72
Cryptogr. Netw. Secur. Rome, Italy: Springer, Jun. 2022, pp. 853–871. and Apple M1,’’ IACR Trans. Cryptograph. Hardw. Embedded Syst.,
[60] F. Yaman, A. C. Mert, E. Öztürk, and E. Savas, ‘‘A hardware accel- vol. 2022, pp. 221–244, Nov. 2021.
erator for polynomial multiplication operation of CRYSTALS-KYBER [81] A. Basso, F. Aydin, D. Dinu, J. Friel, A. Varna, M. Sastry, and S. Ghosh,
PQC scheme,’’ in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), ‘‘Where star wars meets star trek: Saber and Dilithium on the same
Feb. 2021, pp. 1020–1025. polynomial multiplier,’’ Cryptol. ePrint Arch., to be published.
[61] A. Aikata, A. C. Mert, M. Imran, S. Pagliarini, and S. S. Roy, ‘‘KaLi: [82] D. O. C. Greconici, M. J. Kannwischer, and D. Sprenkels, ‘‘Compact
A crystal for post-quantum security using Kyber and Dilithium,’’ IEEE Dilithium implementations on Cortex-M3 and Cortex-M4,’’ IACR Trans.
Trans. Circuits Syst. I, Reg. Papers, vol. 70, no. 2, pp. 747–758, Feb. 2023. Cryptograph. Hardw. Embedded Syst., vol. 2021, pp. 1–24, Dec. 2020.
[62] Y. Kim, J. Song, T.-Y. Youn, and S. C. Seo, ‘‘CRYSTALS–Dilithium on [83] V. B. Dang, K. Mohajerani, and K. Gaj, ‘‘High-speed hardware archi-
ARMv8,’’ Secur. Commun. Netw., vol. 2022, pp. 1–12, Feb. 2022. tectures and FPGA benchmarking of CRYSTALS-Kyber, NTRU, and
[63] X. Chen, B. Yang, Y. Lu, S. Yin, S. Wei, and L. Liu, ‘‘Efficient saber,’’ IEEE Trans. Comput., vol. 72, no. 2, pp. 306–320, Feb. 2023.
access scheme for multi-bank based NTT architecture through con- [84] M. Knezevic, F. Vercauteren, and I. Verbauwhede, ‘‘Faster interleaved
flict graph,’’ in Proc. 59th ACM/IEEE Design Autom. Conf., Jul. 2022, modular multiplication based on Barrett and Montgomery reduction
pp. 91–96. methods,’’ IEEE Trans. Comput., vol. 59, no. 12, pp. 1715–1721,
[64] T. Wang, C. Zhang, P. Cao, and D. Gu, ‘‘Efficient implementation Dec. 2010.
of Dilithium signature scheme on FPGA SoC platform,’’ IEEE Trans. [85] Y. Zhao, R. Xie, G. Xin, and J. Han, ‘‘A high-performance domain-
Very Large Scale Integr. (VLSI) Syst., vol. 30, no. 9, pp. 1158–1171, specific processor with matrix extension of RISC-V for module-LWE
Sep. 2022. applications,’’ IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 69, no. 7,
[65] X. Chen, B. Yang, S. Yin, S. Wei, and L. Liu, ‘‘CFNTT: Scalable radix-2/4 pp. 2871–2884, Jul. 2022.
NTT multiplication architecture with an efficient conflict-free memory [86] D. W. Kim, D. I. Maulana, and W. Jung, ‘‘Kyber accelerator on FPGA
mapping scheme,’’ IACR Trans. Cryptograph. Hardw. Embedded Syst., using energy-efficient LUT-based Barrett reduction,’’ in Proc. 19th Int.
vol. 2022, pp. 94–126, Nov. 2021. SoC Design Conf. (ISOCC), Oct. 2022, pp. 83–84.

70314 VOLUME 11, 2023


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

[87] L. Wan, F. Zheng, G. Fan, R. Wei, L. Gao, Y. Wang, J. Lin, and J. Dong, [107] K. Yao, D. Kundi, C. Wang, M. O’Neill, and W. Liu, ‘‘Towards
‘‘A novel high-performance implementation of CRYSTALS-Kyber with CRYSTALS-Kyber: A M-LWE cryptoprocessor with area-time trade-
ai accelerator,’’ in Proc. Eur. Symp. Res. Comput. Secur. Copenhagen, off,’’ in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2021, pp. 1–5.
Denmark: Springer, Sep. 2022, pp. 514–534. [108] C. Zhang, D. Liu, X. Liu, X. Zou, G. Niu, B. Liu, and Q. Jiang, ‘‘Towards
[88] H. Nguyen and L. Tran, ‘‘Design of polynomial NTT and INTT acceler- efficient hardware implementation of NTT for Kyber on FPGAs,’’ in
ator for post-quantum cryptography CRYSTALS-Kyber,’’ Arabian J. Sci. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2021, pp. 1–5.
Eng., vol. 48, pp. 1527–1536, Feb. 2022. [109] Y. Huang, M. Huang, Z. Lei, and J. Wu, ‘‘A pure hardware implementation
[89] P. Sanal, E. Karagoz, H. Seo, R. Azarderakhsh, and of CRYSTALS-KYBER PQC algorithm through resource reuse,’’ IEICE
M. Mozaffari-Kermani, ‘‘Kyber on ARM64: Compact implementations Electron. Exp., vol. 17, no. 17, 2020, Art. no. 20200234.
of Kyber on 64-bit arm cortex-a processors,’’ in Proc. Int. Conf. Secur. [110] Z. Chen, Y. Ma, T. Chen, J. Lin, and J. Jing, ‘‘Towards efficient Kyber on
Privacy Commun. Syst. Cham, Switzerland: Springer, Sep. 2021, FPGAs: A processor for vector of polynomials,’’ in Proc. 25th Asia South
pp. 424–440. Pacific Design Autom. Conf. (ASP-DAC), Jan. 2020, pp. 247–252.
[90] J. N. Ortiz, F. C. Rodrigues, D. G. Filho, C. Teixeira, J. López, and [111] L. Botros, M. J. Kannwischer, and P. Schwabe, ‘‘Memory-efficient high-
R. Dahab, ‘‘Evaluation of CRYSTALS-Kyber and saber on the ARMv8 speed implementation of Kyber on Cortex-m4,’’ in Proc. Int. Conf.
architecture,’’ in Proc. Anais do XXII Simpósio Brasileiro em Segurança Cryptol. Africa. Rabat, Morocco: Springer, Jul. 2019, pp. 209–228.
da Informação e de Sistemas Computacionais, 2022, pp. 372–377. [112] Y. Kim, J. Song, and S. C. Seo, ‘‘Accelerating falcon on ARMv8,’’ IEEE
[91] J. Huang, J. Zhang, H. Zhao, Z. Liu, R. C. C. Cheung, Ç. K. Koç, Access, vol. 10, pp. 44446–44460, 2022.
and D. Chen, ‘‘Improved plantard arithmetic for lattice-based cryptog- [113] W.-K. Lee, R. K. Zhao, R. Steinfeld, A. Sakzad, and S. O. Hwang,
raphy,’’ IACR Trans. Cryptograph. Hardw. Embedded Syst., vol. 2022, ‘‘High throughput lattice-based signatures on GPUs: Comparing Falcon
pp. 614–636, Aug. 2022. and Mitaka,’’ Cryptol. ePrint Arch., to be published.
[92] T. Plantard, ‘‘Efficient word size modular arithmetic,’’ IEEE Trans. [114] (2023). Australian Research Data Commons Nectar Research Cloud
Emerg. Topics Comput., vol. 9, no. 3, pp. 1506–1518, Jul. 2021. System. [Online]. Available: https://ardc.edu.au/services/
[93] Z. Ye, R. C. C. Cheung, and K. Huang, ‘‘PipeNTT: A pipelined number [115] P. Karl, J. Schupp, T. Fritzmann, and G. Sigl, ‘‘Post-quantum signatures
theoretic transform architecture,’’ IEEE Trans. Circuits Syst. II, Exp. on RISC-V with hardware acceleration,’’ ACM Trans. Embedded Comput.
Briefs, vol. 69, no. 10, pp. 4068–4072, Oct. 2022. Syst., to be published.
[94] M. Li, J. Tian, X. Hu, and Z. Wang, ‘‘Reconfigurable and high-efficiency [116] M. Gautschi, P. D. Schiavone, A. Traber, I. Loi, A. Pullini, D. Rossi,
polynomial multiplication accelerator for CRYSTALS-Kyber,’’ IEEE E. Flamand, F. K. Gürkaynak, and L. Benini, ‘‘Near-threshold RISC-V
Trans. Comput.-Aided Design Integr. Circuits Syst., early access, core with DSP extensions for scalable IoT endpoint devices,’’ IEEE Trans.
Dec. 19, 2022, doi: 10.1109/TCAD.2022.3230359. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 10, pp. 2700–2713,
[95] D. Kundi, Y. Zhang, C. Wang, A. Khalid, M. O’Neill, and W. Liu, Oct. 2017.
‘‘Ultra high-speed polynomial multiplications for lattice-based cryptog-
[117] E. Karabulut and A. Aysu, ‘‘RANTT: A RISC-V architecture extension
raphy on FPGAs,’’ IEEE Trans. Emerg. Topics Comput., vol. 10, no. 4,
for the number theoretic transform,’’ in Proc. 30th Int. Conf. Field-
pp. 1993–2005, Oct. 2022.
Program. Log. Appl. (FPL), Aug. 2020, pp. 26–32.
[96] M. Bisheh-Niasar, R. Azarderakhsh, and M. Mozaffari-Kermani,
[118] D. Harvey and J. van der Hoeven, ‘‘Polynomial multiplication over finite
‘‘A monolithic hardware implementation of Kyber: Comparing apples to
fields in time O(n log n,’’ J. ACM, vol. 69, no. 2, pp. 1–40, Apr. 2022.
apples in PQC candidates,’’ in Proc. Int. Conf. Cryptol. Inf. Secur. Latin
[119] L. Ducas and D. Micciancio, ‘‘FHEW: Bootstrapping homomorphic
Amer. Bogotá, Colombia: Springer, 2021, pp. 108–126.
encryption in less than a second,’’ in Proc. Annu. Int. Conf. Theory Appl.
[97] Y. Xing and S. Li, ‘‘A compact hardware implementation of CCA-secure
Cryptograph. Techn. Sofia, Bulgaria: Springer, Apr. 2015, pp. 617–640.
key exchange mechanism CRYSTALS-KYBER on FPGA,’’ IACR Trans.
Cryptograph. Hardw. Embedded Syst., vol. 2021, pp. 328–356, Feb. 2021. [120] I. Chillotti, N. Gama, M. Georgieva, and M. Izabachène, ‘‘TFHE: Fast
fully homomorphic encryption over the torus,’’ J. Cryptol., vol. 33, no. 1,
[98] P. Nannipieri, S. Di Matteo, L. Zulberti, F. Albicocchi, S. Saponara,
pp. 34–91, Jan. 2020.
and L. Fanucci, ‘‘A RISC-V post quantum cryptography instruction set
extension for number theoretic transform to speed-up CRYSTALS algo- [121] C. Gentry, A. Sahai, and B. Waters, ‘‘Homomorphic encryption
rithms,’’ IEEE Access, vol. 9, pp. 150798–150808, 2021. from learning with errors: Conceptually-simpler, asymptotically-faster,
attribute-based,’’ in Proc. Annual Cryptol. Conf. Santa Barbara, CA,
[99] W. Guo, S. Li, and L. Kong, ‘‘An efficient implementation of Kyber,’’
USA: Springer, Aug. 2013, pp. 75–92.
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 69, no. 3, pp. 1562–1566,
Mar. 2022. [122] S. Halevi and V. Shoup, ‘‘Design and implementation of helib: A homo-
[100] L. Zhao, J. Zhang, J. Huang, Z. Liu, and G. Hancke, ‘‘Efficient imple- morphic encryption library,’’ Cryptol. ePrint Arch., to be published.
mentation of Kyber on mobile devices,’’ in Proc. IEEE 27th Int. Conf. [123] A. Al Badawi et al., ‘‘OpenFHE: Open-source fully homomorphic
Parallel Distrib. Syst. (ICPADS), Dec. 2021, pp. 506–513. encryption library,’’ in Proc. 10th Workshop Encrypted Comput. Appl.
[101] D. T. Nguyen and K. Gaj, ‘‘Fast NEON-based multiplication for lattice- Homomorphic Cryptogr., Nov. 2022, pp. 53–63.
based NIST post-quantum cryptography finalists,’’ in Proc. Int. Conf. [124] C. V. Mouchet, J.-P. Bossuat, J. R. Troncoso-Pastoriza, and J.-P. Hubaux,
Post-Quantum Cryptogr. Daejeon, South Korea: Springer, Jul. 2021, ‘‘Lattigo: A multiparty homomorphic encryption library in go,’’ in Proc.
pp. 234–254. 8th Workshop Encrypted Comput. Appl. Homomorphic Cryptogr., 2020,
[102] Z. Chen, Y. Ma, T. Chen, J. Lin, and J. Jing, ‘‘High-performance area- pp. 64–70.
efficient polynomial ring processor for CRYSTALS-Kyber on FPGAs,’’ [125] Z. Wang, P. Li, R. Hou, Z. Li, J. Cao, X. Wang, and D. Meng, ‘‘HE-
Integration, vol. 78, pp. 25–35, May 2021. booster: An efficient polynomial arithmetic acceleration on GPUs for
[103] Z. Liu, H. Seo, S. Sinha Roy, J. Großschädl, H. Kim, and I. Verbauwhede, fully homomorphic encryption,’’ IEEE Trans. Parallel Distrib. Syst.,
‘‘Efficient Ring-LWE encryption on 8-bit AVR processors,’’ in Proc. vol. 34, no. 4, pp. 1067–1081, Apr. 2023.
Int. Workshop Cryptograph. Hardw. Embedded Syst. Saint-Malo, France: [126] Z. Zheng, ‘‘Encrypted cloud using GPUs,’’ Ph.D. dissertation, KU Leu-
Springer, Sep. 2015, pp. 663–682. ven, Leuven, Belgium, 2020. [Online]. Available: https://www~
[104] M. Bisheh-Niasar, R. Azarderakhsh, and M. Mozaffari-Kermani, ‘‘High- [127] E. Öztürk, Y. Doröz, E. Savas, and B. Sunar, ‘‘A custom accelerator for
speed NTT-based polynomial multiplication accelerator for post-quantum homomorphic encryption applications,’’ IEEE Trans. Comput., vol. 66,
cryptography,’’ in Proc. IEEE 28th Symp. Comput. Arithmetic (ARITH), no. 1, pp. 3–16, Jan. 2017.
Jun. 2021, pp. 94–101. [128] W. Wang, Z. Chen, and X. Huang, ‘‘Accelerating leveled fully homo-
[105] M. Bisheh-Niasar, R. Azarderakhsh, and M. Mozaffari-Kermani, morphic encryption using GPU,’’ in Proc. IEEE Int. Symp. Circuits Syst.
‘‘Instruction-set accelerated implementation of CRYSTALS-Kyber,’’ (ISCAS), Jun. 2014, pp. 2800–2803.
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 68, no. 11, pp. 4648–4659, [129] W. Dai, Y. Doröz, and B. Sunar, ‘‘Accelerating NTRU based homomor-
Nov. 2021. phic encryption using GPUs,’’ in Proc. IEEE High Perform. Extreme
[106] L. Ma, X. Wu, and G. Bai, ‘‘Parallel polynomial multiplication optimized Comput. Conf. (HPEC), Sep. 2014, pp. 1–6.
scheme for CRYSTALS-KYBER post-quantum cryptosystem based on [130] A. S. Özcan, C. Ayduman, E. R. Türkoglu, and E. Savas, ‘‘Homomorphic
FPGA,’’ in Proc. Int. Conf. Commun., Inf. Syst. Comput. Eng. (CISCE), encryption on GPU,’’ IEEE Access, early access, Apr. 7, 2023, doi:
May 2021, pp. 361–365. 10.1109/ACCESS.2023.3265583.

VOLUME 11, 2023 70315


A. Satriawan et al.: Conceptual Review on NTT and Comprehensive Review on Its Implementations

[131] Ö. Özerk, C. Elgezen, A. C. Mert, E. Öztürk, and E. Savaş, ‘‘Efficient INFALL SYAFALNI received the B.Eng. degree
number theoretic transform implementation on GPU for homomorphic in electrical engineering from Institut Teknologi
encryption,’’ J. Supercomput., vol. 78, no. 2, pp. 2840–2872, Feb. 2022. Bandung (ITB), Bandung, Indonesia, in 2008, the
[132] Y. Su, B. Yang, C. Yang, Z. Yang, and Y. Liu, ‘‘A highly unified reconfig- M.Sc. degree in electronic engineering from the
urable multicore architecture to speed up NTT/INTT for homomorphic University of Science Malaysia (USM), Penang,
polynomial multiplication,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Malaysia, in 2011, and the Dr.Eng. degree in
Syst., vol. 30, no. 8, pp. 993–1006, Aug. 2022. engineering from the Kyushu Institute of Tech-
[133] Y. Doröz, E. Öztürk, and B. Sunar, ‘‘Accelerating fully homomor- nology (KIT), Iizuka, Fukuoka, Japan, in 2014.
phic encryption in hardware,’’ IEEE Trans. Comput., vol. 64, no. 6,
From 2014 to 2015, he held a research position
pp. 1509–1521, Jun. 2015.
with KIT. From 2015 to 2018, he was an ASIC
[134] A. C. Mert, E. Karabulut, E. Öztürk, E. Savas, and A. Aysu, ‘‘An extensive
study of flexible design methods for the number theoretic transform,’’ Engineer with the ASIC Development Group, Logic Research Company
IEEE Trans. Comput., vol. 71, no. 11, pp. 2829–2843, Nov. 2022. Ltd., Fukuoka, Japan. In 2019, he joined ITB, where he is currently an Assis-
[135] P. Duong-Ngoc, S. Kwon, D. Yoo, and H. Lee, ‘‘Area-efficient number tant Professor with the School of Electrical Engineering and Informatics and
theoretic transform architecture for homomorphic encryption,’’ IEEE a Researcher with the University Center of Excellence on Microelectronics.
Trans. Circuits Syst. I, Reg. Papers, vol. 70, no. 3, pp. 1270–1283, His research interests include logic synthesis, logic design, VLSI design, and
Mar. 2023. efficient circuits and algorithms.
[136] X. Feng and S. Li, ‘‘Design of an area-effcient million-bit integer multi-
plier using double modulus NTT,’’ IEEE Trans. Very Large Scale Integr.
RELLA MARETA (Graduate Student Member,
(VLSI) Syst., vol. 25, no. 9, pp. 2658–2662, Sep. 2017.
[137] C. Rafferty, M. O’Neill, and N. Hanley, ‘‘Evaluation of large integer IEEE) received the B.S. and M.S. degrees in elec-
multiplication methods on hardware,’’ IEEE Trans. Comput., vol. 66, trical engineering from Institut Teknologi Ban-
no. 8, pp. 1369–1382, Aug. 2017. dung (ITB), Bandung, Indonesia, in 2011 and
[138] R. Paludo and L. Sousa, ‘‘NTT architecture for a Linux-ready RISC-V 2014, respectively. She is currently pursuing the
fully-homomorphic encryption accelerator,’’ IEEE Trans. Circuits Syst. Ph.D. degree with the Digital Integrated Systems
I, Reg. Papers, vol. 69, no. 7, pp. 2669–2682, Jul. 2022. Laboratory, Inha University.
[139] Y. Su, B. Yang, C. Yang, and S. Zhao, ‘‘ReMCA: A reconfigurable multi-
core architecture for full RNS variant of BFV homomorphic evaluation,’’
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 69, no. 7, pp. 2857–2870,
Jul. 2022.
[140] A. Al Badawi, Y. Polyakov, K. M. M. Aung, B. Veeravalli, and K. Rohloff, ISA ANSHORI (Member, IEEE) received the
‘‘Implementation and performance evaluation of RNS variants of the BFV B.S. degree in engineering physics from Insti-
homomorphic encryption scheme,’’ IEEE Trans. Emerg. Topics Comput., tut Teknologi Bandung, Indonesia, in 2009, and
vol. 9, no. 2, pp. 941–956, Apr. 2021. the M.Eng. degree in materials science and the
[141] X. Cao, C. Moore, M. O’Neill, E. O’Sullivan, and N. Hanley, ‘‘Optimised Ph.D. degree in nanoscience and nanotechnol-
multiplication architectures for accelerating fully homomorphic encryp- ogy from the University of Tsukuba, Japan, in
tion,’’ IEEE Trans. Comput., vol. 65, no. 9, pp. 2794–2806, Sep. 2016. 2015 and 2018, respectively. He has been an Assis-
[142] S. S. Roy, F. Turan, K. Jarvinen, F. Vercauteren, and I. Verbauwhede, tant Professor with the Department of Biomedical
‘‘FPGA-based high-performance parallel architecture for homomorphic Engineering, School of Electrical Engineering and
computing on encrypted data,’’ in Proc. IEEE Int. Symp. High Perform. Informatics, Institut Teknologi Bandung, since
Comput. Archit. (HPCA), Feb. 2019, pp. 387–398.
2018. His research interests include bio/chemical sensors, microfluidics, the
[143] P. Ravi, R. Poussier, S. Bhasin, and A. Chattopadhyay, ‘‘On configurable
IoT devices, and lab-on-chip.
SCA countermeasures against single trace attacks for the NTT: A perfor-
mance evaluation study over Kyber and Dilithium on the arm Cortex-m4,’’
in Proc. Int. Conf. Secur., Privacy, Appl. Cryptogr. Eng. Kolkata, India: WERVYAN SHALANNANDA received the B.S.
Springer, Dec. 2020, pp. 123–146. degree in telecommunications engineering and the
[144] J. Howe, T. Prest, and D. Apon, ‘‘SoK: How (not) to design and implement M.S. degree in electrical engineering (telematics
post-quantum cryptography,’’ in Proc. Cryptographers’ Track RSA Conf. and telco networks) from the Bandung Institute
Cham, Switzerland: Springer, May 2021, pp. 444–477. of Technology, in 2013 and 2015, respectively.
He joined the Bandung Institute of Technology,
in 2016, as an Academic Assistant and then as a
Lecturer, in 2018. His research interests include
networked systems and security and artificial intel-
ligence in telecommunications.

ARDIANTO SATRIAWAN received the B.S. and ALEAMS BARRA received the B.S. and M.S.
M.S. degrees in electrical engineering from Institut degrees in mathematics from Institut Teknologi
Teknologi Bandung (ITB), Bandung, Indonesia, Bandung (ITB), Bandung, Indonesia, in 1998 and
in 2013 and 2015 respectively. He is a member 2002, respectively, and the Ph.D. degree in mathe-
of the Computer Engineering Research Group, matics from the University of Kentucky, Kentucky,
School of Electrical Engineering and Informatics, USA, in 2012. He is currently a member of the
ITB. His research interests include virtual reality, Algebra Research Group, Faculty of Mathematics
machine learning, computer networks, and infor- and Natural Sciences, ITB.
mation security.

70316 VOLUME 11, 2023

You might also like