Implementation_of_CRYSTALS-Kyber_Post-Quantum_Algorithm_Using_RISC-V_Processor
Implementation_of_CRYSTALS-Kyber_Post-Quantum_Algorithm_Using_RISC-V_Processor
Abstract—Post-Quantum Cryptography (PQC) has gained cial roles. Therefore, implementing CRYSTALS-Kyber func-
prominence due to advancements in computing, particularly tions on additional hardware, such as an accelerator, can re-
quantum computing. As quantum computers pose a potential duce memory consumption and processing time compared to a
threat to traditional encryption algorithms, the development of
PQC algorithms has gained momentum. CRYSTALS-Kyber is fully software-based implementation. However, it is important
one such algorithm developed in this era. This paper presents an to note that the accelerator will occupy additional space, and
acceleration approach for the CRYSTALS-Kyber PQC algorithm the critical path of the accelerator may limit the maximum
using a RISC-V core, specifically the Ibex core. The accel- clock frequency at which the processor can operate. It is
eration strategy involves a three-pronged approach: analyzing crucial that the system is optimized to the greatest extent
code profiling results, incorporating custom hardware into the
hardware architecture, and simulating the results. Notably, this possible.
work introduces an innovative approach to run and accelerate the During the determination process of the constraints and
CRYSTALS-Kyber algorithm using a core-only implementation, developments, an open source core architecture, RISC-V [3],
eliminating the other microcontroller hardware to reduce the may be used. It provides a flexible, customizable, and simpli-
used area.
fied design structure for developers.
Index Terms—Post Quantum Cryptography, PQC,
CRYSTALS-Kyber, Keccak, Accelerator Design
II. BACKGROUND OF CRYSTALS-K YBER A LGORITHM
I. I NTRODUCTION
The CRYSTALS-Kyber algorithm, which was one of the
Throughout the history of cryptology, the strength of en- finalists in 2022 selected by NIST, is a lattice-based encryption
cryption has traditionally depended on the presence of a device method [4]. In this technique, an encrypted output is obtained
capable of breaking the ciphertext. Notably, in 1994, Peter by performing a linear combination of n different independent
Shor developed Shor’s algorithm [1], which utilizes quantum vectors defined in the Rn space, forming an integral lattice.
operations to easily find the prime factors of large numbers. However, it is relatively easy to recover the encrypted message
Subsequently, as quantum computers capable of performing using methods like Gaussian Elimination. To counter this
these operations emerged, the National Institute of Standards vulnerability, a small error is intentionally introduced to the
and Technology (NIST) initiated a Post-Quantum Cryptogra- secret multiplied by the lattice, turning the equation into
phy (PQC) standardization competition in 2016 with the aim a machine-learning problem. An attacker must then solve
of creating encryption systems that would remain resilient even the nearest point problem to obtain the ciphertext, and this
in the face of quantum computers [2]. CRYSTALS-Kyber is approach is known as ”Learning with Error” (LWE). When
an algorithm designed for public key exchange and public key the angle between the output lattices, affected by this small
encryption models, specifically engineered to withstand attacks error, and the encrypted basis lattices is close to 0 degrees, it
by quantum computers. significantly complicates solving the problem using the LWE
The CRYSTALS-Kyber algorithm consumes more execu- method. As a result, the algorithm provides secure encryption
tion time and memory when implemented in pure software that poses challenges for quantum computers in deciphering
on a computer. This is due to its inherently complex structure it. Matrix multiplication operations on high-order polynomials
compared to previous encryption algorithms. Taking a system in encryption and decryption steps come at a time cost, and to
perspective, both memory usage and execution time play cru- address this, The Number Theoretical Transformation (NTT) is
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 06:05:55 UTC from IEEE Xplore. Restrictions apply.
employed as an effective method for cyclic and inverse-cyclic TABLE I
convolutions on high-order polynomials. I BEX C ORE A REA U SAGE
Using the standards of secure random array generation Resource Utilization Available Utilization (%)
functions, the matrices A, s, and e are acquired. In this LUT 5 965 63 400 9.41
notation, A represents the coefficient matrix of a polynomial, s LUTRAM 545 19 000 2.87
FF 5 657 126 800 4.46
serves as the secret key, and e stands for the error matrix. The BRAM 67 135 49.63
public key, denoted as pk and calculated with (1), is derived. DSP 1 240 0.42
To generate the A matrix, the SHAKE-128 [5] algorithm, IO 34 210 16.19
PLL 1 6 16.67
which also conforms to NIST standards, is employed as an
Extendable Output Function (XOF). Each element within the
A matrix, obtained through the XOF, is transformed into a
polynomial and situated within the NTT domain. s is generated 768, and 1024, each offering two different formalizations,
using SHAKE-256 [5], also referred to as a Pseudo Random namely ”Standard Kyber” based on Keccak and ”Kyber-90s”
Function (PRF). For the error, an e matrix is obtained using based on SHA-2/AES. For the implementation, the Standard
the PRF function. To facilitate operations on polynomials, both Kyber implementation’s code test vectors has been used with
the s and e matrices are transferred to the NTT domain. security level 768, which represents a mid-level of security.
The parameter NTESTS which decides how many times the
pkn×k = An×n · sn×k + en×k (1) algorithm will run is set to 1 in params.h. With the specifica-
tions, the chosen algorithm for execution on the RISC-V core
In the encryption stage, the XOF function uses pk as input
becomes the test vectors768 code.
and generates the A matrix in the NTT domain. Additionally,
the random variable matrix r and error matrices e1 and e2 are After the software compilation phase of the CRYSTALS-
obtained using the PRF function. To encrypt the message, the Kyber algorithm for the Ibex CPU, the resulting hexadecimal
vector u is generated as A · r + e1 , and the vector v is formed program memory, comprising of 30 078 bytes, was imported
as t · r + e2 + m, where t is a form of pk in the NTT domain. into the Xilinx Vivado design and simulation tool [9]. During
The final cipher message becomes [u|v] as shown in (2). the simulation, the LED address was used to verify the re-
sults. In the main function, the functions crypto kem keypair,
crypto kem enc, and crypto kem dec were executed consecu-
r· A | t + e1 | e2 + 0 | m = u | v (2) tively. Once all the processes were completed, the encapsulated
and decapsulated keys were compared byte-by-byte. In the
The received encrypted message is divided into u and v case of complete equality between the bytes, the number
vectors. The secret key is multiplied by u and the result 40190031 will be seen at the LED address. Otherwise, if there
is subtracted from the v matrix in the NTT domain. After is any disparity, the number 40190046 will be seen, indicating
clearing the obtained matrices from errors, the message, m, is a potential issue with the key generation, encryption, and/or
obtained back in real space as it is given in (3). decryption phases of the process. After the simulation of
pure software implementation, the number 40190031 was seen
v−u·s=m+e (3) on the LED address, signifying the correct operation of the
III. P URE S OFTWARE I MPLEMENATION OF system. Simulation result is given in Fig. 1.
CRYSTALS-K YBER A LGORITHM ON I BEX RISC-V C ORE
As a RISC-V Central Processing Unit (CPU), Ibex Core has
been selected because of its ease of use and comprehensive
documentation. It is an open-source 32-bit RISC-V core that
has micro, small, maxperf and maxperf-pmp-bmfull config-
urations [6]. To strike a balance between performance and
area, the ”small” configuration was chosen, which utilizes the
RV32IMC instruction set. For the implementation of the RISC-
V CPU, we chose the Nexys A7-100T Field-Programmable Fig. 1. CRYSTALS-Kyber Algorithm Simulation Result
Gate Array (FPGA) board, using the XC7A100T-1CSG324C
FPGA chip [7]. Table I provides the utilization report. In the simulation, the clock frequency of the processor was
The CRYSTALS-Kyber algorithm has been implemented on set to 50 MHz. According to the simulation results, the key
the selected core, Ibex, using the reference implementation generation, encryption, decryption, and comparison processes
of the algorithm without any modifications [8]. This implies took a total of 339.42 ms, which corresponds to 16 971
that this work has addressed the worst case scenario of the 279.5 clock signals. However, this result also includes the byte
algorithm running on Ibex Core, and the project results remain comparison. When considering only the CRYSTALS-Kyber
valid even without altering the reference code. Within the algorithm, the results related to its functionalities are presented
reference code, there exist three different security levels: 512, in Table II.
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 06:05:55 UTC from IEEE Xplore. Restrictions apply.
TABLE II denotes the length of the permutation. In this study, Kec-
CRYSTALS-K YBER A LGORITHM E XECUTION T IMES WITH cakF1600 is employed and implemented in hardware. Within
P URE -S OFTWARE I MPLEMENTATION
the Keccak function, there are 24 rounds, each has five
Function Execution (Clock Cycles) essential steps that enhance security. These steps include Theta
Keypair Generation 4 702 047.5 (θ), which transforms row-wise bits, Rho (ρ), which rotates
Encapsulation 5 886 277.5
Decapsulation 6 222 787.5 bits, Pi (π), which rearranges data on a 2D plane, Chi (χ),
Overall 16 811 112.5 which applies XOR operations within each row, and Iota
(ι), which introduces round constants to vary the function’s
behavior in different rounds. In the Keccak hardware each
round needs one clock cycle to execute. Based on this, it is
IV. C ODE P ROFILING FOR CRYSTALS-K YBER
reasonable to anticipate a substantial reduction in the number
A LGORITHM
of required clock cycles, from 50 748 to just 24 cycles, thanks
To determine which functionality of the algorithm should be to the utilization of the hardware. While there are existing op-
implemented in hardware to accelerate its performance, code timal Keccak hardware implementations in literature, such as
profiling method was employed. Profiling the code helps iden- the TeamKeccak KeccakF1600 hardware implementation [12],
tify which functions are called most frequently and consume this study introduces a simpler hardware design for the initial
the most time during algorithm execution. For this purpose, phase of acceleration and ease of simulation. The Keccak
the code profiling tool GNU gprof was utilized. It is worth hardware report is given in Table IV.
noting that while this profiling tool is not specifically designed
for RISC-V architectures, it still provides reasonable results,
TABLE IV
as mentioned in [10]. The profiling was conducted with the K ECCAK F1600 H ARDWARE A REA U SAGE
-O0 flag (No optimization flag), and the code was profiled
with NTESTS=1000 because the CRYSTALS-Kyber algorithm Resource Utilization Available Utilization (%)
LUT 6 933 63 400 10.94
utilizes a deterministic randombytes method by Daniel J. FF 3 543 126 800 2.79
Bernstein sourced from SUPERCOP [11]. The decision to
set NTESTS to 1000 was made to minimize the impact of
randomness and the possibility of misleading results. The When incorporating the Keccak hardware with the Ibex
profiling results are presented in Table III. CPU, the decision was made to use the hardware with an
interrupt mechanism. Specifically, memory addresses in the
TABLE III range of 0x14000-0x14008 were allocated for the hardware’s
P ROFILING R ESULTS operation. Importantly, there is no requirement for a data bus
Function Calls Time (%) like the Advanced eXtensible Interface (AXI) in the system,
KeccakF1600 StatePermute 143 207 19.13 which would otherwise increase area usage.
montgomery reduce 47 616 000 12.51 In this setup, the software function KeccakF1600 in the
basemul 4 608 000 9.09
CRYSTALS-Kyber reference implementation was transformed
into an interrupt source for the KeccakF1600 hardware, the
In light of the findings in Table III, the decision has been hardware is solely an address. When it is time to execute
made to implement the most time-consuming function, Kec- the KeccakF1600 function, an interrupt is triggered by writing
cakF1600 StatePermute, in hardware. According to the test ’1’ to the corresponding address. This enables the hardware,
results, the KeccakF1600 function requires 50 748 clock cycles causing the CPU to enter a stall state. With the activation
for execution in Ibex when implemented purely in software. of the enable signal, the Keccak Read Unit retrieves the
Incorporating dedicated hardware for KeccakF1600 within the corresponding 1600 bits from RAM and feeds them to the
CRYSTALS-Kyber algorithm is expected to yield a significant KeccakF1600 hardware, which begins its operation after 50
enhancement in its speed. This is especially crucial since clock cycles of reading memory. After the hardware completes
KeccakF1600 is called 44 times in key generation, 56 times its task, the output is written back to RAM in another 50 clock
in encapsulation, and 46 times in decapsulation processes. cycles. The keccak done signal is then triggered to release
Moreover, the KeccakF1600 function is widely used across all the CPU from its stalled state. The system’s architecture is
key parts of the CRYSTALS-Kyber algorithm, with its usage illustrated in Fig. 2.
almost evenly distributed. This makes the work particularly The CRYSTALS-Kyber algorithm was simulated using the
advantageous for distributed systems. architecture shown in Fig. 2, operating with a 50 MHz clock
frequency. The simulation concluded in 191.95 ms, which
V. H ARDWARE /S OFTWARE C O - DESIGN FOR equals 9 597 321.5 clock cycles, including byte-by-byte
CRYSTALS-K YBER A LGORITHM comparisons. The simulation results are illustrated in Fig.3.
KECCAK − f [x] is a cryptographic permutation function Additionally, Table V compares the execution times of the
that is based on sponge construction and works with different CRYSTALS-Kyber algorithm between this simulation and a
lengths. The number x ∈ {25, 50, 100, 200, 400, 800, 1600} pure software implementation.
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 06:05:55 UTC from IEEE Xplore. Restrictions apply.
with minimal hardware additions, resulting in reduced FPGA
area usage. Notably, this approach does not introduce the over-
head of additional data buses and peripherals. This strategy
has delivered an impressive 43.26% increase in the overall
algorithm’s speed, achieved by increasing the usage of LUT by
just 10.94%. The overall FPGA LUT area usage now stands
at 20.35%, with 7.25% of the available FFs being utilized.
This trade-off has proven to be highly advantageous and
beneficial. Moreover, this study serves to illustrate the worst-
case scenario for the CRYSTALS-Kyber algorithm, using only
the Ibex Core and accelerator, as no software optimization has
been performed for the architecture. As future work, we will
design the software in accordance with RISC-V architecture
and instruction sets. Additionally, the integration of further
Fig. 2. Hardware Architecture hardware, with the results from Table III profiling, is expected
to provide even greater acceleration for the algorithm.
R EFERENCES
[1] E. Gerjuoy, ”Shor’s Factoring Algorithm and Modern Cryptography:
An Illustration of the Capabilities Inherent in Quantum Computers,”
American Journal of Physics, vol. 73, no. 5, pp. 409-412, May 2005,
doi: 10.1119/1.1891170.
[2] NIST CSRC, ”Post-quantum cryptography standardization–post-
quantum cryptography,” 2017. [Online]. Available:
Fig. 3. Hardware Accelerated Result of CRYSTALS-Kyber Algorithm https://csrc.nist.gov/Projects/Post-Quantum-Cryptography/Post-
Quantum-Cryptography-Standardization. [Accessed: June 28, 2023].
[3] RISC-V FOUNDATION. (2023, July 15). RISC-V [Online]. Available:
VI. C ONCLUSION , C OMPARISON AND F UTURE W ORK https://riscv.org/.
[4] G. Alagic, D. Apon, D. Cooper, Q. Dang, T. Dang, J. Kelsey, et al.,
The CRYSTALS-Kyber algorithm has been successfully ac- (2022), ”Status Report on the Third Round of the NIST Post-Quantum
celerated resulting in significant improvements across all three Cryptography Standardization Process”, NIST Interagency/Internal Re-
port (NISTIR), National Institute of Standards and Technology, Gaithers-
functionalities. The speed-ups achieved are as follows: 47.57% burg, MD, https://doi.org/10.6028/NIST.IR.8413-upd1.
in keypair generation, 47.07% in encapsulation, 36.41% in [5] “SHA-3 Standart: Permutation-Based Hash and Ex-
decapsulation stages. Comparing with previous studies, Fritz- tandeble Output Functions,” https://www.nist.gov.
https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf [Accessed:
man, Sigl, and Sepúlveda achieved a 8.2 times faster algorithm July. 15, 2023].
by designing the NTT and Keccak blocks in hardware [13]. [6] “Ibex Documentation,” [Online]. Available: https://ibex-
Similarly, Nannipieri et al. designed an NTT block in hard- core.readthedocs.io/en/latest/. [Accessed: June 18, 2023].
[7] Digilent, ”Nexys A7™ FPGA Board Reference Manual,” n.d [Revised
ware and achieved a 1.84 times faster algorithm [14]. Lastly, July 10 2019].
Masera, Martina, and Dolmeta obtained a 2.47-speedup factor [8] pq-crystals, ”Kyber,” 2023. [Online]. Available: https://github.com/pq-
with Keccak hardware in the Kyber-768 algorithm [15]. The crystals/kyber. [Accessed: June 18, 2023].
[9] Xilinx, “Vivado Design Suite User Guide,” 2012 [Revised Nov. 2023].
notable difference between our work and previous efforts lies [10] Ö. Altınay and S.B.Ö. Yalçın, ”Instruction Extension of RV32I
in our decision to forgo the utilization of an additional data and GCC Back End for Ascon Lightweight Cryptography Algo-
bus, such as AXI. The primary objective of our work is to rithm,” in 2021 IEEE International Conference on Omni-Layer In-
telligent Systems (COINS), Barcelona, Spain, 2021, pp. 1-6, doi:
assess the CRYSTALS-Kyber algorithm’s performance using 10.1109/COINS51742.2021.9524190
only a CPU (without relying on a microcontroller), memory, [11] D. J. Bernstein, T. Lange, and P. Schwabe, ”The security impact
and accelerator hardware. Implementing a microcontroller of a new cryptographic library,” in Proceedings of Progress in
Cryptology – LATINCRYPT 2012, A. Hevia and G. Neven, Eds.,
like PULPissimo could consume up to 90% of the selected vol. 7533, LNCS, pp. 159–176, Springer, 2012. [Online]. Available:
FPGA’s resources [15]. By eliminating the need for additional http://cryptojedi.org/papers/coolnacl.
hardware, we have managed to enhance the algorithm’s speed [12] G. Bertoni, J. Daemen, S. Hoffert, M. Peeters, G. Van Assche, and R.
V. Keer, ”Keccak in VHDL,” Keccak Team, 2020. [Online]. Available:
https://keccak.team/hardware.html. [Accessed: June 18, 2023].
[13] T. Fritzmann, G. Sigl, and J. Sepúlveda, “RISQ-V: Tightly Coupled
TABLE V RISC-V Accelerators for Post-Quantum Cryptography,” TCHES, vol.
C OMPARISON AGAINST P URE S OFTWARE R ESULTS 2020, no. 4, pp. 239–280, Aug. 2020.
[14] P. Nannipieri, S. Di Matteo, L. Zulberti, F. Albicocchi, S. Saponara and
Pure HW L. Fanucci, ”A RISC-V Post Quantum Cryptography Instruction Set
Function Software Accelerated Reduction Speed Extension for Number Theoretic Transform to Speed-Up CRYSTALS
(Clocks) (Clocks) (%) Up Algorithms,” in IEEE Access, vol. 9, pp. 150798-150808, 2021, doi:
Keypair Gen. 4 702 047.5 2 465 323.5 47.57 1.91 10.1109/ACCESS.2021.3126208.
Encapsulation 5 886 277.5 3 115 537.5 47.07 1.89 [15] G. Masera, M. Martina, ”Integration and optimization of a RISC-V based
Decapsulation 6 222 787.5 3 957 289.5 36.41 1.57 Keccak accelerator,” M.S. thesis, Corso di laurea magistrale in Ingegne-
Overall 16 811 112.5 9 538 150.5 43.26 1.76 ria Elettronica (Electronic Engineering), Politecnico di Torino, Torino,
2023. [Online]. Available: https://webthesis.biblio.polito.it/26725/
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 06:05:55 UTC from IEEE Xplore. Restrictions apply.