Flexible and Efficient Implementation of CRYSTALS-KYBER SIMD RISC-V Coprocessor Based On Customized Vector Instruction-Set Extension
Flexible and Efficient Implementation of CRYSTALS-KYBER SIMD RISC-V Coprocessor Based On Customized Vector Instruction-Set Extension
Flexible and Efficient Implementation of CRYSTALS- reduced by 87.5%. In order to improve the computational efficiency,
KYBER SIMD RISC-V Coprocessor Based on Customized an out-of-order instruction flow is designed in this work. The
Vector Instruction-Set Extension coprocessor fetches the next instruction immediately after the
previous instruction is dispatched. This method allows the poly unit
Jiaming Zhang, Jiahao Lu, Dongsheng Liu, Aobo Li, Xiang Li, Shuo to be activated while the sample unit starts the next round of vector
Yang, Ang Hu, Xuecheng Zou sampling. In order to avoid memory access conflict, this work
proposes a dynamic hardware scheduling strategy: the poly unit
Huazhong University of Science and Technology uses two poly RAMs for ping-pong operation after completing the first
With the development of quantum computers in recent years, the round of fetching and computing the vectors from the sample RAM,
security of traditional public-key encryption algorithms is facing and the sample unit will wait for the end of the first round before
serious threats, and post-quantum cryptography (PQC) algorithms performing absorb and squeeze operations on the vectors in the
that can resist quantum computer attacks are urgently needed. sample RAM. This strategy improves the parallelism and throughput
CRYSTALS-KYBER as the finalized NIST key-encapsulation of the entire coprocessor while reducing the storage capacity.
scheme, is continuously advancing the standardization process. The Figure 5 shows the three-stage pipeline structure of the whole SIMD
existing hardware implementations of Kyber mostly use compact coprocessor. The RISC-V core fetches the instruction and sends
architectures to pursue high speed and high performance with the scalar R-type instructions sequentially, then the coprocessor will
cost of programmability, while most hardware-software co-designs customize these instructions into extended vector instruction and
suffer from low parallelism and performance. Aiming at flexibly and check the Read After Write (RAW) hazard between different types of
efficiently implementing the key encapsulation mechanism (KEM) of instructions. Since Kyber is a Lattice-based algorithm with 4×4 as the
Kyber, this work presents a single instruction multiple data (SIMD) largest operation matrix, and the number of basic polynomial
Kyber coprocessor that supports the RISC-V instruction-set. A coefficients is 256, the RAM of depth 256 can be divided into 8 vector
reconfigurable polynomial and logic unit (PLU) is designed, which addresses, and the vs1/vs2/vd operands of instructions point to
can accelerate all types of polynomial vector instruction operations,
2023 IEEE Asian Solid-State Circuits Conference (A-SSCC) | 979-8-3503-3003-8/23/$31.00 ©2023 IEEE | DOI: 10.1109/A-SSCC58667.2023.10347942
12 M+ Po1 Carry
Parallel Num 4 16 M+
Modular Adder
I Kyber_Core Input Bits 48 4/6 12
T
IFU EXU 12
C Output Bits 48 12 C_in 12
12
M Mem Arbitrator
12 q Borrow
Sample A ≈80 cycles - M-
M-
12 Modular Subtractor
DTCM RegFile
Sample s/e/r - 16 cycles B_in 12 12 12
M× M÷ Po2 >>1
1
12'd1665 12
PLU
Sample G Clock Gate 22 M÷
12'd0
SHA Unit
W_in Modular Divider
Sample RejectSample
SampleUnit
Unit Inst
SHAKE-128
Mode
Reject SHAKE-256 PAD
RAM1 Reject Sample Unit
CBD Sampler Dec SHA3-256 4 Modular Reduction
Sha_Rsp
SHA3-512 ROM_H
1600 4
1600 ROM_M ROM_C 12
DMUX
MUX
G
FIFO 4 13 2
Poly ROM_L 12
Mem Arbitrator
12 12 12
RejectSample
SampleUnit
Unit 12 13 14 12 M+
RAM2 Reject
Reject Sample Unit
Reject Sampler
1600
SHA3/SRC Control
34 q Borrow
PLU_OE0 PLU_OE1 PLU_OE2 PLU_OE3
IFU 12
R.Inst
11 12
Poly
13 13 12
Poly Inst 22 DSP
>>21 >>10
RAM3 Dec 1'b1 1'b1
20 5 13 12
G >>14 >>4
1'b1 1'b1 12
Poly
Poly_Rsp 12 10 13 12
Control >>22
ROM >>9
Poly SIMD M× 1'b1
19
1'b1
4 13 12
Unit Kyber Core >>15 >>3
Modular 1'b1 1'b1
Multiplier Compress Decompress
Fig. 1. SIMD Kyber coprocessor with scalar RISC-V core overall
system architecture Fig. 2. Hardware architecture of polynomial and logic unit
PLU_OE Round1 Round2 …… Round6 Round7
M÷ Karatsuba Pointwise Multiplication:
M÷
M÷
M÷ PLU_OE3 PLU_OE2 PLU_OE1 PLU_OE0
M+
M+
M+
M+ PLU_OE3 PLU_OE2 PLU_OE1 PLU_OE0 PLU_OE1 PLU_OE0 PLU_OE1 PLU_OE0
7 6 5 4 3 2 1 0
First Operation: Second Operation: ... ... ... ... ... ... ... ... 39 38 37 35 35 34 33 32 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
M-
M-
M-
……
M-
M× M÷
71 70 69 68 67 66 65 64 ... ... ... ... ... ... ... ... 71 70 69 68 67 66 65 64 71 70 69 68 67 66 65 64
M× M÷
M× M÷
M× M÷ ... ... ... ... ... ... ... ... 71 70 69 68 67 66 65 64 ... ... PLU_OE3 PLU_OE2 ... ... ... PLU_OE3 ... ... PLU_OE2 ...
d σ σ ρ ρ ρ ρ σ σ
M- M-
Poly_Unit NTT(S0) NTT(S1) PWM(A0 S0) PWM(A1 S1) PWM(A2 S0) PWM(A3 S1) ADD1 ADD2
M- M- M-
Write
M× M÷ M× M÷
W/R_Unit G
M× M÷ M× M÷ M× M÷
Clock Gate Sample RAM Ctrl Sample Poly Read/Write Rsp Time
PLU_E PLU_E
PLU_O PLU_O PLU_E
NTT Css/Decss
Sample Poly Sample Poly
RAM1 RAM2 RAM1 RAM2
Sample Poly Sample Poly
M÷
M÷ M÷
M÷ M÷
Fig. 3. PLU functional reconfiguration and Karatsuba polynomial Fig. 4. Coefficient storage method and coprocessor dynamic
pointwise multiplication hardware scheduling strategy
Stage1 Stage2 Stage3 Table I Results Comparison of FPGA Implementation for Kyber1024
TCASI’21 [1] TCASII’22 [2] ASSCC’22 [3] This work
R_Launch Rsp Platform Virtex-7 Artix-7 UltraScale+ UltraScale+
32
RAW Frequency 156MHz 159MHz 250MHz 360MHz
Sample_Inst Slices 5000 2300 3997 3202
Distributor
Sample_Unit
Write/Read
DSPs 12 4 36 8
Decoder
Poly_Inst
Mem.req BRAMS 17 16 5 9
W/R_Inst Poly_Unit
Keygen/ 64.1/ 49.1/ - 11.1/
Encaps/ 89.7/ 52.8/ 70.4/ 17.5/
0 T 2T Inst.time
Decaps(μs) 115.4 66.0 87.3 20.3
AT* 1.00 0.98 1.35 0.28
31 25 24 20 19 15 14 12 11 76 0
Programmable
Vector Inst. : Func7 VS2 VS1 Func3 VD Opcode Flexibility Fixed Inst. Fixed Inst. Config. Inst.
Inst.
VLEN: 96-bit *FPGA area and time production (AT) = (DSP×100+BRAM×196+Slices) × Total Time (s) [6]
0
0 31
...
NTT/NTT v[vd]=ntt/intt(v[vs1]) CHES’20 [4] TCASI’21 [1] TCASI’22 [5] This work
...
Poly
1
...
Poly_ADD v[vd]=add(v[rs1],v[vs2])
0
Frequency 45MHz 200MHz 540MHz 660MHz
Vector[2]
...
2 Css/DeCss v[vd]=css(v[vs1])
2 Logic(kGE) 21 104 623 83
...
2 31
...
G v[vd]=sha_G(v[vs1])
SRAM(kB) 465 190 36.75 8
...
PRF/KDF v[vd]=sha_F(v[vs1])
...
0
XOF v[vd]=sha_X(v[vs1])
Vector[8] Total
...
8 H v[vd]=sha_H(v[vs1])
10 307.68 - 16.24 0.97
...
9
31
Energy(μJ)*
...
Write v[vd]←d[vs1]
Poly RAM3 W/R
Sample RAM1
Poly RAM2 Read d[vd]←v[vs1] AT** 550.6 16.6 13.7 2.2
*Post-synthesis energy consumption is performed through PTPX
Fig. 5. Three-stage pipeline SIMD coprocessor and customized **ASIC area and time production (AT) = Logic Gate (kGE) × Total Time (ms) [5]
vector instruction set extension Fig. 6. Results comparison
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 06:05:28 UTC from IEEE Xplore. Restrictions apply.
3 IEEE ASSCC 2023/ Session X/ Paper X.Y
Coprocessor
LUTs Flip-flops DSPs
Module
IFU 295 185 0
Decoder 66 52 0
Kyber Coprocessor On
UltraScale+ FPGA
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 06:05:28 UTC from IEEE Xplore. Restrictions apply.