0% found this document useful (0 votes)
15 views3 pages

Flexible and Efficient Implementation of CRYSTALS-KYBER SIMD RISC-V Coprocessor Based On Customized Vector Instruction-Set Extension

The document presents a flexible and efficient implementation of the CRYSTALS-KYBER key encapsulation mechanism using a SIMD RISC-V coprocessor with a customized vector instruction-set extension. It highlights improvements in computational efficiency through an out-of-order instruction flow and dynamic hardware scheduling, achieving significant speedups and power efficiency compared to existing implementations. The proposed architecture is evaluated on an Ultrascale+ FPGA platform, demonstrating high performance and programmability for post-quantum cryptography applications.

Uploaded by

Bikram Paul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views3 pages

Flexible and Efficient Implementation of CRYSTALS-KYBER SIMD RISC-V Coprocessor Based On Customized Vector Instruction-Set Extension

The document presents a flexible and efficient implementation of the CRYSTALS-KYBER key encapsulation mechanism using a SIMD RISC-V coprocessor with a customized vector instruction-set extension. It highlights improvements in computational efficiency through an out-of-order instruction flow and dynamic hardware scheduling, achieving significant speedups and power efficiency compared to existing implementations. The proposed architecture is evaluated on an Ultrascale+ FPGA platform, demonstrating high performance and programmability for post-quantum cryptography applications.

Uploaded by

Bikram Paul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

1 IEEE ASSCC 2023/ Session X/ Paper X.

Flexible and Efficient Implementation of CRYSTALS- reduced by 87.5%. In order to improve the computational efficiency,
KYBER SIMD RISC-V Coprocessor Based on Customized an out-of-order instruction flow is designed in this work. The
Vector Instruction-Set Extension coprocessor fetches the next instruction immediately after the
previous instruction is dispatched. This method allows the poly unit
Jiaming Zhang, Jiahao Lu, Dongsheng Liu, Aobo Li, Xiang Li, Shuo to be activated while the sample unit starts the next round of vector
Yang, Ang Hu, Xuecheng Zou sampling. In order to avoid memory access conflict, this work
proposes a dynamic hardware scheduling strategy: the poly unit
Huazhong University of Science and Technology uses two poly RAMs for ping-pong operation after completing the first
With the development of quantum computers in recent years, the round of fetching and computing the vectors from the sample RAM,
security of traditional public-key encryption algorithms is facing and the sample unit will wait for the end of the first round before
serious threats, and post-quantum cryptography (PQC) algorithms performing absorb and squeeze operations on the vectors in the
that can resist quantum computer attacks are urgently needed. sample RAM. This strategy improves the parallelism and throughput
CRYSTALS-KYBER as the finalized NIST key-encapsulation of the entire coprocessor while reducing the storage capacity.
scheme, is continuously advancing the standardization process. The Figure 5 shows the three-stage pipeline structure of the whole SIMD
existing hardware implementations of Kyber mostly use compact coprocessor. The RISC-V core fetches the instruction and sends
architectures to pursue high speed and high performance with the scalar R-type instructions sequentially, then the coprocessor will
cost of programmability, while most hardware-software co-designs customize these instructions into extended vector instruction and
suffer from low parallelism and performance. Aiming at flexibly and check the Read After Write (RAW) hazard between different types of
efficiently implementing the key encapsulation mechanism (KEM) of instructions. Since Kyber is a Lattice-based algorithm with 4×4 as the
Kyber, this work presents a single instruction multiple data (SIMD) largest operation matrix, and the number of basic polynomial
Kyber coprocessor that supports the RISC-V instruction-set. A coefficients is 256, the RAM of depth 256 can be divided into 8 vector
reconfigurable polynomial and logic unit (PLU) is designed, which addresses, and the vs1/vs2/vd operands of instructions point to
can accelerate all types of polynomial vector instruction operations,
2023 IEEE Asian Solid-State Circuits Conference (A-SSCC) | 979-8-3503-3003-8/23/$31.00 ©2023 IEEE | DOI: 10.1109/A-SSCC58667.2023.10347942

these vector spaces, enabling the coprocessor to complete a


and a dynamic hardware scheduling strategy is proposed to enable polynomial vector operation with a single instruction.
different types of instructions to be executed parallelly, improving the
coprocessor pipeline throughput. Implemented on the Ultrascale+ Figure 6 compares this work with a series of Kyber1024 security-
FPGA platform and evaluated under SMIC 40nm technology, the level implementations in Table I. This work is implemented on Xilinx
proposed coprocessor achieves the fastest computing speed with UltraScale+ platform. Benefiting from the parallelism of vector
the lowest power consumption and 3.5×/6.2× improvement in instructions and the reconfigurability of PLU_OE array, this work
FPGA/ASIC AT product efficiency. achieves 5.5×, 3.5× and 3.3× speedups respectively compared to [1],
[2] and [3], and realizes the highest operating frequency. A maximum
Figure 1 shows the entire Kyber coprocessor with a scalar RISC-V 3.5× improvement in FPGA AT product efficiency is obtained in this
core. The coprocessor can be tightly coupled into the RISC-V core work. Compared with other works’ DC synthesis results, this work
pipeline alongside the load/store unit (LSU), sharing tightly coupled achieves the highest operating frequency with the lowest power
memory (TCM). Vector instruction-set extension are used in consumption under SMIC 40nm technology, due to the dynamic
instruction fetching unit (IFU) to make the coprocessor suitable for hardware scheduling strategy. Since [4] only implements a scalar
fine-grained operations while providing programmability. The sample PQC accelerator with poor performance, this work improves 250× in
unit uses a single Keccak core and implements a 4-parallel rejection ASIC AT product efficiency. Compared to [5], which is implemented
sampler and a 16-parallel central binomial distribution (CBD) based on scalar RISC-V processor, the improvement in AT reaches
sampler, requiring only about 80 and 16 clock cycles to complete a 6.2×, and to [1] is 7.5×. This application customized instruction-set
polynomial sampling of A and s/e/r. The poly unit consists of four coprocessor is programmable, and has a high degree of operational
groups of parity PLU as the core calculators to implement the parallelism, allowing flexible and efficient implementation of KEM
polynomial vector operations. Finally, three dual-port RAMs are operations of Kyber.
utilized to store the coefficients during the computation and one ROM
to store the number theoretic transform (NTT) twiddle factors. Acknowledgements:
Figure 2 shows the designed hardware architecture of the PLU. The This work is supported by the National Key Research and Develop
modular multiplier needs to consume one DSP resource to realize ment Program of China (No. 2021YFA0715502), the National Natur
the multiplication of two coefficients and subsequently uses the al Science Foundation of China (No. 61874163, 62104076, 621340
divide-and-conquer method to modulo the product of coefficients. 02), the National Key Analog Integrated Circuit Laboratory Project o
Compression and decompression of polynomials can be achieved by f China (No. JCKY2021210C004), the Introduced Innovative R&D T
multiplexing DSPs at the same time. The proposed modular adder, eam of Dongguan (No.201760712600139), and the Laboratory Ope
modular subtractor and modular divider are implemented using a n Fund of Beijing Smart-chip Microelectronics Technology Co. Ltd,.
combination of carry save adder (CSA) and carry propagate adder The corresponding author is Jiahao Lu. E-mail: lujiahaohust@foxma
(CPA). Three levels of registers are inserted into the modular il.com
multiplier and designs the PLU as a pipeline architecture, which can References:
significantly improve the performance of the PLU and make it more [1] M. Bisheh-Niasar et al., "Instruction-Set Accelerated Implement
suitable for vector instruction operations. ation of CRYSTALS-Kyber." TCAS-I: Regular Papers, vol. 68, no. 1
As shown in Figure 3, PLU_OE consists of a pair of PLUs that 1, pp. 4648-4659, Nov. 2021.
process odd and even vectors respectively. By controlling the [2] W. Guo et al., "An Efficient Implementation of KYBER." TCAS-II:
selector, the PLU_OE can be reconfigured and implement different Express Briefs, vol. 69, no. 3, pp. 1562-1566, March. 2022.
polynomial operations. The polynomial pointwise multiplication
(PWM) operation uses the Karatsuba algorithm, which reduces the [3] A. Li et al., "A Flexible Instruction-based Post-quantum Cryptogr
number of multiplications from 5 to 4 at the cost of extra two additions aphic Processor with Modulus Reconfigurable Arithmetic Unit for M
and one subtraction. Benefiting from the hardware structure of odule LWR&E," A-SSCC 2022:1-3.
PLU_OE array, the clock cycles of vector PWM are reduced from [4] F. Tim et al. "RISQ-V: Tightly Coupled RISC-V Accelerators for
640 to 72, which is decreased by 88.75% compared to original time. Post-Quantum Cryptography. " TCHES 2020:446.
As shown in Figure 4, coefficients of the Kyber polynomial vector are [5] Y. Zhao et al., "A High-Performance Domain-Specific Processor
stored in dual-port RAM with a parity sequential storage strategy, and With Matrix Extension of RISC-V for Module-LWE Applications," TC
the PLU_OE array reads two rows of coefficients at a time. To ensure AS-I: Regular Papers, vol. 69, no. 7, pp. 2871-2884, July. 2022.
pipelined operations, the VLEN in customized vector instruction-set [6] D. Kundi et al., "Ultra High-Speed Polynomial Multiplications for
is set to 96 to match the bandwidth of PLU_OE array. Taking the Lattice-Based Cryptography on FPGAs," Transactions on Emergin
most time-consuming NTT as an example, by using this memory Topics in Computing, vol. 10, no. 4, pp. 1993-2005, Oct.-Dec. 2022.
operation strategy, the delay of a 256-point NTT operation can be
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 06:05:28 UTC from IEEE Xplore. Restrictions apply.
IEEE ASSCC 2023/ Session X/ Paper X.Y 2
12

Scalar RISC-V Core Sample Unit 12


M÷ 12
Sampler Reject CBD A_in 12
LSU q Borrow

12 M+ Po1 Carry
Parallel Num 4 16 M+
Modular Adder
I Kyber_Core Input Bits 48 4/6 12
T
IFU EXU 12
C Output Bits 48 12 C_in 12
12
M Mem Arbitrator
12 q Borrow
Sample A ≈80 cycles - M-
M-
12 Modular Subtractor
DTCM RegFile
Sample s/e/r - 16 cycles B_in 12 12 12
M× M÷ Po2 >>1
1
12'd1665 12

PLU
Sample G Clock Gate 22 M÷
12'd0

SHA Unit
W_in Modular Divider
Sample RejectSample
SampleUnit
Unit Inst
SHAKE-128

Mode
Reject SHAKE-256 PAD
RAM1 Reject Sample Unit
CBD Sampler Dec SHA3-256 4 Modular Reduction
Sha_Rsp
SHA3-512 ROM_H
1600 4
1600 ROM_M ROM_C 12
DMUX

MUX
G
FIFO 4 13 2
Poly ROM_L 12
Mem Arbitrator

12 12 12
RejectSample
SampleUnit
Unit 12 13 14 12 M+
RAM2 Reject
Reject Sample Unit
Reject Sampler
1600
SHA3/SRC Control
34 q Borrow
PLU_OE0 PLU_OE1 PLU_OE2 PLU_OE3
IFU 12
R.Inst
11 12
Poly
13 13 12
Poly Inst 22 DSP
>>21 >>10
RAM3 Dec 1'b1 1'b1
20 5 13 12
G >>14 >>4
1'b1 1'b1 12
Poly
Poly_Rsp 12 10 13 12
Control >>22
ROM >>9
Poly SIMD M× 1'b1
19
1'b1
4 13 12
Unit Kyber Core >>15 >>3
Modular 1'b1 1'b1
Multiplier Compress Decompress
Fig. 1. SIMD Kyber coprocessor with scalar RISC-V core overall
system architecture Fig. 2. Hardware architecture of polynomial and logic unit
PLU_OE Round1 Round2 …… Round6 Round7
M÷ Karatsuba Pointwise Multiplication:


M÷ PLU_OE3 PLU_OE2 PLU_OE1 PLU_OE0
M+
M+
M+
M+ PLU_OE3 PLU_OE2 PLU_OE1 PLU_OE0 PLU_OE1 PLU_OE0 PLU_OE1 PLU_OE0
7 6 5 4 3 2 1 0

7 6 5 4 3 2 1 0 ... ... ... ... ... ... ... ... 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0

First Operation: Second Operation: ... ... ... ... ... ... ... ... 39 38 37 35 35 34 33 32 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
M-
M-
M-
……
M-
M× M÷
71 70 69 68 67 66 65 64 ... ... ... ... ... ... ... ... 71 70 69 68 67 66 65 64 71 70 69 68 67 66 65 64
M× M÷
M× M÷
M× M÷ ... ... ... ... ... ... ... ... 71 70 69 68 67 66 65 64 ... ... PLU_OE3 PLU_OE2 ... ... ... PLU_OE3 ... ... PLU_OE2 ...

PLU_E ... ... ... ... ... ... ... ...


PLU_O
Poly_RAM2 Poly_RAM3 Poly_RAM2 Poly_RAM3
INTT PWM_F PWM_L
M÷ M÷ M÷ Write S A A A A e e
M÷ M÷ R Launch G
G S0 NTT
1
NTT
0
PWM
1
PWM
2
PWM
3
PWM
0
ADD
1
ADD
M+ M+ M+
M+ M+

d σ σ ρ ρ ρ ρ σ σ

Sample_Unit G(d) S0 S1 A00 A01 A10 A11 e0 e1

M- M-
Poly_Unit NTT(S0) NTT(S1) PWM(A0 S0) PWM(A1 S1) PWM(A2 S0) PWM(A3 S1) ADD1 ADD2
M- M- M-
Write
M× M÷ M× M÷
W/R_Unit G
M× M÷ M× M÷ M× M÷

Clock Gate Sample RAM Ctrl Sample Poly Read/Write Rsp Time
PLU_E PLU_E
PLU_O PLU_O PLU_E
NTT Css/Decss
Sample Poly Sample Poly
RAM1 RAM2 RAM1 RAM2
Sample Poly Sample Poly

M÷ M÷
M÷ M÷

M+ M+ M+ Unit Unit Poly Unit Unit Poly


M+ M+
Idle Round1 NTT RAM3 Absorb Round2 NTT RAM3

M- M- M- Sample Poly Sample Poly


M- M-
RAM1 RAM2 RAM1 RAM2
M× M÷ M× M÷
Sample Poly Sample Poly
Unit Unit Unit Unit
M× M÷
M× M÷ M× M÷
Poly Poly
PLU_E PLU_E Working Round3~6 NTT RAM3 Squeeze Round7 NTT RAM3
PLU_O PLU_O PLU_O

Fig. 3. PLU functional reconfiguration and Karatsuba polynomial Fig. 4. Coefficient storage method and coprocessor dynamic
pointwise multiplication hardware scheduling strategy
Stage1 Stage2 Stage3 Table I Results Comparison of FPGA Implementation for Kyber1024
TCASI’21 [1] TCASII’22 [2] ASSCC’22 [3] This work
R_Launch Rsp Platform Virtex-7 Artix-7 UltraScale+ UltraScale+
32
RAW Frequency 156MHz 159MHz 250MHz 360MHz
Sample_Inst Slices 5000 2300 3997 3202
Distributor

Sample_Unit
Write/Read

DSPs 12 4 36 8
Decoder

Poly_Inst
Mem.req BRAMS 17 16 5 9
W/R_Inst Poly_Unit
Keygen/ 64.1/ 49.1/ - 11.1/
Encaps/ 89.7/ 52.8/ 70.4/ 17.5/
0 T 2T Inst.time
Decaps(μs) 115.4 66.0 87.3 20.3
AT* 1.00 0.98 1.35 0.28
31 25 24 20 19 15 14 12 11 76 0
Programmable
Vector Inst. : Func7 VS2 VS1 Func3 VD Opcode Flexibility Fixed Inst. Fixed Inst. Config. Inst.
Inst.
VLEN: 96-bit *FPGA area and time production (AT) = (DSP×100+BRAM×196+Slices) × Total Time (s) [6]
0

Vector[0] Table II Results Comparison of ASIC Synthesis for Kyber1024


...

0 R Inst Func7 Opcode


0
...

0 31
...

NTT/NTT v[vd]=ntt/intt(v[vs1]) CHES’20 [4] TCASI’21 [1] TCASI’22 [5] This work
...

1 PWM1/PWM2 v[vd]=pwm(v[vs1],v[vs2]) Tech 65nm 65nm 28nm 40nm


1
...

Poly
1
...

Poly_ADD v[vd]=add(v[rs1],v[vs2])
0
Frequency 45MHz 200MHz 540MHz 660MHz
Vector[2]
...

2 Css/DeCss v[vd]=css(v[vs1])
2 Logic(kGE) 21 104 623 83
...

2 31
...

G v[vd]=sha_G(v[vs1])
SRAM(kB) 465 190 36.75 8
...

PRF/KDF v[vd]=sha_F(v[vs1])
...

Sample Total Time(μs) 26222.2 210.0 29.4 27.1


...

0
XOF v[vd]=sha_X(v[vs1])
Vector[8] Total
...

8 H v[vd]=sha_H(v[vs1])
10 307.68 - 16.24 0.97
...

9
31
Energy(μJ)*
...

Write v[vd]←d[vs1]
Poly RAM3 W/R
Sample RAM1
Poly RAM2 Read d[vd]←v[vs1] AT** 550.6 16.6 13.7 2.2
*Post-synthesis energy consumption is performed through PTPX
Fig. 5. Three-stage pipeline SIMD coprocessor and customized **ASIC area and time production (AT) = Logic Gate (kGE) × Total Time (ms) [5]
vector instruction set extension Fig. 6. Results comparison

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 06:05:28 UTC from IEEE Xplore. Restrictions apply.
3 IEEE ASSCC 2023/ Session X/ Paper X.Y

Coprocessor
LUTs Flip-flops DSPs
Module
IFU 295 185 0

PLU_Unit 6539 2101 8

Sample_Unit 11670 3835 0

Decoder 66 52 0

Other blocks 859 0 0

All 19429 6173 8


Keygen/Enc/Dec Data
FPGA Layout

Kyber Coprocessor On
UltraScale+ FPGA

Fig. 7. Experiment on UltraScale+ FPGA platform

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 06:05:28 UTC from IEEE Xplore. Restrictions apply.

You might also like