0% found this document useful (0 votes)

15 views3 pages

Flexible and Efficient Implementation of CRYSTALS-KYBER SIMD RISC-V Coprocessor Based On Customized Vector Instruction-Set Extension

The document presents a flexible and efficient implementation of the CRYSTALS-KYBER key encapsulation mechanism using a SIMD RISC-V coprocessor with a customized vector instruction-set extension. It highlights improvements in computational efficiency through an out-of-order instruction flow and dynamic hardware scheduling, achieving significant speedups and power efficiency compared to existing implementations. The proposed architecture is evaluated on an Ultrascale+ FPGA platform, demonstrating high performance and programmability for post-quantum cryptography applications.

Uploaded by

Bikram Paul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views3 pages

Flexible and Efficient Implementation of CRYSTALS-KYBER SIMD RISC-V Coprocessor Based On Customized Vector Instruction-Set Extension

Uploaded by

Bikram Paul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

1 IEEE ASSCC 2023/ Session X/ Paper X.

Flexible and Efficient Implementation of CRYSTALS- reduced by 87.5%. In order to improve the computational efficiency,
KYBER SIMD RISC-V Coprocessor Based on Customized an out-of-order instruction flow is designed in this work. The
Vector Instruction-Set Extension coprocessor fetches the next instruction immediately after the
previous instruction is dispatched. This method allows the poly unit
Jiaming Zhang, Jiahao Lu, Dongsheng Liu, Aobo Li, Xiang Li, Shuo to be activated while the sample unit starts the next round of vector
Yang, Ang Hu, Xuecheng Zou sampling. In order to avoid memory access conflict, this work
proposes a dynamic hardware scheduling strategy: the poly unit
Huazhong University of Science and Technology uses two poly RAMs for ping-pong operation after completing the first
With the development of quantum computers in recent years, the round of fetching and computing the vectors from the sample RAM,
security of traditional public-key encryption algorithms is facing and the sample unit will wait for the end of the first round before
serious threats, and post-quantum cryptography (PQC) algorithms performing absorb and squeeze operations on the vectors in the
that can resist quantum computer attacks are urgently needed. sample RAM. This strategy improves the parallelism and throughput
CRYSTALS-KYBER as the finalized NIST key-encapsulation of the entire coprocessor while reducing the storage capacity.
scheme, is continuously advancing the standardization process. The Figure 5 shows the three-stage pipeline structure of the whole SIMD
existing hardware implementations of Kyber mostly use compact coprocessor. The RISC-V core fetches the instruction and sends
architectures to pursue high speed and high performance with the scalar R-type instructions sequentially, then the coprocessor will
cost of programmability, while most hardware-software co-designs customize these instructions into extended vector instruction and
suffer from low parallelism and performance. Aiming at flexibly and check the Read After Write (RAW) hazard between different types of
efficiently implementing the key encapsulation mechanism (KEM) of instructions. Since Kyber is a Lattice-based algorithm with 4×4 as the
Kyber, this work presents a single instruction multiple data (SIMD) largest operation matrix, and the number of basic polynomial
Kyber coprocessor that supports the RISC-V instruction-set. A coefficients is 256, the RAM of depth 256 can be divided into 8 vector
reconfigurable polynomial and logic unit (PLU) is designed, which addresses, and the vs1/vs2/vd operands of instructions point to
can accelerate all types of polynomial vector instruction operations,
2023 IEEE Asian Solid-State Circuits Conference (A-SSCC) | 979-8-3503-3003-8/23/$31.00 ©2023 IEEE | DOI: 10.1109/A-SSCC58667.2023.10347942

these vector spaces, enabling the coprocessor to complete a

and a dynamic hardware scheduling strategy is proposed to enable polynomial vector operation with a single instruction.
different types of instructions to be executed parallelly, improving the
coprocessor pipeline throughput. Implemented on the Ultrascale+ Figure 6 compares this work with a series of Kyber1024 security-
FPGA platform and evaluated under SMIC 40nm technology, the level implementations in Table I. This work is implemented on Xilinx
proposed coprocessor achieves the fastest computing speed with UltraScale+ platform. Benefiting from the parallelism of vector
the lowest power consumption and 3.5×/6.2× improvement in instructions and the reconfigurability of PLU_OE array, this work
FPGA/ASIC AT product efficiency. achieves 5.5×, 3.5× and 3.3× speedups respectively compared to [1],
[2] and [3], and realizes the highest operating frequency. A maximum
Figure 1 shows the entire Kyber coprocessor with a scalar RISC-V 3.5× improvement in FPGA AT product efficiency is obtained in this
core. The coprocessor can be tightly coupled into the RISC-V core work. Compared with other works’ DC synthesis results, this work
pipeline alongside the load/store unit (LSU), sharing tightly coupled achieves the highest operating frequency with the lowest power
memory (TCM). Vector instruction-set extension are used in consumption under SMIC 40nm technology, due to the dynamic
instruction fetching unit (IFU) to make the coprocessor suitable for hardware scheduling strategy. Since [4] only implements a scalar
fine-grained operations while providing programmability. The sample PQC accelerator with poor performance, this work improves 250× in
unit uses a single Keccak core and implements a 4-parallel rejection ASIC AT product efficiency. Compared to [5], which is implemented
sampler and a 16-parallel central binomial distribution (CBD) based on scalar RISC-V processor, the improvement in AT reaches
sampler, requiring only about 80 and 16 clock cycles to complete a 6.2×, and to [1] is 7.5×. This application customized instruction-set
polynomial sampling of A and s/e/r. The poly unit consists of four coprocessor is programmable, and has a high degree of operational
groups of parity PLU as the core calculators to implement the parallelism, allowing flexible and efficient implementation of KEM
polynomial vector operations. Finally, three dual-port RAMs are operations of Kyber.
utilized to store the coefficients during the computation and one ROM
to store the number theoretic transform (NTT) twiddle factors. Acknowledgements:
Figure 2 shows the designed hardware architecture of the PLU. The This work is supported by the National Key Research and Develop
modular multiplier needs to consume one DSP resource to realize ment Program of China (No. 2021YFA0715502), the National Natur
the multiplication of two coefficients and subsequently uses the al Science Foundation of China (No. 61874163, 62104076, 621340
divide-and-conquer method to modulo the product of coefficients. 02), the National Key Analog Integrated Circuit Laboratory Project o
Compression and decompression of polynomials can be achieved by f China (No. JCKY2021210C004), the Introduced Innovative R&D T
multiplexing DSPs at the same time. The proposed modular adder, eam of Dongguan (No.201760712600139), and the Laboratory Ope
modular subtractor and modular divider are implemented using a n Fund of Beijing Smart-chip Microelectronics Technology Co. Ltd,.
combination of carry save adder (CSA) and carry propagate adder The corresponding author is Jiahao Lu. E-mail: lujiahaohust@foxma
(CPA). Three levels of registers are inserted into the modular il.com
multiplier and designs the PLU as a pipeline architecture, which can References:
significantly improve the performance of the PLU and make it more [1] M. Bisheh-Niasar et al., "Instruction-Set Accelerated Implement
suitable for vector instruction operations. ation of CRYSTALS-Kyber." TCAS-I: Regular Papers, vol. 68, no. 1
As shown in Figure 3, PLU_OE consists of a pair of PLUs that 1, pp. 4648-4659, Nov. 2021.
process odd and even vectors respectively. By controlling the [2] W. Guo et al., "An Efficient Implementation of KYBER." TCAS-II:
selector, the PLU_OE can be reconfigured and implement different Express Briefs, vol. 69, no. 3, pp. 1562-1566, March. 2022.
polynomial operations. The polynomial pointwise multiplication
(PWM) operation uses the Karatsuba algorithm, which reduces the [3] A. Li et al., "A Flexible Instruction-based Post-quantum Cryptogr
number of multiplications from 5 to 4 at the cost of extra two additions aphic Processor with Modulus Reconfigurable Arithmetic Unit for M
and one subtraction. Benefiting from the hardware structure of odule LWR&E," A-SSCC 2022:1-3.
PLU_OE array, the clock cycles of vector PWM are reduced from [4] F. Tim et al. "RISQ-V: Tightly Coupled RISC-V Accelerators for
640 to 72, which is decreased by 88.75% compared to original time. Post-Quantum Cryptography. " TCHES 2020:446.
As shown in Figure 4, coefficients of the Kyber polynomial vector are [5] Y. Zhao et al., "A High-Performance Domain-Specific Processor
stored in dual-port RAM with a parity sequential storage strategy, and With Matrix Extension of RISC-V for Module-LWE Applications," TC
the PLU_OE array reads two rows of coefficients at a time. To ensure AS-I: Regular Papers, vol. 69, no. 7, pp. 2871-2884, July. 2022.
pipelined operations, the VLEN in customized vector instruction-set [6] D. Kundi et al., "Ultra High-Speed Polynomial Multiplications for
is set to 96 to match the bandwidth of PLU_OE array. Taking the Lattice-Based Cryptography on FPGAs," Transactions on Emergin
most time-consuming NTT as an example, by using this memory Topics in Computing, vol. 10, no. 4, pp. 1993-2005, Oct.-Dec. 2022.
operation strategy, the delay of a 256-point NTT operation can be
Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 06:05:28 UTC from IEEE Xplore. Restrictions apply.
IEEE ASSCC 2023/ Session X/ Paper X.Y 2
12

Scalar RISC-V Core Sample Unit 12

M÷ 12
Sampler Reject CBD A_in 12
LSU q Borrow

12 M+ Po1 Carry
Parallel Num 4 16 M+
Modular Adder
I Kyber_Core Input Bits 48 4/6 12
T
IFU EXU 12
C Output Bits 48 12 C_in 12
12
M Mem Arbitrator
12 q Borrow
Sample A ≈80 cycles - M-
M-
12 Modular Subtractor
DTCM RegFile
Sample s/e/r - 16 cycles B_in 12 12 12
M× M÷ Po2 >>1
1
12'd1665 12

PLU
Sample G Clock Gate 22 M÷
12'd0

SHA Unit
W_in Modular Divider
Sample RejectSample
SampleUnit
Unit Inst
SHAKE-128

Mode
Reject SHAKE-256 PAD
RAM1 Reject Sample Unit
CBD Sampler Dec SHA3-256 4 Modular Reduction
Sha_Rsp
SHA3-512 ROM_H
1600 4
1600 ROM_M ROM_C 12
DMUX

MUX
G
FIFO 4 13 2
Poly ROM_L 12
Mem Arbitrator

12 12 12
RejectSample
SampleUnit
Unit 12 13 14 12 M+
RAM2 Reject
Reject Sample Unit
Reject Sampler
1600
SHA3/SRC Control
34 q Borrow
PLU_OE0 PLU_OE1 PLU_OE2 PLU_OE3
IFU 12
R.Inst
11 12
Poly
13 13 12
Poly Inst 22 DSP
>>21 >>10
RAM3 Dec 1'b1 1'b1
20 5 13 12
G >>14 >>4
1'b1 1'b1 12
Poly
Poly_Rsp 12 10 13 12
Control >>22
ROM >>9
Poly SIMD M× 1'b1
19
1'b1
4 13 12
Unit Kyber Core >>15 >>3
Modular 1'b1 1'b1
Multiplier Compress Decompress
Fig. 1. SIMD Kyber coprocessor with scalar RISC-V core overall
system architecture Fig. 2. Hardware architecture of polynomial and logic unit
PLU_OE Round1 Round2 …… Round6 Round7
M÷ Karatsuba Pointwise Multiplication：
M÷
M÷
M÷ PLU_OE3 PLU_OE2 PLU_OE1 PLU_OE0
M+
M+
M+
M+ PLU_OE3 PLU_OE2 PLU_OE1 PLU_OE0 PLU_OE1 PLU_OE0 PLU_OE1 PLU_OE0
7 6 5 4 3 2 1 0

7 6 5 4 3 2 1 0 ... ... ... ... ... ... ... ... 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0

First Operation： Second Operation： ... ... ... ... ... ... ... ... 39 38 37 35 35 34 33 32 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
M-
M-
M-
……
M-
M× M÷
71 70 69 68 67 66 65 64 ... ... ... ... ... ... ... ... 71 70 69 68 67 66 65 64 71 70 69 68 67 66 65 64
M× M÷
M× M÷
M× M÷ ... ... ... ... ... ... ... ... 71 70 69 68 67 66 65 64 ... ... PLU_OE3 PLU_OE2 ... ... ... PLU_OE3 ... ... PLU_OE2 ...

PLU_E ... ... ... ... ... ... ... ...

PLU_O
Poly_RAM2 Poly_RAM3 Poly_RAM2 Poly_RAM3
INTT PWM_F PWM_L
M÷ M÷ M÷ Write S A A A A e e
M÷ M÷ R Launch G
G S0 NTT
1
NTT
0
PWM
1
PWM
2
PWM
3
PWM
0
ADD
1
ADD
M+ M+ M+
M+ M+

d σ σ ρ ρ ρ ρ σ σ

Sample_Unit G(d） S0 S1 A00 A01 A10 A11 e0 e1

M- M-
Poly_Unit NTT(S0) NTT(S1) PWM(A0 S0) PWM(A1 S1) PWM(A2 S0) PWM(A3 S1) ADD1 ADD2
M- M- M-
Write
M× M÷ M× M÷
W/R_Unit G
M× M÷ M× M÷ M× M÷

Clock Gate Sample RAM Ctrl Sample Poly Read/Write Rsp Time
PLU_E PLU_E
PLU_O PLU_O PLU_E
NTT Css/Decss
Sample Poly Sample Poly
RAM1 RAM2 RAM1 RAM2
Sample Poly Sample Poly
M÷
M÷ M÷
M÷ M÷

M+ M+ M+ Unit Unit Poly Unit Unit Poly

M+ M+
Idle Round1 NTT RAM3 Absorb Round2 NTT RAM3

M- M- M- Sample Poly Sample Poly

M- M-
RAM1 RAM2 RAM1 RAM2
M× M÷ M× M÷
Sample Poly Sample Poly
Unit Unit Unit Unit
M× M÷
M× M÷ M× M÷
Poly Poly
PLU_E PLU_E Working Round3~6 NTT RAM3 Squeeze Round7 NTT RAM3
PLU_O PLU_O PLU_O

Fig. 3. PLU functional reconfiguration and Karatsuba polynomial Fig. 4. Coefficient storage method and coprocessor dynamic
pointwise multiplication hardware scheduling strategy
Stage1 Stage2 Stage3 Table I Results Comparison of FPGA Implementation for Kyber1024
TCASI’21 [1] TCASII’22 [2] ASSCC’22 [3] This work
R_Launch Rsp Platform Virtex-7 Artix-7 UltraScale+ UltraScale+
32
RAW Frequency 156MHz 159MHz 250MHz 360MHz
Sample_Inst Slices 5000 2300 3997 3202
Distributor

Sample_Unit
Write/Read

DSPs 12 4 36 8
Decoder

Poly_Inst
Mem.req BRAMS 17 16 5 9
W/R_Inst Poly_Unit
Keygen/ 64.1/ 49.1/ - 11.1/
Encaps/ 89.7/ 52.8/ 70.4/ 17.5/
0 T 2T Inst.time
Decaps(μs) 115.4 66.0 87.3 20.3
AT* 1.00 0.98 1.35 0.28
31 25 24 20 19 15 14 12 11 76 0
Programmable
Vector Inst. : Func7 VS2 VS1 Func3 VD Opcode Flexibility Fixed Inst. Fixed Inst. Config. Inst.
Inst.
VLEN: 96-bit *FPGA area and time production (AT) = (DSP×100+BRAM×196+Slices) × Total Time (s) [6]
0

Vector[0] Table II Results Comparison of ASIC Synthesis for Kyber1024

...

0 R Inst Func7 Opcode

0
...

0 31
...

NTT/NTT v[vd]=ntt/intt(v[vs1]) CHES’20 [4] TCASI’21 [1] TCASI’22 [5] This work
...

1 PWM1/PWM2 v[vd]=pwm(v[vs1],v[vs2]) Tech 65nm 65nm 28nm 40nm

1
...

Poly
1
...

Poly_ADD v[vd]=add(v[rs1],v[vs2])
0
Frequency 45MHz 200MHz 540MHz 660MHz
Vector[2]
...

2 Css/DeCss v[vd]=css(v[vs1])
2 Logic(kGE) 21 104 623 83
...

2 31
...

G v[vd]=sha_G(v[vs1])
SRAM(kB) 465 190 36.75 8
...

PRF/KDF v[vd]=sha_F(v[vs1])
...

Sample Total Time(μs) 26222.2 210.0 29.4 27.1

...

0
XOF v[vd]=sha_X(v[vs1])
Vector[8] Total
...

8 H v[vd]=sha_H(v[vs1])
10 307.68 - 16.24 0.97
...

9
31
Energy(μJ)*
...

Write v[vd]←d[vs1]
Poly RAM3 W/R
Sample RAM1
Poly RAM2 Read d[vd]←v[vs1] AT** 550.6 16.6 13.7 2.2
*Post-synthesis energy consumption is performed through PTPX
Fig. 5. Three-stage pipeline SIMD coprocessor and customized **ASIC area and time production (AT) = Logic Gate (kGE) × Total Time (ms) [5]
vector instruction set extension Fig. 6. Results comparison

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 06:05:28 UTC from IEEE Xplore. Restrictions apply.
3 IEEE ASSCC 2023/ Session X/ Paper X.Y

Coprocessor
LUTs Flip-flops DSPs
Module
IFU 295 185 0

PLU_Unit 6539 2101 8

Sample_Unit 11670 3835 0

Decoder 66 52 0

Other blocks 859 0 0

All 19429 6173 8

Keygen/Enc/Dec Data
FPGA Layout

Kyber Coprocessor On
UltraScale+ FPGA

Fig. 7. Experiment on UltraScale+ FPGA platform

Authorized licensed use limited to: National Institute of Technology - Agartala. Downloaded on January 03,2025 at 06:05:28 UTC from IEEE Xplore. Restrictions apply.

Java Concurrency and Multithreading: Unlock the Secrets of Expert-Level Skills
From Everand
Java Concurrency and Multithreading: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
1 - Script Ceu Azul 2023
67% (3)
1 - Script Ceu Azul 2023
21 pages
Xtra
No ratings yet
Xtra
1 page
GAN-based Synthetic Medical Image Augmentation
No ratings yet
GAN-based Synthetic Medical Image Augmentation
10 pages
Message
No ratings yet
Message
10 pages
A Hardware Accelerator For Polynomial Multiplication Operation of CRYSTALS-KYBER PQC Scheme
No ratings yet
A Hardware Accelerator For Polynomial Multiplication Operation of CRYSTALS-KYBER PQC Scheme
6 pages
Exploring The Design Space For FPGA Base
No ratings yet
Exploring The Design Space For FPGA Base
9 pages
Unit-II - ADS - IMP QP
No ratings yet
Unit-II - ADS - IMP QP
3 pages
Zhang 2021
No ratings yet
Zhang 2021
5 pages
Hardware Acceleration Method Using RISC-V Core With No ISA Extensions
No ratings yet
Hardware Acceleration Method Using RISC-V Core With No ISA Extensions
5 pages
341-Forest Cover Type Prediction
100% (1)
341-Forest Cover Type Prediction
5 pages
IGCSE Maths Paper 21 - Final Paper
No ratings yet
IGCSE Maths Paper 21 - Final Paper
17 pages
1 PB
No ratings yet
1 PB
12 pages
Lecture 1 Slides
No ratings yet
Lecture 1 Slides
84 pages
Coordinate Geometry 2024-25
No ratings yet
Coordinate Geometry 2024-25
3 pages
Practice Questions
No ratings yet
Practice Questions
1 page
An Efficient and Configurable Hardware Architecture of Polynomial Modular Operation For CRYSTALS-Kyber and Dilithium
No ratings yet
An Efficient and Configurable Hardware Architecture of Polynomial Modular Operation For CRYSTALS-Kyber and Dilithium
4 pages
Module 4B
No ratings yet
Module 4B
21 pages
Applsci 14 03323 v2
No ratings yet
Applsci 14 03323 v2
15 pages
2.3 Finding The Equation of A Parabola Given Certain Conditions
100% (2)
2.3 Finding The Equation of A Parabola Given Certain Conditions
10 pages
Efficient Number Theoretic Transform Architecture For CRYSTALS-Kyber
No ratings yet
Efficient Number Theoretic Transform Architecture For CRYSTALS-Kyber
5 pages
Labtask 1
No ratings yet
Labtask 1
8 pages
High-Speed Polynomials Multiplication HW Accelerator For CRYSTALS-Kyber
No ratings yet
High-Speed Polynomials Multiplication HW Accelerator For CRYSTALS-Kyber
9 pages
Formulating and Solving LPs Using Excel Solver
No ratings yet
Formulating and Solving LPs Using Excel Solver
10 pages
10 1109@incet49848 2020 9154105
No ratings yet
10 1109@incet49848 2020 9154105
4 pages
Hexadecimal To Others
No ratings yet
Hexadecimal To Others
12 pages
Wohhk Fmdÿ Iy SL M % Wiia FM & Únd.H" - 2021 Wohhk Fmdÿ Iy SL M % Wiia FM & Únd.H" - 2021 Wohhk Fmdÿ Iy SL M % Wiia FM & Únd.H" - 2021
No ratings yet
Wohhk Fmdÿ Iy SL M % Wiia FM & Únd.H" - 2021 Wohhk Fmdÿ Iy SL M % Wiia FM & Únd.H" - 2021 Wohhk Fmdÿ Iy SL M % Wiia FM & Únd.H" - 2021
5 pages
Applsci 13 10407
No ratings yet
Applsci 13 10407
12 pages
Cambridge IGCSE™: Physics 0625/42 May/June 2022
No ratings yet
Cambridge IGCSE™: Physics 0625/42 May/June 2022
12 pages
Lightweight ASIP Design For Lattice-Based Post-Quantum Cryptography Algorithms
No ratings yet
Lightweight ASIP Design For Lattice-Based Post-Quantum Cryptography Algorithms
15 pages
RFIC Inductor Toolkit
No ratings yet
RFIC Inductor Toolkit
39 pages
ECO-CRYSTALS Efficient Cryptography CRYSTALS On Standard RISC-V ISA
No ratings yet
ECO-CRYSTALS Efficient Cryptography CRYSTALS On Standard RISC-V ISA
13 pages
1.1 - Motion Graphs
No ratings yet
1.1 - Motion Graphs
4 pages
Image Enhancement in Spatial Domain: Spatial Filtering Anisha M. Lal
No ratings yet
Image Enhancement in Spatial Domain: Spatial Filtering Anisha M. Lal
12 pages
Reconfigurable and High-Efficiency Polynomial Multiplication Accelerator For CRYSTALS-Kyber
No ratings yet
Reconfigurable and High-Efficiency Polynomial Multiplication Accelerator For CRYSTALS-Kyber
12 pages
Exame1psd15 Eng 241106 175631
No ratings yet
Exame1psd15 Eng 241106 175631
10 pages
Instruction-Set Accelerated Implementation of CRYSTALS-Kyber
No ratings yet
Instruction-Set Accelerated Implementation of CRYSTALS-Kyber
12 pages
Formation Characterization Well Logs
No ratings yet
Formation Characterization Well Logs
26 pages
Thesis Fpga
100% (3)
Thesis Fpga
7 pages
Vector Processor
No ratings yet
Vector Processor
13 pages
Definition of The Laplace Transform
No ratings yet
Definition of The Laplace Transform
15 pages
Takeoff Edu Group VLSI Title List
No ratings yet
Takeoff Edu Group VLSI Title List
130 pages
Homework2 Ans
No ratings yet
Homework2 Ans
5 pages
Government Intervention Chapter - 9: Exercise Practice Set: S D S D S D
No ratings yet
Government Intervention Chapter - 9: Exercise Practice Set: S D S D S D
7 pages
Module 4 Chapter 2
No ratings yet
Module 4 Chapter 2
42 pages
Test of Arithmetic Progression
No ratings yet
Test of Arithmetic Progression
2 pages
UNIT-V-Pipeline and Array Processing and Multi Processors
No ratings yet
UNIT-V-Pipeline and Array Processing and Multi Processors
51 pages
Defining Computational Aesthetics Hoenig
No ratings yet
Defining Computational Aesthetics Hoenig
6 pages
Quantization and Compression PDF
No ratings yet
Quantization and Compression PDF
220 pages
6.ABC Analysis
No ratings yet
6.ABC Analysis
32 pages
Analog Circuits With Solutions
100% (1)
Analog Circuits With Solutions
98 pages
Chapter 04
No ratings yet
Chapter 04
47 pages
1.3.1 Logic Gates (MT)
100% (1)
1.3.1 Logic Gates (MT)
18 pages
Analysis of Chaos in Double Pendulum
No ratings yet
Analysis of Chaos in Double Pendulum
6 pages
CS3691 - Esiot Lab Manual
No ratings yet
CS3691 - Esiot Lab Manual
80 pages
Resource-Efficient RISC-V Vector Extension Architecture For FPGA-based Accelerator
No ratings yet
Resource-Efficient RISC-V Vector Extension Architecture For FPGA-based Accelerator
8 pages
Porous Media in Openfoam: Chalmers Spring 2009
No ratings yet
Porous Media in Openfoam: Chalmers Spring 2009
14 pages
Stanley Assignment
No ratings yet
Stanley Assignment
6 pages
2006-09 Lodgeroom
100% (1)
2006-09 Lodgeroom
25 pages
Ramanujan's Notebooks (Part 1 of 5) - B. Berndt (Springer, 1985) WW
100% (1)
Ramanujan's Notebooks (Part 1 of 5) - B. Berndt (Springer, 1985) WW
368 pages
Esiot Lab Manual (All Experiments)
No ratings yet
Esiot Lab Manual (All Experiments)
80 pages
40 Out
No ratings yet
40 Out
80 pages
Thiết kế bộ nhân đa thức kết hợp NTT cho CRYSTALS-kyber
No ratings yet
Thiết kế bộ nhân đa thức kết hợp NTT cho CRYSTALS-kyber
18 pages
LCT2
No ratings yet
LCT2
1 page
Implementing RLWE-based Schemes Using An RSA Co-Processor
No ratings yet
Implementing RLWE-based Schemes Using An RSA Co-Processor
40 pages
Lecture 13
No ratings yet
Lecture 13
29 pages
Mallows Theorem
No ratings yet
Mallows Theorem
21 pages
Stick
No ratings yet
Stick
16 pages
M2W1 Waystage - Past Simple
No ratings yet
M2W1 Waystage - Past Simple
3 pages
An Erlang Is A Unit of Telecommunications Traffic Measurement
No ratings yet
An Erlang Is A Unit of Telecommunications Traffic Measurement
4 pages
Microwave Test Bench
100% (2)
Microwave Test Bench
99 pages
Chest Radiography
No ratings yet
Chest Radiography
39 pages
1 Vector Processing: Solutions
No ratings yet
1 Vector Processing: Solutions
16 pages
Kris Gaj: Research and Teaching Interests
No ratings yet
Kris Gaj: Research and Teaching Interests
47 pages
Cep (2019ee616)
No ratings yet
Cep (2019ee616)
24 pages
S.No Title of Theproject Year LD/PD
No ratings yet
S.No Title of Theproject Year LD/PD
3 pages
Signals and Systems With Solutions
100% (2)
Signals and Systems With Solutions
64 pages
Assignment Questions
No ratings yet
Assignment Questions
3 pages
FPGA-based Train Onboard PCIe Board Design and Implementation
No ratings yet
FPGA-based Train Onboard PCIe Board Design and Implementation
5 pages
Implementing Linear Algebraalgorithms For Dense Matrices
No ratings yet
Implementing Linear Algebraalgorithms For Dense Matrices
22 pages
Hw5 Solution
No ratings yet
Hw5 Solution
11 pages
Rvcorep: An Optimized Risc-V Soft Processor of Five-Stage Pipelining
No ratings yet
Rvcorep: An Optimized Risc-V Soft Processor of Five-Stage Pipelining
9 pages
Vector
No ratings yet
Vector
38 pages
An Efficient VLSI Architecture For Data Encryption
No ratings yet
An Efficient VLSI Architecture For Data Encryption
2 pages
Very High-Level Synthesis of Datapath and Control Structures For Reconfigurable Logic Devices
No ratings yet
Very High-Level Synthesis of Datapath and Control Structures For Reconfigurable Logic Devices
5 pages
Guide To FPGA
No ratings yet
Guide To FPGA
472 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
DSD - Assignment 1 2018
No ratings yet
DSD - Assignment 1 2018
3 pages
Design and Implementation of 16 Bit Systolic Multiplier Using Modular Shifting Algorithm
No ratings yet
Design and Implementation of 16 Bit Systolic Multiplier Using Modular Shifting Algorithm
4 pages
S.No Title Year: Vlsi Domain
No ratings yet
S.No Title Year: Vlsi Domain
3 pages
Chapter 04
No ratings yet
Chapter 04
17 pages
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
50 pages
Simple Vector Processor Modeled With VHDL
No ratings yet
Simple Vector Processor Modeled With VHDL
6 pages
Designing of 4-Bit Array Multiplayer
No ratings yet
Designing of 4-Bit Array Multiplayer
6 pages
FPGA Paper PDF
No ratings yet
FPGA Paper PDF
18 pages
Computer Architecture: Ph.D. Qualifiers Examination - Sample Questions
No ratings yet
Computer Architecture: Ph.D. Qualifiers Examination - Sample Questions
2 pages
Title Design and Implementation of PRBS Generator Using VHDL
No ratings yet
Title Design and Implementation of PRBS Generator Using VHDL
7 pages
Embedded Syllabus 2013
No ratings yet
Embedded Syllabus 2013
23 pages
IEEE Titles and Research & Development Projects For Students of
No ratings yet
IEEE Titles and Research & Development Projects For Students of
11 pages
Vlsi Abstracts An Accumulator-Based Test-Per-Clock Scheme
No ratings yet
Vlsi Abstracts An Accumulator-Based Test-Per-Clock Scheme
7 pages
B.E Projects VLSI LIST
No ratings yet
B.E Projects VLSI LIST
3 pages

Flexible and Efficient Implementation of CRYSTALS-KYBER SIMD RISC-V Coprocessor Based On Customized Vector Instruction-Set Extension

Uploaded by

Flexible and Efficient Implementation of CRYSTALS-KYBER SIMD RISC-V Coprocessor Based On Customized Vector Instruction-Set Extension

Uploaded by

1 IEEE ASSCC 2023/ Session X/ Paper X.

these vector spaces, enabling the coprocessor to complete a

Scalar RISC-V Core Sample Unit 12

7 6 5 4 3 2 1 0 ... ... ... ... ... ... ... ... 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0

PLU_E ... ... ... ... ... ... ... ...

Sample_Unit G(d） S0 S1 A00 A01 A10 A11 e0 e1

M+ M+ M+ Unit Unit Poly Unit Unit Poly

M- M- M- Sample Poly Sample Poly

Vector[0] Table II Results Comparison of ASIC Synthesis for Kyber1024

0 R Inst Func7 Opcode

1 PWM1/PWM2 v[vd]=pwm(v[vs1],v[vs2]) Tech 65nm 65nm 28nm 40nm

Sample Total Time(μs) 26222.2 210.0 29.4 27.1

PLU_Unit 6539 2101 8

Sample_Unit 11670 3835 0

Other blocks 859 0 0

All 19429 6173 8

Fig. 7. Experiment on UltraScale+ FPGA platform

You might also like