0% found this document useful (0 votes)
24 views5 pages

High-Speed Design of Post Quantum Cryptography With Optimized Hashing and Multiplication

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views5 pages

High-Speed Design of Post Quantum Cryptography With Optimized Hashing and Multiplication

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

This article has been accepted for publication in IEEE Transactions on Circuits and Systems--II: Express Briefs.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2023.3273821

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, VOL. XX, NO. X, MONTH 20XX 1

High-speed Design of Post Quantum Cryptography


with Optimized Hashing and Multiplication
Malik Imran, Graduate Student Member, IEEE, Aikata Aikata, Sujoy Sinha Roy,
and Samuel Pagliarini, Member, IEEE

Abstract—In this brief, we realize different architectural Existing architectures for post-quantum cryptography (PQC)
techniques for improving the performance of post-quantum algorithms on field-programmable gate array (FPGA) and
cryptography (PQC) algorithms when implemented as hardware application-specific integrated circuit (ASIC) platforms are
accelerators on an application-specific integrated circuit (ASIC)
platform. Having SABER as a case study, we designed a demonstrated in [6]–[16]. These accelerators reveal that the
256-bit wide architecture geared for high-speed cryptographic PQC algorithms need secure hash functions for different
applications that incorporates smaller and distributed SRAM purposes, e.g., binomial sampling. For instance, the recently
memory blocks. Moreover, we have adapted the building blocks of standardized CRYSTALS-Kyber algorithm requires variants of
SABER to process 256-bit words. We have also used a buffering SHA3 and an extended output function (EoF), that is, SHAKE.
technique for efficient polynomial coefficient multiplications to
reduce the clock cycle count. Finally, double-sponge functions are The execution of variants of SHA3 and an EoF depends on
combined serially (one after another) in a high-speed KECCAK a KECCAK sponge function to compute state permutations.
core to improve the hash operations of SHA/SHAKE. For The building blocks of the KECCAK sponge function, that
key-generation, encapsulation, and decapsulation operations of is, theta, pi, rho, chi, and iota, can operate (only) on 64-bit
SABER, our 256-bit wide accelerator with a single sponge words. This encourages designers to select 64 bits for memory
function is 1.71x, 1.45x, and 1.78x faster than the raw clock
cycle count of a serialized SABER design. Similarly, our 256- width and for datapaths in their PQC accelerators. This is
bit implementation with double-sponge functions takes 1.08x, the case for different PQC accelerators in [6]–[16]. Moreover,
1.07x & 1.06x fewer clock cycles compared to its single-sponge PQC algorithms require relatively large storage elements to
counterpart. The studied optimization techniques are not specific keep initial, intermediate, and final results. For example, a
to SABER – they can be utilized for improving the performance memory size of 1024×64 is needed to implement different
of other lattice-based PQC accelerators.
variants of SABER [17], that is, LightSABER, SABER, and
Index Terms—PQC, ASIC design, hardware accelerator, cryp- FireSABER. There are several possibilities for organizing this
tocore, SABER.
I. I NTRODUCTION memory; one choice is to use one single 1024×64 memory as
High-performance hardware-based cryptographic accelera- in [10]. This choice does not allow for parallel read/write oper-
tors are essential for wireless, telecom, cloud, data centers, ations, resulting in a higher cycle count. Another solution is to
and enterprise systems. As examples, the 8920 and 8955 Intel use multiple smaller memories like those employed in SABER
chipsets can process 5k and 40k RSA decryption operations designs of [11]–[15]. These implementations, however, are
per second [1]. The IBM 4769 hardware security module not taking full benefit of the smaller memories because the
offers key exchange and signature generation/verification using read/write operations are performed in a serial way instead
Elliptic Curve Cryptography (ECC) and RSA standards [2]. of a parallel fashion – even if the memories have different
Even if these remarkable chips deliver thousands of operations purposes.
per second, they might become compromised since the secu- Hence, in this brief, we present an ASIC 256-bit accelera-
rity strength of ECC and RSA can be broken using Shor’s tor for SABER to showcase the advantages of wider datapaths
algorithm [3] on a quantum computer. Recently, Google’s and the memory decisions accompanying it. These advantages
Sycamore [4] delivered a 53-qubit quantum computer that can also apply to other PQC algorithms. For reducing the clock
do in 200 seconds a task that would take a classical com- cycle count, we employed four high-speed SRAM memories
puter 10,000 years. Different labs worldwide have developed of sizes 256×64 each and described their control logic to
even more powerful quantum computers [5]. Hence, high- allow for parallel read/write operations. The building blocks
speed quantum-resistant cryptographic hardware accelerators of SABER are implemented to process 256-bit words. We
are mandated to supersede ECC- and RSA-based devices. have also used a long buffer approach for multiplying polyno-
mial coefficients in parallel. Finally, double-sponge functions
This work was partially supported by the EC through the European Social
Fund in the context of the project “ICT programme”. It was also supported are connected serially (one after another) in a high-speed
by European Union’s H2020 research and innovation programme under grant KECCAK core to improve further the studied accelerator’s
agreement No 952252 (SAFEST). It is also partially supported by the State performance.
Government of Styria, Austria – Department Zukunftsfonds Steiermark.
M. Imran and S. Pagliarini are with the Centre for Hardware Security,
Department of Computer Systems, Tallinn University of Technology, Tallinn, II. P ROPOSED C RYPTO ACCELERATOR
Estonia. (e-mail: {malik.imran, samuel.pagliarini}@taltech.ee) Fig. 1 shows the block diagram of our proposed crypto
A. Aikata and S. S. Roy are affiliated with the Institute of Applied
Information Processing and Communications, Graz University of Technology, accelerator architecture. It includes data memory, an address
Graz, Austria. (e-mail: {aikata, sujoy.sinharoy}@iaik.tugraz.at) decoder unit, and a SABER crypto core. The data memory

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:04 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Circuits and Systems--II: Express Briefs. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2023.3273821

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, VOL. XX, NO. X, MONTH 20XX 2

64 LONG PUBLIC POLY BUFFER (LPPB)

4a
3a
2a
64 64 SHA3-256/512 & SHAKE128

a
RF1

Address Decoder Unit (ADU)


SBM1
256 Unpack SP1
256×64

2
Dedicated Controller
256 13 4 13 4 M1
CopyWords

PB1
LONG SECRET POLY BUFFER (LSPB)
RF2 64 256 MAC ... MAC

13
13
CMOV SABER
256×64 13 256 MAC units 13 a,2a,3a,4a
256 Verify crypto core AB1
RF3 64 256 AddPack SBM2
Data

1
SP2 M2
256×64 256 memory
256 AddRound 13 13
4 4

PB2

13
RF4 64 256 ...

RAB
Multiplier MAC MAC
256×64 256 13 256 MAC units 13 a,2a,3a,4a
BS2POLVECP

13

13
AB2

+
-
Fig. 1. Block diagram of our proposed crypto accelerator architecture. SBM3
SP3
13 4 13 4

PB3
...

1
Buffer_Unit dout 64bit din 64bit dout 64bit MAC MAC M3
din 64bit reg
∑ reg 64bit 13 13 a,2a,3a,4a
256 MAC units

13
64bit
5bit

5bit
reg Buffer_Unit AB3
5bit

64bit

Const_Gen_Unit 64bit Round_Unit DATA MEMORY


64bit
64bit
Const_Gen_Unit Round_Unit Const_Gen_Unit 64bit Round_Unit Fig. 3. Fully parallelized schoolbook multiplier for SABER.

Fig. 2. KECCAK optimizations. The gray diagram corresponds to the high- the Buffer Unit by including a register and an accumulator.
speed KECCAK core of [19]. The purple diagram indicates our optimized Moreover, we used additional instances of Const Gen Unit
KECCAK where additional blocks appear in orange.
and Round Unit. Each instance of a Const Gen Unit takes
holds initial, intermediate, and final results. Each memory can a 5-bit counter value as input and generates a 64-bit con-
read/write one 64-bit word in one clock cycle. So, four mem- stant vector as an output. Moreover, each instance of the
ory instances in parallel can read/write one 256-bit word in one Round Unit (or sponge function) takes two 64-bit inputs and
cycle. The address decoder unit selects an appropriate memory produces a single 64-bit output. The first 64-bit input to the
for reading/writing a 64-bit word. Also, it communicates to the corresponding sponge function is from the round constants
SABER controller to pass/collect 64-bit (for SHA3 variants) block. The second 64-bit input to the first sponge function is
or 256-bit (for other SABER blocks) data as input/output from the KECCAK buffer and its output goes as an input to the
to/from the SABER core. The SABER crypto core includes second sponge function. This means the sponge functions are
the required building blocks and is wrapped by a dedicated connected serially one after another. The outputs of the first
controller that handles 64-bit or 256-bit data for write/read and second sponge functions are connected as inputs to the
operations. The controller generates the control signals for KECCAK buffer to accumulate the results. With this strategy,
the corresponding SABER building blocks. Additionally, it 14 clock cycles are required to operate 24 rounds of KECCAK.
allows one SABER block to operate at a time. Next, we have Hence, the cycle count is halved compared to [10], [14], [15].
described the implementation of the SABER blocks.
B. Fully Parallel Schoolbook Multiplier
A. Optimization of SHA3-256/512 & SHAKE128 We have utilized long public and secret polynomial buffers
Since all of the SHA3 variants utilize the KECCAK sponge to load coefficients of public and secret polynomials at once.
function [18], we operate the SHA3-256, SHA3-512, and This one-time data loading from memory helps to reduce
SHAKE-128 like a wrapper in our proposed architecture. the cycle count. For multiplications computation, the long
Moreover, details about the utilized KECCAK cores with poly buffers need an m-bit shift towards left/right. We shift
single- and double-sponge functions are described below. left with 256-bit as our accelerator deals with 256-bit data
The gray diagram in Fig. 2 (left) depicts the high-speed for reading/writing operations to/from data memory. SABER
KECCAK core of [19]. As can be seen, it needs an in- requires a matrix multiplication for multiplying polynomial
stance each of (i) Buffer Unit, (ii) Const Gen Unit and (iii) coefficients, as presented in Eq. 1. The matrix P , S and R
Round Unit. The Buffer Unit holds the initial vectors and hold the public, secret and resultant polynomial coefficients.
keeps the intermediate and final results. Const Gen Unit      
P(0,0) A(0,1) . . . P(0,255) S0 R0
generates the round vectors based on a 5-bit counter value P(1,0) A(1,1) . . . P(1,255)  · S1  = R1  (1)
(coming from Buffer Unit). The Round Unit is the KECCAK
P(2,0) A(2,1) . . . P(2,255) S2 R2
sponge function and operates the KECCAK building blocks
(theta, pi, rho, chi and iota) based on the round constants As shown in Fig. 3, our fully parallelized polynomial
and a 64-bit buffered value from the Buffer Unit. Moreover, it multiplication architecture consists of two long polynomial
generates a 64-bit vector as output which is further connected buffers (LPPB and LSPB) and three copies of a schoolbook
as an input to a register inside the Buffer Unit. This strategy multiplier, that is, SBM1, SBM2, and SBM3. The length of
requires 28 cycles to operate 24 rounds iteratively: 24 cycles LPPB and LSPB is proportional to the size of the matrix P and
are for 24 KECCAK rounds and an additional 4 cycles specify matrix S, respectively. Each row of matrix P contains 256 13-
the ‘wait’ until the registers in the datapath are free. Previously, bit polynomial coefficients. Each row of the matrix S contains
the KECCAK core of [19] has been utilized in [10], [14], [15] 256 4-bit polynomial coefficients. Therefore, 768 coefficients
for SABER hardware accelerators. are in three rows of a matrix P and a matrix S. Then, the
The purple-colored diagram in Fig. 2 details how the length of LPPB is 9984 bits (768×13) and the length of LSPB
number of clock cycles of the KECCAK core can be reduced is 3072 bits (768×4). Multiplication starts with loading 768
by half using additional orange-colored boxes. We modify polynomial coefficients into LPPB and LSPB buffers.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:04 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Circuits and Systems--II: Express Briefs. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2023.3273821

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, VOL. XX, NO. X, MONTH 20XX 3

Key Generation Encapsulation Decapsulation


After loading all the 768 polynomial coefficients into LPPB 10 4 10 4 10 4
10 2 10 2 10 2
and LSPB buffers, the corresponding 256 public and secret 10 0 10 0 10 0

Ve EC

m EC

Ve EC
m nd

Sa LV d

Sa Pa d
A Ot ler

LV k

A Ot ler

C ck

A mp ck
Pa er
dd Ha y

PO Pa k

nP k

Ve y

C ck
y
C ier

B U Pa r

B dd Ha r

U Pac r

PO a r

U ou r
R sh

S2 R sh

LVsh
y

A O rify
n

n n
polynomial coefficients are forwarded to multipliers SBM1,

lie

S2 H lie

R e
c
op

S2 n c

op

op
rif

rif

dd l
Sa ou

PO ou
dd h

dd h

dd th
a
p

p
l
tip

tip

tip
ul

ul

ul
M

M
A

A
SBM2 and SBM3. As detailed in Fig. 3, the SBM1 multiplier

B
10 4 10 4 10 4
consists of three buffers (i.e., PB1, SP1, and AB1) and 256 10 2 10 2 10 2

MAC (multiply-and-accumulate) units. PB1 and SP1 contain 10 0 10 0 10 0

Ve EC

m EC

Ve EC
m nd

Sa LV d

Sa Pa d
A Ot ler

LV k

A Ot ler

C ck

A mp ck
Pa er
A Hopy

PO Pa k

nP k

Ve y

C ck
y
C ier

B U Pa r

B dd Ha r

U Pac r

PO a r

U ou r
Sa ou h

S2 R sh

LVsh
y

A O rify
n

n n
e

lie

S2 H lie

R e
c
S2 n c

op

op
rif

rif
R s

dd l
PO ou
dd h

dd h

dd th
a
p

p
dd a
pl

tip

tip
the 256 coefficients of the first row of the matrix P and matrix

ti
ul

ul

ul
M

M
A

B
S for multiplication. Then, the execution of multiplication us- 10 4 10 4 10 4

ing the MAC units takes 256 cycles. Each MAC unit takes 13- 10 2 10 2 10 2
10 0 10 0 10 0
and 4-bit public and secret polynomial coefficients as inputs

Ve EC

m EC

Ve EC
m nd

Sa LV d

Sa Pa d
A Ot ler

LV k

A Ot ler

C ck

A mp ck
Pa er
A Hopy

PO Pa k

nP k

Ve y

C ck
y
C ier

B U Pa r

B dd Ha r

U Pac r

PO a r

U ou r
Sa ou h

S2 R sh

LVsh
y

A O rify
n

n n
e

lie

S2 H lie

R e
c
S2 n c

op

op
rif

rif
R s

dd l
PO ou
dd h

dd h

dd th
a
p

p
dd a
pl

tip

tip
ti
ul

ul

ul
and results in a 13-bit polynomial as output, as presented in

M
A

B
Total clock cycle counts
Fig. 3. A 13-bit output polynomial from each MAC depends 10000
5000
on the 4-bit secret polynomial. Two bits from the LSB side 0
Key Generation Encapsulation Decapsulation
of a secret polynomial decide between shifted 13-bit public Serial + single-sponge Parallel + single-sponge Parallel + double-sponge

polynomial coefficients (a, 2a, 3a, 4a) using a multiplexer Fig. 4. Clock cycle distribution of SABER for serial and parallel architectures.
M 1. A third bit from the LSB side is a sign bit. Finally, Serial to parallel designs with single-sponge function results in an average
39% cycle reduction. In parallel designs, moving from single- to double-
the last bit of a secret polynomial coefficient determines sponge functions, we obtained a 7% reduction in clock cycles.
the modular addition or subtraction operation to execute for
a 13-bit multiplication result. Moreover, AB1 accumulates CMOV to either copy the decrypted session key or a pseudo-
the multiplication results. The same multiplication strategy random string at a specified memory address. The AddPack
is applied in SBM2 and SBM3 multipliers of Fig. 3. In the performs coefficient-wise addition with a constant followed
SBM2 multiplier, PB2 and SP2 keep the public and secret by the generated message and packs the resultant bits into
polynomial coefficients of the second row of the matrix P a byte string. Like the AddPack block, AddRound computes
and matrix S. Similarly, PB3 and SP3 hold the public and coefficient-wise addition of a constant followed by coefficient-
secret polynomial coefficients for the third row of matrices wise rounding. The BS2POLVECp block converts the byte
P and S. As presented in Fig. 3, an additional RAB buffer string into a polynomial vector.
accumulates the multiplication results from SBM1, SBM2, and III. R ESULTS AND C OMPARISONS
SBM3 multipliers before writing back on the data memory. In Fig. 4, we show the clock cycle count for serial and
Since all three multipliers (SBM1, SBM2 and SBM3) operate parallel SABER architectures. From left to right, the first row
in parallel, 256 clock cycles are required to multiply SABER with three panels in Fig. 4 specifies the key generation, en-
polynomial coefficients. capsulation and decapsulation operations for a serial SABER
In our previously implemented schoolbook multipliers of architecture. Similarly, the second row includes three panels
[10], [14], [15], we utilized 256 MAC units, and these MACs for the same three operations on a parallel SABER architecture
are operated serially to compute the polynomial multiplica- with a single sponge in its KECCAK block. The considered
tions in 768 clock cycles. In this work, our fully-parallel SABER architecture has double-sponge functions in the third
multiplier utilizes 768 MAC units and takes 256 cycles. Also, row of Fig. 4. The bottom panel of Fig. 4 provides the total cy-
our buffer approach is beneficial to avoid frequent memory cle counts for key generation, encapsulation, and decapsulation
access for read/write operations as we have a 256-bit data operations of all three considered designs. Moreover, in Fig. 4,
bus instead of the typical 64-bit size found in the literature. hash determines the SHA3-256/512 and SHAKE128 functions.
The total cycle cost of loading public and secret polynomials Notably, the multiplier and hash operations dominate the
from data memory is 156 and 48 for the schoolbook designs computation time, so they are prime targets for optimizations.
of [10], [14], [15]. The fully-parallelized architecture of this As expected, Fig. 4 shows a decrease in clock cycles for
work reduces these costs to 39 and 12 cycles. As implied key generation, encapsulation, and decapsulation operations
by the block diagram of Fig. 3, the area of our multiplier is when moving from a serial to a parallel design with a single-
approximately 3 times of a serialized schoolbook multiplier. sponge function (see blue and red bars). Similarly, we have a
C. Other implemented SABER building blocks decrease in clock cycles for hash operation when comparing
A sampler is needed to compute the sample from a pseudo- two parallel SABER designs with single- and double-sponge
random input string. The binomial sampler in our proposed functions in the KECCAK (see red and green bars). The
architecture is a combinational block. It maps 256-bit pseudo- last panel in Fig. 4 highlights the total cycle count for each
random bits to a 256-bit sample value in one clock cycle. The operation on all architectures. On average, the number of
transformation from a byte into a bit string is the task of the clock cycles required to execute key generation, encapsula-
Unpack unit. A copy block is only needed during the KEM tion, and decapsulation operations using a parallel accelerator
key-generation process. It transforms the rows and columns to with one sponge function is 1.65× lower compared to the
determine a transpose of a matrix generated using SHAKE128. serial SABER architectures of [14], [15]. The use of double-
The verify block is only required during the decapsulation sponge functions in our parallel accelerator further reduces
operation. It provides a word-by-word comparison between the clock cycle requirement by 1.07× when compared to a
the received ciphertext and the re-encrypted ciphertext. The parallel implementation with one sponge function. Therefore,
result of verify block is stored in a register that is used by a significant decrease in the clock cycle count when moving

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:04 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Circuits and Systems--II: Express Briefs. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2023.3273821

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, VOL. XX, NO. X, MONTH 20XX 4

TABLE I in 1.86, 1.57 and 1.89 times lower cycle count. Our single-
R ESULTS OF P ROPOSED C RYPTO ACCELERATOR ON 28 NM TECHNOLOGY. sponge SABER design requires 1.73, 1.44 and 1.78 times
Implementation details single-sponge double-sponge lower computation time (i.e., latency). Similarly, when using a
Maximum Frequency (M Hz) 2500 2500 double-sponge function, the latency values are 1.77, 1.47 and
Latency (KG/ENC/DEC) (µs) 1.66/1.96/2.09 1.53/1.82/1.96
Utilized Area (mm2 ) 0.251 0.255 1.89 times lower. As seen in the last two columns of Table II,
Power (Lkg/Dyn) (mW ) 10.96/556.25 11.49/597.05 the area and power values of our designs are higher than [14]
Energy (µJ) 0.923/1.090/1.162 0.913/1.086/1.170 as we are utilizing a parallelized 256-bit architecture.
TABLE II A 64-bit SABER chip fabricated in [12] requires 3.48,
ASIC COMPARISON WITH EXISTING ACCELERATORS FROM LITERATURE . 3.81, and 4.48 times higher clock cycles compared to our
Cycles (K) Latency (µs) Freq Area Pow parallel SABER design with a single-sponge function. With
Ref.
KG/ENC/DEC KG/ENC/DEC M Hz mm2 mW
double-sponge functions, the cycle requirement of our design
65nm technology
is 3.76, 4.15, and 4.48 times lower than [12]. Our 256-bit
[14] 7.1/7.1/9.3 7.1/7.1/9.3 1002 0.314 142.5
[12] 14.3/18.7/23.3 89.6/116.9/146.1 160 0.158 – implemented SABER design with single-sponge and double-
[15] 7.1/7.1/9.3 10.0/9.9/13.0 715 1 153.6 sponge functions show 6.26 and 5.85 times speedup in clock
[16] 350/405/425 7740/9011/9437 45 0.840 2.6 frequency. Moreover, our single-sponge SABER design dis-
TW† 4.1/4.9/5.2 4.1/4.9/5.2 ✔ 1002 0.944 647.2
TW‡ 3.8/4.5/4.9 4.0/4.8/5.2 ✔ 936 1.026 860.9 plays 21.85, 23.85, and 28.09 times lower latency. For double-
40nm technology sponge functions, the required computation time is 22.4, 24.35,
[11] 1.0/1.4/1.6 2.6/3.6/4.2 400 0.380 – and 28.09 times lower. We utilize 5.97 and 6.49 times more
TW† 4.1/4.9/5.2 2.45/2.90/3.09 ✔ 1694 0.846 163.2 hardware resources with single and double-sponge functions.
TW‡ 3.8/4.5/4.9 3.47/4.10/4.47 1095 0.767 137.0 Two different operating frequencies, 160M Hz, and 10M Hz
28nm technology are reported in [12]. For 160M Hz, the consumed power is not
[9] 9/11/13 4.54/5.67/6.95 2000 0.263 367.1 reported in the reference design. However, for 10M Hz, the
[13] –/–/– –/–/– 500 3.6 39–368
consumed power is 0.3339mW . Our parallel SABER architec-
TW† 4.1/4.9/5.2 1.6/1.9/2.0 2500 0.251 567.1
TW‡ 3.8/4.5/4.9 1.5/1.8/1.9 ✔ 2500 0.255 608.4 tures with single- and double-sponge functions consume 647.2
TW† & TW‡ : our designs with single- & double-sponge, Fabricated: [12], and 860.9mW power at 1002 and 936M Hz clock frequency.
[13], [15], Technology mapped: [9], [11], [14], [16], TW† & TW‡ , This increase is expected given that our frequency of operation
Area (chip size): [13], [15], CRYSTALS-Kyber: [9], [16], SABER: Others.
is 1-2 orders of magnitude higher.
from a serialized design to parallel architectures, reveals that If we compare our results to [15], our proposed design
the realized approaches in this work can be utilized in other with a single-sponge function takes 1.73, 1.44, and 1.78
PQC algorithms for performance improvements. times lower clock cycles. The design with double-sponge
On a commercial 28nm ASIC technology, the frequency, functions requires 1.86, 1.57, and 1.89 times fewer clock
latency, area, power, and energy results (after synthesis) of cycles. The reasons are the parallel use of smaller memories
our proposed parallel SABER architectures are given in Ta- and a fully parallel multiplier in our SABER design. On the
ble I. KG, ENC and DEC in Table I define the SABER other hand, in [15], smaller memories are accessed serially
key-generation, encapsulation, and decapsulation operations. and an iterative schoolbook multiplier is utilized. Our single-
Similarly, Lkg and Dyn are the leakage and dynamic power sponge and double-sponge implemented SABER designs are
consumption. We have utilized the Vivado IDE tool for 1.40 and 1.30 times faster (in frequency). As shown in column
simulations and Cadence Genus for logic synthesis. Both three of Table II, the computational cost of our SABER design
implementations operate at 2500M Hz. The use of a double- is much lower than [15]. Column five shows that our SABER
sponge function allows us to minimize the computation time core utilizes an area (almost) equivalent to the chip size of
(i.e., latency, calculated as clock cycles over frequency) at a [15]. Due to parallel computations in this work, our single-
modest increase in power (+4.63% and +6.84% for leakage sponge and double-sponge functions consume 4.21 and 5.60
and dynamic power, respectively) and area (+1.57%). The max times higher power than [15]. There is always a trade-off
frequency is obtained by pushing the timing constraint until between processing speed and area/power parameters.
the slack is close to zero. Despite the small area and power We have achieved very interesting results on 40nm process
increase, the double-sponge function has higher merit as it technology. SABER designs with single-sponge and double-
consumes nearly the same energy (product of dynamic power sponge functions utilize 0.079µm2 and 0.115µm2 area for
and computation time) than the single-sponge version. SHA3-256/512 and SHAKE128. For identical SABER de-
ASIC implementations of recent PQC accelerators are signs, the hardware utilization of our fully parallelized mul-
compared in Table II. The clock cycles (CC) and latency tiplier is 0.637µm2 and 0.523µm2 . In Table II, if we see
(Lat) values are reported for KG, ENC, and DEC operations. the total utilized area and consumed power of our design
Moreover, the architectures marked with the blue checkmark with single-sponge and double-sponge functions, the SABER
in Table II give the best-in-class results. design with double-sponge functions takes lower resources and
Due to the parallel use of smaller SRAM memories, our consumes less power than the SABER design with single-
architecture with a single-sponge function requires 1.73, 1.44 sponge function. This is counterintuitive at first but becomes
and 1.78 times lower clock cycles for SABER key-generation, clear once we notice the significant frequency decrease, from
encapsulation and decapsulation operations when compared to 1694M Hz to 1095M Hz, in column four of Table II. This
[14]. Our SABER design with double-sponge functions results happens due to the different critical paths shifting from one

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:04 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Circuits and Systems--II: Express Briefs. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2023.3273821

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, VOL. XX, NO. X, MONTH 20XX 5

design to another (the double-sponge becomes the critical be considered in high-speed implementations of CRYSTALS-
path).In other words, the critical paths of the synthesis of Kyber and CRYSTALS-Dilithium accelerators.
our single- & double-sponge functions lie in the KECCAK
but with different start & end points. Additionally, the choice R EFERENCES
between single- and double-sponge is a function of the tech- [1] Intel, “Integrated cryptographic and compression accelerators on
intel architecture platforms,” last accessed on September 29,
nology: the relative speed of logic versus that of the memory 2022, available at: https://www.intel.com/content/dam/www/public/us/
dictates where the critical path lies and whether the design can en/documents/product-briefs/quickassist-adapter-8920-brief.pdf.
accommodate a double-sponge KECCAK. This consideration [2] IBM, “Ibm cex7s / 4769 pcie cryptographic coprocessor (hsm),” last
accessed on October 20, 2022, available at: https://public.dhe.ibm.com/
applies not only to SABER but also to other PQC accelerators. security/cryptocards/pciecc4/docs/4769 Data Sheet.pdf.
Compared to [11], our parallel designs take more clock [3] P. W. Shor, “Polynomial-time algorithms for prime factorization and
cycles for KG, ENC, and DEC operations of SABER. The discrete logarithms on a quantum computer,” SIAM J. Comput., vol. 26,
no. 5, p. 1484–1509, 1997.
reason is the parallel use of smaller memories in our design [4] F. Arute, K. Arya, R. Babbush, and et al, “Quantum supremacy using a
while dedicated memories for specific SABER computations programmable superconducting processor,” Nature, vol. 574, pp. 505–
are utilized in [11]. Our SABER design with single-sponge 510, 2019.
[5] M. Gong, S. Wang, C. Zha, and et al, “Quantum walks on a pro-
and double-sponge functions is 4.23 and 2.73 times faster in grammable two-dimensional 62-qubit superconducting processor,” Sci-
clock frequency. Moreover, our implementation with a single- ence, vol. 372, no. 6545, pp. 948–952, 2021.
sponge function requires lower computation time (see column [6] L. Beckwith, D. T. Nguyen, and K. Gaj, “High-performance hardware
implementation of crystals-dilithium,” Cryptology ePrint Archive, Paper
three in Table II). We are utilizing more area than [11] because 2021/1451, 2021.
our focus was to reduce the computation time and improve [7] G. Land, P. Sasdrich, and T. Güneysu, “A hard crystal - implementing
the circuit frequency. The comparison with power results is dilithium on reconfigurable hardware,” in Smart Card Research and
Advanced Applications: 20th International Conference, CARDIS 2021,
impossible as they are unavailable in the reference design. Lübeck, Germany, November 11–12, 2021, Revised Selected Papers.
A flexible design [13] for SABER, NTRU, Dilithium, Rain- Berlin, Heidelberg: Springer-Verlag, 2021, p. 210–230.
bow, CRYSTALS-Kyber and McEliece PQC algorithms is five [8] Z. Zhou, D. He, Z. Liu, M. Luo, and K.-K. R. Choo, “A soft-
ware/hardware co-design of crystals-dilithium signature scheme,” ACM
times slower in clock frequency than our dedicated SABER Trans. Reconfigurable Technol. Syst., vol. 14, no. 2, pp. 1–21, 2021.
design. The utilized area is in chip size (3.6mm2 ), as seen [9] A. Aikata, A. C. Mert, M. Imran, S. Pagliarini, and S. S. Roy, “Kali:
in Table II, so a fair one-to-one comparison is impossible. A crystal for post-quantum security using kyber and dilithium,” IEEE
Transactions on Circuits and Systems I: Regular Papers, vol. 70, no. 2,
Similarly, a reasonable comparison with consumed power is pp. 747–758, 2023.
impossible as the power values (are given in a) range from [10] S. Sinha Roy and A. Basso, “High-speed instruction-set coprocessor for
39mW to 368mW . The information about the clock cycle lattice-based key encapsulation mechanism: Saber in hardware,” IACR
Transactions on Cryptographic Hardware and Embedded Systems, vol.
and latency parameters is not reported in the reference design 2020, p. 443–466, 2020.
of [13]. Therefore, this comparison is (also) not possible. [11] Y. Zhu, M. Zhu, B. Yang, W. Zhu, C. Deng, C. Chen, S. Wei, and L. Liu,
On 65 and 28nm technologies, the flexible accelerators of “Lwrpro: An energy-efficient configurable crypto-processor for module-
lwr,” IEEE Transactions on Circuits and Systems I: Regular Papers,
[9], [16] implement multiple PQC algorithms and, as expected, vol. 68, no. 3, pp. 1146–1159, 2021.
their area is higher than in our SABER accelerators. If we [12] A. Ghosh, J. Mera, A. Karmakar, D. Das, S. Ghosh, I. Verbauwhede, and
consider a variant of CRYSTALS-Kyber (i.e., Kyber-1024) S. Sen, “A 334µw 0.158mm2 saber learning with rounding based post-
quantum crypto accelerator,” in 2022 IEEE Custom Integrated Circuits
from these accelerators for comparison, Table II shows that our Conference (CICC), 2022, pp. 1–2.
accelerators outperform in clock cycles, latency and frequency. [13] Y. Zhu, W. Zhu, M. Zhu, C. Li, C. Deng, C. Chen, S. Yin, S. Yin,
The consumed power of our accelerators is high because we S. Wei, and L. Liu, “A 28nm 48kops 3.4µj/op agile crypto-processor for
post-quantum cryptography on multi-mathematical problems,” in 2022
operate related operations on a much higher frequency. IEEE International Solid- State Circuits Conference (ISSCC), vol. 65,
2022, pp. 514–516.
IV. L ESSONS L EARNED & C ONCLUSIONS [14] M. Imran, F. Almeida, J. Raik, A. Basso, S. S. Roy, and S. Pagliarini,
The comparison and discussions reveal that the parallel “Design space exploration of saber in 65nm asic,” in Proceedings of the
5th Workshop on Attacks and Solutions in Hardware Security, 2021, p.
use of several smaller memories is more beneficial to reduce 85–90.
frequent read/write access from the data memory. One-time [15] M. Imran, F. Almeida, A. Basso, S. S. Roy, and S. Pagliarini, “High-
data loading from data memory helps to decrease clock cycles. speed saber key encapsulation mechanism in 65nm cmos,” Journal
of Cryptographic Engineering (JCEN), 2023. [Online]. Available:
Also, the one-time loading benefits the design of a compact https://doi.org/10.1007/s13389-023-00316-2
and a parallel NTT (number-theoretic-transform) multiplier [16] T. Fritzmann, G. Sigl, and J. Sepúlveda, “Risq-v: Tightly coupled risc-
for CRYSTALS-Kyber and CRYSTALS-Dilithium PQC stan- v accelerators for post-quantum cryptography,” IACR Transactions on
Cryptographic Hardware and Embedded Systems, vol. 2020, no. 4, p.
dards. The PQC algorithms involve secure hash computations; 239–280, Aug. 2020.
hence, efficient hash computations allow optimization of the [17] A. Basso, J. M. B. Mera, J.-P. D’Anvers, A. Karmakar, S. S. Roy, M. V.
circuit frequency and also help to minimize the cycles. Beirendonck, and F. Vercauteren, “Saber: Mod-lwr based kem (round
3 submission),” last accessed on March 23, 2022, available at https:
This article shows that our SABER design with a single- //www.esat.kuleuven.be/cosic/pqcrypto/saber/files/saberspecround3.pdf.
sponge function performs better in achieving higher clock [18] NIST, “Sha-3 standard: Permutation-based hash and extendable-output
functions,” FIPS PUB 202, last accessed on March 9, 2022, available at
frequency on 65nm and 40nm process technologies. However, https://doi.org/10.6028/NIST.FIPS.202.
on a 28nm technology, our SABER designs with single- and [19] K. Team, “Keccak in vhdl: High-speed core,” last accessed on September
double-sponge functions outperform the state-of-the-art in fre- 16, 2022, available at: https://keccak.team/hardware.html.
quency and latency. The adopted wider datapath strategy, one-
time data loading approach and KECCAK optimizations can

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 09,2023 at 10:53:04 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

You might also like