Howe Optimised Lattice Based
Howe Optimised Lattice Based
Hardware
1 Introduction
The future development of a scalable quantum computer will allow us to solve in,
polynomial time, several problems which are considered intractable for classical
computers. Certain fields, such as biology and physics, would certainly benefit
from this “quantum speed up”, however this could be disastrous for security. The
security of our current public-key infrastructure is based on the computational
hardness of the integer factorization problem (RSA) and the discrete logarithm
problem (ECC). These problems, however, will be solved in polynomial time by
a machine capable of executing the Shor’s algorithm [9].
To promptly react to the threat, the scientific community started to study,
propose, and implement public-key algorithms, to be deployed on classical com-
puters, but based on problems computational difficult to solve also using a quan-
tum computer or a classical computer. This effort is supported by governmental
and standardisation agencies, which are pushing for new and quantum resistant
algorithms. The most notable example of these activities is the open contest
that NIST [6] is running for the selection of the next public-key standardised
algorithms. The contest started at the end of 2017 and is expected to run for 5
to 7 years.
Approximately seventy algorithms were submitted to the standardisation
process, with the large majority of them being based on the hardness of lat-
tice problems. Lattice-based cryptographic algorithms are a class of algorithms
which base their security on the hardness of problems such has finding the short-
est non-zero vector in a lattice. The reason for such a large number of candidates
is because lattice-based algorithms are extremely promising: they can be imple-
mented efficiently and they are extremely versatile, allowing to efficiently im-
plement cryptographic primitives such as, digital signatures, key encapsulation,
and identity-based encryption.
As in the past case for standardising AES and SHA-3, the parameters which
will be used for selection include the security of the algorithm and its efficiency
when implemented in hardware and software. NIST have also stated that algo-
rithms which can be made robust against physical attacks in an effective and
efficient way will be preferred [7]. Thus, it is important, during the scrutiny of
the candidates, to explore the potential of implementing these algorithms on
a variety of platforms, and to assess the overhead of adding countermeasures
against physical attacks.
To this end, this paper concentrates on FrodoKEM, a key encapsulation al-
gorithm submitted to NIST as a potential post-quantum standard. FrodoKEM
is a conservative candidates due to its hardness being based on standard lattices,
as opposed to Ring-LWE or Module-LWE, as such its has had limited practical
evaluations. Thus, we explore the possibility to efficiently implementing it in
hardware and we estimate the overhead of protecting against power analysis at-
tacks using first-order masking. To maximise the throughput, while maintaining
the area occupation minimal, we rely on a parallelised implementation of the
matrix multiplication. To be parallelised, however, the matrix multiplication re-
quires the use of a smaller and more performant random number generator. We
propose to achieve the performance required for the PRNG by using Trivium,
which we used instead of AES or (c)SHAKE.
The rest of the paper is organised as follows. Section 2 discusses the back-
ground and the related works. Section 3 introduces the proposed hardware ar-
chitectures and the main design decisions. Section 4 reports the results obtained
while synthesising our design on reconfigurable hardware and compares our per-
formance against the state-of-the-art. We conclude the paper in Section 5.
FrodoKEM-640 FrodoKEM-976
Matrix Dimensions n = 640, n̄ = m̄ = 8 n = 976, n̄ = m̄ = 8
Modulus (q) 215 = 32768 216 = 65536
Distribution (χ) σ = 2.8 σ = 2.3
Security 128 bits 192 bits
Naehrig et al. [5] report the results of the implementation on a 64-bit ARM
Cortex-A72 (with the best performance achieved by using OpenSSL AES im-
plementation, that benefits from the NEON engine) and an Intel Core i7-6700
(x64 implementation using AVX2 and AES-NI instructions). Employing modu-
lar arithmetic (q ≤ 216 ) results in using efficient and easy to implement single-
precision arithmetic. The sampling of the error term (16 bits per sample) is done
Algorithm 2 FrodoKEM encapsulation
1: procedure Encaps(pk = seedA ||b)
2: Choose a uniformly random key µ ← U ({0, 1}lenµ )
3: Generate pseudo-random values seedE ||k||d ← G(pk||µ)
4: Sample error S0 ← Frodo.SampleMatrix(seedE , m̄, n, Tχ , 4)
5: Sample error E0 ← Frodo.SampleMatrix(seedE , m̄, n, Tχ , 5)
6: Generate A ∈ Zn×n
q via A ← Frodo.Gen(seedA )
7: Compute B0 ← S0 A + E0
8: Compute c1 ← Frodo.Pack(B0 )
9: Sample error E00 ← Frodo.SampleMatrix(seedE , m̄, n̄, Tχ , 6)
10: Compute B ← Frodo.Unpack(b, n, n̄)
11: Compute V ← S0 B + E00
12: Compute C ← V + Frodo.Encode(µ)
13: Compute c2 ← Frodo.Pack(C)
14: Compute ss ← F (c1 ||c2 ||k||d)
15: return ciphertext c1 ||c2 ||d and shared secret ss
16: end procedure
random and added to the secret. The resulting masked value, which is effectively
a one-time-pad, and the mask are jointly called shares: if taken singularly they
are statistically independent from the secret, and they must be combined to
obtain the secret back. Any operation that previously involved the secret has to
be turned into an operation over its shares. As long as they are not combined,
any leakage from them will be statistically independent of the secret too. In
our context, we show how masking can easily applied to FrodoKEM at a very
low cost. We therefore argue the overhead that a protected implementation of
Frodo in hardware incurs is minimal, hence making it a strong candidate when
side-channel analysis are a concern. The reason behind this is that the only
operation using the secret matrix S is the computation of the matrix M as
C − B0 S during decapsulation. When S is split in two (or more) shares using
addition modulo q, the above multiplication by B0 can be simply applied to
both shares independently. Results are then subtracted by C one-by-one, so
that computations never depend on both shares simultaneously.
3 Hardware Design
Our main design goal is to improve the throughput of the lattice-based key en-
capsulation scheme FrodoKEM [5] when implemented in hardware. As described
in Section 2, FrodoKEM is one of the leading conservative candidates submit-
ted to the NIST post-quantum standardisation effort [6]. Moreover, it has been
shown to have appealing qualities which make it an ideal candidate for hardware
implementations, such as having a power-of-two modulus and significantly easier
parameter selection. However a complete exploration of the possible hardware
optimisations applicable to FrodoKEM is yet to come. For instance, previous im-
plementations do not consider parallelisation or other design alternatives capable
of significantly improve the throughput.
As described in Section 2, FrodoKEM requires heavy use of PRNGs. In the
algorithm specifications it is suggest to either use (c)SHAKE or AES. In par-
ticular, the most computationally intensive operations, Line 7 of Algorithm 2,
requires n × n (for n = 640 or 976) 16-bit pseudo-random values. To not be the
bottle-neck, PRNG needs to achieve a high throughput, typically in the range of
16 bits per clock cycle. In a previous hardware design, proposed by Howe et al.
[4], high throughput for the PRNG was achieved by pre-calculating randomness
and storing it in BRAM. Random data newly calculated was then written into
the memory, overwriting the random data previously stored. This is an efficient
approach, however a more efficient PRNG that would not require BRAM usage,
would have the potential to increase the operating frequency of the design and
thus improve its throughput.
Another issue with the use of AES or (c)SHAKE is the relatively large area
overhead. For example, cSHAKE used within FrodoKEM-640 Encaps occupies
42% of the overall hardware resources [4]. Bos et al. [2] recently improved the
throughput of software implementations of FrodoKEM by leveraging a different
PRNG; xoshiro128**. To improve the parallelism of our implementation, we put
further this idea to hardware and replace the suggested PRNG. We explored sev-
eral options and we decided to integrate into our design an unrolled x32 Trivium
[3] module. This is compatible with the security requirements of the FrodoKEM
submission. In fact, the authors of the algorithm suggests that replacing the
PRNG with another, that still has good statistical pseudo-random properties,
still guarantees the security claims of FrodoKEM. The Trivium architecture we
integrate has high throughput and maintains the cryptographic security required
in the FrodoKEM specifications, thus perfectly fits our needs.
B = SA + E, (1)
Gaussian
ss
Triv 1
cSHAKE
Triv 2 MAC
DSP1
DSP-1
+
DSP-2
...
...
...
DSP-k
c1
Triv Pk Encode(µ)
+
/
2 c2
ARITHMETIC
Fig. 1: A high-level overview of the proposed hardware designs for FrodoKEM for k
parallel multipliers.
To avoid to use BRAM and while keeping the throughput needed by the
MAC operations of the matrix multiplications, the designs require 16 bits of
pseduo-randomness per multiplication per clock cycle. Thus, for every two par-
allel multiplications we require one Trivium instantiation, whose 32-bit output
per clock cycle is split up to form two 16-bit pseudo-random integers. This
pseudo-randomness forms the matrix A in Equation 1, whereas the matrix S
and E require randomness taken from a Gaussian-like distribution. The cumula-
tive distribution table (CDT) sampler technique has been shown to be the most
suitable one for hardware. However compared with previous works, we replace
the use of AES as a psuedo-random input with Trvium. This ensures the same
high throughput, but requires significantly less area on the FPGA.
… … … …
…
…
…
…
…
In this section we presents the results obtained when implementing our FrodoKEM
architecture. The first analysis is directed towards the performance of the PRNG.
When compared to cSHAKE, the PRNG previously used in literature, Trivium
(the PRNG we propose to use), occupies 4.5x less FPGA slices. This means that
when we instantiate a higher number of parallel multipliers, we consume far less
FPGA area than what would be needed when using cSHAKE as discussed in
the algorithm proposal. The increase in area occupation due to parallel imple-
mentation is essentially the only reason for area increase when we move from
a base designs to a design of the same module with a higher number of par-
allel multipliers. This is because the vector being multiplied remains constant,
we just require some additional registers to store these extra random elements.
Additionally, we are able to use a much smaller version of SHA-3 for generating
the random seeds (< 400 FPGA slices) and shared secrets as the computational
requirements for it have significantly decreased.
There is a significant increase in area consumption of all the decapsulation
results which do not utilise BRAM. This is mainly due to the need of storing
public-key and secret-key matrices. We provide results for both architectures
with and without BRAM. The design without BRAM has a significantly higher
throughput, due to the much higher frequency. These results are reported in
Figure 4, which shows the efficiency of each design (namely their throughput)
per FPGA slice utilised. Figure 3 shows a slice count summary of all the proposed
designs, showing a consistent and fairly linear increase in slice utilisation as the
number of parallel multipliers increases. We note on decapsulation results in
Figure 3 where the results would lie if BRAM is used, hence the total results
for without BRAM include both red areas. In most cases slice counts at least
double for decapsulation when BRAM is removed, with only slight increases in
throughout, hence it might be prudent in some use cases to keep BRAM usage.
Compared to the previous works, we show significant savings in FPGA area
resource consumption. For instance, comparing to FrodoKEM module [4] (that
is, using one multiplier) we reduce slice consumption by 3.6x and 5.4x for key gen-
eration and 1.6x for encapsulation, all whilst not requiring any BRAM, whereas
previous results utilise BRAM. For decapsulation, we save between 1.6x and 2.6x
slices when BRAM is used and gain in slice counts by 1.5x and 1.1x if BRAM is
not used. This increase is expected since more than half of this is due to storage
otherwise used in BRAM.
Since the majority of the proposed designs operate without BRAM, we were
able to attain a much higher frequency than previous works. Overall our through-
put outperforms previous comparable results, by factors between 1.13x and
1.19x [4]. Moreover, whilst maintaining less area consumption than previous
research we were able to increase the amount of parallel multipliers used. As a
result, we can achieve up to 840 key generations per second (a 16.5x increase),
825 encapsulations per second (a 16.2x increase), and 710 operations per second
(a 15.6x increase).
FrodoKEM-976-16x
FrodoKEM-976-8x
FrodoKEM-976-4x
FrodoKEM-976-1x
FrodoKEM-640-16x
FrodoKEM-640-8x KeyGen
Encaps
FrodoKEM-640-4x *Decaps
FrodoKEM-640-1x Decaps
5 Conclusions
The main contributions of this research is to evaluate the lattice-based key encap-
sulation mechanism and potential NIST post-quantum standard, FrodoKEM [5],
in hardware. We develop designs which can reach up to 825 operations per sec-
ond, where most of the designs fit in under 1500 slices. We significantly improve
the state of the art by increasing the number of parallel multipliers we use during
matrix multiplication. In order to do this efficiently, we replace the inefficient
PRNG previously used, cSHAKE, with a much faster and smaller PRNG, Triv-
ium. As a result, we are able to attain significantly higher throughput efficiency
compared to previous research. Our implementations also run in constant com-
putational time and the designs comply with the Round 2 version of FrodoKEM
in all aspects except for this PRNG choice. To further evaluate the performance
of FrodoKEM, we implemented first-order masking for decapsulation, and we
showed that it can be achieved with almost no effect on performance.
The results show that FrodoKEM is an ideal candidate for hardware de-
signs, showing potential for high-throughput performances whilst still maintain-
ing relatively small FPGA area consumption. Moreover, compared to other NIST
KeyGen-640
Encaps-640
Decaps-640
1
*Decaps-640
Operations per second per Slice KeyGen-976
0.8 Encaps-976
Decaps-976
0.6 *Decaps-976
0.4
0.2
1 4 8 16
Number of DSP Multipliers
Fig. 4: Comparison of the throughput performance per FPGA slice on a Xilinx Artix-7.
References
1. Bos, J.W., Costello, C., Ducas, L., Mironov, I., Naehrig, M., Nikolaenko, V., Raghu-
nathan, A., Stebila, D.: Frodo: Take off the ring! practical, quantum-secure key
exchange from LWE. In: Proceedings of the 2016 ACM SIGSAC Conference on
Computer and Communications Security, Vienna, Austria, October 24-28, 2016.
pp. 1006–1018 (2016)
2. Bos, J.W., Friedberger, S., Martinoli, M., Oswald, E., Stam, M.: Fly, you fool! faster
frodo for the arm cortex-m4. Cryptology ePrint Archive, Report 2018/1116 (2018),
https://eprint.iacr.org/2018/1116
3. De Canniere, C., Preneel, B.: Trivium. In: New Stream Cipher Designs, pp. 244–266.
Springer (2008)
4. Howe, J., Oder, T., Krausz, M., Güneysu, T.: Standard lattice-based key encapsu-
lation on embedded devices. IACR Transactions on Cryptographic Hardware and
Embedded Systems pp. 372–393 (2018)
5. Naehrig, M., Alkim, E., Bos, J., Ducas, L., Easterbrook, K., LaMacchia, B., Longa,
P., Mironov, I., Nikolaenko, V., Peikert, C., Raghunathan, A., Stebila, D.: Frodokem.
Tech. rep., National Institute of Standards and Technology (2017), available at
https://csrc.nist.gov/projects/post-quantum-cryptography/round-1-submissions
6. NIST: Post-quantum crypto project. http://csrc.nist.gov/groups/ST/post-
quantum-crypto/ (2016)
7. NIST: Submission requirements and evaluation criteria for the post-quantum cryp-
tography standardization process. https://csrc.nist.gov/csrc/media/projects/post-
quantum-cryptography/documents/call-for-proposals-final-dec-2016.pdf (2016)
8. Regev, O.: On lattices, learning with errors, random linear codes,
and cryptography. In: Proceedings of the 37th Annual ACM Sym-
posium on Theory of Computing, Baltimore, MD, USA, May 22-
24, 2005. pp. 84–93 (2005). https://doi.org/10.1145/1060590.1060603,
http://doi.acm.org/10.1145/1060590.1060603
9. Shor, P.W.: Polynomial-time algorithms for prime factorization and discrete loga-
rithms on a quantum computer. SIAM J. Comput. 26(5), 1484–1509 (Oct 1997)
Table 2: FPGA resource consumption of the proposed FrodoKEM hardware designs,
using Trivium as a PRNG, with 1, 4, 8, or 16 parallel multipliers and also using both
parameter sets FrodoKEM-640 and FrodoKEM-976. Results with BRAM usage have
an asterisk (*). Also shown are the hardware results of Trivium and the error sampler.
All results utilise a Xilinx Artix-7 FPGA