0% found this document useful (0 votes)
5 views97 pages

CKKS Scheme, Using RVV

This thesis presents the design of a Homomorphic Encryption (HE) library optimized for RISC-V processors using the RISC-V Vector extension (RVV). It addresses the challenge of slow HE operations by implementing parallelized algorithms, specifically focusing on the CKKS scheme, and demonstrates significant performance improvements, including a 1.81x speed-up for the Number-Theoretic Transform (NTT) and up to 12.23x for certain operations. The work is supported by various funding bodies and highlights the importance of privacy in cloud computing.

Uploaded by

kseyhu1508
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views97 pages

CKKS Scheme, Using RVV

This thesis presents the design of a Homomorphic Encryption (HE) library optimized for RISC-V processors using the RISC-V Vector extension (RVV). It addresses the challenge of slow HE operations by implementing parallelized algorithms, specifically focusing on the CKKS scheme, and demonstrates significant performance improvements, including a 1.81x speed-up for the Number-Theoretic Transform (NTT) and up to 12.23x for certain operations. The work is supported by various funding bodies and highlights the importance of privacy in cloud computing.

Uploaded by

kseyhu1508
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97

Design of a FHE library for RISC-V processors Using the

RISC-V Vector extension

Tiago Filipe de Castro Pena Brandão Rodrigues

Thesis to obtain the Master of Science Degree in

Electrical and Computer Engineering

Supervisors: Prof. Leonel Augusto Pires Seabra de Sousa


Prof. Pedro Filipe Zeferino Aidos Tomás

Examination Committee

Chairperson: Prof. António Manuel Raminhos Cordeiro Grilo


Supervisor: Prof. Leonel Augusto Pires Seabra de Sousa
Member of the Committee: Prof. Carlos Nuno da Cruz Ribeiro

December 2023
Declaration

I declare that this document is an original work of my own authorship and that it fulfills all the
requirements of the Code of Conduct and Good Practices of the Universidade de Lisboa.
Acknowledgments

I would like to express my deepest gratitude to both of my supervisors, Professor Leonel Augusto
Pires Seabra de Sousa and Professor Pedro Filipe Zeferino Aidos Tomás for their careful verification
and assistance in the development process of my dissertation, which went above and beyond the call of
duty that would have been required.
Professor Leonel Augusto Pires Seabra de Sousa, for presenting the topic of Homomorphic Encryp-
tion in one of his classes which peaked my interest and led me to further investigate the topic, which
culminated in the decisive choice to do a dissertation in this topic. Furthermore, for the algorithmic
recommendations and knowledge on the topic, which were indispensable in the development process.
Professor Pedro Filipe Zeferino Aidos Tomás, for the extensive knowledge in acceleration techniques,
as well as accelerators, which constituted one of, if not, the most important part of the dissertation.
without his guidance, the development process would have been impossible.
Finally, I would like to thank my family for all the emotional support given during the years. Moreover,
I would also like to thank for the direct support given during the development process, which greatly
helped with finishing the document.
This work was partially supported by national funds through Fundação para a Ciência e a Tecnologia
(FCT) under projects UIDB/50021/2020, 2022.06780.PTDC, and from the European High Performance
Computing Joint Undertaking (JU) under Framework Partnership Agreement No 800928 and Specific
Grant Agreement No 101036168 (EPI SGA2). The JU receives support from the European Union’s
Horizon 2020 research and innovation programme and from Croatia, France, Germany, Greece, Italy,
Netherlands, Portugal, Spain, Sweden, and Switzerland.
Abstract

With the increasing importance of cloud computing, the privacy of the information sent has become a
major concern to users. In a normal cloud computing paradigm, data must be decrypted to allow for
computations to be performed. This inadvertently exposes the user information to any malicious agent
that has gained access to the cloud computing system. This problem can be tackled with the use of
Homomorphic Encryption (HE), which allows computations to be performed solely on encrypted data.
Hence, even a malicious agent with root access to the remote machine cannot retrieve the information,
as the data is never decrypted on the remote server. However, HE operations are significantly slower
than their corresponding non-encrypted counterparts, which hinders its adoption. On the other hand,
such operations provide several opportunities for parallelization which allows for a significant reduction
in computation time. This dissertation, thus, exploits the RISC-V vector extension (RVV) to accelerate
the HE CKKS scheme. Several server-side and client-side operations are implemented in RVV using in-
trinsics, operating either as cyphertext-cyphertext , cyphertext-plaintext or exclusively on one cyphertext,
and tested on an RVV vector unit being developed under the scope of the European Processor Initia-
tive (EPI) project. As the Number-Theoretic Transform (NTT) is a centrepiece on many HE routines, a
specific vector kernel was developed for this operation that exploits a vector-masked hybrid approach
with a Barrett reduction. Experimental results on the NTT showed a speed-up of 1.81x, with speed-ups
ranging up to 12.23x on the cyphertext-cyphertext ADD operation.

Keywords

Homomorphic Encryption; CKKS; NTT; RISC-V; RVV.

iii
Resumo

Com o aumento da importância da computação na nuvem, a privacidade da informação que é en-


viada para esta tem ganho uma crescente preocupação para os utilizadores. No paradigma comum
de computação na nuvem, para se realizarem operações na informação, é necessário que esta seja
desencriptada o que obrigatoriamente expõe os dados do utilizador a agentes maliciosos que tenham
acesso ao sistema de computação na nuvem. Este problema pode ser resolvido a partir da utilização de
Encriptação Homomórfica (EH), que permite a realização de operações em dados encriptados. Estas
operações, contudo, são significantes mais lentas que as correspondentes operações não encriptadas,
o que põe em causa a adoção. Por outro lado, essas operações apresentam consideráveis oportu-
nidades de paralelização, que permitem reduzir consideravelmente o tempo de computação. Neste
relatório, apresenta-se a aceleração destas instruções utilizando a extensão vetorial do RISC-V (RVV),
para o esquema de encriptação Homomórfica CKKS. Operações quer do lado do cliente quer do servi-
dor foram implementadas com intrı́nsecas de RVV, executando como cifrotexto-cifrotexto, cifrotexto-
textoclaro ou exclusivamente num cifrotexto só, e foram testadas numa unidade vectorial RVV desen-
volvida no abranjo do projeto de Iniciativa Europeia de Processadores (EPI). Dada a importância da
Number-Theoretic Tranform (NTT) como componente de várias rotinas de EH, um kernel vetorial es-
pecı́fico para esta operação voi desenvolvido, que utiliza uma técnica hibrida vetorial com mascaras e
a redução de Barrett. Obteve-se para a NTT uma aceleração de 1.81x, com valores de até 12.23x de
aceleração para a operação de cifrotexto-cifrotexto ADD.

Palavras Chave

Encriptação Homomórfica; CKKS; NTT; RISC-V; RVV.

v
Contents

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background and State-of-the-art 7


2.1 Homomorphic Encryption Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 First FHE Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 FHE Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 In-Depth Analysis of CKKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Client-Side Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Server-Side Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Comparison between CKKS and BFV . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Acceleration Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Algorithm optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Hardware implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.3 Compiler Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.4 Different ISAs and their specific optimizations . . . . . . . . . . . . . . . . . . . . . 25
2.4 Relevant FHE Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Microsoft SEAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 OpenFHE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.3 Scheme Specific libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 FHE Base Operations 29

4 Vector Acceleration of FHE operations 39

5 Experimental Results 49
5.1 Experimental Enviroment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 HADD/HSUB operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

vii
5.3 CADD/CSUB operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4 CMULT operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.5 HMULT operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.6 CWM operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.7 NTT/INTT operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.8 Other operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.9 General speedups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.10 RNS representation test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6 Conclusions 63
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Bibliography 66

A Proof of Correctness 73

B Table of HE parameters 75

viii
List of Figures

1.1 CKKS Implemented operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 Different operations and their direct dependencies. . . . . . . . . . . . . . . . . . . . . . . 9


2.2 Comparison between p and log Q (taken from [1]). . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Client-side operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 High-level overview of the CKKS scheme (taken from [2]). . . . . . . . . . . . . . . . . . . 15
2.5 Server-side operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Different butterflies in NTT and INTT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 Simple Ciphetext-Ciphertext operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31


3.2 Simple Plaintext-Ciphertext operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 HMULT operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 CMULT operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Relinearize implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Rescale implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7 Integer Barrett implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.8 Barrett reduction implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.9 Coefficient Wise Multiplication implementation. . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1 Vector implementation of the HADD operation for VL=4. . . . . . . . . . . . . . . . . . . . 41


4.2 Vector implementation of the CADD operation for VL=4. . . . . . . . . . . . . . . . . . . . 41
4.3 Vector implementation of the HMULT operation for VL=4. . . . . . . . . . . . . . . . . . . 42
4.4 Vector implementation of the CMULT operation for VL=4. . . . . . . . . . . . . . . . . . . 43
4.5 Vector implementation of the Relinearize operation for VL=4. . . . . . . . . . . . . . . . . 44
4.6 Vector implementation of the Rescaling operation for VL=4. . . . . . . . . . . . . . . . . . 45
4.7 Memory access of the Cooley-Tuckey operation. . . . . . . . . . . . . . . . . . . . . . . . 46
4.8 Vector implementation of the CWM operation for VL=4. . . . . . . . . . . . . . . . . . . . . 48

ix
5.1 Avispado architecture (taken from [3]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 HADD/HSUB cycles to perform for different p. . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3 CADD/CSUB cycles to perform values for different p. . . . . . . . . . . . . . . . . . . . . . 54
5.4 CMULT cycles to perform for different p. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.5 HMULT cycles to perform for different p. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.6 CWM cycles to perform for different p. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.7 NTT cycles to perform for p = 1024. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.8 INTT cycles to perform for p = 1024. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.9 NTT Cooley-Tuckey breakdown by stages. . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.10 NTT Cooley-Tuckey breakdown by stages. . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.11 Relinearize/Rescale speedup values for different p. . . . . . . . . . . . . . . . . . . . . . . 60
5.12 HADD RNS performance comparison with different values of p. . . . . . . . . . . . . . . . 62

x
List of Tables

2.1 CKKS keys. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14


2.2 Basic Server-side CKKS operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Different HE Libraries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1 Different operations in FHE that were developed. . . . . . . . . . . . . . . . . . . . . . . . 30

4.1 Different operations in FHE that were vectorized. . . . . . . . . . . . . . . . . . . . . . . . 40

5.1 Speedup values for p = 1024, compared to ”HADD”. . . . . . . . . . . . . . . . . . . . . . 53


5.2 Speedup values for p = 1024, compared to ”HSUB”. . . . . . . . . . . . . . . . . . . . . . 53
5.3 Speedup values for p = 1024, compared to ”CADD % reduct”. . . . . . . . . . . . . . . . . 54
5.4 Speedup values for p = 1024 and Q = Default, compared to ”CSUB Barrett”. . . . . . . . 55
5.5 Speedup values for p = 1024 and Q = Default, compared to ”CWM Barrett”. . . . . . . . . 57
5.6 Speedup values for p = 1024 and Q = Default, compared to ”CT Barrett”. . . . . . . . . . 59
5.7 Speedup values for p = 1024 and Q = Default, compared to ”GS Barrett”. . . . . . . . . . 60
5.8 Speedups for the different operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

B.1 Post-quantum security, for different secret key distributions (taken from [4]). . . . . . . . . 76

xi
xii
List of Algorithms

2.1 Barrett Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21


2.2 Cooley-Tuckey algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

xiii
xiv
Listings

4.1 Mask Creation Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45


4.2 Integer Barrett implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Barrett implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

xv
xvi
Acronyms

BFV Brakerski Fan Vercauteren

BGV Brakerski Gentry Vaikuntanathan

BSC Barcelona Supercomputing Center

CKKS Cheon Kim Kim Song

CRT Chinese Remainder Theorem

CWM Coefficient-Wise Multiplication

FFT Fast Fourier Transform

FHE Fully Homomorphic Encryption

HAL Hardware Abstraction Layer

HE Homomorphic Encryption

INTT Inverse-NTT

ISA Instruction Set Architecture

LHE Leveled Homomorphic Encryption

LWE Learning with Error

MALU Modular Arithmetic Logic Unit

MPC Multi-Party Computation

NTT Number Theoretic Transform

PHE Partially Homomorphic Encryption

PMNS Polynomial Modular Number System

RISC Reduced instruction set computer

RNS Residue Number System

RVV RISC-V Vector

R-LWE Ring-LWE

xvii
SEAL Simple Encrypted Arithmetic Library

SHE Somewhat Homomorphic Encryption

SIMD Single Instruction, Multiple Data

SVP Shortest Vector Problem

VL Vector Lenght

VPU Vector Processing Unit

xviii
xix
Introduction
1
Contents
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1
In this chapter, the motivation behind the dissertation is presented. This will encompass not only the
importance of Fully Homomorphic Encryption (FHE) in the current cloud-based computation paradigm,
in regards to both security and accessibility, but also the feasibility of FHE, in regards to the higher
computational time compared to non-encrypted computations. Therefore, FHE will be analysed in re-
gards to possible real-world applications, such as the previously mentioned cloud-computing done over
encrypted data, as well some of the hindrances to wide-scale adoption, and possible solutions to these
problems. Furthermore, the objectives of the dissertation will be presented, as well as the achieved con-
tributions, with references to both the acceleration obtained and the environment in which these results
were achieved. Finally the organization of the document chapter by chapter, will be presented.

1.1 Motivation

With the increase of both internet speed and bandwidth, cloud computing has become a viable way of
not only doing computationally intensive operations, which would prove infeasible for a user to perform
on their hardware, but also allows a user to do these computations from any device, provided it can
access the internet server. Furthermore, other advantages such as scalability and flexibility [5] have
made an increasing number of companies switch to a cloud computing paradigm.
However, by sending data to an online service, it requires the user to trust that the cloud provider is
a trustworthy server, and that the data sent to it will not be provided to third parties. Furthermore, if a
server is compromised, the information of the users could be exposed to malicious agents, as part of
cloud leaks [6]. Another concern is the fact that some types of data, such as medical records, cannot
be freely distributed to third parties without prior consent from patients or authorities [7], thus requiring
the data to remain encrypted, meaning it would not be possible to do any operations on them under the
classical models of cloud computing.
Homomorphic Encryption (HE) presents itself as a solution based on cryptography concepts that not
only ensures all data is securely encrypted, but also allows doing operations on that same encrypted
data. This approach guarantees that no matter the security standards implemented by the cloud server,
or if it has been compromised or not, the user data will not be able to be extracted from the information
sent, nor from the final result of the computation, since it too remains encrypted afterwards. Furthermore,
the different types of HE implementations, that is the different schemes that can be implemented which
have different functions and different input values, guarantee that it can be used as a solution to different
types of client requirements.
The reason Homomorphic Encryption is not as mainstream as it potentially could be, strictly from
the considerable advantages it provides in regards to security and privacy of data, is the fact that, when
compared to non-encrypted computations, the different operations can take considerably longer time to

2
be performed, with some of the potential applications being impossible to implement when data has to
be computed and analysed in a small time-frame. Furthermore, some specific operations that a client
application might need to perform, such as some comparison operations, might not be possible to be
implemented correctly in certain HE schemes, which further hinders the adoption of HE.

Nevertheless, with the current increasing interest in HE many new computationally simpler schemes
which have faster and wider range of operations than the original scheme have been developed, which
have allowed for applications that require in real-time computations to be performed in an encrypted
matter. Some of the more complex operations, however, continue to be major blockades to adoption.
Therefore, to make Homomorphic Encryption viable, one must accelerate these operations that take the
most time to perform, either by using different algorithms that prove more efficient, by better exploiting
the characteristics of modern processors, or by developing custom accelerators. In particular, a common
way of acceleration is to take advantage of the considerable data parallelism potential of the different
operations, and by exploiting Single Instruction, Multiple Data (SIMD) instructions set extensions, such
as Intel’s AVX extensions [8], ARM’s SVE [9] or the RISC-V Vector (RVV) extension.

This dissertation explores such avenue by devising new strategies to accelerate one particular HE
scheme using the RVV extension. The scheme that was chosen was the Cheon Kim Kim Song (CKKS),
both since it has, on average, a faster execution time for the standard HE operations, but also because
unlike other schemes which only accept integers as inputs, this scheme can perform operations on Real
numbers.

It should be further noted that a cryptography standardization extension for the RISC-V Instruction
Set Architecture (ISA) is currently being developed, both for scalar and vector operations [10]. Yet, as
these operations are currently not available for real processors, they were left out of the development
process.

Homomorphic encryption is not the only proposed alternative to the classic cloud computing paradigm.
For instance there is Yao’s garbled circuits a type of Multi-Party Computation (MPC), which works by ob-
fuscation of a boolean gate truth table, for inputs from two different users, who distrust each other and
don’t want the other to learn their value. Nevertheless, it only works for functions that can be described
as boolean circuits, and the application scope is different from HE, since it is more about two parties that
each have data, which want to keep that data secret from the other party, while also being able to pro-
duce an output from the combined binary data. Furthermore, there is also Functional Encryption which
is able to perform operations on ciphertexts. Just like HE, both of the inputs are completely encrypted,
meaning that if the encryption is correctly performed, it is impossible to obtain the original values from
the input. Where Functional Encryption alters from HE, is that the output produced is not encrypted. For
this reason, it also falls short of FHE, as since it produces the output of the function being performed in
plaintext form, that specific information can potentially be compromised. Nevertheless, for some specific

3
applications in which the only information that a party wishes to preserve is the input, it might be a viable
alternative.

1.2 Objectives

This dissertation addresses the acceleration of FHE operations, based on the vector extension for the
RISC-V ISA. Considering this, the main objectives are to further revise the state of the art previously
developed, with special regard to acceleration techniques such as the different algorithmic accelerations
and the hardware accelerators already developed, as well as to develop a FHE library for the CKKS
scheme, with the acceleration of the different operations using the RVV extension.

1.3 Contributions

The contributions of this dissertation are as follows.

• The development of different scalar implementations for the different CKKS scheme operations,
which are summarized in Figure 1.1.

• The vectorization of said operations using the RVV extension.

• The testing of the developed operations on the Barcelona Supercomputing Center (BSC) FPGA,
which implements the Avispado RISC-V processor with the RVV processing extension, for which
a speedup of 12.23x and 8.22x for the HADD and HSUB operations respectively over the scalar
implementation. Furthermore, A speedup of 1.81x for the NTT operation was obtained.

Operations

Client-side Server-side

Key Generation HADD CADD

Secret Key Decrypt HSUB CSUB

Public Key Encrypt HMULT CMULT

Relinearization
Relinearize Rescale
Key Encode

Decode NTT
Galois keys Rotate
INTT
Bootstrapping
Bootstraping
Keys CWM

= Fully Implemented = Partially Implemented = Not Implemented


(error generation not developed)

Figure 1.1: CKKS Implemented operations.

4
1.4 Organization of the Dissertation

This dissertation is organized as follows. Chapter 1 presents a brief introduction to the problem at hand
and the motivation behind the dissertation, as well as presenting main objectives and the contributions.
In chapter 2, the background and the state-of-the-art of FHE is presented, where the CKKS scheme is
compared with other popular schemes, and the different operations in FHE are described in a top-down
dichotomy, along with some of the possible optimizations that can be performed, both on an algorithm,
a hardware and an ISA basis. Furthermore, some of the more popular HE libraries are discussed. In
Chapter 3, the scalar implementation of the different operations described in Chapter 2 is presented
along with a description of the nuances that can be taken advantage in the different components. In
chapter 4, is presented the techniques used to vectorize the different operations, in which the NTT/INTT
operations take a center stage, given their importance. In chapter 5, experimental results are presented,
with a main focus on the speedup achieved for the different accelerated operations. In Chapter 6, the
concluding remarks of the dissertation are reported along with possible future developments of the work
presented in this dissertation.

5
6
Background and State-of-the-art
2
Contents
2.1 Homomorphic Encryption Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 In-Depth Analysis of CKKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Acceleration Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Relevant FHE Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7
This chapter presents the background required to understand this dissertation, as well as the state-
of-the-art of the different accelerators. In Section 2.1 a description of HE is provided as well as a
presentation of the different scheme generations that exist, and a figure with the different operations of
a scheme. In Section 2.2, a deep analysis of the different operations and the different parameters in the
CKKS scheme is presented, as well as a comparison to other common schemes, and the advantages
and disadvantages of the different schemes are discussed. Furthermore, in 2.3, some possible acceler-
ation techniques to the slower operations are presented both at an algorithm level, as well as the current
accelerators that have been developed. Finally, in Section 2.4 the different library implementations to
HE schemes are presented, as well as the vectorization instructions provided in different ISA.

2.1 Homomorphic Encryption Schemes

Homomorphic Encryption can be achieved in various different ways. For instance, some of them do
approximate arithmetic whereas others do exact. Each different HE is described as a scheme. Prior
to Craig Gentry’s paper on Homomorphic Encryption [11], most schemes would be considered Partially
Homomorphic Encryption (PHE). This means that they would either be able to make an unbounded
number of additions, or an unbounded number of multiplications on encrypted data, but not both. In con-
trast, Somewhat Homomorphic Encryption (SHE) allowed for not only additions but also multiplications
to be done, but there is an intrinsic limitation on the total number of operations. This is due to the fact
that every time one of these operations is performed, an amount of noise is added, meaning after a few
operations, it is impossible to decrypt the ciphertext in a correct way. FHE, such as the one presented
by Gentry, allows both an unbounded number of additions and multiplications on the encrypted data.
Another common classification for the different schemes is Leveled Homomorphic Encryption (LHE).
These schemes are able to do any application as long as the depth of the scheme (usually the multi-
plicative depth) is known beforehand since this operation is the one that introduces the most noise. If
the application is known, the parameters of the scheme can be defined to be the smallest possible that
still allow decryption even after many operations have been performed on the data, which is beneficial
as the choice of arbitrarily large parameters can result in greater than needed execution time. The main
operations that exist in the FHE schemes are presented in Figure 2.1, as well as the different dependen-
cies that the different functions have on other operations, such as the Relinearization operation which
requires a relinearization key.

2.1.1 First FHE Scheme

The first scheme that could be described as FHE was Craig Gentry’s [11]. To do so, Gentry started with a
SHE scheme which as previously described, is limited by the noise growth. However, Gentry developed

8
Operations

Client-side Server-side

Key Generation 6 HADD - 7 CADD -

1 Secret Key - 4 Decrypt 1 8 HSUB - 9 CSUB -

2 Public Key 1 5 Encrypt 2 10 HMULT - 11 CMULT -

3 Relin.Key 1
12 Relinearize 3
Encode

Decode 13 NTT -
16 Galois keys - 18 Rotate 16
14 INTT -
Bootstrapping
17 - 19 Bootstraping 17
Keys 15 CWM -

n = Function number n = Direct Function dependency

Figure 2.1: Different operations and their direct dependencies.

a novel technique called bootstrapping, that allowed the scheme to refresh it’s noise, thus allowing for
unbounded applications. This scheme was based on ideal lattices, which in terms of cryptography
systems can be used to reduce the number of parameters needed to define a lattice and were first used
in security applications in [12]. It is considered to be a first-generation scheme.

2.1.2 FHE Schemes

From Gentry’s initial paper, multiple other schemes have been developed. This Section includes the
more notable ones. What differentiates one scheme from another are the message input data, which
ranges from bits, such as in Gentry’s and more recently in TFHE [13], and integers, such as in BGV [14]
and BFV [15], to complex numbers [16] . Another difference are the operations that can be performed on
the encrypted ciphertext, with some schemes allowing only the most simple required instructions, and
others being able to perform any arbitrary operation through the use of bootstrapping, and the type of
arithmetic used, with some schemes utilizing approximate arithmetic. Finally, they also employ different
types of computational problems to guarantee the security of the encryption, ranging from lattice-based
to Ring-LWE (R-LWE) [14] and NTRU [17].

BGV Scheme

In 2011, Brakerski and Vaikuntanathan presented a new HE implementation [14]. This scheme was
a port of Gentry’s scheme however instead of using the Shortest Vector Problem (SVP) to guarantee

9
hardness, they used Learning with Error (LWE), which is at least as hard as the SVP. The scheme
itself is SHE, however, by adding a technique called relinearization, they where able to make it LHE.
From this implementation, the Brakerski Gentry Vaikuntanathan (BGV) Scheme [18] was developed. It
is also LHE, however it gave the users the choice between using LWE and R-LWE as the guarantee
for security. R-LWE is considered to be not only faster than LWE, but also has an increased hardness,
which resulted in the BGV scheme gaining more popularity The difference between LWE and R-LWE is
that R-LWE uses polynomials in a specific ring, rather than a single integer modulus . It is considered to
be second-generation.

BFV Scheme

In 2012, Brakerski went on to develop the BGV implementation without the need to do the modulus
switching operation [19]. Fan and Vercauteren later implemented the same scheme, with some minor
optimization techniques [15], thus creating the Brakerski Fan Vercauteren (BFV). Unlike the BGV, this
scheme is exclusively R-LWE, and is also considered to be second-generation.

FHEW and TFHE Scheme

In 2014, a new scheme was proposed, that claimed it could perform the bootstrapping operation in less
than 1 second. Compared to the previously mentioned schemes this was a considerable acceleration.
This scheme was named FHEW [20] and similarly to the original by Gentry, it was based on boolean
logic, and the messages were bits. A further optimization was later developed, which boasted the ability
to perform the bootstrapping operation in less than 0.1 seconds, which lead to the creation of TFHE [21].
Both schemes are considered to be third-generation.

CKKS Scheme

More recently, in 2016 [16], Cheon, Kim, Kim and Song proposed a new Homomorphic Encryption
scheme based also on R-LWE, however with one key difference. Unlike the others which would do exact
arithmetic, this new scheme performs approximate arithmetic. The original name given by the authors
was the HEAAN scheme [16], however it is more commonly known as the CKKS to avoid confusion with
the library implementation of the same name [22]. Another particularity of this scheme over others is
the fact that the input values do not need to be integers and can even be complex numbers. One dis-
advantage of CKKS is, as stated prior, the arithmetic is approximate, meaning further caution must be
taken when using this scheme. Another problem is that for some specific operations such as the encod-
ing/decoding, this scheme majorly under performs compared to others, and the amount of numbers that
can be batched during encoding is half of what other can encode. Nevertheless, it can overall perform

10
the majority of the operations faster than the other schemes, when considering full batching. Due to this
reason, it was the scheme adopted for this dissertation.

2.2 In-Depth Analysis of CKKS

Notation

For the rest of the article, p is defined as the dimension of the polynomials, or the polynomial degree
modulus of the different polynomials used in this scheme. For the coefficient modulus, Q is defined, and
for the smaller coefficients modulus used in the Residue Number System (RNS) representation q1 , q2 ,
... , qk . ∆ is defined as the scaling factor used in the encoding/decoding and rescaling operations, while
[ ]Q is presented, it corresponds to a value modulus Q. The initial input array of values is defined as the
p
message which, in the CKKS scheme, this message belongs to R = C/(x 2 + 1). The message, is later
encoded into the plaintext, which belongs to R = Z/(xp + 1). After applying the encryption operation on
the plaintext, the ciphertext is obtained, which belongs to R = ZQ /(xp + 1). For the Number Theoretic
Transform (NTT) operations, The p-th root of unity is defined as ω, and the 2p-th root of unity is ψ. χ
represents the Gaussian probability discrete distribution.

Context Parameters

The CKKS scheme just like other schemes requires a few initial parameters to be defined before any
operation can be performed. These parameters define the security and efficiency of the scheme. By
analysing the application being performed, the depth is determined. Afterwards, the value of Q is defined
to accommodate this depth. After determining Q, the parameter p is determined depending on the
security level chosen for the application. the value of p is usually a power of two, where a larger Q
results in a greater p. This variable defines the maximum degree that the polynomials can take, that is,
both the plaintext and the ciphertext. Finally, and a variable that is exclusive to the CKKS scheme, ∆,
which is used during the encoding and decoding operations, and, for the RNS representation, should
have a value similar to q.
These parameters should be chosen wisely, as wrong parameters might result in insecure cipher-
texts, or results that are incorrect. In Figure 2.2 a graph between the Polynomial Degree Modulus, p, and
the Coefficient Modulus, Q is presented. It should be noted that the parameters can also be chosen de-
pending on the number of bits of security that is required for an operations. These values are presented
in the homomorphic encryption security standard [4] and are presented here in Appendix B, Table B.1.

11
Figure 2.2: Comparison between p and log Q (taken from [1]).

2.2.1 Client-Side Operations

Some of the operations performed in a HE scheme have to be performed not by the server in which the
computational operations will be performed, but in the system of the user whose data will be processed
on. Such operations encompass the generation of the keys, since the private key belongs exclusively
to the user, and therefore, he must be the one to create it, as well as other operations such as encod-
ing/decoding and encrypting/decrypting of the data. It should be noted, as suggested in Figure 2.1, that
the client-side operations took a backseat in the development process, as the main focus of the disserta-
tion was to accelerate the more costly server-side operations, such as HMULT. With that mentioned, the
key-generation operations and the encrypt/decrypt operations were indeed implemented partially, in that
the error particles were not added to the functions, which stems from the myriad amounts of different
possible algorithms that have been developed exclusively for this purpose, which could be the focus of
a completely concurrent dissertation. Therefore, although the functions can be used to prove that all the
operations are working correctly, they should not be used as a secure application. The server-side oper-
ations do not have this problem, as they do not require noise generators. In Figure 2.3, the breakdown
of the different client-side operations is presented, as well as what part of the partially implemented
operations is missing.

Key Generation

After the definition of the parameters, the different keys required for the Homomorphic Encryption
scheme must be created. Below are presented the different keys and how they are generated.

12
Client-side

Key Generation

Secret Key
1 -
Fully Implemented Decrypt:
4 Needs Ciphertext 1
Public Key and Public Key
2 Noise generation 1
not implemented Encrypt:
5 Noise generation 2
Relin.Key: not implemented
3 Noise generation 1
not implemented
Encode: Decode:
Bootstrapping Not Not
Galois keys:
Keys: Implemented Implemented
Not Implemented
Not Implemented

Figure 2.3: Client-side operations.

Private Key: To create the private key, a polynomial of degree n and coefficients of values -1, 0 or
1 is developed. The values of the coefficients are determined randomly with equal probability amongst
each value. In a condensed manner we have

SK = a0 × x0 + a1 × x1 + ... + ap−1 × xp−1 , ai ∈ {−1, 0, 1} (2.1)

where SK is the secret key being generated, ai are the coefficients associated to the corresponding x of
the polynomial.
Public Key: In asymmetric cryptography in particular in HE schemes such as the CKKS, the public
key is dependent on the secret key, which means that to create the public key, the private key must
have already been created. The public key obtained in the CKKS scheme is a pair of polynomials PK =
(P K1 ,P K2 ), which are defined as follows

P K1 = [−a × SK + e]Q ,
(2.2)
P K2 = a

where SK is the private key that was created previously, a is a randomly generated polynomial in the
ring ZQ /(xp + 1) and the error e is sampled from χ, which is the Gaussian probability discrete distribution.
Relinearization Key: After a multiplication is done, the ciphertext is no longer composed of only two
polynomials. Instead, it is has 3 polynomials. To reduce the dimension of the ciphertext back to normal
size, the relinearization key, also known as the evaluation key, must be used. This key is composed of a
pair of polynomials EK = (EK1 ,EK2 ), which are defined as follows

13
EK1 = [−a × SK + SK 2 + e]Q ,
(2.3)
EK2 = a

where SK is the private key that was created previously, a is a randomly generated polynomial in the
ring ZQ /(xp + 1) and the error e is sampled from χ.
Galois Keys: One interesting operation that the CKKS scheme also implements is the ability to
rotate the different values of an encoded vector through the ciphertext. This operation requires the
creation of additional keys which are the Galois keys.
In Table 2.1, the different keys are presented in a summarized way.

Key Name Acr equation


Secret Key SK SK = a0 × x0 + a1 × x1 + ... + ap−1 × xp−1 , ai ∈ {−1, 0, 1}
Public Key PK P K1 = [−pl × SK + e]Q , P K2 = pl, pl ∈ ZQ /(xp + 1)
Evaluation Key EK EK1 = [−pl × SK + SK 2 + e]Q , EK2 = pl, pl ∈ ZQ /(xp + 1)

Table 2.1: CKKS keys.

Encoding and Decoding

The encoding operation, in CKKS, is slightly more complicated than in other schemes, as it is based on a
complex inverse Fast Fourier Transform (FFT). As mentioned prior, the input message can have complex
numbers however, the plaintext message is in Z/(xn + 1). Therefore the encoding operation must map a
p
message in C 2 to a plaintext. To do so, the message is first expanded to H = z ∈ Cp : zj = z−j by the
means of the inverse of the natural projection, which is presented in the original CKKS paper [16] as π −1 .
The result of the previous operation must be projected to the canonical embedding map σ(Z/(xn + 1)),
and it is encoded to the plaintext by applying the inverse of the canonical embedding map, which is
originally presented as σ −1 [16]. Afterwards, this value is multiplied by ∆ to ensure the precision of the
encryption.
The decoding operation does the opposite of the encoding, as it converts the plaintext polynomial
back to a message. It starts by dividing the plaintext by ∆, and afterwards applies σ. This value is then
projected into the message by π.

Encryption and Decryption

To do the encryption operation, the public key generated prior is used. This operation is done on the
plaintext, and results in a pair of polynomials, CT = (CT1 ,CT2 ), that both belong to ZQ /(xp + 1), and are
defined as follows:

14
CT1 = [P K1 × u + e1 + P T ]Q ,
(2.4)
CT2 = [P K2 × u + e2 ]Q

where PK is the public key that was previously created, u is a randomly generated polynomial sampled
from the ring Z2 /(xp + 1) and both error e1 and e2 are sampled from χ.
The decryption operation uses the private key and returns the ciphertext back to a plaintext in Z/(xp
+ 1). The operation is as follows:
M = [CT1 + CT2 × SK]Q (2.5)

where the pair CT is the ciphertext and SK is the generated secret key. Since only the user has access
to the secret key, the privacy of the data is guaranteed.
In figure 2.4, a summarized view of the encoding/decoding and encryption/decryption operations is
presented where the different sizes of the polynomials are shown.

Figure 2.4: High-level overview of the CKKS scheme (taken from [2]).

15
2.2.2 Server-Side Operations

There are different types of server-side operations in HE. Some are performed between two ciphertexts,
whereas others can be performed with only one of the inputs encrypted, and the other unencrypted. No
operation can be done between two unencrypted values as it would defeat the purpose of HE. Of the
server-side operations, only the relinearization operation requires the usage of any of the keys that were
generated in the client. In Figure 2.5, the operations that were implemented and not implemented in the
scope of this dissertation are referred.

Server-side

HADD CADD
6 - 7 -
Fully Implemented Fully Implemented

HSUB CSUB
8 - 9 -
Fully Implemented Fully Implemented

HMULT CMULT
10 - 11 -
Fully Implemented Fully Implemented

Relinearize NTT
12 3 13 -
Fully Implemented Fully Implemented

INTT CWM
14 - 15 -
Fully Implemented Fully Implemented

Rotate: Bootstraping:
Not Implemented Not Implemented

Figure 2.5: Server-side operations.

Addition/Subtraction

Addition/Subtraction can be performed both by adding/subtracting a plaintext with a ciphertext (plaintext-


ciphertext), or by adding/subtracting two ciphertexts together (ciphertext-ciphertext). The operation
behind the ciphertext-ciphertext addition/subtraction, which produces a two polynomial result CT3 =
(CT 31 ,CT 32 ), is as follows

CT 31 = [CT 11 ± CT 21 ]Q ,
(2.6)
CT 32 = [CT 12 ± CT 22 ]Q ,

16
where both CT1 and CT2 are ciphertexts with two polynomials. The plaintext-ciphertext addition/subtrac-
tion operation is even more simple, as the plaintext, P T , is only added/subtracted to the first polynomial

CT 31 = [CT 11 ± P T ]Q
(2.7)
CT 32 = CT 11

All of these operations are element-wise operations, as the coefficients of the polynomials are
added/subtracted together.

Multiplication

Multiplication can be performed both by multiplying a plaintext with a ciphertext (plaintext-ciphertext),


or by multiplying two ciphertexts together (ciphertext-ciphertext). The operation behind the ciphertext-
ciphertext multiplication, which produces a three polynomial result CT3 = (CT 31 ,CT 32 ,CT 33 ), is as
follows

CT 31 = [CT 11 × CT 21 ]Q ,

CT 32 = [CT 11 × CT 22 + CT 12 × CT 21 ]Q , (2.8)

CT 33 = [CT 12 × CT 22 ]Q ,

where both CT1 and CT2 are ciphertexts with two polynomials. This operation produces three polyno-
mials, since it consist of the dyadic product between the input ciphertexts. Because of these reasons, it
should be noted that plaintext-ciphertext multiplication does not produce an output of three polynomials
as the plaintext, P T , is simply multiplied by both ciphertext polynomials that is,

CT 31 = [CT 11 × P T ]Q ,
(2.9)
CT 32 = [CT 12 × P T ]Q ,

Relinearization

After a multiplication is done, the size of the output has increased to three polynomials. To return to a
two polynomial ciphertext, a technique called relinearization must be applied. This operation requires
the usage of the relinearization/evaluation key, EK, and generates the two polynomial result CT ∗ =
(CT1∗ ,CT2∗ ) and the operation is as follows

CT1∗ = [CT1 + EK1 × CT3 ]Q ,


(2.10)
CT2∗ = [CT2 + EK2 × CT3 ]Q ,

17
where EK is the evaluation key and CT is the three polynomial ciphertext input. This way, the ciphertext
is reduced back to two polynomials.

Rescaling

After a multiplication is done, by looking at the underlying mathematical basis, it can be noted that the
value of the operation encrypted on the ciphertext is actually M1 × M2 multiplied by ∆2 . This means
that when decryption is done, the result obtained from the multiplication operation will be an encrypted
∆ ∗ M1 ∗ M2 . Therefore, after every multiplication, an operation to reduce the scale of the ciphertext,
must be performed. This operation is the Rescaling operation which is unique to the CKKS scheme and
is as follows
 
CT1
CT1∗ = ,
∆ Q
  (2.11)
CT2
CT2∗ = ,
∆ Q

where CT ∗ is the output ciphertext, CT is the input ciphertext and ∆ is the scale value defined in the
context parameters. By performing the division by ∆ on the Ciphertext, the underlying message will
have the right scaling factor to guarantee correct decoding.

Rotation and Bootstrapping

As shown in Figure 2.5, none of these operations were implemented. The reason for that is the fact that
these operations directly require a client-side key generation operation which was not implemented, the
Galois keys and the Bootstrapping keys, respectively.
The Rotation operation allows the different elements of a Ciphertext to be rotated without the decryp-
tion of the array. This operation has a limited application, only in specific cases is used.
The Bootstrapping operation allows for completely unbounded operations to be performed, even
when the depth of the circuit is not known beforehand. However, it requires the usage of sizable keys,
and if the circuit depth is known beforehand, any operation can be performed without it. For example,
the CKKS scheme, when it was presented, was originally LHE, since a bootstrapping protocol was not
provided by the authors. This operation was latter developed by different authors, and currently, some
libraries already implement bootstrapping for the CKKS scheme.
In Table 2.2, a resumed description of the different operations is presented. Of note, the Coefficient-
Wise Multiplication (CWM) operation, which was not previously described, is presented. This operation
is the Hadamard dyadic multiplication between the input ciphertexts, and is of particular importance after
the NTT operation has been performed on the inputs.

18
Operation Name Type equation comp.complex.
HADD/HSUB CT-CT CT Res1 = [CT 11 ± CT 21 ]Q , CT Res2 = [CT 12 ± CT 22 ]Q O(n)
CADD/CSUB CT-PT CT Res1 = [CT 11 ± [P T ]Q ]Q , CT Res2 = CT 12 O(n)
CT Res1 = [CT 11 × CT 21 ]Q ,
HMULT CT-CT CT Res2 = [CT 11 × CT 22 + CT 12 × CT 21 ]Q , O(n2 )
CT Res3 = [CT 12 × CT 22 ]Q
CMULT CT-PT CT Res1 = [CT 11 × [P T ]Q ]Q , CT Res2 = [CT 12 × [P T ]Q ]Q O(n2 )
RELINEARIZE CT CT1∗ = [CT1 + EK1 × CT , CT2∗ = [CT2 + EK2 × CT3 ]Q
3 ]Q  O(n2 )
CT1
∗ ∗
 CT
RESCALE CT CT1 = ∆ Q , CT2 = ∆ Q , 2
O(n)
CT Res1 = [CT 11 ◦ CT 21 ]Q ,
CWM CT-CT CT Res2 = [CT 11 ◦ CT 22 + CT 12 ◦ CT 21 ]Q , O(n)
CT Res3 = [CT 12 ◦ CT 22 ]Q ,

Table 2.2: Basic Server-side CKKS operations.

2.2.3 Comparison between CKKS and BFV

The CKKS scheme and the BFV have many operations either in common or very similar. For starters,
both implement the addition and relinearization operations in the same matter. Furthermore, the key gen-
eration process is equal for both schemes. When it comes to encryption/decryption, the only change that
the BFV scheme introduces is that the message is multiplied/divided by a scalar, respectively, whereas
in the CKKS scheme such scalar is not required. Both schemes have as context parameters p and Q,
however, BFV has another parameter called the plaintext modulus, t, that influences the values in the
ring of the plaintext.

Encoding and Batching

One major difference between both of these schemes is the encoding algorithm. This is due to the
different input vectors used in both schemes. In CKKS, the message belongs to Cp/2 , whereas in BFV,
it belongs to Zp . Therefore, the original authors of CKKS had to develop an encoding algorithm for their
scheme. BFV, on the other hand, has multiple encoding algorithms that can be used. One important
difference between both of these schemes is the fact the BFV scheme can encode twice as much values
than in the CKKS scheme, however, the plaintext created in the BFV belongs to the ring Zt /(xp + 1), with
t described as the plaintext modulus in the original article by the authors. It defines the dimension of the
plaintext values and the consumption of noise budget when performing multiplications. If the value of t is
not large enough, the final results will be incorrect, as the BFV scheme will work as modular arithmetic,
unlike CKKS, which does not have this restriction on the plaintext. Furthermore, some of the operations
are slower than in the CKKS scheme.

19
Bootstrapping

On the subject of bootstrapping, both of these schemes were originally presented by their authors with-
out the bootstrapping operation. This was intentional, as both schemes were designed to be Leveled
Homomorphic Encryption. At a later stage, however, this operation was added to both CKKS [23] and
BFV [15].

2.3 Acceleration Techniques

Given the fact that HE is considerably slower than the corresponding operations on unencrypted data,
to make it viable both in a real-time and a versatility scenario, the different operations must be acceler-
ated. This acceleration can be performed in different ways. However, from a top-down view, there is a
distinction between algorithm acceleration, and hardware acceleration.

2.3.1 Algorithm optimizations

Modular Reduction

As mentioned prior, the ciphertext exists in the ring ZQ /(xp + 1), which means that after certain oper-
ations, it might be required to reduce the obtained value to mod Q. This reduction operation in some
instances such as for operations that require polynomial multiplication, can be a major time contributor.
The standard way to reduce values is by using a variant of the euclidean division as shown in equation
2.12.

 
x
x mod Q=x− ∗Q (2.12)
Q

This operation, however, requires a division, which is not only a costly operation, but also one that
does not execute in constant time. To accelerate this reduction, the Barrett algorithm is commonly used.
It allows the reduction of values as long as Q is constant, and the value to be reduced are x < Q2 . It
accomplishes the modular operation, by converting the division into a multiplication by a fixed value,
which can be pre-computed, and a constant shift operation. One advantage of the Barrett algorithm
over other algorithms is the fact that it performs in close to constant time, for any value. This fact is
of importance, considering that operations that do not perform in constant time can give away some of
the underlying computations being performed with the use of timing attacks. The Barrett reduction is
presented in algorithm 2.1. The value of k = 45 is adapted from [24], and is chosen since for this value,
the Barrett reduction works correctly on any unsigned 32 bit integer.
There are variants to the reduction techniques presented. For instance, as a counter-part to the

20
Algorithm 2.1: Barrett Reduction
Data: x ≤ Q2 ∧ Q = constant
Result: y = x mod Q
k ← 45;
k
m ← 2Q ;
aux ← ((x × m) >> k) × Q;
x ← x − aux ; /* Value of x is in [0,2Q) */
if x ≥ Q then
x ← x − Q;
return x;

Barrett reduction, there is also the Montgomery reduction/multiplication [25], which is commonly used
when multiplications between two integers are performed modulo a constant Q, or the Karatsuba algo-
rithm, which is commonly used for quick multiplications between very large values, although it is usually
accompanied by the Barrett algorithm to perform the reduction afterwards. There are also modular re-
ductions based on look-up tables with pre-computed values, such as [26] and [27]. The Barrett algorithm
in particular will be vectorized further ahead in Chapter 4, as well as an signed integer implementation
in Chapter 3, since the Barrett reduction only works for unsigned integer values.

Residue Number System

Given the fact that Q can be a large value, close to 1000 Bits, a general purpose computer is unable to
perform computations directly on such numbers. One possible solution is to use the RNS [28], which
allows a integer modulo Q to be represented by a set of smaller q1 , q2 , ... . The RNS representation
is based on the Chinese Remainder Theorem (CRT), which shows that the remainder of a value X by
a divisor Q can be uniquely represented by the remainder of the division of X by a set of co-prime
q’s. Using RNS, it is possible to perform Homomorphic operations on general purpose computers by
performing them on each of the created polynomials in the ring Zqk /(xp + 1), where the values of qk are a
dimension smaller than the maximum possible value for an integer on a specific processor. The choice
Y
of Q and qk have to be chosen such that Q = qk , and that the values of q are pairwise co-primes. In
k=1
Equation 2.13, this representation is summarized.

 
 [X]q1  res1
 


 [X]q2 k  res2


Y RNS Decoding
[X]Q ⇒ [X]q3 ,Q = qi ⇒ res3 =========⇒ Result (2.13)
 ... ...
 
 i=0 

 
[X]qk resk
 

The Polynomial Modular Number System (PMNS) is a potential alternative to RNS. It represents the
numbers in a slightly different way to RNS, and seems to outperform the RNS representation for smaller

21
values of Q [29]. It should be noted that both the Relinearize and the Rescale operations are done
slightly differently when the scheme is implemented in RNS representation.

Number Theoretic Transform

One of the most important operations to consider for acceleration are the HMULT, CMULT and Relin-
earization operations, due to the fact that they are considerably more computationally demanding than
the others. This is because, their complexity is O(p2 ) since they are based on polynomial multiplica-
tion, as mentioned prior, which means that for applications that employ a large p, they became the
biggest time contributors. Nevertheless, these operations can be speedup using the NTT and Inverse-
NTT (INTT). The NTT is a type of FFT for discrete numbers in a finite field which, given the fact that
the ciphertext space of polynomials is ZQ /(xp + 1), can be used directly on the ciphertexts. The NTT
algoritmn, although it adds overhead by converting the ciphertext into the NTT domain, has the advan-
tage of performing polynomial multiplications in O(p log p), which for a large p, will result in considerable
speedup. Although client-side operations, such as the encrypt and decrypt, took a backseat in this dis-
sertation, it should be noted that they can be accelerated as well utilizing both Barrett reduction and the
NTT/INTT operations.
To implement the standard NTT algoritmn, it is necessary to use ω. For Homomorphic Encryption,
since the ring space of the polynomials is ZQ /(xp + 1), the negative-wrapped convolutions have to be
performed and, to do so requires the output polynomial to be of size 2p. Given the fact that this imple-
mentation requires an auxiliary 2p vector followed by a reduction, it is the least preferred implementation.
For some values of Q and p, there exists values of ψ such that ψ = ω 2 , ψ p = −1 mod Q and ψ < Q .
When this is possible, implementations that use this ψ value can be advantageous. Those implementa-
tions have the advantage that the output polynomial is of size p meaning it does not require an auxiliary
2p polynomial, and can be performed directly on the input. One of the most common algorithm, and
the one that will be vectorized further, is the Cooley-Tuckey NTT [30], which takes advantage of a set of
stages to perform the NTT. The operation performed at each stage is the Cooley-Tukey butterfly, which
is presented in Figure 2.6(a) Since the implementation developed uses ψ, it will not work for some values
of Q and p. The Cooley-Tuckey algorithm, based on the Cooley-Tukey butterfly, is presented in algorithm
2.2. It should be noted that this Cooley-Tukey NTT produces values in index bit-reversed order.
The INTT operation is similar, although the prominent algorithm used is the Gentleman-Sande algo-
ritmn [31]. It differs for the Cooley-Tuckey algoritmn in that instead of ψ it uses ψ −1 . Furthermore, the
final results have to be multiplied by a scalar value as well, and the operation being performed in the
inner-loop is a slightly different butterfly. In Figure 2.6(b), the Gentelman-Sande butterfly is presented.
This algorithm takes values in index bit-reversed order, and produces the result in normal order, which
means that by applying this INTT after the Cooley-Tukey NTT, the results will go back to correct order.

22
Algorithm 2.2: Cooley-Tuckey algorithm
Data: p = 2k ∧ ψ < Q ∧ ψ = ω 2 ∧ [ψ p ]Q = −1
Result: Ciphetext in N T T domain
m ← 1;
stride ← p;
while m < p do
stride ← stride
2 ; j ← 0;
while j < m do
auxj1 ← 2 × j × stride;
auxj2 ← auxj1 + stride − 1;
S = [ψ reverseBits(m+j) ]Q ; /* This step is usually precomputed */
k ← auxj1;
while k < auxj2 do
U ← Ciphertext[k];
V ← [Ciphertext[k + stride] × S]Q ;
Ciphertext[k] ← [U + V ]Q ;
Ciphertext[k + stride] ← [U − V ]Q ;
k ← k + 1;
j ← j + 1;
m ← 2 × m;

Furthermore, since the operations on the NTT converted values are all element-wise operations, the
result produced from the Cooley-Tukey NTT does not have to be put in normal order for the operations
to be performed.

2.3.2 Hardware implementations

To accelerate Homomorphic Encryption, some authors have designed hardware architectures to take
advantage of potential parallelism in applications. These accelerators can take both advantage of the
parallelism in the fact that element-wise operations over a large polynomial are highly parallelizable
or/and the different limbs in the RNS representation, which for some of the operations can be done in
parallel. Furthermore, some authors take a more specific approach, and try to accelerate a specific
instruction rather than the entire scheme. Most of the authors tend to focus on a singular scheme,
although given the similarities between the schemes, as mentioned in Section 2.2.3, can accelerate
more than one.

CPUs

Some authors take advantage of the parallel capabilities of multicore CPUs to accelerate the different
instructions. Since these approaches utilize components that most clients have, they are great ways
to not only accelerate operations, but also make FHE more common. For example, some Authors
utilize both MPI and OpenMP to accelerate on a CPU, such as [32], which by using CPU parallelism

23
U +

V X -

(a) Cooley-Tukey butterfly

U +

V - X

(b) Gentelman-Sande butterfly

Figure 2.6: Different butterflies in NTT and INTT.

accelerated the inference operation for convolutional neural networks on encrypted data, and obtained
considerable speedups for common benchmarks, such as AlexNet Convolutional Network [33]. The [33]
paper takes as basis the BGV and BFV FHE schemes. For other schemes such as the TFHE scheme,
there is [34], which implemented a parallel version of this scheme both on CPU and GPU. Furthermore,
some authors have also accelerated operations using specific CPUs which have vector instructions,
which will be further analysed in section 2.3.4.

GPUs

GPU’s have a much greater amount of cores than a CPU meaning, in general, for highly parallel appli-
cations, they tend to outperform CPU. Furthermore, just like CPUs, they are widely available to clients.
As mentioned prior, FHE operations are highly parallel which has led some authors to try and accelerate
with the usage of GPUs. One such implementation is TensorFHE [35], whose main goal was to accel-
erate the NTT operation using Cuda cores, achieving speedups of up to 1000x when compared to the
CPU implementation for the HADD operation.

FPGAs and ASICS

There are also implementations that use hardware completely dedicated to the acceleration of FHE.
Such implementations can be in the form of FPGAs and ASICs. One of the more important implementa-
tion on a FPGA is HEAX [36], which provides acceleration for both NTT/INTT and other FHE operations,

24
uncoupled from specific parameters, for the CKKS scheme. Multiple acceleration techniques such as for
the choice of the RNS representation constants, as well as an NTT implementation based on the novel
Montgomery butterfly, are implemented in FPGAs and ASICs [37]. Furthermore, it proposes a custom
Modular Arithmetic Logic Unit (MALU), with five modular instructions. A speedups of close to 2 times
compared to the state-of-the-art was obtained.

On the side of the ASICs, the main ones are F1+ [38], Craterlake [39], Ark [40] and more recently
SHARP [41]. Of particular note, Craterlake takes advantage of RNS and NTT, by designing specific
functional units for these operations.

2.3.3 Compiler Optimizations

Another interesting way in which a Homomorphic encryption program might be accelerated is through
the compiler. Given the novelty of FHE, and the fact that most of the applications being developed are
not for a HE paradigm, many programs might be unoptimized or difficult to transport to Homomorphic
Computations. Many of the current compiler optimizations try to solve both of these issues. For example,
in certain applications, it might be beneficial to use the rotation operation to optimize execution, and in
others, by reordering operations, the same result can be obtained with less noise budget consumption,
that is, for complex applications, an inexperienced user might do an inefficient implementation, which a
compiler might be able to optimize. A list of relevant compilers is presented in [42]. Furthermore, for
compilers that can analyse non-Homomorphic programs and convert them to Homomorphic Encryption
operations there is HECO, a end-to-end design for FHE compilers that converts imperative python code,
into FHE programs [43].

2.3.4 Different ISAs and their specific optimizations

Depending on the ISA for which an HE operation is being accelerated, there might be potential spe-
cific instructions to that ISA, which could be used to further obtain speedups. Vector instructions and
extensions in particular can be used to obtain major speedups to the more vectorizable operations.

x86 and AVX-512

The x86 is the ISA that most Intel processors use. For the x86 ISA, there is the AVX-512 which provides
512-bit SIMD instructions. By using this extension, Intel HEXL [44] was able to vectorize both the NTT
operation and the modular multiplication, as well as other less significant operations.

25
ARM and Neon

ARM is a Reduced instruction set computer (RISC) architecture. The vector instructions in ARM have
also been explored to accelerate HE, with Neon NTT [45] utilizing them to accelerate the NTT operation.

RISC-V and RVV extension

RISC-V is an ISA that has recently gained notoriety due to the fact that unlike previously mentioned alter-
natives, it is open-source. Similarly to ARM, it is also a RISC architecture. Another benefit that RISC-V
has is that apart from the standard instructions, others can be added through extensions, meaning that
a processor can be chosen such that it only has the specific extensions needed for a given application.
Of particular note in this dissertation is the RVV extension [46].

2.4 Relevant FHE Libraries

In regards to HE, libraries implement a high level view of the different operations that can be executed in
a scheme. Some libraries might implement some additional operations such as bootstrapping or rotation,
however the main operations, such as the Homomorphic Addition and Multiplication are implemented in
all libraries, for the supported schemes. There are many implementations of Homomorphic Encryption,
however in this report, we only refer the most relevant ones. An extensive list of HE libraries can be
found in [47], each with their advantages and drawbacks, as well as different programming languages.
In table 2.3, the main HE libraries are listed.

2.4.1 Microsoft SEAL

The Simple Encrypted Arithmetic Library or SEAL [48] was developed by Microsoft as a Homomorphic
Encryption library that implements the usual functions for multiple schemes. This library was originally
written in the C++ programming language, however, there are multiple ports to other programming lan-
guages. Furthermore, Simple Encrypted Arithmetic Library (SEAL) implements more than one scheme,
for example for BFV and CKKS. For both of these schemes, SEAL provides the usual addition and multi-
plication operations, however, it supports additional operations such as rotation and plaintext-ciphertext
operations. Another advantage of SEAL over other Homomorphic Encryption libraries is the fact that it
can use the ZLIB [49] or Zstandard [50] libraries to compress data. It can also work with both symmetric
and asymmetric algorithms. One disadvantage of SEAL is the fact that it does not implement any type
of bootstrapping. One limitation that SEAL, and other library implementations, is that operations such
as comparisons are in most cases infeasible to execute on encrypted data.

26
2.4.2 OpenFHE

OpenFHE [51] is a library which was initially released in 2022 and is a successor to the previous PAL-
ISADE FHE library. It implements a wide variety of different schemes, as well as many of the additional
operations for each of the schemes, such as bootstrapping. Furthermore, it has the particular advantage
that an application running on a specific scheme can be changed to another scheme, which could be
used for further acceleration, when in one specific area of an application a scheme outperforms another.
It also implements an Hardware Abstraction Layer (HAL), which allows for support of different hardware
acceleration technologies such as the accelerators previously mentioned in 2.3.2.

2.4.3 Scheme Specific libraries

Many of the schemes, when they are initially presented, are bundled together with a library implementa-
tion by its authors. For example, the third generation schemes, such as the FHEW [52] and TFHE [53],
both have library implementations. The Homomorphic Encryption for Arithmetic of Approximate Num-
bers or HEAAN [22] was originally presented in the CKKS paper and is the library implementation
developed by the original authors. One advantage of this library over other libraries is the fact that it
has been updated to implement the bootstrapping operation for the CKKS scheme. Nevertheless, unlike
SEAL, this library only supports this scheme, which makes it non ideal for doing comparisons between
different schemes.

HE schemes
Library BGV CKKS BFV FHEW TFHE Bootstraping?
SEAL X X X No
OpenFHE X X X X X Yes
HEAAN X Yes
TFHE X Yes

Table 2.3: Different HE Libraries.

In this Chapter, the different operations present in the CKKS scheme were analysed, thorough the
description of not only their mathematical basis by the use of equations, but also from the different figures
which describe what specific keys each operations require. Furthermore, some of the acceleration
avenues were also presented in particular the RNS representation for the acceleration for very large
Q, the Barrett reduction, which not only is more secure in terms of timing attacks, but can be used
to accelerate some of the different fragments of the operations, and finally the representation of the
ciphertexts in another domain by the usage on the NTT, which can be used to accelerate polynomial
multiplications. These operations will be implemented in Chapter 3, in which some specific strategies in
code can be used to accelerate parts of the functions. Furthermore, those operations will be vectored
using the RVV extension in Chapter 4.

27
28
3
FHE Base Operations

29
In this chapter, the different operations mentioned in 2.2 are implemented in a scalar way. The different
caveats regarding the different operations are presented, in which a few possible acceleration techniques
are utilized, such as an overflow test, and the Barrett reduction for some of the operations. Furthermore,
some of the different roadblocks to each of the operations are described, as well as how they were
tackled. It should also be mentioned that the operations that require multiplication of polynomials are
implemented here without the usage of the NTT for acceleration, to create a baseline for these operations
in O(n2 ) complexity. In Table 3.1, the name for the different operations and the type of operations that
will be implemented in this dissertation are presented. These names are in part adapted from [35], which
was mentioned earlier in Section 2.3.2. All the source code for the different functions can be found in a
GitHub repository [54]. It should also be referred that, for the majority of the operations, the portions of
code pertaining to the specific mathematical operation (e.g the polynomial addition in HADD), has to be
performed more than one time, since the different ciphertexts have two or more polynomials and many
of the operations require operations to be performed on all of them.

Operation: CT-CT CT-PT


Addition HADD CADD
Subtraction HSUB CSUB
Multiplication HMULT CMULT
Relinearization* RELINEARIZE -
Rescaling* RESCALE -
NTT* NTT -
INTT* INTT -
CWM* CWM -
* = only one CT as input

Table 3.1: Different operations in FHE that were developed.

HADD/HSUB operations

The HADD operation is relatively simple to implement as it exclusively requires an addition of the poly-
nomials, and a reduction of the final value. However, one acceleration that can be performed, is for the
reduction operation of the final result. Since the ciphertexts’ values that are used as inputs are both in
the [0,Q(, the addition will result in a value in [0,2Q(. Because of that, to reduce the final result, this
operation can be performed by simply subtracting Q when the final value is equal or above Q. This way,
it will not be required to use the costly % operation, nor the Barrett reduction. To add support for the
RNS representation for the HADD operation, the implementation was changed slightly with an outer loop
that iterates over the different limbs of the polynomials to perform the addition. This way, it is possible to
re-use the inner addition of polynomials previously implemented for the different limbs. In Figure 3.1(a),
the block diagram of the HADD implementation is presented. A similar procedure is also performed to

30
the HSUB operation, however, instead of adding both of the ciphertexts together, the second ciphertext
is subtracted to the first. Because of that, the final result is technically in )-Q,Q(. However, considering
that both of the inputs are unsigned values, the final result will actually be in the correct ring, that is [0,Q(,
or a very large unsigned number. The final result will be this very large unsigned whenever the second
ciphertext value is bigger than the first ciphertext value, since the result value is also a ciphertext, and
therefore it is an array of unsigned integers. Because of that, to make the reduction operation, it is
possible to take advantage of this overflow value. That is because this value will be bigger than Q, as
long as the word size is at least two times larger than Q. For this reason the reduction will only be an if
statement and an addition. For this operation to support RNS the process is exactly the same as for the
HADD, in that an outer loop is added to perform the operation for the different limbs. In Figure 3.1(b),
the block diagram of the HSUB implementation is presented.

CT2
y

+ ≥Q n CT_res

CT1

(a) HADD operation

CT2
y

- ≥Q n CT_res

CT1

(b) HSUB operation

Figure 3.1: Simple Ciphetext-Ciphertext operation.

CADD/CSUB operations

The CADD operation is also simple to implement since for the most part it is completely equal to the
HADD operation. However, Since one of the values is a plaintext value, it will not be in [0,Q(. This
means that, to obtain a correct result, the plaintext value must first be reduced to [0,Q(, and then the
operation can be performed equally to the HADD operation, which means that the final reduction can

31
also be performed by a simple subtraction by Q, when necessary. The reduction of the Plaintext value to
the correct ring, nevertheless, is a bit more complicated. That is because this value can be considerably
bigger than Q, therefore, the previous approach of only subtracting by Q will not work. Furthermore,
it can also be a negative integer, making the implementation even harder to correctly perform. One
potential solution is to use a while loop to subtract Q repeatedly when the Plaintext value is positive, and
to add Q repeatedly when it is negative. This approach will work for small values of Plaintext, however,
it will quickly become very slow for large numbers. Furthermore, it also compiles in non-constant time,
which further results in potential timing-attacks. Another solution is to use the native % operation, which
can perform the correct reduction when the Plaintext value is not a negative number, however, will
produce a value in )-Q,0] when the value is negative. Because of that, it will require a final addition by
Q. However, as discussed prior in Section 2.3.1, this operation requires a costly division. Therefore, to
obtain substantial acceleration, the Barrett reduction can be performed. It should be noted that since
the plaintext is a signed integer, applying the Barrett algorithm directly will not work. A variant of the
Barrett algorithm for integer values can be used, as described further in Section 3. This outperforms the
% operation, since the compiler uses the euclidean division to perform the modulus operation, resulting
in a costly division. To sum up, equation 2.7 is slightly changed to

CT 31 = [CT 11 + [P T ]Q ]Q (3.1)

which, in terms of the underlying mathematical equation, results in the exact same value, but for the
implementation allows the operations to be performed entirely on unsigned integers after the reduction
of the plaintext is performed, it can be used to re-utilize code, as can be seen from the Figures 3.1 in
which the latter part of the algorithm is the same as Figure 3.2. In Figure 3.2(a), the block diagram of
the CADD implementation is presented. For the CSUB operation, after the reduction of the Plaintext
is performed, the implementation can take advantage of the same overflow strategy as in the HSUB
operation. In Figure 3.2(b), the block diagram of the CSUB implementation is presented. The RNS
representation of both the CADD and CSUB operations is obtained by adding an outer loop, similar to
the HADD and HSUB operations.

HMULT operation

The HMULT operation has to be implemented in a slightly different way. Firstly, the output ciphertext has
to be allocated to allow for one extra polynomial, since the result output is a dyadic product between
ciphertexts as seen in equation 2.8. Furthermore, since the operation entails the product of two poly-
nomials, the output polynomial will be of degree 2p − 2. This constitutes a considerable implementation

32
Q

PT
Reduce to y
Ring

+ ≥Q n CT_res

CT1

(a) CADD operation

PT
Reduce to y
Ring

- ≥Q n CT_res

CT1

(b) CSUB operation

Figure 3.2: Simple Plaintext-Ciphertext operation.

challenge. Firstly, because the allocated ciphertexts’ polynomials have a maximum degree of p − 1.
Furthermore, the ring of ciphertext polynomials in HE is xp + 1 . Such problem requires that after the
operation of multiplication is performed, the values that have a degree equal or larger than p, must be
subtracted to the corresponding slots in the output polynomial. Therefore, an auxiliary 2p array is utilized
to store the temporary outputs. Since the intermediate multiplications are between unsigned integers,
the reduction that can be used is the unaltered Barrett reduction. After the polynomial multiplications
are performed, the final results in each of the slots of the auxiliary array will be in [0,Q(. Therefore, when
reducing the polynomial degree to xp + 1, which entails subtracting the values, the overflow test used
in HSUB and CSUB can be used again, to accelerate the final reduction. The HMULT operation can
also be converted to the RNS by adding an outer loop, since each multiplication is only concerned with
the current limb being processed on both Ciphertexts. In Figure 3.3, the implementation of the HMULT
operation is presented.

CT2
y
Reduce to
x Auxiliary (2N) Ring - ≥Q n CT_res
(unsigned)
CT1

Figure 3.3: HMULT operation.

33
CMULT operation

The CMULT operation, just like the HMULT operation, requires the usage of an auxiliary 2p array. How-
ever, unlike the HMULT operation, the output is a 2 polynomial ciphertext, meaning it does not require
special allocation for the output ciphertext. Since the output is a plaintext, just like in the CADD and
CSUB operations, it is required that the input values are reduced using the integer Barrett reduction.
Afterwards, the implementation can be done exactly equal to the HMULT implementation described pre-
viously. Another similarity that CMULT has with HMULT is that it too can easily be converted to RNS
representation with the addition of an outer loop. The new equation for the CMULT implementation is
described as

CT 31 = [CT 11 × [P T ]Q ]Q ,
(3.2)
CT 32 = [CT 12 × [P T ]Q ]Q ,

which again, in terms of the underlying mathematical equation, results in the same value, but the
implementation after the initial reduction of the Plaintext is exclusively done on unsigned integers. In
Figure 3.4, the implementation of the CMULT operation is presented.

PT
Reduce to y
Ring
Reduce to
x Auxiliary (2N) Ring - ≥Q n CT_res
(unsigned)
CT1

Figure 3.4: CMULT operation.

Relinearization

The relinearization operation presents a similar problem to the HMULT operation, in that the size of the
output ciphertext is different than the input ciphertext, since this operation reduces the ciphertext from
three polynomials to two. Furthermore, it also requires the multiplication of the Ciphertext polynomials
by the relinearization keys, which is itself a pair of polynomials. Therefore, the same problem previously
mentioned as to be tackled, in that an auxiliary 2p array has to be used. For these reasons, the imple-
mentation of the relinearization operation is very similar to the HMULT operation. Furthermore, it should
also be noted that since the values are in the ring xp + 1, the second part of the final polynomial also

34
has to be subtracted to the first part of itself. In Figure 3.5, the implementation of the Relinearization
operation is presented.

Polynomial1 + Reduce to Polynomial1


Ciphertext Polynomial2 + Ring ... CT_res
Polynomial2
Polynomial3 (unsigned)

x x

Relinearization Polynomial1
keys Polynomial2

Figure 3.5: Relinearize implementation.

Rescale operation

The Rescale operation is exclusive to the CKKS scheme, and entails the division of the final results by
a scalar value. The choice of the scalar value is not important for the implementation of the algorithm,
therefore, the implementation should support any value for the division (apart from zero and negative
values). The operation is implemented similarly to the HADD operation, since it also only requires an
element-wise operation to be performed, in this case the division by a scalar. Since the division operation
will result in a smaller value, the reduction by Q is assumed to be not necessary, as the output will be
in the ring [0,Q( with certainty. It should be noted that the operation can be performed exclusively on
integers, and particularly on unsigned values, since both the input Ciphertext and the output both have to
be in ZQ , meaning the fractional value of the division is not important. In Figure 3.6, the implementation
of the Rescale operation is presented.

CT2 / CT_res

Figure 3.6: Rescale implementation.

Encrypt/Decrypt operations

For the client-side operations developed, the encrypt and decrypt operations again have the same prob-
lem as the HMULT and Relinearize operations, in that the output size is different than the input. For
the encrypt operation, the one polynomial plaintext is converted to a two polynomial ciphertext. As for

35
the decrypt operation, the two polynomial ciphertext is converted to a one polynomial plaintext. Further-
more, both operations require the auxiliary 2p array, since they both have polynomial multiplications. Of
note in the encryption operation, and as previously mentioned, the error generation was not developed,
and therefore the error parcel is assumed to be zero. Therefore, it can be used to show that the different
operations are correct, but they are not to be used as a secure encryption.

NTT

The NTT, being one of the most time consuming, was the operation that was one of the main focus of the
dissertation. The problem with implementing any of the different algorithms that can be widely found is
that they are not developed for the Ring used in HE, that is xp + 1. Most of the FFT and NTT algorithms
are performed for the Ring xp , meaning the direct utilization of them in a HE application will result in
incorrect values. For this reason, to be able to use the already developed algorithms, it is required that
the input polynomial is altered to a 2p array, with the ciphertext values that are actually being operated
on in the first half, and all zeros in the second half. This will result in the values in the latter part of the
auxiliary 2p array in the HMULT and CMULT operation not to be subtracted, and therefore the value will
be in the ring 2p − 2. After performing the INTT operation, the final result will be the 2p − 2 polynomial
that would be produced from the Multiplications operations, and therefore a subtraction similar to the
one in the HMULT/CMULT operation has to be performed. For this reason, although these operations
were implemented, these implementations will not be further presented, but can be found on the GitHub
library [54]. The implementation that was explored the most and further ahead vectored was the Cooley-
Tuckey algorithm. Firstly, the input and output size can be a p polynomial, meaning it can be directly
performed on the ciphertext’s defined structure. Each stage is performed by another loop, with the
butterflies being executed inside in two separate loops. As was shown in Figure 2.6(a), the value of V is
multiplied by ψ. Both of these values are unsigned integers, therefore to accelerate this reduction, the
Barrett algorithm can be used. The reductions by Q that have to be applied when adding/subtracting U
and [V × ψ]Q , can be performed again by subtracting by Q, and by using the overflow test previously
mentioned. This ψ value, is what makes the operation in the ring xp + 1, as if it was not used, the
Cooley-Tuckey NTT and Gentelman-Sande INTT would again perform the operation in the ring xp . Both
the NTT and INTT algorithms implemented were also extended to support RNS representation.

Integer Barrett reduction and the Barrett reduction

As shown in the operations that require the usage of a plaintext as an input, the Barrett reduction cannot
be directly used, since it works only for positive values. However, by using the implementation in Figure

36
3.7, it is possible to reduce any of the values.

- -x

Barrett
|x| =0? y
Reduction

n Barrett
Value <0? Final Value
Reduction

Figure 3.7: Integer Barrett implementation.

This implementation takes advantage of the fact that if a negative value is negated, and then the
Barrett Reduction is performed on it, the result will be the correct reduction value inside of ring )-Q,0],
but in positive form. Because of that, it is required that Q is subtracted, and then the value negated
again, resulting in a final value in [0,Q(.
Finally, the implementation of the Barrett reduction itself can be easily performed by using multiplica-
tions and the logical shift right operation. For the specific value of Q = 12289, the shift value is assumed
to be k = 45 and the auxiliary multiplication variable is m = 2863078532. Figure 3.8 depicts the diagram
of the Barrett Reduction operation implemented.

Value -
y
n
x >> x - >Q? Final Value

Figure 3.8: Barrett reduction implementation.

Coefficient Wise Multiplication (CWM)

After the NTT operation is performed on a Ciphertext, the Multiplication operation is performed element-
wise. In the library implementation [54], this operation is called CWM, and performs similarly to the
HADD operation, although it uses the Multiplication operation rather that the addition. Furthermore, the
final reduction that is performed also has to be the Barrett reduction, since the result of the multiplication

37
can be potentially much larger than Q. This operation also has an implemented RNS variant. One ad-
vantage of using the CWM operation over the basic multiplication ones is the fact that it does not require
an auxiliary array, and is also much faster than the other operations. In Figure 3.9 the implementation of
the CWM operation is presented.

CT1
Reduce to
Ring CT_res
(unsigned)
CT2

Figure 3.9: Coefficient Wise Multiplication implementation.

With all the different operations implemented, and with most of them being able to support the RNS
number system a scalar benchmark has been established, and the different acceleration techniques that
can be used in each of the blocks of the operations have been performed. With all that, one can start
by trying to accelerate these algorithms by directly transposing them to parallel SIMD instructions. For
that, the intrinsics in the RVV extension will be used to perform this acceleration in Chapter 4. For the
operations that cannot be easily transposed directly to vector instructions, or that produce poor results,
new vector algorithms will be presented.

38
4
Vector Acceleration of FHE operations

39
Of the implementations presented in Chapter 3, some of them were also vectorized. In this chapter, the
different vectorization of the operations is presented, in which some of the techniques used to adapt the
scalar implementations to vector are referenced. In Table 4.1, the operations that were and were not
vectorized are presented, where CT-CT defines Ciphertext-Ciphertext operations, and CT-PT defines
Ciphertext-Plaintext operations. It should be noted that for the scalar operations that were expanded for
the RNS representation, their corresponding vector implementations also support RNS. That is because
in all of the vector operations implemented, the vectorization happened in an inner-loop, and since each
limb is independent from each other, the vector instructions can be performed for all the limbs separately.
Again, all the source code can be found in [54].

Operation: CT-CT CT-PT


Addition Yes Yes
Subtraction Yes Yes
Multiplication Yes Yes
Relinearization* Yes -
Rescaling* Yes? -
NTT* Yes -
INTT* Yes -
CWM* Yes -
Encrypt* No -
Decypt* No -
Barrett Reduction** Yes -
Integer Barrett Reduction** Yes -
* = only one CT as input
** = input is for any array of values

Table 4.1: Different operations in FHE that were vectorized.

HADD/HSUB operations

Firstly, to implement the vectorization of the HADD operation, one can perform the addition on the
different slots of the ciphertexts in a parallel manner, that is, to vectorize the most inner loop, and to take
advantage of the large values of p for major speedup gains. Like previously stated in the implementation
of the scalar algorithm, the reduction is exclusively a subtraction by Q. However, since now the function
is operating on a large amount of values rather than just one, it is required that the vector operations
only subtract in the correct slots. The RVV extension has a masked variant for most of the different
operations, meaning by creating a mask wherever the values are equal or larger than Q, the scalar
operation can easily be vectorized. In Figure 4.1, the vector implementation for the HADD operation is
presented, for a Vector Lenght (VL) of dimension 4.
The HSUB operation is performed similarly since there are comparison operations in the extension
that take into account if the values are unsigned or not, meaning the overflow technique previously

40
+

Mask reduct

Figure 4.1: Vector implementation of the HADD operation for VL=4.

mentioned will work as well.

CADD/CSUB operations

The implementation for the CADD operation was the same as in the scalar implementation previously
presented, which means it will have the same challenges. Firstly, the plaintext values can be both
positive and negative integers, therefore the integer variant to the Barrett algorithm has to be used.
However, since the operation is performed on only some of the values that are being processed in
parallel, it is required to create a mask for the negative and positive values, and to process them slightly
different. As seen in Figure 3.7, the negative values will be inverted at the start, and then re-inverted
at the end, with the middle being exactly the same for all the values, meaning it is not required to do
the unsigned Barrett reduction twice. In Figure 4.2 the vector implementation for the CADD operation is
presented for a VL of dimension 4.

integer barrett

Mask reduct

Figure 4.2: Vector implementation of the CADD operation for VL=4.

41
The CSUB operation uses both the vector reduction of the plaintext previously presented, and the
vector overflow technique.

HMULT operation

The HMULT vector implementation will have the same ciphertext output size problems described for the
scalar implementation, as well as the need for an auxiliary 2p array. However, there is one interesting
instruction that can be used to accelerate the operation part, which does not exist in the scalar ISA.
That instruction is the multiply and accumulate (vmacc), which can be used to fuse both the addiction by
the previous value on the auxiliary array slot, and multiply the two new terms of the polynomial that are
being processed. The vectorization performed was both on the main computation loop, as well as on the
final reduction that has to be performed to convert the final result into the Ring xp + 1. In Figure 4.3 the
vector implementation for the HMULT operation is presented for a VL of dimension 4. The vectorization
performed is for the O(n2 ) complexity implementation of the HMULT operation, and therefore does not
use the NTT algorithm.

barrett reduction

overflow trick

Figure 4.3: Vector implementation of the HMULT operation for VL=4.

42
CMULT operation

The CMULT vector implementation does not have the ciphertext output size problem such as the HMULT
operation, although since one of its inputs is a plaintext, it requires the reduction of the plaintext. This
reduction can be performed before the main computations, and the values stored in an auxiliary un-
signed integer array. This way, it is possible to perform the equation in a very similar way to the HMULT
operation. Here the usage of the vmaac instruction can also be used to slightly accelerate the operation.
In Figure 4.3 the vector implementation for the CMULT operation is presented for a VL of dimension 4.
The vectorization performed is for the O(n2 ) complexity implementation of the CMULT operation, and
therefore does not use the NTT algorithm.

integer barrett

barrett reduction

overflow trick

Figure 4.4: Vector implementation of the CMULT operation for VL=4.

43
Relinearization

The vector relinearization operation implementation is very similar to the CMULT implementation, how-
ever there are a few key differences. The difference in size from the input ciphertext to the output
ciphertext, being from three polynomials to two is one of them. Furthermore, the Relinearization key is
an unsigned array in [0,Q(, meaning it does not need to be reduced. Finally, before the main inner loop
(which is the multiplication of the CT3 with the EK), the first/second ciphertext polynomial value needs to
be added to the array. This is done in a scalar way, since the vectorization is concentrated on the most
inner loop. In Figure 4.3 the vector implementation for the Relinearize operation is presented for a VL of
dimension 4.

barrett
reduction

overflow trick

Figure 4.5: Vector implementation of the Relinearize operation for VL=4.

Rescale operation

The vectorial implementation of the Rescale operation can be implemented as a direct port of the scalar
implementation, since the operation is element-wise. However, the version of the vector extension that

44
was used does not have the element-wise unsigned integer division operation implemented, nor the
integer division operation, substituting both of them for an element-wise addition. In Figure 4.2 the
vector implementation for the Rescale operation is presented for a VL of dimension 4, if the intrinsic
operation had been implemented.

Figure 4.6: Vector implementation of the Rescaling operation for VL=4.

NTT/INTT

The implementation of the vector NTT Cooley-Tuckey operation was slightly more difficult, given the fact
that the scalar operation, for the later stages, does not perform at a sufficient vector length efficiency,
with each later stage processing half of the previous values in parallel, to the point that the latter stage
only processes one value, making it exactly like the scalar. This is because the most inner loop shrinks
with each of the stages, whereas the middle loop increases. The solution is to fuse these loops for the
later stages. In Figure 4.7, the memory access for the different stages is presented.
As can be observed, the NTT stages have a pattern of dividing the previous block size, and process-
ing only on the first half of the new block. The vector length is set to the maximum possible, which in
this case is 256, and the operation must be performed exclusively in the first half of the different blocks.
To be able to process exclusively those specific chunks, mask operations will be used, for which a mask
generation algorithm must be developed. This algorithm must produce chunks of ones and zeros the
size of half the block that is currently being processed. In Listing 4.1, the mask creation implementation
is presented

Listing 4.1: Mask Creation Loop

for(int j=0;j<max vector lenght;j++)


{
mask aux[j] = !(1 & (j>>t aux));
}

45
U
U
V
U
UU
V
VV

UU
U
VV
V
UU
V
VV

stage n-1 stage n stage n+1

Figure 4.7: Memory access of the Cooley-Tuckey operation.

where t aux is a variable that produces the corresponding division of j by half of the size of the block
that is currently being processed. This approach will work correctly, however for the first stages it will
not be the optimal. That is, if the scalar implementation is directly converted to a vector one, for the first
stages, it will outperform the masked implementation. therefore, the optimal solution is to make a hybrid
p
version of the two. The switch condition is defined by the dimension of the vector length, m < vector lenght ;

m is a variable that is multiplied by 2 every stage. In regards to the execution of the CT butterfly, one
approach could be to use vmaac as in the CMULT/HMULT operations to calculate U + ψ × V . However,
the butterfly requires two operations to be performed with the result ψ × V as can be seen in Figure
2.6(a), meaning it is faster to reduce this result first, and then use the mask reduction for U + [ψ × V ]Q ,
and the overflow technique for U − [ψ × V ]Q . Another possible technique for the implementation of the
NTT in a vector domain, is by using a different algorithm than the Cooley-Tuckey that is more adaptable
for vector instructions such as the Korn-Lambiotte NTT [55] or the Pease NTT [56].

For the INTT Gentelman-Sande operation, the implementation is very similar, with the usage of the
Barrett reduction as well to reduce the values, and the usage of masks to accelerate the algorithm.
However, the final values have to also be multiplied by a scalar, which can also be vectorized.

46
Barrett Reduction and Integer Barrett Reduction

To implement the Integer Barrett reduction and by consequence the Barrett reduction, it will be required
to use masks to firstly check if a value is smaller than zero, that is, if it is negative, which requires a less
than comparison for signed integers, and secondly, if the values covered by this mask to be different
than zero. Luckily, for the first problem, the vector extension has this operation, being able to compare
signed integers, and for the second problem, the mask creation operations also have a masked variant,
meaning the comparison with zero can be only evaluated in some of the slots. In Listing 4.2

Listing 4.2: Integer Barrett implementation

// Load plaintext
epi 1xi64 v pt = builtin epi vload 1xi64(&plaintext1.values[j], gvl);
//First mask
epi 1xi1 v mask = builtin epi vmslt 1xi64(v pt, v 0, gvl);
epi 1xi64 v aux neg = builtin epi vsub 1xi64 mask(v pt,v 0,v pt, v mask, gvl)
// Reduction plaintext
Barrett reduction(v pt,v m,v shift val);
// Second mask
epi 1xi1 zero mask = builtin epi vmseq 1xi64(v aux neg, v 0, gvl);
epi 1xi1 last mask = builtin epi vmandnot 1xi1(v mask, zero mask, gvl);
v aux neg = builtin epi vsub 1xi64 mask(v aux neg, v aux neg,v Q, last mask,
gvl);
v pt = builtin epi vsub 1xi64 mask(v aux neg,v 0,v aux neg,last mask, gvl);

In Listing 4.2, the integer Barrett reduction is presented, where the creation of the masks previ-
ously mentioned is shown. As for the Barrett reduction itself, it can be implemented directly from the
scalar algorithm by using the corresponding vectorial operations, that is the vector multiplication, and
the element-wise logical shift right operation. In Listing 4.3, the Barrett reduction operation is presented.

Listing 4.3: Barrett implementation

// variable initializations
epi 1xi64 v coef mod = builtin epi vbroadcast 1xi64(Q, gvl);
epi 1xi64 v shift val = builtin epi vbroadcast 1xi64(45, gvl);

// Reduction

47
epi 1xi64 v q = builtin epi vmul 1xi64(v values, v m, gvl);
v q = builtin epi vsrl 1xi64(v q, v shift val , gvl);
epi 1xi64 v aux1 = builtin epi vmul 1xi64(v q, v coef mod, gvl);
v values = builtin epi vsub 1xi64(v values, v aux1, gvl);
epi 1xi1 mask = builtin epi vmsleu 1xi64(v coef mod, v values, gvl);
v values = builtin epi vsub 1xi64 mask(v values, v values, v coef mod, mask,
gvl);

CWM

For the CWM operation, which is the operation used after the NTT for the multiplication of the polynomi-
als, it can be implemented easily by just directly transposing the scalar operation to vector operations.
That is because this operation is only a multiplication element-wise, and the vectorial Barrett reduction
previously presented. This operation can be implemented without the usage of the auxiliary array, and
its implementation is presented in Figure 4.8 , for a VL of dimension 4.

barrett
reduction

Figure 4.8: Vector implementation of the CWM operation for VL=4.

After the implementation of the scalar operations in Chapter 3, where the different acceleration tech-
niques were explored, and the implementation of the vector operations in this chapter, in which some of
the operations had to be adapted to work correctly in a vectorial implementation, some speedup values
can be obtained for the best iterations of both, as well as some potential benchmarks. Furthermore, the
comparison of the different versions of the operations that were implemented are also presented. These
benchmarks will be presented and evaluated for different values of both p and Q, to access their influ-
ence on the final speedups. In Chapter 5, the different results obtained in the emulator for the Avispado
RISC-V processor with the RVV extension are presented.

48
Experimental Results
5
Contents
5.1 Experimental Enviroment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2 HADD/HSUB operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3 CADD/CSUB operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.4 CMULT operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.5 HMULT operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.6 CWM operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.7 NTT/INTT operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.8 Other operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.9 General speedups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.10 RNS representation test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

49
In this chapter, the experimental results obtained for the operations and functions of the previous
chapters are presented. These experiments will entail the alteration of the VL for different operations,
as well as the different values of p and Q to test their impact on the final results. There will also be
benchmarks presented for some of the operations. The setup in which the different tests were conducted
will also be described.

5.1 Experimental Enviroment

The following experimental results where obtained on the FPGA hosted on the BSC. It is a VCU128
board which is configured to implement [57]:

Single RISC-V scalar Avispado core + Vector Processing Unit (VPU) accelerator running @ 50MHz;
Four HomeNodes adding up to 1 MiB L2 cache;
4 GB of DDR4-2666 RAM .

The Semi-dynamics’ avispado processor [3] is a RISCV64GCV processor, which allows for 2-wide in-
order operation. It also has a specialized vector unit which allows the execution of the RVV instructions.
The maximum Vector Length (VL) used in the experimental results is of 256 for 64-bit elements, which
is the maximum VL that the VPU can use. In Figure 5.1, the architecture of the Avispado processor is
presented.

Figure 5.1: Avispado architecture (taken from [3]).

5.2 HADD/HSUB operations

In Figure 5.2(a), the number of cycles needed to perform the operation versus the value of p is presented
for the different implementations of the HADD operation in this dissertation. These implementations are

50
the standard scalar operation, which serves as a reference implementation, an implementation using a
value of Q different than Q = 12289, which in this case is Q = 20000, although any Q larger than the sum
of the two ciphertexts would yield similar results as the conditional reduction subtraction by Q would not
be met for these values, to show the impact of the coefficient modulus in the HADD operation, the vector
implementation that was developed, and the same vector implementation for a sub optimal value of the
VL, in this case, V L = 128. These tests were also performed for the HSUB operation, and their results
are presented in Figure 5.2(b).

(a) HADD operation

(b) HSUB operation

Figure 5.2: HADD/HSUB cycles to perform for different p.

From the results presented, some conclusions can be drawn. In Figure 5.2(a) the different operations
can be seen for differing values of p, which are defined as powers of two, and different HADD function
implementations.”HADD” is the scalar implementation using the mask reduction technique previously

51
presented. ”HADD % reduct”, is the reduction with the usage of the standard % reduction operation. It
can be observed that for small values of p, the latter implementation outperforms the masked on, which
since the dimension of p is rather small, could be due to the if statement in the masked implementation
taking longer than the division. This quickly changes once p increases to 8, in which the masked scalar
operation is already faster than using %, with a speedup of 1.18x for p = 1024. The ”HADD Q =
20000” implementation is the scalar implementation with a Q that is larger than the sum of both of the
ciphertexts values. Because of that, the subtraction by Q is never performed, resulting in a slightly faster
operation time, although it is not considerably faster, with on average a speedup of 1.12x the scalar
operation, meaning the if statement, is likely more time consuming than the subtraction. In regards to
vectorization, it can be noted that for small values of p, the scalar implementations mostly outperform
the vector ones, with the best case scenario having a speedup of 0.43x over the scalar implementation.
This is because the operations in the vector implementation take more time to perform on average than
their corresponding scalar ones, which means than only when the number of values to be performed
in parallel are larger than a certain threshold (in this simulation case it was p = 16 for the best vector
implementation), will there be significant speedups. The best vector implementation ”HADD vect”, is
able to obtain a speedup value of 12.23x over the scalar one for a p = 1024. This shows that using
vector intrinsics to accelerate the different HE operations is a viable solution. As can be seen, there is
also a ”HADD auto” operation, which constitutes the auto-vectorization performed by the clang compiler.
This approach is faster than the scalars for large values of p, however, the ”HADD vect” outperforms it
by a median of 1.57 times. These results indicate that even for the more simple operations, such as the
HADD, the auto compiler needs further work to compete with fully vector implementations, although for
quick tests where the scalar operation can be directly transposed to vector, it could prove as a way to
significantly reduce programming time for different operations. Finally, there is also ”HADD vect, vl=128”,
which is the best vector implementation with a limitation on the VL of half of the maximum possible value.
What can be observed is that for values of p from 1 to the size of the VL, both of the implementations
perform exactly the same. That is because the internal loop can be performed in a single iteration for
both of the functions. However, once p becomes larger than VL, the implementation with the V L = 128
has to perform the inner loop twice, resulting in a consistently slower operation, with a speedup of 0.71x
when p = 1024. This shows that VL is not of important for smaller values of p, however, should p increase
above the VL being used, a major hit on performance will be noticed. In Figure 5.2(b) similar results
are shown for the HSUB operation although with slight caveats. A speedup of 8.22x for p = 1024 was
obtained, when compared to the scalar operation. For both values of Q, the subtraction resulted in a
value in the correct ring, meaning both implementations are exactly equal. However, for the V L = 128,
the result was equal to the implementation with V L = 256. In table 5.1, the resumed speedups of
the different HADD operations is presented. In table 5.2, these speedups are presented for the HSUB

52
operations.

Scalar Vectorial
Q = default Q = 20000 Q = default
If reduct % reduct If reduct If reduct If reduct, vl = 128 auto
1.00x 0.85x 1.07x 12.23x 8.72x 9.04x

Table 5.1: Speedup values for p = 1024, compared to ”HADD”.

Scalar Vectorial
Q = default Q = 20000 Q = default
overflow technique overflow technique overflow technique overflow technique, vl = 128 auto
1.00x 1.00x 8.22x 8.26x 6.55x

Table 5.2: Speedup values for p = 1024, compared to ”HSUB”.

5.3 CADD/CSUB operations

In Figure 5.3(a), the number of cycles needed to perform the operation versus the value of p is presented
for the different implementations of the CADD operation. Just like in the previous operation, the basic
scalar operation is presented, as well as the operation using a different Q. Furthermore, the scalar
operation using the Barrett reduction is also shown. Although in the previous example the value of Q
was not a parameter that influenced considerably the execution time, in this example, it could if the
Plaintext value is a number much larger than that of Q. These tests were also performed for the CSUB
operation, and their results are presented in Figure 5.3(b).
From the results presented, some conclusions can be obtained. In Figure 5.3(a) the different opera-
tions can be seen for a variety of values of p, which are defined as powers of two, and different CADD
function implementations. ”CADD while reduct” is the implementation that uses a while loop to reduce
the values by subtracting Q until the plaintext value is in the correct ring. It can be seen that for a value
of the plaintext that is closer to Q, this implementation can perform competitively when compared to
the other scalar implementations, although it is slower than the others, with a speedup of 0.44x when
compared to the others. It should also be noted that for a smaller value of Q, such as Q = 500, the
time required to perform this operation skyrockets to a much larger value, taking 20 times longer than
for a larger Q when p = 1024. In regards to the other scalar operations, it shows that the ”CADD %
reduct”, which performs the reduction using %, outperforms the Barrett implementation, by a speedup
of 1.27x. This is likely due to the fact that the value of the Plaintext is not a considerably larger value
than Q, meaning for the tests performed, the Barrett reduction was not able to outperform the standard
operation. For operations that require multiplications of numbers, however, this is not likely to happen. In
regards to vectorization it can be observed that both of the auto-vect implementations were not able to

53
(a) CADD operation

(b) CSUB operation

Figure 5.3: CADD/CSUB cycles to perform values for different p.

vectorize the instruction, achieving exactly the same performance as their original scalar counterparts.
More importantly however, it can be noticed that the ”CADD Barrett vect” implementation was able to
outperform all the others, for large values of p which, as mentioned prior, is due to the fact that for
smaller p values, the overhead of the vector operations is not enough for them to be efficient. A speedup
of 3.95x was obtained when comparing with the best scalar implementation and a p = 1024. In Figure
5.3(b) similar results were obtained for the CSUB operations, with a speedup of 5.01x for the vector
operation. In table 5.3, the resumed speedups of the different CADD operations is presented.

Scalar Vectorial
Q = Default Q = 500 Q = Default
while reduct Barrett % reduct while reduct while reduct auto Barrett auto Barrett
0.45x 0.79x 1.00x 0.02x 0.44x 0.79x 3.95x

Table 5.3: Speedup values for p = 1024, compared to ”CADD % reduct”.

54
Scalar Vectorial
while reduct Barrett Barrett
0.31x 1.00x 5.01x

Table 5.4: Speedup values for p = 1024 and Q = Default, compared to ”CSUB Barrett”.

5.4 CMULT operation

In Figure 5.4, the number of cycles needed to perform the operation versus the value of p is presented
for different implementations of the CMULT operation.

Figure 5.4: CMULT cycles to perform for different p.

The results obtained show that for the CMULT operation, the implementation that uses the Barrett
reduction slightly outperforms the % operation. As mentioned in the CADD operation Section, this
is likely because for values of considerable dimension, the Barrett reduction is able to have a faster
reduction. In regards to the vector implementation, it can be seen that it outperforms the scalar operation,
although not by a large margin, with a speedup of 1.49x. This is likely due to the need of auxiliary arrays
or the scalar algorithm not being easily ported to a vector implementation, which might have made the
operation much slower. It should also be noted that compared to the previous operations, CMULT has
a much longer execution time, being over 60 times more time consuming, when compared to the HADD
scalar operation.

5.5 HMULT operation

The HMULT operation should yield similar results to the CMULT operation given their similarities, al-
though since the final result is a three polynomial ciphertext, and the operation is slightly more complex,
a slower execution could be expected. The values obtained are shown in Figure 5.4.

55
Figure 5.5: HMULT cycles to perform for different p.

As seen before in the CMULT operation, it can be observed that the Barrett reduction operation again
slightly outperforms the % implementation. In regards to the vectorization implementation, the same
underwhelming results are observed, with a speedup of 1.20x. This is caused by the same hindrances
stated in the CMULT implementation.

5.6 CWM operation

The CWM operation is the operation that will in essence substitute the CMULT and HMULT operations.
It is the coefficient-wise multiplication of the different values, and will produce the results of the multipli-
cation, after the Ciphertext is converted back with the INTT operation. If this operation is considerably
faster than the polynomial multiplication based ones, than the NTT approach would prove to be viable,
at least for the parameters that were chosen. In Figure 5.6, the final results are presented.

Figure 5.6: CWM cycles to perform for different p.

56
When it comes to the scalar implementations, it can be seen that firstly, the reduction based on sub-
sequent subtraction on Q is horrendously inefficient due to the fact that the CWM operation comprises
the multiplication of values, which will produce a result that is much larger than Q. It has a speedup of
0.002x over the Barrett reduction. Here it is also shown that the Barrett reduction is considerably faster
than the % reduction, when the bottleneck is the exclusively the reduction, with the Barrett reduction ob-
taining a speedup of 1.52x. This shows that for the CMULT and HMULT, the main bottleneck might be the
auxiliary operations being performed rather than the reduction itself. As for the vector implementation, it
is able to obtain a speedup value of 8.87x over the best scalar algorithm.

Scalar Vectorial
while reduct Barrett % reduct Barrett
0.002x 1.00x 0.66x 8.87x

Table 5.5: Speedup values for p = 1024 and Q = Default, compared to ”CWM Barrett”.

5.7 NTT/INTT operations

With the CWM shown to outperform the CMULT and HMULT operations, if the NTT and INTT opera-
tions can be performed in reasonable time, their combination will be able to achieve speedups over the
previously implemented polynomial multiplication operations. In Figure 5.7, the benchmarks for the NTT
operation is presented.

Figure 5.7: NTT cycles to perform for p = 1024.

In this figure, different implementations for the NTT Cooley-Tuckey algorithm can be observed. On
the scalar front, it can be seen that the algorithm that uses successive subtractions of Q to reduce the

57
results majorly under-performs when compared to the Barrett reduction. This to be expected, since the
reduction being performed is for a multiplication of values, ψ and V , which will result in a much larger
value than Q. The speedup of this implementation is 0.008x compared to the Barrett. In regards to ”CT
Barrett vect”, it can be observed that it underperforms compared to its scalar implementation, resulting
in a speedup of 0.65x. Similar results were obtained for the INTT operation, with the standard vector
implementation being slower than the scalar, as shown in Figure 5.8, with a speedup of 0.77x obtained.

Figure 5.8: INTT cycles to perform for p = 1024.

To comprehend why both of these implementations are slower, a breakdown of the Cooley-Tuckey
algorithm by the different stages was performed. The result of this test is presented in figure 5.9.

Figure 5.9: NTT Cooley-Tuckey breakdown by stages.

From analysing the figure, it becomes apparent that the stages that contribute the most to the slower
implementation time of the Barrett vect operation are the later stages, with most of the execution time

58
concentrated on the last 2-3 stages of the function. This is caused by the fact that the scalar algorithm
does not easily translate to the vector implementation. That is, for the first stages, the VL is utilized to
its maximum, with the first two stages performing with VL = 256. However, with each subsequent stage,
the VL is halved, resulting in an exponential execution time increase. For the last stages, for example,
the VL is one, meaning the operation is executing in a scalar matter using vector operations, which are
considerably slower that their corresponding scalar counterparts. To achieve the speedup that can be
observed for the CT Barrett vect mask hybrid implementation, the latter stages of the previous algorithm
had to be changed. That is, for the first stages, where the VL is acceptable, CT Barrett vect is utilized.
However, as soon as the VL decreases below a threshold, the masked implementation is used, which
can maintain a constant VL of 128, or half of the maximum possible value. In Figure 5.10 the breakdown
of the different stages for the CT Barrett vect mask hybrid is presented.

Figure 5.10: NTT Cooley-Tuckey breakdown by stages.

As can be observed, there is a sharp increase in stage m=4, which corresponds to the switching
of the algorithm. However, afterwards the different stages take the same time to perform, which is due
to the VL being fixed to 128. With this implementation, a speedup of 1.81x over the Barrett reduction
scalar algorithm is obtained. In tables 5.6 and 5.7, the resumed speedups of the different NTT and INTT
implementations are presented correspondingly.

Scalar Vectorial
while reduct Barrett Barrett Barrett mask hybrid
0.008x 1.00x 0.65x 1.81x

Table 5.6: Speedup values for p = 1024 and Q = Default, compared to ”CT Barrett”.

59
Scalar Vectorial
Barrett Barrett
1.00x 0.77x

Table 5.7: Speedup values for p = 1024 and Q = Default, compared to ”GS Barrett”.

5.8 Other operations

Of the operations implemented that have yet to be presented, there are the Relinearization and Rescal-
ing operations on the server-side, and the encrypt and decrypt operations on the client-side. In regards
to the Rescale operation, as mentioned in Chapter 4, the intrinsic is not implemented being substituted
for the ”vadd” intrinsic, however, the results can still be used in a certain way to analyse the best case
implementation of the vector operation.

(a) Relinearize operation

(b) Rescale operation

Figure 5.11: Relinearize/Rescale speedup values for different p.

In Figure 5.11(a), it can be seen that the vector implementation was not able to achieve a speedup
compared to the corresponding scalar Barrett implementation, having a speedup of 0.83x. It can be
seen however, that with the increase of p, the speedup value also increases, which means that for
a large enough p, the vector implementation might be able to outperform. In regards to the Rescale
operation Figure 5.11(b) shows that once again the auto vectorization approach was unsuccessful. The

60
vector implementation however, was able to achieve major speedups of 15.51x, which is likely due to
the way the div intrinsic was implemented (or rather not implemented) in the RVV version being used.

5.9 General speedups

Finally, with all the previous test that were performed on the different operations, a summarized table
can be obtained for p = 1024 and Q = 12289. In table 5.8, the number of cycles for the best of both
scalar and vector implementations are presented.

Table 5.8: Speedups for the different operations.

Operation Scalar Version Vectorial Version Speedup


HADD 27336,5 cycles 2235,5 cycles 12.23x
CADD 22257 cycles 5630 cycles 3.95x
HMULT 2556279 cycles 2128319 cycles 1.20x
CMULT 1860984 cycles 1249534 cycles 1.49x
NTT 383978 cycles 212469,5 cycles 1.81x
INTT 690770,5 cycles 898632,5 cycles 0.77x
CWM 52831 cycles 5958 cycles 8.87x
HSUB 25766 cycles 3134 cycles 8.22x
CSUB 28167 cycles 5617 cycles 5.01x
Relinearize 2413243 cycles 2922932 cycles 0.83x
Rescale 26838,5 cycles 1730 cycles 15.51x

In this table, we can see that for the most part, all operations had considerable speedups for the
vector implementations when compared to best scalar version possible. Furthermore, it should also be
noted that the Barrett reduction was an essential part of the acceleration process, resulting in speedup
increases for some of the functions, i.e. the CWM operation.

5.10 RNS representation test

As mentioned prior, many of the operations were implemented with support for the RNS number system.
In this Section, a small example will be shown for the HADD operation using the values q1 = 12289,
q2 = 12289 and q3 = 10, which results in a Q = 1510195210. These values are obviously not co-prime,
and therefore correct decoding would not be possible, however, since the actual underlying HE operation
is not concerned with the coefficient modulus chosen in that it is a limb-wise addition of the polynomials,
these values can be used as a benchmark. In Figure 5.12 the implementations for different values of p
are presented.
It can be observed that regardless of the value of p, the RNS system seems to under perform com-
pared to the standard implementation, with a speedup of 0,29x for p = 1024. However, what has to

61
Figure 5.12: HADD RNS performance comparison with different values of p.

be understood is that the value of Q chosen in this example, although large, is still an unsigned 32 bit
integer, which means the processor can perform the operation just as quickly as for the smaller val-
ues chosen, hence the speedup being around one third of the normal implementation. However, if the
value of Q was much larger than the standard word size for the processor, not only would it be unable
to perform the operation without alterations, but it would also be slower than the RNS number system
counterpart. This means that for a standard 32 bit word size processor, RNS is the only way possible to
be able to make operations on larger values of Q.
In this chapter, the different operations of the CKKS scheme implemented, were tested, and as-
sessed for different parameters. It was found that for some operations, such as the CADD, different
values of Q can result in much larger execution times for the different operations. Furthermore, it was
also found that for some operations, the Barrett algorithm is not able to outperform the standard % re-
duction such as for the CADD operations, and for other operations such as the CWM, it can have a
considerable speedup, which is due to the larger size of the result obtained. For all of the operations,
a bigger p resulted in slower execution times, however, the speedup of the vector implementations over
scalar also increases overall with p. In the next chapter, the conclusions will be presented based on
the previous chapters, and some of the things that could be improved are presented in the future work
Section.

62
Conclusions
6

Contents
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

63
In this chapter the concluding remarks to the dissertation will be presented, in regards to the methods
developed to vectorize the CKKS FHE scheme on RISC-V. Furthermore, some potential future avenues
for development will be mentioned, both for undeveloped operations, as well as possible further acceler-
ations of the developed implementations.

6.1 Conclusions

The present disertation reports the development of a RISC-V HE library for the CKKS scheme, using
the RVV extension. This development entailed the implementation of multiple functions, which could
generally be identified to be as client-side, which are run on the clients system, and server-side, which
are the computationally intensive operations which run on the online cloud server. There were many of
these operations that were completely implemented. In particular, most of the server-side operations
are operating in their entirety. Nevertheless, not all the operations were implemented fully. In regards
to the client-side operations, many required the introduction of an error term, as can be seen in Section
2.2.1. The error generation functions can be implemented in different ways, with both advantages and
disadvantages, and since it would have been not entirely the purpose of the dissertation to further
explore them they were not developed. Nevertheless, all the functions that require an error term, are
equipped in their inputs to allow the passage of a function for this purpose meaning that should an
author create a specific function for this purpose, it can be embedded into the library. Of the server-side
operations, the functions not implemented were the rotation and bootstrapping operations.
Furthermore, many of the implemented operations were also vectorized using the RVV intrisics, re-
sulting in considerable speedups. For example, the HADD operation, which is a relatively basic function,
obtained a speedup of 12.23x when compared to the best possible scalar operation. This considerable
speedups were obtained for most of the operations that were tested, although some like the CMULT/H-
MULT and Relinearize operations came short of what would be expected. These speedups can be
found in table 5.8, in which they are presented in a summarized way. In regards to acceleration tech-
niques many were attempted. For example, there was an extensive use of the Barrett reduction in
regards to Plaintext operations, as well as those in which multiplication operations were present. With-
out its use, many of the speedup values would not be able to be achieved. Furthermore, some specific
technique such as the previously mentioned overflow technique, could be used to prevent changes in
data-structures for the different operations, with a main focus on maintaining operations in the unsigned
domain. Of the vector implemented operations, there were also those that could not be directly trans-
lated from the scalar algorithm, since they would result in extremely slow operation. Of note is the NTT
Cooley-Tuckey operation as shown in Section 5.7, in which the speedup value became mostly depen-
dent on the last stages where VL was smaller than 16. However, by using an hybrid solution, a speedup

64
of 1.81x was attained. Some of the operations however, did not get as much focus when it came to the
vectorization. For example, the CMULT and HMULT operations, had a very basic vector implementation,
because both of these can be accelerated by the use of NTT domain operations.
Finally, there was also an attempt to use the auto-vectorization tool that clang provided. However this
approach wielded poor results. For example, when tested on the different operations, it was only able to
vectorize the implementations for the more simple operations, i.e. HADD and HSUB operations. More
complex operations would result in a warning that the loop had not been vectorized. This is probably a
consequence of the auto-vectorization tool not being able to vectorize loops whenever another loop is
present inside of it. This constitutes a considerable problem, since most of the operations require loops
inside of loops, which means that, for the majority of the operations, the auto-vectorization approach was
unsuccessful. It should also be noted that, for the operations were the vectorization was successful, the
speedup obtained was slightly lower than the vector implementation that was developed, which means
that this tool has space to be improved.

6.2 Future Work

In Regards to Future Work, since this dissertation references an ongoing implementation of a vector
CKKS library for RISC-V there are a few different possibilities that could prove interesting. The main
avenues that were not explored that could be of interest are presented as follows:
Further Implementation of the missing operations, as well as auxiliary : As mentioned prior,
some of the operations were not completely or totally implemented. Firstly, since the server-side
operations were the main focus of the dissertation, many of the client-side operations were not
implemented, or were partially implemented. Operations such as encode and decode, which are
considerably more complex, could be implemented not only to further prove the correctness of the li-
brary implementations, but also to give further incentive to the utilization of the library. The client-side
operations that require the usage of specific error generation, or random number generation, should
have specific functions developed for that purpose. On the server-side, the rotation and bootstrap-
ping operations, could also be developed. These operations are of interest since the former can be
used to accelerate certain programs, and the latter is the basis for true FHE. Without it, any imple-
mentation will be LHE. Finally, some auxiliary operations that could make it easier to use the library
could be developed. These operations can range from checking if the parameters being introduced
are secure, to cross-support for the openFHE library.
Development of acceleration for different ISAS : This could also be considered as the previous
topic, however since it is a bit more separated from the development process that was performed,
it is presented in separate. Since all the functions were accelerated for RISC-V, it could also be

65
interesting to, in the future, also develop them for x86 and ARM systems. It would be less of a
new concept, given that accelerations using vector instructions for both of them have already been
developed, as discussed in Section 2.3.4, however some potential speedups over them could be
attained.
Further Acceleration of operations : In regards to possible further acceleration of the already de-
veloped operations, however, some interesting avenues are also available. For instance, although
most of the functions implemented have a basic support for RNS, that is, most of the operations
can be executed polynomial by polynomial of the RNS system, there are some operations that work
slightly differently than in non-RNS domain, which were not implemented. In particular, both the
Rescaling and Relinearizing operations require a slightly more complex implementation in the RNS
implementation. Furthermore, by representing the polynomials in PMNS, some potential speedups
for some values of Q could be obtained. Furthermore cache analysis could prove as a new possible
way to accelerate these instructions. That is, a potential re-order of the structure of the ciphertexts
might be advantageous for certain cache specifications, or the different loops could also be accel-
erated by performing them in a different matter (e.g. using a block approach [58]). Finally, other
schemes could also be implemented. By doing so, a technique called scheme switching could be
used to change the scheme midway during an application, should it be faster to perform a specific
portion in a different scheme.
Security testing for the different operations : As a final note, it should also be tested the security
of the scheme to make sure that all the implementations are secure from a function implementation
standpoint regarding time and power attacks.

66
Bibliography

[1] Intro to homomorphic encryption. Accessed in January 2023. [Online]. Available: https:
//www.youtube.com/watch?v=SEBdYXxijSo

[2] Tutorial 2: Working with approximate numbers. Accessed in January 2023. [On-
line]. Available: https://github.com/OpenMined/TenSEAL/blob/main/tutorials%2FTutorial%202%
20-%20Working%20with%20Approximate%20Numbers.ipynb

[3] Semidynamics, “Avispado - 64-bit in-order risc-v customisable core,” 2023. [Online]. Available:
https://semidynamics.com/products/avispado

[4] M. Albrecht, M. Chase, H. Chen, J. Ding, S. Goldwasser, S. Gorbunov, S. Halevi, J. Hoffstein,


K. Laine, K. Lauter, S. Lokam, D. Micciancio, D. Moody, T. Morrison, A. Sahai, and V. Vaikun-
tanathan, “Homomorphic encryption security standard,” HomomorphicEncryption.org, Toronto,
Canada, Tech. Rep., November 2018.

[5] L. Wang, G. Von Laszewski, A. Younge, X. He, M. Kunze, J. Tao, and C. Fu, “Cloud computing: a
perspective study,” New generation computing, vol. 28, pp. 137–146, 2010.

[6] A. T. Tunggal, “What are cloud leaks?: Upguard,” Feb 2023. [Online]. Available: https:
//www.upguard.com/blog/what-are-cloud-leaks

[7] [Online]. Available: https://www.ecfr.gov/current/title-45/subtitle-A/subchapter-C/part-164/


subpart-C/section-164.312

[8] [Online]. Available: https://www.intel.com/content/www/us/en/architecture-and-technology/


avx-512-overview.html

[9] A. Ltd., “Neon – arm®.” [Online]. Available: https://www.arm.com/technologies/neon

[10] Riscv, “Riscv/riscv-crypto: Risc-v cryptography extensions standardisation work.” 2023. [Online].
Available: https://github.com/riscv/riscv-crypto

67
[11] C. Gentry, “Fully homomorphic encryption using ideal lattices,” in Proceedings of the
Forty-First Annual ACM Symposium on Theory of Computing, ser. STOC ’09. New York,
NY, USA: Association for Computing Machinery, 2009, p. 169–178. [Online]. Available:
https://doi.org/10.1145/1536414.1536440

[12] D. Micciancio, “Generalized compact knapsacks, cyclic lattices, and efficient one-way functions -
computational complexity,” Dec 2007. [Online]. Available: https://link.springer.com/article/10.1007/
s00037-007-0234-9

[13] I. Chillotti, N. Gama, M. Georgieva, and M. Izabachène, “TFHE: Fast fully homomorphic encryption
library,” August 2016, https://tfhe.github.io/tfhe/.

[14] Z. Brakerski and V. Vaikuntanathan, “Efficient fully homomorphic encryption from (standard)
lwe,” Cryptology ePrint Archive, Paper 2011/344, 2011, https://eprint.iacr.org/2011/344. [Online].
Available: https://eprint.iacr.org/2011/344

[15] J. Fan and F. Vercauteren, “Somewhat practical fully homomorphic encryption,” Cryptology
ePrint Archive, Paper 2012/144, 2012, https://eprint.iacr.org/2012/144. [Online]. Available:
https://eprint.iacr.org/2012/144

[16] J. H. Cheon, A. Kim, M. Kim, and Y. Song, “Homomorphic encryption for arithmetic of approximate
numbers,” Cryptology ePrint Archive, Paper 2016/421, 2016, https://eprint.iacr.org/2016/421.
[Online]. Available: https://eprint.iacr.org/2016/421

[17] K. Kluczniak, “Ntru-v-um: Secure fully homomorphic encryption from ntru with small modulus,” in
Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security,
2022, pp. 1783–1797.

[18] Z. Brakerski, C. Gentry, and V. Vaikuntanathan, “Fully homomorphic encryption without


bootstrapping,” Cryptology ePrint Archive, Paper 2011/277, 2011, https://eprint.iacr.org/2011/277.
[Online]. Available: https://eprint.iacr.org/2011/277

[19] Z. Brakerski, “Fully homomorphic encryption without modulus switching from classical gapsvp,”
Cryptology ePrint Archive, Paper 2012/078, 2012, https://eprint.iacr.org/2012/078. [Online].
Available: https://eprint.iacr.org/2012/078

[20] L. Ducas and D. Micciancio, “Fhew: Bootstrapping homomorphic encryption in less than
a second,” Cryptology ePrint Archive, Paper 2014/816, 2014, https://eprint.iacr.org/2014/816.
[Online]. Available: https://eprint.iacr.org/2014/816

68
[21] I. Chillotti, N. Gama, M. Georgieva, and M. Izabachène, “Faster fully homomorphic encryption:
Bootstrapping in less than 0.1 seconds,” in Advances in Cryptology – ASIACRYPT 2016, J. H.
Cheon and T. Takagi, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2016, pp. 3–33.

[22] Homomorphic encryption for arithmetic of approximate numbers. Accessed in December 2022.
[Online]. Available: https://github.com/snucrypto/HEAAN

[23] J. H. Cheon, K. Han, A. Kim, M. Kim, and Y. Song, “Bootstrapping for approximate homomorphic
encryption,” Cryptology ePrint Archive, Paper 2018/153, 2018, https://eprint.iacr.org/2018/153.
[Online]. Available: https://eprint.iacr.org/2018/153

[24] J. A. Navas, B. Dutertre, and I. A. Mason, “Verification of an optimized ntt algorithm,” in


Software Verification: 12th International Conference, VSTTE 2020, and 13th International
Workshop, NSV 2020, Los Angeles, CA, USA, July 20–21, 2020, Revised Selected
Papers. Berlin, Heidelberg: Springer-Verlag, 2020, p. 144–160. [Online]. Available:
https://doi.org/10.1007/978-3-030-63618-0 9

[25] P. L. Montgomery, “Modular multiplication without trial division,” Mathematics of computation,


vol. 44, no. 170, pp. 519–521, 1985.

[26] Z. Cao, R. Wei, and X. Lin, “A fast modular reduction method,” Cryptology ePrint Archive, 2014.

[27] C. H. Lim, H. S. Hwang, and P. J. Lee, “Fast modular reduction with precomputation,” in Proceedings
of Korea-Japan Joint Workshop on Information Security and Cryptology (JWISC’97). Citeseer,
1997, pp. 65–79.

[28] H. L. Garner, “The residue number system,” in Papers Presented at the the March 3-5,
1959, Western Joint Computer Conference, ser. IRE-AIEE-ACM ’59 (Western). New York,
NY, USA: Association for Computing Machinery, 1959, p. 146–153. [Online]. Available:
https://doi.org/10.1145/1457838.1457864

[29] L.-S. Didier, J.-M. Robert, F. Y. Dosso, and N. El Mrabet, “A software comparison of rns and pmns,”
in 2022 IEEE 29th Symposium on Computer Arithmetic (ARITH), 2022, pp. 86–93.

[30] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of complex fourier series,”
Mathematics of computation, vol. 19, no. 90, pp. 297–301, 1965.

[31] W. M. Gentleman and G. Sande, “Fast fourier transforms: for fun and profit,” in Proceedings of the
November 7-10, 1966, fall joint computer conference, 1966, pp. 563–578.

[32] S. Meftah, B. H. M. Tan, K. M. M. Aung, L. Yuxiao, L. Jie, and B. Veeravalli, “Towards


high performance homomorphic encryption for inference tasks on cpu: An mpi approach,”

69
Future Generation Computer Systems, vol. 134, pp. 13–21, 2022. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0167739X22001145

[33] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep con-
volutional neural networks,” in Advances in Neural Information Processing Systems,
F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds., vol. 25. Curran Asso-
ciates, Inc., 2012. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/2012/file/
c399862d3b9d6b76c8436e924a68c45b-Paper.pdf

[34] T. Morshed, M. M. A. Aziz, and N. Mohammed, “Cpu and gpu accelerated fully homomorphic
encryption,” 2020.

[35] S. Fan, Z. Wang, W. Xu, R. Hou, D. Meng, and M. Zhang, “Tensorfhe: Achieving practical computa-
tion on encrypted data using gpgpu,” in 2023 IEEE International Symposium on High-Performance
Computer Architecture (HPCA). IEEE, 2023, pp. 922–934.

[36] M. S. Riazi, K. Laine, B. Pelton, and W. Dai, “Heax: An architecture for computing on encrypted
data,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Pro-
gramming Languages and Operating Systems, 2020, pp. 1295–1309.

[37] R. Paludo and L. Sousa, “Ntt architecture for a linux-ready risc-v fully-homomorphic encryption
accelerator,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 69, no. 7, pp.
2669–2682, 2022.

[38] A. Feldmann, N. Samardzic, A. Krastev, S. Devadas, R. Dreslinski, K. Eldefrawy, N. Genise, C. Peik-


ert, and D. Sanchez, “F1: A fast and programmable accelerator for fully homomorphic encryption
(extended version),” 2021.

[39] N. Samardzic, A. Feldmann, A. Krastev, N. Manohar, N. Genise, S. Devadas, K. Eldefrawy, C. Peik-


ert, and D. Sanchez, “Craterlake: a hardware accelerator for efficient unbounded computation on
encrypted data,” in Proceedings of the 49th Annual International Symposium on Computer Archi-
tecture, 2022, pp. 173–187.

[40] J. Kim, G. Lee, S. Kim, G. Sohn, M. Rhu, J. Kim, and J. H. Ahn, “ARK: Fully homomorphic
encryption accelerator with runtime data generation and inter-operation key reuse,” in 2022 55th
IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, oct 2022. [Online].
Available: https://doi.org/10.1109%2Fmicro56248.2022.00086

[41] J. Kim, S. Kim, J. Choi, J. Park, D. Kim, and J. H. Ahn, “Sharp: A short-word
hierarchical accelerator for robust and practical fully homomorphic encryption,” in Proceedings
of the 50th Annual International Symposium on Computer Architecture, ser. ISCA ’23.

70
New York, NY, USA: Association for Computing Machinery, 2023. [Online]. Available:
https://doi.org/10.1145/3579371.3589053

[42] A. Viand, P. Jattke, and A. Hithnawi, “Sok: Fully homomorphic encryption compilers,” in 2021 IEEE
Symposium on Security and Privacy (SP), 2021, pp. 1092–1108.

[43] A. Viand, P. Jattke, M. Haller, and A. Hithnawi, “Heco: Automatic code optimizations for efficient
fully homomorphic encryption,” 2022. [Online]. Available: https://arxiv.org/abs/2202.01649

[44] F. Boemer, S. Kim, G. Seifu, F. D. M. de Souza, and V. Gopal, “Intel hexl: Accelerating homomorphic
encryption with intel avx512-ifma52,” 2021.

[45] H. Becker, V. Hwang, M. J. Kannwischer, B.-Y. Yang, and S.-Y. Yang, “Neon ntt: Faster dilithium,
kyber, and saber on cortex-a72 and apple m1,” Cryptology ePrint Archive, Paper 2021/986, 2021,
https://eprint.iacr.org/2021/986. [Online]. Available: https://eprint.iacr.org/2021/986

[46] Riscv, “Riscv/riscv-v-spec: Working draft of the proposed risc-v v vector extension,” 2023. [Online].
Available: https://github.com/riscv/riscv-v-spec

[47] Homomorphic encryption standardization. Accessed in January 2023. [Online]. Available:


https://homomorphicencryption.org/introduction/

[48] Simple encrypted arithmetic library(seal). Accessed in December 2022. [Online]. Available:
https://github.com/microsoft/SEAL

[49] Zlib software library. Accessed in January 2023. [Online]. Available: https://zlib.net/

[50] Zstandard library. Accessed in January 2023. [Online]. Available: https://github.com/facebook/zstd

[51] A. Al Badawi, J. Bates, F. Bergamaschi, D. B. Cousins, S. Erabelli, N. Genise, S. Halevi, H. Hunt,


A. Kim, Y. Lee et al., “Openfhe: Open-source fully homomorphic encryption library,” in Proceedings
of the 10th Workshop on Encrypted Computing & Applied Homomorphic Cryptography, 2022, pp.
53–63.

[52] Lducas, “Lducas/fhew.” [Online]. Available: https://github.com/lducas/FHEW#


a-fully-homomorphic-encryption-library

[53] Tfhe, “Tfhe/tfhe: Tfhe: Fast fully homomorphic encryption library over the torus.” [Online].
Available: https://github.com/tfhe/tfhe

[54] T. Rodrigues, “Ckks-risc-vv-library.” 2023. [Online]. Available: https://github.com/


TiagoRodrigues1111/CKKS-RISC-VV-Library

71
[55] D. Korn and J. J. Lambiotte, “Computing the fast fourier transform on a vector computer,”
Mathematics of Computation, vol. 33, pp. 977–992, 1979. [Online]. Available: https:
//api.semanticscholar.org/CorpusID:37227799

[56] M. C. Pease, “An adaptation of the fast fourier transform for parallel processing,” J. ACM, vol. 15,
no. 2, p. 252–264, apr 1968. [Online]. Available: https://doi.org/10.1145/321450.321457

[57] [Online]. Available: https://repo.hca.bsc.es/gitlab/epi-public/risc-v-vector-simulation-environment/-/


wikis/SDV-Access-Guide#fpga-sdv-for-users

[58] Intel, “Loop optimizations where blocks are required.” [On-


line]. Available: https://www.intel.com/content/www/us/en/developer/articles/technical/
loop-optimizations-where-blocks-are-required.html

72
A
Proof of Correctness

In Appendix B, the correctness of the different CKKS operations is shown in a mathematical basis. Proof
that the decryption operation can decrypt a ciphertext is shown in Equation A.1.

CT 11 + CT 12 × SK =

= P K1 × u + e1 + M + (P K2 × u + e2 ) × SK =

= (−a × SK + e) × u + e1 + M + (a × u + e2 ) × SK = (A.1)

= −a × SK × u + e × u + e1 + M + a × SK × u + SK × e2 =

= M + e × u + e1 + e2 × SK

which for small values of noise, that is, e×u + e1 + e2 ×SK, will result in a correct decryption. In equation
A.2, proof of correctness for the HADD operation is shown.

73
HADD(CT 1, CT 2) = (CT 11 + CT 21 ; CT 12 + CT 22 ) => (CT 11 + CT 21 ) + (CT 12 + CT 22 ) × SK =

= (P K1 × u + e11 + M1 + P K1 × u + e21 + M2 ) + (P K2 × u + e12 + P K2 × u + e22 ) × SK =

= (2 × P K1 × u + e11 + e21 + M1 + M2 ) + (2 × P K2 × u + e12 + e22 ) × SK =

= (−2 × (a × SK + e)) × u + e11 + e21 + M1 + M2 + (2 × a × u + e12 + e22 ) × SK =

= −2 × a × SK × u + 2 × e × u + e11 + e21 + M1 + M2 + 2 × a × SK × u + SK × (e12 + e22 ) =

= M1 + M2 + 2 × e × u + e11 + e21 + (e12 + e22 ) × SK =

= M1 + M2 + 2 × e × u + eadd1 + eadd2 × SK
(A.2)

which, since the eadd1 ≈ e1 ∧ eadd2 ≈ e2 , the noise for the HADD operation is approximately 2× the noise
for encrypting the ciphertext.

74
Table of HE parameters
B
In Appendix B, the tables referenced in [4] are presented. These tables present for different values of
p, and different types of secret key generation, what parameters should be chosen to achieve a specific
value of bit security. The tables also presents estimations of time of attack for three diferent possible
attacks that could be performed on the scheme, which is further mentioned in [4].

75
distribution p security level log Q uSVP dec dual
uniform 1024 128 27 132.2 149.3 164.5
192 19 199.3 241.6 261.6
256 15 262.9 341.1 360.8
2048 128 53 128.1 137.6 147.6
192 37 193.6 215.8 231.4
256 29 257.2 297.9 316.6
4096 128 103 129.1 134.2 141.7
192 72 193.8 206.2 217.2
256 56 259.2 281.9 296.5
8192 128 206 128.2 130.7 136.6
192 143 192.9 199.3 207.3
256 111 258.4 270.8 280.7
16384 128 413 128.2 129.0 132.7
192 286 192.1 195.3 201.4
256 222 257.2 263.1 270.6

distribution n security level logq uSVP dec dual


error 1024 128 27 132.2 149.3 144.5
192 19 199.3 241.6 224.0
256 15 262.9 341.1 302.3
2048 128 53 128.1 137.6 134.8
192 37 193.6 215.8 206.7
256 29 257.2 297.9 281.4
4096 128 103 129.1 134.2 133.1
192 72 193.8 206.2 201.8
256 56 259.2 281.9 270.4
8192 128 206 128.2 130.7 130.1
192 143 192.9 199.3 198.5
256 111 258.4 270.8 266.6
16384 128 413 128.2 129.0 130.1
192 286 192.1 195.3 196.6
256 222 257.2 263.1 265.8

distribution n security level logq uSVP dec dual


(-1, 1) 1024 128 25 132.6 165.5 142.3
192 17 199.9 284.1 222.2
256 13 262.6 423.1 296.6
2048 128 51 128.6 144.3 133.4
192 35 193.5 231.9 205.2
256 27 257.1 327.8 274.4
4096 128 101 129.6 137.4 131.5
192 70 193.7 213.6 198.8
256 54 259.7 295.2 270.6
8192 128 202 129.8 130.7 128.0
192 141 192.9 202.5 196.1
256 109 258.3 276.6 263.1
16384 128 411 128.2 129.5 129.0
256 220 257.2 265.8 260.7

Table B.1: Post-quantum security, for different secret key distributions (taken from [4]).

76

You might also like