0% found this document useful (0 votes)
22 views15 pages

F1: A Fast and Programmable Accelerator For Fully Homomorphic Encryption (Extended Version)

F1 is an accelerator designed to speed up fully homomorphic encryption (FHE) computations. FHE allows computations to be performed on encrypted data without decrypting it first, but current software implementations of FHE are thousands of times slower than non-encrypted computations. F1 is the first FHE accelerator that is programmable, meaning it can execute entire FHE programs. It achieves large speedups over software, counteracting most of FHE's performance overhead and enabling new applications like real-time private deep learning in the cloud.

Uploaded by

qweqwe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views15 pages

F1: A Fast and Programmable Accelerator For Fully Homomorphic Encryption (Extended Version)

F1 is an accelerator designed to speed up fully homomorphic encryption (FHE) computations. FHE allows computations to be performed on encrypted data without decrypting it first, but current software implementations of FHE are thousands of times slower than non-encrypted computations. F1 is the first FHE accelerator that is programmable, meaning it can execute entire FHE programs. It achieves large speedups over software, counteracting most of FHE's performance overhead and enabling new applications like real-time private deep learning in the cloud.

Uploaded by

qweqwe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

F1: A Fast and Programmable Accelerator

for Fully Homomorphic Encryption (Extended Version)


Axel Feldmann1⇤ , Nikola Samardzic1⇤ , Aleksandar Krastev1 , Srini Devadas1 ,
Ron Dreslinski2 , Karim Eldefrawy3 , Nicholas Genise3 , Chris Peikert2 , Daniel Sanchez1
1 Massachusetts Institute of Technology 2 University of Michigan

{axelf, nsamar, alexalex, devadas, sanchez}@csail.mit.edu {dreslin, cpeikert}@umich.edu


3 SRI International

{karim.eldefrawy, nicholas.genise}@sri.com
arXiv:2109.05371v2 [cs.CR] 25 Sep 2021

Encrypted(x)
ABSTRACT x Encrypt
Fully Homomorphic Encryption (FHE) allows computing on en- 3

Trust barrier
1 2
crypted data, enabling secure o�oading of computation to un-
trusted servers. Though it provides ideal security, FHE is expensive Server F1 FHE
5 4 Accelerator
when executed in software, 4 to 5 orders of magnitude slower than Client
computing on unencrypted data. These overheads are a major bar- f(x) Decrypt
Encrypted(f(x))
rier to FHE’s widespread adoption. Figure 1: FHE allows a user to securely o�load computation
We present F1, the �rst FHE accelerator that is programmable, to an untrusted server.
i.e., capable of executing full FHE programs. F1 builds on an in-
this, the client encrypts G and sends it to an untrusted server, which
depth architectural analysis of the characteristics of FHE compu-
computes 5 on this encrypted data directly using FHE, and returns
tations that reveals acceleration opportunities. F1 is a wide-vector
the encrypted result to the client. FHE provides ideal security prop-
processor with novel functional units deeply specialized to FHE
erties: even if the server is compromised, attackers cannot learn
primitives, such as modular arithmetic, number-theoretic trans-
anything about the data, as it remains encrypted throughout.
forms, and structured permutations. This organization provides so
FHE is a young but quickly developing technology. First real-
much compute throughput that data movement becomes the key
ized in 2009 [33], early FHE schemes were about 109 times slower
bottleneck. Thus, F1 is primarily designed to minimize data move-
than performing computations on unencrypted data. Since then,
ment. Hardware provides an explicitly managed memory hierarchy
improved FHE schemes have greatly reduced these overheads and
and mechanisms to decouple data movement from execution. A
broadened its applicability [2, 59]. FHE has inherent limitations—
novel compiler leverages these mechanisms to maximize reuse and
for example, data-dependent branching is impossible, since data
schedule o�-chip and on-chip data movement.
is encrypted—so it won’t subsume all computations. Nonetheless,
We evaluate F1 using cycle-accurate simulation and RTL synthe-
important classes of computations, like deep learning inference [17,
sis. F1 is the �rst system to accelerate complete FHE programs, and
25, 26], linear algebra, and other inference and learning tasks [40]
outperforms state-of-the-art software implementations by gmean
are a good �t for FHE. This has sparked signi�cant industry and
5,400⇥ and by up to 17,000⇥. These speedups counter most of FHE’s
government investments [4, 9, 23] to widely deploy FHE.
overheads and enable new applications, like real-time private deep
Unfortunately, FHE still carries substantial performance over-
learning in the cloud.
heads: despite recent advances [15, 25, 26, 61, 66], FHE is still
10,000⇥ to 100,000⇥ slower than unencrypted computation when
1 INTRODUCTION executed in carefully optimized software. Though this slowdown is
Despite massive e�orts to improve the security of computer systems, large, it can be addressed with hardware acceleration: if a special-
security breaches are only becoming more frequent and damaging, ized FHE accelerator provides large speedups over software execution,
as more sensitive data is processed in the cloud [43, 69]. Current it can bridge most of this performance gap and enable new use cases.
encryption technology is of limited help, because servers must For an FHE accelerator to be broadly useful, it should be pro-
decrypt data before processing it. Once data is decrypted, it is grammable, i.e., capable of executing arbitrary FHE computations.
vulnerable to breaches. While prior work has proposed several FHE accelerators, they do
Fully Homomorphic Encryption (FHE) is a class of encryption not meet this goal. Prior FHE accelerators [20, 21, 27, 65, 66, 71]
schemes that address this problem by enabling generic computation target individual FHE operations, and miss important ones that they
on encrypted data. Fig. 1 shows how FHE enables secure o�oading leave to software. These designs are FPGA-based, so they are small
of computation. The client wants to compute an expensive function and miss the data movement issues facing an FHE ASIC accelerator.
5 (e.g., a deep learning inference) on some private data G. To do These designs also overspecialize their functional units to speci�c
⇤ A. Feldmann and N. Samardzic contributed equally to this work. parameters, and cannot e�ciently handle the range of parameters
needed within a program or across programs.
This is an extended version of a paper that will appear in the Proceedings of
In this paper we present F1, the �rst programmable FHE ac-
the 54th Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO), 2021 [29].
celerator. F1 builds on an in-depth architectural analysis of the
1
characteristics of FHE computations, which exposes the main chal- inference that used to take 20 minutes in 240 milliseconds, enabling
lenges and reveals the design principles a programmable FHE ar- secure real-time deep learning in the cloud.
chitecture should exploit.
Harnessing opportunities and challenges in FHE: F1 is tai- 2 BACKGROUND
lored to the three de�ning characteristics of FHE: Fully Homomorphic Encryption allows performing arbitrary arith-
(1) Complex operations on long vectors: FHE encodes informa- metic on encrypted plaintext values, via appropriate operations
tion using very large vectors, several thousand elements long, and on their ciphertexts. Decrypting the resulting ciphertext yields the
processes them using modular arithmetic. F1 employs vector pro- same result as if the operations were performed on the plaintext
cessing with wide functional units tailored to FHE operations to values “in the clear.”
achieve large speedups. The challenge is that two key operations Over the last decade, prior work has proposed multiple FHE
on these vectors, the Number-Theoretic Transform (NTT) and au- schemes, each with somewhat di�erent capabilities and performance
tomorphisms, are not element-wise and require complex data�ows tradeo�s. BGV [14], B/FV [13, 28], GSW [35], and CKKS [17] are
that are hard to implement as vector operations. To tackle these popular FHE schemes.∗ Though these schemes di�er in how they
challenges, F1 features specialized NTT units and the �rst vector encrypt plaintexts, they all use the same data type for ciphertexts:
implementation of an automorphism functional unit. polynomials where each coe�cient is an integer modulo &. This
(2) Regular computation: FHE programs are data�ow graphs of commonality makes it possible to build a single accelerator that
arithmetic operations on vectors. All operations and their depen- supports multiple FHE schemes; F1 supports BGV, GSW, and CKKS.
dences are known ahead of time (since data is encrypted, branches We describe FHE in a layered fashion: Sec. 2.1 introduces FHE’s
or dependences determined by runtime values are impossible). F1 programming model and operations, i.e., FHE’s interface; Sec. 2.2
exploits this by adopting static scheduling: in the style of Very Long describes how FHE operations are implemented; Sec. 2.3 presents
Instruction Word (VLIW) processors, all components have �xed implementation optimizations; and Sec. 2.4 performs an architec-
latencies and the compiler is in charge of scheduling operations and tural analysis of a representative FHE kernel to reveal acceleration
data movement across components, with no hardware mechanisms opportunities.
to handle hazards (i.e., no stall logic). Thanks to this design, F1 can For concreteness, we introduce FHE using the BGV scheme, and
issue many operations per cycle with minimal control overheads; brie�y discuss other FHE schemes in Sec. 2.5.
combined with vector processing, F1 can issue tens of thousands of
scalar operations per cycle. 2.1 FHE programming model and operations
(3) Challenging data movement: In FHE, encrypting data in-
FHE programs are data�ow graphs: directed acyclic graphs where
creases its size (typically by at least 50⇥); data is grouped in long
nodes are operations and edges represent data values. Data values
vectors; and some operations require large amounts (tens of MBs)
are inputs, outputs, or intermediate values consumed by one or
of auxiliary data. Thus, we �nd that data movement is the key chal-
more operations. All operations and dependences are known in
lenge for FHE acceleration: despite requiring complex functional
advance, and data-dependent branching is impossible.
units, in current technology, limited on-chip storage and memory
In FHE, unencrypted (plaintext) data values are always vectors;
bandwidth are the bottleneck for most FHE programs. Therefore, F1
in BGV [14], each vector consists of # integers modulo an integer
is primarily designed to minimize data movement. First, F1 features
C. BGV provides three operations on these vectors: element-wise
an explicitly managed on-chip memory hierarchy, with a heavily
addition (mod C), element-wise multiplication (mod C), and a small
banked scratchpad and distributed register �les. Second, F1 uses
set of particular vector permutations.
mechanisms to decouple data movement and hide access latencies
We stress that this is BGV’s interface, not its implementation:
by loading data far ahead of its use. Third, F1 uses new, FHE-tailored
it describes unencrypted data, and the homomorphic operations
scheduling algorithms that maximize reuse and make the best out of
that BGV implements on that data in its encrypted form. In Sec. 2.2
limited memory bandwidth. Fourth, F1 uses relatively few functional
we describe how BGV represents encrypted data and how each
units with extremely high throughput, rather than lower-throughput
operation is implemented.
functional units as in prior work. This reduces the amount of data
At a high level, FHE provides a vector programming model with
that must reside on-chip simultaneously, allowing higher reuse.
restricted operations where individual vector elements cannot be
In summary, F1 brings decades of research in architecture to bear,
directly accessed. This causes some overheads in certain algorithms.
including vector processing and static scheduling, and combines
For example, summing up the elements of a vector is non-trivial,
them with new specialized functional units (Sec. 5) and scheduling
and requires a sequence of permutations and additions.
algorithms (Sec. 4) to design a programmable FHE accelerator. We
Despite these limitations, prior work has devised reasonably
implement the main components of F1 in RTL and synthesize them
e�cient implementations of key algorithms, including linear alge-
in a commercial 14nm/12nm process. With a modest area budget of
bra [38], neural network inference [15, 36], logistic regression [39],
151 mm2 , our F1 implementation provides 36 tera-ops/second of 32-
and genome processing [11]. These implementations are often
bit modular arithmetic, 64 MB of on-chip storage, and a 1 TB/s high-
coded by hand, but recent work has proposed FHE compilers to
bandwidth memory. We evaluate F1 using cycle-accurate simulation
automate this translation for particular domains, like deep learn-
running complete FHE applications, and demonstrate speedups of
ing [25, 26].
1,200⇥–17,000⇥ over state-of-the-art software implementations.
These dramatic speedups counter most of FHE’s overheads and ∗ Thesescheme names are acronyms of their authors’ last names. For instance, BGV is
enable new applications. For example, F1 executes a deep learning Brakerski-Gentry-Vaikuntanathan.
2
Finally, note that not all data must be encrypted: BGV provides process to produce two polynomials (u1, u0 ) = KeySwitch(l2 ). The
versions of addition and multiplication where one of the operands �nal output ciphertext is 2C mul = (l1 + u1, l0 + u0 ).
is unencrypted. Multiplying by unencrypted data is cheaper, so As we will see later (Sec. 2.4), key-switching is an expensive
algorithms can trade privacy for performance. For example, a deep operation that dominates the cost of a multiplication.
learning inference can use encrypted weights and inputs to keep Homomorphic permutations permute the # plaintext values
the model private, or use unencrypted weights, which does not (coe�cients) that are encrypted in a ciphertext. Homomorphic
protect the model but keeps inputs and inferences private [15]. permutations are implemented using automorphisms, which are
special permutations of the coe�cients of the ciphertext polynomi-
2.2 BGV implementation overview als. There are # automorphisms, denoted f: (a) and f : (a) for all
We now describe how BGV represents and processes encrypted positive odd : < # . Speci�cally,
data (ciphertexts). The implementation of each computation on
ciphertext data is called a homomorphic operation. For example, f: (a) : 08 ! ( 1)B 08: mod # for 8 = 0, ..., # 1,
the homomorphic multiplication of two ciphertexts yields another
ciphertext that, when decrypted, is the element-wise multiplication where B = 0 if 8: mod 2# < # , and B = 1 otherwise. For example,
of the encrypted plaintexts. f5 (a) permutes a’s coe�cients so that 0 0 stays at position 0, 0 1 goes
Data types: BGV encodes each plaintext vector as a polynomial from position 1 to position 5, and so on (these wrap around, e.g.,
with # coe�cients mod C. We denote the plaintext space as 'C , so with # = 1024, 0 205 goes to position 1, since 205 · 5 mod 1024 = 1).
To perform a homomorphic permutation, we �rst compute an
a = 0 0 + 0 1G + ... + 0 # 1G # 1 2 'C
automorphism on the ciphertext polynomials: 2Cf = (f: (a), f: (b)).
is a plaintext. Each plaintext is encrypted into a ciphertext con- Just as in homomorphic multiplication, 2Cf is encrypted under
sisting of two polynomials of # integer coe�cients modulo some a di�erent secret key, requiring an expensive key-switch to pro-
& C. Each ciphertext polynomial is a member of '& . duce the �nal output 2C perm = (u1, f: (b) + u0 ), where (u1, u0 ) =
Encryption and decryption: Though encryption and decryption KeySwitch(f: (a)).
are performed by the client (so F1 need not accelerate them), they We stress that the permutation applied to the ciphertext does not
are useful to understand. In BGV, the secret key is a polynomial induce the same permutation on the underlying plaintext vector.
s 2 '& . To encrypt a plaintext m 2 'C , one samples a uniformly For example, using a single automorphism and careful indexing, it
random a 2 '& , an error (or noise) e 2 '& with small entries, and is possible to homomorphically rotate the vector of the # encrypted
computes the ciphertext 2C as plaintext values.
2C = (a, b = as + Ce + m).
2.2.2 Noise growth and management.
Ciphertext 2C = (a, b) is decrypted by recovering e 0 = Ce + m = Recall that ciphertexts have noise, which limits the number of oper-
b as mod &, and then recovering m = e 0 mod C. Decryption is ations that they can undergo before decryption gives an incorrect
correct as long as e 0 does not “wrap around” modulo &, i.e., its result. Di�erent operations induce di�erent noise growth: addition
coe�cients have magnitude less than &/2. and permutations cause little growth, but multiplication incurs
The security of any encryption scheme relies on the ciphertexts much more signi�cant growth. So, to a �rst order, the amount of
not revealing anything about the value of the plaintext (or the se- noise is determined by multiplicative depth, i.e., the longest chain
cret key). Without adding the noise term e, the original message of homomorphic multiplications in the computation.
m would be recoverable from 2C via simple Gaussian elimination. Noise forces the use of a large ciphertext modulus &. For example,
Including the noise term entirely hides the plaintext (under crypto- an FHE program with multiplicative depth of 16 needs & to be about
graphic assumptions) [49]. 512 bits. The noise budget, and thus the tolerable multiplicative
As we will see, homomorphic operations on ciphertexts increase depth, grow linearly with log &.
their noise, so we can only perform a limited number of operations FHE uses two noise management techniques in tandem: boot-
before the resulting noise becomes too large and makes decryption strapping and modulus switching.
fail. We later describe noise management strategies (Sec. 2.2.2) to Bootstrapping [33] enables FHE computations of unbounded depth.
keep this noise bounded and thereby allow unlimited operations. Essentially, it removes noise from a ciphertext without access to the
2.2.1 Homomorphic operations. secret key. This is accomplished by evaluating the decryption func-
Homomorphic addition of ciphertexts 2C 0 = (a0, b0 ) and 2C 1 = tion homomorphically. Bootstrapping is an expensive procedure
(a1, b1 ) is done simply by adding their corresponding polynomials: that consists of many (typically tens to hundreds) homomorphic
2C add = 2C 0 + 2C 1 = (a0 + a1, b0 + b1 ). operations. FHE programs with a large multiplicative depth can be
Homomorphic multiplication requires two steps. First, the four divided into regions of limited depth, separated by bootstrapping
input polynomials are multiplied and assembled: operations.
Even with bootstrapping, FHE schemes need a large noise bud-
2C ⇥ = (l2, l1, l0 ) = (a0 a1, a0 b1 + a1 b0, b0 b1 ). get (i.e., a large &) because (1) bootstrapping is computationally
This 2C ⇥ can be seen as a special intermediate ciphertext encrypted expensive, and a higher noise budget enables less-frequent boot-
under a di�erent secret key. The second step performs a key-switch- strapping, and (2) bootstrapping itself consumes a certain noise
ing operation to produce a ciphertext encrypted under the original budget (this is similar to why pipelining circuits hits a performance
secret key s. More speci�cally, l2 undergoes this key-switching ceiling: registers themselves add area and latency).
3
Modulus switching rescales ciphertexts from modulus & to a 1 def keySwitch (x: RVec[L],
modulus & 0 , which reduces the noise proportionately. Modulus 2 ksh0: RVec[L][L], ksh1: RVec[L][L]):
3 y = [INTT(x[i],@8 ) for i in range(L)]
switching is usually applied before each homomorphic multiplica- 4 u0: RVec[L] = [0, ...]
tion, to reduce its noise blowup. 5 u1: RVec[L] = [0, ...]
For example, to execute an FHE program of multiplicative depth 6 for i in range(L):
7 for j in range(L):
16, we would start with a 512-bit modulus &. Right before each 8 xqj = (i == j) ? x[i] : NTT(y[i], @ 9 )
multiplication, we would switch to a modulus that is 32 bits shorter. 9 u0[j] += xqj * ksh0[i,j] mod @ 9
So, for example, operations at depth 8 use a 256-bit modulus. Thus, 10 u1[j] += xqj * ksh1[i,j] mod @ 9
11 return (u0 , u1)
beyond reducing noise, modulus switching reduces ciphertext sizes,
and thus computation cost. Listing 1: Key-switch implementation. RVec is an # -element
vector of 32-bit values, storing a single RNS polynomial in
2.2.3 Security and parameters. either the coe�cient or the NTT domain.
The dimension # and modulus & cannot be chosen independently;
# /log & must be above a certain level for su�cient security. In coe�cients, called residue polynomials. To achieve this, the mod-
practice, this means that using a wide modulus to support deep ulus & is chosen to be the product of ! smaller distinct primes,
programs also requires a large # . For example, with 512-bit &, & = @ 1@ 2 · · · @! . Then, a polynomial in '& can be represented as
# = 16 is required to provide an acceptable level of security, ! polynomials in '@1 , . . . , '@! , where the coe�cients in the 8-th
resulting in very large ciphertexts. polynomial are simply the wide coe�cients modulo @8 . For exam-
ple, with , = 32-bit words, a ciphertext polynomial with 512-bit
2.3 Algorithmic insights and optimizations modulus & is represented as ! = log &/, = 16 polynomials with
32-bit coe�cients.
F1 leverages two optimizations developed in prior work: All FHE operations can be carried out under RNS representation,
Fast polynomial multiplication via NTTs: Multiplying two poly- and have either better or equivalent bit-complexity than operating
nomials requires convolving their coe�cients, an expensive (naively on one wide-coe�cient polynomial.
$ (# 2 )) operation. Just like convolutions can be made faster with
the Fast Fourier Transform, polynomial multiplication can be made
faster with the Number-Theoretic Transform (NTT) [54], a variant
of the discrete Fourier transform for modular arithmetic. The NTT 2.4 Architectural analysis of FHE
takes an # -coe�cient polynomial as input and returns an # -ele- We now analyze a key FHE kernel in depth to understand how we
ment vector representing the input in the NTT domain. Polynomial can (and cannot) accelerate it. Speci�cally, we consider the key-
multiplication can be performed as element-wise multiplication in switching operation, which is expensive and takes the majority of
the NTT domain. Speci�cally, work in all of our benchmarks.
Listing 1 shows an implementation of key-switching. Key-switch-
#)) (ab) = #)) (a) #)) (b),
ing takes three inputs: a polynomial x, and two key-switch hint
where denotes component-wise multiplication. (For this relation matrices ksh0 and ksh1. x is stored in RNS form as ! residue poly-
to hold with # -point NTTs, a negacyclic NTT [49] must be used nomials (RVec). Each residue polynomial x[i] is a vector of #
(Sec. 5.2).) 32-bit integers modulo @8 . Inputs and outputs are in the NTT do-
Because an NTT requires only $ (# log # ) modular operations, main; only the y[i] polynomials (line 3) are in coe�cient form.
multiplication can be performed in $ (# log # ) operations by using Computation vs. data movement: A single key-switch requires
two forward NTTs, element-wise multiplication, and an inverse ! 2 NTTs, 2! 2 multiplications, and 2! 2 additions of # -element
NTT. And in fact, optimized FHE implementations often store poly- vectors. In RNS form, the rest of a homomorphic multiplication
nomials in the NTT domain rather than in their coe�cient form (excluding key-switching) is 4! multiplications and 3! additions
across operations, further reducing the number of NTTs. This is (Sec. 2.2), so key-switching is dominant.
possible because the NTT is a linear transformation, so additions However, the main cost at high values of ! and # is data move-
and automorphisms can also be performed in the NTT domain: ment. For example, at ! = 16, # = 16 , each RNS polynomial
(RVec) is 64 KB; each ciphertext polynomial is 1 MB; each cipher-
#)) (f: (a)) = f: (#)) (a))
text is 2 MB; and the key-switch hints dominate, taking up 32 MB.
#)) (a + b) = #)) (a) + #)) (b) With F1’s compute throughput, fetching the inputs of each key-
Avoiding wide arithmetic via Residue Number System (RNS) switching from o�-chip memory would demand about 10 TB/s of
representation: FHE requires wide ciphertext coe�cients (e.g., 512 memory bandwidth. Thus, it is crucial to reuse these values as much
bits), but wide arithmetic is expensive: the cost of a modular multi- as possible.
plier (which takes most of the compute) grows quadratically with Fortunately, key-switch hints can be reused: all homomorphic
bit width in our range of interest. Moreover, we need to e�ciently multiplications use the same key-switch hint matrices, and each
support a broad range of widths (e.g., 64 to 512 bits in 32-bit incre- automorphism has its own pair of matrices. But values are so large
ments), both because programs need di�erent widths, and because that few of them �t on-chip.
modulus switching progressively reduces coe�cient widths. Finally, note that there is no e�ective way to decompose or tile
RNS representation [31] enables representing a single polyno- this operation to reduce storage needs while achieving good reuse:
mial with wide coe�cients as multiple polynomials with narrower tiling the key-switch hint matrices on either dimension produces
4
many long-lived intermediate values; and tiling across RVec ele- High-Bandwidth Memory Compute cluster
ments is even worse because in NTTs every input element a�ects Vector Register
...
every output element. Mem ctrl Mem ctrl Mem ctrl Mem ctrl File (banked)

Memory hierarchy
Performance requirements: We conclude that, to accommodate
... x128 lanes
...
these large operands, an FHE accelerator requires a memory system Scratchpad

Distributed control
banks (x16) NTT

Vector functional units


that (1) decouples data movement from computation, as demand ...
misses during frequent key-switches would tank performance; and Automorphism
On-chip network
(2) implements a large amount of on-chip storage (over 32 MB in x x
... Mod mult x
(3 16x16 crossbars)
our example) to allow reuse across entire homomorphic operations ... x x
... Mod mult x
(e.g., reusing the same key-switch hints across many homomorphic + +
... Mod add +
Compute clusters
multiplications). Mod add
(x16) + +
... +
Moreover, the FHE accelerator must be designed to use the mem-
ory system well. First, scheduling data movement and computation Figure 2: Overview of the F1 architecture.
is crucial: data must be fetched far ahead of its use to provide de- �xed-point values. B/FV [13, 28] encodes plaintexts in a way that
coupling, and operations must be ordered carefully to maximize makes modulus switching before homomorphic multiplication un-
reuse. Second, since values are large, excessive parallelism can in- necessary, thus easing programming (but forgoing the e�ciency
crease footprint and hinder reuse. Thus, the system should use gains of modulo switching). And GSW [35] features reduced, asym-
relatively few high-throughput functional units rather than many metric noise growth under homomorphic multiplication, but en-
low-throughput ones. crypts a small amount of information per ciphertext (not a full
Functionality requirements: Programmable FHE accelerators # /2-element vector).
must support a wide range of parameters, both # (polynomial/vec- Because F1 accelerates primitive operations rather than full ho-
tor sizes) and ! (number of RNS polynomials, i.e., number of 32-bit momorphic operations, it supports BGV, CKKS, and GSW with the
prime factors of &). While # is generally �xed for a single program, same hardware, since they use the same primitives. Accelerating
! changes as modulus switching sheds o� polynomials. B/FV would require some other primitives, so, though adding sup-
Moreover, FHE accelerators must avoid overspecializing in order port for them would not be too di�cult, our current implementation
to support algorithmic diversity. For instance, we have described does not target it.
an implementation of key-switching, but there are others [34, 45]
with di�erent tradeo�s. For example, an alternative implementation
requires much more compute but has key-switch hints that grow 3 F1 ARCHITECTURE
with ! instead of ! 2 , so it becomes attractive for very large ! (⇠20). Fig. 2 shows an overview of F1, which we derive from the insights
F1 accelerates primitive operations on large vectors: modular arith- in Sec. 2.4.
metic, NTTs, and automorphisms. It exploits wide vector processing Vector processing with specialized functional units: F1 fea-
to achieve very high throughput, even though this makes NTTs tures wide-vector execution with functional units (FUs) tailored to
and automorphisms costlier. F1 avoids building functional units for primitive FHE operations. Speci�cally, F1 implements vector FUs
coarser primitives, like key-switching, which would hinder algo- for modular addition, modular multiplication, NTTs (forward and
rithmic diversity. inverse in the same unit), and automorphisms. Because we leverage
Limitations of prior accelerators: Prior work has proposed sev- RNS representation, these FUs use a �xed, small arithmetic word
eral FHE accelerators for FPGAs [20, 21, 27, 52, 53, 65, 66, 71]. These size (32 bits in our implementation), avoiding wide arithmetic.
systems have three important limitations. First, they work by accel- FUs process vectors of con�gurable length # using a �xed num-
erating some primitives but defer others to a general-purpose host ber of vector lanes ⇢. Our implementation uses ⇢ =128 lanes and
processor, and rely on the host processor to sequence operations. supports power-of-two lengths # from 1,024 to 16,384. This covers
This causes excessive data movement that limits speedups. Second, the common range of FHE polynomial sizes, so an RNS polynomial
these accelerators build functional units for �xed parameters # and maps to a single vector. Larger polynomials (e.g., of 32K elements)
! (or log & for those not using RNS). Third, many of these systems can use multiple vectors.
build overspecialized primitives that limit algorithmic diversity. All FUs are fully pipelined, so they achieve the same throughput
Most of these systems achieve limited speedups, about 10⇥ over of ⇢ =128 elements/cycle. FUs consume their inputs in contiguous
software baselines. HEAX [65] achieves larger speedups (200⇥ vs. chunks of ⇢ elements in consecutive cycles. This is easy for element-
a single core). But it does so by overspecializing: it uses relatively wise operations, but hard for NTTs and automorphisms. Sec. 5
low-throughput functional units for primitive operations, so to details our novel FU implementations, including the �rst vector
achieve high performance, it builds a �xed-function pipeline for implementation of automorphisms. Our evaluation shows that these
key-switching. FUs achieve much higher performance than those of prior work.
This is important because, as we saw in Sec. 2.4, having fewer high-
throughput FUs reduces parallelism and thus memory footprint.
2.5 FHE schemes other than BGV Compute clusters: Functional units are grouped in compute clus-
We have so far focused on BGV, but other FHE schemes provide ters, as Fig. 2 shows. Each cluster features several FUs (1 NTT,
di�erent tradeo�s. For instance, whereas BGV requires integer 1 automorphism, 2 multipliers, and 2 adders in our implementa-
plaintexts, CKKS [17] supports “approximate” computation on tion) and a banked register �le that can (cheaply) supply enough
5
operands each cycle to keep all FUs busy. The chip has multiple Architecture numClusters = 10;
numBanks = 16;
clusters (16 in our implementation). Description
Memory system: F1 features an explicitly managed memory hier- FHE DSL Homomorphic Data
Cycle-Level
Operation Movement
archy. As Fig. 2 shows, F1 features a large, heavily banked scratch- x = InputCT() Scheduler
y = InputCT() Compiler 1 Scheduler 2
pad (64 MB across 16 banks in our implementation). The scratchpad prod = Mul(x, y) Instruction DFG Data Mov. DFG
interfaces with both high-bandwidth o�-chip memory (HBM2 in 1 tmp
2 Static Schedule
ADD
our implementation) and with compute clusters through an on-chip MUL load
Cycle 37:
move RF1[0] <- B3[2]
network. NTT store issue NTT3 (RF3[2])
F1 uses decoupled data orchestration [60] to hide main memory
latency. Scratchpad banks work autonomously, fetching data from Figure 3: Overview of the F1 compiler.
main memory far ahead of its use. Since memory has relatively Register �le (RF) design: Each cluster in F1 requires 10 read ports
low bandwidth, o�-chip data is always staged in scratchpads, and and 6 write ports to keep all FUs busy. To enable this cheaply, we
compute clusters do not access main memory directly. use an 8-banked element-partitioned register �le design [5] that
The on-chip network connecting scratchpad banks and compute leverages long vectors: each vector is striped across banks, and
clusters provides very high bandwidth, which is necessary because each FU cycles through all banks over time, using a single bank
register �les are small and achieve limited reuse. We implement each cycle. By staggering the start of each vector operation, FUs
a single-stage bit-sliced crossbar network [58] that provides full access di�erent banks each cycle. This avoids multiporting, requires
bisection bandwidth. Banks and the network have wide ports (512 a simple RF-FU interconnect, and performs within 5% of an ideal
bytes), so that a single scratchpad bank can send a vector to a in�nite-ported RF.
compute unit at the rate it is consumed (and receive it at the rate
it is produced). This avoids long staging of vectors at the register
�les. 4 SCHEDULING DATA AND COMPUTATION
Static scheduling: Because FHE programs are completely regular, We now describe F1’s software stack, focusing on the new static
F1 adopts a static, exposed microarchitecture: all components have scheduling algorithms needed to use hardware well.
�xed latencies, which are exposed to the compiler. The compiler Fig. 3 shows an overview of the F1 compiler. The compiler takes
is responsible for scheduling operations and data transfers in the as input an FHE program written in a high-level domain speci�c
appropriate cycles to prevent structural or data hazards. This is in language (Sec. 4.1). The compiler is structured in three stages. First,
the style of VLIW processors [30]. the homomorphic operation compiler orders high-level operations
Static scheduling simpli�es logic throughout the chip. For ex- to maximize reuse and translates the program into a computation
ample, FUs need no stalling logic; register �les and scratchpad data�ow graph, where operations are computation instructions
banks need no dynamic arbitration to handle con�icts; and the but there are no loads or stores. Second, the o�-chip data move-
on-chip network uses simple switches that change their con�gura- ment scheduler schedules transfers between main memory and the
tion independently over time, without the bu�ers and arbiters of scratchpad to achieve decoupling and maximize reuse. This phase
packet-switched networks. uses a simpli�ed view of hardware, considering it as a scratchpad
Because memory accesses do have a variable latency, we assume directly attached to functional units. The result is a data�ow graph
the worst-case latency, and bu�er data that arrives earlier (note that, that includes loads and stores from o�-chip memory. Third, the
because we access large chunks of data, e.g., 64 KB, this worst-case cycle-level scheduler re�nes this data�ow graph. It uses a cycle-
latency is not far from the average). accurate hardware model to divide instructions across compute
Distributed control: Though static scheduling is the hallmark of clusters and schedule on-chip data transfers. This phase determine
VLIW, F1’s implementation is quite di�erent: rather than having a the exact cycles of all operations, and produces the instruction
single stream of instructions with many operations each, in F1 each streams for all components.
component has an independent instruction stream. This is possible This multi-pass scheduling primarily minimizes o�-chip data
because F1 does not have any control �ow: though FHE programs movement, the critical bottleneck. Only in the last phase do we
may have loops, we unroll them to avoid all branches, and compile consider on-chip placement and data movement.
programs into linear sequences of instructions. Comparison with prior work: We initially tried static scheduling
This approach may appear costly. But vectors are very long, so algorithms from prior work [7, 12, 37, 50, 57], which primarily
each instruction encodes a lot of work and this overhead is mini- target VLIW architectures. However, we found these approaches
mal. Moreover, this enables a compact instruction format, which ill-suited to F1 for multiple reasons. First, VLIW designs have less-
encodes a single operation followed by the number of cycles to �exible decoupling mechanisms and minimizing data movement
wait until running the next instruction. This encoding avoids the is secondary to maximizing compute operations per cycle. Second,
low utilization of VLIW instructions, which leave many operation prior algorithms often focus on loops, where the key concern is to
slots empty. Each FU, register �le, network switch, scratchpad bank, �nd a compact repeating schedule, e.g., through software pipelin-
and memory controller has its own instruction stream, which a ing [47]. By contrast, F1 has no �ow control and we can schedule
control unit fetches in small blocks and distributes to components. each operation independently. Third, though prior work has pro-
Overall, instruction fetches consume less than 0.1% of memory posed register-pressure-aware instruction scheduling algorithms,
tra�c. they targeted small register �les and basic blocks, whereas we must
manage a large scratchpad over a much longer horizon. Thus, the
6
1 p = Program (N = 16384) Matrix V Dot-products
2 M_rows = [ p.Input(L = 16) for i in range (4) ] M[0] V
M[0] ciphertext (16K slots)

ciphertext (16K slots)


.
M[0].V
3 output = [ None for i in range (4) ]
4 V = p. Input(L = 16)
5 M[1] ciphertext (16K slots) M[1].V.

Vector
Rotate(1)
6 def innerSum (X):
7 for i in range(log2(p.N)): M[2] ciphertext (16K slots) M[2].V. …
8 X = Add(X, Rotate (X, 1 << i)) …
Rotate(14)
9 return X M[3] ciphertext (16K slots) M[3]. V.
10
11 for i in range (4):
12 prod = Mul( M_rows [i], V) Figure 4: Example matrix-vector multiply using FHE.
13 output [i] = innerSum (prod)
key-switching hints (which total 480 MB, exceeding on-chip stor-
Listing 2: (4 ⇥ 16 ) matrix-vector multiply in F1’s DSL. age) four times, achieving no reuse. Clearly, it is better to reorder
algorithms we tried either worked poorly [37, 50, 57] or could not the computation to perform all four multiplies, and then all four
scale to the sizes required [7, 10, 70, 77]. Rotate(X, 1), and so on. This reuses each key-switch hint four
For example, when considering an algorithm such as Code Sched- times.
uling to Minimize Register Usage (CSR) [37], we �nd that the sched- To achieve this, this pass �rst clusters independent homomorphic
ules it produces su�er from a large blowup of live intermediate operations that reuse the same hint, then orders all clusters through
values. This large footprint causes scratchpad thrashing and re- simple list-scheduling. This generates schedules with good key-
sults in poor performance. Furthermore, CSR is also quite computa- switch hint reuse.
tionally expensive, requiring long scheduling times for our larger Translation: Each homomorphic operation is then compiled into
benchmarks. We evaluate our approach against CSR in Sec. 8.3. instructions, using the implementation of each operation in the
We also attempted to frame scheduling as a register allocation target FHE scheme (BGV, CKKS, or GSW). Each homomorphic
problem. E�ectively, the key challenge in all of our schedules is data operation may translate to thousands of instructions. These instruc-
movement, not computation. Finding a register allocation which tions are also ordered to minimize the amount of intermediates.
minimizes spilling could provide a good basis for an e�ective sched- The end result is an instruction-level data�ow graph where every
ule. However, our scratchpad stores at least 1024 residue vectors instruction is tagged with a priority that re�ects its global order.
(1024 at maximum # = 16 , more for smaller values of # ), and The compiler exploits algorithmic choice. Speci�cally, there are
many of our benchmarks involve hundreds of thousands of instruc- multiple implementations of key-switching (Sec. 2.4), and the right
tions, meaning that register allocation algorithms simply could not choice depends on !, the amount of key-switch reuse, and load
scale to our required sizes [7, 10, 70, 77]. on FUs. The compiler leverages knowledge of operation order to
estimate these and choose the right variant.
4.1 Translating the program to a data�ow 4.3 Scheduling data transfers
graph
The second compiler phase consumes an instruction-level data�ow
We implement a high-level domain-speci�c language (DSL) for graph and produces an approximate schedule that includes data
writing F1 programs. To illustrate this DSL and provide a running transfers decoupled from computation, minimizes o�-chip data
example, Listing 2 shows the code for matrix-vector multiplication. transfers, and achieves good parallelism. This requires solving an
This follows HELib’s algorithm [38] , which Fig. 4 shows. This toy interdependent problem: when to bring a value into the scratchpad
4⇥16 matrix-vector multiply uses input ciphertexts with # = 16 . and which one to replace depends on the computation schedule;
Because accessing individual vector elements is not possible, the and to prevent stalls, the computation schedule depends on which
code uses homomorphic rotations to produce each output element. values are in the scratchpad. To solve this problem, this scheduler
As Listing 2 shows, programs in this DSL are at the level of uses a simpli�ed model of the machine: it does not consider on-
the simple FHE interface presented in Sec. 2.1. There is only one chip data movement, and simply treats all functional units as being
aspect of the FHE implementation in the DSL: programs encode directly connected to the scratchpad.
the desired noise budget (! = 16 in our example), as the compiler The scheduler is greedy, scheduling one instruction at a time.
does not automate noise management. It considers instructions ready if their inputs are available in the
scratchpad, and follows instruction priority among ready ones. To
4.2 Compiling homomorphic operations schedule loads, we assign each load a priority
The �rst compiler phase works at the level of the homomorphic ? (load) = max{? (D)|D 2 DB4AB (load)},
operations provided by the DSL. It clusters operations to improve
reuse, and translates them down to instructions. then greedily issue loads as bandwidth becomes available. When
Ordering homomorphic operations seeks to maximize the reuse issuing an instruction, we must ensure that there is space to store
of key-switch hints, which is crucial to reduce data movement its result. We can often replace a dead value. When no such value
(Sec. 2.4). For instance, the program in Listing 2 uses 15 di�erent exists, we evict the value with the furthest expected time to reuse.
sets of key-switch hint matrices: one for the multiplies (line 12), We estimate time to reuse as the maximum priority among unissued
and a di�erent one for each of the rotations (line 8). If this pro- users of the value. This approximates Belady’s optimal replacement
gram was run sequentially as written, it would cycle through all 15 policy [8]. Evictions of dirty data add stores to the data�ow graph.
7
When evicting a value, we add spill (either dirty or clean) and �ll Input Permute column
Permute row
Transpose (i.e., transposed column) Transpose
instructions to our data�ow graph. 12 8 4 0 12 8 4 0 1 2 3 0 1 6 11 0 4 8 12 0
13 9 5 1 15 11 7 3 5 6 7 4 13 2 7 12 15 3 7 11
4.4 Cycle-level scheduling 14 10 6 2 14 10 6 2 9 10 11 8 9 14 3 8 10 14 2 6
15 11 7 3 13 9 5 1 13 14 15 12 5 10 15 4 5 9 13 1
Finally, the cycle-level scheduler takes in the data movement sched-
Figure 5: Applying f3 on an RNS polynomial of four 4-
ule produced by the previous phase, and schedules all operations
element chunks by using only permutations local to chunks.
for all components considering all resource constraints and data de-
pendences. This phase distributes computation across clusters and banking scheme, some automorphisms will map many consecutive
manages their register �les and all on-chip transfers. Importantly, elements to the same bank.
this scheduler is fully constrained by its input schedule’s o�-chip We contribute a new insight that makes vectorizing automor-
data movement. It does not add loads or stores in this stage, but it phisms simple: if we interpret a residue polynomial as a ⌧ ⇥ ⇢
does move loads to their earliest possible issue cycle to avoid stalls matrix, an automorphism can always be decomposed into two inde-
on missing operands. All resource hazards are resolved by stalling. pendent column and row permutations. If we transpose this matrix,
In practice, we �nd that this separation of scheduling into data both column and row permutations can be applied in chunks of
movement and instruction scheduling produces good schedules in ⇢ elements. Fig. 5 shows an example of how automorphism f3 is
reasonable compilation times. applied to a residue polynomial with # = 16 and ⇢ = 4 elements/-
This stage works by iterating through all instructions in the cycle. Note how the permute column and row operations are local
order produced by the previous compiler phase (Sec. 4.3) and deter- to each 4-element chunk. Other f: induce di�erent permutations,
mining the minimum cycle at which all required on-chip resources but with the same row/column structure.
are available. We consider the availability of o�-chip bandwidth, Our automorphism unit, shown in Fig. 6, uses

Permute columns
sign flip
scratchpad space, register �le space, functional units, and ports. this insight to be both vectorized (consuming ⇢ =
During this �nal compiler pass, we �nally account for store 128 elements/cycle) and fully pipelined. Given a aut!
bandwidth, scheduling stores (which result from spills) as needed. residue polynomial of # = ⌧ · ⇢ elements, the au- cyclic shift
In practice, we �nd that this does not hurt our performance much, tomorphism unit �rst applies the column permu-
as stores are infrequent across most of our benchmarks due to tation to each ⇢-element input. Then, it feeds this transpose

Permute rows
our global schedule and replacement policy design. After the �nal to a transpose unit that reads in the whole residue
aut!
schedule is generated, we validate it by simulating it forward to polynomial interpreting it as a ⌧ ⇥ ⇢ matrix, and
ensure that no clobbers or resource usage violations occur. produces its transpose ⇢ ⇥ ⌧. The transpose unit transpose
It is important to note that because our schedules are fully static, outputs ⇢ elements per cycle (outputting multiple
our scheduler also doubles as a performance measurement tool. As rows per cycle when ⌧ < ⇢). Row permutations Figure 6: Au-
illustrated in Fig. 3, the compiler takes in an architecture description are applied to each ⇢-element chunk, and the re- tomorphism
�le detailing a particular con�guration of F1. This �exibility allows verse transpose is applied. unit.
us to conduct design space explorations very quickly (Sec. 8.4). Further, we decompose both the row and col-
umn permutations into a pipeline of sub-permutations that are �xed
5 FUNCTIONAL UNITS in hardware, with each sub-permutation either applied or bypassed
In this section, we describe F1’s novel functional units. These in- based on simple control logic; this avoids using crossbars for the
clude the �rst vectorized automorphism unit (Sec. 5.1), the �rst ⇢-element permute row and column operations.
fully-pipelined �exible NTT unit (Sec. 5.2), and a new simpli�ed Transpose unit: Our quadrant-swap transpose unit transposes an
modular multiplier adapted to FHE (Sec. 5.3). ⇢ ⇥ ⇢ (e.g., 128 ⇥ 128) matrix by recursively decomposing it into
quadrants and exploiting the identity
5.1 Automorphism unit  T  T
A B A CT
Because F1 uses ⇢ vector lanes, each residue polynomial is stored = .
C D BT DT
and processed as ⌧ groups, or chunks, of ⇢ elements each (# =
⌧ · ⇢). An automorphism f: maps the element at index 8 to index The basic building block is a ⇥ quadrant-swap unit, which
:8 mod # ; there are # automorphisms total, two for each odd : < swaps quadrants B and C, as shown in Fig. 7(left). Operationally,
# (Sec. 2.2). The key challenge in designing an automorphism unit the quadrant swap procedure consists of three steps, each taking
is that these permutations are hard to vectorize: we would like this /2 cycles:
unit to consume and produce ⇢ =128 elements/cycle, but the vectors (1) Cycle i in the �rst step reads A[i] and C[i] and stores them
are much longer, with # up to 16 K, and elements are permuted in top[i] and bottom[i], respectively.
across di�erent chunks. Moreover, we must support variable # and (2) Cycle i in the second step reads B[i] and D[i]. The unit
all automorphisms. activates the �rst swap MUX and the bypass line, thus storing
Standard solutions fail: a 16 K⇥16 K crossbar is much too large; a D[i] in top[i] and outputing A[i] (by reading from top[i])
scalar approach, like reading elements in sequence from an SRAM, and B[i] via the bypass line.
is too slow (taking # cycles); and using banks of SRAM to in- (3) Cycle i in the third step outputs D[i] and C[i] by reading
crease throughput runs into frequent bank con�icts: each automor- from top[i] and bottom[i], respectively. The second swap
phism “spreads” elements with a di�erent stride, so regardless of the MUX is activated so that C[i] is on top.
8
Transpose Unit Crucially, we are able to support all values of # using a single
top 4x4
swap if N == 64 four-step NTT pipeline by conditionally bypassing layers in the sec-
buffer
ond NTT butter�y. We use the same transpose unit implementation
Swap?

8x8 Quadrant Swap


bottom Swap? as with automorphisms.
4x4 0
swap if N >= 32 Our four-step pipeline supports negacyclic NTTs (NCNs), which
buffer
1 4x4 4x4 are more e�cient than standard non-negacyclic NTTs (that would
Bypass line
Quadrant Quadrant require padding, Sec. 2.3). Speci�cally, we extend prior work [49,
Swap Swap 62, 67] in order to support both forward and inverse NCNs using
A B the same hardware as for the standard NTT. Namely, prior work
8x8
Quadrant
A C swap if N >= 16
shows how to either (1) perform a forward NCN via a standard
2x2 2x2 2x2 2x2
C D Swap
B D QS QS QS QS decimation-in-time (DIT) NTT pipeline, or (2) perform an inverse
NCN via a standard decimation-in-frequency (DIF) NTT pipeline.
Figure 7: Transpose unit (right) and its component quadrant-
The DIF and DIT NTT variants use di�erent hardware; therefore,
swap unit (left).
this approach requires separate pipelines for forward and inverse
x12 x8 x4 x0
NCNs. Prior work [49] has shown that separate pipelines can be
f12 f8 f4 f0
Twiddle avoided by adding a multiplier either before or after the NTT: doing
Transpose

x13 x9 x5 x1
Multiply
DIT NTT

f13 f9 f5 f1
DIF NTT

SRAM an inverse NCN using a DIT NTT requires a multiplier unit after
x14 x10 x6 x2 f14 f10 f6 f2
the NTT, while doing a forward NCN using a DIF NTT requires a
x15 x11 x7 x3 Inverse? f15 f11 f7 f3 multiplier unit before the NTT.
16-element NTT/Inverse NTT
We now show that both the forward and inverse NCN can be done
Figure 8: Example of a four-step NTT datapath that uses 4- in the same standard four-step NTT pipeline, with no additional
point NTTs to implement 16-point NTTs. hardware. This is because the four-step NTT already has a multiplier
and two NTTs in its pipeline. We set the �rst NTT to be decimation-
Note that step 3 for one input can be done in parallel with step 1 in-time and the second to be decimation-in-frequency (Fig. 8). To
for the next, so the unit is fully pipelined. do a forward NTT, we use the forward NCN implementation via
The transpose is implemented by a full ⇢ ⇥ ⇢ quadrant-swap DIT NTT for the �rst NTT; we modify the contents of the Twiddle
followed by log2 ⇢ layers of smaller transpose units to recursively SRAM so that the multiplier does the pre-multiplication necessary
transpose A, B, C, and D. Fig. 7 (right) shows an implementation for to implement a forward NCN in the second NTT (which is DIF and
⇢ = 8. Finally, by selectively bypassing some of the initial quadrant thus requires the pre-multiplication). Conversely, to do an inverse
swaps, this transpose unit also works for all values of # (# = ⌧ ⇥ ⇢ NTT, we modify the Twiddle SRAM contents to do the post-multi-
with power-of-2 ⌧ < ⇢). plication necessary to implement an inverse NCN in the �rst NTT
Prior work has implemented transpose units for signal-processing (which is DIT); and we use the inverse NCN implementation via
applications, either using registers [76, 78] or with custom SRAM DIF NTT for the second NTT.
designs [68]. Our design has three advantages over prior work: The NTT unit is large: each of the 128-element NTTs requires
it uses standard SRAM memory, so it is dense without requiring ⇢ (log(⇢) 1)/2=384 multipliers, and the full unit uses 896 multi-
complex custom SRAMs; it is fully pipelined; and it works for a pliers. But its high throughput improves performance over many
wide range of dimensions. low-throughput NTTs (Sec. 8). This is the �rst implementation of a
fully-pipelined four-step NTT unit, improving NTT performance
5.2 Four-step NTT unit by 1,600⇥ over the state of the art (Sec. 8.1).
There are many ways to implement NTTs in hardware: an NTT is
like an FFT [19] but with a butter�y that uses modular multipliers. 5.3 Optimized modular multiplier
We implement # -element NTTs (from 1K to 16K) as a composi-
Modular multiplication computes 0 · 1 mod @. This is the most ex-
tion of smaller ⇢=128-element NTTs, since implementing a full
pensive and frequent operation. Therefore, improvements to the
16K-element NTT datapath is prohibitive. The challenge is that
modular multiplier have an almost linear impact on the computa-
standard approaches result in memory access patterns that are hard
tional capabilities of an FHE accelerator.
to vectorize.
Prior work [51] recognized that a Montgomery multiplier [55]
To that end, we use the four-step variant of the FFT algorithm [6],
within NTTs can be improved by leveraging the fact that the possi-
which adds an extra multiplication to produce a vector-friendly
ble values of modulus @ are restricted by the number of elements the
decomposition. Fig. 8 illustrates our four-step NTT pipeline for ⇢ =
4; we use the same structure with ⇢ = 128. The unit is fully pipelined
and consumes ⇢ elements per cycle. To compute an # = ⇢ ⇥ ⇢ NTT, Multiplier Area [`m2 ] Power [mW] Delay [ps]
the unit �rst computes an ⇢-point NTT on each ⇢-element group, Barrett 5, 271 18.40 1,317
multiplies each group with twiddles, transposes the ⇢ groups, and Montgomery 2, 916 9.29 1,040
computes another ⇢-element NTT on each transpose. The same NTT-friendly 2, 165 5.36 1,000
NTT unit implements the inverse NTT by storing multiplicative
FHE-friendly (ours) 1, 817 4.10 1,000
factors (twiddles) required for both forward and inverse NTTs in a
small twiddle SRAM. Table 1: Area, power, and delay of modular multipliers.
9
NTT is applied to. We notice that if we only select moduli @8 , such Component Area [mm2 ] TDP [W]
that @8 = 1 mod 216 , we can remove a mutliplier stage from [51];
this reduces area by 19% and power by 30% (Table 1). The additional NTT FU 2.27 4.80
restriction on @ is acceptable because FHE requires at most 10s of Automorphism FU 0.58 0.99
moduli [34], and our approach allows for 6,186 prime moduli. Multiply FU 0.25 0.60
Add FU 0.03 0.05
Vector RegFile (512 KB) 0.56 1.67
6 F1 IMPLEMENTATION Compute cluster 3.97 8.75
We have implemented F1’s components in RTL, and synthesize them (NTT, Aut, 2⇥ Mul, 2⇥ Add, RF)
in a commercial 14/12nm process using state-of-the-art tools. These Total compute (16 clusters) 63.52 140.0
include a commercial SRAM compiler that we use for scratchpad
and register �le banks. Scratchpad (16⇥4 MB banks) 48.09 20.35
We use a dual-frequency design: most components run at 1 GHz, 3⇥NoC (16⇥16 512 B bit-sliced [58]) 10.02 19.65
but memories (register �les and scratchpads) run double-pumped at Memory interface (2⇥HBM2 PHYs) 29.80 0.45
2 GHz. Memories meet this frequency easily and this enables using Total memory system 87.91 40.45
single-ported SRAMs while serving up to two accesses per cycle. Total F1 151.4 180.4
By keeping most of the logic at 1 GHz, we achieve higher energy
Table 2: Area and Thermal Design Power (TDP) of F1, and
e�ciency. We explored several non-blocking on-chip networks
breakdown by component.
(Clos, Benes, and crossbars). We use 3 16⇥16 bit-sliced crossbars [58]
(scratchpad!cluster, cluster!scratchpad, and cluster!cluster).
Table 2 shows a breakdown of area by component, as well as
dataset [46]. LoLa-MNIST includes two variants with unencrypted
the area of our F1 con�guration, 151.4 mm2 . FUs take 42% of the
and encrypted weights; LoLa-CIFAR is available only with unen-
area, with 31.7% going to memory, 6.6% to the on-chip network,
crypted weights. These three benchmarks use relatively low ! val-
and 19.7% to the two HBM2 PHYs. We assume 512 GB/s bandwidth
ues (their starting ! values are 4, 6, and 8, respectively), so they are
per PHY; this is similar to the NVIDIA A100 GPU [18], which has
less memory-bound. They also feature frequent automorphisms,
2.4 TB/s with 6 HBM2E PHYs [56]. We use prior work to estimate
showing the need for a fast automorphism unit.
HBM2 PHY area [24, 63] and power [32, 63].
DB Lookup is adapted from HELib’s BGV_country_db_lookup [41].
This design is constrained by memory bandwidth: though it has
A BGV-encrypted query string is used to traverse an encrypted
1 TB/s of bandwidth, the on-chip network’s bandwidth is 24 TB/s,
key-value store and return the corresponding value. The original
and the aggregate bandwidth between RFs and FUs is 128 TB/s. This
implementation uses a low security level for speed of demonstra-
is why maximizing reuse is crucial.
tion, but in our version, we implement it at ! =17, # =16K for
realism. We also parallelize the CPU version so it can e�ectively
7 EXPERIMENTAL METHODOLOGY use all available cores. DB Lookup is both deep and wide, so running
Modeled system: We evaluate our F1 implementation from Sec. 6. it on F1 incurs substantial o�-chip data movement.
We use a cycle-accurate simulator to execute F1 programs. Because Bootstrapping: We evaluate bootstrapping benchmarks for BGV
the architecture is static, this is very di�erent from conventional and CKKS. Bootstrapping takes an ! = 1 ciphertext with an ex-
simulators, and acts more as a checker: it runs the instruction hausted noise budget and refreshes it by bringing it up to a chosen
stream at each component and veri�es that latencies are as expected top value of ! = !<0G , then performing the bootstrapping compu-
and there are no missed dependences or structural hazards. We tation to eventually obtain a usable ciphertext at a lower depth (e.g.,
use activity-level energies from RTL synthesis to produce energy !<0G 15 for BGV).
breakdowns. For BGV, we use Sheri� and Peikert’s algorithm [3] for non-
Benchmarks: We use several FHE programs to evaluate F1. All packed BGV bootstrapping, with !<0G = 24. This is a particu-
programs come from state-of-the-art software implementations, larly challenging benchmark because it features computations at
which we port to F1: large values of !. This exercises the scheduler’s algorithmic choice
Logistic regression uses the HELR algorithm [40], which is based component, which selects the right key-switch method to balance
on CKKS. We compute a single batch of logistic regression train- computation and data movement.
ing with up to 256 features, and 256 samples per batch, starting at For CKKS, we use non-packed CKKS bootstrapping from HEA-
computational depth ! = 16; this is equivalent to the �rst batch of AN [16], also with !<0G = 24. CKKS bootstrapping has many
HELR’s MNIST workload. This computation features ciphertexts fewer ciphertext multiplications than BGV, greatly reducing reuse
with large log & (! = 14, 15, 16), so it needs careful data orchestra- opportunities for key-switch hints.
tion to run e�ciently. Baseline systems: We compare F1 with a CPU system running
Neural network benchmarks come from Low Latency CryptoNets the baseline programs (a 4-core, 8-thread, 3.5 GHz Xeon E3-1240v5).
(LoLa) [15]. This work uses B/FV, an FHE scheme that F1 does not Since prior accelerators do not support full programs, we also in-
support, so we use CKKS instead. We run two neural networks: clude microbenchmarks of single operations and compare against
LoLa-MNIST is a simple, LeNet-style network used on the MNIST HEAX [65], the fastest prior accelerator.
dataset [48], while LoLa-CIFAR is a much larger 6-layer network
(similar in computation to MobileNet v3 [42]) used on the CIFAR-10
10
Execution time (ms) on CPU F1 Speedup
LoLa-CIFAR Unencryp. Wghts. 1.2 ⇥ 106 241 5, 011⇥
LoLa-MNIST Unencryp. Wghts. 2, 960 0.17 17, 412⇥
LoLa-MNIST Encryp. Wghts. 5, 431 0.36 15, 086⇥
Logistic Regression 8, 300 1.15 7, 217⇥
DB Lookup 29, 300 4.36 6, 722⇥
BGV Bootstrapping 4, 390 2.40 1, 830⇥
CKKS Bootstrapping 1, 554 1.30 1, 195⇥
gmean speedup 5, 432⇥
∗ LoLa’s
release did not include MNIST with encrypted weights, so
we reimplemented it in HELib.
Table 3: Performance of F1 and CPU on full FHE bench-
marks: execution times in milliseconds and F1’s speedup.

(a) (b)
8 EVALUATION
Figure 9: Per-benchmark breakdowns of (a) data movement
8.1 Performance and (b) average power for F1.
Benchmarks: Table 3 compares the performance of F1 and the CPU are even larger than in full benchmarks. This is because mi-
CPU on full benchmarks. It reports execution time in millisec- crobenchmarks are pure compute, and thus miss the data movement
onds for each program (lower is better), and F1’s speedup over the bottlenecks of FHE programs.
CPU (higher is better). F1 achieves dramatic speedups, from 1,195⇥
to 17,412⇥ (5,432⇥ gmean). CKKS bootstrapping has the lowest
speedups as it’s highly memory-bound; other speedups are within 8.2 Architectural analysis
a relatively narrow band, as compute and memory tra�c are more To gain more insights into these results, we now analyze F1’s data
balanced. movement, power consumption, and compute.
These speedups greatly expand the applicability of FHE. Consider Data movement: Fig. 9a shows a breakdown of o�-chip memory
deep learning: in software, even the simple LoLa-MNIST network tra�c across data types: key-switch hints (KSH), inputs/outputs,
takes seconds per inference, and a single inference on the more and intermediate values. KSH and input/output tra�c is broken into
realistic LoLa-CIFAR network takes 20 minutes. F1 brings this down compulsory and non-compulsory (i.e., caused by limited scratchpad
to 241 milliseconds, making real-time deep learning inference prac- capacity). Intermediates, which are always non-compulsory, are
tical: when o�oading inferences to a server, this time is comparable classi�ed as loads or stores.
to the roundtrip latency between server and client. Fig. 9a shows that key-switch hints dominate in high-depth work-
Microbenchmarks: Table 4 compares the performance of F1, the loads (LogReg, DB Lookup, and bootstrapping), taking up to 94%
CPU, and HEAXf on four microbenchmarks: the basic NTT and of tra�c. Key-switch hints are also signi�cant in the LoLa-MNIST
automorphism operations on a single ciphertext, and homomorphic variants. This shows why scheduling should prioritize them. Sec-
multiplication and permutation (which uses automorphisms). We ond, due our scheduler design, F1 approaches compulsory tra�c
report three typical sets of parameters. We use microbenchmarks for most benchmarks, with non-compulsory accesses adding only
to compare against prior accelerators, in particular HEAX. But 5-18% of tra�c. The exception is LoLa-CIFAR, where intermedi-
prior accelerators do not implement automorphisms, so we extend ates consume 75% of tra�c. LoLa-CIFAR has very high reuse of
each HEAX key-switching pipeline with an SRAM-based, scalar key-switch hints, and exploiting it requires spilling intermediate
automorphism unit. We call this extension HEAXf . ciphertexts.
Table 4 shows that F1 achieves large speedups over HEAXf , Power consumption: Fig. 9b reports average power for each bench-
ranging from 172⇥ to 1,866⇥. Moreover, F1’s speedups over the mark, broken down by component. This breakdown also includes

# = 212 , log & = 109 # = 213 , log & = 218 # = 214 , log & = 438
F1 vs. CPU vs. HEAXf F1 vs. CPU vs. HEAXf F1 vs. CPU vs. HEAXf
NTT 12.8 17,148⇥ 1,600⇥ 44.8 10,736⇥ 1,733⇥ 179.2 8,838⇥ 1,866⇥
Automorphism 12.8 7,364⇥ 440⇥ 44.8 8,250⇥ 426⇥ 179.2 16,957⇥ 430⇥
Homomorphic multiply 60.0 48,640⇥ 172⇥ 300 27,069⇥ 148⇥ 2,000 14,396⇥ 190⇥
Homomorphic permutation 40.0 17,488⇥ 256⇥ 224 10,814⇥ 198⇥ 1,680 6,421⇥ 227⇥
Table 4: Performance on microbenchmarks: F1’s reciprocal throughput, in nanoseconds per ciphertext operation (lower is
better) and speedups over CPU and HEAXf (HEAX augmented with scalar automorphism units) (higher is better).

11
For the FU experiments, our goal is to show the importance of
having high-throughput units. Therefore, the low-throughput vari-
ants use many more (NTT or automorphism) FUs, so that aggregate
throughput across all FUs in the system is the same. Also, the sched-
uler accounts for the characteristics of these FUs. In both cases,
performance drops substantially, by gmean 2.6⇥ and 3.3⇥. This is
because achieving high throughput requires excessive parallelism,
which hinders data movement, forcing the scheduler to balance
both.
Finally, the scheduler experiment uses register-pressure-aware
scheduling [37] as the o�-chip data movement scheduler instead,
Figure 10: Functional unit and HBM utilization over time for operating on the full data�ow graph. This algorithm was proposed
the LoLa-MNIST PTW benchmark. for VLIW processors and register �les; we apply it to the larger
scratchpad. The large slowdowns show that prior capacity-aware
Benchmark LT NTT LT Aut CSR schedulers are ine�ective on F1.
LoLa-CIFAR Unencryp. Wghts. 3.5⇥ 12.1⇥ —∗
LoLa-MNIST Unencryp. Wghts. 5.0⇥ 4.2⇥ 1.1⇥ 8.4 Scalability
LoLa-MNIST Encryp. Wghts. 5.1⇥ 11.9⇥ 7.5⇥ Finally, we study how F1’s per-
Logistic Regression 1.7⇥ 2.3⇥ 11.7⇥ formance changes with its area
DB Lookup 2.8⇥ 2.2⇥ —∗ budget: we sweep the number
BGV Bootstrapping 1.5⇥ 1.3⇥ 5.0⇥ of compute clusters, scratchpad
CKKS Bootstrapping 1.1⇥ 1.2⇥ 2.7⇥ banks, HBM controllers, and
network topology to �nd the
gmean speedup 2.5⇥ 3.6⇥ 4.2⇥
most e�cient design at each
∗ CSR
is intractable for this benchmark. area. Fig. 11 shows this Pareto
Table 5: Speedups of F1 over alternate con�gurations: LT NT-
frontier, with area in the G-
T/Aut = Low-throughput NTT/Automorphism FUs; CSR =
axis and performance in the
Code Scheduling to minimize Register Usage [37].
~-axis. This curve shows that,
as F1 scales, it uses resources Figure 11: Performance vs.
e�ciently: performance grows area across F1 con�gurations.
o�-chip memory power (Table 2 only included the on-chip compo- about linearly through a large range of areas.
nent). Results show reasonable power consumption for an accelera-
tor card. Overall, computation consumes 20-30% of power, and data 8.5 Functional Simulation
movement dominates.
Here we describe our software simulation e�orts for F1. Currently,
Utilization over time: F1’s average FU utilization is about 30%.
we have a functional simulator written in C++ on top of Shoup’s
However, this doesn’t mean that fewer FUs could achieve the same
performance: benchmarks have memory-bound phases that weigh Number Theory Library.† This simulator measures input-output
down average FU utilization. To see this, Fig. 10 shows a break- correctness and calls to functional units throughout a computation.
down of FU utilization over time for LoLa-MNIST Plaintext Weights. The underlying algorithms are not the same as F1’s functional units,
Fig. 10 also shows o�-chip bandwidth utilization over time (black but they match common methods used in software (i.e., HElib’s
line). The program is initially memory-bound, and few FUs are algorithms). This allows one to verify correctness of FHE algo-
active. As the memory-bound phase ends, compute intensity grows, rithms and to create a data�ow graph. The simulator has all our
utilizing a balanced mix of the available FUs. Finally, due to decou- functional units implemented in software: modular additions, mod-
pled execution, when memory bandwidth utilization peaks again, ular multiplications, automorphisms, and NTTs. We then build
F1 can maintain high compute intensity. The highest FU utilization ciphertext-level operations by calls to these algorithms: ciphertext
happens at the end of the benchmark and is caused by processing addition, ciphertext multiplication, rotations, modulus-switching,
the �nal (fully connected) layer, which is highly parallel and already and a simpli�ed bootstrapping procedure, for non-packed cipher-
has all inputs available on-chip. texts. Our functional simulator works for the parameter ranges
discussed throughout the paper: polynomial/ring dimension # as
8.3 Sensitivity studies an arbitrary power of 2 (usually 1024-16384 for security) and RNS
moduli where each is an NTT-friendly prime, @8 ⌘ 1 mod 2# ,
To understand the impact of our FUs and scheduling algorithms,
roughly 24 bits long. Further, each moduli is sampled randomly,
we evaluate F1 variants without them. Table 5 reports the slowdown
similarly to other FHE RNS implementations.
(higher is worse) of F1 with: (1) low-throughput NTT FUs that
follow the same design as HEAX (processing one stage of NTT 9 RELATED WORK
butter�ies per cycle); (2) low-throughput automorphism FUs using
a serial SRAM memory, and (3) Goodman’s register-pressure-aware We now discuss related work not covered so far.
scheduler [37]. † https://libntl.org/

12
FHE accelerators: Prior work has proposed accelerators for indi- GPUs achieve only modest performance gains. For instance, Badawi
vidual FHE operations, but not full FHE computations [20, 21, 22, et al. [1] accelerate B/FV multiplication using GPUs, and achieve
27, 52, 53, 65, 66, 71]. These designs target FPGAs and rely on a speedups of around 10⇥ to 100⇥ over single-thread CPU execution
host processor; Sec. 2.4 discussed their limitations. Early designs (and thus commensurately lower speedups over multicore CPUs,
accelerated small primitives like NTTs, and were dominated by as FHE operations parallelize well).
host-FPGA communication. State-of-the-art accelerators execute
a full homomorphic multiplication independently: Roy et al. [66] 10 CONCLUSION
accelerate B/FV multiplication by 13⇥ over a CPU; HEAWS [71] FHE has the potential to enable computation o�oading with guar-
accelerates B/FV multiplication, and uses it to speed a simple bench- anteed security. But FHE’s high computation overheads currently
mark by 5⇥; and HEAX [65] accelerates CKKS multiplication and limit its applicability to narrow cases (simple computations where
key-switching by up to 200⇥. These designs su�er high data move- privacy is paramount). F1 tackles this challenge, accelerating full
ment (e.g., HEAX does not reuse key-switch hints) and use �xed FHE computations by over 3-4 orders of magnitude. This enables
pipelines with relatively low-throughput FUs. new use cases for FHE, like secure real-time deep learning inference.
We have shown that accelerating FHE programs requires a di�er- F1 is the �rst FHE accelerator that is programmable, i.e., capa-
ent approach: data movement becomes the key constraint, requiring ble of executing full FHE programs. In contrast to prior accelera-
new techniques to extract reuse across homomorphic operations; tors, which build �xed pipelines tailored to speci�c FHE schemes
and �xed pipelines cannot support the operations of even a single and parameters, F1 introduces a more e�ective design approach:
benchmark. Instead, F1 achieves �exibility and high performance it accelerates the primitive computations shared by higher-level
by exploiting wide-vector execution with high-throughput FUs. operations using novel high-throughput functional units, and hard-
This lets F1 execute not only full applications, but di�erent FHE ware and compiler are co-designed to minimize data movement, the
schemes. key bottleneck. This �exibility makes F1 broadly useful: the same
Hybrid HE-MPC accelerators: Recent work has also proposed hardware can accelerate all operations within a program, arbitrary
ASIC accelerators for some homomorphic encryption primitives in FHE programs, and even multiple FHE schemes. In short, our key
the context of oblivious neural networks [44, 64]. These approaches contribution is to show that, for FHE, we can achieve ASIC-level
are very di�erent from FHE: they combine homomorphic encryp- performance without sacri�cing programmability.
tion with multi-party computation (MPC), executing a single layer
of the network at a time and sending intermediates to the client, ACKNOWLEDGMENTS
which computes the �nal activations. Gazelle [44] is a low-power
We thank the anonymous reviewers, Maleen Abeydeera, Hyun Ry-
ASIC for homomorphic evaluations, and Cheetah [64] introduces
ong Lee, Quan Nguyen, Yifan Yang, Victor Ying, Guowei Zhang, and
algorithmic optimizations and a large ASIC design that achieves
Joel Emer for feedback on the paper; Tutu Ajayi, Austin Rovinski,
very large speedups over Gazelle.
and Peter Li for help with the HDL toolchain setup; Shai Halevi, Wei
These schemes avoid high-depth FHE programs, so server-side
Dai, Olli Saarikivi, and Madan Musuvathi for email correspondence.
homomorphic operations are cheaper. But they are limited by client-
This research was developed with funding from the Defense Ad-
side computation and client-server communication: Cheetah and
vanced Research Projects Agency (DARPA) under contract number
Gazelle use ciphertexts that are up to ⇠ 40⇥ smaller than those used
Contract No. HR0011-21-C-0035. The views, opinions and/or �nd-
by F1; however, they require the client to re-encrypt ciphertexts
ings expressed are those of the author and should not be interpreted
every time they are multiplied on the server to prevent noise blowup.
as representing the o�cial views or policies of the Department of
CHOCO [72] shows that client-side computation costs for HE-MPC
Defense or the U.S. Government. Nikola Samardzic was supported
are substantial, and when they are accelerated, network latency
by the Jae S. and Kyuho Lim Graduate Fellowship at MIT.
and throughput overheads dominate (several seconds per DNN
inference). By contrast, F1 enables o�oading the full inference
REFERENCES
using FHE, avoiding frequent communication. As a result, a direct
[1] A. Q. A. Al Badawi, Y. Polyakov, K. M. M. Aung, B. Veeravalli, and K. Rohlo�,
comparison between these accelerators and F1 is not possible. “Implementation and performance evaluation of RNS variants of the BFV homo-
F1’s hardware also di�ers substantially from Cheetah and Gazelle. morphic encryption scheme,” IEEE Transactions on Emerging Topics in Computing,
vol. 9, no. 2, 2021.
First, Cheetah and Gazelle implement �xed-function pipelines (e.g., [2] M. Albrecht, M. Chase, H. Chen, J. Ding, S. Goldwasser, S. Gorbunov, S. Halevi,
for output-stationary DNN inference in Cheetah), whereas F1 is J. Ho�stein, K. Laine, K. Lauter, S. Lokam, D. Micciancio, D. Moody, T. Morrison,
programmable. Second, Cheetah, like HEAX, uses many FUs with A. Sahai, and V. Vaikuntanathan, “Homomorphic encryption security standard,”
HomomorphicEncryption.org, Tech. Rep., 2018.
relatively low throughput, whereas F1 uses few high-throughput [3] J. Alperin-Sheri� and C. Peikert, “Practical bootstrapping in quasilinear time,” in
units (e.g., 40⇥ faster NTTs). Cheetah’s approach makes sense for Annual Cryptology Conference, 2013.
their small ciphertexts, but as we have seen (Sec. 8.3), it is impracti- [4] D. Altavilla, “Intel and Microsoft Collaborate on DARPA Program that Pioneers
A New Frontier Of Ultra-Secure Computing,” https://www.forbes.com/sites/
cal for FHE. davealtavilla/2021/03/08/intel-and-microsoft-collaborate-on-darpa-program-
GPU acceleration: Finally, prior work has also used GPUs to ac- that-pioneers-a-new-frontier-of-ultra-secure-computing/?sh=60db31567c1a
archived at https://perma.cc/YYE6-5FT4, 2021.
celerate di�erent FHE schemes, including GH [74, 75], BGV [73], [5] K. Asanovic, “Vector microprocessors,” Ph.D. dissertation, EECS Department,
and B/FV [1]. Though GPUs have plentiful compute and band- University of California, Berkeley, 1998.
width, they lack modular arithmetic, their pure data-parallel ap- [6] D. H. Bailey, “FFTs in external of hierarchical memory,” in Proceedings of the 1989
ACM/IEEE conference on Supercomputing, 1989.
proach makes non-element-wise operations like NTTs expensive, [7] G. Barany, “Register reuse scheduling,” in 9th Workshop on Optimizations for DSP
and their small on-chip storage adds data movement. As a result, and Embedded Systems (ODES-9), 2011.
13
[8] L. A. Belady, “A study of replacement algorithms for a virtual-storage computer,” [34] C. Gentry, S. Halevi, and N. P. Smart, “Homomorphic evaluation of the AES
IBM Systems journal, vol. 5, no. 2, 1966. circuit,” in Annual Cryptology Conference, 2012.
[9] F. Bergamaschi, “IBM Releases Fully Homomorphic Encryption Toolkit for Ma- [35] C. Gentry, A. Sahai, and B. Waters, “Homomorphic encryption from learning with
cOS and iOS,” https://www.ibm.com/blogs/research/2020/06/ibm-releases-fully- errors: Conceptually-simpler, asymptotically-faster, attribute-based,” in Annual
homomorphic-encryption-toolkit-for-macos-and-ios-linux-and-android- Cryptology Conference, 2013.
coming-soon/ archived at https://perma.cc/U5TQ-K49C, 2020. [36] R. Gilad-Bachrach, N. Dowlin, K. Laine, K. Lauter, M. Naehrig, and J. Wernsing,
[10] D. A. Berson, R. Gupta, and M. L. So�a, “URSA: A Uni�ed ReSource Allocator for “Cryptonets: Applying neural networks to encrypted data with high throughput
Registers and Functional Units in VLIW Architectures,” in Proceedings of the IFIP and accuracy,” in Proceedings of the International Conference on Machine Learning
WG10.3 Working Conference on Architectures and Compilation Techniques for Fine (ICML), 2016.
and Medium Grain Parallelism (PACT’93), 1993. [37] J. R. Goodman and W.-C. Hsu, “Code scheduling and register allocation in large
[11] M. Blatt, A. Gusev, Y. Polyakov, and S. Goldwasser, “Secure large-scale genome- basic blocks,” in Proceedings of the 2nd International Conference on Supercomputing
wide association studies using homomorphic encryption,” Proceedings of the (ICS), 1988.
National Academy of Sciences, vol. 117, no. 21, 2020. [38] S. Halevi and V. Shoup, “Algorithms in HElib,” in Annual Cryptology Conference,
[12] G. E. Blelloch, P. B. Gibbons, and Y. Matias, “Provably e�cient scheduling for 2014.
languages with �ne-grained parallelism,” Journal of the ACM (JACM), vol. 46, [39] K. Han, S. Hong, J. H. Cheon, and D. Park, “E�cient logistic regression on large
no. 2, 1999. encrypted data,” IACR Cryptol. ePrint Arch., 2018.
[13] Z. Brakerski, “Fully homomorphic encryption without modulus switching from [40] K. Han, S. Hong, J. H. Cheon, and D. Park, “Logistic regression on homomorphic
classical GapSVP,” in Annual Cryptology Conference, 2012. encrypted data at scale,” in Proceedings of the AAAI Conference on Arti�cial
[14] Z. Brakerski, C. Gentry, and V. Vaikuntanathan, “(leveled) fully homomorphic Intelligence, vol. 33, no. 01, 2019.
encryption without bootstrapping,” ACM Transactions on Computation Theory [41] HElib, “HElib country lookup example,” https://github.com/homenc/HElib/tree/
(TOCT), vol. 6, no. 3, 2014. master/examples/BGV_country_db_lookup archived at https://perma.cc/U2MW-
[15] A. Brutzkus, R. Gilad-Bachrach, and O. Elisha, “Low latency privacy preserving QLRJ, 2019.
inference,” in Proceedings of the International Conference on Machine Learning [42] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu,
(ICML), 2019. R. Pang, V. Vasudevan et al., “Searching for mobilenet v3,” in Proceedings of the
[16] J. H. Cheon, K. Han, A. Kim, M. Kim, and Y. Song, “Bootstrapping for approximate IEEE/CVF International Conference on Computer Vision, 2019.
homomorphic encryption,” in Annual International Conference on the Theory and [43] IBM, “Cost of a Data Breach Report,” Tech. Rep., 2020.
Applications of Cryptographic Techniques, 2018. [44] C. Juvekar, V. Vaikuntanathan, and A. Chandrakasan, “GAZELLE: A low latency
[17] J. H. Cheon, A. Kim, M. Kim, and Y. Song, “Homomorphic encryption for arith- framework for secure neural network inference,” in 27th USENIX Security Sympo-
metic of approximate numbers,” in International Conference on the Theory and sium (USENIX Security 18), 2018.
Application of Cryptology and Information Security, 2017. [45] M. Kim, Y. Song, S. Wang, Y. Xia, and X. Jiang, “Secure logistic regression based
[18] J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky, “Nvidia a100 on homomorphic encryption: Design and evaluation,” JMIR medical informatics,
tensor core gpu: Performance and innovation,” IEEE Micro, vol. 41, no. 2, 2021. vol. 6, no. 2, 2018.
[19] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of [46] A. Krizhevsky, “Learning multiple layers of features from tiny images,” University
complex Fourier series,” Mathematics of computation, vol. 19, no. 90, 1965. of Toronto, Tech. Rep., 2009.
[20] D. B. Cousins, K. Rohlo�, and D. Sumorok, “Designing an FPGA-accelerated [47] M. S. Lam, “Software pipelining,” in Proceedings of the ACM SIGPLAN Conference
homomorphic encryption co-processor,” IEEE Transactions on Emerging Topics in on Programming Language Design and Implementation (PLDI), 1988.
Computing, vol. 5, no. 2, 2017. [48] Y. Lecun, L. Bottou, Y. Bengio, and P. Ha�ner, “Gradient-based learning applied
[21] D. B. Cousins, J. Golusky, K. Rohlo�, and D. Sumorok, “An FPGA co-processor im- to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, 1998.
plementation of homomorphic encryption,” in Proceedings of the IEEE Conference [49] V. Lyubashevsky, C. Peikert, and O. Regev, “On ideal lattices and learning with er-
on High Performance Extreme Computing (HPEC), 2014. rors over rings,” in Annual International Conference on the Theory and Applications
[22] D. B. Cousins, K. Rohlo�, C. Peikert, and R. Schantz, “An update on SIPHER of Cryptographic Techniques, 2010.
(Scalable Implementation of Primitives for Homomorphic EncRyption) - FPGA [50] L. Marchal, B. Simon, and F. Vivien, “Limiting the memory footprint when dy-
implementation using Simulink,” in Proceedings of the IEEE Conference on High namically scheduling dags on shared-memory platforms,” Journal of Parallel and
Performance Extreme Computing (HPEC), 2012. Distributed Computing, vol. 128, 2019.
[23] DARPA, “DARPA Selects Researchers to Accelerate Use of Fully Homomorphic [51] A. C. Mert, E. Öztürk, and E. Savaş, “Design and implementation of a fast and
Encryption,” https://www.darpa.mil/news-events/2021-03-08 archived at https: scalable NTT-based polynomial multiplier architecture,” in 2019 22nd Euromicro
//perma.cc/6GHW-2MSN, 2021. Conference on Digital System Design (DSD), 2019.
[24] S. Dasgupta, T. Singh, A. Jain, S. Na�ziger, D. John, C. Bisht, and P. Jayaraman, [52] A. C. Mert, E. Öztürk, and E. Savaş, “Design and Implementation of Encryp-
“Radeon RX 5700 Series: The AMD 7nm Energy-E�cient High-Performance tion/Decryption Architectures for BFV Homomorphic Encryption Scheme,” IEEE
GPUs,” in Proceedings of the IEEE International Solid-State Circuits Conference Transactions on Very Large Scale Integration (VLSI) Systems, 2019.
(ISSCC), 2020. [53] V. Migliore, C. Seguin, M. M. Real, V. Lapotre, A. Tisserand, C. Fontaine, G. Gog-
[25] R. Dathathri, B. Kostova, O. Saarikivi, W. Dai, K. Laine, and M. Musuvathi, “EVA: niat, and R. Tessier, “A high-speed accelerator for homomorphic encryption using
An encrypted vector arithmetic language and compiler for e�cient homomorphic the karatsuba algorithm,” ACM Trans. Embedded Comput. Syst., vol. 16, no. 5s,
computation,” in Proceedings of the 41st ACM SIGPLAN Conference on Programming 2017.
Language Design and Implementation, 2020. [54] R. T. Moenck, “Practical fast polynomial multiplication,” in Proceedings of the
[26] R. Dathathri, O. Saarikivi, H. Chen, K. Laine, K. Lauter, S. Maleki, M. Musuvathi, third ACM symposium on Symbolic and algebraic computation, 1976.
and T. Mytkowicz, “CHET: an optimizing compiler for fully-homomorphic neural- [55] P. L. Montgomery, “Modular multiplication without trial division,” Mathematics
network inferencing,” in Proceedings of the 40th ACM SIGPLAN Conference on of computation, vol. 44, no. 170, 1985.
Programming Language Design and Implementation, 2019. [56] NVIDIA, “NVIDIA DGX station A100 system architecture,” https://images.
[27] Y. Doröz, E. Öztürk, and B. Sunar, “Accelerating fully homomorphic encryption nvidia.com/aem-dam/Solutions/Data-Center/nvidia-dgx-station-a100-system-
in hardware,” IEEE Trans. Computers, vol. 64, no. 6, 2015. architecture-white-paper.pdf archived at https://perma.cc/3CSS-PXU7, 2021.
[28] J. Fan and F. Vercauteren, “Somewhat practical fully homomorphic encryption.” [57] E. Ozer, S. Banerjia, and T. M. Conte, “Uni�ed assign and schedule: A new ap-
IACR Cryptol. ePrint Arch., 2012. proach to scheduling for clustered register �le microarchitectures,” in Proceedings
[29] A. Feldmann, N. Samardzic, A. Krastev, S. Devadas, R. Dreslinski, C. Peikert, and of the 31st annual ACM/IEEE International Symposium on Microarchitecture, 1998.
D. Sanchez, “F1: A fast and programmable accelerator for fully homomorphic [58] G. Passas, M. Katevenis, and D. Pnevmatikatos, “Crossbar NoCs are scalable
encryption,” in Proceedings of the 54th annual ACM/IEEE International Symposium beyond 100 nodes,” IEEE Transactions on Computer-Aided Design of Integrated
on Microarchitecture, 2021. Circuits and Systems, vol. 31, no. 4, 2012.
[30] J. A. Fisher, “Very long instruction word architectures and the ELI-512,” in Pro- [59] C. Peikert, “A decade of lattice cryptography,” Foundations and Trends in Theoret-
ceedings of the 10th annual international symposium on Computer architecture, ical Computer Science, vol. 10, no. 4, 2016.
1983. [60] M. Pellauer, Y. S. Shao, J. Clemons, N. Crago, K. Hegde, R. Venkatesan, S. W. Keck-
[31] H. L. Garner, “The residue number system,” in Papers presented at the the March ler, C. W. Fletcher, and J. Emer, “Bu�ets: An e�cient and composable storage idiom
3-5, 1959, Western Joint Computer Conference, 1959. for explicit decoupled data orchestration,” in Proceedings of the Twenty-Fourth
[32] W. Ge, M. Zhao, C. Wu, and J. He, “The design and implementation of ddr International Conference on Architectural Support for Programming Languages and
phy static low-power optimization strategies,” in Communication Systems and Operating Systems, 2019.
Information Technology, 2011. [61] Y. Polyakov, K. Rohlo�, and G. W. Ryan, “Palisade lattice cryptography library
[33] C. Gentry et al., A fully homomorphic encryption scheme. Stanford University, user manual,” Cybersecurity Research Center, New Jersey Institute ofTechnology
2009, vol. 20, no. 9. (NJIT), Tech. Rep, vol. 15, 2017.

14
[62] T. Pöppelmann, T. Oder, and T. Güneysu, “High-performance ideal lattice-based [70] S.-A.-A. Touati, “Register saturation in instruction level parallelism,” International
cryptography on 8-bit atxmega microcontrollers,” in International Conference on Journal of Parallel Programming, vol. 33, 2005.
Cryptology and Information Security in Latin America, 2015. [71] F. Turan, S. Roy, and I. Verbauwhede, “HEAWS: An Accelerator for Homomorphic
[63] Rambus Inc., “White paper: HBM2E and GDDR6: Memory solutions for AI,” 2020. Encryption on the Amazon AWS FPGA,” IEEE Transactions on Computers, 2020.
[64] B. Reagen, W. Choi, Y. Ko, V. Lee, G.-Y. Wei, H.-H. S. Lee, and D. Brooks, “Cheetah: [72] M. van der Hagen and B. Lucia, “Practical encrypted computing for iot clients,”
Optimizations and methods for privacy preserving inference via homomorphic arXiv preprint arXiv:2103.06743, 2021.
encryption,” in Proceedings of the 27th IEEE international symposium on High [73] W. Wang, Z. Chen, and X. Huang, “Accelerating leveled fully homomorphic
Performance Computer Architecture (HPCA-27), 2021. encryption using gpu,” in Proceedings of the IEEE International Symposium on
[65] M. S. Riazi, K. Laine, B. Pelton, and W. Dai, “HEAX: An architecture for computing Circuits and Systems (ISCAS), 2014.
on encrypted data,” in Proceedings of the 25th international conference on Architec- [74] W. Wang, Y. Hu, L. Chen, X. Huang, and B. Sunar, “Accelerating fully homo-
tural Support for Programming Languages and Operating Systems (ASPLOS-XXV), morphic encryption using gpu,” in Proceedings fo the IEEE conference on High
2020. Performance Extreme Computing (HPEC), 2012.
[66] S. S. Roy, F. Turan, K. Järvinen, F. Vercauteren, and I. Verbauwhede, “Fpga- [75] W. Wang, Y. Hu, L. Chen, X. Huang, and B. Sunar, “Exploring the feasibility of
based high-performance parallel architecture for homomorphic computing on fully homomorphic encryption,” IEEE Transactions on Computers, vol. 64, no. 3,
encrypted data,” in Proceedings of the 25th IEEE international symposium on High 2013.
Performance Computer Architecture (HPCA-25), 2019. [76] Y. Wang, Z. Ma, and F. Yu, “Pipelined algorithm and modular architecture for
[67] S. S. Roy, F. Vercauteren, N. Mentens, D. D. Chen, and I. Verbauwhede, “Compact matrix transposition,” IEEE Transactions on Circuits and Systems II: Express Briefs,
ring-LWE cryptoprocessor,” in International workshop on cryptographic hardware vol. 66, no. 4, 2018.
and embedded systems, 2014. [77] W. Xu and R. Tessier, “Tetris: a new register pressure control technique for VLIW
[68] Q. Shang, Y. Fan, W. Shen, S. Shen, and X. Zeng, “Single-port sram-based transpose processors,” ACM SIGPLAN Notices, vol. 42, no. 7, 2007.
memory with diagonal data mapping for large size 2-d dct/idct,” IEEE Transactions [78] B. Zhang, Z. Ma, and F. Yu, “A novel pipelined algorithm and modular architecture
on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 11, 2014. for non-square matrix transposition,” IEEE Transactions on Circuits and Systems
[69] Z. M. Smith, E. Lostri, and J. A. Lewis, “The Hidden Costs of Cybercrime,” Center II: Express Briefs, 2020.
for Strategic and International Studies, Tech. Rep., 2020.

15

You might also like