0% found this document useful (0 votes)
7 views18 pages

Beacon

The document presents novel specialized protocols for secure floating-point training using secure 2-party computation (2PC), significantly improving performance and precision compared to existing libraries. The implementation, called B EACON, outperforms state-of-the-art methods by over 6× and is designed to facilitate secure training of deep neural networks while maintaining data privacy. The paper details the technical advancements in handling compound operations and optimizing bitwidths for efficient computations, particularly in low bitwidth formats like BFloat16.

Uploaded by

panuwushoo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views18 pages

Beacon

The document presents novel specialized protocols for secure floating-point training using secure 2-party computation (2PC), significantly improving performance and precision compared to existing libraries. The implementation, called B EACON, outperforms state-of-the-art methods by over 6× and is designed to facilitate secure training of deep neural networks while maintaining data privacy. The paper details the technical advancements in handling compound operations and optimizing bitwidths for efficient computations, particularly in low bitwidth formats like BFloat16.

Uploaded by

panuwushoo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Secure Floating-Point Training

Deevashwer Rathee1 , Anwesh Bhattacharya2 , Divya Gupta2 , Rahul Sharma2 , and Dawn Song1
1 University of California, Berkeley
2 Microsoft Research

Abstract Cleartext ML frameworks like PyTorch, when running on


Secure 2-party computation (2PC) of floating-point arithmetic CPUs, decompose a training algorithm into many floating-
is improving in performance and recent work runs deep learn- point scalar operations or their SIMD2 (Single Instruction
ing algorithms with it, while being as numerically precise Multiple Data) counterparts that are present in Intel’s libraries.
as commonly used machine learning (ML) frameworks like By running efficient and precise 2PC protocols for these
PyTorch. We find that the existing 2PC libraries for floating- scalar/SIMD floating-point operations, one can get secure
point support generic computations and lack specialized sup- training implementations that are as precise as PyTorch (see
port for ML training. Hence, their latency and communication Section 4.2 for definition of precision) and have tractable
costs for compound operations (e.g., dot products) are high. latency, which is the approach taken by the recent work of
We provide novel specialized 2PC protocols for compound S EC F LOAT [60]. Note that S EC F LOAT [60] is the current
operations and prove their precision using numerical analysis. state-of-the-art in 2PC of floating-point that provably meets
Our implementation B EACON outperforms state-of-the-art the precision specified by Intel’s libraries and outperforms
libraries for 2PC of floating-point by over 6×. all prior 2PC libraries for secure floating-point (ABY [22],
EMP [3], and MP-SPDZ [40]) by 3 − 230×. We observe that
while running ML training algorithms with S EC F LOAT, over
1 Introduction 99% of the execution time is spent in linear layers, i.e., in
convolutions and fully connected layers (Appendix A), which
Deep Neural networks (DNNs) are now being deployed in evaluate compound operations such as matrix multiplications
domains with sensitive data, such as healthcare and finance. or dot products3 . Hence, to reduce the 2PC latency of ML
The more diverse data a DNN is trained on, the more useful training, we focus on building specialized secure protocols
it becomes. While diverse data can be obtained by pooling for these compound operations.
data of multiple organizations, it is challenging to do so due
to privacy policies that restrict data sharing.
This motivates the problem of secure training that allows 1.1 Our contributions
many mutually distrustful parties to collaboratively learn a
We provide novel specialized protocols for compound op-
DNN model of their secret data without revealing anything
erations occurring in ML training that are as precise as
about their data. At the end of secure training, each party
S EC F LOAT-based protocols while being much more efficient.
learns the model and nothing else1 about the data of the other
Note that, apart from S EC F LOAT, all other prior works in se-
parties beyond what can be deduced from the model. The
cure training use approximations that lack formal precision
seminal work of SecureML [55] showed that secure training
guarantees [16, 17, 41, 43, 54, 55, 64, 66, 67]. Among these,
of DNNs can be solved by secure multi-party computation
KS22 [41] is the state-of-the-art whose approximations have
(MPC) and specifically by secure 2-party computation (2PC)
high efficiency and have been shown to empirically match
in the client-server model [27, 56, 57] with two non-colluding
the end-to-end training accuracy provided by floating-point
servers (Section 3.3). In general, 2PC protocols [29, 73] allow
training on MNIST [46]/CIFAR [44] datasets. We show that
two mutually distrustful parties to compute functions over
the latency overheads of our provably precise protocols are
their secret inputs with the guarantee that nothing beyond the
< 6× over KS22. In particular,
function output is revealed about the sensitive inputs.
2 For example, a SIMD addition of two vectors (x , . . . , x ) and (y , . . . , y )
1 n 1 n
1 Approaches based on trusted hardware and federated learning suffer from gives (x1 + y1 , . . . , xn + yn ).
additional leakage and do not provide this cryptographic security guarantee. 3 Also known as inner product.
• We provide the first specialized 2PC protocols for floating- bitwidth of intermediate values, and in turn, the cost of sub-
point compound operations (e.g., dot products) and formally sequent operations. Reducing normalizations and rounding
prove their accuracy using numerical error analysis. operations while guaranteeing precision and performance is
• We implement our protocols in B EACON4 and design them challenging. We achieve our results by working over carefully
to be drop-in replacements for standard tensor operations determined minimal bitwidths needed for intermediate com-
(Section 8). B EACON is the first PPML library to support putations that balance precision and performance. We note
training over various floating-point representations, e.g., that even though protocols for all compound operations such
Google’s BFloat16, Nvidia’s TensorFloat FP19 and stan- as summation and dot products are designed with the same
dard 32-bit floating-point FP32 (Appendix D). goal of minimizing these expensive steps, they require differ-
ent insights and numerical analyses to determine the exact
• B EACON enables push button secure evaluation of PyTorch parameters for efficiency and to prove precision guarantees.
training algorithms. We evaluate B EACON on multiple mod-
els for MNIST and CIFAR-10 datasets. B EACON improves
the latency of secure training by > 6× over the state-of-the- 1.1.2 BFloat16 training
art libraries for 2PC of floating-point (Figure 5) and has The cleartext training algorithms are adopting low bitwidth
< 6× overhead over KS22. floating-point representations such as the 16-bit BFloat16 or
BF16 format. Compared to standard 32-bit floating-point rep-
Note that B EACON enables secure n-party training in the
resentation, FP32, BF16 uses the same number of exponent
standard client-server model with two non-colluding semi-
bits as FP32, i.e., 8, but reduces the number of bits in man-
honest servers [27, 55–57] (see Section 3.3). Researchers
tissa from 23 to 7. In BFloat16 training [37], BF16 numbers
that wish to use B EACON for floating-point tasks outside the
are used to store activations and the arithmetic happens in
context of secure training should note that we have chosen
FP32. For example, the specification of a matrix multiplica-
to omit support for special values like subnormals [28] in
tion in BFloat16 training says that given two input matrices
B EACON as Intel’s libraries with default compilation and prior
that are in BF16 format the output should be as precise as
2PC floating-point implementations, including S EC F LOAT,
the result of the following computation: convert the inputs to
also don’t support them [47, 60]. Now, we explain the main
FP32, then perform all the arithmetic in FP32, and then round
source of our efficiency gains at a high level.
the result to BF16. Kalamkar et al. [37] evaluate on many
ML models and shows that BFloat16 training matches the
1.1.1 Our protocols for precise compound operations accuracy of standard FP32 training. In fact, hardware manu-
facturers are putting native support for BFloat16 training in
In addition to a sign-bit, a floating-point value consists of an
CPUs, GPUs, TPUs, etc. Note that the decomposition-based
integer and a fixed-point number, corresponding to the expo-
mechanism is incompatible with the compound operations in
nent and the mantissa, respectively. There are complex invari-
BFloat16 training. The compound operation is decomposed
ants which the exponent and the mantissa need to maintain to
into scalar/SIMD operations over either FP32 or BF16. The
qualify as a valid floating-point value. During arithmetic oper-
former loses any performance advantage that secure BFloat16
ations, the intermediate results do not respect these invariants.
training can provide over secure FP32 training and the latter
Hence, to return a valid floating-point output, these invariants
loses precision as the operations on BF16 are over 65000×
need to be restored through expensive rounding and normal-
less precise5 than the operations over FP32, as required by
ization steps (Section 4). These steps are core to floating-point
the above specification.
arithmetic, are necessary for precision, and are also the main
In our protocols for BF16 compound operations, interme-
reason behind performance overheads associated with 2PC
diate values use impure representations that are neither FP32
of floating-point. In particular, for adding two floating-point
nor BF16. The bitwidths of these represenatations are care-
numbers using S EC F LOAT, over 82% of the time is spent in
fully chosen to ensure that the computations are exact. These
rounding and normalization (Appendix B). Now consider a
bitwidths are still lower than those required for FP32 compu-
compound operation, e.g., a summation on an n + 1-length
tations and outperform B EACON’s protocols for FP32. The
vector, i.e., given x = (x0 , x1 , . . . , xn ) compute ∑ni=0 xi . Decom-
impure representations do not satisfy the preconditions re-
posing a summation as n floating-point additions will require
quired by the protocols designed for pure representations.
n rounding and n normalization steps. In contrast, we design
Hence, as another technical contribution we generalize our
our specialized protocols for compound operations to only
underlying protocols to handle impure representations.
require a single normalization and rounding operation while
guaranteeing numerically precise results. Thus, B EACON provides two improvements over decom-
Every normalization that is not done is a threat to cor- 5 A floating-point representation with q-bit matissa incurs a relative round-
rectness and every rounding operation omitted increases the ing error of 2−q−1 [28]. Since BF16 has 7 mantissa bits, it incurs a relative
rounding error of 2−7−1 , while FP32 with 23-bit mantissa only incurs 2−23−1 ,
4 Implementation available at https://github.com/mpc-msri/EzPC. which is 65536× lower.
posing BF16 compound operations to scalar FP32 operations PyTorch can incur a relative error of εκn where ε is machine
and running them with S EC F LOAT. For example, while sum- epsilon [28] of the floating-point representation and κ is the
ming 2000 BF16 values, we can either use B EACON’s sum- condition number [65] of the summation problem (see The-
mation over FP32 or B EACON’s specialized summation over orem 1 for exact definition), which is a real-valued quantity
BF16. The former is 8× faster than S EC F LOAT and the latter that is independent of how summation is implemented in
is 2× faster than the former. Overall, our specialized protocol finite-precision floating-point.
in B EACON for BF16 is 16× faster than S EC F LOAT for this To address this performance bottleneck in summation,
task. Finally, our protocols for BFloat16 training are parame- our first idea is to perform intermediate computations in
terized on the number representation and directly generalize to large-enough bitwidths to replicate exact real arithmetic
TensorFloat training, which is the same as BFloat16 training followed by one final normalization and rounding step.
except that it uses Nvidia’s 19-bit TensorFloat representation However, this approach requires the bitwidth to depend on
(8-bit exponent and 10-bit mantissa), FP19, instead of BF16. the difference in magnitudes of the largest and the smallest
values being added and could be as large as 276 + log n bits
1.2 Organization for FP32. Hence, this turns out to be quite expensive in 2PC
and also wasteful. In light of this, we set our goal to have
The rest of the paper is organized as follows. Section 2 de- smaller worst case relative error compared to traditional
scribes our protocols at a high level. Section 3 provides back- approaches, say, close to εκ. With this error in mind, we
ground on 2PC, our threat model, and 2PC protocols over carefully determine a threshold such that we can pick a
integers. Section 4 provides background on floating-point reasonable bitwidth ℓ to ensure that (i) all values with
and 2PC of scalar/SIMD operations over floating-point. Sec- magnitude larger than the threshold are exactly summed
tion 5 provides our protocols for compound operations over and (ii) ignoring all values with magnitude smaller than the
a unique number representation, e.g., operations that arise in threshold leads to at most εκ error. After doing summation of
FP32 training. Section 6 provides our protocols for compound values that matter in ℓ bits, we perform one normalization
operations that switch number representations, e.g., operations and rounding. The final rounding leads to additional ε error
that arise in BFloat16 training. Section 7 argues security of and our overall relative error is at most ε(κ + 1). We note that
B EACON, Section 8 discusses implementation details, and unlike all traditional approaches, our error is independent of
Section 9 evaluates B EACON on ML tasks. Section 10 dis- n. Our approach also results in 5× fewer communication
cusses related work and Section 11 concludes with directions rounds over S EC F LOAT (Section 5.1) for n = 1000.
for future research.
BFloat16 training. Recall from Section 1.1.2 that in
2 Technical overview BFloat16 (or BF16) training, dot products (and matrix multi-
plications) multiply and sum BF16 values in FP32. However,
In this section, we provide a high level intuition of our simply performing all arithmetic in FP32 is wasteful and we
techniques. To ease exposition, we abstract out the actual miss out on any performance benefits of using BF16. Our
floating-point representation (Section 4) that uses exponents starting point is the observation that the precision of the exact
and mantissas. product of two BF16 values (14-bits) is smaller than the preci-
sion of FP32 (23-bits), and thus, we can compute over smaller
Novel protocol for summation. For secure training with intermediate bitwidths than in FP32. However, our FP32 sum-
S EC F LOAT, linear layers (i.e., matrix multiplications and con- mation protocol expects a precision of 23-bits. Consequently,
volutions) are the performance bottlenecks and consume 99% it will pick a larger threshold leading to more values being
of the runtime and communication (Appendix A). Here, over ignored, and a higher (and unacceptable) final error. Hence,
85% (Appendix B) of matrix multiplication runtime is spent to provide the same guarantees as the underlying FP32 com-
in summations (i.e., computing ∑ni=0 xi ), which makes sum- putation, we artificially increase the precision of intermediate
mation the main performance bottleneck of secure training. products, which were computed exactly, to match that of FP32.
As summarized in Section 1.1.1, traditional approaches We also remove the normalization step used in multiplications.
for summation, including the ones used by PyTorch, require Since this violates the input precondition of summation de-
n rounding and n normalization steps that are expensive in scribed above, we need to further generalize our summation
2PC and are the main performance bottlenecks. However, protocol to accept unnormalized inputs without losing its ben-
normalization and rounding are crucial for two reasons: (i) efits. Overall, we meet the specified precision and obtain a
preventing overflows while operating on finite bits, and (ii) performance improvement of 1.7× over dot product (length-
subsequent operations guarantee precision assuming a normal- 1000 vectors) in FP32 (Table 7) that can be attributed to use
ized floating-point input. Repeated rounding also leads to ag- of lower bitwidths in both the multiplications for intermediate
gregation of rounding errors, and final relative error depends products as well as the summation, and avoiding the use of
on n, the length of the summation performed. In particular, operations like rounding.
Finally, even for scalar/SIMD non-linear activations such 2PC protocols, for operations in machine learning, that go
as Sigmoid and Tanh, the specification requires computations from shares of input to shares of output securely.
over FP32 followed by rounding to BF16. We exploit the 2PC threat model. Our threat model is same as SecureML [55]
lower bitwidth of BF16 and domain knowledge to provide and considers a static probabilistic polynomial-time (PPT)
efficient protocols that beat the approach of computing inter- semi-honest adversary. It corrupts one of the parties (P0 or P1 )
mediate results over FP32 by 5×. at the beginning of the protocol and tries to learn information
about honest party’s sensitive input while following the pro-
tocol specification faithfully. We argue security against this
3 Preliminaries adversary in the standard simulation paradigm [13, 29, 48]
that argues indistinguishability of the views of the adversary
We define notation, secret sharing, threat model for secure
in the real execution and ideal execution in the presence of a
training followed by 2PC building blocks over integers.
trusted third party that takes inputs and provides the function
outputs alone.
3.1 Notation Client-Server Model for secure training of DNNs [55] consid-
ers n clients with sensitive training data (Ci has xi , i ∈ [n]) and
We denote the computational security parameter by λ and [n] 2 inputless servers, S0 and S1 . The goal is for the clients to
denotes the set {1, . . . , n}. Variable names with Roman letters learn the output of an ML training algorithm y = f (x1 , . . . , xn ).
(a, b, etc.) are integer-typed and those with Greek letters (α, β, We consider an static semi-honest adversary that corrupts at
etc.) are floating-point. The indicator function 1{P} returns 1 most one server and n − 1 clients. For secure training in this
if the predicate P is true and 0 otherwise; x||y concatenates setting, the clients first secret share their inputs among two
bit-strings x and y. An ℓ-bit integer x ∈ Z2ℓ can be interpreted servers S0 and S1 , who run the 2PC protocol for training to ob-
as an unsigned integer uint(x) = ζℓ (x), where ζℓ is a lossless tain shares of y that are sent to the clients. Clients reconstruct
lifting operator from bitstrings in Z2ℓ to integers in Z. Using y to learn the output of the training algorithm.
the 2’s complement encoding, x can also be interpreted as a
signed integer int(x) = uint(x) − MSB(x) · 2ℓ , where MSB(x)
is the most-significant bit of x. 3.4 Integer Building Blocks
Fixed-point: An unsigned fixed-point number x ∈ Z2ℓ with
Table 1 describes the fixed-point or integer operations that
bitwidth ℓ and scale s represents the real number uint2s(x) ∈ Q. B EACON uses in its protocols for floating-point compound
operations. The communication and round costs6 of the cor-
3.2 Secret Sharing responding protocols are given in Table 2. Improvement in
these costs will directly improve B EACON. All of these op-
Secret sharing [9, 63] is a technique of distributing a secret erations except Round-Nearest (RN) have been considered
among a group of parties by allocating each party a share of and discussed in detail in the cited works. RN rounds an ℓ-bit
the secret. In this work, we consider 2-out-of-2 additive secret value to a (ℓ − s)-bit value. Note that S EC F LOAT [60] pro-
sharing where a secret x ∈ Z2ℓ is split into two shares ⟨x⟩ℓ = vided a more complex protocol that achieves round to nearest
(⟨x⟩ℓ0 , ⟨x⟩ℓ1 ) ∈ (Z2ℓ )2 and party Pi holds ⟨x⟩ℓi . Each secret share and ties to even, which was required for its correctness guar-
⟨x⟩ℓi has the property that it reveals nothing about the secret x antees. For our precision guarantees (Section 4.2), a simpler
in isolation but the shares can all be put together to reconstruct function RN(x, s) = ⌊ 2xs ⌉, that does round to nearest and ties
the secret x as follows: x = ⟨x⟩ℓ0 + ⟨x⟩ℓ1 mod 2ℓ . We use the are always rounded up suffices and is also cheaper in 2PC. It
s−1
superscript B to denote secret-shares of boolean values ∈ Z2 . is easy to see that ⌊ 2xs ⌉ = ⌊ x+22s ⌋, and hence, we implement
our protocol as Πℓ,s ℓ ℓ,s
RN (⟨x⟩ ) = ΠTR (⟨x⟩ + 2
ℓ s−1 ), where Πℓ,s
TR
3.3 2PC and Threat Model is the truncate-reduce protocol from [61] that rounds down by
truncating the lower s bits of x.
Secure 2-party computation (2PC) introduced by [29, 73]
allows two parties P0 and P1 to compute an agreed upon
function f on their sensitive inputs x and y, respectively. It 4 Floating-point Background
provides an interactive protocol with the strong guarantee
that the interaction reveals nothing about the sensitive inputs We review the basics of floating-point representation and the
beyond what can be inferred from the output. A common precision guarantees of floating-point implementations. Then
approach for 2PC is where parties begin by secret sharing we define the secret sharing of floating-point values and the
their inputs with the other party and run a protocol that takes helper protocols we use.
shares of (x, y) to securely generate shares of f (x, y). Then, 6 The MSB-to-Wrap optimization from SIRNN [61] is applicable to all
the parties exchange the shares of the output and reconstruct instances of extension and multiplication, and thus, we report the optimized
the output. With this design in mind, it is sufficient to construct costs for these building blocks.
Notation
Functionality Description
Algorithm Protocol
Integer/Fixed-point Building Blocks
Multiplexer [62] z=c?x:y ⟨z⟩ℓ = ΠℓMUX (⟨c⟩B , ⟨x⟩ℓ , ⟨y⟩ℓ ) z = x if c = 1, else z = y
Less-Than [62] c = 1{x < y} ⟨c⟩B = ΠℓLT (⟨x⟩ℓ , ⟨y⟩ℓ ) Checks if x < y, x, y ∈ Z2ℓ
c = 1{x < y} Checks if x < y and if x = y,
LT&EQ [60] ⟨c⟩B , ⟨e⟩B = ΠℓLT&EQ (⟨x⟩ℓ , ⟨y⟩ℓ )
e = 1{x = y} x, y ∈ Z2ℓ
m,n
Zero-Extension [61] y = ZXt(x, n) ⟨y⟩n = ΠZXt (⟨x⟩m ) ζn (y) = ζm (x) mod 2n , m ⩽ n
Round-Nearest (Section 3.4) y = RN(x, s) ⟨y⟩ℓ−s = Πℓ,s ℓ
RN (⟨x⟩ ) Upper ℓ − s bits of (x + 2s−1 )
Most Significant k, s.t. xk = 1 ∧ ∀i > k, xi = 0,
k, K = MSNZB(x) ⟨k⟩ℓ , ⟨K⟩ℓ = ΠℓMSNZB (⟨x⟩ℓ )
Non-Zero Bit [61, 72] K = 2ℓ−1−k
Lookup Table (LUT) [23] y = L(x), y ∈ Z2n ⟨y⟩n = Πm,n m
LUT (L, ⟨x⟩ ) index x, LUT L, z ∈ Z2n
Unsigned Mixed-bitwidth m,n,ℓ ζℓ (z) = ζm (x) · ζn (y) mod 2ℓ ,
z = x ∗ℓ y ⟨z⟩ℓ = ΠUMult (⟨x⟩m , ⟨y⟩n )
Multiplication [61] ℓ ⩾ max(m, n)
Find Maximum [60] y = maxi∈[n] xi ⟨y⟩ℓ = Πℓ,n ℓ
max ({⟨xi ⟩ }i∈[n] ) Find the maximum xi
Floating-point Building Blocks
p,q,Q
Normalize [60] m′ , e′ = Normalize p,q,Q (m, e) ⟨m′ ⟩Q+1 , ⟨e′ ⟩ p+2 = ΠNormalize (⟨m⟩Q+1 , ⟨e⟩ p+2 ) Section 4.4.1
p,q,Q
Round&Check [60] m′ , e′ = Round&Check p,q,Q (m, e) ⟨m′ ⟩q+1 , ⟨e′ ⟩ p+2 = ΠRound&Check (⟨m⟩Q+1 , ⟨e⟩ p+2 ) Section 4.4.2
p,q
Clip [60] α = Clip p,q
(z, s, e, m) ⟨α⟩FP(p,q) = ΠClip (⟨z⟩B , ⟨s⟩B , ⟨e⟩ p+2 , ⟨m⟩q+1 ) Section 4.4.3

Table 1: 2PC building blocks used by B EACON.

Protocol Communication Rounds 4.2 Precision of floating-point operations


ΠℓMUX [62] 2λ + 2ℓ 2
ΠℓLT [62] λℓ + 14ℓ log(ℓ)

ΠLT&EQ [60] λ(ℓ + 3) + 14ℓ + 60 log(ℓ) + 2 The IEEE-754 standard precisely defines the output of FP32
Πm,n 2λ − m + n + 2
scalar operations, and the handling of error conditions (under-
ZXt [61] 4
ℓ,s flows, overflows, etc.) using special values (infinities, NaNs,
ΠRN (Section 3.4) λ(s + 1) + ℓ + 13s log(s) + 2
and subnormals) and operations on these special values (e.g.,
ΠℓMSNZB [61, 72] λ(5ℓ − 4) + ℓ2 2
∞ − ∞ is NaN). However, the IEEE standard makes no com-
Πm,n
LUT [23] 2λ + 2m n 2
ments about the compound operations, which is the subject
m,n,ℓ λ(2µ + 6) + µ(µ + 2ν)
ΠUMult [61] 4 of our study here.
+3µ + 2ν + 4
Πℓ,n
max [60] (n − 1) · (ΠℓLT + ΠℓMUX ) log(n) · (ΠℓLT + ΠℓMUX )
p,q,Q We make two remarks. First, in numerical analysis text-
ΠNormalize [60] λ(7Q + 3) + (Q + 1)2 4
books [65], the precision of compound operations is estab-
p,q,Q λ(2Q − q + 6) + 28Q
ΠRound&Check [60] log(Q + 1) + 4 lished with pen-and-paper proofs that bound the relative error
−11q + 2p + 21
p,q
ΠClip [60] λ(p + 6) + 2q + 16p + 34 log(p + 2) + 2 between the output of a floating-point implementation and
the ideal Real result. Second, in ML training implementations
Table 2: Cost of 2PC building blocks used by B EACON. All inside PyTorch, the floating-point implementations lack the
communication is in bits. µ = min(m, n), ν = max(m, n). handlers for special values. Hence, to prove the precision of
our protocols, we prove bounds on the relative error between
the output α computed by the protocol and the ideal Real
4.1 Floating-Point Representation result r, assuming that special values do not arise during the
computations. Recall that relative error between α and r is
According to the IEEE-754 standard [4], a floating-point | α−r
r |. We say that a protocol for a compound operation is
value α = (−1)s · 2e−bias · (1 + 2mq ) is represented as s||e||m ∈ precise if the worst case relative error of the protocol output is
{0, 1} p+q+1 , where s is the sign-bit, e is the (biased) p-bit the same or lower than the worst case relative error of the out-
exponent, bias = 2 p−1 − 1, and m is the q-bit mantissa. There put produced by the standard textbook implementations of the
are several floating-point representations defined by tuples operator. Note that all floating-point implementations incur
(p, q), which define the range and precision, respectively. For numerical errors, including the implementations in PyTorch
instance, the representation tuple for FP32 is (8, 23), for BF16 and B EACON, and performing several operations in sequence
is (8, 7), and for FP19 is (8, 10). Rounding to (p, q) represen- accumulates the worst case errors. Given two implementa-
tation introduces a relative error of at most ε = 2−q−1 , that is tions, the one with the lower (or, same) worst case relative
referred to as the machine epsilon. error is said to be more precise.
4.3 Secret sharing of floating-point values 4.4.3 Clip
Identically to [60], we represent a floating-point value α Clipping is used to set results that have a smaller magnitude
parameterized by p, q ∈ Z+ as a 4-tuple (α.z, α.s, α.e, α.m) than the smallest representable value to 0. For instance, FP32
where α.z = 1{α = 0} is the zero-bit, α.s ∈ {0, 1} is the can only represent normal values in the range [2−126 , 2128 ).
p,q
sign-bit (set if α ⩽ 0), α.e ∈ {0, 1} p+2 is the (unbiased) Thus, our clipping protocol ΠClip sets inputs with exponent
exponent7 taking values in [−2 p−1 + 1, 2 p−1 ), and α.m ∈ < −126 to 0 (Appendix F.3). Note that handling subnormal
{0, 1}q+1 is the normalized fixed-point mantissa with scale numbers, i.e., those with magnitude less than 2−126 is straight-
q taking values from [2q , 2q+1 − 1] ∪ {0}. Note that while forward (subnormals are fixed-point numbers with scale 126),
IEEE-754 standard stores just q bits of mantissa m′ as the but it leads to additional performance overheads and is neither
leading bit is always 1 (implicitly), we instead explicitly supported by B EACON nor by the S EC F LOAT baseline. Some
store the mantissa in q + 1 bits along with the leading bit, GPUs don’t support subnormals at all and others provide the
i.e, α.m = 2q + m′ . Consistent with this notation, a secret “fast mode" that, like B EACON, clips subnormals to zero [70].
shared floating-point value α is a tuple of shares ⟨α⟩FP(p,q) =
(⟨α.z⟩B , ⟨α.s⟩B , ⟨α.e⟩ p+2 , ⟨α.m⟩q+1 ). Finally, the Real value 5 Compound Operations in ML
of α, i.e., JαK is zero if α.z = 1 and (−1)α.s · 2int(α.e) · uint2(α.m)
q
otherwise. We describe our protocols for secure compound operations.
We begin by discussing our novel protocol for Summation
that sums up the values in a vector and is the backbone for all
4.4 Floating-point Building Blocks
linear layers. While the techniques from this section work for
Here we discuss sub-operations that are used by scalar opera- general (p, q), it might be useful for a reader to keep in mind
tions over floating-point in [60] and our compound operations. the example of FP32, i.e., p = 8, q = 23. We defer the discus-
2PC protocols for these can be built using the integer building sion of non-linear operations to Appendix G (ReLU, Softmax,
blocks in Table 1 (see Appendix F). etc.) and focus on linear layers where 99% of training time is
spent. Later, in Section 6, we will discuss the techniques and
4.4.1 Normalize further optimizations specific to BFloat16.

A mantissa is said to be normalized if its most significant bit 5.1 Summation


(MSB) is 1. During floating-point operations, the mantissa
can become unnormalized, i.e., its bit-representation will have Given a floating-point vector α = (α1 , . . . , αn ), where each αi
z zeros in the most-significant-bits. A normalization step is a floating-point number, Summation computes γ = ∑ni=1 αi .
adjusts the exponent by decrementing it by z and left shifts Before we discuss 2PC protocols for Summation for vectors
the mantissa by z to get rid of the leading zeros. Note that of length n > 2, we first recall the (high-level) steps involved
computing z requires computing the location of the most for the case of n = 2, i.e., the addition of 2 floating-point
significant non-zero bit (MSNZB) of the mantissa, which is values parameterized by (p, q):
an expensive operation in 2PC (Table 1). The normalization 1. Compare the exponents of the operands and compute their
p,q,Q
protocol ΠNormalize (Appendix F.1) takes as input a p + 2-bit difference, say d.
exponent and a Q + 1-bit mantissa with scale q. It returns
2. Left-shift the mantissa of the operand with larger exponent
a normalized mantissa over Q + 1 bits with scale Q and the
by d. This step aligns both the mantissas with the corre-
adjusted exponent.
sponding exponent as the smaller exponent.
3. Add/subtract the aligned mantissas based on the sign bits.
4.4.2 Round&Check
4. Use Normalize algorithm (Section 4.4) to normalize the
We need to round a normalized mantissa m ∈ [2Q , 2Q+1 ) in mantissa to lie in [1, 2) and adjust the exponent (if needed).
higher precision Q to a normalized mantissa m′ ∈ [2q , 2q+1 ) in
5. Round the normalized mantissa, which is in higher precision,
lower precision q, while incurring a relative error ⩽ ε = 2−q−1 .
Q+1,Q−q to the required precision q (introducing error in computa-
However, simply using ΠRN can result in an unnormal- tion) using Round&Check algorithm (Section 4.4).
ized mantissa (= 2q+1 ) and overflow q + 1 bits if m is very
close to 2Q+1 (see Section V-B in [60] for details). Thus, we 6. Clip the values smaller than the smallest representable
p,q,Q
use Round&Check protocol ΠRound&Check that additionally floating-point number to 0 with Clip (Section 4.4).
performs this check and adjusts the exponent accordingly Given the floating-point addition algorithm, the most
(Appendix F.2). suitable8 option to compute floating-point Summation is
7 The exponent is stored in p + 2-bits like [60] to ensure that all exponent 8 Kahansummation [36] incurs worst-case error of 2εκ but requires O (n)
related comparisons can be performed without overflowing the modulus. normalization and rounding steps, making it more expensive.
Algorithm FPSum p,q,n ({αi }i∈[n] ) Thus, we set the mantissa of all values with exponent < ethr
1: ñ = log(n); ℓ = 2q + 2ñ + 3 to 0 (Step 5).
2: emax = maxi αi .e 2. For all i, left-shift the mantissa of αi by (αi .e − ethr ), essen-
3: ethr = emax − q − ñ − 1 tially setting the exponents of all elements to ethr (Step 6).
4: for i ∈ [n] do Note that a bitwidth of 2q + ñ + 2 suffices for all left-shifted
5: mi = (αi .e < ethr ) ? 0 : αi .m mantissas as (αi .e − ethr ) ⩽ q + ñ + 1.
(align)
6: mi = mi ≪ (αi .e − ethr )
(s) (align) (align) 3. To be able to simply add the mantissas now, we convert
7: mi = αi .s ? − mi : mi (align) (s)
(s) the unsigned mantissas mi to signed mantissas mi by
8: M (s) = ∑i∈[n] mi
multiplexing using the sign bit αi .s (Step 7).
9: (s, z) = LT&EQℓ (M (s) , 0)
4. Next, simply add the n aligned signed mantissas (Step 8).
10: M = s ? − M (s) : M (s)
This step requires additional ñ + 1 bits, i.e., needs to be done
11: (M1 , e1 ) = Normalize p,q,ℓ−1 (M, ethr )
in ℓ = 2q + 2ñ + 3 bits to ensure no overflows.
12: (M2 , e2 ) = Round&Check p,q,ℓ−1 (M1 , e1 )
13: Return Clip p,q (z, s, e2 , M2 ) 5. Find the sign of the resulting signed mantissa M (s) and
whether it is equal to 0 by a (signed) comparison with 0 and
Figure 1: Floating-Point Summation. set the sign bit s and zero bit z accordingly (Step 9).
6. Use the sign bit to revert the mantissa M (s) to unsigned
value (Step 10).
tree-sum (or pairwise summation), that recursively computes
n
2
n
2
7. Use Normalize algorithm (Section 4.4) to normalize the ℓ-
γn = γ n2 ,1 + γ n2 ,2 , where γ n2 ,1 = ∑i=1 αi and γ 2n ,2 = ∑i=1 α n2 +i . bit output mantissa (with scale q) to lie in [1, 2) with scale
However, keeping in mind the above blueprint, even tree-sum (ℓ − 1) and adjust the exponent accordingly (Step 11).
has three drawbacks: (i) number of cryptographically
expensive operations like normalization (step 4) and rounding 8. Round the normalized mantissa which is in higher preci-
(step 5), as well as clip (step 6), scale linearly with n, (ii) sion (ℓ − 1) to the required precision q (Step 12) using
the worst-case relative error compared to the computation Round&Check algorithm (Section 4.4).
over reals is proportional to log(n), and (iii) the number of 9. Finally, clip the values smaller than the smallest repre-
communication rounds are log(n) times the round complexity sentable floating-point number with (p, q) to 0 (Step 13)
for a single floating-point addition. using Clip algorithm (Section 4.4).

Our Algorithm. We propose an algorithm for floating-point The value of ethr has been carefully chosen to help obtain
Summation that addresses all three drawbacks. The key in- low numerical error in Theorem 1. This algorithm assumes
sight in our algorithm is that the expensive steps, i.e., normal- a relation between p and n. In particular, ethr needs to fit in
ization, rounding and clipping, can be performed only once (p + 1) bits. Concretely, for p = 8 (that is true for FP32, BF16,
for the whole vector, as opposed to once per addition. This is FP19), n needs to be smaller than 2105 that trivially holds for
because we only require the final output to be a normalized any practical summation.
floating-point value and no such guarantees are required for Observe that the algorithm invokes the expensive steps
intermediate values. This not only reduces the cryptographic of normalize, round and clip only once, instead of n times
cost greatly (by avoiding unnecessary normalization, rounding in the tree sum algorithm. Next, we report the precision of
and clips), but, as we formally prove, also makes the worst- our algorithm and a description of the corresponding 2PC
case error independent of n. Finally, the round complexity of protocol with its complexity.
the resulting 2PC protocol is log(n) times the rounds for com- Theorem 1. The relative error of Figure 1 is at most ε ·
parison and multiplexer on p + 2 bits, plus a constant. Note (κ + 1) + O(ε2 κ), where ε = 2−q−1 is machine epsilon and
that this is much lower than the tree-sum discussed above. In ∑i∈[n] |αi |
κ= is the condition number of summation.
the remainder of this section, we first describe our algorithm ∑i∈[n] αi
for Summation, then prove its worst case error bounds, and Proof. Let γ̂ and γ represent the output of FPSum and result
finally describe a 2PC protocol for Summation over secret of summation over reals, respectively. Let αk be the element
shared floating-point. with the largest exponent. Note that |αk | ⩾ 2emax and up until
Our Summation algorithm FPSum p,q,n is described for- the rounding step (before Step 12), FPSum computes the sum
mally in Figure 1 and has the following steps: γ′ of all elements ⩾ 2ethr exactly. Thus, the total magnitude
of elements set to 0 in Step 5 that are ignored by FPSum
1. Compare the exponents of all vector elements to find the
is at most n · 2ethr ⩽ 2emax · 2−q−1 ⩽ |αk | · 2−q−1 . Hence, |γ −
largest exponent emax (Step 2). For ñ = log(n), define ethr =
γ′ | ⩽ |αk | · ε, where ε = 2−q−1 is the machine epsilon, and the
emax −q− ñ−1 and threshold Γ = 2ethr such that only vector |γ−γ′ | |αk |·ε ∑i∈[n] |αi |·ε
elements with magnitude ⩾ Γ contribute to the sum (Step 3). relative error of γ′ is |∆| = |γ| ⩽ ∑i∈[n]αi
⩽ ∑i∈[n]αi
= ε · κ.
The final output γ̂ of FPSum is obtained after rounding γ′ (p, q) parameters as the desired output. We propose a gen-
to q mantissa bits. Thus, γ̂ = γ′ (1 + δ) [28], where |δ| ⩽ ε, and eralized Summation that can sum up unnormalized floating-
the absolute error of γ̂ w.r.t. γ is: point values. As we will see later, this generalized Summation
would play a crucial role in our protocols for dot products and
|γ − γ̂| = |γ − γ′ (1 + δ)| = |γ − γ(1 + ∆)(1 + δ)|
BFloat16 datatype as well.
= |γ(δ + ∆ + δ∆)| ⩽ |γ| · (ε + εκ + ε2 κ). First, we set up the problem statement. Input is α =
|γ−γ̂| (α1 , . . . , αn ), where each αi is an unnormalized floating-point
Thus, the relative error of γ̂ is ⩽ ε(κ + 1) + O(ε2 κ).
|γ| value with a (p + 2)-bit signed exponent αi .e ∈ [−2 p−1 +
1, 2 p − 1) and an unnormalized (unsigned) mantissa αi .m
Secure Protocol. Figure 2 describes the 2PC protocol cor- with bitwidth b, lower bound on MSNZB b′ , and scale sc such

responding to our Summation algorithm FPSum (Figure 1). that αi .m ∈ [2b , 2b ). That is, unlike a normalized mantissa,
For each step of the algorithm in Figure 1, we invoke the αi .m can have at most (b − b′ − 1) leading 0’s. The result of
corresponding 2PC protocol that takes the shares of the inputs summation must be a normalized floating-point number with
and returns the shares of the outputs (see Section 3.4 for the parameters (p, q), where sc ⩾ q.
2PC building blocks used for this transformation). With this, Our algorithm for generalized Summation follows the
the transformation from the algorithm to the 2PC protocol is blueprint of FPSum with the following modifications to deal
straightforward for all steps except the left-shift step needed to with unnormalized mantissas. First, after computing emax , we
align the mantissas (Step 6, Figure 1) and additional extension set ethr = emax − sc − ñ − (b − b′ ) so that we can still safely
needed before adding the signed mantissas in Step 8, Figure 1. ignore the values smaller than the threshold Γ = 2ethr . With
In Figure 2, steps 7-9 implement the left-shift of mantissas. this change, the maximum shift amount is sc + ñ + (b − b′ ) (as
Note that left-shift by r-bits can be implemented by multiply- opposed to q + ñ + 1) and consequently, we need to increase
ing the value by 2r (in sufficient bitwidth). Also, since the ℓ to 2b − b′ + sc + 2ñ + 1 to ensure exact summation of n
shift amount for a mantissa is secret-shared, we need to com- aligned and signed mantissas. Since normalization expects an
pute the power-of-2 operation for a secret-shared input. Since input with scale q and the sum of aligned mantissas M has
we have a bound on the shift amount, we can implement it effi- scale sc, we also change the scale of M to q and accordingly
ciently using a lookup table. In more detail, our algorithm left- add q − sc to the exponent.
shifts the ith mantissa mi by ri = (αi .e − ethr ) < (q + ñ + 2). We describe our algorithm for generalized Summation,
We store ri in k = ⌈log(q + ñ + 2)⌉ bits (using a modulo g-FPSum, parameterized by b, b′ , sc, p, q, n in Figure 3 and
by 2k in Step 7), which we assume is smaller than p + 2, prove below that it achieves the same error bounds as be-
as is the case for all floating-point representations used in fore. Moreover, g-FPSum can easily be transformed to a 2PC
b,b′ ,sc,p,q,n
practice. Consider a lookup table pow2 with 2k entries over protocol Πg-FPSum over secret shared input using the same
{0, 1}q+ñ+2 such that pow2[i] = 2i . We do a secret lookup steps as in Section 5.1.
in pow2 at index ri to learn shares of 2ri . Then, we mul-
(align) Theorem 2. The relative error of Figure 3 is at most ε + δκ +
tiply with mi to obtain mi in 2q + ñ + 2 bits. Next, to ∑i∈[n] |αi |
avoid overflows during addition of (aligned) mantissas and O(εδκ), where ε = 2−q−1 , δ = 2−sc−1 , and κ = ∑i∈[n] αi
is the
to accommodate the sign-bit, we require additional ñ + 1 bits. condition number of summation.
Hence, in Step 10, we extend the mantissa to ℓ = 2q + 2ñ + 3
bits. This protocol computes the same result as FPSum and Proof. Let αk be the element with the largest exponent, and
we state security in Section 7. let γ and γ′ represent the sum of all elements and sum of all
Complexity. As can be observed, the round complexity of elements with exponent ⩾ ethr , respectively. We only need
our protocol is equal to the round complexity of Πmax (to
p+2,n to argue that |γ − γ′ | ⩽ |αk | · δ, where δ = 2−sc−1 , and the
compute the maximum exponent) plus roughly the round rest of proof follows in the same way as the proof of The-

complexity of a single floating point addition. In contrast, the orem 1. It is easy to see that |αk | ⩾ 2b −s+emax and each
round complexity of tree-sum based protocol is roughly log n omitted term ⩽ 2b−s+ethr −1 . Thus, the sum of omitted terms

times the round complexity of a floating-point addition. Con- has a magnitude < n · 2b−s+ethr −1 = 2b−s+emax −sc−(b−b )−1 =

cretely, for FP32, we require 6 log n + 73 rounds compared to 2b −s+emax −sc−1 ⩽ |αk | · 2−sc−1 and |γ − γ′ | ⩽ |αk | · δ.
69 log n in S EC F LOAT that uses tree sum. Moreover, commu-
nication complexity of our protocol for FP32 and n = 2000 is 5.3 Dot product and matrix multiplication
4.4× lower than S EC F LOAT (Figure 5).
Given floating-point vectors α = (α1 , . . . , αn ) and β =
(β1 , . . . , βn ), DotProduct is defined as γ = ∑i∈[n] αi · βi . Thus,
5.2 Generalized Summation dot product can be realized naively by first computing the
Our Summation algorithm in Section 5.1 requires that the intermediate products γ = {αi · βi }i∈[n] using floating-point
p,q
inputs are normalized floating-point values with the same multiplication protocol for scalars, ΠFPMul , from [60], and
p,q,n
Protocol ΠFPSum
Input: i ∈ [n], ⟨αi ⟩FP(p,q) .
Output: ⟨γ⟩FP(p,q) s.t. γ = ∑ni=1 αi .
1: ñ = log(n); ℓ = 2q + 2ñ + 3
p+2 p+2,n
2: Call ⟨emax ⟩ = Πmax ({⟨αi .e⟩ p+2 }i∈[n] ) ▷ Find max exponent emax
p+2
3: Set ⟨ethr ⟩ = ⟨emax ⟩ p+2 − q − ñ − 1 ▷ Threshold exponent ethr = emax − q − ñ − 1
// Steps 5-11 implement steps 5,6,7 of Figure 1
4: for i ∈ [n] do
p+2
5: Call ⟨ci ⟩B = ΠLT (⟨αi .e⟩ p+2 , ⟨ethr ⟩ p+2 ) ▷ Compare αi ’s exponent with < ethr
q+1 q+1
6: Call ⟨mi ⟩ = ΠMUX (⟨ci ⟩B , 0, ⟨αi .m⟩q+1 ) ▷ Set mantissa of αi to 0 if αi .e < ethr
7: Set ⟨ri ⟩k = (⟨αi .e⟩ p+2 − ⟨ethr ⟩ p+2 ) mod 2k , where k = ⌈log(q + ñ + 2)⌉ ▷ Compute shift amount ri ⩽ q + ñ + 1 in k bits
k,q+ñ+2
8: Call ⟨Ri ⟩q+ñ+2 = ΠLUT (pow2, ⟨ri ⟩k ) ▷ Ri = 2ri = pow2(ri )
(align) 2q+ñ+2
D E
q+1,q+ñ+2,2q+ñ+2
9: Call mi = ΠUMult (⟨mi ⟩q+1 , ⟨Ri ⟩q+ñ+2 ) ▷ Align the mantissa mi by left-shifting by ri
 
(ext) ℓ (align) 2q+ñ+2
D E D E
2q+ñ+2,ℓ
10: Call mi = ΠZXt mi ▷ Extend to make space for addition of n elements and sign-bit
(s) ℓ (ext) ℓ (ext) ℓ
D E D E D E
11: Call mi = ΠℓMUX (⟨αi .s⟩B , −1 · mi , mi ) ▷ Set sign of mantissa same as input αi
D Eℓ D Eℓ
(s)
12: Set M (s) = ∑i∈[n] mi ▷ Add the aligned mantissas to get M (s)
D Eℓ
13: Call (⟨s⟩B , ⟨z⟩B ) = ΠℓLT&EQ ( M (s) , 0) ▷ Set s = 1{M (s) < 0}, z = 1{M (s) = 0}
D Eℓ D Eℓ
14: Call ⟨M⟩ℓ = ΠℓMUX (⟨s⟩B , −1 · M (s) , M (s) ) ▷ M = |M (s) |
ℓ p,q,ℓ−1
15: Call (⟨M1 ⟩ , ⟨e1 ⟩
p+2
) = ΠNormalize (⟨M⟩ℓ , ⟨ethr ⟩ p+2 ) ▷ ethr is the exponent for unnormalized M
p,q,ℓ−1
16: Call (⟨M2 ⟩ , ⟨e2 ⟩ ) = ΠRound&Check (⟨M1 ⟩ℓ , ⟨e1 ⟩ p+2 )
q+1 p+2
▷ Reduce precision of mantissa from (ℓ − 1) to q
p,q
17: Call and return ⟨γ⟩FP(p,q) = ΠClip (⟨z⟩B , ⟨s⟩B , ⟨e2 ⟩ p+2 , ⟨M2 ⟩q+1 ) ▷ Clip smaller than smallest representable values to 0

Figure 2: Protocol for Floating-Point Summation.

p,q,n p,q,n
then calling our protocol for Summation, ΠFPSum (Figure 2) ΠFPDotProd . This protocol first computes the product of man-
on γ . We reduce the cost of this approach further by remov- tissas mi exactly, and then rounds it to q + 2 bits. Impor-
ing the normalization step in the floating-point multiplication tantly, q + 2 bits suffice for the output of rounding m′i as mi ∈
followed by using our protocol for generalized Summation [22q , (2q+1 − 1)2 ], and thus, m′i = RN(mi , q) ∈ [2q , 2q+2 − 2].
(Section 5.2). In our dot product protocol, the exponents of Note that the rounding operation introduces a relative error
q−1
αi and βi are added and the mantissas of αi and βi are mul- of |δ1 | ⩽ 2|mi | ⩽ 2−q−1 to the intermediate product γi . Un-
tiplied. The latter creates unnormalized 2q + 2-bit mantissas less γi is smaller than the smallest representable value, i.e.,
with scale 2q and values in ∈ [1, 4). We round these interme- q+2,q,q,p,q,n
αi .e + βi .e < 1 − 2 p−1 , it is then input as is to Πg-FPSum ,
diate products to q + 2 bits with scale q, perform clipping to
which adds a relative error of |δ2 | ⩽ ε(κ + 1) + O(ε2 κ) to
0 (to satisfy the constraints on exponents of inputs in general-
its output where ε = 2−q−1 (Theorem 2). Thus, the ab-
ized Summation in Section 5.2), and then invoke generalized p,q,n
q+2,q,q,p,q,n solute error of ΠFPDotProd is | ∑i (γi · (1 + δ2 ) − αi βi )| =
summation (Πg-FPSum ) to get the output of the dot product.
p,q,n | ∑i αi βi · (1 + δ1 )(1 + δ2 ) − αi βi | ⩽ | ∑i αi βi · (ε(κ + 2) +
Our protocol ΠFPDotProd for dot product is formally described O(ε2 (κ + 1)))|, which implies a worst-care relative error of
in Figure 7, Appendix E. For n = 1000, our approach has ε(κ + 2) + O(ε2 (κ + 1)).
1.2× lower communication than the naive approach and is
just as precise as proved below. The protocols for matrix mul-
tiplication and convolutions build on top of dot product , and Now, we look at the worst-case error of the naïve solution.
their description is deferred to the full version of the paper. p,q
ΠFPMul introduces a worst-case relative error of ε to the inter-
p,q,n
Theorem 3. Our dot product protocol ΠFPDotProd is as pre-
p,q,n mediate products, the same as ΠFPDotProd . It also clips inter-
p,q,n p,q
cise as ΠFPSum ({ΠFPMul (⟨αi ⟩FP(p,q) , ⟨βi ⟩FP(p,q) )}i∈[n] ). mediate products when αi .e + βi .e < 1 − 2 p−1 . The interme-
p,q,n
diate products in the naïve solution are then input to ΠFPSum
Proof. We first look at the worst-case relative error of which has the worst-case relative error of ε(κ + 1) + O(ε2 κ),
′ As can be seen easily, this protocol is at least as expensive as
Algorithm g-FPSumb,b ,sc,p,q,n ({αi }i∈[n] ), sc ⩾ q
Π8,23,n
FPDotProd , i.e., dot product of length n over FP32. Another
1: ñ = log(n); ℓ = 2b − b′ + sc + 2ñ + 1
2: emax = maxi αi .e approach to compute a dot product over BF16, which is much
3: ethr = emax − sc − ñ − (b − b′ ) more efficient, is to directly invoke Π8,7,nFPDotProd and return
4: for i ∈ [n] do its output as the final output. However, this has much worse
5: mi = (αi .e < ethr ) ? 0 : αi .m error compared to the first approach that works over higher
(align)
6: mi = mi ≪ (αi .e − ethr ) precision (Section 1.1.2). We now describe our protocol that
7:
(s)
mi = αi .s ? − mi
(align) (align)
: mi achieves the best of both worlds: it is only < 30% more ex-
8: M (s) = ∑i∈[n] mi
(s) pensive than Π8,7,nFPDotProd , and has the same precision as the
standard BF16.
9: (s, z) = LT&EQℓ (M (s) , 0)
If we look closely at the naïve solution, it is first left-shifting
10: M = s ? − M (s) : M (s)
input mantissas by 16 bits each, then multiplying them, and
11: (e, M ′ ) = Normalize p,q,ℓ−1 (ethr + q − sc, M)
finally, rounding the multiplication result by 23 bits to get the
12: (e1 , M1 ) = Round&Check p,q,ℓ−1 (e, M ′ )
mantissas for the intermediate products, the least significant
13: Return Clip p,q (z, s, e1 , M1 )
9 bits of which are always 0. It is easy to see that this is quite
wasteful, and we can instead simply multiply the input mantis-
Figure 3: Generalized Floating-Point Summation.
sas to get 16-bit mantissas with scale 14 for the intermediate
products without losing precision. However, there is an issue
q+2,q,q,p,q,n p,q,n with this change. Since the mantissas being added only have
the same as Πg-FPSum . Thus, ΠFPDotProd is as precise as
scale 14 instead of scale 23 in the naïve approach, the follow-
the naïve solution.
ing generalized Summation would ignore more values than
the naïve approach (ethr depends on the scale sc and lower
6 BFloat16 training scale leads to higher ethr and larger magnitude values being
dropped from the sum). Hence, we fix this in the second step,
BFloat16 or BF16 is essentially a lower-precision version by increasing the scale of mantissa (but not the bitwidth) by
of FP32 with the same dynamic range: mantissa bits q are 9-bits and accordingly adding 9 to the exponent to account
reduced from 23 to 7 and exponent bits p = 8 are the same. for the scale change, thereby obtaining an exact intermediate
In this section, we discuss our techniques for secure imple- product in 16 bits and scale 23. These mantissas with higher
mentation of BF16 that give performance improvements over scale are now fed into the generalized Summation protocol
B EACON’s FP32 while being more precise than standard plat- by invoking Π16,14,23,8,7,n
g-FPSum . Our approach is much more ef-
forms for BF16. Recall that in all platforms, BF16 is used just ficient as it avoids the expensive steps of multiplication on
as a data-storage format, i.e., the inputs and outputs to each 24-bit inputs (Step 3) and rounding by 23-bits (Step 6) in
layer of the model are stored as BF16, and the arithmetic is Π8,23,n
FPDotProd , as well as operates on mantissas of 16-bits as
performed in higher precision FP32 (Section 1.1). Although opposed to 25 bits in generalized summation. Our BF16 dot-
we focus on BF16, our techniques generalize in a straight- product protocol ΠnFPDotProdBF16 is described in Figure 4 and
forward manner to other representations, e.g., TensorFloat its precision is proved below. From Section 5.3, we get the
which uses q = 10. Below, we discuss linear layers and defer BF16 matrix-multiplication protocol by building upon our
non-linear layers tothe full version of the paper. BF16 dot-product protocol.

Theorem 4. The relative error of ΠnFPDotProdBF16 is at most


6.1 Linear Layers δκ + ε + O(εδκ), where ε = 2−8 , δ = 2−24 , and κ is the con-
As discussed in Section 5, our protocols for linear layers build dition number.
upon the protocol for dot product. Hence, we discuss our Proof. We first calculate the relative error of our protocol
techniques for secure dot product over BF16. For vectors of ΠnFPDotProdBF16 . We note that the intermediate products are
size n, the naïve approach for dot product that converts to calculated exactly, unless they are much smaller than the
FP32 before computing works as follows: smallest representable exponent, i.e., αi .e + βi .e + 9 < −127.
1. Left-shift the mantissas of the BF16 input vectors by 16 bits Hence, the relative error of our scheme is same as relative
to convert them into FP32 representation. error introduced by generalized summation Π16,14,23,8,7,n
g-FPSum , i.e.,
−8
δκ + ε + O(εδκ) for ε = 2 , δ = 2 −24 by Theorem 2.
2. Invoke the dot-product protocol Π8,23,n
FPDotProd (Section 5.3)
on FP32 vectors. Corollary 1. ΠnFPDotProdBF16 is more precise than the naïve
3. Round the output mantissa obtained above by 16 bits using solution that left-shifts input mantissas, performs Π8,23,n
FPDotProd
Π8,7,23
Round&Check to get the final BF16 output. and rounds by Π8,7,23
Round&Check .
Protocol ΠnFPDotProdBF16 ({αi , βi }i∈[n] ) operators arising in secure training, e.g., matrix multiplica-
tions, convolutions with various paddings and strides, ReLU
Input: i ∈ [n], ⟨αi ⟩FP(8,7) , ⟨βi ⟩FP(8,7) .
Output: ⟨γ⟩FP(8,7) s.t. γ = ∑i∈[n] αi · βi .
and Maxpool, softmax, loss functions like mean squared er-
ror (MSE) and cross-entropy, etc. For other operators, e.g.,
1: for i ∈ [n] do
trigonometric sine, we use S EC F LOAT [60] as is.
2: Set ⟨ei ⟩10 = ⟨αi .e⟩10 + ⟨βi .e⟩10 + 9
8,8,16
3: Call ⟨mi ⟩16 = ΠUMult (⟨αi .m⟩8 , ⟨βi .m⟩8 )
4: Set ⟨si ⟩B = ⟨αi .s⟩B ⊕ ⟨βi .s⟩B 9 Evaluation
5: Call ⟨zi ⟩B = ΠOR (⟨αi .z⟩B , ⟨βi .z⟩B )
10
6: Call ⟨c⟩B = Π10LT (⟨ei ⟩ , −127)
We provide empirical evidence for the claims in Section 1.1,
16 B 16 i.e., B EACON outperforms state-of-the-art in secure floating-
7: Call m′i = Π16
MUX (⟨c⟩ , 0, ⟨mi ⟩ )
8: Call ei ′ 10
= ΠMUX (⟨c⟩ , −127, ⟨ei ⟩10 )
10 B point by over 6× (Section 9.1), and achieves secure floating-
′ 10 15 point training with < 6× the latency of secure fixed-point
9: Set ⟨δi ⟩FP = (⟨zi ⟩B , ⟨si ⟩B , e′i , m′i )
16,14,23,8,7,n ′ training with KS22 [41] (Section 9.2).
10: Return ⟨γ⟩FP(8,7) = Πg-FPSum ({⟨δi ⟩FP }) Evaluation Setup. We perform our experiments in the LAN
setup between two 2.35 GHz AMD processor 16-core ma-
Figure 4: Protocol for BF16 Dot Product. chines with 64 GiB memory that are connected through a net-
work with 10 Gbps bandwidth and 73 µs latency (measured
through netperf). We measure both end-to-end runtime and
Proof. For comparing our error with the naïve solution, we communication, without assuming an offline phase (similar
first observe that it also computes products exactly modulo to prior works [5, 33, 43, 45, 60–62, 64]). All experiments use
clipping small values to 0. However, the naïve solution clips 16 threads for B EACON as well as all the baselines.
values for which αi .e + βi .e < −127. That is, our approach Benchmarks. We use the MNIST-10 dataset that has 60,000
clips less number of elements before summation. Next, the 28 × 28 monochrome images and the CIFAR-10 dataset that
relative error of the the naïve solution depends on relative has 50,000 32 × 32 colored RGB images. We consider the fol-
errors of Π25,23,23,8,23,n
g-FPSum and Π8,7,23
Round&Check . Again by Theo- lowing training benchmarks: MNIST-Logistic [55] (a single-
rem 2, relative error of Π25,23,23,8,23,n
g-FPSum is δ(κ + 1) + O(δ2 κ). layer logistic classifier for MNIST), MNIST-FFNN [41,43,55]
Next, since Π8,7,23 (a 3-layer feed forward neural network for MNIST), CIFAR-
Round&Check introduces worst-case relative er-
ror of ε, the final relative error of the naïve solution is LeNet [43] (a CNN with 2 convolutions), and CIFAR-
δ(κ + 1) + ε + O((ε + δ)δκ) using similar argument on com- HiNet [31] (a CNN with three convolutions). The description
bining relative errors as in proof of Theorem 3. This clearly of these models is present in [1, 2].
shows that the relative error of the naïve solution is more than We also use the following microbenchmarks (designed to
our protocol ΠnFPDotProdBF16 . have similar runtimes) which are commonly occurring compu-
tations in ML: Summation-2k (2000 summations over vectors
of length 2000 each), DotProduct-1k (1000 inner products be-
tween vectors of length 1000), and MatMul-100 (multiplying
7 2PC for training and security proofs a 100 × 100 matrix with another 100 × 100 matrix).

We note that all our protocols for linear layers (Section 5,6) 9.1 Secure training with B EACON
and non-linear layers (Appendix G)satisfy the condition that
parties/servers P0 and P1 start with secret shares of inputs Figure 5 compares the time and communication of B EACON
and end up with secret shares of outputs. Using this, our 2- with S EC F LOAT [60], the state-of-the-art in secure floating-
party protocol for end-to-end secure training works by putting point, on the microbenchmarks for linear layers (Figure 5(a)-
together protocols for linear layers and non-linear layers as 5(c)) and the training benchmarks for a batch iteration (Fig-
specified by the training algorithm for both the forward and ure 5(d)-5(g)). We relegate the evaluation of non-linear mi-
the backward passes. As is standard, the security of our train- crobenchmarks to Appendix C as 99% of secure training cost
ing algorithms can be argued in the hybrid model [13] by in S EC F LOAT comes from linear layers (Appendix A). In
composing the building blocks and we defer the complete addition to our training benchmarks, we also consider the
security proof to the full version of the paper. Relevance model (Figure 7(h)), a benchmark proposed by
S EC F LOAT [60]. We observe that B EACON has 3.4 − 8.1×
lower latency and 3.1 − 4.4× lower communication than
8 Implementation S EC F LOAT over FP32 tasks. The S EC F LOAT baseline de-
composes a compound operation into individual scalar op-
We have implemented B EACON as a library in C++ on top erations that suffer from performance overheads caused by
of S EC F LOAT with 5k LOC. This library’s API provides many normalization and rounding steps. When compared with
100 45 15 30 15 50

Comm (GiB.)

Comm (GiB.)

Comm (GiB.)
80 25 30

Comm (GiB.)
36 12 22.5 12
Time (s)

Time (s)

Time (s)
20 40 24

Time (s)
60 27 9 9

3.1×

3.7×
4.4×
15 15 30 18

5.4×
5.1×

6.4×
16.1×

4.9×

3.8×
8.1×

40

9.5×
18 6 6

8.8×

7.9×

4.9×

6.2×
10

6.3×
7.5 20 12
20 9 5 3 3
10 6

(a) Summation-2k (b) DotProduct-1k (c) MatMul-100 (d) MNIST-Logistic

Comm (GiB.×10000)
Comm (GiB.×100) 12 10

Comm (GiB.×100)

Time (s×10000)
6.25 3.75

Comm (GiB.×100)
5 30 5 5

Time (s×1000)
9.6
Time (s×100)

Time (s×100)
5 4 3 24 4
7.2 4
6

3.4×
3.75 2.25

4.0×
3 18 3 3

3.8×

6.8×

6.5×
3.9×
5.0×

4.9×
4.8 4
7.0×

6.4×

3.8×
6.9×
6.5×

2.5

5.0×
2 1.5 12

6.2×
2

7.4×
2.4 2 2
1.25 1 0.75 6 1
1

(e) MNIST-FFNN (f) CIFAR-LeNet (g) CIFAR-HiNet (h) Relevance

Figure 5: Performance comparison of S EC F LOAT (time - , comm - ) with B EACON FP32 (time - , comm - ) and
B EACON BF16 (time - , comm - ). The left bar group compares latency and the right group compares communication.
Improvement factors of B EACON (both FP32 and BF16) are also shown.

S EC F LOAT over BF16 tasks, B EACON improves latency by Time (minutes) Comm. (GiB)
Benchmark
6.3 − 16.1× and the communication by 5.4 − 9.5×. Our fur- B EACON KS22 B EACON KS22
ther improvements with BF16 are again due to S EC F LOAT’s MNIST 16 58
56 2,011
use of scalar operations which require that all arithmetic be Logistic (3.3×) (34.8×)
performed in FP32 (Section 1.1.2). MNIST 106 316
626 30,820
Similar to KS22, we set the mini-batch size to 128 for all FFNN (5.9×) (97.5×)
benchmarks except for Relevance, where S EC F LOAT sets the CIFAR 1,065 7,881
3,285 165,594
mini-batch size to 32. Our evaluation in Figure 5 is for one LeNet∗ (3.1×) (21×)
training iteration with these mini-batch sizes. We observe that CIFAR 32,137 214,061
100,827 5,280,375
B EACON has 6.3 − 7.4× lower latency and 6.2 − 6.5× lower HiNet∗ (3.1×) (24.7×)
communication than the baseline on training tasks over BF16,
thus demonstrating the performance benefits of our protocols. Table 3: Time (in minutes) and communication (in GiB) per
epoch of B EACON vs KS22 [41]. Numbers in parentheses
represent the overhead of B EACON over KS22.
9.2 Cost of B EACON vs. KS22 ∗ : extrapolated from 10 iterations

A curious reader might wonder about the overheads of running


secure floating-point w.r.t. secure training with fixed-point ap- OT [11] can significantly reduce communication overheads
proximations. In this section, we compare the training cost of of B EACON.
one epoch of B EACON (over BF16) with the state-of-the-art
secure fixed-point training framework KS22 [41]. We instan-
10 Related Work
tiated KS22 with the configuration from the paper, i.e., using
64-bit fixed-point and the default semi-honest secure 2PC Training algorithms. We have focused on FP32 and BF16
backend (semi-homomorphic encryption or hemi-party.x). training. Other ML training algorithms include hybrid al-
Sometimes KS22 goes out of memory9 when running a full gorithms that mix floating-point and integers. In quantized
epoch and in these cases we have extrapolated the time per training, performance critical operations like matrix multipli-
epoch based on iterations that can run on our current set up. cations are performed on 8-bit/16-bit integers and precision
Table 3 summarizes the results. We found that the latency of critical operations like softmax or weight updates are per-
B EACON is within 3 − 6× of KS22 for our training bench- formed in floating-point [8, 18, 20, 21, 34, 59, 71, 75, 76]. In
marks. The communication of B EACON, on the other hand, block floating-point training, different layers use different
is 21 − 100× higher than that of KS22. This is due to B EA - scales and these scales are updated dynamically depending on
CON ’s use of oblivious transfers (OT) that are known to be
the magnitudes of the runtime values [24, 68]. The advantage
communication heavy compared to homomorphic encryption of this approach is that all weights of a layer share a common
based protocols used in KS22. Techniques such as Silent exponent and hence all matrix multiplications can be done
9 KS22
generates and stores pre-processing material that can overflow over integers. Recently, 8-bit floating-point training is gaining
memory for large number of iterations. traction [7, 52].
Defense against attacks like data poisoning [19] and tech- References
niques like differential privacy [25] are orthogonal to B EA -
CON . These mitigations involve changing the training algo- [1] Deep learning training with multi-party computation.
rithms and B EACON is expressive enough to run the modified https://github.com/csiro-mlai/deep-mpc.
algorithms as well.
[2] Deep learning training with multi-party computa-
The techniques for training in the security literature can be tion. https://github.com/csiro-mlai/deep-mpc/
classified as centralized, federated, and MPC-based. tree/more-models.
Centralized. A centralized approach to secure training is
[3] EMP-toolkit: Efficient MultiParty computation toolkit.
for all the parties to give all their sensitive training data to
https://github.com/emp-toolkit, 2016.
a trusted hardware that does the floating-point computation
in a trusted execution environment (TEE) and returns the [4] IEEE standard for floating-point arithmetic. IEEE STD
result. However, the TEEs are susceptible to side-channel 754-2019 (Revision of IEEE 754-2008), 2019.
attacks [12, 30] and the use of secure 2PC makes B EACON
provably resistant to them. [5] Nitin Agrawal, Ali Shahin Shamsabadi, Matt J. Kusner,
and Adrià Gascón. QUOTIENT: two-party secure neural
Federated. A well-known decentralized approach to multi-
network training and prediction. In CCS, 2019.
party training is through federated learning [50]. Here, multi-
ple parties iteratively train in floating-point, aggregate their [6] Mehrdad Aliasgari, Marina Blanton, Yihua Zhang, and
gradients, and update their models with the aggregated gra- Aaron Steele. Secure computation on floating point
dient. However, these leaked aggregated gradients have been numbers. In NDSS, 2013.
used in various attacks to reveal information about the sensi-
tive datasets [32, 51, 77]. [7] Michael Andersch, Greg Palmer, Ronny Krashinsky,
Nick Stam, Vishal Mehta, Gonzalo Brito, and Srid-
MPC-based works. Kelkar et al. [39] run fixed-point training
har Ramaswamy. NVIDIA Hopper Architecture
of poisson regression models with 2PC. Helen [74] provides
In-Depth. https://developer.nvidia.com/blog/
stronger malicious security but is limited to the fixed-point
nvidia-hopper-architecture-in-depth/, 2022.
training of linear classifiers. There are many works that use
2PC for the related problem of secure inference [10, 33, 35, [8] Ron Banner, Itay Hubara, Elad Hoffer, and Daniel
49, 53, 55, 58, 61, 62]. Prior works on secure floating-point in Soudry. Scalable methods for 8-bit training of neural
the honest majority setting include [6, 14, 15, 38, 42]. Other networks. In NIPS, 2018.
secure training works use threat models different from 2PC,
e.g., honest majority [54, 64, 66, 67] and dealer-based [43, 69]. [9] G. R. Blakley. Safeguarding cryptographic keys. In
International Workshop on Managing Requirements
Knowledge, 1979.

[10] Fabian Boemer, Rosario Cammarota, Daniel Demmler,


11 Conclusion Thomas Schneider, and Hossein Yalame. MP2ML: a
mixed-protocol machine learning framework for private
inference. In ARES, 2020.
B EACON beats prior secure floating-point arithmetic work on
ML tasks by over 6× while providing formal precision guar- [11] Elette Boyle, Geoffroy Couteau, Niv Gilboa, Yuval Ishai,
antees. There are three primary directions, orthogonal to this Lisa Kohl, Peter Rindal, and Peter Scholl. Efficient two-
work, in which future research can extend B EACON, building round OT extension and silent non-interactive secure
upon our novel algorithms for precise compound operations. computation. In CCS, 2019.
a) ImageNet-scale training is currently out of reach. Secure
[12] Ferdinand Brasser, Urs Müller, Alexandra Dmitrienko,
training on datasets with millions of images will require GPU
Kari Kostiainen, Srdjan Capkun, and Ahmad-Reza
support that B EACON lacks. b) Like all prior works on se-
Sadeghi. Software grand exposure: SGX cache attacks
cure 2-party training, we have focused on security against
are practical. In USENIX WOOT, 2017.
semi-honest adversaries and would like to extend B EACON
to provide security against active adversaries. One way to [13] Ran Canetti. Security and Composition of Multiparty
achieve this is by running our algorithms with MP-SPDZ as Cryptographic Protocols. J. Cryptology, 2000.
all our building blocks are easily supported by edabits [26].
c) Finally, we will like to improve the performance of B EA - [14] Octavian Catrina. Evaluation of floating-point arith-
CON by introducing a trusted dealer that provides correlated metic protocols based on shamir secret sharing. In
randomness, e.g., oblivious transfers. ICETE (Selected Papers), 2019.
[15] Octavian Catrina. Performance analysis of secure [27] Adrià Gascón, Phillipp Schoppmann, Borja Balle, Mar-
floating-point sums and dot products. In COMM, 2020. iana Raykova, Jack Doerner, Samee Zahur, and David
Evans. Secure linear regression on vertically partitioned
[16] Harsh Chaudhari, Arpita Patra, Rahul Rachuri, and Ajith datasets. IACR Cryptol. ePrint Arch., 2016.
Suresh. Tetrad: Actively secure 4pc for secure training
and inference. In NDSS, 2022. [28] David Goldberg. What every computer scientist should
know about floating-point arithmetic. ACM Comput.
[17] Harsh Chaudhari, Rahul Rachuri, and Ajith Suresh. Tri- Surv., 1991.
dent: Efficient 4pc framework for privacy preserving
machine learning. In NDSS, 2020. [29] Oded Goldreich, Silvio Micali, and Avi Wigderson. How
to Play any Mental Game or A Completeness Theorem
[18] Xi Chen, Xiaolin Hu, Hucheng Zhou, and Ningyi Xu. for Protocols with Honest Majority. In STOC, 1987.
Fxpnet: Training a deep convolutional neural network
[30] Johannes Götzfried, Moritz Eckert, Sebastian Schinzel,
in fixed-point representation. IJCNN, 2017.
and Tilo Müller. Cache attacks on intel SGX. In EU-
[19] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and ROSEC, 2017.
Dawn Song. Targeted backdoor attacks on deep learning [31] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky,
systems using data poisoning. CoRR, abs/1712.05526, Ilya Sutskever, and Ruslan R. Salakhutdinov. Improving
2017. neural networks by preventing co-adaptation of feature
detectors, 2012.
[20] Matthieu Courbariaux and Yoshua Bengio. Binarynet:
Training deep neural networks with weights and activa- [32] Briland Hitaj, Giuseppe Ateniese, and Fernando Pérez-
tions constrained to +1 or -1. ArXiv, abs/1602.02830, Cruz. Deep models under the GAN: information leakage
2016. from collaborative deep learning. In CCS, 2017.

[21] Dipankar Das, Naveen Mellempudi, Dheevatsa Mudi- [33] Zhicong Huang, Wen jie Lu, Cheng Hong, and Jian-
gere, Dhiraj Kalamkar, Sasikanth Avancha, Kunal sheng Ding. Cheetah: Lean and fast secure two-party
Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, deep neural network inference. In USENIX Security
Bharat Kaul, Evangelos Georganas, Alexander Hei- Symposium, 2022.
necke, Pradeep Dubey, Jesus Corbal, Nikita Shustrov,
[34] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong
Roma Dubtsov, Evarist Fomenko, and Vadim Pirogov.
Zhu, Matthew Tang, Andrew Howard, Hartwig Adam,
Mixed Precision Training of Convolutional Neural Net-
and Dmitry Kalenichenko. Quantization and training
works using Integer Operations, 2018.
of neural networks for efficient integer-arithmetic-only
[22] Daniel Demmler, Ghada Dessouky, Farinaz Koushanfar, inference. In CVPR, 2018.
Ahmad-Reza Sadeghi, Thomas Schneider, and Shaza [35] Chiraag Juvekar, Vinod Vaikuntanathan, and Anantha
Zeitouni. Automated synthesis of optimized circuits for Chandrakasan. Gazelle: A low latency framework for
secure computation. In CCS, 2015. secure neural network inference. In USENIX Security
Symposium, 2018.
[23] Ghada Dessouky, Farinaz Koushanfar, Ahmad-Reza
Sadeghi, Thomas Schneider, Shaza Zeitouni, and [36] William M. Kahan. Further remarks on reducing trunca-
Michael Zohner. Pushing the Communication Barrier tion errors. Communications of the ACM, 1965.
in Secure Computation using Lookup Tables. In NDSS,
2017. [37] Dhiraj D. Kalamkar, Dheevatsa Mudigere, Naveen
Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth
[24] Mario Drumond, Tao Lin, Martin Jaggi, and Babak Fal- Avancha, Dharma Teja Vooturi, Nataraj Jammala-
safi. Training dnns with hybrid block floating point. In madaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jong-
NIPS, 2018. soo Park, Alexander Heinecke, Evangelos Georganas,
Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyan-
[25] Cynthia Dwork. Differential privacy: A survey of results. skiy, Bharat Kaul, and Pradeep Dubey. A study
In TAMC, 2009. of BFLOAT16 for deep learning training. CoRR,
abs/1905.12322, 2019.
[26] Daniel Escudero, Satrajit Ghosh, Marcel Keller, Rahul
Rachuri, and Peter Scholl. Improved primitives for [38] Liina Kamm and Jan Willemson. Secure floating point
MPC over mixed arithmetic-binary circuits. In CRYPTO, arithmetic and private satellite collision analysis. Int. J.
2020. Inf. Sec., 2015.
[39] Mahimna Kelkar, Phi Hung Le, Mariana Raykova, and [53] Pratyush Mishra, Ryan Lehmkuhl, Akshayaram Srini-
Karn Seth. Secure poisson regression. In USENIX vasan, Wenting Zheng, and Raluca Ada Popa. Delphi:
Security Symposium, 2022. A cryptographic inference service for neural networks.
In USENIX Security Symposium, 2020.
[40] Marcel Keller. MP-SPDZ: A versatile framework for
multi-party computation. In CCS, 2020. [54] Payman Mohassel and Peter Rindal. ABY3 : A Mixed
Protocol Framework for Machine Learning. In CCS,
[41] Marcel Keller and Ke Sun. Secure quantized training 2018.
for deep learning. In ICML, 2022.
[55] Payman Mohassel and Yupeng Zhang. SecureML: A
[42] Liisi Kerik, Peeter Laud, and Jaak Randmets. Optimiz- System for Scalable Privacy-Preserving Machine Learn-
ing MPC for robust and scalable integer and floating- ing. In IEEE S&P, 2017.
point arithmetic. In FC, 2016.
[56] Valeria Nikolaenko, Stratis Ioannidis, Udi Weinsberg,
[43] Brian Knott, Shobha Venkataraman, Awni Hannun, Marc Joye, Nina Taft, and Dan Boneh. Privacy-
Shubhabrata Sengupta, Mark Ibrahim, and Laurens preserving matrix factorization. In CCS, 2013.
van der Maaten. CrypTen: Secure multi-party com-
putation meets machine learning. In NIPS, 2021. [57] Valeria Nikolaenko, Udi Weinsberg, Stratis Ioannidis,
Marc Joye, Dan Boneh, and Nina Taft. Privacy-
[44] Alex Krizhevsky. Learning multiple layers of features preserving ridge regression on hundreds of millions of
from tiny images. 2009. records. In IEEE S&P, 2013.

[45] Nishant Kumar, Mayank Rathee, Nishanth Chandran, [58] Arpita Patra, Thomas Schneider, Ajith Suresh, and Hos-
Divya Gupta, Aseem Rastogi, and Rahul Sharma. Crypt- sein Yalame. ABY2.0: Improved Mixed-Protocol secure
flow: Secure tensorflow inference. In IEEE S&P, 2020. Two-Party computation. In USENIX Security Sympo-
sium, 2021.
[46] Yann LeCun and Corinna Cortes. MNIST handwritten
digit database. 2010. [59] Mohammad Rastegari, Vicente Ordonez, Joseph Red-
mon, and Ali Farhadi. Xnor-net: Imagenet classification
[47] Jay P. Lim and Santosh Nagarakatte. RLIBM-32: high using binary convolutional neural networks. In ECCV,
performance correctly rounded math libraries for 32-bit 2016.
floating point representations. In PLDI, 2021.
[60] Deevashwer Rathee, Anwesh Bhattacharya, Rahul
[48] Yehuda Lindell. How to simulate it – a tutorial on the Sharma, Divya Gupta, Nishanth Chandran, and Aseem
simulation proof technique. Tutorials on the Founda- Rastogi. SecFloat: Accurate Floating-Point meets Se-
tions of Cryptography, 2017. cure 2-Party Computation. In IEEE S&P, 2022. https:
//ia.cr/2022/322.
[49] Jian Liu, Mika Juuti, Yao Lu, and N. Asokan. Oblivious
Neural Network Predictions via MiniONN Transforma- [61] Deevashwer Rathee, Mayank Rathee, Rahul Kranti Ki-
tions. In CCS, 2017. ran Goli, Divya Gupta, Rahul Sharma, Nishanth Chan-
dran, and Aseem Rastogi. SIRNN: A math library for
[50] Brendan McMahan, Eider Moore, Daniel Ramage, Seth secure inference of RNNs. In IEEE S&P, 2021.
Hampson, and Blaise Agüera y Arcas. Communication-
efficient learning of deep networks from decentralized [62] Deevashwer Rathee, Mayank Rathee, Nishant Kumar,
data. In AISTATS, 2017. Nishanth Chandran, Divya Gupta, Aseem Rastogi, and
Rahul Sharma. CrypTFlow2: Practical 2-Party Secure
[51] Luca Melis, Congzheng Song, Emiliano De Cristofaro, Inference. In CCS, 2020.
and Vitaly Shmatikov. Exploiting unintended feature
leakage in collaborative learning. In IEEE S&P, 2019. [63] Adi Shamir. How to share a secret. CACM, 1979.

[52] Paulius Micikevicius, Dusan Stosic, Neil Burgess, Mar- [64] Sijun Tan, Brian Knott, Yuan Tian, and David J. Wu.
ius Cornea, Pradeep Dubey, Richard Grisenthwaite, Cryptgpu: Fast privacy-preserving machine learning on
Sangwon Ha, Alexander Heinecke, Patrick Judd, John the GPU. In IEEE S&P, 2021.
Kamalu, Naveen Mellempudi, Stuart F. Oberman, Mo-
hammad Shoeybi, Michael Y. Siu, and Hao Wu. FP8 [65] Lloyd N Trefethen and David Bau III. Numerical linear
formats for deep learning. CoRR, abs/2209.05433, 2022. algebra. Siam, 1997.
[66] Sameer Wagh, Divya Gupta, and Nishanth Chandran. B Profiling of S EC F LOAT’s operations
SecureNN: 3-party secure computation for neural net-
work training. PoPETs, 2019. In (Section 1.1), we claimed that over 82% of the runtime in
S EC F LOAT’s addition operation was spent in normalization
[67] Sameer Wagh, Shruti Tople, Fabrice Benhamouda, Eyal
and rounding. To obtain this split (Table 5), we measured the
Kushilevitz, Prateek Mittal, and Tal Rabin. Falcon:
runtime/communication for 10,000 instances of addition. A
Honest-majority maliciously secure framework for pri-
large number of instances (10, 000) was chosen because the
vate deep learning. PoPETs, 2021.
design of S EC F LOAT is inherently SIMD. Similarly, it was
[68] Maolin Wang, Seyedramin Rasoulinezhad, Philip H. W. also claimed in (Section 2) that 85% of the runtime in matrix
Leong, and Hayden Kwok-Hay So. NITI: training inte- multiplication was spent in summations. This split (Table 6)
ger neural networks using integer-only arithmetic. IEEE is obtained by multiplying a 100 × 100-sized matrix with
TPDS, 2022. another 100 × 100-sized matrix, one of the microbenchmarks
considered in Section 9.
[69] Jean-Luc Watson, Sameer Wagh, and Raluca Ada Popa.
Piranha: A GPU platform for secure computation. In
USENIX Security Symposium, 2022. C Evaluation on non-linear layers
[70] N. Whitehead and A. Fit-Florea. Precision & perfor-
mance: Floating point and ieee 754 compliance for Figure 6 shows empirical improvements of B EACON over
nvidia gpus. nVidia technical white paper, 2011. S EC F LOAT for Softmax-100 (1000 softmax over vectors of
length 100 each) and Sigmoid-1m (pointwise sigmoid of a
[71] Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. vector of 1 million elements). For ReLUs, B EACON’s perfor-
Training and inference with integers in deep neural net- mance is similar to S EC F LOAT. For sigmoid, B EACON’s spe-
works. In ICLR, 2018. cialized BF16 protocol improves the performance by over 5×
compared to S EC F LOAT. The improvement is much smaller
[72] Andrew Yao. How to Generate and Exchange Secrets for FP32 as B EACON uses SIMD FP32 exponentiations in this
(Extended Abstract). In FOCS, 1986. case. For Softmax-100, FP32 exponentiations are again the
[73] Andrew C. Yao. Protocols for secure computations. In bottleneck for B EACON and S EC F LOAT over both BF16 and
FOCS, 1982. FP32, and thus, the improvements from compound operations
in B EACON lead to comparatively modest benefits. In fact,
[74] Wenting Zheng, Raluca Ada Popa, Joseph E. Gonzalez, BF16 performance is even worse than FP32 in this case due
and Ion Stoica. Helen: Maliciously secure coopetitive to the FP32 to BF16 conversion required at the end.
learning for linear models. In IEEE S&P, 2019.

[75] Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin


D Multiple number representations
Wu, and Yuheng Zou. Dorefa-net: Training low bitwidth
convolutional neural networks with low bitwidth gradi-
ents. ArXiv, abs/1606.06160, 2016. Table 7 shows how the performance of B EACON changes with
the number representation used for data storage. We evaluate
[76] Chenzhuo Zhu, Song Han, Huizi Mao, and William J. FP32 (IEEE’s representation with 8-bit exponent and 23-bit
Dally. Trained ternary quantization. In ICLR, 2017. mantissa supported in most CPUs), FP19 (NVIDIA’s repre-
sentation with 8-bit exponent and 10-bit mantissa supported in
[77] Ligeng Zhu, Zhijian Liu, and Song Han. Deep leakage the latest GPUs), and BF16 (Google’s representation with 8-
from gradients. In NIPS, 2019. bit exponent and 7-bit mantissa supported on the latest TPUs,
some GPUs, and some CPUs). Our novel protocols ensure
A Cost split between linear and non-linear lay- that the representations that require lower number of bits have
better performance even though the underlying computation
ers with S EC F LOAT in each of these cases is required to be as precise as FP32. In
In Table 4, we show how the total runtime/communications particular, the performance of BF16 is about 2× better than
of the training benchmark is divided between linear and non- standard 32-bit floating-point.
linear layers for the prior state-of-the-art S EC F LOAT. It is
clearly seen that linear layers dominate in all DNNs consum- p,q,n
ing > 99% of both runtime/communications. Even for single E Dot Product Protocol ΠFPDotProd
layered MNIST-Logistic, linear layers contribute 96% to the
runtime. Our protocol for dot product is formally described in Figure 7.
Time (s) Comm (GiB)
Linear Non-Linear % Linear Linear Non-Linear % Linear
MNIST-Logistic 44.2 1.8 95.96% 26.48 0.13 99.47%
MNIST-FFNN 558 1.97 99.65% 424.86 0.14 99.97%
Relevance 448 1.57 99.65% 325.03 0.07 99.98%
CIFAR-LeNet 3484 5.5 99.84% 2710.67 0.86 99.97%
CIFAR-HiNet 103572 899 99.14% 87787 122.8 99.86%

Table 4: Split in cost between linear and non-linear layers for S EC F LOAT.

S EC F LOAT-time
B EACON-time (FP32)
1.24×

10 B EACON-time (BF16)
1.27×

60
1.23×
1.25×

60
Comm (GiB)

12.5

Comm (GiB.)
8 S EC F LOAT-comm
Time (s)

10
Time (s) 48 48
6 B EACON-comm (FP32)
7.5 36 36

5.23×
4

3.4×
5 24 24 B EACON-comm (BF16)
2
2.5 12
12
(a) Softmax-100 (b) Sigmoid-1m

Figure 6: Comparing the performance of S EC F LOAT (dotted) with B EACON (striped, both FP32 and BF16) on non-linear
functions. The left group of bars compares latency and the right compares communication (comm). The improvement factor of
B EACON is also shown.

Total Norm. % Norm. Microbenchmark Time (s) Comm. (GiB)


Time (s) 0.384 0.316 82.3% FP32 10.29 9.05
Comm (MiB) 105.58 81.28 76.9% Summation-2k FP19 5.83 (1.77×) 4.9 (1.85×)
BF16 5.18 (1.99×) 4.15 (2.18×)
Table 5: Split of S EC F LOAT’s Addition. Norm. refers to nor- FP32 4.65 4.17
DotProduct-1k FP19 3.47 (1.34×) 2.94 (1.42×)
malization and rounding steps. BF16 2.74 (1.7×) 2.38 (1.75×)
FP32 4.34 3.53
Total Summ. % Summ. MatMul-100 FP19 2.94 (1.47×) 2.51 (1.41×)
Time (s) 20.24 17.35 85.7% BF16 2.72 (1.59×) 2.1 (1.68×)
Comm (MiB) 13668 10784 78.9%
Table 7: Comparing cost of B EACON for various number
Table 6: Split of S EC F LOAT’s Matrix Multiplication. Summ. representations. Improvement factor of FP19/BF16 over FP32
refers to Summation step. shown in parentheses.

F Helper Protocols operations, we get our output exponent e′ = e + k − q. From



these steps, it is easy to see that 2e · Jm′ KQ+1,Q = 2e · JmKQ+1,q .
F.1 Normalization The normalization protocol is described in Figure 8.
For a fixed-point value x ∈ Z2ℓ with scale s, we use JxKℓ,s to
denote the corresponding real value. Normalization takes an
unnormalized mantissa m of Q + 1 bits with scale q and an
exponent e in p + 2 bits, and outputs a normalized mantissa F.2 Round and Check
m′ in Q + 1 bits with scale Q and an adjusted exponent e′

such that 2e · Jm′ KQ+1,Q = 2e · JmKQ+1,q . Let m ∈ [2k , 2k+1 ) The protocol for Round&Check is described in Figure 9. The
for some k ∈ [0, Q]. The normalization works as follows: first normalization check is quite simple: since we know that the
compute m′ = m ≪ (Q − k) ∈ [2Q , 2Q+1 ) and set its scale to only m ∈ [2Q+1 − 2Q−q−1 , 2Q+1 ) lead to the one and only
Q. As a result, we get a normalized mantissa with scale Q unnormalized output m′ = 2q+1 , we can simply check for
as expected from the output. To account for these changes to this condition and set the output accordingly. In particular,
mantissa, the exponent needs to be adjusted. We add Q − q to instead of outputting m′ = 2q+1 , e′ = e in this case, we output
exponent and subtract Q −k from it to account for the increase m′ = 2q , e′ = e + 1 which doesn’t introduce any error and thus,
in scale and the left-shift, respectively. Combining the two the final output has relative error ⩽ ε.
p,q,n p,q,Q
Protocol ΠFPDotProd Protocol ΠRound&Check
Input: i ∈ [n], ⟨αi ⟩FP(p,q) , ⟨βi ⟩FP(p,q) . Input: High-precision ⟨m⟩Q+1 and ⟨e⟩ p+2 .
Output: ⟨γ⟩FP(p,q) s.t. γ = ∑i∈[n] αi · βi . Output: Low-precision ⟨m′ ⟩q+1 and ⟨e′ ⟩ p+2 .
1: for i ∈ [n] do B Q+1 Q+1
1: Call ⟨c⟩ = ΠLT (⟨m⟩ , 2Q+1 − 2Q−q−1 ).
2: Set ⟨ei ⟩ p+2 = ⟨αi .e⟩ p+2 + ⟨βi .e⟩ p+2
q+1 Q+1,Q−q
3: q+1,q+1,2q+2
Call ⟨mi ⟩2q+2 = ΠUMult (⟨αi .m⟩q+1 , ⟨βi .m⟩q+1 ) 2: Call ⟨mc ⟩ = ΠRN (⟨m⟩Q+1 )
B B B q+1 q+1
4: Set ⟨si ⟩ = ⟨αi .s⟩ ⊕ ⟨βi .s⟩ 3: Call ⟨m′ ⟩ = ΠMUX (⟨c⟩B , ⟨mc ⟩q+1 , 2q ).
5: Call ⟨zi ⟩B = ΠOR (⟨αi .z⟩B , ⟨βi .z⟩B ) p+2 p+2
q+2 2q+2,q 4: Call ⟨e′ ⟩ = ΠMUX (⟨c⟩B , ⟨e⟩ p+2 , ⟨e⟩ p+2 + 1).
6: Call m′i = ΠRN (⟨m.i⟩2q+2 )
q+1
7:
p+2
Call ⟨c⟩B = ΠLT (⟨ei ⟩ p+2 , 1 − 2 p−1 ) 5: Return (⟨m′ ⟩ , ⟨e′ ⟩ p+2 )
q+2 q+2 q+1
8: Call m′′i = ΠMUX (⟨c⟩B , 0, m′i )
p+2 p+2 Figure 9: Round mantissa and check for its overflow
9: Call ei ′ = ΠMUX (⟨c⟩ , 1 − 2 , ⟨ei ⟩ p+2 )
B p−1
′ p+2 q+2
10: Set ⟨δi ⟩ = (⟨zi ⟩B , ⟨si ⟩B , e′i
FP , m′′i )
11: Return ⟨γ⟩ FP(p,q) q+2,q,q,p,q,n
= Πg-FPSum ({⟨δi ⟩ }) FP ′ G.1 ReLU
ReLU(x) = max(x, 0) can simply be computed as a multi-
Figure 7: Protocol for Dot Product. plexer over the sign bit.

p,q,Q
Protocol ΠNormalize G.2 Softmax
 
Input: Unnormalized ⟨m⟩Q+1 , ⟨e⟩ p+2 , scale of m is q. Given vector α ∈ Rn as input, softmax outputs a vector δ such
α
  that δi = e i eα j . In practice to avoid overflows in exponenti-
Output: Normalized ⟨m′ ⟩Q+1 , ⟨e′ ⟩ p+2 . ∑ j∈[n]
ation, the maximum element is subtracted from every element
1: Call ⟨k⟩
Q+1
, ⟨K⟩Q+1 = ΠQ+1
MSNZB (⟨m⟩
Q+1
)
of the array to create β , i.e., βi = αi − α′ , where α′ = max j α j .
Q+1 Q+1,Q+1,Q+1 Q+1
2: Call ⟨m ⟩ ′ = ΠUMult (⟨m⟩ , ⟨K⟩Q+1 ) At a high level, softmax has the following steps:
p+2
3: Set ⟨e′ ⟩ = ⟨e⟩ p+2 + (⟨k⟩Q+1 mod 2 p+2 ) − q 1. Compute maximum element α′ and subtract it from ev-
4: Return (⟨m′ ⟩Q+1 , ⟨e′ ⟩ p+2 ) ery vector element to get the vector β (with all negative
entries).
Figure 8: Normalize mantissa and adjust exponent accord-
2. Compute exponentiation on β to get γ .
ingly
3. Sum the elements of γ to get θ.

F.3 Clip details 4. Divide γ by θ and output the resulting vector.


All the above steps can be implemented using operations
For target representation (p, q) and input α, the clipping
p,q provided in S EC F LOAT [60].
protocol ΠClip is easily realized by first performing a com-
p+2
We improve upon this solution by optimizing steps 3 and
parison ⟨c⟩B = ΠLT (⟨α.e⟩ p+2 , −2 p−1 + 2) of p + 2 bits to 4. First, we use vector sum (Section 5.1) in step 3, which not
check if exponent is less than or equal to the smallest repre- only improves the efficiency, but also the accuracy of this
sentable exponent, and then setting mantissa to 0 and expo- step. Next, we observe that in step 4, all vector elements are
nent to −2 p−1 + 1 in case c = 1 using two multiplexer op- being divided by the same value θ and we can get better amor-
q+1 p+2
erations: ΠMUX (⟨c⟩B , 0, ⟨α.m⟩q+1 ) and ΠMUX (⟨c⟩B , −2 p−1 + tization cost for this operation. In more detail, computation
1, ⟨α.e⟩ p+2 ), respectively. Since the comparison operation is of output mantissa in the division functionality from [60] in-
signed, we require |α.e| < 2 p+1 for its correctness. volves first approximating the reciprocal of divisor, followed
by a multiplication with dividend and a rounding operation.
Since the divisor is the same in all of the division operations,
G Non-Linear Layers for FP32 the reciprocal computation can be done just once.

Non-linear layers require both compound operations and G.3 Sigmoid/Tanh


SIMD operations. B EACON provides novel protocols for com-
pound operations. For SIMD operations, we use the exist- For α ∈ R, sigmoid is defined as Sigmoid(α) = 1+e1−α .
ing state-of-the-art protocols given by S EC F LOAT [60] for Clearly, this is equivalent to a softmax on vector of length 2,
pointwise multiplication, addition, division, comparison and i.e., softmax on β = [0, α].
α −α
exponentiation. We use Tanh(α) = eeα −e +e−α
= 2 · Sigmoid(2α) − 1.

You might also like