Beacon
Beacon
Deevashwer Rathee1 , Anwesh Bhattacharya2 , Divya Gupta2 , Rahul Sharma2 , and Dawn Song1
1 University of California, Berkeley
2 Microsoft Research
Our Algorithm. We propose an algorithm for floating-point The value of ethr has been carefully chosen to help obtain
Summation that addresses all three drawbacks. The key in- low numerical error in Theorem 1. This algorithm assumes
sight in our algorithm is that the expensive steps, i.e., normal- a relation between p and n. In particular, ethr needs to fit in
ization, rounding and clipping, can be performed only once (p + 1) bits. Concretely, for p = 8 (that is true for FP32, BF16,
for the whole vector, as opposed to once per addition. This is FP19), n needs to be smaller than 2105 that trivially holds for
because we only require the final output to be a normalized any practical summation.
floating-point value and no such guarantees are required for Observe that the algorithm invokes the expensive steps
intermediate values. This not only reduces the cryptographic of normalize, round and clip only once, instead of n times
cost greatly (by avoiding unnecessary normalization, rounding in the tree sum algorithm. Next, we report the precision of
and clips), but, as we formally prove, also makes the worst- our algorithm and a description of the corresponding 2PC
case error independent of n. Finally, the round complexity of protocol with its complexity.
the resulting 2PC protocol is log(n) times the rounds for com- Theorem 1. The relative error of Figure 1 is at most ε ·
parison and multiplexer on p + 2 bits, plus a constant. Note (κ + 1) + O(ε2 κ), where ε = 2−q−1 is machine epsilon and
that this is much lower than the tree-sum discussed above. In ∑i∈[n] |αi |
κ= is the condition number of summation.
the remainder of this section, we first describe our algorithm ∑i∈[n] αi
for Summation, then prove its worst case error bounds, and Proof. Let γ̂ and γ represent the output of FPSum and result
finally describe a 2PC protocol for Summation over secret of summation over reals, respectively. Let αk be the element
shared floating-point. with the largest exponent. Note that |αk | ⩾ 2emax and up until
Our Summation algorithm FPSum p,q,n is described for- the rounding step (before Step 12), FPSum computes the sum
mally in Figure 1 and has the following steps: γ′ of all elements ⩾ 2ethr exactly. Thus, the total magnitude
of elements set to 0 in Step 5 that are ignored by FPSum
1. Compare the exponents of all vector elements to find the
is at most n · 2ethr ⩽ 2emax · 2−q−1 ⩽ |αk | · 2−q−1 . Hence, |γ −
largest exponent emax (Step 2). For ñ = log(n), define ethr =
γ′ | ⩽ |αk | · ε, where ε = 2−q−1 is the machine epsilon, and the
emax −q− ñ−1 and threshold Γ = 2ethr such that only vector |γ−γ′ | |αk |·ε ∑i∈[n] |αi |·ε
elements with magnitude ⩾ Γ contribute to the sum (Step 3). relative error of γ′ is |∆| = |γ| ⩽ ∑i∈[n]αi
⩽ ∑i∈[n]αi
= ε · κ.
The final output γ̂ of FPSum is obtained after rounding γ′ (p, q) parameters as the desired output. We propose a gen-
to q mantissa bits. Thus, γ̂ = γ′ (1 + δ) [28], where |δ| ⩽ ε, and eralized Summation that can sum up unnormalized floating-
the absolute error of γ̂ w.r.t. γ is: point values. As we will see later, this generalized Summation
would play a crucial role in our protocols for dot products and
|γ − γ̂| = |γ − γ′ (1 + δ)| = |γ − γ(1 + ∆)(1 + δ)|
BFloat16 datatype as well.
= |γ(δ + ∆ + δ∆)| ⩽ |γ| · (ε + εκ + ε2 κ). First, we set up the problem statement. Input is α =
|γ−γ̂| (α1 , . . . , αn ), where each αi is an unnormalized floating-point
Thus, the relative error of γ̂ is ⩽ ε(κ + 1) + O(ε2 κ).
|γ| value with a (p + 2)-bit signed exponent αi .e ∈ [−2 p−1 +
1, 2 p − 1) and an unnormalized (unsigned) mantissa αi .m
Secure Protocol. Figure 2 describes the 2PC protocol cor- with bitwidth b, lower bound on MSNZB b′ , and scale sc such
′
responding to our Summation algorithm FPSum (Figure 1). that αi .m ∈ [2b , 2b ). That is, unlike a normalized mantissa,
For each step of the algorithm in Figure 1, we invoke the αi .m can have at most (b − b′ − 1) leading 0’s. The result of
corresponding 2PC protocol that takes the shares of the inputs summation must be a normalized floating-point number with
and returns the shares of the outputs (see Section 3.4 for the parameters (p, q), where sc ⩾ q.
2PC building blocks used for this transformation). With this, Our algorithm for generalized Summation follows the
the transformation from the algorithm to the 2PC protocol is blueprint of FPSum with the following modifications to deal
straightforward for all steps except the left-shift step needed to with unnormalized mantissas. First, after computing emax , we
align the mantissas (Step 6, Figure 1) and additional extension set ethr = emax − sc − ñ − (b − b′ ) so that we can still safely
needed before adding the signed mantissas in Step 8, Figure 1. ignore the values smaller than the threshold Γ = 2ethr . With
In Figure 2, steps 7-9 implement the left-shift of mantissas. this change, the maximum shift amount is sc + ñ + (b − b′ ) (as
Note that left-shift by r-bits can be implemented by multiply- opposed to q + ñ + 1) and consequently, we need to increase
ing the value by 2r (in sufficient bitwidth). Also, since the ℓ to 2b − b′ + sc + 2ñ + 1 to ensure exact summation of n
shift amount for a mantissa is secret-shared, we need to com- aligned and signed mantissas. Since normalization expects an
pute the power-of-2 operation for a secret-shared input. Since input with scale q and the sum of aligned mantissas M has
we have a bound on the shift amount, we can implement it effi- scale sc, we also change the scale of M to q and accordingly
ciently using a lookup table. In more detail, our algorithm left- add q − sc to the exponent.
shifts the ith mantissa mi by ri = (αi .e − ethr ) < (q + ñ + 2). We describe our algorithm for generalized Summation,
We store ri in k = ⌈log(q + ñ + 2)⌉ bits (using a modulo g-FPSum, parameterized by b, b′ , sc, p, q, n in Figure 3 and
by 2k in Step 7), which we assume is smaller than p + 2, prove below that it achieves the same error bounds as be-
as is the case for all floating-point representations used in fore. Moreover, g-FPSum can easily be transformed to a 2PC
b,b′ ,sc,p,q,n
practice. Consider a lookup table pow2 with 2k entries over protocol Πg-FPSum over secret shared input using the same
{0, 1}q+ñ+2 such that pow2[i] = 2i . We do a secret lookup steps as in Section 5.1.
in pow2 at index ri to learn shares of 2ri . Then, we mul-
(align) Theorem 2. The relative error of Figure 3 is at most ε + δκ +
tiply with mi to obtain mi in 2q + ñ + 2 bits. Next, to ∑i∈[n] |αi |
avoid overflows during addition of (aligned) mantissas and O(εδκ), where ε = 2−q−1 , δ = 2−sc−1 , and κ = ∑i∈[n] αi
is the
to accommodate the sign-bit, we require additional ñ + 1 bits. condition number of summation.
Hence, in Step 10, we extend the mantissa to ℓ = 2q + 2ñ + 3
bits. This protocol computes the same result as FPSum and Proof. Let αk be the element with the largest exponent, and
we state security in Section 7. let γ and γ′ represent the sum of all elements and sum of all
Complexity. As can be observed, the round complexity of elements with exponent ⩾ ethr , respectively. We only need
our protocol is equal to the round complexity of Πmax (to
p+2,n to argue that |γ − γ′ | ⩽ |αk | · δ, where δ = 2−sc−1 , and the
compute the maximum exponent) plus roughly the round rest of proof follows in the same way as the proof of The-
′
complexity of a single floating point addition. In contrast, the orem 1. It is easy to see that |αk | ⩾ 2b −s+emax and each
round complexity of tree-sum based protocol is roughly log n omitted term ⩽ 2b−s+ethr −1 . Thus, the sum of omitted terms
′
times the round complexity of a floating-point addition. Con- has a magnitude < n · 2b−s+ethr −1 = 2b−s+emax −sc−(b−b )−1 =
′
cretely, for FP32, we require 6 log n + 73 rounds compared to 2b −s+emax −sc−1 ⩽ |αk | · 2−sc−1 and |γ − γ′ | ⩽ |αk | · δ.
69 log n in S EC F LOAT that uses tree sum. Moreover, commu-
nication complexity of our protocol for FP32 and n = 2000 is 5.3 Dot product and matrix multiplication
4.4× lower than S EC F LOAT (Figure 5).
Given floating-point vectors α = (α1 , . . . , αn ) and β =
(β1 , . . . , βn ), DotProduct is defined as γ = ∑i∈[n] αi · βi . Thus,
5.2 Generalized Summation dot product can be realized naively by first computing the
Our Summation algorithm in Section 5.1 requires that the intermediate products γ = {αi · βi }i∈[n] using floating-point
p,q
inputs are normalized floating-point values with the same multiplication protocol for scalars, ΠFPMul , from [60], and
p,q,n
Protocol ΠFPSum
Input: i ∈ [n], ⟨αi ⟩FP(p,q) .
Output: ⟨γ⟩FP(p,q) s.t. γ = ∑ni=1 αi .
1: ñ = log(n); ℓ = 2q + 2ñ + 3
p+2 p+2,n
2: Call ⟨emax ⟩ = Πmax ({⟨αi .e⟩ p+2 }i∈[n] ) ▷ Find max exponent emax
p+2
3: Set ⟨ethr ⟩ = ⟨emax ⟩ p+2 − q − ñ − 1 ▷ Threshold exponent ethr = emax − q − ñ − 1
// Steps 5-11 implement steps 5,6,7 of Figure 1
4: for i ∈ [n] do
p+2
5: Call ⟨ci ⟩B = ΠLT (⟨αi .e⟩ p+2 , ⟨ethr ⟩ p+2 ) ▷ Compare αi ’s exponent with < ethr
q+1 q+1
6: Call ⟨mi ⟩ = ΠMUX (⟨ci ⟩B , 0, ⟨αi .m⟩q+1 ) ▷ Set mantissa of αi to 0 if αi .e < ethr
7: Set ⟨ri ⟩k = (⟨αi .e⟩ p+2 − ⟨ethr ⟩ p+2 ) mod 2k , where k = ⌈log(q + ñ + 2)⌉ ▷ Compute shift amount ri ⩽ q + ñ + 1 in k bits
k,q+ñ+2
8: Call ⟨Ri ⟩q+ñ+2 = ΠLUT (pow2, ⟨ri ⟩k ) ▷ Ri = 2ri = pow2(ri )
(align) 2q+ñ+2
D E
q+1,q+ñ+2,2q+ñ+2
9: Call mi = ΠUMult (⟨mi ⟩q+1 , ⟨Ri ⟩q+ñ+2 ) ▷ Align the mantissa mi by left-shifting by ri
(ext) ℓ (align) 2q+ñ+2
D E D E
2q+ñ+2,ℓ
10: Call mi = ΠZXt mi ▷ Extend to make space for addition of n elements and sign-bit
(s) ℓ (ext) ℓ (ext) ℓ
D E D E D E
11: Call mi = ΠℓMUX (⟨αi .s⟩B , −1 · mi , mi ) ▷ Set sign of mantissa same as input αi
D Eℓ D Eℓ
(s)
12: Set M (s) = ∑i∈[n] mi ▷ Add the aligned mantissas to get M (s)
D Eℓ
13: Call (⟨s⟩B , ⟨z⟩B ) = ΠℓLT&EQ ( M (s) , 0) ▷ Set s = 1{M (s) < 0}, z = 1{M (s) = 0}
D Eℓ D Eℓ
14: Call ⟨M⟩ℓ = ΠℓMUX (⟨s⟩B , −1 · M (s) , M (s) ) ▷ M = |M (s) |
ℓ p,q,ℓ−1
15: Call (⟨M1 ⟩ , ⟨e1 ⟩
p+2
) = ΠNormalize (⟨M⟩ℓ , ⟨ethr ⟩ p+2 ) ▷ ethr is the exponent for unnormalized M
p,q,ℓ−1
16: Call (⟨M2 ⟩ , ⟨e2 ⟩ ) = ΠRound&Check (⟨M1 ⟩ℓ , ⟨e1 ⟩ p+2 )
q+1 p+2
▷ Reduce precision of mantissa from (ℓ − 1) to q
p,q
17: Call and return ⟨γ⟩FP(p,q) = ΠClip (⟨z⟩B , ⟨s⟩B , ⟨e2 ⟩ p+2 , ⟨M2 ⟩q+1 ) ▷ Clip smaller than smallest representable values to 0
p,q,n p,q,n
then calling our protocol for Summation, ΠFPSum (Figure 2) ΠFPDotProd . This protocol first computes the product of man-
on γ . We reduce the cost of this approach further by remov- tissas mi exactly, and then rounds it to q + 2 bits. Impor-
ing the normalization step in the floating-point multiplication tantly, q + 2 bits suffice for the output of rounding m′i as mi ∈
followed by using our protocol for generalized Summation [22q , (2q+1 − 1)2 ], and thus, m′i = RN(mi , q) ∈ [2q , 2q+2 − 2].
(Section 5.2). In our dot product protocol, the exponents of Note that the rounding operation introduces a relative error
q−1
αi and βi are added and the mantissas of αi and βi are mul- of |δ1 | ⩽ 2|mi | ⩽ 2−q−1 to the intermediate product γi . Un-
tiplied. The latter creates unnormalized 2q + 2-bit mantissas less γi is smaller than the smallest representable value, i.e.,
with scale 2q and values in ∈ [1, 4). We round these interme- q+2,q,q,p,q,n
αi .e + βi .e < 1 − 2 p−1 , it is then input as is to Πg-FPSum ,
diate products to q + 2 bits with scale q, perform clipping to
which adds a relative error of |δ2 | ⩽ ε(κ + 1) + O(ε2 κ) to
0 (to satisfy the constraints on exponents of inputs in general-
its output where ε = 2−q−1 (Theorem 2). Thus, the ab-
ized Summation in Section 5.2), and then invoke generalized p,q,n
q+2,q,q,p,q,n solute error of ΠFPDotProd is | ∑i (γi · (1 + δ2 ) − αi βi )| =
summation (Πg-FPSum ) to get the output of the dot product.
p,q,n | ∑i αi βi · (1 + δ1 )(1 + δ2 ) − αi βi | ⩽ | ∑i αi βi · (ε(κ + 2) +
Our protocol ΠFPDotProd for dot product is formally described O(ε2 (κ + 1)))|, which implies a worst-care relative error of
in Figure 7, Appendix E. For n = 1000, our approach has ε(κ + 2) + O(ε2 (κ + 1)).
1.2× lower communication than the naive approach and is
just as precise as proved below. The protocols for matrix mul-
tiplication and convolutions build on top of dot product , and Now, we look at the worst-case error of the naïve solution.
their description is deferred to the full version of the paper. p,q
ΠFPMul introduces a worst-case relative error of ε to the inter-
p,q,n
Theorem 3. Our dot product protocol ΠFPDotProd is as pre-
p,q,n mediate products, the same as ΠFPDotProd . It also clips inter-
p,q,n p,q
cise as ΠFPSum ({ΠFPMul (⟨αi ⟩FP(p,q) , ⟨βi ⟩FP(p,q) )}i∈[n] ). mediate products when αi .e + βi .e < 1 − 2 p−1 . The interme-
p,q,n
diate products in the naïve solution are then input to ΠFPSum
Proof. We first look at the worst-case relative error of which has the worst-case relative error of ε(κ + 1) + O(ε2 κ),
′ As can be seen easily, this protocol is at least as expensive as
Algorithm g-FPSumb,b ,sc,p,q,n ({αi }i∈[n] ), sc ⩾ q
Π8,23,n
FPDotProd , i.e., dot product of length n over FP32. Another
1: ñ = log(n); ℓ = 2b − b′ + sc + 2ñ + 1
2: emax = maxi αi .e approach to compute a dot product over BF16, which is much
3: ethr = emax − sc − ñ − (b − b′ ) more efficient, is to directly invoke Π8,7,nFPDotProd and return
4: for i ∈ [n] do its output as the final output. However, this has much worse
5: mi = (αi .e < ethr ) ? 0 : αi .m error compared to the first approach that works over higher
(align)
6: mi = mi ≪ (αi .e − ethr ) precision (Section 1.1.2). We now describe our protocol that
7:
(s)
mi = αi .s ? − mi
(align) (align)
: mi achieves the best of both worlds: it is only < 30% more ex-
8: M (s) = ∑i∈[n] mi
(s) pensive than Π8,7,nFPDotProd , and has the same precision as the
standard BF16.
9: (s, z) = LT&EQℓ (M (s) , 0)
If we look closely at the naïve solution, it is first left-shifting
10: M = s ? − M (s) : M (s)
input mantissas by 16 bits each, then multiplying them, and
11: (e, M ′ ) = Normalize p,q,ℓ−1 (ethr + q − sc, M)
finally, rounding the multiplication result by 23 bits to get the
12: (e1 , M1 ) = Round&Check p,q,ℓ−1 (e, M ′ )
mantissas for the intermediate products, the least significant
13: Return Clip p,q (z, s, e1 , M1 )
9 bits of which are always 0. It is easy to see that this is quite
wasteful, and we can instead simply multiply the input mantis-
Figure 3: Generalized Floating-Point Summation.
sas to get 16-bit mantissas with scale 14 for the intermediate
products without losing precision. However, there is an issue
q+2,q,q,p,q,n p,q,n with this change. Since the mantissas being added only have
the same as Πg-FPSum . Thus, ΠFPDotProd is as precise as
scale 14 instead of scale 23 in the naïve approach, the follow-
the naïve solution.
ing generalized Summation would ignore more values than
the naïve approach (ethr depends on the scale sc and lower
6 BFloat16 training scale leads to higher ethr and larger magnitude values being
dropped from the sum). Hence, we fix this in the second step,
BFloat16 or BF16 is essentially a lower-precision version by increasing the scale of mantissa (but not the bitwidth) by
of FP32 with the same dynamic range: mantissa bits q are 9-bits and accordingly adding 9 to the exponent to account
reduced from 23 to 7 and exponent bits p = 8 are the same. for the scale change, thereby obtaining an exact intermediate
In this section, we discuss our techniques for secure imple- product in 16 bits and scale 23. These mantissas with higher
mentation of BF16 that give performance improvements over scale are now fed into the generalized Summation protocol
B EACON’s FP32 while being more precise than standard plat- by invoking Π16,14,23,8,7,n
g-FPSum . Our approach is much more ef-
forms for BF16. Recall that in all platforms, BF16 is used just ficient as it avoids the expensive steps of multiplication on
as a data-storage format, i.e., the inputs and outputs to each 24-bit inputs (Step 3) and rounding by 23-bits (Step 6) in
layer of the model are stored as BF16, and the arithmetic is Π8,23,n
FPDotProd , as well as operates on mantissas of 16-bits as
performed in higher precision FP32 (Section 1.1). Although opposed to 25 bits in generalized summation. Our BF16 dot-
we focus on BF16, our techniques generalize in a straight- product protocol ΠnFPDotProdBF16 is described in Figure 4 and
forward manner to other representations, e.g., TensorFloat its precision is proved below. From Section 5.3, we get the
which uses q = 10. Below, we discuss linear layers and defer BF16 matrix-multiplication protocol by building upon our
non-linear layers tothe full version of the paper. BF16 dot-product protocol.
We note that all our protocols for linear layers (Section 5,6) 9.1 Secure training with B EACON
and non-linear layers (Appendix G)satisfy the condition that
parties/servers P0 and P1 start with secret shares of inputs Figure 5 compares the time and communication of B EACON
and end up with secret shares of outputs. Using this, our 2- with S EC F LOAT [60], the state-of-the-art in secure floating-
party protocol for end-to-end secure training works by putting point, on the microbenchmarks for linear layers (Figure 5(a)-
together protocols for linear layers and non-linear layers as 5(c)) and the training benchmarks for a batch iteration (Fig-
specified by the training algorithm for both the forward and ure 5(d)-5(g)). We relegate the evaluation of non-linear mi-
the backward passes. As is standard, the security of our train- crobenchmarks to Appendix C as 99% of secure training cost
ing algorithms can be argued in the hybrid model [13] by in S EC F LOAT comes from linear layers (Appendix A). In
composing the building blocks and we defer the complete addition to our training benchmarks, we also consider the
security proof to the full version of the paper. Relevance model (Figure 7(h)), a benchmark proposed by
S EC F LOAT [60]. We observe that B EACON has 3.4 − 8.1×
lower latency and 3.1 − 4.4× lower communication than
8 Implementation S EC F LOAT over FP32 tasks. The S EC F LOAT baseline de-
composes a compound operation into individual scalar op-
We have implemented B EACON as a library in C++ on top erations that suffer from performance overheads caused by
of S EC F LOAT with 5k LOC. This library’s API provides many normalization and rounding steps. When compared with
100 45 15 30 15 50
Comm (GiB.)
Comm (GiB.)
Comm (GiB.)
80 25 30
Comm (GiB.)
36 12 22.5 12
Time (s)
Time (s)
Time (s)
20 40 24
Time (s)
60 27 9 9
3.1×
3.7×
4.4×
15 15 30 18
5.4×
5.1×
6.4×
16.1×
4.9×
3.8×
8.1×
40
9.5×
18 6 6
8.8×
7.9×
4.9×
6.2×
10
6.3×
7.5 20 12
20 9 5 3 3
10 6
Comm (GiB.×10000)
Comm (GiB.×100) 12 10
Comm (GiB.×100)
Time (s×10000)
6.25 3.75
Comm (GiB.×100)
5 30 5 5
Time (s×1000)
9.6
Time (s×100)
Time (s×100)
5 4 3 24 4
7.2 4
6
3.4×
3.75 2.25
4.0×
3 18 3 3
3.8×
6.8×
6.5×
3.9×
5.0×
4.9×
4.8 4
7.0×
6.4×
3.8×
6.9×
6.5×
2.5
5.0×
2 1.5 12
6.2×
2
7.4×
2.4 2 2
1.25 1 0.75 6 1
1
Figure 5: Performance comparison of S EC F LOAT (time - , comm - ) with B EACON FP32 (time - , comm - ) and
B EACON BF16 (time - , comm - ). The left bar group compares latency and the right group compares communication.
Improvement factors of B EACON (both FP32 and BF16) are also shown.
S EC F LOAT over BF16 tasks, B EACON improves latency by Time (minutes) Comm. (GiB)
Benchmark
6.3 − 16.1× and the communication by 5.4 − 9.5×. Our fur- B EACON KS22 B EACON KS22
ther improvements with BF16 are again due to S EC F LOAT’s MNIST 16 58
56 2,011
use of scalar operations which require that all arithmetic be Logistic (3.3×) (34.8×)
performed in FP32 (Section 1.1.2). MNIST 106 316
626 30,820
Similar to KS22, we set the mini-batch size to 128 for all FFNN (5.9×) (97.5×)
benchmarks except for Relevance, where S EC F LOAT sets the CIFAR 1,065 7,881
3,285 165,594
mini-batch size to 32. Our evaluation in Figure 5 is for one LeNet∗ (3.1×) (21×)
training iteration with these mini-batch sizes. We observe that CIFAR 32,137 214,061
100,827 5,280,375
B EACON has 6.3 − 7.4× lower latency and 6.2 − 6.5× lower HiNet∗ (3.1×) (24.7×)
communication than the baseline on training tasks over BF16,
thus demonstrating the performance benefits of our protocols. Table 3: Time (in minutes) and communication (in GiB) per
epoch of B EACON vs KS22 [41]. Numbers in parentheses
represent the overhead of B EACON over KS22.
9.2 Cost of B EACON vs. KS22 ∗ : extrapolated from 10 iterations
[21] Dipankar Das, Naveen Mellempudi, Dheevatsa Mudi- [33] Zhicong Huang, Wen jie Lu, Cheng Hong, and Jian-
gere, Dhiraj Kalamkar, Sasikanth Avancha, Kunal sheng Ding. Cheetah: Lean and fast secure two-party
Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, deep neural network inference. In USENIX Security
Bharat Kaul, Evangelos Georganas, Alexander Hei- Symposium, 2022.
necke, Pradeep Dubey, Jesus Corbal, Nikita Shustrov,
[34] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong
Roma Dubtsov, Evarist Fomenko, and Vadim Pirogov.
Zhu, Matthew Tang, Andrew Howard, Hartwig Adam,
Mixed Precision Training of Convolutional Neural Net-
and Dmitry Kalenichenko. Quantization and training
works using Integer Operations, 2018.
of neural networks for efficient integer-arithmetic-only
[22] Daniel Demmler, Ghada Dessouky, Farinaz Koushanfar, inference. In CVPR, 2018.
Ahmad-Reza Sadeghi, Thomas Schneider, and Shaza [35] Chiraag Juvekar, Vinod Vaikuntanathan, and Anantha
Zeitouni. Automated synthesis of optimized circuits for Chandrakasan. Gazelle: A low latency framework for
secure computation. In CCS, 2015. secure neural network inference. In USENIX Security
Symposium, 2018.
[23] Ghada Dessouky, Farinaz Koushanfar, Ahmad-Reza
Sadeghi, Thomas Schneider, Shaza Zeitouni, and [36] William M. Kahan. Further remarks on reducing trunca-
Michael Zohner. Pushing the Communication Barrier tion errors. Communications of the ACM, 1965.
in Secure Computation using Lookup Tables. In NDSS,
2017. [37] Dhiraj D. Kalamkar, Dheevatsa Mudigere, Naveen
Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth
[24] Mario Drumond, Tao Lin, Martin Jaggi, and Babak Fal- Avancha, Dharma Teja Vooturi, Nataraj Jammala-
safi. Training dnns with hybrid block floating point. In madaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jong-
NIPS, 2018. soo Park, Alexander Heinecke, Evangelos Georganas,
Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyan-
[25] Cynthia Dwork. Differential privacy: A survey of results. skiy, Bharat Kaul, and Pradeep Dubey. A study
In TAMC, 2009. of BFLOAT16 for deep learning training. CoRR,
abs/1905.12322, 2019.
[26] Daniel Escudero, Satrajit Ghosh, Marcel Keller, Rahul
Rachuri, and Peter Scholl. Improved primitives for [38] Liina Kamm and Jan Willemson. Secure floating point
MPC over mixed arithmetic-binary circuits. In CRYPTO, arithmetic and private satellite collision analysis. Int. J.
2020. Inf. Sec., 2015.
[39] Mahimna Kelkar, Phi Hung Le, Mariana Raykova, and [53] Pratyush Mishra, Ryan Lehmkuhl, Akshayaram Srini-
Karn Seth. Secure poisson regression. In USENIX vasan, Wenting Zheng, and Raluca Ada Popa. Delphi:
Security Symposium, 2022. A cryptographic inference service for neural networks.
In USENIX Security Symposium, 2020.
[40] Marcel Keller. MP-SPDZ: A versatile framework for
multi-party computation. In CCS, 2020. [54] Payman Mohassel and Peter Rindal. ABY3 : A Mixed
Protocol Framework for Machine Learning. In CCS,
[41] Marcel Keller and Ke Sun. Secure quantized training 2018.
for deep learning. In ICML, 2022.
[55] Payman Mohassel and Yupeng Zhang. SecureML: A
[42] Liisi Kerik, Peeter Laud, and Jaak Randmets. Optimiz- System for Scalable Privacy-Preserving Machine Learn-
ing MPC for robust and scalable integer and floating- ing. In IEEE S&P, 2017.
point arithmetic. In FC, 2016.
[56] Valeria Nikolaenko, Stratis Ioannidis, Udi Weinsberg,
[43] Brian Knott, Shobha Venkataraman, Awni Hannun, Marc Joye, Nina Taft, and Dan Boneh. Privacy-
Shubhabrata Sengupta, Mark Ibrahim, and Laurens preserving matrix factorization. In CCS, 2013.
van der Maaten. CrypTen: Secure multi-party com-
putation meets machine learning. In NIPS, 2021. [57] Valeria Nikolaenko, Udi Weinsberg, Stratis Ioannidis,
Marc Joye, Dan Boneh, and Nina Taft. Privacy-
[44] Alex Krizhevsky. Learning multiple layers of features preserving ridge regression on hundreds of millions of
from tiny images. 2009. records. In IEEE S&P, 2013.
[45] Nishant Kumar, Mayank Rathee, Nishanth Chandran, [58] Arpita Patra, Thomas Schneider, Ajith Suresh, and Hos-
Divya Gupta, Aseem Rastogi, and Rahul Sharma. Crypt- sein Yalame. ABY2.0: Improved Mixed-Protocol secure
flow: Secure tensorflow inference. In IEEE S&P, 2020. Two-Party computation. In USENIX Security Sympo-
sium, 2021.
[46] Yann LeCun and Corinna Cortes. MNIST handwritten
digit database. 2010. [59] Mohammad Rastegari, Vicente Ordonez, Joseph Red-
mon, and Ali Farhadi. Xnor-net: Imagenet classification
[47] Jay P. Lim and Santosh Nagarakatte. RLIBM-32: high using binary convolutional neural networks. In ECCV,
performance correctly rounded math libraries for 32-bit 2016.
floating point representations. In PLDI, 2021.
[60] Deevashwer Rathee, Anwesh Bhattacharya, Rahul
[48] Yehuda Lindell. How to simulate it – a tutorial on the Sharma, Divya Gupta, Nishanth Chandran, and Aseem
simulation proof technique. Tutorials on the Founda- Rastogi. SecFloat: Accurate Floating-Point meets Se-
tions of Cryptography, 2017. cure 2-Party Computation. In IEEE S&P, 2022. https:
//ia.cr/2022/322.
[49] Jian Liu, Mika Juuti, Yao Lu, and N. Asokan. Oblivious
Neural Network Predictions via MiniONN Transforma- [61] Deevashwer Rathee, Mayank Rathee, Rahul Kranti Ki-
tions. In CCS, 2017. ran Goli, Divya Gupta, Rahul Sharma, Nishanth Chan-
dran, and Aseem Rastogi. SIRNN: A math library for
[50] Brendan McMahan, Eider Moore, Daniel Ramage, Seth secure inference of RNNs. In IEEE S&P, 2021.
Hampson, and Blaise Agüera y Arcas. Communication-
efficient learning of deep networks from decentralized [62] Deevashwer Rathee, Mayank Rathee, Nishant Kumar,
data. In AISTATS, 2017. Nishanth Chandran, Divya Gupta, Aseem Rastogi, and
Rahul Sharma. CrypTFlow2: Practical 2-Party Secure
[51] Luca Melis, Congzheng Song, Emiliano De Cristofaro, Inference. In CCS, 2020.
and Vitaly Shmatikov. Exploiting unintended feature
leakage in collaborative learning. In IEEE S&P, 2019. [63] Adi Shamir. How to share a secret. CACM, 1979.
[52] Paulius Micikevicius, Dusan Stosic, Neil Burgess, Mar- [64] Sijun Tan, Brian Knott, Yuan Tian, and David J. Wu.
ius Cornea, Pradeep Dubey, Richard Grisenthwaite, Cryptgpu: Fast privacy-preserving machine learning on
Sangwon Ha, Alexander Heinecke, Patrick Judd, John the GPU. In IEEE S&P, 2021.
Kamalu, Naveen Mellempudi, Stuart F. Oberman, Mo-
hammad Shoeybi, Michael Y. Siu, and Hao Wu. FP8 [65] Lloyd N Trefethen and David Bau III. Numerical linear
formats for deep learning. CoRR, abs/2209.05433, 2022. algebra. Siam, 1997.
[66] Sameer Wagh, Divya Gupta, and Nishanth Chandran. B Profiling of S EC F LOAT’s operations
SecureNN: 3-party secure computation for neural net-
work training. PoPETs, 2019. In (Section 1.1), we claimed that over 82% of the runtime in
S EC F LOAT’s addition operation was spent in normalization
[67] Sameer Wagh, Shruti Tople, Fabrice Benhamouda, Eyal
and rounding. To obtain this split (Table 5), we measured the
Kushilevitz, Prateek Mittal, and Tal Rabin. Falcon:
runtime/communication for 10,000 instances of addition. A
Honest-majority maliciously secure framework for pri-
large number of instances (10, 000) was chosen because the
vate deep learning. PoPETs, 2021.
design of S EC F LOAT is inherently SIMD. Similarly, it was
[68] Maolin Wang, Seyedramin Rasoulinezhad, Philip H. W. also claimed in (Section 2) that 85% of the runtime in matrix
Leong, and Hayden Kwok-Hay So. NITI: training inte- multiplication was spent in summations. This split (Table 6)
ger neural networks using integer-only arithmetic. IEEE is obtained by multiplying a 100 × 100-sized matrix with
TPDS, 2022. another 100 × 100-sized matrix, one of the microbenchmarks
considered in Section 9.
[69] Jean-Luc Watson, Sameer Wagh, and Raluca Ada Popa.
Piranha: A GPU platform for secure computation. In
USENIX Security Symposium, 2022. C Evaluation on non-linear layers
[70] N. Whitehead and A. Fit-Florea. Precision & perfor-
mance: Floating point and ieee 754 compliance for Figure 6 shows empirical improvements of B EACON over
nvidia gpus. nVidia technical white paper, 2011. S EC F LOAT for Softmax-100 (1000 softmax over vectors of
length 100 each) and Sigmoid-1m (pointwise sigmoid of a
[71] Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. vector of 1 million elements). For ReLUs, B EACON’s perfor-
Training and inference with integers in deep neural net- mance is similar to S EC F LOAT. For sigmoid, B EACON’s spe-
works. In ICLR, 2018. cialized BF16 protocol improves the performance by over 5×
compared to S EC F LOAT. The improvement is much smaller
[72] Andrew Yao. How to Generate and Exchange Secrets for FP32 as B EACON uses SIMD FP32 exponentiations in this
(Extended Abstract). In FOCS, 1986. case. For Softmax-100, FP32 exponentiations are again the
[73] Andrew C. Yao. Protocols for secure computations. In bottleneck for B EACON and S EC F LOAT over both BF16 and
FOCS, 1982. FP32, and thus, the improvements from compound operations
in B EACON lead to comparatively modest benefits. In fact,
[74] Wenting Zheng, Raluca Ada Popa, Joseph E. Gonzalez, BF16 performance is even worse than FP32 in this case due
and Ion Stoica. Helen: Maliciously secure coopetitive to the FP32 to BF16 conversion required at the end.
learning for linear models. In IEEE S&P, 2019.
Table 4: Split in cost between linear and non-linear layers for S EC F LOAT.
S EC F LOAT-time
B EACON-time (FP32)
1.24×
10 B EACON-time (BF16)
1.27×
60
1.23×
1.25×
60
Comm (GiB)
12.5
Comm (GiB.)
8 S EC F LOAT-comm
Time (s)
10
Time (s) 48 48
6 B EACON-comm (FP32)
7.5 36 36
5.23×
4
3.4×
5 24 24 B EACON-comm (BF16)
2
2.5 12
12
(a) Softmax-100 (b) Sigmoid-1m
Figure 6: Comparing the performance of S EC F LOAT (dotted) with B EACON (striped, both FP32 and BF16) on non-linear
functions. The left group of bars compares latency and the right compares communication (comm). The improvement factor of
B EACON is also shown.
p,q,Q
Protocol ΠNormalize G.2 Softmax
Input: Unnormalized ⟨m⟩Q+1 , ⟨e⟩ p+2 , scale of m is q. Given vector α ∈ Rn as input, softmax outputs a vector δ such
α
that δi = e i eα j . In practice to avoid overflows in exponenti-
Output: Normalized ⟨m′ ⟩Q+1 , ⟨e′ ⟩ p+2 . ∑ j∈[n]
ation, the maximum element is subtracted from every element
1: Call ⟨k⟩
Q+1
, ⟨K⟩Q+1 = ΠQ+1
MSNZB (⟨m⟩
Q+1
)
of the array to create β , i.e., βi = αi − α′ , where α′ = max j α j .
Q+1 Q+1,Q+1,Q+1 Q+1
2: Call ⟨m ⟩ ′ = ΠUMult (⟨m⟩ , ⟨K⟩Q+1 ) At a high level, softmax has the following steps:
p+2
3: Set ⟨e′ ⟩ = ⟨e⟩ p+2 + (⟨k⟩Q+1 mod 2 p+2 ) − q 1. Compute maximum element α′ and subtract it from ev-
4: Return (⟨m′ ⟩Q+1 , ⟨e′ ⟩ p+2 ) ery vector element to get the vector β (with all negative
entries).
Figure 8: Normalize mantissa and adjust exponent accord-
2. Compute exponentiation on β to get γ .
ingly
3. Sum the elements of γ to get θ.