0% found this document useful (0 votes)
12 views15 pages

Bayesian Differential Privacy For Machine Learning - 1901.09697v5

The paper introduces Bayesian Differential Privacy (BDP) as a variation of traditional differential privacy that considers data distribution to provide more meaningful privacy guarantees in machine learning. BDP aims to reduce utility loss by calibrating noise according to the data distribution, allowing for stronger privacy protections while maintaining high classification accuracy. The authors also present a privacy accounting method that generalizes existing approaches and demonstrate its effectiveness through experiments on classic datasets.

Uploaded by

nandhini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views15 pages

Bayesian Differential Privacy For Machine Learning - 1901.09697v5

The paper introduces Bayesian Differential Privacy (BDP) as a variation of traditional differential privacy that considers data distribution to provide more meaningful privacy guarantees in machine learning. BDP aims to reduce utility loss by calibrating noise according to the data distribution, allowing for stronger privacy protections while maintaining high classification accuracy. The authors also present a privacy accounting method that generalizes existing approaches and demonstrate its effectiveness through experiments on classic datasets.

Uploaded by

nandhini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Bayesian Differential Privacy for Machine Learning

Aleksei Triastcyn 1 Boi Faltings 1

Abstract Despite notable advances, differentially private ML still


suffers from two major problems: (a) utility loss due to
Traditional differential privacy is independent of
excessive noise added during training and (b) difficulty in
the data distribution. However, this is not well-
arXiv:1901.09697v5 [cs.LG] 19 Aug 2020

interpreting the privacy parameters  and . In many cases


matched with the modern machine learning con-
where the first problem appears to be solved, it is actually
text, where models are trained on specific data.
being hidden by the second. We design a motivational
As a result, achieving meaningful privacy guar-
example in Section 3 that illustrates how a seemingly strong
antees in ML often excessively reduces accuracy.
privacy guarantee allows for the attacker accuracy to be as
We propose Bayesian differential privacy (BDP),
high as 99%. Although this guarantee is very pessimistic
which takes into account the data distribution to
and holds against a powerful adversary with any auxiliary
provide more practical privacy guarantees. We
information, it can hardly be viewed as a reassurance to
also derive a general privacy accounting method
a user. Moreover, it provides only the worst-case bound,
under BDP, building upon the well-known mo-
leaving users to wonder how far is it from a typical case.
ments accountant. Our experiments demonstrate
that in-distribution samples in classic machine In this paper, we focus on practicality of privacy guarantees
learning datasets, such as MNIST and CIFAR-10, and propose a variation of DP that provides more meaning-
enjoy significantly stronger privacy guarantees ful guarantees for typical scenarios on top of the global DP
than postulated by DP, while models maintain guarantee. We name it Bayesian differential privacy (BDP).
high classification accuracy.
The key to our privacy notion is the definition of typical
scenarios. We observe that machine learning models are
designed and tuned for a particular data distribution (for
1. Introduction example, an MRI dataset is very unlikely to contain a pic-
Machine learning (ML) and data analytics offer vast oppor- ture of a car). Moreover, such prior distribution of data is
tunities for companies, governments and individuals to take often already available to the attacker. Thus, we consider a
advantage of the accumulated data. However, their ability scenario typical when all sensitive data is drawn from the
to capture fine levels of detail can compromise privacy of same distribution. While the traditional differential privacy
data providers. Recent research (Fredrikson et al., 2015; treats all data as equally likely and hides differences by large
Shokri et al., 2017; Hitaj et al., 2017) suggests that even in amounts of noise, Bayesian DP calibrates noise to the data
a black-box setting it is possible to infer information about distribution and provides much tighter expected guarantees.
individual records in the training set. As the data distribution is usually unknown, BDP estimates
Numerous solutions have been proposed to address this the necessary statistics from data as shown in the following
problem, varying in the extent of data protection and how sections. Furthermore, since typical scenarios are deter-
it is achieved. In this work, we consider a notion that is mined by data, the participants of the dataset are covered by
viewed by many researchers as the gold standard – differen- the BDP guarantee with high probability.
tial privacy (DP) (Dwork, 2006). Initially, DP algorithms To accompany the notion of Bayesian DP (Section 4.1), we
focused on sanitising simple statistics, but in recent years, provide its theoretical analysis and the privacy accounting
it made its way to machine learning (Abadi et al., 2016; framework (Section 4.2). The latter considers the privacy
Papernot et al., 2016; 2018; McMahan et al., 2018). loss random variable and employs principled tools from
1
Artificial Intelligence Lab, Ecole Polytechnique Fédérale de probability theory to find concentration bounds on it. It
Lausanne (EPFL), Lausanne, Switzerland. Correspondence to: provides a clean derivation of privacy accounting in gen-
Aleksei Triastcyn <aleksei.triastcyn@epfl.ch>. eral (Sections 4.2 and 4.3), as well as in the special case
of subsampled Gaussian noise mechanism. Moreover, we
Proceedings of the 37 th International Conference on Machine show that it is a generalisation of the well-known moments
Learning, Online, PMLR 119, 2020. Copyright 2020 by the au-
thor(s). accountant (MA) (Abadi et al., 2016) (Section 4.4.2).
Bayesian Differential Privacy for Machine Learning

Since our privacy accounting relies on data distribution definitions, however, is that in real-world scenarios it is not
samples, a natural concern is that the data not present in the feasible to exactly define distributions or families of dis-
dataset are not taken into account, and thus, are not protected. tributions that generate data. And even if this problem is
However, our finite sample estimator is specifically designed solved by restricting the query functions to enable the us-
to address this issue (see Section 4.3). age of the central limit theorem (e.g. (Bhaskar et al., 2011;
Duan, 2009)), these guarantees would only hold asymptoti-
Our contributions in this paper are the following:
cally and may require prohibitively large batch sizes. While
• we propose a variation of DP, called Bayesian differ- Bayesian DP can be seen as a special case of some of the
ential privacy, that allows to provide more practical above definitions, the crucial difference with the prior work
privacy guarantees in a wide range of scenarios; is that our additional assumptions allow the Bayesian ac-
• we derive a clean, principled privacy accounting counting (Sections 4.2, 4.3) to provide guarantees w.h.p.
method that generalises the moments accountant; with finite number of samples from data distributions, and
• we experimentally demonstrate advantages of our hence, enable a broad range of real-world applications.
method (Section 5), including the state-of-the-art pri- Finally, there are other approaches that use the data distribu-
vacy bounds in deep learning (Section 5.2). tion information in one way or another, and coincidentally
share the same (Yang et al., 2015) or similar (Leung & Lui,
2. Related Work 2012) names. Yet, similarly to the methods discussed above,
their assumptions (e.g. bounds on the minimum probabil-
With machine learning applications becoming more and ity of a data point) and implementation requirements (e.g.
more ubiquitous, vulnerabilities and attacks on ML mod- potentially constructing correlation matrices for millions
els get discovered, raising the need for matching defences. of data samples) make practical applications very difficult.
These attacks can be based on both passive adversaries, such Perhaps, the most similar to our approach is random differ-
as model inversion (Fredrikson et al., 2015) and member- ential privacy (Hall et al., 2011). The main difference is that
ship inference (Shokri et al., 2017), and active adversaries Hall et al. (2011) consider the probability space over all data
(for example, (Hitaj et al., 2017)). points, while we only consider the space over a single differ-
One of the strongest privacy standards that can be employed ing example. As a result, our guarantees are more practical
to protect ML models from these and other attacks is dif- to compute for large, realistic ML datasets. Furthermore,
ferential privacy (Dwork, 2006; Dwork et al., 2006). Pure Hall et al. (2011) only propose a basic composition theo-
-DP is hard to achieve in many realistic learning settings, rem, which is not tight enough for accounting in iterative
and therefore, a notion of approximate (, )-DP is used methods, and to the best of our knowledge, there are no
across-the-board in machine learning. It is typically accom- proofs for other crucial properties, such as post-processing
plished by applying the Gaussian noise mechanism (Dwork and group privacy.
et al., 2014) during the gradient descent update (Abadi et al.,
2016). Privacy accounting, i.e. computing the privacy guar- 3. Motivation
antee throughout multiple iterations of the algorithm, is typ-
ically done by the moments accountant (MA) (Abadi et al., Before we proceed, we find it important to motivate research
2016). In Section 4.4.2, we discuss the link between MA and on alternative privacy definitions, as opposed to fully con-
our accounting method, as well as connection to a closely re- centrating on new mechanisms for DP. On the one hand,
lated notion of Rényi DP (Mironov, 2017). Similarly, a link there is always a combination of data and a desired statistic
can be established to concentrated DP definitions (Dwork & that would yield large privacy loss in DP paradigm, regard-
Rothblum, 2016; Bun & Steinke, 2016). less of the mechanism. In other words, there can always be
data outliers that are difficult to hide without a large drop in
A number of previous relaxations considered a similar idea accuracy. On the other hand, we cannot realistically expect
of limiting the scope of protected data or using the data companies to sacrifice model quality in favour of privacy.
generating distribution, either through imposing a set of As a result, we get models with impractical worst-case guar-
data evolution scenarios (Kifer & Machanavajjhala, 2014), antees (as we demonstrate below) without any indication of
policies (He et al., 2014), distributions (Blum et al., 2013; what is the privacy guarantee for the majority of users.
Bhaskar et al., 2011), or families of distributions (Bassily
et al., 2013; Bassily & Freund, 2016). Some of these def- Consider the following example. The datasets D, D ′ consist
initions (e.g. (Blum et al., 2013)) may require more noise, of income values for residents of a small town. There is
because they are stronger than DP in the sense that datasets one individual x′ whose income is orders of magnitude
can differ in more than one data point. This is not the case higher than the rest, and whose residency in the town is
with our definition: like DP, it considers adjacent datasets what the attacker wishes to infer. The attacker observes
differing in a single data point. The major problem of such the mean income w sanitised by a differentially private
Bayesian Differential Privacy for Machine Learning

mechanism with  = 0 (we consider the stronger, pure DP by BDP is stated in Definition 2 and mimics the (, )-
for simplicity). What we are interested in is the change in the differential privacy (Dwork et al., 2014). The reason Def-
posterior distribution of the attacker after they see the private inition 1 may pose a problem with post-processing is that
model compared to prior (Mironov, 2017; Bun, 2017). If it does not consider sets of outcomes, and a routine that
the individual is not present in the dataset, the probability integrates groups of values into one value could therefore
of w being above a certain threshold is extremely small. On invalidate the guarantee by increasing the probability ratio
the contrary, if x′ is present, this probability is higher (say beyond epsilon. On the other hand, it can still be used for ac-
it is equal to r). The attacker computes the likelihood of counting with adaptive composition, because in this context,
the observed value under each of the two assumptions, the every next step is conditioned on a single outcome of the
corresponding posteriors given a flat prior, and applies a previous step. This separation mirrors the moments accoun-
Bayes optimal classifier. The attacker then concludes that tant approach of bounding tails of the privacy loss random
the individual is present in the dataset and is a resident. variable and converting it to the (, )-DP guarantee (Abadi
et al., 2016), but does so in a more explicit manner.
By the above expression, r can only be eε0 times larger than
the respective probability without x′ . However, if the re−ε0 Denition 1 (Strong Bayesian Differential Privacy). A ran-
is small enough, then the probability P (A) of the attacker’s domised function A : D → R with domain D, range R, and
guess being correct is as high as r(r + re−ε0 ), or outcome w = A(·), satises (µ , µ )-strong Bayesian differ-
ential privacy if for any two adjacent datasets D, D ′ ∈ D,
1 differing in a single data point x′ ∼ µ(x), the following
P (A) =  (1) holds:
1 + e−ε
Pr[LA (w, D, D ′ ) ≥ µ ] ≤ µ , (2)
To put it in perspective, for a DP algorithm with  = 2,
the upper bound on the accuracy of this attack is as high where probability is taken over the randomness of the out-
as 88%. For  = 5, it is 9933%. For  = 10, 99995%. come w and the additional example x′ .
Importantly, these values of  are very common in DP ML
literature (Shokri & Shmatikov, 2015; Abadi et al., 2016; Here, LA (w, D, D ′ ) is the privacy loss defined as
Papernot et al., 2018), and they can be even higher in real-
world deployments1 . p(wD)
LA (w, D, D ′ ) = log , (3)
p(wD ′ )
This guarantee does not tell us anything other than that this
outlier cannot be protected while preserving utility. But where p(wD), p(wD ′ ) are private outcome distributions
what is the guarantee for other residents of the town? Intu- for corresponding datasets. For brevity, we often omit pa-
itively, it should be much stronger. In the next section, we rameters and denote the privacy loss simply by L.
present a novel DP-based privacy notion. It uses the same
We use the subscript µ to underline the main difference
privacy mechanism and augments the general DP guarantee
between the classic DP and Bayesian DP: in the classic defi-
with a much tighter guarantee for the expected case, and, by
nition the probability is taken only over the randomness of
extension, for any percentile of the user/data population.
the outcome (w), while the BDP definition contains two ran-
dom variables (w and x′ ). Therefore, the privacy parameters
4. Bayesian Differential Privacy  and  depend on the data distribution µ(x).
In this section, we define Bayesian differential privacy The addition of another random variable yields the change
(BDP). We then derive a practical privacy loss accounting in the meaning of µ compared to the  of DP. In Bayesian
method, and discuss its relation to the moments accountant. differential privacy, it also accounts for the privacy mecha-
All the proofs are available in the supplementary material. nism failures in the tails of data distributions in addition to
the tails of outcome distributions.
4.1. Denition Denition 2 (Bayesian Differential Privacy). A randomised
function A : D → R with domain D and range R satises
Let us define strong Bayesian differential privacy (Defini-
(µ , µ )-Bayesian differential privacy if for any two adjacent
tion 1) and (weak) Bayesian differential privacy (Defini-
datasets D, D ′ ∈ D, differing in a single data point x′ ∼
tion 2). The first provides a better intuition, connection
µ(x), and for any set of outcomes S the following holds:
to concentration inequalities, and is being used for privacy
accounting. Unfortunately, it may not be closed under post- Pr [A(D) ∈ S] ≤ eεµ Pr [A(D ′ ) ∈ S] + µ  (4)
processing, and therefore, the actual guarantee provided
1
https://www.macobserver.com/analysis/ Proposition 1. (µ , µ )-strong Bayesian differential pri-
google-apple-differential-privacy/ vacy implies (µ , µ )-Bayesian differential privacy.
Bayesian Differential Privacy for Machine Learning

Bayesian DP repeats some basic properties of the classic DP, the parameter λ can be arbitrary, because the bound holds
such as composition, post-processing resilience and group for any value of it, but it determines how tight the bound is.
privacy. More details, proofs for these properties and the By simple manipulations we obtain
above proposition, can be found in supplementary material.  
p(w|D)
λL λ log p(w|D′ )
While Definitions 1 and 2 do not specify the distribution of E[e ] = E e
any point in the dataset other than the additional example ( )λ 
x′ , it is natural to assume that all examples in the dataset are p(wD)
=E  (7)
drawn from the same distribution µ(x). This holds in many p(wD ′ )
real-world applications, including applications evaluated in
this paper, and it allows using dataset samples instead of If the expectation is taken only over the outcome random-
requiring knowing the true distribution. ness, this expression is the function of Rényi divergence be-
We also assume that data points are exchangeable (Aldous, tween p(wD) and p(wD ′ ), and following this path yields
1985), i.e. any permutation of data points has the same re-derivation of Rényi differential privacy (Mironov, 2017).
joint probability. It enables tighter accounting for iterative However, by also taking the expectation over additional
applications of the privacy mechanism (see Section 4.2), is examples x′ ∼ µ(x), we can further tighten this bound.
weaker than independence and is naturally satisfied in the By the law of total expectation,
considered scenarios.
( )λ   ( )λ ∣∣ 
p(wD) p(wD) ∣ ′
4.2. Privacy Accounting E = Ex Ew ∣x ,
p(wD ′ ) p(wD ′ ) ∣
In the context of learning, it is important to be able to keep (8)
track of the privacy loss over iterative applications of the
privacy mechanism. And since the bounds provided by the where the inner expectation is again the function of Rényi
basic composition theorem are loose, we formulate the ad- divergence, and the outer expectation is over µ(x).
vanced composition theorem and develop a general account- Combining Eq. 7 and 8 and plugging it in Eq. 6, we get
ing method for Bayesian differential privacy, the Bayesian
[ ]
accountant, that provides a tight bound on privacy loss and ′
Pr[L ≥ µ ] ≤ Ex eλDλ+1 [p(w|D)‖p(w|D )]−λεµ  (9)
is straightforward to implement. We draw inspiration from
the moments accountant (Abadi et al., 2016).
This expression determines how to compute µ for a fixed µ
Observe that Eq. 4 is a typical concentration bound inequal-
(or vice versa) for one invocation of the privacy mechanism.
ity, which are well studied in probability theory. One of
However, to accommodate the iterative nature of learning,
the most common examples of such bounds is Markov’s
we need to deal with the composition of multiple applica-
inequality. In its extended form, it states the following:
tions of the mechanism. We already determined that our
E[ϕ(L)] privacy notion is naively composable, but in order to achieve
Pr[L ≥ µ ] ≤ , (5) better bounds we need a tighter composition theorem.
ϕ(µ )
Theorem 1 (Advanced Composition). Let a learning algo-
where ϕ(·) is a monotonically increasing non-negative
rithm run for T iterations. Denote by w (1)    w(T ) a se-
function. It is immediately evident that it provides a re-
quence of private learning outcomes at iterations 1,    , T ,
lation between µ and µ (i.e. µ = E[ϕ(L)]ϕ(µ )),
and L(1:T ) the corresponding total privacy loss. Then,
and in order to determine them we need to choose ϕ and
compute the expectation E[ϕ(L(w, D, D ′ ))]. Note that [ (1:T ) ] ∏T [ ] T1
L(w, D, D ′ ) = −L(w, D ′ , D), and since the inequality has E eλL ≤ Ex eT λDλ+1 (pt ‖qt ) ,
to hold for any pair of D, D ′ , we can use L instead of L. t=1

We use the Chernoff bound that can be obtained by choosing where pt = p(w(t) w (t−1) , D), qt = p(w(t) w (t−1) , D ′ ).
ϕ(L) = eλL . It is widely known because of its tightness,
and although not explicitly stated, it is also used by Abadi Proof. See supplementary material.
et al. (2016). The inequality in this case transforms to
E[eλL ] Unlike the moments accountant, our composition theorem
Pr[L ≥ µ ] ≤  (6) presents an upper bound on the total privacy loss due to
eλεµ
computing expectation over the distribution of the same
This inequality requires the knowledge of the moment gen- example over all iterations. However, we found that the
erating function of L or some bound on it. The choice of inequality tends to be tight in practice, and there is little
Bayesian Differential Privacy for Machine Learning

overhead compared to naı̈vely swapping the product and the 4.3. Privacy Cost Estimator
expectation.
Computing ct (λ, T ) precisely requires access to the data
We denote the logarithm of the quantity inside the product distribution µ(x), which is unrealistic. Therefore, we need
in Theorem 1 as ct (λ, T ) and call it the privacy cost of the an estimator for E[eT λDλ+1 (pt ‖qt ) ].
iteration t:
Typically, having access to the distribution samples, one
[ ] T1
ct (λ, T ) = log Ex eT λDλ+1 (pt ‖qt ) (10) would use the law of large numbers and approximate the
expectation with the sample mean. This estimator is unbi-
ased and converges with the growing number of samples.
The privacy cost of the whole learning process is then a sum However, these are not the properties we are looking for.
of the costs of each iteration. We can now relate  and  The most important property of the estimator in our context
parameters of BDP through the privacy cost. is that it does not underestimate E[eT λDλ+1 (pt ‖qt ) ], because
Theorem 2. Let the algorithm produce a sequence of pri- the bound (Eq. 6) would not hold for this estimate otherwise.
vate learning outcomes w (1)    w(T ) using a known proba-
bility distribution p(w (t) w (t−1) , D). Then, for a xed µ : We employ the Bayesian view of the parameter estimation
problem (Oliphant, 2006) and design an estimator with this
T
∑ single property: given a fixed , it returns the value that
log µ ≤ ct (λ, T ) − λµ  overestimates the true expectation with probability 1 − .
t=1
We then incorporate the estimator uncertainty  in µ .
Corollary 1. Under the conditions above, for a xed µ :
T
4.3.1. B INARY C ASE
1∑ 1
µ ≤ ct (λ, T ) − log µ  Let us demonstrate the process of constructing the expecta-
λ t=1 λ
tion estimator with the above property on a simple binary
example. This technique is based on (Oliphant, 2006) and
Theorems 1, 2 and Corollary 1 immediately provide us with it translates directly to other classes of distributions with
an efficient privacy accounting algorithm. During training, minor adjustments. Here, we also address the concern of
we compute the privacy cost ct (λ, T ) for each iteration t, not taking into account the data absent from the dataset.
accumulate it, and then use to compute µ , µ pair. This pro-
cess is ideologically close to that of the moment accountant Let the data x1 , x2 ,    , xN , such that xi ∈ 0, 1, have a
but accumulates a different quantity (note the change from common mean and a common variance. As this information
the privacy loss random variable to Rényi divergence). We is insufficient to solve our problem, let us also assume that
further explore this connection in Section 4.4.2. the data comes from the maximum entropy distribution. This
assumption adds the minimum amount of information to the
The link to Rényi divergence is an advantage for applicabil- problem and makes our estimate pessimistic.
ity of this framework: if the outcome distribution p(wD)
has a known analytic expression for Rényi divergence (Gil For the binary data with the common mean ρ, the maximum
et al., 2013; Van Erven & Harremos, 2014), it can be easily entropy distribution is the Bernoulli distribution:
plugged into our accountant. f (xi ρ) = ρxi (1 − ρ)1−xi , (11)
For the popular subsampled Gaussian mechanism (Abadi
where ρ is also the probability of success (xi = 1). Then,
et al., 2016), we can demonstrate the following.
Theorem 3. Given the Gaussian noise mechanism with f (x1 ,    , xN ρ) = ρN1 (1 − ρ)N0 , (12)
the noise parameter σ and subsampling probability q, the where N0 and N1 is the number of 0’s and 1’s in the dataset.
privacy cost for λ ∈ N at iteration t can be expressed as
We impose the flat prior on ρ, assuming all values in [0, 1]
ct (λ, T ) = maxcL R
t (λ, T ), ct (λ, T ), are equally likely, and use Bayes’ theorem to determine the
where distribution of ρ given the data:
  2 T  Γ(N0 + N1 + 2)
1 k −k
‖g −g ′ 2
‖ f (ρx1 ,    , xN ) = ρN1 (1 − ρ)N0 ,
cL
t (λ, T ) = log Ex Ek∼B(λ+1,q) e 2σ 2 t t
, Γ(N0 + 1)Γ(N1 + 1)
T
 (13)
 2 T 
R 1 k +k
‖g −g ′ 2
‖ where the normalisation constant in front is obtained by
ct (λ, T ) = log Ex Ek∼B(λ,q) e 2σ2 t t
,
T setting the integral over ρ equal to 1.
and B(λ, q) is the binomial distribution with λ experiments Now, we can use the above distribution of ρ to design an es-
and the probability of success q. timator ρ̂, such that it overestimates ρ with high probability,
Bayesian Differential Privacy for Machine Learning

i.e. Pr [ρ ≤ ρ̂] ≥ 1 − . Namely, ρ̂ = F −1 (1 − ), where Remark. By adapting the maximum entropy probability
F −1 is the inverse of the CDF: distribution an equivalent estimator can be derived for other
classes of distributions (e.g. discrete).
F −1 (1 − )
∫ z To avoid introducing new parameters in the privacy defini-
= infz ∈ R : f (tx1 ,    , xN )dt ≥ 1 −  tion, we can incorporate the probability  of underestimating
−∞
the true expectation in µ . We can re-write:
We refer to  as the estimator failure probability, and to
1 −  as the estimator condence.
To demonstrate the resilience of this estimator to unseen Pr[LA (w (t) , D, D ′ ) ≥ µ ]
[ ]
data, consider the following example. Let the true expecta-
= Pr LA (w (t) , D, D ′ ) ≥ µ , ĉt (λ, T ) ≥ ct (λ, T )
tion be 001, and let the data consist of 100 zeros, and no [ ]
ones. A typical “frequentist” mean estimator would confi- + Pr LA (w (t) , D, D ′ ) ≥ µ , ĉt (λ, T ) < ct (λ, T ) 
dently output 0. However, our estimator would never output
0, unless the confidence is set to 0. When the confidence (16)
is set to 1 ( = 0), the output is 1, which is the most pes-
simistic estimate. Finally, the output ρ̂ = ρ = 001 will be When ĉt (λ, T ) ≥ ct (λ, T ), using the Chernoff inequality,
T
assigned the failure probability  = 099101 ≈ 036, which the first summand is bounded by  = exp( t=1 ĉt (λ, T ) −
is the probability of not drawing a single 1 in 101 draws. λµ ).
In a real-world system, the confidence would be set to a Whenever ĉt (λ, T ) < ct (λ, T ),
much higher level (in our experiments, we use  = 10−15 ),
and the probability of 1 would be significantly overesti- Pr[LA (w (t) , D, D ′ ) ≥ µ , ĉt (λ, T ) < ct (λ, T )]
mated. Thus, unseen data do not present a problem for this ≤ Pr[ĉt (λ, T ) < ct (λ, T )]
estimator, because it exaggerates the probability of data that ≤  (17)
increase the estimated expectation.
Therefore, the true µ is bounded by  + , and despite
4.3.2. C ONTINUOUS C ASE
the incomplete data, we can claim that the mechanism is
For applications evaluated in this paper, we are primarily (µ , µ )-Bayesian differentially private, where µ =  + .
concerned with continuous case. Thus, let us define the Remark. This step further changes the interpretation of µ
following m-sample estimator of ct (λ, T ) for continuous in Bayesian differential privacy compared to the classic  of
distributions with existing mean and variance: DP. Apart from the probability of the privacy loss exceeding
  µ , e.g. in the tails of its distribution, it also incorporates
F −1 (1 − , m − 1)
ĉt (λ, T ) = log M (t) + √ S(t) , our uncertainty about the true data distribution (in other
m−1 words, the probability of underestimating the true expecta-
(14)
tion because of not observing enough data samples). It can
where M (t) and S(t) are the sample mean and the sample be intuitively understood as accounting for unobserved (but
(t) feasible) data in µ , rather than in µ .
standard deviation of eλD̂λ+1 , F −1 (1 − , m − 1) is the
inverse of the Student’s t-distribution CDF at 1 −  with
m − 1 degrees of freedom, and 4.4. Discussion

(t) 4.4.1. R ELATION TO DP


D̂λ+1 = max Dλ+1 (p̂t ‖q̂t ), Dλ+1 (q̂t ‖p̂t ) ,
To better understand how the BDP bound relates to the tra-
p̂t = p(w(t)  w (t−1) , B (t) ),
ditional DP, consider the following conditional probability:
q̂t = p(w(t)  w (t−1) , B (t) \ xi ) (15)
∆(, x′ ) = Pr [L(w, D, D ′ ) >   D, D ′ = D  x′ ] 
Since in many cases learning is performed on mini-batches, (18)
we can similarly compute Rényi divergence on batches B (t) .
The moments accountant outputs  that upper-bounds
Theorem 4. Estimator ĉt (λ, T ) overestimates ct (λ, T ) with
∆(, x′ ) for all x′ . It is not true in general for other ac-
probability 1 − . That is,
counting methods, but let us focus on MA, as it is by far the
Pr [ĉt (λ, T ) < ct (λ, T )] ≤  most popular. Consequently, the MA bound is

max ∆(, x) ≤ , (19)


The proof follows the steps of the binary example above. x
Bayesian Differential Privacy for Machine Learning

where  is a chosen constant. At the same time, BDP bounds down to their pairwise distances, which are not as informa-
the probability that is not conditioned on x′ , but we can tive in high-dimensional spaces (i.e. the curse of dimension-
transform one to another through marginalisation and get: ality), and then down to one number. Intuitively, at this rate
of compression very little knowledge can be gained by an
Ex [∆(, x)] ≤ µ  (20) attacker in practice.
Since ∆(·) is a non-negative random variable in x, we can The first approach would provide little information about
apply Markov’s inequality and obtain a tail bound on it using real-world cases due to potentially unrealistic assumptions,
µ . We can therefore nd a pair (, )p that holds for any and hence, we opt for the second approach. We examine
percentile p of the data distribution, not just in expectation. pairwise gradient distances of the points within the training
In all our experiments, we consider bounds well above 99th set and outside, and demonstrate that the privacy leakage is
percentile, so it is very unlikely to encounter data for which not statistically significant in practice (see Section 5.2).
the equivalent DP guarantee doesn’t hold. Moreover, it
is possible to characterise privacy by building a curve for
different percentiles, and hence, gain more insight into how 5. Evaluation
well users and their data are protected. This experimental section comprises two parts. First, we
examine how well Bayesian DP composes over multiple
4.4.2. R ELATION TO M OMENTS ACCOUNTANT steps. We use the Bayesian accountant and compare to the
As mentioned in Section 4.2, removing the distribution re- state-of-the-art DP results obtained by the moments accoun-
quirement on D, D ′ and further simplifying Eq. 9, we can tant. Second, we consider the context of machine learning.
recover the relation between Rényi DP and (, )-DP. In particular, we use the differentially private stochastic gra-
dient descent (DP-SGD), a well known privacy-preserving
At the same time, our accounting technique closely resem- learning technique broadly used in combination with the mo-
bles the moments accountant. In fact, we can show that ments accountant, to train neural networks on classic image
the moments accountant is a special case of Theorem 3. classification tasks MNIST (LeCun et al., 1998) and CI-
Ignoring the data distribution information and substituting FAR10 (Krizhevsky, 2009). We then compare the accuracy
expectation by maxD,D′ yields the substitution of ‖gt − gt′ ‖ and privacy guarantees obtained under BDP and under DP.
for C in Theorem 3, where C is the sensitivity (or clip- We also perform experiments with variational inference on
ping threshold), which turns out to be the exact moments Abalone (Waugh, 1995) and Adult (Kohavi, 1996) datasets.
accountant bound. In addition, there are some extra benefits,
such as avoiding numerical integration when using λ ∈ N Importantly, DP and BDP can use the same privacy mech-
due to connection to Binomial distribution, which improves anism and be accounted in parallel to ensure the DP guar-
numerical stability and computational efficiency. antees hold if BDP assumptions fail. Thus, all comparisons
in this section should be viewed in the following way: the
4.4.3. P RIVACY OF ĉt (λ, T ) reported BDP guarantee would apply to typical data (i.e.
data drawn from the same distribution as the dataset); the
Due to calculating ĉt (λ, T ) from data, our privacy guarantee reported DP guarantee would apply to all other data; their
becomes data-dependent and may potentially leak informa- difference is the advantage for typical data we gain by using
tion. To obtain a theoretical bound on this leakage, we need Bayesian DP. In some experiments we use smaller noise
to get back to the maximum entropy assumption in Sec- variance for BDP in order to speed up training, meaning
tion 4.3, and assume that M (t) and S(t) are following some that the reported BDP guarantees will further improve if
specific distributions, such as Gaussian and χ2 distributions. noise variance is increased to DP levels. For more details
Hence, in case of simple random sampling, these statistics and additional experiments, we refer the reader to the sup-
for two neighbour datasets are differentially private and the plementary material, while the source code is available on
privacy parameters can be computed using Rényi divergence. GitHub2 .
Furthermore, these guarantees are controlled by the number
of samples used to compute the statistics: the more samples 5.1. Composition
are used, the more accurate the statistics are, and the less
privacy leakage occurs. This property can be used to control First, we study the growth rate of the privacy loss over a
estimates privacy without sacrificing their tightness, only at number of mechanism invocations. This experiment is car-
the cost of extra computation time. Without distributional ried out using synthetic gradients drawn from the Weibull
assumptions, the bound can be computed in the limit of the distribution with the shape parameter < 1 to imitate a more
sample size used by the estimator, using the CLT. difficult case of heavy-tailed gradient distributions. We do
2
On the other hand, consider the fact that the information https://github.com/AlekseiTriastcyn/
from many high-dimensional vectors gets first compressed bayesian-differential-privacy
Bayesian Differential Privacy for Machine Learning

(a) 0.05-quantile of ‖∇f ‖. (b) 0.25-quantile of ‖∇f ‖. (c) 0.75-quantile of ‖∇f ‖. (d) 0.95-quantile of ‖∇f ‖.

Figure 1: Evolution of  and µ over multiple steps of the Gaussian noise mechanism with σ = C for DP (with clipping)
and BDP (without clipping). Sub-captions indicate the noise variance relative to the gradient norms distribution.

Table 1: Estimated privacy bounds  for  = 10−5 and µ =


10−10 for MNIST, CIFAR10, Abalone and Adult datasets.
In parenthesis, a potential attack success probability P (A).

Accuracy Privacy
Dataset Baseline Private DP BDP
MNIST 99% 96% 22 (0898) 0.95 (0.721)
(a) MNIST. (b) CIFAR10.
CIFAR10 86% 73% 80 (0999) 0.76 (0.681)
Abalone 77% 76% 76 (0999) 0.61 (0.649) Figure 2: Pairwise gradient distances distribution.
Adult 81% 81% 05 (0623) 0.2 (0.55)

not clip gradients for BDP in order to show the raw effect in a setting similar to (Jälkö et al., 2016).
of the signal-to-noise ratio on the privacy loss behaviour.
Technically, bounded sensitivity is not as essential for BDP, Using the gradient distribution information allows the BDP
because extreme individual contributions are mitigated by models to reach the same accuracy at a much lower  (for
their low probability. However, in practice it is still advanta- 99999% of data points from this distribution, see Sec-
geous to have a better control over privacy loss spikes and tion 4.4.1). On MNIST, we manage to reduce it from 22
ensure that the worst-case DP guarantee is preserved. to 095. For CIFAR10, from 80 to 076. See details in
Table 1. Moreover, since less noise is required for Bayesian
In Figure 1, we plot  and µ as a function of steps for DP, the models reach the same test accuracy much faster.
different levels of noise. Naturally, as the noise standard For example, our model reaches 96% accuracy within 50
deviation gets closer to the expected gradients norm, the epochs for MNIST, while DP model requires more noise and
growth rate of the privacy loss decreases dramatically. Even slower training over hundreds of epochs to avoid  blowing
when the noise is at the 025-quantile, the Bayesian accoun- up. These results confirm that discounting outliers in the
tant matches the moments accountant. It is worth noting, privacy accounting process is highly beneficial for getting
that DP behaves the same in all these experiments because high accuracy and tighter guarantees for all the other points.
the gradients get clipped at the noise level C. Introducing To make our results more transparent, we include in Table 1
clipping for BDP yields the behaviour of Figure 1d, as we the potential attack success probability P (A) computed us-
demonstrate in the next section on real data. ing Eq. 1. In this interpretation, the benefits of using BDP
become even more apparent.
5.2. Learning
An important aspect of BDP, discussed in Section 4.4.3, is
We now consider the application to privacy-preserving deep the potential privacy leakage of the privacy cost estimator.
learning. Our setting closely mimics that of (Abadi et al., To illustrate that this leakage is minimal, we conduct the
2016) to enable a direct comparison with the moments ac- following experiment. After training the model (to ensure
countant and DP. We use a version of DP-SGD that has been it contains as much information about data as possible),
extensively applied to build differentially private machine we compute the gradient pairwise distances over train and
learning models. The idea of DP-SGD is to clip the gradient test sets. We then plot the histograms of these distances to
norm to some constant C (ensuring bounded sensitivity) and inspect any divergence that would distinguish the data that
add Gaussian noise with variance C 2 σ 2 at every iteration of was used in training. Note that this is more information than
SGD. For Abalone and Adult, we use variational inference what is available to an adversary, who only observes µ , µ .
Bayesian Differential Privacy for Machine Learning

As it turns out, these distributions are nearly identical (see Bassily, R. and Freund, Y. Typical stability. arXiv preprint
Figures 2a and 2b), and we do not observe any correlation arXiv:1604.03336, 2016.
with the fact of the presence of data in the training set.
For example, the sample mean of the test set can be both Bassily, R., Groce, A., Katz, J., and Smith, A. Coupled-
somewhat higher or lower than that of the train set. We worlds privacy: Exploiting adversarial uncertainty in sta-
also run the t-test for equality of means and Levene’s test tistical data privacy. In 2013 IEEE 54th Annual Sympo-
for equality of variances, obtaining p-values well over the sium on Foundations of Computer Science, pp. 439–448.
005 threshold, suggesting that the difference of the means IEEE, 2013.
and the variances of these distributions is not statistically
Bhaskar, R., Bhowmick, A., Goyal, V., Laxman, S., and
significant and the equality hypothesis cannot be rejected.
Thakurta, A. Noiseless database privacy. In International
Conference on the Theory and Application of Cryptology
6. Conclusion and Information Security, pp. 215–232. Springer, 2011.
We introduce the notion of (µ , µ )-Bayesian differential Blum, A., Ligett, K., and Roth, A. A learning theory ap-
privacy, a variation of (, )-differential privacy for sensi- proach to noninteractive database privacy. Journal of the
tive data that are drawn from an arbitrary (and unknown) ACM (JACM), 60(2):12, 2013.
distribution µ(x). It is a reasonable assumption in many
ML scenarios where models are designed for and trained on Bun, M. A teaser for differential privacy. 2017.
specific data distributions (e.g. emails, face images, ECGs,
etc.). For example, trying to hide music records in a train- Bun, M. and Steinke, T. Concentrated differential privacy:
ing set for ECG analysis may be unjustified, because the Simplifications, extensions, and lower bounds. In The-
probability of it appearing is actually much smaller than . ory of Cryptography Conference, pp. 635–658. Springer,
2016.
We present the advanced composition theorem for Bayesian
DP that allows for efficient and tight privacy accounting. Duan, Y. Privacy without noise. In Proceedings of the 18th
Since the data distribution is unknown, we design an esti- ACM conference on Information and knowledge manage-
mator that overestimates the privacy loss with high, control- ment, pp. 1517–1520. ACM, 2009.
lable probability. Moreover, as the data sample is finite, we
employ the Bayesian parameter estimation approach with Dwork, C. Differential privacy. In 33rd Interna-
the flat prior and the maximum entropy principle to avoid tional Colloquium on Automata, Languages and
underestimating probabilities of unseen examples. Programming, part II (ICALP 2006), volume
4052, pp. 1–12, Venice, Italy, July 2006. Springer
Our evaluation confirms that Bayesian DP is highly bene-
Verlag. ISBN 3-540-35907-9. URL https:
ficial in ML context where its additional assumptions are
//www.microsoft.com/en-us/research/
naturally satisfied. First, it needs less noise for compara-
publication/differential-privacy/.
ble privacy guarantees (with high probability, as per Sec-
tion 4.4.1). Second, models train faster and can reach higher Dwork, C. and Rothblum, G. N. Concentrated differential
accuracy. Third, it may be used along with DP to ensure privacy. arXiv preprint arXiv:1603.01887, 2016.
the worst-case guarantee for out-of-distribution samples and
outliers, while providing tighter guarantees for most cases. Dwork, C., McSherry, F., Nissim, K., and Smith, A. Calibrat-
In our supervised learning experiments,  always remains ing noise to sensitivity in private data analysis. In Theory
below 1, translating to much more meaningful bounds on a of cryptography conference, pp. 265–284. Springer, 2006.
potential attacker success probability.
Dwork, C., Roth, A., et al. The algorithmic foundations of
differential privacy. Foundations and Trends® in Theo-
References retical Computer Science, 9(3–4):211–407, 2014.
Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B.,
Mironov, I., Talwar, K., and Zhang, L. Deep learning Fredrikson, M., Jha, S., and Ristenpart, T. Model inver-
with differential privacy. In Proceedings of the 2016 ACM sion attacks that exploit confidence information and ba-
SIGSAC Conference on Computer and Communications sic countermeasures. In Proceedings of the 22nd ACM
Security, pp. 308–318. ACM, 2016. SIGSAC Conference on Computer and Communications
Security, pp. 1322–1333. ACM, 2015.

Aldous, D. J. Exchangeability and related topics. In École Gil, M., Alajaji, F., and Linder, T. Rényi divergence mea-
d’Été de Probabilités de Saint-Flour XIII1983, pp. 1–198. sures for commonly used univariate continuous distribu-
Springer, 1985. tions. Information Sciences, 249:124–131, 2013.
Bayesian Differential Privacy for Machine Learning

Hall, R., Rinaldo, A., and Wasserman, L. Random differen- Shokri, R. and Shmatikov, V. Privacy-preserving deep learn-
tial privacy. arXiv preprint arXiv:1112.2680, 2011. ing. In Proceedings of the 22nd ACM SIGSAC conference
on computer and communications security, pp. 1310–
He, X., Machanavajjhala, A., and Ding, B. Blowfish privacy: 1321. ACM, 2015.
Tuning privacy-utility trade-offs using policies. In Pro-
ceedings of the 2014 ACM SIGMOD international con- Shokri, R., Stronati, M., Song, C., and Shmatikov, V. Mem-
ference on Management of data, pp. 1447–1458. ACM, bership inference attacks against machine learning mod-
2014. els. In Security and Privacy (SP), 2017 IEEE Symposium
on, pp. 3–18. IEEE, 2017.
Hitaj, B., Ateniese, G., and Pérez-Cruz, F. Deep models
under the gan: information leakage from collaborative Van Erven, T. and Harremos, P. Rényi divergence and
deep learning. In Proceedings of the 2017 ACM SIGSAC kullback-leibler divergence. IEEE Transactions on Infor-
Conference on Computer and Communications Security, mation Theory, 60(7):3797–3820, 2014.
pp. 603–618. ACM, 2017. Waugh, S. G. Extending and benchmarking Cascade-
Jälkö, J., Dikmen, O., and Honkela, A. Differentially private Correlation: extensions to the Cascade-Correlation ar-
variational inference for non-conjugate models. arXiv chitecture and benchmarking of feed-forward supervised
preprint arXiv:1610.08749, 2016. articial neural networks. PhD thesis, University of Tas-
mania, 1995.
Kifer, D. and Machanavajjhala, A. Pufferfish: A framework
Yang, B., Sato, I., and Nakagawa, H. Bayesian differential
for mathematical privacy definitions. ACM Transactions
privacy on correlated data. In Proceedings of the 2015
on Database Systems (TODS), 39(1):3, 2014.
ACM SIGMOD international conference on Management
Kohavi, R. Scaling up the accuracy of naive-bayes classi- of Data, pp. 747–762. ACM, 2015.
fiers: a decision-tree hybrid. Citeseer, 1996.

Krizhevsky, A. Learning multiple layers of features from


tiny images. 2009.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-


based learning applied to document recognition. Proceed-
ings of the IEEE, 86(11):2278–2324, 1998.

Leung, S. and Lui, E. Bayesian mechanism design with


efficiency, privacy, and approximate truthfulness. In Inter-
national Workshop on Internet and Network Economics,
pp. 58–71. Springer, 2012.

McMahan, H. B., Ramage, D., Talwar, K., and Zhang, L.


Learning differentially private recurrent language models.
2018.

Mironov, I. Renyi differential privacy. In Computer Security


Foundations Symposium (CSF), 2017 IEEE 30th, pp. 263–
275. IEEE, 2017.

Oliphant, T. E. A bayesian perspective on estimating mean,


variance, and standard-deviation from data. 2006.

Papernot, N., Abadi, M., Erlingsson, U., Goodfellow, I.,


and Talwar, K. Semi-supervised knowledge transfer for
deep learning from private training data. arXiv preprint
arXiv:1610.05755, 2016.

Papernot, N., Song, S., Mironov, I., Raghunathan, A., Tal-


war, K., and Erlingsson, Ú. Scalable private learning with
pate. arXiv preprint arXiv:1802.08908, 2018.
Bayesian Differential Privacy for Machine Learning

Appendix and the last inequality is due to the fact that



A. Proofs Pr[A(D) ∈ S  F (x′ ), x′ ] dx′ (31)

A.1. Proofs of Propositions
≤ Pr[A(D) ∈ F (x′ ), x′ ] dx′ (32)
Proposition 1. (µ , µ )-strong Bayesian differential pri- ∫
vacy implies (µ , µ )-Bayesian differential privacy. = µ(x′ ) Pr[A(D) ∈ F (x′ )  x′ ] dx′ (33)
∫ ∫
= µ(x′ ) pA (wD, x′ ) dw dx′ (34)
w∈F (x′ )

= Ex′ [Ew [1L > ]] (35)


≤ µ (36)

Proof. Let us define a set of outcomes for which the pri- Proposition 2 (Post-processing). Let A : D → R be a
vacy loss variable exceeds the  threshold: F (x′ ) = w : (µ , µ )-Bayesian differentially private algorithm. Then
LA (w, D, D ′ ) > , and its compliment F c (x′ ). for any arbitrary randomised data-independent mapping
f : R → R′ , f (A(D)) is (µ , µ )-Bayesian differentially
We have,
private.

Proof. First, by Proposition 1, (µ , µ )-strong BDP implies



the weak sense of BDP:
Pr[A(D) ∈ S] = Pr[A(D) ∈ S, x′ ] dx′ (21)
∫ Pr [A(D) ∈ S] ≤ eεµ Pr [A(D ′ ) ∈ S] + µ , (37)
c ′ ′
= Pr[A(D) ∈ S  F (x ), x ] (22) for any set of outcomes S ⊂ R.
+ Pr[A(D) ∈ S  F (x′ ), x′ ] dx′ For a data-independent function f (·):
(23)
∫ Pr [f (A(D)) ∈ T ] = Pr [A(D) ∈ S] (38)
′ ′ ′ εµ ′
= c
Pr[A(D) ∈ S  F (x )x ]µ(x ) ≤e Pr [A(D ) ∈ S] + µ , (39)
εµ ′
(24) =e Pr [f (A(D )) ∈ T ] + µ (40)
+ Pr[A(D) ∈ S  F (x ), x ] dx′ ′ ′ where S = f −1
[T ], i.e. S is the preimage of T under
(25) f.

≤ eε Pr[A(D ′ ) ∈ S  F c (x′ )x′ ]µ(x′ ) Proposition 3 (Basic composition). Let Ai : D → Ri ,
∀i = 1k, be a sequence of (µ , µ )-Bayesian differen-
(26) tially private algorithms. Then their combination, dened
+ Pr[A(D) ∈ S  F (x ), x ] dx′ ′ ′ as A1:k : D → R1 ×    × Rk , is (kµ , kµ )-Bayesian
(27) differentially private.

≤ eε Pr[A(D ′ ) ∈ S, x′ ] (28) Proof. Let us denote L = log p(w1 ,,wk |D)
p(w1 ,,wk |D ′ ) .

+ Pr[A(D) ∈ S  F (x′ ), x′ ] dx′ Also, let Li = log p(wi |D,wi−1 ,,w1 )


p(wi |D ′ ,wi−1 ,,w1 ) . Then,
(29)
 k

≤ eε Pr[A(D ′ ) ∈ S] + µ , (30) ∑
Pr [L ≥ kµ ] = Pr Li ≥ kµ (41)
i=1
k

where we used the observation that L ≤  implies ≤ Pr[Li ≥ µ ] (42)
i=1
Pr[A(D) ∈ S  F c (x′ )] ≤ eε Pr[A(D ′ ) ∈ S 
k

F c (x′ )], and therefore, Pr[A(D) ∈ S  F c (x′ )  x′ ] ≤
eε Pr[A(D ′ ) ∈ S F c (x′ )  x′ ], because A(D) does not de- ≤ µ (43)
i=1
pend on x′ , and A(D ′ ) is already conditioned on x′ through
D ′ . Additionally, in the first line we used marginalisation, ≤ kµ (44)
Bayesian Differential Privacy for Machine Learning

For the weak sense of BDP, the proof follows the steps of A.2. Proof of Theorem 1
Dwork et al. (2014, Appendix B).
Let us restate the theorem:

Theorem 1 (Advanced Composition). Let a learning algo-


Proposition 4 (Group privacy). Let A : D → R be a rithm run for T iterations. Denote by w (1)    w(T ) a se-
(µ , µ )-Bayesian differentially private algorithm. Then quence of private learning outcomes at iterations 1,    , T ,
for all pairs of datasets D, D ′ ∈ D, differing in k data and L(1:T ) the corresponding total privacy loss. Then,
points x1 ,    , xk s.t. xi ∼ µ(x) for i = 1k, A(D) is
(kµ , kekεµ µ )-Bayesian differentially private.
[ (1:T ) ] ∏T [ ] T1
E eλL ≤ Ex eT λDλ+1 (pt ‖qt ) ,
Proof. Let us define a sequence of datasets D i , i = 1k, t=1
s.t. D = D 0 , D ′ = D k , and D i and D i−1 differ in a single
example. Then,
where pt = p(w(t) w (t−1) , D), qt = p(w(t) w (t−1) , D ′ ).
0 1 k−1
p(wD) p(wD )p(wD )    p(wD )

= 1 2 k
(45)
p(wD ) p(wD )p(wD )    p(wD )

p(w|D i−1 )
Denote Li = log p(w|D i ) for i = 1k.
Finally, applying the definition of (µ , µ )-Bayesian differ- Proof. The proof closely follows (Abadi et al., 2016).
ential privacy,
First, we can write
 k


Pr [L ≥ kµ ] = Pr Li ≥ kµ (46)
i=1 p(w(1)    w(T )  D)
L(1:T ) = log (56)
k
∑ p(w (1)    w(T )  D ′ )
≤ Pr[Li ≥ µ ] (47) T
∏ p(w (t)  w (t−1)    p(w (1) , D)
i=1 = log (57)
≤ kµ (48) t=1
p(w(t)  w (t−1)    p(w (1) , D ′ )
∏T
p(w(t)  w(t−1) , D)
= log (58)
For the weak sense of BDP, p(w(t)  w (t−1) , D ′ )
t=1
  T

P r [A(D) ∈ S] ≤ eεµ P r A(D 1 ) ∈ S + µ (49) = L(t) (59)
εµ
 εµ  2
 
≤e e P r A(D ) ∈ S + µ + µ t=1
(50)
2εµ
 2
 εµ
≤ e P r A(D ) ∈ S + e µ + µ
(51) Unlike the composition proof of the moments accountant by
≤  (52) Abadi et al. (2016), we cannot simply swap the product and
the expectation in our proof, because the additional example
  e kεµ
−1
≤ ekεµ Pr A(D k ) ∈ S + µ x′ remains the same in all applications of the privacy mech-
eεµ − 1 anism and probability distributions will not be independent.
(53) However, we can use generalised Hölder’s inequality:
  kµ ekεµ
≤ ekεµ Pr A(D k ) ∈ S + µ
µ ∥ ∥
(54) ∥∏T ∥ T

∥ ∥
∥ ft ∥ ≤ ‖ft ‖pt , (60)
≤ ekεµ Pr [A(D ′ ) ∈ S] + kekεµ µ , ∥
t=1

t=1
r
(55)

where in (53) we use the formula for the sum of a geometric T


where pt are such that 1
= r,
1
and ‖f ‖r =
progression; in (54), the facts that ex − 1 ≤ xex , for x > 0,  1r
t=1 pt

and ex ≥ x + 1. S
f r dx .
Bayesian Differential Privacy for Machine Learning

Choosing r = 1 and pt = T : Let us first consider Dλ+1 (p(wD ′ )‖p(wD)):


( )λ+1 
  p(wD ′ )
[ ] T
∏ p(w (t) | w (t−1) ,X)
E
λL1:T λ log p(wD)
E e =E e p(w (t) | w (t−1) ,X ′ ) (61) ( )λ+1 
t=1
  ∣  (1 − q)N (0, σ 2 ) + qN (dt , σ 2 )
T
∏ p(w (t) | w (t−1) ,X) ∣ =E (66)
λ log ∣ ′ N (0, σ 2 )
= Ex Ew e p(w (t) | w (t−1) ,X ′ )
∣ x (
∣ )λ+1 
t=1 N (dt , σ 2 )
(62) = E (1 − q) + q (67)
  ∣  N (0, σ 2 )
T
∏ p(w (t) | w (t−1) ,X) ∣ ( )λ+1 
λ log ∣ ′
= Ex Ew e p(w (t) | w (t−1) ,X ′ )
∣ x (w−dt )2 −w2
∣ = E (1 − q) + qe 2σ 2 (68)
t=1
(63) ( )λ+1 
 T
 2dw−d2 t
∏ =E (1 − q) + qe 2σ 2 (69)
= Ex eλDλ+1 (pt ‖qt ) (64)
λ+1 ( 
∑ λ + 1)
t=1
2dt kw−kd2t
T
∏ [ ] T1 =E q k (1 − q)λ+1−k e 2σ 2 (70)
≤ Ex eT λDλ+1 (pt ‖qt ) , (65) k=0
k
∑ (λ + 1)  
t=1 λ+1
2dt kw−kd2
t
= q k (1 − q)λ+1−k E e 2σ2 (71)
k
where (62) is by the law of total expectation; (63) is k=0

due to independence of noise between iterations, similarly ∑ (λ + 1)


λ+1
k2 −k 2

to (Abadi et al., 2016); and (65) is by Hölder’s inequal- = q k (1 − q)λ+1−k e 2σ2 dt (72)
k
ity. k=0
 2 
k −k ′ 2
= Ek∼B(λ+1,q) e 2σ2 ‖gt −gt ‖ , (73)
A.3. Proof of Theorem 3
Let us restate the theorem: Here, in (70) we used the binomial expansion, in (71) the
fact that the factors in front of the exponent
 do not depend

Theorem 3. Given the Gaussian noise mechanism with on w, and in (72) the property Ew exp(2aw(2σ 2 )) =
the noise parameter σ and subsampling probability q, the exp(a2 (2σ 2 )) for w ∼ N (0, σ 2 ). Plugging the above in
privacy cost for λ ∈ N at iteration t can be expressed as the privacy cost formula (Eq. 10 in the main paper), we get
the expression for cLt (λ).

ct (λ, T ) = maxcL R
t (λ, T ), ct (λ, T ),
Computing Dλ+1 (p(wD)‖p(wD ′ )) is a little more chal-
lenging. Let us first change to Dλ (p(wD)‖p(wD ′ )), so
that the expectation is taken over N (0, σ 2 ). Then, we can
where bound it observing that f (x) = x1 is convex for x > 0 and
using the definition of convexity, and apply the same steps
  2 T  as above:
1 k −k
‖g −g ′ 2
‖ ( )λ 
cL
t (λ, T ) = log Ex Ek∼B(λ+1,q) e 2σ2 t t
,
T p(wD)
E
  2 T  p(wD ′ )
R 1 k +k
‖g −g ′ 2
‖ ( )λ 
ct (λ, T ) = log Ex Ek∼B(λ,q) e 2σ 2 t t
,
T N (0, σ 2 )
=E (74)
(1 − q)N (0, σ 2 ) + qN (dt , σ 2 )
( )λ 
d2
and B(λ, q) is the binomial distribution with λ experiments ≤ E (1 − q) + qe 2σ2
t −2dw
(75)
and the probability of success q.
 
k2 +k
‖gt −gt′ ‖2
= Ek∼B(λ,q) e 2σ 2 (76)

In practice, we haven’t found any instance of


Proof. Without loss of generality, assume D ′ = D  x′ . Dλ+1 (p(wD ′ )‖p(wD)) < Dλ+1 (p(wD)‖p(wD ′ ))
For brevity, let dt = ‖gt − gt′ ‖. when the latter was computed using numerical integration,
Bayesian Differential Privacy for Machine Learning

although it may happen when using this theoretical upper as specified in (Abadi et al., 2016) and chose a different
bound. model (VGG-16 pre-trained on ImageNet), guided by main-
taining a similar or lower non-private accuracy. The model
A.4. Proof of Theorem 4 was trained using Adam with the learning rate of 0.001.
Since the goal of these experiments is to show relative per-
Let us restate the theorem: formance of private methods, we did not perform an ex-
Theorem 4. Estimator ĉt (λ, T ) overestimates ct (λ, T ) with haustive search for hyperparameters, either using default or
probability 1 − . That is, previously published values or values that yield reasonable
training behaviour.
Pr [ĉt (λ, T ) < ct (λ, T )] ≤ 
Privacy accounting with DP-SGD works in the following
way. The non-private learning outcome at each iteration t is
Proof. First of all, we can drop the logarithm from our
the gradient gt of the loss function w.r.t. the model parame-
consideration because of its monotonicity.
ters, the outcome distribution is the Gaussian N (gt , σ 2 C 2 ).
Before adding noise, the norm of the gradients is clipped
(t)
Now, assuming that samples eλD̂λ+1 have a common mean
and a common variance, and applying the maximum entropy to C. For the moments accountant, the privacy loss is cal-
principle in combination with an uninformative (flat) prior, culated using this C and σ. For the Bayesian accountant,
[ ]
λD̂
(t) either pairs of examples xi , xj or pairs of batches are sam-
M (t)−E e √ λ+1

one can show that the quantity m−1 pled from the dataset at each iteration, and used to compute
S(t)
follows the Student’s t-distribution with m − 1 degrees of ĉt (λ). Although clipping gradients is no longer necessary
freedom (Oliphant, 2006). with the Bayesian accountant, it is highly beneficial for in-
curring lower privacy loss at each iteration and obtaining
Finally, we use the inverse of the Student’s t CDF to find tighter composition. Moreover, it ensures the classic DP
the value that this random variable would only exceed with bounds on top of BDP bounds.
probability . The result follows by simple arithmetical
operations. We also run evaluation on two binary classification tasks
taken from UCI database: Abalone (Waugh, 1995) (predict-
ing the age of abalone from physical measurements) and
B. Evaluation Adult (Kohavi, 1996) (predicting income based on a per-
B.1. Experimental setting son’s attributes). In this setting, we compare differentially
private variational inference (DPVI-MA (Jälkö et al., 2016))
All experiments were performed on a machine with Intel to the variational inference with BDP. The datasets have
Xeon E5-2680 (v3), 256 GB of RAM, and two NVIDIA 4,177 and 48,842 examples with 8 and 14 attributes accord-
TITAN X graphics cards. We train a classifier represented ingly. We use the same pre-processing and models as (Jälkö
by a neural network on MNIST (LeCun et al., 1998) and et al., 2016). We run experiments using the authors original
on CIFAR10 (Krizhevsky, 2009) using DP-SGD. The first implementation3 with slight modifications (e.g. account-
dataset contains 60,000 training examples and 10,000 testing ing randomness of sampling from variational distributions,
images. We use large batch sizes of 1024, clip gradient instead of adding noise, using Bayesian accountant, and
norms to C = 1, and σ = 01. We also experimented with performing classification with variational samples instead
the idea of dropping updates for a random subset of weights, of optimal variational parameters).
and achieved the best performance with updating 10% of
weights at each iteration. The second dataset consists of B.2. Effect of σ and bounded sensitivity
50,000 training images and 10,000 testing images of objects
split in 10 classes. For this dataset, we use the batch size The primary goal of our paper is to obtain more meaningful
of 512, C = 1, and σ = 08. We fix  = 10−5 in all privacy guarantees sacrificing as little utility as possible.
experiments, and µ = 10−10 to achieve (, 10−5 ) bound The main factor in the loss of utility is the variance of the
for 99999% of data distribution using Markov inequality. noise we add during training. Therefore it is critical to
examine how our guarantee behaves compared to the classic
MNIST experiments are performed with the CNN model DP for the same amount of noise. Or equivalently, how
from Tensorflow tutorial (the same as in (Abadi et al., 2016), much noise does it require to reach the same .
except we do not use PCA), trained using SGD with the
learning rate 0.02. In case of CIFAR10, in order for our As stated above, there are two possible regimes of operation
results to be comparable to (Abadi et al., 2016), we pre- for the Gaussian noise mechanism under Bayesian differen-
train convolutional layers of the model on a different dataset tial privacy: with bounded sensitivity and with unbounded
and retrain a fully-connected layer in a privacy-preserving 3
https://github.com/DPBayes/DPVI-code
way. We were unable to reproduce the experiment exactly
Bayesian Differential Privacy for Machine Learning

(a) Clipping at 0.01-quantile of ‖∇f ‖. (b) Clipping at 0.50-quantile of ‖∇f ‖. (c) Clipping at 0.99-quantile of ‖∇f ‖.

Figure 3: Dependency between σ and  for different C when clipping for both DP and BDP.

(a) Noise at 0.05-quantile of ‖∇f ‖. (b) Noise at 0.50-quantile of ‖∇f ‖. (c) Noise at 0.95-quantile of ‖∇f ‖.

Figure 4: Dependency between σ and  for different C when clipping for DP and not clipping for BDP.

sensitivity. The first is just like the classic DP: there is a


maximum bound on the contribution of an individual ex-
ample, and the noise is scaled to it. The second does not
have a bound on contribution and mitigates it by taking into
account the low probability of extreme contributions.
Figures 3 and 4 demonstrate the dependency between σ and
 for different clipping thresholds C chosen relative to the
quantiles of the gradient norm distribution. If we bound
sensitivity by clipping the gradients, it ensures that BDP
always requires less noise than DP to reach the same , as
seen in Figure 3. As we decrease the clipping threshold C,
more and more gradients get clipped and the BDP curve
approaches the DP curve (Figure 3a). However, as we ob-
serve in Figure 4 comparing DP with bounded sensitivity
and BDP with unbounded sensitivity, using unclipped gra-
Figure 5: Dependency of λ and  for different clipping
dients results in less consistent behaviour. It may require a
thresholds C, q = 6460000, σ = 10.
more thorough search for the right noise variance to reach
the same .
Depicted in Figure 5 is  as a function of λ for 10000 steps.
B.3. Effect of λ We observe that λ has a clear effect on the final  value.
As mentioned in Section 4.2, the privacy cost, and therefore In some cases this effect is very significant and the change
the final value of , depend on the choice of λ. We run the is sharp. It suggests that in practice one should be careful
Bayesian accountant for the Gaussian mechanism with the about the choice of λ. We also note that for lower signal-to-
fixed pairwise gradient distances (s.t. these results apply noise ratios (e.g. C = 01, σ = 1) the optimal choice of λ
exactly to the moments accountant) for different signal-to- is much further on the real line and may well be outside the
noise ratios and different λ. typically range computed in the literature.

You might also like