Summary Statistic Privacy in Data Sharing
Summary Statistic Privacy in Data Sharing
5, 2024 369
A. Contributions
Our contributions are as follows.
• Lower bounds (Section IV): We derive general lower
bounds on distortion given a privacy budget for any
mechanism. These bounds depend on both the secret
function and the data distribution. We derive closed- Fig. 2. An illustrative example of why naive differential privacy mechanisms
do not protect summary statistics. Suppose we want to protect the mean of
form lower bounds for a number of case studies (i.e., the data. A typical differential privacy algorithm [23] would add zero-mean
combinations of prior beliefs on the data distribution and noise (e.g., Laplace noise) to the bins. This mechanism does not change the
secret functions). expected mean of the data.
• Mechanism design and upper bounds (Section V): We
propose a class of mechanisms that achieve summary
statistic privacy called quantization mechanisms, which A natural first attempt at the problem might use DP by
intuitively quantize a data distribution’s parameters1 into treating M as the data release mechanism that takes the
bins. We show that for the case studies analyzed theoret- original dataset as input and outputs the released dataset.
ically in Table I, the quantization mechanism achieves a For example, suppose we want to release the histogram in
privacy-distortion tradeoff within a small constant factor Fig. 2, representing the number of items sold by a company
of optimal (usually ≤3) in the regime where quantization at different prices. Suppose the mean of this distribution is
bins are small relative to the overall support set of the sensitive, as it can be used to determine the company’s overall
distribution parameters. We present a sawtooth technique trade volume. Two natural approaches for using DP arise:
for theoretically analyzing the quantization mechanism’s (1) Per-record DP: A typical DP algorithm [23] would add
privacy tradeoff under various types of secret func- zero-mean noise (e.g., Laplace noise) to each histogram bin in
tions and data distributions (Section V-C). Intuitively, Fig. 2. This prevents the adversary from inferring whether any
the sawtooth technique exploits the geometry of the individual record was in the dataset, but allows the attacker
distribution parameter(s) to divide the parametric space to derive an unbiased estimator of the mean from the released
into two regions: one in which privacy risk is small and data. In other words, the threat model of (this usage of)
analytically tractable, and another in which privacy risk DP and our framework are different: DP hides whether any
can be high, but which occurs with low probability. For given sample contributed to the shared data, whereas we want
the case studies that we do not analyze theoretically, we to hide functions of the underlying distribution. To show
provide a dynamic programming algorithm that efficiently this formally, one can construct counterexamples where a
numerically instantiates the quantization mechanism. data release mechanism is DP, but cannot protect a summary
• Empirical evaluation (Section VII): We give empirical
statistic of a data distribution (or dataset). We construct such
results showing how to use summary statistic privacy a counterexample in Appendix A-A in the supplementary
to release a real dataset, and how to evaluate the cor- material for the scenario of hiding the mean of a dataset of
responding summary statistic privacy metric. We show scalar numbers. The example shows that if the data holder
that the proposed quantization mechanism achieves better applies a local DP mechanism [24] (Gaussian mechanism) to
privacy-distortion tradeoffs than other related privacy their dataset, as the number of dataset samples n grows, the
mechanisms. released noisy mean concentrates around the original mean.
Hence, the adversary can guess the mean to within a tolerance
> 0 with a probability that tends to 1 as n → ∞.
II. R ELATED W ORK
(2) Per-attribute-per-record DP: One can also design a local
We divide the related work into two categories: approaches DP mechanism that adds independent noise to each record of
based on indistinguishability over candidate inputs, and the dataset, with independent noise of different scales used
information-theoretic approaches. for different attributes. Consider an example of hiding the
difference in mean salaries between males and females in a
A. Indistinguishability-Based Approaches gender-salary dataset. A data holder could use a local DP
Differential privacy (DP) [16] is one of the most mechanism that adds independent noise of different scales to
commonly-adopted privacy frameworks. A random mecha- each of the gender and salary attributes. Since the mechanism
nism M is (, δ)-differentially-private if for any neighboring itself is locally DP, it provides privacy guarantees at the
datasets X0 and X1 (i.e., X0 and X1 differ one sample), and level of each record, as well as each attribute of each record
any set S ⊆ range(M), we have (i.e., the DP privacy guarantee protects individual cells in the
dataset). However, such a class of mechanism still cannot
P(M(X0 ) ∈ S) ≤ e · P(M(X1 ) ∈ S) + δ. hide distributional properties (in our example, the mean salary
1 We assume data distributions are drawn from a parametric family; more difference between males and females). In Appendix A-B
details in Section III. in the supplementary material, we precisely formulate and
Authorized licensed use limited to: California State University Fullerton. Downloaded on November 11,2024 at 16:09:30 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: SUMMARY STATISTIC PRIVACY IN DATA SHARING 371
analyze this example, and show that there exists an attack summary statistic privacy (though the privacy guarantees are
strategy such that the probability the adversary guesses the different, so it is difficult to do a fair comparison).
secret to within tolerance > 0 tends to 1 as dataset size Distribution inference [7], [8] considers a hypothesis test
n → ∞. in which the adversary must choose whether released data
(3) Per-dataset DP: A third natural alternative is to devise comes from one of two fixed input data distributions ω1 , ω2 .
a DP-like definition that explicitly protects the secret quan- Both distributions are assumed to be known to all parties.
tity. For instance, we could ask that for any pair of input By defining the attacker’s guessed as ω̂ and
distribution
distributions that differ in their secret quantity, the data the attacker’s advantage as |P ω̂|ω1 − P ω̂|ω2 |, distribution
release mechanism outputs similar released data distributions. inference requires that the attacker’s advantage be negligible.
Several per-dataset methods are listed below. Used naively, However, it is unclear how to establish a reasonable pair of
such an approach provides strong privacy guarantees, but candidate distributions; moreover, as with distribution privacy
may have poor utility. For instance,
consider
two Gaussian and attribute privacy, distribution inference may require high
input distributions N μ1 , σ12 and N μ2 , σ22 with the secret noise since it requires the data distributions to be indistin-
as the mean. The values of σ12 and σ22 could be arbitrarily guishable.
different. To make input distributions indistinguishable given
the released data, we must destroy information about the true
σ , which requires adding potentially unbounded noise. While B. Information-Theoretic Approaches
relaxations like metric differential privacy may help [25], they The second category of frameworks use information-
may introduce new challenges, e.g., how to choose the metric theoretic measures of privacy and utility [20], [22], [29], [30],
function to map dataset distance to a privacy parameter. [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41],
Attribute privacy [18] tackles these challenges in part [42], [43], [44]. Such works often measure disclosure via
by constraining the space of distributions that should be divergences, such as mutual information [19], [34], [45], [46],
indistinguishable [11]. Attribute privacy protects a function [47], [48], [49], [50], [51], f -divergences [21], [52], or min-
of a sensitive column in the dataset (named dataset attribute entropy [22], [30], [31], [38], [39], [40], [53], [54], [55]. We
privacy) or a sensitive parameter of the underlying distribution discuss a few examples here.
from which the data is sampled (named distribution attribute Privacy funnel [19], [29] is a well-known information-
privacy). It addresses the previously-mentioned shortcomings theoretic privacy framework. Let X be the random variable
of vanilla DP under the pufferfish privacy framework [26]. of the original data, containing sensitive information U,
Precisely, let X be the dataset, G be the possible range of a and let Y represent the (random) released data. The pri-
secret g, and Ga , Gb ⊆ G be two non-overlapping subsets of vacy funnel framework evaluates privacy leakage with
the secret range G. A mechanism M is (, δ)-attribute private the mutual information I(U; Y), and the utility of Y
if for any dataset X , secret range pairs Ga , Gb , and any set with mutual information I(X; Y). To find a data release
S ⊆ range(M): mechanism PY|X , privacy funnel solves the optimization
minPY|X :I(X;Y)≥R I(U; Y), where R is a desired threshold on the
P(M(X ) ∈ S|g(X ) ∈ Ga ) ≤ e P(M(X ) ∈ S|g(X ) ∈ Gb ) + δ. utility of Y. Adopting the same privacy and utility metrics,
Zamani et al. [46] instead analyze an upper bound on utility
Attribute privacy focuses on algorithms that output a statistical under a privacy constraint, i.e., supPY|X :I(U;Y)≤ I(X; Y), where
query of the dataset instead of the entire dataset. Though represents the privacy constraint. However, prior work
we may apply attribute privacy to analyze full-dataset-sharing has argued that mutual information is not a good metric for
algorithms; it may need to add substantial noise due to the either privacy or utility [20]. On the privacy front, there exist
high dimensionality of the dataset (Section VII). mechanisms that reduce I(U; Y) while allowing the attacker
Distribution privacy [27] is a closely related notion, which to guess U correctly from Y with higher probability (see [20,
releases a full data distribution under DP-style indistinguisha- Example 1]). On the utility front, high mutual information
bility guarantees. Roughly, for any two input distributions I(X; Y) does not mean that the released data Y is a useful
with parameters θ0 and θ1 from a pre-defined set of candi- representation of X; for instance, Y could be an arbitrary one-
date distributions, a distribution-private mechanism outputs a to-one transformation of X.
distribution M(θi ) for i ∈ {0, 1} such that for any set S in Rate-distortion formulations. Although rate-distortion the-
the output space, we have P[M(θi ) ∈ S] ≤ e P[M(θ1−i ) ∈ ory was originally proposed in the context of source
S] + δ. By obfuscating the whole distribution, distribution coding [56], it has more recently been used to model privacy
privacy inherently protects the private information. However problems as follows. Let X be the random variable with finite
the required noise may be more than what is needed to protect or countable alphabets X with prior distribution π , and M be
only select secret(s). For example, as mentioned above, two the mechanism that encodes X to Y. For a distortion d : X ×
datasets can have exactly the same secret statistic (e.g., mean), Y → R≥0 and a threshold d̂, the problem is to find the optimal
while differing significantly in other respects (e.g., variance)— mechanism that minimizes MI subject to distortion constraint
this requires significant noise in general. A recent work [28] d̂: M∗ = arg minE[D(X,M,d)]≤d̂ I(X; YX,M ), where YX,M is
proposes mechanisms for distribution privacy, and we observe the encoding of X. Several papers have used this formulation
this trend experimentally in Section VII; the noise added by to model tradeoffs between privacy (mutual information) and
the mechanisms in [28] is larger than what we require with utility (distortion), particularly in the context of location
Authorized licensed use limited to: California State University Fullerton. Downloaded on November 11,2024 at 16:09:30 UTC from IEEE Xplore. Restrictions apply.
372 IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY, VOL. 5, 2024
privacy [47], [48]. These works use the celebrated Blahut- same shortcomings as mutual information, namely that any
Arimoto algorithm [57], [58] to identify a mechanism that random one-to-one mapping can achieve a high utility metric
Pareto-optimally trades off mutual information for average without having practical utility.
distortion. Quantitative information flow: The concept of quantitative
As mentioned earlier, mutual information has some information flow (QIF) was first introduced in [59], [60],
shortcomings as a privacy metric [20]. Nevertheless, the rate- with the goal of quantitatively measuring the amount of
distortion formulation of privacy is related to our work in information leaked about a secret by observing the output data.
that it uses distortion as the measure of utility. Whereas rate- QIF broadly encompasses several of the privacy frameworks
distortion-based formulations use average-case distortion as a we have mentioned previously. Early works mainly adopted
distortion metric, we use worst-case distortion (Section III). If mutual information as the leakage definition [49], [50], [51],
we replace the average distortion in rate-distortion theory by while Smith [30] showed that it fails to capture vulnera-
the worst-case distortion, finding the Pareto-optimal mecha- bility: the probability of an attacker successfully guessing
nism may be substantially more challenging since the objective the secret in one try. This led to the introduction of min-
function M∗ = arg minmaxx∈X,y∈Y d(x,y)≤d̂ I(X; YX,M ) is no entropy leakage, which is a normalized variant of our privacy
x,M
longer a convex program in general. metric. Generalizations of min-entropy leakage have been
Maximal leakage [20] is an information-theoretic frame- proposed [31], [53], [55]; in particular, g-leakage [53] intro-
work for quantifying the leakage of sensitive information. duces a gain function that models partial guessing or multiple
Using the same notation as before, the adversary’s guess of guessing scenarios. Recently, g-leakage was used as a pri-
secret U is denoted by Û. Based on this setup, the Markov vacy metric to study a variety of applications, including the
chain U − X − Y − Û holds. Maximal leakage L from X to Y combination of local DP and shuffling [38], average-case
is defined as utility in privacy-preserving pipelines [40], and cyber-attack
defense problems [39]. However, unlike our work, these works
P U = Û consider the entire input dataset as the secret information that
L(X → Y) = sup log , (1)
U−X−Y−Û
maxu PU (u) needs to be protected, whereas we only need to protect the
sensitive information U contained in X, while maximizing
where the sup is taken over U (i.e., considering the worst- the disclosure of nonsensitive information in X. Because of
case secret) and Û (i.e., considering the strongest attacker). this, our goal is to minimize the information leakage of U
Intuitively, Eq. (1) evaluates the ratio (in nats) of the proba- while maximizing the leakage of other information in X,
bilities of guessing the secret U correctly with and without and to derive fundamental limits on tradeoffs between these
observing Y. Variants and generalizations of maximal leakage quantities. Although we could have used min-entropy leakage
have been proposed,
modifying Eq. (1) to penalize different in our privacy metric design, their analysis applies to discrete
values of P U = Û differently, using so-called gain func- alphabets and cannot be trivially extended to the continuous
tions [33], [35], [36], [37]. Maximal leakage and its variants case, as the probability density of an attacker guessing the
assume that the secret U is unknown a priori and therefore exact secret is zero. Moreover, the only utility analysis in these
considers the worst-case leakage over all possible secrets. works is average-case, whereas our paper adopts a worst-case
However, in our problem, data holders know what secret they utility metric, which significantly changes the mechanisms
want to protect. and conclusions we draw (e.g., leading to quantization-based
Min-entropy metrics: Several papers have studied privacy mechanisms).
metrics related to min-entropy, or the probability of guessing Noiseless privacy-preserving policies: An interesting
the secret correctly [22], [30], [31], [53], [54]. Among these, property of the mechanism we study—the quantization
the most closely related paper is by Asoodeh et al. [22], mechanism—is that it is deterministic. Several prior works
which directly analyzes the probability of guessing the secret, have studied noiseless privacy mechanisms under various
as we do (within a threshold). Adopting the same notation assumptions on the generative process for the data [41], [42],
as before (i.e., the Markov chain U − X − Y), [22] aims to [43], [44]. For example, adopting non-stochastic information
maximize the disclosure of X (i.e., maxf P(X = f (Y)), where theoretic methods, Farokhi [42] use maximin information [61]
the max is taken over all functions f ) to ensure high utility. and non-stochastic information leakage as privacy metrics and
This optimization is subject to a privacy constraint on the measure utility as the worst-case difference between the input
sensitive information U: maxĝ P U = ĝ(Y) ≤ T, where the and output responses. For instance, maximin information is
max is taken over all attack strategies ĝ. However, the authors defined as I(X ; Y) = log(|ϒ(X , Y)|); here X , Y represent
assume that for random variables X and Y, the value of each the original and released datasets, and ϒ(X , Y) represents
dimension can only be either 0 or 1 (i.e., each dimension of the unique taxicab partition of X , Y, where X , Y denotes
the data distribution parameter is binary). Since their analysis the set of all feasible input-output dataset pairs. Roughly,
relies on the properties of Bernoulli distribution, the results a taxicab partition consists of a sequence of dataset pairs
cannot be trivially extended to non-binary case, significantly such that any two consecutive pairs share the same X or
constraining the range of distribution settings this framework Y (formal definition in [42]). Reference [42] proves that
can analyze. Furthermore, they assess utility based on the the quantization mechanism is the optimal privacy-preserving
probability of precisely guessing the original data. However, policy over the set of deterministic piecewise differentiable
in data-sharing contexts, this utility measure suffers from the mechanisms (which quantizes the input X as several bins and
Authorized licensed use limited to: California State University Fullerton. Downloaded on November 11,2024 at 16:09:30 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: SUMMARY STATISTIC PRIVACY IN DATA SHARING 373
the output Y is differentiable for each bin) that maximize example Xθ ∼ N (μ, σ ), suppose the data holder wishes to
the privacy level subject to a utility constraint for maximin hide the mean; we thus have that g(μ, σ ) = μ.
information. Data release mechanism: The data holder releases data
In our work, despite not constraining the mechanism space by passing the private parameter θ through a data release
to deterministic mechanisms, we also find that the quantization mechanism Mg . That is, for a given θ , the data holder
mechanism is near-optimal. However, the proof in [42] is first draws internal randomness z ∼ ωZ , and then releases
based on the property of taxicab connectivity [61] from another distribution parameter θ = Mg (θ, z), where Mg is
non-stochastic information theory, which cannot be directly a deterministic function, and ωZ is a fixed distribution from
adopted in our analysis. The similarities in our findings which z is sampled. Note that we assume both the input and
(albeit over quite different problem formulations and analysis output of Mg are distribution parameters. It is straightforward
techniques) suggest that quantization-based mechanisms may to generalize to the case when the input and/or output are
be a universally good solution for private data release problems datasets of samples (see Section VI).
subject to worst-case distortion constraints. This question may For example, in the Gaussian case discussed above, one
be an interesting direction for future work. data release mechanism could be Mg ((μ, σ ), z) = (μ + z, σ )
where z ∼ N (0, 1). I.e., the mechanism shifts the mean by a
random amount drawn from a standard Gaussian distribution
III. P ROBLEM F ORMULATION
and keeps the variance.
Notation: We denote random variables with uppercase English Threat model: We assume that the attacker knows the
letters or upright Greek letters (e.g., X, μ), and their realiza- parametric family from which the data is drawn, and has a
tions with italicized lowercase letters (e.g., x, μ). For a random prior over the parameter realization, but does not know the
variable X, we denote its probability density function (PDF) as initial parameter θ . The attacker also knows the data release
fX , and its distribution measure as ωX . If a random variable X is mechanism Mg and output θ but not the realization of the
drawn from a parametric family (e.g., Gaussian with specified data holder’s internal randomness z. The attacker guesses the
mean and covariance), the parameters will be denoted with a
) based on the released parameter θ according
initial secret g(θ
subscript of X, i.e., the above notations become Xθ , fXθ , ωXθ to estimate ĝ θ . ĝ can be either random or deterministic, and
respectively for parameters θ ∈ Rq , where q ≥ 1 denotes the we assume no computational bounds on the adversary. For
dimension of the parameters. In addition, we denote fX|Y as the instance, in therunning Gaussian example, an attacker may
conditional PDF or PMF of X given another random variable choose ĝ μ , σ = μ .
Y. We use Z, Z>0 , N, R, R>0 , to denote the set of integers,
Privacy metric: The data holder wishes to prevent an attacker
positive integers, natural numbers, real numbers, and positive
from guessing its secret g(θ ).
real numbers, respectively.
Inspired by min-entropy, we define our privacy metric
Original data: Consider a data holder who possesses a dataset privacy ,ω as the attacker’s probability of guessing the
of n samples X = {x1 , . . . , xn }, where for each i ∈ [n], secret(s) to within a tolerance , taken worst-case over all
xi ∈ R is drawn i.i.d. from an underlying distribution. We attackers ĝ:
assume the distribution comes from a parametric family, and
the parameter vector θ ∈ Rq of the distribution fully specifies ,ω sup P |ĝ θ − g(θ )| ≤ . (2)
the distribution. That is, xi ∼ ωXθ , where we further assume ĝ
that θ is itself a realization of random parameter vector , The probability is taken over the randomness of the original
and ω is the probability measure for . We will discuss data distribution (θ ∼ ω ), the data release mechanism (z ∼
how to relax the assumption on this prior distribution of θ in ωZ ), and the attacker strategy (ĝ).
Section VIII. We assume that the data holder knows θ (and Remark: This privacy metric is an average-case guarantee
hence knows its full data distribution ωXθ ); our results and over the prior distribution of the parameters ω . A natural
mechanisms generalize to the case when the data holder only question is whether this can be converted into a worst-case
possesses the dataset X (see Section VI). privacy guarantee, as is common in many privacy frame-
For example, suppose the original data samples come from works [16], [18], [28]. However, a worst-case variant of the
a Gaussian distribution. We have θ = (μ, σ ), and Xθ ∼ metric (worst-case over prior distributions) is too stringent; no
N (μ, σ ). ω (or f ) describes the prior distribution over mechanism can achieve meaningful privacy (< 1), as formally
(μ, σ ). For example, if we know a priori that the mean of the stated below (proof in Appendix D-A in the supplementary
Gaussian is drawn from a uniform distribution between 0 and material).
1, and σ is always 1, we could have f (μ, σ ) = I(μ ∈ [0, 1])· Proposition 1: There is no data release mechanism whose
δ(σ ), where I(·) is the indicator function, and δ is the Dirac privacy value satisfies
delta function.
sup ,ω < 1.
Statistical secret to protect: We assume the data holder wants ω
to hide a secret quantity, which is defined as a function of
the original data distribution. Since the true data distribution Distortion metric: The main goal of data sharing is to provide
is fully specified by parameter vector θ , we define the secret useful data; hence, we (and data holders and users) want to
as a function of θ as follows: g(θ ):Rq → R. In the Gaussian understand how much the released data distorts the original
Authorized licensed use limited to: California State University Fullerton. Downloaded on November 11,2024 at 16:09:30 UTC from IEEE Xplore. Restrictions apply.
374 IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY, VOL. 5, 2024
data. We define the distortion of a mechanism as the worst- The proof is shown as below. From Theorem 1 we see
case distance between the original distribution and the released that the lower bound of distortion scales inversely with the
distribution: privacy budget and positively with the tolerance threshold .
The dependent quantity γ in Eq. (5) can be thought of as a
sup d ωXθ ωXθ , (3) conversion factor that bounds the translation from probability
θ∈Supp(ω ),θ ,
z∈Supp(ωZ ):Mg (θ,z)=θ of detection to distributional distance. Note that we have not
made γ exact as its form depends on the type of the secret and
where d is Wasserstein-1 distance. We use a worst-case prior distribution of data. We will instantiate it in the cases
definition of distortion because in data sharing settings, data studies in Section VI.
is typically released in one shot, so that data should be Proof: Our proof proceeds by constructing an ensemble of
useful even in the worst case. Wasserstein-1 distance is attackers, such that at least one of them will be correct by
commonly used as the distance metric in neural network construction. We do this by partitioning the space of possible
design (e.g., [62], [63]). Note that the definition in Eq. (3) can secret values, and having each attacker output the midpoint
be extended to data release mechanisms that take datasets as of one of the subsets of the partition. We then use the fact
inputs and/or outputs. that each attacker can be correct with probability at most
Formulation: To summarize, the data holder’s objective is to T, combined with γ , which intuitively relates the distance
choose a data release mechanism that minimizes distortion between distributions to the distance between their secrets, to
subject to a constraint on privacy ,ω : derive the claim. Recall that θ is the true private parameter
vector, θ is the released parameter vector as a result of the
min
Mg data release mechanism.
subject to ,ω ≤ T. (4) T ≥ ,ω
The reverse formulation, minMg ,ω subject to ≤ T is = sup P ĝ θ ∈ g(θ ) − , g(θ ) +
analyzed in Appendix B in the supplementary material. ĝ
The optimal data release mechanisms for Eq. (4) depends
= sup E P ĝ θ ∈ g(θ ) − , g(θ ) + θ
on the secrets and the characteristics of the original data. Data ĝ
holders specify the secret function they want to protect and
select the data release mechanism to process the raw data for = E sup P ĝ θ ∈ g(θ ) − , g(θ ) + θ , (7)
sharing. ĝ
Our goal is to study: (1) What are fundamental limits on
where Eq. (7) is due to the following
facts:
the tradeoff between privacy and distortion? (2) Do there
• LHS ≤ RHS, as supĝ P ĝ θ ∈ g(θ ) − , g(θ ) + |θ
exist data release mechanisms that can match or approach
≥ P ĝ θ ∈ g(θ ) − , g(θ ) + |θ for any θ .
these fundamental limits? In general, these questions can
• RHS ≤ LHS: Let us define
have different answers for different parametric families of
data distributions and secret functions. In Sections IV and V,
tθ sup P ĝ θ ∈ g(θ ) − , g(θ ) + θ .
we first present general results that do not depend on data ĝ
distribution or secret function. We then present case studies for
specific secret functions and data distributions in Section VI. RHS=
θ f (θ )tθ dθ . We can define an attacker as
ĝ θ = tθ . In that case,
IV. G ENERAL L OWER B OUND ON P RIVACY -D ISTORTION
T RADEOFFS E P ĝ θ ∈ g(θ ) − , g(θ ) + θ
Given a privacy budget T, we first present a lower bound
on distortion that applies regardless of the prior distribution = f θ tθ dθ .
θ
of data ω and regardless of the secret g. In other words, this
Therefore, LHS≥RHS.
applies for arbitrary correlations between parameters, which
Thus, there exists θ s.t.
are captured by the prior ω .
Theorem 1 (Lower Bound
of Privacy-Distortion Tradeoff):
sup P ĝ θ ∈ g(θ ) − , g(θ ) + θ ≤ T.
Let D Xθ1 , Xθ2 12 d ωXθ1 ωXθ2 , where d(· ·) denotes ĝ
Wasserstein-1 distance. Further, let R Xθ1 , Xθ2 Let
|g(θ1 ) − g(θ2 )| and
Lθ inf g(θ ),
D Xθ1 , Xθ2 θ∈Supp(ω ),z:Mg (θ,z)=θ
γ inf . (5)
θ1 ,θ2 ∈Supp(ω ) R Xθ1 , Xθ2 Rθ sup g(θ ).
θ∈Supp(ω ),z:Mg (θ,z)=θ
For any T ∈ (0, 1), when ,ω ≤ T,
We can define
a sequence of attackers and a constant N such
1 that ĝi θ = Lθ + (i + 0.5) · 2 for i ∈ {0, 1, . . . , N − 1} and
> − 1 · 2γ . (6)
T Lθ + 2N ≥ Rθ > Lθ + 2(N − 1) (Fig. 3). From the above,
Authorized licensed use limited to: California State University Fullerton. Downloaded on November 11,2024 at 16:09:30 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: SUMMARY STATISTIC PRIVACY IN DATA SHARING 375
Therefore, we have
sup D Xθ1 , Xθ2 = D Xθ1 , Xθ2
θi ∈Supp(ω ),zi :Mg (θi ,zi )=θ
1
≥ γ · R Xθ1 , Xθ2 > − 1 · 2γ ,
T
• Eq. (10): Let Note that the policy is fully determined by Si and θi∗ .
We will show different ways of instantiating the quantization
θ1 , θ2 = arg sup D Xθ1 , Xθ2 , mechanism to approach the lower bound in Section IV.
θi ∈Supp(ω ):∃zi s.t.Mg (θi ,zi )=θ
Intuitively, quantization
mechanisms
will have a bounded
θ1 , θ2 = arg inf R Xθ1 , Xθ2 .
θi ∈Supp(ω ):∃zi s.t.Mg (θi ,zi )=θ distortion as long as d ωXθ ωXθ ∗ is bounded for all θ ∈
I(θ)
We have Supp( ). At the same time, they obfuscate the secret as
different data distributions within the same set are mapped
D Xθ1 , Xθ2
γ inf to the same released parameter. This simple deterministic
θ1 ,θ2 ∈Supp(ω ) R Xθ1 , Xθ2
mechanism is sufficient to achieve the (order) optimal privacy-
D Xθ1 , Xθ2 D Xθ1 , Xθ2 distortion trade-offs in many cases, as opposed to differential
≤ ≤ . privacy, which requires randomness to provide theoretical
R Xθ1 , Xθ2 R Xθ1 , Xθ2 guarantees [16] (examples in the case studies of Section VI).
Authorized licensed use limited to: California State University Fullerton. Downloaded on November 11,2024 at 16:09:30 UTC from IEEE Xplore. Restrictions apply.
376 IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY, VOL. 5, 2024
B. Algorithms for Instantiating the Quantization Mechanism possible parameters is divided into infinitely many subsets
To implement the quantization mechanism, we need to Sμ,i , each consisting of a diagonal line segment (parallel blue
define the quantization bins Si and the released parameter per lines in Fig. 4). The space of possible σ values is divided
bin θi∗ . Depending on the data distribution, the secret function, into segments of length s, which correspond to the horizontal
and quantization mechanism parameters, the mechanism can bands in Fig. 4. Given this choice of intervals, the mechanism
have very different privacy-distortion tradeoffs. We present proceeds as follows: when the true distribution parameters fall
two methods for selecting quantization parameters: (1) an in one of these intervals, the mechanism releases the midpoint
analytical approach, and (2) a numeric approach. of the interval. The fact that the intervals Sμ,i are diagonal
1) Analytical Approach (Sketch): In some cases, outlined lines arises from choosing t(θ1 , θ2 ) = μσ11 −μ 2
−σ2 ; each interval
in the case studies of Section VI and the appendices, we can corresponds to a set of points (μ̃, σ̃ ) that satisfy t(θ1 , θ2 ) = t0 ,
find analytical expressions for Si and θi∗ while (near-)optimally i.e., with slope 1/t0 .
trading off privacy for distortion. This is usually possible when We show how to use this construction to upper bound
the lower bound depends on the problem parameters in a privacy-distortion tradeoffs in Section V-C.
specific way (see below). We will next illustrate the procedure 2) Numeric Approach: In some cases, the above procedure
through an example; precise analysis is given in Section VI. may not be possible. We next present a dynamic programming
For example, for the Gaussian distribution where θ = algorithm to numerically compute the quantization mechanism
(μ, σ ), when secret=standard deviation, we can work out the parameters. This algorithm achieves an optimal privacy-
lower bound from Theorem 1 (details in Appendix H in the distortion tradeoff [64] among quantization algorithms with
supplementary material). Note that the lower bound is tight if finite precision and continuous intervals Si . We use this
our mechanism minimizes algorithm (presented for univariate data distributions) in some
of the case studies in Section VI.
D Xμ1 ,σ1 , Xμ2 ,σ2 −μ2 2
1 − 12 μσ1 −σ
= e 1 2 We assume Supp( ) = θ, θ , where θ , θ are lower
R Xμ1 ,σ1 , Xμ2 ,σ2 2π and upper bounds of θ , respectively. We consider the
μ1 − μ2 1 μ1 − μ2 class of quantization mechanisms such that Si = θ i , θ i ,
− − , (11)
σ1 − σ2 2 σ1 − σ2 i.e., each subset of parameters are in a continuous range.
Furthermore, we explore mechanisms such that θ i , θ i , θi∗ ∈
where D Xθ1 , Xθ2 and R Xθ1 , Xθ2 are defined in Theorem 1,
and denotes the CDF of the standard Gaussian distribution. θ , θ + κ, θ + 2κ, . . . , θ , where κ is a hyper-parameter that
That is, for any true parameters μ1 and σ1 , the mechanism encodes numeric precision (and therefore divides (θ − θ )).
should always choose to release μ2 and σ2 such that Eq. (11) For example, if we want to hide the mean of a Geometric
is as small as possible. The exact form of Eq. (11) is not random variable with θ = 0.1 and θ = 0.9, we could consider
important for now; notice instead that the problem parameters three-decimal-place precision, i.e., κ = 0.001 and θ i , θ i , θi∗ ∈
(σi , μi ) take the same form every time they appear in this {0.100, 0.101, 0.102, . . . , 0.900}.
equation. We define t(θ1 , θ2 ) = μσ11 −μ 2 2
−σ2 to be that form. Next,
Since (Eq. (3)) is defined as the worst-case distortion
we find the t(θ1 , θ2 ) that minimizes Eq. (11): whereas ,ω (Eq. (2)) is defined as a probability, which
is related to the original data distribution, optimizing ,ω
D Xθ , Xθ2 given bounded (Eq. (15)) is easier to solve than the final
t0 arg inf 1
t(θ1 ,θ2 ) R Xθ1 , Xθ2 goal of optimizing given bounded ,ω (Eq. (4)).
For instance, in our Gaussian example, we can write t0 as min ,ω subject to ≤ T. (15)
Mg
1 − 1 (t(θ1 ,θ2 ))2 1
t0 = arg inf e 2 − (t(θ1 , θ2 )) − (t(θ1 , θ2 )) , Observing that in Eq. (4) the optimal value of minMg is
t(θ1 ,θ2 ) 2π 2
a monotonic decreasing function w.r.t. the threshold T, we
which can be solved numerically.
Finally, we can choose Si can use a binary search algorithm (shown in Appendix C
and θi∗ to be sets for which t θ, θi∗ = t0 , ∀θ ∈ Si . Using this in the supplementary material) to reduce problem Eq. (4)
rule, we derive the mechanism: to problem Eq. (15). It calls an algorithm that finds the
s s
Sμ,i = μ + t0 · t, σ + (i + 0.5) · s + t |t ∈ − , ,(12) optimal quantization mechanism with numerical precision over
2 2 continuous intervals under a distortion budget T (i.e., solving
∗
θμ,i = μ, σ + (i + 0.5) · s , (13) Eq. (15)). This problem can be solved by a dynamic program-
I = {(μ, i)|i ∈ N, μ ∈ supp(ωV)}, (14) ming algorithm. Let pri(t∗ ) (t∗ ∈ θ , θ + κ, θ + 2κ, . . . , θ )
be
the minimal
privacy ,ω we can get for Supp( ) =
where
s is a hyper-parameter of the mechanism that divides Xθ :θ ∈ θ , t∗ such that ≤ T. Denote D(θ1 , θ2 ) as the
σ − σ , and σ , σ are upper and lower bounds on σ , deter- minimal distortion a quantization mechanism can achieve
mined by the adversary’s prior. under the quantization bin [θ1 , θ2 ), we have
For our Gaussian example, the resulting sets Sμ,i for the
quantization mechanism are shown in Fig. 4; the space of D(θ1 , θ2 ) = infq sup d ωXθ ωXθ ,
θ∈R θ ∈[θ1 ,θ2 )
2 Indeed, for many of the case studies in Section VI, t(θ ) takes an analogous
form; we will see the implications of this in the analysis of the upper bound where d(· ·) is defined in Eq. (3). We also denote
in Section V-C. D∗ (θ1 , θ2 ) = arg infθ∈[θ1 ,θ2 ) supθ ∈[θ1 ,θ2 ) d ωXθ ωXθ . If the
Authorized licensed use limited to: California State University Fullerton. Downloaded on November 11,2024 at 16:09:30 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: SUMMARY STATISTIC PRIVACY IN DATA SHARING 377
prior over parameters is f , we have the Bellman equation Algorithm 1: Dynamic-Programming-Based Data Release
θ Mechanism for Single-Parameter Distributions
θ f (t)dt
pri t∗ = min pri(θ )
t∗ Input: Parameter range: θ, θ
θ∈[ θ,t∗ −κ ],D(θ,t∗ )≤T θ f (t)dt
Prior over parameter: f
t∗
θ f (t)dt ∗ Distortion budget: T
+ t∗
P θ, t Step size: κ (which divides θ − θ )
θ f (t)dt
1
←0
pri(θ)
with the initial state pri θ = 0, where 2 I θ ←∅
for t∗ ← θ + κ, θ + 2κ, . . . , θ do
P θ, t∗ = P ĝ∗ θ ∈ g(θ0 ) − , g(θ0 ) + |θ0 ∈ θ, t∗ , θ 3
Authorized licensed use limited to: California State University Fullerton. Downloaded on November 11,2024 at 16:09:30 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: SUMMARY STATISTIC PRIVACY IN DATA SHARING 379
TABLE I
S UMMARY OF THE C ASE S TUDIES W E C OVER , AND L INKS TO THE C ORRESPONDING R ESULTS
Proposition 2: Under Assumption 1, Mechanism 1 has the operator to divide λ − λ , where λ and λ are upper and
privacy ,ω ≤ 2s and distortion = 2s < 2 opt ( ,ω ), lower bounds of λ.
where opt ( ,ω ) is the minimal distortion any data release • Exponential:
mechanism can achieve given privacy level ,ω .
The proof is in Appendix D-C in the supplementary mate- Si = λ + i · s, λ + (i + 1) · s ,
rial. The two takeaways from this proposition are that: (1) the θi∗ = λ + (i + 0.5) · s,
data holder can use s to control the trade-off between distortion
I = N.
and privacy, and (2) the mechanism achieves an order-optimal
distortion with multiplicative factor 2. • Shifted exponential:
s s
B. Secret = Quantiles Si,h = λ + (i + 0.5)s + t, h − t0 · t |t ∈ − , ,
2 2
In this section, we show how to protect the α-quantile of ∗
θi,h = λ + (i + 0.5)s, h ,
the exponential distribution and the shifted exponential distri-
bution. We analyze the Gaussian and uniform distributions in I = {(i, h)|i ∈ N, h ∈ R},
Appendix G in the supplementary material. We choose these
distributions as the starting point of our analysis as many where
⎧
distributions in real-world data can be approximated by one ⎨ −1 − ln(1 − α) − W−1 − ln(1−α)+1 α ∈ [0, 1 − e−1 )
of these distributions. t0 = 2(1−α)e .
⎩ −1 − ln(1 − α) − W0 − ln(1−α)+1 α ∈ [1 − e−1 , 1)
In our analysis, the parameters of (shifted) exponential 2(1−α)e
distributions are denoted by:
• Exponential distribution: θ = λ, where λ is the scale
For the privacy-distortion trade-off analysis of
parameter: fXλ (x) = λ1 e−x/λ . Mechanism 2, we assume that the parameters of the original
• Shifted exponential distribution generalizes the exponen-
data are drawn from a uniform distribution with lower and
tial distribution with an additional shift parameter h: θ = upper bounds. Again, we relax this assumption to Lipschitz
(λ, h). In other words, fXλ,h (x) = λ1 e−(x−h)/λ . priors in Appendix E-B in the supplementary material.
As before, we first present a lower bound. Precisely,
Corollary 2 (Privacy lower bound, secret = α-quantile of a Assumption 2: The prior over distribution parameters is:
• Exponential: λ follows the uniform distribution over
continuous distribution): Consider the secret function g(θ ) =
α-quantile of fXθ . For λ, λ .
any T ∈ (0, 1), when ,ω ≤ T, we • Shifted exponential: (λ, h) follows the uniform distribu-
have > T1 − 1 · 2γ , where γ is defined as follows:
tion over (a, b)|a ∈ λ, λ , b ∈ h, h .
• Exponential:
We relax Assumption 2 and analyze the privacy-distortion
1 trade-off of Mechanism 2 in Appendix E-B in the supplemen-
γ =− . tary material.
2 ln(1 − α)
Proposition 3: Under Assumption 2, Mechanism 2 has the
• Shifted exponential: following ,ω and value/bound.
⎧
• Exponential:
⎪
⎪
⎪ ln(1−α)+1 −1
⎨ 2 1 + W−1 − ln(1−α)+1 α ∈ [0, 1 − e )
1
⎪
2 1
= , = s<2 opt .
2(1−α)e
γ = , ,ω
⎪
⎪ − ln(1 − α)s 2
⎪ 1 1 + ln(1−α)+1
⎪ α ∈ [1 − e−1 , 1)
⎩2 W0 − ln(1−α)+1
2(1−α)e • Shifted exponential:
where W−1 and W0 are Lambert W functions. 2 |t0 |s
The proof is given in Appendix D-D in the supplementary ,ω < + ,
|ln(1 − α) + t0 |s h − h
material. Next, we provide data release mechanisms for each of s
the distributions that achieve trade-offs close to these bounds. (t0 − 1) + se−t0
=
2
Mechanism 2 (For secret = α-quantile of a continuous dis-
tribution): We design mechanisms for each of the distributions. |t0 | · |ln(1 − α) + t0 |s2
< 2+ opt .
In both cases, s > 0 is the quantization bin size chosen by h−h
Authorized licensed use limited to: California State University Fullerton. Downloaded on November 11,2024 at 16:09:30 UTC from IEEE Xplore. Restrictions apply.
380 IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY, VOL. 5, 2024
s2
Under the high-precision regime where → 0 as For example, it can be used to modify synthetically-generated
h−h
s, (h − h) → ∞, when α ∈ [0.01, 0.25] ∪ [0.75, 0.99], samples after they are generated, or to modify the training
satisfies dataset for a generative model, or to directly modify the
original data for releasing.
lim sup <3 opt .
s2
→0
h−h VII. E XPERIMENTS
opt is the optimal achievable distortion given the privacy In the previous sections, we theoretically demonstrated the
achieved by Mechanism 2, and t0 is a constant defined in privacy-distortion tradeoffs of our data release mechanisms
Mechanism 2. in some special case studies. In this section, we focus on
The proof is in Appendix D-E in the supplementary mate- orthogonal questions through real-world experiments: (1) how
rial. Note that the quantization bin size s cannot be too small, well our data release mechanisms perform in practice when
or the attacker can always successfully guess the secret within the assumptions do not hold, and (2) how summary statistic
a tolerance (i.e., ,ω = 1). Therefore, for the “high- privacy quantitatively compares with existing privacy frame-
precision” regime, we consider the asymptotic scaling as both works (which we explained qualitatively in Section II).3
2
s and h − h grow. When s > 1, the scaling condition s → 0 Datasets. We use two real-world datasets to simulate the
h−h
implies a more interpretable condition of s → 0, which says motivating scenarios.
h−h
that the bin size is small relative to the parameter space. For 1) Wikipedia Web Traffic Dataset (WWT) [69] contains
example, this condition is required when the secret tolerance the daily page views of 145,063 Wikipedia web pages
> 1/2 (i.e., we need a bin size s > 1 to achieve non-trivial in 2015-2016. To preprocess it for our experiments, we
privacy guarantees). remove the web pages with empty page view record
Proposition 3 shows that the quantization mechanism is on any day (117,277 left), and compute the mean page
order-optimal with multiplicative factor 2 for the expo- views across all dates for each web page. Our goal is
nential distribution. For shifted exponential distribution, to release the page views (i.e., a 117,277-dimensional
order-optimality holds asymptotically in the high-precision vector) while protecting the mean of the distribution
regime. (which reveals the business scales of the company).
2) Measuring Broadband America Dataset (MBA) [70]
C. Extending Data Release Mechanisms for Dataset Inputs contains network statistics (including network traf-
and Outputs fic counters) collected by United States Federal
Communications Commission from homes across
The data release mechanisms discussed in previous sec- United States. We select the average network traffic
tions assume that data holders know the distribution parameter (GB/measurement) from AT&T clients as our data. Our
of the original data. In practice, data holders often only have a goal is to release a copy of this data while hiding the
dataset of samples from the data distribution and do not know 0.95-quantile (which reveals the network capability).
the parameters of the underlying distributions. Quantization
data release mechanisms can be easily adapted to handle Baselines. We compare our mechanisms discussed in
dataset inputs and outputs. Section VI with three popular mechanisms proposed in prior
The high-level idea is that the data holders can estimate work (Section II): differentially-private density estimation [23]
the distribution parameters θ from the data samples and (shortened to DP), attribute-private Gaussian mechanism [18]
find the corresponding quantization bins Si according to the (shortened to AP), and Wasserstein mechanism for distribution
estimated parameters, and then modify the original samples privacy [28] (shortened to DistP). As these mechanisms
as if they were sampled according to the released parameter provide different privacy guarantees than summary statistic
θi∗ . This may be infeasible for high-dimensional parameter privacy, it is difficult to do a fair comparison between these
vectors θ ; we did not explore this question in the current baselines and our quantization mechanism. We include them to
work. For brevity, we only present the concrete procedure for quantitatively show the differences (and similarities) between
secret=mean in continuous distributions as an example. For a various privacy frameworks.
dataset of X = {x1 , . . . , xn }, the procedure is the following. For a dataset of samples X = {x1 , . . . , xn }, DP works
1) Estimate the mean from the data samples: μ̂ = by: (1) Dividing the space into m bins:
B1, . . . , Bm . (2)
1 Computing the histogram Ci = nj=1 I xj ∈ Bi . (3) Adding
n i∈[n] xi .
2) According to Eq. (16), compute the index of the corre- noise to the histograms
Di = max 0, Ci + Laplace 0, β 2 ,
μ̂−μ
sponding set i = s . where Laplace 0, β 2 means a random noise from Laplace
3) According to Eq. (13), change the mean of the data distribution with mean 0 and variance β 2 . (4) Normalizing
samples to μtarget = μ + (i + 0.5) · s. This can be done the histogram pi = mDi D . We can then draw yi
j=1 j
by a sample-wise operation xi = xi −μ̂ + μtarget. according to the histogram and release Y = {y1 , . . . , yn }
4) The released dataset is Mg (X , z) = x1 , . . . , xn . with differential
privacy
n guarantees. AP works by releasing
Note that this mechanism applies to samples. Therefore, Y = xi + N 0, β 2 i=1 . DistP works by releasing Y =
it can be applied either to the original data, or as an add-
on to existing data sharing tools [15], [65], [66], [67], [68]. 3 Code available at https://github.com/fjxmlzn/summary_statistic_privacy.
Authorized licensed use limited to: California State University Fullerton. Downloaded on November 11,2024 at 16:09:30 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: SUMMARY STATISTIC PRIVACY IN DATA SHARING 381
n
xi + Laplace 0, β 2 i=1 . Note that for each of these mecha- real data exactly. Our data release mechanism for mean (i.e.,
nisms, normally their noise parameters would be set carefully Mechanism 1 used in WWT) supports general continuous
to match the desired privacy guarantees (e.g., differential distributions. Indeed, even for our surrogate metrics, our
privacy). In our case, since our privacy metric is different, it Mechanism 1 is also optimal (see Appendix D-F in the
is unclear how to set the noise parameters for a fair privacy supplementary material, Fig. 6(a)). However, the quantization
comparison. For this reason, we evaluate different settings of data release mechanisms for quantiles (i.e., Mechanism 2
the noise parameters, and measure the empirical tradeoffs. used in Fig. 6(b)) are order-optimal only when the dis-
Metrics. Our privacy and distortion metrics depend on the tributions are within certain classes (Section VI-B). Since
prior distribution of the original data θ ∼ ω (though network traffic in MBA is one-sided and heavy-tailed, we
the mechanism does not). In practice (and also in these use the data release mechanism for exponential distributions
experiments), the data holder only has one dataset. Therefore, (Mechanism 2), which is not heavy-tailed. Despite the dis-
we cannot empirically evaluate the proposed privacy and tribution mismatch, the quantization data release mechanism
distortion metrics, and resort to surrogate metrics to bound our still achieves a good (surrogate) privacy-distortion compared
true privacy and distortion. to DP, AP, and DistP (Fig. 6(b)).
Surrogate privacy metric. For an original dataset X = (2) The quantization data release mechanisms achieve better
{x1 , . . . , xn } and the released dataset Y = {y1 , . . . , yn }, we privacy-distortion trade-off than DP, AP, and DistP. AP and
define the surrogate privacy metric ˜ as the error of an DistP directly add Gaussian/Laplace noise to each sample.
attacker who guesses the secret of the released dataset as This process does not change the mean of the distribution on
the true secret: ˜ −|g(X ) − g(Y)|, where g(D) = mean expectation. Therefore, Figure 6 shows that AP and DistP have
of D and 0.95-quantile of D in WWT and MBA datasets a bad privacy-distortion tradeoff. DP quantizes (bins) the sam-
respectively. Note that in the definition of ˜ , a minus sign ples before adding noise. Quantization has a better property in
is added so that a smaller value indicates stronger privacy, as terms of protecting the mean of the distribution, and therefore
in privacy metric Eq. (2). This simple attacker strategy is in we see that DP has a better privacy-distortion tradeoff than AP
fact a good proxy for evaluating the privacy ,ω due to the and DistP, but still worse than the quantization mechanism.
following facts. (1) For our data release mechanisms for these Note that in Fig. 6(b), a few of the DP instances have better
secrets Mechanism 1,Mechanism 2, when the prior distribution privacy-distortion trade-offs than ours. This is not an indication
is uniform, this strategy is actually optimal, so there is a direct that DP is fundamentally better. Due to the randomness in DP
mapping between ˜ and ,ω . (from the added Laplace noise), some realizations of the noise
(2) For AP applied on protecting mean of the data (i.e., in this experiment give a better trade-off. Another instance
Wikipedia Web Traffic Dataset experiments), this strategy of the DP algorithm could give a worse trade-off, so DP’s
gives an unbiased estimator of the secret. (3) For DP and achievable trade-off points are widespread.
AP on other cases, this mechanism may not be an unbiased In summary, these empirical results confirm the intuition
estimator of the secret, but it gives an upper bound on the in Section II that DP, AP, and DistP may not achieve good
attacker’s error. privacy-utility tradeoffs for our problem. This is expected—
Surrogate distortion metric. We define our surrogate distortion they are designed for a different objective. Additional results
metric as the Wasserstein-1 distance between the two datasets: on downstream tasks are in Appendix J in the supplementary
˜ d pX pY where pD denotes the empirical distribution of material.
a dataset D. This metric evaluates how much the mechanism
distorts the dataset. VIII. D ISCUSSION AND F UTURE W ORK
In fact, we can deduce a theoretical lower bound for the This work introduces a framework for summary statistic
surrogate privacy and distortion metrics for secret = mean privacy concerns in data sharing applications. This framework
(shown later in Fig. 6) using similar techniques as the proofs can be used to analyze the leakage of statistical information
in the main paper (see Appendix D-F in the supplementary and the privacy-distortion trade-offs of data release mecha-
material). nisms (Sections III and IV). The quantization data release
mechanisms can be used to protect statistical information
A. Results (Sections V and VI). However, many interesting open ques-
We enumerate the hyper-parameters of each method (bin tions for future work remain.
size and β for DP, β for AP and DistP, and s for ours). For each Number of secrets. In this work, we studied the case where
method and each hyper-parameter, we compute their surrogate the data holder only wishes to hide a single secret. In practice,
privacy and distortion metrics. The results are shown in Fig. 6 data holders often want to hide multiple properties of their
(bottom left is best); each data point represents one realization underlying data. The challenges of studying this setting are
of mechanism Mg under a distinct hyperparameter setting. twofold. The first is metric design. Although we can adopt
Two takeaways are below. the same high level ideas of designing privacy and distortion
(1) The proposed quantization data release mechanisms has metrics in this paper, the data holders’ tolerances ( in the
a good surrogate privacy-distortion trade-off, even when the current work) can differ for different secrets (e.g., from not
assumptions do not hold. The data distributions analyzed allowing any secret to be disclosed, to tolerating at most a
in Section VI and in the Appendices may not always match small subset of secrets being revealed). It is not clear which
Authorized licensed use limited to: California State University Fullerton. Downloaded on November 11,2024 at 16:09:30 UTC from IEEE Xplore. Restrictions apply.
382 IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY, VOL. 5, 2024
Fig. 6. Privacy (lower is better) and distortion (lower is better) of AP, DP, DistP, and ours. Each point represents one instance of data release mechanism with
one hyper-parameter. “Lower bound” is the theoretical lower bound of the achievable region. Our data release mechanisms achieve better privacy-distortion
tradeoff than AP, DP, and DistP.
Authorized licensed use limited to: California State University Fullerton. Downloaded on November 11,2024 at 16:09:30 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: SUMMARY STATISTIC PRIVACY IN DATA SHARING 383
,ω considers how much additional “information” that the [26] D. Kifer and A. Machanavajjhala, “Pufferfish: A framework for mathe-
released data provides to the attacker in the worst-case (see matical privacy definitions,” ACM Trans. Database Syst., vol. 39, no. 1,
pp. 1–36, 2014.
also inferential privacy [71]). [27] Y. Kawamoto and T. Murakami, “Local obfuscation mechanisms for
hiding probability distributions,” in Proc. Comput. Security, 2019,
pp. 128–148.
R EFERENCES [28] M. Chen and O. Ohrimenko, “Protecting global properties of datasets
with distribution privacy mechanisms,” in Proc. AISTATS, 2023,
[1] H. L. Lee and S. Whang, “Information sharing in a supply chain,” Int.
pp. 1–20.
J. Manuf. Technol. Manage., vol. 1, no. 1, pp. 79–93, 2000.
[29] H. Yamamoto, “A source coding problem for sources with additional
[2] N. Choucri, S. Madnick, and P. Koepke, Institutions for Cyber Security:
outputs to keep secret from the receiver or wiretappers (corresp.),” IEEE
International Responses and Data Sharing Initiatives. Cambridge, MA,
Trans. Inf. Theory, vol. 29, no. 6, pp. 918–923, Nov. 1983.
USA: Massachusetts Inst. Technol., 2016.
[30] G. Smith, “On the foundations of quantitative information flow,” in Proc.
[3] J. B. Jacobs and D. Blitsa, “Sharing criminal records: The United States,
FoSSaCS, 2009, pp. 288–302.
the european union and interpol compared,” Loyola Los Angeles Int.
[31] M. S. Alvim, K. Chatzikokolakis, A. McIver, C. Morgan, C. Palamidessi,
Comput. Law Rev., vol. 30, p. 125, Jan. 2008.
and G. Smith, “Additive and multiplicative notions of leakage, and their
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
capacities,” in Proc. IEEE CSF, 2014, pp. 308–322.
A large-scale hierarchical image database,” in Proc. IEEE CVPR, 2009,
[32] S. Asoodeh, M. Diaz, F. Alajaji, and T. Linder, “Information extraction
pp. 248–255.
under privacy constraints,” Information, vol. 7, no. 1, p. 15, 2016.
[5] C. Reiss, J. Wilkes, and J. L. Hellerstein, “Google cluster-usage traces: [33] J. Liao, O. Kosut, L. Sankar, and F. du Pin Calmon, “Tunable measures
Format+ schema,” Google Inc., Mountain View, CA, USA, White Paper, for information leakage and applications to privacy-utility tradeoffs,”
pp. 1–14, 2011. IEEE Trans. Inf. Theory, vol. 65, no. 12, pp. 8043–8066, Dec. 2019.
[6] S. Luo et al., “Characterizing microservice dependency and [34] F. P. Calmon, A. Makhdoumi, and M. Médard, “Fundamental limits of
performance: Alibaba trace analysis,” in Proc. ACM SoCC, 2021, perfect privacy,” in Porc. ISIT, 2015, pp. 1796–1800.
pp. 412–426. [35] S. Saeidian, G. Cervia, T. J. Oechtering, and M. Skoglund, “Pointwise
[7] A. Suri and D. Evans, “Formalizing and estimating distribution inference maximal leakage,” in Proc. IEEE ISIT, 2022, pp. 626–631.
risks,” 2021, arXiv:2109.06024. [36] G. R. Kurri, L. Sankar, and O. Kosut, “An operational approach
[8] A. Suri, Y. Lu, Y. Chen, and D. Evans, “Dissecting distribution to information leakage via generalized gain functions,” 2022,
inference,” in Proc. 1st IEEE Conf. Secure Trustworthy Mach. Learn., arXiv:2209.13862.
2023, pp. 1–16. [37] A. Gilani, G. R. Kurri, O. Kosut, and L. Sankar, “Unifying privacy
[9] G. Ateniese, L. V. Mancini, A. Spognardi, A. Villani, D. Vitali, and measures via maximal (α, β)-leakage (MαbeL),” IEEE Trans. Inf.
G. Felici, “Hacking smart machines with smarter ones: How to extract Theory, vol. 70, no. 6, pp. 4368–4395, Jun. 2024.
meaningful data from machine learning classifiers,” Int. J. Security [38] M. Jurado, R. G. Gonze, M. S. Alvim, and C. Palamidessi, “Analyzing
Netw., vol. 10, no. 3, pp. 137–150, 2015. the shuffle model through the lens of quantitative information flow,”
[10] K. Ganju, Q. Wang, W. Yang, C. A. Gunter, and N. Borisov, “Property 2023, arXiv:2305.13075.
inference attacks on fully connected neural networks using permuta- [39] R. Jin, X. He, and H. Dai, “On the security-privacy tradeoff in collabo-
tion invariant representations,” in Proc. ACM SIGSAC Conf. Comput. rative security: A quantitative information flow game perspective,” IEEE
Commun. Security, 2018, pp. 619–633. Trans. Inf. Forensics Security, vol. 14, pp. 3273–3286, 2019.
[11] W. Zhang, S. Tople, and O. Ohrimenko, “Leakage of Dataset properties [40] M. S. Alvim, N. Fernandes, A. McIver, C. Morgan, and G. H. Nunes,
in multi-party machine learning,” in Proc. USENIX Security Symp., 2021, “A novel analysis of utility in privacy pipelines, using Kronecker
pp. 2687–2704. products and quantitative information flow,” in Proc. ACM CCS, 2023,
[12] S. Mahloujifar, E. Ghosh, and M. Chase, “Property inference from pp. 1718–1731.
poisoning,” in Proc. Security Privacy, 2022, pp. 1–18. [41] R. Bhaskar, A. Bhowmick, V. Goyal, S. Laxman, and A. Thakurta,
[13] H. Chaudhari, J. Abascal, A. Oprea, M. Jagielski, F. Tramèr, and “Noiseless database privacy,” in Proc. ASIACRYPT, Seoul, South Korea,
J. Ullman, “SNAP: Efficient extraction of private properties with Dec. 2011, pp. 215–232.
poisoning,” in Proc. IEEE SP, 2022, pp. 1935–1952. [42] F. Farokhi, “Development and analysis of deterministic privacy-
[14] B. Imana, A. Korolova, and J. Heidemann, “Institutional privacy risks preserving policies using non-stochastic information theory,” IEEE
in sharing DNS data,” in Proc. ANRW, 2021, pp. 69–75. Trans. Inf. Forensics Security, vol. 14, pp. 2567–2576, 2019.
[15] Z. Lin, A. Jain, C. Wang, G. Fanti, and V. Sekar, “Using GANs for [43] F. Farokhi, “Noiseless privacy: Definition, guarantees, and applications,”
sharing networked time series data: Challenges, initial promise, and open IEEE Trans. Big Data, vol. 9, no. 1, pp. 51–62, Feb. 2023.
questions,” in Proc. ACM IMC, 2020, pp. 464–483. [44] F. Farokhi and G. Nair, “Non-stochastic private function evaluation,” in
[16] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to Proc. IEEE ITW, 2021, pp. 1–5.
sensitivity in private data analysis,” in Proc. TCC, New York, NY, USA, [45] B. Rassouli and D. Gündüz, “On perfect privacy,” IEEE J. Sel. Areas
Mar. 2006, pp. 265–284. Inf. Theory, vol. 2, no. 1, pp. 177–191, Mar. 2021.
[17] C. Reiss, J. Wilkes, and J. L. Hellerstein, “Obfuscatory obscanturism: [46] A. Zamani, T. J. Oechtering, and M. Skoglund, “Bounds for privacy-
Making workload traces of commercially-sensitive systems safe to utility trade-off with non-zero leakage,” in Proc. IEEE ISIT, 2022,
release,” in Proc. IEEE NOMS, 2012, pp. 1279–1286. pp. 620–625.
[18] W. Zhang, O. Ohrimenko, and R. Cummings, “Attribute privacy: [47] S. Biswas and C. Palamidessi, “PRIVIC: A privacy-preserving method
Framework and mechanisms,” in Proc. FACCT, 2022, pp. 1–20. for incremental collection of location data,” in Proc. Privacy Enhanc.
[19] A. Makhdoumi, S. Salamatian, N. Fawaz, and M. Médard, “From the Technol., 2024, pp. 1–15.
information bottleneck to the privacy funnel,” in Proc. IEEE ITW, 2014, [48] S. Oya, C. Troncoso, and F. Pérez-González, “Back to the drawing
pp. 501–505. board: Revisiting the design of optimal location privacy-preserving
[20] I. Issa, A. B. Wagner, and S. Kamath, “An operational approach mechanisms,” in Proc. ACM CCS, 2017, pp. 1959–1972.
to information leakage,” IEEE Trans. Inf. Theory, vol. 66, no. 3, [49] D. Clark, S. Hunt, and P. Malacaria, “Quantitative analysis of the leakage
pp. 1625–1657, Mar. 2020. of confidential data,” Electron. Notes Theor. Comput. Sci., vol. 59, no. 3,
[21] H. Wang, L. Vo, F. P. Calmon, M. Médard, K. R. Duffy, and M. Varia, pp. 238–251, 2002.
“Privacy with estimation guarantees,” IEEE Trans. Inf. Theory, vol. 65, [50] D. Clark, S. Hunt, and P. Malacaria, “A static analysis for quantifying
no. 12, pp. 8025–8042, Dec. 2019. information flow in a simple imperative language,” J. Comput. Security,
[22] S. Asoodeh, M. Diaz, F. Alajaji, and T. Linder, “Privacy-aware guessing vol. 15, no. 3, pp. 321–371, 2007.
efficiency,” in Proc. ISIT, 2017, pp. 1–5. [51] P. Malacaria, “Assessing security threats of looping constructs,” in Proc.
[23] L. Wasserman and S. Zhou, “A statistical framework for differential ACM POPL, 2007, pp. 225–235.
privacy,” J. Amer. Stat. Assoc., vol. 105, no. 489, pp. 375–389, 2010. [52] B. Rassouli and D. Gündüz, “Optimal utility-privacy trade-off with total
[24] B. Bebensee, “Local differential privacy: A tutorial,” 2019, variation distance as a privacy measure,” IEEE Trans. Inf. Forensics
arXiv:1907.11908. Security, vol. 15, pp. 594–603, 2019.
[25] K. Chatzikokolakis, M. E. Andrés, N. E. Bordenabe, and C. Palamidessi, [53] S. A. Mario, K. Chatzikokolakis, C. Palamidessi, and G. Smith,
“Broadening the scope of differential privacy using metrics,” in Proc. “Measuring information leakage using generalized gain functions,” in
PETS, Bloomington, IN, USA, Jul. 2013, pp. 82–102. Proc. CSF, 2012, pp. 265–279.
Authorized licensed use limited to: California State University Fullerton. Downloaded on November 11,2024 at 16:09:30 UTC from IEEE Xplore. Restrictions apply.
384 IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY, VOL. 5, 2024
[54] S. Asoodeh, M. Diaz, F. Alajaji, and T. Linder, “Estimation efficiency Shuaiqi Wang (Graduate Student Member, IEEE)
under privacy constraints,” IEEE Trans. Inf. Theory, vol. 65, no. 3, received the B.E. degree from Shanghai Jiao Tong
pp. 1512–1534, Mar. 2019. University in 2020. He is currently pursuing the
[55] N. E. Bordenabe and G. Smith, “Correlated secrets in quantitative Ph.D. degree with Carnegie Mellon University. His
information flow,” in Proc. CSF, 2016, pp. 93–104. work has been recognized by the Carnegie Institute
[56] C. E. Shannon, “Coding theorems for a discrete source with a fidelity of Technology Dean’s Fellow. His research interests
criterion,” IRE Int. Conv. Rec., vol. 4, nos. 142–163, p. 1, 1959. are data privacy and security in machine learning.
[57] S. Arimoto, “An algorithm for computing the capacity of arbitrary
discrete memoryless channels,” IEEE Trans. Inf. Theory, vol. IT-18,
no. 1, pp. 14–20, Jan. 1972.
[58] R. Blahut, “Computation of channel capacity and rate-distortion func-
tions,” IEEE Trans. Inf. Theory, vol. 18, no. 4, pp. 460–473, Jan. 1972.
[59] D. E. R. Denning, Cryptography and Data Security, vol. 112. Reading,
MA, USA: Addison-Wesley, 1982.
[60] J. W. Gray, III, “Toward a mathematical foundation for information flow
security,” J. Comput. Security, vol. 1, nos. 3–4, pp. 255–294, 1992.
[61] G. N. Nair, “A nonstochastic information theory for communication
and state estimation,” IEEE Trans. Autom. Control, vol. 58, no. 6,
pp. 1497–1510, Jun. 2013.
[62] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative
adversarial networks,” in Proc. ICML, 2017, pp. 214–223.
Vyas Sekar received the B.Tech. degree from
[63] Z. Lin, A. Khetan, G. Fanti, and S. Oh, “PacGAN: The power of two
the Indian Institute of Technology Madras (IIT
samples in generative adversarial networks,” in Proc. NeurIPS, 2018,
Madras), and the Ph.D. degree from Carnegie
pp. 1–10.
Mellon University, Pittsburgh, where he is a Tan
[64] R. Bellman, “Dynamic programming,” Science, vol. 153, no. 3731,
Family Professor with the ECE Department. He
pp. 34–37, 1966.
also serves as the Chief Scientist with Conviva,
[65] C. Esteban, S. L. Hyland, and G. Rätsch, “Real-valued (medi-
and as a Cofounder with Rockfish Data, a startup
cal) time series generation with recurrent conditional GANs,” 2017,
commercializing his academic research on synthetic
arXiv:1706.02633.
data. His work has been recognized with numer-
[66] Y. Yin, Z. Lin, M. Jin, G. Fanti, and V. Sekar, “Practical GAN-based
ous awards, including the SIGCOMM Rising Star
synthetic IP header trace generation using NetShare,” in Proc. ACM
Award, the SIGCOMM Test of Time Award, the
SIGCOMM, 2022, pp. 458–472.
NSA Science of Security Prize, the NSF CAREER Award, the Internet
[67] J. Jordon, J. Yoon, and M. Van Der Schaar, “PATE-GAN: Generating
Research Task Force Applied Networking Research Award, the Intel
synthetic data with differential privacy guarantees,” in Proc. ICLR, 2018,
Outstanding Researcher Award, and the IIT Madras Young Alumni Achiever
pp. 1–21.
Award. He was awarded the President of India Gold Medal from IIT Madras.
[68] J. Yoon, D. Jarrett, and M. Van der Schaar, “Time-series generative
adversarial networks,” in Proc. NeurIPS, vol. 32, 2019, pp. 1–11.
[69] “Web traffic time series forecasting,” Google. 2018. [Online]. Available:
https://www.kaggle.com/c/web-traffic-time-series-forecasting
[70] (Federal Commun. Comm., Washington, DC, USA). Raw Data—
Measuring Broadband America—Seventh Report. (2018). [Online].
Available: https://www.fcc.gov/reports-research/reports/measuring-
broadband-america/raw-data-measuring-broadband-america-seventh
[71] A. Ghosh and R. Kleinberg, “Inferential privacy guarantees for differ-
entially private mechanisms,” in Proc. ITCS, 2017, pp. 1–31.
Giulia Fanti (Member, IEEE) received the B.S.
degree in ECE from the Olin College of Engineering,
and the Ph.D. degree in EECS from the University
of California at Berkeley. She is an Assistant
Zinan Lin received the B.E. degree from Tsinghua
Professor of Electrical and Computer Engineering
University in 2017, and the Ph.D. degree from
with Carnegie Mellon University. Her research
Carnegie Mellon University in 2023. He is a Senior
interests span the security, privacy, and efficiency of
Researcher with Microsoft Research. His research
distributed systems. She is a two-time Fellow of the
interests are generative modeling and privacy. His
World Economic Forum’s Global Future Council on
work has been recognized with several awards,
Cybersecurity and a member of NIST’s Information
including the Outstand Paper/Oral/Spotlight Awards
Security and Privacy Advisory Board. Her work has
at NeurIPS and the Best Paper Finalist at IMC.
been recognized with several awards, including the best paper awards, a Sloan
Fellowship, an Intel Rising Star Faculty Award, and an ACM SIGMETRICS
Rising Star Award.
Authorized licensed use limited to: California State University Fullerton. Downloaded on November 11,2024 at 16:09:30 UTC from IEEE Xplore. Restrictions apply.