Privately Learning Markov Random Fields: Huanyu Zhang Gautam Kamath Janardhan Kulkarni Zhiwei Steven Wu August 17, 2020
Privately Learning Markov Random Fields: Huanyu Zhang Gautam Kamath Janardhan Kulkarni Zhiwei Steven Wu August 17, 2020
Abstract
We consider the problem of learning Markov Random Fields (including the prototypical
example, the Ising model) under the constraint of differential privacy. Our learning goals include
both structure learning, where we try to estimate the underlying graph structure of the model,
as well as the harder goal of parameter learning, in which we additionally estimate the parameter
on each edge. We provide algorithms and lower bounds for both problems under a variety of
privacy constraints – namely pure, concentrated, and approximate differential privacy. While
non-privately, both learning goals enjoy roughly the same complexity, we show that this is not
the case under differential privacy. In particular, only structure learning under approximate
differential privacy maintains the non-private logarithmic dependence on the dimensionality of
the data, while a change in either the learning goal or the privacy notion would necessitate
a polynomial dependence. As a result, we show that the privacy constraint imposes a strong
separation between these two learning problems in the high-dimensional data regime.
1 Introduction
Graphical models are a common structure used to model high-dimensional data, which find a
myriad of applications in diverse research disciplines, including probability theory, Markov Chain
Monte Carlo, computer vision, theoretical computer science, social network analysis, game theory,
and computational biology [LPW09, Cha05, Fel04, DMR11, GG86, Ell93, MS10]. While statistical
tasks involving general distributions over p variables often run into the curse of dimensionality (i.e.,
an exponential sample complexity in p), Markov Random Fields (MRFs) are a particular family
of undirected graphical models which are parameterized by the “order” t of their interactions.
Restricting the order of interactions allows us to capture most distributions which may naturally
arise, and also avoids this severe dependence on the dimension (i.e., we often pay an exponential
dependence on t instead of p). An MRF is defined as follows, see Section 2 for more precise
definitions and notations we will use in this paper.
Definition 1.1. Let k, t, p ∈ N, G = (V, E) be a graph on p nodes, and Ct (G) be the set of cliques
of size at most t in G. A Markov Random Field with alphabet size k and t-order interactions is a
∗
Cornell University. [email protected]. Supported by NSF #1815893 and by NSF #1704443. This work was
partially done while the author was an intern at Microsoft Research Redmond.
†
University of Waterloo. [email protected]. Supported by a University of Waterloo startup grant. Part of this
work was done while supported as a Microsoft Research Fellow, as part of the Simons-Berkeley Research Fellowship
program, and while visiting Microsoft Research Redmond.
‡
Microsoft Research Redmond. [email protected].
§
University of Minnesota. [email protected]. Supported in part by the NSF FAI Award #1939606, a Google Faculty
Research Award, a J.P. Morgan Faculty Award, a Facebook Research Award, and a Mozilla Research Grant.
¶
These authors are in alphabetical order.
1
distribution D over [k]p such that
X
Pr [X = x] ∝ exp ψI (x) ,
X∼D
I∈Ct (G)
2
1.1 Results and Techniques
We proceed to describe our results on privately learning Markov Random Fields. In this section,
we will assume familiarity with some of the most common notions of differential privacy: pure
ε-differential privacy, ρ-zero-concentrated differential privacy, and approximate (ε, δ)-differential
privacy. In particular, one should know that these are in (strictly) decreasing order of strength
(i.e., an algorithm which satisfies pure DP gives more privacy to the dataset than concentrated DP),
formal definitions appear in Section 2. Furthermore, in order to be precise, some of our theorem
statements will use notation which is defined later (Section 2) – these may be skipped on a first
reading, as our prose will not require this knowledge.
Upper Bounds. Our first upper bounds are for parameter learning. First, we have the following
theorem, which gives an upper bound for parameter learning pairwise graphical models under
√
concentrated differential privacy, showing that this learning goal can be achieved with O( p)
samples. In particular, this includes the special case of the Ising model, which corresponds to an
alphabet size k = 2. Note that this implies the same result if one relaxes the learning goal to
structure learning, or the privacy notion to approximate DP, as these modifications only make the
problem easier. Further details are given in Section 3.3.
Theorem 1.2. There exists an efficient ρ-zCDP algorithm which learns the parameters of a pair-
wise graphical model to accuracy α with probability at least 2/3, which requires a sample complexity
of !
√ 2 5.5
λ2 k5 log(pk)eO(λ) pλ k log2 (pk)eO(λ)
n=O + √ 3
α4 ρα
This result can be seen as a private adaptation of the elegant work of [WSD19] (which in turn
builds on the structural results of [KM17]). Wu, Sanghavi, and Dimakis [WSD19] show that ℓ1 -
constrained logistic regression suffices to learn the parameters of all pairwise graphical models. We
first develop a private analog of this method, based on the private Franke-Wolfe method of Talwar,
Thakurta, and Zhang [TTZ14, TTZ15], which is of independent interest. This method is studied
in Section 3.1.
Theorem 1.3. If we consider the problem of private sparse logistic regression, there exists an
efficient ρ-zCDP algorithm that produces a parameter vector wpriv , such that with probability at
least 1 − β, the empirical risk
4
np
λ log( β )
3
L(wpriv ; D) − L(werm ; D) = O √ 2
.
(n ρ) 3
We note that Theorem 1.3 avoids a polynomial dependence on the dimension p in favor of a
polynomial dependence on the “sparsity” parameter λ. The greater dependence on p which arises in
Theorem 1.2 is from applying Theorem 1.3 and then using composition properties of concentrated
DP.
We go on to generalize the results of [WSD19], showing that ℓ1 -constrained logistic regression
can also learn the parameters of binary t-wise MRFs. This result is novel even in the non-private
setting. Further details are presented in Section 4.
The following theorem shows that we can learn the parameters of binary t-wise MRFs with
√
Õ( p) samples.
3
Theorem 1.4. Let D be an unknown binary t-wise MRF with associated polynomial h. Then there
exists an ρ-zCDP algorithm which, with probability at least 2/3, learns the maximal monomials of
h to accuracy α, given n i.i.d. samples Z 1 , · · · , Z n ∼ D, where
√ √ !
e5λt p log2 (p) tλ2 p log p e6λt log(p)
n=O √ 9 + √ 2 + .
ρα 2 ρα α6
To obtain the rate above, our algorithm uses the Private Multiplicative Weights (PMW) method
by [HR10] to estimate all parity queries of all orders no more than t. The PMW method runs in
time exponential in p, since it maintains a distribution over the data domain. We can also obtain
an oracle-efficient algorithm that runs in polynomial time when given access to an empirical risk
minimization oracle over the class of parities. By replacing PMW with such an oracle-efficient
algorithm sepFEM in [VTB+ 19], we obtain a slightly worse sample complexity
√ !
e5λt p log2 (p) tλ2 p5/4 log p e6λt log(p)
n=O √ 9 + √ 2 + .
ρα 2 ρα α6
For the special case of structure learning under approximate differential privacy, we provide
a significantly better algorithm. In particular, we can achieve an O(log p) sample complexity,
√
which improves exponentially on the above algorithm’s sample complexity of O( p). The following
is a representative theorem statement for pairwise graphical models, though we derive similar
statements for binary MRFs of higher order.
Theorem 1.5. There exists an efficient (ε, δ)-differentially private algorithm which, with probability
at least 2/3, learns the structure of a pairwise graphical model, which requires a sample complexity
of 2 4
λ k exp(14λ) log(pk) log(1/δ)
n=O .
εη 4
This result can be derived using stability properties of non-private algorithms. In particular,
in the non-private setting, the guarantees of algorithms for this problem recover the entire graph
exactly with constant probability. This allows us to derive private algorithms at a multiplicative cost
of O(log(1/δ)/ε) samples, using either the propose-test-release framework [DL09] or stability-based
histograms [KKMN09, BNSV15]. Further details are given in Section 6.
Lower Bounds. We note the significant gap between the aforementioned upper bounds: in
√
particular, our more generally applicable upper bound (Theorem 1.2) has a O( p) dependence on
the dimension, whereas the best known lower bound is Ω(log p) [SW12]. However, we show that
our upper bound is tight. That is, even if we relax the privacy notion to approximate differential
√
privacy, or relax the learning goal to structure learning, the sample complexity is still Ω( p).
Perhaps surprisingly, if we perform both relaxations simultaneously, this falls into the purview of
Theorem 1.5, and the sample complexity drops to O(log p).
First, we show that even under approximate differential privacy, learning the parameters of a
√
graphical model requires Ω( p) samples. The formal statement is given in Section 5.
Theorem 1.6 (Informal). Any algorithm which satisfies approximate differential privacy and learns
the parameters of a pairwise graphical model with probability at least 2/3 requires poly(p) samples.
This result is proved by constructing a family of instances of binary pairwise graphical models
(i.e., Ising models) which encode product distributions. Specifically, we consider the set of graphs
4
formed by a perfect matching with edges (2i, 2i+1) for i ∈ [p/2]. In order to estimate the parameter
on every edge, one must estimate the correlation between each such pair of nodes, which can be
shown to correspond to learning the mean of a particular product distribution in ℓ∞ -distance. This
problem is well-known to have a gap between the non-private and private sample complexities, due
to methods derived from fingerprinting codes [BUV14, DSS+ 15, SU17], and differentially private
Fano’s inequality [ASZ20b].
Second, we show that learning the structure of a graphical model, under either pure or con-
centrated differential privacy, requires poly(p) samples. The formal theorem appears in Section 7.
Theorem 1.7 (Informal). Any algorithm which satisfies pure or concentrated differential privacy
and learns the structure of a pairwise graphical model with probability at least 2/3 requires poly(p)
samples.
We derive this result via packing arguments [HT10, BBKN14, ASZ20b], by showing that there
exists a large number (exponential in p) of different binary pairwise graphical models which must
be √
distinguished. The construction of a packing of size m implies lower bounds of Ω(log m) and
Ω( log m) for learning under pure and concentrated differential privacy, respectively.
5
VMLC16, KM17, HKM17, RH17, LVMC18, WSD19]. Perhaps a turning point in this literature is
the work of Bresler [Bre15], who showed for the first time that general Ising models of bounded
degree can be learned in polynomial time. Since this result, following works have focused on both
generalizing these results to broader settings (including MRFs with higher-order interactions and
non-binary alphabets) as well as simplifying existing arguments. There has also been work on
learning, testing, and inferring other statistical properties of graphical models [BM16, MdCCU16,
DDK17, MMY18, Bha19]. In particular, learning and testing Ising models in statistical distance
have also been explored [DDK18, GLP18, DMR18, DDK19, BBC+ 19], and are interesting questions
under the constraint of privacy.
Recent investigations at the intersection of graphical models and differential privacy include [BMS+ 17,
CRJ19, MSM19]. Bernstein et al. [BMS+ 17] privately learn graphical models by adding noise to the
sufficient statistics and use an expectation-maximization based approach to recover the parameters.
However, the focus is somewhat different, as they do not provide finite sample guarantees for the
accuracy when performing parameter recovery, nor consider structure learning at all. Chowdhury,
Rekatsinas, and Jha [CRJ19] study differentially private learning of Bayesian Networks, another
popular type of graphical model which is incomparable with Markov Random Fields. McKenna,
Sheldon, and Miklau [MSM19] apply graphical models in place of full contingency tables to privately
perform inference.
Graphical models can be seen as a natural extension of product distributions, which correspond
to the case when the order of the MRF t is 1. There has been significant work in differentially
private estimation of product distributions [BDMN05, BUV14, DMNS06, SU17, KLSU19, CWZ19,
BKSW19]. Recently, this investigation has been broadened into differentially private distribution
estimation, including sample-based estimation of properties and parameters, see, e.g., [NRS07,
Smi11, BNSV15, DHS15, KV18, AKSZ18, KLSU19, BKSW19]. For further coverage of differentially
private statistics, see [KU20].
2 Preliminaries
Given an integer n, we let [n] := {1, 2, · · · , n}. Given a set of points X 1 , · · · , X n , we use super-
scripts, i.e., X i to denote the i-th datapoint. Given a vector X ∈ Rp , we use subscripts, i.e., Xi to
denote its i-th coordinate. We also use X−i to denote the vector after deleting the i-th coordinate,
i.e. X−i = [X1 , · · · , Xi−1 , Xi+1 , · · · , Xp ].
Definition 2.1. The p-variable Ising model is a distribution D(A, θ) on {−1, 1}p that satisfies
X X
Pr (Z = z) ∝ exp Ai,j zi zj + θi zi ,
1≤i≤j≤p i∈[p]
where A ∈ Rp×p is a symmetric weight matrix with Aii = 0, ∀i ∈ [p] and θ ∈ Rp is a mean-field
vector. The dependency graph of D(A, θ) is an undirected graph G = (V, E), with vertices V = [p]
6
and edges E = {(i, j) : Ai,j 6= 0}. The width of D(A, θ) is defined as
X
λ(A, θ) = max |Ai,j | + |θi |.
i∈[p]
j∈[p]
Let η(A, θ) be the minimum edge weight in absolute value, i.e., η(A, θ) = mini,j∈[p]:Ai,j 6=0 |Ai,j | .
We note that the Ising model is supported on {−1, 1}p . A natural generalization is to generalize
its support to [k]p , and maintain pairwise correlations.
Definition 2.2. The p-variable pairwise graphical model is a distribution D(W, Θ) on [k]p that
satisfies
X X
Pr (Z = z) ∝ exp Wi,j (zi , zj ) + θi (zi ),
1≤i≤j≤p i∈[p]
where W = {Wi,j ∈ Rk×k : i 6= j ∈ [p]} is a set of weight matrices satisfying Wi,j = Wj,i T , and
7
Now we introduce the definition of δ-unbiased distribution and its properties. The proof appears
in [KM17].
Definition 2.5 (δ-unbiased). Let S be the alphabet set, e.g., S = {1, −1} for binary t-pairwise
MRFs and S = [k] for pairwise graphical models. A distribution D on S p is δ-unbiased if for
Z ∼ D, ∀i ∈ [p], and any assignment x ∈ S p−1 to Z−i , minz∈S Pr (Zi = z|Z−i = x) ≥ δ.
Lemma 2.6. Let D be a δ-unbiased on S p , with alphabet set S. For X ∼ D, ∀i ∈ [p], the
distribution of X−i is also δ-unbiased.
The following lemmas provide δ-unbiased guarantees for various graphical models.
Lemma 2.7. Let D(W, Θ) be a pairwise graphical model with alphabet size k and width λ(W, Θ).
Then D(W, Θ) is δ-unbiased with δ = e−2λ(W,Θ) /k. In particular, an Ising model D(A, θ) is
e−2λ(A,θ) /2-unbiased.
Lemma 2.8. Let D be a binary t-wise MRFs with width λ. Then D is δ-unbiased with δ = e−2λ /2.
Finally, we define two possible goals for learning graphical models. First, the easier goal is
structure learning, which involves recovering the set of non-zero edges.
Definition 2.9. An algorithm learns the structure of a graphical model if, given samples Z1 , . . . , Zn ∼
D, it outputs a graph Ĝ = (V, Ê) over V = [p] such that Ê = E, the set of edges in the dependency
graph of D.
The more difficult goal is parameter learning, which requires the algorithm to learn not only
the location of the edges, but also their parameter values.
Definition 2.10. An algorithm learns the parameters of an Ising model (resp. pairwise graphical
model) if, given samples Z1 , . . . , Zn ∼ D, it outputs a matrix  (resp. set of matrices Ŵ) such that
ci,j (a, b)| ≤ α, ∀i 6= j ∈ [p], ∀a, b ∈ [k]).
maxi,j∈[p] |Ai,j − Âi,j | ≤ α (resp. |Wi,j (a, b) − W
Definition 2.11. An algorithm learns the parameters of a binary t-wise MRF with associated
polynomial h if, given samples X 1 , . . . , X n ∼ D, it outputs another multilinear polynomial u such
that that for all maximal monomial I ⊆ [p], h̄(I) − ū(I) ≤ α.
The second is concentrated differential privacy [DR16]. In this work, we specifically consider its
refinement zero-mean concentrated differential privacy [BS16].
8
Definition 2.13 (Concentrated Differential Privacy (zCDP) [BS16]). A randomized algorithm A :
X n → S satisfies ρ-zCDP if for every pair of neighboring datasets X, X ′ ∈ X n ,
∀α ∈ (1, ∞) Dα M (X)||M (X ′ ) ≤ ρα,
The following lemma quantifies the relationships between (ε, 0)-DP, ρ-zCDP and (ε, δ)-DP.
Roughly speaking, pure DP is stronger than zero-concentrated DP, which is stronger than
approximate DP.
A crucial property of all the variants of differential privacy is that they can be composed
adaptively. By adaptive composition, we mean a sequence of algorithms A1 (X), . . . , AT (X) where
the algorithm At (X) may also depend on the outcomes of the algorithms A1 (X), . . . , At−1 (X).
previously studied in [TTZ14]. Before stating their results, we need the following two definitions.
The first definition is regarding Lipschitz continuity.
The performance of the algorithm also depends on the “curvature” of the loss function, which
is defined below, based on the definition of [Cla10, Jag13]. A side remark is that this is a strictly
weaker constraint than smoothness [TTZ14].
9
Now we are able to introduce the algorithm and its theoretical guarantees.
Algorithm 1: AP F W (D, L, ρ, C) : Private Frank-Wolfe Algorithm
P
Input: Data set: D = {d1 , · · · , dn }, loss function: L(w; D) = n1 nj=1 ℓ(w; dj ) (with
Lipschitz constant L1 ), privacy parameters: ρ, convex set: C = conv(S) with
kCk1 := maxs∈S ksk1 , iteration times: T
1 Initialize w from an arbitrary point in C
2 For t = 1 to T − 1
√
L1 kCk1 T
3 ∀s ∈ S, αs ← hs, ∇L(w; D)i + Lap 0, √
n ρ
4 w̃t ← arg mins∈S αs
2
5 wt+1 ← (1 − µt )wt + µt w̃t , where µt = t+2
Output: wpriv = wT
Lemma 3.3 (Theorem 5.5 from [TTZ14]). Algorithm 1 satisfies ρ-zCDP. Furthermore, let L1 ,
kCk1 be defined as in Algorithm 1. Let Γℓ be an upper bound on the curvature constant for the loss
2 √ 2
Γℓ3 (n ρ) 3
function ℓ(·; d) for all d and |S| be the number of extreme points in S. If we set T = 2 ,
L1 kCk13
then with probability at least 1 − β over the randomness of the algorithm,
1 2
3 n|S|
Γ ℓ (L 1 kCk1 ) 3 log( β )
L(wpriv ; D) − L(werm ; D) = O √ 2
.
(n ρ) 3
Proof. The utility guarantee is proved in [TTZ14]. Therefore, it is enough to prove the algorithm
satisfies ρ-zCDP. According
q to the definition of the Laplace mechanism, every iteration of the
algorithm satisfies ( T , 0)-DP, which naturally satisfies Tρ -zCDP by Lemma 2.14. Then, by the
ρ
If we consider the specific problem of sparse logistic regression, we will get the following corollary.
1 Pn
Corollary 3.4. If we consider the problem of sparse logistic regression, i.e., L(w; D) = n j=1 log(1+
e−y j hw,xj i
), with the constraint that C = {w : kwk1 ≤ λ}, and we further assume that ∀j, xj ∞ ≤
2 √ 2
1, y j
∈ {±1}, let T = λ 3 (n ρ) 3 , then with probability at least 1 − β over the randomness of the
algorithm, 4
np
λ 3 log(
β )
L(wpriv ; D) − L(werm ; D) = O √ 2 .
(n ρ) 3
2
Furthermore, the time complexity of the algorithm is O(T · np + p2 ) = O n 3 · np + p2 .
Proof. First let we show L1 ≤ 2. If we fix sample d = (x, y), then for any w1 , w2 ∈ C,
10
If we take q = 1, r = +∞, then Γℓ ≤ αλ2 , where
α= max ∇2 ℓ(w; d) · v ∞
≤ max ∇2 ℓ(w; d) i,j
.
w∈C,kvk1 =1 i,j∈[p]
We have α ≤ 1, since ∇2 ℓ(w; d) = σ(hw, xi)(1 − σ(hw, xi)) · xxT , and kxk∞ ≤ 1,
Finally given C = {w : kwk1 ≤ 1}, the number of extreme points of S equals 2p. By replacing
all these parameters in Lemma 3.3, we have proved the loss guarantee in the corollary.
With respect
to the time complexity, we note that the time complexity of each iteration is
O np + p2 and there are T iterations in total.
Now if we further assume the data set D is drawn i.i.d. from some underlying distribution P ,
the following lemma from learning theory relates the true risk and the empirical risk, which shall
be heavily used in the following sections.
Theorem 3.6. If we consider the same problem setting and assumptions as in Corollary 3.4, and
we further assume that the training data set D is drawn i.i.d. from some unknown distribution P ,
then with probability at least 1 − β over the randomness of the algorithm and the training data set,
4
np 1
λ log( β )
3 λ log β
E(X,Y )∼P ℓ(w priv ∗
; (X, Y )) − E(X,Y )∼P [ℓ(w ; (X, Y ))] = O √ ,
√ 2 + n
(n ρ) 3
Now we need to bound each term. We firstly bound the first and last term simultaneously.
!
By
1
λ log β
the generalization error bound (Lemma 7 from [WSD19]), they are bounded by O √
n
simul-
term. According to the definition of werm , the third term should be smaller than 0. Therefore,
!
by
4
np
λ 3 log( β ) λ log β 1
union bound, E(X,Y )∼P ℓ(wpriv ; (X, Y )) − E(X,Y )∼P [ℓ(w∗ ; (X, Y ))] = O √ 2 +
√
n
,
(n ρ) 3
with probability greater than 1 − β.
11
An observation of the Ising model is that for any node Zi , the probability of Zi = 1 conditioned
on the values of the remaining nodes Z−i follows from a sigmoid function. The next lemma comes
from [KM17], which formalizes this observation.
Lemma 3.7. Let Z ∼ D(A, θ) and Z ∈ {−1, 1}p , then ∀i ∈ [p], ∀x ∈ {−1, 1}[p]\{i} ,
X
Pr (Zi = 1|Z−i = x) = σ 2Ai,j xj + 2θi = σ hw, x′ i .
j6=i
Proof. The proof is from [KM17], and we include it here for completeness. According to the
definition of the Ising model,
!
P P
exp Ai,j xj + θj + θi
j6=i j6=i
Pr (Zi = 1|Z−i = x) = !
P P P P
exp Ai,j xj + j6=i θj + θi + exp j6=i −Ai,j xj + j6=i θj − θi
j6=i
X
= σ 2Ai,j xj + 2θi .
j6=i
By Lemma 3.7, we can estimate the weight matrix by solving a logistic regression for each node,
which is utilized in [WSD19] to design non-private estimators. Our algorithm uses the private Frank-
Wolfe method to solve the per-node logistic regression problem, achieving the following theoretical
guarantee.
Algorithm 2: Privately Learning Ising Models
Input: n samples {z 1 , · · · , z n }, where z m ∈ {±1}p for m ∈ [n]; an upper bound on
λ(A, θ) ≤ λ, privacy parameter ρ
1 For i = 1 to p
2 ∀m ∈ [n], xm ← [z−i m , 1], y m ← z m
i
3 w priv ← AP F W (D, L, ρ′ , C), where ρ′ = ρp , D = {(xm , y m )}nm=1 ,
P m m
L(w; D) = n1 nm=1 log 1 + e−y hw,x i , C = {kwk1 ≤ 2λ}
4 ∀j ∈ p, Âi,j ← 12 wj̃priv , where j̃ = j when j < i and e
j = j − 1 if j > i
Output: Â ∈ Rp×p
Theorem 3.8. Let D(A, θ) be an unknown p-variable Ising model with λ(A, θ) ≤ λ. There exists
an efficient ρ-zCDP algorithm which outputs a weight matrix  ∈ Rp×p such that with probability
greater than 2/3, maxi,j∈[p] Ai,j − Âi,j ≤ α if the number of i.i.d. samples satisfies
√ !
λ2 log(p)e12λ pλ2 log2 (p)e9λ
n=Ω + √ 3 .
α4 ρα
12
Proof. We first prove that Algorithm 2 satisfies ρ-zCDP. Notice that in each iteration, the algorithm
solves a private sparse logistic regression under ρp -zCDP. Therefore, Algorithm 2 satisfies ρ-zCDP
by composition (Lemma 2.15).
For the accuracy analysis, we start by looking at the first iteration (i = 1) and showing that
1
A1,j − Â1,j ≤ α, ∀j ∈ [p], with probability greater than 1 − 10p .
Given a random sample Z ∼ D(A, θ), we let X = [Z−1 , 1], Y = Z1 . From Lemma 3.7,
Pr (Y = 1|X = x) = σ(hw∗ , xi), where w∗ = 2[A1,2 , · · · , A1,p , θ1 ]. We also note that kw∗ k1 ≤ 2λ,
as a consequence of the width constraint of the Ising model.
For any n i.i.d. samples {z m }nm=1 drawn from the Ising model, let xm = [z−1 m , 1] and y m =
z1 , it is easy to check that each (x , y ) is the realization of (X, Y ). Let wpriv be the out-
m m m
Lemma 3.10 (Fact 2 of [WSD19]). Let Z ∼ D(W, Θ) and Z ∈ [k]p . For any i ∈ [p], any
u 6= v ∈ [k], and any x ∈ [k]p−1 ,
2
The assumption that Wi,j is centered is without loss of generality and widely used in the literature [KM17,
WSD19].
P We present the argument here for completeness. P Suppose the a-th row of Wi,j P is not centered, i.e.,
′ 1 ′ 1
b W i,j (a, b) =
6 0, we can define W i,j (a, b) = W i,j (a, b) − k b W i,j (a, b) and θ i (a) = θ i (a) + k b Wi,j (a, b), and the
probability distribution remains unchanged.
13
X
Pr (Zi = u|Zi ∈ {u, v}, Z−i = x) = σ (Wi,j (u, xj ) − Wi,j (v, xj )) + θi (u) − θi (v).
j6=i
Now we introduce our algorithm. Without loss of generality, we consider estimating W1,j for
all j ∈ [p] as a running example. We fix a pair of values (u, v), where u, v ∈ [k] and u 6= v. Let
Su,v be the samples where Z1 ∈ {u, v}. In order to utilize Lemma 3.10, we perform the following
transformation on the samples in Su,v : for the m-th sample z m , let y m = 1 if z1m = u, else y m = −1.
And xm is the one-hot encoding of the vector [z−1m , 1], where OneHotEncode(s) is a mapping from
[k]p to Rp×k , and the i-th row is the t-th standard basis vector given si = t. Then we define
w∗ ∈ Rp×k as follows:
w∗ (j, ·) = W1,j+1 (u, ·) − W1,j+1 (v, ·), ∀j ∈ [p − 1];
w∗ (p, ·) = [θ1 (u) − θ1 (v), 0, · · · , 0].
Lemma 3.10 implies that ∀t, Pr Y t = 1 = σ hw∗ , X t i , where h·, ·i is the element-wise multipli-
cation of matrices. According to the definition of the width of D(W, Θ), kw∗ k1 ≤ λk. Now we can
apply the sparse logistic regression method of Algorithm 3 to the samples in Su,v .
priv
Suppose wu,v is the output of the private Frank-Wolfe algorithm, we define Uu,v ∈ Rp×k as
follows: ∀b ∈ [k],
priv 1 X priv
Uu,v (j, b) = wu,v (j, b) − wu,v (j, a), ∀j ∈ [p − 1];
k
a∈[k]
priv 1 X X priv
Uu,v (p, b) = wu,v (p, b) + wu,v (j, a). (1)
k
j∈[p−1] a∈[k]
priv
Uu,v can be seen as a “centered” version of wu,v
(for the first p − 1 rows). It is not hard to see
priv
that hUu,v , xi = hwu,v , xi, so Uu,v is also a minimizer of the sparse logistic regression.
For now, assume that ∀j ∈ [p−1], b ∈ [k], Uu,v (j, b) is a “good” approximation
P of (W1,j+1 (u, b) − W1,j+1 (v, b)),
which we will show later. If we sum over v ∈ [k], it can be shown that k1 v∈[k] Uu,v (j, b) is also a
“good” approximation of W1,j+1 (u, b), for all Pj ∈ [p − 1], and u, b ∈ [k], because of the centering
assumption of W, i.e., ∀j ∈ [p − 1], b ∈ [k], v∈[k] W1,j+1 (v, b) = 0. With these considerations in
mind, we are able to introduce our algorithm.
Algorithm 3: Privately Learning Pairwise Graphical Model
Input: alphabet size k, n i.i.d. samples {z 1 , · · · , z n }, where z m ∈ [k]p for m ∈ [n]; an upper
bound on λ(W, Θ) ≤ λ, privacy parameter ρ
1 For i = 1 to p
2 For each pair u 6= v ∈ [k]
3 Su,v ← {z m , m ∈ [n] : zim ∈ {u, v}}
4 ∀z m ∈ Su,v , xm ← OneHotEncode([z−i m , 1]), y m ← 1 if z m = u; y t ← −1 if z m = v
i i
priv
5 wu,v ← AP F W (D, L, ρ′ , C), where ρ′ = kρ2 p , D = {(xm , y m ) : z m ∈ Su,v },
1 P|Su,v |
−y m hw,xm i , C = {kwk ≤ 2λk}
L(w; D) = |Su,v | m=1 log 1 + e 1
priv
6 Define Uu,v ∈ Rp×k by centering the first p − 1 rows of wu,v , as in Equation 1
7 For j ∈ [p]\i and u ∈ [k]
ci,j (u, :) ← 1 P
W
8 k v∈[k] Uu,v (j̃, :), where j̃ = j when j < i and j̃ = j − 1 when j > i
14
The following theorem is the main result of this section. Its proof is structurally similar to that
of Theorem 3.8.
Theorem 3.11. Let D(W, Θ) be an unknown p-variable pairwise graphical model distribution, and
we suppose that D(W, Θ) has width λ(W, Θ) ≤ λ. There exists an efficient ρ-zCDP algorithm
which outputs W c such that with probability greater than 2/3, Wi,j (u, v) − W
ci,j (u, v) ≤ α, ∀i 6= j ∈
[p], ∀u, v ∈ [k] if the number of i.i.d. samples satisfy
√ 2 5.5 !
λ2 k5 log(pk)eO(λ) pλ k log2 (pk)eO(λ)
n=Ω + √ 3 .
α4 ρα
Proof. We consider estimating W1,j for all j ∈ [p] as an example. Fixing one pair (u, v), let Su,v
be the samples whose first element is either u or v, and nu,v be the number of samples in Su,v . We
perform the following transformation on the samples in Su,v : for the sample Z, let Y = 1 if Z1 = u,
else Y = −1, and let X be the one-hot encoding of the vector [Z−1 , 1].
Suppose the underlying
joint distribution of X and Y is P , i.e., (X, Y ) ∼ P , then by Theo-
√
λ2 k 2 log2 (pk) dλ2 k 3 log2 (pk) 1
rem 3.6, when nu,v = O γ2
+ 3√ , with probability greater than 1 − 10pk 2
,
γ2 ρ
E(X,Y )∼P [ℓ(Uu,v ; (X, Y ))] − E(X,Y )∼P [ℓ(w∗ ; (X, Y ))] ≤ γ.
The following lemma appears in [WSD19], which is analogous to Lemma 3.9 for the Ising model.
Lemma 3.12. Let D be a δ-unbiased distribution on [k]p−1 P. For Z ∼ D, X denotes
P the one-hot en-
coding of Z. Let u1 , u2 ∈ R(p−1)×k be two matrices where a u1 (i, a) = 0 and a u2 (i, a) = 0 for all
i ∈ [p − 1]. Let P be a distribution such that given u1 , θ1 ∈R, Pr (Y = 1|X = X) = σ(hu1 , xi +θ1 )
for (X, Y ) ∼ P . Suppose E(X,Y )∼P log 1 + e−Y (hu1 ,Xi+θ1 ) −E(x,Y )∼P log 1 + e−Y (hu2 ,Xi+θ2 ) ≤
γ (p−1)×k , θ ∈ R, and γ ≤ δe−2ku1 k∞,1 −2kθ1 k1 −6 , then ku − u k ku1 k∞,1 +kθ1 k1
pfor u2 ∈ R 2 1 2 ∞ = O(e ·
γ/δ).
e−6λ α2
By Lemma 2.6, Lemma 2.7 and Lemma 3.12, if we substitute γ = , when nu,v =
√ 2 4.5 k
λ2 k 4 log(pk)eO(λ) pλ k log2 (pk)eO(λ)
O α4
+ √
ρα3
,
|W1,j (u, b) − W1,j (v, b) − U u,v (j, b)| ≤ α, ∀j ∈ [p − 1], ∀b ∈ [k]. (2)
By a union bound, Equation (2) holds for all (u, v) pairs simultaneously
P with probability greater
1
than 1 − 10p . If we sum over v ∈ [k] and use the fact that ∀j, b, v∈[k] W1,j (v, b) = 0, we have
1 X
W1,j (u, b) − Uu,v (j, b) ≤ α, ∀j ∈ [p − 1], ∀u, b ∈ [k].
k
v∈[k]
Note that we need to guarantee that we obtain nu,v samples for each pair (u, v). Since D(W, Θ)
is δ-unbiased, given Z ∼ D(W,
Θ), for all u 6= v, Pr (Z ∈ Su,v ) ≥ 2δ. By Hoeffding’s inequality,
nu,v log(pk 2 ) 1
when n = O δ + δ2
, with probability greater than 1 − 10p , we have enough samples for
e−6λ
all (u, v) pairs simultaneously. Substituting δ = k , we have
√ !
λ2 k5 log(pk)eO(λ) pλ2 k5.5 log2 (pk)eO(λ)
n=O + √ 3 .
α4 ρα
15
The same argument holds for other entries of the matrix. We conclude the proof by a union
bound over p iterations.
Finally, we note that the time compexity of the algorithm is poly(n, p) since the private Frank-
Wolfe algorithm is time efficient by Corollary 3.4.
2. find a multilinear polynomial u such that with probability greater than 32 , for every maximal
monomial I of h, h̄(I) − ū(I) ≤ α.
We note that our first objective can be viewed as parameter estimation in ℓ1 distance, where
only an average performance guarantee is provided. In the second objective, the algorithm recovers
every maximal monomial, which can be viewed as parameter estimation in ℓ∞ distance. These two
objectives are addressed in Sections 4.1 and 4.2, respectively.
Lemma 4.1 shows that, similar to pairwise graphical models, it also suffices to learn the param-
eters of binary t-wise MRF using sparse logistic regression.
Algorithm 4: Private Learning binary t-wise MRF in ℓ1 distance
Input: n i.i.d. samples {z 1 , · · · , z n }, where z m ∈ {±1}p for m ∈ [n]; an upper bound λ on
maxi∈[p] k∂i hk1 , privacy parameter ρ
1 For i = 1 to p hQ i
∀m ∈ [n], xm ← m m
2 j∈I zj : I ⊂ [p\i], |I| ≤ t − 1 , ym ← zi
3 wpriv ← A(D, L, ρ′ , C) where D = {(xm , ym )}nm=1 , ℓ(w; d) = log 1 + e−yhw,xi ,
C = {kwk1 ≤ 2λ}, and ρ′ = pρ
4 For I ⊂ [p\i] with |I| ≤ t − 1
5 ū(I ∪ {i}) = 12 wpriv (I), when arg min(I ∪ i) = i
Output: ū(I) : I ∈ Ct (Kp ), where Kp is the p-dimensional complete graph
16
Theorem 4.2. There exists a ρ-zCDP algorithm which, with probability at least 2/3, finds a mul-
tilinear polynomial u such that kh − uk1 ≤ α, given n i.i.d. samples Z 1 , · · · , Z n ∼ D, where
1
!
(2t)O(t) eO(λt) · p4t · log(p) (2t)O(t) eO(λt) · p3t+ 2 · log2 (p)
n=O + √ 3 .
α4 ρα
Proof. Similar
hQ to the previous proof, wei start by fixing i = 1. Given a random sample Z ∼ D,
let X = j
j∈I Z : I ⊂ [p]\1, |I| ≤ t − 1 and Y = Zi . According to Lemma 4.1, we know that
E [Y |X] = σ(hw∗ , Xi), where w∗ = 2 · ∂1 h(I) : I ⊂ [p]\1, |I| ≤ t − 1 . Furthermore, kw∗ k1 ≤ 2λ
by the width constraint. Now, given n i.i.d. samples {z m }nm=1 drawn from D, it is easy to check that
for any given z m , its corresponding
(xm , y m ) is one realization of (X, Y ). Let wpriv be the output of
A D, L, pρ , {w : kwk1 ≤ 2λ} , where D = {(xm , y m )}nm=1 and ℓ(w; (x, y)) = log 1 + e−yhw,xi . By
Lemma 3.6, EZ∼D(A,θ) ℓ(wpriv ; (X, Y )) − EZ∼D(A,θ) [ℓ(w∗ ; (X, Y ))] ≤ γ with probability greater
√
1 pλ2 log2 (p) λ2 log(p)
than 1 − 10p , assuming n = Ω √ 3 + γ2 .
ργ 2
Now we need the following lemma from [KM17], which is analogous to Lemma 3.7 for the Ising
model.
Lemma 4.3 (Lemma 6.4 of [KM17]). Let P be a distribution on {−1, 1}p−1 ×{−1, 1}. Given multi-
linear polynomial u1 ∈ Rp−1 , Pr (Y = 1|X = x) = σ(u 1 (X)) for (X, Y )∼ P . Suppose the marginal
distribution of P on X is δ-unbiased, and E(X,Y )∼P log 1 + e−Y (u1 (X)) −E(X,Y )∼P log 1 + e−Y (u2 (X)) ≤
p
γ for another multilinear polynomial u2 , where γ ≤ δt e−2ku1 k1 −6 , then ku1 − u2 k1 = O (2t)t eku1 k1 · γ/δt · pt .
By substituting
P γ = e−O(λt) · (2t)−O(t) · p−3t · α2 , we have that with probability greater than
1
1 − 10p , I:arg min I=1 |ū(I) − h(I)| ≤ αp . We note that the coefficients of different monomials are
recovered in each iteration. Therefore, by a union bound over p iterations, we prove the desired
result.
17
We first show that if the estimates Q̂ for the parity queries Q are sufficiently accurate, Algo-
rithm 5 solves the ℓ∞ estimation problem, as long as the sample size n1 is large enough.
Lemma 4.4. Suppose that the estimates Q̂ satisfies |Q̂(I) − Q(I)| ≤ α/(8λ) for all I ⊂ [p] such
that |I| ≤ t and n2 = Ω(λ2 t log(p)/α2 ). Then with probability at least 3/4, Algorithm 5 outputs a
multilinear polynomial u such that for every maximal monomial I of h, h̄(I) − ū(I) ≤ α, given n
i.i.d. samples Z 1 , · · · , Z n ∼ D, as long as
√ !
e5λt · p log2 (p) e6λt · log(p)
n1 = Ω √ 9 + .
ρα 2 α6
Proof. We will condition on the event that Q̂ is a “good” hQestimate of Q: |Q̂(I) − Q(I)| ≤ α/(8λ) i for
all I ⊂ [p] such that |I| ≤ t. Let us fix i = 1. Let X = j
j∈I Z : I ⊂ [p]\{1}, |I| ≤ t − 1 , Y = Zi ,
∗ ∗
and we know that E [Y |X] = σ(hw , Xi), where w = 2 · ∂1 h(I) : I ⊂ [p]\1, |I| ≤ t − 1 . Now given
n ρ
n1 i.i.d. samples {z m }m=1 1
drawn from D, let wpriv be the output of A D, L, p , {w : kwk1 ≤ 2λ} ,
where D = {(xm , y m )}nm=1 1
and ℓ(w; (x, y)) = log 1 + e−yhw,xi . Similarly, with probability at least
1
1 − 10p ,
EZ∼D(A,θ) ℓ(wpriv ; (X, Y )) − EZ∼D(A,θ) [ℓ(w∗ ; (X, Y ))] ≤ γ
√
pλ2 log2 (p) λ2 log(p)
as long as n1 = Ω √ 3 + γ2 .
ργ 2
Now we utilize Lemma 6.4 from [KM17], which states that if EZ∼D(A,θ) ℓ(wpriv ; (X, Y )) −
EZ∼D(A,θ) [ℓ(w∗ ; (X, Y ))] ≤ γ, given a random sample X, for any maximal monomial I ⊂ [p]\{1}
of ∂1 h,
α γ · e3λt
Pr ∂1 h(I) − ∂I v1 (X) ≥ <O .
4 α2
e−3λt ·α3 α
α
√By replacing γ =
8λ , we have Pr ∂ 1 h(I) − ∂ I v 1 (X) ≥ 4 < 8λ , as long as n1 =
pe5λt log2 (p) e6λt log(p)
Ω √ 9 + α6
. Accordingly, for any maximal monomial I, E [∂I v1 (X)] − ∂1 h(I) ≤
ρα 2
2
E ∂I v1 (X) − ∂1 h(I) ≤ α4 + 2λ· 8λ α
= α2 . By Hoeffding inequality, given n2 = Ω λ tαlog 2
p
, for each
P
maximal monomial I, with probability greater than 1− p1t , n12 nm=1 2
∂I v1 (Xm ) − E [∂I v1 (X)] ≤ α4 .
P 2 P
Note that Q(I) − Q̂(I) ≤ 8λ α
, then n12 nm=1 ∂I v1 (Xm ) − I ′ ⊂[p] ∂I v1 (I ′ ) · Q̂(I ′ ) ≤ α8 . Therefore,
X
∂I v1 (I ′ ) · Q̂(I ′ ) − ∂1 h(I)
I ′ ⊂[p]
X n2 n2
1 X 1 X
≤ ∂I v1 (I ′ ) · Q̂(I ′ ) − ∂I v1 (Xm ) + ∂I v1 (Xm ) − E [∂I v1 (X)] + E [∂I v1 (X)] − ∂1 h(I)
n2 n2
I ′ ⊂[p] m=1 m=1
α α α 7α
≤ + + = .
8 4 2 8
Finally, by a union bound over p iterations and all the maximal monomials, we prove the desired
results.
18
We now consider two private algorithms for releasing the parity queries. The first algorithm is
called Private Multiplicative Weights (PMW) [HR10], which provides a better accuracy guarantee
but runs in time exponential in the dimension p. The following theorem can be viewed as a zCDP
version of Theorem 4.3 in [Vad17], by noting that during the analysis, every iteration satisfies ε0 -DP,
which naturally satisfies ε20 -zCDP, and by replacing the strong composition theorem of (ε, δ)-DP
by the composition theorem of zCDP (Lemma 2.15).
Lemma 4.5 (Sample complexity of PMW, modification of Theorem 4.3 of [Vad17]). The PMW
algorithm satisfies ρ-zCDP and releases Q̂ such that with probability greater than 19
20 , for all I ⊂ [p]
α
with |I| ≤ t, Q̂(I) − Q(I) ≤ 8λ as long as the size of the data set
√
tλ2 · p log p
n2 = Ω √ 2 .
ρα
The second algorithm sepFEM (Separator-Follow-the-perturbed-leader with exponential mecha-
nism) has slightly worse sample complexity, but runs in polynomial time when it has access to an op-
timization oracle O that does the following: given as input a weighted dataset (I1 , w1 ), . . . , (Im , wm ) ∈
2[p] × R, find x ∈ {±1}p ,
Xm Y
max wi xj .
x∈{±1}p
i=1 j∈Ii
The oracle O essentially solves cost-sensitive classification problems over the set of parity func-
tions [ZLA03], and it can be implemented with an integer program solver [VTB+ 19, GAH+ 14].
Lemma 4.6 (Sample complexity of sepFEM, [VTB+ 19]). The sepFEM algorithm satisfies ρ-zCDP
and releases Q̂ such that with probability greater than 19
20 , for all I ⊂ [p] with |I| ≤ t, Q̂(I) − Q(I) ≤
α
8λ as long as the size of the data set
!
tλ2 · p5/4 log p
n2 = Ω √ 2
ρα
The algorithm runs in polynomial time given access to the optimization oracle O defined above.
Now we can combine Lemmas 4.4, 4.5, and 4.6 to state the formal guarantee of Algorithm 5.
Theorem 4.7. Algorithm 5 is a ρ-zCDP algorithm which, with probability at least 2/3, finds a
multilinear polynomial u such that for every maximal monomial I of h, h̄(I) − ū(I) ≤ α, given n
i.i.d. samples Z 1 , · · · , Z n ∼ D, and
1. if it uses PMW for releasing Q̂; it has a sample complexity of
√ √ !
e5λt · p log2 (p) tλ2 · p log p e6λt · log(p)
n=O √ 9 + √ 2 +
ρα 2 ρα α6
19
5 Lower Bounds for Parameter Learning
The lower bound for parameter estimation is based on mean estimation in ℓ∞ distance.
Theorem 5.1. Suppose A is an (ε, δ)-differentially private algorithm that takes n i.i.d. sam-
ples 1 n
h Z , . . . , Z drawn fromi any unknown p-variable
√Ising
model D(A, θ) and outputs  such that
p
E maxi,j∈[p] |Ai,j − Âi,j | ≤ α ≤ 1/50. Then n = Ω αε .
Proof. Consider a Ising model D(A, 0) with A ∈ Rp×p defined as follows: for i ∈ [ 2p ], A2i−1,2i =
A2i,2i−1 = ηi ∈ [− ln(2), ln(2)], and All′ = 0 for all other pairs of (l, l′ ). This construction divides
the p nodes into 2p pairs, where there is no correlation between nodes belonging to different pairs.
It follows that
1 eηi
Pr (Z2i−1 = 1, Z2i = 1) = Pr (Z2i−1 = −1, Z2i = −1) = ,
2 eηi + 1
1 1
Pr (Z2i−1 = 1, Z2i = −1) = Pr (Z2i−1 = −1, Z2i = 1) = .
2 eηi + 1
For each observation Z, we obtain an observation X ∈ {±1}p/2 such that Xi = Z2i−1 Z2i . Then
each observation X is distributed according to a product distribution in {±1}(p/2) such that the
mean of each coordinate j is (eηi − 1)/(eηi + 1) ∈ [−1/3, 1/3].
Suppose that an (ε, δ)-differentially private algorithm takes n observations
h drawn from i any
such Ising model distribution and output a matrix  such that E maxi,j∈[p] |Ai,j − Âi,j | ≤ α.
Let η̂i = min{max{Â2i−1,2i , − ln(2)}, ln(2)} be the value of A2i−1,2i rounded into the range of
[− ln(2), ln(2)], and so |ηi − η̂i | ≤ α. It follows that
where the last step follows from the fact that ea ≤ 1 + 2a for any a ∈ [0, 1]. Thus, such private
algorithm also can estimate the mean of the product distribution accurately:
p/2 η̂i − 1 2
X eηi − 1 e
E − ≤ 32pα2
eηi + 1 eη̂i + 1
i=1
Now we will use the following sample complexity lower bound on private mean estimation on
product distributions.
Lemma 5.2 (Lemma 6.2 of [KLSU19]). If M : {±1}n×d → [−1/3, 1/3]d is (ε, 3/(64n))-differentially
private, and for every product distribution P over {±1}d such that the mean of each coordinate µj
satisfies −1/3 ≤ µj ≤ 1/3,
d
EX∼P n kM (X) − µk22 ≤ γ 2 ≤ ,
54
then n ≥ d/(72γε).
Then our stated bound follows by instantiating γ 2 = 32pα2 and d = p/2 in Lemma 5.2.
20
6 Structure Learning of Graphical Models
In this section, we will give an (ε, δ)-differentially private algorithm for learning the structure of a
Markov Random Field. The dependence on the dimension d will be only logarithmic, in comparison
to the complexity of privately learning the parameters. As we have shown in Section 5, this de-
pendence is necessarily polynomial in p, even under approximate differential privacy. Furthermore,
as we will show in Section 7, if we wish to learn the structure of an MRF under more restrictive
notions of privacy (such as pure or concentrated), the complexity also becomes polynomial in p.
Thus, in very high-dimensional settings, learning the structure of the MRF under approximate
differential privacy is essentially the only notion of private learnability which is tractable.
The following lemma is immediate from stability-based mode arguments (see, e.g., Proposition
3.4 of [Vad17]).
Lemma 6.1. Suppose there exists a (non-private) algorithm which takes X = (X 1 , . . . , X n ) sampled
i.i.d. from some distribution D, and outputs some fixed value Y (which may depend on D) with
probability
at
least 2/3. Then there exists an (ε, δ)-differentially private algorithm which takes
n log(1/δ)
O ε samples and outputs Y with probability at least 1 − δ.
Theorem 6.2 ([WSD19]). There exists an algorithm which,with probability at least 2/3, learns
λ2 k 4 e14λ log(pk)
the structure of a pairwise graphical model. It requires n = O η4
samples.
Corollary 6.3. There exists an (ε, δ)-differentially private algorithm which, with probability at least
2 4 14λ
2/3, learns the structure of a pairwise graphical model. It requires n = O λ k e log(pk) εη4
log(1/δ)
samples.
For binary MRFs of higher-order, we instead import the following theorem from [KM17]:
Corollary 6.5. There exists an (ε, δ)-differentially private algorithm which, with probability at
least 2/3, learns the structure of a binary t-wise MRF. It requires
!
eO(λt) log( ηp ) log(1/δ)
n=O
εη 4
samples.
21
Ising model is a special case of binary t-wise MRFs corresponding to t = 2. We will show that
under (ε, 0)-DP or ρ-zCDP, a polynomial dependence on the dimension is unavoidable in the sample
complexity.
In Section 7.1, we assume that our samples are generated from an Ising model. In Section 7.2,
we extend our lower bounds to pairwise graphical models.
Proof. Our lower bound argument is in two steps. The first step is to construct a set of distributions,
p
consisting of 2 2 different Ising models such that any feasible structure learning algorithm should
output different answers for different distributions. In the second step, we utilize the probabilistic
packing argument for DP [ASZ20b], or the packing argument for zCDP [BS16] to get the desired
lower bound.
To start, we would like to use the following binary code to construct the distribution set. Let
p
C = {0, 1} 2 , given c ∈ C, we construct the corresponding distribution D(Ac , 0) with Ac ∈ Rp×p
defined as follows: for i ∈ [ 2p ], Ac2i−1,2i = Ac2i,2i−1 = η · c[i], and 0 elsewhere. By construction, we
divide the p nodes into 2p different pairs, where there is no correlation between nodes belonging
to different pairs. Furthermore, for pair i, if c[i] = 0, which means the value of node 2i − 1 is
independent of node 2i, it is not hard to show
1
Pr (Z2i−1 = 1, Z2i = 1) = Pr (Z2i−1 = −1, Z2i = −1) = ,
4
1
Pr (Z2i−1 = 1, Z2i = −1) = Pr (Z2i−1 = −1, Z2i = 1) = .
4
On the other hand, if c[i] = 1,
1 eη
Pr (Z2i−1 = 1, Z2i = 1) = Pr (Z2i−1 = −1, Z2i = −1) = · ,
2 eη + 1
1 1
Pr (Z2i−1 = 1, Z2i = −1) = Pr (Z2i−1 = −1, Z2i = 1) = · η .
2 e +1
The Chi-squared distance between these two distributions is
" # 2
1 eη 1 2 1 1 1 2 2
8 · η − + · η − = 1− η ≤ 4η 2 .
2 e +1 4 2 e +1 4 e +1
Now we want to upper bound the total variation distance between D(Ac1 , 0) and D(Ac2 , 0)
for any c1 6= c2 ∈ C. Let Pi and Qi denote the joint distribution of node 2i − 1 and node 2i
corresponding to D(Ac1 , 0) and D(Ac2 , 0). We have that
v
u p
u X
p u 2 √
dT V (D(Ac1 , 0), D(Ac2 , 0)) ≤ 2dKL (D(Ac1 , 0), D(Ac2 , 0)) = t2 dKL (Pi , Qi ) ≤ min (2η p, 1),
i=1
where the first inequality is by Pinsker’s inequality, and the last inequality comes from the fact
that the KL divergence is always upper bounded by the Chi-squared distance.
22
In order to attain pure DP lower bounds, we utilize the probabilistic version of the packing
argument in [ASZ20a], as stated below.
Lemma 7.2. Let V = {P1 , P2 , ..., PM } be a set of M distributions over X n . Suppose that for any
pair of distributions Pi and Pj , there exists a coupling between Pi and Pj , such that E [dham (X n , Y n )] ≤
D, where X n ∼ Pi and Y n ∼ Pj . Let {Si }i∈[M ] be a collection of disjoint subsets of S. If there exists
9
an ε-DP algorithm A : X n → S such that for every i ∈ [M ], given Z n ∼ Pi , Pr (A(Z n ) ∈ Si ) ≥ 10 ,
then
log M
ε=Ω .
D
√
For any c1 , c2 ∈ C, we have dT V (D(Ac1 , 0), D(Ac2 , 0)) ≤ min 2η p, 1 . By the property of max-
imal coupling [dH12], there must exist some coupling between D n (Ac1 , 0) and D n (A c2
, 0) with ex-
√
pected Hamming distance smaller than min 2nη p, n . Therefore, we have ε = Ω min log|C| √ ,
(nη p,n)
√
p
and accordingly, n = Ω ηε + pε .
Now we move to zCDP lower bounds. We utilize a different version of the packing argu-
ment [BS16], which works under zCDP.
Lemma 7.3. Let V = {P1 , P2 , ..., PM } be a set of M distributions over X n . Let {Si }i∈[M ] be a
collection of disjoint subsets of S. If there exists an ρ-zCDP algorithm A : X n → S such that for
9
every i ∈ [M ], given Z n ∼ Pi , Pr (A(Z n ) ∈ Si ) ≥ 10 , then
log M
ρ=Ω .
n2
q
By Lemma 7.3, we derive ρ = Ω np2 and n = Ω p
ρ accordingly.
Proof. Similar to before, we start with constructing a distribution set consisting of 2O(kp) different
pairwise graphical models such that any accurate structure learning algorithm must output different
answers for different distributions.
Let C be the real symmetric matrix with each value constrained to either 0 or η, i.e., C = {W ∈
{0, η}k×k : W = W T }. Without loss of generality, we assume p is even. Given c = [c1 , c2 , · · · , cp ],
where c1 , c2 , · · · , cp ∈ C, we construct the corresponding distribution D(W c , 0) with W c defined as
follows: for l ∈ [ 2p ], W2l−1,2l
c c = 0. Similarly, by this construction
= cl , and for other pairs (i, j), Wi,j
p
we divide p nodes into 2 different pairs, and there is no correlation between nodes belonging to
different pairs.
We first prove lower bounds under (ε, 0)-DP. By Lemma 7.2, ε = Ω log|C| n , since for any two
n-sample distributions, the
k(k+1) expected coupling distance can 2be always upper bounded byn. √
We
p p
k p
also note that |C| = 2 2 . Therefore, we have n = Ω ε . At the same time, n = Ω ηε is
another lower bound, inherited from the easier task of learning Ising models.
23
k2 p
With respect to zCDP, we utilize Lemma 7.3 and obtain ρ = Ω n2 . Therefore, we have
q 2
k p
n=Ω ρ .
Acknowledgments
The authors would like to thank Kunal Talwar for suggesting the study of this problem, and Adam
Klivans, Frederic Koehler, Ankur Moitra, and Shanshan Wu for helpful and inspiring conversations.
GK would like to thank Chengdu Style Restaurant (古月飘香) in Berkeley for inspiration in the
conception of this project.
References
[AKN06] Pieter Abbeel, Daphne Koller, and Andrew Y. Ng. Learning factor graphs in
polynomial time and sample complexity. Journal of Machine Learning Research,
7(Aug):1743–1788, 2006.
[AKSZ18] Jayadev Acharya, Gautam Kamath, Ziteng Sun, and Huanyu Zhang. Inspectre: Pri-
vately estimating the unseen. In Proceedings of the 35th International Conference on
Machine Learning, ICML ’18, pages 30–39. JMLR, Inc., 2018.
[ASZ20a] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Personal communication, 2020.
[ASZ20b] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Differentially private assouad,
fano, and le cam. arXiv preprint arXiv:2004.06830, 2020.
[BBC+ 19] Ivona Bezakova, Antonio Blanca, Zongchen Chen, Daniel Štefankovič, and Eric
Vigoda. Lower bounds for testing graphical models: Colorings and antiferromagnetic
Ising models. In Proceedings of the 32nd Annual Conference on Learning Theory,
COLT ’19, pages 283–298, 2019.
[BBKN14] Amos Beimel, Hai Brenner, Shiva Prasad Kasiviswanathan, and Kobbi Nissim.
Bounds on the sample complexity for private learning and private data release. Ma-
chine Learning, 94(3):401–437, 2014.
[BDMN05] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy:
The SuLQ framework. In Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART
Symposium on Principles of Database Systems, PODS ’05, pages 128–138, New York,
NY, USA, 2005. ACM.
[BGS14] Guy Bresler, David Gamarnik, and Devavrat Shah. Structure learning of antiferro-
magnetic Ising models. In Advances in Neural Information Processing Systems 27,
NIPS ’14, pages 2852–2860. Curran Associates, Inc., 2014.
[BKSW19] Mark Bun, Gautam Kamath, Thomas Steinke, and Zhiwei Steven Wu. Private hy-
pothesis selection. In Advances in Neural Information Processing Systems 32, NeurIPS
’19. Curran Associates, Inc., 2019.
24
[BM16] Bhaswar B. Bhattacharya and Sumit Mukherjee. Inference in Ising models. Bernoulli,
2016.
[BMS+ 17] Garrett Bernstein, Ryan McKenna, Tao Sun, Daniel Sheldon, Michael Hay, and
Gerome Miklau. Differentially private learning of undirected graphical models us-
ing collective graphical models. In Proceedings of the 34th International Conference
on Machine Learning, ICML ’17, pages 478–487. JMLR, Inc., 2017.
[BNSV15] Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil Vadhan. Differentially private re-
lease and learning of threshold functions. In Proceedings of the 56th Annual IEEE
Symposium on Foundations of Computer Science, FOCS ’15, pages 634–649, Wash-
ington, DC, USA, 2015. IEEE Computer Society.
[Bre15] Guy Bresler. Efficiently learning Ising models on arbitrary graphs. In Proceedings
of the 47th Annual ACM Symposium on the Theory of Computing, STOC ’15, pages
771–782, New York, NY, USA, 2015. ACM.
[BS16] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications,
extensions, and lower bounds. In Proceedings of the 14th Conference on Theory of
Cryptography, TCC ’16-B, pages 635–658, Berlin, Heidelberg, 2016. Springer.
[BUV14] Mark Bun, Jonathan Ullman, and Salil Vadhan. Fingerprinting codes and the price of
approximate differential privacy. In Proceedings of the 46th Annual ACM Symposium
on the Theory of Computing, STOC ’14, pages 1–10, New York, NY, USA, 2014.
ACM.
[Cha05] Sourav Chatterjee. Concentration Inequalities with Exchangeable Pairs. PhD thesis,
Stanford University, June 2005.
[CL68] C.K. Chow and C.N. Liu. Approximating discrete probability distributions with de-
pendence trees. IEEE Transactions on Information Theory, 14(3):462–467, 1968.
[Cla10] Kenneth L Clarkson. Coresets, sparse greedy approximation, and the frank-wolfe
algorithm. ACM Transactions on Algorithms (TALG), 6(4):63, 2010.
[CRJ19] Amrita Roy Chowdhury, Theodoros Rekatsinas, and Somesh Jha. Data-dependent
differentially private parameter learning for directed graphical models. arXiv preprint
arXiv:1905.12813, 2019.
[CT06] Imre Csiszár and Zsolt Talata. Consistent estimation of the basic neighborhood of
Markov random fields. The Annals of Statistics, 34(1):123–145, 2006.
[CWZ19] T. Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates
of convergence for parameter estimation with differential privacy. arXiv preprint
arXiv:1902.04495, 2019.
[DDK18] Constantinos Daskalakis, Nishanth Dikkala, and Gautam Kamath. Testing Ising mod-
els. In Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms,
SODA ’18, pages 1989–2007, Philadelphia, PA, USA, 2018. SIAM.
25
[DDK19] Constantinos Daskalakis, Nishanth Dikkala, and Gautam Kamath. Testing Ising mod-
els. IEEE Transactions on Information Theory, 65(11):6829–6852, 2019.
[DHS15] Ilias Diakonikolas, Moritz Hardt, and Ludwig Schmidt. Differentially private learning
of structured discrete distributions. In Advances in Neural Information Processing
Systems 28, NIPS ’15, pages 2566–2574. Curran Associates, Inc., 2015.
[DKY17] Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. Collecting telemetry data
privately. In Advances in Neural Information Processing Systems 30, NIPS ’17, pages
3571–3580. Curran Associates, Inc., 2017.
[DL09] Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. In Proceedings
of the 41st Annual ACM Symposium on the Theory of Computing, STOC ’09, pages
371–380, New York, NY, USA, 2009. ACM.
[DLS+ 17] Aref N. Dajani, Amy D. Lauger, Phyllis E. Singer, Daniel Kifer, Jerome P. Reiter, Ash-
win Machanavajjhala, Simson L. Garfinkel, Scot A. Dahl, Matthew Graham, Vishesh
Karwa, Hang Kim, Philip Lelerc, Ian M. Schmutte, William N. Sexton, Lars Vilhu-
ber, and John M. Abowd. The modernization of statistical disclosure limitation at
the U.S. census bureau, 2017. Presented at the September 2017 meeting of the Census
Scientific Advisory Committee.
[DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise
to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory
of Cryptography, TCC ’06, pages 265–284, Berlin, Heidelberg, 2006. Springer.
[DMR11] Constantinos Daskalakis, Elchanan Mossel, and Sébastien Roch. Evolutionary trees
and the Ising model on the Bethe lattice: A proof of Steel’s conjecture. Probability
Theory and Related Fields, 149(1):149–189, 2011.
[DMR18] Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The minimax learning rate
of normal and Ising undirected graphical models. arXiv preprint arXiv:1806.06887,
2018.
[DR16] Cynthia Dwork and Guy N. Rothblum. Concentrated differential privacy. arXiv
preprint arXiv:1603.01887, 2016.
[DSS+ 15] Cynthia Dwork, Adam Smith, Thomas Steinke, Jonathan Ullman, and Salil Vadhan.
Robust traceability from trace amounts. In Proceedings of the 56th Annual IEEE Sym-
posium on Foundations of Computer Science, FOCS ’15, pages 650–669, Washington,
DC, USA, 2015. IEEE Computer Society.
26
[EPK14] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. RAPPOR: Randomized
aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM
Conference on Computer and Communications Security, CCS ’14, pages 1054–1067,
New York, NY, USA, 2014. ACM.
[FLNP00] Nir Friedman, Michal Linial, Iftach Nachman, and Dana Pe’er. Using Bayesian net-
works to analyze expression data. Journal of Computational Biology, 7(3-4):601–620,
2000.
[GAH+ 14] Marco Gaboardi, Emilio Jesús Gallego Arias, Justin Hsu, Aaron Roth, and Zhi-
wei Steven Wu. Dual query: Practical private query release for high dimensional
data. In Proceedings of the 31th International Conference on Machine Learning, ICML
2014, Beijing, China, 21-26 June 2014, pages 1170–1178, 2014.
[GG86] Stuart Geman and Christine Graffigne. Markov random field image models and their
applications to computer vision. In Proceedings of the International Congress of Math-
ematicians, pages 1496–1517. American Mathematical Society, 1986.
[GLP18] Reza Gheissari, Eyal Lubetzky, and Yuval Peres. Concentration inequalities for
polynomials of contracting Ising models. Electronic Communications in Probability,
23(76):1–12, 2018.
[HKM17] Linus Hamilton, Frederic Koehler, and Ankur Moitra. Information theoretic properties
of Markov random fields, and their algorithmic applications. In Advances in Neural
Information Processing Systems 30, NIPS ’17. Curran Associates, Inc., 2017.
[HR10] Moritz Hardt and Guy N. Rothblum. A multiplicative weights mechanism for privacy-
preserving data analysis. In Proceedings of the 51st Annual IEEE Symposium on
Foundations of Computer Science, FOCS ’10, pages 61–70, Washington, DC, USA,
2010. IEEE Computer Society.
[HT10] Moritz Hardt and Kunal Talwar. On the geometry of differential privacy. In Proceed-
ings of the 42nd Annual ACM Symposium on the Theory of Computing, STOC ’10,
pages 705–714, New York, NY, USA, 2010. ACM.
[Isi25] Ernst Ising. Beitrag zur theorie des ferromagnetismus. Zeitschrift für Physik A
Hadrons and Nuclei, 31(1):253–258, 1925.
[JJR11] Ali Jalali, Christopher C. Johnson, and Pradeep K. Ravikumar. On learning discrete
graphical models using greedy methods. In Advances in Neural Information Processing
Systems 24, NIPS ’11, pages 1935–1943. Curran Associates, Inc., 2011.
[JRVS11] Ali Jalali, Pradeep K. Ravikumar, Vishvas Vasuki, and Sujay Sanghavi. On learning
discrete graphical models using group-sparse regularization. In Proceedings of the 14th
International Conference on Artificial Intelligence and Statistics, AISTATS ’11, pages
378–387. JMLR, Inc., 2011.
27
[KKMN09] Aleksandra Korolova, Krishnaram Kenthapadi, Nina Mishra, and Alexandros Ntoulas.
Releasing search queries and clicks privately. In Proceedings of the 18th International
World Wide Web Conference, WWW ’09, pages 171–180, New York, NY, USA, 2009.
ACM.
[KLSU19] Gautam Kamath, Jerry Li, Vikrant Singhal, and Jonathan Ullman. Privately learning
high-dimensional distributions. In Proceedings of the 32nd Annual Conference on
Learning Theory, COLT ’19, pages 1853–1902, 2019.
[KM17] Adam Klivans and Raghu Meka. Learning graphical models using multiplicative
weights. In Proceedings of the 58th Annual IEEE Symposium on Foundations of
Computer Science, FOCS ’17, pages 343–354, Washington, DC, USA, 2017. IEEE
Computer Society.
[KU20] Gautam Kamath and Jonathan Ullman. A primer on private statistics. arXiv preprint
arXiv:2005.00010, 2020.
[KV18] Vishesh Karwa and Salil Vadhan. Finite sample differentially private confidence inter-
vals. In Proceedings of the 9th Conference on Innovations in Theoretical Computer Sci-
ence, ITCS ’18, pages 44:1–44:9, Dagstuhl, Germany, 2018. Schloss Dagstuhl–Leibniz-
Zentrum fuer Informatik.
[LAFH01] Charles Lagor, Dominik Aronsky, Marcelo Fiszman, and Peter J. Haug. Automatic
identification of patients eligible for a pneumonia guideline: comparing the diagnostic
accuracy of two decision support models. Studies in Health Technology and Informat-
ics, 84(1):493–497, 2001.
[LPW09] David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. Markov Chains and Mixing
Times. American Mathematical Society, 2009.
[LVMC18] Andrey Y. Lokhov, Marc Vuffray, Sidhant Misra, and Michael Chertkov. Optimal
structure and parameter learning of Ising models. Science Advances, 4(3):e1700791,
2018.
[MdCCU16] Abraham Martı́n del Campo, Sarah Cepeda, and Caroline Uhler. Exact goodness-of-fit
testing for the Ising model. Scandinavian Journal of Statistics, 2016.
[MMY18] Rajarshi Mukherjee, Sumit Mukherjee, and Ming Yuan. Global testing against sparse
alternatives under Ising models. The Annals of Statistics, 46(5):2062–2093, 2018.
[MS10] Andrea Montanari and Amin Saberi. The spread of innovations in social networks.
Proceedings of the National Academy of Sciences, 107(47):20196–20201, 2010.
[MSM19] Ryan McKenna, Daniel Sheldon, and Gerome Miklau. Graphical-model based estima-
tion and inference for differential privacy. arXiv preprint arXiv:1901.09136, 2019.
[NRS07] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sam-
pling in private data analysis. In Proceedings of the 39th Annual ACM Symposium on
the Theory of Computing, STOC ’07, pages 75–84, New York, NY, USA, 2007. ACM.
28
[RWL10] Pradeep Ravikumar, Martin J. Wainwright, and John D. Lafferty. High-dimensional
Ising model selection using ℓ1 -regularized logistic regression. The Annals of Statistics,
38(3):1287–1319, 2010.
[Smi11] Adam Smith. Privacy-preserving statistical estimation with optimal convergence rates.
In Proceedings of the 43rd Annual ACM Symposium on the Theory of Computing,
STOC ’11, pages 813–822, New York, NY, USA, 2011. ACM.
[SU17] Thomas Steinke and Jonathan Ullman. Between pure and approximate differential
privacy. The Journal of Privacy and Confidentiality, 7(2):3–22, 2017.
[TTZ14] Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Private empirical risk minimiza-
tion beyond the worst case: The effect of the constraint set geometry. arXiv preprint
arXiv:1411.5417, 2014.
[TTZ15] Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Nearly-optimal private LASSO.
In Advances in Neural Information Processing Systems 28, NIPS ’15, pages 3025–3033.
Curran Associates, Inc., 2015.
[Vad17] Salil Vadhan. The complexity of differential privacy. In Yehuda Lindell, editor, Tu-
torials on the Foundations of Cryptography: Dedicated to Oded Goldreich, chapter 7,
pages 347–450. Springer International Publishing AG, Cham, Switzerland, 2017.
[VMLC16] Marc Vuffray, Sidhant Misra, Andrey Lokhov, and Michael Chertkov. Interaction
screening: Efficient and sample-optimal learning of Ising models. In Advances in Neu-
ral Information Processing Systems 29, NIPS ’16, pages 2595–2603. Curran Associates,
Inc., 2016.
[VTB+ 19] Giuseppe Vietri, Grace Tian, Mark Bun, Thomas Steinke, and Zhiwei Steven Wu.
New oracle efficient algorithms for private synthetic data release. NeurIPS PriML
workshop, 2019.
[WSD19] Shanshan Wu, Sujay Sanghavi, and Alexandros G. Dimakis. Sparse logistic regression
learns all discrete pairwise graphical models. In Advances in Neural Information
Processing Systems 32, NeurIPS ’19, pages 8069–8079. Curran Associates, Inc., 2019.
[ZLA03] Bianca Zadrozny, John Langford, and Naoki Abe. Cost-sensitive learning by cost-
proportionate example weighting. In Proceedings of the 3rd IEEE International Con-
ference on Data Mining (ICDM 2003), 19-22 December 2003, Melbourne, Florida,
USA, page 435, 2003.
29