0% found this document useful (0 votes)
19 views20 pages

AIM - An Adaptive and Iterative Mechanism For Differentially Private Synthetic Data

The document presents AIM, a novel algorithm for generating differentially private synthetic data that adapts to specific workloads, improving accuracy by iteratively selecting relevant measurements. AIM outperforms existing techniques by providing lower error rates and includes analytic expressions for bounding per-query error, enabling users to understand the accuracy of generated data. Additionally, the paper discusses the importance of tailoring synthetic data to specific tasks and introduces new methods for quantifying uncertainty in query answers derived from synthetic datasets.

Uploaded by

myshenc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views20 pages

AIM - An Adaptive and Iterative Mechanism For Differentially Private Synthetic Data

The document presents AIM, a novel algorithm for generating differentially private synthetic data that adapts to specific workloads, improving accuracy by iteratively selecting relevant measurements. AIM outperforms existing techniques by providing lower error rates and includes analytic expressions for bounding per-query error, enabling users to understand the accuracy of generated data. Additionally, the paper discusses the importance of tailoring synthetic data to specific tasks and introduces new methods for quantifying uncertainty in query answers derived from synthetic datasets.

Uploaded by

myshenc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

AIM: An Adaptive and Iterative Mechanism for

Differentially Private Synthetic Data


Ryan McKenna, Brett Mullins, Daniel Sheldon, Gerome Miklau
University of Massachusetts
Amherst, Massachusetts
{rmckenna,bmullins,sheldon,miklau}@cs.umass.edu
ABSTRACT In this work, we advance the state-of-the-art of differentially
We propose AIM, a novel algorithm for differentially private syn- private synthetic data in two key ways. First, we propose a novel
arXiv:2201.12677v1 [cs.DB] 29 Jan 2022

thetic data generation. AIM is a workload-adaptive algorithm, within workload-aware mechanism that offers lower error than all com-
the paradigm of algorithms that first selects a set of queries, then peting techniques. Second, we derive analytic expressions to bound
privately measures those queries, and finally generates synthetic the per-query error of the mechanism with high probability.
data from the noisy measurements. It uses a set of innovative fea- Our mechanism, AIM, follows the select-measure-generate para-
tures to iteratively select the most useful measurements, reflecting digm, which can be used to describe many prior approaches.1 Mech-
both their relevance to the workload and their value in approx- anisms following this paradigm first select a set of queries, then
imating the input data. We also provide analytic expressions to measure those queries in a differentially private way (through noise
bound per-query error with high probability, which can be used to addition), and finally generate synthetic data consistent with the
construct confidence intervals and inform users about the accuracy noisy measurements. We leverage Private-PGM [37] for the gener-
of generated data. We show empirically that AIM consistently out- ate step, as it provides a robust and efficient method for combining
performs a wide variety of existing mechanisms across a variety of the noisy measurements into a single consistent representation
experimental settings. from which records can be sampled.
The low error of AIM is primarily due to innovations in the select
1 INTRODUCTION stage. AIM uses an iterative, greedy selection procedure, inspired
by the popular MWEM algorithm for linear query answering. We
Differential privacy [15] has grown into the preferred standard for
define a highly effective quality score which determines the private
privacy protection, with significant adoption by both commercial
selection of the next best marginal to measure. Through careful
and governmental enterprises. Many common computations on
analysis, we define a low-sensitivity quality score that is able to
data can be performed in a differentially private manner, including
take into account: (i) how well the candidate marginal is already
aggregates, statistical summaries, and the training of a wide variety
estimated, (ii) the expected improvement measuring it can offer, (iii)
predictive models. Yet one of the most appealing uses of differential
the relevance of the marginal to the workload, and (iv) the available
privacy is the generation of synthetic data, which is a collection of
privacy budget. This novel quality score is accompanied by a host of
records matching the input schema, intended to be broadly represen-
other algorithmic techniques including adaptive selection of rounds
tative of the source data. Differentially private synthetic data is an
and budget-per-round, intelligent initialization, and novel set of
active area of research [1, 2, 5, 11, 12, 19, 25, 27, 29, 30, 43, 45, 46, 48–
candidates from which to select.
50, 52–55] and has also been the basis for two competitions, hosted
In conjunction with AIM, we develop new techniques to quantify
by the U.S. National Institute of Standards and Technology [40].
uncertainty in query answers derived from the generated synthetic
Private synthetic data is appealing because it fits any data pro-
data. The problem of error quantification for data independent
cessing workflow designed for the original data, and, on its face,
mechanisms like the Laplace or Gaussian mechanism is trivial, as
the user may believe they can perform any computation they wish,
they provide unbiased answers with known variance to all queries.
while still enjoying the benefits of privacy protection. Unfortunately
The problem is considerably more challenging for data-dependent
it is well-known that there are limits to the accuracy that can be
mechanisms like AIM, where complex post-processing is performed
provided by synthetic data, under differential privacy or any other
and only a subset of workload queries have unbiased answers. Some
reasonable notion of privacy [14].
mechanisms, like MWEM, provide theoretical guarantees on their
As a consequence, it is important to tailor synthetic data to some
worst-case error, under suitable assumptions. However, this is an a
class of tasks, and this is commonly done by asking the user to
priori bound on error obtained from a theoretical analysis of the
provide a set of queries, called the workload, to which the synthetic
mechanism under worst-case datasets. Instead, we develop an a
data can be tailored. However, as our experiments will show, exist-
posteriori error analysis, derived from the intermediate differentially
ing workload-aware techniques often fail to outperform workload-
private measurements used to produce the synthetic data. Our
agnostic mechanisms, even when evaluated specifically on their
error estimates therefore reflect the actual execution of AIM on
target workloads. Not only do these algorithms fail to produce ac-
the input data, but do not require any additional privacy budget
curate synthetic data, they provide no way for end-users to detect
for their calculation. Formally, our guarantees represent one-sided
the inaccuracy. As a result, in practical terms, differentially private
synthetic data generation remains an unsolved problem. 1 Another common approach is based on GANs [20], however recent research [44]
has shown that published GAN-based approaches rarely outperform simple baselines;
therefore we do not compare with those techniques in this paper.
confidence intervals, and we refer to them simply as “confidence Definition 2 (Workload Error). A workload 𝑊 consists of a list of
bounds”. To our knowledge, AIM is the only differentially private marginal queries 𝑟 1, . . . , 𝑟𝑘 where 𝑟𝑖 ⊆ [𝑑], together with associated
synthetic data generation mechanism that provides this kind of weights 𝑐𝑖 ≥ 0. The error of a synthetic dataset 𝐷ˆ is defined as:
error quantification. This paper makes the following contributions: 𝑘
1 ∑︁
(1) In Section 3, we assess the prior work in the field, character- Error(𝐷, 𝐷)
ˆ = 𝑐𝑖 𝑀𝑟𝑖 (𝐷) − 𝑀𝑟𝑖 ( 𝐷)
ˆ
𝑘 · |𝐷 | 𝑖=1 1
izing different approaches via key distinguishing elements
and limitations, which brings clarity to a complex space.
(2) In Section 4, we propose AIM, a new mechanism for syn- 2.2 Differential privacy
thetic data generation that is workload-aware (for workloads Differential privacy protects individuals by bounding the impact
consisting of weighted marginals) as well as data-aware. any one individual can have on the output of an algorithm. This is
(3) In Section 5, we derive analytic expressions to bound the per- formalized using the notion of neighboring datasets. Two datasets
query error of AIM with high probability. These expressions 𝐷, 𝐷 ′ ∈ D are neighbors (denoted 𝐷 ∼ 𝐷 ′ ) if 𝐷 ′ can be obtained
can be used to construct confidence bounds. from 𝐷 by adding or removing a single record.
(4) In Section 6, we conduct a comprehensive empirical evalua-
Definition 3 (Differential Privacy). A randomized mechanism M :
tion, and show that AIM consistently outperforms all prior
D → R satisfies (𝜖, 𝛿)-differential privacy (DP) if for any neigh-
work, improving error over the next best mechanism by 1.6×
boring datasets 𝐷 ∼ 𝐷 ′ ∈ D, and any subset of possible outputs
on average, and up to 5.7× in some cases.
𝑆 ⊆ R,
2 BACKGROUND Pr[M (𝐷) ∈ 𝑆] ≤ exp(𝜖) Pr[M (𝐷 ′ ) ∈ 𝑆] + 𝛿.
In this section we provide relevant background and notation on A key quantity needed to reason about the privacy of common
datasets, marginals, and differential privacy required to understand randomized mechanisms is the sensitivity, defined below.
this work.
Definition 4 (Sensitivity). Let 𝑓 : D → R𝑝 be a vector-valued
2.1 Data, Marginals, and Workloads function of the input data. The 𝐿2 sensitivity of 𝑓 is
Δ(𝑓 ) = max𝐷∼𝐷 ′ ∥𝑓 (𝐷) − 𝑓 (𝐷 ′ )∥ 2 .
Data. A dataset 𝐷 is a multiset of 𝑁 records, each containing
potentially sensitive information about one individual. Each record It is easy to verify that the 𝐿2 sensitivity of any marginal query
𝑥 ∈ 𝐷 is a 𝑑-tuple (𝑥 1, . . . , 𝑥𝑑 ). The domain of possible values for 𝑥𝑖 𝑀𝑟 is 1, regardless of the attributes in 𝑟 . This is because one individ-
is denoted by Ω𝑖 , which we assume is finite and has size |Ω𝑖 | = 𝑛𝑖 . ual can only contribute a count of one to a single cell of the output
The full domain of possible values for 𝑥 is thus Ω = Ω1 × · · · × Ω𝑑 vector. Below we introduce the two building block mechanisms
Î
which has size 𝑖 𝑛𝑖 = 𝑛. We use D to denote the set of all possible used in this work.
datasets, which is equal to ∪∞ 𝑁 =0 Ω .
𝑁
Definition 5 (Gaussian Mechanism). Let 𝑓 : D → R𝑝 be a vector-
Marginals. A marginal is a central statistic to the techniques valued function of the input data. The Gaussian Mechanism adds i.i.d.
studied in this paper, as it captures low-dimensional structure com- Gaussian noise with scale 𝜎Δ(𝑓 ) to each entry of 𝑓 (𝐷). That is,
mon in high-dimensional data distributions. A marginal for a set M (𝐷) = 𝑓 (𝐷) + 𝜎Δ(𝑓 )N (0, I),
of attributes 𝑟 is essentially a histogram over 𝑥𝑟 : it is a table that
counts the number of occurrences of each 𝑡 ∈ Ω𝑟 . where I is a 𝑝 × 𝑝 identity matrix.

Definition 1 (Marginal). Let 𝑟 ⊆ [𝑑] be a subset of attributes, Definition 6 (Exponential Mechanism). Let 𝑞𝑟 : D → R be quality
Ω𝑟 = 𝑖 ∈𝑟 Ω𝑖 , 𝑛𝑟 = |Ω𝑟 |, and 𝑥𝑟 = (𝑥𝑖 )𝑖 ∈𝑟 . The marginal on 𝑟 is a
Î score function defined for all 𝑟 ∈ R and let 𝜖 ≥ 0 be a real number.
vector 𝜇 ∈ R𝑛𝑟 , indexed by domain elements 𝑡 ∈ Ω𝑟 , such that each Then the exponential mechanism outputs a candidate 𝑟 ∈ R according
entry is a count, i.e., 𝜇 [𝑡] = 𝑥 ∈𝐷 1 [𝑥𝑟 = 𝑡]. We let 𝑀𝑟 : D → R𝑛𝑟 to the following distribution:
Í
 𝜖 
denote the function that computes the marginal on 𝑟 , i.e., 𝜇 = 𝑀𝑟 (𝐷). Pr[M (𝐷) = 𝑟 ] ∝ exp · 𝑞𝑟 (𝐷) ,

In this paper, we use the term marginal query to denote the where Δ = max𝑟 ∈R Δ(𝑞𝑟 ).
function 𝑀𝑟 , and marginal to denote the vector of counts 𝜇 =
Our algorithm is defined using zCDP, an alternate version of
𝑀𝑟 (𝐷). With some abuse of terminology, we will sometimes refer
differential privacy definition which offers beneficial composition
to the attribute subset 𝑟 as a marginal query as well.
properties. We convert to (𝜖, 𝛿) guarantees when necessary.
Workload. A workload is a collection of queries the synthetic Definition 7 (zero-Concentrated Differential Privacy (zCDP)). A
data should preserve well. It represents the measure by which we randomized mechanism M is 𝜌-zCDP if for any two neighboring
will evaluate utility of different mechanisms. We want our mech- datasets 𝐷 and 𝐷 ′ , and all 𝛼 ∈ (1, ∞), we have:
anisms to take a workload as input, and adapt intelligently to the
queries in it, providing synthetic data that is tailored to the queries 𝐷𝛼 (M (𝐷) || M (𝐷 ′ )) ≤ 𝜌 · 𝛼,
of interest. In this work, we focus on the special (but common) case where 𝐷𝛼 is the Rényi divergence of order 𝛼.
where the workload consists of a collection of weighted marginal
queries. Our utility measure is stated in Definition 2. Proposition 1 (zCDP of the Gaussian Mechanism [6]). The Gauss-
ian Mechanism satisfies 2𝜎1 2 -zCDP.
2
Proposition 2 (zCDP of the Exponential Mechanism [10]). The Algorithm 1 MWEM+PGM
2
Exponential Mechanism satisfies 𝜖8 -zCDP. Input: Dataset 𝐷, workload 𝑊 , privacy parameter 𝜌, rounds 𝑇
Output: Synthetic Dataset 𝐷ˆ
We rely on the following propositions to reason about multiple
√︁ 𝑝 0 = Uniform[X]
Initialize ˆ
adaptive invocations of zCDP mechanisms, and the translation from
zCDP to (𝜖, 𝛿)-DP. The proposition below covers 2-fold adaptive 𝜖 = 2 𝜌/𝑇
√︁
composition of zCDP mechanisms, and it can be inductively applied 𝜎 = 𝑇 /𝜌
to obtain analogous k-fold adaptive composition guarantees. for 𝑡 = 1, . . . ,𝑇 do
select 𝑟𝑡 ∈ 𝑊 using exponential mechanism with 𝜖 budget:
Proposition 3 (Adaptive Composition of zCDP Mechanisms [6]).
𝑞𝑟 (𝐷) = ∥𝑀𝑟 (𝐷) − 𝑀𝑟 (𝑝ˆ𝑡 −1 ) ∥ 1 − 𝑛𝑟
Let M1 : D → R 1 be 𝜌 1 -zCDP and M2 : D ×R 1 → R 2 be 𝜌 2 -zCDP.
Then the mechanism M = M2 (𝐷, M1 (𝐷)) is (𝜌 1 + 𝜌 2 )-zCDP.
measure marginal on 𝐶:
Proposition 4 (zCDP to DP [9]). If a mechanism M satisfies 𝜌-
zCDP, it also satisfies (𝜖, 𝛿)-differential privacy for all 𝜖 ≥ 0 and 𝜇˜𝑡 = 𝑀𝑟𝑡 (𝐷) + N (0, 𝜎 2 I)

exp (𝛼 − 1)(𝛼𝜌 − 𝜖)  1 𝛼 estimate data distribution using Private-PGM:
𝛿 = min 1− .
𝛼 >1 𝛼 −1 𝛼 𝑡
∑︁ 2
𝑝ˆ𝑡 = argmin 𝑀𝑟𝑖 (𝑝) − 𝑦𝑖 2
2.3 Private-PGM 𝑝 ∈𝑆 𝑖=1
An important component of our approach is a tool called Private- end for
PGM [34, 35, 37]. For the purposes of this paper, we will treat generate synthetic data 𝐷ˆ using Private-PGM:
Private-PGM as a black box that exposes an interface for solving return 𝐷ˆ
subproblems important to our mechanism. We briefly summarize
Private-PGM and three core utilities it provides. Private-PGM con-
sumes as input a collection of noisy marginals of the sensitive data, arbitrary marginals are measured, JT-SIZE can grow out of control,
in the format of a list of tuples ( 𝜇˜𝑖 , 𝜎𝑖 , 𝑟𝑖 ) for 𝑖 = 1, . . . , 𝑘, where no longer fitting in memory, and leading to unacceptable runtime.
𝜇˜𝑖 = 𝑀𝑟𝑖 (𝐷) + N (0, 𝜎𝑖2 I).2
Synthetic Data Generation. Given an estimated model 𝑝, ˆ
Distribution Estimation. At the heart of Private-PGM is an Private-PGM implements a routine for generating synthetic tabular
optimization problem to find a distribution 𝑝ˆ that “best explains” data that approximately matches the given distribution. It achieves
the noisy observations 𝜇˜𝑖 : this with a randomized rounding procedure, which is a lower vari-
𝑘
∑︁ 1 ance alternative to sampling from 𝑝ˆ [35].
2
𝑝ˆ ∈ argmin 𝑀𝑟𝑖 (𝑝) − 𝜇˜𝑖 2
𝜎
𝑝 ∈S 𝑖=1 𝑖 3 PRIOR WORK ON SYNTHETIC DATA
Í
Here S = {𝑝 | 𝑝 (𝑥) ≥ 0 and 𝑥 ∈Ω 𝑝 (𝑥) = 𝑛} is the set of (scaled) In this section we survey the state of the field, describing basic
probability distributions over the domain Ω.3 When 𝜇˜𝑖 are cor- elements of a good synthetic data mechanism, along with novel-
rupted with i.i.d. Gaussian noise, this is exactly a maximum like- ties of more sophisticated mechanisms. We focus our attention on
lihood estimation problem [34, 35, 37]. In general, convex opti- marginal-based approaches to differentially private synthetic data
mization over the scaled probability simplex is intractable for the in this section, as these have generally seen the most success in
high-dimensional domains we are interested in. Private-PGM over- practical applications. These mechanisms include PrivBayes [52],
comes this curse of dimensionality by exploiting the fact that the PrivBayes+PGM [37], MWEM+PGM [37], MST [35], PrivSyn [55],
objective only depends on 𝑝 through its marginals. The key obser- RAP [3], GEM [32], and PrivMRF [8]. We will review other related
vation is that one of the minimizers of this problem is a graphical work in Section 8. We will begin with a formal problem statement:
model 𝑝ˆ𝜃 . The parameters 𝜃 provide a compact representation of
the distribution 𝑝 that we can optimize efficiently. Problem 1 (Workload Error Minimization). Given a workload
𝑊 , our goal is to design an (𝜖, 𝛿)-DP synthetic data mechanism
Junction Tree Size. The time and space complexity of Private- M : D → D such that the expected error defined in Definition 2 is
PGM depends on the measured marginal queries in a nuanced minimized.
way, the main factor being the size of the junction tree implied
by the measured marginal queries [35, 36]. While understanding 3.1 The Select-Measure-Generate Paradigm
the junction tree construction is not necessary for this paper, it is
We begin by providing a broad overview of the basic approach
important to note that Private-PGM exposes a callable function
employed by many differentially private mechanisms for synthetic
JT-SIZE(𝑟 1, . . . , 𝑟𝑘 ) that can be invoked to check how large a junc-
data. These mechanisms all fit naturally into the select-measure-
tion tree is. JT-SIZE is measured in megabytes, and the runtime of
generate framework. This framework represents a class of mech-
distribution estimation is roughly proportional to this quantity. If
anisms which can naturally be broken up into 3 steps: (1) select
2 Private-PGM is more general than this, but this is the most common setting. a set of queries, (2) measure those queries using a noise-addition
3 When using unbounded DP, 𝑛 is sensitive and therefore we must estimate it. mechanism, and (3) generate synthetic data that explains the noisy
3
measurements well. We consider iterative mechanisms that alter- Table 1: Taxonomy of select-measure-generate mechanisms.
nate between the select and measure step to be in this class as
well. Mechanisms within this class differ in their methodology for Name Year Workload Data Budget Efficiency
selecting queries, the noise mechanism used, and the approach to Aware Aware Aware Aware
Independent - ✓
generating synthetic data from the noisy measurements. Gaussian - ✓
MWEM+PGM, shown in Algorithm 1, is one mechanism from PrivBayes [52] 2014 ✓ ✓ ✓
this class that serves as a concrete example as well as the start- HDMM+PGM [37] 2019 ✓
ing point for our improved mechanism, AIM. As the name implies, PrivBayes+PGM [37] 2019 ✓ ✓ ✓
MWEM+PGM is a scalable instantiation of the well-known MWEM MWEM+PGM [37] 2019 ✓ ✓
algorithm [22] for linear query answering, where the multiplicative PrivSyn [55] 2020 ✓ ✓ ✓
MST [35] 2021 ✓ ✓
weights (MW) step is replaced by a call to Private-PGM. It is a
RAP [3] 2021 ✓ ✓ ✓
greedy, iterative mechanism for workload-aware synthetic data GEM [32] 2021 ✓ ✓ ✓
generation, and there are several variants. One variant is shown in PrivMRF [8] 2021 ✓ ✓ ✓
Algorithm 1. The mechanism begins by initializing an estimate of AIM [This Work] 2022 ✓ ✓ ✓ ✓
the joint distribution to be uniform over the data domain. Then, it
runs for 𝑇 rounds, and in each round it does three things: (1) selects
(via the exponential mechanism) a marginal query that is poorly
3.3 Distinguishing Elements of Existing Work
approximated under the current estimate, (2) measures the selected Beyond the basics, different mechanisms exhibit different novel-
marginal using the Gaussian mechanism, and (3) estimates a new ties, and understanding the design considerations underlying the
data distribution (using Private-PGM) that explains the noisy mea- existing work can be enlightening. We provide a simple taxonomy
surements well. After 𝑇 rounds, the estimated distribution is used of this space in Table 1 in terms of four criteria: workload-, data-,
to generate synthetic tabular data. In the subsequent subsections, budget-, and efficiency-awareness. These characteristics primarily
we will characterize existing mechanisms in terms of how they pertain to the select step of each mechanism.
approach these different aspects of the problem. Workload-awareness. Different mechanisms select from a dif-
ferent set of candidate marginal queries. PrivBayes and PrivMRF,
3.2 Basic Elements of a Good Mechanism for example, select from a particular subset of 𝑘-way marginals, de-
In this section we outline some basic criteria reasonable mecha- termined from the data. Other mechanisms, like MST and PrivSyn,
nisms should satisfy to get good performance. These recommenda- restrict the set of candidates to 2-way marginal queries. On the other
tions primarily apply to the measure step. end of the spectrum, the candidates considered by MWEM+PGM,
RAP, and GEM, are exactly the marginal queries in the workload.
Measure Entire Marginals. Marginals are an appealing statis-
This is appealing, since these mechanisms will not waste the privacy
tic to measure because every individual contributes a count of one
budget to measure marginals that are not relevant to the workload,
to exactly one cell of the marginal. As a result, we can measure
however we show the benefit of extending the set of candidates
every cell of 𝑀𝑟 (𝐷) at the same privacy cost of measuring a single
beyond the workload.
cell. With a few exceptions ([3, 32, 48]), existing mechanisms utilize
this property of marginals or can be extended to use it. The alterna- Data-awareness. Many mechanisms select marginal queries
tive of measuring a single counting query at a time sacrifices utility from a set of candidates based on the data, and are thus data-aware.
unnecessarily. For example, MWEM+PGM selects marginal queries using the
exponential mechanism with a quality score function that depends
Use Gaussian Noise. Back of the envelope calculations reveal
on the data. Independent, Gaussian, and HDMM+PGM are the
that if the number of measurements is greater than roughly log (1/𝛿)
exceptions, as they always select the same marginal queries no
+ 𝜖, which is often the case, then the standard deviation of the re-
matter what the underlying data distribution is.
quired Gaussian noise is lower than that of the Laplace noise. Many
newer mechanisms recognize this and use Gaussian noise, while Budget-awareness. Another aspect of different mechanisms is
older mechanisms were developed with Laplace noise, but can easily how well do they adapt to the privacy budget available. Some mech-
be adapted to use Gaussian noise instead. anisms, like PrivBayes, PrivSyn, and PrivMRF recognize that we
can afford to measure more (or larger) marginals when the privacy
Use Unbounded DP. For fixed √ (𝜖, 𝛿), the required noise mag- budget is sufficiently large. When the privacy budget is limited,
nitude is lower by a factor of 2 when using unbounded DP (add
these mechanisms recognize that fewer (and smaller) marginals
/ remove one record) over bounded DP (modify one record). This
should be measured instead. In contrast, the number and size of the
is because the 𝐿2 sensitivity
√ of a marginal query 𝑀𝑟 is 1 under
marginals selected by mechanisms like MST, MWEM+PGM, RAP,
unbounded DP, and 2 under bounded DP. Some mechanisms like and GEM does not depend on the privacy budget available.4
MST, PrivSyn, and PrivMRF use unbounded DP, while other mech-
anisms like RAP, GEM, and PrivBayes use bounded DP. We remark Efficiency-awareness. Mechanisms that build on top of Private-
that these two different definitions of DP are qualitatively different, PGM must take care when selecting measurements to ensure JT-
and because of SIZE remains sufficiently small to ensure computational tractability.
√ that, the privacy parameters have different interpre-
tations. The 2 difference could be recovered in bounded DP by 4 Thenumber of rounds to run MWEM+PGM, RAP, and GEM is a hyper-parameter,
increasing the privacy budget appropriately. and the best setting of this hyper-parameter depends on the privacy budget available.
4
Among these, PrivBayes+PGM, MST, and PrivMRF all have built-in Algorithm 2 Initialize 𝑝𝑡 (subroutine of Algorithm 4)
heuristics in the selection criteria to ensure the selected marginal 1: for 𝑟 ∈ {𝑟 ∈ 𝑊+ | |𝑟 | = 1} do
queries give rise to a tractable model. Gaussian, HDMM+PGM and 2: 𝑡 = 𝑡 + 1 𝜎𝑡 ← 𝜎0 𝑟𝑡 ← 𝑟
MWEM+PGM have no such safeguards, and they can sometimes se- 3: 𝑦˜𝑡 = 𝑀𝑟 (𝐷) + N (0, 𝜎𝑡2 I)
lect marginal queries that lead to intractable models. In the extreme 4: 𝜌𝑢𝑠𝑒𝑑 ← 𝜌𝑢𝑠𝑒𝑑 + 2𝜎1 2
case, when the workload is all 2-way marginals, Gaussian selects 𝑡

all 2-way marginals, model required for Private-PGM explodes to 5: end for
Í𝑡 1 2
the size of the entire domain, which is often intractable. 6: 𝑝ˆ𝑡 = argmin𝑝 ∈𝑆 𝑖=1 𝜎𝑖 𝑀𝑟𝑖 (𝑝) − 𝑦˜𝑖 2
Mechanisms that utilize different techniques for post-processing
noisy marginals into synthetic data, like PrivSyn, RAP, and GEM,
Algorithm 3 Budget annealing (subroutine of Algorithm 4)
do not have this limitation, and are free to select from a wider √︁
collection of marginals. While these methods do not suffer from 1: if 𝑀𝑟𝑡 (𝑝ˆ𝑡 ) − 𝑀𝑟𝑡 (𝑝ˆ𝑡 −1 ) 1 ≤ 2/𝜋 · 𝜎𝑡 · 𝑛𝑟𝑡 then
this particular limitation of Private-PGM, they have other pros and 2: 𝜖𝑡 +1 ← 2 · 𝜖𝑡
cons which were surveyed in a recent article [34]. 3: 𝜎𝑡 +1 ← 𝜎𝑡 /2
4: else
Summary. With the exception of our new mechanism AIM, no 5: 𝜖𝑡 +1 ← 𝜖𝑡
mechanism listed in Table 1 is aware of all four factors we discussed. 6: 𝜎𝑡 +1 ← 𝜎𝑡
Mechanisms that do not have four checkmarks in Table 1 are not 7: end if
8: if (𝜌 − 𝜌𝑢𝑠𝑒𝑑 ) ≤ 2
1 + 1 𝜖 2  then
necessarily bad, but there are clear ways in which they can be 2𝜎𝑡2+1 8 𝑡 +1
improved. Conversely, mechanisms that have more checkmarks √︁
9: 𝜖𝑡 +1 = 8 · (1 − 𝛼) · (𝜌 − 𝜌𝑢𝑠𝑒𝑑 )
than other mechanisms are not necessarily better. For example, √︁
10: 𝜎𝑡 +1 = 1/(2 · 𝛼 · (𝜌 − 𝜌𝑢𝑠𝑒𝑑 ))
RAP has 3 checkmarks, but as we show in Section 6, it does not
11: end if
consistently beat Independent, which only has 1 checkmark.

3.4 Other Design Considerations Hyperparameters. All mechanisms have some hyperparame-
Beyond these four characteristics summarized in the previous sec- ters than can be tuned to affect the behavior of the mechanism.
tion, different methods make different design decisions that are Mechanisms like PrivBayes, MST, PrivSyn, and PrivMRF have rea-
relevant to mechanism performance, but do not correspond to the sonable default values for these hyperparameters, and these mech-
four criteria discussed in the previous section. In this section, we anisms can be expected to work well out of the box. On the other
summarize some of those additional design considerations. hand, MWEM+PGM, RAP, and GEM have to tune the number of
rounds to run, and it is not obvious how to select this a priori. While
the open source implementations may include a default value, the
Selection method. Some mechanisms select marginals to mea-
experiments conducted in the respective papers did not use these
sure in a batch, while other mechanisms select them iteratively.
default values, in favor of non-privately optimizing over this hyper-
Generally speaking, iterative methods like MWEM+PGM, RAP,
parameter for each dataset and privacy level considered [3, 32].
GEM, and PrivMRF are preferable to batch methods, because the
selected marginals will capture important information about the
distribution that was not effectively captured by the previously mea-
sured marginals. On the other hand, PrivBayes, MST, and PrivSyn
4 AIM: AN ADAPTIVE AND ITERATIVE
select all the marginals before measuring any of them. It is not dif- MECHANISM FOR SYNTHETIC DATA
ficult to construct examples where a batch method like PrivSyn has While MWEM+PGM is a simple and intuitive algorithm, it leaves
suboptimal behavior. For example, suppose the data contains three significant room for improvement. Our new mechanism, AIM, is
perfectly correlated attributes. We can expect iterative methods to presented in Algorithm 4. In this section, we describe the differences
capture the distribution after measuring any two 2-way marginals. between MWEM+PGM and AIM, the justifications for the relevant
On the other hand, a batch method like PrivSyn will determine that design decisions, as well as prove the privacy of AIM.
all three 2-way marginals need to be measured.
Intelligent Initialization. In Line 7 of AIM, we spend a small
fraction of the privacy budget to measure 1-way marginals in the
Budget split. Every mechanism in this discussion, except for
set of candidates. Estimating 𝑝ˆ from these noisy marginals gives
PrivSyn, splits the privacy budget equally among selected marginals.
rise to an independent model where all 1-way marginals are pre-
This is a simple and natural thing to do, but it does not account
served well, and higher-order marginals can be estimated under an
for the fact that larger marginals have smaller counts that are less
independence assumption. This provides a far better initialization
robust to noise, requiring a larger fraction of the privacy budget to
than the default uniform distribution while requiring only a small
answer accurately. PrivSyn provides a simple formula for dividing
fraction of the privacy budget.
privacy budget among marginals of different sizes, but this approach
is inherently tied to their batch selection methodology. It is much New Candidates. In Line 13 of AIM, we make two notable
less clear how to divide the privacy budget within a mechanism modifications to the candidate set that serve different purposes.
that uses an iterative selection procedure. Specifically, the set of candidates is a carefully chosen subset of
5
Algorithm 4 AIM: An Adaptive and Iterative Mechanism which differs from MWEM+PGM’s quality score function 𝑞𝑟 (𝐷) =
1: Input: Dataset 𝐷, workload 𝑊 , privacy parameter 𝜌 ∥𝑀𝑟 (𝐷) − 𝑀𝑟 (𝑝𝑡 −1 ) ∥ − 𝑛𝑟 in two ways.
2: Output: Synthetic Dataset 𝐷ˆ First, the expression inside parentheses can be interpreted as
3: Hyper-Parameters: MAX-SIZE=80MB, 𝑇 = 16𝑑, 𝛼 = 0.9 the expected improvement in 𝐿1 error we can expect by measuring
√︁ that marginal. It consists of two terms: the 𝐿1 error under the
4: 𝜎0 = 𝑇 /(2 𝛼 𝜌)
5: 𝜌𝑢𝑠𝑒𝑑 = 0 current model minus the expected 𝐿1 error if it is measured at
6: 𝑡 =0 the current noise level (Theorem 5 in Appendix B). Compared
7: Initialize 𝑝ˆ𝑡 using Algorithm 2 to the quality score function in MWEM+PGM, this quality score
8:
Í
𝑤𝑟 = 𝑠 ∈𝑊 𝑐𝑠 | 𝑟 ∩ 𝑠 √︁ | function penalizes larger marginals to a much more significant
degree, since 𝜎𝑡 ≫ 1 in most cases. Moreover, this modification
9: 𝜎𝑡 +1 ← 𝜎0 𝜖𝑡 +1 ← 8(1 − 𝛼)𝜌/𝑇
makes the selection criteria “budget-adaptive”, since it recognizes
10: while 𝜌𝑢𝑠𝑒𝑑 < 𝜌 do
that we can afford to measure larger marginals when 𝜎𝑡 is smaller,
11: 𝑡 =𝑡 +1
and we should prefer smaller marginals when 𝜎𝑡 is larger.
12: 𝜌𝑢𝑠𝑒𝑑 ← 𝜌𝑢𝑠𝑒𝑑 + 18 𝜖𝑡2 + 2𝜎1 2
𝑡
𝜌
Second, we give different marginal queries different weights to
13: 𝜌 · MAX-SIZE}
𝐶𝑡 = {𝑟𝑡 ∈ 𝑊+ | JT-SIZE(𝑟 1, . . . , 𝑟𝑡 )) ≤ 𝑢𝑠𝑒𝑑 capture how relevant they are to the workload. In particular, we
14: select 𝑟𝑡 ∈ 𝐶𝑡 using the exponential mechanism with: weight the quality score function for a marginal query 𝑟 using
Í
 √︁  the formula 𝑤𝑟 = 𝑠 ∈𝑊 𝑐𝑠 | 𝑟 ∩ 𝑠 |, as this captures the degree
𝑞𝑟 (𝐷) = 𝑤𝑟 ∥𝑀𝑟 (𝐷) − 𝑀𝑟 (𝑝ˆ𝑡 −1 )∥ 1 − 2/𝜋 · 𝜎𝑡 · 𝑛𝑟
to which the marginal queries in the workload overlap with 𝑟 . In
general, this weighting scheme places more weight on marginals
15: measure marginal on 𝑟𝑡 : involving more attributes. Note that now the sensitivity of 𝑞𝑟 is
𝑦˜𝑡 = 𝑀𝑟𝑡 (𝐷) + N (0, 𝜎𝑡2 I) 𝑤𝑟 rather than 1. When applying the exponential mechanism to
select a candidate, we must either use Δ𝑡 = max𝑟 ∈𝐶𝑡 𝑤𝑟 , or invoke
the generalized exponential mechanism instead, as it can handle
16: estimate data distribution using Private-PGM: quality score functions with varying sensitivity [39].
𝑡 This quality√︁score function exhibits an interesting trade-off: the
∑︁ 1 2
𝑝ˆ𝑡 = argmin 𝑀𝑟𝑖 (𝑝) − 𝑦˜𝑖 2 penalty term 2/𝜋𝜎𝑡 𝑛𝑟 discourages marginals with more cells,
𝑝 ∈𝑆 𝑖=1 𝜎𝑖
while the weight 𝑤𝑟 favors marginals with more attributes. How-
17: anneal 𝜖𝑡 +1 and 𝜎𝑡 +1 using Algorithm 3 ever, if the inner expression is negative, then the larger weight will
18: end while make it more negative, and much less likely to be selected.
19: generate synthetic data 𝐷ˆ from 𝑝ˆ𝑡 using Private-PGM
20: return 𝐷ˆ Adaptive Rounds and Budget Split. In Lines 12 and 17 of AIM,
we introduce logic to modify the per-round privacy budget as ex-
ecution progresses, and as a result, eliminate the need to provide
the marginal queries in the downward closure of the workload. The the number of rounds up front. This makes AIM hyper-parameter
downward closure of the workload is the set of marginal queries free, relieving practitioners from that often overlooked burden.
whose attribute sets are subsets of some marginal query in the Specifically, we use a simple annealing procedure (Algorithm 3)
workload, i.e., 𝑊+ = {𝑟 | 𝑟 ⊆ 𝑠, 𝑠 ∈ 𝑊 }. that gradually increases the budget per round when an insufficient
Using the downward closure is based on the observation that amount of information is learned at the current per-round bud-
marginals with many attributes have low counts, and answering get. The annealing condition is activated if the difference between
them directly with a noise addition mechanism may not provide an 𝑀𝑟𝑡 (𝑝ˆ𝑡 ) and 𝑀𝑟𝑡 (𝑝ˆ𝑡 −1 ) is small, which indicates that not much in-
acceptable signal to noise ratio. In these situations, it may be better formation was learned in the previous round. If it is satisfied, then
to answer lower-dimensional marginals, as these tend to exhibit a 𝜖𝑡 for the select step is doubled, while 𝜎𝑡 for the measure step is
better signal to noise ratio, while still being useful to estimate the cut in half.
higher-dimensional marginals in the workload. This check can pass for two reasons: (1) there were no good
We filter candidates from this set that do not meet a specific candidates (all scores are low in Equation (1)) in which case increas-
model capacity requirement. Specifically, the set will only consist of ing 𝜎𝑡 will make more candidates good, and (2) there were good
candidates that, if selected, ill lead to a JT-SIZE below a prespecified candidates, but they were not selected because there was too much
limit (the default is 80 MB). This ensures that AIM will never select noise in the select step, which can be remedied√︁ by increasing 𝜖𝑡 .
candidates that lead to an intractable model, and hence allows The precise annealing threshold used is 2/𝜋 · 𝜎𝑡 · 𝑛𝑟𝑡 , which is
the mechanism to execute consistently with a predictable memory the expected error of the noisy marginal, and an approximation for
footprint and runtime. the expected error of 𝑝ˆ𝑡 on marginal 𝑟 . When the available privacy
budget is small, this condition will be activated more frequently,
Better Selection Criteria. In Line 14 of AIM, we make two mod-
and as a result, AIM will run for fewer rounds. Conversely, when
ifications to the quality score function for marginal query selection
the available privacy budget is large, AIM will run for many rounds
to better reflect the utility we expect from measuring the selected
before this condition activates.
marginal. In particular, our new quality score function is
√︁  As 𝜎𝑡 decreases throughout execution, quality scores generally
𝑞𝑟 (𝐷) = 𝑤𝑟 ∥𝑀𝑟 (𝐷) − 𝑀𝑟 (𝑝𝑡 −1 ) ∥ 1 − 2/𝜋 · 𝜎𝑡 · 𝑛𝑟 , (1) increase, and it has the effect of “unlocking” new candidates that
6
previously had negative quality scores. We initialize 𝜎𝑡 and 𝜖𝑡 con- estimator for a marginal whose error we can bound with high prob-
servatively, assuming the mechanism will be run for𝑇 = 16𝑑 rounds. ability. Then, we connect the error of this estimator to the error of
This is an upper bound on the number of rounds that AIM will run, the synthetic data by invoking the triangle inequality. The subse-
but in practice the number of rounds will be much less. quent paragraphs provide more details on this approach. Proofs of
As in prior work [8, 55], we do not split the budget equally for all statements in this section appear in Appendix B.
the select and measure step, but rather allocate 10% of the budget
The Easy Case: Supported Marginal Queries. A marginal query
for the select steps, and 90% of the budget for the measure steps.
r is “supported” whenever 𝑟 ⊆ 𝑟𝑡 for some 𝑡. In this case, we can
This is justified by the fact that the quality function for selection is
readily obtain an unbiased estimate of 𝑀𝑟 (𝐷) from 𝑦𝑡 , and ana-
a coarser-grained aggregation than a marginal, and as a result can
lytically derive the variance of that estimate. If there are multiple
tolerate a larger degree of noise.
𝑡 satisfying the condition above, we have multiple estimates we
Privacy Analysis. The privacy analysis of AIM utilizes the no- can use to reduce the variance. We can combine these independent
tion of a privacy filter [41], and the algorithm runs until the realized estimates to obtain a weighted average estimator:
privacy budget spent matches the total privacy budget available, 𝜌. Theorem 2 (Weighted Average Estimator). Let 𝑟 1, . . . , 𝑟𝑡 and 𝑦1, . . . , 𝑦𝑡
To ensure that the budget is not over-spent, there is a special con- be as defined in Algorithm 4, and let 𝑅 = {𝑟 1, . . . , 𝑟𝑡 }. For any 𝑟 ∈ 𝑅+ ,
dition (Line 8 in Algorithm 3) that checks if the remaining budget there is an (unbiased) estimator 𝑦¯𝑟 = 𝑓𝑟 (𝑦1, . . . , 𝑦𝑡 ) such that:
is insufficient for two rounds at the current 𝜖𝑡 and 𝜎𝑡 parameters. 𝑡
h ∑︁ 𝑛𝑟 i −1
If this condition is satisfied, 𝜖𝑡 and 𝜎𝑡 are set to use up all of the 𝑦¯𝑟 ∼ N (𝑀𝑟 (𝐷), 𝜎¯𝑟2 I) where 𝜎¯𝑟2 = 2
,
remaining budget in one final round of execution. 𝑖=1 𝑛𝑟𝑖 𝜎𝑖
𝑟 ⊆𝑟𝑖
Theorem 1. For any 𝑇 ≥ 𝑑, 0 < 𝛼 < 1, and 𝜌 ≥ 0, AIM satisfies While this is not the only (or best) estimator to use,5 the simplic-
𝜌-zCDP. ity allows us to easily bound its error, as we show in Theorem 3.

Proof. There are three steps in AIM that depend on the sensi- Theorem 3 (Confidence Bound). Let 𝑦¯𝑟 be the estimator from The-
tive data: initialization, selection, and measurement. The initializa- orem 2. Then, for any 𝜆 ≥ 0, with probability at least 1 − exp (−𝜆 2 ):
√︁ √
tion step satisfies 𝜌 0 -zCDP for 𝜌 0 = |{𝑟 ∈ 𝑊+ | |𝑟 | = 1}|/2𝜎02 ≤ ∥𝑀𝑟 (𝐷) − 𝑦¯𝑟 ∥ 1 ≤ 2 log 2𝜎¯𝑟 𝑛𝑟 + 𝜆𝜎¯𝑟 2𝑛𝑟
𝑑/2𝜎02 = 2𝛼𝑑𝜌/2𝑇 ≤ 𝜌. For this step, all we need is that the privacy
Note that Theorem 3 gives a guarantee on the error of 𝑦¯𝑟 , but we
budget is not over-spent. The remainder of AIM runs until the ˆ Fortunately, it easy easy
are ultimately interested in the error of 𝐷.
budget is consumed. Each step of AIM involves one invocation of
to relate the two by using the triangle inequality, as shown below:
the exponential mechanism, and one invocation of the Gaussian
mechanism. By Propositions 1 to 3, round 𝑡 of AIM is 𝜌𝑡 -zCDP Corollary 1. Let 𝐷ˆ be any synthetic dataset, and let 𝑦¯𝑟 be the esti-
for 𝜌𝑡 = 18 𝜖𝑡2 /8 + 1/2𝜎𝑡2 . Note that at round 𝑡, 𝜌𝑢𝑠𝑒𝑑 = 𝑖=0
Í𝑡
𝜌𝑖 , mator from Theorem 2. Then with probability at least 1 − exp (−𝜆 2 ):
and we need to show that 𝜌𝑢𝑠𝑒𝑑 never exceeds 𝜌 [41]. There are
√︁ √
𝑀𝑟 (𝐷) − 𝑀𝑟 ( 𝐷)
ˆ
1 ≤ 𝑀𝑟 ( 𝐷) − 𝑦¯𝑟 1 + 2 log 2𝜎¯𝑟 𝑛𝑟 + 𝜆𝜎¯𝑟 2𝑛𝑟
ˆ
two cases to consider: the condition in Line 8 of Algorithm 3 is
either true or false. If it is true, then we know after round 𝑡 that The LHS is what we are interested in bounding, and we can
𝜌 −𝜌𝑢𝑠𝑒𝑑 ≥ 2𝜌𝑡 +1 , i.e., the remaining budget is enough to run round readily compute the RHS from the output of AIM. The RHS is a
𝑡 + 1 without over-spending the budget. If it is false, then we modify random quantity that, with the stated probability, upper bounds
𝜖𝑡 +1 and 𝜌𝑡 +1 to exactly use up the remaining budget. Specifically, the error. When we plug in the realized values we get a concrete
𝜌𝑡 +1 = 8(1 − 𝛼) (𝜌 − 𝜌𝑢𝑠𝑒𝑑 )/8 + 2𝛼 (𝜌 − 𝜌𝑢𝑠𝑒𝑑 )/2 = 𝜌 − 𝜌𝑢𝑠𝑒𝑑 . As numerical bound that can be interpreted as a (one-sided) confidence
a result, when the condition is true, 𝜌𝑢𝑠𝑒𝑑 at time 𝑡 + 1 is exactly interval. In general, we expect 𝑀𝑟 (𝐷)
ˆ to be close to 𝑦¯𝑟 , so the error
𝜌, and after that iteration, the main loop of AIM terminates. The bound for 𝐷 will not be that much larger than that of 𝑦¯𝑟 .6
ˆ
remainder of the mechanism does not access the data. □ The Hard Case: Unsupported Marginal Queries. We now
shift our attention to the hard case, providing guarantees about
5 UNCERTAINTY QUANTIFICATION the error of different marginals even for unsupported marginal
In this section, we propose a solution to the uncertainty quantifica- queries (those not selected during execution of AIM). This problem
tion problem for AIM. Our method uses information from both the is significantly more challenging. Our key insight is that marginal
noisy marginals, measured with Gaussian noise, and the marginal queries not selected have relatively low error compared to the mar-
queries selected by the exponential mechanism. Importantly, the ginal queries that were selected. We can easily bound the error of
method does not require additional privacy budget, as it quantifies selected queries and relate that to non-selected queries by utilizing
uncertainty only by analyzing the private outputs of AIM. We give the guarantees of the exponential mechanism. In Theorem 4 be-
guarantees for marginals in the (downward closure of the) work- low, we provide expressions that capture the uncertainty of these
load, which is exactly the set of marginals the analyst cares about. marginals with respect to 𝑝ˆ𝑡 −1 , the iterates of AIM.
We provide no guarantees for marginals outside this set, which is 5A better estimator would be the minimum variance linear unbiased estimator. Ding et
an area for future work. al. [13] derive an efficient algorithm for computing this from noisy marginals.
6 From prior experience, we might expect the error of 𝐷 ˆ to be lower than the error of
We break our analysis up into two cases: the “easy” case, where
𝑦¯𝑟 [37, 38], so we are paying for this difference by increasing the error bound when
we have access to unbiased answers for a particular marginal, and we might hope to save instead. Unfortunately, this intuition does not lend itself to a
the “hard” case, where we do not. In both cases, we identify an clear analysis that provides better guarantees.
7
Theorem 4 (Confidence Bound). Let 𝜎𝑡 , 𝜖𝑡 , 𝑟𝑡 , 𝑦˜𝑡 , 𝐶𝑡 , 𝑝ˆ𝑡 be as de- Table 2: Summary of datasets used in the experiments.
fined in Algorithm 4, and let Δ𝑡 = max𝑟 ∈𝐶𝑡 𝑤𝑟 . For all 𝑟 ∈ 𝐶𝑡 , with
probability at least 1 − 𝑒 −𝜆1 /2 − 𝑒 −𝜆2 :
2 Min/Max Total
Dataset Records Dimensions
Domains Domain Size
√ 2Δ𝑡  4 × 1016
∥𝑀𝑟 (𝐷) − 𝑀𝑟 (𝑝ˆ𝑡 −1 )∥ 1 ≤ 𝑤𝑟−1 𝐵𝑟 + 𝜆1 𝜎𝑡 𝑛𝑟𝑡 + 𝜆2 adult [28] 48842 15 2–42
𝜖𝑡 salary [24] 135727 9 3–501 1 × 1013
msnbc [7] 989818 16 18 1 × 1020
where 𝐵𝑟 is equal to:
fire [40] 305119 15 2–46 4 × 1015
√︁  2Δ𝑡 nltcs [33] 21574 16 2 7 × 104
𝑤𝑟𝑡 𝑀𝑟𝑡 (𝑝ˆ𝑡 −1 ) − 𝑦𝑡 1 + 2/𝜋𝜎𝑡 𝑤𝑟 𝑛𝑟 − 𝑤𝑟𝑡 𝑛𝑟𝑡 + log (|𝐶𝑡 |)
| {z } | {z } 𝜖𝑡 titanic [17] 1304 9 2–91 9 × 107
| {z }
estimated error on 𝑟𝑡 relationship to
non-selected candidates uncertainty from
exponential mech. Workloads. We consider 3 workloads for each dataset, all-
3way, target, and skewed. Each workload contains a collection of
We can readily compute 𝐵𝑟 from the output of AIM, and use it 3-way marginal queries. The all-3way workload contains queries
to provide a bound on error in the form of a one-sided confidence for all 3-way marginals. The target workload contains queries
interval that captures the true error with high probability. While for all 3-way marginals involving some specified target attribute.
these error bounds are expressed with respect to 𝑝ˆ𝑡 −1 , they can For the adult and titanic datasets, these are the income>50K
ˆ
readily be extended to give a guarantee with respect to 𝐷. attribute and the Survived attribute, as those correspond to the
attributes we are trying to predict for those datasets. For the other
Corollary 2. Let 𝐷ˆ be any synthetic dataset, and let 𝐵𝑟 be as defined datasets, the target attribute is chosen uniformly at random. The
in Theorem 4. Then with probability at least 1 − 𝑒 −𝜆1 /2 − 𝑒 −𝜆2 :
2
skewed workload contains a collection of 3-way marginal queries
biased towards certain attributes and attribute combinations. In par-
𝑀𝑟 (𝐷) − 𝑀𝑟 ( 𝐷)
ˆ
1 ticular, each attribute is assigned a weight sampled from a squared
ˆ − 𝑀𝑟 (𝑝ˆ𝑡 −1 ) + 𝑤𝑟−1 𝐵𝑟 + 𝜆1 𝜎𝑡 √𝑛𝑟𝑡 + 𝜆2 2Δ𝑡 exponential distribution. 256 triples of attributes are sampled with

≤ 𝑀𝑟 ( 𝐷) 1 𝜖𝑡 probability proportional to the product of their weights. This results
in workloads where certain attributes appear far more frequently
Again, the LHS is what we are interested in bounding, and we
than others, and is intended to capture the situation where analysts
can compute the RHS from the output of AIM. We expect 𝑝ˆ𝑡 −1 to
ˆ especially when 𝑡 is larger, so this bound focus on a small number of interesting attributes. All randomness
be reasonably close to 𝐷,
in the construction of the workload was done with a fixed ran-
will often be comparable to the original bound on 𝑝ˆ𝑡 −1 .
dom seed, to ensure that the workloads remain the same across
Putting it Together. We’ve provided guarantees for both sup- executions of different mechanisms and parameter settings.
ported and unsupported marginals. The guarantees for unsupported Mechanisms. We compare against both workload-agnostic and
marginals also apply for supported marginals, although we gener- workload-aware mechanisms in this section. The workload-agnostic
ally expect them to be looser. In addition, there is one guarantee mechanisms we consider are PrivBayes+PGM, MST, PrivMRF. The
for each round of AIM. It is tempting to use the bound that provides workload-aware mechanisms we consider are MWEM+PGM, RAP,
the smallest estimate, although unfortunately doing this invalidates GEM, and AIM. We set the hyper-parameters of every mechanism
the bound. To ensure a valid bound, we must pick only one round, to default values available in their open source implementations.
and that cannot be decided based on the value of the bound. A We also consider baseline mechanisms: Independent and Gauss-
natural choice is to use only the last round, for three reasons: (1) ian. The former measures all 1-way marginals using the Gaussian
𝜎𝑡 is smallest and 𝜖𝑡 is largest in that round, (2) the error of 𝑝ˆ𝑡 mechanism, and generates synthetic data using an independence
generally goes down with 𝑡, and (3) the distance between 𝑝ˆ𝑡 and assumption. The latter answers all queries in the workload using
𝐷ˆ should be the smallest in the last round. However, there may the Gaussian mechanism (using the optimal privacy budget alloca-
be some marginal queries which were not in the candidate set for tion described in [55]). Note that this mechanism does not generate
that round. To bound the error on these marginals, we use the last synthetic data, only query answers.
round where that marginal query was in the candidate set.
Privacy Budgets. We consider a wide range of privacy param-
6 EXPERIMENTS eters, varying 𝜖 ∈ [0.01, 100.0] and setting 𝛿 = 10−9 . The most
practical regime is 𝜖 ∈ [0.1, 10.0], but mechanism behavior at the
In this section we empirically evaluate AIM, comparing it to a
extremes can be enlightening so we include them as well.
collection of state-of-the-art mechanisms and baseline mechanisms
for a variety of workloads, datasets, and privacy levels. Evaluation. For each dataset, workload, and 𝜖, we run each
mechanism for 5 trials, and measure the workload error from Defi-
6.1 Experimental Setup nition 2. We report the average workload error across the five trials,
along with error bars corresponding to the minimum and maximum
Datasets. Our evaluation includes datasets with varying size
workload error observed across the five trials.
and dimensionality. We describe our exact pre-processing scheme
in Appendix A, and summarize the pre-processed datasets and their Runtime Environment. We ran most experiments on a single
characteristics in the table below. core of a compute cluster with a 4 GB memory limit and a 24 hour
8
AIM PrivMRF MST Privbayes+PGM MWEM+PGM GEM RAP Independent Gaussian

100 100
100
Workload Error

10 1

10 1
10 1

10 2
10 2 10 1 100 101 102 10 2 10 1 100 101 102 10 2 10 1 100 101 102
(a) Adult (b) Salary (c) MSNBC

100 100
100
Workload Error

10 1

10 1
10 2

10 3

10 2 10 1 100 101 102 10 2 10 1 100 101 102 10 2 10 1 100 101 102


Epsilon Epsilon Epsilon
(d) Fire (e) NLTCS (f) Titanic

Figure 1: Workload error of competing mechanisms on the all-3way workload for 𝜖 = 0.01, . . . , 100.

time limit.7 These resources were not sufficient to run PrivMRF or interesting, but not surprising that it outperforms workload-
RAP, so we utilized different machines to run those mechanisms. aware mechanisms in this setting.
PrivMRF requires a GPU to run, so we used one node a different (3) Prior to AIM, the best workload-aware mechanism varied for
compute cluster, which has a Nvidia GeForce RTX 2080 Ti GPU. RAP different datasets and privacy levels: MWEM+PGM was best
required significant memory resources, so we ran those experiments in 65% of settings, GEM was best in 35% of settings 8 , and
on a machine with 16 cores and 64 GB of RAM. RAP was best in 0% of settings. Including AIM, we observe
that it is best in 85% of settings, followed by MWEM+PGM in
6.2 all-3way Workload 11% of settings and GEM in 4% of settings. Additionally, in the
Results on the all-3way workload are shown in Figure 1. Workload- most interesting regime for practical deployment (𝜖 ≥ 1.0),
aware mechanisms are shown by solid lines, while workload-agnostic AIM is best in 100% of settings.
mechanisms are shown with dotted lines. From these plots, we make
the following observations: 6.3 target Workload
(1) AIM consistently achieves competitive workload error, across Results for the target workload are shown in Figure 2. For this
all datasets and privacy regimes considered. On average, workload, we expect workload-aware mechanisms to have a signifi-
across all six datasets and nine privacy parameters, AIM im- cant advantage over workload-agnostic mechanisms, since they are
proved over PrivMRF by a factor of 1.3×, MST by a factor of aware that marginals involving the target are inherently more im-
8.4×, MWEM+PGM by a factor 2.1×, PrivBayes+PGM by a portant for this workload. From these plots, we make the following
factor 2.6×, RAP by a factor 9.5×, and GEM by a factor 2.3×. observations:
In the most extreme cases, AIM improved over PrivMRF by (1) All three high-level findings from the previous section are
a factor 3.6×, MST by a factor 118×, MWEM+PGM by a fac- supported by these figures as well.
tor 16×, PrivBayes+PGM by a factor 14.7×, RAP by a factor (2) Somewhat surprisingly, PrivMRF outperforms all workload-
47.1×, and GEM by a factor 11.7×. aware mechanisms prior to AIM on this workload. This is an
(2) Prior to AIM, PrivMRF was consistently the best performing impressive accomplishment for PrivMRF, and clearly high-
mechanism, even outperforming all workload-aware mecha- lights the suboptimality of existing workload-aware mech-
nisms. The all-3way workload is one we expect workload anisms like MWEM+PGM, GEM, and RAP. Even though
agnostic mechanisms like PrivMRF to perform well on, so it is 8 Wecompare against a variant of GEM that selects an entire marginal query in each
round. In results not shown, we also evaluated the variant of that measures a single
7 These experiments usually completed in well under the time limit. counting query, and found that this variant performs significantly worse.
9
AIM PrivMRF MST Privbayes+PGM MWEM+PGM GEM RAP Independent Gaussian

100 100
100
Workload Error

10 1
10 1

10 1
10 2
10 2
10 2 10 1 100 101 102 10 2 10 1 100 101 102 10 2 10 1 100 101 102
(a) Adult (b) Salary (c) MSNBC

100 100
100
Workload Error

10 1

10 1
10 2

10 2 10 3

10 1
10 2 10 1 100 101 102 10 2 10 1 100 101 102 10 2 10 1 100 101 102
Epsilon Epsilon Epsilon
(d) Fire (e) NLTCS (f) Titanic

Figure 2: Workload error of competing mechanisms on the target workload for 𝜖 = 0.01, . . . , 100.

PrivMRF is not workload-aware, it is clear from their paper 6.5 Tuning Model Capacity
that every detail of the mechanism was carefully thought In Line 12 of AIM (Algorithm 4), we construct a set of candidates to
out to make the mechanism work well in practice, which consider in the current round based on an upper limit on JT-SIZE.
explains it’s impressive performance. While AIM did out- 80 MB was chosen to match prior work,9 but in general we can
perform PrivMRF again, the relative performance did not tune it as desired to strike the right accuracy / runtime trade-off.
increase by a meaningful margin — offering a 1.4× improve- Unlike other hyper-parameters, there is no “sweet spot” for this one:
ment on average and a 4.6× improvement in the best case. setting larger model capacities should always make the mechanism
perform better, at the cost of increased runtime. We demonstrate
this trade-off empirically in Figure 4 (a-b). For 𝜖 = 0.1, 1, and 10, we
6.4 skewed Workload considered model capacities ranging from 1.25 MB to 1.28 GB, and
Results for the skewed workload are shown in Figure 3. For this ran AIM on the fire dataset with the all-3way workload. Results
workload, we again expect workload-aware mechanisms to have a are averaged over five trials, with error bars indicating the min/max
significant advantage over workload-agnostic mechanisms, since runtime and workload error across those trials. Our main findings
they are aware of the exact (biased) set of marginals used to judge are listed below:
utility. From these plots, we make the following observations: (1) As expected, runtime increases with model capacity, and
workload error decreases with capacity. The case 𝜖 = 0.1 is
(1) All four high-level findings from the previous sections are an exception, where both the plots level off beyond a capacity
generally supported by these figures as well, with the fol- of 20𝑀𝐵. This is because the capacity constraint is not active
lowing interesting exception: in this regime: AIM already favors small marginals when
(2) PrivMRF did not score well on salary, and while it was still the available privacy budget is small by virtue of the quality
generally the second best mechanism on the other datasets score function for marginal query selection, so the model
(again out-performing the workload-aware mechanisms in remains small even without the model capacity constraint.
many cases), the improvement offered by AIM over PrivMRF (2) Using the default model capacity and 𝜖 = 1 resulted in a 9
is much larger for this workload, averaging a 2× improve- hour runtime. We can slightly reduce error further, by about
ment with up to a 5.7× improvement in the best case. We 13%, by increasing the model capacity to 1.28𝐺𝐵 and waiting
suspect for this setting, workload-awareness is essential to 7 days. Conversely, we can reduce the model capacity to
achieve strong performance.
9 Cai et al. [8] limit the size of the largest clique in the junction tree to have at most

107 cells (80 MB with 8 byte floats), while we limit the overall size of the junction tree.
10
AIM PrivMRF MST Privbayes+PGM MWEM+PGM GEM RAP Independent Gaussian

100 100 100


Workload Error

10 1 10 1

10 1

10 2
10 2

10 2 10 1 100 101 102 10 2 10 1 100 101 102 10 2 10 1 100 101 102


(a) Adult (b) Salary (c) MSNBC

100 100
100
Workload Error

10 1

10 1

10 2

10 2
10 3 10 1

10 2 10 1 100 101 102 10 2 10 1 100 101 102 10 2 10 1 100 101 102


Epsilon Epsilon Epsilon
(d) Fire (e) NLTCS (f) Titanic

Figure 3: Workload error of competing mechanisms on the skewed workload for 𝜖 = 0.01, . . . , 100.

5𝑀𝐵 which increases error by about 75%, but takes less than surprising. The error bounds are also tighter for the sup-
one hour. The law of diminishing returns is at play. ported marginals. The median ratio between error bound
and observed error is 4.4 for supported marginals and 8.3 for
Ultimately, the model capacity to use is a policy decision. In real- unsupported marginals. Intuitively, this makes sense because
world deployments, it is certainly reasonable to spend additional we know selected marginals should have higher error than
computational time for even a small boost in utility. non-selected marginals, but the error of the non-selected
marginal can be far below that of the selected marginal (and
6.6 Uncertainty Quantification hence the bound), which explains the larger gap between
the actual error and our predicted bound.
In this section, we demonstrate that our expressions for uncertainty
quantification correctly bound the error, and evaluate how tight
the bound is. For this experiment, we ran AIM on the fire dataset 7 LIMITATIONS AND OPEN PROBLEMS
with the all-3way workload at 𝜖 = 10. In Figure 4 (c), we plot In this work, we have carefully studied the problem of workload-
the true error of AIM on each marginal in the workload against aware synthetic data generation under differential privacy, and
the error bound predicted by our expressions. We set 𝜆 = 1.7 in proposed a new mechanism for this task. Our work significantly
Corollary 1, and 𝜆1 = 2.7, 𝜆2 = 3.7 in Corollary 2, which provides improves over prior work, although the problem remains far from
95% confidence bounds. Our main findings are listed below: solved, and there are a number of promising avenues for future
work in this space. We enumerate some of the limitations of AIM
(1) For all marginals in the (downward closure of the) workload, below, and identify potential future research directions.
the error bound is always greater than true error. This con-
firms the validity of the bound, and suggests they are safe to Handling More General Workloads. In this work, we restricted
use in practice. Note that even if some errors were above the our attention to the special-but-common case of weighted marginal
bounds, that would not be inconsistent with our guarantee, query workloads. Even in this special case, there are many facets to
as at a 95% confidence level, the bound could fail to hold 5% the problem and nuances to our solution. Designing mechanisms
of the time. The fact that it doesn’t suggests there is some that work for the more general class of linear queries (perhaps
looseness in the bound. defined over the low-dimensional marginals) remains an important
(2) The true errors and the error bounds vary considerably, rang- open problem. While the prior work, MWEM+PGM, RAP, and GEM
ing from 10−4 all the way up to and beyond 1. In general, can handle workloads of this form, they achieve this by selecting a
the supported marginals have both lower errors, and lower single counting query in each round, rather than a full marginal
error bounds than the unsupported marginals, which is not query, and thus there is likely significant room for improvement.
11
100
Epsilon=0.1 Epsilon=0.1 Unsupported Marginals
0.20 102 Supported Marginals
Epsilon=1.0 Epsilon=1.0 10 1
Epsilon=10.0 Epsilon=10.0

Runtime (hours)
Workload Error

0.15

True Error
ur 101
1 ho
10 2
0.10
urs
9 ho 7 da
ys
100 10 3
0.05

0.00 10 4
100 101 102 103 100 101 102 103 10 4 10 3 10 2 10 1 100
Max. Model Size (MB) Max. Model Size (MB) Estimated Error Bound
(a) Workload Error vs. Model Capacity (b) Runtime vs. Model Capacity (c) True Error vs. Error Bound

Figure 4

Beyond linear query workloads, other workloads of interest include proposed under this paradigm [1, 4, 18, 27, 43, 45, 46, 49, 54]. De-
more abstract objectives like machine learning efficacy and other spite their popularity, we are not aware of evidence that these
non-linear query workloads. These metrics have been used to eval- GAN-based mechanisms outperform even baseline marginal-based
uate the quality of workload-agnostic synthetic data mechanisms, mechanisms like PrivBayes on structured tabular data. Most em-
but have not been provided as input to the mechanisms themselves. pirical evaluations of GAN-based mechanisms exclude PrivBayes,
In principle, if we know we want to run a given machine learning and the comparisons that we are aware of show the opposite ef-
model on the synthetic dataset, we should be able to tailor the fect: that marginal-based mechanisms outperform the competition
synthetic data to provide high utility on that model. [8, 35, 44]. GAN-based methods may be better suited for different
data modalities, like image or text data.
Handling Mixed Data Types. In this work, we assumed the in-
One exception is CT-GAN [51], which is an algorithm for syn-
put data was discrete, and each attribute had a finite domain with a
thetic data that does compare against PrivBayes and does outper-
reasonably small number of possible values. Data with numerical at-
form it in roughly 85% of the datasets and metrics they considered.
tributes must be appropriately discretized before running AIM. The
However, this method does not satisfy or claim to satisfy differential
quality of the discretization could have a significant impact on the
privacy, and gives no formal privacy guarantee to the individuals
quality of the generated synthetic data. Designing mechanisms that
who contribute data. Nevertheless, an empirical comparison be-
appropriately handle mixed (categorical and numerical) data type
tween CT-GAN and newer methods for synthetic data, like AIM
is an important problem. There may be more to this problem than
and PrivMRF, would be interesting, since these mechanisms also
meets the eye: a new definition of a workload and utility metric may
outperformed PrivBayes+PGM in nearly every tested situation, and
be in order, and new types of measurements and post-processing
PrivBayes+PGM outperforms PrivBayes most of the time as well
techniques may be necessary to handle numerical data. Note that
[8, 37]. Differentially private implementations of CT-GAN have
some mechanisms, like GAN-based mechanisms, expect numerical
been proposed, but empirical evaluations of the method suggest it
data as input, and categorical data must be one-hot encoded prior
is not competitive with PrivBayes [42, 44].
to usage. While they do handle handle numerical data, their utility
is often not competitive with even the simplest marginal-based ACKNOWLEDGMENTS
mechanisms we considered in this work [44].
This work was supported by the National Science Foundation under
Utilizing Public Data. A promising avenue for future research grants IIS-1749854 and CNS-1954814, and by Oracle Labs, part of
is to design synthetic data mechanisms that incorporate public data Oracle America, through a gift to the University of Massachusetts
in a principled way. There are many places in which public data Amherst in support of academic research.
can be naturally incorporated into AIM, and exploring these ideas
is a promising way to boost the utility of AIM in real world settings
where public data is available. Early work on this problem includes
[31, 32, 35], but these solutions leave room for improvement.

8 RELATED WORK
In Section 3 we focused our discussion on marginal-based mech-
anisms in the select-measure-generate paradigm. While this is a
popular approach, it is not the only way to generate differentially
private synthetic data. In this section we provide a brief discussion
of other methods, and a broad overview of other relevant work.
One prominent approach is based on differentially private GANs.
Several architectures and private learning procedures have been
12
REFERENCES [19] Chang Ge, Shubhankar Mohapatra, Xi He, and Ihab F. Ilyas. 2021. Kamino:
[1] Nazmiye Ceren Abay, Yan Zhou, Murat Kantarcioglu, Bhavani M. Thuraisingham, Constraint-Aware Differentially Private Data Synthesis. Proceedings of the VLDB
and Latanya Sweeney. 2018. Privacy Preserving Synthetic Data Release Using Endowment 14, 10 (2021), 1886–1899. http://www.vldb.org/pvldb/vol14/p1886-
Deep Learning. In Machine Learning and Knowledge Discovery in Databases - Euro- ge.pdf
pean Conference, ECML PKDD 2018, Dublin, Ireland, September 10-14, 2018, Proceed- [20] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
ings, Part I (Lecture Notes in Computer Science), Michele Berlingerio, Francesco Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial
Bonchi, Thomas Gärtner, Neil Hurley, and Georgiana Ifrim (Eds.), Vol. 11051. nets. Advances in neural information processing systems 27 (2014).
Springer, 510–526. https://doi.org/10.1007/978-3-030-10925-7_31 [21] William H Greene. 2003. Econometric analysis. Pearson Education India.
[2] Hassan Jameel Asghar, Ming Ding, Thierry Rakotoarivelo, Sirine Mrabet, [22] Moritz Hardt, Katrina Ligett, and Frank McSherry. 2012. A Simple and Practical Al-
and Mohamed Ali Kâafar. 2019. Differentially Private Release of High- gorithm for Differentially Private Data Release. In Advances in Neural Information
Dimensional Datasets using the Gaussian Copula. CoRR abs/1902.01499 (2019). Processing Systems 25: 26th Annual Conference on Neural Information Processing
arXiv:1902.01499 http://arxiv.org/abs/1902.01499 Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada,
[3] Sergul Aydore, William Brown, Michael Kearns, Krishnaram Kenthapadi, Luca United States, Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges,
Melis, Aaron Roth, and Ankit A Siva. 2021. Differentially Private Query Release Léon Bottou, and Kilian Q. Weinberger (Eds.). 2348–2356. https://proceedings.
Through Adaptive Projection. In Proceedings of the 38th International Conference neurips.cc/paper/2012/hash/208e43f0e45c4c78cafadb83d2888cb6-Abstract.html
on Machine Learning (Proceedings of Machine Learning Research), Marina Meila [23] Joachim Hartung, Guido Knapp, Bimal K Sinha, and Bimal K Sinha. 2008. Statis-
and Tong Zhang (Eds.), Vol. 139. PMLR, 457–467. https://proceedings.mlr.press/ tical meta-analysis with applications. Vol. 6. Wiley Online Library.
v139/aydore21a.html [24] Michael Hay, Ashwin Machanavajjhala, Gerome Miklau, Yan Chen, and Dan
[4] Brett K Beaulieu-Jones, Zhiwei Steven Wu, Chris Williams, Ran Lee, Sanjeev P Zhang. 2016. Principled evaluation of differentially private algorithms using
Bhavnani, James Brian Byrd, and Casey S Greene. 2019. Privacy-preserving dpbench. In Proceedings of the 2016 International Conference on Management of
generative deep neural networks support clinical data sharing. Circulation: Data. 139–154.
Cardiovascular Quality and Outcomes 12, 7 (2019), e005122. https://doi.org/10. [25] Zhiqi Huang, Ryan McKenna, George Bissias, Gerome Miklau, Michael Hay, and
1161/CIRCOUTCOMES.118.005122 Ashwin Machanavajjhala. [n.d.]. PSynDB: accurate and accessible private data
[5] Vincent Bindschaedler, Reza Shokri, and Carl A. Gunter. 2017. Plausible Deniabil- generation. VLDB Demo ([n. d.]).
ity for Privacy-Preserving Data Synthesis. Proceedings of the VLDB Endowment [26] Norman L Johnson, Adrienne W Kemp, and Samuel Kotz. 2005. Univariate discrete
10, 5 (2017), 481–492. https://doi.org/10.14778/3055540.3055542 distributions. Vol. 444. John Wiley & Sons.
[6] Mark Bun and Thomas Steinke. 2016. Concentrated Differential Privacy: Simpli- [27] James Jordon, Jinsung Yoon, and Mihaela van der Schaar. 2019. PATE-GAN: Gen-
fications, Extensions, and Lower Bounds. In Theory of Cryptography Conference. erating Synthetic Data with Differential Privacy Guarantees. In 7th International
Springer, 635–658. https://doi.org/10.1007/978-3-662-53641-4_24 Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May
[7] Igor Cadez, David Heckerman, Christopher Meek, Padhraic Smyth, and Steven 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=S1zk9iRqF7
White. 2000. Visualization of navigation patterns on a web site using model-based [28] Ron Kohavi et al. 1996. Scaling up the accuracy of naive-bayes classifiers: A
clustering. In Proceedings of the sixth ACM SIGKDD international conference on decision-tree hybrid.. In Kdd, Vol. 96. 202–207.
Knowledge discovery and data mining. 280–284. [29] Haoran Li, Li Xiong, and Xiaoqian Jiang. 2014. Differentially Private Synthesiza-
[8] Kuntai Cai, Xiaoyu Lei, Jianxin Wei, and Xiaokui Xiao. 2021. Data synthesis via tion of Multi-Dimensional Data using Copula Functions. In Proceedings of the 17th
differentially private markov random fields. Proceedings of the VLDB Endowment International Conference on Extending Database Technology, EDBT 2014, Athens,
14, 11 (2021), 2190–2202. Greece, March 24-28, 2014, Sihem Amer-Yahia, Vassilis Christophides, Anastasios
[9] Clément L. Canonne, Gautam Kamath, and Thomas Steinke. 2020. The Discrete Kementsietsidis, Minos N. Garofalakis, Stratos Idreos, and Vincent Leroy (Eds.).
Gaussian for Differential Privacy. In NeurIPS. https://proceedings.neurips.cc/ OpenProceedings.org, 475–486. https://doi.org/10.5441/002/edbt.2014.43
paper/2020/hash/b53b3a3d6ab90ce0268229151c9bde11-Abstract.html [30] Fang Liu. 2016. Model-based differentially private data synthesis. arXiv preprint
[10] Mark Cesar and Ryan Rogers. 2021. Bounding, Concentrating, and Truncating: arXiv:1606.08052 (2016). https://arxiv.org/abs/1606.08052
Unifying Privacy Loss Composition for Data Analytics. In Proceedings of the [31] Terrance Liu, Giuseppe Vietri, Thomas Steinke, Jonathan R. Ullman, and Zhi-
32nd International Conference on Algorithmic Learning Theory (Proceedings of wei Steven Wu. 2021. Leveraging Public Data for Practical Private Query Release.
Machine Learning Research), Vitaly Feldman, Katrina Ligett, and Sivan Sabato In ICML. 6968–6977. http://proceedings.mlr.press/v139/liu21w.html
(Eds.), Vol. 132. PMLR, 421–457. https://proceedings.mlr.press/v132/cesar21a. [32] Terrance Liu, Giuseppe Vietri, and Steven Wu. 2021. Iterative Methods for
html Private Synthetic Data: Unifying Framework and New Methods. In Advances in
[11] Anne-Sophie Charest. 2011. How Can We Analyze Differentially-Private Syn- Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and
thetic Datasets? Journal of Privacy and Confidentiality 2, 2 (2011). https: J. Wortman Vaughan (Eds.).
//doi.org/10.29012/jpc.v2i2.589 [33] Kenneth G. Manton. 2010. National Long-Term Care Survey: 1982, 1984, 1989,
[12] Rui Chen, Qian Xiao, Yu Zhang, and Jianliang Xu. 2015. Differentially private 1994, 1999, and 2004.
high-dimensional data publication via sampling-based inference. In Proceedings [34] Ryan McKenna and Terrance Liu. 2022. A simple recipe for private synthetic
of the 21th ACM SIGKDD International Conference on Knowledge Discovery and data generation. DifferentialPrivacy.org
Data Mining. ACM, 129–138. https://doi.org/10.1145/2783258.2783379 [35] Ryan McKenna, Gerome Miklau, and Daniel Sheldon. 2021. Winning the NIST
[13] Bolin Ding, Marianne Winslett, Jiawei Han, and Zhenhui Li. 2011. Differentially Contest: A scalable and general approach to differentially private synthetic data.
private data cubes: optimizing noise sources and consistency. In Proceedings of Journal of Privacy and Confidentiality 11, 3 (2021).
the ACM SIGMOD International Conference on Management of Data, SIGMOD [36] Ryan McKenna, Siddhant Pradhan, Daniel R Sheldon, and Gerome Miklau. 2021.
2011, Athens, Greece, June 12-16, 2011, Timos K. Sellis, Renée J. Miller, Anastasios Relaxed Marginal Consistency for Differentially Private Query Answering. Ad-
Kementsietsidis, and Yannis Velegrakis (Eds.). ACM, 217–228. https://doi.org/10. vances in Neural Information Processing Systems 34 (2021).
1145/1989323.1989347 [37] Ryan McKenna, Daniel Sheldon, and Gerome Miklau. 2019. Graphical-model
[14] Irit Dinur and Kobbi Nissim. 2003. Revealing information while preserving based estimation and inference for differential privacy. In International Conference
privacy. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART on Machine Learning. 4435–4444. http://proceedings.mlr.press/v97/mckenna19a.
Symposium on Principles of Database Systems, June 9-12, 2003, San Diego, CA, html
USA, Frank Neven, Catriel Beeri, and Tova Milo (Eds.). ACM, 202–210. https: [38] Aleksandar Nikolov, Kunal Talwar, and Li Zhang. 2013. The geometry of differ-
//doi.org/10.1145/773153.773173 ential privacy: the sparse and approximate cases. In Proceedings of the forty-fifth
[15] Cynthia Dwork, Frank McSherry Kobbi Nissim, and Adam Smith. 2006. Cali- annual ACM symposium on Theory of computing. 351–360.
brating Noise to Sensitivity in Private Data Analysis. In TCC. 265–284. https: [39] Sofya Raskhodnikova and Adam Smith. 2016. Lipschitz extensions for node-
//doi.org/10.29012/jpc.v7i3.405 private graph statistics and the generalized exponential mechanism. In 2016
[16] James S Frame. 1945. Mean deviation of the binomial distribution. The American IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS). IEEE,
Mathematical Monthly 52, 7 (1945), 377–379. 495–504.
[17] Thomas Cason Frank E. Harrell Jr. [n.d.]. Encyclopedia Titanica. [40] Diane Ridgeway, Mary Theofanos, Terese Manley, Christine Task, et al. 2021.
[18] Lorenzo Frigerio, Anderson Santana de Oliveira, Laurent Gomez, and Patrick Challenge Design and Lessons Learned from the 2018 Differential Privacy Chal-
Duverger. 2019. Differentially Private Generative Adversarial Networks for Time lenges. (2021).
Series, Continuous, and Discrete Open Data. In ICT Systems Security and Privacy [41] Ryan M Rogers, Aaron Roth, Jonathan Ullman, and Salil Vadhan. 2016. Pri-
Protection - 34th IFIP TC 11 International Conference, SEC 2019, Lisbon, Portugal, vacy odometers and filters: Pay-as-you-go composition. Advances in Neural
June 25-27, 2019, Proceedings (IFIP Advances in Information and Communication Information Processing Systems 29 (2016), 1921–1929.
Technology), Gurpreet Dhillon, Fredrik Karlsson, Karin Hedström, and André [42] Lucas Rosenblatt, Xiaoyan Liu, Samira Pouyanfar, Eduardo de Leon, Anuj Desai,
Zúquete (Eds.), Vol. 562. Springer, 151–164. https://doi.org/10.1007/978-3-030- and Joshua Allen. 2020. Differentially private synthetic data: Applied evaluations
22312-0_11 and enhancements. arXiv preprint arXiv:2011.05537 (2020).

13
[43] Uthaipon Tantipongpipat, Chris Waites, Digvijay Boob, Amaresh Ankit Siva, and
Rachel Cummings. 2019. Differentially Private Mixed-Type Data Generation
For Unsupervised Learning. CoRR abs/1912.03250 (2019). arXiv:1912.03250
http://arxiv.org/abs/1912.03250
[44] Yuchao Tao, Ryan McKenna, Michael Hay, Ashwin Machanavajjhala, and Gerome
Miklau. 2021. Benchmarking Differentially Private Synthetic Data Generation Al-
gorithms. Third AAAI Privacy-Preserving Artificial Intelligence (PPAI-22) workshop
(2021).
[45] Amirsina Torfi, Edward A Fox, and Chandan K Reddy. 2022. Differentially private
synthetic medical data generation using convolutional gans. Information Sciences
586 (2022), 485–500.
[46] Reihaneh Torkzadehmahani, Peter Kairouz, and Benedict Paten. 2019. DP-CGAN:
Differentially Private Synthetic Data and Label Generation. In IEEE Conference on
Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Long
Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 98–104.
https://doi.org/10.1109/CVPRW.2019.00018
[47] Michail Tsagris, Christina Beneki, and Hossein Hassani. 2014. On the folded
normal distribution. Mathematics 2, 1 (2014), 12–28.
[48] Giuseppe Vietri, Grace Tian, Mark Bun, Thomas Steinke, and Zhiwei Steven Wu.
2020. New Oracle-Efficient Algorithms for Private Synthetic Data Release. In
Proceedings of the 37th International Conference on Machine Learning, ICML 2020,
13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research), Vol. 119.
PMLR, 9765–9774. http://proceedings.mlr.press/v119/vietri20b.html
[49] Liyang Xie, Kaixiang Lin, Shu Wang, Fei Wang, and Jiayu Zhou. 2018. Differ-
entially Private Generative Adversarial Network. CoRR abs/1802.06739 (2018).
arXiv:1802.06739 http://arxiv.org/abs/1802.06739
[50] Chugui Xu, Ju Ren, Yaoxue Zhang, Zhan Qin, and Kui Ren. 2017. DPPro: Differen-
tially Private High-Dimensional Data Release via Random Projection. IEEE
Transactions on Information Forensics and Security 12, 12 (2017), 3081–3093.
https://doi.org/10.1109/TIFS.2017.2737966
[51] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni.
2019. Modeling Tabular data using Conditional GAN. Advances in Neural
Information Processing Systems 32 (2019), 7335–7345.
[52] Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, and
Xiaokui Xiao. 2017. PrivBayes: Private Data Release via Bayesian Networks.
ACM Transactions on Database Systems (TODS) 42, 4 (2017), 25:1–25:41. https:
//doi.org/10.1145/3134428
[53] Wei Zhang, Jingwen Zhao, Fengqiong Wei, and Yunfang Chen. 2019. Differentially
Private High-Dimensional Data Publication via Markov Network. EAI Endorsed
Trans. Security Safety 6, 19 (2019), e4. https://doi.org/10.4108/eai.29-7-2019.
159626
[54] Xinyang Zhang, Shouling Ji, and Ting Wang. 2018. Differentially private releasing
via deep generative model (technical report). arXiv preprint arXiv:1801.01594
(2018). https://arxiv.org/abs/1801.01594
[55] Zhikun Zhang, Tianhao Wang, Ninghui Li, Jean Honorio, Michael Backes, Shibo
He, Jiming Chen, and Yang Zhang. 2021. PrivSyn: Differentially Private Data
Synthesis. In 30th USENIX Security Symposium (USENIX Security 21). USENIX
Association, 929–946. https://www.usenix.org/conference/usenixsecurity21/
presentation/zhang-zhikun

14
A DATA PREPROCESSING Theorem 5. Let 𝑥 ∼ 𝑁 (0, 𝜎 2 )𝑛 , then:
We apply consistent preprocessing to all datasets in our empirical √︁
E[∥𝑥 ∥ 1 ] = 2/𝜋𝑛𝜎
evaluation. There are three steps to our preprocessing procedure,
described below: and
√︁ √
Attribute selection. For each dataset, we identify a set of at- Pr[∥𝑥 ∥ 1 ≥ 2 log 2𝜎𝑛 + 𝑐𝜎 2𝑛] ≤ exp (−𝑐 2 )
tributes to keep. For the adult, salary, nltcs, and titanic datasets,
we keep all attributes from the original data source. For the fire Proof. First observe that√︁|𝑥𝑖 | is a sample from a half-normal
dataset, we drop the 15 attributes relating to incident times, since af- distribution. Thus, E[𝑥𝑖 ] = 2/𝜋𝜎. From the linearity of expec-
√︁
ter discretization, they contain redundant information. The msnbc tation, we obtain E[∥𝑥 ∥ 1 ] = 2/𝜋𝜎𝑛, as desired. For the second
dataset is a streaming dataset, where each row has a different num- statement, we begin by deriving the moment generating function
ber of entries. We keep only the first 16 entries for each row. of the random variable |𝑥𝑖 |. By definition, we have:
∫ ∞
Domain identification. Usually we expect the domain to be E[exp (𝑡 · |𝑥𝑖 |)] = 𝜙 (𝑧) exp (𝑡 · |𝑧|)𝑑𝑧
supplied separately from the data file. For example, the IPUMS −∞
∫ ∞
website contains comprehensive documentation about U.S. Census
=2 𝜙 (𝑧) exp (𝑡 · 𝑧)𝑑𝑧
data products. However, for the datasets we used, no such domain 0
file was available. Thus, we “cheat” and look at the active domain ∫ ∞  𝑧2 
1
to automatically derive a domain file from the dataset. For each =2 √ exp − 2 exp (𝑡 · 𝑧)𝑑𝑧
0 𝜎 2𝜋 2𝜎
attribute, we identify if it is categorical or numerical. For each cate-

√︂ ∫
gorical attribute, we list the set of observed values (including null) 1 2  𝑧2 
= exp − 2 + 𝑡 · 𝑧 𝑑𝑧
for that attribute, which we treat as the set of possible values for 𝜎 𝜋 0 2𝜎
that attribute. For each numerical attribute, we record the minimum  𝜎 2𝑡 2    𝑡𝜎  
and maximum observed value for that attribute. = exp Φ √ +1
2 2
Discretization. We discretize each numerical attribute into 32
equal-width bins, using the min/max values from the domain file. Í
Moreover, since ∥𝑥 ∥ 1 = 𝑛𝑖=1 |𝑥𝑖 | is a sum of i.i.d random variables,
This turns each numerical attribute into a categorical attribute,
the moment generating function of ∥𝑥 ∥ 1 is:
satisfying our assumption.
 𝜎 2𝑡 2  𝑛   𝑡𝜎  𝑛
B UNCERTAINTY QUANTIFICATION PROOFS E[exp (𝑡 · ∥𝑥 ∥ 1 )] = exp Φ √ +1
2 2
B.1 The Easy Case: Supported Marginals From the Chernoff bound, we have
Theorem 2 (Weighted Average Estimator). Let 𝑟 1, . . . , 𝑟𝑡 and 𝑦1, . . . , 𝑦𝑡
E[exp (𝑡 · ∥𝑥 ∥ 1 )]
be as defined in Algorithm 4, and let 𝑅 = {𝑟 1, . . . , 𝑟𝑡 }. For any 𝑟 ∈ 𝑅+ , Pr[∥𝑥 ∥ 1 ≥ 𝑎] ≤ min
there is an (unbiased) estimator 𝑦¯𝑟 = 𝑓𝑟 (𝑦1, . . . , 𝑦𝑡 ) such that: 𝑡 ≥0 exp (𝑡𝑎)
 𝑛𝜎 2𝑡 2    𝑡𝜎  𝑛
𝑡
h ∑︁ 𝑛𝑟 i −1 = min exp − 𝑡𝑎 Φ √ + 1
𝑦¯𝑟 ∼ N (𝑀𝑟 (𝐷), 𝜎¯𝑟2 I) where 𝜎¯𝑟2 = , 𝑡 ≥0 2 2
2
𝑖=1 𝑛𝑟𝑖 𝜎𝑖  𝑛𝜎 2𝑡 2 
𝑟 ⊆𝑟𝑖 ≤ min 2𝑛 exp − 𝑡𝑎
𝑡 ≥0 2
Proof. For each 𝑟𝑖 ⊇ 𝑟 , we observe 𝑦˜𝑖 ∼ 𝑀𝑟𝑖 (𝐷) + N (0, 𝜎𝑖2 I).  𝑛𝜎 2 (𝑎/𝑛𝜎 2 ) 2 
We can use this noisy marginal to obtain an unbiased estimate ≤ 2𝑛 exp − (𝑎/𝑛𝜎 2 )𝑎
2
𝑀𝑟 (𝐷) by marginalizing out attributes in the set 𝑟𝑖 \𝑟 . This requires  𝑎2 𝑎2 
summing up 𝑛𝑟𝑖 /𝑛𝑟 cells, so the variance in each cell becomes = 2𝑛 exp −
2𝑛𝜎 2 𝑛𝜎 2
𝑛𝑟𝑖 𝜎𝑖2 /𝑛𝑟 . Moreover, the noise is still normally distributed, since  𝑎2 
the sum of independent normal random variables is normal. We = 2𝑛 exp −
thus have such an estimate for each 𝑖 satisfying 𝑟𝑖 ⊇ 𝑟 , and we 2𝑛𝜎 2
can combine these independent estimates using inverse variance
 𝑎2 
= exp − + 𝑛 log 2
weighting [23], resulting in an unbiased estimator with the stated 2𝑛𝜎 2
variance. For the same reason as before, the noise is still normally With some further manipulation of the bound, we obtain:
distributed. □ √   √
Pr[∥𝑥 ∥ 1 ≥ 𝑑𝜎 2𝑛] ≤ exp − 𝑑 2 + 𝑛 log 2 (𝑎 = 𝑑𝜎 2𝑛)
Theorem 3 (Confidence Bound). Let 𝑦¯𝑟 be the estimator from The- √︁ √ √︁
orem 2. Then, for any 𝜆 ≥ 0, with probability at least 1 − exp (−𝜆 2 ): Pr[∥𝑥 ∥ 1 ≥ (𝑐 + 𝑛 log 2)𝜎 2𝑛] ≤ exp (−𝑐 2 ) (𝑑 = 𝑐 + 𝑛 log 2)
√︁ √ √︁ √
∥𝑀𝑟 (𝐷) − 𝑦¯𝑟 ∥ 1 ≤ 2 log 2𝜎¯𝑟 𝑛𝑟 + 𝜆𝜎¯𝑟 2𝑛𝑟 Pr[∥𝑥 ∥ 1 ≥ 2 log 2𝜎𝑛 + 𝑐𝜎 2𝑛] ≤ exp (−𝑐 2 )

Proof. Noting that 𝑀𝑟 (𝐷) − 𝑦¯ ∼ N (0, 𝜎 2 I), the statement is a


direct consequence of Theorem 5, below. □ □
15
B.2 The Hard Case: Unsupported Marginals 𝑀𝑖 (−𝑡). For simplicity, let 𝜇 = |𝑎𝑖 − 𝑏𝑖 |. We have:
Theorem 4 (Confidence Bound). Let 𝜎𝑡 , 𝜖𝑡 , 𝑟𝑡 , 𝑦˜𝑡 , 𝐶𝑡 , 𝑝ˆ𝑡 be as de-  𝜎 2𝑡 2 
fined in Algorithm 4, and let Δ𝑡 = max𝑟 ∈𝐶𝑡 𝑤𝑟 . For all 𝑟 ∈ 𝐶𝑡 , with 𝑀𝑖 (−𝑡) = exp − 𝜇𝑡 Φ(𝜇/𝜎 − 𝜎𝑡)
2
probability at least 1 − 𝑒 −𝜆1 /2 − 𝑒 −𝜆2 :
2
 𝜎 2𝑡 2 
+ exp + 𝜇𝑡 Φ(−𝜇/𝜎 − 𝜎𝑡)
2
√ 2Δ𝑡   𝜎 2𝑡 2
∥𝑀𝑟 (𝐷) − 𝑀𝑟 (𝑝ˆ𝑡 −1 )∥ 1 ≤ 𝑤𝑟−1 𝐵𝑟 + 𝜆1 𝜎𝑡 𝑛𝑟𝑡 + 𝜆2

𝜖𝑡 = exp − 𝜇𝑡 (1 − Φ(−𝜇/𝜎 + 𝜎𝑡))
2
 𝜎 2𝑡 2 
where 𝐵𝑟 is equal to: + exp + 𝜇𝑡 Φ(−𝜇/𝜎 − 𝜎𝑡)
2
 𝜎 2𝑡 2 
√︁  2Δ𝑡
𝑤𝑟𝑡 𝑀𝑟𝑡 (𝑝ˆ𝑡 −1 ) − 𝑦𝑡 1 + 2/𝜋𝜎𝑡 𝑤𝑟 𝑛𝑟 − 𝑤𝑟𝑡 𝑛𝑟𝑡 + log (|𝐶𝑡 |) = exp − 𝜇𝑡
2
| {z } | {z } 𝜖𝑡  𝜎 2𝑡 2 
− 𝜇𝑡 Φ(−𝜇/𝜎 + 𝜎𝑡)
| {z }
estimated error on 𝑟𝑡 relationship to − exp
non-selected candidates uncertainty from 2
exponential mech.  𝜎 2𝑡 2 
+ exp + 𝜇𝑡 Φ(−𝜇/𝜎 − 𝜎𝑡)
2
Proof. By the guarantees of the exponential mechanism, we  𝜎 2𝑡 2 
≤ exp − 𝜇𝑡 (Lemma 1 below; 𝑎 = 𝜎𝑡, 𝑏 = 𝜇/𝜎)
know that, with probability at most 𝑒 −𝜆2 , for all 𝑟 ∈ 𝐶𝑡 we have: 2

2Δ𝑡
𝑞𝑟 𝑡 ≤ 𝑞𝑟 − (log (|𝐶𝑡 |) + 𝜆2 ) We are now ready to plug this result into the Chernoff bound, which
𝜖𝑡 states:
Now define 𝐸𝑟 = ∥𝑀𝑟 (𝐷) − 𝑀𝑟 (𝑝𝑡 −1 )∥ 1 . Plugging in 𝑞𝑟 = 𝑤𝑟 (𝐸𝑟 − Pr[∥𝑎 − 𝑐 ∥ 1 ≤ 𝑟 ] ≤ min exp (𝑡 · 𝑟 )𝑀 (−𝑡)
√︁ 𝑡 ≥0
2/𝜋𝜎𝑡 𝑛𝑟 ) and rearranging gives: Ö  𝜎 2𝑡 2 
≤ min exp (𝑡 · 𝑟 ) exp − |𝑎𝑖 − 𝑏𝑖 |𝑡
𝑡 ≥0 2
𝑖
√︁
𝑤𝑟𝑡 (𝐸𝑟𝑡 − 2/𝜋𝜎𝑡 𝑛𝑟𝑡 ) + 2Δ
𝜖𝑡 (log (|𝐶𝑡 |) + 𝜆2 ) √︁
𝑡
𝑛𝜎 2𝑡 2
𝐸𝑟 ≥ + 2/𝜋𝜎𝑡 𝑛𝑟 = min exp (𝑡 · 𝑟 + − ∥𝑎 − 𝑏 ∥ 1 𝑡)
𝑤𝑟 𝑡 ≥0 2


From Theorem 6, with probability at most 𝑒 −𝜆1 /2 , we have:
2
Setting 𝑟 = ∥𝑎 − 𝑏 ∥ 1 − 𝜆𝜎 𝑛 gives the desired result

√ Pr[∥𝑎 − 𝑐 ∥ 1 ≤ ∥𝑎 − 𝑏 ∥ 1 − 𝜆𝜎 𝑛]
𝑀𝑟𝑡 (𝑝𝑡 −1 ) − 𝑦𝑡 1 + 𝜆1 𝜎𝑡 𝑛𝑟𝑡 ≤ 𝐸𝑟𝑡
√ 𝑛𝜎 2𝑡 2
≤ min exp (𝑡 · (∥𝑎 − 𝑏 ∥ 1 − 𝜆𝜎 𝑛) + − ∥𝑎 − 𝑏 ∥ 1 𝑡)
Combining these two facts via the union bound, along with some 𝑡 ≥0 2
algebraic manipulation, yields the stated result. □
 √ 2
𝑛𝜎 𝑡 2 
= min exp − 𝑡𝜆𝜎 𝑛 +
𝑡 ≥0 2
2 √
≤ exp (−𝜆 /2) (set 𝑡 = 𝜆/𝜎 𝑛)
Theorem 6. Let 𝑎, 𝑏 ∈ R𝑘 and let 𝑐 = 𝑏 + 𝑧 where 𝑧 ∼ N (0, 𝜎 2 )𝑛 .

√  1 

Pr[∥𝑎 − 𝑐 ∥ 1 ≤ ∥𝑎 − 𝑏 ∥ 1 − 𝜆𝜎 𝑛] ≤ exp − 𝜆 2
2
Lemma 1. Let 𝑎, 𝑏 ≥ 0, and let Φ denote the CDF of the standard
normal distribution. Then,
Proof. First note that |𝑎𝑖 −𝑐𝑖 | = |𝑎𝑖 −𝑏𝑖 −𝑧𝑖 |, which is distributed 1  1 
according to a folded normal distribution with mean |𝑎𝑖 − 𝑏𝑖 |. It exp 𝑎 2 + 𝑎𝑏 Φ(−𝑎 − 𝑏) ≤ exp 𝑎 2 − 𝑎𝑏 Φ(𝑎 − 𝑏)
2 2
is well known [47] that the moment generating function for this
random variable is 𝑀𝑖 (𝑡), where: Proof. First observe that:
1   1 2  Φ(−𝑎 − 𝑏)
exp 𝑎 2 + 𝑎𝑏 Φ(−𝑎 − 𝑏) = exp − 𝑏
𝜙 (−𝑎 − 𝑏)
1 
2 2
𝑀𝑖 (𝑡) = exp 𝜎 2𝑡 2 + |𝑎𝑖 − 𝑏𝑖 |𝑡 Φ(|𝑎𝑖 − 𝑏𝑖 |/𝜎 + 𝜎𝑡)
2 1   1 2  Φ(𝑎 − 𝑏)
1  exp 𝑎 2 − 𝑎𝑏 Φ(𝑎 − 𝑏) = exp − 𝑏
+ exp 𝜎 2𝑡 2 − |𝑎𝑖 − 𝑏𝑖 |𝑡 Φ(−|𝑎𝑖 − 𝑏𝑖 |/𝜎 + 𝜎𝑡). 2 2 𝜙 (𝑎 − 𝑏)
2

Since 𝑎, 𝑏 ≥ 0, we know that −𝑎 − 𝑏 ≤ 𝑎 − 𝑏. We will now argue


Φ(𝛼)
Moreover, the moment generating function of ∥𝑎 − 𝑐 ∥ 1 is 𝑀 (𝑡) = that the function 𝜙 (𝛼) is monotonically increasing in 𝛼, which
Î
𝑖 𝑀𝑖 (𝑡). We will begin by focusing our attention on bounding suffices to prove the desired claim. To prove this, we will observe
16
that this is this quantity is known as the Mills ratio [21] for the Theorem 7. Let 𝐷ˆ be the dataset obtained by sampling 𝐾 items with
normal distribution. We know that the Mills ratio is connected to a replacement from 𝐷. Further, let 𝜇® = 𝑁1 𝑀𝑟 (𝐷) and 𝑠® = ⌈𝐾 𝜇®⌉.
particular expectation; specifically, if 𝑋 ∼ N (0, 1), then h 1 i
1
E 𝑀𝑟 (𝐷) − 𝑀𝑟 ( 𝐷) ˆ =
𝑁 𝐾
𝜙 (𝛼) 2 ∑︁

𝐾

E[𝑋 | 𝑋 < 𝛼] = − 𝑠 (𝑥) 𝜇 (𝑥)𝑠 (𝑥) (1 − 𝜇 (𝑥)) 𝐾−𝑠 (𝑥)+1
Φ(𝛼) 𝐾 𝑠 (𝑥)
𝑥 ∈Ω𝑟

Using this interpretation, it is clear that the LHS (and hence the RHS)
𝜙 (𝛼)
is monotonically increasing in 𝛼. Since − Φ(𝛼) is monotonically Proof. The theorem statement follows directly from Lemma 4
Φ(𝛼) and Lemma 3. □
increasing, so is 𝜙 (𝛼) . □
Lemma 2 (Mean Deviation [16, 26]). Let 𝑘 ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (𝑛, 𝑝), then:
h  
C INTERPRETABLE ERROR RATE AND 𝑘 i 2 𝑛 𝑠
E 𝑝− = 𝑠 𝑝 (1 − 𝑝)𝑛−𝑠+1,
SUBSAMPLING MECHANISM 𝑛 𝑛 𝑠
In Section 6, we saw that AIM offers the best error relative to exist- where 𝑠 = ⌈𝑛 · 𝑝⌉.
ing synthetic data mechanisms, although it is not obvious whether
a given 𝐿1 error should be considered “good”. This is necessary for Proof. This statement appears and is proved in [16, 26]. □

Lemma 3 (𝐿1 Deviation). Let 𝑘® ∼ 𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (𝑛, 𝑝),


setting the privacy parameters to strike the right privacy/utility ® then:
tradeoff. We can bring more clarity to this problem by comparing  
AIM to a (non-private) baseline that simply resamples 𝐾 records 2 ∑︁ 𝑛
®
E[ 𝑝® − 𝑘/𝑛 ]= 𝑠 (𝑥) 𝑝 (𝑥)𝑠 (𝑥) (1 − 𝑝 (𝑥))𝑛−𝑠 (𝑥)+1,
from the dataset. Then, if AIM achieves the same error as resam- 1 𝑛 𝑥 𝑠 (𝑥)
pling 𝐾 = 𝑁2 records, this provides a clear interpretation: that the
where 𝑠 (𝑥) = ⌈𝑛 · 𝑝 (𝑥)⌉.
price of privacy is losing about half the data. Due to the simplicity
of this baseline, we can compute the expected workload error in Proof. The statement follows immediately from Lemma 2 and
closed form, without actually running the mechanism. We provide the fact that 𝑘 (𝑥) ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (𝑛, 𝑝 (𝑥)). □
details of these calculations in the next section.
Figure 5 plots the performance of AIM on each dataset, epsilon, Lemma 4. Let 𝐷ˆ be the dataset obtained by sampling 𝐾 items with
and workload considered, measured using the fraction of samples replacement from 𝐷. Then,
needed for the subsampling mechanism to match the performance  
of AIM. These plots reveal that at 𝜖 = 10, the median subsam- ˆ ∼ 𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 𝐾, 1 𝑀𝑟 (𝐷)
𝑀𝑟 ( 𝐷)
pling fraction is about 0.37 for the general workload, 0.62 for the 𝑁
target workload, and 0.85 for the weighted workload. At 𝜖 = 1, Proof. The statement follows from the definition of the multi-
these numbers are 0.13, 0.15, and 0.21, respectively. The results nomial distribution. □
are comparable across five out of six datasets, with nltcs being
a clear outlier. For that dataset, a subsampling fraction of 1.0 was D STRUCTURAL ZEROS
reached by 𝜖 = 0.31 for all workload. This could be an indication of In this section, we describe a simple and principled method to
overfitting to the data; a possible reason for this behavior is that specify and enforce structural zeros in the mechanism. These capture
the domain size of the nltcs data is small compared to the number attribute combinations that cannot occur in the real data. Without
of records. mnsbc is also an outlier to a lesser extent, with worse specifying this, synthetic data mechanisms will usually generate
performance than the other datasets for larger 𝜖. A possible reason records that violate these constraints that hold in the real data, as the
for this behavior is that msnbc has the most data points, so subsam- process of adding noise can introduce spurious records, especially
pling with the same fraction of points has much lower error. AIM in high privacy regimes. These spurious records can be confusing
may not be able to match that low error due to the computational for downstream analysis of the synthetic data, and can lead the
constraints imposed on the model size, combined with the fact that analyst to distrust the quality of the data. By imposing known
this dataset has a large domain. structural zero constraints, we can avoid this problem, while also
improving the quality of the synthetic data on the workload of
C.1 Mathematical Details of Subsampling interest.
Structural zeros, if they exist, can usually be enumerated by a
We begin by analyzing the expected workload error of the (non-
domain expert. We can very naturally incorporate these into our
private) mechanism that randomly samples 𝐾 items with replace-
mechanism with only one minor change to the underlying Private-
ment from 𝐷. Then, we will connect that to the error of AIM, and
PGM library. These structural zeros can be specified as input as
determine the value of 𝐾 where the error rates match. Theorem 7
a list of pairs (𝑟, Z𝑟 ) where Z𝑟 ⊆ Ω𝑟 . The first entry of the pair
gives a closed form expression for the expected 𝐿1 error on a single
specifies the set of attributes relevant to the structural zeros, while
marginal as a function of the number of sampled records.
the second entry enumerates the attribute combinations whose
counts should all be zero. The method we propose can be used
17
102 adult
102 103 fire
101 msnbc
101 102 nltcs
100 salary
Samples

101

Samples

Samples
100 titanic
10 1 100
adult 10 1 adult
10 2 fire fire 10 1
msnbc 10 2 msnbc
nltcs nltcs 10 2
10 3 salary 10 3 salary 10 3
titanic titanic
10 2 10 1 100 101 102 10 4
10 2 10 1 100 101 102 10 2 10 1 100 101 102
Epsilon Epsilon Epsilon
(a) General (b) Target (c) Weighted

Figure 5: Performance of AIM as measured by the number of samples needed to match the achieved workload error.

within any mechanism that builds on top of Private-PGM, and is The results of this experiment are shown in Table 3. On average,
hence more broadly useful outside the context of AIM. imposing structural zeros improves the performance of the mecha-
To understand the technical ideas in this section, please refer nism, although the improvement is not universal across all values
to the background on Private-PGM [37]. Usually Private-PGM is of epsilon we tested. Nevertheless, it is still useful to impose these
initialized by setting 𝜃𝑟 (𝑥𝑟 ) = 0 for all 𝑟 in the model and all 𝑥𝑟 ∈ Ω𝑟 . constraints for data quality purposes.
This corresponds to a model where 𝜇𝑟 (𝑥𝑟 ) is uniform across all
𝑥𝑟 . Our basic observation is that by initializing Private-PGM by Table 3: Error of AIM on the fire dataset, with and
setting 𝜃𝑟 (𝑥𝑟 ) = −∞ for each 𝑥𝑟 ∈ 𝑍𝑟 , the cell of the associated without imposing structural zero constraints.
marginal will be 𝜇𝑟 (𝑥𝑟 ) = 0, as desired. Moreover, each update
within the Private-PGM estimation procedure will try to update
𝜖 AIM AIM+Structural Zeros Ratio
𝜃𝑟 (𝑥𝑟 ) by a finite amount, leaving it unchanged. Thus, 𝜇𝑟 (𝑥𝑟 ) will
remain 0 during the entire estimation procedure. We conjecture 0.010 0.613 0.542 1.130
that the estimation procedure solves the following modified convex 0.031 0.303 0.263 1.151
optimization problem: 0.100 0.141 0.153 0.924
0.316 0.087 0.077 1.124
1.000 0.052 0.053 0.979
𝜇ˆ = min 𝐿(𝜇) 3.162 0.044 0.045 0.964
𝜇 ∈M
𝜇𝑟 (𝑍𝑟 )=0
10.00 0.038 0.032 1.170
31.62 0.029 0.026 1.149
100.0 0.025 0.025 1.004
This approach is appealing because other simple approaches that
discard invalid tuples can inadvertently bias the distribution, which
is undesirable.
E RUNTIME EXPERIMENTS
Note that for each clique in the set of structural zeros, we must
include that clique in our model, which increases the size of that Our primary focus in the main body of the paper was mechanism
model. Thus, we should treat it as we would treat a clique selected utility, as measured by the workload error. In this section we discuss
by AIM. That is, when calculating JT-SIZE in line 12 of AIM, we the runtime of AIM, which is an important consideration when
need to include both the cliques selected in earlier iterations, as deploying it in practice. Note that we do not compare against run-
well as the cliques included in the structural zeros. time of other mechanisms here, because different mechanisms were
executed in different runtime environments. Figure 6 below shows
the runtime of AIM as a function of the privacy parameter. As evi-
D.1 Experiments dent from the figure, runtime increases drastically with the privacy
In this section, we empirically evaluate this structural zeros en- parameter. This is not surprising because AIM is budget-aware: it
hancement, showing that it can reduce workload error in some knows to select larger marginals and run for more rounds when the
cases. For this experiment, we consider the general workload on budget is higher, which in turn leads to longer runtime. For large
the fire dataset, and compare the performance of AIM with and 𝜖, the constraint on JT-SIZE is essential to allow the mechanism to
without imposing structural zero constraints. This dataset contains terminate at all. Without it, AIM may try to select marginal queries
several related attributes, like “Zipcode of Incident“ and “City”. that exceed memory resources and result in much longer runtime.
While these attributes are not perfectly correlated, significant num- For small 𝜖, this constraint is not active, and could be removed
bers of attribute combinations are impossible. We identified a total without affecting the behavior of AIM.
of nine attribute pairs which contain some structural zeros, and a Recall that these experiments were conducted on one core of a
total of 2696 structural zero constraints within these nine marginals. compute cluster with 4 GB of memory and a CPU speed of 2.4 GHz.
18
These machines were used due to the large number of experiments Generator networks [32] offer yet another alternative to Private-
we needed to conduct, but in real-world scenarios we only need to PGM for the generate step. To the best of our knowledge, no direct
run one execution of AIM, for a single dataset, workload, privacy comparison between this approach and Private-PGM has been done
parameter, and trial. For this, we can use machines with much better to date, where confounding factors are controlled for. Conceptually,
specs, which would improve the runtime significantly. this approach is most similar to the relaxed projection approach, so
we conjecture the results to look similar to those shown in Figure 7.

adult
20 fire
msnbc
Time (hours)

15 nltcs
salary
10 titanic

0
10 2 10 1 100 101 102
Epsilon
Figure 6: Runtime of AIM on the all-3way workload.

F PRIVATE-PGM VS. RELAXED PROJECTION


In this paper, we built AIM on top of Private-PGM, leveraging prior
work for the generate step of the select-measure-generate para-
digm. Private-PGM is not the only method in this space, although it
was the first general purpose and scalable method to our knowledge.
“Relaxed Projection” [3] is another general purpose and scalable
method that solves the same problem, and could be used in place
of Private-PGM if desired. RAP, the main mechanism that utilizes
this technique, did not perform well in our experiments. However,
it is not clear from our experiments if the poor performance can be
attributed to the relaxed projection algorithm, or some other algo-
rithmic design decisions. In this section, we attempt to precisely
pin down the differences between these two related methods, tak-
ing care to fix possible confounding factors. We thus consider two
mechanisms: MWEM+PGM, which is defined in Algorithm 1, and
MWEM+Relaxed Projection which is identical to MWEM+PGM in
every way, except the call to Private-PGM is replaced with a call to
the relaxed projection algorithm of Aydore et al.
For this experiment, we consider the all-3way workload, and we
run each algorithm for 𝑇 = 5, 10, . . . , 100, with five trials for each
hyper-parameter setting. We average the workload error across the
five trials, and report the minimum workload error across hyper-
parameter settings in Figure 7. Although the algorithms are con-
ceptually very similar, MWEM+PGM consistently outperforms
MWEM+Relaxed Projection, across every dataset and privacy level
considered. The performance difference is modest in many cases,
but significant on the fire dataset.
AP-PGM [36] offers another alternative to Private-PGM for the
generate step, and while it was shown to be an appealing alternative
to Private-PGM in some cases, within the context of an MWEM-
style algorithm, their own experiments demonstrate the superiority
of Private-PGM.
19
2.00 2.00 2.00
1.75 MWEM + Relaxed Projection 1.75 MWEM + Relaxed Projection 1.75 MWEM + Relaxed Projection
MWEM + PGM MWEM + PGM MWEM + PGM
1.50 1.50 1.50
Workload Error

Workload Error

Workload Error
1.25 1.25 1.25
1.00 1.00 1.00
0.75 0.75 0.75
0.50 0.50 0.50
0.25 0.25 0.25
0.00 0.00 0.00
adult

fire

loans

msnbc

nltcs

titanic

adult

fire

loans

msnbc

nltcs

titanic

adult

fire

loans

msnbc

nltcs

titanic
(a) 𝜖 = 0.1 (b) 𝜖 = 1.0 (c) 𝜖 = 10.0

Figure 7: MWEM+Relaxed Projection vs. MWEM+PGM on the all-3way workload.

20

You might also like