programmingdp
programmingdp
1 Introduction 3
2 De-identification 5
2.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 De-identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Linkage Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 k-Anonymity 15
3.1 Checking for 𝑘-Anonymity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Generalizing Data to Satisfy 𝑘-Anonymity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Does More Data Improve Generalization? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Removing Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 The Homogeneity Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Differential Privacy 21
4.1 The Laplace Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 How Much Noise is Enough? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 The Unit of Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Bounded and Unbounded Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6 Sensitivity 33
6.1 Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.2 Calculating Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.3 Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.4 Avoiding Sensitivity Underestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8 Local Sensitivity 49
i
8.1 Local Sensitivity of the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.2 Achieving Differential Privacy via Local Sensitivity? . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.3 Propose-Test-Release . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
8.4 Smooth Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.5 Sample and Aggregate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
13 Machine Learning 81
13.1 Logistic Regression with Scikit-Learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
13.2 What is a Model? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
13.3 Training a Model with Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
13.4 Gradient Descent with Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
13.5 Effect of Noise on Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
15 Synthetic Data 99
15.1 Synthetic Representation: a Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
15.2 Adding Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
15.3 Generating Tabular Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
15.4 Generating More Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
16 Efficiency 107
16.1 Time Efficiency of Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
16.2 Space Cost of Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
16.3 Limitations of Random Number Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
17 Bibliography 113
ii
Bibliography 115
iii
iv
Programming Differential Privacy
CONTENTS 1
Programming Differential Privacy
2 CONTENTS
CHAPTER
ONE
INTRODUCTION
This is a book about differential privacy, for programmers. It is intended to give you an introduction to the challenges
of data privacy, introduce you to the techniques that have been developed for addressing those challenges, and help you
understand how to implement some of those techniques.
The book contains numerous examples as programs, including implementations of many concepts. Each chapter is gen-
erated from a self-contained Jupyter Notebook. You can click on the “download” button at the top-right of the chapter,
and then select “.ipynb” to download the notebook for that chapter, and you’ll be able to execute the examples yourself.
Many of the examples are generated by code that is hidden (for readability) in the chapters you’ll see here. You can show
this code by clicking the “Click to show” labels adjacent to these cells.
This book assumes a working knowledge of Python, as well as basic knowledge of the pandas and NumPy libraries. You
will also benefit from some background in discrete mathematics and probability - a basic undergraduate course in these
topics should be more than sufficient.
This book is open source, and the latest version will always be available online here. The source code is available on
GitHub. If you would like to fix a typo, suggest an improvement, or report a bug, please open an issue on GitHub.
The techniques described in this book have developed out of the study of data privacy. For our purposes, we will define
data privacy this way:
This is a broad definition, and many different techniques fall under it. But it’s important to note what this definition
excludes: techniques for ensuring security, like encryption. Encrypted data doesn’t reveal anything - so it fails to meet the
first requirement of our definition. The distinction between security and privacy is an important one: privacy techniques
involve an intentional release of information, and attempt to control what can be learned from that release; security tech-
niques usually prevent the release of information, and control who can access data. This book covers privacy techniques,
and we will only discuss security when it has important implications for privacy.
This book is primarily focused on differential privacy. The first couple of chapters outline some of the reasons why:
differential privacy (and its variants) is the only formal approach we know about that seems to provide robust privacy
protection. Commonly-used approaches that have been used for decades (like de-identification and aggregation) have
more recently been shown to break down under sophisticated privacy attacks, and even more modern techniques (like
𝑘-Anonymity) are susceptible to certain attacks. For this reason, differential privacy is fast becoming the gold standard
in privacy protection, and thus it is the primary focus of this book.
3
Programming Differential Privacy
4 Chapter 1. Introduction
CHAPTER
TWO
DE-IDENTIFICATION
Learning Objectives
After reading this chapter, you will be able to:
• Define the following concepts:
– De-identification
– Re-identification
– Identifying information / personally identifying information
– Linkage attacks
– Aggregation and aggregate statistics
– Differencing attacks
• Perform a linkage attack
• Perform a differencing attack
• Explain the limitations of de-identification techniques
• Explain the limitations of aggregate statistics
2.1 Preliminary
Download the dataset by clicking here and placing them in the same directory as this notebook.
The dataset is based on census data. The personally identifiable information (PII) is made up.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
adult = pd.read_csv("adult_with_pii.csv")
adult.head()
5
Programming Differential Privacy
2.2 De-identification
De-identification is the process of removing identifying information from a dataset. The term de-identification is sometimes
used synonymously with the terms anonymization and pseudonymization.
Identifying information has no formal definition. It is usually understood to be information which would be used to
identify us uniquely in the course of daily life - name, address, phone number, e-mail address, etc. As we will see
later, it’s impossible to formalize the concept of identifying information, because all information is identifying. The term
personally identifiable information (PII) is often used synonymously with identifying information.
How do we de-identify information? Easy - we just remove the columns that contain identifying information!
We’ll save some of the identifying information for later, when we’ll use it as auxiliary data to perform a re-identification
attack.
6 Chapter 2. De-identification
Programming Differential Privacy
Imagine we want to determine the income of a friend from our de-identified data. Names have been removed, but we
happen to know some auxiliary information about our friend. Our friend’s name is Karrie Trusslove, and we know Karrie’s
date of birth and zip code.
To perform a simple linkage attack, we look at the overlapping columns between the dataset we’re trying to attack, and
the auxiliary data we know. In this case, both datasets have dates of birth and zip codes. We look for rows in the dataset
we’re attacking with dates of birth and zip codes that match Karrie’s date of birth and zip code. In databases, this is called
a join of two tables, and we can do it in Pandas using merge. If there is only one such row, we’ve found Karrie’s row in
the dataset we’re attacking.
Hours per week Country Target Age Capital Gain Capital Loss
0 40 United-States <=50K 56 2174 0
Indeed, there is only one row that matches. We have used auxiliary data to re-identify an individual in a de-identified
dataset, and we’re able to infer that Karrie’s income is less than $50k.
This scenario is made up, but linkage attacks are surprisingly easy to perform in practice. How easy? It turns out that in
many cases, just one data point is sufficient to pinpoint a row!
Race Sex Hours per week Country Target Age Capital Gain \
0 White Male 40 United-States <=50K 56 2174
Capital Loss
0 0
So ZIP code is sufficient by itself to allow us to re-identify Karrie. What about date of birth?
This time, there are three rows returned - and we don’t know which one is the real Karrie. But we’ve still learned a lot!
• We know that there’s a 2/3 chance that Karrie’s income is less than $50k
• We can look at the differences between the rows to determine what additional auxiliary information would help us
to distinguish them (e.g. sex, occupation, marital status)
How hard is it to re-identify others in the dataset? Is Karrie especially easy or especially difficult to re-identify? A good
way to gauge the effectiveness of this type of attack is to look at how “selective” certain pieces of data are - how good they
are at narrowing down the set of potential rows which may belong to the target individual. For example, is it common for
birthdates to occur more than once?
We’d like to get an idea of how many dates of birth are likely to be useful in performing an attack, which we can do by
looking at how common “unique” dates of birth are in the dataset. The histogram below shows that the vast majority of
dates of birth occur 1, 2, or 3 times in the dataset, and no date of birth occurs more than 8 times. This means that date of
birth is fairly selective - it’s effective in narrowing down the possible records for an individual.
8 Chapter 2. De-identification
Programming Differential Privacy
We can do the same thing with ZIP codes, and the results are even worse - ZIP code happens to be very selective in this
dataset. Nearly all the ZIP codes occur only once.
In this dataset, how many people can we re-identify uniquely? We can use our auxiliary information to find out! First,
let’s see what happens with just dates of birth. We want to know how many possible identities are returned for each data
record in the dataset. The following histogram shows the number of records with each number of possible identities.
The results show that we can uniquely identify almost 7,000 of the data records (out of about 32,000), and an additional
10,000 data records are narrowed down to two possible identities.
So it’s not possible to re-identify a majority of individuals using just date of birth. What if we collect more information,
to narrow things down further? If we use both date of birth and ZIP, we’re able to do much better. In fact, we’re able to
uniquely re-identify basically the whole dataset.
10 Chapter 2. De-identification
Programming Differential Privacy
When we use both pieces of information, we can re-identify essentially everyone. This is a surprising result, since we
generally assume that many people share the same birthday, and many people live in the same ZIP code. It turns out that
the combination of these factors is extremely selective. According to Latanya Sweeney’s work [1], 87% of people in the
US can be uniquely re-identified by the combination of date of birth, gender, and ZIP code.
Let’s just check that we’ve actually re-identified everyone, by printing out the number of possible data records for each
identity:
Barnabe Haime 2
Antonin Chittem 2
Marchelle Benardette 1
Isis Calfe 1
Kaye Patriche 1
Name: Name, dtype: int64
Looks like we missed two people! In other words, in this dataset, only two people share a combination of ZIP code and
date of birth.
2.4 Aggregation
Another way to prevent the release of private information is to release only aggregate data.
adult['Age'].mean()
41.77250253355035
In many cases, aggregate statistics are broken down into smaller groups. For example, we might want to know the average
age of people with a particular education level.
Education Age
0 10th 42.032154
1 11th 42.057021
2 12th 41.879908
Aggregation is supposed to improve privacy because it’s hard to identify the contribution of a particular individual to the
aggregate statistic. But what if we aggregate over a group with just one person in it? In that case, the aggregate statistic
reveals one person’s age exactly, and provides no privacy protection at all! In our dataset, most individuals have a unique
ZIP code - so if we compute the average age by ZIP code, then most of the “averages” actually reveal an individual’s exact
age.
Zip Age
0 4 72.0
1 12 46.0
(continues on next page)
2.4. Aggregation 11
Programming Differential Privacy
The US Census Bureau, for example, releases aggregate statistics at the block level. Some census blocks have large
populations, but some have a population of zero! The situation above, where small groups prevent aggregation from
hiding information about individuals, turns out to be quite common.
How big a group is “big enough” for aggregate statistics to help? It’s hard to say - it depends on the data and on the attack
- so it’s challenging to build confidence that aggregate statistics are really privacy-preserving. However, even very large
groups do not make aggregation completely robust against attacks, as we will see next.
The problems with aggregation get even worse when you release multiple aggregate statistics over the same data. For
example, consider the following two summation queries over large groups in our dataset (the first over the whole dataset,
and the second over all records except one):
adult['Age'].sum()
1360238
1360182
If we know both answers, we can simply take the difference and determine Karrie’s age completely! This kind of attack
can proceed even if the aggregate statistics are over very large groups.
56
Summary
• A linkage attack involves combining auxiliary data with de-identified data to re-identify individuals.
• In the simplest case, a linkage attack can be performed via a join of two tables containing these datasets.
• Simple linking attacks are surprisingly effective:
– Just a single data point is sufficient to narrow things down to a few records
– The narrowed-down set of records helps suggest additional auxiliary data which might be helpful
– Two data points are often good enough to re-identify a huge fraction of the population in a particular dataset
– Three data points (gender, ZIP code, date of birth) uniquely identify 87% of people in the US
12 Chapter 2. De-identification
Programming Differential Privacy
2.4. Aggregation 13
Programming Differential Privacy
14 Chapter 2. De-identification
CHAPTER
THREE
K-ANONYMITY
𝑘-Anonymity [2] is a formal privacy definition. The definition of 𝑘-Anonymity is designed to formalize our intuition that
a piece of auxiliary information should not narrow down the set of possible records for an individual “too much.” Stated
another way, 𝑘-Anonymity is designed to ensure that each individual can “blend into the crowd.”
Learning Objectives
After reading this chapter, you will understand:
• The definition of 𝑘-Anonymity
• How to check for 𝑘-Anonymity
• How to generalize data to enforce 𝑘-Anonymity
• The limitations of 𝑘-Anonymity
Informally, we say that a dataset is “𝑘-Anonymized” for a particular 𝑘 if each individual in the dataset is a member of a
group of size at least 𝑘, such that each member of the group shares the same quasi-identifiers (a selected subset of all the
dataset’s columns) with all other members of the group. Thus, the individuals in each group “blend into” their group - it’s
possible to narrow down an individual to membership in a particular group, but not to determine which group member is
the target.
Definition 2 (K-Anonymity)
Formally, we say that a dataset 𝐷 satisfies 𝑘-Anonymity for a value of 𝑘 if:
• For each row 𝑟1 ∈ 𝐷, there exist at least 𝑘 − 1 other rows 𝑟2 … 𝑟𝑘 ∈ 𝐷 such that Π𝑞𝑖(𝐷) 𝑟1 =
Π𝑞𝑖(𝐷) 𝑟2 , … , Π𝑞𝑖(𝐷) 𝑟1 = Π𝑞𝑖(𝐷) 𝑟𝑘
where 𝑞𝑖(𝐷) is the quasi-identifiers of 𝐷, and Π𝑞𝑖(𝐷) 𝑟 represents the columns of 𝑟 containing quasi-identifiers (i.e. the
projection of the quasi-identifiers).
15
Programming Differential Privacy
We’ll start with a small dataset, so that we can immediately see by looking at the data whether it satisfies 𝑘-Anonymity or
not. This dataset contains age plus two test scores; it clearly doesn’t satisfy 𝑘-Anonymity for 𝑘 > 1. Any dataset trivially
satisfies 𝑘-Anonymity for 𝑘 = 1, since each row can form its own group of size 1.
To implement a function to check whether a dataframe satisfies 𝑘-Anonymity, we loop over the rows; for each row, we
query the dataframe to see how many rows match its values for the quasi-identifiers. If the number of rows in any group is
less than 𝑘, the dataframe does not satisfy 𝑘-Anonymity for that value of 𝑘, and we return False. Note that in this simple
definition, we consider all columns to contain quasi-identifiers; to limit our check to a subset of all columns, we would
need to replace the df.columns expression with something else.
As expected, our example dataframe does not satisfy 𝑘-Anonymity for 𝑘 = 2, but it does satisfy the property for 𝑘 = 1.
isKAnonymized(df, 1)
True
isKAnonymized(df, 2)
False
The process of modifying a dataset so that it satisfies 𝑘-Anonymity for a desired 𝑘 is generally accomplished by generalizing
the data - modifying values to be less specific, and therefore more likely to match the values of other individuals in the
dataset. For example, an age which is accurate to a year may be generalized by rounding to the nearest 10 years, or a
ZIP code might have its rightmost digits replaced by zeros. For numeric values, this is easy to implement. We’ll use
the apply method of dataframes, and pass in a dictionary named depths which specifies how many digits to replace
by zeros for each column. This gives us the flexibility to experiment with different levels of generalization for different
columns.
16 Chapter 3. k-Anonymity
Programming Differential Privacy
Now, we can generalize our example dataframe. First, we’ll try generalizing each column by one “level” - i.e. rounding
to the nearest 10.
depths = {
'age': 1,
'preTestScore': 1,
'postTestScore': 1
}
df2 = generalize(df, depths)
df2
Notice that even after generalization, our example data still does not satisfy 𝑘-Anonymity for 𝑘 = 2.
isKAnonymized(df2, 2)
False
We can try generalizing more - but then we’ll end up removing all of the data!
depths = {
'age': 2,
'preTestScore': 2,
'postTestScore': 2
}
generalize(df, depths)
Important: Achieving 𝑘-Anonymity for meaningful values of 𝑘 often requires removing quite a lot of information from
the data.
Our example dataset is too small for 𝑘-Anonymity to work well. Because there are only 5 individuals in the dataset,
building groups of 2 or more individuals who share the same properties is difficult. The solution to this problem is more
data: in a dataset with more individuals, less generalization will typically be needed to satisfy 𝑘-Anonymity for a desired
𝑘.
Let’s try the same census data we examined for de-identification. This dataset contains more than 32,000 rows, so it
should be easier to achieve 𝑘-Anonymity.
We’ll consider the ZIP code, age, and educational achievement of each individual to be the quasi-identifiers. We’ll project
just those columns, and try to achieve 𝑘-Anonymity for 𝑘 = 2. The data is already 𝑘-Anonymous for 𝑘 = 1.
Notice that we take just the first 100 rows from the dataset for this check - try running isKAnonymized on a larger
subset of the data, and you’ll find that it takes a very long time (for example, running the 𝑘 = 1 check on 5000 rows takes
about 20 seconds on my laptop). For 𝑘 = 2, our algorithm finds a failing row quickly and finishes fast.
df = adult_data[['Age', 'Education-Num']]
df.columns = ['age', 'edu']
isKAnonymized(df.head(100), 1)
True
isKAnonymized(df.head(100), 2)
False
Now, we’ll try to generalize to achieve 𝑘-Anonymity for 𝑘 = 2. We’ll start with generalizing both age and educational
attainment to the nearest 10.
False
The generalized result still does not satisfy 𝑘-Anonymity for 𝑘 = 2! In fact, we can perform this generalization on all
~32,000 rows and still fail to satisfy 𝑘-Anonymity for 𝑘 = 2 - so adding more data does not necessarily help as much as
we expected.
The reason is that the dataset contains outliers - individuals who are very different from the rest of the population. These
individuals do not fit easily into any group, even after generalization. Even considering only ages, we can see that adding
more data is not likely to help, since very low and very high ages are poorly represented in the dataset.
18 Chapter 3. k-Anonymity
Programming Differential Privacy
Achieving the optimal generalization for 𝑘-Anonymity is very challenging in cases like this. Generalizing each row more
would be overkill for the well-represented individuals with ages in the 20-40 range, and would hurt utility. However, more
generalization is clearly needed for individuals at the upper and lower ends of the age range. This is the kind of challenge
that occurs regularly in practice, and is difficult to solve automatically. In fact, optimal generalization for 𝑘-Anonymity
has been shown to be NP-hard.
Important: Outliers make achieving 𝑘-Anonymity very challenging, even for large datasets. Optimal generalization for
𝑘-Anonymity is NP-hard.
One solution to this problem is simply to clip the age of each individual in the dataset to lie within a specific range,
eliminating outliers entirely. This can also hurt utility, since it replaces real ages with fake ones, but it can be better than
generalizing each row more. We can use Numpy’s clip method to perform this clipping. We clip ages to be 10-60, and
require an educational level of at least 5th-6th grade (represented by the index 3 in the dataset).
True
Now, the generalized dataset satisfies 𝑘-Anonymity for 𝑘 = 7! In other words, our level of generalization was appropriate,
but outliers prevented us from achieving 𝑘-Anonymity before, even for 𝑘 = 2.
The homogeneity attack represents a significant limitation to the effectiveness of 𝑘-Anonymity. Ordinarily, the goal of
𝑘-Anonymity is to protect individuals’ identities by ensuring that each individual in a dataset is indistinguishable from at
least 𝑘 − 1 others. These 𝑘 − 1 others (in addition to the individual) are called the individual’s “group” within the dataset.
However, the homogeneity attack leverages similarities among quasi-identifiers within a group, making it challenging to
achieve true individual anonymity. In this attack, an adversary exploits the uniformity or lack of diversity among the
attributes used for anonymization, enabling the de-anonymization/re-identification of individuals.
This attack vector arises when the data’s quasi-identifiers exhibit a high degree of similarity, and groups have identical
values for their sensitive attributes, making it easier for an attacker to infer sensitive information about specific individuals.
In such cases, sensitive information regarding individuals can be inferred via determination of group membership.
Note: It is worth noting that 𝑘-Anonymity is also highly susceptible to attacks enabled by the availability of background
knowledge regarding individuals. Such knowledge may aid, for example, in re-identification of individuals by simple
process of elimination of other members of their group who do not fit certain criteria.
Addressing the homogeneity attack requires careful manual consideration regarding the selection of diverse quasi-
identifiers, highlighting the need for more sophisticated anonymization techniques to enhance the robustness of privacy
preservation in datasets.
One such technique, which is also immune to the presence of background knowledge, is differential privacy.
Summary
• 𝑘-Anonymity is a property of data, which ensures that each individual “blends in” with a group of at least 𝑘 indi-
viduals.
• 𝑘-Anonymity is computationally expensive even to check: the naive algorithm is 𝑂(𝑛2 ), and faster algorithms take
considerable space.
• 𝑘-Anonymity can be achieved by modifying a dataset by generalizing it, so that particular values become more
common and groups are easier to form.
• Optimal generalization is extremely difficult, and outliers can make it even more challenging. Solving this problem
automatically is NP-hard.
20 Chapter 3. k-Anonymity
CHAPTER
FOUR
DIFFERENTIAL PRIVACY
Learning Objectives
After reading this chapter, you will be able to:
• Define differential privacy
• Explain the importance of the privacy parameter 𝜖
• Use the Laplace mechanism to enforce differential privacy for counting queries
Like 𝑘-Anonymity, differential privacy [3, 4] is a formal notion of privacy (i.e. it’s possible to prove that a data release has
the property). Unlike 𝑘-Anonymity, however, differential privacy is a property of algorithms, and not a property of data.
That is, we can prove that an algorithm satisfies differential privacy; to show that a dataset satisfies differential privacy, we
must show that the algorithm which produced it satisfies differential privacy.
Pr[𝐹 (𝑥) ∈ 𝑆]
≤ 𝑒𝜖 (4.1)
Pr[𝐹 (𝑥′ ) ∈ 𝑆]
Two datasets are considered neighbors if they differ in the data of a single individual. Note that 𝐹 is typically a randomized
function, which has many possible outputs under the same input. Therefore, the probability distribution describing its
outputs is not just a point distribution.
The important implication of this definition is that 𝐹 ’s output will be pretty much the same, with or without the data of
any specific individual. In other words, the randomness built into 𝐹 should be “enough” so that an observed output from
𝐹 will not reveal which of 𝑥 or 𝑥′ was the input. Imagine that my data is present in 𝑥 but not in 𝑥′ . If an adversary can’t
determine which of 𝑥 or 𝑥′ was the input to 𝐹 , then the adversary can’t tell whether or not my data was present in the
input - let alone the contents of that data.
The 𝜖 parameter in the definition is called the privacy parameter or the privacy budget. 𝜖 provides a knob to tune the
“amount of privacy” the definition provides. Small values of 𝜖 require 𝐹 to provide very similar outputs when given
similar inputs, and therefore provide higher levels of privacy; large values of 𝜖 allow less similarity in the outputs, and
therefore provide less privacy.
How should we set 𝜖 to prevent bad outcomes in practice? Nobody knows. The general consensus is that 𝜖 should be
around 1 or smaller, and values of 𝜖 above 10 probably don’t do much to protect privacy - but this rule of thumb could
turn out to be very conservative. We will have more to say on this subject later on.
21
Programming Differential Privacy
Note: Why is 𝑆 a set, and why do we write 𝐹 (𝑥) ∈ 𝑆, instead of 𝐹 (𝑥) = 𝑠 for a single output 𝑠? When 𝐹 returns
elements from a continuous domain (like the real numbers), then the probability Pr[𝐹 (𝑥) = 𝑆] = 0 for all 𝑥 (this is a
property of continuous probability distributions—see here for a detailed explanation). For the definition to make sense
in the context of continuous distributions, it needs to instead consider sets of outputs 𝑆, and use set inclusion (∈) instead
of equality.
If 𝐹 returns elements of a discrete set (e.g. 32-bit floating-point numbers), then the definition can instead consider 𝑆 to
be a single value, and use equality instead of set inclusion:
Pr[𝐹 (𝑥) = 𝑆]
≤ 𝑒𝜖 (4.2)
Pr[𝐹 (𝑥′ ) = 𝑆]
This definition might be more intuitive, especially if you have not studied probability theory.
Differential privacy is typically used to answer specific queries. Let’s consider a query on the census data, without differ-
ential privacy.
“How many individuals in the dataset are 40 years old or older?”
17449
The easiest way to achieve differential privacy for this query is to add random noise to its answer. The key challenge is
to add enough noise to satisfy the definition of differential privacy, but not so much that the answer becomes too noisy to
be useful. To make this process easier, some basic mechanisms have been developed in the field of differential privacy,
which describe exactly what kind of - and how much - noise to use. One of these is called the Laplace mechanism [4].
The sensitivity of a function 𝑓 is the amount 𝑓’s output changes when its input changes by 1. Sensitivity is a complex
topic, and an integral part of designing differentially private algorithms; we will have much more to say about it later. For
now, we will just point out that counting queries always have a sensitivity of 1: if a query counts the number of rows in the
dataset with a particular property, and then we modify exactly one row of the dataset, then the query’s output can change
by at most 1.
Thus we can achieve differential privacy for our example query by using the Laplace mechanism with sensitivity 1 and an
𝜖 of our choosing. For now, let’s pick 𝜖 = 0.1. We can sample from the Laplace distribution using Numpy’s random.
laplace.
sensitivity = 1
epsilon = 0.1
(continues on next page)
17492.153260229425
You can see the effect of the noise by running this code multiple times. Each time, the output changes, but most of the
time, the answer is close enough to the true answer (14,235) to be useful.
How do we know that the Laplace mechanism adds enough noise to prevent the re-identification of individuals in the
dataset? For one thing, we can try to break it! Let’s write down a malicious counting query, which is specifically designed
to determine whether Karrie Trusslove has an income greater than $50k.
This result definitely violates Karrie’s privacy, since it reveals the value of the income column for Karrie’s row. Since we
know how to ensure differential privacy for counting queries with the Laplace mechanism, we can do so for this query:
sensitivity = 1
epsilon = 0.1
0.004378856104177986
Is the true answer 0 or 1? There’s too much noise to be able to reliably tell. This is how differential privacy is intended to
work - the approach does not reject queries which are determined to be malicious; instead, it adds enough noise that the
results of a malicious query will be useless to the adversary.
The typical definition of differential privacy defines neighboring datasets as any two datasets that differ in “one person’s
data.” It’s often difficult or impossible to figure out how much data “belongs” to which person.
The unit of privacy refers to the formal definition of “neighboring” used in a differential privacy guarantee. The most
common unit of privacy is “one person” - meaning the privacy guarantee protects the whole person, forever. But other
definitions are possible; Apple’s implementation of differential privacy, for example, uses a “person-day” unit of privacy,
meaning that the guarantee applies to the data submitted by one person on a single day.
The unit of privacy can result in surprising privacy failures. For example, in Apple’s system, the differential privacy
guarantee does not protect trends in the data occurring across multiple days - even for individual people. If a person
submits identical data for 365 days in a row, then differential privacy provides essentially no protection for that data.
The “one person” unit of privacy is a good default, and usually avoids surprises. Other units of privacy are usually used
to make it easier to get accurate results, or because it’s hard to tie specific data values to individual people.
It’s common to make a simplifying assumption that makes it easy to formalize the definition of neighboring datasets:
• Each individual’s data is contained in exactly one row of the data
If this assumption is true, then it’s possible to define neighboring datasets formally, in terms of the format of the data
(see below), and retain the desired “one person” unit of privacy. When it’s not true, the best solution is to transform the
data and queries in order to achieve the “one person” unit of privacy. It’s best to avoid using a different unit of privacy
whenever possible.
Under the “one person = one row” simplification, neighboring datasets differ in one row. What does “differ” mean? There
are two ways to define that, too! Here are the two formal definitions:
Summary
• Differential privacy is a property of algorithms, and not a property of data.
• A function which satisfies differential privacy is often called a mechanism.
• The easiest way to achieve differential privacy for this function is to add random noise to its answer.
• The unit of privacy refers to the formal definition of “neighboring” used in a differential privacy guarantee. The
most common unit of privacy is “one person” - meaning the privacy guarantee protects the whole person, forever.
FIVE
Learning Objectives
After reading this chapter, you will be able to:
• Explain the concepts of sequential composition, parallel composition, and post processing
• Calculate the cumulative privacy cost of multiple applications of a differential privacy mechanism
• Determine when the use of parallel composition is allowed
This chapter describes three important properties of differentially private mechanisms that arise from the definition of
differential privacy. These properties will help us to design useful algorithms that satisfy differential privacy, and ensure
that those algorithms provide accurate answers.
The first major property of differential privacy is sequential composition [4, 5], which bounds the total privacy cost of re-
leasing multiple results of differentially private mechanisms on the same input data. Formally, the sequential composition
theorem for differential privacy says that:
Sequential composition is a vital property of differential privacy because it enables the design of algorithms that consult
the data more than once. Sequential composition is also important when multiple separate analyses are performed on a
single dataset, since it allows individuals to bound the total privacy cost they incur by participating in all of these analyses.
25
Programming Differential Privacy
The bound on privacy cost given by sequential composition is an upper bound - the actual privacy cost of two particular
differentially private releases may be smaller than this, but never larger.
The principle that the 𝜖s “add up” makes sense if we examine the distribution of outputs from a mechanism which averages
two differentially private results together. Let’s look at some examples.
epsilon1 = 1
epsilon2 = 1
epsilon_total = 2
If we graph F1 and F2, we see that the distributions of their outputs look pretty similar.
If we graph F1 and F3, we see that the distribution of outputs from F3 looks “pointier” than that of F1, because its
higher 𝜖 implies less privacy, and therefore a smaller likelihood of getting results far from the true answer.
If we graph F1 and F_combined, we see that the distribution of outputs from F_combined is pointier. This means
its answers are more accurate than those of F1, so it makes sense that its 𝜖 must be higher (i.e. it yields less privacy than
F1).
What about F3 and F_combined? Recall that the 𝜖 values for these two mechanisms are the same - both have an 𝜖 of
2. Their output distributions should look the same.
In fact, F3 looks “pointier”! Why does this happen? Remember that sequential composition yields an upper bound on
the total 𝜖 of several releases, the actual cumulative impact on privacy might be lower. That’s the case here - the actual
privacy loss in this case appears to be somewhat lower than the upper bound 𝜖 determined by sequential composition.
Sequential composition is an extremely useful way to control total privacy cost, and we will see it used in many different
ways, but keep in mind that it is not necessarily an exact bound.
The second important property of differential privacy is called parallel composition [6]. Parallel composition can be seen
as an alternative to sequential composition - a second way to calculate a bound on the total privacy cost of multiple data
releases. Parallel composition is based on the idea of splitting your dataset into disjoint chunks and running a differentially
private mechanism on each chunk separately. Since the chunks are disjoint, each individual’s data appears in exactly one
chunk - so even if there are 𝑘 chunks in total (and therefore 𝑘 runs of the mechanism), the mechanism runs exactly once
on the data of each individual. Formally,
Note that this is a much better bound than sequential composition would give. Since we run 𝐹 𝑘 times, sequential
composition would say that this procedure satisfies 𝑘𝜖-differential privacy. Parallel composition allows us to say that the
total privacy cost is just 𝜖.
The formal definition matches up with our intuition - if each participant in the dataset contributes one row to 𝑋, then
this row will appear in exactly one of the chunks 𝑥1 , ..., 𝑥𝑘 . That means 𝐹 will only “see” this participant’s data one time,
meaning a privacy cost of 𝜖 is appropriate for that individual. Since this property holds for all individuals, the privacy cost
is 𝜖 for everyone.
5.2.1 Histograms
In our context, a histogram is an analysis of a dataset which splits the dataset into “bins” based on the value of one of the
data attributes, and counts the number of rows in each bin. For example, a histogram might count the number of people
in the dataset who achieved a particular educational level.
Education
HS-grad 10501
Some-college 7291
Bachelors 5355
Masters 1723
Assoc-voc 1382
Histograms are particularly interesting for differential privacy because they automatically satisfy parallel composition.
Each “bin” in a histogram is defined by a possible value for a data attribute (for example, 'Education' ==
'HS-grad'). It’s impossible for a single row to have two values for an attribute simultaneously, so defining the bins this
way guarantees that they will be disjoint. Thus we have satisfied the requirements for parallel composition, and we can
use a differentially private mechanism to release all of the bin counts with a total privacy cost of just 𝜖.
epsilon = 1
# This analysis has a total privacy cost of epsilon = 1, even though we release many␣
↪results!
Education
HS-grad 10502.052933
Some-college 7291.038615
Bachelors 5353.172984
Masters 1719.328962
Assoc-voc 1383.845293
A contingency table or cross tabulation (often shortened to crosstab) is like a multi-dimensional histogram. It counts the
frequency of rows in the dataset with particular values for more than one attribute at a time. Contingency tables are
frequently used to show the relationship between two variables when analyzing data. For example, we might want to see
counts based on both education level and gender:
pd.crosstab(adult['Education'], adult['Sex']).head(5)
Like the histogram we saw earlier, each individual in the dataset participates in exactly one count appearing in this table.
It’s impossible for any single row to have multiple values simultaneously, for any set of data attributes considered in
building the contingency table. As a result, it’s safe to use parallel composition here, too.
ct = pd.crosstab(adult['Education'], adult['Sex'])
f = lambda x: x + np.random.laplace(loc=0, scale=1/epsilon)
ct.applymap(f).head(5)
It’s also possible to generate contingency tables of more than 2 variables. Consider what happens each time we add a
variable, though: each of the counts tends to get smaller. Intuitively, as we split the dataset into more chunks, each chunk
has fewer rows in it, so all of the counts get smaller.
These shrinking counts can have a significant affect on the accuracy of the differentially private results we calculate from
them. If we think of things in terms of signal and noise, a large count represents a strong signal - it’s unlikely to be
disrupted too much by relatively weak noise (like the noise we add above), and therefore the results are likely to be useful
even after the noise is added. However, a small count represents a weak signal - potentially as weak as the noise itself -
and after we add the noise, we won’t be able to infer anything useful from the results.
So while it may seem that parallel composition gives us something “for free” (more results for the same privacy cost),
that’s not really the case. Parallel composition simply moves the tradeoff between accuracy and privacy along a different
axis - as we split the dataset into more chunks and release more results, each result contains a weaker signal, and so it’s
less accurate.
5.3 Post-processing
The third property of differential privacy we will discuss here is called post-processing. The idea is simple: it’s impossible
to reverse the privacy protection provided by differential privacy by post-processing the data in some way. Formally:
Theorem 3 (Post-Processing)
• If 𝐹 (𝑋) satisfies 𝜖-differential privacy
• Then for any (deterministic or randomized) function 𝑔, 𝑔(𝐹 (𝑋)) satisfies 𝜖-differential privacy
The post-processing property means that it’s always safe to perform arbitrary computations on the output of a differentially
private mechanism - there’s no danger of reversing the privacy protection the mechanism has provided. In particular, it’s
fine to perform post-processing that might reduce the noise or improve the signal in the mechanism’s output (e.g. replacing
negative results with zeros, for queries that shouldn’t return negative results). In fact, many sophisticated differentially
private algorithms make use of post-processing to reduce noise and improve the accuracy of their results.
The other implication of the post-processing property is that differential privacy provides resistance against privacy attacks
based on auxiliary information. For example, the function 𝑔 might contain auxiliary information about elements of the
dataset, and attempt to perform a linkage attack using this information. The post-processing property says that such an
attack is limited in its effectiveness by the privacy parameter 𝜖, regardless of the auxiliary information contained in 𝑔.
Summary
• Sequential composition bounds the total privacy cost of releasing multiple results of differentially private mecha-
nisms on the same input data.
• Parallel composition is based on the idea of splitting your dataset into disjoint chunks and running a differentially
private mechanism on each chunk separately.
• The post-processing property means that it’s always safe to perform arbitrary computations on the output of a
differentially private mechanism.
5.3. Post-processing 31
Programming Differential Privacy
SIX
SENSITIVITY
Learning Objectives
After reading this chapter, you will be able to:
• Define sensitivity
• Find the sensitivity of counting queries
• Find the sensitivity of summation queries
• Decompose average queries into counting and summation queries
• Use clipping to bound the sensitivity of summation queries
As we mentioned when we discussed the Laplace mechanism, the amount of noise necessary to ensure differential privacy
for a given query depends on the sensitivity of the query. Roughly speaking, the sensitivity of a function reflects the amount
the function’s output will change when its input changes. Recall that the Laplace mechanism defines a mechanism 𝐹 (𝑥)
as follows:
𝑠
𝐹 (𝑥) = 𝑓(𝑥) + Lap ( ) (6.1)
𝜖
where 𝑓(𝑥) is a deterministic function (the query), 𝜖 is the privacy parameter, and 𝑠 is the sensitivity of 𝑓.
For a function 𝑓 ∶ 𝒟 → ℝ mapping datasets (𝒟) to real numbers, the global sensitivity of 𝑓 is defined as follows:
Here, 𝑑(𝑥, 𝑥′ ) represents the distance between two datasets 𝑥 and 𝑥′ , and we say that two datasets are neighbors if their
distance is 1 or less. How this distance is defined has a huge effect on the definition of privacy we obtain, and we’ll discuss
the distance metric on datasets in detail later on.
The definition of global sensitivity says that for any two neighboring datasets 𝑥 and 𝑥′ , the difference between 𝑓(𝑥) and
𝑓(𝑥′ ) is at most 𝐺𝑆(𝑓). This measure of sensitivity is called “global” because it is independent of the actual dataset being
queried (it holds for any choice of neighboring 𝑥 and 𝑥′ ). Another measure of sensitivity, called local sensitivity, fixes
one of the datasets to be the one being queried; we will consider this measure in a later section. For now, when we say
“sensitivity,” we mean global sensitivity.
33
Programming Differential Privacy
6.1 Distance
The distance metric 𝑑(𝑥, 𝑥′ ) described earlier can be defined in many different ways. Intuitively, the distance between
two datasets should be equal to 1 (i.e. the datasets are neighbors) if they differ in the data of exactly one individual. This
idea is easy to formalize in some contexts (e.g. in the US Census, each individual submits a single response containing
their data) but extremely challenging in others (e.g. location trajectories, social networks, and time-series data).
A common formal definition for datasets containing rows is to consider the number of rows which differ between the two.
When each individual’s data is contained in a single row, this definition often makes sense. Formally, this definition of
distance is encoded as a symmetric difference between the two datasets:
𝑑(𝑥, 𝑥′ ) = |𝑥 − 𝑥′ ∪ 𝑥′ − 𝑥|
How do we determine the sensitivity of a particular function of interest? For some simple functions on real numbers, the
answer is obvious.
• The global sensitivity of 𝑓(𝑥) = 𝑥 is 1, since changing 𝑥 by 1 changes 𝑓(𝑥) by 1
• The global sensitivity of 𝑓(𝑥) = 𝑥 + 𝑥 is 2, since changing 𝑥 by 1 changes 𝑓(𝑥) by 2
• The global sensitivity of 𝑓(𝑥) = 5 ∗ 𝑥 is 5, since changing 𝑥 by 1 changes 𝑓(𝑥) by 5
• The global sensitivity of 𝑓(𝑥) = 𝑥 ∗ 𝑥 is unbounded, since the change in 𝑓(𝑥) depends on the value of 𝑥
For functions that map datasets to real numbers, we can perform a similar analysis. We will consider the functions which
represent common aggregate database queries: counts, sums, and averages.
34 Chapter 6. Sensitivity
Programming Differential Privacy
Counting queries (COUNT in SQL) count the number of rows in the dataset which satisfy a specific property. As a rule
of thumb, counting queries always have a sensitivity of 1. This is because adding a row to the dataset can increase the
output of the query by at most 1: either the new row has the desired property, and the count increases by 1, or it does not,
and the count stays the same (the count may correspondingly decrease when a row is removed).
Example: “How many people are in the dataset?” (sensitivity: 1 - counting rows where the property = True)
adult.shape[0]
32563
Example: “How many people have an educational status above 10?” (sensitivity: 1 - counting rows with a property)
10517
Example: “How many people have an educational status equal to or below 10?” (sensitivity: 1 - counting rows with
a property)
22046
Example: “How many people are named Joe Near?” (sensitivity: 1 - counting rows with a property)
Summation queries (SUM in SQL) sum up the attribute values of dataset rows.
Example: “What is the sum of the ages of people with an educational status above 10?”
441431
Sensitivity for these queries is not as simple as it is for counting queries. Adding a new row to the dataset will increase the
result of our example query by the age of the new person. That means the sensitivity of the query depends on the contents
of the row we add.
We’d like to come up with a concrete number to represent the sensitivity of the query. Unfortunately, no number really
exists. We could claim, for example, that the sensitivity is 125 - but it may turn out that the row we add to the database
corresponds to a person who is over 125 years old, which would violate our claim. For any number we come up with, it’s
possible for the added row to violate our claim.
You might (rightly) be skeptical of this point. Say we claim the sensitivity is 1000 - it’s very unlikely that we’ll find a
person who is 1000 years old to violate this claim. In this specific domain - ages - there’s a very reasonable upper bound
on how old someone can be. The oldest person ever lived to be 122 years old, so an upper bound of 125 seems reasonable.
But this is not a proof that nobody will ever live to be 126. And in other domains (e.g. income), it can be much harder
to come up with a reasonable upper bound.
As a rule of thumb, summation queries have unbounded sensitivity when no lower and upper bounds exist on the value
of the attribute being summed. When lower and upper bounds do exist, the sensitivity of a summation query is equal to
the difference between them. In the next section, we will see a technique called clipping for enforcing bounds when none
exist, so that summation queries with unbounded sensitivity can be converted into queries with bounded sensitivity.
Average queries (AVG in SQL) calculate the mean of attribute values in a particular column.
Example: “What is the average age of people with an educational status above 10?”
41.973091185699346
The easiest way to answer an average query with differential privacy is by re-phrasing it as two queries: a summation
query divided by a counting query. For the above example:
41.973091185699346
The sensitivities of both queries can be calculated as described above. Noisy answers for each can be calculated (e.g.
using the Laplace mechanism) and the noisy answers can be divided to obtain a differentially private mean. The total
privacy cost of both queries can be calculated by sequential composition.
6.3 Clipping
Queries with unbounded sensitivity cannot be directly answered with differential privacy using the Laplace mechanism.
Fortunately, we can often transform such queries into equivalent queries with bounded sensitivity, via a process called
clipping.
The basic idea behind clipping is to enforce upper and lower bounds on attribute values. For example, ages above 125
can be “clipped” to exactly 125. After clipping has been performed, we are guaranteed that all ages will be 125 or below.
As a result, the sensitivity of a summation query on clipped data is equal to the difference between the upper and lower
bounds used in clipping: 𝑢𝑝𝑝𝑒𝑟 − 𝑙𝑜𝑤𝑒𝑟. For example, the following query has a sensitivity of 125:
adult['Age'].clip(lower=0, upper=125).sum()
1360238
The primary challenge in performing clipping is to determine the upper and lower bounds. For ages, this is simple -
nobody can have an age less than 0, and probably nobody will be older than 125. In other domains, as mentioned earlier,
it’s much more difficult.
36 Chapter 6. Sensitivity
Programming Differential Privacy
Furthermore, there is a tradeoff between the amount of information lost in clipping and the amount of noise needed to
ensure differential privacy. When the upper and lower clipping bounds are closer together, then the sensitivity is lower,
and less noise is needed to ensure differential privacy. However, aggressive clipping often removes a lot of information
from the data; this information loss tends to cause a loss of accuracy which outweighs the improvement in noise resulting
from smaller sensitivity.
As a rule of thumb, try to set the clipping bounds to include 100% of the dataset, or get as close as possible. This is
harder in some domains (e.g. graph queries, which we will study later) than others.
It’s tempting to determine the clipping bounds by looking at the data. For example, we can look at the histogram of ages
in our dataset to determine an appropriate upper bound:
It’s clear from this histogram that nobody in this particular dataset is over 90, so an upper bound of 90 would suffice.
However, it’s important to note that this approach does not satisfy differential privacy. If we pick our clipping bounds
by looking at the data, then the bounds themselves might reveal something about the data.
Typically, clipping bounds are decided either by using a property of the dataset that can be known without looking at the
data (e.g. that the dataset contains ages, which are likely to lie between 0 and 125), or by performing differentially private
queries to evaluate different choices for the clipping bounds.
To use the second approach, we typically set the lower bound to 0 and slowly increase the upper bound until the query’s
output stops changing (meaning we haven’t included any new data by increasing the bound). For example, let’s try com-
puting the sum of ages for clipping bounds from 0 to 100, using the Laplace mechanism for each one to ensure differential
privacy:
6.3. Clipping 37
Programming Differential Privacy
The total privacy cost for building this plot is 𝜖 = 1 by sequential composition, since we do 100 queries each with
𝜖𝑖 = 0.01. It’s clear that the results level off around a value of upper = 80, so this is a good choice for the clipping
bound.
We can use the same approach for data attributes from any numerical domain, but it helps to know something about the
scale of the data in advance. For example, trying clipping values between 0 and 100 for yearly incomes would not work
very well - we wouldn’t even come close to finding a reasonable upper bound.
One refinement that can work well when the scale of the data is not known is to test upper bounds according to a logarithmic
scale.
38 Chapter 6. Sensitivity
Programming Differential Privacy
This approach allows us to test a huge range of possible bounds with a small number of queries, but at the expense of less
precision in determining the perfect bound. As the upper bound gets really large, the noise will start to overwhelm the
signal - notice how the sum fluctuates wildly for the largest clipping parameters! The key is to look for a region of the
graph which is relatively smooth (meaning low noise) and also not increasing (meaning the clipping bound is sufficient).
Here, this occurs at roughly 28 = 256, which is a reasonable approximation of the upper bound we derived earlier.
In order to correctly predict the sensitivity of our queries and tranformations, we rely on mathematical reasoning around
numeric functions such as arithmetic operators.
If for some reason our numeric functions failed to properly work as anticipated, this would throw off our sensitivity
analysis. This could happen, for example, when integer overflow or floating-point representation error occur [7, 8].
0.1 + 0.2
0.30000000000000004
Oh no! In this example, due to representation error of floating-point numbers in Python, our result is slightly larger than
expected.
This means that there is the potential of sensitivity underestimation in any numeric operations which naively add floating-
point number types in Python.
Fortunately, there are several ways to remedy this issue - in Python and other programming languages - for example via
truncation and rounding, for which there are several techniques. In many cases these techniques dance around trading
efficiency for increased precision.
For example, in Python we can achieve arbitrary precision via the decimal module which provides support for fast correctly
rounded decimal floating-point arithmetic.
Decimal('0.1') + Decimal('0.2')
Decimal('0.3')
Using this approach, we achieve the correct result, and the correctness of our sensitivity analysis is preserved.
Tip: In Python3 integers have arbitrary size which is only limited by available memory. This means that we don’t
normally need to worry about integer overflow affecting the correctness of our sensitivity analysis.
Sensitivity underestimation may break the differential privacy guarantee, while sensitivity overestimation leads to unnec-
essary inaccuracy in the private analysis.
Summary
• The amount of noise necessary to ensure differential privacy for a given query depends on the sensitivity of the
query.
• Roughly speaking, the sensitivity of a function reflects the amount the function’s output will change when its input
changes.
• Intuitively, the distance between two datasets should be equal to 1 (i.e. the datasets are neighbors) if they differ in
the data of exactly one individual.
• Queries with unbounded sensitivity cannot be directly answered with differential privacy using the Laplace mech-
anism.
• Fortunately, we can often transform such queries into equivalent queries with bounded sensitivity, via a process
called clipping.
• In order to correctly predict the sensitivity of our queries and tranformations, we rely on mathematical reasoning
around numeric functions.
• Sensitivity underestimation may break the differential privacy guarantee, while sensitivity overestimation leads to
unnecessary inaccuracy in the private analysis.
40 Chapter 6. Sensitivity
CHAPTER
SEVEN
Learning Objectives
After reading this chapter, you will be able to:
• Define approximate differential privacy
• Explain the differences between approximate and pure differential privacy
• Describe the advantages and disadvantages of approximate differential privacy
• Describe and calculate L1 and L2 sensitivity of vector-valued queries
• Define and apply the Gaussian mechanism
• Apply advanced composition
Approximate differential privacy [5], also called (𝜖, 𝛿)-differential privacy, has the following definition:
The new privacy parameter, 𝛿, represents a “failure probability” for the definition. With probability 1 − 𝛿, we will get the
same guarantee as pure differential privacy; with probability 𝛿, we get no guarantee. In other words:
Pr[𝐹 (𝑥)=𝑆]
• With probability 1 − 𝛿, Pr[𝐹 (𝑥′ )=𝑠] ≤ 𝑒𝜖
• With probability 𝛿, we get no guarantee at all
This definition should seem a little bit scary! With probability 𝛿, anything at all could happen - including a release of
the entire sensitive dataset! For this reason, we typically require 𝛿 to be very small - usually 𝑛12 or less, where 𝑛 is
the size of the dataset. In addition, we’ll see that the (𝜖, 𝛿)-differentially private mechanisms in practical use don’t fail
catastrophically, as allowed by the definition - instead, they fail gracefully, and don’t do terrible things like releasing the
entire dataset.
Such mechanisms are possible, however, and they do satisfy the definition of (𝜖, 𝛿)-differential privacy. We’ll see an
example of such a mechanism later in this section.
41
Programming Differential Privacy
Approximate differential privacy has similar properties to pure 𝜖-differential privacy. It satisfies sequential composition:
The only difference from the pure 𝜖 setting is that we add up the 𝛿s as well as the 𝜖s. Approximate differential privacy
also satisfies post-processing and parallel composition.
The Gaussian mechanism is an alternative to the Laplace mechanism, which adds Gaussian noise instead of Laplacian
noise. The Gaussian mechanism does not satisfy pure 𝜖-differential privacy, but does satisfy (𝜖, 𝛿)-differential privacy.
According to the Gaussian mechanism, for a function 𝑓(𝑥) which returns a number, the following definition of 𝐹 (𝑥)
satisfies (𝜖, 𝛿)-differential privacy:
where 𝑠 is the sensitivity of 𝑓, and 𝒩(𝜎2 ) denotes sampling from the Gaussian (normal) distribution with center 0 and
variance 𝜎2 . Note that here (and elsewhere in these notes), log denotes the natural logarithm.
For real-valued functions 𝑓 ∶ 𝐷 → ℝ, we can use the classic Gaussian mechanism with values of 𝜖 equal to or less than 1,
a range called “high privacy”[9]. This is in contrast to the Laplace mechanism, which can be used in any privacy regime,
high or low.
It’s easy to compare what happens under both mechanisms for a given value of 𝜖.
Here, we graph the empirical probability density function of the Laplace and Gaussian mechanisms for 𝜖 = 1, with
𝛿 = 10−5 for the Gaussian mechanism.
Compared to the Laplace mechanism, the plot for the Gaussian mechanism looks “squished.” Differentially private outputs
which are far from the true answer are much more likely using the Gaussian mechanism than they are under the Laplace
mechanism (which, by comparison, looks extremely “pointy”).
So the Gaussian mechanism has two major drawbacks - it requires the use of the the relaxed (𝜖, 𝛿)-differential privacy
definition, and it’s less accurate than the Laplace mechanism. Why would we want to use it?
So far, we have only considered real-valued functions (i.e. the function’s output is always a single real number). Such
functions are of the form 𝑓 ∶ 𝐷 → ℝ. Both the Laplace and Gaussian mechanism, however, can be extended to vector-
valued functions of the form 𝑓 ∶ 𝐷 → ℝ𝑘 , which return vectors of real numbers. We can think of histograms as
vector-valued functions, which return a vector whose elements consist of histogram bin counts.
We saw earlier that the sensitivity of a function is:
This is equal to the sum of the elementwise sensitivities. For example, if we define a vector-valued function 𝑓 that returns
a length-𝑘 vector of 1-sensitive results, then the 𝐿1 sensitivity of 𝑓 is 𝑘.
Similarly, the 𝐿2 sensitivity of a vector-valued function 𝑓 is:
Using the same √ example as above, a vector-valued function 𝑓 returning a length-𝑘 vector of 1-sensitive results has 𝐿2
sensitivity of 𝑘. For long vectors, the 𝐿2 sensitivity will obviously be much lower than the 𝐿1 sensitivity! For some
applications, like machine learning algorithms (which sometimes return vectors with thousands of elements), 𝐿2 sensitivity
is significantly lower than 𝐿1 sensitivity.
As mentioned earlier, both the Laplace and Gaussian mechanisms can be extended to vector-valued functions. However,
there’s a key difference between these two extensions: the vector-valued Laplace mechanism requires the use of 𝐿1
sensitivity, while the vector-valued Gaussian mechanism allows the use of either 𝐿1 or 𝐿2 sensitivity. This is a major
strength of the Gaussian mechanism. For applications in which 𝐿2 sensitivity is much lower than 𝐿1 sensitivity, the
Gaussian mechanism allows adding much less noise.
• The vector-valued Laplace mechanism releases 𝑓(𝑥) + (𝑌1 , … , 𝑌𝑘 ), where 𝑌𝑖 are drawn i.i.d. from the Laplace
distribution with scale 𝑠𝜖 and 𝑠 is the 𝐿1 sensitivity of 𝑓
• The vector-valued Gaussian mechanism releases 𝑓(𝑥)+(𝑌1 , … , 𝑌𝑘 ), where 𝑌𝑖 are drawn i.i.d. from the Gaussian
2
distribution with 𝜎2 = 2𝑠 log(1.25/𝛿)
𝜖2 and 𝑠 is the 𝐿2 sensitivity of 𝑓
The definition of (𝜖, 𝛿)-differential privacy says that a mechanism which satisfies the definition must “behave well” with
probability 1 − 𝛿. That means that with probability 𝛿, the mechanism can do anything at all. This “failure probability”
is concerning, because mechanisms which satisfy the relaxed definition may (with low probability) result in very bad
outcomes.
Consider the following mechanism, which we will call the catastrophe mechanism:
With probability 1 − 𝛿, the catastrophe mechanism satisfies 𝜖-differential privacy. With probability 𝛿, it releases the
whole dataset with no noise. This mechanism satisfies the definition of approximate differential privacy, but we probably
wouldn’t want to use it in practice.
Fortunately, most (𝜖, 𝛿)-differentially private mechanisms don’t have such a catastrophic failure mode. The Gaussian
mechanism, for example, doesn’t ever release the whole dataset. Instead, with probability 𝛿, the Gaussian mechanism
doesn’t quite satisfy 𝜖-differential privacy - it satisfies 𝑐𝜖-differential privacy instead, for some value 𝑐.
The Gaussian mechanism thus fails gracefully, rather than catastrophically, so it’s reasonable to have far more confidence
in the Gaussian mechanism than in the catastrophe mechanism. Later, we will see alternative relaxations of the definition
of differential privacy which distinguish between mechanisms that fail gracefully (like the Gaussian mechanism) and ones
that fail catastropically (like the catastrophe mechanism).
We have already seen two ways of combining differentially private mechanisms: sequential composition and parallel
composition. It turns out that (𝜖, 𝛿)-differential privacy admits a new way of analyzing the sequential composition of
differentially private mechanisms, which can result in a lower privacy cost.
The advanced composition theorem [10] is usually stated in terms of mechanisms which are instances of 𝑘-fold adaptive
composition.
Iterative programs (i.e. loops or recursive functions) are nearly always instances of 𝑘-fold adaptive composition. A for
loop that runs 1000 iterations, for example, is a 1000-fold adaptive composition. As a more specific example, an averaging
attack is a 𝑘-fold adaptive composition:
avg_attack(10, 1, 500)
10.035959112265063
In this example, the sequence of mechanisms is fixed ahead of time (we use the same mechanism each time), and 𝑘 = 500.
The standard sequential composition theorem says that the total privacy cost of this mechanism is 𝑘𝜖 (in this case, 500𝜖).
The advanced composition theorem says:
So advanced composition derives a much lower bound on 𝜖′ than sequential composition, for the same mechanism. What
does this mean? It means that the bounds given by sequential composition are loose - they don’t tightly bound the actual
privacy cost of the computation. In fact, advanced composition also gives loose bounds - they’re just slightly less loose
than the ones given by sequential composition.
It’s important to note that the two bounds are technically incomparable, since advanced composition introduces a 𝛿. When
𝛿 is small, however, we will often compare the 𝜖s given by both methods.
So, should we always use advanced composition? It turns out that we should not. Let’s try the experiment above for
different values of 𝑘, and graph the total privacy cost under both sequential composition and advanced composition.
Standard sequential composition, it turns out, beats advanced composition for 𝑘 smaller than about 70. Thus, advanced
composition is only really useful when 𝑘 is large (e.g. more than 100). When 𝑘 is very large, though, advanced composition
can make a big difference.
The description of advanced composition above requires the individual mechanisms being composed to satisfy pure 𝜖-
differential privacy. However, the theorem also applies if they satisfy (𝜖, 𝛿)-differential privacy instead. The more general
statement of the advanced composition theorem is as follows ([10], Theorem 3.20):
The only difference is in the failure parameter 𝛿 for the composed mechanism, where we have an additional 𝑘𝛿 term.
When the mechanisms being composed satisfy pure 𝜖-differential privacy, then 𝛿 = 𝑘𝛿 = 0, and we get the same result
as the statement above.
Summary
• For applications in which L2 sensitivity is much lower than L1 sensitivity, the Gaussian mechanism allows adding
much less noise.
• When the number of iterations in a loop is very large, advanced composition can make a big difference in accuracy.
EIGHT
LOCAL SENSITIVITY
Learning Objectives
After reading this chapter, you will be able to:
• Define local sensitivity and explain how it differs from global sensitivity
• Describe how local sensitivity can leak information about the data
• Use propose-test-release to safely apply local sensitivity
• Describe the smooth sensitivity framework
• Use the sample-and-aggregate framework to answer queries with arbitrary sensitivity
So far, we have seen only one measure of sensitivity: global sensitivity. Our definition for global sensitivity considers any
two neighboring datasets. This seems pessimistic, since we’re going to run our differentially private mechanisms on an
actual dataset - shouldn’t we consider neighbors of that dataset?
This is the intuition behind local sensitivity [11]: fix one of the two datasets to be the actual dataset being queried, and
consider all of its neighbors. Formally, the local sensitivity of a function 𝑓 ∶ 𝒟 → ℝ at 𝑥 ∶ 𝒟 is defined as:
Notice that local sensitivity is a function of both the query (𝑓) and the actual dataset (𝑥). Unlike in the case of global
sensitivity, we can’t talk about the local sensitivity of a function without also considering the dataset at which that local
sensitivity occurs.
Local sensitivity allows us to place finite bounds on the sensitivity of some functions whose global sensitivity is difficult
to bound. The mean function is one example. So far, we’ve calculated differentially private means by splitting the query
into two queries: a differentially private sum (the numerator) and a differentially private count (the denominator). By
sequential composition and post-processing, the quotient of these two results satisfies differential privacy.
Why do we do it this way? Because the amount the output of a mean query might change when a row is added to or
removed from the dataset depends on the size of the dataset. If we want to bound the global sensitivity of a mean query,
49
Programming Differential Privacy
we have to assume the worst: a dataset of size 1. In this case, if the data attribute values lie between upper and lower
bounds 𝑢 and 𝑙, the global sensitivity of the mean is just |𝑢 − 𝑙|. For large datasets, this is extremely pessimistic, and the
“noisy sum over noisy count” approach is much better.
The situation is different for local sensitivity. In the worst case, we can add a new row to the dataset which contains the
maximum value (𝑢). Let 𝑛 = |𝑥| (i.e. the size of the dataset). We start with the value of the mean:
𝑛
∑𝑖=1 𝑥𝑖
𝑓(𝑥) = (8.1)
𝑛
Now we consider what happens when we add a row:
𝑛 𝑛
∑ 𝑥𝑖 + 𝑢 ∑𝑖=1 𝑥𝑖
|𝑓(𝑥 ) − 𝑓(𝑥)| =∣ 𝑖=1
′
− ∣ (8.2)
𝑛+1 𝑛
𝑛 𝑛
∑𝑖=1 𝑥𝑖 + 𝑢 ∑𝑖=1 𝑥𝑖
≤∣ − ∣ (8.3)
𝑛+1 𝑛+1
𝑛 𝑛
∑ 𝑥𝑖 + 𝑢 − ∑𝑖=1 𝑥𝑖
=∣ 𝑖=1 ∣ (8.4)
𝑛+1
𝑢
=∣ ∣ (8.5)
𝑛+1
(8.6)
This local sensitivity measure is defined in terms of the actual dataset’s size, which is not possible under global sensitivity.
We have defined an alternative measure of sensitivity - but how do we use it? Can we just use the Laplace mechanism, in
the same way as we did with global sensitivity? Does the following definition of 𝐹 satisfy 𝜖-differential privacy?
𝐿𝑆(𝑓, 𝑥)
𝐹 (𝑥) = 𝑓(𝑥) + Lap ( ) (8.7)
𝜖
No! Unfortunately not. Since 𝐿𝑆(𝑓, 𝑥) itself depends on the dataset, if the analyst knows the local sensitivity of a query
at a particular dataset, then the analyst may be able to infer some information about the dataset. It’s therefore not possible
to use local sensitivity directly to achieve differential privacy. For example, consider the bound on local sensitivity for the
mean, defined above. If we know the local sensitivity at a particular 𝑥, we can infer the exact size of 𝑥 with no noise:
𝑏
|𝑥| = −1 (8.8)
𝐿𝑆(𝑓, 𝑥)
Moreover, keeping the local sensitivity secret from the analyst doesn’t help either. It’s possible to determine the scale of
the noise from just a few query answers, and the analyst can use this value to infer the local sensitivity. Differential privacy
is designed to protect the output of 𝑓(𝑥) - not of the sensitivity measure used in its definition.
Several approaches have been proposed for safely using local sensitivity. We’ll explore these in the rest of this section.
With auxiliary data, this can tell us something really sensitive. What if our query is: “Average score of people named Joe
in the dataset with a 98% on the exam”? Then the size of the thing being averaged is sensitive!!
8.3 Propose-Test-Release
The primary problem with local sensitivity is that the sensitivity itself reveals something about the data. What if we make
the sensitivity itself differentially private? This is challenging to do directly, as there’s often no finite bound on the global
sensitivity of a function’s local sensitivity. However, we can ask a differentially private question that gets at this value
indirectly.
The propose-test-release framework [12] takes this approach. The framework first asks the analyst to propose an upper
bound on the local sensitivity of the function being applied. Then, the framework runs a differentially private test to check
that the dataset being queried is “far from” a dataset where local sensitivity is higher than the proposed bound. If the test
passes, the framework releases a noisy result, with the noise calibrated to the proposed bound.
In order to answer the question of whether a dataset is “far from” one with high local sensitivity, we define the notion of
local sensitivity at distance 𝑘. We write 𝐴(𝑓, 𝑥, 𝑘) to denote the maximum local sensitivity achievable for 𝑓 by taking 𝑘
steps away from the dataset 𝑥. Formally:
Now we’re ready to define a query to answer the question: “how many steps are needed to achieve a local sensitivity
greater than a given upper bound 𝑏?”
Finally, we define the propose-test-release framework (see Barthe et al., Figure 10), which satisfies (𝜖, 𝛿)-differential
privacy:
Notice that 𝐷(𝑓, 𝑥, 𝑏) has a global sensitivity of 1: adding or removing a row in 𝑥 might change the distance to a “high”
local sensitivity by 1. Thus, adding Laplace noise scaled to 1𝜖 yields a differentially private way to measure local sensitivity.
Why does this approach satisfy (𝜖, 𝛿)-differential privacy (and not pure 𝜖-differential privacy)? It’s because there’s a non-
zero chance of passing the test by accident. The noise added in step 2 might be large enough to pass the test, even though
the value of 𝐷(𝑓, 𝑥, 𝑏) is actually less than the minimum distance required to satisfy differential privacy.
This failure mode is much closer to the catastrophic failure we saw from the “catastrophe mechanism” - with non-zero
probability, the propose-test-release framework allows releasing a query answer with far too little noise to satisfy differ-
ential privacy. On the other hand, it’s not nearly as bad as the catastrophe mechanism, since it never releases the answer
with no noise.
8.3. Propose-Test-Release 51
Programming Differential Privacy
Also note that the privacy cost of the framework is (𝜖, 𝛿) even if it returns ⊥ (i.e. the privacy budget is consumed whether
or not the analyst receives an answer).
𝑢
Let’s implement propose-test-release for our mean query. Recall that the local sensitivity for this query is ∣ 𝑛+1 ∣; the best
way to increase this value is to make 𝑛 smaller. If we take 𝑘 steps from the dataset 𝑥, we can arrive at a local sensitivity
𝑢
of ∣ (𝑛−𝑘)+1 ∣. We can implement the framework in Python using the following code.
df = adult['Age']
u = 100 # set the upper bound on age to 100
epsilon = 1 # set epsilon = 1
delta = 1/(len(df)**2) # set delta = 1/n^2
b = 0.005 # propose a sensitivity of 0.005
41.76972403848019
Keep in mind that local sensitivity isn’t always better. For mean queries, our old strategy of splitting the query into two
separate queries (a sum and a count), both with bounded global sensitivity, often works much better. We can implement
the same mean query with global sensitivity:
41.7709655678639
We might do slightly better with propose-test-release, but it’s not a huge difference. Moreover, to use propose-test-release,
the analyst has to propose a bound on sensitivity - and we’ve cheated by “magically” picking a decent value (0.005). In
practice, the analyst would need to perform several queries to explore which values work - which will consume additional
privacy budget.
Our second approach for leveraging local sensitivity is called smooth sensitivity, and is due to Nissim, Raskhodnikova, and
Smith [11]. The smooth sensitivity framework, instantiated with Laplace noise, provides (𝜖, 𝛿)-differential privacy:
The idea behind smooth sensitivity is to use a “smooth” approximation of local sensitivity, rather than local sensitivity
itself, to calibrate the noise. The amount of smoothing is designed to prevent the unintentional release of information
about the dataset that can happen when local sensitivity is used directly. Step 2 above performs the smoothing: it scales
the local sensitivity of nearby datasets by an exponential function of their distance from the actual dataset, then takes the
maximum scaled local sensitivity. The effect is that if a spike in local sensitivity exists in the neighborhood of 𝑥, that
spike will be reflected in the smooth sensitivity of 𝑥 (and therefore the spike itself is “smoothed out,” and doesn’t reveal
anything about the dataset).
Smooth sensitivity has a significant advantage over propose-test-release: it doesn’t require the analyst to propose a bound
on sensitivity. For the analyst, using smooth sensitivity is just as easy as using global sensitivity. However, smooth
sensitivity has two major drawbacks. First, smooth sensitivity is always larger than local sensitivity (by at least a factor of
2 - see step 3), so it may require adding quite a bit more noise than alternative frameworks like propose-test-release (or
even global sensitivity). Second, calculating smooth sensitivity requires finding the maximum smoothed-out sensitivity
over all possible values for 𝑘, which can be extremely challenging computationally. In many cases, it’s possible to prove
that considering a small number of values for 𝑘 is sufficient (for many functions, the exponentially decaying 𝑒−𝛽𝑘 quickly
overwhelms the growing value of 𝐴(𝑓, 𝑥, 𝑘)), but such a property has to be proven for each function we want to use with
smooth sensitivity.
As an example, let’s consider the smooth sensitivity of the mean query we defined earlier.
There are two things to notice here. First, even though we consider only values of 𝑘 less than 200, it’s pretty clear that the
smoothed-out local sensitivity of our mean query approaches 0 as 𝑘 grows. In fact, for this case, the maximum occurs at
𝑘 = 0. This is true in many cases, but if we want to use smooth sensitivity, we have to prove it (which we won’t do here).
Second, notice that the final sensitivity we’ll use for adding noise to the query’s answer is higher than the sensitivity we
proposed earlier (under propose-test-release). It’s not a big difference, but it shows that it’s sometimes possible to achieve
a lower sensitivity with propose-test-release than with smooth sensitivity.
We’ll consider one last framework related to local sensitivity, called sample and aggregate (also due to Nissim, Raskhod-
nikova, and Smith [11]). For any function 𝑓 ∶ 𝐷 → ℝ and upper and lower clipping bounds 𝑢 and 𝑙, the following
framework satisfies 𝜖-differential privacy:
Note that this framework satisfies pure 𝜖-differential privacy, and it actually works without the use of local sensitivity. In
fact, we don’t need to know anything about the sensitivity of 𝑓 (global or local). We also don’t need to know anything
about the chunks 𝑥𝑖 , except that they’re disjoint. Often, they’re chosen randomly (“good” samples tend to result in higher
accuracy), but they don’t need to be.
The framework can be shown to satisfy differential privacy just by global sensitivity and parallel composition. We split
the dataset into 𝑘 distinct chunks, so each individual appears in exactly one chunk. We don’t know the sensitivity of 𝑓,
but we clip its output to lie between 𝑢 and 𝑙, so the sensitivity of each clipped answer 𝑓(𝑥𝑖 ) is 𝑢 − 𝑙. Since we take the
mean of 𝑘 invocations of 𝑓, the global sensitivity of the mean is 𝑢−𝑙𝑘 .
Note that we’re claiming a bound on the global sensitivity of a mean directly, rather than splitting it into sum and count
queries. We weren’t able to do this for “regular” mean queries, because the number of things being averaged in a “regular”
mean query depends on the dataset. In this case, however, the number of items being averaged is fixed by the analyst, via
the choice of 𝑘 - it’s independent of the dataset. Mean queries like this one - where the number of things being averaged
is fixed, and can be made public - can leverage this improved bound on global sensitivity.
In this simple instantiation of the sample and aggregate framework, we ask the analyst to provide the upper and lower
bounds 𝑢 and 𝑙 on the output of each 𝑓(𝑥𝑖 ). Depending on the definition of 𝑓, this might be extremely difficult to do well.
In a counting query, for example, 𝑓’s output will depend directly on the dataset.
More advanced instantiations have been proposed (Nissim, Raskhodnikova, and Smith discuss some of these) which
leverage local sensitivity to avoid asking the analyst for 𝑢 and 𝑙. For some functions, however, bounding 𝑓’s output is
easy - so this framework suffices. We’ll consider our example from above - the mean of ages within a dataset - with this
property. The mean age of a population is highly likely to fall between 20 and 80, so it’s reasonable to set 𝑙 = 20 and
𝑢 = 80. As long as our chunks 𝑥𝑖 are each representative of the population, we’re not likely to lose much information
with this setting.
The key parameter in this framework is the number of chunks, 𝑘. As 𝑘 goes up, the sensitivity of the final noisy mean
goes down - so more chunks means less noise. On the other hand, as 𝑘 goes up, each chunk gets smaller, so each answer
𝑓(𝑥𝑖 ) is less likely to be close to the “true” answer 𝑓(𝑋). In our example above, we’d like the average age within each
chunk to be close to the average age of the whole dataset - and this is less likely to happen if each chunk contains only a
handful of people.
How should we set 𝑘? It depends on 𝑓 and on the dataset, which makes it tricky. Let’s try various values of 𝑘 for our
mean query.
So - sample and aggregate isn’t able to beat our global sensitivity-based approach, but it can get pretty close if you choose
the right value for 𝑘. The big advantage is that sample and aggregate works for any function 𝑓, regardless of its sensitivity;
if 𝑓 is well-behaved, then it’s possible to obtain good accuracy from the framework. On the other hand, using sample and
aggregate requires the analyst to set the clipping bounds 𝑢 and 𝑙, and the number of chunks 𝑘.
NINE
Learning Objectives
After reading this chapter, you will be able to:
• Define Rényi differential privacy and zero-concentrated differential privacy
• Describe the advantages of these variants over (𝜖, 𝛿)-differential privacy
• Convert privacy costs from these variants into (𝜖, 𝛿)-differential privacy
Recall that most of the bounds on privacy cost we have shown are upper bounds, but they sometimes represent very loose
upper bounds - the true privacy cost is much less than the upper bound says. The primary motivation in developing new
variants of differential privacy is to enable tighter bounds on privacy cost - especially for iterative algorithms - while main-
taining privacy definitions which are useful in practice. For example, the catastrophic failure mode of (𝜖, 𝛿)-differential
privacy is not desirable; the variants we’ll see in this section enable even tighter composition for some kinds of queries,
while at the same time eliminating the catastrophic failure mode.
Let’s take a quick look at the tools we have already seen; we’ll look first at sequential composition for 𝜖-differential
privacy. It turns out that sequential composition for 𝜖-differential privacy is tight. What does that mean? It means there’s
a counterexample that would fail to satisfy any lower bound:
• A mechanism 𝐹 exists which satisfies 𝜖-differential privacy
• When composed 𝑘 times, 𝐹 satisfies 𝑘𝜖-differential privacy
• But 𝐹 does not satisfy 𝑐𝜖-differential privacy for any 𝑐 < 𝑘
A neat way to visualize this is to look at what happens to privacy cost when we “vectorize” a query: that is, we merge
lots of queries into a single query which returns a vector of the individual answers. Because the answer is a vector, we
can use the vector-valued Laplace mechanism just once, and avoid composition altogether. Below, we’ll graph how much
noise is needed for 𝑘 queries, first under sequential composition, and then using the “vectorized” form. In the sequential
composition case, each query has a sensitivity of 1, so the scale of the noise for each one is 𝜖1 . If we want a total privacy
𝑖
𝜖
cost of 𝜖, then the 𝜖𝑖 s must add up to 𝜖, so 𝜖𝑖 = 𝑘. This means that each query gets Laplace noise with scale 𝑘𝜖 . In the
𝑘 𝑘
“vectorized” case, there’s just one query, but it has an 𝐿1 sensitivity of ∑𝑖=1 1 = 𝑘, so the scale of the noise is 𝜖 in this
case too.
57
Programming Differential Privacy
The two lines overlap completely. This means that no matter how many queries we’re running, under 𝜖-differential privacy,
we can’t do any better than sequential composition. That’s because sequential composition is just as good as vectorizing
the query, effectively converting it to a single query without involving composition, and we can’t do any better than that.
What about (𝜖, 𝛿)-differential privacy? The story is a little different there. In the sequential composition case, we can
use advanced composition; we have to be a little careful to ensure that the total privacy cost is exactly (𝜖, 𝛿). Specifically,
𝜖 𝛿
we set 𝜖𝑖 = 2√2𝑘 log(1/𝛿 ′)
, 𝛿𝑖 = 2𝑘 , and 𝛿 ′ = 2𝛿 (splitting 𝛿 to go 50% towards the queries, and 50% towards advanced
composition). By advanced composition, the total privacy cost for all 𝑘 queries is (𝜖, 𝛿). The scale of the noise, by the
Gaussian mechanism, is:
2 log ( 1.25
𝛿 )
2
𝜎 = 𝑖
(9.1)
𝜖2𝑖
16𝑘 log ( 𝛿1′ ) log ( 1.25
𝛿 )
= 𝑖
(9.2)
𝜖2
16𝑘 log ( 𝛿 ) log ( 2.5𝑘
2
𝛿 )
= (9.3)
𝜖2
(9.4)
√
In the “vectorized” case, we have just one query, with an 𝐿2 sensitivity of 𝑘. The scale of the noise, by the Gaussian
mechanism, is 𝜎2 = 2𝑘 log(1.25/𝛿)
𝜖2 .
What does this difference mean in practice? The two behave the same asymptotically in 𝑘, but have different constants,
and the advanced composition case has an additional logarithmic factor in 𝛿. All this adds up to a much looser bound in
the case of advanced composition. Let’s graph the two as we did before.
It’s not even close - the “vectorized” version grows much slower. What does this mean? We should be able to do much
better for sequential composition!
It turns out that the definition of differential privacy can be stated directly in terms of something called max divergence.
In statistics, a divergence is a way of measuring the distance between two probability distributions - which is exactly what
we want to do for differential privacy. The max divergence is the worst-case analog of the Kullback–Leibler divergence,
one of the most common such measures. The max divergence between two probability distributions 𝑌 and 𝑍 is defined
to be:
𝑃 𝑟[𝑌 ∈ 𝑆]
𝐷∞ (𝑌 ‖𝑍) = max [ log ]
𝑆⊆Supp(𝑌 ) 𝑃 𝑟[𝑍 ∈ 𝑆]
This already looks a lot like the condition for 𝜖-differential privacy! In particular, it turns out that 𝐹 satisfies 𝜖-differential
privacy if:
An interesting direction for research in differential privacy is the exploration of alternative privacy definitions in terms of
other divergences. Of these, the Rényi divergence is particularly interesting, since it also (like max divergence) allows us
to recover the original definition of differential privacy. The Rényi divergence of order 𝛼 between probability distributions
𝑃 and 𝑄 is defined as (where 𝑃 (𝑥) and 𝑄(𝑥) denote the probability density of 𝑃 and 𝑄 at point 𝑥, respectively):
1 𝑃 (𝑥) 𝛼
𝐷𝛼 (𝑃 ‖𝑄) = log 𝐸𝑥∼𝑄 ( )
𝛼−1 𝑄(𝑥)
If we set 𝛼 = ∞, then we immediately recover the definition of 𝜖-differential privacy! The obvious question arises: what
happens if we set 𝛼 to something else? As we’ll see, it’s possible to use the Rényi divergence to derive really interesting
relaxations of differential privacy that allow better composition theorems while at the same time avoiding the possibility
of “catastrophe” which is possible under (𝜖, 𝛿)-differential privacy.
In 2017, Ilya Mironov proposed Rényi differential privacy (RDP) [13]. A randomized mechanism 𝐹 satisfies (𝛼, 𝜖)-RDP
̄
if for all neighboring datasets 𝑥 and 𝑥′
In other words, RDP requires that the Rényi divergence of order 𝛼 between 𝐹 (𝑥) and 𝐹 (𝑥′ ) to be bounded by 𝜖.̄ Note
that we’ll use 𝜖 ̄ to denote the 𝜖 parameter of RDP, in order to distinguish it from the 𝜖 in pure 𝜖-differential privacy and
(𝜖, 𝛿)-differential privacy.
A key property of Rényi differential privacy is that a mechanism which satisfies RDP also satisfies (𝜖, 𝛿)-differential
privacy. Specifically, if 𝐹 satisfies (𝛼, 𝜖)-RDP,
̄ then for 𝛿 > 0, 𝐹 satisfies (𝜖, 𝛿)-differential privacy for 𝜖 = 𝜖 ̄ + log(1/𝛿)
𝛼−1 .
The analyst is free to pick any value of 𝛿; a meaningful value (e.g. 𝛿 ≤ 𝑛12 ) should be picked in practice.
The basic mechanism for achieving Rényi differential privacy is the Gaussian mechanism. Specifically, for a function
𝑓 ∶ 𝒟 → ℝ𝑘 with 𝐿2 sensitivity Δ𝑓, the following mechanism satisfies (𝛼, 𝜖)-RDP:
̄
Δ𝑓 2 𝛼
𝐹 (𝑥) = 𝑓(𝑥) + 𝒩(𝜎2 ) where 𝜎2 =
2𝜖 ̄
We can implement the Gaussian mechanism for Rényi differential privacy as follows:
The major advantage of Rényi differential privacy is tight composition for the Gaussian mechanism - and this advantage in
composition comes without the need for a special advanced composition theorem. The sequential composition theorem
of Rényi differential privacy states that:
Theorem 7 (Renyi-Sequential-Composition)
• If 𝐹1 satisfies (𝛼, 𝜖1̄ )-RDP
• And 𝐹2 satisfies (𝛼, 𝜖2̄ )-RDP
• Then their composition satisfies (𝛼, 𝜖1̄ + 𝜖2̄ )-RDP
In concurrent work released in 2016, Mark Bun and Thomas Steinke proposed zero-concentrated differential privacy
(zCDP) [14]. Like RDP, zCDP is defined in terms of the Rényi divergence, but it includes only a single privacy parameter
(𝜌). A randomized mechanism 𝐹 satisfies 𝜌-zCDP if for all neighboring datasets 𝑥 and 𝑥′ , and all 𝛼 ∈ (1, ∞):
This is a stronger requirement than RDP, because it restricts the Rényi divergence of many orders; however, the bound
becomes more relaxed as 𝛼 grows. Like RDP, zCDP can be converted to (𝜖, 𝛿)-differential privacy: if 𝐹 satisfies 𝜌-zCDP,
then for 𝛿 > 0, 𝐹 satisfies (𝜖, 𝛿)-differential privacy for 𝜖 = 𝜌 + 2√𝜌 log(1/𝛿).
zCDP is also similar to RDP in that the Gaussian mechanism can be used as a basic mechanism. Specifically, for a
function 𝑓 ∶ 𝒟 → ℝ𝑘 with 𝐿2 sensitivity Δ𝑓, the following mechanism satisfies 𝜌-zCDP:
Δ𝑓 2
𝐹 (𝑥) = 𝑓(𝑥) + 𝒩(𝜎2 ) where 𝜎2 =
2𝜌
In another similarity with RDP, zCDP’s sequential composition is also asymptotically tight for repeated applications of
the Gaussian mechanism. It’s also very simple: the 𝜌s add up. Specifically:
The first thing to note is that using sequential composition under either zCDP or RDP is much better than using advanced
composition with (𝜖, 𝛿)-differential privacy. When building iterative algorithms with the Gaussian mechanism, these
variants should always be used.
The second thing to note is the difference between zCDP (in orange) and RDP (in green). The 𝜖 for RDP grows linearly
in 𝑘, because we have fixed a value for 𝛼. The 𝜖 for zCDP is sublinear in 𝑘, since zCDP effectively considers many 𝛼s.
The two lines touch at some value of 𝑘, depending on the 𝛼 chosen for RDP (for 𝛼 = 20, they touch at roughly 𝑘 = 300).
The practical effect of this difference is that 𝛼 must be chosen carefully when using RDP in order to bound privacy cost
as tightly as possible. This is usually easy to do, since algorithms are usually parameterized by 𝛼; as a result, we can
simply test multiple values of 𝛼 to see which one results in the smallest corresponding 𝜖. Since this test is independent
of the data (it depends mainly on the privacy parameters we pick, and the number of iterations we want to run), we
can test as many values of 𝛼 as we want without paying additional privacy cost. We only need to test a small range of
values for 𝛼 - typically in the range between 2 and 100 - to find a minimum. This is the approach taken in most practical
implementations, including Google’s differentially private version of Tensorflow.
TEN
Learning Objectives
After reading this chapter, you will be able to:
• Define, implement, and apply the Exponential and Report Noisy Max mechanisms
• Describe the challenges of applying the Exponential mechanism in practice
• Describe the advantages of these mechanisms
The fundamental mechanisms we have seen so far (Laplace and Gaussian) are focused on numerical answers, and add
noise directly to the answer itself. What if we want to return a precise answer (i.e. no added noise), but still preserve
differential privacy? One solution is the exponential mechanism [15], which allows selecting the “best” element from a
set while preserving differential privacy.
The analyst defines which element is the “best” by specifying a scoring function that outputs a score for each element in
the set, and also defines the set of things to pick from. The mechanism provides differential privacy by approximately
maximizing the score of the element it returns - in other words, to satisfy differential privacy, the exponential mechanism
sometimes returns an element from the set which does not have the highest score.
𝜖𝑢(𝑥, 𝑟)
exp ( ) (10.1)
2Δ𝑢
The biggest practical difference between the exponential mechanism and the previous mechanisms we’ve seen (e.g. the
Laplace mechanism) is that the output of the exponential mechanism is always a member of the set ℛ. This is extremely
useful when selecting an item from a finite set, when a noisy answer would not make sense. For example, we might want
to pick a date for a big meeting, which uses each participant’s personal calendar to maximize the number of participants
without a conflict, while providing differential privacy for the calendars. Adding noise to a date doesn’t make much sense:
it might turn a Friday into a Saturday, and increase the number of conflicts significantly. The exponential mechanism is
perfect for problems like this one: it selects a date without noise.
65
Programming Differential Privacy
• The privacy cost of the mechanism is just 𝜖, regardless of the size of ℛ - more on this next.
• It works for both finite and infinite sets ℛ, but it can be really challenging to build a practical implementation which
samples from the appropriate probability distribution when ℛ is infinite.
• It represents a “fundamental mechanism” of 𝜖-differential privacy: all other 𝜖-differentially private mechanisms can
be defined in terms of the exponential mechanism with the appropriate definition of the scoring function 𝑢.
10.683
'Married-civ-spouse'
Married-civ-spouse 180
Never-married 19
Divorced 1
dtype: int64
Can we recover the exponential mechanism using the Laplace mechanism? In the case of a finite set ℛ, the basic idea of
the exponential mechanism - to select from a set with differential privacy - suggests a naive implementation in terms of
the Laplace mechanism:
1. For each 𝑟 ∈ ℛ, calculate a noisy score 𝑢(𝑥, 𝑟) + Lap ( Δ𝑢
𝜖 )
'Married-civ-spouse'
pd.Series(r).value_counts()
Married-civ-spouse 192
Never-married 8
dtype: int64
So the exponential mechanism can be replaced with report noisy max when the set ℛ is finite, but what about when it’s
infinite? We can’t easily add Laplace noise to an infinite set of scores. In this context, we have to use the actual exponential
mechanism.
In practice, however, using the exponential mechanism for infinite sets is often challenging or impossible. While it’s easy
to write down the probability density function defined by the mechanism, it’s often the case that no efficient algorithm
exists for sampling from it. As a result, numerous theoretical papers appeal to the exponential mechanism to show that a
differentially private algorithm “exists” with certain desirable properties, but many of these algorithms are impossible to
use in practice.
We’ve seen that it’s not possible to recover the exponential mechanism using the Laplace mechanism plus sequential
composition, because we can’t capture the fact that the algorithm we designed doesn’t release all of the noisy scores.
What about the reverse - can we recover the Laplace mechanism from the exponential mechanism? It turns out that we
can!
Consider a query 𝑞(𝑥) ∶ 𝒟 → ℝ with sensitivity Δ𝑞. We can release an 𝜖-differentially private answer by adding Laplace
noise: 𝐹 (𝑥) = 𝑞(𝑥) + Lap(Δ𝑞/𝜖). The probability density function for this differentially private version of 𝑞 is:
1 |𝑟 − 𝜇|
Pr[𝐹 (𝑥) = 𝑟] = exp ( − ) (10.2)
2𝑏 𝑏
𝜖 𝜖|𝑟 − 𝑞(𝑥)|
= exp ( − ) (10.3)
2Δ𝑞 Δ𝑞
Consider what happens when we set the scoring function for the exponential mechanism to 𝑢(𝑥, 𝑟) = −2|𝑞(𝑥) − 𝑟|. The
exponential mechanism says that we should sample from the probability distribution proportional to:
𝜖𝑢(𝑥, 𝑟)
Pr[𝐹 (𝑥) = 𝑟] = exp ( ) (10.4)
2Δ𝑢
𝜖(−2|𝑞(𝑥) − 𝑟|)
= exp ( ) (10.5)
2Δ𝑞
𝜖|𝑟 − 𝑞(𝑥)|
= exp ( − ) (10.6)
Δ𝑞
(10.7)
So it’s possible to recover the Laplace mechanism from the exponential mechanism, and we get the same results (up to
constant factors - the general analysis for the exponential mechanism is not tight in all cases).
The exponential mechanism is extremely general - it’s generally possible to re-define any 𝜖-differentially private mechanism
in terms of a carefully chosen definition of the scoring function 𝑢. If we can analyze the sensitivity of this scoring function,
then the proof of differential privacy comes for free.
On the other hand, applying the general analysis of the exponential mechanism sometimes comes at the cost of looser
bounds (as in the Laplace example above), and mechanisms defined in terms of the exponential mechanism are often
very difficult to implement. The exponential mechanism is often used to prove theoretical lower bounds (by showing
that a differentially private algorithm exists), but practical algorithms often replicate the same behavior using some other
approach (as in the case of report noisy max above).
ELEVEN
Learning Objectives
After reading this chapter, you will be able to:
• Describe the Sparse Vector Technique and the reasons to use it
• Define and implement Above Threshold
• Apply the Sparse Vector Technique in iterative algorithms
We’ve already seen one example of a mechanism - the exponential mechanism - which achieves a lower-than-expected
privacy cost by withholding some information. Are there others?
There are, and one that turns out to be extremely useful in practical algorithms is the sparse vector technique (SVT) [17].
Important: The sparse vector technique operates on a stream of sensitivity-1 queries over a dataset; it releases the
identity of the first query in the stream which passes a test, and nothing else.
The advantage of SVT is that it incurs a fixed total privacy cost, no matter how many queries it considers.
The most basic instantiation of the sparse vector technique is an algorithm called AboveThreshold (see Dwork and
Roth [16], Algorithm 1). The inputs to the algorithm are a stream of sensitivity-1 queries, a dataset 𝐷, a threshold 𝑇 ,
and the privacy parameter 𝜖; the algorithm preserves 𝜖-differential privacy. A Python implementation of the algorithm
appears below.
import random
69
Programming Differential Privacy
The AboveThreshold algorithm returns (approximately) the index of the first query in queries whose result exceeds
the threshold. The algorithm preserves differential privacy by sometimes returning the wrong index; sometimes, the index
returned may be for a query whose result does not exceed the threshold, and sometimes, the index may not be the first
whose query result exceeds the threshold.
The algorithm works by generating a noisy threshold T_hat, then comparing noisy query answers (q(i) + nu_i)
against the noisy threshold. The algorithm returns the index of the first comparison that succeeds.
It’s a little bit surprising that the privacy cost of this algorithm is just 𝜖, because it may compute the answers to many
queries. In particular, a naive version of this algorithm might compute noisy answers to all of the queries first, then select
the index of the first one whose value is above the threshold:
For a list of queries of length 𝑛, this version preserves 𝑛𝜖-differential privacy by sequential composition.
Why does AboveThreshold do so much better? As we saw with the exponential mechanism, sequential composition
would allow AboveThreshold to release more information than it actually does. In particular, our naive version of
the algorithm could release the indices of every query exceeding the threshold (not just the first one), plus the noisy query
answers themselves, and it would still preserve 𝑛𝜖-differential privacy. The fact that AboveThreshold withholds all
this information allows for a tighter analysis of privacy cost.
The sparse vector technique is extremely useful when we want to run many different queries, but we only care about the
answer for one of them (or a small subset of them). In fact, this application gives the technique its name: it’s most useful
when the vector of queries is sparse - i.e. most of the answers don’t exceed the threshold.
We’ve already seen a perfect example of such a scenario: selecting a clipping bound for summation queries. Earlier,
we took an approach like the naive version of AboveThreshold defined above: compute noisy answers under many
different clipping bounds, then select the lowest one for which the answer doesn’t change much.
We can do much better with the sparse vector technique. Consider a query which clips the ages of everyone in the dataset,
then sums them up:
age_sum_query(adult, 30)
879617
The naive algorithm for selecting a good value for b is to obtain differentially private answers for many values of b,
returning the smallest one where the value stops increasing:
for b in bs:
r = laplace_mech(query(df, b), b, epsilon_i)
return bs[-1]
naive_select_b(age_sum_query, adult, 1)
81
Can we use SVT here? We only care about one thing: the value of b where the value of age_sum_query(df, b)
stops increasing. However, the sensitivity of age_sum_query(df, b) is b, because adding or removing a row in
df could change the sum by at most b; to use SVT, we need to build a stream of 1-sensitive queries.
The value we actually care about, though, is whether or not the query’s answer is changing at a specific value of 𝑏 (i.e.
age_sum_query(df, b) - age_sum_query(df, b + 1)). Consider what happens when we add a row to
df: the answer to the first part of the query age_sum_query(df, b) goes up by 𝑏, but the answer to the second part
of the query age_sum_query(df, b + 1) also goes up - by 𝑏 + 1. The sensitivity is therefore |𝑏 − (𝑏 + 1)| = 1
- so each query will be 1-sensitive, as desired! As the value of 𝑏 approaches the optimal one, the value of the difference
we defined above will approach 0:
Let’s define a stream of difference queries, and use AboveThreshold to determine the index of the best value of b
using the sparse vector technique.
def create_query(b):
return lambda df: age_sum_query(df, b) - age_sum_query(df, b + 1)
bs = range(1,150,5)
queries = [create_query(b) for b in bs]
epsilon = .1
81
Note that it doesn’t matter how long the list bs is - we’ll get accurate results (and pay the same privacy cost) no matter
its length. The really powerful effect of SVT is to eliminate the dependence of privacy cost on the number of queries we
perform. Try changing the range for bs above and re-running the plot below. You’ll see that the output doesn’t depend
on the number of values for b we try - even if the list has thousands of elements!
We can use SVT to build an algorithm for summation queries (and using this, for average queries) that automatically
computes the clipping parameter.
# Run AboveThreshold, using 1/3 of the privacy budget, to find a good clipping␣
↪ parameter
epsilon_svt = epsilon / 3
final_b = bs[above_threshold(queries, df, 0, epsilon_svt)]
# Compute the noisy sum and noisy count, using 1/3 of the privacy budget for each
epsilon_sum = epsilon / 3
epsilon_count = epsilon / 3
return noisy_sum/noisy_count
auto_avg(adult['Age'], 1)
41.77554469012242
This algorithm invokes three differentially private mechanisms: AboveThreshold once, and the Laplace mechanism
twice, each with 31 of the privacy budget. By sequential composition, it satisfies 𝜖-differential privacy. Because we are
free to test a really wide range of possible values for b, we’re able to use the same auto_avg function for data on many
different scales! For example, we can also use it on the capital gain column, even though it has a very different scale than
the age column.
auto_avg(adult['Capital Gain'], 1)
1090.4675219264236
Note that this takes a long time to run! That’s because we have to try a lot more values for b before finding a good one,
since the capital gain column has a much larger scale. We can reduce this cost by increasing the step size (5, in our
implementation above) or by constructing bs with an exponential scale.
In the above application, we only needed the index of the first query which exceeded the threshold, but in many other
applications we would like to find the indices of all such queries.
We can use SVT to do this, but we’ll have to pay a higher privacy cost. We can implement an algorithm called sparse
(see Dwork and Roth [16], Algorithm 2) to accomplish the task, using a very simple approach:
1. Start with a stream 𝑞𝑠 = {𝑞1 , … , 𝑞𝑘 } of queries
2. Run AboveThreshold on 𝑞𝑠 to learn the index 𝑖 of the first query which exceeds the threshold
3. Restart the algorithm (go to (1)) with 𝑞𝑠 = {𝑞𝑖+1 , … , 𝑞𝑘 } (i.e. the remaining queries)
If the algorithm invokes AboveThreshold 𝑛 times, with a privacy parameter of 𝜖 for each invocation, then it sat-
isfies 𝑛𝜖-differential privacy by sequential composition. If we want to specify an upper bound on total privacy cost,
we need to bound 𝑛 - so the sparse algorithm asks the analyst to specify an upper bound 𝑐 on the number of times
AboveThreshold will be invoked.
return idxs
epsilon = 1
sparse(queries, adult, 3, 0, epsilon)
By sequential composition, the sparse algorithm satisfies 𝜖-differential privacy (it uses 𝜖𝑖 = 𝑐𝜖 for each invocation of
AboveThreshold). The version described in Dwork and Roth uses advanced composition, setting the 𝜖𝑖 value for
each invocation of AboveThreshold so that the total privacy cost is 𝜖 (zCDP or RDP could also be used to perform
the composition).
A range query asks: “how many rows exist in the dataset whose values lie in the range (𝑎, 𝑏)?” Range queries are counting
queries, so they have sensitivity 1; we can’t use parallel composition on a set of range queries, however, since the rows
they examine might overlap.
Consider a set of range queries over ages (i.e. queries of the form “how many people have ages between 𝑎 and 𝑏?”). We
can generate many such queries at random:
def create_age_range_query():
lower = np.random.randint(30, 50)
upper = np.random.randint(lower, 70)
return lambda df: age_range_query(df, lower, upper)
[2617, 13640, 14139, 13132, 12619, 6836, 13614, 3103, 10471, 3699]
The answers to such range queries vary widely - some ranges create tiny (or even empty) groups, with small counts, while
others create large groups with high counts. In many cases, we know that the small groups will have inaccurate answers
under differential privacy, so there’s not much point in even running the query. What we’d like to do is learn which queries
are worth answering, and then pay privacy cost for just those queries.
We can use the sparse vector technique to do this. First, we’ll determine the indices of the range queries in the stream
which exceed a threshold for “goodness” that we decide on. Then, we’ll use the Laplace mechanism to find differentially
private answers for just those queries. The total privacy cost will be proportional to the number of queries above the
threshold - not the total number of queries. In cases where we expect just a few queries to be above the threshold, this
can result in a much smaller privacy cost.
[13642.21477538189,
14144.325351165364,
13128.309265827167,
12616.956987628984,
13617.040247690687]
Using this algorithm, we pay half of the privacy budget to determine the first 𝑐 queries which lie above the threshold of
10000, then the other half of the budget to obtain noisy answers to just those queries. If the number of queries exceeding
the threshold is tiny compared to the total number, then we’re able to obtain much more accurate answers using this
approach.
TWELVE
Issues to Consider
• How many queries are required, and what kind of composition can we use?
– Is parallel composition possible?
– Should we use sequential composition, advanced composition, or a variant of differential privacy?
• Can we use the sparse vector technique?
• Can we use the exponential mechanism?
• How should we distribute the privacy budget?
• If there are unbounded sensitivities, how can we bound them?
• Would synthetic data help?
• Would post-processing to “de-noise” help?
Design a variant of sample and aggregate which does not require the analyst to specify the output range of the query
function 𝑓.
Tip: Use SVT to find good upper and lower bounds on 𝑓(𝑥) for the whole dataset first. The result of
𝑐𝑙𝑖𝑝(𝑓(𝑥), 𝑙𝑜𝑤𝑒𝑟, 𝑢𝑝𝑝𝑒𝑟) has bounded sensitivity, so we can use this query with SVT. Then use sample and aggregate
with these upper and lower bounds.
77
Programming Differential Privacy
Google’s RAPPOR system [18] is designed to find the most popular settings for Chrome’s home page. Design an algorithm
which:
• Given a list of the 10,000 most popular web pages by traffic,
• Determines the top 10 most-popular home pages out of the 10,000 most popular web pages
Design an algorithm to produce summary statistics for the U.S. Census. Your algorithm should produce total population
counts at the following levels:
• Census tract
• City / town
• ZIP Code
• County
• State
• USA
Tip: Idea 1: Only compute the bottom level (census tract), using parallel composition. Add up all the tract counts to get
the city counts, and so on up the hierarchy. Advantage: lowers privacy budget.
Idea 2: Compute counts at all levels, using parallel composition for each level. Tune the budget split using real data;
probably we need more accuracy for the smaller levels of the hierarchy.
Idea 3: As (2), but also use post-processing to re-scale lower levels of the hierarchy based on higher ones; truncate counts
to whole numbers; move negative counts to 0.
Design an algorithm to accurately answer a workload of range queries. Range queries are queries on a single table of the
form: “how many rows have a value that is between 𝑎 and 𝑏?” (i.e. the count of rows which lie in a specific range).
12.6.1 Part 1
The whole workload is pre-specified as a finite sequence of ranges: {(𝑎1 , 𝑏1 ), … , (𝑎𝑘 , 𝑏𝑘 )}, and
12.6.2 Part 2
The length of the workload 𝑘 is pre-specified, but queries arrive in a streaming fashion and must be answered as they
arrive.
12.6.3 Part 3
• For each range (𝑖, 𝑖+1), find a count (parallel composition). This is a synthetic data representation! We can answer
infinitely many queries by adding up the counts of all the segments in this histogram which are contained in the
desired interval.
• For part 2, use SVT
For SVT: for each query in the stream, ask how far the real answer is from the synthetic data answer. If it’s far, query the
real answer’s range (as a histogram, using parallel composition) and update the synthetic data. Otherwise just give the
synthetic data answer. This way you ONLY pay for updates to the synthetic data.
Deployment of differential privacy involves several steps to ensure the protection of sensitive data while still allowing
useful analysis. Here’s a checklist for deploying a system with differential privacy:
1. Data Classification: Identify and classify sensitive data that needs protection. Understand the sensitivity levels
and potential risks associated with each class of data.
2. Data Preparation: Preprocess the data to remove personally identifiable information (PII) and other sensitive
attributes that are not relevant to the analysis.
3. Define Privacy Budget: Determine acceptable levels of utility or privacy risk for your deployment. This will
govern the amount of noise added to the data to achieve differential privacy.
4. Data Analysis Requirements: Clearly outline the data analysis goals, as well as specific transformations and
queries that need to be supported while maintaining privacy. Ensure that the chosen privacy mechanisms and
privacy unit can accommodate these requirements.
5. Implement Differential Privacy: Choose appropriate differential privacy mechanisms such as Laplace mecha-
nism, Gaussian mechanism, or other advanced techniques based on the analysis requirements. Implement these
mechanisms into the data processing pipeline. Identify a strategy for privacy composition if applicable.
6. Noise Generation: Generate and introduce high-quality entropy to the data in accordance with the chosen differ-
ential privacy mechanism. Ensure that the randomness level is calibrated to achieve the desired privacy guarantee.
7. Testing and Verification: Conduct thorough testing to validate the effectiveness of the deployed differential privacy
mechanisms. Test the system with a variety of queries and scenarios to ensure that privacy is preserved while
maintaining data utility. Ideally, conduct tests on public or synthetic data.
8. Performance Evaluation: Evaluate the performance overhead introduced by the time and storage cost of differ-
ential privacy mechanisms. Monitor system metrics such as latency, throughput, and compute resource utilization
to ensure acceptable levels of efficiency.
9. Documentation and Compliance: Document the deployment process, including the differential privacy mecha-
nisms used, privacy parameters chosen, and any other relevant details. Ensure compliance with relevant privacy
laws, regulations and standards.
10. Additional Security Measures: Implement all necessary security measures to protect against potential attacks or
vulnerabilities. This may include encryption of data in transit and at rest, access controls, and auditing mechanisms.
This may also include safeguards against side channel attacks such as isolated compute environments.
11. User Education and Training: Educate users and stakeholders about the principles of differential privacy, its
implications, and the importance of preserving privacy while conducting data analysis.
12. Continuous Monitoring and Maintenance: Establish a process for ongoing monitoring and maintenance of the
deployed system. Regularly review privacy parameters and update mechanisms as needed to adapt to evolving
privacy requirements and threats.
By following this checklist, you can effectively deploy a system with differential privacy to protect sensitive data while
enabling valuable analysis and insights!
THIRTEEN
MACHINE LEARNING
Learning Objectives
After reading this chapter, you will be able to:
• Describe and implement the basic algorithm for gradient descent
• Use the Gaussian mechanism to implement differentially private gradient descent
• Clip gradients to enforce differential privacy for arbitrary loss functions
• Describe the effect of noise on the training process
In this chapter, we’re going to explore building differentially private machine learning classifiers. We’ll focus on a kind of
supervised learning problem: given a set of labeled training examples {(𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 )}, in which 𝑥𝑖 is called the
feature vector and 𝑦𝑖 is called the label, train a model 𝜃 which can predict the label for a new feature vector which was not
present in the training set. Each 𝑥𝑖 is typically a vector of real numbers which describe the features of a training example,
and the 𝑦𝑖 s are drawn from a predefined set of classes (usually expressed as integers) that examples can be drawn from.
A binary classifier has two classes (usually either 1 and 0, or 1 and -1).
To train a model, we will use some of the data we have available to build a set of training examples (as described earlier),
but we’ll also set aside some of the data as test examples. Once we have trained the model, we want to know how well
it works on examples that are not present in the training set. A model which works well on new examples it hasn’t seen
before is said to generalize well. One which does not generalize well is said to have overfitted the training data.
To test generalization, we’ll use the test examples - we have labels for them, so we can test the generalization accuracy of
the model by asking the model to classify each one, and then comparing the predicted class against the actual label from
our dataset. We’ll split our data into a training set containing 80% of the examples, and a testing set containing 20% of
the examples.
A simple way to build a binary classifier is with logistic regression. The scikit-learn library has a built-in module for
performing logistic regression, called LogisticRegression, and it’s easy to use to build a model using our data.
LogisticRegression()
Next, we can use the model’s predict method to predict labels for the test set.
81
Programming Differential Privacy
model.predict(X_test)
So, how many test examples does our model get correct? We can compare the predicted labels against the actual labels
from the dataset; if we divide the number of correctly predicted labels by the total number of test examples, we can
measure the percent of the examples which are correctly classified.
np.sum(model.predict(X_test) == y_test)/X_test.shape[0]
0.8243034055727554
Our model predicts the correct label for 82% of the examples in our test set. For this dataset, that’s a pretty decent result.
What exactly is a model? How does it encode the information it uses to make predictions?
There are many different kinds of models, but the ones we’ll explore here are linear models. For an unlabeled example
with a 𝑘-dimensional feature vector 𝑥1 , … , 𝑥𝑘 , a linear model predicts a label by first calculating the quantity:
𝑤1 𝑥1 + ⋯ + 𝑤𝑘 𝑥𝑘 + 𝑏𝑖𝑎𝑠 (13.1)
and then taking the sign of it (i.e. if the quantity above is negative, we predict the label -1; if it’s positive, we predict 1).
The model itself, then, can be represented by a vector containing the values 𝑤1 , … , 𝑤𝑘 and the value for 𝑏𝑖𝑎𝑠. The
model is said to be linear because the quantity we calculate in predicting a label is a polynomial of degree 1 (i.e. linear).
The values 𝑤1 , … , 𝑤𝑘 are often called the weights or coefficients of the model, and 𝑏𝑖𝑎𝑠 is often called the bias term or
intercept.
This is actually how scikit-learn represents its logistic regression model, too! We can check out the weights of our trained
model using the coef_ attribute of the model:
model.intercept_[0], model.coef_[0]
Note that we’ll always have exactly the same number of weights 𝑤𝑖 as we have features 𝑥𝑖 , since we have to multiply each
feature by its corresponding weight. That means our model has exactly the same dimensionality as our feature vectors.
Now that we have a way to get the weights and bias term, we can implement our own function to perform prediction:
# Prediction: take a model (theta) and a single example (xi) and return its predicted␣
↪label
0.8243034055727554
We’ve made the bias term optional here, because in many cases it’s possible to do just as well without it. To make things
simpler, we won’t bother to train a bias term in our own algorithm.
How does the training process actually work? The scikit-learn library has some pretty sophisticated algorithms, but we
can do just about as well by implementing a simple one called gradient descent.
Most training algorithms for machine learning are defined in terms of a loss function, which specifies a way to measure
how “bad” a model is at prediction. The goal of the training algorithm is to minimize the output of the loss function - a
model with low loss will be good at prediction.
The machine learning community has developed many different commonly-used loss functions. A simple loss function
might return 0 for each correctly predicted example, and 1 for each incorrectly predicted example; when the loss becomes
0, that means we’ve predicted each example’s label correctly. A more commonly used loss function for binary classification
is called the logistic loss; the logistic loss gives us a measure of “how far” we are from predicting the correct label (which
is more informative than the simple 0 vs 1 approach).
The logistic loss is implemented by the following Python function:
# The loss function measures how good our model is. The training goal is to minimize␣
↪the loss.
We can use the loss function to measure how good a particular model is. Let’s try it out with a model whose weights are
all zeros. This model isn’t likely to work very well, but it’s a starting point from which we can train a better one.
theta = np.zeros(X_train.shape[1])
loss(theta, X_train[0], y_train[0])
0.6931471805599453
We typically measure how good our model is over our entire training set by simply averaging the loss over all of the
examples in the training data. In this case, we get every example wrong, so the average loss on the whole training set is
exactly equal to the loss we calculated above for just one example.
0.6931471805599453
Our goal in training the model is to minimize the loss. So the key question is: how do we modify the model to make the
loss smaller?
Gradient descent is an approach that makes the loss smaller by updating the model according to the gradient of the loss.
The gradient is like a multi-dimensional derivative: for a function with multi-dimensional inputs (like our loss function
above), the gradient tells you how fast the function’s output is changing with respect to each dimension of the input. If
the gradient is positive in a particular dimension, that means the function’s value will increase if we increase the model’s
weight for that dimension; we want the loss to decrease, so we should modify our model by negating the gradient - i.e. do
the opposite of what the gradient says. Since we move the model in the opposite direction of the gradient, this is called
descending the gradient.
When we iteratively perform many steps of this descent process, we slowly get closer and closer to the model which
minimizes the loss. This algorithm is called gradient descent. Let’s see how this looks in Python; first, we’ll define the
gradient function.
Next, let’s perform a single step of gradient descent. We can apply the gradient function to a single example from
our training data, which should give us enough information to improve the model for that example. We “descend” the
gradient by subtracting it from our current model theta.
# If we take a step in the *opposite* direction from the gradient (by negating it),␣
↪we should
# In this example, we're taking the gradient on just a single training example (the␣
↪first one)
Now, if we call predict on the same example from the training data, its label is predicted correctly! That means our
update did indeed improve the model, since it’s now capable of classifying this example.
(-1.0, -1.0)
We’ll be measuring the accuracy of our model many times, so let’s define a helper function for measuring accuracy. It
works in the same way as the accuracy measurement for the sklearn model above. We can use it on the theta we’ve
built by descending the gradient for one example, to see how good our model is on the test set.
def accuracy(theta):
return np.sum(predict(X_test, theta) == y_test)/X_test.shape[0]
accuracy(theta)
0.7585139318885449
Our improved model now predicts 75% of the labels for the test set correctly! That’s good progress - we’ve improved the
model considerably.
We need to make two changes to arrive at a basic gradient descent algorithm. First, our single step above used only a single
example from the training data; we want to consider the whole training set when updating the model, so that we improve
the model for all examples. Second, we need to perform multiple iterations, to get as close as possible to minimizing the
loss.
We can solve the first problem by calculating the average gradient over all of the training examples, and using it for the
descent step in place of the single-example gradient we used before. Our avg_grad function calculates the average
gradient over a whole array of training examples and the corresponding labels.
To solve the second problem, we’ll define an iterative algorithm that descends the gradient multiple times.
def gradient_descent(iterations):
# Start by "guessing" what the model should be (all zeros)
theta = np.zeros(X_train.shape[1])
return theta
theta = gradient_descent(10)
accuracy(theta)
0.7787483414418399
After 10 iterations, our model reaches nearly 78% accuracy - not bad! Our gradient descent algorithm looks simple (and
it is!) but don’t let its simplicity fool you - this basic approach is behind many of the recent successes in large-scale deep
learning, and our algorithm is very close in its design to the ones implemented in popular frameworks for machine learning
like TensorFlow.
Notice that we didn’t quite make it to the 84% accuracy of the sklearn model we trained earlier. Don’t worry - our
algorithm is definitely capable of this! We just need more iterations, to get closer to the minimum of the loss.
With 100 iterations, we get closer - 82% accuracy. However, the algorithm takes a long time to run when we ask for so
many iterations. Even worse, the closer we get to minimizing the loss, the more difficult it is to improve - so we might
get to 82% accuracy after 100 iterations, but it might take 1000 iterations to get to 84%. This points to a fundamental
tension in machine learning - generally speaking, more iterations of training can improve accuracy, but more iterations
requires more computation time. Most of the “tricks” used to make large-scale deep learning practical are actually aimed
at speeding up each iteration of gradient descent, so that more iterations can be performed in the same amount of time.
One more thing that’s interesting to note: the value of the loss function does indeed go down with each iteration of gradient
descent we perform - so as we perform more iterations, we slowly get closer to minimizing the loss. Also note that the
training and testing loss are very close to one another, suggesting that our model is not overfitting to the training data.
How can we make the above algorithm differentially private? We’d like to design an algorithm that ensures differential
privacy for the training data, so that the final model doesn’t reveal anything about individual training examples.
The only part of the algorithm which uses the training data is the gradient calculation. One way to make the algorithm
differentially private is to add noise to the gradient itself at each iteration before updating the model. This approach is
usually called noisy gradient descent, since we add noise directly to the gradient.
Our gradient function is a vector-valued function, so we can use gaussian_mech_vec to add noise to its output:
for i in range(iterations):
grad = avg_grad(theta, X_train, y_train)
noisy_grad = gaussian_mech_vec(grad, sensitivity, epsilon, delta)
theta = theta - noisy_grad
return theta
There’s just one piece of the puzzle missing - what is the sensitivity of the gradient function? Answering this question
is the central difficulty in making the algorithm work.
There are two major challenges here. First, the gradient is the result of an average query - it’s the mean of many per-
example gradients. As we’ve seen previously, it’s best to split queries like this up into a sum query and a count query.
This isn’t difficult to do - we can compute the noisy sum of the per-example gradients, rather than their average, and
divide by a noisy count later. Second, we need to bound the sensitivity of each per-example gradient. There are two
basic approaches for this: we can either analyze the gradient function itself (as we have done with previous queries) to
determine its worst-case global sensitivity, or we can enforce a sensitivity by clipping the output of the gradient function
(as we did in sample and aggregate).
We’ll start with the second approach - often called gradient clipping - because it’s simpler conceptually and more general
in its applications.
Gradient clipping is a technique often employed in machine learning, including differential privacy scenarios, to mitigate
the risk of large gradients during training. When training a model, the gradient of the loss with respect to the model’s
parameters is computed and used to update the parameters. Gradient clipping involves scaling or limiting the gradients to
a specified threshold if they exceed that threshold. This helps stabilize training and prevents large updates that may lead
to overshooting or divergence.
In the context of differential privacy, gradient clipping is used to control the sensitivity of the model. Recall that, in
the context of differential privacy, sensitivity refers to the maximum amount by which the output of a function (or the
model’s parameters) can change when a single data point is added or removed from the input dataset. It is a crucial
concept in differential privacy because it helps quantify how much an individual’s data can influence the overall outcome,
thus providing a measure of privacy protection. The sensitivity of a model’s update is related to the maximum change in
the model parameters caused by a single training example. By limiting the gradients during training, gradient clipping
indirectly limits the sensitivity of the model, making it more amenable to differential privacy.
To incorporate gradient clipping into a differentially private training process, the clipping is often applied to the gradients
before they are used to update the model parameters. This helps ensure that individual training examples do not overly
influence the model and helps maintain privacy guarantees.
To sum up: sensitivity in the context of differential privacy refers to the maximum impact a single data point can have
on the model’s output, and gradient clipping is a technique used to control and limit this sensitivity during the training of
differentially private models.
Recall that when we implemented sample and aggregate, we enforced a desired sensitivity on a function 𝑓 with unknown
sensitivity by clipping its output. The sensitivity of 𝑓 was:
In the worst case, clip(𝑓(𝑥), 𝑏) = 𝑏, and clip(𝑓(𝑥′ ), 𝑏) = 0, so the sensitivity of the clipped result is exactly 𝑏 (the value
of the clipping parameter).
We can use the same trick to bound the 𝐿2 sensitivity of our gradient function. We’ll need to define a function which
“clips” a vector so that it has 𝐿2 norm within a desired range. We can accomplish this by scaling the vector: if we divide
the vector elementwise by its 𝐿2 norm, then the resulting vector will have an 𝐿2 norm of 1. If we want to target a
particular clipping parameter 𝑏, we can multiply the scaled vector by 𝑏 to scale it back up to have 𝐿2 norm 𝑏. We want to
avoid modifying vectors that already have 𝐿2 norm below 𝑏; in that case, we just return the original vector. We can use
np.linalg.norm with the parameter ord=2 to calculate the 𝐿2 norm of a vector.
if norm > b:
return b * (v / norm)
else:
return v
Now we’re ready to analyze the sensitivity of the clipped gradient. We denote the gradient as ∇(𝜃; 𝑋, 𝑦) (corresponding
to gradient in our Python code):
In the worst case, L2_clip(∇(𝜃; 𝑋, 𝑦), 𝑏) has 𝐿2 norm of 𝑏, and L2_clip(∇(𝜃; 𝑋 ′ , 𝑦)) is all zeros - so that the 𝐿2 norm
of the difference is equal to 𝑏. Thus, the 𝐿2 sensitivity of the clipped gradient is bounded by the clipping parameter 𝑏!
Now we can proceed to compute the sum of clipped gradients, and add noise based on the 𝐿2 sensitivity 𝑏 that we’ve
enforced by clipping.
# sum query
# L2 sensitivity is b (by clipping performed above)
return np.sum(gradients, axis=0)
Now we’re ready to complete our noisy gradient descent algorithm. To compute the noisy average gradient, we need to:
1. Add noise to the sum of the gradients based on its sensitivity 𝑏
2. Compute a noisy count of the number of training examples (sensitivity 1)
3. Divide the noisy sum from (1) by the noisy count from (2)
for i in range(iterations):
grad_sum = gradient_sum(theta, X_train, y_train, sensitivity)
noisy_grad_sum = gaussian_mech_vec(grad_sum, sensitivity, epsilon, delta)
noisy_avg_grad = noisy_grad_sum / noisy_count
theta = theta - noisy_avg_grad
return theta
0.7795223352498895
Each iteration of this algorithm satisfies (𝜖, 𝛿)-differential privacy, and we perform one additional query to determine the
noisy count which satisfies 𝜖-differential privacy. If we perform 𝑘 iterations, then by sequential composition, the algorithm
satisfies (𝑘𝜖 + 𝜖, 𝑘𝛿)-differential privacy. We can also use advanced composition to analyze the total privacy cost; even
better, we could convert the algorithm to Rényi differential privacy or zero-concentrated differential privacy, and obtain
tight bounds on composition.
Our previous approach is very general, since it makes no assumptions about the behavior of the gradient. Sometimes,
however, we do know something about the behavior of the gradient. In particular, a large class of useful gradient functions
(including the gradient of the logistic loss, which we’re using here) are Lipschitz continuous - meaning they have bounded
global sensitivity. Formally, it is possible to show that:
This fact allows us to clip the values of the training examples (i.e. the inputs to the gradient function), instead of the output
of the gradient function, and obtain a bound on the 𝐿2 sensitivity of the gradient.
Clipping the training examples instead of the gradients has two advantages. First, it’s often easier to estimate the scale
of the training data (and thus to pick a good clipping parameter) than it is to estimate the scale of the gradients you’ll
compute during training. Second, it’s computationally more efficient: we can clip the training examples once, and re-use
the clipped training data every time we train a model; with gradient clipping, we need to clip each gradient during training.
Furthermore, we’re no longer forced to compute per-example gradients so that we can clip them; instead, we can compute
all of the gradients at once, which can be done very efficiently (this is a commonly used trick in machine learning, but we
won’t discuss it here).
Note, however, that many useful loss functions - in particular, those derived from neural networks in deep learning - do
not have bounded global sensitivity. For these loss functions, we’re forced to use gradient clipping.
We can clip the training examples instead of the gradients with a couple of simple modifications to our algorithm. First,
we clip the training examples using L2_clip before we start training. Second, we simply delete the code for clipping
the gradients.
# sum query
# L2 sensitivity is b (by sensitivity of the gradient)
return np.sum(gradients, axis=0)
for i in range(iterations):
grad_sum = gradient_sum(theta, clipped_X, y_train, sensitivity)
noisy_grad_sum = gaussian_mech_vec(grad_sum, sensitivity, epsilon, delta)
noisy_avg_grad = noisy_grad_sum / noisy_count
theta = theta - noisy_avg_grad
return theta
0.7797434763379035
Many improvements to this algorithm are possible, which can improve privacy cost and accuracy. Many are drawn from
the machine learning literature. Some examples include:
• Bounding the total privacy cost by 𝜖 by calculating a per-iteration 𝜖𝑖 as part of the algorithm.
• Better composition for large numbers of iterations via advanced composition, RDP, or zCDP.
• Minibatching: calculating the gradient for each iteration using a small chunk of the training data, rather than the
whole training set (this reduces the computation needed to calculate the gradient).
• Parallel composition in conjunction with minibatching.
• Random sampling of batches in conjunction with minibatching.
• Other hyperparameters, like a learning rate 𝜂.
So far, we’ve seen that the number of iterations has a big effect on the accuracy of the model we get, since more iterations
can get you closer to the minimum of the loss. Since our differentially private algorithm adds noise to the gradient, this
can also affect accuracy - the noise can cause our algorithm to move in the wrong direction during training, and actually
make the model worse.
It’s reasonable to expect that smaller values of 𝜖 will result in less accurate models (since this has been the trend in every
differentially private algorithm we have seen so far). This is true, but there’s also a slightly more subtle tradeoff which
occurs because of the composition we need to consider when we perform many iterations of the algorithm: more iterations
means a larger privacy cost. In the standard gradient descent algorithm, more iterations generally result in a better model.
In our differentially private version, more iterations can make the model worse, since we have to use a smaller 𝜖 for each
iteration, and so the scale of the noise goes up. In differentially private machine learning, it’s important (and sometimes,
very challenging) to strike the right balance between the number of iterations used and the scale of the noise added.
Let’s do a small experiment to see how the setting of 𝜖 effects the accuracy of our model. We’ll train a model for several
values of 𝜖, using 20 iterations each time, and graph the accuracy of each model against the 𝜖 value used in training it.
The plot shows that very small values of 𝜖 result in far less accurate models. Keep in mind that the 𝜖 we specify in the
plot is a per-iteration 𝜖, so the privacy cost is much higher after composition.
FOURTEEN
Learning Objectives
After reading this chapter, you will be able to:
• Define the local model of differential privacy and contrast it with the central model
• Define and implement the randomized response and unary encoding mechanisms
• Describe the accuracy implications of these mechanisms and the challenges of the local model
So far, we have only considered the central model of differential privacy, in which the sensitive data is collected centrally
in a single dataset. In this setting, we assume that the analyst is malicious, but that there is a trusted data curator who
holds the dataset and correctly executes the differentially private mechanisms the analyst specifies.
This setting is often not realistic. In many cases, the data curator and the analyst are the same, and no trusted third party
actually exists to hold the data and execute mechanisms. In fact, the organizations which collect the most sensitive data
tend to be exactly the ones we don’t trust; such organizations certainly can’t function as trusted data curators.
An alternative to the central model of differential privacy is the local model of differential privacy, in which data is made
differentially private before it leaves the control of the data subject. For example, you might add noise to your data on
your device before sending it to the data curator. In the local model, the data curator does not need to be trusted, since
the data they collect already satisfies differential privacy.
The local model thus has one huge advantage over the central model: data subjects don’t need to trust anyone else but
themselves. This advantage has made it popular in real-world deployments, including the ones by Google and Apple.
Unfortunately, the local model also has a significant drawback: the accuracy of query results in the local model is typically
orders of magnitude lower for the same privacy cost as the same query under central differential privacy. This huge loss
in accuracy means that only a small handful of query types are suitable for local differential privacy, and even for these, a
large number of participants is required.
In this section, we’ll see two mechanisms for local differential privacy. The first is called randomized response, and the
second is called unary encoding.
91
Programming Differential Privacy
Randomized response [19] is a mechanism for local differential privacy which was first proposed in a 1965 paper by S.
L. Warner. At the time, the technique was intended to improve bias in survey responses about sensitive issues, and it was
not originally proposed as a mechanism for differential privacy (which wouldn’t be invented for another 40 years). After
differential privacy was developed, statisticians realized that this existing technique already satisfied the definition.
Dwork and Roth present a variant of randomized response, in which the data subject answers a “yes” or “no” question as
follows:
1. Flip a coin
2. If the coin is heads, answer the question truthfully
3. If the coin is tails, flip another coin
4. If the second coin is heads, answer “yes”; if it is tails, answer “no”
The randomization in this algorithm comes from the two coin flips. As in all other differentially private algorithms, this
randomization creates uncertainty about the true answer, which is the source of privacy.
As it turns out, this randomized response algorithm satisfies 𝜖-differential privacy for 𝜖 = log(3) = 1.09.
Let’s implement the algorithm for a simple “yes” or “no” question: “is your occupation ‘Sales’?” We can flip a coin in
Python using np.random.randint(0, 2); the result is either a 0 or a 1.
def rand_resp_sales(response):
truthful_response = response == 'Sales'
Let’s ask 200 people who do work in sales to respond using randomized response, and look at the results.
True 151
False 49
dtype: int64
What we see is that we get both “yesses” and “nos” - but that the “yesses” outweigh the “nos.” This output demonstrates
both features of the differentially private algorithms we’ve already seen - it includes uncertainty, which creates privacy,
but also displays enough signal to allow us to infer something about the population.
Let’s try the same thing on some actual data. We’ll take all of the occupations in the US Census dataset we’ve been
using, and encode responses for the question “is your occupation ‘Sales’?” for each one. In an actual deployed system,
we wouldn’t collect this dataset centrally at all - instead, each respondant would run rand_resp_sales locally, and
submit their randomized response to the data curator. For our experiment, we’ll run rand_resp_sales on the existing
dataset.
pd.Series(responses).value_counts()
False 22541
True 10020
dtype: int64
This time, we get many more “nos” than “yesses.” This makes a lot of sense, with a little thought, because the majority
of the participants in the dataset are not in sales.
The key question now is: how do we estimate the acutal number of salespeople in the dataset, based on these responses?
The number of “yesses” is not a good estimate for the number of salespeople:
len(adult[adult['Occupation'] == 'Sales'])
3650
And this is not a surprise, since many of the “yesses” come from the random coin flips of the algorithm.
In order to get an estimate of the true number of salespeople, we need to analyze the randomness in the randomized
response algorithm and estimate how many of the “yes” responses are from actual salespeople, and how many are “fake”
yesses which resulted from random coin flips. We know that:
• With probability 12 , each respondant responds randomly
• With probability 12 , each random response is a “yes”
So, the probability that a respondant responds “yes” by random chance (rather than because they’re a salesperson) is
1 1 1
2 ⋅ 2 = 4 . This means we can expect one-quarter of our total responses to be “fake yesses.”
# we expect 1/4 of the responses to be "yes" based entirely on the coin flip
# these are "fake" yesses
fake_yesses = len(responses)/4
# the number of "real" yesses is the total number of yesses minus the fake yesses
true_yesses = num_yesses - fake_yesses
The other factor we need to consider is that half of the respondants answer randomly, but some of the random respondants
might actually be salespeople. How many of them are salespeople? We have no data on that, since they answered randomly!
But, since we split the respondants into “truth” and “random” groups randomly (by the first coin flip), we can hope that
there are roughly the same number of salespeople in both groups. Therefore, if we can estimate the number of salespeople
in the “truth” group, we can double this number to get the number of salespeople in total.
3627.5
3650
pct_error(true_result, rr_result)
0.6164383561643836
With this approach, and fairly large counts (e.g. more than 3000, in this case), we generally get “acceptable” error -
something below 5%. If your goal is to determine the most popular occupation, this approach is likely to work. However,
when counts are smaller, the error will quickly get larger.
Furthermore, randomized response is orders of magnitude worse than the Laplace mechanism in the central model. Let’s
compare the two for this example:
0.0008829365314378352
Here, we get an error of about 0.01%, even though our 𝜖 value for the central model is slightly lower than the 𝜖 we used
for randomized response.
There are better algorithms for the local model, but the inherent limitations of having to add noise before submitting your
data mean that local model algorithms will always have worse accuracy than the best central model algorithms.
Randomized response allows us to ask a yes/no question with local differential privacy. What if we want to build a
histogram?
A number of different algorithms for solving this problem in the local model of differential privacy have been proposed.
A 2017 paper by Wang et al. [20] provides a good summary of some optimal approaches. Here, we’ll examine the
simplest of these, called unary encoding. This approach is the basis for Google’s RAPPOR system [18] (with a number
of modifications to make it work better for large domains and multiple responses over time).
The first step is to define the domain for responses - the labels of the histogram bins we care about. For our example, we
want to know how many participants are associated with each occupation, so our domain is the set of occupations.
domain = adult['Occupation'].dropna().unique()
domain
We’re going to define three functions, which together implement the unary encoding mechanism:
1. encode, which encodes the response
def encode(response):
return [1 if d == response else 0 for d in domain]
encode('Sales')
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
The next step is perturb, which flips bits in the response vector to ensure differential privacy. The probability that a bit
gets flipped is based on two parameters 𝑝 and 𝑞, which together determine the privacy parameter 𝜖 (based on a formula
we will see in a moment).
𝑝 if 𝐵[𝑖] = 1
Pr[𝐵′ [𝑖] = 1] = {
𝑞 if 𝐵[𝑖] = 0
def perturb(encoded_response):
return [perturb_bit(b) for b in encoded_response]
def perturb_bit(bit):
p = .75
q = .25
sample = np.random.random()
if bit == 1:
if sample <= p:
return 1
else:
return 0
elif bit == 0:
if sample <= q:
return 1
else:
return 0
perturb(encode('Sales'))
[0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0]
Based on the values of 𝑝 and 𝑞, we can calculate the value of the privacy parameter 𝜖. For 𝑝 = .75 and 𝑞 = .25, we will
see an 𝜖 of slightly more than 2.
𝑝(1 − 𝑞)
𝜖 = log ( ) (14.1)
(1 − 𝑝)𝑞
unary_epsilon(.75, .25)
2.1972245773362196
The final piece is aggregation. If we hadn’t done any perturbation, then we could simply take the set of response vectors
and add them element-wise to get counts for each element in the domain:
[('Adm-clerical', 3770),
('Exec-managerial', 4066),
('Handlers-cleaners', 1370),
('Prof-specialty', 4140),
('Other-service', 3295),
('Sales', 3650),
('Craft-repair', 4099),
('Transport-moving', 1597),
('Farming-fishing', 994),
('Machine-op-inspct', 2002),
('Tech-support', 928),
('Protective-serv', 649),
('Armed-Forces', 9),
('Priv-house-serv', 149)]
But as we saw with randomized response, the “fake” responses caused by flipped bits cause the results to be difficult to
interpret. If we perform the same procedure with the perturbed responses, the counts are all wrong:
[('Adm-clerical', 9997),
('Exec-managerial', 10266),
('Handlers-cleaners', 8935),
('Prof-specialty', 10253),
('Other-service', 9796),
('Sales', 10019),
('Craft-repair', 10125),
('Transport-moving', 8841),
('Farming-fishing', 8451),
('Machine-op-inspct', 9228),
('Tech-support', 8544),
('Protective-serv', 8494),
('Armed-Forces', 8095),
('Priv-house-serv', 8314)]
The aggregate step of the unary encoding algorithm takes into account the number of “fake” responses in each category,
which is a function of both 𝑝 and 𝑞, and the number of responses 𝑛:
∑𝑗 𝐵𝑗′ [𝑖] − 𝑛𝑞
𝐴[𝑖] = (14.2)
𝑝−𝑞
def aggregate(responses):
p = .75
q = .25
[('Adm-clerical', 3735.5),
('Exec-managerial', 4177.5),
('Handlers-cleaners', 1457.5),
('Prof-specialty', 4323.5),
('Other-service', 3519.5),
('Sales', 3713.5),
('Craft-repair', 4217.5),
('Transport-moving', 1529.5),
('Farming-fishing', 1457.5),
('Machine-op-inspct', 1903.5),
('Tech-support', 765.5),
('Protective-serv', 793.5),
('Armed-Forces', -88.5),
('Priv-house-serv', 407.5)]
As we saw with randomized response, these results are accurate enough to obtain a rough ordering of the domain elements
(at least the most popular ones), but orders of magnitude less accurate than we could obtain with the Laplace mechanism
in the central model of differential privacy.
Other methods have been proposed for performing histogram queries in the local model, including some detailed in
the paper linked earlier. These can improve accuracy somewhat, but the fundamental limitations of having to ensure
differential privacy for each sample individually in the local model mean that even the most complex technique can’t
match the accuracy of the mechanisms we’ve seen in the central model.
FIFTEEN
SYNTHETIC DATA
Learning Objectives
After reading this chapter, you will be able to:
• Describe the idea of differentially private synthetic data and explain why it is useful
• Define simple synthetic representations used in generating synthetic data
• Define marginals and implement code to calculate them
• Implement simple differentially private algorithms to generate low-dimensional synthetic data
• Describe the challenges of generating high-dimensional synthetic data
In this section, we’ll examine the problem of generating synthetic data using differentially private algorithms. Strictly
speaking, the input of such an algorithm is an original dataset, and its output is a synthetic dataset with the same shape
(i.e. same set of columns and same number of rows); in addition, we would like the values in the synthetic dataset to have
the same properties as the corresponding values in the original dataset. For example, if we take our US Census dataset
to be the original data, then we’d like our synthetic data to have a similar distribution of ages for the participants as the
original data, and to preserve correlations between columns (e.g. a link between age and occupation).
Most algorithms for generating such synthetic data rely on a synthetic representation of the original dataset, which does not
have the same shape as the original data, but which does allow answering queries about the original data. For example, if
we only care about range queries over ages, then we could generate an age histogram - a count of how many participants
in the original data had each possible age - and use the histogram to answer the queries. This histogram is a synthetic
representation which is suitable for answering some queries, but it does not have the same shape as the original data, so
it’s not synthetic data.
Some algorithms simply use the synthetic representation to answer queries. Others use the synthetic representation to
generate synthetic data. We’ll look at one kind of synthetic representation - a histogram - and several methods of generating
synthetic data from it.
We’ve already seen many histograms - they’re a staple of differentially private analyses, since parallel composition can be
immediately applied. We’ve also seen the concept of a range query, though we haven’t used that name very much. As a
first step towards synthetic data, we’re going to design a synthetic representation for one column of the original dataset
which is capable of answering range queries.
A range query counts the number of rows in the dataset which have a value lying in a given range. For example, “how
many participants are between the ages of 21 and 33?” is a range query.
99
Programming Differential Privacy
6245
We can define a histogram query which defines a histogram bin for each age between 0 and 100, and count the number
of people in each bin using range queries. The result looks very much like the output of calling plt.hist on the data
- because we’ve essentially computed the same result manually.
We can use these histogram results as a synthetic representation for the original data! To answer a range query, we can
add up all the counts of the bins which fall into the range.
6245
Notice that we get exactly the same result, whether we issue the range query on the original data or our synthetic repre-
sentation. We haven’t lost any information from the original dataset (at least for the purposes of answering range queries
over ages).
We can easily make our synthetic representation differentially private. We can add Laplace noise separately to each count
in the histogram; by parallel composition, this satisfies 𝜖-differential privacy.
epsilon = 1
dp_syn_rep = [laplace_mech(c, 1, epsilon) for c in counts]
We can use the same function as before to answer range queries using our differentially private synthetic representation.
By post-processing, these results also satisfy 𝜖-differential privacy; furthermore, since we’re relying on post-processing,
we can answer as many queries as we want without incurring additional privacy cost.
6245.9194571562375
How accurate are the results? For small ranges, the results we get from our synthetic representation have very similar
accuracy to the results we could get by applying the Laplace mechanism directly to the result of the range query we want
to answer. For example:
As the range gets bigger, the count gets larger, so we would expect relative error to decrease. We have seen this over
and over again - larger groups means a stronger signal, which leads to lower relative error. With the Laplace mechanism,
we see exactly this behavior. With our synthetic representation, however, we’re adding together noisy results from many
smaller groups - so as the signal grows, so does the noise! As a result, we see roughly the same magnitude of relative
error when using the synthetic representation, regardless of the size of the range - precisely the opposite of the Laplace
mechanism!
This difference demonstrates the drawback of our synthetic representation: it can answer any range query over the range
it covers, but it might not offer the same accuracy as the Laplace mechanism. The major advantage of our synthetic
representation is the ability to answer infinitely many queries without additional privacy budget; the major disadvantage
is the loss in accuracy.
The next step is to go from our synthetic representation to synthetic data. To do this, we want to treat our synthetic
representation as a probability distribution that estimates the underlying distribution from which the original data was
drawn, and sample from it. Because we’re considering just a single column, and ignoring all the others, this is called a
marginal distribution (specifically, a 1-way marginal).
Our strategy here will be simple: we have counts for each histogram bin; we’ll normalize these counts so that they sum to
1, and then treat them as probabilities. Once we have these probabilities, we can sample from the distribution it represents
by randomly selecting a bin of the histogram, weighted by the probabilities. Our first step is to prepare the counts, by
ensuring that none is negative and by normalizing them to sum to 1:
1.0
Notice that if we plot the normalized counts - which we can now treat as probabilities for each corresponding histogram
bin, since they sum to 1 - we see a shape that looks very much like the original histogram (which, in turn, looks a lot like
the shape of the original data). This is all to be expected - except for their scale, these probabilities are simply the counts.
The final step is to generate new samples based on these probabilities. We can use np.random.choice, which allows
passing in a list of probabilities (in the p parameter) associated with the choices given in the first parameter. It implements
exactly the weighted random selection that we need for our sampling task. We can generate as many samples as we want
without additional privacy cost, since we’ve already made our counts differentially private.
def gen_samples(n):
return np.random.choice(bins, n, p=syn_normalized)
Age
0 44
1 62
2 18
3 26
4 58
The samples we generate this way will be roughly distributed - we hope - according to the same underlying distribution as
the original data. That means we can use the generated synthetic data to answer the same queries we could answer using
the original data. In particular, if we plot the histogram of ages in a large synthetic dataset, we’ll see the same shape as
we did in the original data.
We can also answer some queries we’ve seen in the past, like averages and range queries:
Our mean query has fairly low error (though still much larger than we would achieve by applying the Laplace mechanism
directly). Our range query, however, has very large error! This is simply because we haven’t quite matched the shape
of the original data - we only generated 10,000 samples, and the original dataset has more than 30,000 rows. We can
perform an additional differentially private query to determine the number of rows in the original data, and then generate
a new synthetic dataset with the same number of rows, and this will improve our range query results.
So far we’ve generated synthetic data that matches the number of rows of the original dataset, and is useful for answering
queries about the original data, but it has only a single column! How do we generate more columns?
There are two basic approaches. We could repeat the process we followed above for each of 𝑘 columns (generating 𝑘
1-way marginals), and arrive at 𝑘 separate synthetic datasets, each with a single column. Then, we could smash these
datasets together to construct a single dataset with 𝑘 columns. This approach is straightforward, but since we consider
each column in isolation, we’ll lose correlations between columns that existed in the original data. For example, it might
be the case that age and occupation are correlated in the data (e.g. managers are more likely to be old than they are to
be young); if we consider each column in isolation, we’ll get the number of 18-year-olds and the number of managers
correct, but we may be very wrong about the number of 18-year-old managers.
The other approach is to consider multiple columns together. For example, we can consider both age and occupation at the
same time, and count how many 18-year-old managers there are, how many 19-year-old managers there are, and so on.
The result of this modified process is a 2-way marginal distribution. We’ll end up considering all possible combinations
of age and occupation - which is exactly what we did when we built contingency tables earlier! For example:
ct = pd.crosstab(adult['Age'], adult['Occupation'])
ct.head()
Now we can do exactly what we did before - add noise to these counts, then normalize them and treat them as probabilities!
Each count now corresponds to a pair of values - both an age and an occupation - so when we sample from the distribution
we have constructed, we’ll get data with both values at once.
Examining the first element of the probabilities, we find that we’ll have a 0.07% chance of generating a row representing
a 17-year-old clerical worker. Now we’re ready to generate some rows! We’ll first generate a list of indices into the vals
list, then generate rows by indexing into vals; we have to do this because np.random.choice won’t accept a list of
tuples in the first argument.
Age Occupation
0 24 Machine-op-inspct
1 20 Machine-op-inspct
2 21 Prof-specialty
3 70 Handlers-cleaners
4 47 Transport-moving
The downside of considering two columns at once is that our accuracy will be lower. As we add more columns to
the set we’re considering (i.e., build an 𝑛-way marginal, with increasing values of 𝑛), we see the same effect we did
with contingency tables - each count gets smaller, so the signal gets smaller relative to the noise, and our results are not
as accurate. We can see this effect by plotting the histogram of ages in our new synthetic dataset; notice that it has
approximately the right shape, but it’s less smooth than either the original data or the differentially private counts we used
for the age column by itself.
We see the same loss in accuracy when we try specific queries on just the age column:
Summary
• A synthetic representation of a dataset allows answering queries about the original data
• One common example of a synthetic representation is a histogram, which can be made differentially private by
adding noise to its counts
• A histogram representation can be used to generate synthetic data with the same shape as the original data by
treating its counts as probabilities: normalize the counts to sum to 1, then sample from the histogram bins using
the corresponding normalized counts as probabilities
• The normalized histogram is a representation of a 1-way marginal distribution, which captures the information in
a single column in isolation
• A 1-way marginal does not capture correlations between columns
• To generate multiple columns, we can use multiple 1-way marginals, or we can construct a representation of a
𝑛-way marginal where 𝑛 > 1
• Differentially private 𝑛-way marginals become increasingly noisy as 𝑛 grows, since a larger 𝑛 implies a smaller
count for each bin of the resulting histogram
• The challenging tradeoff in generating synthetic data is thus:
– Using multiple 1-way marginals loses correlation between columns
– Using a single 𝑛-way marginal tends to be very inaccurate
• In many cases, generating synthetic data which is both accurate and captures the important correlations between
columns is likely to be impossible
SIXTEEN
EFFICIENCY
Learning Objectives
After reading this chapter, you will be able to:
• Design experiments to measure time and space overhead for private algorithms
• Consider tradeoffs between space and time efficiency
• Consider techniques for optimization
• Describe the efficiency bottlenecks in differentially private algorithms
In a situation where the application of differential privacy is implemented as a direct loop over sensitive entities, there is
a significant time cost incurred from the added burden of random number generation.
We can contrast the runtime performance of two versions of a counting query - one with differential privacy, and one
without it.
import itertools
import operator
import time
def time_count(k):
l = [1] * k
start = time.perf_counter()
_ = list(itertools.accumulate(l, func=operator.add))
stop = time.perf_counter()
return stop - start
def time_priv_count(k):
l = [1] * k
start = time.perf_counter()
_ = list(itertools.accumulate(l, func=lambda x, y: x + laplace_mech(y,1,0.1),␣
↪initial=0))
stop = time.perf_counter()
return stop - start
107
Programming Differential Privacy
plt.legend();
With the time spent counting on the y-axis, and the list lize on the x-axis, we can see that the time complexity of differential
privacy for this particular operation is essentially linear in the size of the input.
The good news is that the above implementation of differentially private count is rather naive, and we can do much better
using optimization techniques such as vectorization!
For example, when implementing differential privacy for the machine learning operations in a previous chapter, we make
use of NumPy functions which are heavily optimized for vector-based operations. In this particular scenario, leveraging
this strategy, we manage to incur negligible time cost for the use of differential privacy.
delta = 1e-5
def time_gd(k):
start = time.perf_counter()
gradient_descent(k)
stop = time.perf_counter()
return stop - start
def time_priv_gd(k):
start = time.perf_counter()
noisy_gradient_descent(k, 0.1, delta)
stop = time.perf_counter()
return stop - start
The graph plots generally overlap or are close in most simulations of this experiment, indicating low (constant) time
inefficiency overhead incurred from utilizing private gradient descent.
We can also analyze the space utilization of differential privacy mechanisms. Python3 (3.4) introduces a debug tool
capable of tracing memory blocks allocated for use during evaluation. We may perform our analysis of space overhead
by comparing the peak size of memory blocks used during private versus non-private computation.
Using this strategy, we can analyze the space overhead of the private count operation as follows:
import itertools
import operator
import tracemalloc
def space_count(k):
l = [1] * k
itertools.accumulate(l, func=operator.add)
_, peak = tracemalloc.get_traced_memory()
tracemalloc.reset_peak()
return peak/1000
def space_priv_count(k):
(continues on next page)
tracemalloc.start()
x_axis = [k for k in range(100_000,1_000_000,100_000)]
plt.xlabel('Size (N)')
plt.ylabel('Memory Blocks')
plt.plot(x_axis, [space_count(k) for k in range(100_000,1_000_000,100_000)], label=
↪'Regular Count')
tracemalloc.stop()
plt.legend();
We display every thousand memory blocks on the y-axis, against the list size on the x-axis.
In this case, the memory complexity of differential privacy is generally constant.
We also observe a spike in the initial space cost for private count, which we can attribute to memory allocation for setup
of the random number generator resources such as the entropy pool and PRNG (pseudo-random number generator).
We can repeat this experiment to contrast private and non-private machine learning:
def space_gd(k):
gradient_descent(k)
_, peak = tracemalloc.get_traced_memory()
tracemalloc.reset_peak()
return peak/1_000_000
tracemalloc.start()
x_axis = [k for k in range(5,10,2)]
plt.xlabel('Size (N)')
plt.ylabel('Memory Blocks')
plt.plot(x_axis, [space_gd(k) for k in range(5,10,2)], label='Regular GD')
plt.plot(x_axis, [space_priv_gd(k) for k in range(5,10,2)], label='Private GD')
tracemalloc.stop()
plt.legend();
In this case, since we are performing gradient descent, the memory allocation is generally much larger - as expected - so
we graph against every millionth allocated memory block.
We also observe generally constant space overhead during private gradient descent. However the space overhead is much
larger here since we leverage extra space for vectorization and concurrency to improve time efficiency.
In conclusion, while some bottlenecks may exist, there are techniques we can use to keep both the space and time cost of
differential privacy generally low (constant) throughout the lifetime of program evaluation.
Why do any bottlenecks exist, and why are they particular to differential privacy? In order to preserve the guarantee
of differential privacy, we require high-quality entropy. This is a limited resource because it generally relies on some
external source of non-deterministic environmental input. Examples of non-determinintic entropy sources can include
disk/network IO, keyboard presses, and mouse movement.
Usually this entropy is stored in some special buffer or file to be retrieved and used later on.
b'wsyJ\xe4NJ\x9c'
import os
os.urandom(32)
b'\x8a\xea3\xdd\xae\x9e\xf9I\xf3|A\xb0\xab\x84\x89P\x8a\xcd\xf2H:(\x821\xfc\x0c\x16
↪<4\xb4\xa2\x04'
Or:
import secrets
secrets.token_bytes(32)
b'\xb7\x96H\xaf\xe3?\xea\x91N\x0c\xc9\xf5\xc3\xed_\xd0\x87\x10M\xa7\r\x8684`\xb1\
↪xa4\x0er\xbb\x01?'
In general, in different programming languages and operating systems, there are several options for random number and
seed generation, with varying applicability for use cases such as: modeling, simulation, or production of cryptographically
strong random bytes suitable for managing sensitive user data at scale.
SEVENTEEN
BIBLIOGRAPHY
113
Programming Differential Privacy
[1] Latanya Sweeney. Simple demographics often identify people uniquely. URL: https://dataprivacylab.org/projects/
identifiability/.
[2] Latanya Sweeney. K-anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzzi-
ness and Knowledge-Based Systems, 10(05):557–570, 2002. URL: https://doi.org/10.1142/S0218488502001648,
arXiv:https://doi.org/10.1142/S0218488502001648, doi:10.1142/S0218488502001648.
[3] Cynthia Dwork. Differential privacy. In Proceedings of the 33rd International Conference on Automata, Languages
and Programming - Volume Part II, ICALP'06, 1–12. Berlin, Heidelberg, 2006. Springer-Verlag. URL: https://doi.
org/10.1007/11787006_1, doi:10.1007/11787006_1.
[4] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data
analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC'06, 265–284. Berlin, Heidelberg,
2006. Springer-Verlag. URL: https://doi.org/10.1007/11681878_14, doi:10.1007/11681878_14.
[5] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our data, ourselves:
privacy via distributed noise generation. In Serge Vaudenay, editor, Advances in Cryptology - EUROCRYPT 2006,
486–503. Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.
[6] Frank D. McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In Pro-
ceedings of the 2009 ACM SIGMOD International Conference on Management of Data, SIGMOD '09, 19–30. New
York, NY, USA, 2009. Association for Computing Machinery. URL: https://doi.org/10.1145/1559845.1559850,
doi:10.1145/1559845.1559850.
[7] Ilya Mironov. On significance of the least significant bits for differential privacy. In Proceedings of the 2012 ACM
Conference on Computer and Communications Security, CCS '12, 650–661. New York, NY, USA, 2012. Association
for Computing Machinery. URL: https://doi.org/10.1145/2382196.2382264, doi:10.1145/2382196.2382264.
[8] Sílvia Casacuberta, Michael Shoemate, Salil Vadhan, and Connor Wagaman. Widespread underestimation of sen-
sitivity in differentially private libraries and how to fix it. In Proceedings of the 2022 ACM SIGSAC Conference on
Computer and Communications Security, CCS '22, 471–484. New York, NY, USA, 2022. Association for Comput-
ing Machinery. URL: https://doi.org/10.1145/3548606.3560708, doi:10.1145/3548606.3560708.
[9] Borja Balle and Yu-Xiang Wang. Improving the Gaussian mechanism for differential privacy: analytical calibration
and optimal denoising. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference
on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 394–403. PMLR, 10–15 Jul 2018.
URL: https://proceedings.mlr.press/v80/balle18a.html.
[10] Cynthia Dwork, Guy N. Rothblum, and Salil Vadhan. Boosting and differential privacy. In 2010 IEEE 51st Annual
Symposium on Foundations of Computer Science, volume, 51–60. 2010. doi:10.1109/FOCS.2010.12.
[11] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data analy-
sis. In Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, STOC '07, 75–84. New
York, NY, USA, 2007. Association for Computing Machinery. URL: https://doi.org/10.1145/1250790.1250803,
doi:10.1145/1250790.1250803.
115
Programming Differential Privacy
[12] Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. In Proceedings of the Forty-First Annual ACM
Symposium on Theory of Computing, STOC '09, 371–380. New York, NY, USA, 2009. Association for Computing
Machinery. URL: https://doi.org/10.1145/1536414.1536466, doi:10.1145/1536414.1536466.
[13] Ilya Mironov. Renyi differential privacy. In Computer Security Foundations Symposium (CSF), 2017 IEEE 30th, 263–
275. IEEE, 2017.
[14] Mark Bun and Thomas Steinke. Concentrated differential privacy: simplifications, extensions, and lower bounds. In
Theory of Cryptography Conference, 635–658. Springer, 2016.
[15] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on
Foundations of Computer Science (FOCS'07), volume, 94–103. 2007. doi:10.1109/FOCS.2007.66.
[16] Cynthia Dwork, Aaron Roth, and others. The algorithmic foundations of differential privacy. Foundations and
Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
[17] Cynthia Dwork, Moni Naor, Omer Reingold, Guy N. Rothblum, and Salil Vadhan. On the complexity of differen-
tially private data release: efficient algorithms and hardness results. In Proceedings of the Forty-First Annual ACM
Symposium on Theory of Computing, STOC '09, 381–390. New York, NY, USA, 2009. Association for Computing
Machinery. URL: https://doi.org/10.1145/1536414.1536467, doi:10.1145/1536414.1536467.
[18] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: randomized aggregatable privacy-preserving
ordinal response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Se-
curity, CCS '14, 1054–1067. New York, NY, USA, 2014. Association for Computing Machinery. URL: https:
//doi.org/10.1145/2660267.2660348, doi:10.1145/2660267.2660348.
[19] Stanley L. Warner. Randomized response: a survey technique for eliminating evasive answer bias. Journal of the
American Statistical Association, 60(309):63–69, 1965. PMID: 12261830. URL: https://www.tandfonline.com/doi/
abs/10.1080/01621459.1965.10480775, doi:10.1080/01621459.1965.10480775.
[20] Tianhao Wang, Jeremiah Blocki, Ninghui Li, and Somesh Jha. Locally differentially private protocols for fre-
quency estimation. In 26th USENIX Security Symposium (USENIX Security 17), 729–745. Vancouver, BC, Au-
gust 2017. USENIX Association. URL: https://www.usenix.org/conference/usenixsecurity17/technical-sessions/
presentation/wang-tianhao.
116 Bibliography
PROOF INDEX
adaptive-composition l1-sensitivity
adaptive-composition (ch6), 45 l1-sensitivity (ch6), 44
advanced-composition-def l2-sensitivity
advanced-composition-def (ch6), 46 l2-sensitivity (ch6), 44
approximate-advanced-composition laplace
approximate-advanced-composition (ch6), 48 laplace (ch3), 22
approximate-dp local-sensitivity-def
approximate-dp (ch6), 41 local-sensitivity-def (ch7), 49
approximate-sequential-composition max-divergence
approximate-sequential-composition (ch6), max-divergence (ch8), 59
42
mechanism
argmin mechanism (ch3), 21
argmin (ch7), 51
parallel-composition-def
bounded-dp parallel-composition-def (ch4), 28
bounded-dp (ch3), 24
post-processing-def
catastrophe-mechanism post-processing-def (ch4), 30
catastrophe-mechanism (ch6), 45 propose-test-release-def
data-privacy propose-test-release-def (ch7), 51
data-privacy (intro), 3 renyi-divergence
gaussian-mechanism-def renyi-divergence (ch8), 59
gaussian-mechanism-def (ch6), 42 renyi-gauss
renyi-gauss (ch8), 60
global-sensitivity
global-sensitivity (ch5), 33 renyi-sequential-composition
renyi-sequential-composition (ch8), 60
k-anonymity-def
k-anonymity-def (ch2), 15 sample-and-aggregate-def
sample-and-aggregate-def (ch7), 54
k-steps
k-steps (ch7), 51 sequential-composition-def
sequential-composition-def (ch4), 25
117
Programming Differential Privacy
smooth-sensitivity-def
smooth-sensitivity-def (ch7), 53
symmetric-difference
symmetric-difference (ch5), 34
theorem-0
theorem-0 (ch9), 65
unbounded-dp
unbounded-dp (ch3), 24
zcpd
zcpd (ch8), 61
zcpd-parallel-composition
zcpd-parallel-composition (ch8), 61