13-4 GoogleDIfferntialPrivacy
13-4 GoogleDIfferntialPrivacy
Differential Privacy:
The Pursuit of Protections
by Default
case study
O
A discussion ver the past decade, calls for better measures to
with protect sensitive, personally identifiable information
Miguel Guevara, have blossomed into what politicians like to call a
Damien “hot-button issue.” Certainly, privacy violations have
become rampant and people have grown keenly
Desfontaines,
aware of just how vulnerable they are. When it comes to
Jim Waldo, and
potential remedies, however, proposals have varied widely,
Terry Coatta
leading to bitter, politically charged arguments. To date,
what’s chiefly come of that have been bureaucratic policies
that satisfy almost no one—and infuriate many.
Now, into this muddled picture comes differential
privacy. First formalized in 2006, it’s an approach based
on a mathematically rigorous definition of privacy that
allows formalization and proof of the guarantees against
re-identification offered by a system. While differential
privacy has been accepted by theorists for some time, its
implementation has turned out to be subtle and tricky,
with practical applications only now starting to become
available. To date, differential privacy has been adopted by
the U.S. Census Bureau, along with a number of technology
companies, but what this means and how these organizations
have implemented their systems remains a mystery to many.
It’s also unlikely that the emergence of differential privacy
signals an end to all the difficult decisions and tradeoffs, but
it does signify that there now are measures of privacy that
acmqueue | september-october 2020 1
privacy case study 2 of 20
stronger.
JW So, the core idea is that when you query the data, the
answer has some noise added to it, and this gives you
control over privacy because the more noise you add to
the data, the more private it becomes—with the tradeoff
being that the amount of precision goes down as the noise
goes up.
DD That’s right.
D
JW How is this now being used inside Google?
ifferential
privacy
MG It’s mostly used by a lot of internal tools. From the
made it start, we saw it as a way to build tooling that could be
possible used to address some core internal use cases. The first
for Google of those was a project where we helped some colleagues
to produce
the COVID-19 who wanted to do some rapid experimentation with data.
Community We discovered that, much of the time, a good way to speed
Mobility access to the data underlying a system is to add a privacy
Reports.
layer powered by differential privacy. That prompted us to
build a system that lets people query underlying data and
obtain differentially private results.
After we started to see a lot of success there, we
decided to scale that system—to the point where we’re
now building systems capable of dealing with data volumes
at Google scale, while also finding ways to serve end users,
as well as internal ones. For example, differential privacy
made it possible for Google to produce the COVID-19
Community Mobility Reports [used by public health
officials to obtain aggregated, anonymized insights from
health-care data that can then be used to chart disease
movement trends over time by geography as well as by
locales (such as grocery stores, transit stations, and
workplaces)]. There’s also a business feature in Google
Maps that shows you how busy a place is at any given point
in time. Differential privacy makes that possible as well.
Basically, differential privacy is used by infrastructure
systems at Google to enable both internal analysis and
some number of end-user features.
JW As I understand it, there’s a third variable. There’s how
accurate things are and how much noise you add—and then
there’s the number of queries you allow. Do you take all
three of those into account?
MG It really depends on the system. In theory, you can have
an infinite number of queries. But there’s a critical aspect
of differential privacy called the privacy budget—each time
you use a query, you use some part of your budget. So, let’s
say that every time you issue a query, you use half of your
remaining budget. As you continue to issue more queries,
the amount of noise you introduce into your queries will
just increase.
With one of our early systems, we overcame this by doing
something you’re hinting at, which was to limit the number of
queries users could make. That was so we wouldn’t exhaust
the budget too fast and would still have what we needed to
provide meaningful results for our users.
DD There’s also a question that comes up in the literature
having to do with someone using an engine to run arbitrary
queries over a dataset—typically whenever that person
does not have access to the raw data. In such use cases,
budget tracking becomes very important. Accordingly,
we’ve developed systems with this in mind, using
techniques like sampling, auditing, and limiting the number
of queries that can be run. On the other hand, with many
common applications, you know what kind of query you
T
we apply to the data we store? How can we request user
he notion of
privacy as
consent in an understandable, respectful way? And so on.
something None of these questions is Boolean. Even in adversarial
Boolean is contexts, where the answer seems to be Boolean, it isn’t.
misleading For example: Is the attacker going to be able to intercept
from the start.
and re-identify data? The answer is either yes or no.
But you still need to think about other questions like:
What is the attacker capable of? What are we trying to
defend against? What’s the worst-case scenario? This is
to say, even without the formal concept of differential
privacy, the notion of privacy in general is far from
Boolean. There always are shades of gray.
What differential privacy does to achieve data
anonymization is to quantify the tradeoffs in a formal,
mathematical way. This makes it possible to move beyond
these shades-of-gray assessments to apply a strong attack
model where you have an attacker armed with arbitrary
background knowledge and computational resources—
which represents the worst possible case—and yet you’re
still able to get strong, quantifiable guarantees. That’s the
essence of differential privacy, and it’s by far the best thing
we have right now in terms of quantifying and measuring
privacy against utility for data anonymization.
P
owerful as differential privacy may be, it’s also
highly abstract. Getting users and developers alike
to build confidence around its ability to protect
personally identifiable information has proved to be
challenging.
In an ongoing effort, various approaches are being tried to
help people make the connection between the mathematics
of differential privacy and the realization of actual privacy
protection goals. Progress in this regard is not yet up to
Google scale.
And yet, Google has a clear, vested interest in building
public confidence in the notion that it and other large
aggregators of user data are fully capable of provably
anonymizing the data they utilize. Finding a way to convey
that in a convincing manner to the general public, however,
remains an unsolved problem.
I
this is mostly wrong. An example is the assumption that
n differential
privacy theory,
each record of the dataset corresponds to a single user.
you add a This owes to the fact that the main use case presented in
random much of the literature relates to medical data—with one
number from record per patient. But, of course, when you’re working with
a continuous
distribution datasets like logs of user activities, place visits, or search
to a statistic queries, each user ends up contributing much more than just
with arbitrary a single record in the dataset. So, it took some innovations
precision.
and optimizations to account for this in building some better
tooling for our purposes.
Something else that contributed to the unforeseen
difficulties was that, even though the math is relatively
simple, implementing it in a way that preserves the
guarantees is tricky. It’s a bit like RSA (Rivest-Shamir-
Adleman) in cryptography—simple to understand, yet naïve
implementations will encounter serious issues like timing
attacks. In differential privacy theory, you add a random
number from a continuous distribution to a statistic with
arbitrary precision. To do that with a computer, you need
to use floating-point numbers, and the ways these are
represented come with a lot of subtle issues. For example,
the bits of least precision in the noisy number can leak
information about the original number if you’re not careful.
I
n many ways, the release of an open-source version of
Google’s differential privacy library creates a whole raft
of new challenges. Now there’s an education program to
roll out; users and developers to be supported; new tools
to be built; external contributions to be curated, vetted,
and tested… indeed, a whole new review process to put into
place and an even broader undertaking to tackle in the form
of organizing an external community of developers.
But that’s just what comes with the territory whenever
there are grand aspirations. The goals of Google’s differential
privacy team happen to be quite ambitious indeed.
2 CONTENTS