0% found this document useful (0 votes)
5 views8 pages

Waye Lucas

This paper presents a framework for performing differentially private operations on data streams, addressing the limitations of traditional differential privacy methods that focus on static data sets. It extends the Privacy Integrated Queries (PINQ) system to support streaming algorithms, enabling non-experts to leverage differential privacy for dynamic data. The framework emphasizes the importance of privacy guarantees in real-world applications, such as social media, where user data is continuously generated and privacy concerns are paramount.

Uploaded by

suman.struc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views8 pages

Waye Lucas

This paper presents a framework for performing differentially private operations on data streams, addressing the limitations of traditional differential privacy methods that focus on static data sets. It extends the Privacy Integrated Queries (PINQ) system to support streaming algorithms, enabling non-experts to leverage differential privacy for dynamic data. The framework emphasizes the importance of privacy guarantees in real-world applications, such as social media, where user data is continuously generated and privacy concerns are paramount.

Uploaded by

suman.struc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Privacy Integrated Data Stream Queries

Lucas Waye
Harvard University
[email protected]

Abstract cessible, programming systems have been developed to help


Research on differential privacy is generally concerned with non-experts leverage the guarantees of differential privacy.
examining data sets that are static. Because the data sets do Most of the research and techniques of differential privacy
not change, every computation on them produces “one-shot” have been with respect to a single non-changing database.
query results; the results do not change aside from random- This precludes many environments where static data sets are
ness introduced for privacy . There are many circumstances, not feasible. Situations where we would like to avoid per-
however, where this model does not apply, or is simply in- forming queries on the entire data set include:
feasible. Data streams are examples of non-static data sets • The data is coming in over a long period of time and we
where results may change as more data is streamed. Theo- would like to observe intermediate query results.
retical support for differential privacy with data streams has • It is practically infeasible to hold the entire data set in
been researched in the form of differentially private stream- memory at one time.
ing algorithms. In this paper, we present a practical frame-
work for which a non-expert can perform differentially pri- • We would like to offer privacy guarantees against in-
vate operations on data streams. The system is built as an trusions into the system running the computation (e.g.
extension to PINQ (Privacy Integrated Queries), a differen- an intruder breaks into the system accessing the private
tially private programming framework for static data sets. database).
The streaming extension provides a programmatic interface To address these issues, many have looked at differential
for the different types of streaming differential privacy from privacy for streaming algorithms. Streaming algorithms can
the literature so that the privacy trade-offs of each type of store less state than the entire database by summarizing rel-
algorithm can be understood by a non-expert programmer. evant features of the data, produce intermediate outputs, and
Categories and Subject Descriptors H.3 [Online Informa- also in some cases protect against unannounced intrusions
tion Services]: Data Sharing into the system. In this paper, we look at how differentially
private streaming algorithms can be used by a non-expert in
Keywords Differential privacy; programming languages; real-world applications where privacy is important.
privacy
1.1 Private Twitter
1. Introduction We are specifically motivated by scenarios where data is
With the increase of big data services where personal user plentiful and generated continuously, but where privacy is
information is collected and stored in large quantities, the still a concern. These scenarios come up in many places. For
need for privacy is very important. Differential privacy [2] is example in Dwork et al., a website that tracks H1N1 symp-
a particularly strong definition of privacy that has withstood toms was described where the intention was to analyze ag-
many known forms of privacy attacks. Differential privacy gregated information, possibly to track its growth or spread
enforces that not much can be learned from a particular par- [3]. The privacy requirements of the web app were explored
ticipant’s data. To help make differential privacy more ac- and potential privacy mechanisms were identified.
Consider another example web app with privacy expecta-
Permission to make digital or hard copies of all or part of this work for personal or tions: a theoretical alternative to the popular social media
classroom use is granted without fee provided that copies are not made or distributed website Twitter where Twitter messages (tweets) are vis-
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the ible to only a user’s followers but where we would still
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or like to analyze aggregate message patterns in a privacy-
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected]. preserving way. In other words, a visitor can only see trends
PSP ’14, October 21, 2014, Portland, OR, USA.. about tweets rather than specific tweets from specific users.
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-2296-6/14/10. . . $15.00. Queries include examining topics that receive many tweets,
http://dx.doi.org/10.1145/2687148.2687150 perhaps specific to certain geographic areas. Other queries
might include determining how many individuals are tweet- directly after making a query, a handle to a streaming algo-
ing about certain topics. (Note these are different queries as rithm corresponding to the query is returned. The streaming
the latter de-duplicates tweets from the same user.) algorithm has a common interface, but its behavior can vary
The need for privacy is apparent: users should be guar- depending the properties of the algorithm. We have imple-
anteed that their tweets are only viewable to their followers mented five different differentially private streaming algo-
so that they cannot be identified by outsiders, but the service rithms. Our goal was not for a complete library based on the
may still like to compute aggregate information using their literature, but rather to get a good representative sample of
tweets. Differential Privacy is a good notion of privacy for different types of streaming algorithms that span the trade-
this setting as any individual user or tweet cannot be singled offs of characteristics identified in this paper. We primarily
out and identified. focused on how the platform would handle the interaction
Depending on the types of queries we are interested in, of various types of streaming algorithms while maintaing an
we find that we need to adjust our definitions of privacy intuitive PINQ-like interface for the user of the platform.
to accommodate many different types of privacy, each with
their own trade-offs: 2. Background
• One may want their query results to be insensitive to cor- In this section we describe the definitions that will be used to
related events. For example, a particular user frequently reason about the privacy guarantees of the system. We will
tweeting about a specific topic should not heavily influ- also describe the underlying platform, PINQ, that provides
ence the observed query results. mechanisms for privately querying static data sets.
• One might also have different goals in mind for the trade-
2.1 Differential Privacy
off between accuracy and responsiveness. That is, how
frequently we get an intermediate result from the algo- Differential privacy enforces that a resulting output distribu-
rithm. One could do daily batches of the events or we tion does not change much based on a particular individual’s
may want updated query results every time an event oc- data. In particular, we use the following definition of differ-
curs. ential privacy.
• Another important trade-off includes the amount of space D EFINITION 2.1 (Differential Privacy). A randomized al-
the algorithm uses and somewhat related to that is how gorithm M provides ✏-differential privacy if for any two
much privacy the algorithm preserves internally. In other adjacent input data sets A and B, and any set of possible
words, how is a user’s privacy affected if the algorithm’s outputs S of M,
internal state is observed or leaked? For example, an
algorithm that stores the last 100 events of a data stream Pr[M(A) 2 S]  e✏ Pr[M(B) 2 S]
could leak all information relating to those events if its
internal state was observed. Intuitively, this definition states that the outputs of differ-
entially private mechanisms should not be sensitive to small
These factors impact certain characteristics about an algo- changes in the input data sets. For example, a single change
rithm, including: privacy guarantees to each user, output be- in an input data set should not affect the output of a differ-
havior, and internal state of the algorithm. Related work pro- entially private mechanism very much. From a user’s per-
vides theoretical results for these tradeoffs in the form of spective, the mechanism’s results will not change very much
specialized algorithms that give various guarantees an an- based on their participation in the data set. As a result, a user
alyst or data owner might desire. In order to be useful to does not have to worry about their (possibly sensitive) infor-
a non-expert, the salient properties of each streaming al- mation being identified based on the outputs of the mecha-
gorithm should be clear in a unified framework so that the nism. Other mechanisms, such as the removal of Personally
best algorithm can be chosen for the given constraints (e.g. Identifiable Information (e.g. name, social security number,
privacy level or result accuracy) without regard to the algo- etc.) do not have this property. Sweeney has shown that par-
rithm’s implementation. ticipants of a data set can be identified from looking at the
1.2 Contributions output data set (original data set with Personally Identifiable
Information removed) [11]. The framework presented in this
We extended the PINQ (Privacy Integrated Queries) plat-
paper is based on differentially privacy (as defined in Defi-
form [7] to support differentially private streaming algo-
nition 2.1) as it is a particularly strong notion of privacy that
rithms1 . Streaming PINQ provides a basis for handling
is resistant to many known privacy attacks.
streaming data, which was not readily available in PINQ or
The sensitivity of a mechanism is based on adjacency
its underlying system for handling data (described later). The
of the inputs and outputs. In most settings where the input
data analyst can program against Streaming PINQ in an in-
data sets are databases n , adjacency would be databases
tuitive way, similar to PINQ, but instead of receiving results
where just one row is changed. This form of adjacency
1 Code is available at http://git.io/jeWc2Q implies row-level privacy (i.e. just one row of a database
is changed). Note that in a streaming setting, this may not privacy budget on the data set that satisfies differential pri-
capture an intuitive notion of privacy, so we will present vacy. The privacy budget is enforced via an agent that is
different definitions of adjacency later. attached to the private query mechanism and is notified ev-
An important concept in programming is the ability to ery time a differentially private operation takes place. The
easily compose operations together to form more compli- PINQueryable object is responsible for adding the noise
cated operations that can be reasoned about in a predictable and consulting the agent to check if the privacy budget is
way. Differential privacy is compatible with this concept. In exceeded. As a result, when a developer wants to implement
particular, any sequential composition of differentially pri- new differentially private primitive operations, it is the re-
vate computations implies differential privacy. sponsibility of the developer to prove its correctness. A user
of the system must trust the underlying implementation of
T HEOREM 2.2 (Composition). If M1 is an ✏1 -differentially
each PINQueryable implementation. The benefit is that it is
private algorithm and M2 is an ✏2 -differentially private al-
very easy to extend the system, but introduces possible un-
gorithm, then the sequential composition of M1 (X) fol-
soundness in the system if a PINQueryable implementation
lowed by M2 (X) provides (✏1 + ✏2 )-differential privacy.
is unsound. Other systems have a smaller trusted base and
A corollary of Theorem 2.2 is that a sequence of differ- can enforce the privacy guarantees of new private functions
entially private algorithms provides privacy equal to the sum (e.g. [9]).
of their ✏ values. Our programming framework makes use of Figure 1 shows an example PINQ program in the Pri-
composition to handle multiple streaming algorithms oper- vate Twitter example. It loads all the tweets (stored in
ating on the same data. ”tweets . txt ), then constructs a differentially private queryable
Another important theorem which we make use of in- object (named data ) where ✏ = 1.0. Then a query is per-
volves operating over disjoint input sets. If differentially pri- formed on the data source which will yield a noisy version
vate algorithms are operating on disjoint subsets of the input of the true number of tweets originating from New York.
data, then we can improve our guarantees. Note that the privacy budget of the data set in “tweets.txt”
has been completely consumed, so no further queries should
T HEOREM 2.3 (Disjointness). If M1 and M2 are ✏-differentially
be performed on that data to preserve privacy.
private algorithms and X1 and X2 are disjoint subsets of
the input domain X, then M1 (X1 ) followed by M2 (X2 )
2.3 Streaming Algorithms
provides ✏-differential privacy.
In this paper, we are interested in streaming algorithms. The
Intuitively, if algorithms are operating on separate data in key difference between streaming algorithms and batch algo-
parallel, then there is no further privacy lost. This theorem rithms is that streaming algorithms do not have access to the
is useful when the input database contains many different data all at once. Instead, the data is coming through a stream
properties that an analyst may want to explore independently one event at a time. Dwork et al. researched this setting and
without using up too much privacy from using composition. came up with various differentially private algorithms one
2.2 Privacy Integrated Queries could use with some extensions to the traditional definition
of differential privacy [4].

1 var t w e e t s = R e a d A l l S a v e d T w e e t s ( ” t w e e t s . t x t ” ) ; Event-Level vs. User-Level Privacy An important change


2 var a g e n t = new PINQAgentBudget ( 1 . 0 ) ; to support streaming differential privacy required re-defining
3 var d a t a = new P I N Q u e r y a b l e<Tweet >( t w e e t s , a g e n t ) ;
4 the notion of adjacency for streams. Since a stream can be
5 d o u b l e tweetsFromNY = d a t a unbounded, the classic definition of row-level adjacency was
6 . Where ( t w e e t => t w e e t . L o c a t i o n . S t a t e == ”NY” )
7 . NoisyCount ( 1 . 0 ) ; not sufficient. Dwork et al. distinguish between two types of
8
adjacency that give rise to two types of privacy: event-level
9 C o n s o l e . W r i t e L i n e ( ” T w e e t s f r o m NY : ” +
10 tweetsFromNY ) ; privacy and user-level privacy. Intuitively, in event-level pri-
vacy one can think of adjacency between two streams as dif-
Figure 1. PINQ program that provides a differentially pri- fering in just one event. In user-level privacy, the user could
vate (✏ = 1.0) count of the number of tweets from New contribute to many events, so an adjacent stream would dif-
York. fer in all events a particular user contributed to. To formalize
this notion of adjacency, we use the following definition.
There are a few differential privacy programming frame- D EFINITION 2.4 (X-Adjacent Data Streams). Data streams
works available. PINQ [7] is an extension to LINQ (Lan- (or stream prefixes) S and S0 are X-adjacent if they differ
guage Integrated Query) [8] which provides a programmer only in the presence or absence of any number of occur-
with a differentially private view of a given LINQ data rences of a single element x 2 X. In other words, if all
source. In the basic feature set, it provides functionality to occurrences of x are deleted from both streams, then the
add Laplace random noise in accordance with an internal resulting streams should be identical.
Definition 2.4 gives rise to event-level and user-level pri- of the classes are entirely new and do not rely on the PINQ
vacy. In user-level privacy, users are distiguished by their object model directly, but the same coding style is adopted
values in the stream. That means that values in the stream for the programmer’s ease of use and understanding.
need to correspond to a particular user (e.g. user row identi- To support streaming data, we designed a streaming data
fiers or unique names for each user). provider interface. The interface is extendable so that data
providers can provide access to their own streams (e.g. net-
Output Behavior Another important consideration with work streams, file streams, etc.). To provide privacy, the data
streaming algorithms is output behavior. Some streaming al- stream is wrapped around a StreamingQueryable<T> ob-
gorithms are single output algorithms [4]. That is, they can ject that controls access to the events in the stream (in a
process a stream for an indefinite period of time, but once similar fashion to PINQueryable<T> in PINQ). The wrap-
they receive a signal to output, they give one output and then per object supports various transformations on the data. The
need to be reinitialized. In a later publication, Dwork et al. underlying data cannot be accessed except through a dif-
provide a generalized transformation that can take a differ- ferentially private streaming algorithm. The wrapper object
entially private streaming algorithm that produces only one consults an agent to determine whether it is safe to run
output to one that can produce continuous outputs [3]. It is a requested streaming algorithm. We extended the tradi-
important to note, though, that an upper bound on the num- tional PINQAgent class to distinguish between user-level
ber of events to process must be given as a parameter to the and event-level private algorithms. PINQ did not have to
transformation upfront. Chan et al. present a differentially distinguish between different notions of privacy that al-
private counter with non-trivial error that does not require gorithms provided so it could have just one implemen-
an upper bound on the number of events to process [1]. tation to check access. When requesting a differentially
Pan-Privacy A nice characteristic of streaming algorithms private algorithm on the (possibly transformed) data, the
is their ability to summarize large amounts of data in a size StreamingQueryable<T> wrapper object returns a handle
less than that of the stream. However, it is possible that the to the streaming algorithm for use by the programmer. The
internal state of the algorithm may not be differentially pri- supported operations include: fetching the output of the al-
vate. Consider a private streaming counter that outputs noise gorithm, checking the number of events it has processed, and
over the true count while internally maintaining the true to start and stop it. Streaming algorithms are implemented
count. If an intrusion into the system were to occur, the algo- on top of a base class StreamingAlgorithm<T>. Imple-
rithm would not maintain privacy as it leaked the true count. mentations of streaming algorithms must be trusted to be
An algorithm that can maintain privacy under an intrusion is implemented correctly, as the platform requires that classes
said to be pan-private [4]. It is also important to note how that extend StreamingAlgorithm<T> provide the privacy
many intrusions the algorithm can withstand. Impossibility guarantees.
results are given for some finite-state algorithms (in partic-
3.1 Streaming PINQ By Example
ular, estimating user density in a stream) against more than
one unannounced intrusion.
The above properties of streaming algorithms – which 1 / / The p r i v a t e s o u r c e o f a l l t w e e t s
are in general not applicable to non-streaming algorithms 2 var t w e e t D a t a = new A l l T w e e t s S t r e a m F i r e H o s e ( ) ;
3
– can be stated independently of each other. It is not clear 4 / / Get d i f f e r e n t i a l l y p r i v a t e view o f t w e e t D a t a
which properties are advantageous over others; it is very 5 var t w e e t s = new S t r e a m i n g Q u e r y a b l e <Tweet >(
6 t w e e t D a t a , new U s e r L e v e l P r i v a c y B u d g e t ( 1 . 0 ) ) ;
dependent on the needs of the analyst and data owner. As 7

a result, the Streaming PINQ framework handles each of 8 / / Find u s e r s d i s c u s s i n g # t o p i c


9 var t w e e t e d T a g = t w e e t s
these properties separately. For example, we provide a single 10 . Where ( t w e e t => t w e e t . Message . C o n t a i n s ( ”# t a g ” ) )
output streaming algorithm as well as a continuous output 11 . S e l e c t ( t w e e t => t w e e t . U s e r )
12 . UserDensityContinuous ( 0 . 5 , AllUsers ( ) , 10000) ;
streaming algorithm (at the expense of accuracy) so that the 13

data analyst can pick the algorithm suitable to their particular 14 / / E v e r y t i m e a d e n s i t y i s made , p r i n t i t
15 t w e e t e d T a g . OnOutput = ( d =>
needs. The only requirement that we make on algorithms is 16 Console . WriteLine ( ” Percent of users t h a t ” +
that they characterize their behaviors according to the above 17 ”tweeted # tag : ” + d ) ) ;
18
properties (i.e. event-level vs. user-level privacy, its output 19 / / P r o c e s s 10000 e v e n t s and s t o p
behavior, and its level of pan-privacy). 20 tweetedTag . ProcessEvents (10000) ;

3. Streaming PINQ Figure 2. example streaming PINQ program


Streaming PINQ is an extension to PINQ that supports
streaming differentially private algorithms. It is imple- Figure 2 shows an example streaming PINQ program
mented in roughly 1,000 lines of C# code. The framework for a setting similar to the one discussed in Section 1.1.
is meant to have the same “look and feel” as PINQ. Many The program will output an estimate of the fraction of
users that have included “#tag” in their tweet every time formed data is passed to the target stream rather than the
a tweet is seen, and then stop outputting estimates after predicate check.2
10,000 tweets. Line 2 instantiates a streaming data provider.
tweets wraps the private tweet data from the stream with 3.3 Streaming Agents
the UserLevelPrivacyBudget agent (discussed in Section
3.3). Lines 9 to 11 transform the data to filter and select only 1 p u b l i c a b s t r a c t c l a s s P IN Q St r e a mi n g A g e n t :
users that are tweeting about a particular topic (“#tag”). Line 2 PINQAgent
3 {
12 selects an algorithm that outputs an estimated user den- 4 bool ApplyEventLevel ( double eps ) ;
sity (fraction of users who have appeared at least once in 5 double UnapplyEventLevel ( double eps ) ;
6 bool ApplyUserLevel ( double eps ) ;
the stream) and makes an outputs at every event seen in the 7 double UnapplyUserLevel ( double eps ) ;
stream. The first parameter specifies the intended ✏ value to 8 }
use, the second provides the algorithm with the data uni-
verse of users (a range of all possible users in the system), Figure 4. streaming agent that distinguishes between event-
and the third parameter gives an upper bound on the number level and user-level privacy
of events that can be processed. Line 15 assigns the output
callback to a function which writes the current output to the
console. Line 20 blocks the program and waits until 10,000 Agents are responsible for enforcing that ✏-differential
events have been processed, at which point the streaming al- privacy is preserved for the stream. Unlike PINQAgent, there
gorithm stops listening for events. The next sections describe are two notions of ✏ for streaming algorithms: one associated
each component of the streaming PINQ implementation. with event-level privacy and another associated with user-
level privacy. Since the notion of user-level privacy is based
3.2 Streams on X-adjacency from Definition 2.4, we can see that event-
level privacy is a special case of user-level privacy where
the stream length is 1. As a result, if an ✏-event level pri-
1 p u b l i c a b s t r a c t c l a s s S t r e a m i n g D a t a S o u r c e <T> vate algorithm runs for t time steps, it is t✏-user level private.
2
3
{
A c t i o n<T> E v e n t R e c e i v e d { g e t ; s e t ; }
Also note that user-level privacy is a much stronger notion of
4 event-level privacy. If an ✏-user level private algorithm runs
5 S t r e a m i n g D a t a S o u r c e <T> f i l t e r ( for t time steps, it is still just ✏-event level private. Because
6 Func<T , bool> p r e d i c a t e ) ;
7 there is a way to relate these algorithms, they can be used
8 S t r e a m i n g D a t a S o u r c e <U> map<U>( concurrently on the same stream. It should be noted, though,
9 Func<T , U> mapper ) ;
10 } that user-level private algorithms have a much stronger defi-
nition of privacy so running an event-level private algorithm
Figure 3. abstract base class for streaming data for even a short time has a very dramatic impact on the
stream’s user-level privacy.
The agent dynamically checks its budget every time an
The StreamingDataSource<T> class provides the basis event is received. Since algorithms can attach (start re-
for streaming data. It acts as a wrapper for a C# delegate with ceiving events from the stream) and detach (stop receiv-
extra functionality provided for filtering and mapping the el- ing events from the stream) from listening at any time, the
ements of the stream. Implementers of new data streams can agent is notified whenever an algorithm begins listening and
extend this class and invoke the EventReceived delegate when it stops listening. It is the responsibility of the agent
when a new event is received. Our implementation includes to decide what to do when these events occur in order to
streaming data source implementations for reading from the preserve differential privacy. When an algorithm attaches
console, generating random numbers, and also a functional to the stream, the agent is notified via ApplyEventLevel
data stream that has outputs dependent on the number of or ApplyUserLevel depending on the algorithm’s privacy
events sent. type it respects. The agent can return true allowing the algo-
The filter function behaves like an option monad with rithm to safely listen, or false if the algorithm should not be
null values. In a stream setting, it is possible for events not allowed to listen to events. When an algorithm detaches from
to happen but a value should be given to the algorithm for the stream, the agent is notified via UnapplyEventLevel or
that timestep. Dwork at al. referred to these types of events UnapplyUserLevel. The agent will return how much pri-
as the “nothing happened” element [3]. This element is en- vacy has been returned to the stream for future events.
coded as null. So if a null event is received in the filter, it From the PINQStreamingAgent base class, there are
is immediately passed on to the target stream. Otherwise the two inheriting classes that enforce either user-level privacy
filter predicate is run and if the filter matches then the event 2 Note
that the filter function is just a specialized map function. That is,
is passed to the target stream, otherwise null is passed. The filter can be implemented as map(input => predicate(input) ?
same behavior is implemented for map, though the trans- input : null)
or event-level privacy. In the user-level privacy agent, pri-
1 var t w e e t s = new S t r e a m i n g Q u e r y a b l e <Tweet >(
vacy can never be returned to the stream, since the notion of 2 t w e e t D a t a , new E v e n t L e v e l P r i v a c y B u d g e t ( 1 . 0 ) ) ;
adjacency extends to the entire stream, even after an algo- 3
4 var t w e e t s B y S t a t e = t w e e t s . P a r t i t i o n (
rithm stops listening. In other words, the algorithm has al- 5 A l l S t a t e s ( ) , t w e e t => t w e e t . L o c a t i o n . S t a t e ) ;
ready learned something about the user, it can not “unlearn” 6
7 var c o u n t s B y S t a t e = new D i c t i o n a r y <
it. On the other hand, when viewing the stream with event- 8 S t a t e , S t r e a m i n g A l g o r i t h m <S t a t e >>() ;
level privacy, intuitively the agent only needs to make sure 9
10 f o r e a c h ( S t a t e s t a t e i n t w e e t s B y S t a t e . Keys ) {
that at most ✏ is “learned” for each event. As a result, when 11 countsByState [ s t a t e ] = tweetsByState [ s t a t e ]
an algorithm detaches it allows other algorithms to run af- 12 . BinaryCount ( 1 . 0 , 10000) ;
13 countsByState [ s t a t e ] . StartListening () ;
terwards with up to the entire initial privacy budget. The 14 }
logic encapsulating these changes in ✏ is encoded in partially 15
16 C o n s o l e . W r i t e L i n e ( ”By S t a t e i n l a s t 10 k e v e n t s : ” )
implemented abstract classes PINQEventLevelAgent and 17 f o r e a c h ( S t a t e s t a t e i n t w e e t s B y S t a t e . Keys ) {
PINQUserLevelAgent. 18 / / Ensure 10 ,000 e v e n t s have been p r o c e s s e d
19 countsByState [ s t a t e ] . ProcessEvents (10000) ;
We follow the same pattern as PINQ in establishing a 20 Console . WriteLine ( s t a t e + ” : ” +
budget that is defined by the data stream owner. The data 21 countsByState [ s t a t e ] . LastOutput ) ;
22 }
stream owner provides the maximum amount of ✏ that
can be used and also the type of privacy (event-level or Figure 5. an example of partitioning disjoint events
user-level). Any algorithm or combination of algorithms
that exceeds that level of ✏ is stopped. There are budget-
based agents for both user-level privacy and event-level pri- that event level private algorithms are attached and detached
vacy named PINQUserLevelAgentBudget (which extends (from the perspective of the agent) after every event so as
PINQUserLevelAgent) and PINQEventLevelAgentBudget to handle the case where user-level privacy is desired. Intu-
(which extends PINQEventLevelAgent). itively, this procedure simulates a user-level algorithm com-
ing online at every time step, which is intuitively what is
3.4 StreamingQueryable happening when an event-level private algorithm is used on
The StreamingQueryable<T> class is the wrapper around a data stream that must respect user-level privacy.
a private stream in much the same way as PINQueryable<T> For user-level private algorithms, the StreamingQueryable
is in PINQ. It supports transformations on the data and in- object only needs to notify the agent when they first regis-
stantiating streaming algorithms on the private data. It also ter to receive events or when they finally unsubscribe from
keeps track of active streaming algorithm subscribers to the events on the stream. This book-keeping is handled in the
private stream data. The supported transformation operators StreamingQueryable object as it has direct access to the
are: Where, Select, and Partition. The first two transfor- agent. It also frees streaming algorithms from needing to im-
mations simply return a new StreamingQueryable with plement this logic. As a result, streaming algorithms can rea-
a filtered or mapped input data stream. The Partition son locally about their privacy and StreamingQueryable
transformation takes a set of possible keys and a key map- reasons about the overall privacy of many algorithms run-
ping function and produces a dictionary of keys mapping ning on the same data stream.
to their designated StreamingQueryable object. Figure 5
3.5 Streaming Algorithms
shows an example usage of the Partition transformation.
Note that we make use of Theorem 2.3 to enforce that the
amount of ✏ needed is the maximum ✏ used by the parti- 1 p u b l i c a b s t r a c t c l a s s S t r e a m i n g A l g o r i t h m <T>
tioned queryable objects. This ✏ is enforced in a similar way 23 {
A c t i o n<double> OnOutput { g e t ; s e t ; }
to PINQ through a specialized agent that tracks the used ✏ 4 i n t EventsSeen { get ; protected s e t ; }
for each partitioned StreamingQueryable, as described in 56 bool IsReceivingData { get ; protected s e t ; }

the previous section. In the case of Figure 5, an ✏ value of 7 void ProcessEvents ( i n t n , bool s t o p A f t e r w a r d s ) ;
1.0 on the original wrapper (tweets) would suffice. 8 a b s t r a c t double GetOutput ( ) ;
9 double ? L a s t O u t p u t { g e t ; p r o t e c t e d s e t ; }
A StreamingQueryable object tracks which algorithms 10
are actively receiving events. It is the only part that di- 11 v i r t u a l void S t a r t R e c e i v i n g ( ) ;
12 v i r t u a l void StopReceiving ( ) ;
rectly receives events from the private stream (through the 13
EventReceived delegate described in Section 3.2). 14 abstract void EventReceived (T d a t a ) ;
15 }
Every time an event is received, the StreamingQueryable
object notifies its agent of all event-level algorithms that are
Figure 6. streaming algorithm base class
attached and their corresponding ✏ values. If successful, the
event is passed to every streaming algorithm for processing.
After each algorithm has processed the event, the agent is The StreamingAlgorithm<T> classes provide the mech-
notified that all event-level algorithms have detached. Note anism for streaming differential privacy. The base class pro-
vides functionality to interact with the StreamingQueryable sult, an adversary could use a covert channel like timing to
object to receive events. The base class also has functionality discover information about the private data (e.g. run a very
to get outputs made by the algorithm. This includes a con- long loop when encountering a row of interest).
venience blocking mechanism that waits until a given num- In the cases where a data owner may prefer everyone
ber of events are processed. This method is implemented to use only algorithms that provide a certain property (e.g.
using a Semaphore. Subclasses generally only need to im- pan-privacy), one could imagine a scenario where a data
plement the EventReceived method to process the event. analyst mistakenly uses an algorithm that does not have the
Algorithms differentiate between being user-level private data owner’s desired properties. We leave it for future work
and event-level private by extending the appropriate classes. to extend the system to enforce an algorithm’s properties,
That is, the type is used to differentiate between user-level much in the same way that user-level and event-level privacy
private and event-level private algorithms. is enforced. One could imagine a more granular agent that
We implemented five algorithms that span various char- takes in the streaming algorithm object itself and compares
acteristics, summarized in Figure 7: that to a whitelist provided by the data owner, or (more
ambitiously) even dynamically checks its code to assert its
1. Buffered Average batches the outputs it receives and
algorithm’s advertised properties.
then invokes PINQ’s NoisyAverage on the buffered data
Another important caveat to streaming differentially pri-
when an output is requested. vate algorithms is that they expose the timing of when events
2. Randomized Response Count will perform a randomized are processed. Although implicit in the literature, one might
response on an event actually being seen so it holds no want to hide the timing of events as it could reveal additional
information about prior events on the stream. As a result, information about the event. For example, in analyzing a
it has no private internal state so it is pan-private and stock trading stream, a stock trade being made after nor-
works on an unbounded number of events. mal trading hours may reveal additional information about
3. Binary Counter maintains log T number of non-noisy the trade (e.g. that the trader is likely an institutional trader
partial sums (hence why it is not pan-private) and adds with special access to the exchange). These events could be
noise to each partial sum for output. mitigated by a time-boxed stream that buckets events into
windows, or randomly dispersed into the stream if order was
4. User Density creates a random sample of candidate users
not important. This mechanism could be easily added to the
with their probability of being included in the count 12 .
streams described in Section 3.2, though it is unclear what
When a user that is in the random sample is seen, the the formal advantages are with this approach with respect to
probability of being included in the count is re-drawn differential privacy.
with a probability 12 + 4✏ . The accuracy depends on how This paper does not evaluate the performance and useful-
large the initial data universe sample is (size is computed
ness of the platform on actual data sets. Rather, it presents a
in terms of ↵).
design that extends a popular differential privacy program-
5. For the continuous bounded output case of User Den- ming framework with streaming in a modular way. We in-
sity, the general transformation given in Dwork et al. was corporated a few of the different notions of streaming differ-
applied to User Density [3]. This transformation keeps ential privacy to show how new definitions of privacy can be
a threshold of when to output a new result based on added, but our aim was not to be exhaustive. For example, a
how much the original algorithm varies. If the algorithm generalization of event-level privacy called w-event privacy
varies frequently, accuracy is lost. For the User Density was not implemented in this framework [6]. We leave it as
algorithm, the error after this transformation was calcu- future work to evaluate this framework on real data sets. We
lated as being 6↵. hope that this work can serve as a practical benchmarking
platform for experimenting with new streaming private al-
For a more detailed description of the algorithms, please
gorithms and definitions.
see the referenced papers in Figure 7.

3.6 Limitations and Future Work


This framework is designed for non-adversarial users. There 4. Related Work
are no formal checks on an implemented algorithm’s adver- Fuzz, a programming system developed by Reed and Pierce,
tised guarantees. For example, an implemented algorithm implements a language that guarantees differential privacy
may be marked as pan-private but the implementation may [9]. That is, any program written in Fuzz is differentially
be buggy and does not actually satisfy pan-privacy. Ad- private. This makes it difficult, though, to introduce new
ditionally, there are known attacks against PINQ [5] that differentially private primitives (such as dynamic data) into
Streaming PINQ is also susceptible to. For example, the the system, as it would require modifying the language and
Where method in StreamingQueryable accepts arbitrary compiler. It is not clear to the authors of this paper how Fuzz
C# predicates that can in general have side effects. As a re- could be modified to support streaming data.
Algorithm Privacy Number of Outputs Pan-Private Additive Error
1. Buffered Average* Event-Level Single No O(1)
p
2. Randomized Response Count* [1] Event-Level Continuous Unbounded Yes O( T )
3. Binary Counter [1] Event-Level Continuous Bounded No O((log T )1.5 )
4. User Density [4] User-Level Single Yes ↵ w.p. 1
5. User Density Continuous [3, 4] User-Level Continuous Bounded Yes 6↵ w.p. 1

Figure 7. Implemented Streaming Algorithms. Note that ✏ is removed from accuracy measurements. ↵ and are user-defined
parameters to the algorithm. Algorithms with an asterisk (*) denote known optimal accuracy for their listed properties. Buffered
Average is simply adding just enough Laplace noise to achieve differential privacy. Randomized Response Count’s error
matches the theoretical lower bound from Dwork et al.’s negative result [3], given its properties (pan-private and continuous
observation). Pan-Privacy is with respect to just one intrusion.

Airavat, another differentially private programming sys- Acknowledgments


tem, is a MapReduce based system that uses differentially We thank Salil Vadhan, Jonathan Ullman, and Stephen
private composition of computation [10]. The implementa- Chong for their comments on an earlier version of this paper
tion of the system is based on Hadoop. This system is diffi- and feedback on the project. We are grateful to the reviewers
cult to extend for streaming as Hadoop has been classically for their helpful comments.
used for batch processing. Streaming data support in Hadoop
is currently not well supported and would require much ef- References
fort to incorporate into the Airavat system.
[1] T.-H. H. Chan, E. Shi, and D. Song. Private and continual
Streaming PINQ is built as an extension to PINQ. It only
release of statistics. ACM Trans. Inf. Syst. Secur., 14(3):26:1–
supports non-streaming data sets. This work extends it to 26:24, Nov. 2011.
support streaming data sets. We chose to extend PINQ to in-
[2] C. Dwork. Differential privacy. In ICALP, pages 1–12.
corporate streaming as it allows easy extensibility, but places
Springer, 2006.
the proof burden of new mechanisms on the developer. PINQ
[3] C. Dwork, M. Naor, T. Pitassi, and G. N. Rothblum. Differen-
is also developed as a library written in C#, a general purpose
tial privacy under continual observation. In Proc. 42nd ACM
programming language, which allowed us to easily create a
symposium on Theory of computing, STOC ’10, pages 715–
programmatic abstraction for streaming data that could be 724, New York, NY, USA, 2010. ACM.
easily incorporated into the existing PINQ platform.
[4] C. Dwork, M. Naor, T. Pitassi, G. N. Rothblum, and
S. Yekhanin. Pan-private streaming algorithms. In Proc. ICS,
2010.
5. Conclusions [5] A. Haeberlen, B. C. Pierce, and A. Narayan. Differential
This paper describes an extension to PINQ that supports privacy under fire. In Proc. 20th USENIX conference on
differentially private streaming algorithms. The platform al- Security, SEC’11, pages 33–33, Berkeley, CA, USA, 2011.
lows a data analyst to choose the trade-offs in privacy vs. USENIX Association.
quality of the results. For example, a data analyst might want [6] G. Kellaris, S. Papadopoulos, X. Xiao, and D. Papadias.
a very accurate result for user density, but would have to de- Differentially private event sequences over infinite streams.
crease the number of intermediate outputs seen. Also, if a PVLDB, 7(12):1155–1166, 2014.
data owner wants to enforce pan-privacy (since he or she [7] F. McSherry. Privacy integrated queries: an extensible plat-
may not trust the streaming algorithms to hold the data), form for privacy-preserving data analysis. Commun. ACM,
then some algorithms would be unusable. We have built the 53(9):89–97, Sept. 2010.
platform to be flexible to allow the data owners and data [8] Microsoft. Linq (language integrated query).
analysts to decide which algorithms to use based on their [9] J. Reed and B. C. Pierce. Distance makes the types grow
needs without understanding the details of the algorithm’s stronger: a calculus for differential privacy. In Proc. 15th
implementation. The only requirement the framework makes ICFP, ICFP ’10, pages 157–168, New York, NY, USA, 2010.
is that the algorithms provide a form of ✏-differential pri- ACM.
vacy. (Whereas if privacy was not needed, then a far simpler [10] I. Roy, S. T. V. Setty, A. Kilzer, V. Shmatikov, and E. Witchel.
framework could be used.) Airavat: security and privacy for mapreduce. In Proc. 7th
We hope that the platform can serve as both a practi- NSDI, NSDI’10, pages 20–20, Berkeley, CA, USA, 2010.
cal implementation for differentially private streaming algo- USENIX Association.
rithms that a data analyst could use, as well as provide a base [11] L. Sweeney. Weaving technology and policy together to
for implementing and experimenting with new differentially maintain confidentiality. Journal of Law, Medicine & Ethics,
private streaming algorithms. 25(2 & 3):98–110, 1997.

You might also like