Waye Lucas
Waye Lucas
Lucas Waye
Harvard University
[email protected]
data analyst can pick the algorithm suitable to their particular 14 / / E v e r y t i m e a d e n s i t y i s made , p r i n t i t
15 t w e e t e d T a g . OnOutput = ( d =>
needs. The only requirement that we make on algorithms is 16 Console . WriteLine ( ” Percent of users t h a t ” +
that they characterize their behaviors according to the above 17 ”tweeted # tag : ” + d ) ) ;
18
properties (i.e. event-level vs. user-level privacy, its output 19 / / P r o c e s s 10000 e v e n t s and s t o p
behavior, and its level of pan-privacy). 20 tweetedTag . ProcessEvents (10000) ;
the previous section. In the case of Figure 5, an ✏ value of 7 void ProcessEvents ( i n t n , bool s t o p A f t e r w a r d s ) ;
1.0 on the original wrapper (tweets) would suffice. 8 a b s t r a c t double GetOutput ( ) ;
9 double ? L a s t O u t p u t { g e t ; p r o t e c t e d s e t ; }
A StreamingQueryable object tracks which algorithms 10
are actively receiving events. It is the only part that di- 11 v i r t u a l void S t a r t R e c e i v i n g ( ) ;
12 v i r t u a l void StopReceiving ( ) ;
rectly receives events from the private stream (through the 13
EventReceived delegate described in Section 3.2). 14 abstract void EventReceived (T d a t a ) ;
15 }
Every time an event is received, the StreamingQueryable
object notifies its agent of all event-level algorithms that are
Figure 6. streaming algorithm base class
attached and their corresponding ✏ values. If successful, the
event is passed to every streaming algorithm for processing.
After each algorithm has processed the event, the agent is The StreamingAlgorithm<T> classes provide the mech-
notified that all event-level algorithms have detached. Note anism for streaming differential privacy. The base class pro-
vides functionality to interact with the StreamingQueryable sult, an adversary could use a covert channel like timing to
object to receive events. The base class also has functionality discover information about the private data (e.g. run a very
to get outputs made by the algorithm. This includes a con- long loop when encountering a row of interest).
venience blocking mechanism that waits until a given num- In the cases where a data owner may prefer everyone
ber of events are processed. This method is implemented to use only algorithms that provide a certain property (e.g.
using a Semaphore. Subclasses generally only need to im- pan-privacy), one could imagine a scenario where a data
plement the EventReceived method to process the event. analyst mistakenly uses an algorithm that does not have the
Algorithms differentiate between being user-level private data owner’s desired properties. We leave it for future work
and event-level private by extending the appropriate classes. to extend the system to enforce an algorithm’s properties,
That is, the type is used to differentiate between user-level much in the same way that user-level and event-level privacy
private and event-level private algorithms. is enforced. One could imagine a more granular agent that
We implemented five algorithms that span various char- takes in the streaming algorithm object itself and compares
acteristics, summarized in Figure 7: that to a whitelist provided by the data owner, or (more
ambitiously) even dynamically checks its code to assert its
1. Buffered Average batches the outputs it receives and
algorithm’s advertised properties.
then invokes PINQ’s NoisyAverage on the buffered data
Another important caveat to streaming differentially pri-
when an output is requested. vate algorithms is that they expose the timing of when events
2. Randomized Response Count will perform a randomized are processed. Although implicit in the literature, one might
response on an event actually being seen so it holds no want to hide the timing of events as it could reveal additional
information about prior events on the stream. As a result, information about the event. For example, in analyzing a
it has no private internal state so it is pan-private and stock trading stream, a stock trade being made after nor-
works on an unbounded number of events. mal trading hours may reveal additional information about
3. Binary Counter maintains log T number of non-noisy the trade (e.g. that the trader is likely an institutional trader
partial sums (hence why it is not pan-private) and adds with special access to the exchange). These events could be
noise to each partial sum for output. mitigated by a time-boxed stream that buckets events into
windows, or randomly dispersed into the stream if order was
4. User Density creates a random sample of candidate users
not important. This mechanism could be easily added to the
with their probability of being included in the count 12 .
streams described in Section 3.2, though it is unclear what
When a user that is in the random sample is seen, the the formal advantages are with this approach with respect to
probability of being included in the count is re-drawn differential privacy.
with a probability 12 + 4✏ . The accuracy depends on how This paper does not evaluate the performance and useful-
large the initial data universe sample is (size is computed
ness of the platform on actual data sets. Rather, it presents a
in terms of ↵).
design that extends a popular differential privacy program-
5. For the continuous bounded output case of User Den- ming framework with streaming in a modular way. We in-
sity, the general transformation given in Dwork et al. was corporated a few of the different notions of streaming differ-
applied to User Density [3]. This transformation keeps ential privacy to show how new definitions of privacy can be
a threshold of when to output a new result based on added, but our aim was not to be exhaustive. For example, a
how much the original algorithm varies. If the algorithm generalization of event-level privacy called w-event privacy
varies frequently, accuracy is lost. For the User Density was not implemented in this framework [6]. We leave it as
algorithm, the error after this transformation was calcu- future work to evaluate this framework on real data sets. We
lated as being 6↵. hope that this work can serve as a practical benchmarking
platform for experimenting with new streaming private al-
For a more detailed description of the algorithms, please
gorithms and definitions.
see the referenced papers in Figure 7.
Figure 7. Implemented Streaming Algorithms. Note that ✏ is removed from accuracy measurements. ↵ and are user-defined
parameters to the algorithm. Algorithms with an asterisk (*) denote known optimal accuracy for their listed properties. Buffered
Average is simply adding just enough Laplace noise to achieve differential privacy. Randomized Response Count’s error
matches the theoretical lower bound from Dwork et al.’s negative result [3], given its properties (pan-private and continuous
observation). Pan-Privacy is with respect to just one intrusion.