0% found this document useful (0 votes)
7 views4 pages

Lec 8

Lec8

Uploaded by

abdulsalamtanaji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views4 pages

Lec 8

Lec8

Uploaded by

abdulsalamtanaji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

COMP 480/580 — Probabilistic Algorithms and Data Structure Sept 15, 2022

Lecture 8: Introduction to Stream Computer and Reservoir Sampling


Lecturer: Anshumali Shrivastava, Zhaozhuo Xu
Scribe By: David Zhao, Meng-Chi Yu, Kristen Curry

1 Data Streams
A data stream is a continuous flow of information. Some characteristics that are important to
take into account include:
• data is continuously generated at a fast rate
• data set is too large to store in its entirety
• complete information (e.g size) is unknown
• data can be viewed as infinite

Figure 1: Continuous data stream visualization [1].

A data stream is visualized in Figure 1. The infinite aspect of the data is such that conven-
tional approaches of extracting information from the data cannot be applied. Yet, essentially
everything is transmitted over the internet. Thus, how do we perform critical calculations
from the stream using limited memory?

1.1 Examples and applications


Some examples of data streams on the internet include: Google queries and Twitter feeds.
Specifically, Google may want to extract information regarding queries made today as opposed
to yesterday. Or Twitter may want the current information on which topics are trending. Data
streams are also extended in other areas of technology such as sensory networks, telephone call
records, and IP packets monitored at a switch. One sensory example is an air conditioner. An
air conditioner set to maintain a given temperature takes the current temperature as a constant
stream of information and acts accordingly.

1.2 Formulation: one pass model


The main complication with data streams is that we cannot store all of the information. Thus,
we do a single pass on on the stream, keeping only what is desired from the data. Once the
data has passed, it can no longer be recovered. This problem is formulated as a one pass model,
where n, or the amount of data, is unknown.
Dn = x1 , x2 , ...xn (1)

1-1
Here, Dn is the stream of information where we observe xt at time t. A common scenario
would be to compute some function of Dt (f (Dt )) for a given time t.

1.2.1 One pass model example


Let’s say we want to find the average of Dn for a given time t. We can achieve this by storing
two values: the sum of seen values st and the current number of counts ct . We can then find
the average at any time t with f (t) = sctt with memory O(1).

2 Data stream sampling


We can also estimate values by extracting and storing a representative sample of the data
stream, then performing our function on just the subset. The question then becomes: how do
we take an unbiased representative sample of the data stream?

2.1 Naive sampling method


A naive approach to getting a representative sample is to keep every jth element. For example,
say we have n elements, yet our memory only allows for the storage size for m elements. We
n
could then store every j = m element. This has 2 issues:

1. n is not defined at the start, thus we cannot guarantee we will not run out of memory

2. this structured method of sampling can cause bias

2.2 Naive random sampling method


In attempt to combat the bias issue, we can instead randomly sample mn of the elements by
n
randomly generating a number from 1 to m for each element. If the random selected number
is 1, then we keep the element in the sample. However, we can demonstrate that a bias still
occurs through the following example:
Say we have the following data set:
• U = number of of unique elements
• 2D = number of of pairwise duplicate elements
• N = U + 2D = total number of of elements in set
• 2D
α = U +2D = true fraction of duplicate
Now we want to estimate α, the fraction of duplicates in our data set. We would do this
by counting the number of duplicate and unique elements in the random sample and apply
the formula accordingly. Since duplicate items are in sequence in the data set, the likelihood
that the two identical elements are in the random sample is low. Thus, we would expect to
underestimate the value of α. This means a bias still occurs.

2.3 Reservoir sampling


Our goal of obtaining an unbiased sample can be rewritten as obtaining 2 goals:
1. s elements are in the sample
2. every element in the sample xt has probability st of being selected

1-2
Our two naive approaches cannot achieve this since the memory is unbounded. However, this
can be achieved with reservoir sampling. The algorithm works by keeping the first s elements
in data stream D. For each of the following elements, with probability st , the new element xt
uniformly selects an element from the current reservoir sample St−1 to replace. The algorithm
is shown in Algorithm 1.

Algorithm 1 Reservoir Sampling


Input xt ← element observed at time t, s ← desired sample size, St−1 ← uniform sample
at time t − 1
Output St uniform sample at time t
if t < s then
St ← St−1 ∪ {xt }
else
if with probability st then
v ← uniformly selected from St−1
St ← St−1 − {v} ∪ {xt }
else
St ← St−1
end if
end if
return St

We now must prove our two conditions. First, it is clear that the sample is of size s. In
order to prove that each element in the stream has an equal probability of being selected for
the sample, we can form a proof by induction. Our inductive hypothesis is that after observing
t elements, each element in the reservoir was sampled with probability st .
Basis: The first t = s elements in the data stream are samples with probability 1 = st
Inductive Step: We must consider both the new additional element and the elements
already in the reservoir maintain selection probability st .
• The new additional element xt is in the sample with probability s
t as this is the probability
it gets selected.
• To prove the probability that an element x that is already in the sample will stay in the
sample, we start by first evaluating the probability that x stays in the reservoir after the
new stream element xt is seen, given that x is already in the reservoir.
P (x stays) = P (xt rejected) + P (xt accept)P (x not selected)
s ss−1
= (1 − ) +
t t s
t−1
=
t
We can now find the probability x is in the reservoir sample, given that it has already
been seen in the data stream D.
P (x ∈ St ) = P (x ∈ St−1 )P (x stays)
s t−1
=
t−1 t
s
=
t

1-3
s
This shows that each element x in the data stream has the probability t of being in the
sample S and is therefore uniformly selected.

2.4 Weighted reservoir sampling


Weighted reservoir sampling is a generalized version of reservoir sampling. Now let’s say we
have a data stream where each element has a weight corresponding to the importance that the
element is maintained in the sample. As before, we would like to sample elements from the
stream such that at time n, we always have s elements sampled. However, we would now like to
base the sample probability off of the elements weights, so that each element xi in the sample
has the sample probability
h i wi
P xi being sampled = Pn (2)
j=1 wi

Some guidelines were given by Pavlos S Efraimidis and Paul G Spirakis in 2006. They suggested
the sampling algorithm should follow:

• Observe xi

• Generate ri uniformly in [0, 1]


1

• Set scorei = ri
wi

• Keep the s highest scores in the sampling pool

In order to maintain a list of top s scores, we can take advantage of a heap of element-score
tuples (xi , scorei ) so that we can find and lowest score tuple in O(log(s)) time. Note that
Algorithm 1 enjoys O(1) insertion and deletion, but we should not worried, as it’s logarithmic
and s does not have to be large.

References
[1] What is Data Stream: Definition — Blog OnAudience.com, April 2019.

1-4

You might also like