Lec 8
Lec 8
1 Data Streams
A data stream is a continuous flow of information. Some characteristics that are important to
take into account include:
• data is continuously generated at a fast rate
• data set is too large to store in its entirety
• complete information (e.g size) is unknown
• data can be viewed as infinite
A data stream is visualized in Figure 1. The infinite aspect of the data is such that conven-
tional approaches of extracting information from the data cannot be applied. Yet, essentially
everything is transmitted over the internet. Thus, how do we perform critical calculations
from the stream using limited memory?
1-1
Here, Dn is the stream of information where we observe xt at time t. A common scenario
would be to compute some function of Dt (f (Dt )) for a given time t.
1. n is not defined at the start, thus we cannot guarantee we will not run out of memory
1-2
Our two naive approaches cannot achieve this since the memory is unbounded. However, this
can be achieved with reservoir sampling. The algorithm works by keeping the first s elements
in data stream D. For each of the following elements, with probability st , the new element xt
uniformly selects an element from the current reservoir sample St−1 to replace. The algorithm
is shown in Algorithm 1.
We now must prove our two conditions. First, it is clear that the sample is of size s. In
order to prove that each element in the stream has an equal probability of being selected for
the sample, we can form a proof by induction. Our inductive hypothesis is that after observing
t elements, each element in the reservoir was sampled with probability st .
Basis: The first t = s elements in the data stream are samples with probability 1 = st
Inductive Step: We must consider both the new additional element and the elements
already in the reservoir maintain selection probability st .
• The new additional element xt is in the sample with probability s
t as this is the probability
it gets selected.
• To prove the probability that an element x that is already in the sample will stay in the
sample, we start by first evaluating the probability that x stays in the reservoir after the
new stream element xt is seen, given that x is already in the reservoir.
P (x stays) = P (xt rejected) + P (xt accept)P (x not selected)
s ss−1
= (1 − ) +
t t s
t−1
=
t
We can now find the probability x is in the reservoir sample, given that it has already
been seen in the data stream D.
P (x ∈ St ) = P (x ∈ St−1 )P (x stays)
s t−1
=
t−1 t
s
=
t
1-3
s
This shows that each element x in the data stream has the probability t of being in the
sample S and is therefore uniformly selected.
Some guidelines were given by Pavlos S Efraimidis and Paul G Spirakis in 2006. They suggested
the sampling algorithm should follow:
• Observe xi
• Set scorei = ri
wi
In order to maintain a list of top s scores, we can take advantage of a heap of element-score
tuples (xi , scorei ) so that we can find and lowest score tuple in O(log(s)) time. Note that
Algorithm 1 enjoys O(1) insertion and deletion, but we should not worried, as it’s logarithmic
and s does not have to be large.
References
[1] What is Data Stream: Definition — Blog OnAudience.com, April 2019.
1-4