0% found this document useful (0 votes)

7 views4 pages

Lec 8

Lec8

Uploaded by

abdulsalamtanaji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views4 pages

Lec 8

Lec8

Uploaded by

abdulsalamtanaji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

COMP 480/580 — Probabilistic Algorithms and Data Structure Sept 15, 2022

Lecture 8: Introduction to Stream Computer and Reservoir Sampling

Lecturer: Anshumali Shrivastava, Zhaozhuo Xu
Scribe By: David Zhao, Meng-Chi Yu, Kristen Curry

1 Data Streams
A data stream is a continuous flow of information. Some characteristics that are important to
take into account include:
• data is continuously generated at a fast rate
• data set is too large to store in its entirety
• complete information (e.g size) is unknown
• data can be viewed as infinite

Figure 1: Continuous data stream visualization [1].

A data stream is visualized in Figure 1. The infinite aspect of the data is such that conven-
tional approaches of extracting information from the data cannot be applied. Yet, essentially
everything is transmitted over the internet. Thus, how do we perform critical calculations
from the stream using limited memory?

1.1 Examples and applications

Some examples of data streams on the internet include: Google queries and Twitter feeds.
Specifically, Google may want to extract information regarding queries made today as opposed
to yesterday. Or Twitter may want the current information on which topics are trending. Data
streams are also extended in other areas of technology such as sensory networks, telephone call
records, and IP packets monitored at a switch. One sensory example is an air conditioner. An
air conditioner set to maintain a given temperature takes the current temperature as a constant
stream of information and acts accordingly.

1.2 Formulation: one pass model

The main complication with data streams is that we cannot store all of the information. Thus,
we do a single pass on on the stream, keeping only what is desired from the data. Once the
data has passed, it can no longer be recovered. This problem is formulated as a one pass model,
where n, or the amount of data, is unknown.
Dn = x1 , x2 , ...xn (1)

1-1
Here, Dn is the stream of information where we observe xt at time t. A common scenario
would be to compute some function of Dt (f (Dt )) for a given time t.

1.2.1 One pass model example

Let’s say we want to find the average of Dn for a given time t. We can achieve this by storing
two values: the sum of seen values st and the current number of counts ct . We can then find
the average at any time t with f (t) = sctt with memory O(1).

2 Data stream sampling

We can also estimate values by extracting and storing a representative sample of the data
stream, then performing our function on just the subset. The question then becomes: how do
we take an unbiased representative sample of the data stream?

2.1 Naive sampling method

A naive approach to getting a representative sample is to keep every jth element. For example,
say we have n elements, yet our memory only allows for the storage size for m elements. We
n
could then store every j = m element. This has 2 issues:

1. n is not defined at the start, thus we cannot guarantee we will not run out of memory

2. this structured method of sampling can cause bias

2.2 Naive random sampling method

In attempt to combat the bias issue, we can instead randomly sample mn of the elements by
n
randomly generating a number from 1 to m for each element. If the random selected number
is 1, then we keep the element in the sample. However, we can demonstrate that a bias still
occurs through the following example:
Say we have the following data set:
• U = number of of unique elements
• 2D = number of of pairwise duplicate elements
• N = U + 2D = total number of of elements in set
• 2D
α = U +2D = true fraction of duplicate
Now we want to estimate α, the fraction of duplicates in our data set. We would do this
by counting the number of duplicate and unique elements in the random sample and apply
the formula accordingly. Since duplicate items are in sequence in the data set, the likelihood
that the two identical elements are in the random sample is low. Thus, we would expect to
underestimate the value of α. This means a bias still occurs.

2.3 Reservoir sampling

Our goal of obtaining an unbiased sample can be rewritten as obtaining 2 goals:
1. s elements are in the sample
2. every element in the sample xt has probability st of being selected

1-2
Our two naive approaches cannot achieve this since the memory is unbounded. However, this
can be achieved with reservoir sampling. The algorithm works by keeping the first s elements
in data stream D. For each of the following elements, with probability st , the new element xt
uniformly selects an element from the current reservoir sample St−1 to replace. The algorithm
is shown in Algorithm 1.

Algorithm 1 Reservoir Sampling

Input xt ← element observed at time t, s ← desired sample size, St−1 ← uniform sample
at time t − 1
Output St uniform sample at time t
if t < s then
St ← St−1 ∪ {xt }
else
if with probability st then
v ← uniformly selected from St−1
St ← St−1 − {v} ∪ {xt }
else
St ← St−1
end if
end if
return St

We now must prove our two conditions. First, it is clear that the sample is of size s. In
order to prove that each element in the stream has an equal probability of being selected for
the sample, we can form a proof by induction. Our inductive hypothesis is that after observing
t elements, each element in the reservoir was sampled with probability st .
Basis: The first t = s elements in the data stream are samples with probability 1 = st
Inductive Step: We must consider both the new additional element and the elements
already in the reservoir maintain selection probability st .
• The new additional element xt is in the sample with probability s
t as this is the probability
it gets selected.
• To prove the probability that an element x that is already in the sample will stay in the
sample, we start by first evaluating the probability that x stays in the reservoir after the
new stream element xt is seen, given that x is already in the reservoir.
P (x stays) = P (xt rejected) + P (xt accept)P (x not selected)
s ss−1
= (1 − ) +
t t s
t−1
=
t
We can now find the probability x is in the reservoir sample, given that it has already
been seen in the data stream D.
P (x ∈ St ) = P (x ∈ St−1 )P (x stays)
s t−1
=
t−1 t
s
=
t

1-3
s
This shows that each element x in the data stream has the probability t of being in the
sample S and is therefore uniformly selected.

2.4 Weighted reservoir sampling

Weighted reservoir sampling is a generalized version of reservoir sampling. Now let’s say we
have a data stream where each element has a weight corresponding to the importance that the
element is maintained in the sample. As before, we would like to sample elements from the
stream such that at time n, we always have s elements sampled. However, we would now like to
base the sample probability off of the elements weights, so that each element xi in the sample
has the sample probability
h i wi
P xi being sampled = Pn (2)
j=1 wi

Some guidelines were given by Pavlos S Efraimidis and Paul G Spirakis in 2006. They suggested
the sampling algorithm should follow:

• Observe xi

• Generate ri uniformly in [0, 1]

• Set scorei = ri
wi

• Keep the s highest scores in the sampling pool

In order to maintain a list of top s scores, we can take advantage of a heap of element-score
tuples (xi , scorei ) so that we can find and lowest score tuple in O(log(s)) time. Note that
Algorithm 1 enjoys O(1) insertion and deletion, but we should not worried, as it’s logarithmic
and s does not have to be large.

References
[1] What is Data Stream: Definition — Blog OnAudience.com, April 2019.

1-4

Standard Test Procedures Manual PDF
No ratings yet
Standard Test Procedures Manual PDF
31 pages
STA 2402 Design and Analysis of Sample Surveys PDF
No ratings yet
STA 2402 Design and Analysis of Sample Surveys PDF
81 pages
Big Data Unit III
No ratings yet
Big Data Unit III
20 pages
CFA Mindmap
92% (12)
CFA Mindmap
98 pages
Mining Data Stream
No ratings yet
Mining Data Stream
31 pages
03 Mcqs Stat Mod-III
100% (1)
03 Mcqs Stat Mod-III
8 pages
Mining Techniques For Streaming Data
No ratings yet
Mining Techniques For Streaming Data
14 pages
Unit 3 - BD - Streaming
No ratings yet
Unit 3 - BD - Streaming
42 pages
Wavelet Decomposition of Data Streams: by Dragana Veljkovic
No ratings yet
Wavelet Decomposition of Data Streams: by Dragana Veljkovic
34 pages
MMD 03
No ratings yet
MMD 03
53 pages
DA Unit 3
No ratings yet
DA Unit 3
12 pages
CSE545 Sp23 (2) Streaming Algorithms 2-4
No ratings yet
CSE545 Sp23 (2) Streaming Algorithms 2-4
60 pages
Streaming Algorithms For Data in Motion
No ratings yet
Streaming Algorithms For Data in Motion
11 pages
Approximate Frequency Counts Over Data Streams
No ratings yet
Approximate Frequency Counts Over Data Streams
87 pages
9 RA MIRI SamplingDS
No ratings yet
9 RA MIRI SamplingDS
66 pages
Data Streams: Models and Algorithms
No ratings yet
Data Streams: Models and Algorithms
372 pages
A Simple PDF
No ratings yet
A Simple PDF
25 pages
Faster Methods For Random Sampling: Jeffrey Scott Vitter
No ratings yet
Faster Methods For Random Sampling: Jeffrey Scott Vitter
21 pages
Lecture3 Sampling
No ratings yet
Lecture3 Sampling
36 pages
ICML - 2016 - Stratified Sampling Meets Machine Learning
No ratings yet
ICML - 2016 - Stratified Sampling Meets Machine Learning
10 pages
Sampling Presentation JiaTeoh
No ratings yet
Sampling Presentation JiaTeoh
26 pages
4marks BA
No ratings yet
4marks BA
5 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
Methodologies For Stream Data Processing and Stream Data Systems
No ratings yet
Methodologies For Stream Data Processing and Stream Data Systems
20 pages
Sampling in Data Streams
No ratings yet
Sampling in Data Streams
6 pages
Streaming Algorithm
No ratings yet
Streaming Algorithm
16 pages
01 Streaming PDF
No ratings yet
01 Streaming PDF
8 pages
Feature-Selected and - Preserved Sampling For
No ratings yet
Feature-Selected and - Preserved Sampling For
6 pages
Mining Data Streams 1
No ratings yet
Mining Data Streams 1
46 pages
Faster Methods For Random Sampling: Jeffrey Scott Vitter
No ratings yet
Faster Methods For Random Sampling: Jeffrey Scott Vitter
16 pages
KRAWXZYKINFFUS2017
No ratings yet
KRAWXZYKINFFUS2017
86 pages
SNT Files 2019 01 Sss-Edbt-2019
No ratings yet
SNT Files 2019 01 Sss-Edbt-2019
12 pages
U3 Notes
No ratings yet
U3 Notes
27 pages
02 StreamsAlgorithms
No ratings yet
02 StreamsAlgorithms
93 pages
Bda Unit3
No ratings yet
Bda Unit3
22 pages
Mining Data Streams (Part 1) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining Data Streams (Part 1) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
MMD3
No ratings yet
MMD3
17 pages
Module 3 Mining Data Streams
No ratings yet
Module 3 Mining Data Streams
96 pages
Unit Ii BD
No ratings yet
Unit Ii BD
74 pages
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
No ratings yet
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
47 pages
Mmd04A Streams
No ratings yet
Mmd04A Streams
78 pages
Unit 2
No ratings yet
Unit 2
23 pages
Mining Data Streams
No ratings yet
Mining Data Streams
33 pages
Algorithms For Massive Data Problems
No ratings yet
Algorithms For Massive Data Problems
28 pages
Data Stream Sampling
No ratings yet
Data Stream Sampling
25 pages
Ch05a Streams1
No ratings yet
Ch05a Streams1
48 pages
Computer Generated
No ratings yet
Computer Generated
3 pages
Data Analytics (Unit-03) - 7777
No ratings yet
Data Analytics (Unit-03) - 7777
48 pages
KCA 034 - Unit 3
No ratings yet
KCA 034 - Unit 3
21 pages
Mining Data Streams (Part 1)
No ratings yet
Mining Data Streams (Part 1)
46 pages
RTDS Unit-5
No ratings yet
RTDS Unit-5
27 pages
Stream
No ratings yet
Stream
30 pages
Mod2 Data Streams
No ratings yet
Mod2 Data Streams
75 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
50 pages
ML A1 PDF
100% (1)
ML A1 PDF
3 pages
BigData Mod2
No ratings yet
BigData Mod2
12 pages
BDA Unit-2
No ratings yet
BDA Unit-2
12 pages
Unit 4
No ratings yet
Unit 4
10 pages
A
No ratings yet
A
3 pages
Unit 3
No ratings yet
Unit 3
30 pages
Finding Frequent Items in Data Streams
No ratings yet
Finding Frequent Items in Data Streams
11 pages
An Introduction To The Bootstrap
No ratings yet
An Introduction To The Bootstrap
7 pages
Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams
No ratings yet
Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams
76 pages
Morris 2007 Estimating Effect Sizes From Pretest Posttest Control Group Designs
No ratings yet
Morris 2007 Estimating Effect Sizes From Pretest Posttest Control Group Designs
23 pages
Notes On The Cram Er-Rao Inequality: Kimball Martin February 8, 2012
No ratings yet
Notes On The Cram Er-Rao Inequality: Kimball Martin February 8, 2012
6 pages
Urban Heat Island - Istanbul
No ratings yet
Urban Heat Island - Istanbul
14 pages
Measurement Uncertainty Procedures Revisited: Direct Determination of Uncertainty and Bias Handling
No ratings yet
Measurement Uncertainty Procedures Revisited: Direct Determination of Uncertainty and Bias Handling
5 pages
Confidence Intervals For Variances and Standard Deviations
No ratings yet
Confidence Intervals For Variances and Standard Deviations
2 pages
Resumen Gujarati Econometria
No ratings yet
Resumen Gujarati Econometria
8 pages
References: Statistical Theory. Wiley, New York
No ratings yet
References: Statistical Theory. Wiley, New York
9 pages
On The Econometrics of The Bass Diffusion Model: H. Peter B
No ratings yet
On The Econometrics of The Bass Diffusion Model: H. Peter B
14 pages
Ijet V3i2p9
No ratings yet
Ijet V3i2p9
5 pages
Least Squares Method
No ratings yet
Least Squares Method
36 pages
ECM Class 1 2 3
No ratings yet
ECM Class 1 2 3
65 pages
Random Forest Intro Presented
No ratings yet
Random Forest Intro Presented
38 pages
The Jackknife Approach: B X 1 N X X
No ratings yet
The Jackknife Approach: B X 1 N X X
3 pages
Chapter 1 - Estimation Theory
No ratings yet
Chapter 1 - Estimation Theory
166 pages
Inference Assignment 3
No ratings yet
Inference Assignment 3
4 pages
Lecture 15
No ratings yet
Lecture 15
14 pages
Fourier Analysis
No ratings yet
Fourier Analysis
17 pages
Komorowski EDA2016
No ratings yet
Komorowski EDA2016
20 pages
Multivariate Generalized Linear Mixed Models For Count Data: Guilherme P. Silva Henrique A. Laureano
No ratings yet
Multivariate Generalized Linear Mixed Models For Count Data: Guilherme P. Silva Henrique A. Laureano
22 pages
SugAnsKey Ch9
No ratings yet
SugAnsKey Ch9
12 pages
Gec410 Note Vi
No ratings yet
Gec410 Note Vi
50 pages
Estimating Functions (Godambe Article)
No ratings yet
Estimating Functions (Godambe Article)
12 pages
Evaluating Hypothesis: Bias in The Estimate. First, The Observed Accuracy of The Learned Hypothesis Over The Training
No ratings yet
Evaluating Hypothesis: Bias in The Estimate. First, The Observed Accuracy of The Learned Hypothesis Over The Training
17 pages
Experiment 1
No ratings yet
Experiment 1
7 pages
Mathematical Foundations of Information Theory
From Everand
Mathematical Foundations of Information Theory
A. Ya. Khinchin
3.5/5 (9)
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet

Lec 8

Uploaded by

Lec 8

Uploaded by

COMP 480/580 — Probabilistic Algorithms and Data Structure Sept 15, 2022

Lecture 8: Introduction to Stream Computer and Reservoir Sampling

Figure 1: Continuous data stream visualization [1].

1.1 Examples and applications

1.2 Formulation: one pass model

1.2.1 One pass model example

2 Data stream sampling

2.1 Naive sampling method

2. this structured method of sampling can cause bias

2.2 Naive random sampling method

2.3 Reservoir sampling

Algorithm 1 Reservoir Sampling

2.4 Weighted reservoir sampling

• Generate ri uniformly in [0, 1]

• Keep the s highest scores in the sampling pool

You might also like