Mod2_Data_Streams
Mod2_Data_Streams
(Part 1)
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff
Ullman Stanford University
http://www.mmds.org
Contents
◼Introduction to Data Streams
◼ Examples of Streams
◼ The Stream Model
◼ Filtering Streams: The Bloom Filter
◼ The Count-Distinct Problem,
- The Flajolet-Martin Algorithm;
◼ Estimating Moments: AMS Algorithm
◼ Higher-Order Moments
◼ Quering on Windows
▪ Counting Ones
▪ DGIM algorithm
Dimensional Duplicate
Spam Web Perceptron,
ity document
Detection advertising kNN
reduction detection
◼Sensor data
◼Image Data
◼Internet and Web Traffic
◼Sensor Data
Standin
. . . 1, 5, 2, 7, 0, 9, 3 g
Queries
. . . a, r, v, t, y, h, b Output
Processor
. . . 0, 0, 1, 0, 1, 1, 0
time
Streams Entering.
Each is stream is
composed of
elements/tuples
Limited
Working
Storage Archival
Storage
2) Standing queries
Example: The stream produced by the ocean-surface-
temperature sensor might have a standing query to output an
alert whenever the temperature exceeds 25 degrees centigrade.
This query is easily answered, since it depends only on the most
recent stream element.
14
(1) Filtering Data
Streams
Filtering Data Streams
Each element of data stream is a tuple
Given a list of keys S
Determine which tuples of stream are in S
16
Bloom Filters – Introduction
Example: Creating gmail account.
◼
A space-efficient probabilistic data structure that is
used to test whether an element is a member of a
set.
◼
The price we pay for efficiency is that it is
probabilistic in nature that means, there might be
some False Positive results.
◼
False positive means, it might tell that given
username is already taken but actually it’s not.
17
The Bloom Filter
A Bloom filter consists of:
18
Examples
◼Refer the numerical examples taken in class.
21
Applications
22
Formal Definition
23
Using Small Storage
24
Flajolet Martin Algorithm
◼ Pseudo Code-Stepwise Solution:
binary_value
4. R=max(r(x))
25
Example
Consider stream, x=[ 1,5,10,5,15,1 ]
h(x) = x mod 11
26
Example (2)
Example:
S=1,3,2,1,2,3,4,3,1,2,3,1
h(x) = 6x + 1 mod 5
27
(3) Computing
Moments
Generalization: Moments
◼Suppose a stream has elements chosen
from a set A of N values
iA
( mi ) k
iA
( mi ) k
AMS Method
◼AMS method works for all moments
◼Gives an unbiased estimate
◼We will just concentrate on the 2nd moment S
◼We pick and keep track of many variables X:
▪ For each variable X we store X.el and X.val
▪ X.el corresponds to the item i
▪ X.val corresponds to the count of item i
▪ Note this requires a count in main memory,
so number of Xs is limited
◼Our goal is to compute
◼ AMS:
◼ Assume that at “random” we pick the 3rd, 8th, and
13th positions
◼ Calculate X1, X2, X3
Expectation Analysis
Count: 1 2 3 ma
Stream: a a b b b a b a
◼2nd moment is
◼ct … number of times item at time t appears
from time t onwards (c1=ma , c2=ma-1, c3=mb)
mi … total count of
item i in the stream
(we are assuming
stream has length n)
Stream: a a b b b a b a
▪ So:
▪ For k=3: c3 - (c-1)3 = 3c2 - 3c + 1
◼Generally: Estimate
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://w 37
Sampling from a Data
Stream:
Techniques of Sampling:
1) Sampling a fixed proportion
2) Fixed Size Sampling
3) Biased Reservoir Sampling
Hash table with b buckets, pick the tuple if its hash value is at most a.
How to generate a 30% sample?
Hash into b=10 buckets, take the tuple if it hashes to one of the first 3 buckets
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44
Queries over a
(long) Sliding Window
Sliding Windows
◼ A useful model of stream processing is that
queries are about a window of length N –
the N most recent elements received
◼ Interesting case: N is so large that the data cannot
be stored in memory, or even on disk
▪ Or, there are so many streams that windows
for all cannot be stored
◼ Amazon example:
▪ For every product X we keep 0/1 stream of whether
that product was sold in the n-th transaction
▪ We want answer queries, how many times have we sold
X in the last k sales
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46
Sliding Window: 1 Stream
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
Past Future
◼Obvious solution:
Store the most recent N bits
▪ When new bit comes in, discard the N+1st bit
010011011101010110110 Suppose N=6
110
Past Future
N
0100111000101001000101101101110010101100
11010 Past Future
56
In the given data stream let us assume the new bit arrives from the right. When the ne w bit = 0
57
After the new bit ( 0 ) arrives with a time stamp 101, there is no change in the bucke
But what if the new bit that arrives is 1, then we need to make changes..
58
Storage Requirements for the DGIM
Algorithm
◼ Constraint on buckets:
Number of 1s must be a power of 2
▪ That explains the O(log log N) in (B) above
1001010110001011010101010101011010101010101110101010111010100
010110010 N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 65
Representing a Stream by Buckets
At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.
1001010110001011010101010101011010101010101110101010111010100
010110010
N
At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.
1001010110001011010101010101011010101010101110101010111010100
010110010
N