0% found this document useful (0 votes)
23 views

Mod2_Data_Streams

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Mod2_Data_Streams

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 75

Mining Data Streams

(Part 1)
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff
Ullman Stanford University
http://www.mmds.org
Contents
◼Introduction to Data Streams
◼ Examples of Streams
◼ The Stream Model
◼ Filtering Streams: The Bloom Filter
◼ The Count-Distinct Problem,
- The Flajolet-Martin Algorithm;
◼ Estimating Moments: AMS Algorithm
◼ Higher-Order Moments
◼ Quering on Windows
▪ Counting Ones
▪ DGIM algorithm

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2


New Topic: Infinite Data
High Infinit
Graph Machine
dim.
data e learning
Apps
data data
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams

Community Queries on Decision Association


Clustering
Detection streams Trees Rules

Dimensional Duplicate
Spam Web Perceptron,
ity document
Detection advertising kNN
reduction detection

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3


Data Management v/s Stream
Management

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4


Example of Stream sources

◼Sensor data
◼Image Data
◼Internet and Web Traffic
◼Sensor Data

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5


Data Streams
◼In many data mining situations, we do not
know the entire data set in advance

◼Stream Management is important when the


input rate is controlled externally:
▪ Google queries
▪ Twitter or Facebook status updates
◼We can think of the data as infinite and
non-stationary (the distribution changes
over time)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
General Stream Processing Model
Ad-Hoc
Queries

Standin
. . . 1, 5, 2, 7, 0, 9, 3 g
Queries
. . . a, r, v, t, y, h, b Output
Processor
. . . 0, 0, 1, 0, 1, 1, 0
time

Streams Entering.
Each is stream is
composed of
elements/tuples
Limited
Working
Storage Archival
Storage

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7


The Stream Model
◼Input elements enter at a rapid rate,
at one or more input ports (i.e., streams)
▪ We call elements of the stream tuples

◼The system cannot store the entire stream


accessibly
◼Q: How do you make critical calculations
about the stream using a limited amount of
(secondary) memory?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8


Types of queries
1) Ad-hoc queries
A question asked once about the current state of a
stream or stream.
Example: What is the max value seen so far?

2) Standing queries
Example: The stream produced by the ocean-surface-
temperature sensor might have a standing query to output an
alert whenever the temperature exceeds 25 degrees centigrade.
This query is easily answered, since it depends only on the most
recent stream element.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9


Problems on Data Streams

◼Types of queries one wants on answer on


a data stream:
▪ Sampling data from a stream
▪ Construct a random sample
▪ Queries over sliding windows
▪ Number of items of type x in the last k elements
of the stream

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10


Problems on Data Streams

◼Types of queries one wants on answer on


a data stream:
◼Filtering a data stream
▪ Select elements with property x from the stream
▪ Counting distinct elements
▪ Number of distinct elements in the last k elements
of the stream
▪ Estimating moments
▪ Estimate avg./std. dev. of last k elements
▪ Finding frequent elements

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11


Applications (1)
◼Mining query streams
▪ Google wants to know what queries are
more frequent today than yesterday

◼Mining click streams


▪ Yahoo wants to know which of its pages are
getting an unusual number of hits in the past hour

◼Mining social network news feeds


▪ E.g., look for trending topics on Twitter, Facebook

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12


Applications (2)
◼Sensor Networks
▪ Many sensors feeding into a central controller
◼Telephone call records
▪ Data feeds into customer bills as well as
settlements between telephone companies
◼IP packets monitored at a switch
▪ Gather information for optimal routing
▪ Detect denial-of-service attacks

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13


Algorithms for streams:
(1) Filtering a data stream: Bloom filters
Select elements with property x from stream
(2) Counting distinct elements: Flajolet-Martin
Number of distinct elements in the last k elements
of the stream
(3) Estimating moments: AMS method
Estimate std. dev. of last k elements
(4) Counting frequent items

14
(1) Filtering Data
Streams
Filtering Data Streams
Each element of data stream is a tuple
Given a list of keys S
Determine which tuples of stream are in S
–

¡Obvious solution: Hash table


But suppose we do not have enough memory to store
all of S in a hash table
E.g., we might be processing millions of filters on the same
stream

16
Bloom Filters – Introduction
Example: Creating gmail account.


A space-efficient probabilistic data structure that is
used to test whether an element is a member of a
set.


The price we pay for efficiency is that it is
probabilistic in nature that means, there might be
some False Positive results.


False positive means, it might tell that given
username is already taken but actually it’s not.
17
The Bloom Filter
A Bloom filter consists of:

1. An array of n bits, initially all 0’s.


2. A collection of hash functions h1, h2, . . . , hk.
Each hash function maps “key” values to n
buckets, corresponding to the n bits of the bit-
array.
3. A set S of m key values.

18
Examples
◼Refer the numerical examples taken in class.

◼Bloom Filters - Introduction and Implementati


on - GeeksforGeeks
(2) Counting Distinct
Elements
Counting Distinct Elements

21
Applications

22
Formal Definition

23
Using Small Storage

24
Flajolet Martin Algorithm
◼ Pseudo Code-Stepwise Solution:

1. Selecting a hash function h so each element in the set is mapped to

a string to at least log2n bits.

2. Convert this h(x) output to binary_value.

3. For each binary_value x, r(b_v)= length of trailing zeroes in

binary_value

4. R=max(r(x))

=> Distinct elements= 2^R

25
Example
Consider stream, x=[ 1,5,10,5,15,1 ]
h(x) = x mod 11

26
Example (2)
Example:
S=1,3,2,1,2,3,4,3,1,2,3,1
h(x) = 6x + 1 mod 5

Numerical solved in lecture.

27
(3) Computing
Moments
Generalization: Moments
◼Suppose a stream has elements chosen
from a set A of N values

◼Let mi be the number of times value i occurs


in the stream
◼The kth moment is

 iA
( mi ) k

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://w 29


Special Cases

 iA
( mi ) k

◼0thmoment = number of distinct elements


▪ The problem just considered
◼1st moment = count of the numbers of
elements = length of the stream
▪ Easy to compute
◼2nd moment = surprise number S =
a measure of how uneven the distribution is

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://w 30


Example: Surprise
Number
◼Stream of length 100
◼11 distinct values
◼Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9
Surprise S = 910
◼Item counts: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1
Surprise S = 8,110

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://w 31


[Alon, Matias, and Szegedy]

AMS Method
◼AMS method works for all moments
◼Gives an unbiased estimate
◼We will just concentrate on the 2nd moment S
◼We pick and keep track of many variables X:
▪ For each variable X we store X.el and X.val
▪ X.el corresponds to the item i
▪ X.val corresponds to the count of item i
▪ Note this requires a count in main memory,
so number of Xs is limited
◼Our goal is to compute

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://w 32


One Random Variable
(X)
◼How to set X.val and X.el?
▪ Assume stream has length n (we relax this later)
▪ Pick some random time t (t<n) to start,
so that any time is equally likely
▪ Let at time t the stream have item i. We set X.el = i
▪ Then we maintain count c (X.val = c) of the number
of is in the stream starting from the chosen time t
 Then the estimate of the 2nd moment () is:

▪ Note, we will keep track of multiple Xs, (X1, X2,… Xk)


and our final estimate will be
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://w 33
Example
◼ The stream is a, b, c, b, d, a, c, d, a, b, d, c, a, a, b.
◼ n= 15

◼ AMS:
◼ Assume that at “random” we pick the 3rd, 8th, and
13th positions
◼ Calculate X1, X2, X3
Expectation Analysis
Count: 1 2 3 ma

Stream: a a b b b a b a

◼2nd moment is
◼ct … number of times item at time t appears
from time t onwards (c1=ma , c2=ma-1, c3=mb)
mi … total count of
item i in the stream
(we are assuming
stream has length n)

Time t when Time t when


Time t when the penultimate the first i is
Group times the last i is i is seen (ct=2) seen (ct=mi)
by the value seen (ct=1)
seen

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35


Expectation Analysis
Count: 1 2 3 ma

Stream: a a b b b a b a

▪ Little side calculation:


◼Then
◼So,
◼We have the second moment (in expectation)!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36


Higher-Order Moments
◼For estimating kth moment we essentially use
the same algorithm but change the estimate:
▪ For k=2 we used n (2·c – 1)
▪ For k=3 we use: n (3·c2 – 3c + 1) (where c=X.val)
◼Why?
▪ For k=2: Remember we had and we showed terms
2c-1 (for c=1,…,m) sum to m2

▪ So:
▪ For k=3: c3 - (c-1)3 = 3c2 - 3c + 1
◼Generally: Estimate
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://w 37
Sampling from a Data
Stream:
Techniques of Sampling:
1) Sampling a fixed proportion
2) Fixed Size Sampling
3) Biased Reservoir Sampling

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39


Sampling from a Data Stream

◼Since we can not store the entire stream,


one obvious approach is to store a sample
◼Two different problems:
▪ (1) Sample a fixed proportion of elements
in the stream (say 1 in 10)
▪ (2) Maintain a random sample of fixed size
over a potentially infinite stream
▪ At any “time” k we would like a random sample
of s elements
▪ What is the property of the sample we want to maintain?
For all time steps k, each of k elements seen so far has
equal prob. of being sampled
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40
Sampling a Fixed Proportion

◼Problem 1: Sampling fixed proportion


◼Scenario: Search engine query stream
▪ Stream of tuples: (user, query, time)
▪ Answer questions such as: How often did a user
run the same query in a single days
▪ Have space to store 1/10th of query stream
◼Naïve solution:
▪ Generate a random integer in [0..9] for each query
▪ Store the query if the integer is 0, otherwise
discard

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41


Problem with Naïve Approach

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42


Solution: Sample Users
Solution:
◼Pick 1/10th of users and take all their
searches in the sample
◼Use a hash function that hashes the
user name or user id uniformly into 10
buckets

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43


Generalized Solution
◼Stream of tuples with keys:
▪ Key is some subset of each tuple’s components
▪ e.g., tuple is (user, search, time); key is user
▪ Choice of key depends on application

◼To get a sample of a/b fraction of the stream:


▪ Hash each tuple’s key uniformly into b buckets
▪ Pick the tuple if its hash value is at most a

Hash table with b buckets, pick the tuple if its hash value is at most a.
How to generate a 30% sample?
Hash into b=10 buckets, take the tuple if it hashes to one of the first 3 buckets
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44
Queries over a
(long) Sliding Window
Sliding Windows
◼ A useful model of stream processing is that
queries are about a window of length N –
the N most recent elements received
◼ Interesting case: N is so large that the data cannot
be stored in memory, or even on disk
▪ Or, there are so many streams that windows
for all cannot be stored
◼ Amazon example:
▪ For every product X we keep 0/1 stream of whether
that product was sold in the n-th transaction
▪ We want answer queries, how many times have we sold
X in the last k sales
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46
Sliding Window: 1 Stream

◼Sliding window on a single stream: N=6

qwertyuiopasdfghjklzxcvbnm

qwertyuiopasdfghjklzxcvbnm

qwertyuiopasdfghjklzxcvbnm

qwertyuiopasdfghjklzxcvbnm

Past Future

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 47


Counting Bits (1)
◼Problem:
▪ Given a stream of 0s and 1s
▪ Be prepared to answer queries of the form
How many 1s are in the last k bits? where k ≤ N

◼Obvious solution:
Store the most recent N bits
▪ When new bit comes in, discard the N+1st bit
010011011101010110110 Suppose N=6
110
Past Future

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 48


Counting Bits (2)
◼You can not get an exact answer without
storing the entire window
◼Real Problem:
What if we cannot afford to store N bits?
▪ E.g., we’re processing 1 billion streams and
N = 1 billion 0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 0
110 Past Future

◼But we are happy with an approximate


answer
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 49
An attempt: Simple solution

N
0100111000101001000101101101110010101100
11010 Past Future

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 50


[Datar, Gionis, Indyk,

DGIM Method Motwani]

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 51


DGIM Elements
◼Timestamp – modulo N
◼Bucket size
◼Rules:
1. The right end of a bucket is always a position with a 1.
2. Every position with a 1 is in some bucket.
3. Length of bucket= number of 1’s
4. There are one or two buckets of any given size, up to
some maximum size.
5. All sizes must be a power of 2.
6. Buckets cannot decrease in size as we move to the left
(back in time)
DGIM Example
◼. . 1 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0
This picture shows how we can form the buckets based on the number
of ones by following the rules.

56
In the given data stream let us assume the new bit arrives from the right. When the ne w bit = 0

57
After the new bit ( 0 ) arrives with a time stamp 101, there is no change in the bucke
But what if the new bit that arrives is 1, then we need to make changes..

58
Storage Requirements for the DGIM
Algorithm

◼Each bucket can be represented by O(log N)


bits. If the window has length N, then there
are no more than N 1’s, surely.

◼Suppose the largest bucket is of size 2j . Then j


cannot exceed log2 N, or else there are more
1’s in this bucket than there are 1’s in the
entire window.
Idea: Exponential Windows

◼Solution that doesn’t (quite) work:


▪ Summarize exponentially increasing regions
of the stream, looking backward
▪ Drop small regions if they begin at the same point
Window of as a larger region
width 16
has 6 1s 6 10
4
?
3 2
2 1
1 0
0100111000101001000101101101110010101100
11010 N
We can reconstruct the count of the last N bits, except we
are not sure how many of the last 6 1s are included in the N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 60
What’s Good?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 61


What’s Not So Good?
◼As long as the 1s are fairly evenly distributed,
the error due to the unknown region is small
– no more than 50%
◼But it could be that all the 1s are in the
unknown area at the end
◼In that case, the error is unbounded!
6 10
4
?
3 2
2 1
1 0
0100111000101001000101101101110010101100
11010 N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 62
[Datar, Gionis, Indyk,

Fixup: DGIM method Motwani]

◼Idea: Instead of summarizing fixed-length


blocks, summarize blocks with specific
number of 1s:
▪ Let the block sizes (number of 1s) increase
exponentially

◼When there are few 1s in the window, block


sizes stay small, so errors are small
1001010110001011010101010101011010101010101110101010111010100
010110010 N

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 63


DGIM: Timestamps

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 64


DGIM: Buckets
◼ A bucket in the DGIM method is a record
consisting of:
▪ (A) The timestamp of its end [O(log N) bits]
▪ (B) The number of 1s between its beginning and
end [O(log log N) bits]

◼ Constraint on buckets:
Number of 1s must be a power of 2
▪ That explains the O(log log N) in (B) above
1001010110001011010101010101011010101010101110101010111010100
010110010 N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 65
Representing a Stream by Buckets

◼Either one or two buckets with the same


power-of-2 number of 1s
◼Buckets do not overlap in timestamps
◼Buckets are sorted by size
▪ Earlier buckets are not smaller than later buckets

◼Buckets disappear when their


end-time is > N time units in the past

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 66


Example: Bucketized Stream

At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.

1001010110001011010101010101011010101010101110101010111010100
010110010
N

Three properties of buckets that are maintained:


- Either one or two buckets with the same power-of-2 number of 1s
- Buckets do not overlap in timestamps
- Buckets are sorted by size
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 67
Updating Buckets (1)
◼When a new bit comes in, drop the last
(oldest) bucket if its end-time is prior to N
time units before the current time
◼2 cases: Current bit is 0 or 1
◼If the current bit is 0:
no other changes are needed

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 68


Updating Buckets (2)
◼ If the current bit is 1:
▪ (1) Create a new bucket of size 1, for just this bit
▪ End timestamp = current time
▪ (2) If there are now three buckets of size 1,
combine the oldest two into a bucket of size 2
▪ (3) If there are now three buckets of size 2,
combine the oldest two into a bucket of size 4
▪ (4) And so on …

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 69


Example: Updating Buckets

Current state of the stream:


1001010110001011010101010101011010101010101110101010111010100
010110010
Bit of value 1 arrives
0010101100010110101010101010110101010101011101010101110101000
101100101
Two orange buckets get merged into a yellow bucket
001010110001011010101010101011010101010101110101010111010100
0101100101
Next bit 1 arrives, new orange bucket is created, then 0 comes, then 1:
0101100010110101010101010110101010101011101010101110101000101
100101101
Buckets get merged…
0101100010110101010101010110101010101011101010101110101000101
100101101
State of the buckets after merging
010110001011010101010101011010101010101110101010111010100010
1100101101
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 70
How to Query?
◼ To estimate the number of 1s in the most
recent N bits:
1. Sum the sizes of all buckets but the last
(note “size” means the number of 1s in the
bucket)
2. Add half the size of the last bucket

◼ Remember: We do not know how many 1s


of the last bucket are still within the wanted
window

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 71


Example: Bucketized Stream

At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.

1001010110001011010101010101011010101010101110101010111010100
010110010
N

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 72


Extensions
◼Can we use the same trick to answer queries
How many 1’s in the last k? where k < N?
▪ A: Find earliest bucket B that at overlaps with k.
Number of 1s is the sum of sizes of more recent
buckets + ½ size of B
100101011000101101010101010101101010101010111010101011101010
0010110010 k

◼Can we handle the case where the stream is


not bits, but integers, and we want the sum
of the last k elements?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 73
Extensions

ci …estimated count for i-th bit

2 5 7 1 3 8 4 6 7 9 1 3 7 6 5 3 5 7 1 3 3 1 2 2 6 Idea: Sum in each


bucket is at most
2 5 7 1 3 8 4 6 7 9 1 3 7 6 5 3 5 7 1 3 3 1 2 2 6 3 2b (unless bucket
has only 1 integer)
2 5 7 1 3 8 4 6 7 9 1 3 7 6 5 3 5 7 1 3 3 1 2 2 6 3 2 Bucket sizes:
2 5 7 1 3 8 4 6 7 9 1 3 7 6 5 3 5 7 1 3 3 1 2 2 6 3 2 16 8 4 2 1
5 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 74
Summary
◼Sampling a fixed proportion of a stream
▪ Sample size grows as the stream grows
◼Sampling a fixed-size sample
▪ Reservoir sampling
◼Counting the number of 1s in the last N
elements
▪ Exponentially increasing windows
▪ Extensions:
▪ Number of 1s in any last k (k < N) elements
▪ Sums of integers in the last N elements

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 75

You might also like