Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
Publish-subscribe systems
You are collecting lots of messages (news articles)
People express interest in certain sets of keywords
Determine whether each message matches user’s
interest
Filter
Item
Hash
func h
In our case:
Targets = bits/buckets
Darts = hash values of items
n( m / n)
1 - (1 – 1/n)
1 – e–m/n
Probability some
target X not hit Probability at
by a dart least one dart
hits target X
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10
Analysis: Throwing Darts (3)
Fraction of 1s in the array B =
= probability of false positive = 1 – e-m/n
0.1
0.08
0.06
What happens as we 0.04
Obvious approach:
Maintain the set of elements seen so far
That is, keep a hash table of all the distinct
elements seen so far
iA
( mi ) k
AMS Method
AMS method works for all moments
Gives an unbiased estimate
We will just concentrate on the 2nd moment S
We pick and keep track of many variables X:
For each variable X we store X.el and X.val
X.el corresponds to the item i
X.val corresponds to the count of item i
Note this requires a count in main memory,
so number of Xs is limited
Our goal is to compute
Stream: a a b b b a b a
2nd moment is
ct … number of times item at time t appears
from time t onwards (c1=ma , c2=ma-1, c3=mb)
mi … total count of
item i in the stream
(we are assuming
stream has length n)
Stream: a a b b b a b a
Little side calculation:
Then
So,
We have the second moment (in expectation)!
So:
For k=3: c3 - (c-1)3 = 3c2 - 3c + 1
Generally: Estimate
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34
Combining Samples
In
practice:
Compute for
as many variables X as you can fit in memory
Average them in groups
Take median of averages
Drawbacks:
Only approximate
Number of itemsets is way too big
...
1/c
Important property: Sum over all weights is
1/[1 – (1 – c)] = 1/c