ch04 Streams2
ch04 Streams2
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
Each element of data stream is a tuple Example: Email spam filtering Given a set of keys S that we want to filter
Given a list of keys S ▪ We know 1 billion “good” email addresses Create a bit array B of n bits, initially all 0s
Determine which tuples of stream are in S ▪ If an email comes from one of these, it is NOT Choose a hash function h with range [0,n)
spam Hash each member of s S to one of
Obvious solution: Hash table n buckets, and set that bit to 1, i.e., B[h(s)]=1
▪ But suppose we do not have enough memory to Publish-subscribe systems Hash each element a of the stream and
store all of S in a hash table ▪ You are collecting lots of messages (news articles) output only those that hash to bit that was
▪ E.g., we might be processing millions of filters ▪ People express interest in certain sets of keywords
on the same stream set to 1
▪ Determine whether each message matches user’s ▪ Output a if B[h(a)] == 1
interest
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
Hash
If the email address is in S, then it surely Consider: If we throw m darts into n equally
func h hashes to a bucket that has the big set to 1,
likely targets, what is the probability that
so it always gets through (no false negatives)
0010001011000 Bit array B a target gets at least one dart?
Drop the item. Approximately 1/8 of the bits are set to 1, so
It hashes to a bucket set
to 0 so it is surely not in S. about 1/8th of the addresses not in S get In our case:
through to the output (false positives) ▪ Targets = bits/buckets
Creates false positives but no false negatives
▪ Actually, less than 1/8th, because more than one ▪ Darts = hash values of items
▪ If the item is in S we surely output it, if not we may address might hash to the same bit
still output it
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9
0.2
What fraction of the bit vector B are 1s? m = 1 billion, n = 8 billion 0.18
Bloom filters guarantee no false negatives,
▪ Throwing k∙m darts at n targets ▪ k = 1: (1 – e-1/8) = 0.1175 0.16
and use limited memory
False positive prob.
0.14
▪ So fraction of 1s is (1 – e-km/n) ▪ k = 2: (1 – e-1/4)2 = 0.0493 0.12 ▪ Great for pre-processing before more
0.1 expensive checks
But we have k independent hash functions 0.08
Suitable for hardware implementation
and we only let the element x through if all k What happens as we 0.06
0.04
▪ Hash function computations can be parallelized
hash element x to a bucket of value 1 keep increasing k? 0.02
0 2 4 6 8 10 12 14 16 18 20
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15
Problem: How many different words are found among
▪ Data stream consists of a universe of elements the Web pages being crawled at a site?
chosen from a set of size N ▪ Unusually low or high numbers could indicate
▪ Maintain a count of the number of distinct artificial pages (spam?)
elements seen so far
How many different Web pages does each
Obvious approach: customer request in a week?
Maintain the set of elements seen so far
▪ That is, keep a hash table of all the distinct How many distinct products have we sold in
elements seen so far the last week?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18
Pick a hash function h that maps each of the Very very rough and heuristic intuition why
Real problem: What if we do not have space N elements to at least log2 N bits Flajolet-Martin works:
to maintain the set of elements seen so far? ▪ h(a) hashes a with equal prob. to any of N values
For each stream element a, let r(a) be the ▪ Then h(a) is a sequence of log2 N bits,
Estimate the count in an unbiased way number of trailing 0s in h(a) where 2-r fraction of all as have a tail of r zeros
▪ About 50% of as hash to ***0
▪ r(a) = position of first 1 counting from the right
Accept that the count may have a little error, ▪ About 25% of as hash to **00
▪ E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2
but limit the probability that the error is large ▪ So, if we saw the longest tail of r=2 (i.e., item hash
Record R = the maximum r(a) seen ending *100) then we have probably seen
▪ R = maxa r(a), over all the items a seen so far about 4 distinct items so far
▪ So, it takes to hash about 2r items before we
Estimated number of distinct elements = 2R see one with zero-suffix of length r
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21
−=
−−
−r −r
Note: (1 − 2−r )m = (1 − 2−r )2 ( m2 ) e−m2
r
Now we show why Flajolet-Martin works What is the probability that a given h(a) ends
in at least r zeros is 2-r Prob. of NOT finding a tail of length r is:
Formally, we will show that probability of ▪ h(a) hashes elements uniformly at random ▪ If m << 2r, then prob. tends to 1
finding a tail of r zeros: −r
▪ (1 − 2− r )m e− m 2 = 1 as m/2r→ 0
▪ Probability that a random number ends in
▪ Goes to 1 if 𝒎 ≫ 𝟐𝒓 at least r zeros is 2-r ▪ So, the probability of finding a tail of length r tends to 0
▪ Goes to 0 if 𝒎 ≪ 𝟐𝒓 Then, the probability of NOT seeing a tail ▪ If m >> 2r, then prob. tends to 0
−r
where 𝒎 is the number of distinct elements of length r among m elements: ▪ (1 − 2− r )m e− m 2 = 0 as m/2r →
seen so far in the stream 𝟏 − 𝟐−𝒓 𝒎 ▪ So, the probability of finding a tail of length r tends to 1
Thus, 2R will almost always be around m!
Prob. all end in Prob. that given h(a) ends Thus, 2R will almost always be around m!
fewer than r zeros. in fewer than r zeros
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24
iA
(mi ) k
Stream of length 100
11 distinct values
AMS method works for all moments
Gives an unbiased estimate
0thmoment = number of distinct elements We will just concentrate on the 2nd moment S
Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9 We pick and keep track of many variables X:
▪ The problem just considered Surprise S = 910
1st moment = count of the numbers of ▪ For each variable X we store X.el and X.val
▪ X.el corresponds to the item i
elements = length of the stream Item counts: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1
▪ X.val corresponds to the count of item i
▪ Easy to compute Surprise S = 8,110
▪ Note this requires a count in main memory,
2nd moment = surprise number S = so number of Xs is limited
a measure of how uneven the distribution is Our goal is to compute 𝑺 = σ𝒊 𝒎𝟐𝒊
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30
Count: 1 2 3 ma Count: 1 2 3 ma
How to set X.val and X.el? Stream: a a b b b a b a Stream: a a b b b a b a
▪ Assume stream has length n (we relax this later) 𝟐 1
2nd moment is 𝑺 = σ𝒊 𝒎𝒊 𝐸 𝑓(𝑋) = σ𝑖 𝑛 (1 + 3 + 5 + ⋯ + 2𝑚𝑖 − 1)
▪ Pick some random time t (t<n) to start, 𝑛
ct … number of times item at time t appears
so that any time is equally likely ▪ Little side calculation: 1 + 3 + 5 + ⋯ + 2𝑚𝑖 − 1 =
from time t onwards (c1=ma , c2=ma-1, c3=mb) 𝑚𝑖 𝑚𝑖 +1
▪ Let at time t the stream have item i. We set X.el = i 𝟏 σ𝑚𝑖=1(2𝑖 − 1) = 2
𝑖
− 𝑚𝑖 = (𝑚𝑖 )2
▪ Then we maintain count c (X.val = c) of the number 𝑬 𝒇(𝑿) = σ𝒏𝒕=𝟏 𝒏(𝟐𝒄𝒕 − 𝟏) m … total count of 𝟏
2
𝟐
𝒏 i
item i in the stream Then 𝑬 𝒇(𝑿) = σ𝒊 𝒏 𝒎𝒊
of is in the stream starting from the chosen time t 𝟏 𝒏
= σ 𝒏 (𝟏 + 𝟑 + 𝟓 + ⋯ + 𝟐𝒎𝒊 − 𝟏)
(we are assuming
𝒏 𝒊
stream has length n)
Then the estimate of the 2nd moment (σ𝒊 𝒎𝟐 𝒊 ) is:
𝑺 = 𝒇(𝑿) = 𝒏 (𝟐 · 𝒄 – 𝟏) Time t when Time t when
So, 𝐄 𝐟(𝐗) = σ𝒊 𝒎𝒊 𝟐 = 𝑺
▪ Note, we will keep track of multiple Xs, (X1, X2,… Xk) Group times
Time t when
the last i is
the penultimate the first i is We have the second moment (in expectation)!
by the value i is seen (ct=2) seen (ct=mi)
and our final estimate will be 𝑺 = 𝟏/𝒌 σ𝒌𝒋 𝒇(𝑿𝒋 ) seen
seen (ct=1)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33
For estimating kth moment we essentially use the In practice: (1) The variables X have n as a factor –
same algorithm but change the estimate: ▪ Compute 𝒇(𝑿) = 𝒏(𝟐 𝒄 – 𝟏) for keep n separately; just hold the count in X
▪ For k=2 we used n (2·c – 1) as many variables X as you can fit in memory (2) Suppose we can only store k counts.
▪ For k=3 we use: n (3·c2 – 3c + 1) (where c=X.val) ▪ Average them in groups We must throw some Xs out as time goes on:
Why? ▪ Take median of averages ▪ Objective: Each starting time t is selected with
probability k/n
▪ For k=2: Remember we had 1 + 3 + 5 + ⋯ + 2𝑚𝑖 − 1 Problem: Streams never end
and we showed terms 2c-1 (for c=1,…,m) sum to m2 ▪ Solution: (fixed-size sampling!)
▪ We assumed there was a number n, ▪ Choose the first k times for k variables
▪ σ𝑚 𝑚 2 𝑚
𝑐=1 2𝑐 − 1 = σ𝑐=1 𝑐 − σ𝑐=1 𝑐 − 1
2 = 𝑚2
the number of positions in the stream ▪ When the nth element arrives (n > k), choose it with
▪ So: 𝟐𝒄 − 𝟏 = 𝒄𝟐 − 𝒄 − 𝟏 𝟐
▪ But real streams go on forever, so n is probability k/n
▪ For k=3: c3 - (c-1)3 = 3c2 - 3c + 1
a variable – the number of inputs seen so far ▪ If you choose it, throw one of the previously stored
Generally: Estimate = 𝑛 (𝑐 𝑘 − 𝑐 − 1 𝑘 ) variables X out, with equal probability
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36
New Problem: Given a stream, which items In principle, you could count frequent pairs
appear more than s times in the window? or even larger sets the same way
Possible solution: Think of the stream of ▪ One stream per itemset
baskets as one binary stream per item
▪ 1 = item present; 0 = not present Drawbacks:
▪ Use DGIM to estimate counts of 1s for all items ▪ Only approximate
6 10 ▪ Number of itemsets is way too big
4
3 2
2 1
1 0
010011100010100100010110110111001010110011010
N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39
of the stream, take the answer at time t to be: stream (1 if x appears, 0 if x does not appear) 1/c
𝒕
= σ𝒊=𝟏 𝒂𝒊 𝟏 − 𝒄 𝒕−𝒊 ▪ New item x arrives:
Important property: Sum over all weights
▪ c is a constant, presumably tiny, like 10-6 or 10-9 ▪ Multiply all counts by (1-c)
σ𝒕 𝟏 − 𝒄 𝒕 is 1/[1 – (1 – c)] = 1/c
When new at+1 arrives: ▪ Add +1 to count for element x
Multiply current sum by (1-c) and add at+1 Call this sum the “weight” of item x
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42
What are “currently” most popular movies? Count (some) itemsets in an E.D.W. Start a count for an itemset S ⊆ B if every
Suppose we want to find movies of weight > ½ ▪ What are currently “hot” itemsets? proper subset of S had a count prior to arrival
▪ Important property: Sum over all weights ▪ Problem: Too many itemsets to keep counts of of basket B
σ𝑡 1 − 𝑐 𝑡 is 1/[1 – (1 – c)] = 1/c all of them in memory ▪ Intuitively: If all subsets of S are being counted
When a basket B comes in: this means they are “frequent/hot” and thus S has
Thus: a potential to be “hot”
▪ There cannot be more than 2/c movies with ▪ Multiply all counts by (1-c) Example:
weight of ½ or more ▪ For uncounted items in B, create new count ▪ Start counting S={i, j} iff both i and j were counted
So, 2/c is a limit on the number of ▪ Add 1 to count of any item in B and to any itemset prior to seeing B
movies being counted at any time contained in B that is already being counted ▪ Start counting S={i, j, k} iff {i, j}, {i, k}, and {j, k}
▪ Drop counts < ½ were all counted prior to seeing B
▪ Initiate new counts (next slide)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45
Counts for single items < (2/c)∙(avg. number
of items in a basket)