0% found this document useful (0 votes)
6 views4 pages

ch04 Streams2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views4 pages

ch04 Streams2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Note to other teachers and users of these slides: We would be delighted if you found this our

material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org

 More algorithms for streams:


▪ (1) Filtering a data stream: Bloom filters
▪ Select elements with property x from stream
▪ (2) Counting distinct elements: Flajolet-Martin
▪ Number of distinct elements in the last k elements
of the stream
▪ (3) Estimating moments: AMS method
▪ Estimate std. dev. of last k elements
Mining of Massive Datasets ▪ (4) Counting frequent items
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Stanford University
http://www.mmds.org J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

 Each element of data stream is a tuple  Example: Email spam filtering  Given a set of keys S that we want to filter
 Given a list of keys S ▪ We know 1 billion “good” email addresses  Create a bit array B of n bits, initially all 0s
 Determine which tuples of stream are in S ▪ If an email comes from one of these, it is NOT  Choose a hash function h with range [0,n)
spam  Hash each member of s S to one of
 Obvious solution: Hash table n buckets, and set that bit to 1, i.e., B[h(s)]=1
▪ But suppose we do not have enough memory to  Publish-subscribe systems  Hash each element a of the stream and
store all of S in a hash table ▪ You are collecting lots of messages (news articles) output only those that hash to bit that was
▪ E.g., we might be processing millions of filters ▪ People express interest in certain sets of keywords
on the same stream set to 1
▪ Determine whether each message matches user’s ▪ Output a if B[h(a)] == 1
interest

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

Output the item since it may be in S.


Item hashes to a bucket that at least
 |S| = 1 billion email addresses  More accurate analysis for the number of
one of the items in S hashed to. |B|= 1GB = 8 billion bits false positives
Item

Hash
 If the email address is in S, then it surely  Consider: If we throw m darts into n equally
func h hashes to a bucket that has the big set to 1,
likely targets, what is the probability that
so it always gets through (no false negatives)
0010001011000 Bit array B a target gets at least one dart?
Drop the item.  Approximately 1/8 of the bits are set to 1, so
It hashes to a bucket set
to 0 so it is surely not in S. about 1/8th of the addresses not in S get  In our case:
through to the output (false positives) ▪ Targets = bits/buckets
 Creates false positives but no false negatives
▪ Actually, less than 1/8th, because more than one ▪ Darts = hash values of items
▪ If the item is in S we surely output it, if not we may address might hash to the same bit
still output it
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

 We have m darts, n targets  Fraction of 1s in the array B =  Consider: |S| = m, |B| = n


 What is the probability that a target gets at = probability of false positive = 1 – e-m/n  Use k independent hash functions h1 ,…, hk
least one dart?  Initialization:
Equals 1/e
Equivalent
 Example: 109 darts, 8∙109 targets ▪ Set B to all 0s
as n →∞
▪ Fraction of 1s in B = 1 – e-1/8 = 0.1175 ▪ Hash each element s S using each hash function hi,
n( m / n)
▪ Compare with our earlier estimate: 1/8 = 0.125 set B[hi(s)] = 1 (for each i = 1,.., k) (note: we have a
1 - (1 – 1/n) single array B!)
1 – e–m/n  Run-time:
▪ When a stream element with key x arrives
Probability some
target X not hit ▪ If B[hi(x)] = 1 for all i = 1,..., k then declare that x is in S
Probability at
by a dart
least one dart ▪ That is, x hashes to a bucket set to 1 for every hash function hi(x)
hits target X ▪ Otherwise discard the element x
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

0.2

 What fraction of the bit vector B are 1s?  m = 1 billion, n = 8 billion 0.18
 Bloom filters guarantee no false negatives,
▪ Throwing k∙m darts at n targets ▪ k = 1: (1 – e-1/8) = 0.1175 0.16
and use limited memory
False positive prob.

0.14

▪ So fraction of 1s is (1 – e-km/n) ▪ k = 2: (1 – e-1/4)2 = 0.0493 0.12 ▪ Great for pre-processing before more
0.1 expensive checks
 But we have k independent hash functions 0.08
 Suitable for hardware implementation
and we only let the element x through if all k  What happens as we 0.06

0.04
▪ Hash function computations can be parallelized
hash element x to a bucket of value 1 keep increasing k? 0.02
0 2 4 6 8 10 12 14 16 18 20

Number of hash functions, k


 Is it better to have 1 big B or k small Bs?
 So, false positive probability = (1 – e-km/n)k  “Optimal” value of k: n/m ln(2)
▪ It is the same: (1 – e-km/n)k vs. (1 – e-m/(n/k))k
▪ In our case: Optimal k = 8 ln(2) = 5.54 ≈ 6
▪ Error at k = 6: (1 – e-1/6)2 = 0.0235
▪ But keeping 1 big B is simpler

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15
 Problem:  How many different words are found among
▪ Data stream consists of a universe of elements the Web pages being crawled at a site?
chosen from a set of size N ▪ Unusually low or high numbers could indicate
▪ Maintain a count of the number of distinct artificial pages (spam?)
elements seen so far
 How many different Web pages does each
 Obvious approach: customer request in a week?
Maintain the set of elements seen so far
▪ That is, keep a hash table of all the distinct  How many distinct products have we sold in
elements seen so far the last week?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18

 Pick a hash function h that maps each of the  Very very rough and heuristic intuition why
 Real problem: What if we do not have space N elements to at least log2 N bits Flajolet-Martin works:
to maintain the set of elements seen so far? ▪ h(a) hashes a with equal prob. to any of N values
 For each stream element a, let r(a) be the ▪ Then h(a) is a sequence of log2 N bits,
 Estimate the count in an unbiased way number of trailing 0s in h(a) where 2-r fraction of all as have a tail of r zeros
▪ About 50% of as hash to ***0
▪ r(a) = position of first 1 counting from the right
 Accept that the count may have a little error, ▪ About 25% of as hash to **00
▪ E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2
but limit the probability that the error is large ▪ So, if we saw the longest tail of r=2 (i.e., item hash
 Record R = the maximum r(a) seen ending *100) then we have probably seen
▪ R = maxa r(a), over all the items a seen so far about 4 distinct items so far
▪ So, it takes to hash about 2r items before we
 Estimated number of distinct elements = 2R see one with zero-suffix of length r
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21

−=
−−

−r −r
Note: (1 − 2−r )m = (1 − 2−r )2 ( m2 )  e−m2
r
 Now we show why Flajolet-Martin works  What is the probability that a given h(a) ends 
in at least r zeros is 2-r  Prob. of NOT finding a tail of length r is:
 Formally, we will show that probability of ▪ h(a) hashes elements uniformly at random ▪ If m << 2r, then prob. tends to 1
finding a tail of r zeros: −r
▪ (1 − 2− r )m  e− m 2 = 1 as m/2r→ 0
▪ Probability that a random number ends in
▪ Goes to 1 if 𝒎 ≫ 𝟐𝒓 at least r zeros is 2-r ▪ So, the probability of finding a tail of length r tends to 0
▪ Goes to 0 if 𝒎 ≪ 𝟐𝒓  Then, the probability of NOT seeing a tail ▪ If m >> 2r, then prob. tends to 0
−r
where 𝒎 is the number of distinct elements of length r among m elements: ▪ (1 − 2− r )m  e− m 2 = 0 as m/2r → 
seen so far in the stream 𝟏 − 𝟐−𝒓 𝒎 ▪ So, the probability of finding a tail of length r tends to 1
 Thus, 2R will almost always be around m!
Prob. all end in Prob. that given h(a) ends  Thus, 2R will almost always be around m!
fewer than r zeros. in fewer than r zeros

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

 E[2R] is actually infinite  Suppose a stream has elements chosen


▪ Probability halves when R → R+1, but value doubles from a set A of N values
 Workaround involves using many hash
functions hi and getting many samples of Ri  Let mi be the number of times value i occurs
 How are samples Ri combined? in the stream
▪ Average? What if one very large value 𝟐𝑹𝒊 ?  The kth moment is
▪ Median? All estimates are a power of 2
▪ Solution:
▪ Partition your samples into small groups
 iA
(mi ) k
▪ Take the median of groups
▪ Then take the average of the medians
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27

[Alon, Matias, and Szegedy]

 iA
(mi ) k 

Stream of length 100
11 distinct values


AMS method works for all moments
Gives an unbiased estimate
 0thmoment = number of distinct elements  We will just concentrate on the 2nd moment S
 Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9  We pick and keep track of many variables X:
▪ The problem just considered Surprise S = 910
 1st moment = count of the numbers of ▪ For each variable X we store X.el and X.val
▪ X.el corresponds to the item i
elements = length of the stream  Item counts: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1
▪ X.val corresponds to the count of item i
▪ Easy to compute Surprise S = 8,110
▪ Note this requires a count in main memory,
 2nd moment = surprise number S = so number of Xs is limited
a measure of how uneven the distribution is  Our goal is to compute 𝑺 = σ𝒊 𝒎𝟐𝒊
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30
Count: 1 2 3 ma Count: 1 2 3 ma
 How to set X.val and X.el? Stream: a a b b b a b a Stream: a a b b b a b a
▪ Assume stream has length n (we relax this later) 𝟐 1
 2nd moment is 𝑺 = σ𝒊 𝒎𝒊  𝐸 𝑓(𝑋) = σ𝑖 𝑛 (1 + 3 + 5 + ⋯ + 2𝑚𝑖 − 1)
▪ Pick some random time t (t<n) to start, 𝑛
 ct … number of times item at time t appears
so that any time is equally likely ▪ Little side calculation: 1 + 3 + 5 + ⋯ + 2𝑚𝑖 − 1 =
from time t onwards (c1=ma , c2=ma-1, c3=mb) 𝑚𝑖 𝑚𝑖 +1
▪ Let at time t the stream have item i. We set X.el = i 𝟏 σ𝑚𝑖=1(2𝑖 − 1) = 2
𝑖
− 𝑚𝑖 = (𝑚𝑖 )2
▪ Then we maintain count c (X.val = c) of the number  𝑬 𝒇(𝑿) = σ𝒏𝒕=𝟏 𝒏(𝟐𝒄𝒕 − 𝟏) m … total count of 𝟏
2
𝟐
𝒏 i
item i in the stream  Then 𝑬 𝒇(𝑿) = σ𝒊 𝒏 𝒎𝒊
of is in the stream starting from the chosen time t 𝟏 𝒏
= σ 𝒏 (𝟏 + 𝟑 + 𝟓 + ⋯ + 𝟐𝒎𝒊 − 𝟏)
(we are assuming

𝒏 𝒊
stream has length n)
 Then the estimate of the 2nd moment (σ𝒊 𝒎𝟐 𝒊 ) is:
𝑺 = 𝒇(𝑿) = 𝒏 (𝟐 · 𝒄 – 𝟏) Time t when Time t when
 So, 𝐄 𝐟(𝐗) = σ𝒊 𝒎𝒊 𝟐 = 𝑺
▪ Note, we will keep track of multiple Xs, (X1, X2,… Xk) Group times
Time t when
the last i is
the penultimate the first i is  We have the second moment (in expectation)!
by the value i is seen (ct=2) seen (ct=mi)
and our final estimate will be 𝑺 = 𝟏/𝒌 σ𝒌𝒋 𝒇(𝑿𝒋 ) seen
seen (ct=1)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33

 For estimating kth moment we essentially use the  In practice:  (1) The variables X have n as a factor –
same algorithm but change the estimate: ▪ Compute 𝒇(𝑿) = 𝒏(𝟐 𝒄 – 𝟏) for keep n separately; just hold the count in X
▪ For k=2 we used n (2·c – 1) as many variables X as you can fit in memory  (2) Suppose we can only store k counts.
▪ For k=3 we use: n (3·c2 – 3c + 1) (where c=X.val) ▪ Average them in groups We must throw some Xs out as time goes on:
 Why? ▪ Take median of averages ▪ Objective: Each starting time t is selected with
probability k/n
▪ For k=2: Remember we had 1 + 3 + 5 + ⋯ + 2𝑚𝑖 − 1  Problem: Streams never end
and we showed terms 2c-1 (for c=1,…,m) sum to m2 ▪ Solution: (fixed-size sampling!)
▪ We assumed there was a number n, ▪ Choose the first k times for k variables
▪ σ𝑚 𝑚 2 𝑚
𝑐=1 2𝑐 − 1 = σ𝑐=1 𝑐 − σ𝑐=1 𝑐 − 1
2 = 𝑚2
the number of positions in the stream ▪ When the nth element arrives (n > k), choose it with
▪ So: 𝟐𝒄 − 𝟏 = 𝒄𝟐 − 𝒄 − 𝟏 𝟐
▪ But real streams go on forever, so n is probability k/n
▪ For k=3: c3 - (c-1)3 = 3c2 - 3c + 1
a variable – the number of inputs seen so far ▪ If you choose it, throw one of the previously stored
 Generally: Estimate = 𝑛 (𝑐 𝑘 − 𝑐 − 1 𝑘 ) variables X out, with equal probability
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36

 New Problem: Given a stream, which items  In principle, you could count frequent pairs
appear more than s times in the window? or even larger sets the same way
 Possible solution: Think of the stream of ▪ One stream per itemset
baskets as one binary stream per item
▪ 1 = item present; 0 = not present  Drawbacks:
▪ Use DGIM to estimate counts of 1s for all items ▪ Only approximate
6 10 ▪ Number of itemsets is way too big
4
3 2
2 1
1 0
010011100010100100010110110111001010110011010
N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39

 Exponentially decaying windows: A heuristic  If each ai is an “item” we can compute the


for selecting likely frequent item(sets) characteristic function of each possible
▪ What are “currently” most popular movies? item x as an Exponentially Decaying Window
▪ Instead of computing the raw count in last N elements ▪ That is: σ𝒕𝒊=𝟏 𝜹𝒊 ⋅ 𝟏 − 𝒄 𝒕−𝒊
▪ Compute a smooth aggregation over the whole stream where δi=1 if ai=x, and 0 otherwise
 If stream is a1, a2,… and we are taking the sum ▪ Imagine that for each item x we have a binary
...

of the stream, take the answer at time t to be: stream (1 if x appears, 0 if x does not appear) 1/c
𝒕
= σ𝒊=𝟏 𝒂𝒊 𝟏 − 𝒄 𝒕−𝒊 ▪ New item x arrives:
 Important property: Sum over all weights
▪ c is a constant, presumably tiny, like 10-6 or 10-9 ▪ Multiply all counts by (1-c)
σ𝒕 𝟏 − 𝒄 𝒕 is 1/[1 – (1 – c)] = 1/c
 When new at+1 arrives: ▪ Add +1 to count for element x
Multiply current sum by (1-c) and add at+1  Call this sum the “weight” of item x
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42

 What are “currently” most popular movies?  Count (some) itemsets in an E.D.W.  Start a count for an itemset S ⊆ B if every
 Suppose we want to find movies of weight > ½ ▪ What are currently “hot” itemsets? proper subset of S had a count prior to arrival
▪ Important property: Sum over all weights ▪ Problem: Too many itemsets to keep counts of of basket B
σ𝑡 1 − 𝑐 𝑡 is 1/[1 – (1 – c)] = 1/c all of them in memory ▪ Intuitively: If all subsets of S are being counted
 When a basket B comes in: this means they are “frequent/hot” and thus S has
 Thus: a potential to be “hot”
▪ There cannot be more than 2/c movies with ▪ Multiply all counts by (1-c)  Example:
weight of ½ or more ▪ For uncounted items in B, create new count ▪ Start counting S={i, j} iff both i and j were counted
 So, 2/c is a limit on the number of ▪ Add 1 to count of any item in B and to any itemset prior to seeing B
movies being counted at any time contained in B that is already being counted ▪ Start counting S={i, j, k} iff {i, j}, {i, k}, and {j, k}
▪ Drop counts < ½ were all counted prior to seeing B
▪ Initiate new counts (next slide)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45
 Counts for single items < (2/c)∙(avg. number
of items in a basket)

 Counts for larger itemsets = ??

 But we are conservative about starting


counts of large sets
▪ If we counted every set we saw, one basket
of 20 items would initiate 1M counts

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46

You might also like