0% found this document useful (0 votes)

6 views4 pages

ch04 Streams2

Uploaded by

kieulinh2003.phuthai.thienkhoi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views4 pages

ch04 Streams2

Uploaded by

kieulinh2003.phuthai.thienkhoi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Note to other teachers and users of these slides: We would be delighted if you found this our

material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org

 More algorithms for streams:

▪ (1) Filtering a data stream: Bloom filters
▪ Select elements with property x from stream
▪ (2) Counting distinct elements: Flajolet-Martin
▪ Number of distinct elements in the last k elements
of the stream
▪ (3) Estimating moments: AMS method
▪ Estimate std. dev. of last k elements
Mining of Massive Datasets ▪ (4) Counting frequent items
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Stanford University
http://www.mmds.org J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

 Each element of data stream is a tuple  Example: Email spam filtering  Given a set of keys S that we want to filter
 Given a list of keys S ▪ We know 1 billion “good” email addresses  Create a bit array B of n bits, initially all 0s
 Determine which tuples of stream are in S ▪ If an email comes from one of these, it is NOT  Choose a hash function h with range [0,n)
spam  Hash each member of s S to one of
 Obvious solution: Hash table n buckets, and set that bit to 1, i.e., B[h(s)]=1
▪ But suppose we do not have enough memory to  Publish-subscribe systems  Hash each element a of the stream and
store all of S in a hash table ▪ You are collecting lots of messages (news articles) output only those that hash to bit that was
▪ E.g., we might be processing millions of filters ▪ People express interest in certain sets of keywords
on the same stream set to 1
▪ Determine whether each message matches user’s ▪ Output a if B[h(a)] == 1
interest

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

Output the item since it may be in S.

Item hashes to a bucket that at least
 |S| = 1 billion email addresses  More accurate analysis for the number of
one of the items in S hashed to. |B|= 1GB = 8 billion bits false positives
Item

Hash
 If the email address is in S, then it surely  Consider: If we throw m darts into n equally
func h hashes to a bucket that has the big set to 1,
likely targets, what is the probability that
so it always gets through (no false negatives)
0010001011000 Bit array B a target gets at least one dart?
Drop the item.  Approximately 1/8 of the bits are set to 1, so
It hashes to a bucket set
to 0 so it is surely not in S. about 1/8th of the addresses not in S get  In our case:
through to the output (false positives) ▪ Targets = bits/buckets
 Creates false positives but no false negatives
▪ Actually, less than 1/8th, because more than one ▪ Darts = hash values of items
▪ If the item is in S we surely output it, if not we may address might hash to the same bit
still output it
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

 We have m darts, n targets  Fraction of 1s in the array B =  Consider: |S| = m, |B| = n

 What is the probability that a target gets at = probability of false positive = 1 – e-m/n  Use k independent hash functions h1 ,…, hk
least one dart?  Initialization:
Equals 1/e
Equivalent
 Example: 109 darts, 8∙109 targets ▪ Set B to all 0s
as n →∞
▪ Fraction of 1s in B = 1 – e-1/8 = 0.1175 ▪ Hash each element s S using each hash function hi,
n( m / n)
▪ Compare with our earlier estimate: 1/8 = 0.125 set B[hi(s)] = 1 (for each i = 1,.., k) (note: we have a
1 - (1 – 1/n) single array B!)
1 – e–m/n  Run-time:
▪ When a stream element with key x arrives
Probability some
target X not hit ▪ If B[hi(x)] = 1 for all i = 1,..., k then declare that x is in S
Probability at
by a dart
least one dart ▪ That is, x hashes to a bucket set to 1 for every hash function hi(x)
hits target X ▪ Otherwise discard the element x
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

0.2

 What fraction of the bit vector B are 1s?  m = 1 billion, n = 8 billion 0.18
 Bloom filters guarantee no false negatives,
▪ Throwing k∙m darts at n targets ▪ k = 1: (1 – e-1/8) = 0.1175 0.16
and use limited memory
False positive prob.

0.14

▪ So fraction of 1s is (1 – e-km/n) ▪ k = 2: (1 – e-1/4)2 = 0.0493 0.12 ▪ Great for pre-processing before more
0.1 expensive checks
 But we have k independent hash functions 0.08
 Suitable for hardware implementation
and we only let the element x through if all k  What happens as we 0.06

0.04
▪ Hash function computations can be parallelized
hash element x to a bucket of value 1 keep increasing k? 0.02
0 2 4 6 8 10 12 14 16 18 20

Number of hash functions, k

 Is it better to have 1 big B or k small Bs?
 So, false positive probability = (1 – e-km/n)k  “Optimal” value of k: n/m ln(2)
▪ It is the same: (1 – e-km/n)k vs. (1 – e-m/(n/k))k
▪ In our case: Optimal k = 8 ln(2) = 5.54 ≈ 6
▪ Error at k = 6: (1 – e-1/6)2 = 0.0235
▪ But keeping 1 big B is simpler

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15
 Problem:  How many different words are found among
▪ Data stream consists of a universe of elements the Web pages being crawled at a site?
chosen from a set of size N ▪ Unusually low or high numbers could indicate
▪ Maintain a count of the number of distinct artificial pages (spam?)
elements seen so far
 How many different Web pages does each
 Obvious approach: customer request in a week?
Maintain the set of elements seen so far
▪ That is, keep a hash table of all the distinct  How many distinct products have we sold in
elements seen so far the last week?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18

 Pick a hash function h that maps each of the  Very very rough and heuristic intuition why
 Real problem: What if we do not have space N elements to at least log2 N bits Flajolet-Martin works:
to maintain the set of elements seen so far? ▪ h(a) hashes a with equal prob. to any of N values
 For each stream element a, let r(a) be the ▪ Then h(a) is a sequence of log2 N bits,
 Estimate the count in an unbiased way number of trailing 0s in h(a) where 2-r fraction of all as have a tail of r zeros
▪ About 50% of as hash to ***0
▪ r(a) = position of first 1 counting from the right
 Accept that the count may have a little error, ▪ About 25% of as hash to **00
▪ E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2
but limit the probability that the error is large ▪ So, if we saw the longest tail of r=2 (i.e., item hash
 Record R = the maximum r(a) seen ending *100) then we have probably seen
▪ R = maxa r(a), over all the items a seen so far about 4 distinct items so far
▪ So, it takes to hash about 2r items before we
 Estimated number of distinct elements = 2R see one with zero-suffix of length r
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21

−=
−−

−r −r
Note: (1 − 2−r )m = (1 − 2−r )2 ( m2 )  e−m2
r
 Now we show why Flajolet-Martin works  What is the probability that a given h(a) ends 
in at least r zeros is 2-r  Prob. of NOT finding a tail of length r is:
 Formally, we will show that probability of ▪ h(a) hashes elements uniformly at random ▪ If m << 2r, then prob. tends to 1
finding a tail of r zeros: −r
▪ (1 − 2− r )m  e− m 2 = 1 as m/2r→ 0
▪ Probability that a random number ends in
▪ Goes to 1 if 𝒎 ≫ 𝟐𝒓 at least r zeros is 2-r ▪ So, the probability of finding a tail of length r tends to 0
▪ Goes to 0 if 𝒎 ≪ 𝟐𝒓  Then, the probability of NOT seeing a tail ▪ If m >> 2r, then prob. tends to 0
−r
where 𝒎 is the number of distinct elements of length r among m elements: ▪ (1 − 2− r )m  e− m 2 = 0 as m/2r → 
seen so far in the stream 𝟏 − 𝟐−𝒓 𝒎 ▪ So, the probability of finding a tail of length r tends to 1
 Thus, 2R will almost always be around m!
Prob. all end in Prob. that given h(a) ends  Thus, 2R will almost always be around m!
fewer than r zeros. in fewer than r zeros

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

 E[2R] is actually infinite  Suppose a stream has elements chosen

▪ Probability halves when R → R+1, but value doubles from a set A of N values
 Workaround involves using many hash
functions hi and getting many samples of Ri  Let mi be the number of times value i occurs
 How are samples Ri combined? in the stream
▪ Average? What if one very large value 𝟐𝑹𝒊 ?  The kth moment is
▪ Median? All estimates are a power of 2
▪ Solution:
▪ Partition your samples into small groups
 iA
(mi ) k
▪ Take the median of groups
▪ Then take the average of the medians
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27

[Alon, Matias, and Szegedy]

 iA
(mi ) k 

Stream of length 100
11 distinct values


AMS method works for all moments
Gives an unbiased estimate
 0thmoment = number of distinct elements  We will just concentrate on the 2nd moment S
 Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9  We pick and keep track of many variables X:
▪ The problem just considered Surprise S = 910
 1st moment = count of the numbers of ▪ For each variable X we store X.el and X.val
▪ X.el corresponds to the item i
elements = length of the stream  Item counts: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1
▪ X.val corresponds to the count of item i
▪ Easy to compute Surprise S = 8,110
▪ Note this requires a count in main memory,
 2nd moment = surprise number S = so number of Xs is limited
a measure of how uneven the distribution is  Our goal is to compute 𝑺 = σ𝒊 𝒎𝟐𝒊
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30
Count: 1 2 3 ma Count: 1 2 3 ma
 How to set X.val and X.el? Stream: a a b b b a b a Stream: a a b b b a b a
▪ Assume stream has length n (we relax this later) 𝟐 1
 2nd moment is 𝑺 = σ𝒊 𝒎𝒊  𝐸 𝑓(𝑋) = σ𝑖 𝑛 (1 + 3 + 5 + ⋯ + 2𝑚𝑖 − 1)
▪ Pick some random time t (t<n) to start, 𝑛
 ct … number of times item at time t appears
so that any time is equally likely ▪ Little side calculation: 1 + 3 + 5 + ⋯ + 2𝑚𝑖 − 1 =
from time t onwards (c1=ma , c2=ma-1, c3=mb) 𝑚𝑖 𝑚𝑖 +1
▪ Let at time t the stream have item i. We set X.el = i 𝟏 σ𝑚𝑖=1(2𝑖 − 1) = 2
𝑖
− 𝑚𝑖 = (𝑚𝑖 )2
▪ Then we maintain count c (X.val = c) of the number  𝑬 𝒇(𝑿) = σ𝒏𝒕=𝟏 𝒏(𝟐𝒄𝒕 − 𝟏) m … total count of 𝟏
2
𝟐
𝒏 i
item i in the stream  Then 𝑬 𝒇(𝑿) = σ𝒊 𝒏 𝒎𝒊
of is in the stream starting from the chosen time t 𝟏 𝒏
= σ 𝒏 (𝟏 + 𝟑 + 𝟓 + ⋯ + 𝟐𝒎𝒊 − 𝟏)
(we are assuming

𝒏 𝒊
stream has length n)
 Then the estimate of the 2nd moment (σ𝒊 𝒎𝟐 𝒊 ) is:
𝑺 = 𝒇(𝑿) = 𝒏 (𝟐 · 𝒄 – 𝟏) Time t when Time t when
 So, 𝐄 𝐟(𝐗) = σ𝒊 𝒎𝒊 𝟐 = 𝑺
▪ Note, we will keep track of multiple Xs, (X1, X2,… Xk) Group times
Time t when
the last i is
the penultimate the first i is  We have the second moment (in expectation)!
by the value i is seen (ct=2) seen (ct=mi)
and our final estimate will be 𝑺 = 𝟏/𝒌 σ𝒌𝒋 𝒇(𝑿𝒋 ) seen
seen (ct=1)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33

 For estimating kth moment we essentially use the  In practice:  (1) The variables X have n as a factor –
same algorithm but change the estimate: ▪ Compute 𝒇(𝑿) = 𝒏(𝟐 𝒄 – 𝟏) for keep n separately; just hold the count in X
▪ For k=2 we used n (2·c – 1) as many variables X as you can fit in memory  (2) Suppose we can only store k counts.
▪ For k=3 we use: n (3·c2 – 3c + 1) (where c=X.val) ▪ Average them in groups We must throw some Xs out as time goes on:
 Why? ▪ Take median of averages ▪ Objective: Each starting time t is selected with
probability k/n
▪ For k=2: Remember we had 1 + 3 + 5 + ⋯ + 2𝑚𝑖 − 1  Problem: Streams never end
and we showed terms 2c-1 (for c=1,…,m) sum to m2 ▪ Solution: (fixed-size sampling!)
▪ We assumed there was a number n, ▪ Choose the first k times for k variables
▪ σ𝑚 𝑚 2 𝑚
𝑐=1 2𝑐 − 1 = σ𝑐=1 𝑐 − σ𝑐=1 𝑐 − 1
2 = 𝑚2
the number of positions in the stream ▪ When the nth element arrives (n > k), choose it with
▪ So: 𝟐𝒄 − 𝟏 = 𝒄𝟐 − 𝒄 − 𝟏 𝟐
▪ But real streams go on forever, so n is probability k/n
▪ For k=3: c3 - (c-1)3 = 3c2 - 3c + 1
a variable – the number of inputs seen so far ▪ If you choose it, throw one of the previously stored
 Generally: Estimate = 𝑛 (𝑐 𝑘 − 𝑐 − 1 𝑘 ) variables X out, with equal probability
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36

 New Problem: Given a stream, which items  In principle, you could count frequent pairs
appear more than s times in the window? or even larger sets the same way
 Possible solution: Think of the stream of ▪ One stream per itemset
baskets as one binary stream per item
▪ 1 = item present; 0 = not present  Drawbacks:
▪ Use DGIM to estimate counts of 1s for all items ▪ Only approximate
6 10 ▪ Number of itemsets is way too big
4
3 2
2 1
1 0
010011100010100100010110110111001010110011010
N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39

 Exponentially decaying windows: A heuristic  If each ai is an “item” we can compute the

for selecting likely frequent item(sets) characteristic function of each possible
▪ What are “currently” most popular movies? item x as an Exponentially Decaying Window
▪ Instead of computing the raw count in last N elements ▪ That is: σ𝒕𝒊=𝟏 𝜹𝒊 ⋅ 𝟏 − 𝒄 𝒕−𝒊
▪ Compute a smooth aggregation over the whole stream where δi=1 if ai=x, and 0 otherwise
 If stream is a1, a2,… and we are taking the sum ▪ Imagine that for each item x we have a binary
...

of the stream, take the answer at time t to be: stream (1 if x appears, 0 if x does not appear) 1/c
𝒕
= σ𝒊=𝟏 𝒂𝒊 𝟏 − 𝒄 𝒕−𝒊 ▪ New item x arrives:
 Important property: Sum over all weights
▪ c is a constant, presumably tiny, like 10-6 or 10-9 ▪ Multiply all counts by (1-c)
σ𝒕 𝟏 − 𝒄 𝒕 is 1/[1 – (1 – c)] = 1/c
 When new at+1 arrives: ▪ Add +1 to count for element x
Multiply current sum by (1-c) and add at+1  Call this sum the “weight” of item x
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42

 What are “currently” most popular movies?  Count (some) itemsets in an E.D.W.  Start a count for an itemset S ⊆ B if every
 Suppose we want to find movies of weight > ½ ▪ What are currently “hot” itemsets? proper subset of S had a count prior to arrival
▪ Important property: Sum over all weights ▪ Problem: Too many itemsets to keep counts of of basket B
σ𝑡 1 − 𝑐 𝑡 is 1/[1 – (1 – c)] = 1/c all of them in memory ▪ Intuitively: If all subsets of S are being counted
 When a basket B comes in: this means they are “frequent/hot” and thus S has
 Thus: a potential to be “hot”
▪ There cannot be more than 2/c movies with ▪ Multiply all counts by (1-c)  Example:
weight of ½ or more ▪ For uncounted items in B, create new count ▪ Start counting S={i, j} iff both i and j were counted
 So, 2/c is a limit on the number of ▪ Add 1 to count of any item in B and to any itemset prior to seeing B
movies being counted at any time contained in B that is already being counted ▪ Start counting S={i, j, k} iff {i, j}, {i, k}, and {j, k}
▪ Drop counts < ½ were all counted prior to seeing B
▪ Initiate new counts (next slide)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45
 Counts for single items < (2/c)∙(avg. number
of items in a basket)

 Counts for larger itemsets = ??

 But we are conservative about starting

counts of large sets
▪ If we counted every set we saw, one basket
of 20 items would initiate 1M counts

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46

SECTION 2 Course Outline Managerial Economics MGCR 293 002 Dr. K. Salmasi (Fall 2017)
No ratings yet
SECTION 2 Course Outline Managerial Economics MGCR 293 002 Dr. K. Salmasi (Fall 2017)
12 pages
Aspnet Core Project Json
100% (1)
Aspnet Core Project Json
1,244 pages
Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
Mining Data Streams 1
No ratings yet
Mining Data Streams 1
46 pages
Mod2 Data Streams
No ratings yet
Mod2 Data Streams
75 pages
ch04 Streams1
No ratings yet
ch04 Streams1
4 pages
Lecture 27
No ratings yet
Lecture 27
21 pages
Mining Data Streams (Part 1) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining Data Streams (Part 1) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
ch03 LSH
No ratings yet
ch03 LSH
58 pages
ch03 LSH
No ratings yet
ch03 LSH
58 pages
4 Frequent Item Set Mining & Association Rules
No ratings yet
4 Frequent Item Set Mining & Association Rules
68 pages
16 Streams
No ratings yet
16 Streams
61 pages
ch01 Intro
No ratings yet
ch01 Intro
28 pages
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
64 pages
ch06 Assocrules
No ratings yet
ch06 Assocrules
110 pages
ch06 Assocrules
No ratings yet
ch06 Assocrules
59 pages
Association Rules and Frequent Item Sets
No ratings yet
Association Rules and Frequent Item Sets
98 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Community Detection in Social Networks
No ratings yet
Community Detection in Social Networks
64 pages
Unit 5
No ratings yet
Unit 5
39 pages
Big Data - Week04 - Association Rules
No ratings yet
Big Data - Week04 - Association Rules
46 pages
DGIM
No ratings yet
DGIM
90 pages
Unit Ii BD
No ratings yet
Unit Ii BD
74 pages
Unit 4
No ratings yet
Unit 4
60 pages
Big Data - Lecture05 - LSH
No ratings yet
Big Data - Lecture05 - LSH
56 pages
ch-09 - Part 1
No ratings yet
ch-09 - Part 1
22 pages
19 Submodular
No ratings yet
19 Submodular
47 pages
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
No ratings yet
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
90 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Ch01 Intro
No ratings yet
Ch01 Intro
19 pages
ch01 Intro
No ratings yet
ch01 Intro
29 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
56 pages
Mining Data Streams
No ratings yet
Mining Data Streams
34 pages
Ch06 Frequent Itemsets
No ratings yet
Ch06 Frequent Itemsets
59 pages
16 Streams
No ratings yet
16 Streams
5 pages
MMD 05
No ratings yet
MMD 05
50 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
58 pages
18-Sub-Modular Functions
No ratings yet
18-Sub-Modular Functions
51 pages
Theory of Locality Sensitive Hashing - CS246 Stanford (Slides)
No ratings yet
Theory of Locality Sensitive Hashing - CS246 Stanford (Slides)
52 pages
Big Data Analytics Course Introduction
No ratings yet
Big Data Analytics Course Introduction
28 pages
Data Science 5
No ratings yet
Data Science 5
82 pages
Module 4
No ratings yet
Module 4
10 pages
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
48 pages
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
49 pages
08 Recsys2
No ratings yet
08 Recsys2
60 pages
Streams 1
No ratings yet
Streams 1
33 pages
Limited Pass Algorithm
No ratings yet
Limited Pass Algorithm
33 pages
Probabilistic Data Structures
No ratings yet
Probabilistic Data Structures
26 pages
Streams 2
No ratings yet
Streams 2
49 pages
BD - Lecture 3 - Decision Tree
No ratings yet
BD - Lecture 3 - Decision Tree
39 pages
Lec1 Bloom Distinctcount
No ratings yet
Lec1 Bloom Distinctcount
76 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
42 pages
Bda PT 2
No ratings yet
Bda PT 2
35 pages
02 Assocrules
No ratings yet
02 Assocrules
56 pages
ch05 Linkanalysis1
No ratings yet
ch05 Linkanalysis1
60 pages
18 Advertising
No ratings yet
18 Advertising
48 pages
Bda Unit3
No ratings yet
Bda Unit3
22 pages
Exploring The Latent Structure of Behavior Using The Human Connectome Project's Data
No ratings yet
Exploring The Latent Structure of Behavior Using The Human Connectome Project's Data
13 pages
Marriage Law
No ratings yet
Marriage Law
19 pages
Fischhoff Evaluating Science Communication
No ratings yet
Fischhoff Evaluating Science Communication
6 pages
Colleran 2016 The Cultural Evolution of Fertility Decline
No ratings yet
Colleran 2016 The Cultural Evolution of Fertility Decline
12 pages
WhitespaceAlpha Deck Jan'25
No ratings yet
WhitespaceAlpha Deck Jan'25
18 pages
A - First Solar FS 275
No ratings yet
A - First Solar FS 275
2 pages
A Business Letter Is A Letter Written in Formal Language
100% (1)
A Business Letter Is A Letter Written in Formal Language
5 pages
CV Ajab Gul
No ratings yet
CV Ajab Gul
3 pages
t560 - Engineering Science n2 QP Nov 2015final
No ratings yet
t560 - Engineering Science n2 QP Nov 2015final
12 pages
Intermediate Reading Pack
No ratings yet
Intermediate Reading Pack
30 pages
Shanabrook Forensic Audit
No ratings yet
Shanabrook Forensic Audit
63 pages
Admit Card Patna University, Patna
No ratings yet
Admit Card Patna University, Patna
2 pages
Fire Fighter
No ratings yet
Fire Fighter
3 pages
Ra21vss1 07 Explode View and Parts List
No ratings yet
Ra21vss1 07 Explode View and Parts List
11 pages
TCNet Design Report
No ratings yet
TCNet Design Report
2 pages
Master Hunter CX
No ratings yet
Master Hunter CX
13 pages
Daily Work Instructions Plan (IKH) 03-04 12 24
No ratings yet
Daily Work Instructions Plan (IKH) 03-04 12 24
3 pages
How To Trade The IV Flush Strategy
No ratings yet
How To Trade The IV Flush Strategy
4 pages
Halo Lighting Architectural Lighting Catalog 1985
No ratings yet
Halo Lighting Architectural Lighting Catalog 1985
84 pages
(New Horizons in Contemporary Writing) Erik Ketzan - Thomas Pynchon and The Digital Humanities - Computational Approaches To Style-Bloomsbury Academic (2021)
No ratings yet
(New Horizons in Contemporary Writing) Erik Ketzan - Thomas Pynchon and The Digital Humanities - Computational Approaches To Style-Bloomsbury Academic (2021)
287 pages
Final Showdown 2
No ratings yet
Final Showdown 2
46 pages
ICD Dadri Report1
No ratings yet
ICD Dadri Report1
9 pages
Css NCii Checklist For Trainies
No ratings yet
Css NCii Checklist For Trainies
22 pages
Science 4 Q3W4
100% (2)
Science 4 Q3W4
134 pages
Preformulasi Merged
No ratings yet
Preformulasi Merged
147 pages
Apllicant Tracking Sarinah
No ratings yet
Apllicant Tracking Sarinah
43 pages
Goodall Anne M 2021 Thesis
No ratings yet
Goodall Anne M 2021 Thesis
204 pages
Animals in Dreams - Dream Interpretation and Meaning of Animals in Dreams Cafeausoul
No ratings yet
Animals in Dreams - Dream Interpretation and Meaning of Animals in Dreams Cafeausoul
1 page
Jurutera CSD Sdn. BHD.: Consulting Engineers
No ratings yet
Jurutera CSD Sdn. BHD.: Consulting Engineers
6 pages
FDP Manual - Petrel Dynamic Modeling PDF
83% (6)
FDP Manual - Petrel Dynamic Modeling PDF
28 pages
MAGNESITA CEMENT Folder 092015
No ratings yet
MAGNESITA CEMENT Folder 092015
24 pages
Microsoft Powerpoint Tips and Tricks
No ratings yet
Microsoft Powerpoint Tips and Tricks
8 pages

ch04 Streams2

Uploaded by

ch04 Streams2

Uploaded by

Note to other teachers and users of these slides: We would be delighted if you found this our

 More algorithms for streams:

Output the item since it may be in S.

 We have m darts, n targets  Fraction of 1s in the array B =  Consider: |S| = m, |B| = n

Number of hash functions, k

 E[2R] is actually infinite  Suppose a stream has elements chosen

[Alon, Matias, and Szegedy]

 Exponentially decaying windows: A heuristic  If each ai is an “item” we can compute the

 Counts for larger itemsets = ??

 But we are conservative about starting

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46

You might also like