0% found this document useful (0 votes)
16 views5 pages

16 Streams

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views5 pages

16 Streams

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Note to other teachers and users of these slides: We would be delighted if you found our

material useful for giving your own lectures. Feel free to use these slides verbatim, or to
modify them to fit your own needs. If you make use of a significant portion of these slides
in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

 Prediction at round 𝒕 is:


 Goal: Find tree 𝒇𝒕 (⋅) that minimizes:
We will be releasing practice exam problems this
weekend
 The optimal objective is:
We will hold extra office hours next week for
exam preparation
▪ 𝐺𝑗 , 𝐻𝑗 depend on the loss function, T= # of leaves.
CS246: Mining Massive Datasets In principle we could:
Jure Leskovec, Stanford University  Enumerate possible tree structures 𝑓 and take
Charilaos Kanatsoulis, Stanford University
the one that minimizes o𝑏𝑗
http://cs246.stanford.edu 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 2 2/29/24 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 3

 In practice we grow tree greedily:  For each node, enumerate over all features  Add a new tree 𝒇𝒕 (𝒙) in each iteration
▪ Start with tree with depth 0 ▪ For each feature, sort the instances by feature value ▪ Compute necessary statistics for our objective
▪ For each leaf node in the tree, try to add a split ▪ Use a linear scan to decide the best split along that
▪ The change of the objective after adding a split is: feature
▪ Greedily grow the tree that minimizes the objective:
▪ Take the best split solution along all the features
 Pre-stopping:
Score of
left child
Score of
right child
Score if we do not split
▪ Stop split if the best split have negative gain  Add 𝒇𝒕 (𝒙) to our ensemble model
𝜖 is called step-size or shrinkage,
▪ Take the split that gives best gain ▪ But maybe a split can benefit future splits. usually set around 0.1

 Next: How to find the best split?  Post-Prunning: Goal: prevent overfitting

▪ Grow a tree to maximum depth, recursively prune


 Repeat until we user 𝑴 ensemble of trees
2/22/22 Jure Les kovec , Stanford CS246: Mi ning Massive Datasets, http://cs246.stanford.edu 4 2/22/22
all the leaf splits with negative gain.
Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 5 2/22/22 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 6

 XGBoost: eXtreme Gradient Boosting  So far we have worked datasets or data bases
▪ A highly scalable implementation of gradient boosted High dim. Graph Infinite Machine
Apps where all data is available
decision trees with regularization data data data learning

Widely used by data scientists and provides state-of-the- Locality


PageRank,
Filtering
Decision Recommen  In contrast, in data streams, data arrives one
art results on many problems! sensitive
SimRank
data
Trees der systems
hashing streams element at a time often at a rapid rate that:
 System optimizations: ▪ If it is not processed immediately it is lost forever.
▪ Parallel tree constructions using column block Clustering
Community
Detection
Queries on
streams
SVM
Association
Rules ▪ It is not feasible to store it all
structure
▪ Distributed Computing for training very large models
using a cluster of machines. Dimensional
Spam Web
Duplicate
ity Parallel SGD document
▪ Out-of-Core Computing for very large datasets that reduction
Detection advertising
detection
don’t fit into memory.
2/22/22 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 7 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 8 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 9

 In many data mining situations, we do not  Input elements enter at a rapid rate,  Stochastic Gradient Descent (SGD) is an
know the entire data set in advance at one or more input ports (i.e., streams) example of a streaming algorithm
 Stream Management is important when the  In Machine Learning we call this: Online Learning
input rate is controlled externally: ▪ We call elements of the stream tuples
▪ Allows for modeling problems where we have
▪ Google queries a continuous stream of data
 The system cannot store the entire stream ▪ We want an algorithm to learn from it and
▪ Twitter posts or Facebook status updates
▪ e-Commerce purchase data.
accessibly slowly adapt to the changes in data
▪ Credit card transactions
 Idea: Do small updates to the model
 Think of the data as infinite and  Q: How do you make critical calculations ▪ SGD makes small updates
non-stationary (the distribution changes about the stream using a limited amount of ▪ So: First train the classifier on training data
over time) (secondary) memory? ▪ Then: For every example from the stream, we slightly
▪ This is the fun part and why interesting algorithms update the model (using small learning rate)
2/29/2024
are needed Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 10 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 11 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 12

Ad-Hoc
Queries
 Types of queries one wants to answer on  Mining query streams
a data stream: ▪ Google wants to know what queries are
. . . 1, 5, 2, 7, 0, 9, 3 Standing
Queries ▪ Sampling data from a stream more frequent today than yesterday
. . . a, r, v, t, y, h, b Output
▪ Construct a random sample
Processor  Mining click streams
. . . 0, 0, 1, 0, 1, 1, 0 ▪ Filtering a data stream
time
▪ Select elements with property x from the stream ▪ Wikipedia wants to know which of its pages are
Streams Entering. getting an unusual number of hits in the past hour
Each stream is ▪ Counting distinct elements
composed of
elements/tuples
▪ Number of distinct elements in the last k elements  Mining social network news feeds
Limited of the stream
Working
▪ Look for trending topics on Twitter, Facebook
Storage Archival ▪ Finding most frequent elements
Storage

2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 13 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 14 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 15
 Why is this important?  Problem 1: Sampling a fixed proportion
▪ Since we cannot store the entire stream, a ▪ E.g. sample 10% of the stream
representative sample can act like the stream ▪ As stream gets bigger, sample gets bigger
 Two different problems:
▪ (1) Sample a fixed proportion of elements  Naïve solution:
in the stream (say 1 in 10)
▪ Generate a random integer in [0...9] for each query
▪ (2) Maintain a random sample of fixed size s
▪ Store the query if the integer is 0, otherwise discard
over a potentially infinite stream
▪ At any “time” k we would like a random sample
of s elements of the stream 1..k  Any problem with this approach?
As the stream grows the sample ▪ What is the property of the sample we want to maintain? ▪ We have to be very careful what query we answer
For all time steps k, each of the k elements seen so far must have
also gets bigger equal probability of being sampled using this sample
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 17 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 18

 Scenario: Search engine query stream  Scenario: Search engine query stream Solution:
▪ Stream of tuples: (user, query, time) ▪ Stream of tuples: (user, query, time)
▪ Question: What fraction of unique queries by an average ▪ Question : What fraction of unique queries by an average  Don’t sample queries, sample users instead
user are duplicates? user are duplicates?  Pick 1/10th of users and take all their
▪ Suppose each user issues x queries once and d queries twice (total of ▪ Suppose each user issues x queries once and d queries twice (total of
x+2d query instances) then the correct answer to the query is d/(x+d) x+2d query instances) then the correct answer to the query is d/(x+d) search queries in the sample
▪ Proposed solution: We keep 10% of the queries ▪ Proposed solution: We keep 10% of the queries  Use a hash function that hashes the
▪ Sample will contain (x+2d)/10 elements of the stream ▪ Sample will contain (x+2d)/10 elements of the stream
▪ Sample will contain d/100 pairs of duplicates ▪ Sample will contain d/100 pairs of duplicates Sample user name or user id uniformly into 10
▪ d/100 = 1/10 ∙ 1/10 ∙ d ▪ d/100 = 1/10 ∙ 1/10 ∙ d underestimates
▪ There are (10x+19d)/100 unique elements in the sample ▪ There are (10x+19d)/100 unique elements in the stream buckets
▪ (x+2d)/10 - d/100 = (10x+19d)/100 ▪ (x+2d)/10 - d/100 = (10x+19d)/100
𝑑 𝑑
𝒅 𝒅
▪ So the sample-based answer is 100
10𝑥 19𝑑 = ▪ So the sample-based answer is 10𝑥10019𝑑 = 𝟏𝟎𝒙+𝟏𝟗𝒅
+ 𝟏𝟎𝒙+𝟏𝟗𝒅 +
100 100 100 100

2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 19 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 20 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 21

 Stream of tuples with keys:  Problem 2: Fixed-size sample


▪ Key is some subset of each tuple’s components  Suppose we need to maintain a random
▪ e.g., tuple is (user, search, time); key is user sample S of size exactly s tuples
▪ Choice of key depends on application ▪ E.g., main memory size constraint
 Why? Don’t know length of stream in advance
 To get a sample of a/b fraction of the stream:  Suppose by time n we have seen n items
▪ Hash each tuple’s key uniformly into b buckets
▪ Each item is in the sample S with equal prob. s/n
▪ Pick the tuple if its hash value is at most a How to think about the problem: say s = 2
Stream: a x c y z k c d e g…
The sample is of fixed size s At n= 5, each of the first 5 tuples is included in the sample S with equal prob.
At n= 7, each of the first 7 tuples is included in the sample S with equal prob.
Hash table with b buckets, pick the tuple if its hash value is at most a. Stream
How to generate a 30% sample? time t Impractical solution would be to store all the n tuples seen
time t+1
Hash into b=10 buckets, take the tuple if it hashes to one of the first 3 buckets time t+2 so far and out of them pick s at random
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 22 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 24

 Algorithm (a.k.a. Reservoir Sampling)  We prove this by induction:  Inductive hypothesis: After n elements, the sample
▪ Assume that after n elements, the sample contains
S contains each element seen so far with prob. s/n
▪ Store all the first s elements of the stream to S  Inductive step:
each element seen so far with probability s/n
▪ Suppose we have seen n-1 elements, and now ▪ New element n+1 arrives, it goes to S with prob s/(n+1)
▪ We need to show that after seeing element n+1 ▪ For all other elements currently in S:
the nth element arrives (𝒏 > 𝒔)
the sample maintains the property ▪ They were in S with prob. s/n
▪ With probability s/n, keep the nth element, else discard it
▪ Sample contains each element seen so far with ▪ The probability that they remain in S:
▪ If we picked the nth element, then it replaces one of the probability s/(n+1)
s elements in the sample S, picked uniformly at random  s   s  s − 1  n
 Base case: 1 − +  =
 Claim: This algorithm maintains a sample S ▪ After we see n=s elements the sample S has the  n + 1   n + 1  s  n + 1
desired property Element n+1 discarded Element n+1
not discarded
Element in the
sample not picked
with the desired property: ▪ Each out of n=s elements is in the sample with ▪ tuples stayed in S with prob. n/(n+1)
𝒔 𝒏 𝒔
▪ After n elements, the sample contains each probability s/s = 1
 So, P(tuple is in S at time n+1) = ⋅ =
𝒏 𝒏+𝟏 𝒏+𝟏
2/29/2024
element seen so far with probability s/n
Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 25 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 26 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 27

 Each element of data stream is a tuple  Example: Email spam filtering


 Given a list of keys S (which is our filter) ▪ 1 million users, each user has 1000 “good” email
 Determine which tuples of stream have key in S addresses (trusted addresses)
▪ If an email comes from one of these, it is NOT spam
 Obvious solution: Hash table
▪ But suppose we do not have enough memory to  Example: Content filtering
store all of S in a hash table
▪ You want to make sure the user does not see the
▪ E.g., we might be processing millions of filters
on the same stream
same ad/recommendation multiple times

2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 29 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 30
Given a set of keys S that we want to filter Output the item since it may be in S.
Item hashes to a bucket that at least
 |S| = 1 billion email addresses
 Create a bit array B of n bits, initially all 0s one of the items in S hashed to. |B|= 1GB = 8 billion bits
Item
 Choose a hash function h with range [0,n)
 If the email address is in S, then it surely
 Hash each member of s S to one of Hash
func h hashes to a bucket that has the bit set to 1,
n buckets, and set that bit to 1, i.e., B[h(s)]=1 so it always gets through (no false negatives)
0010001011000 Bit array B
 Hash each element a of the stream and
output only those that hash to bit that was
Drop the item.  Approximately 1/8 of the bits are set to 1, so
It hashes to a bucket set
to 0 so it is surely not in S. about 1/8th of the addresses not in S get
set to 1
 Creates false positives through to the output (false positives)
▪ Output a if B[h(a)] == 1 ▪ Items that are hashed to a 1 bucket may or may not be in S ▪ Actually, less than 1/8th, because more than one
 but no false negatives address might hash to the same bit
▪ Items that are hashed to 0 bucket are surely not in S
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 31 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 32 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 33

 Let’s do a more accurate analysis of number  We have m darts, n targets  Fraction of 1s in the array B =
of false positives, we know that:  What is the probability that a target gets at probability of false positive = 1 – e-m/n
▪ Fraction of 1s in array B = prob. of false positive least one dart?
Equals 1/e
Equivalent
 Example: 109 darts, 8∙109 targets
 Darts & Targets: If we throw m darts into n as n →∞
▪ Fraction of 1s in B = 1 – e-1/8 = 0.1175
equally likely targets, what is the probability n( m / n)
▪ Compare with our earlier estimate: 1/8 = 0.125
that a target gets at least one dart? 1 - (1 – 1/n)
1 – e–m/n
 To reduce false positive rate of bloom filter
 In our case: Probability some
target X not hit we use multiple hash functions
▪ Targets = bits/buckets by a dart
Probability at
least one dart Approximation is
▪ Darts = hash values of items hits target X especially accurate
when n is large
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 34 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 35 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 36

0.2

 Consider: |S| = m keys, |B| = n bits  What fraction of the bit vector B are 1s?  m = 1 billion, n = 8 billion 0.18

 Use k independent hash functions h1 ,…, hk ▪ Throwing k∙m darts at n targets ▪ k = 1: (1 – e-1/8)
= 0.1175 0.16

False positive prob.


0.14

 Initialization: ▪ So fraction of 1s is (1 – e-km/n) ▪ k = 2: (1 – e-1/4)2 = 0.0489 0.12

▪ Set B to all 0s 0.1

▪ Hash each element s S using each hash function hi,


 But we have k independent hash functions 0.08

and we only let the element x through if all k  What happens as we 0.06

set B[hi(s)] = 1 (for each i = 1,.., k) (note: we have a 0.04

single array B!)


hash element x to a bucket of value 1 keep increasing k? 0.02

 Run-time: 0 2 4 6 8 10 12 14

Number of hash functions, k


16 18 20

▪ When a stream element with key x arrives 𝒏


 So, false positive probability = (1 – e-km/n)k  Optimal value of k: 𝑙𝑛 𝟐
𝒎
▪ If B[hi(x)] = 1 for all i = 1,..., k then declare that x is in S
▪ That is, x hashes to a bucket set to 1 for every hash function hi(x) ▪ In our case: Optimal k = 8 ln(2) = 5.54 ≈ 6
▪ Otherwise discard the element x ▪ Error at k = 6: (1 – e-3/4)6 = 0.0216
Optimal k: k which gives the lowest false positive probability
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 37 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 38 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 39

 Problem:
 How many unique users a website has seen in
▪ Data stream consists of a universe of elements each given month?
chosen from a set of size N
▪ Universal set = set of logins for that month
▪ Maintain a count of the number of distinct ▪ Stream element = each time someone logs in
elements seen so far
 How many different words are found at a site
 Obvious approach: which is among the Web pages being crawled?
Maintain a dictionary of elements seen so far ▪ Unusually low or high numbers could indicate artificial
▪ keep a hash table of all the distinct elements seen so far pages (spam?)
▪ What if number of distinct elements are huge?
▪ What if there are many streams that need to be processed  How many distinct products have we sold in the
at once?
last week?
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 41 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 42

 Real problem: What if we do not have space  Estimates number of distinct elements by  Pick a hash function h that maps each of the
to maintain the set of elements seen so far in hashing elements to a bit-string that is N elements to at least log2 N bits
every stream? sufficiently long
▪ We have limited working storage ▪ The length of the bit-string is large enough that it  For each stream element a, let r(a) be the
produces more result that size of universal set. number of trailing 0s in h(a)
 We use a variety of hashing and randomization to
get approximately what we want ▪ r(a) = position of first 1 counting from the right
 Idea: the more different elements we see in ▪ E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2
the stream, the more different hash values we  Record R = the maximum r(a) seen
 Estimate the count in an unbiased way shall see.
▪ R = maxa r(a), over all the items a seen so far
 Accept that the count may have a little error, but ▪ Number of trailing 0s in these hash values
limit the probability that the error is large estimates number of distinct elements.
 Estimated number of distinct elements = 2R
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 43 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 44 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 45
 Very rough and heuristic intuition why  Now we show why Flajolet-Martin works  What is the probability that a given h(a) ends
Flajolet-Martin works: in at least r zeros? It is 2-r
▪ h(a) hashes a with equal prob. to any of N values  Let 𝒎 be the number of distinct elements ▪ h(a) hashes elements uniformly at random
▪ All elements have equal prob. to have a tail of r zeros seen so far in the stream
▪ Probability that a random number ends in
▪ That is 2-r fraction of all as have a tail of r zeros  We show that probability of finding a tail of r at least r zeros is 2-r
▪ About 50% of as hash to ***0 zeros:
▪ About 25% of as hash to **00
 Then, the probability of NOT seeing a tail
▪ Goes to 1 if 𝒎 ≫ 𝟐𝒓 of length r among m elements:
▪ So, if we saw the longest tail of r=2 (i.e., item hash
ending *100) then we have probably seen ▪ Goes to 0 if 𝒎 ≪ 𝟐𝒓 𝟏 − 𝟐−𝒓 𝒎
about 4 distinct items so far
▪ So, it takes to hash about items before we 2r  Thus, 2R will almost always be around m! Prob. all end in Prob. that given h(a) ends
in fewer than r zeros
see one with zero-suffix of length r fewer than r zeros.

2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 46 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 47 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 48

−=
−−

−r −r
Note: (1 − 2−r )m = (1 − 2−r )2 ( m2 )  e−m2
r
  E[2R] is actually infinite
 Prob. of NOT finding a tail of length r is: ▪ Probability halves when R → R+1, but value doubles
▪ If m << 2r, then prob. tends to 1  Workaround involves using many hash
−r
▪ (1 − 2− r )m  e− m 2 = 1 as m/2r→ 0 functions hi and getting many samples of Ri
▪ So, the probability of finding a tail of length r tends to 0  How are samples Ri combined?
▪ If m >> 2r, then prob. tends to 0 ▪ Average? What if one very large value 𝟐𝑹𝒊 ?
−r
▪ (1 − 2− r )m  e− m 2 = 0 as m/2r → 
▪ Median? All estimates are a power of 2
▪ So, the probability of finding a tail of length r tends to 1
▪ Solution:
 Thus, 2R will almost always be around m! ▪ Partition your samples into small groups
▪ Take the median of groups
▪ Then take the average of the medians
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 49 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 50

 New Problem: Given a stream of itemsets,  Exponentially decaying windows: A heuristic  Think of the stream of itemsets as one binary
which itemsets appear more frequently? for selecting likely frequent items (itemsets) stream per item
▪ What are “currently” most popular movies? ▪ For every item, form a binary stream
 Application: ▪ Instead of computing the raw count in last N elements ▪ 1 = item present; 0 = not present
▪ What are most frequent products bought together?
▪ Compute a smooth aggregation over the whole stream
▪ What are some “hot” gift items bought together?
 Smooth aggregation: If stream is a1, a2,… then the Stream of items:

 Solution: Exponentially decaying windows smooth aggregation at time t: σ𝑻𝒕=𝟏 𝒂𝒕 𝟏 − 𝒄 𝑻−𝒕 brtbhbgbbgzcbabbcbdbdbnbrbpbqbbsbtbababebcbbbvbwbxbwbbbcbdbcgfbabbzdba

▪ We first use it to count singular items ▪ c is a constant, presumably tiny, like 10-6 or 10-9
Binary stream for “b”
▪ Popular movies, most bought products, etc. ▪ at is a non-negative integer in general
▪ Then we extend it to counting itemsets  When new at+1 arrives: 1001010110001011010101010101011010101010101110101010111010100010110010
Multiply current sum by (1-c) and add at+1
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 52 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 53 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 54

 If each at is an “item” we can compute the  What are “currently” most popular movies?
characteristic function of each item x as an  Suppose we want to find movies of weight > ½
Exponentially Decaying Window: ▪ Important property: Sum over all weights
σ𝑡 𝛿𝑡 ⋅ 1 − 𝑐 𝑡 is 1/[1 – (1 – c)] = 𝟏/𝒄
▪ That is: σ𝑻𝒕=𝟏 𝜹𝒕 ⋅ 𝟏 − 𝒄 𝑻−𝒕 ▪ That means that no item can have weight greater than 1/c
where 𝜹𝒕 = 𝟏 if 𝒂𝒕 = 𝒙, and 𝟎 otherwise ▪ The item will have weight 𝟏/𝒄 if its stream is [1,1,1,1,1…]. Note
we have a separate binary stream for each item. So, at a given
▪ In other words: Imagine that for each item 𝒙 we ... time only one item will have a 𝛿𝑡 =1, and for other items: 𝛿𝑡 = 0.
have a binary stream (𝟏 if 𝒙 appears, 𝟎 if 𝒙 does  Thus:
not appear) 1/c ▪ There cannot be more than 𝟐/𝒄 movies with weight of ½
▪ Then, when a new item at arrives: or more
 Important property: Sum over all weights ▪ Why? Assume weight of item is ½. How many items n can we
▪ Multiply the summation of each item by (𝟏 − 𝒄) have so that the sum is <1/c; Answer: ½n<1/c → 𝑛 < 2/𝑐
▪ Add +1 to the summation of item 𝒙 = at
σ𝒕 𝟏 ⋅ 𝟏 − 𝒄 𝒕 = 1/[1 – (1 – c)] = 𝟏/𝒄  So, 𝟐/𝒄 is a limit on the number of movies being
 Call this sum the “weight” of item 𝒙 counted at any time
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 55 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 56 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 57

 Extension: Count (some) itemsets  Start a count for an itemset S ⊆ B if every  Counts for single items < (2/c)∙(avg. number
▪ What are currently “hot” itemsets? proper subset of S had a count prior to arrival of items in a basket)
▪ Problem: Too many itemsets to keep counts of
of basket B.
all of them in memory ▪ Intuitively: If all subsets of S are being counted  Counts for larger itemsets = ??
 When a basket 𝑩 comes in: this means they are “frequent/hot” and thus S has
a potential to be “hot”  But we are conservative about starting
▪ Multiply all counts by (𝟏 − 𝒄)  Example: counts of large sets
▪ For uncounted items in 𝑩, create new count ▪ Start counting S={i, j} iff both i and j were counted ▪ If we counted every set we saw, one basket
▪ Add 𝟏 to count of any item in 𝑩 and to any itemset prior to seeing B
of 20 items would initiate 1M counts
contained in 𝑩 that is already being counted ▪ Start counting S={i, j, k} iff {i, j}, {i, k}, and {j, k}
▪ Drop counts < ½ were all counted prior to seeing B
▪ Initiate new counts (next slide)
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 58 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 59 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 60
 Sampling a fixed proportion of a stream
▪ Sample size grows as the stream grows
 Sampling a fixed-size sample
▪ Reservoir sampling
 Check existence of a set of keys in the stream
▪ Bloom filter
 Counting distinct elements in a stream
▪ Flajolet-Martin algorithm
 Counting frequent elements in a stream
▪ Exponentially decaying window
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 61

You might also like