16 Streams
16 Streams
material useful for giving your own lectures. Feel free to use these slides verbatim, or to
modify them to fit your own needs. If you make use of a significant portion of these slides
in your own lecture, please include this message, or a link to our web site: http://www.mmds.org
In practice we grow tree greedily: For each node, enumerate over all features Add a new tree 𝒇𝒕 (𝒙) in each iteration
▪ Start with tree with depth 0 ▪ For each feature, sort the instances by feature value ▪ Compute necessary statistics for our objective
▪ For each leaf node in the tree, try to add a split ▪ Use a linear scan to decide the best split along that
▪ The change of the objective after adding a split is: feature
▪ Greedily grow the tree that minimizes the objective:
▪ Take the best split solution along all the features
Pre-stopping:
Score of
left child
Score of
right child
Score if we do not split
▪ Stop split if the best split have negative gain Add 𝒇𝒕 (𝒙) to our ensemble model
𝜖 is called step-size or shrinkage,
▪ Take the split that gives best gain ▪ But maybe a split can benefit future splits. usually set around 0.1
Next: How to find the best split? Post-Prunning: Goal: prevent overfitting
XGBoost: eXtreme Gradient Boosting So far we have worked datasets or data bases
▪ A highly scalable implementation of gradient boosted High dim. Graph Infinite Machine
Apps where all data is available
decision trees with regularization data data data learning
In many data mining situations, we do not Input elements enter at a rapid rate, Stochastic Gradient Descent (SGD) is an
know the entire data set in advance at one or more input ports (i.e., streams) example of a streaming algorithm
Stream Management is important when the In Machine Learning we call this: Online Learning
input rate is controlled externally: ▪ We call elements of the stream tuples
▪ Allows for modeling problems where we have
▪ Google queries a continuous stream of data
The system cannot store the entire stream ▪ We want an algorithm to learn from it and
▪ Twitter posts or Facebook status updates
▪ e-Commerce purchase data.
accessibly slowly adapt to the changes in data
▪ Credit card transactions
Idea: Do small updates to the model
Think of the data as infinite and Q: How do you make critical calculations ▪ SGD makes small updates
non-stationary (the distribution changes about the stream using a limited amount of ▪ So: First train the classifier on training data
over time) (secondary) memory? ▪ Then: For every example from the stream, we slightly
▪ This is the fun part and why interesting algorithms update the model (using small learning rate)
2/29/2024
are needed Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 10 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 11 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 12
Ad-Hoc
Queries
Types of queries one wants to answer on Mining query streams
a data stream: ▪ Google wants to know what queries are
. . . 1, 5, 2, 7, 0, 9, 3 Standing
Queries ▪ Sampling data from a stream more frequent today than yesterday
. . . a, r, v, t, y, h, b Output
▪ Construct a random sample
Processor Mining click streams
. . . 0, 0, 1, 0, 1, 1, 0 ▪ Filtering a data stream
time
▪ Select elements with property x from the stream ▪ Wikipedia wants to know which of its pages are
Streams Entering. getting an unusual number of hits in the past hour
Each stream is ▪ Counting distinct elements
composed of
elements/tuples
▪ Number of distinct elements in the last k elements Mining social network news feeds
Limited of the stream
Working
▪ Look for trending topics on Twitter, Facebook
Storage Archival ▪ Finding most frequent elements
Storage
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 13 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 14 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 15
Why is this important? Problem 1: Sampling a fixed proportion
▪ Since we cannot store the entire stream, a ▪ E.g. sample 10% of the stream
representative sample can act like the stream ▪ As stream gets bigger, sample gets bigger
Two different problems:
▪ (1) Sample a fixed proportion of elements Naïve solution:
in the stream (say 1 in 10)
▪ Generate a random integer in [0...9] for each query
▪ (2) Maintain a random sample of fixed size s
▪ Store the query if the integer is 0, otherwise discard
over a potentially infinite stream
▪ At any “time” k we would like a random sample
of s elements of the stream 1..k Any problem with this approach?
As the stream grows the sample ▪ What is the property of the sample we want to maintain? ▪ We have to be very careful what query we answer
For all time steps k, each of the k elements seen so far must have
also gets bigger equal probability of being sampled using this sample
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 17 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 18
Scenario: Search engine query stream Scenario: Search engine query stream Solution:
▪ Stream of tuples: (user, query, time) ▪ Stream of tuples: (user, query, time)
▪ Question: What fraction of unique queries by an average ▪ Question : What fraction of unique queries by an average Don’t sample queries, sample users instead
user are duplicates? user are duplicates? Pick 1/10th of users and take all their
▪ Suppose each user issues x queries once and d queries twice (total of ▪ Suppose each user issues x queries once and d queries twice (total of
x+2d query instances) then the correct answer to the query is d/(x+d) x+2d query instances) then the correct answer to the query is d/(x+d) search queries in the sample
▪ Proposed solution: We keep 10% of the queries ▪ Proposed solution: We keep 10% of the queries Use a hash function that hashes the
▪ Sample will contain (x+2d)/10 elements of the stream ▪ Sample will contain (x+2d)/10 elements of the stream
▪ Sample will contain d/100 pairs of duplicates ▪ Sample will contain d/100 pairs of duplicates Sample user name or user id uniformly into 10
▪ d/100 = 1/10 ∙ 1/10 ∙ d ▪ d/100 = 1/10 ∙ 1/10 ∙ d underestimates
▪ There are (10x+19d)/100 unique elements in the sample ▪ There are (10x+19d)/100 unique elements in the stream buckets
▪ (x+2d)/10 - d/100 = (10x+19d)/100 ▪ (x+2d)/10 - d/100 = (10x+19d)/100
𝑑 𝑑
𝒅 𝒅
▪ So the sample-based answer is 100
10𝑥 19𝑑 = ▪ So the sample-based answer is 10𝑥10019𝑑 = 𝟏𝟎𝒙+𝟏𝟗𝒅
+ 𝟏𝟎𝒙+𝟏𝟗𝒅 +
100 100 100 100
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 19 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 20 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 21
Algorithm (a.k.a. Reservoir Sampling) We prove this by induction: Inductive hypothesis: After n elements, the sample
▪ Assume that after n elements, the sample contains
S contains each element seen so far with prob. s/n
▪ Store all the first s elements of the stream to S Inductive step:
each element seen so far with probability s/n
▪ Suppose we have seen n-1 elements, and now ▪ New element n+1 arrives, it goes to S with prob s/(n+1)
▪ We need to show that after seeing element n+1 ▪ For all other elements currently in S:
the nth element arrives (𝒏 > 𝒔)
the sample maintains the property ▪ They were in S with prob. s/n
▪ With probability s/n, keep the nth element, else discard it
▪ Sample contains each element seen so far with ▪ The probability that they remain in S:
▪ If we picked the nth element, then it replaces one of the probability s/(n+1)
s elements in the sample S, picked uniformly at random s s s − 1 n
Base case: 1 − + =
Claim: This algorithm maintains a sample S ▪ After we see n=s elements the sample S has the n + 1 n + 1 s n + 1
desired property Element n+1 discarded Element n+1
not discarded
Element in the
sample not picked
with the desired property: ▪ Each out of n=s elements is in the sample with ▪ tuples stayed in S with prob. n/(n+1)
𝒔 𝒏 𝒔
▪ After n elements, the sample contains each probability s/s = 1
So, P(tuple is in S at time n+1) = ⋅ =
𝒏 𝒏+𝟏 𝒏+𝟏
2/29/2024
element seen so far with probability s/n
Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 25 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 26 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 27
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 29 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 30
Given a set of keys S that we want to filter Output the item since it may be in S.
Item hashes to a bucket that at least
|S| = 1 billion email addresses
Create a bit array B of n bits, initially all 0s one of the items in S hashed to. |B|= 1GB = 8 billion bits
Item
Choose a hash function h with range [0,n)
If the email address is in S, then it surely
Hash each member of s S to one of Hash
func h hashes to a bucket that has the bit set to 1,
n buckets, and set that bit to 1, i.e., B[h(s)]=1 so it always gets through (no false negatives)
0010001011000 Bit array B
Hash each element a of the stream and
output only those that hash to bit that was
Drop the item. Approximately 1/8 of the bits are set to 1, so
It hashes to a bucket set
to 0 so it is surely not in S. about 1/8th of the addresses not in S get
set to 1
Creates false positives through to the output (false positives)
▪ Output a if B[h(a)] == 1 ▪ Items that are hashed to a 1 bucket may or may not be in S ▪ Actually, less than 1/8th, because more than one
but no false negatives address might hash to the same bit
▪ Items that are hashed to 0 bucket are surely not in S
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 31 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 32 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 33
Let’s do a more accurate analysis of number We have m darts, n targets Fraction of 1s in the array B =
of false positives, we know that: What is the probability that a target gets at probability of false positive = 1 – e-m/n
▪ Fraction of 1s in array B = prob. of false positive least one dart?
Equals 1/e
Equivalent
Example: 109 darts, 8∙109 targets
Darts & Targets: If we throw m darts into n as n →∞
▪ Fraction of 1s in B = 1 – e-1/8 = 0.1175
equally likely targets, what is the probability n( m / n)
▪ Compare with our earlier estimate: 1/8 = 0.125
that a target gets at least one dart? 1 - (1 – 1/n)
1 – e–m/n
To reduce false positive rate of bloom filter
In our case: Probability some
target X not hit we use multiple hash functions
▪ Targets = bits/buckets by a dart
Probability at
least one dart Approximation is
▪ Darts = hash values of items hits target X especially accurate
when n is large
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 34 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 35 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 36
0.2
Consider: |S| = m keys, |B| = n bits What fraction of the bit vector B are 1s? m = 1 billion, n = 8 billion 0.18
Use k independent hash functions h1 ,…, hk ▪ Throwing k∙m darts at n targets ▪ k = 1: (1 – e-1/8)
= 0.1175 0.16
and we only let the element x through if all k What happens as we 0.06
Run-time: 0 2 4 6 8 10 12 14
Problem:
How many unique users a website has seen in
▪ Data stream consists of a universe of elements each given month?
chosen from a set of size N
▪ Universal set = set of logins for that month
▪ Maintain a count of the number of distinct ▪ Stream element = each time someone logs in
elements seen so far
How many different words are found at a site
Obvious approach: which is among the Web pages being crawled?
Maintain a dictionary of elements seen so far ▪ Unusually low or high numbers could indicate artificial
▪ keep a hash table of all the distinct elements seen so far pages (spam?)
▪ What if number of distinct elements are huge?
▪ What if there are many streams that need to be processed How many distinct products have we sold in the
at once?
last week?
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 41 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 42
Real problem: What if we do not have space Estimates number of distinct elements by Pick a hash function h that maps each of the
to maintain the set of elements seen so far in hashing elements to a bit-string that is N elements to at least log2 N bits
every stream? sufficiently long
▪ We have limited working storage ▪ The length of the bit-string is large enough that it For each stream element a, let r(a) be the
produces more result that size of universal set. number of trailing 0s in h(a)
We use a variety of hashing and randomization to
get approximately what we want ▪ r(a) = position of first 1 counting from the right
Idea: the more different elements we see in ▪ E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2
the stream, the more different hash values we Record R = the maximum r(a) seen
Estimate the count in an unbiased way shall see.
▪ R = maxa r(a), over all the items a seen so far
Accept that the count may have a little error, but ▪ Number of trailing 0s in these hash values
limit the probability that the error is large estimates number of distinct elements.
Estimated number of distinct elements = 2R
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 43 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 44 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 45
Very rough and heuristic intuition why Now we show why Flajolet-Martin works What is the probability that a given h(a) ends
Flajolet-Martin works: in at least r zeros? It is 2-r
▪ h(a) hashes a with equal prob. to any of N values Let 𝒎 be the number of distinct elements ▪ h(a) hashes elements uniformly at random
▪ All elements have equal prob. to have a tail of r zeros seen so far in the stream
▪ Probability that a random number ends in
▪ That is 2-r fraction of all as have a tail of r zeros We show that probability of finding a tail of r at least r zeros is 2-r
▪ About 50% of as hash to ***0 zeros:
▪ About 25% of as hash to **00
Then, the probability of NOT seeing a tail
▪ Goes to 1 if 𝒎 ≫ 𝟐𝒓 of length r among m elements:
▪ So, if we saw the longest tail of r=2 (i.e., item hash
ending *100) then we have probably seen ▪ Goes to 0 if 𝒎 ≪ 𝟐𝒓 𝟏 − 𝟐−𝒓 𝒎
about 4 distinct items so far
▪ So, it takes to hash about items before we 2r Thus, 2R will almost always be around m! Prob. all end in Prob. that given h(a) ends
in fewer than r zeros
see one with zero-suffix of length r fewer than r zeros.
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 46 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 47 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 48
−=
−−
−r −r
Note: (1 − 2−r )m = (1 − 2−r )2 ( m2 ) e−m2
r
E[2R] is actually infinite
Prob. of NOT finding a tail of length r is: ▪ Probability halves when R → R+1, but value doubles
▪ If m << 2r, then prob. tends to 1 Workaround involves using many hash
−r
▪ (1 − 2− r )m e− m 2 = 1 as m/2r→ 0 functions hi and getting many samples of Ri
▪ So, the probability of finding a tail of length r tends to 0 How are samples Ri combined?
▪ If m >> 2r, then prob. tends to 0 ▪ Average? What if one very large value 𝟐𝑹𝒊 ?
−r
▪ (1 − 2− r )m e− m 2 = 0 as m/2r →
▪ Median? All estimates are a power of 2
▪ So, the probability of finding a tail of length r tends to 1
▪ Solution:
Thus, 2R will almost always be around m! ▪ Partition your samples into small groups
▪ Take the median of groups
▪ Then take the average of the medians
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 49 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 50
New Problem: Given a stream of itemsets, Exponentially decaying windows: A heuristic Think of the stream of itemsets as one binary
which itemsets appear more frequently? for selecting likely frequent items (itemsets) stream per item
▪ What are “currently” most popular movies? ▪ For every item, form a binary stream
Application: ▪ Instead of computing the raw count in last N elements ▪ 1 = item present; 0 = not present
▪ What are most frequent products bought together?
▪ Compute a smooth aggregation over the whole stream
▪ What are some “hot” gift items bought together?
Smooth aggregation: If stream is a1, a2,… then the Stream of items:
Solution: Exponentially decaying windows smooth aggregation at time t: σ𝑻𝒕=𝟏 𝒂𝒕 𝟏 − 𝒄 𝑻−𝒕 brtbhbgbbgzcbabbcbdbdbnbrbpbqbbsbtbababebcbbbvbwbxbwbbbcbdbcgfbabbzdba
▪ We first use it to count singular items ▪ c is a constant, presumably tiny, like 10-6 or 10-9
Binary stream for “b”
▪ Popular movies, most bought products, etc. ▪ at is a non-negative integer in general
▪ Then we extend it to counting itemsets When new at+1 arrives: 1001010110001011010101010101011010101010101110101010111010100010110010
Multiply current sum by (1-c) and add at+1
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 52 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 53 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 54
If each at is an “item” we can compute the What are “currently” most popular movies?
characteristic function of each item x as an Suppose we want to find movies of weight > ½
Exponentially Decaying Window: ▪ Important property: Sum over all weights
σ𝑡 𝛿𝑡 ⋅ 1 − 𝑐 𝑡 is 1/[1 – (1 – c)] = 𝟏/𝒄
▪ That is: σ𝑻𝒕=𝟏 𝜹𝒕 ⋅ 𝟏 − 𝒄 𝑻−𝒕 ▪ That means that no item can have weight greater than 1/c
where 𝜹𝒕 = 𝟏 if 𝒂𝒕 = 𝒙, and 𝟎 otherwise ▪ The item will have weight 𝟏/𝒄 if its stream is [1,1,1,1,1…]. Note
we have a separate binary stream for each item. So, at a given
▪ In other words: Imagine that for each item 𝒙 we ... time only one item will have a 𝛿𝑡 =1, and for other items: 𝛿𝑡 = 0.
have a binary stream (𝟏 if 𝒙 appears, 𝟎 if 𝒙 does Thus:
not appear) 1/c ▪ There cannot be more than 𝟐/𝒄 movies with weight of ½
▪ Then, when a new item at arrives: or more
Important property: Sum over all weights ▪ Why? Assume weight of item is ½. How many items n can we
▪ Multiply the summation of each item by (𝟏 − 𝒄) have so that the sum is <1/c; Answer: ½n<1/c → 𝑛 < 2/𝑐
▪ Add +1 to the summation of item 𝒙 = at
σ𝒕 𝟏 ⋅ 𝟏 − 𝒄 𝒕 = 1/[1 – (1 – c)] = 𝟏/𝒄 So, 𝟐/𝒄 is a limit on the number of movies being
Call this sum the “weight” of item 𝒙 counted at any time
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 55 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 56 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 57
Extension: Count (some) itemsets Start a count for an itemset S ⊆ B if every Counts for single items < (2/c)∙(avg. number
▪ What are currently “hot” itemsets? proper subset of S had a count prior to arrival of items in a basket)
▪ Problem: Too many itemsets to keep counts of
of basket B.
all of them in memory ▪ Intuitively: If all subsets of S are being counted Counts for larger itemsets = ??
When a basket 𝑩 comes in: this means they are “frequent/hot” and thus S has
a potential to be “hot” But we are conservative about starting
▪ Multiply all counts by (𝟏 − 𝒄) Example: counts of large sets
▪ For uncounted items in 𝑩, create new count ▪ Start counting S={i, j} iff both i and j were counted ▪ If we counted every set we saw, one basket
▪ Add 𝟏 to count of any item in 𝑩 and to any itemset prior to seeing B
of 20 items would initiate 1M counts
contained in 𝑩 that is already being counted ▪ Start counting S={i, j, k} iff {i, j}, {i, k}, and {j, k}
▪ Drop counts < ½ were all counted prior to seeing B
▪ Initiate new counts (next slide)
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 58 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 59 2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 60
Sampling a fixed proportion of a stream
▪ Sample size grows as the stream grows
Sampling a fixed-size sample
▪ Reservoir sampling
Check existence of a set of keys in the stream
▪ Bloom filter
Counting distinct elements in a stream
▪ Flajolet-Martin algorithm
Counting frequent elements in a stream
▪ Exponentially decaying window
2/29/2024 Jure Les kovec, Stanford CS246: Mi ning Ma ssive Datasets, http://cs246.stanford.edu 61