0% found this document useful (0 votes)

149 views31 pages

3.flajolet Martin Algorithm

Uploaded by

Houssam Fouki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

149 views31 pages

3.flajolet Martin Algorithm

Uploaded by

Houssam Fouki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

compsci 514: algorithms for data science

Cameron Musco
University of Massachusetts Amherst. Spring 2020.
Lecture 5

0
logistics

• Problem Set 1 was released last Thursday and is due Friday

2/14 at 8pm in Gradescope. Don’t leave until the last minute.
• There is no class this Thursday.

1
last time

Last Class We Covered:

• Bloom Filters:
• Random hashing to maintain a large set in very small space.
• Discussed applications and how the false positive rate is
determined.
• Streaming Algorithms and Distinct Elements:
• Started on streaming algorithms and one of the most
fundamental examples: estimating the number of distinct items
in a data stream.
• Introduced an algorithm for doing this via a min-of-hashes
approach.

2
this class

Finish Distinct Elements:

• Finish hashing-based distinct elements algorithm. Learn the

‘median trick’ to boost accuracy.
• Discuss variants and practical implementations.

MinHashing For Set Similarity:

• See how a min-of-hashes approach (MinHash) is used to

estimate the overlap between two bit vectors.
• A key idea behind audio ﬁngerprint search (Shazam),
document search (plagiarism and copyright violation
detection), recommendation systems, etc.

3
hashing for distinct elements

Distinct Elements (Count-Distinct) Problem: Given a stream

x1 , . . . , xn , estimate the number of distinct elements.

Hashing for Distinct Elements (variant of Flajolet-Martin):

• Let h : U → [0, 1] be a random hash function (with a real valued

output)
• s := 1
• For i = 1, . . . , n
• s := min(s, h(xi ))
b=
• Return d 1
−1
s

4
hashing for distinct elements

Hashing for Distinct Elements:

• Let h : U → [0, 1] be a random hash function (with a real valued output)

• s := 1
• For i = 1, . . . , n
• s := min(s, h(xi ))
• Return db = 1 −1
s

• After all items are processed, s is the minimum of d points chosen

uniformly at random on [0, 1]. Where d = # distinct elements.
• Intuition: The larger d is, the smaller we expect s to be.
• Same idea as Flajolet-Martin algorithm and HyperLogLog, except
they use discrete hash functions.
• Notice: Output does not depend on n at all. 5
performance in expectation

s is the minimum of d points chosen uniformly at random on [0, 1].

Where d = # distinct elements.

∫ ∞
1
E[s] = (using E(s) = Pr(s > x)dx) + calculus)
d+1 0

• So estimate of db = 1 − 1 output by the algorithm is correct if s

s
b = d? No, but:
exactly equals its expectation. Does this mean E[d]
• Approximation is robust: if |s − E[s]| ≤ ϵ · E[s] for any ϵ ∈ (0, 1/2):
b ≤ (1 + 4ϵ)d
(1 − 2ϵ)d ≤ d
. 6
initial concentration bound

So question is how well s concentrates around its mean.

1 1
E[s] = and Var[s] ≤ (also via calculus).
d+1 (d + 1)2

Chebyshev’s Inequality:

Var[s] 1
Pr [|s − E[s]| ≥ ϵE[s]] ≤ 2
= 2.
(ϵE[s]) ϵ

Bound is vacuous for any ϵ < 1. How can we improve accuracy?

s: minimum of d distinct hashes chosen randomly over [0, 1], computed by

b = 1 − 1: estimate of # distinct elements d.
hashing algorithm. d s

7
improving performance

Leverage the law of large numbers: improve accuracy via repeated

independent trials.

Hashing for Distinct Elements (Improved):

• Let h : U → [0, 1] be a random hash functionLet

h1 , h2 , . . . , hk : U → [0, 1] be random hash functions
• s := 1
• s1 , s2 , . . . , sk := 1
• For i = 1, . . . , n
• s := min(s, h(xi ))
• For j=1,…,k, sj := min(sj , hj (xi ))
1
∑k
• s := k j=1 sj
b=
• Return d 1
−1
s

8
analysis

1
∑k
s= k j=1 sj . Have already shown that for j = 1, . . . , k:
1 1
E[sj ] = =⇒ E[s] = (linearity of expectation)
d+1 d+1
1 1
Var[sj ] ≤ =⇒ Var[s] ≤ (linearity of variance)
(d + 1)2 k · (d + 1)2

Chebyshev Inequality:
[ ] 2 2
b ≥ 4ϵ · d ≤ Var[s] = E[s] /k = 1 = ϵ
Pr [|s − E[s]| ≥ ϵE[s]] Pr d − d
(ϵE[s])2 ϵ E[s]
2 2 k·ϵ 2 ϵ
How should we set k if we want 4ϵ · d error with probability ≥ 1 − δ?
k = ϵ21·δ .

1 ∑k
sj : minimum of d distinct hashes chosen randomly over [0, 1]. s = k j=1 sj .
b = 1 − 1: estimate of # distinct elements d.
d s

9
space complexity

Hashing for Distinct Elements:

• Let h1 , h2 , . . . , hk : U → [0, 1] be random hash functions

• s1 , s2 , . . . , sk := 1
• For i = 1, . . . , n
• For j=1,…, k, sj := min(sj , hj (xi ))
∑
• s := k1 kj=1 sj
b = 1 −1
• Return d s

b with |d − d|
• Setting k = ϵ21·δ , algorithm returns d b ≤ 4ϵ · d with
probability at least 1 − δ.
• Space complexity is k = 1
ϵ2 ·δ real numbers s1 , . . . , sk .
• δ = 5% failure rate gives a factor 20 overhead in space complexity. 10
improved failure rate

How can we decrease the cost of a small failure rate δ?

One Thought: Apply stronger concentration bounds. E.g., replace
Chebyshev with Bernstein. This won’t work. Why?

Bernstein Inequality (Sample Mean): Consider independent

random variables X1 , . . . , Xk all falling in [−M̄, M̄] and let X =
1
∑k 2 1
k i=1 Xi . Let µ = E[X] and σ̄ = k Var[X]. For any t ≥ 0:
( )
t2 k
Pr (|X − µ| ≥ t) ≤ 2 exp − 2 4 .
2σ̄ + 3 M̄t

t2 k 3ϵk
For us, t = ϵ
d and M̄ = 1. So 4 = 4d . So if k ≪ d exponent has small
3 M̄t
magnitude (i.e., bound is bad).

11
improved failure rate

Exponential tail bounds are weak for random variables with

very large ranges compared to their expectation.

12
improved failure rate

How can we improve our dependence on the failure rate δ?

The median trick: Run t = O(log 1/δ) trials each with failure
probability δ ′ = 1/5 – each using k = δ′1ϵ2 = ϵ52 hash functions.

b1 , . . . , d
• Letting d bt be the outcomes of the t trials, return
b b
d = median(d1 , . . . , d bt ).

• If > 1/2> 2/3 of trials fall in [(1 − 4ϵ)d, (1 + 4ϵ)d], then the median
will.
13
• Have < 1/2< 1/3 of trials on both the left and right.
the median trick

b1 , . . . , d
• d bt are the outcomes of the t trials, each falling in
[(1 − 4ϵ)d, (1 + 4ϵ)d] with probability at least 4/5.
b = median(d
• d b1 , . . . , d
bt ).

b falls in
What is the probability that the median d
[(1 − 4ϵ)d, (1 + 4ϵ)d]?

• Let X be the # of trials falling in [(1 − 4ϵ)d, (1 + 4ϵ)d]. E[X] = 4

5 · t.

( ) ( ) ( )
b 2 5 1
/ [(1 − 4ϵ)d, (1 + 4ϵ)d] ≤ Pr X < · t · E[X] ≤ Pr |X − E[X]| ≥ E[X]
Pr d ∈
3 6 6
Apply Chernoff bound:
( ) ( )
12 4 ( )
1 6 · 5t
Pr |X − E[X]| ≥ E[X] ≤ 2 exp − = O e−O(t) .
6 2 + 1/6

• Setting t = O(log(1/δ)) gives failure probability e− log(1/δ) = δ. 14

median trick

Upshot: The median of t = O(log(1/δ)) independent runs of

the hashing algorithm for distinct elements returns
b ∈ [(1 − 4ϵ)d, (1 + 4ϵ)d] with probability at least 1 − δ.
d
Total Space Complexity: t trials, each using k = ϵ21δ′ hash
( )
functions, for δ ′ = 1/5. Space is ϵ5t2 = O log(1/δ)
ϵ2
real numbers
(the minimum value of each hash function).

No dependence on the number of distinct elements d or the

number of items in the stream n! Both of these numbers are
typically very large.
A note on the median: The median is often used as a robust
alternative to the mean, when there are outliers (e.g., heavy
tailed distributions, corrupted data). 15
distinct elements in practice

Our algorithm uses continuous valued fully random hash

functions. Can’t be implemented...
• The idea of using the minimum hash value of x1 , . . . , xn to
estimate the number of distinct elements naturally extends
to when the hash functions map to discrete values.
• Flajolet-Martin (LogLog) algorithm and HyperLogLog.

Estimate # distinct elements

based on maximum number of
trailing zeros m.
The more distinct hashes we
see, the higher we expect this
maximum to be.
16
loglog counting of distinct elements

Flajolet-Martin (LogLog) algorithm and HyperLogLog.

Estimate # distinct elements

based on maximum number of
trailing zeros m.

With d distinct elements what do we expect m to be?

1 1
Pr(h(xi ) has x log d trailing zeros) = x log d = .
2 d
So with d distinct hashes, expect to see 1 with log d trailing zeros.
Expect m ≈ log d. m takes log log d bits to store.
( )
Total Space: O logϵlog
2
d
+ log d for an ϵ approximate count.

Note: Careful averaging of estimates from multiple hash functions. 17

loglog space guarantees

Using HyperLogLog to count 1 billion distinct items with 2% accuracy:

( )
log log d
space used = O + log d
ϵ2
1.04 · ⌈log2 log2 d⌉
= + ⌈log2 d⌉ bits1
ϵ2
1.04 · 5
= + 30 = 13030 bits ≈ 1.6 kB!
.022
Mergeable Sketch: Consider the case (essentially always in practice)
that the items are processed on different machines.

• Given data structures (sketches) HLL(x1 , . . . , xn ), HLL(y1 , . . . , yn ) is

is easy to merge them to give HLL(x1 , . . . , xn , y1 , . . . , yn ). How?
• Set the maximum # of trailing zeros to the maximum in the two
sketches.

1. 1.04 is the constant in the HyperLogLog analysis. Not important!

18
hyperloglog in practice

Implementations: Google PowerDrill, Facebook Presto, Twitter

Algebird, Amazon Redshift.
Use Case: Exploratory SQL-like queries on tables with 100s billions of
rows. ∼ 5 million count distinct queries per day. E.g.,

• Count number if distinct users in Germany that made at least one

search containing the word ‘auto’ in the last month.
• Count number of distinct subject lines in emails sent by users that
have registered in the last week, in comparison to number of
emails sent overall (to estimate rates of spam accounts).

Traditional COUNT, DISTINCT SQL calls are far too slow, especially
when the data is distributed across many servers.

19
in practice

Estimate number of search ‘sessions’ that happened in the last

month (i.e., a single user making possibly many searches at
one time, likely surrounding a speciﬁc topic.)

• Count distinct keys where key is (IP, Hr, Min mod 10).
• Using HyperLogLog, cost is roughly that of a (distributed)
20
Questions on distinct elements counting?

21
another fundamental problem

Jaccard Index: A similarity measure between two sets.

|A ∩ B| # shared elements
J(A, B) = = .
|A ∪ B| # total elements

Natural measure for similarity between bit strings – interpret

an n bit string as a set, containing the elements corresponding
shared ones
the positions of its ones. J(x, y) = # total ones .
22
computing jaccard similarity

|A ∩ B| # shared elements
J(A, B) = = .
|A ∪ B| # total elements

• Computing exactly requires roughly linear time in |A| + |B|

(using a hash table or binary search). Not bad.
• Near Neighbor Search: Have a database of n sets/bit strings
and given a set A, want to find if it has high similarity to
anything in the database. O(n · average set size) time.
• All-pairs Similarity Search: Have n different sets/bit strings
and want to find all pairs with high similarity.
O(n2 · average set size) time.
Prohibitively expensive when n is very large. We’ll see how to
significantly improve on these runtimes with random hashing.
23
application: document comparison

How should you measure similarity between two-documents?

E.g., to detect plagiarism and copyright infringement, to see if an
email message is similar to previously seen spam, to detect
duplicate webpages in search results, etc.

• If the documents are not identical, doing a word-by-word

comparison typically gives nothing. Can compute edit distance,
but this is very expensive if you are comparing many documents.
• Shingling + Jaccard Similarity: Represent a document as the set of
all consecutive strings of length k.

• Measure similarity as Jaccard similarity between shingle sets.

• Also used to measure word similarity. E.g., in spell checkers. 24
application: audio fingerprinting

How should you measure similarity between two audio clips?

E.g. in audio search engines like Shazam, for detecting copyright
infringement, for search in sound effect libraries, etc.

Audio Fingerprinting + Jaccard Similarity:

Step 1: Compute the spectrogram: Step 2: Threshold the spectrogram

representation of frequency to a binary matrix representing
intensity over time. the sound clip.

Compare thresholded spectrograms with Jaccard similarity. 25

application: earthquake detection

Small earthquakes make consistent signatures on seismographs that

repeat over time. Detecting repeated signatures lets you detect
these otherwise undetectable events.

• Split data into overlapping

windows of 10 seconds
• Fingerprint each window using the
spectrogram (i.e., compute a binary
string representing the reading in
the window).
• All-pairs search for windows with
high Jaccard similarity. 26
application: collaborative filtering

Online recommendation systems are often based on collaborative

ﬁltering. Simplest approach: ﬁnd similar users and make
recommendations based on those users.

• Twitter: represent a user as the set of accounts they follow. Match

similar users based on the Jaccard similarity of these sets.
Recommend that you follow accounts followed by similar users.
• Netﬂix: look at sets of movies watched. Amazon: look at products
purchased, etc. 27
application: spam and fraud detection

Many applications to spam/fraud detection. E.g.

• Fake Reviews: Very common on websites like Amazon.

Detection often looks for (near) duplicate reviews on similar
products, which have been copied. ‘Near duplicate’ can be
measured with shingles + Jaccard similarity.
• Lateral phishing: Phishing emails sent to addresses at a
business coming from a legitimate email address at the
same business that has been compromised.
• One method of detection looks at the recipient list of an email
and checks if it has small Jaccard similarity with any previous
recipient lists. If not, the email is ﬂagged as possible spam.

28
why jaccard similarity?

Why use Jaccard similarity over other metrics like: Hamming

distinct (bit strings), correlation (sound waves, seismograms),
edit distance (text, genome sequences, etc.)?
Two Reasons:

• Depending on the application, often is the right measure.

• Even when not ideal, very efﬁcient to compute and
implement near neighbor search and all-pairs similarity
search with.
• This is what we will cover next time. Using more random
hashing!

29
Questions?

Lecture 4 - September 16, 2014
100% (1)
Lecture 4 - September 16, 2014
4 pages
Introduction To Probability and Random Signals
100% (9)
Introduction To Probability and Random Signals
139 pages
MMD 05
No ratings yet
MMD 05
50 pages
Lec1 Bloom Distinctcount
No ratings yet
Lec1 Bloom Distinctcount
76 pages
Notes On Randomized Algorithms
No ratings yet
Notes On Randomized Algorithms
539 pages
Book
No ratings yet
Book
106 pages
Data Streams Lecnotes
No ratings yet
Data Streams Lecnotes
96 pages
DGIM
No ratings yet
DGIM
90 pages
Randomised Algorithm
No ratings yet
Randomised Algorithm
385 pages
Notes On Randomized Algorithms: James Aspnes March 3rd, 2020
No ratings yet
Notes On Randomized Algorithms: James Aspnes March 3rd, 2020
453 pages
Lecture Notes On Bucket Algorithms - Luc Devroye
No ratings yet
Lecture Notes On Bucket Algorithms - Luc Devroye
154 pages
Rozprawa
No ratings yet
Rozprawa
77 pages
Notes PDF
No ratings yet
Notes PDF
407 pages
ProbabilisticCombinatorics 15 MAR 2019
No ratings yet
ProbabilisticCombinatorics 15 MAR 2019
114 pages
Flajolet-Martin Algorithm
No ratings yet
Flajolet-Martin Algorithm
28 pages
24 SimilaritySearch
No ratings yet
24 SimilaritySearch
52 pages
Sta1006 Summary
No ratings yet
Sta1006 Summary
20 pages
Probabilistic Counting Algorithms For Database Applications - Flajolet
No ratings yet
Probabilistic Counting Algorithms For Database Applications - Flajolet
28 pages
Randomized Algosnotes
No ratings yet
Randomized Algosnotes
362 pages
Lec 31 Handout
No ratings yet
Lec 31 Handout
18 pages
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
No ratings yet
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
28 pages
Unit 2 Mathematical Foundation of Big Data: - Syllabus
No ratings yet
Unit 2 Mathematical Foundation of Big Data: - Syllabus
26 pages
Probabilistic Data Structures
No ratings yet
Probabilistic Data Structures
26 pages
Toc CS246 PRK
No ratings yet
Toc CS246 PRK
17 pages
CS648A 2023 Lecture 1
No ratings yet
CS648A 2023 Lecture 1
35 pages
Notes
No ratings yet
Notes
422 pages
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
No ratings yet
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
26 pages
Randomized Algorithms Notes
No ratings yet
Randomized Algorithms Notes
13 pages
Chakrabarthi-Streaming Alg Book
No ratings yet
Chakrabarthi-Streaming Alg Book
94 pages
Streams 2
No ratings yet
Streams 2
49 pages
Bda Exp8
No ratings yet
Bda Exp8
4 pages
Massivedata14 Slidesxx
No ratings yet
Massivedata14 Slidesxx
13 pages
DSBD Unit-II 3
No ratings yet
DSBD Unit-II 3
28 pages
High Speed Hashing For Integers
No ratings yet
High Speed Hashing For Integers
17 pages
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
No ratings yet
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
90 pages
Cheat Sheet - JAM
No ratings yet
Cheat Sheet - JAM
46 pages
The Probabilistic Method - ProbabilisticMethod
No ratings yet
The Probabilistic Method - ProbabilisticMethod
9 pages
Expectation of Geometric Distribution Variance and Standard Deviation
No ratings yet
Expectation of Geometric Distribution Variance and Standard Deviation
5 pages
Algorithms For Massive Data Problems
No ratings yet
Algorithms For Massive Data Problems
28 pages
CSE291 Course Notes
No ratings yet
CSE291 Course Notes
69 pages
Advanced Algorithms Course. Lecture Notes. Part 10: Hashing
No ratings yet
Advanced Algorithms Course. Lecture Notes. Part 10: Hashing
4 pages
5.Topic-Sensitive PageRank S5
No ratings yet
5.Topic-Sensitive PageRank S5
11 pages
Streaming Algorithm
No ratings yet
Streaming Algorithm
16 pages
Sol All
No ratings yet
Sol All
66 pages
1 Comparing Traffic: 1 N 1 M I J - A B - A B
No ratings yet
1 Comparing Traffic: 1 N 1 M I J - A B - A B
2 pages
Experiment No 8
No ratings yet
Experiment No 8
7 pages
Streaming Algorithms
No ratings yet
Streaming Algorithms
73 pages
01 Streaming PDF
No ratings yet
01 Streaming PDF
8 pages
STAT2011 Week1 2024
No ratings yet
STAT2011 Week1 2024
14 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Randomized Algorithms Randomized Algorithms
No ratings yet
Randomized Algorithms Randomized Algorithms
43 pages
1.2 Five Representative Problems
No ratings yet
1.2 Five Representative Problems
30 pages
AA Exam 2021 Answers
No ratings yet
AA Exam 2021 Answers
6 pages
Assignment No.2: HOANG Nguyen Phong
No ratings yet
Assignment No.2: HOANG Nguyen Phong
6 pages
Book 160 163
No ratings yet
Book 160 163
4 pages
hw3 Solution
No ratings yet
hw3 Solution
7 pages
Unit 4 - 4.4
No ratings yet
Unit 4 - 4.4
23 pages
Main EL CM2end 2023
No ratings yet
Main EL CM2end 2023
33 pages
Discrete Probability and Counting
No ratings yet
Discrete Probability and Counting
2 pages
Lecturenotes 3
No ratings yet
Lecturenotes 3
11 pages
1 MinHash-1
No ratings yet
1 MinHash-1
4 pages