0% found this document useful (0 votes)
143 views31 pages

3.flajolet Martin Algorithm

Uploaded by

Houssam Fouki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
143 views31 pages

3.flajolet Martin Algorithm

Uploaded by

Houssam Fouki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

compsci 514: algorithms for data science

Cameron Musco
University of Massachusetts Amherst. Spring 2020.
Lecture 5

0
logistics

• Problem Set 1 was released last Thursday and is due Friday


2/14 at 8pm in Gradescope. Don’t leave until the last minute.
• There is no class this Thursday.

1
last time

Last Class We Covered:


• Bloom Filters:
• Random hashing to maintain a large set in very small space.
• Discussed applications and how the false positive rate is
determined.
• Streaming Algorithms and Distinct Elements:
• Started on streaming algorithms and one of the most
fundamental examples: estimating the number of distinct items
in a data stream.
• Introduced an algorithm for doing this via a min-of-hashes
approach.

2
this class

Finish Distinct Elements:

• Finish hashing-based distinct elements algorithm. Learn the


‘median trick’ to boost accuracy.
• Discuss variants and practical implementations.

MinHashing For Set Similarity:

• See how a min-of-hashes approach (MinHash) is used to


estimate the overlap between two bit vectors.
• A key idea behind audio fingerprint search (Shazam),
document search (plagiarism and copyright violation
detection), recommendation systems, etc.

3
hashing for distinct elements

Distinct Elements (Count-Distinct) Problem: Given a stream


x1 , . . . , xn , estimate the number of distinct elements.

Hashing for Distinct Elements (variant of Flajolet-Martin):

• Let h : U → [0, 1] be a random hash function (with a real valued


output)
• s := 1
• For i = 1, . . . , n
• s := min(s, h(xi ))
b=
• Return d 1
−1
s

4
hashing for distinct elements

Hashing for Distinct Elements:

• Let h : U → [0, 1] be a random hash function (with a real valued output)


• s := 1
• For i = 1, . . . , n
• s := min(s, h(xi ))
• Return db = 1 −1
s

• After all items are processed, s is the minimum of d points chosen


uniformly at random on [0, 1]. Where d = # distinct elements.
• Intuition: The larger d is, the smaller we expect s to be.
• Same idea as Flajolet-Martin algorithm and HyperLogLog, except
they use discrete hash functions.
• Notice: Output does not depend on n at all. 5
performance in expectation

s is the minimum of d points chosen uniformly at random on [0, 1].


Where d = # distinct elements.

∫ ∞
1
E[s] = (using E(s) = Pr(s > x)dx) + calculus)
d+1 0

• So estimate of db = 1 − 1 output by the algorithm is correct if s


s
b = d? No, but:
exactly equals its expectation. Does this mean E[d]
• Approximation is robust: if |s − E[s]| ≤ ϵ · E[s] for any ϵ ∈ (0, 1/2):
b ≤ (1 + 4ϵ)d
(1 − 2ϵ)d ≤ d
. 6
initial concentration bound

So question is how well s concentrates around its mean.


1 1
E[s] = and Var[s] ≤ (also via calculus).
d+1 (d + 1)2

Chebyshev’s Inequality:

Var[s] 1
Pr [|s − E[s]| ≥ ϵE[s]] ≤ 2
= 2.
(ϵE[s]) ϵ

Bound is vacuous for any ϵ < 1. How can we improve accuracy?

s: minimum of d distinct hashes chosen randomly over [0, 1], computed by


b = 1 − 1: estimate of # distinct elements d.
hashing algorithm. d s

7
improving performance

Leverage the law of large numbers: improve accuracy via repeated


independent trials.

Hashing for Distinct Elements (Improved):

• Let h : U → [0, 1] be a random hash functionLet


h1 , h2 , . . . , hk : U → [0, 1] be random hash functions
• s := 1
• s1 , s2 , . . . , sk := 1
• For i = 1, . . . , n
• s := min(s, h(xi ))
• For j=1,…,k, sj := min(sj , hj (xi ))
1
∑k
• s := k j=1 sj
b=
• Return d 1
−1
s

8
analysis

1
∑k
s= k j=1 sj . Have already shown that for j = 1, . . . , k:
1 1
E[sj ] = =⇒ E[s] = (linearity of expectation)
d+1 d+1
1 1
Var[sj ] ≤ =⇒ Var[s] ≤ (linearity of variance)
(d + 1)2 k · (d + 1)2

Chebyshev Inequality:
[ ] 2 2
b ≥ 4ϵ · d ≤ Var[s] = E[s] /k = 1 = ϵ
Pr [|s − E[s]| ≥ ϵE[s]] Pr d − d
(ϵE[s])2 ϵ E[s]
2 2 k·ϵ 2 ϵ
How should we set k if we want 4ϵ · d error with probability ≥ 1 − δ?
k = ϵ21·δ .

1 ∑k
sj : minimum of d distinct hashes chosen randomly over [0, 1]. s = k j=1 sj .
b = 1 − 1: estimate of # distinct elements d.
d s

9
space complexity

Hashing for Distinct Elements:

• Let h1 , h2 , . . . , hk : U → [0, 1] be random hash functions


• s1 , s2 , . . . , sk := 1
• For i = 1, . . . , n
• For j=1,…, k, sj := min(sj , hj (xi ))

• s := k1 kj=1 sj
b = 1 −1
• Return d s

b with |d − d|
• Setting k = ϵ21·δ , algorithm returns d b ≤ 4ϵ · d with
probability at least 1 − δ.
• Space complexity is k = 1
ϵ2 ·δ real numbers s1 , . . . , sk .
• δ = 5% failure rate gives a factor 20 overhead in space complexity. 10
improved failure rate

How can we decrease the cost of a small failure rate δ?


One Thought: Apply stronger concentration bounds. E.g., replace
Chebyshev with Bernstein. This won’t work. Why?

Bernstein Inequality (Sample Mean): Consider independent


random variables X1 , . . . , Xk all falling in [−M̄, M̄] and let X =
1
∑k 2 1
k i=1 Xi . Let µ = E[X] and σ̄ = k Var[X]. For any t ≥ 0:
( )
t2 k
Pr (|X − µ| ≥ t) ≤ 2 exp − 2 4 .
2σ̄ + 3 M̄t

t2 k 3ϵk
For us, t = ϵ
d and M̄ = 1. So 4 = 4d . So if k ≪ d exponent has small
3 M̄t
magnitude (i.e., bound is bad).

11
improved failure rate

Exponential tail bounds are weak for random variables with


very large ranges compared to their expectation.

12
improved failure rate

How can we improve our dependence on the failure rate δ?


The median trick: Run t = O(log 1/δ) trials each with failure
probability δ ′ = 1/5 – each using k = δ′1ϵ2 = ϵ52 hash functions.

b1 , . . . , d
• Letting d bt be the outcomes of the t trials, return
b b
d = median(d1 , . . . , d bt ).

• If > 1/2> 2/3 of trials fall in [(1 − 4ϵ)d, (1 + 4ϵ)d], then the median
will.
13
• Have < 1/2< 1/3 of trials on both the left and right.
the median trick

b1 , . . . , d
• d bt are the outcomes of the t trials, each falling in
[(1 − 4ϵ)d, (1 + 4ϵ)d] with probability at least 4/5.
b = median(d
• d b1 , . . . , d
bt ).

b falls in
What is the probability that the median d
[(1 − 4ϵ)d, (1 + 4ϵ)d]?

• Let X be the # of trials falling in [(1 − 4ϵ)d, (1 + 4ϵ)d]. E[X] = 4


5 · t.

( ) ( ) ( )
b 2 5 1
/ [(1 − 4ϵ)d, (1 + 4ϵ)d] ≤ Pr X < · t · E[X] ≤ Pr |X − E[X]| ≥ E[X]
Pr d ∈
3 6 6
Apply Chernoff bound:
( ) ( )
12 4 ( )
1 6 · 5t
Pr |X − E[X]| ≥ E[X] ≤ 2 exp − = O e−O(t) .
6 2 + 1/6

• Setting t = O(log(1/δ)) gives failure probability e− log(1/δ) = δ. 14


median trick

Upshot: The median of t = O(log(1/δ)) independent runs of


the hashing algorithm for distinct elements returns
b ∈ [(1 − 4ϵ)d, (1 + 4ϵ)d] with probability at least 1 − δ.
d
Total Space Complexity: t trials, each using k = ϵ21δ′ hash
( )
functions, for δ ′ = 1/5. Space is ϵ5t2 = O log(1/δ)
ϵ2
real numbers
(the minimum value of each hash function).

No dependence on the number of distinct elements d or the


number of items in the stream n! Both of these numbers are
typically very large.
A note on the median: The median is often used as a robust
alternative to the mean, when there are outliers (e.g., heavy
tailed distributions, corrupted data). 15
distinct elements in practice

Our algorithm uses continuous valued fully random hash


functions. Can’t be implemented...
• The idea of using the minimum hash value of x1 , . . . , xn to
estimate the number of distinct elements naturally extends
to when the hash functions map to discrete values.
• Flajolet-Martin (LogLog) algorithm and HyperLogLog.

Estimate # distinct elements


based on maximum number of
trailing zeros m.
The more distinct hashes we
see, the higher we expect this
maximum to be.
16
loglog counting of distinct elements

Flajolet-Martin (LogLog) algorithm and HyperLogLog.

Estimate # distinct elements


based on maximum number of
trailing zeros m.

With d distinct elements what do we expect m to be?


1 1
Pr(h(xi ) has x log d trailing zeros) = x log d = .
2 d
So with d distinct hashes, expect to see 1 with log d trailing zeros.
Expect m ≈ log d. m takes log log d bits to store.
( )
Total Space: O logϵlog
2
d
+ log d for an ϵ approximate count.

Note: Careful averaging of estimates from multiple hash functions. 17


loglog space guarantees

Using HyperLogLog to count 1 billion distinct items with 2% accuracy:


( )
log log d
space used = O + log d
ϵ2
1.04 · ⌈log2 log2 d⌉
= + ⌈log2 d⌉ bits1
ϵ2
1.04 · 5
= + 30 = 13030 bits ≈ 1.6 kB!
.022
Mergeable Sketch: Consider the case (essentially always in practice)
that the items are processed on different machines.

• Given data structures (sketches) HLL(x1 , . . . , xn ), HLL(y1 , . . . , yn ) is


is easy to merge them to give HLL(x1 , . . . , xn , y1 , . . . , yn ). How?
• Set the maximum # of trailing zeros to the maximum in the two
sketches.

1. 1.04 is the constant in the HyperLogLog analysis. Not important!


18
hyperloglog in practice

Implementations: Google PowerDrill, Facebook Presto, Twitter


Algebird, Amazon Redshift.
Use Case: Exploratory SQL-like queries on tables with 100s billions of
rows. ∼ 5 million count distinct queries per day. E.g.,

• Count number if distinct users in Germany that made at least one


search containing the word ‘auto’ in the last month.
• Count number of distinct subject lines in emails sent by users that
have registered in the last week, in comparison to number of
emails sent overall (to estimate rates of spam accounts).

Traditional COUNT, DISTINCT SQL calls are far too slow, especially
when the data is distributed across many servers.

19
in practice

Estimate number of search ‘sessions’ that happened in the last


month (i.e., a single user making possibly many searches at
one time, likely surrounding a specific topic.)

• Count distinct keys where key is (IP, Hr, Min mod 10).
• Using HyperLogLog, cost is roughly that of a (distributed)
20
Questions on distinct elements counting?

21
another fundamental problem

Jaccard Index: A similarity measure between two sets.


|A ∩ B| # shared elements
J(A, B) = = .
|A ∪ B| # total elements

Natural measure for similarity between bit strings – interpret


an n bit string as a set, containing the elements corresponding
shared ones
the positions of its ones. J(x, y) = # total ones .
22
computing jaccard similarity

|A ∩ B| # shared elements
J(A, B) = = .
|A ∪ B| # total elements

• Computing exactly requires roughly linear time in |A| + |B|


(using a hash table or binary search). Not bad.
• Near Neighbor Search: Have a database of n sets/bit strings
and given a set A, want to find if it has high similarity to
anything in the database. O(n · average set size) time.
• All-pairs Similarity Search: Have n different sets/bit strings
and want to find all pairs with high similarity.
O(n2 · average set size) time.
Prohibitively expensive when n is very large. We’ll see how to
significantly improve on these runtimes with random hashing.
23
application: document comparison

How should you measure similarity between two-documents?


E.g., to detect plagiarism and copyright infringement, to see if an
email message is similar to previously seen spam, to detect
duplicate webpages in search results, etc.

• If the documents are not identical, doing a word-by-word


comparison typically gives nothing. Can compute edit distance,
but this is very expensive if you are comparing many documents.
• Shingling + Jaccard Similarity: Represent a document as the set of
all consecutive strings of length k.

• Measure similarity as Jaccard similarity between shingle sets.


• Also used to measure word similarity. E.g., in spell checkers. 24
application: audio fingerprinting

How should you measure similarity between two audio clips?


E.g. in audio search engines like Shazam, for detecting copyright
infringement, for search in sound effect libraries, etc.

Audio Fingerprinting + Jaccard Similarity:

Step 1: Compute the spectrogram: Step 2: Threshold the spectrogram


representation of frequency to a binary matrix representing
intensity over time. the sound clip.

Compare thresholded spectrograms with Jaccard similarity. 25


application: earthquake detection

Small earthquakes make consistent signatures on seismographs that


repeat over time. Detecting repeated signatures lets you detect
these otherwise undetectable events.

• Split data into overlapping


windows of 10 seconds
• Fingerprint each window using the
spectrogram (i.e., compute a binary
string representing the reading in
the window).
• All-pairs search for windows with
high Jaccard similarity. 26
application: collaborative filtering

Online recommendation systems are often based on collaborative


filtering. Simplest approach: find similar users and make
recommendations based on those users.

• Twitter: represent a user as the set of accounts they follow. Match


similar users based on the Jaccard similarity of these sets.
Recommend that you follow accounts followed by similar users.
• Netflix: look at sets of movies watched. Amazon: look at products
purchased, etc. 27
application: spam and fraud detection

Many applications to spam/fraud detection. E.g.

• Fake Reviews: Very common on websites like Amazon.


Detection often looks for (near) duplicate reviews on similar
products, which have been copied. ‘Near duplicate’ can be
measured with shingles + Jaccard similarity.
• Lateral phishing: Phishing emails sent to addresses at a
business coming from a legitimate email address at the
same business that has been compromised.
• One method of detection looks at the recipient list of an email
and checks if it has small Jaccard similarity with any previous
recipient lists. If not, the email is flagged as possible spam.

28
why jaccard similarity?

Why use Jaccard similarity over other metrics like: Hamming


distinct (bit strings), correlation (sound waves, seismograms),
edit distance (text, genome sequences, etc.)?
Two Reasons:

• Depending on the application, often is the right measure.


• Even when not ideal, very efficient to compute and
implement near neighbor search and all-pairs similarity
search with.
• This is what we will cover next time. Using more random
hashing!

29
Questions?

30

You might also like