3.flajolet Martin Algorithm
3.flajolet Martin Algorithm
Cameron Musco
University of Massachusetts Amherst. Spring 2020.
Lecture 5
0
logistics
1
last time
2
this class
3
hashing for distinct elements
4
hashing for distinct elements
∫ ∞
1
E[s] = (using E(s) = Pr(s > x)dx) + calculus)
d+1 0
Chebyshev’s Inequality:
Var[s] 1
Pr [|s − E[s]| ≥ ϵE[s]] ≤ 2
= 2.
(ϵE[s]) ϵ
7
improving performance
8
analysis
1
∑k
s= k j=1 sj . Have already shown that for j = 1, . . . , k:
1 1
E[sj ] = =⇒ E[s] = (linearity of expectation)
d+1 d+1
1 1
Var[sj ] ≤ =⇒ Var[s] ≤ (linearity of variance)
(d + 1)2 k · (d + 1)2
Chebyshev Inequality:
[ ] 2 2
b ≥ 4ϵ · d ≤ Var[s] = E[s] /k = 1 = ϵ
Pr [|s − E[s]| ≥ ϵE[s]] Pr d − d
(ϵE[s])2 ϵ E[s]
2 2 k·ϵ 2 ϵ
How should we set k if we want 4ϵ · d error with probability ≥ 1 − δ?
k = ϵ21·δ .
1 ∑k
sj : minimum of d distinct hashes chosen randomly over [0, 1]. s = k j=1 sj .
b = 1 − 1: estimate of # distinct elements d.
d s
9
space complexity
b with |d − d|
• Setting k = ϵ21·δ , algorithm returns d b ≤ 4ϵ · d with
probability at least 1 − δ.
• Space complexity is k = 1
ϵ2 ·δ real numbers s1 , . . . , sk .
• δ = 5% failure rate gives a factor 20 overhead in space complexity. 10
improved failure rate
t2 k 3ϵk
For us, t = ϵ
d and M̄ = 1. So 4 = 4d . So if k ≪ d exponent has small
3 M̄t
magnitude (i.e., bound is bad).
11
improved failure rate
12
improved failure rate
b1 , . . . , d
• Letting d bt be the outcomes of the t trials, return
b b
d = median(d1 , . . . , d bt ).
• If > 1/2> 2/3 of trials fall in [(1 − 4ϵ)d, (1 + 4ϵ)d], then the median
will.
13
• Have < 1/2< 1/3 of trials on both the left and right.
the median trick
b1 , . . . , d
• d bt are the outcomes of the t trials, each falling in
[(1 − 4ϵ)d, (1 + 4ϵ)d] with probability at least 4/5.
b = median(d
• d b1 , . . . , d
bt ).
b falls in
What is the probability that the median d
[(1 − 4ϵ)d, (1 + 4ϵ)d]?
( ) ( ) ( )
b 2 5 1
/ [(1 − 4ϵ)d, (1 + 4ϵ)d] ≤ Pr X < · t · E[X] ≤ Pr |X − E[X]| ≥ E[X]
Pr d ∈
3 6 6
Apply Chernoff bound:
( ) ( )
12 4 ( )
1 6 · 5t
Pr |X − E[X]| ≥ E[X] ≤ 2 exp − = O e−O(t) .
6 2 + 1/6
Traditional COUNT, DISTINCT SQL calls are far too slow, especially
when the data is distributed across many servers.
19
in practice
• Count distinct keys where key is (IP, Hr, Min mod 10).
• Using HyperLogLog, cost is roughly that of a (distributed)
20
Questions on distinct elements counting?
21
another fundamental problem
|A ∩ B| # shared elements
J(A, B) = = .
|A ∪ B| # total elements
28
why jaccard similarity?
29
Questions?
30