Lec1 Bloom Distinctcount
Lec1 Bloom Distinctcount
One simple way would be to choose independently for every but that
is expensive !!!
For our course, we will assume all are chosen all independently!
Querying
Present?
17
Querying
ISBN present in collection?
IP seen by switch?
10.0.21.102
18
Solutions
• Universe , but need to store a set of items,
• Hash table of size :
• Space
• Query time
19
Solutions
• Universe , but need to store a set of items,
• Hash table of size :
• Space
• Query time
• Bit array of size
• Space =
• Query time
20
Solutions
• Universe , but need to store a set of items,
• Hash table of size :
• Space
• Query time
• Bit array of size
• Space =
• Query time
21
Querying, Monte Carlo style
• In hash table construction, we used random hash functions
• we never return incorrect answer
• query time is a random variable
• These are Las Vegas algorithms
22
Bloom filter
[Bloom, 1970]
• A bit-array
• hash functions, , each
𝒙
h1 h3
h2
23
Bloom filter
• A bit-array
• hash functions, , each
𝒙
h1 h3
h2
24
Operations
• for
25
Operations
• for
• for
26
Bloom Filter
• If the element has been added to the Bloom filter, then ….
27
Bloom Filter
• If the element has been added to the Bloom filter, then always returns PRESENT
h1 𝒙 𝒚 𝒒𝒖𝒆𝒓𝒚
28
Designing Bloom Filter
• Want to minimize the probability that we return a false positive
• Parameters and number of hash functions
• normal bit-array
29
Effect of number of hash functions
• Increasing
• Possibly makes it harder for false positives to happen in because of
30
False positive analysis
• elements inserted
• Ifhas not been inserted, what is the probability that returns PRESENT?
31
False positive analysis
• elements inserted
• Ifhas not been inserted, what is the probability that returns PRESENT?
• Assume are independent and for all positions
32
False positive analysis
• The expected number of zero bits w.h.p.
• Pr[PRESENT)
33
Choosing number of hash functions
Minimized at , i.e.
34
Bloom filter design
• This “optimal” choice gives false positive =
36
Applications
• Widespread applications whenever small false positives are tolerable
• Used by browsers
• to decide whether an URL is potentially malicious: a BF is used in
browser, and positives are actually checked with the server, also to reject
common passwords
• Databases e.g. BigTable, HBase, Cassandra, Postgresql use BF to avoid
disk lookups for non-existent rows/columns
• Bitcoin for wallet synchronization, “simplified payment verification”
(answering questions e.g. “is user A interested in transaction B”)
37
Handling deletions
• Chief drawback is that BF does not allow deletions
• Counting Bloom Filter [Fan et al 00]
38
Other Extensions
• Are there better structures that use less space, less randomness and
less memory lookups?
39
Streaming problem: distinct count
• Universe is , number of distinct elements = stream size is , potentially much
bigger than
• Example: all IP addresses
10.1.21.10, 10.93.28,1,…..,98.0.3.1,…..10.93.28.1…..
40
Other applications
41
Solutions
• Naïve solution :
• store all the elements, sort and count distinct
• store a hash map, insert only if not present in map
• Bit array:
• bits initialized to 1 only if element seen in stream
42
Solutions
• Naïve solution : space
• store all the elements, sort and count distinct
• store a hash map, insert only if not present in map
43
Approximations
• approximations
• Algorithm will use random hash functions
• Will return an answer such that
^ ≤ ( 1+𝜖 ) 𝑛
( 1 −𝜖 ) 𝑛 ≤ 𝑛
• This will happen with probability over the randomness of the
algorithm
44
First effort
• Stream length: , universe size:
• Proposed algo: Given space S, sample S items from the stream
• Find the number of distinct elements in this set:
• return
45
First effort
• Stream length: , distinct elements:
• Proposed algo: Given space S, sample S items from the stream
• Find the number of distinct elements in this set:
• return
• Not a constant factor approximation
• 1,1,1,1,…..1,2,3,4,….,n-1
𝑚 −𝑛+1
48
Linear Counting
• Estimate n?
49
Linear Counting
• of zero entries
• Return estimate
50
Linear Counting Analysis
• Pr[ position remaining 0 ] =
• Expected number of positions at zero =
51
Let’s try something else
• Suppose you have distinct numbers – each chosen uniformly randomly
in
•,
•,
Process(x)
• if
Estimate:
• return
58
Example h(.)
0110101
1011010
1000100
1111010
59
Space usage
• We need for some say
• by birthday paradox analysis, no collisions with high prob
60
Intuition
• Assume hash values are uniformly distributed in [0, max_hash_val]
• The probability that a uniform bit-string
• is divisible by 2 is ½
• is divisible by 4 is ¼
• ….
• is divisible by is
• We don’t expect any of them to be divisible by
61
Formalizing intuition
• set of elements that appeared in stream
• For any indicator of
• number of such that
62
Proof of FM
• , equivalently,
63
Proof of FM
• , equivalently,
{
1
1 𝑤𝑖𝑡h 𝑝𝑟𝑜𝑏 𝑟
𝑋 𝑟𝑗 = 2
0 𝑒𝑙𝑠𝑒
64
Proof of FM
𝑣𝑎𝑟 (𝑌 𝑟 ) ≤ ∑ 𝐸 [ 𝑋 ] ≤ 𝑛/2
2
𝑟𝑗
𝑟
𝑗∈𝑆
65
Proof of FM
𝑣𝑎𝑟 (𝑌 𝑟 ) ≤ ∑ 𝐸 [ 𝑋 ] ≤ 𝑛/2
Pr [ 𝑌 𝑟 >0 ] = Pr [ 𝑌 𝑟 ≥ 1 ] ≤
1
2
𝐸 [𝑌 𝑟 ] 𝑛
= 𝑟
𝑟𝑗
2
𝑟
𝑗∈𝑆
66
Proof of FM
𝑣𝑎𝑟 (𝑌 𝑟 ) ≤ ∑ 𝐸 [ 𝑋 ] ≤ 𝑛/2 2 𝑟
𝐸 [𝑌 𝑟 ] 𝑛
Pr [ 𝑌 𝑟 >0 ] = Pr [ 𝑌 𝑟 ≥ 1 ] ≤ = 𝑟
1 2
𝑟𝑗
Pr [ 𝑌 𝑟 = 0 ] ≤ Pr [|𝑌 𝑟 − 𝐸 [ 𝑌 𝑟 ]|≥ 𝐸 [ 𝑌 𝑟 ] ] ≤
𝑣𝑎𝑟 ( 𝑌 𝑟 )
𝐸 [𝑌 𝑟 ]
2
≤
𝑟
2
𝑛
𝑗∈𝑆
67
Upper bound
Returned estimate
68
Lower bound
Returned estimate
69
Understanding the bound
• By union bound, with prob
𝑛
^ ≤4𝑛
≤𝑛
4
• Can get somewhat better constants
• Need only 2-wise independent hash functions, since we only used
variances
70
Improving the probabilities
• To improve the probabilities, a common trick: median of estimates
• Create ,…., in parallel
• return median
• Expect at most of them to exceed
71
Improving the probabilities
• To improve the probabilities, a common trick: median of estimates
• Create ,…., in parallel
• return median
• Expect at most of them to exceed
• But if median exceeds , then of them does using Chernoff bound this prob is
72
Improving the probabilities
• To improve the probabilities, a common trick: median of estimates
• Create ,…., in parallel
• return median
• Using Chernoff bound, can show that median will lie in with probability .
• Given error prob choose
• To get a estimate within , with probability
• First calculate mean of estimates
• Then calculate median of such estimates
73
Summary
74
References:
• Others
• Randomized Algorithms by Mitzenmacher and Upfal.
75
k-wise universal
• For any distinct and any (not necessarily distinct) ,
Pr [ h ( 𝑥 1 )= 𝑦 1 ∧ … h ( 𝑥 𝑘 )= 𝑦 𝑘 ] =𝑚
−𝑘
76