Lecture08_BloomFilter
Lecture08_BloomFilter
1 Bloom Filters
A Bloom filter is a data structure used to test whether an element is a member of a set. It is
probabilistic since it allows false positives; It may decide that an element belongs to the set while
it does not. However, it does not allow false negatives; it never decides that an element does not
belong to the set while it does. Thus it returns either “possibly ∈ set” or “definitely ∈
/ set”.
The motivation of this structure is to provide an alternative to hash tables which consumes much
less space than hash tables, but with the drawback of introducing false positives. It is usually
accompanied with a hash table or another data structure which is effectively used to retrieve the
element only if it passed the fast Bloom filter test. It improves query times when most queried
elements do not belong to the set, and when the set elements are stored in secondary storage.
An empty Bloom filter is an array of m bits, all set to zero. A hash function is a function whose
parameter is an element to be inserted, and it returns a value in the range {0 . . . m−1}. To insert an
element in the Bloom filter, k different hash functions are applied to the inserted element to return
k positions in the range {0 . . . m − 1}. Then, all bits at all these positions are set to 1. It is not
possible to remove an element from a Bloom filter.
To estimate the effectiveness of Bloom filters, we need to calculate the probability of its false
positives which depends on the values of k and m. Assume that a hash function returns a value in
the range {0 . . . m − 1} with equal probability of m1 for each possible value.
Let p = 1 − q the probability that a certain bit is set to 1.
Let q = 1 − p the probability that a certain bit is not set to 1.
After one hash function call: q = 1 − m1 .
k
After inserting one element (that is, after k hash function calls): q = 1 − m1 .
kn kn
After inserting n elements: q = 1 − m1 and p = 1 − 1 − m1 .
Now, after inserting n elements, suppose that we need to test the membership of an element which
does not belong to the set (does not equal to any of the n inserted elements). The probability fp
of declaring a false positive equals to the probability that the associated k hash positions of that
element happens to be set to 1 accidentally by some of the hash positions of the existing n elements.
kn k kn k
1
So: fp = pk = 1 − 1 − m1 ≈ 1 − e− m using the approximation of: 1 − m1 ≈ e− m .
mn
ln 2
Given m and n, the optimal k that minimizes fp is k = n ln 2 where fp = 21
m
(*).
Given n and the target fp, the required m is −nlnln
2
2
fp
(*). Thus:
−1
Given the target fp, the optimal number of bits per element m n
= ln 2
log2 fp where k = − log2 fp.
1
FCAI-CU AdvDS Bloom Filters Amin Allam
kn
Let r = e− m , then k = −m
n
ln(r).
−m
Thus ln(fp) = n ln(r) ln(1 − r). We are now minimizing with respect to r.
Taking the derivative and equating it to zero (ignoring the −m
n
constant):
− ln(r) ln(1−r)
1−r
+ r
= 0. So r ln(r) = (1 − r) ln(1 − r).
Clearly, r = 21 satisfies the above equation, so k = −mn
ln( 21 ) =m n
ln( 12 )−1 = m
n
ln 2.
kn
By substituting the optimal r = e− m = 12 and k = m n
ln 2 in the fp formula:
m m
mn
ln 2 ln 2 ln 2
fp = 1 − 12 n = 12 n = 12
.
−n ln fp
∗ Given n and the target fp, the required m is ln2 2
.
m ln 2
Proof: By taking natural logarithm of both sides of the formula: fp = 12 n .
ln(fp) = m 1
= n (ln 2)(−(ln 2)) = −m (ln 2)2 . Thus m = −nlnln fp
m
n
(ln 2) ln 2 n 2
2
.