Bloom Filter Guo
Bloom Filter Guo
count
idSum A
hashSum
H(A)
count++
count
idSum A
hashSum
H(A)
count++
count
idSum A
hashSum
H(A)
count++
Hash1 Hash2 Hash3
B C
Assign ID to
many cells
59
IBF:
d
All hosts use the same
hash functions
Invertible Bloom Filters (IBF)
Trade IBFs with remote host
A
Host1 Host2
C A F E B D F
IBF 2 IBF 1
60
Invertible Bloom Filters (IBF)
Subtract IBF structures
Produces a new IBF containing only unique objects
A
Host1 Host2
C A F E B D F
IBF 2
IBF 1
IBF (2 - 1)
61
IBF Subtract
62
Timeout for Intuition
After subtraction, all elements common to both sets
have disappeared. Why?
Any common element (e.g W) is assigned to same cells
on both hosts (assume same hash functions on both
sides)
On subtraction, W XOR W = 0. Thus, W vanishes.
While elements in set difference remain, they may
be randomly mixed need a decode procedure.
63
Invertible Bloom Filters (IBF)
Decode resulting IBF
Recover object identifiers from IBF structure.
A
Host1 Host2
C A F E B D F
IBF (2 - 1)
B E C D
Host1 Host2
64
IBF 2
IBF 1
IBF Decode
65
H(V Z)
H(Z)
H(V X Z)
H(V) H(X)
H(Z)
Test for Purity:
H( idSum )
H( idSum ) = hashSum
H(V) = H(V)
IBF Decode
66
IBF Decode
67
IBF Decode
68
69
SmallDiffs:
1.4x 2.3x
LargeDifferences:
1.25x 1.4x
How many IBF cells?
S
p
a
c
e
O
v
e
r
h
e
a
d
SetDifference
HashCnt3
HashCnt4
Overheadtodecodeat>99%
How many hash functions?
1 hash function produces many pure cells initially
but nothing to undo when an element is removed.
70
A B
C
How many hash functions?
1 hash function produces many pure cells initially
but nothing to undo when an element is removed.
Many (say 10) hash functions: too many collisions.
71
A A B
C B C
A A
B B
C C
How many hash functions?
1 hash function produces many pure cells initially
but nothing to undo when an element is removed.
Many (say 10) hash functions: too many collisions.
We find by experiment that 3 or 4 hash functions
works well. Is there some theoretical reason?
72
A A B
C C
A
B
B
C
Theory
Let d = difference size, k = # hash functions.
Theorem 1: With (k + 1) d cells, failure probability
falls exponentially.
For k = 3, implies a 4x tax on storage, a bit weak.
[Goodrich,Mitzenmacher]: Failure is equivalent to
finding a 2-core (loop) in a random hypergraph
Theorem 2: With c
k
d, cells, failure probability
falls exponentially
c
4
= 1.3x tax, agrees with experiments
73
74
LargeDifferences:
1.25x 1.4x
How many IBF cells?
S
p
a
c
e
O
v
e
r
h
e
a
d
SetDifference
HashCnt3
HashCnt4
Overheadtodecodeat>99%
Connection to Coding
Mystery: IBF decode similar to peeling procedure
used to decode Tornado codes. Why?
Explanation: Set Difference is equivalent to coding
with insert-delete channels
Intuition: Given a code for set A, send codewords
only to B. Think of Bs set as a corrupted form of
As.
Reduction: If code can correct D
insertions/deletions, then B can recover A and the set
difference.
75
Reed Solomon <---> Polynomial Methods
LDPC (Tornado) <---> Difference Digest
Reed Solomon <---> Polynomial Methods
LDPC (Tornado) <---> Difference Digest
Difference Digests
Consists of two data structures:
Invertible Bloom Filter (IBF)
Efficiently computes the set difference.
Needs the size of the difference
Strata Estimator
Approximates the size of the set difference.
Uses IBFs as a building block.
76
Strata Estimator
A
Consistent
Partitioning
B C
77
~1/2
~1/4
~1/8
1/16
IBF 1
IBF 4
IBF 3
IBF 2
Estimator
Divide keys into partitions of containing ~1/2
k
Encode each partition into an IBF of fixed size
log(n) IBFs of ~80 cells each
4x
Strata Estimator
78
IBF 1
IBF 4
IBF 3
IBF 2
Estimator1
Attempt to subtract & decode IBFs at each level.
If level k decodes, then return:
2
k
x (the number of IDs recovered)
IBF 1
IBF 4
IBF 3
IBF 2
Estimator2
Decode
Host1 Host2
4x
Strata Estimator
79
IBF 1
IBF 4
IBF 3
IBF 2
Estimator1
Attempt to subtract & decode IBFs at each level.
If level k decodes, then return:
2
k
x (the number of IDs recovered)
IBF 1
IBF 4
IBF 3
IBF 2
Estimator2
Decode
Host1 Host2
What about the
other strata?
2x
Strata Estimator
IBF 1
IBF 4
IBF 3
IBF 2
Estimator1
IBF 1
IBF 4
IBF 3
IBF 2
Estimator2
Decode
Decode
Host1 Host2
Host2 Host1
80
Observation: Extra partitions hold useful data
Sum elements from all decoded strata & return:
2
(k-1)
x (the number of IDs recovered)
Decode
Host1 Host2
Estimation Accuracy
81
AverageEstimationError(15.3KBytes)
SetDifference
R
e
l
a
t
i
v
e
E
r
r
o
r
i
n
E
s
t
i
m
a
t
i
o
n
(
%
)
Hybrid Estimator
82
IBF 1
IBF 4
IBF 3
IBF 2
Strata
Combine Strata and Min-Wise Estimators.
Use IBF Stratas for small differences.
Use Min-Wise for large differences.
IBF 1
Min-Wise
IBF 2
Hybrid
IBF 3
Hybrid Estimator Accuracy
83
Hybrid matches Strata
for small differences.
Converges with Min-wise
for large differences
SetDifference
AverageEstimationError(15.3KBytes)
R
e
l
a
t
i
v
e
E
r
r
o
r
i
n
E
s
t
i
m
a
t
i
o
n
(
%
)
Application: KeyDiff Service
Promising Applications:
File Synchronization
P2P file sharing
Failure Recovery
Key Service
Key Service
Key Service
Application
Application
Application
Add(key)
Remove(key)
Diff(host1,host2
)
84
Difference Digests Summary
Strata & Hybrid Estimators
Estimate the size of the Set Difference.
For 100K sets, 15KB estimator has <15% error
O(log n) communication, O(log n) computation.
Invertible Bloom Filter
Identifies all IDs in the Set Difference.
16 to 28 Bytes per ID in Set Difference.
O(d) communication, O(n+d) computation.
Implemented in KeyDiff Service
85
Conclusions: Got Diffs?
New randomized algorithm (difference digests) for
set difference or insertion/deletion coding
Could it be useful for your system? Need:
Large but roughly equal size sets
Small set differences (less than 10% of set size)
86
Comparison to Logs
IBF work with no prior context.
Logs work with prior context, BUT
Redundant information when syncing with multiple parties.
Logging must be built into system for each write.
Logging add overhead at runtime.
Logging requires non-volatile storage.
Often not present in network devices.
87
IBFs may out-perform logs when:
Synchronizing multiple parties
Synchronizations happen infrequently
88
The main point revised again
Whenever you have a set or list or
function or concurrent state machine or
whatever-will-be-next?, and space is an
issue, an approximate representation, like
a Bloom filter may be a useful alternative.
Just be sure to consider the effects of the
false positives!
89
Extension : Distance-Sensitive Bloom Filters
Instead of answering questions of the form
we would like to answer questions of the form
That is, is the query close to some element of the set,
under some metric and some notion of close.
Applications:
DNA matching
Virus/worm matching
Databases
Some initial results [KirschMitzenmacher].
. S y Is e
. S x y Is e ~
90
Variation: Simpler Hashing
[DillingerManolios],[KirschMitzenmacher]
Let h
1
and h
2
be hash functions.
For i = 0, 1, 2, , k 1 and some f, let
So 2 hash functions can mimic k hash functions.
Hash functions:
SDBMBUZ
Fast generation of very high quality pseudorandom numbers
Mersenne twister
1 2
( ) ( ) ( ) mod
i
g x h x ih x m = +