0% found this document useful (0 votes)
37 views90 pages

Bloom Filter Guo

This document discusses Bloom filters and several of its variants. It begins with an outline of topics to be covered, which include standard Bloom filters, compressed Bloom filters, counting Bloom filters, and other variants. The document then provides detailed explanations and examples of each variant. The main points are that Bloom filters can be used as space-efficient probabilistic data structures to test for set membership, and that variants like compressed and counting Bloom filters can enhance the original Bloom filter by allowing for compression, deletions, or other improvements.

Uploaded by

rolandman99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views90 pages

Bloom Filter Guo

This document discusses Bloom filters and several of its variants. It begins with an outline of topics to be covered, which include standard Bloom filters, compressed Bloom filters, counting Bloom filters, and other variants. The document then provides detailed explanations and examples of each variant. The main points are that Bloom filters can be used as space-efficient probabilistic data structures to test for set membership, and that variants like compressed and counting Bloom filters can enhance the original Bloom filter by allowing for compression, deletions, or other improvements.

Uploaded by

rolandman99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Page 1

Bloom Filters and its Variants


Deke Guo
[email protected]
National University of Defense Technology
2012.01.13
Outline
1. Standard Bloom Filters
2. Compressed Bloom Filters
3. Counting Bloom Filters
4. Representation of a set of (key, f(key))
5. Invertible Bloom Filters
Page 2
3
The main point
Whenever you have a set or list, and space is an
issue, a Bloom filter may be a useful alternative.
4
The Problem Solved by BF:
Approximate Set Membership
Given a set S = {x
1
,x
2
,,x
n
}, construct data structure
to answer queries of the form Is y in S?
Data structure should be:
Fast (Faster than searching through S).
Small (Smaller than explicit representation).
To obtain speed and size improvements, allow some
probability of error.
False positives: y e S but we report y e S
False negatives: y e S but we report y e S
Page 5
Data set B
0 1 0 1 0 0 0 0 0 0 0 0 0 1 0
a
A hash function family
A bit
vector
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1. Standard Bloom Filters
Set representation
Add a
Page 6
0 1 0 1 0 1 0 1 1 0 0 0 0 1 1
Data set B
a b c d
A hash function family
A bit
vector
1. Standard Bloom Filters
Set representation
7
Constant time (time to hash).
Small amount of space.
But with some probability of being wrong.
1. Standard Bloom Filters
Membership query
Page 8
0 1 0 1 0 1 0 1 1 0 0 0 0 1 1
Data set B
a b c d
x
Data set A
A hash function
family
A bit
vector
A false positive
query
9
False positive probability
Assumption: We have good hash functions, look random.
Given m bits for filter and n elements, choose number k of
hash functions to minimize false positives:
Let
Let
As k increases, more chances to find a 0, but more 1s in
the array.
Find optimal at k = (ln 2)m/n by calculus.
m kn kn
e m p
/
) / 1 1 ( ] empty is cell Pr[

~ = =
k m kn k
e p f ) 1 ( ) 1 ( ] pos false Pr[
/
~ = =
/
Pr[false pos] 0.61285
m n
f = =
10
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0 1 2 3 4 5 6 7 8 9 10
Hash functions
F
a
l
s
e

p
o
s
i
t
i
v
e

r
a
t
e
m/n = 8
Opt k = 8 ln 2 = 5.45...
11
Alternative Approach for Bloom Filters
Folklore Bloom filter construction.
Recall: Given a set S = {x
1
,x
2
,x
3
,x
n
} on a universe U, want
to answer membership queries.
Method: Find an n-cell perfect hash function for S.
Maps set of n elements to n cells in a 1-1 manner.
Then keep bit fingerprint of item in each cell.
Lookups have false positive < c.
Advantage: each bit/item reduces false positives by a factor
of 1/2, vs ln 2 for a standard Bloom filter.
Negatives:
Perfect hash functions non-trivial to find.
Cannot handle on-line insertions.
(
) / 1 ( log
2
c
12
Perfect Hashing Approach
Element 1 Element 2 Element 3 Element 4 Element 5
Fingerprint(4)Fingerprint(5)Fingerprint(2)Fingerprint(1)Fingerprint(3)
13
Classic Uses of BF: Spell-Checking
Once upon a time, memory was scarce...
/usr/dict/words -- about 210KB, 25K words
Use 25 KB Bloom filter
8 bits per word.
Optimal 5 hash functions.
Probability of false positive about 2%
False positive = accept a misspelled word
BFs still used to deal with list of words
Password security [Spafford 1992], [Manber & Wu, 94]
Keyword driven ads in web search engines, etc
14
Classic Uses of BF: Data Bases
Join: Combine two tables with a common domain
into a single table
Semi-join: A join in distributed DBs in which only
the joining attribute from one site is transmitted to
the other site and used for selection. The selected
records are sent back.
Bloom-join: A semi-join where we send only a BF
of the joining attribute.
15
Example
Empl Salary Addr City
John 60K New York
George 30K New York
Moe 25K Topeka
Alice 70K Chicago
Raul 30K Chicago
City Cost of living
New York 60K
Chicago 55K
Topeka 30K
Create a table of all employees that make < 40K and
live in city where COL > 50K.
Join: send (City, COL) for COL > 50. Semi-join:
send just (City).
Bloom-join: send a Bloom filter for all cities with
COL > 50
Empl Salary Addr City COL
16
A Modern Application: Distributed Web Caches
Web Cache 1
Web Cache 2
Web Cache 3
17
Web Caching
Summary Cache: [Fan, Cao, Almeida, & Broder]
If local caches know each others content...
try local cache before going out to Web
Sending/updating lists of URLs too expensive.
Solution: use Bloom filters.
False positives
Local requests go unfulfilled.
Small cost, big potential gain
18
2. Compressed Bloom Filters
Insight: Bloom filter is not just a data structure, it is also a
message.
If the Bloom filter is a message, worthwhile to compress it.
Compressing bit vectors is easy.
Arithmetic coding gets close to entropy.
Can Bloom filters be compressed?
19
Optimization, then Compression
Optimize to minimize false positive.
At k = m (ln 2) /n, p = 1/2.
Bloom filter looks like a random string.
Cant compress it.
m kn kn
e m p
/
) / 1 1 ( ] empty is cell Pr[

~ = =
k m kn k
e p f ) 1 ( ) 1 ( ] pos false Pr[
/
~ = =
n m k / ) 2 ln ( =
is optimal
20
Compressed Bloom Filters
Error optimized Bloom filter is full of 0s, 1s.
Compression would not help.
But this optimization for a fixed filter size m.
Instead optimize the false positives for a fixed
number of transmitted bits.
Filter size m can be larger, but mostly 0s
Larger, sparser Bloom filter can be compressed.
Useful if transmission cost is bottleneck.
Claim: transmission cost limiting factor.
Updates happen frequently.
Machine memory is cheap.
21
Benefits of Compressed Bloom Filters
Examples for bounded transmission size.
20-50% of false positive rate.
Array bits per elt. m/n 8 14 92
Trans. Bits per elt. z/n 8 7.923 7.923
Hash functions k 6 2 1
False positive rate f 0.0216 0.0177 0.0108
Array bits per elt. m/n 16 28 48
Trans. Bits per elt. z/n 16 15.846 15.829
Hash functions k 11 4 3
False positive rate f 4.59E-04 3.14E-04 2.22E-04
22
Benefits of Compressed Bloom Filters
Examples with fixed false probability rate.
5-15% compression for transmission size.
Array bits per elt. m/n 8 12.6 46
Trans. Bits per elt. z/n 8 7.582 6.891
Hash functions k 6 2 1
False positive rate f 0.0216 0.0216 0.0215
Array bits per elt. m/n 16 37.5 93
Trans. Bits per elt. z/n 16 14.666 13.815
Hash functions k 11 3 2
False positive rate f 4.59E-04 4.54E-04 4.53E-04
23
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0 1 2 3 4 5 6 7 8 9 10
Hash functions
F
a
l
s
e

p
o
s
i
t
i
v
e

r
a
t
e
Example
z/n = 8
Original
Compressed
24
Results
At k = m (ln 2) /n, false positives are maximized with a
compressed Bloom filter.
Best case without compression is worst case with compression;
compression always helps.
Side benefit: Use fewer hash functions with compression;
possible speedup.
25
3. Counting Bloom Filters and Deletions
Cache contents change
Items both inserted and deleted.
Insertions are easy add bits to BF
Can Bloom filters handle deletions?
Use Counting Bloom Filters to track insertions/deletions at
hosts; send Bloom filters.
26
Handling Deletions
Bloom filters can handle insertions, but not deletions.
If deleting x
i
means resetting 1s to 0s, then deleting x
i
will
delete x
j
.
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0
B
x
i
x
j
27
Counting Bloom Filters
Start with an m bit array, filled with 0s.
Hash each item x
j
in S k times. If H
i
(x
j
) = a, add 1 to B[a].
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
B
0 3 0 0 1 0 2 0 0 3 2 1 0 2 1 0
B
To delete x
j
decrement the corresponding counters.
0 2 0 0 0 0 2 0 0 3 2 1 0 1 1 0
B
Can obtain a corresponding Bloom filter by reducing to 0/1.
0 1 0 0 0 0 1 0 0 1 1 1 0 1 1 0
B
28
Counting Bloom Filters: Overflow
Must choose counters large enough to avoid overflow.
Poisson approximation suggests 4 bits/counter.
Average load using k = (ln 2)m/n counters is ln 2.
Probability a counter has load at least 16:
Failsafes possible.
17 E 78 . 6 ! 16 / ) 2 (ln
16 2 ln
~ ~

e
29
Bloom Filters: Other Applications?
P2P Keyword Search
P2P Collaboration
Resource Location
Loop detection
Scalable Multicast Forwarding
30
P2P Keyword Search
Efficient P2P keyword searching [Reynolds &
Vadhat, 2002].
Distributed inverted word index, on top of an overlay
network. Multi-word queries.
Peer A holds list of document IDs containing Word1, Peer
B holds list for Word2.
Need intersection, with low communication.
A sends B a Bloom filter of document list.
B returns possible intersections to A.
A checks and returns to user; no false positives in end
result.
Equivalent to Bloom-join
31
P2P Collaboration
Informed Content Delivery
Delivery of large, encoded content.
Redundant encoding.
Need a sufficiently large (but not all) number of distinct packets.
Peers A and B have lists of encoded packets.
Can B send A useful packets?
A sends B a Bloom filter; B checks what packets may be
useful.
False positives: not all useful packets sent
Method can be combined with
Min-wise sampling (determine a-priori which peers are
sufficiently different)
32
Resource Location
Queries sent to root. Each node keeps a
list of resources
reachable through it,
through children.
List = Bloom filter.
33
Resource Location: Examples
Secure Discovery Service
[Czerwinski, Zhao, Hodes, Joseph, Katz 99]
Tree of resources.
OceanStore distributed file storage
[Kubiatowicz & al., 2000], [Rhea & Kubiatowicz, 2002]
Attenuated BFs go d levels down in the tree
Geographical region summary service
[Hsiao 2001]
Divide square regions recursively into smaller sub
squares.
Keep and update Bloom filters for each level in
hierarchy.
34
Loop detection
Idea: Carry small BF in the packet header
Whenever passing a node, the node mask is OR-ed into the
BF
If BF does not change there might be a loop
35
Scalable Multicast Forwarding
Usual arrangement for multicast trees: for each
source address keep list of interfaces where the
packet should go
For many simultaneous multicasts, substantial storage
required
Alternative idea: trade computation for space:
For each interface keep BF of addresses
Packets checked against the BF. Check can be
parallelized
False positives lead to (few) spurious transmissions
4. Representation of a set of (key, f(key))
Hash-Based Approximate Counting
Multiset problem: (Key, frequency)
Space-code Bloom filters(INFOCOM 2004)
Spectral Bloom filters(SIGMOD 2003)
Bloomier Filter
(key, f(key))
Approximate Concurrent State Machines
(Key, state)
Beyond Bloom Filters: Approximate Concurrent State Machines
(SIGCOMM 2006)
Fast Statistical Spam Filter by Approximate Classifications
(Sigmetric 2006)
Page 36
37
Hash-Based Approximate Counting
Use min-counter associated with flow as
approximation.
Yields approximation for all flows simultaneously.
Gives lower bound, and good approx.
Can prove rigorous bounds on performance.
This hash-based approximate counting structure
has many uses.
Any place you want to keep approximate counts for a
data stream.
Databases, search engines, network flows, etc.
Example
0 3 4 1 8 1 1 0 3 2 5 4 2 0
y
Use Counting Bloom filter to track bytes per flow.
Potentially heavy flows are recorded
The flow associated with y can only have been
responsible for 3 packets
39
Example
0 3 4 1 8 1 1 0 3 2 5 4 2 0
y
Increment +2
The flow associated with y can only have been
responsible for 3 packets; counters should be
updated to 5.
0 3 4 1 8 1 1 0 5 2 5 5 2 0
40
Bloomier Filter
Bloom filters handle set membership.
Counters to handle multi-set/count tracking.
Bloomier filter:
Extend to handle approximate static functions.
Each element of set has associated function value.
Non-set elements should return null.
Want to always return correct function value for set
elements.
Approximate Concurrent State Machines
Motivation: Router State Problem
Suppose each flow has a state to be tracked.
Applications:
Intrusion detection
Quality of service
Distinguishing P2P traffic
Video congestion control
Potentially, lots of others!
Want to track state for each flow.
But compactly; routers have small space.
Flow IDs can be ~100 bits. Cant keep a big lookup
table for hundreds of thousands or millions of flows!
Approximate Concurrent State Machines
Model for ACSMs
We have underlying state machine, states 1X.
Lots of concurrent flows.
Want to track state per flow.
Dynamic: Need to insert new flows and delete terminating flows.
Can allow some errors.
Space, hardware-level simplicity are key.
ACSM Basics
Operations
Insert new flow, state
Modify flow state
Delete a flow
Lookup flow state
Errors
False positive: return state for non-extant flow
False negative: no state for an extant flow
False return: return wrong state for an extant flow
Dont know: return dont know
Dont know may be better than other types of errors for many
applications, e.g., slow path vs. fast path.
ACSM via Counting Bloom Filters
Dynamically track a set of current (FlowID,FlowState)
pairs using a CBF.
Consider first when system is well-behaved.
Insertion easy.
Lookups, deletions, modifications are easy when current state is
given.
If not, have to search over all possible states. Slow, and can
lead to dont knows for lookups, other errors for deletions.
Direct Bloom Filter (DBF) Example
0 0 1 0 2 3 0 0 2 1 0 1 1 2 0 0
(123456,3) (123456,5)
0 0 0 0 1 3 0 0 3 1 1 1 1 2 0 0
Stateful Bloom Filters
Each flow hashed to k cells, like a Bloom filter.
Each cell stores a state.
If two flows collide at a cell, cell takes on dont
know value.
On lookup, as long as one cell has a state value, and
there are not contradicting state values, return state.
Deletions handled by timing mechanism (or counters
in well-behaved systems).
Similar in spirit to [KM], Bloom filter summaries for
multiple choice hash tables.
47
1001010110 1
1100110000 4
011011101
0
2
0111010100 1
1110011101 3
1100000110 3
0000111101 3
...
Fingerprint State
Fingerprint-compressed Filter Approach
Store a fingerprint of flow + state in a d-left hashtable
...
x
...
1 2 d
1110001000 1
48
Fingerprint-compressed Filter
Approach
Insert - hash the element, and find the
corresponding bucket in each hash table, insert the
fingerprint + state in the bucket with least number
of elements
Lookup retrieve the state of the fingerprint
Delete remove the fingerprint
Update direct update or remove old + add new
Timing-based deletion can still be applied
Stateful Bloom Filter (SBF) Example
1 4 3 4 3 3 0 0 2 1 0 1 4 ? 0 2
(123456,3) (123456,5)
1 4 5 4 5 3 0 0 2 1 0 1 4 ? 0 2
5. Invertible Bloom Filters
Page 50
Whats the Difference?
Efficient Set Reconciliation without Prior
Context
Motivation
Distributed applications often need to compare remote state.
51
1
R
1
2
R
2
Must solve the Set-Difference Problem!
Partition Heals Partition Heals
What is the Set-Difference problem?
What objects are unique to host 1?
What objects are unique to host 2?
A
Host1 Host2
C A F E B D F
52
Example 1: Data Synchronization
Identify missing data blocks
Transfer blocks to synchronize sets
A
Host1 Host2
C A F E B D F
D C
B E
53
Example 2: Data De-duplication
Identify all unique blocks.
Replace duplicate data with pointers
A
Host1 Host2
C A F E B D F
54
Set-Difference Solutions
Trade a sorted list of objects.
O(n) communication, O(n log n) computation
Approximate Solutions:
Approximate Reconciliation Tree (Byers)
O(n) communication, O(n log n) computation
Polynomial Encodings (Minsky & Trachtenberg)
Let d be the size of the difference
O(d) communication, O(dn+d
3
) computation
Invertible Bloom Filter
O(d) communication, O(n+d) computation
55
Difference Digests
Efficiently solves the set-difference problem.
Consists of two data structures:
Invertible Bloom Filter (IBF)
Efficiently computes the set difference.
Needs the size of the difference
Strata Estimator
Approximates the size of the set difference.
Uses IBFs as a building block.
56
Invertible Bloom Filters (IBF)
Encode local object identifiers into an IBF.
A
Host1 Host2
C A F E B D F
IBF 2 IBF 1
57
IBF Data Structure
Array of IBF cells
For a set difference of size, d, require d cells
( > 1)
Each ID is assigned to many IBF cells
Each IBF cell contains:
58
idSum XOR of all IDs in the cell
hashSum XOR of hash(ID) for all IDs in the
cell
count Number of IDs assign to the cell
IBF Encode
A

count
idSum A
hashSum
H(A)
count++

count
idSum A
hashSum
H(A)
count++

count
idSum A
hashSum
H(A)
count++
Hash1 Hash2 Hash3
B C
Assign ID to
many cells
59
IBF:
d
All hosts use the same
hash functions
Invertible Bloom Filters (IBF)
Trade IBFs with remote host
A
Host1 Host2
C A F E B D F
IBF 2 IBF 1
60
Invertible Bloom Filters (IBF)
Subtract IBF structures
Produces a new IBF containing only unique objects
A
Host1 Host2
C A F E B D F
IBF 2
IBF 1
IBF (2 - 1)
61
IBF Subtract
62
Timeout for Intuition
After subtraction, all elements common to both sets
have disappeared. Why?
Any common element (e.g W) is assigned to same cells
on both hosts (assume same hash functions on both
sides)
On subtraction, W XOR W = 0. Thus, W vanishes.
While elements in set difference remain, they may
be randomly mixed need a decode procedure.
63
Invertible Bloom Filters (IBF)
Decode resulting IBF
Recover object identifiers from IBF structure.
A
Host1 Host2
C A F E B D F
IBF (2 - 1)
B E C D
Host1 Host2
64
IBF 2
IBF 1
IBF Decode
65
H(V Z)
H(Z)
H(V X Z)

H(V) H(X)
H(Z)
Test for Purity:
H( idSum )
H( idSum ) = hashSum
H(V) = H(V)
IBF Decode
66
IBF Decode
67
IBF Decode
68
69
SmallDiffs:
1.4x 2.3x
LargeDifferences:
1.25x 1.4x
How many IBF cells?
S
p
a
c
e

O
v
e
r
h
e
a
d
SetDifference
HashCnt3
HashCnt4
Overheadtodecodeat>99%
How many hash functions?
1 hash function produces many pure cells initially
but nothing to undo when an element is removed.
70
A B
C
How many hash functions?
1 hash function produces many pure cells initially
but nothing to undo when an element is removed.
Many (say 10) hash functions: too many collisions.
71
A A B
C B C
A A
B B
C C
How many hash functions?
1 hash function produces many pure cells initially
but nothing to undo when an element is removed.
Many (say 10) hash functions: too many collisions.
We find by experiment that 3 or 4 hash functions
works well. Is there some theoretical reason?
72
A A B
C C
A
B
B
C
Theory
Let d = difference size, k = # hash functions.
Theorem 1: With (k + 1) d cells, failure probability
falls exponentially.
For k = 3, implies a 4x tax on storage, a bit weak.
[Goodrich,Mitzenmacher]: Failure is equivalent to
finding a 2-core (loop) in a random hypergraph
Theorem 2: With c
k
d, cells, failure probability
falls exponentially
c
4
= 1.3x tax, agrees with experiments
73
74
LargeDifferences:
1.25x 1.4x
How many IBF cells?
S
p
a
c
e

O
v
e
r
h
e
a
d
SetDifference
HashCnt3
HashCnt4
Overheadtodecodeat>99%
Connection to Coding
Mystery: IBF decode similar to peeling procedure
used to decode Tornado codes. Why?
Explanation: Set Difference is equivalent to coding
with insert-delete channels
Intuition: Given a code for set A, send codewords
only to B. Think of Bs set as a corrupted form of
As.
Reduction: If code can correct D
insertions/deletions, then B can recover A and the set
difference.
75
Reed Solomon <---> Polynomial Methods
LDPC (Tornado) <---> Difference Digest
Reed Solomon <---> Polynomial Methods
LDPC (Tornado) <---> Difference Digest
Difference Digests
Consists of two data structures:
Invertible Bloom Filter (IBF)
Efficiently computes the set difference.
Needs the size of the difference
Strata Estimator
Approximates the size of the set difference.
Uses IBFs as a building block.
76
Strata Estimator
A
Consistent
Partitioning
B C
77
~1/2
~1/4
~1/8
1/16
IBF 1
IBF 4
IBF 3
IBF 2
Estimator
Divide keys into partitions of containing ~1/2
k
Encode each partition into an IBF of fixed size
log(n) IBFs of ~80 cells each
4x
Strata Estimator
78
IBF 1
IBF 4
IBF 3
IBF 2
Estimator1
Attempt to subtract & decode IBFs at each level.
If level k decodes, then return:
2
k
x (the number of IDs recovered)

IBF 1
IBF 4
IBF 3
IBF 2
Estimator2

Decode
Host1 Host2
4x
Strata Estimator
79
IBF 1
IBF 4
IBF 3
IBF 2
Estimator1
Attempt to subtract & decode IBFs at each level.
If level k decodes, then return:
2
k
x (the number of IDs recovered)

IBF 1
IBF 4
IBF 3
IBF 2
Estimator2

Decode
Host1 Host2
What about the
other strata?
2x
Strata Estimator
IBF 1
IBF 4
IBF 3
IBF 2
Estimator1

IBF 1
IBF 4
IBF 3
IBF 2
Estimator2

Decode
Decode
Host1 Host2
Host2 Host1
80
Observation: Extra partitions hold useful data
Sum elements from all decoded strata & return:
2
(k-1)
x (the number of IDs recovered)
Decode
Host1 Host2

Estimation Accuracy
81
AverageEstimationError(15.3KBytes)
SetDifference
R
e
l
a
t
i
v
e

E
r
r
o
r

i
n

E
s
t
i
m
a
t
i
o
n

(
%
)
Hybrid Estimator
82
IBF 1
IBF 4
IBF 3
IBF 2
Strata
Combine Strata and Min-Wise Estimators.
Use IBF Stratas for small differences.
Use Min-Wise for large differences.

IBF 1
Min-Wise
IBF 2
Hybrid
IBF 3
Hybrid Estimator Accuracy
83
Hybrid matches Strata
for small differences.
Converges with Min-wise
for large differences
SetDifference
AverageEstimationError(15.3KBytes)
R
e
l
a
t
i
v
e

E
r
r
o
r

i
n

E
s
t
i
m
a
t
i
o
n

(
%
)
Application: KeyDiff Service
Promising Applications:
File Synchronization
P2P file sharing
Failure Recovery
Key Service
Key Service
Key Service
Application
Application
Application
Add(key)
Remove(key)
Diff(host1,host2
)
84
Difference Digests Summary
Strata & Hybrid Estimators
Estimate the size of the Set Difference.
For 100K sets, 15KB estimator has <15% error
O(log n) communication, O(log n) computation.
Invertible Bloom Filter
Identifies all IDs in the Set Difference.
16 to 28 Bytes per ID in Set Difference.
O(d) communication, O(n+d) computation.
Implemented in KeyDiff Service
85
Conclusions: Got Diffs?
New randomized algorithm (difference digests) for
set difference or insertion/deletion coding
Could it be useful for your system? Need:
Large but roughly equal size sets
Small set differences (less than 10% of set size)
86
Comparison to Logs
IBF work with no prior context.
Logs work with prior context, BUT
Redundant information when syncing with multiple parties.
Logging must be built into system for each write.
Logging add overhead at runtime.
Logging requires non-volatile storage.
Often not present in network devices.
87
IBFs may out-perform logs when:
Synchronizing multiple parties
Synchronizations happen infrequently
88
The main point revised again
Whenever you have a set or list or
function or concurrent state machine or
whatever-will-be-next?, and space is an
issue, an approximate representation, like
a Bloom filter may be a useful alternative.
Just be sure to consider the effects of the
false positives!
89
Extension : Distance-Sensitive Bloom Filters
Instead of answering questions of the form
we would like to answer questions of the form
That is, is the query close to some element of the set,
under some metric and some notion of close.
Applications:
DNA matching
Virus/worm matching
Databases
Some initial results [KirschMitzenmacher].
. S y Is e
. S x y Is e ~
90
Variation: Simpler Hashing
[DillingerManolios],[KirschMitzenmacher]
Let h
1
and h
2
be hash functions.
For i = 0, 1, 2, , k 1 and some f, let
So 2 hash functions can mimic k hash functions.
Hash functions:
SDBMBUZ
Fast generation of very high quality pseudorandom numbers
Mersenne twister
1 2
( ) ( ) ( ) mod
i
g x h x ih x m = +

You might also like