0% found this document useful (0 votes)

31 views32 pages

Bloom Filters A Tutorial Analysis and Survey

Uploaded by

vothaianh18081997

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views32 pages

Bloom Filters A Tutorial Analysis and Survey

Uploaded by

vothaianh18081997

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/238681522

Bloom Filters | A Tutorial, Analysis, and Survey

Article · January 2002

CITATIONS READS
19 1,226

2 authors:

James Blustein Amal El-Maazawi

Dalhousie University The American University in Cairo
85 PUBLICATIONS 482 CITATIONS 1 PUBLICATION 19 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by James Blustein on 13 December 2014.

The user has requested enhancement of the downloaded file.

Bloom Filters { A Tutorial, Analysis, and Survey

James Blustein
Amal El-Maazawi

Technical Report CS-2002-10

Dec 10, 2002

Faculty of Computer Science

6050 University Ave., Halifax, Nova Scotia, B3H 1W5, Canada
Bloom Filters — A Tutorial, Analysis, and
Survey

Authors: James Blustein∗ and Amal El-Maazawi

Faculty of Computer Science
Dalhousie University
6050 University Avenue
Halifax, NS
B3H 1W5
Canada
Contact: Telephone: +1(902)494-6104
Facsimile: +1(902)492-1517
E-mail: [email protected]

Abstract
Bloom filters use superimposed hash transforms to provide a prob-
abilistic membership test. The only types of errors are false positives
(non-members being reported as members). Non-members are typically
detected quickly (requiring only two probes in the optimal case).
This article surveys modern applications of this technique (e.g., in
spell checking and Web caching software) and provides a detailed analysis
of their performance, in theory and practice. The article concludes with
practical advice about implementing this useful and intriguing technique.

Keywords: Bloom filter, hashing, performance analysis, network cache

∗ Corresponding author
CONTENTS CONTENTS

Contents
1 Introduction 1
1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 What Is An Error . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Novel Uses 3
2.1 Rule-based systems . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Spell Checkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Estimating Join Sizes . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Differential Files . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.5 Network Applications . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 Attenuated Bloom filters . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Text Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Implementation 11
3.1 Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Basic Implementation . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Operations on Cells . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 Operations on Bloom filters . . . . . . . . . . . . . . . . . 13
3.3 Compressed Bloom filters . . . . . . . . . . . . . . . . . . . . . . 13

4 Analysis 14
4.1 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Relationship Between Parameters . . . . . . . . . . . . . . . . . . 14
4.2.1 The General Case . . . . . . . . . . . . . . . . . . . . . . 15
The Governing Equation . . . . . . . . . . . . . . . . . . . 15
Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Rejection Time . . . . . . . . . . . . . . . . . . . . . . . . 16
Growing Sets . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.2 The Optimal Case . . . . . . . . . . . . . . . . . . . . . . 17
Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Rejection Time . . . . . . . . . . . . . . . . . . . . . . . . 18
Governing Equation . . . . . . . . . . . . . . . . . . . . . 19
4.3 Performance issues . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4 Variation on Standard Bloom Filters . . . . . . . . . . . . . . . . 20

5 Summary 21
5.1 Optimal Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Trade-offs in Filter Performance . . . . . . . . . . . . . . . . . . . 21

References 23

A Miscellaneous Methods 27

B Derivation of Rejection Time Inequality 28

ii
1 INTRODUCTION

1 Introduction

The Bloom filter a way of using hash transforms to determine set member-
ship [1]. Bloom filters find application wherever fast set membership tests on
large data sets are required. Such applications include spell checking, differ-

ential file updating, distributed network caches, and textual analysis. It is a

probabilistic method with a set error rate. Errors can only occur on the side
of inclusion — a true member will never be reported as not belonging to a set,
but some non-members may be reported as members.
We describe Bloom filters, elucidate some of their properties, and present a
survey of their uses. The survey in this paper concentrated on research findings
published between 1996 and 2002 and abstracted or cited in the following pe-
riodical indexes: 1) INSPEC1 , 2) NEC Research Institute ResearchIndex2 , and
3) ACM Digital Library3 .
The rest of the article is organized as follows: Section 2 outlines novel uses of
Bloom filters. Section 3 describes the implementation of Bloom filters and the
operations conducted. Section 4 is the analysis of Bloom filters and we present

a summary in Section 5.

1.1 Definition

Bloom filters use hash transforms to compute a vector (the filter) that is rep-
resentative of the data set. Membership is tested by comparing the results of
hashing on the potential members to the vector. In its simplest form the vector
is composed of N elements, each a bit. An element is set if and only if some
hash transform hashes to that location for some key. Figure 2 shows such a

filter with m = 4 hash transforms and N = 8 bits.

1 hURL:http://www.iee.org/Publish/INSPEC/i
2 hURL:http://citeseer.nj.nec.com/i
3 hURL:http://portal.acm.org/i

1
1 INTRODUCTION 1.1 Definition

Table

K h h(K)

Table[K ]

Figure 1: A typical hash transform in action


 h (K) = 2
 1
m hash h2 (K) = 5
transforms  h3 (K) = 7

h4 (K) = 4

bit #
0 1 2 3 4 5 6 7
0 0 1 0 1 1 0 1
| {z }
The filter of N bits

Figure 2: A simple Bloom filter

Bloom filters can be combined with other methods, such as signatures [1, 2].
Figure 3 depicts a case in which the filter contains references to information
related to records rather than only the records. In that case the hash transforms
will hash to N/(b + 1) cells, where b is the size of the signature and the extra
bit is used to flag cells containing signatures [1]. The analysis of such a filter is
equivalent to that for the simple case, which this paper discusses.

2
2 NOVEL USES 1.2 What Is An Error

Legend

Signature
Cell in table
or Other Data

Nil Pointer

Non-nil pointer

Figure 3: A Bloom filter with signature information

1.2 What Is An Error

Errors can occur when two or more transforms map to the same element. The
membership test for a key K works by checking the elements that would have
been updated if the key had been inserted into the vector. If all the appropriate
flag bits have been set by hashes then k will be reported as a member of the
set. If the elements have been updated by hashes on other keys — and not
K — then the membership test will incorrectly report K as a member. For
example, if the set Vegetables contains potato and cabbage but not tomato, then
the Bloom filter illustrated in Figure 4 would incorrectly identify tomato as a
Vegetable. Such an error could occur because all the bits that would be set if
tomato were hashed on would already be set in the filter.

2 Novel Uses

This section reviews some of the most interesting applications of Bloom filters.
It is perhaps surprising that what is essentially a set-membership test is of use
in so many important applications.

3
2 NOVEL USES 2.1 Rule-based systems

A portion of the filter

··· P C P P C — C ···
··· — T — T T — — ···
Results of hashing tomato

Key Meaning
P updated by potato
C updated by cabbage
T would be set by tomato
— unset

Figure 4: tomato is erroneously identified as a member

2.1 Rule-based systems

Burton H. Bloom originally proposed filter hashing as part of a program to

automatically hyphenate words. He wanted to separate words that could be

hyphenated by the application of simple rules from the minority that required
extensive analysis. He proposed using his filter method to separate the 10% of
difficult words from the rest [1].

2.2 Spell Checkers

Bloom filters have been successfully applied in spell checking programs such as
cspell [3–5]. They are employed to determine if candidate words are members
of the set of words in a dictionary. In the case of cspell, suggested corrections

are generated by making all single substitutions in rejected words and then
checking if the results are members of the set [3]. Bloom filters perform very
well in such cases [3]. The filter size was chosen to be large enough to allow the
inclusion of additional words added by the user.

4
2 NOVEL USES 2.3 Estimating Join Sizes

2.3 Estimating Join Sizes

Mullin [6, 7] suggested using Bloom filters to estimate the size of joins in
databases. This is of particular advantage in the case of distributed databases
where communications costs are to be kept to a minimum. He presented a
method by which filters that are too large to fit in memory can be used [7]. The
method is essentially to use a representative sample of a filter for testing and
ignore all hashes outside the range of the sample. Since hash transforms are
pseudo-random, any significantly large portion of a filter can act as a sample.

2.4 Differential Files

A major area of interest in the application of Bloom filters has been their use
in differential file access [4, 5, 8]. A differential file is essentially a separate file
which contains records that are modified in the main file [4]. Differential files
are used as caches in large databases: when a change is made to a record in the
main database the differential file is updated; when all the changes have been
made to the database then the differential file is used to update the database.
When the differential file is much smaller than the database, changes to it can
be made without the overhead needed to search the main file. Of course, it
would be best to keep the entire set of records in memory at once, but this

is not feasible for large data sets and so the probabilistic approach offered by
Bloom filters is used. Bloom filters in core memory are used to predict if a
record will be found in the differential file.

Benefits and Drawbacks

A common assumption in such analyses is that if a record (e.g., a store’s credit

card account) has been updated recently then it is likely to be updated again
soon [5]. If this assumption does not hold, then Bloom filters may not be a

5
2 NOVEL USES 2.5 Network Applications

suitable for this application. Another important consideration is how often the
differential file should reset. The differential file must not be allowed to grow
too large to fit in memory or much of the advantage is lost [5].
The benefits of using this technique includes improvements in performance,
greater database reliability and reduced backup and recovery costs. However the
technique is effective only within certain parameters. Implementors must decide

on trade-offs between on the one hand the size of the filter and the number of
hash transforms to use, and on the other hand the filter error rate. Those values
will depend in turn on the given transaction volume, the number of keys to be
accessed and updated, and the characteristics of the key set. We provide an
analysis of those trade-offs in Section 4.

2.5 Network Applications

Bloom filters have recently found many applications in networks [9]. Networks

rely on some form of routing to transfer messages between hosts. Routers are
special-purpose network devices that must operate with high efficiency in real-
time. Data can be delayed or lost when routers are overwhelmed with traffic.
An interesting solution to the problem of enforcing fairness in routing is by Feng
et al.’s Stochastic Fair Blue (SFB) algorithm [10]. When routers implementing
SFB get near their capacity they begin to drop packets from the various hosts
that are connecting to them. The routers use Bloom filters and labels to prob-
abilistically determine which hosts continue to send more than their share of
traffic even when some of their data are dropped by the router. Hosts which
continue to operate in this non-cooperative fashion have more of their packets
dropped. But traffic from hosts which reduce their demands on the router are
not dropped. Bloom filters are used as a space- and time-efficient method to
keep track of which hosts are sending too much traffic.

6
2 NOVEL USES 2.5 Network Applications

An important aspect of network applications is that they are not indepen-

dent. The cost of transactions that rely on network traffic is generally higher
than computations made in a single host. To minimize the time needed to com-
pute and deliver correct results, distributed applications are designed to avoid
using the network as much as possible and to send messages to the right host
when they must rely on the network. This is a similar concept to the differential

file paradigm we discussed in Section 2.4.

We give an example that is characteristic of the class. Bloom filters are used
in caching proxy servers on the World Wide Web (WWW). The WWW can
be viewed as a distributed system for delivering documents which are sent by
servers to clients only upon request. Caching improves performance when clients
obtain copies of files from neighbouring servers instead of from the originating
server (which may be several slow network links away). Proxy servers intercept
requests from clients and either fulfill the requests themselves or re-issue them
to servers [11].
If the proxy can obtain a copy of the document from a cache (either its own
or that of a nearby co-operating proxy) then the document is retrieved from the
cache and a cache hit is registered. The hit rate of a cache the a measure of a
cache’s effectiveness. Proxies are typically deployed as hierarchies (which mimic

logical network architectures) or as a series of co-operating proxies (without

regard to network architecture) [11]. The performance of a Web cache scheme
depends on the size of its client community; the bigger the user community, the
higher the probability that a cached document will soon be requested again.
Bloom filters are used in Web caches to efficiently determine the existence
of an object in a cache [10] and they can be used for Web cache sharing too.
Web caches are shared to reduce message traffic. Caching proxies may be im-
plemented so as not to transfer the exact content of their caches (i.e., lists of

7
2 NOVEL USES 2.6 Attenuated Bloom filters

URLs) but instead to broadcast much smaller Bloom filters that represent the
contents of the cache. If a proxy wants to determine if another proxy has a page
in its cache, it checks the appropriate Bloom filter.
Bloom filters are also used in cache digests. A cache digest is essentially a
lossy compression of all cache keys with a lookup capability. Digests are made
available via HTTP (the main network protocol of the WWW), and a cache

downloads its neighbors digest at startup. By checking a neighbor’s digest, a

cache can determine with certainty if a neighbouring cache does not hold a
given object. Their use in cache digest allows caches to efficiently inform each
other about their contents without any per-request delays. The main goal is to
reduce ‘cache directory’ size while keeping the number of collisions low. Bloom
filters are an efficient way of serving those purposes. The small chance of a false
positive is greatly outweighed by the significant reduction in network traffic
achieved by using the succinct Bloom filter instead of sending the full list of
cache contents [12].

2.6 Attenuated Bloom filters

The caches we have seen so far do not replicate data. Where cached data
might be replicated, attenuated Bloom filters (ABFs) may be useful. Rhea and
Kubiatowicz developed such ‘a lossy distributed index’ technique using Bloom
filters for networks with nodes that communicate network topology with each
other [13, p. 1248].
ABFs are composed of arrays of Bloom filters. Each node for the network
stores an ABF for each outgoing link. An ABF is an array of n Bloom filters
which together represent the contents of the data cached at neighbouring nodes
that can be reached within n network links.
An example will make the application clearer. Consider the outgoing link

8
2 NOVEL USES 2.7 Text Analysis

from node A to node B in the network depicted in Figure 5. The nth Bloom
filter in the ABF from A to B is the union of all the Bloom filters in all nodes
on a path of length n beginning with B.
A basic Bloom filter could represent the probability that a specific datum
is available from a node on a path beginning with B. Attenuated Bloom filters
(ABFs) provide that information and also information about how many links
away that datum is presumed to be. ABFs can be used to speed up searches in
peer-to-peer networks since the searches resemble depth first traversals of the
network as guided by the probabilistic information in the ABFs. Such searches

are biased in favour of nodes that are most likely to contain the data and are
closest to the current root node. The search algorithms apply penalties to Bloom
filters that are along longer paths (because there is a greater cost associated with
traversing more network links, and the longer the path the greater the possibility
of searching many nodes). The exact penalties applied and details of the search
algorithms are experimental. The methods are being used in conjunction with
OceanStore [14], an extremely large global persistent data store.

2.7 Text Analysis

Bernstein [15, 16] produced an interactive program that uses Bloom filters to
find related passages in a monograph. It works by constructing a Bloom filter of
all the words in each passage of a monograph and then computing the normalized
dot product of all pairs of them. The result of every dot product is a similarity
measure — the higher the value the more likely the passages are to have related
content. This can be a valuable tool for scholars, if as Bernstein claims it
often finds connections that would otherwise go unnoticed [15]. Mylonas and
Bernstein [17] have adapted it to work with Latin as well as English texts. They
claim that this tool

9
2 NOVEL USES 2.7 Text Analysis

h0; 0; 0; 0; 1; 0; 1; 1i

A
h0; 1; 0; 1; 0; 1; 0; 0i

B - D K
6
h1; 0; 0; 0; 0; 1; 0; 1i
?
C I
h1; 0; 1; 0; 1; 0; 0; 0i
R
E
h1; 0; 1; 0; 1; 0; 0; 0i

(a) A network with Bloom fil-

ters in each node

bit # nodes
0 1 2 3 4 5 6 7
0 1 0 1 0 1 0 0 B
1 0 1 0 1 0 0 1 C, D
1 0 1 0 1 0 0 0 E

(b) The attenuated Bloom filter for A → B

for the network in Figure 5(a)

Figure 5: Attenuated Bloom Filters

. . . provide[s] impressionistic information that can open the text

in new and valuable directions under the reader’s guidance.. . . [It is]
inexact and error-prone, seeking to provide intriguing suggestions for
the scholar’s consideration rather than objective data. [17, p. 182]

Bloom filters are used in many areas including updating databases, estimat-
ing the size of database joins, aiding scholarly research, and in spell checking
programs.

10
3 IMPLEMENTATION

3 Implementation

Having now shown some of what Bloom filters can do and where they are used
we will examine how they operate.

3.1 Hashing

Hashing transforms are typically pseudo-random mathematical transforms used

to compute addresses for lookup [18, 19]. Figure 1 shows the use of the hash
transform h, to find an item with a key K, stored at address h(K). The time
complexity of searches by hashing can be as low as O(1) or as high as O(N ),
for a hash table with N elements. The worst-case behaviour occurs when two
or more distinct keys Ki 6= Kj collide, i.e., h(Ki ) = h(Kj ), and the entire table

must be searched to find the correct entry [18, pp. 507 – 508]. Bloom filters are a
fast method in which the hash transforms always have constant time complexity
— there is no attempt at collision resolution. Knuth [18] described Bloom filters
as a type of superimposed coding because all of the hash transforms map to the
same table.

3.2 Basic Implementation

Bloom filters have three operations: A membership test (Procedure 1), Initial-

ization (Procedure 5), and Update (Procedure 6). Procedures 2 – 6 are listed in
Appendix A. Initialize clears all the elements in the vector. Insert computes
the values of the m hash transforms for a key and updates the appropriate el-
ements. In the simplest case the update sets the element’s flag bit. It requires
time proportional to the number of hash transforms. In more complicated cases,
additional information would also be placed in the element. The time complex-
ities are summarized in Table 1.
In the example shown in Figure 2, hash transform h1 updates the value of

11
3 IMPLEMENTATION 3.2 Basic Implementation

Procedure Parameters Time complexity

Initialize Table of N cells O(N )
Set Cell in Table O(1)
Clear Cell in Table O(1)
IsSet Cell in Table O(1)
Insert Table, Key, and m hash transforms O(m)
IsMember Table, Key, and m hash transforms O(m)

Table 1: Time Complexities of Filter Steps

element 2 for key K and transform h2 updates the value of element 5 using the
same key. IsMember computes the same hash values as Insert but instead
of updating the elements it checks if they have been set. By definition, only
members have their keys inserted into the vector. If any of the hash transforms,
hi (K), compute a vector element that has not been set, then the key K could
not have been inserted into the vector and therefore cannot be a member of the
set. Note that the worst time complexity for IsMember occurs for members
(and non-members that are erroneously reported as members). As we show in

the sections named ‘Rejection Time’ below, the complexity can be considerably
less for non-members.

3.2.1 Operations on Cells

For the purposes of the analysis, we are presenting only the essentials of Bloom
filters — the algorithms are for single bit elements. The analysis of filters with
more complicated cells is essentially the same as for the simple case [1].
Blustein has shown how to efficiently implement these operations using
C [20]. IsMember is presented immediately below. The other algorithms are
presented in Appendix A. Their essential characteristics are presented in Ta-
ble 1.

12
3 IMPLEMENTATION 3.3 Compressed Bloom filters

3.2.2 Operations on Bloom filters

Procedure 1 (IsMember)
IsMember(T able,Key)→ Boolean
1. i ← 0
2. repeat
3. i←i+1
. hi is the ith hash transform, where 1 < i ≤ m
4. until ((i = m) ∨ ¬(IsSet(T able[hi(Key)])))
5. if i = m then
6. return(IsSet(T able[hi(Key)]))
7. else

8. return(False)
end.

3.3 Compressed Bloom filters

Space efficiency is particularly important for applications, such as distributed

caches, that send Bloom filters as messages over networks. Mitzenmacher [21]
proposed a method based on information entropy measures for compressing
such filters to improve transmission rates at the cost of more computing time.

Interestingly he found that ‘the number of hash functions that minimizes the
false positive [error] rate without compression in fact maximizes the false pos-
itive [error] rate with compression.’ [21, p. 146] The method has not yet been
implemented or tested.

13
4 ANALYSIS

4 Analysis

We now analyse the performance of Bloom filters. First we show the worst-
case times for the algorithms, then we determine the various trade-offs that are
necessary in any practical implementation.

4.1 Time Complexity

Initialize The naı̈ve implementation requires O(N ) time, however if N is the

size of a native data type then it can be done in constant time.

Insert Insertion requires the computation of m hash transforms, each of which

requires O(1) time. (Since collision detection is not necessary all the hash
transforms have to do is compute values.) Insert therefore takes O(m)
time per key or O(mk) for all k keys.

Note that the filter, or substantial portions of it, can be computed in

advance. In the case of a spell checking program for instance, a filter of
the standard dictionary can be built prior to running the program. If the
program allows a user to add words to the dictionary on-the-fly, then keys

based on those words would need to be inserted at run-time.

IsMember The loop in Procedure 1 may require the computation of as many

as m transforms. Below we prove that, in the optimal case, on average
only two transforms will be required to reject any non-member. In the
worst case, when the key is a member of the set, the time complexity is
O(m).

4.2 Relationship Between Parameters

The behaviour of Bloom filters is determined by four parameters:

14
4 ANALYSIS 4.2 Relationship Between Parameters

N The number of elements (or cells) in the filter

m The number of hash transforms to be used
k The number of set members

f The fraction of elements (or cells) that are set in the filter
Here we derive equations that describe the relationship between these factors
for the general and optimal cases. The general case is applicable to growing and
static sets but the optimal case occurs when the error rate is minimal. As we

show in Section 4.2.2 optimal performance is predicted for only static sets in
which half of the elements have been updated.
The governing equation provides a way to predict the amount of space a
particular filter will require. The expected fraction of false positive results
given the parameter values is the error rate. The rejection time is the expected
number of hashes that will be required to determine that a key is not a member
of the set. These analytic results are summarized in Table 2 (on page 19).

4.2.1 The General Case

Since hash transforms are pseudo-random, the probability of a particular filter

element being addressed by a hash transform is 1/N . Therefore the probability
of a particular element not being updated is 1 − 1/N . If we assume that the
keys are randomly distributed then the probability of a particular element not
being updated after after all k keys have been hashed is (1 − 1/N )k .

The Governing Equation The probability that a particular element not

being updated by any of the m transforms, after all k keys have been entered
is Punset .
mk
Punset = (1 − 1/N ) (1)

15
4 ANALYSIS 4.2 Relationship Between Parameters

Equation 1 is based on the assumption that the hashes are equally likely to set
any bits.
The probability that an element is set is Pset .

Pset = 1 − Punset

Error Rate The analytic probability that an element is hashed to by all m

hash transforms is Pallset .

Pallset = (Pset )m
m
mk
= 1 − (1 − 1/N ) (2)

Both of these computations are based on the standard assumption that the
hash transforms are independent.

Rejection Time If f is the fraction of the bits that are set in a Bloom fil-

ter then a single hash has a probability 1 − f of not rejecting a non-member.

Assuming that the hash transforms are independent, the hth hash also has a
probability 1 − f of not rejecting the non-member. In general then the probabil-
Pm
ity of h hashes being required to reject a non-member is h=1 h×f h−1 ×(1−f ).
Pm
We can simplify the sum by recognizing that h=1 h × f h−1 as the integral of
the sum of the (finite) geometric series with a0 = 1 [22]. A detailed derivation
is in Appendix B. If m is infinite then the sum converges when |f | < 1. Clearly
0 ≤ f < 1, since a Bloom filter with all of its bits set cannot be used to de-
tect non-members and f will be zero only if no keys have been hashed. Thus
Equation 3 represents the relationship between the predicted number of hashes
needed to reject a non-member and the number of elements set in the filter.

16
4 ANALYSIS 4.2 Relationship Between Parameters

m
X 1
h × f h−1 × (1 − f ) ≤ (3)
1−f
h=1

Note that although non-members can be rejected with fewer than m hashes,
member keys will require all of the hashes to verify their status.

Growing Sets In applications where the membership set is allowed to grow,

e.g., in spell checkers with user dictionaries, the number of keys should be the
total number of keys expected. For example, if a spell checker is constructed

with an initial dictionary of 30 000 words and it is predicted that another 5000
will be added as the applications runs then the value of k should be 35 000.

4.2.2 The Optimal Case

Analysis of the optimal case is based on the standard assumption of parallel

hash functions each of which covers half of the N elements in the table. The
optimal case is the one where the error rate is minimized for a given filter size
dPallset
N , i.e. = 0.
dm
It follows immediately that [23]:

d Pallset d m ln(1−(1−1/N )mk

= e )
dm dm
m
= ((ln(1 − (1 − 1/N )mk ) +
1 − (1 − 1/N )mk
d mk
×(− (1 − 1/N )mk ))) × em ln(1−(1−1/N ) )
dm

where

d d mk ln(1−1/N )
(1 − 1/N )mk = e
dm dm
= k ln(1 − 1/N ) × emk ln(1−1/N )

17
4 ANALYSIS 4.2 Relationship Between Parameters

so that
 
mk
d Pallset  ln(1 − (1 − 1/N ) ) − 
=  mk
 × (1 − (1 − 1/N )mk )m
dm mk ln(1−1/N )×(1−1/N )
1−(1−1/N )mk
= 0

By dividing by Equation 2, which cannot be zero (unless no keys have been

hashed), and moving the coefficient we obtain

ln(1 − (1 − 1/N )mk )

(ln(1−1/N )mk )×((1−1/N )mk )
= 1−(1−1/N )mk

By substituting x for (1 − 1/N )mk and multiplying both sides by (1 − x)

we obtain (1 − x) ln(1 − x) = xlnx, which is the same as (1 − x)(1−x) = xx .

Therefore x = 1 − x, so the error rate will be minimal when (1 − 1/N )mk = 1/2.
Therefore optimal filters have half their flag bits set; i.e., the set is composed
of N/2 elements. In an optimal filter Pset = Punset = 1/2.

Error Rate Combining this with Equation 2 we find that the error rate, for
optimal Bloom filters, is Popt .

m
1
Popt = (4)
2

Rejection Time In the optimal case a good hash transform will exclude half
the elements of the vector for any key. One hash will eliminate half the elements;
all subsequent hashes will eliminate half the remaining elements. If one hash
1 1
eliminates half of the elements then two hashes will eliminate 2 + 4 elements,
etc. Since the pattern is a geometric series which converges to 2 as the number
of hashes grows, on average only 2 hashes are needed to reject candidates [3].
This prediction is consistent with the general case shown in Equation 3.

18
4 ANALYSIS 4.3 Performance issues

Governing Equation The general governing equation (Equation 1, above)

is Punset = (1 − 1/N )mk . In the optimal case this becomes 1/2 = (1 − 1/N )mk .
From this is trivial to derive Equation 5.

− ln 2 = mk ln (1 − 1/N ) (5)

The Taylor expansion of ln(1 + x) is

x2 x3 x4
ln(1 + x) = x − + − ··· (6)
2 3 4

In the case of Equation 5, x = −1/N . For N 1, x2 ≈ 0 and ln(1 −

1/N ) ≈ −1/N . It follows that, in the optimal case, described by Equation 7,

the relationship holds.

N ≈ mk/ ln 2 (7)

≈ mk/0.69

The analytic results are summarized in Table 2.

General Case Optimal Case∗

Governing equation Punset = (1 − 1/N )mk Punset = 1/2
m N ≈ mk/ ln 2
mk
False positive rate 1 − (1 − 1/N ) 2−m
Rejection time ≤ 1/(1 − f ) 2

Table 2: Summary of Analytic Results

∗ The optimal case holds for static sets with f = 1/2.

4.3 Performance issues

Of course, the choice of hash transforms will have a major impact on the perfor-
mance of the Bloom filter. To be useful Bloom filters require hashing transforms

19
4 ANALYSIS 4.4 Variation on Standard Bloom Filters

that will not hash to the same set of addresses. Gremillion [8] and Mullin [5]
found that, when applied to differential files, the error rate was much higher
than the analysis predicted.
To remedy that deficit Ramakrishna [4, 24] developed so-called perfect hash
transforms. Perfect hash transforms are a class of hash function that completely
avoid collisions for the specific key set for which they were generated.

His tests were simulations of Bloom filters on differential files of user IDs, the
TM
Unix word list (/usr/dict/words), and library call numbers. Simulations
studies of Bloom filters are accepted in the literature as an accurate method
for determining test performance [5]. Ramakrishna reported that all the results
were similar and gave details for the file of user IDs. The results were all within
a standard deviation of what the governing equation predicts. Czech et al. [25]
have since devised a fast algorithm generate minimal perfect hash transforms.
Perfect hash transforms can be used only when the entire membership set
is known a priori. They are therefore not suitable for applications that use
growing sets. For instance, perfect hash transforms are suitable for use with
differential files of bank accounts because all of the possible account numbers
are known in advance. However a spell checker that allowed users to include
arbitrary words could not expect optimal results from a Bloom filter built with

perfect hash transforms for the words in its list of correct spellings.

4.4 Variation on Standard Bloom Filters

If the membership set is known in advance then better performance than with
a standard Bloom filter can be achieved using related techniques. For instance
we can establish an arbitrarily low false positive rate by using perfect hash
functions and limiting the size of the shared hash table. The important step is
to order the elements by their discriminating power and eliminate those with

20
5 SUMMARY

the lowest power until the desired tradeoff between the size of the table and the
predicted false positive rate is reached or exceeded.

5 Summary

This paper has described Bloom filters, some of their applications, and provided
an analysis of the general and optimal performance cases. Bloom filters used
purely for probabilistic membership tests accurately identify non-members.

5.1 Optimal Filters

In the optimal case, non-members are detected within two hashes. The optimal
case occurs only to sets in which all of the possible keys are known in advance

and in which half of the elements are set. In practice the optimal case requires
the use of perfect hash transforms (as described in Section 4.3).

5.2 Trade-offs in Filter Performance

The error rate can be decreased by increasing the number of hash transforms
and the space allocated to store the table. The analytic performance of Bloom
filters for growing and static sets is given in Equation 3. Formulae that can be

used to tune filters with respect to error rate, filter size, number of keys, and
hash function are summarized in Table 2.
Bloom filters should be considered for programs where an imperfect set mem-
bership test could be helpfully applied to a large data set of unknown compo-
sition. Such programs include spell checkers and those that use data stored in
differential files. The great advantage of Bloom filters over the use of single
hash transforms is their speed and set error rate. Although the method can be
applied to sets of any size, small sets are better dealt with by trees and heaps

21
5 SUMMARY 5.2 Trade-offs in Filter Performance

that can determine for certain if a key belongs to a set. Other methods are
generally more accurate for sets whose composition is known in advance but
they require more space than Bloom Filters.

Acknowledgments This work was improved by Jim Mullin’s comments on

an earlier draft and Ray Spiteri’s help with the derivation of Equation 3. Con-
versations with Jason Rouse were most helpful.

22
REFERENCES REFERENCES

References

[1] Burton H. Bloom. Space/time trade-offs in hashing coding with allowable

errors. Communications of the ACM, 13(7):422 – 426, July 1970. URL
hhttp://doi.acm.org/10.1145/362686.362692i.

[2] Christos Faloutsos. Access methods for text. ACM Computing Surveys,

17(1):49 – 74, March 1985. URL hhttp://doi.acm.org/10.1145/4078.

4080i.

[3] James K. Mullin and Daniel J. Margoliash. A tale of three spelling checkers.
Software — Practice and Experience, 20(6):625 – 630, June 1990.

[4] M. V. Ramakrishna. Practical performance of Bloom filters and paral-

lel free-text searching. Communications of the ACM, 32(10):1237 – 1239,
October 1989. URL hhttp://doi.acm.org/10.1145/67933.67941i.

[5] James K. Mullin. A second look at Bloom filters. Communications of

the ACM, 26(8):570 – 571, August 1983. URL hhttp://doi.acm.org/10.
1145/358161.358167i.

[6] James K. Mullin. Optimal semijoins for distributed database systems.

IEEE Transactions on Software Engineering, 16(5):558 – 560, May 1990.
URL hhttp://ieeexplore.ieee.org/iel1/32/1900/00052778.pdfi.

[7] James K. Mullin. Estimating the size of a relational join. Information

Systems, 18(3):189 – 196, 1993. ISSN 0306-4379.

[8] L. L. Gremilion. Designing a Bloom filter for differential file access.

Communications of the ACM, 25(9):600 – 604, September 1982. URL
hhttp://doi.acm.org/10.1145/358628.358632i.

[9] Andrei Broder and Michael Mitzenmacher. Network applications of bloom

filters: A survey. URL hhttp://www.eecs.harvard.edu/~michaelm/

23
REFERENCES REFERENCES

NEWWORK/postscripts/BloomFilterSurvey.pdfi. Submitted to Annual

Allerton Conference on Communication, Control, and Computing, 2002.

[10] Wu-Chang Feng, Dilip D. Kandlur, Debanjan Saha, and Kang G. Shin.
Stochastic fair blue: A queue management algorithm for enforcing fair-
ness. In INFOCOM 2001: Twentieth Annual Joint Conference of the
IEEE Computer and Communications Societies, volume 3, pages 1520
– 1529. IEEE, 2001. URL hhttp://ieeexplore.ieee.org/iel5/7321/
19795/00916648.pdfi.

[11] Jia Wang. A survey of web caching schemes for the internet. ACM SIG-

COMM Computer Communication Review, 29(5):36 – 39, October 1999.

ISSN 0146-4833. URL hhttp://doi.acm.org/10.1145/505696.505701i.

[12] Alex Rousskov and Duane Wessels. Cache digests. Com-

puter Networks and ISDN Systems, 30(22 – 23):2155 – 2168,
April 1998. URL hhttp://www.sciencedirect.com/science/article/
B6TYT-3VY4SS7-B/1/868712711e204b138cea744eaf0d39a8i.

[13] Sean C. Rhea and John Kubiatowicz. Probabilistic location and routing.
In INFOCOM 2002: Twenty-First Annual Joint Conference of the IEEE
Computer and Communications Societies, volume 3, pages 1248 – 1257.
IEEE, 23 – 27 June 2002. URL hhttp://ieeexplore.ieee.org/iel5/
7943/21923/01019375.pdfi.

[14] Kris Hildrum. The OceanStore project: Project overview. [Webpage],

2002. URL hhttp://oceanstore.cs.berkeley.edu/info/overview.
htmli. Last modified on 07/08/2002.

[15] Mark Bernstein. An apprentice that discovers hypertext links. In N. Streitz,

A. Rizk, and J. André, editors, Hypertext: Concepts, Systems and Appli-

24
REFERENCES REFERENCES

cations, The Cambridge Series on Electronic Publishing, pages 212 – 223.

INRIA, France, Cambridge University Press, 1990.

[16] Mark Bernstein, Jay David Bolter, Michael Joyce, and Elli Mylonas. Ar-
chitectures for volatile hypertext. In Hypertext ’91 Proceedings, pages 243
– 260, San Antonio, Texas, December 1991. URL hhttp://doi.acm.org/
10.1145/122974.122999i.

[17] Elli Mylonas and Mark Bernstein. A literary apprentice. In The 19th Inter-
national Conference of the Association for Literary and Linguistic Comput-
ing; and the 12th International Conference on Computers and Humanities,

pages 181 – 186, Christ Church, Oxford, April 1992.

[18] Donald E. Knuth. Sorting and Searching, volume 3 of The Art of Computer
Programming. Addison-Wesley Publishing Company, 1973.

[19] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Math-
ematics A Foundation for Computer Science, chapter 8.5 Hashing, pages
397 – 412. Addison-Wesley Publishing Company, 1989. ISBN 0-201-14236-
8.

[20] James Blustein. Implementing bit vectors in C. Dr. Dobb’s Journal, 20

(233), August 1995. Updated code is available from URL hhttp://www.

csd.uwo.ca/~jamie/BitVectors/i.

[21] Michael Mitzenmacher. Compressed Bloom filters. In Proceedings of the

Twentieth Annual ACM Symposium on Principles of Distributed Comput-
ing, pages 144 – 150, Newport, RI, USA, 2001. ACM SIGACT, ACM
SIGOPS. ISBN 1-58113-383-9. URL hhttp://portal.acm.org/citation.
cfm?id=384004i.

[22] Thomas Chan, Dan Margoliash, and Howard Shidlowsky. A word legality

25
REFERENCES REFERENCES

module using a Bloom filter and suffix simulation. Unpublished work ob-
tained from James K. Mullin (Dept. of Computer Science, Univ. of Western
Ontario, London, Ontario N6A 5B7), 1983.

[23] Daniel J. Margoliash. CSpell — A Bloom filter-based spelling correction

program. Master’s thesis, University of Western Ontario, 1987.

[24] M. V. Ramakrishna. Perfect hashing for external files. Technical Report CS-
86-25, University of Waterloo Computer Science Department, June 1986.

[25] Zbigniew J. Czech, George Havas, and Bohda S. Majewski. An

optimal algorithm for generating minimal perfect hash func-
tions. Information Processing Letters, 43(5):257 – 264, October
1992. URL hhttp://www.sciencedirect.com/science/article/
B6V0F-45FKW7V-68/1/95400691dc2f8738310f31bc965ca964i.

26
A MISCELLANEOUS METHODS

A Miscellaneous Methods

Procedure 2 (Set)
Set(cell)
. essentially a logical or

cell ← 1
end.

Procedure 3 (Clear)
Clear(cell)
cell ← 0
end.

Procedure 4 (IsSet)
IsSet(cell)→Boolean
return(cell = 1)
end.

Procedure 5 (Initialize)
Initialize(T able)

1. for i ← 1 . . . i = N do
2. Clear(T able[i])
3. endfor
end.

27
B DERIVATION OF REJECTION TIME INEQUALITY

Procedure 6 (Insert)
Insert(T able,Key)
1. for i ← 1 . . . i = m do

2. . hi is the ith hash transform

3. Set(T able[hi (Key)])
4. endfor
end.

B Derivation of Rejection Time Inequality

m
X m
X
h × f h−1 × (1 − f ) = (1 − f ) h × f h−1
h=1 h=1
X∞
≤ (1 − f ) h × f h−1
h=1
X∞
d h
= (1 − f ) f
df
h=1
∞
d X h
= (1 − f ) f , where |f | < 1
df
h=1
d 1
= (1 − f ) , where |f | < 1
df 1 − f
1
= (1 − f ) , where |f | < 1
(1 − f )2
1
= , where |f | < 1
1−f

P∞
Note that h=1 f h is the sum of a geometric series.

View publication stats

Bloom Filters and Their Applications
No ratings yet
Bloom Filters and Their Applications
5 pages
An Examination of The Bloom Filter and Its Application in Preventing Weak Password Choices
No ratings yet
An Examination of The Bloom Filter and Its Application in Preventing Weak Password Choices
4 pages
Probablistic Data Structures
No ratings yet
Probablistic Data Structures
5 pages
Bloom Filters: Differential Files Simple Large Database
No ratings yet
Bloom Filters: Differential Files Simple Large Database
22 pages
Module 4
No ratings yet
Module 4
10 pages
Data Science 5
No ratings yet
Data Science 5
82 pages
Bloom Filter
No ratings yet
Bloom Filter
50 pages
Bloomfilter
No ratings yet
Bloomfilter
9 pages
Bloom Filter
No ratings yet
Bloom Filter
9 pages
Bloom Filters A Tutorial, Analysis, and Survey
No ratings yet
Bloom Filters A Tutorial, Analysis, and Survey
31 pages
Bloom Filter
No ratings yet
Bloom Filter
29 pages
BDA Assignment2 BE6 20
No ratings yet
BDA Assignment2 BE6 20
9 pages
Assignment 2 BDA
No ratings yet
Assignment 2 BDA
9 pages
Bloom Filters: Presented By: Eman Shafiq (2017-EE-389) Bareera Azhar (2017-EE-379) Ruqia Rubab (2017-EE-383
No ratings yet
Bloom Filters: Presented By: Eman Shafiq (2017-EE-389) Bareera Azhar (2017-EE-379) Ruqia Rubab (2017-EE-383
14 pages
Indexing Encrypted Data Using Bloom Filters: February 2020
No ratings yet
Indexing Encrypted Data Using Bloom Filters: February 2020
19 pages
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
No ratings yet
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
28 pages
DSBDA UT 2 Part 2
No ratings yet
DSBDA UT 2 Part 2
21 pages
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
No ratings yet
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
26 pages
Bloom Filter Guo
No ratings yet
Bloom Filter Guo
90 pages
Rsa 2008
No ratings yet
Rsa 2008
32 pages
Ribbon Filter: Practically Smaller Than Bloom and Xor: Peter C. Dillinger Stefan Walzer
No ratings yet
Ribbon Filter: Practically Smaller Than Bloom and Xor: Peter C. Dillinger Stefan Walzer
14 pages
Bloom Filters: References
No ratings yet
Bloom Filters: References
22 pages
Bloom Filters: Insert (X) : For I in (1, K) : A (H - I (X) ) 1
No ratings yet
Bloom Filters: Insert (X) : For I in (1, K) : A (H - I (X) ) 1
1 page
ADS EXP 8 Tanisha Kanal
No ratings yet
ADS EXP 8 Tanisha Kanal
10 pages
32 BDA Exp6
No ratings yet
32 BDA Exp6
6 pages
Bloom Filters - A Probabilistic Data Structure - LinkedIn
No ratings yet
Bloom Filters - A Probabilistic Data Structure - LinkedIn
7 pages
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
No ratings yet
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
4 pages
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
No ratings yet
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
4 pages
Manual Bda 6 7 8
No ratings yet
Manual Bda 6 7 8
6 pages
Don Bosco Institute of Technology: ITDO8011 Big Data Analytics
No ratings yet
Don Bosco Institute of Technology: ITDO8011 Big Data Analytics
6 pages
AdityaGaur BDA Exp7
No ratings yet
AdityaGaur BDA Exp7
2 pages
Lec 32
No ratings yet
Lec 32
20 pages
Probabilistic Data Structures
No ratings yet
Probabilistic Data Structures
26 pages
DSBD Unit-II 3
No ratings yet
DSBD Unit-II 3
28 pages
Lec1 Bloom Distinctcount
No ratings yet
Lec1 Bloom Distinctcount
76 pages
Lecture08 BloomFilter
No ratings yet
Lecture08 BloomFilter
2 pages
Viden Io Data Analytics Lecture7 Data Stream Filtering PDF
No ratings yet
Viden Io Data Analytics Lecture7 Data Stream Filtering PDF
20 pages
Bda 8 59
No ratings yet
Bda 8 59
4 pages
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
No ratings yet
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
90 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
BDA - Question Bank - 2
No ratings yet
BDA - Question Bank - 2
12 pages
6 Filtering and Streaming: 6.1 Bloom Filters
No ratings yet
6 Filtering and Streaming: 6.1 Bloom Filters
6 pages
Deep Packet Inspection Using Parallel Bloom Filters
No ratings yet
Deep Packet Inspection Using Parallel Bloom Filters
8 pages
Bda Ut-2
No ratings yet
Bda Ut-2
34 pages
Bda Exp4 Chinmay
No ratings yet
Bda Exp4 Chinmay
4 pages
Streams 2
No ratings yet
Streams 2
49 pages
CS Presentation 3
No ratings yet
CS Presentation 3
1 page
CS 561, Lecture 2: Randomization in Data Structures: Jared Saia University of New Mexico
No ratings yet
CS 561, Lecture 2: Randomization in Data Structures: Jared Saia University of New Mexico
46 pages
Algo Ds Bloom Typed
No ratings yet
Algo Ds Bloom Typed
8 pages
Introduction To Bloom Filters
No ratings yet
Introduction To Bloom Filters
7 pages
Bda PT 2
No ratings yet
Bda PT 2
35 pages
Data Stream Sampling
No ratings yet
Data Stream Sampling
25 pages
Bloom Filters: What Is A Bloom Filter?
No ratings yet
Bloom Filters: What Is A Bloom Filter?
7 pages
On Implementing Bloom Filters in C - Andreinc
No ratings yet
On Implementing Bloom Filters in C - Andreinc
16 pages
Streaming Algorithms
No ratings yet
Streaming Algorithms
73 pages
Blooms Filter
No ratings yet
Blooms Filter
15 pages
DGIM
No ratings yet
DGIM
90 pages

Bloom Filters A Tutorial Analysis and Survey

Uploaded by

Bloom Filters A Tutorial Analysis and Survey

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Bloom Filters | A Tutorial, Analysis, and Survey

Article · January 2002

James Blustein Amal El-Maazawi

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Technical Report CS-2002-10

Dec 10, 2002

Faculty of Computer Science

Authors: James Blustein∗ and Amal El-Maazawi

Keywords: Bloom filter, hashing, performance analysis, network cache

B Derivation of Rejection Time Inequality 28

ential file updating, distributed network caches, and textual analysis. It is a

filter with m = 4 hash transforms and N = 8 bits.

Figure 1: A typical hash transform in action

Figure 2: A simple Bloom filter

Figure 3: A Bloom filter with signature information

1.2 What Is An Error

A portion of the filter

Figure 4: tomato is erroneously identified as a member

2.1 Rule-based systems

Burton H. Bloom originally proposed filter hashing as part of a program to

2.2 Spell Checkers

2.3 Estimating Join Sizes

2.4 Differential Files

Benefits and Drawbacks

A common assumption in such analyses is that if a record (e.g., a store’s credit

2.5 Network Applications

An important aspect of network applications is that they are not indepen-

file paradigm we discussed in Section 2.4.

logical network architectures) or as a series of co-operating proxies (without

downloads its neighbors digest at startup. By checking a neighbor’s digest, a

2.6 Attenuated Bloom filters

2.7 Text Analysis

(a) A network with Bloom fil-

(b) The attenuated Bloom filter for A → B

Figure 5: Attenuated Bloom Filters

. . . provide[s] impressionistic information that can open the text

Hashing transforms are typically pseudo-random mathematical transforms used

3.2 Basic Implementation

Procedure Parameters Time complexity

Table 1: Time Complexities of Filter Steps

3.2.1 Operations on Cells

3.2.2 Operations on Bloom filters

3.3 Compressed Bloom filters

Space efficiency is particularly important for applications, such as distributed

4.1 Time Complexity

Initialize The naı̈ve implementation requires O(N ) time, however if N is the

Insert Insertion requires the computation of m hash transforms, each of which

Note that the filter, or substantial portions of it, can be computed in

based on those words would need to be inserted at run-time.

IsMember The loop in Procedure 1 may require the computation of as many

4.2 Relationship Between Parameters

The behaviour of Bloom filters is determined by four parameters:

N The number of elements (or cells) in the filter

4.2.1 The General Case

Since hash transforms are pseudo-random, the probability of a particular filter

The Governing Equation The probability that a particular element not

Error Rate The analytic probability that an element is hashed to by all m

ter then a single hash has a probability 1 − f of not rejecting a non-member.

Growing Sets In applications where the membership set is allowed to grow,

4.2.2 The Optimal Case

Analysis of the optimal case is based on the standard assumption of parallel

d Pallset d m ln(1−(1−1/N )mk

By dividing by Equation 2, which cannot be zero (unless no keys have been

ln(1 − (1 − 1/N )mk )

By substituting x for (1 − 1/N )mk and multiplying both sides by (1 − x)

we obtain (1 − x) ln(1 − x) = xlnx, which is the same as (1 − x)(1−x) = xx .

Governing Equation The general governing equation (Equation 1, above)

The Taylor expansion of ln(1 + x) is

In the case of Equation 5, x = −1/N . For N  1, x2 ≈ 0 and ln(1 −

the relationship holds.

The analytic results are summarized in Table 2.

General Case Optimal Case∗

Table 2: Summary of Analytic Results

4.3 Performance issues

In the case of Equation 5, x = −1/N . For N 1, x2 ≈ 0 and ln(1 −