0% found this document useful (0 votes)
33 views4 pages

Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed

Bloom filters are compact data structures that probabilistically represent a set to support membership queries while allowing for a small rate of false positives. They have seen various uses including web cache sharing, query filtering and routing, compact representation of differential files, and free text searching. The key design tradeoffs are the number of hash functions used, the size of the filter, and the error rate. Formula (2) relates these parameters and allows tuning them based on application requirements. Compressed Bloom filters can reduce bandwidth needs during network transmission at the cost of larger memory requirements and additional compression computation time.

Uploaded by

Caroline Dias
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views4 pages

Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed

Bloom filters are compact data structures that probabilistically represent a set to support membership queries while allowing for a small rate of false positives. They have seen various uses including web cache sharing, query filtering and routing, compact representation of differential files, and free text searching. The key design tradeoffs are the number of hash functions used, the size of the filter, and the error rate. Formula (2) relates these parameters and allows tuning them based on application requirements. Compressed Bloom filters can reduce bandwidth needs during network transmission at the cost of larger memory requirements and additional compression computation time.

Uploaded by

Caroline Dias
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 4

Bloom Filters Short Tutorial

Matei Ripeanu, Adriana Iamnitchi 1. Introduction Bloom filters [2] are compact data structures for probabilistic representation of a set in order to support membership ueries !i"e" ueries that as#$ %Is element X in set Y&'(" This compact representation is the pa)off for allo*in+ a small rate of false positives in membership ueries, that is, ueries mi+ht incorrectl) reco+ni-e an element as member of the set" .e succinctl) present Bloom filters use to date in the ne/t section" In Section 0 *e describe Bloom filters in detail, and in Section 1 *e +i2e a hopefull) precise picture of space3computin+ time3error rate tradeoffs" 2. Usage Since their introduction in [2], Bloom filters ha2e seen 2arious uses$ Web cache sharing ![0]( 4ollaboratin+ .eb caches use Bloom filters !dubbed %cache summaries'( as compact representations for the local set of cached files" 5ach cache periodicall) broadcasts its summar) to all other members of the distributed cache" 6sin+ all summaries recei2ed, a cache node has a !partiall) outdated, partiall) *ron+( +lobal ima+e about the set of files stored in the a++re+ated cache" The S uid .eb 7ro/) 4ache [8] uses %4ache 9i+ests' based on a similar idea" Query filtering and routing ![1, :, ;]( The Secure *ide<area 9isco2er) Ser2ice [:], subs)stem of =in>a pro>ect [?], or+ani-es ser2ice pro2iders in a hierarch)" Bloom filters are used as summaries for the set of ser2ices offered b) a node" Summaries are sent up*ards in the hierarch) and a++re+ated" A uer) is a description for a specific ser2ice, also represented as a Bloom filter" Thus, *hen a member node of the hierarch) +enerates3recei2es a uer), it has enou+h information at hand to decide *here to for*ard the uer)$ do*n*ard, to one of its descendants !if a solution to the uer) is present in the filter for the correspondin+ node(, or up*ard, to*ard its parent !other*ise(" The OceanStore [;] replica location ser2ice uses a t*o<tiered approach$ first it initiates an ine/pensi2e, probabilistic search !based on Bloom filters, similar to =in>a( to tr) and find a replica" If this fails, the search falls<bac# on !e/pensi2e( deterministic al+orithm !based on 7la/ton replica location al+orithm(" Alas, their description of the probabilistic search al+orithm is laconic" !An unpublished te/t [88] from members of the same +roup +i2es some more details" But this does not seem to *or# *ell *hen resources are d)namic"( Compact representation of a differential file ![@](" A differential file contains a batch of database records to be updated" For performance reasons the database is updated onl) periodicall) !i"e", midni+ht( or *hen the differential file +ro*s abo2e a certain threshold" Ao*e2er, in order to preser2e inte+rit), each reference3 uer) to the database has to access the differential file to see if a particular record is scheduled to be updated" To speed<up this process, *ith little

memor) and computational o2erhead, the differential file is represented as a Bloom filter" Free text searching ![8B](" Basicall), the set of *ords that appear in a te/t is succinctl) represented usin+ a Bloom filter

3. Constructing Bloom Filters 4onsider a set A =Da8 , a2 ,""", an C of n elements" Bloom filters describe membership information of A usin+ a bit 2ector of len+th m" For this, ! hash functions, h8 , h2 ,""", h! *ith hi $ X D8""mC , are used as described belo*$ The follo*in+ procedure builds an m bits Bloom filter, correspondin+ to a set A and usin+ h8 , h2 ,""", h! hash functions$
Procedure BloomFilter(set A, hash_functions, integer m) returns filter filter = allocate m bits initialized to 0 foreach ai in A: foreach hash function h": filter[h"#ai$] = end foreach end foreach return filter

Therefore, if ai is member of a set A% in the resultin+ Bloom filter all bits obtained correspondin+ to the hashed 2alues of ai are set to 8" Testin+ for membership of an element elm is e ui2alent to testin+ that all correspondin+ bits of are set$
Procedure !embershi"#est (elm, filter, hash_functions) returns $es%no foreach hash function h": if filter[h"#elm$] &= return 'o end foreach return (es

&ice features$ filters can be built incrementall)$ as ne* elements are added to a set the correspondin+ positions are computed throu+h the hash functions and bits are set in the filter" Moreo2er, the filter e/pressin+ the reunion of t*o sets is simpl) computed as the bit<*ise ER applied o2er the t*o correspondin+ Bloom filters" 4. Bloom Filters the Math !this follo*s the description in [0]( Ene prominent feature of Bloom filters is that there is a clear tradeoff bet*een the si-e of the filter and the rate of false positi2es" Ebser2e that after insertin+ n #e)s into a filter of si-e m usin+ ! hash functions, the probabilit) that a particular bit is still B is$
8 p B = 8 m
!n

8 e

!n m

"

!8(

!=ote that *e assume perfect hash functions that spread the elements of A e2enl) throu+hout the space D8""mC" In practice, +ood results ha2e been achie2ed usin+ M9? and other hash functions [8B]"(

Aence, the probabilit) of a false positi2e !the probabilit) that all ! bits ha2e been pre2iousl) set( is$
perr = (8 pB )
! !n !n 8 m = 8 8 8 e m ! !

!2(

In !2( perr is minimi-ed for ! =

m ln 2 hash functions" In practice ho*e2er, onl) a small n

number of hash functions are used" The reason is that the computational o2erhead of each hash additional function is constant *hile the incremental benefit of addin+ a ne* hash function decreases after a certain threshold !see Fi+ure 8("
False positi2es rate !lo+ scale(
8"5<B8 8"5<B2 8"5<B0 8"5<B1 8"5<B? 8"5<B: 8"5<B; 8 1 ; 8B 80 8: 8@ 22 2? 2F 08

;B :B ?B 1B 0B 2B 8B B

#G2 #G1 #GF #G8: #G02

Bits per entr)

8"5<B: 8"5<B? 8"5<B1 8"5<B0 8"5<B2 8"5<B8

=umber of hash functions

5rror rate !lo+ scale(

Figure 1: False positi2e rate as a function of the number of hash functions used" The si-e of the Bloom filter is 02 bits per entr) !m3nG02(" In this case usin+ 22 hash functions minimi-es the false positi2e rate" =ote ho*e2er that addin+ a hash function does not si+nificantl) decrease the error rate *hen more than 8B hashes are alread) used"

Figure 2$ Si-e of Bloom filter !bits3entr)( as a function of the error rate desired" 9ifferent lines represent different numbers of hash #e)s used" =ote that, for the error rates considered, usin+ 02 #e)s does not brin+ si+nificant benefits o2er usin+ onl) F #e)s"

!2( is the base formula for en+ineerin+ Bloom filters" It allo*s, for e/ample, computin+ minimal memor) re uirements !filter si-e( and number of hash functions +i2en the ma/imum acceptable false positi2es rate and number of elements in the set !as *e detail in ("
m = n !
ln perr ! ln 8 e

!bits per entr)(

!0(

'o summari(e$ Bloom filters are compact data structures for probabilistic representation of a set in order to support membership ueries" The main desi+n tradeoffs are the number of hash functions used !dri2in+ the computational o2erhead(, the si-e of the filter and the error !collision( rate" Formula !2( is the main formula to tune parameters accordin+ to application re uirements" 5. Compressed Bloom ilters Some applications that use Bloom filters need to communicate these filters across the net*or#" In this case, besides the three performance metrics *e ha2e seen so far$ !8( the computational o2erhead to loo#up a 2alue !related to the number of hash functions used(, !2( the si-e of the filter in memor), and !0( the error rate, a fourth metric can be used$ the

si-e of the filter transmitted across the net*or#" M" Mit-enmacher sho*s in [F] that compressin+ Bloom filters mi+ht lead to si+nificant band*idth sa2in+s at the cost of hi+her memor) re uirements !lar+er uncompressed filters( and some additional computation time to compress the filter that is sent across the net*or#" .e do not detail here all theoretical and practical issues anal)-ed in [F]" !. "e erences

You might also like