Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
Matei Ripeanu, Adriana Iamnitchi 1. Introduction Bloom filters [2] are compact data structures for probabilistic representation of a set in order to support membership ueries !i"e" ueries that as#$ %Is element X in set Y&'(" This compact representation is the pa)off for allo*in+ a small rate of false positives in membership ueries, that is, ueries mi+ht incorrectl) reco+ni-e an element as member of the set" .e succinctl) present Bloom filters use to date in the ne/t section" In Section 0 *e describe Bloom filters in detail, and in Section 1 *e +i2e a hopefull) precise picture of space3computin+ time3error rate tradeoffs" 2. Usage Since their introduction in [2], Bloom filters ha2e seen 2arious uses$ Web cache sharing  broadcasts its summar) to all other members of the distributed cache" 6sin+ all summaries recei2ed, a cache node has a !partiall) outdated, partiall) *ron+( +lobal ima+e about the set of files stored in the a++re+ated cache" The S uid .eb 7ro/) 4ache [8] uses %4ache 9i+ests' based on a similar idea" Query filtering and routing  Ser2ice [:], subs)stem of =in>a pro>ect [?], or+ani-es ser2ice pro2iders in a hierarch)" Bloom filters are used as summaries for the set of ser2ices offered b) a node" Summaries are sent up*ards in the hierarch) and a++re+ated" A uer) is a description for a specific ser2ice, also represented as a Bloom filter" Thus, *hen a member node of the hierarch) +enerates3recei2es a uer), it has enou+h information at hand to decide *here to for*ard the uer)$ do*n*ard, to one of its descendants !if a solution to the uer) is present in the filter for the correspondin+ node(, or up*ard, to*ard its parent !other*ise(" The OceanStore [;] replica location ser2ice uses a t*o<tiered approach$ first it initiates an ine/pensi2e, probabilistic search !based on Bloom filters, similar to =in>a( to tr) and find a replica" If this fails, the search falls<bac# on !e/pensi2e( deterministic al+orithm !based on 7la/ton replica location al+orithm(" Alas, their description of the probabilistic search al+orithm is laconic" !An unpublished te/t [88] from members of the same +roup +i2es some more details" But this does not seem to *or# *ell *hen resources are d)namic"( Compact representation of a differential file  periodicall) !i"e", midni+ht( or *hen the differential file +ro*s abo2e a certain threshold" Ao*e2er, in order to preser2e inte+rit), each reference3 uer) to the database has to access the differential file to see if a particular record is scheduled to be updated" To speed<up this process, *ith little
memor) and computational o2erhead, the differential file is represented as a Bloom filter" Free text searching , the set of *ords that appear in a te/t is succinctl) represented usin+ a Bloom filter
3. Constructing Bloom Filters 4onsider a set A =Da8 , a2 ,""", an C of n elements" Bloom filters describe membership information of A usin+ a bit 2ector of len+th m" For this, ! hash functions, h8 , h2 ,""", h! *ith hi $ X D8""mC , are used as described belo*$ The follo*in+ procedure builds an m bits Bloom filter, correspondin+ to a set A and usin+ h8 , h2 ,""", h! hash functions$
Procedure BloomFilter(set A, hash_functions, integer m) returns filter filter = allocate m bits initialized to 0 foreach ai in A: foreach hash function h": filter[h"#ai$] = end foreach end foreach return filter
Therefore, if ai is member of a set A% in the resultin+ Bloom filter all bits obtained correspondin+ to the hashed 2alues of ai are set to 8" Testin+ for membership of an element elm is e ui2alent to testin+ that all correspondin+ bits of are set$
Procedure !embershi"#est (elm, filter, hash_functions) returns $es%no foreach hash function h": if filter[h"#elm$] &= return 'o end foreach return (es
&ice features$ filters can be built incrementall)$ as ne* elements are added to a set the correspondin+ positions are computed throu+h the hash functions and bits are set in the filter" Moreo2er, the filter e/pressin+ the reunion of t*o sets is simpl) computed as the bit<*ise ER applied o2er the t*o correspondin+ Bloom filters" 4. Bloom Filters the Math !this follo*s the description in [0]( Ene prominent feature of Bloom filters is that there is a clear tradeoff bet*een the si-e of the filter and the rate of false positi2es" Ebser2e that after insertin+ n #e)s into a filter of si-e m usin+ ! hash functions, the probabilit) that a particular bit is still B is$
8 p B = 8 m
!n
8 e
!n m
"
!8(
!=ote that *e assume perfect hash functions that spread the elements of A e2enl) throu+hout the space D8""mC" In practice, +ood results ha2e been achie2ed usin+ M9? and other hash functions [8B]"(
Aence, the probabilit) of a false positi2e !the probabilit) that all ! bits ha2e been pre2iousl) set( is$
perr = (8 pB )
! !n !n 8 m = 8 8 8 e m ! !
!2(
number of hash functions are used" The reason is that the computational o2erhead of each hash additional function is constant *hile the incremental benefit of addin+ a ne* hash function decreases after a certain threshold !see Fi+ure 8("
False positi2es rate !lo+ scale(
8"5<B8 8"5<B2 8"5<B0 8"5<B1 8"5<B? 8"5<B: 8"5<B; 8 1 ; 8B 80 8: 8@ 22 2? 2F 08
;B :B ?B 1B 0B 2B 8B B
Figure 1: False positi2e rate as a function of the number of hash functions used" The si-e of the Bloom filter is 02 bits per entr) !m3nG02(" In this case usin+ 22 hash functions minimi-es the false positi2e rate" =ote ho*e2er that addin+ a hash function does not si+nificantl) decrease the error rate *hen more than 8B hashes are alread) used"
Figure 2$ Si-e of Bloom filter !bits3entr)( as a function of the error rate desired" 9ifferent lines represent different numbers of hash #e)s used" =ote that, for the error rates considered, usin+ 02 #e)s does not brin+ si+nificant benefits o2er usin+ onl) F #e)s"
!2( is the base formula for en+ineerin+ Bloom filters" It allo*s, for e/ample, computin+ minimal memor) re uirements !filter si-e( and number of hash functions +i2en the ma/imum acceptable false positi2es rate and number of elements in the set !as *e detail in ("
m = n !
ln perr ! ln 8 e
!0(
'o summari(e$ Bloom filters are compact data structures for probabilistic representation of a set in order to support membership ueries" The main desi+n tradeoffs are the number of hash functions used !dri2in+ the computational o2erhead(, the si-e of the filter and the error !collision( rate" Formula !2( is the main formula to tune parameters accordin+ to application re uirements" 5. Compressed Bloom ilters Some applications that use Bloom filters need to communicate these filters across the net*or#" In this case, besides the three performance metrics *e ha2e seen so far$ !8( the computational o2erhead to loo#up a 2alue !related to the number of hash functions used(, !2( the si-e of the filter in memor), and !0( the error rate, a fourth metric can be used$ the
si-e of the filter transmitted across the net*or#" M" Mit-enmacher sho*s in [F] that compressin+ Bloom filters mi+ht lead to si+nificant band*idth sa2in+s at the cost of hi+her memor) re uirements !lar+er uncompressed filters( and some additional computation time to compress the filter that is sent across the net*or#" .e do not detail here all theoretical and practical issues anal)-ed in [F]" !. "e erences