0% found this document useful (0 votes)
39 views74 pages

CS 86 14

Literatura Robin Hood

Uploaded by

Martin Morales
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
39 views74 pages

CS 86 14

Literatura Robin Hood

Uploaded by

Martin Morales
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 74
Robin Hood Hashing Pedro Celis Data Structuring Group Department of Computer Science, University of Waterloo, Waterloo, Ontario, N2L 3G1 ABSTRACT ‘This thesis deals with hash tables in which con- flicts are resolved by open addressing. The ini- tial contribution is a very simple insertion pro- cedure which, in comparison to the standard al- gorithm, has the effect of dramatically reducing the variance of the number of probes required for search. This leads to a new search algorithm which requires less than 2.6 probes on average to perform a successful search even when the table is nearly full. Unsuccessful searches require only O(inn) probes. Finally, an extension to these methods yields a new, simple way of performing deletions and subsequent insertions. Experimen- tal results strongly indicate little degeneration in performance for both successful and unsuccessful searches. Acknowledgements Tam deeply grateful to my supervisor, Professor J. Ian Munro, for introducing ‘me to the areas of analysis of algorithms and computational complexity and for his encouragement and friendahip throughout the course of my graduate studies. Very special thanks are also due to Professor Per-Ake Larson, also ‘my supervisor, for his guidance, constant support, and friendship. They both suggested the thesis topic, spent a great deal of time with me, and read and reread all my work. Their ideas are present throughout this thesis, ‘Thanks are also due to the other members of the examining committee, Pro- fessors Gaston H. Gonnet, Tan P, Goulden, Jeffrey 8. Vitter, and Derick Wood for their helpful criticism, T also thank the Natural Sciences and Engineering Research Council of Canada for its financial support. On a personal level, my deepest gratitude is to my wife Laura Eugenia, for her love and continuing support, without which my graduate studies could not hhave been accomplished. We would like to thank our families in Mexico for their constant eacouragement, and our daughter Laura Elisa and our many friends in Canada for helping make our stay here a wonderful experience. Contents 1.5 Comparison of Reordering Schemes |. 13.1 Ordered Hash Tables... . . 1.3.2 Brent's Method « 18.8 Binary Tree Hashing 1.34 Optimal and Min-Max Hashing . 14 Semmay .. 15. Thesis Outline 2 The Robin Hood Heuriatle 2.1. The Robin Hood Approach to Hashing . . 2.2 A Family of Random Probing Schemes . . 2.9 A Single Final Teble 2.4 The Expected Worst 3 The Distribution of ps! G1 The Asymptotic Model... 0... eee cece eee eee 9.2 The Distribution ...... 4 New Search Algorithms 4.1 Speeding up Searching 4.2 Organ-Pipe Searching . 4.3 Smart Searching . . 44 ASmart findpoattion | - 4.5 Summary ...... 5 Simulation Results 52, Resata for Robin Hood Hashing’ « ---- 5.3. The Standard Method . BA Brent's Method... . 2s cece eee eee i Seeees B88 Deletions 6.1 Deletions in Hashing with Open Addressing... . . 6.2 Deletions in Robin Hood Hashing...... 4... 63 Simolation Results ....... + : 64 Summary ..... 6... ‘Conclusions and Further Research TA Conclusions... 0. eee eee ee cee eee 72 Further Research»... 202-05 saat s eae List of Tables Ba 32 bat Ba 5A 5 56 BT 59 Variance at various load factors... ..... sense Expected value and Variance of pel for a close to... - Robin Hood: Number of probes to insert (E{pall) .. . . . Robin Hood: Variance of probe sequence length (V[psll} - Organ-Pipe Search: number of probes . . . Smart Search: umber of probes... 0... + Organ-Pipe: average time (maecs) to insert a record... Organ-Pipe: average time (msecs) for a successful search. Ses Seacing: average te (owe tonto reerd ‘Smart Searching: Robin Hood: longest probe sequence length for a successful search . 5.10 Standard method: average number of probes to insert or search . 5.11 Standard method: average mumber of maeca to insert a record . 5.12 Standard method: average number of maeca to vearch 45.18 Standard method: Longest probe sequence length . . 5.14 Brent’s method: average number of probes to insert 5.15 Brent’s method: average number of probes to search 5.16 Brent’s method: average number of maecs to insert . 5.17 Brent’s method: average number of mseca to search 5.18 Brent’s method: Longest probe sequence length . . . ma 72 ‘Comparison of Hashing Schemes for full tables... . ‘Comparison of Hashing Schemes for nonfull tables. SSSSSIsssaars ee List of Figures 1.1 Standard insertion algorithm 12 Standard search algorithm 13 Sample insertion in Brent's method... 2. - 14 Sample insertion using Binary Tree Hashing . . . 2.1 Robin Hood insertion algorithm ...--------- 0222+ sa weceee a2 ve Bipall | 4. Probability distribution of psl for a neatly full table... . « 4.2 Expected Search Cost for Organ-Pipe Searching . . 48 The Organ Pipe Stach Heute «ss +++ +++ 4.4 Robin Hood insertion keeping counters - 4.5 The Smart Searching Heuristic... . . 4.6 Expected Search Cost for Smart Searching 5.1 Longest Probe Sequence Length for Robin Hood Hashing . . - . 6.1 Robin Hood insertion algorithm when deletions may have occured 6.2 Robin Hood insertion keeping counters when deletions may have 6.3 Average Cont of Succeasful Searches after Deletions... « 8.4 Average Cost of Unsuccessful Searches after Deletions . - 6.5 Average Probe Position above Minimum after Deletions . Chapter 1 Overview of Hashing ‘One of the most natural and indeed important tasks in programming is the implementation of a date structure servicing the operations of insert, delete and find {alao called member). Such a structure is often called a dictionary. A hash table provides a convenient way to implement a dictionary. Wdeally, the purpose of s hashing scheme is to be able to determine solely from the identification of a record, called the record key, the exact location where the record is stored. Given a record to be inserted or located, a key to address transformation is performed using s hash function h(&) : K t~ {0,....~ 1} which takes aa an argument a key & in the specified universe K and returns an integer A(F) between 0 and n—1, where n is the sise of the table. The record is, then inserted in the table entry specified by h(k). This causes no problems until a record with key Khas to be inserted and location h(K') is already occupied. In this case we say a collision has occurred. Handling collisions is the central issue in hashing and the subject of this thesis. 1.1 Collision Resolution Collisions are almost certain to occur even if the table is sparsely populated. The famous “birthday paradox” (see for example [FEt.68)) asserts that among 28 or more people the probability that at least 2 of them share the same birthday exceeds 1/2. In other words, if we select a random function that maps 23 records {into a table of sise $65, the probability that no two keys map into the location is only 0.4927. In general, a hash table of size n is Likely (probability > £) to have at least one collision by the time it contains about /7n elements, ‘There are two popular ways of handling collisions: ehesning and open ad- dressing. The ides of chaining is to keep, for each location, a Linked list of the records that hash to that location. This implies that each entry in the table ‘must have enough space to contain a record and link feld. There are a number 1 2 CHAPTER 1. OVERVIEW OF HASHING of interesting tradeoffs and techniques in connection with chaining. Our inter- eat, however, lies in an approach which calls for no additional storage, namely open addressing. 1.2 Open Addressing ‘Open addressing seems to have first appeared in the literature in [PETS7). The basic idea is to do sway with the links entirely and to insert by probing the ‘table in a systematic way. When a collision occurs, one of the colliding records is selected to keep the table location, while the other one continues probing until inserted, The sequence of table entries to be inspected when inserting or searching for a record is called the probe sequence. We can augment the hash fonction with another parameter, the probe position or try number, and use it to generate the probe sequence for » record. ‘Thus the hash function becomes Alby) KX (1y.-.,00) > (0, 1. ‘The simplest open addressing hashing scheme, known as Lincar probing, uses the hash function A(E,i) = (hi() +i — 1) mod a, where Aa(k) is an auxiliary hhash function, Another open addressing method, called double hashing, uses two independent auxiliary hash functions hy (k) and ha(k) to compute h(E) = (i(#) + (6 — 1) © ha(k)) mod n, where ha(k) is prime relative to n. Double hhashing performs much better than linear probing for high load factora because it reduces the probability of two colliding records having the same remaining robe sequence. ‘Two other open addressing schemes frequently mentioned in the literature and used as models for analysis are uniform hathing, where the hash function provides a random permutation of the numbers {0,...,n — 1}; and random probing, where h(,i) is simply a number chosen at random from {0,...,n—1}. ‘The difference between these two schemes is that random probing is memory- Jess, meaning that a location may be probed several times before some other location is probed for the fret time, Uniform hashing is conjectured to be opti mal in the sense that it minimizes the expected number of collisions during the insertion process (ULL72}. ‘The conjecture has been proved for the asymptotic cate [YAO85]. Random probing is simpler to analyse and asymptotically has the same performance as uniform hashing for noufall tables under most conflict resolution schemes. Random probing and uniform hashing are not usually im- plemented, since empirical evidence shows that their performance is close to that of double hashing which is lea costly to implement. ‘Their interest lies in the fact that they are simpler to analyze and approximate closely the performance of double hashing. ‘Deponding on the ratio of the link field size to the record sine, open ad- reasing can yield a better performance than chaining if the space allocated to ‘the links and overflow records is used to increase the table sise'. The stan- ‘Butane [KNUT3} snc. 6.4 exerine 18 p. 543 fora way of reducing his and {FELLOW} 1.8, COMPARISON OF REORDERING SCHEMES a dard search algorithm is to probe locations h(é,s),i = 1,2;9.. im order until the record ia found. ‘The standard insertion algorithm is to probe locations ‘h(k,i),4 = 1,2,... until am empty location is found. The new record is placed in that location. Figures 1.1 and 1.2 on page 4 show these algorithms. Initially m= longestprobe = 0, where a is the number of records in the table and Longestprobe is the longest probe sequence length used by any one of the records stored in the table, The table is Glled initially by records having special key value eapty. The problem of deletions is address in Chapter 6. 1.8 Comparison of Reordering Schemes ‘This section reviews several more sophisticated schemes for creating a hash ‘table. The key notion is that records already in the table may be moved as a new one is inserted. Such an insertion algorithm we call a reordering scheme. A reordering scheme can be used with any open addressing hashing method but the performance measures presented below are for either random probing of uniform hashing. As we have already noted, the performance of random probing is similar to that of uaiform hashing and double hashing. ‘When comparing hashing schemes, we are interested both in the cost of loading the table and in the “quality” of the table prodiced, that is, both the ficiency and the efficacy of the insertion technique. We will characterize the quality of the hash table by the behivior of the following random variables: the probe sequence length for a key (pal), which is equivalent to the probe potition where the key was placed; the longest probe sequence length (Ipel); the unsuccessful probe sequence length (upsl); and the longest unsuccessful, probe sequence length (Iupal). We will compare the expected value (denoted by ‘E{e]), and sometimes the variance (denoted by V(s|), of these random variables for both the case of fall and nonfull tables. For nonfull tables these expressions ‘are functions of a, the load factor, defined as a = m/n, where m is the number of records in the table and n is its sine. Several analyses of hashing schemes, including the one we derive here, have been performed for infinite nonfull tables with load factor a, where a <1~<, €>0. Throughout the thesis we refer to this tables as a-full tables. A very important but often neglected performance measure of a hash table is the longest probe sequence length (Ipel), since it provides a bound on both ‘successful and unsuccessful searches. ‘This value can be used to limit the cost ‘of unsuccessful searches in any open addressing hashing scheme, as was done in Figure 1.2. This elegant but sadly underutilized idea is due to Lyon [L¥078]. For standard uniform hashing with no reordering the following equations can be established [PETS 7,GONB1,GoN84): for a nonfall table for & discustion. 4 CHAPTER 1. OVERVIEW OF HASHING table : array [1-2] of RECORD { all enpty } n (table atze), m racorda inserted), longestprobe (antiariy 0} : integer ( tape tun > X i= Key(Record) ‘probeposition := t Joeation := H(x, probeposition) walle table [location] © expty do probeposition ‘= probapoaition + 1 location := WOs, probsposition) andre Loagestprova :* ane( Longestprobe, probeposition ) ‘tablellocation] := Record wast zetura(locetion) nd function snuert Figure 1.1: Standard insertion algorithm ‘function sexrch(k) probeposition := 1 location := H(k, probeposition) watle probeposition <= longeatprobe and table [location] © empry do Af key(table[Location]) » k then returm( location ) probeposition i= probepoastion + 1 Jocation := H(x, probeposition) end vite retura(FAIL) {unsuccessful search } ‘end function search Figure 1.2: Standard search algorithm Bipot] = 2 ty41 — Hammes] a"! n(2 — a) Vipst| - rat a7" in(1 — a) - a~7In?(1- a) Elipst] = — log, n — loga(~ log, n) + O(1) and if we use Lyon's trick Elupal] < —log, n — log, (log. n) + O(1) 4.8, COMPARISON OF REORDERING SCHEMES 5 and for a full table = lan +7-1+0(1) E{tpel] = 0.6315... x n+ O(1) Elupel} = 0.6818... x n+ O(1) where y = 0.5772156649... Euler's constant, ‘The total number of probes needed to load a table for this method is simply mE|pel], which for full tables ia equal to ninn + O(n). 1.3.1 Ordered Hash Tables (One of the frst reordering schemes proposed in the literature was ordered hash tables by Amble & Knuth [AMBKw074]. The scheme was not intended to improve the retrieval of records, but rather to improve the processing of unsuc- cessful searches. ‘The ides is very simple. Whenever two records collide, the one with the smaller key is stored in the disputed location. When searching in aan ordered hash table, the search is unsuccessful whenever a probed location contains a key larger than the search key. As noted, the distribution of the probe poeition of the keys in the table remaine unchanged; only the expected ‘cost of unsucceeeful searches is improved, but not its expected worst case. For uniform hashing and 4 nonfull table and using Lyon’s trick we have Bupa = -a-*(t - 9) +0 (4) Ejlupal] < — log, n — log, (— log. n) + O(1) and for » full table Elupsl] =Inn+7—1 +o(2) Eflupal] = 0.6315...x n+ O(1) The total number of probes required to load a full ordered hash table ia the same as for the standard algorithm, namely nlan+n7~n+O(1). 1.3.2 Brent’s Method Brent [BRE7S] was the first to propose moving stored records to reduce the expected. value of the probe sequence length, During an insertion, a sequence ‘of occupied table entries is probed until an empty location is found. Brent's ‘scheme checks whether any of the records in these occupied locations can be displaced to an empty location further in their probe sequence, at « smaller 6 CHAPTER 1, OVERVIEW OF HASHING cost, and the minimum is taken. Figure 1.8 shows graphically how one mch insertion of a record R might occur. In the example, instead of inserting the record R in its fifth choice and increasing the total table cost by 5, record Rs is displaced to its next choice, and then F is placed in its third choice, the place formerly occupied by Rs. The increase in the total table cost is thus reduced from 5 to 4. it R 2nd R ard R ah R Sth R Figure 1.: Sample insertion in Brent’s method ‘The tables produced by Brent’s scheme have a very good Efpsl], even when completely filled. For random probing and a-fult tables the expected values are aa? at a® tat dal 2980" 3190° Bal 1454+ apie ae + ao ~ aero ~ sooo + snd for full tabla Efpel] ws 2.4941... So, if the standard search algorithm is used, s record cau be retrieved in less than 2.8 probes, on average, regardless of the table sise. There has been no successful analysis of the expected values of Ipel and upel, but its conjectured [GonMuN79} that they are @(/f) for full tables, and this was supported by ‘Smulations, Simulations presented in this thesis, support the conjecture that @(rulnn) time is needed to fll a table, Brent’s method does not require any exira memory to procees an inert operation. If the search for an empty location is done on a depth first basis, as suggested in [BRE73], the expected mumber of times the ha(*) hash function will 13. COMPARISON OF REORDERING SCHEMES 7 be computed is a? +a°-+0°/3-+--- eventually approaching ©(/f) for full tables, [KNU73]. ‘The disadvantage of searching in this manner is that a number of locations below the breakeven line will be probed. For example the fifth probe position of the record R to be inserted would be probed unnecessarily. The number of additional table positions probed during an insertion is approximately at tat+ fa°+a®+-.- [KNU73]. Another way of searching for the closest empty location is to do a level search aa recommended in [GON84}. For double hashing, this would imply call ing the ha(+) hash function during an insertion up to O{n) times instead of @(V/n), which may be preferable to probing the exira locations. A disadvan- tage of Brent's method is that duplicate keys are not detected by the insertion algorithm, so if duplicate insertion requests may occur, an unsuecestful search should precede each insertion, To keep track of the value of longeatprobe, the probe position of the stored record that will be displaced must be determined ‘to establish if its new probe position will be the new maximum. The probe po- ston of stored record can be determined by searching forthe location where the record is stored in its probe sequence. 3.3 Binary Tree Hashing Binary tree hashing is the natural generalisation of Brent’s method. Not only is the record being inserted allowed to displace other records in its probe se- quence, but these displaced records may further displace other records in thelr probe sequences, This is illustrated graphically by Figure 1.4. This method ‘was discovered independently by Maltach [MAL77] and by Gonnet and Munro [GoxMun79}. is expected to produce better tables at a somewhat higher cost. An approximate model {GONMUN79] yields the following for random probing and a-full tables?: Ba? | Gliat 60a* ‘6760 1120 and for fell tables: fpel] 2.15414... ‘These results have been validated using simulation. There is no analysis for the expected values of Ips! and upsl, hut Gonnet and Munro, based on simulation results, conjectured them to be about Ign +1 as 1.44lnn + 1 for full tables. ‘Aa with Brent's scheme, the expected cost of loading a fall table has not been successfully analysed. >the of pal ah aen tm (CONE de hy rom the one in [GON] pou ' J a CHAPTER 1. OVERVIEW OF HASHING Ast R and Ry next Ry ord R, next Ry next Ry next Ry QO @ ke © Figure 1.4: Sample insertion using Binary ‘Tree Hashing ‘The natural order for inspecting the table when searching for an empty Tocation is by levels, as suggested in [MAL77] aud [MAD8O}°. However, the amount of memory required to store the tree of locations probed is large, as Mallach noted. Limited simulations done by the author of this thesis indicate that the number of probes into the table appears to be about .5n°/?lun and the amount of memory required to store the tree about .15n°/?In n, What is worse, ‘the variance of ese two measures is very high, so the probability of requiring, say, n? extra memory locations is significant. Gonnet and Munro [GONMUN79] show how to use an algorithm for the ‘transportation problem, presented in [EDMKAR72], to insert keys into the tar ‘le. This algorithm has a worst case runtime of O(n?logn) [FRETARS4], but simulations done by the author of this thesis euggest that the expected cost of Toading a full table is ©(n*/2 Inn), and the amount of extra memory required is {n), both with a small variance. While this in a great improvement over the ‘natural algorithm, itis still expensive both in time and memory, ‘As with Brent's method, some additional probes are plicate keys and to keep track of the value of Longestprel to avoid du- 1.3.4 Optimal and Min-Max Hashing Both Brent’s method and binary tree hashing only move stored keys forward in their probe sequences. Arbitrary rearrangements of the stored recorda must be “Tha reader is warned that the anslysis, conclusions and comments on Mallach's work pre sented by Madison [MADBQ] are incorrect, 4.4. SUMMARY 9 allowed to obtain an optimal hash table [P0O76,RIV78,GONMUN79]. Poonan was the first to mote that the optimal placement of Keys in a hash tabl ‘a special caso of the assignment problem [KON31], which can be solved in (O(n logn) time in the worst case [FRETARS4). Neither the expected cost of finding the optimal hash table nor the expected values of pel, Ipel, and upel hhave been determined. For Ejpsl] and full tables the following bounds exist, [Gon77,GonMon79]; 1.688382 < Efpel] < 2.19414... Simulation results indicate that Elpsl] ws 1.82.... If, instead of minimizing the expected value of pal, we minimize the expected value of Ipel (called min-max hashing), the bounds become [GON81,RIV78] Inn+-7+0(1) $ Blips] Ipel < 4lgnsv5.77lan with probability 1 where ¢is an arbitrarily small constant and the inequality holds for alln > ro(e). ‘This last inequality is not sufficient to prove that Ellpsl] = O(inn). As a of a result obtained in Chapter 2, it can be shown that E]ipsl] ‘© (Inn) for full min-max hash tables. Both optimal and min-max hash tables, in addition to being expensive to build, require O(n) extra memory during the insertion of a record. 1.4 Summary In summary, we can say that binary tree, optimal and min-max hashing reduce the expected values of pal and upsl dramatically, but at a very high cott for table creation. The expected number of operations to constract a table using cone of these algorithms is high compared to the standard hashing scheme, and also require a nontrivial amount of extra memory. These methods are best suited for applications in which the set of keys ia static and known in advance. In that case, the cost of constructing the table can be amortised over a large number of search operations, and the additional memory space required can be released as soon as the table has been created. Brent's method is perhaps the one that offers the best overall tradeoff. Iti simple to program, the time needed for loading a full table seems to be @(ninn), it requires no additional memory apd has an expected successful search time ‘which ia constant. However duplicate keys are not automatically detected daring ‘the insertion process, that is, an unsuccessful search is a useful prelude {o an insertion. Furthermore, some additional probes into the probe sequence of a stored record to be moved are required to determine if its new probe position is the new maximum. But the greatest disadvantage of Brent's method is the 10 CHAPTER 1. OVERVIEW OF HASHING ‘expected value of Ipel and hence of unsuccessful searches; this is (./7), mach ‘worse than the (Inn) achieved by binary tree, optimal, and min-max hashing. ‘Ordered hash tables do not improve the expected value of pal nor Ipel at all. The method improves the expected cost of unsuccessful searches but not its ‘expected worst case, ‘The loading cost is almost the samo as for the standard algorithm. If unsuccessful searches are expected to be frequent it should be preferred over the standard algorithm, This method and the standard algorithin are the only two that allow detection of duplicate keys with no additional probes {into the table. As for the standard insertion algorithm, ws can say that it is simple to program and efficient to implement, and that the tables it produces have an ‘acceptable expected value of sl for moderate load factors, The main problem. ‘with this scheme is that, when the table is full or nearly full, it will take 8(n) ‘steps to retrieve some keys and to perform unsuccessful searches. The solution that has been suggested is A hash table should never be allowed to get thot full. When the toad factor is about 0.7 or 0.8, the sise of the hosh table should be increared, and the records in the table should be rehashed.” [MAU75] ‘This statement is sometimes interpreted as ‘Do not ase hashing for real time applications”. We will ee that this is no longer the case. ‘What is needed is « hashing scheme that combines the best features of each of the methods presented. Such 3 scheme should: ‘> Be as simple to program as the standard algorithm, # Take only @(n1nn) probes on the average to load a full table, ‘© Require no additional memory for insertions, ‘» Perform successful searches in a small number of probes on the average, even if the table is nearly full, ‘¢ Have a O(Inn) expected value of Ipal and upel. ‘And of less importance: '* Need no additional probes to detect duplicate entries ‘ Need no additional probes to keep track of the value of Lengestprobe ‘The method introduced and analysed in this thesis, called Hobin Hood hashing, has all of these characteristics. 1.5. THESIS OUTLINE cy 1.5 Thesis Outline In this chapter we have reviewed several reordering schemes that produce very good hash tables; however, the better ones are expensive, in time and memory, to implement, Each of these schemes involves a new insertion algorithm but retains the standard search algorithm. In Chapter 2 we modify the standard insertion algorithm to obtain a new method that we call Robin Hood hashing; ‘we also define a family of hashing schemes that have the same expected value of pal and prove that the expected value of Ipal ix O(Inm) for Robin Hood hashing. In Chapter 3 we study the probability distribution of the random variable pel and its moments for a-full tables. In Chapter 4 we present some modifications to the standard search algorithm to be used on a Robin Hood hhash table s0 as to achiove a small constant expected search time for a-full tables, and loading of a table using @(n nn) probes. In Chapter § we present the results of simulating several of the methods presented here and Robin Hood hashing, In Chapter 6 we discuss how deletions can be performed in hash tables with open addressing and we preteat an algorithm to be used with Robin Hood hashing, Simulations revulte for this new algorithm are also presented. Finally in Chapter 7 we present our conclusions and suggestions for further research. Chapter 2 The Robin Hood Heuristic Jn this chapter we present a new algorithm to insert keys into a hash table. We ‘all this algorithm Robin Hood hashing. We then define a family of reorgani- ation schemes that share the same expected probe sequence length (E{psll). Finally we prove that the expected longest probe sequence length (ElIpsl]) is O(tnm} for a Robin Hood hash table with m records. Searching and the cost of loading are discussed in Chapter 4 and deletions in Chapter 6. 2.1 The Robin Hood Approach to Hashing In the standard hashing algorithm, as the table fills, the problem may not be 60 much the average search time, but its expected worst case. An expected value for pel of lnn-+O(1) may be acceptable, but an expected value for Ipal of @(n) is certainly aot. In other words, the problem is not so much the mean of pal, ‘but its high variance. ‘As an attempt to reduce the variance of the pal, consider the following mod- ification to the standard open addreseing hashing algorithm: when inserting, if a record probes a location that is already occupied, the record that has traveled longer in its probe sequence keepe the location, and the other one continues on its probe sequence. Notice that, in deciding which record keeps the locat any knowledge of the remaining probe sequence of the col ‘any other information regarding the rest of the hash table, This implies thet, ‘under random probing; the expected aumber of additional probes into the hash table required to find an empty location is the same for both records. Therefore the expected value of pal is uot affected by this modification to the standard algorithm but its distribution and hence the diatribution of the Ipal will change. Consider a collision of two records Ry and Ra, and assume that their probe potitions are i and j, respectively, with recordporition then begin ‘texphacord : teble[locatson) tableflecation] = Record Racord = texphacord & sm kay(Record) Longentprobe ‘= max( longestprobe, probeposition } probeposition '= recerdpositicn im toteleort +1 stprebe := xax( longestprobe, probepesition ) 4 return(iecation) end function insert Figure 2.1: Robin Hood insertion algorithm ‘The insertion algorithm for all of the hashing schemes belonging to this fam- ily is that of Figure 2.1 except for the condition probeposition > recordposi- ‘tion inside the if statement. In the standard algorithm this condition is changed to recordposition=0 (meaning the location is empty); in ordered hash tables to k < Key(tablefLocation]) (the key of an empty location is larger than any other); and in signature hashing with variable length separators to Signs- ture( k, probeposition) < Signature Key(tableLlocation]), record~ position) (the signature of an empty location is higher than all other signe- tures). Any other condition could be used here. The only restriction is that the decision must be made without looking at the remaining probe ecquence of the colliding records. Brent’s method, binary tree hashing, optimat hashing, and ‘min-max hashing do not satisfy this restriction, and hence are not members of ‘his family. 2.2, A FAMILY OF RANDOM PROBING SCHEMES a Lemma 2.1 Under random probing, and for any hashing scheme in which the decision as to which of the colliding records may stay in the location is not Jased on any knowledge about their future probe sequence, the expected number of probes required to insert m records into a table of size n és [Hn — Bam) ‘where Hy = Tiny} is the i-th harmonic number. The variance of the total ssumber of probes required to insert m recorde it 9 [HQ) — #O),,) — nLfe— Ha-ml where Hf?) = Diy A Prost, Since no knowing about the febere probe sawence of the colliding i ed the the probability that the next location probed is empty is Sultel ‘regardless of which of the records was rejected. The number of prober needed to perform the -th ination ia then s geometrically ditributed random variable with parameter (probability of success) 1— 1, and is inde- ‘pendent of the aumsber of probes required for previous insertions. ‘The expected value and variance of the nuraber of probes needed to perform the i-th insertion are =Tpy and 2S} respectively. The expected number of probes needed to perform m insertions is simply Ee Et tencmth ant [20 — 20), — lta — Bam ‘These derivations are fairly simple and thia could be the reason why, to the best of our knowledge, they have not been published elsewhere. 16 CHAPTER 2. THE ROBIN HOOD HEURISTIC ‘Theorem 2.1 A record inserted using a random probing hashing scheme in which the decision as to which one of the colliding records may stay in the location is not based on any knowledge about their future probe sequences, has an expected probe sequence length of Efpal] = 7 (2%, — Han -a tafe) 40 (4 for a table containing m ladel(R, 6)}} [2= Or ‘The second term of this expression can be read as: the (e+ 1}-st choice of Fis 4, and cither 2 = 0 or & ja rejected from ite e-th choice, ‘Then 2) is the eet of record probes that location ¢ receives, Bach insertion sequence of the records in R corresponds to a particular arder of computing the transitive closure. The sets 2 are completely defined by the eet, R, of records in the table, and the h(e) and latei{s) functions, without reference to the order in which the records were inserted. Since a transitive closure is always unique, the set of record probes & location receives is independent of the insertion sequence. U ‘This unique final table property also applies to some of the other hashing schemes in the family defined in Section 2.2, The above proof could he applied to signature hashing by using the probe signature as the label; and to ordered ‘hash tables, by making the key value the label. Amble & Knuth |AMBKCNU74] ‘proved that ordered hach tables always produced a single final table by showing that inserting a set of records with their algorithm was equivalent to inserting the same set in order of increasing key values neing the standard insertion algorithm. Rivest [Riv78] proved that the optimal hash table can always be obtained by using the stendard insertion algorithm end a permutation of the insertion sequence. Tt is interesting to note that under Robin Hood hashing, the nal table obtained is not necessarily one that could be obtained by using a different, permutation of the insertion sequence and the standard insertion algorithm, Consider for example a table of sise 3 and the insertion eoquence Ji, 22, 283, applied to the Robin Hood insertion algoritlun with a tie breaker that decides in favor of the record with the smallest aubscript. Let the hash function H (24) bbe as specified in the table below, 18 CHAPTER 2. THE ROBIN HOOD HEURISTIC baa RnR Gays ® (sts Rm [spr ‘After inserting the 3 records the hash table will be where each record is in its second probe position. Since no record is in ite ret probe position this table cannot be obtained using the standard insertion algorithm and a different insertion sequence. 2.4 The Expected Worst Case for Robin Hood Hashing In the next chapter we will study the distribution of the probe sequence length, ‘pel, of Robin Hood hashing. In this section we derive bounds for the expected ‘Value of the longett probe sequence length (E{lpsll) and prove that for full tables Ellpel] = @flnn). ‘Assume we have a set 2 = (Ri, Now define the following functions: # Let o: 8 ++ {0,...,n—1) represent the table ansignment, euch that o(Ri) is the table location in which R; is stored. Let w:R ++ {1,...,00} be such that, u(R,) is the position in the probe sequence of FR; of the Iocation in which i is stored. # Let 8: Rx(0,...,c0} r+ R be the backup function, defined as: (Ri, 3) = o~*(h(Ri, (Ri) ~ 3), that is, the record occupying the location that record R probed 7 steps before its current loestion. Asrume that the value of the longest probe sequence length (Ipal) is that ia, at least one record (say Ruoret) ia in the Lth position of ite probe ‘sequence and none occur later. Consider the following intuitive argument: Let Wo = {Reorst}+ The location Ryorst probed in ita (¢—1)-nt choice must con tain a record in at least its (¢-1)-st probe. So there are at least two records, Resores and 9(Reorsts 1); in at least their (¢—1)-at prob {—1or higher. Let Wy, = {Reorst,9(Ruorst:1)}. The records in W; are all in at least their (¢-1)-st probes. Similarly if both records are moved back we find that at Teast 4 records, (Reorsts O(Reorsts 1), O( Reo sts2), 2(0(Rwore1),1)), are in at least their (—2}-nd probes. Let Wa = {Reorst, P(Reorsts 1), 0 (Meorsts2): 9(9(Ruorsts1), 1)}. Care must be taken in such an intuitive argument since we Rey} of m records stored in a hash table. 2.4, THE EXPECTED WORST CASE FOR ROBIN HOOD HASHING 19 are sampling the table with replacement, 40 the cardinalities ofthe last two sets are probably but not necessarily 2 and 4. ‘The preceding intuitive argument can be adapted into a more precise anal- ys, , will denote a set of records that are in at least their (¢-i)-th probe positions and Us, the set of records that are stored in the locations that the records in the set 1; would hit if moved back one further position, More formally, {note that Ryors: is an arbitrarily chosen element in ite £-th probe position) Uo = (Reora} ‘Wo = {Reorst} Uy = {| R= OC, 5) for some R! EUs-1S 5 Si) Wes Why Each record in the set W: in in at least its (£—%)-th probe position. If none of the locations that we sample when moving back a record were repeated, then the cardinality of the set W; would be 2, Since we are eampling the locations with replacement, the cardinality of WJ; is a random variable denoted by w:. We will denote by u; the cardinality of the set U; — W,_1 which is the number of records that belong to the set Wy but do not belong to the set Wy. ‘We will frst find a bound for the expected value of w, using occupancy distributions [DAVBAR62,JOHKOT77]. w; is equal to wi- +s. The distri. ‘bution of uy is of the type called clascical occupancy with specified boxes (see for example chapter 14 of [DAVBAR62]) where the number of urns isn, among which the number of specified urns is n—w;~1 and the number of balls dropped is w.-1 and we are interested in the number of specified urns that are hit. We now define a new sequence of random variables v; such that Elv;] < Bjw.] for all é. Initially wo «= vo = 1. Let v, be the number of urns hit when 2vi-1 balls are dropped at random. If we were to guarantee that the first vj—1 of these balls all went to different urns and the other half were dropped at random, then there would be no diference between the random variables vj and w;. Since this is not the case, the expected value of vs is less than the expected value of ‘ws. The distribution of vi is then of the type called classical occupancy. We therefore have Ev, Ireamnatem (1- (-ay" Vive vin = woul = mim 1) (2-2) Coa" 20 CHAPTER 2. THE ROBIN HOOD HEURISTIC == [6-8-2] [a3] which, using inequality 4.2.29' from [ABRSTE70], is bounded by Fe Nem —2))). It follows that Efe] > F (Eltpel]— fel» —2)1). Let N be the sum of the probe positions of the m records in 2 stored in the hhash table. Since V1 is jost « subset of & then E[N] > Efe] and m CHAPTER 2. THE ROBIN HOOD HEURISTIC EIN] = mEjpal] > (Eilpel] — Net — 2)}). Solving for Efipsl] we get Elipsl] < 3E{pel} + [Ig(m — 2)]. ‘The lower bound holds because pal < Ipel for every table and there is a positive probability that pal < Ipel 0. one 2.4 The expected value of Ips] for « full Robin Hood hash teble is Proof: Follows from the theorem and the fact that Elpsl] = (inn) for a full Robin Hood hash table. Corollary 2.5 The expected value of Ipal for a full min-maz hash teble is bounded by lanta+ 3+ Pllan) + oft) < Elpal] < Sinn + flg(n—2)1 + 87+ oft) tohere P(x) is 0 periodic function with period 1 and magnitude |P(s)| < .0001035, and by a" a(t ~ a) < Bffpal] < 8a" Ia(4~ a) + [atm —2)] for a-full tables with m records and load factor a= 2. Proof: ‘The lower bounds are from [GON81]. For every set of records, the value of Ipal for a min-max hash table is less than the corresponding value for other hashing scheme. Therefore, Eipal] for min-max hashing is less than any Elpal] for any other hashing acheme; in particular it is less chan the Efipel] for Robin Hood hashing. The upper bounds sre from Theorem 2.8 above. 0 Chapter 3 The Distribution of psl In Chapter 2 we introduced a simple modification of the insertion procedure of the standard hashing algorithm. We then showed that all records inserted ‘using this heuristic have the ame distribution of the probe sequence length, regardless of the position in the insertion sequence. We also proved that the expected position of a record in its probe sequence is In(n) +7 + o(1) for a fall table, and that the expected value of Ipel is (Inn). In this chapter we derive the distribution of the probe sequence length for a record in a Robin Hood hash table of infinite size. ‘To study the distribution of the random variable psl for an infinite table we will introduce probability model based on balls and urns. It is important to note that the analysis pre- ‘sented below ia for an infinite hash table and that we have not proved that an analysis for finite tables would converge to the same result. 3.1 The Asymptotic Mod Consider the following urn model that corresponds to inserting all elements in the table simultaneonsly. Assume that we have an infinite number of urns ‘and that we drop at random an infnite number of balls such that the average number of balls per urn is a. Esch of these balls is given a label of 1. After the balls have been dropped, for each un thst contains more than one ball, cone ball is selected according to some criterion and the rest are marked. All balls are Jeft in the urn. For each marked ball with label 1, we create a new ‘unmarked ball with Iabel 2 and drop it into a random urn. After all the new balls have been distributed, we check the urns and for each urn that contains ‘more than one unmarked ball, we select one (unmarked) ball among those with the highest label and mark the rest. We then create an additional unmarked bball with label ¢-+ 1 for each newly marked ball with label ¢ and drop it at the ‘urns. We continue in this fashion until each ura has an average of ar unmarked 25 26 CHAPTER 3. THE DISTRIBUTION OF PSL balls, or equivalently until a fraction @ of the urns contain an unmarked ball (since there can be at most one unmarked ball per urn), In this model, urns represent table locations, balls represent probes into the table, ball labels represent probe potitions or try aumbera, and marking a ball reptesents rejecting a record from the location. An wamarked ball with label represents a record placed in its i-th probe position. Notice that in a finite table, the total number of unmarked balls with a label ‘greater than § (number of records after their i-th probe position) is equal to the ‘total number of balls, either marked or unmarked, with label +2 (number of records that probed their 4+ 1-st position). For an infinite ile this corresponds to the expected momber of unmarked balls with label greater than i per urn bbeing equal to the expected number of balls (marked or unmarked) with label +1 per um. ‘We define the following variables: — Let A be the expected number of balls (marked or unmarked) per urn. Let % be the expected number of balls (marked or unmarked) with abel equal to ¢ per un. —Let be the expected mumber of balls (marked or unmarked) with abel lone than or equal to é per urn. Then jen = Ast — Let ty be the expected number of unmarked bulls with label less than ‘or equal to § per urn. The expected (per ur) number of unmarked balls with label greater than 4 is a ~ f. We already mentioned that the expected number of unmarked balls with label greater than 1 is equal to the expected mumber of marked or unmarked balls with label G41, Hence 41 = a~ ti From the definitions above it follows that Ar = 71 = too = a Consider an arbitrary but fixed urn, Each ballis labeled before itis dropped, ‘and the um selected by a ball is independent of ite label, Therefore the labels of the balls in the urn, at some stage of the loading process described above, cat affect the labels of balls artiving later to the um oly in as much as they ccan affect the expected (per ur) number of balls created with each label. IF we hhave an infinite number of urns, nothing that occurs in a single urn having a finite number of balls can affect the values of the As’, and with probability 1 the number of balls in the urn is finite, Another way of viewing this last observation is as follows. The balls in fan arbitrary but fixed um can affect the distribution of the label of another ball hitting the same urn only if that ball was created s a (direct or indirect) consequence of marking those balls. Let us then divide our loading process into two phases. The first phase is the same ag the original loading process except that all balls falling into the specified urn are placed in the urn but otherwise ignored. The second phase consists of: a) marking the balls in the specified umn exactly as would have’ been done in the normal leading process, and b) ‘creating additional balls with the appropriate Iabels until no new ball needs to be created. ‘The labels of the balls that fall into the specified urn in the 3.2, THE DISTRIBUTION 7 firet phase are independent of each other. The labels of the balls created in ‘the second phase are not independent of the labels of the bails in the specified ‘wa, The balls (with any label) created in the second phase can be viewed as a branching process (see for example chapter XII of [FEL68)). Every time a ball is dropped two things can happen: the ball hits an empty urn, in which case no descendant is created, or it hits an urn that contains one unmarked ball, in which case one descendant is crested. At the end of the first stage the number of balls in the opecifed urn is finite with probability 1, and hence, with probability 1 all but a finite number of records have been placed in the table. Therefore the fraction of occupied urns is a, the probability of hitting an ura that contains ‘one unmarked ball is a, and the expected number of direct descendant of 3 ball is also ar. Feller proves (section XIL6 in [FELO8)) that if we start with a finite number of individuals and the expected number of direct descendants of an individual is lees than 1, then the total progeny is finite with probability 1. Since the total number of balls created in the second phase is finite and they are dropped among an infinite number of urns then, with probability 1, none of them will fall into the specified ura. A consequence of these two observations is that the labels of the balla in am urn are independent of each other. Note that the second of these observations ‘does not hold if we have only a finite number of urns, since the effect of a single ‘um on the 2,’s can be arbitrarily large. For a finite number of urns the labels of the balls in a specified urn are mutually dependent but the larger the number cof urns, the weaker the dependency. Neither does it hold if we have a full table, ince with probability 1 the number of balls in an ura is infinite, This is the reason why the analysis presented below is valid only for infinite nonfull tables. ‘We have already studied the distribution of the number of probes required to load the table in Theorem 2.1 (page 14), From that it follows that, 4 = =In(1—a). 3.2 The Distribution In this section we find the probability distribution of the number of positions probed by a key in an infinite Robin Hood hach table. We do this by establishing a recurrence relation for 7. Larvon [LAR83} has proved for infinite a-full tablés (a < 1), that the dis- ‘ribation of the number of balls that hit an urn is Poisson with parameter A. Consider an arbitrary but fixed urn. The urn will contain an unmarked ball with label less than or equal to é, when it has been hit by at least one ball with label less than or equal to ¢ and no ball with label higher than ¢. Let gi{z,2) denote the probability that the urn is hit by z balls, « of which have a label higher than s. Then gi(z,4) ia the product of a Poisson and a binomial distribution as follows: 28 CHAPTER $. THE DISTRIBUTION OF PSL wed (0) (GR) Cay ‘The Poisson factor gives the probability that the urn receives = balls. (A — 24)/A is the probability that a ball is labeled higher than i. This probability ix independent of the labels of other balls in the same urn. Hence, the probability that s out of x balls have a label higher than i has a binomial distribation. The probability that the urn receives af least one ball with label leas than or equal to é and no ball with label higher than ¢ ia Yaleo = ocean RO = * & ee OM (Led), Define pi(a) to be the probability that a record probes at least ¥ locations before being placed in the table, Then apisi(2) is the umber of unmarked. Balle with label greater than ¢ per urn. We already mentioned that this ia equal to the expected number of marked or unmarked balls per urn with label equal tos-+4. Therefore, pi(a) satisfies apis s(a} = i+ and from the relation 4641 = a ~ fry Pevafa)=1- BAI (1 o =1- Since A = —In(1— a) and Ay = 71 +--+ = 2 (pia) +--+ Pala), we have Pisala) =4— () (cctostone-tnden— 2), ‘We summarise this in the following theorem. ‘Theorem 8.1 In the asymptotic model for an infinite Robin Hood hash table with load factor a (a <1), the probability py(a} that « record is placed in the A-th or further position in its probe sequence is equal to Pala) =1 pestle) =1 (52) (ertouert voter),

You might also like