A New Algorithm For Constructing Minimal Perfect H
A New Algorithm For Constructing Minimal Perfect H
2
Definition 7 Let Vcrit ⊆ V be a set of critical ver- of keys xi , xj ∈ S then h(xi ) < h(xj ) if and only if
tices. Subset Vcrit contains all vertices from V that i < j. In other words, the keys in S are arranged in
are part of cycles or are in a chain connecting two or some order and the function h preserves this order in
more cycles, as the vertex 5 in Figure 4. the hash table.
Hash Table
Definition 9 Let Vncrit = V − Vcrit be a set of non ...
0 1 2 m−1
critical vertices. Subset Vncrit contains all vertices
from V that are not part of cycles. 0 1 2 ... n−1
Key Set
Definition 10 Let Vscrit ⊆ Vcrit be a set of all critical (b)
vertices that have at least one non critical vertex as
Hash Table
adjacent, as the vertex 5 in Figure 4.
0 1 2 ... n−1
Definition 11 Let Encrit = E − Ecrit be a set of non
critical edges. Figure 1: (a) Perfect hash function. (b) Minimal per-
fect hash function.
Definition 12 Let Gcrit = (Vcrit , Ecrit ) be a critical
graph and let Gncrit = (Vncrit ∪ Vscrit , Encrit ) be a In the definitions above the keys to be placed in the
non critical graph, where the critical subgraph Gcrit hash table are integers in the interval [0, m − 1]. In
corresponds to the cyclic part of G and the non critical practice, it is often the case that keys are sequences
subgraph Gncrit corresponds to the acyclic part of G. of characters over some finite and ordered alphabet Σ,
Thus, G = Gcrit ∪ Gncrit . such as the ASCII set. In this case, we convert it to
a random number modulo |V | for each key. To obtain
Definition 13 Let P|Ecrit | be the probability that G a random number for each key, we generate a table of
has at most |Ecrit | critical edges. random numbers, one for each possible character of Σ
at each position i in the key. The construction of the
minimal perfect hash function presumes the existence
Definition 14 Let P|Vcrit | be the probability that G
of two random and independent hash functions h1 and
has at most |Vcrit | critical vertices.
h2 . For a key x containing |x| characters and two dif-
ferent tables of random numbers table 1 and table 2 , the
4 Minimal Perfect Hashing two hashing functions are:
|x|
In a hashing method, different keys might have the X
same address computed by the hash function, a situ- h1 (x) = table1 [i, x[i]] mod |V |,
i=1
ation called collision. In this case various schemes for
resolving collisions are known. A perfect hash function
|x|
is an injection h : U → [0, m − 1], which means that X
for all x, y ∈ S such that x 6= y we have h(x) 6= h(y), h2 (x) = table2 [i, x[i]] mod |V |.
i=1
which implies that m ≥ n. For being an injection, a
perfect hash function transforms each key of S into a Consider now a problem known as the perfect as-
unique address in the hash table, as depicted in Fig- signment problem: For a given undirected graph G =
ure 1(a). Since no collisions occur, each key can be re- (V, E), where |V | = cn and |E| = n, find a func-
trieved from the table in one probe. If m = n and h(x) tion g:V → {0, 1, . . . , |V | − 1} such that the function
is perfect, then h(x) is a minimal perfect hash function h : E → {0, 1, . . . , n − 1}, defined as
(MPHF), as depicted in Figure 1(b). The perfect hash
function h is said to be order preserving if for any pair h(e) = (g(a) + g(b)) mod n (1)
3
is a bijection, where e = {a, b}. This means that we are v g(v)
looking for an assignment of values to vertices so that 0
0 0
for each edge the sum of values associated with end- 5
1 1 2
points taken modulo the number of edges is a unique 2 3 2 3
integer in the range [0, n − 1]. 0
4
2 3 1
The ordering and searching steps of the MOS ap- 3 4 0
proach are a very simple way of solving the perfect 1 4 5 1
assignment problem. Czech, Havas and Majewski [1]
showed that the perfect assignment problem can be Figure 2: Perfect assignment problem for a graph with
solved in optimal time if G is acyclic. To generate six vertices and five edges.
an acyclic graph two vertices h1 (x) and h2 (x) are
computed for each key x ∈ S. Thus, set S has a
corresponding graph G, with V = {0, 1, . . . , v} and Now we show why G must be acyclic. If the graph
E = {{h1 (x), h2 (x)} : x ∈ S}. In order to guarantee G was not acyclic, the assignment process might trace
acyclicity the algorithm repeatedly selects h1 and h2 around a cycle and insist on reassigning some already-
until the corresponding graph is acyclic. For the solu- processed vertex with a different g value than the one
tion to be useful we must have |S| = n and |V | = cn, that has already been assigned to it. For example, let
for some constant c, such that acyclic graphs dominate us suppose that in Figure 2 the edge {3, 4} has been
the space of all random graphs. Havas et al. [10] proved replaced by the edge {0, 1}. In this case, two different
that if |V | = cn holds with c > 2 the probability that values are set to g(0). Following the adjacent list of
G is acyclic is vertex 1, g(0) is set to 4. But g(0) was set to 0 before.
r
1/c c−2
p=e · (2)
c 5 The New Algorithm
For c = 2.09 the probability of a random graph being In this section we present a new algorithm for con-
acyclic is p > 13 . Consequently, for such c, the expected structing minimal perfect hash functions, where the
number of iterations to obtain an acyclic graph is lower order of the keys in S is not preserved. The algorithm
than 3 and the g function needs 2.09n integer numbers is based on the MOS approach and solves the problem
to be stored, since its domain is the set V . In this presented in Figure 1(b). The main novelty is that the
paper, the algorithm proposed by Czech, Havas and random graph G might have cycles and even so we are
Majewski [1] will be referred to as CHM from now on. able to find a MPHF.
Given an acyclic graph G, for the ordering step we The new algorithm looks for a function g : V →
associate with each edge an unique number h(e) ∈ {−|V | + 1, . . . , 0, 1, . . . , |V | − 1} such that the function
[0, n − 1] in the order of the keys of S to obtain an h : E → {0, 1, . . . , m − 1} defined as
order preserving function. Figure 2 illustrates the per-
fect assignment problem for an acyclic graph with six h(e) = g(a) + g(b) (3)
vertices and with the five table entries assigned to the
edges. is a bijection, where e = {a, b}. This means that we
The searching step starts from the weighted graph are looking for an assignment of values to vertices so
G obtained in the ordering step. For each connected that for each edge the sum of values associated with
component of G choose a vertex v and set g(v) to 0. endpoints is a unique integer in the range [0, m − 1].
For example, suppose that vertex 0 in Figure 2 is cho- Notice that we do not need to take the sum of values
sen and the assignment g(0) = 0 is made. Traverse the associated with endpoints of the edges modulo n.
graph using a depth-first or a breadth-first search algo- Figure 3 presents a pseudo code for the new algo-
rithm, beginning with vertex v. If vertex b is reached rithm. The procedure NewAlgorithm (S, g) receives
from vertex a and the value associated with the edge as input the set of keys from S and produces the per-
e = {a, b} is h(e), set g(b) to (h(e) − g(a)) mod n. In fect assignment of vertices represented by the function
Figure 2, following the adjacent list of vertex 0, g(2) is g. The mapping step generates a random undirected
set to 3. Next, following the adjacent list of vertex 2, graph G taking S as input. The ordering step deter-
g(1) is set to 2 and g(3) is set to 1, and so on. mines the order in which hash values are assigned to
4
keys. It partitions the graph G into Gcrit and Gncrit . is added to G is j −1, and the incremental construction
The searching step produces the perfect assignment of of G implies that p(|V |) is:
vertices in G, which is represented by the function g.
n |V |
n−1
Y |V | − j
It starts with Gcrit and finishes with Gncrit . 2 − (j − 1)
Y
2
p(|V |) = |V |
= |V |
·
j=1 2 j=0 2
procedure NewAlgorithm ( S , g)
Mapping ( S , G ) ; As |V | = cn we can rewrite the probability p(n) as:
Ordering ( G , Gcrit , Gncrit ) ; n−1
Searching ( G , Gcrit , Gncrit , g ) ; Y 2j
p(n) = 1− ·
j=0
c2 n2 − cn
Figure 3: Main steps of the new algorithm. Using an asymptotic estimate from Palmer [15], for
two functions f1 : ℜ → ℜ and f2 : ℜ → ℜ defined by
f1 (k) = 1 − k and f2 (k) = e−k , the inequality f1 (k) ≤
5.1 Mapping Step f2 (k) is true ∀ k ∈ ℜ. Considering k = c2 n2j 2 −cn , we
have
The procedure Mapping (S, G) receives as input the n−1
set of keys from S and generates a random undirected Y − 2j
− n−1
p(n) ≤ e c2 n2 −cn = e c2 n−c .
graph G without self-loops and multiple edges. To gen-
j=0
erate the MPHF, the number of critical edges in G
must be |Ecrit | ≤ 12 |E|. The reason is that the maxi- Thus,
mal value of h(e) assigned to an edge e ∈ E in this case 1
is m − 1. In Section 5.3.1 we show that the condition lim p(n) ≃ e− c2 . (4)
n→∞
|Ecrit | ≤ 12 |E| is necessary and sufficient to generate a 1
MPHF. As Ni (X) = 1/p then Ni (X) ≃ e c2 . After that, we
The random graph G is generated using two hash empirically determine the c value to obtain a random
functions h1 and h2 . The functions h1 and h2 trans- graph G with |Ecrit | ≤ 12 |E|. For this we built 10,000
form the keys from S to integers in [0, |V | − 1], so the graphs for each c value and number of keys presented
set of vertices V has |V | vertices and each one of them in Table 1. The two collections used in the experiments
is labelled with a distinct value from [0, |V | − 1]. For (TodoBR and TREC-VLC2) are described in Table 4
each key x from S the edge {h1 (x), h2 (x)} is added to (see Section 7 for more details).
E. We show in Table 1 the probability P|Ecrit | that
A self-loop occurs when h1 (x) = h2 (x). To avoid |Ecrit | ≤ 21 |E|, |E| = n, tends to 0 when c < 1.15
self-loops we modify h2 (x) by adding a random number and n increases. However, it tends to 1 when c ≥ 1.15
in the range [1, |V | − 1]. When a multiple edge occurs and n increases. Thus, |V | = 1.15n is considered a
we abort and start again a new iteration. threshold function (a definition coined by Erdös and
We now show that the expected number of iterations Rényi [3, 5]) for generating a random graph G where
to obtain G is constant. Let p be the probability of |Ecrit | ≤ 21 |E| with probability tending to 1 when n
generating a random graph G without self-loops and increases. Therefore, we use c = 1.15 in the new algo-
multiple edges. Let X be a random variable counting rithm.
the number of iterations to generate G. Variable X is The MPHF generated by the new algorithm needs
said to have the geometric distribution with P (X = 1.15n integer numbers to be stored, since |V | = 1.15n.
i) = p(1 − p)i−1. So, the expected number of iterations Thus, the generated function is stored in 55% —
to generate G is Ni (X) = ∞
P
j=1 jP (X = j) = 1/p and 1.15n/2.09n — of the space necessary to store the one
its variance is V (X) = (1 − p)/p . 2 generated by the CHM algorithm.
Let ξ be the space of edges in G that may be gener- As P|Ecrit | tends to 1 when n increases, we consider
ated by h1 and h2 . The graphs generated in this step that the expected number of iterations to generate G is
1
are undirected and the number of possible edges in ξ Ni (X) ≃ e c2 . For c = 1.15, Ni (X) ≃ 2.13 on average,
is given by |ξ| = |V2 | . The number of possible edges which is constant. So, the mapping step takes O(n)
that might become a multiple edge when the jth edge time.
5
VLC2 (n) TodoBR (n)
c 1, 000 10, 000 100, 000 1, 000, 000 3, 000, 000 1, 000 10, 000 100, 000 1, 000, 000 3, 000, 000
1.10 0.01 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00
1.11 0.04 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.00 0.00
1.12 0.12 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00 0.00
1.13 0.19 0.03 0.00 0.00 0.00 0.20 0.02 0.00 0.00 0.00
1.14 0.30 0.09 0.00 0.00 0.00 0.31 0.11 0.00 0.00 0.00
1.15 0.50 0.56 0.65 0.89 1.00 0.51 0.57 0.65 0.88 1.00
1.16 0.68 0.70 0.88 0.95 1.00 0.70 0.83 0.95 0.95 1.00
1.17 0.77 0.82 0.90 1.00 1.00 0.78 0.99 0.98 1.00 1.00
1.18 0.91 0.97 0.98 1.00 1.00 0.91 1.00 1.00 1.00 1.00
1.19 0.94 1.00 1.00 1.00 1.00 0.95 1.00 1.00 1.00 1.00
1.20 0.98 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00
Table 1: Probability P|Ecrit | that |Ecrit | ≤ 12 n for different c values and different number of keys for the collections
VLC2 and TodoBR.
The new algorithm does not verify if G really has Finally, to determine the vertices in Vscrit we col-
at most 0.5n critical edges in the mapping step. The lect all vertices v ∈ Vcrit with at least one vertex u
rationale is that P|Ecrit | tends to 1 when n increases. that is in Adj(v) and in Vncrit , as the vertex 5 in Fig-
However, if some addition g(u)+g(w) is greater than m ure 4(d). This process takes O(|Vcrit |). Considering
in the searching step for {u, w} ∈ E then the mapping that |Vcrit | ≤ |V |, |Vncrit | ≤ |V | and |V | = n, the
step is restarted, as shown in line 17 of Figure 7. ordering step takes O(n) time.
6
a) Q 14 15 9 10 0 1 11 12 b) Q 15 9 10 0 1 11 12 c) Q 9 10 0 1 11 12 13 d) Q
d:1 d:1 d:1 d:1 d:1 d:1 d:0 d:1 d:1 d:0 d:0 d:0
15 0 1 15 0 1 15 0 1 15 0 1
d:1 d:0 d:0 d:0
14 d:2 14 d:2 14 d:2 14 d:2
2 d:2 2 d:2 2 d:2 2 d:2
3 3 3 3
d:3 13 d:2 13 d:1 13 d:0 13
4 d:3 4 d:3 4 d:3 4 d:3
12 5 12 5 12 5 12 5
d:1 11 d:1 d:3 6 d:3 d:1 11 d:1 d:3 6 d:3 d:1 11 d:1 d:3 6 d:3 d:0 11 d:0 d:2 6 d:3
10 7 10 7 10 7 10 7
d:1 9 8 d:1 9 8 d:1 9 8 d:0 9 8
d:2 d:2 d:2 d:2
d:1 d:2 d:1 d:2 d:1 d:2 d:0 d:2
g:6
a) 2
b) 2 c) 2 d) 13
2
3 3 3 g:7 3 9 8
4 2 4 g:2 2 4 g:2 2 4 g:2
g:0 5 g:0 5 g:0 5 g:0 5
6 1 6 g:1 1 6 g:1 1 6 g:1
5 4 5 4
7 7 7 7
8 8 g:4 8 g:3 g:4 8 g:3
7 7
in Gcrit is presented in Figure 6. Initially, a vertex v g(2) = 6 and g(3) = 7 are made, implying that ad-
is chosen chosen and the assignment g(v) = 0 is made. dresses 8, 9 and 13 must be assigned to edges {4, 2},
For example, suppose that vertex 5 in Figure 6(a) is {4, 3} and {2, 3}, respectively. This finishes the algo-
chosen and the assignment g(5) = 0 is made. In Fig- rithm with AssignedEdges = {1, 2, 4, 5, 7, 8, 9, 13}.
ure 6(b), following the adjacent list of vertex 5, g(6) is A pseudo code for the assignment of values to crit-
set to 1 and g(4) is set to 2, implying that addresses ical vertices is presented in Figure 7. For all edges
1 and 2 must be assigned to edges {5, 6} and {5, 4}, e = {u, w} ∈ E, g(u) + g(w) must be unique. If this
respectively. At the same time, addresses 1 and 2 are constraint is not forced then two different keys from
added to the list of AssignedEdges. In Figure 6(c), fol- S will be mapped in the same hash table location.
lowing the adjacent list of vertex 6, g(7) is set to 3 and Thus, the AssignedEdges array is used to force that
g(8) is set to 4, implying that addresses 4, 5 and 7 must g(u) + g(w) will be distinct for all edges in E, as shown
be assigned to edges {6, 7}, {6, 8} and {7, 8}, respec- in line 18 of Figure 7. The variable Nextg represents
tively. Finally, in Figure 6(d), following the adjacent g(u).
list of vertex 4, g(2) cannot be assigned to 5 because Now we define certain complexity measures used
the sum g(2) + g(4) would cause a reassignment with hereinafter:
the already assigned address 7 to edge {7, 8}, so the
next g value 6 is used instead, and the assignments 1. Let I(u) be the number of iterations occurred in
the repeat-until loop from line 13 until line 19,
when vertex u is assigned.
7
procedure CriticalVerticesAssignment (G , Gcrit , g , AssignedEdges)
1 for v ∈ Vcrit do g(v) := −∞ ;
2 for i := 0 to m − 1 do AssignedEdges [ i ] = false ;
3 for v ∈ Vcrit do
4 i f g(v) = −∞ then traverseBfs ( G , v , Gcrit , g , AssignedEdges ) ;
2. Let Nt be the number of times that Proof: In an undirected graph G, every edge of G is
AssignedEdges[g(u) + g(w)] is true in the either a tree edge or a back edge. In the subgraph
procedure CriticalVerticesAssignment. Thus, Gncrit there are no back edges because it is an acyclic
graph. As shown by Erdös and A. Rényi [4, 5], when
|Vcrit | n tends to infinity the random graph G forms, with
X
Nt = (I(u) − 1) (5) probability tending to 1, a giant component containing
u=1 all cycles of G. So considering that Gcrit is connected,
the number of tree edges is |Vcrit | − 1. It happens
Maximal Value Assigned to An Edge because we have only one tree connecting all vertices
in Vcrit . As the total number of edges in Gcrit is |Ecrit |
In this section we present the following conjecture. then Nbedges + (|Vcrit | − 1) = |Ecrit |. Thus,
8
not be assigned again. Consider now two possibilities: P|Vcrit |
n
(i) If Nt = 0 then the g values will be assigned to ver- VLC2 TodoBR
tices in Vcrit sequentially. Therefore, the greatest and 1, 000 0.51 0.52
the second greatest values assigned to u and w ∈ Vcrit 10, 000 0.76 0.77
are g(u) = |Vcrit | − 1 and g(w) = |Vcrit | − 2, respec- 100, 000 0.98 0.98
tively. Thus, Amax ≤ (|Vcrit |−1)+(|Vcrit |−2) since the 1, 000, 000 1.00 1.00
edge {u, w} may be in Ecrit , in the worst case. (ii) If
Nt > 0 then Nextg is incremented by one for each time Table 2: Probability P|Vcrit | that |Vcrit | ≤ 0.403n for
the condition AssignedEdges[Nextg + g(w)] is true, as different number of keys for the collections VLC2 and
shown in line 15 of Figure 7. Thus, in the worst case, TodoBR.
9
a) b) c) g:3 d) g:3 e) g:3
15 15 g:6 15 g:6 15 g:0 g:6 15 g:0
14 0 14 0 14 0 14 0 14 0
6 6 6
3 3 10 3 10
13 1 g:0 13 1 g:0 13 1 g:0 13 1 g:0 13 1
5 g:0 0 5 g:0 0 5 g:0 0 5 g:0 g:10 0 5 g:0 g:10
12 9 12 9 12 9 12 9 g:12 12 9 g:0
11 10 11 10 11 10 11 10 12 11 10 11
g:0 g:11
0 3 6 10 11 12 3 6 10 11 12 10 11 12 11 12
UnAssignedEdges UnAssignedEdges UnAssignedEdges UnAssignedEdges UnAssignedEdges
Considering that |Adj(u)| = davg on average Since I(u) and da vg in Eq. (6) are constants, we have
and that |AssignedVertices| ≤ |Adj(u)|, then that C(|Vcrit |) = O(|Vcrit |). As |Vcrit | ≤ |V | and |V | =
|AssignedVertices| ≤ davg . Thus, cn, the time complexity of the assignment of values to
|Vcrit |
critical vertices is O(n).
X
C(|Vcrit |) ≤ (davg + (I(u) × davg ) + davg ) (6)
u=1 5.3.2 Assignment of Values to Non Critical
As presented before, davg is a constant and Nt ≤ Vertices
Nbedges . Therefore, applying davg = 2|Ecrit |/|Vcrit | in The procedure NonCriticalVerticesAssignment (G,
Theorem 1 gives: Gncrit , AssignedEdges, g) receives G, Gncrit and As-
Nt ≤ |Ecrit | − |Vcrit | + 1 signedEdges as input and produces the assignment of
values to vertices in Gncrit , represented by the array g.
davg
≤ |Vcrit | − |Vcrit | + 1 This finishes the perfect assignment of values to ver-
2 tices of G. We use a depth-first search algorithm to
davg
≤ − 1 |Vcrit | + 1 assign values to vertices in Gncrit .
2 As Gncrit is acyclic, we can impose the order in
Since davg is a constant then Nt = O(|Vcrit |). The which addresses are associated with edges in Gncrit .
number of times that AssignedEdges[Nextg + g(w)] = Therefore, in the assignment of values to vertices in
true is given by Eq. (5). Thus, I(u) must be a constant Gncrit we place the unused addresses in the gaps left
because by the assignment of values to vertices in Gcrit . For
|Vcrit | that, we start the depht-first search from the vertices
in Vscrit because these critical vertices were already
X
(I(u) − 1) = Nt = O(|Vcrit |).
u=1
assigned, so their g values can not be changed.
10
Considering the subgraph Gncrit in Figure 4(d), a 7 Experimental Results
step by step example of the assignment of values to
vertices in Gncrit is presented in Figure 8. Figure 8(a) In this section we present experimental results to show
presents the initial state of the algorithm. The critical the efficiency of the new algorithm. Also, a comparison
vertex 5 is the only one that has non critical vertices with algorithm CHM (proposed by Czech, Havas and
as adjacent. In the example presented in Figure 6, the Majewski [1]) is made.
addresses {0, 3, 6, 10, 11, 12} were not used. So, taking The two algorithms were implemented in the C lan-
the first unused address 0 and the vertex 13, which guage. All experiments were carried out on a computer
is reached from the vertex 5, the g value of vertex 13 running the Linux operating system, version 2.6.7,
is set to 0 − g(5) = 0, as shown in Figure 8(b). In with a 2.2 gigahertz Athlon processor and 1 gigabyte
Figure 8(c), using the unused addresses 3 and 6, the g of main memory.
values for vertices 15 and 14 are set to 3 − g(13) = 3
Collection n Key Size (Avg)
and to 6 − g(13) = 6, respectively. Vertices 0, 1, 9, 10,
TodoBR 3,541,615 8.3
11 and 12 were not assigned yet, so we continue the as- Random 10,000,000 20.0
signment of values to non critical vertices from vertex VLC2 10,935,900 8.6
0. In Figure 8(d), we set g(0) to 0. The only vertex URLs 20,000,000 57.4
that is reached from vertex 0 is vertex 1, so taking the
unused address 10 we set g(1) to 10 − g(0) = 10. This Table 4: Collections used in the experiments.
process is repeated until the UnAssignedEdges list be-
comes empty. The final result is shown in Figure 8(e). We used four collections in the experiments:
A pseudo code for the assignment of values to non (i) the vocabulary of the TodoBR search engine
critical vertices is presented in Figure 9. (http://www.todobr.com.br); (ii) a collection of keys
generated randomly (Random); (iii) the vocabulary ex-
tracted from the TREC-VLC2 (Very Large Collection
Complexity Analysis 2) collection [11]; (iv) a set of URLs crawled from the
Web. Table 4 presents some details about the collec-
The assignment of values to vertices in Gncrit is a
tions.
depth-first search algorithm. Then, its time complex-
Table 5 presents the main characteristics of the two
ity is O(|Vscrit | + |Vncrit | + |Encrit |). Considering that
algorithms. The number of edges of graph G = (V, E)
|Vncrit | ≤ |V |, |Vscrit | ≤ |V |, |V | = cn and |Encrit | ≤ n,
is equal to the size n of the set S of keys for the two
the complexity of the assignment of values to non crit-
algorithms. The number of vertices of G is equal to
ical vertices is O(n).
1.15n and 2.09n for the new algorithm and the CHM
algorithm, respectively. This measure is related to the
amount of space to store the array g. The number of
6 MPHF Evaluation critical edges is 0.5|E| and 0, for the new algorithm
and the CHM algorithm, respectively.
Figure 10 presents a pseudo code to evaluate the
MPHF generated by the new algorithm. The proce- Algorithms
Characteristics
dure h (x, g, h1 , h2 ) receives as input a key x ∈ S, the New algorithm CHM
g function, the tables used by h1 and h2 and returns |E| n n
the hash table address assigned to x. |V | cn cn
c 1.15 2.09
|g| 1.15n 2.09n
procedure h ( x , g , h1 , h2 ) |Ecrit | 0.5|E| 0
u := h1 (x) ; G cyclic acyclic
v := h2 (x) ; Order preserving no yes
return ( g(u) + g(v) ) ;
Table 5: Main characteristics of the algorithms.
Figure 10: Evaluating the MPHF. Table 6 presents time results for constructing
MPHFs using the two algorithms. The table entries
11
New algorithm, c = 1.15 CHM, c = 2.09
Collection Ni Mapping Ordering Searching Total Ni Mapping + Ordering Searching Total
TodoBR 1.92 11.33 1.93 0.97 14.23 2.63 19.51 3.03 22.54
Random 1.77 41.90 7.17 3.70 52.77 2.96 59.92 10.31 70.23
VLC2 2.24 44.69 7.00 3.59 55.28 2.94 78.77 11.09 89.86
URLs 2.18 153.23 14.62 7.52 175.37 - - - -
12
the new algorithm is time optimal. The time to eval- [10] G. Havas, B.S. Majewski, N.C. Wormald, and Z.J.
uate the generated function is very fast and the space Czech. Graphs, hypergraphs and hashing. In
needed to store it is O(n log n) bits. Experimental re- 19th International Workshop on Graph-Theoretic
sults show that the times to both generate the MPHF Concepts in Computer Science, pages 153–165.
and compute a hash table entry by the new algorithm Springer Lecture Notes in Computer Science vol.
are better than the times obtained by the CHM algo- 790, 1993.
rithm, one of the fastest known algorithm.
[11] D. Hawking. Overview of trec-7 very large collec-
tion track (draft for notebook), 1998.
References [12] D. E. Knuth. The Art of Computer Programming:
Sorting and Searching, volume 3. Addison-Wesley,
[1] Z.J. Czech, G. Havas, and B.S. Majewski. An second edition, 1973.
optimal algorithm for generating minimal perfect
hash functions. Information Processing Letters, [13] B.S. Majewski, N.C. Wormald, G. Havas, and Z.J.
43(5):257–264, 1992. Czech. A family of perfect hashing methods. The
Computer Journal, 39(6):547–554, 1996.
[2] Z.J. Czech, G. Havas, and B.S. Majewski. Funda-
mental study perfect hashing. Theoretical Com- [14] K. Mehlhorn. Data Structures and Algorithms 1:
puter Science, 182:1–143, 1997. Sorting and Searching. Springer-Verlag, 1984.
[15] E. M. Palmer. Graphical Evolution: An Introduc-
[3] P. Erdos and A. Rényi. On random graphs. Pu-
tion to the Theory of Random Graphs. John Wiley
bicationes Mathematicae, 6:290–297, 1959.
& Sons, New York, 1985.
[4] P. Erdös and A. Rényi. On the evolution of ran-
dom graphs. Publications of the Mathematical
Institute of the Hungarian Academy of Sciences,
56:17–61, 1960.