0% found this document useful (0 votes)
35 views13 pages

A New Algorithm For Constructing Minimal Perfect H

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views13 pages

A New Algorithm For Constructing Minimal Perfect H

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

A New Algorithm for Constructing

Minimal Perfect Hash Functions

Fabiano C. Botelho David M. Gomes Nivio Ziviani

Department of Computer Science


Federal University of Minas Gerais
Belo Horizonte, Brazil
{fbotelho,menoti,nivio}@dcc.ufmg.br

Abstract a certain amount of wasted space due to unused loca-


tions in a table and wasted time to resolve collisions
We present a three-step algorithm for generating minimal when two keys are hashed to the same table location.
perfect hash functions which runs very fast in practice. The If the set of keys is static, then it is possible to com-
first step is probabilistic and involves the generation of ran- pute a function h(x) to find any key in the table in
dom graphs. The second step determines the order in which one probe (no collisions in this case). This function is
hash values are assigned to keys. The third step assigns called a perfect hash function. A perfect hash function
hash values to the keys. We give strong evidences that first that can preserve an a priori key ordering is called an
step takes linear random time and the second and third order preserving function. A perfect hash function that
steps take deterministic linear time. We improve upon the stores a set of records in a table of the size equal to the
fastest known method for generating minimal perfect hash number of keys is called a minimal perfect hash func-
functions. The total time to find a minimal perfect hash tion. A minimal perfect hash function totally avoids
function in a PC computer took approximately 175 sec- the common problem of wasted space and time.
onds for a collection of 20 million keys. The time to com-
Minimal perfect hash functions are used for mem-
pute a table entry for any key is also fast because it uses
ory efficient and fast retrieval of items from static sets,
only two different hash functions that are computable in
such as words in natural languages, reserved words in
time proportional to the size of the key. The amount of
programming languages or interactive systems, univer-
space necessary to store the minimal perfect hash function
sal resource locations in Web search engines, or item
is approximately half the space used by the fastest known
sets in data mining techniques. Therefore, there are
algorithm.
applications for minimal perfect hash functions in in-
formation retrieval systems, database systems, hyper-
1 Introduction text, hypermedia, language translation systems, elec-
tronic commerce systems, compilers, operating sys-
Let S be a set of n distinct keys belonging to a finite tems, among others.
universe U of keys. The keys in S are stored so that Finding perfect hash functions, especially for large
membership queries asking if key x ∈ U is in S can be sets, may not be easy since these functions are very
answered. This search problem is called the dictionary rare. According to Knuth [12], the total number of
problem. Various approaches to the dictionary prob- possible hash functions from S (|S| = n) into [0, m − 1]
lem have been explored. One of them is to compute (m ≥ n) is mn and only m(m − 1) . . . (m − n + 1)
a function h(x) to determine the location of a key in are perfect. Thus, the probability that no collisions
a table, leading to a class of very efficient searching occur is the ratio (m(m − 1) . . . (m − n + 1))/mn which
methods known as hashing methods. tends to zero very fast. For m = 13 and n = 10, the
Hashing methods for non static sets of keys involve probability that no collisions occur is only 0.0074.
Many methods for generating minimal perfect hash graphs G = (V, E), |V | = cn and |E| = n, with
functions use a mapping, ordering and searching c ≥ 2.09.
(MOS) approach, a description coined by Fox, Chen The differences between our algorithm and the al-
and Heath [8]. In the MOS approach, the construction gorithms in [1, 2, 13] are as follows. First, we gen-
of a minimal perfect hash function is accomplished in erate cyclic random graphs G = (V, E), |V | = cn
three steps. First, the mapping step transforms the and |E| = n, with c ≥ 1.15 while they generate
key set from the original universe to a new universe. acyclic random graphs with a greater number of ver-
Second, the ordering step places the keys in a sequen- tices (|V | ≥ 2.09n). Second, their mapping step takes
tial order that determines the order in which hash val- longer time because they must generate an acyclic
ues are assigned to keys. Third, the searching step graph while we do not need to check for this property.
attempts to assign hash values to the keys. Third, the time to compute a table entry is faster in
In this paper we present a practical and efficient al- our case because we save a module operation. Fourth,
gorithm to find minimal perfect hash functions for very they generate order preserving minimal perfect hash
large key collections where no a priori key order must functions while our algorithm does not preserve order.
be maintained (e.g., for applications where ordered se-
quential access is not needed). The algorithm is based
on the MOS approach and have linear time complexity 3 Basic Concepts
with a small constant. For a collection of 20 million
This section presents the basic concepts that are used
keys, the total time to find a minimal perfect hash
in the next sections.
function in a PC computer took approximately 175
seconds. The time to compute the table entry h(x) Definition 1 Let U = {0, 1, . . . , u−1} be the universe
for key x is also very fast, as it uses only two different for some arbitrary positive integer u.
universal hash functions, each one computable in time
proportional to the size of the key x. The space nec- Definition 2 Let S be a set of n distinct keys belong-
essary to store the minimal perfect hash function for a ing to U , i.e., n is the size of S.
set of n keys is O(log n) bits per key.
Definition 3 Let h : U → M be a hashing function
that maps the keys from S into a given interval of
2 Related Work integers [0, m−1]. Given a key x ∈ S, the hash function
computes an integer in [0, m − 1] for the storage or
Czech, Havas and Majewski [2] provide a comprehen- retrieval of x in a hash table, i.e., m is the size of a
sive survey and review some of the most important hash table.
theoretical results on perfect hashing. Mehlhorn [14]
shows that the lower bound to store a perfect hash Definition 4 Let G = (V, E) be a random undirected
function is Ω(n/ log n) computer words, where n is the graph without self-loops and multiple edges, where
size of the key set. Fox et al. [7] show that the space |E| = n, |V | = cn, generated using a variation of
lower bound to store an order preserving perfect hash the uniform model [6]. In this model, at each step
function is Ω(n log n) bits. we generate an unordered pair e = {u, v}, where u
Using the MOS approach, Fox et al. [8, 9] presented and v ∈ V , from |V2 | pairs equally likely. If the

algorithms for finding minimal perfect hash functions undirected edge e is neither a self-loop nor a multiple
in which the space to store the functions is close to edge then it is added to G. A self-loop occurs when
the lower bound to store a minimal perfect hash func- u = v. For an edge e1 = {u′ , v ′ } already added to
tion. However, in [2, Section 6.7] it is shown that their G, a multiple edge occurs when one of the conditions
algorithms have exponential running times. (u = u′ and v = v ′ ) or (u = v ′ and v = u′ ) is true.
The works in [1, 2, 13] present a family of efficient
and practical algorithms for generating order preserv- Definition 5 Let Adj(v) be the adjacent list of a ver-
ing minimal perfect hash functions. They present tex v ∈ V .
one of the best known random methods for gener-
2|E|
ating minimal perfect hash functions. One of their Definition 6 Let davg = |V | be the average degree of
algorithms involves the generation of acyclic random the vertices V of G.

2
Definition 7 Let Vcrit ⊆ V be a set of critical ver- of keys xi , xj ∈ S then h(xi ) < h(xj ) if and only if
tices. Subset Vcrit contains all vertices from V that i < j. In other words, the keys in S are arranged in
are part of cycles or are in a chain connecting two or some order and the function h preserves this order in
more cycles, as the vertex 5 in Figure 4. the hash table.

Definition 8 Let Ecrit ⊆ E be a set of critical edges. 0 1 2 ... n−1


Key Set
Subset Ecrit contains all edges from E connecting crit-
ical vertices. (a)

Hash Table
Definition 9 Let Vncrit = V − Vcrit be a set of non ...
0 1 2 m−1
critical vertices. Subset Vncrit contains all vertices
from V that are not part of cycles. 0 1 2 ... n−1
Key Set
Definition 10 Let Vscrit ⊆ Vcrit be a set of all critical (b)
vertices that have at least one non critical vertex as
Hash Table
adjacent, as the vertex 5 in Figure 4.
0 1 2 ... n−1
Definition 11 Let Encrit = E − Ecrit be a set of non
critical edges. Figure 1: (a) Perfect hash function. (b) Minimal per-
fect hash function.
Definition 12 Let Gcrit = (Vcrit , Ecrit ) be a critical
graph and let Gncrit = (Vncrit ∪ Vscrit , Encrit ) be a In the definitions above the keys to be placed in the
non critical graph, where the critical subgraph Gcrit hash table are integers in the interval [0, m − 1]. In
corresponds to the cyclic part of G and the non critical practice, it is often the case that keys are sequences
subgraph Gncrit corresponds to the acyclic part of G. of characters over some finite and ordered alphabet Σ,
Thus, G = Gcrit ∪ Gncrit . such as the ASCII set. In this case, we convert it to
a random number modulo |V | for each key. To obtain
Definition 13 Let P|Ecrit | be the probability that G a random number for each key, we generate a table of
has at most |Ecrit | critical edges. random numbers, one for each possible character of Σ
at each position i in the key. The construction of the
minimal perfect hash function presumes the existence
Definition 14 Let P|Vcrit | be the probability that G
of two random and independent hash functions h1 and
has at most |Vcrit | critical vertices.
h2 . For a key x containing |x| characters and two dif-
ferent tables of random numbers table 1 and table 2 , the
4 Minimal Perfect Hashing two hashing functions are:
 
|x|
In a hashing method, different keys might have the X
same address computed by the hash function, a situ- h1 (x) =  table1 [i, x[i]] mod |V |,
i=1
ation called collision. In this case various schemes for
resolving collisions are known. A perfect hash function  
|x|
is an injection h : U → [0, m − 1], which means that X
for all x, y ∈ S such that x 6= y we have h(x) 6= h(y), h2 (x) =  table2 [i, x[i]] mod |V |.
i=1
which implies that m ≥ n. For being an injection, a
perfect hash function transforms each key of S into a Consider now a problem known as the perfect as-
unique address in the hash table, as depicted in Fig- signment problem: For a given undirected graph G =
ure 1(a). Since no collisions occur, each key can be re- (V, E), where |V | = cn and |E| = n, find a func-
trieved from the table in one probe. If m = n and h(x) tion g:V → {0, 1, . . . , |V | − 1} such that the function
is perfect, then h(x) is a minimal perfect hash function h : E → {0, 1, . . . , n − 1}, defined as
(MPHF), as depicted in Figure 1(b). The perfect hash
function h is said to be order preserving if for any pair h(e) = (g(a) + g(b)) mod n (1)

3
is a bijection, where e = {a, b}. This means that we are v g(v)
looking for an assignment of values to vertices so that 0
0 0
for each edge the sum of values associated with end- 5
1 1 2
points taken modulo the number of edges is a unique 2 3 2 3
integer in the range [0, n − 1]. 0
4
2 3 1
The ordering and searching steps of the MOS ap- 3 4 0
proach are a very simple way of solving the perfect 1 4 5 1
assignment problem. Czech, Havas and Majewski [1]
showed that the perfect assignment problem can be Figure 2: Perfect assignment problem for a graph with
solved in optimal time if G is acyclic. To generate six vertices and five edges.
an acyclic graph two vertices h1 (x) and h2 (x) are
computed for each key x ∈ S. Thus, set S has a
corresponding graph G, with V = {0, 1, . . . , v} and Now we show why G must be acyclic. If the graph
E = {{h1 (x), h2 (x)} : x ∈ S}. In order to guarantee G was not acyclic, the assignment process might trace
acyclicity the algorithm repeatedly selects h1 and h2 around a cycle and insist on reassigning some already-
until the corresponding graph is acyclic. For the solu- processed vertex with a different g value than the one
tion to be useful we must have |S| = n and |V | = cn, that has already been assigned to it. For example, let
for some constant c, such that acyclic graphs dominate us suppose that in Figure 2 the edge {3, 4} has been
the space of all random graphs. Havas et al. [10] proved replaced by the edge {0, 1}. In this case, two different
that if |V | = cn holds with c > 2 the probability that values are set to g(0). Following the adjacent list of
G is acyclic is vertex 1, g(0) is set to 4. But g(0) was set to 0 before.
r
1/c c−2
p=e · (2)
c 5 The New Algorithm
For c = 2.09 the probability of a random graph being In this section we present a new algorithm for con-
acyclic is p > 13 . Consequently, for such c, the expected structing minimal perfect hash functions, where the
number of iterations to obtain an acyclic graph is lower order of the keys in S is not preserved. The algorithm
than 3 and the g function needs 2.09n integer numbers is based on the MOS approach and solves the problem
to be stored, since its domain is the set V . In this presented in Figure 1(b). The main novelty is that the
paper, the algorithm proposed by Czech, Havas and random graph G might have cycles and even so we are
Majewski [1] will be referred to as CHM from now on. able to find a MPHF.
Given an acyclic graph G, for the ordering step we The new algorithm looks for a function g : V →
associate with each edge an unique number h(e) ∈ {−|V | + 1, . . . , 0, 1, . . . , |V | − 1} such that the function
[0, n − 1] in the order of the keys of S to obtain an h : E → {0, 1, . . . , m − 1} defined as
order preserving function. Figure 2 illustrates the per-
fect assignment problem for an acyclic graph with six h(e) = g(a) + g(b) (3)
vertices and with the five table entries assigned to the
edges. is a bijection, where e = {a, b}. This means that we
The searching step starts from the weighted graph are looking for an assignment of values to vertices so
G obtained in the ordering step. For each connected that for each edge the sum of values associated with
component of G choose a vertex v and set g(v) to 0. endpoints is a unique integer in the range [0, m − 1].
For example, suppose that vertex 0 in Figure 2 is cho- Notice that we do not need to take the sum of values
sen and the assignment g(0) = 0 is made. Traverse the associated with endpoints of the edges modulo n.
graph using a depth-first or a breadth-first search algo- Figure 3 presents a pseudo code for the new algo-
rithm, beginning with vertex v. If vertex b is reached rithm. The procedure NewAlgorithm (S, g) receives
from vertex a and the value associated with the edge as input the set of keys from S and produces the per-
e = {a, b} is h(e), set g(b) to (h(e) − g(a)) mod n. In fect assignment of vertices represented by the function
Figure 2, following the adjacent list of vertex 0, g(2) is g. The mapping step generates a random undirected
set to 3. Next, following the adjacent list of vertex 2, graph G taking S as input. The ordering step deter-
g(1) is set to 2 and g(3) is set to 1, and so on. mines the order in which hash values are assigned to

4
keys. It partitions the graph G into Gcrit and Gncrit . is added to G is j −1, and the incremental construction
The searching step produces the perfect assignment of of G implies that p(|V |) is:
vertices in G, which is represented by the function g.
n |V |
 n−1
Y |V | − j

It starts with Gcrit and finishes with Gncrit . 2 − (j − 1)
Y
2
p(|V |) = |V |
 = |V |
 ·
j=1 2 j=0 2
procedure NewAlgorithm ( S , g)
Mapping ( S , G ) ; As |V | = cn we can rewrite the probability p(n) as:
Ordering ( G , Gcrit , Gncrit ) ; n−1  
Searching ( G , Gcrit , Gncrit , g ) ; Y 2j
p(n) = 1− ·
j=0
c2 n2 − cn

Figure 3: Main steps of the new algorithm. Using an asymptotic estimate from Palmer [15], for
two functions f1 : ℜ → ℜ and f2 : ℜ → ℜ defined by
f1 (k) = 1 − k and f2 (k) = e−k , the inequality f1 (k) ≤
5.1 Mapping Step f2 (k) is true ∀ k ∈ ℜ. Considering k = c2 n2j 2 −cn , we

have
The procedure Mapping (S, G) receives as input the n−1  
set of keys from S and generates a random undirected Y − 2j
− n−1
p(n) ≤ e c2 n2 −cn = e c2 n−c .
graph G without self-loops and multiple edges. To gen-
j=0
erate the MPHF, the number of critical edges in G
must be |Ecrit | ≤ 12 |E|. The reason is that the maxi- Thus,
mal value of h(e) assigned to an edge e ∈ E in this case 1
is m − 1. In Section 5.3.1 we show that the condition lim p(n) ≃ e− c2 . (4)
n→∞
|Ecrit | ≤ 12 |E| is necessary and sufficient to generate a 1
MPHF. As Ni (X) = 1/p then Ni (X) ≃ e c2 . After that, we
The random graph G is generated using two hash empirically determine the c value to obtain a random
functions h1 and h2 . The functions h1 and h2 trans- graph G with |Ecrit | ≤ 12 |E|. For this we built 10,000
form the keys from S to integers in [0, |V | − 1], so the graphs for each c value and number of keys presented
set of vertices V has |V | vertices and each one of them in Table 1. The two collections used in the experiments
is labelled with a distinct value from [0, |V | − 1]. For (TodoBR and TREC-VLC2) are described in Table 4
each key x from S the edge {h1 (x), h2 (x)} is added to (see Section 7 for more details).
E. We show in Table 1 the probability P|Ecrit | that
A self-loop occurs when h1 (x) = h2 (x). To avoid |Ecrit | ≤ 21 |E|, |E| = n, tends to 0 when c < 1.15
self-loops we modify h2 (x) by adding a random number and n increases. However, it tends to 1 when c ≥ 1.15
in the range [1, |V | − 1]. When a multiple edge occurs and n increases. Thus, |V | = 1.15n is considered a
we abort and start again a new iteration. threshold function (a definition coined by Erdös and
We now show that the expected number of iterations Rényi [3, 5]) for generating a random graph G where
to obtain G is constant. Let p be the probability of |Ecrit | ≤ 21 |E| with probability tending to 1 when n
generating a random graph G without self-loops and increases. Therefore, we use c = 1.15 in the new algo-
multiple edges. Let X be a random variable counting rithm.
the number of iterations to generate G. Variable X is The MPHF generated by the new algorithm needs
said to have the geometric distribution with P (X = 1.15n integer numbers to be stored, since |V | = 1.15n.
i) = p(1 − p)i−1. So, the expected number of iterations Thus, the generated function is stored in 55% —
to generate G is Ni (X) = ∞
P
j=1 jP (X = j) = 1/p and 1.15n/2.09n — of the space necessary to store the one
its variance is V (X) = (1 − p)/p . 2 generated by the CHM algorithm.
Let ξ be the space of edges in G that may be gener- As P|Ecrit | tends to 1 when n increases, we consider
ated by h1 and h2 . The graphs generated in this step that the expected number of iterations to generate G is
1
are undirected and the  number of possible edges in ξ Ni (X) ≃ e c2 . For c = 1.15, Ni (X) ≃ 2.13 on average,
is given by |ξ| = |V2 | . The number of possible edges which is constant. So, the mapping step takes O(n)
that might become a multiple edge when the jth edge time.

5
VLC2 (n) TodoBR (n)
c 1, 000 10, 000 100, 000 1, 000, 000 3, 000, 000 1, 000 10, 000 100, 000 1, 000, 000 3, 000, 000

1.10 0.01 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00
1.11 0.04 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.00 0.00
1.12 0.12 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00 0.00
1.13 0.19 0.03 0.00 0.00 0.00 0.20 0.02 0.00 0.00 0.00
1.14 0.30 0.09 0.00 0.00 0.00 0.31 0.11 0.00 0.00 0.00
1.15 0.50 0.56 0.65 0.89 1.00 0.51 0.57 0.65 0.88 1.00
1.16 0.68 0.70 0.88 0.95 1.00 0.70 0.83 0.95 0.95 1.00
1.17 0.77 0.82 0.90 1.00 1.00 0.78 0.99 0.98 1.00 1.00
1.18 0.91 0.97 0.98 1.00 1.00 0.91 1.00 1.00 1.00 1.00
1.19 0.94 1.00 1.00 1.00 1.00 0.95 1.00 1.00 1.00 1.00
1.20 0.98 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00

Table 1: Probability P|Ecrit | that |Ecrit | ≤ 12 n for different c values and different number of keys for the collections
VLC2 and TodoBR.

The new algorithm does not verify if G really has Finally, to determine the vertices in Vscrit we col-
at most 0.5n critical edges in the mapping step. The lect all vertices v ∈ Vcrit with at least one vertex u
rationale is that P|Ecrit | tends to 1 when n increases. that is in Adj(v) and in Vncrit , as the vertex 5 in Fig-
However, if some addition g(u)+g(w) is greater than m ure 4(d). This process takes O(|Vcrit |). Considering
in the searching step for {u, w} ∈ E then the mapping that |Vcrit | ≤ |V |, |Vncrit | ≤ |V | and |V | = n, the
step is restarted, as shown in line 17 of Figure 7. ordering step takes O(n) time.

5.2 Ordering Step 5.3 Searching Step


The procedure Ordering (G, Gcrit , Gncrit ) receives as The procedure Searching (G, Gcrit , Gncrit , g) receives
input the graph G and partitions G into two subgraphs as input G, Gcrit , Gncrit and finds a log2 |V | + 1 bit
Gcrit and Gncrit . To partition the graph G into Gcrit value for each vertex v ∈ V , stored in the array g.
and Gncrit we use an optimal time algorithm, as fol- A pseudo code for the searching step is presented in
lows. Figure 4 presents a sample graph with 16 ver- Figure 5. The searching step is first performed for the
tices and 14 edges, where the degree of a vertex is vertices in Gcrit and second for the vertices in Gncrit .
shown besides each vertex. Initially, all vertices with
degree 1 are added to a queue Q. For the example 5.3.1 Assignment of Values to Critical Ver-
shown in Figure 4(a), Q = {14, 15, 9, 10, 0, 1, 11, 12} tices
after the initialization step. This initialization takes
O(|V |) time, because we need to check the degree of The procedure CriticalVerticesAssignment (G, Gcrit ,
each vertex from V . g, AssignedEdges) receives G and Gcrit as input and
Next, we remove one vertex v from the queue, decre- produces as output a g value for each vertex in Gcrit
ment its degree and the degree of vertices with degree and the AssignedEdges array. Such array has m entries
greater than 0 in the adjacent list of v, as depicted in and indicates the edges for which a value h(e) ∈ [0, m−
Figure 4(b) for v = 14. At this point, the adjacencies 1], e ∈ Ecrit , has already been assigned. We use a
of v with degree 1 are inserted into the queue, such as breadth-first search algorithm to assign values to each
vertex 13 in Figure 4(c). This process is repeated until vertex in Gcrit . The reason we start the assignment of
the queue becomes empty. All vertices with degree 0 values to vertices in Gcrit is to resolve reassignments
are non critical vertices and the others are critical ver- as earlier as possible. The reassignment problem is
tices, as depicted in Figure 4(d). This process takes illustrated in the next paragraph.
O(|Vncrit |), because each non critical vertex is removed Considering the subgraph Gcrit in Figure 4(d), a step
from the queue only once. by step example of the assignment of values to vertices

6
a) Q 14 15 9 10 0 1 11 12 b) Q 15 9 10 0 1 11 12 c) Q 9 10 0 1 11 12 13 d) Q

d:1 d:1 d:1 d:1 d:1 d:1 d:0 d:1 d:1 d:0 d:0 d:0
15 0 1 15 0 1 15 0 1 15 0 1
d:1 d:0 d:0 d:0
14 d:2 14 d:2 14 d:2 14 d:2
2 d:2 2 d:2 2 d:2 2 d:2
3 3 3 3
d:3 13 d:2 13 d:1 13 d:0 13
4 d:3 4 d:3 4 d:3 4 d:3
12 5 12 5 12 5 12 5
d:1 11 d:1 d:3 6 d:3 d:1 11 d:1 d:3 6 d:3 d:1 11 d:1 d:3 6 d:3 d:0 11 d:0 d:2 6 d:3

10 7 10 7 10 7 10 7
d:1 9 8 d:1 9 8 d:1 9 8 d:0 9 8
d:2 d:2 d:2 d:2
d:1 d:2 d:1 d:2 d:1 d:2 d:0 d:2

Figure 4: Ordering step for a graph with 16 vertices and 14 edges.

procedure Searching ( G , Gcrit , Gncrit , g)


CriticalVerticesAssignment ( G , Gcrit , g , AssignedEdges ) ;
NonCriticalVerticesAssignment ( G , Gncrit , AssignedEdges , g ) ;

Figure 5: Pseudo code for the searching algorithm.

g:6
a) 2
b) 2 c) 2 d) 13
2
3 3 3 g:7 3 9 8
4 2 4 g:2 2 4 g:2 2 4 g:2
g:0 5 g:0 5 g:0 5 g:0 5
6 1 6 g:1 1 6 g:1 1 6 g:1
5 4 5 4
7 7 7 7
8 8 g:4 8 g:3 g:4 8 g:3
7 7

Figure 6: Example of the critical vertices assignment.

in Gcrit is presented in Figure 6. Initially, a vertex v g(2) = 6 and g(3) = 7 are made, implying that ad-
is chosen chosen and the assignment g(v) = 0 is made. dresses 8, 9 and 13 must be assigned to edges {4, 2},
For example, suppose that vertex 5 in Figure 6(a) is {4, 3} and {2, 3}, respectively. This finishes the algo-
chosen and the assignment g(5) = 0 is made. In Fig- rithm with AssignedEdges = {1, 2, 4, 5, 7, 8, 9, 13}.
ure 6(b), following the adjacent list of vertex 5, g(6) is A pseudo code for the assignment of values to crit-
set to 1 and g(4) is set to 2, implying that addresses ical vertices is presented in Figure 7. For all edges
1 and 2 must be assigned to edges {5, 6} and {5, 4}, e = {u, w} ∈ E, g(u) + g(w) must be unique. If this
respectively. At the same time, addresses 1 and 2 are constraint is not forced then two different keys from
added to the list of AssignedEdges. In Figure 6(c), fol- S will be mapped in the same hash table location.
lowing the adjacent list of vertex 6, g(7) is set to 3 and Thus, the AssignedEdges array is used to force that
g(8) is set to 4, implying that addresses 4, 5 and 7 must g(u) + g(w) will be distinct for all edges in E, as shown
be assigned to edges {6, 7}, {6, 8} and {7, 8}, respec- in line 18 of Figure 7. The variable Nextg represents
tively. Finally, in Figure 6(d), following the adjacent g(u).
list of vertex 4, g(2) cannot be assigned to 5 because Now we define certain complexity measures used
the sum g(2) + g(4) would cause a reassignment with hereinafter:
the already assigned address 7 to edge {7, 8}, so the
next g value 6 is used instead, and the assignments 1. Let I(u) be the number of iterations occurred in
the repeat-until loop from line 13 until line 19,
when vertex u is assigned.
7
procedure CriticalVerticesAssignment (G , Gcrit , g , AssignedEdges)
1 for v ∈ Vcrit do g(v) := −∞ ;
2 for i := 0 to m − 1 do AssignedEdges [ i ] = false ;
3 for v ∈ Vcrit do
4 i f g(v) = −∞ then traverseBfs ( G , v , Gcrit , g , AssignedEdges ) ;

procedure traverseBfs ( G , v , Gcrit , g , AssignedEdges)


5 Nextg := 0;
6 g(v) := Nextg ;
7 EnQueue ( v , Q ) ;
8 while Q 6= ∅ do
9 v := DeQueue ( Q ) ;
10 for u ∈ Adj (v) and g(u) = −∞ do
11 AssignedVertices := ∅ ;
12 for w ∈ Adj (u) and g(w) 6= −∞ do AssignedVertices := AssignedVertices ∪ {w} ;
13 repeat
14 NoAssignedEdges := true ;
15 Nextg := Nextg + 1;
16 for w ∈ AssignedVertices and NoAssignedEdges = true do
17 i f (Nextg + g(w)) ≥ m then restart mapping step ;
18 i f AssignedEdges [ Nextg + g(w)] = true then NoAssignedEdges := false ;
19 until NoAssignedEdges = true ;
20 g(u) := Nextg ; {set the g value to vextex u and change g(u) from −∞ to Nextg}
21 for w ∈ AssignedVertices do AssignedEdges [ Nextg + g(w) ] : = true ;
22 EnQueue ( u , Q ) ;

Figure 7: The critical vertices assignment algorithm.

2. Let Nt be the number of times that Proof: In an undirected graph G, every edge of G is
AssignedEdges[g(u) + g(w)] is true in the either a tree edge or a back edge. In the subgraph
procedure CriticalVerticesAssignment. Thus, Gncrit there are no back edges because it is an acyclic
graph. As shown by Erdös and A. Rényi [4, 5], when
|Vcrit | n tends to infinity the random graph G forms, with
X
Nt = (I(u) − 1) (5) probability tending to 1, a giant component containing
u=1 all cycles of G. So considering that Gcrit is connected,
the number of tree edges is |Vcrit | − 1. It happens
Maximal Value Assigned to An Edge because we have only one tree connecting all vertices
in Vcrit . As the total number of edges in Gcrit is |Ecrit |
In this section we present the following conjecture. then Nbedges + (|Vcrit | − 1) = |Ecrit |. Thus,

Conjecture 1 For a random graph G with |Ecrit | = Nbedges = |Ecrit | − |Vcrit | + 1. 2


0.5n and |V | = 1.15n, it is always possible to generate
a MPHF because the maximal value Amax assigned to Theorem 2 The maximal value Amax assigned to an
an edge e ∈ Ecrit is at most m − 1 (Amax corresponds edge e ∈ Ecrit in the assignment of values to critical
to the maximal value generated by the assignment of vertices is: Amax ≤ 2|Vcrit | − 3 + 2Nt .
values to critical vertices in Eq. (3).) Proof: We start the assignment of values to criti-
cal vertices using the sequence {0, 1, . . . Nextg} so that
Next, we present two auxiliary theorems that will each edge receives the sum of the values associated with
help us in the discussion of Conjecture 1. its endpoints. The g value for each vertex u in Vcrit is
assigned only once. It happens because a g value is as-
Theorem 1 The number of back edges Nbedges of a signed to a vertex u if and only if g(u) = −∞. Thus, af-
random graph G = Gcrit ∪Gncrit is given by: Nbedges = ter g(u) change from −∞ to the value stored in Nextg,
|Ecrit | − |Vcrit | + 1. the condition g(u) = −∞ becomes false and g(u) will

8
not be assigned again. Consider now two possibilities: P|Vcrit |
n
(i) If Nt = 0 then the g values will be assigned to ver- VLC2 TodoBR
tices in Vcrit sequentially. Therefore, the greatest and 1, 000 0.51 0.52
the second greatest values assigned to u and w ∈ Vcrit 10, 000 0.76 0.77
are g(u) = |Vcrit | − 1 and g(w) = |Vcrit | − 2, respec- 100, 000 0.98 0.98
tively. Thus, Amax ≤ (|Vcrit |−1)+(|Vcrit |−2) since the 1, 000, 000 1.00 1.00
edge {u, w} may be in Ecrit , in the worst case. (ii) If
Nt > 0 then Nextg is incremented by one for each time Table 2: Probability P|Vcrit | that |Vcrit | ≤ 0.403n for
the condition AssignedEdges[Nextg + g(w)] is true, as different number of keys for the collections VLC2 and
shown in line 15 of Figure 7. Thus, in the worst case, TodoBR.

Amax ≤ (|Vcrit | − 1 + Nt ) + (|Vcrit | − 2 + Nt ) Finally, we show experimental evidences that Nt ≤


Amax ≤ 2|Vcrit | − 3 + 2Nt . 2 Nbedges . The expected values for |Vcrit | and |Ecrit | are
0.403n and 0.5n, respectively. Then, by Theorem 1,
Let us now resume the discussion of Conjecture 1. Nbedges = 0.5n−0.403n+1 = 0.097n+1. In Table 3 we
Let us consider that Nt ≤ Nbedges when the average show the maximal value of Nt obtained during 10,000
degree of vertices (davg ) in Gcrit is a constant. Substi- executions of the new algorithm for different sizes of
tuting Nt ≤ Nbedges in Theorem 2 gives: S. As shown in Table 3, the maximal value of Nt is
smaller than Nbedges = 0.097n + 1. So, Conjecture 1 is
Amax ≤ 2|Vcrit | − 3 + 2Nbedges correct for c = 1.15.
Substituting the value of Nbedges from Theorem 1 Maximal value of Nt
gives: n
VLC2 TodoBR
1, 000 0.085n 0.093n
Amax ≤ 2|Vcrit | − 3 + 2(|Ecrit | − |Vcrit | + 1)
10, 000 0.067n 0.069n
100, 000 0.061n 0.061n
Applying Definition 6 in Gcrit we obtain davg = 1, 000, 000 0.059n 0.059n
d
2|Ecrit |/|Vcrit |. This implies that |Ecrit | = avg 2 |Vcrit |.
Thus, Table 3: The maximal value of Nt for different sizes of
  S for the collections VLC2 and TodoBR.
davg
Amax ≤ 2|Vcrit | − 3 + 2 |Vcrit | − |Vcrit | + 1
2
≤ 2|Vcrit | − 3 + (davg − 2)|Vcrit | + 2 Complexity Analysis
≤ davg |Vcrit | − 1
We now show that the time complexity of the pseudo
2|Ecrit |
≤ |Vcrit | − 1 code presented in Figure 7 is O(|Vcrit |). For each unas-
|Vcrit | signed vertex u, Adj(u) must be scanned with complex-
≤ 2|Ecrit | − 1 ity |Adj(u)| in order to obtain (in AssignedVertices)
the adjacencies of u that have already been assigned,
As |Ecrit | = 0.5n and n = m then Amax ≤ n − 1 ≤ as shown in lines 11 and 12. For each iteration of the
m − 1. repeat-until loop, |AssignedVertices| vertices must be
We now show evidences that Nt ≤ Nbedges when davg scanned, as shown from lines 13 to 19. As each criti-
is a constant. As shown in Section 5.1, |Ecrit | ≤ 0.5n cal vertex is assigned only once and |AssignedVertices|
with probability tending to 1 when n increases. So, in vertices must be scanned to update the AssignedEdges
order to obtain the average degree davg of vertices in array (as shown in line 21), the time complexity is given
Gcrit we empirically determined that |Vcrit | ≤ 0.35|V |. by
As |V | = 1.15n then |Vcrit | ≤ 0.403n. Table 2
presents the probability P|Vcrit | that |Vcrit | ≤ 0.403n. |Vcrit |
X
As P|Vcrit | tends to 1 when n increases then, davg = C(|Vcrit |) = [ |Adj(u)| +
2 × 0.5n/0.403n = 2.48 is a constant value. We built u=1
10,000 graphs for each number of keys. (I(u) × |AssignedVertices|) + |AssignedVertices| ]

9
a) b) c) g:3 d) g:3 e) g:3
15 15 g:6 15 g:6 15 g:0 g:6 15 g:0
14 0 14 0 14 0 14 0 14 0
6 6 6
3 3 10 3 10
13 1 g:0 13 1 g:0 13 1 g:0 13 1 g:0 13 1
5 g:0 0 5 g:0 0 5 g:0 0 5 g:0 g:10 0 5 g:0 g:10
12 9 12 9 12 9 12 9 g:12 12 9 g:0
11 10 11 10 11 10 11 10 12 11 10 11
g:0 g:11
0 3 6 10 11 12 3 6 10 11 12 10 11 12 11 12
UnAssignedEdges UnAssignedEdges UnAssignedEdges UnAssignedEdges UnAssignedEdges

Figure 8: Example of the non critical vertices assignment.

procedure NonCriticalVerticesAssignment ( G , Gncrit , AssignedEdges , g)


for i := 0 to m − 1 do
i f AssignedEdges [ i ] = false then UnAssignedEdges := UnAssignedEdges ∪ {i} ;
for v ∈ Vscrit do traverseDfs ( G , v , Gncrit , g(v) , g , UnAssignedEdges ) ;
for v ∈ Vncrit and g(v) = −∞ do traverseDfs ( G , v , Gncrit , 0 , g , UnAssignedEdges ) ;

procedure traverseDfs ( G , v , Gncrit , gValue , g , unAssignedEdges)


g(v) := gValue ;
for u ∈ Adj (v) and g(u) = −∞ do
gValue := NextUnusedAddress(UnAssignedEdges) − g(v) ;
traverseDfs ( G , u , Gncrit , gValue , g , UnAssignedEdges) ;

Figure 9: The algorithm to assign values to non critical vertices.

Considering that |Adj(u)| = davg on average Since I(u) and da vg in Eq. (6) are constants, we have
and that |AssignedVertices| ≤ |Adj(u)|, then that C(|Vcrit |) = O(|Vcrit |). As |Vcrit | ≤ |V | and |V | =
|AssignedVertices| ≤ davg . Thus, cn, the time complexity of the assignment of values to
|Vcrit |
critical vertices is O(n).
X
C(|Vcrit |) ≤ (davg + (I(u) × davg ) + davg ) (6)
u=1 5.3.2 Assignment of Values to Non Critical
As presented before, davg is a constant and Nt ≤ Vertices
Nbedges . Therefore, applying davg = 2|Ecrit |/|Vcrit | in The procedure NonCriticalVerticesAssignment (G,
Theorem 1 gives: Gncrit , AssignedEdges, g) receives G, Gncrit and As-
Nt ≤ |Ecrit | − |Vcrit | + 1 signedEdges as input and produces the assignment of
values to vertices in Gncrit , represented by the array g.
davg
≤ |Vcrit | − |Vcrit | + 1 This finishes the perfect assignment of values to ver-
2  tices of G. We use a depth-first search algorithm to
davg
≤ − 1 |Vcrit | + 1 assign values to vertices in Gncrit .
2 As Gncrit is acyclic, we can impose the order in
Since davg is a constant then Nt = O(|Vcrit |). The which addresses are associated with edges in Gncrit .
number of times that AssignedEdges[Nextg + g(w)] = Therefore, in the assignment of values to vertices in
true is given by Eq. (5). Thus, I(u) must be a constant Gncrit we place the unused addresses in the gaps left
because by the assignment of values to vertices in Gcrit . For
|Vcrit | that, we start the depht-first search from the vertices
in Vscrit because these critical vertices were already
X
(I(u) − 1) = Nt = O(|Vcrit |).
u=1
assigned, so their g values can not be changed.

10
Considering the subgraph Gncrit in Figure 4(d), a 7 Experimental Results
step by step example of the assignment of values to
vertices in Gncrit is presented in Figure 8. Figure 8(a) In this section we present experimental results to show
presents the initial state of the algorithm. The critical the efficiency of the new algorithm. Also, a comparison
vertex 5 is the only one that has non critical vertices with algorithm CHM (proposed by Czech, Havas and
as adjacent. In the example presented in Figure 6, the Majewski [1]) is made.
addresses {0, 3, 6, 10, 11, 12} were not used. So, taking The two algorithms were implemented in the C lan-
the first unused address 0 and the vertex 13, which guage. All experiments were carried out on a computer
is reached from the vertex 5, the g value of vertex 13 running the Linux operating system, version 2.6.7,
is set to 0 − g(5) = 0, as shown in Figure 8(b). In with a 2.2 gigahertz Athlon processor and 1 gigabyte
Figure 8(c), using the unused addresses 3 and 6, the g of main memory.
values for vertices 15 and 14 are set to 3 − g(13) = 3
Collection n Key Size (Avg)
and to 6 − g(13) = 6, respectively. Vertices 0, 1, 9, 10,
TodoBR 3,541,615 8.3
11 and 12 were not assigned yet, so we continue the as- Random 10,000,000 20.0
signment of values to non critical vertices from vertex VLC2 10,935,900 8.6
0. In Figure 8(d), we set g(0) to 0. The only vertex URLs 20,000,000 57.4
that is reached from vertex 0 is vertex 1, so taking the
unused address 10 we set g(1) to 10 − g(0) = 10. This Table 4: Collections used in the experiments.
process is repeated until the UnAssignedEdges list be-
comes empty. The final result is shown in Figure 8(e). We used four collections in the experiments:
A pseudo code for the assignment of values to non (i) the vocabulary of the TodoBR search engine
critical vertices is presented in Figure 9. (http://www.todobr.com.br); (ii) a collection of keys
generated randomly (Random); (iii) the vocabulary ex-
tracted from the TREC-VLC2 (Very Large Collection
Complexity Analysis 2) collection [11]; (iv) a set of URLs crawled from the
Web. Table 4 presents some details about the collec-
The assignment of values to vertices in Gncrit is a
tions.
depth-first search algorithm. Then, its time complex-
Table 5 presents the main characteristics of the two
ity is O(|Vscrit | + |Vncrit | + |Encrit |). Considering that
algorithms. The number of edges of graph G = (V, E)
|Vncrit | ≤ |V |, |Vscrit | ≤ |V |, |V | = cn and |Encrit | ≤ n,
is equal to the size n of the set S of keys for the two
the complexity of the assignment of values to non crit-
algorithms. The number of vertices of G is equal to
ical vertices is O(n).
1.15n and 2.09n for the new algorithm and the CHM
algorithm, respectively. This measure is related to the
amount of space to store the array g. The number of
6 MPHF Evaluation critical edges is 0.5|E| and 0, for the new algorithm
and the CHM algorithm, respectively.
Figure 10 presents a pseudo code to evaluate the
MPHF generated by the new algorithm. The proce- Algorithms
Characteristics
dure h (x, g, h1 , h2 ) receives as input a key x ∈ S, the New algorithm CHM
g function, the tables used by h1 and h2 and returns |E| n n
the hash table address assigned to x. |V | cn cn
c 1.15 2.09
|g| 1.15n 2.09n
procedure h ( x , g , h1 , h2 ) |Ecrit | 0.5|E| 0
u := h1 (x) ; G cyclic acyclic
v := h2 (x) ; Order preserving no yes
return ( g(u) + g(v) ) ;
Table 5: Main characteristics of the algorithms.
Figure 10: Evaluating the MPHF. Table 6 presents time results for constructing
MPHFs using the two algorithms. The table entries

11
New algorithm, c = 1.15 CHM, c = 2.09
Collection Ni Mapping Ordering Searching Total Ni Mapping + Ordering Searching Total

TodoBR 1.92 11.33 1.93 0.97 14.23 2.63 19.51 3.03 22.54
Random 1.77 41.90 7.17 3.70 52.77 2.96 59.92 10.31 70.23
VLC2 2.24 44.69 7.00 3.59 55.28 2.94 78.77 11.09 89.86
URLs 2.18 153.23 14.62 7.52 175.37 - - - -

Table 6: Time to generate the MPHFs for the new algorithms.

represent averages over 50 trials. The column labelled Algorithms


Collection
as Ni represents the number of iterations to generate New algorithm CHM
the random graph G in the mapping step of the algo- TodoBR 3.33 3.59
rithms. The other columns represent the run times for Random 12.29 13.70
each step of the algorithms. All times are in seconds. VLC2 12.41 13.81
URLs 60.03 -
The CHM algorithm performs the ordering step to-
gether the mapping step. In the CHM algorithm the
Table 7: Time to compute a hash table entry for the
ordering step is just the assignment of hash values to
algorithms considered. All times are in seconds.
the edges of G.
The mapping step of the new algorithm is faster be-
cause the number of iterations to generate G is lower, MPHF by the new algorithm for different number of
since G has 1.15n vertices and must not be acyclic. keys of the TREC-VLC2 collection. As claimed, the
This result fully backs the theoretical considerations. time to generate a MPHF using the new algorithm
Using Eq. (4), the expected number of iterations to grows linearly with n.
generate G for the new algorithm is 2.13 and using
Eq. (2), the same measure is 2.92 for the CHM algo-
rithm. The CHM algorithm also needs to verify if G is TREC−VLC2
acyclic during the mapping step, which has the same 55
MPHF
50
complexity of the ordering step of the new algorithm.
45
The random graph G generated in the mapping step 40
of the new algorithm has 1.15n vertices and the one 35
Time (s)

generated in the mapping step of the CHM algorithm 30


has 2.09n vertices. That is why the searching step of 25
new algorithm is faster, since the time complexity of 20
15
the searching step of the algorithms depends on the
10
number of vertices in G.
5
We were not able to generate a MPHF for the 0
CHM algorithm using the URLs collection. The rea- 0 1000 2000 3000 4000 5000 6000 7000 8000
son was that its random graph G has more vertices n/1000
(|V | = 2.09n) and could not be stored in the main
memory of the machine used for the experiments. Figure 11: Verification of the O(n) complexity to gen-
The MPHF generated by the new algorithm is erate a MPHF by the new algorithm.
slightly faster than the one generated by the CHM al-
gorithm. It happens because we save a module oper-
ation, as shown in Eq. (3). Table 7 presents the eval-
uation times, which are averages over 50 trials. Each 8 Conclusions
entry in Table 7 represents the time to evaluate all keys
of each collection. A new algorithm for generating MPHFs has been pro-
Finally, Figure 11 presents the time to generate the posed. Its expected time complexity is O(n), so that

12
the new algorithm is time optimal. The time to eval- [10] G. Havas, B.S. Majewski, N.C. Wormald, and Z.J.
uate the generated function is very fast and the space Czech. Graphs, hypergraphs and hashing. In
needed to store it is O(n log n) bits. Experimental re- 19th International Workshop on Graph-Theoretic
sults show that the times to both generate the MPHF Concepts in Computer Science, pages 153–165.
and compute a hash table entry by the new algorithm Springer Lecture Notes in Computer Science vol.
are better than the times obtained by the CHM algo- 790, 1993.
rithm, one of the fastest known algorithm.
[11] D. Hawking. Overview of trec-7 very large collec-
tion track (draft for notebook), 1998.
References [12] D. E. Knuth. The Art of Computer Programming:
Sorting and Searching, volume 3. Addison-Wesley,
[1] Z.J. Czech, G. Havas, and B.S. Majewski. An second edition, 1973.
optimal algorithm for generating minimal perfect
hash functions. Information Processing Letters, [13] B.S. Majewski, N.C. Wormald, G. Havas, and Z.J.
43(5):257–264, 1992. Czech. A family of perfect hashing methods. The
Computer Journal, 39(6):547–554, 1996.
[2] Z.J. Czech, G. Havas, and B.S. Majewski. Funda-
mental study perfect hashing. Theoretical Com- [14] K. Mehlhorn. Data Structures and Algorithms 1:
puter Science, 182:1–143, 1997. Sorting and Searching. Springer-Verlag, 1984.
[15] E. M. Palmer. Graphical Evolution: An Introduc-
[3] P. Erdos and A. Rényi. On random graphs. Pu-
tion to the Theory of Random Graphs. John Wiley
bicationes Mathematicae, 6:290–297, 1959.
& Sons, New York, 1985.
[4] P. Erdös and A. Rényi. On the evolution of ran-
dom graphs. Publications of the Mathematical
Institute of the Hungarian Academy of Sciences,
56:17–61, 1960.

[5] P. Erdös and A. Rényi. On the strength of con-


nectedness of a random graph. Acta Mathematica
Scientia Hungary, 12:261–267, 1961.

[6] P. Flajolet, D. E. Knuth, and B. Pittel. The


first cycles in an evolving graph. Discrete Math,
75:167–215, 1989.

[7] E. A. Fox, Q. F. Chen, A. M. Daoud, and L. S.


Heath. Order preserving minimal perfect hash
functions and information retrieval. ACM Trans.
Inform. Systems, 9(3):281–308, July 1991.

[8] E.A. Fox, Q.F. Chen, and L.S. Heath. A faster


algorithm for constructing minimal perfect hash
functions. In Proceedings of the 15th Annual In-
ternational ACM SIGIR Conference on Research
and Development in Information Retrieval, pages
266–273, 1992.

[9] E.A. Fox, L. S. Heath, Q.Chen, and A.M.


Daoud. Practical minimal perfect hash functions
for large databases. Communications of the ACM,
35(1):105–121, 1992.
13

You might also like