Prefix Hash Tree: An Indexing Data Structure Over Distributed Hash Tables
Prefix Hash Tree: An Indexing Data Structure Over Distributed Hash Tables
potential application domains: against data loss when nodes go down 1 , the failure
of any given node in the Prefix Hash Tree does not
Databases Peer-to-peer databases [15] need to sup- affect the availability of data stored at other nodes.
port SQL-type relational queries in a distributed
fashion. Range predicates are a key component in But perhaps the most crucial property of PHT is
SQL. that it is built entirely on top of the lookup inter-
face, and thus can run over any DHT. That is, PHT
Distributed computing Resource discovery requires uses only the lookup(key) operation common to
locating resources within certain size ranges in a all DHTs and does not, as in SkipGraph [1] and
decentralized manner. other such approaches, assume knowledge of nor
require changes to the DHT topology or routing
Location-aware computing Many applications want behavior. While designs that rely on such lower-
to locate nearby resources (computing, human or layer knowledge and modifications are appropriate
commercial) based on a user’s current location, for contexts where the DHT is expressly deployed
which is essentially a 2-dimensional range query for the purpose of supporting range queries, we ad-
based on geographic coordinates. dress the case where one must use a pre-existing
DHT. This is particularly important if one wants
Scientific computing Parallel N-body computations to make use of publicly available DHT services,
[34] require 3-dimensional range queries for accu- such as OpenHash [18].
rate approximations.
The remainder of the paper is organized as fol-
In this paper, we address the problem of efficiently lows. Section 2 describes the design of the PHT
supporting 1-dimensional range queries over a DHT. data structure. Section 3 presents the results of an
Our main contribution is a novel trie-based dis- experimental evaluation. Section 4 surveys related
tributed data structure called Prefix Hash Tree (hence- work and section 5 concludes.
forth abbreviated as PHT) that supports such queries.
As a corollary, the PHT can also support heap 2. DATA STRUCTURE
queries (“what is the maximum/minimum ?”), prox-
This section describes the PHT data structure, along
imity queries (“what is the nearest element to X with related algorithms.
?”), and, in a limited way, multi-dimensional ana-
logues of the above, thereby greatly expanding the
querying facilities of DHTs. PHT is efficient, in 2.1 PHT Description
that updates are doubly logarithmic in the size For the sake of simplicity, it is assumed that the do-
of the domain being indexed. Moreover, PHT is main being indexed is {0, 1}D , i.e., binary strings
self-organizing and load-balanced. PHT also toler-
ates failures well; while it cannot by itself protect 1
But PHT can take advantage of any replication or other
data-preserving technique employed by a DHT.
of length D, although the discussion extends nat- in regions of the domain which are sparsely popu-
urally to other domains. Therefore, the data set lated. Finally, property 5 ensures that the leaves
indexed by the PHT consists of some number N of of the PHT form a doubly linked list, which en-
D-bit binary keys. ables sequential traversal of the leaves for answer-
ing range queries.
In essence, the PHT data structure is a binary trie
built over the data set. Each node of the trie is As described this far, the PHT structure is a fairly
labeled with a prefix that is defined recursively: routine binary trie. The novelty of PHT lies in
given a node with label l, its left and right child how this logical trie is distributed among the peers
nodes are labeled l0 and l1 respectively. The root in the network; i.e., in how PHT vertices are as-
is labeled with the attribute being indexed, and signed to DHT nodes. This is achieved by hashing
downstream nodes are labeled as above. the prefix labels of PHT nodes over the DHT iden-
tifier space. A node with label l is thus assigned
The following properties are invariant in a PHT. 4 to the peer to which l is mapped by the DHT,
i.e., the peer whose identifier is closest to HASH(l).
1. (Universal prefix ) Each node has either 0 or 2 This hash-based assignment implies that given a la-
children. bel, it is possible to locate its corresponding PHT
node via a single DHT lookup. This “direct access”
2. (Key storage) A key K is stored at a leaf node property is unlike the successive link traversals as-
whose label is a prefix of K. sociated with typical data structures and results in
the PHT having several desirable features that are
3. (Split) Each leaf node stores atmost B keys. discussed subsequently.
4. (Merge) Each internal node contains atleast 2.2 PHT Operations
(B + 1) keys in its sub-tree. This section describes algorithms for PHT opera-
tions.
5. (Threaded leaves) Each leaf node maintains a
pointer to the leaf nodes on its immediate left and
and immediate right respectively. 2 2.2.1 Lookup
Given a key K, a PHT lookup operation returns
Property 1 guarantees that the leaf nodes of the the unique leaf node leaf (K) whose label is a prefix
PHT form a universal prefix set 3 . Consequently, of K. Because there are (D + 1) distinct prefixes
given any key K ∈ {0, 1}D , there is exactly one leaf of K, there are (D + 1) potential candidates; an
obvious algorithm is to perform a linear scan of
node leaf (K) whose label is a prefix of K. Prop- these (D + 1) nodes until the required leaf node is
erty 2 states that the key K is stored at leaf (K). reached. This is similar to a top-down traversal of
Figure 1 provides an example of a PHT contain- the trie except that a DHT lookup is used to locate
ing N = 20 6-bit keys with B = 4. The table on a PHT node given its prefix label. Pseudocode for
the right in Figure 1 lists the 20 keys and the leaf this algorithm is given below.
nodes they are stored in.
Properties 3 and 4 govern how the PHT adapts Algorithm: PHT-LOOKUP-LINEAR
to the distribution of keys in the data set. Fol- input : A key K
lowing the insertion of a new key, the number of
keys stored at a leaf node may exceed the threshold output: leaf (K)
B, causing property 3 to be violated. To restore for i ← 0 to D do
the invariant, the node splits into two child nodes, /*Pi (K) denotes prefix of K of length
and its keys are redistributed among the children
according to property 2. Conversely, following the i */
deletion of an existing key, the number of keys con- node ← DHT-LOOKUP(Pi (K));
tained in a sub-tree may fall below (B +1), causing if (node is a leaf node) then return node ;
property 4 to be violated. To restore the invari- end
ant, the entire sub-tree is merged into a single leaf return f ailure;
node, where all the keys are aggregated. Notice the
shape of the PHT depends on the distribution of
keys; it is ”deep” in regions of the domain which How can this be improved ? Given a key K, the
are densely populated, and conversely, ”shallow” above algorithm tries different prefix lengths until
2
the required leaf node is reached. Clearly, linear
A pointer here would be the prefixes of neighboring leaves search can be replaced by binary search on prefix
and, as a performance optimization, the cached IP address
of their corresponding DHT nodes. 4
Assignment implies that the peer maintains the state as-
3
A set of prefixes is a universal prefix set if and only if for sociated with the PHT node assigned to it. Henceforth, the
every infinite binary sequence b, there is exactly one element discussion will use PHT node to also refer to the peer as-
in the set that is a prefix of b. signed that node.
lengths. If the current prefix is an internal node
of the PHT, the search tries longer prefixes. Al- 0 1
ternatively, if the current prefix is not an internal
node of the PHT, the search tries shorter prefixes.
The search terminates when the required leaf node 0 1 0 1
is reached. The decision tree to the left in Fig-
ure 1 illustrates the binary search. For example.
consider a lookup for the key 001100. The binary
0 1 0 1
search algorithm first tries the 3-bit prefix 001*
(internal node), then the 5-bit prefix 00110* (not
an internal node), and then finally the 4-bit prefix
0 1
0011*, which is the required leaf node. Pseudocode
for this algorithm is given below.
Parallel
0 1
Algorithm: PHT-LOOKUP-BINARY
input : A key K
output: leaf (K) 0 1
Sequential
lo ← 0;
hi ← D;
while (lo ≤ hi) do
mid ← (lo + hi)/2;
Figure 2: Range queries
/*Pmid (K) denotes prefix of K of
length mid */
node ← DHT-LOOKUP(Pmid (K)); L ≤ K ≤ H. Range queries can be implemented
if (node is a leaf node) then return node ; in a PHT in several ways; we present two simple
else algorithms.
if (node is an internal node) then lo ←
mid + 1; The first algorithm is to locate leaf (L) using the
else hi ← mid- 1; PHT lookup operation. Now the doubly linked list
end of threaded leaves is traversed sequentially until
end the node leaf (H) is reached. All values satisying
the range query are retrieved. This algorithm is
return f ailure; simple and efficient; it initially requires log D DHT
lookups to locate leaf (L). It cannot avoid travers-
Binary search reduces the number of DHT lookups ing the remaining nodes to answer the query. The
from (D + 1) to blog (D + 1)c + 1 ≈ log D. Never- disadvantage of this algorithm is that a sequential
theless, linear search is still significant for atleast scan of the leaf nodes may result in a high latency
two reasons. First, observe that the (D + 1) DHT before the query is completely resolved.
lookups in linear search can be performed in paral-
lel, as opposed to binary search, which is inherently The second algorithm is to parallelize. Using the
sequential. This results in two modes of operation DHT, locate the node whose label corresponds to
viz. low-overhead lookups using binary search, and the smallest prefix range that completely covers the
low-latency lookups using parallel search. Second, specified range. If this is an internal node, then re-
binary search may fail ,i.e., be unable to correctly cursively forward the query onward to those chil-
locate the leaf node, as a result of the failure of dren which overlap with the specified range. This
an internal PHT node 5 . On the other hand, lin- process continuues until the leaf nodes overlapping
ear search is guaranteed to succeed as long as the with the query are reached. If this is not an inter-
leaf node is alive, and the DHT is able to route nal node, the required range query is covered by
to it, and therefore provides a failover mechanism. a single leaf node, which can be located by binary
Note that both algorithms are contingent on the search.
fact that the DHT provides a mechanism to locate Figure 2 shows an example of range search. Con-
any PHT node via a single lookup. sider a query for the range [001001, 001011]. In the
sequential algorithm, a PHT lookup is used to lo-
2.2.2 Range Query cate the node containing the lower endpoint, i.e.,
Given two keys L and H (L ≤ H), a range query node 00100∗. After this a traversal of the linked list
returns all keys K contained in the PHT satisfying forwards the query to the next two leaves 001010∗
5
Binary search will not be able to distinguish between the and 001011∗, which resolves the query. In the par-
failure of an internal node and the absence of an internal allel algorithm, we first identify the smallest prefix
node. range that completely covers the query, which is
0010∗. A single DHT lookup is used to directly DHT being used to distribute B-tree nodes across
jump to this node, after which the query is for- peers in the network. While the tree-based indices
warded in parallel within the sub-tree, until all may be better in traditional indexing applications
leaf nodes that overlap with the search range are like databases, we argue the reverse is true for im-
reached. plementation over a DHT.
Note that in the parallel algorithm, it is sometimes The primary difference between the two approaches
desirable to break the search query into two, and is as follows: a trie partitions the space while a tree
treat these sub-queries independently. For exam- partitions the data set. In other words, a trie node
ple, a very small range that contains the midpoint represents a particular region of space, while a tree
of the space, will result in ∗ being the smallest node represents a particular set of keys. Because
prefix range containing it, thereby potentially over- a trie uses space, which is constant independent of
loading the root. To prevent this, we observe that the actual data set, there is some implicit knowl-
every range is contained in the union of two pre- edge about the location of a key. For example, in
fix ranges that are of roughly the same size as the a trie, a key is always stored at a prefix of the key,
query (within a factor of 2). By handling these which makes it possible to exploit the mechanism
separately, it is possible to ensure a search starts the DHT provides to locate a node via a single
at a level in the PHT that is appropriate for the DHT lookup. In a tree, this knowledge is lacking,
query i.e. smaller queries start lower down in the and it not possible to locate a key without a top-
PHT. down traversal from the root. Therefore, a tree
index cannot use the random access property of
2.2.3 Insert / Delete the DHT in the same manner. This translates into
Insertion and deletion of a key K both require a several key advantages in favor of the PHT when
PHT lookup operation to first locate the leaf node compared to a balanced tree index.
leaf (K). Insertion of a new key can cause this
leaf node to split into two children, followed by a
redistribution of keys. In most cases, the (B + 1) 2.3.1 Efficiency
keys are distributed among the two children such A balanced tree has a height of log N , and therefore
that each of them stores atmost B. However it is a key lookup requires log N DHT lookups. In addi-
possible that all (B + 1) keys are distributed to tion, updates may require the tree to re-balanced.
the same child, necessitating a further split. In the The binary search lookup algorithm in the case of
worst case, an insertion can cause splits to cascade the PHT requires only log D DHT operations, and
all the way to a depth D 6 , making insertion costs updates have the same cost as well. Comparing the
proportional to D. Similarly, in the worst case, cost of lookups in the case of an index consisting
deletion can cause an entire sub-tree of depth D of a million 32-bit keys, a tree index would require
to collapse into a single leaf node, incurring a cost 20 DHT lookups as compared to 6 for the PHT to
proportional to D. retrieve a key. Of course, multiway indexing could
be used to reduce the height of the tree, but this
It is possible to reduce update costs and avoid would also leave the tree more vulnerable to faults
problems of multi-node coordination through stag- in the indexing structure.
gered updates. Only one split operation is allowed
per insertion, and similarly, only one merge oper-
ation is allowed per deletion. While this results in 2.3.2 Load Balancing
update costs reducing to log D DHT lookups (the As mentioned before, every lookup in a tree must
cost of a PHT lookup to locate the leaf node), it goes through the root, creating a potential bottle-
also allows invariants 3 and 4 to be violated. A leaf neck. In the case of a trie, binary search allows the
D
node can now store upto (B + D) keys. This is not load to be spread over 2 2 nodes (assuming uniform
likely to be a problem because in most practical lookups), thus eliminating any bottleneck.
scenarios, B >> D.
1000