0% found this document useful (0 votes)
60 views13 pages

Path Caching: A Technique For Optimal External Searching: Sridhar Ramaswamy and Sairam Subramanian

This document describes a new technique called path caching that can be used to transform efficient in-memory data structures for 2-dimensional range searching into I/O-efficient external data structures. Specifically, it presents a data structure using path caching that supports 2-sided 2-dimensional range queries in optimal O(logB n + t/B) I/O time using O(Bn log2 log2 B) space, and supports updates in optimal O(logB n) amortized time. It also extends this to support 3-sided queries optimally at a higher space cost. This provides the first truly optimal external solution to this fundamental database problem.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views13 pages

Path Caching: A Technique For Optimal External Searching: Sridhar Ramaswamy and Sairam Subramanian

This document describes a new technique called path caching that can be used to transform efficient in-memory data structures for 2-dimensional range searching into I/O-efficient external data structures. Specifically, it presents a data structure using path caching that supports 2-sided 2-dimensional range queries in optimal O(logB n + t/B) I/O time using O(Bn log2 log2 B) space, and supports updates in optimal O(logB n) amortized time. It also extends this to support 3-sided queries optimally at a higher space cost. This provides the first truly optimal external solution to this fundamental database problem.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Path Caching: A Technique for Optimal

External Searching
Sridhar Ramaswamy and Sairam Subramanian
Department of Computer Science
Brown University
Providence, Rhode Island 02912
CS-94-27
May 1994
Path Caching: A Technique for Optimal External
Searching
(Extended Abstract)

Sridhar Ramaswamy Sairam Subramaniany


Brown University Brown University
[email protected] [email protected]

Abstract declarative programming features (relational calculus


External 2-dimensional searching is a fundamental problem and algebra) of the model are important, it is crucial to
with many applications in relational, object-oriented, spa- support these features by data structures for searching
tial, and temporal databases. For example, interval inter- and updating that make optimal use of secondary
section can be reduced to 2-sided, 2-dimensional search- storage. B-trees and their variants B+ -trees [BaM,
ing and indexing class hierarchies of objects to 3-sided, 2- Com] are examples of such data structures. They
dimensional searching. Path caching is a new technique that have been an unquali ed success in supporting external
can be used to transform a number of time/space ecient dynamic 1-dimensional range searching in relational
data structures for internal 2-dimensional searching (such as database systems.
segment trees, interval trees, and priority search trees) into The general data structure problem underlying ef-
I/O ecient external ones. Let n be the size of the database, cient secondary storage manipulation for many data
B the page size, and t the output size of a query. Using path
caching, we provide the rst data structure with optimal models is external dynamic k-dimensional range search-
I/O query time O(log B n + t=B ) for 2-sided, 2-dimensional ing. B-trees, which are good for 1-dimensional range
searching. Furthermore, we show that path caching re- searching, are inecient for handling more general prob-
quires a small space overhead O( Bn log 2 log 2 B ) and is sim- lems like two and higher dimensional range search.
ple enough to admit dynamic updates in optimal O(log B n) The problem of multi-dimensional range searching in
amortized time. We also extend this data structure to han- both main memory and secondary memory has been the
dle 3-sided, 2-dimensional searching with optimal I/O query- subject of much research. Many elegant data structures
time, at the expense of slightly higher storage and update like the priority search tree, segment tree, and interval
overheads. tree have been proposed for use in main memory for
special cases of 2-dimensional range searching. In this
1 Introduction and motivation paper, we introduce a new technique called path caching
The successful realization of any data model in a that can be used to convert many of these in-core
large-scale database requires supporting its language data structures into secondary storage data structures.
features with ecient secondary storage manipulation. These data structures have optimal query time at the
Consider the relational data model of [Cod]. While the expense of small storage overheads. The technique
is also simple enough to allow updates in optimal
 Contact Author. Address: Dept. of Computer Science, Brown amortized time.
University, Box 1910, Providence, RI 02912. Tel: 401-863- We rst introduce our model for secondary storage
7662. Fax: 401-863-7657. Research supported by ONR Contract algorithms and then look at the performance of B-
N00014-91-J-4052, ARPA Order 8225.
y Address: Dept. of Computer Science, Brown University, Box trees for 1-dimensional range searching. We make
1910, Providence, RI 02912. Tel: 401-863-7673. Fax: 401-863- the standard assumption that each secondary memory
7657. Research supported by NSF PYI award CCR-9157620, to-
gether with matching PYI funds from Honeywell Corporation,
access transmits one page or B units of data, and we
Thinking Machines Corporation, and Xerox Corporation. Ad- count this as one I/O.1 The eciency of our algorithms
ditional support provided by ARPA contract N00014- 91-J-4052 is measured in terms of the number of I/O operations
ARPA Order No. 8225 that they perform. Let R be a relation with n tuples
and let the output of a query on R have t tuples. Our
I/O bounds are expressed in terms of n; t and B and all
constants are independent of these three parameters.
1 We will use the words page and disk block interchangeably, as
also in-core and main memory. Also, the symbol log used without
a base defaults to base 2.
A B+ -tree on attribute x of the n-tuple relation R general problem (3-sided queries) with optimal query
uses O( Bn ) disk blocks. The following operations de ne and update times and uses optimal storage. Many
the problem of external dynamic 1-dimensional range algorithms have been presented to solve this problem in
searching on relational database attribute x: (1) Find secondary memory. These include [BlGa, BlGb, IKO].
all tuples such that for their x attribute a1  x  a2. If The rst I/O optimal solution for this problem appeared
the output size is t tuples, then the B+ -tree can answer in [KRV]. [KRV] reduces dynamic interval management
this query in O(logB n + t=B ) disk I/O's in the worst to stabbing queries, which in turn reduce to a special
case. If a1 = a2 and x is a key then this is key-based case of 2-dimensional range searching called diagonal
searching. (2) Inserting or deleting a given tuple into corner queries (see Figure 1). Diagonal corner queries
the B+ -tree can be done in O(logB n) disk I/O's in can be answered in optimal time O(logB n + t=B ) using
the worst case. It can be shown that the performance optimal storage O( Bn ). The solution presented in [KRV]
of B+ -trees is optimal for this problem. The problem is fairly involved and does not support deletion of points.
of external dynamic k-dimensional range searching on In this paper, we present a data structure for solving
relational database attributes x1 ; : : :; xk generalizes 1- a more general version of this problem, namely 2-sided
dimensional range searching to k attributes, with range queries (see Figure 1). We use path caching and obtain
searching on k-dimensional intervals. bounds of O(logB n + t=B ) I/O's for query time and
Many ecient algorithms exist for 2-dimensional O(logB n) for amortized updates. The data structure
range searching and its special cases (see [ChT] for a occupies O( Bn loglog B ) storage.
detailed survey). Most of these algorithms are not e- The second important special case of 2-dimensional
cient when mapped to secondary storage. However, the range searching is 3-sided range searching (see Figure 1).
practical need for good I/O support has led to the de- The priority search tree can answer 3-sided queries in-
velopment of a large number of external data structures, core in time O(log n + t), using storage O(n). The
which do not have good theoretical worst-case bounds update time is O(log n) in the worst-case. All these
but have good average-case behavior for common spatial bounds are optimal. It is shown in [KRV] that answering
database problems. These includes the grid- le [NHS], 3-sided queries eciently is key to solving the problem
various quad-trees [Sama, Samb], z-orders [Ore] and of indexing classes. Indexing classes is the natural
other space lling curves, k-d-B-trees [Rob], hB-trees generalization of indexing in the context of object-
[LoS], cell-trees [Gun], and various R-trees [Gut, SRF]. oriented databases and is very important to their good
For these external data structures there has been a lot of performance (see [KiL, ZdM] for more information on
experimentation but relatively little algorithmic analy- this area). [KKD, LOL] present solutions to the problem
sis. Their average-case performance (e.g., some achieve of indexing classes. However, their algorithms are based
the desirable static query I/O time of O(logB n + t=B ) on heuristics and cannot guarantee good worst-case
on \average" inputs) is heuristic and usually validated performance.
through experimentation. Moreover, their worst-case Previous attempts to answer 3-sided queries in sec-
performance is much worse than the optimal bounds ondary memory by implementing priority search trees
achievable for dynamic external 1-dimensional range in secondary memory [IKO, KRV] did not have opti-
searching using B+ -trees(see [KRV] for a more complete mal query times. [IKO] uses optimal storage but an-
reference on the eld). In this paper, we are interested in swers queries in O(log n + t=B ) time. [KRV] improves
obtaining algorithms with good worst-case performance. on this, answering queries in O(logB n + log B + t=B )
Using path caching, we study tradeo s between time time using optimal storage. Neither of them allow in-
and space in secondary memory. serts and deletes from the data structure. We present a
Two special cases of 2-dimensional range searching data structure to solve this problem using path caching
have been studied extensively in the literature. The that answers queries in optimal time O(logB n + t=B ),
rst one is dynamic interval management in secondary but uses storage O( Bn log B log log B ) and performs up-
storage. This problem is crucial to indexing in dates in O(logB n log2 B ) time. In addition to these
constraint databases and temporal databases [KKR, data structures, path caching can also be applied to
KRV]. It is shown in [KRV] that the key component other main memory data structures to obtain optimal
of dynamic interval management is answering stabbing query times at the expense of small space overheads.
queries. Given a set of input intervals, to answer By doing this, we improve on the bounds of [BlGb] for
a stabbing query for a point q we have to report implementing segment trees in secondary memory.
all intervals that intersect q. Elegant solutions exist To summarize, we present a simple technique called
for this problem in main memory. The segment tree path caching that can be used to transform many
[Ben], interval tree [Edea, Edeb], and the priority in-core data structures to ecient secondary storage
search tree [McC] can all solve this problem well. Of structures. We show how to use path caching to
these, the priority search tree solves a slightly more implement priority search trees in secondary memory
Diagonal corner query 2-sided query 3-sided query general 2-dimensional query

Figure 1: Diagonal corner queries, 2-sided, 3-sided and general 2-dimensional range queries.

for answering general 2 and 3-sided queries with optimal nodes are called allocation nodes of interval I . A node x
I/O. We also apply this technique to improve on existing is an allocation node of interval I if I contains the cover-
bounds for segment trees in secondary memory, and interval of x and does not contain the cover-interval of
to implement a restricted version of interval trees in x's parent. The intervals stored at node x are placed in
secondary memory. a list CL(x) called the cover-list of x.
The rest of the paper is organized as follows. Section 2 Given a query point q, let P be path of T from the
explains the general principles behind path caching by root to the leaf y such that q is in the cover-interval of y.
applying it to segment trees. Section 3 explains the It is not hard to show that the intervals containing the
application of path caching to priority search trees to query point q are exactly those intervals that are stored
obtain reasonably good worst-case bounds for 2-sided at the nodes on P . The time required to answer such
range searching. We also present results about the a query is O(log n + t) (where t is number of intervals
application of path caching to 3-sided range searching, that contain q), which is optimal. The data structure
segment trees and interval trees. Section 4 applies occupies O(n log n) space because each interval is stored
recursion and path caching to 2-sided queries to improve in at most 2 log n nodes of the tree.
on the bounds of Section 3. It also brie y touches on Let us now try to implement this data structure in
applying similar ideas to 3-sided queries. Section 5 secondary storage. Given a block-size of B it is easy
shows how updates to the data structures can be to see that we require at least t=B I/O's to output all
handled in optimal amortized time. We nally present the intervals. Also, it can be shown that we require
our conclusions and open problems in Section 6.
(logB n) I/O's to identify the path to y. Thus an ideal
implementation of segment trees in secondary memory
2 Path caching would require O( Bn log n) disk blocks of space and would
Let us illustrate the idea of path caching by applying it answer stabbing queries with O(logB n + t=B ) I/O's.
to the main memory data structure segment tree. The To lower the time required to locate the root-to-y
segment tree is an elegant data structure that is used to path P we can store the tree T in a blocked fashion
answer stabbing queries on a collection of intervals. by mapping subtrees of height log B into disk blocks.
Before we discuss the use of path caching in this The resulting search structure is called the skeletal B-
context we give a brief description of the segment tree; tree and is similar in structure to a B -tree (see Figure 2).
a more complete treatment can be found in [Ben]. For With this blocking, and a searching strategy similar to
ease of exposition we will assume that none of the input B-trees we can locate a log B -sized portion of P with
intervals share any endpoints. every I/O. If the cover-list of each node is stored in a
To build a segment tree on a set of n intervals we rst blocked fashion (with B intervals per block) then we
build a binary search tree T on the 2n endpoints of the could examine the cover-list CL(x) of each node x on
intervals. The endpoints e1 ; e2;    ; ek are stored at the P and retrieve the intervals in CL(x) B at a time. A
leaves of the search tree in sorted order. With each node closer look reveals that this approach could result in
x in the tree we associate a half-open interval2 called a query time of O(log n + t=B ). This is because even
the cover-interval of x. If x is a leaf node containing though we can identify P in O(logB n) time we still have
the endpoint ej then the cover-interval of x is the half- to do at least O(log n) I/O's, one for each cover-list on
open interval [ej ; ej+1). If x is an internal node then the path (see Figure 3). These I/O's may be wasteful if
its cover-interval is the union of the cover-intervals of the cover-lists contain fewer than B intervals. To avoid
its children. To answer stabbing queries we store each paying the additional log n in the query time we need
input interval I in up to 2 log n nodes of the tree. These to avoid wasteful I/O's (ones that return fewer than
2A half-open interval [ ) contains all the points between
a; b a
B intervals) as much as possible. In particular, if the
and including but excluding .
b a b number of wasteful I/O's are smaller than the number
coalesce all the cover-lists on P that have  B elements
and store it in a cache C (y) at y then we could look
at C (y) instead of looking at log n possibly underfull
1
cover-lists. If C (y) is stored in a blocked fashion
then retrieving intervals from it would cause at most
one wasteful I/O instead of log n wasteful ones (see
2 3 4 Figure 3). The time for reporting all the intervals would
then be  2t=B + 1. This combined with the time for
5 6 7 8 9 10 11 12 13 nding P would give us the desired query time.
We therefore make the following modi cation to the
1 segment tree:
For each leaf y identify the underfull cover-lists
CL1 ;    CLk (cover-lists that contain less than B in-
tervals) along the root-to-y path. Make copies of the
2 3 4 intervals in all the underfull cover-lists and store it in a
cache C (y) in y. Block C (y) into blocks of size B on to
secondary memory.
5 6 7 8 9 10 11 12 13 From the above discussions we can see that using
this modi ed version of a segment tree we can answer
Figure 2: Constructing the skeletal graph stabbing queries with O(logB n + t=B ) I/O's. The only
thing left to analyze is the amount of storage required
for the modi ed data structure. The number of disk
underfull cover−lists
cause wasteful I/O blocks required to block the search tree T is O(n=B ).
The total number of intervals in all the cover-lists is
O(n log n). These can be stored in O( Bn log n) disk
blocks. At each leaf y we have a cache C (y) that
contains up to B log n intervals (from the log n nodes
along the root-to-y path). Therefore to store the caches
from the 2n leaves we need 2n log n disk blocks. Putting
all of this together we see that the space required is
O(n log n) blocks.
y

path−caching underfull The space overhead can be reduced to the optimal


C(y) cover−lists at the leaves
saves on wasteful I/O value O( Bn log n) by performing two optimizations: (1)
Building caches at the leaf nodes of the skeletal tree
Figure 3: Underfull cover-lists along the query path instead of the complete binary tree (we therefore have
result in wasteful I/O's. Path-caching alleviates this to build only O(n=B ) path-caches); and (2) By requir-
problem. ing the query to look at O(logB n) path-caches instead
of just one (this would allow us to build smaller path-
caches). We can get a secondary memory implementa-
of useful I/O's (ones that return B intervals) then the tion of the segment tree that requires O( Bn log n) space
I/O's required to report all the intervals on P will be and answers queries with O(logB n + t=B ) I/O's.
 2t=B = O(t=B )3 . This combined with fact that we
need O(logB n) time to identify P would give us the
desired query time.
3 Priority search trees and path caching
Consider a node x on P such that its cover-list has In this section, we apply the idea of path caching to
at least B intervals. It is then easy to see that the rst priority search trees and use them to solve special cases
I=O at node x will be useful. In fact all but the last I/O of 2-dimensional range searching. We rst consider 2-
at node x will return B intervals. This implies that the sided 2-dimensional queries that are bound on two sides
number of wasteful I/O's at node x is upper-bounded and free on the other two as shown in Figure 1. As
by the number of useful I/O's at that node. Thus the before, n will indicate the number of data items, B the
problem nodes are only those that have fewer than B size of the disk page, and t the number of items in the
intervals stored at them. query result.
This leads us to the idea of path caching: If we Let us consider the implementation of priority search
trees in secondary memory for answering 2-sided queries.
3 In other words each wasteful I/O will be paid for by The input is a set of n points in the plane. The priority
performing a useful one. search tree is a combination of a heap and balanced bi-
nary search tree. The solution proposed in [IKO] works examined, we perform one I/O operation. The corner,
as follows: Find the top B points (on the basis of their y ancestor, and sibling nodes can cause wasteful I/O's
values) in the input set, store them in a disk block and but there are at most O(log n) such nodes. For every
associate this disk block with the root of the search tree. descendant of a \sibling" that is examined, its parent
Divide the remaining points into two equal sets based would have contributed an useful I/O. From this analy-
on their x values. From these two sets, choose the top sis, we can conclude that we can answer 2-sided queries
B points in each set and associate them with the two in O(log n + t=B ) I/O's.
children of the root. Continue this division recursively. We now show how to avoid the log n wasteful I/O's by
The skeletal structure for the binary tree itself is stored caching the data in the ancestor and sibling nodes. We
in a B-tree (called the skeletal B-tree). It is clear that to store two caches associated with the corner. One cache
store n points in this fashion, we will use only O(n=B ) will contain all the data in the ancestors sorted in right-
disk blocks. to-left (largest x value rst) fashion. Call this cache
As illustrated in Figure 4, each node in the priority the A-list. The second cache will contain all the data
search tree de ned above corresponds to a rectangular in the siblings stored in top-to-bottom (largest y value
region in the plane that contains all the points stored in rst) fashion. Call this cache the S -list. Using these
that node. Furthermore, the tree as a whole de nes a caches we can answer 2-sided queries in O(logB n + t=B )
hierarchical decomposition of the plane. As is shown I/O's. In order to answer a 2-sided query, we simply
in [IKO], this organization has the following crucial look at the skeletal B-tree and locate the corner in
property: A point in a node x can belong in a 2-sided O(logB n) time. We then look at the caches (performing
query if (1) the region corresponding to x's parent at most two wasteful I/O's) to determine which points
is completely contained within the query or, (2) the from the ancestors/siblings fall into the query. After
region corresponding to x or the one corresponding to this, we look into the descendants if necessary. As
its parent is cut by the left side of the query. discussed above, any wasteful query that is caused by
With this division, we can show that a 2-sided query examining a descendant can be counted o against a
with t points in the output can be answered by looking useful query that is the result of examining its parent.
at only O(log n + t=B ) disk blocks. To do that, we The descendants thus pay for looking into them through
classify nodes that contain points inside the query into their parents. The following lemma follows.
four categories as follows:
Lemma 3.1 Given n input points on the plane, the
 The corner: This is the node whose region contains data structure described above answers any 2-sided
the corner of the query. query containing t points using O(logB n + t=B ) I/O's.
The storage used is O( Bn log n) disk blocks of size
 Ancestors of the corner: These are nodes whose B each.
regions are cut by the left side of the query and there
can be at most O(log n) such nodes. Now, we show that we can bring the storage overhead
 Right siblings of the corner and the ancestors: These down to O( Bn log B ). We do this by observing that
are nodes whose parents' regions are cut by the left maintaining caches of size O(log n) at each node is
side of the query. There can be at most O(log n) wasteful. We cut the total path length of log n into
such nodes. logB n segments of size log B . We maintain A-lists and
S -lists at each node as before. However, to construct
 Descendants of right siblings: There can be an these lists at a node we only examine the ancestors and
unbounded number of them, but for every such node, siblings that are in the log B segment of the root-to-node
its parent's region has to be completely contained path that the node belongs to. Thus the lists at any
inside the query. That pays for the cost of looking node contain at most O(log B ) disk blocks. Therefore
into these nodes. That is, for every k descendant the total storage required comes down to O( Bn log B ).
blocks that are partially cut by the query, there will To answer a query, we now have to look at a total of
be at least k2 blocks that lie completely inside the log n= log B = logB n A-lists and S -lists(one for each of
query. the log B -sized subpaths). The descendants of siblings
are handled as they were in the previous construction.
The algorithm proceeds by locating the nodes inter- We get the following theorem.
secting the left side of our query. This is done by per-
forming a search on the skeletal B-tree. The nodes are Theorem 3.2 Given n input points on the plane, path
examined to nd the points inside the query. Next, right caching can be used to construct a data structure that
siblings of these nodes and their descendants are exam- answers any 2-sided query using O(logB n + t=B ) I/O's.
ined in a top-down fashion until the bottom boundary Here t is the output size of the query. The data structure
of the query is crossed. In this algorithm, for each node requires O( Bn log B ) disk blocks of storage.
2-sided query

Sibling
Ancestor

Descendant
of Sibling

Corner

Figure 4: Binary tree implementation of Priority Search Tree in secondary memory showing corner, ancestor, sibling
and sibling's descendant. Here, B is 4.

In the next section, we show how we can combine the general 2-sided queries by using a secondary memory
idea of path caching with recursion to make the storage priority search tree. Similar ideas can be used to get
required even less while keeping queries ecient. better space overheads for the other data structures as
Using similar ideas, we can obtain the following well.
bounds for 3-sided queries, segment trees and interval We rst describe a two level scheme for building a sec-
trees (details will be given in the full version of the ondary memory priority search tree that requires only
paper). O( Bn log log B ) storage while still admitting optimal
query performance. We then brie y describe a multi-
Theorem 3.3 Given n input points on the plane, path level version of this idea that requires only O( Bn log B )
caching can be used to construct a data structure that storage.
answers any 3-sided query using O(logB n + t=B ) I/O's. Recall that the basic scheme divides the points into
Here t is the output size of the query. The data structure regions of size B . A careful look at this scheme shows
requires O( Bn log2 B ) disk blocks of storage. that the log B overhead is due to the fact that the
ancestor and sibling caches of each of the n=B regions
Theorem 3.4 We can implement Segment Trees in can potentially contain log B blocks. To reduce the
secondary memory so that a point enclosure query space overhead we could either (1) reduce the amount of
containing t intervals can be answered in O(logB n + information stored in each region's cache or (2) reduce
t=B ) I/O's. For n input intervals, the storage used is the total number of regions.
O( Bn log n) disk blocks. A closer look shows that to get optimal query
time, with any region, we must store the information
Theorem 3.5 We can implement Interval Trees in associated with log B of its ancestors. This is because
secondary memory so that a point enclosure query in the priority tree structure the path length from
containing t intervals can be answered in O(logB n + any block to the root is O(log n). Thus to achieve
t=B ) I/O's. For n input intervals, the storage used is a query overhead of at most logB n we must divide
O( Bn log B ) disk blocks. such a path into no more than logB n pieces. Since
logB n = log n= log B , we see that with each node we
In the following sections, we explore 2-sided queries must store the information associated with O(log B ) of
in more detail to improve the bounds obtained in this its ancestors. We therefore turn to the second idea. To
section. We then discuss algorithms for updating the get a linear space data structure we build a basic priority
resulting data structures. search tree that divides the points into regions of size
B log B instead of B . We thus have n=B log B regions.
4 Using recursion to reduce the space To build the caches associated with each of the regions
we proceed in a slightly di erent fashion. First, we sort
overhead the points in each region R right-to-left (i.e. largest to
In this section we describe how to extend the ideas of smallest) according to their x-coordinates. We store
section 3 to develop a recursive data structure that has a these points (B at a time) in a list of disk blocks
much smaller space overhead and still allows queries to associated with R. In the same fashion we also sort
be answered in optimal time. Due to space limitations the points top-to-bottom (i.e. largest to smallest) and
we restrict our ourselves to the problem of answering store them in a list of disk blocks associated with R.
Thus the points in each region are blocked according to collect all the points that lie inside the query. However,
their x as well as their y coordinates. These lists are just looking at the these two caches is not enough to
called the X -list and the Y -list of R respectively. guarantee that we have collected all the points from the
To build the ancestor cache associated with a region ancestors and their siblings. This is because we only
R we look at its log B immediate ancestors. From use one block from each ancestor (and each sibling) to
each of the ancestor's X -list we copy the points in the build the A and S -lists.
rst block. We then sort all these points right-to-left To collect the other points in these regions that are
according to their x-coordinates and store them in a list in our 2-sided query we examine the X and Y -lists of
of disk blocks associated with region R. These blocks the ancestors and their siblings respectively. These lists
constitute the ancestor list (A-list) of R. Similarly, to are examined block by block until we reach a block
build the sibling cache (S -list) of R we consider the that is not fully contained in our query. The X -list
rst blocks from the Y -lists of the siblings of R and its of an ancestor Q of R is examined if and only if all the
ancestors. Adding up the all the storage requirements points from Q that were in the ancestor cache of R are
we get the following lemma. found to be inside the 2-sided query. Similarly the Y -
Lemma 4.1 The total storage required to implement list of a sibling T (along the path from R to the root)
the top-level priority search tree and the associated A, is examined if and only if the points from T that are in
S , X , and Y lists is O( Bn ). the sibling cache are all inside our 2-sided query.
We claim that this algorithm will correctly nd all
Unlike in the basic scheme we now have regions which points in the ancestor and the sibling regions that are
contain O(B log B ) points or O(log B ) disk blocks. in our query. We now show that all the points in
To complete our data structure we therefore build the ancestor regions are found correctly. The case for
secondary level structures for each of these regions. the siblings of the ancestors is similar. Consider some
For each region we build a priority search tree as per ancestor region Q of R and its associated right-to-left
Lemma 3.1. In other words, we divide the region into ordering of the points as represented in its X -list. Since
blocks of size B and build ancestor and sibling caches Q is an ancestor, it is cut by the vertical line of the
(as before) for each of the blocks. In this case the height 2-sided query (see Figure 4). Therefore, all the points
of any path is at most O(log log B ); therefore for each in Q that are to the right of the vertical line are in the
region we use all of its ancestors and the siblings to query and the rest aren't. Thus, all the points in the
construct the ancestor and sibling caches. query are present in consecutive disk blocks (starting
We now count the space overhead incurred due the from the rst one) in the X -list of Q. We therefore
priority search trees built for each region. For each block need to examine the ith block in the X -list if and only
in a given region R we need no more than O(loglog B ) if all the previous i 1 blocks are completely contained
disk blocks to build its ancestor and sibling caches. This in our query. Since the rst block is part of R's ancestor
follows from the fact that the priority search tree of R cache we need to look at the X -list of Q if and only if
has height O(log log B ). A given region R has O(log B ) all of the points of Q that are in the ancestor cache of
disk blocks; therefore the space required for storing the R are contained in the 2-sided query.
ancestor and sibling caches of all the blocks in R is To account for the time taken to nd these points
O(log B log log B ). Adding this over all the regions we we note that there are O(logB n) caches that we must
get the following lemma. look at. It is not hard to see that apart from these
Lemma 4.2 The total storage required by the two level initial lookups we only look at a disk-block from a
data structure is O( Bn log log B ). region Q if our last I/O yielded B points (inside the
query) from this region. Therefore all the other I/O's
4.1 Answering queries using the two-level data are paid for. Thus the time to nd these points is
structure O(logB n + (tA + tS )=B ) where tA and tS denote the
We now show how to answer 2-sided queries using this number of points contained in the ancestors and their
data structure. To answer such a query we proceed siblings.
as follows: As in the basic scheme we rst determine To nd the points in the query that are in the
the region R (in the top level priority search tree) descendants of the siblings, we use the same approach
that contains the corner of the query. As discussed in as the basic scheme. We nd the points in all these
section 3, the points in the query belong to either the regions by scanning their Y -lists. We traverse a region
region R, or one of R's ancestors Q, or a sibling T of Q Q if and only if its parent P is fully contained in the
(or R), or to a descendant of some sibling T . query. An argument similar to the one above shows that
To nd the points that are in the ancestors and their all such points are found by this algorithm, and that the
siblings we look at the logB n ancestor and sibling caches number of I/O's required is O(tD =B ). Here tD denotes
along the path from R to the root. From these caches we the number of points contained in the descendants of
the siblings. fully dynamic, i.e. it can handle both additions and
To nd the points in the query that are in region R deletions of points. The amortized time bound for an
we use the second level priority tree associated with R. update is O(logB n). In this abstract we only give a
Let tR denote the number of points in R that belong brief overview of the dynamization. The details will be
to the query. We nd these points by asking a 2-sided given in the full paper.
query inside region R. This requires at most O(tR =B ) Before we describe our dynamic data structure we
I/O's (by the arguments in section 3). rst discuss an alternate way of visualizing the top level
Therefore, the total number of I/O's required to priority search tree in the two level scheme. In this
answer a two sided query is O(logB n + t=B ) where tree a node corresponds to a region of size B log B . We
t is the size of the output. This in conjunction with partition this priority search tree into subtrees of height
Lemma 4.2 gives us the following theorem. log B log log B . Each such subtree is considered a super
Theorem 4.3 There exists a secondary memory static node. As in section 3, in order to build the ancestor
implementation of the priority search tree that can be and sibling cache of any region R, we only consider
used to answer general 2-sided queries using O(logB n + those ancestors (and their siblings) of R that are in the
t=B ) I/O's; where t is the size of the output. This data same super node as R. Considering subtrees of height
structure requires O( Bn log log B ) disk blocks of space to log B loglog B (instead of log B ) does not change the
query times because we are now dealing with regions of
store n points. size B log B .
4.2 A multilevel scheme to further lower the The advantage of viewing these subtrees as super
space over-head nodes is that now we can isolate the duplication of
It is possible to reduce the space overhead further by information due to the S and A-lists to within a super
using more than two levels. The idea is the same node, since none of the caches ever cross the super
as before. At the second stage instead of building a node boundary. The layer of super nodes (regions)
basic priority search tree for each region we build a tree immediately following the last layer in a super node N
that contains regions of size B log log B and build the can be thought of as children of N in this visualization.
X , Y , A, and S lists same as before. A three level Note that each supernode N contains B= log B regions
scheme gives us a space overhead of O( Bn log loglog B ) and has O(B= log B ) super nodes as its children. Also
while maintaining optimum query time. If we carry this note that the number of super nodes on any path in the
multilevel scheme further then we get a data structure tree is O(logB n).
with the following bounds. Our dynamic data structure associates an update
bu er U of size B with each super node N in the
Theorem 4.4 There exists a secondary memory static tree. It also associates an update bu er u (also of
implementation of the priority search tree that can be size B ) with each region R. These bu ers are used to
used to answer general 2-sided queries using O(logB n + store incoming updates until we have collected enough
t=B + log B ) I/O's; where t is the size of the output. of them to account for rebuilding some structure. To
This data structure requires O( Bn log B ) disk blocks of process a query, we rst use the algorithm from section 4
space to store n points. to collect the points that have been entered into the
data structure. We then look through the associated
These ideas can be applied to 3-sided queries as well, update bu ers to add new points that have been added
to reduce the space overhead incurred by the sibling and to discard old points that have been deleted by an
caches. In particular we can get the following bounds unprocessed update.
for answering 3-sided queries. We now need to be careful in claiming optimal query
Theorem 4.5 There exists a secondary memory static time because the points that we collect by searching the
implementation of the priority search tree that can be priority search structures may then have to be deleted
used to answer general 3-sided queries using O(logB n + when we look at the respective update bu ers. However,
t=B + log B ) I/O's; where t is the size of the output. it can be shown that for every B log B points we collect
This data structure requires O( Bn log B log B ) disk we can lose at most B points, thus resulting in at most
blocks of space to store n points. two wasted I/O's for log B useful ones. Therefore, the
loss of points due to unprocessed deletes is very small
5 A fully dynamic secondary memory and does not a ect the overall query performance.
data structure for answering 2-sided Whenever an update occurs, we rst locate the super
node N where the update should be made. The update
queries could be a point insertion or a deletion. We then log the
In this section we show how to dynamize the two-level update in the associated update bu er U . If the bu er
scheme discussed in section 4. Our data structure is does not over ow we don't do anything. If U over ows
then we take all the updates collected and propagate is done once every B log B updates, the amortized time
them to the regions in N where they should go. For required for one such rebuild is O(1). However, since
instance, a point insertion is trickled down to the region we push updates down when we rebuild super nodes, we
of N that contains its coordinates. In each such region may have to do up to logB n rebuilds (along an entire
we then log the update into the local update bu er u path) due to a single over ow. Therefore, the amortized
associated with it. We now rebuild the X and Y lists of cost of a rebuild is O(logB n).
each region in N taking into account the updates that A moment of thought reveals that pushing points
have percolated into it. We also rebuild each region's down is not enough to keep the priority search tree
ancestor and sibling caches; again taking into account balanced. Repeated additions or deletions to one
the updates that have percolated into that region. side can make subtrees unbalanced. We therefore
The number of I/O's required to rebuild the A, S , periodically rebuild subtrees in the following manner.
X , and Y lists of one region R is O(log B ). Therefore, With each node in the tree we associate a size which
the number of I/O's required to rebuild all the caches is the number of points in the subtree rooted at that
is (B= log B )O(log B ) = O(B ). Since we do this only node. We say that a node is unbalanced if the size of
once in B updates the amortized cost of rebuilding the one of its children is more than twice the size of its
caches is O(1) per update. other child. Whenever this happens at a node R we
If for none of the regions in N any of their bu ers rebuild the subtree rooted at R. The number of I/O's
over ow, we don't do anything further. Otherwise for required to rebuild a priority search tree with x points
each region R whose update bu er has over owed we is O((x=B ) logB x + (x=B ) loglog B ). This is because
rebuild the second level priority search tree associated we need to rebuild the secondary level priority search
with it. As before, we take into account the updates in trees as well as the primary level tree along with all the
the bu er u of R. caches at both levels. Since a subtree of size x can get
The number of I/O's required to rebuild the pri- unbalanced only after O(x) updates we get an amortized
ority search tree associated with a given region R is rebuilding time of O((logB x + loglog B )=B ) = O(1).
O(log B log log B ). This is because we need to rebuild Summing up all the I/O's that are required to rebuild
the caches of log B blocks each of which could contain various things we see that the total amortized time for
up to log log B disk-blocks of information. Since this is an update is O(logB n). We therefore have the following
done only once every B updates the amortized time per theorem.
update is O((log B log log B )=B ) = O(1). Theorem 5.1 There exists a fully dynamic secondary
For every super node N ; once every B log B updates memory implementation of the priority search tree that
we do a rebuild. This rebuilding keeps the same x- can be used to answer general 2-sided queries using
division as the old one. Keeping the x-divisions same O(logB n + t=B ) I/O's; where t is the size of the
we change the y-lines of the regions so that each region output. The amortized I/O-complexity of processing
now contains exactly B log B points. Using these new both deletions and additions of points is O(logB n). This
regions structure we rebuild the A, S , X , and Y lists data structure requires O( Bn log log B ) disk blocks of
for each region as well as the secondary level priority space to store n points.
search trees associated with each region. Note that it is
important to keep the same x-division to preserve the Similar ideas can be used to get a dynamic data
underlying binary structure of the priority search tree. structure for answering 3-sided queries as well. The
We cannot view the regions in a given supernode in time to answer queries is still optimal but the time to
isolation since they are part of a bigger priority search process updates is not as good. In particular we get the
tree. following bounds for answering 3-sided queries.
To keep the invariant that each region in N contain Theorem 5.2 There exists a fully dynamic secondary
B log B points we may have to push points into its memory implementation of the priority search tree that
children or we may have to borrow points from them. can be used to answer general 3-sided queries using
These are logged as updates in the corresponding O(logB n + t=B ) I/O's; where t is the size of the output.
supernodes. Pushing points into a node is equivalent The amortized I/O-complexity of processing both dele-
to adding points to that region while borrowing points tions and additions of points is O(logB n log2 B ). This
from a node is the same as deleting points from the data structure requires O( Bn log B loglog B ) disk blocks
region. These updates may then cause an over ow in the
bu ers associated with one or more of those supernodes. of space to store n points.
We repeat the same process described above with any
such supernode. 6 Conclusions and open problems
It is easy to show that the number of I/O's required to Special cases of 2-dimensional range searching have
rebuild a super node is O(B log log B ). Since a rebuild many applications in databases. We have presented
a technique called path caching which can be used to [Edeb] H. Edelsbrunner, \A new Approach to Rect-
implement many main memory data structures for these angle Intersections, Part II," Int. J. Computer
problems in secondary memory. Our data structures Mathematics 13(1983), 221{229.
have optimal query performance at the expense of a [Gun] O. Gunther, \The Design of the Cell Tree: An
slight overhead in storage. Furthermore, our technique Object-Oriented Index Structure for Geometric
is simple enough to allow inserts and deletes in optimal Databases," Proc. of the fth Int. Conf. on
or near optimal amortized time. Data Engineering (1989), 598{605.
There seem to be some fundamental obstacles to [Gut] Antonin Guttman, \R-Trees: A Dynamic Index
implementing many main memory data structures in Structure for Spatial Searching," Proc. 1984
secondary memory. We believe that studying space- ACM-SIGMOD Conference on Management of
time tradeo s, as we have done, is important in Data (1985), 47{57.
understanding the complexities of secondary storage
structures. The hope is that this will eventually help us [IKO] C. Icking, R. Klein, and T. Ottmann, Priority
develop ecient data structures that will provide good Search Trees in Secondary Memory (Extended
worst case bounds on querying as well as update times. Abstract) , Lecture Notes In Computer Science
As of today, we have to rely on heuristics|that may or #314, Springer-Verlag, 1988.
may not perform well at all times|to handle many of [KKR] P. C. Kanellakis, G. M. Kuper, and P. Z.
these problems. Revesz, \Constraint Query Languages," Proc.
Speci cally, the important problem of dynamic in- 9th ACM PODS (1990), 299{313.
terval management that we highlighted in [KRV] re- [KRV] P. C. Kanellakis, S. Ramaswamy, D. E. Ven-
mains open. Can we solve this problem optimally us- gro , and J. S. Vitter, \Indexing for Data Mod-
ing O( Bn ) storage, answering queries in O(logB n + t=B ) els with Constraints and Classes," Proc. 12th
time, while being able to perform updates in O(logB n) ACM PODS (1993), 233{243, (A complete ver-
worst-case time? sion of the paper appears in Technical Report
Acknowledgments: We thank Paris Kanellakis for 93-21, Brown University.).
helpful discussions on this area.
[KKD] W. Kim, K. C. Kim, and A. Dale, \Indexing
References Techniques for Object-Oriented Databases,"
in Object-Oriented Concepts, Databases, and
Applications , W. Kim and F. H. Lochovsky,
[BaM] R. Bayer and E. McCreight, \Organization of eds., Addison-Wesley, 1989, 371{394.
Large Ordered Indexes," Acta Informatica 1 [KiL] W. Kim and F. H. Lochovsky, eds., Object-
(1972), 173{189. Oriented Concepts, Databases, and Applica-
[Ben] J. L. Bentley, \Algorithms for Klee's Rect- tions , Addison-Wesley, 1989.
angle Problems," Dept. of Computer Science, [LoS] D. B. Lomet and B. Salzberg, \The hB-Tree:
Carnegie Mellon Univ. unpublished notes, 1977. A Multiattribute Indexing Method with Good
[BlGa] G. Blankenagel and R. H. Guting, \XP-Trees Guaranteed Performance," ACM Transactions
- External Priority Search Trees," FernUniver- on Database Systems 15(4)(1990), 625{658.
sitat Hagen, Informatik{Bericht Nr. 92, 1990. [LOL] C. C. Low, B. C. Ooi, and H. Lu, \H-trees: A
[BlGb] G. Blankenagel and R. H. Guting, \External Dynamic Associative Search Index for OODB,"
Segment Trees," FernUniversitat Hagen, Infor- Proc. ACM SIGMOD (1992), 134{143.
matik{Bericht, 1990. [McC] E. M. McCreight, \Priority Search Trees,"
[ChT] Y.-J. Chiang and R. Tamassia, \Dynamic Al- SIAM Journal of Computing 14(2)(1985), 257{
gorithms in Computational Geometry," Pro- 276.
ceedings of IEEE, Special Issue on Computa- [NHS] J. Nievergelt, H. Hinterberger, and K. C. Sev-
tional Geometry 80(9) (1992), 362{381. cik, \The Grid File: An Adaptable, Symmetric
[Cod] E. F. Codd, \A Relational Model for Large Multikey File Structure," ACM Transactions
Shared Data Banks," CACM 13(6) (1970), on Database Systems 9(1)(1984), 38{71.
377{387. [Ore] J. A. Orenstein, \Spatial Query Processing in
[Com] D. Comer, \The Ubiquitous B-tree," Comput- an Object-Oriented Database System," Proc.
ing Surveys 11(2)(1979), 121{137. ACM SIGMOD (1986), 326{336.
[Edea] H. Edelsbrunner, \A new Approach to Rect- [OSB] M. H. Overmars, M. H. M. Smid, M. T.
angle Intersections, Part I," Int. J. Computer de Berg, and M. J. van Kreveld, \Maintaining
Mathematics 13(1983), 209{219. Range Trees in Secondary Memory: Part I:
Partitions," Acta Informatica 27(1990), 423{
452.
[Rob] J. T. Robinson, \The K-D-B Tree: A Search
Structure for Large Multidimensional Dynamic
Indexes," Proc. ACM SIGMOD (1984), 10{18.
[Sama] H. Samet, Applications of Spatial Data Struc-
tures: Computer Graphics, Image Processing,
and GIS , Addison-Wesley, 1989.
[Samb] H. Samet, The Design and Analysis of Spatial
Data Structures , Addison-Wesley, 1989.
[SRF] T. Sellis, N. Roussopoulos, and C. Faloutsos,
\The R+ -Tree: A Dynamic Index for Multi-
Dimensional Objects," Proc. 1987 VLDB Con-
ference, Brighton, England (1987).
[SmO] M. H. M. Smid and M. H. Overmars, \Main-
taining Range Trees in Secondary Memory:
Part II: Lower Bounds," Acta Informatica 27
(1990), 453{480.
[ZdM] S. Zdonik and D. Maier, Readings in Object-
Oriented Database Systems , Morgan Kauf-
mann, 1990.

You might also like