0% found this document useful (0 votes)
36 views

Data Structures For Range Searching

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Data Structures For Range Searching

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Data Structures for Range Searching

JON LOUIS BENTLEY


Departments of Computer Sctence and Mathematics, Carnegte-Mellon Unwerslty, Pittsburgh,
Pennsylvanta 15213
JEROME H. FRIEDMAN
Computatmn Research Group, Stanford Lmear Accelerator Center, Stanford, Cahfornia 94305
Much research has recently been devoted to "multikey" searching problems. In this paper
the partmular multlkey problem of range searching Is investigated and a number of data
structures that have been proposed as solutions to this problem are surveyed. The
purposes of this paper are to bring together a collection of widely scattered results, to
acquaint the reader with the structures currently avadable for solving the particular
problem of range searching, and to display a set of general methods for attacking multikey
searching problems.
Keywords and Phrases: analysis of algorithms, orthogonal range queries, range searching,
cells, multidimensional binary search trees, projection
CR Categorws. 3.63, 3.74, 5.25
INTRODUCTION
The study of data structures for facilitating
rapid searching is a fascinating subject of
both practical and theoretical interest.
Knut h [KNUT73] provides a definitive trea-
tise on the subject of searching when the
search is based on only one "key," but he
points out t hat not much was known at the
time his book was published about data
structures for sets t hat have many "keys."
This subject area, which is often called
"multikey searching," "multidimensional
searching," or "multiple attribute re-
trieval," has been the focus of a great deal
of research in the past few years. In this
paper we study a small part of this area by
surveying the work t hat has been done on
one particular multikey searching problem.
This problem is important in itself (having
applications in such areas as database sys-
Thin research was supported m part by t he Office of
Naval Research under Contract N00014-76-C-0370
and m part by the Department of Energy
tems, statistics, and design automation)
and, in addition, serves as a representative
of the entire class of multikey searching
problems.
We need some definitions to describe this
particular searching problem precisely. In
database terminology a file is a collection
of records, each containing several attri-
butes or keys. A query asks for all records
satisfying certain characteristics. An or-
thogonal range query asks for all records
with key values each within specified
ranges (that is, each key is between speci-
fied upper and lower bounds). The process
of retrieving the appropriate records is
called range searching. This problem can
also be cast in geometric terms by regarding
the record attributes as coordinates and the
k values for each record as representing a
point in a k-dimensional coordinate space.
The file of records then becomes a point set
in k-space. The intersection of the query
ranges is a k-dimensional hyperrectangle in
the space {that is, a "box"), and a range
query calls for finding all points lying inside
Permission to copy without fee all or part of this materml is granted provided that the copies are not made or
distributed for direct commercial advantage, the ACM copyright notme and the title of the pubhcatlon and its
date appear, and notme is gwen that copying is by permlsmon of the Association for Computing Machinery. To
copy otherwme, or to repubhsh, reqmres a fee and/ or specific permission
1979 ACM 0010-4892/79/1200-0397 $00 75
Computing Surveys, Voi. 11, No. 4, December 1979
398
CONTENTS
J. L. Bent l ey and J. H. Fri edman
I NTRODUCTI ON
1 THE DATA STRUCTURES
1 1 Sequent ml Scan
1 2 Proj ect i on
1 3 Cells
14 k-d Trees
1 5 Range Trees
1 6 k-ranges
1 7 Ot her St r uct ur es
1 8 Compar i son of Met hods
2 ADDI TI ONAL WORK
3 CONCLUSI ONS
REF ERENCES
T
this hyperrectangle. We will often cast
range searching in this geometric frame-
work as an aid to intuition.
Range searching arises in many applica-
tions. In a geographic database of U.S.
cities one might seek a list of all those with
latitude between 37 and 41 and longitude
between 102 and 109 (defining the state
of Colorado). To compile an honor list of
older students, a university administrator
may wish to know those students whose
age is between 21 and 24 years and whose
grade point average is between 3.5 and 4.0.
In data analysis it is often useful to do
separate analyses on sets of data lying in
different regions (hyperrectangles) of the
observation space and then compare (or
contrast) the respective results. (At the
Stanford Linear Accelerator Center, for ex-
ample, over 10 hours per week of IBM 370/
168 time is devoted to this application.) In
statistics, range searching can be employed
to determine the empirical probability con-
tent of a hyperrectangle, to determine em-
pirical cumulative distributions, and to per-
form density estimation (see LOFT65).
Lauther [LAuT78] describes how range
searching can be used to solve a design
automation problem in very large-scale in-
tegrated circuitry (VLSI).
This paper has been written with two
distinct audiences in mind. For the expert
in searching {with background either in
database systems or theoretical computer
science), this paper is intended as a survey
t hat gathers together and presents in a
common terminology a number of results
t hat have recently appeared on the problem
of range searching. This problem is of par-
ticular interest for two reasons: First, it is
an important problem in many practical
applications (and a difficult theoretical
problem!); second, the methods t hat we
investigate are broadly applicable to many
other multikey searching problems. The
second type of reader for whom this paper
is intended is a computer scientist who is
somewhat familiar with data structures for
single-key searching, and who would like a
tutorial on the problem of range searching.
For this reader, the methods t hat we dis-
cuss are described on an intuitive level, and
references are given to more precise de-
scriptions elsewhere in the literature.
In Section 1 of this paper we examine six
data structures for the range searching
problem in some detail, and then briefly
compare those structures at the end of the
section. Additional work (that both has
been done and needs to be done) is de-
scribed in Section 2, and conclusions are
then offered in Section 3.
1. T HE DA T A ST RU C T U RE S
In this section we investigate a number of
search methods for range searching. Each
search method is specified by a dat a struc-
ture for storing the data and algorithms for
building (which we call preprocessing) and
searching the structure. We will analyze a
search structure (say A) by giving three
cost functions of N (the number of points)
and k (the number of dimensions):
PA(N, k), the cost of preprocessing N
points in k-space into a data structure;
SA(N, k), the storage required by the
data structure;
QA(N, k), the search time or query cost.
These costs can be analyzed in terms of
their average or their worst case; we usually
speak of the worst-case cost, explicitly men-
tioning the average whenever we employ it.
In many applications one may desire var-
ious utility operations on data structures,
such as insertion and deletion. In this sec-
tion we ignore this issue, considering only
static (unchanging) files; we then return to
Comput i ng Surveys, Vol 11, No. 4, December 1979
Data Structures for Range Searching . 399
O
I
I
t
I
l
I
I
I
I
I
I
o
I I
t t
I t
I I
i .
I I
I
I I
I
I I
I
I I
I
I I
I I I
I I I
I I
I I
I U
I O
I I I I
I I I I
I l I
I I I
a I I I
F m u R ~ 1 .
! / , d, '
I
! D !
I I / I I I
: > " 2 ' '
# ; / , ' , , ,,'
I I / r Q I I I
j / q /~ d" ' !
I I
" I f !
I Uus t r a t i on of pr oj e c t mn
I
I
f
o
I
!
I
I
!
i y
!
I r
!
! a
i n
t g
! e
i
the question of dynamic structures in Sec-
tion 2.
1.1 Seq uent i al Scan
The simplest approach to range searching
is to store the N points in a sequential list.
As each query arrives, all elements of the
list are scanned and every record t hat sat-
isfies the query is reported. If the queries
do not have to be handled immediately,
then they can be "batched" so t hat many
queries can be processed with one sequen-
tial pass through the file. Since all k keys of
the N records must be stored and each
k-key record is examined as the structure is
built or searched, it is easy to see t hat the
sequential scan structure SS has the prop-
erties
Pss(N, k) = O(Nk),
Sss(N, k) = O(Nk),
Qss(N, k) = O(Nk).
Sequential scanning has the advantage of
being trivial to implement on any storage
medium. It is competitive with the more
sophisticated methods described in this pa-
per when the file is small and the number
of attributes is large, or when a large frac-
tion of the records in the file satisfy the
query (or queries, if they are batched).
1. 2 P roj ecti on
The projection technique involves keeping,
for each attribute, a sequence of the records
in the file sorted by t hat attribute. One can
view this geometrically as a projection of
the points on each coordinate. The k lists
representing the projections can be ob-
tained by using a standard sorting algo-
rithm k times. After preprocessing, a range
query can be answered by the following
search procedure: Choose one of the attri-
butes, say the ith. Look up the two positions
in the ith sequence (using a binary search)
of the extreme values defining the range on
the ~ th attribute of the query. All records
satisfying the query will be in the list be-
tween these two positions just found. This
(smaller} list is t hen searched by brute
force. The projection technique is referred
to as inverted lists by Knut h [KNUT73].
This technique was applied by Friedman,
Baskett, and Shustek [FRIE75] in their so-
lution of the "nearest neighbor" problem
and by Lee, Chin, and Chang [LEEC76] to
a number of database problems.
The projection technique is illustrated in
Figure 1. The points represent a set of
sixteen records of two keys each, repre-
sented by the x- and y-coordinates. The
dashed lines are the projection of the r e c -
Comput i ng Sur veys, Vol. 11, No. 4, December 1979
400 J. L. Bentley and J. H. Friedman
ords ont o t he x-coordi nat e (t hat is, t he
records sort ed into x-order). The vert i cal
slab is t he x-range of t he query, t he hori-
zontal slab is t he y-range, and t he rect angl e
t hat is t hei r i nt ersect i on cont ai ns t hose
poi nt s whi ch satisfy t he query. To answer
this query, we need onl y i nvest i gat e t he six
poi nt s t hat are inside t he vert i cal slab
mar ked by t he 45 lines.
One can appl y t he proj ect i on t echni que
wi t h onl y one sort ed list (projection). If t he
di st ri but i on of val ues of t he vari ous attri-
but es is mor e or less uni form over similar
ranges and t he quer y ranges of each attri-
but e are similar, t hen one list is sufficient.
If this is not t he case, however, t hen keep-
ing several lists can oft en lead to subst ant i al
reduct i ons in t he quer y time. The mul t i pl e
proj ect i ons are expl oi t ed by performi ng two
bi nary searches in each t o find t he l ower
and upper bounds of t he respect i ve range,
and t hen searchi ng t hat proj ect i on wi t h t he
smallest number of records in t he range.
The cost analysis of proj ect i on is
st rai ght forward. To preprocess a file of N
records of k keys each, we must per f or m k
sort s of N elements. To st ore such a file, we
must st ore k lists of N el ement s each. These
facts i mmedi at el y yield
Pp(N, k) = O(kN log N) ,
Sp(N, k) = O(kN).
Fri edman, Basket t , and Shust ek [FreE75]
show t hat for searches t hat have al most
cubical quer y regions and find a small num-
ber of records (and are t her ef or e similar t o
nearest nei ghbor searches), t he quer y t i me
of proj ect i on is given by
Qp(N, k) = O( N H/k) (average case)
when t he poi nt set is drawn f r om a smoot h
underl yi ng distribution. The proj ect i on
t echni que is most effective when t he quer-
ies al most always cont ai n one range t hat
excl udes most of t he file.
1. 3 Cel l s
There are two ways they can search [for the murder
weapon] from the body outward m a spiral, or dwlde
the room up Into squares--that' s the grid method.
From the CBS series Kojak,
"Death Is Not a Passing Grade"
Cart ographers as well as det ect i ves use t he
grid (or cell) met hod. St r eet maps of met -
ropol i t an areas are oft en pr i nt ed in t he f or m
of books. The first page of t he book shows
t he ent i re area, and t he remai ni ng pages
are det ai l ed maps of (say) one-mi l e-square
regions. To find (for exampl e) all schools in
a specified rect angl e, one woul d l ook at t he
first page t o find whi ch squares overl ap t he
rect angl e and t hen check onl y on t hose
pages of t he book t o find t he schools. Thi s
appr oach can be mechani zed i mmedi at el y.
A square of t he map corresponds t o a cell
in k-space, and t he poi nt s of t he file wi t hi n
t he cell are st ored t oget her in an i mpl emen-
tation. The first page of t he map book
corresponds to a di rect ory t hat allows one
t o t ake a hyper r ect angl e and l ook up t he
set of cells.
The cell t echni que is i l l ust rat ed in Fi gure
2. The si xt een poi nt s in t hat figure repre-
sent si xt een records cont ai ni ng t wo keys
each. The poi nt s in each cell are st ored
t oget her in an i mpl ement at i on. The quer y
is given by t he rect angl e in t he upper par t
of t he figure, and t o answer it, onl y t hose
poi nt s in t he four dashed cells need be
investigated. The squares in t hat figure are
t he "di r ect or y" correspondi ng t o t he first
page of t he map book.
The di rect ory can be i mpl ement ed in two
ways. If t he poi nt s are (say) uni forml y dis-
t r i but ed on [0, 10] 2 and we have chosen
1 1 cells, t hen we can use a t wo-di men-
sional ar r ay as t he di rect ory, named DI-
RECT ( 0. . 9, 0. . 9). In DI RECT (i, j) we
woul d keep a poi nt er t o a list of all poi nt s
in t he cell [t, t + 1] [ j , j + 1]. If we
want ed to find all poi nt s in [5.2, 6.3] [1.2,
3.4], t hen we woul d onl y have t o exami ne
cells (5, 1), (5, 2), (5, 3), (6, 1), (6, 2), and (6,
3) - - we call t hi s "t ransl at i ng" from a range
quer y t o a set of cell id' s. The mul t i di men-
sional ar r ay works ver y well when t he
poi nt s are known a priori t o be uni forml y
di st ri but ed over some given rect angl e in t he
key space. When this is not known t o be
t he case, one woul d pr obabl y use a search
met hod, such as hashing, for t he di rect ory.
In this met hod we name each cell as before;
so cell (t, j ) is a poi nt er t o t he poi nt s in
[i, i + 1] [j , j + 1]. Inst ead of storing al l
cells, however, we st ore onl y t hose cells
t hat act ual l y cont ai n records of t he file. To
process a query, we t ransl at e t he rect angl e
into a set of cell id' s (as we did above), look
Computing Surveys, Vol 11, No. 4, December 1979
/
/
ql #
/
}
f
/
/
/
f
/
/
/
/
/
Data Structures for Range Searching
/
/
/
/
/
/
f,
/
/
/ :
/
/
/
.
/ /
/ /
/ /
.J
f
/ /
/ /
/ /
/ /
/
/ I
/ #
D/ /
/ /
#
/ g
/ g
/ #
/ 0
# /
/ /
/ /
I I /
t O
401
FIGURE 2 Il l ust rat i on of cells
up those id' s, and check all the points in the
occupied cells for inclusion in the rectangle.
The storage required for the cell technique
is the storage for the directory plus loca-
tions for the linked list representing points
in cells; the size of the directory is usually
much smaller t han N.
Knut h [KNUT73] has discussed this
scheme for the two-dimensional case. Lev-
inthal [LEVI66] used a cell technique in
three-dimensional Euclidean space for de-
termining all atoms within 5 angstroms of
every atom in a protein molecule--he re-
ferred to this as "cubing." The idea of using
hashing for the cell directory was first de-
scribed by Yuval [YUVA75], and was later
used by Rabin [RAm76] to solve the "clos-
est pair" problem. Bentley, Stanat, and
Williams [BENT77] discuss a number of
different implementations for the directory
(two of which we have seen).
The basic parameters of the cell tech-
nique are the size and shape of each cell. In
analyzing a search there are two costs to
count: cell accesses (the number of direc-
tory look-ups) and inclusion tests (testing
whether a point satisfies the range query).
If the cell size is extremely large, there will
be few cell accesses and many inclusion
tests. If the cell size is very small, on the
other hand, there will be very many cell
accesses and very few inclusion tests.
Clearly, either extreme is to be avoided.
The best cell size and shape depend on
the size and shape of the query hyperrec-
tangle. Bentley, Stanat, and Williams
[BENT77] show t hat if the query hyperrec-
tangles have constant size and shape so
t hat only their location (in the coordinate
space) is unspecified, t hen for a single grid
a nearly optimum size and shape for the
cells are the same as those of the query
hyperrectangle. For this case the number
of cells accessed is 2 k, and the expected
search time is proportional to 2 k times the
number of points in the range. In this con-
text the performance of cells is given by
P~(N, k) -- O( Nk ) ,
S~(N, k) ffi O( Nk ) ,
Qc(N, k) = O(2 k F) (average),
where F is the number of records found. In
most applications, however, the queries will
vary in size and shape as well as in location,
so there is little information available for
making a good choice of cell size and shape.
1.4 k- d T r ees
In this section we examine a data structure
called the "k-dimensional binary search
Computing Surveys, Vol 11, No. 4, December 1979
402 , J. L. Bentley and J. H. Fri edman
tree," which is usually abbreviated as "k-d
tree." This structure is a natural generaliza-
tion of the standard one-dimensional binary
search tree, so we will briefly review a spe-
cial type of t hat structure (a complete de-
scription of binary trees can be found in
KNUT73). To build a file of single-key rec-
ords into a binary search tree, we choose
the median of the set as the discriminator
value and build all records with key values
less t han or equal to the discriminator into
the left subtree of the root (recursively) and
all elements with greater key values into
the right subtree. This process continues
recursively until there are only a few (say
six or less} nodes in the set, at which point
we store t hem as a linked list. Note t hat no
records are stored in the internal nodes of
such a binary search tree; they are con-
tained only in the leaf nodes or "buckets"
at the bottom of the tree. We can answer a
range search in this structure by a recursive
algorithm t hat compares the range to the
discriminator of the node it is currently
visiting. If the range is entirely to one side
or the other of the discriminator, only the
appropriate son is searched; otherwise both
sons are searched recursively.
The single-key binary search tree per-
forms three functions at once: It stores the
records of the file (in the external nodes, or
"buckets"), it divides the data space into
segments (by choosing the discriminators),
and it gives a directory among the segments
(the tree structure). We now investigate a
multidimensional generalization of the bi-
nary search tree t hat performs these same
three functions: storing the records, divid-
ing space into hyperrectangles, and provid-
ing a directory among the hyperrectangles.
It accomplishes this by using the same idea
as the one-dimensional algorithm with one
critical exception: In the one-dimensional
tree we only have one key to use as the
discriminator; in a multidimensional tree
we have to choose at each internal node
one of k keys to use as a discriminator.
The algorithm for constructing a k-d tree
is to choose for the discriminator t hat co-
ordinate j for which the spread of attribute
values (as measured by any convenient sta-
tistic, such as variance or distance from
minimum to maximum) is maximum for
the subcollection represented by the node.
The partitioning value is chosen to be the
median value of this attribute. This algo-
rithm is t hen applied recursively to the two
subcollections represented by the two sons
of the node just partitioned. The partition-
ing is stopped, creating a terminal node (or
bucket), when the cardinality of the sub-
collection is less t han a prespecified maxi-
mum, which is a parameter of the proce-
dure. (Friedman, Bentley, and Finkel
[FRIE77] found empirically t hat values
ranging from 8 to 16 work well in a Fortran
implementation.) The result of this proce-
dure is t hat the coordinate space is divided
into a number of buckets, each containing
approximately the same number of points
(by the stopping criterion) and each ap-
proximately "cubical" in shape (by choos-
ing as discriminator the dimension of max-
imum spread, which slowly chops long and
skinny rectangles into cubes).
Range searching with k-d trees is
straightforward. Starting at the root, the
k-d tree is recursively searched in the fol-
lowing manner. When visiting a node t hat
discriminates by the f l h key (which we call
a j-discriminator), one compares the j t h
range of the query with the discriminator
value. If the query range is totally above
(or below) t hat value, then one need only
search the right subtree (respectively, left)
of t hat node; the other son can be pruned
from the search because any node it con-
tains does not satisfy the query in t hat
particular key. If the query range overlaps
the node' s key (that is, the key is between
the low and high bounds of the range), then
both sons need be searched. This can be
accomplished by searching both sons recur-
sively (the search being implemented by a
stack).
The application of k-d trees to (two-di-
mensional) range searching is illustrated in
Figure 3. The k-d tree is depicted in two
ways: Figure 3a shows the structure in 2-
space, and Figure 3b shows the abstract
tree. The root of the tree is internal node
A; it is an x-discriminator. The vertical line
in the right part of the figure labeled A is
the discriminating line. That is, every point
to the left of t hat vertical line is in the left
subtree of A {with B as root), and every
point to the right is in the subtree with root
C. This partitioning continues recursively,
Computing Surveys, Vol 11, No 4, December 1979
" A
E o
13
i
D

Data Structures for Range Searching
U
F
.,,,.
/ x h I
/ x
: / " ~ ' l D
, 0 /
/
/
i f
G
(a)
403
A
B C
(b)
FmURE 3. Illustratmn of k-d trees a) Planar representatmn, b) tree representation
and the resulting cells (buckets) in this tree
each contain two points. The query rectan-
gle is illustrated in Figure 3a, and the search
for all points within the rectangle is illus-
trated in both figures. The search starts at
the root, and since the query, rectangle is
entirely to the right of the vertical line
defined by A, the left subtree of A {with B
as root) can be pruned from the search.
This is illustrated in Figure 3b by the per-
pendicular line through the son link from A
to B. The search continues, searching both
sons of C, both sons of F, and only the left
son of G. A total of three buckets are
searched; these buckets are dashed in the
planar representation and are marked by
an S in the tree representation.
In the k-d tree as introduced by Bentley
[BENT75a], the discriminators are chosen
cyclically (that is, the root is discriminated
by the first key, its sons by the second, and
so on). The idea of "adaptive partitioning"
was proposed by Friedman, Bentley, and
Finkel [FRIE77] and makes the k-d tree a
structure very "sensitive" to the particular
file t hat it represents. The application of
k-d trees to a host of problems can be found
in BENT79b, GOTL78, and SILv78b.
Analysis of k-d trees for range searching
has been considered by several researchers.
The work required to construct a k-d tree
and its storage requirements (see BENT79b)
are
Pk(N, k) = O( N log N),
Sk(N, k) ffi O(Nk).
Computing Surveys, Vol. 11, No. 4, December 1979
404 J. L. Bentley and J. H. Fri edman
The search cost depends on the nature of
the query. Lee and Wong [LEEW80] have
shown t hat in the worst case,
Qk(N, k) < O( N H/ k + F)
where F is the number of points found in
the region. If the query range is almost
cubical and the number of records t hat
satisfies the query is small (so t hat the
range query is similar to a nearest neighbor
search), then Friedman, Bentley, and Fin-
kel' s [FRIE77] analysis shows t hat
Qk(N, k) = O(log N + F)
(average case for small answer).
For the case where a large fraction of the
file satisfies the query, Bentley and St anat
[BENT75b] and Silva-Filho [Smv78a] show
t hat
Qk(N, k) = O(F)
(average case for large answer).
The k-d tree structure is most effective in
situations where little is known about the
nature of the queries or a wide variety of
queries are expected. It is also useful if
other types of queries (in addition to range
queries) are anticipated; many other quer-
ies supported by k-d trees are discussed by
Bentley [BENT79b].
1. 5 Range Trees
A number of very similar structures for
range searching (of primarily theoretical
rather t han practical interest) have recently
been described by Lueker [LuEK78], Lee
and Wong [LEEW80], and Willard
[WILL78a]. In this section we investigate
the range tree, a structure introduced by
Bentley [BENT79a] t hat is also similar to
the former structures. It achieves the best
worst-case search time of all the structures
we have seen so far in this paper, but has
relatively high preprocessing and storage
costs. For most applications the high stor-
age will be prohibitive, but the range tree
is very interesting from a theoretical view-
point. Since the range tree is defined recur-
sively in dimension (that is, the k-dimen-
sional structure is defined in terms of the
(k - 1)-dimensional structure), we begin
our discussion by looking at a one-dimen-
sional structure and t hen generalize t hat
structure to higher dimensions.
The simplest structure for one-dimen-
sional range searching is a sorted array.
The preprocessing sorts the N elements in
ascending order by key. To answer a range
query, we do two binary searches to find
the positions of the low and high end of the
range in the array. After these two positions
have been found, we can list all the points
in t hat part of the array as the answer to
the range query. (Note t hat this is precisely
the projection method applied to the one-
dimensional problem.) For this structure
we use linear storage and O( N log N) pre-
processing time. The two binary searches
each cost O(log N), and the cost of listing
the points found in the region will, of
course, be proportional to the number of
such points. Letting F be the number of
points found in the region, we have
Pr(N, 1) = O( N log N),
Sr(N, 1) = O( N) ,
Q~(N, 1) = O(log N + F).
We will now build a two-dimensional
range tree, using as a tool the one-dimen-
sional sorted arrays (SA' s) we described
above. The range tree is similar to the
"binary search trees" described by Knut h
[KNUT73, Sect. 6.2], so we will use his ter-
minology in our discussions. The range tree
is a rooted binary tree in which every node
has a left son, a right son, a discriminating
value (all nodes in the left subtree have a
discriminating value less t han the node' s),
and (unlike a regular binary search tree)
every node contains an SA. The root of the
range tree contains an SA (sorted by
y-coordinate) and has as a discriminating
value the median x-value for all points. The
left subtree of the root has an SA containing
the N/ 2 points with x-value less t han me-
dian sorted by y-coordinate. Similarly, the
right son of the root represents the N/ 2
points with x-value greater t han the median
and has an SA of those points sorted by
y-coordinate. This partitioning continues so
t hat i levels away from the root we have 2'
subtrees, each representing N/ 2 L points
contiguous in the x-dimension and each
containing an SA of the points sorted by
y-coordinate. This partitioning continues
for a total of (approximately) log N levels;
Computing Surveys, Vol 11, No 4, December 1979
Dat a Structures for Range Searchi ng 405
we handle small point sets (say, less t han a
dozen points) by brute force.
The search algorithm for a range tree is
most easily described recursively. Each
node in the tree represents a range in the
x-dimension from the least x-value con-
tained in the subtree to the greatest. When
visiting a node, we compare the x-range of
the query to the range of the node, and if
the node' s range is entirely within the
query' s, then we search t hat structure' s SA
for all points in the query' s y-range and
return. If the query' s range does not wholly
contain the node' s, then we compare the
query' s x-range to the node' s discriminator
value. If the range is entirely below the
discriminator, we recursively visit the left
subtree; if it is above, we visit the right; and
if the range overlaps the discriminator, then
we visit both subtrees.
The analysis of the planar tree is rat her
complicated. Since there are log N levels
in the tree and N points are stored on
each level, the total storage required is
O( N log N). The preprocessing can be per-
formed in O( N log N) time if clever tech-
niques are employed. Analysis shows t hat
at most two SA searches are done on each
level of the tree {each of cost approximately
log N), so the total cost for a search is
O(log 2 N) plus the time for listing the
points in the region. Letting F stand, as
before, for the total number of points found
in the region we have
Pr(N, 2) = O( N log N),
Sr(N, 2) = O( N log N),
Qr(N, 2) = O(log '~ N + F).
If we step back for a moment, we can see
how we built the structure: We constructed
a two-dimensional structure by building a
tree of one-dimensional structures. We can
perform essentially the same operation to
yield a three-dimensional structure: We
construct a tree containing two-dimen-
sional structures in the nodes. This process
can be continued to yield a structure for
k-dimensions, which will be a tree contain-
ing (k - D-dimensional structures. This
will yield a structure with performances
Pr(N, k) = O( Nl og k-I N),
S~(N, k) = O( N log k-l N),
Qr(N, k) = O(log k N + F).
The range tree structure is very interest-
ing from a theoretical viewpoint. The
#sympt ot i c search time is very fast, but the
amount of storage used is usually prohibi-
tive in practice. Although the application of
this structure to practical problems will
probably be limited to cases when k ffi 2 or
3, it does provide an i mport ant theoretical
benchmark. It also gives us an interesting
technique (recursion in dimension) t hat
might yield fruit in practice. {Indeed, there
are some very interesting relationships be-
tween range trees and the k-d trees of Sec-
tion 1.4.)
1. 6 k-ranges
The k-range is an efficient worst-case struc-
ture for range searching introduced by
Bentley and Maurer [BENT80b]. They de-
veloped two types of k-ranges, overlapping
and nonoverlapping. Bot h of these struc-
tures involve storing sets of lists of points
sorted by different coordinates; additional
dimensions are added recursively, much
like the range trees of the last section. Be-
cause k-ranges are rat her complicated to
describe and are of primarily theoretical
interest, we will not describe t hem here but
only mention their performance. The over-
lapping k-ranges can be made to have per-
formance
Po(N, k) ffi O(N~+~),
So(N, k) = O(N'+~),
Qo(N, k) = O(log N + F)
for any e > 0. It is pleasing to note t hat the
constants "hidden" in the O' s of the above
equations are just k/E. Overlapping k-
ranges have very efficient retrieval time but
somewhat high preprocessing and storage
costs; their dual structures, nonoverlapping
k-ranges, have very efficient preprocessing
and storage costs but increased query
times. Thei r performance is
Pn(N, k) = O( N log N),
Sn(N, k) = O( N) ,
Q(N, k) = O( N) ,
for any fixed > 0. The details of these
structures can be found in BENT80b. Al-
though these structures were developed pri-
marily as a theoretical device, t hey might
prove efficient in some implementations
Computing Surveys, Vol. 11, No 4, December 19t9
406 J. L. Bent l ey and J. H. Fri edman
(Their primary drawback is t hat their space
requirements are high, and space is usually
a critical resource.)
1. 7 Other Structures
In the previous sections we have investi-
gated six structures for the range searching
problem t hat (in the authors' opinion) dom-
inate other structures proposed for this
problem. In this section we briefly investi-
gate some of these other structures.
Knut h [KNUT73] points out t hat the no-
tion of cells can be applied recursively. That
is, when one of the cubes has more t han
some certain number of points, the cube is
further divided into subcubes of yet smaller
size. This scheme implies a multidimen-
sional tree with multiway branching. In
terms of both the partitioning imposed on
the space and the ease of implementation,
this idea seems to be dominated by a data
structure called the quad tree.
The quad tree was first described by Fin-
kel and Bentley [FINK74]. It is a generaliza-
tion of the standard binary search tree, in
which every node has 2 h sons. Bentley and
St anat [BENT75b] analyzed the perform-
ance of quad trees for "square" range
searches in uniform planar point sets, and
Linn [LINN73] discussed the fact t hat quad
trees (which he called "search-sort k trees" )
have advantages over binary trees when
used in a synchronized multiprocessor sys-
tem. This application aside, however, the
quad tree seems to be dominated by the
k-d trees of Section 1.4.
A great deal of work has been done re-
cently on multikey searching problems t hat
are similar in flavor to the range searching
problem. Dobkin and Lipton [DOBK76] and
Bentley [BENT80a] have investigated a
number of searching problems defined on
sets of points in k-dimensional space. Rivest
[RIVE76] provides a number of interesting
data structures for answering "partial-
mat ch" queries, which are essentially range
queries in a file in which the keys assume
discrete values. For discussions of efficient
search methods in the context of database
systems, the reader is referred to such pa-
pers as LIou77, SHNE77, YANG77, and
YANG78.
1. 8 Comparison of Methods
In Sections 1.1 through 1.6 we have dis-
cussed six structures for range searching.
The performances of these six structures
(seven including the two variants of k-
ranges) are summarized in Table 1, which
shows the preprocessing, storage, and query
costs of each structure. All the functions in
t hat table reflect worst-case costs, except
those query costs t hat are footnoted. For
those functions the probabilistic assump-
tions are described in the notes.
Four of these six structures (sequential
scan, projection, cells, and k-d trees) have
been presented as providing practical solu-
tions to the range searching problem. For
each structure there are situations in which
it is clearly superior and other situations
where it performs badly. In this section we
will mention some of these situations and
compare the performance of the four
methods.
If the file is small and the number of
attributes large, if the f' fle is to be searched
only a few times, or if the queries can be
batched so t hat nearly all the records in the
file satisfy at least one, then sequential scan
TABLE 1. Performance of Dat a St ruct ures for Range Searchmg
Structure P(N, k) S(N, k) Q(N, k)
Sequential scan O{N} O(N) O(N)
Projection O(N log N) O(N) O(N 1-1/* + F) a~)
Cells O(N) O(N) O(F) a(z)
k-d trees O(N log N) O(N) O(N H/ k + F)
O(log N + F) a())
Nonoverlappmg k-ranges O(N log N) O(N) O(N ~ + F)
Range trees O(N logk-lN) O(N log *-1 N) O(log * N + F)
Overlapping k-ranges O(N ~+~) O(N ~+~) O(log N + F)
a Query times t hat indmate average case analysis Probabi hst m assumptions are
(1) Smoot h dat a set s--very small query region.
(2) Any data set--cell size equals query size.
(3) Smoot h data set.
Computing Surveys, Vol. 11, No 4, December 1979
Data Structures for Range Searching 407
is the method of choice. In other cases one
of the more sophisticated methods is likely
to be more efficient. Projection does best
when the query range on one of the attri-
butes is usually sufficient to eliminate
nearly all the File records. For this case the
low overhead of searching this structure
allows it to dominate the others. In situa-
tions where several or many of the attri-
butes serve to restrict the range query, the
projection technique performs relatively
poorly.
Both the cell and k-d tree structures are
appropriate in situations where the query
restricts several of the attributes. If the
approximate size and shape of the queries
are roughly constant and known in ad-
vance, then cells defined by a fixed grid
with size and shape similar to those of the
expected queries is most advantageous. For
queries with sizes and shapes t hat differ
considerably from the design, however, per-
formance can be quite poor.
The k-d tree structure is characterized by
its robustness to wildly varying queries.
The cell design adapts to the distribution
of the attribute values of the file records in
the k-dimensional coordinate space. The
cells all contain very nearly the same num-
ber of records; there are no empty cells. In
dense regions there are many cells and a
correspondingly fine division of the coordi-
nate space; in sparse regions there is a
coarser division with fewer cells. For most
applications of range searching t hat are not
characterized in the preceding paragraphs,
k-d trees are likely to be the method of
choice.
2 . ADDI TI ONAL WORK
Our discussion of the data structures in
Section 1 is on a very abstract conceptual
level, and we have ignored many problems
t hat arise in actual applications of range
searching. In this section we briefly exam-
ine some of those problems and the solu-
tions t hat have been proposed to handle
them.
All files t hat we have discussed so far
have been static; t hat is, they represent
unchanging files. Many applications, how-
ever, require dynamic structures, in which
insertions and deletions can be made. The
sequential scan structure is easy to main-
tain dynamically, and so is the projection
structure using methods for maintaining
one-dimensional sorted lists described by
Knut h [KNUT73]. The cell technique can
support insertions and deletions by merely
keeping a linked list of the points in each
cell and inserting or deleting the new or old
record in the appropriate list. Dynamic k-d
trees are a more subtle problem and have
been discussed by Bentley [BENT79b] and
Willard [WILL78b].
Considerable research remains to be
done in the development of heuristics for
aiding the search methods we have seen.
For example, if the range queries in a seven-
dimensional problem almost always involve
only two of the attributes, then the design
of the structure should involve only those
two attributes. Heuristics for detecting
these and other similar situations would be
very helpful. Techniques described by Ben-
tley and Burkhard [BENT76] might prove
useful in such an investigation.
Our discussion of all of the data struc-
tures has been for the case in which they
are implemented in primary memory.
Many applications (particularly databases)
inherently involve secondary storage media
such as disks and tapes. All the structures
of Section i can be efficiently implemented
on such mediaJ
Several researchers have recently consid-
ered an interesting generalization of the
range searching problem, which calls for
adding a range restriction to an existing
data structure. That is, we already have
some structure for performing a particular
type of query, and we want to have the
capability of saying "perform t hat query on
all records in which this key lies in t hat
range." Bentley [BENT79a], Lueker
[LUEK79], and Willard [WILL78a] have de-
veloped a number of transformations on
data structures t hat allow one to add the
range restriction capability. (These trans-
formations actually led to the discovery of
both the range tree and the k-range data
structures of Section 1.) Although the stor-
age requirements of the resulting structures
seem to be too high to make t hem of im-
1 For det ai l s of t hese I mpl ement at i ons, t he r eader is
referred to BENT78 whmh m an earl i er versi on of thin
paper
Computing Surveys, Vol. 11, No. 4, December 1979
408 * J . L . Bentley and J. H. Friedman
mediate practical interest, this approach is
a novel attack on the problem of construct-
ing data structures for range searching.
An interesting theoretical problem that
could prove to be of practical value is prov-
ing lower bounds on the complexity of the
range searching problem. Saxe [SAXE79]
has investigated this problem using the
standard "decision tree" model of concrete
complexity theory and has shown that
k-ranges have optimal worst-case query
times. These k-ranges have very high stor-
age requirements, however; so it would be
very desirable to have lower bounds that
make stronger statements of the form, "if
you only use this much storage and prepro-
cessing, then this is the fastest search time
you can have." Fredman [FRED79] has re-
cently made progress in this direction. An-
other interesting open problem is to show
lower bounds on the average complexity,
rather than j ust the worst-case complexity.
3. CONCLUSI ONS
In this paper we have investigated a num-
ber of data structure for the range searching
probl em. In 1973 Knuth [KNuT73, p. 554]
was abl e to write that "no real l y nice data
structures seem to exist" for the probl em of
range searching. In this paper we have tried
to show that this situation has changed i n
the interim, and that these changes can
have a substantial impact on both the the-
ory and practice of mul ti key searching.
REFERENCES
BENTLEY, J. L. "Mul t i di mensmnal bi-
nary search trees used for assocmtive
searching," Comm ACM 18, 9 (Sept.
1975), 509-517
BENTLEY, J. L., AND STANAT, D F.
"Analysis of range searches m quad
trees," Inf Process Lett 3, 6 (July 1975},
170-173
BENTLEY, J L , AND BURKHARD, W A.
"Heunst ms for partial mat ch retrieval
data base design," Inf Process. Lett 4, 5
{Feb 1976), 132-135.'
BENTLEY, J L., STANAT, D. F., AND WIL-
LIAMS, E. H JR "The complexity of
fixed-radius near neighbor searching," Inf
Process. Lett. 6, 6 (Dec. 1977), 209-212.
BENTLEY, J L., AND FRIEDMAN, J H. A
survey of algortthms and data structures
for range searching, Carnegm-Mellon
Computer Science Rep CMU-CS-78-136
a n d Stanford h n e a r Accelerator C e n t e r
R e p S L A C - P U B - 2 1 8 9 , p r e h m l n a r y ver-
BENT75a
BENT75b
BENT76
BENT77
BENT78
BENT79a
BENT79b
BENT80a
BENTS0b
DOBK76
FINK74
F R E D 7 9
FRIE75
FR~ E77
GOTL78
KNOT73
LAUT78
LEEC76
LEEW78
LEEW80
LEVI66
LINN73
LIOU77
LOFT65
sion in Proc. Computer Science and Sta-
tistics: l l t h Ann. Symp. on the Interface,
March 1978, pp. 297-307.
BENTLEY, J . L . "Decomposable search-
ing problems," Inf. Process. Lett. 8, 5
(June 1979), 133-136.
BENTLEY, J. L. "Multidimensional bi-
nary search trees in dat abase applica-
tions," IEEE Trans Softw. Eng SE-5, 4
(July 1979), 333-340.
BENTLEY, J. L "Multidimensional di-
vide-and-conquer," to appear m Comm.
ACM.
B E N T L E Y , J. L , A N n M A U R E R , H. A.
"Efficient worst-case data structures for
range searching," to a p p e a r m A c t a I nf.
D O B K I N , D , A N D L I P T O N , R. J " M u l t l -
dimensional searching problems," S I A M
J. C o m p u t . 5, 2 (1976), 181-186.
FINKEL, R. A, AND BENTLEY, J. L.
"Quad t rees--a data structure for re-
trieval on composite keys," Acta Inf 4, 1
(1974), 1-9.
FREDMAN, M. "A near optimal data
st ruct ure for a type of range query prob-
lem," in Proc. l l t h ACM Symp. Theory of
Computing, May 1979, pp. 62-66.
FRIEDMAN, J H., BASKETT, F., AND SHUS-
TEK, L. J "An algorithm for finding
nearest neighbors," IEEE Trans Corn-
put. C-24, 10 (Oct. 1975), 1000-1006.
FRIEDMAN, J H., BENTLEY, J L., AND
FINKEL, R. A. "An algorithm for finding
best mat ches m logarithmic expected
time," ACM Trans. Math. Softw. 3, 3
(Sept. 1977), 209-226.
GOTLIEB, C. C., AND GOTLIEB, L. R
Data types and structures, Prenhce-Hall,
Englewood Cliffs, N. J, pp. 357-363
KNUTH, D. E. The art of computer pro-
gramm~ng, vol. 3- sorting and searching,
Addison-Wesley, Reading, Mass., 1973.
LAUTHER, U. "4-d~ mensmnal binary
search trees as a means to speed up asso-
ciative searches in design rule verification
of integrated circuits," J Des. Autom
Fault-Tolerant Comput. 2, 3 (July 1978),
241-247.
LEE, R. C. T., CHIN, Y. H, AND CHANG,
S. C. "Application of principal compo-
nent analysm to mulh-key searching,"
IEEE Trans. Softw. Eng. SE-2, 3 (Sept
1976), 185-193.
LEE, D. T, AND WONG, C K. "Worst-
case analysis for region and partial region
searches In multidimensional binary
search trees and quad trees," Acta Inf 9,
1 (1978), 23-29.
LEE, D. T., AND TONG, C. K. "Qulntary
trees' a f' de structure for multidimensional
dat abase systems," to appear m ACM
Trans. Database Syst
LEVINTHAL, C. "Molecular model-build-
ing by computer," Sc~ Am 214 (June
1966), ~ 2-52.
LINN, J. General methods for parallel
searchtng, Tech Rep. 61, Digital Systems
Lab, Stanford U., Stanford, Cahf, May
1973.
LIOU, J H., AND YAO, S B "Multi-di-
mensional clustering for data base orga-
nization," Inf. Syst. 2 (1977}, 187-198.
L O F T S G A A R D E N , D. O , A N D Q U E S E N -
Computing Surveys, Vo| 11, No 4, December t979
Data Structures for Range Searching 409
LUEK78
LUEK79
RABi76
RIVE76
SAXE79
SHNE77
BERRY, C. P. "A nonparametric density
function," Ann. Math. Stat. 36 (1965),
1049-1051 SILV78a
LUEKER, G. "A data structure for or-
thogonal range queries," m Proe 19th
Syrup Foundattons of Computer Sctence,
IEEE, Oct. 1978, pp. 28-34
LUEKER, G. "A transformation for add-
mg range restriction capabdlty to dynamtc
data structures for decomposable search-
mg problems," Tech. Rep. 129, U of Cal-
iforma at Irvine, 1979.
RABIN, M O "Probabdlstm algo-
rithms," m Agortthms and complexity.
new dwectlons and recent results, J. F
Traub (Ed.), Academm Press, New York,
1976, pp. 21-39. YANG77
RIVEST, R L. "Parhal mat ch retrieval
algorithms," SI AM J Comput. 5, 1
(March 1976), 19-50 YANt~ 78
SAXE, J. B "On the number of range
querms m k-space," to appear m Dtscrete YUVA75
Appl Mat h
SHNEIDERMAN, B. "Reduced combined
SILv78b
WILL78a
WILL 78b
indexes for efficient multiple at t ri but e re-
trieval," Inf. Syst. 2 (1977), 149-154.
SILVA-FILHO, Y. V. Average case analo
ysls of regton search m balanced k-d
trees, Rep., U. of Kent, Canterbury, Eng-
land, Nov. 1978.
S[LVA-F~ LHO, Y V. Mult~dimenstonal
search trees as radices of files, Rep., U
of Kent, Canterbury, England, Dec 1978.
WILLARD, D. E. Predicate-oriented
database search algorithms, Rep. TR-20-
78, Harvard U. Aiken Lab., 1978.
WILLARD, D. E. "Balanced forests of
k-d* trees as a dynamic data structure, "
reformative abstract, Harvard U., Boston,
Mass, 1978.
YANG, C. "Avoiding redundant record
accesses in unsort ed multilist f' de organi-
zations," I nf Syst. 2 (1977), 155-158.
YANG, C. "A class of hybri d list file or-
gamzations," Inf. Syst. 3 (1978), 49-58.
YUVAL, G. "Finding near neighbors m
k-dimensional space," Inf. Process. Left.
3, 4 (March 1975), 113-114
RECEIVED JANUARY 1979; FINAL REVISION ACCEPTED AUGUST 1979.
Computing Surveys, Vol. 11, No 4, December 1979

You might also like