K Dim DC

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

1.

Introduction

The young field of algorithm design and analysis has


made many contributions to computer science of both
theoretical and practical significance. One can cite a
Programming R. Rivest
number of properly designed algorithms that save users
Techniques Editor
thousands of dollars per month when compared to naive
Multidimensional algorithms for the same task (sorting and Fourier trans-
forms are examples of such tasks). On the more theoret-
Divide-and-Conquer ical side, algorithm design has shown us a number of
counterintuitive results that are fascinating from a purely
mathematical viewpoint (for instance, there is a faster
Jon Louis Bentley way to multiply two matrices than the standard "high
Carnegie-Mellon University school" algorithm). In one important sense, however, the
study of algorithms is rather unsatisfying--the field con-
sists primarily of a scattered collection of results, without
Most results in the field of algorithm design are much underlying theory.
single algorithms that solve single problems. In this Recent research has begun laying the groundwork
paper we discuss multidimensionaldivide-and-conquer, for a theory, of algorithm design by identifying certain
an algorithmic paradigm that can be instantiated in algorithmic methods (or paradigms) that are used in the
many different ways to yield a number of algorithms solution of a wide variety of problems. Aho et al. [ 1, ch.
and data structures for multidimensional problems. We 2] describe a few such basic paradigms, and Weide [27]
use this paradigm to give best-known solutions to such discusses a number of important analysis techniques in
problems as the ECDF, maxima, range searching, algorithm design. Almost all of the algorithmic para-
closest pair, and all nearest neighbor problems. The digms discussed to date are at one of two extremes,
contributions of the paper are on two levels. On the however: Either they are so general that they cannot be
first level are the particular algorithms and data discussed precisely, or they are so specific that they are
structures given by applying the paradigm. On the useful in solving only one or two problems. In this paper
second level is the more novel contribution of this we examine a more "middle of the road" paradigm that
paper: a detailed study of an algorithmic paradigm that can be precisely specified and yet can also be used to
is specific enough to be described precisely yet general solve many problems in its domain of applicability. We
enough to solve a wide variety of problems. call this paradigm multidimensional divide-and-conquer.
Key Words and Phrases: analysis of algorithms, Multidimensional divide-and-conquer is applicable
data structures, computational geometry, to problems dealing with collections of objects in a
multidimensional searching problems, algorithmic multidimensional space. In this paper we concentrate on
paradigms, range searching, maxima problems, problems dealing with N points in k-dimensional space.
empirical cumulative distribution functions, closest- In a geometric setting these points might represent N
point problems cities in the plane (2-space) or N airplanes in 3-space.
CR Categories: 3.73, 3.74, 5.25, 5.31 Statisticians often view multivariate data with k variables
measured on N samples as N points in k-space. Yet
another interpretation is used by researchers in database
systems who view N records each containing k keys as
points in a multidimensional space. An alternative for-
malism views the points as N k-vectors; in this paper we
use the point formalism, which aids our geometric intu-
ition. The motivating applications for the problems dis-
cussed later are phrased in these geometric terms.
Multidimensional divide-and-conquer is a single al-
gorithmic paradigm that can be used to solve many
Permission to copy without fee all or part of this material is particular problems. It can be described roughly as fol-
granted provided that the copies are not made or distributed for direct
commercial advantage, the ACM copyright notice and the title of the lows: to solve a problem of N points in k-space, first
publication and its date appear, and notice is given that copying is by recursively solve two problems each of N/2 points in k-
permission of the Association for Computing Machinery. To copy space, and then recursively solve one problem of N points
otherwise, or to republish, requires a fee and/or specific permission.
This research was supported in part by the Office of Naval in (k-1)-dimensional space. In this paper we study a
Research under contract N00014-76-C-0370 and in part by the Na- number of different algorithms and see how each can be
tional Science Foundation under a Graduate Fellowship. viewed as an instance of multidimensional divide-and-
Author's address: Department of Computer Science, Carnegie-
Mellon University, Pittsburgh, PA 15213. conquer. There are three distinct benefits resulting from
© 1980 ACM 0001-0782/80/0400-0214 $00.75. such a study. First, this coherent presentation enables

214 Communications April 1980


of Volume 23
the ACM Number 4
descriptions of the algorithms to be communicated more Fig. 1. Point A dominates point B.
easily. Second, by studying the algorithms as a group,
A
advances made in one algorithm can be transferred to
others in the group. Third, once the paradigm is under-
stood, it can be used as a tool with which to attack
unsolved research problems. Even another benefit might B
ultimately result from this study and others like it: a C
theory of "concrete computational complexity" which
explains why (and how) some problems can be solved
quickly and why others cannot.
Much previous work has been done on the problems
to be discussed. Since most of the work applies to only
one problem, we mention that work when discussing the multidimensional divide-and-conquer algorithms for
particular problem. Two pieces o f work, however, are solving those problems. In Section 3 we focus on prob-
globally applicable and are therefore mentioned (only) lems defined by point closeness. These two sections
here. Dobkin and Lipton [11] describe a method for constitute the main part of our discussion of multidimen-
multidimensional searching that is radically different sional divide-and-conquer. In Section 4 we survey ad-
from one that we study. Although their method yields ditional work that has been done, and we then view the
search times somewhat faster than those we discuss, the paradigm in retrospect in Section 5.
preprocessing and storage costs o f their algorithms are
prohibitive for practical applications. Shamos [25, 26]
has thoroughly investigated a large number o f compu- 2. Domination Problems
tational problems in plane geometry and has achieved
many fascinating results. In this section we investigate three problems defined
The problems discussed in this paper provide an in terms of point domination. We write Ai for the ith
interesting insight into the relation o f theory and practice coordinate of point A and say that point A dominates
in algorithm design. On the practical side, the previous point B if and only if Ai > Bi for all i, l _< i _< k. If
best-known algorithms for many of the problems we neither point A dominates point C nor point C dominates
discuss have running time proportional to N 2 (where N point A, then A and C are said to be incomparable. It is
is the number of points). The algorithms discussed in clear from these definitions that the dominance relation
this paper have running time proportional to N lg N (at defines a partial ordering on any k-dimensional point
least for low dimensional spaces). 1 To make this abstract set. These concepts are illustrated for the case k = 2 in
difference more concrete we note that if the two algo- Figure I. The point A dominates the point B, and both
rithms were used to process sets of 1 million points on a of the pairs A, C and B, C are incomparable.
1 million-instruction-per-second computer, then the N 2 In Section 2.1 we investigate the empirical cumulative
algorithm would take over 11 days, while the N lg N distribution function, which asks how many points a
algorithm would require only 20 seconds! On the theo- given point dominates. In Section 2.2 we study the
retical side many of the algorithms we discuss can be related question of whether a given point is dominated.
proved to be the best possible for solving their respective In both of these sections we discuss two distinct but
problems, and this allows us to specify precisely the related problems. In an all-points problem we are asked
computational complexity of those problems. These to calculate something about every point in a set (How
problems also show us some interesting interactions be- many points does it dominate? Is it dominated?). In a
tween theory and practice: Although some of the theo- searching problem we must organize the data into some
retically elegant algorithms of Sections 2 and 3 are not structure such that future queries (How many points
suitable for implementation, they suggest certain heuris- does this point dominate? Is this point dominated?) may
tic algorithms which are currently implemented in sev- be answered quickly. In Section 2.3 we examine a search-
eral software packages. ing problem phrased in terms of domination that has no
This paper is more concerned with expressing the all-points analog.
important concepts of multidimensional divide-and-con-
quer than scrupulously examining the details of partic- 2.1 Empirical Cumulative Distribution Functions
ular algorithms. For this reason we gloss over many Given a set S of N points we define the rank of point
(important) details of the algorithms we discuss; the x to be the number of points in S dominated by x. Figure
interested reader is referred to papers containing these 2 shows a point set with the rank of each point written
details. In Section 2 we examine three problems centered near that point. In statistics the empirical cumulative
around the concept of point domination, and we develop distribution function (ECDF) for a sample set S of N
elements, evaluated at point x, is just rank(x)/N. This
~We will use lg as an abbreviation for log2 and lgk N as an quantity is the empirical analog of the population cu-
abbreviation for (lg N)k. mulative distribution function. Because of the intimate

215 Communications April 1980


of Volume 23
the ACM Number 4
relation between rank and ECDF, we often write ECDF Fig. 2. With each point is associated its rank.
for rank. With this notation we can state two important i
computational problems.
4
(1) All-Points ECDF. Given a set S of N points in k-
2
space, compute the rank of each point in the set.
0
(2) ECDF Searching. Given a set S, organize it into a
data structure such that queries of the form "what is 1
the rank of point x" can be answered quickly (where
x is not necessarily an element of S).
The ECDF is often required in statistical applications
because it provides a good estimate of an underlying
distribution, given only a set of points chosen randomly
from that distribution. A common problem in statistics
Fig. 3. Operation of Algorithm ECDF2.
is hypothesis testing of the following form: Given two
point sets, were they drawn from the same underlying A B
distribution? Many important multivariate tests require
computing the all-points E C D F problem to answer this
question; these include the Hoeffding, multivariate Kol-
mogorov-Smirnov, and multivariate Cramer-Von Mises
tests. The solution to the E C D F searching problem is
required for certain approaches to density estimation,
which asks for an estimate of the underlying probability
density function given a sample. These and other appli-
cations of E C D F problems in statistics are described by
(a)
Bentley and Shamos [9], which is the source of the 1
algorithms discussed.
In this section we first devote our attention to the all-
points ECDF problem in Section 2.1.1 and then solve 0
0
the ECDF searching problem by analogy with the all-
0
points solution in Section 2.1.2. Our strategy in both
sections is to examine solutions to the problem in increas- 0
ingly higher dimensions, starting with the one-dimen-
sional, or linear, problem.
(b)
2.1.1 The all-points ECDF problem. In one dimen-
sion the rank of a point x is just the number of points in 0"'-
2+2=4
the set less than x, so the all-points E C D F problem can
be solved by sorting. After arranging the points in in- 0+2=2
creasing order we assign the first point the rank 0, the
second point rank 1, and so on. It is obvious that given 0+1 =1
the ranks we could produce such a sorted list, so we
know that the complexity of the one-dimensional all- 0-'
points ECDF problem is exactly the same as sorting,
which is well known to be O(N lg N). Thus we have
developed an optimal algorithm for the one-dimensional (c)
case) The two-dimensional case is not quite as easy,
however. To solve this problem we apply the multidi- Our planar ECDF algorithm operates as follows. The
mensional divide-and-conquer technique instantiated to first step is to choose some vertical line L dividing the
two dimensions: To solve a problem of N points in the point set S into two subsets A and B, each containing
plane, solve two problems of N/2 points each in the N/2 points. 3 This step is illustrated in Figure 3(a). The
plane and then solve one problem of N points on the second step of our algorithm calculates for each point in
line. A its rank among the points in A, and likewise the rank
of each point in B among the points of B. The result of

2 If we apply the multidimensional divide-and-conquer strategy to


the one-dimensional problem, then we achieve a sorting algorithm a To avoid needless detail we make certain assumptions such as
similar to quicksort (but that always partitions around the median o f that N is even and that no pair o f points share, x- or y-coordinates. To
the set). We do not discuss that algorithm here. otherwise confront such detail is not particularly illuminating.

216 Communications April 1980


of Volume 23
the ACM Number 4
this is depicted in Figure 3(b). We now make an impor- linear time, so the total cost of step 3 is O(N lg N).
tant observation that allows us to combine these subso- Adding the costs of the three steps we find that the total
lutions efficiently to form a solution to the original cost of the algorithm is
problem. Since every point in A has an x-value less than
T(N) ---- O(N) + 2T(N/2) + O(N lg N)
every point in B, two facts hold: First, no point in A
= 2T(N/2) + O(N lg N).
dominates any point in B, and second, a point b in B
dominates point a in A iff the y-value of b is greater than This recurrence 5 has solution
the y-value of a. By the first fact we know that the ranks
T(N) = O(N lg2 N)
we computed for A are the correct final ranks. We are
still faced with the reduced problem of calculating for so we know that the running time of Algorithm ECDF2
every point in B the number of points it dominates in A is O(N lg2 N).
(which we add to the number of B's it dominates to get We can make an observation that will allow us to
the fmal answer). To solve this reduced problem we use speed up many multidimensional divide-and-conquer
the second fact. If we project all the points in S onto the algorithms. In looking carefully at the analysis of Algo-
line L (as depicted in Figure 3(c)), then we can solve the rithm ECDF2 we see that the running time is dominated
reduced problem by scanning up the line L, keeping by the sort of step 3. To remove this cost we can sort the
track of how many As we have seen, and add that N points of S once byy-coordinate before any invocation
number to the partial rank of each point in B as we pass of ECDF2, at a once-for-all cost of O(N lg N). After this
that point. This counts the number of points in A with we can achieve the effect of sorting (without the cost) by
smaller y-values, which are exactly the points it domi- being very careful to maintain "sortedness-by-y" when
nates. We implement this solution algorithmically by dividing into sets A and B during step 1. After this
sorting the As and Bs together and then scanning the modification the recurrence describing the modified al-
sorted list. gorithm becomes
Having described the algorithm informally, we are
T(N) = 2T(N/2) + O(N)
now ready to give it a more precise description as Algo-
rithm ECDF2. Algorithm ECDF2 is a recursive algo- which has solution
rithm which is given as input a set S of N points in the
T(N) = O(N lg N).
plane and returns as its output the rank of each point.
Algorithm ECDF2
This technique is known aspresorting and has very broad
1. [Division Step.[ If S contains just one element then return its rank applicability; we see that it allows us to remove a factor
as 0; otherwise proceed. 4 Choose a cut line L perpendicular to the of O(lg h0 from the running time of many algorithms.
x-axis such that 3[/2 points o f S have x-value less than L's (call this We now turn our attention to developing an algo-
set of points A) and the remainder have greater x-value (call this set rithm for solving the ECDF problem for N points in 3-
B). Note that L is a median x-value o f the set.
2. [Recursive Step.] Recursively call ECDF2(A) and ECDF2(B). After
space. The multidimensional divide-and-conquer
this step we know the true E C D F o f all points in A. method we use is analogous to the method used by the
3. [Marriage Step.] We must now fred for each point in B the n u m b e r previous algorithm: To solve a problem of N points in 3-
of points in A it dominates (i.e., that have lesser y-value) and add space we solve two problems of N/2 points in 3-space
this n u m b e r to its partial ECDF. To do this, pool the points of A and then one problem of N points in 2-space. The first
and B (remembering their type) and sort them by y-value. Scan
through this sorted list in increasing y-value, keeping track in
step of our algorithm chooses a cut plane P perpendicular
A C O U N T o f the n u m b e r of As so far observed. Each time a B is to the x-axis dividing S into sets A and B of N/2 points
observed, add the current value of A C O U N T to its partial ECDF. each. Figure 4 illustrates this division. The second step
then (recursively) counts for each point in A the number
That Algorithm ECDF2 correctly computes the rank
of points in A it dominates, and likewise for B. By
of each point in S can be established by induction, using
reasoning analogous to that for the planar case we can
the two facts mentioned above. We also use induction to
see that since no point in A dominates any point in B,
analyze its running time on a random access computer
the final ranks of A are exactly the ranks already com-
by setting up a recurrence relation describing the running
puted. By the same reasoning we know that a point b in
time on N points, say T(N), and then solving that recur-
B dominates point a in A iff b dominates a in their
rence. To set up the recurrence we must count how many
projection on P, the (y, z) plane. The third step of our
operations each step of the algorithm requires. Step 1
algorithm therefore projects all points onto plane P
can be solved by a fast median algorithm; we can use the
(which merely involves ignoring the x-coordinates) and
algorithm of Blum et al. [10] to accomplish this step in
then counts for each B-point the number of A-points it
O(N) operations. Because step 2 solves two problems of
dominates. This reduced problem, however, is just a
size N/2, its cost will be 2 T(N/2), by induction. The sort
of step 3 requires O(N lg N) time, and the scan requires
5 To be precise we should also define the "boundary condition" of
the recurrence, which is in this case T(1) = c, for some constant c.
Since all the recurrences we will see in this paper have the same
4 All the algorithms we will see have this test for small input size; boundary, we delete it for brevity. The particular value of the constant
we usually omit it for brevity. does not affect the asymptotic growth rate of the functions.

217 Communications April 1980


of Volume 23
the A C M Number 4
slightly modified version of the planar ECDF problem, Fig. 4. A three-dimensional problem.
which can be solved in O(N lg N) time. 6 The recurrence P
A B
describing our three-dimensional algorithm is then
T(N) = 2 T(N/2) + O(N lg iV) - - - °

which, as we saw previously, has solution T(N) =


O(N lg2 N).
The technique that we just used to solve the two- and
three-dimensional problems can be extended to solve the
general problem in k-space. The algorithm consists of
three steps: divide into A and B, solve the subproblems
recursively, and then patch up the partial answers in B
by counting for each point in B the number of As it
dominates (a (k--0-dimensional problem). The (k-1)-
dimensional subproblem can be solved by a "bookkeep-
ing" modification to the (k-1)-dimensional E C D F al- sional case and then examine successively higher dimen-
gorithm. We can describe the algorithm formally as sions. There are three costs associated with a search
Algorithm ECDFk. structure: the preprocessing time required to build the
Algorithm ECDFk. structure, the query time required to search a structure,
I. Choose a ( k - l)-dimensional cut plane P dividing S into two subsets and the storage required to represent the structure in
A and B, each of N / 2 points. memory. When analyzing a structure containing N
2. Recursively call ECDFk(A) and ECDFk(B). After this we know the points we denote these quantities by P(N), Q(N), and
true E C D F of all points in A.
3. [For each B fmd the n u m b e r of As it dominates.] Project the points
S(N), respectively. We illustrate these quantities as we
of S onto P, noting for each whether it was an A or a B. Now solve examine the one-dimensional E C D F searching problem.
the reduced problem using a modified E C D F ( k - l ) algorithm and In one dimension the E C D F searching problem asks
add the calculated values to the partial ranks o f B. us to organize N points (real numbers) such that when
To analyze the runtime of Algorithm E C D F k we given a new point x (not necessarily in the set), we can
denote its running time on a set of N points in k-space quickly determine how many points x dominates. One
by T(N, k). For any fixed value of k greater than 2, step of the more obvious ways of solving the problem is the
l can be accomplished in O(N) time. Step 2 requires "sorted array" data structure. In this scheme the N points
2T(N/2, k) time, and the recursive call of step 3 requires are sorted into increasing order and stored in an array.
T(N, k - l ) time. Combining these we have the recurrence To see how many points a query point dominates we
perform a binary search to fred the position of that point
T(N, k) -- O(N) + 2T(N/2, k) + T(N, k - l ) . in the array. This structure has been studied often (see,
We can use as a basis for induction on k the fact that for example, Knuth [16]) and is known to have properties

T(N, 2) = O(N lg N), P(N) = O(N lg N),


Q(N) = O(lg N),
as shown previously, and this establishes that S(N) = O(N).
T(N, k) -- O(N lg k-1 N). 7 In the two-dimensional E C D F searching problem we
We have therefore exhibited an algorithm that solves the are to preprocess N points in the plane such that we can
all-points ECDF problem for N points in k-space in quickly answer queries asking the rank of a new point,
O(N lgk-I N) time, for any fLxed k greater than 1. that is, how many points lie below it and to its left. There
are many structures that can be used to solve this prob-
2.1.2 The ECDF searching problem. We now turn lem, but we focus on one called the E C D F tree which
our attention to the ECDF searching problem. As in the follows from the multidimensional divide-and-conquer
all-points problem, we first investigate the one-dimen- paradigm (others are discussed by Bentley and Shamos
[9]). The multidimensional divide-and-conquer method
applied to planar search structures represents a structure
of N points in 2-space by two substructures of N/2 points
6 The following "bookkeeping" operations must be added to
E C D F 2 to enable it to solve this problem in O(N lg N) time: Relabel
in 2-space, and one substructure of N points in 1-space.
the As and Bs to be Xs and Ys, respectively. W e are now given N points We now describe the top level of an E C D F tree storing
in the plane and asked to count for each Y the n u m b e r o f )is it the point set S. By analogy to the all-points algorithm,
dominates. As in ECDF2, we divide into sets A and B and solve those
subproblems recursively. W e must now count for each Y in B the we choose a line L dividing S into equal sized sets A and
n u m b e r of Xs in A it dominates; we do this by projecting only the Xs B. Instead of solving subproblems A and B, however, we
o f A and the Ys of B onto L in step 3 o f ECDF2. now recursively process them into ECDF trees represent-
7 We use the fact that if T(N) = 2T(N/2) + O(N lg" N), then
T(N) = O(N lg "+1 N). A more detailed discussion of these recurrences ing their respective subsets. Having built these subtrees
can be found in Monier [21]. we are (almost) prepared to answer ECDF queries in set
218 Communications April 1980
of Volume 23
the A C M Number 4
Fig. 5. T w o c a s e s o f p l a n a r q u e r i e s . Fig. 6. C a l c u l a t i n g v's y - r a n k i n A.

L 0- . . . . .

A B
V
LJ
----n.~

I
I V

I I
I !
I I
I
I I
I i

S. The first step of a query algorithm compares the x- In analyzing the search time our recurrence will depend
value of the query point with the line L; the two possible on whether the point lies in A or B, so we assume it lies
outcomes are illustrated in Figure 5 as points u and v. If in B and analyze its worst case. In this case we must
the point lies to the left of L (as u does), then we fmd its make one comparison, perform a binary search in a
rank in S by recursively searching the substructure rep- structure of size N/2, and then recursively search a
resenting A, for it cannot dominate any point in B. If the structure of size N/2. The cost of this will be
point lies to the right o f L (as v does), then searching B
Q(N) = Q(N/2) + O(lg N)
tells how many points in B are dominated by v, but we
still must find how many points in A are dominated by so we know that the worst-case cost of searching is
v. To do this we need only calculate v's y-rank in A; this Q(N) = O(lg 2 N).
is illustrated in Figure 6.
We can now describe the planar E C D F tree more Having analyzed the performance of the planar E C D F
precisely. An internal node representing a set o f N points tree, we can turn our attention to higher-dimensional
will contain an x-value (representing the line L), a E C D F searching problems.
pointer to a left son (representing A, the N/2 points with A node representing an N-element E C D F tree in 3-
lesser x-values), a right son representing B, and an array space contains two subtress (each representing N/2
of the N/2 points of A sorted by y-value. To build an points in 3-space) and a two-dimensional E C D F tree
E C D F tree recursively one divides the set into A and B, (representing the projection of the points in A onto the
builds the subtrees representing each, and then sorts the cut plane P). This structure is built recursively (analogous
elements of A by y-value (actually by presorting). To to the ECDF3 algorithm). The searching algorithm com-
search the tree recursively one first compares the x-value pares the query point's x value to the value defining the
of the node with the x-value of the query point. If the cut plane, and if less, searches only the left substructure.
query point is less, then only the left son is searched If the query point lies in B, then the right substructure is
recursively. If the value is greater, then the right son is searched, and a search is done in the two-dimensional
searched recursively, a binary search is done in the sorted E C D F tree. The full k-dimensional structure is analo-
y-sequence representing A to find the query point's y- gous: A node in this structure contains two substructures
rank in A, and the two ranks are added together and of N/2 points in k-space, and one substructure o f N/2
returned as the result. points in ( k - 0 - s p a c e . The recurrences describing the
To analyze this search structure we again use recur- structure containing N points in k-space are
rences. In counting the preprocessing cost we note that
P(N, k) = 2P(N/2, k) + P(N/2, k - 1 ) + O(N),
the recurrence describing the algorithm (with presorting)
S(N, k) = 2S(N/2, k) + S(N/2, k - l ) + O(1),
is
Q(N, k) = Q(N/2, k) + Q(N/2, k - l ) + O(1).
P(N) = 2P(N/2) + O(N)
We can use the performance of the two-dimensional
and the solution is
structure as a basis for induction on k, and thus establish
P(N) = O(U lg N). (for fixed values of k) that
To store an N element set we must store two N/2 element
P(N, k) = O(N lg k-1 N),
sets plus one sorted list of N / 2 elements, so the recurrence
S(N, k) = O(N lg k-a N),
is
Q(N, k) = O(lg k N).
S(N) = 2S(N/2) + N/2
It is interesting to note how faithfully the actions of the
which has solution
multidimensional divide-and-conquer algorithms are de-
s(_,v) = O ( N lg U). scribed by the recurrences. Indeed, the recurrences might

219 Communications A p r i l 1980


of V o l u m e 23
the ACM Number 4
provide a suitable definition of multidimensional divide- in developing these algorithms is to start with low-di-
and-conquer! mensional problems and then successively move to
higher dimensions. Describing an algorithm recursively
2.1.3 Summary of the ECDF problems. In our study leads to two advantages: We can describe it succinctly
of ECDF problems so far we have concentrated on and then analyze its performance by the use of recur-
algorithms for solving the problems without examining rence relations. The recurrence used most often is
lower bounds. We saw that the one-dimensional all-
points ECDF problem is equivalent to sorting (that is, F(N) = 2F(N/2) --]- O(N lgm N)
one problem can be reduced to the other in linear time), which has solution O(N lg''+1 N) for m _ 0.
so we know that the ECDF problem has an f~(N lg N) The multidimensional divide-and-conquer paradigm
lower bound in the decision tree model of computation. can be applied in the development of data structures as
Since the one-dimensional problem can be embedded in well as all-points algorithms. The data structure strategy
any higher-dimensional space (by simply ignoring some can be described as follows:
coordinates), this immediately gives an f~(N lg N) lower
To store a structure representing N points in k-space, store two
bound for the all-points problem in k-space. Lueker [20]
structures representing N/2 points each in k-space and one structure
has used similar methods to show that O(kN lg N) time of (up to) N points in (k--l)-space.
is necessary and sufficient for the all-points problem in
k-space in the decision tree model of computation. Un- There are, of course, many similarities between these
fortunately, there do not appear to be succinct programs data structures and the multidimensional algorithms
corresponding to the decision trees used in his proof. It (most importantly, we use such an algorithm to build the
therefore appears that a stronger model of computation structure), so the principles we enumerated above for all-
than decision trees will have to be used to prove lower points problems will apply to data structures as well. In
bounds on these problems; Fredman [12] has recently addition to the recurrence mentioned above, the recur-
made progress in this direction. These results show that rence
Algorithm ECDF2 is within a constant factor of optimal; F(N) = F(N/2) + O(lg m N)
whether Algorithm ECDFk is optimal in some reasona-
ble model of computation remains an open question. which has solution F(N) = O(lg m+l N) for m __ 0 arises
Similar methods can be used to show a lower bound often in the study of these structures.
on the problem of ECDF searching; it requires at least Presorting is a technique applicable to both multidi-
~(lg N) time in the worst case. mensional divide-and-conquer algorithms and data
The analyses that we have seen for the E C D F algo- structures. By sorting data once-for-all before a recursive
rithms have been "rough" in two respects: We have only algorithm is initially invoked (and then keeping the data
considered the case that N is a power of 2, and our sorted as we divide into subproblems), we can avoid the
analyses were given for fLxed k as N grows large. Monier repetitive cost of sorting. This technique often saves
[21] has analyzed the algorithms in this section more factors of O(lg N). One might hope that presorting could
exactly, overcoming both of the above objections. His in some way be used many times to save multiple factors
analysis shows that the time taken for Algorithm E C D F k of O(lg N), but the author doubts that this can be
is given by achieved.
Having made these general observations about our
T(N, k) = c(N lg *-1N)/(k-1)! + O(N lg*-z N) primary algorithm design tool, we are ready to apply it
where c is an implementation-dependent constant. It is to the solution of other problems. Because we have
particularly pleasing to note that the coefficient of the examined the ECDF algorithms in some detail and the
N lgk-1 N term is 1/(k-l)! ; this function goes to zero algorithms that we will soon examine are so similar, our
very rapidly. Monier's analyses also showed that the discussion of those algorithms will not be so precise; their
leading terms of the E C D F searching structures perform- details can be deduced by analogy with the algorithms
ances have similar coefficients (inverse factorials). of this section.
We have now completed our study of the E C D F
2.2 Maxima
problems per se, and it is important for us to take a
In this section we investigate problems dealing with
moment to reflect on the things we have learned about
maximal elements, or maxima, of point sets. A point is
multidimensional divide-and-conquer. The paradigm
said to be a maximum of a set if there is no other point
applies directly to all-points problems, and we state it
that dominates it. In Figure 7 we illustrate a planar point
here in its full generality:
set with the maxima of the set circled. We are interested
To solve a problem of N points in k-space, solve two problems of in two types of maxima problems: the all-points problem
N/2 points each in k-space and one problem of (up to) N points in (given a set, fmd all the maxima) and the searching
( k - l)-space.
problem (preprocess a set to answer queries asking if a
Algorithms based on this paradigm have three major new point is a maximum of the set). The problem of
parts: the division, recurs&e, and marriage steps. Because computing maxima arises in many diverse applications.
of the recursion on dimension, an important technique Suppose, for example, that we have a set of programs for
220 Communications April 1980
of Volume 23
the ACM Number 4
Fig. 7. Maximaare circled. Fig. 8. Maximaof A are circled;B's are squared.

® ®
®
® []
[]
® ® ®

performing the same task rated on the two dimensions order and then scan that sorted list right to left, observing
of space efficiency and time efficiency. If we plot these successive "highest y-values so far observed" and mark-
measures as points in the x-y plane, then a point (pro- ing those as maxima. It is easy to prove that this algo-
gram) dominates another only if it is more space efficient rithm gives exactly the maxima, for a point is maximal
and more time efficient. The maximal programs of the if and only if all points with greater x-values (before it
set are the only ones we might consider for use, because on the list) have lesser y-values. The computational cost
any other program is dominated by one of the maxima. of the algorithm will be O(N lg N) for the sort and then
In general, if we are seeking to maximize some multi- O(N) for the scan. (So note that if we have presorted
variate goodness function (monotone in all variables) the list, then the total time for finding the maxima is
over some finite point set, then it suffices to consider linear.)
only maxima of the set. This observation can signifi- We can also develop a multidimensional divide-and-
cantly decrease the cost of optimization if many optimi- conquer algorithm to solve the planar problem. As be-
zations are to be performed. Such computation is com- fore, we divide by L into A and B and solve those
mon in econometric problems. subproblems recursively (finding the maxima of each
Problems about maxima are very similar to problems set). This is illustrated in Figure 8, in which the maxima
about ECDFs. If we define the negation of point set A of A are circled and the maxima of B are in boxes.
(written - A ) to consist of each of the points of A multi- Because no point in B is dominated by any point in A,
plied by -1, then a point is a maximum of A if and only the maxima of B are also maxima of the entire set S.
if its rank in - A is zero (for if it is dominated by no Thus the third step (the "marriage" step) of our algo-
points in A, then it dominates no points in -A). By this rithm must discard points which are maxima of A but
observation we can solve the all-points maxima problem not of the whole set, i.e., those maxima of A which are
in O(N lgk-I N) time and the maxima searching problem dominated by some point in B. Since all points in B x-
with similar preprocessing time and space and O(lg k N) dominate all points in A, we need check only for y-
query time, by using the ECDF algorithms of Section domination. We therefore project the maxima of A and
2.1. In this section we investigate a different multidimen- B onto L, then discard A-points dominated by B-points
sional divide-and-conquer algorithm that allows us to on the line. This third step can be easily implemented by
reduce those cost functions by a factor of O(lg N). The just comparing the y-value of all A-maxima with the
all-points maxima algorithm we will see is due to Kung maximum y-value of the B-maxima and discarding all
et al. [17] (although our presentation is less complicated A's with lesser y-value (we described it otherwise to ease
than theirs). The searching structure of this section is the transition to higher spaces). The running time of this
described here for the first time. Although the algorithms algorithm is described by the recurrence
that we will see are similar to the ECDF algorithms of
T(N) = 2T(N/2) + O(N)
the last section in many respects, they do have some
interesting expected-time properties that the ECDF al- which has solution O(N lg N).
gorithms do not have. Having made these introductory We can generalize the planar algorithm to yield a
comments, we can now turn our attention to the maxima maxima algorithm for 3-space. The first step divides into
problems, investigating first the all-points problem and A and B, and the second step recursively finds the
then the searching problem. maxima of each of those sets. Since every maxima of B
The maximum of N points on a line is just the is a maxima of the whole set, the third step must discard
maximum element of the set, which can be found in every maxima of A which is dominated by a maxima of
exactly N - 1 comparisons. Computing the maxima of N B. This is accomplished by projecting the respective
points in the plane is just a bit more difficult. Looking at maxima sets onto the plane and then solving the planar
Figure 7, we notice that the maxima (circled) are increas- problem. We could modify the two-dimensional maxima
ing upward as the point set is scanned right to left. This algorithm to solve this task, but it will be slightly more
suggests an algorithm: Sort the points into increasing x- efficient to use the "scanning" algorithm. Suppose we

221 Communications April 1980


of Volume 23
the ACM Number 4
cut into A and B by the z-coordinate; we must discard the left subtree; if the point is dominated, we return the
all A s dominated by any Bs in the x - y plane. If we have dominating point. If it is not dominated by any point in
presorted by x, then we just scan right to left down the A, then we must check to see if it is dominated by any
sorted list, discarding As with y-values less than the point in B. This can be accomplished by storing in each
maximum By-value observed to date. This marriage step node the maximum y-value of any point in B. This
will have linear time (with presorting), so this algorithm structure can be built in O(N lg N) time and requires
has the same recurrence as the two-dimensional, and its linear space. Since the worst-case cost of a query satisfies
running time is therefore also O(N lg N). the recurrence
The obvious generalization of this algorithm carries
through to k-space without difficulty. We solve a prob-
T(N) = T(N/2) + O(1),
lem of N points in k-space by solving two problems of the worst-case search time is O(lg N).
N/2 points in k-space and then solving one problem of This search structure can be generalized to k-space.
(up to) N points in (k-1)-space. This reduced problem In that case a structure representing N points in k-space
calls for finding all As in the space dominated by any contains two substructures representing N / 2 points in k-
Bs, and we can solve this by modifying the maxima space and one substructure representing N / 2 points in
algorithm (similar to our modifications of the E C D F (k-D-space. To test if a new point is a maximum we
algorithm). The resulting algorithm has a recurrence first determine if it lies in A or B. If it is in B, then we
visit only the right son. If it lies in A, we first see if it is
T(N, k) = 2T(N/2, k) + T(N, k - l ) + O(N)
dominated by any point in A (visit the left son), and if
and we can use the fact that T(N, 3) = O(N lg N) to not then we check to see if it is dominated by any point
establish that in B (by searching the (k-1)-dimensional structure). The
recurrences describing the worst-case performance of
T(N, k) = O(N lgk-2 N) for k _ 3. this structure are
The analysis we just performed, though accurate for P(N, k) = 2P(N/2, k) + P(N, k - l ) + O(N),
the worst case, is terribly pessimistic. It assumes that all S(N, k) = 2S(N/2, k) + S(N/2, k - l ) + O(1),
N points of the original set will be maxima of their Q(N, k) = Q(N/2, k) 4. Q(N/2, k - l ) 4- O(1),
subsets, whereas for many sets there will be relatively
few maxima of A and B. Results obtained by Bentley et which have solutions
al. [6] show that only a very small number of points
P(N, k) = O(N lg k-2 N),
usually remain as maxima (for many probability distri-
S(N, k) = O(N lg k-z N),
butions). If only m points remain, then the term T(N, Q(N, k) = O(lg k-1 N).
k - l ) in the above recurrence is replaced by T(m,
k - l ) , which for small enough m (i.e., m = O(N p) for As in the case of the all-points problem, these times are
some p < 1) has running time O(N). If this is true, then highly pessimistic, and for many point distributions they
the recurrence describing the maxima algorithm is can be shown to be much less on the average.
Yao [29] has shown that ~(N lg N) is a lower bound
T(N, k) = 2T(N/2, k) + O(N),
on the decision tree complexity of computing the maxima
which has solution T(N) = O(N lg N). One can formalize of N points in 2-space. This result shows that the maxima
the arguments we have just sketched to show the average algorithms we have seen here are optimal for two and
running time of the above algorithm is O(N lg N) for a three dimensions (by embedding). Lower bounds for the
wide class of distributions. The interested reader is re- rest of the problems of this section are still open prob-
ferred to [6] in which a linear expected-time maxima lems, and a model other than the decision tree will have
algorithm is presented (with poorer worst-case perform- to be used to prove optimal the algorithms that we have
ance than this algorithm); the analysis techniques used seen.
therein can be used to prove this result. This concludes our study of maxima problems.
We turn our attention now to the maxima searching Clever application of the multidimensional divide-and,
problem. We start with the planar case, where we must conquer strategy allowed us to squeeze a factor of
process N points in the plane into a data structure so we O(lg N) from the running times of the E C D F algorithms.
can quickly determine if a new point is a maxima (and We also glimpsed how an expected-time analysis might
if not, we must name a point which dominates it). Our be performed on the computation-cost functions.
structure is a binary tree in which the left son of a given
node represents all points with lesser x-values (A), the 2.3 Range Searching
right son represents B, and an x-value represents the line In this section we examine the problem of range
L. To answer a query asking if a new point q is a searching, a searching problem defined by point domi-
maximum of the set represented by a given node, we nation for which there is no corresponding all-points
compare q's x-value to the node's. If the point lies in B problem. The problem is to build a structure holding N
(greater x-value), then we search the subtree and return points in k-space to facilitate answering queries of the
the answer. If the point lies in A, however, we first search form "report all points which are dominated by point U
222 Communications April 1980
of Volume 23
the A C M Number 4
Fig. 9. Number of points in R = r(A) - (r(B) + r(D)) + r(C). Fig. 10. A node in a planar range tree.

lO HI
MID
A B
B A • . . . . . - - -o

.o

C ~D
i
i
I
I
I
I
I

and dominate point L." This kind of query is usually ing must include a term of O(F) in the analysis of
called an orthogonal range query because we are in fact query time.
giving for each dimension i a range Ri = [li, ui] and then We will now describe range trees, a structure intro-
asking the search to report all points x such that xi is in duced by Bentley [4]; as usual, we first examine the
range Ri for all i. A geometric interpretation of the query planar case. There are six elements in a range tree's node
is that we are asking for all points that lie in a given describing set S. These values are illustrated in Figure
hyper-rectangle. Such a search might be used in querying 10. The reals LO and HI give the minimum and maxi-
a geographic database to list all cities with latitude mum x-values in the set S (these are accumulated
between 37 ° and 41° N and longitude between 102 ° and "down" the tree as it is built). The real MID holds the x-
109 ° W (this asks for all cities in Colorado). In addition value defining the line L, which divides S into A and B,
to database problems, range queries are also used in as usual; we then store two pointers to range trees
certain statistical applications. These applications and a representing the sets A and B. The final element stored
survey of the different approaches to the problem are in the node is a pointer to a sorted array, containing the
discussed in Bentley and Friedman's [5] survey of range points of S sorted by y-value. A range tree can be built
searching. The multidimensional divide-and-conquer recursively in a manner similar to constructing an E C D F
technique that we will see has also been applied to this tree. We answer a range query asking for all points with
problem by Lee and Wong [ 18], Lueker [20], and Willard x-value in range X and y-value in range Y by visiting
[28] who independently achieved structures very similar the root of the tree with the following recursive proce-
to the ones we describe. dure. When visiting node N we compare the range X to
In certain applications of the range searching prob- the range [LO, HI]. If [LO, HI] is contained in X, then
lem we are not interested in actually processing each we can do a range search in the sorted array for all
point found in the query rectangle--it suffices rather to points in the range Y (all these points satisfy both the X
know only how many such points there are. (One such and Y ranges). If the X range lies wholly to one side of
example is multivariate density estimation.) Such a prob- MID, then we search only the appropriate subtree (re-
lem can be solved by using the E C D F searching algo- cursively); otherwise we search both subtrees. If one
rithm of Section 2.1 and the principle of inclusion and views this recursive process as happening all at once, we
exclusion. Figure 9 illustrates how four planar rank see that we are performing a set of range searches in a
queries can be combined to tell the number of points in set of arrays sorted by y. The preprocessing costs of this
rectangle R (we use "r" as an abbreviation for "rank"); structure and the storage costs are both O(N lg N). To
in k-space 2k range searches are sufficient. analyze the query cost we note that at most two sorted
The sorted array is one suitable structure for range lists are searched at each of the lg N levels of the tree,
searching in one-dimensional point sets. The points are and each of those searches cost at most O(lg N), plus the
organized into increasing order exactly as they were for number of points found during that search. The query
the E C D F searching problem of Section 2.1. To answer cost of this structure is therefore O(lg2N + F), where F
a query we do two binary searches in the array to locate (as before) is the number of points found in the desired
the positions of the low and high ends of the range; this range.
identifies a sequence of points in the array which are the The range tree structure can of course be generalized
answer to the query, and they can then be reported by a to k-space. Each node in such a range tree contains
simple procedure. The analysis of this structure for range pointers to two subtrees representing N / 2 points in k-
searching is very similar to our previous analysis: The space and one N point subtree in (k-l)-space. Analysis
storage cost is linear and the preprocessing cost is O(N of range trees shows that
lg N). The query cost is then O(lg N) for the binary P(N, k) = O(N lg k-1 N), S(N, k) = O(N lg k-1 N),
searches plus O(F), if a total of F points are found to be Q(N, k) = O(lg k N + F )
in the region. Note that any algorithm for range search- where F is the number of points found.
223 Communications April 1980
of Volume 23
the ACM Number 4
Saxe [24] has used the decision tree model of com- Fig, 11. Fixed-radius near neighbor algorithm.
putation to show a lower bound on the range searching
problem of approximately 2k lg N. Bentley and Maurer l
[7] have given a range searching data structure that A B
realizes this query time, at the cost of extremely high
storage and preprocessing requirements. An interesting
open problem is to give bounds on the complexity of this
problem in the presence of only limited space (or pre- Q

processing time); Fredman's [12] work is a first step in Q


@

this direction.

3. Closest-Point Problems

In Section 2 we investigated problems defmed by


condition is guaranteed in certain applications from the
point domination; in this section we discuss a class of
natural sciences--if an object can affect all objects within
problems defined by point closeness. We saw that mul-
a certain radius, then there cannot be too many "affect-
tidimensional divide-and-conquer "works" for domina-
hag" objects. We see that this condition also arises nat-
tion problems because projection onto a plane preserves
urally in the solution of other closest-point problems. In
point domination. In this section we discuss a number of
this section we investigate the sparse all-points near
projections that preserve point closeness.
neighbors problem by examining successively higher
We investigate three problems dealing with closeness.
dimensions, and then we turn our attention to the search-
We use as our "closeness" measure the standard Euclid-
ing problem.
ean distance measure (although the algorithms can be
In the one-dimensional all-points near neighbor
modified tO use other measures). The problem we study
problem we are given N points on a line and constants
in Section 3.1 is the easiest of the three problems we
c and d such that no segment on the line of length 2d
discuss because it is defined in terms of "absolute"
contains more than c points, our problem is to list all
closeness. The problems of Sections 3.2 and 3.3 are
pairs within d of one another. We can accomplish this
defined in terms of relative distances and are therefore
by sorting the points into a list in ascending order and
a bit trickier. Throughout this section we describe the
then scanning down that list. When visiting point x
algorithms only at a very high level; the interested reader
during the scan we check backward and forward on the
can fmd the details of these algorithms (as well as a
list a distance ofd. By the sparsity condition, this involves
sketch of how they were discovered) in Bentley [3].
checking at most c points for "closeness" to x. The cost
of this procedure is O(N lg N) for the sorting and then
3.1 Fixed-Radius Near Neighbors O(N) for the scan, for a total cost of O(N lg N). Note the
In this section we discuss problems on point sets very important role sparsity plays in analyzing this al-
which deal with absolute closeness of points, that is, pairs gorithm: It guarantees that the cost of the scan is linear
of points within some fixed distance d of one another. inN.
We concentrate on the all-points problem which asks for Figure 11 shows how we can use multidimensional
all pairs within d to be listed, and then we briefly examine divide-and-conquer to solve the planar near neighbor
the problem of "fixed-radius near neighbor" searching. problem. The first and second steps of our algorithm are,
Fixed-radius problems arise whenever "multidimen- as usual, to divide the point set by L into A and B and
sional agents" have the capability of affecting all objects then fmd all near neighbor pairs in each recursively. At
within some fixed radius. Such problems arise in air this point we have almost solved our problem--all that
traffic control, molecular graphics, pattern recognition, remains to be done is to fred all pairs within d which
and certain military applications. One difficulty in ap- have one element in A and one in B. Note that the "A
proaching the fixed-radius near neighbors problem, how- point" of such a pair must lie in the slab of A which is
ever, is that if points are clustered closely together, then within d of L, and likewise for B. Our third step thus
there can be O(N 2) close pairs, and we are therefore calls for finding all pairs with one element in A and the
precluded from finding a fast algorithm to solve the other in B, and to do this we can confine our attention
problem. to the slab of width 2d centered about line L. But this
We can avoid this difficulty by considering only can be transformed into a one-dimensional problem by
sparse point sets, that is, sets which are not "clustered." projecting all points in the slab onto L. It is not difficult
We det'me sparsity as the condition that no d-ball in the to show that projection preserves sparsity (details of the
space (that is, a sphere of radius d) contains more than proof can be found in Bentley [3]), and it is obvious that
some constant c points. This condition ensures that there projection preserves closeness, for projection only de-
will be no more than cN pairs of close points found. This creases the distance between pairs of points. Our reduced
224 Communications April 1980
of Volume 23
the ACM Number 4
Fig. 12. Two cut lines. choose as cut line L a horizontal line dividing the set
Bad cut into halves. It turns out not to be hard to generalize this
line notion to show that in any sparse point set there is a
"good" cut line. By "good" we mean that L possesses
the following three properties:
(1) It is possible to locate L in linear time.
G o o d cut (2) The set S is divided approximately in half by L.
line
(3) Only O(N 1/2) points of S are within d o f L.
A proof that every sparse point set contains such a
cut line can be found in Bentley [3]. We can use the
existence of such a cut line to create an O(N lg N)
algorithm. The first step of our algorithm takes linear
time (by property 1 of L), and the second step is altered
(by property 2). The third step is faster because it sorts
fewer than N points--only the O(N ~/2) points within d
problem is therefore just the one-dimensional sparse near of L, by property 3. Since this can be accomplished in
neighbors problem (though it requires checking both to much less than linear time, our algorithm has the recur-
ensure pairs have one element from A and one from B rences
and to ensure that the pairs were close before projection),
and this can be accomplished in O(N lg N) time, or T(N) = 2T(N/2) + O(N)
linear time if presorting is used. The runtime of our which has solution O(N lg N). The gain in speed was
algorithm thus obeys the recurrence realized here by solving only a very small problem on
T(N) = 2T(N/2) + O(N) the fine, so small that it can be solved in much less than
linear time. Not unexpectedly, it can be shown that for
which has solution T(N) = O(N lg N). Sparsity played
sparse point sets in k-space there will always exist good
two important roles in this algorithm. Since the original cut planes, which will have not more than O(N a-Ilk)
point set was sparse, we could guarantee that both A and
points within d of them. These planes imply that the
B would be sparse after the division step (which in no
(k-1)-dimensional subproblem can be solved in less than
way alters A or B). The sparsity condition was also
linear time, and the full problem thus obeys the recur-
preserved in the projection of the third step, which
rence
allowed us to use the one-dimensional algorithm to solve
the resulting subproblem. T(N, k) = 2T(N/2, k) + O(N).
The algorithm we just saw can be generalized to three
This establishes that we can solve the general problem in
and higher dimensions. In three dimensions we divide
O(N lg N) time.
the set by a cut plane P into A and B and find all near
The techniques which we have used for the all-points
pairs in those sets recursively. We now need to find all
near neighbors problems can also be applied to the near
close pairs with one member in A and the other in B,
neighbor searching problem. In that problem we are
and to do this we confine our attention to the "slab" of
given a sparse set of points to preprocess into a data
all points within distance d of P. If we project all those
structure such that we can quickly answer queries asking
points onto the slab (remembering if each was an A or
for all points within d of a query point. If we use the
a B), then we have a planar near neighbor problem o f general multidimensional divide-and-conquer strategy,
(up to) N points. Using our previous planar algorithm
then we achieve a structure very similar to the range tree,
gives an algorithm for 3-space with O(N lg 2 N) running
with performances
time. Extending this to k-space gives us an O(N lg k-~ N)
algorithm. P(N) = O(N lg k-1 N),
Having seen so many O(N lg k-1 N) algorithms in this S(N) = O(N lg k-~ N),
paper may have lulled the reader into a bleary-eyed state Q(N) = O(lg k N).
of universal acceptance, but the practicing algorithm If we make use of the good cut planes, however, then we
designer never sleeps well until he has an algorithm with can achieve a structure with performance
a matching lower bound. For this problem the best
known lower bound is f~(N lg N); so we are encouraged P(N) = O(N lg N),
to try to find an O(N lg N) algorithm. First we consider S(N) = O(N),
our planar algorithm in its O(N lg 2 N) form, temporarily Q(N) = O(lg N).
ignoring the speedup available with presorting. If we ask This modified structure follows immediately from the
where the extra logarithmic factor comes from, we see properties of the cut planes we mentioned above; the
that it is due to the fact that in the worst case all N points
can lie in the slab of width 2d; this is illustrated in Figure 8The recurrenceactuallytakes on a slightlydifferentform--details
12. If the points are configured this way, then we should are in Bentley[3].
225 Communications April 1980
of Volume 23
the ACM Number 4
details are similar to the other multidimensional divide- Fig. 13. A planar closest-pair algorithm.
and-conquer structures we have seen previously.
To show lower bounds on fixed-radius problems in A
k-space we can consider the corresponding problems in
1-space. Fredman and Weide [13] have shown that the
problem of reporting all intersecting pairs among a set
of segments on the line requires f~(N lg N) time; by
embedding, this immediately gives the same lower bound i
on the all-points fixed-radius near neighbors problem in de
k-space. This shows that our algorithm is optimal (to
within a constant factor). Reduction to one dimension
can also be used to show that the data structure is
optimal. d = rain (d^,dB)

In this section we have seen how multidimensional


divide-and-conquer can be applied to closest-point prob-
lems, a totally different kind of problem than the domi- both A and B. Because the closest pair in A is da apart,
nation problems we saw in Section 2. Some of the no da-ball in A can contain more than seven points. This
techniques we have seen in this section will be useful in follows from the fact that at most six unit circles can be
all other closest-point problems. One such technique is made to touch some fLxed unit circle in the plane without
employing the concept of sparsity; it was given in the overlapping; details of the proof are in Bentley [3].
statement of this problem, and we will see how to intro- Likewise we can show that B is sparse in the sense that
duce it into other problems in which it is not given. The no dB-ball in B contains more than seven points. If we
second technique that we will use again is projection of let d be the minimum of d3 and dB, notice that the whole
all near points onto a cut plane. With these tools in hand, space is sparse in the sense that no d-ball contains more
we now investigate other closest-point problems. than 14 points. This observation of "induced" sparsity
will make the third step of our algorithm much easier,
which is to make sure that the closest pair in the space
3.2 Closest Pair is actually that corresponding to dA or to dB. We could
In this section we examine the closest-pair problem, just run a sparse fixed-radius near neighbor algorithm at
an all-points problem with no searching analog. We are this point to fred any pairs within d of one another, but
given N points in k-space and must fred the closest pair there is a more elegant approach. Note that any close
in the set. Notice that this problem is based on relative, pair must have one element in A and one element in B,
not absolute, distances. Although the distance separating so all we have to do is consider the slab of all points
the closest pair could be used as a rotation-invariant within d of L, and the third step of this algorithm
"signature" of a point set, its primary interest to us is not becomes exactly the third step of the near neighbor
as an applications problem but rather in its status as an algorithm. If we do not use presorting, this gives an
"easiest" closest-point problem. We call it easiest because O(N lg2 N) algorithm.
there are a number of other geometric problems (such as The generalization to 3-space is obvious: We choose
nearest neighbors and minimal spanning trees) that fred a plane P defining A and B and solve the subproblems
the closest pair as part of their solution. For a long time for those sets. After this we have introduced sparsity into
researchers felt that there might be a quadratic lower both A and B (relative to da and dB), and we can ensure
bound on the complexity of the closest-pair problem, that our answer is correct by solving a planar fixed-
which would have implied a quadratic lower bound on radius subproblem. In k-space we solve two closest-pair
all the other problems. In this section we will see an O(N problems of N/2 points each in k-space and one fixed-
lg N) closest-pair algorithm, which gives us hope for the radius problem of (up to) N points in k - 1 dimensions.
existence of fast algorithms for the other problems. If we use the O(N lg N) algorithm for near neighbors,
(The O(N lg N) planar algorithm we will see was first then our recurrence is
described by Shamos [26], who attributes to H.R. Strong
the idea of using divide-and-conquer to solve this problem.) T(N) = 2T(N/2) + O(N lg N)
The one-dimensional closest-pair problem can be which has solution T(N) = O(N lg 2 N). Although we will
solved in O(N lg N) time by sorting. After performing not go into the details of the proof here, Bentley [3] has
the sort we scan through the list, checking the distance shown how the good cut planes we saw for the fixed-
between adjacent elements. In two dimensions we can radius problem can be applied to this problem. If they
use multidimensional divide-and-conquer to solve the are used appropriately, then the running time of the
problem. The first step divides S by line L into sets A closest-pair algorithm in k-space can be reduced to
and B, and the second step finds the closest pairs in A O(N lg N). Shamos [26] has shown an f~(N lg N) lower
and B, the distances between which we denote by da and bound on this problem in 1-space by reduction to the
dn, respectively. This is illustrated in Figure 13. Note "element uniqueness" problem; this algorithm is there-
that we have now introduced a sparsity condition into fore optimal to within a constant factor.
226 Communications April 1980
of Volume 23
the ACM Number 4
3.3 Nearest Neighbors algorithms for other multidimensional problems, such as
The fmal closest-point problem we investigate deals the all-points problem of finding the minimal-perimeter
with nearest neighbors. In the all-points form we ask triangle determined by N points and the searching prob-
that for each point x the nearest point to x be identified lem of determining if a query point lies in any of a set of
(ties may be broken arbitrarily). In the searching form N rectangles. Another aspect of this paradigm is the
we give a new point x and ask which of the points in the work of Bentley [3] on heuristics that algorithm designers
set is nearest to x. The all-points problem has applica- can use when applying this paradigm to their problems.
tions in cluster analysis and multivariate hypothesis test- These heuristics were enumerated after the paradigm
ing; the searching problem arises in density estimation had been used to solve the closest-point problems of
and classification. As usual, we begin our discussion of Section 3 and were then used in developing the algo-
this problem by examining the planar case of the all- rithms of Section 2, among others. A final aspect of this
points problem. paradigm is the precise mathematical analysis of the
It is not hard to see how multidimensional divide- resulting algorithms; Monier [21] has used beautiful com-
and-conquer can be used to solve the planar problem. binatorial techniques to analyze all of the algorithms we
The first step divides S into A and B and the second step have seen in this paper.
finds for each point in A its nearest neighbor in A (and We now briefly examine two paradigms of algorithm
likewise for each point in B). The third step must "patch design closely related to multidimensional divide-and-
up" by finding if any point in A actually has its true conquer. The first such paradigm, planar divide-and-con-
nearest neighbor in B, and similarly for points in B. To quer, is really just the specialization of the general par-
aid in this step we observe that we have established a adigm to the planar case. Shamos [25, 26] has used this
particular kind of sparsity condition. We define the NN- technique to solve many computational problems in
ball (for nearest neighbor ball) of point x to be the circle plane geometry. Among these problems are constructing
centered at x which has radius equal to the distance from the convex hulls of planar point sets, constructing Vo-
x to x's nearest neighbor. It can be shown (see Bentley ronoi diagrams (a planar structure which can be used to
[3]) that with this definition no point in the plane is solve various problems), and two-variable linear pro-
contained in more than seven NN-balls of points in A. We gramming. It is often easier to apply the paradigm in
will now discuss one-half of the third step, namely, the planar problems than in k-dimensional problems, be-
process of ensuring for each point in A that its nearest cause the third (marriage) step of the algorithm is one-
neighbor in A is actually its nearest neighbor in S. In dimensional, and there are many techniques for solving
this process we need consider only those points in A with such problems. Lipton and Tarjan [19] have given a very
NN-balls intersecting the line L (for if their NN-ball did powerful "planar separator theorem" that often aids in
not intersect L, then their nearest neighbor in A is closer applying the planar divide-and-conquer paradigm. 9
than any point in B). The final step of our algorithm The second related paradigm is what we might call
projects all such points of A onto L and then projects recursive partitioning. This technique is usually applied
every point of B onto L. It is then possible to determine to searching problems, but it can then be used to Solve
during a linear-time scan of the resulting list if any point all-points problems by repeated searching. The idea
x in A has a point in B nearer to x than x's nearest underlying this technique can be phrased as follows: To
neighbor in A. This results in an O(N lg N) algorithm if store N points in k-space, store two substructures each of
presorting is used. Shamos [26] has shown that it is N/2 points in k-space. Searches in such structures must
within a constant factor of optimal. then occasionally visit both subtrees of a given node to
The extension of the algorithm to k-space yields answer some queries, but with "proper" choice of cut
O(N lgk-1 N) performance. It is not clear that there is a planes this can be made to happen very infrequently.
search structure corresponding to this algorithm. Shamos Bentley [2] described a search structure based on this
[26] and Lipton and Tarjan [19] have given nearest idea which he called the multidimensional binary search
neighbor search structures for points in the plane that tree, abbreviated as a k-d tree when used in k-space.
are analogous to this algorithm. Whether there exists a That structure has been used to facilitate fast nearest
fast k-dimensional nearest neighbor search structure is neighbor searching, range searching, fixed-radius near
still an open question; this approach is certainly one neighbor searching, and for a database problem called
promising point of attack for that problem. "partial match" searching. Reddy and Rubin [23] use
recursive partitioning in algorithms and data structures

4. Additional Work 9 The author cannot resist pointing out that the planar divide-and-
conquer paradigm is also used by police officers. Murray [22] offers
the following advice in a hypothetical situation: "A crowd of rioters far
In Sections 2 and 3 we saw many aspects of the outnumbers the police assigned to disperse it. If you were in command,
multidimensional divide-and-conquer paradigm, but the best action to take would be to split the crowd into two or m o r e
parts and disperse the parts separately." An interesting open problem
there are many other aspects that can only be briefly is to apply other algorithmic paradigms to problems in police work,
mentioned. The paradigm has been used to create fast thus establishing a discipline of "computational criminology."

227 Communications April 1980


of Volume 23
the ACM Number 4
for representing objects in computer graphics systems. lems and searching problems. For five all-points prob-
Friedman [14, 15] has used the idea of recursive parti- lems of N points in k-space we saw algorithms with
tioning to solve many problems in multivariate data running time of O(N lgk-lN); for certain of these prob-
analysis such as classification and regression. In addition lems we saw how to reduce their running time even
to their theoretical interest, these structures are quite more. For four searching problems we developed data
easy and efficient to implement; their use has reduced structures that could be built on O(N lgk-lN) time, used
the costs of certain computations by factors of a hundred O(N lg~-lN) space, and could be searched in O(lgkN)
to a thousand (examples of such savings can be found in time. Both the all-points algorithms and the searching
the above references). data structures were constructed by using one paradigm:
All of the data structures described in this paper have multidimensional divide-and-conquer. All of the
been static in the sense that once they are built, additional all-points problems that we saw have f~(N lg N)
elements cannot be inserted into them. Many applica- lower bounds and all of the searching problems have
tions, however, require a dynamic structure into which f~(lg N) lower bounds; the algorithms and data struc-
additional elements can be inserted. Techniques de- tures that we have seen are therefore within a constant
scribed by Bentley [4] can be applied to all of the data factor of optimal (some for only small k, others for any
structures that we have seen in this paper to transform k).
them from static to dynamic. The cost of this transfor- The contributions of this paper can be described at
mation is to add an extra factor of O(lg N) to both query two levels. At the first level we have seen a number of
and preprocessing times (P(N) now denotes the time particular results of both theoretical and practical inter-
required to insert N elements into an initially empty est. The algorithms of Sections 2 and 3 are currently the
structure), while leaving the storage requirements un- best algorithms known for their respective problems in
changed. The details of this transformation can be found terms of asymptotic running time. They (or their var-
in Bentley [4]. Recent work by Lueker [20] and Willard iants) can also be implemented efficiently for problems
[28] can be applied to all of the data structures in this of "practical" size, and several are currently in use in
paper to convert them to dynamic at the cost of an software packages. At a second level this paper contains
O(lg N) increase in P(N), leaving both Q(N) and S(N) contributions in the study of a particular algorithmic
unchanged. Additionally, their method facilitates dele- paradigm, multidimensional divide-and-conquer. This
tion of the elements. paradigm is essentially a general algorithmic schema,
This survey of additional work is not completed by which we instantiated to yield a number of particular
having mentioned only what has been done; there is algorithms and data structures. The study of this para-
much more eagerly waiting to be done. Perhaps the single digm as a paradigm has three distinct advantages. First,
most obvious open problem is that of developing meth- we have been able to present a large number of results
ods to reduce the times of our algorithm from O(N lgkN) rather succinctly. Second, advances made in one problem
to O(N lg N). We saw how presorting could be used to can be applied to other problems (indeed, once one
remove one logarithmic factor (in all the algorithms) and search structure was discovered, all the rest came about
certain other techniques that actually achieved quite rapidly). Third, this paradigm has been used to
O(N lg N) time, such as the expected analysis of Section discover new algorithms and data structures. The author
2.2 and the "good" cut planes of Sections 3.1 and 3.2. has (on occasion) consciously applied this paradigm in
One might hope for similar techniques of broader appli- the attempt to solve research problems. Although the
cability to increase the speed of our algorithms even paradigm often did not yield fruit, the ECDF, range
more. Another area which we just barely scratched (in searching, and nearest neighbor problems were solved in
Section 2.2) was the expected analysis of these algo- exactly this way.
rithms-experience indicates that our worst-case analy-
In this paper the author has tried to communicate
ses are terribly pessimistic. A more specific open problem
some of the flavor of the process of algorithm design and
is to use this method to solve the nearest neighbor
analysis, in addition to the nicely packaged results. It is
searching problem in k-space; one might also hope to
his hope that the reader takes away from this study not
use the method to give a fast algorithm for constructing
only a set of algorithms and data structures, but also a
minimal spanning trees. Although it has already been
feeling for how these objects came into being.
used to solve a number of research problems, much work
remains to be done before we can "write the final
chapter" on multidimensional divide-and-conquer.
Acknowledgments. The presentation of this paper has
been greatly improved by the careful comments of J.
5. Conclusions McDermott, J. Traub, B. Weide, and an anonymous
referee. I am also happy to acknowledge assistance re-
In this section we summarize the contributions con- ceived as I was working on the algorithms described in
tained in this paper, but before we do so we will briefly this paper. D. Stanat was an excellent thesis advisor, and
review the results of Sections 2 and 3. Those sections M. Shamos was a constant source of problems and
dealt with two basic classes of problems: all-points prob- algorithmic insight.
228 Communications April 1980
of Volume 23
the ACM Number 4
Received 9/78; revised 6/79; accepted 1/80

References
!. Aho, AV., Hopcroft, J.E., and Ullman, J.D. The Design and
Analysis of Computer Algorithms. Addison-Wesley, Reading, Mass.,
1974.
2. Bentley, J.L. Multidimensional binary search trees used for
associative searching. Comm. ACM 18, 9 (Sept. 1975), 509-517. Programming R. Rivest
3. Bentley, J.L. Divide and conquer algorithms for closest point Techniques Editor
problems in multidimensional space. Unpublished Ph.D. dissertation,
Univ. of North Carolina, Chapel Hill, N.C., 1976.
4. Bentley, J.L. Decomposable searching problems. Inform. Proc.
Letters 8, 5 (June 1979), 244-251.
A Unifying Look
5. Bentley, J.L., and Friedman, J.H. Algorithms and data structures
for range searching. Comptng. Surv. 11, 4 (Dec. 1979), 397-409. at Data Structures
6. Bentley, J.L., Kung, H.T., Schkolnick, M., and Thompson, C.D.
On the average number of maxima in a set of vectors and Jean Vuillemin
applications. J. ACM 25, 4 (Oct. 1978), 536-543.
7. Bentley, J.L., and Maurer, H.A. Efficient worst-case data University of Paris-South
structures for range searching. To appear in Acta lnformatica
(1980).
g. Bentley, J.L., and Shamos, M.I. Divide and conquer in
multidimensional space. In Proc. ACM Symp. Theory of Comptng., Examples of fruitful interaction between
May 1976, pp. 220-230. geometrical combinatorics and the design and analysis
9. Bentley, J.L., and Shamos, M.I. A problem in multivariate of algorithms are presented. A demonstration is given
statistics: Algorithm, data structure, and applications. In Proc. 15th
Allerton Conf. Communication, Control, and Comptng., Sept. 1977, of the way in which a simple geometrical construction
pp. 193-201. yields new and efficient algorithms for various
10. Blum, M., et al. Time bounds for selection. J. Comptr. Syst. Sci. searching and list manipulation problems.
7, 4 (Aug. 1972), 448--461.
il. Dobkin, D., and Lipton, R.J. Multidimensional search problems. Key Words and Phrases: data structures,
SIAM J. Comptng. 5, 2 (June 1976), 181-186. dictionaries, linear list, search, merge, permutations,
12. Fredman, M. A near optimal data structure for a type of range analysis of algorithms
query problem. In Proc. 1lth ACM Symp. Theory of Comptng.,
April 1979, pp. 62-66. CR Categories: 4.34, 5.24, 5.25, 5.32, 8.1
13. Fredman, M., and Weide, B.W. On the complexity of computing
the measure of O [ai, bi]. Comm. ACM 21, 7 (July 1978), 540-544.
14. Friedman, J.H. A recursive partitioning decision rule for
nonparametric classification. 1EEE Trans. Comptrs. C-26, 4 (April 1. Introduction
1977), 404--408.
15. Friedman, J. H. A nested partitioning algorithm for numerical W h e n e v e r t w o c o m b i n a t o r i a l s t r u c t u r e s are c o u n t e d
multiple integration. Rep. SLAC-PUB-2006, Stanford Linear
Accelerator Ctr., 1978. b y t h e s a m e n u m b e r , t h e r e exist b i j e c t i o n s ( o n e - o n e
16. Knuth, D.E. The Art of Computer Programming, Vol. 3: Sorting m a p p i n g s ) b e t w e e n t h e t w o structures. O n e g o a l o f g e o -
and Searching. Addison-Wesley, Reading, Mass., 1973. m e t r i c a l c o m b i n a t o r i c s (see, for e x a m p l e , F o a t a a n d
17. Kung, H.T., Luccio, F., and Preparata, F.P. On finding the
maxima of a set of vectors. J. A CM 22, 4 (Oct. 1975), 469-476. S c h u t z e n b e r g e r [7]) is to e x p l i c i t l y c o n s t r u c t s u c h b i j e c -
18. Lee, D.T., and Wong, C.K. Qintary trees: A file structure for tions. T h i s is b r i n g i n g t h e field v e r y close to c o m p u t e r
multidimensional database systems. To appear in ACM Trans.
Database Syst. science: O n e c a n r e g a r d c o m b i n a t o r i a l r e p r e s e n t a t i o n s o f
19. Lipton, R., and Tarjan, R.E. Applications of a planar separator r e m a r k a b l e n u m b e r s as e q u i v a l e n t d a t a structures; ex-
theorem. In Proc. 18th Symp. Foundations of Comptr. Sci., Oct. plicit b i j e c t i o n s b e t w e e n s u c h r e p r e s e n t a t i o n s p r o v i d e
1977, pp. 162-170.
20. Lueker, G. A data structure for orthogonal range queries. In c o d i n g a n d d e c o d i n g a l g o r i t h m s b e t w e e n the structures.
Proc. 19th Symp. Foundations of Comptr. Sci., Oct. 1978, pp. E a r l i e r i n v e s t i g a t i o n s a l o n g t h e s e lines are 1;eported in
28-34. F r a n ~ o n et al. [10] a n d F l a j o l e t et al. [6].
21. Monier, L. Combinatorial solutions of multidimensional divide-
T h i s p a p e r s h o u l d b e r e g a r d e d as a n i n t r o d u c t i o n to
and-conquer recurrences. To appear in the J. of Algorithms.
22. Murray, J.A. Lieutenant, Police Department--The Complete Study u s i n g m e t h o d s o f g e o m e t r i c a l c o m b i n a t o r i c s in the field
Guidefor Scoring High (4th ed.). Arco, New York, 1966, p. 184, o f a l g o r i t h m d e s i g n a n d analysis. F o r this p u r p o s e , w e
question 3.
c o n s i d e r r e p r e s e n t a t i o n o f n! as a r u n n i n g e x a m p l e a n d
23. Reddy, D.R., and Rubin, S. Representation of three-dimensional
objects. Carnegie-Mellon Comptr. Sci. Rep. CMU-CS-78-113, d e m o n s t r a t e h o w we are led to d i s c o v e r i n g n e w a n d
Carnegie-Mellon Univ., Pittsburgh, Pa., 1978. e f f i c i e n t d a t a s t r u c t u r e s a n d a l g o r i t h m s for s o l v i n g v a r -
24. Saxe, J.B. On the number of range queries in k-space. Discrete
Appl. Math. 1, 3 (Nov. 1979), 217-225. ious data manipulation problems.
25. Shamos, M.I. Computational geometry. Unpublished Ph.D. Permission to copy without fee all or part of this material is
dissertation, Yale Univ., New Haven, Conn., 1978. granted provided that the copies are not made or distributed for direct
26. Shamos, M.I. Geometric complexity. In Proc. 7th ACM Symp. commercial advantage, the ACM copyright notice and the title of the
Theory of Comptng., May 1975, pp. 224-233. publication and its date appear, and notice is given that copying is by
27. Weide, B. A survey of analysis techniques for discrete algorithms. permission of the Association for Computing Machinery. To copy
Comptng. Surv. 9, 4 (Dec. 1977), 291-313. otherwise, or to republish, requires a fee and/or specific permission.
28. Willard, D.E. New data structures for orthogonal queries. This work was supported by the National Center for Scientific
Harvard Aiken Comptr. Lab. Rep., Cambridge, Mass., 1978. Research (CNRS), Paris, under Grant 3941.
29. Yao, F.F. On f'mding the maximal elements in a set of planar Author's address: J. Vuillemin, Laboratory for Information Re-
vectors. Rep. UIUCDCS-R-74-667, Comptr. Sci. Dept., Univ. of search, Building 490, University of Paris-South, 91405 Orsay, France.
Illinois, Urbana, July 1974. © 1980 ACM 0001-0782/80/0400-0229 $00.75.

229 Communications April 1980


of Volume 23
the ACM Number 4

You might also like