Departments of Computer Sctence and Mathematics, Carnegte-Mellon Unwerslty, Pittsburgh, Pennsylvanta 15213 JEROME H. FRIEDMAN Computatmn Research Group, Stanford Lmear Accelerator Center, Stanford, Cahfornia 94305 Much research has recently been devoted to "multikey" searching problems. In this paper the partmular multlkey problem of range searching Is investigated and a number of data structures that have been proposed as solutions to this problem are surveyed. The purposes of this paper are to bring together a collection of widely scattered results, to acquaint the reader with the structures currently avadable for solving the particular problem of range searching, and to display a set of general methods for attacking multikey searching problems. Keywords and Phrases: analysis of algorithms, orthogonal range queries, range searching, cells, multidimensional binary search trees, projection CR Categorws. 3.63, 3.74, 5.25 INTRODUCTION The study of data structures for facilitating rapid searching is a fascinating subject of both practical and theoretical interest. Knut h [KNUT73] provides a definitive trea- tise on the subject of searching when the search is based on only one "key," but he points out t hat not much was known at the time his book was published about data structures for sets t hat have many "keys." This subject area, which is often called "multikey searching," "multidimensional searching," or "multiple attribute re- trieval," has been the focus of a great deal of research in the past few years. In this paper we study a small part of this area by surveying the work t hat has been done on one particular multikey searching problem. This problem is important in itself (having applications in such areas as database sys- Thin research was supported m part by t he Office of Naval Research under Contract N00014-76-C-0370 and m part by the Department of Energy tems, statistics, and design automation) and, in addition, serves as a representative of the entire class of multikey searching problems. We need some definitions to describe this particular searching problem precisely. In database terminology a file is a collection of records, each containing several attri- butes or keys. A query asks for all records satisfying certain characteristics. An or- thogonal range query asks for all records with key values each within specified ranges (that is, each key is between speci- fied upper and lower bounds). The process of retrieving the appropriate records is called range searching. This problem can also be cast in geometric terms by regarding the record attributes as coordinates and the k values for each record as representing a point in a k-dimensional coordinate space. The file of records then becomes a point set in k-space. The intersection of the query ranges is a k-dimensional hyperrectangle in the space {that is, a "box"), and a range query calls for finding all points lying inside Permission to copy without fee all or part of this materml is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notme and the title of the pubhcatlon and its date appear, and notme is gwen that copying is by permlsmon of the Association for Computing Machinery. To copy otherwme, or to repubhsh, reqmres a fee and/ or specific permission 1979 ACM 0010-4892/79/1200-0397 $00 75 Computing Surveys, Voi. 11, No. 4, December 1979 398 CONTENTS J. L. Bent l ey and J. H. Fri edman I NTRODUCTI ON 1 THE DATA STRUCTURES 1 1 Sequent ml Scan 1 2 Proj ect i on 1 3 Cells 14 k-d Trees 1 5 Range Trees 1 6 k-ranges 1 7 Ot her St r uct ur es 1 8 Compar i son of Met hods 2 ADDI TI ONAL WORK 3 CONCLUSI ONS REF ERENCES T this hyperrectangle. We will often cast range searching in this geometric frame- work as an aid to intuition. Range searching arises in many applica- tions. In a geographic database of U.S. cities one might seek a list of all those with latitude between 37 and 41 and longitude between 102 and 109 (defining the state of Colorado). To compile an honor list of older students, a university administrator may wish to know those students whose age is between 21 and 24 years and whose grade point average is between 3.5 and 4.0. In data analysis it is often useful to do separate analyses on sets of data lying in different regions (hyperrectangles) of the observation space and then compare (or contrast) the respective results. (At the Stanford Linear Accelerator Center, for ex- ample, over 10 hours per week of IBM 370/ 168 time is devoted to this application.) In statistics, range searching can be employed to determine the empirical probability con- tent of a hyperrectangle, to determine em- pirical cumulative distributions, and to per- form density estimation (see LOFT65). Lauther [LAuT78] describes how range searching can be used to solve a design automation problem in very large-scale in- tegrated circuitry (VLSI). This paper has been written with two distinct audiences in mind. For the expert in searching {with background either in database systems or theoretical computer science), this paper is intended as a survey t hat gathers together and presents in a common terminology a number of results t hat have recently appeared on the problem of range searching. This problem is of par- ticular interest for two reasons: First, it is an important problem in many practical applications (and a difficult theoretical problem!); second, the methods t hat we investigate are broadly applicable to many other multikey searching problems. The second type of reader for whom this paper is intended is a computer scientist who is somewhat familiar with data structures for single-key searching, and who would like a tutorial on the problem of range searching. For this reader, the methods t hat we dis- cuss are described on an intuitive level, and references are given to more precise de- scriptions elsewhere in the literature. In Section 1 of this paper we examine six data structures for the range searching problem in some detail, and then briefly compare those structures at the end of the section. Additional work (that both has been done and needs to be done) is de- scribed in Section 2, and conclusions are then offered in Section 3. 1. T HE DA T A ST RU C T U RE S In this section we investigate a number of search methods for range searching. Each search method is specified by a dat a struc- ture for storing the data and algorithms for building (which we call preprocessing) and searching the structure. We will analyze a search structure (say A) by giving three cost functions of N (the number of points) and k (the number of dimensions): PA(N, k), the cost of preprocessing N points in k-space into a data structure; SA(N, k), the storage required by the data structure; QA(N, k), the search time or query cost. These costs can be analyzed in terms of their average or their worst case; we usually speak of the worst-case cost, explicitly men- tioning the average whenever we employ it. In many applications one may desire var- ious utility operations on data structures, such as insertion and deletion. In this sec- tion we ignore this issue, considering only static (unchanging) files; we then return to Comput i ng Surveys, Vol 11, No. 4, December 1979 Data Structures for Range Searching . 399 O I I t I l I I I I I I o I I t t I t I I i . I I I I I I I I I I I I I I I I I I I I I I I I I U I O I I I I I I I I I l I I I I a I I I F m u R ~ 1 . ! / , d, ' I ! D ! I I / I I I : > " 2 ' ' # ; / , ' , , ,,' I I / r Q I I I j / q /~ d" ' ! I I " I f ! I Uus t r a t i on of pr oj e c t mn I I f o I ! I I ! i y ! I r ! ! a i n t g ! e i the question of dynamic structures in Sec- tion 2. 1.1 Seq uent i al Scan The simplest approach to range searching is to store the N points in a sequential list. As each query arrives, all elements of the list are scanned and every record t hat sat- isfies the query is reported. If the queries do not have to be handled immediately, then they can be "batched" so t hat many queries can be processed with one sequen- tial pass through the file. Since all k keys of the N records must be stored and each k-key record is examined as the structure is built or searched, it is easy to see t hat the sequential scan structure SS has the prop- erties Pss(N, k) = O(Nk), Sss(N, k) = O(Nk), Qss(N, k) = O(Nk). Sequential scanning has the advantage of being trivial to implement on any storage medium. It is competitive with the more sophisticated methods described in this pa- per when the file is small and the number of attributes is large, or when a large frac- tion of the records in the file satisfy the query (or queries, if they are batched). 1. 2 P roj ecti on The projection technique involves keeping, for each attribute, a sequence of the records in the file sorted by t hat attribute. One can view this geometrically as a projection of the points on each coordinate. The k lists representing the projections can be ob- tained by using a standard sorting algo- rithm k times. After preprocessing, a range query can be answered by the following search procedure: Choose one of the attri- butes, say the ith. Look up the two positions in the ith sequence (using a binary search) of the extreme values defining the range on the ~ th attribute of the query. All records satisfying the query will be in the list be- tween these two positions just found. This (smaller} list is t hen searched by brute force. The projection technique is referred to as inverted lists by Knut h [KNUT73]. This technique was applied by Friedman, Baskett, and Shustek [FRIE75] in their so- lution of the "nearest neighbor" problem and by Lee, Chin, and Chang [LEEC76] to a number of database problems. The projection technique is illustrated in Figure 1. The points represent a set of sixteen records of two keys each, repre- sented by the x- and y-coordinates. The dashed lines are the projection of the r e c - Comput i ng Sur veys, Vol. 11, No. 4, December 1979 400 J. L. Bentley and J. H. Friedman ords ont o t he x-coordi nat e (t hat is, t he records sort ed into x-order). The vert i cal slab is t he x-range of t he query, t he hori- zontal slab is t he y-range, and t he rect angl e t hat is t hei r i nt ersect i on cont ai ns t hose poi nt s whi ch satisfy t he query. To answer this query, we need onl y i nvest i gat e t he six poi nt s t hat are inside t he vert i cal slab mar ked by t he 45 lines. One can appl y t he proj ect i on t echni que wi t h onl y one sort ed list (projection). If t he di st ri but i on of val ues of t he vari ous attri- but es is mor e or less uni form over similar ranges and t he quer y ranges of each attri- but e are similar, t hen one list is sufficient. If this is not t he case, however, t hen keep- ing several lists can oft en lead to subst ant i al reduct i ons in t he quer y time. The mul t i pl e proj ect i ons are expl oi t ed by performi ng two bi nary searches in each t o find t he l ower and upper bounds of t he respect i ve range, and t hen searchi ng t hat proj ect i on wi t h t he smallest number of records in t he range. The cost analysis of proj ect i on is st rai ght forward. To preprocess a file of N records of k keys each, we must per f or m k sort s of N elements. To st ore such a file, we must st ore k lists of N el ement s each. These facts i mmedi at el y yield Pp(N, k) = O(kN log N) , Sp(N, k) = O(kN). Fri edman, Basket t , and Shust ek [FreE75] show t hat for searches t hat have al most cubical quer y regions and find a small num- ber of records (and are t her ef or e similar t o nearest nei ghbor searches), t he quer y t i me of proj ect i on is given by Qp(N, k) = O( N H/k) (average case) when t he poi nt set is drawn f r om a smoot h underl yi ng distribution. The proj ect i on t echni que is most effective when t he quer- ies al most always cont ai n one range t hat excl udes most of t he file. 1. 3 Cel l s There are two ways they can search [for the murder weapon] from the body outward m a spiral, or dwlde the room up Into squares--that' s the grid method. From the CBS series Kojak, "Death Is Not a Passing Grade" Cart ographers as well as det ect i ves use t he grid (or cell) met hod. St r eet maps of met - ropol i t an areas are oft en pr i nt ed in t he f or m of books. The first page of t he book shows t he ent i re area, and t he remai ni ng pages are det ai l ed maps of (say) one-mi l e-square regions. To find (for exampl e) all schools in a specified rect angl e, one woul d l ook at t he first page t o find whi ch squares overl ap t he rect angl e and t hen check onl y on t hose pages of t he book t o find t he schools. Thi s appr oach can be mechani zed i mmedi at el y. A square of t he map corresponds t o a cell in k-space, and t he poi nt s of t he file wi t hi n t he cell are st ored t oget her in an i mpl emen- tation. The first page of t he map book corresponds to a di rect ory t hat allows one t o t ake a hyper r ect angl e and l ook up t he set of cells. The cell t echni que is i l l ust rat ed in Fi gure 2. The si xt een poi nt s in t hat figure repre- sent si xt een records cont ai ni ng t wo keys each. The poi nt s in each cell are st ored t oget her in an i mpl ement at i on. The quer y is given by t he rect angl e in t he upper par t of t he figure, and t o answer it, onl y t hose poi nt s in t he four dashed cells need be investigated. The squares in t hat figure are t he "di r ect or y" correspondi ng t o t he first page of t he map book. The di rect ory can be i mpl ement ed in two ways. If t he poi nt s are (say) uni forml y dis- t r i but ed on [0, 10] 2 and we have chosen 1 1 cells, t hen we can use a t wo-di men- sional ar r ay as t he di rect ory, named DI- RECT ( 0. . 9, 0. . 9). In DI RECT (i, j) we woul d keep a poi nt er t o a list of all poi nt s in t he cell [t, t + 1] [ j , j + 1]. If we want ed to find all poi nt s in [5.2, 6.3] [1.2, 3.4], t hen we woul d onl y have t o exami ne cells (5, 1), (5, 2), (5, 3), (6, 1), (6, 2), and (6, 3) - - we call t hi s "t ransl at i ng" from a range quer y t o a set of cell id' s. The mul t i di men- sional ar r ay works ver y well when t he poi nt s are known a priori t o be uni forml y di st ri but ed over some given rect angl e in t he key space. When this is not known t o be t he case, one woul d pr obabl y use a search met hod, such as hashing, for t he di rect ory. In this met hod we name each cell as before; so cell (t, j ) is a poi nt er t o t he poi nt s in [i, i + 1] [j , j + 1]. Inst ead of storing al l cells, however, we st ore onl y t hose cells t hat act ual l y cont ai n records of t he file. To process a query, we t ransl at e t he rect angl e into a set of cell id' s (as we did above), look Computing Surveys, Vol 11, No. 4, December 1979 / / ql # / } f / / / f / / / / / Data Structures for Range Searching / / / / / / f, / / / : / / / . / / / / / / .J f / / / / / / / / / / I / # D/ / / / # / g / g / # / 0 # / / / / / I I / t O 401 FIGURE 2 Il l ust rat i on of cells up those id' s, and check all the points in the occupied cells for inclusion in the rectangle. The storage required for the cell technique is the storage for the directory plus loca- tions for the linked list representing points in cells; the size of the directory is usually much smaller t han N. Knut h [KNUT73] has discussed this scheme for the two-dimensional case. Lev- inthal [LEVI66] used a cell technique in three-dimensional Euclidean space for de- termining all atoms within 5 angstroms of every atom in a protein molecule--he re- ferred to this as "cubing." The idea of using hashing for the cell directory was first de- scribed by Yuval [YUVA75], and was later used by Rabin [RAm76] to solve the "clos- est pair" problem. Bentley, Stanat, and Williams [BENT77] discuss a number of different implementations for the directory (two of which we have seen). The basic parameters of the cell tech- nique are the size and shape of each cell. In analyzing a search there are two costs to count: cell accesses (the number of direc- tory look-ups) and inclusion tests (testing whether a point satisfies the range query). If the cell size is extremely large, there will be few cell accesses and many inclusion tests. If the cell size is very small, on the other hand, there will be very many cell accesses and very few inclusion tests. Clearly, either extreme is to be avoided. The best cell size and shape depend on the size and shape of the query hyperrec- tangle. Bentley, Stanat, and Williams [BENT77] show t hat if the query hyperrec- tangles have constant size and shape so t hat only their location (in the coordinate space) is unspecified, t hen for a single grid a nearly optimum size and shape for the cells are the same as those of the query hyperrectangle. For this case the number of cells accessed is 2 k, and the expected search time is proportional to 2 k times the number of points in the range. In this con- text the performance of cells is given by P~(N, k) -- O( Nk ) , S~(N, k) ffi O( Nk ) , Qc(N, k) = O(2 k F) (average), where F is the number of records found. In most applications, however, the queries will vary in size and shape as well as in location, so there is little information available for making a good choice of cell size and shape. 1.4 k- d T r ees In this section we examine a data structure called the "k-dimensional binary search Computing Surveys, Vol 11, No. 4, December 1979 402 , J. L. Bentley and J. H. Fri edman tree," which is usually abbreviated as "k-d tree." This structure is a natural generaliza- tion of the standard one-dimensional binary search tree, so we will briefly review a spe- cial type of t hat structure (a complete de- scription of binary trees can be found in KNUT73). To build a file of single-key rec- ords into a binary search tree, we choose the median of the set as the discriminator value and build all records with key values less t han or equal to the discriminator into the left subtree of the root (recursively) and all elements with greater key values into the right subtree. This process continues recursively until there are only a few (say six or less} nodes in the set, at which point we store t hem as a linked list. Note t hat no records are stored in the internal nodes of such a binary search tree; they are con- tained only in the leaf nodes or "buckets" at the bottom of the tree. We can answer a range search in this structure by a recursive algorithm t hat compares the range to the discriminator of the node it is currently visiting. If the range is entirely to one side or the other of the discriminator, only the appropriate son is searched; otherwise both sons are searched recursively. The single-key binary search tree per- forms three functions at once: It stores the records of the file (in the external nodes, or "buckets"), it divides the data space into segments (by choosing the discriminators), and it gives a directory among the segments (the tree structure). We now investigate a multidimensional generalization of the bi- nary search tree t hat performs these same three functions: storing the records, divid- ing space into hyperrectangles, and provid- ing a directory among the hyperrectangles. It accomplishes this by using the same idea as the one-dimensional algorithm with one critical exception: In the one-dimensional tree we only have one key to use as the discriminator; in a multidimensional tree we have to choose at each internal node one of k keys to use as a discriminator. The algorithm for constructing a k-d tree is to choose for the discriminator t hat co- ordinate j for which the spread of attribute values (as measured by any convenient sta- tistic, such as variance or distance from minimum to maximum) is maximum for the subcollection represented by the node. The partitioning value is chosen to be the median value of this attribute. This algo- rithm is t hen applied recursively to the two subcollections represented by the two sons of the node just partitioned. The partition- ing is stopped, creating a terminal node (or bucket), when the cardinality of the sub- collection is less t han a prespecified maxi- mum, which is a parameter of the proce- dure. (Friedman, Bentley, and Finkel [FRIE77] found empirically t hat values ranging from 8 to 16 work well in a Fortran implementation.) The result of this proce- dure is t hat the coordinate space is divided into a number of buckets, each containing approximately the same number of points (by the stopping criterion) and each ap- proximately "cubical" in shape (by choos- ing as discriminator the dimension of max- imum spread, which slowly chops long and skinny rectangles into cubes). Range searching with k-d trees is straightforward. Starting at the root, the k-d tree is recursively searched in the fol- lowing manner. When visiting a node t hat discriminates by the f l h key (which we call a j-discriminator), one compares the j t h range of the query with the discriminator value. If the query range is totally above (or below) t hat value, then one need only search the right subtree (respectively, left) of t hat node; the other son can be pruned from the search because any node it con- tains does not satisfy the query in t hat particular key. If the query range overlaps the node' s key (that is, the key is between the low and high bounds of the range), then both sons need be searched. This can be accomplished by searching both sons recur- sively (the search being implemented by a stack). The application of k-d trees to (two-di- mensional) range searching is illustrated in Figure 3. The k-d tree is depicted in two ways: Figure 3a shows the structure in 2- space, and Figure 3b shows the abstract tree. The root of the tree is internal node A; it is an x-discriminator. The vertical line in the right part of the figure labeled A is the discriminating line. That is, every point to the left of t hat vertical line is in the left subtree of A {with B as root), and every point to the right is in the subtree with root C. This partitioning continues recursively, Computing Surveys, Vol 11, No 4, December 1979 " A E o 13 i D
Data Structures for Range Searching U F .,,,. / x h I / x : / " ~ ' l D , 0 / / / i f G (a) 403 A B C (b) FmURE 3. Illustratmn of k-d trees a) Planar representatmn, b) tree representation and the resulting cells (buckets) in this tree each contain two points. The query rectan- gle is illustrated in Figure 3a, and the search for all points within the rectangle is illus- trated in both figures. The search starts at the root, and since the query, rectangle is entirely to the right of the vertical line defined by A, the left subtree of A {with B as root) can be pruned from the search. This is illustrated in Figure 3b by the per- pendicular line through the son link from A to B. The search continues, searching both sons of C, both sons of F, and only the left son of G. A total of three buckets are searched; these buckets are dashed in the planar representation and are marked by an S in the tree representation. In the k-d tree as introduced by Bentley [BENT75a], the discriminators are chosen cyclically (that is, the root is discriminated by the first key, its sons by the second, and so on). The idea of "adaptive partitioning" was proposed by Friedman, Bentley, and Finkel [FRIE77] and makes the k-d tree a structure very "sensitive" to the particular file t hat it represents. The application of k-d trees to a host of problems can be found in BENT79b, GOTL78, and SILv78b. Analysis of k-d trees for range searching has been considered by several researchers. The work required to construct a k-d tree and its storage requirements (see BENT79b) are Pk(N, k) = O( N log N), Sk(N, k) ffi O(Nk). Computing Surveys, Vol. 11, No. 4, December 1979 404 J. L. Bentley and J. H. Fri edman The search cost depends on the nature of the query. Lee and Wong [LEEW80] have shown t hat in the worst case, Qk(N, k) < O( N H/ k + F) where F is the number of points found in the region. If the query range is almost cubical and the number of records t hat satisfies the query is small (so t hat the range query is similar to a nearest neighbor search), then Friedman, Bentley, and Fin- kel' s [FRIE77] analysis shows t hat Qk(N, k) = O(log N + F) (average case for small answer). For the case where a large fraction of the file satisfies the query, Bentley and St anat [BENT75b] and Silva-Filho [Smv78a] show t hat Qk(N, k) = O(F) (average case for large answer). The k-d tree structure is most effective in situations where little is known about the nature of the queries or a wide variety of queries are expected. It is also useful if other types of queries (in addition to range queries) are anticipated; many other quer- ies supported by k-d trees are discussed by Bentley [BENT79b]. 1. 5 Range Trees A number of very similar structures for range searching (of primarily theoretical rather t han practical interest) have recently been described by Lueker [LuEK78], Lee and Wong [LEEW80], and Willard [WILL78a]. In this section we investigate the range tree, a structure introduced by Bentley [BENT79a] t hat is also similar to the former structures. It achieves the best worst-case search time of all the structures we have seen so far in this paper, but has relatively high preprocessing and storage costs. For most applications the high stor- age will be prohibitive, but the range tree is very interesting from a theoretical view- point. Since the range tree is defined recur- sively in dimension (that is, the k-dimen- sional structure is defined in terms of the (k - 1)-dimensional structure), we begin our discussion by looking at a one-dimen- sional structure and t hen generalize t hat structure to higher dimensions. The simplest structure for one-dimen- sional range searching is a sorted array. The preprocessing sorts the N elements in ascending order by key. To answer a range query, we do two binary searches to find the positions of the low and high end of the range in the array. After these two positions have been found, we can list all the points in t hat part of the array as the answer to the range query. (Note t hat this is precisely the projection method applied to the one- dimensional problem.) For this structure we use linear storage and O( N log N) pre- processing time. The two binary searches each cost O(log N), and the cost of listing the points found in the region will, of course, be proportional to the number of such points. Letting F be the number of points found in the region, we have Pr(N, 1) = O( N log N), Sr(N, 1) = O( N) , Q~(N, 1) = O(log N + F). We will now build a two-dimensional range tree, using as a tool the one-dimen- sional sorted arrays (SA' s) we described above. The range tree is similar to the "binary search trees" described by Knut h [KNUT73, Sect. 6.2], so we will use his ter- minology in our discussions. The range tree is a rooted binary tree in which every node has a left son, a right son, a discriminating value (all nodes in the left subtree have a discriminating value less t han the node' s), and (unlike a regular binary search tree) every node contains an SA. The root of the range tree contains an SA (sorted by y-coordinate) and has as a discriminating value the median x-value for all points. The left subtree of the root has an SA containing the N/ 2 points with x-value less t han me- dian sorted by y-coordinate. Similarly, the right son of the root represents the N/ 2 points with x-value greater t han the median and has an SA of those points sorted by y-coordinate. This partitioning continues so t hat i levels away from the root we have 2' subtrees, each representing N/ 2 L points contiguous in the x-dimension and each containing an SA of the points sorted by y-coordinate. This partitioning continues for a total of (approximately) log N levels; Computing Surveys, Vol 11, No 4, December 1979 Dat a Structures for Range Searchi ng 405 we handle small point sets (say, less t han a dozen points) by brute force. The search algorithm for a range tree is most easily described recursively. Each node in the tree represents a range in the x-dimension from the least x-value con- tained in the subtree to the greatest. When visiting a node, we compare the x-range of the query to the range of the node, and if the node' s range is entirely within the query' s, then we search t hat structure' s SA for all points in the query' s y-range and return. If the query' s range does not wholly contain the node' s, then we compare the query' s x-range to the node' s discriminator value. If the range is entirely below the discriminator, we recursively visit the left subtree; if it is above, we visit the right; and if the range overlaps the discriminator, then we visit both subtrees. The analysis of the planar tree is rat her complicated. Since there are log N levels in the tree and N points are stored on each level, the total storage required is O( N log N). The preprocessing can be per- formed in O( N log N) time if clever tech- niques are employed. Analysis shows t hat at most two SA searches are done on each level of the tree {each of cost approximately log N), so the total cost for a search is O(log 2 N) plus the time for listing the points in the region. Letting F stand, as before, for the total number of points found in the region we have Pr(N, 2) = O( N log N), Sr(N, 2) = O( N log N), Qr(N, 2) = O(log '~ N + F). If we step back for a moment, we can see how we built the structure: We constructed a two-dimensional structure by building a tree of one-dimensional structures. We can perform essentially the same operation to yield a three-dimensional structure: We construct a tree containing two-dimen- sional structures in the nodes. This process can be continued to yield a structure for k-dimensions, which will be a tree contain- ing (k - D-dimensional structures. This will yield a structure with performances Pr(N, k) = O( Nl og k-I N), S~(N, k) = O( N log k-l N), Qr(N, k) = O(log k N + F). The range tree structure is very interest- ing from a theoretical viewpoint. The #sympt ot i c search time is very fast, but the amount of storage used is usually prohibi- tive in practice. Although the application of this structure to practical problems will probably be limited to cases when k ffi 2 or 3, it does provide an i mport ant theoretical benchmark. It also gives us an interesting technique (recursion in dimension) t hat might yield fruit in practice. {Indeed, there are some very interesting relationships be- tween range trees and the k-d trees of Sec- tion 1.4.) 1. 6 k-ranges The k-range is an efficient worst-case struc- ture for range searching introduced by Bentley and Maurer [BENT80b]. They de- veloped two types of k-ranges, overlapping and nonoverlapping. Bot h of these struc- tures involve storing sets of lists of points sorted by different coordinates; additional dimensions are added recursively, much like the range trees of the last section. Be- cause k-ranges are rat her complicated to describe and are of primarily theoretical interest, we will not describe t hem here but only mention their performance. The over- lapping k-ranges can be made to have per- formance Po(N, k) ffi O(N~+~), So(N, k) = O(N'+~), Qo(N, k) = O(log N + F) for any e > 0. It is pleasing to note t hat the constants "hidden" in the O' s of the above equations are just k/E. Overlapping k- ranges have very efficient retrieval time but somewhat high preprocessing and storage costs; their dual structures, nonoverlapping k-ranges, have very efficient preprocessing and storage costs but increased query times. Thei r performance is Pn(N, k) = O( N log N), Sn(N, k) = O( N) , Q(N, k) = O( N) , for any fixed > 0. The details of these structures can be found in BENT80b. Al- though these structures were developed pri- marily as a theoretical device, t hey might prove efficient in some implementations Computing Surveys, Vol. 11, No 4, December 19t9 406 J. L. Bent l ey and J. H. Fri edman (Their primary drawback is t hat their space requirements are high, and space is usually a critical resource.) 1. 7 Other Structures In the previous sections we have investi- gated six structures for the range searching problem t hat (in the authors' opinion) dom- inate other structures proposed for this problem. In this section we briefly investi- gate some of these other structures. Knut h [KNUT73] points out t hat the no- tion of cells can be applied recursively. That is, when one of the cubes has more t han some certain number of points, the cube is further divided into subcubes of yet smaller size. This scheme implies a multidimen- sional tree with multiway branching. In terms of both the partitioning imposed on the space and the ease of implementation, this idea seems to be dominated by a data structure called the quad tree. The quad tree was first described by Fin- kel and Bentley [FINK74]. It is a generaliza- tion of the standard binary search tree, in which every node has 2 h sons. Bentley and St anat [BENT75b] analyzed the perform- ance of quad trees for "square" range searches in uniform planar point sets, and Linn [LINN73] discussed the fact t hat quad trees (which he called "search-sort k trees" ) have advantages over binary trees when used in a synchronized multiprocessor sys- tem. This application aside, however, the quad tree seems to be dominated by the k-d trees of Section 1.4. A great deal of work has been done re- cently on multikey searching problems t hat are similar in flavor to the range searching problem. Dobkin and Lipton [DOBK76] and Bentley [BENT80a] have investigated a number of searching problems defined on sets of points in k-dimensional space. Rivest [RIVE76] provides a number of interesting data structures for answering "partial- mat ch" queries, which are essentially range queries in a file in which the keys assume discrete values. For discussions of efficient search methods in the context of database systems, the reader is referred to such pa- pers as LIou77, SHNE77, YANG77, and YANG78. 1. 8 Comparison of Methods In Sections 1.1 through 1.6 we have dis- cussed six structures for range searching. The performances of these six structures (seven including the two variants of k- ranges) are summarized in Table 1, which shows the preprocessing, storage, and query costs of each structure. All the functions in t hat table reflect worst-case costs, except those query costs t hat are footnoted. For those functions the probabilistic assump- tions are described in the notes. Four of these six structures (sequential scan, projection, cells, and k-d trees) have been presented as providing practical solu- tions to the range searching problem. For each structure there are situations in which it is clearly superior and other situations where it performs badly. In this section we will mention some of these situations and compare the performance of the four methods. If the file is small and the number of attributes large, if the f' fle is to be searched only a few times, or if the queries can be batched so t hat nearly all the records in the file satisfy at least one, then sequential scan TABLE 1. Performance of Dat a St ruct ures for Range Searchmg Structure P(N, k) S(N, k) Q(N, k) Sequential scan O{N} O(N) O(N) Projection O(N log N) O(N) O(N 1-1/* + F) a~) Cells O(N) O(N) O(F) a(z) k-d trees O(N log N) O(N) O(N H/ k + F) O(log N + F) a()) Nonoverlappmg k-ranges O(N log N) O(N) O(N ~ + F) Range trees O(N logk-lN) O(N log *-1 N) O(log * N + F) Overlapping k-ranges O(N ~+~) O(N ~+~) O(log N + F) a Query times t hat indmate average case analysis Probabi hst m assumptions are (1) Smoot h dat a set s--very small query region. (2) Any data set--cell size equals query size. (3) Smoot h data set. Computing Surveys, Vol. 11, No 4, December 1979 Data Structures for Range Searching 407 is the method of choice. In other cases one of the more sophisticated methods is likely to be more efficient. Projection does best when the query range on one of the attri- butes is usually sufficient to eliminate nearly all the File records. For this case the low overhead of searching this structure allows it to dominate the others. In situa- tions where several or many of the attri- butes serve to restrict the range query, the projection technique performs relatively poorly. Both the cell and k-d tree structures are appropriate in situations where the query restricts several of the attributes. If the approximate size and shape of the queries are roughly constant and known in ad- vance, then cells defined by a fixed grid with size and shape similar to those of the expected queries is most advantageous. For queries with sizes and shapes t hat differ considerably from the design, however, per- formance can be quite poor. The k-d tree structure is characterized by its robustness to wildly varying queries. The cell design adapts to the distribution of the attribute values of the file records in the k-dimensional coordinate space. The cells all contain very nearly the same num- ber of records; there are no empty cells. In dense regions there are many cells and a correspondingly fine division of the coordi- nate space; in sparse regions there is a coarser division with fewer cells. For most applications of range searching t hat are not characterized in the preceding paragraphs, k-d trees are likely to be the method of choice. 2 . ADDI TI ONAL WORK Our discussion of the data structures in Section 1 is on a very abstract conceptual level, and we have ignored many problems t hat arise in actual applications of range searching. In this section we briefly exam- ine some of those problems and the solu- tions t hat have been proposed to handle them. All files t hat we have discussed so far have been static; t hat is, they represent unchanging files. Many applications, how- ever, require dynamic structures, in which insertions and deletions can be made. The sequential scan structure is easy to main- tain dynamically, and so is the projection structure using methods for maintaining one-dimensional sorted lists described by Knut h [KNUT73]. The cell technique can support insertions and deletions by merely keeping a linked list of the points in each cell and inserting or deleting the new or old record in the appropriate list. Dynamic k-d trees are a more subtle problem and have been discussed by Bentley [BENT79b] and Willard [WILL78b]. Considerable research remains to be done in the development of heuristics for aiding the search methods we have seen. For example, if the range queries in a seven- dimensional problem almost always involve only two of the attributes, then the design of the structure should involve only those two attributes. Heuristics for detecting these and other similar situations would be very helpful. Techniques described by Ben- tley and Burkhard [BENT76] might prove useful in such an investigation. Our discussion of all of the data struc- tures has been for the case in which they are implemented in primary memory. Many applications (particularly databases) inherently involve secondary storage media such as disks and tapes. All the structures of Section i can be efficiently implemented on such mediaJ Several researchers have recently consid- ered an interesting generalization of the range searching problem, which calls for adding a range restriction to an existing data structure. That is, we already have some structure for performing a particular type of query, and we want to have the capability of saying "perform t hat query on all records in which this key lies in t hat range." Bentley [BENT79a], Lueker [LUEK79], and Willard [WILL78a] have de- veloped a number of transformations on data structures t hat allow one to add the range restriction capability. (These trans- formations actually led to the discovery of both the range tree and the k-range data structures of Section 1.) Although the stor- age requirements of the resulting structures seem to be too high to make t hem of im- 1 For det ai l s of t hese I mpl ement at i ons, t he r eader is referred to BENT78 whmh m an earl i er versi on of thin paper Computing Surveys, Vol. 11, No. 4, December 1979 408 * J . L . Bentley and J. H. Friedman mediate practical interest, this approach is a novel attack on the problem of construct- ing data structures for range searching. An interesting theoretical problem that could prove to be of practical value is prov- ing lower bounds on the complexity of the range searching problem. Saxe [SAXE79] has investigated this problem using the standard "decision tree" model of concrete complexity theory and has shown that k-ranges have optimal worst-case query times. These k-ranges have very high stor- age requirements, however; so it would be very desirable to have lower bounds that make stronger statements of the form, "if you only use this much storage and prepro- cessing, then this is the fastest search time you can have." Fredman [FRED79] has re- cently made progress in this direction. An- other interesting open problem is to show lower bounds on the average complexity, rather than j ust the worst-case complexity. 3. CONCLUSI ONS In this paper we have investigated a num- ber of data structure for the range searching probl em. In 1973 Knuth [KNuT73, p. 554] was abl e to write that "no real l y nice data structures seem to exist" for the probl em of range searching. In this paper we have tried to show that this situation has changed i n the interim, and that these changes can have a substantial impact on both the the- ory and practice of mul ti key searching. REFERENCES BENTLEY, J. L. "Mul t i di mensmnal bi- nary search trees used for assocmtive searching," Comm ACM 18, 9 (Sept. 1975), 509-517 BENTLEY, J. L., AND STANAT, D F. "Analysis of range searches m quad trees," Inf Process Lett 3, 6 (July 1975}, 170-173 BENTLEY, J L , AND BURKHARD, W A. "Heunst ms for partial mat ch retrieval data base design," Inf Process. Lett 4, 5 {Feb 1976), 132-135.' BENTLEY, J L., STANAT, D. F., AND WIL- LIAMS, E. H JR "The complexity of fixed-radius near neighbor searching," Inf Process. Lett. 6, 6 (Dec. 1977), 209-212. BENTLEY, J L., AND FRIEDMAN, J H. A survey of algortthms and data structures for range searching, Carnegm-Mellon Computer Science Rep CMU-CS-78-136 a n d Stanford h n e a r Accelerator C e n t e r R e p S L A C - P U B - 2 1 8 9 , p r e h m l n a r y ver- BENT75a BENT75b BENT76 BENT77 BENT78 BENT79a BENT79b BENT80a BENTS0b DOBK76 FINK74 F R E D 7 9 FRIE75 FR~ E77 GOTL78 KNOT73 LAUT78 LEEC76 LEEW78 LEEW80 LEVI66 LINN73 LIOU77 LOFT65 sion in Proc. Computer Science and Sta- tistics: l l t h Ann. Symp. on the Interface, March 1978, pp. 297-307. BENTLEY, J . L . "Decomposable search- ing problems," Inf. Process. Lett. 8, 5 (June 1979), 133-136. BENTLEY, J. L. "Multidimensional bi- nary search trees in dat abase applica- tions," IEEE Trans Softw. Eng SE-5, 4 (July 1979), 333-340. BENTLEY, J. L "Multidimensional di- vide-and-conquer," to appear m Comm. ACM. B E N T L E Y , J. L , A N n M A U R E R , H. A. "Efficient worst-case data structures for range searching," to a p p e a r m A c t a I nf. D O B K I N , D , A N D L I P T O N , R. J " M u l t l - dimensional searching problems," S I A M J. C o m p u t . 5, 2 (1976), 181-186. FINKEL, R. A, AND BENTLEY, J. L. "Quad t rees--a data structure for re- trieval on composite keys," Acta Inf 4, 1 (1974), 1-9. FREDMAN, M. "A near optimal data st ruct ure for a type of range query prob- lem," in Proc. l l t h ACM Symp. Theory of Computing, May 1979, pp. 62-66. FRIEDMAN, J H., BASKETT, F., AND SHUS- TEK, L. J "An algorithm for finding nearest neighbors," IEEE Trans Corn- put. C-24, 10 (Oct. 1975), 1000-1006. FRIEDMAN, J H., BENTLEY, J L., AND FINKEL, R. A. "An algorithm for finding best mat ches m logarithmic expected time," ACM Trans. Math. Softw. 3, 3 (Sept. 1977), 209-226. GOTLIEB, C. C., AND GOTLIEB, L. R Data types and structures, Prenhce-Hall, Englewood Cliffs, N. J, pp. 357-363 KNUTH, D. E. The art of computer pro- gramm~ng, vol. 3- sorting and searching, Addison-Wesley, Reading, Mass., 1973. LAUTHER, U. "4-d~ mensmnal binary search trees as a means to speed up asso- ciative searches in design rule verification of integrated circuits," J Des. Autom Fault-Tolerant Comput. 2, 3 (July 1978), 241-247. LEE, R. C. T., CHIN, Y. H, AND CHANG, S. C. "Application of principal compo- nent analysm to mulh-key searching," IEEE Trans. Softw. Eng. SE-2, 3 (Sept 1976), 185-193. LEE, D. T, AND WONG, C K. "Worst- case analysis for region and partial region searches In multidimensional binary search trees and quad trees," Acta Inf 9, 1 (1978), 23-29. LEE, D. T., AND TONG, C. K. "Qulntary trees' a f' de structure for multidimensional dat abase systems," to appear m ACM Trans. Database Syst LEVINTHAL, C. "Molecular model-build- ing by computer," Sc~ Am 214 (June 1966), ~ 2-52. LINN, J. General methods for parallel searchtng, Tech Rep. 61, Digital Systems Lab, Stanford U., Stanford, Cahf, May 1973. LIOU, J H., AND YAO, S B "Multi-di- mensional clustering for data base orga- nization," Inf. Syst. 2 (1977}, 187-198. L O F T S G A A R D E N , D. O , A N D Q U E S E N - Computing Surveys, Vo| 11, No 4, December t979 Data Structures for Range Searching 409 LUEK78 LUEK79 RABi76 RIVE76 SAXE79 SHNE77 BERRY, C. P. "A nonparametric density function," Ann. Math. Stat. 36 (1965), 1049-1051 SILV78a LUEKER, G. "A data structure for or- thogonal range queries," m Proe 19th Syrup Foundattons of Computer Sctence, IEEE, Oct. 1978, pp. 28-34 LUEKER, G. "A transformation for add- mg range restriction capabdlty to dynamtc data structures for decomposable search- mg problems," Tech. Rep. 129, U of Cal- iforma at Irvine, 1979. RABIN, M O "Probabdlstm algo- rithms," m Agortthms and complexity. new dwectlons and recent results, J. F Traub (Ed.), Academm Press, New York, 1976, pp. 21-39. YANG77 RIVEST, R L. "Parhal mat ch retrieval algorithms," SI AM J Comput. 5, 1 (March 1976), 19-50 YANt~ 78 SAXE, J. B "On the number of range querms m k-space," to appear m Dtscrete YUVA75 Appl Mat h SHNEIDERMAN, B. "Reduced combined SILv78b WILL78a WILL 78b indexes for efficient multiple at t ri but e re- trieval," Inf. Syst. 2 (1977), 149-154. SILVA-FILHO, Y. V. Average case analo ysls of regton search m balanced k-d trees, Rep., U. of Kent, Canterbury, Eng- land, Nov. 1978. S[LVA-F~ LHO, Y V. Mult~dimenstonal search trees as radices of files, Rep., U of Kent, Canterbury, England, Dec 1978. WILLARD, D. E. Predicate-oriented database search algorithms, Rep. TR-20- 78, Harvard U. Aiken Lab., 1978. WILLARD, D. E. "Balanced forests of k-d* trees as a dynamic data structure, " reformative abstract, Harvard U., Boston, Mass, 1978. YANG, C. "Avoiding redundant record accesses in unsort ed multilist f' de organi- zations," I nf Syst. 2 (1977), 155-158. YANG, C. "A class of hybri d list file or- gamzations," Inf. Syst. 3 (1978), 49-58. YUVAL, G. "Finding near neighbors m k-dimensional space," Inf. Process. Left. 3, 4 (March 1975), 113-114 RECEIVED JANUARY 1979; FINAL REVISION ACCEPTED AUGUST 1979. Computing Surveys, Vol. 11, No 4, December 1979