0% found this document useful (0 votes)
2 views

Spatial Data Indexing and Queries

The document discusses spatial data indexing and queries, focusing on how to index multi-dimensional data such as geographic locations. It covers various indexing techniques, including R-trees and space-filling curves like Z-ordering and Hilbert curves, and explores spatial queries such as nearest neighbor and range queries. The document also highlights the importance of minimizing overlap and dead space in spatial indexing for efficient query performance.

Uploaded by

Alex Zhou
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Spatial Data Indexing and Queries

The document discusses spatial data indexing and queries, focusing on how to index multi-dimensional data such as geographic locations. It covers various indexing techniques, including R-trees and space-filling curves like Z-ordering and Hilbert curves, and explores spatial queries such as nearest neighbor and range queries. The document also highlights the importance of minimizing overlap and dead space in spatial indexing for efficient query performance.

Uploaded by

Alex Zhou
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Spatial Data Indexing & Queries

(from G.Kollios)
Ahmed Mahmood
[email protected]
10/08/2024
Introduction
Suppose we have the following table
Sid Sname GPA Location
0010 David 3.9 (-117.4, 33.9)
0011 Kim 3.9 (-118.1, 35.2)
0014 Joe 3.9 (-122.2, 37.4)

We know how to build index on one-dimensional data, e.g., Sid, Sname, GPA

But how do we index Location in the above table?


Question?
■ What is indexing and why we need it?
■ What are the main index operations?
■ What are the types of indexes you are
familiar with?
■ What is the time complexity of main index
operations?
Problem
■ Let’s start from spatial point. Assume each point has
latitude and longitude.

■ Can we use B-tree to handle the indexing of spatial


data?
Problem
■ Let’s start from spatial point. Assume each point has
latitude and longitude.

■ Can we use B-tree to handle the indexing of spatial


data?

Index along which dimension?


Multi-dimensional Indexing
■ GIS applications (maps):
■ Urban planning, route optimization, fire or pollution monitoring,
utility networks, etc.
- ESRI (ArcInfo), Oracle Spatial, etc.

■ Other applications:
■ VLSI design, CAD/CAM, model of human brain, etc.
Spatial data types

point region
line

■ Point : 2 real numbers


■ Line : sequence of points
■ Region : area included inside n-points
Spatial Relationships
■ Topological relationships:
■ adjacent, inside, disjoint, etc
■ Direction relationships:
■ Above, below, north_of, etc
■ Metric relationships:
■ “distance < 100”
■ And operations to express the relationships
Spatial Queries
■ Selection queries: “Find all objects inside query q”,
■ inside-> intersects, contains, north_of etc.
■ q can be a point or a range
■ Nearest Neighbor-queries (k-NN): “Find the closest
object to a query point q”, k-closest objects
■ Spatial join queries: Two spatial relations S1 and S2, find all pairs:
{x in S1, y in S2, and x rel y= true}, rel= intersect, inside, etc
Spatial Queries
■ Given a collection of geometric objects (points, lines,
polygons, ...)
■ organize them on disk, to answer efficiently
■ point selection queries

■ range selection queries

■ k-nn queries

■ spatial joins (‘all pairs’ queries)


Spatial Queries
■ Given a collection of geometric objects (points, lines,
polygons, ...)
■ organize them on disk, to answer
■ point selection queries

■ range selection queries

■ k-nn queries

■ spatial joins (‘all pairs’ queries)


Spatial Queries
■ Given a collection of geometric objects (points, lines,
polygons, ...)
■ organize them on disk, to answer
■ point selection queries

■ range selection queries

■ k-nn queries

■ spatial joins (‘all pairs’ queries)


Spatial Queries
■ Given a collection of geometric objects (points, lines,
polygons, ...)
■ organize them on disk, to answer
■ point queries

■ range queries

■ k-nn queries

■ spatial joins (‘all pairs’ queries)


Spatial Queries
■ Given a collection of geometric objects (points, lines,
polygons, ...)
■ organize them on disk, to answer
■ point queries

■ range queries

■ k-nn queries

■ spatial joins (‘all pairs’ queries)


Problem
■ Given a collection of geometric objects (points, lines,
polygons, ...)
■ Organize them on disk, to answer spatial queries (range,
nn, etc)
Space Filling Curves
■ Idea: Linearize spatial data, while maintaining the
spatial locality

■ Common methods:
- Z-ordering Curve
- Hilbert Curve
Z-ordering
■ Basic assumption: Finite precision in the representation
of each coordinate, K bits (2K values)
■ The address space is a square (image) and represented
as a 2K x 2K array
■ Each element is called a pixel
Z-ordering
■ Impose a linear ordering on the pixels
of the image →1 dimensional problem
11 A ZA = shuffle(xA, yA) = shuffle(“01”, “11”)
10 = 0111 = (7)10
01 B
00
00 01 10 11
ZB = shuffle(“01”, “01”) = 0011
Example of Z-values
• Left part shows a map with spatial object A, B, C
• Right part and Left bottom part Z-values within A, B and C
• Note C gets z-values of 2 and 8, which are not close
• Exercise: Compute z-values for B.

Fig 4.7

19
Z-ordering: Summary
■ Given a point (x, y) and the precision K find the pixel for
the point and then compute the z-value
■ Given a set of points, use a B+-tree to index the z-values
■ A range (rectangular) query in 2-d is mapped to a set of
ranges in 1-d

20
Queries
■ Find the z-values contained in the query and
then the ranges
QA QA → range [4, 7]
11
10 QB → ranges [2,3] and [8,9]
01
00
00 01 10 11
QB
21
Hilbert Curve
■ We want points that are close in 2d to be close in the 1d
■ Note that in 2d there are 4 neighbors for each point where
in 1d only 2.
■ Z-curve has some “jumps” that we would like to avoid
■ Hilbert curve avoids the jumps : recursive definition

22
Hilbert Curve- example
■ It has been shown that in general Hilbert is better than
the other space filling curves for retrieval [Jag90]
■ Hi (order-i) Hilbert curve for 2ix2i array

23
Hilbert vs Z-ordering
■ Hilbert tends to transform nearby objects
into nearby sequences.

24
Motivation for Spatial Indexes
■ No perfect approximation when mapping
high-dimensional point.
■ Spatial objects include complicated geometric
shapes.
■ Space filling curves depend on predefined
precision, which cannot handle skew.

25
R-trees
■ [Guttman 84] Main idea: extend B+-tree to
multi-dimensional spaces!

■ (only deal with Minimum Bounding Rectangles - MBRs)

26
R-trees
■ A multi-way external memory tree
■ Index nodes and data (leaf) nodes
■ All leaf nodes appear on the same level
■ Every node contains between M/2 and M entries
■ The root node has at least 2 entries (children)

27
Example
■ eg., with tree fanout 4: group nearby rectangles to
parent MBRs; each group → disk page

A C F G I
H
B
E J
D

28
Example
■ F=4
P1 P3
A C F G I
H
B
E P4 J A B C H I J
P2 D D E F G

29
Example
■ F=4
P1 P3
P1 P2 P3 P4
A C F G I
B H
E P4 J A B C H I J
P2 D D E F G

30
R-trees - format of nodes
■ {(MBR; obj_ptr)} for leaf nodes
P1 P2 P3 P4

x-low; x-high
obj A B C
y-low; y-high
ptr ...
...

31
R-trees - format of nodes
■ {(MBR; node_ptr)} for non-leaf nodes
x-low; x-high node
y-low; y-high ptr P1 P2 P3 P4
... ...

A B C

32
R-trees – Example with Point Data
y axis
Root
10 E7
E E E
E1 e f E2 1 2 3
8
E8 E
d E5 g E 2
1
6 i
E6 h E9 E
E E E E 8 E
contents 4 5 6 7 9
4 omitted
b
E4
a
2 c
E3 a b c d e f h g i
x
0 2 4 6 8 10 axis E E E
4 5 8

33
R-trees:Search

P1 P3
P1 P2 P3 P4
A C F G I
H
B
E P4 J A B C H I J
P2 D D E F G

34
R-trees:Search
P1 P3 I P1 P2 P3 P4
A C F G H
B
E P4 J A B C H I J
P2 D D E F G

35
R-trees: Main points
■ Every parent node completely covers its ‘children’
■ A child MBR may be covered by more than one parent - it is stored
under ONLY ONE of them. (ie., no need for duplicate elimination)
■ A query may follow multiple branches
■ Everything works for any(?) dimensionality
■ well, up to few dimensions (maybe 6-7)
■ dimensionality curse (space becomes very sparse)

36
R-trees: Insertion
■ How to find the next node to insert the new
object?
■ Using ChooseLeaf: Find the entry that needs the
least enlargement to include Y. Resolve ties using
the area (smallest)

37
R-trees: Insertion
Insert X
P1 P3
P1 P2 P3 P4
A C F G I
B H
X E P4 J A B C H I J
P2 D D E X F G

38
R-trees: Insertion
Insert Y
P1 P3
P1 P2 P3 P4
AC F G I
B H
Y E P4 J A B C H I J
P2 D D E F G

39
R-trees: Insertion
■ Extend the parent MBR
P1 P3
P1 P2 P3 P4
A C F G I
B H
Y E P4 J A B C H I J
P2 D D E Y F G

40
R-trees:Insertion
■ Use the ChooseLeaf to find the leaf node to insert an
entry E
■ If leaf node is full, then Split, otherwise insert there
■ Propagate the split upwards, if necessary
■ Adjust parent nodes

41
R-trees: Insertion
■ If node is full then Split : ex. Insert w

P1 P3
P1 P2 P3 P4
A K C W F G I
B H
E P4 J A B C K H I J
P2 D D E F G

42
R-trees: Insertion
■ If node is full then Split : ex. Insert w
Q1 Q2

K P5 P3 P1 P5 P2 P3 P4
P1 A C F G I
W H
B A B

E P4 J C K W H I J
P2 D Q2 F G
Q1 D E
43
R-trees: Split
■ Split node P1: partition the MBRs into two groups; each
group will become a node.
P1
A K C W • ‘linear’ split
B • ‘quadratic’ split

44
R-trees: Split
Why a good split is very important :
The aim is to minimize the area of the newly created nodes. This will make
future searches more efficient.

Bad Split Good Split


45
R-trees: Split
■ Pick two rectangles as ‘seeds’;
■ Assign each rectangle ‘R’ to the ‘closest’ ‘seed’

seed2
R
seed1

46
R-trees: Split
■ Pick two rectangles as ‘seeds’;
■ Assign each rectangle ‘R’ to the ‘closest’ ‘seed’:
■ ‘closest’: the smallest increase in area

seed2
R
seed1
47
R-trees: Split
■ How to pick Seeds:
■ Linear: Find the highest low side and lowest high side in

each dimension, normalize the separations, choose the


pair with the greatest normalized separation
■ Quadratic: For each pair E1 and E2, calculate the

rectangle J=MBR(E1, E2) and d= J-E1-E2. Choose the


pair with the largest d

48
R-trees: Split
■ After the two seeds are chosen, PickNext picks which of the remaining
entries to insert next:
■ Linear: Pick any of the remaining entries
■ Quadratic: Let Q be the set of remaining entries. For each entry E
in Q calculate
■ d1 = The area increase required in the covering rectangle of Group 1 to include E.
■ d2 = The area increase required in the covering rectangle of Group 2 to include E.
■ Choose an entry with the maximum difference between d1 and d2. (Aim is to choose an
entry with maximum preference for any one already created rectangle)

49
R-Trees: Deletion
■ Find the leaf node that contains the entry E
■ Remove E from this node
■ If underflow:
■ Eliminate the node by removing the node entries and the

parent entry
■ Reinsert the orphaned (other entries) into the tree using

Insert

50
R-Tree Performance Parameters
■ The area covered by a directory rectangle should be minimized. Here the
aim is to reduce ‘dead’ space (i.e., space covered by the bounding
rectangle but not covered by enclosed rectangles).

■ Overlap between directory rectangles should be minimized. This also


decreases the number of paths required to be traversed for a query.

■ Margin of a directory rectangle should be minimized (Margin is the sum of


the sides of a rectangle). With fixed area, the smallest margin is that of a
square.
51
R-trees: Variations
■ R+-tree: Do not allow overlapping, so split the
objects

■ R*-tree: change the insertion, deletion algorithms


(forced re-insertion)

52
+
R -tree

53
Note: Figure from https://tildesites.bowdoin.edu/~ltoma/teaching/cs340/spring08/Papers/Rtree-chap2.pdf
R*-tree
■ Splitting Algorithm: consider both minimizing the total area
of the bounding box and the overlap between the entries.

■ Reinsertion Strategy: when a node overflows, a fraction of


its entries are reinserted into the tree.

■ Node Selection: when inserting new entries, R*-tree uses a


more complex method to choose the appropriate node.
54
Summary
■ Almost every online transaction has a spatial trace
■ Spatial data is multidimensional and requires spatial indexing
■ Spatial data objects have topological relationships
■ Main spatial queries are: Selection, Nearest Neighbor, and Joins
■ Space filling curves are a way to map multidimensional data into
single dimensional
■ The R-tree is a multidimensional index that addresses limitations
of space filling curves
■ The objective of every operation in the R-tree is the minimize the
spatial overlap between MBRs, Dead Space, and margin

55
Thank you
Any questions?

You might also like