Advanced Database Indexing
Advanced Database Indexing
by
Yannis Manolopoulos
Aristotle University, Greece
Yannis Theodoridis
Computer Technology Institute, Greece
Vassilis J. Tsotras
University of California, Riverside, U.S.A.
"
~.
List of Figures Xl
List of Tables xv
Contributors xvii
Preface XIX
Figure 9.1. (a) hashing function with collisions (b) perfect hashing
function and (C) minimal perfect hashing function. 189
Figure 9.2. A dynamic external perfect hashing scheme. 193
Figure 9.3. Dependency graph of the set of six words {chef, clean,
sigma, take, taken, tea}. 197
Figure 9.4. Searching step for the key set of interest. 199
Figure 9.5. A 2- dimensional array-based trie of the word set of the
example. 203
Figure 9.6. The Packed Trie array for the key set of the example. 204
Figure 10.1. The three architectures proposed to support a parallel
database system. 210
Figure 10.2.(a) backend sorting and (b) distributed sorting. 212
Figure 1O.3.Example ofa parallel merge-sort. 213
Figure 10.4. Calculation of the exact splitting vector and generation
of fragments. 215
Figure 10.5. Example ofload balance and load imbalance for a sy-
stem with two processors. 216
Figure 11.1.Using multiple disks to store different files. 220
Figure 11.2.A simple tree-based index. 221
Figure 11.3 .Record distribution approach for a three-disk system. 222
Figure 11.4 .Example of a super page, partitioned into four disks. 223
Figure 11.5.A B-tree with four nodes. 224
Figure 11.6. Distribution of a B-tree to 3 disks. The horizontal links
are omitted for clarity. 226
Figure 11.7. A binary image (left) and the corresponding Region
Quadtree (right). 227
Figure 11.8. The shaded region represents a range query, which is
the area of interest. 229
Figure 11.9. Page P4 has been split to P4a and P4b. The proximity index
of the MBR ofP4b with R1, R2, R3 and R4 is calculated. 230
Figure 1l.1O.An S-tree example. 231
Figure 12.1.Search structure before and after a split. 238
Figure 12.2.Node layout ofa Blink_tree. 240
Figure 12.3. Two stages of split in a Blink-tree. 241
Figure 12.4.Node layout for operation specific locking. 243
Figure 12.5.Node of an Rlink_tree. 246
Figure 13.1. The architecture of a data warehouse 260
Figure 13.2.A data cube 261
Figure 13.3.A data integration system. 264
Figure 13.4.(a) A T-tree, (b) a T-tree node. 265
List of Tables
Table 9.1. Set of six words with their random ho, hI and h2 values. 198
Table 9.2. Levels in the Ordering step of the example of the six word
strings. 198
Table 9.3. (a) vertices with their g values, (b) word strings (keys) with
their computed hash addresses 201
Table 9.4. Performance characteristics of hashing techniques that
produce external perfect hashing functions. 206
Table 12.1. Compatibility table for lock modes. 236
Table 12.2. Alternative compatibility table for lock modes. 236
Table 12.3. Performance results for R-trees 249
Contributors
level issues have come to a steady state does not seem to be correct any
more.
Although transparent for the user of a DBMS, access methods playa key
role in database performance. Thus, careful tuning or selection of the appro-
priate access method is important in order to develop efficient systems in the
present transition era towards object-relational and other special purpose
systems. Also, understanding of the state-of-art is essential in order to pro-
pose more efficient indexing techniques.
This book may serve as a textbook for graduates specializing in database
systems, or database professionals, which are keen to be acquainted with the
recent developments. Emphasis has been given on structure description, im-
plementation techniques used, and operations performed. Most books in re-
lated topics are based on COBOL, PLll, Pascal, C or a pseudo-language.
This book uses a simple algorithmic pseudo-language (for some of the ac-
cess methods), whereas the interested reader is encouraged to implement
some of them. Note, also, that code for certain structures could be found on
the Internet.
The book is divided in two parts. The fIrst part consisting of 3 chapters
contains some fundamentals, which more-or-Iess may be found in most of
the books about fIle structures, physical database design, or computer archi-
tecture. It serves as the background knowledge for the second part, which
consists of 10 chapters dealing with more advanced material on access
methods. Every book chapter ends with references for further reading.
Chapter 1 briefly discusses issues related to storage media, such as mag-
netic disks, optical disks and tertiary storage, parallel disks and RAID sys-
tems. The next chapter describes external sorting methods, introducing no-
tions useful for a chapter in the second part of the book, dedicated to parallel
external sorting. The third chapter is about the most important fIle structures,
which are currently used in any DBMS, such as B+-trees, Hashing with
Chaining, Linear Hashing and Inverted Files, as well as other popular struc-
tures such as Grid Files and k-d trees.
The first chapter of the second part, Chapter 4, explains some structures,
which are used not to store integers but ranges of integers, e.g. intervals.
These structures are Segment Indices, Interval B-trees and External Segment
trees. Chapter 5 concerns structures for temporal databases, such as the
Snapshot index, Time-Split B-tree, Multiversion B-trees and Overlapping B-
trees. In Chapter 6 we examine structures used in spatial databases and GISs.
The most well known methods, Quadtrees, R -trees and variants, are exam-
ined along with other interesting structures such as LSD-trees, etc. Chapter 7
deals with spatiotemporal data, or in other words, spatial data that evolve
over time. Certain indexing techniques based on overlapping and partial per-
sistence are described. Chapter 8 examines representations and access meth-
Preface XXI
ods for image and multimedia databases, such as 2-D strings, X-trees, M-
trees and R-tree based methods.
Chapter 9 contains material based on hashing. For a number of years,
perfect hashing was considered as a method useful exclusively for main
memory applications, e.g. for tables with a small number of keywords. Here
we will describe two methods, which can apply perfect hashing for very
large numbers of records. Chapter 10 considers again the issue of external
sorting, which has been examined in the second chapter. However, due to the
advent of new architectures and fast disks, a parallel environment is as-
sumed, and thus the approach is quite different. Chapter 11 introduces the
notion of declustering, i.e. techniques used to distribute a single file structure
in several disks. In this context, B-trees, R-trees, Quadtrees and Linear
Hashing will be examined. In Chapter 12 we deal with an important issue re-
lated to the performance of indices, i.e. concurrency control, and we examine
particular techniques such as the Blink-tree and Rlink-tree methods. Finally, the
book ends with a chapter dedicated to the newest development in indexing,
such as indexing for on-line analytical processing, data warehouses and
semistructured data and main-memory databases.
Thanks are due to many friends and colleagues for their help during the
various stages of authoring this book. In particular, we would like to thank
(in alphabetical order) Robert Alcock, Alex Biliris, Alex Nanopoulos, Nikos
Karayannidis, George Kollios, Dimitris Papadias, Apostolos Papadopoulos,
Evi Pitoura, Timos Sellis, Eleni Tousidou, Michael Vassilakopoulos, and in
particular Theodoros Tzouramanis. Also, Scott Delman and Melissa Fearon
of Kluwer Academic Publishers provided invaluable support.
Yannis Manolopoulos
Yannis Theodoridis
Vassilis J. Tsotras