0% found this document useful (0 votes)
7 views

Sorting With B Trees

Uploaded by

fovoni
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Sorting With B Trees

Uploaded by

fovoni
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Sorting And Indexing With Partitioned B-Trees

Goetz Graefe
Microsoft Corporation
Redmond, WA 98052-6399
USA
[email protected]

Abstract column to a B-tree index. If only a single value for this


Partitioning within a B-tree, based on an artificial lead- leading B-tree column is present, which is the usual and
ing key column and combined with online reorganiza- most desirable state, the B-tree index is rather like a tradi-
tion, can be exploited during external merge sort for tional index. If multiple values are present at any one point
accurate deep read-ahead and dynamic resource alloca- in time, which usually is only a transient state, the set of
tion, during index creation for a reduced delay until the index entries is effectively partitioned. It is rather surpris-
first query can search the new index, during data load- ing how many problems this one simple technique can
ing for streaming integration of new data into a fully help address in a database management product and its
indexed database, and for miscellaneous other opera- real-world usage.
tions. Despite improving multiple fundamental data-
Let us briefly consider some example benefits, which
base operations using a single basic mechanism, the
proposal offers these benefits without requiring data will be explained and discussed in more detail in later sec-
structures or algorithms not yet supported in modern tions of this paper.
relational database management systems. While some First, it permits putting all runs in an external merge sort
of the ideas discussed here have been touched upon into a single B-tree (with the run number as artificial lead-
elsewhere, the focus here is on re-thinking the relation- ing key column), which in turn permits improvements to
ship between sorting and B-trees more thoroughly, on asynchronous read-ahead and to adaptive memory usage.
exploiting this relationship to simplify and unify data Given the trend to remote disks, e.g., in SAN and NAS
structures and algorithms, and on gathering compre- environments, hiding latency by exploiting asynchronous
hensive lists of issues and benefits.
read-ahead is important, and given the continued trend to
striped disks, forecasting multiple I/O operations is gain-
Introduction ing importance. Similarly, given the trend to extremely
Even the most advanced data models rely on very tradi- large online databases, the ability to dynamically grow and
tional data structures and algorithms for storing and man- shrink resources dedicated to a single operation is very
aging records, including efficient query and update proc- important, and the proposed changes permit doing so even
essing. Thus, there is a continuous stream of research into to the extremes of pausing an operation altogether and of
improvements to these data structures, these algorithms, letting a single operation use a machine’s entire memory
and their usage. Among the perpetually interesting data and entire set of processors during an otherwise idle batch
structures in database systems is the B-tree [BM 72] and window.
its many variants, and among the perpetually interesting Second, it substantially reduces by at least a factor of
algorithms is external merge sort. Sorting is used to build two the wait time until a newly created index is available
B-tree indexes efficiently, and B-trees are used to avoid for query answering. While the initial form of the index
the expense of sorting and to reduce the expense of does not perform as well as the final, fully optimized in-
searching during query processing – however, the mutu- dex or a traditional index, at least it is usable by queries
ally beneficial relationship between sorting and B-trees and permits replacing table scans with index searches.
can go substantially further than that. Moreover, the index can be improved incrementally from
The present paper proposes not a new data structure or a its initial form to its final and fully optimized form, which
new search algorithm but an adaptation of well-known is very similar to the final form after traditional index
algorithms and of a well-known data structure. The es- creation. Thus, the final indexes are extremely similar in
sence of the proposal is to add an artificial leading key performance to indexes created offline or with traditional
online methods; the main difference is cutting in half (or
Permission to copy without fee all or part of this material is granted better) the delay between a decision to create a new index
provided that the copies are not made or distributed for direct commer-
cial advantage, the VLDB copyright notice and the title of the publica-
and its first beneficial impact on query processing.
tion and its date appear, and notice is given that copying is by permis- Third, adding a large amount of data to a very large,
sion of the Very Large Data Base Endowment. To copy otherwise, or to fully indexed data warehouse so far has created a dilemma
republish, requires a fee and/or special permission from the Endowment. between dropping and rebuilding all indexes or updating
Proceedings of the 2003 CIDR Conference
all indexes one record at a time, implying random inser- [GKK 01] or as part of data migration in partitioned data
tions, poor performance, a large log volume, and a large stores [LKO 00]. The value of the present proposal, from a
incremental backup. The present proposal resolves this database implementer’s point of view, is that no new data
dilemma in most cases. Note that it does so without spe- structures, algorithms, or quality assurance tests are re-
cial new data structures. Recently proposed approaches to quired, except of course tests of truly new functionality,
this problem have relied on adding a special separate e.g., pausing and resuming a sort operation or querying an
lookup structure in main memory, or on retaining records index still being built. Moreover, the present proposal
waiting to be pushed down within an index tree by divid- provides improvements concurrently in three main areas –
ing each B-tree node into a segment with traditional key- external sorting, index creation, and bulk loading – plus a
pointer pairs and another segment with waiting records. few additional ones.
Special or novel data structures and algorithms can have There is, of course, a vast amount of research on sorting.
enormous costs for real-world database systems, first in The most relevant work is on external merge sort with
development and testing, then when installing the new dynamic memory management [PCL 93, ZL 97]. These
release and reformatting large production databases, and prior algorithms adjusted the merge fan-in between merge
finally for training staff in application development and in steps, which might imply a long delay; the contribution
operations; all this not only for the core database system here is the ability to vary merge fan-in and memory usage
but also for relevant third-party add-on products for capac- dramatically and quickly at any point during a merge step
ity planning, tuning, operations, disaster preparedness and without wasting or repeating any work.
recovery, monitoring, etc.
After a brief summary of related research, the remainder Artificial leading key columns
of this paper first describes precisely how to manage parti-
The essence of the present proposal is to maintain parti-
tions within B-trees and then discusses how this technique
tions within a single B-tree, by means of an artificial lead-
assists in the three situations outlined above, plus a few
ing key column, and to reorganize and optimize such a B-
other ones.
tree online using, effectively, the merge step well known
from external merge sort. This key column probably
Related work should be an integer of 2 or 4 bytes. By default, the same
The present proposal is orthogonal to research into single value appears in all records in a B-tree, and most of
alternative layouts of data within B-tree pages, e.g., in the techniques described later rely on carefully exploiting
[BU 77, DR 01, GL 01, H 81]. Similarly, it is orthogonal multiple alternative values, temporarily in most cases and
to the data collection being indexed or the attribute being permanently for some few techniques. If a table or view in
indexed, which could be a column in a traditional rela- a relational database (or any equivalent concept in another
tional table, a hash value, a location in multi-dimensional data model) has multiple indexes, each index has its own
space mapped to a single dimension [RMF 00], or any artificial leading key column. The values in these columns
other (deterministic) function. are not coordinated or propagated among the indexes. In
Prior research and development into partitioning in par- other words, each artificial leading key column is internal
allel and distributed database systems are closely related, to a single B-tree, such that each B-tree can be reorganized
including [AON 96, HD 91, CAB 88]. However, none of and optimized independently of all others. If a table or
the prior work specifically considers online index opera- index is horizontally partitioned and represented in multi-
tions such as index creation, schema changes, etc., and ple B-trees, the artificial leading key column can be de-
how to exploit partitioning for those purposes. Online in- fined separately for each partition or once for all partitions
dex construction has been considered in the past [MN 92], – the present paper does not consider this issue further.
but not in the contexts of partitioning or of querying an In fact, the leading artificial key column effectively de-
index still in its construction, as proposed here. Mohan fines partitions within a single B-tree. The proposal differs
and Nareng [MN 92] also mention in a footnote that an from traditional horizontal partitioning using a separate B-
index could be made available incrementally, but their tree for each partition in an important way. Most advan-
description implies waiting until the complete sort opera- tages of the present proposal depend on partitions (or dis-
tion starts to emit output, and they do not consider how a tinct values in the leading artificial key column) being
query processor could exploit indexes coming online in- created and removed quite dynamically. In a traditional
crementally, as the present paper does. implementation of partitioning, each creation or removal
Another related research direction has considered fast of a partition is a change of the table’s schema and catalog
insertion into novel data structures derived from B-trees, entries, which requires locks on the table's schema or cata-
both small insertions in OLTP environments and large log entries and thus excludes concurrent or long-running
insertions in bulk loading, e.g., in [JDO 99, JNS 97, user accesses to the table, as well as forcing recompilation
JOY 02, MOP 98, OCG 96]. Other research has consid- of cached query and update plans. If, as proposed, parti-
ered fast bulk deletions, either in response to user requests tions are created and removed as easily as inserting and
deleting rows, smooth continuous operation is relatively distinct values in the leading column (as might be useful
easy to achieve. in a “select distinct …” query) and searching for index
Adding an artificial leading key column to every B-tree entries matching the current query.
raises some obvious concerns, which will now be dis- Presume, for example, that the B-tree in Figure 1 is an
cussed in turn – potential benefits will be discussed in index on column x, and that a user query requests items
subsequent sections. First, the artificial leading key col- with x = 19. The first probe into the B-tree inspects the
umn increases record lengths and therefore total disk us- left edge of the B-tree and determines that the lowest
age as well as required disk bandwidth while reading or value for the artificial leading key column is 0; the second
writing the entire B-tree. However, if prefix truncation is probe finds index entries within partition 0 with x = 19.
used [BU 77], almost all B-tree pages, both leaves and The third probe finds the first item beyond partition 0 and
internal nodes, will store only a single copy of this key thus determines that the next value in the artificial leading
column, since its value will be constant for all records in key column is 3, etc., for a total of 7 probes including the
almost all pages. Note that implementations that exploit left and right edges of the B-tree.
prefix truncation do not necessarily split pages in the mid- Fortunately, this search can be limited at both ends by
dle upon page overflow, instead favoring a split point near the use of integrity constraints, either traditional “hard”
the middle that permits truncating the longest possible constraints or “soft” constraints that are observed auto-
prefix in both pages after the split. Thus, this artificial key matically by the database system and invalidated auto-
imposes negligible new disk space and bandwidth re- matically when a violating record is inserted into the data-
quirements. base [GSZ 01]. In Figure 1, if a constraint limits the parti-
tion number to 4 or less, the probe at the right edge can be
omitted. If there is only one value for the artificial leading
key column in the B-tree, and if integrity constraints for
both ends of the B-tree exist, a probe into the proposed B-
tree is as efficient as a probe into a traditional B-tree.
Fourth, B-tree indexes deliver sorted data streams as
query output or as intermediate query result. In order to
Partition no. 0 3 4 obtain the same sorted output stream, records from multi-
ple partitions of the B-tree must be merged on the fly. If
Figure 1. B-tree with partitions the number of partitions is moderate, this can be achieved
very efficiently, using well known algorithms and data
Second, searches within a page are more expensive, be-
structures used in external merge sort.
cause each comparison must compare the entire key, start-
Fifth, B-tree indexes are often used to efficiently enforce
ing with the artificial leading key column. However, if
uniqueness constraints, and the proposed B-trees with the
prefix truncation is used, the key component that has been
artificial leading key column substantially increase the
truncated because it is constant for all records in a page
expense of checking for a duplicate key value. This check
actually does not participate in comparisons; thus, only
disregards, of course, the artificial leading key column,
comparisons within pages with multiple values of the arti-
and therefore must probe into the B-tree index for each
ficial key column within the page incur some cost, mean-
actual value of the artificial leading key column. Again,
ing hardly any pages and thus hardly any comparisons.
when multiple values for this column are present in the B-
Note that prefix truncation is not really required to reduce
tree, this concern is valid; however, in most cases and at
the comparison cost; “dynamic prefix truncation” requires
most times, there should be only one value present and
that comparison operations indicate where in the compari-
this fact should be known due to hard or soft integrity con-
son arguments the first difference was found, and permits
straints.
comparisons to skip over those leading parts in which
Sixth, selectivity estimation, which is crucial for effec-
lower and upper bound of the remaining search interval
tive query optimization, could be hampered because the
coincide [L 98].
histogram associated with an index describes primarily or
Third, searches in the B-tree are more complex and more
even exclusively the distribution of the leading key col-
expensive than in traditional B-tree indexes, in particular
umn, i.e., the artificial leading key column rather than the
if multiple partitions exist. The situation is, of course, very
first user-chosen key column. Fortunately, most modern
similar to other B-tree indexes with low-cardinality lead-
database systems separate the notions of histograms and
ing columns. Each searching probe into the B-tree must
indexes. While it used to make sense to link the two be-
first determine the lowest actual value for the artificial
cause both needed full data scans and sorting for efficient
leading key, then search for the actual parameter of the
construction, modern database systems build histograms
probe, then search whether there is another value for the
from sampled data and refresh them much more often than
leading artificial key column, etc. [L 95]. The probe pat-
they rebuild B-tree indexes. Typically, a sufficient sample
tern effectively interleaves two sequences: enumerating
easily fits into main memory and thus can be sorted effi- other hand, when merging runs of very different sizes,
ciently. Due to this efficiency, most database systems and substantially more read operations will pertain to the large
installations support statistics for columns that are not input runs – a typical situation occurs when merging some
indexed at all or are not leading columns in indexes, which initial runs (which are about the size of memory) and
is precisely the type of statistics needed here. some intermediate merge results (which are larger than the
Finally, a few more observations that likely are obvious initial runs by a factor equal to the merge fan-in, e.g.,
and thus are mentioned only briefly. The proposed use of 100). Moreover, if the key distribution in the input is
B-trees is entirely orthogonal to the data collection to be skewed, i.e., if there is any form of correlation between
indexed. The proposed technique applies to relational da- input order and output order, even input runs of similar
tabases as well as data models and other storage tech- sizes might require different amounts of read-ahead at
niques that support associative search, it applies to both different times during a merge step.
primary (clustered) and secondary (non-clustered) in- In both cases, deep forecasting is required, i.e., forecast-
dexes, and it applies to indexes on traditional columns as ing that reaches beyond one asynchronous read operation
well as on computed columns, including B-trees on hash and beyond finding the lowest one among the highest keys
values, Z-values (as in “universal B-trees” [RMF 00]), and on each page currently consumed by the merge logic
on user-defined functions. Similarly, it applies to indexes [K 73]. Other researchers have considered technique for
on views (materialized and maintained results of queries) planning the “page consumption sequence” ahead of a
just as well as to indexes on traditional tables. merge step [ZL 98] or as the merge progresses [S 94]. In
To summarize, adding an artificial column to each B- both efforts, a separate data structure was designed to re-
tree index raises several obvious possible concerns, but all tain the highest keys in each data page. In commercial
of them can be mitigated to a negligible level. Having reality, however, every new data structure requires new
considered these concerns, let us now discuss the benefits. development and, maybe more importantly and more ex-
pensively, testing, which is why neither of these designs
Sorting has been transferred into real products.
Retaining all runs in a single B-tree, using the run num-
Virtually all database systems use external merge sort
ber as the artificial leading key column, addresses several
for large inputs, with a variety of algorithms used for in-
issues without introducing the need for a new data struc-
ternal sorting and run generation. One important design
ture. Most immediately, the parent level about the B-tree’s
issue is how to store intermediate runs on disk such that
leaves is a natural storage container for precisely the keys
can be read efficiently in sort order. Many database serv-
needed for accurate deep forecasting. In fact, it is possible
ers use roughly ten times more disk drives than CPUs; in
to forecast arbitrarily deeply, and to do so dynamically
some case, however, the number of disk arms is effec-
while merging progresses, i.e., adapt the forecasting depth
tively unknown to the database management system since
to the current I/O delay as well as add or drop runs from
an entire disk farm or network attached storage is shared
the forecasting logic. Moreover, a scan over the leaves’
by many users and even multiple servers, including multi-
immediate parent nodes is already implemented in some
ple database servers. In order to keep all disk arms use-
database systems because it is also required for multi-page
fully busy and in order to hide all I/O latencies, asynchro-
read-ahead in an ordered key range retrieval, e.g., a large
nous I/O is needed while writing initial runs and while
“between” predicate.
reading and writing runs during merge steps. Asynchro-
The space and I/O overhead for using a B-tree for runs is
nous writing is relatively easy since it is always clear
negligible: internal B-tree nodes of 8 KB have a fan-out of
which pages should be written and since the CPU process
at least 100, meaning that about 99% of all pages in the B-
does not need to wait for completion of the I/O. Asyn-
tree are leaves. A B-tree fan-out of 100 is very conserva-
chronous reading in merge steps requires more attention
tive if prefix and suffix truncation are used and if the
for two reasons. First, if a required page is not yet in
space utilization is 100%, which is possible because the B-
memory, the sorting program must wait, thus relinquish-
tree is loaded sequentially by the merge step. Thus, a B-
ing not only the CPU but also the CPU cache. Second, the
tree fan-out of 400 seems realistic in many cases, meaning
very nature of merging implies that many inputs are read,
about 0.25% of all pages are not leaves. Actually, since
and it is necessary to determine which of the inputs must
the leaves’ immediate parents are equivalent to any other
be read next, commonly known as forecasting [K 73].
data structure that captures the consumption sequence of
Note that double buffering [S 89a] for all input runs does
pages in merge input runs, only 1% of 1% of all pages (or
not truly solve the problem. On one hand, it reduces the
0.25% of 0.25%) in the B-tree is overhead due to using a
merge fan-in to half, whereas good forecasting reduces the
B-tree to store all runs.
fan-in only by a relatively small fixed number. Useful
Another benefit of using a B-tree to store all runs is that
values are the number of disk drives if known or simply
parallel threads can be added or removed from the sort
ten, based on the rule of thumb that there are roughly ten
effort at any time. A new thread can be put to good use
times more disks than CPUs in a balanced server. On the
simply by choosing and assigning a set of runs to merge to ‘z’; all 1,800 records into a single output run. The final
and a key range within those runs. Even in an external sort merge step merges these 1,800 records with 9 times 900
with a single merge step, the final merge can be parallel. records with keys ‘a’ to ‘m’ followed by another 9 times
Inversely, a thread can stop its work at any time – the re- 900 records with keys ‘m’ to ‘z’. Thus, the total merge
maining B-tree is still a valid and consistent collection of effort is 19,800 records – a savings of about 25% in this
runs. No work already performed is wasted and no work is (artificial) example. Starting a merge at such a “given”
performed twice. The operation to delete an entire key key within a run on disk is very inefficient with traditional
range within a merge input run is precisely the same one runs, but is no problem if runs are stored in a B-tree.
that deletes an entire run, and is already implemented in Given all these adaptive mechanisms1, one important de-
B-tree implementations used in data warehousing, where sign issue is management of information about runs, how
entire date ranges are regularly added and removed. Simi- to determine efficiently and at any time which runs cur-
larly, memory can be added and removed from a sort op- rently exist, their sizes and their key ranges. In an index
eration at any time, without loss in I/O efficiency, i.e., with only a few partitions, it is possible to enumerate the
without the need to shrink the units of data transfer. The partitions at the expense of one root-to-leaf probe per par-
merge process can add or drop runs at any time. In the tition, possibly saving the first and the last probe through
extreme case, a merge process can drop all its runs, mean- constraints on the artificial leading key column. In a large
ing that the entire sort operation is paused. With appropri- external merge sort, one can determine the set of runs in
ate transaction support, sort operations can be resumed the same way. Even some of the interesting properties of
even after server restart. Note that it is quite straightfor- runs, e.g., run sizes, can be estimated quite accurately be-
ward to drop runs from the current merge step; adding a cause all leaves and all interior nodes of the B-tree are
run requires finding in an existing run precisely the right filled 100%. It is probably more efficient, however, to
key that matches the current merge progress. This search employ a separate table with information about runs. De-
is obvious and easy with runs in a B-tree, due to B-trees’ pending on the detail captured, e.g., information about
inherent support for “between” predicates, whereas it re- entire runs only or information about key distributions
quires expensive searching in traditional “flat” run files. within runs for virtual concatenation of key ranges, this
The resulting runs with partial key ranges enable optimi- table might need to be stored on disk. One design allocates
zations traditionally conceived for partially pre-sorted a small amount of memory to run management, e.g., two
inputs [H 77]. Two runs with disjoint key ranges can be pages, and overflows all further run descriptors to disk.
thought of as a single run, and can together, one after an- Merge planning is based on those two pages, and only
other, serve as a single input in a future merge step, a when the number of runs has shrunk such that their de-
technique called “virtual concatenation” here. In addition scriptors fit on one page, another page of run descriptors is
to the traditional use of this technique, a B-tree even per- loaded. In an alternative design also using a small amount
mits to rearrange key ranges within runs. Instead of merg- of memory, intermediate merge steps are forced when the
ing or concatenating entire runs, fractions of runs defined number of runs exceeds a given threshold. Note that for
by key ranges could be merged or concatenated. When such forced intermediate merge steps, merge planning
reaching a given pre-planned key, one or multiple merge should attempt to merge runs with the most similar sizes
inputs are removed from the merge logic and other runs rather than the smallest runs, which is the usual optimiza-
added. For example, consider an external merge sort with tion heuristic for merge planning.
memory for a fan-in of 10, and 18 runs remaining to be To summarize this section on sorting, capturing runs in
merged with 1,000 records each. The keys are strings an external merge sort opens new opportunities, princi-
starting with a character in the range ‘a’ to ‘z’. Presume pally in two directions. First, it enables more efficient sort-
both these keys occur in all runs, so traditional virtual ing due to accurate deep forecasting and to virtual con-
concatenation does not apply. However, presume that in 9 catenation of key ranges. Second, it enables mechanisms
of these 18 runs, the key ‘m’ appears in the 100th record; that enable large sort operations to adapt to the current
while in the other runs, it appears in the 900th record. The system load quickly and over a wide range of resource
final merge step in all merge strategies will process all levels. In other words, it enables mechanisms required in
18,000 records, with no savings possible. The required self-tuning database management systems.
intermediate merge step in the standard merge strategy
first chooses the smallest 9 runs (or 9 random runs, since
they all contain 1,000 records), and merge those at a cost 1
For memory adjustment during run generation, Zhang and Lar-
of 9,000 records read, merged, and written. The total son proposed a method that is both adaptive and cache-efficient
merge effort is 9,000 + 18,000 = 27,000 records. The al- [ZL 97]: Sort each incoming data page into a mini-run, and
ternative strategy proposed here merges key ranges. In the merge mini-runs (and remove records from memory) as required
first merge step, 9 times 100 records with keys ‘a’ to ‘m’ to free space for incoming data pages or competing memory
are merged followed by 9 times 100 records with keys ‘m’ users. Techniques from [LG 98] can be adapted to manage space
for individual records, including variable-length records.
Index operations search and update the index, concurrent merge steps
should have excellent online behavior. Specifically, when
Database system use sorting for many purposes, not the
conflicting with a lock held or requested by a concurrent
least among them is efficient construction of B-tree in-
user transaction, the merge step should let the concurrent
dexes. All the sorting techniques discussed above apply to
transaction proceed. Fortunately, as discussed in the sec-
index creation operations, including pause and resume
tion above on sorting, small ranges can be merged indi-
without loosing or wasting work, e.g., after a load spike or
vidually, even concurrently by multiple independent
server shutdown. In addition, online index creation can
threads, and a merge step can commit and terminate at any
exploit B-tree indexes with an artificial leading key col-
time, and resume later without any work being wasted.
umn in an interesting way, as follows. At the end of the
For correct durability after a new index has been com-
run generation phase, a single B-tree contains all future
mitted in its initial format, all further modifications of the
index records, albeit not yet in the final order. Nonethe-
index must be fully logged, including the merge actions.
less, the records are already sufficiently organized to per-
Changes may be logged per page in order to avoid per-
mit reasonably efficient searches. Thus, concurrent queries
record overheads, and it may be possible to combine log
may start exploiting the new index after only a single pass
records for page deletion (in the merge input) and page
over the data, even before the start of the merge phase. If
creation (in the merge output). The initial data transfer
the index creation requires only a single merge step (the
(prior to committing the initial index) may omit data log-
usual case nowadays), this means that the index is avail-
ging, similar to today’s techniques that require flushing
able for querying in half the time of traditional index crea-
the new index to disk and capturing the index contents
tion. For a very large index, the reduction in latency might
when backing up the log, as optional in [MS 98]. Further
even be a factor 3 (for a two-level merge sort).
reductions in logging volume may be possible but require
When searching indexes that are not fully merged and
further research, and they may introduce new tradeoffs
optimized yet, there is a compromise in search efficiency
and compromises, e.g., retaining rather than reclaiming
but, unless the initial runs are very small, it is faster to
data pages of merge input runs.
probe into each run using a root-to-leaf B-tree traversal
If concurrent transactions update the indexed table
than to scan the new B-tree in its entirety. For example,
(view, etc.) while the initial runs are created, these updates
presume that all nodes above the leaves or at least above
must be applied to the future index before it may be que-
the leaves’ immediate parent nodes will fit in the buffer
ried or updated. There are two well-known methods to do
and therefore will not incur I/O during a probe. Thus, each
so [MN 92]: either a log-driven “catch up” phase applies
root-to-leaf traversal will incur at most two random I/Os,
these updates to the index after the index builder com-
which takes about 12 ms using contemporary fast disks.
pletes, or the concurrent transactions apply their updates
Recall that two root-to-leaf passes may be required for
immediately to the index, which in the present proposal
each run or partition within the B-tree, or about 24 ms per
consists of initial runs. Given that the recovery log typi-
distinct value in the artificial leading key column. During
cally cannot be searched by key value, the latter technique
that time, today’s fast disk drives can deliver about 8 MB
is more interesting here. Thus, a new index must be tagged
at their nominal (ideal) speed of 320 MB/s. Thus, if the
“in construction” such that updates but not queries con-
average initial run is longer than 8 MB, queries will per-
sider the index. Deletions of records not yet in the index
form better by probing the new index than by scanning the
insert some special markers or “anti-matter” that will be
old storage structures. Note that a file scan at full speed
applied and cleared out later by the index builder. Trans-
puts much more load on the system’s CPUs, memory, and
actions searching the index before the merge phase com-
bus than repeated index probes; thus, index probes are
pletes must search not only for valid records but also for
even more preferable in a multi-user environment.
anti-matter, very similar to searching in a differential file
For correct transactional behavior, the transaction creat-
[SL 76]. Fortunately, all insertions, deletions, and anti-
ing the index should commit after the initial runs are com-
matter insertions by concurrent transactions can be col-
plete. Concurrent transactions should not query the new
lected in a single partition, i.e., a single, constant, well-
index when its creation might still roll back; in fact, the
known value for the artificial leading key column. Assum-
query optimizer should not create execution plans that
ing record-level or key value locking, the level of lock
search an index whose existence is not yet committed.
contention among concurrent transactions should not be
After the initial index is committed, subsequent merge
greater than lock contention will be in the final index, and
steps may be part of the original statement execution but
thus may be presumed to be acceptable. In order to reduce
should not be part of the original transaction. Instead,
lock and latch contention between concurrent transactions
since they only modify the internal index structure but not
and run generation within the index builder, it is advanta-
database or index contents, they can be system transac-
geous to separate this B-tree partition from the runs, e.g.,
tions that commit independently, rather similar to system
use value 0 in the artificial leading key column for inser-
transactions used routinely today, for example during a
node split in a B-tree index. While concurrent queries
tions and deletions by concurrent transactions and start run ports success to the user. Thus, it leaves all data move-
generation with run number 1. ment and sort work to asynchronous system transactions
The possibility of creating a new index in a single pass that will run later. The price of this flexibility is that even
over the data, to the point of making the index usable to the initial data insertions must be fully logged, like all data
retrieval queries even if it is not immediately optimal, can movement after the existence of the index has been com-
be extended even further. If the data source during the mitted, although further research may be able to reduce
index creation is ordered, e.g., if it is a primary index, key the logging volume.
ranges in the source will approximately correspond to ini- An index can be populated not only for one continuous
tial runs in the index being built. Specifically, if run gen- range, as proposed above, but for multiple ranges that may
eration repeats read-sort-write cycles (as many sort im- or may not be contiguous, called “inclusion ranges” else-
plementations based on quicksort do), initial runs in the where [SS 95]. These ranges are maintained in a control
new index will precisely correspond to key ranges in the table very similar to control tables (also called “tables of
old index. If run generation streams data from the input to contents”) commonly used today for selective replication
the initial runs (as many sort implementation based on or caching. Despite describing the contents of an index,
replacement selection do), the steady pipeline can be inter- these control tables are data, not meta data. Therefore, a
rupted and flushed every now and then, or at least the as- change in a control table does not trigger plan invalidation
signment of run numbers to records entering the priority or query recompilation. Note that a single control table
heap in replacement selection can be modified to flush may suffice for an entire database with all its tables and
input records into the new index. After all records with indexes if normalized keys are used, i.e., in some sense all
key values in the old index within a certain range have indexes only have one search column, which is a binary
been flushed into the new index (albeit in multiple runs), string. While an additional range is being populated, con-
the set of records already captured in the new index can be current updates (by concurrent transactions) must be ap-
described with simple range predicates in both the data plied to the new index, including anti-matter, as discussed
source and the new index being built. In the scanned data above for online index creation without ranges. Thus, in-
source, the predicate uses the search key of that index, and dividual ranges must be tagged as “fully operational” or
in the new index, the predicate uses the run number, i.e., “in construction,” and branch selection in dynamic execu-
the artificial leading key column. Such simple range tion plans for updating a table are controlled slightly dif-
predicates are, of course, fully supported in all implemen- ferently than in dynamic plans for selecting from a table.
tations of indexed (materialized) views; thus, even such a Partial indexes also open a door for another promising
partial index [S 89b] can be made available to the opti- technique. If the query optimizer cannot find a suitable
mizer, very similar to an index on a (materialized) selec- index for a query and must thus plan a table scan, it can
tion view. determine the most useful possible index and then prepare
For views of this type, optimizers can construct dynamic a dynamic plan with two alternative branches. One branch
execution plans with two branches for each table access, exploits this index if it already exists; the other branch
one branch exploiting the new index for a query predicate performs the table scan but leaves behind a B-tree that
subsumed by the predicate describing the range of rows contains the initial runs for this index, leaving it to an
already indexed, and one branch to process the query asynchronous utility operation to optimize this index by
without the new index. Note that some query executions merging those runs into a single traditional B-tree. Actu-
may employ both the old and the new indexes, relying on ally, there must be three branches in the query plan, the
two mutually exclusive range predicates for the two in- third one simply performing the table scan without leaving
dexes to find each qualifying row exactly once. Thus, a anything behind, to be used if the empty initial index can-
new B-tree index can be considered by the query opti- not be created when the query plan needs to start, e.g., due
mizer immediately after index creation begins, and it be- to space constraints or due to locks held by concurrent
comes more and more useful for query processing as index transactions. Note that run generation using a replacement
constructions proceeds, both during initial run generation selection does not substantially alter the flow behavior of
(range by range) and during the subsequent merge phase the file scan. One of the issues that need to be resolved is
(fewer and fewer partitions or runs within the new B-tree). how the file scan produces records for two different trans-
In the extreme case, a new index can be committed in- action contexts, the user query and the index builder. For-
stantly without moving any data at all and independently tunately, using “insert” and “delete” tags familiar from
of the size of the data source and of the new index. In maintenance plans for indexed (materialized) views can
other words, the initial user transaction verifies schema readily be adapted for this purpose.
and permissions, reserves sufficient disk space for the Unique indexes, or indexes built for efficient mainte-
entire future index, creates initial catalog entries including nance of uniqueness constraints, pose an additional chal-
a boundary predicate and a boundary not satisfied by any lenge for online index creation. The traditional approach
data currently in the database, and then commits and re- has been to fail concurrent transactions or the index
builder when a uniqueness violation is detected [MN 92]. amounts of data into a pre-existing large and fully indexed
Thus, index creation may be aborted hours or even days database.
after it starts and minutes before it completes due to a sin- To summarize this section, a new index can be available
gle committed insertion by a concurrent transaction, even for queries substantially earlier than in traditional methods
if another concurrent transaction is about to delete one of if initial sort runs are collected in a single B-tree, and
the duplicate keys and commit it before the index builder online index construction can even be incremental. More-
will complete. Fortunately, a useful technique exists that over, self-tuning query plans as briefly outlined here
also extends “soft constraints” [GSZ 01] from single-row would represent a great step forward compared to the tun-
“check” constraints to uniqueness and key constraints. Its ing capabilities found in most database systems today;
essence is to maintain a counter of uniqueness violations maybe B-tree indexes with an artificial leading key col-
for each possibly desirable uniqueness constraint, in the umn will turn out an important step in this direction due to
minimal case only one but in the maximal case for all pre- the single-pass construction of the initial index. Rather
fixes of the B-tree’s search key. Thus, whenever a B-tree than relying on “wizards” or “advisors” that run outside
entry is inserted or deleted, it must be compared with one the query processor [ACN 00, VZZ 00], creating and
or both of its neighbors, and counters must be incremented populating useful indexes can become a native and inte-
or decremented appropriately. It is important to maintain gral part of query optimization and query execution. Inci-
these counters accurately and with correct transaction se- dentally, query optimizers routinely decide on index crea-
mantics. While escrow locks [O 86] might prove helpful tion and have the query execution plan populate such in-
for such counters, some systems maintain such counters dexes; the main difference is that those temporary indexes
apparently without them [CAB 93], possibly by using today are created, populated, used, and destroyed within a
transaction-internal counters made globally visible only single query plan and transaction context rather than main-
during commit processing. When a counter for a specific tained during database updates and then amortized over
prefix is zero, this set of leading columns is unique and a multiple invocations of the same (possibly parameterized)
uniqueness constraint can be enabled instantly without query.
further validation. If such counters are maintained during
index construction, neither concurrent transactions nor the B-tree loading
index builder must be aborted due to uniqueness viola-
While we may hope that indexed (materialized) views
tions, and the index structure may be retained even if the
substantially alleviate the response time in relational
uniqueness constraint is not satisfied at the time the index
OLAP (online analytical processing), one issue that will
builder completes. Note that this separation of indexes and
remain is importing new data into existing large, popu-
uniqueness is entirely consistent with the physical nature
lated, and indexed data warehouses. In a very common
of indexes (they are an issue of database representation,
scenario, at the end of every month, another month’s
not of database contents) and the logical nature of con-
worth of data is added to the detail data. The key difficulty
straints (which limit and describe the database contents,
is that the largest table typically has multiple indexes, all
not the data representation), and is also reflected appropri-
on different columns, and only one index on time. Tradi-
ately in the SQL standards, which include syntax and se-
tional solutions have been to keep separate tables and their
mantics for constraints but not for indexes.
indexes for each time period, equivalent solutions using
It might seem that an artificial leading key column inhib-
partitioning and “local” indexes (i.e., secondary indexes
its using this technique. Indeed, while there are multiple
are partitioned precisely like the primary index or the base
distinct values for this column, it is impossible to activate
table), dropping all indexes during data import and re-
a uniqueness constraint instantly. Instead, the existing
building them afterwards, or special “update execution
partitions of the index B-tree must be merged to verify
plans” that merge an ordered scan of the old index with a
uniqueness. The result of this merge step can be material-
sorted set of index insertions into an entirely new index. It
ized, in which case this merge step is precisely the final
appears that partitioning with local indexes has shown the
step of sorting or of index creation. Alternatively, the
most desirable properties, namely fast data import and
merge result can be ignored, leaving the final optimization
short delay until queries can exploit new indexes on the
of the index to a subsequent online reorganization. Notice,
new data. The unfortunate aspect of partitioning is that
however, as mentioned earlier, that enforcing a uniqueness
many individual partitions must be managed for each in-
constraint using an index with multiple partitions is more
dex, with additional catalog tables, catalog entries, catalog
expensive than using one with only a single partition.
look-ups, etc.; the other very unfortunate aspect is, of
Thus, finalizing the B-tree index before activating a
course, that each partition must be searched when the
uniqueness constraint based on the new index seems like
query predicate does not restrict the query to a single time
the right approach in general. Nonetheless, there are also
period.
situations in which multiple partitions in persistent in-
B-tree indexes with an artificial leading key column of-
dexes are extremely useful, e.g., when importing large
fer an attractive combination of features in these situa-
tions. In effect, the artificial leading key column defines GKK 01]. Sorting the insert batch typically relies on an
partitions; however, it does so within the structure of a external merge sort. If there are multiple batches, each of
single index, single B-tree, and single partition as far as them is sorted and each of them requires updating many or
the catalogs and the query optimizer are concerned. More- even all leaf pages in the index. The proposed method, on
over, the partitions within the B-tree are temporary, to be the other hand, works efficiently independent of the num-
removed by online reorganization at the earliest conven- ber and size of batches, unless batches are much smaller
ient time. This perspective, using the artificial leading key than the memory that can be dedicated to run generation.
column as a form of partitioning, immediately leads to a Runs are formed and merged using the same amount of
very efficient bulk insert strategy: let each large insert memory and effort in the traditional and in the proposed
define a new partition, and then let incremental index re- strategies; the main difference is that the merge strategy
organization re-optimize the B-tree structure at a conven- can be optimized ignoring which batch generated which
ient time. In the best case, this reorganization is incre- runs.
mental, online, and responsive to the current system load. The proposed bulk insert strategy offers further benefits.
Note that multiple batches of bulk insertions can be in- Maybe most importantly, newly imported data are avail-
serted into a B-tree before reorganization takes place, that able for queries much faster than in traditional strategies,
reorganization may proceed incrementally range by range, even if multiple indexes need to be maintained. Thus, the
that a reorganization step does not necessarily affect all proposed strategy is suitable for indexing and querying
existing partitions, and that reorganization can proceed continuous data streams. Moreover, the proposed algo-
even while another batch is being imported. Note also that rithm reduces the number of index pages that are modified
multiple batches can be inserted concurrently, even by and thus will participate in a transaction rollback (should
multiple users; contention for locks and latches should be that be necessary) or in an incremental backup immedi-
negligible if page splits optimize the key distribution for ately following the data import. In addition, this strategy
maximal prefix truncation [BU 77] rather than assign pre- effectively eliminates lock conflicts between the import
cisely half the data to each resulting page. transaction and any concurrent transactions. Finally, many
Letting each large insert operation create a single new fewer log records are required during bulk insertion be-
partition implies that the insert operation pre-sorts the new cause each actual log record can describe an entire new
data appropriately for each index. Following the discus- page, rather than an individual record insertion. Note that
sion earlier in the paper, this sort operation might employ index entries in secondary indexes are often much smaller
a B-tree to hold intermediate runs. Rather than using a than the fixed space overhead for log records, including
separate, temporary B-tree, however, the bulk import op- previous LSN, next undo LSN, etc. [MHL 92]. An index
eration can immediately use the import target. Thus, when entry of 20 bytes might result in a log record of 80 bytes –
importing into a B-tree index, the incoming data is proc- thus, paying the overhead of a log record once per page
essed into runs, and runs are added immediately as indi- rather than per index entry substantially reduces the log
vidual partitions to the permanent B-tree. When importing volume during the insert operation. If, on the other hand,
data into a table with multiple B-tree indexes, each of bulk insertions are not logged but only flushed to disk at
those forms its own new partitions, which are independent the end of the insert operation, to be made truly durable
of the new partitions in the other B-trees. Thus, runs for all only by a backup, partitioning within a B-tree substantially
B-trees can be formed concurrently, such that incoming reduces both the flush effort and the backup volume.
data is never written to any temporary structures, meaning Unfortunately, there are also some concerns during data
that it is processed in memory only once before it can be- loading. If enforcement of a uniqueness constraint relies
come available for retrieval queries. Both load-sort-store on a B-tree index, duplicate search keys may be located
and replacement selection can be used for run generation. not only in immediately neighboring B-tree entries but
The expected size of the runs relative to the memory size also in other partitions. In other words, if it is truly im-
depends on the choice of algorithms, but runs should be at perative that the B-tree at no time and under no circum-
least as large as the allocated memory in all cases. As dis- stances contain duplicate entries, bulk loading has to
cussed for index creation, parts of the data can be commit- search in all partitions for all search keys. Note that each
ted and made available for queries at any time; of course, such search can be leveraged for multiple new records; the
in this case, all participating indexes must be flushed in resulting algorithm resembles a merge join with each prior
order to ensure transactional consistency among them. partition (actually a merge-based anti-semi-join). If, on the
It might be interesting to compare the effort in a tradi- other hand, it is sufficient that the uniqueness constraint
tional bulk insert with the effort in the scheme proposed holds only for the B-tree entries in the default partition
here, i.e., the combined effort of appending partitions and (say value 0 for the artificial leading key column), bulk
merging them into the main partition. The traditional bulk insert into other partitions can proceed at full speed, and
insert strategy for a single large batch sorts its entire insert verification of the uniqueness constraint can be left to the
set and then merges the sorted set into the B-tree [MS 98, B-tree reorganization performed later. For example, the
reorganization might simply skip over duplicate keys; even possible to append runs to multiple indexes, which is
when reorganization is complete, only duplicate keys are particularly attractive for capturing and indexing non-
left in those partitions. In fact, most implementers and repeatable data streams.
administrators of data warehouses prefer a tolerant data
loading process, because typically only a small minority of Other applications
records violates any constraints and it is not worthwhile to
B-tree indexes with artificial leading key columns can
disrupt a high-speed loading process for a few violations
improve not only large inserts but also small ones. Other
that can be identified and resolved later.
researchers have proposed constructing multiple coordi-
A related issue is the generation of “uniquifiers” in pri-
nated structures, e.g., the log-structured adaptations of B-
mary indexes. One design lets entries in secondary indexes
trees [MOP 98, OCG 96], or employing new structures
“point” to entries in the primary index by means of search
with new algorithms, tuning parameters, etc., e.g., buffer
keys – the advantage of this design is that page splits in
trees [A 95, V 01] or the Y-tree [JDO 99]. Instead, a single
the primary index do not affect the secondary indexes
traditional B-tree can be used, with multiple partitions
[O 93]. If the search keys in the primary index are not
based on an artificial leading key column, with one parti-
unique, an artificial uniquifier column is added as a trail-
tion small enough to fit in memory. Inserts are directed to
ing column to each clustering key. (In efficient implemen-
this partition, and online reorganization in the background
tations, one instance per unique search key may have a
merges its records into the main partition of the B-tree.
NULL uniquifier value, which like other NULL values is
Note that this idea works for both updates and retrievals. If
stored very compressed in the primary index and in any
certain values are searched more often than others, those
secondary index.) If multiple partitions may hide actual
can be gathered into one small partition, such that those
duplicate search keys in the primary index, either the as-
searches can be performed always entirely within the
signment of uniquifiers must search all partitions or the
buffer pool. In a way, this design creates a self-organizing
“pointer” from a secondary index into the primary index
search structure.
must include the value of the artificial leading key column
Large deletions, on the other hand, can greatly benefit
in the primary index. Moreover, any reorganization of the
from a preparatory online reorganization. First, all index
primary index may need to assign new uniquifier values
entries to be deleted are assigned to a single B-tree parti-
and thus require expensive updates in all secondary in-
tion, i.e., are assigned a new value of the artificial leading
dexes.
key column. When this is complete, a large and efficient
Perhaps a better design that requires less reorganization
range deletion can remove all those entries from the B-
of secondary indexes adopts an additional artificial trailing
trees very efficiently. The reorganization is about as fast
key column in the primary index, and a slowly increasing
as the bulk deletion techniques described in [GKK 01],
boundary value indicating which values for the artificial
whereas the actual deletion should be an order of magni-
leading key column have already been used and reorgan-
tude faster. A special application of this technique is data
ized into the main part of the primary index. If a value for
migration in a parallel database or any other partitioned
the artificial leading column is found in an entry in a sec-
data store, e.g., when adjusting the boundary between two
ondary index that is higher than this boundary value, it is
partitions: first prepare all source indexes for a large dele-
interpreted as the artificial leading key column in the pri-
tion using small online steps, then move data using the
mary index as described above. If, however, a value is
bulk delete and bulk insertion strategies proposed in this
found that is lower than the boundary value, the pointer
paper, and finally optimize the destination indexes in
into the secondary index is dereferenced using the default
small online steps. Note that the transactions performing
value for the artificial leading key column in the primary
the initial and final reorganizations are local transactions;
index and the old low partition number is interpreted as
therefore, multi-node commit processing is needed only
the value for the new trailing key column. Thus, a table’s
for the actual data movement.
row and all its representations in the primary and all sec-
Another promising application combining insertions and
ondary indexes will retain the initial partition number for-
deletions is capturing and holding a window within a con-
ever, but the interpretation of that number changes over
tinuous stream of incoming data, e.g., sensor data. The
time when the primary index is reorganized.
insertions may be grouped, sorted, and inserted as a batch
To summarize this section, B-tree indexes with parti-
similar to traditional bulk data import; or random inser-
tions defined by an artificial leading key column transfer
tions can always focus on the latest, smallest, most active,
the advantages of partitioning without some of the
and therefore memory-resident partition. If only a window
disadvantages. In particular, a large data insert operation
of recent data is to be retained and items older than a pre-
or bulk insert can append runs or partitions to all B-tree
set threshold are to be deleted, e.g., in order to analyze
indexes in a table, whether or not the load file and the in-
auto-correlations or periodic phenomena within a stream,
dexes are sorted on the same columns, and it does so
deletions can similarly either be batched or focused within
without lock conflicts and with minimal log volume. It is
a single, small and thus memory-resident partition.
Incremental index maintenance over continuous input quired in some execution plans for large updates, namely
streams also enables a symmetric dataflow join that mir- those that merge a pre-existing index with the delta stream
rors the benefits of earlier proposals [DST 02, WA 91] – into an entirely new index. The only difference is that the
two indexes are built on the two join inputs, input from merge result is not stored in a new index structure but
either join input is accepted at any time and matched pipelined to the next operation in the query evaluation
against the currently existing index on the other input. plan.
This strategy closely mirrors earlier prototypes of symmet- To summarize, there seems to be a large variety of situa-
ric hash join; the difference is that the in-memory hash tions in which partitioned B-tree indexes using an artificial
table and hash table overflow are replaced by a B-tree and leading key column can enable or at least simplify imple-
the standard buffer manager support for B-tree indexes. mentation of online changes of schema and data in large
Presuming incremental online index reorganization is databases.
available, the techniques discussed above for creating B-
tree indexes can be extended to other index operations, Implementation notes
e.g., changing the schema of the records stored and in-
B-tree indexes with artificial leading key columns can be
dexed. A typical example is changing an existing col-
implemented at two levels. If B-tree indexes with artificial
umn’s type, length, or precision, e.g., from an integer or a
leading key columns are not a native feature in a database
decimal numeric to a floating point type. If all records in
management system, a database administrator can create
the index are modified immediately as part of the change
those columns, one per index, and adjust various com-
statement, such a simple statement may run a long time,
mands to take advantage of the resulting flexibility. For
typically with an exclusive lock on the entire index or
example, bulk insert commands must assign suitable val-
even the entire table. Incremental online reorganization is
ues to these columns, constraints must restrict these col-
an attractive alternative, although it implies that the index
umns to a single constant value at most times, and histo-
contains both old and new records for a while, and that the
grams must exist for the columns beyond the artificial
index must support both queries and updates for this mix-
leading key column. Online index reorganization can be
ture of record formats, which introduces new complexity.
achieved using scripts that search for indexes with multi-
If records and their keys are not normalized, records of old
ple values in the artificial leading key column and, when
and new formats can be compared correctly, albeit quite
found, assign new values to some rows selected by ranges
expensively – every single comparison must consider the
of the trailing index columns. While this method is more
formats of the two records currently being compared. If
cumbersome and less efficient than a native implementa-
records and keys are normalized, and in particular if the
tion, it achieves many of the desired benefits.
normalized form is compressed, normalization of the old
If, on the other hand, the vendor deems the advantages
and new record formats might not permit correct compari-
discussed so far important enough, the artificial leading
sons. In that case, defining two partitions within a B-tree
key column can be a hidden implementation feature of B-
index is a simple and effective solution, with different
tree indexes. Whether a customer wants them or not, they
normalizations in the separate partitions. It permits instant
are always there. Prefix truncation ensures that their space
completion of the user’s request as well as efficient nor-
usage and their overhead in individual comparisons are
malization and incremental online reorganization.
negligible, soft constraints ensure that they do not intro-
Not only different schemas and their normalization but
duce additional root-to-leaf B-tree searches (i.e., in most
also different validity status can be assigned to different
cases the optimizer can exploit that there is only a single
partitions. For example, online index creation “without
value), and a suitable histogram implementation ensures
side file” [MN 92] requires that concurrent transactions
that the artificial leading key column does not confuse
insert a deletion marker (“anti matter”) into an index if
selectivity estimation in the query optimizer. Online index
they cannot delete an entry because the index builder has
reorganization is a great advantage in this implementation
not inserted it yet. If, for some reason, it is desirable to
strategy.
keep a large part of an index stable, e.g., in order to limit
In order to ensure correct transaction semantics, tradi-
the size of incremental backups, insertions and deletions
tional locking mechanisms suffice. For example, index
may all be inserted into a separate partition, using anti
reorganization today employs many small system transac-
matter to represent deletions. Of course, this is very simi-
tions – these transactions do not change database contents,
lar to differential files [SL 76], but applied specifically
only data representation, and therefore can commit even if
within B-tree indexes in database systems. In a variation
the invoking transaction might roll back, and they can
of this technique, if insertions and deletions are marked
commit without forcing the commit record to stable stor-
with time stamps, multiple partitions can serve as main B-
age. If system transactions are small, e.g., B-tree page
tree and its associated version store in multi-version con-
splits or small reorganizations, a “no steal” buffer policy
currency control and snapshot isolation. The required
permits omitting “undo” log records [HR 83]. Commercial
query execution techniques are very similar to those re-
database management systems already employ various
techniques to avoid “redo” log records for index creation. the time of the traditional method, and even less if partial
During run generation in online index creation, each trans- indexes are exploited by the query optimizer. Finally, par-
action scans a range of input data, produces as many runs titioning within a single B-tree can be exploited for speed-
as necessary, inserts run descriptors into the auxiliary ta- ier updates and retrievals, most importantly bulk insertion,
ble, updates the boundary value in the predicate describing which can proceed with the speed of index creation even
the coverage of the new index, and then commits. Concur- when adding new records within the preexisting key
rent user transactions retain ordinary locks in the auxiliary ranges of multiple populated B-tree indexes.
table and thus prevent runs from vanishing during a user
transaction; this is the reason why concurrent user transac- Acknowledgements
tions should be small and short, and one of the ways in
Several friends and colleagues have made a number of
which large concurrent transactions reduce the efficiency
interesting and helpful suggestions on earlier drafts of this
of online index creation.
paper or its contents, including David Campbell, John
Carlis, César Galindo Legaria, Jim Gray, James Hamilton,
Summary and conclusions Joe Hellerstein, Tengiz Kharatishvili, Per-Åke Larson,
The purpose of this paper has been to re-think tech- Steve Lindell, and Florian Waas.
niques that might have seemed completely understood. It
turns out that a fairly simple and maybe surprising tech- References
nique can substantially increase the performance and ca-
[A 95] Lars Arge: The Buffer Tree: A New Technique for
pabilities of sorting and indexing, in particular in the in-
Optimal I/O-Algorithms. Workshop on Algorithms and
creasingly important aspects of self-tuning and self-
Data Structures, Lecture Notes in Computer Science 955,
management as well as online operations for continuous
Springer Verlag, 1995: 334-345.
availability. Levels of resource allocation in a multi-user
[ACN 00] Sanjay Agrawal, Surajit Chaudhuri, Vivek R.
server must adapt fast to be truly useful, and online index
Narasayya: Automated Selection of Materialized Views
operations are an important feature for both low-end and
and Indexes in SQL Databases. VLDB Conf. 2000: 496-
high-end database installations: At the high end, online
505.
operations improve service availability, and at the low
[AON 96] Kiran J. Achyutuni, Edward Omiecinski, Sham-
end, they are a crucial enabler of automatic index tuning,
kant B. Navathe: Two Techniques for On-line Index
because automatically dropping and creating an index is
Modification in Shared Nothing Parallel Databases.
only acceptable and is only invisible if all applications and
SIGMOD Conf. 1996: 125-136.
all data remain available at all times without “random”
[CAB 88] George P. Copeland, William Alexander, Ellen
interruptions of service due to automatically initiated tun-
E. Boughter, Tom W. Keller: Data Placement In Bubba.
ing or maintenance.
SIGMOD Conf. 1988: 99-108.
The central idea of this paper is to employ ordinary B-
[CAB 93] Richard L. Cole, Mark J. Anderson, Robert J.
trees in a slightly unusual way, namely by introducing an
Bestgen: Query Processing in the IBM Application Sys-
artificial leading key column and thus, effectively,
tem/400. IEEE Data Eng. Bull. 16(4): 19-28 (1993).
partitioning within a single B-tree. Earlier sections
[BM 72] Rudolf Bayer, Edward M. McCreight: Organiza-
reviewed possible concerns and overheads, most of which
tion and Maintenance of Large Ordered Indices. Acta
can be overcome or reduced to a truly negligible level, as
Informatica 1: 173-189 (1972).
well as the many benefits of the proposed change.
[BU 77] Rudolf Bayer, Karl Unterauer: Prefix B-Trees.
Supporting multiple partitions in a single B-tree index is
ACM TODS 2(1): 11-26 (1977).
an extraordinarily powerful concept, in particular if
[DR 01] Kurt W. Deschler, Elke A. Rundersteiner: B+
combined with incremental online reorganization using
Retake: Sustaining High Volume Inserts into Large Data
the merge logic well known from external sorting. It
Pages, ACM Fourth Int’l Workshop on Data Warehousing
permits using a single B-tree to store all runs in an
and OLAP, Atlanta, GA, 2001.
external merge sort, which in turn enables relatively
[DST 02] Jens-Peter Dittrich, Bernhard Seeger, David
straightforward implementations of important dynamic
Scot Taylor, Peter Widmayer: Progressive Merge Join: A
sorting techniques, including deep forecasting when
Generic and Non-blocking Sort-based Join Algorithm.
merging runs from many disk drives as well as dynamic
VLDB Conf. 2002.
memory management even to the extremes of running
[GKK 01] Andreas Gärtner, Alfons Kemper, Donald
multiple merge steps for separate ranges concurrently and
Kossmann, Bernhard Zeller: Efficient Bulk Deletes in
of pausing a merge step at any point to resume it later
Relational Databases. ICDE 2001: 183-192.
without wasting any work. Partitioning within a single B-
[GL 01] Goetz Graefe, Per-Åke Larson: B-Tree Indexes
tree also enables practically useful advances in online in-
and CPU Caches. ICDE 2001: 349-358.
dex operations, e.g., index creation or schema modifica-
tion. The most exciting among those is that a new index
can be made available to queries in half the time of the
[GSZ 01] Jarek Gryz, K. Bernhard Schiefer, Jian Zheng, [MS 98] Microsoft Corp., SQL Server 7.0, product docu-
Calisto Zuzarte: Discovery and Application of Check Con- mentation, 1998.
straints in DB2. ICDE 2001: 551-556. [O 86] Patrick E. O'Neil: The Escrow Transactional
[H 77] Theo Härder: A Scan-Driven Sort Facility for a Method. ACM TODS 11(4): 405-430 (1986).
Relational Database System. VLDB Conf. 1977: 236-244. [O 93] Edward Omiecinski: An analytical comparison of
[H 81] Wilfred J. Hansen: A Cost Model for the Internal two secondary index schemes: physical versus logical
Organization of B+-Tree Nodes. ACM TOPLAS 3(4): addresses. Inform. Sys. 18(5): 319-328 (1993).
508-532 (1981). [OCG 96] Patrick E. O'Neil, Edward Cheng, Dieter
[HD 91] Hui-I Hsiao, David J. DeWitt: A Performance Gawlick, Elizabeth J. O'Neil: The Log-Structured Merge-
Study of Three High Availability Data Replication Strate- Tree (LSM-Tree). Acta Informatica 33(4): 351-385
gies. PDIS 1991: 18-28. (1996).
[HR 83] Theo Härder, Andreas Reuter: Principles of [PCL 93] HweeHwa Pang, Michael J. Carey, Miron
Transaction-Oriented Database Recovery. ACM Comput- Livny: Memory-Adaptive External Sorting. VLDB Conf.
ing Surveys 15(4): 287-317 (1983). 1993: 618-629.
[JDO 99] Chris Jermaine, Anindya Datta, Edward [RMF 00] Frank Ramsak, Volker Markl, Robert Fenk,
Omiecinski: A Novel Index Supporting High Volume Martin Zirkel, Klaus Elhardt, Rudolf Bayer: Integrating
Data Warehouse Insertion. VLDB Conf. 1999: 235-246. the UB-Tree into a Database System Kernel. VLDB Conf.
[JNS 97] H. V. Jagadish, P. P. S. Narayan, S. Seshadri, S. 2000: 263-272.
Sudarshan, Rama Kanneganti: Incremental Organization [S 89a] Betty Salzberg: Merging Sorted Runs Using Large
for Data Recording and Warehousing. VLDB Conf. 1997: Main Memory. Acta Informatica 27(3): 195-215 (1989).
16-25. [S 89b] Michael Stonebraker: The Case for Partial In-
[JOY 02] Chris Jermaine, Edward Omiecinski, Wai-Gen dexes. SIGMOD Record 18(4): 4-11 (1989).
Yee: Out From Under the Trees. Technical Report, Geor- [S 94] Leonard D. Shapiro, personal communication,
gia Inst. of Technology, 2002. 1994.
[K 73] Donald E. Knuth: The Art of Computer Program- [SC 92] V. Srinivasan, Michael J. Carey: Performance of
ming, Volume III: Sorting and Searching. Addison- On-Line Index Construction Algorithms. EDBT Conf.
Wesley 1973. 1992: 293-309.
[L 95] David Lomet, personal communication, 1995. [SL 76] Dennis G. Severance, Guy M. Lohman: Differen-
[L 98] Per-Åke Larson, personal communication, 1998. tial Files: Their Application to the Maintenance of Large
[LG 98] Per-Åke Larson, Goetz Graefe: Memory Man- Databases. ACM TODS 1(3): 256-267 (1976).
agement during Run Generation in External Sorting. [SS 95] Praveen Seshadri, Arun N. Swami: Generalized
SIGMOD Conf. 1998: 472-483. Partial Indexes. ICDE 1995: 420-427.
[LKO 00] Mong-Li Lee, Masaru Kitsuregawa, Beng Chin [V 01] Jeffrey Scott Vitter: External memory algorithms
Ooi, Kian-Lee Tan, Anirban Mondal: Towards Self- and data structures. ACM Computing Surveys 33(2): 209-
Tuning Data Placement in Parallel Database Systems. 271 (2001).
SIGMOD Conf. 2000: 225-236. [VZZ 00] Gary Valentin, Michael Zuliani, Daniel C. Zilio,
[MHL 92] C. Mohan, Donald J. Haderle, Bruce G. Lind- Guy M. Lohman, Alan Skelley: DB2 Advisor: An Opti-
say, Hamid Pirahesh, Peter Schwarz: ARIES: A Transac- mizer Smart Enough to Recommend Its Own Indexes.
tion Recovery Method Supporting Fine-Granularity Lock- ICDE 2000: 101-110.
ing and Partial Rollbacks Using Write-Ahead Logging. [WA 91] Annita N. Wilschut, Peter M. G. Apers: Data-
ACM TODS 17(1): 94-162 (1992). flow Query Execution in a Parallel Main-Memory Envi-
[MN 92] C. Mohan, Inderpal Narang: Algorithms for Cre- ronment. PDIS 1991: 68-77.
ating Indexes for Very Large Tables Without Quiescing [ZL 97] Weiye Zhang, Per-Åke Larson: Dynamic Memory
Updates. SIGMOD Conf. 1992: 361-370. Adjustment for External Mergesort. VLDB Conf. 1997:
[MOP 98] Peter Muth, Patrick E. O'Neil, Achim Pick, 376-385.
Gerhard Weikum: Design, Implementation, and Perform- [ZL 98] Weiye Zhang, Per-Åke Larson: Buffering and
ance of the LHAM Log-Structured History Data Access Read-Ahead Strategies for External Mergesort. VLDB
Method. VLDB Conf. 1998: 452-463. Conf. 1998: 523-533.

You might also like