Module - 4: 10.1 Indexed Sequential Access
Module - 4: 10.1 Indexed Sequential Access
Consideration 2: Reading in or writing out a. block should not take very long. Even if we had
an unlimited amount of memory, we would want to place an upper limit on the block size so
we would not end up reading in the entire file just to get at a single record.
10.3 ADDING A SIMPLE INDEX TO THE SEQUENCE SET
Let's see whether we can find an efficient way to locate some specific block containing a
particular record, given the record's key. We can view each of our blocks as containing a
range of records, as illustrated in Fig. 10.2.
It is easy to see how we could construct a simple, single- level index for these blocks. We
might choose, for example, to build an index of fixed length records that contain the key for
the last record in each block, as shown in Fig. 10.3. Note that we are using the largest key in
the block as the key of the whole block. The combination of this kind of index with the
sequence set of blocks provides complete indexed sequential access. If we need to retrieve a
specific record we consult the index and then retrieve the correct block; if we need sequential
access we start at the first block and read through the linked list of blocks until we have read
them all.
The requirement that the index be held in memory is important for two reasons:
Since this is a simple index we find specific records by means of a binary Search of
the index. Binary searching works well if the searching takes place in memory, but, it
requires too many seeks if the file is on a secondary storage device.
As the blocks in the sequence set are changed through splitting, merging, and redistribution,
the index has to be updated. Updating a simple, fixed- length record index of this kind works
well if the index is relatively small and contained in memory. If, however, the updating
requires seeking to individual index records on disk, the process can become very expensive.
10.4 THE CONTENT OF THE INDEX: SEPARATORS INSTEAD OF KEYS
The purpose of the index we are building is to assist us when we are searching for a record
with a Specific key. The index must guide us to the block in the sequence set that contains the
record, if it exists in the sequence set at all. Given this view of the index set, we can take the
very important step of recognizing that we do not need to have keys in the index set. Our real
need is for separators. Figure 10.4 shows one possible set of separators for the sequence set
in Fig. 10.2.
Note that there are many potential separators capable of `distinguishing between two blocks.
For example, all of the strings shown between blocks 3 and 4 in Fig. 10.5 are capable of
guiding us in our choice between the blocks as we search for a particular key.
If a-string comparison between the key and any of these separators shows that the key
precedes the separator, we look for the key in block 3. If the key follows the Separator, we
look in block 4.
If we are willing to treat the separators as variable- length entities within our index structure
(we talk about how to do this later), we can save space by placing the shortest separator in
the index structure. Consequently, we use E as the separator to guide our choice between
blocks 3 and 4. Note that there is not always a unique shortest separator. For example, BK,
BN, and BO are separators that. are all the same length and are equally effective as separators
between blocks 1 and 2 in Fig. 10.4. We choose BO and all of the other separators contained
in Fig. 10.4 by using the logic embodied in the C++ function shown in Fig. 10.6. We must
decide to retrieve the block to the right of the separator or the one to the left of the separator
according to the following rule:
Relation ofsearch key and separator Decision
Key < Separator Go left
Key = separator Go right
Key > separator Go right
Since the number of sequence set blocks is unchanged and since no records are moved
between blocks, the index set can also remain unchanged. This is easy to see in the case of
the EMBRY deletion: E is still a perfectly good separator for sequence set blocks 3 and 4, so
there is no reason to change it in the index set.
The effect of inserting into the sequence set new records that do not cause block splitting is
much the same as the effect of these deletions that do not result in merging. The index $ct
remains unchanged. Suppose for example, that we insert a record for EATON. Following the
path indicated by the separators in the index set, we find that we will insert the new record
into block 4 of the sequence set. The new record becomes the first record in block 4, but no
change in the index set is necessary.
10.6.2 Changes Involving Multiple Blocks in the Sequence Set
We begin with an insertion into the sequence set shown in Fig. 10.8. Specifically, let's
assume that there is an insertion into the first block and that this insertion causes the block to
split. A new block (block 7) is brought in to hold the second half of what was originally the
first block. This new block is linked into the correct position in the sequence set, following
block I and preceding block 2 (these are the physical block numbers).These changes to the
Sequence set are illustrated in Fig. 10.9.
Now let's suppose we delete a record from block 2 of the sequence set and that this, causes an
underflow condition and consequent merging of blocks 2 and 3, Once the merging is
complete, block 3 is no longer needed in the sequence set, and the separator that once
distinguished between blocks 2 and 3 must be removed from the index set. Removing this
separator, CAM, causes an underflow in an index $et node. Consequently, there is another
merging, this time in the index set, that results in the demotion of the BO separator from the
root, bringing it back down into a node with the AY separator. Once these changes are
complete, the simple prefix B+ tree has. the structure illustrated in Fig. 10.10.
Record insertion and deletion always take place in the sequence set, since that is where the
records are. If splitting, merging, or redistribution is necessary, perform the operation just as
you would if there were no index set at all. Then, after the record operations in the sequence
set are complete, make changes as necessary in the index set:
If blocks are split in the sequence set, a new separator must be inserted into the index
set;
If blocks are merged in the sequence set, a separator must be removed from the index
set; and
If records arc redistributed between blocks in the sequence set, the value of a
separator in the index set must be changed.
10.7 INDEX SET BLOCK SIZE
The physical size of a node for the index Set is usually the same as the physical size of a
block in the sequence set. There are a number of reasons for using a common block size for
the index and Sequence sets:
The block size for the sequence set is usually chosen because there is a good fit among
this block size, the characteristics of the disk drive, and the amount of memory available
the choice of an index set block Size is governed by consideration of the same factors;
therefore, the block size that is best for the sequence set is usually best for the index set.
The index set blocks and sequence set blocks are often mingled within the same file to
avoid seeking between two separate files while accessing the simple prefix B+ tree. Use
of one file for both kinds of blocks is simpler if the block sizes are the same.
AsBaBroCChCraDeleEdiErrFaFle 00 02 04 07 08 10 13 17 20 23 25
Let's suppose, once again, that we are looking for a record with the key "Beck" and that the
search has brought us to the index set block pictured in Fig. 10.12. The total length of the
separators and the separator count allow us to find the beginning, the end, and consequently
the middle of the index to the separators. We perform a binary Search of the separators
through this index, finally concluding that the key "Beck" falls between the separators "Ba"
and "Bro". Conceptually, the relation between the keys and the RBNs is illustrated in Fig.
10.13.
As Fig. 10.13 makes clear, discovering that the key falls between." Ba" and " Bro" allows us
to decide that the "next block we need to retrieve has the RBN stored in the B02 position of
the RBN vector. This next block could be another index set block and thus another block of
the road map, or it could be the sequence set block that we are looking for. In either case, the
quantity and arrangement of information in the current index set block is sufficient to let us
conduct our binary search within the index block and proceed to the next block in the simple
prefix B+ tree.
10.9 LOADING A SIMPLE PREFIX B+ TREE
It is possible to conceive of simple prefix B+ tree as a sequence set with an added index, but
one can also build them the other way as mentioned below.
We can begin by sorting the records that are to be loaded. Then we can guarantee that the next
record we encounter is the next record we need to load. Working from a sorted file, we can
place the records into sequence set blocks, one by one, starting a new block when the one we
are working with fills up. As we make the transition between two sequence Set blocks, we can
determine the shortest Separator for the blocks we can collect these separators into an index
set block that we build and hold in memory until it is full.
To develop an example of how this works let's assume that we have sets of records associated
with terms that are being compiled for a book index. The records might consist of a list of the
occurrences of each term.
In Fig. 10.14 we show four sequence set blocks that have been written out to the disk and one
index set block that has been built in memory from the shortest separators derived from the
sequence set block keys. As you can See, the next sequence set block consists of a set of
terms ranging from CATCH through CHECK, and therefore the next separator is CAT. Let's
suppose that the index set block is now full.-We write it out to disk. Now what do we do with
the separator CAT?
Clearly, we need to start a new index block. However, we cannot place CAT into another
index block at the same level as the one containing the separators ALW, ASP, and BET
because we cannot have two blocks at the same level without having a parent block. Instead,
we promote the CAT separator to a higher- level block. However, the` higher- level block
cannot point directly to the sequence set, it must point to the lower- level index blocks. This
means that we will now be building two levels of the index set in memory as we build the
sequence set. Figure 10.15 illustrates this working-on two-level phenomenon: the addition of
the CAT separator requires us to Start a new, root- level index block as well as a lower-level
index block.
Figure 10.16 shows what the index looks like after even more sequence set blocks are added.
As you can see, the lower- level index block that contained no separators when we added CAT
to the root has now filled up.
The principal advantage is that the loading process goes more quickly because
The output can be written sequentially;
We make only one pass over the data, rather than the many passes associated with
random order insertions; and
No blocks need to be reorganized as we proceed.
10.10 B+ TREES
The difference between a simp le prefix B+ tree and a plain B+ tree is that the latter
structure does not involve the use of prefixes as separators. Instead, the separators in the
index set are simply copies of the actual keys. Contrast the index set block shown in Fig.
10.17, which illustrates the initial loading steps for a B+ tree, with the index block that is
illustrated in Fig. 10.14, where we are build ing a Simple prefix B+ tree.
There are, however, at least two factors that might give favor to using a B+ tree that uses
full copies of keys as separators.
The reason for using shortest separators is to pack more of them into an index set
block. This imp lies, ineluctably, the use of variable- length fields within the index
set blocks. For some applications the cost of the extra overhead required to
maintain and use this variable- length structure outweighs the benefits of shorter
separators. In these cases one might choose to build a Straightforward B+ tree
using fixed- length copies of .the keys fro m the sequence set as separators.
Some key sets do not show much co mpression when the Simple prefix method is
used to produce separators. For example, suppose the keys consist of large,
consecutive alphanumeric Sequences such as 34C18K756,.34C18K757,
34C18K758, and so on. In this case, to enjoy appreciable compression, we need
to use compression techniques that remove redundancy from the front of the key.
10.11 B-TREES, B+ TREES, AND SIMPLE PREFIX B+ TREES IN PRESPECTIVE
B- trees, B+ trees, and simple prefix B+ trees or, not a panacea. However, they do have
broad applicability, particularly for situations that require the ability to access a large file
sequentially, in order by key, and through an index all three of these t ools share the
following characteristics:
They are all paged index structures, which means that they bring entire blocks of
information into me mory at once. As a consequence, it is possible to choose between a
great many alternatives (for exa mple, the keys for hundreds of thousands of records)
with just a few seek out to disk storage the shape of these trees tends to be broad and
shallow.
All three approaches maintain height- balanced trees. The trees do not grow in an uneven
way, which would result in some potentially long searches for certain keys.
In all cases the trees grow fro m the bottom up. Balance is maintained through block
splitting, merging, and redistribution.
With all three structures it is possible to obtain greater storage efficiency through the use
of two-to-three splitting and of redistribution in place of block splitting when possible.
All three approaches can be imple mented as virtual tree structures in which the most
recently used blocks are held in memory.
Any of these approaches can be adapted for use with variable- length records using
structures ins ide a block similar to those outlined in this chapter.
For all of this similarity, there are some important differences. These differences are
brought into focus through a review of the strengths and unique characteristics of each of
these file structures.
B Trees
The B- trees are multilevel indexes to data files that are entry sequenced this is the
simplest type of B-tree to imple ment and is a very efficient representation for most
cases. The strengths of this approach are the simplic ity of imple mentation, the inherent
efficiency of indexing, and a maximization of the breadth of the B- tree. The major
weakness of this strategy1s the lack of organization of the data file, resulting in an
excessive amount of seeking for sequentia l access.
B+ Trees
The primary difference between the B+ tree and the B- tree is that in the B+ tree all the
key and record information is contained in a linked set of blocks known as the sequence
set. The key and record infor mation is not in the upper- level, tree like portion of the B+
tree. Indexed access to this sequence set is provided through a conceptually separate
structure called the index set.
.
There are three significant advantages that the B+ tree structure provides over the B- tree:
The sequence set can be processed in a truly linear, sequential way, providing
efficient access to records in order by key.
The index is built with a single key or separator per block of data records
instead of one key per data record.
The size of the lowest- level index is reduced by the blocking factor of the data
file. Since there are fewer keys, the index is smaller and hence shallower.
Simple Prefix B+ Trees
The Simple prefix B+ tree builds on the advantage of B+ tree by making the separators in the
index set smaller than the keys in the sequence set, rather than just using copies of these keys. If
the separators are smaller, we can fit more of them into a block to obtain a higher branching
factor out of the block. In a sense, the simple prefix B+ tree takes one of the strongest features of
the B+ tree one step farther.
The price we have to pay to obtain this separator compression and consequent increase in
branching factor is that we must use an indexe rs block structure that supports variable-length
fields.