Chapter 7 Indexing
Chapter 7 Indexing
Chapter 7
CS2202‐File Organization
2021‐2022
Chapter 7
Indexing
1
Overview
• An index is a table containing a list of keys associated with
a reference field pointing to the record.
• A simple index is simply an array of (key, reference) pairs.
• An index lets you impose order on a file without
rearranging the file.
- This makes record addition much less expensive than they are with
a sorted file.
- Indexing provides multiple access paths.
- You can have different indexes for the same data file.
• Indexing also give us keyed access to variable-length
record files.
2
A Simple Index for Entry-Sequenced
Files I
• Suppose that we are looking at a collection of
recordings with the following information about each
of them:
– Identification Number
– Title
– Composer
– Artist
– Label (publisher)
3
Contents of Sample Recording File
4
A Simple Index for Entry-Sequenced
Files II
• We choose to organize the file as a series of
variable-length record with a size field
preceding each record.
- The fields within each record are also of variable-
length but are separated by delimiters.
• We form a primary key by concatenating the
record company label code and the record’s
ID number.
- This should form a unique identifier.
5
A Simple Index for Entry-Sequenced
Files III
• In order to provide rapid key access, we build a
simple index with a key field associated with a
reference field which provides the address of the
first byte of the corresponding data record.
• The index may be sorted while the file does not
have to be.
- This means that the data file may be entry
sequenced: the record occur in the order they are
entered in the file.
6
Index of Sample Recording File
7
A Simple Index for Entry-Sequenced
Files IV
• A few comments about our Index Organization:
– The index is easier to use than the data file because
1. it uses fixed-length records
2. it is likely to be much smaller than the data file.
– By requiring fixed-length records in the index file, we
impose a limit on the size of the primary key field.
– The index could carry more information than the key
and reference fields.
– e.g., we could keep the length of each data file record in the
index as well.
8
Basic Operations on an Indexed
Entry-Sequenced File
• Assumption: the index is small enough to be held in memory.
Later on, we will see what can be done when this is not the
case.
• Operations in order to maintain an indexed file
– Create the original empty index and data files
– Load the index into memory before using it.
– Rewrite the index file from memory after using it.
– Add records to the data file and index.
– Delete records from the data file.
– Update records in the data file.
9
Creating, Loading and Re-writing
• The index is represented as an array of records.
– The loading into memory can be done sequentially,
reading a large number of index records (which are short)
at once.
• What happens if the index changed but its re-writing
does not take place or takes place incompletely?
– Use a mechanism for indicating whether or not
the index is out of date.
– Have a procedure that reconstructs the index from
the data file in case it is out of date.
10
Record Addition
• When we add a record, both the data file and the index
should be updated.
• In the data file, the record can be added anywhere.
- However, the byte-offset of the new record should be saved.
• Since the index is sorted, the location of the new record
does matter.
- We have to shift all the records that belong after the one we
are inserting to open up space for the new record.
- However, this operation is not too costly as it is performed in
memory.
11
Record Deletion
• Record deletion can be done using the
methods discussed last lecture.
• In addition, the index record corresponding to
the data record being deleted must also be
deleted.
– Once again, since this deletion takes place in
memory, the record shifting is not too costly.
12
Record Updating
• Record updating falls into two categories:
– The update changes the value of the key field.
– The update does not affect the key field.
• In the first case, both the index and data file may need
to be reordered.
- The update is easiest to deal with if it is conceptualized as a
delete followed by an insert (but the user needs not know
about this).
• In the second case, the index does not need reordering,
but the data file may.
- If the updated record is smaller than the original one, it can be
re-written at the same location.
- However, if it is larger, then a new spot has to be found for it.
Again the delete/insert solution can be used.
13
Indexes that are too large to hold
in memory I
• The indexes that we have considered before could
fit into main memory.
• If this is not the case, we have the following
problems:
– Binary searching requires several seeks rather than
being performed at memory speed.
– Index rearrangement (record addition or deletion)
requires shifting records on secondary storage →
Extremely time consuming.
• Solutions:
– Use a hashed organization (later)
– Use a tree-structured index such as B-trees and B+
trees (later)
1
4
Indexes that are too large to hold
in memory II
• Nonetheless, simple indexes are still useful:
– They allow the use of a binary search in a
variable-length record file.
– Sorting and maintaining an index is less costly
than sorting and maintaining the data file,
because the index is smaller.
– If there are pinned records in the data file,
rearrangements of the keys are possible
without moving the data records.
– They can provide access by multiple keys.
1
5
Indexing to provide access by
multiple keys
• So far, our index only allows key access. i.e., we can
retrieve record DG188807, but we cannot retrieve a
recording of Beethoven’s Symphony no. 9.
→ Not useful!
• We need to use secondary key fields consisting of album
titles, composers, and artists.
–Although it would be possible to relate a secondary key to an
actual byte offset, this is usually not done.
–Instead, we relate the secondary key to a primary key which
then will point to the actual byte offset.
1
6
Index of Sample Recording File
1
7
Composer Index & Title Index
1
8
Record Addition in multiple key
access settings
• When a secondary index is used, adding a record
involves updating the data file, the primary index and
the secondary index.
– The secondary index update is similar to the primary index
update.
• Secondary keys are entered in canonical form (ex. all
capitals).
– The upper- and lower- case form must be obtained from the
data file. As well, because of the length restriction on keys,
secondary keys may sometimes be truncated.
• The secondary index may contain duplicate.
– The primary index couldn’t 7
Record Deletion in multiple key
access settings
• Removing a record from the data file means
removing its corresponding entry in the primary
index and may mean removing all of the entries
in the secondary indexes that refer to this
primary index entry.
• This is too much rearrangement, specially if
indexes cannot fit into main memory
2
0
Record Deletion in multiple key
access settings
• However, it is also possible not to worry about the
secondary index.
– Since, as we mentioned before, secondary keys were made
to point at primary keys. We don’t modify the secondary
index files.
– When accessing the file through a secondary key, the primary
index file will be checked and a deleted record can be
identified.
– The deleted record still occupy space in the secondary key
indexes.
– If a lot of deletions occur, we can periodically cleanup these
deleted records.
2
1
Record Updating in multiple key
access settings
• There are three types of updates:
– The update changes the secondary key.
• We have to rearrange secondary index to stay in sorted
order.
– The update changes the primary key.
• We have to update and reorder the primary index.
• We have to update the references to primary key in the
secondary index.
– Update confined to other fields.
• No changes necessary to primary nor secondary index.
22
Retrieval using combinations of
secondary keys
• With secondary keys, we can now search for things like
all the recordings of “Beethoven’s work” or all the
recordings titled “Violin Concerto”.
• More importantly, we can use combinations of
secondary keys.
– e.g., find all recordings of Beethoven’s Symphony no. 9.
• Without the use of secondary indexes, this request
requires a very expensive sequential search through
the entire file.
– Using secondary indexes, responding to this query is simple
and quick.
23
Improving the secondary index
structure I: The problem
• Secondary indexes lead to two difficulties:
– The secondary index file has to be rearranged
every time a new record is added to the file.
– If there are duplicate secondary keys, the
secondary key field is repeated for each entry.
• Space is wasted.
24
Improving the secondary index
structure II: Solution 1
• Solution 1: Change the secondary index structure so it
associates an array of reference with each secondary
key.
• Advantage: helps avoid the need to rearrange the
secondary index file too often.
• Disadvantages:
– It may restrict the number of references that can be
associated with each secondary key.
– It may cause internal fragmentation, i.e., waste of
space.
25
Array of Reference
26
Improving the secondary index
structure III: Solution 2
• Solution 2: inverted lists.
– Each secondary key points to a different list of primary key
references.
– Each of these lists could grow to be as long as it needs to be
and no space would be lost to internal fragmentation.
• Advantages:
– The secondary index file needs to be rearranged only upon
record addition.
– The rearranging is faster since it is smaller.
– Space from deleted primary index records can easily be
reused, since its records have fixed-length.
• Disadvantage:
– Locality (in the secondary index) has been lost.
• More seeking may be necessary.
27
Inverted Lists
28
Selective Indexes
• Using secondary keys, you can divide the file
into parts and provide a selective view.
• For example, you can build a selective index
that contains only titles to classical recordings
or recordings released prior to 1970, and since
1970.
• A possible query could then be: “List all the
recordings of Beethoven’s Simphony no. 9
released since 1970.
29