Chapter - 3 - Indexing Structures For Files
Chapter - 3 - Indexing Structures For Files
2
Contents
3
Single-level index introduction
◼ A single-level index is an auxiliary file that
makes it more efficient to search for a record in
the data file.
◼ The index is usually specified on one field of the
file (although it could be specified on several
fields).
◼ One form of an index is a file of entries <field
value, pointer to record>, which is ordered by
field value.
◼ The index is called an access path on the field.
4
Single-level index introduction (cont.)
◼ The index file usually occupies considerably less
disk blocks than the data file because its entries
are much smaller.
◼ A binary search on the index yields a pointer to
the file record.
◼ Indexes can also be characterized as dense or
sparse:
❑ A dense index has an index entry for every search key
value (and hence every record) in the data file.
❑ A sparse (or nondense) index, on the other hand, has
index entries for only some of the search values
5
Example 1
Given the following data file:
EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )
Suppose that:
◼ Record size R = 150 bytes, block size B = 512 bytes, r = 30.000 records
◼ SSN Field size VSSN = 9 bytes, record pointer size PR = 7 bytes
Then, we get:
◼ Blocking factor: bfr = B/R = 512/150 = 3 records/block
◼ Number of blocks needed for the file: b= r/bfr = 30.000/3 = 10.000 blocks
◼ Primary Indexes
◼ Clustering Indexes
◼ Secondary Indexes
7
Primary Index
8
Primary key field Data file
1 7
4 8
8 9
12 10
12
13
15
9
Primary Index
◼ Dense or Nondense?
❑ Nondense
10
Clustering Index
11
Clustering field Data file
12
Dept_No Name DoB Salary Sex
Clustering field
1
1
2
2
Index file
2
(<K(i), P(i)> entries)
2
Clustering Block 2
field value pointer
1
3
2
3
3
4
4
5
4
Data file 13
Clustering Index
◼ Dense or Nondense?
❑ Nondense
14
Secondary index
◼ A secondary index provides a secondary means of
accessing a file.
❑ The data file is unordered on indexing field.
◼ Indexing field:
❑ secondary key (unique value)
❑ nonkey (duplicate values)
15
Index file Secondary
(<K(i), P(i)> entries) key field Data file
◼ Dense or Nondense?
❑ Dense
17
Secondary index on non-key field
◼ Discussion: Structure of Secondary index on non-
key field?
◼ Option 1: include duplicate index entries with the
same K(i) value - one for each record.
◼ Option 2: keep a list of pointers <P(i, 1), ..., P(i, k)>
in the index entry for K(i).
◼ Option 3:
❑ more commonly used.
❑ one entry for each distinct index field value + an extra
level of indirection to handle the multiple pointers.
18
Blocks of record pointers Indexing field Data file
…
3
Index file 5
(<K(i), P(i)> entries) 1
…
Field Block
2
value pointer
3
4
…
1
2 3
3 3
…
4 1
…
5 5
1
…
◼ Dense or Nondense?
❑ Dense/ nondense
20
Summary of Single-level indexes
◼ Dense index?
❑ Secondary index
◼ Nondense index?
❑ Primary index
❑ Clustering index
❑ Secondary index
22
Summary of Single-level indexes
23
Example 2
Given the following data file:
EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )
Suppose that:
◼ Record size R = 150 bytes, block size B = 512 bytes, r = 30.000 records
◼ SSN Field size VSSN = 9 bytes, block pointer size P = 6 bytes
Then, we get:
◼ Blocking factor: bfr = B/R = 512/150 = 3 records/block
◼ Number of blocks needed for the file: b = r/bfr = 30.000/3 = 10.000 blocks
24
Contents
25
Multi-Level Indexes
◼ Because a single-level index is an ordered file, we
can create a primary index to the index itself.
❑ The original index file is called the first-level index and the
index to the index is called the second-level index.
◼ We can repeat the process, creating a third, fourth,
..., top level until all entries of the top level fit in
one disk block.
◼ A multi-level index can be created for any type of
first-level index (primary, secondary, clustering) as
long as the first-level index consists of more than
one disk block.
26
A two-level primary
index resembling
ISAM (Indexed
Sequential Access
Method)
organization.
27
Example 3
Given the following data file:
EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )
Suppose that:
◼ Record size R=150 bytes, block size B=512 bytes, r=30000 records
◼ SSN Field size VSSN=9 bytes, block pointer size P=6 bytes
Then, we get:
◼ Blocking factor: bfr= B/R = 512/150 = 3 records/block
◼ Number of blocks needed for the file: b= r/bfr= 30000/3 = 10000 blocks
For a primary index on the ordering key field SSN (Example 2):
◼ Index entry size: Ri=(VSSN+ P)=(9+6)=15 bytes
◼ Index blocking factor bfri= B/Ri = 512/15 = 34 entries/block
◼ Number of blocks for index file: b i= b/bfri = 10000/34 = 295 blocks
◼ Search for and retrieve a record needs: log2bi + 1 = log2295 + 1 = 10 block
accesses
For a multilevel index on the ordering key field SSN:
◼ Index blocking factor bfri= B/Ri = 512/15 = 34 entries/block
o This is the fan-out fo of the multilevel index.
◼ Number of 1st level index blocks: b1 = 295 blocks
◼ Number of 2nd level index blocks: b2 = b1 / fo = 295 / 34 = 9 blocks
◼ Number of 3th level index blocks: b3 = b2 / fo = 9 / 34 = 1 block → top level
◼ Number of level of this multilevel index: x = 3 levels
◼ Search for and retrieve a record needs: x + 1 = 4 blocks
28
31
32
33
34
35
36
37
Multi-Level Indexes
38
Contents
39
Dynamic Multilevel Indexes Using B-
Trees and B+-Trees
◼ Most multi-level indexes use B-tree or B+-tree data
structures because of the insertion and deletion
problem.
❑ This leaves space in each tree node (disk block) to allow
for new index entries
◼ These data structures are variations of search trees
that allow efficient insertion and deletion of new
search values.
◼ In B-Tree and B+-Tree data structures, each node
corresponds to a disk block.
◼ Each node is kept between half-full and completely
full.
40
Dynamic Multilevel Indexes Using B-
Trees and B+-Trees (cont.)
◼ An insertion into a node that is not full is quite
efficient.
❑ If a node is full, the insertion causes a split into
two nodes.
◼ Splitting may propagate to other tree levels.
◼ A deletion is quite efficient if a node does not
become less than half full.
◼ If a deletion causes a node to become less than
half full, it must be merged with neighboring
nodes.
41
Difference between B-tree and B+-tree
42
B-tree Structures
43
The Nodes of a B+-Tree
44
The Nodes of a B+-Tree (cont.)
45
Example 4: Calculate the order of a B-tree
◼ Suppose that:
❑ Search field V = 9 bytes, disk block size B = 512 bytes
❑ Record (data) pointer Pt = 7 bytes, block pointer is P = 6 bytes.
◼ Each B-tree node can have at most p tree pointers, p – 1
data pointers, and p – 1 search key field values.
◼ These must fit into a single disk block if each B-tree node is to
correspond to a disk block:
(p*P) + ((p-1)*(Pt+V)) B
(p*6) + ((p-1)*(7+9)) 512
(22*p) 528
◼ We can choose to be a large value that satisfies the above
inequality, which gives p = 23 (p = 24 is not chosen because
of additional information).
46
Example 5: Calculate approximate number
of entries of a B-tree
◼ Suppose that:
❑ Search field of Example 3 is a non-ordering key field, and we construct a B-Tree on
this field.
❑ Each node of the B-tree is 69 percent full.
◼ Each node, on the average, will have: p * 0.69 = 23 * 0.69 = 15.87 ≈ 16
pointers → 15 search key field values.
◼ The average fan-out fo = 16. We can start at the root and see how many
values and pointers can exist, on the average, at each subsequent level:
Level Nodes Index entries Pointers
Root: 1 node 15 entries 16 pointers
Level 1: 16 nodes 240 entries 256 pointers
Level 2: 256 nodes 3840 entries 4096 pointers
Level 3: 4096 nodes 61,440 entries
◼ At each level, number of entries = the total number of pointers at the
previous level * the average number of entries in each node.
◼ A two-level B-tree holds 3840+240+15 = 4095 entries on the average; a
three-level B-tree holds 65,535 entries on the average. 47
Example 6: Calculate the order of a B+-tree
◼ Suppose that:
❑ Search key field V=9 bytes, block size B=512bytes
❑ Record pointer is Pr = 7bytes, block pointer is P = 6bytes.
◼ An internal node of the B+-tree can have up to p tree pointers and p-
1 search field values; these must fit into a single block. Hence, we
have:
(p*P) + ((p-1)*V) B
(p*6) + ((p-1)*9) 512
15*p 512
48
Example 6: Calculate the order of a B+-tree
(cont.)
◼ The leaf nodes of B+-tree will have the same number of
values and pointers, except that the pointers are data
pointers and a next pointer. Hence, the order pleaf for the
leaf nodes can be calculated as follows:
(pleaf * (Pt+V))+P B
(pleaf * (7+9))+6 512
(16 * pleaf) 506
◼ If follows that each leaf node hold up to pleaf = 31 key
value/data pointer combinations, assuming that the data
pointers are record pointers.
49
Example 7: Calculate approximate number
of entries of a B+-tree
◼ Suppose that we construct a B+-Tree on the field of Example 6:
❑ Search key field V = 9 bytes, block size B = 512bytes
❑ Record pointer is Pr = 7bytes, block pointer is P = 6bytes.
❑ Each node is 69 percent full.
◼ On the average, each internal node will be have 34*0.69 ≈ 23.46 or
approximately 23 pointers, and hence 22 values.
◼ Each leaf node, on the average, will hold 0.69*pleaf = 0.69*31 ≈ 21.39 or
approximately 21 data record pointers.
◼ A B+-tree will have the following average number of entries at each level:
Level Nodes Index entries Pointers
Root 1 nodes 22 entries 23 pointers
Level 1 23 23*22 = 506 232=529 pointers
Level 2 529 529*22 = 11,638 233=12,167 pointers
Leaf level 12,167 12,167 *21 = 255,507
◼ A 3-level B+-tree holds up to 255,507 record pointers, on the average.
◼ Compare this to the 65,535 entries for corresponding B-tree in Example 4.
50
B+-Tree: Insert entry
51
B+-Tree: Insert entry (cont.)
52
Example of insertion in B+-tree
p = 3 and pleaf = 2
53
Example of insertion in B+-tree (cont.)
p = 3 and pleaf = 2
54
Example of insertion in B+-tree (cont.)
p = 3 and pleaf = 2
55
Example of insertion in B+-tree (cont.)
p = 3 and pleaf = 2
56
Example of insertion in B+-tree (cont.)
57
Example of insertion in B+-tree (cont.)
58
Example of insertion in B+-tree (cont.)
59
B+-Tree: Delete entry
◼ Remove the entry from the leaf node.
◼ If it happens to occur in an internal node:
❑ Remove.
❑ The value to its left in the leaf node must replace it in the internal
node.
◼ Deletion may cause underflow in leaf node:
❑ Try to find a sibling leaf node – a leaf node directly to the left or to
the right of the node with underflow.
❑ Redistribute the entries among the node and its siblings.
(Common method: The left sibling first and the right sibling later)
❑ If redistribution fails, the node is merged with its sibling.
❑ If merge occurred, must delete entry (pointing to node and
sibling) from parent node.
60
B+-Tree: Delete entry (cont.)
61
Example of deletion from B+-tree
p = 3 and pleaf = 2.
Delete 5
62
Example of deletion from B+-tree (cont.)
P = 3 and pleaf = 2.
63
Example of deletion from B+-tree (cont.)
p = 3 and pleaf = 2.
Delete 9:
Underflow (merge with left, redistribute)
64
Example of deletion from B+-tree (cont.)
p = 3 and pleaf = 2.
65
Search using B-trees and B+-trees
K=8
5<8
7< 8 <= 8
found
66
Search using B-trees and B+-trees
◼ Search conditions on indexing attributes
❑ =, <, >, ≤, ≥, between, MINIMUM value, MAXIMUM
value
◼ Search results
❑ Zero, one, or many data records
◼ Search cost
❑ B-trees
◼ From 1 to (1 + the number of tree levels) + data accesses
❑ B+-trees
◼ 1 (root level) + the number of tree levels + data accesses
67
Contents
69
Indexes on Multiple Keys
◼ In many retrieval and update requests, multiple
attributes are involved.
◼ If a certain combination of attributes is used
frequently, it is advantageous to set up an access
structure to provide efficient access by a key value
that is a combination of those attributes.
◼ If an index is created on attributes <A1, A2, … , An>,
the search key values are tuples with n values: <v1,
v2, … , vn>.
◼ A lexicographic ordering of these tuple values
establishes an order on this composite search key.
◼ An index on a composite key of n attributes works
similarly to any index discussed so far.
70
Contents
71
Other File Indexes
◼ Hash indexes
❑ The hash index is a secondary structure to access the
file by using hashing on a search key other than the one
used for the primary data file organization.
◼ Bitmap indexes
❑ A bitmap index is built on one particular value of a
field (the column in a table) with respect to all the rows
(records) and is an array of bits.
◼ Function-based indexes
❑ In Oracle, an index such that the value that results from
applying a function (expression) on a field or some fields
becomes the key to the index
72
Other File Indexes
◼ Hash indexes
❑ The hash index is a secondary structure to
access the file by using hashing on a search
key other than the one used for the primary
data file organization.
◼ access structures similar to indexes, based on
hashing
❑ Support for equality searches on the hash
field
73
Hash indexes
74
hashing
function:
the sum of
the digits
of Emp_id
modulo 10
75
Bitmap indexes
◼ A bitmap index is built on one particular value
of a field (the column in a table) with respect to
all the rows (records) and is an array of bits.
❑ Each bit in the bitmap corresponds to a row. If the bit is
set, then the row contains the key value.
◼ In a bitmap index, each indexing field value is
associated with pointers to multiple rows.
◼ Bitmap indexes are primarily designed for data
warehousing or environments in which queries
reference many columns in an ad hoc fashion.
❑ The number of distinct values of the indexed field is
small compared to the number of rows.
❑ The indexed table is either read-only or not subject to
significant modification by DML statements.
76
Bitmap indexes
77
Bitmap indexes
78
Function-based indexes
◼ The use of any function on a column prevents the
index defined on that column from being used.
❑ Indexes are only used with some specific search
conditions on indexed columns.
79
Function-based indexes
80
Contents
81
Index Creation
CREATE [ UNIQUE ] INDEX <index name>
ON <table name> ( <column name> [ <order> ] { , <column name> [ <order> ] } )
[ CLUSTER ] ;
82
B-tree index in Oracle 19c
83
B-tree for a clustered index in MS
SQL Server
84
Review questions
1) Define the following terms: indexing field, primary key field, clustering
field, secondary key field, block anchor, dense index, and nondense
(sparse) index.
2) What are the differences among primary, secondary, and clustering
indexes? How do these differences affect the ways in which these
indexes are implemented? Which of the indexes are dense, and which
are not?
3) Why can we have at most one primary or clustering index on a file, but
several secondary indexes?
4) How does multilevel indexing improve the efficiency of searching an
index file?
5) What is the order p of a B-tree? Describe the structure of B-tree nodes.
6) What is the order p of a B+-tree? Describe the structure of both internal
and leaf nodes of a B+-tree.
7) How does a B-tree differ from a B+-tree? Why is a B+-tree usually
preferred as an access structure to a data file?
85
86