0% found this document useful (0 votes)
10 views83 pages

Chapter - 3 - Indexing Structures For Files

Chapter 3 discusses various indexing structures for files, including single-level ordered indexes, multilevel indexes, and dynamic multilevel indexes using B-Trees and B+-Trees. It explains the characteristics of primary, clustering, and secondary indexes, as well as their efficiency in searching records. The chapter also highlights the advantages of using multilevel indexes to improve search performance and the challenges associated with insertion and deletion in these structures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views83 pages

Chapter - 3 - Indexing Structures For Files

Chapter 3 discusses various indexing structures for files, including single-level ordered indexes, multilevel indexes, and dynamic multilevel indexes using B-Trees and B+-Trees. It explains the characteristics of primary, clustering, and secondary indexes, as well as their efficiency in searching records. The chapter also highlights the advantages of using multilevel indexes to improve search performance and the challenges associated with insertion and deletion in these structures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Chapter 3

Indexing Structures for Files


Contents

1 Single-level Ordered Indexes


2 Multilevel Indexes
Dynamic Multilevel Indexes Using B-Trees and
3
B+-Trees
4 Indexes on Multiple Keys
5 Other File Indexes
6 Indexes in Today‘s DBMSs

2
Contents

1 Single-level Ordered Indexes


2 Multilevel Indexes
Dynamic Multilevel Indexes Using B-Trees and
3
B+-Trees
4 Indexes on Multiple Keys
5 Other File Indexes
6 Indexes in Today‘s DBMSs

3
Single-level index introduction
◼ A single-level index is an auxiliary file that
makes it more efficient to search for a record in
the data file.
◼ The index is usually specified on one field of the
file (although it could be specified on several
fields).
◼ One form of an index is a file of entries <field
value, pointer to record>, which is ordered by
field value.
◼ The index is called an access path on the field.

4
Single-level index introduction (cont.)
◼ The index file usually occupies considerably less
disk blocks than the data file because its entries
are much smaller.
◼ A binary search on the index yields a pointer to
the file record.
◼ Indexes can also be characterized as dense or
sparse:
❑ A dense index has an index entry for every search key
value (and hence every record) in the data file.
❑ A sparse (or nondense) index, on the other hand, has
index entries for only some of the search values

5
Example 1
Given the following data file:
EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )
Suppose that:
◼ Record size R = 150 bytes, block size B = 512 bytes, r = 30.000 records
◼ SSN Field size VSSN = 9 bytes, record pointer size PR = 7 bytes
Then, we get:
◼ Blocking factor: bfr = B/R = 512/150 = 3 records/block
◼ Number of blocks needed for the file: b= r/bfr = 30.000/3 = 10.000 blocks

For an dense index on the SSN field:


◼ Index entry size: Ri = (VSSN+ PR) = (9+7) = 16 bytes
◼ Index blocking factor bfri = B/RI = 512/16 = 32 entries/block
◼ Number of blocks for index file: b i = r/bfri = (30000/32)= 938 blocks
◼ Search for and retrieve a record needs: log2bi  + 1 = log2938  + 1 = 11
block accesses

◼ This is compared to an average linear search cost of:


(b/2)= 10000/2 = 5000 block accesses
◼ If the file records are ordered, the binary search cost would be:
 log2b  =  log210000  = 14 block accesses
6
Types of Single-level Ordered Indexes

◼ Primary Indexes

◼ Clustering Indexes

◼ Secondary Indexes

7
Primary Index

◼ Defined on an ordered data file.


❑ The data file is ordered on a key field.

◼ One index entry for each block in the data file


❑ First record in the block, which is called the block anchor

◼ A similar scheme can use the last record in a block.

8
Primary key field Data file

ID Name DoB Salary Sex


1
2
Index file 3
(<K(i), P(i)> entries)
4
Primary Block
key value pointer 6

1 7

4 8
8 9
12 10

12
13
15

9
Primary Index

◼ Number of index entries?


❑ Number of blocks in data file.

◼ Dense or Nondense?
❑ Nondense

◼ Search/ Insert/ Update/ Delete?

10
Clustering Index

◼ Defined on an ordered data file.


❑ The data file is ordered on a non-key field.

◼ One index entry each distinct value of the field.


❑ The index entry points to the first data block that
contains records with that field value

11
Clustering field Data file

Dept_No Name DoB Salary Sex


1
1
Index file 2
(<K(i), P(i)> entries)
2
Clustering Block
field value pointer 2
1 2
2
2
3
3
4
3
5
4
4
5

12
Dept_No Name DoB Salary Sex
Clustering field
1
1

2
2
Index file
2
(<K(i), P(i)> entries)
2
Clustering Block 2
field value pointer
1
3
2
3
3
4
4
5
4

Data file 13
Clustering Index

◼ Number of index entries?


❑ Number of distinct indexing field values in data file .

◼ Dense or Nondense?
❑ Nondense

◼ Search/ Insert/ Update/ Delete?


◼ At most one primary index or one clustering
index but not both.

14
Secondary index
◼ A secondary index provides a secondary means of
accessing a file.
❑ The data file is unordered on indexing field.
◼ Indexing field:
❑ secondary key (unique value)
❑ nonkey (duplicate values)

◼ The index is an ordered file with two fields:


❑ The first field: indexing field.
❑ The second field: block pointer or record pointer.

◼ There can be many secondary indexes for the same file.

15
Index file Secondary
(<K(i), P(i)> entries) key field Data file

Index field Block 5


value pointer
13
3
8
4
5 6
6 15
8 3
9
9
11
21
13 … 11
15
18 4
21 23
23 18

Secondary index on key field


16
Secondary index on key field

◼ Number of index entries?


❑ Number of record in data file

◼ Dense or Nondense?
❑ Dense

◼ Search/ Insert/ Update/ Delete?

17
Secondary index on non-key field
◼ Discussion: Structure of Secondary index on non-
key field?
◼ Option 1: include duplicate index entries with the
same K(i) value - one for each record.
◼ Option 2: keep a list of pointers <P(i, 1), ..., P(i, k)>
in the index entry for K(i).
◼ Option 3:
❑ more commonly used.
❑ one entry for each distinct index field value + an extra
level of indirection to handle the multiple pointers.

18
Blocks of record pointers Indexing field Data file

Dept Name DoB Job Sex


_No


3
Index file 5
(<K(i), P(i)> entries) 1


Field Block
2
value pointer
3
4

1
2 3
3 3

4 1

5 5
1

Secondary Index on non-key field: option 3


Secondary index on nonkey field

◼ Number of index entries?


❑ Number of records in data file
❑ Number of distinct index field values

◼ Dense or Nondense?
❑ Dense/ nondense

◼ Search/ Insert/ Update/ Delete?

20
Summary of Single-level indexes

◼ Ordered file on indexing field?


❑ Primary index
❑ Clustering index
◼ Indexing field is Key?
❑ Primary index
❑ Secondary index
◼ Indexing field is not Key?
❑ Clustering index
❑ Secondary index
21
Summary of Single-level indexes

◼ Dense index?
❑ Secondary index

◼ Nondense index?
❑ Primary index
❑ Clustering index
❑ Secondary index

22
Summary of Single-level indexes

23
Example 2
Given the following data file:
EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )
Suppose that:
◼ Record size R = 150 bytes, block size B = 512 bytes, r = 30.000 records
◼ SSN Field size VSSN = 9 bytes, block pointer size P = 6 bytes
Then, we get:
◼ Blocking factor: bfr = B/R = 512/150 = 3 records/block
◼ Number of blocks needed for the file: b = r/bfr = 30.000/3 = 10.000 blocks

For a primary index on the ordering key field SSN:


◼ Index entry size: Ri = (VSSN+ P) = (9+6) = 15 bytes
◼ Index blocking factor bfri= B/Ri = 512/15 = 34 entries/block
◼ Number of blocks for index file: b i= b/bfri = 10000/34 = 295 blocks
◼ Search for and retrieve a record needs: log2bi  + 1 = log2 295  + 1 = 10
block accesses

◼ This is compared to a dense index cost of: 11 block accesses

24
Contents

1 Single-level Ordered Indexes


2 Multilevel Indexes
Dynamic Multilevel Indexes Using B-Trees and
3
B+-Trees
4 Indexes on Multiple Keys
5 Other File Indexes
6 Indexes in Today‘s DBMSs

25
Multi-Level Indexes
◼ Because a single-level index is an ordered file, we
can create a primary index to the index itself.
❑ The original index file is called the first-level index and the
index to the index is called the second-level index.
◼ We can repeat the process, creating a third, fourth,
..., top level until all entries of the top level fit in
one disk block.
◼ A multi-level index can be created for any type of
first-level index (primary, secondary, clustering) as
long as the first-level index consists of more than
one disk block.

26
A two-level primary
index resembling
ISAM (Indexed
Sequential Access
Method)
organization.

27
Example 3
Given the following data file:
EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )
Suppose that:
◼ Record size R=150 bytes, block size B=512 bytes, r=30000 records
◼ SSN Field size VSSN=9 bytes, block pointer size P=6 bytes
Then, we get:
◼ Blocking factor: bfr= B/R = 512/150 = 3 records/block
◼ Number of blocks needed for the file: b= r/bfr= 30000/3  = 10000 blocks
For a primary index on the ordering key field SSN (Example 2):
◼ Index entry size: Ri=(VSSN+ P)=(9+6)=15 bytes
◼ Index blocking factor bfri= B/Ri = 512/15 = 34 entries/block
◼ Number of blocks for index file: b i= b/bfri = 10000/34  = 295 blocks
◼ Search for and retrieve a record needs: log2bi  + 1 = log2295  + 1 = 10 block
accesses
For a multilevel index on the ordering key field SSN:
◼ Index blocking factor bfri= B/Ri = 512/15 = 34 entries/block
o This is the fan-out fo of the multilevel index.
◼ Number of 1st level index blocks: b1 = 295 blocks
◼ Number of 2nd level index blocks: b2 =  b1 / fo =  295 / 34 = 9 blocks
◼ Number of 3th level index blocks: b3 =  b2 / fo =  9 / 34 = 1 block → top level
◼ Number of level of this multilevel index: x = 3 levels
◼ Search for and retrieve a record needs: x + 1 = 4 blocks
28
31
32
33
34
35
36
37
Multi-Level Indexes

◼ Such a multi-level index is a form of search


tree.
◼ However, insertion and deletion of new index
entries is a severe problem because every
level of the index is an ordered file.

38
Contents

1 Single-level Ordered Indexes


2 Multilevel Indexes
Dynamic Multilevel Indexes Using B-Trees
3
and B+-Trees
4 Indexes on Multiple Keys
5 Other File Indexes
6 Indexes in Today‘s DBMSs

39
Dynamic Multilevel Indexes Using B-
Trees and B+-Trees
◼ Most multi-level indexes use B-tree or B+-tree data
structures because of the insertion and deletion
problem.
❑ This leaves space in each tree node (disk block) to allow
for new index entries
◼ These data structures are variations of search trees
that allow efficient insertion and deletion of new
search values.
◼ In B-Tree and B+-Tree data structures, each node
corresponds to a disk block.
◼ Each node is kept between half-full and completely
full.
40
Dynamic Multilevel Indexes Using B-
Trees and B+-Trees (cont.)
◼ An insertion into a node that is not full is quite
efficient.
❑ If a node is full, the insertion causes a split into
two nodes.
◼ Splitting may propagate to other tree levels.
◼ A deletion is quite efficient if a node does not
become less than half full.
◼ If a deletion causes a node to become less than
half full, it must be merged with neighboring
nodes.
41
Difference between B-tree and B+-tree

◼ In a B-Tree, pointers to data records exist at


all levels of the tree.
◼ In a B+-Tree, all pointers to data records exist
at the leaf-level nodes.
◼ A B+-Tree can have less levels (or higher
capacity of search values) than the
corresponding B-tree.

42
B-tree Structures

43
The Nodes of a B+-Tree

44
The Nodes of a B+-Tree (cont.)

45
Example 4: Calculate the order of a B-tree
◼ Suppose that:
❑ Search field V = 9 bytes, disk block size B = 512 bytes
❑ Record (data) pointer Pt = 7 bytes, block pointer is P = 6 bytes.
◼ Each B-tree node can have at most p tree pointers, p – 1
data pointers, and p – 1 search key field values.
◼ These must fit into a single disk block if each B-tree node is to
correspond to a disk block:
(p*P) + ((p-1)*(Pt+V))  B
 (p*6) + ((p-1)*(7+9))  512
 (22*p)  528
◼ We can choose to be a large value that satisfies the above
inequality, which gives p = 23 (p = 24 is not chosen because
of additional information).

46
Example 5: Calculate approximate number
of entries of a B-tree
◼ Suppose that:
❑ Search field of Example 3 is a non-ordering key field, and we construct a B-Tree on
this field.
❑ Each node of the B-tree is 69 percent full.
◼ Each node, on the average, will have: p * 0.69 = 23 * 0.69 = 15.87 ≈ 16
pointers → 15 search key field values.
◼ The average fan-out fo = 16. We can start at the root and see how many
values and pointers can exist, on the average, at each subsequent level:
Level Nodes Index entries Pointers
Root: 1 node 15 entries 16 pointers
Level 1: 16 nodes 240 entries 256 pointers
Level 2: 256 nodes 3840 entries 4096 pointers
Level 3: 4096 nodes 61,440 entries
◼ At each level, number of entries = the total number of pointers at the
previous level * the average number of entries in each node.
◼ A two-level B-tree holds 3840+240+15 = 4095 entries on the average; a
three-level B-tree holds 65,535 entries on the average. 47
Example 6: Calculate the order of a B+-tree
◼ Suppose that:
❑ Search key field V=9 bytes, block size B=512bytes
❑ Record pointer is Pr = 7bytes, block pointer is P = 6bytes.
◼ An internal node of the B+-tree can have up to p tree pointers and p-
1 search field values; these must fit into a single block. Hence, we
have:
(p*P) + ((p-1)*V)  B
 (p*6) + ((p-1)*9)  512

 15*p  512

◼ We can choose p to be the largest value satisfying the above


inequality, which give p = 34.
◼ This is larger than the value of 23 for the B-Tree, resulting in a larger
fan-out and more entries in each internal node of a B+-Tree than in
the corresponding B-Tree.

48
Example 6: Calculate the order of a B+-tree
(cont.)
◼ The leaf nodes of B+-tree will have the same number of
values and pointers, except that the pointers are data
pointers and a next pointer. Hence, the order pleaf for the
leaf nodes can be calculated as follows:
(pleaf * (Pt+V))+P  B
 (pleaf * (7+9))+6  512
 (16 * pleaf)  506
◼ If follows that each leaf node hold up to pleaf = 31 key
value/data pointer combinations, assuming that the data
pointers are record pointers.

49
Example 7: Calculate approximate number
of entries of a B+-tree
◼ Suppose that we construct a B+-Tree on the field of Example 6:
❑ Search key field V = 9 bytes, block size B = 512bytes
❑ Record pointer is Pr = 7bytes, block pointer is P = 6bytes.
❑ Each node is 69 percent full.
◼ On the average, each internal node will be have 34*0.69 ≈ 23.46 or
approximately 23 pointers, and hence 22 values.
◼ Each leaf node, on the average, will hold 0.69*pleaf = 0.69*31 ≈ 21.39 or
approximately 21 data record pointers.
◼ A B+-tree will have the following average number of entries at each level:
Level Nodes Index entries Pointers
Root 1 nodes 22 entries 23 pointers
Level 1 23 23*22 = 506 232=529 pointers
Level 2 529 529*22 = 11,638 233=12,167 pointers
Leaf level 12,167 12,167 *21 = 255,507
◼ A 3-level B+-tree holds up to 255,507 record pointers, on the average.
◼ Compare this to the 65,535 entries for corresponding B-tree in Example 4.
50
B+-Tree: Insert entry

51
B+-Tree: Insert entry (cont.)

52
Example of insertion in B+-tree

p = 3 and pleaf = 2

Insertion Sequence: 8, 5, 1, 7, 3, 12, 9, 6

53
Example of insertion in B+-tree (cont.)

p = 3 and pleaf = 2

Insertion Sequence: 8, 5, 1, 7, 3, 12, 9, 6

54
Example of insertion in B+-tree (cont.)

p = 3 and pleaf = 2

Insertion Sequence: 8, 5, 1, 7, 3, 12, 9, 6

55
Example of insertion in B+-tree (cont.)

p = 3 and pleaf = 2

Insertion Sequence: 8, 5, 1, 7, 3, 12, 9, 6

56
Example of insertion in B+-tree (cont.)

p = 3 and pleaf = 2 Insertion Sequence: 8, 5, 1, 7, 3, 12, 9, 6

57
Example of insertion in B+-tree (cont.)

p = 3 and pleaf = 2 Insertion Sequence: 8, 5, 1, 7, 3, 12, 9, 6

58
Example of insertion in B+-tree (cont.)

p = 3 and pleaf = 2 Insertion Sequence: 8, 5, 1, 7, 3, 12, 9, 6

59
B+-Tree: Delete entry
◼ Remove the entry from the leaf node.
◼ If it happens to occur in an internal node:
❑ Remove.
❑ The value to its left in the leaf node must replace it in the internal
node.
◼ Deletion may cause underflow in leaf node:
❑ Try to find a sibling leaf node – a leaf node directly to the left or to
the right of the node with underflow.
❑ Redistribute the entries among the node and its siblings.
(Common method: The left sibling first and the right sibling later)
❑ If redistribution fails, the node is merged with its sibling.
❑ If merge occurred, must delete entry (pointing to node and
sibling) from parent node.

60
B+-Tree: Delete entry (cont.)

◼ If an internal node is underflow:


❑ Redistribute the entries among the node, its siblings and
entry pointing to node and sibling of parent node .
❑ If redistribution fails, the node is merged with its sibling and
the entry pointing to node and sibling of parent node .
❑ If merge occurred, must delete entry pointing to node and
sibling from parent node.
❑ If the root node is empty → the merged node becomes the
new root node.
◼ Merge could propagate to root, reduce the tree
levels.

61
Example of deletion from B+-tree

p = 3 and pleaf = 2.

Deletion sequence: 5, 12, 9

Delete 5

62
Example of deletion from B+-tree (cont.)
P = 3 and pleaf = 2.

Deletion sequence: 5, 12, 9

Delete 12: underflow


(redistribute)

63
Example of deletion from B+-tree (cont.)
p = 3 and pleaf = 2.

Deletion sequence: 5, 12, 9

Delete 9:
Underflow (merge with left, redistribute)

64
Example of deletion from B+-tree (cont.)
p = 3 and pleaf = 2.

Deletion sequence: 5, 12, 9

65
Search using B-trees and B+-trees
K=8
5<8

7< 8 <= 8

found

66
Search using B-trees and B+-trees
◼ Search conditions on indexing attributes
❑ =, <, >, ≤, ≥, between, MINIMUM value, MAXIMUM
value
◼ Search results
❑ Zero, one, or many data records
◼ Search cost
❑ B-trees
◼ From 1 to (1 + the number of tree levels) + data accesses
❑ B+-trees
◼ 1 (root level) + the number of tree levels + data accesses

◼ Logically ordering for a data file

67
Contents

1 Single-level Ordered Indexes


2 Multilevel Indexes
Dynamic Multilevel Indexes Using B-Trees and
3
B+-Trees
4 Indexes on Multiple Keys
5 Other File Indexes
6 Indexes in Today‘s DBMSs

69
Indexes on Multiple Keys
◼ In many retrieval and update requests, multiple
attributes are involved.
◼ If a certain combination of attributes is used
frequently, it is advantageous to set up an access
structure to provide efficient access by a key value
that is a combination of those attributes.
◼ If an index is created on attributes <A1, A2, … , An>,
the search key values are tuples with n values: <v1,
v2, … , vn>.
◼ A lexicographic ordering of these tuple values
establishes an order on this composite search key.
◼ An index on a composite key of n attributes works
similarly to any index discussed so far.

70
Contents

1 Single-level Ordered Indexes


2 Multilevel Indexes
Dynamic Multilevel Indexes Using B-Trees and
3
B+-Trees
4 Indexes on Multiple Keys
5 Other File Indexes
6 Indexes in Today‘s DBMSs

71
Other File Indexes
◼ Hash indexes
❑ The hash index is a secondary structure to access the
file by using hashing on a search key other than the one
used for the primary data file organization.
◼ Bitmap indexes
❑ A bitmap index is built on one particular value of a
field (the column in a table) with respect to all the rows
(records) and is an array of bits.
◼ Function-based indexes
❑ In Oracle, an index such that the value that results from
applying a function (expression) on a field or some fields
becomes the key to the index

72
Other File Indexes

◼ Hash indexes
❑ The hash index is a secondary structure to
access the file by using hashing on a search
key other than the one used for the primary
data file organization.
◼ access structures similar to indexes, based on
hashing
❑ Support for equality searches on the hash
field

73
Hash indexes

◼ The hash index is a secondary


structure to access the file by using
hashing on a search key other than the
one used for the primary data file
organization.
❑ access structures similar to indexes, based
on hashing
◼ Support for equality searches on the
hash field

74
hashing
function:
the sum of
the digits
of Emp_id
modulo 10

75
Bitmap indexes
◼ A bitmap index is built on one particular value
of a field (the column in a table) with respect to
all the rows (records) and is an array of bits.
❑ Each bit in the bitmap corresponds to a row. If the bit is
set, then the row contains the key value.
◼ In a bitmap index, each indexing field value is
associated with pointers to multiple rows.
◼ Bitmap indexes are primarily designed for data
warehousing or environments in which queries
reference many columns in an ad hoc fashion.
❑ The number of distinct values of the indexed field is
small compared to the number of rows.
❑ The indexed table is either read-only or not subject to
significant modification by DML statements.
76
Bitmap indexes

77
Bitmap indexes

78
Function-based indexes
◼ The use of any function on a column prevents the
index defined on that column from being used.
❑ Indexes are only used with some specific search
conditions on indexed columns.

◼ In Oracle, a function-based index is an index


such that the value that results from applying
some function (expression) on a field or a
collection of fields becomes the key to the index.
❑ A function-based index can be either a B-tree or a
bitmap index.

79
Function-based indexes

80
Contents

1 Single-level Ordered Indexes


2 Multilevel Indexes
Dynamic Multilevel Indexes Using B-Trees and
3
B+-Trees
4 Indexes on Multiple Keys
5 Other File Indexes
6 Indexes in Today‘s DBMSs

81
Index Creation
CREATE [ UNIQUE ] INDEX <index name>
ON <table name> ( <column name> [ <order> ] { , <column name> [ <order> ] } )
[ CLUSTER ] ;

◼ UNIQUE is used to guarantee that no two rows of a table


have duplicate values in the key column or column.
◼ CLUSTER is used when the index to be created should also
sort the data file records on the indexing attribute.

CREATE INDEX DnoIndex ON EMPLOYEE (Dno)


CLUSTER ;

82
B-tree index in Oracle 19c

83
B-tree for a clustered index in MS
SQL Server

84
Review questions
1) Define the following terms: indexing field, primary key field, clustering
field, secondary key field, block anchor, dense index, and nondense
(sparse) index.
2) What are the differences among primary, secondary, and clustering
indexes? How do these differences affect the ways in which these
indexes are implemented? Which of the indexes are dense, and which
are not?
3) Why can we have at most one primary or clustering index on a file, but
several secondary indexes?
4) How does multilevel indexing improve the efficiency of searching an
index file?
5) What is the order p of a B-tree? Describe the structure of B-tree nodes.
6) What is the order p of a B+-tree? Describe the structure of both internal
and leaf nodes of a B+-tree.
7) How does a B-tree differ from a B+-tree? Why is a B+-tree usually
preferred as an access structure to a data file?

85
86

You might also like