0% found this document useful (0 votes)
59 views58 pages

Chapter 7 Indexing Part1

This document discusses indexing in databases using B+-trees. It begins by explaining basic concepts like indexes, search keys, and primary versus secondary indexes. It then describes the properties of B+-trees, including that they are balanced, support efficient equality and range searches, and have internal nodes with up to n-1 keys and n pointers. The document provides examples of leaf nodes containing keys and pointers to records, and internal nodes containing keys that separate child nodes. It concludes with an example of building a secondary B+-tree index on an attribute to speed up searches.

Uploaded by

Diu Nei Lo Mo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views58 pages

Chapter 7 Indexing Part1

This document discusses indexing in databases using B+-trees. It begins by explaining basic concepts like indexes, search keys, and primary versus secondary indexes. It then describes the properties of B+-trees, including that they are balanced, support efficient equality and range searches, and have internal nodes with up to n-1 keys and n pointers. The document provides examples of leaf nodes containing keys and pointers to records, and internal nodes containing keys that separate child nodes. It concludes with an example of building a secondary B+-tree index on an attribute to speed up searches.

Uploaded by

Diu Nei Lo Mo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Acknowledgment: Dr.

Chui Chun Kit

Chapter 7.

Indexing
Part 1: B+-Tree
COMP3278
Introduction to
Database Management Systems
Department of Computer Science, The University of Hong Kong
Slides prepared by - Dr. Reynold Cheng, http://www.cs.hku.hk/~ckcheng/ for students in COMP3278
For other uses, please email : [email protected]
In this chapter…
Outcome 1. Information Modeling
Able to understand the modeling of real life information in a database
system.

Outcome 2. Query Languages


Able to understand and use the languages designed for data access.

Outcome 3. System Design


Able to understand the design of an efficient and reliable database
system.

Outcome 4. Application Development


Able to implement a practical application on a real database.

2
We are going to learn…
Basic concepts

B+ -tree

Hashing

Index definition in SQL

3
Section 1

Basic
Concepts

Slides prepared by - Dr. Reynold Cheng, http://www.cs.hku.hk/~ckcheng/ for students in COMP3278


For other uses, please email : [email protected]
Basic concepts
Index is used to speed up access to desired data.
E.g., Author catalog in library, phone directory index, etc.

Search key
An attribute or a set of attributes used
to look up records in a file.

Indices are typically much smaller


than the original file.

5
5
Primary v.s. secondary
Primary index - An index whose search key also
defines the sequential order of the file.
E.g., Access staff records through staffID
(primary search key).
However, the data file can be sorted in one order only.
staffID roomID faculty
How about accessing data with 10101 49 C.S.
12121 42 Finance
a different search key? 15151
22222
35
10
Music
Physics
32343 15 History
E.g., Access staff records through 33456
45565
18
20
C.S.
E.E.E.
roomID (a secondary search key, 58583
76543
3
31
Biology
Finance
need a secondary index!). 76766
83821
5
2
Finance
C.S.
98345 24 C.S.
6
2 classes of indices
Ordered Indices – Search keys are sorted in the index
Example: indexed-sequential file, B+-tree.

Hash Indices – Search keys are distributed over


different buckets using a hash function.
Example: extendable hash-index.

These indices differ in their


speeds in answering different
queries.

7
Index evaluation factors
Each indexing technique must be
evaluated on the basis of these factors
Access types – The types of access that are supported
efficiently (e.g., equality search or range search? Single
attribute search or multi-attribute search?)
Access time – The time it takes to find a particular data item,
or a set of items.
No one indexing
Insertion / deletion time technique is the best.
Rather, each technique is
Space overhead best suited to particular
database applications.
8
Section 2

+
B -tree
Slides prepared by - Dr. Reynold Cheng, http://www.cs.hku.hk/~ckcheng/ for students in COMP3278
For other uses, please email : [email protected]
Properties of B -tree
+

B+-tree index structure is one of the most widely used


index structure in DBMS.
All paths from root to leaf are of the same length
(i.e., balanced)
Can support efficient processing of the following
queries (Assume that the B+-tree is built on attribute A
of the relation R):
SELECT * FROM R WHERE R.A = 3
SELECT * FROM R WHERE R.A >= 3 AND R.A < 22 10
Why not binary search tree?
Balanced Binary Search tree minimizes the
number of key comparisons for finding a
search key. Why don’t we use balanced binary
search tree in Database?
Because we want to minimize
We need a tree which is the number of block retrieval
Node size = 1 block in answering a query (i.e.,
(A node can contain more number of tree nodes to be
than one keys) accessed) rather than the
Low in height number of key comparisons.

Balanced
11
A node in B -tree
+

n=4 pointers

n-1 = 3 search-keys

A node contains up to n-1 search-key values, and n


pointers.

1 2 5 1 5 2

The search-key values within a node are kept in sorted


order.
12
1. Leaf node
1 4 6 9 11

A leaf node has at least é(n – 1)/2ù and at most (n – 1)


values, where n is the number of pointers.
E.g., with n = 4, a leaf node must contain at least 2 values,
and at most 3 values.

The last pointer is used to chain together the leaf


nodes in search-key order.
13
1. Leaf node
1 4 6 9 11

1001 4 Ben 100


1003 1 David 150
1005 11 Kit 40
1006 9 Anthony 90
1011 6 Kenneth 110

The pointer before a search-key value points to the


record that contains the search-key (assume no
duplicate values). 14
2. Non-leaf node

Non-leaf node

Leaf nodes …

Non-leaf nodes must hold at least én/2ù, and at most


n pointers.
E.g., with n = 4, a non-leaf node contains at least 2
pointers, and at most 4 pointers.

15
2. Non-leaf node

Non-leaf node 4

Leaf nodes 1 2 3 4 7 8 …

The pointer on the left of a key K points to the part of


the subtree that contains those key values less than K.
The pointer on the right of a key K points to the part of
the subtree that contains those key values larger than
or equal to K.
16
Example B -tree
+

10101 49 C.S.
In the file, records are ordered 12121 42 Finance
15151 35 Music
according to the 1st attribute, 22222 10 Physics
32343 15 History
we would like to build a B+-tree 33456 18 C.S.
45565 20 E.E.E.
index (secondary index) to 58583 3 Biology
76543 31 Finance
speed up the searching on the 76766 5 Finance
83821 2 C.S.
2nd attribute. 98345 24 C.S.
17
Example B -tree
+

31

10 18 42

2 3 5 10 15 18 20 24 31 35 42 49

With n = 4, a leaf node must 10101 49 C.S.


12121 42 Finance
contain at least 2 values, and at 15151 35 Music
22222 10 Physics
most 3 values. 32343 15 History
33456 18 C.S.
45565 20 E.E.E.
With n = 4, a non-leaf node must 58583
76543
3
31
Biology
Finance
contain at least 2 pointers, and at 76766
83821
5
2
Finance
C.S.
most 4 pointers. 98345 24 C.S.
18
Example B -tree
+

31

10 18 42

2 3 5 10 15 18 20 24 31 35 42 49

10101 49 C.S.
12121 42 Finance
15151 35 Music
22222 10 Physics
32343 15 History
33456 18 C.S.
45565 20 E.E.E.
58583 3 Biology
76543 31 Finance
76766 5 Finance
83821 2 C.S.
98345 24 C.S.
19
Searching
Step 1. Traverse 31
from root to
leaf.
10 18 42

2 3 5 10 15 18 20 24 31 35 42 49

Step 2. Search in the leaf node. Step 3. Follow the 10101 49 C.S.
12121 42 Finance
pointer in the leaf 15151 35 Music
Point query node to retrieve 22222 10 Physics
32343 15 History
SELECT * FROM R WHERE R.B = 3 the record. 33456 18 C.S.
45565 20 E.E.E.
58583 3 Biology
76543 31 Finance
76766 5 Finance
With this B+-tree, how many disk 83821 2 C.S.
98345 24 C.S.
block accesses to answer this query? 20
Searching
31

10 18 42

2 3 5 10 15 18 20 24 31 35 42 49

Start output Stop output


10101 49 C.S.
Range query 12121 42 Finance
15151 35 Music
SELECT * FROM R WHERE R.B >= 3 AND R.B < 22 22222 10 Physics
32343 15 History
33456 18 C.S.
B+-tree can also handle range search 45565 20 E.E.E.
very well. Search for the left border of 58583
76543
3
31
Biology
Finance
the range and traverse the leaf chain 76766 5 Finance
83821 2 C.S.
until a record with search-key larger 98345 24 C.S.
than the right border is encountered. 21
Insertion
Assume no duplicate entries are inserted, insertion is
simply searching + insert entry.
If a leaf node is full, node splitting has to be
performed.
Step 1. Create one more node and distribute the first én/2ù
records to one node and the remaining to the other node.
Step 2. Parent nodes (non-leaf nodes) have to be updated
accordingly.

Insert key “1” 2 3 5 1 2 3 5


22
1. Node splitting (leaf node)

31

10 18 42

2 3 5 10 15 18 20 24 31 35 42 49

Let’s learn how node splitting is


implemented on leaf node by considering
inserting key “1” in the above B+-tree.
23
1. Node splitting (leaf node)

31

10 18 42

2 3 5 10 15 18 20 24 31 35 42 49

We first search for the leaf node that the key “1”
should be inserted.

Since this node is full, inserting “1” requires


SPLITTING this leaf node. 24
1. Node splitting (leaf node)

31

10 18 42

2 3 5 10 15 18 20 24 31 35 42 49

Step1. Create one more node and


distribute the entries.
25
1. Node splitting (leaf node)

31

10 18 42

1 2 10 15 18 20 24 31 35 42 49

3 5

Step1. Create one more node and


distribute the entries.
26
1. Node splitting (leaf node)

31

33 10 18 42

1 2 10 15 18 20 24 31 35 42 49

3 5 DO
NE
Step2. Update the parent.
27
2. Node splitting (non-leaf node)

Splitting of a non-leaf node is a little different from


splitting of a leaf node.

31

3 10 18 42

10 15 18 20 24 31 35 42 49

1 2 3 5

Let’s learn how node splitting is implemented on non-leaf node


by considering inserting key “26” in the above B+-tree. 28
2. Node splitting (non-leaf node)

31

3 10 18 42

10 15 18 20 24 31 35 42 49

1 2 3 5

“26” should be inserted into this leaf node.


Since this node is full, node SPLITTING need to be performed.
29
2. Node splitting (non-leaf node)

31

3 10 18 42

10 15 31 35 42 49

1 2 3 5

Step 1. Create one more node 18 20 24 26

and distribute the entries. 30


2. Node splitting (non-leaf node)
Step 2. Update the parent (Parent node is full!)
As we cannot have 5 pointers stored in a non-leaf node, we
need to split this non-leaf node (Recursively).

31

3 10 18 42

10 15 ? 31 35 42 49

1 2 3 5

18 20 24 26

31
2. Node splitting (non-leaf node)
Splitting non-leaf node (Recursive)
Step 1. We first create a new node to accommodate the
new pointers (the 5 pointers, one for each leaf node).

31

3 10 18 42

10 15 ? 31 35 42 49

1 2 3 5

18 20 24 26

32
2. Node splitting (non-leaf node)
Splitting non-leaf node
Step 2. We distribute the pointers among the two
nodes.

31

3 10 18 42

10 15 31 35 42 49

1 2 3 5

18 20 24 26

33
2. Node splitting (non-leaf node)
Tricky part! Note that the
Splitting non-leaf node
search key that lies between
Step 3. Then consider the keys that the pointers that stay on the
are required in each slot among the left, and the pointers that
two nodes. move to the right node (i.e.
31
18) is treated differently.

3 10 18 24 42

10 15 31 35 42 49

1 2 3 5

18 20 24 26

34
2. Node splitting (non-leaf node)
“18” is moved to the parent node to separate the search-
keys among the two nodes (if the parent node is full, split
the parent node recursively)

18 31

3 10 24 42

10 15 31 35 42 49

1 2 3 5

18 20 24 26

35
Deletion
Find the record to be deleted.

Remove it from the file and from the leaf node (if
present)

If the leaf node has too few entries due to the removal:
Try to MERGE the node with its sibling node.
Try to REDISTRIBUTE the entries if MERGING fails.

36
1. Merging

31

3 10 18 42

10 15 18 20 24 31 35 42 49

1 2 3 5

Let’s try to remove key “42” in the above B+-tree.


37
1. Merging

31

3 10 18 42

10 15 18 20 24 31 35 49

1 2 3 5
Deletion may cause a node to underfull.
This node has only 1 value, which violates the
requirement that each leaf node must contain
at least é(n – 1)/2ù values (i.e., 2 in this case).
38
1. Merging

31

3 10 18 42

10 15 18 20 24 31 35 49

1 2 3 5

Step 1. Merge with sibling node.

39
1. Merging
After merging, this leaf node
is empty and no longer used.

31

3 10 18 42

10 15 18 20 24 31 35 49

1 2 3 5

40
1. Merging
Step 2. Update the parents.

31

3 10 18

10 15 18 20 24 31 35 49

1 2 3 5
The parent node now contains too
few pointers. Remember we require
non-leaf node to have at least én/2ù
pointers.
41
1. Merging
Recursively, we try to MERGE these 2 nodes.
However, the two nodes cannot be merged as the left node is
already full (4 pointers).

31

3 10 18

10 15 18 20 24 31 35 49

1 2 3 5

When MERGE fails, do REDISTRIBUTION!


42
2. Redistribution
Redistribution
Step1. Redistribute the pointers.

31

3 10 18

10 15 18 20 24 31 35 49

1 2 3 5

43
2. Redistribution
Redistribution
Step2. Update the keys.

18

3 10 31

10 15 18 20 24 31 35 49

1 2 3 5

DO
NE
44
Example 1
Delete 35

18

3 10 31

10 15 18 20 24 31 35 49

1 2 3 5

45
Example 1
After deletion, this node contains 2 values (VALID).
Delete 35 Remember the keys in a node should be in sorted
order.

18

3 10 31

10 15 18 20 24 31 49

1 2 3 5

DO
NE
46
Example 2
Deletion of “49” causes this leaf node to contain only
one value, which is underfull.
Delete 49 First, try MERGE with its sibling node, but the sibling
node is full, so we need to do REDISTRIBUTION.

18

3 10 31

10 15 18 20 24 31 49

1 2 3 5

47
Example 2
After REDISTRIBUTION, we need
Delete 49 to update the keys.

18

3 10 31

10 15 18 20 24 31

1 2 3 5

48
Example 2
Delete 49

18

3 10 24

10 15 18 20 24 31

1 2 3 5

DO
NE
49
Example 3
Deletion of “18” causes this leaf node to contain only
one value, which is underfull.
Delete 18 First, try merge with its sibling node, which sibling
should be merged?

18

3 10 24

10 15 18 20 24 31

1 2 3 5

50
Example 3
After merging, this leaf node
Delete 18 is empty and no longer used.

18

3 10 24

10 15 20 24 31

1 2 3 5

51
Example 3
Now this node has only one pointer,
which is underfull (1 pointer only).
Delete 18
We try merging it with its sibling.

18

3 10

10 15 20 24 31

1 2 3 5

52
Example 3
Merging non-leaf nodes
Delete 18 Step 1. Update the pointers.

18

3 10

10 15 20 24 31

1 2 3 5

53
Example 3
Merging non-leaf nodes
Step 2. Update the keys.
Delete 18 (It is “18” as originally it is the key “18” in the
root node that separate the two pointers.)
18

3 10 18

10 15 20 24 31

1 2 3 5

54
Example 3
Note that since we merged the non-
Delete 18 leaf node, some pointers and parent
entries can be removed.

18

3 10 18

10 15 20 24 31

1 2 3 5

55
Example 3
Delete 18

3 10 18

10 15 20 24 31

1 2 3 5

DO
NE
56
Discussions
B+-tree is one of the most important structures in
large DB search.
B+-tree is very good in handling range queries, and
have decent update performance.
B+-tree has a lot of variants, for handling dynamic
data and multi-dimensional spatial data
Next, we study hash tables for DBMS...Stay tuned!

57
Chapter 7 (Part 1).

END
COMP3278
Introduction to
Database Management Systems
Department of Computer Science, The University of Hong Kong
Slides prepared by - Dr. Reynold Cheng, http://www.cs.hku.hk/~ckcheng/ for students in COMP3278
For other uses, please email : [email protected]

You might also like