0% found this document useful (0 votes)
5 views36 pages

Tutorial 10 Indexing

The document discusses two main indexing approaches in a DBMS: B+-tree and extendable hashing, noting that these topics will not be included in the exam. It details the creation of indexes in MySQL, the structure and properties of B+-trees, and the process of insertion and deletion within these trees. Additionally, it covers the concept of extendable hashing, highlighting its dynamic nature and the use of a bucket address table for efficient data management.

Uploaded by

wzy190817
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views36 pages

Tutorial 10 Indexing

The document discusses two main indexing approaches in a DBMS: B+-tree and extendable hashing, noting that these topics will not be included in the exam. It details the creation of indexes in MySQL, the structure and properties of B+-trees, and the process of insertion and deletion within these trees. Additionally, it covers the concept of extendable hashing, highlighting its dynamic nature and the use of a bucket address table for efficient data management.

Uploaded by

wzy190817
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Indexing COMP3278 2024-2025

Overview
• Two approach in indexing in a DBMS
• B+-tree
• Extendable hashing

This topic will not be included in the exam.

2
CREATE INDEX in MySQL
https://dev.mysql.com/doc/refman/5.7/en/create-index.html

• One can create an index in MySQL using the CREATE INDEX syntax.
• This speed up execution of queries if the index could be used.
• Note that all primary key are indexed

Index creation is DBMS specific, different DBMS (and


storage engine) will provide different indexing capabilities.
Knowing how to create proper index in a database for an
application is an important skill for database administrators.

3
Tree and hash
• Two major approaches are used to provide index
• Tree provides ordered list of indexed columns, support comparison operators
and ORDER BY queries
• Hash provides unordered list of indexed columns, support fast access by index
1
18 5
2
14 16 20 00
2
01
2
10
11
5 7 14 15 16 17 18 19 20 Bucket address table
2
3 11
7
4
+
B -tree - node
search-key value
A node in B+-tree consists of
(search-key) values and pointers
14 20

pointer

The number of search-key values is configurable


(but is fixed in the same tree)

In this example, the tree has an order 4, i.e., there are


4 pointers (3 search-key values) in a node.

5
+
B -tree – leaf and non-leaf node
Pointers in a non-leaf node points
to nodes at a lower level
14 20 Pointer left of a search-key value
points to a node with value less than
< 14 ≥ 14 < 20 ≥ 20 the search-key value.

5 7 15 20
Leaf nodes formed a linked-list using
their last pointers
Pointer before a value Values in non-leaf nodes are part of the tree structure,
points to its corresponding only the values in the leaves are the indexed values.
record

6
Consider this B+-tree,
1. How many nodes are accessed if we want to find the
value 16?
+
B -tree – quick test 2. How many nodes are accessed if we want to find
values between 13 and 17?
3. If no new node is added, how many values could be
stored in this tree?

14 18

7 9 14 20

1 3 4 7 8 10 13 14 15 16 17 18 19 20 21

Why don’t we just use binary tree?


B+-tree optimized memory access, a node fits in a single block in memory.
This minimize the number of block retrieval.

7
B+-tree properties
• Always balanced – depth of all leaf nodes are the same
• Except for the root node:
• A leaf node must be at least half-full (half of the values are not empty)
𝑚−1
• i.e., at least values for tree with order 𝑚.
2
• A non-leaf node must be at least half-full (half of the pointers are pointing to
a child)
𝑚
• i.e., at least pointers for tree with order 𝑚.
2

• In this discussion, we assume that the value to be indexed is unique.


• How non-unique values could be stored is implementation specific.

8
+
B -tree – insertion (1) Simply search and fill if leaf node is not full

Insert: 17
14 20

5 7 15 20

9
B+-tree – insertion (1) Simply search and fill if leaf node is not full
result

Insert: 17
14 20

5 7 15 17 20

10
B+-tree – insertion (1) Split node if inserting to a full node
result

Insert: 18
14 20

5 7 15 17 19 20

11
B+-tree – insertion (2) Pick a mid-point to split
In case the number of values to be
Splitting - 1 redistributed is odd, the choice is
implementation dependent
Insert: 18
15 17 18 19

If adding a value to a node exceed its Once split, a new parent is needed.
capacity, the node should be split and We pick the first value in the right child
the values should be redistributed As the parent.
18 We need to insert this value to the
parent node.

15 17 18 19
12
B+-tree – insertion (2) Insert to parent recursively
Update parent

Insert: 18
14 20

18 Parent can keep it, so


it is done.
5 7 20

15 17 18 19
13
B+-tree – insertion (2)
result

Insert: 18
14 18 20

5 7 15 17 18 19 20

14
+
B -tree – insertion (3) Node splitting may need to be done
recursively

Insert: 14
14 18 20

5 7 15 16 17 18 19 20

15
B+-tree – insertion (3) Node splitting may need to be done
First split recursively

Insert: 14
14 18 20

Parent is also full,


need to split
16
parent also
5 7 18 19 20

14 15 16 17
16
B+-tree – insertion (3) For non-leaf nodes, we split according
Splitting parent to the pointers but not values.
Here we have 5 pointers, the way to
split it is implementation specific
14 16 18 20
To do this, we pull one value up as
the new parent of the split. 18

14 16 20

18

14 16 20
17
B+-tree – insertion (3)
Result

Insert: 14
18

14 16 20

5 7 14 15 16 17 18 19 20

18
+
B -tree – deletion (1) When a value is deleted from a node, the
node may have less values than the
minimum.

Delete: 18
18

14 16 20

5 7 14 15 16 17 18 19 20
Once 18 is removed, this node has only 1 value.
Since a node must be at least half-full, this node
should be removed
19
B+-tree – deletion (1) If there is a sibling node that can take the value,
Merging nodes we merge the two nodes.
The choice of left/right sibling is implementation
specific.
Delete: 18
18

14 16 20

5 7 14 15 16 17 18 19 20
Note that this is NOT a Merge
sibling of the target node!

20
B+-tree – deletion (1)
Parent node
Once leaf nodes are merged, the
Delete: 18 corresponding parent node has to be
updated.
18 The parent node has only one
pointer now, which is less than
half full, this node has to be
14 16 removed also.

This time we can also merge with


its sibling.

5 7 14 15 16 17 19 20
Note that this is NOT a Merged
sibling of the target node!

21
B+-tree – deletion (1)
Parent node (2)

Delete: 18
What should this be?
18 Could be 19 – because the first value in the
children is 19
Could be 18 – because the previous value in the
14 16 parent is 18

5 7 14 15 16 17 19 20
Note that this is NOT a Merged
sibling of the target node!

22
B+-tree – deletion (1)
Result

Delete: 18 The original root node could be


removed as there is only one child.

14 16 18

5 7 14 15 16 17 19 20
Note that this is NOT a Merged
sibling of the target node!

23
+
B -tree – deletion (2) If merging fails, we need to redistribute
the nodes with its sibling.

Delete: 18
18

14 16 20

5 7 14 15 16 17 18 19 20 21 25
Once 18 is removed, the sibling node cannot take
the remaining value, a redistribution should be
done.
24
B+-tree – deletion (2)
Update parent

Delete: 18
18 The value in the parent node has to be
updated

14 16 21

5 7 14 15 16 17 19 20 21 25
Redistributed

25
+
B -tree – Summary
• Insertion: In practice, sometimes, parent node
is not updated after deletion even
• Split node and generate a new parent when there is an under fill, why?
• Recursively insert new parent to parent node
• Deletion:
• Merge node with sibling if node is underfill
• Recursively update parent nodes, delete node if empty
• Redistribute nodes if merging is not possible
• Update parent nodes

26
Extendable hashing
• Traditional hashing is static in nature
• Need to decide on the initial hash table size
• Not scaling well when database grows
• Dynamic hashing expand the hash table when it is needed
• Could be optimized by using bucket of size equals to one block of memory
• Extendable hashing is one of the example

27
Here is a typical implementation of
Static hashing static hashing for a database

Bucket 0
Each entry points to a
Value 𝑥 is placed in bucket 𝑚 if 4 record in the database
ℎ 𝑥 =𝑚

Bucket 1
5 21 If a bucket is full, overflow bucket will
be added through chaining
13
Bucket 2

Number of initial buckets equals the Configuration:


number of possible hash values, Bucket 3 hash function ℎ(𝑥) gives a hash value of 0 − 3 .
these buckets will be created even if Bucket size is 2.
they are empty.

28
Extendable hashing added two features:
1. A bucket address table
Extendable hashing 2. Hash prefix size

Hash prefix size Multiple entries in the bucket address table could
1 points to the same bucket.
1417 All entries in the same bucket has the same hash
3 prefix of the specific hash prefix size.
000
001 2
010 709 1867 If there are many entries with the
1346 same hash value, there will be
011 overflow buckets
100 3
101 1640
110
111
Bucket address table
3 Configuration:
hash function ℎ(𝑥) gives a hash value of 𝑛 bits.
Bucket address table lists all hash prefix 1394 Bucket size is 2.
of size 𝑛 bit. Where 𝑛 is indicated in the 1653
hash prefix size record. 29
Extendable hashing – construction example
Suppose we are adding these entries
in a configuration of extendable 𝒙 𝒉 𝒙
hashing with bucket size of 2. 3 3 (11)
2 2 (10)
5 1 (01)
7 3 (11)
11 3 (11)
13 1 (01)
0 0 (00)
1 1 (01)

30
𝒙 𝒉 𝒙
3 3 (11)
Extendable hashing 2 2 (10)
Initial state, add 3, 2 5 1 (01)
7 3 (11)
Initial state Add 3
11 3 (11)
0 0
13 1 (01)
0 0 3 0 0 (00)
1 1 (01)
Bucket address table Bucket address table

Initially, the hash prefix size is 0. ℎ 3 = 3, current hash prefix size is 0


There is one entry in the bucket address table First 0 bit is ∅ (nothing)
(20 = 1) The only bucket is not full, so 3 could be added.
There is one corresponding empty bucket Add 2
0
0 3
2
Bucket address table

Same when adding 2. 31


𝒙 𝒉 𝒙
Split bucket
1 prefix: 0 3 3 (11)
Extendable hashing 0
2 2 (10)
add 5 3
5 1 (01)
Add 5 7 3 (11)
0 2 11 3 (11)
0 1 prefix: 1
3 13 1 (01)
3
2 0 0 (00)
Bucket address table
2
1 1 (01)
Expand bucket address table

1 1
5 0 0
1
0 1
Bucket address table Bucket address table
1
Bucket address table 1
ℎ 5 = 1, first 0 bit is ∅ (nothing)
3 The only bucket is full, we need to split the bucket.
2 Hash prefix size for the new buckets is increased by 1.
The bucket address table has a smaller hash prefix size,
so it needs to be expanded.
32
𝒙 𝒉 𝒙
Split bucket
2 prefix: 10 3 3 (11)
Extendable hashing 1
2 2 2 (10)
add 7 3
5 1 (01)
Add 7 7 3 (11)
1 2 2 prefix: 11 11 3 (11)
1 5 3
13 1 (01)
0
1 0 0 (00)
1 Expand bucket address table
Bucket address table 1 1 (01)
3
2
2 1
00
0
1 01
1 10
5 Bucket address table
2 11
Bucket address table
00
2
01
2 ℎ 7 = 3, first 1 bit is 1
10 Bucket for “1” is full, we need to split the bucket.
11 Hash prefix size for the new buckets is increased by 1.
Bucket address table
2 The bucket address table has a smaller hash prefix size,
3 so it needs to be expanded.
7 33
𝒙 𝒉 𝒙
3 3 (11)
Extendable hashing 2 2 (10)
add 11 5 1 (01)
7 3 (11)
Add 11
11 3 (11)
1
13 1 (01)
5
0 0 (00)
2
00 1 1 (01)
2
01
2
10
11
Bucket address table
2
3 11
7

ℎ 11 = 3, first 2 bit is 11
Bucket for “11” is full, we consider splitting the bucket to prefix size 3, however all
entries has the same hash prefix of size 3. Splitting cannot resolve the collision.
An overflow bucket is added instead.
34
𝒙 𝒉 𝒙
3 3 (11)
Extendable hashing 2
2 2 (10)
add 13, 0, 1 0
5 1 (01)
7 3 (11)
11 3 (11)
2
13 1 (01)
5 1
13 0 0 (00)
2
00 1 1 (01)
2
01
2
10
11
Bucket address table
2
3 11
7

You can apply the same procedure to add the remaining 3.


This is left as an exercise.

35
Extendable hashing – Summary
• Initialize Bucket address table with prefix size 0.
• Associate one empty bucket of prefix size 0 to the only entry in the bucket
address table.
• To add a value, check the first 𝑛 bit of the hash value, where 𝑛 is the
current prefix size. Find the corresponding bucket.
• If the bucket is not full, add the entry to the bucket.
• If the bucket is full, try to split the bucket and redistribute the existing entries.
• If the bucket is still full, use an overflow bucket instead.
• Add the new entry to the newly split bucket. If the prefix size of the bucket is
larger than the current prefix size in the bucket address table, expand the
bucket address table.
• Then update the bucket address table.

36

You might also like