Tutorial 10 Indexing
Tutorial 10 Indexing
Overview
• Two approach in indexing in a DBMS
• B+-tree
• Extendable hashing
2
CREATE INDEX in MySQL
https://dev.mysql.com/doc/refman/5.7/en/create-index.html
• One can create an index in MySQL using the CREATE INDEX syntax.
• This speed up execution of queries if the index could be used.
• Note that all primary key are indexed
3
Tree and hash
• Two major approaches are used to provide index
• Tree provides ordered list of indexed columns, support comparison operators
and ORDER BY queries
• Hash provides unordered list of indexed columns, support fast access by index
1
18 5
2
14 16 20 00
2
01
2
10
11
5 7 14 15 16 17 18 19 20 Bucket address table
2
3 11
7
4
+
B -tree - node
search-key value
A node in B+-tree consists of
(search-key) values and pointers
14 20
pointer
5
+
B -tree – leaf and non-leaf node
Pointers in a non-leaf node points
to nodes at a lower level
14 20 Pointer left of a search-key value
points to a node with value less than
< 14 ≥ 14 < 20 ≥ 20 the search-key value.
5 7 15 20
Leaf nodes formed a linked-list using
their last pointers
Pointer before a value Values in non-leaf nodes are part of the tree structure,
points to its corresponding only the values in the leaves are the indexed values.
record
6
Consider this B+-tree,
1. How many nodes are accessed if we want to find the
value 16?
+
B -tree – quick test 2. How many nodes are accessed if we want to find
values between 13 and 17?
3. If no new node is added, how many values could be
stored in this tree?
14 18
7 9 14 20
1 3 4 7 8 10 13 14 15 16 17 18 19 20 21
7
B+-tree properties
• Always balanced – depth of all leaf nodes are the same
• Except for the root node:
• A leaf node must be at least half-full (half of the values are not empty)
𝑚−1
• i.e., at least values for tree with order 𝑚.
2
• A non-leaf node must be at least half-full (half of the pointers are pointing to
a child)
𝑚
• i.e., at least pointers for tree with order 𝑚.
2
8
+
B -tree – insertion (1) Simply search and fill if leaf node is not full
Insert: 17
14 20
5 7 15 20
9
B+-tree – insertion (1) Simply search and fill if leaf node is not full
result
Insert: 17
14 20
5 7 15 17 20
10
B+-tree – insertion (1) Split node if inserting to a full node
result
Insert: 18
14 20
5 7 15 17 19 20
11
B+-tree – insertion (2) Pick a mid-point to split
In case the number of values to be
Splitting - 1 redistributed is odd, the choice is
implementation dependent
Insert: 18
15 17 18 19
If adding a value to a node exceed its Once split, a new parent is needed.
capacity, the node should be split and We pick the first value in the right child
the values should be redistributed As the parent.
18 We need to insert this value to the
parent node.
15 17 18 19
12
B+-tree – insertion (2) Insert to parent recursively
Update parent
Insert: 18
14 20
15 17 18 19
13
B+-tree – insertion (2)
result
Insert: 18
14 18 20
5 7 15 17 18 19 20
14
+
B -tree – insertion (3) Node splitting may need to be done
recursively
Insert: 14
14 18 20
5 7 15 16 17 18 19 20
15
B+-tree – insertion (3) Node splitting may need to be done
First split recursively
Insert: 14
14 18 20
14 15 16 17
16
B+-tree – insertion (3) For non-leaf nodes, we split according
Splitting parent to the pointers but not values.
Here we have 5 pointers, the way to
split it is implementation specific
14 16 18 20
To do this, we pull one value up as
the new parent of the split. 18
14 16 20
18
14 16 20
17
B+-tree – insertion (3)
Result
Insert: 14
18
14 16 20
5 7 14 15 16 17 18 19 20
18
+
B -tree – deletion (1) When a value is deleted from a node, the
node may have less values than the
minimum.
Delete: 18
18
14 16 20
5 7 14 15 16 17 18 19 20
Once 18 is removed, this node has only 1 value.
Since a node must be at least half-full, this node
should be removed
19
B+-tree – deletion (1) If there is a sibling node that can take the value,
Merging nodes we merge the two nodes.
The choice of left/right sibling is implementation
specific.
Delete: 18
18
14 16 20
5 7 14 15 16 17 18 19 20
Note that this is NOT a Merge
sibling of the target node!
20
B+-tree – deletion (1)
Parent node
Once leaf nodes are merged, the
Delete: 18 corresponding parent node has to be
updated.
18 The parent node has only one
pointer now, which is less than
half full, this node has to be
14 16 removed also.
5 7 14 15 16 17 19 20
Note that this is NOT a Merged
sibling of the target node!
21
B+-tree – deletion (1)
Parent node (2)
Delete: 18
What should this be?
18 Could be 19 – because the first value in the
children is 19
Could be 18 – because the previous value in the
14 16 parent is 18
5 7 14 15 16 17 19 20
Note that this is NOT a Merged
sibling of the target node!
22
B+-tree – deletion (1)
Result
14 16 18
5 7 14 15 16 17 19 20
Note that this is NOT a Merged
sibling of the target node!
23
+
B -tree – deletion (2) If merging fails, we need to redistribute
the nodes with its sibling.
Delete: 18
18
14 16 20
5 7 14 15 16 17 18 19 20 21 25
Once 18 is removed, the sibling node cannot take
the remaining value, a redistribution should be
done.
24
B+-tree – deletion (2)
Update parent
Delete: 18
18 The value in the parent node has to be
updated
14 16 21
5 7 14 15 16 17 19 20 21 25
Redistributed
25
+
B -tree – Summary
• Insertion: In practice, sometimes, parent node
is not updated after deletion even
• Split node and generate a new parent when there is an under fill, why?
• Recursively insert new parent to parent node
• Deletion:
• Merge node with sibling if node is underfill
• Recursively update parent nodes, delete node if empty
• Redistribute nodes if merging is not possible
• Update parent nodes
26
Extendable hashing
• Traditional hashing is static in nature
• Need to decide on the initial hash table size
• Not scaling well when database grows
• Dynamic hashing expand the hash table when it is needed
• Could be optimized by using bucket of size equals to one block of memory
• Extendable hashing is one of the example
27
Here is a typical implementation of
Static hashing static hashing for a database
Bucket 0
Each entry points to a
Value 𝑥 is placed in bucket 𝑚 if 4 record in the database
ℎ 𝑥 =𝑚
Bucket 1
5 21 If a bucket is full, overflow bucket will
be added through chaining
13
Bucket 2
28
Extendable hashing added two features:
1. A bucket address table
Extendable hashing 2. Hash prefix size
Hash prefix size Multiple entries in the bucket address table could
1 points to the same bucket.
1417 All entries in the same bucket has the same hash
3 prefix of the specific hash prefix size.
000
001 2
010 709 1867 If there are many entries with the
1346 same hash value, there will be
011 overflow buckets
100 3
101 1640
110
111
Bucket address table
3 Configuration:
hash function ℎ(𝑥) gives a hash value of 𝑛 bits.
Bucket address table lists all hash prefix 1394 Bucket size is 2.
of size 𝑛 bit. Where 𝑛 is indicated in the 1653
hash prefix size record. 29
Extendable hashing – construction example
Suppose we are adding these entries
in a configuration of extendable 𝒙 𝒉 𝒙
hashing with bucket size of 2. 3 3 (11)
2 2 (10)
5 1 (01)
7 3 (11)
11 3 (11)
13 1 (01)
0 0 (00)
1 1 (01)
30
𝒙 𝒉 𝒙
3 3 (11)
Extendable hashing 2 2 (10)
Initial state, add 3, 2 5 1 (01)
7 3 (11)
Initial state Add 3
11 3 (11)
0 0
13 1 (01)
0 0 3 0 0 (00)
1 1 (01)
Bucket address table Bucket address table
1 1
5 0 0
1
0 1
Bucket address table Bucket address table
1
Bucket address table 1
ℎ 5 = 1, first 0 bit is ∅ (nothing)
3 The only bucket is full, we need to split the bucket.
2 Hash prefix size for the new buckets is increased by 1.
The bucket address table has a smaller hash prefix size,
so it needs to be expanded.
32
𝒙 𝒉 𝒙
Split bucket
2 prefix: 10 3 3 (11)
Extendable hashing 1
2 2 2 (10)
add 7 3
5 1 (01)
Add 7 7 3 (11)
1 2 2 prefix: 11 11 3 (11)
1 5 3
13 1 (01)
0
1 0 0 (00)
1 Expand bucket address table
Bucket address table 1 1 (01)
3
2
2 1
00
0
1 01
1 10
5 Bucket address table
2 11
Bucket address table
00
2
01
2 ℎ 7 = 3, first 1 bit is 1
10 Bucket for “1” is full, we need to split the bucket.
11 Hash prefix size for the new buckets is increased by 1.
Bucket address table
2 The bucket address table has a smaller hash prefix size,
3 so it needs to be expanded.
7 33
𝒙 𝒉 𝒙
3 3 (11)
Extendable hashing 2 2 (10)
add 11 5 1 (01)
7 3 (11)
Add 11
11 3 (11)
1
13 1 (01)
5
0 0 (00)
2
00 1 1 (01)
2
01
2
10
11
Bucket address table
2
3 11
7
ℎ 11 = 3, first 2 bit is 11
Bucket for “11” is full, we consider splitting the bucket to prefix size 3, however all
entries has the same hash prefix of size 3. Splitting cannot resolve the collision.
An overflow bucket is added instead.
34
𝒙 𝒉 𝒙
3 3 (11)
Extendable hashing 2
2 2 (10)
add 13, 0, 1 0
5 1 (01)
7 3 (11)
11 3 (11)
2
13 1 (01)
5 1
13 0 0 (00)
2
00 1 1 (01)
2
01
2
10
11
Bucket address table
2
3 11
7
35
Extendable hashing – Summary
• Initialize Bucket address table with prefix size 0.
• Associate one empty bucket of prefix size 0 to the only entry in the bucket
address table.
• To add a value, check the first 𝑛 bit of the hash value, where 𝑛 is the
current prefix size. Find the corresponding bucket.
• If the bucket is not full, add the entry to the bucket.
• If the bucket is full, try to split the bucket and redistribute the existing entries.
• If the bucket is still full, use an overflow bucket instead.
• Add the new entry to the newly split bucket. If the prefix size of the bucket is
larger than the current prefix size in the bucket address table, expand the
bucket address table.
• Then update the bucket address table.
36