Index Architecture: Febriliyan Samopa
Index Architecture: Febriliyan Samopa
Architecture
Febriliyan Samopa
Basic Concepts
new block
B+-Tree Index Files
B+-tree indices are an alternative to indexed-sequential files.
• Disadvantage of indexed-sequential files :
• Performance degrades as file grows, since many overflow
blocks get created.
• Periodic reorganization of entire file is required.
• Advantage of B+-tree index files :
• Automatically reorganizes itself with small, local, changes,
in the face of insertions and deletions.
• Reorganization of entire file is not required to maintain
performance.
• (Minor) disadvantage of B+-trees :
• Extra insertion and deletion overhead, space overhead.
• Advantages of B+-trees outweigh disadvantages thus
B+-trees are used extensively.
B+-Tree Index Files
Example of B+-Tree
B+-Tree Index Files
A B+-tree is a rooted tree satisfying the following
properties:
• All paths from root to leaf are of the same length
• Each node that is not a root or a leaf has between
𝑛/2 and 𝑛 children (𝑛 = number of pointer in a
node).
• A leaf node has between (𝑛–1)/2 and 𝑛–1
values.
• Special cases:
• If the root is not a leaf, it must has at least 2 children
→ 𝑛 ≥ 3.
• If the root is a leaf, it can have between 0 and (𝑛–1)
values.
B+-Tree Node Structure
• Typical node :
Last K 1 K2 pointer
non-null K3 in C
=C
P1 PK21 PK32 PK43
C
PK11 P
K2 PK3 P4
C
Handling Duplicates
• With duplicate search keys in both leaf and
internal nodes :
• Cannot guarantee that K1 < K2 < K3 < . . . < Kn–1.
• But can guarantee K1 ≤ K2 ≤ K3 ≤ . . . ≤ Kn–1.
• Search-keys (V) in the subtree to which Pi
points
• V ≤ Ki but not necessarily V < Ki.
• To see why, suppose same search key value V
is present in two leaf node Li and Li+1, then in
parent node Ki must be equal to V.
Handling Duplicates
• Modify find procedure as follows :
• Traverse Pi even if V = Ki.
• As soon as we reach a leaf node C, check if C has
only search key values less than V, if so set C = right
sibling of C before checking whether C contains V.
• Procedure printAll
• Uses modified find procedure to find first occurrence
of V.
• Traverse through consecutive leaves to find all
occurrences of V.
Modified Queries on B+-Trees
Find record with search-key value V :
1. C = root.
2. While C is not a leaf node : Second change
1. Let i be least value so that V ≤ Ki.
2. If no such i exists, set C = last non-null pointer in C.
3. Else set C = Pi First change
3. If for all Ki in C, Ki < V then C = right sibling of C.
4. Let i be least value so that Ki = V.
5. If there is such a value i, follow pointer Pi to the
desired record.
6. Else no record with search-key value V exists.
Queries on B+-Trees
• If there are K search-key values in the file, the height of
the tree is no more than 𝑙𝑜𝑔 𝑛/2 (𝐾) .
• A node is generally the same size as a disk block,
typically 4 kilobytes and 𝒏 is typically around 100 (40
bytes per index entry).
• With 1 million search key values and 𝑛 = 100, at most
𝑙𝑜𝑔50 (1000000) = 4 nodes are accessed in a lookup.
• Contrast this with a balanced binary tree with 1 million
search key values — 𝑙𝑜𝑔2(1000000) = 20 nodes are
accessed in a lookup.
• Above difference is significant since every node access
may need a disk I/O, costing around 20 milliseconds.
Insert on B+-Trees
1. Find the leaf node in which the search-key value would
appear
2. If the search-key value is already present in the leaf node
1. Add record to the file.
2. If necessary add a pointer to the bucket.
3. If the search-key value is not present, then :
1. Add the record to the main file (and create a bucket if
necessary).
2. If there is room in the leaf node, insert (key-value,
pointer) pair in the leaf node.
3. Otherwise, split the node (along with the new (key-
value, pointer) entry) as discussed in the next slide.
Insert on B+-Trees
• Splitting a leaf node :
• Take the n (search-key value, pointer) pairs (including the
one being inserted) in sorted order. Place the first 𝑛/2
in the original node, and the rest in a new node.
• let the new node be p, and let k be the least key value in
p. Insert (k,p) in the parent of the node being split.
• If the parent is full, split it and propagate the split further
up.
• Splitting of nodes proceeds upwards till a node that
is not full is found.
• In the worst case the root node may be split increasing
the height of the tree by 1.
Insert on B+-Trees
• Splitting a non-leaf node: when inserting
(k,p) into an already full internal node N :
• Copy N to an in-memory area M with space
for n+1 pointers and n keys.
• Insert (k,p) into M.
• Copy P1 ,K1 , …, K 𝑛/2 -1 ,P 𝑛/2 from M back into
node N.
• Copy P 𝑛/2 +1 ,K 𝑛/2 +1 ,…,Kn ,Pn+1 from M into newly
allocated node N’.
• Insert (K 𝑛/2 ,N’) into parent N.
Insert on B+-Trees Example
Let’s say that< parent
V (‘Adams’)
Modify
Insert leaf’s we want
K1 (‘Brandt’)
(‘Mozart’)
node
key-value(‘Adams’) →CCCdata
→
to insert
(‘Einstein’)and =→
=isPPleaf
accordingly
into → search-key
1C1 with
is full → Split C into
search-key value (V)
not 2found =→
nodes‘Adams’
insert new row in file
K1 K2 K3
K1 K2 K3 =C
C
K1 K2 K3
C=
Delete on B+-Trees
• Find the record to be deleted, and remove it from the
main file and from the bucket (if present).
• Remove (search-key value, pointer) from the leaf node if
there is no bucket or if the bucket has become empty.
• If the node has too few entries due to the removal, and
the entries in the node and a sibling fit into a single
node, then merge siblings :
• Insert all the search-key values in the two nodes into
a single node (the one on the left), and delete the
other node.
• Delete the pair (Ki–1 , Pi ), where Pi is the pointer to
the deleted node, from its parent, recursively using
the above procedure.
Delete on B+-Trees
• Otherwise, if the node has too few entries due to the
removal, but the entries in the node and a sibling do not
fit into a single node, then redistribute pointers :
• Redistribute the pointers between the node and a sibling such
that both have more than the minimum number of entries.
• Update the corresponding search-key value in the parent of
the node.
• The node deletions may cascade upwards till a node
which has 𝑛/2 or more pointers is found.
• If the root node has only one pointer after deletion, it is
deleted and the sole child becomes the root.
Delete on B+-Trees Example
V (‘Srinivasan’)
Remove
Let’s
Update
UpdatesayC’s
that
C’s we=> want
parent K1node
V (‘Srinivasan’)
parent’s tofrom C→
delete
(‘Mozart’)
parent node →
(‘Srinivasan’) → =C
Cnode
data =1+1PC1+1
Pwith
remove =is =2row
P2 from
Punder
search-key → merge
fullvalue
file (V) = C‘Srinivasan’
with previous node
K1 K2 K3
=C
K1 K2 K3
C C
K1 K2 K3
C
B-Tree Index Files
• Similar to B+-tree, but B-tree allows search-key
values to appear only once, eliminates
redundant storage of search keys.
• Search keys in non leaf nodes appear nowhere
else in the B-tree, an additional pointer field for
each search key in a nonleaf node must be
included.
Typical B-Tree nodes : (a) Leaf node, (b) Non leaf node
B-Tree Index Files Example
• Filtered indexes :
CREATE INDEX messages_todo
ON messages (receiver)
WHERE processed = 'N'
Join Operation
• Join operation may affect the performance.
• Join table or query involving more than two
tables might be performed by nested loop
select inside programming language via ORM
and function.
• Yet, network latencies occur on top of disk
latencies.
• SQL join operation performs better than nested
loop select.
• Two common approaches, the SQL join
operation and nested query.
SQL Join and Nested Query
•SQL Join :
SELECT S.SalesAmount FROM EMPLOYEE E
JOIN SALES S ON E.EMP_ID = S.EMP_ID
WHERE EMP_NAME = ?
•Nested Query :
SELECT SalesAmount FROM SALES WHERE
EMP_ID IN (SELECT EMP_ID FROM
EMPLOYEE WHERE EMP_NAME = ? )
Types of Join Operation
• SQL Server employs three types of join
operations :
• Nested loops joins.
• Merge joins.
• Hash joins.
• Type of join operation are automatically
chosen by DBMS.
• Some consideration includes number of
data and index availability.
Nested Loops Joins
• Employ nested iteration.
• Two tables A and B were outer input and
inner input.
• Outer loop consumes outer input table.
• Inner loop, executed each outer output row
matching the rows in the inner input table.
• Effective when outer input is small and
inner input is pre-indexed.
• Index on join predicates key and where
clause attributes.
Nested Loops Joins on SQL Server