UNit 4
UNit 4
· All items in the left subtree are less than the root.
· All items in the right subtree are greater or equal to the root.
· Each subtree is itself a binary search tree.
Thus, in a binary search tree, the left subtree contains key values less than the root, the right
subtree contains key values greater than or equal to the root. A binary search tree is represented
in Figure 18.1.
NOTE: Binary search trees provide an excellent structure for searching a list and at the same time
for inserting and deleting data into the list.
Some examples of valid binary search trees are shown in Figure 18.2.
Page 1
Invalid Binary search trees
Some examples of invalid binary search trees are shown in Figure 18.3.
BST Operations
Traversal
There are various traversal schemes such as preorder, and inorder. The traversal
postorder binary search tree shown in Figure 18.4 of a traversal schemes is
according to different below. discussed
Page 2
· Preorder Traversal
According to the preorder traversal of a binary search tree shown in Figure 18.4, the following
sequence is generated.
23 18 12 20 44 35 52
· Postorder Traversal
According to the preorder traversal of a binary search tree shown in Figure 18.4, the following
sequence is generated.
12 20 18 35 52 44 23
· Inorder Traversal
According to the preorder traversal of a binary search tree shown in Figure 18.4, the following
sequence is generated.
12 18 20 23 35 44 52
Note: Inorder traversal of a binary search tree produces an ascending sequence.
· Right-Node-Left Traversal
According to the preorder traversal of a binary search tree shown in Figure 18.4, the following
sequence is generated.
52 44 35 23 20 18 12
Note: Right-node-left traversal of a binary search tree produces a descending sequence.
The algorithm for finding the smallest node in a binary search tree is given below:
Page 3
· Searching the Largest node:
The algorithm for finding the largest node in a binary search tree is given below:
The algorithm for finding the requested node in a binary search tree is given in the presentation:
Building BST
To insert data all we need to d o is follow the branches to an empty subtree and then insert the
new node. In other words, all inserts take place at a leaf or at a leaflike nod e – a node that has
only one null subtree.
The algorithm for inserting node into binary search tree is given in the presenatation:
Deletion
Rather than simply delete the node, we try to maintain the existing structure as much as possible
by finding data to take the place of the deleted data. This can be done in one of two ways. We can
find the largest node in the deleted node’s left subtree and move its data to replace the deleted
node’s data. We can find the smallest node on the deleted node’s right subtree and move its data to
replace the deleted node’s data.
Note: Either of these moves preserves the integrity of the binary search tree.
• The length of the longest road from the root node to one of the terminal nodes is what we call
the height of a tree.
• The difference between the height of the right subtree and the height of the left subtree is what
we call the balancing factor.
• The binary tree is balanced when all the balancing factors of all the nodes are -1,0,+1.
Formally, we can translate this to this: | hd – hs | < = 1, node X being any node in the tree, where
hs and hd represent the heights of the left and the right subtrees.
Figure 1. Balancing factor equals hd – hs
Here are some examples of procedures for the calculus of the height of a subtree and of the
balancing factor for the above binary search tree.
· the height of the tree is 4, meaning the length of the longest path from the root to a leaf node.
· the height of the left subtree of the root is 3, meaning that the length of the longest path from the
node 13 to one of the leaf nodes (2, 7 or 12).
-for finding the balancing factor of the root we subtract the height of the right subtree and the left
subtree : 1-3 = - 2.
-the balancing factor of the node with the key 12 is very easy to determine. We notice that the
node has no children so the balancing factor is 0.
-for finding the balancing factor of the node with key 5 we subtract the height of the right subtree
from the height of the left subtree: 1 - 0 = +1.
Operations on AVL Trees
1.2.1 Insertion into AVL Trees
An insertion into the tree can lead to the imbalance of some nodes which occurs in the case of the
formula: |hd – hs | <= 1 not being respected.
Basically, a key is inserted like in an ordinary binary search tree, meaning that we start from the
root following the left or right nodes according to the relation between the key we insert and the
key of the nodes we are passing, until we reach a node which can become parent of the new node.
After insertion we go recursively upwards and we search the first node that is not balanced,
meaning the first node whose subtrees heights difference differ by 2 units. This node must be
balanced and will be in 1 of the 4 cases which follow.
Case 1. Simple right rotate
Figure 3. Simple right rotate
INSERT KEY 7
The node with the 7 key should be inserted in the right of the node with the key 5. In this case the
tree is imbalanced since the balance factor of root node becomes 2. Figure 7 presents the simple
left rotation that is performed the final shape of the tree.
6
INSERT KEY 2
The node with the 2 key will be inserted at the left of the node with the 4 key without the tree
becoming imbalanced.
INSERT KEY 1
The node with the 1 key will be inserted at the left of the node with the key 2, but in this case the
tree is unbalanced because key 4 has balance factor -2. Figure 8 presents the simple right rotation
that is performed the final shape of the tree.
INSERT KEY 3
The node with the 3 key will be inserted at the left of the node with the key 4, but in this case the
tree is unbalanced because key 5 has balance factor -2. Figure 9 presents the double right rotation
that is performed the the final shape of the tree.
INSERT KEY 6
The node with the 6 key will be inserted at the left of the node with the key 7, but in this case the
tree is unbalanced because key 5 has balance factor +2. Figure 10 presents the double left rotation
that is performed the the final shape of the tree.
Figure 9. Double right rotation for node with key 5
Figure 10. Double left rotation for node with key 5
Deletion from AVL Trees
Deletion from AVL trees may be treated as deletion of a node for which left or right pointer are
NULL. From this point of view the procedure is similar with deletion of nodes from a binary
search tree. The difference consists in the fact that the path from the root node to the node that is to
be deleted needs to be recorded. The steps necessary for deleting nodes from an AVL tree are:
Step1. Find the node that needs to be deleted. The path from root node to deleted node needs to be
recorded.
Step2. Find the predecessor/successor of the deleted node. This path needs also to be recorded.
Step3. Replace the key of the deleted node with the key of the predecessor/successor. At successor
be deleted.
Figure 11. Deletion of node with key 12 – initial s ituation
Figure 12. Deletion of node with key 12 – after rep lacing key 12 with key 15
The node with key 15 (the successor for the node with key 12) has replaced the key 12. After this
operation is performed, the balance factor of node with key 20 becomes 0 and the balance factor of
node with key 15 becomes -2. This means a rebalancing procedure for node 15 is necessary. Since
node 15 has balance factor -2 and the left child (node with key 8) has balance factor of +1 a double
right rotation for node 15 is necessary. After this rotation the tree will look like in the next figure.
The balance factor of node with key 24 is also increased thus becoming 0.
Figure 13. Deletion of node with key 12 – final sha pe, after rebalancing
An m-way search tree is a tree in which, for some integer m called the order of the tree, each node
has at most m children. If a node has k number of children, then it contains exactly k −1 keys,
which partition all the keys into k subsets consisting of all the keys less than the first key in the
node, all the keys between a pair of keys in the node, and all keys greater than the largest key in
the node. Figure 1.1 depicts the m-way search tree with m=5.
NOTE:
As in a binary search tree each node has 1 value and 2 subtrees. In the same way, each node of m-
way search tree has m - 1 values and m subtrees. m is the degree of the tree. In fact, each node may
contain up to m - 1 values. A node with k values must have k + 1 subtrees. In a node, values are
stored in ascending order:
V1 < V2 < ... < Vk
The subtrees are placed between adjacent values: each value has a left and right subtree.
Example: Search for 68 in the 3-way search tree shown in Figure 1.3.
The examples of 3-way B-tree and 5-way B-tree are depicted in Figure 21.5(a) and 21.5(b)
respectively.
(a) (b)
Note:
A B-tree of order n is also called an n-(n - 1) tree or an (n - 1)-n tree. Thus a 4-5 tree is a B-tree of
order 5 as is a 5-4 tree.
In the descriptions of our algorithms, we assume that M is odd; therefore each node (other than the
root) must contain between (M-1)/2 and M-1 values (keys).
Insertion into a B-Tree
· using the SEARCH procedure for M-way trees (described above) find the leaf node to
which X should be added.
· add X to this node in the appropriate place among the values already there. Being a leaf
node there are no subtrees to worry about.
· if there are M-1 or fewer values in the node after adding X, then we are finished.
If there are M nodes after adding X, we say the node has overflowed. To repair this, we split the
node into three parts:
Left:
the first (M-1)/2 values
Middle:
the middle value (position 1+((M-1)/2)
Right:
the last (M-1)/2 values
Notice that Left and Right have just enough values to be made into individual nodes. That's what
we do... they become the left and right children of Middle, which we add in the appropriate place
in this node's parent.
But what if there is no room in the parent? If it overflows we do the same thing again: split it into
Left-Middle-Right, make Left and Right into new nodes and add Middle (with Left and Right as
its children) to the node above. We continue doing this until no overflow occurs, or until the root
itself overflows. If the root overflows, we split it, as usual, and create a new root node with Middle
as its only value and Left and Right as its children (as usual).
For example, let's do a sequence of insertions into this B-tree (M=5, so each node other than the
root must contain between 2 and 4 values):
· Left = [ 2 3 ]
· Middle = 5
· Right = [ 6 7 ]
Left and Right become nodes; Middle is added to the node above with Left and Right as its
children.
The node above (the root in this small example) does not overflow, so we are done.
Insert 21: Add it to the middle leaf. That overflows, so we split it:
· left = [ 17 21 ]
· Middle = 22
· Right = [ 44 45 ]
Left and Right become nodes; Middle is added to the node above with Left and Right as its
children.
The node above (the root in this small example) does not overflow, so we are done.
Insert 67: Add it to the rightmost leaf. That overflows, so we split it:
· Left = [ 55 66 ]
· Middle = 67
· Right = [ 68 70 ]
Left and Right become nodes; Middle is added to the node above with Left and Right as its
children.
But now the node above does overflow. So it is split in exactly the same manner:
Left and Right become nodes, the children of Middle. If this were not the root, Middle would be
added to the node above and the process repeated. If there is no node above, as in this example, a
new root is created with Middle as its only value.
The tree-insertion algorithms we're previously seen add new nodes at the bottom of the tree, and
then have to worry about whether doing so creates an imbalance. The B-tree insertion algorithm is
just the opposite: it adds new nodes at the top of the tree (a new node is allocated only when the
root splits). B-trees grow at the root, not at the leaves. Because of this, there is never any doubt
that the tree is always perfectly height balanced: when a new node is added, all existing nodes
become one level deeper in the tree.
Deleting a value from a B-Tree
Recall our deletion algorithm for binary search trees: if the value to be deleted is in a node having
two subtrees, we would replace the value with the largest value in its left subtree and then delete
the node in the left subtree that had contained the largest value (we are guaranteed that this node
will be easy to delete).
We will use a similar strategy to delete a value from a B-tree. If the value to be deleted does not
occur in a leaf, we replace it with the largest value in its left subtree and then proceed to delete that
value from the node that originally contained it. For example, if we wished to delete 67 from the
above tree, we would find the largest value in 67's left subtree, 66, replace 67 with 66, and then
delete the occurrence of 66 in the left subtree. In a B-tree, the largest value in any value's left
subtree is guaranteed to be in leaf. Therefore wherever the value to be deleted initially resides, the
following deletion algorithm always begins at a leaf.
To delete value X from a B-tree, starting at a leaf node, there are 2 steps:
· Remove X from the current node. Being a leaf node there are no subtrees to worry about.
· Removing X might cause the node containing it to have too few values.
Recall that we require the root to have at least 1 value in it and all other nodes to have at least (M-
1)/2 values in them. If the node has too few values, we say it has underflowed.
If underflow does not occur, then we are finished the deletion process. If it does occur, it must be
fixed. The process for fixing a root is slightly different than the process for fixing the other nodes,
and will be discussed afterwards.
How do we fix a non-root node that has underflowed? Let us take as a specific example, deleting 6
from this B-tree (of degree 5):
Removing 6 causes the node it is in to underflow, as it now contains just 1 value (7). Our strategy
for fixing this is to try to `borrow' values from a neighbouring node. We join together the current
node and its more populous neighbour to form a `combined node' - and we must also include in the
combined node the value in the parent node that is in between these two nodes.
In this example, we join node [7] with its more populous neighbour [17 22 44 45] and put `10' in
between them, to create
[7 10 17 22 44 45]
The treatment of the comb ined node is different depending on whe ther the neighbouring
contributes exactly (M-1)/2 v alues or more than this number.
Case 1: Suppose that the neig hbouring node contains more than (M-1)/2 values. In this case, the
total number of values in the combined node is strictly greater than 1 + ((M -1)/2 - 1) + ((M-1)/2),
i.e. it is strictly greater than ( M-1). So it must contain M values or more.
We split the combined node into three pieces: Left, Middle, and Right, w here Middle is a single
value in the very middle of the combined node. Because the combined node has M values or more,
Left and Right are gua ranteed to have (M-1)/2 values each, and t herefore are legitimate nodes.
We replace the value we borrowed from the parent with Middle and we use Left and Right as its
two children. In this case the parent's size does not change, so we are completely finished.
This is what happens in our e xample of deleting 6 from the tree above. The combined node [7 10
17 22 44 45] contains more than 5 values, so we split it into:
· Left = [ 7 10 ]
· Middle = 17
· Right = [ 22 44 45 ]
Then put Middle into the pa rent node (in the position where the `10' ha d been) with Left and
Right as its children:
Case 2: Suppose, on the other hand, that the neighbouring node contains e xactly (M-1)/2 values.
Then the total number of values in the combined node is 1 + ((M-1)/2 - 1) + ((M-1)/2) = (M-1)
In this case the combined node contains the right number of values to be treated as a node. So we
make it into a node and remove from the parent node the value that has been incorporated into the
new, combined node. As a concrete example of this case, suppose that, in the above tree, we had
deleted 3 instead of 6. The node [2 3] underflows when 3 is removed. It would be combined with
its more populous neighbour [6 7] and the intervening value from the parent (5) to create the
combined node [2 5 6 7]. This contains 4 values, so it can be used without further processing. The
result would be:
It is very important to note that the parent node now has one fewer value. This might cause it to
underflow - imagine that 5 had been the only value in the parent node. If the parent node
underflows, it would be treated in exactly the same way - combined with its more populous
neighbour etc. The underflow processing repeats at successive levels until no underflow occurs or
until the root underflows (the processing for root-underflow is described next).
Now let us consider the root. For the root to underflow, it must have originally contained just one
value, which now has been removed. If the root was also a leaf, then there is no problem: in this
case the tree has become completely empty.
If the root is not a leaf, it must originally have had two subtrees (because it originally contained
one value). How could it possibly underflow?
The deletion process always starts at a leaf and therefore the only way the root could have its value
removed is through the Case 2 processing just described: the root's two children have been
combined, along with the root's only value to form a single node. But if the root's two children are
now a single node, then that node can be used as the new root, and the current root (which has
underflowed) can simply be deleted.
The node [3 7] would underflow, and the combined node [3 10 18 20] would be created. This has
4 values, which is acceptable when M=5. So it would be kept as a node, and `10' would be
removed from the parent node - the root. This is the only circumstance in which underflow can
occur in a root that is not a leaf. The situation is this:
`
Clearly, the current root node, now empty, can be deleted and its child used as the new root.
Assumptions:
Entry key values can have variable length because of variable-length column values
appearing in the index key.
When a node split occurs, equal lengths of entry information are placed in the left and right
split node.
Nodes above the leaf level contain directory entries, with n-1 separator keys and n disk
pointers to lower-level B-tree nodes.
Nodes at the leaf level contain entries with (keyval, rowid) pairs pointing to individual
rows indexed.
All nodes below the root are at least half full with entry information.
Searching an unbalanced tree may require traversing an arbitrary and unpredictable number of
nodes and pointers.
Advantages of B-tree
• Searching a balanced tree means that all leaves are at the same depth. There is no runaway
pointer overhead. Indeed, even very large B-trees can guarantee only a small number of
nodes must be retrieved to find a given key. For example, a B-tree of 10,000,000 keys with
50 keys per node never needs to retrieve more than 4 nodes to find any key.
• The purpose of the B-tree index is to minimize the number of disk I/Os needed to locate a
row wit a given index key value.
• The depth of the B-tree bears a close relationship to the number of disk I/Os used to reach
the leaf-level entry where the rowid is kept.
• The nodes of the B-tree are loaded in a left-to-right fashion so that successive inserts
normally occur to the same leaf node held consistently in memory buffer.
• When the leaf node splits, the successive leaf node is allocated from the next disk page of
the allocated extent.
• Node splits at every level occur in a controlled way and allow us to leave just the right
amount of free space on each page.
• It is common to estimate the fanout at each level to have a value of n where n is expected
number of entries that appear in each node. Assuming that there are n directory entries at
the root node and every node below that, the number of entries at the second level is n^2, at
third level is n^3, and so on. For a tree of depth K, the number of leaf-level entries is n^K
just before a root split occurs in the tree to make it a tree of depth K+1.
Need of B-trees
• Remember that performance is related to the height of the tree
• Used to process external records (information too large to put into memory), minimizes
number of accesses to secondary peripheral
• B-trees were created to make the most efficient use possible of the relationship between
main memory and secondary storage
• For all of the collections we have studied thus far, our assumption has been that the entire
collections exists in memory at once
Summary
• The B-tree is a tree-like structure that helps us to organize data in an efficient way.
• The B-tree index is a technique used to minimize the disk I/Os needed for the purpose of
locating a row with a given index key value.
• Because of its advantages, the B-tree and the B-tree index structure are widely used in
databases nowadays.
Efficiency
• of B-Trees
• Just as the height of a binary tree related to the number of nodes through log2 so the height
of a B-Tree is related through the log m where m is the order of the tree:
•
• height = logm n + 1
•
• This relationship enables a B-Tree to hold vast amounts of data in just a few levels. For
example, if the B-tree is of order 10, then level 0 can hold 9 pieces of data, level 1 can hold
10 nodes each with 9 pieces of data, or 90 pieces of data. Level 2 can hold 10 times 10
nodes (100), each with 9 pieces of data for a total of 900. Thus the three levels hold a total
of 999. Searches can become very efficient because the number of nodes to be examined is
reduced a factor of 10 times at each probe. Unfortunately, there is still some price to pay
because each node can contain 9 pieces of data and therefore, in the worst case, all 9 keys
would have to be searched in each node. Thus finding the node is of order log (base m) n
but the total search is m-1 log m n. If the order of the tree is small there are still a
substantial number of searches in the worst case. However if m is large, then the efficiency
of the search is enhanced. Since the data are in order within any given node, a binary
search can be used in the node. However, this is not of much value unless the order is large
since a simple linear search may be almost as efficient for short lists.
• It should be clear that although the path length to a node may be very short, examining a
node for a key can involve considerable searching within the node. Because of this the B-
Tree structure is used with very large data sets which cannot easily be stored in main
memory. The tree actually resides on disk. If a node is stored so that it occupies just one
disk block then it can be read in with one read operation. Hence main memory can be used
for fast searching within a node and only one disk access is required for each level in the
tree. In this way the B-Tree structure minimizes the number of disk accesses that must be
made to find a given key.