0% found this document useful (0 votes)
98 views59 pages

MC4101-ADSA Unit-II 2

MC4101-ADSA_Unit-III

Uploaded by

Rathnakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views59 pages

MC4101-ADSA Unit-II 2

MC4101-ADSA_Unit-III

Uploaded by

Rathnakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 59

Paavai Institutions Department of MCA

UNIT-II
BALANCED SEARCH TREES,
SORTING AND INDEXING

UNIT-I 1. 1
Paavai Institutions Department of MCA

TECHNICAL TERMS

1. AVL Tree
An empty tree is height balanced. If T is a non-empty binary tree with TL and TR as its
left and right subtrees, then T is height balanced if
i) TL and TR are height balanced and
ii) │hL - hR│≤ 1
Where hL and hR are the heights of TL and TR respectively.

2. Balanced trees
Balanced trees have the structure of binary trees and obey binary search tree properties.
Apart from these properties, they have some special constraints, which differ from one data
structure to another. However, these constraints are aimed only at reducing the height of the tree,
because this factor determines the time complexity.
Eg: AVL trees, Splay trees.

3. AVL rotations
Let A be the nearest ancestor of the newly inserted nod which has the balancing factor ±2.
Then the rotations can be classified into the following four categories:
Left-Left: The newly inserted node is in the left subtree of the left child of A.
Right-Right: The newly inserted node is in the right subtree of the right child of
A. Left-Right: The newly inserted node is in the right subtree of the left child of A.
Right-Left: The newly inserted node is in the left subtree of the right child of A.

4. Heap
A heap is defined to be a complete binary tree with the property that the value of each
node is at least as small as the value of its child nodes, if they exist. The root node of the
heap has the smallest value in the tree.

5. B-tree of order M
a. A B-tree of order M is a tree that is not binary with the following structural

UNIT-I 1. 2
Paavai Institutions Department of MCA

properties:
• The root is either a leaf or has between 2 and M children.
• All non-leaf nodes (except the root) have between ┌M/2┐ and M
children.
• All leaves are at the same depth.

6. Applications of B-tree
Database implementation
Indexing on non primary key fields

7. Hashing
Hashing is the transformation of string of characters into a usually shorter fixed length
value or key that represents the original string. Hashing is used to index and retrieve
items in a database because it is faster to find the item using the short hashed key than
to find it using the original value.

8. Hash table
The hash table data structure is merely an array of some fixed size, containing the keys. A
key is a string with an associated value. Each key is mapped into some number in the
range 0 to tablesize-1 and placed in the appropriate cell.

9. Hash function
A hash function is a key to address transformation which acts upon a given key to
compute the relative position of the key in an array. The choice of hash function
should be simple and it must distribute the data evenly. A simple hash function is
hash_key=key mod table size.

10. Collision in hashing


When an element is inserted, it hashes to the same value as an already inserted element,
and then it produces collision.

11. Separate chaining


UNIT-I 1. 3
Paavai Institutions Department of MCA

Separate chaining is a collision resolution technique to keep the list of all elements that hash to
the same value. This is called separate chaining because each hash table element is a separate
chain (linked list). Each linked list contains all the elements whose keys hash to the same
index.

12. Open addressing


Open addressing is a collision resolving strategy in which, if collision occurs alternative
cells are tried until an empty cell is found. The cells h0(x), h1(x), h2(x),…. are tried
in succession, where hi(x)=(Hash(x)+F(i))mod Table size with F(0)=0. The function
F is the collision resolution strategy.

13. Probing
Probing is the process of getting next available hash table array cell.

14. Linear probing


Linear probing is an open addressing collision resolution strategy in which F is a linear
function of i, F(i)=i. This amounts to trying sequentially in search of an empty cell. If
the table is big enough, a free cell can always be found, but the time to do so can get
quite large.

15. Sorting
Ordering the data in an increasing or decreasing fashion according to some relationship
among the data item is called sorting.

16. External sorting


External sorting is a process of sorting in which large blocks of data stored in storage
devices are moved to the main memory and then sorted.

17. Internal sorting


Internal sorting is a process of sorting the data in the main memory.

18. Insertion sort

UNIT-I 1. 4
Paavai Institutions Department of MCA

The main idea of insertion sort is to insert in the ith pass the ith element in A (1) A (2)...A
(i) in its rightful place.

19. Shell sort


Instead of sorting the entire array at once, it is first divide the array into smaller segments,
which are then separately sorted using the insertion sort.

20. Quick sort


The purpose of the quick sort is to move a data item in the correct direction, just enough
for to reach its final place in the array.

21. Segment
When large blocks of data are to be sorted, only a portion of the block or file is loaded in
the main memory of the computer since, it cannot hold the entire block. This small
portion of file is called a segment.

22. External sorting methods


Polyphase merging
Oscillation sorting
Merge sorting

23. Max heap


A heap in which the parent has a larger key than the child's is called a max heap.

24. Min heap


A heap in which the parent has a smaller key than the child's is called a min heap.

UNIT-I 1. 5
Paavai Institutions Department of MCA

CONTENTS

3. BALANCED SEARCH TREES, SORTING AND INDEXING


3.1 AVL trees
3.2 B-Trees
3.3 Sorting
3.4 Bubble sort
3.5 Quick Sort
3.6 Insertion Sort
3.7 Heap sort
3.8 Hashing -Hashing functions
3.9 Collision Resolution Techniques
3.10 Separate chaining
3.11 Open addressing
3.12 Multiple hashing.

UNIT-I 1. 6
Paavai Institutions Department of MCA

LECTURE NOTES
3.1 AVL TREES

3.1.1 Introduction
Binary search trees are an excellent data structure to implement associative arrays, maps,
sets, and similar interfaces. The main difficulty, as discussed in last lecture, is that they are
efficient only when they are balanced. Straightforward sequences of insertions can lead to highly
unbalanced trees with poor asymptotic complexity and unacceptable practical efficiency. For
example, if we insert n elements with keys that are in strictly increasing or decreasing order, the
complexity will be O(n2). On the other hand, if we can keep the height to O(log(n)), as it is for a
perfectly balanced tree, then the commplexity is bounded by O(n _ log(n)).
The solution is to dynamically rebalance the search tree during insert or search
operations. We have to be careful not to destroy the ordering invariant of the tree while we
rebalance. Because of the importance of binary search trees, researchers have developed many
different algorithms for keeping trees in balance, such as AVL trees, red/black trees, splay trees,
or randomized binary search trees. They differ in the invariants they maintain (in addition to the
ordering invariant), and when and how the rebalancing is done.
In this lecture we use AVL trees, which is a simple and efficient data structure to
maintain balance, and is also the first that has been proposed. It is named after its inventors,
G.M. Adelson-Velskii and E.M. Landis, who described it in 1962.

3.1.2 THE HEIGHT INVARIANT


Recall the ordering invariant for binary search trees.
3.1.2.1 ORDERING INVARIANT
At any node with key k in a binary search tree, all keys of the elements in the left subtree
are strictly less than k, while all keys of the elements in the right subtree are
strictly greater than k.
To describe AVL trees we need the concept of tree height, which we define as the
maximal length of a path from the root to a leaf. So the empty tree has height 0, the tree with one

UNIT-I 1. 7
Paavai Institutions Department of MCA

node has height 1, a balanced tree with three nodes has height 2. If we add one more node to this
last tree is will have height 3. Alternatively, we can define it recursively by saying that the empty
tree has height 0, and the height of any node is one greater than the maximal height of its two
children. AVL trees maintain a height invariant (also sometimes called a balance invariant).

3.1.2.1 HEIGHT INVARIANT


At any node in the tree, the heights of the left and right subtrees differs by at most 1.
As an example, consider the following binary search tree of height 3.

Figure 3.1 Height of Binary Search tree

If we insert a new element with a key of 14, the insertion algorithm for binary search
trees without rebalancing will put it to the right of 13.

UNIT-I 1. 8
Paavai Institutions Department of MCA

Figure 3.2 Height of Binary Search tree after inserting an element

Now the tree has height 4, and one path is longer than the others. However, it is easy to
check that at each node, the height of the left and right subtrees still differ only by one. For
example, at the node with key 16, the left subtree has height 2 and the right subtree has height 1,
which still obeys our height invariant.
Now consider another insertion, this time of an element with key 15.
This is inserted to the right of the node with key 14.

Figure 3.3 Height of Binary Search tree after inserting another element

All is well at the node labeled 14: the left subtree has height 0 while the right subtree has
height 1. However, at the node labeled 13, the left subtree has height 0, while the right subtree
has height 2, violating our invariant. Moreover, at the node with key 16, the left subtree has
height 3 while the right subtree has height 1, also a difference of 2 and therefore an invariant
violation.

UNIT-I 1. 9
Paavai Institutions Department of MCA

We therefore have to take steps to rebalance tree. We can see without too much trouble,
that we can restore the height invariant if we move the node labeled 14 up and push node 13
down and to the right, resulting in the following tree.

Figure 3.4 Height of Binary Search tree

The question is how to do this in general. In order to understand this we need a


fundamental operation called a rotation, which comes in two forms, left rotation and right
rotation.

3.1.3 LEFT AND RIGHT ROTATIONS


Below, we show the situation before a left rotation. We have generically denoted the
crucial key values in question with x and y. Also, we have summarized whole subtrees with the
intervals bounding their key values. Even though we wrote 􀀀1 and +1, when the whole tree is a
subtree of a larger tree these bounds will be generic bounds _ which is smaller than x and !
which is greater than y. The tree on the right is after the left rotation.

UNIT-I 1. 10
Paavai Institutions Department of MCA

Figure 3.5 Tree height after Left rotation

From the intervals we can see that the ordering invariants are preserved, as are the
contents of the tree. We can also see that it shifts some nodes from the right subtree to the left
subtree. We would invoke this operation if the invariants told us that we have to rebalance from
right to left. We implement this with some straightforward code. First, recall the type of trees
from last lecture. We do not repeat the function is_ordtree that checks if a tree is ordered.
struct tree {
elem data;
struct tree* left;
struct tree* right;
};
typedef struct tree* tree;
bool is_ordtree(tree T);

The main point to keep in mind is to use (or save) a component of the input before
writing to it. We apply this idea systematically, writing to a location immediately after using it
on the previous line. We repeat the type specification of tree from last lecture.
tree rotate_left(tree T)
//@requires is_ordtree(T);
//@requires T != NULL && T->right != NULL;

UNIT-I 1. 11
Paavai Institutions Department of MCA

//@ensures is_ordtree(\result);
//@ensures \result != NULL && \result->left != NULL;
{
tree root = T->right;
T->right = root->left;
root->left = T;
return root;
}

The right rotation is entirely symmetric. First in pictures:

Figure 3.6 Tree height after Right rotation

Then in code:
tree rotate_right(tree T)
//@requires is_ordtree(T);
//@requires T != NULL && T->left != NULL;
//@ensures is_ordtree(\result);
//@ensures \result != NULL && \result->right != NULL;
{
tree root = T->left;
T->left = root->right;
UNIT-I 1. 12
Paavai Institutions Department of MCA

root->right = T;
return root;
}

3.1.4 SEARCHING FOR A KEY


Searching for a key in an AVL tree is identical to searching for it in a plain binary search
tree as described in later. The reason is that we only need the ordering invariant to find the
element; the height invariant is only relevant for inserting an element.

3.1.5 INSERTING AN ELEMENT


The basic recursive structure of inserting an element is the same as for searching for an
element. We compare the element’s key with the keys associated with the nodes of the trees,
inserting recursively into the left or right subtree. When we find an element with the exact key
we overwrite the element in that node.
If we encounter a null tree, we construct a new tree with the element to be inserted and no
children and then return it. As we return the new subtrees (with the inserted element) towards the
root, we check if we violate the height invariant. If so, we rebalance to restore the invariant and
then continue up the tree to the root.
The main cleverness of the algorithm lies in analyzing the situations when we have to
rebalance and applying the appropriate rotations to restore the height invariant. It turns out that
one or two rotations on the whole tree always suffice for each insert operation, which is a very
elegant result.
First, we keep in mind that the left and right subtrees’ heights before the insertion can
differ by at most one. Once we insert an element into one of the subtrees, they can differ by at
most two. We now draw the trees in such a way that the height of a node is indicated by the
height that we are drawing it at.
The first situation we describe is where we insert into the right subtree, which is already
of height h + 1 where the left subtree has height h. If we are unlucky, the result of inserting into

UNIT-I 1. 13
Paavai Institutions Department of MCA

the right subtree will give us a new right subtree of height h + 2 which raises the height of the
overall tree to h + 3, violating the height invariant.
In the new right subtree has height h+2, either its right or the left subtree must be of
height h+1 (and only one of them; think about why). If it is the right subtree we are in the
situation depicted below on the left.

Figure 3.7 Tree height after Left rotation at x

We fix this with a left rotation, the result of which is displayed to the right. In the second
case we consider we once again insert into the right subtree, but now the left subtree of the right
subtree has height h + 1.

Figure 3.8 Double rotation at z and x

UNIT-I 1. 14
Paavai Institutions Department of MCA

In that case, a left rotation alone will not restore the invariant (see Exercise 1). Instead,
we apply a so-called double rotation: first a right rotation at z, then a left rotation at the root.
When we do this we obtain the picture on the right, restoring the height invariant. There are two
additional symmetric cases to consider, if we insert the new element on the left.
We can see that in each of the possible cases where we have to restore the invariant, the
resulting tree has the same height h + 2 as before the insertion. Therefore, the height invariant
above the place where we just restored it will be automatically satisfied.

3.1.6 CHECKING INVARIANTS


The interface for the implementation is exactly the same as for binary search trees, as is
the code for searching for a key. In various places in the algorithm we have to compute the
height of the tree. This could be an operation of asymptotic complexity O(n), unless we store it
in each node and just look it up. So we have:
struct tree {
elem data;
int height;
struct tree* left;
struct tree* right;
};
typedef struct tree* tree;
/* height(T) returns the precomputed height of T in O(1) */
int height(tree T) {
return T == NULL ? 0 : T->height;
}
When checking if a tree is balanced, we also check that all the heights that have been computed
are correct.
bool is_balanced(tree T) {
if (T == NULL) return true;
int h = T->height;

UNIT-I 1. 15
Paavai Institutions Department of MCA

int hl = height(T->left);
int hr = height(T->right);
if (!(h == (hl > hr ? hl+1 : hr+1))) return false;
if (hl > hr+1 || hr > hl+1) return false;
return is_balanced(T->left) && is_balanced(T->right);
}
A tree is an AVL tree if it is both ordered and balanced.
bool is_avl(tree T) {
return is_ordtree(T) && is_balanced(T);
}
We use this, for example, in a utility function that creates a new leaf
from an element (which may not be null).
tree leaf(elem e)
//@requires e != NULL;
//@ensures is_avl(\result);
{
tree T = alloc(struct tree);
T->data = e;
T->height = 1;
T->left = NULL;
T->right = NULL;
return T;
}

3.1.7 IMPLEMENTING INSERTION


The code for inserting an element into the tree is mostly identical with the code for plain
binary search trees. The difference is that after we insert into the left or right subtree, we call a
function rebalance_left or rebalance_right, respectively, to restore the invariant if necessary and
calculate the new height.

UNIT-I 1. 16
Paavai Institutions Department of MCA

tree tree_insert(tree T, elem e)


//@requires is_avl(T);
//@ensures is_avl(\result);
{
assert(e != NULL); /* cannot insert NULL element */
if (T == NULL) {
T = leaf(e); /* create new leaf with data e */
} else {
int r = compare(elem_key(e), elem_key(T->data));
if (r < 0) {
T->left = tree_insert(T->left, e);
T = rebalance_left(T); /* also fixes height */
} else if (r == 0) {
T->data = e;
} else { //@assert r > 0;
T->right = tree_insert(T->right, e);
T = rebalance_right(T); /* also fixes height */
}
}
return T;
}

We show only the function rebalance_right; rebalance_left is symmetric.


tree rebalance_right(tree T)
//@requires T != NULL;
//@requires is_avl(T->left) && is_avl(T->right);
/* also requires that T->right is result of insert into T */
//@ensures is_avl(\result);
{
UNIT-I 1. 17
Paavai Institutions Department of MCA

tree l = T->left;
tree r = T->right;
int hl = height(l);
int hr = height(r);
if (hr > hl+1) {
//@assert hr == hl+2;
if (height(r->right) > height(r->left)) {
//@assert height(r->right) == hl+1;
T = rotate_left(T);
//@assert height(T) == hl+2;
return T;
} else {
//@assert height(r->left) == hl+1;
/* double rotate left */
T->right = rotate_right(T->right);
T = rotate_left(T);
//@assert height(T) == hl+2;
return T;
}
} else { //@assert !(hr > hl+1);
fix_height(T);
return T;
}
}
Note that the preconditions are weaker than we would like. In particular, they do not
imply some of the assertions we have added in order to show the correspondence to the pictures.
Such assertions are nevertheless useful because they document expectations based on informal
reasoning we do behind the scenes. Then, if they fail, they may be evidence for some error in our
understanding, or in the code itself, which might otherwise go undetected.
UNIT-I 1. 18
Paavai Institutions Department of MCA

3.1.8 EXPERIMENTAL EVALUATION


We would like to assess the asymptotic complexity and then experimentally validate it. It
is easy to see that both insert and search operations take type O(h), where h is the height of the
tree. But how is the height of the tree related to the number of elements stored, if we use the
balance invariant of AVL trees? It turns out that h is O(log(n)). It is not difficult to prove this,
but it is beyond the scope of this course.
To experimentally validate this prediction, we have to run the code with inputs of
increasing size. A convenient way of doing this is to double the size of the input and compare
running times. If we insert n elements into the tree and look them up, the running time should be
bounded by c _ n _ log(n) for some constant c. Assume we run it at some size n and observe r =
c_n_log(n). If we double the input size we have c_(2_n)_log(2_n) = 2 _ c _ n _ (1 + log(n)) = 2 _
r + 2 _ c _ n, we mainly expect the running time to double with an additional summand that
roughly doubles with as n doubles. In order to smooth out minor variations and get bigger
numbers, we run each experiment 100 times. Here is the table with the results: n AVL trees
increase BSTs

Table 3.1 n AVL trees increase BST


We see in the third column, where 2r stands for the doubling of the previous value, we
are quite close to the predicted running time, with a approximately linearly increasing additional
summand.
UNIT-I 1. 19
Paavai Institutions Department of MCA

In the fourth column we have run the experiment with plain binary search trees which do
not rebalance automatically. First of all, we see that they are much less efficient, and second we
see that their behavior with increasing size is difficult to predict, sometimes jumping
considerably and sometimes not much at all. In order to understand this behavior, we need to
know more about the order and distribution of keys that were used in this experiment. They were
strings, compared lexicographically. The keys were generated by counting integers upward and
then converting them to strings. The distribution of these keys is haphazard, but not random. For
example, if we start counting at 0
"0" < "1" < "2" < "3" < "4" < "5" < "6" < "7" < "8" < "9" < "10" < "12" < ...
The first ten strings are in ascending order but then numbers are inserted between "1" and
"2". This kind of haphazard distribution is typical of many realistic applications, and we see that
binary search trees without rebalancing perform quite poorly and unpredictably compared with
AVL trees.

3.2 B-TREES

3.2.1 INTRODUCTION
We have seen binary search trees before. When data volume is large and does not ¯t in
memory, an extension of the binary search tree to disk-based environment is the B-tree,
originally invented by Bayer and McCreight [1]. In fact, since the B-tree is always balanced (all
leaf nodes appear at the same level), it is an extension of the balanced binary search tree. Since
each disk access exchanges a whole block of information between memory and disk rather than a
few bytes, a node of the B-tree is expanded to hold more than two child pointers, up to the block
capacity. To guarantee worst-case performance, the B-tree requires that every node (except the
root) has to be at least half full. An exact match query, insertion or deletion need to access
O(logB n) nodes, where B is the page capacity in number of child pointers, and n is the number
of objects.

3.2.2 THE B-TREE

UNIT-I 1. 20
Paavai Institutions Department of MCA

The problem which the B-tree aims to solve is: given a large collection of objects, each
having a key and an value, design a disk-based index structure which e±ciently supports query
and update.
Here the query that is of interest if the exact-match query: given a key k, locate the value of the
object with key=k. The update can be either an insertion or a deletion. That is, insert a new object
into the index, or deleted from the index an object with a given key.

3.2.2.1 B-TREE DEFINITION


A B-tree is a tree structure where every node corresponds to a disk page and which
satisfies the following properties:
 A node (leaf or index) x has a value x:num as the number of objects stored in x. It also
stores the list of x:num objects in increasing key order. The key and value of the ith
object (1 · i · x:num) are represented as x:key[i] and x:value[i], respectively.
 Every leaf node has the same depth.
 An index node x stores, besides x:num objects, x:num+1 child pointers. Here each child
pointer is a pageID of the corresponding child node. The ith child pointer is denoted as
x:child[i]. It corresponds to a key range (x:key[i ¡ 1], x:key[i]). This means that in the ith
sub-tree, any object key must be larger than x:key[i¡1] and smaller than x:key[i]. For
instance, in the sub-tree referenced by x:child[1], the object keys are smaller than
x:key[1]. In the sub-tree referenced by x:child[2], the objects keys are between x:key[1]
and x:key[2], and so on.
 Every node except the root node has to be at least half full. That is, suppose an index
node can hold up to 2B child pointers (besides, of course, 2B-1 objects), then any index
node except the root must have at least B child pointers. A leaf node can hold more
objects, since no child pointer needs to be stored. However, for simplicity we assume a
leaf node holds between B and 2B objects.
 If the root node is an index node, it must have at least two children.
A special case of the B-tree is when B = 2. Here every index node must have 2 or 3 or 4
child pointers. This special case is called the 2-3-4 tree.

UNIT-I 1. 21
Paavai Institutions Department of MCA

Figure 15.1 shows an example of a B-tree. In particular, it's a 2-3-4 tree.

Figure 3.9 An example of a B-tree.


In the figure, every index node contains between 2 and 4 child pointers, and every leaf
node contains between 2 and 4 objects. The root node A is an index node. Currently it has one
object with key=25 and two child pointers. In the left sub-tree, every object has key<25. In the
right sub-tree, every object has key>25. Every leaf node (D through I) are located at the same
depth: their distance to A is 2. Currently, there are two pages which are full: an index node B and
a leaf node D.

3.2.3 B-TREE QUERY


To find the value of an object with key=k, we call the Query algorithm given below. The
parameters are the tree root pageID and the search key k. The algorithm works as follows.
It follows (at most) a single path from root to leaf. At each index node along the path,
there can be at most one sub-tree whose key range contains k. A recursive call on that sub-tree is
performed (step 2c). Eventually, we reach a leaf node (step 3a). If there exists an object in the
node with key=k, the algorithm returns the value of the object. Otherwise, the object does not
exist in the tree and NULL is returned. Since the index nodes of the B-tree also stores objects, it
is possible that the object with key=k is found in an index node. In this case, the algorithm
returns the object value without going down to the next level (step 2a).

UNIT-I 1. 22
Paavai Institutions Department of MCA

Algorithm Query(pageID, k)
Input: pageID of a B-tree node, a key k to be searched.
Output: value of the object with key= k; NULL if non-exist.
1. x = DiskRead(pageID).
2. if x is an index node
(a) If there is an object o in x s.t. o:key = k, return o:value.
(b) Find the child pointer x:child[i] whose key range contains k.
(c) return Query(x:child[i], k).
3. else
(a) If there is an object o in x s.t. o:key = k, return o:value. Otherwise, return NULL.
4. end if
As an example, Figure 15.2 shows how to perform a search query for k = 13. At node A, we
should follow the left sub-tree since k < 25. At node B, we should follow the third sub-tree since
10 < k < 16. Now we reach a leaf node F. An object with key=13 is found in the node.

Figure 3.10: Query processing in a B-tree.


If the query wants to search for k = 12, we still examine the three nodes A, B, F. This
time, no object with key=12 is found in F, and thus the algorithm returns NULL. If the search
key is 10 instead, the algorithm only examines node A and B. Since in node B such an object is
found, the algorithm stops there.

UNIT-I 1. 23
Paavai Institutions Department of MCA

Notice that in the Query algorithm, only DiskRead function is called. The other three
functions, e.g. DiskWrite are not needed as the algorithm does not modify the B-tree. Since the
query algorithm examines a single path from root to leaf, the complexity of the algorithm in
number of I/Os is O(logB n), where n is the number of objects.

3.2.3 B-TREE INSERTION


To insert a new object with key k and value v into the index, we call the Insert algorithm
given below.

Algorithm Insert(root, k, v)
Input: root pageID of a B-tree, the key k and the value v of a new object.
Prerequisite: The object does not exist in the tree.
Action: Insert the new object into the B-tree.
1. x = DiskRead(root).
2. if x is full
(a) y = AllocateP age(), z = AllocateP age().
(b) Locate the middle object oi stored in x. Move the objects to the left of oi
into y. Move the objects to the right of oi into z. If x is an index page, also
move the child pointers accordingly.
(c) x:child[1] = y:pageID, x:child[2] = z:pageID.
(d) DiskWrite(x); DiskWrite(y); DiskWrite(z).
3. end if
4. InsertNotF ull(x; k; v).
Basically, the algorithm makes sure that root page is not current full, and then it calls the
InsertNotFull function to insert the object into the tree. If the root page x is full, the algorithm
will split it into two nodes y and z, and node x will be promoted to a higher level, thus increasing
the height of the tree.
This scenario is illustrated in Figure. Node x is a full root page. It contains three objects
and four child pointers. If we try to insert some record into the tree, the root node is split into two

UNIT-I 1. 24
Paavai Institutions Department of MCA

nodes y and z. Originally, x contains x:num = 3 objects. The left object (key=6) is moved to a
new node y. The right object (key=16) is moved to a new node z.

Figure 3.11: Splitting the root node increases the height of the tree.

The middle object (key=10) remains in x. Correspondingly, the child pointers D, E, F, G


are also moved. Now, x contains only one object (key=10). We make it as the new root, and
make y and z be the two children of it. To insert an object into a sub-tree rooted by a non-full
node x, the following algorithm InsertNotFull is used.

Algorithm InsertNotFull(x, k, v)
Input: an in-memory page x of a B-tree, the key k and the value v of a new object.
Prerequisite: page x is not full.
Action: Insert the new object into the sub-tree rooted by x.
1. if x is a leaf page
(a) Insert the new object into x, keeping objects in sorted order.
(b) DiskWrite(x).
2. else
(a) Find the child pointer x:child[i] whose key range contains k.
(b) w = DiskRead(x:child[i]).
UNIT-I 1. 25
Paavai Institutions Department of MCA

(c) if w is full
i. y = AllocateP age().
ii. Locate the middle object oj stored in w. Move the objects to the right of oj into y. If w is an
index page, also move the child pointers accordingly.
iii. Move oj into x. Accordingly, add a child pointer in x (to the right of oj) pointing to y.
iv. DiskWrite(x); DiskWrite(y); DiskWrite(w).
v. If k < oj:key, call InsertNotF ull(w; k; v); otherwise, call InsertNotF ull(y; k; v).
(d) else
InsertNotF ull(w; k; v).
(e) end if
3. end if
Algorithm InsertNotFull examines a single path from root to leaf, and eventually insert
the object into some leaf page. At each level, the algorithm follows the child pointer whose key
range contains the key of the new object (step 2a). If no node along the path is full, the algorithm
recursively calls itself on each of these nodes (step 2d) till the leaf level, where the object is
inserted into the leaf node (step 1).
Consider the other case when some node w along the path is full (step 2c). The node is
first split into two (w and y). The right half of the objects from w are moved to y, while the
middle object is pushed into the parent node. After the split, the key range of either w or y, but
not both, contains the key of the new object. A recursive call is performed on the correct node.
As an example, consider inserting an object with key=14 into the B-tree of Figure. The
result is shown in Figure. The child pointers that are followed are thick. When we examine the
root node A, we follow the child pointer to B. Since B is full, we first split it into two, by moving
the right half of the objects (only one object in our case, with key=16) into a new node B00. The
child pointers to F and G are moved as well. Further, the previous middle object in B (key=10) is
moved to the parent node A. A new child pointer to B00 is also generated in A. Now, since the
key of the new object is 14, which is bigger than 10, we recursively call the algorithm on B00. At
this node, since 14 < 16, we recursively call the algorithm on node F. Since F is a leaf node, the

UNIT-I 1. 26
Paavai Institutions Department of MCA

algorithm finishes by inserting the new object into F. The accessed disk pages are shown as
shadowed.

Figure 3.12: Inserting an object with key=14 into the B-tree


Since node B is full, it is split into two (B and B00). The object is recursively inserted into the
sub-tree rooted by B00. At the lowest level, it is stored in node F.

3.2.4 B-TREE DELETION


This section describes the Delete algorithm which is used to delete an object with key=k
from the B-tree. It is a recursive algorithm. It takes (besides k) as parameter a tree node, and it
will perform deletion in the sub-tree rooted by that node. We should call the algorithm with the
root node of the tree as parameter.
We know that there is a single path from the root node to the node x that contains k. The
Delete algorithm examines this path. Along the path, at each level when we examine node x, we
first make sure that x has at least one more element than half full (except the case when x is the
root). The reasoning behind this is that in order to delete an element from the sub-tree rooted by
x, the number of element stored in x can be reduced at most by one. If x has one more element
than half full (minimum occupancy), it can be guaranteed that x will not underflow. We
distinguish three cases:
1. x is a leaf node;
2. x is an index node which contains an object with key=k;
UNIT-I 1. 27
Paavai Institutions Department of MCA

3. x is an index node which does not contain an object with key=k.

Algorithm Delete(x, k)
Input: an in-memory node x of a B-tree, the key k to be deleted.
Prerequisite: an object with key=k exists in the sub-tree rooted by x.
Action: Delete the object from the sub-tree rooted by x.
1. if x is a leaf page
(a) Delete the object with key=k from x.
(b) DiskWrite(x).
2. else if x does not contain the object with key=k
(a) Locate the child x:child[i] whose key range contains k.
(b) y = DiskRead(x:child[i]).
(c) if y is exactly half full
i. If the sibling node z immediate to the left (right) of y has at least one more object than
minimally required, add one more object to y by moving x:key[i] from x to y and move that last
(¯rst) object from z to x. If y is an index node, the last (¯rst) child pointer in z is also moved to y.
ii. Otherwise, any immediate sibling of y is exactly half full. Merge y with an immediate sibling.
end if
(d) Delete(y; k).
3. else
(a) If the child y that precedes k in x has at least one more object than minimally required,find the
predecessor k0 of k in the sub-tree rooted by y, recursively delete k0 from the sub-tree and
replace k with k0 in x.
(b) Otherwise, y is exactly half full. We check the child z that immediately follows k in x. If z has
at least one more object than minimally required, find the successor k0 of k in the sub-tree rooted
by z, recursively delete k0 from the sub-tree and replace k with k0 in x.
(c) Otherwise, both y and z are half full. Merge them into one node and push k down to the new
node as well. Recursively delete k from this new node.
4. end if

UNIT-I 1. 28
Paavai Institutions Department of MCA

Along the search path from the root to the node containing the object to be deleted, for
each node x we encounter, there are three cases. The simplest scenario is when x is a leaf node
(step 1 of the algorithm). In this case, the object is deleted from the node and the algorithm
returns. Note that there is no need to handle underflow. The reason is: if the leaf node is root,
there is only one node in the tree and it is fine if it has only a few objects; otherwise, the previous
recursive step has already guaranteed that x has at least one more object than minimally required.
Step 2 and 3 of the algorithm correspond to two di®erent cases of dealing with an index node.
For step 2, the index node x does not contain the object with key=k. Thus there exists a
child node y whose key range contains k. After we read the child node into memory (step 2b), we
will recursively call the Delete algorithm on the sub-tree rooted by y (step 2d).
However, before we do that, step 2(c) of the algorithm makes sure that y contains at least
one more object than half full. Suppose we want to delete 5 from the B-tree shown in the figure.
When we are examining the root node A, we see that child node B should be followed next. Since
B has two more objects than half full, the recursion goes to node B. In turn, since D has two more
objects than minimum occupancy, the recursion goes to node D, where the object can be
removed.
Let's examine another examine. Still from the B+-tree shown in the figure, suppose we
want to delete 33. The algorithm ¯nds that the child node y = C is half full. One more object
needs to be incorporated into node C before a recursive call on C is performed.
There are two sub-cases. The ¯rst sub-case is when one immediate sibling z of node y has
at least one more object than minimally required. This case corresponds to step 2(c)i of the
algorithm. To handle this case, we drag one object down from x to y, and we push one object
from the sibling node up to x. As an example, the deletion of object 33 is shown in the figure.

UNIT-I 1. 29
Paavai Institutions Department of MCA

Figure 3.13: Illustration of step 2(c)i of the Delete algorithm.

Deleting an object with key=33 from the B-tree of Figure 15.2. At node A, we examine
the right child. Since node C only had one object before, a new object was added to it in the
following way: the object with key=25 is moved from A to C, and the object with key=16 is
moved from B to A. Also, the child pointer pointing to G is moved from B to C.

Another sub-case is when all immediate siblings of y are exactly half full. In this case, we
merge y with one sibling. In our 2-3-4-tree example, an index node which is half full contains
one object. If we merge two such nodes together, we also drag an object from the parent node of
them down to the merged node. The node will then contain three objects, which is full but does
not overflow.
For instance, suppose we want to delete object 31 from Figure 15.5. When we are exam-
ining node x = C, we see that we need to recursively delete in the child node y = H. Now, both
immediate siblings of H are exactly half full. So we need to merge H with a sibling, say G.
Besides moving the remaining object 28 from H to G, we also should drag object 25 from the
parent node C to G. The figure is omitted for this case.
The third case is that node x is an index node which contains the object to be deleted.
Step 3 of algorithm Delete corresponds to this scenario. We cannot simply delete the object from
x, because we also need to decrement the number of child pointers by one. In the figure, suppose
we want to delete object with key=25, which is stored in index node C.
UNIT-I 1. 30
Paavai Institutions Department of MCA

Figure 3.14: Illustration of step 3(c) of the Delete algorithm.


Deleting an object with key=25 from the B-tree of Figure 15.5. At node A, we examine
the right child. We see that node C contains the object with key=25. We cannot move an object
up from a child node of C, since both children G and H (around key 25) are exactly half full. The
algorithm merges these two nodes into one, by moving objects 28 and 31 from H to G and then
deleting H. Node C loses an object (key=25) and a child pointer (to H).
However, in our case, both child nodes G and H are half full and thus cannot contribute
an object. Step 3(c) of the algorithm corresponds to this case. As shown in the figure, the two
nodes are merged into one.

UNIT-I 1. 31
Paavai Institutions Department of MCA

3.3 SORTING

Consider sorting the values in an array A of size N. Most sorting algorithms involve what
are called comparison sorts; i.e., they work by comparing values. Comparison sorts can never
have a worst-case running time less than O(N log N). Simple comparison sorts are usually
O(N2); the more clever ones are O(N log N).
Three interesting issues to consider when thinking about different sorting algorithms are:
 Does an algorithm always take its worst-case time?
 What happens on an already-sorted array?
 How much space (other than the space for the array itself) is required?

We will discuss four comparison-sort algorithms:

1. Bubble sort
2 Quick Sort
3. Insertion Sort
4. Heap sort
Insertion sort have worst-case time O(N2). Quick sort is also O(N2) in the worst case, but
its expected time is O(N log N).

3.4 BUBBLE SORT


The basic idea of this algorithm is that we bring the smaller elements upward in the array
step by step and as a result, the larger elements go downward. If we think about array as a
vertical one, we do bubble sort.
The smaller elements come upward and the larger elements go downward in the array.
Thus it seems a bubbling phenomenon. Due to this bubbling nature, this is called the bubble sort.
Thus the basic idea is that the lighter bubbles (smaller numbers) rise to the top. This is for the
sorting in ascending order. We can do this in the reverse order for the descending order.

UNIT-I 1. 32
Paavai Institutions Department of MCA

3.4.1 BUBBLE SORT- STEPS


The steps in the bubble sort can be described as below
• Exchange neighboring items until the largest item reaches the end of the array
• Repeat the above step for the rest of the array
In this sort algorithm, we do not search the array for the smallest number like in the other
two algorithms. Also we do not insert the element by shifting the other elements. In this
algorithm, we do pair-wise swapping. We will take first the elements and swap the smaller with
the larger number. Then we do the swap between the next pair. By repeating this process, the
larger number will be going to the end of the array and smaller elements come to the start of the
array.
Let’s try to understand this phenomenon with the help of figures how bubble sort works.
Consider the same previous array that has elements 19, 12, 5 and 7.

Figure 3.15: Bubble Sort.

UNIT-I 1. 33
Paavai Institutions Department of MCA

First of all, we compare the first pair i.e. 19 and 5. As 5 is less than 19, we swap these
elements. Now 5 is at its place and we take the next pair. This pair is 19, 12 and not 12, 7. In this
pair 12 is less than 19, we swap 12 and 19. After this, the next pair is 19, 7. Here 7 is less than 19
so we swap it. Now 7 is at its place as compared to 19 but it is not at its final position. The
element 19 is at its final position. Now we repeat the pair wise swapping on the array from index
0 to 2 as the value at index 3 is at its position. So we compare 5 and 12. As 5 is less than 12 so it
is at its place (that is before 12) and we need not to swap them. Now we take the next pair that is
12 and 7. In this pair, 7 is less than 12 so we swap these elements. Now 7 is at its position with
respect to the pair 12 and 7.
Thus we have sorted the array up to index 2 as 12 is now at its final position. The element
19 is already at its final position. Note that here in the bubble sort, we are not using additional
storage (array). Rather, we are replacing the elements in the same array. Thus bubble sort is also
an in place algorithm. Now as index 2 and 3 have their final values, we do the swap process up to
the index 1. Here, the first pair is 5 and 7 and in this pair, we need no swapping as 5 is less than 7
and is at its position (i.e. before 7). Thus 7 is also at its final position and the array is sorted.

Following is the code of bubble sort algorithm in C++.


void bubbleSort(int *arr, int N)
{
int i, temp, bound = N-1;
int swapped = 1;
while (swapped > 0 )
{
swapped = 0;
for(i=0; i < bound; i++)
if ( arr[i] > arr[i+1] )
{

temp = arr[i];

UNIT-I 1. 34
Paavai Institutions Department of MCA

arr[i] = arr[i+1];
arr[i+1] = temp;
swapped = i;
}
bound = swapped;
}
}
In line with the previous two sort methods, the bubbleSort method also takes an array and
size of the array as arguments. There is i, temp, bound and swapped variables declared in the
function. We initialize the variable bound with N–1. This N-1 is our upper limit for the swapping
process. The outer loop that is the while loop executes as long as swapping is being done. In the
loop, we initialize the swapped variable with zero. When it is not changed in the for loop, it
means that the array is now in sorted form and we exit the loop. The inner for loop executes from
zero to bound-1.
In this loop, the if statement compares the value at index i and i+1. If I (element on left
side in the array) is greater than the element at i+1 (element on right side in the array) then we
swap these elements. We assign the value of i to the swapped variable that being greater than
zero indicates that swapping has been done. Then after the for loop, we put the value of swapped
variable in the bound to know that up to this index, swapping has taken place. After the for loop,
if the value of swapped is not zero, the while loop will continue execution. Thus the while loop
will continue till the time, the swapping is taking place.
Now let’s see the time complexity of bubble sort algorithm.

3.4.2 BUBBLE SORT ANALYSIS


In this algorithm, we see that there is an outer loop and an inner loop in the code. The
outer loop executes N times, as it has to pass through the whole array. Then the inner loop
executes for N times at first, then for N-1 and for N-2 times. Thus its range decreases with each
of the iteration of the outer loop. In the first iteration, we do the swapping up to N elements. And

UNIT-I 1. 35
Paavai Institutions Department of MCA

as a result the largest elements come at the last position. The next iteration passes through the N-
1 elements.
Thus the part of the array in which swapping is being done decreases after iteration. At
the end, there remains only one element where no swapping is required. Now if we sum up these
iterations i.e. 1 + 2 + 3 + ……… + N-1 + N. Then this summation becomes N (1 + N) / 2 = O
(N2). Thus in this equation, the term N 2 dominates as the value of N increases. It becomes
ignorable as compared to N2. P Thus when the value of N increases, the time complexity of this
algorithm increases proportional to N2.

3.5 QUICK SORT

Quick sort (like merge sort) is a divide and conquer algorithm: it works by creating two
problems of half size, solving them recursively, then combining the solutions to the small
problems to get a solution to the original problem. However, quick sort does more work than
merge sort in the "divide" part, and is thus able to avoid doing any work at all in the "combine"
part!

The idea is to start by partitioning the array: putting all small values in the left half and
putting all large values in the right half. Then the two halves are (recursively) sorted. Once that's
done, there's no need for a "combine" step: the whole array will be sorted! Here's a picture that
illustrates these ideas:

The key question is how to do the partitioning? Ideally, we'd like to put exactly half of
the values in the left part of the array, and the other half in the right part; i.e., we'd like to put all
values less than the median value in the left and all values greater than the median value in the
right. However, that requires first computing the median value (which is too expensive). Instead,
we pick one value to be the pivot, and we put all values less than the pivot to its left, and all
values greater than the pivot to its right (the pivot itself is then in its final place).

UNIT-I 1. 36
Paavai Institutions Department of MCA

Here's the algorithm outline:

1. Choose a pivot value.


2. Partition the array (put all value less than the pivot in the left part of the array, then the
pivot itself, then all values greater than the pivot).

3. Recursively, sort the values less than the pivot.

4. Recursively, sort the values greater than the pivot.

Note that, as for merge sort, we need an auxiliary method with two extra parameters -- low
and high indexes to indicate which part of the array to sort. Also, although we could "recurse" all
the way down to a single item, in practice, it is better to switch to a sort like insertion sort when
the number of items to be sorted is small (e.g., 20).

Now let's consider how to choose the pivot item. (Our goal is to choose it so that the "left
part" and "right part" of the array have about the same number of items -- otherwise we'll get a
bad runtime).

An easy thing to do is to use the first value -- A[low] -- as the pivot. However, if A is already
sorted this will lead to the worst possible runtime, as illustrated below:

In this case, after partitioning, the left part of the array is empty, and the right part contains
all values except the pivot. This will cause O(N) recursive calls to be made (to sort from 0 to N-
1, then from 1 to N-1, then from 2 to N-1, etc). Therefore, the total time will be O(N2).

Another option is to use a random-number generator to choose a random item as the pivot.
This is OK if you have a good, fast random-number generator.

A simple and effective technique is the "median-of-three": choose the median of the values in
A[low], A[high], and A[(low+high)/2]. Note that this requires that there be at least 3 items in the

UNIT-I 1. 37
Paavai Institutions Department of MCA

array, which is consistent with the note above about using insertion sort when the piece of the
array to be sorted gets small.

Once we've chosen the pivot, we need to do the partitioning. (The following assumes that the
size of the piece of the array to be sorted is at least 3.) The basic idea is to use two "pointers"
(indexes) left and right. They start at opposite ends of the array and move toward each other until
left "points" to an item that is greater than the pivot (so it doesn't belong in the left part of the
array) and right "points" to an item that is smaller than the pivot. Those two "out-of-place" items
are swapped, and we repeat this process until left and right cross:

1. Choose the pivot (using the "median-of-three" technique); also, put the smallest of the 3
values in A[low], put the largest of the 3 values in A[high], and put the pivot in A[high-
1]. (Putting the smallest value in A[low] prevents "right" from falling off the end of the
array in the following steps.)
2. Initialize: left = low+1; right = high-2

3. Use a loop with the condition:


while (left <= right)
The loop invariant is:
all items in A[low] to A[left-1] are <= the pivot
all items in A[right+1] to A[high] are >= the pivot
Each time around the loop:
left is incremented until it "points" to a value > the pivot
right is decremented until it "points" to a value < the pivot
if left and right have not crossed each other,
then swap the items they "point" to.
4. Put the pivot into its final place.

Here's the actual code for the partitioning step (the reason for returning a value will be clear
when we look at the code for quick sort itself):

UNIT-I 1. 38
Paavai Institutions Department of MCA

private static int partition(Comparable[] A, int low, int high) {


// precondition: A.length >= 3

int pivot = medianOfThree(A, low, high); // this does step 1


int left = low+1; right = high-2;
while ( left <= right ) {
while (A[left].compareTo(pivot) < 0) left++;
while (A[right].compareTo(pivot) > 0) right--;
if (left <= right) {
swap(A, left, right);
left++;
right--;
}
}
swap(A, left, high-1); // step 4
return right;
}
After partitioning, the pivot is in A[right+1], which is its final place; the final task is to
sort the values to the left of the pivot, and to sort the values to the right of the pivot. Here's the
code for quick sort (so that we can illustrate the algorithm, we use insertion sort only when the
part of the array to be sorted has less than 3 items, rather than when it has less than 20 items):

public static void quickSort(Comparable[] A) {


quickAux(A, 0, A.length-1);
}
private static void quickAux(Comparable[] A, int low, int high) {
if (high-low < 2) insertionSort(A, low, high);
else {
int right = partition(A, low, high);

UNIT-I 1. 39
Paavai Institutions Department of MCA

quickAux(A, low, right);


quickAux(A, right+2, high);
}}

Note: It is important to handle duplicate values efficiently. In particular, it is not a good


idea to put all values strictly less than the pivot into the left part of the array, and all values
greater than or equal to the pivot into the right part of the array. The code given above for
partitioning handles duplicates correctly at the expense of some "extra" swaps when both left and
right are "pointing" to values equal to the pivot.

Here's a picture illustrating quick sort:


What is the time for Quick Sort?
 If the pivot is always the median value, then the calls form a balanced binary tree (like
they do for merge sort).
 In the worst case (the pivot is the smallest or largest value) the calls form a "linear" tree.
 In any case, the total work done at each level of the call tree is O(N) for partitioning.
So the total time is:
 worst-case: O(N2)
 in practice: O(N log N)
Note that quick sort's worst-case time is worse than merge sort's. However, an advantage of
quick sort is that it does not require extra storage, as merge sort does.

3.6 INSERTION SORT

The idea behind insertion sort is:


1. Put the first 2 items in correct relative order.
2. Insert the 3rd item in the correct place relative to the first 2.

3. Insert the 4th item in the correct place relative to the first 3.

4. etc.

UNIT-I 1. 40
Paavai Institutions Department of MCA

As for selection sort, a nested loop is used; however, a different invariant holds: after the ith
time around the outer loop, the items in A[0] through A[i-1] are in order relative to each other
(but are not necessarily in their final places). Also, note that in order to insert an item into its
place in the (relatively) sorted part of the array, it is necessary to move some values to the right
to make room.

Here's the code:

public static void insertionSort(Comparable[] A) {


int k, j;
Comparable tmp;
int N = A.length;

for (k = 1; k < N, k++) {


tmp = A[k];
j = k - 1;
while ((j > = 0) && (A[j].compareTo(tmp) > 0)) {
A[j+1] = A[j]; // move one value over one place to the right
j--;
}
A[j + 1] = tmp; // insert kth value in correct place relative to previous
// values
}
}

Here's a picture illustrating how insertion sort works on the same array used above for
selection sort:

What is the time complexity of insertion sort? Again, the inner loop can execute a different
number of times for every iteration of the outer loop. In the worst case:

UNIT-I 1. 41
Paavai Institutions Department of MCA

 1st iteration of outer loop: inner executes 1 time


 2nd iteration of outer loop: inner executes 2 times

 3rd iteration of outer loop: inner executes 3 times

 ...

 N-1st iteration of outer loop: inner executes N-1 times

So we get:

1 + 2 + ... + N-1

which is still O(N2).

3.7 HEAP SORT


3.7.1 HEAP REVIEW
Remember that a heap is a way to implement a priority queue.
The property of a priority queue is:
 when you get an item it is the one with the highest priority
Heaps have same complexity as a balanced search tree but:
 they can easily be kept in an array
 they are much simpler than a balanced search tree
 they are cheaper to run in practice
A heap is a binary tree that has special structure compared to a general binary tree:
1. The root is greater than any value in a subtree
 this means the highest priority is at the root
 this is less information than a BST and this is why it can be easier and faster
2. It is a complete tree
 the height is always log(n) so all operations are O(log(n))
3.7.2 HEAP SORT

UNIT-I 1. 42
Paavai Institutions Department of MCA

If you have values in a heap and remove them one at a time they come out in (reverse)
sorted order. Since a heap has worst case complexity of O(log(n)) it can get O(nlog(n)) to
remove n value that are sorted.
There are a few areas that we want to make this work well:
 how do we form the heap efficiently?
 how can we use the input array to avoid extra memory usage?
 how do we get the result in the normal sorted order?
If we achieve it all then we have a worst case O(nlog(n)) sort that does not use extra memory.
This is the best theoretically for a comparison sort.
The steps of the heap sort algorithm are:
1. Use data to form a heap
2. remove highest priority item from heap (largest)
3. reform heap with remaining data
You repeat steps 2 & 3 until you finish all the data.
You could do step 1 by inserting the items one at a time into the heap:
 This would be O(nlog(n)). Turns out we can do in O(n). This does not change the overall
complexity but is more efficient.
 You would have to modify the normal heap implementation to avoid needing a second
array.
Instead we will enter all values and make it into a heap in one pass.
As with other heap operations, we first make it a complete binary tree and then fix up so the
ordering is correct. We have already seen that there is a relationship between a complete binary
tree and an array.
Our standard sorting example becomes:

UNIT-I 1. 43
Paavai Institutions Department of MCA

Figure 3.16: Binary Tree.

Now we need to get the ordering correct.


It will work by letting the smaller values percolate down the tree.
To make into a heap you use an algorithm that fixes the lower part of the tree and works it
way toward the root:
 Go from lowest right parent (non-leaf) and proceed to left. When finish one level go
to next starting again from right.
 at each node, percolate down the item to its proper place in this part of the subtree,
e.g., subheap.
Here is how the example goes:

UNIT-I 1. 44
Paavai Institutions Department of MCA

Figure 3.17: Binary Heap.

UNIT-I 1. 45
Paavai Institutions Department of MCA

This example has very few swaps. In some cases you have to percolate a value down by
swapping it with several children.
The Weiss book has the details to show that this is worst case O(n) complexity. It isn't
O(nlog(n)) because each step is log(subtree height currently considering) and most of the nodes
root subtrees with a small height. For example, about half the nodes have no children (are
leaves).
Now that we have a heap, we just remove the items one after another.
The only new twist here is to keep the removed item in the space of the original array. To
do this you swap the largest item (at root) with the last item (lower right in heap). In our example
this gives:

Figure 3.18: Heap-Sorted Values


The last value of 5 is no longer in the heap.
Now let the new value at the root percolate down to where it belongs.

Figure 3.19: Heap-Sorted Values

UNIT-I 1. 46
Paavai Institutions Department of MCA

Now repeat with the new root value (just chance it is 5 again):

And keep continuing:


Figure 3.20: Heap-Sorted Values

UNIT-I 1. 47
Paavai Institutions Department of MCA

Figure 3.21: Heap-Sorted List

UNIT-I 1. 48
Paavai Institutions Department of MCA

3.7.3 HEAP COMPLEXITY

The part just shown very similar to removal from a heap which is O(log(n)). You do it n-
1 times so it is O(nlog(n)). The last steps are cheaper but for the reverse reason from the building
of the heap, most are log(n) so it is O(nlog(n)) overall for this part. The build part was O(n) so it
does not dominate. For the whole heap sort you get O(nlog(n)).
There is no extra memory except a few for local temporaries.
Thus, we have finally achieved a comparison sort that uses no extra memory and is
O(nlog(n)) in the worst case.
In many cases people still use quick sort because it uses no extra memory and is usually
O(nlog(n)). Quick sort runs faster than heap sort in practice. The worst case of O(n 2) is not seen
in practice.

3.8 HASHING
The hashing is an algorithmic procedure and a methodology. It is not a new data
structure. It is a way to use the existing data structure. The methods- find, insert and remove of
table will get of constant time. You will see that we will be able to do this ere talking about the
algorithms and rocedures rather than data structure. Now we will discuss about the strategies and
methodologies. Hashing is also a part of this.
A hash function converts a number in a large range into a number in a smaller range. This
smaller range corresponds to the index numbers in an array.
arrayIndex = hugeNumber % arraySize

Figure 3.22: Hash Function

UNIT-I 1. 49
Paavai Institutions Department of MCA

3.8.1 HASH FUNCTIONS

A good hash function is simple so that it can be computed quickly. A perfect hash
function maps every key into a different table location. Use a prime number as the array size.

Hashing Strings: We can convert short strings to key numbers by multiplying digit codes by
powers of a constant. The three letter word ace could turn into a number by calculating

key = 1 * 262 + 3 * 261 + 5 * 260

This approach has the desirable attribute of involving all the characters in the input string.
The calculated key value can then be hashed into an array index in the usual way:
index = key % arraySize

def hashFunc1 ( key, arraySize ):


hashVal = 0
pow26 = 1
for j in range (len(key) - 1, -1, -1):
letter = int (key[j]) - 96
hashVal += pow26 * letter
pow26 *= 26
return hashVal % arraySize

The hashFunc1() method is not as efficient as it might be. Other than the character
conversion, there are two multiplications and an addition inside the loop. We can eliminate one
multiplication by using Horner's method:
a4x4 + a3x3 + a2x2 + a1x1 + a0 = ( ( ( a4x + a3 ) x + a2 ) x + a1 ) x + a0

UNIT-I 1. 50
Paavai Institutions Department of MCA

The hashFunc1() cannot handle long strings because the hashVal exceeds the size of int.
Notice that the key always ends up being less than the array size. In Horner's method we can
apply the modulo (%) operator at each step in the calculation. This gives the same result as
applying the modulo operator once at the end, but avoids the overflow.

def hashFunc2 ( key, arraySize ):


hashVal = 0
for j in range (len(key)):
letter = ord (key[j]) - 96
hashVal = (hashVal * 26 + letter ) % arraySize
return hashVal

3.9 COLLISION RESOLUTION TECHNIQUES


3.9.1 COLLISION
An array into which data is inserted using a hash function is called a hash table. Collision
occurs when two keys map to the same index. Solutions to collision:
1. Open Addressing
2. Separate Chaining

3.9.2 OPEN ADDRESSING


When a data item cannot be placed at the index calculated by the hash function, another
location in the array is sought.
 Linear Probing
 Quadratic Probing

 Double Hashing

UNIT-I 1. 51
Paavai Institutions Department of MCA

3.9.3 SEPARATE CHAINING

A data item's key is hashed to the index in the usual way, and the item is inserted into the
linked list at that index. Other items that hash to the same index are simply added to the linked
list.

3.10 SEPARATE CHAINING


In Separate Chaining a data item's key is hashed to the index in the usual way, and the
item is inserted into the linked list at that index. Other items that hash to the same index are
simply added to the linked list. In separate chaining it is normal to put N or more items into an
N-cell array. Finding the initial cell takes fast O(1) time, but searching through a list takes time
proportional to the number of items on the list - O(m). In separate chaining the load factor can
rise above 1 without hurting performance very much. It is not important to make the table size a
prime number.

3.10.1 BUCKETS
Another approach similar to separate chaining is to use an array at each location in the
hash table instead of a linked list. Such arrays are called buckets. This approach is not as
efficient as the linked list approach, however, because of the problem of choosing the size of the
buckets. If they are too small they may overflow, and if they are too large they waste memory.

3.11 OPEN ADDRESSING


3.11.1 LINEAR PROBING
In Linear Probing we search sequentially for vacant cells. As more items are inserted in
the array clusters grow larger. It is not a problem when the array is half full, and still not bad
when it is two- thirds full. Beyond this, however, the performance degrades seriously as the
clusters grow larger and larger. The performance is determined by the Load Factor. The Load
Factor is the ratio of the number of items in a table to the table's size.
loadFactor = nItems / arraySize

UNIT-I 1. 52
Paavai Institutions Department of MCA

3.11.2 QUADRATIC PROBING


If x is the position in the array where the collision occurs, in Quadratic Probing the step
sizes are x + 1, x + 4, x + 9, x + 16, and so on. The problem with Quadratic Probing is that it
gives rise to secondary clustering.

3.11.3 DOUBLE HASHING OR REHASHING


Hash the key a second time, using a different hash function, and use the result as the step
size. For a given key the step size remains constant throughout a probe, but it is different for
different keys. The secondary hash function must not be the same as the primary hash function
and it must not output 0 (zero).
stepSize = constant - ( key % constant )
The constant is a prime number and smaller than the array size. Double hashing requires
that the size of the hash table is a prime number. Using a prime number as the array size makes it
impossible for any number to divide it evenly, so the probe sequence will eventually check every
cell. Suppose the array size is 15 ( indices from 0 to 14 ) and that a particular key hashes to an
initial index of 0 and a step size of 5. For example consider hashing the following sequence of
numbers 15, 30, 45, 60, 75, 90, 105. Then the probe sequence will be 0, 5, 10, 0, 5, 10, and so on,
repeating endlessly.
If the array size was 13 and the numbers were [13, 26, 39, 42, 65, 78, 91] then the step
size would be [2, 4, 1, 3, 5, 2, 4]. Supposing the step size was the same for a set of numbers then
the sequence of steps would be [0, 5, 10, 2, 7, 12, 4, 9, 1, 6, 11, 3] and so on. If there is even one
empty cell, the probe will find it.

3.12 MULTIPLE HASHING


Hash functions are typically defined by the way they create hash values from data. There
are two main methodologies for a hash algorithm to implement, they are:
3.12.1 ADDATIVE AND MULTIPLICATIVE HASHING
This is where the hash value is constructed by traversing through the data and continually
incrementing an initial value by a calculated value relative to an element within the data. The

UNIT-I 1. 53
Paavai Institutions Department of MCA

calculation done on the element value is usually in the form of a multiplication by a prime
number.

Figure 3.23: Hash Functions

3.12.2 ROTATIVE HASHING


Same as additive hashing in that every element in the data string is used to construct the
hash, but unlike additive hashing the values are put through a process of bitwise shifting. Usually
a combination of both left and right shifts, the shift amounts as before are prime. The result of
each process is added to some form of accumulating count, the final result being the hash value
is passed back as the final accumulation.

Figure 3.24: Rotative Hashing

3.12.3 HASH FUNCTIONS AND PRIME NUMBERS


There isn't much real mathematical work which can definitely prove the relationship
between prime numbers and pseudo random number generators. Nevertheless, the best results

UNIT-I 1. 54
Paavai Institutions Department of MCA

have been found to include the use of prime numbers. PRNGs are currently studied as a
statistical entity, they are not study as deterministic entities hence any analysis done can only
bare witness to the overall result rather than to determine how and or why the result came into
being. If a more discrete analysis could be carried out, one could better understand what prime
numbers work better and why they work better, and at the same time why other prime numbers
don't work as well, answering these questions with stable, repeatable proofs can better equip one
for designing better PRNGs and hence eventually better hash functions.

3.12.4 VARIOUS FORMS OF HASHING


Hashing as a tool to associate one set or bulk of data with an identifier has many different
forms of application in the real-world. Below are some of the more common uses of hash
functions.

3.12.4.1 STRING HASHING


Used in the area of data storage access. Mainly within indexing of data and as a structural
back end to associative containers(ie: hash tables)

3.12.4.2 CRYPTOGRAPHIC HASHING


Used for data/user verification and authentication. A strong cryptographic hash function
has the property of being very difficult to reverse the result of the hash and hence reproduce the
original piece of data. Cryptographic hash functions are used to hash user's passwords and have
the hash of the passwords stored on a system rather than having the password itself stored.
Cryptographic hash functions are also seen as irreversible compression functions, being able to
represent large quantities of data with a signal ID, they are useful in seeing whether or not the
data has been tampered with, and can also be used as data one signes in order to prove
authenticity of a document via other cryptographic means.

3.12.4.3 GEOMETRIC HASHING


This form of hashing is used in the field of computer vision for the detection of classified
objects in arbitrary scenes.

UNIT-I 1. 55
Paavai Institutions Department of MCA

The process involves initially selecting a region or object of interest. From there using
affine invariant feature detection algorithms such as the Harris corner detector (HCD), Scale-
Invariant Feature Transform (SIFT) or Speeded-Up Robust Features (SURF), a set of affine
features are extracted which are deemed to represent said object or region. This set is sometimes
called a macro-feature or a constellation of features. Depending on the nature of the features
detected and the type of object or region being classified it may still be possible to match two
constellations of features even though there may be minor disparities (such as missing or outlier
features) between the two sets. The constellations are then said to be the classified set of features.
A hash value is computed from the constellation of features. This is typically done by
initially defining a space where the hash values are intended to reside - the hash value in this case
is a multidimensional value normalized for the defined space. Coupled with the process for
computing the hash value another process that determines the distance between two hash values
is needed - A distance measure is required rather than a deterministic equality operator due to the
issue of possible disparities of the constellations that went into calculating the hash value. Also
owing to the non-linear nature of such spaces the simple Euclidean distance metric is essentially
ineffective, as a result the process of automatically determining a distance metric for a particular
space has become an active field of research in academia.
Typical examples of geometric hashing include the classification of various kinds of
automobiles, for the purpose of re-detection in arbitrary scenes. The level of detection can be
varied from just detecting a vehicle, to a particular model of vehicle, to a specific vehicle.

3.12.4.4 BLOOM FILTERS


A Bloom filter allows for the "state of existence" of a large set of possible values to be
represented with a much smaller piece of memory than the sum size of the values. In computer
science this is known as a membership query and is core concept in associative containers.
The Bloom filter achieves this through the use of multiple distinct hash functions and also
by allowing the result of a membership query for the existence of a particular value to have a
certain probability of error. The guarantee a Bloom filter provides is that for any membership
query there will never be any false negatives, however there may be false positives. The false

UNIT-I 1. 56
Paavai Institutions Department of MCA

positive probability can be controlled by varying the size of the table used for the Bloom filter
and also by varying the number of hash functions.
Subsequent research done in the area of hash functions and tables and bloom filters by
Mitzenmacher et al. suggest that for most practical uses of such constructs, the entropy in the
data being hashed contributes to the entropy of the hash functions, this further leads onto
theoretical results that conclude an optimal bloom filter (one which provides the lowest false
positive probability for a given table size or vice versa) providing a user defined false positive
probability can be constructed with at most two distinct hash functions also known as pairwise
independent hash functions, greatly increasing the efficiency of membership queries.
Bloom filters are commonly found in applications such as spell-checkers, string matching
algorithms, network packet analysis tools and network/internet caches.

UNIT-I 1. 57
Paavai Institutions Department of MCA

3.13 QUESTION BANK

PART A -2 Marks

1. What is meant by sorting?


2. What are the two main classifications of sorting based on the source of data?
3. What is meant by external sorting?
4. What is meant by internal sorting?
5. What are the various factors to be considered in deciding a sorting algorithm?
6. What is the main idea behind insertion sort?
7. What is the main idea behind insertion sort?
8. What is the purpose of quick sort?
9. What is the advantage of quick sort?
10. What is the average efficiency of heap sort?
11. Define segment?
12. When is a sorting method said to be stable?
13. When can we use insertion sort?
14. Define max heap?
15. Define min heap?
16. Define balanced search tree.
17. Define AVL tree.
18. What are the drawbacks of AVL trees?
19. What is a heap?
20. What is the main use of heap?
21. Give three properties of heaps?
22. Give the main property of a heap that is implemented as an array.
23. What are the two alternatives that are used to construct a heap?
24. Give the pseudocode for Bottom-up heap construction.
25. What is the algorithm to delete the root’s key from the heap?
26. Who discovered heapsort and how does it work?
27. What is a min-heap?
28. Define splay tree.
29. Define B-tree?
30. Define Priority Queue?
31. Define Binary Heap?
32. Explain array implementation of Binary Heap.
33. Define Max-heap.
34. Explain AVL rotation.
35. What are the different types of Rotation in AVL Tree?

UNIT-I 1. 58
Paavai Institutions Department of MCA

PART B -16 Marks

1. Show the result of inserting 2,4,1,5,9,3,6,7 in to an initially empty AVL Tree.


2. Write a procedure to implement AVL single and double rotations.
3. Write a routine to perform insertion and deletion in B-Tree
4. Explain heap sort with an example?
5. Explain quick sort with an example?
6. Explain Insertion sort with an example.
7. Explain Bubble sort with an example.
8. Explain Hashing techniques.
9. Explain collision resolution techniques.

UNIT-I 1. 59

You might also like