MC4101-ADSA Unit-II 2
MC4101-ADSA Unit-II 2
UNIT-II
BALANCED SEARCH TREES,
SORTING AND INDEXING
UNIT-I 1. 1
Paavai Institutions Department of MCA
TECHNICAL TERMS
1. AVL Tree
An empty tree is height balanced. If T is a non-empty binary tree with TL and TR as its
left and right subtrees, then T is height balanced if
i) TL and TR are height balanced and
ii) │hL - hR│≤ 1
Where hL and hR are the heights of TL and TR respectively.
2. Balanced trees
Balanced trees have the structure of binary trees and obey binary search tree properties.
Apart from these properties, they have some special constraints, which differ from one data
structure to another. However, these constraints are aimed only at reducing the height of the tree,
because this factor determines the time complexity.
Eg: AVL trees, Splay trees.
3. AVL rotations
Let A be the nearest ancestor of the newly inserted nod which has the balancing factor ±2.
Then the rotations can be classified into the following four categories:
Left-Left: The newly inserted node is in the left subtree of the left child of A.
Right-Right: The newly inserted node is in the right subtree of the right child of
A. Left-Right: The newly inserted node is in the right subtree of the left child of A.
Right-Left: The newly inserted node is in the left subtree of the right child of A.
4. Heap
A heap is defined to be a complete binary tree with the property that the value of each
node is at least as small as the value of its child nodes, if they exist. The root node of the
heap has the smallest value in the tree.
5. B-tree of order M
a. A B-tree of order M is a tree that is not binary with the following structural
UNIT-I 1. 2
Paavai Institutions Department of MCA
properties:
• The root is either a leaf or has between 2 and M children.
• All non-leaf nodes (except the root) have between ┌M/2┐ and M
children.
• All leaves are at the same depth.
6. Applications of B-tree
Database implementation
Indexing on non primary key fields
7. Hashing
Hashing is the transformation of string of characters into a usually shorter fixed length
value or key that represents the original string. Hashing is used to index and retrieve
items in a database because it is faster to find the item using the short hashed key than
to find it using the original value.
8. Hash table
The hash table data structure is merely an array of some fixed size, containing the keys. A
key is a string with an associated value. Each key is mapped into some number in the
range 0 to tablesize-1 and placed in the appropriate cell.
9. Hash function
A hash function is a key to address transformation which acts upon a given key to
compute the relative position of the key in an array. The choice of hash function
should be simple and it must distribute the data evenly. A simple hash function is
hash_key=key mod table size.
Separate chaining is a collision resolution technique to keep the list of all elements that hash to
the same value. This is called separate chaining because each hash table element is a separate
chain (linked list). Each linked list contains all the elements whose keys hash to the same
index.
13. Probing
Probing is the process of getting next available hash table array cell.
15. Sorting
Ordering the data in an increasing or decreasing fashion according to some relationship
among the data item is called sorting.
UNIT-I 1. 4
Paavai Institutions Department of MCA
The main idea of insertion sort is to insert in the ith pass the ith element in A (1) A (2)...A
(i) in its rightful place.
21. Segment
When large blocks of data are to be sorted, only a portion of the block or file is loaded in
the main memory of the computer since, it cannot hold the entire block. This small
portion of file is called a segment.
UNIT-I 1. 5
Paavai Institutions Department of MCA
CONTENTS
UNIT-I 1. 6
Paavai Institutions Department of MCA
LECTURE NOTES
3.1 AVL TREES
3.1.1 Introduction
Binary search trees are an excellent data structure to implement associative arrays, maps,
sets, and similar interfaces. The main difficulty, as discussed in last lecture, is that they are
efficient only when they are balanced. Straightforward sequences of insertions can lead to highly
unbalanced trees with poor asymptotic complexity and unacceptable practical efficiency. For
example, if we insert n elements with keys that are in strictly increasing or decreasing order, the
complexity will be O(n2). On the other hand, if we can keep the height to O(log(n)), as it is for a
perfectly balanced tree, then the commplexity is bounded by O(n _ log(n)).
The solution is to dynamically rebalance the search tree during insert or search
operations. We have to be careful not to destroy the ordering invariant of the tree while we
rebalance. Because of the importance of binary search trees, researchers have developed many
different algorithms for keeping trees in balance, such as AVL trees, red/black trees, splay trees,
or randomized binary search trees. They differ in the invariants they maintain (in addition to the
ordering invariant), and when and how the rebalancing is done.
In this lecture we use AVL trees, which is a simple and efficient data structure to
maintain balance, and is also the first that has been proposed. It is named after its inventors,
G.M. Adelson-Velskii and E.M. Landis, who described it in 1962.
UNIT-I 1. 7
Paavai Institutions Department of MCA
node has height 1, a balanced tree with three nodes has height 2. If we add one more node to this
last tree is will have height 3. Alternatively, we can define it recursively by saying that the empty
tree has height 0, and the height of any node is one greater than the maximal height of its two
children. AVL trees maintain a height invariant (also sometimes called a balance invariant).
If we insert a new element with a key of 14, the insertion algorithm for binary search
trees without rebalancing will put it to the right of 13.
UNIT-I 1. 8
Paavai Institutions Department of MCA
Now the tree has height 4, and one path is longer than the others. However, it is easy to
check that at each node, the height of the left and right subtrees still differ only by one. For
example, at the node with key 16, the left subtree has height 2 and the right subtree has height 1,
which still obeys our height invariant.
Now consider another insertion, this time of an element with key 15.
This is inserted to the right of the node with key 14.
Figure 3.3 Height of Binary Search tree after inserting another element
All is well at the node labeled 14: the left subtree has height 0 while the right subtree has
height 1. However, at the node labeled 13, the left subtree has height 0, while the right subtree
has height 2, violating our invariant. Moreover, at the node with key 16, the left subtree has
height 3 while the right subtree has height 1, also a difference of 2 and therefore an invariant
violation.
UNIT-I 1. 9
Paavai Institutions Department of MCA
We therefore have to take steps to rebalance tree. We can see without too much trouble,
that we can restore the height invariant if we move the node labeled 14 up and push node 13
down and to the right, resulting in the following tree.
UNIT-I 1. 10
Paavai Institutions Department of MCA
From the intervals we can see that the ordering invariants are preserved, as are the
contents of the tree. We can also see that it shifts some nodes from the right subtree to the left
subtree. We would invoke this operation if the invariants told us that we have to rebalance from
right to left. We implement this with some straightforward code. First, recall the type of trees
from last lecture. We do not repeat the function is_ordtree that checks if a tree is ordered.
struct tree {
elem data;
struct tree* left;
struct tree* right;
};
typedef struct tree* tree;
bool is_ordtree(tree T);
The main point to keep in mind is to use (or save) a component of the input before
writing to it. We apply this idea systematically, writing to a location immediately after using it
on the previous line. We repeat the type specification of tree from last lecture.
tree rotate_left(tree T)
//@requires is_ordtree(T);
//@requires T != NULL && T->right != NULL;
UNIT-I 1. 11
Paavai Institutions Department of MCA
//@ensures is_ordtree(\result);
//@ensures \result != NULL && \result->left != NULL;
{
tree root = T->right;
T->right = root->left;
root->left = T;
return root;
}
Then in code:
tree rotate_right(tree T)
//@requires is_ordtree(T);
//@requires T != NULL && T->left != NULL;
//@ensures is_ordtree(\result);
//@ensures \result != NULL && \result->right != NULL;
{
tree root = T->left;
T->left = root->right;
UNIT-I 1. 12
Paavai Institutions Department of MCA
root->right = T;
return root;
}
UNIT-I 1. 13
Paavai Institutions Department of MCA
the right subtree will give us a new right subtree of height h + 2 which raises the height of the
overall tree to h + 3, violating the height invariant.
In the new right subtree has height h+2, either its right or the left subtree must be of
height h+1 (and only one of them; think about why). If it is the right subtree we are in the
situation depicted below on the left.
We fix this with a left rotation, the result of which is displayed to the right. In the second
case we consider we once again insert into the right subtree, but now the left subtree of the right
subtree has height h + 1.
UNIT-I 1. 14
Paavai Institutions Department of MCA
In that case, a left rotation alone will not restore the invariant (see Exercise 1). Instead,
we apply a so-called double rotation: first a right rotation at z, then a left rotation at the root.
When we do this we obtain the picture on the right, restoring the height invariant. There are two
additional symmetric cases to consider, if we insert the new element on the left.
We can see that in each of the possible cases where we have to restore the invariant, the
resulting tree has the same height h + 2 as before the insertion. Therefore, the height invariant
above the place where we just restored it will be automatically satisfied.
UNIT-I 1. 15
Paavai Institutions Department of MCA
int hl = height(T->left);
int hr = height(T->right);
if (!(h == (hl > hr ? hl+1 : hr+1))) return false;
if (hl > hr+1 || hr > hl+1) return false;
return is_balanced(T->left) && is_balanced(T->right);
}
A tree is an AVL tree if it is both ordered and balanced.
bool is_avl(tree T) {
return is_ordtree(T) && is_balanced(T);
}
We use this, for example, in a utility function that creates a new leaf
from an element (which may not be null).
tree leaf(elem e)
//@requires e != NULL;
//@ensures is_avl(\result);
{
tree T = alloc(struct tree);
T->data = e;
T->height = 1;
T->left = NULL;
T->right = NULL;
return T;
}
UNIT-I 1. 16
Paavai Institutions Department of MCA
tree l = T->left;
tree r = T->right;
int hl = height(l);
int hr = height(r);
if (hr > hl+1) {
//@assert hr == hl+2;
if (height(r->right) > height(r->left)) {
//@assert height(r->right) == hl+1;
T = rotate_left(T);
//@assert height(T) == hl+2;
return T;
} else {
//@assert height(r->left) == hl+1;
/* double rotate left */
T->right = rotate_right(T->right);
T = rotate_left(T);
//@assert height(T) == hl+2;
return T;
}
} else { //@assert !(hr > hl+1);
fix_height(T);
return T;
}
}
Note that the preconditions are weaker than we would like. In particular, they do not
imply some of the assertions we have added in order to show the correspondence to the pictures.
Such assertions are nevertheless useful because they document expectations based on informal
reasoning we do behind the scenes. Then, if they fail, they may be evidence for some error in our
understanding, or in the code itself, which might otherwise go undetected.
UNIT-I 1. 18
Paavai Institutions Department of MCA
In the fourth column we have run the experiment with plain binary search trees which do
not rebalance automatically. First of all, we see that they are much less efficient, and second we
see that their behavior with increasing size is difficult to predict, sometimes jumping
considerably and sometimes not much at all. In order to understand this behavior, we need to
know more about the order and distribution of keys that were used in this experiment. They were
strings, compared lexicographically. The keys were generated by counting integers upward and
then converting them to strings. The distribution of these keys is haphazard, but not random. For
example, if we start counting at 0
"0" < "1" < "2" < "3" < "4" < "5" < "6" < "7" < "8" < "9" < "10" < "12" < ...
The first ten strings are in ascending order but then numbers are inserted between "1" and
"2". This kind of haphazard distribution is typical of many realistic applications, and we see that
binary search trees without rebalancing perform quite poorly and unpredictably compared with
AVL trees.
3.2 B-TREES
3.2.1 INTRODUCTION
We have seen binary search trees before. When data volume is large and does not ¯t in
memory, an extension of the binary search tree to disk-based environment is the B-tree,
originally invented by Bayer and McCreight [1]. In fact, since the B-tree is always balanced (all
leaf nodes appear at the same level), it is an extension of the balanced binary search tree. Since
each disk access exchanges a whole block of information between memory and disk rather than a
few bytes, a node of the B-tree is expanded to hold more than two child pointers, up to the block
capacity. To guarantee worst-case performance, the B-tree requires that every node (except the
root) has to be at least half full. An exact match query, insertion or deletion need to access
O(logB n) nodes, where B is the page capacity in number of child pointers, and n is the number
of objects.
UNIT-I 1. 20
Paavai Institutions Department of MCA
The problem which the B-tree aims to solve is: given a large collection of objects, each
having a key and an value, design a disk-based index structure which e±ciently supports query
and update.
Here the query that is of interest if the exact-match query: given a key k, locate the value of the
object with key=k. The update can be either an insertion or a deletion. That is, insert a new object
into the index, or deleted from the index an object with a given key.
UNIT-I 1. 21
Paavai Institutions Department of MCA
UNIT-I 1. 22
Paavai Institutions Department of MCA
Algorithm Query(pageID, k)
Input: pageID of a B-tree node, a key k to be searched.
Output: value of the object with key= k; NULL if non-exist.
1. x = DiskRead(pageID).
2. if x is an index node
(a) If there is an object o in x s.t. o:key = k, return o:value.
(b) Find the child pointer x:child[i] whose key range contains k.
(c) return Query(x:child[i], k).
3. else
(a) If there is an object o in x s.t. o:key = k, return o:value. Otherwise, return NULL.
4. end if
As an example, Figure 15.2 shows how to perform a search query for k = 13. At node A, we
should follow the left sub-tree since k < 25. At node B, we should follow the third sub-tree since
10 < k < 16. Now we reach a leaf node F. An object with key=13 is found in the node.
UNIT-I 1. 23
Paavai Institutions Department of MCA
Notice that in the Query algorithm, only DiskRead function is called. The other three
functions, e.g. DiskWrite are not needed as the algorithm does not modify the B-tree. Since the
query algorithm examines a single path from root to leaf, the complexity of the algorithm in
number of I/Os is O(logB n), where n is the number of objects.
Algorithm Insert(root, k, v)
Input: root pageID of a B-tree, the key k and the value v of a new object.
Prerequisite: The object does not exist in the tree.
Action: Insert the new object into the B-tree.
1. x = DiskRead(root).
2. if x is full
(a) y = AllocateP age(), z = AllocateP age().
(b) Locate the middle object oi stored in x. Move the objects to the left of oi
into y. Move the objects to the right of oi into z. If x is an index page, also
move the child pointers accordingly.
(c) x:child[1] = y:pageID, x:child[2] = z:pageID.
(d) DiskWrite(x); DiskWrite(y); DiskWrite(z).
3. end if
4. InsertNotF ull(x; k; v).
Basically, the algorithm makes sure that root page is not current full, and then it calls the
InsertNotFull function to insert the object into the tree. If the root page x is full, the algorithm
will split it into two nodes y and z, and node x will be promoted to a higher level, thus increasing
the height of the tree.
This scenario is illustrated in Figure. Node x is a full root page. It contains three objects
and four child pointers. If we try to insert some record into the tree, the root node is split into two
UNIT-I 1. 24
Paavai Institutions Department of MCA
nodes y and z. Originally, x contains x:num = 3 objects. The left object (key=6) is moved to a
new node y. The right object (key=16) is moved to a new node z.
Figure 3.11: Splitting the root node increases the height of the tree.
Algorithm InsertNotFull(x, k, v)
Input: an in-memory page x of a B-tree, the key k and the value v of a new object.
Prerequisite: page x is not full.
Action: Insert the new object into the sub-tree rooted by x.
1. if x is a leaf page
(a) Insert the new object into x, keeping objects in sorted order.
(b) DiskWrite(x).
2. else
(a) Find the child pointer x:child[i] whose key range contains k.
(b) w = DiskRead(x:child[i]).
UNIT-I 1. 25
Paavai Institutions Department of MCA
(c) if w is full
i. y = AllocateP age().
ii. Locate the middle object oj stored in w. Move the objects to the right of oj into y. If w is an
index page, also move the child pointers accordingly.
iii. Move oj into x. Accordingly, add a child pointer in x (to the right of oj) pointing to y.
iv. DiskWrite(x); DiskWrite(y); DiskWrite(w).
v. If k < oj:key, call InsertNotF ull(w; k; v); otherwise, call InsertNotF ull(y; k; v).
(d) else
InsertNotF ull(w; k; v).
(e) end if
3. end if
Algorithm InsertNotFull examines a single path from root to leaf, and eventually insert
the object into some leaf page. At each level, the algorithm follows the child pointer whose key
range contains the key of the new object (step 2a). If no node along the path is full, the algorithm
recursively calls itself on each of these nodes (step 2d) till the leaf level, where the object is
inserted into the leaf node (step 1).
Consider the other case when some node w along the path is full (step 2c). The node is
first split into two (w and y). The right half of the objects from w are moved to y, while the
middle object is pushed into the parent node. After the split, the key range of either w or y, but
not both, contains the key of the new object. A recursive call is performed on the correct node.
As an example, consider inserting an object with key=14 into the B-tree of Figure. The
result is shown in Figure. The child pointers that are followed are thick. When we examine the
root node A, we follow the child pointer to B. Since B is full, we first split it into two, by moving
the right half of the objects (only one object in our case, with key=16) into a new node B00. The
child pointers to F and G are moved as well. Further, the previous middle object in B (key=10) is
moved to the parent node A. A new child pointer to B00 is also generated in A. Now, since the
key of the new object is 14, which is bigger than 10, we recursively call the algorithm on B00. At
this node, since 14 < 16, we recursively call the algorithm on node F. Since F is a leaf node, the
UNIT-I 1. 26
Paavai Institutions Department of MCA
algorithm finishes by inserting the new object into F. The accessed disk pages are shown as
shadowed.
Algorithm Delete(x, k)
Input: an in-memory node x of a B-tree, the key k to be deleted.
Prerequisite: an object with key=k exists in the sub-tree rooted by x.
Action: Delete the object from the sub-tree rooted by x.
1. if x is a leaf page
(a) Delete the object with key=k from x.
(b) DiskWrite(x).
2. else if x does not contain the object with key=k
(a) Locate the child x:child[i] whose key range contains k.
(b) y = DiskRead(x:child[i]).
(c) if y is exactly half full
i. If the sibling node z immediate to the left (right) of y has at least one more object than
minimally required, add one more object to y by moving x:key[i] from x to y and move that last
(¯rst) object from z to x. If y is an index node, the last (¯rst) child pointer in z is also moved to y.
ii. Otherwise, any immediate sibling of y is exactly half full. Merge y with an immediate sibling.
end if
(d) Delete(y; k).
3. else
(a) If the child y that precedes k in x has at least one more object than minimally required,find the
predecessor k0 of k in the sub-tree rooted by y, recursively delete k0 from the sub-tree and
replace k with k0 in x.
(b) Otherwise, y is exactly half full. We check the child z that immediately follows k in x. If z has
at least one more object than minimally required, find the successor k0 of k in the sub-tree rooted
by z, recursively delete k0 from the sub-tree and replace k with k0 in x.
(c) Otherwise, both y and z are half full. Merge them into one node and push k down to the new
node as well. Recursively delete k from this new node.
4. end if
UNIT-I 1. 28
Paavai Institutions Department of MCA
Along the search path from the root to the node containing the object to be deleted, for
each node x we encounter, there are three cases. The simplest scenario is when x is a leaf node
(step 1 of the algorithm). In this case, the object is deleted from the node and the algorithm
returns. Note that there is no need to handle underflow. The reason is: if the leaf node is root,
there is only one node in the tree and it is fine if it has only a few objects; otherwise, the previous
recursive step has already guaranteed that x has at least one more object than minimally required.
Step 2 and 3 of the algorithm correspond to two di®erent cases of dealing with an index node.
For step 2, the index node x does not contain the object with key=k. Thus there exists a
child node y whose key range contains k. After we read the child node into memory (step 2b), we
will recursively call the Delete algorithm on the sub-tree rooted by y (step 2d).
However, before we do that, step 2(c) of the algorithm makes sure that y contains at least
one more object than half full. Suppose we want to delete 5 from the B-tree shown in the figure.
When we are examining the root node A, we see that child node B should be followed next. Since
B has two more objects than half full, the recursion goes to node B. In turn, since D has two more
objects than minimum occupancy, the recursion goes to node D, where the object can be
removed.
Let's examine another examine. Still from the B+-tree shown in the figure, suppose we
want to delete 33. The algorithm ¯nds that the child node y = C is half full. One more object
needs to be incorporated into node C before a recursive call on C is performed.
There are two sub-cases. The ¯rst sub-case is when one immediate sibling z of node y has
at least one more object than minimally required. This case corresponds to step 2(c)i of the
algorithm. To handle this case, we drag one object down from x to y, and we push one object
from the sibling node up to x. As an example, the deletion of object 33 is shown in the figure.
UNIT-I 1. 29
Paavai Institutions Department of MCA
Deleting an object with key=33 from the B-tree of Figure 15.2. At node A, we examine
the right child. Since node C only had one object before, a new object was added to it in the
following way: the object with key=25 is moved from A to C, and the object with key=16 is
moved from B to A. Also, the child pointer pointing to G is moved from B to C.
Another sub-case is when all immediate siblings of y are exactly half full. In this case, we
merge y with one sibling. In our 2-3-4-tree example, an index node which is half full contains
one object. If we merge two such nodes together, we also drag an object from the parent node of
them down to the merged node. The node will then contain three objects, which is full but does
not overflow.
For instance, suppose we want to delete object 31 from Figure 15.5. When we are exam-
ining node x = C, we see that we need to recursively delete in the child node y = H. Now, both
immediate siblings of H are exactly half full. So we need to merge H with a sibling, say G.
Besides moving the remaining object 28 from H to G, we also should drag object 25 from the
parent node C to G. The figure is omitted for this case.
The third case is that node x is an index node which contains the object to be deleted.
Step 3 of algorithm Delete corresponds to this scenario. We cannot simply delete the object from
x, because we also need to decrement the number of child pointers by one. In the figure, suppose
we want to delete object with key=25, which is stored in index node C.
UNIT-I 1. 30
Paavai Institutions Department of MCA
UNIT-I 1. 31
Paavai Institutions Department of MCA
3.3 SORTING
Consider sorting the values in an array A of size N. Most sorting algorithms involve what
are called comparison sorts; i.e., they work by comparing values. Comparison sorts can never
have a worst-case running time less than O(N log N). Simple comparison sorts are usually
O(N2); the more clever ones are O(N log N).
Three interesting issues to consider when thinking about different sorting algorithms are:
Does an algorithm always take its worst-case time?
What happens on an already-sorted array?
How much space (other than the space for the array itself) is required?
1. Bubble sort
2 Quick Sort
3. Insertion Sort
4. Heap sort
Insertion sort have worst-case time O(N2). Quick sort is also O(N2) in the worst case, but
its expected time is O(N log N).
UNIT-I 1. 32
Paavai Institutions Department of MCA
UNIT-I 1. 33
Paavai Institutions Department of MCA
First of all, we compare the first pair i.e. 19 and 5. As 5 is less than 19, we swap these
elements. Now 5 is at its place and we take the next pair. This pair is 19, 12 and not 12, 7. In this
pair 12 is less than 19, we swap 12 and 19. After this, the next pair is 19, 7. Here 7 is less than 19
so we swap it. Now 7 is at its place as compared to 19 but it is not at its final position. The
element 19 is at its final position. Now we repeat the pair wise swapping on the array from index
0 to 2 as the value at index 3 is at its position. So we compare 5 and 12. As 5 is less than 12 so it
is at its place (that is before 12) and we need not to swap them. Now we take the next pair that is
12 and 7. In this pair, 7 is less than 12 so we swap these elements. Now 7 is at its position with
respect to the pair 12 and 7.
Thus we have sorted the array up to index 2 as 12 is now at its final position. The element
19 is already at its final position. Note that here in the bubble sort, we are not using additional
storage (array). Rather, we are replacing the elements in the same array. Thus bubble sort is also
an in place algorithm. Now as index 2 and 3 have their final values, we do the swap process up to
the index 1. Here, the first pair is 5 and 7 and in this pair, we need no swapping as 5 is less than 7
and is at its position (i.e. before 7). Thus 7 is also at its final position and the array is sorted.
temp = arr[i];
UNIT-I 1. 34
Paavai Institutions Department of MCA
arr[i] = arr[i+1];
arr[i+1] = temp;
swapped = i;
}
bound = swapped;
}
}
In line with the previous two sort methods, the bubbleSort method also takes an array and
size of the array as arguments. There is i, temp, bound and swapped variables declared in the
function. We initialize the variable bound with N–1. This N-1 is our upper limit for the swapping
process. The outer loop that is the while loop executes as long as swapping is being done. In the
loop, we initialize the swapped variable with zero. When it is not changed in the for loop, it
means that the array is now in sorted form and we exit the loop. The inner for loop executes from
zero to bound-1.
In this loop, the if statement compares the value at index i and i+1. If I (element on left
side in the array) is greater than the element at i+1 (element on right side in the array) then we
swap these elements. We assign the value of i to the swapped variable that being greater than
zero indicates that swapping has been done. Then after the for loop, we put the value of swapped
variable in the bound to know that up to this index, swapping has taken place. After the for loop,
if the value of swapped is not zero, the while loop will continue execution. Thus the while loop
will continue till the time, the swapping is taking place.
Now let’s see the time complexity of bubble sort algorithm.
UNIT-I 1. 35
Paavai Institutions Department of MCA
as a result the largest elements come at the last position. The next iteration passes through the N-
1 elements.
Thus the part of the array in which swapping is being done decreases after iteration. At
the end, there remains only one element where no swapping is required. Now if we sum up these
iterations i.e. 1 + 2 + 3 + ……… + N-1 + N. Then this summation becomes N (1 + N) / 2 = O
(N2). Thus in this equation, the term N 2 dominates as the value of N increases. It becomes
ignorable as compared to N2. P Thus when the value of N increases, the time complexity of this
algorithm increases proportional to N2.
Quick sort (like merge sort) is a divide and conquer algorithm: it works by creating two
problems of half size, solving them recursively, then combining the solutions to the small
problems to get a solution to the original problem. However, quick sort does more work than
merge sort in the "divide" part, and is thus able to avoid doing any work at all in the "combine"
part!
The idea is to start by partitioning the array: putting all small values in the left half and
putting all large values in the right half. Then the two halves are (recursively) sorted. Once that's
done, there's no need for a "combine" step: the whole array will be sorted! Here's a picture that
illustrates these ideas:
The key question is how to do the partitioning? Ideally, we'd like to put exactly half of
the values in the left part of the array, and the other half in the right part; i.e., we'd like to put all
values less than the median value in the left and all values greater than the median value in the
right. However, that requires first computing the median value (which is too expensive). Instead,
we pick one value to be the pivot, and we put all values less than the pivot to its left, and all
values greater than the pivot to its right (the pivot itself is then in its final place).
UNIT-I 1. 36
Paavai Institutions Department of MCA
Note that, as for merge sort, we need an auxiliary method with two extra parameters -- low
and high indexes to indicate which part of the array to sort. Also, although we could "recurse" all
the way down to a single item, in practice, it is better to switch to a sort like insertion sort when
the number of items to be sorted is small (e.g., 20).
Now let's consider how to choose the pivot item. (Our goal is to choose it so that the "left
part" and "right part" of the array have about the same number of items -- otherwise we'll get a
bad runtime).
An easy thing to do is to use the first value -- A[low] -- as the pivot. However, if A is already
sorted this will lead to the worst possible runtime, as illustrated below:
In this case, after partitioning, the left part of the array is empty, and the right part contains
all values except the pivot. This will cause O(N) recursive calls to be made (to sort from 0 to N-
1, then from 1 to N-1, then from 2 to N-1, etc). Therefore, the total time will be O(N2).
Another option is to use a random-number generator to choose a random item as the pivot.
This is OK if you have a good, fast random-number generator.
A simple and effective technique is the "median-of-three": choose the median of the values in
A[low], A[high], and A[(low+high)/2]. Note that this requires that there be at least 3 items in the
UNIT-I 1. 37
Paavai Institutions Department of MCA
array, which is consistent with the note above about using insertion sort when the piece of the
array to be sorted gets small.
Once we've chosen the pivot, we need to do the partitioning. (The following assumes that the
size of the piece of the array to be sorted is at least 3.) The basic idea is to use two "pointers"
(indexes) left and right. They start at opposite ends of the array and move toward each other until
left "points" to an item that is greater than the pivot (so it doesn't belong in the left part of the
array) and right "points" to an item that is smaller than the pivot. Those two "out-of-place" items
are swapped, and we repeat this process until left and right cross:
1. Choose the pivot (using the "median-of-three" technique); also, put the smallest of the 3
values in A[low], put the largest of the 3 values in A[high], and put the pivot in A[high-
1]. (Putting the smallest value in A[low] prevents "right" from falling off the end of the
array in the following steps.)
2. Initialize: left = low+1; right = high-2
Here's the actual code for the partitioning step (the reason for returning a value will be clear
when we look at the code for quick sort itself):
UNIT-I 1. 38
Paavai Institutions Department of MCA
UNIT-I 1. 39
Paavai Institutions Department of MCA
3. Insert the 4th item in the correct place relative to the first 3.
4. etc.
UNIT-I 1. 40
Paavai Institutions Department of MCA
As for selection sort, a nested loop is used; however, a different invariant holds: after the ith
time around the outer loop, the items in A[0] through A[i-1] are in order relative to each other
(but are not necessarily in their final places). Also, note that in order to insert an item into its
place in the (relatively) sorted part of the array, it is necessary to move some values to the right
to make room.
Here's a picture illustrating how insertion sort works on the same array used above for
selection sort:
What is the time complexity of insertion sort? Again, the inner loop can execute a different
number of times for every iteration of the outer loop. In the worst case:
UNIT-I 1. 41
Paavai Institutions Department of MCA
...
So we get:
1 + 2 + ... + N-1
UNIT-I 1. 42
Paavai Institutions Department of MCA
If you have values in a heap and remove them one at a time they come out in (reverse)
sorted order. Since a heap has worst case complexity of O(log(n)) it can get O(nlog(n)) to
remove n value that are sorted.
There are a few areas that we want to make this work well:
how do we form the heap efficiently?
how can we use the input array to avoid extra memory usage?
how do we get the result in the normal sorted order?
If we achieve it all then we have a worst case O(nlog(n)) sort that does not use extra memory.
This is the best theoretically for a comparison sort.
The steps of the heap sort algorithm are:
1. Use data to form a heap
2. remove highest priority item from heap (largest)
3. reform heap with remaining data
You repeat steps 2 & 3 until you finish all the data.
You could do step 1 by inserting the items one at a time into the heap:
This would be O(nlog(n)). Turns out we can do in O(n). This does not change the overall
complexity but is more efficient.
You would have to modify the normal heap implementation to avoid needing a second
array.
Instead we will enter all values and make it into a heap in one pass.
As with other heap operations, we first make it a complete binary tree and then fix up so the
ordering is correct. We have already seen that there is a relationship between a complete binary
tree and an array.
Our standard sorting example becomes:
UNIT-I 1. 43
Paavai Institutions Department of MCA
UNIT-I 1. 44
Paavai Institutions Department of MCA
UNIT-I 1. 45
Paavai Institutions Department of MCA
This example has very few swaps. In some cases you have to percolate a value down by
swapping it with several children.
The Weiss book has the details to show that this is worst case O(n) complexity. It isn't
O(nlog(n)) because each step is log(subtree height currently considering) and most of the nodes
root subtrees with a small height. For example, about half the nodes have no children (are
leaves).
Now that we have a heap, we just remove the items one after another.
The only new twist here is to keep the removed item in the space of the original array. To
do this you swap the largest item (at root) with the last item (lower right in heap). In our example
this gives:
UNIT-I 1. 46
Paavai Institutions Department of MCA
Now repeat with the new root value (just chance it is 5 again):
UNIT-I 1. 47
Paavai Institutions Department of MCA
UNIT-I 1. 48
Paavai Institutions Department of MCA
The part just shown very similar to removal from a heap which is O(log(n)). You do it n-
1 times so it is O(nlog(n)). The last steps are cheaper but for the reverse reason from the building
of the heap, most are log(n) so it is O(nlog(n)) overall for this part. The build part was O(n) so it
does not dominate. For the whole heap sort you get O(nlog(n)).
There is no extra memory except a few for local temporaries.
Thus, we have finally achieved a comparison sort that uses no extra memory and is
O(nlog(n)) in the worst case.
In many cases people still use quick sort because it uses no extra memory and is usually
O(nlog(n)). Quick sort runs faster than heap sort in practice. The worst case of O(n 2) is not seen
in practice.
3.8 HASHING
The hashing is an algorithmic procedure and a methodology. It is not a new data
structure. It is a way to use the existing data structure. The methods- find, insert and remove of
table will get of constant time. You will see that we will be able to do this ere talking about the
algorithms and rocedures rather than data structure. Now we will discuss about the strategies and
methodologies. Hashing is also a part of this.
A hash function converts a number in a large range into a number in a smaller range. This
smaller range corresponds to the index numbers in an array.
arrayIndex = hugeNumber % arraySize
UNIT-I 1. 49
Paavai Institutions Department of MCA
A good hash function is simple so that it can be computed quickly. A perfect hash
function maps every key into a different table location. Use a prime number as the array size.
Hashing Strings: We can convert short strings to key numbers by multiplying digit codes by
powers of a constant. The three letter word ace could turn into a number by calculating
This approach has the desirable attribute of involving all the characters in the input string.
The calculated key value can then be hashed into an array index in the usual way:
index = key % arraySize
The hashFunc1() method is not as efficient as it might be. Other than the character
conversion, there are two multiplications and an addition inside the loop. We can eliminate one
multiplication by using Horner's method:
a4x4 + a3x3 + a2x2 + a1x1 + a0 = ( ( ( a4x + a3 ) x + a2 ) x + a1 ) x + a0
UNIT-I 1. 50
Paavai Institutions Department of MCA
The hashFunc1() cannot handle long strings because the hashVal exceeds the size of int.
Notice that the key always ends up being less than the array size. In Horner's method we can
apply the modulo (%) operator at each step in the calculation. This gives the same result as
applying the modulo operator once at the end, but avoids the overflow.
Double Hashing
UNIT-I 1. 51
Paavai Institutions Department of MCA
A data item's key is hashed to the index in the usual way, and the item is inserted into the
linked list at that index. Other items that hash to the same index are simply added to the linked
list.
3.10.1 BUCKETS
Another approach similar to separate chaining is to use an array at each location in the
hash table instead of a linked list. Such arrays are called buckets. This approach is not as
efficient as the linked list approach, however, because of the problem of choosing the size of the
buckets. If they are too small they may overflow, and if they are too large they waste memory.
UNIT-I 1. 52
Paavai Institutions Department of MCA
UNIT-I 1. 53
Paavai Institutions Department of MCA
calculation done on the element value is usually in the form of a multiplication by a prime
number.
UNIT-I 1. 54
Paavai Institutions Department of MCA
have been found to include the use of prime numbers. PRNGs are currently studied as a
statistical entity, they are not study as deterministic entities hence any analysis done can only
bare witness to the overall result rather than to determine how and or why the result came into
being. If a more discrete analysis could be carried out, one could better understand what prime
numbers work better and why they work better, and at the same time why other prime numbers
don't work as well, answering these questions with stable, repeatable proofs can better equip one
for designing better PRNGs and hence eventually better hash functions.
UNIT-I 1. 55
Paavai Institutions Department of MCA
The process involves initially selecting a region or object of interest. From there using
affine invariant feature detection algorithms such as the Harris corner detector (HCD), Scale-
Invariant Feature Transform (SIFT) or Speeded-Up Robust Features (SURF), a set of affine
features are extracted which are deemed to represent said object or region. This set is sometimes
called a macro-feature or a constellation of features. Depending on the nature of the features
detected and the type of object or region being classified it may still be possible to match two
constellations of features even though there may be minor disparities (such as missing or outlier
features) between the two sets. The constellations are then said to be the classified set of features.
A hash value is computed from the constellation of features. This is typically done by
initially defining a space where the hash values are intended to reside - the hash value in this case
is a multidimensional value normalized for the defined space. Coupled with the process for
computing the hash value another process that determines the distance between two hash values
is needed - A distance measure is required rather than a deterministic equality operator due to the
issue of possible disparities of the constellations that went into calculating the hash value. Also
owing to the non-linear nature of such spaces the simple Euclidean distance metric is essentially
ineffective, as a result the process of automatically determining a distance metric for a particular
space has become an active field of research in academia.
Typical examples of geometric hashing include the classification of various kinds of
automobiles, for the purpose of re-detection in arbitrary scenes. The level of detection can be
varied from just detecting a vehicle, to a particular model of vehicle, to a specific vehicle.
UNIT-I 1. 56
Paavai Institutions Department of MCA
positive probability can be controlled by varying the size of the table used for the Bloom filter
and also by varying the number of hash functions.
Subsequent research done in the area of hash functions and tables and bloom filters by
Mitzenmacher et al. suggest that for most practical uses of such constructs, the entropy in the
data being hashed contributes to the entropy of the hash functions, this further leads onto
theoretical results that conclude an optimal bloom filter (one which provides the lowest false
positive probability for a given table size or vice versa) providing a user defined false positive
probability can be constructed with at most two distinct hash functions also known as pairwise
independent hash functions, greatly increasing the efficiency of membership queries.
Bloom filters are commonly found in applications such as spell-checkers, string matching
algorithms, network packet analysis tools and network/internet caches.
UNIT-I 1. 57
Paavai Institutions Department of MCA
PART A -2 Marks
UNIT-I 1. 58
Paavai Institutions Department of MCA
UNIT-I 1. 59