DATA STRUCTURES AND ALGORITHMS-2015 Edition
DATA STRUCTURES AND ALGORITHMS-2015 Edition
LECTURE MANUAL
By
April, 2014
DATA TYPE
In computer programming, a data type simply refers to a defined kind of data, that is, a set of possible values
and basic operations on those values. When applied in programming languages, a data type defines a set of
values and the allowable operations on those values.
Data types are important in computer programmes because they classify data so that a translator (compiler or
interpreter) can reserve appropriate memory storage to hold all possible values, e.g. integers, real numbers,
characters, strings, and Boolean values, all have very different representations in memory.
a. Character
b. Numeric integer
c. Numeric real
d. Boolean (logical).
Example 2: In Java programming language, the “int” type represents the set of 32-bit integers ranging in
value from -2,147, 483, 648 to 2,147, 483, 647 and the operation such as addition, subtraction and
multiplication that can be performed on integers.
2
the interface normally provides a constructor, which returns an abstract handle to new data, and several
operations, which are functions accepting the abstract handle as an argument.
In the design of many types of programmes, the choice of data structures is a primary design consideration,
as experience in building large systems has shown that the difficulty of implementation and the quality and
performance of the final result depends heavily on choosing the best data structure.
ARRAYS
In Computer Science, an array is a data structure consisting of a group of elements that are accessed by
indexing. Each data item of an array is known as an element, and the elements are referenced by a common
name known as the array name.
3
int[] anArray;
anArray = new int[10];
An array can also be created using a shortcut. For example:
int[] anArray = {1,2,3,4,5,6,7,8,9,10}
An array element can be accessed using an index value. For example: int i = anArray[5]
The size of an array can be found using the length attribute. For example: int len = anArray.length
Before any array is used in the computer, some memory locations have to be created for storage of the
elements. This is often done by using the DIM instruction of BASIC programming language or
DIMENSION instruction of FORTRAN programming language. For example, the instruction:
DIM LAGOS (45)
will create 45 memory locations for storage of the elements of the array called LAGOS.
In most programming languages, each element has the same data type
and the array occupies a contiguous area of storage. Most programming languages have a built-in array data
type. Some programming languages support array programming which generalises operations and functions
to work transparently over arrays as they do with scalars, instead of requiring looping over array members.
Declaration of Arrays
Variables normally only store a single value but, in some situations, it is useful to have a variable that can
store a series of related values – using an array. For example, suppose a programme is required that will
calculate the average age among a group of six students. The ages of the students could be stored in six
integer variables in C:
int age1;
int age2;
int age3;
However, a better solution would be to declare a six-element array:
int age[6];
This creates a six element array; the elements can be accessed as age[0] through age[5] in C.
A two-dimensional array (in which the elements are arranged into rows and columns) declared by say DIM
X(3,4) can be stored as linear arrays in the computer memory by determining the product of the subscripts.
The above can thus be expressed as DIM X (3 * 4) or DIM X (12).
Multi-dimensional arrays can be stored as linear arrays in order to reduce the computation time and memory.
Multi-dimensional Arrays
Ordinary arrays are indexed by a single integer. Also useful, particularly in numerical and graphics
applications, is the concept of a multi-dimensional array, in which we index into the array using an ordered
list of integers, such as in a[3,1,5]. The number of integers in the list used to index into the multi-dimensional
array is always the same and is referred to as the array's dimensionality, and the bounds on each of these are
called the array's dimensions. An array with dimensionality k, is often called k-dimensional. One-dimensional
arrays correspond to the simple arrays discussed thus far; two-dimensional arrays are a particularly common
representation for matrices. In practice, the dimensionality of an array rarely exceeds three. Mapping a one-
dimensional array into memory is obvious, since memory is logically itself a (very large) one-dimensional
array. When we reach higher-dimensional arrays, however, the problem is no longer obvious.
Suppose we want to represent this simple two-dimensional array:
4
It is most common to index this array using the RC-convention, where elements are referred in row, column
fashion such as:
Multi-dimensional arrays are typically represented by one-dimensional arrays of references (Iliffe vectors) to
other one-dimensional arrays. The sub-arrays can be either the rows or columns.
Classification of Arrays
Arrays can be classified as static arrays (i.e. whose size cannot change once their storage has been
allocated), or dynamic arrays, which can be resized.
Processing Arrays
Although array-based iteration is useful when dealing with very simple data structures, it is quite difficult to
construct generalized algorithms that do much more than process every element of an array from start to
finish. For example, suppose you want to process only every second item; include or exclude specific values
based on some selection criteria; or even process the items in reverse order. Being tied to arrays also makes it
difficult to write applications that operate on databases or files without first copying the data into an array for
processing. Using simple array-based iteration not only ties algorithms to using arrays, but also requires that
the logic for determining which elements stay, which go, and in which order to process them, is known in
advance. Even worse, if you need to perform the iteration in more than one place in your code, you will
likely end up duplicating the logic. This clearly isn’t a very extensible approach. Instead, what’s needed is a
way to separate the logic for selecting the data from the code that actually processes it. An iterator (also
known as an enumerator) solves these problems by providing a generic interface for looping over a set of
data so that the underlying data structure or storage mechanism—such as an array, database, and so on—is
hidden. Whereas simple iteration generally requires you to write specific code to handle where the data is
sourced from or even what kind of ordering or preprocessing is required, an iterator enables you to write
simpler, more generic algorithms. An iterator provides a number of operations for traversing and accessing
data.
A Reverse Iterator
Sometimes you will want to reverse the iteration order without changing the code that processes the values.
Imagine an array of names that is sorted in ascending order, A to Z, and displayed to the user somehow. If the
user chose to view the names sorted in descending order, Z to A, you might have to re-sort the array or at the
very least implement some code that traversed the array backward from the end. With a reverse iterator,
however, the same behavior can be achieved without re-sorting and without duplicated code. When the
application calls first(), the reverse iterator actually calls last() on the underlying iterator. When the
application calls next(), the underlying iterator’s previous() method is invoked, and so on. In this way, the
5
behavior of the iterator can be reversed without changing the client code that displays the results, and without
re-sorting the array, which could be quite processing intensive, when you write some sorting algorithms.
Applications of Arrays
Arrays are employed in many computer applications in which data items need to be saved in the computer
memory for subsequent reprocessing. Due to their performance characteristics, arrays are used to implement
other data structures, such as heaps, hash tables, deques, queues, stacks and strings.
A list data structure is a sequential data structure, i.e. a collection of items accessible one after the other,
beginning at the head and ending at the tail. It is a widely used data structure for applications which do not
need random access. Lists differ from the stacks and queues data structures in that additions and removals
can be made at any position in the list.
Elements of a List
The sentence “Dupe is not a boy” can be written as a list as follows:
We regard each word in the sentence above as a data-item or datum, which is linked to the next datum, by a
pointer. Datum plus pointer make one node of a list. The last pointer in the list is called a terminator. It is
often convenient to speak of the first item as the head of the list, and the remainder of the list as the tail.
Operations
The main primitive operations of a list are known as:
Add adds a new node
Set updates the contents of a node
Remove removes a node
Get returns the value at a specified index
IndexOf returns the index in the list of a specified element
Additional primitives can be defined:
6
IsEmpty reports whether the list is empty
IsFull reports whether the list is full
Initialise creates/initialises the list
Destroy deletes the contents of the list (may be implemented by re-initialising the list)
Initialise Creates the structure – i.e. ensures that the structure exists but contains no elements e.g.
Initialise(L) creates a new empty queue named Q
Add
e.g. Add(1,X,L) adds the value X to list L at position 1 (the start of the list is position 0), shifting subsequent
elements up L
A B C
Set
e.g. Set(2,Z,L) updates the values at position 2 to be Z
L
A X Z C
Fig. 4: List after update
Remove
e.g. Remove(Z,L) removes the node with value Z
L
A X Z C
Fig. 5: List before removal
L
A X C
Get
e.g. Get(2,L) returns the value of the third node, i.e. C
IndexOf
e.g. IndexOf(X,L) returns the index of the node with value X, i.e. 1
List Implementation
There are many ways to implement a list depending on how the programmer will use lists in their
programme. The two most common, are an array-based implementation and a linked list.
1. Array List: As the name suggests, an array list uses an array to hold the values.
7
2. Linked List: A linked list, conversely, is a chain of elements in which each item has a reference (or link) to
the next (and optionally previous) element.
Array Lists
As the name suggests, an array list uses an array as the underlying mechanism for storing elements. Because
of this, the fact that you can index directly into arrays makes implementing access to elements very easy. It
also makes an array list the fastest implementation for indexed and sequential access. The downside to using
an array is that each time you insert a new element; you need to shift any elements in higher positions one
place to the right by physically copying them. Similarly, when deleting an existing element, you need to shift
any objects in higher positions one place to the left to fill the gap left by the deleted element. Additionally,
because arrays are fixed in size, anytime you need to increase the size of the list, you also need to reallocate a
new array and copy the contents over. This clearly affects the performance of insertion and deletion.
Linked List
The Linked List is stored as a sequence of linked nodes. Rather than use an array to hold the elements, a
linked list contains individual elements with links between them. As in the case of the stack, each node in a
linked list contains data AND a reference to the next node. It also makes insertion and deletion much simpler
than it is for an array list.
8
tail
Figure 7: A singly linked list
header
a1 a2 an-1 an
As you might recall from the discussion on array lists, in most cases when deleting or inserting, some portion
of the underlying array needs to be copied. With a linked list, however, each time you wish to insert or delete
an element, you need only update the references to and from the next and previous elements, respectively.
This makes the cost of the actual insertion or deletion almost negligible in all but the most extreme cases. For
lists with extremely large numbers of elements, the traversal time can be a performance issue. A doubly
linked list also maintains references to the first and last elements in the list—often referred to as the head and
tail, respectively. This enables you to access either end with equal performance.
9
Push Pop
Application of Stacks
Stacks are used extensively at every level of a modern computer system. For example, a modern PC uses
stacks at the architecture level, which are used in the basic design of an operating system for interrupt
handling and operating system function calls. Among other uses, stacks are used to run a Java Virtual
Machine, and the Java language itself has a class called "Stack", which can be used by the programmer.
Stacks have many other applications. For example, as processor executes a programme, when a function call
is made, the called function must know how to return back to the programme, so the current address of
programme execution is pushed onto a stack. Once the function is finished, the address that was saved is
removed from the stack, and execution of the programme resumes. If a series of function calls occur, the
successive return values are pushed onto the stack in LIFO order so that each function can return back to
calling programme. Stacks support recursive function calls, subroutine calls, especially when “reverse polish
notation” is involved.
Solving a search problem, regardless of whether the approach is exhaustive or optimal, needs stack space.
Examples of exhaustive search methods are brute force and backtracking. Examples of optimal search
exploring methods are branch and bound and heuristic solutions. All of these algorithms use stacks to
remember the search nodes that have been noticed but not explored yet.
Another common use of stacks at the architecture level is as a means of allocating and accessing memory.
10
Fig. 10: Basic Architecture of a Stack
Operations on a Stack
The stack is usually implemented with two basic operations known as "push" and "pop". Thus, two
operations applicable to all stacks are:
A push operation, in which a data item is placed at the location pointed to by the stack pointer and the
address in the stack pointer is adjusted by the size of the data item; Push adds a given node to the top of the
stack leaving previous nodes below.
A pop or pull operation, in which a data item at the current location pointed to by the stack pointer is
removed, and the stack pointer is adjusted by the size of the data item. Pop removes and returns the current
top node of the stack.
The main primitives of a stack are known as:
Push adds a new node
Pop removes a node
Figure 11 shows the insertion of three data X, Y and Z to a stack and the removal of two data, Z and Y, from
the stack.
To X To Y To Z To Y To X
p p p p p
11
Initialise creates/initialises the stack
Destroy deletes the contents of the stack
(may be implemented by
re-initialising the stack)
Initialise
Creates the structure – i.e. ensures that the structure exists but contains
no elements e.g. Initialise(S) creates a new empty stack named S
X
S
Pop
e.g. Pop(S) removes the TOP node and returns its value
Pop
12
a static data structure
OR
a dynamic data structure
Customers line up in a bank waiting to be served by a teller and in supermarkets waiting to check out. No
doubt you’ve been stuck waiting in a line to speak to a customer service representative at a call center. In
computing terms, however, a queue is a list of data items stored in such a way that they can be retrieved in a
definable order. The main distinguishing feature between a queue and a list is that whereas all items in a list
are accessible—by their position within the list—the only item you can ever retrieve from a queue is the one
at the head. Which item is at the head depends on the specific queue implementation.
More often than not, the order of retrieval is indeed the same as the order of insertion (also known as first-in-
first-out, or FIFO), but there are other possibilities as well. Some of the more common examples include a
last-in-first-out queue and a priority queue, whereby retrieval is based on the relative priority of each item.
You can even create a random queue that effectively “shuffles” the contents. Queues are often described in
terms of producers and consumers. A producer is anything that stores data in a queue, while a consumer is
anything that retrieves data from a queue.
Queues can be ether bounded or unbounded. Bounded queues have limits placed on the number of items that
can be held at any one time. These are especially useful when the amount of available memory is
constrained—for example, in a device such as a router or even an in-memory message queue. Unbounded
queues, conversely, are free to grow in size as the limits of the hardware allow.
The queue data structure is characterised by the fact that additions are made at the end, or tail, of the queue
while removals are made from the front, or head of the queue. For this reason, a queue is referred to as a
FIFO structure (First-In First-Out). Figure 14 shows a queue of part of English alphabets.
13
Insertion Deletion
Application of Queues
Queues are very important structures in computer simulations, data processing, information management,
and in operating systems.
In simulations, queue structures are used to represent real-life events such as car queues at traffic light
junctions and petrol filling stations, queues of people at the check-out point in super markets, queues of bank
customers, etc.
In operating systems, queue structures are used to represent different programmes in the computer memory in
the order in which they are executed. For example, if a programme, J is submitted before programme K, then
programme J is queued before programme K in the computer memory and programme J is executed before
programme K.
Operations on a Queue
The main primitive operations on a queue are known as:
Enqueue: Stores a value in the queue. The size of the queue will increase by one.
Dequeue: Retrieves the value at the head of the queue. The size of the queue will decrease by one. Throws
EmptyQueueException if there are no more items in the queue.
Clear: Deletes all elements from the queue. The size of the queue will be reset to zero (0).
Size: Obtains the number of elements in the queue.
IsEmpty: Determines whether the queue is empty (size() = 0) or not.
Initialise
Creates the structure – i.e. ensures that the structure exists but contains no elements.
e.g. Initialise(Q) creates a new empty queue named Q
Add
e.g. Add(X,Q) adds the value X to the tail of Q
Q X
Fig. 15: Queue after adding the value X to the tail of Q then, Add (Y, Q) adds the value Y to the tail of Q
14
X Y
e.g. Remove(X, Q) removes the head node and returns its value
Q Y
15
Fig. 18: Storing a queue in an array
16
The new node is to be added at the tail of the queue. The reference Queue.Tail should point to the new node,
and the NextNode reference of the node previously at the tail of the queue should point to the DataItem of
the new node.
17
Blocking Queues
Queues are often used in multi-threaded environments as a form of interprocess communication.
Unfortunately, FIFO Queue is totally unsafe for use in situations where multiple consumers would be
accessing it concurrently. Instead, a blocking queue is one way to provide a thread-safe implementation,
ensuring that all access to the data is correctly synchronized. The first main enhancement that a blocking
queue offers over a regular queue is that it can be bounded.
So far, we have only dealt with unbounded queues—those that continue to grow without limit. The blocking
queue enables you to set an upper limit on the size of the queue. Moreover, when an attempt is made to store
an item in a queue that has reached its limit, the queue will, you guessed it, block the thread until space
becomes available—either by removing an item or by calling clear(). In this way, you guarantee that the
queue will never exceed its predefined bounds. The second major feature affects the behavior of dequeue().
When an attempt is made to retrieve an item from an empty queue, a blocking queue, will block the current
thread until an item is enqueued. This is good for implementing work queues where multiple, concurrent
consumers need to wait until there are more tasks to perform.
The scheduler makes sure that each dispatch queue is filled with a minimum number of tasks so that a
processor could always find a task in its dispatch queue when it finishes execution of a task. The scheduler
18
determines a feasible schedule based on the worst case computation times of tasks satisfying their timing and
resource constraints. The scheduling algorithm has full knowledge about the currently active set of tasks, but
not about the new set of tasks that may arrive while scheduling the current task set. The objective of the
dynamic scheduling is to minimize the makespan thereby improving the guarantee ratio. The guarantee ratio
is the percentage of tasks that arrived in the system whose deadlines are met. The scheduler must also
guarantee that the tasks already scheduled are going to meet their deadlines. The scheduler model consists of
minimum of 5 processors and maximum of 10 processors. The scheduler model is shown in Fig.3.1.
P1
P2
P3
Task queue
P4
Scheduler
P5
P6
P7
P8
P9
P10
19
TREES DATA STRUCTURE
A tree is often used to represent a hierarchy. This is because the relationships between the items in
the hierarchy suggest the branches of a botanical tree.
A simple unordered tree; in this diagram, the node labeled 7 has two children, labeled 2 and 6, and one
parent, labeled 2. The root node, at the top, has no parent. In computer science, a tree is a widely-used data
structure that emulates a hierarchical tree structure with a set of linked nodes. A node is a structure which
may contain a value, a condition, or represent a separate data structure (which could be a tree of its own).
Each node in a tree has zero or more child nodes, which are below it in the tree (by convention, trees are
drawn growing downwards). A node that has a child is called the child's parent node (or ancestor node, or
superior). A node has at most one parent. Nodes that do not have any children are called leaf nodes. They are
also referred to as terminal nodes.
The height of a node is the length of the longest downward path to a leaf from that node. The height of the
root is the height of the tree. The depth of a node is the length of the path to its root (i.e., its root path). This
is commonly needed in the manipulation of the various self balancing trees, AVL Trees in particular.
Conventionally, the value −1 corresponds to a subtree with no nodes, whereas zero corresponds to a subtree
with one node.
The topmost node in a tree is called the root node. Being the topmost node, the root node will not have
parents. It is the node at which operations on the tree commonly begin (although some algorithms begin with
the leaf nodes and work up ending at the root). All other nodes can be reached from it by following edges or
links. (In the formal definition, each such path is also unique). In diagrams, it is typically drawn at the top. In
some trees, such as heaps, the root node has special properties. Every node in a tree can be seen as the root
node of the subtree rooted at that node.
An internal node or inner node is any node of a tree that has child nodes and is thus not a leaf node.
Similarly, an external node or outer node is any node that does not have child nodes and is thus a leaf.
A subtree of a tree T is a tree consisting of a node in T and all of its descendants in T. (This is different from
the formal definition of subtree used in graph theory. The subtree corresponding to the root node is the entire
tree; the subtree corresponding to any other node is called a proper subtree (in analogy to the term proper
subset).
20
Root Node
A tree with height
3
Right Child
Node
Edges or Links
Leaf Node
This forms a complete tree, whose height is defined as the number of links from the root to the
deepest leaf.
Key terms
Root Node
Node at the "top" of a tree - the one from which all operations on the tree commence. The root node
may not exist (a NULL tree with no nodes in it) or have 0, 1 or 2 children in a binary tree.
Leaf Node
Node at the "bottom" of a tree - farthest from the root. Leaf nodes have no children.
Complete Tree
Tree in which each leaf is at the same distance from the root. A more precise and formal definition of
a complete tree is set out later.
Height
Number of nodes which must be traversed from the root to reach a leaf of a tree.
21
Binary Trees
A binary tree
The nodes at the lowest levels of the tree (the ones with no sub-trees) are called leaves.
1. the keys of all the nodes in the left sub-tree are less than that of the root,
2. the keys of all the nodes in the right sub-tree are greater than that of the root,
3. the left and right sub-trees are themselves ordered binary trees.
Traversal methods
There are many different applications of trees. As a result, there are many different algorithms for
manipulating them. However, many of the different tree algorithms have in common the
characteristic that they systematically visit all the nodes in the tree. That is, the algorithm walks
through the tree data structure and performs some computation at each node in the tree. This process
of walking through the tree is called a tree traversal.
There are 3 types of walks or traversal in a tree: pre-order, in-order and post-order
Pre-order walk: each parent node is traversed before its children is called;
22
Post-order walk: the children are traversed before their respective parents are traversed;
In-order: is a walk in which a node’s left subtree, is transfer
Stepping through the items of a tree, by means of the connections between parents and children, is called
walking the tree, and the action is a walk of the tree. Often, an operation might be performed when a pointer
arrives at a particular node. A walk in which each parent node is traversed before its children is called a pre-
order walk; a walk in which the children are traversed before their respective parents are traversed is called
a post-order walk; a walk in which a node's left subtree, then the node itself, and then finally its right
subtree are traversed is called an in-order traversal. (This last scenario, referring to exactly two subtrees, a
left subtree and a right subtree, assumes specifically a binary tree.)
Preorder Traversal
The first depth-first traversal method we consider is called preorder
traversal. Preorder traversal is defined recursively as follows: To do a
preorder traversal of a general tree:
1. Visit the root first; and then
2. Do a preorder traversal each of the subtrees of the root one-by-one in the order given.
Preorder traversal gets its name from the fact that it visits the root first.
In the case of a binary tree, the algorithm becomes:
1. Visit the root first; and then
2. Traverse the left subtree; and then
3. Traverse the right subtree.
Notice that the preorder traversal visits the nodes of the tree in precisely the same order in which
they are written. A preorder traversal is often done when it is necessary to print a textual
representation of a tree.
Postorder Traversal
The second depth-first traversal method we consider is postorder
traversal. In contrast with preorder traversal, which visits the root first, postorder traversal visits the
root last. To do a postorder traversal of a general tree:
1. Do a postorder traversal each of the subtrees of the root one by-one in the order given; and then
2. Visit the root.
To do a postorder traversal of a binary tree
1. Traverse the left subtree; and then
2. Traverse the right subtree; and then
3. Visit the root.
Inorder Traversal
The third depth-first traversal method is inorder traversal. Inorder
23
traversal only makes sense for binary trees. Whereas preorder traversal visits the root first and
postorder traversal visits the root last, inorder traversal visits the root in between visiting the left and
right subtrees:
1. Traverse the left subtree; and then
2. Visit the root; and then
3. Traverse the right subtree.
If we relax the restriction that each node can have only one key, we can reduce the height of the tree
a. is empty or
b. consists of a root containing j (1<=j<m) keys, kj, and
a set of sub-trees, Ti, (i = 0..j), such that
i. if k is a key in T0, then k <= k1
ii. if k is a key in Ti (0<i<j), then ki <= k <= ki+1
iii. if k is a key in Tj, then k > kj and
iv. all Ti are nonempty m-way search trees or all Ti are empty
Or in plain English ..
24
iii. All keys in a sub-tree to the right of a key are greater than it.
iv. This is the "standard" recursive part of the definition.
A variation of the B-tree, known as a B+-tree considers all the keys in nodes except the leaves as dummies.
All keys are duplicated in the leaves. This has the advantage that is all the leaves are linked together
sequentially; the entire tree may be scanned without visiting the higher nodes at all.
Key Terms
n-ary trees (or n-way trees)
Trees in which each node may have up to n children.
B-tree
Balanced variant of an n-way tree.
B+-tree
B-tree in which all the leaves are linked to facilitate fast in order traversal.
-----------
AVL tree
An AVL tree is another balanced binary search tree. Named after their inventors, Adelson-Velskii
and Landis, they were the first dynamically balanced trees to be proposed. AVL tree is a self-
balancing Binary Search Tree (BST) where the difference between heights of left and right subtrees
cannot be more than one for all nodes. An AVL tree is a binary search tree which has the following
properties:
25
A binary search tree is an AVL tree if there is no node that has subtrees differing in height by more
than A
Adding or removing a leaf from an AVL tree may make many nodes violate the AVL balance
condition, but each violation of AVL balance can be restored by one or two simple changes called
rotations..
B - TREE
In computer science, a B- tree is a tree data structure that keeps data stored and allows
searches, sequential access, insertions and deletions in a logarithmic time. The B-tree is a
generalization of a binary search tree in that a node may have more than two children (Comer
1979, p.123). Unlike self-balancing binary search trees, the B-tree is optimized for systems
that read and write large data block of data. It is commonly used in database environment and
file systems.
Fig 2: A B-tree of order 2 (Bayer & McCreight 1972) or order 5 (Knuth 1998).
26
As depicted in the above picture, in B-tree, the internal (non-leaf) nodes can have a variable
number of child nodes within some pre-defined way. Any time data is inserted or removed
from the node, the number of children nodes changes, In order to maintain the pre-defined
range, the internal node may be joined or split. Because a range of child node is permitted, B-
tree do not need re-balancing as frequently as other self-balancing search tree, but it may
waste some space, since nodes are not entirely full. The lower and upper bounds on the
number of child nodes are typically fixed for s particular implementation. For instance, in a
2-3 B -tree (often referred to as a 2-3 tree), each internal nodes may have only 2 or 3 child
nodes.
Each internal node of a B-tree will contain a number of keys. The keys have a separation
values which divides its sub-trees. A B- tree is made balanced by keeping all leaf nodes at the
same depth. This depth will increase slowly as element are added to the tree, but an increase
in the overall depth is infrequent, and result in a leaf nodes being one more node farther away
from the root. B-trees have substantial advantages over alternative implementations when the
time to access the data of a node greatly exceed the time spent processing the data, because
then, the cost of accessing the node may be amortized over multiple operation within the
node. This usually occurs when the node when the node data are in a secondary storage such
as disk drives. By maximizing the number of keys within each internal node, the height of the
tree decreases and the number of the expensive node accesses is reduced. In addition,
rebalancing of the tree occurs less often. The maximum number of a child node depends on
the information that must be stored for each child node and the size of a full disk block or an
analogous size in secondary storage. While 2-3 B-tree are easier to explain, practical B-tree
using secondary storage needs large number of child node to improve performance.
Note that, the term B-Tree may refer to a specific design or may refer to a general class of
designs. In the narrow sense, a B-tree stores keys in its internal nodes but may need not store
those keys in the records at the leaves. The general class of B-tree includes variation such as
B+ Tree as we shall see in the next section.
27
B + TREE
A B+ Tree can be seen as a tree in which each node contains only keys (not key-value pairs),
and to which an additional level is added at the bottom with linked leaves. As depicted in the
simple B+ Tree in the example below, the B+ tree is linking the keys 1-7 to data values d1-
d7. The linked list with red color allows a rapid in-order traversal, with a branching factor of
b =4.
28
B-TREE B+TREE
Has Lower fan-out compared with Has very high fan-out (number of
B+Tress pointers to child nodes in a node)
Leaf nodes has no linkage (i.e. Leaf Leaf nodes are Linked with each other
Nodes pointing to another leaf Node)
B trees contain data with each key B+ trees don't have data associated with
interior nodes
M-WAY TREE
An M-way tree is a multi-way tree that can have more than two children. A multi-way tree of
order m (known as m-way tree) is the one in which a tree can have m children. As with other
tree that have been previously mentioned, the node in m-way tree will be made up of key
fields, in this case m-1 key field and a pointer to children. In order to make the processing of
an m-way tree easier, some type of order will be imposed on the keys within each nodes,
resulting in a multi way search tree of order m. Hence, by definition an m way search tree is a
m-way tree in which the following condition holds:
29
Fig 4: M- way tree structure
Storage Management
An executing program uses memory (storage) for many different purposes, such as for the
machine instructions that represent the executable part of the program, the values of data
objects, and the return location for a function invocation.
1.2.1 Static Memory Management
When memory is allocated during compilation time, it is called ‘Static Memory
Management’. This memory is fixed and cannot be increased or decreased after allocation. If
more memory is allocated than requirement, then memory is wasted. If less memory is
allocated than requirement, then program will not run successfully. So exact memory
requirements must be known in advance.
1.2.2 Dynamic Memory Management
When memory is allocated during run/execution time, it is called ‘Dynamic Memory
Management’. This memory is not fixed and is allocated according to our requirements. Thus
in it there is no wastage of memory. So there is no need to know exact memory requirements
in advance.
30
storage that is no longer needed is made available for reuse; and 3) compaction, in which
blocks of storage that are in use but are separated by blocks of unused storage are moved
together in order to provide larger blocks of available storage. (Compaction is desirable, but
usually it is difficult or impossible to do in practice so it is not often done.) These phases may
be repeated many times during the execution of a program
HASHING
Hashing is the technique used for performing almost constant time search in case of insertion,
deletion and find operation. Taking a very simple example of it, an array with its index as key is the
If one wants to store a certain set of similar objects and wants to quickly access a given one (or come
back with the result that it is unknown), the first idea would be to store them in a list, possibly sorted
for faster access. This however still would need log (n) comparisons to find a given element or to
decide that it is not yet stored. Therefore one uses a much bigger array and uses a function on the
space of possible objects with integer values to decide, where in the array to store a certain object. If
this so called hash function distributes the actually stored objects well enough over the array, the
The idea of hashing is to distribute the entries (key/valuepairs) across an array of buckets. Given a
key, the algorithm computes an index that suggests where the entry can be found:
In this method, the hash is independent of the array size, and it is then reduced to an index (a number
31
In the case that the array size is a power of two, the remainder operation is reduced to masking,
which improves speed, but can increase problems with a poor hash function.
HASH FUNCTIONS
A function which employs some algorithm to computes the key K for all the data elements in the set
U, such that the key K which is of a fixed size. The same key K can be used to map data to a hash
table and all the operations like insertion, deletion and searching should be possible. The values
returned by a hashfunction are also referred to as hash values, hash codes, hash sums, or hashes
Here are some relatively simple hash functions that have been used:
Division-remainder method: The size of the number of items in the table is estimated. That
number is then used as a divisor into each original value or key to extract a quotient and a
remainder. The remainder is the hashed value. (Since this method is liable to produce a number
of collisions, any search mechanism would have to be able to recognize a collision and offer an
Folding method: This method divides the original value (digits in this case) into several parts,
adds the parts together, and then uses the last four digits (or some other arbitrary number of digits
Radix transformation method: Where the value or key is digital, the number base (or radix) can
be changed resulting in a different sequence of digits. (For example, a decimal numbered key
could be transformed into a hexadecimal numbered key.) High-order digits could be discarded to
32
Digit rearrangement method: This is simply taking part of the original value or key such as digits
in positions 3 through 6, reversing their order, and then using that sequence of digits as the hash
value or key.
There are several well-known hash functions used in cryptography. These include the message-
digest hash functions MD2, MD4, and MD5, used for hashing digital signatures into a shorter value
called a message-digest, and the Secure Hash Algorithm (SHA), a standard algorithm, that makes a
larger (60-bit) message digest and is similar to MD4. A hash function that works well for database
storage and retrieval, however, might not work as for cryptographic or error-checking purposes.
HASH TABLES
It is a Data structure where the data elements are stored (inserted), searched, deleted based on the
keys generated for each element, which is obtained from a hashing function. In a hashing system the
keys are stored in an array which is called the Hash Table. A perfectly implemented hash table
A hash table is a collection of items which are stored in such a way as to make it easy to find them
later. Each position of the hash table, often called a slot, can hold an item and is named by an integer
value starting at 0. For example, we will have a slot named 0, a slot named 1, a slot named 2, and so
on. Initially, the hash table contains no items so every slot is empty. We can implement a hash table
by using a list with each element initialized to the special Python value None. Figure 1shows a hash
table of size \(m=11\). In other words, there are m slots in the table, named 0 through 10.
33
Figure 1: Hash Table with 11 Empty Slots
The mapping between an item and the slot where that item belongs in the hash table is called the
hash function. The hash function will take any item in the collection and return an integer in the
range of slot names, between 0 and m-1. Assume that we have the set of integer items 54, 26, 93, 17,
77, and 31. Our first hash function, sometimes referred to as the “remainder method,” simply takes
an item and divides it by the table size, returning the remainder as its hash value (\(h(item)=item \%
11\)). Table 1gives all of the hash values for our example items. Note that this remainder method
(modulo arithmetic) will typically be present in some form in all hash functions, since the result
Once the hash values have been computed, we can insert each item into the hash table at the
designated position as shown in Figure 2. Note that 6 of the 11 slots are now occupied. This is
34
\(\lambda = \frac {numberofitems}{tablesize}\).
Now when we want to search for an item, we simply use the hash function to compute the slot name
for the item and then check the hash table to see if it is present. This searching operation is \(O(1)\),
since a constant amount of time is required to compute the hash value and then index the hash table
at that location. If everything is where it should be, we have found a constant time search algorithm.
We can probably already see that this technique is going to work only if each item maps to a unique
location in the hash table. For example, if the item 44 had been the next item in our collection, it
would have a hash value of 0 (\(44 \% 11 == 0\)). Since 77 also had a hash value of 0, we would
have a problem. According to the hash function, two or more items would need to be in the same
slot. This is referred to as a collision (it may also be called a “clash”). Clearly, collisions create a
HASHING TECHNIQUES
Static hashing: In static hashing, the hash function maps search-key values to a fixed set of
locations.
Dynamic hashing: In dynamic hashing a hash table can grow to handle variable set of
locations
35
HASH COLLISION
A situation when the resultant hashes for two or more data elements in the data set U, maps to the
same location in the hash table is called a hash collision. In such a situation two or more data
elements would qualify to be stored/mapped to the same location in the hash table.
When hash collision occur, the act of ensuring two or more data elements is not stored or mapped to
the same location is referred to as hash collision resolution. The Techniques utilized in collision
resolution include;
Separate chaining
Open addressing
SEPARATE CHAINING: is a technique in which the data is not directly stored at the hash key
index (k) of the Hash table. Rather the data at the key index (k) in the hash table is a pointer to the
36
head of the data structure where the data is actually stored. In the most simple and common
implementations the data structure adopted for storing the element is a linked-list.
In this technique when a data needs to be searched, it might become necessary (worst case) to
traverse all the nodes in the linked list to retrieve the data.
• The hash table can hold more elements without the large performance deterioration of open
• The performance of chaining declines much more slowly than open addressing.
37
• The keys of the objects to be hashed need not be unique.
• It requires the implementation of a separate data structure for chains, and code to manage it.
• The main cost of chaining is the extra space required for the linked lists.
• For some languages, creating new nodes (for linked lists) is expensive and slows down the
system.
Open Addressing: In this technique a hash table with pre-identified size is considered. All items are
stored in the hash table itself. In addition to the data, each hash bucket also maintains the three
states: EMPTY, OCCUPIED, DELETED. While inserting, if a collision occurs, alternative cells are
tried until an empty bucket is found. For which one of the following technique is adopted.
i. Linear Probing: One of the simplest re-hashing functions is +1 (or -1) on a collision. The
functions look in the neighboring slot in the table. It calculates the new address extremely
quickly and may be extremely efficient on a modern RISC processor due to efficient cache
utilization.
ii. Quadratic probing: Better behavior is usually obtained with quadratic probing, where the
Address = h(key) + c i2
on the tthre-hash. (A more complex function of i may also be used.) Since keys which are
mapped to the same value by the primary hash function follow the same sequence of
38
addresses, quadratic probing shows secondary clustering. However, secondary clustering is
iii. Double hashing: Re-hashing schemes use a second hashing operation when there is a
collision. If there is a further collision, we re-hash until an empty "slot" in the table is found.
The re-hashing function can either be a new function or a re-application of the original one.
As long as the functions are applied to a key in the same order, then a sought key can always
be located.
• All items are stored in the hash table itself. There is no need for another data structure.
• Requires the use of a three-state (Occupied, Empty, or Deleted) flag in each cell.
• DATABASE SYSTEM: Hash table provides a way to locate data in a constant amount of
time.
• SYMBOL TABLE: Hash table used is by compiler to maintain information about symbols
from a program.
39
• DATA DICTIONARY: data structure that supports Adding, Deleting and searching Data
40