Data Structures and Algorithms
Data Structures and Algorithms
In computer programming, a data type simply refers to a defined kind of data, that is, a
set of possible values and basic operations on those values. When applied in
programming languages, a data type defines a set of values and the allowable operations
on those values.
Data types are important in computer programmes because they classify data so that a
translator (compiler or interpreter) can reserve appropriate memory storage to hold all
possible values, e.g. integers, real numbers, characters, strings, and Boolean values, all
have very different representations in memory.
a. Character
b. Numeric integer
c. Numeric real
d. Boolean (logical).
Example 2: In Java programming language, the “int” type represents the set of 32-bit
integers ranging in value from -2,147, 483, 648 to 2,147, 483, 647 and the operation such
as addition, subtraction and multiplication that can be performed on integers.
2
lists), or an instance of a class. (e.g. a list of circles). A data type is abstract in the sense
that it is independent of various concrete implementations.
Object-oriented languages such as C++ and Java provide explicit support for expressing
abstract data types by means of classes. A first class abstract data type supports the
creation of multiple instances of ADT and the interface normally provides a constructor,
which returns an abstract handle to new data, and several operations, which are functions
accepting the abstract handle as an argument.
In the design of many types of programmes, the choice of data structures is a primary
design consideration, as experience in building large systems has shown that the
difficulty of implementation and the quality and performance of the final result depends
heavily on choosing the best data structure.
3
ARRAYS
In Computer Science, an array is a data structure consisting of a group of elements that
are accessed by indexing. Each data item of an array is known as an element, and the
elements are referenced by a common name known as the array name.
int[] anArray;
anArray = new int[10];
An array can also be created using a shortcut. For example:
int[] anArray = {1,2,3,4,5,6,7,8,9,10}
An array element can be accessed using an index value. For example: int i = anArray[5]
The size of an array can be found using the length attribute. For example: int len =
anArray.length
Before any array is used in the computer, some memory locations have to be created for
storage of the elements. This is often done by using the DIM instruction of BASIC
programming language or DIMENSION instruction of FORTRAN programming
language. For example, the instruction:
DIM LAGOS (45)
will create 45 memory locations for storage of the elements of the array called LAGOS.
In most programming languages, each element has the same data type
and the array occupies a contiguous area of storage. Most programming languages have a
built-in array data type. Some programming languages support array programming which
generalises operations and functions to work transparently over arrays as they do with
scalars, instead of requiring looping over array members.
Declaration of Arrays
Variables normally only store a single value but, in some situations, it is useful to have a
variable that can store a series of related values – using an array. For example, suppose a
programme is required that will calculate the average age among a group of six students.
The ages of the students could be stored in six integer variables in C:
int age1;
int age2;
int age3;
However, a better solution would be to declare a six-element array:
int age[6];
This creates a six element array; the elements can be accessed as age[0] through age[5] in
C.
A two-dimensional array (in which the elements are arranged into rows and columns)
declared by say DIM X(3,4) can be stored as linear arrays in the computer memory by
determining the product of the subscripts.
The above can thus be expressed as DIM X (3 * 4) or DIM X (12).
4
Multi-dimensional arrays can be stored as linear arrays in order to reduce the
computation time and memory.
Multi-dimensional Arrays
Ordinary arrays are indexed by a single integer. Also useful, particularly in numerical and
graphics applications, is the concept of a multi-dimensional array, in which we index into
the array using an ordered list of integers, such as in a[3,1,5]. The number of integers in
the list used to index into the multi-dimensional array is always the same and is referred
to as the array's dimensionality, and the bounds on each of these are called the array's
dimensions. An array with dimensionality k, is often called k-dimensional. One-
dimensional arrays correspond to the simple arrays discussed thus far; two-dimensional
arrays are a particularly common representation for matrices. In practice, the
dimensionality of an array rarely exceeds three. Mapping a one-dimensional array into
memory is obvious, since memory is logically itself a (very large) one-dimensional array.
When we reach higher-dimensional arrays, however, the problem is no longer obvious.
Suppose we want to represent this simple two-dimensional array:
It is most common to index this array using the RC-convention, where elements are
referred in row, column fashion such as:
Classification of Arrays
Arrays can be classified as static arrays (i.e. whose size cannot change once their storage
has been allocated), or dynamic arrays, which can be resized.
Processing Arrays
Although array-based iteration is useful when dealing with very simple data structures, it
is quite difficult to construct generalized algorithms that do much more than process
every element of an array from start to finish. For example, suppose you want to process
only every second item; include or exclude specific values based on some selection
criteria; or even process the items in reverse order. Being tied to arrays also makes it
difficult to write applications that operate on databases or files without first copying the
5
data into an array for processing. Using simple array-based iteration not only ties
algorithms to using arrays, but also requires that the logic for determining which elements
stay, which go, and in which order to process them, is known in advance. Even worse, if
you need to perform the iteration in more than one place in your code, you will likely end
up duplicating the logic. This clearly isn’t a very extensible approach. Instead, what’s
needed is a way to separate the logic for selecting the data from the code that actually
processes it. An iterator (also known as an enumerator) solves these problems by
providing a generic interface for looping over a set of data so that the underlying data
structure or storage mechanism—such as an array, database, and so on—is hidden.
Whereas simple iteration generally requires you to write specific code to handle where
the data is sourced from or even what kind of ordering or preprocessing is required, an
iterator enables you to write simpler, more generic algorithms. An iterator provides a
number of operations for traversing and accessing data.
A Reverse Iterator
Sometimes you will want to reverse the iteration order without changing the code that
processes the values. Imagine an array of names that is sorted in ascending order, A to Z,
and displayed to the user somehow. If the user chose to view the names sorted in
descending order, Z to A, you might have to re-sort the array or at the very least
implement some code that traversed the array backward from the end. With a reverse
iterator, however, the same behavior can be achieved without re-sorting and without
duplicated code. When the application calls first(), the reverse iterator actually calls last()
on the underlying iterator. When the application calls next(), the underlying iterator’s
previous() method is invoked, and so on. In this way, the behavior of the iterator can be
reversed without changing the client code that displays the results, and without re-sorting
the array, which could be quite processing intensive, when you write some sorting
algorithms.
Applications of Arrays
Arrays are employed in many computer applications in which data items need to be saved
in the computer memory for subsequent reprocessing. Due to their performance
characteristics, arrays are used to implement other data structures, such as heaps, hash
tables, deques, queues, stacks and strings.
6
would now find that the list had grown to include two copies of “swimming”. The major
difference between arrays and lists, however, is that whereas an array is fixed in size, lists
can resize—growing and shrinking—as necessary.
A list data structure is a sequential data structure, i.e. a collection of items accessible one
after the other, beginning at the head and ending at the tail. It is a widely used data
structure for applications which do not need random access. Lists differ from the stacks
and queues data structures in that additions and removals can be made at any position in
the list.
Elements of a List
The sentence “Dupe is not a boy” can be written as a list as follows:
We regard each word in the sentence above as a data-item or datum, which is linked to
the next datum, by a pointer. Datum plus pointer make one node of a list. The last pointer
in the list is called a terminator. It is often convenient to speak of the first item as the
head of the list, and the remainder of the list as the tail.
Operations
The main primitive operations of a list are known as:
Add adds a new node
Set updates the contents of a node
Remove removes a node
Get returns the value at a specified index
IndexOf returns the index in the list of a specified element
Additional primitives can be defined:
IsEmpty reports whether the list is empty
IsFull reports whether the list is full
Initialise creates/initialises the list
Destroy deletes the contents of the list (may be implemented by re-initialising the list)
Initialise Creates the structure – i.e. ensures that the structure exists but contains no
elements e.g. Initialise(L) creates a new empty queue named Q
Add
e.g. Add(1,X,L) adds the value X to list L at position 1 (the start of the list is position 0),
shifting subsequent elements up L
A B C
7
Fig. 3: List after adding value
Set
e.g. Set(2,Z,L) updates the values at position 2 to be Z
L
A X Z C
Fig. 4: List after update
Remove
e.g. Remove(Z,L) removes the node with value Z
L
A X Z C
Fig. 5: List before removal
L
A X C
Get
e.g. Get(2,L) returns the value of the third node, i.e. C
IndexOf
e.g. IndexOf(X,L) returns the index of the node with value X, i.e. 1
List Implementation
There are many ways to implement a list depending on how the programmer will use lists
in their programme. The two most common, are an array-based implementation and a
linked list.
1. Array List: As the name suggests, an array list uses an array to hold the values.
2. Linked List: A linked list, conversely, is a chain of elements in which each item has a
reference (or link) to the next (and optionally previous) element.
Array Lists
As the name suggests, an array list uses an array as the underlying mechanism for storing
elements. Because of this, the fact that you can index directly into arrays makes
implementing access to elements very easy. It also makes an array list the fastest
implementation for indexed and sequential access. The downside to using an array is that
each time you insert a new element; you need to shift any elements in higher positions
one place to the right by physically copying them. Similarly, when deleting an existing
element, you need to shift any objects in higher positions one place to the left to fill the
gap left by the deleted element. Additionally, because arrays are fixed in size, anytime
8
you need to increase the size of the list, you also need to reallocate a new array and copy
the contents over. This clearly affects the performance of insertion and deletion.
Linked List
The Linked List is stored as a sequence of linked nodes. Rather than use an array to hold
the elements, a linked list contains individual elements with links between them. As in
the case of the stack, each node in a linked list contains data AND a reference to the next
node. It also makes insertion and deletion much simpler than it is for an array list.
9
iii. Doubly Linked Lists
This permits scanning or searching of the list in both directions. (To go backwards in a
simple list, it is necessary to go back to the start and scan forwards.) In this case, the node
structure is altered to have two links. This double linking makes it possible to traverse the
elements in either direction. It also makes insertion and deletion much simpler than it is
for an array list.
header
a1 a2 an-1 an
As you might recall from the discussion on array lists, in most cases when deleting or
inserting, some portion of the underlying array needs to be copied. With a linked list,
however, each time you wish to insert or delete an element, you need only update the
references to and from the next and previous elements, respectively. This makes the cost
of the actual insertion or deletion almost negligible in all but the most extreme cases. For
lists with extremely large numbers of elements, the traversal time can be a performance
issue. A doubly linked list also maintains references to the first and last elements in the
list—often referred to as the head and tail, respectively. This enables you to access either
end with equal performance.
Push Pop
10
A frequently used metaphor is the idea of a stack of plates in a spring loaded cafeteria
stack. In such a stack, only the top plate is visible and accessible to the user, all other
plates remain hidden. As new plates are added, each new plate becomes the top of the
stack, hiding each plate below, pushing the stack of plates down. As the top plate is
removed from the stack, the plates pop back up, and the second plate becomes the top of
the stack.
Application of Stacks
Stacks are used extensively at every level of a modern computer system. For example, a
modern PC uses stacks at the architecture level, which are used in the basic design of an
operating system for interrupt handling and operating system function calls. Among other
uses, stacks are used to run a Java Virtual Machine, and the Java language itself has a
class called "Stack", which can be used by the programmer.
Stacks have many other applications. For example, as processor executes a programme,
when a function call is made, the called function must know how to return back to the
programme, so the current address of programme execution is pushed onto a stack. Once
the function is finished, the address that was saved is removed from the stack, and
execution of the programme resumes. If a series of function calls occur, the successive
return values are pushed onto the stack in LIFO order so that each function can return
back to calling programme. Stacks support recursive function calls, subroutine calls,
especially when “reverse polish notation” is involved.
Solving a search problem, regardless of whether the approach is exhaustive or optimal,
needs stack space. Examples of exhaustive search methods are brute force and
backtracking. Examples of optimal search exploring methods are branch and bound and
heuristic solutions. All of these algorithms use stacks to remember the search nodes that
have been noticed but not explored yet.
Another common use of stacks at the architecture level is as a means of allocating and
accessing memory.
11
Operations on a Stack
The stack is usually implemented with two basic operations known as "push" and "pop".
Thus, two operations applicable to all stacks are:
A push operation, in which a data item is placed at the location pointed to by the stack
pointer and the address in the stack pointer is adjusted by the size of the data item; Push
adds a given node to the top of the stack leaving previous nodes below.
A pop or pull operation, in which a data item at the current location pointed to by the
stack pointer is removed, and the stack pointer is adjusted by the size of the data item.
Pop removes and returns the current top node of the stack.
The main primitives of a stack are known as:
Push adds a new node
Pop removes a node
Figure 11 shows the insertion of three data X, Y and Z to a stack and the removal of two
data, Z and Y, from the stack.
To X To Y To Z To Y To X
p p p p p
Initialise
Creates the structure – i.e. ensures that the structure exists but contains
no elements e.g. Initialise(S) creates a new empty stack named S
12
X
S
Pop
e.g. Pop(S) removes the TOP node and returns its value
Pop
13
Customers line up in a bank waiting to be served by a teller and in supermarkets waiting
to check out. No doubt you’ve been stuck waiting in a line to speak to a customer service
representative at a call center. In computing terms, however, a queue is a list of data
items stored in such a way that they can be retrieved in a definable order. The main
distinguishing feature between a queue and a list is that whereas all items in a list are
accessible—by their position within the list—the only item you can ever retrieve from a
queue is the one at the head. Which item is at the head depends on the specific queue
implementation.
More often than not, the order of retrieval is indeed the same as the order of insertion
(also known as first-in-first-out, or FIFO), but there are other possibilities as well. Some
of the more common examples include a last-in-first-out queue and a priority queue,
whereby retrieval is based on the relative priority of each item. You can even create a
random queue that effectively “shuffles” the contents. Queues are often described in
terms of producers and consumers. A producer is anything that stores data in a queue,
while a consumer is anything that retrieves data from a queue.
Queues can be ether bounded or unbounded. Bounded queues have limits placed on the
number of items that can be held at any one time. These are especially useful when the
amount of available memory is constrained—for example, in a device such as a router or
even an in-memory message queue. Unbounded queues, conversely, are free to grow in
size as the limits of the hardware allow.
The queue data structure is characterised by the fact that additions are made at the end, or
tail, of the queue while removals are made from the front, or head of the queue. For this
reason, a queue is referred to as a FIFO structure (First-In First-Out). Figure 14 shows a
queue of part of English alphabets.
Insertion Deletion
Application of Queues
Queues are very important structures in computer simulations, data processing,
information management, and in operating systems.
In simulations, queue structures are used to represent real-life events such as car queues
at traffic light junctions and petrol filling stations, queues of people at the check-out point
in super markets, queues of bank customers, etc.
14
In operating systems, queue structures are used to represent different programmes in the
computer memory in the order in which they are executed. For example, if a programme,
J is submitted before programme K, then programme J is queued before programme K in
the computer memory and programme J is executed before programme K.
Operations on a Queue
The main primitive operations on a queue are known as:
Enqueue: Stores a value in the queue. The size of the queue will increase by one.
Dequeue: Retrieves the value at the head of the queue. The size of the queue will
decrease by one. Throws EmptyQueueException if there are no more items in the queue.
Clear: Deletes all elements from the queue. The size of the queue will be reset to zero
(0).
Size: Obtains the number of elements in the queue.
IsEmpty: Determines whether the queue is empty (size() = 0) or not.
Initialise
Creates the structure – i.e. ensures that the structure exists but contains no elements.
e.g. Initialise(Q) creates a new empty queue named Q
Add
e.g. Add(X,Q) adds the value X to the tail of Q
Q X
Fig. 15: Queue after adding the value X to the tail of Q then, Add (Y, Q) adds the value
Y to the tail of Q
X Y
e.g. Remove(X, Q) removes the head node and returns its value
Q Y
15
Add (B,Q) A B -
Add(C,Q) A B C -
Remove (Q) B C A
Add (F,Q) B C F -
Remove (Q) C F B
Remove (Q) F C
Remove (Q) empty F
16
Storing a Queue in a Dynamic Data Structure
A queue requires a reference to the head node AND a reference to the tail node. The
following diagram describes the storage of a queue called Queue. Each node consists of
data (DataItem) and a reference (NextNode).
17
Blocking Queues
Queues are often used in multi-threaded environments as a form of interprocess
communication. Unfortunately, FIFO Queue is totally unsafe for use in situations where
multiple consumers would be accessing it concurrently. Instead, a blocking queue is one
way to provide a thread-safe implementation, ensuring that all access to the data is
correctly synchronized. The first main enhancement that a blocking queue offers over a
regular queue is that it can be bounded.
So far, we have only dealt with unbounded queues—those that continue to grow without
limit. The blocking queue enables you to set an upper limit on the size of the queue.
Moreover, when an attempt is made to store an item in a queue that has reached its limit,
the queue will, you guessed it, block the thread until space becomes available—either by
removing an item or by calling clear(). In this way, you guarantee that the queue will
never exceed its predefined bounds. The second major feature affects the behavior of
dequeue(). When an attempt is made to retrieve an item from an empty queue, a blocking
queue, will block the current thread until an item is enqueued. This is good for
implementing work queues where multiple, concurrent consumers need to wait until there
are more tasks to perform.
18
There is dispatch queue associated with each processor. The communication between the
scheduler and the processors is through the dispatch queues.
The scheduler makes sure that each dispatch queue is filled with a minimum number of
tasks so that a processor could always find a task in its dispatch queue when it finishes
execution of a task. The scheduler determines a feasible schedule based on the worst
case computation times of tasks satisfying their timing and resource constraints. The
scheduling algorithm has full knowledge about the currently active set of tasks, but not
about the new set of tasks that may arrive while scheduling the current task set. The
objective of the dynamic scheduling is to minimize the makespan thereby improving the
guarantee ratio. The guarantee ratio is the percentage of tasks that arrived in the system
whose deadlines are met. The scheduler must also guarantee that the tasks already
scheduled are going to meet their deadlines. The scheduler model consists of minimum of
5 processors and maximum of 10 processors. The scheduler model is shown in Fig.3.1.
P1
P2
P3
Task queue
P4
Scheduler
P5
P6
P7
P8
P9
P10
19
Fig. 3.1: The Scheduler Model
(Source: Oluwadare, 2009)
TREES DATA STRUCTURE
A tree is often used to represent a hierarchy. This is because the relationships
between the items in the hierarchy suggest the branches of a botanical tree.
A simple unordered tree; in this diagram, the node labeled 7 has two children, labeled 2
and 6, and one parent, labeled 2. The root node, at the top, has no parent. In computer
science, a tree is a widely-used data structure that emulates a hierarchical tree structure
with a set of linked nodes. A node is a structure which may contain a value, a condition,
or represent a separate data structure (which could be a tree of its own). Each node in a
tree has zero or more child nodes, which are below it in the tree (by convention, trees are
drawn growing downwards). A node that has a child is called the child's parent node (or
ancestor node, or superior). A node has at most one parent. Nodes that do not have any
children are called leaf nodes. They are also referred to as terminal nodes.
The height of a node is the length of the longest downward path to a leaf from that node.
The height of the root is the height of the tree. The depth of a node is the length of the
path to its root (i.e., its root path). This is commonly needed in the manipulation of the
various self balancing trees, AVL Trees in particular. Conventionally, the value −1
corresponds to a subtree with no nodes, whereas zero corresponds to a subtree with one
node.
The topmost node in a tree is called the root node. Being the topmost node, the root node
will not have parents. It is the node at which operations on the tree commonly begin
(although some algorithms begin with the leaf nodes and work up ending at the root). All
other nodes can be reached from it by following edges or links. (In the formal definition,
each such path is also unique). In diagrams, it is typically drawn at the top. In some trees,
such as heaps, the root node has special properties. Every node in a tree can be seen as
the root node of the subtree rooted at that node.
An internal node or inner node is any node of a tree that has child nodes and is thus not
a leaf node. Similarly, an external node or outer node is any node that does not have
child nodes and is thus a leaf.
20
corresponding to the root node is the entire tree; the subtree corresponding to any other
node is called a proper subtree (in analogy to the term proper subset).
Root Node
A tree with height
3
Right Child
Node
Edges or Links
Leaf Node
This forms a complete tree, whose height is defined as the number of links
from the root to the deepest leaf.
Key terms
21
Root Node
Node at the "top" of a tree - the one from which all operations on the tree
commence. The root node may not exist (a NULL tree with no nodes in it) or
have 0, 1 or 2 children in a binary tree.
Leaf Node
Node at the "bottom" of a tree - farthest from the root. Leaf nodes have no
children.
Complete Tree
Tree in which each leaf is at the same distance from the root. A more precise and
formal definition of a complete tree is set out later.
Height
Number of nodes which must be traversed from the root to reach a leaf of a tree.
Binary Trees
A binary tree
The nodes at the lowest levels of the tree (the ones with no sub-trees) are called leaves.
22
1. the keys of all the nodes in the left sub-tree are less than that of the root,
2. the keys of all the nodes in the right sub-tree are greater than that of the root,
3. the left and right sub-trees are themselves ordered binary trees.
Traversal methods
There are many different applications of trees. As a result, there are many different
algorithms for manipulating them. However, many of the different tree algorithms
have in common the characteristic that they systematically visit all the nodes in the
tree. That is, the algorithm walks through the tree data structure and performs
some computation at each node in the tree. This process of walking through the
tree is called a tree traversal.
Stepping through the items of a tree, by means of the connections between parents and
children, is called walking the tree, and the action is a walk of the tree. Often, an
operation might be performed when a pointer arrives at a particular node. A walk in
which each parent node is traversed before its children is called a pre-order walk; a walk
in which the children are traversed before their respective parents are traversed is called a
post-order walk; a walk in which a node's left subtree, then the node itself, and then
finally its right subtree are traversed is called an in-order traversal. (This last scenario,
referring to exactly two subtrees, a left subtree and a right subtree, assumes specifically a
binary tree.)
Preorder Traversal
The first depth-first traversal method we consider is called preorder
traversal. Preorder traversal is defined recursively as follows: To do a
preorder traversal of a general tree:
1. Visit the root first; and then
2. Do a preorder traversal each of the subtrees of the root one-by-one in the order
given.
Preorder traversal gets its name from the fact that it visits the root first.
In the case of a binary tree, the algorithm becomes:
1. Visit the root first; and then
2. Traverse the left subtree; and then
3. Traverse the right subtree.
Notice that the preorder traversal visits the nodes of the tree in precisely the same
order in which they are written. A preorder traversal is often done when it is
necessary to print a textual representation of a tree.
Postorder Traversal
The second depth-first traversal method we consider is postorder
traversal. In contrast with preorder traversal, which visits the root first, postorder
traversal visits the root last. To do a postorder traversal of a general tree:
23
1. Do a postorder traversal each of the subtrees of the root one by-one in the order
given; and then
2. Visit the root.
To do a postorder traversal of a binary tree
1. Traverse the left subtree; and then
2. Traverse the right subtree; and then
3. Visit the root.
Inorder Traversal
The third depth-first traversal method is inorder traversal. Inorder
traversal only makes sense for binary trees. Whereas preorder traversal visits the
root first and postorder traversal visits the root last, inorder traversal visits the root
in between visiting the left and right subtrees:
1. Traverse the left subtree; and then
2. Visit the root; and then
3. Traverse the right subtree.
If we relax the restriction that each node can have only one key, we can reduce the
height of the tree
a. is empty or
24
b. consists of a root containing j (1<=j<m) keys, kj, and
a set of sub-trees, Ti, (i = 0..j), such that
i. if k is a key in T0, then k <= k1
ii. if k is a key in Ti (0<i<j), then ki <= k <= ki+1
iii. if k is a key in Tj, then k > kj and
iv. all Ti are nonempty m-way search trees or all Ti are empty
Or in plain English ..
A variation of the B-tree, known as a B+-tree considers all the keys in nodes except the
leaves as dummies. All keys are duplicated in the leaves. This has the advantage that is
all the leaves are linked together sequentially; the entire tree may be scanned without
visiting the higher nodes at all.
Key Terms
n-ary trees (or n-way trees)
Trees in which each node may have up to n children.
B-tree
Balanced variant of an n-way tree.
B+-tree
B-tree in which all the leaves are linked to facilitate fast in order traversal.
-----------
AVL tree
An AVL tree is another balanced binary search tree. Named after their inventors,
Adelson-Velskii and Landis, they were the first dynamically balanced trees to be
proposed. AVL tree is a self-balancing Binary Search Tree (BST) where the
25
difference between heights of left and right subtrees cannot be more than one for
all nodes. An AVL tree is a binary search tree which has the following properties:
A binary search tree is an AVL tree if there is no node that has subtrees differing
Adding or removing a leaf from an AVL tree may make many nodes violate the
AVL balance condition, but each violation of AVL balance can be restored by one
or two simple changes called rotations..
B - TREE
In computer science, a B- tree is a tree data structure that keeps data stored
and allows searches, sequential access, insertions and deletions in a
logarithmic time. The B-tree is a generalization of a binary search tree in
that a node may have more than two children (Comer 1979, p.123). Unlike
26
self-balancing binary search trees, the B-tree is optimized for systems that
read and write large data block of data. It is commonly used in database
environment and file systems.
Fig 2: A B-tree of order 2 (Bayer & McCreight 1972) or order 5 (Knuth 1998).
As depicted in the above picture, in B-tree, the internal (non-leaf) nodes can
have a variable number of child nodes within some pre-defined way. Any
time data is inserted or removed from the node, the number of children
nodes changes, In order to maintain the pre-defined range, the internal node
may be joined or split. Because a range of child node is permitted, B-tree do
not need re-balancing as frequently as other self-balancing search tree, but it
may waste some space, since nodes are not entirely full. The lower and
upper bounds on the number of child nodes are typically fixed for s
particular implementation. For instance, in a 2-3 B -tree (often referred to as
a 2-3 tree), each internal nodes may have only 2 or 3 child nodes.
Each internal node of a B-tree will contain a number of keys. The keys have
a separation values which divides its sub-trees. A B- tree is made balanced
by keeping all leaf nodes at the same depth. This depth will increase slowly
as element are added to the tree, but an increase in the overall depth is
infrequent, and result in a leaf nodes being one more node farther away from
the root. B-trees have substantial advantages over alternative
implementations when the time to access the data of a node greatly exceed
27
the time spent processing the data, because then, the cost of accessing the
node may be amortized over multiple operation within the node. This
usually occurs when the node when the node data are in a secondary storage
such as disk drives. By maximizing the number of keys within each internal
node, the height of the tree decreases and the number of the expensive node
accesses is reduced. In addition, rebalancing of the tree occurs less often.
The maximum number of a child node depends on the information that must
be stored for each child node and the size of a full disk block or an
analogous size in secondary storage. While 2-3 B-tree are easier to explain,
practical B-tree using secondary storage needs large number of child node to
improve performance.
Note that, the term B-Tree may refer to a specific design or may refer to a
general class of designs. In the narrow sense, a B-tree stores keys in its
internal nodes but may need not store those keys in the records at the leaves.
The general class of B-tree includes variation such as B+ Tree as we shall
see in the next section.
B + TREE
A B+ Tree can be seen as a tree in which each node contains only keys (not
key-value pairs), and to which an additional level is added at the bottom
with linked leaves. As depicted in the simple B+ Tree in the example below,
the B+ tree is linking the keys 1-7 to data values d1-d7. The linked list with
red color allows a rapid in-order traversal, with a branching factor of b =4.
28
Fig 3: A B+ tree structure.
The primary value of a B+ Tree resides in its ability to store data for an
efficient retrieval in a block oriented storage context, particularly in file
systems. This is primarily because unlike a binary search tree, B+ trees have
a very high fanout (number of pointers to the child nodes in a node, typically
on the order of 100 or more), which thereby reduces the number of I/O
operations required to find an element in the tree.
The NTFS, ReiserFS, NSS, XFS, JFS, ReFS and BFS file systems all uses
this type of tree for metadata indexing; BFS also uses B+ tree for storing
directories. The relational database management systems such as IBM DB2,
Informix, Microsoft SQL Server, Oracle 8, Sybase ASE and SQLite support
this type of tree for table indices. Key-value database management systems
such as CouchDB and Tokyo Cabinet support this type of tree for efficient
data access.
COMPARISON BETWEEN B-TREE AND B + TREE
B-TREE B+TREE
Has Lower fan-out compared with Has very high fan-out (number of
B+Tress pointers to child nodes in a node)
Leaf nodes has no linkage (i.e. Leaf Leaf nodes are Linked with each other
29
Nodes pointing to another leaf Node)
B trees contain data with each key B+ trees don't have data associated with
interior nodes
M-WAY TREE
An M-way tree is a multi-way tree that can have more than two children. A
multi-way tree of order m ( known as m-way tree) is the one in which a tree
can have m children. As with other tree that have been previously
mentioned, the node in m-way tree will be made up of key fields, in this
case m-1 key field and a pointer to children. In order to make the processing
of an m-way tree easier, some type of order will be imposed on the keys
within each nodes, resulting in a multi way search tree of order m. Hence, by
definition an m way search tree is a m-way tree in which the following
condition holds:
30
Fig 4: M- way tree structure
Storage Management
An executing program uses memory (storage) for many different purposes,
such as for the machine instructions that represent the executable part of the
program, the values of data objects, and the return location for a function
invocation.
1.2.1 Static Memory Management
When memory is allocated during compilation time, it is called ‘Static
Memory Management’. This memory is fixed and cannot be increased or
decreased after allocation. If more memory is allocated than requirement,
then memory is wasted. If less memory is allocated than requirement, then
program will not run successfully. So exact memory requirements must be
known in advance.
1.2.2 Dynamic Memory Management
When memory is allocated during run/execution time, it is called ‘Dynamic
Memory Management’. This memory is not fixed and is allocated according
to our requirements. Thus in it there is no wastage of memory. So there is no
need to know exact memory requirements in advance.
31
2.0 Phases of Storage Management
In general, storage is managed in three phases: 1) allocation, in which
needed storage is found from available (unused) storage and assigned to the
program; 2) recovery, in which storage that is no longer needed is made
available for reuse; and 3) compaction, in which blocks of storage that are in
use but are separated by blocks of unused storage are moved together in
order to provide larger blocks of available storage. (Compaction is desirable,
but usually it is difficult or impossible to do in practice so it is not often
done.) These phases may be repeated many times during the execution of a
program
HASHING
Hashing is the technique used for performing almost constant time search in case
of insertion, deletion and find operation. Taking a very simple example of it, an
If one wants to store a certain set of similar objects and wants to quickly access a
given one (or come back with the result that it is unknown), the first idea would be
to store them in a list, possibly sorted for faster access. This however still would
need log (n) comparisons to find a given element or to decide that it is not yet
stored. Therefore one uses a much bigger array and uses a function on the space of
possible objects with integer values to decide, where in the array to store a certain
object. If this so called hash function distributes the actually stored objects well
32
The idea of hashing is to distribute the entries (key/value pairs) across an array of
buckets. Given a key, the algorithm computes an index that suggests where the
In this method, the hash is independent of the array size, and it is then reduced to
an index (a number between 0and array_size − 1) using the modulo operator (%).
In the case that the array size is a power of two, the remainder operation is reduced
to masking, which improves speed, but can increase problems with a poor hash
function.
HASH FUNCTIONS
A function which employs some algorithm to compute the key K for all the data
elements in the set U, such that the key K which is of a fixed size. The same key K
can be used to map data to a hash table and all the operations like insertion,
deletion and searching should be possible. The values returned by a hash function
are also referred to as hash values, hash codes, hash sums, or hashes
Here are some relatively simple hash functions that have been used:
estimated. That number is then used as a divisor into each original value or key
33
(Since this method is liable to produce a number of collisions, any search
Folding method: This method divides the original value (digits in this case)
into several parts, adds the parts together, and then uses the last four digits (or
some other arbitrary number of digits that will work) as the hashed value or
key.
Radix transformation method: Where the value or key is digital, the number
base (or radix) can be changed resulting in a different sequence of digits. (For
uniform length.
Digit rearrangement method: This is simply taking part of the original value or
key such as digits in positions 3 through 6, reversing their order, and then
There are several well-known hash functions used in cryptography. These include
the message-digest hash functions MD2, MD4, and MD5, used for hashing digital
signatures into a shorter value called a message-digest, and the Secure Hash
digest and is similar to MD4. A hash function that works well for database storage
34
and retrieval, however, might not work as for cryptographic or error-checking
purposes.
HASH TABLES
It is a Data structure where the data elements are stored (inserted), searched,
deleted based on the keys generated for each element, which is obtained from a
hashing function. In a hashing system the keys are stored in an array which is
called the Hash Table. A perfectly implemented hash table would always promise
A hash table is a collection of items which are stored in such a way as to make it
easy to find them later. Each position of the hash table, often called a slot, can
hold an item and is named by an integer value starting at 0. For example, we will
have a slot named 0, a slot named 1, a slot named 2, and so on. Initially, the hash
table contains no items so every slot is empty. We can implement a hash table by
using a list with each element initialized to the special Python value None. Figure
1shows a hash table of size \(m=11\). In other words, there are m slots in the table,
35
The mapping between an item and the slot where that item belongs in the hash
table is called the hash function. The hash function will take any item in the
collection and return an integer in the range of slot names, between 0 and m-1.
Assume that we have the set of integer items 54, 26, 93, 17, 77, and 31. Our first
item and divides it by the table size, returning the remainder as its hash value
(\(h(item)=item \% 11\)). Table 1gives all of the hash values for our example
items. Note that this remainder method (modulo arithmetic) will typically be
present in some form in all hash functions, since the result must be in the range of
slot names.
Once the hash values have been computed, we can insert each item into the hash
table at the designated position as shown in Figure 2. Note that 6 of the 11 slots
are now occupied. This is referred to as the load factor, and is commonly denoted
by
36
For this example, \(\lambda = \frac {6}{11}\).
Now when we want to search for an item, we simply use the hash function to
compute the slot name for the item and then check the hash table to see if it is
required to compute the hash value and then index the hash table at that location.
algorithm.
We can probably already see that this technique is going to work only if each item
maps to a unique location in the hash table. For example, if the item 44 had been
the next item in our collection, it would have a hash value of 0 (\(44 \% 11 == 0\)).
Since 77 also had a hash value of 0, we would have a problem. According to the
hash function, two or more items would need to be in the same slot. This is
referred to as a collision (it may also be called a “clash”). Clearly, collisions create
HASHING TECHNIQUES
Static hashing: In static hashing, the hash function maps search-key values
37
Dynamic hashing: In dynamic hashing a hash table can grow to handle
HASH COLLISION
A situation when the resultant hashes for two or more data elements in the data set
U, maps to the same location in the hash table is called a hash collision. In such a
When hash collision occur, the act of ensuring two or more data elements is not
Separate chaining
Open addressing
38
SEPARATE CHAINING: is a technique in which the data is not directly stored
at the hash key index (k) of the Hash table. Rather the data at the key index (k) in
the hash table is a pointer to the head of the data structure where the data is
actually stored. In the most simple and common implementations the data
(worst case) to traverse all the nodes in the linked list to retrieve the data.
• The hash table can hold more elements without the large performance
39
• The performance of chaining declines much more slowly than open
addressing.
• The main cost of chaining is the extra space required for the linked lists.
• For some languages, creating new nodes (for linked lists) is expensive and
considered. All items are stored in the hash table itself. In addition to the data,
each hash bucket also maintains the three states: EMPTY, OCCUPIED,
DELETED. While inserting, if a collision occurs, alternative cells are tried until an
empty bucket is found. For which one of the following technique is adopted.
40
ii. Quadratic probing: Better behavior is usually obtained with quadratic
probing, where the secondary hash function depends on the re-hash index:
Address = h(key) + c i2
keys which are mapped to thesame value by the primary hash function
iii. Double hashing: Re-hashing schemes use a second hashing operation when
the original one. As long as the functions are applied to a key in the same
• All items are stored in the hash table itself. There is no need for another
data structure.
41
• Dependent on choosing a proper table size.
cell.
searching Data
42