0% found this document useful (0 votes)
21 views

Notes of advanced data structures

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Notes of advanced data structures

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 202

D.Y.

Patil University, Ambi, Pune


Subject: Advanced Data Structures Code: UPCCCS403
Batch: 2021 Semester: IV
Unit I Hashing
Hash Table- Concepts-hash table, hash function, basic operations, bucket, collision,
probe, synonym, overflow, open hashing, closed hashing, perfect hash function, load
density, full table, load factor, rehashing, issues in hashing, hash functions- properties of
good hash function, division, multiplication, extraction, mid-square, folding and
universal, Collision resolution strategies- open addressing and chaining, Hash table
overflow- open addressing and chaining, extendible hashing, closed addressing and
separate chaining.
Skip List- representation, searching and operations- insertion, removal
What is hashing?
Hashing is the process of transforming any given key or a string of characters into another
value. This is usually represented by a shorter, fixed-length value or key that represents and
makes it easier to find or employ the original string.
Or
Hashing is a technique or process of mapping keys, and values into the hash table by using a
hash function. It is done for faster access to elements. The efficiency of mapping depends on
the efficiency of the hash function used.
Or
Hashing refers to the process of generating a fixed-size output from an input of variable size
using the mathematical formulas known as hash functions. This technique determines an index
or location for the storage of an item in a data structure.
The most popular use for hashing is the implementation of hash tables. A hash table stores key
and value pairs in a list that is accessible through its index. Because key and value pairs are
unlimited, the hash function will map the keys to the table size. A hash value then becomes the
index for a specific element.
A hash function generates new values according to a mathematical hashing algorithm, known
as a hash value or simply a hash. To prevent the conversion of hash back into the original key,
a good hash always uses a one-way hashing algorithm.
Let a hash function H(x) maps the value x at the index x%10 in an Array. For example if the
list of values is [11,12,13,14,15] it will be stored at positions {1,2,3,4,5} in the array or Hash
table respectively.
Hashing is relevant to -- but not limited to -- data indexing and retrieval, digital signatures,
cybersecurity and cryptography.

1
Hash Bucket
Hash buckets are used to apportion data items for sorting or lookup purposes. The aim of this
work is to weaken the linked lists so that searching for a specific item can be accessed within
a shorter timeframe.

A hash table that uses buckets is actually a combination of an array and a linked list. Each
element in the array [the hash table] is a header for a linked list. All elements that hash into the
same location will be stored in the list. The hash function assigns each record to the first slot
within one of the buckets. In case the slot is occupied, then the bucket slots will be searched
sequentially until an open slot is found. In case a bucket is completely full, the record will get
stored in an overflow bucket of infinite capacity at the end of the table. All buckets share the
same overflow bucket. However, a good implementation will use a hash function that
distributes the records evenly among the buckets so that as few records as possible go into the
overflow bucket.
What is hashing used for?
Hashing uses functions or algorithms to map object data to a representative integer value. A
hash can then be used to narrow down searches when locating these items on that object data
map.
For example, in hash tables, developers store data -- perhaps a customer record -- in the form
of key and value pairs. The key helps identify the data and operates as an input to the hashing
function, while the hash code or the integer is then mapped to a fixed size.
Hash tables support functions/operations that include the following:
insert (key, value)
search (key) /get (key)
delete (key)

2
Need for Hash data structure
Every day, the data on the internet is increasing multi-fold and it is always a struggle to store
this data efficiently. In day-to-day programming, this amount of data might not be that big, but
still, it needs to be stored, accessed, and processed easily and efficiently. A very common data
structure that is used for such a purpose is the Array data structure.
Now the question arises if Array was already there, what was the need for a new data structure!
The answer to this is in the word “efficiency“. Though storing in Array takes O (1) time,
searching in it takes at least O (log n) time. This time appears to be small, but for a large data
set, it can cause a lot of problems and this, in turn, makes the Array data structure inefficient.
So now we are looking for a data structure that can store the data and search in it in constant
time, i.e. in O (1) time. This is how Hashing data structure came into play. With the introduction
of the Hash data structure, it is now possible to easily store data in constant time and retrieve
them in constant time as well.

Components of Hashing
There are majorly three components of hashing:
Key: A Key can be anything string or integer which is fed as input in the hash function the
technique that determines an index or location for storage of an item in a data structure.
Hash Function: The hash function receives the input key and returns the index of an element
in an array called a hash table. The index is known as the hash index.
Hash Table: Hash table is a data structure that maps keys to values using a special function
called a hash function. Hash stores the data in an associative manner in an array where each
data value has its own unique index.
Hash Table
In a Hash Table, to store data we use Hash function that takes the data as input and based on
that data it generates some key and we store the data based on that key.

3
Our data can be of any form i.e. it can be in integer or string or some other data type. But our
main aim should be to convert that data into an integer by doing some pre-processing and then
apply some hash function to it and map the result in the hash table.

For example: If the data that is to be stored is 2, 3, 4, 5, 26, 17 and our hash function is:
hashFunction(k): k % 10
Here, in the above example, we are finding the remainder of any data and that remainder is our
key and based on that we are storing the value in the hash table. So, the maximum possible
cases of keys can be 0-9 (because we are using the modulus of 10 and in this case, the remainder
can never exceed 9). Following is the representation of the above example:

4
Let's take another example where we want to store some strings in the hash table. Here, we will
pre-process our data(by taking the sum of alphabets i.e. if the string is "ab" then the sum will
be "1+2 = 3") and then based on this we will use the hash function to generate our key.

How does Hashing work?


Suppose we have a set of strings {“ab”, “cd”, “efg”} and we would like to store it in a table.
Our main objective here is to search or update the values stored in the table quickly in O (1)
time and we are not concerned about the ordering of strings in the table. So the given set of
strings can act as a key and the string itself will act as the value of the string but how to store
the value corresponding to the key?
Step 1: We know that hash functions (which is some mathematical formula) are used to
calculate the hash value which acts as the index of the data structure where the value will be
stored.
Step 2: So, let’s assign
“a” = 1,
“b”=2, .. etc, to all alphabetical characters.
Step 3: Therefore, the numerical value by summation of all characters of the string:
“ab” = 1 + 2 = 3,
“cd” = 3 + 4 = 7 ,
“efg” = 5 + 6 + 7 = 18
Step 4: Now, assume that we have a table of size 7 to store these strings. The hash function that
is used here is the sum of the characters in key mod Table size. We can compute the location
of the string in the array by taking the sum (string) mod 7.
Step 5: So we will then store

5
“ab” in 3 mod 7 = 3,
“cd” in 7 mod 7 = 0, and
“efg” in 18 mod 7 = 4.

The above technique enables us to calculate the location of a given string by using a simple
hash function and rapidly find the value that is stored in that location. Therefore the idea of
hashing seems like a great way to store (key, value) pairs of the data in a table.
Another Example: Say we are given keys in the range 0 to 999, and have a hash table of size
10. In this case, a possible hash function might simply divide the key value by 100. Thus, all
keys in the range 0 to 99 would hash to slot 0, keys 100 to 199 would hash to slot 1, and so on.
In other words, this hash function “bins” the first 100 keys to the first slot, the next 100 keys
to the second slot, and so on.
What is a Hash function?
The hash function creates a mapping between key and value, this is done through the use of
mathematical formulas known as hash functions. The result of the hash function is referred to
as a hash value or hash. The hash value is a representation of the original string of characters
but usually smaller than the original.
For example: Consider an array as a Map where the key is the index and the value is the value
at that index. So for an array A if we have index i which will be treated as the key then we can
find the value by simply looking at the value at A[i].
Types of Hash functions
There are many hash functions that use numeric or alphanumeric keys.
Division Method.
Mid Square Method.
Folding Method.
Multiplication Method.
1. Division Method:
This is the most simple and easiest method to generate a hash value. The hash function divides
the value k by M and then uses the remainder obtained.
Formula:
h(K) = k mod M
Here,
k is the key value, and

6
M is the size of the hash table.
It is best suited that M is a prime number as that can make sure the keys are more uniformly
distributed. The hash function is dependent upon the remainder of a division.
Example:
k = 12345
M = 95
h(12345) = 12345 mod 95
= 90
k = 1276
M = 11
h(1276) = 1276 mod 11
=0
Pros:
This method is quite good for any value of M.
The division method is very fast since it requires only a single division operation.
Cons:
This method leads to poor performance since consecutive keys map to consecutive hash values
in the hash table.
Sometimes extra care should be taken to choose the value of M.
2. Mid Square Method:
The mid-square method is a very good hashing method. It involves two steps to compute the
hash value-
Square the value of the key k i.e. k2
Extract the middle r digits as the hash value.
Formula:
h(K) = h(k x k)
Here,
k is the key value.
The value of r can be decided based on the size of the table.
Example:
Suppose the hash table has 100 memory locations. So r = 2 because two digits are required to
map the key to the memory location.

7
k = 60
k x k = 60 x 60
= 3600
h(60) = 60
The hash value obtained is 60
Pros:
The performance of this method is good as most or all digits of the key value contribute to the
result. This is because all digits in the key contribute to generating the middle digits of the
squared result.
The result is not dominated by the distribution of the top digit or bottom digit of the original
key value.
Cons:
The size of the key is one of the limitations of this method, as the key is of big size then its
square will double the number of digits.
Another disadvantage is that there will be collisions but we can try to reduce collisions.
3. Digit Folding Method:
This method involves two steps:
Divide the key-value k into a number of parts i.e. k1, k2, k3,….,kn, where each part has the
same number of digits except for the last part that can have lesser digits than the other parts.
Add the individual parts. The hash value is obtained by ignoring the last carry if any.
Formula:
k = k1, k2, k3, k4, ….., kn
s = k1+ k2 + k3 + k4 +….+ kn
h(K)= s
Here,
s is obtained by adding the parts of the key k
Example:
k = 12345
k1 = 12, k2 = 34, k3 = 5
s = k1 + k2 + k3
= 12 + 34 + 5
= 51

8
h(K) = 51
Note:
The number of digits in each part varies depending upon the size of the hash table. Suppose for
example the size of the hash table is 100, then each part must have two digits except for the
last part which can have a lesser number of digits.
4. Multiplication Method
This method involves the following steps:
Choose a constant value A such that 0 < A < 1.
Multiply the key value with A.
Extract the fractional part of kA.
Multiply the result of the above step by the size of the hash table i.e. M.
The resulting hash value is obtained by taking the floor of the result obtained in step 4.
Formula:
h(K) = floor (M (kA mod 1))
Here,
M is the size of the hash table.
k is the key value.
A is a constant value.
Example:
k = 12345
A = 0.357840
M = 100
h(12345) = floor[ 100 (12345*0.357840 mod 1)]
= floor[ 100 (4417.5348 mod 1) ]
= floor[ 100 (0.5348) ]
= floor[ 53.48 ]
= 53
Pros:
The advantage of the multiplication method is that it can work with any value between 0 and
1, although there are some values that tend to give better results than the rest.

9
Cons:
The multiplication method is generally suitable when the table size is the power of two, then
the whole process of computing the index by the key using multiplication hashing is very fast.
Properties of a Good hash function
A hash function that maps every item into its own unique slot is known as a perfect hash
function. We can construct a perfect hash function if we know the items and the collection will
never change but the problem is that there is no systematic way to construct a perfect hash
function given an arbitrary collection of items. Fortunately, we will still gain performance
efficiency even if the hash function isn’t perfect. We can achieve a perfect hash function by
increasing the size of the hash table so that every possible value can be accommodated. As a
result, each item will have a unique slot. Although this approach is feasible for a small number
of items, it is not practical when the number of possibilities is large.
So, We can construct our hash function to do the same but the things that we must be careful
about while constructing our own hash function.
A good hash function should have the following properties:
 Efficiently computable.
 Should uniformly distribute the keys (Each table position is equally likely for each.
 Should minimize collisions.
 Should have a low load factor (number of items in the table divided by the size of the
table).
Complexity of calculating hash value using the hash function
Time complexity: O(n)
Space complexity: O(1)
Problem with Hashing
If we consider the above example, the hash function we used is the sum of the letters, but if we
examined the hash function closely then the problem can be easily visualized that for different
strings same hash value is begin generated by the hash function.
For example: {“ab”, “ba”} both have the same hash value, and string {“cd”,”be”} also generate
the same hash value, etc. This is known as collision and it creates problem in searching,
insertion, deletion, and updating of value.
What is collision?
The hashing process generates a small number for a big key, so there is a possibility that two
keys could produce the same value. The situation where the newly inserted key maps to an
already occupied, and it must be handled using some collision handling technology.

10
How to handle Collisions?
There are mainly two methods to handle collision:
Separate Chaining:
Open Addressing:

1) Separate Chaining
The idea is to make each cell of the hash table point to a linked list of records that have the
same hash function value. Chaining is simple but requires additional memory outside the table.
Example: We have given a hash function and we have to insert some elements in the hash table
using a separate chaining method for collision resolution technique.

11
Hash function = key % 5,
Elements = 12, 15, 22, 25 and 37.
Let’s see step by step approach to how to solve the above problem:
Step 1: First draw the empty hash table which will have a possible range of hash values from
0 to 4 according to the hash function provided.

Step 2: Now insert all the keys in the hash table one by one. The first key to be inserted is 12
which is mapped to bucket number 2 which is calculated by using the hash function 12%5=2.

Step 3: Now the next key is 22. It will map to bucket number 2 because 22%5=2. But bucket
2 is already occupied by key 12.

12
Step 4: The next key is 15. It will map to slot number 0 because 15%5=0.

Step 5: Now the next key is 25. Its bucket number will be 25%5=0. But bucket 0 is already
occupied by key 25. So separate chaining method will again handle the collision by creating a
linked list to bucket 0.

Hence In this way, the separate chaining method is used as the collision resolution technique.

13
2) Open Addressing
In open addressing, all elements are stored in the hash table itself. Each table entry contains
either a record or NIL. When searching for an element, we examine the table slots one by one
until the desired element is found or it is clear that the element is not in the table.
2.a) Linear Probing
In linear probing, the hash table is searched sequentially that starts from the original location
of the hash. If in case the location that we get is already occupied, then we check for the next
location.
Algorithm:
Calculate the hash key. i.e. key = data % size
Check, if hashTable[key] is empty
store the value directly by hashTable[key] = data
If the hash index already has some value then
check for next index using key = (key+1) % size
Check, if the next index is available hashTable[key] then store the value. Otherwise try for next
index.
Do the above process till we find the space.
Example: Let us consider a simple hash function as “key mod 5” and a sequence of keys that
are to be inserted are 50, 70, 76, 85, 93.
Step1: First draw the empty hash table which will have a possible range of hash values from 0
to 4 according to the hash function provided.

14
Step 2: Now insert all the keys in the hash table one by one. The first key is 50. It will map to
slot number 0 because 50%5=0. So insert it into slot number 0.

Step 3: The next key is 70. It will map to slot number 0 because 70%5=0 but 50 is already at
slot number 0 so, search for the next empty slot and insert it.

Step 4: The next key is 76. It will map to slot number 1 because 76%5=1 but 70 is already at
slot number 1 so, search for the next empty slot and insert it.

15
Step 5: The next key is 93 It will map to slot number 3 because 93%5=3, So insert it into slot
number 3.

2.b) Quadratic Probing


Quadratic probing is an open addressing scheme in computer programming for resolving hash
collisions in hash tables. Quadratic probing operates by taking the original hash index and
adding successive values of an arbitrary quadratic polynomial until an open slot is found.
An example sequence using quadratic probing is:

H + 12, H + 22, H + 32, H + 42…………………. H + k2


This method is also known as the mid-square method because in this method we look for i2‘th
probe (slot) in i’th iteration and the value of i = 0, 1, . . . n – 1. We always start from the original
hash location. If only the location is occupied then we check the other slots.
Let hash(x) be the slot index computed using the hash function and n be the size of the hash
table.
If the slot hash(x) % n is full, then we try (hash(x) + 12) % n.
If (hash(x) + 12) % n is also full, then we try (hash(x) + 22) % n.
If (hash(x) + 22) % n is also full, then we try (hash(x) + 32) % n.
This process will be repeated for all the values of i until an empty slot is found

Example: Let us consider table Size = 7, hash function as Hash(x) = x % 7 and collision
resolution strategy to be f(i) = i2 . Insert = 25, 33, and 105

16
Step 1: Create a table of size 7.

Step 2 – Insert 22 and 30


Hash(25) = 22 % 7 = 1, Since the cell at index 1 is empty, we can easily insert 22 at slot 1.
Hash(30) = 30 % 7 = 2, Since the cell at index 2 is empty, we can easily insert 30 at slot 2.

Step 3: Inserting 50
Hash(25) = 50 % 7 = 1
In our hash table slot 1 is already occupied. So, we will search for slot 1+12, i.e. 1+1 = 2,
Again slot 2 is found occupied, so we will search for cell 1+22, i.e.1+4 = 5,
Now, cell 5 is not occupied so we will place 50 in slot 5.

17
2.c) Double Hashing
Double hashing is a collision resolving technique in Open Addressed Hash tables. Double
hashing make use of two hash function,
The first hash function is h1(k) which takes the key and gives out a location on the hash table.
But if the new location is not occupied or empty then we can easily place our key.
But in case the location is occupied (collision) we will use secondary hash-function h2(k) in
combination with the first hash-function h1(k) to find the new location on the hash table.
This combination of hash functions is of the form
h(k, i) = (h1(k) + i * h2(k)) % n
where
i is a non-negative integer that indicates a collision number,
k = element/key which is being hashed
n = hash table size.
Complexity of the Double hashing algorithm:
Time complexity: O(n)
Example: Insert the keys 27, 43, 692, 72 into the Hash Table of size 7. where first hash-function
is h1(k) = k mod 7 and second hash-function is h2(k) = 1 + (k mod 5)
Step 1: Insert 27
27 % 7 = 6, location 6 is empty so insert 27 into 6 slot.

18
Step 2: Insert 43
43 % 7 = 1, location 1 is empty so insert 43 into 1 slot.

Step 3: Insert 692


692 % 7 = 6, but location 6 is already being occupied and this is a collision
So we need to resolve this collision using double hashing.
hnew = [h1(692) + i * (h2(692)] % 7
= [6 + 1 * (1 + 692 % 5)] % 7
=9%7
=2
Now, as 2 is an empty slot,
so we can insert 692 into 2nd slot.

19
Step 4: Insert 72
72 % 7 = 2, but location 2 is already being occupied and this is a collision.
So we need to resolve this collision using double hashing.
hnew = [h1(72) + i * (h2(72)] % 7
= [2 + 1 * (1 + 72 % 5)] % 7
=5%7
= 5,
Now, as 5 is an empty slot,
so we can insert 72 into 5th slot.

20
S.No. Separate Chaining Open Addressing

Chaining is Simpler to implement. Open Addressing requires more


1. computation.

In chaining, Hash table never fills up, we In open addressing, table may
2. can always add more elements to chain. become full.

Chaining is Less sensitive to the hash Open addressing requires extra care
3. function or load factors. to avoid clustering and load factor.

Chaining is mostly used when it is Open addressing is used when the


unknown how many and how frequently frequency and number of keys is
4. keys may be inserted or deleted. known.

Cache performance of chaining is not Open addressing provides better


good as keys are stored using linked list. cache performance as everything is
5. stored in the same table.

Wastage of Space (Some Parts of hash In Open addressing, a slot can be used
6. table in chaining are never used). even if an input doesn’t map to it.

7. Chaining uses extra space for links. No links in Open addressing

What is meant by Load Factor in Hashing?


The load factor of the hash table can be defined as the number of items the hash table contains
divided by the size of the hash table. Load factor is the decisive parameter that is used when
we want to rehash the previous hash function or want to add more elements to the existing hash
table.
It helps us in determining the efficiency of the hash function i.e. it tells whether the hash
function which we are using is distributing the keys uniformly or not in the hash table.
Load Factor = Total elements in hash table/ Size of hash table
What is Rehashing?
As the name suggests, rehashing means hashing again. Basically, when the load factor
increases to more than its predefined value (the default value of the load factor is 0.75), the
complexity increases. So to overcome this, the size of the array is increased (doubled) and all
the values are hashed again and stored in the new double-sized array to maintain a low load
factor and low complexity.

21
Applications of Hash Data structure
 Hash is used in databases for indexing.
 Hash is used in disk-based data structures.
 In some programming languages like Python, JavaScript hash is used to implement
objects.
Real-Time Applications of Hash Data structure
 Hash is used for cache mapping for fast access to the data.
 Hash can be used for password verification.
 Hash is used in cryptography as a message digest.
Advantages of Hash Data structure
 Hash provides better synchronization than other data structures.
 Hash tables are more efficient than search trees or other data structures
 Hash provides constant time for searching, insertion, and deletion operations on
average.
Disadvantages/Issues of Hash Data structure
 Hash is inefficient when there are many collisions.
 Hash collisions are practically not avoided for a large set of possible keys.
 Hash does not allow null values.

Skip List
A skip list is a data structure that allows for efficient search, insertion and deletion of elements
in a sorted list. It is a probabilistic data structure, meaning that its average time complexity is
determined through a probabilistic analysis.
In a skip list, elements are organized in layers, with each layer having a smaller number of
elements than the one below it. The bottom layer is a regular linked list, while the layers above
it contain “skipping” links that allow for fast navigation to elements that are far apart in the
bottom layer. The idea behind this is to allow for quick traversal to the desired element,
reducing the average number of steps needed to reach it.
Skip lists are implemented using a technique called “coin flipping.” In this technique, a random
number is generated for each insertion to determine the number of layers the new element will
occupy. This means that, on average, each element will be in log(n) layers, where n is the
number of elements in the bottom layer.
Skip lists have an average time complexity of O(log n) for search, insertion and deletion, which
is similar to that of balanced trees, such as AVL trees and red-black trees, but with the
advantage of simpler implementation and lower overhead.
In summary, skip lists provide a simple and efficient alternative to balanced trees for certain
use cases, particularly when the average number of elements in the list is large.

22
Can we search in a sorted linked list better than O(n) time? The worst-case search time for a
sorted linked list is O(n) as we can only linearly traverse the list and cannot skip nodes while
searching. For a Balanced Binary Search Tree, we skip almost half of the nodes after one
comparison with the root. For a sorted array, we have random access and we can apply Binary
Search on arrays. Can we augment sorted linked lists to search faster? The answer is Skip List.
The idea is simple, we create multiple layers so that we can skip some nodes. See the following
example list with 16 nodes and two layers. The upper layer works as an “express lane” that
connects only the main outer stations, and the lower layer works as a “normal lane” that
connects every station. Suppose we want to search for 50, we start from the first node of the
“express lane” and keep moving on the “express lane” till we find a node whose next is greater
than 50. Once we find such a node (30 is the node in the following example) on “express lane”,
we move to “normal lane” using a pointer from this node, and linearly search for 50 on “normal
lane”. In the following example, we start from 30 on the “normal lane” and with linear search,
we find 50.

What is the time complexity with two layers? worst-case time complexity is several nodes on
the “express lane” plus several nodes in a segment (A segment is several “normal lane” nodes
between two “express lane” nodes) of the “normal lane”. So if we have n nodes on “normal
lane”, √n (square root of n) nodes on “express lane” and we equally divide the “normal lane”,
then there will be √n nodes in every segment of “normal lane”. √n is an optimal division with
two layers. With this arrangement, the number of nodes traversed for a search will be O(√n).
Therefore, with O(√n) extra space, we can reduce the time complexity to O(√n).
Advantages of Skip List:
 The skip list is solid and trustworthy.
 To add a new node to it, it will be inserted extremely quickly.
 Easy to implement compared to the hash table and binary search tree
 The number of nodes in the skip list increases, and the possibility of the worst-case
decreases
 Requires only Θ(logn) time in the average case for all operations.
 Finding a node in the list is relatively straightforward.
Disadvantages of Skip List:
 It needs a greater amount of memory than the balanced tree.
 Reverse search is not permitted.
 Searching is slower than a linked list
 Skip lists are not cache-friendly because they don’t optimize the locality of reference

23
Skip List Insertion Operation
Each element in the list is represented by a node, the level of the node is chosen randomly while
insertion in the list. Level does not depend on the number of elements in the node. The level
for node is decided by the following algorithm –
randomLevel()
lvl := 1
//random() that returns a random value in [0...1)
while random() < p and lvl < MaxLevel do
lvl := lvl + 1
return lvl
MaxLevel is the upper bound on number of levels in the skip list. It can be determined as –
L(N) = log_{p/2}{N} . Above algorithm assure that random level will never be greater
than MaxLevel. Here p is the fraction of the nodes with level i pointers also having level i+1
pointers and N is the number of nodes in the list.
Node Structure
Each node carries a key and a forward array carrying pointers to nodes of a different level. A
level i node carries i forward pointers indexed through 0 to i.

Insertion in Skip List


We will start from the highest level in the list and compare the key of the next node of the
current node with the key to be inserted. The basic idea is If –
Key of next node is less than key to be inserted then we keep on moving forward on the same
level
Key of next node is greater than the key to be inserted then we store the pointer to current node
i at update[i] and move one level down and continue our search.
At the level 0, we will definitely find a position to insert the given key. Following is the pseudo-
code for the insertion algorithm –
Insert(list, searchKey)
local update[0...MaxLevel+1]

24
x := list -> header
for i := list -> level downto 0 do
while x -> forward[i] -> key forward[i]
update[i] := x
x := x -> forward[0]
lvl := randomLevel()
if lvl > list -> level then
for i := list -> level + 1 to lvl do
update[i] := list -> header
list -> level := lvl
x := makeNode(lvl, searchKey, value)
for i := 0 to level do
x -> forward[i] := update[i] -> forward[i]
update[i] -> forward[i] := x
Here update[i] holds the pointer to node at level i from which we moved down to level i-1 and
pointer of node left to insertion position at level 0. Consider this example where we want to
insert key 17 –

25
Searching an element in Skip list
Searching an element is very similar to approach for searching a spot for inserting an element
in Skip list. The basic idea is if –
Key of next node is less than search key then we keep on moving forward on the same level.
Key of next node is greater than the key to be inserted then we store the pointer to current node
i at update[i] and move one level down and continue our search.
At the lowest level (0), if the element next to the rightmost element (update[0]) has key equal
to the search key, then we have found key otherwise failure. Following is the pseudo code for
searching element –
Search(list, searchKey)
x := list -> header
-- loop invariant: x -> key level downto 0 do
while x -> forward[i] -> key forward[i]
x := x -> forward[0]
if x -> key = searchKey then return x -> value
else return failure
Consider this example where we want to search for key 17-

Deleting an element from the Skip list


Deletion of an element k is preceded by locating element in the Skip list using above mentioned
search algorithm. Once the element is located, rearrangement of pointers is done to remove
element form list just like we do in singly linked list. We start from lowest level and do
rearrangement until element next to update[i] is not k. After deletion of element there could be
levels with no elements, so we will remove these levels as well by decrementing the level of
Skip list. Following is the pseudo code for deletion –
Delete(list, searchKey)
local update[0..MaxLevel+1]
x := list -> header
for i := list -> level downto 0 do
while x -> forward[i] -> key forward[i]
update[i] := x

26
x := x -> forward[0]
if x -> key = searchKey then
for i := 0 to list -> level do
if update[i] -> forward[i] ≠ x then break
update[i] -> forward[i] := x -> forward[i]
free(x)
while list -> level > 0 and list -> header -> forward[list -> level] = NIL do
list -> level := list -> level – 1
Consider this example where we want to delete element 6 –

Here at level 3, there is no element (arrow in red) after deleting element 6. So we will decrement
level of skip list by 1.

27
Unit II Graphs
Basic Concepts, Storage representation, Adjacency matrix, adjacency list, adjacency
multi list, inverse adjacency list. Traversals-depth first and breadth first, Minimum
spanning Tree, Greedy algorithms for computing minimum spanning tree- Prims and
Kruskal Algorithms, Dikjtra's Single source shortest path, All pairs shortest paths-
Flyod-Warshall Algorithm Topological ordering.

What is Graph Data Structure?


A Graph is a non-linear data structure consisting of vertices and edges. The vertices are
sometimes also referred to as nodes and the edges are lines or arcs that connect any two nodes
in the graph. More formally a Graph is composed of a set of vertices( V ) and a set of edges( E
). The graph is denoted by G(E, V).

Components of a Graph
Vertices: Vertices are the fundamental units of the graph. Sometimes, vertices are also known
as vertex or nodes. Every node/vertex can be labeled or unlabelled.
Edges: Edges are drawn or used to connect two nodes of the graph. It can be ordered pair of
nodes in a directed graph. Edges can connect any two nodes in any possible way. There are no
rules. Sometimes, edges are also known as arcs. Every edge can be labeled/unlabelled.
Graphs are used to solve many real-life problems. Graphs are used to represent networks. The
networks may include paths in a city or telephone network or circuit network. Graphs are also
used in social networks like linkedIn, Facebook. For example, in Facebook, each person is
represented with a vertex(or node). Each node is a structure and contains information like
person id, name, gender, locale etc.

28
Graph data structures are a powerful tool for representing and analyzing complex relationships
between objects or entities. They are particularly useful in fields such as social network
analysis, recommendation systems, and computer networks. In the field of sports data science,
graph data structures can be used to analyze and understand the dynamics of team performance
and player interactions on the field.
Imagine a game of football as a web of connections, where players are the nodes and their
interactions on the field are the edges. This web of connections is exactly what a graph data
structure represents, and it’s the key to unlocking insights into team performance and player
dynamics in sports.
Types of Graph
1. Null Graph
A graph is known as a null graph if there are no edges in the graph.
2. Trivial Graph
Graph having only a single vertex, it is also the smallest graph possible.

3. Regular Graph
The graph in which the degree of every vertex is equal to K is called K regular graph.
4. Complete Graph
The graph in which from each node there is an edge to each other node.

29
5. Cycle Graph
The graph in which the graph is a cycle in itself, the degree of each vertex is 2.
6. Cyclic Graph
A graph containing at least one cycle is known as a Cyclic graph.

7. Directed Acyclic Graph


A Directed Graph that does not contain any cycle.
8. Bipartite Graph
A graph in which vertex can be divided into two sets such that vertex in each set does not
contain any edge between them.

Tree v/s Graph


Trees are the restricted types of graphs, just with some more rules. Every tree will always be a
graph but not all graphs will be trees. Linked List, Trees, and Heaps all are special cases of
graphs.

30
Representation of Graphs
There are two ways to store a graph:
Adjacency Matrix
Adjacency List
Adjacency Matrix
In this method, the graph is stored in the form of the 2D matrix where rows and columns denote
vertices. Each entry in the matrix represents the weight of the edge between those vertices.

Adjacency List
This graph is represented as a collection of linked lists. There is an array of pointer which
points to the edges connected to that vertex.

31
Basic Operations on Graphs
Below are the basic operations on the graph:
 Insertion of Nodes/Edges in the graph – Insert a node into the graph.
 Deletion of Nodes/Edges in the graph – Delete a node from the graph.
 Searching on Graphs – Search an entity in the graph.
 Traversal of Graphs – Traversing all the nodes in the graph.
Usage of graphs
Maps can be represented using graphs and then can be used by computers to provide various
services like the shortest path between two cities.
When various tasks depend on each other than this situation can be represented using a Directed
Acyclic graph and we can find the order in which tasks can be performed using topological
sort.
State Transition Diagram represents what can be the legal moves from current states. In-game
of tic tac toe this can be used.
Breadth First Search or BFS for a Graph
The breadth-first search (BFS) algorithm is used to search a tree or graph data structure for a
node that meets a set of criteria. It starts at the tree’s root or graph and searches/visits all nodes
at the current depth level before moving on to the nodes at the next depth level. Breadth-first
search can be used to solve many problems in graph theory.
Breadth-First Traversal (or Search) for a graph is similar to the Breadth-First Traversal of a
tree.
The only catch here is, that, unlike trees, graphs may contain cycles, so we may come to the
same node again. To avoid processing a node more than once, we divide the vertices into two
categories:
Visited and
Not visited.
A boolean visited array is used to mark the visited vertices. For simplicity, it is assumed that
all vertices are reachable from the starting vertex. BFS uses a queue data structure for traversal.
Example:
In the following graph, we start traversal from vertex 2.

32
When we come to vertex 0, we look for all adjacent vertices of it.
2 is also an adjacent vertex of 0.
If we don’t mark visited vertices, then 2 will be processed again and it will become a non-
terminating process.
There can be multiple BFS traversals for a graph. Different BFS traversals for the above graph:
2, 3, 0, 1
2, 0, 3, 1
Implementation of BFS traversal on Graph:
Pseudocode:
Breadth_First_Search( Graph, X ) // Here, Graph is the graph that we already have and X is the
source node
Let Q be the queue
Q.enqueue( X ) // Inserting source node X into the queue
Mark X node as visited.
While ( Q is not empty )
Y = Q.dequeue( ) // Removing the front node from the queue
Process all the neighbors of Y, For all the neighbors Z of Y
If Z is not visited, Q. enqueue( Z ) // Stores Z in Q
Mark Z as visited
Follow the below method to implement BFS traversal.
Declare a queue and insert the starting vertex.
Initialize a visited array and mark the starting vertex as visited.
ollow the below process till the queue becomes empty:
Remove the first vertex of the queue.
Mark that vertex as visited.
Insert all the unvisited neighbors of the vertex into the queue.

33
Illustration:
Step1: Initially queue and visited arrays are empty.

Step2: Push node 0 into queue and mark it visited.

Step 3: Remove node 0 from the front of queue and visit the unvisited neighbours and push
them into queue.

Step 4: Remove node 1 from the front of queue and visit the unvisited neighbours and push
them into queue.

Step 5: Remove node 2 from the front of queue and visit the unvisited neighbours and push
them into queue.

34
Step 6: Remove node 3 from the front of queue and visit the unvisited neighbours and push
them into queue.
As we can see that every neighbours of node 3 is visited, so move to the next node that are in
the front of the queue.

Steps 7: Remove node 4 from the front of queue and visit the unvisited neighbours and push
them into queue.
As we can see that every neighbours of node 4 are visited, so move to the next node that is in
the front of the queue.

Now, Queue becomes empty, So, terminate these process of iteration.


#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#define MAX_VERTICES 50
// This struct represents a directed graph using
// adjacency list representation
typedef struct Graph_t {

35
int V; // No. of vertices
bool adj[MAX_VERTICES][MAX_VERTICES];
} Graph;
// Constructor
Graph* Graph_create(int V)
{
Graph* g = malloc(sizeof(Graph));
g->V = V;
for (int i = 0; i < V; i++) {
for (int j = 0; j < V; j++) {
g->adj[i][j] = false;
}
}
return g;
}
// Destructor
void Graph_destroy(Graph* g) { free(g); }
// function to add an edge to graph
void Graph_addEdge(Graph* g, int v, int w)
{
g->adj[v][w] = true; // Add w to v’s list.
}
// prints BFS traversal from a given source s
void Graph_BFS(Graph* g, int s)
{
// Mark all the vertices as not visited
bool visited[MAX_VERTICES];
for (int i = 0; i < g->V; i++) {
visited[i] = false;
}

36
// Create a queue for BFS
int queue[MAX_VERTICES];
int front = 0, rear = 0;
// Mark the current node as visited and enqueue it
visited[s] = true;
queue[rear++] = s;
while (front != rear) {
// Dequeue a vertex from queue and print it
s = queue[front++];
printf("%d ", s);
// Get all adjacent vertices of the dequeued
// vertex s. If a adjacent has not been visited,
// then mark it visited and enqueue it
for (int adjacent = 0; adjacent < g->V;
adjacent++) {
if (g->adj[s][adjacent] && !visited[adjacent]) {
visited[adjacent] = true;
queue[rear++] = adjacent;
}
}
}
}
// Driver program to test methods of graph struct
int main()
{
// Create a graph given in the above diagram
Graph* g = Graph_create(4);
Graph_addEdge(g, 0, 1);
Graph_addEdge(g, 0, 2);
Graph_addEdge(g, 1, 2);

37
Graph_addEdge(g, 2, 0);
Graph_addEdge(g, 2, 3);
Graph_addEdge(g, 3, 3);
printf("Following is Breadth First Traversal "
"(starting from vertex 2) \n");
Graph_BFS(g, 2);
Graph_destroy(g);
return 0;
}

Output
Following is Breadth First Traversal (starting from vertex 2)
2031
Time Complexity: O(V+E), where V is the number of nodes and E is the number of edges.
Auxiliary Space: O(V)
Depth First Search or DFS for a Graph
Depth First Traversal (or Search) for a graph is similar to Depth First Traversal of a tree. The
only catch here is, that, unlike trees, graphs may contain cycles (a node may be visited twice).
To avoid processing a node more than once, use a Boolean visited array. A graph can have
more than one DFS traversal.
Example:
Input: n = 4, e = 6
0 -> 1, 0 -> 2, 1 -> 2, 2 -> 0, 2 -> 3, 3 -> 3
Output: DFS from vertex 1 : 1 2 0 3
Explanation:
DFS Diagram:

38
Input: n = 4, e = 6
2 -> 0, 0 -> 2, 1 -> 2, 0 -> 1, 3 -> 3, 1 -> 3
Output: DFS from vertex 2 : 2 0 1 3
Explanation:
DFS Diagram:

Depth-first search is an algorithm for traversing or searching tree or graph data structures. The
algorithm starts at the root node (selecting some arbitrary node as the root node in the case of
a graph) and explores as far as possible along each branch before backtracking.
So the basic idea is to start from the root or any arbitrary node and mark the node and move to
the adjacent unmarked node and continue this loop until there is no unmarked adjacent node.
Then backtrack and check for other unmarked nodes and traverse them. Finally, print the nodes
in the path.
Follow the below steps to solve the problem:

39
Create a recursive function that takes the index of the node and a visited array.
Mark the current node as visited and print the node.
Traverse all the adjacent and unmarked nodes and call the recursive function with the index of
the adjacent node.
Minimum Spanning Tree
To understand the concept of spanning tree, consider the below graph:

The above graph can be represented as G(V, E), where 'V' is the number of vertices, and 'E' is
the number of edges. The spanning tree of the above graph would be represented as G`(V`, E`).
In this case, V` = V means that the number of vertices in the spanning tree would be the same
as the number of vertices in the graph, but the number of edges would be different. The number
of edges in the spanning tree is the subset of the number of edges in the original graph.
Therefore, the number of edges can be written as:
E` € E
It can also be written as:
E` = |V| - 1
Two conditions exist in the spanning tree, which is as follows:
The number of vertices in the spanning tree would be the same as the number of vertices in the
original graph.
V` = V
The number of edges in the spanning tree would be equal to the number of edges minus 1.
E` = |V| - 1
The spanning tree should not contain any cycle.
The spanning tree should not be disconnected.
Consider the below graph:

40
The above graph contains 5 vertices. As we know, the vertices in the spanning tree would be
the same as the graph; therefore, V` is equal 5. The number of edges in the spanning tree would
be equal to (5 - 1), i.e., 4. The following are the possible spanning trees:

41
What is a minimum spanning tree?
The minimum spanning tree is a spanning tree whose sum of the edges is minimum. Consider
the below graph that contains the edge weight:
The following are the spanning trees that we can make from the above graph.
The first spanning tree is a tree in which we have removed the edge between the vertices 1 and
5 shown as below:
The sum of the edges of the above tree is (1 + 4 + 5 + 2): 12
The second spanning tree is a tree in which we have removed the edge between the vertices 1
and 2 shown as below:
The sum of the edges of the above tree is (3 + 2 + 5 + 4) : 14
The third spanning tree is a tree in which we have removed the edge between the vertices 2 and
3 shown as below:
The sum of the edges of the above tree is (1 + 3 + 2 + 5) : 11
The fourth spanning tree is a tree in which we have removed the edge between the vertices 3
and 4 shown as below:
The sum of the edges of the above tree is (1 + 3 + 2 + 4) : 10. The edge cost 10 is minimum so
it is a minimum spanning tree.
General properties of minimum spanning tree:
If we remove any edge from the spanning tree, then it becomes disconnected. Therefore, we
cannot remove any edge from the spanning tree.
If we add an edge to the spanning tree then it creates a loop. Therefore, we cannot add any edge
to the spanning tree.
In a graph, each edge has a distinct weight, then there exists only a single and unique minimum
spanning tree. If the edge weight is not distinct, then there can be more than one minimum
spanning tree.
A complete undirected graph can have an nn-2 number of spanning trees.
Every connected and undirected graph contains atleast one spanning tree.
The disconnected graph does not have any spanning tree.

42
In a complete graph, we can remove maximum (e-n+1) edges to construct a spanning tree.
Let's understand the last property through an example.
Consider the complete graph which is given below:

The number of spanning trees that can be made from the above complete graph equals to nn-2
= 44-2 = 16.
Therefore, 16 spanning trees can be created from the above graph.
The maximum number of edges that can be removed to construct a spanning tree equals to e-
n+1 = 6 - 4 + 1 = 3.
There are two methods to find Minimum Spanning Tree
Kruskal's Algorithm
Prim's Algorithm
Kruskal's Algorithm:
An algorithm to construct a Minimum Spanning Tree for a connected weighted graph. It is a
Greedy Algorithm. The Greedy Choice is to put the smallest weight edge that does not because
a cycle in the MST constructed so far.
If the graph is not linked, then it finds a Minimum Spanning Tree.
Steps for finding MST using Kruskal's Algorithm:
1. Arrange the edge of G in order of increasing weight.
2. Starting only with the vertices of G and proceeding sequentially add each edge which
does not result in a cycle, until (n - 1) edges are used.
3. EXIT.
MST- KRUSKAL (G, w)
1. A ← ∅
2. for each vertex v ∈ V [G]
3. do MAKE - SET (v)
4. sort the edges of E into non decreasing order by weight w

43
5. for each edge (u, v) ∈ E, taken in non decreasing order by weight
6. do if FIND-SET (μ) ≠ if FIND-SET (v)
7. then A ← A ∪ {(u, v)}
8. UNION (u, v)
9. return A
Analysis: Where E is the number of edges in the graph and V is the number of vertices,
Kruskal's Algorithm can be shown to run in O (E log E) time, or simply, O (E log V) time, all
with simple data structures. These running times are equivalent because:
E is at most V2 and log V2= 2 x log V is O (log V).
If we ignore isolated vertices, which will each their components of the minimum spanning tree,
V ≤ 2 E, so log V is O (log E).
Thus the total time is O (E log E) = O (E log V).
For Example: Find the Minimum Spanning Tree of the following graph using Kruskal's
algorithm.

Solution: First we initialize the set A to the empty set and create |v| trees, one containing each
vertex with MAKE-SET procedure. Then sort the edges in E into order by non-decreasing
weight.
There are 9 vertices and 12 edges. So MST formed (9-1) = 8 edges

Now, check for each edge (u, v) whether the endpoints u and v belong to the same tree. If they
do then the edge (u, v) cannot be supplementary. Otherwise, the two vertices belong to different

44
trees, and the edge (u, v) is added to A, and the vertices in two trees are merged in by union
procedure.
Step1: So, first take (h, g) edge

Step 2: then (g, f) edge.

Step 3: then (a, b) and (i, g) edges are considered, and the forest becomes

Step 4: Now, edge (h, i). Both h and i vertices are in the same set. Thus it creates a cycle. So
this edge is discarded.
Then edge (c, d), (b, c), (a, h), (d, e), (e, f) are considered, and the forest becomes.

45
Step 5: In (e, f) edge both endpoints e and f exist in the same tree so discarded this edge. Then
(b, h) edge, it also creates a cycle.
Step 6: After that edge (d, f) and the final spanning tree is shown as in dark lines.

Step 7: This step will be required Minimum Spanning Tree because it contains all the 9 vertices
and (9 - 1) = 8 edges
e → f, b → h, d → f [cycle will be formed]

Minimum Cost MST

46
Prim's Algorithm
It is a greedy algorithm. It starts with an empty spanning tree. The idea is to maintain two sets
of vertices:
Contain vertices already included in MST.
Contain vertices not yet included.
At every step, it considers all the edges and picks the minimum weight edge. After picking the
edge, it moves the other endpoint of edge to set containing MST.
Steps for finding MST using Prim's Algorithm:
Create MST set that keeps track of vertices already included in MST.
Assign key values to all vertices in the input graph. Initialize all key values as INFINITE (∞).
Assign key values like 0 for the first vertex so that it is picked first.
While MST set doesn't include all vertices.
Pick vertex u which is not is MST set and has minimum key value. Include 'u'to MST set.
Update the key value of all adjacent vertices of u. To update, iterate through all adjacent
vertices. For every adjacent vertex v, if the weight of edge u.v less than the previous key value
of v, update key value as a weight of u.v.
MST-PRIM (G, w, r)
1. for each u ∈ V [G]
2. do key [u] ← ∞
3. π [u] ← NIL
4. key [r] ← 0
5. Q ← V [G]
6. While Q ? ∅
7. do u ← EXTRACT - MIN (Q)
8. for each v ∈ Adj [u]
9. do if v ∈ Q and w (u, v) < key [v]
10. then π [v] ← u
11. key [v] ← w (u, v)

47
Example: Generate minimum cost spanning tree for the following graph using Prim's
algorithm.

Solution: In Prim's algorithm, first we initialize the priority Queue Q. to contain all the vertices
and the key of each vertex to ∞ except for the root, whose key is set to 0. Suppose 0 vertex is
the root, i.e., r. By EXTRACT - MIN (Q) procure, now u = r and Adj [u] = {5, 1}.
Removing u from set Q and adds it to set V - Q of vertices in the tree. Now, update the key and
π fields of every vertex v adjacent to u but not in a tree.

Taking 0 as starting vertex


Root = 0
Adj [0] = 5, 1
Parent, π [5] = 0 and π [1] = 0
Key [5] = ∞ and key [1] = ∞
w [0, 5) = 10 and w (0,1) = 28
w (u, v) < key [5] , w (u, v) < key [1]
Key [5] = 10 and key [1] = 28
So update key value of 5 and 1 is:

48
Now by EXTRACT_MIN (Q) Removes 5 because key [5] = 10 which is minimum so u = 5.
Adj [5] = {0, 4} and 0 is already in heap
Taking 4, key [4] = ∞ π [4] = 5
(u, v) < key [v] then key [4] = 25
w (5,4) = 25
w (5,4) < key [4]
date key value and parent of 4.

Now remove 4 because key [4] = 25 which is minimum, so u =4


Adj [4] = {6, 3}
Key [3] = ∞ key [6] = ∞
w (4,3) = 22 w (4,6) = 24
w (u, v) < key [v] w (u, v) < key [v]

49
w (4,3) < key [3] w (4,6) < key [6]
Update key value of key [3] as 22 and key [6] as 24.
And the parent of 3, 6 as 4.
π[3]= 4 π[6]= 4

u = EXTRACT_MIN (3, 6) [key [3] < key [6]]


u=3 i.e. 22 < 24
Now remove 3 because key [3] = 22 is minimum so u =3.

Adj [3] = {4, 6, 2}


4 is already in heap
4 ≠ Q key [6] = 24 now becomes key [6] = 18
Key [2] = ∞ key [6] = 24
w (3, 2) = 12 w (3, 6) = 18
w (3, 2) < key [2] w (3, 6) < key [6]
Now in Q, key [2] = 12, key [6] = 18, key [1] = 28 and parent of 2 and 6 is 3.
π [2] = 3 π[6]=3
Now by EXTRACT_MIN (Q) Removes 2, because key [2] = 12 is minimum.

50
u = EXTRACT_MIN (2, 6)
u=2 [key [2] < key [6]]
12 < 18
Now the root is 2
Adj [2] = {3, 1}
3 is already in a heap
Taking 1, key [1] = 28
w (2,1) = 16
w (2,1) < key [1]
So update key value of key [1] as 16 and its parent as 2.
π[1]= 2

Now by EXTRACT_MIN (Q) Removes 1 because key [1] = 16 is minimum.


Adj [1] = {0, 6, 2}
0 and 2 are already in heap.
Taking 6, key [6] = 18
w [1, 6] = 14
w [1, 6] < key [6]
Update key value of 6 as 14 and its parent as 1.
Π [6] = 1

51
Now all the vertices have been spanned, Using above the table we get Minimum Spanning
Tree.
0→5→4→3→2→1→6
[Because Π [5] = 0, Π [4] = 5, Π [3] = 4, Π [2] = 3, Π [1] =2, Π [6] =1]
Thus the final spanning Tree is

Total Cost = 10 + 25 + 22 + 12 + 16 + 14 = 99
Dijkstra’s Algorithm – Single Source Shortest Path Algorithm
Dijkstra’s Algorithm is also known as Single Source Shortest Path (SSSP) problem. It is used
to find the shortest path from source node to destination node in graph.
The graph is widely accepted data structure to represent distance map. The distance between
cities effectively represented using graph.
Dijkstra proposed an efficient way to find the single source shortest path from the weighted
graph. For a given source vertex s, the algorithm finds the shortest path to every other vertex v
in the graph.
Assumption : Weight of all edges is non-negative.
Steps of the Dijkstra’s algorithm are explained here:
1. Initializes the distance of source vertex to zero and remaining all other vertices to infinity.
2. Set source node to current node and put remaining all nodes in the list of unvisited vertex
list. Compute the tentative distance of all immediate neighbour vertex of the current node.
3. If the newly computed value is smaller than the old value, then update it.
For example, C is the current node, whose distance from source S is dist (S, C) = 5.
Consider N is the neighbour of C and weight of edge

52
(C, N) is 3. So the distance of N from source via C would be 8.
If the distance of N from source was already computed and if it is greater than 8 then relax edge
(S, N) and update it to 8, otherwise don’t update it.

d(S, N) = 11
d(S, C) + d(C, N) < d(S, N) ⇒ Relax edge (S, N)
Update d(S, N) = 8

d(S, N) = 7
d(S, C) + d(C, N) > d(S, N) ⇒ Don’t update d(S, N)
Weight updating in Dijkstra’s algorithm
4. When all the neighbours of a current node are explored, mark it as visited. Remove it from
unvisited vertex list. Mark the vertex from unvisited vertex list with minimum distance and
repeat the procedure.
5. Stop when the destination node is tested or when unvisited vertex list becomes empty.
Algorithm for Dijkstra’s Algorithm
Dijkstra’s shortest path algorithm is described below :
Algorithm DIJAKSTRA_SHORTEST_PATH(G, s, t)
// s is the source vertex
// t is the target vertex
// π[u] stores the parent / previous node of u

53
// V is the set of vertices in graph G
dist[s] ← 0
π[s] ← NIL
for each vertex v ∈ V do
if v ≠ s then
dist[v] ← ∞
π[v] ← undefined
end
ENQUEUE(v, Q) // insert v to queue Q
end
while Q is not empty do
u ← vertex in Q having minimum dist[u]
if u == t then
break
end
DEQUEUE(u, Q) // Remove u from queue Q
for each adjacent node v of u do
val ← dist[u] + weight(u, v)
if val<dist[v] then
dist[v] ← val
π[v] ← u
end
end
end
Complexity analysis of Dijkstra’s Algorithm
First for loop does initialization in O(|V|) time. As there are |V| nodes in the graph, size of
queue Q would be V, and hence while loop iterates |V| times in worst case. For loop inside
while loop run maximum |V| time, because a node can have maximum |V| – 1 neighbours. The
worst case upper bound running time of this algorithm is described as O(|V2|).

54
Problem: Suppose Dijkstra’s algorithm is run on the following graph, starting at node A

Draw a table showing the intermediate distance values of all the nodes at each iteration of the
algorithm.
Show the final shortest path tree.
Solution:
Here, source vertex is A.
dist[u] indicates distance of vertex u from source
π[u] indicates parent / previous node of u
Initialization:
dist[source] = 0 ⇒ dist[A] = 0
π[source] = undefined ⇒ π[A] = NIL
dist[B] = dist[C] = dist[D] = dist[E] = dist [F] = dist[G]= dist[H] = ∞
π[B] = π[C] = π[D] = π[E] = π[F] = π[G] = π[H]= NIL

Vertex u A B C D E F G H
dist[u] 0 ∞ ∞ ∞ ∞ ∞ ∞ ∞
π[u] NIL NIL NIL NIL NIL NIL NIL NIL

Iteration 1:
u = unprocessed vertex in Q having minimum dist[u] = A
Adjacent[A] = {B, E, F}

55
val[B] = dist[A] + weight(A, B)
=0+1
=1
Here, val[B] <dist[B], so update dist[B]
dist[B] = 1, and π[B] = A
val[E] = dist[A] + weight(A, E)
=0+4
=4
Here, val[E] < dist[E], so update dist[E]
dist[E] = 4 and π[6] = A
val[F] = dist[A] + weight(A, F)
=0+8
=8
Here, val[F] < dist[F], so update dist[F]
dist[F] = 8 and π[F] = A

Vertex u A B C D E F G H
dist[u] 0 1 ∞ ∞ 4 8 ∞ ∞
π[u] NIL A NIL NIL A A NIL NIL

Iteration 2:
u = unprocessed vertex in Q having minimum dist[u] = B
Adjacent[B] = {C, F, G}
val[C] = dist[B] + weight(B, C)
=1+2
=3
Here, val[C] < dist[C], so update dist[C]
dist[C] = 3 and π[C] = B
val[F] = dist[B] + weight(B, F)
=1+6
=7
Here, val[F] < dist[F], so update dist[F]

56
dist[F] = 7 and π[F] = B
val[G] = dist[B] + weight(B, G)
=1+6
=7
Here, val[G] < dist[G], so update dist[G]
dist[G] = 7 and π[G] = B

Vertex u A B C D E F G H
dist[u] 0 1 3 ∞ 4 7 7 ∞
π[u] NIL A B NIL A B B NIL
Iteration 3:
u = unprocessed vertex in Q having minimum dist[u] = C
Adjacent [C] = {D, G}
val[D] = dist[C] + weight(C, D)
=3+1
=4
Here, val[D] < dist[D], so update dist[D]
dist[D] = 4 and π[D] = C
val[G] = dist[C] + weight(C, G)
=3+2
=5
Here, val[G] < dist[G], so update dist[G]
dist[G] = 5 and π[G] = C

Vertex u A B C D E F G H
dist[u] 0 1 3 4 4 7 5 ∞
π[u] NIL A B C A B C NIL
Iteration 4:
u = unprocessed vertex in Q having minimum dist[u] = E

57
Adjacent[E] = {F}
val[F] = dist[E] + weight(E, F)
=4+5
=9
Here, val[F] > dist[F], so no change in table

Vertex u A B C D E F G H
dist[u] 0 1 3 4 4 7 5 ∞
π[u] NIL A B C A B C NIL

Iteration 5:
u = unprocessed vertex in Q having minimum dist[u] = D
Adjacent[D] = {G, H}
val[G] = dist[D] + weight(D, G)
=4+1
=5
Here, val[G] = dist[G], so don’t update dist[G]
val[H] = dist[D] + weight(D, H)
=4+4
=8
Here, val[H] < dist[H], so update dist[H]
dist[H] = 8 and π[H] = D

Vertex u A B C D E F G H
dist[u] 0 1 3 4 4 7 5 8
π [u] NIL A B C A B D D

Iteration 6:
u = unprocessed vertex in Q having minimum dist[u] = G
Adjacent[G] = { F, H }

58
val[F] = dist[G] + weight(G, F)
=5+1
=6
Here, val[F] < dist[F], so update dist[F]
dist[F] = 6 and π[F] = G
val[H] = dist[G] + weight(G, H)
=5+1
=6
Here, val[H] < dist[H], so update dist[H]
dist[H] = 6 and π[H] = G

Vertex u A B C D E F G H
dist[u] 0 1 3 4 4 6 5 6
π [u] NIL A B C A G C G
Iteration 7:
u = unprocessed vertex in Q having minimum dist[u] = F
Adjacent[F] = { }
So, no change in table

Vertex u A B C D E F G H
dist[u] 0 1 3 4 4 6 5 6
p[u] NIL A B C A G C G
Iteration 8:
u = unprocessed vertex in Q having minimum dist[u] = H
Adjacent[H] = { }
So, no change in table

Vertex u A B C D E F G H
dist[u] 0 1 3 4 4 6 5 6
p[u] NIL A B C A G C G
We can easily derive the shortest path tree for given graph from above table. In the table, p[u]
indicates the parent node of vertex u. The shortest path tree is shown in following figure

59
All Pairs Shortest Path (Floyd-Warshall) Algorithm
All Pairs Shortest Path Algorithm is also known as the Floyd-Warshall algorithm. And this is
an optimization problem that can be solved using dynamic programming.
Let G = <V, E> be a directed graph, where V is a set of vertices and E is a set of edges with
nonnegative length. Find the shortest path between each pair of nodes.
L = Matrix, which gives the length of each edge
L[i, j] = 0, if i == j // Distance of node from itself is zero
L[i, j] = ∞ , if i ≠ j and (i, j) ∉ E
L[i, j] = w (i, j), if i ≠ j and (i, j) ∈ E // w(i, j) is the weight of the edge (i, j)
Principle of optimality :
If k is the node on the shortest path from i to j, then the path from i to k and k to j, must also be
shortest.
In the following figure, the optimal path from i to j is either p or summation of p1 and p2.

While considering kth vertex as intermediate vertex, there are two possibilities:
If k is not part of shortest path from i to j, we keep the distance D[i, j] as it is.
If k is part of shortest path from i to j, update distance D[i, j] as
D[i, k] + D[k, j].
Optimal sub structure of the problem is given as:
Dk [i, j] = min{ Dk – 1 [i, j], Dk – 1 [i, k] + Dk – 1 [k, j] }

60
Dk = Distance matrix after kth iteration
Algorithm for All Pairs Shortest Path
This approach is also known as the Floyd-warshall shortest path algorithm. The algorithm for
all pair shortest path (APSP) problem is described below
Algorithm FLOYD_APSP ( L)
// L is the matrix of size n n representing original graph
// D is the distance matrix
D←L
for k ← 1 to n do
for i ← 1 to n do
for j ← 1 to n do
D[i, j]k ← min ( D[i, j]k-1, D[i, k]k-1 + D[k, j]k-1 )
end
end
end
return D
Complexity analysis of All Pairs Shortest Path Algorithm
It is very simple to derive the complexity of all pairs’ shortest path problem from the above
algorithm. It uses three nested loops. The innermost loop has only one statement. The
complexity of that statement is Q(1).
The running time of the algorithm is computed as :

Problem: Apply Floyd’s method to find the shortest path for the below-mentioned all pairs.

61
Solution:
Optimal substructure formula for Floyd’s algorithm,
Dk [i, j] = min { Dk – 1 [i, j], Dk – 1 [i, k] + Dk – 1 [k, j] }
Iteration 1 : k = 1 :
D1[1, 2] = min { Do [1, 2], Do [1, 1] + Do [1, 2] }
= min {∞, 0 + ∞}
= ∞
D1[1, 3] = min { Do [1, 3], Do [1, 1] + Do [1, 3] }
= min {3, 0 + 3}
=3
D1[1, 4] = min { Do [1, 4], Do [1, 1] + Do [1, 4] }
= min {∞, 0 + ∞}
= ∞
D1[2, 1] = min { Do [2, 1], Do [2, 1] + Do [1, 1] }
= min {2, 2 + 0}
=2
D1[2, 3] = min { Do [2, 3], Do [2, 1] + Do [1, 3] }
= min {∞, 2 + 3}
=5
D1[2, 4] = min { Do [2, 4], Do [2, 1] + Do [1, 4] }
= min {∞, 2 + ∞}
=∞
D1[3, 1] = min { Do [3, 1], Do [3, 1] + Do [1, 1] }
= min {∞, 0 + ∞}
=∞
D1[3, 2] = min { Do [3, 2], Do [3, 1] + Do [1, 2] }
= min {7, ∞ + ∞}
=7
D1[3, 4] = min { Do [3, 4], Do [3, 1] + Do [1, 4] }
= min {1, ∞ + ∞}

62
=1
D1[4, 1] = min { Do [4, 1], Do [4, 1] + Do [1, 1] }
= min {6, 6 + 0}
=6
D1[4, 2] = min { Do [4, 2], Do [4, 1] + Do [1, 2] }
= min {∞, 6 + ∞}
=∞
D1[4, 3] = min { Do [4, 3], Do [4, 1] + Do [1, 3] }
= min {∞, 6 + 3}
=9

Note : Path distance for highlighted cell is improvement over original matrix.
Iteration 2 (k = 2) :
D2[1, 2] = D1 [1, 2] = ∞
D2[1, 3] = min { D1 [1, 3], D1 [1, 2] + D1 [2, 3] }
= min {3, ∞ + 5}
=3
D2[1, 4] = min { D1 [1, 4], D1 [1, 2] + D1 [2, 4] }
= min {∞, ∞ + ∞}
=∞
D2[2, 1] = D1 [2, 1] = 2
D2[2, 3] = D1 [2, 3] = 5
D2[2, 4] = D1 [2, 4] = ∞
D2[3, 1] = min { D1 [3, 1], D1 [3, 2] + D1 [2, 1] }
= min {∞, 7 + 2}

63
=9
D2[3, 2] = D1 [3, 2] = 7
D2[3, 4] = min { D1 [3, 4], D1 [3, 2] + D1 [2, 4] }
= min {1, 7 + ∞}
=1
D2[4, 1] = min { D1 [4, 1], D1 [4, 2] + D1 [2, 1] }
= min {6, ∞ + 2}
=6
D2[4, 2] = D1 [4, 2] = ∞
D2[4, 3] = min { D1 [4, 3], D1 [4, 2] + D1 [2, 3] }
= min {9, ∞ + 5}
=9

Iteration 3 (k = 3) :
D3[1, 2] = min { D2 [1, 2], D2 [1, 3] + D2 [3, 2] }
= min {∞, 3 + 7}
= 10
D3[1, 3] = D2 [1, 3] = 3
D3[1, 4] = min { D2 [1, 4], D2 [1, 3] + D2 [3, 4] }
= min {∞, 3 + 1}
=4
D3[2, 1] = min { D2 [2, 1], D2 [2, 3] + D2 [3, 1] }
= min {2, 5 + 9}
=2
D3[2, 3] = D2 [2, 3] = 5

64
D3[2, 4] = min { D2 [2, 4], D2 [2, 3] + D2 [3, 4] }
= min {∞, 5 + 1}
=6
D3[3, 1] = D2 [3, 1] = 9
D3[3, 2] = D2 [3, 2] = 7
D3[3, 4] = D2 [3, 4] = 1
D3[4, 1] = min { D2 [4, 1], D2 [4, 3] + D2 [3, 1] }
= min {6, 9 + 9}
=6
D3[4, 2] = min { D2 [4, 1], D2 [4, 3] + D2 [3, 2] }
= min {∞, 9 + 7}
= 16
D3[4, 3] = D2 [4, 3] = 9

Iteration 4 (k = 4) :
D4[1, 2] = min { D3 [1, 2], D3 [1, 4] + D3 [4, 2] }
= min {10, 4 + 16}
= 10
D4[1, 3] = min { D3 [1, 3], D3 [1, 4] + D3 [4, 1] }
= min {3, 4 + 9}
=3
D4[1, 4] = D3 [1, 4] = 4
D4[2, 1] = min { D3 [2, 1], D3 [2, 4] + D3 [4, 1] }
= min {2, 6 + 6}
=2

65
D4[2, 3] = min { D3 [2, 3], D3 [2, 4] + D3 [4, 3] }
= min {5, 6 + 9}
=5
D4[2, 4] = D3 [2, 4] = 6
D4[3, 1] = min { D3 [3, 1], D3 [3, 4] + D3 [4, 1] }
= min {9, 1 + 6}
=7
D4[3, 2] = min { D3 [3, 2], D3 [3, 4] + D3 [4, 2] }
= min {7, 1 + 16}
=7
D4[3, 4] = D3 [3, 4] = 1
D4[4, 1] = D3 [4, 1] = 6
D4[4, 2] = D3 [4, 2] = 16
D4[4, 3] = D3 [4, 3] = 9
Final distance matrix is,

Problem: Obtain all pair shortest path using Floyd’s Algorithm for given weighted graph

66
Solution:
Optimal substructure formula for Floyd’s algorithm,
Dk [i,j] = min{ Dk – 1 [i, j], Dk – 1 [i, k] + Dk – 1 [k, j] }

Iteration 1 (k = 1) :
D1 [1, 2] = min {D0[1, 2], D0[1, 1] + D0[1, 2] }
= min {8, 0 + 8}
=8
D1[1, 3] = min {D0[1, 3], D0[1, 1] + D0[1, 3] }
= min {5, 0 + 5}
=5
D1[2, 1] = min {D0[2, 1], D0[2, 1] + D0[1, 1] }
= min {2, 2 + 0}
=2
D1[2, 3] = min {D0[2, 3], D0[2, 1] + D0[1, 3] }
= min {∞, 2 + 5}
=7
D1[3, 1] = min {D0[3, 1], D0[3, 1] + D0[1, 1] }
= min {∞, ∞ + 0}
=∞
D1[3, 2] = min {D0[3, 2], D0[3, 1] + D0[1, 2] }
= min {1, 1 + 8}
=1

67
Iteration 2 (k = 2) :
D2[1, 2] = min {D1[1, 2], D1[1, 2] + D1[2, 2]}
= min {8, 8 + 0}
=8
D2[1, 3] = min {D1[1, 3], D1[1, 2] + D1[2, 3] }
= min {5, 8 +7}
=5
D2[2, 1] = min {D1[2, 1], D1[2, 2] + D1[2, 1}
= min {2, 0 + 2}
=2
D2[2, 3] = min {D1[2, 3], D1[2, 2] + D1[2, 3] }
= min {7, 0 + 7}
=7
D2[3, 1] = min {D1[3, 1], D1[3, 2] + D1[2, 1] }
min {∞, 1 + 2}
=3
D2[3, 2] = min {D1[3, 2], D1[3, 2] + D1[2, 2] }
= min {1, 1 + 0}
=1

Iteration 3 (k = 3) :

68
D3[1, 2] = min {D2[1, 2], D2[1, 3] + D2[3, 2]}
= min {8, 5 + 1}
=6
D3[1, 3] = min {D2[1, 3], D2[1, 3] + D2[3, 3] }
= min {5, 5 + 0}
=5
D3[2, 1] = min {D2[2, 1], D2[2, 3] + D2[3, 1}
= min {2, 7 + 3}
=2
D3[2, 3] = min {D2[2, 3], D2[2, 3] + D2[3, 3] }
= min {7, 7 + 0}
=7
D3[3, 1] = min {D2[3, 1], D2[3, 3] + D2[3, 1] }
= min {3, 0 + 3}
=3
D3[3, 2] = min {D2[3, 2], D2[3, 3] + D2[3, 2] }
= min {1, 0 + 1}
=1

69
Real-Life Applications of Graph

Following are the real-life applications:


Graph data structures can be used to represent the interactions between players on a team, such
as passes, shots, and tackles. Analysing these interactions can provide insights into team
dynamics and areas for improvement.
 Commonly used to represent social networks, such as networks of friends on social
media.
 Graphs can be used to represent the topology of computer networks, such as the
connections between routers and switches.
 Graphs are used to represent the connections between different places in a
transportation network, such as roads and airports.
 Neural Networks: Vertices represent neurons and edges represent the synapses between
them. Neural networks are used to understand how our brain works and how
connections change when we learn. The human brain has about 10^11 neurons and
close to 10^15 synapses.
 Compilers: Graphs are used extensively in compilers. They can be used for type
inference, for so-called data flow analysis, register allocation, and many other purposes.
They are also used in specialized compilers, such as query optimization in database
languages.
 Robot planning: Vertices represent states the robot can be in and the edges the possible
transitions between the states. Such graph plans are used, for example, in planning paths
for autonomous vehicles.
 Graphs are used to represent flow of control in computers.
 Graphs are used in social networking sites where users act as nodes and connection
between them acts as edges.
 In an operating system, graphs are used as resource allocation graphs.
 Graphs are used in Google maps to find the shortest route.
 Graphs are also used in airlines system for effective route optimization.

70
 In-state transition diagrams, the graph is used to represent their states and their
transition.
 In transportation, graphs are used to find the shortest path.
 In circuits, graphs can be used to represent circuit points as nodes and wires as edges.
 Graphs are used in solving puzzles with only one solution, such as mazes.
 Graphs are used in computer networks for Peer to peer (P2P) applications.
 Graphs basically in the form of DAG(Directed acyclic graph) are used as alternative to
blockchain for cryptocurrency. For example crypto like IOTA, Nano are mainly based
on DAG.
When to use Graphs:
 When you need to represent and analyse the relationships between different objects or
entities.
 When you need to perform network analysis.
 When you need to identify key players, influencers or bottlenecks in a system.
 When you need to make predictions or recommendations.
Advantages of Graph:
 By using graphs we can easily find the shortest path, neighbors of the nodes, and many
more.
 Graphs are used to implement algorithms like DFS and BFS.
 It is used to find minimum spanning tree which has many practical applications.
 It helps in organizing data.
 Because of its non-linear structure, helps in understanding complex problems and their
visualization.
 Graphs can handle large amounts of data and can easily be distributed across multiple
machines.
 Graphs can be used to model many different types of real-world relationships and
connections.
 Graphs are well suited to handle sparse data.
 Graphs can support multiple types of relationships between entities, such as one-to-one,
one-to-many, and many-to-many.
Disadvantages of Graph:
 Graphs use lots of pointers which can be complex to handle.
 It can have large memory complexity.
 If the graph is represented with an adjacency matrix then it does not allow parallel edges
and multiplication of the graph is also difficult.
 Some graph algorithms have high time complexity, which can slow down the
performance of a system.
 Graphs can have cyclic relationships, which can make it difficult to traverse or process
the data.
 Graphs may not have built-in support for advanced analytics such as machine learning
or data mining.

71
Unit III Trees
Tree- basic terminology, General tree and its representation, representation using
sequential and linked organization, Binary tree- properties, converting tree to binary
tree, binary tree traversals(recursive and non-recursive) - inorder, preorder, post order,
depth first and breadth first, Operations on binary tree. Huffman Tree (Concept and
Use), Binary Search Tree (BST), BST operations, Threaded binary search tree- concepts,
threading, insertion and deletion of nodes in in-order threaded binary search tree, in
order traversal of in-order threaded binary search tree.
What is a Tree data structure?
A tree data structure is a hierarchical structure that is used to represent and organize data in a
way that is easy to navigate and search. It is a collection of nodes that are connected by edges
and has a hierarchical relationship between the nodes. The topmost node of the tree is called
the root, and the nodes below it are called the child nodes. Each node can have multiple child
nodes, and these child nodes can also have their own child nodes, forming a recursive structure.
This data structure is a specialized method to organize and store data in the computer to be
used more effectively. It consists of a central node, structural nodes, and sub-nodes, which are
connected via edges. We can also say that tree data structure has roots, branches, and leaves
connected with one another.

Recursive Definition:
A tree consists of a root, and zero or more subtrees T1, T2, … , Tk such that there is an edge
from the root of the tree to the root of each subtree.

72
Why Tree is considered a non-linear data structure?
The data in a tree are not stored in a sequential manner i.e, they are not stored linearly. Instead,
they are arranged on multiple levels or we can say it is a hierarchical structure. For this reason,
the tree is considered to be a non-linear data structure.
Basic Terminologies In Tree Data Structure:
Parent Node: The node which is a predecessor of a node is called the parent node of that node.
{B} is the parent node of {D, E}.
Child Node: The node which is the immediate successor of a node is called the child node of
that node. Examples: {D, E} are the child nodes of {B}.
Root Node: The topmost node of a tree or the node which does not have any parent node is
called the root node. {A} is the root node of the tree. A non-empty tree must contain exactly
one root node and exactly one path from the root to all other nodes of the tree.
Leaf Node or External Node: The nodes which do not have any child nodes are called leaf
nodes. {K, L, M, N, O, P} are the leaf nodes of the tree.
Ancestor of a Node: Any predecessor nodes on the path of the root to that node are called
Ancestors of that node. {A,B} are the ancestor nodes of the node {E}
Descendant: Any successor node on the path from the leaf node to that node. {E,I} are the
descendants of the node {B}.
Sibling: Children of the same parent node are called siblings. {D,E} are called siblings.
Level of a node: The count of edges on the path from the root node to that node. The root node
has level 0.
Internal node: A node with at least one child is called Internal Node.
Neighbour of a Node: Parent or child nodes of that node are called neighbors of that node.
Subtree: Any node of the tree along with its descendant.
Properties of a Tree:
Number of edges: An edge can be defined as the connection between two nodes. If a tree has
N nodes then it will have (N-1) edges. There is only one path from each node to any other node
of the tree.

73
Depth of a node: The depth of a node is defined as the length of the path from the root to that
node. Each edge adds 1 unit of length to the path. So, it can also be defined as the number of
edges in the path from the root of the tree to the node.
Height of a node: The height of a node can be defined as the length of the longest path from
the node to a leaf node of the tree.
Height of the Tree: The height of a tree is the length of the longest path from the root of the
tree to a leaf node of the tree.
Degree of a Node: The total count of subtrees attached to that node is called the degree of the
node. The degree of a leaf node must be 0. The degree of a tree is the maximum degree of a
node among all the nodes in the tree.
Some more properties are:
Traversing in a tree is done by depth first search and breadth first search algorithm.
It has no loop and no circuit
It has no self-loop
Its hierarchical model.
Syntax:
struct Node
{
int data;
struct Node *left_child;
struct Node *right_child;
};

Basic Operation of Tree:


Create – create a tree in data structure.
Insert − Inserts data in a tree.
Search − Searches specific data in a tree to check it is present or not.
Preorder Traversal – perform Traveling a tree in a pre-order manner in data structure.
In order Traversal – perform Traveling a tree in an in-order manner.
Post order Traversal –perform Traveling a tree in a post-order manner.

74
General Tree:
In the data structure, General tree is a tree in which each node can have either zero or many
child nodes. It cannot be empty. In general tree, there is no limitation on the degree of a node.
The topmost node of a general tree is called the root node. There are many subtrees in a general
tree. The subtree of a general tree is unordered because the nodes of the general tree cannot be
ordered according to specific criteria. In a general tree, each node has in-degree (number of
parent nodes) one and maximum out-degree (number of child nodes) n.

Convert a Generic Tree (N-array Tree) to Binary Tree


Following are the rules to convert a Generic(N-array Tree) to a Binary Tree:
 The root of the Binary Tree is the Root of the Generic Tree.
 The left child of a node in the Generic Tree is the Left child of that node in the Binary
Tree.
 The right sibling of any node in the Generic Tree is the Right child of that node in the
Binary Tree.
Examples:
Convert the following Generic Tree to Binary Tree:

Below is the Binary Tree of the above Generic Tree:

75
Binary Tree Representations
A binary tree data structure is represented using two methods. Those methods are as follows...
Array Representation
Linked List Representation
Consider the following binary tree...

1. Array Representation of Binary Tree


In array representation of a binary tree, we use one-dimensional array (1-D Array) to represent
a binary tree.

76
Consider the above example of a binary tree and it is represented as follows...

To represent a binary tree of depth 'n' using array representation, we need one dimensional
array with a maximum size of 2n + 1.
2. Linked List Representation of Binary Tree
We use a double linked list to represent a binary tree. In a double linked list, every node consists
of three fields. First field for storing left child address, second for storing actual data and third
for storing right child address.
In this linked list representation, a node has the following structure...

The above example of the binary tree represented using Linked list representation is shown as
follows...

Properties of Binary Tree


1. The maximum number of nodes at level ‘l’ of a binary tree is 2l:
Note: Here level is the number of nodes on the path from the root to the node (including root
and node). The level of the root is 0
2. The Maximum number of nodes in a binary tree of height ‘h’ is 2h – 1:
Note: Here the height of a tree is the maximum number of nodes on the root-to-leaf path. The
height of a tree with a single node is considered as 1

77
3. In a Binary Tree with N nodes, the minimum possible height or the minimum number of
levels is Log2(N+1):
Each level should have at least one element, so the height cannot be more than N. A binary tree
of height ‘h’ can have a maximum of 2h – 1 nodes (previous property). So the number of nodes
will be less than or equal to this maximum value
4. A Binary Tree with L leaves has at least | Log2L |+ 1 levels:
A Binary tree has the maximum number of leaves (and a minimum number of levels) when all
levels are fully filled. Let all leaves be at level l, then below is valid for the number of leaves
L
5. In a Binary tree where every node has 0 or 2 children, the number of leaf nodes is always
one more than nodes with two children:
6. In a non-empty binary tree, if n is the total number of nodes and e is the total number of
edges, then e = n-1:
Some extra properties of binary tree are:
Each node in a binary tree can have at most two child nodes: In a binary tree, each node can
have either zero, one, or two child nodes. If a node has zero children, it is called a leaf node. If
a node has one child, it is called a unary node. If a node has two children, it is called a binary
node.
The node at the top of the tree is called the root node: The root node is the first node in a binary
tree and all other nodes are connected to it. All other nodes in the tree are either child nodes or
descendant nodes of the root node.
Nodes that do not have any child nodes are called leaf nodes: Leaf nodes are the endpoints of
the tree and have no children. They represent the final result of the tree.
The height of a binary tree is defined as the number of edges from the root node to the deepest
leaf node: The height of a binary tree is the length of the longest path from the root node to any
of the leaf nodes. The height of a binary tree is also known as its depth.
In a full binary tree, every node except the leaves has exactly two children: In a full binary tree,
all non-leaf nodes have exactly two children. This means that there are no unary nodes in a full
binary tree.
In a complete binary tree, every level of the tree is completely filled except for the last level,
which can be partially filled: In a complete binary tree, all levels of the tree except the last level
are completely filled. This means that there are no gaps in the tree and all nodes are connected
to their parent nodes.
In a balanced binary tree, the height of the left and right subtrees of every node differ by at
most 1: In a balanced binary tree, the height of the left and right subtrees of every node is
similar. This ensures that the tree is balanced and that the height of the tree is minimized.
The in-order, pre-order, and post-order traversal of a binary tree are three common ways to
traverse the tree: In-order, pre-order, and post-order are three different ways to traverse a binary
tree. In-order traversal visits the left subtree, the node itself, and then the right subtree. Pre-

78
order traversal visits the node itself, the left subtree, and then the right subtree. Post-order
traversal visits the left subtree, the right subtree, and then the node itself.
Binary tree traversals (recursive and non-recursive)- inorder, preorder, post order, depth
first and breadth first, Operations on binary tree.
Pre-order Traversal without Recursion
The following operations are performed to traverse a binary tree in pre-order using a stack:
 Start with root node and push onto stack.
 Repeat while the stack is not empty
 POP the top element (PTR) from the stack and process the node.
 PUSH the right child of PTR onto to stack.
 PUSH the left child of PTR onto to stack.
Consider the following tree.

Start with node 4 and push it onto the stack.

Since the stack is not empty, POP 4 from the stack, process it and PUSH its left(7) and right(18)
child onto the stack.

79
Repeat the same process since the stack is not empty. POP 7 from the stack, process it and
PUSH its left(8) and right (13) child onto the stack.

Again, POP 8 from the stack and process it. Since it has no right or left child we don’t have to
PUSH anything to the stack.

Now POP 13 from the stack and process it. We don’t have to PUSH anything to the stack
because it also doesn’t have any subtrees.

80
POP 18 from the stack, process it and PUSH its left(5) and right(2) child to the stack.

Similarly POP 5 and 2 from the stack one after another. Since both these nodes don’t have any
child, we don’t have to PUSH anything onto the stack.

81
The nodes are processed in the order [ 4, 7, 8, 13, 18, 5, 2 ]. This is the required preorder
traversal of the given tree.
In-order Traversal without Recursion
The following operations are performed to traverse a binary tree in in-order using a stack:
 Start from the root, call it PTR.
 Push PTR onto stack if PTR is not NULL.
 Move to left of PTR and repeat step 2.
 If PTR is NULL and stack is not empty, then Pop element from stack and set as PTR.
 Process PTR and move to right of PTR , go to step 2.
Post-order Traversal without Recursion
The post order traversal algorithm is more difficult to implement than the previous two
algorithms. The following operations are performed to traverse a binary tree in post-order using
a stack:
 Start from the root, call it PTR.
 Push PTR onto stack if PTR is not NULL.
 Move to left of PTR and repeat step 2.
 If PTR has a right child R, then PUSH -R onto the stack.
 Pop and process positive element from stack and set as PTR.
 If a negative node is popped, (PTR = -N), then set PTR = N and go to step 2.
Tree Traversals (Inorder, Preorder and Postorder) Recursive way
Unlike linear data structures (Array, Linked List, Queues, Stacks, etc) which have only one
logical way to traverse them, trees can be traversed in different ways. The following are the
generally used methods for traversing trees:
Inorder Traversal:
 Traverse the left subtree, i.e., call Inorder(left->subtree)
 Visit the root.
 Traverse the right subtree, i.e., call Inorder(right->subtree)

82
Preorder Traversal:
 Visit the root.
 Traverse the left subtree, i.e., call Preorder(left->subtree)
 Traverse the right subtree, i.e., call Preorder(right->subtree)
Postorder Traversal:
 Traverse the left subtree, i.e., call Postorder(left->subtree)
 Traverse the right subtree, i.e., call Postorder(right->subtree)
 Visit the root

DFS (Depth First Search) algorithm (Repeated just to understand in reference with trees)
It is a recursive algorithm to search all the vertices of a tree data structure or a graph. The depth-
first search (DFS) algorithm starts with the initial node of graph G and goes deeper until we
find the goal node or the node with no children.
Because of the recursive nature, stack data structure can be used to implement the DFS
algorithm. The process of implementing the DFS is similar to the BFS algorithm.
The step by step process to implement the DFS traversal is given as follows -
 First, create a stack with the total number of vertices in the graph.
 Now, choose any vertex as the starting point of traversal, and push that vertex into the
stack.

83
 After that, push a non-visited vertex (adjacent to the vertex on the top of the stack) to
the top of the stack.
 Now, repeat steps 3 and 4 until no vertices are left to visit from the vertex on the stack's
top.
 If no vertex is left, go back and pop a vertex from the stack.
 Repeat steps 2, 3, and 4 until the stack is empty.
Algorithm
Step 1: SET STATUS = 1 (ready state) for each node in G
Step 2: Push the starting node A on the stack and set its STATUS = 2 (waiting state)
Step 3: Repeat Steps 4 and 5 until STACK is empty
Step 4: Pop the top node N. Process it and set its STATUS = 3 (processed state)
Step 5: Push on the stack all the neighbors of N that are in the ready state (whose STATUS =
1) and set their STATUS = 2 (waiting state)
[END OF LOOP]
Step 6: EXIT
Pseudocode
DFS(G,v) ( v is the vertex where the search starts )
Stack S := {}; ( start with an empty stack )
for each vertex u, set visited[u] := false;
push S, v;
while (S is not empty) do
u := pop S;
if (not visited[u]) then
visited[u] := true;
for each unvisited neighbour w of uu
push S, w;
end if
end while
END DFS()
Example of DFS algorithm
Now, let's understand the working of the DFS algorithm by using an example. In the example
given below, there is a directed graph having 7 vertices.

84
Now, let's start examining the graph starting from Node H.
Step 1 - First, push H onto the stack.
STACK: H
Step 2 - POP the top element from the stack, i.e., H, and print it. Now, PUSH all the neighbors
of H onto the stack that are in ready state.
Print: H]STACK: A
Step 3 - POP the top element from the stack, i.e., A, and print it. Now, PUSH all the neighbors
of A onto the stack that are in ready state.
Print: A
STACK: B, D
Step 4 - POP the top element from the stack, i.e., D, and print it. Now, PUSH all the neighbors
of D onto the stack that are in ready state.
Print: D
STACK: B, F
Step 5 - POP the top element from the stack, i.e., F, and print it. Now, PUSH all the neighbors
of F onto the stack that are in ready state.
Print: F
STACK: B
Step 6 - POP the top element from the stack, i.e., B, and print it. Now, PUSH all the neighbors
of B onto the stack that are in ready state.
Print: B
STACK: C
Step 7 - POP the top element from the stack, i.e., C, and print it. Now, PUSH all the neighbors
of C onto the stack that are in ready state.
Print: C

85
STACK: E, G
Step 8 - POP the top element from the stack, i.e., G and PUSH all the neighbors of G onto the
stack that are in ready state.
Print: G
STACK: E
Step 9 - POP the top element from the stack, i.e., E and PUSH all the neighbors of E onto the
stack that are in ready state.
Print: E
STACK:
Now, all the graph nodes have been traversed, and the stack is empty.
BFS algorithm
Breadth-first search is a graph traversal algorithm that starts traversing the graph from the root
node and explores all the neighbouring nodes. Then, it selects the nearest node and explores all
the unexplored nodes. While using BFS for traversal, any node in the graph can be considered
as the root node.
There are many ways to traverse the graph, but among them, BFS is the most commonly used
approach. It is a recursive algorithm to search all the vertices of a tree or graph data structure.
BFS puts every vertex of the graph into two categories - visited and non-visited. It selects a
single node in a graph and, after that, visits all the nodes adjacent to the selected node.
Algorithm
The steps involved in the BFS algorithm to explore a graph are given as follows –
Step 1: SET STATUS = 1 (ready state) for each node in G
Step 2: Enqueue the starting node A and set its STATUS = 2 (waiting state)
Step 3: Repeat Steps 4 and 5 until QUEUE is empty
Step 4: Dequeue a node N. Process it and set its STATUS = 3 (processed state).
Step 5: Enqueue all the neighbours of N that are in the ready state (whose STATUS = 1) and
set
their STATUS = 2
(waiting state)
[END OF LOOP]
Step 6: EXIT
Example of BFS algorithm
Now, let's understand the working of BFS algorithm by using an example. In the example given
below, there is a directed graph having 7 vertices.

86
In the above graph, minimum path 'P' can be found by using the BFS that will start from Node
A and end at Node E. The algorithm uses two queues, namely QUEUE1 and QUEUE2.
QUEUE1 holds all the nodes that are to be processed, while QUEUE2 holds all the nodes that
are processed and deleted from QUEUE1.
Now, let's start examining the graph starting from Node A.
Step 1 - First, add A to queue1 and NULL to queue2.
QUEUE1 = {A}
QUEUE2 = {NULL}
Step 2 - Now, delete node A from queue1 and add it into queue2. Insert all neighbors of node
A to queue1.
QUEUE1 = {B, D}
QUEUE2 = {A}
Step 3 - Now, delete node B from queue1 and add it into queue2. Insert all neighbors of node
B to queue1.
QUEUE1 = {D, C, F}
QUEUE2 = {A, B}
Step 4 - Now, delete node D from queue1 and add it into queue2. Insert all neighbors of node
D to queue1. The only neighbor of Node D is F since it is already inserted, so it will not be
inserted again.
QUEUE1 = {C, F}
QUEUE2 = {A, B, D}
Step 5 - Delete node C from queue1 and add it into queue2. Insert all neighbors of node C to
queue1.
QUEUE1 = {F, E, G}
QUEUE2 = {A, B, D, C}
Step 5 - Delete node F from queue1 and add it into queue2. Insert all neighbors of node F to
queue1. Since all the neighbors of node F are already present, we will not insert them again.
QUEUE1 = {E, G}

87
QUEUE2 = {A, B, D, C, F}
Step 6 - Delete node E from queue1. Since all of its neighbors have already been added, so we
will not insert them again. Now, all the nodes are visited, and the target node E is encountered
into queue2.
QUEUE1 = {G}
QUEUE2 = {A, B, D, C, F, E}
Huffman Coding | Greedy Algo-3
Huffman coding is a lossless data compression algorithm. The idea is to assign variable-length
codes to input characters, lengths of the assigned codes are based on the frequencies of
corresponding characters.
The variable-length codes assigned to input characters are Prefix Codes, means the codes (bit
sequences) are assigned in such a way that the code assigned to one character is not the prefix
of code assigned to any other character. This is how Huffman Coding makes sure that there is
no ambiguity when decoding the generated bit stream.
Let us understand prefix codes with a counter example. Let there be four characters a, b, c and
d, and their corresponding variable length codes be 00, 01, 0 and 1. This coding leads to
ambiguity because code assigned to c is the prefix of codes assigned to a and b. If the
compressed bit stream is 0001, the de-compressed output may be “cccd” or “ccb” or “acd” or
“ab”.
There are mainly two major parts in Huffman Coding
Build a Huffman Tree from input characters.
Traverse the Huffman Tree and assign codes to characters.
Steps to build Huffman Tree
Input is an array of unique characters along with their frequency of occurrences and output is
Huffman Tree.
Create a leaf node for each unique character and build a min heap of all leaf nodes (Min Heap
is used as a priority queue. The value of frequency field is used to compare two nodes in min
heap. Initially, the least frequent character is at root)
Extract two nodes with the minimum frequency from the min heap.
Create a new internal node with a frequency equal to the sum of the two nodes frequencies.
Make the first extracted node as its left child and the other extracted node as its right child.
Add this node to the min heap.
Repeat steps#2 and #3 until the heap contains only one node. The remaining node is the root
node and the tree is complete.
Let us understand the algorithm with an example:
character Frequency

88
a 5
b 9
c 12
d 13
e 16
f 45
Step 1. Build a min heap that contains 6 nodes where each node represents root of a tree with
single node.
Step 2 Extract two minimum frequency nodes from min heap. Add a new internal node with
frequency 5 + 9 = 14.

Now min heap contains 5 nodes where 4 nodes are roots of trees with single element each, and
one heap node is root of tree with 3 elements
character Frequency
c 12
d 13
Internal Node 14
e 16
f 45
Step 3: Extract two minimum frequency nodes from heap. Add a new internal node with
frequency 12 + 13 = 25

89
Now min heap contains 4 nodes where 2 nodes are roots of trees with single element each, and
two heap nodes are root of tree with more than one nodes
character Frequency
Internal Node 14
e 16
Internal Node 25
f 45
Step 4: Extract two minimum frequency nodes. Add a new internal node with frequency 14 +
16 = 30

Now min heap contains 3 nodes.


character Frequency
Internal Node 25
Internal Node 30
f 45
Step 5: Extract two minimum frequency nodes. Add a new internal node with frequency 25 +
30 = 55

90
Now min heap contains 2 nodes.
character Frequency
f 45
Internal Node 55
Step 6: Extract two minimum frequency nodes. Add a new internal node with frequency 45 +
55 = 100

Now min heap contains only one node.


character Frequency
Internal Node 100
Since the heap contains only one node, the algorithm stops here.
Steps to print codes from Huffman Tree:
Traverse the tree formed starting from the root. Maintain an auxiliary array. While moving to
the left child, write 0 to the array. While moving to the right child, write 1 to the array. Print
the array when a leaf node is encountered.

91
The codes are as follows:
character code-word
f 0
c 100
d 101
a 1100
b 1101
e 111
What is a Binary Search tree?
A binary search tree follows some order to arrange the elements. In a Binary search tree, the
value of left node must be smaller than the parent node, and the value of right node must be
greater than the parent node. This rule is applied recursively to the left and right subtrees of the
root.
Let's understand the concept of Binary search tree with an example.

In the above figure, we can observe that the root node is 40, and all the nodes of the left subtree
are smaller than the root node, and all the nodes of the right subtree are greater than the root
node.
Similarly, we can see the left child of root node is greater than its left child and smaller than its
right child. So, it also satisfies the property of binary search tree. Therefore, we can say that
the tree in the above image is a binary search tree.
Suppose if we change the value of node 35 to 55 in the above tree, check whether the tree will
be binary search tree or not.

92
In the above tree, the value of root node is 40, which is greater than its left child 30 but smaller
than right child of 30, i.e., 55. So, the above tree does not satisfy the property of Binary search
tree. Therefore, the above tree is not a binary search tree.
Example of creating a binary search tree
Now, let's see the creation of binary search tree using an example.
Suppose the data elements are - 45, 15, 79, 90, 10, 55, 12, 20, 50
First, we have to insert 45 into the tree as the root of the tree.
Then, read the next element; if it is smaller than the root node, insert it as the root of the left
subtree, and move to the next element.
Otherwise, if the element is larger than the root node, then insert it as the root of the right
subtree.
Now, let's see the process of creating the Binary search tree using the given data element. The
process of creating the BST is shown below -
Step 1 - Insert 45.

Step 2 - Insert 15.


As 15 is smaller than 45, so insert it as the root node of the left subtree.

93
Step 3 - Insert 79.
As 79 is greater than 45, so insert it as the root node of the right subtree.

Step 4 - Insert 90.


90 is greater than 45 and 79, so it will be inserted as the right subtree of 79.

Step 5 - Insert 10.


10 is smaller than 45 and 15, so it will be inserted as a left subtree of 15.

94
Step 6 - Insert 55.
55 is larger than 45 and smaller than 79, so it will be inserted as the left subtree of 79.

Step 7 - Insert 12.


12 is smaller than 45 and 15 but greater than 10, so it will be inserted as the right subtree of 10.

Step 8 - Insert 20.


20 is smaller than 45 but greater than 15, so it will be inserted as the right subtree of 15.

95
Step 9 - Insert 50.
50 is greater than 45 but smaller than 79 and 55. So, it will be inserted as a left subtree of 55.

Now, the creation of binary search tree is completed. After that, let's move towards the
operations that can be performed on Binary search tree.
We can perform insert, delete and search operations on the binary search tree.
Let's understand how a search is performed on a binary search tree.
Searching in Binary search tree
Searching means to find or locate a specific element or node in a data structure. In Binary
search tree, searching a node is easy because elements in BST are stored in a specific order.
The steps of searching a node in Binary Search tree are listed as follows -
 First, compare the element to be searched with the root element of the tree.
 If root is matched with the target element, then return the node's location.
 If it is not matched, then check whether the item is less than the root element, if it is
smaller than the root element, then move to the left subtree.
 If it is larger than the root element, then move to the right subtree.
 Repeat the above procedure recursively until the match is found.
 If the element is not found or not present in the tree, then return NULL.
Now, let's understand the searching in binary tree using an example. We are taking the binary
search tree formed above. Suppose we have to find node 20 from the below tree.
Step1:

96
Step2:

Step3:

Deletion in Binary Search tree


In a binary search tree, we must delete a node from the tree by keeping in mind that the property
of BST is not violated. To delete a node from BST, there are three possible situations occur -
The node to be deleted is the leaf node, or,
The node to be deleted has only one child, and,
The node to be deleted has two children
We will understand the situations listed above in detail.
When the node to be deleted is the leaf node

97
It is the simplest case to delete a node in BST. Here, we have to replace the leaf node with
NULL and simply free the allocated space.
We can see the process to delete a leaf node from BST in the below image. In below image,
suppose we have to delete node 90, as the node to be deleted is a leaf node, so it will be replaced
with NULL, and the allocated space will free.

When the node to be deleted has only one child


In this case, we have to replace the target node with its child, and then delete the child node. It
means that after replacing the target node with its child node, the child node will now contain
the value to be deleted. So, we simply have to replace the child node with NULL and free up
the allocated space.
We can see the process of deleting a node with one child from BST in the below image. In the
below image, suppose we have to delete the node 79, as the node to be deleted has only one
child, so it will be replaced with its child 55.
So, the replaced node 79 will now be a leaf node that can be easily deleted.

When the node to be deleted has two children


This case of deleting a node in BST is a bit complex among other two cases. In such a case, the
steps to be followed are listed as follows -
First, find the inorder successor of the node to be deleted.
After that, replace that node with the inorder successor until the target node is placed at the leaf
of tree.

98
And at last, replace the node with NULL and free up the allocated space.
The inorder successor is required when the right child of the node is not empty. We can obtain
the inorder successor by finding the minimum element in the right child of the node.
We can see the process of deleting a node with two children from BST in the below image. In
the below image, suppose we have to delete node 45 that is the root node, as the node to be
deleted has two children, so it will be replaced with its inorder successor. Now, node 45 will
be at the leaf of the tree so that it can be deleted easily.

Insertion in Binary Search tree


A new key in BST is always inserted at the leaf. To insert an element in BST, we have to start
searching from the root node; if the node to be inserted is less than the root node, then search
for an empty location in the left subtree. Else, search for the empty location in the right subtree
and insert the data. Insert in BST is similar to searching, as we always have to maintain the rule
that the left subtree is smaller than the root, and right subtree is larger than the root.
Now, let's see the process of inserting a node into BST using an example.

99
What do you mean by Threaded Binary Tree?
In the linked representation of binary trees, more than one half of the link fields contain NULL
values which results in wastage of storage space. If a binary tree consists of n nodes then n+1
link fields contain NULL values. So in order to effectively manage the space, a method was
devised by Perlis and Thornton in which the NULL links are replaced with special links known
as threads. Such binary trees with threads are known as threaded binary trees. Each node in a
threaded binary tree either contains a link to its child node or thread to other nodes in the tree.

One-way threaded Binary trees:

100
In one-way threaded binary trees, a thread will appear either in the right or left link field of a
node. If it appears in the right link field of a node then it will point to the next node that will
appear on performing in order traversal. Such trees are called Right threaded binary trees. If
thread appears in the left field of a node then it will point to the nodes inorder predecessor.
Such trees are called Left threaded binary trees. Left threaded binary trees are used less often
as they don't yield the last advantages of right threaded binary trees. In one-way threaded binary
trees, the right link field of last node and left link field of first node contains a NULL. In order
to distinguish threads from normal links they are represented by dotted lines.

The above figure shows the inorder traversal of this binary tree yields D, B, E, A, C, F. When
this tree is represented as a right threaded binary tree, the right link field of leaf node D which
contains a NULL value is replaced with a thread that points to node B which is the inorder
successor of a node D. In the same way other nodes containing values in the right link field
will contain NULL value.
Two-way threaded Binary Trees:

In two-way threaded Binary trees, the right link field of a node containing NULL values is
replaced by a thread that points to nodes inorder successor and left field of a node containing
NULL values is replaced by a thread that points to nodes inorder predecessor.

101
The above figure shows the inorder traversal of this binary tree yields D, B, E, G, A, C, F. If
we consider the two-way threaded Binary tree, the node E whose left field contains NULL is
replaced by a thread pointing to its inorder predecessor i.e. node B. Similarly, for node G whose
right and left linked fields contain NULL values are replaced by threads such that right link
field points to its inorder successor and left link field points to its inorder predecessor. In the
same way, other nodes containing NULL values in their link fields are filled with threads.

In the above figure of two-way threaded Binary tree, we noticed that no left thread is possible
for the first node and no right thread is possible for the last node. This is because they don't
have any inorder predecessor and successor respectively. This is indicated by threads pointing
nowhere. So in order to maintain the uniformity of threads, we maintain a special node called
the header node. The header node does not contain any data part and its left link field points to
the root node and its right link field points to itself. If this header node is included in the two-
way threaded Binary tree then this node becomes the inorder predecessor of the first node and
inorder successor of the last node. Now threads of left link fields of the first node and right link
fields of the last node will point to the header node.

102
Insertion in Binary threaded tree is similar to insertion in binary tree but we will have to
adjust the threads after insertion of each element.
C representation of Binary Threaded Node:
struct Node
{
struct Node *left, *right;
int info;
// false if left pointer points to predecessor
// in Inorder Traversal
boolean lthread;
// false if right pointer points to successor
// in Inorder Traversal
boolean rthread;
};
In the following explanation, we have considered Binary Search Tree (BST) for insertion as
insertion is defined by some rules in BSTs.
Let tmp be the newly inserted node. There can be three cases during insertion:
Case 1: Insertion in empty tree
Both left and right pointers of tmp will be set to NULL and new node becomes the root.
root = tmp;
tmp -> left = NULL;
tmp -> right = NULL;
Case 2: When new node inserted as the left child
After inserting the node at its proper place we have to make its left and right threads points to
inorder predecessor and successor respectively. The node which was inorder successor. So the
left and right threads of the new node will be-
tmp -> left = par ->left;
tmp -> right = par;
Before insertion, the left pointer of parent was a thread, but after insertion it will be a link
pointing to the new node.
par -> lthread = false;
par -> left = temp;

103
Following example show a node being inserted as left child of its parent.

After insertion of 13,

Predecessor of 14 becomes the predecessor of 13, so left thread of 13 points to 10.


Successor of 13 is 14, so right thread of 13 points to left child which is 13.
Left pointer of 14 is not a thread now, it points to left child which is 13.
Case 3: When new node is inserted as the right child
The parent of tmp is its inorder predecessor. The node which was inorder successor of the
parent is now the inorder successor of this node tmp. So the left and right threads of the new
node will be-
tmp -> left = par;
tmp -> right = par -> right;

104
Before insertion, the right pointer of parent was a thread, but after insertion it will be a link
pointing to the new node.
par -> rthread = false;
par -> right = tmp;
Following example shows a node being inserted as right child of its parent.

After 15 inserted,

Successor of 14 becomes the successor of 15, so right thread of 15 points to 16


Predecessor of 15 is 14, so left thread of 15 points to 14.
Right pointer of 14 is not a thread now, it points to right child which is 15.

105
In deletion, first the key to be deleted is searched, and then there are different cases for deleting
the Node in which key is found.
Case A: Leaf Node need to be deleted
In BST, for deleting a leaf Node the left or right pointer of parent was set to NULL. Here
instead of setting the pointer to NULL it is made a thread.
If the leaf Node is to be deleted is left child of its parent then after deletion, left pointer of
parent should become a thread pointing to its predecessor of the parent Node after deletion.
par -> lthread = true;
par -> left = ptr -> left;

If the leaf Node to be deleted is right child of its parent then after deletion, right pointer of
parent should become a thread pointing to its successor. The Node which was inorder successor
of the leaf Node before deletion will become the inorder successor of the parent Node after
deletion.
par -> rthread = true;
par -> right = ptr -> right;

Case B: Node to be deleted has only one child


After deleting the Node as in a BST, the inorder successor and inorder predecessor of the Node
are found out.
s = inSucc(ptr);
p = inPred(ptr);
If Node to be deleted has left subtree, then after deletion right thread of its predecessor should
point to its successor.

106
p->right = s;

Before deletion 15 is predecessor and 2 is successor of 16. After deletion of 16, the Node 20
becomes the successor of 15, so right thread of 15 will point to 20.
If Node to be deleted has right subtree, then after deletion left thread of its successor should
point to its predecessor.
s->left = p;

Before deletion of 25 is predecessor and 34 is successor of 30. After deletion of 30, the Node
25 becomes the predecessor of 34, so left thread of 34 will point to 25.
Case C: Node to be deleted has two children
We find inorder successor of Node ptr (Node to be deleted) and then copy the information of
this successor into Node ptr. After this inorder successor Node is deleted using either Case A
or Case B.
Inorder Traversal Using Threads
Let us see how inorder traversal actually works in a single threaded binary tree now

107
The following steps take place in the diagram above:
We go to the left most node 1
Node 1 has boolean rightThread as true, hence next we visit its inOrder successor node 2
Node 2 has a right child, so we visit Node 3 next
Node 3 does not have a right child, ie the rightThread boolean is true, so we visit Node 4 next
Node 4 has a right child so we visit Node 5 and finish traversal
We will write a function which will give us the inOrder successor of a node:
struct Node* findSuccesor(Node* node) {
if(node -> rightTread == true){
return node --> right; //straightaway return the parent node if present
}
if(root -> right != NULL){
while(root -> left != NULL) { //if left child is present return the leftmost node
root = root -> left;
}
return root;
}
return NULL; //if no children are present then this is the last node
Applications of Tree data structure:
The applications of tree data structures are as follows:
1. Spanning trees: It is the shortest path tree used in the routers to direct the packets to the
destination.
2. Binary Search Tree: It is a type of tree data structure that helps in maintaining a sorted stream
of data.
Full Binary tree

108
Complete Binary tree
Skewed Binary tree
Strictly Binary tree
Extended Binary tree
3. Storing hierarchical data: Tree data structures are used to store the hierarchical data, which
means data is arranged in the form of order.
4. Syntax tree: The syntax tree represents the structure of the program’s source code, which is
used in compilers.
5. Trie: It is a fast and efficient way for dynamic spell checking. It is also used for locating
specific keys from within a set.
6. Heap: It is also a tree data structure that can be represented in a form of an array. It is used
to implement priority queues.
7. Artificial intelligence: Decision trees and other tree-based models are commonly used in
machine learning and artificial intelligence to make predictions and classify data.
8. Database: Some databases use trees to organize data for efficient searching and sorting.
9. Network: Routing algorithms for networks, such as Internet Protocol (IP) routing, use trees
to find the best path for data to travel from one network to another.
Advantages of Tree data structure
 Efficient insertion, deletion, and search operations.
 Trees are flexibility in terms of the types of data that can be stored.
 It is used to represent hierarchical relationships.
 It has the ability to represent a recursive structure.
 Used in various algorithms and data structures such as Huffman coding and decision
trees.
 Trees use less space than other data structures, like lists, and linked lists.
 Trees are dynamic in nature.
 Tree data structures can automatically self-organize as new data is added or removed,
which can improve performance and reduce complexity.
Disadvantages of Tree data structure
 Trees require additional memory for pointers.
 Trees are not the best choice for data that does not have hierarchical relationships.
 Trees with many levels can become expensive to search and traverse.
 Limited scalability compared to other data structures such as hash tables.
 Trees are typically used for storing and manipulating hierarchical data, and may not be
the best choice for other types of data.
 Not suitable for large datasets.
 Trees can become unbalanced, leading to poor performance and decreased efficiency.

109
Unit IV Search Trees
Symbol Table-Representation of Symbol Tables- Static tree table and Dynamic tree table,
Weight balanced tree - Optimal Binary Search Tree (OBST), OBST as an example of
Dynamic Programming, Height Balanced Tree- AVL tree. Red-Black Tree, AA tree, K-
dimensional tree, Splay Tree

Symbol Table
Symbol table is an important data structure used in a compiler.

Symbol table is used to store the information about the occurrence of various entities such as
objects, classes, variable name, interface, function name etc. it is used by both the analysis and
synthesis phases.

The symbol table used for following purposes:

It is used to store the name of all entities in a structured form at one place.
It is used to verify if a variable has been declared.
It is used to determine the scope of a name.
It is used to implement type checking by verifying assignments and expressions in the source
code are semantically correct.
A symbol table can either be linear or a hash table. Using the following format, it maintains
the entry for each name.

<symbol name, type, attribute>


For example, suppose a variable store the information about the following variable declaration:

static int salary


then, it stores an entry in the following format:

<salary, int, static>


The clause attribute contains the entries related to the name.

Implementation
The symbol table can be implemented in the unordered list if the compiler is used to handle the
small amount of data.

A symbol table can be implemented in one of the following techniques:

Linear (sorted or unsorted) list


Hash table
Binary search tree
Symbol table are mostly implemented as hash table.

110
Operations
The symbol table provides the following operations:
Insert ()
Insert () operation is more frequently used in the analysis phase when the tokens are identified
and names are stored in the table.
The insert() operation is used to insert the information in the symbol table like the unique name
occurring in the source code.
In the source code, the attribute for a symbol is the information associated with that symbol.
The information contains the state, value, type and scope about the symbol.
The insert () function takes the symbol and its value in the form of argument.
For example:
int x;
Should be processed by the compiler as:
insert (x, int)
lookup()
In the symbol table, lookup() operation is used to search a name. It is used to determine:
The existence of symbol in the table.
The declaration of the symbol before it is used.
Check whether the name is used in the scope.
Initialization of the symbol.
Checking whether the name is declared multiple times.
The basic format of lookup() function is as follows:
lookup (symbol)

Data structure for symbol table


A compiler contains two type of symbol table: global symbol table and scope symbol table.
Global symbol table can be accessed by all the procedures and scope symbol table.
The scope of a name and symbol table is arranged in the hierarchy structure as shown below:

int value=10;

void sum_num()
{
int num_1;
int num_2;

{
int num_3;
int num_4;
}

int num_5;

{
int_num 6;
int_num 7;
}
}

Void sum_id

111
{
int id_1;
int id_2;

{
int id_3;
int id_4;
}

int num_5;
}
The above grammar can be represented in a hierarchical data structure of symbol tables:

The global symbol table contains one global variable and two procedure names. The name
mentioned in the sum_num table is not available for sum_id and its child tables.

Data structure hierarchy of symbol table is stored in the semantic analyzer. If you want to
search the name in the symbol table then you can search it using the following algorithm:
First a symbol is searched in the current symbol table.
If the name is found then search is completed else the name will be searched in the symbol
table of parent until,

112
The name is found or global symbol is searched.

Representing Scope Information


In the source program, every name possesses a region of validity, called the scope of that name.

The rules in a block-structured language are as follows:

If a name declared within block B then it will be valid only within B.


If B1 block is nested within B2 then the name that is valid for block B2 is also valid for B1
unless the name's identifier is re-declared in B1.
These scope rules need a more complicated organization of symbol table than a list of
associations between names and attributes.
Tables are organized into stack and each table contains the list of names and their associated
attributes.
Whenever a new block is entered then a new table is entered onto the stack. The new table
holds the name that is declared as local to this block.
When the declaration is compiled then the table is searched for a name.
If the name is not found in the table then the new name is inserted.
When the name's reference is translated then each table is searched, starting from the each table
on the stack.
For example:
int x;
void f(int m) {
float x, y;
{
int i, j;
int u, v;
}
}
int g (int n)
{
bool t;
}

113
Dynamic Programming(DP) is a technique to solve problems by breaking them down into
overlapping sub-problems which follows the optimal substructure. There are various problems
using DP like subset sum, knapsack, coin change etc. DP can also be applied on trees to solve
some specific problems.

Given a tree with N nodes and N-1 edges, calculate the maximum sum of the node values from
root to any of the leaves without re-visiting any node.

114
Given above is a diagram of a tree with N=14 nodes and N-1=13 edges. The values at node
being 3, 2, 1, 10, 1, 3, 9, 1, 5, 3, 4, 5, 9 and 8 respectively for nodes 1, 2, 3, 4….14.
The diagram below shows all the paths from root to leaves:

All the paths are marked by different colors :


Path 1(red, 3-2-1-4) : sum of all node values = 10
Path 2(orange, 3-2-1-5) : sum of all node values = 11
Path 3(yellow, 3-2-3) : sum of all node values = 8
Path 4(green, 3-1-9-9) : sum of all node values = 22
Path 5(violet, 3-1-9-8) : sum of all node values = 21
Path 6(pink, 3-10-1) : sum of all node values = 14
Path 7(blue, 3-10-5) : sum of all node values = 18
Path 8(brown, 3-10-3) : sum of all node values = 16
The answer is 22, as Path 4 has the maximum sum of values of nodes in its path from a root to
leaves.

The greedy approach fails in this case. Starting from the root and take 3 from the first level, 10
from the next level and 5 from the third level greedily. Result is path-7 if after following the
greedy approach, hence do not apply greedy approach over here.
The problem can be solved using Dynamic Programming on trees. Start memoizing from the
leaves and add the maximum of leaves to the root of every sub-tree. At the last step, there will
be root and the sub-tree under it, adding the value at node and maximum of sub-tree will give
us the maximum sum of the node values from root to any of the leaves.

The diagram above shows how to start from the leaves and add the maximum of leaves of a
sub-tree to its root. Move upward and repeat the same procedure of storing the maximum of
every sub-tree leaves and adding it to its root. In this example, the maximum of node 11 and

115
12 is taken to count and then added to node 5 (In this sub-tree, 5 is the root and 11, 12 are its
leaves). Similarly, the maximum of node 13 and 14 is taken to count and then added to node 7.
Repeat the steps for every sub-tree till we reach the node.

Let DPi be the maximum summation of node values in the path between i and any of its leaves
moving downwards. Traverse the tree using DFS traversal. Store the maximum of all the leaves
of the sub-tree, and add it to the root of the sub-tree. At the end, DP1 will have the maximum
sum of the node values from root to any of the leaves without re-visiting any node.

Height Balanced Trees

A height-balanced binary tree is defined as a binary tree in which the height of the left and the
right subtree of any node differ by not more than 1. AVL tree, red-black tree are examples of
height-balanced trees.

Conditions for Height-Balanced Binary Tree:


Following are the conditions for a height-balanced binary tree:

The difference between the heights of the left and the right subtree for any node is not more
than one.
The left subtree is balanced.
The right subtree is balanced.
Note: An empty tree is also height-balanced.

What is the Height Balance of a node?


To check if the binary tree is height-balanced or not, you have to check the height balance of
each node. For this, you need to calculate the heights of the two subtrees for each node making
this impractical. Instead, we store the height balance information of every subtree in the root
node of that subtree. Thus, each node not only maintains its data and children’s information
but also a height balance value.
The height balance of a node is calculated as follows:

height balance of node = height of right subtree – height of left subtree

The above formula means that:

116
If the right subtree is taller, the height balance of the node will be positive.
If the left subtree is taller, the balance of the node will be negative.

Height of a Height-Balanced Binary Tree:


The height of a node in a tree is the length of the longest path from that node downward to a
leaf, counting both the start and end vertices of the path. The height of a leaf is 1. The height
of a nonempty tree is the height of its root. It can be proved that the height of a height-balanced
binary tree with N nodes is O(logN).

Why do we need a Height-Balanced Binary Tree?


Let’s understand the need for a balanced binary tree through an example.

The above tree is a binary search tree and also a height-balanced tree.

117
Suppose we want to want to find the value 79 in the above tree. First, we compare the value of
the root node. Since the value of 79 is greater than 35, we move to its right child, i.e., 48. Since
the value 79 is greater than 48, so we move to the right child of 48. The value of the right child
of node 48 is 79. The number of hops required to search the element 79 is 2.
Similarly, any element can be found with at most 2 jumps because the height of the tree is 2.

So it can be seen that any value in a balanced binary tree can be searched in O(logN) time
where N is the number of nodes in the tree. But if the tree is not height-balanced then in the
worst case, a search operation can take O(N) time.

Applications of Height-Balanced Binary Tree:


Balanced trees are mostly used for in-memory sorts of sets and dictionaries.
Balanced trees are also used extensively in database applications in which insertions and
deletions are fewer but there are frequent lookups for data required.
It is used in applications that require improved searching apart from database applications.
It has applications in storyline games as well.
It is used mainly in corporate sectors where they have to keep the information about the
employees working there and their change in shifts.
Advantages of Height-Balanced Binary Tree:
It will improve the worst-case lookup time at the expense of making a typical case roughly one
lookup less.
As a general rule, a height-balanced tree would work better when the request frequencies across
the data set are more evenly spread,
It gives better search time complexity.
Disadvantages of Height-Balanced Binary Tree:
Longer running times for the insert and remove operations.
Must keep balancing info in each node.
To find nodes to balance, must go back up in the tree.
How to check if a given tree is height-balanced:
You can check if a tree is height-balanced using recursion based on the idea that every subtree
of the tree will also be height-balanced. To check if a tree is height-balanced perform the
following operations:

Use recursion and visit the left subtree and right subtree of each node:
Check the height of the left subtree and right subtree.
If the absolute difference between their heights is at most 1 then that node is height-balanced.
Otherwise, that node and the whole tree is not balanced.

AVL Tree:
AVL tree is a self-balancing Binary Search Tree (BST) where the difference between heights
of left and right subtrees cannot be more than one for all nodes.

118
Example of AVL Tree:

The above tree is AVL because the differences between the heights of left and right subtrees
for every node are less than or equal to 1.

Example of a Tree that is NOT an AVL Tree:

The above tree is not AVL because the differences between the heights of the left and right
subtrees for 8 and 12 are greater than 1.

Why AVL Trees?


Most of the BST operations (e.g., search, max, min, insert, delete.. etc) take O(h) time where h
is the height of the BST. The cost of these operations may become O(n) for a skewed Binary
tree. If we make sure that the height of the tree remains O(log(n)) after every insertion and
deletion, then we can guarantee an upper bound of O(log(n)) for all these operations. The height
of an AVL tree is always O(log(n)) where n is the number of nodes in the tree.

Insertion in AVL Tree:


To make sure that the given tree remains AVL after every insertion, we must augment the
standard BST insert operation to perform some re-balancing.
Following are two basic operations that can be performed to balance a BST without violating
the BST property (keys(left) < key(root) < keys(right)).

119
Left Rotation
Right Rotation
T1, T2 and T3 are subtrees of the tree, rooted with y (on the left side) or x (on the right side)

y x
/ \ Right Rotation / \
x T3 - - - - - - - > T1 y
/\ <------- /\
T1 T2 Left Rotation T2 T3

Keys in both of the above trees follow the following order


keys(T1) < key(x) < keys(T2) < key(y) < keys(T3)
So BST property is not violated anywhere.

Steps to follow for insertion:


Let the newly inserted node be w

Perform standard BST insert for w.


Starting from w, travel up and find the first unbalanced node. Let z be the first unbalanced
node, y be the child of z that comes on the path from w to z and x be the grandchild of z that
comes on the path from w to z.
Re-balance the tree by performing appropriate rotations on the subtree rooted with z. There can
be 4 possible cases that need to be handled as x, y and z can be arranged in 4 ways.
Following are the possible 4 arrangements:
y is the left child of z and x is the left child of y (Left Left Case)
y is the left child of z and x is the right child of y (Left Right Case)
y is the right child of z and x is the right child of y (Right Right Case)
y is the right child of z and x is the left child of y (Right Left Case)
Following are the operations to be performed in above mentioned 4 cases. In all of the cases,
we only need to re-balance the subtree rooted with z and the complete tree becomes balanced
as the height of the subtree (After appropriate rotations) rooted with z becomes the same as it
was before insertion.

1. Left Left Case

T1, T2, T3 and T4 are subtrees.


z y
/\ / \
y T4 Right Rotate (z) x z
/\ - - - - - - - - -> / \ / \
x T3 T1 T2 T3 T4
/\
T1 T2
2. Left Right Case

z z x
/\ / \ / \
y T4 Left Rotate (y) x T4 Right Rotate(z) y z
/ \ - - - - - - - - -> / \ - - - - - - - -> / \ / \

120
T1 x y T3 T1 T2 T3 T4
/\ /\
T2 T3 T1 T2
3. Right Right Case

z y
/ \ / \
T1 y Left Rotate(z) z x
/ \ - - - - - - - -> / \ / \
T2 x T1 T2 T3 T4
/\
T3 T4
4. Right Left Case

z z x
/\ /\ / \
T1 y Right Rotate (y) T1 x Left Rotate(z) z y
/ \ - - - - - - - - -> / \ - - - - - - - -> / \ / \
x T4 T2 y T1 T2 T3 T4
/\ / \
T2 T3 T3 T4
Illustration of Insertion at AVL Tree

121
122
Approach:
The idea is to use recursive BST insert, after insertion, we get pointers to all ancestors one by
one in a bottom-up manner. So we don’t need a parent pointer to travel up. The recursive code
itself travels up and visits all the ancestors of the newly inserted node.

Follow the steps mentioned below to implement the idea:

Perform the normal BST insertion.


The current node must be one of the ancestors of the newly inserted node. Update the height
of the current node.
Get the balance factor (left subtree height – right subtree height) of the current node.
If the balance factor is greater than 1, then the current node is unbalanced and we are either in
the Left Left case or left Right case. To check whether it is left left case or not, compare the
newly inserted key with the key in the left subtree root.
If the balance factor is less than -1, then the current node is unbalanced and we are either in the
Right Right case or Right-Left case. To check whether it is the Right Right case or not, compare
the newly inserted key with the key in the right subtree root.

Red-Black Tree

When it comes to searching and sorting data, one of the most fundamental data structures is
the binary search tree. However, the performance of a binary search tree is highly dependent
on its shape, and in the worst case, it can degenerate into a linear structure with a time
complexity of O(n). This is where Red Black Trees come in, they are a type of balanced binary
search tree that use a specific set of rules to ensure that the tree is always balanced. This balance
guarantees that the time complexity for operations such as insertion, deletion, and searching is
always O(log n), regardless of the initial shape of the tree.

Red Black Trees are self-balancing, meaning that the tree adjusts itself automatically after each
insertion or deletion operation. It uses a simple but powerful mechanism to maintain balance,
by coloring each node in the tree either red or black.

123
Red Black Tree-

Red-Black tree is a binary search tree in which every node is colored with either red or black.
It is a type of self balancing binary search tree. It has a good efficient worst case running time
complexity.
Properties of Red Black Tree:
The Red-Black tree satisfies all the properties of binary search tree in addition to that it satisfies
following additional properties –

1. Root property: The root is black.

2. External property: Every leaf (Leaf is a NULL child of a node) is black in Red-Black tree.
3. Internal property: The children of a red node are black. Hence possible parent of red node is
a black node.

4. Depth property: All the leaves have the same black depth.

5. Path property: Every simple path from root to descendant leaf node contains same number
of black nodes.

The result of all these above-mentioned properties is that the Red-Black tree is roughly
balanced.

Rules That Every Red-Black Tree Follows:


Every node has a color either red or black.
The root of the tree is always black.
There are no two adjacent red nodes (A red node cannot have a red parent or red child).
Every path from a node (including root) to any of its descendants NULL nodes has the same
number of black nodes.
Every leaf (e.i. NULL node) must be colored BLACK.
Why Red-Black Trees?
Most of the BST operations (e.g., search, max, min, insert, delete.. etc) take O(h) time where h
is the height of the BST. The cost of these operations may become O(n) for a skewed Binary
tree. If we make sure that the height of the tree remains O(log n) after every insertion and
deletion, then we can guarantee an upper bound of O(log n) for all these operations. The height
of a Red-Black tree is always O(log n) where n is the number of nodes in the tree.

How does a Red-Black Tree ensure balance?


A simple example to understand balancing is, that a chain of 3 nodes is not possible in the Red-
Black tree. We can try any combination of colors and see if all of them violate the Red-Black
tree property.

124
Interesting points about Red-Black Tree:
The black height of the red-black tree is the number of black nodes on a path from the root
node to a leaf node. Leaf nodes are also counted as black nodes. So, a red-black tree of height
h has black height >= h/2.
Height of a red-black tree with n nodes is h<= 2 log2(n + 1).
All leaves (NIL) are black.
The black depth of a node is defined as the number of black nodes from the root to that node
i.e the number of black ancestors.
Every red-black tree is a special case of a binary tree.
Black Height of a Red-Black Tree :
Black height is the number of black nodes on a path from the root to a leaf. Leaf nodes are also
counted black nodes. From the above properties 3 and 4, we can derive, a Red-Black Tree of
height h has black-height >= h/2.

Number of nodes from a node to its farthest descendant leaf is no more than twice as the number
of nodes to the nearest descendant leaf.

Every Red Black Tree with n nodes has height <= 2Log2(n+1)
This can be proved using the following facts:

For a general Binary Tree, let k be the minimum number of nodes on all root to NULL paths,
then n >= 2k – 1 (Ex. If k is 3, then n is at least 7). This expression can also be written as k <=
Log2(n+1).
From property 4 of Red-Black trees and above claim, we can say in a Red-Black Tree with n
nodes, there is a root to leaf path with at-most Log2(n+1) black nodes.
From properties 3 and 5 of Red-Black trees, we can claim that the number of black nodes in a
Red-Black tree is at least ⌊ n/2 ⌋ where n is the total number of nodes.
From the above points, we can conclude the fact that Red Black Tree with n nodes has a height
<= 2Log2(n+1)

Search Operation in Red-black Tree:


As every red-black tree is a special case of a binary tree so the searching algorithm of a red-
black tree is similar to that of a binary tree.

Algorithm:

searchElement (tree, val)

125
Step 1:
If tree -> data = val OR tree = NULL
Return tree
Else
If val < data
Return searchElement (tree -> left, val)
Else
Return searchElement (tree -> right, val)
[ End of if ]
[ End of if ]

Step 2: END

Example: Searching 11 in the following red-black tree.

Solution:

Start from the root.


Compare the inserting element with root, if less than root, then recurse for left, else recurse for
right.
If the element to search is found anywhere, return true, else return false.

126
AA Trees

AA trees are the variation of the red-black trees, a form of binary search tree.

AA trees use the concept of levels to aid in balancing binary trees. The level of node (instead
of colour) is used for balancing information. A link where child and parent’s levels are same,
is called a horizontal link, and is analogous to a red link in the red-black tree.

The level of every leaf node is one.


The level of red nodes are same as the level of their parent nodes and the links are called
horizontal links.
The level of black nodes are one less than the level of their parent node.
Additional storage requirement with every node is O(Log n) in red black trees instead of O(1)
(only color in Red Black Trees), but AA trees simplify restructuring by removing many cases.

An AA tree follows same rule as red-black trees with the addition of single new rule that red
nodes cannot be present as left child.

 Every node can be either red (linked horizontally) or black.


 There are no two adjacent red nodes (or horizontal links).
 Every path from root to a NULL node has same number of black nodes (on black links).
 Left link cannot NOT be red (horizontal). (New added rule)

Why AA trees :
The implementation and number of rotation cases in Red-Black Trees is complex. AA trees
simplifies the algorithm.

It eliminates half of the restructuring process by eliminating half of the rotation cases, which
is easier to code.

127
It simplifies the deletion process by removing multiple cases.
Below tree is the example of AA tree :

Note that in the above tree there are no left red child which is the new added rule of AA Trees.

After re-drawing the above AA tree with levels and horizontal links (the red nodes are shown
connected through horizontal or red links), the tree looks like:

Note that all the nodes on level 1 i.e. 5, 10, 20, 35, 40, 55, 65, 80, 90 are known as leaf nodes.

So, in summarized way, for tree to be AA tree, it must satisfy the following five invariants:

1.) The level of leaf node is 1.


2.) The level of left child is exactly one less than of its parent.
3.) The level of every right child is equal to or one less than of its parent.
4.) The level of every right grandchild is strictly less than that of its grandparent.
5.) Every node of level greater than one has two children.

What is K dimension tree?


A K-D Tree(also called as K-Dimensional Tree) is a binary search tree where data in each node
is a K-Dimensional point in space. In short, it is a space partitioning(details below) data
structure for organizing points in a K-Dimensional space. A non-leaf node in K-D tree divides
the space into two parts, called as half-spaces. Points to the left of this space are represented
by the left subtree of that node and points to the right of the space are represented by the right

128
subtree. We will soon be explaining the concept on how the space is divided and tree is formed.
For the sake of simplicity, let us understand a 2-D Tree with an example. The root would have
an x-aligned plane, the root’s children would both have y-aligned planes, the root’s
grandchildren would all have x-aligned planes, and the root’s great-grandchildren would all
have y-aligned planes and so on.

Generalization: Let us number the planes as 0, 1, 2, …(K – 1). From the above example, it is
quite clear that a point (node) at depth D will have A aligned plane where A is calculated as:
A = D mod K

How to determine if a point will lie in the left subtree or in right subtree? If the root node is
aligned in planeA, then the left subtree will contain all points whose coordinates in that plane
are smaller than that of root node. Similarly, the right subtree will contain all points whose
coordinates in that plane are greater-equal to that of root node.

Creation of a 2-D Tree: Consider following points in a 2-D plane: (3, 6), (17, 15), (13, 15), (6,
12), (9, 1), (2, 7), (10, 19)

 Insert (3, 6): Since tree is empty, make it the root node.
 Insert (17, 15): Compare it with root node point. Since root node is X-aligned, the X-
coordinate value will be compared to determine if it lies in the right subtree or in the
left subtree. This point will be Y-aligned.
 Insert (13, 15): X-value of this point is greater than X-value of point in root node. So,
this will lie in the right subtree of (3, 6). Again Compare Y-value of this point with the
Y-value of point (17, 15) (Why?). Since, they are equal, this point will lie in the right
subtree of (17, 15). This point will be X-aligned.
 Insert (6, 12): X-value of this point is greater than X-value of point in root node. So,
this will lie in the right subtree of (3, 6). Again Compare Y-value of this point with the
Y-value of point (17, 15) (Why?). Since, 12 < 15, this point will lie in the left subtree
of (17, 15). This point will be X-aligned.
 Insert (9, 1):Similarly, this point will lie in the right of (6, 12).
 Insert (2, 7):Similarly, this point will lie in the left of (3, 6).
 Insert (10, 19): Similarly, this point will lie in the left of (13, 15).

129
How is space partitioned? All 7 points will be plotted in the X-Y plane as follows:

1. Point (3, 6) will divide the space into two parts: Draw line X = 3.

2. Point (2, 7) will divide the space to the left of line X = 3 into two parts horizontally.
Draw line Y = 7 to the left of line X = 3.

3. Point (17, 15) will divide the space to the right of line X = 3 into two parts horizontally.
Draw line Y = 15 to the right of line X = 3.

130
Point (6, 12) will divide the space below line Y = 15 and to the right of line X = 3 into two
parts. Draw line X = 6 to the right of line X = 3 and below line Y = 15.

Point (13, 15) will divide the space below line Y = 15 and to the right of line X = 6 into two
parts. Draw line X = 13 to the right of line X = 6 and below line Y = 15.

Point (9, 1) will divide the space between lines X = 3, X = 6 and Y = 15 into two parts. Draw
line Y = 1 between lines X = 3 and X = 13.

131
Point (10, 19) will divide the space to the right of line X = 3 and above line Y = 15 into two
parts. Draw line Y = 19 to the right of line X = 3 and above line Y = 15

Splay Tree-
Splay tree is a binary search tree. In a splay tree, M consecutive operations can be performed
in O (M log N) time.

A single operation may require O(N) time but average time to perform M operations will need
O (M Log N) time.

When a node is accessed, it is moved to the top through a set of operations known as splaying.
Splaying technique is similar to rotation in an AVL tree. This will make the future access of
the node cheaper.

Unlike AVL tree, splay trees do not have the requirement of storing Balance Factor of every
node. This saves the space and simplifies algorithm to a great extent.

There are two standard techniques of splaying.

1. Bottom up Splaying

2. Top Down Splaying

132
1. Bottom up Splaying:-
Idea behind bottom up splaying is explained below: Rotation is performed bottom up along the
access path.

Let X be a (non root) node on the access path at which we are rotating.

a) If the parent of X is the root of the tree, rotate X and the parent of X. This will be the last
rotation required.

b) If X has both a parent (P) and Grand parent (G) then like an AVL tree there could be four
cases.

These four cases are:

1. X is a left child and P is a left child.

2. X is a left child and P is right child

3. X is a right child and P is a left child

4. X is a right child and P is a right child.

2.Top Down Splaying:-

When an item X is inserted as a leaf, a series of tree rotations brings X at the root. These
rotations are known as splaying. A splay is also performed during searches, and if an item is
not found, a splay is performed on the last node on the access path.

A top down traversal is performed to locate the leaf node.

Splaying is done using bottom up traversal.

This can be done by storing the access path, during top down traversal on a stack.

Top down splaying is based on splaying on the initial traversal path. A stack is not required to
save the traversal path.

At any point in the access:

1. A current node X that is the root of its sub tree and represented as the “middle” tree.

2. Tree L stores nodes in the tree T that are less than X, but not in the X’s sub tree.

3. Tree R stores nodes in the tree T that are larger than X, but not in X’s sub tree.

4. Initially, X is the root of T, and L and R are empty.

133
The worst case time complexity of Binary Search Tree (BST) operations like search, delete,
insert is O(n). The worst case occurs when the tree is skewed. We can get the worst case time
complexity as O(Logn) with AVL and Red-Black Trees.
Can we do better than AVL or Red-Black trees in practical situations?
Like AVL and Red-Black Trees, Splay tree is also self-balancing BST. The main idea of splay
tree is to bring the recently accessed item to root of the tree, this makes the recently searched
item to be accessible in O(1) time if accessed again. The idea is to use locality of reference (In
a typical application, 80% of the access are to 20% of the items). Imagine a situation where we
have millions or billions of keys and only few of them are accessed frequently, which is very
likely in many practical applications.
All splay tree operations run in O(log n) time on average, where n is the number of entries in
the tree. Any single operation can take Theta(n) time in the worst case.
Search Operation
The search operation in Splay tree does the standard BST search, in addition to search, it also
splays (move a node to the root). If the search is successful, then the node that is found is
splayed and becomes the new root. Else the last node accessed prior to reaching the NULL is
splayed and becomes the new root.
There are following cases for the node being accessed.
1) Node is root We simply return the root, don’t do anything else as the accessed node is already
root.
2) Zig: Node is child of root (the node has no grandparent). Node is either a left child of root
(we do a right rotation) or node is a right child of its parent (we do a left rotation).
T1, T2 and T3 are subtrees of the tree rooted with y (on left side) or x (on right side)

y x
/ \ Zig (Right Rotation) / \
x T3 – - – - – - – - - -> T1 y
/\ <--------- /\
T1 T2 Zag (Left Rotation) T2 T3

3) Node has both parent and grandparent. There can be following subcases.
……..3.a) Zig-Zig and Zag-Zag Node is left child of parent and parent is also left child of grand
parent (Two right rotations) OR node is right child of its parent and parent is also right child
of grand parent (Two Left Rotations).

Zig-Zig (Left Left Case):

G P X
/\ / \ /\
P T4 rightRotate(G) X G rightRotate(P) T1 P
/ \ ============> / \ / \ ============> /\
X T3 T1 T2 T3 T4 T2 G
/\ /\
T1 T2 T3 T4

Zag-Zag (Right Right Case):

G P X
/ \ / \ /\

134
T1 P leftRotate(G) G X leftRotate(P) P T4
/ \ ============> / \ / \ ============> / \
T2 X T1 T2 T3 T4 G T3
/\ /\
T3 T4 T1 T2

……..3.b) Zig-Zag and Zag-Zig Node is right child of parent and parent is left child of grand
parent (Left Rotation followed by right rotation) OR node is left child of its parent and parent
is right child of grand parent (Right Rotation followed by left rotation).

Zag-Zig (Left Right Case):

G G X
/\ / \ / \
P T4 leftRotate(P) X T4 rightRotate(G) P G
/ \ ============> / \ ============> / \ / \
T1 X P T3 T1 T2 T3 T4
/\ /\
T2 T3 T1 T2

Zig-Zag (Right Left Case):

G G X
/ \ / \ / \
T1 P rightRotate(P) T1 X leftRotate(G) G P
/ \ =============> / \ ============> / \ / \
X T4 T2 P T1 T2 T3 T4
/\ /\
T2 T3 T3 T4

Example:

100 100 [20]


/ \ / \ \
50 200 50 200 50
/ search(20) / search(20) / \
40 ======> [20] ========> 30 100
/ 1. Zig-Zig \ 2. Zig-Zig \ \
30 at 40 30 at 100 40 200
/ \
[20] 40

The important thing to note is, the search or splay operation not only brings the searched key
to root, but also balances the BST. For example in above case, height of BST is reduced by 1.

135
What is Weight Balanced Binary Tree?

A weight-balanced tree is a binary tree in which for each node the number of nodes in the left
subtree is at least half and at most twice the number of nodes in the right subtree. It is a binary
tree that is balanced based on the knowledge of the probabilities of searching for each
individual node. In each sub-tree, the node with the highest weight appears at the root thereby
resulting in more efficient searching performance. The nodes which are most likely to be
searched/accessed have the lowest search time.

Example:

Huffman Tree.

In the above diagram, the letters represent the node values and the numbers represent node
weights.

Why different definitions of balanced

Binary Search Tree (BST) was invented to make searching a more efficient process, than
searching in an unordered array. However, when the BST is unbalanced then that case
searching was inefficient. For efficient searching, it is advisable to keep the tree balanced. But
it is difficult and inefficient to keep a BST balanced as the values are added and deleted
frequently. Thus, a way was invented to keep the BST balanced by adding more information
to each node or by allowing a node to have more than two children. Some of the examples of
such invented trees were AVL Tree, 2-3 Tree, B-Tree, Red-Black Tree, etc.

Comparison of Height Balanced and Weight Balanced Tree

A height-balanced tree improves the worst-case lookup time (for a binary tree, it will always
be bounded by log2(n)), at the expense of making the typical case roughly one lookup less
(approximately half of the nodes will be at the maximum depth).

136
If your weight is related to frequency-of-lookup, a weight-balanced tree will improve the
average lookup time, at the expense of making the worst-case higher (more frequently
requested items have a higher weight, and will thus tend to be in shallower trees, with the cost
being deeper trees for less-frequently-requested items).

S
No. Height Balanced Tree Weight Balanced Tree

It is the binary tree that is balanced


It is the binary tree that is balanced based on the weight on the edges of the
1 based on the height of the subtrees. tree.

In a height-balanced tree, the In weight balanced tree, the absolute


absolute difference of height of the difference between the weight of the
left subtree and the right subtree left subtree and the right subtree should
2 should be minimum. be minimum.

It will improve the worst-case


lookup time at the expense of making It will improve the average lookup
a typical case roughly one lookup time at the expense of making the
3 less worst-case higher.

The restructuring operation on a The restructuring operation on a node


node with n descendants happens with n descendants happens every
4 every 2-O(lg n) operation. O(1/n) operation.

Which one is best?

The best way to determine which is the best binary search tree out of the two is to measure the
performance of the two trees. This can be done in the following steps:

Gather representative query traffic.


Build a test rig where one can count the tree operations.
Replay the canned queries against both a height-balanced and a weight-balanced tree.
As a general rule, a height-balanced tree would work better the more even the request
frequencies are across the data set, and the more skewed it is, the more advantage one can get
from a weight-balanced tree.

Optimal Binary Search Tree

As we know that in binary search tree, the nodes in the left subtree have lesser value than the
root node and the nodes in the right subtree have greater value than the root node.

We know the key values of each node in the tree, and we also know the frequencies of each
node in terms of searching means how much time is required to search a node. The frequency
and key-value determine the overall cost of searching a node. The cost of searching is a very

137
important factor in various applications. The overall cost of searching a node should be less.
The time required to search a node in BST is more than the balanced binary search tree as a
balanced binary search tree contains a lesser number of levels than the BST. There is one way
that can reduce the cost of a binary search tree is known as an optimal binary search tree.

Let's understand through an example.

If the keys are 10, 20, 30, 40, 50, 60, 70

In the above tree, all the nodes on the left subtree are smaller than the value of the root node,
and all the nodes on the right subtree are larger than the value of the root node. The maximum
time required to search a node is equal to the minimum height of the tree, equal to logn.

Now we will see how many binary search trees can be made from the given number of keys.

For example: 10, 20, 30 are the keys, and the following are the binary search trees that can be
made out from these keys.

The Formula for calculating the number of trees:

When we use the above formula, then it is found that total 5 number of trees can be created.

The cost required for searching an element depends on the comparisons to be made to search
an element. Now, we will calculate the average cost of time of the above binary search trees.

138
In the above tree, total number of 3 comparisons can be made. The average number of
comparisons can be made as:

In the above tree, the average number of comparisons that can be made as:

139
In the above tree, the average number of comparisons that can be made as:

In the above tree, the total number of comparisons can be made as 3. Therefore, the average
number of comparisons that can be made as:

In the above tree, the total number of comparisons can be made as 3. Therefore, the average
number of comparisons that can be made as:

In the third case, the number of comparisons is less because the height of the tree is less, so it's
a balanced binary search tree.

To find the optimal binary search tree, we will determine the frequency of searching a
key.

140
Let's assume that frequencies associated with the keys 10, 20, 30 are 3, 2, 5.

The above trees have different frequencies. The tree with the lowest frequency would be
considered the optimal binary search tree. The tree with the frequency 17 is the lowest, so it
would be considered as the optimal binary search tree.

Dynamic Approach
Consider the below table, which contains the keys and frequencies.

First, we will calculate the values where j-i is equal to zero.

When i=0, j=0, then j-i = 0

When i = 1, j=1, then j-i = 0

When i = 2, j=2, then j-i = 0

When i = 3, j=3, then j-i = 0

When i = 4, j=4, then j-i = 0

Therefore, c[0, 0] = 0, c[1 , 1] = 0, c[2,2] = 0, c[3,3] = 0, c[4,4] = 0

141
Now we will calculate the values where j-i equal to 1.

When j=1, i=0 then j-i = 1


When j=2, i=1 then j-i = 1

When j=3, i=2 then j-i = 1

When j=4, i=3 then j-i = 1

Now to calculate the cost, we will consider only the jth value.

The cost of c[0,1] is 4 (The key is 10, and the cost corresponding to key 10 is 4).

The cost of c[1,2] is 2 (The key is 20, and the cost corresponding to key 20 is 2).

The cost of c[2,3] is 6 (The key is 30, and the cost corresponding to key 30 is 6)

The cost of c[3,4] is 3 (The key is 40, and the cost corresponding to key 40 is 3)

Now we will calculate the values where j-i = 2

When j=2, i=0 then j-i = 2

When j=3, i=1 then j-i = 2

When j=4, i=2 then j-i = 2

In this case, we will consider two keys.

142
When i=0 and j=2, then keys 10 and 20. There are two possible trees that can be made out from
these two keys shown below:

In the first binary tree, cost would be: 4*1 + 2*2 = 8

In the second binary tree, cost would be: 4*2 + 2*1 = 10

The minimum cost is 8; therefore, c[0,2] = 8

When i=1 and j=3, then keys 20 and 30. There are two possible trees that can be made out from
these two keys shown below:
In the first binary tree, cost would be: 1*2 + 2*6 = 14

In the second binary tree, cost would be: 1*6 + 2*2 = 10

The minimum cost is 10; therefore, c[1,3] = 10

When i=2 and j=4, we will consider the keys at 3 and 4, i.e., 30 and 40. There are two possible
trees that can be made out from these two keys shown as below:
In the first binary tree, cost would be: 1*6 + 2*3 = 12

143
In the second binary tree, cost would be: 1*3 + 2*6 = 15

The minimum cost is 12, therefore, c[2,4] = 12

Now we will calculate the values when j-i = 3

When j=3, i=0 then j-i = 3

When j=4, i=1 then j-i = 3

When i=0, j=3 then we will consider three keys, i.e., 10, 20, and 30.
The following are the trees that can be made if 10 is considered as a root node.

144
In the above tree, 10 is the root node, 20 is the right child of node 10, and 30 is the right child
of node 20.

Cost would be: 1*4 + 2*2 + 3*6 = 26

In the above tree, 10 is the root node, 30 is the right child of node 10, and 20 is the left child of
node 20.

Cost would be: 1*4 + 2*6 + 3*2 = 22

The following tree can be created if 20 is considered as the root node.

In the above tree, 20 is the root node, 30 is the right child of node 20, and 10 is the left child of
node 20.

Cost would be: 1*2 + 4*2 + 6*2 = 22

The following are the trees that can be created if 30 is considered as the root node.

145
In the above tree, 30 is the root node, 20 is the left child of node 30, and 10 is the left child of
node 20.

Cost would be: 1*6 + 2*2 + 3*4 = 22

In the above tree, 30 is the root node, 10 is the left child of node 30 and 20 is the right child of
node 10.

Cost would be: 1*6 + 2*4 + 3*2 = 20

Therefore, the minimum cost is 20 which is the 3rd root. So, c[0,3] is equal to 20.

When i=1 and j=4 then we will consider the keys 20, 30, 40
c[1,4] = min{ c[1,1] + c[2,4], c[1,2] + c[3,4], c[1,3] + c[4,4] } + 11

= min{0+12, 2+3, 10+0}+ 11

= min{12, 5, 10} + 11

The minimum value is 5; therefore, c[1,4] = 5+11 = 16

146
Now we will calculate the values when j-i = 4
When j=4 and i=0 then j-i = 4

In this case, we will consider four keys, i.e., 10, 20, 30 and 40. The frequencies of 10, 20, 30
and 40 are 4, 2, 6 and 3 respectively.

w[0, 4] = 4 + 2 + 6 + 3 = 15

If we consider 10 as the root node then

C[0, 4] = min {c[0,0] + c[1,4]}+ w[0,4]

= min {0 + 16} + 15= 31

If we consider 20 as the root node then

C[0,4] = min{c[0,1] + c[2,4]} + w[0,4]

= min{4 + 12} + 15

= 16 + 15 = 31

If we consider 30 as the root node then,

C[0,4] = min{c[0,2] + c[3,4]} +w[0,4]

= min {8 + 3} + 15

= 26

If we consider 40 as the root node then,

147
C[0,4] = min{c[0,3] + c[4,4]} + w[0,4]

= min{20 + 0} + 15

= 35

In the above cases, we have observed that 26 is the minimum cost; therefore, c[0,4] is equal to
26.

148
The optimal binary tree can be created as:

General formula for calculating the minimum cost is:

C[i,j] = min{c[i, k-1] + c[k,j]} + w(i,j)

149
Unit V Indexing and Multiway Trees
Indexing and Multiway Trees- Indexing, indexing techniques-primary, secondary, dense,
sparse, Multiway search trees, B-Tree- insertion, deletion , B+Tree - insertion, deletion,
use of B+ tree in Indexing, Trie Tree.
Indexing in Databases
Indexing is a way to optimize the performance of a database by minimizing the number of disk
accesses required when a query is processed. It is a data structure technique which is used to
quickly locate and access the data in a database.
Indexes are created using a few database columns.
The first column is the Search key that contains a copy of the primary key or candidate key of
the table. These values are stored in sorted order so that the corresponding data can be accessed
quickly.
Note: The data may or may not be stored in sorted order.
The second column is the Data Reference or Pointer which contains a set of pointers holding
the address of the disk block where that particular key value can be found.

The indexing has various attributes:


Access Types: This refers to the type of access such as value based search, range access, etc.
Access Time: It refers to the time needed to find particular data element or set of elements.
Insertion Time: It refers to the time taken to find the appropriate space and insert a new data.
Deletion Time: Time taken to find an item and delete it as well as update the index structure.
Space Overhead: It refers to the additional space required by the index.
In general, there are two types of file organization mechanism which are followed by the
indexing methods to store the data:

150
1. Sequential File Organization or Ordered Index File: In this, the indices are based on a sorted
ordering of the values. These are generally fast and a more traditional type of storing
mechanism. These Ordered or Sequential file organization might store the data in a dense or
sparse format:
(i) Dense Index:
For every search key value in the data file, there is an index record.
This record contains the search key and also a reference to the first data record with that search
key value.

(ii) Sparse Index:


The index record appears only for a few items in the data file. Each item points to a block as
shown.
To locate a record, we find the index record with the largest search key value less than or equal
to the search key value we are looking for.
We start at that record pointed to by the index record, and proceed along with the pointers in
the file (that is, sequentially) until we find the desired record.
Number of Accesses required=log₂(n)+1, (here n=number of blocks acquired by index file)

151
2. Hash File organization: Indices are based on the values being distributed uniformly across a
range of buckets. The buckets to which a value is assigned is determined by a function called
a hash function.
Types of Indexing: There are two ways as following below.
1. Single-level Indexing –
Primary indexing
Clustering Indexing
Secondary Indexing
2. Multilevel Indexing –
B Trees
B+ Trees
There are primarily three methods of indexing:
Clustered Indexing
Non-Clustered or Secondary Indexing
Multilevel Indexing
1. Clustered Indexing
When more than two records are stored in the same file these types of storing known as cluster
indexing. By using the cluster indexing we can reduce the cost of searching reason being
multiple records related to the same thing are stored at one place and it also gives the frequent
joining of more than two tables (records).
Clustering index is defined on an ordered data file. The data file is ordered on a non-key field.
In some cases, the index is created on non-primary key columns which may not be unique for
each record. In such cases, in order to identify the records faster, we will group two or more
columns together to get the unique values and create index out of them. This method is known

152
as the clustering index. Basically, records with similar characteristics are grouped together and
indexes are created for these groups.
For example, students studying in each semester are grouped together. i.e. 1st Semester
students, 2nd semester students, 3rd semester students etc. are grouped.

Clustered index sorted according to first name (Search key)


Primary Indexing:
This is a type of Clustered Indexing wherein the data is sorted according to the search key and
the primary key of the database table is used to create the index. It is a default format of
indexing where it induces sequential file organization. As primary keys are unique and are
stored in a sorted manner, the performance of the searching operation is quite efficient.
Primary indexing is defined mainly on the primary key of the data-file, in which the data-file
is already ordered based on the primary key. Primary Index is an ordered file whose records
are of fixed length with two fields. The first field of the index replicates the primary key of the
data file in an ordered manner, and the second field of the ordered file contains a pointer that
points to the data-block where a record containing the key is available. The first record of each
block is called the Anchor record or Block anchor. There exists a record in the primary index
file for every block of the data-file. The average number of blocks using the Primary Index is
= log2B+ 1, where B is the number of index blocks.
The reason for the Average no. of blocks using the Primary Index = log2B+ 1 is log2B is the
no. of blocks traversed in the Primary Index for searching that block that contains that entry
which represents a block in the hard disk. After getting that entry we moved to that block of
Hard disk and searched for the record. Since we moved to the Hard disk’s block that’s why the
count of Blocks using the Primary index increased by 1. Hence the average number of blocks
using the Primary Index is = log2B+ 1

153
2. Non-clustered or Secondary Indexing
A non clustered index just tells us where the data lies, i.e. it gives us a list of virtual pointers or
references to the location where the data is actually stored. Data is not physically stored in the
order of the index. Instead, data is present in leaf nodes. For eg. the contents page of a book.
Each entry gives us the page number or location of the information stored. The actual data
here(information on each page of the book) is not organized but we have an ordered
reference(contents page) to where the data points actually lie. We can have only dense ordering
in the non-clustered index as sparse ordering is not possible because data is not physically
organized accordingly.
It requires more time as compared to the clustered index because some amount of extra work
is done in order to extract the data by further following the pointer. In the case of a clustered
index, data is directly present in front of the index.

154
Secondary indexing is a database management technique used to create additional indexes on
data stored in a database. The main purpose of secondary indexing is to improve the
performance of queries and to simplify the search for specific records within a database. A
secondary index provides an alternate means of accessing data in a database, in addition to the
primary index. The primary index is typically created when the database is created and is used
as the primary means of accessing data in the database. Secondary indexes, on the other hand,
can be created and dropped at any time, allowing for greater flexibility in managing the
database.
Benefits
Improved Query Performance: Secondary indexes can improve the performance of queries by
reducing the amount of data that needs to be scanned to find the desired records. With a
secondary index, the database can directly access the required records, rather than having to
scan the entire table.
Flexibility: Secondary indexes provide greater flexibility in managing a database, as they can
be created and dropped at any time. This allows for a more dynamic approach to database
management, as the needs of the database can change over time.
Simplified Search: Secondary indexes simplify the search for specific records within a
database, making it easier to find the desired data.
Reduced Data Storage Overhead: Secondary indexes use a compact data structure that requires
less space to store compared to the original data. This means that you can store more data in a
database while reducing the amount of storage space required.
Types of Secondary Indexes
B-tree Index: A B-tree index is a type of index that stores data in a balanced tree structure. B-
tree indexes are commonly used in relational databases and provide efficient search, insert, and
delete operations.
Hash Index: A hash index is a type of index that uses a hash function to map data to a specific
location within the index. Hash indexes are commonly used in non-relational databases, such
as NoSQL databases, and provide fast access to data.
Bitmap Index: A bitmap index is a type of index that uses a bitmap to represent the data in a
database. Each bit in the bitmap represents a specific record in the database, and the value of
the bit indicates whether the record is present or not. Bitmap indexes are commonly used in
data warehousing and business intelligence applications, as they provide efficient access to
large amounts of data.
When to Use Secondary Indexing
Secondary indexing should be used in database management systems when there is a need to
improve the performance of data retrieval operations that search for data based on specific
conditions. Secondary indexing is particularly useful in the following scenarios:

155
Queries with Complex Search Criteria: Secondary indexes can be used to support complex
queries that search for data based on multiple conditions. By creating a secondary index based
on the columns used in the search criteria, database management systems can access the data
more efficiently.
Large Data Sets: Secondary indexing can be beneficial for large data sets where the time and
resources required for data retrieval operations can be significant. By creating a secondary
index, database management systems can access the data more quickly, reducing the time and
resources required for data retrieval operations.
Frequently Accessed Data: Secondary indexing should be used for frequently accessed data to
reduce the time and resources required for data retrieval operations. This is because secondary
indexes provide a fast and efficient way to access data stored in a database.
Sorting and Aggregating Data: Secondary indexing can be used to support sorting and
aggregating data based on specific columns. By creating a secondary index based on the
columns used for sorting and aggregating, database management systems can access the data
more efficiently, reducing the time and resources required for data retrieval operations.
Data Structure: The data structure of a database can also affect the decision to use secondary
indexing. For example, if the data is structured as a B-tree, a B-tree index may be the most
appropriate type of secondary index.
3. Multilevel Indexing
With the growth of the size of the database, indices also grow. As the index is stored in the
main memory, a single-level index might become too large a size to store with multiple disk
accesses. The multilevel indexing segregates the main block into various smaller blocks so that
the same can stored in a single block. The outer blocks are divided into inner blocks which in
turn are pointed to the data blocks. This can be easily stored in the main memory with fewer
overheads.

156
Bitmap Indexing in DBMS
Bitmap Indexing is a special type of database indexing that uses bitmaps. This technique is
used for huge databases, when column is of low cardinality and these columns are most
frequently used in the query.
Need of Bitmap Indexing –
The need of Bitmap Indexing will be clear through the below given example:
For example, Let us say that a company holds an employee table with entries like EmpNo,
EmpName, Job, New_Emp and salary. Let us assume that the employees are hired once in the
year, therefore the table will be updated very less and will remain static most of the time. But
the columns will be frequently used in queries to retrieve data like: No. of female employees
in the company etc. In this case we need a file organization method which should be fast enough
to give quick results. But any of the traditional file organization method is not that fast,
therefore we switch to a better method of storing and retrieving data known as Bitmap Indexing.
How Bitmap Indexing is done –
In the above example of table employee, we can see that the column New_Emp has only two
values Yes and No based upon the fact that the employee is new to the company or not.
Similarly let us assume that the Job of the Employees is divided into 4 categories only i.e
Manager, Analyst, Clerk and Salesman. Such columns are called columns with low cardinality.
Even though these columns have less unique values, they can be queried very often.
Bit: Bit is a basic unit of information used in computing that can have only one of two values
either 0 or 1. The two values of a binary digit can also be interpreted as logical values true/false
or yes/no.
In Bitmap Indexing these bits are used to represent the unique values in those low cardinality
columns. This technique of storing the low cardinality rows in form of bits are called bitmap
indices.
Continuing the Employee example, given below is the Employee table:

If New_Emp is the data to be indexed, the content of the bitmap index is shown as four( As we
have four rows in the above table) columns under the heading Bitmap Indices. Here Bitmap
Index “Yes” has value 1001 because row 1 and row four has value “Yes” in column New_Emp.

157
In this case there are two such bitmaps, one for “New_Emp” Yes and one for “New_Emp” NO.
It is easy to see that each bit in bitmap indices shows that whether a particular row refer to a
person who is New to the company or not.
The above scenario is the simplest form of Bitmap Indexing. Most columns will have more
distinct values. For example the column Job here will have only 4 unique values (As mentioned
earlier). Variations on the bitmap index can effectively index this data as well. For Job column
the bitmap Indexing is shown below:

Now Suppose, If we want to find out the details for the Employee who is not new in the
company and is a sales person then we will run the query:
SELECT *
FROM Employee
WHERE New_Emp = "No" and Job = "Salesperson";
For this query the DBMS will search the bitmap index of both the columns and perform logical
AND operation on those bits and find out the actual result:

158
Here the result 0100 represents that the second row has to be retrieved as a result.
Bitmap Indexing in SQL – The syntax for creating bitmap index in sql is given below:
CREATE BITMAP INDEX Index_Name
ON Table_Name (Column_Name);
For the above example of employee table, the bitmap index on column New_Emp will be
created as follows:
CREATE BITMAP INDEX index_New_Emp
ON Employee (New_Emp);
Advantages –
 Efficiency in terms of insertion deletion and updation.
 Faster retrieval of records
 Disadvantages –
 Only suitable for large tables
 Bitmap Indexing is time consuming
M-way Trees
An M-way(multi-way) tree is a tree that has the following properties:
Each node in the tree can have at most m children.
Nodes in the tree have at most (m-1) key fields and pointers(references) to the children.
Consider the pictorial representation shown below of an M-way tree.

The above image shows a 4-way tree, where each node can have at most 3(4-1) key fields and
at most 4 children. It is also a 4-way search tree.

159
M-way Search Trees
An M-way search tree is a more constrained m-way tree, and these constrain mainly apply to
the key fields and the values in them. The constraints on an M-way tree that makes it an M-
way search tree are:
Each node in the tree can associate with m children and m-1 key fields.
The keys in any node of the tree are arranged in a sorted order(ascending).
The keys in the first K children are less than the Kth key of this node.
The keys in the last (m-K) children are higher than the Kth key.
Consider the pictorial representation shown below of an M-way search tree:

M-way search trees have the same advantage over the M-way trees, which is making the search
and update operations much more efficient. Though, they can become unbalanced which in
turn leaves us to the same issue of searching for a key in a skewed tree which is not much of
an advantage.
Searching in an M-way Search Tree:
If we want to search for a value say X in an M-way search tree and currently we are at a node
that contains key values from Y1, Y2, Y3,.....,Yk. Then in total 4 cases are possible to deal
with this scenario, these are:
If X < Y1, then we need to recursively traverse the left subtree of Y1.
If X > Yk, then we need to recursively traverse the right subtree of Yk.
If X = Yi, for some i, then we are done, and can return.
Last and only remaining case is that when for some i we have Yi < X < Y(i+1), then in this
case we need to recursively traverse the subtree that is present in between Yi and Y(i+1).
For example, consider the 3-way search tree that is shown above, say, we want to search for a
node having key(X) equal to 60. Then, considering the above cases, for the root node, the
second condition applies, and (60 > 40) and hence we move on level down to the right subtree

160
of 40. Now, the last condition is valid only, hence we traverse the subtree which is in between
the 55 and 70. And finally, while traversing down, we have our value that we were looking for.
B Trees Data Structure:
A B tree is an extension of an M-way search tree. Besides having all the properties of an M-
way search tree, it has some properties of its own, these mainly are:
All the leaf nodes in a B tree are at the same level.
All internal nodes must have M/2 children.
If the root node is a non leaf node, then it must have at least two children.
All nodes except the root node, must have at least [M/2]-1 keys and at most M-1 keys.
Consider the pictorial representation of a B tree shown below:

Searching in a B Tree:
Searching for a key in a B Tree is exactly like searching in an M-way search tree, which we
have seen just above. Consider the pictorial representation shown below of a B tree, say we
want to search for a key 49 in the below shown B tree. We do it as following:
Compare item 49 with root node 75. Since 49 < 75 hence, move to its left sub-tree.
Since, 40<49<58, traverse right sub-tree of 40.
49>44, move to right. Compare 49.
We have found 49, hence returning.
Consider the pictorial representation shown below:

161
Inserting in a B Tree:
Inserting in a B tree is done at the leaf node level. We follow the given steps to make sure that
after the insertion the B tree is valid, these are:
First, we traverse the B tree to find the appropriate node where the to be inserted key will fit.
If that node contains less than M-1 keys, then we insert the key in an increasing order.
If that node contains exactly M-1 keys, then we have two cases ? Insert the new element in
increasing order, split the nodes into two nodes through the median, push the median element
up to its parent node, and finally if the parent node also contains M-1 keys, then we need to
repeat these steps.
Consider the pictorial representation shown below:

Now, consider that we want to insert a key 9 into the above shown B tree, the tree after inserting
the key 9 will look something like this:

162
Since, a violation occurred, we need to push the median node to the parent node, and then split
the node in two parts, hence the final look of B tree is:

Deletion in a B Tree:
Deletion of a key in a B tree includes two cases, these are:
Deletion of key from a leaf node
Deletion of a key from an Internal node
Deletion of Key from a leaf node:
If we want to delete a key that is present in a leaf node of a B tree, then we have two cases
possible, these are:
If the node that contains the key that we want to delete, in turn contains more than the minimum
number of keys required for the valid B tree, then we can simply delete that key.
Consider the pictorial representation shown below:

163
Say, we want to delete the key 64 and the node in which 64 is present, has more than minimum
number of nodes required by the B tree, which is 2. So, we can simply delete this node.
The final tree after deletion of 64 will look like this:

If the node that contains the key that we want to delete, in turn contains the minimum number
of keys required for the valid B tree, then three cases are possible:
In order to delete this key from the B Tree, we can borrow a key from the immediate left
node(left sibling). The process is that we move the highest value key from the left sibling to
the parent, and then the highest value parent key to the node from which we just deleted our
key.
In another case, we might have to borrow a key from the immediate right node(right sibling).
The process is that we move the lowest value key from the right sibling to the parent node, and
then the highest value parent key to the node from which we just deleted our key.
Last case would be that neither the left sibling or the right sibling are in a state to give the
current node any value, so in this step we will do a merge with either one of them, and the
merge will also include a key from the parent, and then we can delete that key from the node.
Case 1 pictorial representation:

After we delete 23, we ask the left sibling, and then move 16 to the parent node and then push
20 downwards, and the resultant B tree is:

164
Case 2 pictorial representation:

After we delete 72, we ask the right sibling, and then move the 77 to the parent node and then
push the 75 downwards, and the resultant B tree is:

Case 3 pictorial representation:

After deleting 65 from the leaf node, we will have the final B tree as:

165
Deletion of Key from an Internal node:
If we want to delete a key that is present in an internal node, then we can either take the value
which is in order predecessor of this key or if taking that inorder predecessor violates the B
tree property we can take the inorder successor of the key.
In the inorder predecessor approach, we extract the highest value in the left children node of
the node where our key is present.
In the inorder successor approach, we extract the lowest value in the right children node of the
node where our key is present.
Pictorial Representation of the above cases:
Internal predecessor approach

After deletion, our B tree:

166
Internal successor approach

After deletion of 95, our tree will look like this:

Key Points:
The time complexity for search, insert and delete operations in a B tree is O(log n).
The minimum number of keys in a B tree should be [M/2] - 1.
The maximum number of keys in a B tree should be M-1.
All the leaf nodes in a B tree should be at the same level.
All the keys in a node in a binary tree are in increasing order.
B Trees are used in SQL to improve the efficiency of queries.
Each node in a B Tree can have at most M children.
B+ Trees Data Structure
A B+ tree is an extension of a B tree which makes the search, insert and delete operations more
efficient. We know that B trees allow both the data pointers and the key values in internal nodes
as well as leaf nodes, this certainly becomes a drawback for B trees as the ability to insert the
nodes at a particular level is decreased thus increase the node levels in it, which is certainly of
no good. B+ trees reduce this drawback by simply storing the data pointers at the leaf node
level and only storing the key values in the internal nodes. It should also be noted that the nodes
at the leaf level are linked with each other, hence making the traversal of the data pointers easy
and more efficient.

167
B+ trees come in handy when we want to store a large amount of data in the main memory.
Since we know that the size of the main memory is not that large, so make use of the B+ trees,
whose internal nodes that store the key(to access the records) are stored in the main memory
whereas, the leaf nodes that contain the data pointers are actually stored in the secondary
memory.
A pictorial representation of a B+ tree is given below:

Why B+ trees?
B+ trees store the records which later can be fetched in an equal number of disk accesses.
The height of the B+ tree remains balanced and very less as when compared to B trees even if
the number of records store in them is the same.
Having less number of levels makes the accessing of records very easy.
As the leaf nodes are connected with each other like a linked list, we can easily search elements
in sequential manners.
Inserting in B+ tree
Perform a search operation in the B+ tree to check the ideal bucket location where this new
node should go to.
If the bucket is not full( does not violate the B+ tree property ), then add that node into this
bucket.
Otherwise split the nodes into two nodes and push the middle node( median node to be precise
) to the parent node and then insert the new node.
Repeat the above steps if the parent node is there and the current node keeps getting full.
Consider the pictorial representations shown below to understand the Insertion operation in the
B+ tree:

168
Let us try to insert 57 inside the above-shown B+ tree, the resultant B+ tree will look like this:

We know that the bucket(node) where we inserted the key with value 57 is now violating the
property of the B+ tree, hence we need to split this node as mentioned in the steps above. After
splitting we will push the median node to the parent node, and the resulting B+ tree will look
like this:

169
Searching in B+ tree:
Searching in a B+ tree is similar to searching in a BST. If the current value is less than the
searching key, then traverse the left subtree, and if greater than first traverse this current
bucket(node) and then check where the ideal location is.
Consider the below representation of a B+ tree to understand the searching procedure. Suppose
we want to search for a key with a value equal to 59 in the below given B+ tree.

Now we know that 59 < 69, hence we traverse the left subtree.

Now we have found the internal pointer that will point us to our required search value.

Finally, we traverse this bucket in a linear fashion to get to our required search value.

170
Deletion in B+ tree:
Deletion is a bit complicated process, two case arises:
It is only present at the leaf level
or, it contains a pointer from an internal node also.
Deletion of only the leaf-node:
If it is present only as a leaf node position, then we can simply delete it, for which we first do
the search operation and then delete it.
Consider the pictorial representation shown below:

After the deletion of 79, we are left with the following B+ tree.

171
Deletion if a pointer to a leaf node is there:

After we locate the node we want to delete we must also delete the internal pointer that points
to this node, and then we finally need to move the next node pointer to move to the parent
node.

172
Comparison between B tree and B+ tree:

SN B Tree B+ Tree

1 Search keys can not be repeatedly stored. Redundant search keys can be present.

2 Data can be stored in leaf nodes as well as internal Data can only be stored on the leaf nodes.
nodes

3 Searching for some data is a slower process since Searching is comparatively faster as data can
data can be found on internal nodes as well as on the only be found on the leaf nodes.
leaf nodes.

4 Deletion of internal nodes are so complicated and Deletion will never be a complexed process
time consuming. since element will always be deleted from the
leaf nodes.

5 Leaf nodes can not be linked together. Leaf nodes are linked together to make the
search operations more efficient.

What is Trie?
Trie is a type of k-ary search tree used for storing and searching a specific key from a set. Using
Trie, search complexities can be brought to optimal limit (key length).
Definition: A trie (derived from retrieval) is a multiway tree data structure used for storing
strings over an alphabet. It is used to store a large amount of strings. The pattern matching can
be done efficiently using tries.
The trie shows words like allot, alone, ant, and, are, bat, bad. The idea is that all strings sharing
common prefix should come from a common node. The tries are used in spell checking
programs.
Preprocessing pattern improves the performance of pattern matching algorithm. But if a text is
very large then it is better to preprocess text instead of pattern for efficient search.
A trie is a data structure that supports pattern matching queries in time proportional to the
pattern size.
If we store keys in a binary search tree, a well balanced BST will need time proportional to M
* log N, where M is the maximum string length and N is the number of keys in the tree. Using
Trie, the key can be searched in O(M) time. However, the penalty is on Trie storage
requirements (Please refer to Applications of Trie for more details).
Trie is also known as digital tree or prefix tree. Refer to this article for more detailed
information.

173
Structure of Trie node:
Every node of Trie consists of multiple branches. Each branch represents a possible character
of keys. Mark the last node of every key as the end of the word node. A Trie node field
isEndOfWord is used to distinguish the node as the end of the word node.
A simple structure to represent nodes of the English alphabet can be as follows.
// Trie node
struct TrieNode
{
struct TrieNode *children[ALPHABET_SIZE];
// isEndOfWord is true if the node
// represents end of a word
bool isEndOfWord;
};
Insert Operation in Trie:
Inserting a key into Trie is a simple approach.

174
Every character of the input key is inserted as an individual Trie node. Note that the children
is an array of pointers (or references) to next-level trie nodes.
The key character acts as an index to the array children.
If the input key is new or an extension of the existing key, construct non-existing nodes of the
key, and mark the end of the word for the last node.
If the input key is a prefix of the existing key in Trie, Simply mark the last node of the key as
the end of a word.
The key length determines Trie depth.
The following picture explains the construction of trie using keys given in the example below.

Advantages of tries
1. In tries the keys are searched using common prefixes. Hence it is faster. The lookup of keys
depends upon the height in case of binary search tree.
2. Tries take less space when they contain a large number of short strings. As nodes are shared
between the keys.
3. Tries help with longest prefix matching, when we want to find the key.
Comparison of tries with hash table
1. Looking up data in a trie is faster in worst case as compared to imperfect hash table.
2. There are no collisions of different keys in a trie.
3. In trie if single key is associated with more than one value then it resembles buckets in hash
table.
4. There is no hash function in trie.
5. Sometimes data retrieval from tries is very much slower than hashing.
6. Representation of keys a string is complex. For example, representing floating point numbers
using strings is really complicated in tries.

175
7. Tries always take more space than hash tables.
8. Tries are not available in programming tool it. Hence implementation of tries has to be done
from scratch.
Applications of tries
1. Tries has an ability to insert, delete or search for the entries. Hence they are used in building
dictionaries such as entries for telephone numbers, English words.
2. Tries are also used in spell-checking softwares.
Search Operation in Trie:
Searching for a key is similar to the insert operation. However, It only compares the characters
and moves down. The search can terminate due to the end of a string or lack of key in the trie.
In the former case, if the isEndofWord field of the last node is true, then the key exists in the
trie.
In the second case, the search terminates without examining all the characters of the key, since
the key is not present in the trie.

Note: Insert and search costs O(key_length), however, the memory requirements of Trie is
O(ALPHABET_SIZE * key_length * N) where N is the number of keys in Trie. There are
efficient representations of trie nodes (e.g. compressed trie, ternary search tree, etc.) to
minimize the memory requirements of the trie.

176
Unit VI File Organization
Files: concept, need, primitive operations. Sequential file organization- concept and
primitive operations, Direct Access File- Concepts and Primitive operations, Indexed
sequential file organization-concept, types of indices, structure of index sequential file,
Linked Organization- multi list files, coral rings, inverted files and cellular partitions.
What is File?
A file a container in a computer system that stores data, information, settings, or commands,
which are used with a computer program. In graphical user interface (GUI), such as Microsoft
operating systems, represent the files as icons, which associate to the program that opens the
file. For instance, the picture is shown as an icon; it is related to Microsoft Word. If your
computer contains this file and you double-click on the icon, it will open in Microsoft Word
installed on the computer.
There are several types of files available such as directory files, data files, text files, binary and
graphic files, and these several kinds of files contain different types of information. In the
computer system, files are stored on hard drives, optical drives, discs, or other storage devices.
In most of the operating systems, a file must be saved with a unique name within a given file
directory. However, certain characters cannot be used during creating a file as they are
considered illegal. A filename is consisted of with a file extension that is also called a suffix.
The file extension contains two to four characters that follow the complete filename, and it
helps to recognize the file format, type of file, and the attributes related to the file.
Most modern computer systems have the ability to protect files from file corruption or damage.
The file can be contained the data from system-generated information to user-specified
information. File management is done manually at times with the help of the user or done with
the help of third-party tools and operating systems.
The basic operations that can be performed on a file are given below:
 Closing or terminating a file operation
 Creation of programs
 Reading of data from the file
 Creation of a new file
 Opening the file in order to make the contents available to other
 Modification of data or file attributes
 Writing data to the file
How are the files created?
A software program helps to create a file on the computer. For instance, to create a document
file, you will use a word processor, to create a C programming file, you would use a C software,
to create an image file, you would use an image editor. Specific software is used to create a
particular file.
Where are files stored?
Computer files are stored on a hard drive, disc-like DVD, and floppy disk. It can also be stored
in a folder that is stored on the drives.

177
Illegal file characters
The given below characters are considered illegal with most operating systems, hence cannot
be used. If you try to create a file name with these characters, it would generate an error or
make the file inaccessible.
:*?"<>|\/
File management
File management is also referred to as a file system that is a process of creating an organized
structure and retrieving files from a storage medium such as a hard drive. It is a type of software
that usually comprise files separated into groups, which is called directories. Basically, it is
designed to handle individual or group files, like records and special office documents. It is
able to display report details, such as creation date, state of completion, owner and more other
similar details, which are useful in an office environment.
Nowadays, NTFS (New Technology File System) is the most widely used file system with
Windows. All files cannot be organized without file management, and it would be impossible
to be the same name for a file. Often, files are managed in a hierarchical way that allows users
to view files in the current directory and then navigate into any subdirectories.
File Format
The file format is the structure of a file that arranges the data logically within a file. It allows a
program to represent the information correctly, retrieve data, and continue with processing. For
instance, a Microsoft Word document will be saved with .doc file format; it will be best viewed
in Microsoft Word software. Although another software can open this file, it may not have all
features to display the document properly, like Microsoft Word. The programs may be able to
give an overview of a file if they are compatible with the file format. But they may be unable
to display all the files features.
Additionally, some of the programs that are not supported with a file format maybe give you
garbage with opening a file. For example, if you will open a.XLS file in another program like
notepad, it will not display the document properly and give you garbage. A file format
minimizes the required storage space as it contains the data encoding process. For instance,
video and picture are encoded by embedded processes like compression; in this process, a
picture is divided into pixels.
Furthermore, it also includes presentation information. For example, a Microsoft .xls file
includes both the document's text and its final form, as well as table, color, calculations, font
size, charts, and other information that must be organized in a standard form inside the file.

178
Common file formats
Below is a table that contains common file formats you are most likely to see while working
on a computer.

File type File extension

Image .bmp .eps .gif .jpg .pict .png .psd .tif

Text .asc .doc .docx .rtf .msg .txt .wpd .wps

Video .avi .mp4 .mpg .mov .wmv

Compressed .arc .arj .gz .hqx .rar .sit .tar .z .zip

Program .bat .com .exe

Sound aac .au .mid .mp3 .ra .snd .wma .wav

File Extension
A file extension is an identifier that helps identify the type of file in operating systems, such as
Microsoft Windows. It can be classified as a type of metadata, and it helps the operating
systems to understand the intended use of a file and the characteristics. The filename extension
may be contained one to four characters and used as a suffix to the file name. For example, in
Microsoft Windows, the file extension is often followed by three characters.
A dot (.) symbol is used to separate the file extension from the filename. The filename is
considered incomplete without file extension; therefore, to complete a filename, it must be
included in the file extension. Generally, file extensions are hidden from the users in Windows
operating systems. Although file extensions can be renamed, it is not necessarily by renaming
a file extension will convert one file format to another. File extensions are helpful for both
users and the file system in two ways:
 It helps in identifying the type of data that a file hold.
 It allows the operating system to select the proper program or application with which
to open a file.
What makes a valid file name extension?
A filename extension is always at the end of the file name, which starts with a period (After
dot symbol). Although it is often between one and three characters, some of the programs also
support more than three characters. For instance, in the latest versions of Microsoft Word file
will be saved with .docx extension and some web pages with .html file extension.
Can be a file extension more than three or four characters?
Yes, a file extension can be more than three or four characters. It depends on the program of
how it was designed. Some of the programs are designed to identify and open a program with

179
a longer (more than three or four characters) file extension. However, most programs do not
exceed four characters to keep the overall file name short.
Limit of a file extension
Until the file name, path, and extension are not combined, the limit of file extension does not
exceed the limit of the maximum file name character. There is given a list below, which
contains Microsoft operating systems (Windows) versions and their filename character limit.
Windows XP: It contains a limit of 255 characters.
Windows 7: It includes a limit of 260 characters.
Windows 2000: Its limit is 254 characters.
Windows Vista: Its limit is 260 characters.
Windows 8: It includes a limit of 260 characters.
Windows 10: It contains a limit of 260 characters.
Different types of file extension
There are various types of file extensions that can be connected with one or more applications.
Below is given a list that contains some of the more common file extensions and their related
programs.
Music and sound files
.wav
.mp3
Picture files
.bmp
.jpg
.gif
Text and word processing documents
.doc
.rtf
.docx
.txt
Operating system files
.dll
.exe

180
Web Page files
.htm
.html
Spreadsheet files
.xls
.xlr
.xlsx
.csv
File Compression
File compression is also known as file zipping. It is a data compression method that contains
one or more files or directory that is smaller than their original file size. It is used to reduce the
file size to save storage space and provide faster transmission over a network or the Internet.
The compressed files allow more data to be stored on removable media and make downloading
faster. The common types of compressed file extensions are .RAR, .ARJ, .ZIP, TGZ,
and.TAR.GZ.
The process of file compression is completed with the help of data or file compression software,
which processes all files and creates a compressed version. Generally, it scans an entire file,
recognizes repetitive patterns and data and replaces duplicates with a unique identifier. The
size of the created file of the identifier is much smaller as compared to the original file.
Although there is no fixed size of the compressed file, it reduces the size by 50 to 90 percent
of the original file while compressing the file.
There are various types of compressed file extensions, below is a table that contains some
common types of compressed file extensions:

.pf .b64 .cab .z .tar.gz


.rar .btoa .cpt .zip .tbz
.rpm .bz .gz .zipx .tbz2
.lzh .arc .hqx .zoo .sea
.mim .arj .iso .tgz .sit
.mme .as .lha .uu .sitx
.pak .uue

How to copy files


A file name must be unique; if you copy a file with the same file name in the same directory
or folder, a number will be appended to the end of a file name. Instead of a number, it could be
'- Copy' added to the end of the file name. For example, a file name is smith.doc, and it is
copied in the same directory or folder, the copied file name may be snkh143(1).doc or snkh143
- Copy.doc.

181
Although it is very easy to copy computer documents from one location to another. To copy
files, follow these below steps:
Copy a file in Microsoft Windows
First, go to the files or folders that you want to copy, then select them with the help of clicking
the mouse. If you want to copy more than one file, you need to select all files. You can highlight
more than one file by holding down the Ctrl or Shift keys on your keyboard and clicking the
mouse together.
After selecting the files, you are required to right-click on the selected files and choose the
copy option from the opened list. You can also press the Ctrl+C shortcut key. Also, in Windows
Explorer, you can click on the edit option at the top of the program window and select copy.
Now, you need to open the destination folder where you want to copy the files, and right-click
an empty space in the destination folder and select the paste option.
How to move files or folders on the computer
There are numerous methods to move files, folders or directories from one location to another
on the computer.
Move a file in windows
In Windows, files can be moved by using different methods such as cut and paste, drag-and-
drop, or using the move to folder option. Below all methods are described through which you
can move the files easily. You can choose any method accordingly.
Cut and paste method
To use the cut and paste method, first, you are required to select the file that you want to move.
Then, right-click on the selected file and choose the cut option from the opened list. Now, open
the destination folder where you want to move the file and right-click on the empty space in
the folder and select the paste option from the list that appeared.
On the other hand, select the file and click on the Edit option from the file menu and select the
Cut option. Then, browse the folder where you want to move the files and click on the Edit
option from the file menu and select the Paste option to move the file.
Alternatively, you can also use shortcut keys to move the files. For that, you are required to
highlight the file that you want to move, then press the shortcut key Ctrl+X. Now, browse the
folder where you want to move the files and press the shortcut key Ctrl+V to paste the files.
Drag-and-drop method
First, you are required to select the files that you want to move, then hold down the right mouse
button on the file, and drag the files while continuing to hold down the right mouse button, and
release the mouse button on the location where you want to move the files.
'Move to folder' method
Highlight the file by clicking on the file name, then click on the Edit from the file menu and
click the Move to Folder option. In the new window, browse the folder in which you want to

182
move the files, then you only need to click on the move button to move the file to the browsed
folder.
What is Sequential File Organisation in DBMS?
Sequential File organization is the easiest type of file organization in which the files are
sequentially stored one after the other; rather than storing the various records of the files in
rows and column format(tabular form), it stores the records in a single row.
You cannot shorten, lengthen, or delete a record after it has been placed in a sequential file.
However, if the length of the record does not change, you can update it. At the end of the file,
new records are added.
What are the Various Methods of Sequential File Organization in DBMS?
Two methods can implement sequential file organization
Pile File method
Sorted File method
Pile File method
It is the easiest sequential file organization method in which the records are stored on a first
come basis, meaning whichever records come first would be stored first in the sequence. There
is no fixed sequence. In this method, the order in which the records come decides the order in
which they will be stored.
In case of an update, we first have to find the previous record and then update that record by
the new value of the record without changing the order of the sequence.

Insertion of the New Record: It is a one-step process. Whenever a new record is added to the
file organization system it would append itself to the end of the existing file in a sequential
manner. Let R1, R2, R8, R6, and R7 be the five records previously stored in the file. Suppose
a new record R5 comes; then the record will be added to the end of the file, i.e., after record
R7.

183
Sorted File Method
As the name suggests in the method, files are stored in some sorted format( ascending or
descending). The order in which the records come has no impact on the final sequence because
we always get a sorted sequence. Order can be defined by a primary key or any other
key/attribute.
In case of an update, we first have to find the previous record and update its value just like the
pile file method, and then we have to sort the file according to the new record value. Searching
the previous record takes less time in the case of the sorted file method.

184
Insertion of the New Record: It is a two-step process. Whenever a new record comes, it is first
added to the end of the file, just like the pile file method, and then it will change its position
according to the sorted order. Let R1, R2, R4, and R6 be the four previous records in the file
sorted based on primary key reference. Suppose a new record R5 comes; then the record will
firstly add it at the end of the file, i.e., after R6, and then changes its position next to R4,
according to sorted format.

185
What are the Pros and Cons of Sequential File Organization in DBMS?
As everything in the world has pros and cons, this sequential file organization also has both
pros and cons.
Pros
Comparatively speaking to other file groups, the design is simple. The process of storing the
data does not need much work.
When there are significant amounts of data, this method is extremely quick and effective. This
approach is useful when it's necessary to access the majority of the information, like when
computing a student's grade or creating pay stubs, for example, where we use every record to
make our calculations.
In the case of report generation or statistical calculations, this strategy is appropriate.
Cons
In the case of the pile file approach, we cannot immediately hop onto the particular record. It
takes a lot of time to always travel through it sequentially from left to right.
Whenever we have to delete a record from a certain position the record at that particular
position is deleted and all the other records will remain at their respective positions which
results to create an unused memory space in the file hence it is an inefficient memory method
to delete a record.
The record must always be sorted when using the sorted file approach. The file is sorted after
each insert, update, and delete operation. Therefore, repeatedly sorting them after finding the
record, adding it, updating it, or deleting it takes time and could cause the system to slow down.

186
Direct-Access File Design
Direct access file design also uses the key field of the records but in a different way from
sequential design. The key field provides the only way of accessing data within a direct access
file design. Therefore, records are not stored in any particular order.
The data record being sought is retrieved according to its key value, so records before or after
it are not read. Generally this key is a memory or disk address for a particular record or file.
The address is usually a number from five to seven digits that is related to the physical
characteristics of the storage medium. When a file is created, this address determines where
the record is written. During retrieval, the address determines where to locate the record.
Another way to obtain the address of a record is to place the record keys and their
corresponding addresses in a directory (see Figure below). During processing, the computer
searches the directory to locate the address of a particular record.

Direct-access file design is much more efficient than searching an entire data file for a
particular record. It is useful when information must be updated and retrieved quickly and when
current information is crucial. A common application of direct-access file organization is for
airline seat reservations. Current information about available flights and seats must be available
at all times so that flights are not overbooked. Virtually all modern computer systems today
use a Direct Access file mechanism when opening and running computer applications. This
includes portable and desktop computers as you are using now.
In contrast to a batch-processing system, a direct-access system does not require transaction
records to be grouped or sorted before they are processed. Data is submitted to the computer in
the order it occurs, usually using an online access method. Direct-access storage devices
(DASDs), such as magnetic disk drive units, make this type of processing possible. A particular
record on a master file can be accessed directly, using its assigned address or keys, and updated
without all preceding records on the file being read. Only the key to the record needs to be
known. Thus, up-to-the-minute reports can be provided.

187
With direct-access processing, the computer can locate the record to be updated without
processing all records that precede it. This next Figure shows how direct-access processing
would be used in a business.

MAKING INQUIRIES TO DIRECT-ACCESS FILES


Consider again the two inquiries discussed in connection with sequential files. This time the
inquiries will be made to direct-access files.
List the records of employees with social security numbers 234-02-1532 and 372-84-7436.
With the social security number as the key to the employee file, these two employee records
can be located directly. The computer retrieves each record from the address assigned for each
social security number. It does not have to read all the records in the file.
List all employees from the area with zip code 43403.
The approach used for this inquiry will depend on the organization of the file. If the file is large
and much processing is done based on a geographic breakdown of employees, a directory using
zip codes and their record addresses can be created (as in the Figure below).

188
However, if there are not many employees and processing is seldom based on the geographic
breakdown of employees, a directory to locate employee records by zip code may have little
value. In this case, the situation is the same as with sequential files-the entire file must be read
to obtain the desired information.
ASSESSMENT OF DIRECT-ACCESS FILE DESIGN
Direct-access processing and file design is suitable for applications with low activity and high
volatility. Examples of such applications (systems requiring only a few records to be updated
frequently) include banking operations and hotel and airline reservation systems.
Advantages of direct-access processing and file design are the following:
 Transaction data can be used directly to update master records via online terminals
without first being sorted. Transactions are processed as they occur.
 The master file is not read completely each time updating occurs; only the master
records to be updated are accessed. This saves time and money.
 Gaining access to any record on a direct-access file takes only a fraction of a second.
 Direct-access files provide flexibility in handling inquiries.
 Several files can be updated at the same time by use of direct-access processing. For
example, when a credit sale is made, the inventory file can be updated, the customer
file can be changed to reflect the current accounts receivable figure, and the sales file
can be updated to show which employee made the sale. Several runs would be required
to accomplish all these operations if sequential processing were used.
Disadvantages of direct-access design include the following:
 During processing, the original record is replaced by the updated record. In effect, it is
destroyed. (In batch processing, a completely new master file is created, but the old
master file remains intact.) Consequently, to provide backup, an organization may have
to make a magnetic-tape copy of the master file weekly and also keep the current week's
transactions so that master records can be reconstructed if necessary. (This school
makes a backup of staff, student, and web resources weekly).
 Since many users may have access to records stored on direct-access devices in online
systems, the chances of accidental destruction of data are greater than with sequential
processing. Special programs are required to edit input data and to perform other checks
to ensure that data is not lost.

189
Direct-access could present security problems for organizations. Users may be able to access
confidential information. Therefore additional security procedures must be implemented.
Implementation of direct-access systems is often difficult because of the complexity and the
high level of programming (software) support that such systems need. In addition, the cost of
developing and maintaining a direct-access system is greater than the expense of a sequential
processing system.
Indexed sequential access method (ISAM)
Indexed sequential access file combines both sequential file and direct access file organization.
In indexed sequential access file, records are stored randomly on a direct access device such as
magnetic disk by a primary key.
This file have multiple keys. These keys can be alphanumeric in which the records are ordered
is called primary key.
The data can be access either sequentially or randomly using the index. The index is stored in
a file and read into memory when the file is opened.
Advantages of Indexed sequential access file organization
 In indexed sequential access file, sequential file and random file access is possible.
 It accesses the records very fast if the index table is properly organized.
 The records can be inserted in the middle of the file.
 It provides quick access for sequential and direct processing.
 It reduces the degree of the sequential search.
Disadvantages of Indexed sequential access file organization
 Indexed sequential access file requires unique keys and periodic reorganization.
 Indexed sequential access file takes longer time to search the index for the data access
or retrieval.
 It requires more storage space.
 It is expensive because it requires special software.
 It is less efficient in the use of storage space as compared to other file organizations.
ISAM method is an advanced sequential file organization. In this method, records are stored in
the file using the primary key. An index value is generated for each primary key and mapped
with the record. This index contains the address of the record in the file.

190
If any record has to be retrieved based on its index value, then the address of the data block is
fetched and the record is retrieved from the memory.
Pros of ISAM:
In this method, each record has the address of its data block, searching a record in a huge
database is quick and easy.
This method supports range retrieval and partial retrieval of records. Since the index is based
on the primary key values, we can retrieve the data for the given range of value. In the same
way, the partial value can also be easily searched, i.e., the student name starting with 'JA' can
be easily searched.
Cons of ISAM
This method requires extra space in the disk to store the index value.
When the new records are inserted, then these files have to be reconstructed to maintain the
sequence.
When the record is deleted, then the space used by it needs to be released. Otherwise, the
performance of the database will slow down.
Types of Index

Primary Index
Primary Index is an ordered file which is fixed length size with two fields primary key and
pointed data block respectively.
In the primary Index, there is always one to one relationship between the entries in the index
table.
The primary Indexing is divided into two types.
Dense Index
In a dense index, a record is created for every search key valued in the database. This index
helps you to search faster.
• In this Indexing, method records contain search key value and points to the record on the disk.

191
Sparse Index
It is an index record that appears for only some of the values in the file.
Sparse Index helps you to resolve the issues of dense Indexing.
Here, range of index columns stores the same data block address, and when data needs to be
retrieved, the block address will be fetched.
However, sparse Index stores index records for only some search-key values.
It needs less space, less maintenance overhead for insertion, and deletions but It is slower
compared to the dense Index for locating records.

192
Multilevel Index
• Multilevel Indexing is created when a primary index does not fit in memory.
• In this type of indexing method, you can reduce the number of disk accesses to short any
record and kept on a disk as a sequential file and create a sparse base on that file.

Secondary Index
• The secondary Index can be generated by a field which has a unique value for each record,
and it should be a candidate key. It is also known as a non-clustering index.
• This two-level database indexing technique is used to reduce the mapping size of the first
level. For the first level, a large range of numbers is selected because of this; the mapping size
always remains small.

193
Clustering Index
• In a clustered index, records themselves are stored in the Index and not pointers.
• Sometimes the Index is created on non-primary key columns which might not be unique for
each record.
• In such a situation, you can group two or more columns to get the unique values and create
an index which is called clustered Index. This also helps you to identify the record faster.
• Example:
• Let's assume that a company recruited many employees in various departments. In this case,
clustering indexing should be created for all employees who belong to the same dept.
• It is considered in a single cluster, and index points point to the cluster as a whole. Here,
Department _no is a non-unique key.

Indexed file organization stores the record sequentially depending on the value of the
RECORD-KEY(generally in ascending order). A RECORD-KEY in an Indexed file is a
variable that must be part of the record/data. In the case of Indexed files two types of files are
created:
Data file: It consists of the records in sequential order.
Index file: It consists of the RECORD-KEY and the address of the RECORD-KEY in the data
file.

194
The Indexed file can be accessed sequentially same as Sequential file organization as well as
randomly only if the RECORD-KEY is known.
Relative file organization stores the record on the basis of their relative address. Each record
is identified by its Relative Record Number, a Relative Record Number is the position of the
record from the beginning of the file. These records can be accessed sequentially same as
Sequential file organization as well as randomly, to access files randomly the user must specify
the relative record number.
Difference between Sequential, Indexed, Relative files:

Sequential files Indexed files Relative files

These files can be accessed


These files can be accessed sequentially as well as
These files can be sequentially as well as randomly with the help of
accessed only randomly with the help of the their relative record
sequentially. record key. number.

The records are stored based


on the value of the RECORD-
The records are stored KEY which is the part of the The records are stored by
sequentially. data. their relative address.

Records cannot be deleted It is possible to store the The records can be


and can only be stored at records in the middle of the inserted at any given
the end of the file. file. position.

It occupies less space as


the records are stored in
continuous order. It occupies more space. It occupies more space.

It provides slow access, as It also provides slow


in order to access any access(but is fast as compared
record all the previous to sequential access) as it It provides fast access as
records are to be accessed takes time to search for the provides the record key
first. index. compared to the other two.

In Indexed file organization, In Relative file


In Sequential file the records are written in organization, the records
organization, the records sequential order but can be can be written and read in
are read and written in read in sequential as well as sequential as well as
sequential order. random order. random order.

195
Sequential files Indexed files Relative files

There is no need to declare One or more KEYS can be Only one unique KEY is
any KEY for storing and created for storing and declared for storing and
accessing the records. accessing the records. accessing the records.

Linked Organization:
• Linked organizations differ from sequential organizations essentially in that the logical
sequence of records is generally different from the physical sequence.
• In sequential ith record is placed at location li, then the i+1st record is placed at li + c where
c is the length of ith record or some fixed constant.
• In linked organization the next logical record is obtained by following link value from present
record. Linking in order of increasing primary key eases insertion deletion.
• Searching for a particular record is difficult since no index is available, so only sequential
search possible.
• We can facilitate indexes by maintaining indexes corresponding to ranges of employee
numbers eg. 501-700, 701-900. All records with same range will be linked together in a list.
• We can generalize this idea for secondary key level also. We just set up indexes for each key
and allow records to be in more than one list. This leads to the multilist structure for file
representation.

Multilist File Organisation


The basic approach to providing the linkage between an index and the file of data records is
called multilist organisation. A multilist file maintains an index for each secondary key. The
index for secondary key contains, instead of a list of primary keys related to that secondary
key, only one primary key value related to that secondary key. That record will be linked to
other records containing the same secondary key in the data file. The multi-list organisation
differs from inverted file in that while the entry in the inverted file index for a key value has a
pointer to each data record with that key value, the entry in the multi-list index for a key value
has just one pointer to the first data record with that key value.
Linking records together in order of increasing primary key value facilitates easy insertion and
deletion once the place at which the insertion or deletion to be made is known. Searching for a

196
record with a given primary key value is difficult when no index is available, since the only
search possible is a sequential search. To facilitate searching on the primary key as well as on
secondary keys, it is customary to maintain several indexes, one for each key. Using an index
in this way reduces the length of the lists and thus the search time. This idea is very easily
generalised to allow for easy secondary key retrieval. We just set up indexes for each key and
allow records to be in more than one list.
This leads to the multilist structure for file representation.
Multi-list file Organisation
Multi-list file organisation is a multi-index linked file organisation. A linked file organisation
is a logical organisation where physical ordering of records is not of concern. In linked
organisation the series of records is governed by the links that verify the next record in series.
Linking of records can be unordered but such a linking is very costly for searching of
information from a file. Thus, it may be a good idea to link records in the order of increasing
primary key. This will facilitate deletion and insertion algorithms. Also this really helps the
search performance. In addition to making order during linking, search by a file can be further
facilitated by producing primary and secondary indexes. All these ideas are supported in the
multi-list file organisation. Let us describe these concepts further with the help of an example.
Consider the employee data as given in Figure. The record numbers are given as alphabets for
better explanation. Suppose that the Empid is the key field of the data records. Let us describe
the Multi-list file organisation for the data file.

Since, the primary key of the file is Empid, thus the linked order of records should be defined
as B (500), E(600), D(700), A(800), C(900). Though, as the file size will grow the search
performance of the file would deteriorate. Therefore, we can make a primary index on the file

197
(please note that in this file the records are in the logical series and tied together using links
and not physical placement, thus, the primary index will be a linked index file rather than block
indexes).
Inverted File Organisation
In inverted file organisation, a linkage is provided between an index and the file of data records.
A key’s inverted index contains all of the values that the key presently has in the records of the
data file. Each key-value entry in the inverted index points to all of the data records that have
the corresponding value.
Inverted files represent one extreme of file organisation in which only the index structures are
important. The records themselves may be stored in any way (sequentially ordered by primary
key, random, linked ordered by primary key etc.)

Inverted files may also result in space saving compared with other file structures when record
retrieval does not require retrieval of key fields. In this case, the key fields may be deleted from
the records.
Both inverted files and multilist files have:
An index for each secondary key.
An index entry for each distinct value of the secondary key.
The index may be tabular or tree-structured.
The entries in an index may or may not be sorted.
The pointers to data records may be direct or indirect.
The indexes differ in that
An entry in an inverted index has a pointer to each data record with that value.
An entry in a multilist index has a pointer to the first data record with that value. Thus an
inverted index may have variable-length entries whereas a multilist index has fixed-length
entries. Some of the implications of these differences are the following:
Index management is easier in the multilist approach because entries are fixed in length.

198
The inverted file approach tends to exhibit better inquiry performance. Many types of queries
can be answered by accessing inversion indexes without necessitating access to data records,
thereby reducing I/O-access requirements.
Inversion of a file can be transparent to a programmer who accesses that file but does not use
the inversion indexes, while a multilist structure affects the file’s record layout. The multilist
pointers can be made transparent to a programmer if the data manager does not make them
available for programmer use and stores them at the end of each record.
Inverted files are similar to multilists. Multilists records with the same key value are linked
together with link information being kept in individual record. In case of inverted files the link
information is kept in index itself.
• EG. We assume that every key is dense. Since the index entries are variable length, index
maintenance becomes complex for multilists. Benefits being Boolean queries require only one
access per record satisfying the query. Queries of type k1=xx and k2=yy can be handled
similarly by intersecting two lists.
• The retrieval works in two steps. In the first step, the indexes are processed to obtain a list of
records satisfying the query and in the second, these records are retrieved using the list. The
no. of disk accesses needed is equal to the no. of records being retrieved + the no. to process
the indexes.
• Inverted files represent one extreme of file organization in which only the index structures
are important. The records themselves can be stored in any way.
• Inverted files may also result in space saving compared with other file structures when record
retrieval doesn’t require retrieval of key fields. In this case key fields may be deleted from the
records unlike multilist structures.
Cellular partitions
To reduce the file search times, the storage media may be divided into cells. A cell may be an
entire disk pack or it may simply be a cylinder.
• Lists are localized to lie within a cell.
• A multilist organization in which the list for key1=prog list included records on several
different cylinders then we could break the list into several smaller lists where each prog list
included only those records in the same cylinder.
• The index entry for prog will now contain several entries of the type (addr, length) where
addr is a pointer to start of a list of records with key1=prog and length is the no. of records on
the list.
• By doing this all records of the same cell may be accessed without moving the read/write
heads.
Coral rings
In this doubly linked multilist structure is used.
Each list is circular list with headnode

199
• ‘Alink’ field is used to link all records with same key value
• ‘Blink’ is used for some records back pointer and for others it is pointer to head • node. Owing
to these back pointers, deletion is easy without going to start.
• Indexes are maintained as per multilists
External sorting

External sorting is a technique in which the data is stored on the secondary memory, in which
part by part data is loaded into the main memory and then sorting can be done over there.
Then this sorted data will be stored in the intermediate files.
The requirement of external sorting is there, where the data we have to store in the main
memory does not fit into it. Basically, it consists of two phases that are:
Sorting phase: This is a phase in which a large amount of data is sorted in an intermediate
file.
Merge phase: In this phase, the sorted files are combined into a single larger file.

200
201
202

You might also like