0% found this document useful (0 votes)
51 views

DS Module-5 Notes

The document discusses different graph representations and traversal methods. It defines key graph terminology like vertices, edges, paths, cycles, connected components, and graph representations like adjacency matrix, adjacency list, and adjacency multilist. It also discusses traversal algorithms like breadth first search and depth first search.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

DS Module-5 Notes

The document discusses different graph representations and traversal methods. It defines key graph terminology like vertices, edges, paths, cycles, connected components, and graph representations like adjacency matrix, adjacency list, and adjacency multilist. It also discusses traversal algorithms like breadth first search and depth first search.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

18CS32 DSA Notes

MODULE - 5

Graphs: Definitions, Terminologies, Matrix and Adjacency List Representation Of Graphs,


Elementary Graph operations, Traversal methods: Breadth First Search and Depth First Search.
Sorting and Searching: Insertion Sort, Radix sort, Address Calculation Sort. Hashing: Hash Table
organizations, Hashing Functions, Static and Dynamic Hashing.
Files and Their Organization:Data Hierarchy, File Attributes, Text Files and Binary Files, BasicFile
Operations, File Organizations and Indexing

Introduction

The first recorded evidence of the use of graph dates back to 1736. When Leonhard Euler used them
to solve the classical Konigsberg bridge problem.

Definitions

Graph: A graph G consist of two sets V and E


1. V is a finite nonempty set of vetices and
2. E is a set of pairs of vertices these pairs are called edges
A graph can be represents as G = (V, E). V(G) will represent the set of vertices and E(G) will
represent the set of edges of the graph G

Example:

V(G)= {v1, v2, v3, v4, v5, v6}


E(G) = {e1, e2, e3, e4, e5, e6} E(G) = {(v1, v2) (v2, v3) (v1, v3) (v3, v4), (v3, v5) (v5, v6)}.
There are six edges and six vertex in the graph

Undirected Graph: In a undirected graph the pair of vertices representing an edge is unordered.
thus the pairs (u,v) and (v,u) represent the same edge.
Example:

V(G)={a,b,c,d}

E(G)={(a,b),(a,d),(b,d),(b,c)

Directed Graph (digraph): In a directed graph each edge is represented by a directed pair (u,v), v is
the head and u is the tail of the edge. Therefore (v,u) and(<u,v) represent two different edges

Dept. of ISE, SVIT Page 1


18CS32 DSA Notes

Example:
V(G)={a,b,d}

E(G)={(a,d), (a,b), (d,b)}

Self Edges/Self Loops: Edges of the form(v,v) are called self edges or self loops . It is an edge
which starts and ends at the same vertex.
Example:

Mutigraph: A graph with multiple occurrences of the same edge is called a multigraph
Example:

Complete Graph: An undirected graph with n vertices and exactly n(n-1)/2 edges is said to be a
complete graph. In a graph all pairs of vertices are connected by an edge.
Example : A complete graph with n=3 vertices

Adjacent Vertex

If (u,v) is an edge in E(G), then we say that the vertices u and v are adjacent and the edge(u,v) is
incident on vertices u and v.

Path: A path from vertex u to v in graph g is a sequence of vertices u,i1,i2,…….ik,v such that
(u,i1),(i1,i2)………(ik,v) are edges in E(G). if G’ is directed then the path consists of
<u,i1>,<i1,i2>………<ik,v> edges in E(G’).

The length of the path is the number of edges in it.

Dept. of ISE, SVIT Page 2


18CS32 DSA Notes

Example:

(B,C),(C,D) is a path from B to D the length of the path is 2

A simple path is a path in which all the vertices are distinct.

Cycle: A cycle is a simple path in which all the vertices except the first and last vertices are distinct.
The first and the last vertices are same.
Example :
(B,C),(C,D)(D,E)(E,A)(A,B) is a cycle

Degree of a vertex : In a undirected graph degree of a vertex is the number of edges incident on a
vertex.

In a directed graph the in-degree if a vertex v is the number of edges for which v is the head i.e. the
number of edges that are coming into a vertex. The out degree is defined as the number of edges for
which v is the tail i.e. the number of edges that are going out of a vertex

Subgraph: A subgraph of G is a graph G’ such that V(G’)  V(G) and E(G’)  E(G)
Example :

Graph(G) Subgraph(G’)

Connected Graph: An undirected graph G is said to be connected if for every pair of distinct
vertices u and v in V(G) there is a path from u to v in G.

Connected Component is a maximal connected subgraph

Strongly connected graph : A directed graph G is said to be strongly connected if for every pair of
distinct vertices u an v in V(G), there is a directed path from u to v and from v to u.

Tree: A tree is a connected acyclic connected graph.

ADT Graph

Objects: a nonempty set of vertices and a set of undirected edges, where each edge is a pair of
vertices.

Functions: for all graph Graph, v,v1,v2  vertices

Dept. of ISE, SVIT Page 3


18CS32 DSA Notes

Graph create():=
Example: return an empty graph
Graph InsertVertex(graph,v):= return a graph with v inserted. v has no incident edges
Graph InsertEdge(graph,v1,v2) := retrun a graph with a new edge between v1 and v2
Graph DeleteVertex(graph,v) := return a graph in which v and all edges incident to it is
removed
Graph DeleteEdge(graph,v1,v2):= retrun a graph in which the edge (v1,v2) is removed,
leave the incident nodes in the graph
Boolean IsEmpty:= If (graph == empty graph) retrun TRUE else Retrun
FALSE
List Adjacent(graph,v) := retrun a list of all vertices that are adjacent to v

Graph Representation

The three most commonly used representations are

 Adjacency Matrix
 Adjacency List
 Adjacency Multilist

Adjacency Matrix: Let G=(V,E) be a graph with n vertices, n>=1. The adjacency matrix of G is a
two dimensional n*n array for example a, with the property that a[i][j]=1 if there exist ane edge (i,j)
(for a directed graph edge <i,j> is in E(G).a[i][j]=0 if no such edge in G.

Example:
0
Adjacency Matrix

0 1 2 3
0 0 1 1 1
1 2 1 1 0 1 1
2 1 1 0 1
3 1 1 1 0

Figure 5.1 Graph G1

 The space requirement to store an adjacency matrix is n2 bits.


 The adjacency matrix for a undirected graph is symmetric .About half the space can be saved
in an undirected graph by storing only the upper or lower triangle of the matrix.
 For an undirected graph the degree of any vertex i is its row sum. For a directed graph the row
sum is the out-degree and the column sum is the in-degree.

Adjacency list: In adjacency matrix the n rows of the adjacency matrix are represented as n chains.
There is one chain for each vertex in G. The nodes in chain i represent the vertices that are adjacent
from vertex i. The data field of a chain node stores the index of an adjacent vertex.

Example: the adjacency list of graph G1 in figure 5.1 is shown below


Dept. of ISE, SVIT Page 4
18CS32 DSA Notes

AdjLists
Example: data link
[0] 1 2 3 0
[1] 0 2 3 0
[2] 0 1 3 0
[3] 0 1 2 0

 For an undirected graph with n vertices and e edges. The linked adjacency lists representation
requires an array of size n and 2e chain nodes.
 The degree of any vertex in an undirected graph may be determined by counting the number
of nodes in the adjacency list.
 For a digraph the number of list nodes is only e.

Adjacency Multi lists: For each edge there will be exactly one node, but this node will be in two
list(i.e., the adjacency list for each of the two nodes to which it is incident). A new field is necessary
to determine if the edge is determined and mark it as examined.

The new node structure is

m Vertex1 Vertex2 Link1 Link2

Example: The adjacency multilist for graph G1 is shown below

adjLists N0 0 1 N1 N3 edge(0,1)
[0]
N1 0 2 N2 N3 edge(0,2)
[1]
N2 0 3 0 N4 edge(0,3)
[2]
N3 1 2 N4 N5 edge(1,2)
[3]
N4 1 3 0 N5 edge(1,3)
F
N5 2 3 0 0 edge(2,3)

The Lists are Vertex 0: N0->N1->N2


Vertex 1: N0->N3->N4
Vertex 2: N1->N3->N5
Vertex 3: N2->N4->N5

Weighted Edges: In many applications the edges of a graph have weight assigned to them. These
weights may represent the distance from one vertex t o another or the cost for going from one vertex
to an adjacent vertex. The adjacency matrix and list maintains the weight information also. A graph
with weighted edges are also called network.

Dept. of ISE, SVIT Page 5


18CS32 DSA Notes

Example:

Elementary Graph Operations

Given an undirected graph G=(V,E) and a vertex v in V(G) ,there are two ways to find all the
vertices that are reachable from v or are connected to v .

 Depth First Search and


 Breadth First Search

Depth First Search

1. Visit the starting vertex v. (visiting consist of printing node’s vertex)


2. Select an unvisited vertex w from v’s adjacency and carry a depth first search on w.
3. A stack is maintained to preserve the current position in v’s adjacency list.
4. When we reach a vertex u that has no unvisited vertices on adjacency list, remove a vertex
from the stack and continue processing its adjacency list. Previously visited vertices are
discarded and unvisited vertices are placed on stack
5. The search terminates when the stack is empty.

A recursive implementation of depth first search is shown below.

A global array visited is maintained , it is initialized to false, when we visit a vertex i we change the
visited[i] to true.

Global Declaraions

# define FALSE 0
# define true 1
Short int visited[max_vertices];

void dfs(int v)
{
visited[v]=TRUE;
printf(“%d”,v);
w=graph[v]
while(w!=NULL)
{
If(visited[w->vertex]==FALSE)
dfs(w->vertex);
w=w->link;

}}

Dept. of ISE, SVIT Page 6


18CS32 DSA Notes

Analysis
Example:

 If we represent G by its adjacency list then we can determine the vertices adjacent to v by
following a chain of links. Since dfs examines each node in the adjacency list at most once
then the time to complete the search is O(e).
 If we represent G by its adjacency matrix then determining all vertices adjacent o v requires
O(n) time. Since we visit at most n vertices the total time is O(n2).

Example: For the graph given below if the search is initiated from vertex 0 then the vertices are
visited in the order vertex 3, 1, 2

1 2

adjLists data link


[0] 3 1 2 0
[1] 2 0 3 0
[2] 1 0 3 0
[3] 0 1 2 0

Breadth first Search

1. Search starts at vertex v marks it as visited.


2. It then visits each of the vertices on v’s adjacency list.
3. As we visit each vertex it is placed on a queue.
4. When all the vertices in the adjacency list is visited we remove a vertex from the queue and
proceed by examining each of the vertices in its adjacency list.
5. Visited vertices are ignored and unvisited vertices are placed on the queue
6. The search terminates when the queue is empty.

The queue definition and the function prototypes

struct node
{
int vertex;
struct node * link;
};
typedef struct node queue;

Dept. of ISE, SVIT Page 7


18CS32 DSA Notes

queue
Example:* front,*rear;
int visied[max_vertics];

void addq(int);
int delete();

void bfs(int v)
{
front=rear=NULL;
printf(“%d”,v);
visisted[v]= TRUE;
addq(v);
while(front)
{
v=deleteq();
while(w!=NULL)
{
if(visited[w->vertex]==FALSE)
{
printf(“%d”,w->vertex);
addq(w->vertex);
visited[w->vertex]=TRUE;
}
w=w->link;
}
}
}

Analysis of BFS:

 For each vertex is placed on the queue exactly once, the while loop is iterated at most n
times.
 For the adjacency list representation the loop has a total cost of O(e). For the adjacency
matrix representation the loop takes O(n) times
 Therefore the total time is O (n2).

Insertion Sort: The basic step is to insert a new record into a sorted sequence of i records in sucha
way that the resulting sequence of size i+1 is also ordered. The function insert implements this
insertion.

Insert e into the ordered list a[i:i] such that the resulting list a[1:i+1] is also ordered, the array
must have space allocated for atleast i+2 elements

Void insert(element e, element a[], int i)


{
a[0]=e;
while(e.key <a[i].key;
{

Dept. of ISE, SVIT Page 8


18CS32 DSA Notes

Example:
a[i+1]=a[i];
i--;
}
a[i+1]=e;
}

The following function sorts a[1:n] in increasing order

Void insertion_sort(element a[],int n)


{
int j;
element temp;
for(j=2;j<=n;j++)
{
Temp=a[j];
insert(temp,a,j-1);
}
}

Example:

Analysis: In the worst case insert(e,a,i) makes i+1 comparsions before making insertion hence the
complexity is O(i)Function insertion sort invokes the insert function n-1 times so the time
complexity of insertion sort is O(n2).

Radix Sort: Radix sort is a method used to alphabetise a large list of names.( Here the radix is
26,. The 26 letters of the alphabet). The list of the names are first sorted according to the first
letter of each name. That is the names are arranged to n classes, where the first class begins to
those names that begins with the letter “A”. the second class begins with “B” and so on. During

Dept. of ISE, SVIT Page 9


18CS32 DSA Notes

the second pass each class is alphabetised according to the second letter and so on. If the
Example:
maximum length of the name is n the names are alphabetised in atmost n passes.

Suppose a list of n items A1,A2, ... An is given, let d denote the radix( for example d=10 for decimal
digits, d=26 for letters d=2 bits ). Suppose number is made of s digits. The radix sort algorithm will
require s passes.

Example: The example below shows how 3 digit numbers can be sorted using reverse digit sort. First
the numbers are sorted according to the units digit. On the second pass, the numbers are sorted
according to the tens digits, then the number are sorted according to the hundreds digit.

Consider the list of numbers 348,143,361,423,538,128,321,543,366

Pass-1

Input 0 1 2 3 4 5 6 7 8 9
348 348
143 143
361 361
423 423
538 538
128 128
321 321
543 543
366 366

Pass-2

Input 0 1 2 3 4 5 6 7 8 9
361 361
321 321
143 143
423 423
543 543
366 366
348 348
538 538
128 128

Pass-3

Input 0 1 2 3 4 5 6 7 8 9
321 321
423 423
128 128
538 538
143 143
543 543
348 348
Dept. of ISE, SVIT Page 10
18CS32 DSA Notes

361
Example: 361
366 366

Now the numbers are sorted 128,143,321,348,361,366,423,538,543

Analysis of Radix sort

n= number of elements
d= radix number
s= number of digits in each element
Then the number of comparisons C(n) <= d*s*n
D is independent of n but s depends on n therefore
Time complexity worst case- O(n2)
Time complexity best case- O(n logn)

Drawback of Radix sort

1. Radix sort works well when s (number of digits in the representation is small)
2. We may need d*n memory locations. This may be minimized by using linked lists.

Address Calculation Sort

◾ In this method a function f is applied to each key.


◾ The result of this function determines into which of the several subfiles the record is to be
placed.
◾ The function should have the property that: if x <= y , f (x) <= f (y), Such a function is called
order preserving.
◾ An item is placed into a subfile in correct sequence by using appropriate sorting method –
simple insertion is often used.

Example:
25 57 48 37 12 92 86 33

 Let us create 10 subfiles. Initially each of these subfiles is empty.


 The number is passed to hash function, which returns its Most significant digit (ten’s place
digit), the number is placed at that position , in the array of pointers.

Dept. of ISE, SVIT Page 11


18CS32 DSA Notes

Analysis: If all the elements (n) are uniformly distributed over m subfiles then n/m is approximately
Example:
1, time of the sort is near O(n). On the other hand if maximum elements accommodate in one or two
subfiles, then n/m is much larger significant work is required to insert its proper subfile at its proper
position and efficiency is O(n2).

Hashing

Hashing enables us to perform dictionary operations like search insert and delete in O(1) time. There
are two types of hashing

◾ Static and
◾ Dynamic

Static Hashing
◾ In static Hashing the dictionary pairs are stored in a table, ht called the hash table.
◾ The hash table is partitioned into b buckets, ht[0],…….ht[b-1]
◾ Each bucket is capable of holding s dictionary pairs.
◾ Thus a bucket is said to consist of s slots. usually s=1
◾ The address or location of a pair whose key is k is determined by hash function h which
maps keys into buckets.
◾ Thus for any key k, h(k) is an integer in range 0 through b-1

Hash Table (ht) h(k)=0… ..... (b-1)


0
1
2
Buckets

.
.
.
b-2
b-1
1 2 ………… s
S slots

The key density of a hash table is the ratio n/T


– n is the number of pairs in the table
– T is possible keys
The loading density or loading factor of a hash table is a = n/(sb)
– s is the number of slots
– b is the number of buckets

Example1:
b=26, s=2
n=10 distinct identifiers- each representing a C library function
Loading factor a = n/(sb) = 10/52=0.19
f(x)= first character of x

Dept. of ISE, SVIT Page 12


18CS32 DSA Notes

x: acos, define, float, exp, char, atan, ceil, floor, clock, ctime
Example:
f(x) : 0, 3, 5, 4, 2, 0, 2, 5, 2, 2

Slot0 Slot1
0 Acos atan
1
2 Char ceil
3 Define
4 exp
5 foat floor
.
.

24
25

Hash Functions: A hash function maps a key into a bucket in the hash table. A function H from the
set K of keys into the set L of memory addresses is called the hash function

H:K->L

Desired Properties are


◾ Easy computation
◾ Minimal number of collisions
◾ Uniformly distribute the hash addresses throughout the set L

Division : Chose a number m larger than the number n of keys in K. The number m is chosen to be a
prime number or a number without small divisors to reduce collisions. The function is defined as

h(k)=k%m or h(K)= k mod m

Bucket addresses range from 0 to m-1 and the hash table must have m buckets

Example:if m=10 then h(25)=5, h(32)=2

Mid Square: In this method the square of the key is found and appropriate number of bits are used
from the middle of the square to obtain the bucket address

 F(K)=middle(K2)
 The number of bits used to obtain bucket address depends on table size.
 If r bits are used the range of values is 0 through 2r-1

Example: K=3205 K2 = 10272025 H(K)= 72

Folding: Partition the keys k into several parts

 All parts except for the last one have the same length
 The parts are added together to obtain the hash address
 Two possibilities

Dept. of ISE, SVIT Page 13


18CS32 DSA Notes

Example k= 12320324111220
Example:
x1=123, x2=203, x3=241, x4=112, x5=20, address= 123+203+241+112+20= 699

Digit Analysis

 Useful in the case of a static file where all the keys in the table are known in advance
 Each key is interpreted using some radix r.
 The same radix is used for all the keys in the table
 Digits are examined with this radix
 Digits having the most skewed distributions are deleted.
 Enough digits are deleted so that the remaining digits are small enough to give and address in
the range of hash table

Converting keys to integers

Two methods used for converting keys to integer are

◾ Converting each character to a unique integer and summing these unique integers.

◾ Shifting the integer corresponding to every other character by 8 bits and then summing it up

Over Flow Handling

Synonyms: Hash function h maps several different keys into the same bucket
Two keys, k1 and k2 are synonyms with respect to h
if h(k1) = h(k2)

An overflow occurs when home bucket for a new dictionary pair is full when we wish to insert
this pair

A collision occurs when the home bucket for the new pair is not empty at the time of insertion.

Two popular ways To handle overflows


 Open Addressing/ Linear Probing
 Chaining

Open Addressing/Linear Probing


 When inserting a new pair whose key is k we search the hash table in the order
ht[h(k) +i]%b, 0<=i <=b-1, where
 h is the hash function and
 b is the number of buckets
◾ The search terminates when we reach the first unfilled bucket and the new pair is inserted into
this bucket.
◾ Incase no such bucket is found, the table is full and the size of the hash table needs to be
increased.
◾ For good performance the table size is increased when loading density exceeds a prescribed
threshold such as 0.75 rather when the table is full.
◾ When the hash table is resized
 Hash function changes
 Home bucket of each key may change

Dept. of ISE, SVIT Page 14


18CS32 DSA Notes

Example:

Suppose the table T has 11 memory locations T[1]……T[11] and suppose the file f contains 8
records with the following hash addresses

Records A B C D E X Y Z
H(K) 4 8 2 11 4 11 5 1

Suppose these 8 records are entered into the hash table in the above order the hash table will look as
shown below.

Table T X C Z A E Y - B - - D
Address 1 2 3 4 5 6 7 8 9 10 11

The average number S of probes for a successful search is S= (1+1+1+1+2+2+2+3)/8=13/8=1.6

The average number U of probes for a unsuccessful search is

U= (7+6+5+4+3+2+1+2+1+1+8)/11=40/11=3.6

Example-2

Assume a 13 bucket table with 1 slot per bucket


Identifier Additive Transform x Hash
for 102+111+114 327 2
do 100+111 211 3
while 119+104+105+108+101 537 4
if 105+102 207 12
else 101+108+115+101 425 9
function 102+117+110+99+116+105+111+110 870 12

0 1 2 3 4 5 6 7 8 9 10 11 12
function for do while else if

Searching a key using linear Probing

◾ Compute h(k)
◾ Examine the hash table in the order ht[h(k) +i]%b, 0<=i <=b-1, untill one of the follwing
happens
 The bucket ht[h(k) +i]%b contains the key k and the desired pair is found
 ht[h(k) +i]%b is empty; k is not in the table.
 Return to ht[h(k)], the table is full and k is not in the table

Drawbacks of Linear Probing

◾ Identifiers tend to cluster together


◾ Adjacent cluster tend to coalesce
◾ Increase the search time

Dept. of ISE, SVIT Page 15


18CS32 DSA Notes

Example
Example:showing the drawback

Insert acos, atoi,char,define,exp,ceil,cos, float, atol, floor , ctime into a 26 bucket hash table

bucket x Bucket searched


0 acos 1
1 atoi 2
2 char 1
3 define 1
4 exp 1
5 ceil 4
6 cos 5
7 float 3
8 atol 9
9 floor 5
10 ctime 9
…..
25

We see the number of searches increasing and the keys clustering together

Quadratic Probing
 Quadratic probing uses a quadratic function of i as the increment
 Suppose a record R with key k has the hash addres H(k)=h then instead of searching the
locations with h,h+1, h+2,……….. we linearly search locations with h,h+1,h+4,h+9, ......... h+i2
 If the number m of locations in the table T is a prime number, then the above sequence will
access half of the locations T

Double hashing
Here a second hash function H’ is used for resolving a collision, as follows.
Suppose a record R with key k has the hash address H(k)=h and h’(k)=h’ m then we linearly search
locations with addresses h, h+h’,h=2h’.h+3h’,………
If m is a prime number then the above sequence will access all the locations in the table T.

Note: One major disadvantage in any type of open addressing procedure is in the
implementation of deletion.
Suppose a record r is deleted from location T(r) , suppose we reach this location during a search, it
does not mean the search is unsuccesssfull..
Thus when deleting a record the location should be labeled to indicate that previously it did contain a
record

Chaining
 Maintain one list per bucket
 Each list containing the synonyms for that bucket.
 Search involves
o Computing the hash address h(k), and
o Examining the keys in the list of h(k)

Dept. of ISE, SVIT Page 16


18CS32 DSA Notes

Example: Insert acos, atoi,char,define,exp,ceil,cos, float, atol, floor , ctime into a 26 bucket hash
table maintained as hash chain

[0] acos-> atoi-> atol


[1] NULL
[2] char -. Ceil-> cos -> ctime
[3] define
[4] exp
[5] float-.floor
[6] NULL
.
.
.
[25] NULL

5.7.4 Rehashing: When the hash table becomes nearly full, the number of collisions increases, thereby
degrading the performance of insertion and search operations. In such cases, a better option is to create
a new hash table with size double of the original hash table.

All the entries in the original hash table will then have to be moved to the new hash table. This is done
by taking each entry, computing its new hash value, and then inserting it in the new hash table. Though
rehashing seems to be a simple process, it is quite expensive and must therefore not be done frequently.

Example:
Consider the hash table of size 5 given below. The hash function used is h(x)= x % 5.

Rehash the entries into to a new hash table using hash function—h(x)= x % 10.

Dynamic hashing

Limitation of static hashing: when the table tends to be full, overflow increases and reduces
performance.

To ensure good performance, it is necessary to increase the size of a hash table whenever the
loading density exceeds a prescribed threshold.

When the loading density increases array doubling is used to increase the size of the array to
2b+1.Change in divisor causes us to rebuild the hash table by reinserting the key in the smaller
table. Dynamic hashing or extendible hashing reduces the rebuild time.

Dept. of ISE, SVIT Page 17


18CS32 DSA Notes

There are two forms of dynamic hashing


Example:
 Dynamic hashing using directories
 Directory less dynamic hashing

Example: Hash function that transforms keys into 6 bit non negative integers. H(k,t) denote the
integers formed by the ‘t’ least significant bits of h(k).
The example taken is a two letter key. H transforms Letter A,B,C into bit sequnce 100,101 and
110 respectively Digits 0 through 7 are transformed into their 3 bit representation

k h(k)
A0 100 000
A1 100 001
B0 101 000
B1 101 001
C1 110 001
C2 110 010
C3 110 011
C5 110 101

Dynamic Hashing using directories

◾ A directory of d of pointers to buckets are used


◾ The number of bits of h(k) used to index the directory is called the directory depth.
◾ Size of directory depends on the number of bits of h(k) used to index into the directory.
◾ Directory size d= 2t where t is the number of bits used to identify all h(k).
◾ Initially t=2 bits then d= 22=4
◾ H(k,t) denote the integers formed by the t least significant bits of h(k).

Example: Figure below shows a dynamic hash table that contain the keys A0, B0,A1,B1,C2 and C3.
Here the directory depth is 2 and uses buckets that have 2 slots. For each key k,

we examine the bucket pointed to by d[h(k,t)] where t is the directory depth. Suppose we insert C5
into the hash table since h(c5,2)=01 we follow the pointer d[01] and this bucket is full. To resolve
the overflow, we determine the least u such that h(k,u) is not the same for all keys. Incase u is
greater than the directory depth we increase the directory depth to this least value u. Figure below
shows the table after inserting C5

Dept. of ISE, SVIT Page 18


18CS32 DSA Notes

Example:

Advantages

• Only the directory doubles hash table remains the same


• Only the entries that overflows needs to be rehashed

Directory Less Dynamic Hashing

◾ Also known as liner dynamic hashing


◾ Directory is not used instead an array ht of buckets is used.
◾ We assume that this array is as large as possible so there is no possibility of increasing the
size dynamically
◾ To avoid initializing such a large array, two variables are used r and q, 0<=q<=2r . It keeps
track of the active buckets.
◾ At any time only the buckets 0 to 2r + q-1 are active
◾ Each active bucket is the start of a chain of buckets.
◾ The remaining buckets in the chain are called overflow buckets.
◾ Each dictionary pair is either in a active or an overflow bucket.

Figure Below shows a directory less hash table ht with r=2 and q=0. The number of active bucket is
4. The index of the active bucket identifies its chain.. Each active bucket has 2 slots.

r=2, q=0

When we insert C5 into the table, chain 01 is examined and we verify that C5 is not present. Since the
active bucket for the searched chain is full we get an overflow. An overflow is handled by activating
bucker 2r+q, reallocating the entries in the chain q then the value of q is incremented by 1.incase q
becomes 2r. We increment r by 1 and reset q to 0. The reallocation is done usingh(k,r+1). Finally the
new pair is inserted into the chain.

Dept. of ISE, SVIT Page 19


18CS32 DSA Notes

Example:

r=2,q=1

Insert C1 will again result in an overflow at 001 so the bucket 5=100 is activated . Rehashing is done
and the table is as shown below.

r=2,q=2

Files and their Organization

Data Hierarchy

Every file contains data which can be organized in a hierarchy to present a systematic organization.
The data hierarchy includes data items such as fields, records, files, and database. These terms are
defined below.
 A data field is an elementary unit that stores a single fact. A data field is usually characterized
by its type and size. For example, student’s name is a data field that stores the name of students.
This field is of type character and its size can be set to a maximum of 20 or 30 characters
depending on the requirement.
 A record is a collection of related data fields which is seen as a single unit .For example, the
student’s record may contain data fields such as name, address, phone number, roll number,
marks obtained, and so on.
 A file is a collection of related records. For example, if there are 60 students in a class, then
there are 60 records. All these related records are stored in a file
Dept. of ISE, SVIT Page 20
18CS32 DSA Notes

 A directory stores information of related files. A directory organizes information so that users
Example:
can find it easily.

Consider figure that shows how multiple related files are stored in a student directory.

File Attributes

Every file in a computer system is stored in a directory. Each file has a list of attributes associated
with it . It gives the operating system and the application software information about the file and
how it is to be used.

These attributes are


 File name It is a string of characters that stores the name of a file. File naming onventions vary
from one operating system to the other.
 File position It is a pointer that points to the position at which the next read/write operation
will be performed.
 File structure It indicates whether the file is a text file or a binary file. In the text file, the
numbers(integer or floating point) are stored as a string of characters. A binary file, stores
numbers in the same way as they are represented in the main memory.
 File Access Method: It indicates whether the records in a file can be accessed sequentially or
randomly. In sequential access mode, records are read one by one. That is, if 60 records of
students are stored in the STUDENT file, then to read the record of 39th student, you have to
go through the record of the first 38 students. However, in random access, records can be
accessed in any order.
 Attributes Flag- A file can have six additional attributes attached to it. These attributes are
usually stored in a single byte, with each bit representing a specific attribute. If a particular bit
is set to ‘1’ then this means that the corresponding attribute is turned on. Table below shows
the list of attributes and their position in the attribute flag or attribute byte. Note that the
directory is treated as a special file in the operating system. So, all these attributes are
applicable to files as well as to directories

Dept. of ISE, SVIT Page 21


18CS32 DSA Notes

Example: o Read-only A file marked as read-only cannot be deleted or modified For example, if
an attempt is made to either delete or modify a read-only file, then a message ‘access
denied’ is displayed on the screen.
o Hidden A file marked as hidden is not displayed in the directory listing.
o System A file marked as a system file indicates that it is an important file used by the
system and should not be altered or removed from the disk.
o Volume Label Every disk volume is assigned a label for identification. The label can
be assigned at the time of formatting the disk or later through various tools such as the
DOS command LABEL.
o Directory In directory listing, the files and sub-directories of the current directory are
differentiated by a directory-bit. This means that the files that have the directory-bit
turned on are actually sub-directories containing one or more files.
o Archive The archive bit is used as a communication link between programs that modify
files and those that are used for backing up files. Most backup programs allow the user
to do an incremental backup. Incremental backup selects only those files for backup
which have been modified since the last backup. when the program archives the file, it
clears the archive bit (sets it to zero). Subsequently, if any program modifies the file, it
turns on the archive bit (sets it to 1). Thus, whenever the backup program is run, it
checks the archive bit to know whether the file has been modified since its last run. The
backup program will archive only those files which were modified.

Text and binary files Text

file

 also known as a flat file or an ASCII file


 It is structured as a sequence of lines of alphabet, numerals, special characters, etc.
However, the data in a text file, whether numeric or non-numeric, is stored using its
corresponding ASCII code.
 The end of a text file is often denoted by placing a
 special character, called an end-of-file marker, after the last line in the text file.
Binary file
 contains any type of data encoded in binary form for computer storage and processing
purpose.
 A binary file stores data in a format that is similar to the format in which the data is stored
in the main memory.
 a binary file is not readable by humans

Basic file Operations: The basic operations that can be performed on a file are given below

Dept. of ISE, SVIT Page 22


18CS32 DSA Notes

Creating
Example:a File- A file is created by specifying its name and mode. Then the file is opened for
writing records that are read from an input device. Once all the records have been written into the
file, the file is closed.

Updating a File : Updating a file means changing the contents of the file. A file can be updated in
the following ways:
 Inserting a new record in the file. For example, if a new student joins the course, we need to
add his record to the STUDENT file.
 Deleting an existing record. For example, if a student quits a course in the middle of the session,
his record has to be deleted from the STUDENT file.
 Modifying an existing record. For example, if the name of a student was spelt incorrectly, then
correcting the name will be a modification of the existing record.

Retrieving from a File- Extracting useful data from a given file. Information can be retrieved from
a file either for an inquiry or for report generation. An inquiry for some data retrieves low volume of
data, while report generation may retrieve a large volume of data from the file.

Maintaining a File - It involves restructuring or re-organizing the file to improve the performance
of the programs that access this file. Restructuring a file keeps the file organization unchanged and
changes only the structural aspects of the file.

FILE ORGANIZATION

A file is a collection of related records. The main issue in file management is the way in which
the records are organized inside the file. The following considerations should be kept in mind
before selecting an appropriate file organization method:
 Rapid access to one or more records
 Ease of inserting/updating/deleting one or more records without disrupting the speed
of accessing record(s).
 Efficient storage of records
 Using redundancy to ensure data integrity

There are three different ways in which a file can be organized.

SequeNtial file organization

Dept. of ISE, SVIT Page 23


18CS32 DSA Notes

 Records are written in the order in which they are entered


Example:
 Records are read and written sequentially
 Deletion or updation of one or more records calls for replacing the original file with a new
file that contains the desired changes
 Records have the same size and the same field format
 Records are sorted on a key value
 Generally used for report generation or sequential reading

Advantages
 Simple and easy to handle
 No extra overheads involved
 Sequential files can be stored on magnetic disks as well as magnetic tapes
 Well suited for batch–oriented applications

Disadvantages
 Records can be read only sequentially. If ith record has to be read, then all the i–1 records
must be read
 Does not support update operation. A new file has to be created and the original file has to
be replaced with the new file that contains the desired changes
 Cannot be used for interactive applications

Relative file Organization

Features

 Provides an effective way to access individual records


 The record number represents the location of the record relative to the beginning of the file.
 Records in a relative file are of fixed length
 Relative files can be used for both random as well as sequential access
 Every location in the table either stores a record or is marked as FREE

Address of ith record = base_address + (i–1) * record_length

Consider the base address of a file is 1000 and each record occupies 20 bytes, then the address of the
5th record can be given as:
1000 + (5–1) * 20
= 1000 + 80
= 1080

Dept. of ISE, SVIT Page 24


18CS32 DSA Notes

Advantages
Example:

 Ease of processing
 If the relative record number of the record that has to be accessed is known, then the record
can be accessed instantaneously
 Random access of records makes access to relative files fast
 Allows deletions and updations in the same file
 Provides random as well as sequential access of records with low overhead
 New records can be easily added in the free locations based on the relative record number of
the record to be inserted
 Well suited for interactive applications

Disadvantage
 Use of relative files is restricted to disk devices
 Records can be of fixed length only
 For random access of records, the relative record number must be known in advance

Indexed Sequential File Organization


 Provides fast data retrieval
 Records are of fixed length
 Index table stores the address of the records in the file
 The ith entry in the index table points to the ith record of the file
 While the index table is read sequentially to find the address of the desired record, a direct
access is made to the address of the specified record in order to access it randomly
 Indexed sequential files perform well in situations where sequential access as well as random
access is made to the data

Advantages
 The key improvement is that the indices are small and can be searched quickly, allowing the
database to access only the records it needs
 Supports applications that require both batch and interactive processing
 Records can be accessed sequentially as well as randomly
 Updates the records in the same file

Dept. of ISE, SVIT Page 25


18CS32 DSA Notes

Disadvantage
Example:

 Indexed sequential files can be stored only on disks


 Needs extra space and overhead to store indices
 Handling these files is more complicated than handling sequential files
 Supports only fixed length records

Indexing

An index for a file can be compared with a catalogue in a library


Indexed sequential files are very efficient to use.
There are two kinds of indices:
 Ordered indices that are sorted based on one or more key values
 Hash indices that are based on the values generated by applying a hash function

Ordered Indices

Indices are used to provide fast random access to records.A file may have multiple indices
based on different key fields

An index of a file may be a

◾ primary index : the index whose search key specifies the sequential order of the file.
Example Roll number

◾ Secondary index: search key specifies an order different from the sequential order of the file
Example name field.Used to improve the performance of the queries

Dense and Sparse Indices

 In a dense index the index table stores the address of every record in the file. However, in a
sparse index, the index table stores the address of only some of the records in the file. Although

Dept. of ISE, SVIT Page 26


18CS32 DSA Notes

sparse indices are easy to fit in the main memory, a dense index would be more efficient to use
Example:
than a sparse index if it fits in the memory.
 The records need not be stored in consecutive memory locations. The pointer to the next record
stores the address of the next record.
 By looking at the dense index, it can be concluded directly whether the record exists in the file
or not. In a sparse index, to locate a record, we first find an entry in the index table with the
largest search key value that is either less than or equal to the search key value of the desired
record. Then, we start at that record pointed to by that entry in the index table and then proceed
searching the record using the sequential pointers in the file, until the desired record is
obtained. For example, if we need to access record number 40, then record number 30 is the
largest key value that is less than 40. So jump to the record pointed by record number 30 and
move along the sequential pointer to reach record number 40.Thus we see that sparse index
takes more time to find a record with the given key.
 Denseindices are faster to use, while sparse indices require less space and impose less
maintenance forinsertions and deletions.

Cylinder Surface Indexing

Cylinder surface indexing is a very simple technique used only for the primary key index of a
sequentially ordered file. In a sequentially ordered file, the records are stored sequentially in the
Increasing order of the primary key.
The index file will contain two fields—cylinder index andseveral surface indices. Generally, there
are multiple cylinders, and each cylinder has multiple surfaces. If the file needs m cylinders for
storage then the cylinder index will contain m entries.

The physical and logical organization of disk is shown in Figure below.

When a record with a particular key value has to be searched, then the following steps areperformed:
 First the cylinder index of the file is read into memory.
 Second, the cylinder index is searched to determine which cylinder holds the desired record.
 After the cylinder index is searched, appropriate cylinder is determined.
 Depending on the cylinder, the surface index corresponding to the cylinder is then retrieved
from the disk.
 Once the cylinder and the surface are determined, the corresponding track is read and
searched for the record with the desired key.

Dept. of ISE, SVIT Page 27


18CS32 DSA Notes

Hence,
Example:the total number of disk accesses is three—first, for accessing the cylinder index,second for
accessing the surface index, and third for getting the track address. However, if track sizes are very
large then we can also include sector addresses. But this would add an extra level of indexing.

The cylinder surface indexing methodof maintaining a file and index is referredto as Indexed
Sequential Access Method (ISAM). This technique is the most popularand simplest file organization
in use forsingle key values.

Multi-level Indices
Useful for very large files that may contain millions of records.

Consider a file that has 10,000 records. If we use simple indexing,then we need an index table that can
contain at least 10,000 entries to point to 10,000 records. If each entry in the index table occupies 4
bytes, then we need an index table of 4 * 10000 bytes =40000 bytes. Finding such a big space
consecutively is not always easy. So, a better scheme is to index the index table.

Figure below shows a two-level multi-indexing.

We can continue further by having a three-levelindexing and so on. But practically, we use two-level
indexing. Note that two and higher-levelindexing must always be sparse, otherwise multi-level
indexing will lose its effectiveness.

Inverted indices

◾ Inverted files are commonly used in document retrieval systems for large textual databases.
◾ Reorganizes the structure of an existing data file in order to provide fast access to all records
having one field falling within the set limits.
◾ For example, inverted files are widely used by bibliographic databases that may store author
names, title words, journal names, etc.
◾ Thus, for each keyword, an inverted file contains an inverted list that stores a list of pointers
to all occurrences of that term in the main text.
◾ There are two main variants of inverted indices:
 A record-level inverted index (also known as inverted file index or inverted file)
stores a list of references to documents for each word

Dept. of ISE, SVIT Page 28


18CS32 DSA Notes

Example:  A word-level inverted index (also known as full inverted index or inverted list) in
addition to a list of references to documents for each word also contains the positions
of each word within a document.

B-TreeIndices

 A database is defined as a collection of data organized in a fashion that facilitates updating,


retrieving, and managing the data.
 Most organizations maintain databases for their business operations. For example, A
university maintains a database of all its students. These real-world databases may contain
millions of records that may occupy gigabytes of storage space.
 For a database to be useful, it must support fast retrieval and storage of data.
 Since it is impractical to maintain the entire database in the memory, B-trees are used to index
the data in order to provide fast access.
 If the database is indexed with a B-tree, the search operation will run in O(log n) time.

Majority of the database management systems use the B-tree index technique as the default indexing
method. This technique supersedes other techniques of creating indices, mainly due to
its data retrieval speed, ease of maintenance, and simplicity.

Figure below shows a B-tree index.

It forms a tree structure with the root at the top. The index consists of a B-tree (balanced tree) structure
based on the values of the indexed column. In this example, the indexed column is name and the B-
tree is created using all the existing names that are the values of the indexed column.

The upper blocks of the tree contain index data pointing to the next lower block, thus forming a
hierarchical structure. The lowest level blocks, also known as leaf blocks, contain pointers to the data
rows stored in the table.

Dept. of ISE, SVIT Page 29


18CS32 DSA Notes

Example:
The B-tree structure has the following advantages:
 Since the leaf nodes of a B-tree are at the same depth, retrieval of any record from anywhere
in the index takes approximately the same time.
 B-trees improve the performance of a wide range of queries that either search a value having
an exact match or for a value within specified range.
 B-trees provide fast and efficient algorithms to insert, update, and delete records that maintain
the key order.
 B-trees perform well for small as well as large tables. Their performance does not degrade as
the size of a table grows.
 B-trees optimize costly disk access.

Hashed Indices

◾ Hashing is used to compute the address of a record by using a hash function on the search key
value.
◾ The hashed values map to the same address, then collision occurs and schemes to resolve these
collisions are applied to generate a new address.
◾ Choosing a good hash function is critical to the success of this technique.
◾ By a good hash function, we mean two things.
 First, a good hash function, irrespective of the number of search keys, gives an
average-case lookup that is a small constant.
 Second, the function distributes records uniformly and randomly among the buckets,

The drawback of using hashed indices includes:

 Though the number of buckets is fixed, the number of files may grow with time.
 If the number of buckets is too large, storage space is wasted.
 If the number of buckets is too small, there may be too many collisions.

◾ The following operations can be performed on a hashed file organization.

a. Insertion- To insert a record that has ki as its search value, use the hash function h(ki)
to compute the address of the bucket for that record. If the bucket is free, store the
record else use chaining to store the record.
b. Searching- To search a record having the key value ki, use h(ki) to compute the
address of the bucket where the record is stored. The bucket may contain one or several
records, so check for every record to retrieve the desired record with the given key
value.
a. Deletion- To delete a record with key value ki, use h(ki) to compute the address of
the bucket where the record is stored. The bucket may contain one or several records
so check for every record in the bucket Then delete the record

Dept. of ISE, SVIT Page 30

You might also like