DS Module-5 Notes
DS Module-5 Notes
MODULE - 5
Introduction
The first recorded evidence of the use of graph dates back to 1736. When Leonhard Euler used them
to solve the classical Konigsberg bridge problem.
Definitions
Example:
Undirected Graph: In a undirected graph the pair of vertices representing an edge is unordered.
thus the pairs (u,v) and (v,u) represent the same edge.
Example:
V(G)={a,b,c,d}
E(G)={(a,b),(a,d),(b,d),(b,c)
Directed Graph (digraph): In a directed graph each edge is represented by a directed pair (u,v), v is
the head and u is the tail of the edge. Therefore (v,u) and(<u,v) represent two different edges
Example:
V(G)={a,b,d}
Self Edges/Self Loops: Edges of the form(v,v) are called self edges or self loops . It is an edge
which starts and ends at the same vertex.
Example:
Mutigraph: A graph with multiple occurrences of the same edge is called a multigraph
Example:
Complete Graph: An undirected graph with n vertices and exactly n(n-1)/2 edges is said to be a
complete graph. In a graph all pairs of vertices are connected by an edge.
Example : A complete graph with n=3 vertices
Adjacent Vertex
If (u,v) is an edge in E(G), then we say that the vertices u and v are adjacent and the edge(u,v) is
incident on vertices u and v.
Path: A path from vertex u to v in graph g is a sequence of vertices u,i1,i2,…….ik,v such that
(u,i1),(i1,i2)………(ik,v) are edges in E(G). if G’ is directed then the path consists of
<u,i1>,<i1,i2>………<ik,v> edges in E(G’).
Example:
Cycle: A cycle is a simple path in which all the vertices except the first and last vertices are distinct.
The first and the last vertices are same.
Example :
(B,C),(C,D)(D,E)(E,A)(A,B) is a cycle
Degree of a vertex : In a undirected graph degree of a vertex is the number of edges incident on a
vertex.
In a directed graph the in-degree if a vertex v is the number of edges for which v is the head i.e. the
number of edges that are coming into a vertex. The out degree is defined as the number of edges for
which v is the tail i.e. the number of edges that are going out of a vertex
Subgraph: A subgraph of G is a graph G’ such that V(G’) V(G) and E(G’) E(G)
Example :
Graph(G) Subgraph(G’)
Connected Graph: An undirected graph G is said to be connected if for every pair of distinct
vertices u and v in V(G) there is a path from u to v in G.
Strongly connected graph : A directed graph G is said to be strongly connected if for every pair of
distinct vertices u an v in V(G), there is a directed path from u to v and from v to u.
ADT Graph
Objects: a nonempty set of vertices and a set of undirected edges, where each edge is a pair of
vertices.
Graph create():=
Example: return an empty graph
Graph InsertVertex(graph,v):= return a graph with v inserted. v has no incident edges
Graph InsertEdge(graph,v1,v2) := retrun a graph with a new edge between v1 and v2
Graph DeleteVertex(graph,v) := return a graph in which v and all edges incident to it is
removed
Graph DeleteEdge(graph,v1,v2):= retrun a graph in which the edge (v1,v2) is removed,
leave the incident nodes in the graph
Boolean IsEmpty:= If (graph == empty graph) retrun TRUE else Retrun
FALSE
List Adjacent(graph,v) := retrun a list of all vertices that are adjacent to v
Graph Representation
Adjacency Matrix
Adjacency List
Adjacency Multilist
Adjacency Matrix: Let G=(V,E) be a graph with n vertices, n>=1. The adjacency matrix of G is a
two dimensional n*n array for example a, with the property that a[i][j]=1 if there exist ane edge (i,j)
(for a directed graph edge <i,j> is in E(G).a[i][j]=0 if no such edge in G.
Example:
0
Adjacency Matrix
0 1 2 3
0 0 1 1 1
1 2 1 1 0 1 1
2 1 1 0 1
3 1 1 1 0
Adjacency list: In adjacency matrix the n rows of the adjacency matrix are represented as n chains.
There is one chain for each vertex in G. The nodes in chain i represent the vertices that are adjacent
from vertex i. The data field of a chain node stores the index of an adjacent vertex.
AdjLists
Example: data link
[0] 1 2 3 0
[1] 0 2 3 0
[2] 0 1 3 0
[3] 0 1 2 0
For an undirected graph with n vertices and e edges. The linked adjacency lists representation
requires an array of size n and 2e chain nodes.
The degree of any vertex in an undirected graph may be determined by counting the number
of nodes in the adjacency list.
For a digraph the number of list nodes is only e.
Adjacency Multi lists: For each edge there will be exactly one node, but this node will be in two
list(i.e., the adjacency list for each of the two nodes to which it is incident). A new field is necessary
to determine if the edge is determined and mark it as examined.
adjLists N0 0 1 N1 N3 edge(0,1)
[0]
N1 0 2 N2 N3 edge(0,2)
[1]
N2 0 3 0 N4 edge(0,3)
[2]
N3 1 2 N4 N5 edge(1,2)
[3]
N4 1 3 0 N5 edge(1,3)
F
N5 2 3 0 0 edge(2,3)
Weighted Edges: In many applications the edges of a graph have weight assigned to them. These
weights may represent the distance from one vertex t o another or the cost for going from one vertex
to an adjacent vertex. The adjacency matrix and list maintains the weight information also. A graph
with weighted edges are also called network.
Example:
Given an undirected graph G=(V,E) and a vertex v in V(G) ,there are two ways to find all the
vertices that are reachable from v or are connected to v .
A global array visited is maintained , it is initialized to false, when we visit a vertex i we change the
visited[i] to true.
Global Declaraions
# define FALSE 0
# define true 1
Short int visited[max_vertices];
void dfs(int v)
{
visited[v]=TRUE;
printf(“%d”,v);
w=graph[v]
while(w!=NULL)
{
If(visited[w->vertex]==FALSE)
dfs(w->vertex);
w=w->link;
}}
Analysis
Example:
If we represent G by its adjacency list then we can determine the vertices adjacent to v by
following a chain of links. Since dfs examines each node in the adjacency list at most once
then the time to complete the search is O(e).
If we represent G by its adjacency matrix then determining all vertices adjacent o v requires
O(n) time. Since we visit at most n vertices the total time is O(n2).
Example: For the graph given below if the search is initiated from vertex 0 then the vertices are
visited in the order vertex 3, 1, 2
1 2
struct node
{
int vertex;
struct node * link;
};
typedef struct node queue;
queue
Example:* front,*rear;
int visied[max_vertics];
void addq(int);
int delete();
void bfs(int v)
{
front=rear=NULL;
printf(“%d”,v);
visisted[v]= TRUE;
addq(v);
while(front)
{
v=deleteq();
while(w!=NULL)
{
if(visited[w->vertex]==FALSE)
{
printf(“%d”,w->vertex);
addq(w->vertex);
visited[w->vertex]=TRUE;
}
w=w->link;
}
}
}
Analysis of BFS:
For each vertex is placed on the queue exactly once, the while loop is iterated at most n
times.
For the adjacency list representation the loop has a total cost of O(e). For the adjacency
matrix representation the loop takes O(n) times
Therefore the total time is O (n2).
Insertion Sort: The basic step is to insert a new record into a sorted sequence of i records in sucha
way that the resulting sequence of size i+1 is also ordered. The function insert implements this
insertion.
Insert e into the ordered list a[i:i] such that the resulting list a[1:i+1] is also ordered, the array
must have space allocated for atleast i+2 elements
Example:
a[i+1]=a[i];
i--;
}
a[i+1]=e;
}
Example:
Analysis: In the worst case insert(e,a,i) makes i+1 comparsions before making insertion hence the
complexity is O(i)Function insertion sort invokes the insert function n-1 times so the time
complexity of insertion sort is O(n2).
Radix Sort: Radix sort is a method used to alphabetise a large list of names.( Here the radix is
26,. The 26 letters of the alphabet). The list of the names are first sorted according to the first
letter of each name. That is the names are arranged to n classes, where the first class begins to
those names that begins with the letter “A”. the second class begins with “B” and so on. During
the second pass each class is alphabetised according to the second letter and so on. If the
Example:
maximum length of the name is n the names are alphabetised in atmost n passes.
Suppose a list of n items A1,A2, ... An is given, let d denote the radix( for example d=10 for decimal
digits, d=26 for letters d=2 bits ). Suppose number is made of s digits. The radix sort algorithm will
require s passes.
Example: The example below shows how 3 digit numbers can be sorted using reverse digit sort. First
the numbers are sorted according to the units digit. On the second pass, the numbers are sorted
according to the tens digits, then the number are sorted according to the hundreds digit.
Pass-1
Input 0 1 2 3 4 5 6 7 8 9
348 348
143 143
361 361
423 423
538 538
128 128
321 321
543 543
366 366
Pass-2
Input 0 1 2 3 4 5 6 7 8 9
361 361
321 321
143 143
423 423
543 543
366 366
348 348
538 538
128 128
Pass-3
Input 0 1 2 3 4 5 6 7 8 9
321 321
423 423
128 128
538 538
143 143
543 543
348 348
Dept. of ISE, SVIT Page 10
18CS32 DSA Notes
361
Example: 361
366 366
n= number of elements
d= radix number
s= number of digits in each element
Then the number of comparisons C(n) <= d*s*n
D is independent of n but s depends on n therefore
Time complexity worst case- O(n2)
Time complexity best case- O(n logn)
1. Radix sort works well when s (number of digits in the representation is small)
2. We may need d*n memory locations. This may be minimized by using linked lists.
Example:
25 57 48 37 12 92 86 33
Analysis: If all the elements (n) are uniformly distributed over m subfiles then n/m is approximately
Example:
1, time of the sort is near O(n). On the other hand if maximum elements accommodate in one or two
subfiles, then n/m is much larger significant work is required to insert its proper subfile at its proper
position and efficiency is O(n2).
Hashing
Hashing enables us to perform dictionary operations like search insert and delete in O(1) time. There
are two types of hashing
◾ Static and
◾ Dynamic
Static Hashing
◾ In static Hashing the dictionary pairs are stored in a table, ht called the hash table.
◾ The hash table is partitioned into b buckets, ht[0],…….ht[b-1]
◾ Each bucket is capable of holding s dictionary pairs.
◾ Thus a bucket is said to consist of s slots. usually s=1
◾ The address or location of a pair whose key is k is determined by hash function h which
maps keys into buckets.
◾ Thus for any key k, h(k) is an integer in range 0 through b-1
.
.
.
b-2
b-1
1 2 ………… s
S slots
Example1:
b=26, s=2
n=10 distinct identifiers- each representing a C library function
Loading factor a = n/(sb) = 10/52=0.19
f(x)= first character of x
x: acos, define, float, exp, char, atan, ceil, floor, clock, ctime
Example:
f(x) : 0, 3, 5, 4, 2, 0, 2, 5, 2, 2
Slot0 Slot1
0 Acos atan
1
2 Char ceil
3 Define
4 exp
5 foat floor
.
.
24
25
Hash Functions: A hash function maps a key into a bucket in the hash table. A function H from the
set K of keys into the set L of memory addresses is called the hash function
H:K->L
Division : Chose a number m larger than the number n of keys in K. The number m is chosen to be a
prime number or a number without small divisors to reduce collisions. The function is defined as
Bucket addresses range from 0 to m-1 and the hash table must have m buckets
Mid Square: In this method the square of the key is found and appropriate number of bits are used
from the middle of the square to obtain the bucket address
F(K)=middle(K2)
The number of bits used to obtain bucket address depends on table size.
If r bits are used the range of values is 0 through 2r-1
All parts except for the last one have the same length
The parts are added together to obtain the hash address
Two possibilities
Example k= 12320324111220
Example:
x1=123, x2=203, x3=241, x4=112, x5=20, address= 123+203+241+112+20= 699
Digit Analysis
Useful in the case of a static file where all the keys in the table are known in advance
Each key is interpreted using some radix r.
The same radix is used for all the keys in the table
Digits are examined with this radix
Digits having the most skewed distributions are deleted.
Enough digits are deleted so that the remaining digits are small enough to give and address in
the range of hash table
◾ Converting each character to a unique integer and summing these unique integers.
◾ Shifting the integer corresponding to every other character by 8 bits and then summing it up
Synonyms: Hash function h maps several different keys into the same bucket
Two keys, k1 and k2 are synonyms with respect to h
if h(k1) = h(k2)
An overflow occurs when home bucket for a new dictionary pair is full when we wish to insert
this pair
A collision occurs when the home bucket for the new pair is not empty at the time of insertion.
Example:
Suppose the table T has 11 memory locations T[1]……T[11] and suppose the file f contains 8
records with the following hash addresses
Records A B C D E X Y Z
H(K) 4 8 2 11 4 11 5 1
Suppose these 8 records are entered into the hash table in the above order the hash table will look as
shown below.
Table T X C Z A E Y - B - - D
Address 1 2 3 4 5 6 7 8 9 10 11
U= (7+6+5+4+3+2+1+2+1+1+8)/11=40/11=3.6
Example-2
0 1 2 3 4 5 6 7 8 9 10 11 12
function for do while else if
◾ Compute h(k)
◾ Examine the hash table in the order ht[h(k) +i]%b, 0<=i <=b-1, untill one of the follwing
happens
The bucket ht[h(k) +i]%b contains the key k and the desired pair is found
ht[h(k) +i]%b is empty; k is not in the table.
Return to ht[h(k)], the table is full and k is not in the table
Example
Example:showing the drawback
Insert acos, atoi,char,define,exp,ceil,cos, float, atol, floor , ctime into a 26 bucket hash table
We see the number of searches increasing and the keys clustering together
Quadratic Probing
Quadratic probing uses a quadratic function of i as the increment
Suppose a record R with key k has the hash addres H(k)=h then instead of searching the
locations with h,h+1, h+2,……….. we linearly search locations with h,h+1,h+4,h+9, ......... h+i2
If the number m of locations in the table T is a prime number, then the above sequence will
access half of the locations T
Double hashing
Here a second hash function H’ is used for resolving a collision, as follows.
Suppose a record R with key k has the hash address H(k)=h and h’(k)=h’ m then we linearly search
locations with addresses h, h+h’,h=2h’.h+3h’,………
If m is a prime number then the above sequence will access all the locations in the table T.
Note: One major disadvantage in any type of open addressing procedure is in the
implementation of deletion.
Suppose a record r is deleted from location T(r) , suppose we reach this location during a search, it
does not mean the search is unsuccesssfull..
Thus when deleting a record the location should be labeled to indicate that previously it did contain a
record
Chaining
Maintain one list per bucket
Each list containing the synonyms for that bucket.
Search involves
o Computing the hash address h(k), and
o Examining the keys in the list of h(k)
Example: Insert acos, atoi,char,define,exp,ceil,cos, float, atol, floor , ctime into a 26 bucket hash
table maintained as hash chain
5.7.4 Rehashing: When the hash table becomes nearly full, the number of collisions increases, thereby
degrading the performance of insertion and search operations. In such cases, a better option is to create
a new hash table with size double of the original hash table.
All the entries in the original hash table will then have to be moved to the new hash table. This is done
by taking each entry, computing its new hash value, and then inserting it in the new hash table. Though
rehashing seems to be a simple process, it is quite expensive and must therefore not be done frequently.
Example:
Consider the hash table of size 5 given below. The hash function used is h(x)= x % 5.
Rehash the entries into to a new hash table using hash function—h(x)= x % 10.
Dynamic hashing
Limitation of static hashing: when the table tends to be full, overflow increases and reduces
performance.
To ensure good performance, it is necessary to increase the size of a hash table whenever the
loading density exceeds a prescribed threshold.
When the loading density increases array doubling is used to increase the size of the array to
2b+1.Change in divisor causes us to rebuild the hash table by reinserting the key in the smaller
table. Dynamic hashing or extendible hashing reduces the rebuild time.
Example: Hash function that transforms keys into 6 bit non negative integers. H(k,t) denote the
integers formed by the ‘t’ least significant bits of h(k).
The example taken is a two letter key. H transforms Letter A,B,C into bit sequnce 100,101 and
110 respectively Digits 0 through 7 are transformed into their 3 bit representation
k h(k)
A0 100 000
A1 100 001
B0 101 000
B1 101 001
C1 110 001
C2 110 010
C3 110 011
C5 110 101
Example: Figure below shows a dynamic hash table that contain the keys A0, B0,A1,B1,C2 and C3.
Here the directory depth is 2 and uses buckets that have 2 slots. For each key k,
we examine the bucket pointed to by d[h(k,t)] where t is the directory depth. Suppose we insert C5
into the hash table since h(c5,2)=01 we follow the pointer d[01] and this bucket is full. To resolve
the overflow, we determine the least u such that h(k,u) is not the same for all keys. Incase u is
greater than the directory depth we increase the directory depth to this least value u. Figure below
shows the table after inserting C5
Example:
Advantages
Figure Below shows a directory less hash table ht with r=2 and q=0. The number of active bucket is
4. The index of the active bucket identifies its chain.. Each active bucket has 2 slots.
r=2, q=0
When we insert C5 into the table, chain 01 is examined and we verify that C5 is not present. Since the
active bucket for the searched chain is full we get an overflow. An overflow is handled by activating
bucker 2r+q, reallocating the entries in the chain q then the value of q is incremented by 1.incase q
becomes 2r. We increment r by 1 and reset q to 0. The reallocation is done usingh(k,r+1). Finally the
new pair is inserted into the chain.
Example:
r=2,q=1
Insert C1 will again result in an overflow at 001 so the bucket 5=100 is activated . Rehashing is done
and the table is as shown below.
r=2,q=2
Data Hierarchy
Every file contains data which can be organized in a hierarchy to present a systematic organization.
The data hierarchy includes data items such as fields, records, files, and database. These terms are
defined below.
A data field is an elementary unit that stores a single fact. A data field is usually characterized
by its type and size. For example, student’s name is a data field that stores the name of students.
This field is of type character and its size can be set to a maximum of 20 or 30 characters
depending on the requirement.
A record is a collection of related data fields which is seen as a single unit .For example, the
student’s record may contain data fields such as name, address, phone number, roll number,
marks obtained, and so on.
A file is a collection of related records. For example, if there are 60 students in a class, then
there are 60 records. All these related records are stored in a file
Dept. of ISE, SVIT Page 20
18CS32 DSA Notes
A directory stores information of related files. A directory organizes information so that users
Example:
can find it easily.
Consider figure that shows how multiple related files are stored in a student directory.
File Attributes
Every file in a computer system is stored in a directory. Each file has a list of attributes associated
with it . It gives the operating system and the application software information about the file and
how it is to be used.
Example: o Read-only A file marked as read-only cannot be deleted or modified For example, if
an attempt is made to either delete or modify a read-only file, then a message ‘access
denied’ is displayed on the screen.
o Hidden A file marked as hidden is not displayed in the directory listing.
o System A file marked as a system file indicates that it is an important file used by the
system and should not be altered or removed from the disk.
o Volume Label Every disk volume is assigned a label for identification. The label can
be assigned at the time of formatting the disk or later through various tools such as the
DOS command LABEL.
o Directory In directory listing, the files and sub-directories of the current directory are
differentiated by a directory-bit. This means that the files that have the directory-bit
turned on are actually sub-directories containing one or more files.
o Archive The archive bit is used as a communication link between programs that modify
files and those that are used for backing up files. Most backup programs allow the user
to do an incremental backup. Incremental backup selects only those files for backup
which have been modified since the last backup. when the program archives the file, it
clears the archive bit (sets it to zero). Subsequently, if any program modifies the file, it
turns on the archive bit (sets it to 1). Thus, whenever the backup program is run, it
checks the archive bit to know whether the file has been modified since its last run. The
backup program will archive only those files which were modified.
file
Basic file Operations: The basic operations that can be performed on a file are given below
Creating
Example:a File- A file is created by specifying its name and mode. Then the file is opened for
writing records that are read from an input device. Once all the records have been written into the
file, the file is closed.
Updating a File : Updating a file means changing the contents of the file. A file can be updated in
the following ways:
Inserting a new record in the file. For example, if a new student joins the course, we need to
add his record to the STUDENT file.
Deleting an existing record. For example, if a student quits a course in the middle of the session,
his record has to be deleted from the STUDENT file.
Modifying an existing record. For example, if the name of a student was spelt incorrectly, then
correcting the name will be a modification of the existing record.
Retrieving from a File- Extracting useful data from a given file. Information can be retrieved from
a file either for an inquiry or for report generation. An inquiry for some data retrieves low volume of
data, while report generation may retrieve a large volume of data from the file.
Maintaining a File - It involves restructuring or re-organizing the file to improve the performance
of the programs that access this file. Restructuring a file keeps the file organization unchanged and
changes only the structural aspects of the file.
FILE ORGANIZATION
A file is a collection of related records. The main issue in file management is the way in which
the records are organized inside the file. The following considerations should be kept in mind
before selecting an appropriate file organization method:
Rapid access to one or more records
Ease of inserting/updating/deleting one or more records without disrupting the speed
of accessing record(s).
Efficient storage of records
Using redundancy to ensure data integrity
Advantages
Simple and easy to handle
No extra overheads involved
Sequential files can be stored on magnetic disks as well as magnetic tapes
Well suited for batch–oriented applications
Disadvantages
Records can be read only sequentially. If ith record has to be read, then all the i–1 records
must be read
Does not support update operation. A new file has to be created and the original file has to
be replaced with the new file that contains the desired changes
Cannot be used for interactive applications
Features
Consider the base address of a file is 1000 and each record occupies 20 bytes, then the address of the
5th record can be given as:
1000 + (5–1) * 20
= 1000 + 80
= 1080
Advantages
Example:
Ease of processing
If the relative record number of the record that has to be accessed is known, then the record
can be accessed instantaneously
Random access of records makes access to relative files fast
Allows deletions and updations in the same file
Provides random as well as sequential access of records with low overhead
New records can be easily added in the free locations based on the relative record number of
the record to be inserted
Well suited for interactive applications
Disadvantage
Use of relative files is restricted to disk devices
Records can be of fixed length only
For random access of records, the relative record number must be known in advance
Advantages
The key improvement is that the indices are small and can be searched quickly, allowing the
database to access only the records it needs
Supports applications that require both batch and interactive processing
Records can be accessed sequentially as well as randomly
Updates the records in the same file
Disadvantage
Example:
Indexing
Ordered Indices
Indices are used to provide fast random access to records.A file may have multiple indices
based on different key fields
◾ primary index : the index whose search key specifies the sequential order of the file.
Example Roll number
◾ Secondary index: search key specifies an order different from the sequential order of the file
Example name field.Used to improve the performance of the queries
In a dense index the index table stores the address of every record in the file. However, in a
sparse index, the index table stores the address of only some of the records in the file. Although
sparse indices are easy to fit in the main memory, a dense index would be more efficient to use
Example:
than a sparse index if it fits in the memory.
The records need not be stored in consecutive memory locations. The pointer to the next record
stores the address of the next record.
By looking at the dense index, it can be concluded directly whether the record exists in the file
or not. In a sparse index, to locate a record, we first find an entry in the index table with the
largest search key value that is either less than or equal to the search key value of the desired
record. Then, we start at that record pointed to by that entry in the index table and then proceed
searching the record using the sequential pointers in the file, until the desired record is
obtained. For example, if we need to access record number 40, then record number 30 is the
largest key value that is less than 40. So jump to the record pointed by record number 30 and
move along the sequential pointer to reach record number 40.Thus we see that sparse index
takes more time to find a record with the given key.
Denseindices are faster to use, while sparse indices require less space and impose less
maintenance forinsertions and deletions.
Cylinder surface indexing is a very simple technique used only for the primary key index of a
sequentially ordered file. In a sequentially ordered file, the records are stored sequentially in the
Increasing order of the primary key.
The index file will contain two fields—cylinder index andseveral surface indices. Generally, there
are multiple cylinders, and each cylinder has multiple surfaces. If the file needs m cylinders for
storage then the cylinder index will contain m entries.
When a record with a particular key value has to be searched, then the following steps areperformed:
First the cylinder index of the file is read into memory.
Second, the cylinder index is searched to determine which cylinder holds the desired record.
After the cylinder index is searched, appropriate cylinder is determined.
Depending on the cylinder, the surface index corresponding to the cylinder is then retrieved
from the disk.
Once the cylinder and the surface are determined, the corresponding track is read and
searched for the record with the desired key.
Hence,
Example:the total number of disk accesses is three—first, for accessing the cylinder index,second for
accessing the surface index, and third for getting the track address. However, if track sizes are very
large then we can also include sector addresses. But this would add an extra level of indexing.
The cylinder surface indexing methodof maintaining a file and index is referredto as Indexed
Sequential Access Method (ISAM). This technique is the most popularand simplest file organization
in use forsingle key values.
Multi-level Indices
Useful for very large files that may contain millions of records.
Consider a file that has 10,000 records. If we use simple indexing,then we need an index table that can
contain at least 10,000 entries to point to 10,000 records. If each entry in the index table occupies 4
bytes, then we need an index table of 4 * 10000 bytes =40000 bytes. Finding such a big space
consecutively is not always easy. So, a better scheme is to index the index table.
We can continue further by having a three-levelindexing and so on. But practically, we use two-level
indexing. Note that two and higher-levelindexing must always be sparse, otherwise multi-level
indexing will lose its effectiveness.
Inverted indices
◾ Inverted files are commonly used in document retrieval systems for large textual databases.
◾ Reorganizes the structure of an existing data file in order to provide fast access to all records
having one field falling within the set limits.
◾ For example, inverted files are widely used by bibliographic databases that may store author
names, title words, journal names, etc.
◾ Thus, for each keyword, an inverted file contains an inverted list that stores a list of pointers
to all occurrences of that term in the main text.
◾ There are two main variants of inverted indices:
A record-level inverted index (also known as inverted file index or inverted file)
stores a list of references to documents for each word
Example: A word-level inverted index (also known as full inverted index or inverted list) in
addition to a list of references to documents for each word also contains the positions
of each word within a document.
B-TreeIndices
Majority of the database management systems use the B-tree index technique as the default indexing
method. This technique supersedes other techniques of creating indices, mainly due to
its data retrieval speed, ease of maintenance, and simplicity.
It forms a tree structure with the root at the top. The index consists of a B-tree (balanced tree) structure
based on the values of the indexed column. In this example, the indexed column is name and the B-
tree is created using all the existing names that are the values of the indexed column.
The upper blocks of the tree contain index data pointing to the next lower block, thus forming a
hierarchical structure. The lowest level blocks, also known as leaf blocks, contain pointers to the data
rows stored in the table.
Example:
The B-tree structure has the following advantages:
Since the leaf nodes of a B-tree are at the same depth, retrieval of any record from anywhere
in the index takes approximately the same time.
B-trees improve the performance of a wide range of queries that either search a value having
an exact match or for a value within specified range.
B-trees provide fast and efficient algorithms to insert, update, and delete records that maintain
the key order.
B-trees perform well for small as well as large tables. Their performance does not degrade as
the size of a table grows.
B-trees optimize costly disk access.
Hashed Indices
◾ Hashing is used to compute the address of a record by using a hash function on the search key
value.
◾ The hashed values map to the same address, then collision occurs and schemes to resolve these
collisions are applied to generate a new address.
◾ Choosing a good hash function is critical to the success of this technique.
◾ By a good hash function, we mean two things.
First, a good hash function, irrespective of the number of search keys, gives an
average-case lookup that is a small constant.
Second, the function distributes records uniformly and randomly among the buckets,
Though the number of buckets is fixed, the number of files may grow with time.
If the number of buckets is too large, storage space is wasted.
If the number of buckets is too small, there may be too many collisions.
a. Insertion- To insert a record that has ki as its search value, use the hash function h(ki)
to compute the address of the bucket for that record. If the bucket is free, store the
record else use chaining to store the record.
b. Searching- To search a record having the key value ki, use h(ki) to compute the
address of the bucket where the record is stored. The bucket may contain one or several
records, so check for every record to retrieve the desired record with the given key
value.
a. Deletion- To delete a record with key value ki, use h(ki) to compute the address of
the bucket where the record is stored. The bucket may contain one or several records
so check for every record in the bucket Then delete the record