0% found this document useful (0 votes)
2 views

Unit 5 Sort and Search

The document discusses various sorting techniques including Insertion Sort, Bucket Sort, and Radix Sort, explaining their processes and algorithms. It also covers Address Calculation Sort and Hashing, detailing how hash functions work and their properties. Additionally, it addresses collision resolution techniques in hash tables, including open addressing and chaining.

Uploaded by

tempoabhi1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit 5 Sort and Search

The document discusses various sorting techniques including Insertion Sort, Bucket Sort, and Radix Sort, explaining their processes and algorithms. It also covers Address Calculation Sort and Hashing, detailing how hash functions work and their properties. Additionally, it addresses collision resolution techniques in hash tables, including open addressing and chaining.

Uploaded by

tempoabhi1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

SORTING TECHNIQUES

Sorting
 Process of arranging elements
 Useful in Searching process
 Useful in many algorithms as pre-requisite (Uniqueness check, etc.),
and therefore improves the efficiency of the algorithm.
Insertion Sort
 The given problem instance is divide into 2 group – sorted, unsorted.
 Initially, sorted group will contain first element, and rest of the
elements are in Unsorted group
 One element of unsorted group is taken at a time and inserted into
sorted group.
Example
84 69 69 69 69
54 54
84 84
76 76 76
69 69
84
84 84 84
76 76
75
69 69 69
86 86
84 84
76
76 76 76 76
86 86
84
86 86 86 86 86
86
54 54 94
54 54
94 94
54 54
75 75 91
75 75
91 91
75 75 75
Algorithm
for(i=1 to n-1) do
{
ele=a[i], j=i-1;
for(j=i-1; j>=0 & a[j+1] < a[j]; j--)
{
a[j+1] = a[j];
}
a[j+1]=ele;
}
BUCKET SORT

Bucket sort, or bin sort, is a distribution sorting algorithm.


It is a stable sort, where the relative order of any two items
with the same key is preserved.
It works in the following way:
• set up m buckets where each bucket is responsible for
an equal portion of the range of keys in the array.
• place items in appropriate buckets.
• sort items in each non-empty bucket using insertion
sort.
• concatenate sorted lists of items from buckets to get
final sorted order.
RADIX SORT
Radix sort is also known as bucket sort. Radix is the base of a
number system or logarithm.
 Radix sort is a multiple pass distribution sort.
 It distributes each item to a bucket according to part of the
item's key.
 After each pass, items are collected from the buckets,
keeping the items in order, then redistributed according to
the next most significant part of the key.
 This sorts keys digit-by-digit (hence referred to as digital
sort), or, if the keys are strings that we want to sort
alphabetically, it sorts character-by-character.
 Radix sort uses bucket or count sort as the stable sorting
algorithm, where the initial relative order of equal keys is
unchanged.
RADIX SORT
Radix sort is classified based on how it works internally:

Least Significant Digit (LSD) radix sort:

Processing starts from the least significant digit and moves


towards the most significant digit.

Most Significant Digit (MSD) radix sort:

Processing starts from the most significant digit and moves


towards the least significant digit. This is recursive.
RADIX SORT - Example

Input is an array of 15 integers. For integers, the number of buckets is 10, from
0 to 9. The first pass distributes the keys into buckets by the least significant
digit (LSD). When the first pass is done, we have the following
RADIX SORT - Example
100, 150, 65, 25, 19, 8, 4, 67, 73, 90, 128,
248, 328, 440
0 1 2 3 4 5 6 7 8 9
100 73 4 65 67 8 19
150 25 128
90 248
440 328
merge
100 150 090 440 073 004 065 025 067
008 128 248 328 019
Radix Sort – Program
#include <conio.h>
#include <stdio.h>
void main()
{
int unsorted[50] , bucket[10][50]={{0}} , sorted[50] ;
int j , k , m , p , flag = 0, num, N;
clrscr();
printf("\nEnter the number of elements to be sorted :");
scanf("%d",&N);
printf("\nEnter the elements to be sorted :\n");
for(k=0 ; k < N ; k++)
{
scanf("\n%d",&num);
sorted[k] = unsorted[k] = num;
}
Radix Sort – Program
for(p=1; flag != N ; p*=10)
{
flag = 0;
for(k=0;k<N;k++)
{
printf("\n flag=%d",flag);
bucket[(sorted[k]/p)%10][k] = sorted[k];
printf("\n position of element=%d %d",((sorted[k]/p)%10),k);
printf("\n element value=%d",bucket[(sorted[k]/p)%10][k]);
if ( (sorted[k]/p)%10 == 0 )
{
flag++;
}
}
Radix Sort – Program
for(j=0,m=0;j<10;j++) if (flag == N)
{ {
for(k=0;k<N;k++) printf("\nSorted List: \n");
{ for(j=0 ; j < N ; j++)
if( bucket[j][k] > 0 ) {
{ printf("%d\t", sorted[j]);
sorted[m] = bucket[j][k]; }
bucket[j][k] = 0 ; m++; printf("\n");
} }
} getch() ;
} }
}
Address Calculation Sort
• This can be one of the fastest types of distributive sorting technique if
enough space is available also called as Hashing.
• In this algorithm, a hash function is used and applied to each element in
the list. The result of the hash function is placed into an address in the
table that represents the key.
• Linked lists are used as address table for storing keys.(if there are 4 keys
then 4 linked lists are used).
• The hash function places the elements in linked lists are called as sub
files. An item is placed into a sub -file in correct sequence by using any
sorting method. After all the elements are placed into subfiles, the lists
(subfies)are concatenated to produce the sorted list.
Address Calculation Sort
Procedure:
1. In this method a hash function f is applied to each key.
2. The result of this function determines into which of the
several subfiles the record is to be placed. The function should
have the property that: if x <= y , f (x) <= f (y), Such a
function is called order preserving.
3. An item is placed into a subfile in correct sequence by using
any sorting method – simple insertion is often used.
4. After all the elements are placed into subfiles, the lists
(subfiles) are concatenated to produce the sorted list.
SEARCHING TECHNIQUES
Hashing
Table 1. Records of employees

Table 2. Records of employees with a five-digit Emp_ID


Hashing
• In Table 1, use a two-digit primary key (Emp_ID) and Table 2 use a five-digit
key, and there are just 100 employees in the company.
• Both use 100 locations in the array.
• Therefore, in order to keep the array size down to the size that we will actually
be using (100 elements).
• A good option is to use just the last two digits of the key to identify each
employee.
• For example, the employee with Emp_ID 79439 will be stored in the element of
the array with index 39. Similarly, the employee with Emp_ID 12345 will have
his record stored in the array at the 45th location.
• In the second solution, the elements are not stored according to the value of the
key. So in this case, we need a way to convert a five-digit key number to a two-
digit array index.
• We need a function which will do the transformation. In this case, we will use the
term hash table for an array and the function that will carry out the
transformation will be called a hash function
Hash Tables
• Hash table is a data structure in which keys are mapped to array
positions by a hash function
• A value stored in a hash table can be searched in O(1) (one
comparison) time by using a hash function which generates an
address from the key (by producing the index of the array where the
value is stored).
• In a hash table, an element with key k is stored at index h(k)and not
k.
• A hash function is used to calculate the index at which the element
with key k has to be stored.
• The process of mapping the keys to appropriate locations (or indices)
in a hash table is called hashing.
Direct relationship between key and index in the array

Relationship between keys and hash table index

Note that keys k2 and k6 point to


the same memory location. This is
known as collision.
Hash Functions
• A hash function is a mathematical formula which, when
applied to a key, produces an integer which can be used as
an index for the key in the hash table.
• The main aim of a hash function is that elements should be
relatively, randomly, and uniformly distributed.
• It produces a unique set of integers within some suitable
range in order to reduce the number of collisions.
• In practice, there is no hash function that eliminates
collisions completely.
• A good hash function can only minimize the number of
collisions by spreading the elements uniformly throughout
the array.
Properties of a Good Hash Function
Low cost The cost of executing a hash function must be small.

Determinism A hash procedure must be deterministic. This means


that the same hash value must be generated for a given input value.

Uniformity A good hash function must map the keys as evenly as


possible over its output range. This means that the probability of
generating every hash value in the output range should roughly be
the same. The property of uniformity also minimizes the number of
collisions.
Hash Functions

• Division Method

• Multiplication Method

• Mid-Square Method

• Folding Method
Division Method: This method divides x by M and then uses
the remainder obtained.
h(x) = x mod M

Choose M to be a prime number because making M a prime


number increases the likelihood that the keys are mapped with
a uniformity in the output range of values. M should also be
not too close to the exact powers of 2

Example: Calculate the hash values of keys 1234 and 5462.

Solution: Setting M = 97, hash values can be calculated as:


h(1234) = 1234 % 97 = 70
h(5642) = 5642 % 97 = 16
Multiplication Method:

The steps involved in the multiplication method are as follows:

Step 1: Choose a constant A such that 0 < A < 1.


Step 2: Multiply the key k by A.
Step 3: Extract the fractional part of kA.
Step 4: Multiply the result of Step 3 by the size of hash table (m).

Hence, the hash function can be given as:


Multiplication Method:

Example: Given a hash table of size 1000, map the key 12345 to
an appropriate location in the hash table.

Solution: We will use A = 0.618033, m = 1000, and k = 12345

h(12345) = 1000 (12345 * 0.618033 mod 1)


h(12345) = 1000 (7629.617385 mod 1)
h(12345) = 1000 (0.617385)
h(12345) = 617.385
h(12345) = 617
Mid-Square Method:

The mid-square method is a good hash function which works in


two steps:

Step 1: Square the value of the key. That is, find k2.
Step 2: Extract the middle r digits of the result obtained in Step 1.

In the mid-square method, the same r digits must be chosen from


all the keys. Therefore, the hash function can be given as:
h(k) = s
where s is obtained by selecting r digits from k2
Mid-Square Method:

Example: Calculate the hash value for keys 1234 and 5642 using
the mid-square method. The hash table has 100 memory locations.

Solution Note that the hash table has 100 memory locations whose
indices vary from 0 to 99.
This means that only two digits are needed to map the key to a
location in the hash table, so r = 2.
When k = 1234, k2 = 1522756, h (1234) = 27
When k = 5642, k2 = 31832164, h (5642) = 21
Observe that the 3rd and 4th digits starting from the right are
chosen.
Folding Method:

The folding method works in the following two steps:

Step 1: Divide the key value into a number of parts. That is, divide
k into parts k1, k2, , ..., kn, where each part has the same number
of digits except the last part which may have lesser digits than the
other parts.
Step 2: Add the individual parts. That is, obtain the sum of k1+ k2
+ ... + kn. The hash value is produced by ignoring the last carry, if
any.
Example: Given a hash table of 100 locations, calculate
the hash value using folding method for keys 5678, 321,
and 34567
Collision Resolution Technique
• Collisions occur when the hash function maps two different keys to the
same location. Obviously, two records cannot be stored in the same
location.
• A method used to solve the problem of collision, collision resolution
technique is applied. The two most popular methods of resolving
collisions are:
1. Open addressing
2. Chaining
• The hash table contains two types of values: sentinel values (e.g., –1) and
data values. The presence of a sentinel value indicates that the location
contains no data value at present but can be used to hold a value.
• The process of examining memory locations in the hash table is called
probing.
Collision Resolution by Open addressing
Open addressing technique can be implemented using linear probing,
quadratic probing, double hashing, and rehashing.

Linear Probing: In this technique, if a value is already stored at a


location generated by h(k), then the hash function used to resolve the
collision is:
h(k, i) = [h’(k) + i] mod m
Where m is the size of the hash table, h’(k) = (k mod m), and i is the
probe number that varies from 0 to m–1.
In Linear probing, when there is no empty location to store a
value, we try the slots: [h’(k)] mod m, [h’(k) + 1]mod m,
[h’(k) + 2] mod m, [h’(k) + 3] mod m, [h’(k) + 4] mod m, [h’(k) + 5]
mod m, and so no, until a vacant location is found.
Example: Consider a hash table of size 10. Using linear probing,
insert the keys 72, 27, 36, 24, 63, 81, 92, and 101 into the table.
Let h’(k)= k mod m, m = 10
Key = 72
h(72, 0) = (72 mod 10 + 0) mod 10 = (2) mod 10 = 2

Key = 101
h(101, 0) = (101 mod 10 + 0) mod 10 = (1) mod 10 = 1
T[1]is occupied, so we cannot store the key 101 in T[1]. Therefore, try again for the
next location. Thus probe, i = 1, this time.
Key = 101
h(101, 1) = (101 mod 10 + 1) mod 10 = (1 + 1) mod 10 = 2
T[2]is also occupied, so we cannot store the key in this location. The procedure will be
repeated until the hash function generates the address of location 8 which is vacant
and can be used to store the value in it.
Searching a Value using Linear Probing
• While searching for a value in a hash table, the array index is re-
computed and the key of the element stored at that location is
compared with the value that has to be searched.
• If a match is found, then the search operation is successful.
• If the key does not match, then the search function begins a
sequential search of the array that continues until:
 the value is found, or
the search function encounters a vacant location in the array,
indicating that the value is not present, or
the search function terminates because it reaches the end of the
table and the value is not present.
Collision Resolution by Open addressing Contd..
Quadratic Probing: In this technique, if a value is already stored at a
location generated by h(k), then the following hash function is used to
resolve the collision:

where m is the size of the hash table, h’(k) = (k mod m), i is the probe
number that varies from 0 to m–1, and c1and c2 are constants such that
c1 and c2 ≠ 0.
Example: Consider a hash table of size 10. Using quadratic probing, insert
the keys 72, 27, 36, 24, 63, 81, and 101 into the table. Take c1= 1 and c2= 3.
Let h’(k)= k mod m, m = 10, h(k, i) = [h’(k) + c1i + c2i2] mod m
Key = 72
h(72, 0) = [72 mod 10 + 1 ¥ 0 + 3 ¥ 0] mod 10 = [72 mod 10] mod 10 = 2 mod 10 = 2

Key = 101
h(101,0) = [101 mod 10 + 1 ¥ 0 + 3 ¥ 0] mod 10 = [101 mod 10 + 0] mod 10 = 1 mod 10 = 1
Since T[1]is already occupied, the key 101 cannot be stored in T[1]. Therefore, try again for
next location. Thus probe, i = 1, this time.
Key = 101
h(101,0) = [101 mod 10 + 1 ¥ 1 + 3 ¥ 1] mod 10 = [101 mod 10 + 1 + 3] mod 10
= [101 mod 10 + 4] mod 10 = [1 + 4] mod 10 = 5 mod 10 = 5
Collision Resolution by Open addressing Contd..
Double Hashing: double hashing uses one hash value and then
repeatedly steps forward an interval until an empty location is reached.
The interval is decided using a second, independent hash function,
hence the name double hashing. In double hashing, we use two hash
functions rather than a single function. The hash function in the case
of double hashing can be given as:

where m is the size of the hash table, h1(k) and h2(k) are two hash
functions given as h1(k) = k mod m, h2(k) = k mod m', i is the probe
number that varies from 0 to m–1, and m' is chosen to be less than m.
We can choose m' = m–1or m–2.
Example: Consider a hash table of size = 10. Using double hashing,
insert the keys 72, 27, 36, 24, 63, 81, 92, and 101 into the table.
Take h1= (k mod 10) and h2 = (k mod 8).
We have h(k, i) = [h1 (k) + ih2(k)] mod m
Key = 72
h(72, 0) = [72 mod 10 + (0 ¥ 72 mod 8)] mod 10 = [2 + (0 ¥ 0)] mod 10 = 2 mod 10 = 2

Note: Although double hashing is a very efficient algorithm, it always requires m to


be a prime number. In our case m=10, which is not a prime number, hence, the
degradation in performance. Had m been equal to 11, the algorithm would have
worked very efficiently. Thus, we can say that the performance of the technique is
sensitive to the value of m.
Collision Resolution by Open addressing Contd..
Rehashing:
• When the hash table becomes nearly full, the number of collisions
increases, thereby degrading the performance of insertion and
search operations.
• In such cases, a better option is to create a new hash table with size
double of the original hash table.
• All the entries in the original hash table will then have to be moved
to the new hash table.
• This is done by taking each entry, computing its new hash value,
and then inserting it in the new hash table.
• Though rehashing seems to be a simple process, it is quite
expensive and must therefore not be done frequently.
Rehashing: Example
Collision Resolution by Chaining
• In chaining, each location in a hash table stores a pointer to a linked list
that contains all the key values that were hashed to that location.
• Location l in the hash table points to the head of the linked list of all the
key values that hashed to l.
Operations on a Chained Hash Table

• Searching for a value in a chained hash table is as simple as


scanning a linked list for an entry with the given key.
• Insertion operation appends the key to the end of the linked
list pointed by the hashed location.
• Deleting a key requires searching the list and removing the
element.
Example: Insert the keys 7, 24, 18, 52, 36, 54, 11, and
23 in a chained hash table of 9 memory locations. Use
h(k) = k mod m.

Step 1 Key = 7 Step 2 Key = 24


h(k) = 7 mod 9 h(k) = 24 mod 9
=7 =6
Files and Their Organization
File is a block of useful data which is available to a computer
program and is usually stored on a persistent storage medium

DATA HIERARCHY: The data hierarchy includes data items


such as fields, records, files, and database.

Data field: is an elementary unit that stores a single fact. A data


field is usually characterized by its type and size.

Example: student’s name is a data field that stores the name of


students.
This field is of type character and its size can be set to a
maximum of 20 or 30 characters depending on the requirement.
Files and Their Organization
Record: is a collection of related data fields which is seen as a
single unit from the application point of view.

Example: the student’s record may contain data fields such as


name, address, phone number, roll number, marks obtained, and so
on.

File: is a collection of related records. For example, if there are 60


students in a class, then there are 60 records. All these related
records are stored in a file.

Directory: stores information of related files. A directory organizes


information so that users can find it easily.
Example: Student Directory
FILE ATTRIBUTES

Each file has a list of attributes associated with it that gives the
operating system and the application software information about the
file and how it is intended to be used.

File name It is a string of characters that stores the name of a file.


File naming conventions vary from one operating system to the other.
File position It is a pointer that points to the position at which the
next read/write operation will be performed.
File structure It indicates whether the file is a text file or a binary
file. In the text file, the numbers (integer or floating point) are stored
as a string of characters. A binary file, on the other hand, stores
numbers in the same way as they are represented in the main memory.
File Access Method
It indicates whether the records in a file can be accessed sequentially
or randomly. In sequential access mode, records are read one by one.
In random access, records can be accessed in any order.

Attributes Flag
A file can have six additional attributes attached to it. These attributes
are usually stored in a single byte, with each bit representing a
specific attribute
Read-only A file marked as read-only cannot be deleted or modified.

Hidden A file marked as hidden is not displayed in the directory listing.


System A file marked as a system file indicates that it is an important file
used by the system and should not be altered or removed from the disk.

Volume Label Every disk volume is assigned a label for identification. The
label can be assigned at the time of formatting the disk or later through
various tools such as the DOS command LABEL.

Directory In directory listing, the files and sub-directories of the current


directory are differentiated by a directory-bit.

Archive The archive bit is used as a communication link between programs


that modify files and those that are used for backing up files. Most backup
programs allow the user to do an incremental backup. Incremental backup
selects only those files for backup which have been modified since the last
backup.
TEXT AND BINARY FILES
Text file:
• also known as a flat file or an ASCII file, is structured as a sequence of
lines of alphabet, numerals, special characters, etc. is stored using its
corresponding ASCII code.
• can be manipulated by any text editor, they do not provide efficient
storage.
• is readable by humans.

Binary file:
• contains any type of data encoded in binary form for computer storage
and processing purposes.
• provide efficient storage of data, but they can be read only through an
appropriate program.
• is not readable by humans.
BASIC FILE OPERATIONS
FILE ORGANIZATION

• File is a collection of related records.


• The main issue in file management is the way in which the
records are organized inside the file.
• Organization of records means the logical arrangement of
records in the file and not the physical layout of the file as
stored on a storage media.
Sequential Organization

• A sequentially organized file stores the records in the order in


which they were entered.
• Sequential files can be read only sequentially, starting with the
first record in the file.
• Once we store the records in a file, we cannot make any
changes to the records.
• In sequential file organization, all the records have the same
size and the same field format, and every field has a fixed size.
• The records are sorted based on the value of one field or a
combination of two or more fields.
• This field is known as the key.
• Each key uniquely identifies a record in a file.
Relative File Organization

• Provides an effective way to access individual records directly.


• Records are ordered by their relative key. It means the record number
represents the location of the record relative to the beginning of the
file.
• Records are organized in ascending relative record number.
• Relative files can be used for both random as well as sequential access.
• For sequential access, records are simply read one after another.
• Relative file organization provides random access by directly jumping
to the record which has to be accessed.
• if the records are of fixed length and given the base address of the file
and the length of the record, then any record i can be accessed using
the following formula:
Address of ith record = base_address + (i–1) * record_length
Relative File Organization
Schematic representation of a relative file which has been allocated
enough space to store 100 records

Consider the base address of a file is


1000 and each record occupies 20
bytes, then the address of the 5th
record can be given as:

1000 + (5–1) * 20
= 1000 + 80
= 1080
Indexed Sequential File Organization

• Indexed sequential file organization stores data for fast retrieval. The
records in an indexed sequential file are of fixed length and every
record is uniquely identified by a key field.
• It maintains a table known as the index table which stores the record
number and the address of all the records.
• This type of file organization is called as indexed sequential file
organization because physically the records may be stored anywhere,
but the index table stores the address of those records.
• An indexed sequential file uses the concept of both sequential as
well as relative files.
• While the index table is read sequentially to find the address of the
desired record, a direct access is made to the address of the specified
record in order to access it randomly.
Indexed Sequential File Organization
INDEXING

• Indexed sequential files are very efficient to use, but in real-world


applications, these files are very large and a single file may contain
millions of records.
• in such situations, indexing techniques are used which are analysed
based on factors such as access type, access time, insertion time,
deletion time, and space overhead involved.
• There are two kinds of indices:
 Ordered indices: that are sorted based on one or more key values. Indices
are used to provide fast random access to records. As stated above, a file may
have multiple indices based on different key fields. An index of a file may be a
primary index or a secondary index.
 Hash indices: that are based on the values generated by applying a hash
function. hashing is used to compute the address of a record by using a hash
function on the search key value.

You might also like