File Organization, Hashing and Collision Full Copy. 1
File Organization, Hashing and Collision Full Copy. 1
PDFelement
4.a
File Organizations
File organization ensures that records are available for processing. It is used to determine an
efficient file organization for each base relation.
For example, if we want to retrieve employee records in alphabetical order of name. Sorting
the file by employee name is a good file organization. However, if we want to retrieve all
employees whose marks are in a certain range, a file is ordered by employee name would
not be a good file organization.
4.b
Sequential file search starts from the beginning of the file and the records can be added at
the end of the file.
In sequential file, it is not possible to add a record in the middle of the file without rewriting
the Sequential File Organization
It is one of the simple methods of file organization. Here each file/records are stored one
after the other in a sequential manner. This can be achieved in two ways:
Records are stored one after the other as they are inserted into the tables.
In the case of any modification or deletion of record, the record will be searched in the
memory blocks. Once it is found, it will be marked for deleting and new block of record is
entered.
In the second method, records are sorted (either ascending or descending) each time they
are inserted into the system. This method is called sorted file method. Sorting of records
may be based on the primary key or on any other columns. Whenever a new record is
inserted, it will be inserted at the end of the file and then it will sort – ascending or
descending based on key value and placed at the correct position. In the case of update, it
will update the record and then sort the file to place the updated record in the right place.
Same is the case with delete
Wondershare
PDFelement
4.c
Example
Sequential file is best use if storage space structure is also sequential in nature.
This is an advanced sequential file organization method. Here records are stored in order of
primary key in the file. Using the primary key, the records are sorted. For each primary key,
an index value is generated and mapped with the record. This index is nothing but the
address of record in the file.
Wondershare
PDFelement
4.d
Indexed sequential access file combines both sequential file and direct access file
organization.
In indexed sequential access file, records are stored randomly on a direct access device such
as magnetic disk by a primary key.
This file has multiple keys. These keys can be alphanumeric in which the records are ordered
is called primary key.
The data can be access either sequentially or randomly using the index. The index is stored
in a file and read into memory when the file is opened.
In indexed sequential access file, sequential file and random file access is possible.
It accesses the records very fast if the index table is properly organized.
Indexed sequential access file requires unique keys and periodic reorganization.
Indexed sequential access file takes longer time to search the index for the data access or
retrieval.
It is less efficient in the use of storage space as compared to other file organizations.
In direct access file, all records are stored in direct access storage device (DASD), such as
hard disk. The records are randomly placed throughout the file.
The records do not need to be in sequence because they are updated directly and rewritten
back in the same location.
Wondershare
PDFelement
4.e
This file organization is useful for immediate access to large amount of information. It is
used in accessing large databases.
Example:
Records are stored at random locations on the disk. This randomization could be achieved
by any of several techniques:
1. Direct addressing,
2. Directory lookup,
3. Hashing.
Direct addressing: In direct addressing with equal size records, available disk space is
divided out into nodes large enough to hold a record. Numeric value of primary key is used
to determine the node into which a particular record is to be stored.
Directory lookup: the index is not direct access type but is a dense index ( there is an index
record for every search key value in the database)maintained using a structure suitable for
index operations. Retrieving a record involves searching the index for the record address
and then accessing the record itself. The storage management scheme will depend on
whether fixed size or variable size nodes are being used. It requires more accesses for
retrieval and update, since index searching will generally require more than one access. In
both direct addressing and directory lookup, some provision must be made to handle
collisions.
Wondershare
PDFelement
4.f
Direct access file helps in online transaction processing system (OLTP) like online railway
reservation system.
It is expensive.
5.a
Hashing
Hashing is another approach in which time required to search an element doesn't depend
on the total number of elements. Using hashing data structure, a given element is searched
with constant time complexity. Hashing is an effective way to reduce the number of
comparisons to search an element in a data structure.
Hashing is the process of indexing and retrieving element (data) in a data structure to
provide a faster way of finding the element using a hash key.
Here, the hash key is a value which provides the index value where the actual data is likely
to be stored in the data structure.
In this data structure, we use a concept called Hash table to store data. All the data values
are inserted into the hash table based on the hash key value. The hash key value is used to
map the data with an index in the hash table. And the hash key is generated for every data
using a hash function. That means every entry in the hash table is based on the hash key
value generated using the hash function.
Hash table is just an array which maps a key (data) into the data structure with the help of
hash function such that insertion, deletion and search operations are performed with
constant time complexity
Hash tables are used to perform insertion, deletion and search operations very quickly in a
data structure. Using hash table concept, insertion, deletion, and search operations are
accomplished in constant time complexity. Generally, every hash table makes use of a
function called hash function to map the data into the hash table.
Hash function is a function which takes a piece of data (i.e. key) as input and produces an
integer (i.e. hash value) as output which maps the data to a particular index in the hash
table.
Wondershare
PDFelement
5.b
Hashing Functions
There are various types of hash function which are used to place the data in a hash table,
1. Division method
In this the hash function is dependent upon the remainder of a division. For example:-if the
record 52,68,99,84 is to be placed in a hash table and let us take the table size is 10.
2=52%10
8=68%10
9=99%10
4=84%10
5.c
H(key)=124+655+12
=791
Collision
Wondershare
PDFelement
5.d
It is a situation in which the hash function returns the same hash key for more than one
record, it is called as collision. Sometimes when we are going to resolve the collision it may
lead to an overflow condition and this overflow and collision condition makes the poor hash
function.
1) Chaining
It is a method in which additional field with data i.e. chain is introduced. A chain is
maintained at the home bucket. In this when a collision occurs then a linked list is
maintained for colliding data.
Example: Let us consider a hash table of size 10 and we apply a hash function of H(key)=key
% size of table. Let us take the keys to be inserted are 31,33,77,61. In the diagram below we
can see at same bucket 1 there are two records which are maintained by linked list or we
can say by chaining method.
2) Linear probing
Example: Let us consider a hash table of size 10 and hash function is defined as H(key)=key
% table size. Consider that following keys are to be inserted that are 56,64,36,71.
Wondershare
PDFelement
5.e
In this diagram we can see that 56 and 36 need to be placed at same bucket but by linear
probing technique the records linearly placed downward if place is empty i.e. it can be seen
36 is placed at index 7.
3) Quadratic probing
This is a method in which solving of clustering problem is done. In this method the hash
function is defined by the H(key)=(H(key)+x*x)%table size. Let us consider we have to insert
following elements that are:-67, 90,55,17,49.
In this we can see if we insert 67, 90, and 55 it can be inserted easily but at case of 17 hash
function is used in such a manner that :-(17+0*0)%10=17 (when x=0 it provide the index
value 7 only) by making the increment in value of x. let x =1 so (17+1*1)%10=8.in this case
bucket 8 is empty hence we will place 17 at index 8.
4) Double hashing
Wondershare
PDFelement
5.f
It is techniques in which two hash function are used when there is an occurrence of
collision. In this method first hash function is simple as same as division method. But for the
second hash function there are two important rules which are
H2(key)=P-(key mod P)
Where, P is a prime number which should be taken smaller than the size of a hash table.
In this we can see 67, 90 and 55 can be inserted in a hash table by using first hash function
but in case of 17 again the bucket is full and in this case we have to use the second hash
function which is H2(key)=P-(key mode P) here p is a prime number which should be taken
smaller than the hash table so value of p will be the 7.
i.e. H2(17)=7-(17%7)=7-3=4 that means we have to take 4 jumps for placing the 17.
Therefore 17 will be placed at index 1.