Module 5 File Organization 1
Module 5 File Organization 1
File Organization
A file organization is a way of arranging the records in a file when tl1e file is stored on
secondary storage (disk, tape etc.).
The different ways of arranging the records enable different operations to be carried out efficiently
over the file.
A database management system supports several file organization techniques. The most important
task of DBA is to choose a best - organization for each file, based on its use.
File Organization
The organization of records in a file is influenced by number of factors that must be taken into
consideration while choosing a particular technique. These factors are:
a) fast retrieval, updation and transfer of records,
b) efficient use of disk space,
c) high throughput,
d) type of use,
e) efficient manipulation,
f) security from unauthorized access,
g) scalability,
h) reduction in cost,
i) protection from failure.
File & Record
• File is a collection of related sequence of records.
• A collection of field names and their corresponding data types constitutes a record.
• A data type, associated with each field, specifies the types of values a field can take. All records
in a file are of the same record type.
• Records and Record Types
Data is generally stored in the form of records. A record is a collection of fields or data items
and data items is formed of one or more bytes. Each record has a unique identifier called
record-id. The records in a file are one of the following two types :
(i) Fixed Length records.
(ii) Variable Length records.
Fixed Length Record
Each record in the file has exactly of same size.
The record slots are uniform and are arranged in continuous manner in a file.
Advantage
1. Insertion and deletion of records in the file are simple to implement since the space made
available by a deleted record is same as needed to insert a new record.
Disadvantages
1. In fixed length records, since the length of record is fixed, it causes wastage of memory
space. For example, if the length is set up to 50 characters and most of the are less than 25
characters, it causes wastage of precious memory space.
2. It is an inflexible approach. For example, if it is required to increase the length the record,
then major changes in program and database are needed.
Variable Length Record
Every record in the file need not be of the same size. Therefore, the records in the file have
different sizes.
Advantage
1. It reduces manual mistakes as database automatically adjust the size of record
2. It saves lot of memory space in case of records of variable lengths.
3. It is a flexible approach since future enhancements are very easy to implement.
Disadvantages
1. It increases the overhead of DBMS because database have to keep record of the size of all
records.
Types Of Files
The following three types of files are used in database systems :
1. Master file
2. Transaction file
3. Report file.
1. Master file : This file contains information of permanent nature about the entities. The master
file act as a source of reference data for processing transactions. They accumulate the
information based on the transaction data.
2. Transaction file : This file contains records that describe the activities carried out by the
organization. This file is created as a result of processing transactions and preparing transaction
documents. These are also used to update the master file permanently.
3. Report file : This file is created by extracting data from the different records to prepare a report
e.g. A report file about the weekly sales of a particular item
File Organization Techniques
A file organization is a way of arranging the records in a file wl1en the file is stored on secondary
storage (disk, tape etc). There are different types of file organizations that are used by
applications. The operations to be performed and the selection of storage device are the major
factors that influence the choice of a particular file organization. The different types of
file organizations are as follows :
1. Heap file organization
2. Sequential file organization
3. lndexed-Sequential. file organization
4. Hashing or Direct file organization
Heap File Organization
A heap file is an unordered set of records.
Heap File Organization is the most basic form of file organization, based on data chunks.
In such an organization, records are stored in the file in the order in which they are inserted, and
new records are always placed at the end of the file.
In the file, every record has a unique id, and every page in a file is of the same size. It is the
DBMS responsibility to store and manage the new records.
Heap File Organization
Inserting a New Record
The insertion of a new record is very efficient. It is performed in the following steps:
• The last disk block of the file is copied into a buffer.
• The new record is added.
• The block in the buffer is then rewritten back to the disk.
Delete or update a record, first we need to search for the record. Again, searching a record is
similar to retrieving it-start from the beginning of the file till the record is fetched. If it is a small
file, it can be fetched quickly. But larger the file, greater amount of time needs to be spent in
fetching.
In addition, while deleting a record, the record will be deleted from the data block. But it will
not be freed and it cannot be re-used. Hence as the number of record increases, the memory
size also increases and hence the efficiency. For the database to perform better, DBA has to free
this unused memory periodically.
Heap File Organization
Advantages
1. Insertion of new record is fast & efficient.
2. The filling factor of this file organization is 100%.
3. Space is fully utilized & conserved.
Disadvantages
1. Searching & accessing of record is too slow.
2. Deletion of many records result in wastage of space.
3. Updation cost of data is comparatively high.
Sequential File Organization
In sequential file organization, records are stored in a sequential order according to the "search
key".
A search Key is an attribute or a set of attributes which are used to serialize the records.
It is not necessary that search key must be primary key.
It is the simplest method of file organization.
Sequential method is based on tape model. Devices who support sequential access are
magnetic tapes, cassettes, card readers etc. Editors and compilers also use this approach to
access files.
Sequential File Organization
Structure of a sequential file is shown in Figure.
The records are stored in sequential order one after another.
To reach at the consecutive record from any record pointers are used. The Pointers are used
for fast retrieval of records.
Sequential file organization is a type of file organization used in database management
systems (DBMS) where data is stored sequentially in a file or table.
In this method, data is accessed sequentially in the order it is stored, starting from the
beginning of the file and proceeding towards the end.
Sequential File Organization
In a sequential file, each record is stored one after the other, without any index or key. This
makes it easy to add new records to the end of the file, but searching for specific records can be
slow and inefficient because the system has to search through the entire file to find the desired
record.
Sequential file organization is suitable for situations where data is accessed in a serial manner,
such as batch processing or generating reports. It is also used in situations where data is not
frequently updated, as adding or deleting a record can cause the entire file to be rewritten.
However, this method is not efficient for situations that require frequent updates or random
access to data. To improve the efficiency of accessing data in a sequential file, various
techniques like buffering, caching, and memory-mapped files can be used.
Sequential File Organization
Advantages
1. It is easy to understand.
2. Efficient file system for small size files.
3. Construction and reconstruction are much easier in comparison to other files
4. Supports tape media, editors and compilers.
5. It contains sorted records.
Disadvantages
1. Inefficient file system for medium and large size files.
2. Updations and maintenance are not easy.
3. Inefficient use of storage space because of fixed size blocks.
4. Linear search takes more time.
5. Before Updations all transactions are stored sequentially.
Indexed Sequential File Organization
Index sequential file organization is used to overcome the disadvantages of sequential file
organization.
It also preserves the advantages of sequential access.
This organization enables fast searching of records with the use of index.
Some basic terms associated with indexed sequential file organization are as follows :
Block: Block is a unit of storage in which records are saved.
Index: Index is a table with a search key by which block of a record can be find.
Pointer: Pointer is a variable which points from index entry to starting address of block.
Indexed Sequential File Organization
Components Of Indexed Sequential File Organization
Data File: The records in the data file are stored in sorted order based on a key field.
Index File: The index file contains index entries, which are pointers to blocks (or records) in the
data file. These index entries are also sorted based on the key.
Overflow Area: This is used to store new records that are inserted but cannot fit into the primary
data area. The overflow area is also organized in a sorted manner.
Operations of ISAM
Searching: When searching for a record, the system first looks in the index to find the disk block
that contains the record. Then, that specific block is fetched to retrieve the record. This usually takes
two disk I/O operations.
Insertion: If a new record needs to be inserted, it is placed in its correct sorted position. If the block
is full, the record is placed in the overflow area.
Deletion: Deleting a record involves marking it as deleted. The space is then reclaimed during a
subsequent reorganization of the data file.
Updating: Updating a record involves either modifying it if the key is not changed, or deleting and
re-inserting it if the key is modified.
Components Of Indexed Sequential File Organization
Example of ISAM
Let's consider a simplified library database where we have a file containing book information sorted
by ISBN numbers.
Data File Index File
ISBN TITLE ISBN Block Address
123 Book A 123 1
234 Book B 234 2
345 Book C 345 3
456 Book D 456 4
ISBN TITLE
133 New Book
Components Of Indexed Sequential File Organization
Advantages
1. Efficient file system for medium and large size files.
2. Easy to update.
3. Easy to update than direct files.
4. Efficient use of storage.
5. Searching of records is fast.
6. Maintain advantages of sequential file.
Disadvantages
1. Inefficient file system for small size files.
2. It is an expensive method.
3. Typical structure than sequential file.
4. Indexes need additional storage space.
5. Performance degradation with respect to growth of files.
Hash File Organization
Hash file organization is a type of file organization where data is stored in a file or table using a
hash function.
A hash function is a mathematical function that converts a key value into a hash code, which is
used to map the key to the location in the file or table where the data is stored.
Hash Function
Key Field Address of records
Hash File Organization
Any type of mathematical function can be used as a hash function. It can be simple or complex.
Hash function is applied to columns or attributes to get the block address. The records are
stored randomly. So, it is also known as Direct or Random file organization.
If the generated hash function is on the column which is considered as key, then the column
can be called as hash key and if the generated hash function is on the column which is
considered as non-key, then the column can be called as hash column.
Hash file organization is suitable for situations where data needs to be accessed quickly and
efficiently based on the value of its key, and where the data is not frequently updated. However,
if the hash function is poorly designed or if there are collisions (where two different keys map to
the same location), it can result in poor performance and decreased efficiency.
Hash File Organization
Any type of mathematical function can be used as a hash function. It can be simple or complex.
Hash function is applied to columns or attributes to get the block address. The records are
stored randomly. So, it is also known as Direct or Random file organization.
If the generated hash function is on the column which is considered as key, then the column
can be called as hash key and if the generated hash function is on the column which is
considered as non-key, then the column can be called as hash column.
Hash file organization is suitable for situations where data needs to be accessed quickly and
efficiently based on the value of its key, and where the data is not frequently updated. However,
if the hash function is poorly designed or if there are collisions (where two different keys map to
the same location), it can result in poor performance and decreased efficiency.
Hash File Organization
In the diagram above, a hash table maps the hashed keys to a specific data block in memory.
Data blocks are represented as rectangular boxes, with each box containing one or more data
records. Each data record is shown as a separate row in the data block.
When a new data record is inserted into the hash file, the hash function is used to hash its key,
and the resulting hashed key is used to determine which data block it should be stored in. Then
the data record is inserted, including the appropriate data block.
In this example, Record 1 and Record 4 are stored in Data Block 1, Record 5 and Record 2 in
Data Block 2, and Record 3, Record 7, and Record 6 are stored in Data Block 3, respectively.
Hash File Organization
There are various hashing techniques. These are as follows :
Mid Square Method
Folding Method
Division
Division-Remainder Method
Radix Information Method
Polynomial Conversion Method
Truncation Method
Conversion using Digital Gates
Direct File Organization
Direct File Organization is used to access the records of file randomly.
In direct file organization, records can be stored anywhere in storage area, but can be
accessed directly without any sequential searching.
It overcomes the drawbacks of sequential, indexed sequential and B-tree organization.
For efficient organization and direct access of individual record, some mapping and
transformation process is required that converts key field of a record into its physical address
location.
Actually, direct file organization depends upon the hashing that provides the base of mapping
procedure.
To overcome the drawbacks of hashing algorithm, collision resolution techniques are needed.
Devices that support direct access are CD’s, Floppy etc.
Direct file organization is known as Random File Organization.
Direct File Organization
Searching (Reading or retrieving from direct file): To read a record from direct file, just enter
the key field of the record. With the help of hashing algorithm that key field is mapped into
physical location of that record.
Updation of records in direct file:
1. Adding a new record: To add a new record in direct file, specify its key field. With the help
of mapping procedure and collision resolution technique get the free address location for
that record.
2. Deleting record from direct file: To delete a record, first search that record and after,
searching, change its status code to deleted or vacant.
3. Modify any record: To modify any record, first search that record, then make the
necessary modifications. Then re-write the modified record to the same location.
Direct File Organization
Advantages
1. Records are not needed to sorted in order during addition.
2. It gives fastest retrieval of records.
3. It gives efficient use of memory.
4. Supports fast storage devices.
5. Searching time depends upon mapping procedure not logarithm of the number of searchkeys
as in B tree.
Disadvantages
1. Wastages of space if hashing method is not chosen properly.
2. It does not support sequential storage devices.
3. Direct file system is complex and expensive.
4. Extra overhead due to collision resolution techniques.
Indexing
Index is a collection of data entries which is used to locate a record in a file.
Index table records consist of two parts, the first part consists of value of prime or non-
prime attributes of file record known as indexing field. and, the second part consists of a
pointer to the location where the record is physically stored in memory.
In general, index table is like the index of a book, that consists of the name of topic and the
page number.
During searching of a file record, index is searched to locate the record memory address
instead of searching a record in secondary memory.
Indexing
On the basis of properties that affect the efficiency of searching, the indexes can be
classified into two categories.
1. Ordered indexing
2. Hashed indexing
Ordered Indexing In ordered indexing, records of file are stored in some sorted order in
physical memory. The values in the index are ordered (sorted) so that binary search can be
performed on the index.
Hashing allow us to avoid accessing an index structure. A hashed index consists of two
fields, the first field consists of search key attribute values and second field consists of
pointer to the hash file structure. Hashing index is based on values of records being
uniformly distributed using a hashed function.
Ordering Indexing
Ordered indexes can be divided into two categories.
1. Dense indexing
2. Sparse indexing.
Dense and Sparse Indexes
Dense index : In dense indexing there is a record in index table for each unique value of the
search-key attribute of file and a pointer to the first data record with that value. The other records
with the same value of search-key attribute are stored sequentially after the first record. The order
of data entries in the index differs from the order of data records as shown in Figure.
Advantages
1. It is efficient technique for small and medium sized data files.
2. Searching is comparatively fast and efficient.
Disadvantages
1. Index table is large and require more memory space.
2. Insertion and deletion is comparatively complex.
3. In-efficient for large data files.
Dense and Sparse Indexes
Sparse index : On contrary, in sparse indexing there are only some records in index table for
unique values of the search-key attribute of file and a pointer to the first data record with that value.
To search a record in sparse index we search for a value that is less than or equal to value in index
for which we are looking. After getting the first record, linear search is performed to retrieve the
desired record. There is at most one sparse index since it is not possible to build a sparse index
that is not clustered.
Advantages
1. Index table is small and hence save memory space (specially in large files).
2. Insertion and deletion is comparatively easy.
Disadvantages
1. Searching is comparatively slower, since index table is searched and then linear search is
performed inside secondary memory.
Dense and Sparse Indexes
Clustered and Non-clustered Indexes
Clustered index : ln clustering index file, records are stored physically in order on a non-prime key.
attribute that does not have a unique value for each record. The non-prime key field is known as
clustering field and index is known as clustering index. It is same as dense index. A file can have at
most one clustered index as it can be clustered on at most one search key attribute. It may be
sparse.
Non-Clustered index : An index that is not clustered is known as non-clustered index. Data file
can have more than one non-clustered index.