File Organization
File Organization
Unit-5
• What is File?
• File is a collection of records related to each other. The file size is limited by
the size of memory and storage medium.
1. File Activity
2. File Volatility
• Direct access file helps in online transaction processing system (OLTP) like
online railway reservation system.
• In direct access file, sorting of the records are not required.
• It accesses the desired records immediately.
• It updates several files quickly.
• It has better control over record allocation.
• Disadvantages of direct access file organization
Direct access file does not provide back up facility.
• It is expensive.
• It has less storage space as compared to sequential file.
• A hashed file uses a mathematical function to accomplish this mapping. The user
gives the key, the function maps the key to the address and passes it to the
operating system, and the record is retrieved
• Hashing methods
• For key-address mapping, we can select one of several hashing methods. We
discuss a few of them here.
• Direct hashing
• In direct hashing, the key is the data file address without any algorithmic
manipulation. The file must therefore contain a record for every possible key..
• Modulo division hashing
• Also known as division remainder hashing, the modulo division method divides the
key by the file size and uses the remainder plus 1 for the address. This gives the
simple hashing algorithm that follows, where list_size is the number of elements in
the file. The reason for adding a 1 to the mod operation result is that our list starts
with 1 instead of 0.
• Digit extraction hashing
• Using digit extraction hashing, selected digits are extracted from the key and
used as the address. For example, using our six-digit employee number to hash to a
three-digit address (000–999), we could select the first, third and fourth digits
(from the left) and use them as the address.
• Collision
• Generally, the population of keys for a hashed list is greater than the number of
records in the data file.
• For example, if we have a file of 50 students for a class in which the students are
identified by the last four digits of their social security number, then there are 200
possible keys for each element in the file (10,000/50). Because there are many keys
for each address in the file, there is a possibility that more than one key will hash to
the same address in the file. We call the set of keys that hash to the same address in
our list synonyms.
• Collision resolution
• With the exception of the direct method, none of the methods we have discussed
for hashing creates one-to-one mappings.
• This means that when we hash a new key to an address, we may create a collision.
There are several methods for handling collisions, each of them independent of the
hashing algorithm. That is, any hashing method can be used with any collision
resolution method.
• Example: Load the keys 23, 13, 21, 14, 7, 8, and 15, in this order,
in a hash table of size 7 using quadratic probing with c(i) = i2 and
the hash function: h(key) = key % 7
• The required probe sequences are given by:
hi(key) = (h(key) i2) % 7 i = 0, 1, 2, 3
31
Quadratic Probing (cont’d)
• Double hashing achieves this by having two hash functions that both depend on
the hash key.
• The function c(i) = i*hp(r) satisfies Property 2 provided hp(r) and tableSize are
relatively prime.
Example: Load the keys 18, 26, 35, 9, 64, 47, 96, 36, and 70 in this order, in an
empty hash table of size 13
(a) using double hashing with the first hash function: h(key) = key % 13 and the
second hash function: hp(key) = 1 + key % 12
(b) using double hashing with the first hash function: h(key) = key % 13 and the
second hash function: hp(key) = 7 - key % 7
Show all computations.
34
Double Hashing (cont’d)
• Inverted files
• The inverted file may be the database file itself, rather than its index. It is the most
popular data structure used in document retrieval systems, used on a large scale
for example in search engines.
• The inverted file may be the database file itself, rather than its index. It is the most popular data
structure used in document retrieval systems, used on a large scale for example in search engines.
• An inverted index catalogs a collection of objects in their textual representations. Given a set of
documents, keywords and other attributes (possibly including relevance ranking) are assigned to
each document. The inverted index is the list of keywords and links to the corresponding
document.
• Stopwords
• The stopword list is a group of keywords which should not be indexed. It is comprised of
keywords that are considered irrelevant to the index--keywords considered too common or non-
specific to be useful as search items. Common stopwords include articles, prepositions and one-
letter words.
Example
• Text:
1 6 12 16 18 25 29 36 40 45 54 58 66 70
That house has a garden. The garden has many flowers. The flowers are
beautiful
• Inverted file
Vocabulary Occurrences
beautiful 70
flowers 45, 58
garden 18, 29
house 6
Example
• Text:
Block 1 Block 2 Block 3 Block 4
That house has a garden. The garden has many flowers. The flowers are
beautiful
• Inverted file
Vocabulary Occurrences
beautiful 4
flowers 3
garden 2
house 1
Types of Indexes
• Primary Indexes
• Clustering Indexes
• Secondary Indexes
• Advantages of Indexed sequential access file organization
• In indexed sequential access file, sequential file and random file access is possible.
• It accesses the records very fast if the index table is properly organized.
• The records can be inserted in the middle of the file.
• It provides quick access for sequential and direct processing.
• It reduces the degree of the sequential search.
• Indexed sequential access file requires unique keys and periodic reorganization.
• Indexed sequential access file takes longer time to search the index for the data
access or retrieval.
• It requires more storage space.
• It is expensive because it requires special software.
• It is less efficient in the use of storage space as compared to other file
organizations.