0% found this document useful (0 votes)
288 views49 pages

File Organization

- File organization refers to how data is structured and stored within a file. There are several types of file organization including serial, sequential access, direct access, and indexed sequential access. - Serial file organization stores records one after another with no particular ordering. Sequential access organization stores records in sequential order based on a key field. - Direct access organization allows random access to records using hashing techniques. Indexed sequential access uses an index to allow quick access to records in sequential order.

Uploaded by

Tulip
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
288 views49 pages

File Organization

- File organization refers to how data is structured and stored within a file. There are several types of file organization including serial, sequential access, direct access, and indexed sequential access. - Serial file organization stores records one after another with no particular ordering. Sequential access organization stores records in sequential order based on a key field. - Direct access organization allows random access to records using hashing techniques. Indexed sequential access uses an index to allow quick access to records in sequential order.

Uploaded by

Tulip
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

File Organization

Unit-5
• What is File?

• File is a collection of records related to each other. The file size is limited by
the size of memory and storage medium.

There are two important features of file:

1. File Activity
2. File Volatility

File activity specifies percent of actual records which proceed in a single run.

File volatility addresses the properties of record changes. It helps to increase


the efficiency of disk design than tape.
• File organization is a way of organizing the data or records in a file. It does
not refer to how files are organized in folders, but how the contents of a
file are added and accessed.

• For example, if we want to retrieve employee records in alphabetical order


of name. Sorting the file by employee name is a good file organization.

• Types of File Organization


• There are different types of organizing the file:

• 1. Serial file organization


2. Sequential access file organization
3. Direct access file organization
4. Indexed sequential access file organization
• Serial file organization
• Records in a file are stored and accessed one after another.
• The records are not stored in any way on the storage medium this type of
organization is mainly used on magnetic tapes.

• Advantages of serial file organization


• It is simple
• It is cheap

• Disadvantages of serial file organization


• It is cumbersome to access because you have to access all proceeding
records before retrieving the one being searched.
• Wastage of space on medium in form of inter-record gap.
• It cannot support modern high speed requirements for quick record access.
• Sequential access file organization
• Storing and sorting in contiguous block within files on tape or disk is
called as sequential access file organization.
• In sequential access file organization, all records are stored in a sequential
order. The records are arranged in the ascending or descending order of a
key field.
• Sequential file search starts from the beginning of the file and the records
can be added at the end of the file.
• In sequential file, it is not possible to add a record in the middle of the file
without rewriting the file.

• Advantages of sequential file


It is simple to program and easy to design.
• Sequential file is best use if storage space.

• Disadvantages of sequential file


Sequential file is time consuming process.
• It has high data redundancy.
• Random searching is not possible.
• Updating sequential files
• Sequential files must be updated periodically to reflect changes in information. The
updating process is very involved because all the records need to be checked and
updated (if necessary) sequentially.

• Files involved in updating


• There are four files associated with an update program: the new master file, the
old master file, the transaction file and the error report file. All these files are
sorted based on key values.
• Processing file updates
• To make the updating process efficient, all files are sorted on the same key. This
updating process is shown in Figure.
• Random access
• If we need to access a specific record without having
to retrieve all records before it, we use a file structure
that allows random access. Two file structures allow
this:
• indexed files and hashed files.
• Direct access file organization(Hashing)
• Direct access file is also known as random access or relative file organization.
• In direct access file, all records are stored in direct access storage device
(DASD), such as hard disk. The records are randomly placed throughout the
file.
• The records does not need to be in sequence because they are updated directly
and rewritten back in the same location.
• This file organization is useful for immediate access to large amount of
information. It is used in accessing large databases.
• It is also called as hashing.
• Magnetic(Floppy) and optical disks(CD) allow data to be stored and accessed
randomly.

• Advantages of direct access file organization

• Direct access file helps in online transaction processing system (OLTP) like
online railway reservation system.
• In direct access file, sorting of the records are not required.
• It accesses the desired records immediately.
• It updates several files quickly.
• It has better control over record allocation.
• Disadvantages of direct access file organization
Direct access file does not provide back up facility.
• It is expensive.
• It has less storage space as compared to sequential file.
• A hashed file uses a mathematical function to accomplish this mapping. The user
gives the key, the function maps the key to the address and passes it to the
operating system, and the record is retrieved
• Hashing methods
• For key-address mapping, we can select one of several hashing methods. We
discuss a few of them here.
• Direct hashing
• In direct hashing, the key is the data file address without any algorithmic
manipulation. The file must therefore contain a record for every possible key..
• Modulo division hashing
• Also known as division remainder hashing, the modulo division method divides the
key by the file size and uses the remainder plus 1 for the address. This gives the
simple hashing algorithm that follows, where list_size is the number of elements in
the file. The reason for adding a 1 to the mod operation result is that our list starts
with 1 instead of 0.
• Digit extraction hashing
• Using digit extraction hashing, selected digits are extracted from the key and
used as the address. For example, using our six-digit employee number to hash to a
three-digit address (000–999), we could select the first, third and fourth digits
(from the left) and use them as the address.
• Collision
• Generally, the population of keys for a hashed list is greater than the number of
records in the data file.

• For example, if we have a file of 50 students for a class in which the students are
identified by the last four digits of their social security number, then there are 200
possible keys for each element in the file (10,000/50). Because there are many keys
for each address in the file, there is a possibility that more than one key will hash to
the same address in the file. We call the set of keys that hash to the same address in
our list synonyms.
• Collision resolution

• With the exception of the direct method, none of the methods we have discussed
for hashing creates one-to-one mappings.

• This means that when we hash a new key to an address, we may create a collision.
There are several methods for handling collisions, each of them independent of the
hashing algorithm. That is, any hashing method can be used with any collision
resolution method.

• Collision Resolution Techniques


• There are two broad ways of collision resolution:
• 1. Separate Chaining: An array of linked list implementation.
• 2. Open Addressing: Array-based implementation.
• (i) Linear probing (linear search)
• (ii) Quadratic probing (nonlinear search)
• (iii) Double hashing (uses two hash functions)
Open Addressing: Quadratic Probing
• Quadratic probing eliminates primary clusters.
• c(i) is a quadratic function in i of the form c(i) = a*i2 + b*i. Usually c(i) is chosen
as:
c(i) = i2 for i = 0, 1, . . . , tableSize – 1
or
c(i) = i2 for i = 0, 1, . . . , (tableSize – 1) / 2

• The probe sequences are then given by:


hi(key) = [h(key) + i2] % tableSize for i = 0, 1, . . . , tableSize – 1
or
hi(key) = [h(key)  i2] % tableSize for i = 0, 1, . . . , (tableSize – 1) / 2

• Note for Quadratic Probing:


 Hashtable size should not be an even number; otherwise Property 2 will not be
satisfied.
 Ideally, table size should be a prime of the form 4j+3, where j is an integer. This

choice of table size guarantees Property 2.


30
Quadratic Probing (cont’d)

• Example: Load the keys 23, 13, 21, 14, 7, 8, and 15, in this order,
in a hash table of size 7 using quadratic probing with c(i) = i2 and
the hash function: h(key) = key % 7
• The required probe sequences are given by:
hi(key) = (h(key)  i2) % 7 i = 0, 1, 2, 3

31
Quadratic Probing (cont’d)

h0(23) = (23 % 7) % 7 = 2 hi(key) = (h(key)  i2) % 7 i = 0, 1, 2, 3


h0(13) = (13 % 7) % 7 = 6
h0(21) = (21 % 7) % 7 = 0
h0(14) = (14 % 7) % 7 = 0 collision 0 O 21
h1(14) = (0 + 12) % 7 = 1
h0(7) = (7 % 7) % 7 = 0 collision 1 O 14
h1(7) = (0 + 12) % 7 = 1 collision
h-1(7) = (0 - 12) % 7 = -1 2 O 23
NORMALIZE: (-1 + 7) % 7 = 6 collision
h2(7) = (0 + 22) % 7 = 4 3 O 15
h0(8) = (8 % 7)%7 = 1 collision
4 O 7
h1(8) = (1 + 12) % 7 = 2 collision
h-1(8) = (1 - 12) % 7 = 0 collision
5 O 8
h2(8) = (1 + 22) % 7 = 5
h0(15) = (15 % 7)%7 = 1 collision 6 O 13
h1(15) = (1 + 1 ) % 7 = 2 collision
2

h-1(15) = (1 - 12) % 7 = 0 collision


h2(15) = (1 + 22) % 7 = 5 collision
32
h-2(15) = (1 - 2 ) % 7 = -3
2
Double Hashing
• To eliminate secondary clustering, synonyms must have different probe sequences.

• Double hashing achieves this by having two hash functions that both depend on
the hash key.

• c(i) = i * hp(key) for i = 0, 1, . . . , tableSize – 1


where hp (or h2) is another hash function.

• The probing sequence is:


hi(key) = [h(key) + i*hp(key)]% tableSize for i = 0, 1, . . . , tableSize – 1

• The function c(i) = i*hp(r) satisfies Property 2 provided hp(r) and tableSize are
relatively prime.

• To guarantee Property 2, tableSize must be a prime number.

• Common definitions for hp are :


 hp(key) = 1 + key % (tableSize - 1)
 hp(key) = q - (key % q) where q is a prime less than tableSize
 hp(key) = q*(key % q) where q is a prime less than tableSize
33
Double Hashing (cont'd)

Performance of Double hashing:


– Much better than linear or quadratic probing because it eliminates both primary
and secondary clustering.
– BUT requires a computation of a second hash function hp.

Example: Load the keys 18, 26, 35, 9, 64, 47, 96, 36, and 70 in this order, in an
empty hash table of size 13
(a) using double hashing with the first hash function: h(key) = key % 13 and the
second hash function: hp(key) = 1 + key % 12
(b) using double hashing with the first hash function: h(key) = key % 13 and the
second hash function: hp(key) = 7 - key % 7
Show all computations.

34
Double Hashing (cont’d)

h0(18) = (18%13)%13 = 5 hi(key) = [h(key) + i*hp(key)]% 13


h0(26) = (26%13)%13 = 0
h(key) = key % 13
h0(35) = (35%13)%13 = 9
h0(9) = (9%13)%13 = 9 collision hp(key) = 1 + key % 12
hp(9) = 1 + 9%12 = 10
h1(9) = (9 + 1*10)%13 = 6
h0(64) = (64%13)%13 = 12
h0(47) = (47%13)%13 = 8
h0(96) = (96%13)%13 = 5 collision
hp(96) = 1 + 96%12 = 1
h1(96) = (5 + 1*1)%13 = 6 collision
h2(96) = (5 + 2*1)%13 = 7
h0(36) = (36%13)%13 = 10
h0(70) = (70%13)%13 = 5 collision
hp(70) = 1 + 70%12 = 11
h1(70) = (5 + 1*11)%13 = 3 35
Double Hashing (cont'd)

h0(18) = (18%13)%13 = 5 hi(key) = [h(key) + i*hp(key)]% 13


h0(26) = (26%13)%13 = 0 h(key) = key % 13
h0(35) = (35%13)%13 = 9
hp(key) = 7 - key % 7
h0(9) = (9%13)%13 = 9 collision
hp(9) = 7 - 9%7 = 5
h1(9) = (9 + 1*5)%13 = 1
h0(64) = (64%13)%13 = 12
h0(47) = (47%13)%13 = 8
h0(96) = (96%13)%13 = 5 collision
hp(96) = 7 - 96%7 = 2
h1(96) = (5 + 1*2)%13 = 7
h0(36) = (36%13)%13 = 10
h0(70) = (70%13)%13 = 5 collision
hp(70) = 7 - 70%7 = 7
h1(70) = (5 + 1*7)%13 = 12 collision
36
h2(70) = (5 + 2*7)%13 = 6
Inverted files

• Inverted files

• The inverted file may be the database file itself, rather than its index. It is the most
popular data structure used in document retrieval systems, used on a large scale
for example in search engines.

• An inverted index catalogs a collection of objects in their textual representations.


Given a set of documents, keywords and other attributes (possibly including
relevance ranking) are assigned to each document. The inverted index is the list of
keywords and links to the corresponding document.
Inverted Files
• Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to
speed up the searching task.

• The inverted file may be the database file itself, rather than its index. It is the most popular data
structure used in document retrieval systems, used on a large scale for example in search engines.

• An inverted index catalogs a collection of objects in their textual representations. Given a set of
documents, keywords and other attributes (possibly including relevance ranking) are assigned to
each document. The inverted index is the list of keywords and links to the corresponding
document.

• Structure of inverted file:


– Vocabulary: is the set of all distinct words in the text
– Occurrences: lists containing all information necessary for each word of the vocabulary (text
position, frequency, documents where the word appears, etc.)

• Stopwords
• The stopword list is a group of keywords which should not be indexed. It is comprised of
keywords that are considered irrelevant to the index--keywords considered too common or non-
specific to be useful as search items. Common stopwords include articles, prepositions and one-
letter words.
Example
• Text:
1 6 12 16 18 25 29 36 40 45 54 58 66 70

That house has a garden. The garden has many flowers. The flowers are
beautiful
• Inverted file
Vocabulary Occurrences
beautiful 70
flowers 45, 58
garden 18, 29
house 6
Example
• Text:
Block 1 Block 2 Block 3 Block 4

That house has a garden. The garden has many flowers. The flowers are
beautiful
• Inverted file
Vocabulary Occurrences
beautiful 4
flowers 3
garden 2
house 1
Types of Indexes
• Primary Indexes
• Clustering Indexes
• Secondary Indexes
• Advantages of Indexed sequential access file organization

• In indexed sequential access file, sequential file and random file access is possible.
• It accesses the records very fast if the index table is properly organized.
• The records can be inserted in the middle of the file.
• It provides quick access for sequential and direct processing.
• It reduces the degree of the sequential search.

• Disadvantages of Indexed sequential access file organization

• Indexed sequential access file requires unique keys and periodic reorganization.
• Indexed sequential access file takes longer time to search the index for the data
access or retrieval.
• It requires more storage space.
• It is expensive because it requires special software.
• It is less efficient in the use of storage space as compared to other file
organizations.
 

You might also like