FS Mod2
FS Mod2
Module 2
ORGANIZATION OF FILES FOR PERFORMANCE AND INDEXING
DATA COMPRESSION
Data compression involves encoding the information in a file in such a way that it takes up
less space. Many techniques are available for data compression.
Fixed length fields are good example for compression of this type. For example if the
student record contains name of their state, which would occupy 2 ASCII bytes, ie. 16
bits. There are 29 states in India, they can be represented using 5 bits only, instead of
using 16 bits. Thus saving 11 bits or more than 50% of address space.
This type of compression technique decreases the number of bits required to store
records, by finding a more compact notation.
Disadvatages :
i) File is unreadable - by using pure binary encoding, we have made the file
unreadable by humans.
ii) Cost of Encoding / Decoding Time – Time is consumed for encoding and
decoding (eg - state name given as KA must be converted into binary format).
Page 1
File Structures Module 2
This technique is not good if file is very small and if used by many programs. This can be
used in files containing several million records and is processed by 1 or 2 programs.
If there is a repeating sequences of bits, then this type of compression technique called
the Run Length encoding can be used.
Imagine a image of dark sky with bright stars. Here the pixel values of dark sky are 0’s
and bright stars are 1’s. Here the image consists of only 0’s & 1’s and can be
compressed using this technique.
Example: Encode the following sequence of hexadecimal values, choose ‘ff’ as run length
indicator.
22 23 24 24 24 24 24 24 24 25 26 26 26 26 26 26 25 24
22 23 ff 24 07 25 ff 26 06 25 24
Disadvantage:
i) No guarantee that space will be saved. [If there are no many repeated
sequences, then space is not saved].
The principle of working of this method is assign short codes to the most frequent
occurring values and long codes to the least frequent ones.
Page 2
File Structures Module 2
This technique was earlier used in Morse code - for telegraphic communication. It used a
standard look up table. But here the code does not change depending upon the occurance
of the characters in the current input.
In morse code . and - are used to represent characters. Each character have different
length of symbols. A look-up table is maintained for each character(A-Z).
More frequently occurring characters are given shorter search paths in a tree.
Page 3
File Structures Module 2
Page 4
File Structures Module 2
e) Compression in Unix
System V Unix has routines called pack and unpack, which uses Huffman codes on a
byte by byte basis typically pack achieves 25 to 40% reduction on text files, but less
on binary files that have a more uniform distribution of byte values.
pack appends a .z to the end of file it has compressed.
Berkeley Unix has routines called compress and uncompress, which uses
effective dynamic method called Lempel-zip.
Compress appends a .Z to the end of the file it has compressed.
Page 5
File Structures Module 2
Storage compaction makes files smaller by looking for places in a file where there is no
data at all and recovering this space (or)
Removing the unused space from the file is called storage compaction.
Amar|1VAIS09005|23|54|65
Fig (2): After the second
*|arath|1VAIS09007|32 |54|65
record is marked as
Harish|1VAIS09012|36|54|65 deleted.
The * symbol as the first field of the record is used to indicate a deleted record.
After marking the deleted records with *, they are left in file for a period of time.
After many records are deleted, the reclamation of space from the deleted records
happen all at once. A special program is used to reconstruct the file with all deleted
records squeezed out as shown
Amar|1VAIS09005|23|54|65
Harish|1VAIS09012|36|54|65
Storage compaction can be used with both fixed and variable – length records.
But it does not return the unused space, instantly.
Dynamic storage reclamation is required , the different techniques used are
illustrated in next topics.
Page 6
File Structures Module 2
A way to know immediately if there are unused slots(deleted records) in the file.
A way to jump directly to one of those slots if they exist.
The deleted records are marked in some special way, so that it is easy to identify the
unused spaces (space that used occupied by deleted records). Usually it is indicated
by putting ‘*’ as the first character.
Solution: use linked list in the form of a stack, RRN plays the role of a pointer. Linked
list as stack:
Both the above requirements can be solved by using a linked list for connecting all
available(unused) record spaces.
Each node in the list consists of the RRN of deleted record and a link to the next node. A
head reference to the first node in the list is used to move through the list by using the node’s
link field. The traversal of the list is stopped when an end-of-list(-1) is encountered.
A list containing the space available due to deletion of records is called Avail list .
The avail list is used to know the location of free space available to add new records. As it is
fixed length records, all the deleted records are of same size
Stack is used to handle the list. A stack is a list in which all insertions and removals of nodes
take place at one end of the list. It stores the RRN of the record.
Page 7
File Structures Module 2
Now, if a new record has to be placed in file, the avail list shows that record with RRN 3
is deleted and that space can be reused. When that space is used by new record, the avail
list becomes,
If the pointer to the top of the stack contains -1 value, then we know that there are no empty
slots and the new records are appended to the end of the file. If the pointer to the stack
contains a valid node reference, then we know that a reusable slot is available, and also know
the exact position.
Linking and Stacking Deleted records: (Another technique for reclaiming space in fixed
length records)
In placing the deleted records on a stack, a separate file has to be maintained to keep
the stack. The linking and stacking deleted records are done by arranging the links to
make one available record slot point to the next, by using the RRN as pointer.
Page 8
File Structures Module 2
Page 9
File Structures Module 2
Example:
Fig shows sample file illustrating variable length record deletion.
HEAD FIRST_AVAIL: -1
28 Ames|1VA09IS013|034|023|067| 60 James Watt George
Willington|1VA10IS020|075|065|087| 31 Michael|1VA10IS010|050|045|067|
Fig(a): original sample file stored in variable length format with byte count. (Records
of Ames,James and Michael )
HEAD FIRST_AVAIL: 35
Page 10
File Structures Module 2
a) Before removal
Suppose the new record to be added is of 55 bytes. Traverse the records whose sizes are
47,38,72 and 68 . If a slot big enough to hold the new record is found, remove it from the
avail list.
Storage Fragmentation:
Internal fragmentation: - wasted space within a record is called internal
fragmentation.
Fixed length record structures often result in internal fragmentation.
Variable – length records do not suffer from internal fragmentation. However
external fragmentation is not avoided.
Page 11
File Structures Module 2
In variable length records, there is no fragmentation when there is only insertion of records.
But when a variable length record file is deleted and a shorter record is placed in that
space – fragmentation occurs.
Example -
HEAD FIRST_AVAIL: -1
28 Ames|1VA09IS013|034|023|067| 60 James Watt George
Willington|1VA10IS020|075|065|087| 31 Michael|1VA10IS010|050|045|067|
After insertion of new record to deleted space, if there is space remaining, then it is
internal fragmentation. (as the remaining space is not put back to the avail list) HEAD
While inserting the new record, the space available is broken into two parts –
a) Space back on avail list – to avoid internal fragmentation
b) Space for new record.
Example -
A new record of size 20 is added, in to the unused space.
HEAD FIRST_AVAIL: 35
33 Ames John|1VA09IS013|034|023|067| 40 *| -1……………………… ……………. ……
… . . ………. 20 Ajay|IS01|012|043|065| 31 Michael|1VA10IS010|050|045|067| Fig(b): After
Page 12
File Structures Module 2
External Fragmentation: - Form of fragmentation that occurs in a file when there is unused
space outside or between individual records.
After the available space is used as shown in above example, the available space becomes so
small that it is no more usable. Such fragment which cannot be used any more by records is
called external fragmentation.
Suppose a new record of size 34 is added, in to the unused space. Then the remaining space
available is of size 6. This small space which cannot be used to store any records is called
external fragmentation.
external fragmentation
HEAD FIRST_AVAIL: 35
33 Ames John|1VA09IS013|034|023|067| 06 *| -1…......................|34 Anoop Kumar |
1VA09IS015| 024 |045|056| 20 Ajay|IS01|012|043|065| 31 Michael | 1VA10IS010 | 050
|045|067|
Fig(c): After insertion of another record into deleted space.
Placement Strategies:
A Placement strategy is a mechanism for selecting the space from the avail list for a new
record.
There are three placement strategies used in variable length records –
First fit placement Strategy: Accept the first available record slot that can
accommodate (or large enough to hold) the new record.
Best fit placement Strategy: Finds the available record slot that is closest in size
to what is needed to hold the new record.
Worst fit Placement strategy: Selects the largest available record slot,
regardless of how small the new record.
Page 13
File Structures Module 2
Page 14
File Structures Module 2
Page 15
File Structures Module 2
Binary algorithm:
Suppose ‘key’ is the search element to be found. Step 1:
Assign low=0 & high=num_rec -1.
Step 2: Check if low value is less than high [low<=high]
Step 2a: Find middle value of low and high
mid=
(low+high)/2
Step 2b: Check if middle value =key , key found
[a[mid]=key]
Step 2c: Check if middle value <key, then assign, low as midvalue+1 low=
mid+1
Step 2d: Check if middle value >key, then assign, high as midvalue-1 high=
mid-1
Page 16
File Structures Module 2
int Binary search (Fixed Record File & file, Record type &obj, Key type & key)
\\if Key found, obj contains corresponding record, 1 returned
\\if key not found, 0 returned.
{
int low=0, int high= num Recs()-1;
while (low<=high)
{
int guess= (high-low)/2;
file.ReadbyRRN (obj, guess); if
(obj.key()==key) return 1;
if ( obj.key()<key) high=guess-1;
else low= guess+1;
}
return 0;
}
Takes O(log2 n) comparisons n- Takes O(n) comparisons
>no. of records.
When the file size is doubled, it When the no. of records (file size) is
adds only one more guess to our doubled, it doubles the no. of
worst case comparisons required.
Page 17
File Structures Module 2
If the entire contents of the file can be held in memory, we can perform an
internal sort (sorting in memory) which is very efficient.
But, most often, the entire file cannot be held in memory. In such case
keysorting can be done.
comparisons.
If each comparison requires a disk access, a series of binary search on a list of one
Page 18
File Structures Module 2
KEY SORTING
It is a method of sorting a file that does not require holding the entire file in
memory. Only the keys are copied to memory, then these keys are sorted in
memory, and the sorted list of keys is used to construct a new file that has the
records in sorted order.
Advantages: requires less memory than a internal sort
Disadavtages: process of constructing a new file requires a lot of seeking for records.
int key sort (Fixed Record file & infile, char *out file)
{
for(int i=0;, i<infile.numrecs();i++)
{
infile.readbyRRN(obj.i); keynodes[i]= //1
keyRRN(obj.key(),i); //2
}
Page 19
File Structures Module 2
Page 20
File Structures Module 2
deleted records slots is created by linking all of the available slots together. This linking
is done by writing a link field into each deleted record that points to the next deleted
record. This link field gives information about the exact physical location of next
available record. When a file contains such references to the physical locations of
records, we say that these records are pinned. A pinned record is one that cannot be
moved. There exists a connection from one record to another. So the use of pinned
records in a file makes sorting more difficult and sometimes impossible.
Solution: use index file to keep the sorted order of the records while keeping the data file in
its original order.
INDEXING
WHAT IS AN INDEX?
An index is a table containing a list of keys and corresponding reference fields.
The reference field (address) points to the record where the information
referenced by the key is found.
(Or)
An index is a tool for finding records in a file it consists of
Key field on which the index is searched.
Reference field that tells particular address of the key.
An index lets us to maintain the order of a file without rearranging the file;
The records in the file are not sorted, but the index file is sorted.
Indexing gives us keyed access to variable length record files.
Page 21
File Structures Module 2
Index file
Note on index:
The index is easier to use than the data file because,
It uses fixed length records.
Likely to be much smaller than the data file
By requiring fixed length records in the index file, we impose a limit on size of the
primary key field this could cause problems
The index could carry more information other than the key and reference
fields.(e.g., : length of each data record)
Page 22
File Structures Module 2
Record addition
Adding a new record to data file requires that we also add an entry to the index.
In data file, record can be added anywhere., however the byte offset of new record
should be saved.
Since the index is kept in sorted order by key insertion of new index entry
probably requires some rearrangement of the index.
We have to shift all the records below by one, to open up space for inserting the
new record.
However, this operation is not costly (no file access) as it is performed in
memory.( the entire index file is in memory).
Record deletion:
The main advantage of indexed file organization is that the records in the file always
in sorted order.
Page 23
File Structures Module 2
To delete a record we just remove the corresponding entry from index and the space
created can be filled up by shifting the below entries to close up the space.
Since the record deletion takes place in memory, record shifting is not too costly.
Record updating
Record updating falls in two categories:
The update changes the value of the key field
The update does not affect the key field.
Solution: if the simple index file is too large to be held in memory, the following
techniques can be used –
Hashed organization- If access speed is a top priority.
Page 24
File Structures Module 2
Tree structured or multilevel index (like B-tree) – if you need the flexibility of both
keyed access and ordered sequential access.
Advantages of storing simple indexes on secondary storage over the use of data file sorted by
key are:
A simple index allows use of binary search in a variable length record file.
If the index entries are substantially smaller than the data file records,
sorting and maintaining the index can be less expensive than the data file.
If there are pinned records in the data file, the use of an index lets us
rearrange the keys without moving the data records.
Provides multiple views of a data file.
There can be NAMES which are same, in different records, ie. Secondary key may be
the same(grouped together, as the index file is sorted), but primary key is always
unique. So to find the actual byte offset of a record, we relate the secondary key to a
primary key which then will point to the actual byte offset.
Record addition:
When a secondary index is used, adding a record involves updating the data file, the
primary index, and the secondary index. Secondary index update is similar to
primary update (i.e., in both records must be shifted)
Page 25
File Structures Module 2
One important difference between secondary index and primary index is that a
secondary index can contain duplicate keys (grouped together) and the primary index
will not have duplicate keys.
In the example shown above, there are three records with the name ‘Chethan’. Within this
group, they should be ordered according to the values of the reference field(primary keys).
Record Deletion
Removing a record from data file means removing the corresponding entry in
primary index and all the entries in secondary indexes that refer to this primary index
entry.
Like primary index, the secondary indexes are maintained in sorted order by key.
Deleting an entry would involve rearranging the remaining entries to close up the
space left open by deletion.
Thus deleting a record consumes more time with respect to secondary index file.
This can be avoided by deleting corresponding entry of only the primary index
file. The secondary index file is not changed. Thus eliminating the modification
and rearrangement in secondary index.
When a record is searched using secondary key, the key is searched in secondary
index and the corresponding primary key is found. The corresponding entry is not
found in primary index file , and the search ends, informing the user that the record
does not exists.
Page 26
File Structures Module 2
Disadvantage: Deleted records take up space in the secondary index files. Solution: B-
Tree (allows for deletion without having to rearrange a lot of records)
Record updating
There are 3 possible situations
i. Update changes the secondary key:
We may have to rearrange the secondary key so it stays in sorted order. (Relatively
expensive operation)
ii. Update changes the primary key:
Has large impact (or changes) on primary key index but often requires that we
update only the affected primary key (reference field) in all secondary index
also. If the secondary key of the affected primary key is the same, then sorting
is done according to the primary key.
iii. Update confirmed to other fields:
No changes necessary to secondary indexes. But the primary index file will
change if the address of the modified record changes.
Suppose, there are two secondary indexes, one containing name as secondary key and the
other containing semester as secondary key. To retrieve the record of Chetan studying in I
semester.
Page 27
File Structures Module 2
The student name and semester requested coincides with the record having primary key as
1VA09IS015. So from the primary index the byteoffset is retrieved for that primary key to
retrieve the record.
Page 28
File Structures Module 2
Solution for these difficulties is created of files such as secondary indexes, in which a secondary
key leads to a set of one or more primary keys, called inverted list.
ADV DISADV
Avoids the need to rearrange the May restrict the number of
secondary index file until a new references that can be associated
secondary key is added. with each secondary key.
Cause internal fragmentation
Fig shows conceptual view of primary key reference fields as a series of lists.
Page 29
File Structures Module 2
1VA09IS023
1VA09IS045
Advantages
No wastage of space due to internal fragmentation.
When a new record is added, the new primary key in secondary index file are stored
and arranged.
Disadvantage
A large number of small files are required to store the list of primary keys.
Rearranging of secondary index file is required when a new secondary key is added to the
data file or when there is no record of that name. When a record with an
Page 30
File Structures Module 2
existing name is added the primary key should be added to primary key reference file only.
Advantages
Secondary index file needs to be rearranged only when new record, with a
different student name is added (i.e., when new student’s name is added or modify)
Re arranging is faster since there are fewer records and each record is smaller.
There is less need for sorting. Therefore we can keep secondary index file on
disk. So that more space is available in primary memory.
Primary key reference file is entry sequenced i.e. the USNs of new records are added
at the end of the reference file, primary index never needs to be sorted.
Space from deleted primary key reference file can easily be reused.
Disadvantage
The reference ID associated with a given name are no longer guaranteed to be
grouped together physically. i.e., locality (together) in the secondary index has been
lost.
Since the reference file is very long, they are usually stored in secondary
memory, leading to large seek time.
SELECTIVE INDEXES
A selective index contains keys for only a portion of the records in the data file.
Such an index provides the user with a view of a specific subset of the file’s
records.
BINDING
The process of bounding the key to the physical address of its associated record is called
binding.
The binding of our primary keys takes place at construction time. Adv:
faster access.
Disadv: Re organization of data file must result in modifications to all bound
index files.
Binding of our secondary keys takes place at the time they are used. Adv:
Safer.
Tight binding (construction time binding i.e. during preparation of data file)
is preferable when data file is static or nearly static, requiring little or no
adding, deleting, or updating.
Page 31
File Structures Module 2
Page 32
File Structures Module 2
Note: In tight binding, indexes contain explicit references to the associated physical data
record.
Postponing binding as long as possible is simpler and safer when the data file
requires a lot of adding, deleting and updating.
Note: Here the connection between a key and a particular physical record is
postponed until the record is retrieved in the course of program execution.
Page 33
File Structures Module 2
Important Questions:
Page 34