FS Mod1@AzDOCUMENTS - in
FS Mod1@AzDOCUMENTS - in
Module 1: INTRODUCTION
The main driving force behind the file structure design is the relatively slow access time of
disk and its enormous, non-volatile capacity.
Good file structure design will give us access to all the capacity of disk, without making our
application spend a lot of time waiting for the disk.
A file structure is the representation of organizing the data on the secondary memory in a
particular fashion, so that retrieving data becomes easy.
Earlier files were stored on tapes. Access to these tapes was sequential. The cost to access
the tape grew in direct proportion to the size of the file.
As files grew and as storage devices such as disk drives became available, indexes were added to
files. Index consist of a list of keys and corresponding pointers to the files. This type of accessing files
could search the files quickly.
As the index file grew, they too became difficult to manage. In early 1960’s, the idea of applying
tree structures emerged. But this also resulted in long searches due to uneven growth of trees, as a
result of addition & deletion of records.
In 1963 researchers developed, an elegant, self adjusting binary tree structure called the AVL
tree, for storing data in memory. The problem with this binary tree structure was that dozens of access
were required to find a record.
The solution to this problem emerged in the form of B-tree. B-tree grows from the bottom-up,
as records are inserted. B-tree provided excellent access performance, except that the f iles couldn’t
be accessed sequentially. This problem was solved by introducing B+ trees.
But all the above methods needed more than one disk access.
Hashing approach promised to give required data in one disk access, but performed badly f or
large amount of data.
Logical file resembles the physical file in its content, but is visible only to the application
which handles it. The changes are done in the logical file and finally when the file is saved, the
changes are updated to the physical file by the operating system.
Opening files:
Opening a file can be done in 2 ways-
1. Opening a existing file.
2. Create a new file.
When a file is opened, the read/write pointers are positioned at the beginning of the f ile and
are ready to read/write. The file contents are not distributed by the open statement. Creating a new
file also opens the file, as it doesn’t have any contents, it is always created in write or read write
mode.
The “open” function is used to open an existing file or to create a new file. Thisfunction
is present in the header “fcntl.h” in C language.
The syntax of open function is- fd=open(filename,flags[,pmode]);
where,
->fd - is the file descriptor, which is of type int. if there is an error in the
attempt to open the file, this value is negative.
->filename - this is of type char*. This string contains the physical filename.
->flags - this argument controls the operation of the open function. It determines
whether to open an existing file for reading or writing. It also determines whether to
open an existing file or create a new file.
A bitwise OR of the following values can also be performed. The different flags are-
->pmode- this is of type int. it is used only if O_CREAT is useed. pmode is protection mode which
specifies the read, write and execute permission for owner, group and others on the created file.
It is a 3digit octal number. The first digit indicate how the file can be used by owner, 2 nd indicate the
permission on file by group and 3 rd by others.
rwe r we r w e
Pmode = 0751 -> 1 1 1 10 1 00 1
Owner group others
Example of opening an existing file for reading and writing or create a new one if
necessary (doesn’t exist)
fd=open(filename,O_RDWR|O_CREAT,0751);
Closing files:
Files are usually closed automatically by the operating system when a program terminates
normally. The execution of close statement within a program is needed to protect it against data loss,
in the event tat the program is interrupted and to free up logical filenames for reuse.
“close()” function is used to close the file.
-> source_file :the filename from where the data has to be read.
-> destination_addr :specifies the address where the read data has to be stored.
-> size :the maximum number of bytes that can be read from the file.
->destination_file : the logical filename that is used for writing the data.
-> source_addr : address, where the information is found. This information is
written to the destination file.
-> size :the maximum number of bytes that can be written to the file.
File handling with C streams and C++ stream classes:2 ways of manipulating files in c++:
1) Using standard C functions defined in header file stdio.h – C streams.
2) using stream classes defined in iostream.h and fstream.h _ C++ streams.
Read and write operations are supported by fread, fget, fwrite, fput, fscanf and fprintf.
The C++ streams uses equivalent operations, but the syntax is different. The functionto open
the file is ‘open’.
A C++ program that opens a file for input and reads it, character by character, sending each character
to the screen after it’s read from the file.(Implementation of cat command)
#include<fstream> #include<iostream>
2) Implement the UNIX command ‘tail –n filename’, where n is the number of linesfrom the
end of the file to be copied to the stdout(screen).
void main()
{
int n,i;
char buf[100];fstream fp; clrscr();
cout<<"Enter the no. of lines to be displayed\n";cin>>n;
fp.open("in.txt",ios::in);if(fp.fail())
cout<<"couldnot open the file\n";fp.seekg( (REC_SIZE*n),ios::end); while(fp)
{
fp>>buf; cout<<buf<<"\n";
}
getch();
}
3) Implement the UNIX command ‘cp file1 file2’, where ‘file1’ contents are copied to
‘file2’.
#include<iostream.h>#include<conio.h> #include<fstream.h>
void main()
{
int i;
char buf[100]; fstream fp1,fp2;clrscr();
fp1.open("in.txt",ios::in); //’read’ mode fp2.open(“out.txt”,ios::out); //’write’ mode
if(fp1.fail() || fp2.fail())
cout<<"couldnot open the file\n";while(fp1)
getch();
}
Seeking:
The action of moving directly to a certain position in a file is called seeking.
Sometimes in the program, we may need to jump to a byte which is ten thousand bytesaway or
to the end of the file to read or write some contents, then seek() function is used.
Eg - seek(data, 370);
The read/write pointer is moved to the 370 th byte in the file ‘data’.
where: pos - moved position value of read/write pointer after fseek() function.file
origin - a value that specifies the starting position from which the byte_offset
must be taken,it can have 3 values
SEEK_SET(0),SEEK_CUR(1),SEEK_END(2).
Eg: pos=fseek(data,370L,0)
The pointer is moved to 370 th position from start of the file ‘data’.
1. Regular file.
2. Directory, containing files or directories.
Any file in UNIX file system can be uniquely identified by the absolute pathname, that begins with
the root directory.
For eg: absolute path name of file ‘addr’ is ‘/usr6/mydir/addr’
All the pathnames that begin with current directory are called as Relative pathname. The
special file names “.” Stands for current directory and “..” stands for parent directory. Physical devices
and logical files:
A keyboard is considered as a file, which produces a sequence of bytes that are sent to the
computer when keys are pressed. The console (monitor) accepts a sequence of bytes and displays their
corresponding symbols on screen. These devices have their filename as ‘/dev/kbd’ and ‘
/dev/console’ respectively.
In UNIX, a file is represented logically by an integer file descriptor. A keyboard, a diskfile and
a magnetic tape are all represented by integers. Once the integer that describes a file is identified, a
program can access that file using the integer.
The below statements show, how data can be read from a file using the file desc riptorand then
displayed on the console.
The logical file of “newfile.c” is represented by the value returned by fopen() method.
Similarly the console is represented by the value ‘stdout’ defined in ‘stdio.h’.
Similarly the keyboard is represented by ‘stdin’ and the error file is represented by
‘stderr’(standard error) in stdio.h.
I/O redirection allows to specify alternative files for input or output at the execution time.
The notations for input and output redirection on the command line in UNIX are
If the output of a program is to be fed as input to another program, then the pipesymbol(
‘|’) is used.
Syntax: program1| program2
The result of program1 is sent as input to program2.
Header files requires for file handling problems are- iostream.h, fstream.h, fcntl.h,
file.h.
The C++ streams are in iostream.h and fstream.h. Many UNIX operation are in fcntl.h
and file.h.
The flags O_RDONLY, O_WRONLY and O_RDWR are usually found in file.h.
Surface of disk:
The significance of the cylinder is that all the information of a single cylinder can be accessed
without moving the read/write head. Moving the arm holding the read/write head is called seeking.
This is the slowest activity in reading the information from the disk.
In a typical disk, each platter has two surfaces, so the number of tracks per cylinder is twice the
number of platters.
The number of cylinders is same as the no of tracks on the disk surface and each track has the
same capacity. Hence the capacity of the disk is a function of the number of cylinders, number of
tracks per cylinder and the capacity of the track.
A cylinder consists of group of tracks. A track consists of group of sectors.
Solve:
1) Suppose we want to store a file of 50,000 fixed length data records on a typical smallcomputer
with the following characteristics:
No of bytes per sector=512 bytes.No of sectors per track = 63.
No of tracks per cylinder = 16. No of cylinders= 4092.
How many cylinders does the file require if each data record requires 200 bytes? What is
the total capacity of the disk?
Solution:
= 2,111 MB
Organizing tracks:
There are 2 ways to organize data on a disk:
1. By sector.
2. By user defined block.
This problem can be solved by interleaving few sectors. Several sectors are interleaved
between 2 logically adjacent sectors. In the figure shown there is an interval of 5 physical sectors
between the logically adjacent sectors. In the disk having interleaving factor 5, only five revolutions
are required to read all the 32 sectors of a track.
Clusters:
A cluster is a fixed number of contiguous sectors. Once a given cluster has been f ound on a
disk, all sectors in that cluster can be accessed without requiring an additional seek.
A part of the operating system called the file manager creates a file allocationtable(FAT). The
FAT contains a list of all clusters of a file, in order and a pointer to the physical location of the cluster
in the disk.
Extents:
Extents of a file are the parts of a file which are stored in contiguous clusters. So thatthe
number of seek to access a file reduces.
It is preferable to store the entire file in one extent. But this may not be possible due to non -
availability of contiguous space, errors in allocated space etc. then the file is divided into two or more
extents.
As the number of extents in a file increase, the file becomes more spread out on the disk and
the amount of seeking required to process the file increases.
Fragmentation:
Since the smallest organizational unit of a disk is one sector, the data is stored interms of a
multiple sectors and new file is started in a new sector, this leads to unused(waste) disk space
resulting in fragmentation.
Suppose the sector size is 512 bytes and the record size is 300 bytes. The records can be stored
in two ways:
• Store only one record per sector.
• Allow records to span sectors, so the beginning of a record might be found in one sector and
the end of it in another.
If the file is with record size 300 bytes, then the block size will be a multiple of 300 bytes.
Non-data overhead:
I both blocks and sector organization of track, certain amount of space is taken up from the
disk to store some information about storage data, this is called non -data overhead. On sector
Addressable disks, non-data overhead contains information such as sector address, block address,
sector condition, synchronization marks between fields of
information etc. This non-data overhead is of no concern to the programmer.
On a block-addressable disk, some non-data overhead is considered by the programmer. Since
sub-block and inter-block gaps have to be provided in every block, there is more non-data overhead
with blocks than with sector organization.
Note: If blocking factor is large, more no of records can be stored. But larger blocking f actor may
lead to fragmentation within a track.
Solution:
If blocking factor is 10, there are 10 records per block.Sub block & inter block gap = 300 bytes
Record size – 100 B
b)
If blocking factor is 60, there are 60 records per block.So block size = (100 x 60) + 300
= 6300 bytes ( 1 block size)
a) Seek time:
The time taken to move the read/write head to the required cylinder, is called seek time.
The amount of seek time spent depends on how far the read/write head has to move.
If we are accessing a file sequentially and the file is packed into several consecutive cylinders, then
the seek time required for consecutive access of data is less.
Since the seek time required for each file operation various, usually the average seek time is
determined.
Most hard disks available today have an average seek time of less than 10 milliseconds.
c) Transfer time:
The time required to transfer one byte of data from track to read/write head or vice versa, is
called Transfer time.
Note:
1) Sequential access of file takes less time to transfer data when compared to randomaccess.
2) Seek time is more in random access.
Problem 3:
Calculate the data transfer speed of the hard disk if it has a speed of 10,000 rpm and has 170 sectors
per track. Assume that a sector can store 512B of data.
The disk transfer time is much slower when compared to the network and computer CPU. So the
network or CPU has to wait for long time for the disk to transmit data.
A number of techniques are used to solve this problem one among those is
multiprogramming, in which the CPU works on other jobs while waiting for the data to arrive f rom
the disk.
Another technique is striping, Disk striping involves splitting the parts of a file on several
different drives, then letting the separate driver to deliver the part of the file to network
simultaneously. This improves the throughput of the disk.
Another approach to solve the disk bottleneck is to avoid accessing the disk.As the cost
of memory is steadily decreasing, more programmers are using memory to hold data. Two ways of
using memory instead of secondary storage are memory disks and disk caches.
A RAM disk is a large part of memory configured to simulate the behavior of mechanical
disk, other than speed and validity.
The data in RAM disk can be accessed much faster than disks i.e. without a seek or
relational delay. But here the memory is volatile, the content of RAM disk are lost when the
computer turn off.
A Disk cache is a large block of memory configured to contain pages of data from a disk
when data is requested by a program, the file manager first looks into the disk cache, to check if it
contains the page with the requested data. If it contains data, it is processed immediately.
Otherwise, the file manager reads the page from the disks, replacing the page in disk cache
RAM disks and cache memory are example of buffering
Magnetic Tape
Magnetic tape is a device that provides no direct (random) accessing facility, but can
provide very rapid sequential access to data.
Tapes are compact, works in different environmental conditions, are easy to store and
transport and are less expensive than disks. Tapes are commonly used as a backup device.
The parity bit is not part of the data, but it is used to check the validity of the data. If odd parity is set
for tape, this bit is set to make the number of 1 bit in the frame as odd.
Frames are grouped into data blocks whose size can vary from a few bytes to many kilobytes,
depending on the needs of the user. Blocks are separated inter block gaps, which does not contain any
information.
S=n x (b+g)
= 20000 x (6.4 +0.3)
= 1, 34,000 inches [12 inches = 1 feet]
= 11,166.67 feet
So a tape of 11166.67 feet is used to store 1 million records.
Disk Vs Tape
In past, both disk and tapes were used for secondary storage. Disks were
proffered for random access and tape for sequential access.
Now, disks have over much of secondary storage because of decreased costof
disks and memory storage tapes are used as tertiary storage.
Introduction to CD-ROM
CD-ROM is an acronym for compact disk Read Only Memory .A single disk can
hold more than 600 MB of data.CD-ROM is read only i.e. it is publishing medium rather than
a data storage & retrieval like magnetic disks.
CD- ROM Strengths – high storage capacity, inexpensive price, durability
CD-ROM weakness – extremely slow performance, this makes intelligent file structure
difficult.
History of CD-ROM
CD-ROM is used to store any kind of digital information (text, audio and
The Philips and Sony worked on a way to store music on optical disc in digital data
format rather than the analog format used earlier.
Thus the CD audio appeared in 1984. CD-ROM was built using the same technology
of CD audio. The CD-ROM drive appeared in 1985. It was a read only device. But with the
introduction of CD-RW (compact disk re-writable) in 1997, the use of CD has been
widespread.
Reading Pits & Lands- (It is made up of polycarbonate plastic a thin layer of aluminum to
make reflection surface)
• CD-ROM’s are stamped from a glass master disk which has coating that is changed
by the laser beam. When the coating is developed, the areas hit by the laser beam turn
into pits and the smooth, unchanged areas between the pits are called lands.
• When the stamped copy of the disk is read, we focus a beam of laser light on the
track. The pits scatters the light, but the land reflects back most of the light. This
alternating pattern of high and low intensity the original digital information.
• 1’s are represented by the transition from land to pit and back again. 0’s are
represented by the amount of time between transitions. The longer between transitions,
the more 0’s we have.
01 0 0 0 0 1 0 0 0 0 0 0 10
Pit Land
The encoding scheme used is such that, it is not possible to have two adjacent
1’s.1’s are always separated by 2 or more 0’s. The data read fromtrack has to be
translated to 8-bit pattern of 1’s and 0’s to get back the original data
The encoding scheme called EFM encoding (Eight to Fourteen Modulation), which is
done through a look up table turn the original 8-bit of data into 14expanded bits,
thatis represente d as pitsand landson thedisk. Thereading
• Since 0’s are represented by the length of time between transitions, the disk must be
rotated at a precise and constant speed. This affects the CD-ROM driver’s ability.
• In constant angular velocity, the tracks are concentric and sectors are pie - shaped. It
writes data less densely in the outer tracks than in the center tracks, as there is equal
amount of data in all the sectors. This leads to wasting of storage capacity in outer
tracks but have the advantage of being able to spin the disk at the same speed for all
positions of read/write head.
• In constant linear velocity (CLV), sector towards the center of the disk takes the
same amount of space as a sector towards the outer edge of desk. Hence data can b e
stored in maximum density in all the sectors.
Since reading the date requires that it has to pass under the optical pick up
device at constant rate, the disk has to spin more slowly when we arereading at the
outer edges than when we are reading towards the center of the disk. Hence the disk
moves at varying speed.
• Disk is divided into pie shapedsector • Disk is divided into sectors of equal
• Data is densely packed at center & physical size
loosely packed at the ends • Data is densely packed allover the
• Wastage of disk space at outeredge disk
• Disk spins at a content speed
• No wastage of space
• Lesser storage space
than architecture • Disk spins more slowly when
reading at outer edge
• Much better storage space
3) Storage capacity – A CD-ROM holds more than 600MB of data. This is a huge
amount of memory for storage. Many typical text databases anddocument
collections stored on CD-ROM use only a fraction memory. With such large
capacity it enables us to build indexes and other structures that can help us to
overcome the disadvantages of seek performance of CD- ROM.
4) Read-Only Access- CD-ROM is publishing medium, a storage device that cannot
be changed after manufacturer develops it. The advantage of this is that, the user
need not worry about the updating. This simplifies some of the file structures.
5) Asymmetric writing and Reading – with CD-ROM, we create files are placed on
the disk once and access the file content thousands or millions of times. With
intelligent and carefully designed file structure built once, the user can enjoy the
benefit of this investment again and again.
Storage as a Hierarchy
There are different types of storage devices of different speed, capacity and
cost. The users can select the device depending on their need.
A JOURNEY OF A BYTE
How a byte is stored from a program,
write (textfile,ch,1); //write value of ch to hard disk
calls the operating system. The operating system invokes the file manager, an OS program
which deals with the file – related matter and I/O devices.
The file manager does the following tasks when write operation is requested
− Checks whether the operation requested write is permitted.
− Locates the physical location where the byte has to be stored (i.e.locate drive, cylinder,
track & sector)
− Finds out whether the sector to store the character (‘ch’) is already inmemory, if not call I/O
Buffer.
− Puts ‘ch’ in the buffer
− Keeps the sector in memory to see if more operations are to be donein the same sector.
Disk Controller
The job of controlling the operation of the disk is done by disk controller.
- The I/O processor asks the disk controller if the disk drive is available for
waiting
- Disk controller instructs the disk drive to move its read/write head to the righttrack
and right sector.
- Disk spins to right location and byte is written
BUFFER MANAGEMENT
Use of Buffer – Buffering involves working with large chunks of data in memory, so that the
number of access to secondary storage can be reduced.
Buffer Bottleneck –
• Assume that the system has a single buffer and is performing input and output on one
character at a time alternatively.
• In this case, the sector containing the character to be read is constantly overwritten by
the sector containing spot where the character has to be written and vice versa.
• In such a case, the system needs more than one buffer
• Moving data to and from disk is very slow and programs may become I/O bound.
Therefore we need to find better strategies to avoid this problem.
Double buffering allows the OS to operate on one buffer is being loaded or emptied.
In (a) the contents of system I/O buffer1 are sent to disk while I/O buffer 2 is being filled and
(b) the contents of buffer 2 are sent to disk while I/O buffer 1 is beingfilled.
Buffer pooling
• When system buffer is needed, it is taken from a pool of available buffers and
used.
• When the system receives a request to read a certain sector or block, it search
to find if any buffer in the block contains that sector or block. If no buffer contains it,
the system finds a free buffer from the pool and loads the sector or block into it.
mode, there is no transfer of data b/n the system buffer & the data area.
Scatter/Gather I/O
Each block in memory consists of header followed by data. Suppose many
blocks of a file read at ones. In order to process the data, the headers of each block and data
of each block is moved to two separate buffers. So to read the data of all the blocks only
one read operation is required. The technique of separating the headers & data blocks is
called ‘Scatter’
The reverse of scatter input is gather output. the several data & header block are
arranged separately in gather technique. After processing the data, the data and the header
blocks are joined with each other.
I/O in Unix
The above diagram shows the process of transmitting data from a program to an
external device. The topmost layer deals with data in logical terms. The below layer’s carry
out the task of turning the logical object into a collection of bits on a physical device. This
layer is called the kernel.
The top layer consists of processes, associated with solving some problems using shell
commands(like cat, tail, ls etc), user programs that operate on files, and library routines
like scanf and so on. Below this layer is the unix kernel, which consists of all the rest of the
layers.
In UNIX all the operations below the top layer are independent of applications.
Consider the example of writing a character to disk
Write (fd,&ch,1);
When this system call is executed kernel is invoked immediately. This system call
instructs the kernel to write a character to a file.
The kernel I/O System connects the file descriptor (fd) in program to some f ile in the
file system. It does this by proceeding through a series of four tab les that enables the kernel to
find the file in the disk.
The open file table contains entries for every file open in the system. Every time a file is
opened or created, a new entry is added to this table. It contains information about the file,
such as –
- mode in which file is opened
- number of processes currently accessing that file
- the offset within the file where the pointer is pointing for the nextread
or write operation
- array of pointers to generic functions ( generally used functions)
- pointer to file’s inode table
More information about the file is present in inode table. The files inode table will be
kept on the disk with the file. It contains information like the permission given to the file
while creations, owner’s id, file size, no. of blocks used by the file, and the File Allocation
Table.
File Allocation table (FAT) is within the inode table it associates the clusters of the file
with pointer locating that address of the clusters.
All file path starts with a directory. A directory is just a small file that contains many
files where each file name is associated with pointer to files inode on disk. This pointer from a
directory to the inode of a file is called hard link. It provides direct information about the file.
It is possible for a file to be saved in different names, in such case, all such f ilenames
point to the same inode and there are many hard links to the same file. A field in the inode
tells how many hard links are there to the inode. When a file nameis deleted and there are
other file names for the same file, the file itself is not deleted; it’s inode’s hardlink count is
decremented.
A soft link or symbolic link, links a filename to another filename or path rather than pointing
to inode. Hence when the original file is deleted, the inode is also deleted and symbolic link
becomes dangling.
Types of Files
There are types of file –
Normal files – are normal text or program files
Special files – are files that drive some device, such as line printer or graphics
device…(device drivers).
Socket – are abstractions that serve as end points for inter process communication.
Device Drivers
For every peripheral device, there is a separate set of routines called device driver. Itperforms
the I/O operations between I/O buffer and the device.
A stream file.
Operator << is an overloaded function to write the fields to a file as stream of bytes. Eg:
fstream fp1;
fp1.open(filename,ios::out);fp1<<name<<usn;
name
C h a y a \ # # # #
0
0 1 2 3 4 5 6 7 8 9 next field
In this method, fields are organized by limiting the maximum size of each field.
This is called as fixed-length field.
As the fields are of predictable length, we can access them from a file by
counting the number of characters or till ‘#’ appears.
1. Padding is required to bring the fields up to a fixed length, which makes thefile
larger.
2. Data might be lost, if it is not able to fit into the allocated space.
Due to the above mentioned problems, the fixed field approach of structuring data fields is
inappropriate for data that contains a large amount of variability in the length of fields.
This method of structuring is very good solution if the fields are already fixed in length or
there is very little variation in field lengths.
AMAR 1VACS054 11 12 13
MARY.S 1VA09IS053 15 17 53
In this method, it is possible to know the end of the field, as the length of the f ield is
stored at the beginning of each field. This type of fields are called length - based field.
04AMAR081VACS054021102120213
06MARY.S101VA09IS053021502170253
In this method, the fields are separated by using a special character or sequence of
characters. The special character used to separate the fields is called the delimiter.
The selected delimiter to separate the fields should not appear as a field value.
A delimiter can be ‘|’, ‘#’, space, newline etc.
AMAR|1VACS054|53|32|42
MARY.S|1VA09IS053|54|73|82
In this method, the keyword and its values are stored for each record. It is the first
structure in which a field provides information about itself. Such self- describing structures
are very useful tools for organizing files in many applications.
It is easy to identify the missing fields.
The main disadvantage of this structure is that 50% or more of the files space is occupied by
the keywords.
Name=AMAR|usn=1VACS054|M1=53|M2=32|M3=42
Name=MARY.S|usn=1VA09IS053|M1=54|M2=73|M3=82
A fixed length record file is one in which each record contains the same number of bytes.
The record size can be determined(fixed) by adding the maximum space occupied by each
field. This method is also called as counting bytes structure.
Here, the size of the entire record is fixed, the fields inside the record can be of varying
size or of fixed size.
AMAR#########1VACS057####17##### 54####13
MARY.S####### 1VA09IS054###15##### 54####17
In this method the number of fields in each record is fixed. This method is also called
as fixed field count structure. Assuming that every record has 5 fields, then each record can be
displayed by counting the fields modulo five.
AMAR|1VACS057|17|54|13| MARY.S|1VA09IS054|15|54|17|…….
In this method, every record would begin with an integer, that indicates how many
bytes are there in the rest of the record. This is a commonly used method for handling variable
length records.
23 AMAR|1VACS057|17|54|13| 27 MARY.S|1VA09IS054|15|….
An index is used to keep a byte offset for each record. The byte off set allows us to
find the beginning of each successive record and compute the length of each record. The
position of any record can be taken fom the index files then seek to the record in the data file.
AMAR|1VACS057|17|54|13| MARY.S|1VA09IS054|15
Data file:
Index
file:
00 23
AMAR|1VACS057|17|54|13|# MARY.S|1VA09IS054|15|#….
1. As the length indicator is put at the beginning of every record, the sum of thelengths
of the field record must be known before writing the fields to the file.
2. In what form should the record length field be created in the file?
To solve the first problem, all the field values are put to a buf fer one-b y- one
with delimiter separating the fields and finally the length of the buffer is found by using ‘strlen’
function.
char buffer[100];
In a loop,
buffer[i+1]=field values one-by-one with delimiter.
The values can be put to buffer using strcpy() and strcat()
//find buffer length.
int len=strlen(buffer);
//put the record length to file first, but after converting to char.
To solve the second problem, the integer value is converted to ASCIIcharacters and
then its hexa value is put to the file.
IOBuffer
Char array for buffer
value
VariableLengthBuffer FixedLengthBuffer
read and write operations Read and write
for variable length operations for fixed
records length records.
The members and methods that are common to all the three bufferclasses, are
included in the base class IO buffer. Other methods are in classes variable length buffer
and fixed length buffer, which support the read and write operations for different type of
records. Finally the classes LengthFieldBuffer, DelimiterFieldBuffer and FixedFieldBuffer
have the pack and unpack methods for the specific field representations.
Record Access:
To view the contents of a specific record with a key, the key must be in
canonical form. A standard form of representing a key is called the canonical form.
The unique key, which identifies a single record is the primary key. Ex : USN.
Secondary key is a key that is common to a group of records. Ex : Semester.
Sequential Search:
Sequential search is the technique of searching, where the records are
accessed one-by-one and checked till the searching key is found.
The work required to search sequentially for a record in a file with n
records is proportional to n.
A number of tools are present in UNIX which sequentially processes a file, ex:cat,wc,
grep.
$cat myfile – displays the contents of the file.
AMAR 1VACS057 13 15 1
7
MARY.S 1VA09IS054 5 1 5
3 7 4
wc – reads through an ASCII file sequentially and counts the number of lines,
words and characters in a file.
$wc myfile
2 10 49
Direct Access:
The alternative to sequential search for a record is the retrieval mechanism known as
direct access. There is direct access to a record, when we can seek directly to the beginning
of the record and read it.
The problem can be solved if we know the RRN of the required record in fixed-length
records. RRN is the Relative Record Number, it gives the position of a record with respect to
the beginning of the file. The first record in the file has RRN0, the next has RRN1 and so
on.
If the record is of variable length, the records are searched sequentially to get the
correct RRN. In case of fixed length records the byte offset from start of the file is-
Byte offset = record size( r ) * Required RRN(n)
The byte offset of record of RRN 2 and fixed length record of size 128 bytes isByte offset =
128*2 = 256
th
From 256 byte the record of RRN2 starts.
Header records:
A header record is often placed at the beginning of the file to hold some
general information about a file.
The header keeps track of some general information such as no of records in file, size
of each record, size of header, date and time of last alteration etc. Header record usually has
a different structure than the data records in a file.
Headers are usually defined in a class by the help of 2 methods-
• int readheader(); - reads the header and returns
▪ 1 –if header is correct and
▪ 0 –if header is wrong.
• int writeheader(); - adds header to the file and returns the number of bytes in the
header.
Important Questions
1. What are file structures? Explain briefly the history of file structures design.
2. Explain the different costs of disk access. Define i)seek time ii)rotationaliii)transfer
time
3. Explain the functions OPEN,READ and WRITE with parameters.
4. Briefly explain the different basic ways to organize the data on a disk.
5. Briefly explain the organization of data on Nine-Track tapes with a neat diagram