FS M1 Part1
FS M1 Part1
Design.
To Discuss a number of Advanced Data Structure Concepts
that are necessary for achieving high efficiency in File
Operations.
To Develop important programming skills in and Object-
Oriented Language such as C++ or Java.
2
File Structures, An Object-Oriented Approach
with C++
by
Michael J. Folk, Bill Zoellick
and Greg Riccardi
3
CO1 To understand the concepts of storage, manipula ons, and processing of file using various
file opera ons
CO2 Apply various data structures to achieve improved file opera ons
CO3 Analyze various file indexing techniques to improve performance of file structures
CO4 Illustrate different file organiza on and storage management techniques
CO5 Design and develop solu ons for real- me file management problems
COs PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2
CO1 2
CO2 3 2
CO3 3
CO4 3 2
CO5 3
4
Module1:
Introduc on: File Structures: The Heart of the file structure Design, A Short History of File
Structure Design, A Conceptual Toolkit; Fundamental File Opera ons: Physical Files and
Logical Files, Opening Files, Closing Files, Reading and Wri ng, Seeking, Special Characters,
The Unix Directory Structure, Physical devices and Logical Files, File-related Header Files,
UNIX file System Commands; Secondary Storage and System So ware: Disks, Magne c Tape,
Disk versus Tape; CD-ROM: Introduc on, Physical Organiza on, Strengths and Weaknesses;
Storage as Hierarchy, A journey of a Byte, Buffer Management, Input /Output in UNIX.
Fundamental File Structure Concepts, Managing Files of Records : Field and Record
Organiza on, Using Classes to Manipulate Buffers, Using Inheritance for Record Buffer
Classes, Managing Fixed Length, Fixed Field Buffers, An Object-Oriented Class for Record
Files, Record Access, More about Record Structures, Encapsula ng Record Opera ons in a
Single Class, File Access and File Organiza on.
5
Module2:
Organiza on of Files for Performance, Indexing: Data Compression, Reclaiming Space in files,
Internal Sor ng and Binary Searching, Keysor ng; What is an Index? A Simple Index for Entry-
Sequenced File, Using Template Classes in C++ for Object I/O, Object-Oriented support for
Indexed, Entry-Sequenced Files of Data Objects, Indexes that are too large to hold in Memory,
Indexing to provide access by Mul ple keys, Retrieval Using Combina ons of Secondary Keys,
Improving the Secondary Index structure: Inverted Lists, Selec ve indexes, Binding.
Module3:
Consequen al Processing and the Sor ng of Large Files: A Model for Implemen ng
Cosequen al Processes, Applica on of the Model to a General Ledger Program, Extension of
the Model to include Mu way Merging, A Second Look at Sor ng in Memory, Merging as a Way
of Sor ng Large Files on Disk.
Mul -Level Indexing and B-Trees: The inven on of B-Tree, Statement of the problem, Indexing
with Binary Search Trees; Mul -Level Indexing, B-Trees, Example of Crea ng a B-Tree, An
Object-Oriented Representa on of B-Trees, B-Tree Methods; Nomenclature, Formal Defini on
of B-Tree Proper es, Worst-case Search Depth, Dele on, Merging and Redistribu on,
Redistribu on during inser on; B* Trees, Buffering of pages; Virtual B-Trees; Variable-length
Records and keys.
6
Module4:
Indexed Sequen al File Access and Prefix B + Trees: Indexed Sequen al Access, Maintaining a
Sequence Set, Adding a Simple Index to the Sequence Set, The Content of the Index: Separators
Instead of Keys, The Simple Prefix B+ Tree and its maintenance, Index Set Block Size, Internal
Structure of Index Set Blocks: A Variable-order B- Tree, Loading a Simple Prefix B+ Trees, B-
Trees, B+ Trees and Simple Prefix B+ Trees in Perspec ve.
Module5:
Hashing: Introduc on, A Simple Hashing Algorithm, Hashing Func ons and Record Distribu on,
How much Extra Memory should be used?, Collision resolu on by progressive overflow, Buckets,
Making dele ons, Other collision resolu on techniques, Pa erns of record access.
Extendible Hashing: How Extendible Hashing Works, Implementa on, Dele on, Extendible
Hashing Performance, Alterna ve Approaches.
7
What are File Structures?
Why Study File Structure Design
Overview of File Structure Design
8
A File Structure is a combination of representations for
data in files and of operations for accessing the data.
A File Structure allows applications to read, write and
modify data. It might also support finding the data that
matches some search criteria or reading through the
data in some particular order.
9
Computer Data can be stored in three kinds of locations:
◦ Primary Storage ==> Memory [Computer Memory]
◦ Secondary Storage [Online Disk/ Tape/ CDRom that can be
Our accessed by the computer]
Focus
◦ Tertiary Storage ==> Archival Data [Offline Disk/
Tape/ CDRom not directly available to the computer.]
10
Secondary storage such as disks can pack thousands of megabytes in a
small physical location.
11
By improving the File Structure.
Since the details of the representation of the data and the
implementation of the operations determine the efficiency of the
file structure for particular applications, improving these details can
help improve secondary storage access time.
12
Get the information we need with one access to the disk.
If that’s not possible, then get the information with as few accesses
as possible.
Group information so that we are likely to get everything we need
with only one trip to the disk.
13
It is relatively easy to come up with file structure designs that meet
the general goals when the files never change.
When files grow or shrink when information is added and deleted, it
is much more difficult.
14
Early Work assumed that files were on tape.
Access was sequential and the cost of acces grew in direct
proportion to the size of the file.
15
As files grew very large, unaided sequential access was not a good
solution.
Disks allowed for direct access.
Indexes made it possible to keep a list of keys and pointers in a
small file that could be searched very quickly.
With the key and pointer, the user had direct access to the large,
primary file.
16
As indexes also have a sequential flavour, when they grew too
much, they also became difficult to manage.
17
In 1963, researchers came up with the idea of AVL trees for data in memory.
AVL trees, however, did not apply to files because they work well when tree nodes
are composed of single records rather than dozens or hundreds of them.
In the 1970’s came the idea of B-Trees which require an O(logk N) access time
where N is the number of entries in the file and k, th number of entries indexed in
a single block of the B-Tree structure --> B-Trees can guarantee that one can find
one file entry among millions of others with only 3 or 4 trips to the disk.
18
Retrieving entries in 3 or 4 accesses is good, but it does
not reach the goal of accessing data with a single request.
From early on, Hashing was a good way to reach this goal
with files that do not change size greatly over time.
Recently, Extendible Dynamic Hashing guarantees one or
at most two disk accesses no matter how big a file
becomes.
19
Fundamental File Processing Operations
20
Physical versus Logical Files
Opening and Closing Files
Reading, Writing and Seeking
Special Characters in Files
The Unix Directory Structure
Physical Devices and Logical Files
Unix File System Commands
21
Physical File: A collection of bits/bytes stored on the secondary
storage like disk or tape.
Logical File: A channel that connects the program to the physical
file(stream)
22
Logical File: A “Channel” (like a telephone line) that hides
the details of the file’s location and physical format to
the program.
When a program wants to use a particular file:
◦ It has to hookup between logical and physical file.
◦ E.g:Select inp_file assign to “myfile.dat”
23
◦ Tells=> the operating system
24
Once we have a logical file identifier hooked up to a
physical file or device, we need to declare what we
intend to do with the file:
Open an existing file
Create a new file
That makes the file ready to use by the program
We are positioned at the beginning of the file and are
ready to read or write.
25
System Call fd = open(filename, flags [, pmode]);
Argument Type Explanation
fd (file int Logical file descriptor
descriptor) *error to open a file:negative value
filename Char * Character string
(physical file
name)
flags int O_APPEND(to the end of file)
O_CREAT( no effect if file exist)
O_EXCL( returns error if O_CREAT is specified and
file exists )
O_RDONLY, O_RDWR, O_WRONLY.
(opens a file specified mode)
O_TRUNC(If file exists truncates to length to zero
by destroying its contents. )
27
Argument Type Explanation
flags Int In O_CREAT is specified =>pmode is required
pmode = rwe rwe rwe
111 101 001
owner group world
E.g.
1.fd=open(filename, O_RDWR| O_CREAT, 0751)
If file exist and to be start at initial position:
28
Makes the logical file name/file descriptor available for
another physical file (it’s like hanging up the telephone
after a call).
Ensures that everything has been written to the file [since
data is written to a buffer prior to the file].
Files are usually closed automatically by the operating
system (unless the program is abnormally interrupted).
29
Read(Source_file, Destination_addr, Size)
30
Write(Destination_file, Source_addr, Size)
31
C Stream: stdio.h
◦ Input stream: stdin
◦ Output stream: stdout
Open function:
◦ file=fopen(filename, type);
file=> File * =>file descriptor
filename=> char *
type=> char * =>read/input(r), write/output(w), append(a)
,
32
type=> char * =>
r+ => Open an existing file for input/output
w+ => Create a new file / truncate an existing one for input/
output
a+ => Create a new file / append to an existing one for input/
output
fread, fget, fwrite, fput
fscanf, fprintf (formatted I/O)
C++ Stream:
◦ We use cin and cout for input and output
◦ To use this cin and cout overload the respective
operators.
◦ Other file operations are defined in fstream.h
◦ int open(char * filename, int mode)
◦ ios::in, ios::out, ios::nocreate(fail if the file does not
exists), ios::noreplace(fail if the file does exists), ios::
binary(file is in binary)
This are 2 constructor in fstream.h which
attaches a file to fstream.
◦ fstream()
◦ fstream(char * filename,int mode)
35
Opening Files:
◦ links a logical file to a physical file.
In C:
FILE * outfile;
outfile = fopen(“myfile.txt”, “w”);
In C++:
fstream outfile;
outfile.open(“myfile.txt”, ios::out);
36
In C :
fclose(outfile);
In C++ :
outfile.close();
37
Read data from a file and place it in a variable inside the
program.
In C:
char c;
FILE * infile;
infile = fopen(“myfile.txt”,”r”);
fread(&c, 1, 1, infile);
In C++:
char c;
fstream infile;
infile.open(“myfile.txt”,ios::in);
infile >> c;
38
Write data from a variable inside the program
into the file.
In C:
char c;
FILE * outfile;
outfile = fopen(“mynew.txt”,”w”);
fwrite(&c, 1, 1, outfile);
In C++:
char c;
fstream outfile;
outfile.open(“mynew.txt”,ios::out);
outfile << c;
39
#include <stdio.h>
main(){
char ch;
FILE * file;// file descriptor
char filename[20];
printf(“enter the name of the file:”);
gets(filename);
file = fopen(filename, “r”);
while (fread(&ch,1,1,file)!=0)//element size and no. of elements
fwrite(&ch,1,1,stdout);
fclose(file);
}
40
#include <iostream.h>
main(){
char ch;
fstream file;// file descriptor
char filename[20];
cout<<“enter the name of the file:”<<flush;
cin>> filename;
file.open(filename, ios::in);
file.unset(ios::skipws);
41
while(1){
file>>ch; //operator overload(data read from file to
ch)
if(file.fail())break; // end of file
cout<<ch;
}
file.close();
}
42
A program does not necessarily have to read through a file
sequentially: It can jump to specific locations in the file or to the
end of file so as to append to it.
The action of moving directly to a certain position in a file is often
called seeking.
Seek(Source_file, Offset)
◦ Source_file = the logical file name in which the seek will occur
◦ Offset = the number of positions in the file the pointer is to be
moved from the start of the file.
43
pos=fseek(file, byte_offset, origin)
◦ pos => address to which read/write pointer has moved.
◦ file => file descriptor
◦ byte_offset => no.of bytes(location) from some origin.
◦ Origin => 0 - fseek from beginning of the file
1 - fseek from current of the file
2 - fseek from end of the file
Pos=fseek(file,373L,0); // 373 bytes
44
It is similar to C
Two difference:
◦ fstream has 2 pointers
Get pointer(seekg)
file.seekg(byte_offset,origin)
Put pointer(seekp)
file.seekp(byte_offset,origin)
◦ Origin is of class ios
ios::beg, ios::cur, ios::end
e.g.:file.seekg/seekp(373, ios::beg)
45
Sometimes, the operating system attempts to make
“regular” user’s life easier by automatically adding or
deleting characters for them.
These modifications, however, make the life of
programmers building sophisticated file structures (YOU)
more complicated!
46
Control-Z is added at the end of all files (MS-
DOS). This is to signal an end-of-file.
<Carriage-Return> + <Line-Feed> are added to
the end of each line (again, MS-DOS).
<Carriage-Return> is removed and replaced by a
character count on each line of text (VMS)
47
In any computer systems, there are many files (100’s or 1000’s). These
files need to be organized using some method. In Unix, this is called
the File System.
The Unix File System is a tree-structured organization of directories.
With the root of the tree represented by the character “/”.
Each directory can contain regular files or other directories.
The file name stored in a Unix directory corresponds to its physical
name.
48
49
Any file can be uniquely identified by giving it its
absolute pathname. E.g., /usr6/mydir/addr.
The directory you are in is called your current
directory.
You can refer to a file by the path relative to the
current directory.
“.” stands for the current directory and “..” stands for
the parent directory.
50
Unix has a very general view of what a file is:
◦ It corresponds to a sequence of bytes without knowing
about
where the bytes are stored
where they come from.
Magnetic disks or tapes can be thought of as files
and so can the keyboard and the console.
◦ /dev/kbd (keyboard sends sequence of bytes)
◦ /dev/console (accepts sequence of bytes and displays)
51
No matter what the physical form of a
Unix file
◦ real file or device
◦ Both are represented in the same way in Unix:
By an integer.
i.e. file descriptor
Which is a logical name
52
Stdout --> Console
fwrite(&ch, 1, 1, stdout);
Stdin --> Keyboard
fread(&ch, 1, 1, stdin);
Stderr --> Standard Error (again, Console)
[When the compiler detects an error, the error
message is written in this file]
53
If we want to write to a file instead of stdout.
OR
If we want to use the output of the file as the input in
another file
We go for concept of redirection.
54
< filename [redirect stdin to “filename”]
> filename [redirect stdout to “filename”]
E.g., a.out > my-output
program1 | program2 [take any stdout output from
program1 and use it in place of any stdin input to
program2.
E.g., list | sort
55
cat filenames --> Print the content of the named textfiles.
tail filename --> Print the last 10 lines of the text file.
cp file1 file2 --> Copy file1 to file2.
mv file1 file2 --> Move (rename) file1 to file2.
rm filenames --> Remove (delete) the named files.
chmod mode filename --> Change the protection mode on the
named file.
ls --> List the contents of the directory.
mkdir name --> Create a directory with the given name.
rmdir name --> Remove the named directory.
56
Secondary Storage and
System Software: Magnetic
Disks &Tapes
57
The Organization of Disks
Estimating Capacities and Space Needs
Organizing Tracks by Sector
Organizing Tracks by Block
Non Data Overhead
The Cost of a Disk Access
Disk as Bottleneck
58
Having learned how to manipulate files, we now
learn about the nature and limitations of the
devices and systems used to store and retrieve
files, so that we can design good file structures
that arrange the data in ways that minimize
access costs given the device used by the system.
59
Disks belong to the category of Direct Access Storage Devices (DASDs)
because they make it possible to access the data directly.
This is in contrast to Serial Devices (e.g., Magnetic Tapes) which allows
only serial access [all the data before the one we are interested in has to
be read or written in order].
Different Types of Disks:
◦ Hard Disk: High Capacity + Low Cost per bit.
◦ Floppy Disk: Cheap, but slow and holds little data. (zip disks: removable disk
cartridges)
◦ Optical Disk (CD-ROM): Read Only, but holds a lot of data and can be reproduced
cheaply. However, slow.
60
The information to be stored on a disk is stored on
the surface of one or more platters.
The information is stored in successive tracks on
the surface of the disk.
Each track is often divided into a number of
sectors which is the smallest addressable portion
of a disk.
61
62
63
64
65
When a read statement calls for a particular byte
from a disk file, the computer’s operating system
finds the correct platter, track and sector, reads
the entire sector into a special area in memory
called a buffer, and then finds the requested byte
within that buffer.
66
Disk drives typically have a number of platters and the
tracks that are directly above and below one another
form a cylinder.
All the info on a single cylinder can be accessed
without moving the arm that holds the read/write
heads.
Moving this arm is called seeking. The arm movement
is usually the slowest part of reading information
from a disk.
67
Track Capacity = number of sectors per track *
bytes per sector
Cylinder Capacity = number of tracks per cylinder
* track capacity
Drive Capacity = number of cylinders * cylinder
capacity
68
Suppose we want to store 50,000 bytes of fixed
data records with following characteristics:
◦ No. of bytes per sector=512
◦ No. of sectors per track=63
◦ No. of tracks per cylinder=16
◦ No. of cylinders=4092
How many cylinders are required to store the above
data if data record is 256 bytes?
69
As 1 data record is 256
Each sector can hold 2 records as sector is 512.
No. of sectors:50000/2=25000
One cylinder can hold:
◦ No. of tracks*no. of sectors=63*16=1008
No. of cylinders are:
◦ 25000/1008=24.8 cylinders.
70
Data on disks can be organized in 2 ways:
1. By sector
2. By user defined block
71
Logical organization of sectors on a track is that sectors are :
◦ adjacent
◦ fixed-sized segments of a track
◦ that holds a file.
Physically, however, this organization is not optimal:
◦ After reading the data, it takes the disk controller some time
◦ to process the received information before it is ready to
accept more.
◦ If the sectors were physically adjacent:
we would miss the start of the next sector while processing the info
just read in.
73
Traditional Solution: Interleave the sectors.
Namely, leave an interval of several physical
sectors between logically adjacent sectors.
74
Nowadays, however, the controller’s speed has
improved so that no interleaving is necessary
anymore.
75
The file can also be viewed as a series of
clusters of sectors which represent a fixed
number of (logically) contiguous sectors.
Once a cluster has been found on a disk, all
sectors in that cluster can be accessed without
requiring an additional seek.
76
When a program wants to access a file it will be
the task of file manager.
As file manager maps logical file to the physical
one.
File is viewed as a series of cluster of sectors
A cluster physical location is identified by the
File Allocation Table which maps logical sectors
to the physical clusters they belong to.
77
78
We go to all the above methods to reduce seeking.
(one extent) the file can be processed with a
minimum of seeking time.
To do the above we can check whether the disks is
free:
◦ If free place => place file in a contiguous slot.
◦ If disk is not free then we can 2 or more non
contiguous (2/more extents)
◦ As the number of extents in a file increases, the file
becomes more spread out on the disk, and the
amount of seeking necessary increases.
79
80
If the size of sector is 512 bytes
And if all records in a file is 300 bytes
How to store this data?
There are 2 possible organizations for records (if the
records are smaller than the sector size:
1. Store 1 record per sector
2. Store the records successively (i.e., one record may span two sectors)
81
82
Trade-Offs
Advantage of 1: Each record can be retrieved from 1 sector.
Disadvantage of 1: Loss of Space with each sector ==>
Internal Fragmentation
Advantage of 2: No internal fragmentation
Disadvantage of 2: 2 sectors may need to be accessed to
retrieve a single record.
The use of clusters also leads to internal fragmentation.
83
Rather than being divided into sectors,
◦ The disk tracks may be divided into user-defined blocks.
84
Blocks don’t have the sector-spanning and
fragmentation problem of sectors
since they vary in size to fit the logical organization of the data.
A block consist of integral number of logical
records.
85
The blocking factor indicates the number of
records that are to be stored in each block in a file.
Each block is usually accompanied by sub-blocks:
◦ key-subblock (each data block can be given a key
depending on the last record)
◦ count-subblock (no. of bytes in a data block)
86
Key sub-block: The disk controller can search a
particular block on a track by the key defined.
Accessing is efficient.
87
• FAT stands for File Allocation Table and FAT32 is an extension which
means that data is stored in chunks of 32 bits. These is an older type
of file system that isn’t commonly used these days.
• NTFS stands for New Technology File System and this took over
from FAT as the primary file system being used in Windows.
88
Whether using a block or a sector organization, some
space on the disk is taken up by non-data overhead.
◦ i.e., information stored on the disk during pre-formatting.
90
The greater the block-factor,
◦ Advantage: Efficient use of storage.
◦ Disadvantage :Since tracks are fixed in size- at the end of
each track some space is left leads to internal track
fragmentation.
The flexibility introduced by the use of blocks rather
than sectors
◦ can save time since it lets the programmer determine, to a large
extent, how the data is to be organized physically on disk.
On the negative side: overhead is on the
Programmer and Operating System determining
data organization .
91
Seek Time is the time required to move the access arm to the
correct cylinder.
Rotational Delay is the time it takes for the disk to rotate so the
92
Transfer Time =
(Number of Bytes Transferred/ Number of Bytes on a Track)
* Rotation Time
93
Processes are often Disk-Bound, i.e., the network and
the CPU often have to wait inordinate lengths of time for
the disk to transmit data.
Solution 1: Multiprogramming (CPU works on other jobs
while waiting for the disk)
Solution 2: Striping: splitting the parts of a file on several
different drives, then letting the separate drives deliver
parts of the file to the network simultaneously ==>
Parallelism
94
Solution 3: RAID: Redundant Array of Independent
Disks(is a data storage virtualization technology that combines multiple
physical disk drive components into a single logical unit )
95
Secondary Storage and System
Software: Magnetic Disks &Tapes
96
Description of Tape Systems
Organization of Data on Nine-Track Tapes
Estimating Tape Length Requirements
Estimating Data Transmission Times
Disk versus Tape
97
No direct accessing facility, but very rapid
sequential access.
Easy to store and transport, cheaper than disk.
Used for application data
Currently, tapes are primarily used as archival
storage.
98
On a tape, the logical position of a byte within a file
◦ corresponds directly to its physical position relative to
the start of the file.
The surface of a typical tape can be seen
◦ as a set of parallel tracks
◦ each of which is a sequence of bits.
◦ These bits correspond to 1 byte + a parity bit.
One Byte = a one-bit-wide slice of tape called a
frame.
99
100
In odd parity, the bit is set to make the number
of bits in the frame odd. This is done to check
the validity of the data.
Frames are organized into data blocks of
variable size separated by inter block gaps (long
enough to permit stopping and starting)
101
Calculate how much tape space needed?
◦ To store 1 million 100 byte records.
◦ To store a file on 6250 bpi(bytes per inch)
◦ Interblock gap= 0.3 inches
102
Let b= the physical length of a data block
Let g= the length of an interblock gap, and
Let n= the number of data blocks.
The space requirement, s, for storing the file is
s = n * (b+g)
103
b= blocksize (i.e., bytes per block)
tape density (i.e., bytes per inch)
104
b = 100/6250 => 0.016 inch
If blocking factor = 1 then,
n= 1 000 000 =>1 000 000
1
s= 1 000 000 *(0.016+0.3)
=> 316000 inches= 26,333 feet
105
If blocking factor = 50 then,
n= 1 000 000 =>20000
50
s= 20000 *(0.016+0.3)
=> 6320 inches
106
The number of records stored in a physical block
is called the blocking factor.
If blocking factor = 1 then,
107
Greater the blocking factor lesser interblock gap and better space
utilization.
Hence it depends on blocking factor.
If we try to use a generalized measure i.e. =>
Effective Record Density: a general measure of the effect of choosing
different block sizes:
number of bytes per block => 100
number of inches required to store a block (0.016+0.3)
Space utilization is sensitive to the relative sizes of data blocks and
interblock gaps.
108
Nominal Data Transmission Rate=
(Tape Density in (bpi)) * (Tape Speed in (ips))
Interblock gaps, however, must be taken into
consideration ==> Effective Transmission Rate/
((Effective Recording Density)* (Tape Speed))
Effective Recording Density=
No. of bytes per block
No. of inches required to store a block
109
In the past: Both Disks and Tapes were used for
secondary storage. Disks were preferred for
random access and tape was better for sequential
access.
Now (1): Disks have taken over much of secondary
storage ==> Because of the decreased cost of disk
+ memory storage
Now (2): Tapes are used as Tertiary storage
(Cheap, fast & easy to stream large files or sets of
files between tape and disk)
110
Secondary Storage and System
Software:
CD-ROM & Issues in Data
Management
111
CD-ROM (Compact Disk, Read-Only
Memory)
A Journey of a Byte
Buffer Management
I/O in Unix
112
A single disc can hold more than 600 megabytes of data (~
400 books of the textbook’s size)
CD-ROM is read only. i.e., it is a publishing medium rather
than a data storage and retrieval like magnetic disks.
CD-ROM Strengths: High storage capacity, inexpensive and
durable.
CD-ROM Weaknesses: extremely slow seek performance
(between 1/2 a second to a second) ==> Intelligent File
Structures are critical.
113
CD-ROM is a descendent of CD Audios. i.e.,
listening to music is sequential and does not
require fast random access to data.
Reading Pits and Lands:
◦ CD-ROMs are stamped from a glass master disk which
has a coating.
◦ Coating :Which is changed by the laser beam
114
Reading Pits and Lands:
◦ When the coating is developed, the areas hit by the laser
beam turn into pits along the track followed by the beam.
The smooth unchanged areas between the pits are called
lands.
115
◦ To read data from CD ROM
A beam of laser light is passed on the track as it
moves under the optical pickup.
The pits scatter the light, but the lands reflect
most of it back to the pickup.
This alternating pattern of high- and low-intensity
reflected light is the signal used to reconstruct
the original digital information.
116
1’s are represented by the transition from pit to
land and back again.
0’s are represented by the amount of time
between transitions. The longer between
transitions, the more 0s we have.
117
Given this scheme, it is not possible to have 2
adjacent 1s: 1s are always separated by 0s. As a
matter of fact, because of physical limitations,
there must be at least two 0s between any pair of
1s.
Raw patterns of 1s and 0s have to be translated to
get the 8-bit patterns of 1s and 0s that form the
bytes of the original data.
118
E.g.
◦ EFM encoding (Eight to Fourteen Modulations) turns the
original 8 bits of data into 14 expanded bits that can be
represented in the pits and lands on the disk.
119
CLV- Constant linear velocity
CAV- Constant angular velocity
Data on a CD-ROM is stored in a single, spiral track.
This allows the data to be packed as tightly as possible
since all the sectors have the same size (whether in the
center or at the edge).
This type of pattern is used in audio system.
In the “regular arrangement”, the data is packed more
densely in the center than in the edge ==> Space is lost in
the edge.
120
Since reading the data requires that it passes under
the optical pick-up device at a constant rate,
The disc has to spin more slowly when reading the
outer edges than when reading towards the center.
121
Part of the problem is the need to change
rotational speed.
In CLV- rotation time needs to be changed
In CAV- data storage at the center is more and at
the edges is less.
◦ Some space is not used in CAV.
122
To read the address info that is stored on the disc
◦ Along with the user’s data, we need to be moving the
data under the optical pick up at the correct speed.
◦ But to know how to adjust the speed, we need to be able
to read the address info so we know where we are.
◦ How do we break this loop? By guessing and through trial
and error ==> Slows down performance.
123
Different from the “regular” disk method.
Each second of playing time on a CD is divided into 75 sectors.
Each sector holds 2 Kilobytes of data. Each CD-ROM contains
at least one hour of playing time.
==> The disc is capable of holding at least
◦ 60 min * 60 sec/min * 75 sector/sec * 2 Kilobytes/sector =
540, 000 KBytes
Often, it is actually possible to store over 600, 000 KBytes.
Sectors are addressed by min:sec:sector e.g., 16:22:34
124
125
Seek Performance: very bad
Data Transfer Rate: Not Terrible/Not Great
Storage Capacity: Great
◦ Benefit: enables us to build indexes and other support structures that
can help overcome some of the limitations associated with CD-ROM’s
poor performance.
Read-Only Access: There can’t be any changes ==> File organization can
be optimized.
127
Part that takes place in memory:
Statement calls the Operating System (OS) which
overseas the operation
128
File manager (Part of the OS that deals with I/O)
◦ Checks whether the operation is permitted
◦ Locates the physical location where the byte will be stored
(Drive, Cylinder, Track & Sector)
◦ Finds out whether the sector to locate the ‘P’ is already in
memory (if not, call the I/O Buffer)
◦ Puts ‘P’ in the I/O Buffer
◦ Keep the sector in memory to see if more bytes will be going
to the same sector in the file
129
130
131
Part that takes place outside of memory:
I/O Processor: Wait for an external data path to become
available (CPU is faster than data-paths ==> Delays)
Disk Controller:
◦ I/O Processor asks the disk controller if the disk drive is
available for writing
◦ Disk Controller instructs the disk drive to move its read/
write head to the right track and sector.
◦ Disk spins to right location and byte is written
132
What happens to data travelling between a program’s
data area and secondary storage?
We use temporary storage i.e. BUFFER
The use of Buffers: Buffering involves
◦ working with a large chunk of data in memory so the
number of accesses to secondary storage can be
reduced.
133
Problems:
In depends how many buffers are used:
◦ Has a single buffer
◦ Performing both input and output
◦ One character at a time, alternatively.
In this case,
◦ Read (input sector) the character to the buffer.
◦ Over write previous sector with output sector where the
character to be written
◦ Vice versa
134
In such a case, the system needs more than 1
buffer: at least, one for input and the other one for
output.
Moving data to or from disk is very slow and
programs may become I/O Bound ==> Find better
strategies to avoid this problem.
135
Multiple Buffering
Move Mode and Locate Mode
Scatter/Gather I/O
136
Multiple Buffering: using 2 or more buffer to
perform Input/Output. There are two types :
◦ Double Buffering
◦ Buffer Pooling
137
Double Buffering:
◦ Suppose we have task with only write(to disk)
operations
◦ If we use 2 buffers where I/O and CPU operations are
overlapped.
If Buffer-1 =>used by the CPU to fill data
Then Buffer-2 =>used to transmit the data to disk.
If Buffer-2 =>used by the CPU to fill data
Then Buffer-1 =>used to transmit the data to disk.
138
139
Buffer pooling:
◦ Pool of buffers will be available
◦ On requirement a buffer will be chosen from the pool.
◦ Method to choose buffer from pool:
LRU (Least Recently Used) strategy
Maintained by a queue.
140
Move Mode and Locate Mode:
Buffers work in two modes:
◦ Move mode:
Works while accessing data:
Deals with the movement of data from system buffer
and programs area and vice versa.
Problem: To access a byte of data we need to
transfer the data between system buffer and
program area.
141
• Locate mode:
To avoid the above problem:
We can avoid the unnecessary transfer of data.
The above can be done as follows:
System buffer is used for all operations.
Providing the pointer of the location to program.
This mode is called Locate mode.
142
Scatter Input and Gather Output:
Suppose we have to access a file with many blocks:
Where each block consist of a header followed by
data.
The above can be done:
To read the entire block to a single big buffer.
And then read the header in one buffer
The data into other buffer
The above method includes 2 steps.
143
To avoid 2 steps we use scatter input:
◦ A single block of data to be scattered in a collection of
buffers.
◦ The data to be read in a single read call.
Gather output:
◦ Several buffers can be gathered.
◦ Written in a single write call.
◦ Avoids: need to copy to a single big output buffer.
144
The Kernel:
145
When a program executes following instruction(Journey of
a byte):
◦ Write (fd, ch,1)
1.System call is invoked
2.Which invokes kernel(S/m call insist the kernel to write a
character)
3.Kernel I/O system- connects the file descriptor to a file or
I/O device by using four tables
146
Four tables are:
1.File descriptor table (Program)
2.Open file table (Kernel)
3.File allocation table (Kernel)
4.Table of Index nodes (Part of file system)
147
File descriptor table:
◦ Owned by the program.
◦ information to access open file table
148
Open file table:
◦ Contains information about the opened files.
149
File allocation table is a part of inode.
When a file is opened a copy of it is added to inode table.
Inode is a structure used to describe a file.
150
Once the file is identified device drivers are called to
access the data.
Linking file names to files:
◦ Two links
Hard links
Pointer from the directory to the inode of a file
Soft links/ symbolic link
Specifies a path name
151