File Systems
File Systems
RAJAN KARMACHARYA
DEPARTMENT OF COMPUTER SCIENCE AND INFORMATION
Includes…
4.1 Files
4.2 Directories
4.3 File System Implementation
2
Er. Rajan Karmacharya
Files
l
• Issues
▫ How to store the large amount of data into the computer?
▫ What happens when process terminates or is killed using some data?
▫ How to assign the same data to the multiple process
• The solution to all these problems is to store information on disks or on other
external media called FILES.
3
Er. Rajan Karmacharya
Files
l
• A file is a named collection of related information normally resides on a
secondary storage device such as disk or tape.
d t d i h di k t
• Commonly, files represent programs (both source and object forms) and data;
d
data files may be numeric, alphanumeric or binary
fil b i l h i bi
• Information stored in files must be persistent, not be effected by power failures
and system reboot
d b
• The files are managed by the OS and part of the OS that is responsible to
manage files is known as the file system
4
Er. Rajan Karmacharya
File System Issues
l S I
• How to create a file?
• How are they named?
• How are they structured?
• What operations are allowed on files?
• How to protect them?
• How are they accessed or used?
• p
How to implement them?
5
Er. Rajan Karmacharya
File Naming
l
• When a process creates a file, a file name is given which continues to exist and
can be accessed by other processes even after the process terminates.
b d b th ft th t i t
• A file is named, for the convenience of its human users and it is referred to by
its name.
• A name is a string of characters which may be digits or special characters
• Some system differentiate between lowercase and uppercase characters such
as UNIX and other consider equivalent like MS DOS.
• Normally the string of max 8 characters (MS DOS)are legal file name but
recent systems support as long as 255 characters (Windows)
6
Er. Rajan Karmacharya
File Naming
l
• Many OS support two part file names separated by a period.
• The part following the period is known as extension
• The extension name resembles the type of the file
• Some systems may have two or more extensions such as Prog1.c.Z ‐‐‐‐ A C
source file compressed using Ziv‐Lampel Algorithm
7
Er. Rajan Karmacharya
Typical File Extensions
l l E
8
Er. Rajan Karmacharya
File Structure
l S
• Files must be structured and understood by the OS.
• Files can be structured in many ways and the most general structures are
▫ Unstructured
▫ Record Structured
▫ Tree Structured
T S d
9
Er. Rajan Karmacharya
File Structure
l S
Er. Rajan Karmacharya
File Structures
l S
• Unstructured
▫ Consists of unstructured sequence of bytes or words
▫ OS does not know or care what is in the file
▫ Any meaning must be imposed by the user level program
▫ Provides
Provides maximum flexibility; user can put anything they want and name them
maximum flexibility; user can put anything they want and name them
anyway that is convenient
▫ Both UNIX and Windows use these approach.
Both UNIX and Windows use these approach.
11
Er. Rajan Karmacharya
File Structures
l S
• Record Structured
▫ A file is a sequence of fixed length records, each with some internal structure.
▫ Each read operation returns one records and write operation overwrites or append
one record
one record
▫ May old mainframe systems use this structure
12
Er. Rajan Karmacharya
File Structures
l S
• Tree Structured
▫ File consists of tree of records not necessarily all the same length.
▫ Each containing a key field in a fixed position in the record sorted on the key to
allow rapid searching
allow rapid searching.
▫ The operation is to get the record with the specific key
▫ Used in large mainframe for commercial data processing
13
Er. Rajan Karmacharya
File Types
l
• Many OS support several types of files
• Regular files: contains user information and are generally ACSII or Binary
• Directories: system files for maintaining the structure of the file system
• Character Special Files:
p related to I/O and used to model serial I/O devices
such as terminals, printers and networks
• Block special files:
p f used to model disks
14
Er. Rajan Karmacharya
File Types
l
• ASCII Files
▫ Consists of line of text where each line is terminated either by carriage return or by
line feed character or both
▫ They can be displayed and printed as it is and can be edited with ordinary text
editor
• Binary Files
▫ Consists
Consists of sequence of byte only
of sequence of byte only
▫ They have some internal structure known to programs that use them
▫ Many OS use extension to identify the file type; but UNIX like OS use a magic
y yp
number to identify the file type
15
Er. Rajan Karmacharya
A
Access Methods
h d
• Sequential Access
▫ read all bytes/records from the beginning
d ll b t / d f th b i i
▫ cannot jump around, could rewind or back up
▫ convenient when medium was magnetic tape
• Direct Access
▫ bytes/records read in any order
▫ essential for data base systems
essential for data base systems
▫ Used for immediate access to large amount if information
▫ read can be …
x move file marker (seek), then read or …
move file marker (seek), then read or …
x read and then move file marker
16
Er. Rajan Karmacharya
File Attributes
l A b
• In addition to name
and data all other
and data, all other
information about file
is termed as file
attribute
• The file attributes
may vary from system
t
to system
t
17
Er. Rajan Karmacharya
File Operations
l O
• OS provides system calls to perform operations on files. Some common calls
are
• Create : If disk space is available, it creates new file without data.
• Delete: Deletes files to free up disk space.
• Open: Before using a file, a process must open it.
• Close: When all access are finished, the file should be closed to free up the
internal table space.
• Read : Reads data from file
• Write: Writes data to a file
• Append: Adds data to the end of a file
• Seek : Repositions the file pointer to a specific place in the file
• Get : Returns file attributes for processing
• Set : To set the user settable attributes when files changed
g
• Rename : Rename a file
18
Er. Rajan Karmacharya
Directory Structure
S
• A directory is a node containing information about files.
• Directories may have different structures
19
Er. Rajan Karmacharya
Directory Structure
S
• Single Level Directory
▫ All files are contained in the same directory
▫ Easy to support and understand but difficult to manage large amount of files and to
manage different users
▫ Different users may accidently use the same name for their files where the later
Different users may accidently use the same name for their files where the later
overwrites the earlier
▫ Advantage of this scheme is the simplicity and ability to locate files quickly; there is
y p
only once place to look after all.
• A single level directory system
▫ contains 4 files
▫ owned by 3 different people, A, B,
y p p
and C
20
Er. Rajan Karmacharya
Directory Structure
S
• Two Level Directory
▫ Separate directory for each user
▫ Used on a multiuser computer and on a simple network computer
▫ It has problem when users want to cooperate on some task and to access one
another’ss files.
another files
▫ It also cause problem when a single user has large number of files.
21
Er. Rajan Karmacharya
Directory Structure
S
• Hierarchical Level Directory
▫ Generalization of two level structure to a tree of arbitrary height
▫ This allows the user to create their own subdirectories and to organize their files.
▫ To allow to share the directory for different user acyclic graph is used
▫ N l ll
Nearly all modern file systems are organized in this manner
d fil t i d i thi
22
Er. Rajan Karmacharya
Path Names
h
• Absolute Path Names
▫ Path name starting from root directory to the file
▫ Path separated by / in UNIX ,\ in Windows and > in MULTICS
▫ Windows ‐‐‐‐ \usr\ast\mailbox
▫ Li
Linux ‐‐‐‐ / / t/ ilb
/usr/ast/mailbox
▫ MULTICS ‐‐‐ >usr>ast>mailbox
• Relative Path Names
▫ Concept of working directory
▫ A user can designate one directory as the current working directory in which all
path names not beginning at the root directory are taken relative to the working
directory.
▫ E.g. bin/lab2 is enough to locate same file if current working directory is /usr/user1
• cp /usr/ast/mailbox /usr/ast/mailbox1
• cp mailbox mailbox1
ilb ilb 1
• do exactly the same work if the working directory is /ust/ast
23
Er. Rajan Karmacharya
Directory Operations
O
• Create : A directory is created
• D l t A di t
Delete : A directory is deleted
i d l t d
• OpenDir : Directories can be read
• CloseDir : After read, it should be closed to free up internal table space
• Rename : Rename a directory
Rename : Rename a directory
• Link : allows a file to appear in more than one directory. This system call
specifies an existing file and a path name, and creates a link from
g
the existing file to the name specified by the path.
p y p
• Unlink : A directory entry is removed
24
Er. Rajan Karmacharya
File System Implementation
l S I l
• How are files and directories are stored?
• How disk space is managed?
• How to make every thing work efficiently and reliably?
• Files can be implemented in two ways
▫ Contiguous Allocation Method
▫ Linked Allocation Method
25
Er. Rajan Karmacharya
All
Allocation Method: Contiguous Allocation
h d C All
• The simplest allocation scheme is to store each file as a contiguous run of disk
bl k
blocks.
• Each file occupy a set of contiguous block on the disk
• Disk address define a linear ordering on the disk.
• File is defined by the disk address and length in block units
• On a disk with 1 kb blocks, a 50 kb file is allocated 50 consecutive block
• With 2 kb blocks, a 50 kb file would be allocated 25 consecutive blocks
• Both sequential and direct access can be supported by contiguous allocation
26
Er. Rajan Karmacharya
All
Allocation Method: Contiguous Allocation
h d C All
(a) Contiguous allocation of disk space for 7 files
(a) Contiguous allocation of disk space for 7 files
(b) State of the disk after files D and E have been
removed
27
Er. Rajan Karmacharya
All
Allocation Method: Contiguous Allocation
h d C All
• Advantages:
▫ Simple to implement:
l l A
Accessing a file that has been allocated contiguously is easy
fl h h b ll d l
▫ High performance: The entire file can be read in single operation i.e. decrease the seek time
• Problems:
P bl
▫ Fragmentation: When files are allocated and deleted the free disk space is broken into holes
▫ Dynamic storage allocation problem: Searching of right holes requires pre‐information of file
size to put it into a hole
size to put it into a hole
28
Er. Rajan Karmacharya
All
Allocation Method: Linked Allocation
h d k d All
• Each file is a linked list of disk blocks; the disk block may be scattered
anywhere on the disk.
h th di k
• Each block contains a pointer to the next block of same file.
• To create the new file, we simply create a new entry in the directory; with
li k d ll
linked allocation, each directory entry has a pointer to the first disk block of
i h di h i h fi di k bl k f
the file
• Unlike contiguous allocation, every disk can be used in this method.
• No space is lost to disk fragmentation except for internal fragmentation.
l d kf f lf
29
Er. Rajan Karmacharya
All
Allocation Method: Linked Allocation
h d k d All
• File A uses disk block 4,7,2,10 and 12 and
• File B uses disk block 6,3,11 and 14
30
Er. Rajan Karmacharya
All
Allocation Method: Linked Allocation
h d k d All
• It solves all problems of contiguous allocation but it can be used only for
sequential access file: random access is extremely slow
ti l fil d i t l l
• Each block access requires disk seek
• It also requires space for pointer
• Solution : Using File Allocation Table (FAT)
▫ The table has one entry for each disk block containing next block number for the file. This resides at the
beginning of each disk partition.
31
Er. Rajan Karmacharya
Directory Implementation
I l
• The directory entry provides the information needed to find the disk block.
• The file attributes are stored in the directory
• Directories can be implemented in two ways
▫ Linear List
▫ Hash Table
H h T bl
32
Er. Rajan Karmacharya
Directory Implementation‐
I l Linear List
• Use the linear list of the file names with pointer to the data blocks
• It requires linear search to find a particular entry
• Advantages: Simple to implement but time consuming to execute
• Problem: Linear search to a file is slow
• To create a new file, we must first search the directory to be sure that no existing file
has the same name. Then we add a new entry at the end of the directory. To delete a file,
we search for the named file and release the space allocated to it.
• Requires cache and sorting and may use B‐Tree
33
Er. Rajan Karmacharya
Directory Implementation –
I l Hash Table
h bl
• It consists linear list with has table.
• The hash table takes a value computed from the file name and returns a pointer to
Th h h bl k l df h fil d i
the file name in the linear list. Thus, it greatly decreases the search time.
• If that key is already in use, a linked list is constructed.
• Advantages:
Ad t greatly decreases the file search time.
tl d th fil h ti
• Problem: It is fixed size and dependence of the hash function on that size.
• Insertion and deletion are also fairly straight forward although some provision must be
made for collisions situations in which two file names has to the same location.
• If we make a hash table that holds 64 entries, it converts filenames into integers from 0 to
If we make a hash table that holds 64 entries it converts filenames into integers from 0 to
63
• Later if we create a 65th file, we must enlarge the hash table say 128 entries. As a result we
need a new hash function that must map files to the range of 0 – 127 and reorganize
need a new hash function that must map files to the range of 0 127 and reorganize
existing directories
34
Er. Rajan Karmacharya
Free Space Management
S
• To keep track of free blocks, system maintains the free space list
• The free space list records all free blocks those not allocated to some file or
directory
• To create a file, system searches the free space list for required amount of
space, and allocate that space to the new file and removed from free space list.
d ll h h fil d df f li
• When a file is deleted, its disk space is added to the free space list
• The free space list can be implemented in two ways
▫ Bitmap
▫ Linked List
35
Er. Rajan Karmacharya
Free Space Management : Bitmap
S
• Each block is represented by a bit. If the block is free, the bit is 1; if the block is
allocated the bit is 0.
ll t d th bit i 0
• A disk with n blocks requires a bitmap with n bits.
• Eg: consider a disk where blocks 2,3,4,5,8,9,10,11,12,13,17,18,25,26,27 are free
and rest are allocated.
d ll d
• The free space bitmap would be 0011110011111100011000000111
• Advantages:
▫ Simple and efficient in finding first free blocks or n consecutive free blocks
• Problem:
▫ Inefficient unless the entire bitmap is kept in the main memory
▫ Keeping in main memory is possible only for small disk, when disk is large the
bitmap would be large
36
Er. Rajan Karmacharya
Free Space Management : Linked List
S k d
• Link together all the free disk blocks,
k
keeping a pointer to the first free
i i t t th fi t f
block in a special location on the disk
and caching it in memory.
• The first block contains a pointer to
Th fi t bl k t i i t t
the next free block
• The previous example can be
represented in liked list as
represented in liked list as
• Advantage
• Only one block is kept in memory
• Problem
• Not efficient to traverse list, it
must read each block
d h bl k
File Sharing
In a multiuser system, there is almost always a requirement for files to be shared among a number of
users. Two issues arise: access rights and the management of simultaneous access.
Access Rights
The file system should provide a flexible tool for allowing extensive file sharing among users. The file
system should provide a number of options so that the way in which a particular file is accessed can be
controlled. Typically, users or groups of users are granted certain access rights to a file. A wide range of
access rights have been used. The following list is representative of access rights that can be assigned to
a particular file
None: The user may not even learn of the existence of the file, much less access it. To enforce this
restriction, the user would not be allowed to read the user directory that includes this file.
Knowledge: The user can determine that the file exists and who its owner is. The user is then able to
petition (ask) the owner for additional access rights.
Execution: The user can load and execute a program but cannot copy it.
Reading : The user can read the file for any purpose, including copying and execution.
Appending : The user can add data to the file, often only at the end but canʹt modify or delete any of
the fileʹs contents. This right is useful in collecting data from a number of sources.
Updating: The user can modify, delete and add to the fileʹs data. This normally includes writing the
file initially, rewriting it completely or in part and removing all or a portion of the data.
Changing Protection: The user can change the access rights granted to other users. Typically, this
right is held only by the owner and may extend this right to others.
Deletion: The user can delete the file from the file system.
These rights can be considered to constitute a hierarchy, with each right implying those that precede it.
Access can be provided to different classes of users
Specific Users: Individual users who are designated by user ID.
User Groups: A set of users who are not individually defined. The system must have some way of
keeping track of the membership of user groups.
All: All users who have access to this system. These are public files.
Simultaneous Access
When access is granted to append or update a file to more than one user, the operating system or file
management system must enforce discipline.
A brute‐force approach is to allow a user to lock the entire file when it is to be updated. A finer grain
of control is to lock individual records during update.
Issues of mutual exclusion and deadlock must be addressed in designing the access capability.
File System Reliability
Major problem in maintaining a file system
• To protect it from being corrupted by both systems failure and errors in systemʹs software.
• If a computerʹs file system is irrecoverably lost, due to hardware or software, restoring all the
information will be difficult, time consuming and in many cases, impossible.
Solution
The usual way of ensuring reliability is by making duplicates copies of files.
Backups – a duplicate copy of the files used incase the original damages
Companies, who well understand the value of back up do it once a day, usually to a tape. Modern
tapes hold 100 gigabytes of data and cost pennies per gigabyte.
Backup tapes are generally made to
• Recover from disaster – due to disk crash, fire, flood or other natural calamities
• Recover from stupidity – accidental removal of files that is required later (may be temporarily
in Recycle Bin)
Making a backup takes a longer time and occupies a large amount of space‐‐‐ doing this efficiently and
conveniently in a major issue.
Periodic dump
• All files in the file system are copied to another device, usually a magnetic tape. This is done
regularly and may be done as often as weekly if the files are highly volatile.
• Wasteful to back up files that have not changed since the last backup
Incremental dump
• Make a complete dump periodically and make a daily dump of those files that get modified
from the last dump. This is faster to do and usually done more often.
• Makes recovery more complicated because first the most recent full dump is to be restored
followed by all the incremental dumps in reverse order.
Compression
• With immense amount of data, compression seems to be feasible but with many compression
algorithms, a single bad spot on the backup tape can make the entire data unreadable.
Security Issue
• Backup introduces non technical problems in an organization
• The best online security system in the world may be useless if the system administrator keeps
all the backup tapes in his office and leaves it open and unguarded. So, Good Bye Security.
Two strategies can be used for dumping a disk to a tape: a physical dump or a logical dump.
Physical Dump
A physical dump starts at block 0 of the disk, writes all the disk blocks onto the output tape in
order, and stops when it has copied the last one. Such a program is so simple that it can probably be
made 100% bug free, sometime that can probably not be said about any other useful program.
Advantages
• Simplicity and great speed (basically, it can run at the speed of disk).
Problem
• There is no value in backing up unused disk blocks.
• Dumping bad blocks creates endless disk read errors during the dumping process.
• Unable to skip selected directories and makes incremental dumps.
Logical Dump
A logical dump starts at one or more specified directories and recursively dumps all files and
directories found there that have been changed since some given base date (the last backup for an
incremental dump for a full dump).
In a logical dump, the dump tape gets a series of carefully identified directories and files, which makes
it easy to restore a specific file or directory upon request.
Logical dumping is the most common form. In the example below, a tree with directories (squares) and
files (circles) is shown. The shaded items have been modified since the base date and thus need to be
dumped. The unshaded ones do not need to be dumped.
File System Consistency
• Another important aspect of a file system is its consistency. When a system crash occurs, it is
possible that a modified block may not have been written yet. The standard way of checking
for consistency is to read all the files and the blocks allocated to the files of the file system and
then compare this with the blocks in the free list.
• To deal with the problem of inconsistent file system, most computers have a utility program
that checks file system consistency. UNIX has fsck and Windows has scandisk. These utilities
run when the system is booted after a crash.
File System Performance
Access to disk is much slower than access to memory. Hence, many file systems have been designed
with various optimizations to improve the performance
Caching
• The most common technique used to reduce disk access is the block cache or the buffer cache.
A cache is a collection of blocks that logically belong to the disk but are being kept in the
memory for performance reasons.
• Check all the requests to see if the needed block is in the cache. If it is, the read request can be
satisfied without a disk access and if not, it is first read into the cache and then copied to
wherever it is needed. Subsequent requests for the same block can be satisfied from the cache.
Block Read Ahead
• Another technique to improve file system performance is to try to get blocks into the cache
before they are needed to increase the hit rate.
• In particular, many files are read sequentially. When the file system is asked to produce block k
in a file, it does that, but when it is finished, it makes a sneaky check in the cache to see if block
k+1 is already there. If it is not, it schedules a read for block k+1 in the hope that when it is
needed, it will have already arrived in the cache. At the very least, it will be on the way.
• Only helps the sequentially read files.(not in random access method)
Reducing Disk Arm Motion
• The disk arm motion can be reduced by putting blocks that are likely to be accessed in sequence
close to each other, preferably in the same cylinder. (i.e., blocks lie in the outer most track, say
track 1, of platter 1, platter 2 , platter 3 and so on)
• When the blocks lie in the same cylinder, the disk arm does not require moving and only the
read write head moves. Hence data access is faster when the disk arm motion is reduced.