File Systems (1) : XVII-1
File Systems (1) : XVII-1
CS 167 1
XVII-1
File Systems
CS 167 2
In this lecture we cover the "traditional" UNIX file systems: S5FS and UFS. We
then look at how kernel-supported buffering is used to speed file-system operations
and the trouble this sometimes causes.
XVII-2
S5FS: Inode
Device
Inode Number
Mode
Link Count
Owner, Group
Size
Disk Map
CS 167 3
In both S5FS and UFS, a data structure known as the inode (for index node) is
used to represent a file. These inodes are the focus of all file activity, i.e., every
access to a file must make use of information from the inode. Every file has a inode
on permanent (disk) storage. While a file is active (e.g. it is open), its inode is
brought into primary storage.
XVII-3
Disk Map
0
1
2
3
4
5
6
7
8
9
10
11
12
CS 167 4
The first file system we discuss is known as the S5 File System -- it is based on the
original UNIX file system, developed in the early seventies. The name S5 comes from
the fact that this was the only file system supported in early versions of what's
known as UNIX System V.
The purpose of the disk-map portion of the inode is to map block numbers relative
to the beginning of a file into block numbers relative to the beginning of the file
system. Each block is 1024 (1K) bytes long.
The disk map consists of 13 pointers to disk blocks, the first 10 of which point to
the first 10 blocks of the file. Thus the first 10Kb of a file are accessed directly. If the
file is larger than 10Kb, then pointer number 10 points to a disk block called an
indirect block. This block contains up to 256 (4-byte) pointers to data blocks (i.e.,
256KB of data). If the file is bigger than this (256K +10K = 266K), then pointer
number 11 points to a double indirect block containing 256 pointers to indirect
blocks, each of which contains 256 pointers to data blocks (64Mb of data). If the file
is bigger than this (64MB + 256KB + 10KB), then pointer number 12 points to a
triple indirect block containing up to 256 pointers to double indirect blocks, each of
which contains up to 256 pointers pointing to single indirect blocks, each of which
contains up to 256 pointers pointing to data blocks (potentially 16GB, although the
real limit is 2GB, since the file size, a signed number of bytes, must fit in a 32-bit
word).
This data structure allows the efficient representation of sparse files, i.e., files
whose content is mainly zeros. Consider, for example, the effect of creating an empty
file and then writing one byte at location 2,000,000,000. Only four disk blocks are
allocated to represent this file: a triple indirect block, a double indirect block, a
single indirect block, and a data block. All pointers in the disk map, except for the
last one, are zero. All bytes up to the last one read as zero. This is because a zero
pointer is treated as if it points to a block containing all zeros: a zero pointer to an
indirect block is treated as if it pointed to an indirect block filled with zero pointers,
each of which is treated as if it pointed to a data block filled with zeros. However,
one must be careful about copying such a file, since commands such as cp and tar
actually attempt to write all the zero blocks! (The dump command, on the other XVII-4
hand, copes with sparse files properly.)
S5FS Layout
Data Region
I-list
Superblock
Boot block
CS 167 5
XVII-5
S5FS Free List
99
98
97
99
98
0 97
Super Block
0
CS 167 6
Free disk blocks are organized as shown in the picture. The superblock contains
the address of up to 100 free disk blocks. The last of these disk blocks contains 100
pointers to additional free disk blocks. The last of these pointers points to another
block containing up to n free disk blocks, etc., until all free disk blocks are
represented. Thus most requests for a free block can be satisfied by merely getting
an address from the superblock. When the last block reference by the superblock is
consumed, however, a disk read must be done to fetch the addresses of up to 100
more free disk blocks. Freeing a disk block results in reconstructing the list
structure.
This organization, though very simple, scatters the blocks of files all over the
surface of the disk. When allocating a block for a file, one must always use the next
block from the free list; there is no way to request a block at a specific location. No
matter how carefully the free list is ordered when the file system is initialized, it
becomes fairly well randomized after the file system has been used for a while.
XVII-6
S5FS Free Inode List
16 0
15
14
13 0
12 0
11 0
13 10
9 0
8
11 7
6 6 0
12 5
4 4 0
3
Super Block 2
1
I-list
CS 167 7
Inodes are allocated from the I-list. Free inodes are represented simply by zeroing
their mode bits. The superblock contains a cache of indices of free inodes. When a
free inode is needed (i.e., to represent a new file), its index is taken from this cache.
If the cache is empty, then the I-list is scanned sequentially until enough free inodes
are found to refill the cache.
To speed this search somewhat, the cache contains a reference to the inode with
the smallest index that is known to be free. When an inode is free, it is added to the
cache if there is room, and its mode bits are zeroed on disk.
XVII-7
UFS
CS 167 8
UFS was developed at the University of California at Berkeley as part of the version
of UNIX known as 4.2 BSD (BSD stands for Berkeley Software Distribution). It was
designed to be much faster than the S5 file system and to eliminate some of its
restrictions, such as the length of components within directory path names.
This material is covered in The Design and Implementation of the 4.3BSD UNIX
Operating System, by Leffler et al.
XVII-8
UFS Directory Format
117
16 4
u n i x
\0
4
12 3
e t c \0
18
484 3
u s r \0
Free Space
Directory Block
CS 167 9
XVII-9
Doing File-System I/O Quickly
CS 167 10
The UFS file system uses three techniques to improve I/O performance. The first
technique, which has perhaps the greatest payoff, maximizes the amount of data
transferred with each I/O request by using a relatively large block size. UFS block
sizes may be either 4K bytes or 8K bytes (the size is fixed for each individual file
system). A disadvantage of using a large block size is the wastage due to internal
fragmentation: on the average, half of a disk block is wasted for each file. To alleviate
this problem, blocks may be shared among files under certain circumstances.
The second technique to improve performance is to minimize seek time by
attempting to locate the blocks of a file near to one another.
Finally, UFS attempts to minimize latency time, i.e. to reduce the amount of time
spent waiting for the disk to rotate to bring the desired block underneath the desired
disk head (many modern disk controllers make it either impossible or unnecessary
to apply this technique).
XVII-10
UFS Layout
data
cg n-1
inodes
cg block
super block
cg i data
data
cg 1 cg summary
inodes
cg block
cg 0 super block
boot block
CS 167 11
XVII-11
Minimizing Fragmentation
Costs
• A file system block may be split into
fragments that can be independently
assigned to files
– fragments assigned to a file must be
contiguous and in order
• The number of fragments per block (1, 2, 4, or
8) is fixed for each file system
• Allocation in fragments may only be done on
what would be the last block of a file, and
only if the file does not contain indirect
blocks
CS 167 12
Fragmentation is what Berkeley calls their technique for reducing disk space
wastage due to file sizes not being an integral multiple of the block size. Files are
normally allocated in units of blocks, since this allows the system to transfer data in
relatively large, block-size units. But this causes space problems if we have lots of
small files, where the average amount of space wasted per file (half the block size) is
an appreciable fraction of the size of the file (the wastage can be far greater than the
size of the file for very small files). The ideal solution might be to reduce the block
size for small files, but this could cause other problems; e.g., small files might grow
to be large files. The solution used in UFS is to have all of the blocks of a file be the
standard size, except for perhaps the last block of a file. This block may actually be
some number of contiguous, in-order fragments of a standard block.
XVII-12
Use of Fragments (1)
File A
File B
CS 167 13
This example illustrates a difficulty associated with the use of fragments. The file
system must preserve the invariant that fragments assigned to a file must be
contiguous and in order, and that allocation of fragments may be done only on what
would be the last block of the file. In the picture, the direction of growth is
downwards. Thus file A may easily grow by up to two fragments, but file B cannot
easily grow within this block.
In the picture, file A is 18 fragments in length, file B is 12 fragments in length.
XVII-13
Use of Fragments (2)
File A
File B
CS 167 14
XVII-14
Use of Fragments (3)
File A
File B
CS 167 15
File A grows by two more fragments, but since there is no space for it, the file
system allocates another block and copies file A's fragments into it. How much space
should be available in the newly allocated block? If the newly allocated block is
entirely free, i.e., none of its fragments are used by other files, then further growth
by file A will be very cheap. However, if the file system uses this approach all the
time, then we do not get the space-saving benefits of fragmentation. An alternative
approach is to use a "best-fit" policy: find a block that contains exactly the number
of free fragments needed by file A, or if such a block is not available, find a block
containing the smallest number of contiguous free fragments that will satisfy file A's
needs.
Which approach is taken depends upon the degree to which the file system is
fragmented. If disk space is relatively unfragmented, then the first approach is taken
("optimize for time"). Otherwise, i.e., when disk space is fragmented, the file system
takes the second approach ("optimize for space").
The points at which the system switches between the two policies is parameterized
in the superblock: a certain percentage of the disk space, by default 10%, is reserved
for superuser. (Disk-allocation techniques need a reasonable chance of finding free
disk space in each cylinder group in order to optimize the layout of files.) If the total
amount of fragmented free disk space (i.e., the total amount of free disk space not
counting that portion consisting of whole blocks) increases to 8% of the size of the
file system (or, more generally, increases to 2% less than the reserve), then further
allocation is done using the best-fit approach. Once this approach is being used, if
the total amount of fragmented free disk space drops below 5% (or half of the
reserve), then further allocation is done using the whole-block technique.
XVII-15
Minimizing Seek Time
• The principle:
– keep related information as close together as possible
– distribute information sufficiently to make the above
possible
• The practice:
– attempt to put new inodes in the same cylinder group
as their directories
– put inodes for new directories in cylinder groups with
"lots" of free space
– put the beginning of a file (direct blocks) in the inode's
cylinder group
– put additional portions of the file (each 2MB) in
cylinder groups with "lots" of free space
CS 167 16
One of the major components (in terms of time) of a disk I/O operation is the
positioning of the disk head. In the S5 file system we didn't worry about this, but in
the UFS file system we would like to lay out files on disk so as to minimize the time
required to position the disk head. If we know exactly what the contents of an entire
file system will be when we create it, then, in principle, we could lay files out
optimally. But we don't have this sort of knowledge, so, in the UFS file system, a
reasonable effort is made to lay files out "pretty well."
XVII-16
Minimizing Latency (1)
7 6
8 5
1 4
2 3
CS 167 17
Latency is the time spent waiting for the disk platter to rotate, bringing the desired
sector underneath the disk head. In most disks, latency time is dominated by seek
time, but if we've done a good job improving seek time, perhaps we can do
something useful with latency time.
A naive way of laying out consecutive blocks of the file on a track would be to put
them in consecutive locations. The problem with this is that some amount of time
passes between the completion of one disk request and the start of the next. During
this time, the disk rotates a certain distance, probably far enough that the disk head
is positioned after the next block. Thus it is necessary to wait for the disk to rotate
almost a complete revolution for it to bring the beginning of the next block
underneath the disk head. This delay could cause a significant slowdown.
XVII-17
Minimizing Latency (2)
4
3
1
2
CS 167 18
A better technique is not to lay out the blocks on the track consecutively, but to
leave enough space between them that the disk rotates no further than to the
position of the next block during the time between disk requests.
It may be that when a new block is allocated for a file, the optimal position for the
next block is already occupied. If so, one may be able to find a block that is just as
good. If the disk has multiple surfaces (and multiple heads), then we can make the
reasonable assumption that the blocks underneath each head can be accessed
equally quickly. Thus the stack of blocks underneath the disk heads at one instant
are said to be rotationally equivalent. If all of these blocks are occupied, then the
next stack of rotationally equivalent blocks in the opposite direction of disk rotation
is almost as good as the first. If all of these blocks are taken, then the third stack is
almost as good, and so forth all the way around the cylinder. If all of these are taken,
then any block within the cylinder group is chosen.
This technique is perhaps not as useful today as in the past, since many disk
controllers buffer entire tracks and hide the relevant disk geometry.
XVII-18
The Buffer Cache
Buffer
User Process
Buffer Cache
CS 167 19
File I/O in UNIX is not done directly to the disk drive, but through an
intermediary, the buffer cache.
The buffer cache has two primary functions. The first, and most important, is to
make possible concurrent I/O and computation within a UNIX process. The second
is to insulate the user from physical block boundaries.
From a user thread's point of view, I/O is synchronous. By this we mean that when
the I/O system call returns, the system no longer needs the user-supplied buffer.
For example, after a write system call, the data in the user buffer has either been
transmitted to the device or copied to a kernel buffer -- the user can now scribble
over the buffer without affecting the data transfer. Because of this synchronization,
from a user thread's point of view, no more than one I/O operation can be in
progress at a time. Thus user-implemented multibuffered I/O is not possible (in a
single-threaded process).
The buffer cache provides a kernel implementation of multibuffering I/O, and thus
concurrent I/O and computation are possible even for single-threaded processes.
XVII-19
Multi-Buffered I/O
Process
read( … )
i-1 i i+1
previous current probable
block block next block
CS 167 20
The use of read-aheads and write-behinds makes possible concurrent I/O and
computation: if the block currently being fetched is block i and the previous block
fetched was block i-1, then block i+1 is also fetched. Modified blocks are normally
written out not synchronously but instead sometime after they were modified,
asynchronously.
XVII-20
Maintaining the Cache
buffer requests
returns of no-longer-
active buffers
oldest
returns of active
youngest
buffers
CS 167 21
XVII-21
File-System Consistency (1)
1 2 3
CS 167 22
In the event of a crash, the contents of the file system may well be inconsistent
with any view of it the user might have. For example, a programmer may have
carefully added a node to the end of the list, so that at all times the list structure is
well-formed.
XVII-22
File-System Consistency (2)
CRASH!!!
New Node New Node
Not on Not on
disk disk
1 2 3 4 5
CS 167 23
But, if the new node and the old node are stored on separate disk blocks, the
modifications to the block containing the old node might be written out first; the
system might well crash before the second block is written out.
XVII-23
Keeping It Consistent
CS 167 24
To deal with this problem, one must make certain that the target of a pointer is
safely on disk before the pointer is set to point to it. This is done for certain system
data structures (e.g., directory entries, inodes, indirect blocks, etc.).
No such synchronization is done for user data structures: not enough is known
about the semantics of user operations to make this possible. However, a user
process called update executes a sync system call every 30 seconds, which initiates
the writing out to disk of all dirty buffers. Alternatively, the user can open a file with
the synchronous option so that all writes are waited for; i.e, the buffer cache acts as
a write-through cache (N.B.: this is expensive!).
XVII-24