OSC___CH12___File_System_Implementation
OSC___CH12___File_System_Implementation
[email protected]
School of Computer Science and Engineering,
Southeast University
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 1 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
Contents
1 Warm-up
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 2 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
Warm-up
File System Measurement Summary
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 3 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
Warm-up
Size of A File
• Why the files are of these sizes and use these spaces.
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 4 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
Objectives
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 5 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
Contents
1 Warm-up
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 6 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• We are now discussing a simple file system implementation, known as VSFS (the
Very Simple File System)
• A simplified version of a typical UNIX file system.
• You should understand
• Data structures: what types of on-disk structures are utilized by the file
system to organize its data and metadata?
• Access methods: How does it map the calls made by a process onto its
structures?
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 7 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• Divide the disk into blocks (with a commonly-used block size of 4 KB).
• Assume a really small disk, with just 64 blocks.
0 7 8 15 16 23 24 31
32 39 40 47 48 55 56 63
D D D D D D D D D D D D D D D D D D D D D D D D
0 7 8 15 16 23 24 31
Data Region
D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D
32 39 40 47 48 55 56 63
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 8 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• To track information about each file, the inodes are stored in the inode table.
• Assume 5 of 64 blocks for inodes.
Inodes Data Region
I I I I I I I I D D D D D D D D D D D D D D D D D D D D D D D D
0 7 8 15 16 23 24 31
Data Region
D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D
32 39 40 47 48 55 56 63
• Assuming 256 bytes per inode, our file system contains ? total inodes.
• This number represents the maximum number of files we can have in our
file system.
• How could the file system know which inode /data block is free?
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 9 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• To track whether inodes or data blocks are free or allocated, an inode bitmap
and a data bitmap are required.
Inodes Data Region
SI iI dI I I I I I D D D D D D D D D D D D D D D D D D D D D D D D
0 7 8 15 16 23 24 31
Data Region
D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D
32 39 40 47 48 55 56 63
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 10 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77
Super i-bmap d-bmap
2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78
3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 11 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 12 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 13 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 14 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
directory entry
test ··· 217
name start block
0
217 618
339
618 339
-1
FAT
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 15 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• NTFS (new technology file system) with MFT (master file table)
• A record in MFT (each is 1KB in size):
• Body contains attr. data, or a pointer to an extent
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 16 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• Contiguous allocation
• Each file occupies a set of contiguous blocks on the disk.
• ext4, ntfs
• Linked allocation
• Each file is a linked list of disk blocks
• fat
• Indexed allocation
• Brings all pointers together into the index block
• ext2, ext3
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 17 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• Imagine a file system which uses inodes to manage files on disk. Each inode
consists of a file name (4 bytes), user id (2 bytes), three timestamps (4 bytes
each), protection bits (2 bytes), a reference count (2 bytes), a file type (2 bytes),
and the file size (4 bytes). Additionally, the inode contains 13 direct indices, 1
index to a single indirect block, 1 index to a double indirect block, and one index
to a triple indirect block. Each of these indices (block pointer) is 4 bytes. The
file system also stores the first 356 bytes of each file in the inode.
1 Three major methods of allocating disk space are introduced in our
textbook. What are these three allocation methods? Which one is used in
the previous file system?
2 Assume a disk sector is 512 bytes and that each indirect block fills a single
sector. What is the maximum file size for this file system? Show your work
clearly. You need not do the arithmetic to get full credit.
3 Is there any benefit to including the first 356 bytes of the file in the inode?
If so, what is the reason? If not, why not?
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 18 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• indexed allocation
• (512/4)3 ∗ 512 + (512/4)2 ∗ 512 + (512/4)1 ∗ 512 + 13 ∗ 512 + 356
• Yes, Efficiency in both spatial and temporal. Most files are small. For small files
(≤356 bytes), do not need to access disk twice. save disk space (internal
fragmentation within blocks).
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 19 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• Deleting a file (e.g., calling unlink()) can leave an empty space in the
middle of the directory, and hence there should be some way to mark that
as well (e.g., with a reserved inode number such as zero).
• Such a delete is one reason the record length (reclen) is used: a new entry
may reuse an old, bigger entry and thus have extra space within.
• A directory has an inode, somewhere in the inode table (with the type field
of the inode marked as “directory” instead of “regular file”).
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 20 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• Suppose we want to open a file, e.g., /foo/bar, read it, then close it.
• Opening a file from disk
• The file system must traverse the pathname and thus locate the desired inode.
1 Read the inode of the root directory which is simply called /.
2 Look inside the inode to find pointers to data blocks, which contain the
contents of the root directory.
3 Find the entry for foo, and the inode number.
4 ···
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 21 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
1 The first read (at offset 0 unless lseek() has been called) will thus read in
the first block of the file, consulting the inode to find the location of such
a block.
2 ···
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 22 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
data inode root foo bar root foo bar bar bar
bitmap bitmap inode inode inode data data data[0] data[1] data[2]
read
read
open(bar ) read
read
read
read
read() read
write
read
read() read
write
read
read() read
write
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 23 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
write()
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 24 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• Writing to disk
write()
• Considering file creation, the total amount of I/O traffic to do so is quite high:
• one read to the inode bitmap (to find a free inode),
• one write to the inode bitmap (to mark it allocated),
• one write to the new inode itself (to initialize it),
• one write to the data of the directory (to link the high-level name of the
file to its inode number), and
• one read and write to the directory inode to update it.
• if the directory needs to grow to accommodate the new entry, additional
I/Os (i.e., to the data bitmap, and the new directory block) will be needed
too.
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 25 / 55
Typical File System
data inode root foo bar root foo bar bar bar
bitmap bitmap inode inode inode data data data[0] data[1] data[2]
read
read
read
read
create read
(/foo/bar ) write
write
read
write
write
read
read
write() write
write
write
read
read
write() write
write
write
read
read
write() write
write
write
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
Contents
1 Warm-up
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 27 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• Performance problems
• Expensive positioning costs, if data was spread all over the place
• The data blocks of a file were often very far away from its inode, thus
inducing an expensive seek whenever one first read the inode and
then the data blocks.
• Fragmented file system, if the free space was not carefully managed.
• The original block size was too small (512bytes).
• Fast File System (FFS): disk awareness
• Design the file system structures and allocation policies to be “disk aware”
and thus improve performance.
• It keeps the same interface to the FS, but changes the internal
implementation.
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 28 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• Modern drives do not export enough information for the file system to truly
understand whether a particular cylinder is in use.
• Modern file systems (such as Linux ext2, ext3, and ext4) instead organize the
drive into block groups.
• Whether you call them cylinder groups or block groups, these groups are the
central mechanism that FFS uses to improve performance.
• By placing two files within the same group, FFS can ensure that accessing
one after the other will not result in long seeks across the disk.
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 29 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• FFS keeps within a single cylinder group all the structures you might expect a
file system to have.
S ib db Inodes Data
• FFS keeps a copy of the super block (S) in each group for reliability
reasons.
• A per-group inode bitmap (ib) and data bitmap (db) to track whether the
inodes and data blocks of the group are allocated.
• The inode and data block regions are just like those in the previous
very-simple file system.
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 30 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 31 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• Assume three directories (/, /a, and /b), and four files (/a/c, /a/d, /a/e, and
/b/f).
• In general FS
group inodes data
0 /--------- /---------
1 a--------- a---------
2 b--------- b---------
3 c--------- c---------
4 d--------- d---------
5 e--------- e---------
6 f--------- f---------
7 ---------- ----------
···
• In FFS
group inodes data
0 /--------- /---------
1 acde------ accddee---
2 bf-------- bff-------
3 ---------- ----------
···
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 32 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 33 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 34 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
Contents
1 Warm-up
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 35 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• Considering consistency
• How to update persistent data structures despite the presence of a power
loss or system crash?
• Crash-consistency problem
• Imagine you have to update two on-disk structures, A and B, in order to
complete a particular operation. Because the disk only services a single
request at a time, one of these requests will reach the disk first (either A or
B). If the system crashes or loses power after one write completes, the
on-disk structure will be left in an inconsistent state.
• How to update the disk despite crashes?
• Two approaches
• A file system checker (fsck)
• Journaling (also known as write-ahead logging)
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 36 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
From
Inode Data
Bmap Bmap Inodes Daba Blocks
I.v1
Da
To
Inode Data
Bmap Bmap Inodes Daba Blocks
I.v2
Da Db
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 37 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 38 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 39 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• Superblock: If fsck finds a suspect (corrupt) superblock; in this case, the system
(or administrator) may decide to use an alternate copy of the superblock.
• Free blocks: Fsck scans the inodes, indirect blocks, double indirect blocks, etc.,
to build an understanding of which blocks are currently allocated within the file
system. It uses this knowledge to produce a correct version of the allocation
bitmaps; thus, if there is any inconsistency between bitmaps and inodes, it is
resolved by trusting the information within the inodes.
• Inode state: Each inode is checked for corruption or other problems. Suspect
inode is cleared by fsck.
• Inode links: Fsck scans through the entire directory tree to verify the link count
of each allocated inode.
• Duplicates: Fsck checks for duplicate pointers. If one inode is obviously bad, it
may be cleared. Alternately, the pointed-to block could be copied, thus giving
each inode its own copy as desired.
• Bad blocks: A pointer is considered “bad” if it obviously points to something
outside its valid range.
• Directory checks: Fsck performs additional integrity checks on the contents of
each directory, e.g., making sure that each inode referred to in a directory entry
is allocated and no directory is linked to more than once.
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 40 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• Fsck is too slow and it allows inconsistencies happen and then find and fix them
later when rebooting.
• An alternate solution is journaling (a.k.a. write-ahead logging).
• When updating the disk, before overwriting the structures in place, first
write down a little note (somewhere else on the disk, in a well-known
location) describing what you are about to do.
• If a crash takes places, the note tells exactly what to fix (and how to fix it)
after a crash, instead of having to scan the entire disk.
• Linux ext2 without journaling
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 41 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• Recall our update example, where we wish to write the inode (I[v2]), bitmap
(B[v2]), and data block (Db) to disk.
• Before any writing, we are now first going to write them to the log (a.k.a.
journal).
Journal
• The transaction begin (TxB) tells us about this update, and contains a
transaction identifier (TID).
• The middle three blocks just contain the exact contents of the blocks
themselves. (physical logging vs. logical logging).
• The final block (TxE) is a marker of the end of this transaction, and also
contains the TID.
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 42 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• Once this transaction is safely on disk, we are ready to overwrite the old
structures in the file system; this process is called checkpointing.
• To checkpoint the file system (i.e., bring it up to date with the pending update
in the journal), we issue the writes I[v2], B[v2], and Db to their disk locations.
• If these writes complete successfully, we have successfully checkpointed the file
system and are basically done.
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 43 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 44 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• Thus the file system issues the transactional write in two steps.
Journal
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 45 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 46 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 47 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 48 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 49 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 50 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 51 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 52 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• Block Reuse
• User adds an entry to the directory foo, assume the location of the foo
directory data is block 1000.
• User deletes everything in the directory as well as the directory itself,
freeing up block 1000 for reuse.
• User creates a new file foobar , which ends up reusing the same block 1000.
Journal
• Now assume a crash occurs and all of this information is still in the log.
What will happen?
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 53 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
• Solutions:
• Never reuse blocks until the delete of said blocks is checkpointed out of
the journal
• Linux ext3: add a new type of record (revoke) between two transactions.
Such revoked data is never replayed.
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 54 / 55
0.Prologue 1.VSFS 2.FFS 3.FSCK & Journaling
A/Prof. Kai Dong Operating System Concepts Chapter 12. File-System Implementation 55 / 55