Ext3/4 File Systems: Don Porter CSE 506
Ext3/4 File Systems: Don Porter CSE 506
Don Porter
CSE 506
Logical Diagram
Binary Memory
Threads
Formats Allocators
User
Today’s Lecture
System Calls Kernel
RCU File System Networking Sync
Hardware
Interrupts Disk Net Consistency
Ext2 review
ò Very reliable, “best-of-breed” traditional file system
design
ò Much like the JOS file system you are building now
ò Fixed location super blocks
ò A few direct blocks in the inode, followed by indirect
blocks for large files
ò Directories are a special file type with a list of file names
and inode numbers
ò Etc.
File systems and crashes
ò What can go wrong?
ò Write a block pointer in an inode before marking block as
allocated in allocation bitmap
ò Write a second block allocation before clearing the first –
block in 2 files after reboot
ò Allocate an inode without putting it in a directory –
“orphaned” after reboot
ò Etc.
Deeper issue
ò Operations like creation and deletion span multiple on-
disk data structures
ò Requires more than one disk write
ò Think of disk writes as a series of updates
ò System crash can happen between any two updates
ò Crash between wrong two updates leaves on-disk data
structures inconsistent!
Atomicity
ò The property that something either happens or it doesn’t
ò No partial results
ò This is what you want for disk updates
ò Either the inode bitmap, inode, and directory are updated
when a file is created, or none of them are
ò But disks only give you atomic writes for a sector L
ò For each inode, check the reference count, make sure all
referenced blocks are marked as allocated
ò Example:
ò I modify an inode and write to the journal
ò Journal commits, ready to write inode back
ò I want to make another inode change
ò Cannot safely change in-memory inode until I have either
written it to the file system or created another journal entry
Another example
ò Suppose journal transaction1 modifies a block, then
transaction 2 modifies the same block.
ò Only metadata in the journal, but data writes only allowed after
metadata is in journal
ò Faster than full data, but constrains write orderings (slower)
ò Metadata only – fastest, most dangerous
ò More efficient for large files (both in space and disk
scheduling)
ò Disk blocks can either be used for data or inodes, but
can’t change after creation
ò Why?
Why?
ò Simplicity
ò Fixed location inodes means you can take inode number, total
number of inodes, and find the right block using math
ò Dynamic inodes introduces another data structure to track this
mapping, which can get corrupted on disk (losing all contained
files!)
ò Bookkeeping gets a lot more complicated when blocks change
type
ò Downside: potentially wasted space if you guess wrong
number of files
Directory scalability
ò An ext3 directory can have a max of 32,000 sub-
directories/files
ò Painfully slow to search – remember, this is just a simple
array on disk (linear scan to lookup a file)
ò Replace this in ext4 with an HTree
ò Hash-based custom BTree
ò Relatively flat tree to reduce risk of corruptions
ò Big performance wins on large directories – up to 100x
Other goodies
ò Improvements to help with locality
ò Preallocation and hints keep blocks that are often accessed
together close on the disk
ò Checksumming of disk blocks is a good idea
ò Especially for journal blocks
ò Fsck on a large fs gets expensive
ò Put used inodes at front if possible, skip large swaths of
unused inodes if possible
Summary
ò ext2 – Great implementation of a “classic” file system
ò ext3 – Add a journal for faster crash recovery and less
risk of data loss