0% found this document useful (0 votes)
2 views

5.FileSystems

The document discusses various file systems, focusing on FAT and EXT2, detailing their structures, operations, and methods for maintaining consistency. It also covers advanced concepts like journaling, copy-on-write, and the Btrfs file system, which utilizes a B-tree structure for efficient data management. Key topics include inode management, bitmap allocation, and strategies for handling file system inconsistencies after crashes.

Uploaded by

soumyapadhy3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

5.FileSystems

The document discusses various file systems, focusing on FAT and EXT2, detailing their structures, operations, and methods for maintaining consistency. It also covers advanced concepts like journaling, copy-on-write, and the Btrfs file system, which utilizes a B-tree structure for efficient data management. Key topics include inode management, bitmap allocation, and strategies for handling file system inconsistencies after crashes.

Uploaded by

soumyapadhy3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Unit-III: File Systems

Subhrendu Chattopadhyay, IDRBT

Disclaimer: A few of the Images are taken from the Internet and Textbooks 1
FAT File Allocation Table

2
FAT
• Simple file system popularized by MS-DOS
• First introduced in 1977
• Most devices today use the FAT32 spec from 1996
• FAT12, FAT16, VFAT, FAT32, etc.
• Still quite popular today
• Default format for USB sticks and memory cards
• Used for EFI boot partitions
• Name comes from the index table used to track directories and files
FAT Layout
Superblock FAT
● Stores basic info about the file system ● File allocation table (FAT)
● FAT version, location of boot files ● Marks which blocks are free or in-use
● Total number of blocks ● Linked-list structure to manage large
● Index of the root directory in the FAT files

Data Block
● Store file and directory data
● Each block is a fixed size (4KB – 64KB)
● Files may span multiple blocks

https://elixir.bootlin.com/linux/v6.11-rc4/source/mm
FAT
• Entry size depends on FAT-12, FAT-16, FAT-32
• Directories are special files
• Contains a list of entries inside the directory file
• Possible values for FAT entries:
• 0 – entry is empty
• 1 – reserved by the OS
• 1 < N < 0xFFFF – next block in a chain
• 0xFFFF – end of a chain
EXT2 Extended File System

6
Ext2
• Inode: Every file or directory is represented by an
inode. Special data structure to index
files/directories
• Block Groups: The file system is divided into block
groups, which contain a fixed number of inodes,
data blocks, and a superblock. This layout improves
performance and allows for better disk usage.
Directories are files. Contains
the list of entries in the directory ● Each inode can directly point to 12 blocks
● Can also indirectly point to blocks at 1, 2, and 3
Ext2 levels of depth

Name Inode
bin

. 0
/ home
bin 1

home 2

initrd.img 3

Inode Data
Inode Data Block
Bitmap Bitmap
Root inode = 0
/*
Ext2: Inode Entries * Structure of an inode on the disk
*/
struct ext2_inode { … }

Field Size Description

Mode 2B File type and permissions (indicates file type, owner, group, and other permissions).

UID 2B User ID of the file's owner.

Size 4B File size in bytes. For directories, it is the total size of the directory contents

Atime 4B Last access time (in seconds since Unix epoch).

Ctime 4B Creation time or last inode status change time.

Mtime 4B Last modification time.

Dtime 4B Deletion time. Stores when the file was deleted (if applicable).

https://elixir.bootlin.com/linux/v6.11-rc4/source/fs/ext2/ext2.h#L290
Ext2: Inode Entries
Field Size Description

GID 2B Group ID of the file.

Links count 2B Number of hard links to the file.

Flags 4B File attributes (e.g., secure deletion, immutability, compression).

Blocks 4B Total number of 512-byte blocks used by the file (including indirect blocks).

Pointers to data Blocks 60B 12 direct, 1 single indirect, 1 double indirect, 1 triple indirect block pointers.

Generation number 4B File versioning number, used in network file systems (e.g., NFS).

File ACL 4B Access control list information for additional file permissions.

Directory ACL 4B ACL information specific to directories.

Fragment address 4B Not commonly used; holds file fragment information if the file is fragmented.

OS-specific 12B OS-specific data fields for additional functionality.


Ext2: Bitmaps: Block Bitmap
• The block bitmap tracks the allocation status of data blocks within a block group.
• The size of the block bitmap depends on the number of blocks in the block group. Each bit in the
bitmap represents a single block in the block group.
• 1: The block is allocated (in use).
• 0: The block is free (available for allocation).
• Usage:
• When a file is created or expanded, the filesystem scans the block bitmap to find free blocks
(bits set to 0). Once blocks are allocated, the corresponding bits in the block bitmap are set to
1.
• During file deletion, the blocks are deallocated, and the corresponding bits in the bitmap are
cleared (set to 0), marking those blocks as free for future use.

Block Bitmap: | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
|---|---|---|---|---|---|---|---|---|---|
| | | | | | | | | | |
| B1| B2| B3| B4| B5| B6| B7| B8| B9|B10|
Ext2: Bitmaps: inode Bitmap
• The inode bitmap tracks the allocation status of inodes within a block group.
• The size of the inode bitmap depends on the number of inodes in the block group. Each bit in the
inode bitmap represents a single inode.
• 1: The inode is allocated (in use, assigned to a file or directory).
• 0: The inode is free (available for use).
• Usage:
• When a new file or directory is created, the inode bitmap is scanned to find a free inode (bit
set to 0). The inode is then allocated, and the corresponding bit in the bitmap is set to 1.
• When a file or directory is deleted, the inode is deallocated, and its corresponding bit is
cleared to 0, marking the inode as free.

Inode Bitmap: | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
|---|---|---|---|---|---|---|---|---|---|
| | | | | | | | | | |
| I1| I2| I3| I4| I5| I6| I7| I8| I9|I10|
Ext2: Write Operation Example
• Let’s go through an example where you write a 10 KB file to an EXT2 filesystem, assuming the block size is 4 KB.
• Open the File:
• The kernel opens the file, looks up its inode, and initializes the necessary metadata.
• Write 10 KB of Data:
• The kernel splits the 10 KB of data into blocks. Since the block size is 4 KB, the file will require 3 blocks: two full 4 KB
blocks and one partial 2 KB block.
• Block Allocation:
• The kernel checks the block bitmap and allocates 3 blocks for the file. The block pointers in the inode are updated to
point to these blocks.
• Data Write:
• The kernel writes the first 4 KB of data to the first block, the next 4 KB to the second block, and the remaining 2 KB to
the third block.
• Update Inode:
• The inode is updated to reflect the new file size (10 KB) and the block pointers are updated. The modification time
(i_mtime) is also updated.
• Delayed Write:
• The data may initially be written to the page cache and only later written to the disk during a sync operation or a
periodic flush.
Ext2: Inconsistency
• Many operations results in multiple, independent writes to the file system
• Example: append a block to an existing file
• Update the free data bitmap
• Update the inode
• Write the user data
• What happens if the computer crashes in the middle of this process?
Ext2: Inconsistency
• The disk guarantees that sector writes are atomic
• No way to make multi-sector writes atomic
• How to ensure consistency after a crash?
• Don’t bother to ensure consistency
• Accept that the file system may be inconsistent after a crash
• Run a program that fixes the file system during bootup
• File system checker (fsck)
• Use a transaction log to make multi-writes atomic
• Log stores a history of all writes to the disk
• After a crash the log can be “replayed” to finish updates
• Journaling file system
Approach-1: Check file-system
• Key idea: fix inconsistent file systems during bootup
• Unix utility called fsck (chkdsk on Windows)
• Scans the entire file system multiple times, identifying and correcting inconsistencies
• Why during bootup?
• No other file system activity can be going on
• After fsck runs, bootup/mounting can continue
fsck
• Superblock: validate the superblock, replace it with a backup if it is corrupted
• Free blocks and inodes: rebuild the bitmaps by scanning all inodes
• Reachability: make sure all inodes are reachable from the root of the file system
• inodes: delete all corrupted inodes, and rebuild their link counts by walking the directory tree
• directories: verify the integrity of all directories
• … and many other minor consistency checks

• Advantages of fsck
• Doesn’t require the file system to do any work to ensure consistency
• Makes the file system implementation simpler
• Disadvantages of fsck
• Very complicated to implement the fsck program
• Many possible inconsistencies that must be identified
• Many difficult corner cases to consider and handle
• fsck is super slow
• Scans the entire file system multiple times
• Imagine how long it would take to fsck a 40 TB RAID array
Approach-2: Journaling
• Key idea: make writes transactional by using a write-ahead log
• Commonly referred to as a journal
• Ext3 and NTFS use journaling
• After the log is written, the writes execute normally
• In essence, the log records transactions
Approach-2: Journaling
• What happens after a crash…
• If the writes to the log are interrupted?
• The transaction is incomplete
• The user’s data is lost, but the file system is consistent
• If the writes to the log succeed, but the normal writes are interrupted?
• The file system may be inconsistent, but…
• The log has exactly the right information to fix the problem
Approach-2: Journaling
• Advantages of journaling
• Robust, fast file system recovery
• No need to scan the entire journal or file system
• Relatively straightforward to implement
• Disadvantages of journaling
• Write traffic to the disk is doubled
• Especially the file data, which is probably large
B+ Tree Use of B+ tree in File Systems

21
inode and it’s problem
• Recall: inodes use indirection to acquire additional blocks of pointers
• Problem: inodes are not efficient for large files
• Example: for a 100MB file, you need 25600 block pointers (assuming 4KB blocks)
• This is unavoidable if the file is 100% fragmented
• However, what if large groups of blocks are contiguous?
• Extents are better suited for contiguous files
• B Trees are widely used for file system representation (WAFL, ZFS, BTRFS)
• Logarithmic time key search, insert and remove
• Well represents sparse files
• The File System as a large tree made up of fixed size pages
• Shadowing:
• Technique to support atomic updates over persistent data structures
• Implement snapshots, crash recovery, write-batching, RAID
Shadowing
• To update an on-disk page (the page is in the disk, not available in the memory)
• Read the entire page in the memory
• Modify the page
• Write in an alternate location
• When a page is shadowed, its location on the disk changes
• Update (and shadow) the immediate ancestor of the
• page with the new address
• Propagates up to the file system root
Copy-on-Write (COW)
• The core idea behind copy-on-write is to delay the copying of data until it is absolutely necessary,
saving resources like memory and disk space. Instead of creating a full copy of data when it’s shared
or duplicated, the system shares the same data with multiple processes or users and only creates a
copy when someone tries to modify the shared data.
• Some filesystems, like Btrfs, ZFS, and snapshots in EXT4 (with certain configurations), use
copy-on-write to efficiently manage file modifications and snapshots.
• Snapshots: When creating a snapshot, the filesystem doesn’t immediately duplicate the entire file or
directory. Instead, it marks the snapshot as sharing the original data. If the original data is modified,
the filesystem only copies the blocks being modified, leaving the unmodified blocks shared between
the original file and the snapshot.
btrfs

25
History
• A file system based on COW principle -- initially designed at Oracle Corporation for use in Linux
• The development began in 2007, since November 2013 it has been marked as stable
• Principal Btrfs author: Chris Mason
• “to let Linux scale for the storage that will be available. Scaling is not just about addressing the storage but
also means being able to administer and to manage it with a clean interface.”
History & Basics
• A file system based on COW principle -- initially designed at Oracle Corporation for use in Linux
• The development began in 2007, since November 2013 it has been marked as stable
• Principal Btrfs author: Chris Mason
• “to let Linux scale for the storage that will be available. Scaling is not just about addressing the storage but
also means being able to administer and to manage it with a clean interface.”

• Page, block: A 4KB contiguous region on disk and in memory. This is the standard Linux page size.
• Extent: A contiguous on-disk area. It is page aligned, and its length is a multiple of pages.
• Copy-on-write (COW): Creating a new version of an extent or a page at a different location
• The data is loaded from disk to memory, modified, and then written elsewhere
• Do not update the original location in place, risking a power failure and partial update
BTRFS B-tree
• A generic structure with three types of data structures: keys, items, and block headers
• Block header: A fixed size data structure, holds fields like checksums, flags, filesystem ids, generation
number, etc.
• Key: describes an object address,
• Item: is a key with additional offset and size fields.
• Internal tree nodes hold only [key, block-pointer] pairs
• Leaf nodes hold arrays of [item, data] pairs.
• The offset field describes data held in an extent.
BTRFS B-tree
• A leaf stores
• an array of items in the beginning
• a reverse sorted data array at the ends
• These arrays grow towards each other.
BTRFS: Small files
• Small files that occupy less than one leaf block are packed into the b-tree inside the extent item
• Key offset if the byte offset of the data in the file
• The size field indicates how much data is stored
BTRFS: Large files
• Large files are stored in extents -- contiguous on-disk areas that hold user-data without additional headers
or formatting
• Extent maintains a [disk block, disk num blocks] pair to record the area of disk corresponding to the file.
BTRFS: Large files
• A directory holds an array of dir_item elements
• Maps a file name (string) to a 64bit object_id
• Directory lookup index - [dir_item_key, filename 64bit hash]
BTRFS
• BTRFS does its own device management

BTRFS

Traditional File Systems with DMs

You might also like