2023 334 The3
2023 334 The3
CENG 334
Operating Systems
Spring 2022 - Homework 3
The Transactional Metadata Journaling EXT2 file system
1 Introduction
This final homework aims to familiarize you with the implementation of file systems. You will be required to
implement various utilities on a simplified version of the ext2 file system, which is a traditional file system.
Additionally, there is a bonus opportunity to extend the ext2 file system by incorporating a feature called
“transactional metadata journaling”. This feature helps back up disk transactions and reduces the risk of
disk failures.
Detailed explanations of the journaling modification will be provided, along with some utilities to assist
with image creation and checking. Your code will directly operate on the ext2 file system image, performing
operations that update the file system. It’s a challenging task ahead, so get ready to dive in!
1
Figure 1: Overall ext2 Layout
struct ext2_super_block {
uint32_t inode_count; /* Total number of inodes in the fs */
uint32_t block_count; /* Total number of blocks in the fs */
uint32_t reserved_block_count; /* Number of blocks reserved for root */
uint32_t free_block_count; /* Number of free blocks */
uint32_t free_inode_count; /* Number of free inodes */
uint32_t first_data_block; /* The first data block number */
uint32_t log_block_size; /* 2^(10 + this value) gives the block size */
uint32_t log_fragment_size; /* Same for fragments (we won't use fragments) */
uint32_t blocks_per_group; /* Number of blocks for each block group */
uint32_t fragments_per_group; /* Same for fragments */
uint32_t inodes_per_group; /* Number of inodes for each block group */
uint32_t mount_time; /* Some less relevant fields */
uint32_t write_time;
uint16_t mount_count;
uint16_t max_mount_count;
uint16_t magic;
uint16_t state;
uint16_t errors;
uint16_t minor_rev_level;
uint32_t last_check_time;
uint32_t check_interval;
uint32_t creator_os;
uint32_t rev_level; /* Revision level: 0 or 1 */
uint16_t default_uid;
uint16_t default_gid;
uint32_t first_inode; /* First non-reserved inode in the file system */
uint16_t inode_size; /* Size of each inode */
/* More, less relevant fields follow */
};
• Block group descriptor table blocks: These blocks follow the previously mentioned information.
• Block group descriptors: They store information about each block group.
• Information stored in block group descriptors: This includes the positions of the bitmaps,
inode table, and other details such as the number of free blocks in the block group.
2
– Bitmaps: These mark allocated block and inode positions in the block group.
– Inode table: It provides space for all the inodes in the block group and is static. The
maximum number of inodes is fixed, and unallocated inodes are represented as sequences of
zeroes in the table.
– Data blocks: The remaining blocks are allocated for storing data to be used by files.
– Backups in some block groups: Apart from the first block group, some block groups may
also contain backups for the superblock and block group descriptor table.
• Structure definition: The structure definition of the file system is provided below the mentioned
information.
struct ext2_block_group_descriptor {
uint32_t block_bitmap; /* Block containing the block bitmap */
uint32_t inode_bitmap; /* Block containing the inode bitmap */
uint32_t inode_table; /* First block of the inode table */
uint16_t free_block_count; /* Number of free blocks in the group */
uint16_t free_inode_count; /* Number of free inodes in the group */
uint16_t used_dirs_count; /* Number of directories in the group */
uint16_t pad; /* Padding to 4 byte alignment */
uint32_t reserved[3]; /* Unused, reserved 12 bytes */
};
• Indirect blocks:
– Indirect blocks are data blocks filled with pointers.
– They enable indirect indexing, allowing for multiple data blocks to be accessed through these
pointers.
3
As an example, an ext2 file system with a block size of 4096 will be able to store 4096/4 = 1024 block
numbers in an indirect block. Thus, a single indirect block would be able to index 1024 · 4096 = 4MB of
data. A double indirect block would index 1024 indirect blocks, indexing a total of 1024·1024·4096 = 4GB
of data. A triple indirect block would be able to index 4TB of data! The data can have holes, meaning
that some intermediate blocks may not be allocated. You can see the inode structure below. 1
Note that inodes do not contain file names. These are instead contained in directory entries referring to
inodes. Thus, it’s possible to have entries with different names in different directories referring to the same
file (inode). Each of these is called hard links.
Directories are also files and thus have inodes and data blocks. However, data blocks of directories have
a special structure. They are filled with directory entry structures forming a singly linked list:
struct ext2_dir_entry {
uint32_t inode; /* inode number of the file */
uint16_t length; /* Record length, aligned on 4 bytes */
uint8_t name_length; /* 255 is the maximum allowed name length */
uint8_t file_type; /* Not used in rev. 0, file type identifier in rev. 1 */
char name[]; /* File name. This is called a 'flexible array member' in C. */
};
These are records of variable size, essentially 8 bytes plus space for the name, aligned on a 4-byte boundary
2 . The next entry can be reached by adding length bytes to the current entry’s offset. The last entry has
its length padded to the size of the block so that the next entry corresponds to the offset of the end of the
block. Once the end of the block is reached, it’s time to move on to the next data block of the directory
1
Although the maximum file size for a 4KB block ext2 file system would be around 2TB due to the 4-byte limit
of the i_blocks field (named block_count_512 in ext2fs.h) in the inode.
2
Why? Remember your computer organization course!
4
file. As one last thing, a 0 inode value indicates an entry which should be skipped (can be padding or
pre-allocation).
Some important information before finishing up:
The following links contain details about ext2 and should be your go-to references when writing your code:
Then, format the file into a file system image via mke2fs. The following example creates an ext2 file
system with 64 inodes and a block size of 2048.
You can dump file system details with the dumpe2fs command:
$ dumpe2fs example.img
Now that you have a file system image, you can mount the image onto a directory. The FUSE-based
fuseext2 command allows you to do this in userspace without any superuser privileges (use this on the
ineks! 3 ). The below example mounts our example file system in read-write mode:
3
Note that fusermount does not currently work on the ineks without setting group read-execute permissions
for the mount directory (chmod g+rx) and group execute permissions (g+x) for other directories on the path to the
mount directory (including your home directory). This is not ideal and is being looked into so that it can work
with user permissions only. An announcement will follow when a fix happens.
5
$ mkdir fs-root
$ fuseext2 -o rw+ example.img fs-root
Now, fs-root will be the root of your file system image. You can cd to it, create files and directories, do
whatever you want. To unmount once you are done, use fusermount:
$ fusermount -u fs-root
Make sure to unmount the file system before running programs on the image. On systems where you have
superuser privileges, you can use the more standard mount and unmount:
You can check the consistency of your file system after modifications with e2fsck. The below example
forces the check-in verbose mode and refuses fixes (-n flag), since e2fsck attempts to fix problems by
default. This will help you find bugs in your implementation later on.
2.3 Implementation
You will write a utility program je2fs that will be able to perform some operations on the ext2 images.
For the base part of the homework, you can assume that all utility functions you will write should perform
the same as ext2 utility functions, you can cheat from them (:.
• /abs/path/to/dirname: An absolute path to the file starting from the root directory “/” (specifi-
cation 4). However, the separator is not guaranteed to be a single slash (specification 5).
Let’s consider a directory creation scenario: There is a directory, under the given path, having no entry
named “directory” only. Let’s say the smallest unallocated inode index is 15, and the smallest unallocated
block index 585. Additionally, the directory under path, has a single block at index 540 to store its entries
and there is enough space to store the entry named “directory”.
Therefore in this scenario, we would expect to see such a case:
6
The corresponding algorithm to be executed will be like below
procedure ext2_mkdir(path=“/path/to/dirname”)
parent_path, name ← tokenize the path
if (/abs/path/to/ does not exists) or (/abs/path/to/dirname exists) then
return
end if
parent_inode ← get inode of the parent_path
• /abs/path/to/dirname: An absolute path to the file starting from the root directory “/” (specifi-
cation 4). However, the separator is not guaranteed to be a single slash (specification 5).
7
Ext2 Remove Directory Algorithm
procedure ext2_rmdir(path=“/path/to/dirname”)
parent_path, name ← tokenize the path
if (/abs/path/to/dirname does not exists) or (/abs/path/to/dirname has more than two
dir entry) then
return
end if
parent_inode ← get inode of the parent_path
inode ← get inode of the path
// unlink parent_inode –> inode with name
remove a dir entry with name under parent_inode
unlink the created dir entry with name and inode
// unlink dir entries under inode
decrement inode link count by two
// make inode’s blocks and inode itself free again
for each block in inode’s direct and indirect blocks do
// i didn’t say unlink :)
deallocate the block
end for
deallocate the inode
// update time and write them back
reverse through parent_inode to ROOT_INODE
update their modification and access times
end procedure
Let’s consider a scenario: There is a directory, under the given path, having tow entry named “.” and
“..” only. Let’s say the directory’s inode index is 15, and two entries are located in the block index 585.
Additionally, the directory name is an entry in the block index 540 which is linked to the parent inode.
Therefore in this scenario, we would expect to see such a case:
8
• /abs/path/to/dirname: An absolute path to the file starting from the root directory “/” (specifi-
cation 4). However, the separator is not guaranteed to be a single slash (specification 5).
procedure ext2_read_file(path=“/path/to/fname”)
parent_path, fname ← tokenize the path
if /abs/path/to/fname does not exists then
return
end if
parent_inode ← get inode of the parent_path
inode ← get inode of the path
// remove the link (parent_inode –> inode) with fname
remove a dir entry with fname under parent_inode
unlink the created dir entry with fname and inode
for each block in inode’s direct and indirect blocks do
deallocate the block
end for
if inode’s link_count is zero then
deallocate the inode
end if
// update time and write them back
reverse through parent_inode to ROOT_INODE
update their access times
end procedure
Let’s consider a scenario: There is a file, under the given path, whose content is among the blocks 555,
556, and 557. Let’s say the file’s inode index is 15, and the file name is in an entry at the block index
540 which is linked to the parent inode. Additionally, assume that there is no hard link to the file.
Therefore in this scenario, we would expect to see such a case:
9
• INDEX is the offset in the file that you start insertion.
– you can assume that INDEX is less then or equal to the STRING length.
• /abs/path/to/dirname: An absolute path to the file starting from the root directory “/” (specifi-
cation 4). However, the separator is not guaranteed to be a single slash (specification 5).
• STRING is enclosed by single quotes so that you don’t have to deal with joining multiple argv’s.
For our examples, let’s say there is a file, under the given path,
• which contains LENGTH (= 3 ∗ BLOCK SIZE − 12) number of characters as its content.
• whose file entry is at the block 540 which is contained by the parent directory.
Additionally assume that we have an input string whose length is STRLEN = BLOCK SIZE + 20,
where BLOCK SIZE > 20. Let’s consider an insert-mode example:
This is simply an insert command. We will start at INDEX, insert string, and append the remaining
content after INDEX to the file. Since 4 ∗ BLOCK SIZE < 3 ∗ BLOCK SIZE − 12 + BLOCK SIZE +
20 < 5 ∗ BLOCK SIZE, this command will allocate two new blocks which are 558 and 559.
Let’s consider a replace-mode example:
This is simply a replacement command. We will start at INDEX, delete BLOCK SIZE + 20 characters,
and insert string which is also BLOCK SIZE + 20 characters. Therefore this command will only modify
times in terms of inode metadata. Let’s consider a more general example:
In this case, we will start at INDEX, and delete BLOCK SIZE + 15 characters. We will then insert
the string, and append the remaining content after INDEX to the file. Since 2 ∗ BLOCK SIZE <
(3 ∗ BLOCK SIZE − 12) + (BLOCK SIZE + 20) − (BLOCK SIZE + 15) < 3 ∗ BLOCK SIZE, this
command will deallocate two blocks which are 556 and 557.
The corresponding algorithm you will write will be below:
10
Ext2 Edit File Algorithm
11
3 ext2journal (bonus 50 points)
3.1 Details
Now that the basics of ext2 are out of the way, it’s time to extend the second extended file system! What
we want is the ability to journal the metadata of inodes and blocks, so that we can have atomic-like
operations on disk to prevent inconsistencies in case of any power failure.
Thankfully, Stephen Tweedie has developed a journaling extension for the ext2 code, providing several
benefits. One major advantage is the prevention of metadata corruption and the elimination of the need
to wait for e2fsck to complete after a system crash. Importantly, this extension can be implemented
without modifying the existing on-disk ext2 layout. The journal itself functions as a regular file, storing
modified metadata blocks (and optionally data blocks) before they are written to the file system. This
approach allows for the addition of a journal to an ext2 file system without requiring any data conversion.
When file system changes occur, such as file renaming, they are recorded as transactions within the journal.
These transactions can be either complete or incomplete at the time of a system crash. In the case of a
complete transaction, where the system does not crash, the blocks within that transaction are guaranteed
to represent a valid state of the file system and are subsequently copied into the file system. However,
if a transaction is incomplete at the time of a crash, the blocks within that transaction lack consistency
guarantees and are therefore discarded. As a result, any file system changes represented by those discarded
blocks are lost.
The following links contain lots of details about ext2 journaling and should be your go-to references when
writing your code:
A journal metadata block contains the entire contents of a single block of file system metadata as updated
by a transaction. This means that however small a change we make to a file system metadata block, we
have to write an entire journal block out to log the change. However, this turns out to be relatively cheap
for two reasons/
struct ext2_journal_descriptor_block {
/* number of metadata blocks to be written */
uint32_t metadata_blocks_count;
/* Disk block numbers of the metadata blocks */
uint32_t metadata_blocks_array[EXT2_MAX_METADATA_BLOCKS];
};
Descriptor blocks are journal blocks which describe other journal metadata blocks. Whenever we want to
write out metadata blocks to the journal, we need to record which disk blocks the metadata normally lives
12
at, so that the recovery mechanism can copy the metadata back into the main file system. A descriptor
block is written out before each set of metadata blocks in the journal and contains the number of metadata
blocks to be written plus their disk block numbers.
Both descriptor and metadata blocks are written sequentially to the journal, starting again from the start
of the journal whenever we run off the end. At all times, we maintain the current head of the log (the block
number of the last block written) and the tail (the oldest block in the log which has not been unpinned,
as described below). Whenever we run out of log space –the head of the log has looped back around and
caught up with the tail– we stall new log writes until the tail of the log has been cleaned up to free more
space.
Finally, the journal file contains a number of header blocks at fixed locations. These record the current
head and tail of the journal, plus a sequence number. At recovery time, the header blocks are scanned to
find the block with the highest sequence number, and when we scan the log during recovery we just run
through all journal blocks from the tail to the head, as recorded in that header block.
$ ./mkfs.ext2j example.img
The mkfs.ext2j executable4 compiled for x86_64 Linux is provided to help you deal with ext2s images.
Rather than creating images from scratch, it converts existing ext2 images into the journaled ext2 by
allocating an inode at a specific index for the journal file.
Note that this will only create a journal file whose inode is indexed at 12. For checking general consistency,
you should continue to use e2fsck. It should not report any problems after conversion (it’s a bug in
mkfs.ext2j if it does).
Important: Programming correctly is a difficult art to master, and as such mkfs.ext2j may contain some
bugs. If you suspect a bug, save the command and image reproducing the bug and send them over to
[email protected].
4
Because it’s going to be your sidekick! Also, anthropomorphizing your executables will help you feel less lonely
when coding long stuff :) What? No, no, I’m totally fine!
13
The most recent version of mkfs.ext2j will always be available at https://user.ceng.metu.edu.tr/
~adhd/334hw3/mkfs.ext2j along with a changelog at https://user.ceng.metu.edu.tr/~adhd/334hw3/
changelog.txt. Please check the changelog for known bugs and possible bug fixes if you have issues and
update your executable.
See the appendix for more details about the tool and processes.
3.3 Implementation
In this part also, you will be still writing to the utility program je2fs that will be able to perform some
operations on the ext2 images. Apart from the base part of the homework, you may assume that your
code will be recompiled with an extra option -DJOURNAL during compilation.
In terms of journaling, your functions are expected to act the same as the previous part, in terms of
outputting and end-user behaviour. The only main difference is to use a small portion of the disk (which
is called journal blocks or the journal file ) as a buffer or a cache. That is, your journaled functions are
basically expected to seek the journal first before reading (part 1) and writing (part 2) until a commit is
received. Once it is received, changes in the journal file will be reflected (part 3) to the remaining disk,
which is called permanent data blocks. For simplification, you can assume that there will be exactly one
journal file whose inode is set in the header file.
In the first command, you are expected to fetch the inode from either the journal or the permanent blocks
and print.
In the second command, you are expected to calculate the permanent data block’s actual block index in the
disk given as BLOCK INDEX. Then, print all entries in the actual index. Corresponding print_inode
and print_dir_entry functions will be given as utility functions.
The corresponding algorithm you will write will be below:
14
Ext2 Read Algorithm
// if there is no metadata journal block for the block at block index yet
if the descriptor block at the header block.head is full then
copy num bytes from the ext2 image starting at offset to the mem
end if
end procedure
15
Ext2 Write Algorithm
procedure ext2_write(off_t offset, const void *mem, size_t num, bool journal)
if is not journaled then
copy num bytes from the mem to the ext2 image starting at offset
return
end if
// if there is no metadata journal block for the block at block index yet
if the descriptor block at the header block.head is full then
descriptor ← allocate and link a new block
push descriptor block’s index as the current head of the header journal block
else
descriptor ← seek the current head of the header journal block
end if
descriptor.array[descriptor.count++] ← allocate and link a new block
offset ← (the new block’s index * block size) + (offset modulus block size)
copy num bytes from the mem to the ext2 image starting at offset
end procedure
16
$ ./je2fs FS_IMAGE commit
During the evaluation, you can also consider that images from the previous read and write part will be
subjected to noise (randomly bit-flipping), and we will check if the commit command recovers the image
back to its original version just before the noise.
The corresponding algorithm you will write will be below:
4 Specifications
1. Your code must be written in C or C++.
2. Your implementation will be compiled and evaluated on the ineks, so you should make sure that
your code works on them.
5. However, the separator is not guaranteed to be a single slash. Therefore, following paths are valid:
• /////abs/path///to///filename
• //////////////dirname//////
6. You are supposed to read the file system data structures into memory, modify them and write
them back to the image. Mounting the file system in your code is forbidden; do not run any other
executables from your code using things like system() or exec*(). This includes mkfs.ext2j which
is provided to journal an ext2 file system.
7. Including POSIX ext2/ext3/ext4 libraries (as well as kernel codes and their variations etc.) is not
allowed.
8. The ext2fs.h header file is provided for your convenience. You are free to include it, modify it,
remove it or do whatever you want with it.
17
9. We have a zero-tolerance policy against cheating. All the code you submit must be your own
work. Sharing code with your friends, using code from the internet or previous years’ homework
are all considered plagiarism and strictly forbidden.
10. Follow the course page on ODTUClass and COW for possible updates and clarifications.
11. Please ask your questions on COW instead of sending an email for questions that do not contain
code or solutions, so that all may benefit.
6 Submission
Submission will be done via ODTUClass. Create a gzipped tarball file named hw3.tar.gz that contains
all your source code files together with your Makefile. Your archive file should not contain any subfolders.
Your code should compile and your executable should run with the following command sequence:
If there is a mistake in any of these steps mentioned above, you will lose 10 points.
Late Submission: a penalty of 5 · (late days)2 will be applied to your final grade.
18
• metadata block: refers to a single block of filesystem metadata that is updated by a transaction.
This can include information such as file permissions, ownership, timestamps, and other attributes
associated with files and directories on the filesystem. When changes are made to a metadata block,
the entire contents of the block must be written out to the journal in order to log the change. This
ensures that any updates are recorded in case of an unexpected shutdown or power failure, allowing
for quick and reliable recovery without data loss.
• descriptor block: refers to a journal block that describes other journal metadata blocks. Whenever
metadata blocks are written out to the journal, we need to record which disk blocks the metadata
normally lives at, so that the recovery mechanism can copy the metadata back into the main filesys-
tem. A descriptor block is written out before each set of metadata blocks in the journal and contains
the number of metadata blocks to be written plus their disk block numbers. The descriptor block is
used by ext2fs during recovery to locate and restore the associated metadata blocks back into the
main filesystem.
• the journal file: contains a number of header blocks at fixed locations. These record the current
head and tail of the journal, plus a sequence number. The header block is used by ext2fs to keep
track of the state of the journal and ensure that it remains consistent after a crash or power failure.
The header block is also used to locate and read other blocks in the journal, such as descriptor
blocks and metadata blocks.
19