The Second Extended File System
The Second Extended File System
Copyright 2001-2011 Dave Poirier Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license can be acquired electronically from http://www.fsf.org/licenses/fdl.html or by writing to 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA Table of Contents About this book 1. Historical Background 2. Definitions 2.1. Blocks 2.2. Block Groups 2.3. Directories 2.4. Inodes 2.5. Superblocks 2.6. Symbolic Links 3. Disk Organization 3.1. Superblock 3.1.1. s_inodes_count 3.1.2. s_blocks_count 3.1.3. s_r_blocks_count 3.1.4. s_free_blocks_count 3.1.5. s_free_inodes_count 3.1.6. s_first_data_block 3.1.7. s_log_block_size 3.1.8. s_log_frag_size 3.1.9. s_blocks_per_group 3.1.10. s_frags_per_group 3.1.11. s_inodes_per_group 3.1.12. s_mtime 3.1.13. s_wtime 3.1.14. s_mnt_count
www.nongnu.org/ext2-doc/ext2.html 1/47
12/06/12
3.1.15. s_max_mnt_count 3.1.16. s_magic 3.1.17. s_state 3.1.18. s_errors 3.1.19. s_minor_rev_level 3.1.20. s_lastcheck 3.1.21. s_checkinterval 3.1.22. s_creator_os 3.1.23. s_rev_level 3.1.24. s_def_resuid 3.1.25. s_def_resgid 3.1.26. s_first_ino 3.1.27. s_inode_size 3.1.28. s_block_group_nr 3.1.29. s_feature_compat 3.1.30. s_feature_incompat 3.1.31. s_feature_ro_compat 3.1.32. s_uuid 3.1.33. s_volume_name 3.1.34. s_last_mounted 3.1.35. s_algo_bitmap 3.1.36. s_prealloc_blocks 3.1.37. s_prealloc_dir_blocks 3.1.38. s_journal_uuid 3.1.39. s_journal_inum 3.1.40. s_journal_dev 3.1.41. s_last_orphan 3.1.42. s_hash_seed 3.1.43. s_def_hash_version 3.1.44. s_default_mount_options 3.1.45. s_first_meta_bg 3.2. Block Group Descriptor Table 3.2.1. bg_block_bitmap 3.2.2. bg_inode_bitmap 3.2.3. bg_inode_table 3.2.4. bg_free_blocks_count 3.2.5. bg_free_inodes_count 3.2.6. bg_used_dirs_count 3.2.7. bg_pad 3.2.8. bg_reserved 3.3. Block Bitmap 3.4. Inode Bitmap 3.5. Inode Table 3.5.1. i_mode 3.5.2. i_uid
www.nongnu.org/ext2-doc/ext2.html 2/47
12/06/12
3.5.3. i_size 3.5.4. i_atime 3.5.5. i_ctime 3.5.6. i_mtime 3.5.7. i_dtime 3.5.8. i_gid 3.5.9. i_links_count 3.5.10. i_blocks 3.5.11. i_flags 3.5.12. i_osd1 3.5.13. i_block 3.5.14. i_generation 3.5.15. i_file_acl 3.5.16. i_dir_acl 3.5.17. i_faddr 3.5.18. Inode i_osd2 Structure 3.6. Locating an Inode 4. Directory Structure 4.1. Linked List Directory 4.1.1. inode 4.1.2. rec_len 4.1.3. name_len 4.1.4. file_type 4.1.5. name 4.1.6. Sample Directory 4.2. Indexed Directory Format 4.2.1. Indexed Directory Root 4.2.2. Indexed Directory Entry 4.2.3. Lookup Algorithm 4.2.4. Insert Algorithm 4.2.5. Splitting 4.2.6. Key Collisions 4.2.7. Hash Function 4.2.8. Performance 5. File Attributes 5.1. Standard Attributes 5.1.1. SUID, SGID and -rwxrwxrwx 5.1.2. File Size 5.1.3. Owner and Group 5.2. Extended Attributes
www.nongnu.org/ext2-doc/ext2.html 3/47
12/06/12
5.2.1. Extended Attribute Block Layout 5.2.2. Extended Attribute Block Header 5.2.3. Attribute Entry Header 5.3. Behaviour Control Flags 5.3.1. EXT2_SECRM_FL - Secure Deletion 5.3.2. EXT2_UNRM_FL - Record for Undelete 5.3.3. EXT2_COMPR_FL - Compressed File 5.3.4. EXT2_SYNC_FL - Synchronous Updates 5.3.5. EXT2_IMMUTABLE_FL - Immutable File 5.3.6. EXT2_APPEND_FL - Append Only 5.3.7. EXT2_NODUMP_FL - Do No Dump/Delete 5.3.8. EXT2_NOATIME_FL - Do Not Update .i_atime 5.3.9. EXT2_DIRTY_FL - Dirty 5.3.10. EXT2_COMPRBLK_FL - Compressed Blocks 5.3.11. EXT2_NOCOMPR_FL - Access Raw Compressed Data 5.3.12. EXT2_ECOMPR_FL - Compression Error 5.3.13. EXT2_BTREE_FL - B-Tree Format Directory 5.3.14. EXT2_INDEX_FL - Hash Indexed Directory 5.3.15. EXT2_IMAGIC_FL 5.3.16. EXT2_JOURNAL_DATA_FL - Journal File Data 5.3.17. EXT2_RESERVED_FL - Reserved A. Credits List of Tables 2-1. Impact of Block Sizes 3-1. Sample Floppy Disk Layout, 1KiB blocks 3-2. Sample 20mb Partition Layout 3-3. Superblock Structure 3-4. Defined s_state Values 3-5. Defined s_errors Values 3-6. Defined s_creator_os Values 3-7. Defined s_rev_level Values 3-8. Defined s_feature_compat Values 3-9. Defined s_feature_incompat Values 3-10. Defined s_feature_ro_compat Values 3-11. Defined s_algo_bitmap Values 3-12. Block Group Descriptor Structure 3-13. Inode Structure 3-14. Defined Reserved Inodes 3-15. Defined i_mode Values 3-16. Defined i_flags Values 3-17. Inode i_osd2 Structure: Hurd 3-18. Inode i_osd2 Structure: Linux 3-19. Inode i_osd2 Structure: Masix 3-20. Sample Inode Computations 4-1. Linked Directory Entry Structure
www.nongnu.org/ext2-doc/ext2.html 4/47
12/06/12
4-2. Defined Inode File Type Values 4-3. Sample Linked Directory Data Layout, 4KiB blocks 4-4. Indexed Directory Root Structure 4-5. Defined Indexed Directory Hash Versions 4-6. Indexed Directory Entry Structure (dx_entry) 4-7. Indexed Directory Entry Count and Limit Structure 5-1. Extended Attribute Block Layout 5-2. ext2_xattr_header structure 5-3. Behaviour Control Flags List of Figures 4-1. Performance of Indexed Directories 5-1. ext2_xattr_header structure
www.nongnu.org/ext2-doc/ext2.html
5/47
12/06/12
Chapter 2. Definitions
The Second Extended Filesystem uses blocks as the basic unit of storage, inodes as the mean of keeping track of files and system objects, block groups to logically split the disk into more manageable sections, directories to provide a hierarchical organization of files, block and inode bitmaps to keep track of allocated blocks and inodes, and superblocks to define the parameters of the file system and its overall state. Ext2 shares many properties with traditional Unix filesystems. It has space in the specification for Access Control Lists (ACLs), fragments, undeletion and compression. There is also a versioning mechanism to allow new features (such as journalling) to be added in a maximally compatible manner; such as in Ext3 and Ext4.
2.1. Blocks
A partition, disk, file or block device formated with a Second Extended Filesystem is divided into small groups of sectors called "blocks". These blocks are then grouped into larger units called block groups. The size of the blocks are usually determined when formatting the disk and will have an impact on performance, maximum file size, and maximum file system size. Block sizes commonly implemented include 1KiB, 2KiB, 4KiB and 8KiB although provisions in the superblock allow for block sizes as big as 1024 * (2^31)-1 (see s_log_block_size). Depending on the implementation, some architectures may impose limits on which block sizes are supported. For example, a Linux 2.6 implementation on DEC Alpha uses blocks of 8KiB but the same implementation on a Intel 386 processor will support a maximum block size of 4KiB. Table 2-1. Impact of Block Sizes Upper Limits file system blocks blocks per block group inodes per block group bytes per block group file system size (real) file system size (Linux) blocks per file file size (real) file size (Linux 2.6.28) 1KiB 2,147,483,647 8,192 8,192 8,388,608 (8MiB) 4,398,046,509,056 (4TiB) 2,199,023,254,528 (2TiB) [a] 16,843,020 17,247,252,480 (16GiB) 17,247,252,480 (16GiB) 2KiB 2,147,483,647 16,384 16,384 33,554,432 (32MiB) 8,796,093,018,112 (8TiB) 4KiB 2,147,483,647 32,768 32,768 134,217,728 (128MiB) 17,592,186,036,224 (16TiB) 8KiB 2,147,483,647 65,536 65,536 536,870,912 (512MiB) 35,184,372,080,640 (32TiB)
8,796,093,018,112 17,592,186,036,224 35,184,372,080,640 (8TiB) (16TiB) (32TiB) 134,217,728 274,877,906,944 (256GiB) 274,877,906,944 (256GiB) 1,074,791,436 2,199,023,255,552 (2TiB) 2,199,023,255,552 (2TiB) 8,594,130,956 2,199,023,255,552 (2TiB) 2,199,023,255,552 (2TiB)
6/47
www.nongnu.org/ext2-doc/ext2.html
12/06/12
Notes: a. This limit comes from the maximum size of a block device in Linux 2.4; it is unclear whether a Linux 2.6 kernel using a 1KiB block size could properly format and mount a Ext2 partition larger than 2TiB. Note: the 2TiB file size is limited by the i_blocks value in the inode which indicates the number of 512-bytes sector rather than the actual number of ext2 blocks allocated.
2.3. Directories
This definition comes from the Linux Kernel Documentation with some minor alterations. A directory is a filesystem object and has an inode just like a file. It is a specially formatted file containing records which associate each name with an inode number. Later revisions of the filesystem also encode the type of the object (file, directory, symlink, device, fifo, socket) to avoid the need to check the inode itself for this information The inode allocation code should try to assign inodes which are in the same block group as the directory in which they are first created. The original Ext2 revision used singly-linked list to store the filenames in the directory; newer revisions are able to use hashes and binary trees. Also note that as directory grows additional blocks are assigned to store the additional file records. When filenames are removed, some implementations do not free these additional blocks.
2.4. Inodes
www.nongnu.org/ext2-doc/ext2.html 7/47
12/06/12
This definition comes from the Linux Kernel Documentation with some minor alterations. The inode (index node) is a fundamental concept in the ext2 filesystem. Each object in the filesystem is represented by an inode. The inode structure contains pointers to the filesystem blocks which contain the data held in the object and all of the metadata about an object except its name. The metadata about an object includes the permissions, owner, group, flags, size, number of blocks used, access time, change time, modification time, deletion time, number of links, fragments, version (for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs). There are some reserved fields which are currently unused in the inode structure and several which are overloaded. One field is reserved for the directory ACL if the inode is a directory and alternately for the top 32 bits of the file size if the inode is a regular file (allowing file sizes larger than 2GB). The translator field is unused under Linux, but is used by the HURD to reference the inode of a program which will be used to interpret this object. Most of the remaining reserved fields have been used up for both Linux and the HURD for larger owner and group fields, The HURD also has a larger mode field so it uses another of the remaining fields to store the extra bits. There are pointers to the first 12 blocks which contain the file's data in the inode. There is a pointer to an indirect block (which contains pointers to the next set of blocks), a pointer to a doubly-indirect block (which contains pointers to indirect blocks) and a pointer to a trebly-indirect block (which contains pointers to doubly-indirect blocks). Some filesystem specific behaviour flags are also stored and allow for specific filesystem behaviour on a perfile basis. There are flags for secure deletion, undeletable, compression, synchronous updates, immutability, append-only, dumpable, no-atime, indexed directories, and data-journaling. Many of the filesystem specific behaviour flags, like journaling, have been implemented in newer filesystems like Ext3 and Ext4, while some other are still under development. All the inodes are stored in inode tables, with one inode table per block group.
2.5. Superblocks
This definition comes from the Linux Kernel Documentation with some minor alterations. The superblock contains all the information about the configuration of the filesystem. The information in the superblock contains fields such as the total number of inodes and blocks in the filesystem and how many are free, how many inodes and blocks are in each block group, when the filesystem was mounted (and if it was cleanly unmounted), when it was modified, what version of the filesystem it is and which OS created it. The primary copy of the superblock is stored at an offset of 1024 bytes from the start of the device, and it is essential to mounting the filesystem. Since it is so important, backup copies of the superblock are stored in block groups throughout the filesystem.
www.nongnu.org/ext2-doc/ext2.html 8/47
12/06/12
The first version of ext2 (revision 0) stores a copy at the start of every block group, along with backups of the group descriptor block(s). Because this can consume a considerable amount of space for large filesystems, later revisions can optionally reduce the number of backup copies by only putting backups in specific groups (this is the sparse superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7. Revision 1 and higher of the filesystem also store extra fields, such as a volume name, a unique identification number, the inode size, and space for optional filesystem features to store configuration info. All fields in the superblock (as in all other ext2 structures) are stored on the disc in little endian format, so a filesystem is portable between machines without having to know what machine it was created on.
12/06/12
In revision 0 of Ext2, each block group consists of a copy superblock, a copy of the block group descriptor table, a block bitmap, an inode bitmap, an inode table, and data blocks. With the introduction of revision 1 and the sparse superblock feature in Ext2, only specific block groups contain copies of the superblock and block group descriptor table. All block groups still contain the block bitmap, inode bitmap, inode table, and data blocks. The shadow copies of the superblock can be located in block groups 0, 1 and powers of 3, 5 and 7. The block bitmap and inode bitmap are limited to 1 block each per block group, so the total blocks per block group is therefore limited. (More information in the Block Size Impact table). Each data block may also be further divided into "fragments". As of Linux 2.6.28, support for fragment was still not implemented in the kernel; it is therefore suggested to ensure the fragment size is equal to the block size so as to maintain compatibility. Table 3-1. Sample Floppy Disk Layout, 1KiB blocks Block Offset Length Description boot record (if present) additional boot record data (if present) superblock block group descriptor table block bitmap inode bitmap inode table data blocks
byte 0 512 bytes byte 512 512 bytes -- block group 0, blocks 1 to 1439 -byte 1024 block 2 block 3 block 4 block 5 block 28 1024 bytes 1 block 1 block 1 block 23 blocks 1412 blocks
For the curious, block 0 always points to the first sector of the disk or partition and will always contain the boot record if one is present. The superblock is always located at byte offset 1024 from the start of the disk or partition. In a 1KiB blocksize formatted file system, this is block 1, but it will always be block 0 (at 1024 bytes within block 0) in larger block size file systems. And here's the organisation of a 20MB ext2 file system, using 1KiB blocks: Table 3-2. Sample 20mb Partition Layout Block Offset Length Description boot record (if present) additional boot record data (if present) superblock block group descriptor table block bitmap inode bitmap
10/47
byte 0 512 bytes byte 512 512 bytes -- block group 0, blocks 1 to 8192 -byte 1024 1024 bytes block 2 1 block block 3 1 block block 4
www.nongnu.org/ext2-doc/ext2.html
1 block
12/06/12
block 5 214 blocks inode table block 219 7974 blocks data blocks -- block group 1, blocks 8193 to 16384 -block 8193 1 block superblock backup block 8194 1 block block group descriptor table backup block 8195 1 block block bitmap block 8196 1 block inode bitmap block 8197 214 blocks inode table block 8408 7974 blocks data blocks -- block group 2, blocks 16385 to 24576 -block 16385 1 block block bitmap block 16386 1 block inode bitmap block 16387 block 16601 214 blocks 3879 blocks inode table data blocks
The layout on disk is very predictable as long as you know a few basic information; block size, blocks per group, inodes per group. This information is all located in, or can be computed from, the s p r l c uebok structure. Nevertheless, unless the image was crafted with controlled parameters, the position of the various structures on disk (except the superblock) should never be assumed. Always load the superblock first. Notice how block 0 is not part of the block group 0 in 1KiB block size file systems. The reason for this is block group 0 always starts with the block containing the superblock. Hence, on 1KiB block systems, block group 0 starts at block 1, but on larger block sizes it starts on block 0. For more information, see the s_first_data_block superblock entry.
3.1. Superblock
The superblock is always located at byte offset 1024 from the beginning of the file, block device or partition formatted with Ext2 and later variants (Ext3, Ext4). Its structure is mostly constant from Ext2 to Ext3 and Ext4 with only some minor changes. Table 3-3. Superblock Structure Offset (bytes) 0 4 8 12 16 20 24
www.nongnu.org/ext2-doc/ext2.html
Description
11/47
12/06/12
28 32 36 40 44 48 52 54 56 58 60 62 64 68 72 76 80 82
4 4 4 4 4 4 2 2 2 2 2 2 4 4 4 4 2 2
s_log_frag_size s_blocks_per_group s_frags_per_group s_inodes_per_group s_mtime s_wtime s_mnt_count s_max_mnt_count s_magic s_state s_errors s_minor_rev_level s_lastcheck s_checkinterval s_creator_os s_rev_level s_def_resuid s_def_resgid s_first_ino s_inode_size s_block_group_nr s_feature_compat s_feature_incompat s_feature_ro_compat s_uuid s_volume_name s_last_mounted s_algo_bitmap s_prealloc_blocks s_prealloc_dir_blocks (alignment) s_journal_uuid s_journal_inum s_journal_dev s_last_orphan s_hash_seed s_def_hash_version
12/47
-- EXT2_DYNAMIC_REV Specific -84 4 88 90 92 96 100 104 120 136 200 204 205 2 2 4 4 4 16 16 64 4 1 1
-- Performance Hints --
4 x4 1
12/06/12
3 4 4 760
padding - reserved for future expansion s_default_mount_options s_first_meta_bg Unused - reserved for future revisions
3.1.1. s_inodes_count
32bit value indicating the total number of inodes, both used and free, in the file system. This value must be lower or equal to (s_inodes_per_group * number of block groups). It must be equal to the sum of the inodes defined in each block group.
3.1.2. s_blocks_count
32bit value indicating the total number of blocks in the system including all used, free and reserved. This value must be lower or equal to (s_blocks_per_group * number of block groups). It must be equal to the sum of the blocks defined in each block group.
3.1.3. s_r_blocks_count
32bit value indicating the total number of blocks reserved for the usage of the super user. This is most useful if for some reason a user, maliciously or not, fill the file system to capacity; the super user will have this specified amount of free blocks at his disposal so he can edit and save configuration files.
3.1.4. s_free_blocks_count
32bit value indicating the total number of free blocks, including the number of reserved blocks (see s_r_blocks_count). This is a sum of all free blocks of all the block groups.
3.1.5. s_free_inodes_count
32bit value indicating the total number of free inodes. This is a sum of all free inodes of all the block groups.
3.1.6. s_first_data_block
32bit value identifying the first data block, in other word the id of the block containing the s p r l c uebok structure. Note that this value is always 0 for file systems with a block size larger than 1KB, and always 1 for file systems with a block size of 1KB. The s p r l c is always starting at the 1024th byte of the disk, which uebok normally happens to be the first byte of the 3rd sector.
www.nongnu.org/ext2-doc/ext2.html
13/47
12/06/12
3.1.7. s_log_block_size
The block size is computed using this 32bit value as the number of bits to shift left the value 1024. This value may only be positive.
boksz =12 < slgboksz; lc ie 04 < _o_lc_ie
Common block sizes include 1KiB, 2KiB, 4KiB and 8Kib. For information about the impact of selecting a block size, see Impact of Block Sizes. In Linux, at least up to 2.6.28, the block size must be at least as large as the sector size of the block device, and cannot be larger than the supported memory page of the architecture.
3.1.8. s_log_frag_size
The fragment size is computed using this 32bit value as the number of bits to shift left the value 1024. Note that a negative value would shift the bit right rather than left.
i(pstv ) f oiie fame sz =12 < slgfa_ie rgnt ie 04 < _o_rgsz; es le fage sz =12 > -_o_rgsz; rmnt ie 04 > slgfa_ie
As of Linux 2.6.28 no support exists for an Ext2 partition with fragment size smaller than the block size, as this feature seems to not be available.
3.1.9. s_blocks_per_group
32bit value indicating the total number of blocks per group. This value in combination with s_first_data_block can be used to determine the block groups boundaries.
3.1.10. s_frags_per_group
32bit value indicating the total number of fragments per group. It is also used to determine the size of the b o k b t a of each block group. lc imp
3.1.11. s_inodes_per_group
32bit value indicating the total number of inodes per group. This is also used to determine the size of the i o e b t a of each block group. Note that you cannot have more than (block size in bytes * 8) inodes nd imp per group as the inode bitmap must fit within a single block. This value must be a perfect multiple of the number of inodes that can fit in a block ((1024<<s_log_block_size)/s_inode_size).
www.nongnu.org/ext2-doc/ext2.html 14/47
12/06/12
3.1.12. s_mtime
Unix time, as defined by POSIX, of the last time the file system was mounted.
3.1.13. s_wtime
Unix time, as defined by POSIX, of the last write access to the file system.
3.1.14. s_mnt_count
32bit value indicating how many time the file system was mounted since the last time it was fully verified.
3.1.15. s_max_mnt_count
32bit value indicating the maximum number of times that the file system may be mounted before a full check is performed.
3.1.16. s_magic
16bit value identifying the file system as Ext2. The value is currently fixed to E T _ U E _ A I of value X2SPRMGC 0xEF53.
3.1.17. s_state
16bit value indicating the file system state. When the file system is mounted, this state is set to E T _ R O _ S After the file system was cleanly unmounted, this value is set to E T _ A I _ S X2ERRF. X2VLDF. When mounting the file system, if a valid of E T _ R O _ Sis encountered it means the file system was not X2ERRF cleanly unmounted and most likely contain errors that will need to be fixed. Typically under Linux this means running fsck. Table 3-4. Defined s_state Values Constant Name EXT2_VALID_FS EXT2_ERROR_FS 1 2 Value Description Unmounted cleanly Errors detected
3.1.18. s_errors
16bit value indicating what the file system driver should do when an error is detected. The following values have been defined: Table 3-5. Defined s_errors Values
www.nongnu.org/ext2-doc/ext2.html 15/47
12/06/12
Value
3.1.19. s_minor_rev_level
16bit value identifying the minor revision level within its revision level.
3.1.20. s_lastcheck
Unix time, as defined by POSIX, of the last file system check.
3.1.21. s_checkinterval
Maximum Unix time interval, as defined by POSIX, allowed between file system checks.
3.1.22. s_creator_os
32bit identifier of the os that created the file system. Defined values are: Table 3-6. Defined s_creator_os Values Constant Name EXT2_OS_LINUX EXT2_OS_HURD EXT2_OS_MASIX EXT2_OS_FREEBSD EXT2_OS_LITES 0 1 2 3 4 Value Linux GNU HURD MASIX FreeBSD Lites Description
3.1.23. s_rev_level
32bit revision level value. Table 3-7. Defined s_rev_level Values Constant Name EXT2_GOOD_OLD_REV EXT2_DYNAMIC_REV 0 1 Value Revision 0 Revision 1 with variable inode sizes, extended attributes, etc. Description
3.1.24. s_def_resuid
www.nongnu.org/ext2-doc/ext2.html 16/47
12/06/12
16bit value used as the default user id for reserved blocks. In Linux this defaults to E T _ E _ E U Dof 0. X2DFRSI
3.1.25. s_def_resgid
16bit value used as the default group id for reserved blocks. In Linux this defaults to E T _ E _ E G Dof 0. X2DFRSI
3.1.26. s_first_ino
32bit value used as index to the first inode useable for standard files. In revision 0, the first non-reserved inode is fixed to 11 (E T _ O D O D F R T I O In revision 1 and later this value may be set to any X 2 G O _ L _ I S _ N ). value.
3.1.27. s_inode_size
16bit value indicating the size of the inode structure. In revision 0, this value is always 128 (E T _ O D O D I O E S Z ). In revision 1 and later, this value must be a perfect power of 2 and must X2GO_L_ND_IE be smaller or equal to the block size (1<<s_log_block_size).
3.1.28. s_block_group_nr
16bit value used to indicate the block group number hosting this superblock structure. This can be used to rebuild the file system from any superblock backup.
3.1.29. s_feature_compat
32bit bitmask of compatible features. The file system implementation is free to support them or not without risk of damaging the meta-data. Table 3-8. Defined s_feature_compat Values Constant Name Value Description Block pre-allocation for new directories
12/06/12
EXT2_FEATURE_COMPAT_EXT_ATTR 0x0008 EXT2_FEATURE_COMPAT_RESIZE_INO 0x0010 EXT2_FEATURE_COMPAT_DIR_INDEX 0x0020 Extended inode attributes are present Non-standard inode size used Directory indexing (HTree)
3.1.30. s_feature_incompat
32bit bitmask of incompatible features. The file system implementation should refuse to mount the file system if any of the indicated feature is unsupported. An implementation not supporting these features would be unable to properly use the file system. For example, if compression is being used and an executable file would be unusable after being read from the disk if the system does not know how to uncompress it. Table 3-9. Defined s_feature_incompat Values Constant Name Value Description Disk/File compression is used
EXT2_FEATURE_INCOMPAT_COMPRESSION 0x0001 EXT2_FEATURE_INCOMPAT_FILETYPE 0x0002 EXT3_FEATURE_INCOMPAT_RECOVER 0x0004 EXT3_FEATURE_INCOMPAT_JOURNAL_DEV 0x0008 EXT2_FEATURE_INCOMPAT_META_BG 0x0010
3.1.31. s_feature_ro_compat
32bit bitmask of "read-only" features. The file system implementation should mount as read-only if any of the indicated feature is unsupported. Table 3-10. Defined s_feature_ro_compat Values Constant Name Value Description Sparse Superblock Large file support, 64-bit file size
18/47
12/06/12
0x0004
3.1.32. s_uuid
128bit value used as the volume id. This should, as much as possible, be unique for each file system formatted.
3.1.33. s_volume_name
16 bytes volume name, mostly unusued. A valid volume name would consist of only ISO-Latin-1 characters and be 0 terminated.
3.1.34. s_last_mounted
64 bytes directory path where the file system was last mounted. While not normally used, it could serve for auto-finding the mountpoint when not indicated on the command line. Again the path should be zero terminated for compatibility reasons. Valid path is constructed from ISO-Latin-1 characters.
3.1.35. s_algo_bitmap
32bit value used by compression algorithms to determine the compression method(s) used. Compression is supported in Linux 2.4 and 2.6 via the e2compr patch. For more information, visit http://e2compr.sourceforge.net/ Table 3-11. Defined s_algo_bitmap Values Constant Name EXT2_LZV1_ALG EXT2_LZRW3A_ALG EXT2_GZIP_ALG EXT2_BZIP2_ALG EXT2_LZO_ALG 0 1 2 3 4 Bit Number Description Binary value of 0x00000001 Binary value of 0x00000002 Binary value of 0x00000004 Binary value of 0x00000008 Binary value of 0x00000010
3.1.36. s_prealloc_blocks
8-bit value representing the number of blocks the implementation should attempt to pre-allocate when creating a new regular file. Linux 2.6.28 will only perform pre-allocation using Ext4 although no problem is expected if any version of Linux encounters a file with more blocks present than required.
www.nongnu.org/ext2-doc/ext2.html
19/47
12/06/12
3.1.37. s_prealloc_dir_blocks
8-bit value representing the number of blocks the implementation should attempt to pre-allocate when creating a new directory. Linux 2.6.28 will only perform pre-allocation using Ext4 and only if the E T _ E T R _ O P T D R P E L O flag is present. Since Linux does not de-allocate blocks from X4FAUECMA_I_RALC directories after they were allocated, it should be safe to perform pre-allocation and maintain compatibility with Linux.
3.1.38. s_journal_uuid
16-byte value containing the uuid of the journal superblock. See Ext3 Journaling for more information.
3.1.39. s_journal_inum
32-bit inode number of the journal file. See Ext3 Journaling for more information.
3.1.40. s_journal_dev
32-bit device number of the journal file. See Ext3 Journaling for more information.
3.1.41. s_last_orphan
32-bit inode number, pointing to the first inode in the list of inodes to delete. See Ext3 Journaling for more information.
3.1.42. s_hash_seed
An array of 4 32bit values containing the seeds used for the hash algorithm for directory indexing.
3.1.43. s_def_hash_version
An 8bit value containing the default hash version used for directory indexing.
3.1.44. s_default_mount_options
A 32bit value containing the default mount options for this file system. TODO: Add more information here!
3.1.45. s_first_meta_bg
A 32bit value indicating the block group ID of the first meta block group. TODO: Research if this is an Ext3www.nongnu.org/ext2-doc/ext2.html 20/47
12/06/12
only extension.
For each block group in the file system, such a g o p d s is created. Each represent a single block group ru_ec within the file system and the information within any one of them is pertinent only to the group it is describing. Every block group descriptor table contains all the information about all the block groups. NOTE: All indicated "block id" are absolute.
3.2.1. bg_block_bitmap
32bit block id of the first block of the "block bitmap" for the group represented. The actual block bitmap is located within its own allocated blocks starting at the block ID specified by this value.
3.2.2. bg_inode_bitmap
32bit block id of the first block of the "inode bitmap" for the group represented.
www.nongnu.org/ext2-doc/ext2.html 21/47
12/06/12
3.2.3. bg_inode_table
32bit block id of the first block of the "inode table" for the group represented.
3.2.4. bg_free_blocks_count
16bit value indicating the total number of free blocks for the represented group.
3.2.5. bg_free_inodes_count
16bit value indicating the total number of free inodes for the represented group.
3.2.6. bg_used_dirs_count
16bit value indicating the number of inodes allocated to directories for the represented group.
3.2.7. bg_pad
16bit value used for padding the structure on a 32bit boundary.
3.2.8. bg_reserved
12 bytes of reserved space for future revisions.
12/06/12
its associated group descriptor. When the inode table is created, all the reserved inodes are marked as used. In revision 0 this is the first 11 inodes.
The first few entries of the inode tables are reserved. In revision 0 there are 11 entries reserved while in revision 1 (EXT2_DYNAMIC_REV) and later the number of reserved inodes entries is specified in the s_first_ino of the superblock structure. Here's a listing of the known reserved inode entries:
www.nongnu.org/ext2-doc/ext2.html 23/47
12/06/12
Table 3-14. Defined Reserved Inodes Constant Name EXT2_BAD_INO EXT2_ROOT_INO EXT2_ACL_IDX_INO EXT2_ACL_DATA_INO EXT2_BOOT_LOADER_INO EXT2_UNDEL_DIR_INO 1 2 3 4 5 6 Value Description bad blocks inode root directory inode ACL index inode (deprecated?) ACL data inode (deprecated?) boot loader inode undelete directory inode
3.5.1. i_mode
16bit value used to indicate the format of the described file and the access rights. Here are the possible values, which can be combined in various ways: Table 3-15. Defined i_mode Values Constant -- file format -EXT2_S_IFSOCK EXT2_S_IFLNK EXT2_S_IFREG EXT2_S_IFBLK EXT2_S_IFDIR EXT2_S_IFCHR 0xC000 0xA000 0x8000 0x6000 0x4000 0x2000 socket symbolic link regular file block device directory character device fifo Set process User ID Set process Group ID sticky bit user read user write user execute group read group write group execute others read others write others execute Value Description
EXT2_S_IFIFO 0x1000 -- process execution user/group override -EXT2_S_ISUID EXT2_S_ISGID EXT2_S_ISVTX -- access rights -EXT2_S_IRUSR EXT2_S_IWUSR EXT2_S_IXUSR EXT2_S_IRGRP EXT2_S_IWGRP EXT2_S_IXGRP EXT2_S_IROTH EXT2_S_IWOTH EXT2_S_IXOTH 0x0800 0x0400 0x0200 0x0100 0x0080 0x0040 0x0020 0x0010 0x0008 0x0004 0x0002 0x0001
3.5.2. i_uid
www.nongnu.org/ext2-doc/ext2.html 24/47
12/06/12
3.5.3. i_size
In revision 0, (signed) 32bit value indicating the size of the file in bytes. In revision 1 and later revisions, and only for regular files, this represents the lower 32-bit of the file size; the upper 32-bit is located in the i_dir_acl.
3.5.4. i_atime
32bit value representing the number of seconds since january 1st 1970 of the last time this inode was accessed.
3.5.5. i_ctime
32bit value representing the number of seconds since january 1st 1970, of when the inode was created.
3.5.6. i_mtime
32bit value representing the number of seconds since january 1st 1970, of the last time this inode was modified.
3.5.7. i_dtime
32bit value representing the number of seconds since january 1st 1970, of when the inode was deleted.
3.5.8. i_gid
16bit value of the POSIX group having access to this file.
3.5.9. i_links_count
16bit value indicating how many times this particular inode is linked (referred to). Most files will have a link count of 1. Files with hard links pointing to them will have an additional count for each hard link. Symbolic links do not affect the link count of an inode. When the link count reaches 0 the inode and all its associated blocks are freed.
3.5.10. i_blocks
32-bit value representing the total number of 512-bytes blocks reserved to contain the data of this inode, regardless if these blocks are used or not. The block numbers of these reserved blocks are contained in the
www.nongnu.org/ext2-doc/ext2.html 25/47
12/06/12
i_block array. Since this value represents 512-byte blocks and not file system blocks, this value should not be directly used as an index to the i_block array. Rather, the maximum index of the i_block array should be computed from i_blocks / ((1024<<s_log_block_size)/512), or once simplified, i_blocks/(2<<s_log_block_size).
3.5.11. i_flags
32bit value indicating how the ext2 implementation should behave when accessing the data for this inode. Table 3-16. Defined i_flags Values Constant Name EXT2_SECRM_FL EXT2_UNRM_FL EXT2_COMPR_FL EXT2_SYNC_FL EXT2_IMMUTABLE_FL EXT2_APPEND_FL EXT2_NODUMP_FL EXT2_NOATIME_FL -- Reserved for compression usage -EXT2_DIRTY_FL EXT2_COMPRBLK_FL EXT2_NOCOMPR_FL EXT2_ECOMPR_FL -- End of compression flags -EXT2_BTREE_FL EXT2_INDEX_FL EXT2_IMAGIC_FL EXT3_JOURNAL_DATA_FL EXT2_RESERVED_FL 0x00001000 0x00001000 0x00002000 0x00004000 0x80000000 b-tree format directory hash indexed directory AFS directory journal file data reserved for ext2 library 0x00000100 0x00000200 0x00000400 0x00000800 Dirty (modified) compressed blocks access raw compressed data compression error Value 0x00000001 0x00000002 0x00000004 0x00000008 0x00000010 0x00000020 0x00000040 0x00000080 Description secure deletion record for undelete compressed file synchronous updates immutable file append only do not dump/delete file do not update .i_atime
3.5.12. i_osd1
32bit OS dependant value.
3.5.12.2. Linux
www.nongnu.org/ext2-doc/ext2.html 26/47
12/06/12
3.5.13. i_block
15 x 32bit block numbers pointing to the blocks containing the data for this inode. The first 12 blocks are direct blocks. The 13th entry in this array is the block number of the first indirect block; which is a block containing an array of block ID containing the data. Therefore, the 13th block of the file will be the first block ID contained in the indirect block. With a 1KiB block size, blocks 13 to 268 of the file data are contained in this indirect block. The 14th entry in this array is the block number of the first doubly-indirect block; which is a block containing an array of indirect block IDs, with each of those indirect blocks containing an array of blocks containing the data. In a 1KiB block size, there would be 256 indirect blocks per doubly-indirect block, with 256 direct blocks per indirect block for a total of 65536 blocks per doubly-indirect block. The 15th entry in this array is the block number of the triply-indirect block; which is a block containing an array of doubly-indrect block IDs, with each of those doubly-indrect block containing an array of indrect block, and each of those indirect block containing an array of direct block. In a 1KiB file system, this would be a total of 16777216 blocks per triply-indirect block. A value of 0 in this array effectively terminates it with no further block being defined. All the remaining entries of the array should still be set to 0.
3.5.14. i_generation
32bit value used to indicate the file version (used by NFS).
3.5.15. i_file_acl
32bit value indicating the block number containing the extended attributes. In revision 0 this value is always 0. Patches and implementation status of ACL under Linux can generally be found at http://acl.bestbits.at/
3.5.16. i_dir_acl
In revision 0 this 32bit value is always 0. In revision 1, for regular files this 32bit value contains the high 32 bits of the 64bit file size.
www.nongnu.org/ext2-doc/ext2.html
27/47
12/06/12
Linux sets this value to 0 if the file is not a regular file (i.e. block devices, directories, etc). In theory, this value could be set to point to a block containing extended attributes of the directory or special file.
3.5.17. i_faddr
32bit value indicating the location of the file fragment. In Linux and GNU HURD, since fragments are unsupported this value is always 0. In Ext4 this value is now marked as obsolete. In theory, this should contain the block number which hosts the actual fragment. The fragment number and its size would be contained in the i_osd2 structure.
3.5.18.1. Hurd Table 3-17. Inode i_osd2 Structure: Hurd Offset (bytes) 0 1 2 4 6 8
3.5.18.1.1. h_i_frag
Description
8bit fragment number. Always 0 GNU HURD since fragments are not supported. Obsolete with Ext4.
3.5.18.1.2. h_i_fsize
8bit fragment size. Always 0 in GNU HURD since fragments are not supported. Obsolete with Ext4.
3.5.18.1.3. h_i_mode_high
www.nongnu.org/ext2-doc/ext2.html
28/47
12/06/12
3.5.18.1.4. h_i_uid_high
3.5.18.1.5. h_i_gid_high
3.5.18.1.6. h_i_author
32bit user id of the assigned file author. If this value is set to -1, the POSIX user id will be used.
3.5.18.2. Linux Table 3-18. Inode i_osd2 Structure: Linux Offset (bytes) 0 1 2 4 6 8
3.5.18.2.1. l_i_frag
Description
8bit fragment number. Always 0 in Linux since fragments are not supported.
A new implementation of Ext2 should completely disregard this field if the i_faddr value is 0; in Ext4 this field is combined with l_i_fsize to become the high 16bit of the 48bit blocks count for the inode data.
3.5.18.2.2. l_i_fsize
12/06/12
A new implementation of Ext2 should completely disregard this field if the i_faddr value is 0; in Ext4 this field is combined with l_i_frag to become the high 16bit of the 48bit blocks count for the inode data.
3.5.18.2.3. l_i_uid_high
3.5.18.2.4. l_i_gid_high
3.5.18.3. Masix Table 3-19. Inode i_osd2 Structure: Masix Offset (bytes) 0 1 2
3.5.18.3.1. m_i_frag
Description
8bit fragment number. Always 0 in Masix as framgents are not supported. Obsolete with Ext4.
3.5.18.3.2. m_i_fsize
8bit fragment size. Always 0 in Masix as fragments are not supported. Obsolete with Ext4.
12/06/12
Knowing that inode 1 is the first inode defined in the inode table, one can use the following formulaes:
bokgop=(nd -1 /sioe_e_ru lc ru ioe ) _ndsprgop
Once the block is identified, the local inode index for the local inode table can be identified using:
lclioeidx=(nd -1 %sioe_e_ru oa nd ne ioe ) _ndsprgop
Here are a couple of sample values that could be used to test your implementation: Table 3-20. Sample Inode Computations Inode Number s_inodes_per_group = 1712 1 963 1712 1713 3424 3425 0 0 0 1 1 2 0 962 1711 0 1711 0 Block Group Number Local Inode Index
As many of you are most likely already familiar with, an index of 0 means the first entry. The reason behind using 0 rather than 1 is that it can more easily be multiplied by the structure size to find the final byte offset of its location in memory or on disk.
12/06/12
In revision 0, the type of the entry (file, directory, special file, etc) has to be looked up in the inode of the file. In revision 0.5 and later, the file type is also contained in the directory entry structure. Table 4-1. Linked Directory Entry Structure Offset (bytes) 0 4 6 7 8 4 2 1 1 0-255 Size (bytes) inode rec_len name_len[a] file_type[b] name Description
Notes: a. Revision 0 of Ext2 used a 16bit n m _ e ; since most implementations restricted filenames to a maximum aeln of 255 characters this value was truncated to 8bit with the upper 8bit recycled as file_type. b. Not available in revision 0; this field was part of the 16bit name_len field.
4.1.1. inode
32bit inode number of the file entry. A value of 0 indicate that the entry is not used.
4.1.2. rec_len
16bit unsigned displacement to the next directory entry from the start of the current directory entry. This field must have a value at least equal to the length of the current record. The directory entries must be aligned on 4 bytes boundaries and there cannot be any directory entry spanning multiple data blocks. If an entry cannot completely fit in one block, it must be pushed to the next data block and the rec_len of the previous entry properly adjusted. Since this value cannot be negative, when a file is removed the previous record within the block has to be modified to point to the next valid record within the block or to the end of the block when no other directory entry is present. If the first entry within the block is removed, a blank record will be created and point to the next directory entry or to the end of the block.
4.1.3. name_len
8bit unsigned value indicating how many bytes of character data are contained in the name. This value must never be larger than rec_len - 8. If the directory entry name is updated and cannot fit in the existing directory entry, the entry may have to be relocated in a new directory entry of sufficient size and possibly stored in a new data block.
www.nongnu.org/ext2-doc/ext2.html 32/47
12/06/12
4.1.4. file_type
8bit unsigned value used to indicate file type. In revision 0, this field was the upper 8-bit of the then 16-bit name_len. Since all implementations still limited the file names to 255 characters this 8-bit value was always 0. This value must match the inode type defined in the related inode entry. Table 4-2. Defined Inode File Type Values Constant Name EXT2_FT_UNKNOWN EXT2_FT_REG_FILE EXT2_FT_DIR EXT2_FT_CHRDEV EXT2_FT_BLKDEV EXT2_FT_FIFO EXT2_FT_SOCK EXT2_FT_SYMLINK 0 1 2 3 4 5 6 7 Value Regular File Directory File Character Device Block Device Buffer File Socket File Symbolic Link Description Unknown File Type
4.1.5. name
Name of the entry. The ISO-Latin-1 character set is expected in most system. The name must be no longer than 255 bytes after encoding.
For which the following data representation can be found on the storage device: Table 4-3. Sample Linked Directory Data Layout, 4KiB blocks Offset (bytes)
www.nongnu.org/ext2-doc/ext2.html
Size (bytes)
Description
33/47
12/06/12
Directory Entry 0 0 4 6 7 8 9 Directory Entry 1 12 16 18 19 20 22 Directory Entry 2 24 28 30 31 32 45 Directory Entry 3 48 52 54 55 56 63 Directory Entry 4 64 68 70 71 72 Directory Entry 5 76 80 82 83 84 95
www.nongnu.org/ext2-doc/ext2.html
4 2 1 1 1 3 4 2 1 1 2 2 4 2 1 1 13 3 4 2 1 1 7 1 4 2 1 1 4 4 2 1 1 11 1
inode number: 783362 record length: 12 name length: 1 file type: E T _ T D R X 2 F _ I =2 name: . padding inode number: 1109761 record length: 12 name length: 2 file type: E T _ T D R X 2 F _ I =2 name: .. padding inode number: 783364 record length: 24 name length: 13 file type: E T _ T R G F L X2F_E_IE name: .bash_profile padding inode number: 783363 record length: 16 name length: 7 file type: E T _ T R G F L X2F_E_IE name: .bashrc padding inode number: 783377 record length: 12 name length: 4 file type: E T _ T R G F L X2F_E_IE name: mbox inode number: 783545 record length: 20 name length: 11 file type: E T _ T D R X 2 F _ I =2 name: public_html padding
34/47
12/06/12
Directory Entry 6 96 100 102 103 104 107 Directory Entry 7 108 112 114 115 116 116 4 2 1 1 0 3980 inode number: 0 record length: 3988 name length: 0 file type: E T _ T U K O N X2F_NNW name: padding 4 2 1 1 3 1 inode number: 669354 record length: 12 name length: 3 file type: E T _ T D R X 2 F _ I =2 name: tmp padding
12/06/12
rec_len: 12
6 1 name_len: 1 7 1 file_type: E T _ T D R X 2 F _ I =2 8 1 name: . 9 3 padding -- Linked Directory Entry: .. -12 4 inode: parent directory 16 2 rec_len: (blocksize - this entry's length(12)) 18 1 name_len: 2 19 1 file_type: E T _ T D R X 2 F _ I =2 20 2 name: .. 22 2 padding -- Indexed Directory Root Information Structure -24 4 reserved, zero 28 1 hash_version 29 1 info_length 30 1 indirect_levels 31 1 reserved - unused flags 4.2.1.1. hash_version 8bit value representing the hash version used in this indexed directory. Table 4-5. Defined Indexed Directory Hash Versions Constant Name DX_HASH_LEGACY DX_HASH_HALF_MD4 DX_HASH_TEA 4.2.1.2. info_length 8bit length of the indexed directory information structure (dx_root); currently equal to 8. 0 1 2 Value Description TODO: link to section TODO: link to section TODO: link to section
4.2.1.3. indirect_levels 8bit value indicating how many indirect levels of indexing are present in this hash. In Linux, as of 2.6.28, the maximum indirect levels value supported is 1.
12/06/12
The indexed directory entries are used to quickly lookup the inode number associated with the hash of a filename. These entries are located immediately following the fake linked directory entry of the directory data blocks, or immediately following the Section 4.2.1. The first indexed directory entry, rather than containing an actual hash and block number, contains the maximum number of indexed directory entries that can fit in the block and the actual number of indexed directory entries stored in the block. The format of this special entry is detailed in Table 4-7. The other directory entries are sorted by hash value starting from the smallest to the largest numerical value. Table 4-6. Indexed Directory Entry Structure (dx_entry) Offset (bytes) 0 4 4 4 Size (bytes) hash block Description
Table 4-7. Indexed Directory Entry Count and Limit Structure Offset (bytes) 0 2 4.2.2.1. hash 32bit hash of the filename represented by this entry. 2 2 Size (bytes) limit count Description
4.2.2.2. block 32bit block index of the directory inode data block containing the (linked) directory entry for the filename.
4.2.2.3. limit 16bit value representing the total number of indexed directory entries that fit within the block, after removing the other structures, but including the count/limit entry.
4.2.2.4. count 16bit value representing the total number of indexed directory entries present in the block. TODO: Research if this value includes the count/limit entry.
12/06/12
-Cmueahs o tenm opt ah f h ae -Ra teidxro ed h ne ot -Uebnr sac (ieri tecretcd)t fn te s iay erh lna n h urn oe o id h frtidxo la bokta cudcnantetre hs is ne r ef lc ht ol oti h agt ah (nte odr i re re) -Rpa teaoeutltelws te lvli rahd eet h bv ni h oet re ee s ece -Ra tela drcoyetybokadd anra Et ed h ef ietr nr lc n o oml x2 drcoyboksac i i. ietr lc erh n t -I tenm i fud rtr isdrcoyetyadbfe f h ae s on, eun t ietr nr n ufr -Ohrie i tecliinbto tenx drcoyetyi tews, f h olso i f h et ietr nr s st cniu sacigi tescesrbok e, otne erhn n h ucso lc
Normally, two logical blocks of the file will need to be accessed, and one or two metadata index blocks. The effect of the metadata index blocks can largely be ignored in terms of disk access time since these blocks are unlikely to be evicted from cache. There is some small CPU cost that can be addressed by moving the whole directory into the page cache.
The details of splitting and hash collision handling are somewhat messy, but I will be happy to dwell on them at length if anyone is interested.
4.2.5. Splitting
In brief, when a leaf node fills up and we want to put a new entry into it the leaf has to be split, and its share of the hash space has to be partitioned. The most straightforward way to do this is to sort the entrys by hash value and split somewhere in the middle of the sorted list. This operation is log(number_of_entries_in_leaf) and is not a great cost so long as an efficient sorter is used. I used Combsort for this, although Quicksort would have been just as good in this case since average case performance is more important than worst case. An alternative approach would be just to guess a median value for the hash key, and the partition could be done in linear time, but the resulting poorer partitioning of hash key space outweighs the small advantage of the linear partition algorithm. In any event, the number of entries needing sorting is bounded by the number that fit in a leaf.
12/06/12
avoid splitting such sequences between blocks, so the split point of a block is adjusted with this in mind. But the possibility still remains that if the block fills up with identically-hashed entries, the sequence may still have to be split. This situation is flagged by placing a 1 in the low bit of the index entry that points at the sucessor block, which is naturally interpreted by the index probe as an intermediate value without any special coding. Thus, handling the collision problem imposes no real processing overhead, just come extra code and a slight reduction in the hash key space. The hash key space remains sufficient for any conceivable number of directory entries, up into the billions.
4.2.8. Performance
OK, if you have read this far then this is no doubt the part you've been waiting for. In short, the performance improvement over normal Ext2 has been stunning. With very small directories performance is similar to standard Ext2, but as directory size increases standard Ext2 quickly blows up quadratically, while htreeenhanced Ext2 continues to scale linearly. Uli Luckas ran benchmarks for file creation in various sizes of directories ranging from 10,000 to 90,000 files. The results are pleasing: total file creation time stays very close to linear, versus quadratic increase with normal Ext2. Time to create: Figure 4-1. Performance of Indexed Directories
Idxd nee
www.nongnu.org/ext2-doc/ext2.html
Nra oml
39/47
12/06/12
100Fls 00 ie: 200Fls 00 ie: 300Fls 00 ie: 400Fls 00 ie: 500Fls 00 ie: 600Fls 00 ie: 700Fls 00 ie: 800Fls 00 ie: 900Fls 00 ie:
==== === 0130 m.5s 0270 m.2s 0430 m.3s 0580 m.9s 0700 m.4s 0860 m.1s 0990 m.8s 01.6s m200 01.0s m340
=== === 02.7s m360 12.7s m040 3930 m.2s 54.5s m870 93.7s m120 1m220 35.5s 1m400 92.7s 2m670 53.3s 3m850 31.5s
A graph is posted at: http://www.innominate.org/~phillips/htree/performance.png All of these tests are CPU-bound, which may come as a surprise. The directories fit easily in cache, and the limiting factor in the case of standard Ext2 is the looking up of directory blocks in buffer cache, and the low level scan of directory entries. In the case of htree indexing there are a number of costs to be considered, all of them pretty well bounded. Notwithstanding, there are a few obvious optimizations to be done:
-Uebnr sac isedo lna sac i teitro idx s iay erh nta f ier erh n h neir ne nds oe. -I teei ol oela boki adrcoy bps teidx f hr s ny n ef lc n ietr, yas h ne poe g srih t tebok rb, o tagt o h lc. -Mptedrcoyit tepg cceisedo tebfe cce a h ietr no h ae ah nta f h ufr ah.
Each of these optimizations will produce a noticeable improvement in performance, but naturally it will never be anything like the big jump going from N**2 to Log512(N), ~= N. In time the optimizations will be applied and we can expect to see another doubling or so in performance. There will be a very slight performance hit when the directory gets big enough to need a second level. Because of caching this will be very small. Traversing the directories metadata index blocks will be a bigger cost, and once again, this cost can be reduced by moving the directory blocks into the page cache. Typically, we will traverse 3 blocks to read or write a directory entry, and that number increases to 4-5 with really huge directories. But this is really nothing compared to normal Ext2, which traverses several hundred blocks in the same situation.
12/06/12
12/06/12
Attribute values are aligned to the end of the block, stored in no specific order. They are also padded to E T _ A T _ A (4) byte boundaries. No additional gaps are left between them. X2XTRPD Table 5-1. Extended Attribute Block Layout Attribute Block Header Attribute Entry 1 Attribute Entry 2 Attribute Entry 3 4 null bytes unused space... Attribute Value 1 Attribute Value 3 Attribute Value 2
| | V
growing downwards
^ | |
growing upwards
5.2.2.2. h_refcount 32bit value used as reference count. This value is incremented everytime a link is created to this attribute block and decremented when a link is destroyed. Whenever this value reaches 0 the attribute block can be freed.
5.2.2.3. h_blocks 32bit value indicating how many blocks are currently used by the extended attributes. In Linux a value of h_blocks higher than 1 is considered invalid. This effectively restrict the amount of extended attributes to what can be fit in a single block.
www.nongnu.org/ext2-doc/ext2.html 42/47
12/06/12
There does not seem to be any support for extended attributes in Ext2 under GNU HURD.
5.2.2.4. h_hash 32bit hash value of all attribute entry header hashes. Procedure to compute Extended Attribute Header Hash 1. Initialize the 32bit hash to 0 2. Check if there are any extended attribute entry to process, if not we are done. 3. Do a cyclic bit shift of 16 bits to the left of the 32bits hash value, effectively swapping the upper and lower 16bits of the hash 4. Perform a bitwise OR between the extended attribute entry hash and the header hash being computed. 5. Go back to step 2>.
The total size of an attribute entry is always rounded to the next 4-bytes boundary.
5.2.3.1. e_name_len 8bit unsigned value indicating the length of the name.
5.2.3.3. e_value_offs 16bit unsigned offset to the value within the value block.
www.nongnu.org/ext2-doc/ext2.html 43/47
12/06/12
5.2.3.5. e_value_size 32bit unsigned value indicating the size of the attribute value.
0x00000001 0x00000002 0x00000004 0x00000008 0x00000010 0x00000020 0x00000040 0x00000080 0x00000100 0x00000200 0x00000400 0x00000800 0x00001000 0x00001000 0x00002000 0x00004000 0x80000000
secure deletion record for undelete compressed file synchronous updates immutable file append only do not dump/delete file do not update .i_atime dirty (file is in use?) compressed blocks access raw compressed data compression error b-tree format directory Hash indexed directory ? journal file data reserved for ext2 implementation
44/47
12/06/12
12/06/12
5.3.13. EXT2_BTREE_FL - B-Tree Format Directory 5.3.14. EXT2_INDEX_FL - Hash Indexed Directory
When this bit is set, the format of the directory file is hash indexed. This is covered in details in Section 4.2.
5.3.15. EXT2_IMAGIC_FL 5.3.16. EXT2_JOURNAL_DATA_FL - Journal File Data 5.3.17. EXT2_RESERVED_FL - Reserved
Appendix A. Credits
I would like to personally thank everybody who contributed to this document, you are numerous and in many cases I haven't kept track of all of you. Be sure that if you are not in this list, it's a mistake and do not hesitate to contact me, it will be a pleasure to add your name to the list.
PtrRtegte (ee.otnatrbkruhscm ee otnatr [email protected]) Cretost Scin311 orcin o eto ..1 Cretost Tbe31adTbe32 orcin o al - n al Cretost Scin32 orcin o eto .
www.nongnu.org/ext2-doc/ext2.html 46/47
12/06/12
Ra Ctbrsn(ynctbrsnaead.d.u yn uheto ra.uheto@dlieeua) Cretost Scin351 orcin o eto ..0 Cretost Catr3 orcin o hpe AdesGunahr(.rebce@etisa) nra rebce agunahrbsbt.t Scin52 eto . Dne Pilp (hlisinmnt.e ail hlis pilp@noiaed) Scin423 eto .. Scin424 eto .. Scin425 eto .. Scin426 eto .. Scin427 eto .. Scin428 eto .. Jrm Salyo Acs Dt Ic eey tne f ces aa n. Pitdotteivre vle frET__FOKadET__FN one u h nesd aus o X2SISC n X2SILK
www.nongnu.org/ext2-doc/ext2.html
47/47