0% found this document useful (0 votes)
20 views9 pages

Billion-Files File Systems (BFFS) : A Comparison: Sohail Shaikh

This study compares the performance and limitations of popular Linux filesystems (EXT4, XFS, BtrFS, ZFS, and F2FS) by creating, storing, and reading one billion files. It investigates whether these filesystems can handle such a large number of files and how performance may degrade with increased file counts. The research includes detailed metrics on read/write throughput, storage utilization, and filesystem behavior under heavy loads, providing valuable insights for system designers and integrators.

Uploaded by

Partha Adhikary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views9 pages

Billion-Files File Systems (BFFS) : A Comparison: Sohail Shaikh

This study compares the performance and limitations of popular Linux filesystems (EXT4, XFS, BtrFS, ZFS, and F2FS) by creating, storing, and reading one billion files. It investigates whether these filesystems can handle such a large number of files and how performance may degrade with increased file counts. The research includes detailed metrics on read/write throughput, storage utilization, and filesystem behavior under heavy loads, providing valuable insights for system designers and integrators.

Uploaded by

Partha Adhikary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Billion-files File Systems (BfFS): A Comparison

Sohail Shaikh
Department of Computer Science
George Mason University, Fairfax, Virginia
[email protected]

Abstract—As the volume of data being produced is increasing to catastrophic corruption when the system loses power during
at an exponential rate that needs to be processed quickly, it is a write operation. EXT3 solved this problem by journaling, a
reasonable that the data needs to be available very close to the 2-phase writing process of metadata and the data. EXT3 was
compute devices to reduce transfer latency. Due to this need, local
filesystems are getting close attention to understand their inner limited to a 2TB file size and a maximum 16TB filesystem
arXiv:2408.01805v1 [cs.PF] 3 Aug 2024

workings, performance, and more importantly their limitations. size. EXT4 increased this limitation to 16TB file size and
This study analyzes few popular Linux filesystems: EXT4, XFS, 1EB filesystem size. There are other improvements in EXT4
BtrFS, ZFS, and F2FS by creating, storing, and then reading such as different types of block allocations, extents, and an
back one billion files from the local filesystem. The study also unlimited number of subdirectories using HTree indices [16],
captured and analyzed read/write throughput, storage blocks
usage, disk space utilization and overheads, and other metrics the checksum for journals, nanosecond timestamp, and online
useful for system designers and integrators. Furthermore, the defragmentation [7].
study explored other side effects such as filesystem performance XFS is a high-performance 64-bit journaling filesystem and
degradation during and after these large numbers of files and was ported to the Linux kernel in 2001. XFS excels in the
folders are created. execution of parallel input/output (I/O) operations due to its
Index Terms—Linux, filesystem, EXT4, XFS, BtrFS, F2FS,
ZFS, performance, statistics design [4]. XFS design enables extreme scalability of I/O
threads, filesystem bandwidth, size of files, and the filesys-
tem itself when spanning multiple physical storage devices.
I. I NTRODUCTION
XFS ensures the consistency of data by employing metadata
The local filesystem, where the operating system and appli- journaling and supporting write barriers. Space allocation
cations reside, forms the core of many systems. As local hard is performed via extents with data structures stored in B+
drives scale up, the number of objects stored in each node trees, improving the overall performance of the file system,
grows. This study is to investigate the hypothesis that there are especially when handling large files.
likely some limitations and performance impacts of storing and BtrFS is a modern modified Btree-based Copy-on-Write
accessing large numbers of files (e.g., a billion files) of varying (CoW) filesystem where files are not replaced but a new
sizes on popular local filesystems such as EXT4, XFS, BtrFS, copy with updates are created. It is aimed at implementing
F2FS, ZFS, etc. The study intends to evaluate different local advanced features while also focusing on fault tolerance,
filesystems available for the Linux operating system, their pros repair, and easy administration. Its main features and benefits
and cons, their ability to store a billion files and directories, are snapshots, RAID, and self-healing. 16EB file size, dynamic
and their suitability for the purpose under specific workloads inode allocation, SSD awareness, automatic defragmentation,
on different types of storage media such as SSD and HDD. and scrubbing [6].
Linux EXT4, the 3rd revision of the EXT filesystem has ZFS is another filesystem developed by Sun Microsystem
been around since the mid-2000 and has performed very well and acquired by Oracle, combines the roles of volume manager
as a general-purpose and stable local filesystem for popular and a file system intended for critical systems where data loss
Linux distributions. However, as the data volume has been is not acceptable. ZFS is intended for large infrastructures
increasing rapidly with large amounts of data being placed where large number of storage devices are deployed. ZFS
closer to the compute devices to improve the performance requires appropriate configuration to support these storage
of data-intensive computations and avoid data transfers over devices.
the network, the importance of local filesystem performance F2FS (Flash-Friendly File System) is a filesystem specif-
and their limitation has increased rapidly. The core idea of ically intended for NAND-based flash memory devices
this study is to answer two basic questions supported by equipped with Flash Translation Layer (FTL). It is supported
analysis: a) Can these filesystems handle one billion or more by Linux kernel 3.8 and above and supports compression as
files and folders? and b) does the performance deteriorates as well as encryption natively. F2FS was designed around the
the number of files and folders are added? If unable to store idea of a log-structured filesystem approach, which is adapted
a billion files, analyze and determine if these limits can be to newer forms of storage and overcame issues such as the
overcome. snowball effect of wandering trees and high cleaning overhead.
EXT4 is currently the default filesystem for most popular F2FS has a weak fsck that can lead to data loss in case of a
Linux distributions. EXT2 and earlier filesystems were prone sudden power loss [8].
II. M ETHODS Algorithm 1 Folders/Files create/write/read algorithm
To compare target filesystems, they needed to be exercised 1: F iles ← 10M or 100M or 1B files
thoroughly and compared against a baseline. The baseline is 2: F olders ← 100
established by creating 10 million files of different sizes that 3: Subf olders ← computed from F iles
are normally distributed between 1KB and 10KB, and stored 4: for i = 1 to F olders do
evenly among 100 root folders. To establish a baseline, the first 5: Create folder
step is to set up the environment for these experiments. Due 6: for j = 1 to Subf olders do
to several factors including the introduction of unpredictable 7: Create subfolder
internet and ISP-induced latencies, and response time from the 8: for k = 1 to 100K do
cloud components, it was decided to avoid the cloud altogether 9: Create file of normal random data size with crc-32
and rely on the on-prem server hardware for experiments. The 10: end for
specification of the on-prem HP Z820 workstation is: 2x Xeon 11: end for
E5-2670 2.6GHz CPUs with 16 cores/32 threads total, 256GB 12: Statistics ← collect folder/file statistics
RAM, dual NICs connected to two independent subnets, one 13: end for
1TB SSD for Linux Ubuntu 22.04, a 2TB Samsung 870 EVO 14: Statistics ← collect storage statistics
SATA SSD [12] with 550MB/sec read/write performance 15: for i = 1 to F olders do
for applications, and a 14TB Seagate IronWolf™ HDD with 16: Search folder
256MB cache and a 210MB/s sustained transfer rate [11]. In 17: for j = 1 to Subf olders do
addition, C/C++ development tools, Java OpenJDK 11, Visual 18: Search subfolder
Studio, and other tools were installed such as filebench, fio, 19: for k = 1 to 100K do
vmstat, iotop, dstat, atop, visualjvm, and considered using bcc 20: Read file and verify content using crc-32
[17]. The goal is to capture IOPS, read and write performance, 21: end for
and I/O latency of reading and writing files. To exercise 22: end for
these filesystems with files and folders reads and writes, an 23: Statistics ← collect folder/file statistics
application was developed. The operating system was not fine- 24: end for
tuned in any way deliberately to make this study reproducible 25: Process collected statistics for analysis
and useful for others.
the run, it summarizes captured metrics which are then fed
A. Application
into an Excel spreadsheet for further analysis. The Reader
After compiling filebench and making several attempts to performs the reads, iterating thru the folders and files created
test and failed at around 10 million files mark due to crashes, by the Creator earlier. The Reader reads each file’s content,
it became evident that it is time to write some purpose-built extracts the last 8 bytes of the content, the CRC-32 checksum,
tools. Initially, Java environment is used for its flexibility computes the CRC-32 checksum of the remaining content, and
and consistent performance to develop the benchmarking and compares the two to ensure file’s integrity. The application
evaluation application [2]. To ensure that application overheads counts the number of failed checksum and reports the file
are consistent regardless of the number of files to be created or count discrepancy for the folder as well as for the entire run
read, simple programming constructs were used. In addition, at the end. Following is the files/folders schedule that is used
to eliminate any performance degradation due to introduction for testing these file systems: F iles = F olders∗Subf olders∗
of a JVM, the application is also (re)developed in C language 100K.
using the lowest level Linux file system calls [19]. This paper
uses performance metrics from C application. The high-level TABLE I: Files/Folders Read/Write Schedule
pseudo code for folders and files creation and verification are
shown in Algorithm 1. Files Folders Subfolders/Folder Files/Subfolder
The application consists of two parts: a Creator and a 10M 100 1 100,000
Reader, routines that create folders, subfolders and files and 100M 100 10 100,000
read folders, subfolders and files, respectively. To accommo- 1B 100 100 100,000
date 1 billion files with a planned file size range of 1KB to
10KB, a 14TB HDD was used. The rationale behind the file
size range is that a typical disk block size is 4KB so a file may In addition to reading and writing files and folders,
use anywhere from one and up to three 4KB blocks [3]. During the application captures metrics for macrobenchmarking and
file creation, each file is added with random binary data and microbenchmarking. Macrobenchmarking captures read/write
is appended with an 8-byte CRC-32 checksum generated on- performance metrics for the individual folders and files con-
the-fly so that the Reader can verify the file’s content during tained in them as well as the entire run at the end whereas
reading. microbenchmarking focuses on individual file read/write per-
The Creator routine also captures file sizes and individual formance and distribution (min, ave., max). It is important to
write throughput distribution during the runs. At the end of note that captured timings are in microseconds.
B. Linux Filesystem driver, and block device layers are constant therefore the only
Focusing only on Linux operating system for this study, layer left is the filesystem which is the layer this study tries
Linux file system is designed around several abstract layers to capture and analyze. The study has analyzed the portion of
built on top of each other for ease of interfacing and to isolate run time that is consumed by the application, other software
the applications from unnecessary complexities. Applications layers, and by the hard disk I/O.
access block devices, such as hard disks through system 1) Directory Entry Cache: File-related system calls such as
calls (e.g., open(), read(), write()) to the Virtual File System open() and create() require path as an argument. This argument
(VFS). VFS is the software layer in the kernel that provides is used by the VFS to search for the directory entry in a lookup
the filesystem interface to the user space programs[5] and cache, called Directory Entry Cache (dcache) and it provides a
interfaces with one or more installed file systems such as view into the entire filespace. Due to the constraint of limited
EXT4, XFS, and others. Between the hardware and file system, physical memory, entries (dentries) are sometimes created on
there exist two more layers: the page cache and the block layer. the fly using inode information.
Linux page cache layer is the main disk cache and improves 2) Inode Cache and Inode: Inode cache (icache) go to-
file system performance by caching data to and from the disk. gether with dcache, in a master-slave configuration. If there
The kernel first looks in the cache to see if the data is available is a dentry, there is an entry in icache. The reason icache
and only accesses the disk if the data is not found in the cache. exists in the first place is that the inode gets updated frequently
Once the data is read from the disk, the page cache is updated. while the file is open but the inode is saved on the disk unlike
In the case of a write operation, data is checked if it is already dcache. An inode or index node exists for one per object in
in the cache, if not a new entry is added but the data is not the filesystem for all object types (files, directories, etc.) It is
written immediately to the disk to allow accumulating more a structure and holds not only the inode number but also the
data before committing to the disk, reducing the I/O overheads. file size, owner’s user id, and group id, access and creation
times, etc.

Fig. 1: Linux VFS Architecture

This delayed write to the disk is handled by the I/O Fig. 2: EXT4 Internal Structures
scheduler in the block layer (which improved write speed).
Block drivers hide the complexity of the underlying storage 3) File Object: For each opened file in a Linux system,
hardware e.g., hard disks (HDD), solid-state disks (SSD), a file object exists and contains information such as its path,
optical disks, etc. In the context of this paper, there are a dentry entry, file operations, flags, permission, mode, and other
few other important aspects of the filesystem implemented metadata[10]. Each file is assigned a file descriptor (FD) which
in Linux VFS that are useful to understand for this study points to the file object and is used for file operations. Linux
and explained in the following sections, e.g., directory entry reserves the first three FDs for standard input, output, and error
cache, inode, and file object. Read/Write latency for the file and imposes a configurable 1024 FDs per-process limit.
system is described as follows where each layer contributes to
some amount of latency: Latencyread/write = Latencyapplication + C. Discovery Approach
LatencyVFS + Latencyfilesystem + Latencydriver + LatencydiskIO . To determine the read/write behavior of the target filesys-
Application latency is largely dependent on the program- tems namely: EXT4 (baseline), XFS, BtrFS, ZFS and F2FS,
ming language and constructs to access the filesystem, and any several tools were explored but nothing satisfied the goals
high-level abstractions used to make the filesystem transparent of this study. To overcome this roadblock, an application
to the programmer. Linux Virtual File System (VFS) latency is developed in Java and later in C language which creates
is related to abstracting the filesystem’s internal working from files and folders and captures performance metrics for further
the high-level system utilities as well as the applications so that analysis. The targeted metrics are described below for writes
different filesystems look the same to the applications [10]. and reads.
However, the latency imposed by the application is assumed 1) Baseline Metrics - EXT410M filesystem: As the first step,
to be constant or with minimum variations between runs. VFS a baseline needs to be established for comparison before
also imposes a negligible overhead as compared to the amount conducting any further analysis. EXT4, which comes out of
of time spent in other layers. Since the application, block the box with most Linux distributions, is used as the baseline
filesystem that allows further analysis and comparison of slower. However, for ZFS100M write is 16 µsec but the average
targeted filesystems to support one billion files. Application reads are 201 µsec, a 12 times slower performance.
program Creator and Reader ran on the 14TB HDD and 4) Disk Utilization Overheads: Disk utilization overhead is
produced the results listed in Table II. another aspect of the file creation process where the disk space
used is greater than the bytes added, which is calculated by
TABLE II: EXT410M Baseline Disk Metrics ((Disk Space Used - Bytes Added) / Disk Space Used) * 100,
Parameter Before Used After
and has been observed to be anywhere from 30% to 60% for
different filesystems. However, except in one case (i.e., XFS)
Inodes 1200005235 10000200 1190005035
number of files and folders did not affect this overhead because
Blocks 3171666459 19270745 3152395714 of its handling of folders and files metadata dynamically.
Disk size 1.2991×1013 78932971520 1.2912×1013 Though total folder creation time is relatively small, the
tool captures these metrics to ensure it is not high enough
2) File Writes: The application provides the capability to to skew the analysis, especially for larger runs e.g. 10M files
create folders and then create files between the sizes of 1K and and above. Per folder write throughput which is determined
10K following a normal distribution curve, using Box-Muller by individual file write time divided by the number of files
transform [18], that produced a consistent file sizes with the per folder. It is used to analyze performance degradation as
median falling around 5500 bytes with std. dev. of 1024 bytes. the number of folders and files are added to the filesystem.
Reason to chose this range of file sizes is to ensure that one Each file’s write speed is captured, sorted, and saved in a
billion files fit in the 14TB hard disk as well at least 66% of bucket to determine the distribution pattern as shown in Figure
the file sizes are over 4KB block size occupying at least two 4. For example, majority of EXT4 file write speeds fall in two
blocks (except in the case of ZFS which uses 128KB blocks). buckets: 10-15µs and 15-20µs whereas XFS write speed fall
The file size distribu- precisely 15µs regardless of number of files. Each of the file
tion pattern is shown in systems listed below requires a slightly different configuration
the Figure 3. The tool also to ensure it can accommodate one billion or more files and
captured additional infor- directories.
mation to assess the per- EXT4 is the general-purpose filesystem for a majority of
formance. Individual files Linux distributions and gets installed as the default. However,
write time is aggregated for it is unable to accommodate one billion files by default due
each folder and used for cu- to a small number of inodes available, typically set at 240
mulative write time for the million files or folders, created during installation. To increase
Fig. 3: File Size Distribution
entire run. It is shown in the inodes, the EXT4 filesystem needs to be reinstalled on a
Table III as the Total Write device with a custom configuration using the mkfs command.
Time (TWT) in µsec. Captured individual file write time is The option -b 4096 creates block of size 4KB each resulting
also used to determine the min, max and average file write in about 1.2 billion inodes on a 14TB drive: mkfs.ext4 -N
times. Total bytes written on the disk are captured to determine 1200005248 -b 4096 /dev/xxx
the disk utilization and overheads of adding the files. The XFS creates inodes dynamically as the files are being
tool captures an exact number of bytes per file and per folder created but it requires that enough space is allocated to the
which are added to get the cumulative size written to the disk inodes data structures on the disk, therefore XFS can handle
for each run. After each run, inode count, blocks used, and one billion or more files and folders. This dynamic creation
increase in the disk space utilization are captured to determine of inodes was observed when the progress bar momentarily
the overheads. pauses and then continues during the runs. For this purpose
3) File Reads: In order to understand the differences be- maxpct flag is used with the command that allocates a per-
tween write performance and read performance, we have to centage of the disk to the inode structure e.g. 10% as shown
consider that during the write operation the cache is used in the command: mkfs.xfs -i maxpct=10 -f /dev/xxx
therefore the system calls return quicker than in the case of BtrFS does not use inodes the like other filesystems there-
reads since the application reads all the bytes in the memory, fore it always shows inode=0 for df command. BtrFS has
computes CRC-32 check, and compare with what is provided many other advanced features for a modern filesystem such
in the file before proceeding to the next file. File systems as snapshots, fault tolerance, copy-on-write, support for huge
provide different read performance considering number of size files, etc. The command to create BtrFS is: mkfs.btrfs -f
files generated earlier. For this comparison, the paper consider /dev/xxx
the average write and average read statistics. The captured ZFS requires that a pool is created first and then it needs to
average write performance is between 101̃5 µsec whereas read be mounted. Only then it is available for write/read operations.
performances varies between 5 to 200 µsec. Except in the case A pool is created using: zpool create -f zfspoolname /dev/xxx.
of ZFS, the difference between write speed and read speed of F2FS is a specialized filesystem to support SSDs. Upon
the same files was not too great, e.g., on average EXT4100M installation, F2FS established a segmented disk layout. It uses
write is 24 µsec whereas read is 62 µsec which is 3 times inodes to points to files and folders. The command used was:
(a) All Files (b) 1 Billion Files (c) 100 Million Files (d) 10 Million Files
Fig. 4: File Write Speed Distribution (in µsec)

(a) All Files (b) 1 Billion Files (c) 100 Million Files (d) 10 Million Files
Fig. 5: File Read Speed Distribution (in µsec)

(a) All Files (b) 1 Billion Files (c) 100 Million Files (d) 10 Million Files
Fig. 6: Folder Write Throughput Distributed Over Time (µsec vs 20 buckets)

(a) All Files (b) 1 Billion Files (c) 100 Million Files (d) 10 Million Files
Fig. 7: Folder Read Throughput Distributed Over Time (µsec vs 20 buckets)

mkfs.f2fs -i -s 10 -z 10 -f /dev/xxx. This generated only 630 • BkWs: Blocks written per second. Typically 4KB for file
million inodes and unable to create 1 billion inodes so the test systems but for ZFS it is 128KB.
was restricted to 100 million files and corresponding folders. • TFCWT: Total File Create for Write Time is the cumula-
The tables below captured performance statistics for writes tive time to create files but does not include writing the
and reads for the file systems listed above. The explanation of data/bytes to the file.
write performance numbers are as follows: • FCWT: Average time to create a files but not writing the
bytes to the file.
• FWT(min, ave, max) : File Write Time(min, ave, and max) in µsec. • TByW: Total Bytes Written in gigabytes (GB).
• WTh: Write Throughput in bytes per microsecond. • TBkW: Total 4KB or 128KB Blocks Written.
• TFWT: Total Folder Write Time is cumulative time to • DSU: Disk Space Used in gigabytes (GB).
create (sub)folders in seconds. • DSUO: Disk Space Utilization Overheads.
• TfWT: Total File Write Time is time to create/write files • Inodes: Inodes used to store folders, subfolders and files.
in seconds.
The explanation of read performance numbers are as fol-
• TWT: Total Write Time is the time to create folders,
lows:
subfolders and files in seconds.
• FWs: Files written per second. • FRT(min, ave, max) : File Read Time(min, ave, and max) in µsec.
• RTh: Read Throughput in bytes per microsecond. which turned out to be false because of Linux’s built-in write
• TFRT: Total Folder Read Time is cumulative time to caching mechanism. For example, EXT4 1B average file write
search (sub)folders in seconds. time is only 14 µsec but the average read time is 62 µsec,
• TfWT: Total File Read Time is time to read files in about x3 times slower. ZFS 1B run is even worse in this
seconds. regard since its average file write time is 16 µsec but the
• TWT: Total Read Time is the time to search (sub)folders average file read time is 218 µsec, about x12 times slower. File
and files in seconds. read performance numbers provided a consistently interesting
• FWs: Files read per second. pattern as the graph shows. All small runs i.e., 10M files, for
• BkWs: Blocks read per second. Typically 4KB for file all filesystems, provided slower read speeds than write times.
systems but for ZFS it is 128KB. One of the questions that needed investigation was: Is there
• TFORT: Total File Open for Read Time is the cumulative any performance degradation as the number of files in the
time to open files but does not include reading the filesystem increases? To answer this question, Creator code
data/bytes from the file. computes and uses folder performance over the entire run. Due
• FORT: Average time to open and read a files. to large number of folders it was difficult to show the trend
• TByR: Total Bytes Read in gigabytes (GB). over all folders, therefore it was decided to take 20 samples
• TBkR: Total 4KB or 128KB Blocks Read. during the entire run spaced evenly.
• TRT: Total Run Time in seconds. The graphs in Figures 6a and 7a shows all filesystems with
• CPUO: Computation Overheads is the total time taken by different runs. These graphs are further categorized by number
the run not including I/O. of files generated or read to show different patterns at different
• BkSize: Block size, typically 4KB except for ZFS’s times of the runs. The purpose is show the long term trends
128KB. especially during the folders, subfolders and files creation. In
Total Runtim Time (TRT) can be loosely defined as Total some cases, write performance recovered and in some cases
Write Time (TFCWT + TWT), Total Read Time (TFORT + continue to deteriorate. Starting with EXT4, both read and
TRT) + processing overheads. write performance was consistent for all three runs though
EXT4 shows momentary deterioration initially during 1 billion
III. D ISCUSSION files run which was not reproducible so it was recorded as
The application generates folders, subfolders and files as a possible anomaly. EXT4 gave relatively consistent write
per schedule in Table I. The files created with random data and read performances as well as long term performance
with normal distribution as shown in Figure 3 appended with a degradation trends among the set of file system explored.
CRC-32 checksum computed using the random data generated XFS turned out to be about the same in performance and
for the file. Once all the files are generated, they are read long term performance deterioration trend when compared
back according to the same schedule. The file data is verified with EXT4 (EXT4/XFS: WR:15/12, 24/28, 14/31 RD:10/5,
using CRC-32 read from the file and a generated CRC-32 62/62, 62/73). In 1 billion folder read/write performance XFS
checksum using the data from the file. Using EXT4 file was slower than EXT4 (see Figures 6b and 7b).
system with 10 million files spread between 100 root folders We need to point out that there was an initial performance
and 1 subfolder each as the baseline to compare other file degradation in 100 million files run for both EXT4 and XFS
systems. The reason is to minimize the differences between which seems to be stabilized after create 25 folders and files.
each run and the configuration. All the runs were carefully This degradation was not observed in 1 billion or 10 million
conducted. Linux Ubuntu 22.04 was out-of-the-box version files runs.
with no configuration changes except regular system updates, XFS turned out to be about the same in performance and
re-creating the file system on the 14TB drive, rebooting the long term performance deterioration trend when compared
server between switching to different file system as well as in with EXT4 (EXT4/XFS: WR:15/12, 24/28, 14/31 RD:10/5,
between runs when generating 1 billion files as these runs 62/62, 62/73).
some times lasted 36 to 48 hours. Most of initial testing In 1 billion folder read/write performance XFS was slower
was conducted using EXT4 with 10 million files. Comparison than EXT4 (see Figures 6b and 7b). We need to point out that
analysis is performed between the baseline EXT4 and other there was an initial performance degradation in 100 million
file systems. files run for both EXT4 and XFS which seems to be stabilized
File write and read performance provided a glimpse of after create 25 folders and files. This degradation was not
each file system and how they behave with different number observed in 1 billion or 10 million files runs.
of folders, subfolders, and files as each consumes an inode.
As captured, the application is I/O-intensive, which means
it spent majority of time waiting for disk I/O, i.e., writing
or reading or searching the files. CPU overhead captured the
numbers to show the percentage of the time system was *not*
performing I/O. Starting with an initial assumption about
worse write performance in comparison with read performance
TABLE III: File Write Performance Metrics
Filesystem FWTmin,ave,max WTh TFWT TfWT TWT FWs BkWs TFCWT FCWT TByW TBkW DSU DSUO Inodes
EXT410MBL 5, 15, 2.53×106 348 0.02 157 157 63 121 253 25 54.99 19.18 78.93 30% 10000200
EXT4100M 5, 24, 2.50×106 224 0.12 2452 2452 41 78 3103 31 549.95 191.82 789.29 30% 100001100
EXT41B 5, 14, 2.23×106 382 18.33 14387 14405 70 133 24762 24 5507.59 1919.61 7898.75 30% 1000010100
XFS10M 7, 12, 0.22×106 442 0.09 124 124 80 154 224 22 54.99 19.18 138.34 60% 10000200
XFS100M 7, 28, 0.76×106 191 0.31 2871 2872 35 66 2425 24 549.96 191.83 903.31 39% 100001100
XFS1B 7, 31, 0.72×106 176 8.07 31208 31216 32 61 26242 26 5499.58 1918.25 8432.62 35% 1000010100
BtrFS10M 6, 11, 0.14×106 464 0.01 118 118 85 162 220 22 54.99 19.18 85.23 35% 0
BtrFS100M 6, 12, 30.73×106 445 0.07 1234 1234 81 155 4968 49 549.93 191.82 911.62 40% 0
F2FS10M 4, 11, 0.19×106 495 0.01 110 110 90 172 1131 113 55.00 19.18 118.99 54% 10000400
F2FS100M 4, 11, 0.39×106 471 0.05 1166 1166 86 164 11969 119 549.97 191.83 1189.78 54% 100003100
ZFS10M 11, 16, 0.05×106 337 0.02 169 163 61 61 394 39 55.00 10.00 82.33 33% 10000200
ZFS100M 11, 15, 0.09×106 358 5.92 1535 1541 65 65 3901 39 549.94 100.00 823.22 33% 100001100
ZFS1B 11, 16, 0.13×106 337 13.76 16316 16330 61 61 40887 40 5499.59 1000.00 8232.76 33% 1000010100
All performance numbers are in microseconds. Table is divided in rows and columns: rows are for file system and number of files generated, whereas columns capture particular metrics.
Each column is explained in detail the paper. EXT4 10M files is the baseline (EXT410MBL ). BtrFS file system does not use inodes therefore it is shown with 0 inodes. Unable to read back
1 billion files for BtrFS file system due to exponentially increasing read back times therefore it is not included in the write or read tables. Please note that folders are also represented by
inodes.

TABLE IV: File Read Performance Metrics


Filesystem FRTmin,ave,max RTh TFRT TfRT TRT FRs BkRs TFORT FORT TByR TBkR TRT CPUO BkSize
EXT410MBL 2, 10, 4.12×106 505 0.00 108 108 92 176 23 2 54.99 19.18 683.66 21% 4096
EXT4100M 2, 62, 1.73×106 88 0.05 6226 6226 16 30 1688 16 549.95 191.82 15049.99 10% 4096
EXT41B 2, 62, 0.95×106 88 185.42 62200 62386 16 30 20181 20 5507.59 1919.61 136914.58 11% 4096
XFS10M 2, 5, 2.16×106 1028 0.00 53 53 187 358 23 2 54.99 19.18 571.45 26% 4096
XFS100M 3, 62, 1.17×106 87 0.09 6285 6285 16 30 1351 13 549.96 191.83 14560.15 11% 4096
XFS1B 37, 73, 1.17×106 74 253.65 73852 74105 14 25 39588 39 5499.58 1918.25 188036.72 9% 4096
BtrFS10M 2, 4, 0.0×106 1349 0.00 40 40 245 470 24 2 54.99 19.18 553.70 27% 4096
BtrFS100M 2, 70, 0.62×106 77 0.08 7071 7071 14 27 1759 17 549.93 191.82 16674.51 10% 4096
F2FS10M 1, 2, 0.01×106 2038 0.00 26 26 371 710 23 2 55.00 19.18 1436.08 10% 4096
F2FS100M 2, 131, 0.66×106 41 1.36 13163 13164 8 14 20780 207 549.97 191.83 48771.11 3% 4096
ZFS10M 11, 16, 0.06×106 332 0.00 165 165 61 60 30 3 55.00 10.00 912.24 17% 131072
ZFS100M 83, 217, 0.65×106 25 2.91 21707 21710 5 4 3961 39 549.94 100.00 32981.95 6% 131072
ZFS1B 73, 230, 0.83×106 23 10.52 230447 230457 4 4 41671 41 5499.59 1000.00 348312.94 5% 131072
All performance numbers are in microseconds. Table is divided in rows and columns: rows are for file system and number of files read, whereas columns captures a particular metrics.
Each column is explained in detail the paper. Right most three columns are included though they are common for write and read operation. Total Run Time (TRT) is the total run time of
the application against a target file system. CPUO is the CPU bound processing overheads that excludes any I/O. BlkSize is the block size used by the file system.
To capture the long term trend to assess performance the final performance metrics. To exercise these file systems
deterioration, we captured write and read performance of every and capture desired metrics, a tool is developed (initially in
5th folder created and read back and stored these numbers Java) in C programming language. The second decision was
in 20 buckets. BtrFS testing took more time than all others made to use an on-prem server rather than using the cloud to
file system testing combined because its read performance avoid unnecessary round-trip delays.
deteriorated to a point that 1 billion files run never got The decision to pick popular Linux filesystems was based
completed even after running for 72 hours straight. on a couple of factors: their availability under most Linux
The write portion of the runs completed but reads were distributions, non-clustered, and at least one, F2FS, supports
too slow to over 60 minutes and gradually increasing just to SSDs natively. EXT4 comes with many Linux distributions
read one folder so runs were killed, rebooted, re-installed file by default therefore it was chosen to establish the baseline for
system and still failed. BtrFS file write was comparable to performance benchmarking. XFS, BtrFS and ZFS were chosen
EXT4 but the file read performance was lightly lower (around because they were comparable in features with EXT4 for this
65 µsec). We did not observe any long term write or read paper.
performance degradation for BtrFS for 100 million and 10 All three filesystems, except F2FS, were able to handle
million files runs. one billion files. Once they ran successfully, and essential
F2FS, though meant for flash drives, was tested on 14TB metrics was captured, these large tests were not repeated
HDD to be consistent with other file systems under test. It was because they took days to complete. One of the fundamental
not expected we could create 1 billion files but we tested F2FS limitations that were discovered was the default number of
anyway by increasing the inodes. We were unable to increase inodes. EXT4 needed to be configured and reinstalled whereas
beyond 600 million so restricted F2FS to 100 million files XFS, BtrFS and ZFS handle inode creation automatically.
maximum. F2FS was slightly better than EXT4 in file write However, XFS requires that when installing the filesystem, a
performance but in file read performance F2FS was slower certain percentage of storage must be dedicated to inode data
than EXT4 in 100 million files runs. In folder write and read structures. EXT4 is a good choice if a large number of small
long term trend, F2F2 was stable in both 100 million and 10 files need to be stored on the filesystem. XFS has better block-
million files runs. based performance if performance is the primary requirement.
ZFS was different in terms of its installation. Once installed Original questions that were asked in the beginning are
it behaved like any other system, i.e., no changes to the partially answered–ability to create one billion files on selected
application code required. ZFS file write performance was filesystems except for F2FS due to the lack of needed inodes.
very consistent (16/15/16 µsec) but file read performance was In addition, the study was able to answer about performance
much higher (16/217/230 µsec). ZFS runs took much higher degradation. Except for EXT4, other filesystems show some
clock time than other file systems. For 1 billion files write signs of performance deterioration as the number of files
and read back it took 330 thousand seconds, about 92 hours, increased, BtrFS was the worse in this regard. In addition, disk
to complete. utilization overhead were observed for all four filesystems,
CPU overheads are defined as time the code spend not XFS has the worse overheads followed by F2FS, useful for
performing any I/O, such as calculating statistics, or other storage estimation by system designers.
housekeeping work. The decision to switch to pure C from The study captured extensive amounts of data for all four
Java was deliberate in order to avoid any overhead im- filesystems including individual read and write speeds as well
posed by JVM or Java programming constructs. The num- as cumulative reads and writes performance, read, and write
ber captured showed a consistent pattern: 20% for 10 throughput, the number of files written and read per second,
million files runs, 10% for 100 million files runs, and disk block utilization, and disk utilization overheads. Further
10% for 1 billion files runs. CPU overheads were computed work can be conducted to measure performance using real-
using Total Run Time (TRT), Total Write Time (TWT), world databases on these filesystems with high data volume
Total File Create for Write Time (TFCWT), Total Read and high transaction rates. Additional statistical analysis can be
Time (TRT), and Total File Open for Read Time (TFORT). performed by working with the raw data captured to establish
CPUO = TRT - (TWT + TFCWT + TRT + TFORT) / TRT a statistical basis for the findings using techniques to include
Confidence Intervals, Paired Observations, Linear Regression,
IV. C ONCLUSION
Hypothesis Testing, and Analysis of Variance (ANOVA). We
This study investigated and compared several popular Linux hope that study is useful for system designers in selecting the
filesystems: EXT4, XFS, BtrFS, F2FS, and ZFS for their right file system for thier purpose.
ability to create and manage one billion files and measure file
creation and reading timing and any performance degradation ACKNOWLEDGMENT
during and after creating the files. Except for F2FS, all filesys-
tems were able to handle one billion files after increasing the I would like to acknowledge my professor Dr. Yue Cheng
inodes and re-mounting the filesystem i.e., no other operating ([email protected]) who provided guidance and timely
system onfiguration changes were made. BtrFS was tested with advice to conduct this study and in the preparation of this
1 billion files but reading these files were too slow to capture paper.
R EFERENCES
[1] V. Tarasov, S. Bhanage, E. Zadok, and M. Seltzer. (2011). Benchmarking
file system benchmarking: it *IS* rocket science. Workshop on Hot
Topics in Operating Systems, 9.
[2] S. He, X.-H. Sun, and Y. Yin, “BPS: A Performance Metric of I/O Sys-
tem,” in 2013 IEEE International Symposium on Parallel & Distributed
Processing, Workshops and Phd Forum, 2013, pp. 1954–1962.
[3] “XFS Wikipedia,” https://en.wikipedia.org/wiki/XFS (accessed Apr. 29,
2022).
[4] “BtrFS Wikipedia,” https://en.wikipedia.org/wiki/BtrFS (accessed Apr.
29, 2022).
[5] “BtrFS Kernel,” https://btrfs.wiki.kernel.org/index.php/Main Page (ac-
cessed Apr. 29, 2022).
[6] J. Salter, “Understanding Linux filesystems: EXT4 and beyond,”
https://opensource.com/article/18/4/ext4-filesystem (accessed Apr. 29,
2022).
[7] “F2FS ArchLinux,” https://wiki.archlinux.org/title/F2FS (accessed Apr.
29, 2022).
[8] R. Gooch “Overview of the Linux Virtual File System,”
https://www.kernel.org/doc/html/latest/filesystems/vfs.html, 1999,
(accessed Apr. 29, 2022).
[9] M. Senofsky and P. Dietl “Introduction to the Linux Vir-
tual Filesystem (VFS),” https://www.starlab.io/blog/introduction-to-the-
linux-virtual- filesystem-vfs-part-i-a-high-level-tour. 2020, (accessed
Apr. 29, 2022).
[10] Seagate Technology “Seagate 6TB-14TB IronWolf NAS
HDD Technical Specifications,” https://www.seagate.com/www-
content/datasheets/pdfs/ironwolf-14tb-DS1904-10-1807US-en US.pdf.
2022, (accessed Apr. 29, 2022).
[11] Samsung “Samsung 2TB 870 EVO SSD Technical Specifica-
tions,” https://www.samsung.com/us/computing/memory-storage/solid-
state-drives/870-evo-sata-2-5-ssd-2tb-mz-77e2t0b-am/. 2022, (accessed
Apr. 29, 2022).
[12] “Hard disk drive performance characteristics,”
https://en.wikipedia.org/wiki/Hard disk drive performance characteri-
stics. Aug. 2022, (accessed Apr. 29, 2022).
[13] M. Jones “Anatomy of the Linux virtual file system switch,”
https://developer.ibm.com/tutorials/l-virtual-filesystem-switch/, Aug.
2009, (accessed Apr. 29, 2022).
[14] M. Larabel “Linux 5.14 SSD Benchmarks
With Btrfs vs. EXT4 vs. F2FS vs. XFS,”
https://www.phoronix.com/scan.php?page=news item&px=Linux-
5.14-File-Systems, Aug. 2021 (accessed Apr. 29, 2022).
[15] D. Phillips “A Directory Index for Ext2,” http://www.linuxshowcase-
.org/2001/full papers/phillips/phillips html/index.html, Sep. 2021 (ac-
cessed Apr. 29, 2022).
[16] B. Gregg “Linux Performance,” https://www.brendangregg.com/linux-
perf.html (accessed Apr. 29, 2022).
[17] D. J. Lilja, Measuring Computer Performance. Cambridge: Cambridge
University Press, 2000.
[18] “Box-Muller Transform,” https://en.wikipedia.org/wiki/Box–Muller tra-
nsform (accessed June. 30, 2024).
[19] Shaikh, S. (2024). BfFS (Version 1.0) [Computer software].
https://github.com/sshaikh5/BfFS

You might also like