0% found this document useful (0 votes)
15 views7 pages

Full Scale File System Acc

The document presents GPU4FS, a novel file system designed to run entirely on GPUs, aiming to reduce latency and CPU overhead associated with traditional file system operations. By moving file system management to the GPU, the system allows for direct storage access and efficient shared memory communication, enhancing performance for high-performance computing and AI applications. Preliminary implementations demonstrate that GPU4FS can achieve competitive bandwidth while supporting modern file system features like checksums and RAID configurations.

Uploaded by

herejoker16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views7 pages

Full Scale File System Acc

The document presents GPU4FS, a novel file system designed to run entirely on GPUs, aiming to reduce latency and CPU overhead associated with traditional file system operations. By moving file system management to the GPU, the system allows for direct storage access and efficient shared memory communication, enhancing performance for high-performance computing and AI applications. Preliminary implementations demonstrate that GPU4FS can achieve competitive bandwidth while supporting modern file system features like checksums and RAID configurations.

Uploaded by

herejoker16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Full-Scale File System Acceleration on GPU

Peter Maucher Lennard Kittner Nico Rath


Karlsruhe Institute of Technology Karlsruhe Institute of Technology Karlsruhe Institute of Technology
Germany Germany Germany

Gregor Lucka Lukas Werling Yussuf Khalil


Karlsruhe Institute of Technology Karlsruhe Institute of Technology Karlsruhe Institute of Technology
Germany Germany Germany

Thorsten Gröninger Frank Bellosa


Karlsruhe Institute of Technology Karlsruhe Institute of Technology
Germany Germany
ABSTRACT CPU will have to go through the file system (FS) implemen-
Modern HPC and AI Computing solutions regularly use tation until it hits the storage device, and after receiving the
GPUs as their main source of computational power. This response, has to signal completion to the GPU. Some of this
creates a significant imbalance for storage operations for latency is unavoidable, especially the storage latency, but a
GPU applications, as every such storage operation has to be lot of the latency can be avoided if GPU hits storage directly.
signalled to and handled by the CPU. In GPU4FS, we propose Even with CPU file systems, Volos et al. [26] argue that some
a radical solution to this imbalance: Move the file system overhead is added by old FS interfaces originally designed
implementation to the application, and run the complete file for HDDs.
system on the GPU. This requires multiple changes to the In this paper, we present GPU4FS, a modern, fully-featured
complete file system stack, from the actual storage layout GPU-side FS designed to solve both the latency induced by
up to the file system interface. Additionally, this approach the FS interface as well as by the trip through the inter-
frees the CPU from file system management tasks, which connect. GPU4FS runs on the GPU, in parallel to the actual
allows for more meaningful usage of the CPU. In our pre- GPU-side application, which allows for low latency access
liminary implementation, we show that a fully-featured file and shared memory communication between application
system running on GPU with minimal CPU interaction is and the FS. Given the lack of a system call instruction on
possible, and even bandwidth-competitive depending on the GPUs, this communication is done via shared video mem-
underlying storage medium. ory (VRAM), utilizing a parallel work queue implementation.
We also place work queues in DRAM, which enables fast,
KEYWORDS parallel, user space access for CPU-side applications to the
FS.
File System, GPU, Direct Storage Access, GPU-Acceleration, We designed a file system with a feature set closely follow-
GPU-Offloading ing modern file systems, as the file system interface is widely
used and well understood by programmers. Instead of having
1 INTRODUCTION to port each and every application, we can completely hide
Graphics processing units (GPUs) offer massive parallelism the implementation details to unaware applications, but en-
for data-parallel applications. This data needs to be loaded able an opt-in for CPU-side applications to benefit from the
into GPU memory in some way. Currently, the data manage- changed semantics. On the GPU, the application needs to be
ment is handled mostly on CPU, with the GPU only signalling modified for any kind of storage access. Using GPUfs, Silber-
progress to the CPU. Assuming this data comes from storage, stein et al. [24] demonstrate that offering a library interface
the request has to go through the interconnect to the CPU. to CPU FS eases access to storage for GPU programmers, but
After the CPU actually starts working on the request, the GPUfs only calls a CPU-side file system. GPU4FS offers a
similar interface to GPUfs, but runs the file system on the
GPU.
In our preliminary implementation, we demonstrate GPU
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0
access to Intel Optane Persistent Memory (PMem) [19], and
International License. DOI: https://doi.org/10.18420/fgbs2024f-03. FGBS ’24,
March 14-15, 2024, Bochum, Germany.
Maucher et al.

show that we are write-bandwidth competitive to contempo- individual file can be loaded using the atomic request. Nev-
rary Optane file systems. We also demonstrate an implemen- ertheless, the POSIX interface is required for compatibility.
tation for GPU-optimized folder structures and a RAID im- One important consideration for the compatible applica-
plementation. We argue that this shows that a fully-featured tions is supporting mmap(). In Linux, one can either use a
GPU-side file system is both useful and can work well. private or a shared mapping, which governs two independent
features: A shared mapping implies that modified memory
2 METHODOLOGY is written back to disk, but also allows for shared memory
communication with other processes. We assume that most
Given the novel idea to attempt to move the implementa-
applications use a shared mapping to persist their changes,
tion from the CPU to the GPU, GPU4FS shows major differ-
rather than for shared memory communication. In a CPU-
ences to contemporary file systems in the implementation.
only file system with a DRAM page cache, implementing
To benefit from the compatibility with preexisting CPU-side
both is relatively easy, but mixing in multiple devices with
applications and also with the familiarity for new GPU-side
their own memory adds complications, and forces the data
applications, these differences need to be hidden by the in-
to be held in DRAM to allow access from all devices. To
terface.
allow for high-bandwidth VRAM-side mmap, we add an ad-
ditional write-back mode that only flushes data to disk when
2.1 Interface unmapped. Additionally, this lets us avoid costly page table
The most interesting aspect for compatibility is the file sys- changes. We expect this mode to be widely used in GPU-side
tem interface. In POSIX [17], accesses are commonly handled applications that are inherently incompatible with any other
using syscalls, which is not feasible in GPU4FS for several file system.
reasons: A classical system call instruction is not available
on GPUs, and even with such an instruction, it would most 2.2 GPU Implementation
certainly not be usable from the CPU. Instead, we elected the As a broad overview, a system running GPU4FS has at least
common approach (e.g., FUSE [6]) to use a shared-memory two corresponding processes running: the main GPU-side
buffer for communication. A requesting process will insert a GPU4FS process handling the complete file system, and a
command for every intended file system operation, while the CPU-side process responsible to set the GPU process and to
GPU4FS process can then parse the information and handle establish communication. Additionally, there may be multi-
the request. ple additional processes running on both CPU and GPU that
GPU4FS is intended to offer two sets of commands: One interface with the two GPU4FS processes. In this section, we
closely aligned to the POSIX interface, and a high-level in- present the GPU-side implementation, which is responsible
terface intended to reduce the number of requests and thus for communicating to requesting processes using the shared
decrease latency and increase bandwidth. Volos et al. [26] command buffers as described above and communicating
show that major improvements can be achieved by reduc- to the storage backend. FS management tasks like garbage
ing the number of interactions: To read a single file into collection and defragmentation are also controlled on and
memory using POSIX interfaces requires five operating sys- run by the GPU.
tem interactions: A call to open() for the file descriptor, a A request to the file system will always occur through one
call to fstatat() to get the length of the file, followed by of the command buffers, either through a buffer in VRAM
malloc() for memory allocation, then one or more calls to issued by another GPU-side process or through a buffer in
read() to fetch the actual data and close() to release the DRAM for any other request. Such a request usually causes
file descriptor. This does not meaningfully improve with the accesses to the disk or PMem, either to load or to store in-
usage of mmap(), as the only call to be left out is the call to formation. Additionally, even a data load request can cause
malloc(). In both cases, read() and mmap(), the process is data changes on the storage medium, as metadata needs to
inherently racy, as other tasks might change both file con- be updated.
tent as well as metadata like length at any time. Instead, we
offer an interface to load a complete file, with the allocation Read Request. First, we describe a read request without
of memory for the loaded file handled inside the command any data changes: After the request is parsed, the GPU tries
handler. The resulting data is written into the shared buffer, to load data from one of its caches, either in VRAM or in
and a pointer/length pair is returned. This way, the operation DRAM. We employ both caching levels, as VRAM is usually a
can be handled atomically, and only one request is needed. lot smaller and DRAM is still a lot faster than contemporary
We expect a mixture of both interfaces to be used in GPU ap- storage devices like NVMe and PMem, and an early paper
plications: For parallel reads of a folder, the POSIX interface suggests the same behavior for Compute Express Link (CXL)
can be used so that each item is only loaded once, but each implementations [25]. If the requested data is not cached,
Full-Scale File System Acceleration on GPU

the data is instead fetched from storage. The main goal of the number of block pointers. On GPU, a decrease in block
GPU4FS is to bypass the CPU for storage access: PMem is pointers has the additional advantage of having less pointer
mapped directly to the GPU, and NVMe memory is accessed divergence, which means the GPUs memory management
via Peer to Peer-DMA (P2PDMA) as suggested by Qureshi unit can coalesce more accesses. Similar to WineFS, the allo-
et al. [21] for their BaM system architecture. After the data is cator maintains free-lists of aligned pages for each of the sizes
fetched, it is possibly cached depending on application hints. which are split up into smaller blocks as needed. GPU4FS
We add caching hints as some applications only require a file assumes a flat address space, and does not reserve special
once, thus freeing up cache capacity for repeated accesses. areas for inodes, directories, or indirect blocks filled with
Except in a shared mmap mapping, we allocate space for the block pointers. This means that gradual fragmentation is a
result in the shared memory area, and store the fetched data major concern, which we combat mainly by reusing garbage-
there. To signal completion, we store the data pointer and collected space and partial defragmentation if needed.
length into the initial command structure, and conclude the
interaction with an atomic store to a completion flag. Additional Features. Additionally, GPU4FS implements fea-
tures found in modern file systems like BTRFS [22] and ZFS
Write Request. In the opposite direction, when data is to [2], namely checksums, deduplication and storage distribu-
be written, a few more steps are required to guarantee data tion over multiple drives (RAID) [18]. We can use checksums
consistency and to allocate space for new files. Even though to notify the application if data has been modified, which
latency reduction for GPU-side application is a major goal avoids propagating erroneous data into the application. On
for GPU4FS, the overall latency for CPU-side applications top of that, checksums enable deduplication to increase the
is expected to grow. This increases the amount of data in- usable storage area if identical blocks are stored, and improve
flight, and therefore the pressure on the consistency system. mapping write-back performance. A RAID implementation
We intend to solve this issue using a combination of log- lets us combine the storage capacity of multiple drives for
structuring [23] and journaling [20]: Data writes are done in larger data sets. Additionally, in combination with check-
a log-structured way, while metadata updates are journaled. sums, we can use the redundant information in the RAID to
The journal therefore is a major point of contention. To solve recover from checksum errors. Both checksums and RAID re-
this issue, we reserve a big chunk of persistent storage to quire parallel computations, which is well-suited for a GPU.
the journal, and assign each independent GPU workgroup For the checksum functionality, we select BLAKE3 [15] as
one smaller subarea of the large chunk. To avoid multiple the hash function as it is designed for parallel computations,
conflicting journal entries for the same object, we limit ev- but is still secure with 128 bit collision resistance. To further
ery file system object to at most one workgroup at runtime. increase the exploitable parallelism, we checksum each block
A journal entry is fixed-size, and allocated using a single individually. To differentiate between changes in the check-
atomic store to a flag variable. Another atomic store marks sum as compared to the data, we also calculate the hash for
the completion of the journal entry setup. That way, should every block storing checksums. We further utilize them to
a crash occur, the whole journal area can be scanned and op- avoid storing the same data twice. At runtime, checksums
eration completion can be verified. To commit a data-write, help with the mmap implementation: If we use mmap in
the inode structure needs to be modified to add the pointers VRAM, the buffer is mapped as read/write to avoid page
to the new file system blocks. As compared to a file system table modifications, but then, no pagefaults occurs on modifi-
on a block device, PMem only allows for smaller granularity cation. To detect changes when writing back to disk, instead
of data persistence, which means that we cannot atomically of having to compare every byte and potentially having to
change the inode in-place. Instead, the journal allows for fetch them first, we can simply compare the checksums, thus
recovery of either the old or the new state even though we increasing performance and decreasing storage latency.
use in-place modifications. The other important feature we implement is software
GPU4FS is designed from the ground up for modern stor- RAID: GPU4FS takes heavy inspiration from the BTRFS vol-
age with good random access performance, and for PMem umes, and similarly offers runtime-configurable RAID levels
direct mapping capabilities. WineFS [11] shows the advan- using volumes. The design goal of supporting direct map-
tages of using hugepages for direct mapping, as compared ping is leads to difficulties with parity information, so direct
to only 4 kB pages. GPU4FS incorporates this into its design, mapping is enabled on a per-file basis. We tag each block
combined with a novel take on extents, a common feature in pointer to be either physical, or virtual, with physical point-
multiple file systems [7, 22] . We use 64 bit pointers where ers encoding a disk and an offset, while virtual pointers are
two bits are used for a flag bit to discern between aligned translated through a BTRFS-inspired volume tree to blocks
4 KiB, 2 MiB and 1 GiB pages as well as 256 byte inodes. Not on disk, potentially multiple. This flexibility even allows
only does this allow for direct mapping, but it also decreases different RAID levels for different parts of the file.
Maucher et al.

2.3 CPU Implementation CPU 2× Xeon Silver 4215, operating at 2.5 GHz
In GPU4FS, the GPU is mainly responsible for the file system. DRAM 8× DDR4 at 2400 MT/s, 16 GiB
Nonetheless, the CPU-side GPU4FS implementation plays a PMem 4× DDR-T at 2400 MT/s, 128 GiB
major role in establishing communication to the FS for both GPU AMD Radeon RX 6800, VRAM 16 GiB, 16 PCIe
GPU-side and CPU-side clients, and CPU-side application Gen3 lanes
are able to access the FS on the GPU. Table 1: Test Platform
The connection is always established on the CPU, as some
parts of the procedure require page table changes and thus
kernel privileges. Hence, if a GPU-side client desires to in-
terface to GPU4FS, its CPU-side process managing the GPU- space file system, and opens the possibility for performance
side client is responsible for initializing the communication. gains. Figure 1 shows the complete setup with the command
The requesting CPU-side process starts by sending an inter- flow, including the kernel doing page table modifications for
process communication (IPC) message to the GPU4FS CPU- shared mmap.
side process, which includes the desired size for the GPU4FS
communication buffer. The CPU-side GPU4FS process allo-
cates a part of VRAM for GPU4FS, which is used for shared
3 PRELIMINARY RESULTS
memory buffers, caches, and internal data, with the remain- We implement a demonstrator to build the case for a full-
der of VRAM being used by actual applications. Similarly, the blown GPU4FS implementation. The main concerns are band-
CPU-side GPU4FS process controls shared memory buffers width and latency of the file system as compared to a simple
in DRAM for CPU-side FS clients. After the request, GPU4FS GPU access to storage without the file system overhead. Our
allocates some space in its VRAM area or in DRAM for com- demonstrator uses Vulkan [9] as the programming interface,
munication, and maps it as a buffer in the requesting process. running on the RADV [5] driver on Linux. Our implemen-
Here, the page tables need to be modified, so this part needs tation adds a simple write path to Optane PMem, and we
to run on the CPU. Given that the GPU controls the file evaluate the directory creation using an EXT-inspired H-Tree
system, the GPU-side GPU4FS is notified before communi- approach [7] and RAID address translation overhead. Due
cation can be fully established. The CPU-side and GPU-side to the Vulkan implementation, we have to restart the file
GPU4FS processes communicate via the same shared mem- system for every test, which incorporates a startup latency
ory interface as the other processes, the only difference is the of 12 ms. All the tests run using Intel Optane as the stor-
existence of a few higher-privilege commands. One of these age medium, the exact specification can be found in Table 1.
higher-privilege commands is used here by the CPU-side We make sure that the GPU accesses Optane without going
process to inform the GPU of the new connection and the through the inter-processor interconnect.
newly allocated buffer. The GPU then initializes the buffer We achieve a bandwidth maximum of 1.5 GB s−1 to one
and further data structures needed for communication, and Optane DIMM using our configuration. Even though the CPU
signals completion to the GPU4FS CPU-side process. With can write with up to 2 GB s−1 , we use the GPU bandwidth
this completion, the CPU-side process sends a response with as the baseline for our evaluation. Similarly, we also have to
the address of the new buffer to the requesting CPU-side wait about 12 ms for the GPU to respond to our command,
process, which can then finally start using GPU4FS. The only even if nothing in the file system happens. We assume this
difference between a CPU-side GPU4FS client and a GPU- is because we have to reinitialize large buffers, which seems
side GPU4FS client is that in the GPU case, two buffers, one to be uncommon in the video game applications the driver is
in DRAM and one in VRAM, are initialized, and the CPU- optimized for. In our complete implementation, we expect to
side requesting process hands the buffer address to its GPU reach bandwidth equality to the CPU with all storage media,
process. and we expect the latency to reduce to a few microseconds
As mentioned above, during the communication setup, if GPU4FS is already running instead of having to be started.
certain changes to the page table of the requesting process The first file system test inserts a single file into a folder,
need to be made. Additionally, this operation is needed for to verify that given a large enough file size, the bandwidth
the shared mmap case. This operation is usually forbidden limit mentioned above is reached. In the plot given in Figure
in user space, and can only be implemented in the kernel. 2, we reach the maximum possible bandwidth at a file size
Following the design of Aerie [26] we add a minor kernel of 128 MiB with 1.49 GB s−1 . This means that the file system
modification that allows the CPU-side implementation to does not add an inherent overhead to every copy operation.
map pages in other processes. In normal operation, FS access This plot includes the 12 ms startup latency, to show that
completely bypasses the kernel, which makes GPU4FS a user we can be bandwidth-competitive even including the higher
latency.
Full-Scale File System Acceleration on GPU

Processes GPU CPU-side


GPU4FS
()
load ) mmap
p( VRAM ()
mma mmap() Kernel

n
cmd p letio read /
com write

sh
flu
read /
write
l/
PMEM
fil

manage
FS Caches

Figure 1: Full-Scale GPU4FS design. Processes either in DRAM or VRAM communicate their requests to the GPU,
which accesses storage. Optionally, shared mmap is handed to the kernel for page table modifications.

We also evaluate metadata operations in isolation, but 1.5


subtract the latency here as the runtime of the metadata
operation is similar to the startup latency. As the metadata
operation, we create a deep directory chain, similar to a call Bandwidth [GiB/s]
to mkdir -p a/b/c/d/.... On the GPU, we issue a single 1
command, which also shows the benefit of our more general
interface, whereas the CPU has to issue repeated mkdirat()
calls, which incurs repeated syscall overhead. We compare
to EXT4, as GPU4FS uses EXT4-inspired H-trees. The results 0.5
can be seen in Figure 3. GPU4FS is slightly slower than the
CPU for large enough requests, but we expect metadata
operations to be few and the latency to be hidden by other
operations. Also, we expect to reduce the additional latency
0
in the future. The main takeaway is that even dependent
metadata operations, though rare, can be efficiently executed 104 105 106 107 108
on the GPU. File Size [Bytes]

Figure 2: GPU write bandwidth to Intel Optane, for one


3.1 Discussion file with different sizes. The bandwidth increases to
Our current implementation is not fully optimized, as can be the measured max of 1.5GBs-1 .
seen by the sub-optimal bandwidth and latency. Currently,
we use the RADV driver, which is designed for video games,
access to the GPU, and when incurring additional latency.
not for compute, which might contribute to especially the
The results show the potential for a full implementation.
latency issues. In the future, we intend to use AMD’s ROCm
stack [1], an actual compute-focused API, and to optimize our
4 RELATED WORK
code to the used hardware based on that new implementation.
This will also enable GPU4FS to operate as a daemon instead In this section, we compare GPU4FS to other file systems
of having to start the system and incurring other slowdowns, and GPU projects.
such as TLB misses or cache misses. We expect the latency
to further drop when requests are issued from the GPU, and 4.1 File Systems
when the file system runs as a daemon instead of having to Kernel File Systems. File systems have been a major compo-
setup the GPU for every run. We also expect the bandwidth nent of operating systems that they are part of the Portable
to increase with further optimizations. Nonetheless, there Operating System Interface (POSIX) [17]. POSIX suggests a
are definitely cases in which the GPU-side comes close to file system implementation inside the kernel, thus enforcing
CPU-side implementations, even when tested using CPU system calls for most operations. In Linux, the system calls
Maucher et al.

extract as much parallelism from the file system as possible,


10−1
while keeping the benefits of a user space file system.
An interesting hybrid between kernel space and user space
file systems is the “Filesystem in Userspace” (FUSE) [6]. FUSE
10−2 allows the implementation of file systems in user space while
preserving the kernel interface, by performing an upcall from
Time [s]

the kernel into the user space FUSE driver whenever the ker-
10−3 nel is called by an application. FUSE is also implemented
using shared memory buffers between user and kernel, simi-
lar to GPU4FS, but these buffers are in the DRAM, instead
10−4 of VRAM or even the memory of a different device.
GPU4FS
CPU EXT4 4.2 GPUs
Applications running on GPU include graphics [4], high
100 101 102 103 performance computing applications [14] and artificial in-
Number of Directories
telligence applications [8], which all require frequent data
access or are currently limited by VRAM capacity. Each of
Figure 3: Directory creation time per directory tree
these applications can profit from GPU4FS.
depth. At about 1000 directories deep, the GPU and
CPU come close. GPU-side File Systems. The main comparison point in GPU
file systems is GPUfs [24], which demonstrates the use of
a file system interface on the GPU by allowing the GPU
are the defined interface. POSIX also lists metadata which
to access the CPU-side file system. In GPU4FS, we run the
can be requested for file system objects. Examples of mostly
complete file system on GPU, and let the CPU access the
POSIX-compliant file system are the EXT family of file sys-
file system. [21] and [16] show the validity of direct storage
tems [7] for a simple file system, and more feature rich file
access to the GPU, but they never use it for full file systems.
systems with RAID, full data checksums, deduplication and
Prior work has implemented several parts of the file sys-
encryption like ZFS [2] and BTRFS [22]. To reach our goal
tem on accelerators, e.g., RAID [3, 12] or checksums [15] or
of POSIX compliance for legacy applications, we take inspi-
encryption [10], but these parts have never been integrated
ration of the aforementioned file systems in both features
into one file system.
and data layout.
A recent comparison are PMem file systems: In NOVA [27],
the file system uses direct pointers instead of indices into 5 CONCLUSION
data structures to cater to modern storage media, and uses In conclusion, in this paper we propose the design of GPU4FS,
different logging strategies for data and metadata, similar a novel GPU-side file system with interfaces to be used from
to GPU4FS. In comparison to NOVA, GPU4FS changes the the CPU as well as the GPU. We also build the case for a
metadata log to a classical journal for ease of access and to complete realization using our preliminary implementation,
move the point of parallelization from the file system object which demonstrates that the bandwidth limits are not im-
to the workgroup. WineFS [11] inspired our tiered allocator posed by the file system implementation, but instead by the
for the pages, but we also use the pages as a lightweight storage bandwidth.
implementation of extents, thus keeping the indirection tree With this result, we intent to first finish the port to the
small. new ROCm [1] platform, and use it to increase performance.
The next step is to implement the consistency mode, as the
User Space File Systems. As compared to file systems com- complexity is high and it is a central part of the design. Going
pletely or mostly implemented in the kernel, Aerie [26] from there, we will focus on allocation and garbage collection,
shows the benefit of file systems in user space, especially for and integrate the remaining features like RAID, checksum-
PMem storage. To support POSIX, they add a small kernel ming as well as kernel communication on top.
module to implement the required semantics. Strata [13] and
EvFS [28] show the use of tiered storage and of multiple REFERENCES
layers of caching inside the user space file systems. However,
[1] Advanced Micro Devices, Inc. 2021. AMD ROCm™ documentation.
these file systems all run on the CPU. A GPU is much less https://rocmdocs.amd.com/en/latest/index.html
optimized for random jumps and pointer chasing, a common [2] Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee,
feature in these file systems. In comparison, we optimize to and Mark Shellenbaum. 2002. The Zettabyte File System.
Full-Scale File System Acceleration on GPU

https://www.cs.hmc.edu/~rhodes/courses/cs134/fa20/readings/ [19] Ivy B. Peng, Maya B. Gokhale, and Eric W. Green. 2019. System Evalu-
The%20Zettabyte%20File%20System.pdf ation of the Intel Optane Byte-Addressable NVM. In Proceedings of the
[3] Matthew L. Curry, H. Lee Ward, Anthony Skjellum, and Ron International Symposium on Memory Systems (Washington, District of
Brightwell. 2010. A Lightweight, GPU-Based Software RAID System. Columbia, USA) (MEMSYS ’19). Association for Computing Machin-
In 2010 39th International Conference on Parallel Processing. 565–572. ery, New York, NY, USA, 304–315. https://doi.org/10.1145/3357526.
https://doi.org/10.1109/ICPP.2010.64 3357568
[4] Blender Developers. 2022. blender. https://www.blender.org/ [20] Vijayan Prabhakaran, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-
[5] Freedesktop Developers. 2022. RADV Dusseau. 2005. Analysis and Evolution of Journaling File Systems.. In
RADV is a Vulkan driver for AMD GCN/RDNA GPUs. https://docs. USENIX Annual Technical Conference, General Track, Vol. 194. 196–215.
mesa3d.org/drivers/radv.html [21] Zaid Qureshi, Vikram Sharma Mailthody, Isaac Gelado, Seungwon Min,
[6] FUSE Developers. May 06, 2022. Filesystem in Userspace. https: Amna Masood, Jeongmin Park, Jinjun Xiong, C. J. Newburn, Dmitri
//github.com/libfuse/libfuse Vainbrand, I-Hsin Chung, Michael Garland, William Dally, and Wen-
[7] Linux Kernel Developers. September 20, 2016. EXT4 Linux kernel wiki. mei Hwu. 2023. GPU-Initiated On-Demand High-Throughput Storage
https://ext4.wiki.kernel.org/index.php/Main_Page Access in the BaM System Architecture. In Proceedings of the 28th ACM
[8] TensorFlow Developers. 2022. TensorFlow. Zenodo (2022). International Conference on Architectural Support for Programming
[9] Khronos® Group. 2022. Khronos Vulkan Registry. https://registry. Languages and Operating Systems, Volume 2 (Vancouver, BC, Canada)
khronos.org/vulkan/ (ASPLOS 2023). Association for Computing Machinery, New York, NY,
[10] Keisuke Iwai, Naoki Nishikawa, and Takakazu Kurokawa. 2012. Ac- USA, 325–339. https://doi.org/10.1145/3575693.3575748
celeration of AES encryption on CUDA GPU. International Journal of [22] Ohad Rodeh, Josef Bacik, and Chris Mason. 2013. BTRFS: The Linux B-
Networking and Computing 2, 1 (2012), 131–145. Tree Filesystem. ACM Trans. Storage 9, 3, Article 9 (aug 2013), 32 pages.
[11] Rohan Kadekodi, Saurabh Kadekodi, Soujanya Ponnapalli, Harshad https://doi.org/10.1145/2501620.2501623
Shirwadkar, Gregory R. Ganger, Aasheesh Kolli, and Vijay Chi- [23] Mendel Rosenblum and John K. Ousterhout. 1992. The Design and
dambaram. 2021. WineFS: a hugepage-aware file system for persistent Implementation of a Log-Structured File System. ACM Trans. Comput.
memory that ages gracefully. In Proceedings of the ACM SIGOPS 28th Syst. 10, 1 (feb 1992), 26–52. https://doi.org/10.1145/146941.146943
Symposium on Operating Systems Principles (Virtual Event, Germany) [24] Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2013.
(SOSP ’21). Association for Computing Machinery, New York, NY, USA, GPUfs: Integrating a File System with GPUs. SIGARCH Comput. Archit.
804–818. https://doi.org/10.1145/3477132.3483567 News 41, 1 (mar 2013), 485–498. https://doi.org/10.1145/2490301.
[12] Aleksandr Khasymski, M. Mustafa Rafique, Ali R. Butt, Sudharshan S. 2451169
Vazhkudai, and Dimitrios S. Nikolopoulos. 2012. On the Use of GPUs [25] Yan Sun, Yifan Yuan, Zeduo Yu, Reese Kuper, Ipoom Jeong, Ren Wang,
in Realizing Cost-Effective Distributed RAID. In 2012 IEEE 20th Inter- and Nam Sung Kim. 2023. Demystifying CXL Memory with Genuine
national Symposium on Modeling, Analysis and Simulation of Computer CXL-Ready Systems and Devices. arXiv:2303.15375 [cs.PF] https:
and Telecommunication Systems. 469–478. https://doi.org/10.1109/ //doi.org/10.48550/arXiv.2303.15375
MASCOTS.2012.59 [26] Haris Volos, Sanketh Nalli, Sankarlingam Panneerselvam,
[13] Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Venkatanathan Varadarajan, Prashant Saxena, and Michael M.
Witchel, and Thomas Anderson. 2017. Strata: A Cross Media File Swift. 2014. Aerie: Flexible File-System Interfaces to Storage-Class
System. In Proceedings of the 26th Symposium on Operating Systems Memory. In Proceedings of the Ninth European Conference on Computer
Principles (Shanghai, China) (SOSP ’17). Association for Computing Systems (Amsterdam, The Netherlands) (EuroSys ’14). Association
Machinery, New York, NY, USA, 460–477. https://doi.org/10.1145/ for Computing Machinery, New York, NY, USA, Article 14, 14 pages.
3132747.3132770 https://doi.org/10.1145/2592798.2592810
[14] Christoph A Niedermeier, Christian F Janßen, and Thomas Indinger. [27] Jian Xu and Steven Swanson. 2016. NOVA: A Log-structured File Sys-
2018. Massively-parallel multi-GPU simulations for fast and accurate tem for Hybrid Volatile/Non-volatile Main Memories. In 14th USENIX
automotive aerodynamics. In Proceedings of the 7th European Confer- Conference on File and Storage Technologies (FAST 16). USENIX Associ-
ence on Computational Fluid Dynamics, Glasgow, Scotland, UK, Vol. 6. ation, Santa Clara, CA, 323–338. https://www.usenix.org/conference/
2018. fast16/technical-sessions/presentation/xu
[15] Jack O’Connor, Jean-Philippe Aumasson, Samuel Neves, and Zooko [28] Takeshi Yoshimura, Tatsuhiro Chiba, and Hiroshi Horii. 2019. EvFS:
Wilcox-O’Hearn. 2021. BLAKE3 - One function, fast everywhere. https: User-Level, Event-Driven File System for Non-Volatile Memory. In
//github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf Proceedings of the 11th USENIX Conference on Hot Topics in Storage and
[16] Shweta Pandey, Aditya K Kamath, and Arkaprava Basu. 2022. GPM: File Systems (Renton, WA, USA) (HotStorage’19). USENIX Association,
leveraging persistent memory from a GPU. In Proceedings of the 27th USA, 16. https://dl.acm.org/doi/10.5555/3357062.3357083
ACM International Conference on Architectural Support for Program-
ming Languages and Operating Systems (Lausanne, Switzerland) (ASP-
LOS ’22). Association for Computing Machinery, New York, NY, USA,
142–156. https://doi.org/10.1145/3503222.3507758
[17] PASC. 2018. The Open Group Base Specifications Issue 7, 2018 edition.
https://pubs.opengroup.org/onlinepubs/9699919799/
[18] David A. Patterson, Garth Gibson, and Randy H. Katz. 1988. A Case for
Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the
1988 ACM SIGMOD International Conference on Management of Data
(Chicago, Illinois, USA) (SIGMOD ’88). Association for Computing
Machinery, New York, NY, USA, 109–116. https://doi.org/10.1145/
50202.50214

You might also like