Softssd: Software-Defined SSD Development Platform For Rapid Flash Firmware Prototyping
Softssd: Software-Defined SSD Development Platform For Rapid Flash Firmware Prototyping
Zili Shao
Department of Computer Science and Engineering
The Chinese University of Hong Kong
Hong Kong, China
[email protected]
Abstract—Recently, solid-state drives (SSDs) have been used in optimizations to reduce request latency and improve through-
a wide range of emerging data processing systems. Essentially, put. Meanwhile, novel storage interface protocols such as
an SSD is a complex embedded system that involves both NVM Express (NVMe) [3], have been proposed to reduce
hardware and software design. For the latter, firmware modules
such as the flash translation layer (FTL) orchestrate internal communication overhead between the host and the device to
operations and flash management, and are crucial to the overall harvest the high I/O performance enabled by modern SSDs.
I/O performance of an SSD. Despite the rapid development of Under the hood, SSDs persist data on NAND flash memory,
new features of SSDs in the market, the research of flash firmware which retains data after power loss [5]. The flash memory
has been mostly simulation-based due to the lack of a realistic chips typically form an array with multiple channels so that
and extensible SSD development platform. In this paper, we
propose SoftSSD, a software-defined SSD development platform user requests can be distributed among them to better utilize
for rapid flash firmware prototyping. The core of SoftSSD is the I/O parallelism.
a novel framework with an event-driven programming model. Although NAND flash memory is the core component of
With the programming model, new FTL algorithms can be SSDs, it does not work off-the-shelf. A smart storage controller
implemented and integrated into a full-featured flash firmware must be used to orchestrate the internal operations of SSDs and
in a straightforward way. The resulting flash firmware can be
deployed and evaluated on a hardware development board, which data flows between the host interface and the flash memory. By
can be connected to a host system via PCIe and serve as a nature, the storage controller is a complex embedded system
normal NVMe SSD. Different from existing hardware-oriented that involves both hardware and software design. The former
development platforms, SoftSSD implements the majority of includes physical interfaces to buses through which the storage
SSD components (e.g., host interface controller) in software so controller is connected to the host and the underlying flash
that data flows and internal states that were once confined in
the hardware can now be examined with a software debugger, memory array. For the latter, firmware modules such as the
providing the observability and the extensibility that are critical flash translation layer (FTL) [7], [13] are implemented to
to the rapid prototyping and research of flash firmware. This handle internal operations and flash management of SSDs. In a
paper describes the programming model and hardware design word, the main task of a storage controller is to accept requests
of SoftSSD. We also perform experiments with real application from the host interface, perform necessary transformations on
workloads on a prototype board to demonstrate the performance
and usefulness of SoftSSD and released the open-source code of them and finally dispatch them through multiple channels to
SoftSSD for public access. the underlying flash memory array.
Index Terms—solid-state drives, heterogeneous embedded sys- Storage controllers are responsible for various operations on
tem, software hardware co-design the critical path of request processing. Thus, the performance
delivered by an SSD depends heavily on the storage controller.
I. I NTRODUCTION In order to handle the tremendous number of concurrent I/O
requests arriving through multiple submission queues provided
Recently, solid-state drives (SSDs) have been used in a by the flexible host interface protocol, the storage controller,
wide range of emerging data processing systems [12], [10]. especially the flash firmware running on top of it, needs to
Compared to traditional magnetic disks, SSDs deliver higher perform internal SSD operations such as address translation
I/O throughput as well as lower latency, which makes them and cache lookup with high efficiency. On the other hand,
one of the best choices as the backing storage for large scale the architecture of modern SSDs has been designed in a hi-
data processing systems demanding high I/O performance. erarchical manner so that flash transactions can be distributed
The design of modern SSDs has gone through intensive across multiple channels and executed in parallel to obtain
chip 0
chip n
chip 0
chip n
Flash chip 0
Flash chip n
Flash
Flash
Flash
Flash
Flash chip 0
Flash chip n
Die 0 Die 1
PCIe Link Layer MAC/PHY
...
...
PCIe PHY
...
...
Plane
Plane
Plane
Plane
PCIe PHY
platforms and hardware-oriented platforms, we propose Soft-
Ethernet Link PCIe Bus
PCIe Bus
Host
SSD, a novel softwarized and flexible SSD development
Host
platform for rapid flash firmware prototyping, as shown in
(a) Existing platforms (b) SoftSSD Figure 1(b). The core of SoftSSD is a development board
Fig. 1: Comparison of SoftSSD and existing hardware-oriented which can be plugged into a host machine and serve as
platforms. a normal NVMe SSD. Different from existing hardware-
oriented approaches, we implement only a small number of
maximal throughput [6]. Thus, the storage controller needs necessary components in the storage controller in hardware
to handle the queuing and scheduling of flash transactions to on an FPGA. These components includes the physical layer
fully utilize the massive internal I/O parallelism. These internal interfaces to the host bus (e.g., the PCIe link) and the NAND
tasks can incur high computational load when there is a large flash chips. Such interfaces are defined by specifications
amount of concurrent user requests and the flash firmware can and thus not subject to frequent changes. Furthermore, these
soon become the bottleneck if the FTL algorithms are not well hardware components are only required to handle simple tasks
designed. Furthermore, manufacturers begin to employ high- such as receiving raw transaction layer packets from the PCIe
performance multi-core microprocessors on SSDs to adapt to link so they can be extended to support newer revisions
the massive I/O parallelism of the underlying flash storage. of the interfaces. Other components of SoftSSD, including
With a multi-core processor, the flash firmware can start the NVMe interface and the flash transaction scheduler, are
multiple worker threads to process the computational work in developed in pure software as parts of the flash firmware
parallel to improve throughput and hide request latency [16]. modules and run on an ARM-based embedded processor.
This opens new opportunities in the research of flash firmware, These components as well as the flash firmware can be
in which an extensible SSD development board is required for reconfigured and reprogrammed for research purpose. With
rapid prototyping and evaluation. SoftSSD, data flows and internal states that were once confined
In addition, previous research on flash firmware mainly in the hardware design are now processed by the software and
focuses on different aspects of the flash translation layer, can be examined with a debugger, providing observability and
such as address mapping and garbage collection, based on visibility which are critical in rapid prototyping and research
an SSD simulator [14], [8], [11]. However, as the hardware of flash firmware.
organization for SSDs is constantly evolving, it is impossible However, implementing SSD components in software brings
to evaluate the improvement of an FTL algorithm without new challenges. First, compared to specialized hardware im-
putting it into context of a full SSD system as part of an end-to- plementations, software implementations provide more flexi-
end flash firmware. In that case, the FTL algorithms inevitably bility at the cost of lower performance, which poses a great
need to interact with the underlying hardware resources and challenge on SoftSSD. Second, as SSD components are now
other low-level components of the flash firmware. To enable an integrated part of the flash firmware, a new programming
rapid flash firmware development, a programming model is model is required to hide the details of interacting with
required to provide basic hardware abstraction and runtime, the hardware and enable implementing and assembling these
so that FTL algorithms can be implemented, integrated into a modules, including the FTL, into a flash firmware in a straight-
full-featured flash firmware and evaluated in a straightforward forward way. To this end, we propose a novel framework
way. with an event-driven threaded programming model for SSD
Conventionally, many components of a storage controller, firmware. Under the framework, user requests are handled
such as the host interface controller, the flash transaction as small tasks that can be assigned to multiple threads and
scheduler and the ECC engine, are implemented in hardware scheduled to maximize CPU utilization and thus enhance the
on FPGA or ASIC. Although such a hardware implementation performance of SoftSSD. Furthermore, the flash firmware built
can generally achieve higher performance compared to its soft- with the proposed framework can be deployed on a multi-
ware counterpart, it causes several difficulties for investigating core heterogeneous microprocessor to process I/O requests in
and prototyping new flash firmware. parallel.
Due to complexity of modern storage protocols, implement- We implement both the programming model and the hard-
ing a host interface controller in pure hardware requires non- ware components of SoftSSD and carry out performance
trivial effort. Also, new extensions and new transports, such evaluation. We connect SoftSSD as a standard NVMe SSD
as the NVMe key-value (KV) command set and remote direct to a real host system and conduct experiments with applica-
603
Authorized licensed use limited to: R V College of Engineering. Downloaded on April 29,2025 at 08:15:55 UTC from IEEE Xplore. Restrictions apply.
tion workloads to demonstrate the performance of SoftSSD. Flash
Firmware
Experimental results show that SoftSSD can achieve good per-
formance for real I/O workloads while providing observability Processor DRAM
604
Authorized licensed use limited to: R V College of Engineering. Downloaded on April 29,2025 at 08:15:55 UTC from IEEE Xplore. Restrictions apply.
complex and PCIe switches route and forward TLPs based NAND flash memory through the flash channel. The control
on the destination addresses provided in the packets. Other path also reads the status of the flash devices (i.e., whether
functionalities of PCIe are built on top this packet-based they are ready to accept new commands) and exposes them
communication. For example, to perform a direct memory through a status register which can be accessed by the flash
access (DMA) to the host memory, a PCIe device sends a firmware.
memory read/write transaction packet with the data payload The data path of a flash channel controller is responsible
to the root complex. The root complex has access to the host for transferring data between the device DRAM and the
memory and it completes the transaction based on the TLP and NAND flash memory and ensuring data integrity. It includes
optionally sends back a completion packet to the requesting a DMA controller, an ECC encoder and an ECC detector.
device for non-posted requests (e.g., memory read). Message- The DMA controller performs data movement between the
signaled interrupts (MSIs) are sent from a PCIe device to NAND flash controller and the device DRAM. In SoftSSD,
the host by writing pre-configured data to a memory address we deliberately implement the ECC engine in software so that
specified by the interrupt controller. different encoding schemes can be used to protect the user
In SoftSSD, we only build the physical and the data link data. However, processing all aspects of ECC in software can
layer in hardware on the FPGA. The two layers receive/trans- still incur prohibitive overhead. To reduce the cost of software
mit raw TLPs from/to the PCIe link. The TLPs received from ECC, we implement a simple error detector in the hardware
the link or to be transmitted are divided into two streams based flash channel controller. If no error is detected in the read data,
on whether the device is the requester or the completer of the expensive software error correction can be bypassed.
transaction. Two DMA interfaces move the raw TLPs between
the PCIe controller and the device DRAM. After the raw TLPs IV. S OFTWARE D ESIGN
arrive at the device DRAM, they are parsed and processed by Based on the minimal set of functionalities implemented by
the software PCIe transaction layer. The completer interface the hardware design, we can build the flash firmware to serve
handles memory read (MRd) or memory write (MWr) requests user requests. In this section, we discuss the programming
from the host to the memory-mapped I/O regions exposed model and the implementation of the software components in
by SoftSSD through the base address registers (BARs). For SoftSSD.
memory write requests, the write data is attached as payload in
the TLPs and the device does not need to send back completion A. Programming model
packets to the host. Memory read requests contain a unique tag The FTL processing of a user request involves multiple
but do not have any data payload. Upon processing a memory blocking operations and may be suspended before such opera-
read TLP, the software transaction layer sends a completion tions complete. For example, before issuing a flash command
TLP containing the read data with the same tag as the request to write data to the NAND flash memory, the flash firmware
packet. The requester interface is used by the flash firmware needs to wait for the PCIe controller to transfer the write
to initiate DMA requests to the host memory. To send a DMA data from the host memory into internal buffers in the device
request, the flash firmware prepares a TLP packet in the device DRAM. In order to exploit the data access parallelism, the
DRAM and configures the requester DMA controller to move flash firmware should continue to process new requests when
the packet to the PCIe controller for transmission to the PCIe the current request is blocked on host DMA/flash access
link. For DMA read requests, the host sends back response operations instead of waiting for the operation to complete.
data with a completion TLP, which is moved to the device For this reason, the flash firmware needs to efficiently switch
DRAM and processed by the software transaction layer. between requests once a request is blocked so that multiple
requests can be processed in a concurrent manner. Existing
B. NAND Flash Interface methods tackle this problem by separating the request pro-
Besides the PCIe interface, we also implement the interface cessing into several stages. A user request of arbitrary size
to the underlying NAND flash array in hardware. As shown is divided into slice commands that request fixed-sized data.
in Figure 2, the NAND flash interface consists of one or Each stage processes one slice command at a time and when a
more flash channel controllers, each of which is responsible slice command is blocked, the stage puts the command in an
for exchanging flash commands and data with the flash chips internal output queue for the next stage to retrieve. After the
connected via one flash channel. Each flash channel has a blocking operation completes, the slice command is resumed
control path which is used by the flash firmware to issue by the next stage. As such, user requests are processed in a
flash commands and a data path which transfers data between pipelined manner to maximize throughput. However, existing
the device DRAM and the NAND flash memory. The control FTL algorithms still need to be re-designed and manually
path consists of a command interface, which exposes a set divided into multiple stages based on the possible suspension
of memory-mapped registers for the flash firmware to con- points in order to be used on the existing SSD platform.
figure the command opcode and the physical address of the Furthermore, there can be dependencies between different
target flash page. Upon receiving a command from the flash stages. For example, during the address translation stage,
firmware, the NAND flash controller converts the command requests may be issued to the flash transaction stage to read in
opcode into a raw NAND flash command and transfers it to the pages that contain translation information. Such dependencies
605
Authorized licensed use limited to: R V College of Engineering. Downloaded on April 29,2025 at 08:15:55 UTC from IEEE Xplore. Restrictions apply.
NVMe Cache Address Mapping A. Flash Flash Host NVMe
Parsing Lookup Translation Table Read T. Read Cmd DMA DMA Compl. Flash rmware
Flash
Flash Interface
Channel
Layer (FIL)
Controller
NVMe Cache Host NVMe NVMe Cache Host NVMe Flash Translation Cortex-R5
PCIe
Parsing Insertion DMA Compl. Parsing Insertion DMA Compl. Layer (FTL)
Controller
Cortex-A53 ECC Engine
Fig. 3: An example of request processing under the proposed Cortex-R5
programming model with two threads. Shared
memory
make it more difficult to implement FTL algorithm as multiple Fig. 4: The multi-core heterogeneous architecture of SoftSSD.
stages.
In SoftSSD, instead of dividing the flash firmware into to employ high-performance multi-core microprocessors with
stages, we map user requests to threads. Each request accepted more computational power on SSDs. In SoftSSD, we run
through the NVMe interface is assigned to a thread, which runs the flash firmware on an ARM-based heterogeneous multi-
the request until its completion. Once a request is blocked on core microprocessor with one 64-bit Cortex-A53 core and
data transfer or flash operations, its thread is switched out so two 32-bit Cortex-R5 cores. To better utilize the parallel
that other requests can be processed in an interleaving way. processing enabled by the multi-core processor, we adopt
Figure 3 shows an example of concurrent request processing a layered design for the flash firmware and divide it into
with two threads. During the processing of a user request, three modules that can independently run on different CPU
the execution may be blocked due to various operations. For cores. The flash translation layer (FTL) module implements the
example, during the address translation, the FTL may need software NVMe interface which communicates with the host
to issue flash read commands to load missing mapping table OS driver through the PCIe controller. It also handles most
pages from the flash memory (Mapping Table Read). It can computational tasks (i.e., cache lookup and address mapping)
also be blocked on DMA transfers between the host memory, for a user request. Thus, we run the FTL module on the
the device DRAM and NAND flash memory (Host DMA and Cortex-A53 core with the highest computational power. The
Flash DMA). Whenever a thread is blocked on a specific flash interface layer (FIL) module manages the low-level flash
operation, it puts itself in a wait queue and suspends itself. The transactions issued by the FTL module and dispatches them to
scheduler then picks another thread with pending user requests the NAND flash memory through the flash channel controllers.
to continue execution. Later, when the operation completes, It also continuously sends commands to the NAND flash
the scheduler is notified and it checks the wait queue to memory to poll the NAND device status. Finally, the ECC
resume the suspended threads. With the framework, we can engine module implements the software error correction logic
overlap computation tasks with blocking I/O operations so that to ensure data integrity.
multiple NVM requests can be processed concurrently to fully The FTL module communicates with the FIL and the
utilize data access parallelism and maximize throughput. Also, ECC engine module with two ring queue pairs located in
the entire request processing is implemented as a monolithic a shared memory region. Each queue pair has an available
threaded function in a straightforward way without dividing ring and a used ring. When the FTL module needs to send
FTL algorithms into multiple stages connected through mes- a request to the other two modules, it enqueues a command
sage queues. in the corresponding available ring. The FIL and the ECC
In this paper, we use a coroutine-based asynchronous frame- engine module continuously poll the available ring for new
work to implement the proposed programming model. Corou- commands. For the FIL module, each command represents a
tines are similar to normal functions, but their execution can be read or write flash transaction that needs to be executed on the
suspended and resumed at manually defined suspension points. target flash chip. For the ECC engine module, each command
Local variables used by a coroutine are preserved across contains the data block and the corresponding stored code bits
suspensions until the coroutine is destroyed. Specifically, we that need to be corrected. After a command is processed, the
implement one variant called stackful coroutines, in which two modules enqueue a completion entry in the used ring and
each coroutine has its own separate runtime stack instead send an inter-processor interrupt (IPI) to the FTL module to
of sharing the stack of the underlying thread. With stackful notify it of the command completion.
coroutines, local states and the call stack of a coroutine are
preserved on the private stack. Stackful coroutines can be im- C. NVMe Interface
plemented as a general-purpose library without special syntax In SoftSSD, we implement the NVMe protocol completely
(e.g., async/await) provided by the language compilers in software for better observability and extensibility. With
and FTL algorithms can be integrated into the asynchronous NVMe, the host OS creates multiple queue pairs in the host
framework with minor modifications. memory for submitting commands and receiving completion.
Internally, the flash firmware starts multiple coroutine threads
B. Heterogeneous Multi-core Architecture to handle the requests from one queue pair. Each NVMe
In order to handle the tremendous number of concurrent I/O worker thread is bound to an NVMe submission queue (SQ).
requests arriving over multiple queues, manufacturers begin After a worker thread is started, it continuously polls the SQ
606
Authorized licensed use limited to: R V College of Engineering. Downloaded on April 29,2025 at 08:15:55 UTC from IEEE Xplore. Restrictions apply.
for new commands. If the SQ is empty (i.e., the head pointer
and the tail pointer meet), then the worker thread suspends
itself and waits until new commands arrive. Otherwise, it
fetches the command at the head pointer with a host memory
DMA request through the PCIe controller and updates the head
pointer. The command is handled as either an admin command
or an I/O command based on the SQ from which it is received
and the command result will be posted to the corresponding
CQ. Finally, the worker thread sends an message-signaled
interrupt to the host to notify it of the command completion.
When the host writes to the SQ tail doorbells, a callback
function will be invoked with the new tail pointer. If the new
tail pointer is different from the current value, it updates the Fig. 5: The hardware development board of SoftSSD proto-
tail pointer and wakes up all worker threads assigned to the type.
SQ to process the new commands.
D. Flash Interface Layer SSD board includes a dual-core ARM Cortex-A53 application
processing unit (APU), a dual-core ARM Cortex-R5 real-time
The flash interface layer (FIL) manages the low-level flash processing unit (RPU) and programmable logics built into the
transaction queues and the underlying NAND flash array. To FPGA. We use one APU core to run the worker coroutine
avoid contention between FTL tasks and flash management threads of the FTL and two RPU cores to run the FIL and
tasks, we offload the FIL to a dedicated CPU core. The the ECC engine, respectively. The SoftSSD board is designed
FTL core and the FIL core communicate via a pair of ring as a PCIe expansion card that can be connected to the host
queues (the available ring and the used ring) located in system with PCIe Gen3 x8 and serve as a normal NVMe
a shared memory. The FTL divides user requests of arbi- SSD. The flash memory modules are mounted with 8 test
trary sizes into fixed-sized flash transactions and submits sockets. The NAND flash packages are connected to the FPGA
them to the available ring. Each flash transaction reads/writes with 8 NV-DDR2 channels specified by the Open NAND Flash
data from/to a flash page identified by its physical ad- Interface (ONFI) [1]. The two packages in a same row share
dress. The physical address is represented as a location vec- two channels and the storage controller supports up to an
tor (channel, way, die, plane, block, page) 8-channel, 4-way configuration. SoftSSD enables rapid flash
based on the flash array hierarchy. At runtime, the FIL firmware development by implementing a large number of
first polls the available ring for incoming flash transactions. SSD components in software. In this paper, we implement
The flash transactions are organized into per-chip transaction a page-level flash translation layer based on DFTL [7].
queues based on the channel and the way number in their
location vectors. Each flash transaction is executed in two
phases: the command execution phase and the data transfer VI. E VALUATION
phase. Based the flash transaction type, the command execu-
tion phase either reads data from the NAND flash cells into A. Experiment setup
its internal data cache or programs the data in the cache to the
flash cells. The data transfer phase transfers data between the All experiments are conducted on a host PC with a 6-
device DRAM and the internal data cache. The two phases can core Intel(R) Core(TM) i7-8700 CPU and 16GB RAM. The
only be started if the target die or channel is idle. Thus, the host PC runs Linux kernel 5.16. The workloads used in the
FIL goes through the transaction queues of all chips that are experiments are generated by FIO-3.28 [4]. The SoftSSD
connected through an idle channel or have at least one die that prototype board is connected to the host PC via PCIe and
is not executing any flash command and dispatches the flash used as an NVMe block device driven by the OS driver.
transactions to them if possible. The FIL also maintains the list An additional development PC is used to collect software
of flash channels and dies with outstanding data transfers/flash output and statistics with a debug console via the UART serial
commands. It polls the status of the flash channel controllers port. The development PC also runs the development tools for
and the flash dies to check if the data transfers or the flash programming the prototype board.
commands have completed. Once the two execution phases for For the SoftSSD prototype board, we install 8 Micron
a flash transaction complete, it is added to the used ring and MT29F1T08 MLC NAND flash memory chips (128GB each).
the FIL generates an inter-processor interrupt (IPI) to notify We enabled NV-DDR2 timing mode 7 with a maximum
the FTL core of the completion. data rate of 400MT/s per channel. The OS driver creates
12 NVMe I/O queues on the device (one per host CPU
V. I MPLEMENTATION
hardware thread) and the maximum queue depth is set to 1024.
We implement the prototype of SoftSSD on a hardware Table I summarizes the configurations of SoftSSD and the
development board as shown in Figure 5. The core of the Soft- flash microarchitecture that are used in the experiments.
607
Authorized licensed use limited to: R V College of Engineering. Downloaded on April 29,2025 at 08:15:55 UTC from IEEE Xplore. Restrictions apply.
TABLE I: SSD configurations used in the experiments.
Capacity: 1TB
Mapping table cache: 1GB
Write cache: 512MB
SSD configuations # Channels: 8
# Chips per channel: 4
# NVMe IO queues: 12
NVMe queue depth: 1024
Page size: 16KB + 1872B OOB
# Pages per block: 512 (a) Read 1MB – Throughput (b) Read 1MB – Latency CDF
Flash microarchitecture # Blocks per plane: 1048
# Planes per die: 2
# Dies per chip: 2
B. Sequential access
We first measure the sequential access performance of the
SoftSSD prototype board. We use FIO to generate workloads
that issue read or write requests to consecutive logical ad-
dresses. The queue depth is set to 1 and the number of out- (c) Write 16kB – Throughput (d) Write 16kB – Latency CDF
standing requests to the SSD matches the number of threads.
Fig. 6: Sequential access performance of SoftSSD.
The request size is 1MB for read requests and 16kB for write
requests. All experiments start with an empty mapping table
and every accessed logical address is assigned a new physical
address by the address mapping unit. For write requests, we
issue enough requests to fill the write cache and cause the data
cache to evict pages to NAND flash memory.
As shown in Figure 6, as the number of threads increases
from 1 to 16, the throughput for both read and write requests
increases linearly. The throughput peaks at 32 threads where
all 32 flash chips (8-channel, 4-way) are saturated with the
outstanding flash commands. When the number of threads is
small (e.g., 1-4), write requests have a lower throughput than (a) Read – Throughput (b) Read – Latency CDF
read requests due to the longer flash command latency. As
the number of thread increases, there are enough requests to
saturate the bandwidth of the software host interface so that
read and write requests achieve similar peak throughput.
Figure 6 (b) and (d) also show the CDFs of end-to-
end read/write latencies. Compared to write latencies, read
latencies have a more uniform distribution because data read
from the NAND flash memory are not cached in the device
DRAM and every read request incurs the same number of
transactions to the NAND flash memory. For write requests, (c) Write – Throughput (d) Write – Latency CDF
data can be buffered in the device DRAM before the data Fig. 7: 16kB random access performance of SoftSSD.
cache is full. After the data cache is filled, data must be evicted
and written back to the NAND flash memory to make room
for new data and write requests may incur extra flash write
transactions, causing the longer tail latency. peaks at 32 threads. Compared to sequential accesses, random
C. Random access read/write accesses achieve a lower maximum throughput due
to the worse locality. This has negative impact on different
Figure 7 shows the average throughput and latency CDF of components in the FTL. The mapping table needs to allocate
random accesses in SoftSSD. The workloads in the experiment and maintain more translation pages to store the physical
issue read/write requests of 16kB data to random logical addresses of the random logical addresses. Also, the data
addresses. The queue depth is set to 1 and the number cache cannot utilize the temporal locality to coalesce multiple
of outstanding requests to the SSD matches the number of write requests with a cached page, incurring more flash write
threads. As shown in the figure, as the number of threads transactions.
increases from 1 to 16, the IOPS increases linearly and
608
Authorized licensed use limited to: R V College of Engineering. Downloaded on April 29,2025 at 08:15:55 UTC from IEEE Xplore. Restrictions apply.
into the flash firmware without modifying the hardware com-
ponents. We conduct experiments with real I/O workloads
to demonstrate the performance of SoftSSD as a standard
NVMe SSD. We believe the observability and extensibility
provided by the SoftSSD platform can contribute to future
flash firmware development in the research communities.
ACKNOWLEDGMENT
(a) Read – Transfer (b) Read – Command The work described in this paper is partially supported by
the grants from the National Natural Science Foundation of
China (62072333), the Research Grants Council of the Hong
Kong Special Administrative Region, China (GRF 15224918),
and Direct Grant for Research, The Chinese University of
Hong Kong (Project No. 4055151).
R EFERENCES
[1] Home - ONFI. http://www.onfi.org/.
[2] Jasmine OpenSSD Platform - OpenSSDWiki. http://www.openssd-
(c) Write – Transfer (d) Write – Command project.org/wiki/Jasmine OpenSSD Platform.
[3] NVM Express Base Specification 2.0. page 452.
Fig. 8: Latency CDFs of individual phases of flash transac- [4] J. Axboe. Flexible I/O Tester, 2022.
tions. [5] J. Boukhobza, S. Rubini, R. Chen, and Z. Shao. Emerging NVM: A
Survey on Architectural Integration and Research Challenges. ACM
Transactions on Design Automation of Electronic Systems, 23(2):14:1–
D. Flash transaction latency 14:32, Nov. 2017.
[6] C. Gao, L. Shi, M. Zhao, C. J. Xue, K. Wu, and E. H.-M. Sha. Exploiting
In SoftSSD, each flash transaction is executed in two parallelism in I/O scheduling for access conflict minimization in flash-
phases. In the transfer phase, data are transferred between the based solid state drives. In 2014 30th Symposium on Mass Storage
device DRAM and the internal data cache in the NAND flash Systems and Technologies (MSST), pages 1–11, June 2014.
[7] A. Gupta, Y. Kim, and B. Urgaonkar. DFTL: A flash translation
memory. In the command phase, the NAND flash memory layer employing demand-based selective caching of page-level address
executes a flash command to read/program data cache from/to mappings. ACM SIGARCH Computer Architecture News, 37(1):229–
NAND flash cells. With the software implementation, we can 240, Mar. 2009.
[8] M. Jung, J. Zhang, A. Abulila, M. Kwon, N. Shahidi, J. Shalf, N. S. Kim,
use the audit framework to collect performance statistics of and M. Kandemir. SimpleSSD: Modeling Solid State Drives for Holistic
different internal operations. Figure 8 shows the latencies of System Simulation. IEEE Computer Architecture Letters, 17(1):37–41,
individual execution phases of flash transactions by running Jan. 2018.
[9] J. Kwak, S. Lee, K. Park, J. Jeong, and Y. H. Song. Cosmos+ OpenSSD:
16kB aligned random read/write requests. Rapid Prototype for Flash Storage Systems. ACM Transactions on
As shown in the figure, when a single thread is used, Storage, 16(3):15:1–15:35, July 2020.
SoftSSD achieves deterministic latencies for the transfer phase [10] G. Lee, S. Shin, W. Song, T. J. Ham, J. W. Lee, and J. Jeong.
Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-
(∼ 74μs read, ∼ 108μs write) and the read command phase Low Latency SSDs. In 2019 {USENIX} Annual Technical Conference
(∼ 105μs). When 32 threads are used, the flash transaction ({USENIX} {ATC} 19), pages 603–616, 2019.
latencies increase due to scheduling overhead and contention [11] H. Li, M. Hao, M. H. Tong, S. Sundararaman, M. Bjørling, and H. S.
Gunawi. The {CASE} of {FEMU}: Cheap, Accurate, Scalable and
at the bus interconnects which connect the flash channel Extensible Flash Emulator. In 16th USENIX Conference on File and
controllers to the device DRAM. Specifically, when there is a Storage Technologies (FAST 18), pages 83–90, 2018.
large number of outstanding flash transactions, the FIL core [12] L. Lu, T. S. Pillai, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau.
WiscKey: Separating Keys from Values in SSD-conscious Storage. In
needs to continuously monitor the completion status of up 14th {USENIX} Conference on File and Storage Technologies ({FAST}
to 64 flash dies, resulting in the longer command latencies. 16), pages 133–148, 2016.
Also, when used in MLC mode, each flash cell is shared by [13] C. Ma, Y. Wang, Z. Shen, R. Chen, Z. Wang, and Z. Shao. MNFTL: An
Efficient Flash Translation Layer for MLC NAND Flash Memory. ACM
a lower (LSB) page and an upper (MSB) page with different Transactions on Design Automation of Electronic Systems, 25(6):50:1–
read/program latencies, which is reflected by the steps of the 50:19, Aug. 2020.
write command latencies. [14] A. Tavakkol, J. Gómez-Luna, M. Sadrosadati, S. Ghose, and O. Mutlu.
MQSim: A Framework for Enabling Realistic Studies of Modern Multi-
Queue {SSD} Devices. In 16th {USENIX} Conference on File and
VII. C ONCLUSION Storage Technologies ({FAST} 18), pages 49–66, 2018.
[15] M.-C. Yang, Y.-M. Chang, C.-W. Tsao, P.-C. Huang, Y.-H. Chang, and
In this paper, we propose SoftSSD which enables rapid T.-W. Kuo. Garbage collection and wear leveling for flash memory:
Past and future. In 2014 International Conference on Smart Computing,
prototyping of flash firmware on a real hardware platform. pages 66–73, 2014.
The key technique for achieving this is that we implement [16] J. Zhang, M. Kwon, M. Swift, and M. Jung. Scalable Parallel Flash
the majority of components in pure software so that SoftSSD Firmware for Many-core Architectures. In 18th {USENIX} Conference
on File and Storage Technologies ({FAST} 20), pages 121–136, 2020.
can provide better observability of the internal states in the
SSD, and new storage protocol features can be integrated
609
Authorized licensed use limited to: R V College of Engineering. Downloaded on April 29,2025 at 08:15:55 UTC from IEEE Xplore. Restrictions apply.