0% found this document useful (0 votes)
3 views16 pages

Split Sosp15

The document introduces split-level I/O scheduling, a framework that distributes I/O scheduling logic across three layers of the storage stack: block, system call, and page cache. This approach overcomes limitations of traditional block-level schedulers by improving throughput, latency, and isolation through novel schedulers like Actually Fair Queuing, Split-Deadline, and Split-Token. The framework is adaptable and demonstrates effectiveness across various applications, including databases and distributed file systems.

Uploaded by

xin kuang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views16 pages

Split Sosp15

The document introduces split-level I/O scheduling, a framework that distributes I/O scheduling logic across three layers of the storage stack: block, system call, and page cache. This approach overcomes limitations of traditional block-level schedulers by improving throughput, latency, and isolation through novel schedulers like Actually Fair Queuing, Split-Deadline, and Split-Token. The framework is adaptable and demonstrates effectiveness across various applications, including databases and distributed file systems.

Uploaded by

xin kuang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Split-Level I/O Scheduling

Suli Yang, Tyler Harter, Nishant Agrawal, Salini Selvaraj Kowsalya, Anand Krishnamurthy,
Samer Al-Kiswany, Rini T. Kaushik*, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
University of Wisconsin-Madison IBM Research-Almaden*

Abstract built at the block level within an operating system, be-


We introduce split-level I/O scheduling, a new frame- neath the file system and just above the device itself.
work that splits I/O scheduling logic across handlers at Such block-level schedulers are given a stream of re-
three layers of the storage stack: block, system call, and quests and are thus faced with the question: which
page cache. We demonstrate that traditional block-level requests should be dispatched, and when, in order to
I/O schedulers are unable to meet throughput, latency, achieve the goals of the system?
and isolation goals. By utilizing the split-level frame- Unfortunately, making decisions at the block level is
work, we build a variety of novel schedulers to readily problematic, for two reasons. First, and most impor-
achieve these goals: our Actually Fair Queuing sched- tantly, the block-level scheduler fundamentally cannot
uler reduces priority-misallocation by 28×; our Split- reorder certain write requests; file systems carefully con-
Deadline scheduler reduces tail latencies by 4×; our trol write ordering to preserve consistency in the event of
Split-Token scheduler reduces sensitivity to interference system crash or power loss [21, 25]. Second, the block-
by 6×. We show that the framework is general and oper- level scheduler cannot perform accurate accounting; the
ates correctly with disparate file systems (ext4 and XFS). scheduler lacks the requisite information to determine
Finally, we demonstrate that split-level scheduling serves which application was responsible for a particular I/O
as a useful foundation for databases (SQLite and Post- request. Due to these limitations, block-level schedulers
greSQL), hypervisors (QEMU), and distributed file sys- cannot implement a full range of policies.
tems (HDFS), delivering improved isolation and perfor- An alternate approach, which does not possess these
mance in these important application scenarios. same limitations, is to implement scheduling much
1 Introduction higher in the stack, namely with system calls [19].
Deciding which I/O request to schedule, and when, has System-call scheduling intrinsically has access to neces-
long been a core aspect of the operating system storage sary contextual information (i.e., which process has is-
stack [11, 13, 22, 27, 28, 29, 31, 38, 43, 44, 45, 54]. Each sued an I/O). Unfortunately, system-call scheduling is no
of these approaches has improved different aspects of I/O panacea, as the low-level knowledge required to build ef-
scheduling; for example, research in single-disk sched- fective schedulers is not present. For example, at the time
ulers incorporated rotational awareness [28, 29, 44]; of a read or write, the scheduler cannot predict whether
other research tackled the problem of scheduling within the request will generate I/O or be satisfied by the page
a multi-disk array [53, 57]; more recent work has tar- cache, information which can be useful in reordering re-
geted flash-based devices [30, 36], tailoring the behavior quests [12, 49]. Similarly, the file system will likely
of the scheduler to this new and important class of stor- transform a single write request into a series of reads and
age device. All of these optimizations and techniques are writes, depending on the crash-consistency mechanism
important; in sum total, these systems can improve over- employed (e.g., journaling [25] or copy-on-write [42]);
all system performance dramatically [22, 44, 57] as well scheduling without exact knowledge of how much I/O
as provide other desirable properties (including fairness load will be generated is difficult and error prone.
across processes [17] and the meeting of deadlines [56]). In this paper, we introduce split-level I/O scheduling,
Most I/O schedulers (hereafter just “schedulers”) are a novel scheduling framework in which a scheduler is
constructed across several layers. By implementing a ju-
Permission to make digital or hard copies of all or part of this work for personal diciously selected set of handlers at key junctures within
or classroom use is granted without fee provided that copies are not made or the storage stack (namely, at the system-call, page-cache,
distributed for profit or commercial advantage and that copies bear this notice
and the full citation on the first page. Copyrights for components of this work and block layers), a developer can implement a schedul-
owned by others than the author(s) must be honored. Abstracting with credit is ing discipline with full control over behavior and with no
permitted. To copy otherwise, or republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee. Request permissions from loss in high- or low-level information. Split schedulers
[email protected]. can determine which processes issued I/O (via graph tags
SOSP’15, October 4–7, 2015, Monterey, CA.
Copyright is held by the Owner/Author(s). Publication rights licensed to ACM. that track causality across levels) and accurately esti-
ACM 978-1-4503-3834-9/15/10...$15.00. mate I/O costs. Furthermore, memory notifications make
http://dx.doi.org/10.1145/2815400.2815421

1
schedulers aware of write work as soon as possible (not 150 B’s one-second burst

A’s Throughput (MB/s)


tens of seconds later when writeback occurs). Finally, 125
split schedulers can prevent file systems from imposing
orderings that are contrary to scheduling goals. 100
We demonstrate the generality of split scheduling by 75
implementing three new schedulers: AFQ (Actually-Fair 50 CFQ
Queuing) provides fairness between processes, Split- Split
Deadline observes latency goals, and Split-Token iso- 25
lates performance. Compared to similar schedulers in 0
other frameworks, AFQ reduces priority-misallocation 0 1 2 3 4 5 6 7 8 9 10
errors by 28×, Split-Deadline reduces tail latencies by Time (minutes)
4×, and Split-Token improves isolation by 6× for some
Figure 1: Write Burst. B’s one-second random-write
workloads. Furthermore, the split framework is not spe- burst severely degrades A’s performance for over five minutes.
cific to a single file system; integration with two file sys- Putting B in CFQ’s idle class provides no help.
tems (ext4 [34] and XFS [47]) is relatively simple.
The idle policy is one of many possible scheduling
Finally, we demonstrate that the split schedulers pro-
goals, but the difficulties it faces at the block level are
vide a useful base for more complex storage stacks.
not unique. In this section, we consider three differ-
Split-Token provides isolation for both virtual machines
ent scheduling goals, identifying several shared needs
(QEMU) and data-intensive applications (HDFS), and
(§2.1). Next, we describe prior scheduling frameworks
Split-Deadline provides a solution to the database com-
(§2.2). Finally, we show these frameworks are funda-
munity’s “fsync freeze” problem [2, 9, 10]. In sum-
mentally unable to meet scheduler needs when running
mary, we find split scheduling to be simple and elegant,
in conjunction with a modern file system (§2.3).
yet compatible with a variety of scheduling goals, file
systems, and real applications. 2.1 Framework Support for Schedulers
The rest of this paper is organized as follows. We We now consider three I/O schedulers: priority, deadline,
evaluate existing frameworks and describe the challenges and resource-limit, identifying what framework support
they face (§2). We discuss the principles of split schedul- is needed to implement these schedulers correctly.
ing (§3) and our implementation in Linux (§4). We im- Priority: These schedulers aim to allocate I/O re-
plement three split schedulers as case studies (§5), dis- sources fairly between processes based on their priori-
cuss integration with other file systems (§6), and evalu- ties [1]. To do so, a scheduler must be able to track which
ate our schedulers with real applications (§7). Finally, process is responsible for which I/O requests, estimate
we discuss related work (§8) and conclude (§9). how much each request costs, and reorder higher-priority
requests before lower-priority requests.
2 Motivation Deadline: These schedulers observe deadlines for I/O
Block-level schedulers are severely limited by their in- operations, offering predictable latency to applications
ability to gather information from and exert control over that need it [3]. A deadline scheduler needs to map an
other levels of the storage stack. As an example, we application’s deadline setting to each request and issue
consider the Linux CFQ scheduler, which supports an lower-latency requests before other requests.
ionice utility that can put a process in idle mode. Ac- Resource-Limit: These schedulers cap the resources
cording to the man page: “a program running with idle an application may use, regardless of overall system
I/O priority will only get disk time when no other pro- load. Limits are useful when resources are purchased and
gram has asked for disk I/O” [7]. Unfortunately, CFQ the seller does not wish to give away free I/O. Resource-
has little control over write bursts from idle-priority pro- Limit schedulers need to know the cost and causes of I/O
cesses, as writes are buffered above the block level. operations in order to throttle correctly.
We demonstrate this problem by running a normal pro- Although these schedulers have distinct goals, they
cess A alongside an idle-priority process B. A reads se- have three common needs. First, schedulers need to be
quentially from a large file. B issues random writes over able to map causes to identify which process is respon-
a one-second burst. Figure 1 shows the result: B quickly sible for an I/O request. Second, schedulers need to esti-
finishes while A (whose performance is shown via the mate costs in order to optimize performance and perform
CFQ line) takes over five minutes to recover. Block-level accounting properly. Third, schedulers need to be able to
schedulers such as CFQ are helpless to prevent processes reorder I/O requests so that the operations most impor-
from polluting write buffers with expensive I/O. As we tant to achieving scheduling goals are served first. Unfor-
will see, other file-system features such as journaling and tunately, as we will see, current block-level and system-
delayed allocation are similarly problematic. call schedulers cannot meet all of these requirements.

2
Throughput I/O Submitter
!"#$%&'() !*#$+,+ !(#$+-&./ 25 100 %

!"#$%&'()** &' &' '


20
75 %
15

MB/s
+%&,-" '
50 %
10

.*,/0 &' &' &' &' 5


25 %

0 0%
!"#$% !"#$% !"#$%
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Priority Priority
1%,-2345 6//,74$345
Figure 3: CFQ Throughput. The left plot shows sequen-
Figure 2: Scheduling Architectures. The boxes show tial write throughput for different priorities. The right plot the
where scheduler hooks exist for reordering I/O requests or do- portion of requests for each priority seen by CFQ. Unfortu-
ing accounting. Sometimes reads and writes are handled dif- nately, the “Completely Fair Scheduler” is not even slightly
ferently at different levels, as indicated by “R” and “W”. fair for sequential writes.

2.2 Framework Architectures systems make it challenging to satisfy the needs of the
Scheduling frameworks offer hooks to which schedulers scheduler. We now examine the implications of write-
can attach. Via these hooks, a framework passes infor- back, delayed allocation, journaling, and caching for
mation and exposes control to schedulers. We categorize schedulers, showing how these behaviors fundamentally
frameworks by the level at which the hooks are available. require a restructuring of the I/O scheduling framework.
Figure 2(a) illustrates block-level scheduling, 2.3.1 Delayed Writeback and Allocation
the traditional approach implemented in Linux [8], Delayed writeback is a common technique for postpon-
FreeBSD [41], and other systems [6]. Clients initiate ing I/O by buffering dirty data to write at a later time.
I/O requests via system calls, which are translated Procrastination is useful because the work may go away
to block-level requests by the file system. Within by itself (e.g., the data could be deleted) and, as more
the block-scheduling framework, these requests are work accumulates, more efficiencies can be gained (e.g.,
then passed to the scheduler along with information sequential write patterns may become realizable).
describing them: their location on storage media, size, Some file systems also delay allocation to optimize
and the submitting process. Based on such information, data layout [34, 47]. When allocating a new block, the
a scheduler may reorder the requests according to some file system does not immediately decide its on-disk lo-
policy. For example, a scheduler may accumulate many cation; another task will decide later. More information
requests in its internal queues and later dispatch them in (e.g., file sizes) becomes known over time, so delaying
an order that improves sequentiality. allocation enables more informed decisions.
Figure 2(b) show the system-call scheduling architec- Both delayed writeback and allocation involve file-
ture (SCS) proposed by Craciunas et al. [19]. Instead system level delegation, with one process doing I/O work
of operating beneath the file system and deciding when on behalf of other processes. A writeback process sub-
block requests are sent to the storage device, a system- mits buffers that other processes dirtied and may also
call scheduler operates on top of the file system and dirty metadata structures on behalf of other processes.
decides when to issue I/O related system calls (read, Such delegation obfuscates the mapping from requests
write, etc.). When a process invokes a system call, the to processes. To block-level schedulers, the writeback
scheduler is notified. The scheduler may put the process task sometimes appears responsible for all write traffic.
to sleep for a time before the body of the system call We evaluate Linux’s priority-based block scheduler,
runs. Thus, the scheduler can reorder the calls, control- CFQ (Completely Fair Queuing) [1], using an asyn-
ling when they become active within the file system. chronous write workload. CFQ aims to allocate disk
Figure 2(c) shows the hooks of the split framework, time fairly among processes (in proportion to priority).
which we describe in a later section (§4.2). In addition to We launch eight threads with different priorities, ranging
introducing novel page-cache hooks, the split framework from 0 (highest) to 7 (lowest): each writes to its own file
supports select system-call and block-level hooks. sequentially. A thread’s write throughput should be pro-
portional to its priority, as shown by the expectation line
2.3 File-System Challenges of Figure 3 (left). Unfortunately, CFQ ignores priorities,
Schedulers allocate disk I/O to processes, but processes treating all threads equally. Figure 3 (right) shows why:
do not typically use hard disks or SSDs directly. Instead, to CFQ all the requests appear to have a priority of 4, be-
processes request service from a file system, which in cause the writeback thread (a priority-4 process) submits
turn translates requests to disk I/O. Unfortunately, file all the writes on behalf of the eight benchmark threads.

3
Fsync Latency of A (s)
%&'()*+#$*,-. !"#!"#$%&' 3.0
!" !# $%&&'( 2.5
$"#!"#$%&'
2.0
1.5
)" )# )* !" !#
1.0
! / $ !"#!$%&'()
0.5
Figure 4: Journal Batching. Arrows point to events that 0.0
must occur before the event from which they point. The event
for the blocks is a disk write. The event for an fsync is a return.
0 1 2 3 4
Data Flush Size of B (MB)
2.3.2 Journaling Figure 5: I/O Latency Dependencies. Thread A keeps
Many modern file systems use journals for consistent up- issuing fsync to flush one block of data to disk, while thread B
dates [15, 34, 47]. While details vary across file systems, flushes multiple blocks using fsync. This plot shows how A’s
latency depends on B’s I/O size.
most follow similar journaling protocols to commit data
to disk; here, we discuss ext4’s ordered-mode to illustrate calling fsync after each. Thread B does N bytes of ran-
how journaling severely complicates scheduling. dom writes (N ranges from 16 KB to 4 MB) followed by
When changes are made to a file, ext4 first writes the an fsync. Figure 5 shows that even though A only writes
affected data blocks to disk, then creates a journal trans- one block each time, A’s fsync latency depends on how
action which contains all related metadata updates and much data B flushes each time.
commits that transaction to disk, as shown in Figure 4. Most file systems enforce ordering for correctness, so
The data blocks (D1 , D2 , D3 ) must be written before the these problems occur with other crash-consistency mech-
journal transaction, as updates become durable as soon anisms as well. For example, in log-structured files sys-
as the transaction commits, and ext4 needs to prevent the tems [42], writes appended earlier are durable earlier.
metadata in the journal from referring to data blocks con- 2.3.3 Caching and Write Amplification
taining garbage. After metadata is journaled, ext4 even- Sequentially reading or writing N bytes from or to a file
tually checkpoints the metadata in place. often does not result in N bytes of sequential disk I/O
Transaction writing and metadata checkpointing are for several reasons. First, file systems use different disk
both performed by kernel processes instead of the pro- layouts, and layouts change as file systems age; hence,
cesses that initially caused the updates. This form of sequential file-system I/O may become random disk I/O.
write delegation also complicates cause mapping. Second, file-system reads and writes may be absorbed by
More importantly, journaling prevents block-level caches or write buffers without causing I/O. Third, some
schedulers from reordering. Transaction batching is a file-system features amplify I/O. For example, reading
well-known performance optimization [25], but block a file block may involve additional metadata reads, and
schedulers have no control over which writes are writing a file block may involve additional journal writes.
batched, so the journal may batch together writes that are These behaviors prevent system-call schedulers from ac-
important to scheduling goals with less-important writes. curately estimating costs.
For example, in Figure 4, suppose A is higher priority To show how this inability hurts system-call sched-
than B. A’s fsync depends on transaction commit, which ulers, we evaluate SCS-Token [18]. In SCS-Token, a
depends on writing B’s data. Priority is thus inverted. process’s resource usage is limited by the number of to-
When metadata (e.g., directories or bitmaps) is shared kens it possesses. Per-process tokens are generated at a
among files, journal batching may be necessary for cor- fixed rate, based on user settings. When the process is-
rectness (not just performance). In Figure 4, the journal sues a system call, SCS blocks the call until the process
could have conceivably batched M1 and M2 separately; has enough tokens to pay for it.
however, M1 depends on D2 , data written by a process We attempt to isolate a process A’s I/O performance
C to a different file, and thus A’s fsync depends on the from a process B by throttling B’s resource usage. If SCS-
persistence of C’s data. Unfortunately (for schedulers), Token works correctly, A’s performance will vary little
metadata sharing is common in file systems. with respect to B’s I/O patterns. To test this behavior, we
The inability to reorder is especially problematic for a configure A to sequentially read from a large file while B
deadline scheduler: a block-request deadline completely runs workloads with different I/O patterns. Each of the B
loses its relevance if one request’s completion depends workloads involve repeatedly accessing R bytes sequen-
on the completion of unrelated I/Os. To demonstrate, tially from a 10 GB file and then randomly seeking to
we run two threads A (small) and B (big) with Linux’s a new offset. We explore 7 values for R (from 4 KB to
Block-Deadline scheduler [3], setting the block-request 16 MB) for both reads and writes (14 workloads total).
deadline to 20 ms for each. Thread A does 4 KB appends, In each case, B is throttled to 10 MB/s.

4
125 A (w/ B writing)
Block Syscall Split
Throughput (MB/s)

100 Cause Mapping ✖ ✔ ✔


75 Cost Estimation ✔ ✖ ✔
Reordering ✖ ✔ ✔
50 A’s std dev: 41.1 MB
A (w/ B reading)
Table 1: Framework Properties. A ✔ indicates a given
25 scheduling functionality can be supported with the framework,
B reading and an ✖ indicates a functionality cannot be supported.
0 B writing
4K 16K 64K 256K 1M 4M 16M %- %/ %7+$839#':"(;$<=32>?$@23$,-./0A
rand B’s Run Size (Bytes) seq

Figure 6: SCS Token Bucket: Isolation. The perfor- !"#$% !"#$% &#'(% !"#$% !"#$%
mance of two processes is shown: a sequential reader, A, and
a throttled process, B. B may read (black) or write (gray), and !"#"$%"&' 12)34"5$%"&' 6'#"$%"&'
performs runs of different sizes (x-axis). (")*'*+$,-./0 (")*'*+$,-./0 (")*'*+$,-./0
Figure 6 shows how A’s performance varies with B’s Figure 7: Set Tags and I/O Proxies. Our tags map meta-
I/O patterns. Note the large gap between the performance data and journal I/O to the real causes, P1 and P2, not P3.
of A with B reading vs. writing. When B is perform-
ing sequential writes, A’s throughput is as high as 125
3.1 Cause Mapping
A scheduler must be able to map I/O back to the pro-
MB/s; when B is performing random reads, A’s through-
cesses that caused it to accurately perform accounting
put drops to 25 MB/s in the worst case. Writes appear
even when some other process is submitting the I/O.
cheaper than reads because write buffers absorb I/O and
Metadata is usually shared, and I/Os are usually batched,
make it more sequential. Across the 14 workloads, A’s
so there may be multiple causes for a single dirty page
throughput has a standard deviation of 41 MB, indicat-
or a single request. Thus, the split framework tags I/O
ing A is highly sensitive to B’s patterns. SCS-Token fails
operations with sets of causes, instead of simple scalar
to isolate A’s performance by throttling B, as SCS-Token
tags (e.g., those implemented by Mesnier et al. [35]).
cannot correctly estimate the cost of B’s I/O pattern.
Write delegation (§2.3.1) further complicates cause
2.3.4 Discussion mapping when one process is dirtying data (not just sub-
Table 1 summarizes how different needs are met (or not) mitting I/O) on behalf of other processes. We call such
by each framework. The block-level framework fails to processes proxies; examples include the writeback and
support correct cause mapping (due to write delegation journaling tasks. Our framework tags proxy process to
such as journaling and delayed allocation) or control over identify the set of processes being served instead of the
reordering (due to file-system ordering requirements). proxy itself. These tags are created when a process starts
The system-call framework solves these two problems, dirtying data for others and cleared when it is done.
but fails to provide enough information to schedulers Figure 7 illustrates how our framework tracks multiple
for accurate cost estimation because it lacks low-level causes and proxies. Processes P1 and P2 both dirty the
knowledge. These problems are general to many file sys- same data page, so the page’s tag includes both processes
tems; even if journals are not used, similar issues arise in its set. Later, a writeback process, P3, writes the dirty
from the ordering constraints imposed by other mecha- buffer to disk. In doing so, P3 may need to dirty the
nisms such as copy-on-write techniques [16] or soft up- journal and metadata, and will be marked as a proxy for
dates [21]. Our split framework meets all the needs in {P1, P2} (the tag is inherited from the page it is writing
Table 1 by incorporating ideas from the other two frame- back). Thus, P1 and P2 are considered responsible when
works and exposing additional memory-related hooks. P3 dirties other pages, and the tag of these pages will be
marked as such. The tag of P3 is cleared when it finishes
3 Split Framework Design submitting the data page to the block level.
Existing frameworks offer insufficient reordering con-
trol and accounting knowledge. Requests are queued, 3.2 Cost Estimation
batched, and processed at many layers of the stack, thus Many policies require schedulers to know how much I/O
the limitations of single-layer frameworks are unsurpris- costs, in terms of device time or other metrics. An I/O
ing. We propose a holistic alternative: all important deci- pattern’s cost is influenced by file-system features, such
sions about when to perform I/O work should be exposed as caches and write buffers, and by device properties
as scheduling hooks, regardless of the level at which (e.g., random I/O is cheaper on flash than hard disk).
those decisions are made in the stack. We now discuss Costs can be most accurately estimated at the lowest
how these hooks support correct cause mapping (§3.1), levels of the stack, immediately above hardware (or bet-
accurate cost estimation (§3.2), and reordering (§3.3). ter still in hardware, if possible). At the block level, re-

5
delete to reorder for sequentiality), and is best done at a lower
file 2 level in the stack. We enable reordering at the block level
write buffer: ! ! ! ! ! ! ! by exposing hooks for both block reads and writes.

long delay
Unfortunately, the ability to reorder writes at the block
!"#$%&%'#()*""+,*'-+.
level is greatly limited by file systems (§2.3.2). Thus, re-
ordering hooks for writes (but not reads, which are not
request queue: "" "# $% $& $' $( $)
entangled by journals) are also exposed above the file
Figure 8: Accounting: Memory vs. Block Level. Disk system, at the system-call level. By controlling when
locations for buffered writes may not be known (indicated by write system calls run, a scheduler can control when
the question marks on the blocks) if allocations are delayed.
writes become visible to the file system and prevent or-
quest locations are known, so sequentiality-based models dering requirements that conflict with scheduling goals.
can estimate costs. Furthermore, this level is below all Many storage systems have calls that modify meta-
file-system features, so accounting is less likely to over- data, such as mkdir and creat in Linux; the split frame-
estimate costs (e.g., by counting cache reads) or under- work also exposes these. This approach presents an ad-
estimate costs (e.g., by missing journal writes). vantage over the SCS framework, which cannot correctly
Unfortunately, writes may be buffered for a long time schedule these calls. In particular, the cost of a metadata
(e.g., 30 seconds) before being flushed to the block level. update greatly depends on file-system internals, of which
Thus, while block-level accounting may accurately esti- SCS schedulers are unaware. Split schedulers, however,
mate the cost of a write, it is not aware of most writes un- can observe metadata writes at the block level and ac-
til some time after they enter the system via a write sys- cordingly charge the responsible applications.
tem call. Thus, if prompt accounting is more important File-system synchronization points (e.g., fsync or
than accurate accounting (e.g., for interactive systems), similar) require all dependent data to be flushed to disk
accounting should be done at the memory level. With- and typically invoke the file system’s ordering mecha-
out memory-level information, a scheduler could allow nism. Unfortunately, logically independent operations
a low-priority process to fill the write buffers with giga- often must wait for the synchronized updates to com-
bytes of random writes, as we saw earlier (Figure 1). plete (§2.3.2), so the ability to schedule fsync is essen-
Figure 8 shows the trade-off between accounting at tial. Furthermore, writes followed by fsync are more
the memory level (write buffer) and block level (request costly than writes by themselves, so schedulers should
queue). At the memory level, schedulers do not know be able to treat the two patterns differently. Thus, the
whether dirty data will be deleted before a flush, whether split framework also exposes fsync scheduling.
other writers will overwrite dirty data, or whether I/O 4 Split Scheduling in Linux
will be sequential or random. A scheduler can guess Split-style scheduling could be implemented in a variety
how sequential buffered writes will be based on file off- of storage stacks. In this work, we implement it in Linux,
sets, but delayed allocation prevents certainty about the integrating with the ext4 and XFS file systems.
layout. After a long delay, on-disk locations and other
details are known for certain at the block level. 4.1 Cross-Layer Tagging
The cost of buffered writes depends on future work- In Linux, I/O work is described by different function
load behavior, which is usually unknown. Thus, we be- calls and data structures at different layers. For exam-
lieve all scheduling frameworks are fundamentally lim- ple, a write request may be represented by (a) the ar-
ited and cannot provide cost estimation that is both guments to vfs write at the system-call level, (b) a
prompt and accurate. Our framework exposes hooks at buffer head structure in memory, and (c) a bio struc-
both the memory and block levels, enabling each sched- ture at the block level. Schedulers in our framework see
uler to handle the trade-off in the manner most suitable to the same requests in different forms, so it is useful to
its goals. Schedulers may even utilize hooks at both lev- have a uniform way to describe I/O across layers. We add
els. For example, Split-Token (§5.3) promptly guesses a causes tagging structure that follows writes through
write costs as soon as buffers are dirtied, but later revises the stack and identifies the original processes that caused
that estimate when more information becomes available an I/O operation. Split schedulers can thereby correctly
(e.g., when the dirty data is flushed to disk). map requests back to the application from any layer.
Writeback and journal tasks are marked as I/O prox-
3.3 Reordering ies, as described earlier (§3.1). In ext4, writeback calls
Most schedulers will want to reorder I/O to achieve good the ext4 da writepages function (“da” stands for “de-
performance as well as to meet more specific goals (e.g., layed allocation”), which writes back a range of pages of
low latency or fairness). Reordering for performance re- a given file. We modify this function so that as it does al-
quires knowledge of the device (e.g., whether it is useful location for the pages, it sets the writeback thread’s proxy

6
(a) Read (b) Write

Throughput (MBs/s)
$0+1'(2--/ 341516
250 Block Split
1%()$&$0)%*+.(4$/23 !"!
$%&'"#()*++
1%()$&%$)"%0+%5/23 !"!
200
#.*06&$0)%*+23 !"# 150
#.*06&%$)"%0+%5/23 !"# 100
6%$7)&$0)%*+23 !"#
6%$7)&%$)"%0+%5/23 !"# 50
0
,+-./ !"#

!"##$%&'(%)*+,-'.$)/0$1/23 !"#
!"##$%&#%$$+.$)3
1 10 100 1 10 100
!"#
Concurrent Threads Concurrent Threads
%$8&7''+23 #$%&'
%$8&6,9:-$)$+23 #$%&' Figure 9: Time Overhead. The split framework scales well
with the number of concurrent threads doing I/O to an SSD.
Table 2: Split Hooks. The “Origin” column shows which
hooks are new and which are borrowed from other frameworks. dirty buffer is modified. In the latter case, the framework
tells the scheduler which processes previously dirtied the
state as appropriate. For the journal proxy, we modify
buffer; depending on policy, the scheduler could revise
jbd2 (ext4’s journal) to keep track of all tasks responsi-
accounting statistics, shifting some (or all) of the respon-
ble for adding changes to the current transaction.
sibility for the I/O to the last writer. The buffer-free
4.2 Scheduling Hooks hooks tell the scheduler if a buffer is deleted before write-
We now describe the hooks we expose, which are split back. Schedulers can either rely on Linux to perform
across the system-call, memory, and block levels. Ta- writeback and throttle write system calls to control how
ble 2 lists a representative sample of the split hooks. much dirty data accumulates before writeback, or they
System Call: These hooks allow schedulers to inter- can take complete control of the writeback. We evaluate
cept entry and return points for various I/O system calls. the trade-off of these two approaches later (§7.1.2).
A scheduler can delay the execution of a system call Block: These hooks are identical to those in Linux’s
by simply sleeping in the entry hook. Like SCS, we original scheduling framework; schedulers are notified
intercept writes, so schedulers can separate writes be- when requests are added to the block level or completed
fore the file system entangles them. Unlike SCS, we do by the disk. Although we did not modify the function
not intercept reads (no file-system mechanism entangles interfaces at this level, schedulers implementing these
reads, so scheduling reads below the cache is preferable). hooks in our framework are more informed, given tags
Two metadata-write calls, creat and mkdir, and the within the request structures that identify the responsible
Linux synchronization call, fsync, are also exposed to processes. The Linux scheduling framework has over a
the scheduler. It would be useful (and straightforward) to dozen other block-level hooks for initialization, request
support other metadata calls in the future (e.g., unlink). merging, and convenience. We support all these as well
Note that in our implementation, the caller is blocked for compatibility, but do not discuss them here.
until the system call is scheduled. Other implementations Implementing the split-level framework in Linux in-
are possible, such as buffering the system calls and re- volves ∼300 lines of code, plus some file-system inte-
turning immediately, or simply returning EAGAIN to tell gration effort, which we discuss later (§6). While repre-
the caller to issue the system call later. We choose this senting a small change in the Linux code base, it enables
particular implementation because of its simplicity and powerful scheduling capabilities, as we will show.
POSIX compliance. Linux itself blocks writes (when
there are too many dirty pages) and fsyncs, and most 4.3 Overhead
applications already deal with this behavior using sepa- In this section, we evaluate the time and space overhead
rate threads; what we do is no different. of the split framework. In order to isolate framework
Memory: These hooks expose page-cache internals overhead from individual scheduler overhead, we com-
to schedulers. In Linux, a writeback thread (pdflush) pare no-op schedulers implemented in both our frame-
decides when to pass I/O to the block-level scheduler, work and the block framework (a no-op scheduler issues
which then decides when to pass that I/O to disk. Both all I/O immediately, without any reordering). Figure 9
components are performing scheduling tasks, and sep- shows our framework imposes no noticeable time over-
arating them is inefficient (e.g., writeback could flush head, even with 100 threads.
more aggressively if it knew when the disk was idle). The split framework introduces some memory over-
We add two hooks to inform the scheduler when buffers head for tagging writes with causes structures (§4.1).
are dirtied or deleted. The buffer-dirty hook notifies Memory overheads roughly correspond to the number of
the scheduler when a process dirties a buffer or when a dirty write buffers. To measure this overhead, we instru-

7
60 default 25%
(a) Seq Read 25%
(b) Seq Write
max
Overhead (MB)
50 afq 124.95 MB/s afq 117.30 MB/s
20% cfq 121.06 MB/s 20% cfq 127.66 MB/s
40
15% 15%
30 avg
10% 10%
20 AFQ AFQ
5% CFQ 5% CFQ
10
0% 0%
0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
10% 20% 30% 40% 50%
Dirty Ratio Priority Priority
Figure 10: Space Overhead. Memory overhead is
shown for an HDFS worker with 8 GB of RAM under a 25%
(c) Rand Write 500
(d) Overwrite
write-heavy workload. Maximum and average overhead is afq 46.61 KB/s

throughput (MB/s)
measured as a function of the Linux dirty ratio setting. 20% cfq 49.07 KB/s 400 afq 3627.90 MB/s
dirty background ratio is set to half of dirty ratio. 15% 300 cfq 3948.67 MB/s

ment kmalloc and kfree to track the number of bytes 10% 200
AFQ AFQ
allocated for tags over time. For our evaluation, we run 5% CFQ 100 CFQ
HDFS with a write-heavy workload, measuring alloca- 0% 0
tions on a single worker machine. Figure 10 shows the 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
results: with the default Linux settings, average over-
Priority Priority
head is 14.5 MB (0.2% of total RAM); the maximum
is 23.3 MB. Most tagging is on the write buffers; thus, Figure 11: AFQ Priority. The plots show the percentage of
a system tuned for more buffering should have higher throughput that threads of each priority receive. The lines show
the goal distributions; the labels indicate total throughput.
tagging overheads. With a 50% dirty ratio [5], max-
imum usage is still only 52.2 MB (0.6% of total RAM). AFQ chooses I/O requests to dequeue at the block and
system-call levels using the stride algorithm [51]. When-
5 Scheduler Case Studies ever a block request is dispatched to disk, AFQ charges
In this section, we evaluate the split framework’s abil- the responsible processes for the disk I/O. The I/O cost
ity to support a variety of scheduling goals. We imple- is based on a simple seek model.
ment AFQ (§5.1), Split-Deadline (§5.2), and Split-Token Evaluation: We compare AFQ to CFQ with four
(§5.3), and compare these schedulers to similar sched- workloads, shown in Figure 11. Figure 11(a) shows read
ulers in other frameworks. Unless otherwise noted, all performance on AFQ and CFQ for eight threads, with
experiments run on top of ext4 with the Linux 3.2.51 ker- priorities ranging from 0 (high) to 7 (low), each reading
nel (most XFS results are similar but usually not shown). from its own file sequentially. We see that AFQ’s perfor-
Our test machine has an eight-core, 1.4 GHz CPU and mance is similar to CFQ, and both respect priorities.
16 GB of RAM. We use 500 GB Western Digital hard Figure 11(b) shows asynchronous sequential-write
drives (AAKX) and an 80 GB Intel SSD (X25-M). performance, again with eight threads. This time, CFQ
fails to respect priorities because of write delegation, but
5.1 AFQ: Actually Fair Queuing AFQ correctly maps I/O requests via split tags, and thus
As shown earlier (§2.1), CFQ’s inability to correctly map respects priorities. On average, CFQ deviates from the
requests to processes causes unfairness, due to the lack of ideal by 82%, AFQ only by 16% (a 5× improvement).
information Linux’s elevator framework provides. More- Figure 11(c) shows synchronous random-write perfor-
over, file-system ordering requirements limit CFQ’s re- mance: we set up 5 threads per priority level, and each
ordering options, causing priority inversions. In order keeps randomly writing and flushing (with fsync) 4 KB
to overcome these two drawbacks, we introduce AFQ blocks. The average throughput of threads at each pri-
(Actually Fair Queuing scheduler) to allocate I/O fairly ority level is shown. CFQ again fails to respect priority;
among processes according to their priorities. using fsync to force data to disk invokes ext4’s jour-
Design: AFQ employs a two-level scheduling strat- naling mechanism and keeps CFQ from reordering to fa-
egy. Reads are handled at the block level and writes vor high-priority I/O. AFQ, however, blocks low-priority
(and calls that cause writes, such as fsync) are handled fsyncs when needed, improving throughput for high-
at the system-call level. This design allows reads to hit priority threads. As shown, AFQ is able to respect prior-
the cache while protecting writes from journal entangle- ity, deviating from the ideal value only by 3% on average
ment. Beneath the journal, low-priority blocks may be while CFQ deviates by 86% (a 28× improvement).
prerequisites for high-priority fsync calls, so writes at Finally, Figure 11(d) shows throughput for a memory-
the block level are dispatched immediately. intense workload that just overwrites dirty blocks in the

8
A B
Split-Deadline (HDD) Split-Deadline (SSD)
Block Write Fsync Block Write Fsync 4
HDD 10 ms 100 ms 100 ms 6000 ms 10
3

Latency (ms)
SSD 1 ms 3 ms 10 ms 100 ms 10
2
Table 3: Deadline Settings. For Block-Deadline, we 10
set deadlines for block-level writes; for Split-Deadline, we set 1
deadlines for fsyncs. 10
0
10
write buffer. One thread at each priority level keeps over- 10
-1
writing 4 MB of data in its own file. Both CFQ and AFQ
get very high performance as expected, though AFQ is Block-Deadline (HDD) Block-Deadline (SSD)
4
slightly slower (AFQ needs to do significant bookkeep- 10
ing for each write system call). The plot has no fairness 3

Latency (ms)
10
goal line as there is no contention for disk resources. 2
10
In general, AFQ and CFQ have similar performance; 1
10
however, AFQ always respects priorities, while CFQ
0
only respects priorities for the read workloads. 10
-1
10
5.2 Deadline 0 30 60 90 0.0 0.3 0.6 0.9
Time (s) Time (s)
As shown earlier (Figure 5 in §2.3.2), Block-Deadline
does poorly when trying to limit tail latencies, due to its Figure 12: Fsync Latency Isolation. Dashed and solid
inability to reorder block I/Os in the presence of file- lines present the goal latencies of A and B respectively. Dots
represent the actual latency of B’s calls, and pluses represent
system ordering requirements. Split-level scheduling, the actual latency of A’s calls. The shaded area represents the
with system-call scheduling capabilities and memory- time when B’s fsyncs are being issued.
state knowledge, is better suited to this task. The deadline settings are shown in Table 3. We choose
Design: We implement the Split-Deadline sched- shorter block-write deadlines than fsync deadlines be-
uler by modifying the Linux deadline scheduler (Block- cause each fsync causes multiple block writes; however,
Deadline). Block-Deadline maintains two deadline our results do not appear sensitive to the exact values
queues and two location queues (for both read and write chosen. Linux’s Block-Deadline scheduler does not sup-
requests) [3]. In Split-Deadline, an fsync-deadline port setting different deadlines for different processes, so
queue is used instead of a block-write deadline queue. we add this feature to enable a fair comparison.
During operation, if no read request or fsync is going Figure 12 shows the experiment results on both HDD
to expire, block-level read and write requests are issued and SSD. We can see that when no I/O from B is inter-
from the location queues to maximize performance. If fering, both schedulers give A low-latency fsyncs. Af-
some read requests or fsync calls are expiring, they are ter B starts issuing big fsyncs, however, Block-Deadline
issued before their deadlines. starts to fail: A’s fsync latencies increase by an order of
Split-Deadline monitors how much data is dirtied for magnitude; this happens because B generates too much
one file using the buffer-dirty hook and thereby esti- bursty I/O when calling fsync, and the scheduler has
mates the cost of an fsync. If there is an fsync pending no knowledge of or control over when they are coming.
that may affect other processes by causing too much I/O, Worse, A’s operations become dependent on these I/Os.
it will not be issued directly. Instead, the scheduler asks With Split-Deadline, however, A’s fsync latencies
the kernel to launch asynchronous writeback of the file’s mostly fluctuate around the deadline, even when B is call-
dirty data and waits until the amount of dirty data drops ing fsync after large writes. Sometimes A exceeds its
to a point such that other deadlines would not be affected goal slightly because our estimate of the fsync cost is
by issuing the fsync. Asynchronous writeback does not not perfect, but latencies are always relatively near the
generate a file-system synchronization point and has no target. Such performance isolation is possible because
deadline, so other operations are not forced to wait. Split-Deadline can reorder to spread the cost of bursty
Evaluation: We compare Split-Deadline to Block- I/Os caused by fsync without forcing others to wait.
Deadline for a database-like workload on both hard disk
drive (HDD) and solid state drive (SSD). We set up two 5.3 Token Bucket
threads A (small) and B (big); thread A appends to a small Earlier, we saw that SCS-Token [18] fails to isolate per-
file one block (4 KB) at a time and calls fsync (this formance (Figure 6 in §2.3.3). In particular, the through-
mimics database log appends) while thread B writes 1024 put of a process A was sensitive to the activities of an-
blocks randomly to a large file and then calls fsync (this other process B. SCS underestimates the I/O cost of some
mimics database checkpointing). B workloads, and thus does not sufficiently throttle B. In

9
125 A (w/ B writing)
Split SCS
90% 5G
Throughput (MB/s)

A (w/ B reading) 2.3x


100 80% 4G
70%

A’s Slowdown

B’s Bytes/Sec
3G
75 60%
A’s std dev: 7.0 MB 50% 2G
837x
50 40% 1G
30% 100M
25 20%
10% 1M
B writing 0% .5M
0 B reading -10% 0

rand
seq
mem
rand
seq
mem

rand
seq
mem
rand
seq
mem
4K 16K 64K 256K 1M 4M 16M
rand B’s Run Size (Bytes) seq
read write read write
Figure 13: Isolation: Split-Token with ext4. The
same as Figure 6, but for our Split-Token implementation. A is Figure 14: Split-Token vs. SCS-Token. Left: A’s
the unthrottled sequential reader, and B is the throttled process throughput slowdown is shown. Right: B’s performance is
performing I/O of different run sizes. shown. Process A achieves about 138 MB/s when running
alone, and B is throttled to 1 MB/s of normalized I/O, so there
this section, we evaluate Split-Token, a reimplementation should be a 0.7% slowdown for A (shown by a target line). The
of token bucket in our framework. x-axis indicates B’s workload; A always reads sequentially.
Design: As with SCS-Token, throttled processes are
given tokens at a set rate. I/O costs tokens, I/O is blocked We now directly compare SCS-Token with Split-
if there are no tokens, and the number of tokens that Token using a broader range of read and write workloads
may be held is capped. Split-Token throttles a process’s for process B. I/O can be random (expensive), sequen-
system-call writes and block-level reads if and only if tial, or served from memory (cheap). As before, A is an
the number of tokens is negative. System-call reads are unthrottled reader, and B is throttled to 1 MB/s of normal-
never throttled (to utilize the cache). Block writes are ized I/O. Figure 14 (left) shows that Split-Token is near
never throttled (to avoid entanglement). the isolation target all six times, whereas SCS-Token sig-
Our implementation uses memory-level and block- nificantly deviates three times (twice by more than 50%),
level hooks for accounting. The scheduler promptly again showing Split-Token provides better isolation.
charges tokens as soon as buffers are dirtied, and then After isolation, a secondary goal is the best perfor-
revises when the writes are later flushed to the block mance for throttled processes, which we measure in Fig-
level (§3.2), charging more tokens (or refunding them) ure 14 (right). Sometimes B is faster with SCS-Token,
based on amplification and sequentiality. Tokens repre- but only because SCS-Token is incorrectly sacrificing
sent bytes, so accounting normalizes the cost of an I/O isolation for A (e.g., B does faster random reads with
pattern to the equivalent amount of sequential I/O (e.g., SCS-Token, but A’s performance drops over 80%). We
1 MB of random I/O may be counted as 10 MB). consider the cases where SCS-Token did provide isola-
Split-Token estimates I/O cost based on two models, tion. First, Split-Token is 2.3× faster for “read-mem”.
both of which assume an underlying hard disk (simpler SCS-Token logic must run on every read system call,
models could be used on SSD). When buffers are first whereas Split-Token does not. SCS-Token still achieves
dirtied at the memory level, a preliminary model esti- nearly 2 GB/s, though, indicating cache hits are not throt-
mates cost based on the randomness of request offsets tled. Although the goal of SCS-Token was to do system-
within the file. Later, when the file system allocates call scheduling, Craciunas et al. needed to modify the
space on disk for the requests and flushes them to the file system to tell which reads are cache hits [19]. Sec-
block level, a disk model revises the cost estimate. The ond, Split-Token is 837× faster for “write-mem”. SCS-
second model is more accurate because it can consider Token does write accounting at the system-call level, so it
more factors than the first model, such as whether the does not differentiate buffer overwrites from new writes.
file system introduced any fragmentation, and whether Thus, SCS-Token unnecessarily throttles B. With Split-
the file is located near other files being written. Token, B’s throughput does not reach 1 MB/s for “read-
Evaluation: We repeat our earlier SCS experiments seq” because the intermingled I/Os from A and B are no
(Figure 6) with Split-Token, as shown in Figure 13. We longer sequential; we charge it to both A and B.
observe that whether B does reads or writes has little ef- We finally evaluate Split-Token for a large number of
fect on A (the A lines are near each other). Whether B’s threads; we repeat the six workloads of Figure 14, this
pattern is sequential or random also has little impact (the time varying the number of B threads performing the I/O
lines are flat). Across all workloads, the standard devi- task (all threads of B share the same I/O limit). Figure 15
ation of A’s performance is 7 MB, about a 6× improve- shows the results. For sequential read, the number of
ment over SCS (SCS-Token’s deviation was 41 MB). B threads has no impact on A’s performance, as desired.

10
A’s Throughput (MB/s) 125 125
A (w/ B writing)

Throughput (MB/s)
100 100
A (w/ B reading)
75 75
read seq
50 read mem 50 A’s std dev: 12.8 MB
write mem
25 spin 25
B writing
0 0 B reading
1 2 4 8 16 32 64 128 256 512 4K 16K 64K 256K 1M 4M 16M
B’s Thread Count rand B’s Run Size (Bytes) seq
Figure 15: Split-Token Scalability. A’s throughput is Figure 16: Isolation: Split-Token with XFS. This is the
shown as a function of the number of B threads performing a same as Figure 6 and Figure 14, but for XFS running with our
given activity. Goal performance is 101.7 MB (these numbers Split implementation of token bucket.
were taken on a 32-core CloudLab node with a 1 TB drive).
100 ext4 50
We do not show random read, sequential write, or ran-
80 std dev: 3.0 MB 40

B’s creates/s
dom write, as these lines would appear the same as the (3.3% of mean)

As MB/s
read-sequential line (varying at most by 1.7%). However, 60 30
when B is reading or writing to memory, A’s performance 40 xfs 20
is only steady if B has 128 threads or less. Since the B std dev: 5.4 MB
xfs
20 10
threads do not incur any disk I/O, our I/O scheduler does (15.3% of mean)
ext4
not throttle them, leaving the B threads free to dominate 0 0
the CPU, indirectly slowing A. To confirm this, we do an 1 2 4 8 16 32 1 2 4 8 16 32
experiment (also shown in Figure 15) where B threads Create Interval (ms) Create Interval (ms)
simply execute a spin loop, issuing no I/O; A’s perfor- Figure 17: Metadata: Split-Token with XFS and ext4.
mance still suffers in this case. This reminds us of the Process A sequentially reads while B creates and flushes new,
empty files. A’s throughput is shown as function of how long
usefulness of CPU schedulers in addition to I/O sched- B sleeps between operations (left). B’s create frequency is also
ulers: if a process does not receive enough CPU time, shown for the same experiments (right).
it may not be able to issue requests fast enough to fully
utilize the storage system. than 10 lines. In contrast, btrfs [33] uses its own buffer
structures, so integration would require more effort.
5.4 Implementation Effort Part (b), on the other hand, is highly file-system spe-
Implementing different schedulers within the split frame- cific, as different file systems use different proxy mech-
work is not only possible, but relatively easy: Split-AFQ anisms. For ext4, the journal task acts as a proxy when
takes ∼950 lines of code to implement, Split-Deadline writing the physical journal, and the writeback task acts
takes ∼750 lines of code, and Split-Token takes ∼950 as a proxy when doing delayed allocation. XFS uses
lines of code. As a comparison, Block-CFQ takes more logical journaling, and has its own journal implementa-
than 4000 lines of code, Block-Deadline takes ∼500 tion. For a copy-on-write file system, garbage collection
lines of code, and SCS-Token takes ∼2000 lines of code would be another important proxy mechanism. Properly
(SCS-Token is large because there is not a clean separa- tagging these proxies is a bit more involved. In ext4,
tion between the scheduler and framework). it takes 80 lines of code across 5 different files. For-
6 File System Integration tunately, such proxy mechanisms typically only involve
Thus far we have presented results with ext4; now, we metadata, so for data-dominated workloads, partial inte-
consider the effort necessary to integrate ext4 and other gration with only (a) should work relatively well.
file systems, in particular XFS, into the split framework. In order to verify the above hypotheses, we have fully
Integrating a file system involves (a) tagging relevant integrated ext4 with the split framework, and only par-
data structures the file system uses to represent I/O in tially integrated XFS with part (a). We evaluate the ef-
memory and (b) identifying the proxy mechanisms in the fectiveness of our partial XFS integration on both data-
file system and properly tagging the proxies. intensive and metadata-intensive workloads.
In Linux, part (a) is mostly file-system independent Figure 16 repeats our earlier isolation experiment
as many file systems use generic page buffer data struc- (Figure 13), but with XFS; these experiments are data-
tures to represent I/O. Both ext4 and XFS rely heavily on intensive. Split-Token again provides significant isola-
the buffer head structure, which we already tag prop- tion, with A only having a deviation of 12.8 MB. In fact,
erly. Thus we are able to integrate XFS buffers with split all the experiments we show earlier are data intensive,
tags by adding just two lines of code, and ext4 with less and XFS has similar results (not shown) as ext4.

11
100%

Percentage of Xacts
(a) Block-Deadline (b) Split-Deadline
1.2 1.2 99.9th percentile 80%
1.0 1.0 99th percentile
target latency
Latency (s)

60%
0.8 0.8
0.6 0.6 40% Split-Deadline
0.4 0.4 Split-Pdflush
20%
0.2 0.2 Block-Deadline
0.0 0.0 0%
64 128 256 512 1K 64 128 256 512 1K 1 10 100 1000
Buffers/Checkpoint Buffers/Checkpoint Transaction Latency (ms)
Figure 18: SQLite Transaction Latencies. 99th and Figure 19: PostgreSQL Transaction Latencies. A
99.9th percentiles of the transaction latencies are shown. The CDF of transaction latencies is shown for three systems. Split-
x-axis is the number of dirty buffers we allow before checkpoint. Pdflush is Split-Deadline, but with pdflush controlling write-
back separately.
Figure 17 shows the performance of a metadata-
intense workload for both XFS and ext4. In this exper- ing threshold. When checkpoint thresholds are larger,
iment, A reads sequentially while B repeatedly creates checkpointing is less frequent, fewer transactions are af-
empty files and flushes them to disk with fsync. B is fected, and thus the 99th line falls. Unfortunately, this
throttled, A is not. B sleeps between each create for a approach does not eliminate tail latencies; instead, it con-
time varied on the x-axis. As shown in the left plot, B’s centrates the cost on fewer transactions, so the 99.9th
sleep time influences A’s performance significantly with line continues to rise. In contrast, Figure 18(b) shows
XFS, but with ext4 A is isolated. The right plot explains that Split-Deadline provides much smaller tail latencies
why: with ext4, B’s creates are correctly throttled, re- (a 4× improvement for 1K buffers).
gardless of how long B sleeps. With XFS, however, B
is unthrottled because XFS does not give the scheduler 7.1.2 PostgreSQL
enough information to map the metadata writes (which We run PostgreSQL [10] on top of an SSD and bench-
are performed by journal tasks) back to B. mark it using pgbench [4], a TPC-B like workload. We
We conclude that some file systems can be partially in- change PostgreSQL to set I/O deadlines for each worker
tegrated with minimal effort, and data-intense workloads thread. We want consistently low transaction latencies
will be well supported. Support for metadata workloads, (within 15 ms), so we set the foreground fsync deadline
however, requires more effort. to 5 ms, and the background checkpointing fsync dead-
line to 200 ms for Split-Deadline. For Block-Deadline,
7 Real Applications we set the block write deadline to 5 ms. For block reads,
In this section, we explore whether the split framework
a deadline of 5 ms is used for both Split-Deadline and
is a useful foundation for databases (§7.1), virtual ma-
Block-Deadline. Checkpoints occur every 30 seconds.
chines (§7.2), and distributed file systems (§7.3).
Figure 19 shows the cumulative distribution of the
7.1 Databases transaction latencies. We can see that when running on
To show how real databases could benefit from Split- top of Block-Deadline, 4% of transactions fail to meet
Deadline’s low-latency fsyncs, we measure transaction- their latency target, and over 1% take longer than 500 ms.
response time for SQLite3 [26] and PostgreSQL [10] After further inspection, we found that the latency spikes
running with both Split-Deadline and Block-Deadline. happen at the end of each checkpoint period, when the
7.1.1 SQLite3 system begins to flush a large amount of dirty data to
We run SQLite3 on a hard disk drive. For Split-Deadline, disk using fsync. Such data flushing interferes with
we set short deadlines (100 ms) for fsyncs on the write- foreground I/Os, causes long transaction latency and low
ahead log file and reads from the database file and set system throughput. The database community has long
long deadlines (10 seconds) for fsyncs on the database experienced this “fsync freeze” problem, and has no
file. For Block-Deadline, the default settings (50 ms for great solution for it [2, 9, 10]. We show next that Split-
block reads and 500 ms for block writes) are used. We Deadline provides a simple solution to this problem.
make minor changes to SQLite to allow concurrent log When running Split-Deadline, we have the ability to
appends and checkpointing and to set appropriate dead- schedule fsyncs and minimize their performance impact
lines. For our benchmark, we randomly update rows in a to foreground transactions. However, pdflush (Linux’s
large table, measure transaction latencies, and run check- writeback task) may still submit many writeback I/Os
pointing in a separate thread whenever the number of without scheduler involvement and interfere with fore-
dirty buffers reaches a threshold. ground I/Os. Split-Deadline maintains deadlines in this
Figure 18(a) shows the transaction tail latencies (99th case by limiting the amount of data pdflush may flush
and 99.9th percentiles) when we change the checkpoint- at any given time by throttling write system calls. In Fig-

12
Split SCS (a) 64MB blocks (b) 16MB blocks
90% 5G 250 CFQ
80%

Throughput (MB/s)
4G
70% 200
A’s Slowdown

B’s Bytes/Sec
3G
60%
50% 2G 150
40% 1G upper
30% 100M 100 bound
20%
10% 15M 50
0% 7.5M 0
-10% 0 16 32 48 16 32 48
rand
seq
mem
rand
seq
mem

rand
seq
mem
rand
seq
mem
Rate Cap (MB/s) Rate Cap (MB/s)
read write read write throttled unused unthrottled
Figure 20: QEMU Isolation. This is the same as Fig- Figure 21: HDFS Isolation. Solid-black and gray bars rep-
ure 14, but processes A and B run in different QEMU virtual resent the total throughput of throttled and unthrottled HDFS
machines ext4 on the host. B is throttled to 5 MB/s. Reported writers, respectively. Dashed lines represent an upper bound
throughput is for the processes at the guest system-call level. on throughput; solid lines represent Block-CFQ throughput.

ure 19 we can see that this approach effectively elimi- 7.3 Distributed File Systems (HDFS)
nates tail latency: 99.99% of the transactions are com- To show that local split scheduling is a useful founda-
pleted within 15 ms. Unfortunately, the median transac- tion to provide isolation in a distributed environment, we
tion latency is much higher because write buffers are not integrate HDFS with Split-Token to provide isolation to
fully utilized. HDFS clients. We modify the client-to-worker protocol
When pdflush is disabled, though, Split-Deadline so workers know which account should be billed for disk
has complete control of writeback, and can allow more I/O generated by the handling of a particular RPC call.
dirty data in the system without worrying about untimely Account information is propagated down to Split-Token
writeback I/Os. It then initiates writeback in a way that and across to other workers (for pipelined writes).
both observes deadlines and optimizes performance, thus We evaluate our modified HDFS on a 256-core Cloud-
eliminating tail latencies while maintaining low median Lab cluster (one NameNode and seven workers, each
latencies, as shown in Figure 19. with 32 cores). Each worker has 8 GB of RAM and a
7.2 Virtual Machines (QEMU) 1 TB disk. We run an unthrottled group of four threads
Isolation is especially important in cloud environments, and a throttled group of four threads. Each thread se-
where customers expect to be isolated from other (po- quentially writes to its own HDFS file.
tentially malicious) customers. To evaluate our frame- Figure 21(a) shows the result for varying rate limits
work’s usefulness in this environment, we repeat our on the x-axis. The summed throughput (i.e., that of both
token-bucket experiment in Figure 14, this time running throttled and unthrottled writers) is similar to through-
the unthrottled process A and throttled process B in sep- put when HDFS runs over CFQ without any priorities
arate QEMU instances. The guests run a vanilla kernel; set. With Split-Token, though, smaller rate caps on the
the host runs our modified kernel. Thus, throttling is on throttled threads provide the unthrottled threads with bet-
the whole VM, not just the benchmark we run inside. We ter performance (e.g., the gray bars get more throughput
use an 8 GB machine with a four-core 2.5 GHz CPU. when the black bars are locally throttled to 16 MB/s).
Figure 20 shows the results for QEMU running over Given there are seven datanodes, and data must be
both SCS and Split-Token on the host. The isolation re- triply written for replication, the expected upper bound
sults for A (left) are similar to the results when we ran A on total I/O is (ratecap/3) ∗ 7. The dashed lines show
and B directly on the host (Figure 14): with Split-Token, these upper bounds in Figure 21(a); the black bars fall
A is always well isolated, but with SCS, A experiences short. We found that many tokens go unused on some
major slowdowns when B does random I/O. workers due to load imbalance. The hashed black bars
The throughput results for B (right) are more interest- represent the potential HDFS write I/O that was thus lost.
ing: whereas before SCS greatly slowed memory-bound In Figure 21(b), we try to improve load balance by
workloads, now SCS and Split-Token provide equal per- decreasing the HDFS block size from 64 MB (the de-
formance for these workloads. This is because when a fault) to 16 MB. With smaller blocks, fewer tokens go
throttled process is memory bound, it is crucial for per- unused, and the throttled writers achieve I/O rates nearer
formance that a caching/buffering layer exist above the the upper bound. We conclude that local scheduling
scheduling layer. The split and QEMU-over-SCS stacks can be used to meet distributed isolation goals; however,
have this property (and memory workloads are fast), but throttled applications may get worse-than-expected per-
the raw-SCS stack does not. formance if the system is not well balanced.

13
8 Related Work Exposing File-System Mechanisms: Split-level
scheduling requires file systems to expose certain mecha-
Multi-Layer Scheduling: A number of works argue
nisms (journaling, delayed allocation, etc.) to the frame-
that efficient I/O scheduling requires coordination at
work by properly tagging them as proxies. Others have
multiple layers in the storage stack [45, 50, 52, 56].
Riska et al. [40] evaluated the effectiveness of optimiza- also found that exposing file-system information is help-
ful [20, 37, 55]. For example, in Featherstitch [20], file-
tions at various layers of the I/O path, and found that
system ordering requirements are exposed to the outside
superior performance is yielded by combining optimiza-
tions at various layers. Redline [56] tries to avoid system as dependency rules so that the kernel can make informed
decisions about writeback.
unresponsiveness during fsync by scheduling at both the
buffer cache level and the block level. Argon [50] com- Other I/O Scheduling Techniques: Different ap-
bines mechanisms at different layers to achieve perfor- proaches have been proposed to improve different as-
mance insulation. However, compared to these ad-hoc pects of I/O scheduling: to better incorporate rotational-
approaches, our framework provides a systematic way awareness [28, 29, 44], to better support different stor-
for schedulers to plug in logic at different layers of the age devices [30, 36], or to provide better QoS guaran-
storage stack while still maintaining modularity. tees [23, 32, 39]. All these techniques are complemen-
tary to our work and can be incorporated into our frame-
Cause Mapping and Tagging: The need for correctly
work as new schedulers.
accounting resource consumption to the responsible en-
tities arises in different contexts. Banga et al. [14] found 9 Conclusion
that kernel consumes resources on behalf of applications, In this work, we have shown that single-layer sched-
causing difficulty in scheduling. The hypervisor may ulers operating at either the block level or system-
also do work on behalf of a virtual machine, making call level fail to support common goals due to a lack
it difficult to isolate performance [24]. We identify the of coordination with other layers. While our experi-
same problem in I/O scheduling, and propose tagging ments indicate that simple layering must be abandoned,
as a general solution. Both Differentiated Storage Ser- we need not sacrifice modularity. In our split frame-
vices (DSS) [35] and IOFlow [48] also tag data across work, the scheduler operates across all layers, but is
layers. DSS tags the type of data, IOFlow tags the type still abstracted behind a collection of handlers. This
and cause, and split scheduling tags with a set of causes. approach is relatively clean, and enables pluggable
Software-Defined Storage Stack: In the spirit of scheduling. Supporting a new scheduling goal sim-
moving toward a more software-defined storage (SDS) ply involves writing a new scheduler plug-in, not re-
stack, the split-level framework exposes knowledge and engineering the entire storage system. Our hope is that
control at different layers to a centralized entity, the split-level scheduling will inspire future vertical integra-
scheduler. The IOFlow [48] stack is similar to split tion in storage stacks. Our source code is available at
scheduling in this regard; both tag I/O across layers and http://research.cs.wisc.edu/adsl/Software/split.
have a central controller.
IOFlow, however, operates at the distributed level; the Acknowledgement
lowest IOFlow level is an SMB server that resides above We thank the anonymous reviewers and Angela Demke
a local file system. IOFlow does not address the core Brown (our shepherd) for their insightful feedback. We
file-system issues, such as write delegation or ordering thank the members of the ADSL research group for their
requirements, and thus likely has the same disadvantages suggestions and comments on this work at various stages.
as system-call scheduling. We believe that the prob- This material was supported by funding from NSF
lems introduced by the local file systems, which we iden- grants CNS-1421033, CNS-1319405, and CNS-1218405
tify and solve in this paper, are inherent to any storage as well as generous donations from Cisco, EMC, Face-
stack. We argue any complete SDS solutions would need book, Google, Huawei, IBM, Los Alamos National Lab-
to solve them and thus our approach is complementary. oratory, Microsoft, NetApp, Samsung, Symantec, Sea-
Combining IOFlow with split scheduling, for example, gate, and VMware as part of the WISDOM research
could be very useful: flows could be tracked through hy- institute sponsorship. Tyler Harter is supported by the
pervisor, network, and local-storage layers. NSF Fellowship. Samer Al-Kiswany is supported by the
Shue et al. [46] provision I/O resources in a key- NSERC Postdoctoral Fellowship. Any opinions, find-
value store (Libra) by co-designing the application and ings, and conclusions or recommendations expressed in
I/O scheduler; however, they noted that “OS-level ef- this material are those of the authors and may not reflect
fects due to filesystem operations [ . . . ]are beyond Libra's the views of NSF or other institutions.
reach”; building such applications with the split frame-
work should provide more control.

14
References [22] G U , W., K ALBARCZYK , Z., I YER , R. K., AND YANG , Z. Char-
acterization of Linux Kernel Behavior Under Errors. In DSN ’03
[1] CFQ (Complete Fairness Queueing). https://www.kernel.
(San Francisco, CA, June 2003), pp. 459–468.
org/doc/Documentation/block/cfq-iosched.txt.
[23] G ULATI , A., M ERCHANT, A., AND VARMAN , P. J. pClock:
[2] Database/kernel community topic at collaboration sum-
An Arrival Curve Based Approach for QoS Guarantees in Shared
mit 2014. http://www.postgresql.org/message-id/
Storage Systems. In Proceedings of the 2007 ACM SIGMETRICS
[email protected].
International Conference on Measurement and Modeling of Com-
[3] Deadline IO scheduler tunables. https://www.kernel.org/ puter Systems (New York, NY, USA, 2007), SIGMETRICS ’07,
doc/Documentation/block/deadline-iosched.txt. ACM, pp. 13–24.
[4] Documentation for pgbench. http://http://www. [24] G UPTA , D., C HERKASOVA , L., G ARDNER , R., AND VAHDAT,
postgresql.org/docs/9.4/static/pgbench.html. A. Enforcing performance isolation across virtual machines in
[5] Documentation for /proc/sys/vm/*. https://www.kernel. xen. In Middleware 2006. Springer, 2006, pp. 342–362.
org/doc/Documentation/sysctl/vm.txt. [25] H AGMANN , R. Reimplementing the Cedar File System Using
[6] Inside the Windows Vista Kernel: Part 1. http:// Logging and Group Commit. In SOSP ’87 (Austin, TX, Novem-
technet.microsoft.com/en-us/magazine/2007.02. ber 1987).
vistakernel.aspx. [26] H IPP, D. R., AND K ENNEDY, D. SQLite, 2007.
[7] ionice(1) - Linux man page. http://linux.die.net/man/1/ [27] H OFRI , M. Disk scheduling: FCFS vs.SSTF revisited. Commu-
ionice. nications of the ACM 23, 11 (1980), 645–653.
[8] Notes on the Generic Block Layer Rewrite in Linux 2.5. [28] H UANG , L., AND C HIUEH , T. Implementation of a Rotation-
https://www.kernel.org/doc/Documentation/block/ Latency-Sensitive Disk Scheduler. Tech. Rep. ECSL-TR81,
biodoc.txt. SUNY, Stony Brook, March 2000.
[9] pgsql-hackers maillist communication. http://www. [29] JACOBSON , D. M., AND W ILKES , J. Disk Scheduling Algo-
postgresql.org/message-id/CA+Tgmobv6gm6SzHx8e2w- rithms Based on Rotational Position. Tech. Rep. HPL-CSP-91-7,
[email protected]. Hewlett Packard Laboratories, 1991.
[10] Postgresql 9.2.9 documentation. http://www.postgresql. [30] K IM , J., O H , Y., K IM , E., C HOI , J., L EE , D., AND N OH , S. H.
org/docs/9.2/static/wal-configuration.html. Disk Schedulers for Solid State Drivers. In Proceedings of the
[11] A NAND , A., S EN , S., K RIOUKOV, A., P OPOVICI , F., A KELLA , Seventh ACM International Conference on Embedded Software
A., A RPACI -D USSEAU , A. C., A RPACI -D USSEAU , R. H., AND (New York, NY, USA, 2009), EMSOFT ’09, ACM, pp. 295–304.
BANERJEE , S. Avoiding File System Micromanagement with
[31] L UMB , C., S CHINDLER , J., G ANGER , G., N AGLE , D., AND
Range Writes. In OSDI ’08 (San Diego, CA, December 2008).
R IEDEL , E. Towards Higher Disk Head Utilization: Extracting
[12] A RPACI -D USSEAU , A. C., AND A RPACI -D USSEAU , R. H. In- “Free” Bandwidth From Busy Disk Drives. In OSDI ’00 (San
formation and Control in Gray-Box Systems. In ACM SIGOPS Diego, CA, October 2000), pp. 87–102.
Operating Systems Review (2001), vol. 35, ACM, pp. 43–56.
[32] L UMB , C. R., M ERCHANT, A., AND A LVAREZ , G. A. Facade:
[13] A RPACI -D USSEAU , R. H., AND A RPACI -D USSEAU , A. C. Op- Virtual storage devices with performance guarantees. In FAST
erating Systems: Three Easy Pieces. Arpaci-Dusseau Books, ’03 (San Francisco, CA, April 2003).
2014.
[33] M ASON , C. The Btrfs Filesystem. oss.oracle.com/
[14] BANGA , G., D RUSCHEL , P., AND M OGUL , J. C. Resource con- projects/btrfs/dist/documentation/btrfs-ukuug.
tainers: A new facility for resource management in server sys- pdf, September 2007.
tems. In OSDI (1999), vol. 99, pp. 45–58.
[34] M ATHUR , A., C AO , M., B HATTACHARYA , S., D ILGE , A.,
[15] B EST, S. JFS Overview. http://jfs.sourceforge.net/ T OMAS , A., AND V IVIER , L. The New Ext4 Filesystem: Cur-
project/pub/jfs.pdf, 2000. rent Status and Future Plans. In Ottawa Linux Symposium (OLS
[16] B ONWICK , J., AND M OORE , B. ZFS: The Last Word in File ’07) (Ottawa, Canada, July 2007).
Systems. http://opensolaris.org/os/community/zfs/ [35] M ESNIER , M., C HEN , F., L UO , T., AND A KERS , J. B. Dif-
docs/zfs_last.pdf, 2007. ferentiated Storage Services. In Proceedings of the 23rd ACM
[17] B OSCH , P., AND M ULLENDER , S. Real-time disk scheduling in Symposium on Operating Systems Principles (SOSP ’11) (Cas-
a mixed-media file system. In Real-Time Technology and Appli- cais, Portugal, October 2011).
cations Symposium, 2000. RTAS 2000. Proceedings. Sixth IEEE [36] PARK , S., AND S HEN , K. FIOS: A Fair, Efficient Flash I/O
(2000), pp. 23–32. Scheduler. In FAST (2012), p. 13.
[18] C RACIUNAS , S. S., K IRSCH , C. M., AND R ÖCK , H. The TAP [37] P ILLAI , T. S., C HIDAMBARAM , V., A LAGAPPAN , R., A L -
Project: Traffic Shaping System Calls. http://tap.cs.uni- K ISWANY, S., A RPACI -D USSEAU , A. C., AND A RPACI -
salzburg.at/downloads.html. D USSEAU , R. H. All File Systems Are Not Created Equal: On
[19] C RACIUNAS , S. S., K IRSCH , C. M., AND R ÖCK , H. I/O Re- the Complexity of Crafting Crash-Consistent Applications. In
source Management Through System Call Scheduling. SIGOPS 11th USENIX Symposium on Operating Systems Design and Im-
Oper. Syst. Rev. 42, 5 (July 2008), 44–54. plementation (OSDI 14) (Broomfield, CO, October 2014).
[20] F ROST, C., M AMMARELLA , M., K OHLER , E., DE LOS R EYES , [38] P OPOVICI , F. I., A RPACI -D USSEAU , A. C., AND A RPACI -
A., H OVSEPIAN , S., M ATSUOKA , A., AND Z HANG , L. Gener- D USSEAU , R. H. Robust, Portable I/O Scheduling with the Disk
alized File System Dependencies. In SOSP ’07 (Stevenson, WA, Mimic. In USENIX Annual Technical Conference, General Track
October 2007), pp. 307–320. (2003), pp. 297–310.
[21] G ANGER , G. R., M C K USICK , M. K., S OULES , C. A., AND [39] P OVZNER , A., K ALDEWEY, T., B RANDT, S., G OLDING , R.,
PATT, Y. N. Soft Updates: A Solution to the Metadata Update W ONG , T. M., AND M ALTZAHN , C. Efficient Guaranteed Disk
Problem in File Systems. ACM Transactions on Computer Sys- Request Scheduling with Fahrrad. In EuroSys ’08 (Glasgow,
tems (TOCS) 18, 2 (2000), 127–153. Scotland UK, March 2008).

15
[40] R ISKA , A., L ARKBY-L AHET, J., AND R IEDEL , E. Evaluating
Block-level Optimization Through the IO Path. In USENIX ’07
(Santa Clara, CA, June 2007).
[41] R IZZO , L., AND C HECCONI , F. GEOM SCHED: A Frame-
work for Disk Scheduling within GEOM. http://info.
iet.unipi.it/~luigi/papers/20090508-geom_sched-
slides.pdf.
[42] ROSENBLUM , M., AND O USTERHOUT, J. The Design and Im-
plementation of a Log-Structured File System. ACM Transac-
tions on Computer Systems 10, 1 (February 1992), 26–52.
[43] RUEMMLER , C., AND W ILKES , J. An Introduction to Disk Drive
Modeling. IEEE Computer 27, 3 (March 1994), 17–28.
[44] S ELTZER , M., C HEN , P., AND O USTERHOUT, J. Disk Schedul-
ing Revisited. In USENIX Winter ’90 (Washington, DC, January
1990), pp. 313–324.
[45] S HENOY, P., AND V IN , H. Cello: A Disk Scheduling Frame-
work for Next-generation Operating Systems. In SIGMETRICS
’98 (Madison, WI, June 1998), pp. 44–55.
[46] S HUE , D., AND F REEDMAN , M. J. From application requests to
Virtual IOPs: Provisioned key-value storage with Libra. In Pro-
ceedings of the Ninth European Conference on Computer Systems
(2014), ACM, p. 17.
[47] S WEENEY, A., D OUCETTE , D., H U , W., A NDERSON , C.,
N ISHIMOTO , M., AND P ECK , G. Scalability in the XFS File
System. In USENIX 1996 (San Diego, CA, January 1996).
[48] T HERESKA , E., BALLANI , H., O’S HEA , G., K ARAGIANNIS ,
T., ROWSTRON , A., TALPEY, T., B LACK , R., AND Z HU , T.
IOFlow: A Software-Defined Storage Architecture. In Proceed-
ings of the Twenty-Fourth ACM Symposium on Operating Systems
Principles (2013), ACM, pp. 182–196.
[49] VAN M ETER , R., AND G AO , M. Latency Management in Stor-
age Systems. In OSDI ’00 (San Diego, CA, October 2000),
pp. 103–117.
[50] WACHS , M., A BD -E L -M ALEK , M., T HERESKA , E., AND
G ANGER , G. R. Argon: Performance insulation for shared stor-
age servers. In FAST ’07 (San Jose, CA, February 2007).
[51] WALDSPURGER , C. A., AND W EIHL , W. E. Stride Scheduling:
Deterministic Proportional Share Resource Management. Mas-
sachusetts Institute of Technology. Laboratory for Computer Sci-
ence, 1995.
[52] WANG , H., AND VARMAN , P. J. Balancing fairness and effi-
ciency in tiered storage systems with bottleneck-aware allocation.
In FAST ’13 (San Jose, CA, February 2014).
[53] W ILKES , J., G OLDING , R., S TAELIN , C., AND S ULLIVAN , T.
The HP AutoRAID Hierarchical Storage System. ACM Transac-
tions on Computer Systems 14, 1 (February 1996), 108–136.
[54] W ORTHINGTON , B. L., G ANGER , G. R., AND PATT, Y. N.
Scheduling Algorithms for Modern Disk Drives. In SIGMET-
RICS ’94 (Nashville, TN, May 1994), pp. 241–251.
[55] YANG , J., S AR , C., AND E NGLER , D. EXPLODE: A
Lightweight, General System for Finding Serious Storage Sys-
tem Errors. In OSDI ’06 (Seattle, WA, November 2006).
[56] YANG , T., L IU , T., B ERGER , E. D., K APLAN , S. F., AND
M OSS , J. E. B. Redline: First Class Support for Interactivity
in Commodity Operating Systems. In OSDI ’08 (San Diego, CA,
December 2008).
[57] Y U , X., G UM , B., C HEN , Y., WANG , R. Y., L I , K., K RISH -
NAMURTHY, A., AND A NDERSON , T. E. Trading Capacity for
Performance in a Disk Array. In OSDI ’00 (San Diego, CA, Oc-
tober 2000).

16

You might also like