0% found this document useful (0 votes)
9 views10 pages

Making Lockless Synchronization Fast

Uploaded by

Huseyn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views10 pages

Making Lockless Synchronization Fast

Uploaded by

Huseyn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Making Lockless Synchronization Fast:

Performance Implications of Memory Reclamation

Thomas E. Hart1∗, Paul E. McKenney2, and Angela Demke Brown1

University of Toronto
1 2
IBM Beaverton
Dept. of Computer Science Linux Technology Center
Toronto, ON M5S 2E4 CAN Beaverton, OR 97006 USA
{tomhart, demke}@cs.toronto.edu [email protected]

Abstract
Achieving high performance for concurrent applications
on modern multiprocessors remains challenging. Many pro-
grammers avoid locking to improve performance, while oth-
ers replace locks with non-blocking synchronization to pro-
tect against deadlock, priority inversion, and convoying. In Figure 1. Read/reclaim race.
both cases, dynamic data structures that avoid locking, re-
quire a memory reclamation scheme that reclaims nodes Locking is also susceptible to priority inversion, convoying,
once they are no longer in use. deadlock, and blocking due to thread failure [3, 10], leading
The performance of existing memory reclamation researchers to pursue non-blocking (or lock-free) synchro-
schemes has not been thoroughly evaluated. We conduct nization [6, 12, 13, 14, 16, 29]. In some cases, lock-free
the first fair and comprehensive comparison of three recent approaches can bring performance benefits [25]. For clar-
schemes—quiescent-state-based reclamation, epoch-based ity, we describe all strategies that avoid locks as lockless.
reclamation, and hazard-pointer-based reclamation—using A major challenge for lockless synchronization is han-
a flexible microbenchmark. Our results show that there is dling the read/reclaim races that arise in dynamic data
no globally optimal scheme. When evaluating lockless syn- structures. Figure 1 illustrates the problem: thread T1 re-
chronization, programmers and algorithm designers should moves node N from a list while thread T2 is referencing it.
thus carefully consider the data structure, the workload, N ’s memory must be reclaimed to allow reuse, lest memory
and the execution environment, each of which can dramati- exhaustion block all threads, but such reuse is unsafe while
cally affect memory reclamation performance. T2 continues referencing N . For languages like C, where
memory must be explicitly reclaimed (e.g. via free()),
1 Introduction programmers must combine a memory reclamation scheme
with their lockless data structures to resolve these races.1
As multiprocessors become mainstream, multithreaded Several such reclamation schemes have been proposed.
applications will become more common, increasing the Programmers need to understand the semantics and the
need for efficient coordination of concurrent accesses to performance implications of each scheme, since the over-
shared data structures. Traditional locking requires expen- head of inefficient reclamation can be worse than that of
sive atomic operations, such as compare-and-swap (CAS), locking. For example, reference counting [5, 29] has high
even when locks are uncontended. For example, acquiring overhead in the base case and scales poorly with data-
and releasing an uncontended spinlock requires over 400 structure size. This is unacceptable when performance is
TM
cycles on an IBM! R
POWER CPU. Therefore, many re- the motivation for lockless synchronization. Unfortunately,
searchers recommend avoiding locking [2, 7, 22]. Some there is no single optimal scheme and existing work is rela-
TM
systems, such as Linux , use concurrently-readable syn- tively silent on factors affecting reclamation performance.
chronization, which uses locks for updates but not for reads. 1 Reclamation is subsumed into automatic garbage collectors in envi-
TM
∗ Supported by an NSERC Canada Graduate Scholarship. ronments that provide them, such as Java .

1-4244-0054-6/06/$20.00 ©2006 IEEE


Figure 2. Illustration of QSBR. Black boxes Figure 3. QSBR is inherently blocking.
represent quiescent states.

We address this deficit by comparing three recent recla-


mation schemes, showing the respective strengths and
weaknesses of each. In Sections 2 and 3, we review these
schemes and describe factors affecting their performance.
Section 4 explains our experimental setup. Our analysis, in
Section 5, reveals substantial performance differences be-
Figure 4. Illustration of EBR.
tween these schemes, the greatest source of which is per-
operation atomic instructions. In Section 6, we discuss the 2.1.1 Quiescent-State-Based Reclamation (QSBR)
relevance of our work to designers and implementers. We
show that lockless algorithms and reclamation schemes are QSBR uses quiescent states to detect grace periods. A
mostly independent, by combining a blocking reclamation quiescent state for thread T is a state in which T holds
scheme and a non-blocking algorithm, then comparing this no references to shared nodes; hence, a grace period for
combination to a fully non-blocking equivalent. We also QSBR is any interval of time during which all threads pass
present a new reclamation scheme that combines aspects of through at least one quiescent state. The choice of quies-
two other schemes to give good performance and ease of cent states is application-dependent. Many operating sys-
use. We close with a discussion of related work in Section 7 tem kernels contain natural quiescent states, such as volun-
and summarize our conclusions in Section 8. tary context switch, and use QSBR to implement read-copy
update (RCU) [22, 20, 7].
Figure 2 illustrates the relationship between quiescent
2 Memory Reclamation Schemes states and grace periods in QSBR. Thread T1 goes through
quiescent states at times t1 and t5 , T2 at times t2 and t4 ,
and T3 at time t3 . Hence, a grace period is any time interval
This section briefly reviews the reclamation schemes we containing either [t1 , t3 ] or [t3 , t5 ].
consider: quiescent-state-based reclamation (QSBR) [22, QSBR must detect grace periods so that removed nodes
2], epoch-based reclamation (EBR) [6], hazard-pointer- may be reclaimed; however, detecting minimal grace peri-
based reclamation (HPBR) [23, 24], and reference count- ods is unnecessary. In Figure 2, for example, any interval
ing [29, 26]. We provide an overview of each scheme to containing [t1 , t3 ] or [t3 , t5 ] is a quiescent state; implemen-
help the reader understand our work; further details are tations which check for grace periods only when threads en-
available in the papers cited. ter quiescent states would detect [t1 , t5 ], since T1’s two qui-
escent states form the only pair from a single thread which
2.1 Blocking Schemes enclose a grace period.
Figure 3 shows why QSBR is blocking. Here, T 2 goes
through no quiescent states (for example, due to blocking
We discuss two blocking schemes, QSBR and EBR, on I/O). Threads T 1 and T 3 are then prevented from re-
which use the concept of a grace period. A grace period claiming memory, since no grace periods exist. The ensuing
is a time interval [a,b] such that, after time b, all nodes memory exhaustion will eventually block all threads.
removed before time a may safely be reclaimed. These
methods force threads to wait for other threads to complete 2.1.2 Epoch-Based Reclamation (EBR)
their lockless operations in order for a grace period to oc-
cur. Failed or delayed threads can therefore prevent memory Fraser’s EBR [6] follows QSBR in using grace periods, but
from being reclaimed. Eventually, memory allocation will uses epochs in place of QSBR’s quiescent states. Each
fail, causing threads to block. thread executes in one of three logical epochs, and may lag
1 int search (struct list *l, long key)
2 {
3 node_t *cur;
4 critical_enter();
5 for (cur = l->list_head;
6 cur != NULL; cur = cur->next) {
7 if (cur->key >= key) {
8 critical_exit();
9 return (cur->key == key);
10 }
11 } Figure 6. Illustration of HPBR.
12 critical_exit();
13 return (0);
1 int search (struct list *l, long key)
14 }
2 {
3 node_t **prev, *cur, *next;
4 /* Index of our first hazard pointer. */
Figure 5. EBR concurrently-readable search. 5
6
int base = getTID()*K;
/* Offset into our hazard pointer segment. */
7 int off = 0;
at most one epoch behind the global epoch. Each thread 8 try_again:
9 prev = &l->list_head;
atomically sets a per-thread flag upon entry into a critical 10 for (cur = *prev; cur != NULL; cur = next & ˜1) {
region, indicating that the thread intends to access shared 11 /* Protect cur with a hazard pointer. */
12 HP[base+off] = cur;
data without locks. Upon exit, the thread atomically clears 13 memory_fence();
its flag. No thread is allowed to access an EBR-protected 14 if (*prev != cur)
15 goto try_again;
object outside of a critical region. 16 next = cur->next;
Figure 4 shows how EBR tracks epochs, allowing safe 17 if (cur->key >= key)
18 return (cur->key == key);
memory reclamation. Upon entering a critical region, a 19 prev = &cur->next;
thread updates its local epoch to match the global epoch. 20 off = (off+1) % K;
21 }
Hence, if the global epoch is e, threads in critical regions 22 return (0);
can be in either epoch e or e-1, but not e+1 (all mod 3). Any 23 }
node a thread removes during a given epoch may be safely
reclaimed the next time the thread re-enters that epoch.
Thus, the time period [t1 , t2 ] in the figure is a grace period. Figure 7. HPBR concurrently-readable
A thread will attempt to update the global epoch only if it search.
has not changed for some pre-determined number of critical
region entries. hence, we have H=NK hazard pointers in total. K is data-
As with QSBR, reclamation can be stalled by threads structure-dependent, and often small. Queues and linked
which fail in critical regions, but threads not in critical re- lists need K=2 hazard pointers, while stacks require only
gions cannot stall EBR. EBR’s bookkeeping is invisible to K=1; however, we know of no upper bound on K for gen-
the applications programmer, making it easy for a program- eral tree- or graph-traversal algorithms.
mer to use; however, Section 5 shows that this property im- After removing a node, a thread places that node in a
poses significant overhead on EBR. private list. When the list grows to a predefined size R,
Figure 5 shows an example of a search of a linked list the thread reclaims each node lacking a corresponding haz-
which allows lockless reads but uses locking for updates. ard pointer. Increasing R amortizes reclamation overhead
QSBR omits lines 4 and 12, which handle EBR’s epoch across more nodes, but increases memory usage.
bookkeeping, but is otherwise identical; QSBR’s quiescent An algorithm using HPBR must identify all hazardous
states are flagged explicitly at a higher level (see Figure 8). references — references to shared nodes that may have been
removed by other threads or that are vulnerable to the ABA
2.2 Non-blocking schemes problem [24]. Such references require hazard pointers. The
algorithm sets a hazard pointer, then checks for node re-
This section presents the non-blocking reclamation moval; if the node has not been removed, then it may safely
schemes we evaluate: hazard-pointer-based reclamation be referenced. As long as the hazard pointer references the
(HPBR) and lock-free reference counting (LFRC). node, HPBR’s reclamation routine refrains from reclaiming
it. Figure 6 illustrates the use of HPBR. Node N has been re-
moved from the linked list, but cannot be reclaimed because
2.2.1 Hazard-Pointer-Based Reclamation (HPBR)
T2’s hazard pointer HP[2] references it.
Michael’s HPBR [23] provides an existence locking mech- Figure 7, showing code adapted from Michael [24],
anism for dynamically-allocated nodes. Each thread per- demonstrates HPBR with a search algorithm corresponding
forming lockless operations has K hazard pointers which to Figure 5. At most two nodes must be protected: the cur-
it uses to protect nodes from reclamation by other threads; rent node and its predecessor (K=2). The code removing
nodes, which is not shown here, uses the low-order bit of 3.2 Data Structures and Workload
the next pointer as a flag. This guarantees that the valida-
tion step on line 14 will fail and retry in case of concurrent Data structures differ in both the operations they provide,
removal. Full details are given by Michael [24]. and in their common workloads. Queues are write-only,
Herlihy et al. [15] presented a very similar scheme called but linked lists and hash tables are often read-mostly [19].
Pass the Buck (PTB). Since HPBR and PTB have similar Blocking schemes may perform poorly with update-heavy
per-operation costs, we believe that our HPBR results apply structures, since the risk of memory exhaustion is higher.
to PTB as well. Conversely, non-blocking schemes may perform poorly
with operations such as list or tree traversal which visit
many nodes, since they require per-node fences.
2.2.2 Lock-Free Reference Counting (LFRC)

Lock-free reference counting (LFRC) is a well-known non- 3.3 Threads and Scheduling
blocking garbage-collection technique. Threads track the
number of references to nodes, reclaiming any node whose We expect contention due to concurrent threads to be
count is zero. Valois’s LFRC scheme [29] (corrected a minor source of reclamation overhead; however, for the
by Michael and Scott [26]) uses CAS and fetch-and-add non-blocking schemes, it could be unbounded in degener-
(FAA), and requires nodes to retain their type after recla- ate cases: readers forced to repeatedly restart their traversals
mation. Sundell’s scheme [28], based on Valois’s, is wait- must repeatedly execute fence instructions for every node.
free. The scheme of Detlefs et al. [5] allows nodes’ types Thread preemption, especially when there are more
to change upon reclamation, but requires double compare- threads than processors, can adversely affecting blocking
and-swap (DCAS), which no current CPU supports. schemes. Descheduled threads can delay reclamation, po-
Michael [24] showed that LFRC introduces overhead tentially exhausting memory, particularly in update-heavy
which often makes lockless algorithms perform worse than workloads. Longer scheduling quanta may increase the risk
lock-based versions. We include some experiments with of this exhaustion.
Valois’ scheme to reproduce Michael’s findings.
3.4 Memory Constraints
3 Reclamation Performance Factors Although lock-free memory allocators exist [25], many
allocators use locking. Blocking methods will see greater
We categorize factors which can affect reclamation lock contention because they must access a lock-protected
scheme performance; we vary these factors in Section 5. global pool more frequently. Furthermore, if a thread is pre-
empted while holding such a lock, other threads will block
3.1 Memory Consistency on memory allocation. The size of the global pool is finite,
and governs the likelihood of a blocking scheme exhausting
all memory. Only HPBR [23] provides a provable bound on
Current literature on lock-free algorithms generally as- the amount of unreclaimed memory; it should thus be less
sumes a sequentially-consistent [18] memory model, which sensitive to these constraints.
prohibits instruction reordering and globally orders mem-
ory references. For performance reasons, however, mod-
ern CPUs provide weaker memory consistency models in 4 Experimental Setup
which ordering can be enforced only when needed via spe-
cial fence instructions. Although fences are often omitted We evaluated the memory reclamation strategies with re-
from pseudocode, they are expensive on most modern CPUs spect to the factors outlined in Section 3 using commodity
and must be included in realistic performance analyses. SMP systems with IBM POWER CPUs. This section pro-
HPBR, EBR, and LFRC require per-operation fences. vides details on these aspects of our experiments.
HPBR, as shown in Figure 7, requires a fence between
hazard-pointer setting and validation, thus one fence per 4.1 Data Structures Used
visited node. LFRC also requires per-node fences, in ad-
dition to atomic instructions needed to maintain reference We tested the reclamation schemes on linked lists and
counts. EBR requires two fences per operation: one when queues. We used Michael’s ordered lock-free linked
setting a flag when entering a critical region, and one when list [24], which forbids duplicate keys, and coded our
clearing it upon exit. Since QSBR has no per-operation concurrently-readable lists similarly. Because linked lists
fences, its per-operation overhead can be very low. permit arbitrary lengths and read-to-write ratios, we used
1 while (parent’s timer has not expired) {
2 for i from 1 to 100 do {
3 key = random key; Table 1. Characteristics of Machines
4 op = random operation; XServe IBM POWER
5 d = data structure; CPUs 2x 2.0GHz PowerPC G5 8x 1.45 GHz POWER4+
6 op(d, key);
7 } Kernel Linux 2.6.8-1.ydl.7g5-smp Linux 2.6.13 (kernel.org)
8 if (using QSBR) Fence 78ns (156 cycles) 76ns (110 cycles)
9 quiescent_state();
10 }
CAS 52ns (104 cycles) 59ns (86 cycles)
Lock 231ns (462 cycles) 243ns (352 cycles)

Figure 8. Per-thread test pseudocode.


Our experiment implements threads using processes.
Our memory allocator is similar to that of Bonwick [4].
them heavily in our experiments. Our lock-free queue fol-
Each thread has two freelists of up to 100 elements each,
lows Michael and Scott’s design [24]. Queues allow evalu-
and can acquire more memory from a global non-blocking
ating QSBR on a write-only data structure, which no prior
stack of freelists. This non-blocking allocator allowed us to
studies have done.
study reclamation performance without considering patho-
logical locking conditions discussed in Section 3.
4.2 Test Program
We implemented CAS using POWER’s LL/SC instruc-
tions (larx and stcx), and fences using the eieio in-
In our tests, a parent thread creates N child threads, starts
struction. Our spinlocks were implemented using CAS and
a timer, and stops the threads upon timer expiry. Child
fences. Our algorithms used exponential backoff [1] upon
threads count the number of operations they perform, and
encountering conflicts.
the parent then calculates the average execution time per
operation by dividing the duration of the test by the total
number of operations. The CPU time is the execution time 4.4 Limitations of Experiment
divided by the minimum of the number of threads and the
number of processors. CPU time compensates for increas- Microbenchmarks are never perfect [17], however, they
ing numbers of CPUs, allowing us to focus on synchroniza- allow us to study reclamation performance by varying each
tion overhead. Our tests report the average of five trials. of the factors outlined in Section 3 independently. Our re-
Each thread runs repeatedly through the test loop shown sults show that these factors significantly affect reclamation
in Figure 8 until the timer expires. QSBR tests place a performance. In macrobenchmark experiments, it is more
quiescent state at the end of the loop. The probabilities difficult to gain insight into the causes of performance dif-
of inserting and removing nodes are equal, keeping data- ferences, and to test the schemes comprehensively.
structure size roughly constant throughout a given run. Some applications may not have natural quiescent states;
We vary the number of threads and nodes. For linked furthermore, detecting quiescent states in other applications
lists, we also vary the ratio of reads to updates. As shown may be more expensive than it is in our experiments. Our
in Figure 8, each thread performs 100 operations per qui- QSBR implementation, for example, is faster than that used
escent state; hence, grace-period-related overhead is amor- in the Linux kernel, due to the latter’s need to support dy-
tized across 100 operations. For EBR, each op in Figure 8 namic insertion and removal of CPUs, interrupt handlers,
is a critical region; a thread attempts to update the global and real-time workloads.
epoch whenever it has entered a critical region 100 times Our HPBR experiments statically allocate hazard point-
since the last update, again amortizing grace-period-related ers. Although this is sufficient for our experiments, some al-
overhead across 100 operations. For HPBR, we amortized gorithms, to the best of our knowledge, require unbounded
reclamation overhead over R = 2H + 100 node removals. numbers of hazard pointers.
For consistency, QSBR and EBR both used the fuzzy bar- Despite these limitations, we believe that our experi-
rier [11] algorithm from Fraser’s EBR implementation [6]. ments thoroughly evaluate these schemes, and show when
The code for our experiments is available at each scheme is and is not efficient.
http://www.cs.toronto.edu/˜tomhart/
perflab/ipdps06.tgz.
5 Performance Analysis
4.3 Operating Environment
We first investigate the base costs for the reclamation
We performed our experiments on the two machines schemes: single-threaded execution on small data struc-
shown in Table 1. The last line of this table gives the com- tures. We then show how workload, list traversal length,
bined costs of locking and then unlocking a spinlock. number of threads, and preemption affect performance.
Avg Exec Time (ns)
5000
4500 HPBR
1000 4000
QSBR

Avg CPU Time (ns)


EBR
800 3500
600 3000
2500
400 2000
200 1500
1000
0 500
0
LFRC
QSBR
HPBR
EBR

QSBR
HPBR
EBR

QSBR
HPBR
EBR
0 20 40 60 80 100
Number of Elements
List Reads List Writes Queue
Figure 10. Effect of traversal length — read-
only lock-free list, one thread, 8 CPUs.
Figure 9. Base costs — single-threaded data
20000
from 8-CPU machine. 18000 HPBR
QSBR
16000

Avg CPU Time (ns)


EBR
14000 LFRC
5.1 Base Costs 12000
10000
8000
Figure 9 shows the single-threaded base costs of these 6000
schemes on non-blocking queues and single-element linked 4000
2000
lists with no preemption or contention. We ran LFRC only 0
on read-only workloads; these were sufficient for us to cor- 0 20 40 60 80 100
Number of Elements
roborate Michael’s [24] result that LFRC performs poorly.
In these base cases, the dominant influence on perfor-
mance is per-operation atomic instructions: compare-and- Figure 11. Effect of traversal length, including
swap (CAS), fetch-and-add (FAA), and fences make LFRC LFRC — read-only lock-free list, one thread,
much more expensive than the other schemes. Since EBR 8 CPUs.
requires two fences per operation, and HPBR requires one
for most operations considered here, EBR is usually the 5.3 Scalability with Threads
next most expensive. QSBR, needing no per-operation
atomic instructions, is the cheapest scheme in the base case. Concurrent performance is an obvious concern for mem-
Workload affects the performance of these schemes. Un- ory reclamation schemes. We study the effect of threads
der an update-intensive workload, a significant number of sharing the data structure when there is no CPU contention,
operations will involve removing nodes; for each attempt and when threads must also compete for the CPU.
to reclaimed a removed node, HPBR must search the array
of hazard pointers. This overhead can become significant 5.3.1 No Preemption
for update-intensive workloads, as can be seen in Figure 9.
To reduce the effects of CPU contention (thread preemp-
We note that in our experiments, the performance of QSBR,
tion, migration, etc.), we use a maximum of seven threads,
EBR, and HPBR all increased linearly between read-only
ensuring that one CPU is available for other processes, fol-
and update-only workloads.
lowing Fraser [6].
Figures 12 and 13 show the performance of the reclama-
5.2 Scalability with Traversal Length tion schemes with a read-only workload on a linked list, and
with a write-only workload on a queue. All three schemes
Figure 10 shows the effect of list length on a single- scale almost linearly in the read-only case. In both cases,
threaded read-only workload. We observed similar results the schemes’ relative performance seems to be unaffected
in write-only workloads. As expected, per-element fence by the number of threads.
instructions degrade HPBR’s performance on long chains
of elements; QSBR and EBR do much better.
5.3.2 With Preemption
Figure 11 shows the same scenario, but also includes
LFRC. At best, LFRC takes more than twice as long as the To evaluate the performance of the reclamation schemes un-
next slowest scheme, and the performance gap rapidly in- der preemption, we ran our tests on our 2-CPU machine,
creases with list length due to the multiple per-node atomic varying the number of threads from 1 to 32.
instructions. Because LFRC is always the worst scheme in Figure 14 shows the performance of the schemes on a
terms of performance, we do not consider it further. one-element lock-free linked list with a read-only workload.
450
800
400

Avg Execution Time (ns)


700
350
Avg CPU Time (ns)

600
300
250 500
200 400
150 300
100 HPBR 200 HPBR
QSBR QSBR
50 EBR 100 EBR
0 0
1 2 3 4 5 6 7 5 10 15 20 25 30
Number of Threads Number of Threads

Figure 12. Effect of adding threads — read- Figure 15. Effect of preemption — lock-free
only lock-free list, one element, 8 CPUs. queue, 2 CPUs.

4000 90000
HPBR 80000 HPBR
3500

Avg Execution Time (ns)


QSBR QSBR
70000 EBR
Avg CPU Time (ns)

3000 EBR
60000
2500
50000
2000
40000
1500 30000
1000 20000
500 10000
0 0
1 2 3 4 5 6 7 0 5 10 15 20 25 30 35
Number of Threads Number of Threads

Figure 13. Effect of adding threads — lock- Figure 16. Effect of busy waiting when out of
free queue, 8 CPUs. memory — lock-free queue, 2 CPUs.

This case eliminates reclamation overhead, focusing solely reclamation.


on read-side and fuzzy barrier overhead. In this case, the Although this busy waiting would be a poor design
algorithms all scale well, with QSBR remaining cheapest. choice in a real application, this test demonstrates that pre-
For the write-heavy workloads shown in Figure 15, HPBR emption and write-heavy workloads can cause QSBR and
performs best due to its non-blocking design. EBR to exhaust all memory. Similarly, Sarma and McKen-
The blocking schemes perform well on this write-heavy ney [27] have shown how QSBR-based components of the
workload only because threads yield the processor upon al- Linux kernel are vulnerable to denial-of-service attacks. Al-
location failure. Figure 16 shows the same test as Figure 15, though this can be avoided with engineering effort – and has
but with busy-waiting upon allocation failure. Here, HPBR been, in Linux – it is in these situations that HPBR’s fault-
performs well, but EBR and QSBR quickly exhaust the pool tolerance becomes valuable.
of free memory. Each thread spins waiting for more mem-
ory to become free, thereby preventing grace periods from 5.4 Summary
completing in a timely manner and hence delaying memory
We note several trends of interest. First, in the base case,
atomic instructions such as fences are the dominant cost.
300 HPBR Second, when large numbers of elements must be traversed,
Avg Execution Time (ns)

250
QSBR
EBR
HPBR and reference counting suffer from significant over-
head due to extra atomic instructions. QSBR and EBR per-
200
form poorly when grace periods are stalled and the work-
150
load is update-intensive.
100

50

0 6 Consequences of Analysis
5 10 15 20 25 30
Number of Threads
We describe the consequences of our analysis for com-
Figure 14. Effect of preemption — read-only paring algorithms, designing new reclamation schemes, and
lock-free list, 2 CPUs. choosing reclamation schemes for applications.
1000
400

800 350
Avg CPU Time (ns)

Avg CPU Time (ns)


300
600 250
200
400
150
HPBR
LF-HPBR 100 QSBR
200
LF-QSBR EBR
CR-QSBR 50 NEBR
0 0
0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7
Update Fraction Number of Threads

Figure 17. Lock-free versus concurrently- Figure 19. Performance of NEBR — lock-free
readable algorithms — ten-element lists, one list, 8 CPUs, read-only workload, variable
thread, 8 CPUs. number of threads.

1000 LF-QSBR concurrently-readable algorithms using a common reclama-


CR-QSBR
tion scheme. Here, the lock-free algorithm out-performs
Avg CPU Time (ns)

800
locking for update fractions above about 15%. Lock-free
600
lists and hash might therefore be practical for update-heavy
400 situations in environments providing QSBR, such as OS
200 kernels like Linux.
New reclamation schemes should also be evaluated by
0
0 0.2 0.4 0.6 0.8 1 varying each of the factors that can affect their performance.
Update Fraction For example, Gidenstam et al. [8] recently proposed a new
non-blocking reclamation scheme that combines reference
Figure 18. Lock-free versus concurrently- counting with HPBR, and can be proven to have several
readable algorithms — hash tables with load attractive properties. However, like HPBR and reference
factor 1, four threads, 8 CPUs. counting, it requires expensive per-node atomic operations.
The evaluation of this scheme consisted only of experiments
6.1 Fair Evaluation of Algorithms on double-ended queues, thus failing to evaluate scalability
with data-structure size, an HPBR weakness. This failing
Reclamation schemes have profound performance ef- shows the value of our analysis: it is necessary to vary the
fects that must be accounted for when experimentally eval- experimental parameters we considered to gain a full under-
uating new lockless algorithms. standing of a given scheme’s performance.
For example, one of our early experiments compared a
concurrently-readable linked list with QSBR, as is used in 6.2 Improving Reclamation Performance
the Linux kernel, with a lock-free HPBR equivalent. Our
intuition was that lock-free algorithms might pay a perfor- Improved reclamation schemes can be designed based
mance penalty for their fault-tolerance properties, as they on an understanding of the factors that affect performance.
do in read-mostly situations. The LF-HPBR and CR-QSBR For example, we observe that a key difference between
traces in Figure 17 might lead to the erroneous conclusion QSBR and EBR is the per-operation overhead of EBR’s
that the concurrently-readable algorithm is always faster. A two fences. This observation allows us to make a modest
better analysis takes the LF-QSBR trace into account, not- improvement to EBR called new epoch-based-reclamation
ing that as the update fraction increases, lock-free perfor- (NEBR).
mance improves, since its updates require fewer atomic in- NEBR requires compromising EBR’s application-
structions than does locking. This example shows that one independence. Instead of setting and clearing the flag at
can accurately compare two lockless algorithms only when the start and end of every lockless operation, we set it at the
each is using the same reclamation scheme. application level before entering any code that might con-
LF-QSBR’s higher per-node overhead makes it more tain NEBR critical regions. Since our flag is set and cleared
attractive when there are fewer nodes. Figure 18 shows at the application level, we can amortize the overhead of
the performance of hash tables consisting of arrays of LF- the corresponding fence instructions over a larger number
QSBR or CR-QSBR single-element lists being concurrently of operations. We reran the experiment shown in Figure 12,
accessed by four threads. For clarity, we omit HPBR from but including NEBR, and found that NEBR scaled linearly
this graph — our intent is to compare the lock-free and and performed slightly better than did HPBR. Furthermore,
NEBR does not need the expensive per-node atomic opera- Auslander implemented a lock-free hash table with
tions that ruin HPBR’s performance for long traversals. QSBR in K42 [19]. No performance evaluation, ei-
NEBR is attractive because it is almost as fast as QSBR, ther between different reclamation methods or between
but does not require quiescent states. Interestingly, recent concurrently-readable and lock-free hash tables, was pro-
realtime variants of the Linux-kernel RCU also dispense vided. We are unaware of any work combining QSBR with
with quiescent states [21]. Ongoing work is expected to update-intensive non-blocking algorithms such as queues.
substantially reduce realtime RCU read-side overhead. Fraser [6] noted, but did not thoroughly evaluate,
It is interesting to consider what sort of weakened non- HPBR’s fence overhead, and used his EBR instead. Our
blocking property, if any, could be defined such that one work extends Fraser’s, showing that EBR itself has high
could create a corresponding reclamation scheme without overhead, often exceeding that of HPBR.
requiring expensive per-node atomic operations.

6.3 Blocking Memory Reclamation for 8 Conclusions


Non-Blocking Data Structures
We have performed the first fair comparison of blocking
We have shown that non-blocking data structures often and non-blocking reclamation across a range of workloads,
perform better when using blocking reclamation schemes. showing that reclamation has a huge effect on lockless al-
One might question why one would want to use a non- gorithm performance. Choosing the right scheme for the
blocking data structure in this case, since a halted would environment in which a concurrent algorithm is expected to
cause an infinite memory leak, thus destroying the non- run is essential to having the algorithm perform well.
blocking data structure’s fault-tolerance guarantees. Our results show that QSBR is usually the best-
However, non-blocking data structures are often used for performing reclamation scheme; however, the performance
reasons other than fault-tolerance; for example, Qprof [3] of both QSBR and EBR can suffer due to memory exhaus-
and Cache Kernel [10] both use such structures because tion in the face of thread preemption or failure. HPBR and
they can be accessed from signal handlers without risk of EBR have higher base costs than QSBR due to their re-
self-deadlock. Blocking memory reclamation does not re- quired fences. For EBR, the worst-case overhead of fences
move this benefit. In fact, Cache Kernel uses a blocking im- is constant, while for HPBR it is unbounded. HPBR and
plementation of Type-Stable Memory to guard against read- LFRC scale poorly when many nodes must be traversed.
reclamation races; its implementation [9] has similarities Our analysis helped us to identify the main source of
to QSBR. Non-blocking algorithms with blocking reclama- overhead in EBR and decrease it, resulting in NEBR. Fur-
tion schemes similarly continue to benefit from resistance thermore, understanding the impact of reclamation schemes
to preemption-induced convoying and priority inversion. on algorithm performance enables fair comparison of differ-
We view combining a non-blocking algorithm with a ent algorithms — in our case, lock-free and concurrently-
blocking reclamation scheme as part of a trend towards readable lists and hash tables.
weakened non-blocking properties [16, 3], designed to pre-
We reiterate that blocking reclamation can be useful with
serve selected advantages of non-blocking synchronization
non-blocking algorithms: in the absence of thread fail-
while improving performance. In this case, threads have all
ure, non-blocking algorithms still benefit from deadlock-
the advantages of non-blocking synchronization, unless the
freedom, signal handler safety, and avoidance of priority
system runs out of memory.
inversion. Nevertheless, the question of what sort of weak-
Conversely, one may also use non-blocking reclamation ened non-blocking property could be satisfied by a reclama-
with blocking algorithms to reduce the amount of memory tion scheme without the per-node overhead of current non-
awaiting reclamation in face of preempted or failed threads. blocking reclamation scheme designs remains open.

7 Related Work
9 Acknowledgments
Relevant work on reclamation scheme design was dis-
cussed in Section 2. Previous work on the performance of We owe thanks to Maged Michael and Keir Fraser for
these schemes, however, is limited. helpful comments on their respective work, to Faith Fich
Michael [24] criticized QSBR for its unbounded mem- and Cristiana Amza for much helpful advice, and to Dan
ory use, but did not compare the performance of QSBR to Frye for his support of this effort. We are indebted to Martin
that of HPBR, or determine when this limitation affects a Bligh, Andy Whitcroft, and the ABAT team for the use of
program. the 8-CPU machine used in our experiments.
10 Legal Statement [14] M. Herlihy. A methodology for implementing highly
concurrent data objects. ACM Trans. Prog. Lang. Syst.,
IBM and POWER are trademarks or registered trade- 15(5):745–770, Nov. 1993.
[15] M. Herlihy, V. Luchangco, and M. Moir. The repeat offender
marks of International Business Machines Corporation in
problem: A mechanism for supporting dynamic-sized, lock-
the United States and/or other countries. free data structures. In Proc. 16th Intl. Symp. on Distributed
Java and all Java-based trademarks are trademarks of Computing, Oct. 2002.
Sun Microsystems, Inc. in the United States, other coun- [16] M. Herlihy, V. Luchangco, and M. Moir. Obstruction-free
tries, or both. synchronization: Double-ended queues as an example. In
Linux is a trademark of Linus Torvalds in the United Proc. 23rd Intl. Conf. on Distributed Computing Syst., pages
States, other countries, or both. 522–529. IEEE Computer Society, 2003.
Other company, product, and service names may be [17] S. Kumar, D. Jiang, R. Chandra, and J. P. Singh. Evalu-
trademarks or service marks of others. ating synchronization on shared address space multiproces-
sors: methodology and performance. In Proc. SIGMETRICS
1999, pages 23–34, New York, NY, USA, 1999. ACM Press.
References [18] L. Lamport. How to make a multiprocessor computer that
correctly executes multiprocess programs. IEEE Trans.
[1] T. E. Anderson. The performance of spin lock alternatives Comput., 28(9):690–691, Sept. 1979.
for shared-money multiprocessors. IEEE Trans. Parallel and [19] P. E. McKenney. Exploiting Deferred Destruction: An Anal-
Distributed Syst., 1(1):6–16, 1990. ysis of Read-Copy-Update Techniques in Operating System
[2] A. Arcangeli, M. Cao, P. E. McKenney, and D. Sarma. Us- Kernels. PhD thesis, OGI School of Science and Engineer-
ing read-copy update techniques for System V IPC in the ing at Oregon Health and Sciences University, 2004.
Linux 2.5 kernel. In Proc. USENIX Annual Technical Conf. [20] P. E. McKenney, J. Appavoo, A. Kleen, O. Krieger, R. Rus-
(FREENIX Track), pages 297–310. USENIX Association, sell, D. Sarma, and M. Soni. Read-copy update. In Ottawa
June 2003. Linux Symp., July 2001.
[3] H.-J. Boehm. An almost non-blocking stack. In Proc. 23rd [21] P. E. McKenney and D. Sarma. Towards hard realtime
ACM Symp. on Principles of Distributed Computing, pages response from the Linux kernel on SMP hardware. In
40–49, 2004. linux.conf.au, Canberra, AU, April 2005.
[4] J. Bonwick and J. Adams. Magazines and Vmem: Extend- [22] P. E. McKenney and J. D. Slingwine. Read-copy update:
ing the slab allocator to many CPUs and arbitrary resources. Using execution history to solve concurrency problems. In
In USENIX Technical Conf., General Track, pages 15–33, Parallel and Distributed Computing and Syst., pages 509–
2001. 518, Las Vegas, NV, Oct. 1998.
[5] D. L. Detlefs, P. A. Martin, M. Moir, and G. L. Steele, [23] M. M. Michael. Safe memory reclamation for dynamic lock-
Jr. Lock-free reference counting. Distributed Computing, free objects using atomic reads and writes. In Proc. 21st
15(4):255–271, 2002. ACM Symp. on Principles of Distributed Computing, pages
[6] K. Fraser. Practical Lock-Freedom. PhD thesis, University 21–30, July 2002.
of Cambridge Computer Laboratory, 2004. [24] M. M. Michael. Hazard pointers: Safe memory reclama-
[7] B. Gamsa, O. Krieger, J. Appavoo, and M. Stumm. Tornado: tion for lock-free objects. IEEE Trans. on Parallel and Dis-
Maximizing locality and concurrency in a shared memory tributed Syst., 15(6):491–504, June 2004.
multiprocessor operating system. In Proc. 3rd Symp. on Op- [25] M. M. Michael. Scalable lock-free dynamic memory allo-
erating Syst. Design and Impl., pages 87–100, 1999. cation. In Proc. ACM Conf. on Programming Language De-
[8] A. Gidenstam, M. Papatriantafilou, H. Sundell, and P. Tsi- sign and Implementation, pages 35–46, June 2004.
gas. Efficient and reliable lock-free memory reclamation [26] M. M. Michael and M. L. Scott. Correction of a memory
based on reference counting. In Proc. 8th I-SPAN, 2005. management method for lock-free data structures. Technical
[9] M. Greenwald. Non-blocking Synchronization and System Report TR599, Computer Science Department, University
Design. PhD thesis, Stanford University, 1999. of Rochester, Dec. 1995.
[10] M. Greenwald and D. Cheriton. The synergy between non- [27] D. Sarma and P. E. McKenney. Issues with selected scala-
blocking synchronization and operating system structure. In bility features of the 2.6 kernel. In Ottawa Linux Symp., July
Proc. 2nd USENIX Symp. on Operating Syst. Design and 2004.
Impl., pages 123–136. ACM Press, 1996. [28] H. Sundell. Wait-free reference counting and memory man-
[11] R. Gupta. The fuzzy barrier: a mechanism for high speed agement. In Proc. 19th Intl. Parallel and Distributed Pro-
synchronization of processors. In Proc. 3rd Intl. Conf. on cessing Symp., Apr. 2005.
Arch. Support for Prog. Lang. and Operating Syst. (ASP- [29] J. D. Valois. Lock-free linked lists using compare-and-swap.
LOS), pages 54–63, New York, NY, USA, 1989. ACM Press. In Proc. 14th ACM Symp. on Principles of Distributed Com-
[12] T. L. Harris. A pragmatic implementation of non-blocking puting, pages 214–222, Aug. 1995.
linked-lists. In Proc. 15th Intl. Conf. on Distributed Com-
puting, pages 300–314. Springer-Verlag, 2001.
[13] M. Herlihy. Wait-free synchronization. ACM Trans. Prog.
Lang. Syst., 13(1):124–149, 1991.

You might also like