Making Lockless Synchronization Fast
Making Lockless Synchronization Fast
University of Toronto
1 2
IBM Beaverton
Dept. of Computer Science Linux Technology Center
Toronto, ON M5S 2E4 CAN Beaverton, OR 97006 USA
{tomhart, demke}@cs.toronto.edu [email protected]
Abstract
Achieving high performance for concurrent applications
on modern multiprocessors remains challenging. Many pro-
grammers avoid locking to improve performance, while oth-
ers replace locks with non-blocking synchronization to pro-
tect against deadlock, priority inversion, and convoying. In Figure 1. Read/reclaim race.
both cases, dynamic data structures that avoid locking, re-
quire a memory reclamation scheme that reclaims nodes Locking is also susceptible to priority inversion, convoying,
once they are no longer in use. deadlock, and blocking due to thread failure [3, 10], leading
The performance of existing memory reclamation researchers to pursue non-blocking (or lock-free) synchro-
schemes has not been thoroughly evaluated. We conduct nization [6, 12, 13, 14, 16, 29]. In some cases, lock-free
the first fair and comprehensive comparison of three recent approaches can bring performance benefits [25]. For clar-
schemes—quiescent-state-based reclamation, epoch-based ity, we describe all strategies that avoid locks as lockless.
reclamation, and hazard-pointer-based reclamation—using A major challenge for lockless synchronization is han-
a flexible microbenchmark. Our results show that there is dling the read/reclaim races that arise in dynamic data
no globally optimal scheme. When evaluating lockless syn- structures. Figure 1 illustrates the problem: thread T1 re-
chronization, programmers and algorithm designers should moves node N from a list while thread T2 is referencing it.
thus carefully consider the data structure, the workload, N ’s memory must be reclaimed to allow reuse, lest memory
and the execution environment, each of which can dramati- exhaustion block all threads, but such reuse is unsafe while
cally affect memory reclamation performance. T2 continues referencing N . For languages like C, where
memory must be explicitly reclaimed (e.g. via free()),
1 Introduction programmers must combine a memory reclamation scheme
with their lockless data structures to resolve these races.1
As multiprocessors become mainstream, multithreaded Several such reclamation schemes have been proposed.
applications will become more common, increasing the Programmers need to understand the semantics and the
need for efficient coordination of concurrent accesses to performance implications of each scheme, since the over-
shared data structures. Traditional locking requires expen- head of inefficient reclamation can be worse than that of
sive atomic operations, such as compare-and-swap (CAS), locking. For example, reference counting [5, 29] has high
even when locks are uncontended. For example, acquiring overhead in the base case and scales poorly with data-
and releasing an uncontended spinlock requires over 400 structure size. This is unacceptable when performance is
TM
cycles on an IBM! R
POWER CPU. Therefore, many re- the motivation for lockless synchronization. Unfortunately,
searchers recommend avoiding locking [2, 7, 22]. Some there is no single optimal scheme and existing work is rela-
TM
systems, such as Linux , use concurrently-readable syn- tively silent on factors affecting reclamation performance.
chronization, which uses locks for updates but not for reads. 1 Reclamation is subsumed into automatic garbage collectors in envi-
TM
∗ Supported by an NSERC Canada Graduate Scholarship. ronments that provide them, such as Java .
Lock-free reference counting (LFRC) is a well-known non- 3.3 Threads and Scheduling
blocking garbage-collection technique. Threads track the
number of references to nodes, reclaiming any node whose We expect contention due to concurrent threads to be
count is zero. Valois’s LFRC scheme [29] (corrected a minor source of reclamation overhead; however, for the
by Michael and Scott [26]) uses CAS and fetch-and-add non-blocking schemes, it could be unbounded in degener-
(FAA), and requires nodes to retain their type after recla- ate cases: readers forced to repeatedly restart their traversals
mation. Sundell’s scheme [28], based on Valois’s, is wait- must repeatedly execute fence instructions for every node.
free. The scheme of Detlefs et al. [5] allows nodes’ types Thread preemption, especially when there are more
to change upon reclamation, but requires double compare- threads than processors, can adversely affecting blocking
and-swap (DCAS), which no current CPU supports. schemes. Descheduled threads can delay reclamation, po-
Michael [24] showed that LFRC introduces overhead tentially exhausting memory, particularly in update-heavy
which often makes lockless algorithms perform worse than workloads. Longer scheduling quanta may increase the risk
lock-based versions. We include some experiments with of this exhaustion.
Valois’ scheme to reproduce Michael’s findings.
3.4 Memory Constraints
3 Reclamation Performance Factors Although lock-free memory allocators exist [25], many
allocators use locking. Blocking methods will see greater
We categorize factors which can affect reclamation lock contention because they must access a lock-protected
scheme performance; we vary these factors in Section 5. global pool more frequently. Furthermore, if a thread is pre-
empted while holding such a lock, other threads will block
3.1 Memory Consistency on memory allocation. The size of the global pool is finite,
and governs the likelihood of a blocking scheme exhausting
all memory. Only HPBR [23] provides a provable bound on
Current literature on lock-free algorithms generally as- the amount of unreclaimed memory; it should thus be less
sumes a sequentially-consistent [18] memory model, which sensitive to these constraints.
prohibits instruction reordering and globally orders mem-
ory references. For performance reasons, however, mod-
ern CPUs provide weaker memory consistency models in 4 Experimental Setup
which ordering can be enforced only when needed via spe-
cial fence instructions. Although fences are often omitted We evaluated the memory reclamation strategies with re-
from pseudocode, they are expensive on most modern CPUs spect to the factors outlined in Section 3 using commodity
and must be included in realistic performance analyses. SMP systems with IBM POWER CPUs. This section pro-
HPBR, EBR, and LFRC require per-operation fences. vides details on these aspects of our experiments.
HPBR, as shown in Figure 7, requires a fence between
hazard-pointer setting and validation, thus one fence per 4.1 Data Structures Used
visited node. LFRC also requires per-node fences, in ad-
dition to atomic instructions needed to maintain reference We tested the reclamation schemes on linked lists and
counts. EBR requires two fences per operation: one when queues. We used Michael’s ordered lock-free linked
setting a flag when entering a critical region, and one when list [24], which forbids duplicate keys, and coded our
clearing it upon exit. Since QSBR has no per-operation concurrently-readable lists similarly. Because linked lists
fences, its per-operation overhead can be very low. permit arbitrary lengths and read-to-write ratios, we used
1 while (parent’s timer has not expired) {
2 for i from 1 to 100 do {
3 key = random key; Table 1. Characteristics of Machines
4 op = random operation; XServe IBM POWER
5 d = data structure; CPUs 2x 2.0GHz PowerPC G5 8x 1.45 GHz POWER4+
6 op(d, key);
7 } Kernel Linux 2.6.8-1.ydl.7g5-smp Linux 2.6.13 (kernel.org)
8 if (using QSBR) Fence 78ns (156 cycles) 76ns (110 cycles)
9 quiescent_state();
10 }
CAS 52ns (104 cycles) 59ns (86 cycles)
Lock 231ns (462 cycles) 243ns (352 cycles)
QSBR
HPBR
EBR
QSBR
HPBR
EBR
0 20 40 60 80 100
Number of Elements
List Reads List Writes Queue
Figure 10. Effect of traversal length — read-
only lock-free list, one thread, 8 CPUs.
Figure 9. Base costs — single-threaded data
20000
from 8-CPU machine. 18000 HPBR
QSBR
16000
600
300
250 500
200 400
150 300
100 HPBR 200 HPBR
QSBR QSBR
50 EBR 100 EBR
0 0
1 2 3 4 5 6 7 5 10 15 20 25 30
Number of Threads Number of Threads
Figure 12. Effect of adding threads — read- Figure 15. Effect of preemption — lock-free
only lock-free list, one element, 8 CPUs. queue, 2 CPUs.
4000 90000
HPBR 80000 HPBR
3500
3000 EBR
60000
2500
50000
2000
40000
1500 30000
1000 20000
500 10000
0 0
1 2 3 4 5 6 7 0 5 10 15 20 25 30 35
Number of Threads Number of Threads
Figure 13. Effect of adding threads — lock- Figure 16. Effect of busy waiting when out of
free queue, 8 CPUs. memory — lock-free queue, 2 CPUs.
250
QSBR
EBR
HPBR and reference counting suffer from significant over-
head due to extra atomic instructions. QSBR and EBR per-
200
form poorly when grace periods are stalled and the work-
150
load is update-intensive.
100
50
0 6 Consequences of Analysis
5 10 15 20 25 30
Number of Threads
We describe the consequences of our analysis for com-
Figure 14. Effect of preemption — read-only paring algorithms, designing new reclamation schemes, and
lock-free list, 2 CPUs. choosing reclamation schemes for applications.
1000
400
800 350
Avg CPU Time (ns)
Figure 17. Lock-free versus concurrently- Figure 19. Performance of NEBR — lock-free
readable algorithms — ten-element lists, one list, 8 CPUs, read-only workload, variable
thread, 8 CPUs. number of threads.
800
locking for update fractions above about 15%. Lock-free
600
lists and hash might therefore be practical for update-heavy
400 situations in environments providing QSBR, such as OS
200 kernels like Linux.
New reclamation schemes should also be evaluated by
0
0 0.2 0.4 0.6 0.8 1 varying each of the factors that can affect their performance.
Update Fraction For example, Gidenstam et al. [8] recently proposed a new
non-blocking reclamation scheme that combines reference
Figure 18. Lock-free versus concurrently- counting with HPBR, and can be proven to have several
readable algorithms — hash tables with load attractive properties. However, like HPBR and reference
factor 1, four threads, 8 CPUs. counting, it requires expensive per-node atomic operations.
The evaluation of this scheme consisted only of experiments
6.1 Fair Evaluation of Algorithms on double-ended queues, thus failing to evaluate scalability
with data-structure size, an HPBR weakness. This failing
Reclamation schemes have profound performance ef- shows the value of our analysis: it is necessary to vary the
fects that must be accounted for when experimentally eval- experimental parameters we considered to gain a full under-
uating new lockless algorithms. standing of a given scheme’s performance.
For example, one of our early experiments compared a
concurrently-readable linked list with QSBR, as is used in 6.2 Improving Reclamation Performance
the Linux kernel, with a lock-free HPBR equivalent. Our
intuition was that lock-free algorithms might pay a perfor- Improved reclamation schemes can be designed based
mance penalty for their fault-tolerance properties, as they on an understanding of the factors that affect performance.
do in read-mostly situations. The LF-HPBR and CR-QSBR For example, we observe that a key difference between
traces in Figure 17 might lead to the erroneous conclusion QSBR and EBR is the per-operation overhead of EBR’s
that the concurrently-readable algorithm is always faster. A two fences. This observation allows us to make a modest
better analysis takes the LF-QSBR trace into account, not- improvement to EBR called new epoch-based-reclamation
ing that as the update fraction increases, lock-free perfor- (NEBR).
mance improves, since its updates require fewer atomic in- NEBR requires compromising EBR’s application-
structions than does locking. This example shows that one independence. Instead of setting and clearing the flag at
can accurately compare two lockless algorithms only when the start and end of every lockless operation, we set it at the
each is using the same reclamation scheme. application level before entering any code that might con-
LF-QSBR’s higher per-node overhead makes it more tain NEBR critical regions. Since our flag is set and cleared
attractive when there are fewer nodes. Figure 18 shows at the application level, we can amortize the overhead of
the performance of hash tables consisting of arrays of LF- the corresponding fence instructions over a larger number
QSBR or CR-QSBR single-element lists being concurrently of operations. We reran the experiment shown in Figure 12,
accessed by four threads. For clarity, we omit HPBR from but including NEBR, and found that NEBR scaled linearly
this graph — our intent is to compare the lock-free and and performed slightly better than did HPBR. Furthermore,
NEBR does not need the expensive per-node atomic opera- Auslander implemented a lock-free hash table with
tions that ruin HPBR’s performance for long traversals. QSBR in K42 [19]. No performance evaluation, ei-
NEBR is attractive because it is almost as fast as QSBR, ther between different reclamation methods or between
but does not require quiescent states. Interestingly, recent concurrently-readable and lock-free hash tables, was pro-
realtime variants of the Linux-kernel RCU also dispense vided. We are unaware of any work combining QSBR with
with quiescent states [21]. Ongoing work is expected to update-intensive non-blocking algorithms such as queues.
substantially reduce realtime RCU read-side overhead. Fraser [6] noted, but did not thoroughly evaluate,
It is interesting to consider what sort of weakened non- HPBR’s fence overhead, and used his EBR instead. Our
blocking property, if any, could be defined such that one work extends Fraser’s, showing that EBR itself has high
could create a corresponding reclamation scheme without overhead, often exceeding that of HPBR.
requiring expensive per-node atomic operations.
7 Related Work
9 Acknowledgments
Relevant work on reclamation scheme design was dis-
cussed in Section 2. Previous work on the performance of We owe thanks to Maged Michael and Keir Fraser for
these schemes, however, is limited. helpful comments on their respective work, to Faith Fich
Michael [24] criticized QSBR for its unbounded mem- and Cristiana Amza for much helpful advice, and to Dan
ory use, but did not compare the performance of QSBR to Frye for his support of this effort. We are indebted to Martin
that of HPBR, or determine when this limitation affects a Bligh, Andy Whitcroft, and the ABAT team for the use of
program. the 8-CPU machine used in our experiments.
10 Legal Statement [14] M. Herlihy. A methodology for implementing highly
concurrent data objects. ACM Trans. Prog. Lang. Syst.,
IBM and POWER are trademarks or registered trade- 15(5):745–770, Nov. 1993.
[15] M. Herlihy, V. Luchangco, and M. Moir. The repeat offender
marks of International Business Machines Corporation in
problem: A mechanism for supporting dynamic-sized, lock-
the United States and/or other countries. free data structures. In Proc. 16th Intl. Symp. on Distributed
Java and all Java-based trademarks are trademarks of Computing, Oct. 2002.
Sun Microsystems, Inc. in the United States, other coun- [16] M. Herlihy, V. Luchangco, and M. Moir. Obstruction-free
tries, or both. synchronization: Double-ended queues as an example. In
Linux is a trademark of Linus Torvalds in the United Proc. 23rd Intl. Conf. on Distributed Computing Syst., pages
States, other countries, or both. 522–529. IEEE Computer Society, 2003.
Other company, product, and service names may be [17] S. Kumar, D. Jiang, R. Chandra, and J. P. Singh. Evalu-
trademarks or service marks of others. ating synchronization on shared address space multiproces-
sors: methodology and performance. In Proc. SIGMETRICS
1999, pages 23–34, New York, NY, USA, 1999. ACM Press.
References [18] L. Lamport. How to make a multiprocessor computer that
correctly executes multiprocess programs. IEEE Trans.
[1] T. E. Anderson. The performance of spin lock alternatives Comput., 28(9):690–691, Sept. 1979.
for shared-money multiprocessors. IEEE Trans. Parallel and [19] P. E. McKenney. Exploiting Deferred Destruction: An Anal-
Distributed Syst., 1(1):6–16, 1990. ysis of Read-Copy-Update Techniques in Operating System
[2] A. Arcangeli, M. Cao, P. E. McKenney, and D. Sarma. Us- Kernels. PhD thesis, OGI School of Science and Engineer-
ing read-copy update techniques for System V IPC in the ing at Oregon Health and Sciences University, 2004.
Linux 2.5 kernel. In Proc. USENIX Annual Technical Conf. [20] P. E. McKenney, J. Appavoo, A. Kleen, O. Krieger, R. Rus-
(FREENIX Track), pages 297–310. USENIX Association, sell, D. Sarma, and M. Soni. Read-copy update. In Ottawa
June 2003. Linux Symp., July 2001.
[3] H.-J. Boehm. An almost non-blocking stack. In Proc. 23rd [21] P. E. McKenney and D. Sarma. Towards hard realtime
ACM Symp. on Principles of Distributed Computing, pages response from the Linux kernel on SMP hardware. In
40–49, 2004. linux.conf.au, Canberra, AU, April 2005.
[4] J. Bonwick and J. Adams. Magazines and Vmem: Extend- [22] P. E. McKenney and J. D. Slingwine. Read-copy update:
ing the slab allocator to many CPUs and arbitrary resources. Using execution history to solve concurrency problems. In
In USENIX Technical Conf., General Track, pages 15–33, Parallel and Distributed Computing and Syst., pages 509–
2001. 518, Las Vegas, NV, Oct. 1998.
[5] D. L. Detlefs, P. A. Martin, M. Moir, and G. L. Steele, [23] M. M. Michael. Safe memory reclamation for dynamic lock-
Jr. Lock-free reference counting. Distributed Computing, free objects using atomic reads and writes. In Proc. 21st
15(4):255–271, 2002. ACM Symp. on Principles of Distributed Computing, pages
[6] K. Fraser. Practical Lock-Freedom. PhD thesis, University 21–30, July 2002.
of Cambridge Computer Laboratory, 2004. [24] M. M. Michael. Hazard pointers: Safe memory reclama-
[7] B. Gamsa, O. Krieger, J. Appavoo, and M. Stumm. Tornado: tion for lock-free objects. IEEE Trans. on Parallel and Dis-
Maximizing locality and concurrency in a shared memory tributed Syst., 15(6):491–504, June 2004.
multiprocessor operating system. In Proc. 3rd Symp. on Op- [25] M. M. Michael. Scalable lock-free dynamic memory allo-
erating Syst. Design and Impl., pages 87–100, 1999. cation. In Proc. ACM Conf. on Programming Language De-
[8] A. Gidenstam, M. Papatriantafilou, H. Sundell, and P. Tsi- sign and Implementation, pages 35–46, June 2004.
gas. Efficient and reliable lock-free memory reclamation [26] M. M. Michael and M. L. Scott. Correction of a memory
based on reference counting. In Proc. 8th I-SPAN, 2005. management method for lock-free data structures. Technical
[9] M. Greenwald. Non-blocking Synchronization and System Report TR599, Computer Science Department, University
Design. PhD thesis, Stanford University, 1999. of Rochester, Dec. 1995.
[10] M. Greenwald and D. Cheriton. The synergy between non- [27] D. Sarma and P. E. McKenney. Issues with selected scala-
blocking synchronization and operating system structure. In bility features of the 2.6 kernel. In Ottawa Linux Symp., July
Proc. 2nd USENIX Symp. on Operating Syst. Design and 2004.
Impl., pages 123–136. ACM Press, 1996. [28] H. Sundell. Wait-free reference counting and memory man-
[11] R. Gupta. The fuzzy barrier: a mechanism for high speed agement. In Proc. 19th Intl. Parallel and Distributed Pro-
synchronization of processors. In Proc. 3rd Intl. Conf. on cessing Symp., Apr. 2005.
Arch. Support for Prog. Lang. and Operating Syst. (ASP- [29] J. D. Valois. Lock-free linked lists using compare-and-swap.
LOS), pages 54–63, New York, NY, USA, 1989. ACM Press. In Proc. 14th ACM Symp. on Principles of Distributed Com-
[12] T. L. Harris. A pragmatic implementation of non-blocking puting, pages 214–222, Aug. 1995.
linked-lists. In Proc. 15th Intl. Conf. on Distributed Com-
puting, pages 300–314. Springer-Verlag, 2001.
[13] M. Herlihy. Wait-free synchronization. ACM Trans. Prog.
Lang. Syst., 13(1):124–149, 1991.