Linux Multi-Core Scalability
Linux Multi-Core Scalability
Andi Kleen
Intel Corporation
Germany
[email protected]
1 Abstract
The future of computing is multi-core. Massively multi-core. But how does the Linux kernel cope with
it? This paper takes a look at Linux kernel scalability on many-core systems under various workloads
and discusses some some known bottlenecks. The primary focus will be on kernel scalability.
2 Why parallelization
Since CPUs hit the power-wall earlier this decade single threaded CPU performance has been increasing
at a much lower pace than it historically used to in earlier decades1 . The recent trend in hardware is to go
multi-core and multi-threaded for more performance instead. Multi-core means that the CPU package
has more than one CPU core inside and acts like multiple CPUs. Multi threaded CPUs are using multiple
virtual CPUs inside each CPU core to use execution resources more efficiently. Also larger systems have
always used multiple CPU packages for better performance.
Exploiting the performance potential of these multiple CPUs 2 requires software improvements to
run parallel. Traditionally only larger super computers and servers needed major software scalability
work, because they have been using many CPU sockets, while cheaper systems only had a low number
of CPUs (one or perhaps two) so that major scalability work was not needed. But since individual CPUs
are getting more and more cores and threads this is changing and even relatively low end systems require
extensive scalability work now. The following table has some current example systems, showing these
trends.
3 Parallelization basics
When talking about parallelization commonly Amdahl’s law[1] is cited as a limitation. Amdahl’s law
essentially states that parallelization performance improvements are limited by the serial sections in the
algorithm when an algorithm working on a given data set is parallelized. But in practice – and especially
for kernel tuning – we tend to be more guided by Gustafson’s law[2]: when doing more in parallel usually
also the data set sizes increases and that more than offsets any speedup limitations from Amdahl’s law. To
apply this to the kernel: the kernel will not necessarily get faster for single operations by parallelization
3 , but it will be able to do a lot of single operations in parallel, allowing more processes or threads to run
This technique can be applied to many other computing-intensive command line tools.
Then there are programs which have some degree of parallelism, but do not scale to large systems
(where large can be as small as a 16CPUs systems). That might be because they still have too coarse-
grained locking, run into communication bottlenecks or suffer from other scalability problems. Unfor-
tunately these programs are often written to start as many threads as the system has CPUs, but when the
system is larger than their scalability limit adding more threads might actually scale negatively (as in
becoming slower when more threads are added)
The first measure is to limit them to the maximum number of threads that they can successfully scale
to (or of course fix their scalability problems if possible). This can be done through benchmarking to
find the scalability limits and then configure the program suitably.
This of course leaves some of the CPUs idle. This can be addressed by either run other instances of
the process ("cluster in a box") or running different programs on the same systems too ("server consoli-
dation")
This raises the question how to isolate different services on such a larger box. A popular approach
is to use virtualization (like Xen or KVM) to run multiple smaller guests with less CPUs on a larger
system. This has relatively high overhead in terms of memory consumption and administation overhead.
An alternative is also to use the new cgroups subsystem in recent kernels to run applications in kernel-
level containers, limiting them to the number of CPUs they can successfully handle.
10 Conclusion
Kernel scalability is critical for Linux to run well on today’s and tomorrow’s systems. While a lot of
work has been done in this area, and it works reasonable well for a variety of workloads today, improving
scalability further is a ongoing process that needs constant effort.
References
[1] Amdahl (1967) "Validity of the Single Processor Approach to Achieving Large-Scale Computing
Capabilities" http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf
[2] Gustafson (1988) "Reevaluating Amdahl’s Law" Communications of the ACM 31(5) http://
www.scl.ameslab.gov/Publications/Gus/AmdahlsLaw/Amdahls.html
[3] McKenney, Appavoo, Kleen, Krieger, Russel, Sarma, Soni (2001) "Read-Copy Update", Ottawa
Linux Symposium http://www.rdrop.com/users/paulmck/rclock/rclock_OLS.
2001.05.01c.pdf
[5] Schimmel (1994) "Unix Systems for Modern Architectures: Symmetric Multiprocessing and
Caching for Kernel Programmers" Addison-Wesley Publishing Company
[6] McKenney, Sarma, Soni (2004) "Scaling dcache with RCU" http://www.linuxjournal.
com/article/7124