Linux Scalability
Linux Scalability
LINUX SCALABILITY
TO MANY CORES
Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao,
Aleksey Pesterev, M. Frans Kaashoek, Robert Morris,
Nickolai Zeldovich (MIT)
OSDI 2010
Paper highlights
Asks whether traditional kernel designs apply to multicore
architectures
Do they allow efficient usage of architecture?
Unnecessary sharing
Amdahl’s Law
theoretical speedup of the execution of the whole task;
speedup of the part of the task that benefits from improved
system resources
fraction of execution time that the part benefiting from
improved resources originally occupied
Example: Flying Houston-New York
Now
Waiting at airport: 1 hour
Taxiing out: 17 minutes
Air time: 2 hours 56 minutes
Taxiing in: 6 minutes
Total time: 4 hours 19 mi
Faster airplane cuts air time by 50 percent ( = 2)
and = 1.515
MOSBENCH Applications
Mail Server:
Exim
Single master process waits for incoming TCP connections
Forks a child process for each new connection
servers
Spends 80% of its time processing packets in the kernel at one
core
MOSBENCH Applications
Web server:
Apache
Single instance listening on port 80
Uses a thread pool to process connections
synchronization
Should exhibit little contention for read-mostly workloads
Web pages
Focus on the indexing component of psearchy (pedsort)
More system intensive
With one core, pedsort spends only 1.9% of its time in the kernel
Grows to 23% at 48 cores
MOSBENCH Applications
Parallel build:
gmake
Creates many more processes than they are cores
Execution time dominated by the compiler it runs
Running on a single core, it spends 7.6 percent of its time in the
kernel
MOSBENCH Applications
MapReduce:
Metis
MapReduce library for single multicore servers
2 0 0 2 8
In our example,
8 references
4 of them are pre-allocated references
Incrementing the sloppy counter (I)
If the core has spare pre-allocated references
Subtract increment from local counter
1 0 0 2 8
1 0 0 2 9
1 1 0 2 9
1 1 0 0 7
0 0 0 0 0
1 0 0 0 0
1 1 0 0 0
Example (II)
1 2 0 0 0
1 2 0 0 0
0 0 0 0 3
Lock-free comparison (I)
Observed low scalability for name lookups in the directory entry
cache.
Directory entry cache speeds up lookups by mapping a directory
and a file name to a dentry identifying the target file’s inode
When a potential dentry is located
Lookup code gets a per-dentry spin lock to atomically compare
dentry contents with lookup function arguments
Causes a bottleneck
Lock-free comparison (II)
Use instead a lock-free protocol
Similar to Linux lock-free page cache lookup protocol
Else
Fall back to the locking protocol.
Per-core data structures
To reduce contention
Split the per-super-block list of open files into per-core lists.
Works in most cases