0% found this document useful (0 votes)
30 views

Linux Scalability

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Linux Scalability

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

AN ANALYSIS OF

LINUX SCALABILITY
TO MANY CORES
Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao,
Aleksey Pesterev, M. Frans Kaashoek, Robert Morris,
Nickolai Zeldovich (MIT)
OSDI 2010
Paper highlights
 Asks whether traditional kernel designs apply to multicore
architectures
 Do they allow efficient usage of architecture?

 Investigated 8 different applications


 Running on a 48-core computer

 Concluded that most kernel bottlenecks could be eliminated


using standard parallelizing techniques
 Added a new one: sloppy counters
The challenge
 Multicore architectures
 Do we need new kernel designs?
 Barrelfish, Corey, fox, …

 Can we use traditional kernel architectures?


The approach
 Try to scale up various system applications on
 A 48-core computer
 Running a conventional Linux kernel

 Measure scalability of 8 applications (MOSBENCH) using


unmodified kernel
 Try to fix scalability bottlenecks
 Measure scalability of applications once fixes have been applied
Scalability
 Application speedup/Number of cores ratio
 Ideally, but rarely, 100%
 Typically lesser due to
 Inherently sequential part(s) of the application
 Other bottlenecks
 Obtaining locks on shared variables, …

 Unnecessary sharing
Amdahl’s Law
 theoretical speedup of the execution of the whole task;
 speedup of the part of the task that benefits from improved
system resources
 fraction of execution time that the part benefiting from
improved resources originally occupied
Example: Flying Houston-New York
 Now
 Waiting at airport: 1 hour
 Taxiing out: 17 minutes
 Air time: 2 hours 56 minutes
 Taxiing in: 6 minutes
 Total time: 4 hours 19 mi
 Faster airplane cuts air time by 50 percent ( = 2)
 and = 1.515
MOSBENCH Applications
 Mail Server:
 Exim
 Single master process waits for incoming TCP connections
 Forks a child process for each new connection

 Child handles the incoming mail coming on a connection

Include access to a set of shared spool directories and a


shared log file
 Spends 69% of its time in the kernel
MOSBENCH Applications
 Object cache:
 memcached
 In-memory key-value store

 Single memcached server would not scale up


 Bottleneck is the internal lock handling the KV hash table

 Run multiple memcached servers


 Clients deterministically distribute key lookups among

servers
 Spends 80% of its time processing packets in the kernel at one
core
MOSBENCH Applications
 Web server:
 Apache
 Single instance listening on port 80
 Uses a thread pool to process connections

 Configuration stresses the network stack and the file system

(directory name lookups)


 Running on a single core, it spends 60 percent of its time in the
kernel
MOSBENCH Applications
 Database:
 Postgres
 Makes extensive internal use of shared data structures and

synchronization
 Should exhibit little contention for read-mostly workloads

 For read-only workloads

With one core: spends 1.5% of its time in the kernel


With 48 cores 82%
MOSBENCH Applications
 File indexer:
 Psearchy
 parallel version of searchy, a program to index and query

Web pages
 Focus on the indexing component of psearchy (pedsort)
 More system intensive

 With one core, pedsort spends only 1.9% of its time in the kernel
 Grows to 23% at 48 cores
MOSBENCH Applications
 Parallel build:
 gmake
 Creates many more processes than they are cores
 Execution time dominated by the compiler it runs
 Running on a single core, it spends 7.6 percent of its time in the
kernel
MOSBENCH Applications
 MapReduce:
 Metis
 MapReduce library for single multicore servers

 Workload allocates large amount of memory to hold temporary


tables
With one core: spends 3% of its time in the kernel
With 48 cores: 16%
Common scalability issues (I)
 Tasks may lock a shared data structure
 Tasks may write into a shared memory location
 Cache coherence issues even in lock-free shared data
structures.
 Tasks may compete for space in a limited-size shared hardware
cache
 Happens even if tasks never share memory
Common scalability issues (II)
 Tasks may compete for other shared hardware resources
 Inter-core interconnect, DRAM, …

 Too few tasks to keep all cores busy


 Cache consistency issues:
 When a core uses data that other cores have just written
 Delays
Hard fixes
 When everything else fails
 Best approach is to change the implementation
 In the stock Linux kernel
 Set of runnable threads is partitioned into mostly-private

per-core scheduling queues


FreeBSD low-level scheduler uses similar approach
Easy fixes
 Well-known techniques such as
 Lock-free protocols
 Fine-grained locking
Multicore packet processing
 Want each packet, queue, and connection be handled by just one
core
 Use Intel’s 82599 10Gbit Ethernet (IXGBE) card network card
 Multiple hardware queues
 Can configure Linux to assign each hardware queue to a different
core
 Uses sampling to send packet to right core
 Works with long-term connections
 Configured the IXGBE to direct each packet to a queue (and core)
using a hash of the packet headers
Sloppy counters
 Speed up increment/decrement operations on a shared counter

2 0 0 2 8

 One local counter per core


 Represents number of pre-allocated references allocated to
that specific core
 Global counter represents total number of committed references
 In use or pre-allocated
In the paper
 Counter used to keep track of reference count to an object

 Main idea is to pre-allocate spare references to cores

 In our example,
 8 references
 4 of them are pre-allocated references
Incrementing the sloppy counter (I)
 If the core has spare pre-allocated references
 Subtract increment from local counter

1 0 0 2 8

First core used one of its pre-allocated references


Global counter remains unchanged
Incrementing the sloppy counter (II)
 If the core does not have any spare pre-allocated reference
 Add increment to global counter

1 0 0 2 9

Second core requested and obtained one


additional reference
Global counter is updated
Decrementing the sloppy counter
 Always
 Add decrement to local value of counter

1 1 0 2 9

Second core releases a reference


Increments its number of pre-allocated references
Does not update the global counter
Releasing pre-allocated references
 Always
 Subtract same value from global and local counter

1 1 0 0 7

Fourth core released its two pre-allocated


references
How they work (I)
 Represent one logical counter as
 A single shared central counter
 A set of per-core counts of spare references
 When a core increments a sloppy counter by V
 First tries to acquire a spare reference by decrementing its
per-core counter by V
 If the per-core counter is greater than or equal to V , the
decrement succeeds.
 Otherwise the core increments the shared counter by V
How they work (II)
 When a core decrements a sloppy counter by V
sloppiness
 Increments its per-core counter by V
 If the local count grows above some threshold
 Spare references are released by decrementing both
the per-core count and the central count.
 Sloppy counters maintain the invariant:
 The sum of per-core counters and the number of resources in
use equals the value in the shared counter.
Meaning
 Local counts keep track of the number of spare references held
by each core
 Act as local reserve
 Global count keeps track of total number of references issued
 For a local reserve and being used
Example (I)
 Local count is equal to 2
 Global count is equal to 6 2 … 6
 Core uses 0 references

 Core needs 2 extra references


 Decrement local count by 2
 Local count is equal to 0
 Global count is equal to 6 0 … 6
 Core now uses 2 references

 Core needs 2 extra references


 Increment global count by 2
Example (II)
 Local count is equal to 0
 Global count is equal to 8 0 … 8
 Core now uses 4 references
 Core releases 2 references
 Increment local count by 2
 Local count is now equal to 2
 Global count is equal to 8
2 … 8
 Core now uses 2 references
 Core releases 2 references
 Increment local count by 2
Example (III)
 Local count is equal to 4
 Global count is equal to 8 4 … 8
 Core uses no references
 Local count is too high
 Return two pre-allocated references
 Decrement both counts by 2
 Local count is equal to 2
 Global count is equal to 6 2 … 6
 Core uses no references
A more general view For your
information
 Replace a shared counter by only
 A global counter
 One local counter per thread
 When a thread wants to increments the counter
 It increments its local value (protected by a local lock)
 Global value becomes out of date
 From time to time,
 Local values are transferred to the global counter
 Local counters are reset to zero
Example (I)

0 0 0 0 0

1 0 0 0 0

1 1 0 0 0
Example (II)

1 2 0 0 0

1 2 0 0 0

0 0 0 0 3
Lock-free comparison (I)
 Observed low scalability for name lookups in the directory entry
cache.
 Directory entry cache speeds up lookups by mapping a directory
and a file name to a dentry identifying the target file’s inode
 When a potential dentry is located
 Lookup code gets a per-dentry spin lock to atomically compare
dentry contents with lookup function arguments
 Causes a bottleneck
Lock-free comparison (II)
 Use instead a lock-free protocol
 Similar to Linux lock-free page cache lookup protocol

 Add a generation counter


 Incremented after every modification to the dentry
 Temporary set to zero during the update
Lock-free comparison (III)
 If generation counter is 0
 Fall back to locking protocol
 Otherwise
 Remember generation counter value

 Copy the fields of the dentry to local variables


 When generation differs from the remembered value
 Fall back to the locking protocol.
Lock-free comparison (IV)
 Compare the copied fields to the arguments.
 If there is a match
 If reference count greater than zero
 Increment the reference count and return the dentry

 Else
 Fall back to the locking protocol.
Per-core data structures

 To reduce contention
 Split the per-super-block list of open files into per-core lists.
 Works in most cases

 Added per-core vfsmount tables, each acting as a cache for a


central vfsmount table
 Used per core free lists to allocate packet buffers (skbuffs) in
the memory system closest to the I/O bus.
Eliminating false sharing
 Problems occurred because kernel had located a variable it
updated often on the same cache line as a variable it reads often
 Cores contended for the falsely shared line
 Degraded Exim per-core performance
 memcached, Apache, and PostgreSQL faced similar false
sharing problems
 Placing the heavily modified data on separate cache lines solved
the problem
Evaluation
(after)
Note
 We skipped the individual discussions of the performances of
each application
 There will not be on any test
Conclusion
 Can remove most kernel bottlenecks by slight modifications to the
applications or the kernel
 Except for sloppy counters, most of changes are applications of
standard parallel programming techniques
 Results suggest that traditional kernel designs may be able to
achieve application scalability on multicore computers
 Subject to limitations of study

You might also like