0% found this document useful (0 votes)

30 views

Linux Scalability

Uploaded by

glorivaldocardoso

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

Linux Scalability

Uploaded by

glorivaldocardoso

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

AN ANALYSIS OF

LINUX SCALABILITY
TO MANY CORES
Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao,
Aleksey Pesterev, M. Frans Kaashoek, Robert Morris,
Nickolai Zeldovich (MIT)
OSDI 2010
Paper highlights
 Asks whether traditional kernel designs apply to multicore
architectures
 Do they allow efficient usage of architecture?

 Investigated 8 different applications

 Running on a 48-core computer

 Concluded that most kernel bottlenecks could be eliminated

using standard parallelizing techniques
 Added a new one: sloppy counters
The challenge
 Multicore architectures
 Do we need new kernel designs?
 Barrelfish, Corey, fox, …

 Can we use traditional kernel architectures?

The approach
 Try to scale up various system applications on
 A 48-core computer
 Running a conventional Linux kernel

 Measure scalability of 8 applications (MOSBENCH) using

unmodified kernel
 Try to fix scalability bottlenecks
 Measure scalability of applications once fixes have been applied
Scalability
 Application speedup/Number of cores ratio
 Ideally, but rarely, 100%
 Typically lesser due to
 Inherently sequential part(s) of the application
 Other bottlenecks
 Obtaining locks on shared variables, …

 Unnecessary sharing
Amdahl’s Law
 theoretical speedup of the execution of the whole task;
 speedup of the part of the task that benefits from improved
system resources
 fraction of execution time that the part benefiting from
improved resources originally occupied
Example: Flying Houston-New York
 Now
 Waiting at airport: 1 hour
 Taxiing out: 17 minutes
 Air time: 2 hours 56 minutes
 Taxiing in: 6 minutes
 Total time: 4 hours 19 mi
 Faster airplane cuts air time by 50 percent ( = 2)
 and = 1.515
MOSBENCH Applications
 Mail Server:
 Exim
 Single master process waits for incoming TCP connections
 Forks a child process for each new connection

 Child handles the incoming mail coming on a connection

Include access to a set of shared spool directories and a

shared log file
 Spends 69% of its time in the kernel
MOSBENCH Applications
 Object cache:
 memcached
 In-memory key-value store

 Single memcached server would not scale up

 Bottleneck is the internal lock handling the KV hash table

 Run multiple memcached servers

 Clients deterministically distribute key lookups among

servers
 Spends 80% of its time processing packets in the kernel at one
core
MOSBENCH Applications
 Web server:
 Apache
 Single instance listening on port 80
 Uses a thread pool to process connections

 Configuration stresses the network stack and the file system

(directory name lookups)

 Running on a single core, it spends 60 percent of its time in the
kernel
MOSBENCH Applications
 Database:
 Postgres
 Makes extensive internal use of shared data structures and

synchronization
 Should exhibit little contention for read-mostly workloads

 For read-only workloads

With one core: spends 1.5% of its time in the kernel

With 48 cores 82%
MOSBENCH Applications
 File indexer:
 Psearchy
 parallel version of searchy, a program to index and query

Web pages
 Focus on the indexing component of psearchy (pedsort)
 More system intensive

 With one core, pedsort spends only 1.9% of its time in the kernel
 Grows to 23% at 48 cores
MOSBENCH Applications
 Parallel build:
 gmake
 Creates many more processes than they are cores
 Execution time dominated by the compiler it runs
 Running on a single core, it spends 7.6 percent of its time in the
kernel
MOSBENCH Applications
 MapReduce:
 Metis
 MapReduce library for single multicore servers

 Workload allocates large amount of memory to hold temporary

tables
With one core: spends 3% of its time in the kernel
With 48 cores: 16%
Common scalability issues (I)
 Tasks may lock a shared data structure
 Tasks may write into a shared memory location
 Cache coherence issues even in lock-free shared data
structures.
 Tasks may compete for space in a limited-size shared hardware
cache
 Happens even if tasks never share memory
Common scalability issues (II)
 Tasks may compete for other shared hardware resources
 Inter-core interconnect, DRAM, …

 Too few tasks to keep all cores busy

 Cache consistency issues:
 When a core uses data that other cores have just written
 Delays
Hard fixes
 When everything else fails
 Best approach is to change the implementation
 In the stock Linux kernel
 Set of runnable threads is partitioned into mostly-private

per-core scheduling queues

FreeBSD low-level scheduler uses similar approach
Easy fixes
 Well-known techniques such as
 Lock-free protocols
 Fine-grained locking
Multicore packet processing
 Want each packet, queue, and connection be handled by just one
core
 Use Intel’s 82599 10Gbit Ethernet (IXGBE) card network card
 Multiple hardware queues
 Can configure Linux to assign each hardware queue to a different
core
 Uses sampling to send packet to right core
 Works with long-term connections
 Configured the IXGBE to direct each packet to a queue (and core)
using a hash of the packet headers
Sloppy counters
 Speed up increment/decrement operations on a shared counter

2 0 0 2 8

 One local counter per core

 Represents number of pre-allocated references allocated to
that specific core
 Global counter represents total number of committed references
 In use or pre-allocated
In the paper
 Counter used to keep track of reference count to an object

 Main idea is to pre-allocate spare references to cores

 In our example,
 8 references
 4 of them are pre-allocated references
Incrementing the sloppy counter (I)
 If the core has spare pre-allocated references
 Subtract increment from local counter

1 0 0 2 8

First core used one of its pre-allocated references

Global counter remains unchanged
Incrementing the sloppy counter (II)
 If the core does not have any spare pre-allocated reference
 Add increment to global counter

1 0 0 2 9

Second core requested and obtained one

additional reference
Global counter is updated
Decrementing the sloppy counter
 Always
 Add decrement to local value of counter

1 1 0 2 9

Second core releases a reference

Increments its number of pre-allocated references
Does not update the global counter
Releasing pre-allocated references
 Always
 Subtract same value from global and local counter

1 1 0 0 7

Fourth core released its two pre-allocated

references
How they work (I)
 Represent one logical counter as
 A single shared central counter
 A set of per-core counts of spare references
 When a core increments a sloppy counter by V
 First tries to acquire a spare reference by decrementing its
per-core counter by V
 If the per-core counter is greater than or equal to V , the
decrement succeeds.
 Otherwise the core increments the shared counter by V
How they work (II)
 When a core decrements a sloppy counter by V
sloppiness
 Increments its per-core counter by V
 If the local count grows above some threshold
 Spare references are released by decrementing both
the per-core count and the central count.
 Sloppy counters maintain the invariant:
 The sum of per-core counters and the number of resources in
use equals the value in the shared counter.
Meaning
 Local counts keep track of the number of spare references held
by each core
 Act as local reserve
 Global count keeps track of total number of references issued
 For a local reserve and being used
Example (I)
 Local count is equal to 2
 Global count is equal to 6 2 … 6
 Core uses 0 references

 Core needs 2 extra references

 Decrement local count by 2
 Local count is equal to 0
 Global count is equal to 6 0 … 6
 Core now uses 2 references

 Core needs 2 extra references

 Increment global count by 2
Example (II)
 Local count is equal to 0
 Global count is equal to 8 0 … 8
 Core now uses 4 references
 Core releases 2 references
 Increment local count by 2
 Local count is now equal to 2
 Global count is equal to 8
2 … 8
 Core now uses 2 references
 Core releases 2 references
 Increment local count by 2
Example (III)
 Local count is equal to 4
 Global count is equal to 8 4 … 8
 Core uses no references
 Local count is too high
 Return two pre-allocated references
 Decrement both counts by 2
 Local count is equal to 2
 Global count is equal to 6 2 … 6
 Core uses no references
A more general view For your
information
 Replace a shared counter by only
 A global counter
 One local counter per thread
 When a thread wants to increments the counter
 It increments its local value (protected by a local lock)
 Global value becomes out of date
 From time to time,
 Local values are transferred to the global counter
 Local counters are reset to zero
Example (I)

0 0 0 0 0

1 0 0 0 0

1 1 0 0 0
Example (II)

1 2 0 0 0

0 0 0 0 3
Lock-free comparison (I)
 Observed low scalability for name lookups in the directory entry
cache.
 Directory entry cache speeds up lookups by mapping a directory
and a file name to a dentry identifying the target file’s inode
 When a potential dentry is located
 Lookup code gets a per-dentry spin lock to atomically compare
dentry contents with lookup function arguments
 Causes a bottleneck
Lock-free comparison (II)
 Use instead a lock-free protocol
 Similar to Linux lock-free page cache lookup protocol

 Add a generation counter

 Incremented after every modification to the dentry
 Temporary set to zero during the update
Lock-free comparison (III)
 If generation counter is 0
 Fall back to locking protocol
 Otherwise
 Remember generation counter value

 Copy the fields of the dentry to local variables

 When generation differs from the remembered value
 Fall back to the locking protocol.
Lock-free comparison (IV)
 Compare the copied fields to the arguments.
 If there is a match
 If reference count greater than zero
 Increment the reference count and return the dentry

 Else
 Fall back to the locking protocol.
Per-core data structures

 To reduce contention
 Split the per-super-block list of open files into per-core lists.
 Works in most cases

 Added per-core vfsmount tables, each acting as a cache for a

central vfsmount table
 Used per core free lists to allocate packet buffers (skbuffs) in
the memory system closest to the I/O bus.
Eliminating false sharing
 Problems occurred because kernel had located a variable it
updated often on the same cache line as a variable it reads often
 Cores contended for the falsely shared line
 Degraded Exim per-core performance
 memcached, Apache, and PostgreSQL faced similar false
sharing problems
 Placing the heavily modified data on separate cache lines solved
the problem
Evaluation
(after)
Note
 We skipped the individual discussions of the performances of
each application
 There will not be on any test
Conclusion
 Can remove most kernel bottlenecks by slight modifications to the
applications or the kernel
 Except for sloppy counters, most of changes are applications of
standard parallel programming techniques
 Results suggest that traditional kernel designs may be able to
achieve application scalability on multicore computers
 Subject to limitations of study

ANT-A094518R09v06-4080 Datasheet
No ratings yet
ANT-A094518R09v06-4080 Datasheet
4 pages
Practical Go: Building Scalable Network and Non-Network Applications
From Everand
Practical Go: Building Scalable Network and Non-Network Applications
Amit Saha
No ratings yet
A Comprehensive Presentation On 'An Analysis of Linux Scalability To Many Cores'
No ratings yet
A Comprehensive Presentation On 'An Analysis of Linux Scalability To Many Cores'
49 pages
Linux Multi-Core Scalability
No ratings yet
Linux Multi-Core Scalability
7 pages
Contiki
No ratings yet
Contiki
28 pages
Week 5
No ratings yet
Week 5
52 pages
Poor Man's Computing Revisited: Alexander Shchepetkin, I.G.P.P. UCLA
No ratings yet
Poor Man's Computing Revisited: Alexander Shchepetkin, I.G.P.P. UCLA
12 pages
Perfbook 1c E2 rc11
No ratings yet
Perfbook 1c E2 rc11
881 pages
Kernel SMP Ban Galore 2003
No ratings yet
Kernel SMP Ban Galore 2003
18 pages
Is Parallel Programming Hard, And, If So, What Can You Do About It V2021.12.22a
No ratings yet
Is Parallel Programming Hard, And, If So, What Can You Do About It V2021.12.22a
630 pages
Programming Shared-Memory Platforms With Pthreads: John Mellor-Crummey
No ratings yet
Programming Shared-Memory Platforms With Pthreads: John Mellor-Crummey
34 pages
The Parallel Book
No ratings yet
The Parallel Book
646 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Perfbook 2023 06 11a
No ratings yet
Perfbook 2023 06 11a
662 pages
Perfbook-1c 2021 12 22a
No ratings yet
Perfbook-1c 2021 12 22a
930 pages
Parallel Programming
No ratings yet
Parallel Programming
692 pages
Issues in Microprocessor and Multimicroprocessor Systems - Veljko Milutinivic
No ratings yet
Issues in Microprocessor and Multimicroprocessor Systems - Veljko Milutinivic
240 pages
Perfbook-Eb 2023 06 11a
No ratings yet
Perfbook-Eb 2023 06 11a
1,432 pages
2017edan85l4 1
No ratings yet
2017edan85l4 1
33 pages
Corey: An Operating System For Many Cores
No ratings yet
Corey: An Operating System For Many Cores
15 pages
Perfbook-1c 2023 06 11a
No ratings yet
Perfbook-1c 2023 06 11a
970 pages
Lecture 05
No ratings yet
Lecture 05
73 pages
Perfbook-1c 2019 12 22a PDF
No ratings yet
Perfbook-1c 2019 12 22a PDF
825 pages
Is Parallel Programming Hard
No ratings yet
Is Parallel Programming Hard
662 pages
Perfbook-1c 2022 09 25a
No ratings yet
Perfbook-1c 2022 09 25a
950 pages
Threads
No ratings yet
Threads
23 pages
Assignment4-Rennie Ramlochan
No ratings yet
Assignment4-Rennie Ramlochan
7 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
Is Parallel Programming Hard - and - If So - What Can You Do About It
No ratings yet
Is Parallel Programming Hard - and - If So - What Can You Do About It
533 pages
Figure List
No ratings yet
Figure List
57 pages
System Calls
No ratings yet
System Calls
27 pages
Demystifying Multicore Germany 14 PDF
No ratings yet
Demystifying Multicore Germany 14 PDF
82 pages
OS Quick Look
No ratings yet
OS Quick Look
16 pages
Linux Case Study
No ratings yet
Linux Case Study
13 pages
Multiprocessing Wiki 20150330
No ratings yet
Multiprocessing Wiki 20150330
96 pages
410A-week-5
No ratings yet
410A-week-5
23 pages
Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp
No ratings yet
Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp
10 pages
Unit 1 OS 2023
No ratings yet
Unit 1 OS 2023
23 pages
Unit 4
No ratings yet
Unit 4
42 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
Do Microkernels Suck?: Gernot Heiser
No ratings yet
Do Microkernels Suck?: Gernot Heiser
20 pages
2.ParallelArchExec
No ratings yet
2.ParallelArchExec
46 pages
multicore02-2
No ratings yet
multicore02-2
18 pages
Lecture04 Operating System Architecture
No ratings yet
Lecture04 Operating System Architecture
39 pages
#2 (Design, Privileges, Concepts)
No ratings yet
#2 (Design, Privileges, Concepts)
30 pages
Study Guide (1) 2
No ratings yet
Study Guide (1) 2
18 pages
05 Multiprocessor
No ratings yet
05 Multiprocessor
54 pages
1-os-mechanisms
No ratings yet
1-os-mechanisms
18 pages
DS1822 -Parallel Computing - Unit 1
No ratings yet
DS1822 -Parallel Computing - Unit 1
23 pages
Linux Kernel Internal.2
100% (1)
Linux Kernel Internal.2
168 pages
1
No ratings yet
1
44 pages
Parallel Computer Architecture A Hardware-Software
No ratings yet
Parallel Computer Architecture A Hardware-Software
18 pages
Event Driven I:o
No ratings yet
Event Driven I:o
12 pages
Case Study of Linux - Linux Kernel Version 2.6
No ratings yet
Case Study of Linux - Linux Kernel Version 2.6
23 pages
Linux Programming Assignment
No ratings yet
Linux Programming Assignment
8 pages
ECE408MT2ReviewFA24
No ratings yet
ECE408MT2ReviewFA24
58 pages
Projectlist
No ratings yet
Projectlist
3 pages
Software Reuse: Methods, Models, Costs, second edition
From Everand
Software Reuse: Methods, Models, Costs, second edition
Ronald J. Leach
No ratings yet
Learn Kubernetes & Docker - .NET Core, Java, Node.JS, PHP or Python
From Everand
Learn Kubernetes & Docker - .NET Core, Java, Node.JS, PHP or Python
Arnaud Weil
No ratings yet
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet
Learn Kubernetes - Container orchestration using Docker: Learn Collection
From Everand
Learn Kubernetes - Container orchestration using Docker: Learn Collection
Arnaud Weil
4/5 (1)
School Sports Club - Template
No ratings yet
School Sports Club - Template
6 pages
C&D-Installation and Maintenance
No ratings yet
C&D-Installation and Maintenance
8 pages
Iptv Ad Insertion Software Edgeware Datasheet
No ratings yet
Iptv Ad Insertion Software Edgeware Datasheet
2 pages
Programming Fundamentals in C++: Practical#14)
No ratings yet
Programming Fundamentals in C++: Practical#14)
24 pages
Java Questions
No ratings yet
Java Questions
21 pages
Alcatel-Lucent 7450 Ethernet Service Switch: Enabling Profitable Carrier Ethernet Services
No ratings yet
Alcatel-Lucent 7450 Ethernet Service Switch: Enabling Profitable Carrier Ethernet Services
12 pages
2021 NFPA 70E - Highlight of Changes
No ratings yet
2021 NFPA 70E - Highlight of Changes
35 pages
12 CBM GUVEN Radio Remote Control Grab Technical Specs
No ratings yet
12 CBM GUVEN Radio Remote Control Grab Technical Specs
6 pages
Wi-Fi Function User Manual: Split-Type Air Conditioner
No ratings yet
Wi-Fi Function User Manual: Split-Type Air Conditioner
14 pages
Royai Airport Services
No ratings yet
Royai Airport Services
20 pages
Gamboa Structural - Element - Erection - Bracing
No ratings yet
Gamboa Structural - Element - Erection - Bracing
61 pages
Pridgen Groupings - Coordination
100% (1)
Pridgen Groupings - Coordination
2 pages
Commercial Lightings
No ratings yet
Commercial Lightings
12 pages
Solutions Manual To Accompany Intermediate Algebra For College Students 8th Edition 9780321620910
100% (49)
Solutions Manual To Accompany Intermediate Algebra For College Students 8th Edition 9780321620910
3 pages
Freeniya Angel Final Capstone Project
No ratings yet
Freeniya Angel Final Capstone Project
50 pages
FEBE Faculty Handbook 2021 Web
No ratings yet
FEBE Faculty Handbook 2021 Web
307 pages
UG Project Report Format and Guidelines
No ratings yet
UG Project Report Format and Guidelines
12 pages
DE I (A) Report Format Details
No ratings yet
DE I (A) Report Format Details
3 pages
Managing Your Boss
No ratings yet
Managing Your Boss
50 pages
AKD HMI Modbus Communications Manual EN (REV A)
No ratings yet
AKD HMI Modbus Communications Manual EN (REV A)
8 pages
Heroglyph Protocol
No ratings yet
Heroglyph Protocol
41 pages
Service Parts
No ratings yet
Service Parts
6 pages
2022.12.02 Grijalva Porter To Haaland Re Hammond Pardon REDACTED
No ratings yet
2022.12.02 Grijalva Porter To Haaland Re Hammond Pardon REDACTED
17 pages
Ecomm Barometer-Uncovering SEA Online Shoppers & Delivery Preferences - by Ninja Van X DPD Group
No ratings yet
Ecomm Barometer-Uncovering SEA Online Shoppers & Delivery Preferences - by Ninja Van X DPD Group
18 pages
Typical 61850 Signal List Micom P143
No ratings yet
Typical 61850 Signal List Micom P143
1 page
install Vsftpd
No ratings yet
install Vsftpd
3 pages
Submission of SOP and Questionnaire
100% (1)
Submission of SOP and Questionnaire
4 pages
Etabs Truss Steel Assigment
No ratings yet
Etabs Truss Steel Assigment
8 pages
C2150-606.exam.33q: Website: VCE To PDF Converter: Facebook: Twitter
No ratings yet
C2150-606.exam.33q: Website: VCE To PDF Converter: Facebook: Twitter
27 pages

Linux Scalability

Uploaded by

Linux Scalability

Uploaded by

AN ANALYSIS OF

 Investigated 8 different applications

 Concluded that most kernel bottlenecks could be eliminated

 Can we use traditional kernel architectures?

 Measure scalability of 8 applications (MOSBENCH) using

 Child handles the incoming mail coming on a connection

Include access to a set of shared spool directories and a

 Single memcached server would not scale up

 Run multiple memcached servers

 Configuration stresses the network stack and the file system

(directory name lookups)

 For read-only workloads

With one core: spends 1.5% of its time in the kernel

 Workload allocates large amount of memory to hold temporary

 Too few tasks to keep all cores busy

per-core scheduling queues

 One local counter per core

 Main idea is to pre-allocate spare references to cores

First core used one of its pre-allocated references

Second core requested and obtained one

Second core releases a reference

Fourth core released its two pre-allocated

 Core needs 2 extra references

 Core needs 2 extra references

 Add a generation counter

 Copy the fields of the dentry to local variables

 Added per-core vfsmount tables, each acting as a cache for a

You might also like