0% found this document useful (0 votes)

42 views134 pages

Go Shceduler

The document discusses the Go scheduler and how it implements lightweight concurrency. It covers topics like goroutine states, handling system calls, scaling to many cores, and using a distributed scheduler approach.

Uploaded by

Yalmaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views134 pages

Go Shceduler

Uploaded by

Yalmaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 134

Go scheduler

Implementing language with lightweight concurrency

Dmitry Vyukov, [email protected]

Hydra conf, July 12 2019 1
Agenda
● Go specifics
● Scheduler
● Scalability
● Fairness
● Stacks
● Future

2
What is a goroutine?
Logically a thread of execution.

Logically same as:

● OS thread
● coroutine
● green thread

3
Most material is generic
... and to large degree applicable to:

● OS thread schedulers
● Coroutine schedulers
● Thread pools
● Other languages

4
Go specifics:
1. Current Go design decisions
2. Go requirements and constraints

5
Go specifics:
1. Current Go design decisions
2. Go requirements and constraints:
● goroutines are lightweight (1M)

6
Go specifics:
1. Current Go design decisions
2. Go requirements and constraints:
● goroutines are lightweight (1M)
● parallel and scalable

7
Go specifics:
1. Current Go design decisions
2. Go requirements and constraints:
● goroutines are lightweight (1M)
● parallel and scalable
● minimal API (no hints)

8
Go specifics:
1. Current Go design decisions
2. Go requirements and constraints:
● goroutines are lightweight (1M)
● parallel and scalable
● minimal API (no hints)
● infinite stack

9
Go specifics:
1. Current Go design decisions
2. Go requirements and constraints:
● goroutines are lightweight (1M)
● parallel and scalable
● minimal API (no hints)
● infinite stack
● handling of IO, syscalls, C calls

10
A taste of Go
resultChan := make(chan Result) // FIFO queue
go func() { // start a goroutine
response := sendRequest() // blocks on IO
result := parse(response)
resultChan <- result // send the result back
}()
process(<-resultChan) // receive the result

11
How can we implement this?

12
Thread per goroutine?
Would work!

But too expensive:

● memory (at least 32K or so)

● performance (syscalls)
● no infinite stacks

13
Thread pool?
Only faster goroutine creation.

But still:
● memory consumption
● performance
● no infinite stacks

14
M:N Threading

G G
G G G
G G
G
G G G
G G G
G

Thread Thread Thread Thread

15
M:N Threading

G G
G G G
G G
G
G G G
G G G
- cheap G
- full control

M
- expensive
- less control Thread Thread Thread Thread
- actual execution
- parallelism

16
Goroutine States
runnable blocked
G G
G G G
G G
G
G G G
G G G
- cheap G
- full control running

M
- expensive
- less control Thread Thread Thread Thread
- actual execution
- parallelism

17
Simple M:N Scheduler
Scheduler
G G G G Runnable Goroutines (Run Queue)

MUTEX

18
Simple M:N Scheduler
Scheduler
G G G G Runnable Goroutines (Run Queue)

MUTEX

Thread Thread Thread

G G G Running Goroutines
19
Blocked Goroutines

Channel

Wait Queue G G G

20
Blocked Goroutines

Channel blocking

Wait Queue G G G G

21
Blocked Goroutines

Channel

Wait Queue G G G G

unblocking
Scheduler Run Queue

G G G G

22
Blocked Goroutines

Channel

Wait Queue G G G G

The same mechanism for:

● Mutexes unblocking
● Timers Scheduler Run Queue
● Network IO
G G G G

23
System Calls
Thread

enter

Kernel/C code
24
System Calls
Thread

enter exit

Kernel/C code
25
System Calls
Thread Thread Thread Thread

G G G G

enter enter enter enter

Kernel/C code
26
System Calls
Thread Thread Thread Thread

G G G G

Run Queue

G G
enter enter enter enter

Kernel/C code
27
System Calls
in syscall
Thread Thread Thread Thread

G G G G

Run Queue

G G
enter

Kernel/C code
28
System Calls
in syscall
Thread Thread Thread Thread Thread

G G G G G

Run Queue

G G
enter

Kernel/C code
29
System Calls
idle
Thread Thread Thread Thread Thread

G G G G G

Run Queue

G G G
exit

Kernel/C code
30
#Threads > #Cores

31
√ lightweight goroutines

√ handling of IO and syscalls

√ parallel

32
Not Scalable!
Scheduler
G G G G Runnable Goroutines (Run Queue)

MUTEX

Thread Thread Thread

G G G
33
lock-free?

34
lock-free?

⛔
35
👩 🖥

36
👩
👩
👩 🖥
👩
👩
37
👩
👩
👩 🖥
👩
👩
38
Shifts:
8:00 - 16:00
👩
16:00 - 24:00 👩 MUTEX

0:00 - 8:00
👩 🖥
👩
👩
39
👩
👩 LOCK-FREE

👩 🖥
👩
👩
40
DISTRIBUTED

👩 🖥
👩 🖥
👩 🖥
👩 🖥
👩 🖥
41
Distributed Scheduler
Per-thread state Per-thread state Per-thread state

G G G G G

Thread Thread Thread

G G G
42
Distributed Scheduler
Per-thread state Per-thread state Per-thread state Scheduler

G G G G G G G

MUTEX

Thread Thread Thread

G G G
43
Distributed Scheduler
Per-thread state Per-thread state Per-thread state Scheduler

G G G G G G G

malloc cache malloc cache malloc cache

other caches other caches other caches

MUTEX

Thread Thread Thread

G G G
44
Poll Order
Main question: what is the next goroutine to run?

1. Local Run Queue

2. Global Run Queue
3. Network Poller
4. Work Stealing

45
Work Stealing

Per-thread state Per-thread state Per-thread state Per-thread state

G G G G G

Thread Thread Thread Thread

46
√ lightweight goroutines

√ handling of IO and syscalls

√ parallel

√ scalable

47
Threads in syscalls :(
(#threads > #cores)

48
M:P:N Threading
G G
G G G
G G
G
G G G
G G G
G
N

P Processor Processor Processor Processor

M
Thread Thread Thread Thread Thread Thread Thread

49
M:P:N Threading
G G
G G G
G G
G running
G G G
G G G
G
N

P Processor Processor Processor Processor

resource required to run Go code

M
Thread Thread Thread Thread Thread Thread Thread

50
M:P:N Threading
in syscall G G
G G G
G G
G running
G G G
G G G
G
N

P Processor Processor Processor Processor

M
Thread Thread Thread Thread Thread Thread Thread
in syscall in syscall in syscall
51
Distributed 3-Level Scheduler
"Processor" "Processor"

G G G

malloc cache malloc cache

other caches other caches

in syscall in syscall
Thread Thread Thread Thread

G G G G
52
Syscall handling: Handoff
"Processor"

G G

malloc cache

other caches

Thread

G
53
Syscall handling: Handoff
"Processor"

G G

malloc cache

other caches

in syscall
Thread

G
54
Syscall handling: Handoff
"Processor"

G G

malloc cache

other caches

in syscall
Thread Thread

G
55
Syscall handling: Handoff
"Processor"

G G

malloc cache

other caches

in syscall
Thread Thread

G
56
Syscall handling: Handoff
"Processor"

G G

malloc cache

other caches

in syscall
Thread Thread

G G
57
√ lightweight goroutines

√ handling of IO and syscalls

√ parallel

√ scalable

√ efficient

58
Fairness

59
Fairness
What: if a goroutine is runnable, it will run eventually.

Why:

● bad tail latencies

● livelocks
● pathological behaviors

60
Fairness
What: if a goroutine is runnable, it will run eventually.

Why:

● bad tail latencies

● livelocks
● pathological behaviors

Fairness is like Oxygen

61
Fair Scheduling

Fair: FIFO Run Queue

G G G G

62
Fair Scheduling

Fair: FIFO Run Queue

G G G G

Not Fair: LIFO Run Queue

G G G G

63
Fairness/Performance Tradeoff
● Single Run Queue does not scale
● FIFO bad for locality

Want a minimal amount of fairness!

64
Infinite Loops

in infinite loop

65
Infinite Loops

in infinite loop starved

G G

66
Infinite Loops

in infinite loop starved

G G

Solution: preemption (~10ms)

67
Local Run Queue

FIFO

G G G G

68
Local Run Queue

FIFO 1-element LIFO buffer

G G G G G

69
Local Run Queue

FIFO 1-element LIFO buffer

G G G G G

● better locality

70
Local Run Queue

FIFO 1-element LIFO buffer

G G G G G

● better locality
● restricts stealing (3us)

71
Local Run Queue Starvation

FIFO 1-element LIFO buffer

G G G G G

72
Time Slice Inheritance

1-element LIFO buffer

Solution: inherit time slice -> looks like infinite loop -> preemption (~10ms)

73
Global Run Queue Starvation

Local Run Queue

G G

Global Run Queue

74
Global Run Queue Starvation

g = pollLocalRunQueue()
if g != nil {
return g
}
return pollGlobalRunQueue()
75
Global Run Queue Starvation
schedTick++
if schedTick%61 == 0 {
g = pollGlobalRunQueue()
if g != nil {
return g
}
}
g = pollLocalRunQueue()
if g != nil {
return g
}
return pollGlobalRunQueue()
76
Why 61?
It is not even 42! ¯\_(ツ)_/¯

Want something:
● not too small
● not too large
● prime to break any patterns

77
Network Poller Starvation

Global/Local Run Queue

G G

Network Poller

78
Network Poller Starvation

Global/Local Run Queue

G G

Network Poller

Solution: background thread poll network occasionally

79
Fairness Hierarchy

Goroutine - preemption

Local Run Queue - time slice inheritance

Global Run Queue - check once in a while

Network Poller - background thread

= minimal fairness at minimal cost

80
Stacks

81
Function Frame
void foo() ● local variables
{ ● return address
... ● previous frame pointer
int x = 42;
...
return;
}

82
Thread Stack

Stack

main

grows down

83
Thread Stack

Stack

foo main

grows down

84
Thread Stack

Stack

bar foo main

grows down

85
Thread Stack

Stack

foo main

grows down

86
Thread Stack
stack pointer (RSP)

Stack

foo main

87
Thread Stack
stack pointer (RSP)

Stack

foo main

return prev
local/temp variables
address frame ptr

88
Stack Implementation

Stack (1-8 MB)

page (4K) page (4K) page (4K) page (4K) page (4K) page (4K) page (4K)

protected not paged-in paged-in

89
Stack is cheap!
foo:
sub $64, %RSP // allocate stack frame of size 64
...
mov %RAX, 16(%RSP) // store to a local var
...
add $64, %RSP // deallocate stack frame
retq

90
Paging-based infinite stacks?
● Lazy page-in
● 64-bit Virtual Address Space

Can we build "infinite" stacks based on this?

91
What is infinite?

1GB is "infinite" enough

92
Paging won't work :(
● Not enough Address Space
○ 48 bits address space
○ 1 bit for kernel = 47 bits = 128TB
○ max 128K stacks

93
Paging won't work :(
● Not enough Address Space
○ 48 bits address space
○ 1 bit for kernel = 47 bits = 128TB
○ max 128K stacks
● Bad granularity
○ 4KB x 1M = 4GB

94
Paging won't work :(
● Not enough Address Space
○ 48 bits address space
○ 1 bit for kernel = 47 bits = 128TB
○ max 128K stacks
● Bad granularity
○ 4KB x 1M = 4GB
● Slow "page-out"

95
Paging won't work :(
● Not enough Address Space
○ 48 bits address space
○ 1 bit for kernel = 47 bits = 128TB
○ max 128K stacks
● Bad granularity
○ 4KB x 1M = 4GB
● Slow "page-out"
● No huge pages (2MB, 1GB)

96
Paging won't work :(
● Not enough Address Space
○ 48 bits address space
○ 1 bit for kernel = 47 bits = 128TB
○ max 128K stacks
● Bad granularity
○ 4KB x 1M = 4GB
● Slow "page-out"
● No huge pages (2MB, 1GB)
● 32-bit systems
○ ARM

97
Normal stack again
foo:

sub $64, %RSP

...
mov %RAX, 16(%RSP)
...
add $64, %RSP
retq

98
Goroutine stacks
foo:
mov %fs:-8, %RCX // load G descriptor from TLS
cmp 16(%RCX), %RSP // compare the stack limit and RSP
jbe morestack // jump to slow-path if not enough stack
sub $64, %RSP
...
mov %RAX, 16(%RSP)
...
add $64, %RSP
retq
...
morestack: // call runtime to allocate more stack
callq <runtime.morestack>
99
Function Prologue
void foo()
{
if (RSP < TLS_G->stack_limit)
morestack();
...
}

100
Split Stack

Stack segment (1KB)

main

limit RSP

101
Split Stack

Stack segment (1KB)

foo main

limit RSP

102
Split Stack

Stack segment (1KB)

bar foo main

RSP limit

103
Split Stack

Stack segment (1KB) Stack segment (1KB)

foo main

limit RSP

104
Split Stack

Stack segment (1KB) Stack segment (1KB)

bar foo main

limit RSP

105
Split Stack

Stack segment (1KB)

foo main

limit RSP

106
Split Stack Benefits
● 1M goroutines
● works on 32-bits
● good granularity
● cheap "page-out"
● huge pages

107
"Hop Split" Problem :(

for ... { // hot loop

foo() // causes stack split

}

108
Important Performance Characteristics
1. Transparent
2. Stable

"Hot Split" problem fail both.

109
Growable Stack

Stack (1KB)

main

limit RSP

110
Growable Stack

Stack (1KB)

foo main

limit RSP

111
Growable Stack

Stack (1KB)

bar foo main

RSP limit

112
Growable Stack

Stack (1KB)

foo main

New Stack (2KB) COPY

foo main

limit RSP

113
Growable Stack

Stack (2KB)

foo main

limit RSP

114
Growable Stack

Stack (2KB)

bar foo main

limit RSP

115
Growable Stack

Stack (2KB)

foo main

limit RSP

116
Stack Performance
Split Stack

● O(1) cost per function call

● repeated

Worst case: stack split in hot loop

117
Stack Performance
Split Stack Growable Stack

● O(1) cost per function call ● O(N) cost per function call
● repeated ● amortized

Worst case: stack split in hot loop Worst case: growing stack for short goroutine

118
Stack Performance
Split Stack Growable Stack

● O(1) cost per function call ● O(N) cost per function call
● repeated ● amortized

Worst case: stack split in hot loop Worst case: growing stack for short goroutine

Penalizing cheap operation a bit

< Penalizing expensive operation significantly

119
Stack Cache

"Processor"

G G

malloc cache

stack cache

other caches

120
Interesting Fact

Split stacks are in gcc:

$ gcc -fsplit-stack prog.c

121
Preemption
What: Asynchronously asking a goroutine to yield.

Why:

● multiplexing multiple goroutines

● auxiliary functions (GC, crashes)

122
Preemption
What: Asynchronously asking a goroutine to yield.

Why:

● multiplexing multiple goroutines

● auxiliary functions (GC, crashes)

Preemption is also like Oxygen

123
Implementation strategy
Signals:

+ Fast

124
Implementation strategy
Signals:

+ Fast
- OS-dependent
- non-preemptible regions
- GC stack/register maps

⛔
125
Implementation strategy
Signals: Cooperative checks:

+ Fast + OS-independent
- OS-dependent + non-preemptible regions
- non-preemptible regions + GC stack/register maps
- GC stack/register maps

⛔
126
Implementation strategy
Signals: Cooperative checks:

+ Fast + OS-independent
- OS-dependent + non-preemptible regions
- non-preemptible regions + GC stack/register maps
- GC stack/register maps - Slow (1-10%)

⛔ ⛔
127
Function Prologue
foo:
mov %fs:-8, %RCX // load G descriptor from TLS
cmp 16(%RCX), %RSP // compare the stack limit and RSP
jbe morestack // jump to slow-path if not enough stack
...

128
Spoof stack limit!

G->stackLimit = 0xfffffffffffffade

129
Function Prologue
foo:
mov %fs:-8, %RCX
cmp 16(%RCX), %RSP // guaranteed to fail!
jbe morestack
...

130
Advantages
+ fast
+ portable
+ simple
+ GC-friendly

131
Advantages
+ fast
+ portable
+ simple
+ GC-friendly
- loops

132
Recap
√ lightweight goroutines
√ handling of IO and syscalls
√ parallel
√ scalable
√ efficient
√ fair
√ infinite stacks
√ preemptible*

133
Thank you!

Q&A

Dmitry Vyukov, [email protected]

134

Java Interview
No ratings yet
Java Interview
10 pages
Table of Contents
100% (1)
Table of Contents
6 pages
Going Go Programming
No ratings yet
Going Go Programming
324 pages
(Cs431) Slide
No ratings yet
(Cs431) Slide
168 pages
Distributed Systems
No ratings yet
Distributed Systems
238 pages
Queues Fairness and The Go Scheduler V3
No ratings yet
Queues Fairness and The Go Scheduler V3
145 pages
Get Programming With Go
No ratings yet
Get Programming With Go
15 pages
Chapter 2-3-4 For Med Distributive
No ratings yet
Chapter 2-3-4 For Med Distributive
124 pages
001 Exercises
No ratings yet
001 Exercises
115 pages
Job Queue in Golang
No ratings yet
Job Queue in Golang
82 pages
3 Concurrency
No ratings yet
3 Concurrency
52 pages
L11-Asynchronous Programming in Rust
No ratings yet
L11-Asynchronous Programming in Rust
63 pages
27 Huawei
No ratings yet
27 Huawei
3 pages
Daa Unit-Vi
No ratings yet
Daa Unit-Vi
50 pages
Locks 1
No ratings yet
Locks 1
61 pages
Rethinking Classical Concurrency Patterns
No ratings yet
Rethinking Classical Concurrency Patterns
121 pages
Chapter 3-Processes
No ratings yet
Chapter 3-Processes
40 pages
ESET File Security For Linux: User Guide
No ratings yet
ESET File Security For Linux: User Guide
58 pages
Programming Shared-Memory Platforms With Pthreads: John Mellor-Crummey
No ratings yet
Programming Shared-Memory Platforms With Pthreads: John Mellor-Crummey
34 pages
07
No ratings yet
07
21 pages
Lecture 16
No ratings yet
Lecture 16
30 pages
CS330 Operating System Part VI
No ratings yet
CS330 Operating System Part VI
26 pages
Lightweight Threads Explained
No ratings yet
Lightweight Threads Explained
12 pages
Lecture 09-Network Programming Using Golang
No ratings yet
Lecture 09-Network Programming Using Golang
19 pages
Threads
No ratings yet
Threads
22 pages
Object-Oriented Modeling: Sequence Diagram
No ratings yet
Object-Oriented Modeling: Sequence Diagram
46 pages
ACA 2024W 03 Shared-Memory Programming 1-35
No ratings yet
ACA 2024W 03 Shared-Memory Programming 1-35
27 pages
Operating Systems Assignment
No ratings yet
Operating Systems Assignment
9 pages
Download
No ratings yet
Download
23 pages
OS - Lab2. Process & Multithreaded Process
No ratings yet
OS - Lab2. Process & Multithreaded Process
21 pages
Concurrecy in Golang
No ratings yet
Concurrecy in Golang
14 pages
Threads & Concurrency: Lecture 23 - CS2110 - Fall 2018
No ratings yet
Threads & Concurrency: Lecture 23 - CS2110 - Fall 2018
34 pages
Go Programming Language Tutorial (Part 2)
No ratings yet
Go Programming Language Tutorial (Part 2)
8 pages
Chapter 3-Processes
No ratings yet
Chapter 3-Processes
40 pages
Threads
No ratings yet
Threads
16 pages
Qthreads PDF
No ratings yet
Qthreads PDF
8 pages
Programming Shared Address Space Platforms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Programming Shared Address Space Platforms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
67 pages
Lab 2 Process & Multithreaded Process Course: Operating Systems
No ratings yet
Lab 2 Process & Multithreaded Process Course: Operating Systems
18 pages
CSCI 350 Ch. 4 - Threads and Concurrency: Mark Redekopp Michael Shindler & Ramesh Govindan
No ratings yet
CSCI 350 Ch. 4 - Threads and Concurrency: Mark Redekopp Michael Shindler & Ramesh Govindan
41 pages
Annotated-Exercise 2
No ratings yet
Annotated-Exercise 2
3 pages
Go Programming Language Tutorial (Part 7)
No ratings yet
Go Programming Language Tutorial (Part 7)
8 pages
Con Currency
No ratings yet
Con Currency
46 pages
Go Course Day 3
No ratings yet
Go Course Day 3
47 pages
pbl4 Osama3011
No ratings yet
pbl4 Osama3011
16 pages
Fork Join Parallelism
No ratings yet
Fork Join Parallelism
30 pages
Introduction To Concurrent Programming
No ratings yet
Introduction To Concurrent Programming
20 pages
Coroutines in One Page of C
No ratings yet
Coroutines in One Page of C
7 pages
Lab 2
No ratings yet
Lab 2
24 pages
Go Low Level Programming Guide
No ratings yet
Go Low Level Programming Guide
2 pages
Concurrency in Go
No ratings yet
Concurrency in Go
27 pages
Concurrency in Go
No ratings yet
Concurrency in Go
27 pages
Event Driven I:o
No ratings yet
Event Driven I:o
12 pages
A Whirlwind Tour Through Concurrency: Kedar Namjoshi Bell Labs
No ratings yet
A Whirlwind Tour Through Concurrency: Kedar Namjoshi Bell Labs
37 pages
Process and Threads, Presentation4
No ratings yet
Process and Threads, Presentation4
20 pages
Concurrency - Part 3: Pitfalls and Summary - Medium
No ratings yet
Concurrency - Part 3: Pitfalls and Summary - Medium
2 pages
eRRU Installation Modes A
No ratings yet
eRRU Installation Modes A
21 pages
Introduction To Threads 1.1 Non Threaded Applications
No ratings yet
Introduction To Threads 1.1 Non Threaded Applications
15 pages
Operating System 4
No ratings yet
Operating System 4
33 pages
Golang - Tutorial
No ratings yet
Golang - Tutorial
25 pages
Chapter 3 Processes2
No ratings yet
Chapter 3 Processes2
33 pages
14 Concurrency Threads
No ratings yet
14 Concurrency Threads
37 pages
A ShortIntroductiontoPOSIX Threads
No ratings yet
A ShortIntroductiontoPOSIX Threads
8 pages
Lecture #10: Threads & Synchronization
No ratings yet
Lecture #10: Threads & Synchronization
7 pages