Go Shceduler
Go Shceduler
2
What is a goroutine?
Logically a thread of execution.
3
Most material is generic
... and to large degree applicable to:
● OS thread schedulers
● Coroutine schedulers
● Thread pools
● Other languages
4
Go specifics:
1. Current Go design decisions
2. Go requirements and constraints
5
Go specifics:
1. Current Go design decisions
2. Go requirements and constraints:
● goroutines are lightweight (1M)
6
Go specifics:
1. Current Go design decisions
2. Go requirements and constraints:
● goroutines are lightweight (1M)
● parallel and scalable
7
Go specifics:
1. Current Go design decisions
2. Go requirements and constraints:
● goroutines are lightweight (1M)
● parallel and scalable
● minimal API (no hints)
8
Go specifics:
1. Current Go design decisions
2. Go requirements and constraints:
● goroutines are lightweight (1M)
● parallel and scalable
● minimal API (no hints)
● infinite stack
9
Go specifics:
1. Current Go design decisions
2. Go requirements and constraints:
● goroutines are lightweight (1M)
● parallel and scalable
● minimal API (no hints)
● infinite stack
● handling of IO, syscalls, C calls
10
A taste of Go
resultChan := make(chan Result) // FIFO queue
go func() { // start a goroutine
response := sendRequest() // blocks on IO
result := parse(response)
resultChan <- result // send the result back
}()
process(<-resultChan) // receive the result
11
How can we implement this?
12
Thread per goroutine?
Would work!
13
Thread pool?
Only faster goroutine creation.
But still:
● memory consumption
● performance
● no infinite stacks
14
M:N Threading
G G
G G G
G G
G
G G G
G G G
G
15
M:N Threading
G G
G G G
G G
G
G G G
G G G
- cheap G
- full control
M
- expensive
- less control Thread Thread Thread Thread
- actual execution
- parallelism
16
Goroutine States
runnable blocked
G G
G G G
G G
G
G G G
G G G
- cheap G
- full control running
M
- expensive
- less control Thread Thread Thread Thread
- actual execution
- parallelism
17
Simple M:N Scheduler
Scheduler
G G G G Runnable Goroutines (Run Queue)
MUTEX
18
Simple M:N Scheduler
Scheduler
G G G G Runnable Goroutines (Run Queue)
MUTEX
G G G Running Goroutines
19
Blocked Goroutines
Channel
Wait Queue G G G
20
Blocked Goroutines
Channel blocking
Wait Queue G G G G
21
Blocked Goroutines
Channel
Wait Queue G G G G
unblocking
Scheduler Run Queue
G G G G
22
Blocked Goroutines
Channel
Wait Queue G G G G
23
System Calls
Thread
enter
Kernel/C code
24
System Calls
Thread
enter exit
Kernel/C code
25
System Calls
Thread Thread Thread Thread
G G G G
Kernel/C code
26
System Calls
Thread Thread Thread Thread
G G G G
Run Queue
G G
enter enter enter enter
Kernel/C code
27
System Calls
in syscall
Thread Thread Thread Thread
G G G G
Run Queue
G G
enter
Kernel/C code
28
System Calls
in syscall
Thread Thread Thread Thread Thread
G G G G G
Run Queue
G G
enter
Kernel/C code
29
System Calls
idle
Thread Thread Thread Thread Thread
G G G G G
Run Queue
G G G
exit
Kernel/C code
30
#Threads > #Cores
31
√ lightweight goroutines
√ parallel
32
Not Scalable!
Scheduler
G G G G Runnable Goroutines (Run Queue)
MUTEX
G G G
33
lock-free?
34
lock-free?
⛔
35
👩 🖥
36
👩
👩
👩 🖥
👩
👩
37
👩
👩
👩 🖥
👩
👩
38
Shifts:
8:00 - 16:00
👩
16:00 - 24:00 👩 MUTEX
0:00 - 8:00
👩 🖥
👩
👩
39
👩
👩 LOCK-FREE
👩 🖥
👩
👩
40
DISTRIBUTED
👩 🖥
👩 🖥
👩 🖥
👩 🖥
👩 🖥
41
Distributed Scheduler
Per-thread state Per-thread state Per-thread state
G G G G G
G G G
42
Distributed Scheduler
Per-thread state Per-thread state Per-thread state Scheduler
G G G G G G G
MUTEX
G G G
43
Distributed Scheduler
Per-thread state Per-thread state Per-thread state Scheduler
G G G G G G G
G G G
44
Poll Order
Main question: what is the next goroutine to run?
45
Work Stealing
G G G G G
46
√ lightweight goroutines
√ parallel
√ scalable
47
Threads in syscalls :(
(#threads > #cores)
48
M:P:N Threading
G G
G G G
G G
G
G G G
G G G
G
N
M
Thread Thread Thread Thread Thread Thread Thread
49
M:P:N Threading
G G
G G G
G G
G running
G G G
G G G
G
N
M
Thread Thread Thread Thread Thread Thread Thread
50
M:P:N Threading
in syscall G G
G G G
G G
G running
G G G
G G G
G
N
M
Thread Thread Thread Thread Thread Thread Thread
in syscall in syscall in syscall
51
Distributed 3-Level Scheduler
"Processor" "Processor"
G G G
in syscall in syscall
Thread Thread Thread Thread
G G G G
52
Syscall handling: Handoff
"Processor"
G G
malloc cache
other caches
Thread
G
53
Syscall handling: Handoff
"Processor"
G G
malloc cache
other caches
in syscall
Thread
G
54
Syscall handling: Handoff
"Processor"
G G
malloc cache
other caches
in syscall
Thread Thread
G
55
Syscall handling: Handoff
"Processor"
G G
malloc cache
other caches
in syscall
Thread Thread
G
56
Syscall handling: Handoff
"Processor"
G G
malloc cache
other caches
in syscall
Thread Thread
G G
57
√ lightweight goroutines
√ parallel
√ scalable
√ efficient
58
Fairness
59
Fairness
What: if a goroutine is runnable, it will run eventually.
Why:
60
Fairness
What: if a goroutine is runnable, it will run eventually.
Why:
61
Fair Scheduling
G G G G
62
Fair Scheduling
G G G G
G G G G
63
Fairness/Performance Tradeoff
● Single Run Queue does not scale
● FIFO bad for locality
64
Infinite Loops
in infinite loop
65
Infinite Loops
G G
66
Infinite Loops
G G
67
Local Run Queue
FIFO
G G G G
68
Local Run Queue
G G G G G
69
Local Run Queue
G G G G G
● better locality
70
Local Run Queue
G G G G G
● better locality
● restricts stealing (3us)
71
Local Run Queue Starvation
G G G G G
72
Time Slice Inheritance
Solution: inherit time slice -> looks like infinite loop -> preemption (~10ms)
73
Global Run Queue Starvation
G G
74
Global Run Queue Starvation
g = pollLocalRunQueue()
if g != nil {
return g
}
return pollGlobalRunQueue()
75
Global Run Queue Starvation
schedTick++
if schedTick%61 == 0 {
g = pollGlobalRunQueue()
if g != nil {
return g
}
}
g = pollLocalRunQueue()
if g != nil {
return g
}
return pollGlobalRunQueue()
76
Why 61?
It is not even 42! ¯\_(ツ)_/¯
Want something:
● not too small
● not too large
● prime to break any patterns
77
Network Poller Starvation
G G
Network Poller
78
Network Poller Starvation
G G
Network Poller
Goroutine - preemption
81
Function Frame
void foo() ● local variables
{ ● return address
... ● previous frame pointer
int x = 42;
...
return;
}
82
Thread Stack
Stack
main
grows down
83
Thread Stack
Stack
foo main
grows down
84
Thread Stack
Stack
grows down
85
Thread Stack
Stack
foo main
grows down
86
Thread Stack
stack pointer (RSP)
Stack
foo main
87
Thread Stack
stack pointer (RSP)
Stack
foo main
return prev
local/temp variables
address frame ptr
88
Stack Implementation
page (4K) page (4K) page (4K) page (4K) page (4K) page (4K) page (4K)
89
Stack is cheap!
foo:
sub $64, %RSP // allocate stack frame of size 64
...
mov %RAX, 16(%RSP) // store to a local var
...
add $64, %RSP // deallocate stack frame
retq
90
Paging-based infinite stacks?
● Lazy page-in
● 64-bit Virtual Address Space
91
What is infinite?
92
Paging won't work :(
● Not enough Address Space
○ 48 bits address space
○ 1 bit for kernel = 47 bits = 128TB
○ max 128K stacks
93
Paging won't work :(
● Not enough Address Space
○ 48 bits address space
○ 1 bit for kernel = 47 bits = 128TB
○ max 128K stacks
● Bad granularity
○ 4KB x 1M = 4GB
94
Paging won't work :(
● Not enough Address Space
○ 48 bits address space
○ 1 bit for kernel = 47 bits = 128TB
○ max 128K stacks
● Bad granularity
○ 4KB x 1M = 4GB
● Slow "page-out"
95
Paging won't work :(
● Not enough Address Space
○ 48 bits address space
○ 1 bit for kernel = 47 bits = 128TB
○ max 128K stacks
● Bad granularity
○ 4KB x 1M = 4GB
● Slow "page-out"
● No huge pages (2MB, 1GB)
96
Paging won't work :(
● Not enough Address Space
○ 48 bits address space
○ 1 bit for kernel = 47 bits = 128TB
○ max 128K stacks
● Bad granularity
○ 4KB x 1M = 4GB
● Slow "page-out"
● No huge pages (2MB, 1GB)
● 32-bit systems
○ ARM
97
Normal stack again
foo:
98
Goroutine stacks
foo:
mov %fs:-8, %RCX // load G descriptor from TLS
cmp 16(%RCX), %RSP // compare the stack limit and RSP
jbe morestack // jump to slow-path if not enough stack
sub $64, %RSP
...
mov %RAX, 16(%RSP)
...
add $64, %RSP
retq
...
morestack: // call runtime to allocate more stack
callq <runtime.morestack>
99
Function Prologue
void foo()
{
if (RSP < TLS_G->stack_limit)
morestack();
...
}
100
Split Stack
main
limit RSP
101
Split Stack
foo main
limit RSP
102
Split Stack
RSP limit
103
Split Stack
foo main
limit RSP
104
Split Stack
limit RSP
105
Split Stack
foo main
limit RSP
106
Split Stack Benefits
● 1M goroutines
● works on 32-bits
● good granularity
● cheap "page-out"
● huge pages
107
"Hop Split" Problem :(
108
Important Performance Characteristics
1. Transparent
2. Stable
109
Growable Stack
Stack (1KB)
main
limit RSP
110
Growable Stack
Stack (1KB)
foo main
limit RSP
111
Growable Stack
Stack (1KB)
RSP limit
112
Growable Stack
Stack (1KB)
foo main
foo main
limit RSP
113
Growable Stack
Stack (2KB)
foo main
limit RSP
114
Growable Stack
Stack (2KB)
limit RSP
115
Growable Stack
Stack (2KB)
foo main
limit RSP
116
Stack Performance
Split Stack
117
Stack Performance
Split Stack Growable Stack
● O(1) cost per function call ● O(N) cost per function call
● repeated ● amortized
Worst case: stack split in hot loop Worst case: growing stack for short goroutine
118
Stack Performance
Split Stack Growable Stack
● O(1) cost per function call ● O(N) cost per function call
● repeated ● amortized
Worst case: stack split in hot loop Worst case: growing stack for short goroutine
119
Stack Cache
"Processor"
G G
malloc cache
stack cache
other caches
120
Interesting Fact
121
Preemption
What: Asynchronously asking a goroutine to yield.
Why:
122
Preemption
What: Asynchronously asking a goroutine to yield.
Why:
123
Implementation strategy
Signals:
+ Fast
124
Implementation strategy
Signals:
+ Fast
- OS-dependent
- non-preemptible regions
- GC stack/register maps
⛔
125
Implementation strategy
Signals: Cooperative checks:
+ Fast + OS-independent
- OS-dependent + non-preemptible regions
- non-preemptible regions + GC stack/register maps
- GC stack/register maps
⛔
126
Implementation strategy
Signals: Cooperative checks:
+ Fast + OS-independent
- OS-dependent + non-preemptible regions
- non-preemptible regions + GC stack/register maps
- GC stack/register maps - Slow (1-10%)
⛔ ⛔
127
Function Prologue
foo:
mov %fs:-8, %RCX // load G descriptor from TLS
cmp 16(%RCX), %RSP // compare the stack limit and RSP
jbe morestack // jump to slow-path if not enough stack
...
128
Spoof stack limit!
G->stackLimit = 0xfffffffffffffade
129
Function Prologue
foo:
mov %fs:-8, %RCX
cmp 16(%RCX), %RSP // guaranteed to fail!
jbe morestack
...
130
Advantages
+ fast
+ portable
+ simple
+ GC-friendly
131
Advantages
+ fast
+ portable
+ simple
+ GC-friendly
- loops
132
Recap
√ lightweight goroutines
√ handling of IO and syscalls
√ parallel
√ scalable
√ efficient
√ fair
√ infinite stacks
√ preemptible*
133
Thank you!
Q&A