0% found this document useful (0 votes)
42 views134 pages

Go Shceduler

The document discusses the Go scheduler and how it implements lightweight concurrency. It covers topics like goroutine states, handling system calls, scaling to many cores, and using a distributed scheduler approach.

Uploaded by

Yalmaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views134 pages

Go Shceduler

The document discusses the Go scheduler and how it implements lightweight concurrency. It covers topics like goroutine states, handling system calls, scaling to many cores, and using a distributed scheduler approach.

Uploaded by

Yalmaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 134

Go scheduler

Implementing language with lightweight concurrency

Dmitry Vyukov, [email protected]


Hydra conf, July 12 2019 1
Agenda
● Go specifics
● Scheduler
● Scalability
● Fairness
● Stacks
● Future

2
What is a goroutine?
Logically a thread of execution.

Logically same as:


● OS thread
● coroutine
● green thread

3
Most material is generic
... and to large degree applicable to:

● OS thread schedulers
● Coroutine schedulers
● Thread pools
● Other languages

4
Go specifics:
1. Current Go design decisions
2. Go requirements and constraints

5
Go specifics:
1. Current Go design decisions
2. Go requirements and constraints:
● goroutines are lightweight (1M)

6
Go specifics:
1. Current Go design decisions
2. Go requirements and constraints:
● goroutines are lightweight (1M)
● parallel and scalable

7
Go specifics:
1. Current Go design decisions
2. Go requirements and constraints:
● goroutines are lightweight (1M)
● parallel and scalable
● minimal API (no hints)

8
Go specifics:
1. Current Go design decisions
2. Go requirements and constraints:
● goroutines are lightweight (1M)
● parallel and scalable
● minimal API (no hints)
● infinite stack

9
Go specifics:
1. Current Go design decisions
2. Go requirements and constraints:
● goroutines are lightweight (1M)
● parallel and scalable
● minimal API (no hints)
● infinite stack
● handling of IO, syscalls, C calls

10
A taste of Go
resultChan := make(chan Result) // FIFO queue
go func() { // start a goroutine
response := sendRequest() // blocks on IO
result := parse(response)
resultChan <- result // send the result back
}()
process(<-resultChan) // receive the result

11
How can we implement this?

12
Thread per goroutine?
Would work!

But too expensive:

● memory (at least 32K or so)


● performance (syscalls)
● no infinite stacks

13
Thread pool?
Only faster goroutine creation.

But still:
● memory consumption
● performance
● no infinite stacks

14
M:N Threading

G G
G G G
G G
G
G G G
G G G
G

Thread Thread Thread Thread

15
M:N Threading

G G
G G G
G G
G
G G G
G G G
- cheap G
- full control

M
- expensive
- less control Thread Thread Thread Thread
- actual execution
- parallelism

16
Goroutine States
runnable blocked
G G
G G G
G G
G
G G G
G G G
- cheap G
- full control running

M
- expensive
- less control Thread Thread Thread Thread
- actual execution
- parallelism

17
Simple M:N Scheduler
Scheduler
G G G G Runnable Goroutines (Run Queue)

MUTEX

18
Simple M:N Scheduler
Scheduler
G G G G Runnable Goroutines (Run Queue)

MUTEX

Thread Thread Thread

G G G Running Goroutines
19
Blocked Goroutines

Channel

Wait Queue G G G

20
Blocked Goroutines

Channel blocking

Wait Queue G G G G

21
Blocked Goroutines

Channel

Wait Queue G G G G

unblocking
Scheduler Run Queue

G G G G

22
Blocked Goroutines

Channel

Wait Queue G G G G

The same mechanism for:


● Mutexes unblocking
● Timers Scheduler Run Queue
● Network IO
G G G G

23
System Calls
Thread

enter

Kernel/C code
24
System Calls
Thread

enter exit

Kernel/C code
25
System Calls
Thread Thread Thread Thread

G G G G

enter enter enter enter

Kernel/C code
26
System Calls
Thread Thread Thread Thread

G G G G

Run Queue

G G
enter enter enter enter

Kernel/C code
27
System Calls
in syscall
Thread Thread Thread Thread

G G G G

Run Queue

G G
enter

Kernel/C code
28
System Calls
in syscall
Thread Thread Thread Thread Thread

G G G G G

Run Queue

G G
enter

Kernel/C code
29
System Calls
idle
Thread Thread Thread Thread Thread

G G G G G

Run Queue

G G G
exit

Kernel/C code
30
#Threads > #Cores

31
√ lightweight goroutines

√ handling of IO and syscalls

√ parallel

32
Not Scalable!
Scheduler
G G G G Runnable Goroutines (Run Queue)

MUTEX

Thread Thread Thread

G G G
33
lock-free?

34
lock-free?


35
👩 🖥

36
👩
👩
👩 🖥
👩
👩
37
👩
👩
👩 🖥
👩
👩
38
Shifts:
8:00 - 16:00
👩
16:00 - 24:00 👩 MUTEX

0:00 - 8:00
👩 🖥
👩
👩
39
👩
👩 LOCK-FREE

👩 🖥
👩
👩
40
DISTRIBUTED

👩 🖥
👩 🖥
👩 🖥
👩 🖥
👩 🖥
41
Distributed Scheduler
Per-thread state Per-thread state Per-thread state

G G G G G

Thread Thread Thread

G G G
42
Distributed Scheduler
Per-thread state Per-thread state Per-thread state Scheduler

G G G G G G G

MUTEX

Thread Thread Thread

G G G
43
Distributed Scheduler
Per-thread state Per-thread state Per-thread state Scheduler

G G G G G G G

malloc cache malloc cache malloc cache

other caches other caches other caches


MUTEX

Thread Thread Thread

G G G
44
Poll Order
Main question: what is the next goroutine to run?

1. Local Run Queue


2. Global Run Queue
3. Network Poller
4. Work Stealing

45
Work Stealing

Per-thread state Per-thread state Per-thread state Per-thread state

G G G G G

Thread Thread Thread Thread

46
√ lightweight goroutines

√ handling of IO and syscalls

√ parallel

√ scalable

47
Threads in syscalls :(
(#threads > #cores)

48
M:P:N Threading
G G
G G G
G G
G
G G G
G G G
G
N

P Processor Processor Processor Processor

M
Thread Thread Thread Thread Thread Thread Thread

49
M:P:N Threading
G G
G G G
G G
G running
G G G
G G G
G
N

P Processor Processor Processor Processor

resource required to run Go code

M
Thread Thread Thread Thread Thread Thread Thread

50
M:P:N Threading
in syscall G G
G G G
G G
G running
G G G
G G G
G
N

P Processor Processor Processor Processor

M
Thread Thread Thread Thread Thread Thread Thread
in syscall in syscall in syscall
51
Distributed 3-Level Scheduler
"Processor" "Processor"

G G G

malloc cache malloc cache

other caches other caches

in syscall in syscall
Thread Thread Thread Thread

G G G G
52
Syscall handling: Handoff
"Processor"

G G

malloc cache

other caches

Thread

G
53
Syscall handling: Handoff
"Processor"

G G

malloc cache

other caches

in syscall
Thread

G
54
Syscall handling: Handoff
"Processor"

G G

malloc cache

other caches

in syscall
Thread Thread

G
55
Syscall handling: Handoff
"Processor"

G G

malloc cache

other caches

in syscall
Thread Thread

G
56
Syscall handling: Handoff
"Processor"

G G

malloc cache

other caches

in syscall
Thread Thread

G G
57
√ lightweight goroutines

√ handling of IO and syscalls

√ parallel

√ scalable

√ efficient

58
Fairness

59
Fairness
What: if a goroutine is runnable, it will run eventually.

Why:

● bad tail latencies


● livelocks
● pathological behaviors

60
Fairness
What: if a goroutine is runnable, it will run eventually.

Why:

● bad tail latencies


● livelocks
● pathological behaviors

Fairness is like Oxygen

61
Fair Scheduling

Fair: FIFO Run Queue

G G G G

62
Fair Scheduling

Fair: FIFO Run Queue

G G G G

Not Fair: LIFO Run Queue

G G G G

63
Fairness/Performance Tradeoff
● Single Run Queue does not scale
● FIFO bad for locality

Want a minimal amount of fairness!

64
Infinite Loops

in infinite loop

65
Infinite Loops

in infinite loop starved

G G

66
Infinite Loops

in infinite loop starved

G G

Solution: preemption (~10ms)

67
Local Run Queue

FIFO

G G G G

68
Local Run Queue

FIFO 1-element LIFO buffer

G G G G G

69
Local Run Queue

FIFO 1-element LIFO buffer

G G G G G

● better locality

70
Local Run Queue

FIFO 1-element LIFO buffer

G G G G G

● better locality
● restricts stealing (3us)

71
Local Run Queue Starvation

FIFO 1-element LIFO buffer


G

G G G G G

72
Time Slice Inheritance

1-element LIFO buffer


G

Solution: inherit time slice -> looks like infinite loop -> preemption (~10ms)

73
Global Run Queue Starvation

Local Run Queue

G G

Global Run Queue

74
Global Run Queue Starvation

g = pollLocalRunQueue()
if g != nil {
return g
}
return pollGlobalRunQueue()
75
Global Run Queue Starvation
schedTick++
if schedTick%61 == 0 {
g = pollGlobalRunQueue()
if g != nil {
return g
}
}
g = pollLocalRunQueue()
if g != nil {
return g
}
return pollGlobalRunQueue()
76
Why 61?
It is not even 42! ¯\_(ツ)_/¯

Want something:
● not too small
● not too large
● prime to break any patterns

77
Network Poller Starvation

Global/Local Run Queue

G G

Network Poller

78
Network Poller Starvation

Global/Local Run Queue

G G

Network Poller

Solution: background thread poll network occasionally


79
Fairness Hierarchy

Goroutine - preemption

Local Run Queue - time slice inheritance

Global Run Queue - check once in a while

Network Poller - background thread

= minimal fairness at minimal cost


80
Stacks

81
Function Frame
void foo() ● local variables
{ ● return address
... ● previous frame pointer
int x = 42;
...
return;
}

82
Thread Stack

Stack

main

grows down

83
Thread Stack

Stack

foo main

grows down

84
Thread Stack

Stack

bar foo main

grows down

85
Thread Stack

Stack

foo main

grows down

86
Thread Stack
stack pointer (RSP)

Stack

foo main

87
Thread Stack
stack pointer (RSP)

Stack

foo main

return prev
local/temp variables
address frame ptr

88
Stack Implementation

Stack (1-8 MB)

page (4K) page (4K) page (4K) page (4K) page (4K) page (4K) page (4K)

protected not paged-in paged-in

89
Stack is cheap!
foo:
sub $64, %RSP // allocate stack frame of size 64
...
mov %RAX, 16(%RSP) // store to a local var
...
add $64, %RSP // deallocate stack frame
retq

90
Paging-based infinite stacks?
● Lazy page-in
● 64-bit Virtual Address Space

Can we build "infinite" stacks based on this?

91
What is infinite?

1GB is "infinite" enough

92
Paging won't work :(
● Not enough Address Space
○ 48 bits address space
○ 1 bit for kernel = 47 bits = 128TB
○ max 128K stacks

93
Paging won't work :(
● Not enough Address Space
○ 48 bits address space
○ 1 bit for kernel = 47 bits = 128TB
○ max 128K stacks
● Bad granularity
○ 4KB x 1M = 4GB

94
Paging won't work :(
● Not enough Address Space
○ 48 bits address space
○ 1 bit for kernel = 47 bits = 128TB
○ max 128K stacks
● Bad granularity
○ 4KB x 1M = 4GB
● Slow "page-out"

95
Paging won't work :(
● Not enough Address Space
○ 48 bits address space
○ 1 bit for kernel = 47 bits = 128TB
○ max 128K stacks
● Bad granularity
○ 4KB x 1M = 4GB
● Slow "page-out"
● No huge pages (2MB, 1GB)

96
Paging won't work :(
● Not enough Address Space
○ 48 bits address space
○ 1 bit for kernel = 47 bits = 128TB
○ max 128K stacks
● Bad granularity
○ 4KB x 1M = 4GB
● Slow "page-out"
● No huge pages (2MB, 1GB)
● 32-bit systems
○ ARM

97
Normal stack again
foo:

sub $64, %RSP


...
mov %RAX, 16(%RSP)
...
add $64, %RSP
retq

98
Goroutine stacks
foo:
mov %fs:-8, %RCX // load G descriptor from TLS
cmp 16(%RCX), %RSP // compare the stack limit and RSP
jbe morestack // jump to slow-path if not enough stack
sub $64, %RSP
...
mov %RAX, 16(%RSP)
...
add $64, %RSP
retq
...
morestack: // call runtime to allocate more stack
callq <runtime.morestack>
99
Function Prologue
void foo()
{
if (RSP < TLS_G->stack_limit)
morestack();
...
}

100
Split Stack

Stack segment (1KB)

main

limit RSP

101
Split Stack

Stack segment (1KB)

foo main

limit RSP

102
Split Stack

Stack segment (1KB)

bar foo main

RSP limit

103
Split Stack

Stack segment (1KB) Stack segment (1KB)

foo main

limit RSP

104
Split Stack

Stack segment (1KB) Stack segment (1KB)

bar foo main

limit RSP

105
Split Stack

Stack segment (1KB)

foo main

limit RSP

106
Split Stack Benefits
● 1M goroutines
● works on 32-bits
● good granularity
● cheap "page-out"
● huge pages

107
"Hop Split" Problem :(

for ... { // hot loop

foo() // causes stack split


}

108
Important Performance Characteristics
1. Transparent
2. Stable

"Hot Split" problem fail both.

109
Growable Stack

Stack (1KB)

main

limit RSP

110
Growable Stack

Stack (1KB)

foo main

limit RSP

111
Growable Stack

Stack (1KB)

bar foo main

RSP limit

112
Growable Stack

Stack (1KB)

foo main

New Stack (2KB) COPY

foo main

limit RSP

113
Growable Stack

Stack (2KB)

foo main

limit RSP

114
Growable Stack

Stack (2KB)

bar foo main

limit RSP

115
Growable Stack

Stack (2KB)

foo main

limit RSP

116
Stack Performance
Split Stack

● O(1) cost per function call


● repeated

Worst case: stack split in hot loop

117
Stack Performance
Split Stack Growable Stack

● O(1) cost per function call ● O(N) cost per function call
● repeated ● amortized

Worst case: stack split in hot loop Worst case: growing stack for short goroutine

118
Stack Performance
Split Stack Growable Stack

● O(1) cost per function call ● O(N) cost per function call
● repeated ● amortized

Worst case: stack split in hot loop Worst case: growing stack for short goroutine

Penalizing cheap operation a bit


< Penalizing expensive operation significantly

119
Stack Cache

"Processor"

G G

malloc cache

stack cache

other caches

120
Interesting Fact

Split stacks are in gcc:

$ gcc -fsplit-stack prog.c

121
Preemption
What: Asynchronously asking a goroutine to yield.

Why:

● multiplexing multiple goroutines


● auxiliary functions (GC, crashes)

122
Preemption
What: Asynchronously asking a goroutine to yield.

Why:

● multiplexing multiple goroutines


● auxiliary functions (GC, crashes)

Preemption is also like Oxygen

123
Implementation strategy
Signals:

+ Fast

124
Implementation strategy
Signals:

+ Fast
- OS-dependent
- non-preemptible regions
- GC stack/register maps


125
Implementation strategy
Signals: Cooperative checks:

+ Fast + OS-independent
- OS-dependent + non-preemptible regions
- non-preemptible regions + GC stack/register maps
- GC stack/register maps


126
Implementation strategy
Signals: Cooperative checks:

+ Fast + OS-independent
- OS-dependent + non-preemptible regions
- non-preemptible regions + GC stack/register maps
- GC stack/register maps - Slow (1-10%)

⛔ ⛔
127
Function Prologue
foo:
mov %fs:-8, %RCX // load G descriptor from TLS
cmp 16(%RCX), %RSP // compare the stack limit and RSP
jbe morestack // jump to slow-path if not enough stack
...

128
Spoof stack limit!

G->stackLimit = 0xfffffffffffffade

129
Function Prologue
foo:
mov %fs:-8, %RCX
cmp 16(%RCX), %RSP // guaranteed to fail!
jbe morestack
...

130
Advantages
+ fast
+ portable
+ simple
+ GC-friendly

131
Advantages
+ fast
+ portable
+ simple
+ GC-friendly
- loops

132
Recap
√ lightweight goroutines
√ handling of IO and syscalls
√ parallel
√ scalable
√ efficient
√ fair
√ infinite stacks
√ preemptible*

133
Thank you!

Q&A

Dmitry Vyukov, [email protected]


134

You might also like