Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: make the page allocator scale #35112

Closed
mknyszek opened this issue Oct 23, 2019 · 42 comments
Closed

runtime: make the page allocator scale #35112

mknyszek opened this issue Oct 23, 2019 · 42 comments

Comments

@mknyszek
Copy link
Contributor

mknyszek commented Oct 23, 2019

Over the course of the last year, we've seen several cases where making relatively minor changes to the allocator slow path, which allocates pages from the heap, caused serious performance issues (#32828, #31678). The problem stemmed largely from contention on the heap lock: a central m-based spinlock (so we can't schedule another g while it's waiting!) which guards nearly all operations on the page heap. Since Go 1.11, (and likely earlier) we've observed barging behavior on this lock in applications which allocate larger objects (~1K) frequently, which indicates a collapsing lock. Furthermore, these applications tend to stay the same or worsen in performance as the number of available cores on the machine increases. This represents a fundamental scalability bottleneck in the runtime.

Currently the page allocator is based on a treap (and formerly had a small linked-list based cache for smaller free spans), but I propose we rethink it and rewrite to something that:

  1. Is more cache-friendly, branch-predictor friendly, etc. on the whole.
  2. Has a lockless fast path.

The former just makes the page allocator faster and less likely to fall into bad paths in the microarchitecture, while the latter directly reduces contention on the lock.

While increased performance across the board is what we want, what we're most interested in solving here is the scalability bottleneck: when we increase the number of cores available to a Go application, we want to see a performance improvement.

Edit: Design doc

I've already built both an out-of-tree and in-tree prototype of a solution which is based on a bitmap representing the whole heap and a radix tree over that bitmap. The key idea with bitmaps here being that we may "land grab" several pages at once and cache them in a P. Two RPC benchmarks, a Google internal one, and one based on Tile38, show up to a 30% increase in throughput and up to a 40% drop in tail latencies (90th, 99th percentile) on a 48-core machine, with the benefit increasing with the number of cores available to the benchmark.

@mknyszek mknyszek added this to the Go1.14 milestone Oct 23, 2019
@mknyszek mknyszek self-assigned this Oct 23, 2019
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/200439 mentions this issue: runtime: place lower limit on trigger ratio

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/195698 mentions this issue: runtime: count scavenged bits for new allocation for new page allocator

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/190621 mentions this issue: runtime: add packed bitmap summaries

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/196640 mentions this issue: runtime: make more page sweeper operations atomic

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/195697 mentions this issue: runtime: add scavenging code for new page allocator

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/196643 mentions this issue: runtime: add page cache and tests

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/190622 mentions this issue: runtime: add new page allocator core

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/190619 mentions this issue: runtime: add new page allocator constants and description

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/196642 mentions this issue: runtime: add per-p mspan cache

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/201763 mentions this issue: runtime: make the scavenger self-paced

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/190620 mentions this issue: runtime: add mallocbits and tests

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/196639 mentions this issue: runtime: remove unnecessary large parameter to mheap_.alloc

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/195701 mentions this issue: runtime: add per-p page allocation cache

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/201764 mentions this issue: runtime: integrate new page allocator into runtime

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/195700 mentions this issue: runtime: remove old page allocator

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/195699 mentions this issue: runtime: add option to scavenge with lock held throughout

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/201765 mentions this issue: runtime: switch to new page allocator

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/196638 mentions this issue: runtime: ensure heap memstats are updated atomically

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/196641 mentions this issue: runtime: refactor mheap_.alloc* into allocSpan

@mknyszek
Copy link
Contributor Author

Gopherbot isn't updating the issue with this, but the design doc is at @ https://go-review.googlesource.com/c/proposal/+/202857. I'll update the first post in the issue once it lands.

CC: @aclements @randall77 @cherrymui

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/203318 mentions this issue: runtime: fix (*gcSweepBuf).block guarantees

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/203858 mentions this issue: runtime: compute whether a span needs zeroing in the new page allocator

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/203859 mentions this issue: runtime: make allocNeedsZero lock-free

@un000
Copy link

un000 commented Oct 29, 2019

#31222
May be related
Could you test the new allocator with a reproducer from the ticket above?

@mknyszek
Copy link
Contributor Author

@un000 It's possible but unlikely this will help.

Let's assume that the problem in that issue is fully that a long allocation blocks STW for 43 ms. Firstly, 100 MB allocations don't really get any faster with this proposal. In that issue, asking for 100 MB is definitely going to go to the OS and map pages in. No matter what we're bound by how quickly we can get 100 MB worth of committed address space from the OS.

@mknyszek
Copy link
Contributor Author

mknyszek commented Oct 29, 2019

@un000 Well, it helps maxWait at least and brings it down to 1.0s vs. 1.3s on my machine, so perhaps I spoke too soon.

But I'm hitting stalls trying to copy the traces to verify whether there's an improvement in the STW delay. I'll try again later and let you know what I found.

gopherbot pushed a commit that referenced this issue Nov 8, 2019
This change defines a maximum supported physical and huge page size in
the runtime based on the new page allocator's implementation, and uses
them where appropriate.

Furthemore, if the system exceeds the maximum supported huge page
size, we simply ignore it silently.

It also fixes a huge-page-related test which is only triggered by a
condition which is definitely wrong.

Finally, it adds a few TODOs related to code clean-up and supporting
larger huge page sizes.

Updates #35112.
Fixes #35431.

Change-Id: Ie4348afb6bf047cce2c1433576d1514720d8230f
Reviewed-on: https://go-review.googlesource.com/c/go/+/205937
Run-TryBot: Michael Knyszek <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Keith Randall <[email protected]>
Reviewed-by: Cherry Zhang <[email protected]>
gopherbot pushed a commit that referenced this issue Nov 8, 2019
mheap_.alloc currently accepts both a spanClass and a "large" parameter
indicating whether the allocation is large. These are redundant, since
spanClass.sizeclass() == 0 is an equivalent way to determine this and is
already used in mheap_.alloc. There are no places in the runtime where
the size class could be non-zero and large == true.

Updates #35112.

Change-Id: Ie66facf8f0faca6f4cd3d20a8ac4bc259e11823d
Reviewed-on: https://go-review.googlesource.com/c/go/+/196639
Run-TryBot: Michael Knyszek <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Austin Clements <[email protected]>
gopherbot pushed a commit that referenced this issue Nov 8, 2019
This change makes it so that allocation and free related page sweeper
metadata operations (e.g. pageInUse and pagesInUse) are atomic rather
than protected by the heap lock. This will help in reducing the length
of the critical path with the heap lock held in future changes.

Updates #35112.

Change-Id: Ie82bff024204dd17c4c671af63350a7a41add354
Reviewed-on: https://go-review.googlesource.com/c/go/+/196640
Run-TryBot: Michael Knyszek <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Austin Clements <[email protected]>
gopherbot pushed a commit that referenced this issue Nov 8, 2019
Currently gcSweepBuf guarantees that push operations may be performed
concurrently with each other and that block operations may be performed
concurrently with push operations as well.

Unfortunately, this isn't quite true. The existing code allows push
operations to happen concurrently with each other, but block operations
may return blocks with nil entries. The way this can happen is if two
concurrent pushers grab a slot to push to, and the first one (the one
with the earlier slot in the buffer) doesn't quite write a span value
when the block is called. The existing code in block only checks if the
very last value in the block is nil, when really an arbitrary number of
the last few values in the block may or may not be nil.

Today, this case can't actually happen because when push operations
happen concurrently during a GC (which is the only time block is
called), they only ever happen during an allocation with the heap lock
held, effectively serializing them. A block operation may happen
concurrently with one of these pushes, but its callers will never see a
nil mspan. Outside of a GC, this isn't a problem because although push
operations from allocations can run concurrently with push operations
from sweeping, block operations will never run.

In essence, the real concurrency guarantees provided by gcSweepBuf are
that block operations may happen concurrently with push operations, but
that push operations may not be concurrent with each other if there are
any block operations.

To fix this, and to prepare for push operations happening without the
heap lock held in a future CL, we update the documentation for block to
correctly state that there may be nil entries in the returned slice.
While we're here, make the mspan writes into the buffer atomic to avoid
a block user racing on a nil check, and document that the user should
load mspan values from the returned slice atomically. Finally, we make
all callers of block adhere to the new rules.

We choose to allow nil values rather than filter them out because the
only caller of block is markrootSpans, and if it catches a nil entry,
then there wasn't anything to mark in there anyway since the span is
just being created.

Updates #35112.

Change-Id: I6450aab15f51690d7a000ba5b3d529cf2ca5da1e
Reviewed-on: https://go-review.googlesource.com/c/go/+/203318
Run-TryBot: Michael Knyszek <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Austin Clements <[email protected]>
gopherbot pushed a commit that referenced this issue Nov 8, 2019
This change combines the functionality of allocSpanLocked, allocManual,
and alloc_m into a new method called allocSpan. While these methods'
abstraction boundaries are OK when the heap lock is held throughout,
they start to break down when we want finer-grained locking in the page
allocator.

allocSpan does just that, and only locks the heap when it absolutely has
to. Piggy-backing off of work in previous CLs to make more of span
initialization lockless, this change makes span initialization entirely
lockless as part of the reorganization.

Ultimately this change will enable us to add a lockless fast path to
allocSpan.

Updates #35112.

Change-Id: I99875939d75fb4e958a67ac99e4a7cda44f06864
Reviewed-on: https://go-review.googlesource.com/c/go/+/196641
Run-TryBot: Michael Knyszek <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Austin Clements <[email protected]>
gopherbot pushed a commit that referenced this issue Nov 8, 2019
This change adds a per-p mspan object cache similar to the sudog cache.
Unfortunately this cache can't quite operate like the sudog cache, since
it is used in contexts where write barriers are disallowed (i.e.
allocation codepaths), so rather than managing an array and a slice,
it's just an array and a length. A little bit more unsafe, but avoids
any write barriers.

The purpose of this change is to reduce the number of operations which
require the heap lock in allocation, paving the way for a lockless fast
path.

Updates #35112.

Change-Id: I32cfdcd8528fb7be985640e4f3a13cb98ffb7865
Reviewed-on: https://go-review.googlesource.com/c/go/+/196642
Run-TryBot: Michael Knyszek <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Austin Clements <[email protected]>
gopherbot pushed a commit that referenced this issue Nov 8, 2019
This change adds a page cache structure which owns a chunk of free pages
at a given base address. It also adds code to allocate to this cache
from the page allocator. Finally, it adds tests for both.

Notably this change does not yet integrate the code into the runtime,
just into runtime tests.

Updates #35112.

Change-Id: Ibe121498d5c3be40390fab58a3816295601670df
Reviewed-on: https://go-review.googlesource.com/c/go/+/196643
Run-TryBot: Michael Knyszek <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Austin Clements <[email protected]>
gopherbot pushed a commit that referenced this issue Nov 8, 2019
This change adds a per-p free page cache which the page allocator may
allocate out of without a lock. The change also introduces a completely
lockless page allocator fast path.

Although the cache contains at most 64 pages (and usually less), the
vast majority (85%+) of page allocations are exactly 1 page in size.

Updates #35112.

Change-Id: I170bf0a9375873e7e3230845eb1df7e5cf741b78
Reviewed-on: https://go-review.googlesource.com/c/go/+/195701
Run-TryBot: Michael Knyszek <[email protected]>
Reviewed-by: Austin Clements <[email protected]>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/206199 mentions this issue: runtime: copy some functions from math/bits to runtime/internal/sys

gopherbot pushed a commit that referenced this issue Nov 8, 2019
CL 201765 activated calls from the runtime to functions in math/bits.
When coverage and race detection were simultaneously enabled,
this caused a crash when the covered+race-checked code in
math/bits was called from the runtime before there was even a P.

PS Win for gdlv in helping sort this out.

TODO - next CL intrinsifies the new functions in
runtime/internal/sys

TODO/Would-be-nice - Ctz64 and TrailingZeros64 are the same
function; 386.s is intrinsified; clean all that up.

Fixes #35461.
Updates #35112.

Change-Id: I750a54dba493130ad3e68a06530ede7687d41e1d
Reviewed-on: https://go-review.googlesource.com/c/go/+/206199
Reviewed-by: Michael Knyszek <[email protected]>
Run-TryBot: Michael Knyszek <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/206277 mentions this issue: runtime: fix min/max logic in findScavengeCandidate

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/206200 mentions this issue: cmd/compile: intrinsify functions added to runtime/internal/sys

gopherbot pushed a commit that referenced this issue Nov 9, 2019
This restores intrinsic status to functions copied from math/bits
into runtime/internal/sys, as an aid to runtime performance.

Updates #35112.

Change-Id: I41a7d87cf00f1e64d82aa95c5b1000bc128de820
Reviewed-on: https://go-review.googlesource.com/c/go/+/206200
Run-TryBot: David Chase <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Cherry Zhang <[email protected]>
gopherbot pushed a commit that referenced this issue Nov 10, 2019
This change makes the test addresses start at 1 GiB instead of 2 GiB to
support mips and mipsle, which only have 31-bit address spaces.

It also changes some tests to use smaller offsets for the chunk index to
avoid jumping too far ahead in the address space to support 31-bit
address spaces. The tests don't require such large jumps for what
they're testing anyway.

Updates #35112.
Fixes #35440.

Change-Id: Ic68ff2b0a1f10ef37ac00d4bb5b910ddcdc76f2e
Reviewed-on: https://go-review.googlesource.com/c/go/+/205938
Run-TryBot: Brad Fitzpatrick <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Keith Randall <[email protected]>
Reviewed-by: Cherry Zhang <[email protected]>
gopherbot pushed a commit that referenced this issue Nov 11, 2019
Before this CL, if max > min and max was unaligned to min, then the
function could return an unaligned (unaligned to min) region to
scavenge. On most platforms, this leads to some kind of crash.

Fix this by explicitly aligning max to the next multiple of min.

Fixes #35445.
Updates #35112.

Change-Id: I0af42d4a307b48a97e47ed152c619d77b0298291
Reviewed-on: https://go-review.googlesource.com/c/go/+/206277
Reviewed-by: Ian Lance Taylor <[email protected]>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/206978 mentions this issue: runtime: check summary before scavenging in fast path

gopherbot pushed a commit that referenced this issue Nov 15, 2019
In scavengeOne's fast path, we currently don't check the summary for the
chunk that scavAddr points to, which means that we might accidentally
scavenge unused address space if the previous scavenge moves the
scavAddr into that space. The result of this today is a crash.

This change makes it so that scavengeOne's fast path only happens after
the check, following the comment in mpagealloc.go. It also adds a test
for this case.

Fixes #35465.
Updates #35112.

Change-Id: I861d44ee75e42a0e1f5aaec243bc449228273903
Reviewed-on: https://go-review.googlesource.com/c/go/+/206978
Run-TryBot: Michael Knyszek <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Cherry Zhang <[email protected]>
@ianlancetaylor
Copy link
Contributor

@mknyszek Anything else to do for this issue?

@mknyszek
Copy link
Contributor Author

mknyszek commented Dec 5, 2019

Nope! Closing.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/219121 mentions this issue: design/35112-scaling-the-page-allocator.md: update proposal

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants