-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: segmentation fault from vgetrandomPutState
and runtime.growslice
w/ runtime.OSLockThread
#73141
Comments
I can reproduce on my Linux 6.12 machine, with the same stack trace. For reference, I had to run with |
Your theory looks correct to me. |
@gopherbot Please backport to 1.24. This is a regression that can cause arbitrary crashes in programs that exit goroutines under LockOSThread and are running on Linux 6.11 or higher. |
Backport issue(s) opened: #73144 (for 1.24). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
By the way, thank you very much for the easy reproducer and bisect! That makes things much easier. |
Change https://go.dev/cl/662455 mentions this issue: |
Fix seems correct to me. Thanks, and sorry for the bug. |
Change https://go.dev/cl/662496 mentions this issue: |
💯 double this! Also Kudos to various folks in moby/moby#49513 who tried bisecting and reporting traces they could find. We knew "something" was wrong, but the only reproducer we had was "docker segfaults sometimes", which wasn't very useful to report here as a reproducer, so really happy @sipsma was able to find a MUCH smaller reproducer. |
@prattmic thank you so much for the fix! I know that go1.24.2 was just released, but any chance a go1.24.3 release with this fix could be pushed forward? There are a number of Linux distributions and downstream projects which are currently stuck at Go 1.23 until a release with this fix happens for 1.24. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
We have pushed an updated
|
@thaJeztah For what it's worth, while I definitely appreciate the basically-solved bug report (makes my life easier!), in this case I suspect even with no reproducer we probably could have figured this out from:
I say that because in this stack trace we see That's certainly not going to be possible with every runtime bug, but I don't think it would have hurt to file a bug saying FYI you think you've found a problem in 1.24, here are some minimal details, and we are still investigating to narrow it down. |
k3s-io/k3s#11973 (comment) reports the same crash in containerd as well. (I'm not sure what's going on with the rest of that issue, which is marked as fixed. The crash report may be unrelated to the rest of the issue.) |
vgetrandomPutState
and runtime.growslice
w/ runtime.OSLockThread
vgetrandomPutState
and runtime.growslice
w/ runtime.OSLockThread
Change https://go.dev/cl/662636 mentions this issue: |
@prattmic Thanks for the extra pointers there! I guess I was a bit too conservative; when we received the report, our own builds were not yet on go1.24, and maintainers did not manage to reproduce it, so we didn't want to immediately waste time in case it was due to some other factors. I recall I left a quick blurb on another ticket #71932 (comment), but I probably should've opened a new ticket already with just the information I had. Will do next time!! |
Add a regression test similar to the reproducer from #73141 to try to help catch future issues with vgetrandom and thread exit. Though the test isn't very precise, it just hammers thread exit. When the test reproduces #73141, it simply crashes with a SIGSEGV and no output or stack trace, which would be very unfortunate on a builder. https://go.dev/issue/49165 tracks collecting core dumps from builders, which would make this more tractable to debug. For #73141. Change-Id: I6a6a636c7d7b41e2729ff6ceb30fd7f979aa9978 Reviewed-on: https://go-review.googlesource.com/c/go/+/662636 Reviewed-by: Cherry Mui <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Auto-Submit: Michael Pratt <[email protected]>
Go version
go version go1.24.2 linux/amd64
Output of
go env
in your module/workspace:What did you do?
Run the following go program on a recent Linux kernel (6.13):
I am not sure but I suspect that having a 6.11+ kernel (where getrandom is optimized to use vdso) is important.
It's also possible that
amd64
is important, but haven't tried on other arches on 6.11+ so not sure.This is my full
uname -a
output in case helpful:I did not build/run it in any special way, just:
The machine I ran on had 4 cores, which might be relevant for triggering it quickly while also avoiding
thread exhaustion
, as pointed out here.taskset
/GOMAXPROCS
or adjust some of the constants in the repro code to hit it consistentlyWhat did you see happen?
After ~5ish seconds, it outputs
Segmentation fault (core dumped)
, with the following core dump output:What did you expect to see?
It to not crash.
For more context:
Dagger and Docker have both been unable to update to any version of go 1.24 from 1.23 due to periodic segmentation faults.
Multiple stack traces shared by other users/debuggers have shown crash stack traces involving
runtime.vgetrandomPutState
andruntime.growslice
, matching what I repro'd in isolation above:I took a look at the relevant lines from the stack traces:
And got the theory that:
runtime.LockOSThread
being held at goexitvgetrandomAlloc.states
slice was appended to such that it triggeredgrowslice
and thus tried to malloc, but at a point of the m/p lifecycle where that's not allowed (or just doesn't work for some other reason)runtime.LockOSThread
is particularly relevant since it potentially explains why dagger/docker hit this so quickly but seemingly no other reports have surfaced; dagger/docker are some of the rare users of that API (due to doing container-y things)I am very very far from a go runtime expert, so not at all sure if the above is correct but it lead me to the repro code above that does indeed seem to consistently trigger this, whether by coincidence or not 🤷♂️
cc @zx2c4
The text was updated successfully, but these errors were encountered: