Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: segmentation fault from vgetrandomPutState and runtime.growslice w/ runtime.OSLockThread #73141

Closed
sipsma opened this issue Apr 2, 2025 · 19 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Critical A critical problem that affects the availability or correctness of production systems built using Go NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@sipsma
Copy link

sipsma commented Apr 2, 2025

Go version

go version go1.24.2 linux/amd64

Output of go env in your module/workspace:

AR='ar'
CC='gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='0'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='g++'
GCCGO='gccgo'
GO111MODULE=''
GOAMD64='v1'
GOARCH='amd64'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/home/arch/.cache/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/home/arch/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -m64 -fno-caret-diagnostics -Qunused-arguments -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1459677643=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMOD='/home/arch/test/go.mod'
GOMODCACHE='/home/arch/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/arch/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTELEMETRY='local'
GOTELEMETRYDIR='/home/arch/.config/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.24.2'
GOWORK=''
PKG_CONFIG='pkg-config'

What did you do?

Run the following go program on a recent Linux kernel (6.13):

package main

import (
    "runtime"
    "time"

    "golang.org/x/sys/unix"
)

const i = 16*1024
const bs = 1024
const sl = 1*time.Millisecond

func main() {
    for {
        time.Sleep(sl)
        go func() {
            runtime.LockOSThread()
            b := make([]byte, bs)
            for range i {
		_, err := unix.Getrandom(b, 0)
                if err != nil { panic(err) }
            }
        }()
        b := make([]byte, bs)
        _, err := unix.Getrandom(b, 0)
        if err != nil { panic(err) }
    }
}

I am not sure but I suspect that having a 6.11+ kernel (where getrandom is optimized to use vdso) is important.

It's also possible that amd64 is important, but haven't tried on other arches on 6.11+ so not sure.

This is my full uname -a output in case helpful:

Linux ip-172-31-34-47 6.13.8-1-ec2 #1 SMP PREEMPT_DYNAMIC Mon, 24 Mar 2025 21:00:24 +0000 x86_64 GNU/Linux

I did not build/run it in any special way, just:

go build -o main main.go && ./main

The machine I ran on had 4 cores, which might be relevant for triggering it quickly while also avoiding thread exhaustion, as pointed out here.

  • Possible that others may need to use taskset/GOMAXPROCS or adjust some of the constants in the repro code to hit it consistently

What did you see happen?

After ~5ish seconds, it outputs Segmentation fault (core dumped), with the following core dump output:

           PID: 854360 (main)
           UID: 1000 (arch)
           GID: 1000 (arch)
        Signal: 11 (SEGV)
     Timestamp: Wed 2025-04-02 20:09:55 UTC (10s ago)
  Command Line: ./main
    Executable: /home/arch/test/main
 Control Group: /user.slice/user-1000.slice/session-25.scope
          Unit: session-25.scope
         Slice: user-1000.slice
       Session: 25
     Owner UID: 1000 (arch)
       Boot ID: 8235bd622064418bb4c88fcfb47876ec
    Machine ID: 51b81e352cd5447891aebaad822ce91e
      Hostname: ip-172-31-34-47
       Storage: /var/lib/systemd/coredump/core.main.1000.8235bd622064418bb4c88fcfb47876ec.854360.1743624595000000.zst (present)
  Size on Disk: 144.4K
       Message: Process 854360 (main) of user 1000 dumped core.

                Stack trace of thread 854441:
                #0  0x0000000000410238 runtime.mallocgcSmallNoscan (/home/arch/test/main + 0x10238)
                #1  0x0000000000463bd9 runtime.mallocgc (/home/arch/test/main + 0x63bd9)
                #2  0x0000000000466149 runtime.growslice (/home/arch/test/main + 0x66149)
                #3  0x00000000004613d6 runtime.vgetrandomPutState (/home/arch/test/main + 0x613d6)
                #4  0x000000000043a265 runtime.mdestroy (/home/arch/test/main + 0x3a265)
                #5  0x0000000000439f1f runtime.mstart0 (/home/arch/test/main + 0x39f1f)
                #6  0x0000000000468b65 runtime.mstart (/home/arch/test/main + 0x68b65)
                #7  0x000000000046c8ef runtime.clone (/home/arch/test/main + 0x6c8ef)

                Stack trace of thread 854440:
                #0  0x00007efcdb520411 n/a (linux-vdso.so.1 + 0x1411)
                #1  0x000000000046ca18 runtime.vgetrandom1 (/home/arch/test/main + 0x6ca18)
                #2  0x000000c00019eb48 n/a (n/a + 0x0)
                #3  0x0000000000467cd5 runtime.vgetrandom (/home/arch/test/main + 0x67cd5)
                #4  0x00000000004738a6 golang.org/x/sys/unix.Getrandom (/home/arch/test/main + 0x738a6)
                #5  0x0000000000473ea9 main.main.func1 (/home/arch/test/main + 0x73ea9)
                #6  0x000000000046aa81 runtime.goexit (/home/arch/test/main + 0x6aa81)

                Stack trace of thread 854361:
                #0  0x000000000046c277 runtime.usleep (/home/arch/test/main + 0x6c277)
                #1  0x0000000000443585 runtime.sysmon (/home/arch/test/main + 0x43585)
                #2  0x0000000000439fd3 runtime.mstart1 (/home/arch/test/main + 0x39fd3)
                #3  0x0000000000439f15 runtime.mstart0 (/home/arch/test/main + 0x39f15)
                #4  0x0000000000468b65 runtime.mstart (/home/arch/test/main + 0x68b65)
                #5  0x000000000046c8ef runtime.clone (/home/arch/test/main + 0x6c8ef)
                #6  0x000000c000020000 n/a (n/a + 0x0)
                ELF object binary architecture: AMD x86-64

What did you expect to see?

It to not crash.


For more context:

Dagger and Docker have both been unable to update to any version of go 1.24 from 1.23 due to periodic segmentation faults.

Multiple stack traces shared by other users/debuggers have shown crash stack traces involving runtime.vgetrandomPutState and runtime.growslice, matching what I repro'd in isolation above:

I took a look at the relevant lines from the stack traces:

And got the theory that:

  1. eb6f2c2 is the culprit
  2. It involved specific code paths followed when
    • A goroutine's P is being destroyed due to runtime.LockOSThread being held at goexit
    • The vgetrandomAlloc.states slice was appended to such that it triggered growslice and thus tried to malloc, but at a point of the m/p lifecycle where that's not allowed (or just doesn't work for some other reason)
    • The use of runtime.LockOSThread is particularly relevant since it potentially explains why dagger/docker hit this so quickly but seemingly no other reports have surfaced; dagger/docker are some of the rare users of that API (due to doing container-y things)

I am very very far from a go runtime expert, so not at all sure if the above is correct but it lead me to the repro code above that does indeed seem to consistently trigger this, whether by coincidence or not 🤷‍♂️

cc @zx2c4

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Apr 2, 2025
@dmitshur dmitshur added the Critical A critical problem that affects the availability or correctness of production systems built using Go label Apr 2, 2025
@prattmic
Copy link
Member

prattmic commented Apr 3, 2025

I can reproduce on my Linux 6.12 machine, with the same stack trace.

For reference, I had to run with taskset -c 0,1 ./my_binary to get it to crash quickly (~40s). Running on all cores did not crash within 5 minutes.

@prattmic
Copy link
Member

prattmic commented Apr 3, 2025

Your theory looks correct to me. mexit -> mdestroy -> vgetrandomPutState appends to a slice after we release the P, meaning allocating is no longer safe. We should avoid that.

@prattmic
Copy link
Member

prattmic commented Apr 3, 2025

@gopherbot Please backport to 1.24. This is a regression that can cause arbitrary crashes in programs that exit goroutines under LockOSThread and are running on Linux 6.11 or higher.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #73144 (for 1.24).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@prattmic
Copy link
Member

prattmic commented Apr 3, 2025

By the way, thank you very much for the easy reproducer and bisect! That makes things much easier.

@prattmic prattmic self-assigned this Apr 3, 2025
@prattmic prattmic added this to the Go1.25 milestone Apr 3, 2025
@prattmic prattmic added the NeedsFix The path to resolution is known, but the work has not been done. label Apr 3, 2025
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/662455 mentions this issue: runtime: cleanup M vgetrandom state before dropping P

@zx2c4
Copy link
Contributor

zx2c4 commented Apr 3, 2025

Fix seems correct to me. Thanks, and sorry for the bug.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/662496 mentions this issue: [release-branch.go1.24] runtime: cleanup M vgetrandom state before dropping P

@thaJeztah
Copy link
Contributor

By the way, thank you very much for the easy reproducer and bisect! That makes things much easier.

💯 double this! Also Kudos to various folks in moby/moby#49513 who tried bisecting and reporting traces they could find.

We knew "something" was wrong, but the only reproducer we had was "docker segfaults sometimes", which wasn't very useful to report here as a reproducer, so really happy @sipsma was able to find a MUCH smaller reproducer.

@mvdan
Copy link
Member

mvdan commented Apr 3, 2025

@prattmic thank you so much for the fix! I know that go1.24.2 was just released, but any chance a go1.24.3 release with this fix could be pushed forward? There are a number of Linux distributions and downstream projects which are currently stuck at Go 1.23 until a release with this fix happens for 1.24.

@Foxboron

This comment has been minimized.

@mvdan

This comment has been minimized.

@Antiz96

This comment has been minimized.

@Foxboron
Copy link
Contributor

Foxboron commented Apr 3, 2025

We have pushed an updated go compiler to Arch with the backported patch. Everything works as expected.

λ go » time go run .
signal: segmentation fault (core dumped)
go run .  1.21s user 0.14s system 514% cpu 0.262 total
λ go » sudo pacman -U /var/cache/pacman/pkg/go-2:1.24.1-2-x86_64.pkg.tar.zst
[...snip...]
λ go » time go run .
^Csignal: interrupt
go run .  72.98s user 0.22s system 1808% cpu 4.046 total

@prattmic
Copy link
Member

prattmic commented Apr 3, 2025

@mvdan Since #73144 is where the backport will be discussed, let's move discussion of a potential earlier release there (and so it won't be hidden inside a closed issue).

@prattmic
Copy link
Member

prattmic commented Apr 3, 2025

@thaJeztah For what it's worth, while I definitely appreciate the basically-solved bug report (makes my life easier!), in this case I suspect even with no reproducer we probably could have figured this out from:

  1. We are getting random segfaults with this stack trace:
                #0  0x0000000000410238 runtime.mallocgcSmallNoscan (/home/arch/test/main + 0x10238)
                #1  0x0000000000463bd9 runtime.mallocgc (/home/arch/test/main + 0x63bd9)
                #2  0x0000000000466149 runtime.growslice (/home/arch/test/main + 0x66149)
                #3  0x00000000004613d6 runtime.vgetrandomPutState (/home/arch/test/main + 0x613d6)
                #4  0x000000000043a265 runtime.mdestroy (/home/arch/test/main + 0x3a265)
                #5  0x0000000000439f1f runtime.mstart0 (/home/arch/test/main + 0x39f1f)
                #6  0x0000000000468b65 runtime.mstart (/home/arch/test/main + 0x68b65)
                #7  0x000000000046c8ef runtime.clone (/home/arch/test/main + 0x6c8ef)
  1. This seems to be related to upgrading to 1.24.

I say that because in this stack trace we see runtime.vgetrandomPutState, which is new in 1.24 (and not that big or complicated of a feature) and runtime.mdestroy, which is a really suspicious place to be crashing.

That's certainly not going to be possible with every runtime bug, but I don't think it would have hurt to file a bug saying FYI you think you've found a problem in 1.24, here are some minimal details, and we are still investigating to narrow it down.

@prattmic
Copy link
Member

prattmic commented Apr 3, 2025

k3s-io/k3s#11973 (comment) reports the same crash in containerd as well. (I'm not sure what's going on with the rest of that issue, which is marked as fixed. The crash report may be unrelated to the rest of the issue.)

@dmitshur dmitshur changed the title runtime: Segmentation fault from vgetrandomPutState and runtime.growslice w/ runtime.OSLockThread runtime: segmentation fault from vgetrandomPutState and runtime.growslice w/ runtime.OSLockThread Apr 3, 2025
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/662636 mentions this issue: runtime: add thread exit plus vgetrandom stress test

@thaJeztah
Copy link
Contributor

@thaJeztah For what it's worth, while I definitely appreciate the basically-solved bug report (makes my life easier!), in this case I suspect even with no reproducer we probably could have figured this out from:

@prattmic Thanks for the extra pointers there! I guess I was a bit too conservative; when we received the report, our own builds were not yet on go1.24, and maintainers did not manage to reproduce it, so we didn't want to immediately waste time in case it was due to some other factors. I recall I left a quick blurb on another ticket #71932 (comment), but I probably should've opened a new ticket already with just the information I had.

Will do next time!!

gopherbot pushed a commit that referenced this issue Apr 4, 2025
Add a regression test similar to the reproducer from #73141 to try to
help catch future issues with vgetrandom and thread exit. Though the
test isn't very precise, it just hammers thread exit.

When the test reproduces #73141, it simply crashes with a SIGSEGV and no
output or stack trace, which would be very unfortunate on a builder.
https://go.dev/issue/49165 tracks collecting core dumps from builders,
which would make this more tractable to debug.

For #73141.

Change-Id: I6a6a636c7d7b41e2729ff6ceb30fd7f979aa9978
Reviewed-on: https://go-review.googlesource.com/c/go/+/662636
Reviewed-by: Cherry Mui <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Auto-Submit: Michael Pratt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Critical A critical problem that affects the availability or correctness of production systems built using Go NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

9 participants