Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/sync/errgroup: Group: document that Go may not be concurrent with Wait unless semaphore > 0 #70284

Closed
haaawk opened this issue Nov 11, 2024 · 25 comments
Labels
Documentation Issues describing a change to documentation. help wanted NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@haaawk
Copy link

haaawk commented Nov 11, 2024

(Edit: skip down to #70284 (comment); this is now a doc change request. --@adonovan)

Go version

go version go1.23.1 darwin/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/Users/haaawk/Library/Caches/go-build'
GOENV='/Users/haaawk/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/haaawk/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/haaawk/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/darwin_amd64'
GOVCS=''
GOVERSION='go1.23.1'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/Users/haaawk/Library/Application Support/go/telemetry'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/_y/nj5hcsh93l12tmsszxgx7ntr0000gn/T/go-build2396141927=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

When following code is run:

package main

import (
	"runtime"

	"golang.org/x/sync/errgroup"
)

func main() {
	runtime.GOMAXPROCS(1)
	g := &errgroup.Group{}
	g.SetLimit(1)
	ch := make(chan struct{})
	wait := make(chan struct{}, 2)
	g.Go(func() error {
		<-ch
		wait <- struct{}{}
		return nil
	})
	go g.Go(func() error {
		println("I'm not blocked")
		wait <- struct{}{}
		return nil
	})
	println("Ok let's play")
	close(ch)
	g.Wait()
	println("It's over already?")
	<-wait
	<-wait
}

https://go.dev/play/p/xTIsT1iouTd

What did you see happen?

The program printed:

Ok let's play
It's over already?
I'm not blocked

What did you expect to see?

The program printing:

Ok let's play
I'm not blocked
It's over already?
@gopherbot gopherbot added this to the Unreleased milestone Nov 11, 2024
@haaawk
Copy link
Author

haaawk commented Nov 11, 2024

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/627075 mentions this issue: x/sync/errgroup: ensure all goroutines finish before Wait returns

@jrick
Copy link
Contributor

jrick commented Nov 11, 2024

go g.Go(

I don't think this is what you meant to write.

@haaawk
Copy link
Author

haaawk commented Nov 11, 2024

go g.Go(

I don't think this is what you meant to write.

It is exactly what I meant to write. Otherwise the g.Go would block.

@jrick
Copy link
Contributor

jrick commented Nov 11, 2024

Your program contains a data race, because it's not valid to increment a waitgroup with 0 count while waiting on it. Your submitted patch doesn't change this.

@haaawk
Copy link
Author

haaawk commented Nov 11, 2024

Your program contains a data race, because it's not valid to increment a waitgroup with 0 count while waiting on it. Your submitted patch doesn't change this.

The waitgroup has count of 1 when the go g.Go is called

@jrick
Copy link
Contributor

jrick commented Nov 11, 2024

No. You would need additional synchronization to guarantee this.

package main

import (
        "runtime"

        "golang.org/x/sync/errgroup"
)

func main() {
        runtime.GOMAXPROCS(1)
        g := &errgroup.Group{}
        g.SetLimit(1)
        ch := make(chan struct{})
        wait := make(chan struct{}, 2)
        insideGo := make(chan struct{})
        g.Go(func() error {
                <-ch
                wait <- struct{}{}
                return nil
        })
        go g.Go(func() error {
                close(insideGo)
                println("I'm not blocked")
                wait <- struct{}{}
                return nil
        })
        println("Ok let's play")
        close(ch)
        <-insideGo
        g.Wait()
        println("It's over already?")
        <-wait
        <-wait
}

@cherrymui
Copy link
Member

@jrick is right. go g.Go(...) is not what you want. g.Go is intentionally blocking to limit the number of active goroutines. You've called g.SetLimit(1), so that limits it to 1 active goroutine a time. If you don't want that limit, you can remove that line, or set a higher limit.

In your case, there is no synchronization between g.Wait and the g.Go call in the new goroutine created by go g.Go(...). If is possible that g.Wait has finished while the new goroutine created by go g.Go(...) hasn't run. So this works as intended. Your CL doesn't seem to change it either: g.Wait can still finish even before the new goroutine reaches g.Go.

In general, if you have questions about how to use the API, please see https://go.dev/wiki/Questions . Thanks.

@cherrymui cherrymui closed this as not planned Won't fix, can't repro, duplicate, stale Nov 11, 2024
@adonovan
Copy link
Member

adonovan commented Nov 11, 2024

Perhaps I misunderstand, but I think the Group is working as intended. SetLimit may prevent the Group from accepting new work, and the client must deal with that. It definitely cannot call Go asynchronously as there would be no guarantee that it happened before the later call to Wait.

Perhaps the documentation could be improved, but I don't think there's a bug.

@adonovan
Copy link
Member

Ah, @cherrymui beat me to it!

@haaawk
Copy link
Author

haaawk commented Nov 11, 2024

This was just my failed attempt to have a simple reproducer to the issue that's still there.
In the code below the order is always right and the program still prints the same result. The problem is that g.Go can wake up from sem in a group that was already waited on.
Let me thing of less artificial example and come back. It will need to be more complex unfortunately.

package main

import (
	"runtime"
	"time"

	"golang.org/x/sync/errgroup"
)

func main() {
	runtime.GOMAXPROCS(1)
	g := &errgroup.Group{}
	g.SetLimit(1)
	ch := make(chan struct{})
	wait := make(chan struct{}, 2)
	g.Go(func() error {
		<-ch
		wait <- struct{}{}
		return nil
	})
	go g.Go(func() error {
		println("I'm not blocked")
		wait <- struct{}{}
		return nil
	})
	println("Ok let's play")
	time.Sleep(1 * time.Second)
	close(ch)
	g.Wait()
	println("It's over already?")
	<-wait
	<-wait
}

@adonovan
Copy link
Member

In the code below the order is always right and the program still prints the same result.

This program has a race: the second call to Go races with Wait. Consider: the goroutine created by the first Go function could complete (along with println, Sleep, close) before the goroutine created by the go statement even begins to execute the second call to g.Go.

@haaawk
Copy link
Author

haaawk commented Nov 11, 2024

In the code below the order is always right and the program still prints the same result.

This program has a race: the second call to Go races with Wait. Consider: the goroutine created by the first Go function could complete (along with println, Sleep, close) before the goroutine created by the go statement even begins to execute the second call to g.Go.

I know it's not properly synchronized and I never said it is. I said that the order is right. I wouldn't expect runtime to stay idle during sleep and not execute the runnable goroutine.

I guess a race free reproducer could be:

package main

import (
	"runtime"
	"runtime/pprof"
	"strings"
	"time"

	"golang.org/x/sync/errgroup"
)

func main() {
	runtime.GOMAXPROCS(1)
	g := &errgroup.Group{}
	g.SetLimit(1)
	ch := make(chan struct{})
	wait := make(chan struct{}, 2)
	go func() {
		println("Starting task scheduler")
		println("Scheduling first task")
		g.Go(func() error {
			<-ch
			println("First task ran")
			wait <- struct{}{}
			return nil
		})
		println("Scheduling second task")
		g.Go(func() error {
			println("What about me?")
			wait <- struct{}{}
			return nil
		})
	}()
	time.Sleep(1 * time.Second)
	var b strings.Builder
	pprof.Lookup("goroutine").WriteTo(&b, 1)
	close(ch)
	if strings.Contains(b.String(), "errgroup.go:71") {
		g.Wait()
		println("Game over")
	} else {
		println("Second task not blocked yet")
	}
	<-wait
	<-wait
}

https://go.dev/play/p/Zg5GpZJz2bK

but I have to admit, there's probably no way to reproduce the problem in a completely race free way. If g.Go and g.Wait are in two different goroutines then runtime can always preempt g.Wait in the middle and then schedule g.Go to run and set g.wg.Add(1) while g.Wait is running with counter equal to 0.

@haaawk
Copy link
Author

haaawk commented Nov 11, 2024

I guess it would be cleaner if the docs were saying explicitly that g.Wait has to be called only after all calls to g.Go finish. This is not obvious without looking into the implementation. One could imagine an implementation that does not have a data race when g.Wait and g.Go are called concurrently.

@adonovan
Copy link
Member

I guess it would be cleaner if the docs were saying explicitly that g.Wait has to be called only after all calls to g.Go finish. This is not obvious without looking into the implementation. One could imagine an implementation that does not have a data race when g.Wait and g.Go are called concurrently.

It's fine to call Go concurrently with Wait, and indeed useful, if one item of work might add others to the queue. But you can't use SetLimit in this scenario or else the Group will stop accepting work. We could certainly document that, but we should not try to disallow concurrent calls to Go.

@haaawk
Copy link
Author

haaawk commented Nov 12, 2024

I guess it would be cleaner if the docs were saying explicitly that g.Wait has to be called only after all calls to g.Go finish. This is not obvious without looking into the implementation. One could imagine an implementation that does not have a data race when g.Wait and g.Go are called concurrently.

It's fine to call Go concurrently with Wait, and indeed useful, if one item of work might add others to the queue. But you can't use SetLimit in this scenario or else the Group will stop accepting work. We could certainly document that, but we should not try to disallow concurrent calls to Go.

It is a very unintuitive requirement which is really driven by implementation details not the best API design but you're right.

@adonovan
Copy link
Member

It is a very unintuitive requirement which is really driven by implementation details not the best API design

I think your criticism of the design is unfair. The doc comment for SetLimit says that it "limits the number of active goroutines", and that "any subsequent call to the Go method will block ...". The alternative that you propose would remove the blocking behavior, which would be an incompatible change (since it removes happens-before edges), and it would require that Group maintain an arbitrarily large queue of functions provided to Go before a goroutine is available to run them, which could cause an application to use unbounded memory when dealing with very long streams of tasks.

@haaawk
Copy link
Author

haaawk commented Nov 12, 2024

It is a very unintuitive requirement which is really driven by implementation details not the best API design

I think your criticism of the design is unfair. The doc comment for SetLimit says that it "limits the number of active goroutines", and that "any subsequent call to the Go method will block ...". The alternative that you propose would remove the blocking behavior, which would be an incompatible change (since it removes happens-before edges), and it would require that Group maintain an arbitrarily large queue of functions provided to Go before a goroutine is available to run them, which could cause an application to use unbounded memory when dealing with very long streams of tasks.

So I'm not advocating for my change any more at all. My criticism is about the fact that calling Wait and Go concurrently with no synchronization is a data race and the docs don't say a word about it. Calling Go from a goroutine running inside a group is a form of synchronization. The requirement is really specific - "You can call Go concurrently with Wait only if you have guaranteed that at least 1 other goroutine that is already running inside the group won't finish until after either Go or Wait finishes first - or both.

@adonovan
Copy link
Member

My criticism is about the fact that calling Wait and Go concurrently with no synchronization is a data race and the docs don't say a word about it. Calling Go from a goroutine running inside a group is a form of synchronization. The requirement is really specific - "You can call Go concurrently with Wait only if you have guaranteed that at least 1 other goroutine that is already running inside the group won't finish until after either Go or Wait finishes first - or both.

True, but the exact same principle applies to plain old WaitGroups: to add an item to the group you call Add(1) and later Done. In the simple case you make all the Add calls in sequence and all the Dones happen asynchronously. In more complex cases a task begets more tasks, so it calls Add (for the child) before calling Done for itself, preserving the invariant. But you get into trouble if you make the Add asynchronous to the Wait without otherwise ensuring that the semaphore is nonzero.

Perhaps we could add some clarifying documentation.

@haaawk
Copy link
Author

haaawk commented Nov 12, 2024

My criticism is about the fact that calling Wait and Go concurrently with no synchronization is a data race and the docs don't say a word about it. Calling Go from a goroutine running inside a group is a form of synchronization. The requirement is really specific - "You can call Go concurrently with Wait only if you have guaranteed that at least 1 other goroutine that is already running inside the group won't finish until after either Go or Wait finishes first - or both.

True, but the exact same principle applies to plain old WaitGroups: to add an item to the group you call Add(1) and later Done. In the simple case you make all the Add calls in sequence and all the Dones happen asynchronously. In more complex cases a task begets more tasks, so it calls Add (for the child) before calling Done for itself, preserving the invariant. But you get into trouble if you make the Add asynchronous to the Wait without otherwise ensuring that the semaphore is nonzero.

Perhaps we could add some clarifying documentation.

Yes but WaitGroup has the following in the docs:

Note that calls with a positive delta that occur when the counter is zero must happen before a Wait. Calls with a negative delta, or calls with a positive delta that start when the counter is greater than zero, may happen at any time. Typically this means the calls to Add should execute before the statement creating the goroutine or other event to be waited for. If a WaitGroup is reused to wait for several independent sets of events, new Add calls must happen after all previous Wait calls have returned.

which gives a user a chance to use the API correctly without looking into its implementation.

@haaawk
Copy link
Author

haaawk commented Nov 12, 2024

BTW I came up with this issue because I found in the codebase I'm working on the following code outside of group goroutines:

	if !w.g.TryGo(f) {
		go w.g.Go(f)
	}

and it seemed fishy from the concurrency point of view so I kept digging. Apparently someone was misled by errgroup.Group docs/API.

@adonovan
Copy link
Member

Reopening as a doc change request.

@adonovan adonovan reopened this Nov 13, 2024
@adonovan adonovan changed the title x/sync/errgroup: Group.Wait may return before all Group.Go calls are finished x/sync/errgroup: Group: document that Go may not be concurrent with Wait unless semaphore > 0 Nov 13, 2024
@gopherbot gopherbot added the Documentation Issues describing a change to documentation. label Nov 13, 2024
@ianlancetaylor ianlancetaylor added help wanted NeedsFix The path to resolution is known, but the work has not been done. labels Nov 13, 2024
@ianlancetaylor ianlancetaylor added the gopls Issues related to the Go language server, gopls. label Nov 13, 2024
@findleyr findleyr removed the gopls Issues related to the Go language server, gopls. label Nov 13, 2024
@berbreik
Copy link

berbreik commented Jan 6, 2025

I'd like to work on this issue. I'II update the documentation to clarify the behavior when the semaphore limit is set to zero

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/660075 mentions this issue: errgroup: document calling Go before Wait

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Issues describing a change to documentation. help wanted NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

9 participants