client/server: Don't block the main connection loop for transport IO #73

kevpar · 2020-12-07T19:39:05Z

Restructures both the client and server connection management so that
sending messages on the transport is done by a separate "sender"
goroutine. The receiving end was already split out like this.

Without this change, it is possible for a send to block if the other end
isn't reading fast enough, which then would block the main connection
loop and prevent incoming messages from being processed.

Signed-off-by: Kevin Parsons [email protected]

Fixes #72

Note: I feel there may be other things that can be cleaned up in the client/server connection code, but with this PR I was focused on fixing this bug specifically since we are seeing it in production.

kevpar · 2020-12-07T19:39:48Z

@ambarve @dcantah @katiewasnothere @anmaxvl PTAL

This pulls in a new version of github.com/containerd/ttrpc from a fork to fix the deadlock issue in containerd/ttrpc#72. Will revert back to the upstream ttrpc vendor once the fix is merged (containerd/ttrpc#73). Signed-off-by: Kevin Parsons <[email protected]>

This pulls in a new version of github.com/containerd/ttrpc from a fork to fix the deadlock issue in containerd/ttrpc#72. Will revert back to the upstream ttrpc vendor once the fix is merged (containerd/ttrpc#73). This fix also included some vendoring cleanup from running "vndr". Signed-off-by: Kevin Parsons <[email protected]>

server.go

anmaxvl

LGTM

jstarks · 2020-12-07T23:46:18Z

server.go

+			// the main loop will return and close done, which will cause us to exit as well.
+			case <-done:
+				return
+			case response := <-responses:


May be slightly clearer to defer close(responses) and just have this be for response := range responses.

Ah, maybe that's not practical since responses might still be referenced in the call goroutine.

Yes, generally you don't want to close from the read side.

jstarks · 2020-12-07T23:52:26Z

client.go

-			}
-
+			go func(streamID uint32, call *callRequest) {
+				requests <- streamCall{


Why not just call c.send here directly, rather than pop over to another goroutine?

That could result in multiple of this goroutines calling c.send concurrently, couldn't it?

If you do keep this model, do you need to select here on ctx.Done() so that this goroutine doesn't hang forever?

(Alternatively maybe the other goroutine shouldn't select on ctx.Done() and should use some other scheme to determine when it's done.

Good point, we don't have any way for these to be cleaned up if the connection closes.

I think we need to keep a single sender goroutine that receives messages via channel and calls c.send. That will ensure we don't interleave the bits from multiple messages on the wire.

ctx.Done() seems to be the client's equivalent of the done channel on the server side, so I think that's probably most appropriate to select on to see when we should terminate.

Answering your question, yes, that's true. I thought that was already possible but I was wrong. So that's out, I suppose.

A possible problem with this approach is that you've eliminated the backpressure on calls--if the socket is busy, we still keep processing messages from calls, allocating more goroutines without bound. Before, we would stop pulling messages off calls, which would allow someone select sending on calls (doubt this happens, though, didn't look yet). Also storing messages on calls is probably more memory and CPU efficient than storing them in blocked goroutines.

Not sure if that's a practical consideration.

Ah, indeed we do select on sending to calls. So I think this is a problem worth solving.

I'd suggest trying to process calls directly in the new goroutine. You'll need to come up with a new scheme for synchronizing waiters--although you could play more games with channels, perhaps it's reasonable to just use a mutex in this case.

I changed the send to a select with a <-c.ctx.Done() case, so we at least won't leak goroutines. I'll look at refactoring the rest of the flow to add back-pressure soon.

I would tend to agree with @jstarks suggestion here, re: process calls in a new goroutine and use a mutex to sync waiters.

Restructures both the client and server connection management so that sending messages on the transport is done by a separate "sender" goroutine. The receiving end was already split out like this. Without this change, it is possible for a send to block if the other end isn't reading fast enough, which then would block the main connection loop and prevent incoming messages from being processed. Signed-off-by: Kevin Parsons <[email protected]>

thaJeztah · 2021-02-08T11:19:11Z

@fuweid @crosbymichael PTAL

thaJeztah · 2021-03-16T11:52:23Z

@cpuguy83 @jstarks @katiewasnothere ptal (I see the PR was updated since your last review comments)

kevpar · 2021-10-14T17:42:50Z

I (finally) revisited this PR and took a different approach. The new PR is #94. Going to close this PR, but PTAL at the new one. :)

kevpar mentioned this pull request Dec 7, 2020

Re-vendor dependencies jterry75/cri#89

Merged

katiewasnothere reviewed Dec 7, 2020

View reviewed changes

server.go Show resolved Hide resolved

anmaxvl approved these changes Dec 7, 2020

View reviewed changes

jstarks reviewed Dec 7, 2020

View reviewed changes

fuweid self-requested a review December 8, 2020 02:18

kevpar force-pushed the deadlock branch from a43d9fd to 9536df6 Compare December 8, 2020 18:07

cpuguy83 requested a review from crosbymichael December 8, 2020 22:00

jsturtevant mentioned this pull request Aug 3, 2021

Concurrent calls to collect stats on Windows using containerd hangs the metrics endpoint kubernetes/kubernetes#104111

Closed

kevpar mentioned this pull request Oct 14, 2021

client: Handle sending/receiving in separate goroutines #94

Merged

kevpar closed this Oct 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client/server: Don't block the main connection loop for transport IO #73

client/server: Don't block the main connection loop for transport IO #73

kevpar commented Dec 7, 2020 •

edited

Loading

kevpar commented Dec 7, 2020

anmaxvl left a comment

jstarks Dec 7, 2020

jstarks Dec 8, 2020 •

edited

Loading

cpuguy83 Dec 8, 2020

jstarks Dec 7, 2020

kevpar Dec 7, 2020

jstarks Dec 7, 2020

kevpar Dec 7, 2020

kevpar Dec 8, 2020

jstarks Dec 8, 2020

jstarks Dec 8, 2020

kevpar Dec 8, 2020

cpuguy83 Dec 8, 2020

thaJeztah commented Feb 8, 2021

thaJeztah commented Mar 16, 2021

kevpar commented Oct 14, 2021

client/server: Don't block the main connection loop for transport IO #73

client/server: Don't block the main connection loop for transport IO #73

Conversation

kevpar commented Dec 7, 2020 • edited Loading

kevpar commented Dec 7, 2020

anmaxvl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jstarks Dec 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thaJeztah commented Feb 8, 2021

thaJeztah commented Mar 16, 2021

kevpar commented Oct 14, 2021

kevpar commented Dec 7, 2020 •

edited

Loading

jstarks Dec 8, 2020 •

edited

Loading