Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lxd/instance/exec: Only use keepalives on TCP sockets #12530

Merged
merged 1 commit into from
Nov 15, 2023

Conversation

cjwatson
Copy link
Contributor

@cjwatson cjwatson commented Nov 15, 2023

Keepalives aren't particularly useful on Unix sockets, and it seems that there are some issues with them sometimes being written but not read: this means that processes launched via lxc exec can sometimes eventually hang because a websocket buffer fills up, causing an attempt to send a keepalive to return EAGAIN, causing LXD to give up on mirroring output from the corresponding file descriptor, and thus eventually causing subprocesses to hang when the buffer for their standard output/error fills up in turn.

I've observed this particularly on slower architectures such as riscv64 and arm64 in the context of non-interactive lxc exec via launchpad-buildd, though I don't quite know the exact set of triggers and so don't have a minimal test case - the best I have is building a snap from https://git.launchpad.net/~canonical-lxd/lxd latest-candidate in launchpad-buildd on riscv64, which reliably hangs prior to this change.

Fixes #10034

Keepalives aren't particularly useful on Unix sockets, and it seems that
there are some issues with them sometimes being written but not read:
this means that processes launched via `lxc exec` can sometimes
eventually hang because a websocket buffer fills up, causing an attempt
to send a keepalive to return `EAGAIN`, causing LXD to give up on
mirroring output from the corresponding file descriptor, and thus
eventually causing subprocesses to hang when the buffer for their
standard output/error fills up in turn.

I've observed this particularly on slower architectures such as riscv64
and arm64 in the context of non-interactive `lxc exec` via
`launchpad-buildd`, though I don't quite know the exact set of triggers
and so don't have a minimal test case - the best I have is building a
snap from `https://git.launchpad.net/~canonical-lxd/lxd
latest-candidate` in `launchpad-buildd` on riscv64, which reliably hangs
prior to this change.

Signed-off-by: Colin Watson <[email protected]>
Copy link
Member

@tomponline tomponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks lgtm!

@tomponline tomponline merged commit 9d6eff3 into canonical:main Nov 15, 2023
25 checks passed
@xnox
Copy link
Contributor

xnox commented Nov 15, 2023

really nice find indeed.

@tomponline

Reading all the docs about https://pkg.go.dev/github.com/gorilla/websocket and the confusion around the use of ping/pong in this github issue gorilla/websocket#649 and example https://github.com/gorilla/websocket/blob/main/examples/filewatch/main.go i cannot tell if these keepalive messages are correct for TCP either.

Cause i'm not sure PingPong is supposed to be used for tcp keepalive functionality (why can't we use actual tcp keepalive for it?). And I am sort of expecting to have somewhere ping/pong handlers set, send messages, and set appropriate gorilla/websocket read & write timeouts. As default handlers do nothing, and indeed would eventually fill up if no other messages are happening. How come nothing seems to read and consume the ping messages? or send a pong message & handle pong?

I guess my question is - we are confident we have solved this bug for local sockets; but do we still have the same bug on TCP sockets now?

@tomponline
Copy link
Member

I do plan on reviewing the use of web socket ping pong indeed. The original intent of using both tcp keepalives and application level ping was to ensure that an idle exec session was kept alive even when being run through an http proxy (where there are 2 TCP connections involved and we don't necessarily control both sides) and the proxy is doing application level timeouts. Which this did indeed fix.

The default ping handler is documented as to return a pong message. So it didn't seem we needed a specific handler either side (as the default pong handler is documented as not taking any action, but I interpreted that as the message being consumed and discarded) as just the ping pong messages should be enough to avoid the proxy killing the connection.

However I was looking at those examples earlier as well and it seems like we should inline the handling of the ping and pong messages into the main message consumer go routines rather than doing it separately.

@tomponline
Copy link
Member

#11011

@tomponline
Copy link
Member

#10034 (comment)

@cjwatson cjwatson deleted the keepalives-tcp-only branch November 16, 2023 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"lxc exec" frequently runs into I/O timeouts
3 participants