-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lxd/instance/exec: Only use keepalives on TCP sockets #12530
Conversation
Keepalives aren't particularly useful on Unix sockets, and it seems that there are some issues with them sometimes being written but not read: this means that processes launched via `lxc exec` can sometimes eventually hang because a websocket buffer fills up, causing an attempt to send a keepalive to return `EAGAIN`, causing LXD to give up on mirroring output from the corresponding file descriptor, and thus eventually causing subprocesses to hang when the buffer for their standard output/error fills up in turn. I've observed this particularly on slower architectures such as riscv64 and arm64 in the context of non-interactive `lxc exec` via `launchpad-buildd`, though I don't quite know the exact set of triggers and so don't have a minimal test case - the best I have is building a snap from `https://git.launchpad.net/~canonical-lxd/lxd latest-candidate` in `launchpad-buildd` on riscv64, which reliably hangs prior to this change. Signed-off-by: Colin Watson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks lgtm!
really nice find indeed. Reading all the docs about https://pkg.go.dev/github.com/gorilla/websocket and the confusion around the use of ping/pong in this github issue gorilla/websocket#649 and example https://github.com/gorilla/websocket/blob/main/examples/filewatch/main.go i cannot tell if these keepalive messages are correct for TCP either. Cause i'm not sure PingPong is supposed to be used for tcp keepalive functionality (why can't we use actual tcp keepalive for it?). And I am sort of expecting to have somewhere ping/pong handlers set, send messages, and set appropriate gorilla/websocket read & write timeouts. As default handlers do nothing, and indeed would eventually fill up if no other messages are happening. How come nothing seems to read and consume the ping messages? or send a pong message & handle pong? I guess my question is - we are confident we have solved this bug for local sockets; but do we still have the same bug on TCP sockets now? |
I do plan on reviewing the use of web socket ping pong indeed. The original intent of using both tcp keepalives and application level ping was to ensure that an idle exec session was kept alive even when being run through an http proxy (where there are 2 TCP connections involved and we don't necessarily control both sides) and the proxy is doing application level timeouts. Which this did indeed fix. The default ping handler is documented as to return a pong message. So it didn't seem we needed a specific handler either side (as the default pong handler is documented as not taking any action, but I interpreted that as the message being consumed and discarded) as just the ping pong messages should be enough to avoid the proxy killing the connection. However I was looking at those examples earlier as well and it seems like we should inline the handling of the ping and pong messages into the main message consumer go routines rather than doing it separately. |
Keepalives aren't particularly useful on Unix sockets, and it seems that there are some issues with them sometimes being written but not read: this means that processes launched via
lxc exec
can sometimes eventually hang because a websocket buffer fills up, causing an attempt to send a keepalive to returnEAGAIN
, causing LXD to give up on mirroring output from the corresponding file descriptor, and thus eventually causing subprocesses to hang when the buffer for their standard output/error fills up in turn.I've observed this particularly on slower architectures such as riscv64 and arm64 in the context of non-interactive
lxc exec
vialaunchpad-buildd
, though I don't quite know the exact set of triggers and so don't have a minimal test case - the best I have is building a snap fromhttps://git.launchpad.net/~canonical-lxd/lxd latest-candidate
inlaunchpad-buildd
on riscv64, which reliably hangs prior to this change.Fixes #10034