lxd/instance/exec: Only use keepalives on TCP sockets #12530

cjwatson · 2023-11-15T11:43:03Z

Keepalives aren't particularly useful on Unix sockets, and it seems that there are some issues with them sometimes being written but not read: this means that processes launched via lxc exec can sometimes eventually hang because a websocket buffer fills up, causing an attempt to send a keepalive to return EAGAIN, causing LXD to give up on mirroring output from the corresponding file descriptor, and thus eventually causing subprocesses to hang when the buffer for their standard output/error fills up in turn.

I've observed this particularly on slower architectures such as riscv64 and arm64 in the context of non-interactive lxc exec via launchpad-buildd, though I don't quite know the exact set of triggers and so don't have a minimal test case - the best I have is building a snap from https://git.launchpad.net/~canonical-lxd/lxd latest-candidate in launchpad-buildd on riscv64, which reliably hangs prior to this change.

Fixes #10034

Keepalives aren't particularly useful on Unix sockets, and it seems that there are some issues with them sometimes being written but not read: this means that processes launched via `lxc exec` can sometimes eventually hang because a websocket buffer fills up, causing an attempt to send a keepalive to return `EAGAIN`, causing LXD to give up on mirroring output from the corresponding file descriptor, and thus eventually causing subprocesses to hang when the buffer for their standard output/error fills up in turn. I've observed this particularly on slower architectures such as riscv64 and arm64 in the context of non-interactive `lxc exec` via `launchpad-buildd`, though I don't quite know the exact set of triggers and so don't have a minimal test case - the best I have is building a snap from `https://git.launchpad.net/~canonical-lxd/lxd latest-candidate` in `launchpad-buildd` on riscv64, which reliably hangs prior to this change. Signed-off-by: Colin Watson <[email protected]>

tomponline

Thanks lgtm!

xnox · 2023-11-15T16:42:51Z

really nice find indeed.

@tomponline

Reading all the docs about https://pkg.go.dev/github.com/gorilla/websocket and the confusion around the use of ping/pong in this github issue gorilla/websocket#649 and example https://github.com/gorilla/websocket/blob/main/examples/filewatch/main.go i cannot tell if these keepalive messages are correct for TCP either.

Cause i'm not sure PingPong is supposed to be used for tcp keepalive functionality (why can't we use actual tcp keepalive for it?). And I am sort of expecting to have somewhere ping/pong handlers set, send messages, and set appropriate gorilla/websocket read & write timeouts. As default handlers do nothing, and indeed would eventually fill up if no other messages are happening. How come nothing seems to read and consume the ping messages? or send a pong message & handle pong?

I guess my question is - we are confident we have solved this bug for local sockets; but do we still have the same bug on TCP sockets now?

tomponline · 2023-11-15T16:53:35Z

I do plan on reviewing the use of web socket ping pong indeed. The original intent of using both tcp keepalives and application level ping was to ensure that an idle exec session was kept alive even when being run through an http proxy (where there are 2 TCP connections involved and we don't necessarily control both sides) and the proxy is doing application level timeouts. Which this did indeed fix.

The default ping handler is documented as to return a pong message. So it didn't seem we needed a specific handler either side (as the default pong handler is documented as not taking any action, but I interpreted that as the message being consumed and discarded) as just the ping pong messages should be enough to avoid the proxy killing the connection.

However I was looking at those examples earlier as well and it seems like we should inline the handling of the ping and pong messages into the main message consumer go routines rather than doing it separately.

tomponline · 2023-11-15T17:05:33Z

#11011

tomponline · 2023-11-15T17:09:06Z

#10034 (comment)

cjwatson requested a review from tomponline as a code owner November 15, 2023 11:43

tomponline approved these changes Nov 15, 2023

View reviewed changes

tomponline merged commit 9d6eff3 into canonical:main Nov 15, 2023
25 checks passed

cjwatson deleted the keepalives-tcp-only branch November 16, 2023 15:33

tomponline mentioned this pull request Nov 23, 2023

Exec cleanup improvements #12542

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lxd/instance/exec: Only use keepalives on TCP sockets #12530

lxd/instance/exec: Only use keepalives on TCP sockets #12530

cjwatson commented Nov 15, 2023 •

edited by tomponline

Loading

tomponline left a comment

xnox commented Nov 15, 2023

tomponline commented Nov 15, 2023

tomponline commented Nov 15, 2023

tomponline commented Nov 15, 2023

lxd/instance/exec: Only use keepalives on TCP sockets #12530

lxd/instance/exec: Only use keepalives on TCP sockets #12530

Conversation

cjwatson commented Nov 15, 2023 • edited by tomponline Loading

tomponline left a comment

Choose a reason for hiding this comment

xnox commented Nov 15, 2023

tomponline commented Nov 15, 2023

tomponline commented Nov 15, 2023

tomponline commented Nov 15, 2023

cjwatson commented Nov 15, 2023 •

edited by tomponline

Loading