[bug]: pong response failure #9043

AndySchroder · 2024-08-28T19:18:19Z

Background

peers don't stay connected

Your environment

lnd-v0.18.2

Expected behaviour

peers should stay connected and channels remain active.

Actual behaviour

I have two nodes on a local network, Node A and Node B. Node A has port 9735 open on the firewall. Node B has no open firewall ports. Restarting node B causes it to connect to Node A, but then after a few minutes, I get the following errors and channels go inactive. I just upgraded from v0.16.4-beta.rc1 to lnd-v0.18.2. I believe that it worked fine on v0.16.4-beta.rc1 .

Node A

2024-08-28 13:42:07.420 [WRN] PEER: Peer(B): pong response failure for [email protected]:57402: timeout while waiting for pong response -- disconnecting
2024-08-28 13:42:07.420 [INF] PEER: Peer(B): disconnecting [email protected]:57402, reason: pong response failure for [email protected]:57402: timeout while waiting for pong response -- disconnecting
2024-08-28 13:42:07.420 [INF] PEER: Peer(B): unable to read message from peer: read next header: read tcp 192.168.2.A:9735->192.168.2.B:57402: use of closed network connection
2024-08-28 13:42:07.521 [INF] DISC: Removing GossipSyncer for peer=B
2024-08-28 13:42:07.521 [INF] HSWC: ChannelLink(thechannel:1): stopping
2024-08-28 13:42:07.522 [INF] HSWC: ChannelLink(thechannel:1): exited
2024-08-28 13:42:07.522 [INF] HSWC: Removing channel link with ChannelID(thechannelid)

Node B

2024-08-28 13:42:07.435 [WRN] PEER: Peer(A): pong response failure for [email protected]:9735: timeout while waiting for pong response -- disconnecting
2024-08-28 13:42:07.436 [INF] PEER: Peer(A): disconnecting [email protected]:9735, reason: pong response failure for [email protected]:9735: timeout while waiting for pong response -- disconnecting
2024-08-28 13:42:07.538 [INF] DISC: Removing GossipSyncer for peer=A
2024-08-28 13:42:07.539 [INF] HSWC: ChannelLink(thechannel:1): stopping
2024-08-28 13:42:07.540 [INF] HSWC: ChannelLink(thechannel:1): exited
2024-08-28 13:42:07.541 [INF] HSWC: Removing channel link with ChannelID(thechannelid)

AndySchroder · 2024-08-28T19:49:52Z

Also, I have another node C. This node is on the same physical machine as node B. Node A and C can communicate together. Wondering if node A is getting confused between node B and C since they have the same IP address? Node C has port 9735 opened on the firewall.

Also, I have another node D. This is on the same physical machine as A. Node D can stay connected to node A.

Both node B and D have nolisten=true set in lnd.conf.

ViktorTigerstrom · 2024-09-01T23:57:09Z

Hi @AndySchroder,

Could you please check the following just for some initial clarifications:

Does the connection remain up if you remove:
nolisten=true
On node B?
Alternatively does the connection remain up if you keep the nolisten=true on node B, but never start Node C?

ziggie1984 · 2024-09-04T18:51:05Z

I just upgraded from v0.16.4-beta.rc1 to lnd-v0.18.2. I believe that it worked fine on v0.16.4-beta.rc1

Since LND 18 we do enforce pong messages and will disconnect the peer if the don't get a reply in 30sec. Something seems not right with the connection.

Can you set the PEER subsystem to trace and provide the logs.

AndySchroder · 2024-10-08T20:09:20Z

I just upgraded both nodes to lnd-v0.18.3, and I get the same result.

AndySchroder · 2024-10-08T23:24:20Z

Running lncli debuglevel --level PEER=trace seems to fill up the log so quickly it can't capture the whole event in one file, the log begins to rotate.

AndySchroder · 2024-10-08T23:55:25Z

I don't know if it was clear from my previous writing, but after going inactive, it never reconnects until lnd is restarted. Seems to me like both peers should periodically retry connecting every few minutes after disconnection.

AndySchroder · 2024-10-08T23:58:52Z

Does the connection remain up if you remove:
nolisten=true
On node B?

I set the following in lnd.conf on node B:

#nolisten=true
listen=:9999

I still get the same result.

Note, node B has port 9999 blocked on the firewall, which is what I want.

AndySchroder · 2024-10-09T00:13:21Z

2. Alternatively does the connection remain up if you keep the nolisten=true on node B, but never start Node C?

I still get the same result.

AndySchroder · 2024-10-09T00:38:50Z

Here are some more observations. For some reason node A is using IPv6 representation of the IPv4 address. I'm not sure why or if it matters. Also, the Recv-Q seems to be large for some reason on node B, but Send-Q is always 0 on node A. Once the connection goes inactive in lnd, the socket disappears from the ss command's output.

A@A:~$ ss '( sport = :9735 )' dst 192.168.2.B
Netid                      State                      Recv-Q                      Send-Q                                                    Local Address:Port                                                      Peer Address:Port                       Process                      
tcp                        ESTAB                      0                           0                                                 [::ffff:192.168.2.A]:9735                                             [::ffff:192.168.2.B]:57400                                                   



B@B:~$ ss '( dport = :9735 )' dst 192.168.2.A
Netid                        State                        Recv-Q                        Send-Q                                               Local Address:Port                                                  Peer Address:Port                        Process                        
tcp                          ESTAB                        168390                        0                                                     192.168.2.B:57400                                                 192.168.2.A:9735

AndySchroder · 2024-10-09T01:14:06Z

Okay, here is some new information. Node A and C have TOR installed. It seems as though node C is connecting to node A over TOR. Node B and D do not use TOR. I'd rather node A and C not use TOR to connect to each other, there is no reason to use TOR if they are on the same local network, but it seems that this bug triggered them to fall back to the advertised external address and it forgot the manually connected address.

Also, none of my nodes have externalip as I want to keep them private.

Can someone else test? It seems the problem is the following: One node behind firewall, the other not and no externalip defined on the machine that is not firewalled.

I think there are a few parts to the issue:

Disconnecting for unknown reason
Forgetting (until restart) the successful non tor, non public address connected to, so it won't reconnect.
Preferring a TOR address over a manual peer connection with a local IP address.

AndySchroder · 2024-10-09T01:42:34Z

Turning off TOR on nodes A and C caused them to automatically use the local network instead.

Also, the connection between node A and B now persists with TOR off!

Notes

Before turning TOR off, I was using tor.skip-proxy-for-clearnet-targets=1.
Node A has not had TOR installed for very long. It's possible I enabled it at the same time as I upgraded to lnd-v0.18.2. Node C has had TOR running for years.

AndySchroder · 2024-10-09T01:55:46Z

Turning TOR on only on node A results in the connections between node A and B and node A and C to persist. Node A and C connect via the local address and not TOR.

AndySchroder · 2024-10-09T02:07:21Z

Turning TOR on only on node C results in the connections between node A and B and node A and C to persist. Node A and C connect via the local address and not TOR.

So, it seems there is some kind of problem with running TOR on both machines. I'm guessing the tor.skip-proxy-for-clearnet-targets option is not working right and I'm exposing a bug because I'm using not using public addresses that would be accessible over TOR. I'm wondering if tor.skip-proxy-for-clearnet-targets is being turned off on the whole node if one peer uses TOR. Node A has no other .onion peers.

AndySchroder · 2024-10-09T03:37:34Z

For completeness, I re-tested with TOR enabled on both A and C and the connections remained stable. A and C are also connected via the local network. Weird, but good.

I now cannot reproduce this problem. It's possible there is some slowness to the gossip of the external TOR address and as soon as node C sees the advertised tor address for node A again, the problem will return.

Will have to report back in a few days.

AndySchroder · 2024-10-11T07:36:28Z

After 24 hours, node B's channel with node A was disconnected again.

I turned TOR off on node A and restarted LND for it to take affect. Node C was naturally disconnected by this restart, but never reconnected again. Also, node B never reconnected either.

After a while I restarted LND on node B and C. They reconnected on restart and have remained stable with TOR off.

I then restarted node A again, just to see if node B and C would automatically reconnect with TOR off on node A before they initially connected. In fact, node B and C automatically reconnected this time a minute after node A came back up.

So, it seems like the issue is very likely being triggered by TOR.

I don't want to turn TOR off on node C right now for further testing, but I think there is enough details here for someone else to try and reproduce the problem.

ellemouton · 2024-10-14T06:56:41Z

cc @ProofOfKeags - sounds like a matter of making ping/pong timeouts configurable and perhaps having different defaults there for Tor connections since we expect those to take longer.

AndySchroder · 2024-10-14T13:29:48Z

I don't think that is the issue here. The machine that is struggling to stay connected is on local gigabit ethernet and should not be using TOR. Latency is about 2ms.

ProofOfKeags · 2024-10-14T13:46:02Z

This isn't a timeout issue if it's on a local network.

AndySchroder added bug Unintended code behaviour needs triage labels Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug]: pong response failure #9043

[bug]: pong response failure #9043

AndySchroder commented Aug 28, 2024

AndySchroder commented Aug 28, 2024 •

edited

Loading

ViktorTigerstrom commented Sep 1, 2024

ziggie1984 commented Sep 4, 2024

AndySchroder commented Oct 8, 2024

AndySchroder commented Oct 8, 2024

AndySchroder commented Oct 8, 2024

AndySchroder commented Oct 8, 2024 •

edited

Loading

AndySchroder commented Oct 9, 2024

AndySchroder commented Oct 9, 2024

AndySchroder commented Oct 9, 2024 •

edited

Loading

AndySchroder commented Oct 9, 2024 •

edited

Loading

AndySchroder commented Oct 9, 2024

AndySchroder commented Oct 9, 2024 •

edited

Loading

AndySchroder commented Oct 9, 2024

AndySchroder commented Oct 11, 2024

ellemouton commented Oct 14, 2024

AndySchroder commented Oct 14, 2024

ProofOfKeags commented Oct 14, 2024

[bug]: pong response failure #9043

[bug]: pong response failure #9043

Comments

AndySchroder commented Aug 28, 2024

Background

Your environment

Expected behaviour

Actual behaviour

AndySchroder commented Aug 28, 2024 • edited Loading

ViktorTigerstrom commented Sep 1, 2024

ziggie1984 commented Sep 4, 2024

AndySchroder commented Oct 8, 2024

AndySchroder commented Oct 8, 2024

AndySchroder commented Oct 8, 2024

AndySchroder commented Oct 8, 2024 • edited Loading

AndySchroder commented Oct 9, 2024

AndySchroder commented Oct 9, 2024

AndySchroder commented Oct 9, 2024 • edited Loading

AndySchroder commented Oct 9, 2024 • edited Loading

AndySchroder commented Oct 9, 2024

AndySchroder commented Oct 9, 2024 • edited Loading

AndySchroder commented Oct 9, 2024

AndySchroder commented Oct 11, 2024

ellemouton commented Oct 14, 2024

AndySchroder commented Oct 14, 2024

ProofOfKeags commented Oct 14, 2024

AndySchroder commented Aug 28, 2024 •

edited

Loading

AndySchroder commented Oct 8, 2024 •

edited

Loading

AndySchroder commented Oct 9, 2024 •

edited

Loading

AndySchroder commented Oct 9, 2024 •

edited

Loading

AndySchroder commented Oct 9, 2024 •

edited

Loading