Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: pong response failure #9043

Open
AndySchroder opened this issue Aug 28, 2024 · 18 comments
Open

[bug]: pong response failure #9043

AndySchroder opened this issue Aug 28, 2024 · 18 comments
Labels
bug Unintended code behaviour needs triage

Comments

@AndySchroder
Copy link

Background

peers don't stay connected

Your environment

lnd-v0.18.2

Expected behaviour

peers should stay connected and channels remain active.

Actual behaviour

I have two nodes on a local network, Node A and Node B. Node A has port 9735 open on the firewall. Node B has no open firewall ports. Restarting node B causes it to connect to Node A, but then after a few minutes, I get the following errors and channels go inactive. I just upgraded from v0.16.4-beta.rc1 to lnd-v0.18.2. I believe that it worked fine on v0.16.4-beta.rc1 .

Node A

2024-08-28 13:42:07.420 [WRN] PEER: Peer(B): pong response failure for [email protected]:57402: timeout while waiting for pong response -- disconnecting
2024-08-28 13:42:07.420 [INF] PEER: Peer(B): disconnecting [email protected]:57402, reason: pong response failure for [email protected]:57402: timeout while waiting for pong response -- disconnecting
2024-08-28 13:42:07.420 [INF] PEER: Peer(B): unable to read message from peer: read next header: read tcp 192.168.2.A:9735->192.168.2.B:57402: use of closed network connection
2024-08-28 13:42:07.521 [INF] DISC: Removing GossipSyncer for peer=B
2024-08-28 13:42:07.521 [INF] HSWC: ChannelLink(thechannel:1): stopping
2024-08-28 13:42:07.522 [INF] HSWC: ChannelLink(thechannel:1): exited
2024-08-28 13:42:07.522 [INF] HSWC: Removing channel link with ChannelID(thechannelid)

Node B

2024-08-28 13:42:07.435 [WRN] PEER: Peer(A): pong response failure for [email protected]:9735: timeout while waiting for pong response -- disconnecting
2024-08-28 13:42:07.436 [INF] PEER: Peer(A): disconnecting [email protected]:9735, reason: pong response failure for [email protected]:9735: timeout while waiting for pong response -- disconnecting
2024-08-28 13:42:07.538 [INF] DISC: Removing GossipSyncer for peer=A
2024-08-28 13:42:07.539 [INF] HSWC: ChannelLink(thechannel:1): stopping
2024-08-28 13:42:07.540 [INF] HSWC: ChannelLink(thechannel:1): exited
2024-08-28 13:42:07.541 [INF] HSWC: Removing channel link with ChannelID(thechannelid)

@AndySchroder AndySchroder added bug Unintended code behaviour needs triage labels Aug 28, 2024
@AndySchroder
Copy link
Author

AndySchroder commented Aug 28, 2024

Also, I have another node C. This node is on the same physical machine as node B. Node A and C can communicate together. Wondering if node A is getting confused between node B and C since they have the same IP address? Node C has port 9735 opened on the firewall.

Also, I have another node D. This is on the same physical machine as A. Node D can stay connected to node A.

Both node B and D have nolisten=true set in lnd.conf.

@ViktorTigerstrom
Copy link
Collaborator

Hi @AndySchroder,

Could you please check the following just for some initial clarifications:

  1. Does the connection remain up if you remove:
    nolisten=true
    On node B?

  2. Alternatively does the connection remain up if you keep the nolisten=true on node B, but never start Node C?

@ziggie1984
Copy link
Collaborator

I just upgraded from v0.16.4-beta.rc1 to lnd-v0.18.2. I believe that it worked fine on v0.16.4-beta.rc1

Since LND 18 we do enforce pong messages and will disconnect the peer if the don't get a reply in 30sec. Something seems not right with the connection.

Can you set the PEER subsystem to trace and provide the logs.

@AndySchroder
Copy link
Author

I just upgraded both nodes to lnd-v0.18.3, and I get the same result.

@AndySchroder
Copy link
Author

Running lncli debuglevel --level PEER=trace seems to fill up the log so quickly it can't capture the whole event in one file, the log begins to rotate.

@AndySchroder
Copy link
Author

I don't know if it was clear from my previous writing, but after going inactive, it never reconnects until lnd is restarted. Seems to me like both peers should periodically retry connecting every few minutes after disconnection.

@AndySchroder
Copy link
Author

AndySchroder commented Oct 8, 2024

  1. Does the connection remain up if you remove:
    nolisten=true
    On node B?

I set the following in lnd.conf on node B:

#nolisten=true
listen=:9999

I still get the same result.

Note, node B has port 9999 blocked on the firewall, which is what I want.

@AndySchroder
Copy link
Author

2. Alternatively does the connection remain up if you keep the nolisten=true on node B, but never start Node C?

I still get the same result.

@AndySchroder
Copy link
Author

Here are some more observations. For some reason node A is using IPv6 representation of the IPv4 address. I'm not sure why or if it matters. Also, the Recv-Q seems to be large for some reason on node B, but Send-Q is always 0 on node A. Once the connection goes inactive in lnd, the socket disappears from the ss command's output.

A@A:~$ ss '( sport = :9735 )' dst 192.168.2.B
Netid                      State                      Recv-Q                      Send-Q                                                    Local Address:Port                                                      Peer Address:Port                       Process                      
tcp                        ESTAB                      0                           0                                                 [::ffff:192.168.2.A]:9735                                             [::ffff:192.168.2.B]:57400                                                   



B@B:~$ ss '( dport = :9735 )' dst 192.168.2.A
Netid                        State                        Recv-Q                        Send-Q                                               Local Address:Port                                                  Peer Address:Port                        Process                        
tcp                          ESTAB                        168390                        0                                                     192.168.2.B:57400                                                 192.168.2.A:9735                                                       



@AndySchroder
Copy link
Author

AndySchroder commented Oct 9, 2024

Okay, here is some new information. Node A and C have TOR installed. It seems as though node C is connecting to node A over TOR. Node B and D do not use TOR. I'd rather node A and C not use TOR to connect to each other, there is no reason to use TOR if they are on the same local network, but it seems that this bug triggered them to fall back to the advertised external address and it forgot the manually connected address.

Also, none of my nodes have externalip as I want to keep them private.

Can someone else test? It seems the problem is the following: One node behind firewall, the other not and no externalip defined on the machine that is not firewalled.

I think there are a few parts to the issue:

  1. Disconnecting for unknown reason
  2. Forgetting (until restart) the successful non tor, non public address connected to, so it won't reconnect.
  3. Preferring a TOR address over a manual peer connection with a local IP address.

@AndySchroder
Copy link
Author

AndySchroder commented Oct 9, 2024

Turning off TOR on nodes A and C caused them to automatically use the local network instead.

Also, the connection between node A and B now persists with TOR off!

Notes

  • Before turning TOR off, I was using tor.skip-proxy-for-clearnet-targets=1.
  • Node A has not had TOR installed for very long. It's possible I enabled it at the same time as I upgraded to lnd-v0.18.2. Node C has had TOR running for years.

@AndySchroder
Copy link
Author

Turning TOR on only on node A results in the connections between node A and B and node A and C to persist. Node A and C connect via the local address and not TOR.

@AndySchroder
Copy link
Author

AndySchroder commented Oct 9, 2024

Turning TOR on only on node C results in the connections between node A and B and node A and C to persist. Node A and C connect via the local address and not TOR.

So, it seems there is some kind of problem with running TOR on both machines. I'm guessing the tor.skip-proxy-for-clearnet-targets option is not working right and I'm exposing a bug because I'm using not using public addresses that would be accessible over TOR. I'm wondering if tor.skip-proxy-for-clearnet-targets is being turned off on the whole node if one peer uses TOR. Node A has no other .onion peers.

@AndySchroder
Copy link
Author

For completeness, I re-tested with TOR enabled on both A and C and the connections remained stable. A and C are also connected via the local network. Weird, but good.

I now cannot reproduce this problem. It's possible there is some slowness to the gossip of the external TOR address and as soon as node C sees the advertised tor address for node A again, the problem will return.

Will have to report back in a few days.

@AndySchroder
Copy link
Author

After 24 hours, node B's channel with node A was disconnected again.

I turned TOR off on node A and restarted LND for it to take affect. Node C was naturally disconnected by this restart, but never reconnected again. Also, node B never reconnected either.

After a while I restarted LND on node B and C. They reconnected on restart and have remained stable with TOR off.

I then restarted node A again, just to see if node B and C would automatically reconnect with TOR off on node A before they initially connected. In fact, node B and C automatically reconnected this time a minute after node A came back up.

So, it seems like the issue is very likely being triggered by TOR.

I don't want to turn TOR off on node C right now for further testing, but I think there is enough details here for someone else to try and reproduce the problem.

@ellemouton
Copy link
Collaborator

cc @ProofOfKeags - sounds like a matter of making ping/pong timeouts configurable and perhaps having different defaults there for Tor connections since we expect those to take longer.

@AndySchroder
Copy link
Author

I don't think that is the issue here. The machine that is struggling to stay connected is on local gigabit ethernet and should not be using TOR. Latency is about 2ms.

@ProofOfKeags
Copy link
Collaborator

This isn't a timeout issue if it's on a local network.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended code behaviour needs triage
Projects
None yet
Development

No branches or pull requests

5 participants