-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug]: pong response failure #9043
Comments
Also, I have another node C. This node is on the same physical machine as node B. Node A and C can communicate together. Wondering if node A is getting confused between node B and C since they have the same IP address? Node C has port 9735 opened on the firewall. Also, I have another node D. This is on the same physical machine as A. Node D can stay connected to node A. Both node B and D have |
Hi @AndySchroder, Could you please check the following just for some initial clarifications:
|
Since LND 18 we do enforce pong messages and will disconnect the peer if the don't get a reply in 30sec. Something seems not right with the connection. Can you set the PEER subsystem to trace and provide the logs. |
I just upgraded both nodes to lnd-v0.18.3, and I get the same result. |
Running |
I don't know if it was clear from my previous writing, but after going inactive, it never reconnects until lnd is restarted. Seems to me like both peers should periodically retry connecting every few minutes after disconnection. |
I set the following in
I still get the same result. Note, node B has port |
I still get the same result. |
Here are some more observations. For some reason node A is using IPv6 representation of the IPv4 address. I'm not sure why or if it matters. Also, the
|
Okay, here is some new information. Node A and C have TOR installed. It seems as though node C is connecting to node A over TOR. Node B and D do not use TOR. I'd rather node A and C not use TOR to connect to each other, there is no reason to use TOR if they are on the same local network, but it seems that this bug triggered them to fall back to the advertised external address and it forgot the manually connected address. Also, none of my nodes have Can someone else test? It seems the problem is the following: One node behind firewall, the other not and no I think there are a few parts to the issue:
|
Turning off TOR on nodes A and C caused them to automatically use the local network instead. Also, the connection between node A and B now persists with TOR off! Notes
|
Turning TOR on only on node A results in the connections between node A and B and node A and C to persist. Node A and C connect via the local address and not TOR. |
Turning TOR on only on node C results in the connections between node A and B and node A and C to persist. Node A and C connect via the local address and not TOR. So, it seems there is some kind of problem with running TOR on both machines. I'm guessing the |
For completeness, I re-tested with TOR enabled on both A and C and the connections remained stable. A and C are also connected via the local network. Weird, but good. I now cannot reproduce this problem. It's possible there is some slowness to the gossip of the external TOR address and as soon as node C sees the advertised tor address for node A again, the problem will return. Will have to report back in a few days. |
After 24 hours, node B's channel with node A was disconnected again. I turned TOR off on node A and restarted LND for it to take affect. Node C was naturally disconnected by this restart, but never reconnected again. Also, node B never reconnected either. After a while I restarted LND on node B and C. They reconnected on restart and have remained stable with TOR off. I then restarted node A again, just to see if node B and C would automatically reconnect with TOR off on node A before they initially connected. In fact, node B and C automatically reconnected this time a minute after node A came back up. So, it seems like the issue is very likely being triggered by TOR. I don't want to turn TOR off on node C right now for further testing, but I think there is enough details here for someone else to try and reproduce the problem. |
cc @ProofOfKeags - sounds like a matter of making ping/pong timeouts configurable and perhaps having different defaults there for Tor connections since we expect those to take longer. |
I don't think that is the issue here. The machine that is struggling to stay connected is on local gigabit ethernet and should not be using TOR. Latency is about 2ms. |
This isn't a timeout issue if it's on a local network. |
Background
peers don't stay connected
Your environment
lnd-v0.18.2
Expected behaviour
peers should stay connected and channels remain active.
Actual behaviour
I have two nodes on a local network, Node A and Node B. Node A has port 9735 open on the firewall. Node B has no open firewall ports. Restarting node B causes it to connect to Node A, but then after a few minutes, I get the following errors and channels go inactive. I just upgraded from v0.16.4-beta.rc1 to lnd-v0.18.2. I believe that it worked fine on v0.16.4-beta.rc1 .
Node A
2024-08-28 13:42:07.420 [WRN] PEER: Peer(B): pong response failure for [email protected]:57402: timeout while waiting for pong response -- disconnecting
2024-08-28 13:42:07.420 [INF] PEER: Peer(B): disconnecting [email protected]:57402, reason: pong response failure for [email protected]:57402: timeout while waiting for pong response -- disconnecting
2024-08-28 13:42:07.420 [INF] PEER: Peer(B): unable to read message from peer: read next header: read tcp 192.168.2.A:9735->192.168.2.B:57402: use of closed network connection
2024-08-28 13:42:07.521 [INF] DISC: Removing GossipSyncer for peer=B
2024-08-28 13:42:07.521 [INF] HSWC: ChannelLink(thechannel:1): stopping
2024-08-28 13:42:07.522 [INF] HSWC: ChannelLink(thechannel:1): exited
2024-08-28 13:42:07.522 [INF] HSWC: Removing channel link with ChannelID(thechannelid)
Node B
2024-08-28 13:42:07.435 [WRN] PEER: Peer(A): pong response failure for [email protected]:9735: timeout while waiting for pong response -- disconnecting
2024-08-28 13:42:07.436 [INF] PEER: Peer(A): disconnecting [email protected]:9735, reason: pong response failure for [email protected]:9735: timeout while waiting for pong response -- disconnecting
2024-08-28 13:42:07.538 [INF] DISC: Removing GossipSyncer for peer=A
2024-08-28 13:42:07.539 [INF] HSWC: ChannelLink(thechannel:1): stopping
2024-08-28 13:42:07.540 [INF] HSWC: ChannelLink(thechannel:1): exited
2024-08-28 13:42:07.541 [INF] HSWC: Removing channel link with ChannelID(thechannelid)
The text was updated successfully, but these errors were encountered: