Try to reconnect to the eth node if the connection fails.#1093
Try to reconnect to the eth node if the connection fails.#1093palango merged 3 commits intoraiden-network:masterfrom palango:reconnect
Conversation
raiden/raiden_service.py
Outdated
| self.alarm.stop_async() | ||
| self.protocol.stop_and_wait() | ||
|
|
||
| timeout_s = 2 |
There was a problem hiding this comment.
Please refrain from unnecessary single letter abbreviations, timeout_seconds would be just fine here.
There was a problem hiding this comment.
IMO this also should be made into a config in DEFAULT_CONFIG, under a new key ethereum_node or something like it.
There was a problem hiding this comment.
should we add a command line flag for changing this as well?
raiden/blockchain/events.py
Outdated
| self.event_listeners = list() | ||
| result = list() | ||
| reinstalled_filters = True | ||
| self.add_proxies_listeners(get_relevant_proxies( |
There was a problem hiding this comment.
Not sure, if recreating the proxy instances here is the right approach (i.e. calling get_relevant_proxies). I suspect that we double the proxy instances by this approach (i.e. the old instances may still be referenced from some other node).
There was a problem hiding this comment.
I think konrad is correct here. Perhaps try to see where we save the data you need when we install proxies at raiden_service and extract it from there?
There was a problem hiding this comment.
Can you try to re-install from self.event_listeners, or, if they don't hold enough information, extend them to allow for re-installation? That way we also won't have the potentially cyclic dependency of self.chain in this class.
There was a problem hiding this comment.
I think it's fine to reinstall the filters. Since filters are not persistent, once the node is restarted and the id from the previous run is used the error will be filter not found, reinstalling the filters seems fine under this circumstance.
@LefterisJP The filters are kept inside the BlockchainEvents only.
NVM, you guys are talking about the proxies not the filters, my bad.
There was a problem hiding this comment.
There are problems with this patch though.
- It may miss events under some race conditions. This will happen if the filters miss a block before being re-installed.
- It's passing the
chainto yet another object, this may be better handled by something else.
Example for 1.:
- Block 5
- Events are polled
- The node goes offline
- Block 6
- Block 7
- The filters are reinstalled
Under the above scenario, events from the block 6 will be lost. Setting [`fromBlock{ ]](https://github.com/ethereum/wiki/wiki/JSON-RPC#eth_newfilter) with the latest polled block should be fine.
Note: Processing the same event multiple times is not a problem, events must be idempotent.
There was a problem hiding this comment.
Any recommendations on how to handle point 2? I'm not sure how to architect this in a nice way.
raiden/network/rpc/client.py
Outdated
| """ | ||
|
|
||
| def _checkNodeConnection(func, *args, **kwargs): | ||
| def retry_waits(): |
There was a problem hiding this comment.
I think this should reuse the timeout_exponential_backoff generator (better move it outside that module)
raiden/network/rpc/client.py
Outdated
| for c in changes | ||
| ] | ||
|
|
||
| @_checkNodeConnection |
There was a problem hiding this comment.
what about the transactions?
There was a problem hiding this comment.
Btw, I wonder if the transaction pool is actually persisted, if that is the case the retries may fail like with a known transaction error: #1061
There was a problem hiding this comment.
Apparently it is persistent for ones own transactions.
There was a problem hiding this comment.
So, it raises the exception with the message known transaction? We need some way to tell if that's bad or good, and handle errors like the one in the PR.
If you want we can merge this and look at the repeated transaction later.
|
@palango you will need to rebase on top of the changes that split the proxies and filters into their own files. |
|
Yeah, didn't want to rebase before I knew how stuff would work. |
|
Ok, think this is ready for a second round of reviews. Thanks for the good feedback.
|
| # contact the disconnected client | ||
| try: | ||
| with gevent.Timeout(self.shutdown_timeout): | ||
| self.blockchain_events.uninstall_all_event_listeners() |
There was a problem hiding this comment.
If the node is offline, the filters are already gone, I don't think there is much use for waiting for it to come back online.
There was a problem hiding this comment.
The uninstall_all_event_listeners function calls uninstall on every Filter which results in a RPC-call which will block.
There was a problem hiding this comment.
Yes, the point is that the RPC-call is useless if the ethereum node is offline. Ideally the code would just give up on uninstalling if the ethereum node is offline, instead of waiting for the timeout.
Note: I'm assuming that raiden and the ethereum node are running on the same machine, this actually makes sense for communication through the network, since we can't know for sure if the node is offline or the network is hiccuping.
There was a problem hiding this comment.
I agree with the reasoning but I'm not sure how to handle that. Might be easiest to add a connected property to the JSONRPCClient and set it accordingly in the _checkNodeConnection decorator. Do you think that would work?
Regarding your assumption: I'd say it's a valid assumption to have ethereum node and raiden on the same machine. Did we discuss that?
There was a problem hiding this comment.
Do you think that would work?
Yes, although it's a bit convoluted, can't you expose a version of the rpc client that does not retry by default?
Did we discuss that?
I don't recall this being made an explicit assumption.
There was a problem hiding this comment.
The problem with exposing different kinds of client is that most of the interaction happens over the BlockchainService object. Another option would be to introduce a per-method retry parameter. However I think that complicates stuff more than using the timeout in the end.
| for event_listener in self.event_listeners: | ||
| new_listener = EventListener( | ||
| event_listener.event_name, | ||
| event_listener.filter_creation_function(from_block=from_block), |
There was a problem hiding this comment.
Is the from_block inclusive?
I believe the value of self._blocknumber from RaidenService is the current block, which is being polled for events.
Note: This actually depends on the order of which the callbacks are installed with the alarm task, if the polling is executed first then self._blocknumber is the previous block.
There was a problem hiding this comment.
As of this it is inclusive, not sure about parity though.
There was a problem hiding this comment.
Btw, could you add a test for this under the assumptions?
| log.info('Client reconnected') | ||
| return result | ||
|
|
||
| except (requests.exceptions.ConnectionError, InvalidReplyError): |
There was a problem hiding this comment.
Why is the client considered offline when an InvalidReplyError is raised?
I'm wondering what is the content of the response, perhaps the node restarted and Raiden polled for a filter that is gone?
There was a problem hiding this comment.
Just after the disconnect tinyrpc gets an empty response which leads to the InvalidReplyError and in my testing only happens in this case.
Not sure why this happens.
|
Ok, next version. I removed the decorator from It would be nice to merge this now and fix some of the issues you brought up later. These are:
I think these are important but can be handled after this PR. Opinions? |
Introduced 2 exceptions for both SocketFactory and RaidenAPI servers. Each of those exceptions are raised when their designated server's port is already in use. The CLI component in this case would need to catch those specific exceptions and display error messages accordingly. Resolves #1093
Fixes #707.
This uses the same approach as #767 but a different timeout strategy. I discussed with Konrad and we think it doesn't make to kill the raiden instance. So now raiden tries to reconnect every 3 seconds in the first minute, then every 10s.
One issues I found is that one cannotThis is now fixed by adding timeouts to the shutdown sequence.Ctrl-Cstop raiden while it's trying to reconnect. Not sure atm why the events aren't propagated.