Try to reconnect to the eth node if the connection fails. by palango · Pull Request #1093 · raiden-network/raiden

palango · 2017-10-25T09:34:09Z

Fixes #707.

This uses the same approach as #767 but a different timeout strategy. I discussed with Konrad and we think it doesn't make to kill the raiden instance. So now raiden tries to reconnect every 3 seconds in the first minute, then every 10s.

~~One issues I found is that one cannot Ctrl-C stop raiden while it's trying to reconnect. Not sure atm why the events aren't propagated.~~ This is now fixed by adding timeouts to the shutdown sequence.

konradkonrad · 2017-11-13T10:39:18Z

raiden/raiden_service.py

        self.alarm.stop_async()
        self.protocol.stop_and_wait()

+        timeout_s = 2


Please refrain from unnecessary single letter abbreviations, timeout_seconds would be just fine here.

IMO this also should be made into a config in DEFAULT_CONFIG, under a new key ethereum_node or something like it.

should we add a command line flag for changing this as well?

konradkonrad · 2017-11-13T10:45:45Z

raiden/blockchain/events.py

+                    self.event_listeners = list()
+                    result = list()
+                    reinstalled_filters = True
+                    self.add_proxies_listeners(get_relevant_proxies(


Not sure, if recreating the proxy instances here is the right approach (i.e. calling get_relevant_proxies). I suspect that we double the proxy instances by this approach (i.e. the old instances may still be referenced from some other node).

I think konrad is correct here. Perhaps try to see where we save the data you need when we install proxies at raiden_service and extract it from there?

Can you try to re-install from self.event_listeners, or, if they don't hold enough information, extend them to allow for re-installation? That way we also won't have the potentially cyclic dependency of self.chain in this class.

I think it's fine to reinstall the filters. Since filters are not persistent, once the node is restarted and the id from the previous run is used the error will be filter not found, reinstalling the filters seems fine under this circumstance.

~~@LefterisJP The filters are kept inside the BlockchainEvents only.~~

NVM, you guys are talking about the proxies not the filters, my bad.

There are problems with this patch though.

It may miss events under some race conditions. This will happen if the filters miss a block before being re-installed.

It's passing the chain to yet another object, this may be better handled by something else.

Example for 1.:

- Block 5 - Events are polled - The node goes offline - Block 6 - Block 7 - The filters are reinstalled

Under the above scenario, events from the block 6 will be lost. Setting [`fromBlock{ ]](https://github.com/ethereum/wiki/wiki/JSON-RPC#eth_newfilter) with the latest polled block should be fine.

Note: Processing the same event multiple times is not a problem, events must be idempotent.

Any recommendations on how to handle point 2? I'm not sure how to architect this in a nice way.

hackaugusto · 2017-11-13T11:23:12Z

raiden/network/rpc/client.py

    """

+    def _checkNodeConnection(func, *args, **kwargs):
+        def retry_waits():


I think this should reuse the timeout_exponential_backoff generator (better move it outside that module)

hackaugusto · 2017-11-13T11:33:29Z

raiden/network/rpc/client.py

            for c in changes
        ]

+    @_checkNodeConnection


what about the transactions?

Good point!

Btw, I wonder if the transaction pool is actually persisted, if that is the case the retries may fail like with a known transaction error: #1061

Apparently it is persistent for ones own transactions.

So, it raises the exception with the message known transaction? We need some way to tell if that's bad or good, and handle errors like the one in the PR.

If you want we can merge this and look at the repeated transaction later.

LefterisJP · 2017-11-14T20:00:07Z

@palango you will need to rebase on top of the changes that split the proxies and filters into their own files.

palango · 2017-11-14T21:26:17Z

Yeah, didn't want to rebase before I knew how stuff would work.

palango · 2017-11-15T14:29:22Z

Ok, think this is ready for a second round of reviews. Thanks for the good feedback.

Now the proxies are reused and just the filters updated after a reconnect.
The problem of missing information is handled by giving the filters from_block parameters with the last seen node
I still use the two stage timeout generator, feels a bit nice. But open to change if there's other sentiment.
I added a setting for the shutdown timeout. Should this be given a command line option as well?

hackaugusto · 2017-11-15T15:41:06Z

raiden/raiden_service.py

+        # contact the disconnected client
+        try:
+            with gevent.Timeout(self.shutdown_timeout):
+                self.blockchain_events.uninstall_all_event_listeners()


If the node is offline, the filters are already gone, I don't think there is much use for waiting for it to come back online.

The uninstall_all_event_listeners function calls uninstall on every Filter which results in a RPC-call which will block.

Yes, the point is that the RPC-call is useless if the ethereum node is offline. Ideally the code would just give up on uninstalling if the ethereum node is offline, instead of waiting for the timeout.

Note: I'm assuming that raiden and the ethereum node are running on the same machine, this actually makes sense for communication through the network, since we can't know for sure if the node is offline or the network is hiccuping.

I agree with the reasoning but I'm not sure how to handle that. Might be easiest to add a connected property to the JSONRPCClient and set it accordingly in the _checkNodeConnection decorator. Do you think that would work?

Regarding your assumption: I'd say it's a valid assumption to have ethereum node and raiden on the same machine. Did we discuss that?

Do you think that would work?

Yes, although it's a bit convoluted, can't you expose a version of the rpc client that does not retry by default?

Did we discuss that?

I don't recall this being made an explicit assumption.

The problem with exposing different kinds of client is that most of the interaction happens over the BlockchainService object. Another option would be to introduce a per-method retry parameter. However I think that complicates stuff more than using the timeout in the end.

hackaugusto · 2017-11-15T15:52:09Z

raiden/blockchain/events.py

+                    for event_listener in self.event_listeners:
+                        new_listener = EventListener(
+                            event_listener.event_name,
+                            event_listener.filter_creation_function(from_block=from_block),


Is the from_block inclusive?

I believe the value of self._blocknumber from RaidenService is the current block, which is being polled for events.

Note: This actually depends on the order of which the callbacks are installed with the alarm task, if the polling is executed first then self._blocknumber is the previous block.

As of this it is inclusive, not sure about parity though.

Btw, could you add a test for this under the assumptions?

hackaugusto · 2017-11-15T15:54:29Z

raiden/network/rpc/client.py

+                        log.info('Client reconnected')
+                    return result
+
+                except (requests.exceptions.ConnectionError, InvalidReplyError):


Why is the client considered offline when an InvalidReplyError is raised?

I'm wondering what is the content of the response, perhaps the node restarted and Raiden polled for a filter that is gone?

Just after the disconnect tinyrpc gets an empty response which leads to the InvalidReplyError and in my testing only happens in this case.

Not sure why this happens.

Fixes #1128

Also fix some formatting

palango · 2017-11-23T13:00:10Z

Ok, next version. I removed the decorator from send_transaction, this uses call on all paths, so should be covered.

It would be nice to merge this now and fix some of the issues you brought up later. These are:

handle the known transaction case gracefully Endpoint registration fails on fast restarts #1061
get rid of the timeout in the shutdown sequence Don't use a timeout in shutdown sequence #1149
Decide if we want a command line setting or environment variable for the timeouts

I think these are important but can be handled after this PR. Opinions?

Introduced 2 exceptions for both SocketFactory and RaidenAPI servers. Each of those exceptions are raised when their designated server's port is already in use. The CLI component in this case would need to catch those specific exceptions and display error messages accordingly. Resolves #1093

palango added dev: Work In Progress and removed dev: Work In Progress labels Oct 25, 2017

konradkonrad reviewed Nov 13, 2017

View reviewed changes

hackaugusto reviewed Nov 13, 2017

View reviewed changes

palango mentioned this pull request Nov 13, 2017

Handle eth node disconnect #767

Closed

palango added dev: Work In Progress and removed dev: Please Review labels Nov 14, 2017

palango mentioned this pull request Nov 15, 2017

Reconnect if ethereum node goes offline #1128

Closed

4 tasks

palango added dev: Please Review and removed dev: Work In Progress labels Nov 15, 2017

hackaugusto reviewed Nov 15, 2017

View reviewed changes

palango added 3 commits November 23, 2017 13:53

Handle disconnects and reinstall filters

192c95d

Give filters correct starting block when reinstalling

2ad0419

Fixes #1128

Add changelog entry for reconnect

0c69924

Also fix some formatting

hackaugusto approved these changes Nov 23, 2017

View reviewed changes

palango merged commit 3cb5d52 into raiden-network:master Nov 23, 2017

palango deleted the reconnect branch November 23, 2017 13:36

rakanalh mentioned this pull request Jun 18, 2018

Resolve address in use #1582

Merged

Conversation

palango commented Oct 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hackaugusto Nov 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LefterisJP commented Nov 14, 2017

Uh oh!

palango commented Nov 14, 2017

Uh oh!

palango commented Nov 15, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hackaugusto Nov 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

palango commented Nov 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

palango commented Oct 25, 2017 •

edited

Loading

hackaugusto Nov 13, 2017 •

edited

Loading

hackaugusto Nov 15, 2017 •

edited

Loading

palango commented Nov 23, 2017 •

edited

Loading