Skip to content

Fix an inherent race in execution vs. destruction. #1150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 28, 2023

Conversation

clalancette
Copy link
Contributor

The rclpy executor collects all of the entities in one pass, then creates async tasks for each of the ready ones and attempts to "take" and execute them. If one of those entities is destroyed after the collection but before we attempt to "take" it, then we can end up attempting to enter a Destroyable-derived class that has already been destroyed. The Destroyable will then raise an InvalidHandle error.

Fix this by explicitly catching the InvalidHandle error that can be raised in all of the Destroyable-derived entities. If we do catch it, then we actually let the machinery continue but tell things to just not execute; in a subsequent executor iteration, the entity will be destroyed and hence not looked at anymore. This seems to fix the race in my testing.

Fixes #1147

@sloretz I'm particularly looking for input from you, since I think you have the best handle on what is going on in rclpy. Still a draft until I get that feedback.

@clalancette
Copy link
Contributor Author

While I do like this cleanup in that it makes things much more consistent, I realized that there is (likely) another way to fix this. The handler function has a problem where the try...finally should really be around everything starting with with work_tracker; that way any failures will result in callback_group.ending_execution being called. I can also make that change, but I'll wait for feedback here before doing that.

@dcconner
Copy link

Just confirming that I did test this patch and it seemed to fix what I observed in #1147

Copy link
Contributor

@sloretz sloretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one of those entities is destroyed after the collection but before we attempt to "take" it, then we can end up attempting to enter a Destroyable-derived class that has already been destroyed. The Destroyable will then raise an InvalidHandle error.

One thing to clarify, since the Handler is given a reference to the original entity, we can't get to this case when the entity is garbage collected. It will only happen when the entity is explicitly destroyed.

Fix this by explicitly catching the InvalidHandle error that can be raised in all of the Destroyable-derived entities. If we do catch it, then we actually let the machinery continue but tell things to just not execute; in a subsequent executor iteration, the entity will be destroyed and hence not looked at anymore.

I think ignoring the work when InvalidHandle is raised is the right idea.

The rclpy executor collects all of the entities in one
pass, then creates async tasks for each of the ready ones
and attempts to "take" and execute them.  If one of those entities
is destroyed after the collection but before we attempt
to "take" it, then we can end up attempting to __enter__
a Destroyable-derived class that has already been destroyed.
The Destroyable will then raise an InvalidHandle error.

Fix this by explicitly catching the InvalidHandle error
that can be raised in all of the Destroyable-derived entities.
If we do catch it, then we actually let the machinery
continue but tell things to just not execute; in a subsequent
executor iteration, the entity will be destroyed and
hence not looked at anymore.  This seems to fix the race
in my testing.

Signed-off-by: Chris Lalancette <[email protected]>
Signed-off-by: Chris Lalancette <[email protected]>
@clalancette clalancette force-pushed the clalancette/fix-sub-race branch from b79dafa to 55d1fe6 Compare August 21, 2023 22:01
@clalancette clalancette marked this pull request as ready for review August 21, 2023 22:03
@clalancette
Copy link
Contributor Author

CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

Copy link
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@clalancette
Copy link
Contributor Author

Since CI is green, this is approved, and this fixes a real user-reported bug, I'm going to go ahead and merge this. @sloretz if you have any further feedback please feel free to leave it and I'll address it in a follow-up PR. Thanks!

@clalancette clalancette merged commit 159ced4 into rolling Aug 28, 2023
@delete-merged-branch delete-merged-branch bot deleted the clalancette/fix-sub-race branch August 28, 2023 17:03
Timple pushed a commit to nobleo/rclpy that referenced this pull request Mar 15, 2024
* Fix an inherent race in execution vs. destruction.

The rclpy executor collects all of the entities in one
pass, then creates async tasks for each of the ready ones
and attempts to "take" and execute them.  If one of those entities
is destroyed after the collection but before we attempt
to "take" it, then we can end up attempting to __enter__
a Destroyable-derived class that has already been destroyed.
The Destroyable will then raise an InvalidHandle error.

Fix this by explicitly catching the InvalidHandle error
that can be raised in all of the Destroyable-derived entities.
If we do catch it, then we actually let the machinery
continue but tell things to just not execute; in a subsequent
executor iteration, the entity will be destroyed and
hence not looked at anymore.  This seems to fix the race
in my testing.

Signed-off-by: Chris Lalancette <[email protected]>
clalancette added a commit that referenced this pull request Mar 20, 2024
* Fix an inherent race in execution vs. destruction.

The rclpy executor collects all of the entities in one
pass, then creates async tasks for each of the ready ones
and attempts to "take" and execute them.  If one of those entities
is destroyed after the collection but before we attempt
to "take" it, then we can end up attempting to __enter__
a Destroyable-derived class that has already been destroyed.
The Destroyable will then raise an InvalidHandle error.

Fix this by explicitly catching the InvalidHandle error
that can be raised in all of the Destroyable-derived entities.
If we do catch it, then we actually let the machinery
continue but tell things to just not execute; in a subsequent
executor iteration, the entity will be destroyed and
hence not looked at anymore.  This seems to fix the race
in my testing.

Signed-off-by: Chris Lalancette <[email protected]>
Co-authored-by: Chris Lalancette <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lost subscription after repeated create/destroy cycles
4 participants