-
Notifications
You must be signed in to change notification settings - Fork 3k
[feat] Fault detect and auto recover for rollout with backed up tokens #4872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
# Conflicts: # verl/experimental/agent_loop/agent_loop.py # verl/experimental/agent_loop/single_turn_agent_loop.py
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a fault detection and auto-recovery mechanism for rollouts, which is a valuable addition for improving the robustness of the system. The changes primarily involve tracking requests with a global_id and utilizing a fault_manager_queue. While the overall approach is sound, there are several instances of using bare except blocks, which can mask underlying issues and make debugging difficult. I've provided suggestions to replace these with more specific exception handling. Additionally, a placeholder exception was found and should be replaced with a more appropriate error type.
| try: | ||
| tokens_queue = ray.get_actor("fault_manager_queue") | ||
| except: | ||
| pass | ||
| else: | ||
| await tokens_queue.put.remote((new_request_id, global_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a bare except is discouraged as it can catch unexpected exceptions like SystemExit or KeyboardInterrupt, making the program harder to debug and control. It's better to catch specific exceptions. In this case, ray.get_actor raises a ValueError if the actor is not found, so catching ValueError would be more appropriate.
try:
tokens_queue = ray.get_actor("fault_manager_queue")
except ValueError:
# It's expected that the fault manager queue might not exist
# if fault tolerance is not enabled.
pass
else:
await tokens_queue.put.remote((new_request_id, global_id))| if output.log_probs is not None or output.routed_experts is not None: | ||
| raise Exception("[fault_manager TODO fix") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raising a generic Exception with a 'TODO' message is not suitable for production code. It's better to use a more specific exception type like NotImplementedError and provide a more descriptive message to clarify what is not yet supported.
| if output.log_probs is not None or output.routed_experts is not None: | |
| raise Exception("[fault_manager TODO fix") | |
| if output.log_probs is not None or output.routed_experts is not None: | |
| raise NotImplementedError("Fault manager does not yet support log_probs or routed_experts from generate output.") |
| try: | ||
| ray.get_actor("fault_manager_queue") | ||
| from recipe.fault_recover.async_llm import AsyncFaultRecoverLLM as AsyncLLM | ||
| except: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bare except can hide important errors, such as an ImportError if the recipe module is not found, or other unexpected exceptions from ray.get_actor. It's better to explicitly catch the expected exceptions, like ValueError from ray.get_actor and ImportError for the dynamic import. This makes the code more robust and easier to debug.
| try: | |
| ray.get_actor("fault_manager_queue") | |
| from recipe.fault_recover.async_llm import AsyncFaultRecoverLLM as AsyncLLM | |
| except: | |
| pass | |
| try: | |
| ray.get_actor("fault_manager_queue") | |
| from recipe.fault_recover.async_llm import AsyncFaultRecoverLLM as AsyncLLM | |
| except (ValueError, ImportError): | |
| # It's expected that the fault manager queue or the recipe might not exist | |
| # if fault tolerance is not enabled. In that case, we use the default AsyncLLM. | |
| pass |
What does this PR do?
High available for vllm rollout.
RFC
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
TODO
API and Usage Example
Design & Code Changes
Other chanages in recipe pr: verl-project/verl-recipe#14
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.