Skip to content

Conversation

@Li-Yongwen
Copy link

@Li-Yongwen Li-Yongwen commented Jan 10, 2026

What does this PR do?

High available for vllm rollout.
RFC

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

TODO

API and Usage Example

# code
git clone https://github.com/verl-project/verl.git
cd verl/recipe
git pull origin main

# run
python3 -m recipe.fault_recover.main_ppo --config-path=config \
    --config-name='fault_recover_ppo_megatron_trainer.yaml'\
    fault_manager.enable = True \
    # refer to other detal config in the fault_manager part of
    # recipe/fault_recover/config/fault_recover_ppo_megatron_trainer.yaml

Design & Code Changes

  • Patch vllm output handle.
  • Add fault_recover document in recipe.
  • Add vllm rollout pre process.

Other chanages in recipe pr: verl-project/verl-recipe#14

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fault detection and auto-recovery mechanism for rollouts, which is a valuable addition for improving the robustness of the system. The changes primarily involve tracking requests with a global_id and utilizing a fault_manager_queue. While the overall approach is sound, there are several instances of using bare except blocks, which can mask underlying issues and make debugging difficult. I've provided suggestions to replace these with more specific exception handling. Additionally, a placeholder exception was found and should be replaced with a more appropriate error type.

Comment on lines +120 to +125
try:
tokens_queue = ray.get_actor("fault_manager_queue")
except:
pass
else:
await tokens_queue.put.remote((new_request_id, global_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a bare except is discouraged as it can catch unexpected exceptions like SystemExit or KeyboardInterrupt, making the program harder to debug and control. It's better to catch specific exceptions. In this case, ray.get_actor raises a ValueError if the actor is not found, so catching ValueError would be more appropriate.

            try:
                tokens_queue = ray.get_actor("fault_manager_queue")
            except ValueError:
                # It's expected that the fault manager queue might not exist
                # if fault tolerance is not enabled.
                pass
            else:
                await tokens_queue.put.remote((new_request_id, global_id))

Comment on lines +89 to +90
if output.log_probs is not None or output.routed_experts is not None:
raise Exception("[fault_manager TODO fix")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Raising a generic Exception with a 'TODO' message is not suitable for production code. It's better to use a more specific exception type like NotImplementedError and provide a more descriptive message to clarify what is not yet supported.

Suggested change
if output.log_probs is not None or output.routed_experts is not None:
raise Exception("[fault_manager TODO fix")
if output.log_probs is not None or output.routed_experts is not None:
raise NotImplementedError("Fault manager does not yet support log_probs or routed_experts from generate output.")

Comment on lines +411 to +415
try:
ray.get_actor("fault_manager_queue")
from recipe.fault_recover.async_llm import AsyncFaultRecoverLLM as AsyncLLM
except:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The bare except can hide important errors, such as an ImportError if the recipe module is not found, or other unexpected exceptions from ray.get_actor. It's better to explicitly catch the expected exceptions, like ValueError from ray.get_actor and ImportError for the dynamic import. This makes the code more robust and easier to debug.

Suggested change
try:
ray.get_actor("fault_manager_queue")
from recipe.fault_recover.async_llm import AsyncFaultRecoverLLM as AsyncLLM
except:
pass
try:
ray.get_actor("fault_manager_queue")
from recipe.fault_recover.async_llm import AsyncFaultRecoverLLM as AsyncLLM
except (ValueError, ImportError):
# It's expected that the fault manager queue or the recipe might not exist
# if fault tolerance is not enabled. In that case, we use the default AsyncLLM.
pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants