[feat] Fault detect and auto recover for rollout with backed up tokens #4872

Li-Yongwen · 2026-01-10T07:23:49Z

What does this PR do?

High available for vllm rollout.
RFC

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

TODO

API and Usage Example

# code
git clone https://github.com/verl-project/verl.git
cd verl/recipe
git pull origin main

# run
python3 -m recipe.fault_recover.main_ppo --config-path=config \
    --config-name='fault_recover_ppo_megatron_trainer.yaml'\
    fault_manager.enable = True \
    # refer to other detal config in the fault_manager part of
    # recipe/fault_recover/config/fault_recover_ppo_megatron_trainer.yaml

Design & Code Changes

Patch vllm output handle.
Add fault_recover document in recipe.
Add vllm rollout pre process.

Other chanages in recipe pr: verl-project/verl-recipe#14

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

# Conflicts: # verl/experimental/agent_loop/agent_loop.py # verl/experimental/agent_loop/single_turn_agent_loop.py

…kens

CLAassistant · 2026-01-10T07:23:56Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist

Code Review

This pull request introduces a fault detection and auto-recovery mechanism for rollouts, which is a valuable addition for improving the robustness of the system. The changes primarily involve tracking requests with a global_id and utilizing a fault_manager_queue. While the overall approach is sound, there are several instances of using bare except blocks, which can mask underlying issues and make debugging difficult. I've provided suggestions to replace these with more specific exception handling. Additionally, a placeholder exception was found and should be replaced with a more appropriate error type.

gemini-code-assist · 2026-01-10T07:26:16Z

verl/experimental/agent_loop/agent_loop.py

+            try:
+                tokens_queue = ray.get_actor("fault_manager_queue")
+            except:
+                pass
+            else:
+                await tokens_queue.put.remote((new_request_id, global_id))


Using a bare except is discouraged as it can catch unexpected exceptions like SystemExit or KeyboardInterrupt, making the program harder to debug and control. It's better to catch specific exceptions. In this case, ray.get_actor raises a ValueError if the actor is not found, so catching ValueError would be more appropriate.

try: tokens_queue = ray.get_actor("fault_manager_queue") except ValueError: # It's expected that the fault manager queue might not exist # if fault tolerance is not enabled. pass else: await tokens_queue.put.remote((new_request_id, global_id))

gemini-code-assist · 2026-01-10T07:26:16Z

verl/experimental/agent_loop/single_turn_agent_loop.py

+        if output.log_probs is not None or output.routed_experts is not None:
+            raise Exception("[fault_manager TODO fix")


Raising a generic Exception with a 'TODO' message is not suitable for production code. It's better to use a more specific exception type like NotImplementedError and provide a more descriptive message to clarify what is not yet supported.

Suggested change

if output.log_probs is not None or output.routed_experts is not None:

raise Exception("[fault_manager TODO fix")

if output.log_probs is not None or output.routed_experts is not None:

raise NotImplementedError("Fault manager does not yet support log_probs or routed_experts from generate output.")

gemini-code-assist · 2026-01-10T07:26:16Z

verl/workers/rollout/vllm_rollout/vllm_async_server.py

+        try:
+            ray.get_actor("fault_manager_queue")
+            from recipe.fault_recover.async_llm import AsyncFaultRecoverLLM as AsyncLLM
+        except:
+            pass


The bare except can hide important errors, such as an ImportError if the recipe module is not found, or other unexpected exceptions from ray.get_actor. It's better to explicitly catch the expected exceptions, like ValueError from ray.get_actor and ImportError for the dynamic import. This makes the code more robust and easier to debug.

Suggested change

try:

ray.get_actor("fault_manager_queue")

from recipe.fault_recover.async_llm import AsyncFaultRecoverLLM as AsyncLLM

except:

pass

try:

ray.get_actor("fault_manager_queue")

from recipe.fault_recover.async_llm import AsyncFaultRecoverLLM as AsyncLLM

except (ValueError, ImportError):

# It's expected that the fault manager queue or the recipe might not exist

# if fault tolerance is not enabled. In that case, we use the default AsyncLLM.

pass

Li-Yongwen added 3 commits January 8, 2026 11:56

feat:Fault detect and auto recover for rollout with backed up tokens

add6620

Merge remote-tracking branch 'verl-ras/main1229' into verl-main-pr

bfe007c

# Conflicts: # verl/experimental/agent_loop/agent_loop.py # verl/experimental/agent_loop/single_turn_agent_loop.py

[bug fix] Fault detext and cuto recover for rollout with backed up to…

88d88a8

…kens

Li-Yongwen requested review from PeterSH6, chenhaiq and wuxibin89 as code owners January 10, 2026 07:23

gemini-code-assist bot reviewed Jan 10, 2026

View reviewed changes

Li-Yongwen mentioned this pull request Jan 10, 2026

[feat] Fault detect and auto recover for rollout with backed up tokens verl-project/verl-recipe#14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] Fault detect and auto recover for rollout with backed up tokens #4872

[feat] Fault detect and auto recover for rollout with backed up tokens #4872

Li-Yongwen commented Jan 10, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Jan 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 10, 2026

Uh oh!

gemini-code-assist bot Jan 10, 2026

Uh oh!

gemini-code-assist bot Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if output.log_probs is not None or output.routed_experts is not None:
		raise Exception("[fault_manager TODO fix")

[feat] Fault detect and auto recover for rollout with backed up tokens #4872

Are you sure you want to change the base?

[feat] Fault detect and auto recover for rollout with backed up tokens #4872

Conversation

Li-Yongwen commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Jan 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Li-Yongwen commented Jan 10, 2026 •

edited

Loading