Skip to content

Conversation

@hanhan-networking
Copy link

@hanhan-networking hanhan-networking commented Jan 12, 2026

What does this PR do?

Based on ckpt engine abstraction #4775 , in this pr, we add hccl backend to support huawei Ascend npu.

TODO:

  • Give more detailed performance testing results.
  • Improve checkpoint engine README.

In the near future, we will

  • Add Mooncake transfer engine to support p2p communication both for npu and GPU.
  • Integrate kimi ckpt engine for more complex communication. For now, the basic functions are tested, we will provide some performance results.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@CLAassistant
Copy link

CLAassistant commented Jan 12, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an HCCLCheckpointEngine backend to support checkpointing on Huawei Ascend NPUs, building upon the existing checkpoint engine abstraction. The changes include the engine implementation, utility functions for creating stateless process groups with IPv6 support, and a new test suite for the HCCL engine. My review primarily focuses on resource management within the new engine, where I've identified a significant resource leak related to ZMQ and socket handling. The provided feedback includes specific suggestions to rectify this issue.

@wuxibin89 wuxibin89 changed the title [ckpt]feat: add Hccl ckpt engine backend [ckpt] feat: add Hccl ckpt engine backend Jan 13, 2026
@wuxibin89
Copy link
Collaborator

@wuxibin89 wuxibin89 mentioned this pull request Jan 13, 2026
24 tasks
@hanhan-networking
Copy link
Author

Please format code according to: https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting

Thank you for kind review, I have fixed the pre-commit comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants