-
Notifications
You must be signed in to change notification settings - Fork 3k
[ckpt] feat: add Hccl ckpt engine backend #4885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[ckpt] feat: add Hccl ckpt engine backend #4885
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an HCCLCheckpointEngine backend to support checkpointing on Huawei Ascend NPUs, building upon the existing checkpoint engine abstraction. The changes include the engine implementation, utility functions for creating stateless process groups with IPv6 support, and a new test suite for the HCCL engine. My review primarily focuses on resource management within the new engine, where I've identified a significant resource leak related to ZMQ and socket handling. The provided feedback includes specific suggestions to rectify this issue.
|
Please format code according to: https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting |
Thank you for kind review, I have fixed the pre-commit comments. |
What does this PR do?
Based on ckpt engine abstraction #4775 , in this pr, we add hccl backend to support huawei Ascend npu.
TODO:
In the near future, we will
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.