-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Fix: ZenFlow Adam integration for updated PyTorch backward flow (#7759) #7771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Introduced ZenFlowAdamBuilder in op_builder. - Updated CPU_Accelerator to include ZenFlowAdamBuilder in the import statements and class handling. - Modified ZenFlowCPUAdam to utilize ZenFlowAdamBuilder for creating Adam instances. Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
- Updated the DeepSpeedEngine to ensure ZenFlow manages the backward pass, allowing for selective parameter updates and synchronization boundaries. - Replaced direct calls to `torch._utils.is_compiling()` with `is_compiling()` from the compiler module for better clarity and maintainability. Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>
|
TODO: Ideally, ZenFlow’s backward logic should be fully compatible with the standard |
Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
|
Thank you for the fix! |
deepspeed/runtime/engine.py
Outdated
| return gas_scaled_loss | ||
|
|
||
| # TODO: handle these scaling with direct calls to loss.backward() | ||
| if isinstance(self.optimizer, ZeROOptimizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block was originally outside the compiled_autograd context. I wonder if compiling the scaling might cause an issue when scaling factor is changed. Can we have this block outside the context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tohtana, Thanks for the review and reminder. Place this block outside the context should be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good point. I wanted to resolve this so we can pass the full CI test suite, but it makes the core part of the optimizer more Zenflow-specific. We probably shouldn’t cut corners for this. Let me close this for now, and then consider a more general approach.
Good suggestion! |
- Added a RuntimeError to prevent direct calls to loss.backward() when ZenFlow is enabled, ensuring proper management of the backward pass. - Updated position of loss scale block. Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
…i/DeepSpeed into tingfeng/zenflow_fix_backward
|
Hi @tohtana. I’ve added the corresponding handling in the latest commit based on your comment. Could you please help check whether this looks reasonable? Thanks! |
|
Hi @Antlera, |
|
By the way, Stage 3 + Probably this is unrelated to this PR. |
Thanks for pointing this out! Yeah, this looks like an issue with attribute propagation or initialization. |
|
|
||
| # TODO: handle these scaling with direct calls to loss.backward() | ||
| if isinstance(self.optimizer, ZeROOptimizer): | ||
| loss = self.optimizer.scale_if_loss(loss) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tohtana, do you think this is OK?
We could keep this PR as the current fix. Since ZenFlow implements its own backward logic, we don’t need to add an extra layer of handling in the higher-level optimizer for now. This won’t affect the existing optimizer logic.
The test issue mainly comes from how the scaled loss is handled; fixing this part should be sufficient to make the tests pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may reopen this PR if you think this approach works.
Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
|
@Antlera Sorry, I mistakenly closed this issue, being confused with #7793. DeepSpeed engine's forward always registers a backward hook that calls To enable this use case in a general way (not too hard-coded for Zenflow), we need to allow an advanced custom optimizer to control hooks in an earlier stage (like the engine's initialization). |
This PR refactors ZenFlow’s integration with DeepSpeed Adam/AdamW optimizers to adapt to recent changes in PyTorch’s backward execution model.
It is a follow-up to #7759 and adapts ZenFlow to recent PyTorch backward changes.
It is a fast patch to restore correct behavior under the new loss.backward() flow.