Skip to content

Conversation

@Antlera
Copy link
Collaborator

@Antlera Antlera commented Jan 11, 2026

This PR refactors ZenFlow’s integration with DeepSpeed Adam/AdamW optimizers to adapt to recent changes in PyTorch’s backward execution model.
It is a follow-up to #7759 and adapts ZenFlow to recent PyTorch backward changes.
It is a fast patch to restore correct behavior under the new loss.backward() flow.

Antlera and others added 2 commits January 11, 2026 01:38
- Introduced ZenFlowAdamBuilder in op_builder.
- Updated CPU_Accelerator to include ZenFlowAdamBuilder in the import statements and class handling.
- Modified ZenFlowCPUAdam to utilize ZenFlowAdamBuilder for creating Adam instances.

Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
- Updated the DeepSpeedEngine to ensure ZenFlow manages the backward pass, allowing for selective parameter updates and synchronization boundaries.
- Replaced direct calls to `torch._utils.is_compiling()` with `is_compiling()` from the compiler module for better clarity and maintainability.

Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>
@Antlera
Copy link
Collaborator Author

Antlera commented Jan 11, 2026

TODO: Ideally, ZenFlow’s backward logic should be fully compatible with the standard loss.backward() path in the future.

Antlera and others added 2 commits January 11, 2026 01:54
Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
@tohtana
Copy link
Collaborator

tohtana commented Jan 11, 2026

Thank you for the fix!
Can we throw an error if ZenFlow is enabled and backward is called in loss.backward()-style?

return gas_scaled_loss

# TODO: handle these scaling with direct calls to loss.backward()
if isinstance(self.optimizer, ZeROOptimizer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block was originally outside the compiled_autograd context. I wonder if compiling the scaling might cause an issue when scaling factor is changed. Can we have this block outside the context?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tohtana, Thanks for the review and reminder. Place this block outside the context should be better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good point. I wanted to resolve this so we can pass the full CI test suite, but it makes the core part of the optimizer more Zenflow-specific. We probably shouldn’t cut corners for this. Let me close this for now, and then consider a more general approach.

@Antlera
Copy link
Collaborator Author

Antlera commented Jan 12, 2026

Thank you for the fix! Can we throw an error if ZenFlow is enabled and backward is called in loss.backward()-style?

Good suggestion!

- Added a RuntimeError to prevent direct calls to loss.backward() when ZenFlow is enabled, ensuring proper management of the backward pass.
- Updated position of loss scale block.

Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
…i/DeepSpeed into tingfeng/zenflow_fix_backward
@Antlera
Copy link
Collaborator Author

Antlera commented Jan 12, 2026

Hi @tohtana. I’ve added the corresponding handling in the latest commit based on your comment. Could you please help check whether this looks reasonable? Thanks!

@tohtana
Copy link
Collaborator

tohtana commented Jan 12, 2026

Hi @Antlera,
Thank you for the fix! Some tests now fail. I wonder if those might be testing with FP16. Do you have any idea?

@tohtana
Copy link
Collaborator

tohtana commented Jan 18, 2026

Hi @Antlera, I opened #7793 to fix the current issues of this PR. Do you think it works?

@tohtana
Copy link
Collaborator

tohtana commented Jan 18, 2026

By the way, Stage 3 + full_warm_up_rounds=0 still fail with:

AttributeError: 'Parameter' object has no attribute 'complete_column_offset'

Probably this is unrelated to this PR.

@Antlera
Copy link
Collaborator Author

Antlera commented Jan 19, 2026

Hi @Antlera, I opened #7793 to fix the current issues of this PR. Do you think it works?
Hi @tohtana. Thanks a lot for the fix! It works on my test bed and matches my original fix version. It seems like I misplaced the gas_scaled_loss in the following fix.

@Antlera
Copy link
Collaborator Author

Antlera commented Jan 19, 2026

By the way, Stage 3 + full_warm_up_rounds=0 still fail with:

AttributeError: 'Parameter' object has no attribute 'complete_column_offset'

Probably this is unrelated to this PR.

Thanks for pointing this out! Yeah, this looks like an issue with attribute propagation or initialization.
I’ll open a new issue and handle it in a separate PR.

@Antlera
Copy link
Collaborator Author

Antlera commented Jan 20, 2026

By the way, Stage 3 + full_warm_up_rounds=0 still fail with:

AttributeError: 'Parameter' object has no attribute 'complete_column_offset'

Probably this is unrelated to this PR.

Hi @tohtana. I’ve just created a follow-up issue for this: #7796

@tohtana tohtana closed this Jan 20, 2026

# TODO: handle these scaling with direct calls to loss.backward()
if isinstance(self.optimizer, ZeROOptimizer):
loss = self.optimizer.scale_if_loss(loss)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tohtana, do you think this is OK?
We could keep this PR as the current fix. Since ZenFlow implements its own backward logic, we don’t need to add an extra layer of handling in the higher-level optimizer for now. This won’t affect the existing optimizer logic.

The test issue mainly comes from how the scaled loss is handled; fixing this part should be sufficient to make the tests pass.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may reopen this PR if you think this approach works.

Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
@tohtana tohtana reopened this Jan 21, 2026
@tohtana
Copy link
Collaborator

tohtana commented Jan 21, 2026

@Antlera Sorry, I mistakenly closed this issue, being confused with #7793.
I'm still not very clear about the current change. Can you help me understand the issue?

DeepSpeed engine's forward always registers a backward hook that calls _backward_prologue on the original loss, which already invokes self.optimizer.backward_prologue() and enter_backward() for ZeRO optimizers. With the new ZenFlow path, engine.backward() also calls self.optimizer.backward(...), and ZenFlowZeroOptimizer.backward() itself calls backward_prologue(), which increments micro_step. Then, backward_epilogue() and exit_backward() are called again.

To enable this use case in a general way (not too hard-coded for Zenflow), we need to allow an advanced custom optimizer to control hooks in an earlier stage (like the engine's initialization).
What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants