Fix: ZenFlow Adam integration for updated PyTorch backward flow (#7759) #7771

Antlera · 2026-01-11T06:48:09Z

This PR refactors ZenFlow’s integration with DeepSpeed Adam/AdamW optimizers to adapt to recent changes in PyTorch’s backward execution model.
It is a follow-up to #7759 and adapts ZenFlow to recent PyTorch backward changes.
It is a fast patch to restore correct behavior under the new loss.backward() flow.

- Introduced ZenFlowAdamBuilder in op_builder. - Updated CPU_Accelerator to include ZenFlowAdamBuilder in the import statements and class handling. - Modified ZenFlowCPUAdam to utilize ZenFlowAdamBuilder for creating Adam instances. Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>

- Updated the DeepSpeedEngine to ensure ZenFlow manages the backward pass, allowing for selective parameter updates and synchronization boundaries. - Replaced direct calls to `torch._utils.is_compiling()` with `is_compiling()` from the compiler module for better clarity and maintainability. Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>

Antlera · 2026-01-11T06:49:18Z

TODO: Ideally, ZenFlow’s backward logic should be fully compatible with the standard loss.backward() path in the future.

Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>

tohtana · 2026-01-11T07:28:06Z

Thank you for the fix!
Can we throw an error if ZenFlow is enabled and backward is called in loss.backward()-style?

tohtana · 2026-01-11T07:34:08Z

deepspeed/runtime/engine.py

+                return gas_scaled_loss
+
+            # TODO: handle these scaling with direct calls to loss.backward()
+            if isinstance(self.optimizer, ZeROOptimizer):


This block was originally outside the compiled_autograd context. I wonder if compiling the scaling might cause an issue when scaling factor is changed. Can we have this block outside the context?

Hi @tohtana, Thanks for the review and reminder. Place this block outside the context should be better.

Yes, that's a good point. I wanted to resolve this so we can pass the full CI test suite, but it makes the core part of the optimizer more Zenflow-specific. We probably shouldn’t cut corners for this. Let me close this for now, and then consider a more general approach.

Antlera · 2026-01-12T04:41:13Z

Thank you for the fix! Can we throw an error if ZenFlow is enabled and backward is called in loss.backward()-style?

Good suggestion!

- Added a RuntimeError to prevent direct calls to loss.backward() when ZenFlow is enabled, ensuring proper management of the backward pass. - Updated position of loss scale block. Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>

…i/DeepSpeed into tingfeng/zenflow_fix_backward

Antlera · 2026-01-12T04:47:11Z

Hi @tohtana. I’ve added the corresponding handling in the latest commit based on your comment. Could you please help check whether this looks reasonable? Thanks!

tohtana · 2026-01-12T07:28:19Z

Hi @Antlera,
Thank you for the fix! Some tests now fail. I wonder if those might be testing with FP16. Do you have any idea?

tohtana · 2026-01-18T06:24:23Z

Hi @Antlera, I opened #7793 to fix the current issues of this PR. Do you think it works?

tohtana · 2026-01-18T06:27:52Z

By the way, Stage 3 + full_warm_up_rounds=0 still fail with:

AttributeError: 'Parameter' object has no attribute 'complete_column_offset'

Probably this is unrelated to this PR.

Antlera · 2026-01-19T23:37:32Z

Hi @Antlera, I opened #7793 to fix the current issues of this PR. Do you think it works?
Hi @tohtana. Thanks a lot for the fix! It works on my test bed and matches my original fix version. It seems like I misplaced the gas_scaled_loss in the following fix.

Antlera · 2026-01-19T23:51:20Z

By the way, Stage 3 + full_warm_up_rounds=0 still fail with:
AttributeError: 'Parameter' object has no attribute 'complete_column_offset'
Probably this is unrelated to this PR.

Thanks for pointing this out! Yeah, this looks like an issue with attribute propagation or initialization.
I’ll open a new issue and handle it in a separate PR.

Antlera · 2026-01-20T00:12:53Z

By the way, Stage 3 + full_warm_up_rounds=0 still fail with:
AttributeError: 'Parameter' object has no attribute 'complete_column_offset'
Probably this is unrelated to this PR.

Hi @tohtana. I’ve just created a follow-up issue for this: #7796

Antlera · 2026-01-20T18:50:33Z

deepspeed/runtime/engine.py


        # TODO: handle these scaling with direct calls to loss.backward()
        if isinstance(self.optimizer, ZeROOptimizer):
-            loss = self.optimizer.scale_if_loss(loss)


Hi @tohtana, do you think this is OK?
We could keep this PR as the current fix. Since ZenFlow implements its own backward logic, we don’t need to add an extra layer of handling in the higher-level optimizer for now. This won’t affect the existing optimizer logic.

The test issue mainly comes from how the scaled loss is handled; fixing this part should be sufficient to make the tests pass.

We may reopen this PR if you think this approach works.

Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>

tohtana · 2026-01-21T07:37:26Z

@Antlera Sorry, I mistakenly closed this issue, being confused with #7793.
I'm still not very clear about the current change. Can you help me understand the issue?

DeepSpeed engine's forward always registers a backward hook that calls _backward_prologue on the original loss, which already invokes self.optimizer.backward_prologue() and enter_backward() for ZeRO optimizers. With the new ZenFlow path, engine.backward() also calls self.optimizer.backward(...), and ZenFlowZeroOptimizer.backward() itself calls backward_prologue(), which increments micro_step. Then, backward_epilogue() and exit_backward() are called again.

To enable this use case in a general way (not too hard-coded for Zenflow), we need to allow an advanced custom optimizer to control hooks in an earlier stage (like the engine's initialization).
What do you think?

Antlera and others added 2 commits January 11, 2026 01:38

Antlera requested review from loadams, tjruwase and tohtana as code owners January 11, 2026 06:48

Antlera and others added 2 commits January 11, 2026 01:54

Format code.

af91733

Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>

Merge branch 'master' into tingfeng/zenflow_fix_backward

3e49ca6

tohtana reviewed Jan 11, 2026

View reviewed changes

Antlera added 2 commits January 11, 2026 23:45

Refine logic and error handling based on feedback

3583d35

- Added a RuntimeError to prevent direct calls to loss.backward() when ZenFlow is enabled, ensuring proper management of the backward pass. - Updated position of loss scale block. Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>

Merge branch 'tingfeng/zenflow_fix_backward' of github.com:deepspeeda…

7f25ec2

…i/DeepSpeed into tingfeng/zenflow_fix_backward

tohtana mentioned this pull request Jan 18, 2026

Fix loss scaling and backward call of ZenFlow #7793

Closed

tohtana closed this Jan 20, 2026

Antlera commented Jan 20, 2026

View reviewed changes

Fix loss scaling logic in DeepSpeedEngine.

0bf92cc

Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>

tohtana reopened this Jan 21, 2026

Merge branch 'master' into tingfeng/zenflow_fix_backward

c2291bb

Fix: ZenFlow Adam integration for updated PyTorch backward flow (#7759) #7771

Are you sure you want to change the base?

Fix: ZenFlow Adam integration for updated PyTorch backward flow (#7759) #7771

Uh oh!

Conversation

Antlera commented Jan 11, 2026

Uh oh!

Antlera commented Jan 11, 2026

Uh oh!

tohtana commented Jan 11, 2026

Uh oh!

tohtana Jan 11, 2026

Choose a reason for hiding this comment

Uh oh!

Antlera Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Antlera commented Jan 12, 2026

Uh oh!

Antlera commented Jan 12, 2026

Uh oh!

tohtana commented Jan 12, 2026

Uh oh!

tohtana commented Jan 18, 2026

Uh oh!

tohtana commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Antlera commented Jan 19, 2026

Uh oh!

Antlera commented Jan 19, 2026

Uh oh!

Antlera commented Jan 20, 2026

Uh oh!

Antlera Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Antlera Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tohtana commented Jan 18, 2026 •

edited

Loading