Skip to content

Conversation

@wlhgtc
Copy link
Contributor

@wlhgtc wlhgtc commented Jan 9, 2026

What does this PR do?

Using pop in _build_messages removes images from doc, causing length calculation to ignore image tokens.

Then the calculate_length logic will ignore the image length.
CleanShot 2026-01-09 at 18 46 43@2x

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ……
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@wlhgtc
Copy link
Contributor Author

wlhgtc commented Jan 9, 2026

@wuxibin89

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a bug in the doc2len function where image and video tokens were not being counted due to their data being prematurely removed from the sample dictionary. The fix involves changing dict.pop to dict.get in the _build_messages helper function. While this correctly resolves the issue in doc2len, it introduces a regression in the __getitem__ method, which relies on the previous behavior of pop. My review includes a critical comment detailing this regression and suggests an alternative approach to fix the original bug without this side effect.

@wlhgtc wlhgtc changed the title [dataset] fix: Replace pop with get to preserve images/videos in doc2len fix: Use deepcopy in doc2len to preserve images/videos for length calculation Jan 9, 2026
…length calculation

Fixed a bug where _build_messages used pop() to remove images/videos from doc,
causing them to be unavailable when doc2len calculates prompt length. Instead of
changing pop to get (which would break __getitem__'s expectation that images/videos
are removed), we now deepcopy doc before passing it to _build_messages in doc2len.

This ensures multimodal prompt length filtering correctly includes image/video tokens,
preventing overlong samples that could cause OOM during training.
@wlhgtc
Copy link
Contributor Author

wlhgtc commented Jan 9, 2026

Code Review

This pull request fixes a bug in the doc2len function where image and video tokens were not being counted due to their data being prematurely removed from the sample dictionary. The fix involves changing dict.pop to dict.get in the _build_messages helper function. While this correctly resolves the issue in doc2len, it introduces a regression in the __getitem__ method, which relies on the previous behavior of pop. My review includes a critical comment detailing this regression and suggests an alternative approach to fix the original bug without this side effect.

Fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants