-
Notifications
You must be signed in to change notification settings - Fork 3k
fix: Use deepcopy in doc2len to preserve images/videos for length calculation #4861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes a bug in the doc2len function where image and video tokens were not being counted due to their data being prematurely removed from the sample dictionary. The fix involves changing dict.pop to dict.get in the _build_messages helper function. While this correctly resolves the issue in doc2len, it introduces a regression in the __getitem__ method, which relies on the previous behavior of pop. My review includes a critical comment detailing this regression and suggests an alternative approach to fix the original bug without this side effect.
…length calculation Fixed a bug where _build_messages used pop() to remove images/videos from doc, causing them to be unavailable when doc2len calculates prompt length. Instead of changing pop to get (which would break __getitem__'s expectation that images/videos are removed), we now deepcopy doc before passing it to _build_messages in doc2len. This ensures multimodal prompt length filtering correctly includes image/video tokens, preventing overlong samples that could cause OOM during training.
Fix |
What does this PR do?
Using
popin_build_messagesremovesimagesfromdoc, causing length calculation to ignore image tokens.Then the calculate_length logic will ignore the image length.

Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.