Skip to content

Conversation

@gabeiglio
Copy link
Contributor

@gabeiglio gabeiglio commented Jul 14, 2025

These changes ensures that data files are not loaded all into memory at once when validating replaced partitions. Instead it uses ParallelIterable to load new files in batches with a hard limit of 30k files in memory at a time.

This PR deprecates SnapshotUtil.newFiles in favor of SnapshotUtil.newFilesBetween

@github-actions github-actions bot added the core label Jul 14, 2025
@gabeiglio gabeiglio force-pushed the batch-load-new-files branch from 42f0f83 to 4a0dd13 Compare July 15, 2025 00:31
@gabeiglio gabeiglio force-pushed the batch-load-new-files branch from 4a0dd13 to dcbecc6 Compare July 15, 2025 04:33
@gabeiglio gabeiglio requested a review from bryanck July 18, 2025 16:38
Copy link
Contributor

@bryanck bryanck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gabeiglio , and @bryanck for reviewing!

@amogh-jahagirdar amogh-jahagirdar merged commit 159d253 into apache:main Aug 12, 2025
42 checks passed
fbertsch pushed a commit to fbertsch/iceberg that referenced this pull request Jan 19, 2026
…ons (apache#13556) - 1.4 (apache#824)

Optimize memory utilization by batching new files load when validating a
cherry-pick

Co-authored-by: Gabriel Igliozzi <gaboiglio@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants