-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Fix incremental compute of partition stats for various edge cases #13163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java
Outdated
Show resolved
Hide resolved
|
@ajantha-bhat Thanks for working on this! Left some comments. |
37ac725 to
5e092ef
Compare
| try (CloseableIterable<PartitionStats> oldStats = | ||
| readPartitionStatsFile(schema(partitionType), Files.localInput(previousStatsFile.path()))) { | ||
| readPartitionStatsFile( | ||
| schema(partitionType), table.io().newInputFile(previousStatsFile.path()))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just using the FileIO as pointed out in the PR.
Unrelated to this bug fix. But related to this feature.
| id -> | ||
| table.snapshot(id).allManifests(table.io()).stream() | ||
| .filter(file -> file.snapshotId().equals(id))) | ||
| .collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also checked that if snapshots are expired, we cannot find previous stats for the table in the caller.
So, it will fallback to full compute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that, because of snapshot id filter,
Each snapshot's added manifest files will be considered only once for compute. So, reused manifests won't be considered again. If manifests are rewritten, entries will be marked as EXISTING and won't be considered for incremental compute from existing logic in collectStatsForManifest.
So, IMO it works for all the scenarios now and we have testcase to cover all the scenarios.
5e092ef to
6a6e2f1
Compare
core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java
Outdated
Show resolved
Hide resolved
|
@lirui-apache: Could you please take another look at the fix, verify and approve the PR if it looks good for you? |
lirui-apache
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @ajantha-bhat . I only have a minor comment. But I have another question related to the feature and may be worth another PR. In PartitionStatsHandler::latestStatsFile, we throw exception if there is any previous stats file but not reachable via the parent snapshot pointers. I think this means we don't allow gaps in snapshot history, which can be common when using tags. Won't it be more user friendly to just fall back to a full compute in that case?
| // So, for incremental computation, gather the manifests added by each snapshot | ||
| // instead of relying solely on those from the latest snapshot. | ||
| List<ManifestFile> manifests = | ||
| snapshotIdsRange.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: How about just call SnapshotUtil::ancestorsBetween and iterate through the ancestors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Great. I didn't think of this case before. Hence, added that exception before. I made it to fallback now and added the test. |
|
@pvary: Thanks for previous review and approval. Please take another look after @lirui-apache's approval. |
lirui-apache
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating. LGTM
|
Merged to main. |
Fixes an edge case where deleted entires information is not carried to next snapshot. So, incremental stats compute was wrong for this copy on write case. Fix is to apply incremental stats compute snapshot by snapshot for its added manifest.
Also fallback when stats file exist for table but not for current snapshot chain, instead of throwing exception.
Fixes: #13155