-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Predicate pushdown for all_data_files table #4382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Predicate pushdown for all_data_files table #4382
Conversation
d576ea6 to
803b973
Compare
|
The failure may not be related, filed: #4383, re-triggering to see if it goes away. |
|
Looks like build is broken in general by 0f6398f, will retrigger after. |
| /** | ||
| * Alternative to {@link #planFiles()}, allows exploring old snapshots even for empty table. | ||
| */ | ||
| protected CloseableIterable<FileScanTask> planAllFiles() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be in BaseFilesTableScan? Just wondering if it would be relevant for say "Manifests" tables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea this was tricky and I went back and forth with various solutions. It is relevant for Files and other tables like AllManifests, AllEntries.
BaseAllMetadataTableScan used to override planFiles() , and was inherited by AllDataFiles, AllManifests, AllEntries. Now moving AllDataFiles to BaseFilesTableScan. Hence moving the method up to the common parent, so it can be shared by the two hierarchies (BaseAllMetadataTableScans and BaseFilesTableScan), though we don't get the free override anymore as AllDataFiles cannot have multiple inheritance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I tried to improve a little bit and made the override at the BaseMetadataTableScan parent class, based on a boolean flag the subclasses configure, let me know if that is better.
| /** | ||
| * @return if metadata table scan is for all snapshots, ie 'all_x' metadata tables | ||
| */ | ||
| protected boolean allScan() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about allSnapshotsScan ? Also the doc for return only JavaDocs should be
/**
* Return .....
**/
This will populate the javadoc Summary field, otherwise it will be blank with the only documentation in the method detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I think allScan is not very descriptive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, what about removing this flag completely and overriding planFiles in BaseAllMetadataTableScan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of planFiles seems a little bit weird with this flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aokolnychyi yea i had that version in the previous iteration, having a common method for the BaseAllMetadataTableScan and AllDataFileTableScan (which no longer inherits from it), like:
BaseMetadataTableScan.planFilesForAllSnapshots();
and then having:
@Override
BaseAllMetadataTableScan.planFiles() {
super.planFilesForAllSnapshots();
}
@Override
AllDataFileTableScan.planFiles() {
super.planFilesForAllSnapshots();
}
Is that better than this? I can revert back to that version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I changed to this approach , it's slightly more clear in forcing implementators to have to override this flag, rather than have to find this method just by reading the code , but it's not a big deal either way as it's fairly internal.
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small note on the "allScan" naming, other than that I think this is good to go
...2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMetadataTables.java
Outdated
Show resolved
Hide resolved
...2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMetadataTables.java
Outdated
Show resolved
Hide resolved
| /** | ||
| * @return if metadata table scan is for all snapshots, ie 'all_x' metadata tables | ||
| */ | ||
| protected boolean allScan() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I think allScan is not very descriptive
| /** | ||
| * @return if metadata table scan is for all snapshots, ie 'all_x' metadata tables | ||
| */ | ||
| protected boolean allScan() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, what about removing this flag completely and overriding planFiles in BaseAllMetadataTableScan?
| } | ||
|
|
||
| @Override | ||
| public CloseableIterable<FileScanTask> planFiles() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should override planFiles in BaseAllMetadataTableScan instead.
|
We should fix the predicate pushdown in a separate PR as it only uses the current table spec instead of the spec of a manifest that we filter. The logic I am talking about is in |
…lanFiles override" This reverts commit 1db465e.
|
@aokolnychyi addressed as many comments as I could, can you check when you have time, if its any clearer? |
aokolnychyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I did not realize we have a little bit weird hierarchy in our metadata tables and ideally want to extend two classes, which is obviously not possible. This isn't a problem of this PR and I don't have an obvious solution. Let's get this in and check if we can maybe use composition rather than inheritance in a few places to simplify the hierarchy of metadata tables and scans.
(cherry picked from commit 56d8f07)
This adopts all_data_files table on new base class added in #4139 , which will allow it predicate-push-down for partition predicates.
Change the base class manifests() method slightly from List to Iterable, to support the original ParallelIterable of the AllDataFilesTable.
Inherit some WAP (staging snapshot) handling from BaseAllMetadataTableScan.