Skip to content

Conversation

@szehon-ho
Copy link
Member

@szehon-ho szehon-ho commented Mar 23, 2022

This adopts all_data_files table on new base class added in #4139 , which will allow it predicate-push-down for partition predicates.

Change the base class manifests() method slightly from List to Iterable, to support the original ParallelIterable of the AllDataFilesTable.

Inherit some WAP (staging snapshot) handling from BaseAllMetadataTableScan.

@szehon-ho szehon-ho force-pushed the all_data_table_pred_pushdown branch from d576ea6 to 803b973 Compare March 23, 2022 22:07
@szehon-ho
Copy link
Member Author

The failure may not be related, filed: #4383, re-triggering to see if it goes away.

@szehon-ho szehon-ho closed this Mar 24, 2022
@szehon-ho szehon-ho reopened this Mar 24, 2022
@szehon-ho
Copy link
Member Author

Looks like build is broken in general by 0f6398f, will retrigger after.

@szehon-ho szehon-ho closed this Mar 24, 2022
@szehon-ho szehon-ho reopened this Mar 24, 2022
/**
* Alternative to {@link #planFiles()}, allows exploring old snapshots even for empty table.
*/
protected CloseableIterable<FileScanTask> planAllFiles() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be in BaseFilesTableScan? Just wondering if it would be relevant for say "Manifests" tables?

Copy link
Member Author

@szehon-ho szehon-ho Mar 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this was tricky and I went back and forth with various solutions. It is relevant for Files and other tables like AllManifests, AllEntries.

BaseAllMetadataTableScan used to override planFiles() , and was inherited by AllDataFiles, AllManifests, AllEntries. Now moving AllDataFiles to BaseFilesTableScan. Hence moving the method up to the common parent, so it can be shared by the two hierarchies (BaseAllMetadataTableScans and BaseFilesTableScan), though we don't get the free override anymore as AllDataFiles cannot have multiple inheritance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I tried to improve a little bit and made the override at the BaseMetadataTableScan parent class, based on a boolean flag the subclasses configure, let me know if that is better.

/**
* @return if metadata table scan is for all snapshots, ie 'all_x' metadata tables
*/
protected boolean allScan() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about allSnapshotsScan ? Also the doc for return only JavaDocs should be

/**
* Return .....
**/

This will populate the javadoc Summary field, otherwise it will be blank with the only documentation in the method detail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I think allScan is not very descriptive

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, what about removing this flag completely and overriding planFiles in BaseAllMetadataTableScan?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation of planFiles seems a little bit weird with this flag.

Copy link
Member Author

@szehon-ho szehon-ho Apr 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi yea i had that version in the previous iteration, having a common method for the BaseAllMetadataTableScan and AllDataFileTableScan (which no longer inherits from it), like:

BaseMetadataTableScan.planFilesForAllSnapshots(); 

and then having:

@Override
BaseAllMetadataTableScan.planFiles() {
   super.planFilesForAllSnapshots();
}

@Override
AllDataFileTableScan.planFiles() {
  super.planFilesForAllSnapshots();
}

Is that better than this? I can revert back to that version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I changed to this approach , it's slightly more clear in forcing implementators to have to override this flag, rather than have to find this method just by reading the code , but it's not a big deal either way as it's fairly internal.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small note on the "allScan" naming, other than that I think this is good to go

/**
* @return if metadata table scan is for all snapshots, ie 'all_x' metadata tables
*/
protected boolean allScan() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I think allScan is not very descriptive

/**
* @return if metadata table scan is for all snapshots, ie 'all_x' metadata tables
*/
protected boolean allScan() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, what about removing this flag completely and overriding planFiles in BaseAllMetadataTableScan?

}

@Override
public CloseableIterable<FileScanTask> planFiles() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should override planFiles in BaseAllMetadataTableScan instead.

@aokolnychyi
Copy link
Contributor

We should fix the predicate pushdown in a separate PR as it only uses the current table spec instead of the spec of a manifest that we filter. The logic I am talking about is in filterManifests.

@szehon-ho
Copy link
Member Author

@aokolnychyi addressed as many comments as I could, can you check when you have time, if its any clearer?

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I did not realize we have a little bit weird hierarchy in our metadata tables and ideally want to extend two classes, which is obviously not possible. This isn't a problem of this PR and I don't have an obvious solution. Let's get this in and check if we can maybe use composition rather than inheritance in a few places to simplify the hierarchy of metadata tables and scans.

@aokolnychyi aokolnychyi merged commit 56d8f07 into apache:master Apr 6, 2022
sunchao pushed a commit to sunchao/iceberg that referenced this pull request May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants