-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Predicate pushdown for files metadata table #2926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
szehon-ho
commented
Aug 3, 2021
- This re-uses the logic of Core: Add predicate push down for Partitions table #2358 (partitions metadata table)
|
@rdblue @RussellSpitzer fyi when you guys get some time, thanks |
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, were we waiting on something else?
|
Thanks for fast response! Not that i know of |
| this.name = name; | ||
| } | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: unnecessary newline.
| * this spec is used to project an expression for the partitions table, the projection will remove predicates for | ||
| * non-partition fields (not in the spec) and will remove the "partition." prefix from fields. | ||
| * | ||
| * @param partitionTableSchema schema of the partition table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is still assuming the partition table. May want to update it to tableSchema.
While we're thinking about this, it may also make sense to allow passing a different prefix. The prefix for the entries table, for example, would be data_file.partition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, rewrote the comment and added an argument.
| .project(rowFilter); | ||
|
|
||
| ManifestEvaluator manifestEval = ManifestEvaluator.forPartitionFilter(partitionFilter, table().spec(), | ||
| caseSensitive); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'd probably wrap so that all the arguments are on the same line. That seems more readable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, done
| } | ||
|
|
||
| @VisibleForTesting | ||
| ManifestFile getManifest() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this is only for testing, but I would still suggest using a method name that we don't have to change later if we want to use the method elsewhere. That means removing get because get doesn't add anything. Typically, get signals that the method name can be simpler (e.g., manifest to return the manifest) or should have a more descriptive verb like find, fetch, create, etc. that tell the caller more about what is happening. The only time we use get is when the object needs to be a Java bean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah got it , changed
| "Partition data tuple, schema based on the partition spec")).asStruct(); | ||
|
|
||
| TableScan scanNoFilter = dataFilesTable.newScan().select("partition.data_bucket"); | ||
| Assert.assertEquals(expected, scanNoFilter.schema().asStruct()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure you need to add the schema check since this is trying to validate pushdown, but I'm fine keeping it.
|
|
||
| TableScan scanNoFilter = dataFilesTable.newScan().select("partition.data_bucket"); | ||
| Assert.assertEquals(expected, scanNoFilter.schema().asStruct()); | ||
| CloseableIterable<CombinedScanTask> tasksNoFilter = scanNoFilter.planTasks(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be a bit simpler if you used planFiles instead of planTasks. That should produce only FileScanTask rather than CombinedScanTask so you wouldn't need to flatMap tasks because there is only one data file per task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had an issue early on where planTasks caught it and planFiles did not for some reason (passing wrong schema). I changed most of them to planFiles, and added one specific test for planTasks.
|
|
||
| Expression ltAnd = Expressions.and( | ||
| Expressions.lessThan("partition.data_bucket", 2), | ||
| Expressions.greaterThan("record_count", 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the record_count filter needed or is it just there for good measure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, it's covered by the "and" test, removed.
| .commit(); | ||
| table.newFastAppend() | ||
| .appendFile(FILE_PARTITION_3) | ||
| .commit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might consider a helper method to prepare the table for these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szehon-ho, this looks great! Nice work. I just noted a few minor things but otherwise it's great. Nice job on the tests, I like how thorough they are.
|
@rdblue thanks for the detailed review and catching those, updated |
|
@szehon-ho, could you fix the conflict? |
|
Done, seems it passed |
|
Merged! Thanks @szehon-ho, nice work! |
Merge remote-tracking branch 'upstream/merge-master-20210816' into master ## 该MR主要解决什么? merge upstream/master,引入最近的一些bugFix和优化 ## 该MR的修改是什么? 核心关注PR: > Predicate PushDown 支持,https://github.com/apache/iceberg/pull/2358, https://github.com/apache/iceberg/pull/2926, https://github.com/apache/iceberg/pull/2777/files > Spark场景写入空dataset 报错问题,直接skip掉即可, apache#2960 > Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务, apache#288 > Spark 修复nested Struct Pruning问题, apache#2877 > 可以使用Table Properties指定创建v2 format表,apache#2887 > 补充SortRewriteStrategy框架,逐步支持不同rewrite策略, apache#2609 (WIP:apache#2829) > Spark 为catalog配置hadoop属性支持, apache#2792 > Spark 针对timestamps without timezone读写支持, apache#2757 > Spark MicroBatch支持配置属性skip delete snapshots, apache#2752 > Spark V2 RewriteDatafilesAction 支持 > Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关,补充schema-id, Core: add schema id to snapshot > Spark Extension支持identifier fields操作, apache#2560 > Parquet: Update to 1.12.0, apache#2441 > Hive: Vectorized ORC reads for Hive, apache#2613 > Spark: Add an action to remove all referenced files, apache#2415 ## 该MR是如何测试的? UT