Core: Predicate pushdown for files metadata table #2926

szehon-ho · 2021-08-03T06:29:17Z

This re-uses the logic of Core: Add predicate push down for Partitions table #2358 (partitions metadata table)

szehon-ho · 2021-08-03T16:54:34Z

@rdblue @RussellSpitzer fyi when you guys get some time, thanks

RussellSpitzer

This looks good to me, were we waiting on something else?

szehon-ho · 2021-08-03T18:30:00Z

Thanks for fast response! Not that i know of

rdblue · 2021-08-04T16:08:55Z

core/src/main/java/org/apache/iceberg/BaseMetadataTable.java

    this.name = name;
  }

+


Nit: unnecessary newline.

rdblue · 2021-08-04T16:10:56Z

core/src/main/java/org/apache/iceberg/BaseMetadataTable.java

+   * this spec is used to project an expression for the partitions table, the projection will remove predicates for
+   * non-partition fields (not in the spec) and will remove the "partition." prefix from fields.
+   *
+   * @param partitionTableSchema schema of the partition table


Looks like this is still assuming the partition table. May want to update it to tableSchema.

While we're thinking about this, it may also make sense to allow passing a different prefix. The prefix for the entries table, for example, would be data_file.partition.

Good catch, rewrote the comment and added an argument.

rdblue · 2021-08-04T16:11:40Z

core/src/main/java/org/apache/iceberg/DataFilesTable.java

+          .project(rowFilter);
+
+      ManifestEvaluator manifestEval = ManifestEvaluator.forPartitionFilter(partitionFilter, table().spec(),
+          caseSensitive);


Nit: I'd probably wrap so that all the arguments are on the same line. That seems more readable to me.

You are right, done

rdblue · 2021-08-04T16:14:20Z

core/src/main/java/org/apache/iceberg/DataFilesTable.java

    }
+
+    @VisibleForTesting
+    ManifestFile getManifest() {


I realize this is only for testing, but I would still suggest using a method name that we don't have to change later if we want to use the method elsewhere. That means removing get because get doesn't add anything. Typically, get signals that the method name can be simpler (e.g., manifest to return the manifest) or should have a more descriptive verb like find, fetch, create, etc. that tell the caller more about what is happening. The only time we use get is when the object needs to be a Java bean.

Ah got it , changed

rdblue · 2021-08-04T16:17:15Z

core/src/test/java/org/apache/iceberg/TestMetadataTableScans.java

+            "Partition data tuple, schema based on the partition spec")).asStruct();
+
+    TableScan scanNoFilter = dataFilesTable.newScan().select("partition.data_bucket");
+    Assert.assertEquals(expected, scanNoFilter.schema().asStruct());


I'm not sure you need to add the schema check since this is trying to validate pushdown, but I'm fine keeping it.

rdblue · 2021-08-04T16:18:41Z

core/src/test/java/org/apache/iceberg/TestMetadataTableScans.java

+
+    TableScan scanNoFilter = dataFilesTable.newScan().select("partition.data_bucket");
+    Assert.assertEquals(expected, scanNoFilter.schema().asStruct());
+    CloseableIterable<CombinedScanTask> tasksNoFilter = scanNoFilter.planTasks();


I think this would be a bit simpler if you used planFiles instead of planTasks. That should produce only FileScanTask rather than CombinedScanTask so you wouldn't need to flatMap tasks because there is only one data file per task.

I had an issue early on where planTasks caught it and planFiles did not for some reason (passing wrong schema). I changed most of them to planFiles, and added one specific test for planTasks.

rdblue · 2021-08-04T16:20:03Z

core/src/test/java/org/apache/iceberg/TestMetadataTableScans.java

+
+    Expression ltAnd = Expressions.and(
+        Expressions.lessThan("partition.data_bucket", 2),
+        Expressions.greaterThan("record_count", 0));


Is the record_count filter needed or is it just there for good measure?

Good catch, it's covered by the "and" test, removed.

rdblue · 2021-08-04T16:20:37Z

core/src/test/java/org/apache/iceberg/TestMetadataTableScans.java

+        .commit();
+    table.newFastAppend()
+        .appendFile(FILE_PARTITION_3)
+        .commit();


You might consider a helper method to prepare the table for these tests.

rdblue

@szehon-ho, this looks great! Nice work. I just noted a few minor things but otherwise it's great. Nice job on the tests, I like how thorough they are.

szehon-ho · 2021-08-06T16:57:35Z

@rdblue thanks for the detailed review and catching those, updated

rdblue · 2021-08-09T18:45:13Z

@szehon-ho, could you fix the conflict?

szehon-ho · 2021-08-09T22:48:57Z

Done, seems it passed

rdblue · 2021-08-09T22:59:08Z

Merged! Thanks @szehon-ho, nice work!

Merge remote-tracking branch 'upstream/merge-master-20210816' into master ## 该MR主要解决什么？ merge upstream/master，引入最近的一些bugFix和优化 ## 该MR的修改是什么？核心关注PR： > Predicate PushDown 支持，https://github.com/apache/iceberg/pull/2358， https://github.com/apache/iceberg/pull/2926， https://github.com/apache/iceberg/pull/2777/files > Spark场景写入空dataset 报错问题，直接skip掉即可， apache#2960 > Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务， apache#288 > Spark 修复nested Struct Pruning问题， apache#2877 > 可以使用Table Properties指定创建v2 format表，apache#2887 > 补充SortRewriteStrategy框架，逐步支持不同rewrite策略， apache#2609 （WIP：apache#2829） > Spark 为catalog配置hadoop属性支持， apache#2792 > Spark 针对timestamps without timezone读写支持， apache#2757 > Spark MicroBatch支持配置属性skip delete snapshots， apache#2752 > Spark V2 RewriteDatafilesAction 支持 > Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关，补充schema-id， Core: add schema id to snapshot > Spark Extension支持identifier fields操作， apache#2560 > Parquet: Update to 1.12.0, apache#2441 > Hive: Vectorized ORC reads for Hive, apache#2613 > Spark: Add an action to remove all referenced files, apache#2415 ## 该MR是如何测试的？ UT

github-actions bot added the core label Aug 3, 2021

szehon-ho force-pushed the files_filter branch from a3a77f4 to 81ff69d Compare August 3, 2021 07:07

RussellSpitzer approved these changes Aug 3, 2021

View reviewed changes

rdblue reviewed Aug 4, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/BaseMetadataTable.java Outdated

this.name = name;

}

Copy link

Contributor

rdblue Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unnecessary newline.

rdblue reviewed Aug 4, 2021

View reviewed changes

rdblue requested changes Aug 4, 2021

View reviewed changes

szehon-ho added 2 commits August 9, 2021 13:06

Core: Predicate pushdown for files metadata table

ed361d3

Address review comments

99a4d2c

szehon-ho force-pushed the files_filter branch from 3649a71 to 99a4d2c Compare August 9, 2021 20:07

Fix rebasing error

fdd6eb9

rdblue approved these changes Aug 9, 2021

View reviewed changes

rdblue merged commit 7a3bfed into apache:master Aug 9, 2021

szehon-ho mentioned this pull request Apr 7, 2022

Core: Files table predicate filtering wrong for tables with evolved partition specs #4520

Merged

dramaticlly mentioned this pull request Jul 25, 2023

Core: push down filters when evaluating entries in metadata tables #8106

Closed

Core: Predicate pushdown for files metadata table #2926

Core: Predicate pushdown for files metadata table #2926

Uh oh!

Conversation

szehon-ho commented Aug 3, 2021

Uh oh!

szehon-ho commented Aug 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Aug 3, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Aug 6, 2021

Uh oh!

rdblue commented Aug 9, 2021

Uh oh!

szehon-ho commented Aug 9, 2021

Uh oh!

rdblue commented Aug 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

szehon-ho commented Aug 3, 2021 •

edited

Loading