Skip to content

Conversation

@szehon-ho
Copy link
Member

@szehon-ho
Copy link
Member Author

szehon-ho commented Aug 3, 2021

@rdblue @RussellSpitzer fyi when you guys get some time, thanks

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, were we waiting on something else?

@szehon-ho
Copy link
Member Author

Thanks for fast response! Not that i know of

this.name = name;
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unnecessary newline.

* this spec is used to project an expression for the partitions table, the projection will remove predicates for
* non-partition fields (not in the spec) and will remove the "partition." prefix from fields.
*
* @param partitionTableSchema schema of the partition table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is still assuming the partition table. May want to update it to tableSchema.

While we're thinking about this, it may also make sense to allow passing a different prefix. The prefix for the entries table, for example, would be data_file.partition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, rewrote the comment and added an argument.

.project(rowFilter);

ManifestEvaluator manifestEval = ManifestEvaluator.forPartitionFilter(partitionFilter, table().spec(),
caseSensitive);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd probably wrap so that all the arguments are on the same line. That seems more readable to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, done

}

@VisibleForTesting
ManifestFile getManifest() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is only for testing, but I would still suggest using a method name that we don't have to change later if we want to use the method elsewhere. That means removing get because get doesn't add anything. Typically, get signals that the method name can be simpler (e.g., manifest to return the manifest) or should have a more descriptive verb like find, fetch, create, etc. that tell the caller more about what is happening. The only time we use get is when the object needs to be a Java bean.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it , changed

"Partition data tuple, schema based on the partition spec")).asStruct();

TableScan scanNoFilter = dataFilesTable.newScan().select("partition.data_bucket");
Assert.assertEquals(expected, scanNoFilter.schema().asStruct());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you need to add the schema check since this is trying to validate pushdown, but I'm fine keeping it.


TableScan scanNoFilter = dataFilesTable.newScan().select("partition.data_bucket");
Assert.assertEquals(expected, scanNoFilter.schema().asStruct());
CloseableIterable<CombinedScanTask> tasksNoFilter = scanNoFilter.planTasks();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be a bit simpler if you used planFiles instead of planTasks. That should produce only FileScanTask rather than CombinedScanTask so you wouldn't need to flatMap tasks because there is only one data file per task.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had an issue early on where planTasks caught it and planFiles did not for some reason (passing wrong schema). I changed most of them to planFiles, and added one specific test for planTasks.


Expression ltAnd = Expressions.and(
Expressions.lessThan("partition.data_bucket", 2),
Expressions.greaterThan("record_count", 0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the record_count filter needed or is it just there for good measure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, it's covered by the "and" test, removed.

.commit();
table.newFastAppend()
.appendFile(FILE_PARTITION_3)
.commit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider a helper method to prepare the table for these tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szehon-ho, this looks great! Nice work. I just noted a few minor things but otherwise it's great. Nice job on the tests, I like how thorough they are.

@szehon-ho
Copy link
Member Author

@rdblue thanks for the detailed review and catching those, updated

@rdblue
Copy link
Contributor

rdblue commented Aug 9, 2021

@szehon-ho, could you fix the conflict?

@szehon-ho
Copy link
Member Author

Done, seems it passed

@rdblue rdblue merged commit 7a3bfed into apache:master Aug 9, 2021
@rdblue
Copy link
Contributor

rdblue commented Aug 9, 2021

Merged! Thanks @szehon-ho, nice work!

chenjunjiedada referenced this pull request in chenjunjiedada/incubator-iceberg Oct 20, 2021
Merge remote-tracking branch 'upstream/merge-master-20210816' into master
## 该MR主要解决什么?

merge upstream/master,引入最近的一些bugFix和优化

## 该MR的修改是什么?

核心关注PR:
> Predicate PushDown 支持,https://github.com/apache/iceberg/pull/2358, https://github.com/apache/iceberg/pull/2926, https://github.com/apache/iceberg/pull/2777/files
> Spark场景写入空dataset 报错问题,直接skip掉即可, apache#2960
> Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务, apache#288
> Spark 修复nested Struct Pruning问题, apache#2877
> 可以使用Table Properties指定创建v2 format表,apache#2887
> 补充SortRewriteStrategy框架,逐步支持不同rewrite策略, apache#2609 (WIP:apache#2829)
> Spark 为catalog配置hadoop属性支持, apache#2792
> Spark 针对timestamps without timezone读写支持, apache#2757
> Spark MicroBatch支持配置属性skip delete snapshots, apache#2752
> Spark V2 RewriteDatafilesAction 支持
> Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关,补充schema-id, Core: add schema id to snapshot 
> Spark Extension支持identifier fields操作, apache#2560
> Parquet: Update to 1.12.0, apache#2441
> Hive: Vectorized ORC reads for Hive, apache#2613
> Spark: Add an action to remove all referenced files, apache#2415

## 该MR是如何测试的?

UT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants