-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark 3.3: Implement Position Deletes Table #6716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 3.3: Implement Position Deletes Table #6716
Conversation
cc5f4a7 to
284800b
Compare
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeleteRowReader.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java
Outdated
Show resolved
Hide resolved
| addCloseable(orcFileReader); | ||
|
|
||
| TypeDescription fileSchema = orcFileReader.getSchema(); | ||
| Schema schemaWithoutConstantFields = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, schema already pruned without constant columns is passed into ORCIterable for its filter logic.
However, this fails here when trying to bind the constant column filters to it (as they are not in the schema). This makes it so that we prune out constant columns only when we need it, but we keep original schema for the binding.
core/src/main/java/org/apache/iceberg/PositionDeletesTable.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeleteRowReader.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java
Outdated
Show resolved
Hide resolved
aokolnychyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks almost ready to me. I had some comments in the reader. Let me also think a bit about the Parquet and ORC changes.
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeleteRowReader.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeleteRowReader.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeleteRowReader.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeleteRowReader.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeleteRowReader.java
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeleteRowReader.java
Show resolved
Hide resolved
|
Will take another look today. |
aokolnychyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from one comment in PositionDeletesTable, this looks OK to me. I am still thinking about the best way to deal with residuals.
core/src/main/java/org/apache/iceberg/PositionDeletesTable.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/PositionDeletesTable.java
Outdated
Show resolved
Hide resolved
a5f00ca to
9eba5b5
Compare
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aokolnychyi as discussed offline, reverted the constant column fix from ORC and Parquet readers, and instead use @rdblue 's new method from : #6599
| Set<Integer> fields = schema.idToName().keySet(); | ||
| Set<Integer> nonConstantFields = | ||
| fields.stream() | ||
| .filter(id -> schema.findField(id).type().isPrimitiveType()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to only pass in primitive field ids, because internally ExpressionUtil.extractByIdInclusive uses these to make a dummy Identity partition spec, and those only take primitive fields.
Alternatively we could push this down to method itself, but leaving here so method is same as in: #6599
| * A {@link Table} implementation whose {@link Scan} provides {@link PositionDeletesScanTask}, for | ||
| * reading of position delete files. | ||
| */ | ||
| public class PositionDeletesTable extends BaseMetadataTable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
| * @return specs used to rewrite the metadata table filters to partition filters using an | ||
| * inclusive projection | ||
| */ | ||
| static Map<Integer, PartitionSpec> transformSpecs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good too.
aokolnychyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final nits and should be good to go! We should add @rdblue as co-author while merging.
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeletesRowReader.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeletesRowReader.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeletesRowReader.java
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeletesRowReader.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeletesRowReader.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeletesRowReader.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeletesRowReader.java
Outdated
Show resolved
Hide resolved
|
Thanks, @szehon-ho! It is nice to have this done. I also added @rdblue as a co-author since this PR includes some logic from PR #6599. |
|
Thanks, I filed a follow-up issue: #6925 to implement pushdown optimization of queries with filters of constant columns (ie, spec_id, delete_file_path). |
Co-authored-by: Ryan Blue <blue@apache.org>
This is the Spark-side changes for: #6365, (or the second part of the initial end to end pr: #4812)
Some explanations: