Skip to content

Conversation

@szehon-ho
Copy link
Member

@szehon-ho szehon-ho commented Feb 1, 2023

This is the Spark-side changes for: #6365, (or the second part of the initial end to end pr: #4812)

Some explanations:

  • Because RowReader is instantiated with PositionDeletesTable, but we need the base table's schema, need to expose some API's that expose the base table.
  • ORC and Parquet readers fixed to handle constant column pushdown. (Previously this code path not exercised because metadata columns are not pushed down by Spark).

addCloseable(orcFileReader);

TypeDescription fileSchema = orcFileReader.getSchema();
Schema schemaWithoutConstantFields =
Copy link
Member Author

@szehon-ho szehon-ho Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, schema already pruned without constant columns is passed into ORCIterable for its filter logic.

However, this fails here when trying to bind the constant column filters to it (as they are not in the schema). This makes it so that we prune out constant columns only when we need it, but we keep original schema for the binding.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks almost ready to me. I had some comments in the reader. Let me also think a bit about the Parquet and ORC changes.

@aokolnychyi
Copy link
Contributor

Will take another look today.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from one comment in PositionDeletesTable, this looks OK to me. I am still thinking about the best way to deal with residuals.

@szehon-ho szehon-ho force-pushed the position_delete_spark_only_2 branch from a5f00ca to 9eba5b5 Compare February 14, 2023 23:40
@github-actions github-actions bot added API and removed ORC labels Feb 14, 2023
Copy link
Member Author

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi as discussed offline, reverted the constant column fix from ORC and Parquet readers, and instead use @rdblue 's new method from : #6599

Set<Integer> fields = schema.idToName().keySet();
Set<Integer> nonConstantFields =
fields.stream()
.filter(id -> schema.findField(id).type().isPrimitiveType())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to only pass in primitive field ids, because internally ExpressionUtil.extractByIdInclusive uses these to make a dummy Identity partition spec, and those only take primitive fields.

Alternatively we could push this down to method itself, but leaving here so method is same as in: #6599

* A {@link Table} implementation whose {@link Scan} provides {@link PositionDeletesScanTask}, for
* reading of position delete files.
*/
public class PositionDeletesTable extends BaseMetadataTable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

* @return specs used to rewrite the metadata table filters to partition filters using an
* inclusive projection
*/
static Map<Integer, PartitionSpec> transformSpecs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good too.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final nits and should be good to go! We should add @rdblue as co-author while merging.

@aokolnychyi aokolnychyi merged commit bd58ca5 into apache:master Feb 17, 2023
@aokolnychyi
Copy link
Contributor

Thanks, @szehon-ho! It is nice to have this done. I also added @rdblue as a co-author since this PR includes some logic from PR #6599.

@szehon-ho
Copy link
Member Author

Thanks, I filed a follow-up issue: #6925 to implement pushdown optimization of queries with filters of constant columns (ie, spec_id, delete_file_path).

krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants