Skip to content

Conversation

@pvary
Copy link
Contributor

@pvary pvary commented Oct 8, 2021

2nd implementation of #2935.
Based on #3248 I was able to implement the ORC delete writers.

@pvary pvary force-pushed the deleteStruct2 branch 2 times, most recently from 798ba0d to 660d660 Compare October 11, 2021 13:43
@pvary
Copy link
Contributor Author

pvary commented Oct 11, 2021

This seems like a better solution for #2935.
This PR is build on #3248.

The first commit in this PR is the squashed commits of #3248, the next 3 changes are unique for this.

If you have time, could you please review: @kbendick, @aokolnychyi, @rdblue?

@openinx: Answered your comments, and did the appropriate changes. Thanks for your time!

@openinx
Copy link
Member

openinx commented Oct 12, 2021

Thanks @pvary for the work, I plan to check this again once we got the dependency PR #3248 merged !

@openinx
Copy link
Member

openinx commented Oct 14, 2021

@pvary we've just got the #3248 merged now, I think it's time to rebase this PR and take another round review now, thanks !

@pvary
Copy link
Contributor Author

pvary commented Oct 14, 2021

@openinx: Rebased, and rerun the flaky test (TestFlinkTableSink#testReplacePartitions). Now we have a clean run.
Could you please review?

Thanks,
Peter


if (createWriterFunc != null) {
appenderBuilder.createWriterFunc((schema, typeDescription) ->
GenericOrcWriters.positionDelete(createWriterFunc.apply(deleteSchema, typeDescription), pathTransformFunc));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If people don't provide a rowSchema by using DeleteWriteBuilder#rowSchema, then we still use the flink RowData writer to write the <path, pos> and it is required to convert the path from CharSequence to RowData I think. That's the hottest code path because I think most of the cases people won't need the extra rowSchema to attach the original row when write PositionDelete, and all the pos-delete writer will use run into this line.

In my view, for the case without rowSchema, I think we can use the Record writer to avoid the extra conversion from CharSequence to RowData or InternalRow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better if we have an unit test to address this parquet/orc issue, I think it's good to make it a separate issue or PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the code to match the way how it is currently working with Parquet. I have created a new OrcRowWriter for the pathPosSchema and used that to write the data. Is this what you were suggesting for cases when no rowSchema was provided?

Are you suggesting that we should make sure that the GenericOrcWriter is writing path as expected? Also if I see correctly the Parquet code is also using identity transform for path values, so that is why you are suggesting to write a test case specifically for this?

If I remove the pathTransform from the PositionDeleteStructWriter then the TestFlinkFileWriterFactory#testPositionDeleteWriterWithRow will fail immediately with the following exception:

java.lang.String cannot be cast to org.apache.flink.table.data.StringData
java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.flink.table.data.StringData
	at org.apache.iceberg.flink.data.FlinkOrcWriters$StringWriter.nonNullWrite(FlinkOrcWriters.java:96)
	at org.apache.iceberg.orc.OrcValueWriter.write(OrcValueWriter.java:42)
	at org.apache.iceberg.data.orc.GenericOrcWriters$StructWriter.write(GenericOrcWriters.java:492)

I expect that the same thing is handler already with the generic record writer, so that is why we do not get the same exception for TestFlinkFileWriterFactory#testPositionDeleteWriter

Do I miss something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I should describe this more clear. I mean we may need to add an unit test to address the case to ensure it will use the Record PositionDeleteStructWriter to write the <path, pos> if we don't provide any rowSchema. That prevents us introducing new changes that will lead to write path performance regression because of the conversion from CharSequence to RowData or InternalRow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this two cases:

  1. Write <path, pos, row-data> into positional files, the row-data could be flink's RowData or spark's InternalRow.
  2. Write <path, pos> without any attached row-data into positional files.

I think we could use the same PositionDeleteStructWriter to write the <path, pos, row-data> or <path, pos>, the difference is what kinds of pathTransformFunc that we will pass to:

  1. For flink's <path, pos, RowData>, we should pass a path -> StringData.fromString(path.toString());
  2. For spark's <path, pos, InternalRow>, we should pass a path -> UTF8String.fromString(path.toString();
  3. For both spark and flink's <path, pos>, we should pass a dummy Function.identity() to do nothing because we will just use the Record writer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Refactored to match how it is working for Parquet.

There is one thing I am not entirely comfortable with:

  • If we provide rowSchema, but do not provide the createWriterFunc then we ignore the provided rowSchema. My understanding was that the value of the rowSchema defines wether we write rowData to the position delete file or we just use it to store the filename and the position. It turns out this is defined by the combined values of these properties.

Wouldn't it be better to have a single storeRows boolean flag on the DeleteWriteBuilder class to define this behaviour, and make the appropriate checks when creating the writer wether every required parameter is set? I thing this would make this easier to understand for the next contributors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing your feeling, there are two cases we did not describe it clear in current Parquet/ORC/Avro position writer builder I think:

  1. rowSchema is null and createWriterFunc is not null. In this case, the createWriterFunc should be meaningless because we don't need to write any extra row records into position delete files. Using the default Record position delete writer should be OK for me.

  2. rowSchema is not null and createWriteFunc is null. In this case, I think we should throw an IllegalArgumentException because we don't know how to construct the rowSchema 's column writers. Adding a Precondition.checkArgument(rowSchema!=null && createWriterFunc, "..." should be OK, but currently we will fallback to skip to write rowSchema row into position delete files, which is your confusion I think.

That sounds reasonable to make it into a separate PR I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to add a Precondition.checkArgument to buildPositionWriter(), just like have you suggested, but DeleteReadTests.testPositionDeletes() and DeleteReadTests.testMixedPositionAndEqualityDeletes() tests were failing with this:

Create function should be provided if we write row data
java.lang.IllegalArgumentException: Create function should be provided if we write row data
	at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:142)
	at org.apache.iceberg.parquet.Parquet$DeleteWriteBuilder.buildPositionWriter(Parquet.java:600)
	at org.apache.iceberg.data.FileHelpers.writeDeleteFile(FileHelpers.java:59)
	at org.apache.iceberg.data.DeleteReadTests.testPositionDeletes(DeleteReadTests.java:287)
[..]

I the tests we use FileHelper.writeDeleteFile with a table which provides the schema and without a createWriterFunc. So I am not sure if it is a test only issue, or a real use-case. Any info around this?

Copy link
Member

@openinx openinx Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's here where we've set the rowSchema by default, while in fact we shouldn't use the table.schema() as rowSchema when building the parquet posDeleteWriter by default. I will suggest to use the following to construct the PositionWriter:

    PositionDeleteWriter<?> writer = Parquet.writeDeletes(out)
        .withSpec(table.spec());
        .setAll(table.properties());
        .metricsConfig(MetricsConfig.forTable(table))
        .withPartition(partition)
        .overwrite()
        .buildPositionWriter();

And if people plan to use forTable(table) to construct the position writer, then the Preconditions.checkArgument(rowSchema == null || createWriterFunc != null) will remind the devs to add createWriterFunc or fallback to use the separate setters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #3305

Copy link
Member

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, @pvary ! I left several comments which I think we need to address.

@pvary
Copy link
Contributor Author

pvary commented Oct 15, 2021

Thanks for the update, @pvary ! I left several comments which I think we need to address.

Thanks for the review @openinx!
Addressed your comments. I am not sure I understand your suggestions correctly in one case. Could you please check that out?

Thanks,
Peter

Copy link
Member

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost looks great to me now, left several minor comments ! Thanks @pvary for the update!

@github-actions github-actions bot added the core label Oct 18, 2021
Copy link
Member

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me now, thanks @pvary for the patient work !

@openinx openinx merged commit 1b920e2 into apache:master Oct 18, 2021
@pvary
Copy link
Contributor Author

pvary commented Oct 18, 2021

Thanks for the review and the merge @openinx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants