-
Notifications
You must be signed in to change notification settings - Fork 3k
ORC: Implement buildEqualityWriter() and buildPositionWriter() - 2nd version #3250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
orc/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriter.java
Outdated
Show resolved
Hide resolved
orc/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriter.java
Outdated
Show resolved
Hide resolved
orc/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriter.java
Outdated
Show resolved
Hide resolved
798ba0d to
660d660
Compare
|
This seems like a better solution for #2935. The first commit in this PR is the squashed commits of #3248, the next 3 changes are unique for this. If you have time, could you please review: @kbendick, @aokolnychyi, @rdblue? @openinx: Answered your comments, and did the appropriate changes. Thanks for your time! |
|
@openinx: Rebased, and rerun the flaky test (TestFlinkTableSink#testReplacePartitions). Now we have a clean run. Thanks, |
flink/src/main/java/org/apache/iceberg/flink/data/FlinkOrcWriter.java
Outdated
Show resolved
Hide resolved
orc/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriter.java
Outdated
Show resolved
Hide resolved
orc/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
Outdated
Show resolved
Hide resolved
orc/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
Outdated
Show resolved
Hide resolved
|
|
||
| if (createWriterFunc != null) { | ||
| appenderBuilder.createWriterFunc((schema, typeDescription) -> | ||
| GenericOrcWriters.positionDelete(createWriterFunc.apply(deleteSchema, typeDescription), pathTransformFunc)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If people don't provide a rowSchema by using DeleteWriteBuilder#rowSchema, then we still use the flink RowData writer to write the <path, pos> and it is required to convert the path from CharSequence to RowData I think. That's the hottest code path because I think most of the cases people won't need the extra rowSchema to attach the original row when write PositionDelete, and all the pos-delete writer will use run into this line.
In my view, for the case without rowSchema, I think we can use the Record writer to avoid the extra conversion from CharSequence to RowData or InternalRow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better if we have an unit test to address this parquet/orc issue, I think it's good to make it a separate issue or PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the code to match the way how it is currently working with Parquet. I have created a new OrcRowWriter for the pathPosSchema and used that to write the data. Is this what you were suggesting for cases when no rowSchema was provided?
Are you suggesting that we should make sure that the GenericOrcWriter is writing path as expected? Also if I see correctly the Parquet code is also using identity transform for path values, so that is why you are suggesting to write a test case specifically for this?
If I remove the pathTransform from the PositionDeleteStructWriter then the TestFlinkFileWriterFactory#testPositionDeleteWriterWithRow will fail immediately with the following exception:
java.lang.String cannot be cast to org.apache.flink.table.data.StringData
java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.flink.table.data.StringData
at org.apache.iceberg.flink.data.FlinkOrcWriters$StringWriter.nonNullWrite(FlinkOrcWriters.java:96)
at org.apache.iceberg.orc.OrcValueWriter.write(OrcValueWriter.java:42)
at org.apache.iceberg.data.orc.GenericOrcWriters$StructWriter.write(GenericOrcWriters.java:492)
I expect that the same thing is handler already with the generic record writer, so that is why we do not get the same exception for TestFlinkFileWriterFactory#testPositionDeleteWriter
Do I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I think I should describe this more clear. I mean we may need to add an unit test to address the case to ensure it will use the Record PositionDeleteStructWriter to write the <path, pos> if we don't provide any rowSchema. That prevents us introducing new changes that will lead to write path performance regression because of the conversion from CharSequence to RowData or InternalRow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this two cases:
- Write
<path, pos, row-data>into positional files, the row-data could be flink'sRowDataor spark'sInternalRow. - Write
<path, pos>without any attached row-data into positional files.
I think we could use the same PositionDeleteStructWriter to write the <path, pos, row-data> or <path, pos>, the difference is what kinds of pathTransformFunc that we will pass to:
- For flink's
<path, pos, RowData>, we should pass apath -> StringData.fromString(path.toString()); - For spark's
<path, pos, InternalRow>, we should pass apath -> UTF8String.fromString(path.toString(); - For both spark and flink's
<path, pos>, we should pass a dummyFunction.identity()to do nothing because we will just use theRecordwriter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Refactored to match how it is working for Parquet.
There is one thing I am not entirely comfortable with:
- If we provide
rowSchema, but do not provide thecreateWriterFuncthen we ignore the provided rowSchema. My understanding was that the value of therowSchemadefines wether we write rowData to the position delete file or we just use it to store the filename and the position. It turns out this is defined by the combined values of these properties.
Wouldn't it be better to have a single storeRows boolean flag on the DeleteWriteBuilder class to define this behaviour, and make the appropriate checks when creating the writer wether every required parameter is set? I thing this would make this easier to understand for the next contributors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing your feeling, there are two cases we did not describe it clear in current Parquet/ORC/Avro position writer builder I think:
-
rowSchemais null andcreateWriterFuncis not null. In this case, thecreateWriterFuncshould be meaningless because we don't need to write any extra row records into position delete files. Using the defaultRecordposition delete writer should be OK for me. -
rowSchemais not null andcreateWriteFuncis null. In this case, I think we should throw an IllegalArgumentException because we don't know how to construct therowSchema's column writers. Adding aPrecondition.checkArgument(rowSchema!=null && createWriterFunc, "..."should be OK, but currently we will fallback to skip to writerowSchemarow into position delete files, which is your confusion I think.
That sounds reasonable to make it into a separate PR I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to add a Precondition.checkArgument to buildPositionWriter(), just like have you suggested, but DeleteReadTests.testPositionDeletes() and DeleteReadTests.testMixedPositionAndEqualityDeletes() tests were failing with this:
Create function should be provided if we write row data
java.lang.IllegalArgumentException: Create function should be provided if we write row data
at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:142)
at org.apache.iceberg.parquet.Parquet$DeleteWriteBuilder.buildPositionWriter(Parquet.java:600)
at org.apache.iceberg.data.FileHelpers.writeDeleteFile(FileHelpers.java:59)
at org.apache.iceberg.data.DeleteReadTests.testPositionDeletes(DeleteReadTests.java:287)
[..]
I the tests we use FileHelper.writeDeleteFile with a table which provides the schema and without a createWriterFunc. So I am not sure if it is a test only issue, or a real use-case. Any info around this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's here where we've set the rowSchema by default, while in fact we shouldn't use the table.schema() as rowSchema when building the parquet posDeleteWriter by default. I will suggest to use the following to construct the PositionWriter:
PositionDeleteWriter<?> writer = Parquet.writeDeletes(out)
.withSpec(table.spec());
.setAll(table.properties());
.metricsConfig(MetricsConfig.forTable(table))
.withPartition(partition)
.overwrite()
.buildPositionWriter();And if people plan to use forTable(table) to construct the position writer, then the Preconditions.checkArgument(rowSchema == null || createWriterFunc != null) will remind the devs to add createWriterFunc or fallback to use the separate setters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #3305
spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcWriter.java
Outdated
Show resolved
Hide resolved
openinx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update, @pvary ! I left several comments which I think we need to address.
spark/src/main/java/org/apache/iceberg/spark/source/SparkFileWriterFactory.java
Outdated
Show resolved
Hide resolved
spark/src/main/java/org/apache/iceberg/spark/source/SparkFileWriterFactory.java
Outdated
Show resolved
Hide resolved
orc/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
Outdated
Show resolved
Hide resolved
orc/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
Outdated
Show resolved
Hide resolved
openinx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost looks great to me now, left several minor comments ! Thanks @pvary for the update!
openinx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me now, thanks @pvary for the patient work !
|
Thanks for the review and the merge @openinx! |
2nd implementation of #2935.
Based on #3248 I was able to implement the ORC delete writers.