-
Notifications
You must be signed in to change notification settings - Fork 3k
Core, Parquet, ORC: Fix missing data when writing unknown #12581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| new Schema( | ||
| required(1, "id", LongType.get()), | ||
| optional(2, "test_type", type), | ||
| required(3, "trailing_data", Types.StringType.get()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without correct alignment, this test will fail for unknown when the null value for field 2 is passed to the writer for field 3.
| } | ||
|
|
||
| public static <T extends StructLike> ParquetValueWriter<T> create( | ||
| public static <T extends StructLike> ParquetValueWriter<T> createWriter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was needed because InternalWriter::create could refer to either create(MessageType) or create(Schema, MessageType). Adding the ability to pass BiFunction<Schema, MessageType, ParquetValueWriter<?>> to Parquet.createWriterFunc caused compile failures because the calls were ambiguous.
To solve the problem, I renamed this method so it is unambiguous. This has not been in a release so it is safe. The create(MessageType) method has been in a release so it is deprecated for removal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this because we now have a confusing combination of create, createWriter and buildWriter floating around, but I also couldn't find a good alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just be more aggressive about fixing these. We should deprecate the old ones and support better names.
| public void write(int repetitionLevel, S value) { | ||
| for (int i = 0; i < writers.length; i += 1) { | ||
| Object fieldValue = get(value, i); | ||
| Object fieldValue = get(value, fieldIndexes[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix is to map from writer index to field index. The index is created below by skipping unknown fields in the data struct.
|
You might want to run |
| RecordWriter(Types.StructType struct, List<OrcValueWriter<?>> writers) { | ||
| super(struct, writers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a similar change in Flink/Spark ORC writers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will. Right now, Spark writers and Flink ORC writers don't support the new types so they don't need to be updated in this PR.
|
Thanks for the reviews, @pvary and @danielcweeks! |
This fixes ORC and Parquet writers with unknown columns.
Previously, writers assumed that the Iceberg schema of the incoming data and the outgoing Parquet schema matched. However, unknown is not represented in file schemas, which led to an index mismatch. The value for an unknown column was being used by index and passed to the next writer. In some cases, this is not caught because the data is valid when the fields did not align (for example if there is only one more field and it is optional).
The fix is to pass the Iceberg schema into the write builder and account for columns that are present in data records but are not passed to a writer.
This updates all paths that called Parquet's
GenericParquetWriterorInternalWriterto pass the data schema. This was not necessary in ORC because the write builder already passes the data schema.Avro does not need to be fixed because unknown is converted to a NULL schema and a null writer is used.