Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Mar 19, 2025

This fixes ORC and Parquet writers with unknown columns.

Previously, writers assumed that the Iceberg schema of the incoming data and the outgoing Parquet schema matched. However, unknown is not represented in file schemas, which led to an index mismatch. The value for an unknown column was being used by index and passed to the next writer. In some cases, this is not caught because the data is valid when the fields did not align (for example if there is only one more field and it is optional).

The fix is to pass the Iceberg schema into the write builder and account for columns that are present in data records but are not passed to a writer.

This updates all paths that called Parquet's GenericParquetWriter or InternalWriter to pass the data schema. This was not necessary in ORC because the write builder already passes the data schema.

Avro does not need to be fixed because unknown is converted to a NULL schema and a null writer is used.

new Schema(
required(1, "id", LongType.get()),
optional(2, "test_type", type),
required(3, "trailing_data", Types.StringType.get())));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without correct alignment, this test will fail for unknown when the null value for field 2 is passed to the writer for field 3.

}

public static <T extends StructLike> ParquetValueWriter<T> create(
public static <T extends StructLike> ParquetValueWriter<T> createWriter(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed because InternalWriter::create could refer to either create(MessageType) or create(Schema, MessageType). Adding the ability to pass BiFunction<Schema, MessageType, ParquetValueWriter<?>> to Parquet.createWriterFunc caused compile failures because the calls were ambiguous.

To solve the problem, I renamed this method so it is unambiguous. This has not been in a release so it is safe. The create(MessageType) method has been in a release so it is deprecated for removal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this because we now have a confusing combination of create, createWriter and buildWriter floating around, but I also couldn't find a good alternative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just be more aggressive about fixing these. We should deprecate the old ones and support better names.

public void write(int repetitionLevel, S value) {
for (int i = 0; i < writers.length; i += 1) {
Object fieldValue = get(value, i);
Object fieldValue = get(value, fieldIndexes[i]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is to map from writer index to field index. The index is created below by skipping unknown fields in the data struct.

@danielcweeks
Copy link
Contributor

You might want to run gradle revapi because I think there are a few issues.

Comment on lines +159 to +160
RecordWriter(Types.StructType struct, List<OrcValueWriter<?>> writers) {
super(struct, writers);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a similar change in Flink/Spark ORC writers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will. Right now, Spark writers and Flink ORC writers don't support the new types so they don't need to be updated in this PR.

@rdblue rdblue merged commit 7217417 into apache:main Mar 20, 2025
42 checks passed
@rdblue
Copy link
Contributor Author

rdblue commented Mar 20, 2025

Thanks for the reviews, @pvary and @danielcweeks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants