Core, Parquet, ORC: Fix missing data when writing unknown #12581

rdblue · 2025-03-19T20:24:31Z

This fixes ORC and Parquet writers with unknown columns.

Previously, writers assumed that the Iceberg schema of the incoming data and the outgoing Parquet schema matched. However, unknown is not represented in file schemas, which led to an index mismatch. The value for an unknown column was being used by index and passed to the next writer. In some cases, this is not caught because the data is valid when the fields did not align (for example if there is only one more field and it is optional).

The fix is to pass the Iceberg schema into the write builder and account for columns that are present in data records but are not passed to a writer.

This updates all paths that called Parquet's GenericParquetWriter or InternalWriter to pass the data schema. This was not necessary in ORC because the write builder already passes the data schema.

Avro does not need to be fixed because unknown is converted to a NULL schema and a null writer is used.

rdblue · 2025-03-19T20:26:03Z

core/src/test/java/org/apache/iceberg/data/DataTest.java

+        new Schema(
+            required(1, "id", LongType.get()),
+            optional(2, "test_type", type),
+            required(3, "trailing_data", Types.StringType.get())));


Without correct alignment, this test will fail for unknown when the null value for field 2 is passed to the writer for field 3.

rdblue · 2025-03-19T20:32:36Z

parquet/src/main/java/org/apache/iceberg/data/parquet/InternalWriter.java

  }

-  public static <T extends StructLike> ParquetValueWriter<T> create(
+  public static <T extends StructLike> ParquetValueWriter<T> createWriter(


This was needed because InternalWriter::create could refer to either create(MessageType) or create(Schema, MessageType). Adding the ability to pass BiFunction<Schema, MessageType, ParquetValueWriter<?>> to Parquet.createWriterFunc caused compile failures because the calls were ambiguous.

To solve the problem, I renamed this method so it is unambiguous. This has not been in a release so it is safe. The create(MessageType) method has been in a release so it is deprecated for removal.

I don't love this because we now have a confusing combination of create, createWriter and buildWriter floating around, but I also couldn't find a good alternative.

I think we should just be more aggressive about fixing these. We should deprecate the old ones and support better names.

rdblue · 2025-03-19T20:37:45Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueWriters.java

    public void write(int repetitionLevel, S value) {
      for (int i = 0; i < writers.length; i += 1) {
-        Object fieldValue = get(value, i);
+        Object fieldValue = get(value, fieldIndexes[i]);


The fix is to map from writer index to field index. The index is created below by skipping unknown fields in the data struct.

danielcweeks · 2025-03-19T21:19:39Z

You might want to run gradle revapi because I think there are a few issues.

pvary · 2025-03-20T13:44:48Z

orc/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriter.java

+    RecordWriter(Types.StructType struct, List<OrcValueWriter<?>> writers) {
+      super(struct, writers);


Do we need a similar change in Flink/Spark ORC writers?

We will. Right now, Spark writers and Flink ORC writers don't support the new types so they don't need to be updated in this PR.

rdblue · 2025-03-20T21:42:21Z

Thanks for the reviews, @pvary and @danielcweeks!

rdblue added 2 commits March 19, 2025 13:12

ORC: Fix missing data when writing unknown fields.

a2bf448

Parquet: Fix missing data when writing unknown fields.

eec119f

github-actions bot added spark parquet arrow core data flink ORC labels Mar 19, 2025

Core: Use a required field.

01eafcc

rdblue commented Mar 19, 2025

View reviewed changes

Add back protected methods.

9788734

rdblue commented Mar 19, 2025

View reviewed changes

rdblue added 2 commits March 19, 2025 13:40

Fix checkstyle.

b0a9509

Fix javadoc.

5e757fb

rdblue added 3 commits March 19, 2025 15:13

Flink: Fix Parquet unknown writes.

4e8cd97

Parquet: Add back deprecated method to fix revapi.

0e7ff47

Apply spotless.

e879f6f

pvary reviewed Mar 20, 2025

View reviewed changes

danielcweeks approved these changes Mar 20, 2025

View reviewed changes

rdblue merged commit 7217417 into apache:main Mar 20, 2025
42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core, Parquet, ORC: Fix missing data when writing unknown #12581

Core, Parquet, ORC: Fix missing data when writing unknown #12581

Uh oh!

rdblue commented Mar 19, 2025

Uh oh!

rdblue Mar 19, 2025

Uh oh!

rdblue Mar 19, 2025

Uh oh!

danielcweeks Mar 20, 2025

Uh oh!

rdblue Mar 20, 2025

Uh oh!

rdblue Mar 19, 2025

Uh oh!

danielcweeks commented Mar 19, 2025

Uh oh!

pvary Mar 20, 2025

Uh oh!

rdblue Mar 20, 2025

Uh oh!

Uh oh!

rdblue commented Mar 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		RecordWriter(Types.StructType struct, List<OrcValueWriter<?>> writers) {
		super(struct, writers);

Core, Parquet, ORC: Fix missing data when writing unknown #12581

Core, Parquet, ORC: Fix missing data when writing unknown #12581

Uh oh!

Conversation

rdblue commented Mar 19, 2025

Uh oh!

rdblue Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

rdblue Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

danielcweeks Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

rdblue Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

rdblue Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

danielcweeks commented Mar 19, 2025

Uh oh!

pvary Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

rdblue Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue commented Mar 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants