Skip to content

Conversation

@dirtysalt
Copy link

When using variantShreddingFunc to partially shred variant fields, unshredded fields were being lost during serialization. The bug was in ShreddedObject constructor where local variable shreddedFields shadowed the instance field this.shreddedFields, causing unshredded fields to be added to the local map instead of the instance field.

This resulted in the binary value field containing only metadata headers without actual field data, causing IndexOutOfBoundsException on read and permanent data loss.

Fix: Changed all references in the problematic code block to use this.shreddedFields explicitly, ensuring unshredded fields are properly preserved in the instance field and serialized correctly.Added test case testPartialShreddingWithShreddedObject that reproduces the exact scenario from issue #15086.

@nastra
Copy link
Contributor

nastra commented Jan 20, 2026

@huaxingao @aihuaxu could you guys please take a look?

records.add(record);
}

// Shredding function that only shreds the "id" field
Copy link
Member

@RussellSpitzer RussellSpitzer Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what i'm not sure about, this shouldn't be relevant to the fix correct? We would lose the unshredded fields even if no transform is applied. The Transform in the writer isn't relevant to the broken serialization method correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer I may not have enough knowledge about iceberg code, and pardon me if I'm wrong because I just do some tests and observe results, and try to connect them and find reasonable explanation. Maybe my explanation is not the root cause.

I think it's relevant to the fix. The bug is not "we lose the unshredded fields", but is "we will lose shredded fields if we partially shred some fields". If I write the test case like following "shred all fields", then the test case works.

     VariantShreddingFunction partialShredding =
         (id, name) -> {
-          VariantMetadata shreddedMetadata = Variants.metadata("id");
-          ShreddedObject shreddedObject = Variants.object(shreddedMetadata);
-          shreddedObject.put("id", Variants.of(1234L));
-          return ParquetVariantUtil.toParquetSchema(shreddedObject);
+          if (name.equals("var")) {
+            ShreddedObject obj = Variants.object(metadata);
+            obj.put("id", Variants.of(1000L));
+            obj.put("name", Variants.of("user"));
+            obj.put("city", Variants.of("city"));
+            return ParquetVariantUtil.toParquetSchema(obj);
+          }
+          return null;
         };

But if I just do partially shredding, then exception happens.


Actually, the fix should be following, only to shred the id field in variant. But I guess it does not matter much because id field of primitive type does not support shredding.

--- a/parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java
+++ b/parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java
@@ -302,13 +302,16 @@ public class TestVariantWriters {
       records.add(record);
     }

-    // Shredding function that only shreds the "id" field
+    // Shredding function that only shreds the "id" field in the variant
     VariantShreddingFunction partialShredding =
         (id, name) -> {
-          VariantMetadata shreddedMetadata = Variants.metadata("id");
-          ShreddedObject shreddedObject = Variants.object(shreddedMetadata);
-          shreddedObject.put("id", Variants.of(1234L));
-          return ParquetVariantUtil.toParquetSchema(shreddedObject);
+          if (name.equals("var")) {
+            VariantMetadata shreddedMetadata = Variants.metadata("id");
+            ShreddedObject shreddedObject = Variants.object(shreddedMetadata);
+            shreddedObject.put("id", Variants.of(1234L));
+            return ParquetVariantUtil.toParquetSchema(shreddedObject);
+          }
+          return null;
         };

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question was whether -

    // Shredding function that only shreds the "id" field
    VariantShreddingFunction partialShredding =
        (id, name) -> null; // No Shredding

Would also be broken, but it looks like it isn't from my internal test. So It must be going down a different serialization path I guess

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I think I get it, the test code is essentially creating a fully shredded object at the top with

  obj.put("id", Variants.of(1000L + i));
  obj.put("name", Variants.of("user_" + i));
  obj.put("city", Variants.of("city_" + i));

Then this code is transforming that object into a partially shredded one

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, tranformig that object into a partially shredded one.


creating a fully shredded object

I'm not sure if I want to create a fully shredded object at this moment, although the returned type here is indeed called ShreddedObject.

ShreddedObject obj = Variants.object(metadata);
obj.put("id", Variants.of(1000L + i));
obj.put("name", Variants.of("user_" + i));
obj.put("city", Variants.of("city_" + i));

I'm not sure this is the right way(or expected way) to construct a variant, or should I construct a SerializedObject like what you wrote in this issue (#15086)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants