-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Fix data loss in partial variant shredding #15087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@huaxingao @aihuaxu could you guys please take a look? |
parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java
Outdated
Show resolved
Hide resolved
parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java
Outdated
Show resolved
Hide resolved
parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/variants/ShreddedObject.java
Outdated
Show resolved
Hide resolved
parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java
Show resolved
Hide resolved
parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java
Show resolved
Hide resolved
| records.add(record); | ||
| } | ||
|
|
||
| // Shredding function that only shreds the "id" field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what i'm not sure about, this shouldn't be relevant to the fix correct? We would lose the unshredded fields even if no transform is applied. The Transform in the writer isn't relevant to the broken serialization method correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer I may not have enough knowledge about iceberg code, and pardon me if I'm wrong because I just do some tests and observe results, and try to connect them and find reasonable explanation. Maybe my explanation is not the root cause.
I think it's relevant to the fix. The bug is not "we lose the unshredded fields", but is "we will lose shredded fields if we partially shred some fields". If I write the test case like following "shred all fields", then the test case works.
VariantShreddingFunction partialShredding =
(id, name) -> {
- VariantMetadata shreddedMetadata = Variants.metadata("id");
- ShreddedObject shreddedObject = Variants.object(shreddedMetadata);
- shreddedObject.put("id", Variants.of(1234L));
- return ParquetVariantUtil.toParquetSchema(shreddedObject);
+ if (name.equals("var")) {
+ ShreddedObject obj = Variants.object(metadata);
+ obj.put("id", Variants.of(1000L));
+ obj.put("name", Variants.of("user"));
+ obj.put("city", Variants.of("city"));
+ return ParquetVariantUtil.toParquetSchema(obj);
+ }
+ return null;
};But if I just do partially shredding, then exception happens.
Actually, the fix should be following, only to shred the id field in variant. But I guess it does not matter much because id field of primitive type does not support shredding.
--- a/parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java
+++ b/parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java
@@ -302,13 +302,16 @@ public class TestVariantWriters {
records.add(record);
}
- // Shredding function that only shreds the "id" field
+ // Shredding function that only shreds the "id" field in the variant
VariantShreddingFunction partialShredding =
(id, name) -> {
- VariantMetadata shreddedMetadata = Variants.metadata("id");
- ShreddedObject shreddedObject = Variants.object(shreddedMetadata);
- shreddedObject.put("id", Variants.of(1234L));
- return ParquetVariantUtil.toParquetSchema(shreddedObject);
+ if (name.equals("var")) {
+ VariantMetadata shreddedMetadata = Variants.metadata("id");
+ ShreddedObject shreddedObject = Variants.object(shreddedMetadata);
+ shreddedObject.put("id", Variants.of(1234L));
+ return ParquetVariantUtil.toParquetSchema(shreddedObject);
+ }
+ return null;
};There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question was whether -
// Shredding function that only shreds the "id" field
VariantShreddingFunction partialShredding =
(id, name) -> null; // No ShreddingWould also be broken, but it looks like it isn't from my internal test. So It must be going down a different serialization path I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I think I get it, the test code is essentially creating a fully shredded object at the top with
obj.put("id", Variants.of(1000L + i));
obj.put("name", Variants.of("user_" + i));
obj.put("city", Variants.of("city_" + i));
Then this code is transforming that object into a partially shredded one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, tranformig that object into a partially shredded one.
creating a fully shredded object
I'm not sure if I want to create a fully shredded object at this moment, although the returned type here is indeed called ShreddedObject.
ShreddedObject obj = Variants.object(metadata);
obj.put("id", Variants.of(1000L + i));
obj.put("name", Variants.of("user_" + i));
obj.put("city", Variants.of("city_" + i));
I'm not sure this is the right way(or expected way) to construct a variant, or should I construct a SerializedObject like what you wrote in this issue (#15086)
When using variantShreddingFunc to partially shred variant fields, unshredded fields were being lost during serialization. The bug was in ShreddedObject constructor where local variable
shreddedFieldsshadowed the instance fieldthis.shreddedFields, causing unshredded fields to be added to the local map instead of the instance field.This resulted in the binary value field containing only metadata headers without actual field data, causing IndexOutOfBoundsException on read and permanent data loss.
Fix: Changed all references in the problematic code block to use
this.shreddedFieldsexplicitly, ensuring unshredded fields are properly preserved in the instance field and serialized correctly.Added test case testPartialShreddingWithShreddedObject that reproduces the exact scenario from issue #15086.