-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Add Variant logical type for Avro #12238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
16f8525 to
d014477
Compare
core/src/test/java/org/apache/iceberg/avro/TestAvroSchemaProjection.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/avro/TestAvroSchemaProjection.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/avro/TestAvroSchemaProjection.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/avro/TestAvroSchemaProjection.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/avro/TestSchemaConversions.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/avro/AvroCustomOrderSchemaVisitor.java
Outdated
Show resolved
Hide resolved
d014477 to
5f6cb48
Compare
xxubai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick update!
core/src/main/java/org/apache/iceberg/avro/VariantLogicalType.java
Outdated
Show resolved
Hide resolved
a34b98c to
0d732ae
Compare
core/src/test/java/org/apache/iceberg/avro/TestAvroSchemaProjection.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/avro/TestSchemaConversions.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/avro/VariantLogicalType.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/avro/BaseWriteBuilder.java
Outdated
Show resolved
Hide resolved
5dff14c to
901a17d
Compare
xxubai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
core/src/main/java/org/apache/iceberg/avro/AvroSchemaVisitor.java
Outdated
Show resolved
Hide resolved
dd2c075 to
4351e8a
Compare
core/src/main/java/org/apache/iceberg/avro/AvroCustomOrderSchemaVisitor.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/ParquetVariantVisitor.java
Show resolved
Hide resolved
|
This is close. There are just two blockers for me:
|
core/src/main/java/org/apache/iceberg/avro/AvroCustomOrderSchemaVisitor.java
Outdated
Show resolved
Hide resolved
| String name = schema.getFullName(); | ||
| Preconditions.checkState( | ||
| !visitor.recordLevels.contains(name), "Cannot process recursive Avro record %s", name); | ||
| Preconditions.checkArgument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is fairly difficult to understand. When you have complicated logic like this, I'd recommend using a variable to make it more straightforward and readable:
boolean isMalformedVariant = schema.getLogicalType() instanceof VariantLogicalType && !AvroSchemaUtil.isVariantSchema(schema);
Preconditions.checkArgument(!isMalformedVariant, ...);However, in this case I think the problem is the structure. I commented below that I think Variant behavior should match map and list behavior. Those don't call field (which is specific to structs in this visitor) and that means even less code is going to be shared with records.
I think the best path forward is to handle the variant separately, rather than trying to share the code that visits fields. Then this Precondition can be back inside the block that validates whether the schema is a variant, which should be just after the visitor.recordLevels.push(name); line:
visitor.recordLevels.push(name);
if (schema.getLogicalType() instanceof VariantLogicalType) {
Preconditions.checkArgument(AvroSchemaUtil.isVariantSchema(schema), ...)
T metadataResult = new VisitFuture<>(schema.getField(METADATA).schema(), visitor);
T valueResult = new VisitFuture<>(schema.getField(VALUE).schema(), visitor);
return visit(schema, metadataResult, valueResult);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. I think we don't need visitor.recordLevels.push(name) and visitor.recordLevels.pop(); for visiting Variant fields since we know they are not records.
BTW: I didn't change but seems we should change the following
visitor.recordLevels.pop();
return visitor.record(schema, names, Iterables.transform(results, Supplier::get));
to
visitor.recordLevels.push(name);
...
Iterable<> itFields = Iterables.transform(results, Supplier::get);
visitor.recordLevels.pop();
return visitor.record(schema, names, itFields);
in order to fail if there is any recursive nodes in Avro schema?
core/src/main/java/org/apache/iceberg/avro/AvroSchemaVisitor.java
Outdated
Show resolved
Hide resolved
8b50e0b to
b3c349c
Compare
b3c349c to
dcc7f70
Compare
| Preconditions.checkState( | ||
| !visitor.recordLevels.contains(name), "Cannot process recursive Avro record %s", name); | ||
|
|
||
| visitor.recordLevels.push(name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be outside the else block? It should apply to both right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue I commented above. As I understand that we are trying to error out for the recursive Avro schema. But since we know Variant record will not have records inside, that's why I didn't place that inside visitor.recordLevels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is unlikely to be a problem given that there is a check for the structure of a variant. But it would still be safer to go ahead and push the name in case anything changes in the future. This is low priority though.
| Preconditions.checkState( | ||
| !visitor.recordLevels.contains(name), "Cannot process recursive Avro record %s", name); | ||
|
|
||
| visitor.recordLevels.push(name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. The record names should still be tracked to avoid name duplication issues.
| return visitor.variant( | ||
| schema, | ||
| visit(schema.getField(METADATA).schema(), visitor), | ||
| visit(schema.getField(VALUE).schema(), visitor)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be visitWithName(METADATA, schema, visitor)? That is used for map key/value and array elements.
| Types.NestedField.required( | ||
| 8, "tags", Types.ListType.ofRequired(9, Types.StringType.get())), | ||
| Types.NestedField.optional(10, "payload", Types.VariantType.get())) | ||
| .asStruct()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convert method accepts a schema. Why use the Type version instead of the Schema, String version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I should be able to use Schema version. Let me do that in following PR. Thanks for reviewing and merging it.
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can fix the visitWithName issue in a follow up. I'll merge this to unblock other work.
|
Thanks for working on this, @aihuaxu! And thanks for reviewing, @XBaith! |
This PR implements a custom logical type - Variant for Avro files in Iceberg. This is to prepare for adding shredded column metadata, which will be written as a Variant in Avro manifest files.
Part of: #10392