Skip to content

Conversation

@aihuaxu
Copy link
Contributor

@aihuaxu aihuaxu commented Feb 12, 2025

This PR implements a custom logical type - Variant for Avro files in Iceberg. This is to prepare for adding shredded column metadata, which will be written as a Variant in Avro manifest files.

Part of: #10392

@github-actions github-actions bot added the core label Feb 12, 2025
@aihuaxu aihuaxu force-pushed the variant-type-avro-type branch from 16f8525 to d014477 Compare February 13, 2025 03:52
@aihuaxu aihuaxu force-pushed the variant-type-avro-type branch from d014477 to 5f6cb48 Compare February 20, 2025 07:11
Copy link
Contributor

@xxubai xxubai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick update!

@aihuaxu aihuaxu force-pushed the variant-type-avro-type branch 3 times, most recently from a34b98c to 0d732ae Compare February 20, 2025 21:24
@aihuaxu aihuaxu force-pushed the variant-type-avro-type branch 2 times, most recently from 5dff14c to 901a17d Compare February 21, 2025 06:12
@aihuaxu aihuaxu requested review from rdblue and xxubai February 21, 2025 06:14
Copy link
Contributor

@xxubai xxubai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aihuaxu aihuaxu force-pushed the variant-type-avro-type branch from dd2c075 to 4351e8a Compare February 22, 2025 04:23
@rdblue
Copy link
Contributor

rdblue commented Feb 25, 2025

This is close. There are just two blockers for me:

  1. This should not expose unnecessary constants in Variant. Cross-project constant use is unnecessary API surface and can cause issues because it entails unnecessary classloading.
  2. The custom order schema visitor should also pass metadata and value result futures.

@github-actions github-actions bot removed the API label Feb 25, 2025
@aihuaxu aihuaxu requested a review from rdblue February 25, 2025 19:12
String name = schema.getFullName();
Preconditions.checkState(
!visitor.recordLevels.contains(name), "Cannot process recursive Avro record %s", name);
Preconditions.checkArgument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is fairly difficult to understand. When you have complicated logic like this, I'd recommend using a variable to make it more straightforward and readable:

boolean isMalformedVariant = schema.getLogicalType() instanceof VariantLogicalType && !AvroSchemaUtil.isVariantSchema(schema);
Preconditions.checkArgument(!isMalformedVariant, ...);

However, in this case I think the problem is the structure. I commented below that I think Variant behavior should match map and list behavior. Those don't call field (which is specific to structs in this visitor) and that means even less code is going to be shared with records.

I think the best path forward is to handle the variant separately, rather than trying to share the code that visits fields. Then this Precondition can be back inside the block that validates whether the schema is a variant, which should be just after the visitor.recordLevels.push(name); line:

visitor.recordLevels.push(name);

if (schema.getLogicalType() instanceof VariantLogicalType) {
  Preconditions.checkArgument(AvroSchemaUtil.isVariantSchema(schema), ...)
  T metadataResult = new VisitFuture<>(schema.getField(METADATA).schema(), visitor);
  T valueResult = new VisitFuture<>(schema.getField(VALUE).schema(), visitor);
  return visit(schema, metadataResult, valueResult);
}

Copy link
Contributor Author

@aihuaxu aihuaxu Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. I think we don't need visitor.recordLevels.push(name) and visitor.recordLevels.pop(); for visiting Variant fields since we know they are not records.

BTW: I didn't change but seems we should change the following

visitor.recordLevels.pop();
return visitor.record(schema, names, Iterables.transform(results, Supplier::get));

to

visitor.recordLevels.push(name);
...
Iterable<> itFields = Iterables.transform(results, Supplier::get);
visitor.recordLevels.pop();
return visitor.record(schema, names, itFields);

in order to fail if there is any recursive nodes in Avro schema?

@aihuaxu aihuaxu force-pushed the variant-type-avro-type branch from 8b50e0b to b3c349c Compare February 27, 2025 23:20
@aihuaxu aihuaxu force-pushed the variant-type-avro-type branch from b3c349c to dcc7f70 Compare February 27, 2025 23:30
Preconditions.checkState(
!visitor.recordLevels.contains(name), "Cannot process recursive Avro record %s", name);

visitor.recordLevels.push(name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be outside the else block? It should apply to both right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue I commented above. As I understand that we are trying to error out for the recursive Avro schema. But since we know Variant record will not have records inside, that's why I didn't place that inside visitor.recordLevels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is unlikely to be a problem given that there is a check for the structure of a variant. But it would still be safer to go ahead and push the name in case anything changes in the future. This is low priority though.

Preconditions.checkState(
!visitor.recordLevels.contains(name), "Cannot process recursive Avro record %s", name);

visitor.recordLevels.push(name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. The record names should still be tracked to avoid name duplication issues.

return visitor.variant(
schema,
visit(schema.getField(METADATA).schema(), visitor),
visit(schema.getField(VALUE).schema(), visitor));
Copy link
Contributor

@rdblue rdblue Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be visitWithName(METADATA, schema, visitor)? That is used for map key/value and array elements.

Types.NestedField.required(
8, "tags", Types.ListType.ofRequired(9, Types.StringType.get())),
Types.NestedField.optional(10, "payload", Types.VariantType.get()))
.asStruct());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convert method accepts a schema. Why use the Type version instead of the Schema, String version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I should be able to use Schema version. Let me do that in following PR. Thanks for reviewing and merging it.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can fix the visitWithName issue in a follow up. I'll merge this to unblock other work.

@rdblue rdblue merged commit 1fbf6cd into apache:main Mar 4, 2025
42 checks passed
@rdblue
Copy link
Contributor

rdblue commented Mar 4, 2025

Thanks for working on this, @aihuaxu! And thanks for reviewing, @XBaith!

aihuaxu added a commit to aihuaxu/iceberg that referenced this pull request Mar 5, 2025
aihuaxu added a commit to aihuaxu/iceberg that referenced this pull request Mar 6, 2025
aihuaxu added a commit to aihuaxu/iceberg that referenced this pull request Apr 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants