Skip to content

Conversation

@bryanck
Copy link
Contributor

@bryanck bryanck commented Feb 11, 2025

With very large table metadata json, for example, those with many snapshots with partition summaries, we sometimes encounter errors involving hash collisions when loading the metadata. This PR disables that hash collision check so the metadata can be parsed without error. We have had this set in our internal fork for a while.

In addition, this PR disable string interning of field names which has lead to performance problems for us when parsing metadata. Given partition summary field names and other snapshot properties are often not reused across different metadata, the interning causes more harm than good. This is especially true when using Iceberg in a server which is loading metadata for many tables.

This also fixes a test classpath issue. The Avatica driver is a shadow jar that bundles an old unshaded version of Jackson.

@github-actions github-actions bot added the core label Feb 11, 2025
@bryanck bryanck marked this pull request as draft February 11, 2025 16:44
@bryanck bryanck marked this pull request as ready for review February 11, 2025 17:38
@stevenzwu
Copy link
Contributor

Given partition summary field names and other snapshot properties are often not reused across different metadata, the interning causes more harm than good.

@bryanck I didn't quite get the partition summary field names. were you referring to PartitionFieldSummaryParser? it seems to have just 4 field names.

String.intern can be helpful for some use cases while harmful for some (like the one you encountered). Disabling interning seems to be a safer option considering diverse scenarios that the code can be used (like REST catalog server).

hash collision check

I definitely understand the situation you described. maybe reach out to the Jackson authors too according to the doc?
https://github.com/fasterxml/jackson-core/wiki/JsonFactory-Features

In unlikely event that the exception is triggered for valid data, it may make sense to either disable this feature, or to disable canonicalization. However, Jackson authors would also like to be notified for such usage as it may point to an issue with hashing scheme -- so please file an issue if you encounter this problem.

The doc also mentioned that hash collision check is Only relevant if canonicalization is enabled. wondering if CANONICALIZE_FIELD_NAMES should be disabled too. I imagined it can cause similar memory footprint issue as String interning.

@bryanck
Copy link
Contributor Author

bryanck commented Feb 11, 2025

@bryanck I didn't quite get the partition summary field names. were you referring to PartitionFieldSummaryParser? it seems to have just 4 field names.

String.intern can be helpful for some use cases while harmful for some (like the one you encountered). Disabling interning seems to be a safer option considering diverse scenarios that the code can be used (like REST catalog server).

The information for each partition key has a field name unique to the partition (with the prefix partitions.). There is some discussion around intern here with more links. TL;DR is that intern was disabled by default for Jackson 3 (whenever that is released).

I definitely understand the situation you described. maybe reach out to the Jackson authors too according to the doc? https://github.com/fasterxml/jackson-core/wiki/JsonFactory-Features

Sure sounds good, I'll reach out.

The doc also mentioned that hash collision check is Only relevant if canonicalization is enabled. wondering if CANONICALIZE_FIELD_NAMES should be disabled too. I imagined it can cause similar memory footprint issue as String interning.

Canonicalization can help when field names are reused within a single metadata file, so that seemed helpful still.

@stevenzwu
Copy link
Contributor

Canonicalization can help when field names are reused within a single metadata file, so that seemed helpful still.

canonicalization lifecycle is scoped to a single metadata file? if it is also JVM lifecycle scope (like String intern), it can also be a problem for large tables and a server handling many tables.

@bryanck
Copy link
Contributor Author

bryanck commented Feb 11, 2025

Canonicalization can help when field names are reused within a single metadata file, so that seemed helpful still.

canonicalization lifecycle is scoped to a single metadata file? if it is also JVM lifecycle scope (like String intern), it can also be a problem for large tables and a server handling many tables.

I believe it is scoped to a parser instance, and we generally create a new parser for each AFAIK. (https://github.com/fasterxml/jackson-core/wiki/JsonFactory-Features)

@bryanck
Copy link
Contributor Author

bryanck commented Feb 11, 2025

Canonicalization can help when field names are reused within a single metadata file, so that seemed helpful still.

canonicalization lifecycle is scoped to a single metadata file? if it is also JVM lifecycle scope (like String intern), it can also be a problem for large tables and a server handling many tables.

I believe it is scoped to a parser instance, and we generally create a new parser for each AFAIK. (https://github.com/fasterxml/jackson-core/wiki/JsonFactory-Features)

Actually that doesn't seem correct, it is for any parser created by the same factory, so we should probably turn canonicalization off instead.

@bryanck
Copy link
Contributor Author

bryanck commented Feb 11, 2025

I made the change to disable canonicalization instead.

@bryanck
Copy link
Contributor Author

bryanck commented Feb 11, 2025

I switched back to the original change, to just disable intern and the hash collision check. Disabling canonicalization altogether can impact performance significantly.

@stevenzwu
Copy link
Contributor

@bryanck thanks for the experimentation with canonicalization. do you have any micro/jmh benchmark for the parser performance? if yes, maybe it would be useful to add it to the Iceberg repo.

Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bryanck !

do you have any micro/jmh benchmark for the parser performance

+1, size and number of snapshot tuple would be great to experiment with and have it commited.

[for my understanding] I thought we had a way to lazy load metadata in REST, the complete metadata parsing would only be required at the time of commit ? Are all the tables write heavy ?

@bryanck
Copy link
Contributor Author

bryanck commented Feb 12, 2025

[for my understanding] I thought we had a way to lazy load metadata in REST, the complete metadata parsing would only be required at the time of commit ? Are all the tables write heavy ?

We have a very high write load and generally have partition summaries turned on.

@nastra nastra merged commit 80a009a into apache:main Feb 13, 2025
46 checks passed
@bryanck bryanck added this to the Iceberg 1.8.1 milestone Feb 18, 2025
nastra pushed a commit to nastra/iceberg that referenced this pull request Feb 19, 2025
nastra added a commit that referenced this pull request Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants