-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Adjust Jackson settings to handle large metadata json #12224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@bryanck I didn't quite get the partition summary field names. were you referring to String.intern can be helpful for some use cases while harmful for some (like the one you encountered). Disabling interning seems to be a safer option considering diverse scenarios that the code can be used (like REST catalog server).
I definitely understand the situation you described. maybe reach out to the Jackson authors too according to the doc? The doc also mentioned that hash collision check is |
The information for each partition key has a field name unique to the partition (with the prefix
Sure sounds good, I'll reach out.
Canonicalization can help when field names are reused within a single metadata file, so that seemed helpful still. |
canonicalization lifecycle is scoped to a single metadata file? if it is also JVM lifecycle scope (like String intern), it can also be a problem for large tables and a server handling many tables. |
I believe it is scoped to a parser instance, and we generally create a new parser for each AFAIK. (https://github.com/fasterxml/jackson-core/wiki/JsonFactory-Features) |
Actually that doesn't seem correct, it is for any parser created by the same factory, so we should probably turn canonicalization off instead. |
|
I made the change to disable canonicalization instead. |
7c48a51 to
5cffecd
Compare
5cffecd to
3c5d438
Compare
|
I switched back to the original change, to just disable intern and the hash collision check. Disabling canonicalization altogether can impact performance significantly. |
This reverts commit 3c5d438.
|
@bryanck thanks for the experimentation with canonicalization. do you have any micro/jmh benchmark for the parser performance? if yes, maybe it would be useful to add it to the Iceberg repo. |
singhpk234
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bryanck !
do you have any micro/jmh benchmark for the parser performance
+1, size and number of snapshot tuple would be great to experiment with and have it commited.
[for my understanding] I thought we had a way to lazy load metadata in REST, the complete metadata parsing would only be required at the time of commit ? Are all the tables write heavy ?
We have a very high write load and generally have partition summaries turned on. |
With very large table metadata json, for example, those with many snapshots with partition summaries, we sometimes encounter errors involving hash collisions when loading the metadata. This PR disables that hash collision check so the metadata can be parsed without error. We have had this set in our internal fork for a while.
In addition, this PR disable string interning of field names which has lead to performance problems for us when parsing metadata. Given partition summary field names and other snapshot properties are often not reused across different metadata, the interning causes more harm than good. This is especially true when using Iceberg in a server which is loading metadata for many tables.
This also fixes a test classpath issue. The Avatica driver is a shadow jar that bundles an old unshaded version of Jackson.