-
Notifications
You must be signed in to change notification settings - Fork 3k
API, Core: Add manifestLocation API to ContentFile #11044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API, Core: Add manifestLocation API to ContentFile #11044
Conversation
72323a7 to
2d88443
Compare
core/src/main/java/org/apache/iceberg/InheritableMetadataFactory.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think going via InheritableMetadata is the easiest option.
I only have a question about CharSequence, will check the tests tomorrow.
| * Returns the path of the manifest which this file is referenced in or null if it was not read | ||
| * from a manifest. | ||
| */ | ||
| default CharSequence manifestPath() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we put this method before pos() so both are co-located?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to discuss using CharSequence for paths in our APIs. I am not sure this approach is working out very well. I believe the idea was to expose Avro Utf8 directly. The problem is that Utf8 is not serializable. Therefore, we always convert Utf8 to String. Working with CharSequence adds a lot of complexity for no performance benefit. Puffin and a few other places use String for paths.
@rdblue, is my assumption about CharSequence correct? What are your thoughts now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, path() in ManifestFile returns String.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be fine to use String here and always expect to convert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use String. I also published #11092 to deprecate the path API which returns CharSequence, in favor of a location API which returns String. This removal would happen in 2.0
core/src/main/java/org/apache/iceberg/InheritableMetadataFactory.java
Outdated
Show resolved
Hide resolved
|
I took another look. I am on board. My high-level comments are below.
|
Great points on these, will run distributed planning benchmarks to make sure there's not a regression from adding this. |
b923d60 to
494339a
Compare
|
@aokolnychyi I ran the plan benchmarks, and added to the PR description. there's no significant difference in planning before and after.
For tests, we actually already have SparkDistributedDataScanTestBase which tests different combinations of local/distributed for data file and delete file planning. We have TestSparkDistributedDataScanJavaSerialization and TestSparkDistributedDataScanKryoSerialization for testing java/kryo serialization respectively. So I think we're effectively getting coverage in different planning + serialization combinations for the new code path. Please let me know if you think there's a coverage gap though! |
core/src/main/java/org/apache/iceberg/InheritableMetadataFactory.java
Outdated
Show resolved
Hide resolved
494339a to
32145d7
Compare
32145d7 to
499011d
Compare
core/src/test/java/org/apache/iceberg/DataTableScanTestBase.java
Outdated
Show resolved
Hide resolved
aokolnychyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one nit in the tests. Thanks, @amogh-jahagirdar!
|
Verified TestSparkReaderDeletes#testMultiplePosDeleteFiles() failure is unrelated (ran locally a few times with my changes and it passed). The test may be flaky. I'll push another change to address the nit and we can verify if it's flaky. |
…path to a manifest from which the content file resides in
499011d to
bbf159b
Compare
|
This time |
|
I'm going to go ahead and merge. Thanks for reviewing @aokolnychyi @rdblue! |
This change adds a manifestPath API to ContentFile which will return the path to a manifest from which the content file resides in. If the data or delete file was not read from a manifest, the API returns null. The planned use case for this API is to enable a more optimized determination of which manifests need to be rewritten during commits.
Note: To be very clear this is not a spec change, this is an additional read API on ContentFile which surfaces the manifest path on reading of the manifest entry
Benchmark results prior to the changes:
Benchmark results after the changes: