-
Notifications
You must be signed in to change notification settings - Fork 3k
API, Core: Preserve original Type for upper/lower bounds in Metrics #13695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f49d706 to
5c90a90
Compare
5c90a90 to
d9ec96a
Compare
parquet/src/test/java/org/apache/iceberg/parquet/TestVariantMetrics.java
Outdated
Show resolved
Hide resolved
d9ec96a to
1c3b6ce
Compare
| private Map<Integer, Long> nanValueCounts = null; | ||
| private Map<Integer, ByteBuffer> lowerBounds = null; | ||
| private Map<Integer, ByteBuffer> upperBounds = null; | ||
| private Map<Integer, Type> originalTypes = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to copy this in the BaseFile too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we would want to do this in BaseFile, since that would mean we would expose an originalTypes() method there, which isn't needed. For DataFiles we do it because most places call withMetrics() and we want to preserve the original types when we later start converting from Metrics to the new structure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think the principle here is that anywhere we need to build new DataFile or DeleteFile metadata we need to make sure that the metrics original types information is copied over so that it can be used when converting to the future structure. The only place that really needs read exposure to the originalTypes is the Metrics class itself.
But I think I agree with @pvary, I do think the field does need to be copied over (we don't need to expose it neccesarily). I think the field needs to be copied over for the DeleteFile case since DeleteFile extends BaseFile and we could have stats for those that we would want to convert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think the principle here is that anywhere we need to build new
DataFileorDeleteFilemetadata we need to make sure that the metrics original types information is copied over so that it can be used when converting to the future structure. The only place that really needs read exposure to theoriginalTypesis the Metrics class itself.
Yes that's absolutely correct. The main use case is that we want to convert from a Metrics instance to a Stats instance when e.g. an appender/writer creates a new Metrics instance and before those metrics are written to disk. Also we don't want to store originalTypes on disk, since that would require a spec change and we wouldn't have that information anyway when reading existing data files.
But I think I agree with @pvary, I do think the field does need to be copied over (we don't need to expose it neccesarily). I think the field needs to be copied over for the DeleteFile case since DeleteFile extends BaseFile and we could have stats for those that we would want to convert
I took another look at this but I'm still not sure why we would want to carry forward the originalTypes in BaseFile. There's really only a single use case I could find in the codebase that would require exposing originalTypes() in order to carry it forward and that is in RewriteTablePathUtil when a delete manifest is rewritten. The metrics of that DeleteFile are carried forward in
iceberg/core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java
Lines 491 to 492 in 62d9ff5
| Metrics metricsWithTargetPath = | |
| ContentFileUtil.replacePathBounds(file, sourcePrefix, targetPrefix); |
originalTypes there as well, but the DeleteFile was read from the delete manifest at that point and we don't actually store originalTypes on disk, meaning that it would be null anyway.
That being said, the Metrics class is our main surface area where we want to preserve originalTypes. Once we keep individual metrics fields and don't create a new Metrics instance again from those fields (which is the case with BaseFile - except RewriteTablePathUtil), we shouldn't need to preserve originalTypes anymore.
@pvary @amogh-jahagirdar let me know if that makes sense or whether I'm missing something obvious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nastra: Let me rephrase what I understand from your comment:
- Currently we have Metrics - here in some cases we have some binary data for min and max which is not typed
- In the future we will have only Stats - where the stats will be typed, and this type will help interpret the min/max values
In between, for a while will have Metrics converted to Stats, and for this we need the type info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nastra: Let me rephrase what I understand from your comment:
- Currently we have Metrics - here in some cases we have some binary data for min and max which is not typed
currently lower/upper bounds are all binary and we don't know what their original type was
- In the future we will have only Stats - where the stats will be typed, and this type will help interpret the min/max values
yes the new stats structure will store upper/lower bound with their actual type
In between, for a while will have Metrics converted to Stats, and for this we need the type info
correct. We want to convert from metrics to stats and we can only do so if we know the original type of the upper/lower bound.
One of the reasons why I'm doing the metrics -> stats conversion is because currently our appender/writer APIs are all returning Metrics after data has been written. Changing those APIs to return stats instead would be quite a big change, which I want to avoid until we figure out how we want the new API to look like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @nastra, to move the Metrics.originalTypes() to package private. So we don't pollute the API with this method. This way we don't have to describe the users when this field is available, and when it is not available.
I think this is a good compromise, so we can proceed with implementing the stats, and eventually deprecate the way to generate the stats from metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this makes sense to me now, I had missed for the DV case that the change to FileMetadata would already cover propagation of the types required for the future stats conversion (since we just copy over the fields from the metrics in the builder)
| private Map<Integer, Long> nanValueCounts = null; | ||
| private Map<Integer, ByteBuffer> lowerBounds = null; | ||
| private Map<Integer, ByteBuffer> upperBounds = null; | ||
| private Map<Integer, Type> originalTypes = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think the principle here is that anywhere we need to build new DataFile or DeleteFile metadata we need to make sure that the metrics original types information is copied over so that it can be used when converting to the future structure. The only place that really needs read exposure to the originalTypes is the Metrics class itself.
But I think I agree with @pvary, I do think the field does need to be copied over (we don't need to expose it neccesarily). I think the field needs to be copied over for the DeleteFile case since DeleteFile extends BaseFile and we could have stats for those that we would want to convert
1c3b6ce to
76c7a31
Compare
72c2a55 to
e90eb8f
Compare
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nastra !
| private Map<Integer, Long> nanValueCounts = null; | ||
| private Map<Integer, ByteBuffer> lowerBounds = null; | ||
| private Map<Integer, ByteBuffer> upperBounds = null; | ||
| private Map<Integer, Type> originalTypes = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this makes sense to me now, I had missed for the DV case that the change to FileMetadata would already cover propagation of the types required for the future stats conversion (since we just copy over the fields from the metrics in the builder)
|
thanks for reviewing @pvary and @amogh-jahagirdar, I'll get this merged |
This preserves the original type of the upper/lower bound of a particular field metric.
This will be later used to convert from
Metricsto the new content stats