Skip to content

Conversation

@nastra
Copy link
Contributor

@nastra nastra commented Jul 29, 2025

This preserves the original type of the upper/lower bound of a particular field metric.
This will be later used to convert from Metrics to the new content stats

@nastra nastra moved this to In review in Content Stats Jul 29, 2025
@nastra nastra force-pushed the metrics-carry-over-original-type branch from f49d706 to 5c90a90 Compare July 29, 2025 09:29
@github-actions github-actions bot added the ORC label Jul 29, 2025
@nastra nastra force-pushed the metrics-carry-over-original-type branch from 5c90a90 to d9ec96a Compare July 29, 2025 09:34
@nastra nastra force-pushed the metrics-carry-over-original-type branch from d9ec96a to 1c3b6ce Compare July 29, 2025 16:06
private Map<Integer, Long> nanValueCounts = null;
private Map<Integer, ByteBuffer> lowerBounds = null;
private Map<Integer, ByteBuffer> upperBounds = null;
private Map<Integer, Type> originalTypes = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to copy this in the BaseFile too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we would want to do this in BaseFile, since that would mean we would expose an originalTypes() method there, which isn't needed. For DataFiles we do it because most places call withMetrics() and we want to preserve the original types when we later start converting from Metrics to the new structure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the principle here is that anywhere we need to build new DataFile or DeleteFile metadata we need to make sure that the metrics original types information is copied over so that it can be used when converting to the future structure. The only place that really needs read exposure to the originalTypes is the Metrics class itself.

But I think I agree with @pvary, I do think the field does need to be copied over (we don't need to expose it neccesarily). I think the field needs to be copied over for the DeleteFile case since DeleteFile extends BaseFile and we could have stats for those that we would want to convert

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the principle here is that anywhere we need to build new DataFile or DeleteFile metadata we need to make sure that the metrics original types information is copied over so that it can be used when converting to the future structure. The only place that really needs read exposure to the originalTypes is the Metrics class itself.

Yes that's absolutely correct. The main use case is that we want to convert from a Metrics instance to a Stats instance when e.g. an appender/writer creates a new Metrics instance and before those metrics are written to disk. Also we don't want to store originalTypes on disk, since that would require a spec change and we wouldn't have that information anyway when reading existing data files.

But I think I agree with @pvary, I do think the field does need to be copied over (we don't need to expose it neccesarily). I think the field needs to be copied over for the DeleteFile case since DeleteFile extends BaseFile and we could have stats for those that we would want to convert

I took another look at this but I'm still not sure why we would want to carry forward the originalTypes in BaseFile. There's really only a single use case I could find in the codebase that would require exposing originalTypes() in order to carry it forward and that is in RewriteTablePathUtil when a delete manifest is rewritten. The metrics of that DeleteFile are carried forward in

Metrics metricsWithTargetPath =
ContentFileUtil.replacePathBounds(file, sourcePrefix, targetPrefix);
which might indicate that we would need to preserve originalTypes there as well, but the DeleteFile was read from the delete manifest at that point and we don't actually store originalTypes on disk, meaning that it would be null anyway.

That being said, the Metrics class is our main surface area where we want to preserve originalTypes. Once we keep individual metrics fields and don't create a new Metrics instance again from those fields (which is the case with BaseFile - except RewriteTablePathUtil), we shouldn't need to preserve originalTypes anymore.

@pvary @amogh-jahagirdar let me know if that makes sense or whether I'm missing something obvious

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra: Let me rephrase what I understand from your comment:

  • Currently we have Metrics - here in some cases we have some binary data for min and max which is not typed
  • In the future we will have only Stats - where the stats will be typed, and this type will help interpret the min/max values

In between, for a while will have Metrics converted to Stats, and for this we need the type info

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra: Let me rephrase what I understand from your comment:

  • Currently we have Metrics - here in some cases we have some binary data for min and max which is not typed

currently lower/upper bounds are all binary and we don't know what their original type was

  • In the future we will have only Stats - where the stats will be typed, and this type will help interpret the min/max values

yes the new stats structure will store upper/lower bound with their actual type

In between, for a while will have Metrics converted to Stats, and for this we need the type info

correct. We want to convert from metrics to stats and we can only do so if we know the original type of the upper/lower bound.

One of the reasons why I'm doing the metrics -> stats conversion is because currently our appender/writer APIs are all returning Metrics after data has been written. Changing those APIs to return stats instead would be quite a big change, which I want to avoid until we figure out how we want the new API to look like

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @nastra, to move the Metrics.originalTypes() to package private. So we don't pollute the API with this method. This way we don't have to describe the users when this field is available, and when it is not available.

I think this is a good compromise, so we can proceed with implementing the stats, and eventually deprecate the way to generate the stats from metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this makes sense to me now, I had missed for the DV case that the change to FileMetadata would already cover propagation of the types required for the future stats conversion (since we just copy over the fields from the metrics in the builder)

private Map<Integer, Long> nanValueCounts = null;
private Map<Integer, ByteBuffer> lowerBounds = null;
private Map<Integer, ByteBuffer> upperBounds = null;
private Map<Integer, Type> originalTypes = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the principle here is that anywhere we need to build new DataFile or DeleteFile metadata we need to make sure that the metrics original types information is copied over so that it can be used when converting to the future structure. The only place that really needs read exposure to the originalTypes is the Metrics class itself.

But I think I agree with @pvary, I do think the field does need to be copied over (we don't need to expose it neccesarily). I think the field needs to be copied over for the DeleteFile case since DeleteFile extends BaseFile and we could have stats for those that we would want to convert

@nastra nastra force-pushed the metrics-carry-over-original-type branch from 1c3b6ce to 76c7a31 Compare July 31, 2025 08:26
@nastra nastra force-pushed the metrics-carry-over-original-type branch from 72c2a55 to e90eb8f Compare August 4, 2025 16:39
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nastra !

private Map<Integer, Long> nanValueCounts = null;
private Map<Integer, ByteBuffer> lowerBounds = null;
private Map<Integer, ByteBuffer> upperBounds = null;
private Map<Integer, Type> originalTypes = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this makes sense to me now, I had missed for the DV case that the change to FileMetadata would already cover propagation of the types required for the future stats conversion (since we just copy over the fields from the metrics in the builder)

@nastra
Copy link
Contributor Author

nastra commented Aug 5, 2025

thanks for reviewing @pvary and @amogh-jahagirdar, I'll get this merged

@nastra nastra merged commit 32f469b into apache:main Aug 5, 2025
43 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in Content Stats Aug 5, 2025
@nastra nastra deleted the metrics-carry-over-original-type branch August 5, 2025 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants