API, Core: Preserve original Type for upper/lower bounds in Metrics #13695

nastra · 2025-07-29T08:36:58Z

This preserves the original type of the upper/lower bound of a particular field metric.
This will be later used to convert from Metrics to the new content stats

parquet/src/test/java/org/apache/iceberg/parquet/TestVariantMetrics.java

api/src/main/java/org/apache/iceberg/Metrics.java

pvary · 2025-07-29T16:56:27Z

core/src/main/java/org/apache/iceberg/DataFiles.java

    private Map<Integer, Long> nanValueCounts = null;
    private Map<Integer, ByteBuffer> lowerBounds = null;
    private Map<Integer, ByteBuffer> upperBounds = null;
+    private Map<Integer, Type> originalTypes = null;


Do we need to copy this in the BaseFile too?

I don't think we would want to do this in BaseFile, since that would mean we would expose an originalTypes() method there, which isn't needed. For DataFiles we do it because most places call withMetrics() and we want to preserve the original types when we later start converting from Metrics to the new structure

So I think the principle here is that anywhere we need to build new DataFile or DeleteFile metadata we need to make sure that the metrics original types information is copied over so that it can be used when converting to the future structure. The only place that really needs read exposure to the originalTypes is the Metrics class itself.

But I think I agree with @pvary, I do think the field does need to be copied over (we don't need to expose it neccesarily). I think the field needs to be copied over for the DeleteFile case since DeleteFile extends BaseFile and we could have stats for those that we would want to convert

So I think the principle here is that anywhere we need to build new DataFile or DeleteFile metadata we need to make sure that the metrics original types information is copied over so that it can be used when converting to the future structure. The only place that really needs read exposure to the originalTypes is the Metrics class itself.

Yes that's absolutely correct. The main use case is that we want to convert from a Metrics instance to a Stats instance when e.g. an appender/writer creates a new Metrics instance and before those metrics are written to disk. Also we don't want to store originalTypes on disk, since that would require a spec change and we wouldn't have that information anyway when reading existing data files.

But I think I agree with @pvary, I do think the field does need to be copied over (we don't need to expose it neccesarily). I think the field needs to be copied over for the DeleteFile case since DeleteFile extends BaseFile and we could have stats for those that we would want to convert

I took another look at this but I'm still not sure why we would want to carry forward the originalTypes in BaseFile. There's really only a single use case I could find in the codebase that would require exposing originalTypes() in order to carry it forward and that is in RewriteTablePathUtil when a delete manifest is rewritten. The metrics of that DeleteFile are carried forward in

iceberg/core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java

Lines 491 to 492 in 62d9ff5

Metrics metricsWithTargetPath =

ContentFileUtil.replacePathBounds(file, sourcePrefix, targetPrefix);

which might indicate that we would need to preserve originalTypes there as well, but the DeleteFile was read from the delete manifest at that point and we don't actually store originalTypes on disk, meaning that it would be null anyway.

That being said, the Metrics class is our main surface area where we want to preserve originalTypes. Once we keep individual metrics fields and don't create a new Metrics instance again from those fields (which is the case with BaseFile - except RewriteTablePathUtil), we shouldn't need to preserve originalTypes anymore.

@pvary @amogh-jahagirdar let me know if that makes sense or whether I'm missing something obvious

@nastra: Let me rephrase what I understand from your comment:

Currently we have Metrics - here in some cases we have some binary data for min and max which is not typed

In the future we will have only Stats - where the stats will be typed, and this type will help interpret the min/max values

In between, for a while will have Metrics converted to Stats, and for this we need the type info

@nastra: Let me rephrase what I understand from your comment:

Currently we have Metrics - here in some cases we have some binary data for min and max which is not typed

currently lower/upper bounds are all binary and we don't know what their original type was

In the future we will have only Stats - where the stats will be typed, and this type will help interpret the min/max values

yes the new stats structure will store upper/lower bound with their actual type

In between, for a while will have Metrics converted to Stats, and for this we need the type info

correct. We want to convert from metrics to stats and we can only do so if we know the original type of the upper/lower bound.

One of the reasons why I'm doing the metrics -> stats conversion is because currently our appender/writer APIs are all returning Metrics after data has been written. Changing those APIs to return stats instead would be quite a big change, which I want to avoid until we figure out how we want the new API to look like

Discussed with @nastra, to move the Metrics.originalTypes() to package private. So we don't pollute the API with this method. This way we don't have to describe the users when this field is available, and when it is not available.

I think this is a good compromise, so we can proceed with implementing the stats, and eventually deprecate the way to generate the stats from metrics.

Ok this makes sense to me now, I had missed for the DV case that the change to FileMetadata would already cover propagation of the types required for the future stats conversion (since we just copy over the fields from the metrics in the builder)

api/src/main/java/org/apache/iceberg/Metrics.java

amogh-jahagirdar · 2025-07-30T20:11:46Z

core/src/main/java/org/apache/iceberg/DataFiles.java

    private Map<Integer, Long> nanValueCounts = null;
    private Map<Integer, ByteBuffer> lowerBounds = null;
    private Map<Integer, ByteBuffer> upperBounds = null;
+    private Map<Integer, Type> originalTypes = null;


So I think the principle here is that anywhere we need to build new DataFile or DeleteFile metadata we need to make sure that the metrics original types information is copied over so that it can be used when converting to the future structure. The only place that really needs read exposure to the originalTypes is the Metrics class itself.

But I think I agree with @pvary, I do think the field does need to be copied over (we don't need to expose it neccesarily). I think the field needs to be copied over for the DeleteFile case since DeleteFile extends BaseFile and we could have stats for those that we would want to convert

amogh-jahagirdar

Thanks @nastra !

amogh-jahagirdar · 2025-08-04T22:39:51Z

core/src/main/java/org/apache/iceberg/DataFiles.java

    private Map<Integer, Long> nanValueCounts = null;
    private Map<Integer, ByteBuffer> lowerBounds = null;
    private Map<Integer, ByteBuffer> upperBounds = null;
+    private Map<Integer, Type> originalTypes = null;


Ok this makes sense to me now, I had missed for the DV case that the change to FileMetadata would already cover propagation of the types required for the future stats conversion (since we just copy over the fields from the metrics in the builder)

nastra · 2025-08-05T06:14:33Z

thanks for reviewing @pvary and @amogh-jahagirdar, I'll get this merged

github-actions bot added API parquet core labels Jul 29, 2025

nastra added this to Content Stats Jul 29, 2025

nastra requested review from amogh-jahagirdar and danielcweeks July 29, 2025 08:40

nastra moved this to In review in Content Stats Jul 29, 2025

nastra force-pushed the metrics-carry-over-original-type branch from f49d706 to 5c90a90 Compare July 29, 2025 09:29

github-actions bot added the ORC label Jul 29, 2025

nastra force-pushed the metrics-carry-over-original-type branch from 5c90a90 to d9ec96a Compare July 29, 2025 09:34

pvary reviewed Jul 29, 2025

View reviewed changes

parquet/src/test/java/org/apache/iceberg/parquet/TestVariantMetrics.java Outdated Show resolved Hide resolved

pvary reviewed Jul 29, 2025

View reviewed changes

api/src/main/java/org/apache/iceberg/Metrics.java Show resolved Hide resolved

nastra force-pushed the metrics-carry-over-original-type branch from d9ec96a to 1c3b6ce Compare July 29, 2025 16:06

pvary reviewed Jul 29, 2025

View reviewed changes

amogh-jahagirdar reviewed Jul 30, 2025

View reviewed changes

nastra added 2 commits July 31, 2025 09:25

API, Core: Preserve original Type for upper/lower bounds in Metrics

806956f

review feedback

76c7a31

nastra force-pushed the metrics-carry-over-original-type branch from 1c3b6ce to 76c7a31 Compare July 31, 2025 08:26

nastra requested review from amogh-jahagirdar and pvary August 1, 2025 10:35

pvary approved these changes Aug 4, 2025

View reviewed changes

change visibility

e90eb8f

nastra force-pushed the metrics-carry-over-original-type branch from 72c2a55 to e90eb8f Compare August 4, 2025 16:39

amogh-jahagirdar approved these changes Aug 4, 2025

View reviewed changes

nastra merged commit 32f469b into apache:main Aug 5, 2025
43 checks passed

github-project-automation bot moved this from In review to Done in Content Stats Aug 5, 2025

nastra deleted the metrics-carry-over-original-type branch August 5, 2025 09:38

	Metrics metricsWithTargetPath =
	ContentFileUtil.replacePathBounds(file, sourcePrefix, targetPrefix);

API, Core: Preserve original Type for upper/lower bounds in Metrics #13695

API, Core: Preserve original Type for upper/lower bounds in Metrics #13695

Uh oh!

Conversation

nastra commented Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nastra commented Aug 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants