Introduce MetricsMaxInferredColumnDefaultsStrategy #13039

jkolash · 2025-05-13T13:57:17Z

These changes address issue #11253 allowing for setting of a new default strategy that considers the total number of field metrics rather than just the number of top level columns.

the new property is
write.metadata.metrics.max-inferred-column-defaults.strategy

and valid values would be original, depth, breadth

It currently preserves the original default behavior as changing
that may be a more disruptive change as it could lead to unexpected
performance regressions. A Breadth first strategy would likely be most
compatible with the original strategy so it would be safer to default
into vs the depth strategy. The original strategy could then be
deprecated and removed in the future

This could also easily support a previously discussed feature of
reversing order of field ids for considering defaults. Though that
won't be included in this PR

I'm inclined to remove the depth strategy unless there is a strong
desire to keep it.

These changes address issue apache#11253 allowing for setting of a new default strategy that considers the total number of field metrics rather than just the number of top level columns. the new property is ```write.metadata.metrics.max-inferred-column-defaults.strategy``` and valid values would be ```original, depth, breadth``` It currently preserves the original default behavior as changing that may be a more disruptive change as it could lead to unexpected performance regressions. A Breadth first strategy would likely be most compatible with the original strategy so it would be safer to default into vs the depth strategy. The original strategy could then be deprecated and removed in the future This could also easily support a previously discussed feature of reversing order of field ids for considering defaults. Though that won't be included in this PR I'm inclined to remove the depth strategy unless there is a strong desire to keep it.

… level columns list not the number of total projected fieldIds from the schema.

rdblue · 2025-05-30T18:15:12Z

I brought this up at the community sync to gauge what other people thought about changing the behavior here without introducing a "strategy" option and the response was positive. Our rationale was that tables that have deep nesting and a lot of top-level columns are uncommon and would likely benefit from removing a lot of unnecessary metrics overall. So rather than introducing a strategy, I think we should move forward with the "breadth" approach: keep stats for top-level primitive columns, then the next layer, and so on until the 100 primitive field limit is exhausted.

jkolash · 2025-05-30T18:34:43Z

Thanks, I will simplify this PR to just include the breadth strategy and without a new property.

jkolash · 2025-06-03T00:48:58Z

So while doing a self review I came to question why a user provided default should not be bounded. this was the behavior before, but I'm not quite sure it is right. Users can set the write.metadata.metrics.max-inferred-column-defaults property so that allows more user control over the bounding behavior vs none at all.

core/src/main/java/org/apache/iceberg/MetricsConfig.java

Create TestMetricsConfig for these tests as they are independent of iceberg version vs in the TestMetrics class.

jkolash · 2025-06-05T12:43:25Z

Stress testing this. For a schema with a schema with 1 million structs I got ~133ms per iteration when bounding to 100 fields.

  @Test
  public void perf(){
    AtomicInteger fieldId = new AtomicInteger(0);

    Supplier<Types.NestedField> newStruct = () -> required(fieldId.getAndIncrement(), String.valueOf(fieldId.get()),
            Types.StructType.of(required(fieldId.getAndIncrement(), String.valueOf(fieldId.get()), Types.IntegerType.get())));

    int items = 1000000;
    Schema schema = new Schema(IntStream.range(0, items).mapToObj( (i) -> newStruct.get()).collect(Collectors.toList()));

    StopWatch sw = StopWatch.createStarted();
    int iterations = 100;
    for (int i = 0; i < iterations; i++) {
      MetricsConfig.boundedBreadthFirstSubSchema(schema, 100);
    }
    sw.stop();
    System.out.println("ms per iteration: "+ sw.getTime(TimeUnit.MILLISECONDS) * 1.0/iterations);
  }

Output:

ms per iteration: 133.08

jkolash · 2025-06-16T11:51:34Z

@rdblue Let me know what else is needed for this PR.

jkolash · 2025-06-16T11:52:03Z

Should I re-open it with a new description?

github-actions · 2025-07-17T00:19:33Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2025-07-25T00:19:13Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

rdblue · 2025-07-29T23:45:01Z

core/src/main/java/org/apache/iceberg/MetricsConfig.java

+          @Override
+          public Iterable<Integer> list(Types.ListType list, Supplier<Iterable<Integer>> future) {
+            List<Integer> returnValue = Lists.newArrayListWithCapacity(1);
+            if (list.elementType().isPrimitiveType()) {


I think that this can be simpler:

if (list.elementType.isPrimitiveType() || list.elementType().isVariantType()) { return ImmutableList.of(list.elementId()); } return future.get();

If the element is a primitive type (or a variant), then we don't need to get the value of the future because we already know that it is going to be Collections.emptyList(). I prefer that implementation because it cuts down on Iterable instances that concatenate things that we know are empty.

rdblue · 2025-07-29T23:45:44Z

core/src/main/java/org/apache/iceberg/MetricsConfig.java

+              Types.MapType map,
+              Supplier<Iterable<Integer>> keyFuture,
+              Supplier<Iterable<Integer>> valueFuture) {
+            List<Integer> returnValue = Lists.newArrayListWithCapacity(2);


I'd make the same simplification as I suggested above here:

Iterable<Integer> keyResult; if (map.keyType().isPrimitiveType() || map.keyType().isVariantType()) { keyResult = ImmutableList.of(map.keyId()); } else { keyResult = keyFuture.get(); } Iterable<Integer> valueResult; if (map.valueType().isPrimitiveType() || map.valueType().isVariantType()) { valueResult = ImmutableList.of(map.valueId()); } else { valueResult = valueFuture.get(); } return Iterables.concat(keyResult, valueResult);

rdblue · 2025-07-29T23:54:15Z

core/src/main/java/org/apache/iceberg/MetricsConfig.java

+
+          @Override
+          public Iterable<Integer> struct(
+              Types.StructType struct, Iterable<Iterable<Integer>> fieldResults) {


Javadoc for the visit method says that the field results are traversed when the Iterable passed here is consumed:

Structs are passed an {@link Iterable} that traverses child fields during iteration.

The intent of this structure is to consume these Iterable instances while working with the object, but this creates a new Iterable using concat that is lazy. The result of returning the field results unconsumed is an unpredictable (or certainly hard to understand) order when visiting fields.

I think that this does the right thing because it will create a structure of iterables that contain the primitive field IDs first and it doesn't really matter when the results are consumed, but I think this structure makes the code harder to understand than it needs to be.

I think I'd prefer a solution that keeps a single top-level list and adds only the requested number of field IDs to it. Most of the code would be the same, with each structure checking whether it contains primitives and adding them in order, before traversing the children if more IDs need to be added.

Here's what I was thinking:

static Set<Integer> limitFieldIds(Schema schema, int limit) { return TypeUtil.visit( schema, new TypeUtil.CustomOrderSchemaVisitor<>() { private final Set<Integer> idSet = Sets.newHashSet(); private boolean shouldContinue() { return idSet.size() < limit; } @Override public Set<Integer> schema(Schema schema, Supplier<Set<Integer>> structResult) { return idSet; } @Override public Set<Integer> struct(Types.StructType struct, Iterable<Set<Integer>> fieldResults) { Iterator<Types.NestedField> fields = struct.fields().iterator(); while (shouldContinue() && fields.hasNext()) { Types.NestedField field = fields.next(); if (field.type().isPrimitiveType() || field.type().isVariantType()) { idSet.add(field.fieldId()); } } // visit children to add more ids Iterator<Set<Integer>> iter = fieldResults.iterator(); while (iter.hasNext() && shouldContinue()) { iter.next(); } return null; } @Override public Set<Integer> field(Types.NestedField field, Supplier<Set<Integer>> fieldResult) { return fieldResult.get(); } @Override public Set<Integer> list(Types.ListType list, Supplier<Set<Integer>> elementResult) { if (shouldContinue() && (list.elementType().isPrimitiveType() || list.elementType().isVariantType())) { idSet.add(list.elementId()); } if (shouldContinue()) { elementResult.get(); } return null; } @Override public Set<Integer> map( Types.MapType map, Supplier<Set<Integer>> keyResult, Supplier<Set<Integer>> valueResult) { if (shouldContinue() && (map.keyType().isPrimitiveType() || map.keyType().isVariantType())) { idSet.add(map.keyId()); } if (shouldContinue() && (map.valueType().isPrimitiveType() || map.valueType().isVariantType())) { idSet.add(map.valueId()); } if (shouldContinue()) { keyResult.get(); valueResult.get(); } return null; } }); }

I will try this out and also add a test with variant on the v3 paramaterization in TestWriterMetrics.java.

Ok I did something along those lines. I avoided

if (shouldContinue() && (map.keyType().isPrimitiveType() || map.keyType().isVariantType())) { idSet.add(map.keyId()); }

since I felt it was a little unweildy. I think the updated version is more readable.

rdblue · 2025-07-29T23:54:39Z

core/src/main/java/org/apache/iceberg/MetricsConfig.java

+                orderedFieldIds.add(field.fieldId());
+              }
+            }
+            return Iterables.concat(orderedFieldIds, Iterables.concat(fieldResults));


Style: Iceberg's convention is to leave an empty newline between control flow blocks and the following statement.

rdblue · 2025-07-30T00:12:25Z

core/src/main/java/org/apache/iceberg/MetricsConfig.java

+        });
+  }
+
+  static Schema boundedBreadthFirstSubSchema(Schema schema, int maxInferredDefaultColumns) {


It's good to use breadth-first priority, but I don't think that the policy needs to be exposed to the caller. How about renaming this something like limitFields?

We need to check this before adding to the idSet and before visiting further children.

rdblue · 2025-08-05T16:01:24Z

core/src/main/java/org/apache/iceberg/MetricsConfig.java

+        defaultMode = DEFAULT_MODE;
+      } else {
+        Schema subSchema = limitSchema(schema, maxInferredDefaultColumns);
+        for (Integer id : TypeUtil.getProjectedIds(subSchema)) {


Why is this limiting to an ID set, then creating a schema, and finally calling getProjectedIds to recover the ID set? Couldn't this just call the ID set method directly?

I think that works. I think there is a behavior difference with the ids returned from limitFieldIds vs limitSchema I think limitSchema may include the intermediate field ids for nested structures? I'm not sure it matters. I will investigate and make the change.

rdblue · 2025-08-05T16:02:48Z

core/src/main/java/org/apache/iceberg/MetricsConfig.java

+          @Override
+          @SuppressWarnings("ReturnValueIgnored")
+          public Set<Integer> field(Types.NestedField field, Supplier<Set<Integer>> fieldResult) {
+            if (shouldContinue()) {


I don't think this needs to call shouldContinue everywhere, as long as it is called before adding an ID to the set. It's not a problem to, but it's okay to traverse a field and return quickly without adding an ID rather than not traverse a field. And for fields specifically, shouldContinue is called before the field is visited.

rdblue

There are a couple of minor suggestions, but I think that this is correct and can be committed. I'd also like to get this into 1.10 so I added it to the milestone. We can merge this any time if we need to get a release candidate out, but otherwise I'll watch for the updates. Thanks, @jkolash!

rdblue · 2025-08-07T19:27:13Z

Merged. Thanks, @jkolash!

github-actions bot added core data labels May 13, 2025

jkolash added 4 commits May 13, 2025 10:04

spotlessApply

4d313b7

Fix last minute refactoring changes

f6fd513

conform to checkstyle config

dcbf84f

need to change strategy to bound column metrics currently

2cf476a

jkolash force-pushed the wip-fix-11253 branch from a77d0d6 to 2cf476a Compare May 13, 2025 20:48

Previously this was evaluated when considering only the length of top…

9dd8ec8

… level columns list not the number of total projected fieldIds from the schema.

Simplify changes to just include MetricBreadthPriority

d1791c8

jkolash force-pushed the wip-fix-11253 branch from 38dbca3 to d1791c8 Compare June 3, 2025 00:44

rdblue reviewed Jun 4, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/MetricsConfig.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 4, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/MetricsConfig.java Outdated Show resolved Hide resolved

jkolash added 2 commits June 4, 2025 19:37

Simplify this further.

a77c3eb

Eliminate unnecessary class

50706b9

Create TestMetricsConfig for these tests as they are independent of iceberg version vs in the TestMetrics class.

jkolash force-pushed the wip-fix-11253 branch from 9683645 to 50706b9 Compare June 5, 2025 00:34

Avoid nested concat to avoid potential stack overflow

0486ca5

github-actions bot added the stale label Jul 17, 2025

github-actions bot closed this Jul 25, 2025

rdblue reopened this Jul 29, 2025

rdblue reviewed Jul 29, 2025

View reviewed changes

rdblue reviewed Jul 30, 2025

View reviewed changes

github-actions bot removed the stale label Jul 30, 2025

jkolash added 4 commits July 29, 2025 22:29

Add metrics test for iceberg v3+ for variant type

f918603

Address further simplification feedback.

a4909ab

consolidate around shouldTerminateEarly

ccc2c86

We need to check this before adding to the idSet and before visiting further children.

Use assertj assumptions

fc44b4a

jkolash force-pushed the wip-fix-11253 branch from b59afcc to fc44b4a Compare July 30, 2025 22:10

Address style check issues

8603117

rdblue reviewed Aug 5, 2025

View reviewed changes

rdblue added this to the Iceberg 1.10.0 milestone Aug 6, 2025

rdblue approved these changes Aug 6, 2025

View reviewed changes

Address final feedback removing subschema creation

a25f1c1

rdblue approved these changes Aug 7, 2025

View reviewed changes

rdblue merged commit 1b0e4b3 into apache:main Aug 7, 2025
42 checks passed

dramaticlly mentioned this pull request Aug 11, 2025

Core, Docs: Update write.metadata.metrics.max-inferred-column-defaults documentation and add benchmark #13785

Merged

Introduce MetricsMaxInferredColumnDefaultsStrategy #13039

Introduce MetricsMaxInferredColumnDefaultsStrategy #13039

Uh oh!

Conversation

jkolash commented May 13, 2025

Uh oh!

rdblue commented May 30, 2025

Uh oh!

jkolash commented May 30, 2025

Uh oh!

jkolash commented Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

jkolash commented Jun 5, 2025

Uh oh!

jkolash commented Jun 16, 2025

Uh oh!

jkolash commented Jun 16, 2025

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

github-actions bot commented Jul 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue commented Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants