Skip to content

Conversation

@jkolash
Copy link
Contributor

@jkolash jkolash commented May 13, 2025

These changes address issue #11253 allowing for setting of a new default strategy that considers the total number of field metrics rather than just the number of top level columns.

the new property is
write.metadata.metrics.max-inferred-column-defaults.strategy

and valid values would be original, depth, breadth

It currently preserves the original default behavior as changing
that may be a more disruptive change as it could lead to unexpected
performance regressions. A Breadth first strategy would likely be most
compatible with the original strategy so it would be safer to default
into vs the depth strategy. The original strategy could then be
deprecated and removed in the future

This could also easily support a previously discussed feature of
reversing order of field ids for considering defaults. Though that
won't be included in this PR

I'm inclined to remove the depth strategy unless there is a strong
desire to keep it.

These changes address issue apache#11253 allowing for setting
of a new default strategy that considers the total number
of field metrics rather than just the number of top level columns.

the new property is
```write.metadata.metrics.max-inferred-column-defaults.strategy```

and valid values would be ```original, depth, breadth```

It currently preserves the original default behavior as changing
that may be a more disruptive change as it could lead to unexpected
performance regressions. A Breadth first strategy would likely be most
compatible with the original strategy so it would be safer to default
into vs the depth strategy. The original strategy could then be
deprecated and removed in the future

This could also easily support a previously discussed feature of
reversing order of field ids for considering defaults. Though that
won't be included in this PR

I'm inclined to remove the depth strategy unless there is a strong
desire to keep it.
… level

columns list not the number of total projected fieldIds from the schema.
@rdblue
Copy link
Contributor

rdblue commented May 30, 2025

I brought this up at the community sync to gauge what other people thought about changing the behavior here without introducing a "strategy" option and the response was positive. Our rationale was that tables that have deep nesting and a lot of top-level columns are uncommon and would likely benefit from removing a lot of unnecessary metrics overall. So rather than introducing a strategy, I think we should move forward with the "breadth" approach: keep stats for top-level primitive columns, then the next layer, and so on until the 100 primitive field limit is exhausted.

@jkolash
Copy link
Contributor Author

jkolash commented May 30, 2025

Thanks, I will simplify this PR to just include the breadth strategy and without a new property.

@jkolash
Copy link
Contributor Author

jkolash commented Jun 3, 2025

So while doing a self review I came to question why a user provided default should not be bounded. this was the behavior before, but I'm not quite sure it is right. Users can set the write.metadata.metrics.max-inferred-column-defaults property so that allows more user control over the bounding behavior vs none at all.

jkolash added 2 commits June 4, 2025 19:37
Create TestMetricsConfig for these tests as they are independent
of iceberg version vs in the TestMetrics class.
@jkolash
Copy link
Contributor Author

jkolash commented Jun 5, 2025

Stress testing this. For a schema with a schema with 1 million structs I got ~133ms per iteration when bounding to 100 fields.

  @Test
  public void perf(){
    AtomicInteger fieldId = new AtomicInteger(0);

    Supplier<Types.NestedField> newStruct = () -> required(fieldId.getAndIncrement(), String.valueOf(fieldId.get()),
            Types.StructType.of(required(fieldId.getAndIncrement(), String.valueOf(fieldId.get()), Types.IntegerType.get())));

    int items = 1000000;
    Schema schema = new Schema(IntStream.range(0, items).mapToObj( (i) -> newStruct.get()).collect(Collectors.toList()));

    StopWatch sw = StopWatch.createStarted();
    int iterations = 100;
    for (int i = 0; i < iterations; i++) {
      MetricsConfig.boundedBreadthFirstSubSchema(schema, 100);
    }
    sw.stop();
    System.out.println("ms per iteration: "+ sw.getTime(TimeUnit.MILLISECONDS) * 1.0/iterations);
  }

Output:

ms per iteration: 133.08

@jkolash
Copy link
Contributor Author

jkolash commented Jun 16, 2025

@rdblue Let me know what else is needed for this PR.

@jkolash
Copy link
Contributor Author

jkolash commented Jun 16, 2025

Should I re-open it with a new description?

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jul 17, 2025
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Jul 25, 2025
@rdblue rdblue reopened this Jul 29, 2025
@Override
public Iterable<Integer> list(Types.ListType list, Supplier<Iterable<Integer>> future) {
List<Integer> returnValue = Lists.newArrayListWithCapacity(1);
if (list.elementType().isPrimitiveType()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this can be simpler:

  if (list.elementType.isPrimitiveType() || list.elementType().isVariantType()) {
    return ImmutableList.of(list.elementId());
  }

  return future.get();

If the element is a primitive type (or a variant), then we don't need to get the value of the future because we already know that it is going to be Collections.emptyList(). I prefer that implementation because it cuts down on Iterable instances that concatenate things that we know are empty.

Types.MapType map,
Supplier<Iterable<Integer>> keyFuture,
Supplier<Iterable<Integer>> valueFuture) {
List<Integer> returnValue = Lists.newArrayListWithCapacity(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make the same simplification as I suggested above here:

            Iterable<Integer> keyResult;
            if (map.keyType().isPrimitiveType() || map.keyType().isVariantType()) {
              keyResult = ImmutableList.of(map.keyId());
            } else {
              keyResult = keyFuture.get();
            }

            Iterable<Integer> valueResult;
            if (map.valueType().isPrimitiveType() || map.valueType().isVariantType()) {
              valueResult = ImmutableList.of(map.valueId());
            } else {
              valueResult = valueFuture.get();
            }

            return Iterables.concat(keyResult, valueResult);


@Override
public Iterable<Integer> struct(
Types.StructType struct, Iterable<Iterable<Integer>> fieldResults) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc for the visit method says that the field results are traversed when the Iterable passed here is consumed:

Structs are passed an {@link Iterable} that traverses child fields during iteration.

The intent of this structure is to consume these Iterable instances while working with the object, but this creates a new Iterable using concat that is lazy. The result of returning the field results unconsumed is an unpredictable (or certainly hard to understand) order when visiting fields.

I think that this does the right thing because it will create a structure of iterables that contain the primitive field IDs first and it doesn't really matter when the results are consumed, but I think this structure makes the code harder to understand than it needs to be.

I think I'd prefer a solution that keeps a single top-level list and adds only the requested number of field IDs to it. Most of the code would be the same, with each structure checking whether it contains primitives and adding them in order, before traversing the children if more IDs need to be added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I was thinking:

  static Set<Integer> limitFieldIds(Schema schema, int limit) {
    return TypeUtil.visit(
        schema,
        new TypeUtil.CustomOrderSchemaVisitor<>() {
          private final Set<Integer> idSet = Sets.newHashSet();

          private boolean shouldContinue() {
            return idSet.size() < limit;
          }

          @Override
          public Set<Integer> schema(Schema schema, Supplier<Set<Integer>> structResult) {
            return idSet;
          }

          @Override
          public Set<Integer> struct(Types.StructType struct, Iterable<Set<Integer>> fieldResults) {
            Iterator<Types.NestedField> fields = struct.fields().iterator();
            while (shouldContinue() && fields.hasNext()) {
              Types.NestedField field = fields.next();
              if (field.type().isPrimitiveType() || field.type().isVariantType()) {
                idSet.add(field.fieldId());
              }
            }

            // visit children to add more ids
            Iterator<Set<Integer>> iter = fieldResults.iterator();
            while (iter.hasNext() && shouldContinue()) {
              iter.next();
            }

            return null;
          }

          @Override
          public Set<Integer> field(Types.NestedField field, Supplier<Set<Integer>> fieldResult) {
            return fieldResult.get();
          }

          @Override
          public Set<Integer> list(Types.ListType list, Supplier<Set<Integer>> elementResult) {
            if (shouldContinue()
                && (list.elementType().isPrimitiveType() || list.elementType().isVariantType())) {
              idSet.add(list.elementId());
            }

            if (shouldContinue()) {
              elementResult.get();
            }

            return null;
          }

          @Override
          public Set<Integer> map(
              Types.MapType map,
              Supplier<Set<Integer>> keyResult,
              Supplier<Set<Integer>> valueResult) {
            if (shouldContinue()
                && (map.keyType().isPrimitiveType() || map.keyType().isVariantType())) {
              idSet.add(map.keyId());
            }

            if (shouldContinue()
                && (map.valueType().isPrimitiveType() || map.valueType().isVariantType())) {
              idSet.add(map.valueId());
            }

            if (shouldContinue()) {
              keyResult.get();
              valueResult.get();
            }

            return null;
          }
        });
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try this out and also add a test with variant on the v3 paramaterization in TestWriterMetrics.java.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I did something along those lines. I avoided

            if (shouldContinue()
                && (map.keyType().isPrimitiveType() || map.keyType().isVariantType())) {
              idSet.add(map.keyId());
            }

since I felt it was a little unweildy. I think the updated version is more readable.

orderedFieldIds.add(field.fieldId());
}
}
return Iterables.concat(orderedFieldIds, Iterables.concat(fieldResults));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: Iceberg's convention is to leave an empty newline between control flow blocks and the following statement.

});
}

static Schema boundedBreadthFirstSubSchema(Schema schema, int maxInferredDefaultColumns) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to use breadth-first priority, but I don't think that the policy needs to be exposed to the caller. How about renaming this something like limitFields?

@github-actions github-actions bot removed the stale label Jul 30, 2025
defaultMode = DEFAULT_MODE;
} else {
Schema subSchema = limitSchema(schema, maxInferredDefaultColumns);
for (Integer id : TypeUtil.getProjectedIds(subSchema)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this limiting to an ID set, then creating a schema, and finally calling getProjectedIds to recover the ID set? Couldn't this just call the ID set method directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that works. I think there is a behavior difference with the ids returned from limitFieldIds vs limitSchema I think limitSchema may include the intermediate field ids for nested structures? I'm not sure it matters. I will investigate and make the change.

@Override
@SuppressWarnings("ReturnValueIgnored")
public Set<Integer> field(Types.NestedField field, Supplier<Set<Integer>> fieldResult) {
if (shouldContinue()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to call shouldContinue everywhere, as long as it is called before adding an ID to the set. It's not a problem to, but it's okay to traverse a field and return quickly without adding an ID rather than not traverse a field. And for fields specifically, shouldContinue is called before the field is visited.

@rdblue rdblue added this to the Iceberg 1.10.0 milestone Aug 6, 2025
Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of minor suggestions, but I think that this is correct and can be committed. I'd also like to get this into 1.10 so I added it to the milestone. We can merge this any time if we need to get a release candidate out, but otherwise I'll watch for the updates. Thanks, @jkolash!

@rdblue rdblue merged commit 1b0e4b3 into apache:main Aug 7, 2025
42 checks passed
@rdblue
Copy link
Contributor

rdblue commented Aug 7, 2025

Merged. Thanks, @jkolash!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants