-
Notifications
You must be signed in to change notification settings - Fork 3k
Introduce MetricsMaxInferredColumnDefaultsStrategy #13039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
These changes address issue apache#11253 allowing for setting of a new default strategy that considers the total number of field metrics rather than just the number of top level columns. the new property is ```write.metadata.metrics.max-inferred-column-defaults.strategy``` and valid values would be ```original, depth, breadth``` It currently preserves the original default behavior as changing that may be a more disruptive change as it could lead to unexpected performance regressions. A Breadth first strategy would likely be most compatible with the original strategy so it would be safer to default into vs the depth strategy. The original strategy could then be deprecated and removed in the future This could also easily support a previously discussed feature of reversing order of field ids for considering defaults. Though that won't be included in this PR I'm inclined to remove the depth strategy unless there is a strong desire to keep it.
… level columns list not the number of total projected fieldIds from the schema.
|
I brought this up at the community sync to gauge what other people thought about changing the behavior here without introducing a "strategy" option and the response was positive. Our rationale was that tables that have deep nesting and a lot of top-level columns are uncommon and would likely benefit from removing a lot of unnecessary metrics overall. So rather than introducing a strategy, I think we should move forward with the "breadth" approach: keep stats for top-level primitive columns, then the next layer, and so on until the 100 primitive field limit is exhausted. |
|
Thanks, I will simplify this PR to just include the breadth strategy and without a new property. |
|
So while doing a self review I came to question why a user provided default should not be bounded. this was the behavior before, but I'm not quite sure it is right. Users can set the |
Create TestMetricsConfig for these tests as they are independent of iceberg version vs in the TestMetrics class.
|
Stress testing this. For a schema with a schema with 1 million structs I got ~133ms per iteration when bounding to 100 fields. @Test
public void perf(){
AtomicInteger fieldId = new AtomicInteger(0);
Supplier<Types.NestedField> newStruct = () -> required(fieldId.getAndIncrement(), String.valueOf(fieldId.get()),
Types.StructType.of(required(fieldId.getAndIncrement(), String.valueOf(fieldId.get()), Types.IntegerType.get())));
int items = 1000000;
Schema schema = new Schema(IntStream.range(0, items).mapToObj( (i) -> newStruct.get()).collect(Collectors.toList()));
StopWatch sw = StopWatch.createStarted();
int iterations = 100;
for (int i = 0; i < iterations; i++) {
MetricsConfig.boundedBreadthFirstSubSchema(schema, 100);
}
sw.stop();
System.out.println("ms per iteration: "+ sw.getTime(TimeUnit.MILLISECONDS) * 1.0/iterations);
}Output: |
|
@rdblue Let me know what else is needed for this PR. |
|
Should I re-open it with a new description? |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
| @Override | ||
| public Iterable<Integer> list(Types.ListType list, Supplier<Iterable<Integer>> future) { | ||
| List<Integer> returnValue = Lists.newArrayListWithCapacity(1); | ||
| if (list.elementType().isPrimitiveType()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this can be simpler:
if (list.elementType.isPrimitiveType() || list.elementType().isVariantType()) {
return ImmutableList.of(list.elementId());
}
return future.get();If the element is a primitive type (or a variant), then we don't need to get the value of the future because we already know that it is going to be Collections.emptyList(). I prefer that implementation because it cuts down on Iterable instances that concatenate things that we know are empty.
| Types.MapType map, | ||
| Supplier<Iterable<Integer>> keyFuture, | ||
| Supplier<Iterable<Integer>> valueFuture) { | ||
| List<Integer> returnValue = Lists.newArrayListWithCapacity(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make the same simplification as I suggested above here:
Iterable<Integer> keyResult;
if (map.keyType().isPrimitiveType() || map.keyType().isVariantType()) {
keyResult = ImmutableList.of(map.keyId());
} else {
keyResult = keyFuture.get();
}
Iterable<Integer> valueResult;
if (map.valueType().isPrimitiveType() || map.valueType().isVariantType()) {
valueResult = ImmutableList.of(map.valueId());
} else {
valueResult = valueFuture.get();
}
return Iterables.concat(keyResult, valueResult);|
|
||
| @Override | ||
| public Iterable<Integer> struct( | ||
| Types.StructType struct, Iterable<Iterable<Integer>> fieldResults) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Javadoc for the visit method says that the field results are traversed when the Iterable passed here is consumed:
Structs are passed an {@link Iterable} that traverses child fields during iteration.
The intent of this structure is to consume these Iterable instances while working with the object, but this creates a new Iterable using concat that is lazy. The result of returning the field results unconsumed is an unpredictable (or certainly hard to understand) order when visiting fields.
I think that this does the right thing because it will create a structure of iterables that contain the primitive field IDs first and it doesn't really matter when the results are consumed, but I think this structure makes the code harder to understand than it needs to be.
I think I'd prefer a solution that keeps a single top-level list and adds only the requested number of field IDs to it. Most of the code would be the same, with each structure checking whether it contains primitives and adding them in order, before traversing the children if more IDs need to be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what I was thinking:
static Set<Integer> limitFieldIds(Schema schema, int limit) {
return TypeUtil.visit(
schema,
new TypeUtil.CustomOrderSchemaVisitor<>() {
private final Set<Integer> idSet = Sets.newHashSet();
private boolean shouldContinue() {
return idSet.size() < limit;
}
@Override
public Set<Integer> schema(Schema schema, Supplier<Set<Integer>> structResult) {
return idSet;
}
@Override
public Set<Integer> struct(Types.StructType struct, Iterable<Set<Integer>> fieldResults) {
Iterator<Types.NestedField> fields = struct.fields().iterator();
while (shouldContinue() && fields.hasNext()) {
Types.NestedField field = fields.next();
if (field.type().isPrimitiveType() || field.type().isVariantType()) {
idSet.add(field.fieldId());
}
}
// visit children to add more ids
Iterator<Set<Integer>> iter = fieldResults.iterator();
while (iter.hasNext() && shouldContinue()) {
iter.next();
}
return null;
}
@Override
public Set<Integer> field(Types.NestedField field, Supplier<Set<Integer>> fieldResult) {
return fieldResult.get();
}
@Override
public Set<Integer> list(Types.ListType list, Supplier<Set<Integer>> elementResult) {
if (shouldContinue()
&& (list.elementType().isPrimitiveType() || list.elementType().isVariantType())) {
idSet.add(list.elementId());
}
if (shouldContinue()) {
elementResult.get();
}
return null;
}
@Override
public Set<Integer> map(
Types.MapType map,
Supplier<Set<Integer>> keyResult,
Supplier<Set<Integer>> valueResult) {
if (shouldContinue()
&& (map.keyType().isPrimitiveType() || map.keyType().isVariantType())) {
idSet.add(map.keyId());
}
if (shouldContinue()
&& (map.valueType().isPrimitiveType() || map.valueType().isVariantType())) {
idSet.add(map.valueId());
}
if (shouldContinue()) {
keyResult.get();
valueResult.get();
}
return null;
}
});
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try this out and also add a test with variant on the v3 paramaterization in TestWriterMetrics.java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I did something along those lines. I avoided
if (shouldContinue()
&& (map.keyType().isPrimitiveType() || map.keyType().isVariantType())) {
idSet.add(map.keyId());
}since I felt it was a little unweildy. I think the updated version is more readable.
| orderedFieldIds.add(field.fieldId()); | ||
| } | ||
| } | ||
| return Iterables.concat(orderedFieldIds, Iterables.concat(fieldResults)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: Iceberg's convention is to leave an empty newline between control flow blocks and the following statement.
| }); | ||
| } | ||
|
|
||
| static Schema boundedBreadthFirstSubSchema(Schema schema, int maxInferredDefaultColumns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good to use breadth-first priority, but I don't think that the policy needs to be exposed to the caller. How about renaming this something like limitFields?
We need to check this before adding to the idSet and before visiting further children.
| defaultMode = DEFAULT_MODE; | ||
| } else { | ||
| Schema subSchema = limitSchema(schema, maxInferredDefaultColumns); | ||
| for (Integer id : TypeUtil.getProjectedIds(subSchema)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this limiting to an ID set, then creating a schema, and finally calling getProjectedIds to recover the ID set? Couldn't this just call the ID set method directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that works. I think there is a behavior difference with the ids returned from limitFieldIds vs limitSchema I think limitSchema may include the intermediate field ids for nested structures? I'm not sure it matters. I will investigate and make the change.
| @Override | ||
| @SuppressWarnings("ReturnValueIgnored") | ||
| public Set<Integer> field(Types.NestedField field, Supplier<Set<Integer>> fieldResult) { | ||
| if (shouldContinue()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this needs to call shouldContinue everywhere, as long as it is called before adding an ID to the set. It's not a problem to, but it's okay to traverse a field and return quickly without adding an ID rather than not traverse a field. And for fields specifically, shouldContinue is called before the field is visited.
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of minor suggestions, but I think that this is correct and can be committed. I'd also like to get this into 1.10 so I added it to the milestone. We can merge this any time if we need to get a release candidate out, but otherwise I'll watch for the updates. Thanks, @jkolash!
|
Merged. Thanks, @jkolash! |
These changes address issue #11253 allowing for setting of a new default strategy that considers the total number of field metrics rather than just the number of top level columns.
the new property is
write.metadata.metrics.max-inferred-column-defaults.strategyand valid values would be
original, depth, breadthIt currently preserves the original default behavior as changing
that may be a more disruptive change as it could lead to unexpected
performance regressions. A Breadth first strategy would likely be most
compatible with the original strategy so it would be safer to default
into vs the depth strategy. The original strategy could then be
deprecated and removed in the future
This could also easily support a previously discussed feature of
reversing order of field ids for considering defaults. Though that
won't be included in this PR
I'm inclined to remove the depth strategy unless there is a strong
desire to keep it.