Skip to content

Conversation

@nandorKollar
Copy link
Contributor

@nandorKollar nandorKollar commented Jul 15, 2025

This PR implements missing parts of Arrow conversions for Iceberg nanosec precision timestamp type.

@github-actions github-actions bot added the arrow label Jul 15, 2025
@nandorKollar nandorKollar force-pushed the arrow_support_timestamp_nano branch 12 times, most recently from 1151edf to c88f981 Compare July 16, 2025 14:53
@nandorKollar nandorKollar marked this pull request as ready for review July 16, 2025 14:54
@nandorKollar nandorKollar force-pushed the arrow_support_timestamp_nano branch 3 times, most recently from 47baa5c to c45aad7 Compare July 16, 2025 16:33
@nandorKollar nandorKollar force-pushed the arrow_support_timestamp_nano branch from c45aad7 to 576f1c1 Compare July 17, 2025 12:20
Comment on lines 221 to 224
if (columnDescriptor.getPrimitiveType().getLogicalTypeAnnotation() != null
&& !(columnDescriptor.getPrimitiveType().getLogicalTypeAnnotation()
instanceof LogicalTypeAnnotation.UUIDLogicalTypeAnnotation)) {
allocateVectorBasedOnLogicalTypeAnnotation(columnDescriptor.getPrimitiveType(), arrowField);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change?

Copy link
Contributor Author

@nandorKollar nandorKollar Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean why use logical type annotations instead of original types? In parquet-format, logical types were represented as ConvertedTypes, which is called in parquet-mr OriginalType. As it was an enum, it didn't give sufficient flexibility, there was no way to properly represent logical type parameters, like for example decimal precision and scale, time unit for timestamps etc, therefore it was replaced with a union of structs called LogicalType. New types, like variant, geometry don't have a proper OriginalType representation, only logical types. Nanosecond precision timestamp type falls into this category: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#deprecated-timestamp-convertedtype.

case TIMESTAMP_MILLIS:
case TIMESTAMP_NANOS:
vectorizedColumnIterator
.timestampMillisBatchReader()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Millis batch reader should work here, as the annotation can be added to the same physical types. Nevertheless, 'millis' in the method naming is misleading here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least leave a comment stating this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! I was actually wondering, that I add new methods and classes for nano, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked, and starting to have doubts that this is working as expected.
the timestampMillisBatchReader ends up returning VectorizedParquetDefinitionLevelReader.TimestampMillisReader where:

    @Override
    protected void nextVal(
        FieldVector vector,
        int idx,
        VectorizedValuesReader valuesReader,
        int typeWidth,
        byte[] byteArray) {
      vector.getDataBuffer().setLong((long) idx * typeWidth, valuesReader.readLong() * 1000);
    } 

So it is very much millis

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the code, and found that the readType is never assigned to TIMESTAMP_NANOS.
MICROS, and NANOS are read as ReadType.LONG.
This means, that this line is not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, actually I came to the same conclusion, nevertheless I would like to understand why do we need timestampMillisBatchReader for millis, and not for micros. Is it because we need to transform millis to micros, because Iceberg timestamp type represents microsecond precision? If so, then I think this line is indeed not needed, as we're reading timestamp nanos here, no transformation is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is my understanding too

allocateVectorBasedOnOriginalType(columnDescriptor.getPrimitiveType(), arrowField);
if (columnDescriptor.getPrimitiveType().getLogicalTypeAnnotation() != null
&& !(columnDescriptor.getPrimitiveType().getLogicalTypeAnnotation()
instanceof LogicalTypeAnnotation.UUIDLogicalTypeAnnotation)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this extra instanceof LogicalTypeAnnotation.UUIDLogicalTypeAnnotation, because the logic, which relied on original types didn't handle UUID. It doesn't have an original type representation, only logical type annotation exists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it more natural to add this fallback to the allocateVectorBasedOnLogicalTypeAnnotation method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, moving this fallback to logical type annotation visitor looks indeed better then using instanceof.

Comment on lines 221 to 224
if (columnDescriptor.getPrimitiveType().getLogicalTypeAnnotation() != null
&& !(columnDescriptor.getPrimitiveType().getLogicalTypeAnnotation()
instanceof LogicalTypeAnnotation.UUIDLogicalTypeAnnotation)) {
allocateVectorBasedOnLogicalTypeAnnotation(columnDescriptor.getPrimitiveType(), arrowField);
Copy link
Contributor Author

@nandorKollar nandorKollar Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean why use logical type annotations instead of original types? In parquet-format, logical types were represented as ConvertedTypes, which is called in parquet-mr OriginalType. As it was an enum, it didn't give sufficient flexibility, there was no way to properly represent logical type parameters, like for example decimal precision and scale, time unit for timestamps etc, therefore it was replaced with a union of structs called LogicalType. New types, like variant, geometry don't have a proper OriginalType representation, only logical types. Nanosecond precision timestamp type falls into this category: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#deprecated-timestamp-convertedtype.

@Override
public Optional<Object> visit(
LogicalTypeAnnotation.DecimalLogicalTypeAnnotation decimalLogicalType) {
vec = arrowField.createVector(rootAlloc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we assign values to instance variable we use this.. See: https://iceberg.apache.org/contribute/#accessing-instance-variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply using this won't work here, as we're inside an anonymous class instance. VectorizedArrowReader.this.vec works though.

@nandorKollar nandorKollar force-pushed the arrow_support_timestamp_nano branch from 58f936e to f524b61 Compare July 28, 2025 09:59
@nandorKollar nandorKollar force-pushed the arrow_support_timestamp_nano branch from f356881 to 1f1fa34 Compare July 28, 2025 13:58
* <li>Iceberg: {@link Types.LongType}, Arrow: {@link MinorType#BIGINT}
* <li>Iceberg: {@link Types.FloatType}, Arrow: {@link MinorType#FLOAT4}
* <li>Iceberg: {@link Types.DoubleType}, Arrow: {@link MinorType#FLOAT8}
* <li>Iceberg: {@link Types.DecimalType}, Arrow: {@link MinorType#DECIMAL}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually decimal support appears to work: #2486 (comment)

Comment on lines 575 to 576
}
if (intLogicalType.getBitWidth() == 64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be in the else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not put this in the else. Actually I was thinking about a switch case for the width instead. In the unlikely case, when an additional type width is introduced, then in else, we'll silently do possibly wrong things. Maybe it's better to be more explicit here: throw an exception when the bit width is not an expected value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go for the switch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we return Optional.empty if unknown type is used, and I would guess that this will throw an exception down the line. So I would not change that, unless we know exactly why, and what will be the effect.

So basically just use a switch to make the different cases more explicit/easy to read

* Types.FixedType} and {@link Types.DecimalType} See
* https://github.com/apache/iceberg/issues/2485 and
* <li>Data types: {@link Types.ListType}, {@link Types.MapType}, {@link Types.StructType} and
* {@link Types.FixedType} See https://github.com/apache/iceberg/issues/2485 and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Fixed working?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears it does, though it isn't added to the supported type list (called SUPPORTED_TYPES). It isn't covered with UT either, should I add one for it in this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another PR then

Comment on lines 264 to 269
private void allocateVectorBasedOnLogicalTypeAnnotation(
PrimitiveType primitive, Field arrowField) {
primitive
.getLogicalTypeAnnotation()
.accept(new VectorizedArrowReaderLogicalTypeAnnotationVisitor(arrowField, primitive));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth to create a separate method for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you questioning the existence of allocateVectorBasedOnLogicalTypeAnnotation? It might not worth it, maybe it could improve the readability a little bot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the method then

Comment on lines 460 to 461
VectorizedArrowReader.this.allocateVectorBasedOnTypeName(
VectorizedArrowReader.this.columnDescriptor.getPrimitiveType(), arrowField);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the requirement is using this when assigning instance variables. See: https://iceberg.apache.org/contribute/#accessing-instance-variables

In this case it is not needed

Comment on lines 473 to 474
((FixedSizeBinaryVector) VectorizedArrowReader.this.vec)
.allocateNew(VectorizedArrowReader.this.batchSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not an assignment... See other cases below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, sorry, should've been more cautious what I replace.

new FieldType(true, new ArrowType.Decimal(9, 2, 128), null),
null),
new Field(
"timestamp_ns",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please either call this timestamp_ns or timestamp_nano in order to align this with the fields that use a timezone

return Optional.empty();
}

private Optional<LogicalTypeVisitorResult> visitEnumJsonBsonString() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just give this a better name, since this doesn't do an actual "visit" but actually inits a vector

LogicalTypeAnnotation.IntLogicalTypeAnnotation intLogicalType) {
FieldVector vector = arrowField.createVector(rootAlloc);

if (intLogicalType.getBitWidth() == 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe extract intLogicalType.getBitWidth() into a var so that the code gets shorter

return reader;
}

private static final class LogicalTypeVisitorResult {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is intended to be used only for a very specific, internal purpose, that's why I marked it final. It isn't intended for extension, basically using 'record' type would be more accurate here, but that's not present in Java 11. If we would like to inherit from this sometime in the future, we can then remove the final modifier. Could you please explain why do you think it is better not to mark this final?

schema,
spec,
ImmutableMap.of(TableProperties.FORMAT_VERSION, "3"),
tableLocation); // .create(schema, spec, tableLocation);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment can be removed now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops right, that's a leftover comment there. Thanks for spotting it. Is it okay to change the format version in this test?

return reader;
}

private static final class LogicalTypeVisitorResult {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe let's call this Result to make the code shorter

@nandorKollar nandorKollar requested a review from pvary August 21, 2025 08:47
@pvary pvary merged commit efc27f9 into apache:main Aug 21, 2025
39 checks passed
@pvary
Copy link
Contributor

pvary commented Aug 21, 2025

Merged to main.
Thanks for the PR @nandorKollar and @nastra for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants