-
Notifications
You must be signed in to change notification settings - Fork 3k
Arrow: Add nanosec precision timestamp #13562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow: Add nanosec precision timestamp #13562
Conversation
1151edf to
c88f981
Compare
47baa5c to
c45aad7
Compare
c45aad7 to
576f1c1
Compare
| if (columnDescriptor.getPrimitiveType().getLogicalTypeAnnotation() != null | ||
| && !(columnDescriptor.getPrimitiveType().getLogicalTypeAnnotation() | ||
| instanceof LogicalTypeAnnotation.UUIDLogicalTypeAnnotation)) { | ||
| allocateVectorBasedOnLogicalTypeAnnotation(columnDescriptor.getPrimitiveType(), arrowField); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean why use logical type annotations instead of original types? In parquet-format, logical types were represented as ConvertedTypes, which is called in parquet-mr OriginalType. As it was an enum, it didn't give sufficient flexibility, there was no way to properly represent logical type parameters, like for example decimal precision and scale, time unit for timestamps etc, therefore it was replaced with a union of structs called LogicalType. New types, like variant, geometry don't have a proper OriginalType representation, only logical types. Nanosecond precision timestamp type falls into this category: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#deprecated-timestamp-convertedtype.
| case TIMESTAMP_MILLIS: | ||
| case TIMESTAMP_NANOS: | ||
| vectorizedColumnIterator | ||
| .timestampMillisBatchReader() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Millis batch reader should work here, as the annotation can be added to the same physical types. Nevertheless, 'millis' in the method naming is misleading here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least leave a comment stating this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! I was actually wondering, that I add new methods and classes for nano, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked, and starting to have doubts that this is working as expected.
the timestampMillisBatchReader ends up returning VectorizedParquetDefinitionLevelReader.TimestampMillisReader where:
@Override
protected void nextVal(
FieldVector vector,
int idx,
VectorizedValuesReader valuesReader,
int typeWidth,
byte[] byteArray) {
vector.getDataBuffer().setLong((long) idx * typeWidth, valuesReader.readLong() * 1000);
}
So it is very much millis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked the code, and found that the readType is never assigned to TIMESTAMP_NANOS.
MICROS, and NANOS are read as ReadType.LONG.
This means, that this line is not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, actually I came to the same conclusion, nevertheless I would like to understand why do we need timestampMillisBatchReader for millis, and not for micros. Is it because we need to transform millis to micros, because Iceberg timestamp type represents microsecond precision? If so, then I think this line is indeed not needed, as we're reading timestamp nanos here, no transformation is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is my understanding too
| allocateVectorBasedOnOriginalType(columnDescriptor.getPrimitiveType(), arrowField); | ||
| if (columnDescriptor.getPrimitiveType().getLogicalTypeAnnotation() != null | ||
| && !(columnDescriptor.getPrimitiveType().getLogicalTypeAnnotation() | ||
| instanceof LogicalTypeAnnotation.UUIDLogicalTypeAnnotation)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this extra instanceof LogicalTypeAnnotation.UUIDLogicalTypeAnnotation, because the logic, which relied on original types didn't handle UUID. It doesn't have an original type representation, only logical type annotation exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it more natural to add this fallback to the allocateVectorBasedOnLogicalTypeAnnotation method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, moving this fallback to logical type annotation visitor looks indeed better then using instanceof.
| if (columnDescriptor.getPrimitiveType().getLogicalTypeAnnotation() != null | ||
| && !(columnDescriptor.getPrimitiveType().getLogicalTypeAnnotation() | ||
| instanceof LogicalTypeAnnotation.UUIDLogicalTypeAnnotation)) { | ||
| allocateVectorBasedOnLogicalTypeAnnotation(columnDescriptor.getPrimitiveType(), arrowField); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean why use logical type annotations instead of original types? In parquet-format, logical types were represented as ConvertedTypes, which is called in parquet-mr OriginalType. As it was an enum, it didn't give sufficient flexibility, there was no way to properly represent logical type parameters, like for example decimal precision and scale, time unit for timestamps etc, therefore it was replaced with a union of structs called LogicalType. New types, like variant, geometry don't have a proper OriginalType representation, only logical types. Nanosecond precision timestamp type falls into this category: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#deprecated-timestamp-convertedtype.
| @Override | ||
| public Optional<Object> visit( | ||
| LogicalTypeAnnotation.DecimalLogicalTypeAnnotation decimalLogicalType) { | ||
| vec = arrowField.createVector(rootAlloc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we assign values to instance variable we use this.. See: https://iceberg.apache.org/contribute/#accessing-instance-variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply using this won't work here, as we're inside an anonymous class instance. VectorizedArrowReader.this.vec works though.
58f936e to
f524b61
Compare
f356881 to
1f1fa34
Compare
| * <li>Iceberg: {@link Types.LongType}, Arrow: {@link MinorType#BIGINT} | ||
| * <li>Iceberg: {@link Types.FloatType}, Arrow: {@link MinorType#FLOAT4} | ||
| * <li>Iceberg: {@link Types.DoubleType}, Arrow: {@link MinorType#FLOAT8} | ||
| * <li>Iceberg: {@link Types.DecimalType}, Arrow: {@link MinorType#DECIMAL} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually decimal support appears to work: #2486 (comment)
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java
Show resolved
Hide resolved
| } | ||
| if (intLogicalType.getBitWidth() == 64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be in the else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not put this in the else. Actually I was thinking about a switch case for the width instead. In the unlikely case, when an additional type width is introduced, then in else, we'll silently do possibly wrong things. Maybe it's better to be more explicit here: throw an exception when the bit width is not an expected value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go for the switch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, we return Optional.empty if unknown type is used, and I would guess that this will throw an exception down the line. So I would not change that, unless we know exactly why, and what will be the effect.
So basically just use a switch to make the different cases more explicit/easy to read
| * Types.FixedType} and {@link Types.DecimalType} See | ||
| * https://github.com/apache/iceberg/issues/2485 and | ||
| * <li>Data types: {@link Types.ListType}, {@link Types.MapType}, {@link Types.StructType} and | ||
| * {@link Types.FixedType} See https://github.com/apache/iceberg/issues/2485 and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Fixed working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears it does, though it isn't added to the supported type list (called SUPPORTED_TYPES). It isn't covered with UT either, should I add one for it in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another PR then
| private void allocateVectorBasedOnLogicalTypeAnnotation( | ||
| PrimitiveType primitive, Field arrowField) { | ||
| primitive | ||
| .getLogicalTypeAnnotation() | ||
| .accept(new VectorizedArrowReaderLogicalTypeAnnotationVisitor(arrowField, primitive)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth to create a separate method for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you questioning the existence of allocateVectorBasedOnLogicalTypeAnnotation? It might not worth it, maybe it could improve the readability a little bot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the method then
| VectorizedArrowReader.this.allocateVectorBasedOnTypeName( | ||
| VectorizedArrowReader.this.columnDescriptor.getPrimitiveType(), arrowField); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the requirement is using this when assigning instance variables. See: https://iceberg.apache.org/contribute/#accessing-instance-variables
In this case it is not needed
| ((FixedSizeBinaryVector) VectorizedArrowReader.this.vec) | ||
| .allocateNew(VectorizedArrowReader.this.batchSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not an assignment... See other cases below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, sorry, should've been more cautious what I replace.
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java
Show resolved
Hide resolved
| new FieldType(true, new ArrowType.Decimal(9, 2, 128), null), | ||
| null), | ||
| new Field( | ||
| "timestamp_ns", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please either call this timestamp_ns or timestamp_nano in order to align this with the fields that use a timezone
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java
Show resolved
Hide resolved
| return Optional.empty(); | ||
| } | ||
|
|
||
| private Optional<LogicalTypeVisitorResult> visitEnumJsonBsonString() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just give this a better name, since this doesn't do an actual "visit" but actually inits a vector
| LogicalTypeAnnotation.IntLogicalTypeAnnotation intLogicalType) { | ||
| FieldVector vector = arrowField.createVector(rootAlloc); | ||
|
|
||
| if (intLogicalType.getBitWidth() == 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe extract intLogicalType.getBitWidth() into a var so that the code gets shorter
| return reader; | ||
| } | ||
|
|
||
| private static final class LogicalTypeVisitorResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is intended to be used only for a very specific, internal purpose, that's why I marked it final. It isn't intended for extension, basically using 'record' type would be more accurate here, but that's not present in Java 11. If we would like to inherit from this sometime in the future, we can then remove the final modifier. Could you please explain why do you think it is better not to mark this final?
| schema, | ||
| spec, | ||
| ImmutableMap.of(TableProperties.FORMAT_VERSION, "3"), | ||
| tableLocation); // .create(schema, spec, tableLocation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment can be removed now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooops right, that's a leftover comment there. Thanks for spotting it. Is it okay to change the format version in this test?
| return reader; | ||
| } | ||
|
|
||
| private static final class LogicalTypeVisitorResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe let's call this Result to make the code shorter
|
Merged to main. |
This PR implements missing parts of Arrow conversions for Iceberg nanosec precision timestamp type.