Arrow: Add nanosec precision timestamp #13562

nandorKollar · 2025-07-15T19:27:13Z

This PR implements missing parts of Arrow conversions for Iceberg nanosec precision timestamp type.

pvary · 2025-07-18T12:35:57Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

+      if (columnDescriptor.getPrimitiveType().getLogicalTypeAnnotation() != null
+          && !(columnDescriptor.getPrimitiveType().getLogicalTypeAnnotation()
+              instanceof LogicalTypeAnnotation.UUIDLogicalTypeAnnotation)) {
+        allocateVectorBasedOnLogicalTypeAnnotation(columnDescriptor.getPrimitiveType(), arrowField);


Why is this change?

You mean why use logical type annotations instead of original types? In parquet-format, logical types were represented as ConvertedTypes, which is called in parquet-mr OriginalType. As it was an enum, it didn't give sufficient flexibility, there was no way to properly represent logical type parameters, like for example decimal precision and scale, time unit for timestamps etc, therefore it was replaced with a union of structs called LogicalType. New types, like variant, geometry don't have a proper OriginalType representation, only logical types. Nanosecond precision timestamp type falls into this category: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#deprecated-timestamp-convertedtype.

nandorKollar · 2025-07-16T14:58:33Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

          case TIMESTAMP_MILLIS:
+          case TIMESTAMP_NANOS:
            vectorizedColumnIterator
                .timestampMillisBatchReader()


Millis batch reader should work here, as the annotation can be added to the same physical types. Nevertheless, 'millis' in the method naming is misleading here.

At least leave a comment stating this

Makes sense! I was actually wondering, that I add new methods and classes for nano, what do you think?

Checked, and starting to have doubts that this is working as expected.
the timestampMillisBatchReader ends up returning VectorizedParquetDefinitionLevelReader.TimestampMillisReader where:

@Override protected void nextVal( FieldVector vector, int idx, VectorizedValuesReader valuesReader, int typeWidth, byte[] byteArray) { vector.getDataBuffer().setLong((long) idx * typeWidth, valuesReader.readLong() * 1000); }

So it is very much millis

Checked the code, and found that the readType is never assigned to TIMESTAMP_NANOS.
MICROS, and NANOS are read as ReadType.LONG.
This means, that this line is not needed

Yes, actually I came to the same conclusion, nevertheless I would like to understand why do we need timestampMillisBatchReader for millis, and not for micros. Is it because we need to transform millis to micros, because Iceberg timestamp type represents microsecond precision? If so, then I think this line is indeed not needed, as we're reading timestamp nanos here, no transformation is needed.

That is my understanding too

nandorKollar · 2025-07-16T15:00:51Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

-        allocateVectorBasedOnOriginalType(columnDescriptor.getPrimitiveType(), arrowField);
+      if (columnDescriptor.getPrimitiveType().getLogicalTypeAnnotation() != null
+          && !(columnDescriptor.getPrimitiveType().getLogicalTypeAnnotation()
+              instanceof LogicalTypeAnnotation.UUIDLogicalTypeAnnotation)) {


I added this extra instanceof LogicalTypeAnnotation.UUIDLogicalTypeAnnotation, because the logic, which relied on original types didn't handle UUID. It doesn't have an original type representation, only logical type annotation exists.

Isn't it more natural to add this fallback to the allocateVectorBasedOnLogicalTypeAnnotation method?

Makes sense, moving this fallback to logical type annotation visitor looks indeed better then using instanceof.

nandorKollar · 2025-07-18T13:06:00Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

+      if (columnDescriptor.getPrimitiveType().getLogicalTypeAnnotation() != null
+          && !(columnDescriptor.getPrimitiveType().getLogicalTypeAnnotation()
+              instanceof LogicalTypeAnnotation.UUIDLogicalTypeAnnotation)) {
+        allocateVectorBasedOnLogicalTypeAnnotation(columnDescriptor.getPrimitiveType(), arrowField);


You mean why use logical type annotations instead of original types? In parquet-format, logical types were represented as ConvertedTypes, which is called in parquet-mr OriginalType. As it was an enum, it didn't give sufficient flexibility, there was no way to properly represent logical type parameters, like for example decimal precision and scale, time unit for timestamps etc, therefore it was replaced with a union of structs called LogicalType. New types, like variant, geometry don't have a proper OriginalType representation, only logical types. Nanosecond precision timestamp type falls into this category: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#deprecated-timestamp-convertedtype.

pvary · 2025-07-23T13:17:57Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

+              @Override
+              public Optional<Object> visit(
+                  LogicalTypeAnnotation.DecimalLogicalTypeAnnotation decimalLogicalType) {
+                vec = arrowField.createVector(rootAlloc);


when we assign values to instance variable we use this.. See: https://iceberg.apache.org/contribute/#accessing-instance-variables

Simply using this won't work here, as we're inside an anonymous class instance. VectorizedArrowReader.this.vec works though.

nandorKollar · 2025-07-28T14:19:37Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowReader.java

 *   <li>Iceberg: {@link Types.LongType}, Arrow: {@link MinorType#BIGINT}
 *   <li>Iceberg: {@link Types.FloatType}, Arrow: {@link MinorType#FLOAT4}
 *   <li>Iceberg: {@link Types.DoubleType}, Arrow: {@link MinorType#FLOAT8}
+ *   <li>Iceberg: {@link Types.DecimalType}, Arrow: {@link MinorType#DECIMAL}


Actually decimal support appears to work: #2486 (comment)

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

pvary · 2025-07-28T14:51:13Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

+      }
+      if (intLogicalType.getBitWidth() == 64) {


Could be in the else

I'd rather not put this in the else. Actually I was thinking about a switch case for the width instead. In the unlikely case, when an additional type width is introduced, then in else, we'll silently do possibly wrong things. Maybe it's better to be more explicit here: throw an exception when the bit width is not an expected value.

Go for the switch

BTW, we return Optional.empty if unknown type is used, and I would guess that this will throw an exception down the line. So I would not change that, unless we know exactly why, and what will be the effect.

So basically just use a switch to make the different cases more explicit/easy to read

pvary · 2025-07-28T14:51:43Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowReader.java

- *       Types.FixedType} and {@link Types.DecimalType} See
- *       https://github.com/apache/iceberg/issues/2485 and
+ *   <li>Data types: {@link Types.ListType}, {@link Types.MapType}, {@link Types.StructType} and
+ *       {@link Types.FixedType} See https://github.com/apache/iceberg/issues/2485 and


Is Fixed working?

It appears it does, though it isn't added to the supported type list (called SUPPORTED_TYPES). It isn't covered with UT either, should I add one for it in this PR?

Another PR then

pvary · 2025-07-28T14:52:37Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

+  private void allocateVectorBasedOnLogicalTypeAnnotation(
+      PrimitiveType primitive, Field arrowField) {
+    primitive
+        .getLogicalTypeAnnotation()
+        .accept(new VectorizedArrowReaderLogicalTypeAnnotationVisitor(arrowField, primitive));
  }


Is it worth to create a separate method for this?

Are you questioning the existence of allocateVectorBasedOnLogicalTypeAnnotation? It might not worth it, maybe it could improve the readability a little bot.

Please remove the method then

pvary · 2025-07-28T14:54:36Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

+      VectorizedArrowReader.this.allocateVectorBasedOnTypeName(
+          VectorizedArrowReader.this.columnDescriptor.getPrimitiveType(), arrowField);


the requirement is using this when assigning instance variables. See: https://iceberg.apache.org/contribute/#accessing-instance-variables

In this case it is not needed

pvary · 2025-07-28T14:55:02Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

+          ((FixedSizeBinaryVector) VectorizedArrowReader.this.vec)
+              .allocateNew(VectorizedArrowReader.this.batchSize);


not an assignment... See other cases below

Ah yes, sorry, should've been more cautious what I replace.

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

nastra · 2025-08-04T11:00:42Z

arrow/src/test/java/org/apache/iceberg/arrow/vectorized/TestArrowReader.java

                new FieldType(true, new ArrowType.Decimal(9, 2, 128), null),
+                null),
+            new Field(
+                "timestamp_ns",


can you please either call this timestamp_ns or timestamp_nano in order to align this with the fields that use a timezone

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

nastra · 2025-08-04T11:23:35Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

+      return Optional.empty();
+    }
+
+    private Optional<LogicalTypeVisitorResult> visitEnumJsonBsonString() {


maybe just give this a better name, since this doesn't do an actual "visit" but actually inits a vector

nastra · 2025-08-04T11:24:08Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

+        LogicalTypeAnnotation.IntLogicalTypeAnnotation intLogicalType) {
+      FieldVector vector = arrowField.createVector(rootAlloc);
+
+      if (intLogicalType.getBitWidth() == 8


maybe extract intLogicalType.getBitWidth() into a var so that the code gets shorter

…mp_nano

nandorKollar · 2025-08-05T07:45:48Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

    return reader;
  }

+  private static final class LogicalTypeVisitorResult {


This class is intended to be used only for a very specific, internal purpose, that's why I marked it final. It isn't intended for extension, basically using 'record' type would be more accurate here, but that's not present in Java 11. If we would like to inherit from this sometime in the future, we can then remove the final modifier. Could you please explain why do you think it is better not to mark this final?

nastra · 2025-08-05T09:25:16Z

arrow/src/test/java/org/apache/iceberg/arrow/vectorized/TestArrowReader.java

+            schema,
+            spec,
+            ImmutableMap.of(TableProperties.FORMAT_VERSION, "3"),
+            tableLocation); // .create(schema, spec, tableLocation);


the comment can be removed now

Ooops right, that's a leftover comment there. Thanks for spotting it. Is it okay to change the format version in this test?

nastra · 2025-08-05T09:28:42Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

    return reader;
  }

+  private static final class LogicalTypeVisitorResult {


maybe let's call this Result to make the code shorter

…mp_nano

pvary · 2025-08-21T12:40:38Z

Merged to main.
Thanks for the PR @nandorKollar and @nastra for the review!

github-actions bot added the arrow label Jul 15, 2025

nandorKollar force-pushed the arrow_support_timestamp_nano branch 12 times, most recently from 1151edf to c88f981 Compare July 16, 2025 14:53

nandorKollar marked this pull request as ready for review July 16, 2025 14:54

nandorKollar force-pushed the arrow_support_timestamp_nano branch 3 times, most recently from 47baa5c to c45aad7 Compare July 16, 2025 16:33

Arrow: Add nanosec precision timestamp

576f1c1

nandorKollar force-pushed the arrow_support_timestamp_nano branch from c45aad7 to 576f1c1 Compare July 17, 2025 12:20

pvary reviewed Jul 18, 2025

View reviewed changes

nandorKollar commented Jul 18, 2025

View reviewed changes

pvary reviewed Jul 23, 2025

View reviewed changes

nandorKollar added 2 commits July 24, 2025 20:02

address code review comments

bb101a1

refactor visitor to inner class

f524b61

nandorKollar force-pushed the arrow_support_timestamp_nano branch from 58f936e to f524b61 Compare July 28, 2025 09:59

remove case for timestamp nanos

1f1fa34

nandorKollar force-pushed the arrow_support_timestamp_nano branch from f356881 to 1f1fa34 Compare July 28, 2025 13:58

decimal support is actually working

89c7fc0

nandorKollar commented Jul 28, 2025

View reviewed changes

pvary reviewed Jul 28, 2025

View reviewed changes

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java Show resolved Hide resolved

pvary reviewed Jul 28, 2025

View reviewed changes

Address code review comments

44c23d6

pvary reviewed Jul 29, 2025

View reviewed changes

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java Show resolved Hide resolved

pvary reviewed Jul 29, 2025

View reviewed changes

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java Show resolved Hide resolved

nandorKollar added 2 commits July 29, 2025 10:22

add newlines + use else if

a7b852a

Refactor logical type visitor

004380c

nandorKollar mentioned this pull request Aug 1, 2025

Arrow: test case for fixed type #13700

Merged

Add nano type as supported type in ColumnVector javadoc

4cd4dea

FredKhayat mentioned this pull request Aug 4, 2025

Improve timestamp support apache/iceberg-python#2270

Closed

3 tasks

nastra reviewed Aug 4, 2025

View reviewed changes

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java Show resolved Hide resolved

nastra reviewed Aug 4, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into arrow_support_timesta…

70787db

…mp_nano

nandorKollar commented Aug 5, 2025

View reviewed changes

Address code review comments

c962b04

nastra reviewed Aug 5, 2025

View reviewed changes

nastra approved these changes Aug 5, 2025

View reviewed changes

nandorKollar added 3 commits August 5, 2025 11:58

remove comment

ee33d81

Fix spotless problem

4e77be0

Merge remote-tracking branch 'origin/main' into arrow_support_timesta…

02f6190

…mp_nano

nandorKollar requested a review from pvary August 21, 2025 08:47

pvary approved these changes Aug 21, 2025

View reviewed changes

pvary merged commit efc27f9 into apache:main Aug 21, 2025
39 checks passed

		VectorizedArrowReader.this.allocateVectorBasedOnTypeName(
		VectorizedArrowReader.this.columnDescriptor.getPrimitiveType(), arrowField);

		((FixedSizeBinaryVector) VectorizedArrowReader.this.vec)
		.allocateNew(VectorizedArrowReader.this.batchSize);

Arrow: Add nanosec precision timestamp #13562

Arrow: Add nanosec precision timestamp #13562

Uh oh!

Conversation

nandorKollar commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nandorKollar Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nandorKollar Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

nandorKollar commented Jul 15, 2025 •

edited

Loading

nandorKollar Jul 18, 2025 •

edited

Loading

nandorKollar Jul 18, 2025 •

edited

Loading